From pcaulfie at redhat.com  Mon Aug  1 07:32:43 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 01 Aug 2005 08:32:43 +0100
Subject: [Linux-cluster] Error when loading the lock_dlm module
In-Reply-To: <8293054050729235847b66f37@mail.gmail.com>
References: <8293054050729235847b66f37@mail.gmail.com>
Message-ID: <42EDD01B.40809@redhat.com>

KL Raja Sekar wrote:
> Hi,
> 
> I am trying to use GFS 6.1 with lock_dlm module.  But when i try to
> load the lock_dlm module i am getting fatal error.  Herewith attached
> the fatal error msg and the dmesg errors.
> 
> I am using HP Proliant DL360 server with XP 1024 storage.   please let
> me know if anybody has the solution for the above.
> 
> regards
> shekar
> 
> ******************************************************************************
> root at uranus1 ~]# modprobe lock_dlm
> FATAL: Error inserting lock_dlm
> (/lib/modules/2.6.9-11.ELsmp/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko):
> Unknown symbol in module, or unknown parameter (see dmesg)
> ******************************************************************************
> dmesg output -------------------------------
> lock_dlm: Unknown symbol dlm_debug_dump
> lock_dlm: Unknown symbol dlm_new_lockspace
> lock_dlm: Unknown symbol kcl_register_service
> lock_dlm: Unknown symbol dlm_unlock
> lock_dlm: Unknown symbol kcl_start_done
> lock_dlm: Unknown symbol dlm_release_lockspace
> lock_dlm: Unknown symbol kcl_join_service
> lock_dlm: Unknown symbol kcl_unregister_service
> lock_dlm: Unknown symbol kcl_leave_service
> lock_dlm: Unknown symbol dlm_query
> lock_dlm: Unknown symbol kcl_get_members
> lock_dlm: Unknown symbol kcl_releaseref_cluster
> lock_dlm: Unknown symbol dlm_lock
> lock_dlm: Unknown symbol kcl_cluster_name
> lock_dlm: Unknown symbol kcl_get_services
> lock_dlm: Unknown symbol kcl_addref_cluster


Looks like the cman & dlm modules are not loaded. If they're on the system then
try re-running depmod -a. If not, then install them ;-)


-- 

patrick



From ialberdi at histor.fr  Mon Aug  1 09:30:48 2005
From: ialberdi at histor.fr (Ion Alberdi)
Date: Mon, 01 Aug 2005 11:30:48 +0200
Subject: [Linux-cluster] File size limitation on GFS
Message-ID: <42EDEBC8.7070402@histor.fr>

Hi everybody, 
is there is a maximum file size the GFS can handle?
I tried to do some tests with big files, and I couldn't open (open(2)) 
files that
were  >= 2Go. (It works with 1Go files, I didn't try sizes between 1 and 
2 Go).

I would like to know if this limitation comes from my configuration or 
from the GFS
file system.

I searched an answer in the web and in the mailing list but I didn't 
found anything,
If I missed something I'd be very sorry and an url to the article
I missed would be a great answer :).

Thanks in advance!

Regards



From javipolo at datagrama.net  Mon Aug  1 12:14:48 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Mon, 1 Aug 2005 14:14:48 +0200
Subject: [Linux-cluster] segfault
Message-ID: <20050801121448.GA4036@gibson.drslump.org>

While doing cman_tool join ... I got a segfault and this on the logs:

Unable to handle kernel NULL pointer dereference at virtual address 0000019c
 printing eip:
c034e80a
*pde = 00000000
Oops: 0000 [#12]
Modules linked in: gfs lock_harness dlm cman ipv6 snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
CPU:    0
EIP:    0060:[<c034e80a>]    Not tainted VLI
EFLAGS: 00010296   (2.6.12.3) 
EIP is at sk_alloc+0x1b/0xd0
eax: c1551180   ebx: 00000002   ecx: ffffff9c   edx: 000000d0
esi: 00000134   edi: 000000d0   ebp: ffffff9f   esp: c2213ed8
ds: 007b   es: 007b   ss: 0068
Process cman_tool (pid: 4034, threadinfo=c2212000 task=c993fa80)
Stack: ca2850ac c886c074 00000286 00000002 cc1f3380 000000d0 ffffff9f e1aa90e7 
       0000001e 000000d0 00000134 c1551180 00000002 cc1f3380 00000002 e1aa933e 
       cc1f3380 000000d0 0000001e cc1f3380 c034c923 cc1f3380 00000002 00000001 
Call Trace:
 [<e1aa90e7>] cl_alloc_sock+0x38/0x97 [cman]
 [<e1aa933e>] cl_create+0x59/0x101 [cman]
 [<c034c923>] __sock_create+0xc3/0x1c7
 [<c034ca56>] sock_create+0x2f/0x33
 [<c034cab5>] sys_socket+0x28/0x55
 [<c034d9ff>] sys_socketcall+0x89/0x251
 [<c0149e59>] filp_close+0x52/0x96
 [<c010dcbc>] do_page_fault+0x0/0x5bf
 [<c0102865>] syscall_call+0x7/0xb
Code: ff ff ff c7 44 24 14 04 00 00 00 e9 75 fc ff ff 83 ec 1c 89 74 24 10 89 5c 24 0c 89 7c 24 14 89 6c 24 18 8b 74 24 28 8b 54 24 24 <8b> 46 68 85 c0 0f 84 8c 00 00 00 89 54 24 04 89 04 24 e8 a8 79 

Has anybody a hint?
I compiled the kernel modules, and also made debian packages from the sources out there at ubuntu ... I'm with 2.6.12.3 in debian/sid .....

thx

-- 
Javier Polo @ Datagrama
902 136 126



From pcaulfie at redhat.com  Mon Aug  1 12:24:27 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 01 Aug 2005 13:24:27 +0100
Subject: [Linux-cluster] segfault
In-Reply-To: <20050801121448.GA4036@gibson.drslump.org>
References: <20050801121448.GA4036@gibson.drslump.org>
Message-ID: <42EE147B.30702@redhat.com>

Javi Polo wrote:
> While doing cman_tool join ... I got a segfault and this on the logs:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 0000019c
>  printing eip:
> c034e80a
> *pde = 00000000
> Oops: 0000 [#12]
> Modules linked in: gfs lock_harness dlm cman ipv6 snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
> CPU:    0
> EIP:    0060:[<c034e80a>]    Not tainted VLI
> EFLAGS: 00010296   (2.6.12.3) 
> EIP is at sk_alloc+0x1b/0xd0
> eax: c1551180   ebx: 00000002   ecx: ffffff9c   edx: 000000d0
> esi: 00000134   edi: 000000d0   ebp: ffffff9f   esp: c2213ed8
> ds: 007b   es: 007b   ss: 0068
> Process cman_tool (pid: 4034, threadinfo=c2212000 task=c993fa80)
> Stack: ca2850ac c886c074 00000286 00000002 cc1f3380 000000d0 ffffff9f e1aa90e7 
>        0000001e 000000d0 00000134 c1551180 00000002 cc1f3380 00000002 e1aa933e 
>        cc1f3380 000000d0 0000001e cc1f3380 c034c923 cc1f3380 00000002 00000001 
> Call Trace:
>  [<e1aa90e7>] cl_alloc_sock+0x38/0x97 [cman]
>  [<e1aa933e>] cl_create+0x59/0x101 [cman]
>  [<c034c923>] __sock_create+0xc3/0x1c7
>  [<c034ca56>] sock_create+0x2f/0x33
>  [<c034cab5>] sys_socket+0x28/0x55
>  [<c034d9ff>] sys_socketcall+0x89/0x251
>  [<c0149e59>] filp_close+0x52/0x96
>  [<c010dcbc>] do_page_fault+0x0/0x5bf
>  [<c0102865>] syscall_call+0x7/0xb
> Code: ff ff ff c7 44 24 14 04 00 00 00 e9 75 fc ff ff 83 ec 1c 89 74 24 10 89 5c 24 0c 89 7c 24 14 89 6c 24 18 8b 74 24 28 8b 54 24 24 <8b> 46 68 85 c0 0f 84 8c 00 00 00 89 54 24 04 89 04 24 e8 a8 79 
> 
> Has anybody a hint?
> I compiled the kernel modules, and also made debian packages from the sources out there at ubuntu ... I'm with 2.6.12.3 in debian/sid .....
> 

The sk_alloc code has changed a few times in the kernel so it might be that the
source you have compiled doesn't match the kernel it is running on. Though that
usually results in a compile error rather than a runtime one!

Which source are you using?

-- 

patrick



From pcaulfie at redhat.com  Mon Aug  1 12:28:49 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 01 Aug 2005 13:28:49 +0100
Subject: [Linux-cluster] How do nodes in a cluster authenticate each other?
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A364F4BF@mailxchg01.corp.opsource.net>
References: <38A48FA2F0103444906AD22E14F1B5A364F4BF@mailxchg01.corp.opsource.net>
Message-ID: <42EE1581.2040604@redhat.com>

Jeff Harr wrote:
> I asked this under a different heading earlier but nobody answered J 
> Don?t mean to spam the group but I can?t figure it out.  This is on a
> Redhat Cluster 4.
> 

Basically they don't. In a CMAN cluster there is a join protocol that all nodes
have to go through to become a member and cluster nodes will only talk to known
members.

But as things currently stand if someone spoofs the IP address, port & cluster
number then it will get through I'm afraid. The solution is to use a private
network for cluster communications.

-- 

patrick



From javipolo at datagrama.net  Mon Aug  1 12:57:09 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Mon, 1 Aug 2005 14:57:09 +0200
Subject: [Linux-cluster] segfault
In-Reply-To: <42EE147B.30702@redhat.com>
References: <20050801121448.GA4036@gibson.drslump.org>
	<42EE147B.30702@redhat.com>
Message-ID: <20050801125709.GA4173@gibson.drslump.org>

On Aug/01/2005, Patrick Caulfield wrote:

> The sk_alloc code has changed a few times in the kernel so it might be that the
> source you have compiled doesn't match the kernel it is running on. Though that
> usually results in a compile error rather than a runtime one!
> Which source are you using?

I downloaded the debian patches:
kernel-patch-2.6-cman - Cluster manager - kernel patch

It did not compile, and I hand-fixed it with some patches that appeared in
this list ...

I'm gonna check out now the svn code in open.datacore.ch .....

-- 
Javier Polo @ Datagrama
902 136 126



From jharr at opsource.net  Mon Aug  1 13:17:29 2005
From: jharr at opsource.net (Jeff Harr)
Date: Mon, 1 Aug 2005 09:17:29 -0400
Subject: [Linux-cluster] How do nodes in a cluster authenticate each other?
Message-ID: <38A48FA2F0103444906AD22E14F1B5A364FB1F@mailxchg01.corp.opsource.net>

Ok, thank you very much, I just wanted to be sure that I wasn't
overlooking something.

Jeff

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patrick Caulfield
Sent: Monday, August 01, 2005 8:29 AM
To: linux clustering
Subject: Re: [Linux-cluster] How do nodes in a cluster authenticate each
other?

Jeff Harr wrote:
> I asked this under a different heading earlier but nobody answered J 
> Don't mean to spam the group but I can't figure it out.  This is on a
> Redhat Cluster 4.
> 

Basically they don't. In a CMAN cluster there is a join protocol that
all nodes
have to go through to become a member and cluster nodes will only talk
to known
members.

But as things currently stand if someone spoofs the IP address, port &
cluster
number then it will get through I'm afraid. The solution is to use a
private
network for cluster communications.

-- 

patrick

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster




From javiermarasco at yahoo.com.ar  Mon Aug  1 12:32:20 2005
From: javiermarasco at yahoo.com.ar (javier marasco)
Date: Mon, 1 Aug 2005 09:32:20 -0300
Subject: [Linux-cluster] job opening in Houston
In-Reply-To: <78fcc84a050730211531f4a197@mail.gmail.com>
Message-ID: <200508011332.j71DWOGh025030@mx3.redhat.com>

I don't think so. Im only ask for information purpose. If you can of course
, tell me how much pay for that job.

thanks 


Javier Marasco
System Administrator
Digbang (Vera 358)
Argentina 54-11-4857-6585 
www.digbang.com

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of y f
Sent: Sunday, July 31, 2005 1:16 AM
To: keith at clearpathit.com; linux clustering
Subject: Re: [Linux-cluster] job opening in Houston

Hi, Keith,

Can the work be done remotely ?

On 7/29/05, Keith Grammer <keith at clearpathit.com> wrote:
>  
>  
> 
> I am looking for a Linux Cluster Specialist for a contract position. 
> Please respond with a resume in Word.
> 
> Thank You,
> 
> Keith
> 
>   
> 
>  
> 
>   
> 
>   
> 
> Keith Grammer
> 
> Partner
> 
> ClearPath IT LLC
> 
> 713-344-0232
> 
> keith at clearpathit.com
> 
>   
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster


	

	
		
___________________________________________________________ 
1GB gratis, Antivirus y Antispam 
Correo Yahoo!, el mejor correo web del mundo 
http://correo.yahoo.com.ar 



From teigland at redhat.com  Tue Aug  2 07:18:28 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 2 Aug 2005 15:18:28 +0800
Subject: [Linux-cluster] [PATCH 00/14] GFS
Message-ID: <20050802071828.GA11217@redhat.com>

Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel.  The 14 patches total about 900K so I won't send
them to the list unless that's requested.  Comments and suggestions are
welcome.  Thanks

http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050801/broken-out/

Dave



From javipolo at datagrama.net  Tue Aug  2 07:45:18 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Tue, 2 Aug 2005 09:45:18 +0200
Subject: [Linux-cluster] segfault
In-Reply-To: <20050801125709.GA4173@gibson.drslump.org>
References: <20050801121448.GA4036@gibson.drslump.org>
	<42EE147B.30702@redhat.com>
	<20050801125709.GA4173@gibson.drslump.org>
Message-ID: <20050802074518.GA20528@gibson.drslump.org>

On Aug/01/2005, Javi Polo wrote:

> > The sk_alloc code has changed a few times in the kernel so it might be that the
> > source you have compiled doesn't match the kernel it is running on. Though that
> > usually results in a compile error rather than a runtime one!
> > Which source are you using?
> I'm gonna check out now the svn code in open.datacore.ch .....

Weeeeeeek, error :P

cluster/cman/cnxman.c: In function `cl_alloc_sock':
cluster/cman/cnxman.c:922: warning: passing arg 3 of `sk_alloc' makes pointer from integer without a cast
cluster/cman/cnxman.c:922: warning: passing arg 4 of `sk_alloc' makes integer from pointer without a cast
cluster/cman/cnxman.c: In function `cl_bind':
cluster/cman/cnxman.c:1062: error: structure has no member named `sk_zapped'
cluster/cman/cnxman.c:1086: error: structure has no member named `sk_zapped'
make[2]: *** [cluster/cman/cnxman.o] Error 1
make[1]: *** [cluster/cman] Error 2
make: *** [cluster] Error 2
kinoko:/usr/src/linux# 

I fixed those as described in https://www.redhat.com/archives/linux-cluster/2005-April/msg00051.html
and https://www.redhat.com/archives/linux-cluster/2005-April/msg00034.html

(with the debian patch I just had to fix the sk_zapped thing)

and the results areeeeeeee:

yuppp, now it does not segfaults .... :)


-- 
Javier Polo @ Datagrama
902 136 126



From pegasus at nerv.eu.org  Tue Aug  2 08:49:52 2005
From: pegasus at nerv.eu.org (Jure =?iso-8859-2?Q?Pe=E8ar?=)
Date: Tue, 02 Aug 2005 10:49:52 +0200
Subject: [Linux-cluster] gfs max number of subdirs per directory
Message-ID: <1122972593.8420.7.camel@localhost.localdomain>

Hi all,

I want to know what is the maximum number of subdirectories one can
create on a GFS filesystem. For example, both ext2/3 and Veritas vxfs
are limited to 32k, but that will soon become a limiting factor for my
application. Because of this my only choice right now is reiserfs ...

-- 

Jure Pe?ar
http://jure.pecar.org



From teigland at redhat.com  Tue Aug  2 07:18:28 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 2 Aug 2005 15:18:28 +0800
Subject: [Linux-cluster] [PATCH 00/14] GFS
Message-ID: <20050802071828.GA11217@redhat.com>

Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel.  The 14 patches total about 900K so I won't send
them to the list unless that's requested.  Comments and suggestions are
welcome.  Thanks

http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050801/broken-out/

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



From oldmoonster at gmail.com  Tue Aug  2 09:48:07 2005
From: oldmoonster at gmail.com (Q.L)
Date: Tue, 2 Aug 2005 17:48:07 +0800
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
Message-ID: <359782e705080202484184f323@mail.gmail.com>

Hi,

I have tried 2.6.9,2.6.11,2.6.12, but the kernel build always fails,
:-(, could you share with me which kernel did you patch to? Is there
any instructions to help me build and install cluster-1.00.00.tar.gz
on my redhat 9 system? Indeed, I have refered to
http://gfs.wikidev.net/Installation, but I still can't successfully
build sources... .

Thanks!!!!

Q.L

On 7/30/05, Jacob Liff <jacobl at ccbill.com> wrote:
>  
>  
> 
> Howdy, 
> 
>   
> 
> I have gone through the list before buging you guys but haven't found the
> answer. A few days ago someone else was having issues and someone
> recommended getting the source from: 
> 
>   
> 
> ftp://sources.redhat.com/pub/cluster/releases/cluster-1.00.00.tar.gz
> 
>   
> 
> Instead of the CSV and to compile it against vanilla 2.6.12. I followed
> those instructions and everything appeared to compile fine except when I
> modprobe gfs I get this fun error: 
> 
>   
> 
> ATAL: Error inserting gfs
> (/lib/modules/2.6.12/kernel/fs/gfs/gfs.ko): Unknown symbol
> in module, or unknown parameter (see dmesg) 
> 
>   
> 
> dmesg output: 
> 
>   
> 
> gfs: Unknown symbol posix_acl_from_xattr 
> 
> gfs: Unknown symbol posix_acl_valid 
> 
> gfs: Unknown symbol posix_acl_permission 
> 
> gfs: Unknown symbol posix_acl_equiv_mode 
> 
> gfs: Unknown symbol posix_acl_chmod_masq 
> 
> gfs: Unknown symbol posix_acl_to_xattr 
> 
> gfs: Unknown symbol posix_acl_create_masq 
> 
> gfs: Unknown symbol posix_acl_clone 
> 
>   
> 
> Looking at the source its linking against the correct headers for the
> functions.. and the functions do exist in the headers. 
> 
>   
> 
> Maybe I have not compiled someone into the kernel that I needed in order to
> make this work? Could I be trying to use this against the wrong
> kernel(Vanilla 2.6.12
> http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.12.3.tar.bz2
> )? I have been trying to either compile modules or patch against most of the
> newer kernels for the last two days with no luck. Any help would be greatly
> appreciated. 
> 
>   
> 
> Jacob L. 
> 
>   
> 
>   
> 
>   
> 
>   
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
>



From pcaulfie at redhat.com  Tue Aug  2 10:03:26 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 02 Aug 2005 11:03:26 +0100
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <359782e705080202484184f323@mail.gmail.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
	<359782e705080202484184f323@mail.gmail.com>
Message-ID: <42EF44EE.7020804@redhat.com>

Q.L wrote:
> Hi,
> 
> I have tried 2.6.9,2.6.11,2.6.12, but the kernel build always fails,
> :-(, could you share with me which kernel did you patch to? Is there
> any instructions to help me build and install cluster-1.00.00.tar.gz
> on my redhat 9 system? Indeed, I have refered to
> http://gfs.wikidev.net/Installation, but I still can't successfully
> build sources... .
> 

It compiles cleanly for me against 2.6.12.2.

-- 

patrick



From oldmoonster at gmail.com  Tue Aug  2 10:09:52 2005
From: oldmoonster at gmail.com (Q.L)
Date: Tue, 2 Aug 2005 18:09:52 +0800
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <42EF44EE.7020804@redhat.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
	<359782e705080202484184f323@mail.gmail.com>
	<42EF44EE.7020804@redhat.com>
Message-ID: <359782e7050802030967157c38@mail.gmail.com>

Hi, Patrick,

Could you share me the instructions to build against kernel 2.6.12.2?

Thanks!

Q.L 

On 8/2/05, Patrick Caulfield <pcaulfie at redhat.com> wrote:
> Q.L wrote:
> > Hi,
> >
> > I have tried 2.6.9,2.6.11,2.6.12, but the kernel build always fails,
> > :-(, could you share with me which kernel did you patch to? Is there
> > any instructions to help me build and install cluster-1.00.00.tar.gz
> > on my redhat 9 system? Indeed, I have refered to
> > http://gfs.wikidev.net/Installation, but I still can't successfully
> > build sources... .
> >
> 
> It compiles cleanly for me against 2.6.12.2.
> 
> --
> 
> patrick
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From pcaulfie at redhat.com  Tue Aug  2 10:16:54 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 02 Aug 2005 11:16:54 +0100
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <359782e7050802030967157c38@mail.gmail.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>	
	<359782e705080202484184f323@mail.gmail.com>	
	<42EF44EE.7020804@redhat.com>
	<359782e7050802030967157c38@mail.gmail.com>
Message-ID: <42EF4816.1090200@redhat.com>

Q.L wrote:
> Hi, Patrick,
> 
> Could you share me the instructions to build against kernel 2.6.12.2?
> 


Certainly:

./configure --kernel_src=<path-to-kernel-src>
make


-- 

patrick



From oldmoonster at gmail.com  Tue Aug  2 10:23:08 2005
From: oldmoonster at gmail.com (Q.L)
Date: Tue, 2 Aug 2005 18:23:08 +0800
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <42EF4816.1090200@redhat.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
	<359782e705080202484184f323@mail.gmail.com>
	<42EF44EE.7020804@redhat.com>
	<359782e7050802030967157c38@mail.gmail.com>
	<42EF4816.1090200@redhat.com>
Message-ID: <359782e70508020323165f669@mail.gmail.com>

On 8/2/05, Patrick Caulfield <pcaulfie at redhat.com> wrote:
> Q.L wrote:
> > Hi, Patrick,
> >
> > Could you share me the instructions to build against kernel 2.6.12.2?
> >
> 
> 
> Certainly:
> 
> ./configure --kernel_src=<path-to-kernel-src>
> make
> 
> 
Yes, I know that option, but to me, output just like this, it then make error.

[root at buckupy cluster-1.00.00]# ./configure --kernel_src=/usr/src/linux-2.6.9
configure cman-kernel

Configuring Makefiles for your system...
Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95.
configure dlm-kernel

Configuring Makefiles for your system...
Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95.
configure gfs-kernel

Configuring Makefiles for your system...
Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95.
configure gnbd-kernel

Configuring Makefiles for your system...
Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95.
configure magma

Configuring Makefiles for your system...
Completed Makefile configuration

configure ccs

Configuring Makefiles for your system...
Completed Makefile configuration

configure cman

Configuring Makefiles for your system...
Completed Makefile configuration

configure dlm

Configuring Makefiles for your system...
Completed Makefile configuration

configure fence

Configuring Makefiles for your system...
Completed Makefile configuration

configure iddev

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs

Configuring Makefiles for your system...
Completed Makefile configuration

configure gnbd

Configuring Makefiles for your system...
Completed Makefile configuration

configure gulm

Configuring Makefiles for your system...
Completed Makefile configuration

configure magma-plugins

Configuring Makefiles for your system...
Completed Makefile configuration

configure rgmanager

Configuring Makefiles for your system...
Completed Makefile configuration

Thanks!

Q.L
> --
> 
> patrick
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From pcaulfie at redhat.com  Tue Aug  2 10:28:33 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 02 Aug 2005 11:28:33 +0100
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <1122318870.12824.29.camel@localhost.localdomain>
References: <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
Message-ID: <42EF4AD1.6010809@redhat.com>

Steven Dake wrote:
> On Mon, 2005-07-18 at 09:10 +0100, Patrick Caulfield wrote:
> 
>>As I see it there are two things we can do with userland cman that's current in
>>the head of CVS:
>>
>>1. Leave it as it is - a port of the kernel one. This has some benefits: it's
>>easy (plus a few bug fixes that need to go in), it's protocol-compatible with
>>the kernel one. There are a small number of extra features that could go in
>>there (that would, annoyingly, break that compatibility) but nothing really
>>serious. It doesn't give us anything new, but what new is neeed ?
>>
>>2. Migrate it to something much more sophisticated. I've mentioned Virtual
>>Synchrony a few times before and I've been looking into this in some detail
>>since. The benefits are largely internal but they do provide a reliable, robust
>>and well-performing messaging system that other cluster subsystems can use.
>>While the application programmers at the cluster summit maintained they had no
>>use for a cluster messaging system, I still believe that it is a useful thing to
>>have at a lower level - if only for our own programming needs. I know that Jon
>>looked into the existing cman messaging system before rejecting it as too slow
>>and unreliable for he needs of the cluster mirroring code.
>>
>>There are two suboptions here.
>>  a) write it ourself. Quite a big job this. Bigger than I would like. To be
>>honest I did make a start at this and now realise just what a huge job it is to
>>get something that both performs well and is reliable. REALLY reliable. even
>>worse if the academics want something provably reliable.
>>   b) adopt something else. The obvious candidate here is the openAIS code[1].
>>This looks to be quite mature now and has all the features we need of a low
>>level messaging system. It's very nicely abstracted out so we can pick out just
>>the bits we need without having the whole (rather heavyweight) system on top of it.
>>
>>The one problem with the openAIS code is that it doesn't support IPv6, and much
>>of the code is tied to IPv4. Having had a look at it and emailed Steven Dake
>>about this he reckons it's about 2 weeks work to add.[2]
>>
>>The advantages of doing this are several.
>>- It saves time. We get something that is known to work, even though it needs
>>extra features added for our own use.
>>- we're not inventing something new that already exists in several other places.
>>- we get more people who know the code. Currently only I know the internals of
>>cman as it stands and it's quite scary code that people don't want to get
>>involved with (we've have several DLM patches in the past, but no CMAN ones).
>>This way we get at least 2 (Steven and me) as well as anyone else who is
>>following openAIS. Of course there will be CMAN-specific stuff on top of their
>>comms layer to make it quorum-based and capable of supporting GFS and DLM that
> 
> 
> sorry my response is so late I missed this mail while at OLS.
> 
> The quorum problem is commonly referred to in the literature as a
> "virtual synchrony filter".  I'd love to have some implementations of
> virtual synchrony filters that exist within libtotem itself..
> Definately an area of interest for openais as we need some services to
> operate only in one partition (like the amf).
> 
> 
>>will be Red Hat specific but these are not going to be large.
>>- the APIs are all open (based on SAforum specifications) and already
>>implemented. Although adding saCLM to CMAN is pretty easy as I proved last week.
>>
> 
> 
>>The disadvantages are
>>- Need to learn the internals of someone else's code.
> 
> 
> indeed this part is somewhat painful :(
> 
> 
>>- We don't have full control over the code. Although we can obviously fork it if
>>we feel the need it would, obviously be preferable not to.
> 
> 
> My view is that open source influence is dictated by level of
> contribution just like any kind of community.  ie: the more a person
> contributes the more influence they can exert over a project or
> direction.  Even as maintainer I don't have full control over the
> openais code as the community really decides where we go and what work
> we do.
> 
> My point here is that if you are willing to fork, then you probably have
> some time to maintain the code..  which is better spent influencing the
> current openais tree :)
> 
> 
>>- non-compatibility with "old" cman, making rolling upgrades har or even
>>impossible. I'm not sure what to do about this yet, but it's worth pointing out
>>that the DLM has a new line-protocol too.
> 
> 
> yes upgrades are a real pain.  We have not fully tackled this problem in
> the openais project yet, because we havn't released a stable version.
> Ideally we would like two versions (older, newer) to interoperate, even
> if that means uglifying the implementation to coexist with two line
> types.  We have some work in place to address this problem but before
> our first production release I'm planning to really think through
> interoperability with new implementations for features of the totem
> protocol (like redundant ring, multi ring gateway (for local area
> networks), group key generation, multi-ring-bridged (for wide area
> networks), etc).
> 
> 
>>- openAIS is BSD licensed, I don't think this is a problem but it probably needs
>>checking.
>>
> 
> 
> Originally I had planned to use spread for openais, but the license was
> not compatible with the lawyers "approved list".  So we had to implement
> a protocol completely from scratch because of the license issue which
> took about 1.5 years of work (sigh).  I wanted to be sure other projects
> could reuse the totem code so chose the most liberal license I could
> find.
> 
> 
>>In short, I'm advocating adopting the openAIS core (libtotem basically) as
>>CMAN's communications/membership protocol. If we're going to do a "CMAN V2" that
>>has anything significant over V1 then re-inventing it is going to be a huge
>>amount of work that someone else has already done.
>>
>>Comments?
>>
> 
> 
> sounds good Patrick  if you need any help from us let us know
> 

Thanks for that Steven. I'm going to make a start on this when I get back from
UKUUG next week. I've managed to knock up something that looks like cman from
the outside but uses libtotem for it's comms layer so it's looking good. On
other thing I need to look into (apart from IPv6) is multi-home. cman had a
(primitive) failover system but it's not currently in use by anyone because DLM
doesn't support it but I think it's something we need to provide at some stage.

Don't worry about the mention of a fork - the chances of it happening are almost
nil!
-- 

patrick



From pcaulfie at redhat.com  Tue Aug  2 10:29:46 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 02 Aug 2005 11:29:46 +0100
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <359782e70508020323165f669@mail.gmail.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>	
	<359782e705080202484184f323@mail.gmail.com>	
	<42EF44EE.7020804@redhat.com>	
	<359782e7050802030967157c38@mail.gmail.com>	
	<42EF4816.1090200@redhat.com>
	<359782e70508020323165f669@mail.gmail.com>
Message-ID: <42EF4B1A.2030405@redhat.com>

Q.L wrote:
> On 8/2/05, Patrick Caulfield <pcaulfie at redhat.com> wrote:
> 
>>Q.L wrote:
>>
>>>Hi, Patrick,
>>>
>>>Could you share me the instructions to build against kernel 2.6.12.2?
>>>
>>
>>
>>Certainly:
>>
>>./configure --kernel_src=<path-to-kernel-src>
>>make
>>
>>
> 
> Yes, I know that option, but to me, output just like this, it then make error.
> 
> [root at buckupy cluster-1.00.00]# ./configure --kernel_src=/usr/src/linux-2.6.9
> configure cman-kernel
> 
> Configuring Makefiles for your system...
> Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95.
> configure dlm-kernel
> 


Have you configured the kernel source using make menuconfig ?


-- 

patrick



From oldmoonster at gmail.com  Tue Aug  2 10:44:33 2005
From: oldmoonster at gmail.com (Q.L)
Date: Tue, 2 Aug 2005 18:44:33 +0800
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <42EF4B1A.2030405@redhat.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
	<359782e705080202484184f323@mail.gmail.com>
	<42EF44EE.7020804@redhat.com>
	<359782e7050802030967157c38@mail.gmail.com>
	<42EF4816.1090200@redhat.com>
	<359782e70508020323165f669@mail.gmail.com>
	<42EF4B1A.2030405@redhat.com>
Message-ID: <359782e70508020344510f473a@mail.gmail.com>

On 8/2/05, Patrick Caulfield <pcaulfie at redhat.com> wrote:
> Q.L wrote:
> > On 8/2/05, Patrick Caulfield <pcaulfie at redhat.com> wrote:
> >
> >>Q.L wrote:
> >>
> >>>Hi, Patrick,
> >>>
> >>>Could you share me the instructions to build against kernel 2.6.12.2?
> >>>
> >>
> >>
> >>Certainly:
> >>
> >>./configure --kernel_src=<path-to-kernel-src>
> >>make
> >>
> >>
> >
> > Yes, I know that option, but to me, output just like this, it then make error.
> >
> > [root at buckupy cluster-1.00.00]# ./configure --kernel_src=/usr/src/linux-2.6.9
> > configure cman-kernel
> >
> > Configuring Makefiles for your system...
> > Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95.
> > configure dlm-kernel
> >
> 
> 
> Have you configured the kernel source using make menuconfig ?
> 
> 
Thanks! and this time I ran:
# cd /usr/src/linux-2.6.9
# find /path/to/cluster -name '*.patch' | xargs cat | patch -t -p1
it seems ok, but no following options in .config
# GFS-specific
CONFIG_LOCK_HARNESS=m
CONFIG_GFS_FS=m
CONFIG_LOCK_NOLOCK=m
CONFIG_LOCK_DLM=m
CONFIG_LOCK_GULM=m

further more, when I run above instructions in linux-2.6.12 source
tree, it reports many conflicts... see following:
[root at buckupy linux-2.6.12]# find /home/share/cluster-1.00.00 -name
'*.patch' | xargs cat | patch -t -p1
patching file arch/alpha/Kconfig
Hunk #1 succeeded at 608 (offset 8 lines).
patching file arch/arm/Kconfig
Hunk #1 succeeded at 744 (offset 54 lines).
patching file arch/arm26/Kconfig
Hunk #1 FAILED at 222.
1 out of 1 hunk FAILED -- saving rejects to file arch/arm26/Kconfig.rej
patching file arch/cris/Kconfig
Hunk #1 succeeded at 178 (offset 4 lines).
patching file arch/i386/Kconfig
Hunk #1 succeeded at 1263 with fuzz 2 (offset 69 lines).
patching file arch/ia64/Kconfig
Hunk #1 succeeded at 441 (offset 51 lines).
patching file arch/m68k/Kconfig
Hunk #1 succeeded at 668 (offset 13 lines).
patching file arch/mips/Kconfig
Hunk #1 FAILED at 1563.

Thanks!!

Q.L

> --
> 
> patrick
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From pcaulfie at redhat.com  Tue Aug  2 11:33:06 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 02 Aug 2005 12:33:06 +0100
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <359782e70508020344510f473a@mail.gmail.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>	<359782e705080202484184f323@mail.gmail.com>	<42EF44EE.7020804@redhat.com>	<359782e7050802030967157c38@mail.gmail.com>	<42EF4816.1090200@redhat.com>	<359782e70508020323165f669@mail.gmail.com>	<42EF4B1A.2030405@redhat.com>
	<359782e70508020344510f473a@mail.gmail.com>
Message-ID: <42EF59F2.2060305@redhat.com>

Q.L wrote:

> Thanks! and this time I ran:
> # cd /usr/src/linux-2.6.9
> # find /path/to/cluster -name '*.patch' | xargs cat | patch -t -p1


two things.

1. I didn't say anything about patching, just ./configure  && make
2. I also said it works against a 2.6.12.2 kernel, not 2.6.9


-- 

patrick



From teigland at redhat.com  Tue Aug  2 11:47:58 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 2 Aug 2005 19:47:58 +0800
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <359782e70508020344510f473a@mail.gmail.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
	<359782e705080202484184f323@mail.gmail.com>
	<42EF44EE.7020804@redhat.com>
	<359782e7050802030967157c38@mail.gmail.com>
	<42EF4816.1090200@redhat.com>
	<359782e70508020323165f669@mail.gmail.com>
	<42EF4B1A.2030405@redhat.com>
	<359782e70508020344510f473a@mail.gmail.com>
Message-ID: <20050802114758.GD11217@redhat.com>

> [root at buckupy linux-2.6.12]# find /home/share/cluster-1.00.00 -name
> '*.patch' | xargs cat | patch -t -p1

We don't do kernel patches any more, ignore what's there; we'll remove
those last few in the next release.  You have to build the modules within
the cluster tree now.

Dave



From natecars at natecarlson.com  Tue Aug  2 12:37:45 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Tue, 2 Aug 2005 07:37:45 -0500 (CDT)
Subject: [Linux-cluster] [PATCH 00/14] GFS
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <Pine.LNX.4.61.0508020737100.31542@tungsten.technicality.org>

On Tue, 2 Aug 2005, David Teigland wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to 
> see added to the kernel.  The 14 patches total about 900K so I won't 
> send them to the list unless that's requested.  Comments and suggestions 
> are welcome.  Thanks
>
> http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050801/broken-out/

I see that these patches are for GFS2.. does this mean that GFS2 is ready 
for prime time?

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From djani22 at dynamicweb.hu  Tue Aug  2 16:16:15 2005
From: djani22 at dynamicweb.hu (djani22 at dynamicweb.hu)
Date: Tue, 2 Aug 2005 18:16:15 +0200
Subject: [Linux-cluster] gnbd question
References: <1122972593.8420.7.camel@localhost.localdomain>
Message-ID: <010c01c5977d$8f2e2780$0400a8c0@LocalHost>

Hi all,

 I want to know is there a way to speed up gnbd-server?

I have a "big" (~8TB) free web store based on (g)nbd.

(4 server + 1 client)

When I try to read linear the one big device, it can generate 380-400 Mb/s
transfers on Gig-Eth.
But when I start the web serving from it, the result is only 80-90Mb/s , and
the client's load goes up to 100-150!

This is the best performance what I can set up and the settings are these:

Scheduler: deadline
nr_requests: 255
read_ahead_kb: 0

iosched:
front_merges: 0
read_expire: 50

The all server's load is always  ~1.00 (0.95-1.14).

Is there a way to stress more the servers?
(And with small modification the code?)

GNBD 1.0.0
no cluster, fs: xfs


I try the nbd, enbd, anbd, and gnbd, but the gnbd's stability is the best!
;-)
The another nbds always generate a deadlock for me...

(Sorry for my english)

Thanks

Janos Haar
(Hungary)





From haydar2906 at hotmail.com  Tue Aug  2 17:08:27 2005
From: haydar2906 at hotmail.com (haydar Ali)
Date: Tue, 02 Aug 2005 13:08:27 -0400
Subject: [Linux-cluster] GFS : important questions
Message-ID: <BAY104-F28E61C4E7F3C64FCB9791CC5C20@phx.gbl>

Hi,

Now, we have 3 servers HP Proliant 380 G3 (RedHat Advanced Server 3) 
attached by 2 fiber channels each to the storage area network SAN HP MSA1000 
and we want to install and configure GFS to allow 2 servers to 
simultaneously read and write to a single shared file system (Word documents 
located into /u04) located on the Storage area network SAN HP MSA1000.
I know that I have to install GFS on the 3 nodes and one of them will be a 
master and the 2 others will be the slaves.

My questions are:
1 - If one of the GFS slaves RAC1 or RAC2 is down, will be the other slave 
server able to access to the shared file system /u04?
2 - If the master is down, will be the both GFS slave servers able to access 
to the shared file system /u04 simultaneously?

Thanks for your help
Cheers!


Haydar




From natecars at natecarlson.com  Tue Aug  2 18:50:17 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Tue, 2 Aug 2005 13:50:17 -0500 (CDT)
Subject: [Linux-cluster] CLVM and Snapshots?
Message-ID: <Pine.LNX.4.61.0508021346530.25673@tungsten.technicality.org>

Hey all,

I'm curious if it should be possible to take a snapshot of a filesystem 
sitting on top of CLVM (*not* a shared filesystem like GFS; something like 
XFS or ext3, but still on a shared block device). I recall trying it a 
couple weeks ago, and it failing miserably, but don't recall why, and 
figured I'd ask if it should work before experimenting again.

If it should work, a few questions:
- Should I be able to create the snapshot on any node, or just the node
   that is using the LV that I want to create a snapshot of?
- Is the syntax identical to normal linux snapshots?

Thanks!

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From natecars at natecarlson.com  Tue Aug  2 19:10:25 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Tue, 2 Aug 2005 14:10:25 -0500 (CDT)
Subject: [Linux-cluster] CLVM and Snapshots?
In-Reply-To: <Pine.LNX.4.61.0508021346530.25673@tungsten.technicality.org>
References: <Pine.LNX.4.61.0508021346530.25673@tungsten.technicality.org>
Message-ID: <Pine.LNX.4.61.0508021355500.25673@tungsten.technicality.org>

On Tue, 2 Aug 2005, Nate Carlson wrote:
> I'm curious if it should be possible to take a snapshot of a filesystem 
> sitting on top of CLVM (*not* a shared filesystem like GFS; something 
> like XFS or ext3, but still on a shared block device). I recall trying 
> it a couple weeks ago, and it failing miserably, but don't recall why, 
> and figured I'd ask if it should work before experimenting again.
>
> If it should work, a few questions:
> - Should I be able to create the snapshot on any node, or just the node
>  that is using the LV that I want to create a snapshot of?
> - Is the syntax identical to normal linux snapshots?
>
> Thanks!

OK, decided to try it.. here's what I get:

xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron
   Error locking on node nitrogen: Internal lvm error, check syslog
   Aborting. Failed to activate snapshot exception store. Remove new LV and retry.

Error on Nitrogen:

Aug  2 14:09:05 nitrogen lvm[295]: Volume group for uuid not found: doCTDCC376pE3g2JA35fNAieVNTWpAC1B3UIGOGQpn5Um5FcmPsa8yQPa0p9o9ZP

Nitrogen does not have access to the PV that this snapshot is on, which 
could be part of the problem.

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From alewis at redhat.com  Tue Aug  2 19:14:33 2005
From: alewis at redhat.com (AJ Lewis)
Date: Tue, 2 Aug 2005 14:14:33 -0500
Subject: [Linux-cluster] CLVM and Snapshots?
In-Reply-To: <Pine.LNX.4.61.0508021355500.25673@tungsten.technicality.org>
References: <Pine.LNX.4.61.0508021346530.25673@tungsten.technicality.org>
	<Pine.LNX.4.61.0508021355500.25673@tungsten.technicality.org>
Message-ID: <20050802191433.GP4954@null.msp.redhat.com>

On Tue, Aug 02, 2005 at 02:10:25PM -0500, Nate Carlson wrote:
> On Tue, 2 Aug 2005, Nate Carlson wrote:
> >I'm curious if it should be possible to take a snapshot of a filesystem 
> >sitting on top of CLVM (*not* a shared filesystem like GFS; something 
> >like XFS or ext3, but still on a shared block device). I recall trying 
> >it a couple weeks ago, and it failing miserably, but don't recall why, 
> >and figured I'd ask if it should work before experimenting again.
> >
> >If it should work, a few questions:
> >- Should I be able to create the snapshot on any node, or just the node
> > that is using the LV that I want to create a snapshot of?
> >- Is the syntax identical to normal linux snapshots?
> >
> >Thanks!
> 
> OK, decided to try it.. here's what I get:
> 
> xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron
>   Error locking on node nitrogen: Internal lvm error, check syslog
>   Aborting. Failed to activate snapshot exception store. Remove new LV and 
>   retry.
> 
> Error on Nitrogen:
> 
> Aug  2 14:09:05 nitrogen lvm[295]: Volume group for uuid not found: 
> doCTDCC376pE3g2JA35fNAieVNTWpAC1B3UIGOGQpn5Um5FcmPsa8yQPa0p9o9ZP
> 
> Nitrogen does not have access to the PV that this snapshot is on, which 
> could be part of the problem.

...yeah, you can't do this.  How is the node is managing the snapshot going to
know that blocks changed on the node with the origin?  You need cluster
snapshots, which aren't finished yet.  Do *not* do this - you will seriously
screw yourself over.  

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat                                    E-Mail: alewis at redhat.com
One Main Street SE, Suite 209
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050802/1f85e2f0/attachment.sig>

From natecars at natecarlson.com  Tue Aug  2 19:14:52 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Tue, 2 Aug 2005 14:14:52 -0500 (CDT)
Subject: [Linux-cluster] CLVM and Snapshots?
In-Reply-To: <Pine.LNX.4.61.0508021355500.25673@tungsten.technicality.org>
References: <Pine.LNX.4.61.0508021346530.25673@tungsten.technicality.org>
	<Pine.LNX.4.61.0508021355500.25673@tungsten.technicality.org>
Message-ID: <Pine.LNX.4.61.0508021413310.25673@tungsten.technicality.org>

On Tue, 2 Aug 2005, Nate Carlson wrote:
> OK, decided to try it.. here's what I get:
>
> xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron
>  Error locking on node nitrogen: Internal lvm error, check syslog
>  Aborting. Failed to activate snapshot exception store. Remove new LV and 
> retry.
>
> Error on Nitrogen:
>
> Aug  2 14:09:05 nitrogen lvm[295]: Volume group for uuid not found: 
> doCTDCC376pE3g2JA35fNAieVNTWpAC1B3UIGOGQpn5Um5FcmPsa8yQPa0p9o9ZP
>
> Nitrogen does not have access to the PV that this snapshot is on, which could 
> be part of the problem.

Shut down Nitrogen and the rest of the nodes that do not have access to 
that PV, and get:

xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron
   Error locking on node xen2: Internal lvm error, check syslog
   Error locking on node xen1: Internal lvm error, check syslog
   Problem reactivating origin iron

xen1 dmesg:
[  491.091753] clvmd: page allocation failure. order:0, mode:0xd0
[  491.168600]  [<c013f653>] __alloc_pages+0x2b3/0x430
[  491.232788]  [<c0143379>] kmem_cache_alloc+0x69/0x70
[  491.298194]  [<c68e8044>] alloc_pl+0x34/0x60 [dm_mod]
[  491.364848]  [<c68e8185>] client_alloc_pages+0x25/0x60 [dm_mod]
[  491.442834]  [<c68e8c6d>] kcopyd_client_create+0x6d/0xc0 [dm_mod]
[  491.523111]  [<c6acc8a5>] snapshot_ctr+0x2d5/0x3b0 [dm_snapshot]
[  491.602352]  [<c68e3b59>] dm_table_add_target+0x149/0x200 [dm_mod]
[  491.683980]  [<c68e6690>] populate_table+0x90/0xf0 [dm_mod]
[  491.757601]  [<c68e6758>] table_load+0x68/0x170 [dm_mod]
[  491.827788]  [<c68e7319>] ctl_ioctl+0xf9/0x160 [dm_mod]
[  491.896622]  [<c68e66f0>] table_load+0x0/0x170 [dm_mod]
[  491.965460]  [<c016b598>] do_ioctl+0x58/0x80
[  492.021717]  [<c016b725>] vfs_ioctl+0x65/0x1e0
[  492.080366]  [<c016b907>] sys_ioctl+0x67/0x90
[  492.137764]  [<c0108ec8>] syscall_call+0x7/0xb
[  492.199824] device-mapper: Could not create kcopyd client
[  492.271951] device-mapper: error adding target to table

xen2 dmesg:
[  497.254501] clvmd: page allocation failure. order:0, mode:0xd0
[  497.331260]  [<c013f653>] __alloc_pages+0x2b3/0x430
[  497.395454]  [<c0143379>] kmem_cache_alloc+0x69/0x70
[  497.460858]  [<c68e8044>] alloc_pl+0x34/0x60 [dm_mod]
[  497.527518]  [<c68e8185>] client_alloc_pages+0x25/0x60 [dm_mod]
[  497.605612]  [<c68e8c6d>] kcopyd_client_create+0x6d/0xc0 [dm_mod]
[  497.685994]  [<c6acc8a5>] snapshot_ctr+0x2d5/0x3b0 [dm_snapshot]
[  497.765238]  [<c68e3b59>] dm_table_add_target+0x149/0x200 [dm_mod]
[  497.846871]  [<c68e6690>] populate_table+0x90/0xf0 [dm_mod]
[  497.920384]  [<c68e6758>] table_load+0x68/0x170 [dm_mod]
[  497.990365]  [<c68e7319>] ctl_ioctl+0xf9/0x160 [dm_mod]
[  498.059305]  [<c68e66f0>] table_load+0x0/0x170 [dm_mod]
[  498.128602]  [<c016b598>] do_ioctl+0x58/0x80
[  498.185030]  [<c016b725>] vfs_ioctl+0x65/0x1e0
[  498.243674]  [<c016b907>] sys_ioctl+0x67/0x90
[  498.301075]  [<c0108ec8>] syscall_call+0x7/0xb
[  498.363072] device-mapper: Could not create kcopyd client
[  498.434127] device-mapper: error adding target to table


------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From natecars at natecarlson.com  Tue Aug  2 19:21:26 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Tue, 2 Aug 2005 14:21:26 -0500 (CDT)
Subject: [Linux-cluster] CLVM and Snapshots?
In-Reply-To: <20050802191433.GP4954@null.msp.redhat.com>
References: <Pine.LNX.4.61.0508021346530.25673@tungsten.technicality.org>
	<Pine.LNX.4.61.0508021355500.25673@tungsten.technicality.org>
	<20050802191433.GP4954@null.msp.redhat.com>
Message-ID: <Pine.LNX.4.61.0508021420100.25673@tungsten.technicality.org>

On Tue, 2 Aug 2005, AJ Lewis wrote:
> ...yeah, you can't do this.  How is the node is managing the snapshot 
> going to know that blocks changed on the node with the origin?  You need 
> cluster snapshots, which aren't finished yet.  Do *not* do this - you 
> will seriously screw yourself over.

Good to know - I won't try doing that anymore, then.  :)

The docs on the cluster snapshot page seemed to indicate that it was only 
necessary for clustered file systems - guess not!

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From keith at clearpathit.com  Tue Aug  2 19:24:18 2005
From: keith at clearpathit.com (Keith Grammer)
Date: Tue, 2 Aug 2005 14:24:18 -0500
Subject: [Linux-cluster] CLVM and Snapshots?
In-Reply-To: <Pine.LNX.4.61.0508021420100.25673@tungsten.technicality.org>
Message-ID: <0MKyxe-1E02N92ISu-0007Z6@mrelay.perfora.net>

Please do not the broadcast in this manor. 

 
 
Keith Grammer
Partner
ClearPath IT LLC
713-344-0232
keith at clearpathit.com
 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Nate Carlson
Sent: Tuesday, August 02, 2005 2:21 PM
To: linux clustering
Subject: Re: [Linux-cluster] CLVM and Snapshots?

On Tue, 2 Aug 2005, AJ Lewis wrote:
> ...yeah, you can't do this.  How is the node is managing the snapshot 
> going to know that blocks changed on the node with the origin?  You need 
> cluster snapshots, which aren't finished yet.  Do *not* do this - you 
> will seriously screw yourself over.

Good to know - I won't try doing that anymore, then.  :)

The docs on the cluster snapshot page seemed to indicate that it was only 
necessary for clustered file systems - guess not!

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster



From JACOB_LIBERMAN at Dell.com  Tue Aug  2 19:29:31 2005
From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com)
Date: Tue, 2 Aug 2005 14:29:31 -0500
Subject: [Linux-cluster] GF support across architectures
Message-ID: <BC430F453501174992B9D9E8AFB7519A085671@ausx3mps309.aus.amer.dell.com>

I have 3 hosts: an IA32, IA64, and an EM64T that I would like to
cluster.

Can 3 RHEL 4 hosts running with different architectures participate in
the same cluster? 

Can they both access the same GFS 6.1 file system? 

Many thanks, Jacob



From dawson at fnal.gov  Tue Aug  2 21:31:37 2005
From: dawson at fnal.gov (Troy Dawson)
Date: Tue, 02 Aug 2005 16:31:37 -0500
Subject: [Linux-cluster] GF support across architectures
In-Reply-To: <BC430F453501174992B9D9E8AFB7519A085671@ausx3mps309.aus.amer.dell.com>
References: <BC430F453501174992B9D9E8AFB7519A085671@ausx3mps309.aus.amer.dell.com>
Message-ID: <42EFE639.4000805@fnal.gov>

I have combination of i686 and x86_64 machines all accessing the same 
GFS file system, and they all seem to be happy.  There hasn't been any 
architecture problems at least.
So I would say yes they can.
Troy

JACOB_LIBERMAN at Dell.com wrote:
> I have 3 hosts: an IA32, IA64, and an EM64T that I would like to
> cluster.
> 
> Can 3 RHEL 4 hosts running with different architectures participate in
> the same cluster? 
> 
> Can they both access the same GFS 6.1 file system? 
> 
> Many thanks, Jacob
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


-- 
__________________________________________________
Troy Dawson  dawson at fnal.gov  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________



From anu.matthew at bms.com  Tue Aug  2 23:20:55 2005
From: anu.matthew at bms.com (Anu Matthew)
Date: Tue, 02 Aug 2005 19:20:55 -0400
Subject: [Linux-cluster] IP alias over a bonded interface
Message-ID: <42EFFFD7.7070703@bms.com>

Greetings..!!

I have eth0 and eth1 bonded together as bond0 -- Works good, everything 
is fine, redundancy is good etc.

I created an alias bond0:0 and assigned another IP in the same subnet to 
it. This new IP is now pingable from other systems, other subnets etc.

Okay, here it gets interesting:

Traceroutes to any hosts from bond0 succeeds, but traceroutes using 
bond0:0 fails.

[root at linux1 root]# traceroute linux12 -i bond0
traceroute to linux12  (A.B.C.D), 30 hops max, 38 byte packets
 1  linux12 (A.B.C.D)  0.067 ms  0.029 ms  0.019 ms

[root at linux1 root]# traceroute linux12 -i bond0:0
setsockopt: No such device
unable to bind to device: bond0:0


Any ideas on why I cannot traceroute using bond0:0? (BTW, if it were on 
a non_bonded interface, say, eth0:0, it works.).

Thanks in advance,

--~~AM



From rajkum2002 at rediffmail.com  Wed Aug  3 02:34:10 2005
From: rajkum2002 at rediffmail.com (Raj  Kumar)
Date: 3 Aug 2005 02:34:10 -0000
Subject: [Linux-cluster] GFS : important questions
Message-ID: <20050803023410.12171.qmail@webmail46.rediffmail.com>


>My questions are:
>1 - If one of the GFS slaves RAC1 or RAC2 is down, will be the other slave server able to access to the shared file system /u04?
If the slave server was shutdown cleanly, the other slave server will be able to access the shared file system. However, if the slave was fenced off and was not rebooted the locks it holds will not be released. In that case the other slave server may not be able to access all files on /u04. If you have ilo or rilo in your HP servers you can use remote ilo fencing agent. 

>2 - If the master is down, will be the both GFS slave servers able to access to the shared file system /u04 simultaneously?
No. The slave servers will not be able to access the shared file system if the master is down!

this is my two cents knowledge... experts please add to this!!

Good luck!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050803/83ac52a7/attachment.htm>

From haydar2906 at hotmail.com  Wed Aug  3 02:54:30 2005
From: haydar2906 at hotmail.com (haydar Ali)
Date: Tue, 02 Aug 2005 22:54:30 -0400
Subject: [Linux-cluster] GFS : important questions
In-Reply-To: <20050803023410.12171.qmail@webmail46.rediffmail.com>
Message-ID: <BAY104-F8B3BE4D25321D8C8C52DAC5C50@phx.gbl>

Thanks Raj
Very Kind

Haydar

>From: "Raj  Kumar" <rajkum2002 at rediffmail.com>
>Reply-To: "Raj  Kumar" <rajkum2002 at rediffmail.com>
>To: "linux clustering" <linux-cluster at redhat.com>
>CC: "haydar Ali" <haydar2906 at hotmail.com>
>Subject: Re: [Linux-cluster] GFS : important questions
>Date: 3 Aug 2005 02:34:10 -0000
>
>
> >My questions are:
> >1 - If one of the GFS slaves RAC1 or RAC2 is down, will be the other 
>slave server able to access to the shared file system /u04?
>If the slave server was shutdown cleanly, the other slave server will be 
>able to access the shared file system. However, if the slave was fenced off 
>and was not rebooted the locks it holds will not be released. In that case 
>the other slave server may not be able to access all files on /u04. If you 
>have ilo or rilo in your HP servers you can use remote ilo fencing agent.
>
> >2 - If the master is down, will be the both GFS slave servers able to 
>access to the shared file system /u04 simultaneously?
>No. The slave servers will not be able to access the shared file system if 
>the master is down!
>
>this is my two cents knowledge... experts please add to this!!
>
>Good luck!




From teigland at redhat.com  Wed Aug  3 03:56:18 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 3 Aug 2005 11:56:18 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
Message-ID: <20050803035618.GB9812@redhat.com>

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> * The on disk structures are defined in terms of uint32_t and friends,
> which are NOT endian neutral. Why are they not le32/be32 and thus
> endian-defined? Did you run bitwise-sparse on GFS yet ?

GFS has had proper endian handling for many years, it's still correct as
far as we've been able to test.  I ran bitwise-sparse yesterday and didn't
find anything alarming.

> * None of your on disk structures are packet. Are you sure?

Quite, particular attention has been paid to aligning the structure
fields, you'll find "pad" fields throughout.  We'll write a quick test to
verify that packing doesn't change anything.

> +#define gfs2_16_to_cpu be16_to_cpu
> +#define gfs2_32_to_cpu be32_to_cpu
> +#define gfs2_64_to_cpu be64_to_cpu
> 
> why this pointless abstracting?

#ifdef GFS2_ENDIAN_BIG

#define gfs2_16_to_cpu be16_to_cpu
#define gfs2_32_to_cpu be32_to_cpu
#define gfs2_64_to_cpu be64_to_cpu

#define cpu_to_gfs2_16 cpu_to_be16
#define cpu_to_gfs2_32 cpu_to_be32
#define cpu_to_gfs2_64 cpu_to_be64

#else /* GFS2_ENDIAN_BIG */

#define gfs2_16_to_cpu le16_to_cpu
#define gfs2_32_to_cpu le32_to_cpu
#define gfs2_64_to_cpu le64_to_cpu

#define cpu_to_gfs2_16 cpu_to_le16
#define cpu_to_gfs2_32 cpu_to_le32
#define cpu_to_gfs2_64 cpu_to_le64

#endif /* GFS2_ENDIAN_BIG */

The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
on-disk instead of LE which is another useful way to verify endian
correctness.

You should be able to use gfs in mixed architecture and mixed endian
clusters.  We don't have a mixed endian cluster to test, though.

> * +static const uint32_t crc_32_tab[] = .....
> why do you duplicate this? The kernel has a perfectly good set of generic
> crc32 tables/functions just fine

We'll try them, they'll probably do fine.

> * Why use your own journalling layer and not say ... jbd ?

Here's an analysis of three approaches to cluster-fs journaling and their
pros/cons (including using jbd):  http://tinyurl.com/7sbqq

> * +	while (!kthread_should_stop()) {
> +		gfs2_scand_internal(sdp);
> +
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> +	}
> 
> you probably really want to check for signals if you do interruptible sleeps

I don't know why we'd be interested in signals here.

> * why not use msleep() and friends instead of schedule_timeout(), you're
> not using the complex variants anyway

When unmounting we really appreciate waking up more often than the
timeout, otherwise the unmount sits and waits for the longest daemon's
msleep to complete.  I converted this to msleep recently but it was too
painful and had to go back.

We'll get to your other comments, thanks.
Dave



From oldmoonster at gmail.com  Wed Aug  3 05:21:41 2005
From: oldmoonster at gmail.com (Q.L)
Date: Wed, 3 Aug 2005 13:21:41 +0800
Subject: [Linux-cluster] Compile Issues
In-Reply-To: <20050802114758.GD11217@redhat.com>
References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local>
	<359782e705080202484184f323@mail.gmail.com>
	<42EF44EE.7020804@redhat.com>
	<359782e7050802030967157c38@mail.gmail.com>
	<42EF4816.1090200@redhat.com>
	<359782e70508020323165f669@mail.gmail.com>
	<42EF4B1A.2030405@redhat.com>
	<359782e70508020344510f473a@mail.gmail.com>
	<20050802114758.GD11217@redhat.com>
Message-ID: <359782e705080222217bf4f733@mail.gmail.com>

I began to love GFS more and more. :-)

Thanks,
Q.L

On 8/2/05, David Teigland <teigland at redhat.com> wrote:
> > [root at buckupy linux-2.6.12]# find /home/share/cluster-1.00.00 -name
> > '*.patch' | xargs cat | patch -t -p1
> 
> We don't do kernel patches any more, ignore what's there; we'll remove
> those last few in the next release.  You have to build the modules within
> the cluster tree now.
> 
> Dave
> 
>



From teigland at redhat.com  Wed Aug  3 06:36:44 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 3 Aug 2005 14:36:44 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <84144f0205080203163cab015c@mail.gmail.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
Message-ID: <20050803063644.GD9812@redhat.com>

On Tue, Aug 02, 2005 at 01:16:53PM +0300, Pekka Enberg wrote:

> > +void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
> > +			  unsigned int line)
> > +{
> > +	void *x;
> > +	for (;;) {
> > +		x = kmalloc(size, flags);
> > +		if (x)
> > +			return x;
> > +		if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
> > +			printk("GFS2: out of memory: %s, %u\n",
> > +			       __FILE__, __LINE__);
> > +			gfs2_malloc_warning = jiffies;
> > +		}
> > +		yield();
> 
> This does not belong in a filesystem. It also seems like a very bad
> idea. What are you trying to do here? If you absolutely must not fail,
> use __GFP_NOFAIL instead.

will do, carried over from before NOFAIL existed

> -mm has memory leak detection patches and there are others floating
> around. Please do not introduce yet another subsystem-specific debug
> allocator.

ok, thanks
Dave



From teigland at redhat.com  Wed Aug  3 10:08:49 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 3 Aug 2005 18:08:49 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1123060630.3363.10.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050803035618.GB9812@redhat.com>
	<1123060630.3363.10.camel@laptopd505.fenrus.org>
Message-ID: <20050803100849.GE9812@redhat.com>

On Wed, Aug 03, 2005 at 11:17:09AM +0200, Arjan van de Ven wrote:
> On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote:
> > The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
> > on-disk instead of LE which is another useful way to verify endian
> > correctness.
> 
> that sounds wrong to be a compile option. If you really want to deal
> with dual disk endianness it really ought to be a runtime one (see jffs2
> for example).

We don't want BE to be an "option" per se; as developers we'd just like to
be able to compile it that way to verify gfs's endianness handling.  If
you think that's unmaintainable or a bad idea we'll rip it out.

> > > * +	while (!kthread_should_stop()) {
> > > +		gfs2_scand_internal(sdp);
> > > +
> > > +		set_current_state(TASK_INTERRUPTIBLE);
> > > +		schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> > > 
> > > you probably really want to check for signals if you do
> > > interruptible sleeps
> > 
> > I don't know why we'd be interested in signals here.
> 
> well.. because if you don't your schedule_timeout becomes a nop when you
> get one, which makes your loop a busy waiting one.

OK, it looks like we need to block/flush signals a la daemonize(); I guess
I mistakenly figured the kthread routines did everything daemonize did.
Thanks,
Dave



From lmb at suse.de  Wed Aug  3 10:37:44 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Wed, 3 Aug 2005 12:37:44 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050803035618.GB9812@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050803035618.GB9812@redhat.com>
Message-ID: <20050803103744.GG11081@marowsky-bree.de>

On 2005-08-03T11:56:18, David Teigland <teigland at redhat.com> wrote:

> > * Why use your own journalling layer and not say ... jbd ?
> Here's an analysis of three approaches to cluster-fs journaling and their
> pros/cons (including using jbd):  http://tinyurl.com/7sbqq

Very instructive read, thanks for the link.



-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From addi at hugsmidjan.is  Wed Aug  3 11:58:47 2005
From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=)
Date: Wed, 03 Aug 2005 11:58:47 +0000
Subject: [Linux-cluster] Fencing agents
Message-ID: <42F0B177.7050907@hugsmidjan.is>

I'm implementing a shared storage between multiple (2 at the moment) 
Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC 
AX100 through FC.

The SAN has two FC ports so the need for a FC Switch has not yet come 
however we will add other Blades in the coming months.
The one thing I haven't got figured out with GFS and the Cluster-Suite 
is the whole idea about fencing.

We have a working setup using Centos rebuilds of the Cluster-Suite and 
GFS (http://rpm.karan.org/el4/csgfs/) which we are not planning to use 
in the final implementation where we plan to use the official GFS 
packages from Red Hat.
The fencing agents in that setup is manual fencing.

Both machines have the file system mounted and there appears to be no 
problems.

What does "automatic" fencing have to offer that the manual fencing lacks.
If we decide to buy the FC switch right away is it recomended that we 
buy one of the ones that have fencing agent available for the 
Cluster-Suite ?

If can't get our hands on supported FC switchs can we do fencing in 
another manner than throught a FC switch ?




-- 
S?valdur Gunnarsson :: Hugsmi?jan



From arjan at infradead.org  Tue Aug  2 07:45:24 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Tue, 02 Aug 2005 09:45:24 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <1122968724.3247.22.camel@laptopd505.fenrus.org>

On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks
> 
> http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050801/broken-out/

* The on disk structures are defined in terms of uint32_t and friends,
which are NOT endian neutral. Why are they not le32/be32 and thus
endian-defined? Did you run bitwise-sparse on GFS yet ?

* None of your on disk structures are packet. Are you sure?

* 
+#define gfs2_16_to_cpu be16_to_cpu
+#define gfs2_32_to_cpu be32_to_cpu
+#define gfs2_64_to_cpu be64_to_cpu

why this pointless abstracting?

* +static const uint32_t crc_32_tab[] = .....

why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine

* Why are you using bufferheads extensively in a new filesystem?

* +	if (create)
+		down_write(&ip->i_rw_mutex);
+	else
+		down_read(&ip->i_rw_mutex);

why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right?
How skewed is the read/write ratio?

* Why use your own journalling layer and not say ... jbd ?

* +	while (!kthread_should_stop()) {
+		gfs2_scand_internal(sdp);
+
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
+	}

you probably really want to check for signals if you do interruptible sleeps
(multiple places)

* why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway

* +++ b/fs/gfs2/fixed_div64.h	2005-08-01 14:13:08.009808200 +0800

ehhhh why?

* int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
+		   unsigned int size)
+{
+	int error;
+
+	if (bh)
+		error = copy_to_user(*buf, bh->b_data + offset, size);
+	else
+		error = clear_user(*buf, size);

that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem?

* [PATCH 08/14] GFS: diaper device

The diaper device is a block device within gfs that gets transparently
inserted between the real device the and rest of the filesystem.

hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't
this wrapper just increase the risk for memory deadlocks?

* [PATCH 06/14] GFS: logging and recovery

quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL?


* do_lock_wait

that almost screams for using wait_event and related APIs


*
+static inline void gfs2_log_lock(struct gfs2_sbd *sdp)
+{
+	spin_lock(&sdp->sd_log_lock);
+}
why the abstraction ?





From arjan at infradead.org  Tue Aug  2 07:45:24 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Tue, 02 Aug 2005 09:45:24 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <1122968724.3247.22.camel@laptopd505.fenrus.org>

On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks
> 
> http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050801/broken-out/

* The on disk structures are defined in terms of uint32_t and friends,
which are NOT endian neutral. Why are they not le32/be32 and thus
endian-defined? Did you run bitwise-sparse on GFS yet ?

* None of your on disk structures are packet. Are you sure?

* 
+#define gfs2_16_to_cpu be16_to_cpu
+#define gfs2_32_to_cpu be32_to_cpu
+#define gfs2_64_to_cpu be64_to_cpu

why this pointless abstracting?

* +static const uint32_t crc_32_tab[] = .....

why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine

* Why are you using bufferheads extensively in a new filesystem?

* +	if (create)
+		down_write(&ip->i_rw_mutex);
+	else
+		down_read(&ip->i_rw_mutex);

why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right?
How skewed is the read/write ratio?

* Why use your own journalling layer and not say ... jbd ?

* +	while (!kthread_should_stop()) {
+		gfs2_scand_internal(sdp);
+
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
+	}

you probably really want to check for signals if you do interruptible sleeps
(multiple places)

* why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway

* +++ b/fs/gfs2/fixed_div64.h	2005-08-01 14:13:08.009808200 +0800

ehhhh why?

* int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
+		   unsigned int size)
+{
+	int error;
+
+	if (bh)
+		error = copy_to_user(*buf, bh->b_data + offset, size);
+	else
+		error = clear_user(*buf, size);

that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem?

* [PATCH 08/14] GFS: diaper device

The diaper device is a block device within gfs that gets transparently
inserted between the real device the and rest of the filesystem.

hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't
this wrapper just increase the risk for memory deadlocks?

* [PATCH 06/14] GFS: logging and recovery

quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL?


* do_lock_wait

that almost screams for using wait_event and related APIs


*
+static inline void gfs2_log_lock(struct gfs2_sbd *sdp)
+{
+	spin_lock(&sdp->sd_log_lock);
+}
why the abstraction ?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



From penberg at gmail.com  Tue Aug  2 10:16:53 2005
From: penberg at gmail.com (Pekka Enberg)
Date: Tue, 2 Aug 2005 13:16:53 +0300
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <84144f0205080203163cab015c@mail.gmail.com>

Hi David,

On 8/2/05, David Teigland <teigland at redhat.com> wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks

> +#define kmalloc_nofail(size, flags) \
> +	gmalloc_nofail((size), (flags), __FILE__, __LINE__)

[snip]

> +void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
> +			  unsigned int line)
> +{
> +	void *x;
> +	for (;;) {
> +		x = kmalloc(size, flags);
> +		if (x)
> +			return x;
> +		if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
> +			printk("GFS2: out of memory: %s, %u\n",
> +			       __FILE__, __LINE__);
> +			gfs2_malloc_warning = jiffies;
> +		}
> +		yield();

This does not belong in a filesystem. It also seems like a very bad
idea. What are you trying to do here? If you absolutely must not fail,
use __GFP_NOFAIL instead.

> +	}
> +}
> +
> +#if defined(GFS2_MEMORY_SIMPLE)
> +
> +atomic_t gfs2_memory_count;
> +
> +void gfs2_memory_add_i(void *data, char *file, unsigned int line)
> +{
> +	atomic_inc(&gfs2_memory_count);
> +}
> +
> +void gfs2_memory_rm_i(void *data, char *file, unsigned int line)
> +{
> +	if (data)
> +		atomic_dec(&gfs2_memory_count);
> +}
> +
> +void *gmalloc(unsigned int size, int flags, char *file, unsigned int line)
> +{
> +	void *data = kmalloc(size, flags);
> +	if (data)
> +		atomic_inc(&gfs2_memory_count);
> +	return data;
> +}
> +
> +void *gmalloc_nofail(unsigned int size, int flags, char *file,
> +		     unsigned int line)
> +{
> +	atomic_inc(&gfs2_memory_count);
> +	return gmalloc_nofail_real(size, flags, file, line);
> +}
> +
> +void gfree(void *data, char *file, unsigned int line)
> +{
> +	if (data) {
> +		atomic_dec(&gfs2_memory_count);
> +		kfree(data);
> +	}
> +}

-mm has memory leak detection patches and there are others floating
around. Please do not introduce yet another subsystem-specific debug allocator.

                                    Pekka



From jengelh at linux01.gwdg.de  Tue Aug  2 14:57:11 2005
From: jengelh at linux01.gwdg.de (Jan Engelhardt)
Date: Tue, 2 Aug 2005 16:57:11 +0200 (MEST)
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
Message-ID: <Pine.LNX.4.61.0508021655580.4138@yvahk01.tjqt.qr>


>* Why use your own journalling layer and not say ... jbd ?

Why does reiser use its own journalling layer and not say ... jbd ?



Jan Engelhardt
-- 



From arjan at infradead.org  Tue Aug  2 15:02:52 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Tue, 02 Aug 2005 17:02:52 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <Pine.LNX.4.61.0508021655580.4138@yvahk01.tjqt.qr>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<Pine.LNX.4.61.0508021655580.4138@yvahk01.tjqt.qr>
Message-ID: <1122994972.3247.31.camel@laptopd505.fenrus.org>

On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
> >* Why use your own journalling layer and not say ... jbd ?
> 
> Why does reiser use its own journalling layer and not say ... jbd ?

because reiser got merged before jbd. Next question.

Now the question for GFS is still a valid one; there might be reasons to
not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used by 2
filesystems in -mm already).





From reiser at namesys.com  Wed Aug  3 01:00:02 2005
From: reiser at namesys.com (Hans Reiser)
Date: Tue, 02 Aug 2005 18:00:02 -0700
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1122994972.3247.31.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>	
	<1122968724.3247.22.camel@laptopd505.fenrus.org>	
	<Pine.LNX.4.61.0508021655580.4138@yvahk01.tjqt.qr>
	<1122994972.3247.31.camel@laptopd505.fenrus.org>
Message-ID: <42F01712.2030105@namesys.com>

Arjan van de Ven wrote:

>On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
>  
>
>>>* Why use your own journalling layer and not say ... jbd ?
>>>      
>>>
>>Why does reiser use its own journalling layer and not say ... jbd ?
>>    
>>
>
>because reiser got merged before jbd. Next question.
>  
>
That is the wrong reason.  We use our own journaling layer for the
reason that Vivaldi used his own melody.

I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant.  Now, if
you got into details, and said jbd does X, Y and Z, and GFS does the
same X and Y, and does not do Z as well as jbd, that would be a more
serious comment.  He might want to look at how reiser4 does wandering
logs instead of using jbd..... but I would never claim that for sure
some other author should be expected to use it.....  and something like
changing one's journaling system is not something to do just before a
merge.....

>Now the question for GFS is still a valid one; there might be reasons to
>not use it (which is fair enough) but if there's no real reason then
>using jdb sounds a lot better given it's maturity (and it is used by 2
>filesystems in -mm already).
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo at vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>  
>



From mrmacman_g4 at mac.com  Wed Aug  3 04:07:38 2005
From: mrmacman_g4 at mac.com (Kyle Moffett)
Date: Wed, 3 Aug 2005 00:07:38 -0400
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <42F01712.2030105@namesys.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<Pine.LNX.4.61.0508021655580.4138@yvahk01.tjqt.qr>
	<1122994972.3247.31.camel@laptopd505.fenrus.org>
	<42F01712.2030105@namesys.com>
Message-ID: <4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2@mac.com>

On Aug 2, 2005, at 21:00:02, Hans Reiser wrote:
> Arjan van de Ven wrote:
>> because reiser got merged before jbd. Next question.
> That is the wrong reason.  We use our own journaling layer for the
> reason that Vivaldi used his own melody.
>
> I don't know anything about GFS, but expecting a filesystem author to
> use a journaling layer he does not want to is a bit arrogant.  Now, if
> you got into details, and said jbd does X, Y and Z, and GFS does the
> same X and Y, and does not do Z as well as jbd, that would be a more
> serious comment.  He might want to look at how reiser4 does wandering
> logs instead of using jbd..... but I would never claim that for sure
> some other author should be expected to use it.....  and something  
> like
> changing one's journaling system is not something to do just before a
> merge.....

I don't want to start another big reiser4 flamewar, but...

"I don't know anything about Reiser4, but expecting a filesystem author
to use a VFS layer he does not want to is a bit arrogant.  Now, if you
got into details, and said the linux VFS does X, Y, and Z, and Reiser4
does..."

Do you see my point here?  If every person who added new kernel code
just wrote their own thing without checking to see if it had already
been done before, then there would be a lot of poorly maintained code
in the kernel.  If a journalling layer already exists, _new_ journaled
filesystems should either (A) use the layer as is, or (B) fix the layer
so it has sufficient functionality for them to use, and submit patches.
That way if somebody later says, "Ah, crap, there's a bug in the kernel
journalling layer", and fixes it, there are not eight other filesystems
with their own open-coded layers that need to be audited for similar
mistakes.

This is similar to why some kernel developers did not like the Reiser4
code, because it implemented some private layers that looked kinda like
stuff the VFS should be doing  (Again, I don't want to get into that
argument again, I'm just bringing up the similarities to clarify _this_
particular point, as that one has been beaten to death enough already).

>> Now the question for GFS is still a valid one; there might be  
>> reasons to
>> not use it (which is fair enough) but if there's no real reason then
>> using jdb sounds a lot better given it's maturity (and it is used  
>> by 2
>> filesystems in -mm already).

Personally, I am of the opinion that if GFS cannot use jdb, the  
developers
ought to clarify why it isn't useable, and possibly submit fixes to make
it useful, so that others can share the benefits.

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw  
knives at
people who weren't supposed to be in your machine room.
   -- Anthony de Boer




From jengelh at linux01.gwdg.de  Wed Aug  3 06:37:19 2005
From: jengelh at linux01.gwdg.de (Jan Engelhardt)
Date: Wed, 3 Aug 2005 08:37:19 +0200 (MEST)
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2@mac.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<Pine.LNX.4.61.0508021655580.4138@yvahk01.tjqt.qr>
	<1122994972.3247.31.camel@laptopd505.fenrus.org>
	<42F01712.2030105@namesys.com>
	<4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2@mac.com>
Message-ID: <Pine.LNX.4.61.0508030826000.2263@yvahk01.tjqt.qr>


>> > because reiser got merged before jbd. Next question.
>>
>> That is the wrong reason.  We use our own journaling layer for the
>> reason that Vivaldi used his own melody.
>> 
>> [...] He might want to look at how reiser4 does wandering
>> logs instead of using jbd..... but I would never claim that for sure
>> some other author should be expected to use it.....  and something like
>> changing one's journaling system is not something to do just before a
>> merge.....
>
> Do you see my point here?  If every person who added new kernel code
> just wrote their own thing without checking to see if it had already
> been done before, then there would be a lot of poorly maintained code
> in the kernel.  If a journalling layer already exists, _new_ journaled
> filesystems should either (A) use the layer as is, or (B) fix the layer
> so it has sufficient functionality for them to use, and submit patches.

Maybe jbd 'sucks' for something 'cool' like reiser*, and modifying jbd to be 
'eleet enough' for reiser* would overwhelm ext.

Lastly, there is the 'political' thing, when a <your-favorite-jbd-fs>-only 
specific change to jbd is rejected by all other jbd-using fs. (Basically the 
situation thing that leads to software forks, in any area.)



Jan Engelhardt
-- 



From penberg at gmail.com  Wed Aug  3 06:44:06 2005
From: penberg at gmail.com (Pekka Enberg)
Date: Wed, 3 Aug 2005 09:44:06 +0300
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <84144f0205080223445375c907@mail.gmail.com>

Hi David,

Some more comments below.

                                Pekka

On 8/2/05, David Teigland <teigland at redhat.com> wrote:
> +/**
> + * inode_create - create a struct gfs2_inode
> + * @i_gl: The glock covering the inode
> + * @inum: The inode number
> + * @io_gl: the iopen glock to acquire/hold (using holder in new gfs2_inode)
> + * @io_state: the state the iopen glock should be acquired in
> + * @ipp: pointer to put the returned inode in
> + *
> + * Returns: errno
> + */
> +
> +static int inode_create(struct gfs2_glock *i_gl, struct gfs2_inum *inum,
> +			struct gfs2_glock *io_gl, unsigned int io_state,
> +			struct gfs2_inode **ipp)
> +{
> +	struct gfs2_sbd *sdp = i_gl->gl_sbd;
> +	struct gfs2_inode *ip;
> +	int error = 0;
> +
> +	RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip);

Why do you want to do this? The callers can handle ENOMEM just fine.

> +/**
> + * gfs2_random - Generate a random 32-bit number
> + *
> + * Generate a semi-crappy 32-bit pseudo-random number without using
> + * floating point.
> + *
> + * The PRNG is from "Numerical Recipes in C" (second edition), page 284.
> + *
> + * Returns: a 32-bit random number
> + */
> +
> +uint32_t gfs2_random(void)
> +{
> +	gfs2_random_number = 0x0019660D * gfs2_random_number + 0x3C6EF35F;
> +	return gfs2_random_number;
> +}

Please consider moving this into lib/random.c. This one already appears in
drivers/net/hamradio/dmascc.c.

> +/**
> + * gfs2_hash - hash an array of data
> + * @data: the data to be hashed
> + * @len: the length of data to be hashed
> + *
> + * Take some data and convert it to a 32-bit hash.
> + *
> + * This is the 32-bit FNV-1a hash from:
> + * http://www.isthe.com/chongo/tech/comp/fnv/
> + *
> + * Returns: the hash
> + */
> +
> +uint32_t gfs2_hash(const void *data, unsigned int len)
> +{
> +     uint32_t h = 0x811C9DC5;
> +     h = hash_more_internal(data, len, h);
> +     return h;
> +}

Is there a reason why you cannot use <linux/hash.h> or <linux/jhash.h>?

> +void gfs2_sort(void *base, unsigned int num_elem, unsigned int size,
> +            int (*compar) (const void *, const void *))
> +{
> +     register char *pbase = (char *)base;
> +     int i, j, k, h;
> +     static int cols[16] = {1391376, 463792, 198768, 86961,
> +                            33936, 13776, 4592, 1968,
> +                            861, 336, 112, 48,
> +                            21, 7, 3, 1};
> +
> +     for (k = 0; k < 16; k++) {
> +             h = cols[k];
> +             for (i = h; i < num_elem; i++) {
> +                     j = i;
> +                     while (j >= h &&
> +                            (*compar)((void *)(pbase + size * (j - h)),
> +                                      (void *)(pbase + size * j)) > 0) {
> +                             SWAP(pbase + size * j,
> +                                  pbase + size * (j - h),
> +                                  size);
> +                             j = j - h;
> +                     }
> +             }
> +     }
> +}

Please use sort() from lib/sort.c.

> +/**
> + * gfs2_io_error_inode_i - Flag an inode I/O error and withdraw
> + * @ip:
> + * @function:
> + * @file:
> + * @line:

Please drop empty kerneldoc tags. (Appears in various other places as well.)

> +#define RETRY_MALLOC(do_this, until_this) \
> +for (;;) { \
> +     { do_this; } \
> +     if (until_this) \
> +             break; \
> +     if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
> +             printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
> +             gfs2_malloc_warning = jiffies; \
> +     } \
> +     yield(); \
> +}

Please drop this.

> +int gfs2_acl_create(struct gfs2_inode *dip, struct gfs2_inode *ip)
> +{
> +             struct gfs2_sbd *sdp = dip->i_sbd;
> +     struct posix_acl *acl = NULL;
> +     struct gfs2_ea_request er;
> +     mode_t mode = ip->i_di.di_mode;
> +     int error;
> +
> +     if (!sdp->sd_args.ar_posix_acl)
> +             return 0;
> +     if (S_ISLNK(ip->i_di.di_mode))
> +             return 0;
> +
> +     memset(&er, 0, sizeof(struct gfs2_ea_request));
> +     er.er_type = GFS2_EATYPE_SYS;
> +
> +     error = acl_get(dip, ACL_DEFAULT, &acl, NULL,
> +                     &er.er_data, &er.er_data_len);
> +     if (error)
> +             return error;
> +     if (!acl) {
> +             mode &= ~current->fs->umask;
> +             if (mode != ip->i_di.di_mode)
> +                     error = munge_mode(ip, mode);
> +             return error;
> +     }
> +
> +     {
> +             struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
> +             error = -ENOMEM;
> +             if (!clone)
> +                     goto out;
> +             gfs2_memory_add(clone);
> +             gfs2_memory_rm(acl);
> +             posix_acl_release(acl);
> +             acl = clone;
> +     }

Please make this a real function. It is duplicated below.

> +     if (error > 0) {
> +             er.er_name = GFS2_POSIX_ACL_ACCESS;
> +             er.er_name_len = GFS2_POSIX_ACL_ACCESS_LEN;
> +             posix_acl_to_xattr(acl, er.er_data, er.er_data_len);
> +             er.er_mode = mode;
> +             er.er_flags = GFS2_ERF_MODE;
> +             error = gfs2_system_eaops.eo_set(ip, &er);
> +             if (error)
> +                     goto out;
> +     } else
> +             munge_mode(ip, mode);
> +
> + out:
> +     gfs2_memory_rm(acl);
> +     posix_acl_release(acl);
> +     kfree(er.er_data);
> +
> +             return error;

Whitespace damage.

> +int gfs2_acl_chmod(struct gfs2_inode *ip, struct iattr *attr)
> +{
> +     struct posix_acl *acl = NULL;
> +     struct gfs2_ea_location el;
> +     char *data;
> +     unsigned int len;
> +     int error;
> +
> +     error = acl_get(ip, ACL_ACCESS, &acl, &el, &data, &len);
> +     if (error)
> +             return error;
> +     if (!acl)
> +             return gfs2_setattr_simple(ip, attr);
> +
> +     {
> +             struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
> +             error = -ENOMEM;
> +             if (!clone)
> +                     goto out;
> +             gfs2_memory_add(clone);
> +             gfs2_memory_rm(acl);
> +             posix_acl_release(acl);
> +             acl = clone;
> +     }

Duplicated above.

> +static int ea_foreach(struct gfs2_inode *ip, ea_call_t ea_call, void *data)
> +{
> +     struct buffer_head *bh;
> +     int error;
> +
> +     error = gfs2_meta_read(ip->i_gl, ip->i_di.di_eattr,
> +                            DIO_START | DIO_WAIT, &bh);
> +     if (error)
> +             return error;
> +
> +     if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
> +             error = ea_foreach_i(ip, bh, ea_call, data);

goto out here so you can drop the else branch below.

> +     else {
> +             struct buffer_head *eabh;
> +             uint64_t *eablk, *end;
> +
> +             if (gfs2_metatype_check(ip->i_sbd, bh, GFS2_METATYPE_IN)) {
> +                     error = -EIO;
> +                     goto out;
> +             }
> +
> +             eablk = (uint64_t *)(bh->b_data +
> +                                  sizeof(struct gfs2_meta_header));
> +             end = eablk + ip->i_sbd->sd_inptrs;
> +

> +static int ea_find_i(struct gfs2_inode *ip, struct buffer_head *bh,
> +                  struct gfs2_ea_header *ea, struct gfs2_ea_header *prev,
> +                  void *private)
> +{
> +     struct ea_find *ef = (struct ea_find *)private;
> +     struct gfs2_ea_request *er = ef->ef_er;
> +
> +     if (ea->ea_type == GFS2_EATYPE_UNUSED)
> +             return 0;
> +
> +     if (ea->ea_type == er->er_type) {
> +             if (ea->ea_name_len == er->er_name_len &&
> +                 !memcmp(GFS2_EA2NAME(ea), er->er_name, ea->ea_name_len)) {
> +                     struct gfs2_ea_location *el = ef->ef_el;
> +                     get_bh(bh);
> +                     el->el_bh = bh;
> +                     el->el_ea = ea;
> +                     el->el_prev = prev;
> +                     return 1;
> +             }
> +     }
> +
> +#if 0
> +     else if ((ip->i_di.di_flags & GFS2_DIF_EA_PACKED) &&
> +              er->er_type == GFS2_EATYPE_SYS)
> +             return 1;
> +#endif

Please drop commented out code.

> +static int ea_list_i(struct gfs2_inode *ip, struct buffer_head *bh,
> +                  struct gfs2_ea_header *ea, struct gfs2_ea_header *prev,
> +                  void *private)
> +{
> +     struct ea_list *ei = (struct ea_list *)private;

Please drop redundant cast.

> +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
> +                 struct gfs2_ea_location *el)
> +{
> +     {
> +             struct ea_set es;
> +             int error;
> +
> +             memset(&es, 0, sizeof(struct ea_set));
> +             es.es_er = er;
> +             es.es_el = el;
> +
> +             error = ea_foreach(ip, ea_set_simple, &es);
> +             if (error > 0)
> +                     return 0;
> +             if (error)
> +                     return error;
> +     }
> +     {
> +             unsigned int blks = 2;
> +             if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
> +                     blks++;
> +             if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
> +                     blks += DIV_RU(er->er_data_len,
> +                                    ip->i_sbd->sd_jbsize);
> +
> +             return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
> +     }

Please drop the extra braces.



From arjan at infradead.org  Wed Aug  3 09:09:02 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Wed, 03 Aug 2005 11:09:02 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <42F01712.2030105@namesys.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<Pine.LNX.4.61.0508021655580.4138@yvahk01.tjqt.qr>
	<1122994972.3247.31.camel@laptopd505.fenrus.org>
	<42F01712.2030105@namesys.com>
Message-ID: <1123060142.3363.8.camel@laptopd505.fenrus.org>


> I don't know anything about GFS, but expecting a filesystem author to
> use a journaling layer he does not want to is a bit arrogant.

good that I didn't expect that then.
I think it's fair enough to ask people if they can use it. If the answer
is "No because it doesn't fit our model <here>" then that's fine. If the
answer is "eh yeah we could" then I think it's entirely reasonable to
expect people to use common code as opposed to adding new code.




From arjan at infradead.org  Wed Aug  3 09:17:09 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Wed, 03 Aug 2005 11:17:09 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050803035618.GB9812@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050803035618.GB9812@redhat.com>
Message-ID: <1123060630.3363.10.camel@laptopd505.fenrus.org>

On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote:
> The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
> on-disk instead of LE which is another useful way to verify endian
> correctness.

that sounds wrong to be a compile option. If you really want to deal
with dual disk endianness it really ought to be a runtime one (see jffs2
for example).



> > * Why use your own journalling layer and not say ... jbd ?
> 
> Here's an analysis of three approaches to cluster-fs journaling and their
> pros/cons (including using jbd):  http://tinyurl.com/7sbqq
> 
> > * +	while (!kthread_should_stop()) {
> > +		gfs2_scand_internal(sdp);
> > +
> > +		set_current_state(TASK_INTERRUPTIBLE);
> > +		schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> > +	}
> > 
> > you probably really want to check for signals if you do interruptible sleeps
> 
> I don't know why we'd be interested in signals here.

well.. because if you don't your schedule_timeout becomes a nop when you
get one, which makes your loop a busy waiting one.





From travellig at gmail.com  Wed Aug  3 15:57:45 2005
From: travellig at gmail.com (travellig travellig)
Date: Wed, 3 Aug 2005 16:57:45 +0100
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 16, Issue 4
In-Reply-To: <20050803140914.6801B739B8@hormel.redhat.com>
References: <20050803140914.6801B739B8@hormel.redhat.com>
Message-ID: <6944872105080308573f220551@mail.gmail.com>

On Wed, 2005-08-03 at 11:58 +0000, "S?valdur Arnar Gunnarsson
[Hugsmi?jan]" wrote:
> What does "automatic" fencing have to offer that the manual fencing
> lacks.
Automatic fencing uses hardware to fence a node and reboot it.
Manual fencing relay on you to manually fence the node whenever you
release there is a problem in the cluster and relays on you to
prowercycle the faulty node manually, no very convenient when you are
sysadmin the cluster remotely.

> If we decide to buy the FC switch right away is it recomended that we 
> buy one of the ones that have fencing agent available for the 
> Cluster-Suite ?
 If you look at the configuration manual for RHCS, there is a list of
supported fencing agents.
> If can't get our hands on supported FC switchs can we do fencing in 
> another manner than throught a FC switch ?

Manual fencing.

Nando


On 8/3/05, linux-cluster-request at redhat.com
<linux-cluster-request at redhat.com> wrote:
> Send Linux-cluster mailing list submissions to
>         linux-cluster at redhat.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
>         linux-cluster-request at redhat.com
> 
> You can reach the person managing the list at
>         linux-cluster-owner at redhat.com
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: [PATCH 00/14] GFS (Lars Marowsky-Bree)
>    2. Fencing agents (S?valdur Arnar Gunnarsson [Hugsmi?jan])
>    3. Re: [PATCH 00/14] GFS (Arjan van de Ven)
>    4. Re: [PATCH 00/14] GFS (Arjan van de Ven)
>    5. Re: [PATCH 00/14] GFS (Pekka Enberg)
>    6. Re: [PATCH 00/14] GFS (Jan Engelhardt)
>    7. Re: [PATCH 00/14] GFS (Arjan van de Ven)
>    8. Re: [PATCH 00/14] GFS (Hans Reiser)
>    9. Re: [PATCH 00/14] GFS (Jan Engelhardt)
>   10. Re: [PATCH 00/14] GFS (Pekka Enberg)
>   11. Re: [PATCH 00/14] GFS (Kyle Moffett)
>   12. Re: [PATCH 00/14] GFS (Arjan van de Ven)
>   13. Re: [PATCH 00/14] GFS (Arjan van de Ven)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Wed, 3 Aug 2005 12:37:44 +0200
> From: Lars Marowsky-Bree <lmb at suse.de>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: David Teigland <teigland at redhat.com>,       Arjan van de Ven
>         <arjan at infradead.org>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,
>         linux-kernel at vger.kernel.org
> Message-ID: <20050803103744.GG11081 at marowsky-bree.de>
> Content-Type: text/plain; charset=us-ascii
> 
> On 2005-08-03T11:56:18, David Teigland <teigland at redhat.com> wrote:
> 
> > > * Why use your own journalling layer and not say ... jbd ?
> > Here's an analysis of three approaches to cluster-fs journaling and their
> > pros/cons (including using jbd):  http://tinyurl.com/7sbqq
> 
> Very instructive read, thanks for the link.
> 
> 
> 
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business     -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Wed, 03 Aug 2005 11:58:47 +0000
> From: "S?valdur Arnar Gunnarsson [Hugsmi?jan]" <addi at hugsmidjan.is>
> Subject: [Linux-cluster] Fencing agents
> To: linux-cluster at redhat.com
> Message-ID: <42F0B177.7050907 at hugsmidjan.is>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> I'm implementing a shared storage between multiple (2 at the moment)
> Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC
> AX100 through FC.
> 
> The SAN has two FC ports so the need for a FC Switch has not yet come
> however we will add other Blades in the coming months.
> The one thing I haven't got figured out with GFS and the Cluster-Suite
> is the whole idea about fencing.
> 
> We have a working setup using Centos rebuilds of the Cluster-Suite and
> GFS (http://rpm.karan.org/el4/csgfs/) which we are not planning to use
> in the final implementation where we plan to use the official GFS
> packages from Red Hat.
> The fencing agents in that setup is manual fencing.
> 
> Both machines have the file system mounted and there appears to be no
> problems.
> 
> What does "automatic" fencing have to offer that the manual fencing lacks.
> If we decide to buy the FC switch right away is it recomended that we
> buy one of the ones that have fencing agent available for the
> Cluster-Suite ?
> 
> If can't get our hands on supported FC switchs can we do fencing in
> another manner than throught a FC switch ?
> 
> 
> 
> 
> --
> S?valdur Gunnarsson :: Hugsmi?jan
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 02 Aug 2005 09:45:24 +0200
> From: Arjan van de Ven <arjan at infradead.org>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: David Teigland <teigland at redhat.com>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,
>         linux-kernel at vger.kernel.org
> Message-ID: <1122968724.3247.22.camel at laptopd505.fenrus.org>
> Content-Type: text/plain
> 
> On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote:
> > Hi, GFS (Global File System) is a cluster file system that we'd like to
> > see added to the kernel.  The 14 patches total about 900K so I won't send
> > them to the list unless that's requested.  Comments and suggestions are
> > welcome.  Thanks
> >
> > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> > http://redhat.com/~teigland/gfs2/20050801/broken-out/
> 
> * The on disk structures are defined in terms of uint32_t and friends,
> which are NOT endian neutral. Why are they not le32/be32 and thus
> endian-defined? Did you run bitwise-sparse on GFS yet ?
> 
> * None of your on disk structures are packet. Are you sure?
> 
> *
> +#define gfs2_16_to_cpu be16_to_cpu
> +#define gfs2_32_to_cpu be32_to_cpu
> +#define gfs2_64_to_cpu be64_to_cpu
> 
> why this pointless abstracting?
> 
> * +static const uint32_t crc_32_tab[] = .....
> 
> why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine
> 
> * Why are you using bufferheads extensively in a new filesystem?
> 
> * +     if (create)
> +               down_write(&ip->i_rw_mutex);
> +       else
> +               down_read(&ip->i_rw_mutex);
> 
> why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right?
> How skewed is the read/write ratio?
> 
> * Why use your own journalling layer and not say ... jbd ?
> 
> * +     while (!kthread_should_stop()) {
> +               gfs2_scand_internal(sdp);
> +
> +               set_current_state(TASK_INTERRUPTIBLE);
> +               schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> +       }
> 
> you probably really want to check for signals if you do interruptible sleeps
> (multiple places)
> 
> * why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway
> 
> * +++ b/fs/gfs2/fixed_div64.h   2005-08-01 14:13:08.009808200 +0800
> 
> ehhhh why?
> 
> * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
> +                  unsigned int size)
> +{
> +       int error;
> +
> +       if (bh)
> +               error = copy_to_user(*buf, bh->b_data + offset, size);
> +       else
> +               error = clear_user(*buf, size);
> 
> that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem?
> 
> * [PATCH 08/14] GFS: diaper device
> 
> The diaper device is a block device within gfs that gets transparently
> inserted between the real device the and rest of the filesystem.
> 
> hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't
> this wrapper just increase the risk for memory deadlocks?
> 
> * [PATCH 06/14] GFS: logging and recovery
> 
> quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL?
> 
> 
> * do_lock_wait
> 
> that almost screams for using wait_event and related APIs
> 
> 
> *
> +static inline void gfs2_log_lock(struct gfs2_sbd *sdp)
> +{
> +       spin_lock(&sdp->sd_log_lock);
> +}
> why the abstraction ?
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 02 Aug 2005 09:45:24 +0200
> From: Arjan van de Ven <arjan at infradead.org>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: David Teigland <teigland at redhat.com>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,
>         linux-kernel at vger.kernel.org
> Message-ID: <1122968724.3247.22.camel at laptopd505.fenrus.org>
> Content-Type: text/plain
> 
> On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote:
> > Hi, GFS (Global File System) is a cluster file system that we'd like to
> > see added to the kernel.  The 14 patches total about 900K so I won't send
> > them to the list unless that's requested.  Comments and suggestions are
> > welcome.  Thanks
> >
> > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> > http://redhat.com/~teigland/gfs2/20050801/broken-out/
> 
> * The on disk structures are defined in terms of uint32_t and friends,
> which are NOT endian neutral. Why are they not le32/be32 and thus
> endian-defined? Did you run bitwise-sparse on GFS yet ?
> 
> * None of your on disk structures are packet. Are you sure?
> 
> *
> +#define gfs2_16_to_cpu be16_to_cpu
> +#define gfs2_32_to_cpu be32_to_cpu
> +#define gfs2_64_to_cpu be64_to_cpu
> 
> why this pointless abstracting?
> 
> * +static const uint32_t crc_32_tab[] = .....
> 
> why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine
> 
> * Why are you using bufferheads extensively in a new filesystem?
> 
> * +     if (create)
> +               down_write(&ip->i_rw_mutex);
> +       else
> +               down_read(&ip->i_rw_mutex);
> 
> why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right?
> How skewed is the read/write ratio?
> 
> * Why use your own journalling layer and not say ... jbd ?
> 
> * +     while (!kthread_should_stop()) {
> +               gfs2_scand_internal(sdp);
> +
> +               set_current_state(TASK_INTERRUPTIBLE);
> +               schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> +       }
> 
> you probably really want to check for signals if you do interruptible sleeps
> (multiple places)
> 
> * why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway
> 
> * +++ b/fs/gfs2/fixed_div64.h   2005-08-01 14:13:08.009808200 +0800
> 
> ehhhh why?
> 
> * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
> +                  unsigned int size)
> +{
> +       int error;
> +
> +       if (bh)
> +               error = copy_to_user(*buf, bh->b_data + offset, size);
> +       else
> +               error = clear_user(*buf, size);
> 
> that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem?
> 
> * [PATCH 08/14] GFS: diaper device
> 
> The diaper device is a block device within gfs that gets transparently
> inserted between the real device the and rest of the filesystem.
> 
> hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't
> this wrapper just increase the risk for memory deadlocks?
> 
> * [PATCH 06/14] GFS: logging and recovery
> 
> quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL?
> 
> 
> * do_lock_wait
> 
> that almost screams for using wait_event and related APIs
> 
> 
> *
> +static inline void gfs2_log_lock(struct gfs2_sbd *sdp)
> +{
> +       spin_lock(&sdp->sd_log_lock);
> +}
> why the abstraction ?
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Tue, 2 Aug 2005 13:16:53 +0300
> From: Pekka Enberg <penberg at gmail.com>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: David Teigland <teigland at redhat.com>
> Cc: akpm at osdl.org, Pekka Enberg <penberg at cs.helsinki.fi>,
>         linux-cluster at redhat.com, linux-kernel at vger.kernel.org
> Message-ID: <84144f0205080203163cab015c at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi David,
> 
> On 8/2/05, David Teigland <teigland at redhat.com> wrote:
> > Hi, GFS (Global File System) is a cluster file system that we'd like to
> > see added to the kernel.  The 14 patches total about 900K so I won't send
> > them to the list unless that's requested.  Comments and suggestions are
> > welcome.  Thanks
> 
> > +#define kmalloc_nofail(size, flags) \
> > +     gmalloc_nofail((size), (flags), __FILE__, __LINE__)
> 
> [snip]
> 
> > +void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
> > +                       unsigned int line)
> > +{
> > +     void *x;
> > +     for (;;) {
> > +             x = kmalloc(size, flags);
> > +             if (x)
> > +                     return x;
> > +             if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
> > +                     printk("GFS2: out of memory: %s, %u\n",
> > +                            __FILE__, __LINE__);
> > +                     gfs2_malloc_warning = jiffies;
> > +             }
> > +             yield();
> 
> This does not belong in a filesystem. It also seems like a very bad
> idea. What are you trying to do here? If you absolutely must not fail,
> use __GFP_NOFAIL instead.
> 
> > +     }
> > +}
> > +
> > +#if defined(GFS2_MEMORY_SIMPLE)
> > +
> > +atomic_t gfs2_memory_count;
> > +
> > +void gfs2_memory_add_i(void *data, char *file, unsigned int line)
> > +{
> > +     atomic_inc(&gfs2_memory_count);
> > +}
> > +
> > +void gfs2_memory_rm_i(void *data, char *file, unsigned int line)
> > +{
> > +     if (data)
> > +             atomic_dec(&gfs2_memory_count);
> > +}
> > +
> > +void *gmalloc(unsigned int size, int flags, char *file, unsigned int line)
> > +{
> > +     void *data = kmalloc(size, flags);
> > +     if (data)
> > +             atomic_inc(&gfs2_memory_count);
> > +     return data;
> > +}
> > +
> > +void *gmalloc_nofail(unsigned int size, int flags, char *file,
> > +                  unsigned int line)
> > +{
> > +     atomic_inc(&gfs2_memory_count);
> > +     return gmalloc_nofail_real(size, flags, file, line);
> > +}
> > +
> > +void gfree(void *data, char *file, unsigned int line)
> > +{
> > +     if (data) {
> > +             atomic_dec(&gfs2_memory_count);
> > +             kfree(data);
> > +     }
> > +}
> 
> -mm has memory leak detection patches and there are others floating
> around. Please do not introduce yet another subsystem-specific debug allocator.
> 
>                                     Pekka
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Tue, 2 Aug 2005 16:57:11 +0200 (MEST)
> From: Jan Engelhardt <jengelh at linux01.gwdg.de>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: Arjan van de Ven <arjan at infradead.org>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,
>         linux-kernel at vger.kernel.org
> Message-ID: <Pine.LNX.4.61.0508021655580.4138 at yvahk01.tjqt.qr>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
> 
> 
> >* Why use your own journalling layer and not say ... jbd ?
> 
> Why does reiser use its own journalling layer and not say ... jbd ?
> 
> 
> 
> Jan Engelhardt
> --
> 
> 
> 
> ------------------------------
> 
> Message: 7
> Date: Tue, 02 Aug 2005 17:02:52 +0200
> From: Arjan van de Ven <arjan at infradead.org>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: Jan Engelhardt <jengelh at linux01.gwdg.de>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,
>         linux-kernel at vger.kernel.org
> Message-ID: <1122994972.3247.31.camel at laptopd505.fenrus.org>
> Content-Type: text/plain
> 
> On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
> > >* Why use your own journalling layer and not say ... jbd ?
> >
> > Why does reiser use its own journalling layer and not say ... jbd ?
> 
> because reiser got merged before jbd. Next question.
> 
> Now the question for GFS is still a valid one; there might be reasons to
> not use it (which is fair enough) but if there's no real reason then
> using jdb sounds a lot better given it's maturity (and it is used by 2
> filesystems in -mm already).
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 8
> Date: Tue, 02 Aug 2005 18:00:02 -0700
> From: Hans Reiser <reiser at namesys.com>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: Arjan van de Ven <arjan at infradead.org>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,    Jan Engelhardt
>         <jengelh at linux01.gwdg.de>, linux-kernel at vger.kernel.org
> Message-ID: <42F01712.2030105 at namesys.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Arjan van de Ven wrote:
> 
> >On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
> >
> >
> >>>* Why use your own journalling layer and not say ... jbd ?
> >>>
> >>>
> >>Why does reiser use its own journalling layer and not say ... jbd ?
> >>
> >>
> >
> >because reiser got merged before jbd. Next question.
> >
> >
> That is the wrong reason.  We use our own journaling layer for the
> reason that Vivaldi used his own melody.
> 
> I don't know anything about GFS, but expecting a filesystem author to
> use a journaling layer he does not want to is a bit arrogant.  Now, if
> you got into details, and said jbd does X, Y and Z, and GFS does the
> same X and Y, and does not do Z as well as jbd, that would be a more
> serious comment.  He might want to look at how reiser4 does wandering
> logs instead of using jbd..... but I would never claim that for sure
> some other author should be expected to use it.....  and something like
> changing one's journaling system is not something to do just before a
> merge.....
> 
> >Now the question for GFS is still a valid one; there might be reasons to
> >not use it (which is fair enough) but if there's no real reason then
> >using jdb sounds a lot better given it's maturity (and it is used by 2
> >filesystems in -mm already).
> >
> >
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo at vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/
> >
> >
> >
> >
> 
> 
> 
> ------------------------------
> 
> Message: 9
> Date: Wed, 3 Aug 2005 08:37:19 +0200 (MEST)
> From: Jan Engelhardt <jengelh at linux01.gwdg.de>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: Kyle Moffett <mrmacman_g4 at mac.com>
> Cc: akpm at osdl.org, linux-cluster at redhat.com, Hans Reiser
>         <reiser at namesys.com>,   linux-kernel at vger.kernel.org, Arjan van de Ven
>         <arjan at infradead.org>
> Message-ID: <Pine.LNX.4.61.0508030826000.2263 at yvahk01.tjqt.qr>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
> 
> 
> >> > because reiser got merged before jbd. Next question.
> >>
> >> That is the wrong reason.  We use our own journaling layer for the
> >> reason that Vivaldi used his own melody.
> >>
> >> [...] He might want to look at how reiser4 does wandering
> >> logs instead of using jbd..... but I would never claim that for sure
> >> some other author should be expected to use it.....  and something like
> >> changing one's journaling system is not something to do just before a
> >> merge.....
> >
> > Do you see my point here?  If every person who added new kernel code
> > just wrote their own thing without checking to see if it had already
> > been done before, then there would be a lot of poorly maintained code
> > in the kernel.  If a journalling layer already exists, _new_ journaled
> > filesystems should either (A) use the layer as is, or (B) fix the layer
> > so it has sufficient functionality for them to use, and submit patches.
> 
> Maybe jbd 'sucks' for something 'cool' like reiser*, and modifying jbd to be
> 'eleet enough' for reiser* would overwhelm ext.
> 
> Lastly, there is the 'political' thing, when a <your-favorite-jbd-fs>-only
> specific change to jbd is rejected by all other jbd-using fs. (Basically the
> situation thing that leads to software forks, in any area.)
> 
> 
> 
> Jan Engelhardt
> --
> 
> 
> 
> ------------------------------
> 
> Message: 10
> Date: Wed, 3 Aug 2005 09:44:06 +0300
> From: Pekka Enberg <penberg at gmail.com>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: David Teigland <teigland at redhat.com>
> Cc: akpm at osdl.org, Pekka Enberg <penberg at cs.helsinki.fi>,
>         linux-cluster at redhat.com, linux-kernel at vger.kernel.org
> Message-ID: <84144f0205080223445375c907 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi David,
> 
> Some more comments below.
> 
>                                 Pekka
> 
> On 8/2/05, David Teigland <teigland at redhat.com> wrote:
> > +/**
> > + * inode_create - create a struct gfs2_inode
> > + * @i_gl: The glock covering the inode
> > + * @inum: The inode number
> > + * @io_gl: the iopen glock to acquire/hold (using holder in new gfs2_inode)
> > + * @io_state: the state the iopen glock should be acquired in
> > + * @ipp: pointer to put the returned inode in
> > + *
> > + * Returns: errno
> > + */
> > +
> > +static int inode_create(struct gfs2_glock *i_gl, struct gfs2_inum *inum,
> > +                     struct gfs2_glock *io_gl, unsigned int io_state,
> > +                     struct gfs2_inode **ipp)
> > +{
> > +     struct gfs2_sbd *sdp = i_gl->gl_sbd;
> > +     struct gfs2_inode *ip;
> > +     int error = 0;
> > +
> > +     RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip);
> 
> Why do you want to do this? The callers can handle ENOMEM just fine.
> 
> > +/**
> > + * gfs2_random - Generate a random 32-bit number
> > + *
> > + * Generate a semi-crappy 32-bit pseudo-random number without using
> > + * floating point.
> > + *
> > + * The PRNG is from "Numerical Recipes in C" (second edition), page 284.
> > + *
> > + * Returns: a 32-bit random number
> > + */
> > +
> > +uint32_t gfs2_random(void)
> > +{
> > +     gfs2_random_number = 0x0019660D * gfs2_random_number + 0x3C6EF35F;
> > +     return gfs2_random_number;
> > +}
> 
> Please consider moving this into lib/random.c. This one already appears in
> drivers/net/hamradio/dmascc.c.
> 
> > +/**
> > + * gfs2_hash - hash an array of data
> > + * @data: the data to be hashed
> > + * @len: the length of data to be hashed
> > + *
> > + * Take some data and convert it to a 32-bit hash.
> > + *
> > + * This is the 32-bit FNV-1a hash from:
> > + * http://www.isthe.com/chongo/tech/comp/fnv/
> > + *
> > + * Returns: the hash
> > + */
> > +
> > +uint32_t gfs2_hash(const void *data, unsigned int len)
> > +{
> > +     uint32_t h = 0x811C9DC5;
> > +     h = hash_more_internal(data, len, h);
> > +     return h;
> > +}
> 
> Is there a reason why you cannot use <linux/hash.h> or <linux/jhash.h>?
> 
> > +void gfs2_sort(void *base, unsigned int num_elem, unsigned int size,
> > +            int (*compar) (const void *, const void *))
> > +{
> > +     register char *pbase = (char *)base;
> > +     int i, j, k, h;
> > +     static int cols[16] = {1391376, 463792, 198768, 86961,
> > +                            33936, 13776, 4592, 1968,
> > +                            861, 336, 112, 48,
> > +                            21, 7, 3, 1};
> > +
> > +     for (k = 0; k < 16; k++) {
> > +             h = cols[k];
> > +             for (i = h; i < num_elem; i++) {
> > +                     j = i;
> > +                     while (j >= h &&
> > +                            (*compar)((void *)(pbase + size * (j - h)),
> > +                                      (void *)(pbase + size * j)) > 0) {
> > +                             SWAP(pbase + size * j,
> > +                                  pbase + size * (j - h),
> > +                                  size);
> > +                             j = j - h;
> > +                     }
> > +             }
> > +     }
> > +}
> 
> Please use sort() from lib/sort.c.
> 
> > +/**
> > + * gfs2_io_error_inode_i - Flag an inode I/O error and withdraw
> > + * @ip:
> > + * @function:
> > + * @file:
> > + * @line:
> 
> Please drop empty kerneldoc tags. (Appears in various other places as well.)
> 
> > +#define RETRY_MALLOC(do_this, until_this) \
> > +for (;;) { \
> > +     { do_this; } \
> > +     if (until_this) \
> > +             break; \
> > +     if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
> > +             printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
> > +             gfs2_malloc_warning = jiffies; \
> > +     } \
> > +     yield(); \
> > +}
> 
> Please drop this.
> 
> > +int gfs2_acl_create(struct gfs2_inode *dip, struct gfs2_inode *ip)
> > +{
> > +             struct gfs2_sbd *sdp = dip->i_sbd;
> > +     struct posix_acl *acl = NULL;
> > +     struct gfs2_ea_request er;
> > +     mode_t mode = ip->i_di.di_mode;
> > +     int error;
> > +
> > +     if (!sdp->sd_args.ar_posix_acl)
> > +             return 0;
> > +     if (S_ISLNK(ip->i_di.di_mode))
> > +             return 0;
> > +
> > +     memset(&er, 0, sizeof(struct gfs2_ea_request));
> > +     er.er_type = GFS2_EATYPE_SYS;
> > +
> > +     error = acl_get(dip, ACL_DEFAULT, &acl, NULL,
> > +                     &er.er_data, &er.er_data_len);
> > +     if (error)
> > +             return error;
> > +     if (!acl) {
> > +             mode &= ~current->fs->umask;
> > +             if (mode != ip->i_di.di_mode)
> > +                     error = munge_mode(ip, mode);
> > +             return error;
> > +     }
> > +
> > +     {
> > +             struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
> > +             error = -ENOMEM;
> > +             if (!clone)
> > +                     goto out;
> > +             gfs2_memory_add(clone);
> > +             gfs2_memory_rm(acl);
> > +             posix_acl_release(acl);
> > +             acl = clone;
> > +     }
> 
> Please make this a real function. It is duplicated below.
> 
> > +     if (error > 0) {
> > +             er.er_name = GFS2_POSIX_ACL_ACCESS;
> > +             er.er_name_len = GFS2_POSIX_ACL_ACCESS_LEN;
> > +             posix_acl_to_xattr(acl, er.er_data, er.er_data_len);
> > +             er.er_mode = mode;
> > +             er.er_flags = GFS2_ERF_MODE;
> > +             error = gfs2_system_eaops.eo_set(ip, &er);
> > +             if (error)
> > +                     goto out;
> > +     } else
> > +             munge_mode(ip, mode);
> > +
> > + out:
> > +     gfs2_memory_rm(acl);
> > +     posix_acl_release(acl);
> > +     kfree(er.er_data);
> > +
> > +             return error;
> 
> Whitespace damage.
> 
> > +int gfs2_acl_chmod(struct gfs2_inode *ip, struct iattr *attr)
> > +{
> > +     struct posix_acl *acl = NULL;
> > +     struct gfs2_ea_location el;
> > +     char *data;
> > +     unsigned int len;
> > +     int error;
> > +
> > +     error = acl_get(ip, ACL_ACCESS, &acl, &el, &data, &len);
> > +     if (error)
> > +             return error;
> > +     if (!acl)
> > +             return gfs2_setattr_simple(ip, attr);
> > +
> > +     {
> > +             struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
> > +             error = -ENOMEM;
> > +             if (!clone)
> > +                     goto out;
> > +             gfs2_memory_add(clone);
> > +             gfs2_memory_rm(acl);
> > +             posix_acl_release(acl);
> > +             acl = clone;
> > +     }
> 
> Duplicated above.
> 
> > +static int ea_foreach(struct gfs2_inode *ip, ea_call_t ea_call, void *data)
> > +{
> > +     struct buffer_head *bh;
> > +     int error;
> > +
> > +     error = gfs2_meta_read(ip->i_gl, ip->i_di.di_eattr,
> > +                            DIO_START | DIO_WAIT, &bh);
> > +     if (error)
> > +             return error;
> > +
> > +     if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
> > +             error = ea_foreach_i(ip, bh, ea_call, data);
> 
> goto out here so you can drop the else branch below.
> 
> > +     else {
> > +             struct buffer_head *eabh;
> > +             uint64_t *eablk, *end;
> > +
> > +             if (gfs2_metatype_check(ip->i_sbd, bh, GFS2_METATYPE_IN)) {
> > +                     error = -EIO;
> > +                     goto out;
> > +             }
> > +
> > +             eablk = (uint64_t *)(bh->b_data +
> > +                                  sizeof(struct gfs2_meta_header));
> > +             end = eablk + ip->i_sbd->sd_inptrs;
> > +
> 
> > +static int ea_find_i(struct gfs2_inode *ip, struct buffer_head *bh,
> > +                  struct gfs2_ea_header *ea, struct gfs2_ea_header *prev,
> > +                  void *private)
> > +{
> > +     struct ea_find *ef = (struct ea_find *)private;
> > +     struct gfs2_ea_request *er = ef->ef_er;
> > +
> > +     if (ea->ea_type == GFS2_EATYPE_UNUSED)
> > +             return 0;
> > +
> > +     if (ea->ea_type == er->er_type) {
> > +             if (ea->ea_name_len == er->er_name_len &&
> > +                 !memcmp(GFS2_EA2NAME(ea), er->er_name, ea->ea_name_len)) {
> > +                     struct gfs2_ea_location *el = ef->ef_el;
> > +                     get_bh(bh);
> > +                     el->el_bh = bh;
> > +                     el->el_ea = ea;
> > +                     el->el_prev = prev;
> > +                     return 1;
> > +             }
> > +     }
> > +
> > +#if 0
> > +     else if ((ip->i_di.di_flags & GFS2_DIF_EA_PACKED) &&
> > +              er->er_type == GFS2_EATYPE_SYS)
> > +             return 1;
> > +#endif
> 
> Please drop commented out code.
> 
> > +static int ea_list_i(struct gfs2_inode *ip, struct buffer_head *bh,
> > +                  struct gfs2_ea_header *ea, struct gfs2_ea_header *prev,
> > +                  void *private)
> > +{
> > +     struct ea_list *ei = (struct ea_list *)private;
> 
> Please drop redundant cast.
> 
> > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
> > +                 struct gfs2_ea_location *el)
> > +{
> > +     {
> > +             struct ea_set es;
> > +             int error;
> > +
> > +             memset(&es, 0, sizeof(struct ea_set));
> > +             es.es_er = er;
> > +             es.es_el = el;
> > +
> > +             error = ea_foreach(ip, ea_set_simple, &es);
> > +             if (error > 0)
> > +                     return 0;
> > +             if (error)
> > +                     return error;
> > +     }
> > +     {
> > +             unsigned int blks = 2;
> > +             if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
> > +                     blks++;
> > +             if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
> > +                     blks += DIV_RU(er->er_data_len,
> > +                                    ip->i_sbd->sd_jbsize);
> > +
> > +             return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
> > +     }
> 
> Please drop the extra braces.
> 
> 
> 
> ------------------------------
> 
> Message: 11
> Date: Wed, 3 Aug 2005 00:07:38 -0400
> From: Kyle Moffett <mrmacman_g4 at mac.com>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: Hans Reiser <reiser at namesys.com>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,    Jan Engelhardt
>         <jengelh at linux01.gwdg.de>, linux-kernel at vger.kernel.org,        Arjan van de
>         Ven <arjan at infradead.org>
> Message-ID: <4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2 at mac.com>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
> 
> On Aug 2, 2005, at 21:00:02, Hans Reiser wrote:
> > Arjan van de Ven wrote:
> >> because reiser got merged before jbd. Next question.
> > That is the wrong reason.  We use our own journaling layer for the
> > reason that Vivaldi used his own melody.
> >
> > I don't know anything about GFS, but expecting a filesystem author to
> > use a journaling layer he does not want to is a bit arrogant.  Now, if
> > you got into details, and said jbd does X, Y and Z, and GFS does the
> > same X and Y, and does not do Z as well as jbd, that would be a more
> > serious comment.  He might want to look at how reiser4 does wandering
> > logs instead of using jbd..... but I would never claim that for sure
> > some other author should be expected to use it.....  and something
> > like
> > changing one's journaling system is not something to do just before a
> > merge.....
> 
> I don't want to start another big reiser4 flamewar, but...
> 
> "I don't know anything about Reiser4, but expecting a filesystem author
> to use a VFS layer he does not want to is a bit arrogant.  Now, if you
> got into details, and said the linux VFS does X, Y, and Z, and Reiser4
> does..."
> 
> Do you see my point here?  If every person who added new kernel code
> just wrote their own thing without checking to see if it had already
> been done before, then there would be a lot of poorly maintained code
> in the kernel.  If a journalling layer already exists, _new_ journaled
> filesystems should either (A) use the layer as is, or (B) fix the layer
> so it has sufficient functionality for them to use, and submit patches.
> That way if somebody later says, "Ah, crap, there's a bug in the kernel
> journalling layer", and fixes it, there are not eight other filesystems
> with their own open-coded layers that need to be audited for similar
> mistakes.
> 
> This is similar to why some kernel developers did not like the Reiser4
> code, because it implemented some private layers that looked kinda like
> stuff the VFS should be doing  (Again, I don't want to get into that
> argument again, I'm just bringing up the similarities to clarify _this_
> particular point, as that one has been beaten to death enough already).
> 
> >> Now the question for GFS is still a valid one; there might be
> >> reasons to
> >> not use it (which is fair enough) but if there's no real reason then
> >> using jdb sounds a lot better given it's maturity (and it is used
> >> by 2
> >> filesystems in -mm already).
> 
> Personally, I am of the opinion that if GFS cannot use jdb, the
> developers
> ought to clarify why it isn't useable, and possibly submit fixes to make
> it useful, so that others can share the benefits.
> 
> Cheers,
> Kyle Moffett
> 
> --
> I lost interest in "blade servers" when I found they didn't throw
> knives at
> people who weren't supposed to be in your machine room.
>    -- Anthony de Boer
> 
> 
> 
> 
> ------------------------------
> 
> Message: 12
> Date: Wed, 03 Aug 2005 11:09:02 +0200
> From: Arjan van de Ven <arjan at infradead.org>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: Hans Reiser <reiser at namesys.com>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,    Jan Engelhardt
>         <jengelh at linux01.gwdg.de>, linux-kernel at vger.kernel.org
> Message-ID: <1123060142.3363.8.camel at laptopd505.fenrus.org>
> Content-Type: text/plain
> 
> 
> > I don't know anything about GFS, but expecting a filesystem author to
> > use a journaling layer he does not want to is a bit arrogant.
> 
> good that I didn't expect that then.
> I think it's fair enough to ask people if they can use it. If the answer
> is "No because it doesn't fit our model <here>" then that's fine. If the
> answer is "eh yeah we could" then I think it's entirely reasonable to
> expect people to use common code as opposed to adding new code.
> 
> 
> 
> 
> ------------------------------
> 
> Message: 13
> Date: Wed, 03 Aug 2005 11:17:09 +0200
> From: Arjan van de Ven <arjan at infradead.org>
> Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
> To: David Teigland <teigland at redhat.com>
> Cc: akpm at osdl.org, linux-cluster at redhat.com,
>         linux-kernel at vger.kernel.org
> Message-ID: <1123060630.3363.10.camel at laptopd505.fenrus.org>
> Content-Type: text/plain
> 
> On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote:
> > The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
> > on-disk instead of LE which is another useful way to verify endian
> > correctness.
> 
> that sounds wrong to be a compile option. If you really want to deal
> with dual disk endianness it really ought to be a runtime one (see jffs2
> for example).
> 
> 
> 
> > > * Why use your own journalling layer and not say ... jbd ?
> >
> > Here's an analysis of three approaches to cluster-fs journaling and their
> > pros/cons (including using jbd):  http://tinyurl.com/7sbqq
> >
> > > * + while (!kthread_should_stop()) {
> > > +           gfs2_scand_internal(sdp);
> > > +
> > > +           set_current_state(TASK_INTERRUPTIBLE);
> > > +           schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> > > +   }
> > >
> > > you probably really want to check for signals if you do interruptible sleeps
> >
> > I don't know why we'd be interested in signals here.
> 
> well.. because if you don't your schedule_timeout becomes a nop when you
> get one, which makes your loop a busy waiting one.
> 
> 
> 
> 
> 
> ------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> End of Linux-cluster Digest, Vol 16, Issue 4
> ********************************************
> 


-- 
travellig.



From JACOB_LIBERMAN at Dell.com  Wed Aug  3 16:40:14 2005
From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com)
Date: Wed, 3 Aug 2005 11:40:14 -0500
Subject: [Linux-cluster] Fencing agents
Message-ID: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>

The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw.

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster

The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab.

Thanks, jacob

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> "S?valdur Arnar Gunnarsson [Hugsmi?jan]"
> Sent: Wednesday, August 03, 2005 6:59 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Fencing agents
> 
> I'm implementing a shared storage between multiple (2 at the 
> moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
> connected to a EMC AX100 through FC.
> 
> The SAN has two FC ports so the need for a FC Switch has not 
> yet come however we will add other Blades in the coming months.
> The one thing I haven't got figured out with GFS and the 
> Cluster-Suite is the whole idea about fencing.
> 
> We have a working setup using Centos rebuilds of the 
> Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which 
> we are not planning to use in the final implementation where 
> we plan to use the official GFS packages from Red Hat.
> The fencing agents in that setup is manual fencing.
> 
> Both machines have the file system mounted and there appears 
> to be no problems.
> 
> What does "automatic" fencing have to offer that the manual 
> fencing lacks.
> If we decide to buy the FC switch right away is it recomended 
> that we buy one of the ones that have fencing agent available 
> for the Cluster-Suite ?
> 
> If can't get our hands on supported FC switchs can we do 
> fencing in another manner than throught a FC switch ?
> 
> 
> 
> 
> -- 
> S?valdur Gunnarsson :: Hugsmi?jan
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 



From addi at hugsmidjan.is  Wed Aug  3 17:20:01 2005
From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=)
Date: Wed, 03 Aug 2005 17:20:01 +0000
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>
References: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>
Message-ID: <42F0FCC1.7030108@hugsmidjan.is>

Could you please include a cluster.conf fencing sample on how to 
implement this.

JACOB_LIBERMAN at Dell.com wrote:
> The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw.
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster
> 
> The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab.
> 
> Thanks, jacob
> 
> 
>>-----Original Message-----
>>From: linux-cluster-bounces at redhat.com 
>>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
>>"S?valdur Arnar Gunnarsson [Hugsmi?jan]"
>>Sent: Wednesday, August 03, 2005 6:59 AM
>>To: linux-cluster at redhat.com
>>Subject: [Linux-cluster] Fencing agents
>>
>>I'm implementing a shared storage between multiple (2 at the 
>>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
>>connected to a EMC AX100 through FC.
>>
>>The SAN has two FC ports so the need for a FC Switch has not 
>>yet come however we will add other Blades in the coming months.
>>The one thing I haven't got figured out with GFS and the 
>>Cluster-Suite is the whole idea about fencing.
>>
>>We have a working setup using Centos rebuilds of the 
>>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which 
>>we are not planning to use in the final implementation where 
>>we plan to use the official GFS packages from Red Hat.
>>The fencing agents in that setup is manual fencing.
>>
>>Both machines have the file system mounted and there appears 
>>to be no problems.
>>
>>What does "automatic" fencing have to offer that the manual 
>>fencing lacks.
>>If we decide to buy the FC switch right away is it recomended 
>>that we buy one of the ones that have fencing agent available 
>>for the Cluster-Suite ?
>>
>>If can't get our hands on supported FC switchs can we do 
>>fencing in another manner than throught a FC switch ?
>>
>>
>>
>>
>>-- 
>>S?valdur Gunnarsson :: Hugsmi?jan
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


-- 
S?valdur Gunnarsson :: Hugsmi?jan



From mark.fasheh at oracle.com  Wed Aug  3 18:54:01 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Wed, 3 Aug 2005 11:54:01 -0700
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050803103744.GG11081@marowsky-bree.de>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050803035618.GB9812@redhat.com>
	<20050803103744.GG11081@marowsky-bree.de>
Message-ID: <20050803185401.GB21228@ca-server1.us.oracle.com>

On Wed, Aug 03, 2005 at 12:37:44PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-03T11:56:18, David Teigland <teigland at redhat.com> wrote:
> 
> > > * Why use your own journalling layer and not say ... jbd ?
> > Here's an analysis of three approaches to cluster-fs journaling and their
> > pros/cons (including using jbd):  http://tinyurl.com/7sbqq
> 
> Very instructive read, thanks for the link.
While it may be true that for a full log, flushing for a *single* lock may
be more expensive in OCFS2, Ken ignores the fact that in our one big flush
we've made all locks on journalled resources immediately releasable.
According to that description, GFS2 would have to do a seperate transaction
flush (including the extra step of writing revoke records) for each lock
protecting a journalled resource. Assuming the same number of locks are
required to be dropped under both systems then for a number of locks > 1
OCFS2 will actually do less work - the actual metadata blocks would be the
same on either end, but JBD only has to write that the journal is now clean
to the journal superblock whereas GFS2 has to revoke the blocks for each
dropped lock.

Of course all of this talk completely avoids the fact that in any case these
things are expensive so a cluster file system has to take care to ping locks
as little as possible. OCFS2 takes great pains to make as many operations
node local (requiring no cluster locks) as possible - data allocation is
usually done from a node local pool which is refreshed from the main bitmap.
Deallocation happens similarly - we have a truncate log in which we record
deleted clusters. Each node has their own inode and metadata chain
allocators which another node will only lock for delete (a truncate log
style local metadata delete log could easily be added if that ever became a
problem).
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From jacobl at ccbill.com  Wed Aug  3 20:36:21 2005
From: jacobl at ccbill.com (Jacob Liff)
Date: Wed, 3 Aug 2005 13:36:21 -0700
Subject: [Linux-cluster] Live Changes To Cluster
Message-ID: <889A47B16278164FB657E0FFB1CAB8C7CB1375@hq-exchange.ccbill-hq.local>

Hello,

 

I've done some searching on the mailing list and found  one link that
talks about what I'm trying to accomplish. These are also the steps in
the usage.txt for changing the config in a live cluster.

 

http://tinyurl.com/duxqt

 

I have tried this but it does not seem to work. I changed the config on
one box HUP'ed ccsd but the new cluster.conf is not sent to the other
members. I am using the latest gzip from:

 

ftp://sources.redhat.com/pub/cluster/releases/cluster-1.00.00.tar.gz

 

It will allow me to change the config number on the other
members(cman_tool version -r 2) but something just doesn't seem right
with the new configs not residing on the machines. Has this process been
updated at some point?

 

Jacob L.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050803/fdadb6d7/attachment.htm>

From eric at bootseg.com  Wed Aug  3 20:48:01 2005
From: eric at bootseg.com (Eric Kerin)
Date: Wed, 03 Aug 2005 16:48:01 -0400
Subject: [Linux-cluster] Live Changes To Cluster
In-Reply-To: <889A47B16278164FB657E0FFB1CAB8C7CB1375@hq-exchange.ccbill-hq.local>
References: <889A47B16278164FB657E0FFB1CAB8C7CB1375@hq-exchange.ccbill-hq.local>
Message-ID: <1123102081.3344.6.camel@auh5-0479.corp.jabil.org>

On Wed, 2005-08-03 at 13:36 -0700, Jacob Liff wrote:
> Hello,
> 
>  
> It will allow me to change the config number on the other members(cman_tool version ?r 2) but something just doesn?t seem right with the new configs not residing on the machines. Has this process been updated at some point?
> 

This is the process I use when I update my cluster.xml file:
1. make my changes, upping the version number
2. run: "ccs_tool update /etc/cluster/cluster.xml"
3. run: "cman_tool version -r <new version number>"
4. on each node, run: "cman_tool status" and make sure that the version
number is the new one.

Thanks,
Eric Kerin




From jacobl at ccbill.com  Wed Aug  3 20:53:30 2005
From: jacobl at ccbill.com (Jacob Liff)
Date: Wed, 3 Aug 2005 13:53:30 -0700
Subject: [Linux-cluster] Live Changes To Cluster
Message-ID: <889A47B16278164FB657E0FFB1CAB8C7CB1384@hq-exchange.ccbill-hq.local>

Eric,

Thanks very much, this did the trick for me. You guys maintain an
incredibly helpful list.

Jacob L.

-----Original Message-----
From: Eric Kerin [mailto:eric at bootseg.com] 
Sent: Wednesday, August 03, 2005 1:48 PM
To: Jacob Liff; ; linux clustering
Subject: Re: [Linux-cluster] Live Changes To Cluster

On Wed, 2005-08-03 at 13:36 -0700, Jacob Liff wrote:
> Hello,
> 
>  
> It will allow me to change the config number on the other
members(cman_tool version -r 2) but something just doesn't seem right
with the new configs not residing on the machines. Has this process been
updated at some point?
> 

This is the process I use when I update my cluster.xml file:
1. make my changes, upping the version number
2. run: "ccs_tool update /etc/cluster/cluster.xml"
3. run: "cman_tool version -r <new version number>"
4. on each node, run: "cman_tool status" and make sure that the version
number is the new one.

Thanks,
Eric Kerin




From amanthei at redhat.com  Wed Aug  3 21:29:49 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 3 Aug 2005 16:29:49 -0500
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <42F0B177.7050907@hugsmidjan.is>
References: <42F0B177.7050907@hugsmidjan.is>
Message-ID: <20050803212949.GA3268@redhat.com>

On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote:
> I'm implementing a shared storage between multiple (2 at the moment) 
> Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC 
> AX100 through FC.
>
> The SAN has two FC ports so the need for a FC Switch has not yet come 
> however we will add other Blades in the coming months.
> The one thing I haven't got figured out with GFS and the Cluster-Suite 
> is the whole idea about fencing.

Funny timing :)  I just checked in the fencing agent for the PowerEdge
1855's a couple days ago!  

(http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb-markup&cvsroot=cluster)

> The fencing agents in that setup is manual fencing.

I would strongly discourage this.

> What does "automatic" fencing have to offer that the manual fencing lacks.
> If we decide to buy the FC switch right away is it recomended that we 
> buy one of the ones that have fencing agent available for the 
> Cluster-Suite ?

In this case, you already have a fencing agent (fence_drac) that works with
the PE 1855 blades so there is no need for further fencing hardware (unless
you are going to be connecting other machines to the cluster that aren't
going to have any other form of fencing)

The main advantage that "automatic" fencing gives you over manual fencing is
that in the event that a fencing operation is required, your cluster can
automatically recover (on the order of seconds to minutes) instead of waiting
for user intervention (which can take minutes to hours to days depending on
how attentive the admins are :).    

-- 
Adam Manthei  <amanthei at redhat.com>



From amanthei at redhat.com  Wed Aug  3 21:34:25 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 3 Aug 2005 16:34:25 -0500
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>
References: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>
Message-ID: <20050803213425.GB3268@redhat.com>

On Wed, Aug 03, 2005 at 11:40:14AM -0500, JACOB_LIBERMAN at Dell.com wrote:
> The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw.
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster

"racadm" should be avoided.  The interface does not provide the feedback
necessary to guarantee that nodes have been properly fenced.  The telnet
interface is the preferred method for support DRAC hardware.  Fortunately, as
your link above shows, the fence_drac agent now supports the 1855 as of
Monday.


> The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab.
> 
> Thanks, jacob
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> > "S?valdur Arnar Gunnarsson [Hugsmi?jan]"
> > Sent: Wednesday, August 03, 2005 6:59 AM
> > To: linux-cluster at redhat.com
> > Subject: [Linux-cluster] Fencing agents
> > 
> > I'm implementing a shared storage between multiple (2 at the 
> > moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
> > connected to a EMC AX100 through FC.
> > 
> > The SAN has two FC ports so the need for a FC Switch has not 
> > yet come however we will add other Blades in the coming months.
> > The one thing I haven't got figured out with GFS and the 
> > Cluster-Suite is the whole idea about fencing.
> > 
> > We have a working setup using Centos rebuilds of the 
> > Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which 
> > we are not planning to use in the final implementation where 
> > we plan to use the official GFS packages from Red Hat.
> > The fencing agents in that setup is manual fencing.
> > 
> > Both machines have the file system mounted and there appears 
> > to be no problems.
> > 
> > What does "automatic" fencing have to offer that the manual 
> > fencing lacks.
> > If we decide to buy the FC switch right away is it recomended 
> > that we buy one of the ones that have fencing agent available 
> > for the Cluster-Suite ?
> > 
> > If can't get our hands on supported FC switchs can we do 
> > fencing in another manner than throught a FC switch ?
> > 
> > 
> > 
> > 
> > -- 
> > S?valdur Gunnarsson :: Hugsmi?jan
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>



From JACOB_LIBERMAN at Dell.com  Wed Aug  3 21:42:23 2005
From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com)
Date: Wed, 3 Aug 2005 16:42:23 -0500
Subject: [Linux-cluster] Fencing agents
Message-ID: <BC430F453501174992B9D9E8AFB7519A08568A@ausx3mps309.aus.amer.dell.com>

Hi Adam,

I noticed that you updated this script quite a bit from previous versions. If I'm not mistaken, the previous version actually used the "racadm serveraction powercycle/shutdown/etc" commands. This version uses telnet exclusively. How about adding some logic that checks whether racadm is installed locally and uses that if it is, and then uses telnet if it is not?

I think that adding the racadm commands to enable telnet on the rac is a good idea, but if they can use racadm to configure telnet access, they should also be able to use racadm to fence the node.

Just my 2 cents. I think its great that you wrote an agent for the drac.

Thanks, jacob 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Adam Manthei
> Sent: Wednesday, August 03, 2005 4:30 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Fencing agents
> 
> On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar 
> Gunnarsson [Hugsmi?jan]" wrote:
> > I'm implementing a shared storage between multiple (2 at 
> the moment) 
> > Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
> connected to a 
> > EMC AX100 through FC.
> >
> > The SAN has two FC ports so the need for a FC Switch has 
> not yet come 
> > however we will add other Blades in the coming months.
> > The one thing I haven't got figured out with GFS and the 
> Cluster-Suite 
> > is the whole idea about fencing.
> 
> Funny timing :)  I just checked in the fencing agent for the 
> PowerEdge 1855's a couple days ago!  
> 
> (http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ag
> ents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb
> -markup&cvsroot=cluster)
> 
> > The fencing agents in that setup is manual fencing.
> 
> I would strongly discourage this.
> 
> > What does "automatic" fencing have to offer that the manual 
> fencing lacks.
> > If we decide to buy the FC switch right away is it 
> recomended that we 
> > buy one of the ones that have fencing agent available for the 
> > Cluster-Suite ?
> 
> In this case, you already have a fencing agent (fence_drac) 
> that works with the PE 1855 blades so there is no need for 
> further fencing hardware (unless you are going to be 
> connecting other machines to the cluster that aren't going to 
> have any other form of fencing)
> 
> The main advantage that "automatic" fencing gives you over 
> manual fencing is that in the event that a fencing operation 
> is required, your cluster can automatically recover (on the 
> order of seconds to minutes) instead of waiting for user 
> intervention (which can take minutes to hours to days depending on
> how attentive the admins are :).    
> 
> --
> Adam Manthei  <amanthei at redhat.com>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 



From JACOB_LIBERMAN at Dell.com  Wed Aug  3 21:43:16 2005
From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com)
Date: Wed, 3 Aug 2005 16:43:16 -0500
Subject: [Linux-cluster] Fencing agents
Message-ID: <BC430F453501174992B9D9E8AFB7519A08568B@ausx3mps309.aus.amer.dell.com>

Oops! Looks like I sent this 1 second too soon. 8) 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> JACOB_LIBERMAN at Dell.com
> Sent: Wednesday, August 03, 2005 4:42 PM
> To: linux-cluster at redhat.com
> Subject: RE: [Linux-cluster] Fencing agents
> 
> Hi Adam,
> 
> I noticed that you updated this script quite a bit from 
> previous versions. If I'm not mistaken, the previous version 
> actually used the "racadm serveraction 
> powercycle/shutdown/etc" commands. This version uses telnet 
> exclusively. How about adding some logic that checks whether 
> racadm is installed locally and uses that if it is, and then 
> uses telnet if it is not?
> 
> I think that adding the racadm commands to enable telnet on 
> the rac is a good idea, but if they can use racadm to 
> configure telnet access, they should also be able to use 
> racadm to fence the node.
> 
> Just my 2 cents. I think its great that you wrote an agent 
> for the drac.
> 
> Thanks, jacob 
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Adam Manthei
> > Sent: Wednesday, August 03, 2005 4:30 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Fencing agents
> > 
> > On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar 
> Gunnarsson 
> > [Hugsmi?jan]" wrote:
> > > I'm implementing a shared storage between multiple (2 at
> > the moment)
> > > Blade machines (Dell PowerEdge 1855) running RHEL4 ES
> > connected to a
> > > EMC AX100 through FC.
> > >
> > > The SAN has two FC ports so the need for a FC Switch has
> > not yet come
> > > however we will add other Blades in the coming months.
> > > The one thing I haven't got figured out with GFS and the
> > Cluster-Suite
> > > is the whole idea about fencing.
> > 
> > Funny timing :)  I just checked in the fencing agent for 
> the PowerEdge 
> > 1855's a couple days ago!
> > 
> > (http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ag
> > ents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb
> > -markup&cvsroot=cluster)
> > 
> > > The fencing agents in that setup is manual fencing.
> > 
> > I would strongly discourage this.
> > 
> > > What does "automatic" fencing have to offer that the manual
> > fencing lacks.
> > > If we decide to buy the FC switch right away is it
> > recomended that we
> > > buy one of the ones that have fencing agent available for the 
> > > Cluster-Suite ?
> > 
> > In this case, you already have a fencing agent (fence_drac) 
> that works 
> > with the PE 1855 blades so there is no need for further fencing 
> > hardware (unless you are going to be connecting other 
> machines to the 
> > cluster that aren't going to have any other form of fencing)
> > 
> > The main advantage that "automatic" fencing gives you over manual 
> > fencing is that in the event that a fencing operation is required, 
> > your cluster can automatically recover (on the order of seconds to 
> > minutes) instead of waiting for user intervention (which can take 
> > minutes to hours to days depending on
> > how attentive the admins are :).    
> > 
> > --
> > Adam Manthei  <amanthei at redhat.com>
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 



From amanthei at redhat.com  Wed Aug  3 22:11:44 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 3 Aug 2005 17:11:44 -0500
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <BC430F453501174992B9D9E8AFB7519A08568A@ausx3mps309.aus.amer.dell.com>
References: <BC430F453501174992B9D9E8AFB7519A08568A@ausx3mps309.aus.amer.dell.com>
Message-ID: <20050803221144.GC3268@redhat.com>

On Wed, Aug 03, 2005 at 04:42:23PM -0500, JACOB_LIBERMAN at Dell.com wrote:
> Hi Adam,
> 
> I noticed that you updated this script quite a bit from previous versions. If I'm not mistaken, the previous version actually used the "racadm serveraction powercycle/shutdown/etc" commands. This version uses telnet exclusively. How about adding some logic that checks whether racadm is installed locally and uses that if it is, and then uses telnet if it is not?

The problem that I experienced with the racadm utility is that there where
times that there was no way of querying what the power status of a node was.
I know that I am unable to do that at all with the firmware that I have 
installed for my PowerEdge 750's.  Another drawback to the racadm approach
is that `serveraction` returns right away before waiting to for that command
to complete.  Given the combination of the two issues, it makes using racadm
difficult to rely upon for a fencing agent because it's possible for the
fencing agent to report success before the machine is powered off.  If that
were to happen, corruption in the filesystem could occur.

I've emailed a couple people at Dell and the linux-poweredge list and have
not been able to get an adequate response as to how to use racadm reliably.
As such, we only support the telnet interface.

> I think that adding the racadm commands to enable telnet on the rac is a good idea, but if they can use racadm to configure telnet access, they should also be able to use racadm to fence the node.

I thought about adding that functionality, but forgot about it shortly after
getting the telnet interface enabled on my DRAC card ;)  Thanks for the
reminder, I'll look into adding that feature.  In the meantime, the commands
for enabling it are documented in the man page.

       [root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

       [root]# racadm racreset


> Just my 2 cents. I think its great that you wrote an agent for the drac.
 
:)

Adam

> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Adam Manthei
> > Sent: Wednesday, August 03, 2005 4:30 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Fencing agents
> > 
> > On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar 
> > Gunnarsson [Hugsmi?jan]" wrote:
> > > I'm implementing a shared storage between multiple (2 at 
> > the moment) 
> > > Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
> > connected to a 
> > > EMC AX100 through FC.
> > >
> > > The SAN has two FC ports so the need for a FC Switch has 
> > not yet come 
> > > however we will add other Blades in the coming months.
> > > The one thing I haven't got figured out with GFS and the 
> > Cluster-Suite 
> > > is the whole idea about fencing.
> > 
> > Funny timing :)  I just checked in the fencing agent for the 
> > PowerEdge 1855's a couple days ago!  
> > 
> > (http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ag
> > ents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb
> > -markup&cvsroot=cluster)
> > 
> > > The fencing agents in that setup is manual fencing.
> > 
> > I would strongly discourage this.
> > 
> > > What does "automatic" fencing have to offer that the manual 
> > fencing lacks.
> > > If we decide to buy the FC switch right away is it 
> > recomended that we 
> > > buy one of the ones that have fencing agent available for the 
> > > Cluster-Suite ?
> > 
> > In this case, you already have a fencing agent (fence_drac) 
> > that works with the PE 1855 blades so there is no need for 
> > further fencing hardware (unless you are going to be 
> > connecting other machines to the cluster that aren't going to 
> > have any other form of fencing)
> > 
> > The main advantage that "automatic" fencing gives you over 
> > manual fencing is that in the event that a fencing operation 
> > is required, your cluster can automatically recover (on the 
> > order of seconds to minutes) instead of waiting for user 
> > intervention (which can take minutes to hours to days depending on
> > how attentive the admins are :).    
> > 
> > --
> > Adam Manthei  <amanthei at redhat.com>
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>



From oldmoonster at gmail.com  Thu Aug  4 02:29:54 2005
From: oldmoonster at gmail.com (Q.L)
Date: Thu, 4 Aug 2005 10:29:54 +0800
Subject: [Linux-cluster] Compiling error against kernel-2.6.12.2
Message-ID: <359782e705080319293993cbf1@mail.gmail.com>

Hi,

When I began to "make" on the RH9.0, I can't pass following errors,
could you help me?
however, it seems no compiling problem happen on a host with FC1.0.
further more, what's the special config required in kernel .config
file for GFS cluster?

Thanks.

Q.L

cd ccs_tool && make install
make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool'
gcc -Wall -I. -I../config -I../include -I../lib
-I/home/share/cluster-1.00.00/build/incdir -Wall -O2
-D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags`
-DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib
-I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c
update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config
--libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma
-lmagmamsg -ldl
/home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
to `pthread_rwlock_rdlock'
/home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
to `pthread_rwlock_unlock'
/home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
to `pthread_rwlock_wrlock'



From oldmoonster at gmail.com  Thu Aug  4 06:34:40 2005
From: oldmoonster at gmail.com (Q.L)
Date: Thu, 4 Aug 2005 14:34:40 +0800
Subject: [Linux-cluster] Is there backporting patches for 2.4.x kernel can
	be found?
Message-ID: <359782e7050803233458cf8e76@mail.gmail.com>

Hi, 

If I don't ask such a question, I can't give up the idea forever,
although I know it is ultimately impossible.

Thanks,

Q.L



From mdl at veles.ru  Thu Aug  4 08:32:13 2005
From: mdl at veles.ru (Denis Medvedev)
Date: Thu, 04 Aug 2005 12:32:13 +0400
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <20050803212949.GA3268@redhat.com>
References: <42F0B177.7050907@hugsmidjan.is> <20050803212949.GA3268@redhat.com>
Message-ID: <42F1D28D.10006@veles.ru>

Adam Manthei ?????:

>On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote:
>  
>
>>I'm implementing a shared storage between multiple (2 at the moment) 
>>Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC 
>>AX100 through FC.
>>
>>The SAN has two FC ports so the need for a FC Switch has not yet come 
>>however we will add other Blades in the coming months.
>>The one thing I haven't got figured out with GFS and the Cluster-Suite 
>>is the whole idea about fencing.
>>    
>>
>
>Funny timing :)  I just checked in the fencing agent for the PowerEdge
>1855's a couple days ago!  
>
>(http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb-markup&cvsroot=cluster)
>
>  
>
>>The fencing agents in that setup is manual fencing.
>>    
>>
>
>I would strongly discourage this.
>
>  
>
>>What does "automatic" fencing have to offer that the manual fencing lacks.
>>If we decide to buy the FC switch right away is it recomended that we 
>>buy one of the ones that have fencing agent available for the 
>>Cluster-Suite ?
>>    
>>
>
>In this case, you already have a fencing agent (fence_drac) that works with
>the PE 1855 blades so there is no need for further fencing hardware (unless
>you are going to be connecting other machines to the cluster that aren't
>going to have any other form of fencing)
>
>The main advantage that "automatic" fencing gives you over manual fencing is
>that in the event that a fencing operation is required, your cluster can
>automatically recover (on the order of seconds to minutes) instead of waiting
>for user intervention (which can take minutes to hours to days depending on
>how attentive the admins are :).    
>
>  
>
"recover"? You mean reboot? But if a machine need fencing, doesn't that 
mean that something is inherently wrong with that machine and simple 
reboot would't cure that?



From mdl at veles.ru  Thu Aug  4 08:49:13 2005
From: mdl at veles.ru (Denis Medvedev)
Date: Thu, 04 Aug 2005 12:49:13 +0400
Subject: [Linux-cluster] Mirrored shared disks
Message-ID: <42F1D689.40807@veles.ru>

Dear sirs,
I am trying to make the following configuration:
a two node cluster, each node exports its own (identical in size) device,
each node imports the device from its neighbour,
a md device which is a composition of own and exported neighbour device 
is created on each node,
lock_dlm is used as a locking system, a gfs is used on each node for 
that md device.
Will it work? Isi it possible to create no-single-point of failure if I 
have 2 storage that expose iSCSI devices and I want to make a mirror 
based on both of them?
                                                                      
Thanks in advance
                                                                      
Denis Medvedev



From javipolo at datagrama.net  Thu Aug  4 11:00:48 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Thu, 4 Aug 2005 13:00:48 +0200
Subject: [Linux-cluster] ipv6_loopback symbol in 2.6.12
Message-ID: <20050804110048.GA18954@gibson.drslump.org>

Hi there

I managed to compile almost everything fine, and wanted to do some
testing, but I realise that lock_gulm has some undefined reference:

gfstest1:/usr/src/linux# modprobe lock_gulm
FATAL: Error inserting lock_gulm (/lib/modules/2.6.12.3/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko): Unknown symbol in module, or unknown parameter (see dmesg)
gfstest1:/usr/src/linux# dmesg |grep gulm
lock_gulm: Unknown symbol in6addr_loopback
lock_gulm: Unknown symbol in6addr_loopback
gfstest1:/usr/src/linux# 

I added ipv6 support (though I didnt want to). Is it required?

and anyway, I suppose they must have changed something in the kernel, as
I got this error.

I had to fix several things (2.6.x is changing so much, i guess) that I
found on the archives, but no fix refering to this ipv6 thing.

Can anybody give me a hint? :P

thanks in advance ;)

-- 
Javier Polo @ Datagrama
902 136 126



From addi at hugsmidjan.is  Thu Aug  4 11:00:47 2005
From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=)
Date: Thu, 04 Aug 2005 11:00:47 +0000
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>
References: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>
Message-ID: <42F1F55F.1080708@hugsmidjan.is>

Well .. when I manually run the fence_drac.pl perl script and supply it 
with the ip of the DRAC (-a 192.168.100.173) login name (-l root) 
DRAC/MC module name (-m Server-1) and the password (-p dummypassword) 
the machine in question (Server-1) powers down and doesn't power back on.

How do I implement this in cluster.xml (specify the ip/login/pass/module 
name) and shouldn't it power back up afterwards ?


JACOB_LIBERMAN at Dell.com wrote:
> The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw.
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster
> 
> The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab.
> 
> Thanks, jacob
> 
> 
>>-----Original Message-----
>>From: linux-cluster-bounces at redhat.com 
>>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
>>"S?valdur Arnar Gunnarsson [Hugsmi?jan]"
>>Sent: Wednesday, August 03, 2005 6:59 AM
>>To: linux-cluster at redhat.com
>>Subject: [Linux-cluster] Fencing agents
>>
>>I'm implementing a shared storage between multiple (2 at the 
>>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
>>connected to a EMC AX100 through FC.
>>
>>The SAN has two FC ports so the need for a FC Switch has not 
>>yet come however we will add other Blades in the coming months.
>>The one thing I haven't got figured out with GFS and the 
>>Cluster-Suite is the whole idea about fencing.
>>
>>We have a working setup using Centos rebuilds of the 
>>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which 
>>we are not planning to use in the final implementation where 
>>we plan to use the official GFS packages from Red Hat.
>>The fencing agents in that setup is manual fencing.
>>
>>Both machines have the file system mounted and there appears 
>>to be no problems.
>>
>>What does "automatic" fencing have to offer that the manual 
>>fencing lacks.
>>If we decide to buy the FC switch right away is it recomended 
>>that we buy one of the ones that have fencing agent available 
>>for the Cluster-Suite ?
>>
>>If can't get our hands on supported FC switchs can we do 
>>fencing in another manner than throught a FC switch ?
>>
>>
>>
>>
>>-- 
>>S?valdur Gunnarsson :: Hugsmi?jan
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster


-- 
S?valdur Gunnarsson :: Hugsmi?jan



From javipolo at datagrama.net  Thu Aug  4 12:38:19 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Thu, 4 Aug 2005 14:38:19 +0200
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <BC430F453501174992B9D9E8AFB7519A08568B@ausx3mps309.aus.amer.dell.com>
References: <BC430F453501174992B9D9E8AFB7519A08568B@ausx3mps309.aus.amer.dell.com>
Message-ID: <20050804123819.GA20306@gibson.drslump.org>

I just modified a bit fence_sanbox2.pl so it can fence hosts telnetting
to the fiber switch.

It's an IBM 2005 H16 ... dunnow if other's IBM commands are the same.

Here it is, in case anyone find it useful :P

-- 
Javier Polo @ Datagrama
902 136 126
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence_IBMswitch.pl
Type: text/x-perl
Size: 5062 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050804/781cb49a/attachment.bin>

From mtilstra at redhat.com  Thu Aug  4 13:30:20 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Thu, 4 Aug 2005 08:30:20 -0500
Subject: [Linux-cluster] ipv6_loopback symbol in 2.6.12
In-Reply-To: <20050804110048.GA18954@gibson.drslump.org>
References: <20050804110048.GA18954@gibson.drslump.org>
Message-ID: <20050804133020.GA21470@redhat.com>

On Thu, Aug 04, 2005 at 01:00:48PM +0200, Javi Polo wrote:
> Hi there
> 
> I managed to compile almost everything fine, and wanted to do some
> testing, but I realise that lock_gulm has some undefined reference:
> 
> gfstest1:/usr/src/linux# modprobe lock_gulm
> FATAL: Error inserting lock_gulm
> (/lib/modules/2.6.12.3/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko):
> Unknown symbol in module, or unknown parameter (see dmesg)
> gfstest1:/usr/src/linux# dmesg |grep gulm
> lock_gulm: Unknown symbol in6addr_loopback
> lock_gulm: Unknown symbol in6addr_loopback
> gfstest1:/usr/src/linux# 
> 
> I added ipv6 support (though I didnt want to). Is it required?

gulm requires ipv6.  If you plan on using cman/dlm, you don't need gulm,
and so can comment it out of the makefiles.


-- 
Michael Conrad Tadpol Tilstra
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050804/87494636/attachment.sig>

From travellig at yahoo.co.uk  Wed Aug  3 14:11:58 2005
From: travellig at yahoo.co.uk (Hernando Garcia)
Date: Wed, 03 Aug 2005 15:11:58 +0100
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <42F0B177.7050907@hugsmidjan.is>
References: <42F0B177.7050907@hugsmidjan.is>
Message-ID: <1123078318.4405.32.camel@hgarcia.surrey.redhat.com>

On Wed, 2005-08-03 at 11:58 +0000, "S?valdur Arnar Gunnarsson
[Hugsmi?jan]" wrote:
> What does "automatic" fencing have to offer that the manual fencing
> lacks.
Automatic fencing uses hardware to fence a node and reboot it.
Manual fencing relay on you to manually fence the node whenever you
release there is a problem in the cluster and relays on you to
prowercycle the faulty node manually, no very convenient when you are
sysadmin the cluster remotely.

> If we decide to buy the FC switch right away is it recomended that we 
> buy one of the ones that have fencing agent available for the 
> Cluster-Suite ?
 If you look at the configuration manual for RHCS, there is a list of
supported fencing agents.
> If can't get our hands on supported FC switchs can we do fencing in 
> another manner than throught a FC switch ?

Manual fencing.

Nando



From mbrookov at mines.edu  Thu Aug  4 14:26:43 2005
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Thu, 04 Aug 2005 08:26:43 -0600
Subject: [Linux-cluster] Is there backporting patches for 2.4.x kernel
	can be found?
In-Reply-To: <359782e7050803233458cf8e76@mail.gmail.com>
References: <359782e7050803233458cf8e76@mail.gmail.com>
Message-ID: <1123165603.1143.1.camel@merlin.Mines.EDU>

There is no back port that I am aware of, but there is the old version
that works on 2.4.

See http://www.gyrate.org/misc/gfs.txt

for build instructions.

Matt

On Thu, 2005-08-04 at 00:34, Q.L wrote:

> Hi, 
> 
> If I don't ask such a question, I can't give up the idea forever,
> although I know it is ultimately impossible.
> 
> Thanks,
> 
> Q.L
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050804/50d87e51/attachment.htm>

From javipolo at datagrama.net  Thu Aug  4 14:32:00 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Thu, 4 Aug 2005 16:32:00 +0200
Subject: [Linux-cluster] ipv6_loopback symbol in 2.6.12
In-Reply-To: <20050804133020.GA21470@redhat.com>
References: <20050804110048.GA18954@gibson.drslump.org>
	<20050804133020.GA21470@redhat.com>
Message-ID: <20050804143200.GA23365@gibson.drslump.org>

On Aug/04/2005, Michael Conrad Tadpol Tilstra wrote:

> > lock_gulm: Unknown symbol in6addr_loopback
> gulm requires ipv6.  If you plan on using cman/dlm, you don't need gulm,
> and so can comment it out of the makefiles.

I understand clm/gulm are lock managers ... what advantages has each
one? (or where could I read a little bit about it, I've been on RH
cluster page, but cant understand well what's better from one or
another, and what should I use ... :?

btw, I couldnt test gulm because of this change in 2.6.12 ... has
anybody a patch? O:)

thx

-- 
Javier Polo @ Datagrama
902 136 126



From addi at hugsmidjan.is  Thu Aug  4 15:21:48 2005
From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=)
Date: Thu, 04 Aug 2005 15:21:48 +0000
Subject: [Linux-cluster] Purpose of fencing devices
Message-ID: <42F2328C.4050700@hugsmidjan.is>

Could someone explain to me the purpose of the fencing hardware in a 
cluster with a shared storage resource.

When one of the cluster member goes down all access to the shared volume 
(GFS) is closed off.
No other cluster member can read or write to the volume until the failed 
node comes back up.

Are fencing devices used to close off the access the dead node has on 
the filesystem so the other nodes can access (read/write) the fileystem 
as usual ?
-- 
S?valdur Gunnarsson :: Hugsmi?jan



From oldmoonster at gmail.com  Thu Aug  4 16:16:24 2005
From: oldmoonster at gmail.com (Qin Li)
Date: Fri, 05 Aug 2005 00:16:24 +0800
Subject: [Linux-cluster] many compiling warning.
Message-ID: <42F23F58.4080606@gmail.com>

Hi,

I am trying to install cluster-1.0 on my Redhat/Fedora core 1 system, 
but the compiling is not clean.
See following:

make[3]: Entering directory `/usr/src/linux-2.6.12.2'
  Building modules, stage 2.
  MODPOST
*** Warning: "kcl_addref_cluster" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_node_by_addr" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_node_addresses" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_releaseref_cluster" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_current_interface" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_get_node_by_nodeid" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_leave_service" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_remove_callback" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_global_service_id" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_unregister_service" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_join_service" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_start_done" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_add_callback" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
*** Warning: "kcl_register_service" 
[/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!

What's wrong?

Thanks,

Q.L



From amanthei at redhat.com  Thu Aug  4 16:29:09 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Thu, 4 Aug 2005 11:29:09 -0500
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <42F1D28D.10006@veles.ru>
References: <42F0B177.7050907@hugsmidjan.is>
	<20050803212949.GA3268@redhat.com> <42F1D28D.10006@veles.ru>
Message-ID: <20050804162909.GD3268@redhat.com>

On Thu, Aug 04, 2005 at 12:32:13PM +0400, Denis Medvedev wrote:
> >>What does "automatic" fencing have to offer that the manual fencing lacks.
> >>If we decide to buy the FC switch right away is it recomended that we 
> >>buy one of the ones that have fencing agent available for the 
> >>Cluster-Suite ?
> >>   
> >>
> >
> >In this case, you already have a fencing agent (fence_drac) that works with
> >the PE 1855 blades so there is no need for further fencing hardware (unless
> >you are going to be connecting other machines to the cluster that aren't
> >going to have any other form of fencing)
> >
> >The main advantage that "automatic" fencing gives you over manual fencing 
> >is
> >that in the event that a fencing operation is required, your cluster can
> >automatically recover (on the order of seconds to minutes) instead of 
> >waiting
> >for user intervention (which can take minutes to hours to days depending on
> >how attentive the admins are :).    
> >
> > 
> >
> "recover"? You mean reboot? 

In order for the filesystem to recover, an expired node must first be
fenced.  In this case, since DRAC is being used, it means that the node is
probably rebooted.

> But if a machine need fencing, doesn't that 
> mean that something is inherently wrong with that machine and simple 
> reboot would't cure that?

Perhaps.  Otherwise it might be as simple as a network hiccup that causes
a node to miss enough heartbeats that result in a node getting fenced.  If
you want to leave the node in a state to debug it, then use a SAN based
fencing setup, thus isolating the node for the cluster and keeping it's
state intact for the admin to look at later... maybe (if the machine locks
up too hard, you won't be able to get into it anyway).  If you want to
automate the recovery process, but still make sure that nodes that got
fenced aren't automatically reintegrated into the cluster, you can use a
power based fencing agent that just turns the machine off and doesn't
attempt to power it back it on again.  

-- 
Adam Manthei  <amanthei at redhat.com>



From amanthei at redhat.com  Thu Aug  4 16:59:34 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Thu, 4 Aug 2005 11:59:34 -0500
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <42F1F55F.1080708@hugsmidjan.is>
References: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>
	<42F1F55F.1080708@hugsmidjan.is>
Message-ID: <20050804165934.GE3268@redhat.com>

On Thu, Aug 04, 2005 at 11:00:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote:
> Well .. when I manually run the fence_drac.pl perl script and supply it 
> with the ip of the DRAC (-a 192.168.100.173) login name (-l root) 
> DRAC/MC module name (-m Server-1) and the password (-p dummypassword) 
> the machine in question (Server-1) powers down and doesn't power back on.

Interesting... :(
 
> How do I implement this in cluster.xml (specify the ip/login/pass/module 
> name) 

Typically, the parameters are suppose to be in the manpage for the agent.
If they are not, then it should be considered a bug.  I think this will work
for, but I've not tested the bellow config, so it might not be error free :)

   <fencedevices>
      <fencedevice name="dracula"
         agent="fence_drac" 
         login="root" 
         passwd="dummypassword"
         ipaddr="192.168.100.173" 
         action="reboot" />
   </fencedevices>

   <clusternodes>
      <clusternode name="servername">
         <method name="1">
            <device name="dracula" module="Server-1"/>
         </method>
      </clusternode>
   </clusternodes>

>and shouldn't it power back up afterwards ?

The default action is suppose to be "reboot" as in the machine should come
back online.  I don't know why it isn't.  If you continue to have problems,
try enabling the debugging output from the command line:

fence_drac -a 192.168.100.173 -l root -p dummypassword -m Server-1 \
	-D /tmp/drac.log -v

Keep us posted.
-Adam

> JACOB_LIBERMAN at Dell.com wrote:
> >The 1855 has a built in ERA controller. You can modify the fencing agents 
> >to either send "racadm serveraction powercycle" or install the PERL telent 
> >module and create your own fencing script. The former option requires that 
> >the rac management software be installed on the host. I havent tested this 
> >with the 1855 btw.
> >
> >http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster
> >
> >The fence_drac agent out on the CVS should work for you. If you cant get 
> >it working, let me know, and ill see if I can dig up an 1855 in the lab.
> >
> >Thanks, jacob
> >
> >
> >>-----Original Message-----
> >>From: linux-cluster-bounces at redhat.com 
> >>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> >>"S?valdur Arnar Gunnarsson [Hugsmi?jan]"
> >>Sent: Wednesday, August 03, 2005 6:59 AM
> >>To: linux-cluster at redhat.com
> >>Subject: [Linux-cluster] Fencing agents
> >>
> >>I'm implementing a shared storage between multiple (2 at the 
> >>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
> >>connected to a EMC AX100 through FC.
> >>
> >>The SAN has two FC ports so the need for a FC Switch has not 
> >>yet come however we will add other Blades in the coming months.
> >>The one thing I haven't got figured out with GFS and the 
> >>Cluster-Suite is the whole idea about fencing.
> >>
> >>We have a working setup using Centos rebuilds of the 
> >>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which 
> >>we are not planning to use in the final implementation where 
> >>we plan to use the official GFS packages from Red Hat.
> >>The fencing agents in that setup is manual fencing.
> >>
> >>Both machines have the file system mounted and there appears 
> >>to be no problems.
> >>
> >>What does "automatic" fencing have to offer that the manual 
> >>fencing lacks.
> >>If we decide to buy the FC switch right away is it recomended 
> >>that we buy one of the ones that have fencing agent available 
> >>for the Cluster-Suite ?
> >>
> >>If can't get our hands on supported FC switchs can we do 
> >>fencing in another manner than throught a FC switch ?
> >>
> >>
> >>
> >>
> >>-- 
> >>S?valdur Gunnarsson :: Hugsmi?jan
> >>
> >>--
> >>Linux-cluster mailing list
> >>Linux-cluster at redhat.com
> >>http://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> S?valdur Gunnarsson :: Hugsmi?jan
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>



From cfeist at redhat.com  Thu Aug  4 18:54:39 2005
From: cfeist at redhat.com (Chris Feist)
Date: Thu, 04 Aug 2005 13:54:39 -0500
Subject: [Linux-cluster] many compiling warning.
In-Reply-To: <42F23F58.4080606@gmail.com>
References: <42F23F58.4080606@gmail.com>
Message-ID: <42F2646F.4080703@redhat.com>

The problem is caused by versioning done during the kernel build.  Because 
dlm.ko uses symbols from cman.ko it expects to know about those in the 
Module.symvers file in the kernel build directory.  But, since we don't modify 
that file and add the cman.ko symbols when we build cman-kernel dlm.ko can't 
find the cman.ko symbols when it builds.

It's not a big deal and the modules will load fine even with those warnings.

Thanks,
Chris

Qin Li wrote:
> Hi,
> 
> I am trying to install cluster-1.0 on my Redhat/Fedora core 1 system, 
> but the compiling is not clean.
> See following:
> 
> make[3]: Entering directory `/usr/src/linux-2.6.12.2'
>  Building modules, stage 2.
>  MODPOST
> *** Warning: "kcl_addref_cluster" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_get_node_by_addr" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_get_node_addresses" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_releaseref_cluster" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_get_current_interface" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_get_node_by_nodeid" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_leave_service" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_remove_callback" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_global_service_id" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_unregister_service" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_join_service" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_start_done" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_add_callback" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> *** Warning: "kcl_register_service" 
> [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined!
> 
> What's wrong?
> 
> Thanks,
> 
> Q.L
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From addi at hugsmidjan.is  Thu Aug  4 20:22:29 2005
From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=)
Date: Thu, 04 Aug 2005 20:22:29 +0000
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <20050804165934.GE3268@redhat.com>
References: <BC430F453501174992B9D9E8AFB7519A08567F@ausx3mps309.aus.amer.dell.com>	<42F1F55F.1080708@hugsmidjan.is>
	<20050804165934.GE3268@redhat.com>
Message-ID: <42F27905.7010908@hugsmidjan.is>

I'm sorry, that was my mistake, the machine does in fact power back up.

Adam Manthei wrote:
> On Thu, Aug 04, 2005 at 11:00:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote:
> 
>>Well .. when I manually run the fence_drac.pl perl script and supply it 
>>with the ip of the DRAC (-a 192.168.100.173) login name (-l root) 
>>DRAC/MC module name (-m Server-1) and the password (-p dummypassword) 
>>the machine in question (Server-1) powers down and doesn't power back on.
> 
> 
> Interesting... :(
>  
> 
>>How do I implement this in cluster.xml (specify the ip/login/pass/module 
>>name) 
> 
> 
> Typically, the parameters are suppose to be in the manpage for the agent.
> If they are not, then it should be considered a bug.  I think this will work
> for, but I've not tested the bellow config, so it might not be error free :)
> 
>    <fencedevices>
>       <fencedevice name="dracula"
>          agent="fence_drac" 
>          login="root" 
>          passwd="dummypassword"
>          ipaddr="192.168.100.173" 
>          action="reboot" />
>    </fencedevices>
> 
>    <clusternodes>
>       <clusternode name="servername">
>          <method name="1">
>             <device name="dracula" module="Server-1"/>
>          </method>
>       </clusternode>
>    </clusternodes>
> 
>>and shouldn't it power back up afterwards ?
> 
> 
> The default action is suppose to be "reboot" as in the machine should come
> back online.  I don't know why it isn't.  If you continue to have problems,
> try enabling the debugging output from the command line:
> 
> fence_drac -a 192.168.100.173 -l root -p dummypassword -m Server-1 \
> 	-D /tmp/drac.log -v
> 
> Keep us posted.
> -Adam
> 
> 
>>JACOB_LIBERMAN at Dell.com wrote:
>>
>>>The 1855 has a built in ERA controller. You can modify the fencing agents 
>>>to either send "racadm serveraction powercycle" or install the PERL telent 
>>>module and create your own fencing script. The former option requires that 
>>>the rac management software be installed on the host. I havent tested this 
>>>with the 1855 btw.
>>>
>>>http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster
>>>
>>>The fence_drac agent out on the CVS should work for you. If you cant get 
>>>it working, let me know, and ill see if I can dig up an 1855 in the lab.
>>>
>>>Thanks, jacob
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: linux-cluster-bounces at redhat.com 
>>>>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
>>>>"S?valdur Arnar Gunnarsson [Hugsmi?jan]"
>>>>Sent: Wednesday, August 03, 2005 6:59 AM
>>>>To: linux-cluster at redhat.com
>>>>Subject: [Linux-cluster] Fencing agents
>>>>
>>>>I'm implementing a shared storage between multiple (2 at the 
>>>>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES 
>>>>connected to a EMC AX100 through FC.
>>>>
>>>>The SAN has two FC ports so the need for a FC Switch has not 
>>>>yet come however we will add other Blades in the coming months.
>>>>The one thing I haven't got figured out with GFS and the 
>>>>Cluster-Suite is the whole idea about fencing.
>>>>
>>>>We have a working setup using Centos rebuilds of the 
>>>>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which 
>>>>we are not planning to use in the final implementation where 
>>>>we plan to use the official GFS packages from Red Hat.
>>>>The fencing agents in that setup is manual fencing.
>>>>
>>>>Both machines have the file system mounted and there appears 
>>>>to be no problems.
>>>>
>>>>What does "automatic" fencing have to offer that the manual 
>>>>fencing lacks.
>>>>If we decide to buy the FC switch right away is it recomended 
>>>>that we buy one of the ones that have fencing agent available 
>>>>for the Cluster-Suite ?
>>>>
>>>>If can't get our hands on supported FC switchs can we do 
>>>>fencing in another manner than throught a FC switch ?
>>>>
>>>>
>>>>
>>>>
>>>>-- 
>>>>S?valdur Gunnarsson :: Hugsmi?jan
>>>>
>>>>--
>>>>Linux-cluster mailing list
>>>>Linux-cluster at redhat.com
>>>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>>--
>>>Linux-cluster mailing list
>>>Linux-cluster at redhat.com
>>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>-- 
>>S?valdur Gunnarsson :: Hugsmi?jan
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 


-- 
S?valdur Gunnarsson :: Hugsmi?jan



From brianu at silvercash.com  Thu Aug  4 23:01:44 2005
From: brianu at silvercash.com (brianu)
Date: Thu, 4 Aug 2005 16:01:44 -0700
Subject: [Linux-cluster] values for gnbd multipath dmsetup
Message-ID: <20050804230208.A0D195A86A7@mail.silvercash.com>

Hello all,

 

We would like to use GFS and GNBD for a SAN setup, but I am having trouble
setting up the multipath. Currently I have three GNBD servers mounting the
storage and exporting the volumes, I read the information in the post from
the previous thread but I am having trouble finding the correct values to
echo into 

Dmsetup, example from the thread below:

https://www.redhat.com/archives/linux-cluster/2005-April/msg00062.html

 

I found references for

> echo "0 167772160  multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1
1000 " | dmsetup create dm0
>  (251:0 ist the major:minor id of /dev/gnbd0)

 

I can see the major:minor blockid and the size from /sys/block/gnbd0 & gnbd1
so my  attempt looks as so:

 

> echo "0 2293981184  multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1
1000 " | dmsetup create dm0
>  (251:0 ist the major:minor id of /dev/gnbd0)

 

Which gives me an error of:

Device-mapper ioctl cmd 9 failed: Invalid argument

 

What am I doing wrong? Also the vaules from above specifically the " 0 0 1 1
& 0 2 1" & "1000" from the previous post can someone clarify where these are
coming from?

 

We are using a vanilla kernel-2.6.12 with from kernel.org with the cluster
software from CVS stabile for 2.6.12, from the path provided at
sources.redhat.com/cluster.

 

Kernel Params:

CONFIG_DM_MULTIPATH=m

 

OS=CENTOS4

 

 

 

Brian Urrutia

System Administrator

Price Communications Inc.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050804/4085cc7b/attachment.htm>

From oldmoonster at gmail.com  Fri Aug  5 02:28:29 2005
From: oldmoonster at gmail.com (Q.L)
Date: Fri, 5 Aug 2005 10:28:29 +0800
Subject: [Linux-cluster] Can't build on a RH9.0 system with
	cluster-1.00.00.tar.gz
Message-ID: <359782e705080419284d07b58c@mail.gmail.com>

Hi,

I build the cluster-1.00.00 againt kernel-2.6.12.2, but failed, could
you help to explain what's
wrong happen?

make[2]: Leaving directory `/home/share/cluster-1.00.00/ccs/lib'
cd ccs_test && make install
make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_test'
install -d /home/share/cluster-1.00.00/build/sbin
install ccs_test /home/share/cluster-1.00.00/build/sbin
make[2]: Leaving directory `/home/share/cluster-1.00.00/ccs/ccs_test'
cd ccs_tool && make install
make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool'
gcc -Wall -I. -I../config -I../include -I../lib
-I/home/share/cluster-1.00.00/build/incdir -Wall -O2
-D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags`
-DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib
-I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c
update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config
--libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma
-lmagmamsg -ldl
/home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
to `pthread_rwlock_rdlock'
/home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
to `pthread_rwlock_unlock'
/home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
to `pthread_rwlock_wrlock'
collect2: ld returned 1 exit status
make[2]: *** [ccs_tool] Error 1
make[2]: Leaving directory `/home/share/cluster-1.00.00/ccs/ccs_tool'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/home/share/cluster-1.00.00/ccs'
make: *** [all] Error 2
[root at localhost cluster-1.00.00]#



From naoki at valuecommerce.com  Fri Aug  5 02:44:05 2005
From: naoki at valuecommerce.com (Naoki)
Date: Fri, 05 Aug 2005 11:44:05 +0900
Subject: [Linux-cluster] RH exporting local disks as LUNs.
Message-ID: <1123209846.17379.18.camel@dragon.sys.intra>

RH / Fedora can export devices as iSCSI.

This question may show my total ignorance but I'm not above that ;) 

Is there anything stopping me from exporting a device or a volume as a
raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act
like) a SCSI or FC disk?

The idea is this..  Hook up a couple of boxes with plenty of internal
disk (as a raid 5 set) to an FC switch. Export the two LUNs and then on
a client server, LVM mirror the sets.  

I could then put NFS (for example) on the client server.

If there is a limitation here is it driver, hardware, or both?



From fabbione at fabbione.net  Fri Aug  5 05:06:26 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Fri, 05 Aug 2005 07:06:26 +0200
Subject: [PATCH] fix Re: [Linux-cluster] Compiling error against
	kernel-2.6.12.2
In-Reply-To: <359782e705080319293993cbf1@mail.gmail.com>
References: <359782e705080319293993cbf1@mail.gmail.com>
Message-ID: <42F2F3D2.7080900@fabbione.net>

Q.L wrote:
> Hi,
> 
> When I began to "make" on the RH9.0, I can't pass following errors,
> could you help me?
> however, it seems no compiling problem happen on a host with FC1.0.
> further more, what's the special config required in kernel .config
> file for GFS cluster?
> 
> Thanks.
> 
> Q.L
> 
> cd ccs_tool && make install
> make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool'
> gcc -Wall -I. -I../config -I../include -I../lib
> -I/home/share/cluster-1.00.00/build/incdir -Wall -O2
> -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags`
> -DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib
> -I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c
> update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config
> --libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma
> -lmagmamsg -ldl
> /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
> to `pthread_rwlock_rdlock'
> /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
> to `pthread_rwlock_unlock'
> /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
> to `pthread_rwlock_wrlock'
> 

It appears also in Ubuntu, but i am not sure why or what did lost a link to
pthred. The patch in attachment fix the problem.

Fabio

-- 
no signature file found.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ccs_tool_and_pthread_love.dpatch
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050805/21eb11de/attachment.ksh>

From teigland at redhat.com  Fri Aug  5 07:14:15 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 5 Aug 2005 15:14:15 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
Message-ID: <20050805071415.GC14880@redhat.com>

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> * +static const uint32_t crc_32_tab[] = .....
> why do you duplicate this? The kernel has a perfectly good set of
> generic crc32 tables/functions just fine

The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted.  Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.

> * Why are you using bufferheads extensively in a new filesystem?

bh's are used for metadata, the log, and journaled data which need to be
written at the block granularity, not page.

> why do you use a rwsem and not a regular semaphore? You are aware that
> rwsems are far more expensive than regular ones right?  How skewed is
> the read/write ratio?

Aware, yes, it's the only rwsem in gfs.  Specific skew, no, we'll have to
measure that.

> * +++ b/fs/gfs2/fixed_div64.h	2005-08-01 14:13:08.009808200 +0800
> ehhhh why?

I'm not sure, actually, apart from the comments:

do_div: /* For ia32 we need to pull some tricks to get past various versions
           of the compiler which do not like us using do_div in the middle
           of large functions. */

do_mod: /* Side effect free 64 bit mod operation */

fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
perhaps this is an old problem that's now fixed?

> * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
> +		   unsigned int size)
> +{
> +	int error;
> +
> +	if (bh)
> +		error = copy_to_user(*buf, bh->b_data + offset, size);
> +	else
> +		error = clear_user(*buf, size);
> 
> that looks to be missing a few kmaps.. whats the guarantee that b_data
> is actually, like in lowmem?

This is only used in the specific case of reading a journaled-data file.
That seems to effectively be the same as reading a buffer of fs metadata.

> The diaper device is a block device within gfs that gets transparently
> inserted between the real device the and rest of the filesystem.
> 
> hmmmm why not use device mapper or something? Is this really needed?

This is needed for the "withdraw" feature (described in the comment) which
is fairly important.  We'll see if dm could be used instead.

Thanks,
Dave



From mchristi at redhat.com  Fri Aug  5 07:27:08 2005
From: mchristi at redhat.com (Mike Christie)
Date: Fri, 05 Aug 2005 02:27:08 -0500
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050805071415.GC14880@redhat.com>
References: <20050802071828.GA11217@redhat.com>	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
Message-ID: <42F314CC.4000309@redhat.com>

David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> 
>>* Why are you using bufferheads extensively in a new filesystem?
> 
> 
> bh's are used for metadata, the log, and journaled data which need to be
> written at the block granularity, not page.
> 

In a scsi tree
http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary
there is a function, bio_map_kern(), in fs.c that maps a buffer into a
bio. It does not have to be page granularity. Can something like that be
used in these places?



From mchristi at redhat.com  Fri Aug  5 07:30:41 2005
From: mchristi at redhat.com (Mike Christie)
Date: Fri, 05 Aug 2005 02:30:41 -0500
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <42F314CC.4000309@redhat.com>
References: <20050802071828.GA11217@redhat.com>	<1122968724.3247.22.camel@laptopd505.fenrus.org>	<20050805071415.GC14880@redhat.com>
	<42F314CC.4000309@redhat.com>
Message-ID: <42F315A1.7050408@redhat.com>

Mike Christie wrote:
> David Teigland wrote:
> 
>>On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
>>
>>
>>>* Why are you using bufferheads extensively in a new filesystem?
>>
>>
>>bh's are used for metadata, the log, and journaled data which need to be
>>written at the block granularity, not page.
>>
> 
> 
> In a scsi tree
> http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary

oh yeah it is in -mm too.

> there is a function, bio_map_kern(), in fs.c that maps a buffer into a
> bio. It does not have to be page granularity. Can something like that be
> used in these places?
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From arjan at infradead.org  Fri Aug  5 07:34:38 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Fri, 05 Aug 2005 09:34:38 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050805071415.GC14880@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
Message-ID: <1123227279.3239.1.camel@laptopd505.fenrus.org>

On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> 
> > * +static const uint32_t crc_32_tab[] = .....
> > why do you duplicate this? The kernel has a perfectly good set of
> > generic crc32 tables/functions just fine
> 
> The gfs2_disk_hash() function and the crc table on which it's based are a
> part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
> unusual since gfs uses a hash table on-disk for its directory structure.
> This header, including the hash function/table, must be included by user
> space programs like fsck that want to decipher a fs, and any change to the
> function or table would effectively make the fs corrupted.  Because of
> this I think it's best for gfs to keep it's own copy as part of its ondisk
> format spec.

for userspace there's libcrc32 as well. If it's *the* bog standard crc32
I don't see a reason why your "spec" can't just reference that instead.
And esp in the kernel you should just use the in kernel one not your own
regardless; you can assume the in kernel one is optimized and it also
keeps size down.




From mdl at veles.ru  Fri Aug  5 08:00:29 2005
From: mdl at veles.ru (Denis Medvedev)
Date: Fri, 05 Aug 2005 12:00:29 +0400
Subject: [Linux-cluster] RH exporting local disks as LUNs.
In-Reply-To: <1123209846.17379.18.camel@dragon.sys.intra>
References: <1123209846.17379.18.camel@dragon.sys.intra>
Message-ID: <42F31C9D.8090601@veles.ru>

Naoki ?????:

>RH / Fedora can export devices as iSCSI.
>
>This question may show my total ignorance but I'm not above that ;) 
>
>Is there anything stopping me from exporting a device or a volume as a
>raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act
>like) a SCSI or FC disk?
>
>The idea is this..  Hook up a couple of boxes with plenty of internal
>disk (as a raid 5 set) to an FC switch. Export the two LUNs and then on
>a client server, LVM mirror the sets.  
>
>I could then put NFS (for example) on the client server.
>
>If there is a limitation here is it driver, hardware, or both?
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>
>  
>
Yes, you can.
Probably a single point of failure in this will be the NFS server and 
also the FC switch.
And why in this case not to put the storage directly to the NFS server?



From jengelh at linux01.gwdg.de  Fri Aug  5 08:28:13 2005
From: jengelh at linux01.gwdg.de (Jan Engelhardt)
Date: Fri, 5 Aug 2005 10:28:13 +0200 (MEST)
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050805071415.GC14880@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
Message-ID: <Pine.LNX.4.61.0508051026540.26367@yvahk01.tjqt.qr>

>The gfs2_disk_hash() function and the crc table on which it's based are a
>part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
>unusual since gfs uses a hash table on-disk for its directory structure.
>This header, including the hash function/table, must be included by user
>space programs like fsck that want to decipher a fs, and any change to the
>function or table would effectively make the fs corrupted.  Because of
>this I think it's best for gfs to keep it's own copy as part of its ondisk
>format spec.

Tune the spec to use kernel and libcrc32 tables and bump the version number of 
the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out 
eventually at some later date.



Jan Engelhardt
-- 



From arjan at infradead.org  Fri Aug  5 08:34:32 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Fri, 05 Aug 2005 10:34:32 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <Pine.LNX.4.61.0508051026540.26367@yvahk01.tjqt.qr>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
	<Pine.LNX.4.61.0508051026540.26367@yvahk01.tjqt.qr>
Message-ID: <1123230872.3239.20.camel@laptopd505.fenrus.org>

On Fri, 2005-08-05 at 10:28 +0200, Jan Engelhardt wrote:
> >The gfs2_disk_hash() function and the crc table on which it's based are a
> >part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
> >unusual since gfs uses a hash table on-disk for its directory structure.
> >This header, including the hash function/table, must be included by user
> >space programs like fsck that want to decipher a fs, and any change to the
> >function or table would effectively make the fs corrupted.  Because of
> >this I think it's best for gfs to keep it's own copy as part of its ondisk
> >format spec.
> 
> Tune the spec to use kernel and libcrc32 tables and bump the version number of 
> the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out 
> eventually at some later date.

afaik the tables aren't actually different. So no need to bump the spec!



From mikore.li at gmail.com  Fri Aug  5 08:45:32 2005
From: mikore.li at gmail.com (Michael)
Date: Fri, 5 Aug 2005 16:45:32 +0800
Subject: [Linux-cluster] gfs2 Vs gfs
Message-ID: <bc5727090508050145d518c88@mail.gmail.com>

Hi, 

Does that means redhat has already turn to development of gfs2 and
only bugfix for gfs? Where is the brief of the new features in
upcoming gfs2? when will the first release achieve?
Is there 1 or 5 years roadmap for GFS development?

Thanks,

Q.L



From teigland at redhat.com  Fri Aug  5 09:44:52 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 5 Aug 2005 17:44:52 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1123227279.3239.1.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
	<1123227279.3239.1.camel@laptopd505.fenrus.org>
Message-ID: <20050805094452.GD14880@redhat.com>

On Fri, Aug 05, 2005 at 09:34:38AM +0200, Arjan van de Ven wrote:
> On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote:
> > On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> > 
> > > * +static const uint32_t crc_32_tab[] = .....
> > > why do you duplicate this? The kernel has a perfectly good set of
> > > generic crc32 tables/functions just fine
> > 
> > The gfs2_disk_hash() function and the crc table on which it's based are a
> > part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
> > unusual since gfs uses a hash table on-disk for its directory structure.
> > This header, including the hash function/table, must be included by user
> > space programs like fsck that want to decipher a fs, and any change to the
> > function or table would effectively make the fs corrupted.  Because of
> > this I think it's best for gfs to keep it's own copy as part of its ondisk
> > format spec.
> 
> for userspace there's libcrc32 as well. If it's *the* bog standard crc32
> I don't see a reason why your "spec" can't just reference that instead.
> And esp in the kernel you should just use the in kernel one not your own
> regardless; you can assume the in kernel one is optimized and it also
> keeps size down.

linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
This looks like a standard that's not going to change, as you've said, so
including crc32table.h and getting rid of our own table would work fine.

Do we go a step beyond this and use say the crc32() function from
linux/crc32.h?  Is this _function_ as standard and unchanging as the table
of crcs?  In my tests it doesn't produce the same results as our
gfs2_disk_hash() function, even with both using the same crc table.  I
don't mind adopting a new function and just writing a user space
equivalent for the tools if it's a fixed standard.

Dave



From teigland at redhat.com  Fri Aug  5 10:31:38 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 5 Aug 2005 18:31:38 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050805100750.GA9818@wohnheim.fh-wedel.de>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
	<1123227279.3239.1.camel@laptopd505.fenrus.org>
	<20050805094452.GD14880@redhat.com>
	<20050805100750.GA9818@wohnheim.fh-wedel.de>
Message-ID: <20050805103138.GE14880@redhat.com>

On Fri, Aug 05, 2005 at 12:07:50PM +0200, J?rn Engel wrote:
> On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote:
> > Do we go a step beyond this and use say the crc32() function from
> > linux/crc32.h?  Is this _function_ as standard and unchanging as the table
> > of crcs?  In my tests it doesn't produce the same results as our
> > gfs2_disk_hash() function, even with both using the same crc table.  I
> > don't mind adopting a new function and just writing a user space
> > equivalent for the tools if it's a fixed standard.
> 
> The function is basically set in stone.  Variants exists depending on
> how it is called.  I know of four variants, but there may be more:
> 
> 1. Initial value is 0
> 2. Initial value is 0xffffffff
> a) Result is taken as-is
> b) Result is XORed with 0xffffffff
> 
> Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
> function or something similar?

You're right, initial value 0xffffffff and xor result with 0xffffffff
matches the results from our function.  Great, we can get rid of
gfs2_disk_hash() and use crc32() directly.
Thanks,
Dave



From natecars at natecarlson.com  Fri Aug  5 13:21:18 2005
From: natecars at natecarlson.com (Nate Carlson)
Date: Fri, 5 Aug 2005 08:21:18 -0500 (CDT)
Subject: [Linux-cluster] RH exporting local disks as LUNs.
In-Reply-To: <42F31C9D.8090601@veles.ru>
References: <1123209846.17379.18.camel@dragon.sys.intra>
	<42F31C9D.8090601@veles.ru>
Message-ID: <Pine.LNX.4.61.0508050820530.15712@tungsten.technicality.org>

On Fri, 5 Aug 2005, Denis Medvedev wrote:
>> This question may show my total ignorance but I'm not above that ;) 
>> Is there anything stopping me from exporting a device or a volume as a
>> raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act
>> like) a SCSI or FC disk?
> 
> Yes, you can.

Nifty! I was actually looking at this the other day, and couldn't figure 
out a way to do it.

Do you happen to have any documentation?

------------------------------------------------------------------------
| nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------



From mikore.li at gmail.com  Fri Aug  5 05:22:30 2005
From: mikore.li at gmail.com (Mikore Li)
Date: Fri, 5 Aug 2005 13:22:30 +0800
Subject: [PATCH] fix Re: [Linux-cluster] Compiling error against
	kernel-2.6.12.2
In-Reply-To: <42F2F3D2.7080900@fabbione.net>
References: <359782e705080319293993cbf1@mail.gmail.com>
	<42F2F3D2.7080900@fabbione.net>
Message-ID: <bc572709050804222239a6f7e1@mail.gmail.com>

Yes,Yes! It works!

Thanks,

Q.L

On 8/5/05, Fabio Massimo Di Nitto <fabbione at fabbione.net> wrote:
> Q.L wrote:
> > Hi,
> >
> > When I began to "make" on the RH9.0, I can't pass following errors,
> > could you help me?
> > however, it seems no compiling problem happen on a host with FC1.0.
> > further more, what's the special config required in kernel .config
> > file for GFS cluster?
> >
> > Thanks.
> >
> > Q.L
> >
> > cd ccs_tool && make install
> > make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool'
> > gcc -Wall -I. -I../config -I../include -I../lib
> > -I/home/share/cluster-1.00.00/build/incdir -Wall -O2
> > -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags`
> > -DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib
> > -I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c
> > update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config
> > --libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma
> > -lmagmamsg -ldl
> > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
> > to `pthread_rwlock_rdlock'
> > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
> > to `pthread_rwlock_unlock'
> > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference
> > to `pthread_rwlock_wrlock'
> >
> 
> It appears also in Ubuntu, but i am not sure why or what did lost a link to
> pthred. The patch in attachment fix the problem.
> 
> Fabio
> 
> --
> no signature file found.
> 
> 
> #! /bin/sh /usr/share/dpatch/dpatch-run
> ## ccs_tool_and_pthread_love.dpatch by  <fabbione at localhost.localdomain>
> ##
> ## All lines beginning with `## DP:' are a description of the patch.
> ## DP: No description.
> 
> @DPATCH@
> diff -urNad --exclude=CVS --exclude=.svn ./ccs/ccs_tool/Makefile /usr/src/dpatchtemp/dpep-work.OlU9tL/redhat-cluster-suite-1.20050729/ccs/ccs_tool/Makefile
> --- ./ccs/ccs_tool/Makefile     2005-07-29 06:48:35.000000000 +0200
> +++ /usr/src/dpatchtemp/dpep-work.OlU9tL/redhat-cluster-suite-1.20050729/ccs/ccs_tool/Makefile  2005-07-29 07:00:37.000000000 +0200
> @@ -25,7 +25,7 @@
>  endif
> 
>  LDFLAGS+= -L${ccs_libdir} `xml2-config --libs` -L${libdir}
> -LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl
> +LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl -lpthread
> 
>  all: ccs_tool
> 
> 
> 
>



From joern at wohnheim.fh-wedel.de  Fri Aug  5 10:07:50 2005
From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel)
Date: Fri, 5 Aug 2005 12:07:50 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050805094452.GD14880@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
	<1123227279.3239.1.camel@laptopd505.fenrus.org>
	<20050805094452.GD14880@redhat.com>
Message-ID: <20050805100750.GA9818@wohnheim.fh-wedel.de>

On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote:
> 
> linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
> This looks like a standard that's not going to change, as you've said, so
> including crc32table.h and getting rid of our own table would work fine.
> 
> Do we go a step beyond this and use say the crc32() function from
> linux/crc32.h?  Is this _function_ as standard and unchanging as the table
> of crcs?  In my tests it doesn't produce the same results as our
> gfs2_disk_hash() function, even with both using the same crc table.  I
> don't mind adopting a new function and just writing a user space
> equivalent for the tools if it's a fixed standard.

The function is basically set in stone.  Variants exists depending on
how it is called.  I know of four variants, but there may be more:

1. Initial value is 0
2. Initial value is 0xffffffff
a) Result is taken as-is
b) Result is XORed with 0xffffffff

Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
function or something similar?

J?rn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong



From javipolo at datagrama.net  Fri Aug  5 15:02:05 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Fri, 5 Aug 2005 17:02:05 +0200
Subject: [Linux-cluster] problem with fencing
Message-ID: <20050805150205.GA13010@gibson.drslump.org>

Hi there

I'm trying to set up gfs for work with a SAN ... and I want to use a
script for fencing, instead of fence_manual, but it doesnt works :/

to try that, I do a "ifconfig eth0 down" in gfstest2, and gfstest1's syslog says:
Aug  5 16:51:13 gfstest1 fenced: gfstest2 not a cluster member after 0 sec post_fail_delay
Aug  5 16:51:13 gfstest1 fenced: fencing node "gfstest2"
Aug  5 16:51:13 gfstest1 fence_manual: Node 192.168.1.2 needs to be reset before recovery can procede.  Waiting for 192.168.1.2 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 192.168.1.2)

I want it to be automatic, and I modified fence_sanbox2.pl so it fits
our switch commands. (I attached it on another mail some days ago)

the script works fine if I run it manually:
gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4      
portDisable 4 
success: disable 4
gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable
portEnable 4 
success: enable 4
gfstest1:~# 

could anybody give me a hint?
I'm using lock_dlm

this is my cluster.conf:

<?xml version="1.0"?>
<cluster name="test_cluster" config_version="4">

        <fencedevices>
          <fencedevice name="human" agent="fence_manual"/>
          <fencedevice name="san" agent="fence_IBM" ipaddr="10.1.1.1" login="admin" passwd="tangerine"/>
        </fencedevices>

        <fence_daemon clean_start="0">
        </fence_daemon>

        <cman>
        </cman>

        <clusternodes>
          <clusternode name="gfstest1" nodeid="1" votes="1">
             <fence>
               <method name="fibre">
                 <device name="san" port="5"/>
               </method>
               <method name="single">
                 <device name="human" ipaddr="192.168.1.1"/>
               </method>
             </fence>
          </clusternode>

          <clusternode name="gfstest2" nodeid="2" votes="1">
             <fence>
               <method name="fibre">
                 <device name="san" port="4"/>
               </method>
               <method name="single">
                 <device name="human" ipaddr="192.168.1.2"/>
               </method>
             </fence>
          </clusternode>

          <clusternode name="gfstest3" nodeid="3" votes="1">
             <fence>
               <method name="fibre">
                 <device name="san" port="3"/>
               </method>
               <method name="single">
                 <device name="human" ipaddr="192.168.1.3"/>
               </method>
             </fence>
          </clusternode>
        </clusternodes>

</cluster>

-- 
Javier Polo @ Datagrama
902 136 126



From amanthei at redhat.com  Fri Aug  5 15:21:02 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Fri, 5 Aug 2005 10:21:02 -0500
Subject: [Linux-cluster] problem with fencing
In-Reply-To: <20050805150205.GA13010@gibson.drslump.org>
References: <20050805150205.GA13010@gibson.drslump.org>
Message-ID: <20050805152102.GE7385@redhat.com>

On Fri, Aug 05, 2005 at 05:02:05PM +0200, Javi Polo wrote:
> Hi there
> 
> I'm trying to set up gfs for work with a SAN ... and I want to use a
> script for fencing, instead of fence_manual, but it doesnt works :/
> 
> to try that, I do a "ifconfig eth0 down" in gfstest2, and gfstest1's syslog says:
> Aug  5 16:51:13 gfstest1 fenced: gfstest2 not a cluster member after 0 sec post_fail_delay
> Aug  5 16:51:13 gfstest1 fenced: fencing node "gfstest2"
> Aug  5 16:51:13 gfstest1 fence_manual: Node 192.168.1.2 needs to be reset before recovery can procede.  Waiting for 192.168.1.2 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 192.168.1.2)
> 
> I want it to be automatic, and I modified fence_sanbox2.pl so it fits
> our switch commands. (I attached it on another mail some days ago)
> 
> the script works fine if I run it manually:
> gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4      
> portDisable 4 
> success: disable 4
> gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable
> portEnable 4 
> success: enable 4
> gfstest1:~# 
> 
> could anybody give me a hint?
> I'm using lock_dlm

Did you update the cluster.conf file across all the nodes?  Could it be that
gfstest1 still has the old cluster.conf file?  That might account for the
manual fencing being run.  

Another way that you are going to run into manual fencing using this
configuration is if the first method ("san") fails the second method
("single") will be called.  What's odd about that is that there should still
be something in the logs listing the output of the first command.  I would
hope that there would also be an error in the logs in the even that you
forgot to but "fence_IBM" in the path or make it executable.  I'd consider
it a bug if that wasn't the case.

Lastly, I've not looked too closely at your script for fence_IBMswitch (I
think that's what you called it in the previous email... did you rename it to
fence_IBM?), but success and failure is not determined by fenced on the
basis of the output, but on the exit status of the agent.  If the agent
returns 0, then it succeeds, otherwise it's a failed fencing operation.
This might explain why the second method is being called, but it wouldn't
explain why there is no output in the logs from the first.

 
> this is my cluster.conf:
> 
> <?xml version="1.0"?>
> <cluster name="test_cluster" config_version="4">
> 
>         <fencedevices>
>           <fencedevice name="human" agent="fence_manual"/>
>           <fencedevice name="san" agent="fence_IBM" ipaddr="10.1.1.1" login="admin" passwd="tangerine"/>
>         </fencedevices>
> 
>         <fence_daemon clean_start="0">
>         </fence_daemon>
> 
>         <cman>
>         </cman>
> 
>         <clusternodes>
>           <clusternode name="gfstest1" nodeid="1" votes="1">
>              <fence>
>                <method name="fibre">
>                  <device name="san" port="5"/>
>                </method>
>                <method name="single">
>                  <device name="human" ipaddr="192.168.1.1"/>
>                </method>
>              </fence>
>           </clusternode>
> 
>           <clusternode name="gfstest2" nodeid="2" votes="1">
>              <fence>
>                <method name="fibre">
>                  <device name="san" port="4"/>
>                </method>
>                <method name="single">
>                  <device name="human" ipaddr="192.168.1.2"/>
>                </method>
>              </fence>
>           </clusternode>
> 
>           <clusternode name="gfstest3" nodeid="3" votes="1">
>              <fence>
>                <method name="fibre">
>                  <device name="san" port="3"/>
>                </method>
>                <method name="single">
>                  <device name="human" ipaddr="192.168.1.3"/>
>                </method>
>              </fence>
>           </clusternode>
>         </clusternodes>
> 
> </cluster>
> 
> -- 
> Javier Polo @ Datagrama
> 902 136 126
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>



From amanthei at redhat.com  Fri Aug  5 15:28:17 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Fri, 5 Aug 2005 10:28:17 -0500
Subject: [Linux-cluster] problem with fencing
In-Reply-To: <20050805150205.GA13010@gibson.drslump.org>
References: <20050805150205.GA13010@gibson.drslump.org>
Message-ID: <20050805152817.GF7385@redhat.com>

On Fri, Aug 05, 2005 at 05:02:05PM +0200, Javi Polo wrote:
> could anybody give me a hint?
how about this?  looks like there is a typo in you cluster.conf

your test command...
> gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable
              ^^^^^^^^^^^^^^^

your cluster.conf...
>           <fencedevice name="san" agent="fence_IBM" ipaddr="10.1.1.1" 
                                           ^^^^^^^^^
>                  login="admin" passwd="tangerine"/>


-- 
Adam Manthei  <amanthei at redhat.com>



From javipolo at datagrama.net  Fri Aug  5 15:46:57 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Fri, 5 Aug 2005 17:46:57 +0200
Subject: [Linux-cluster] problem with fencing
In-Reply-To: <20050805152817.GF7385@redhat.com>
References: <20050805150205.GA13010@gibson.drslump.org>
	<20050805152817.GF7385@redhat.com>
Message-ID: <20050805154657.GA14355@gibson.drslump.org>

On Aug/05/2005, Adam Manthei wrote:

> > could anybody give me a hint?
> how about this?  looks like there is a typo in you cluster.conf
> > gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable
>               ^^^^^^^^^^^^^^^
> >           <fencedevice name="san" agent="fence_IBM" ipaddr="10.1.1.1" 
>                                            ^^^^^^^^^

yipes ... that was a big and stupid mistake :/
thanks for helping me see it (i've spent lot of time today trying to
debug this ... sigh)

anyway, now I cannot seem to even join the fence cluster with the
machine I'm doing the testing and so ..:

gfstest2:~# fence_tool -w -D join
fence_tool: wait for quorum 1
fence_tool: get our node name
fence_tool: connect to ccs
fence_tool: start fenced
fenced: 1123256223 our name from cman "gfstest2"
fenced: 1123256223 delay post_join 6s post_fail 0s
fenced: 1123256223 added 3 nodes from ccs
^C
gfstest2:~# fence_tool -w -D -c join
fence_tool: wait for quorum 1
fence_tool: get our node name
fence_tool: connect to ccs
fence_tool: start fenced
fenced: 1123256379 our name from cman "gfstest2"
fenced: 1123256379 delay post_join 6s post_fail 0s
fenced: 1123256379 clean start, skipping initial nodes
fenced: fence_domain_add: service set level failed
gfstest2:~# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join      S-1,80,3
[]

gfstest2:~# 

I maybe leave it for monday, as today I'm having a pretty bad day :/

-- 
Javier Polo @ Datagrama
902 136 126



From eric at bootseg.com  Fri Aug  5 16:21:40 2005
From: eric at bootseg.com (Eric Kerin)
Date: Fri, 05 Aug 2005 12:21:40 -0400
Subject: [Linux-cluster] problem with fencing
In-Reply-To: <20050805152102.GE7385@redhat.com>
References: <20050805150205.GA13010@gibson.drslump.org>
	<20050805152102.GE7385@redhat.com>
Message-ID: <1123258900.3350.8.camel@auh5-0479.corp.jabil.org>

On Fri, 2005-08-05 at 10:21 -0500, Adam Manthei wrote:
> What's odd about that is that there should still
> be something in the logs listing the output of the first command. 

I just did a quick test, and confirmed that it won't produce an error
unless all of the fence devices for a node fail to fence the server.
Right now it treats inability to execute the fence script the same as if
the fence script failed to fence the node.   Also the failure condition
where the exec returns does not produce any output, so nothing gets
displayed (or sent to syslog).

The attached patch (diff against RHEL4) will produce an error message
when the exec fails with the error message. Also it display a message
when no output is produced by a fence agent, for a failed exec.

Thanks,
Eric
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fenced_badagentfix.patch
Type: text/x-patch
Size: 1180 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050805/df3d9e75/attachment.bin>

From brianu at silvercash.com  Fri Aug  5 17:38:10 2005
From: brianu at silvercash.com (brianu)
Date: Fri, 5 Aug 2005 10:38:10 -0700
Subject: [Linux-cluster] RE: values for gnbd multipath dmsetup
Message-ID: <20050805173812.CA6F65A8619@mail.silvercash.com>

Hello,

 

Ok I figured out id just try some of the vaules from the previous post
without fully understanding them, and multipath appears to be working.

 

dm-1

[size=546 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active][first]

  \_ 0:0:0:0      251:0   [undef ][active]

\_ round-robin 0 [enabled]

  \_ 0:0:0:0      251:4   [undef ][active]

 

 

 

But I stil get an error

 

[root at dell-1650-31 ~]# mount -t gfs /dev/mapper/dm-1 /mnt/gfs1

mount: /dev/dm-1: can't read superblock

 

if I do a dmsetup remove dm-1, then mount the individual gnbds all is well,
but the purpose of this is to enable some sore of failover which I am told
GNBD has the capability of doing.

 

>From redhats main site and documentation for gfs 6.1 they state that
multipath is not supported in the 6.1 realease however I optained this
source from CVS and the main docs for
http://sources.redhat.com/cluster/gnbd/  state that multipath is an option.
Can someone clarify whether the CVS stabile sources for kernel-2.6.12 is
multipath compatable, or am I doing something wrong?

 

Current specs.

SAN -> MSA-1000

 

3 GNBD servers currently using software iSCSI to mount that SAN - will prob
go fiber if I can figure this out. ( lets say this cluster is called
cluster1)

Using DLM & GNBD

 

1 client for testing separate cluster name lets say "cluster2"

Client mounted the gnbd from one of the servers that is exporting it, the
servers are not mounting it, then formatted the device with gfs & created 20
journals size of 32MB each, remounted the device and verified write and read
(bonnie++)

Ran dmsetup to round robin the devices then failed to mount the volume as
shown above.

 

 

 

Brian Urrutia

System Administrator

Price Communications Inc.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050805/f10a8bff/attachment.htm>

From brianu at silvercash.com  Fri Aug  5 19:22:34 2005
From: brianu at silvercash.com (brianu)
Date: Fri, 5 Aug 2005 12:22:34 -0700
Subject: [Linux-cluster] RE: values for gnbd multipath dmsetup
In-Reply-To: <20050805173812.CA6F65A8619@mail.silvercash.com>
Message-ID: <20050805192247.AC0C15A868B@mail.silvercash.com>

Hello,

 

Just thought I'd plug in some more info I've also tried testing this with a
stock FC4 client with all the Cluster RPMs installed.

 

GFS: fsid=sclients:mygfs.0: jid=17: Trying to acquire journal lock...

GFS: fsid=sclients:mygfs.0: jid=17: Looking at journal...

GFS: fsid=sclients:mygfs.0: jid=17: Done

GFS: fsid=sclients:mygfs.0: jid=18: Trying to acquire journal lock...

GFS: fsid=sclients:mygfs.0: jid=18: Looking at journal...

GFS: fsid=sclients:mygfs.0: jid=18: Done

GFS: fsid=sclients:mygfs.0: jid=19: Trying to acquire journal lock...

GFS: fsid=sclients:mygfs.0: jid=19: Looking at journal...

GFS: fsid=sclients:mygfs.0: jid=19: Done

GFS: fsid=sclients:mygfs.0: jid=20: Trying to acquire journal lock...

GFS: fsid=sclients:mygfs.0: jid=20: Looking at journal...

attempt to access beyond end of device

dm-0: rw=0, want=1146990600, limit=1146990592

GFS: fsid=sclients:mygfs.0: fatal: I/O error

GFS: fsid=sclients:mygfs.0:   block = 143373824

GFS: fsid=sclients:mygfs.0:   function = gfs_dreread

GFS: fsid=sclients:mygfs.0:   file =
/usr/src/build/588748-i686/BUILD/xen0/src/gfs/dio.c, line = 576

GFS: fsid=sclients:mygfs.0:   time = 1123268277

GFS: fsid=sclients:mygfs.0: about to withdraw from the cluster

GFS: fsid=sclients:mygfs.0: waiting for outstanding I/O

GFS: fsid=sclients:mygfs.0: telling LM to withdraw

lock_dlm: withdraw abandoned memory

GFS: fsid=sclients:mygfs.0: withdrawn

GFS: fsid=sclients:mygfs.0: jid=20: Failed

GFS: fsid=sclients:mygfs.0: error recovering journal 20: -5

[root at 5n@k3bi73 ~]#   

Aug  5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0:   time =
1123268277

Aug  5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: about to
withdraw from the cluster

Aug  5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: waiting for
outstanding I/O

Aug  5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: telling LM to
withdraw

Aug  5 11:57:57 5n at k3bi73 kernel: lock_dlm: withdraw abandoned memory

Aug  5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: withdrawn

Aug  5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: jid=20: Failed

 

This off a multipathed device which dmsetup status gives the output below:

 

dm-1: 0 1146990592 multipath 1 0 0 2 1 A 0 1 0 251:0 A 0 E 0 1 0 251:4 A 0

dmsetup deps gives

dm-1: 2 dependencies    : (251, 4) (251, 0)

and dmsetup info gives

Name:              dm-1

State:             ACTIVE

Tables present:    LIVE

Open count:        0

Event number:      0

Major, minor:      253, 1

Number of targets: 1

 

 

  _____  

From: brianu [mailto:brianu at silvercash.com] 
Sent: Friday, August 05, 2005 10:38 AM
To: linux-cluster at redhat.com
Cc: brianu at silvercash.com
Subject: RE: values for gnbd multipath dmsetup

 

Hello,

 

Ok I figured out id just try some of the vaules from the previous post
without fully understanding them, and multipath appears to be working.

 

dm-1

[size=546 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active][first]

  \_ 0:0:0:0      251:0   [undef ][active]

\_ round-robin 0 [enabled]

  \_ 0:0:0:0      251:4   [undef ][active]

 

 

 

But I stil get an error

 

[root at dell-1650-31 ~]# mount -t gfs /dev/mapper/dm-1 /mnt/gfs1

mount: /dev/dm-1: can't read superblock

 

if I do a dmsetup remove dm-1, then mount the individual gnbds all is well,
but the purpose of this is to enable some sore of failover which I am told
GNBD has the capability of doing.

 

>From redhats main site and documentation for gfs 6.1 they state that
multipath is not supported in the 6.1 realease however I optained this
source from CVS and the main docs for
http://sources.redhat.com/cluster/gnbd/  state that multipath is an option.
Can someone clarify whether the CVS stabile sources for kernel-2.6.12 is
multipath compatable, or am I doing something wrong?

 

Current specs.

SAN -> MSA-1000

 

3 GNBD servers currently using software iSCSI to mount that SAN - will prob
go fiber if I can figure this out. ( lets say this cluster is called
cluster1)

Using DLM & GNBD

 

1 client for testing separate cluster name lets say "cluster2"

Client mounted the gnbd from one of the servers that is exporting it, the
servers are not mounting it, then formatted the device with gfs & created 20
journals size of 32MB each, remounted the device and verified write and read
(bonnie++)

Ran dmsetup to round robin the devices then failed to mount the volume as
shown above.

 

 

 

Brian Urrutia

System Administrator

Price Communications Inc.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050805/e54f340e/attachment.htm>

From thomsonr at ucalgary.ca  Sat Aug  6 01:08:25 2005
From: thomsonr at ucalgary.ca (Ryan Thomson)
Date: Fri, 05 Aug 2005 19:08:25 -0600
Subject: [Linux-cluster] Clustered LDAP, good or bad idea?
Message-ID: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca>

Hello list,

I've been thinking about running OpenLDAP on our to-be GFS/RHCS based
storage cluster. I was thinking I could run LDAP as a service with all
of it's data on a shared disk so that if the node with that service goes
down, another node can pickup the service. It would be nice to have
failover support for LDAP without using replication to another OpenLDAP
server.

I've already done tests on a two node cluster and it seems to work fine
but it "seeming" to work fine isn't much confidence, I must admit.

I have the configuration, lock files and database all on shared storage.
I've modified the LDAP init.d script on each node to start/stop LDAP
since the config and lock files aren't in the default spot anymore.

I'm a bit concerned about failures since I can't test that properly in a
two-node cluster.

I suppose what I'm really asking is this: Is running LDAP as a cluster
service a particularly bad idea for any reason?

-- 
Ryan Thomson
Systems Administrator
University Of Calgary, Biocomputing
Phone: (403) 220-2264
Email: thomsonr at ucalgary.ca



From robert at deakin.edu.au  Sat Aug  6 01:13:22 2005
From: robert at deakin.edu.au (Robert Ruge)
Date: Sat, 6 Aug 2005 11:13:22 +1000
Subject: [Linux-cluster] I love GFS
Message-ID: <200508060113.j761DCKc013924@deakin.edu.au>

I would just like to say a big thank you to all those who have created
GFS. I have fallen in love with it in the last month.

I have a setup where all of my Windows infrastructure/servers run
under vmware and I am currently migrating the vmware images to a GFS
cluster that allows me some pretty cool failover and load balancing of
the virtual machines.

I have one question though - what is the real world experience with
the reliability of a GFS filesystem? In my case if I lose the vmware
GFS image repository I would lose 6 or more virtual servers and all of
our Windows infrastructure, which would be a major pain. Something
deep down says that putting all of my eggs in one basket is a bad
idea, but there are some great advantages to doing it this way.

What do other people think?

Robert Ruge   School of Information Technology, Deakin University 



From teigland at redhat.com  Mon Aug  8 06:26:36 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 8 Aug 2005 14:26:36 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050805071415.GC14880@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050805071415.GC14880@redhat.com>
Message-ID: <20050808062636.GB13951@redhat.com>

On Fri, Aug 05, 2005 at 03:14:15PM +0800, David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> > * +++ b/fs/gfs2/fixed_div64.h	2005-08-01 14:13:08.009808200 +0800
> > ehhhh why?
> 
> I'm not sure, actually, apart from the comments:
> 
> do_div: /* For ia32 we need to pull some tricks to get past various versions
>            of the compiler which do not like us using do_div in the middle
>            of large functions. */
> 
> do_mod: /* Side effect free 64 bit mod operation */
> 
> fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
> perhaps this is an old problem that's now fixed?

I've looked into getting rid of these:

- The existing do_div() works fine for me with 64 bit numerators, so I'll
  get rid of the "fixed" version.

- The "fixed" do_mod() seems to be the only way to do 64 bit modulus.
  It would be great if I was wrong about that...

Thanks,
Dave



From mikore.li at gmail.com  Mon Aug  8 08:34:05 2005
From: mikore.li at gmail.com (Michael)
Date: Mon, 8 Aug 2005 16:34:05 +0800
Subject: [Linux-cluster] Any update for this document?
Message-ID: <bc57270905080801347c4165a@mail.gmail.com>

Hi,

I found a good document that introduces GFS implementation which is
release in the middle of last year, however, many APIs has been
changed since that. To have a good understanding of GFS, I'd like to
get the latest version of it for your latest release, could you point
to me the link?

ps: the doc name is "Symmetric Cluster Architecture and Component
Technical Specifications"

Thanks,

Michael



From teigland at redhat.com  Mon Aug  8 09:00:50 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 8 Aug 2005 17:00:50 +0800
Subject: [Linux-cluster] Any update for this document?
In-Reply-To: <bc57270905080801347c4165a@mail.gmail.com>
References: <bc57270905080801347c4165a@mail.gmail.com>
Message-ID: <20050808090050.GC13951@redhat.com>

On Mon, Aug 08, 2005 at 04:34:05PM +0800, Michael wrote:
> Hi,
> 
> I found a good document that introduces GFS implementation which is
> release in the middle of last year, however, many APIs has been
> changed since that. To have a good understanding of GFS, I'd like to
> get the latest version of it for your latest release, could you point
> to me the link?
> 
> ps: the doc name is "Symmetric Cluster Architecture and Component
> Technical Specifications"

In general terms, this document describes pretty well the code that's in
the RHEL4 branch.  It doesn't talk much about GFS, it's mostly about the
cluster infrastructure.  As you've said, the specific api's and code
examples are very outdated -- there's no recent version. 

If you're interested in the current development of the cluster
infrastructure (where most things run in user space), then that document
is useless I'm afraid.

Dave



From teigland at redhat.com  Mon Aug  8 09:57:47 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 8 Aug 2005 17:57:47 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <84144f0205080223445375c907@mail.gmail.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
Message-ID: <20050808095747.GD13951@redhat.com>

On Wed, Aug 03, 2005 at 09:44:06AM +0300, Pekka Enberg wrote:

> > +uint32_t gfs2_hash(const void *data, unsigned int len)
> > +{
> > +     uint32_t h = 0x811C9DC5;
> > +     h = hash_more_internal(data, len, h);
> > +     return h;
> > +}
> 
> Is there a reason why you cannot use <linux/hash.h> or <linux/jhash.h>?

See gfs2_hash_more() and comment; we hash discontiguous regions.

> > +#define RETRY_MALLOC(do_this, until_this) \
> > +for (;;) { \
> > +     { do_this; } \
> > +     if (until_this) \
> > +             break; \
> > +     if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
> > +             printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
> > +             gfs2_malloc_warning = jiffies; \
> > +     } \
> > +     yield(); \
> > +}
> 
> Please drop this.

Done in the spot that could deal with an error, but there are three other
places that still need it.

> > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
> > +                 struct gfs2_ea_location *el)
> > +{
> > +     {
> > +             struct ea_set es;
> > +             int error;
> > +
> > +             memset(&es, 0, sizeof(struct ea_set));
> > +             es.es_er = er;
> > +             es.es_el = el;
> > +
> > +             error = ea_foreach(ip, ea_set_simple, &es);
> > +             if (error > 0)
> > +                     return 0;
> > +             if (error)
> > +                     return error;
> > +     }
> > +     {
> > +             unsigned int blks = 2;
> > +             if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
> > +                     blks++;
> > +             if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
> > +                     blks += DIV_RU(er->er_data_len,
> > +                                    ip->i_sbd->sd_jbsize);
> > +
> > +             return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
> > +     }
> 
> Please drop the extra braces.

Here and elsewhere we try to keep unused stuff off the stack.  Are you
suggesting that we're being overly cautious, or do you just dislike the
way it looks?

Thanks,
Dave



From arjan at infradead.org  Mon Aug  8 10:05:25 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Mon, 08 Aug 2005 12:05:25 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050808095747.GD13951@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
Message-ID: <1123495525.3245.36.camel@laptopd505.fenrus.org>

On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote:
> > 
> > Please drop the extra braces.
> 
> Here and elsewhere we try to keep unused stuff off the stack.  Are you
> suggesting that we're being overly cautious, or do you just dislike the
> way it looks?

nice theory. In practice gcc 3.x still adds up all the stack space
anyway and as long as gcc 3.x is a supported kernel compiler, you can't
depend on this. Also.. please favor readability. gcc is getting smarter
about stack use nowadays, and {}'s shouldn't be needed to help it, it
tracks liveness of variables already.





From teigland at redhat.com  Mon Aug  8 10:56:13 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 8 Aug 2005 18:56:13 +0800
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42F73185.00006260@courier.cs.helsinki.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
	<courier.42F73185.00006260@courier.cs.helsinki.fi>
Message-ID: <20050808105613.GE13951@redhat.com>

On Mon, Aug 08, 2005 at 01:18:45PM +0300, Pekka J Enberg wrote:

> gfs2-02.patch:+ RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, 
> -> GFP_NOFAIL. 

Already gone, inode_create() can return an error.

if (create) {
        RETRY_MALLOC(page = grab_cache_page(aspace->i_mapping, index),
                     page);
} else {
        page = find_lock_page(aspace->i_mapping, index);
        if (!page)
                return NULL;
}

> I think you can set aspace->flags to GFP_NOFAIL 

will try that

> but why can't you return NULL here on failure like you do for
> find_lock_page()? 

because create is set

> gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, 
>                                                    GFP_KERNEL),
> -> GFP_NOFAIL 

It looks to me like NOFAIL does nothing for kmem_cache_alloc().
Am I seeing that wrong?

> gfs2-10.patch:+         RETRY_MALLOC(new_gh = gfs2_holder_get(gl, state,
> gfs2_holder_get uses kmalloc which can use GFP_NOFAIL. 

Which means adding a new gfp_flags parameter... fine.

Dave



From teigland at redhat.com  Mon Aug  8 11:39:10 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 8 Aug 2005 19:39:10 +0800
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42F73AB3.00006AEE@courier.cs.helsinki.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
	<courier.42F73185.00006260@courier.cs.helsinki.fi>
	<20050808105613.GE13951@redhat.com>
	<courier.42F73AB3.00006AEE@courier.cs.helsinki.fi>
Message-ID: <20050808113910.GF13951@redhat.com>

On Mon, Aug 08, 2005 at 01:57:55PM +0300, Pekka J Enberg wrote:
> David Teigland writes:
> >> but why can't you return NULL here on failure like you do for
> >> find_lock_page()?  
> >
> >because create is set
> 
> Yes, but looking at (some of the) top-level callers, there's no real reason 
> why create must not fail. Am I missing something here?

I'll trace the callers back farther and see about dealing with errors.

> >> gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, 
> 
> It is passed to the page allocator just like with kmalloc() which uses 
> __cache_alloc() too. 

Yes, I read it wrongly, looks like NOFAIL should work fine.  I think we
can get rid of the RETRY macro entirely.
Thanks,
Dave



From penberg at cs.helsinki.fi  Mon Aug  8 10:00:43 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Mon, 08 Aug 2005 13:00:43 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050808095747.GD13951@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
Message-ID: <courier.42F72D4B.00006DE7@courier.cs.helsinki.fi>

David Teigland writes:
> > > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
> > > +                 struct gfs2_ea_location *el)
> > > +{
> > > +     {
> > > +             struct ea_set es;
> > > +             int error;
> > > +
> > > +             memset(&es, 0, sizeof(struct ea_set));
> > > +             es.es_er = er;
> > > +             es.es_el = el;
> > > +
> > > +             error = ea_foreach(ip, ea_set_simple, &es);
> > > +             if (error > 0)
> > > +                     return 0;
> > > +             if (error)
> > > +                     return error;
> > > +     }
> > > +     {
> > > +             unsigned int blks = 2;
> > > +             if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
> > > +                     blks++;
> > > +             if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
> > > +                     blks += DIV_RU(er->er_data_len,
> > > +                                    ip->i_sbd->sd_jbsize);
> > > +
> > > +             return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
> > > +     }
> > 
> > Please drop the extra braces. 
> 
> Here and elsewhere we try to keep unused stuff off the stack.  Are you
> suggesting that we're being overly cautious, or do you just dislike the
> way it looks?

The extra braces hurt readability. Please drop them or make them proper 
functions instead. 

And yes, I think you're hiding potential stack usage problems here. Small 
unused stuff on the stack don't matter and large ones should probably be 
kmalloc() anyway. 

               Pekka 



From penberg at cs.helsinki.fi  Mon Aug  8 10:18:45 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Mon, 08 Aug 2005 13:18:45 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050808095747.GD13951@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
Message-ID: <courier.42F73185.00006260@courier.cs.helsinki.fi>

David Teigland writes:
> > > +#define RETRY_MALLOC(do_this, until_this) \
> > > +for (;;) { \
> > > +     { do_this; } \
> > > +     if (until_this) \
> > > +             break; \
> > > +     if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
> > > +             printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
> > > +             gfs2_malloc_warning = jiffies; \
> > > +     } \
> > > +     yield(); \
> > > +}
> > 
> > Please drop this. 
> 
> Done in the spot that could deal with an error, but there are three other
> places that still need it.

Which places are those? I only see these: 

gfs2-02.patch:+ RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, 
GFP_KERNEL), ip);
gfs2-02.patch-+ gfs2_memory_add(ip);
gfs2-02.patch-+ memset(ip, 0, sizeof(struct gfs2_inode));
gfs2-02.patch-+
gfs2-02.patch-+ ip->i_num = *inum;
gfs2-02.patch-+ 

 -> GFP_NOFAIL. 

gfs2-02.patch:+         RETRY_MALLOC(page = 
grab_cache_page(aspace->i_mapping, index),
gfs2-02.patch-+                      page);
gfs2-02.patch-+ } else {
gfs2-02.patch-+         page = find_lock_page(aspace->i_mapping, index);
gfs2-02.patch-+         if (!page)
gfs2-02.patch-+                 return NULL; 

I think you can set aspace->flags to GFP_NOFAIL but why can't you return 
NULL here on failure like you do for find_lock_page()? 

gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, 
GFP_KERNEL),
gfs2-02.patch-+              bd);
gfs2-02.patch-+ gfs2_memory_add(bd);
gfs2-02.patch-+ atomic_inc(&gl->gl_sbd->sd_bufdata_count);
gfs2-02.patch-+
gfs2-02.patch-+ memset(bd, 0, sizeof(struct gfs2_bufdata)); 

 -> GFP_NOFAIL 

gfs2-08.patch:+ RETRY_MALLOC(gm = kmalloc(sizeof(struct gfs2_memory), 
GFP_KERNEL), gm);
gfs2-08.patch-+ gm->gm_data = data;
gfs2-08.patch-+ gm->gm_file = file;
gfs2-08.patch-+ gm->gm_line = line;
gfs2-08.patch-+
gfs2-08.patch-+ spin_lock(&memory_lock); 

 -> GFP_NOFAIL 

gfs2-10.patch:+         RETRY_MALLOC(new_gh = gfs2_holder_get(gl, state,
gfs2-10.patch-+                                               LM_FLAG_TRY |
gfs2-10.patch-+                                               
GL_NEVER_RECURSE),
gfs2-10.patch-+                      new_gh);
gfs2-10.patch-+         set_bit(HIF_DEMOTE, &new_gh->gh_iflags);
gfs2-10.patch-+         set_bit(HIF_DEALLOC, &new_gh->gh_iflags); 

gfs2_holder_get uses kmalloc which can use GFP_NOFAIL. 

                Pekka 



From joern at wohnheim.fh-wedel.de  Mon Aug  8 10:20:22 2005
From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel)
Date: Mon, 8 Aug 2005 12:20:22 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1123495525.3245.36.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
	<1123495525.3245.36.camel@laptopd505.fenrus.org>
Message-ID: <20050808102022.GA17978@wohnheim.fh-wedel.de>

On Mon, 8 August 2005 12:05:25 +0200, Arjan van de Ven wrote:
> On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote:
> > > 
> > > Please drop the extra braces.
> > 
> > Here and elsewhere we try to keep unused stuff off the stack.  Are you
> > suggesting that we're being overly cautious, or do you just dislike the
> > way it looks?
> 
> nice theory. In practice gcc 3.x still adds up all the stack space
> anyway and as long as gcc 3.x is a supported kernel compiler, you can't
> depend on this. Also.. please favor readability. gcc is getting smarter
> about stack use nowadays, and {}'s shouldn't be needed to help it, it
> tracks liveness of variables already.

Plus, you don't have to guess about stack usage.  Run "make
checkstack" or, better yet, run the objdump of fs/gfs/built-in.o
through the perl script.

J?rn

-- 
It's just what we asked for, but not what we want!
-- anonymous



From penberg at cs.helsinki.fi  Mon Aug  8 10:34:56 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Mon, 08 Aug 2005 13:34:56 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050808095747.GD13951@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
Message-ID: <courier.42F73550.00004B7B@courier.cs.helsinki.fi>

David Teigland writes:
> > Is there a reason why you cannot use <linux/hash.h> or <linux/jhash.h>? 
> 
> See gfs2_hash_more() and comment; we hash discontiguous regions.

jhash() takes an initial value. Isn't that sufficient? 

                 Pekka 



From penberg at cs.helsinki.fi  Mon Aug  8 10:57:55 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Mon, 08 Aug 2005 13:57:55 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050808105613.GE13951@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
	<courier.42F73185.00006260@courier.cs.helsinki.fi>
	<20050808105613.GE13951@redhat.com>
Message-ID: <courier.42F73AB3.00006AEE@courier.cs.helsinki.fi>

David Teigland writes:
> > but why can't you return NULL here on failure like you do for
> > find_lock_page()?  
> 
> because create is set

Yes, but looking at (some of the) top-level callers, there's no real reason 
why create must not fail. Am I missing something here? 

> > gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, 
> >                                                    GFP_KERNEL),
> > -> GFP_NOFAIL  
> 
> It looks to me like NOFAIL does nothing for kmem_cache_alloc().
> Am I seeing that wrong?

It is passed to the page allocator just like with kmalloc() which uses 
__cache_alloc() too. 

                Pekka 



From sdake at mvista.com  Fri Aug  5 17:45:49 2005
From: sdake at mvista.com (Steven Dake)
Date: Fri, 05 Aug 2005 10:45:49 -0700
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <42EF4AD1.6010809@redhat.com>
References: <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
	<42EF4AD1.6010809@redhat.com>
Message-ID: <1123263949.16923.23.camel@localhost.localdomain>

On Tue, 2005-08-02 at 11:28 +0100, Patrick Caulfield wrote:
> Steven Dake wrote:
> > On Mon, 2005-07-18 at 09:10 +0100, Patrick Caulfield wrote:
> > 
> >>As I see it there are two things we can do with userland cman that's current in
> >>the head of CVS:
> >>
> >>1. Leave it as it is - a port of the kernel one. This has some benefits: it's
> >>easy (plus a few bug fixes that need to go in), it's protocol-compatible with
> >>the kernel one. There are a small number of extra features that could go in
> >>there (that would, annoyingly, break that compatibility) but nothing really
> >>serious. It doesn't give us anything new, but what new is neeed ?
> >>
> >>2. Migrate it to something much more sophisticated. I've mentioned Virtual
> >>Synchrony a few times before and I've been looking into this in some detail
> >>since. The benefits are largely internal but they do provide a reliable, robust
> >>and well-performing messaging system that other cluster subsystems can use.
> >>While the application programmers at the cluster summit maintained they had no
> >>use for a cluster messaging system, I still believe that it is a useful thing to
> >>have at a lower level - if only for our own programming needs. I know that Jon
> >>looked into the existing cman messaging system before rejecting it as too slow
> >>and unreliable for he needs of the cluster mirroring code.
> >>
> >>There are two suboptions here.
> >>  a) write it ourself. Quite a big job this. Bigger than I would like. To be
> >>honest I did make a start at this and now realise just what a huge job it is to
> >>get something that both performs well and is reliable. REALLY reliable. even
> >>worse if the academics want something provably reliable.
> >>   b) adopt something else. The obvious candidate here is the openAIS code[1].
> >>This looks to be quite mature now and has all the features we need of a low
> >>level messaging system. It's very nicely abstracted out so we can pick out just
> >>the bits we need without having the whole (rather heavyweight) system on top of it.
> >>
> >>The one problem with the openAIS code is that it doesn't support IPv6, and much
> >>of the code is tied to IPv4. Having had a look at it and emailed Steven Dake
> >>about this he reckons it's about 2 weeks work to add.[2]
> >>
> >>The advantages of doing this are several.
> >>- It saves time. We get something that is known to work, even though it needs
> >>extra features added for our own use.
> >>- we're not inventing something new that already exists in several other places.
> >>- we get more people who know the code. Currently only I know the internals of
> >>cman as it stands and it's quite scary code that people don't want to get
> >>involved with (we've have several DLM patches in the past, but no CMAN ones).
> >>This way we get at least 2 (Steven and me) as well as anyone else who is
> >>following openAIS. Of course there will be CMAN-specific stuff on top of their
> >>comms layer to make it quorum-based and capable of supporting GFS and DLM that
> > 
> > 
> > sorry my response is so late I missed this mail while at OLS.
> > 
> > The quorum problem is commonly referred to in the literature as a
> > "virtual synchrony filter".  I'd love to have some implementations of
> > virtual synchrony filters that exist within libtotem itself..
> > Definately an area of interest for openais as we need some services to
> > operate only in one partition (like the amf).
> > 
> > 
> >>will be Red Hat specific but these are not going to be large.
> >>- the APIs are all open (based on SAforum specifications) and already
> >>implemented. Although adding saCLM to CMAN is pretty easy as I proved last week.
> >>
> > 
> > 
> >>The disadvantages are
> >>- Need to learn the internals of someone else's code.
> > 
> > 
> > indeed this part is somewhat painful :(
> > 
> > 
> >>- We don't have full control over the code. Although we can obviously fork it if
> >>we feel the need it would, obviously be preferable not to.
> > 
> > 
> > My view is that open source influence is dictated by level of
> > contribution just like any kind of community.  ie: the more a person
> > contributes the more influence they can exert over a project or
> > direction.  Even as maintainer I don't have full control over the
> > openais code as the community really decides where we go and what work
> > we do.
> > 
> > My point here is that if you are willing to fork, then you probably have
> > some time to maintain the code..  which is better spent influencing the
> > current openais tree :)
> > 
> > 
> >>- non-compatibility with "old" cman, making rolling upgrades har or even
> >>impossible. I'm not sure what to do about this yet, but it's worth pointing out
> >>that the DLM has a new line-protocol too.
> > 
> > 
> > yes upgrades are a real pain.  We have not fully tackled this problem in
> > the openais project yet, because we havn't released a stable version.
> > Ideally we would like two versions (older, newer) to interoperate, even
> > if that means uglifying the implementation to coexist with two line
> > types.  We have some work in place to address this problem but before
> > our first production release I'm planning to really think through
> > interoperability with new implementations for features of the totem
> > protocol (like redundant ring, multi ring gateway (for local area
> > networks), group key generation, multi-ring-bridged (for wide area
> > networks), etc).
> > 
> > 
> >>- openAIS is BSD licensed, I don't think this is a problem but it probably needs
> >>checking.
> >>
> > 
> > 
> > Originally I had planned to use spread for openais, but the license was
> > not compatible with the lawyers "approved list".  So we had to implement
> > a protocol completely from scratch because of the license issue which
> > took about 1.5 years of work (sigh).  I wanted to be sure other projects
> > could reuse the totem code so chose the most liberal license I could
> > find.
> > 
> > 
> >>In short, I'm advocating adopting the openAIS core (libtotem basically) as
> >>CMAN's communications/membership protocol. If we're going to do a "CMAN V2" that
> >>has anything significant over V1 then re-inventing it is going to be a huge
> >>amount of work that someone else has already done.
> >>
> >>Comments?
> >>
> > 
> > 
> > sounds good Patrick  if you need any help from us let us know
> > 
> 
> Thanks for that Steven. I'm going to make a start on this when I get back from
> UKUUG next week. I've managed to knock up something that looks like cman from
> the outside but uses libtotem for it's comms layer so it's looking good. On
> other thing I need to look into (apart from IPv6) is multi-home. cman had a
> (primitive) failover system but it's not currently in use by anyone because DLM
> doesn't support it but I think it's something we need to provide at some stage.
> 
> Don't worry about the mention of a fork - the chances of it happening are almost
> nil!

Thats great news Patrick.  One thing you should be aware of is that I
have changed some of the internal interfaces in preparation for others
to use libtotem to be extremely more sanitary.  Unfortunately I may have
done this a little too late in your case..  But I think you will find
things are a little better.  It really only effects totempg_initialize.
Also libtotem was renamed to libtotem_pg because of requests from Daniel
about a name-space collision with some movie player in fc4.

For multihoming, I want to support the totem redundant ring protocol in
the totem code.  This is an extension of totemsrp to support multiple
nics per processor.  Then data is either actively or passively
replicated over multiple links.  There is essentially no failover and
multiple links can offer better performance and still operate properly
when one entire network fails.  It looks pretty simple to implement.
The paper is at:

http://www.rcsc.de/pdf/icdcs02.pdf

regards
-steve



From javipolo at datagrama.net  Mon Aug  8 14:12:38 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Mon, 8 Aug 2005 16:12:38 +0200
Subject: [Linux-cluster] problem with rejoining a node
Message-ID: <20050808141238.GA6455@gibson.drslump.org>

Hi there (again :P)

I'm still fighting with all this, sorry to bother so much (hope some day
when I understand it all better I'll write some article on how to set this up)

Well, I have already up the cluster and mounted the gfs filesystem in 3
machines, and if one of those goes down, it's correctly fenced. The FC
port is also disconnected, so I suppose at this point is everything ok.

The problem is on the recovery. I understand that when a node rejoins
is automaticaly unfenced, and then it can rejoin the fence and
mount again the filesystem.

I've blocked all input and output traffic on the node I want to test
with iptables.

The node gets fenced ok:
Aug  8 16:00:48 gfstest2 fenced[2594]: fencing node "gfstest1"
Aug  8 16:00:56 gfstest2 fenced[2594]: fence "gfstest1" success

Now I can access the GFS filesystem safely from my other 2 nodes, as the
FC port for gfstest1 is disabled now, but if I enable traffic for the
node, it does not rejoin the cluster. Shouldnt this be automatically?

Anyway, I cannot rejoin/leave/whatever the cluster from gfstest1:
gfstest1:~# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "primer_fs"                         2   3 run       -
[1 2 3]

GFS Mount Group: "primer_fs"                         3   4 run       -
[1 2 3]

gfstest1:~# cman_tool join
cman_tool: Node is already active
gfstest1:~# cman_tool leave
cman_tool: Can't leave cluster while there are 5 active subsystems

and also, I cannot umount /dev/sdc1 as I have no access to the SAN
(and however DLM should block him not to do so). So I get a totally
screwed up system, that I can just fix by hard-rebooting (if I do a
clean reboot, the system "hangs" while "umounting filesystems").

Also, when the system boots up, the SAN is still unaccessible, as the
fencing script does not run to re-enable the port ...

I'm loooooost diving into google querys ... and certainly it's hard to
find accurate info about all this :/

could someone spot some light?
(probably I dont understand well how the fencing system works, but also
havent find anywhere where its explained :/)

thx in advance :)
-- 
Javier Polo @ Datagrama
902 136 126



From penberg at cs.helsinki.fi  Mon Aug  8 14:14:45 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Mon, 08 Aug 2005 17:14:45 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050803063644.GD9812@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
Message-ID: <courier.42F768D5.000046F2@courier.cs.helsinki.fi>

David Teigland writes:
> +static ssize_t walk_vm_hard(struct file *file, char *buf, size_t size,
> +                         loff_t *offset, do_rw_t operation)
> +{
> +     struct gfs2_holder *ghs;
> +     unsigned int num_gh = 0;
> +     ssize_t count;
> +
> +     {

Can we please get rid of the extra braces everywhere? 

[snip] 

David Teigland writes:
> +
> +             for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
> +                     if (end <= vma->vm_start)
> +                             break;
> +                     if (vma->vm_file &&
> +                         vma->vm_file->f_dentry->d_inode->i_sb == sb) {
> +                             num_gh++;
> +                     }
> +             }
> +
> +             ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder),
> +                           GFP_KERNEL);
> +             if (!ghs) {
> +                     if (!dumping)
> +                             up_read(&mm->mmap_sem);
> +                     return -ENOMEM;
> +             }
> +
> +             for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {

Sorry if this is an obvious question but what prevents another thread from 
doing mmap() before we do the second walk and messing up num_gh? 

> +                     if (end <= vma->vm_start)
> +                             break;
> +                     if (vma->vm_file) {
> +                             struct inode *inode;
> +                             inode = vma->vm_file->f_dentry->d_inode;
> +                             if (inode->i_sb == sb)
> +                                     gfs2_holder_init(get_v2ip(inode)->i_gl,
> +                                                      vma2state(vma),
> +                                                      0, &ghs[x++]);
> +                     }
> +             }

            Pekka 



From pcaulfie at redhat.com  Mon Aug  8 15:30:43 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 08 Aug 2005 16:30:43 +0100
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <1123263949.16923.23.camel@localhost.localdomain>
References: <42DB63F6.5070600@redhat.com>	<1122318870.12824.29.camel@localhost.localdomain>	<42EF4AD1.6010809@redhat.com>
	<1123263949.16923.23.camel@localhost.localdomain>
Message-ID: <42F77AA3.80000@redhat.com>

Steven Dake wrote:
> Thats great news Patrick.  One thing you should be aware of is that I
> have changed some of the internal interfaces in preparation for others
> to use libtotem to be extremely more sanitary.  Unfortunately I may have
> done this a little too late in your case..  But I think you will find
> things are a little better.  It really only effects totempg_initialize.
> Also libtotem was renamed to libtotem_pg because of requests from Daniel
> about a name-space collision with some movie player in fc4.

Yes I spotted that, my current "nearly-working" cman is based on the latest SVN
sources.

> For multihoming, I want to support the totem redundant ring protocol in
> the totem code.  This is an extension of totemsrp to support multiple
> nics per processor.  Then data is either actively or passively
> replicated over multiple links.  There is essentially no failover and
> multiple links can offer better performance and still operate properly
> when one entire network fails.  It looks pretty simple to implement.
> The paper is at:
> 
> http://www.rcsc.de/pdf/icdcs02.pdf

Excellent, thanks. I'll have a read.

-- 

patrick



From pcaulfie at redhat.com  Mon Aug  8 15:35:14 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 08 Aug 2005 16:35:14 +0100
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <20050808141238.GA6455@gibson.drslump.org>
References: <20050808141238.GA6455@gibson.drslump.org>
Message-ID: <42F77BB2.4070705@redhat.com>

Javi Polo wrote:
> Hi there (again :P)
> 
> I'm still fighting with all this, sorry to bother so much (hope some day
> when I understand it all better I'll write some article on how to set this up)
> 
> Well, I have already up the cluster and mounted the gfs filesystem in 3
> machines, and if one of those goes down, it's correctly fenced. The FC
> port is also disconnected, so I suppose at this point is everything ok.
> 
> The problem is on the recovery. I understand that when a node rejoins
> is automaticaly unfenced, and then it can rejoin the fence and
> mount again the filesystem.
> 
> I've blocked all input and output traffic on the node I want to test
> with iptables.
> 
> The node gets fenced ok:
> Aug  8 16:00:48 gfstest2 fenced[2594]: fencing node "gfstest1"
> Aug  8 16:00:56 gfstest2 fenced[2594]: fence "gfstest1" success

What sort of fencing are you using? If it's a power-switch fence then the
node should be hard rebooted. If it's SAN fencing then you'll have to get the
node out of the cluster - the remaining two nodes /should/ tell it it leave the
cluster.

A node can't just "rejoin" a cluster after being SAN fenced. it must be removed
from the cluster and rejoin from scratch. There's far too much state involved
for it to merge  seamlessly back into a cluster.

> Now I can access the GFS filesystem safely from my other 2 nodes, as the
> FC port for gfstest1 is disabled now, but if I enable traffic for the
> node, it does not rejoin the cluster. Shouldnt this be automatically?
> 
> Anyway, I cannot rejoin/leave/whatever the cluster from gfstest1:
> gfstest1:~# cman_tool services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           1   2 run       -
> [1 2 3]
> 
> DLM Lock Space:  "primer_fs"                         2   3 run       -
> [1 2 3]
> 
> GFS Mount Group: "primer_fs"                         3   4 run       -
> [1 2 3]
> 
> gfstest1:~# cman_tool join
> cman_tool: Node is already active
> gfstest1:~# cman_tool leave
> cman_tool: Can't leave cluster while there are 5 active subsystems

cman_tool leave force will force it to leave, but you might find it still needs
a reboot to clear the filesystems.

> and also, I cannot umount /dev/sdc1 as I have no access to the SAN
> (and however DLM should block him not to do so). So I get a totally
> screwed up system, that I can just fix by hard-rebooting (if I do a
> clean reboot, the system "hangs" while "umounting filesystems").
> 
> Also, when the system boots up, the SAN is still unaccessible, as the
> fencing script does not run to re-enable the port ...
> 
> I'm loooooost diving into google querys ... and certainly it's hard to
> find accurate info about all this :/
> 
> could someone spot some light?
> (probably I dont understand well how the fencing system works, but also
> havent find anywhere where its explained :/)
> 
> thx in advance :)


-- 

patrick



From mdl at veles.ru  Tue Aug  9 07:34:57 2005
From: mdl at veles.ru (Denis Medvedev)
Date: Tue, 09 Aug 2005 11:34:57 +0400
Subject: [Linux-cluster] RH exporting local disks as LUNs.
In-Reply-To: <Pine.LNX.4.61.0508050820530.15712@tungsten.technicality.org>
References: <1123209846.17379.18.camel@dragon.sys.intra>	<42F31C9D.8090601@veles.ru>
	<Pine.LNX.4.61.0508050820530.15712@tungsten.technicality.org>
Message-ID: <42F85CA1.3040402@veles.ru>

Nate Carlson ?????:

> On Fri, 5 Aug 2005, Denis Medvedev wrote:
>
>>> This question may show my total ignorance but I'm not above that ;) 
>>> Is there anything stopping me from exporting a device or a volume as a
>>> raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act
>>> like) a SCSI or FC disk?
>>
>>
>> Yes, you can.
>
>
> Nifty! I was actually looking at this the other day, and couldn't 
> figure out a way to do it.
>
> Do you happen to have any documentation?
>
> ------------------------------------------------------------------------
> | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com |
> | depriving some poor village of its idiot since 1981 |
> ------------------------------------------------------------------------
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>
I just tried (successfully) TWO working implementations of
Linux was AltLinux (www.altlinux.ru, ftp.altlinux.ru - Master 
distribution upgraded to Sisyphus).
And the working iscsi "targets" (providers of iscsi disks) was 
iscsi-target.sf.net
(better for me)
and umh-iscsi ( http://www.iol.unh.edu/consortiums/iscsi/)
BOTH worked fine with Microsoft initiator (the iscsi client).





From pcaulfie at redhat.com  Tue Aug  9 07:08:01 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 09 Aug 2005 08:08:01 +0100
Subject: [Openais] Re: [Linux-cluster] Where to go with cman ?
In-Reply-To: <1123524809.16145.17.camel@localhost.localdomain>
References: <42DB63F6.5070600@redhat.com>	<1122318870.12824.29.camel@localhost.localdomain>	<42EF4AD1.6010809@redhat.com>	<1123263949.16923.23.camel@localhost.localdomain>	<42F77AA3.80000@redhat.com>
	<1123524809.16145.17.camel@localhost.localdomain>
Message-ID: <42F85651.80501@redhat.com>

Steven Dake wrote:
> On Mon, 2005-08-08 at 16:30 +0100, Patrick Caulfield wrote:
> 
>>Steven Dake wrote:
>>
>>>Thats great news Patrick.  One thing you should be aware of is that I
>>>have changed some of the internal interfaces in preparation for others
>>>to use libtotem to be extremely more sanitary.  Unfortunately I may have
>>>done this a little too late in your case..  But I think you will find
>>>things are a little better.  It really only effects totempg_initialize.
>>>Also libtotem was renamed to libtotem_pg because of requests from Daniel
>>>about a name-space collision with some movie player in fc4.
>>
>>Yes I spotted that, my current "nearly-working" cman is based on the latest SVN
>>sources.
>>
>>
>>>For multihoming, I want to support the totem redundant ring protocol in
>>>the totem code.  This is an extension of totemsrp to support multiple
>>>nics per processor.  Then data is either actively or passively
>>>replicated over multiple links.  There is essentially no failover and
>>>multiple links can offer better performance and still operate properly
>>>when one entire network fails.  It looks pretty simple to implement.
>>>The paper is at:
>>>
>>>http://www.rcsc.de/pdf/icdcs02.pdf
>>
>>Excellent, thanks. I'll have a read.
>>
> 
> 
> Patrick,
> 
> Over the weekend I reorged the totem code significantly (although the
> totempg interfaces have not changed).  The reorg was painful timewise,
> but the result is that redundant ring should be pretty easy to implement
> now.  Basically I took all of the network junk out of totemsrp and put
> it in "totemnet.c".  It also allows for multiple instances of totemnet
> binds.  This is the main feature I needed to implement redundant ring in
> a clean fashion.  The ipv6 support should be a little easier to add now
> since most of the network code is limited to totemnet.

Superb! I intend to get on to the IPv6 stuff in the next week or so, other
things permitting.

> I should have a patch in a few days with a redundant ring passive and
> active implementation.



-- 

patrick



From javipolo at datagrama.net  Mon Aug  8 21:55:30 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Mon, 8 Aug 2005 23:55:30 +0200
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <42F77BB2.4070705@redhat.com>
References: <20050808141238.GA6455@gibson.drslump.org>
	<42F77BB2.4070705@redhat.com>
Message-ID: <20050808215530.GA23695@gibson.drslump.org>

On Aug/08/2005, Patrick Caulfield wrote:

> What sort of fencing are you using? If it's a power-switch fence then the
> node should be hard rebooted. If it's SAN fencing then you'll have to get the

it's san fencing. I modified the fence_sanbox2.pl script to suit the
switch commands, and "by hand" it works perfectly

> node out of the cluster - the remaining two nodes /should/ tell it it leave the
> cluster.

so, when the node recovers and "says hello" to the cluster, the other
two do take him out of the cluster?

> A node can't just "rejoin" a cluster after being SAN fenced. it must be removed
> from the cluster and rejoin from scratch. There's far too much state involved
> for it to merge  seamlessly back into a cluster.

must i do it manually? or is any kind of automated process here?

what are the steps the node should perform after being fenced so it can
join again the node?
(sorry asking so much but I'm really lost here :/ )

> > gfstest1:~# cman_tool join
> > cman_tool: Node is already active
> > gfstest1:~# cman_tool leave
> > cman_tool: Can't leave cluster while there are 5 active subsystems
> cman_tool leave force will force it to leave, but you might find it still needs
> a reboot to clear the filesystems.

so, if we just simply loose conectivity between nodes, we should still
reboot the server so it can be "clean" and join again the cluster?

and if so, should I enable manually the port on the SAN, or will fenced
do it for me (as the script does actually accepts an enable parameter)
:??

-- 
Javier Polo @ Datagrama
902 136 126



From javipolo at datagrama.net  Mon Aug  8 21:55:30 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Mon, 8 Aug 2005 23:55:30 +0200
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <42F77BB2.4070705@redhat.com>
References: <20050808141238.GA6455@gibson.drslump.org>
	<42F77BB2.4070705@redhat.com>
Message-ID: <20050808215530.GA23695@gibson.drslump.org>

On Aug/08/2005, Patrick Caulfield wrote:

> What sort of fencing are you using? If it's a power-switch fence then the
> node should be hard rebooted. If it's SAN fencing then you'll have to get the

it's san fencing. I modified the fence_sanbox2.pl script to suit the
switch commands, and "by hand" it works perfectly

> node out of the cluster - the remaining two nodes /should/ tell it it leave the
> cluster.

so, when the node recovers and "says hello" to the cluster, the other
two do take him out of the cluster?

> A node can't just "rejoin" a cluster after being SAN fenced. it must be removed
> from the cluster and rejoin from scratch. There's far too much state involved
> for it to merge  seamlessly back into a cluster.

must i do it manually? or is any kind of automated process here?

what are the steps the node should perform after being fenced so it can
join again the node?
(sorry asking so much but I'm really lost here :/ )

> > gfstest1:~# cman_tool join
> > cman_tool: Node is already active
> > gfstest1:~# cman_tool leave
> > cman_tool: Can't leave cluster while there are 5 active subsystems
> cman_tool leave force will force it to leave, but you might find it still needs
> a reboot to clear the filesystems.

so, if we just simply loose conectivity between nodes, we should still
reboot the server so it can be "clean" and join again the cluster?

and if so, should I enable manually the port on the SAN, or will fenced
do it for me (as the script does actually accepts an enable parameter)
:??

-- 
Javier Polo @ Datagrama
902 136 126



From lhh at redhat.com  Mon Aug  8 20:14:07 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 08 Aug 2005 16:14:07 -0400
Subject: [Linux-cluster] Purpose of fencing devices
In-Reply-To: <42F2328C.4050700@hugsmidjan.is>
References: <42F2328C.4050700@hugsmidjan.is>
Message-ID: <1123532047.3992.54.camel@ayanami.boston.redhat.com>

On Thu, 2005-08-04 at 15:21 +0000, "S?valdur Arnar Gunnarsson
[Hugsmi?jan]" wrote:
> Could someone explain to me the purpose of the fencing hardware in a 
> cluster with a shared storage resource.
> 
> When one of the cluster member goes down all access to the shared volume 
> (GFS) is closed off.
> No other cluster member can read or write to the volume until the failed 
> node comes back up.

That sounds like manual fencing.

> Are fencing devices used to close off the access the dead node has on 
> the filesystem so the other nodes can access (read/write) the fileystem 
> as usual ?

Yes.

-- Lon




From lhh at redhat.com  Mon Aug  8 20:12:51 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 08 Aug 2005 16:12:51 -0400
Subject: [Linux-cluster] Clustered LDAP, good or bad idea?
In-Reply-To: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca>
References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca>
Message-ID: <1123531971.3992.51.camel@ayanami.boston.redhat.com>

On Fri, 2005-08-05 at 19:08 -0600, Ryan Thomson wrote:

> I'm a bit concerned about failures since I can't test that properly in a
> two-node cluster.
> 
> I suppose what I'm really asking is this: Is running LDAP as a cluster
> service a particularly bad idea for any reason?
> 

Not that I'm aware of.  I had one running a while for a reason which
I've since forgotten (I stopped it on 29-Mar-2005, and has never been
restarted).  However, that LDAP server was only for testing; I've never
run one in production.

-- Lon



From lhh at redhat.com  Mon Aug  8 20:06:50 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 08 Aug 2005 16:06:50 -0400
Subject: [Linux-cluster] Fencing agents
In-Reply-To: <42F1D28D.10006@veles.ru>
References: <42F0B177.7050907@hugsmidjan.is>
	<20050803212949.GA3268@redhat.com>  <42F1D28D.10006@veles.ru>
Message-ID: <1123531610.3992.47.camel@ayanami.boston.redhat.com>

On Thu, 2005-08-04 at 12:32 +0400, Denis Medvedev wrote:
> >The main advantage that "automatic" fencing gives you over manual fencing is
> >that in the event that a fencing operation is required, your cluster can
> >automatically recover (on the order of seconds to minutes) instead of waiting
> >for user intervention (which can take minutes to hours to days depending on
> >how attentive the admins are :).    
> >
> >  
> >
> "recover"? You mean reboot? But if a machine need fencing, doesn't that 
> mean that something is inherently wrong with that machine and simple 
> reboot would't cure that?

recovery = The *cluster* can continue operation, not the *node*.

If the node is truly dead (maybe its CPU was fried from a bolt of
lightning), rebooting it doesn't hurt it.  If it was just a software or
network issue (e.g. kernel panic, router glitch), then the node should
be able to recover after its reboot and rejoin the cluster.

-- Lon



From adam.cassar at netregistry.com.au  Tue Aug  9 12:11:35 2005
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Tue, 09 Aug 2005 22:11:35 +1000
Subject: [Linux-cluster] Clustered LDAP, good or bad idea?
In-Reply-To: <1123531971.3992.51.camel@ayanami.boston.redhat.com>
References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca>
	<1123531971.3992.51.camel@ayanami.boston.redhat.com>
Message-ID: <42F89D77.80907@netregistry.com.au>

We run openldap quite extensively here. In my experience, if slapd fails 
it is usually due to some backend db issue, and gfs will not help you 
with that.

A master slave set up will be your best option.

Lon Hohberger wrote:

>On Fri, 2005-08-05 at 19:08 -0600, Ryan Thomson wrote:
>
>  
>
>>I'm a bit concerned about failures since I can't test that properly in a
>>two-node cluster.
>>
>>I suppose what I'm really asking is this: Is running LDAP as a cluster
>>service a particularly bad idea for any reason?
>>
>>    
>>
>
>Not that I'm aware of.  I had one running a while for a reason which
>I've since forgotten (I stopped it on 29-Mar-2005, and has never been
>restarted).  However, that LDAP server was only for testing; I've never
>run one in production.
>
>-- Lon
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>  
>



From pcaulfie at redhat.com  Tue Aug  9 12:25:08 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 09 Aug 2005 13:25:08 +0100
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <20050808215530.GA23695@gibson.drslump.org>
References: <20050808141238.GA6455@gibson.drslump.org>	<42F77BB2.4070705@redhat.com>
	<20050808215530.GA23695@gibson.drslump.org>
Message-ID: <42F8A0A4.5000507@redhat.com>

Javi Polo wrote:
> On Aug/08/2005, Patrick Caulfield wrote:
> 
> 
>>What sort of fencing are you using? If it's a power-switch fence then the
>>node should be hard rebooted. If it's SAN fencing then you'll have to get the
> 
> 
> it's san fencing. I modified the fence_sanbox2.pl script to suit the
> switch commands, and "by hand" it works perfectly
> 
> 
>>node out of the cluster - the remaining two nodes /should/ tell it it leave the
>>cluster.
> 
> 
> so, when the node recovers and "says hello" to the cluster, the other
> two do take him out of the cluster?

Yes. Is that not happening ?

> 
>>A node can't just "rejoin" a cluster after being SAN fenced. it must be removed
>>from the cluster and rejoin from scratch. There's far too much state involved
>>for it to merge  seamlessly back into a cluster.
> 
> 
> must i do it manually? or is any kind of automated process here?
> 
> what are the steps the node should perform after being fenced so it can
> join again the node?
> (sorry asking so much but I'm really lost here :/ )

A reboot is usually the easiest way to ensure that a node is "clean". If you
can umount all the GFS filesystems and stop all the cluster subsystems (fence,
clvmd, gfs) then you should be able to run the startup scripts again but it's
just a faff.

>>>gfstest1:~# cman_tool join
>>>cman_tool: Node is already active
>>>gfstest1:~# cman_tool leave
>>>cman_tool: Can't leave cluster while there are 5 active subsystems
>>
>>cman_tool leave force will force it to leave, but you might find it still needs
>>a reboot to clear the filesystems.
> 
> 
> so, if we just simply loose conectivity between nodes, we should still
> reboot the server so it can be "clean" and join again the cluster?
> 
> and if so, should I enable manually the port on the SAN, or will fenced
> do it for me (as the script does actually accepts an enable parameter)
> :??
> 

I don't know. I've never had access to a SAN fencing device!

-- 

patrick



From oldmoonster at gmail.com  Mon Aug  8 15:41:34 2005
From: oldmoonster at gmail.com (Michael)
Date: Mon, 08 Aug 2005 23:41:34 +0800
Subject: [Linux-cluster] [PATCH 00/14] GFS
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <42F77D2E.8000806@gmail.com>

I patched gfs2-full.patch to 2.6.12.2 kernel, however, if I don't 
comment out "depends on DLM" in fs/Kconfig,
I can't see GFS2 in "make menuconfig", and of course, this result to 
compiling failure.

config GFS2_FS
        tristate "GFS2 file system support"
#      depends on DLM
        help
          A cluster filesystem.

Thanks,

Michael

David Teigland wrote:

>Hi, GFS (Global File System) is a cluster file system that we'd like to
>see added to the kernel.  The 14 patches total about 900K so I won't send
>them to the list unless that's requested.  Comments and suggestions are
>welcome.  Thanks
>
>http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
>http://redhat.com/~teigland/gfs2/20050801/broken-out/
>
>Dave
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>
>  
>



From sdake at mvista.com  Mon Aug  8 18:13:29 2005
From: sdake at mvista.com (Steven Dake)
Date: Mon, 08 Aug 2005 11:13:29 -0700
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <42F77AA3.80000@redhat.com>
References: <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
	<42EF4AD1.6010809@redhat.com>
	<1123263949.16923.23.camel@localhost.localdomain>
	<42F77AA3.80000@redhat.com>
Message-ID: <1123524809.16145.17.camel@localhost.localdomain>

On Mon, 2005-08-08 at 16:30 +0100, Patrick Caulfield wrote:
> Steven Dake wrote:
> > Thats great news Patrick.  One thing you should be aware of is that I
> > have changed some of the internal interfaces in preparation for others
> > to use libtotem to be extremely more sanitary.  Unfortunately I may have
> > done this a little too late in your case..  But I think you will find
> > things are a little better.  It really only effects totempg_initialize.
> > Also libtotem was renamed to libtotem_pg because of requests from Daniel
> > about a name-space collision with some movie player in fc4.
> 
> Yes I spotted that, my current "nearly-working" cman is based on the latest SVN
> sources.
> 
> > For multihoming, I want to support the totem redundant ring protocol in
> > the totem code.  This is an extension of totemsrp to support multiple
> > nics per processor.  Then data is either actively or passively
> > replicated over multiple links.  There is essentially no failover and
> > multiple links can offer better performance and still operate properly
> > when one entire network fails.  It looks pretty simple to implement.
> > The paper is at:
> > 
> > http://www.rcsc.de/pdf/icdcs02.pdf
> 
> Excellent, thanks. I'll have a read.
> 

Patrick,

Over the weekend I reorged the totem code significantly (although the
totempg interfaces have not changed).  The reorg was painful timewise,
but the result is that redundant ring should be pretty easy to implement
now.  Basically I took all of the network junk out of totemsrp and put
it in "totemnet.c".  It also allows for multiple instances of totemnet
binds.  This is the main feature I needed to implement redundant ring in
a clean fashion.  The ipv6 support should be a little easier to add now
since most of the network code is limited to totemnet.

I should have a patch in a few days with a redundant ring passive and
active implementation.

regards
-steve



From zab at zabbo.net  Mon Aug  8 18:32:55 2005
From: zab at zabbo.net (Zach Brown)
Date: Mon, 08 Aug 2005 11:32:55 -0700
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42F768D5.000046F2@courier.cs.helsinki.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
Message-ID: <42F7A557.3000200@zabbo.net>

Pekka J Enberg wrote:

> Sorry if this is an obvious question but what prevents another thread
> from doing mmap() before we do the second walk and messing up num_gh?

Nothing, I suspect.  OCFS2 has a problem like this, too.  It wants a way
for a file system to serialize mmap/munmap/mremap during file IO.  Well,
more specifically, it wants to make sure that the locks it acquired at
the start of the IO really cover the buf regions that might fault during
the IO.. mapping activity during the IO can wreck that.

- z



From teigland at redhat.com  Tue Aug  9 14:13:35 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 9 Aug 2005 22:13:35 +0800
Subject: [Linux-cluster] [PATCH 00/14] GFS
In-Reply-To: <42F77D2E.8000806@gmail.com>
References: <20050802071828.GA11217@redhat.com> <42F77D2E.8000806@gmail.com>
Message-ID: <20050809141334.GB12114@redhat.com>

On Mon, Aug 08, 2005 at 11:41:34PM +0800, Michael wrote:
> I patched gfs2-full.patch to 2.6.12.2 kernel, however, if I don't 
> comment out "depends on DLM" in fs/Kconfig,
> I can't see GFS2 in "make menuconfig", and of course, this result to 
> compiling failure.
> 
> config GFS2_FS
>        tristate "GFS2 file system support"
> #      depends on DLM
>        help
>          A cluster filesystem.

you need the dlm as found in -mm kernels

Dave



From ggilyeat at jhsph.edu  Tue Aug  9 14:29:26 2005
From: ggilyeat at jhsph.edu (Gerald G. Gilyeat)
Date: Tue, 9 Aug 2005 10:29:26 -0400
Subject: [Linux-cluster] Tuning...
Message-ID: <DF33F4DAC09B3048AA095F1B5C368915A57EAE@XCH-VN02.sph.ad.jhsph.edu>

I'm looking for information regarding what each of the tunable parameters returned by gfs_tool gettune actually -does-.
Output from one of my partitions:
ilimit1 = 100
ilimit1_tries = 3
ilimit1_min = 1
ilimit2 = 500
ilimit2_tries = 10
ilimit2_min = 3
demote_secs = 300
incore_log_blocks = 1024
jindex_refresh_secs = 60
gldep_secs = 30
scand_secs = 5
recoverd_secs = 60
logd_secs = 1
quotad_secs = 5
inoded_secs = 15
quota_simul_sync = 64
quota_warn_period = 10
atime_quantum = 3600
quota_quantum = 60
quota_scale = 1.0000   (1, 1)
quota_enforce = 1
quota_account = 1
new_files_jdata = 0
new_files_directio = 0
max_atomic_write = 4194304
max_readahead = 262144
lockdump_size = 131072
stall_secs = 600
complain_secs = 10
reclaim_limit = 5000
entries_per_readdir = 32
prefetch_secs = 10
statfs_slots = 64


Some of them are fairly obvious, but I'd like to have a more solid idea of what each does before I start mucking with things.

Thanks!

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050809/4260e718/attachment.htm>

From penberg at cs.helsinki.fi  Tue Aug  9 14:55:41 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Tue, 09 Aug 2005 17:55:41 +0300
Subject: [Linux-cluster] Re: GFS
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
Message-ID: <courier.42F8C3ED.00001BC1@courier.cs.helsinki.fi>

Hi David, 

Here are some more comments. 

                     Pekka 

+/************************************************************************** 
****
> +*******************************************************************************
> +**
> +**  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
> +**  Copyright (C) 2004-2005 Red Hat, Inc.  All rights reserved.
> +**
> +**  This copyrighted material is made available to anyone wishing to use,
> +**  modify, copy, or redistribute it subject to the terms and conditions
> +**  of the GNU General Public License v.2.
> +**
> +*******************************************************************************
> +******************************************************************************/

Do you really need this verbose banner? 

> +#define NO_CREATE 0
> +#define CREATE 1
> +
> +#define NO_WAIT 0
> +#define WAIT 1
> +
> +#define NO_FORCE 0
> +#define FORCE 1

The code seems to interchangeably use FORCE and NO_FORCE together
with TRUE and FALSE.  Perhaps they could be dropped? 

> +#define GLF_PLUG		0
> +#define GLF_LOCK		1
> +#define GLF_STICKY		2
> +#define GLF_PREFETCH		3
> +#define GLF_SYNC		4
> +#define GLF_DIRTY		5
> +#define GLF_SKIP_WAITERS2	6
> +#define GLF_GREEDY		7

Would be nice if these were either enums or had a comment linking
them to the struct member they are used for. 

> +#define GIF_MIN_INIT		0
> +#define GIF_QD_LOCKED		1
> +#define GIF_PAGED		2
> +#define GIF_SW_PAGED		3

Same here and in few other places too. 

> +#define LO_BEFORE_COMMIT(sdp) \
> +do { \
> +	int __lops_x; \
> +	for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
> +		if (gfs2_log_ops[__lops_x]->lo_before_commit) \
> +			gfs2_log_ops[__lops_x]->lo_before_commit((sdp)); \
> +} while (0)
> +
> +#define LO_AFTER_COMMIT(sdp, ai) \
> +do { \
> +	int __lops_x; \
> +	for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
> +		if (gfs2_log_ops[__lops_x]->lo_after_commit) \
> +			gfs2_log_ops[__lops_x]->lo_after_commit((sdp), (ai)); \
> +} while (0)
> +
> +#define LO_BEFORE_SCAN(jd, head, pass) \
> +do \
> +{ \
> +  int __lops_x; \
> +  for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
> +    if (gfs2_log_ops[__lops_x]->lo_before_scan) \
> +      gfs2_log_ops[__lops_x]->lo_before_scan((jd), (head), (pass)); \
> +} \
> +while (0)

static inline functions, please. 

> +static inline int LO_SCAN_ELEMENTS(struct gfs2_jdesc *jd, unsigned int start,
> +				   struct gfs2_log_descriptor *ld,
> +				   unsigned int pass)

Lower case name, please. 

> +{
> +	unsigned int x;
> +	int error;
> +
> +	for (x = 0; gfs2_log_ops[x]; x++)
> +		if (gfs2_log_ops[x]->lo_scan_elements) {
> +			error = gfs2_log_ops[x]->lo_scan_elements(jd, start,
> +								 ld, pass);
> +			if (error)
> +				return error;
> +		}
> +
> +	return 0;
> +}
> +
> +#define LO_AFTER_SCAN(jd, error, pass) \
> +do \
> +{ \
> +  int __lops_x; \
> +  for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
> +    if (gfs2_log_ops[__lops_x]->lo_before_scan) \
> +      gfs2_log_ops[__lops_x]->lo_after_scan((jd), (error), (pass)); \
> +} \
> +while (0)

static inline function, please. 

> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/smp_lock.h>
> +#include <linux/spinlock.h>
> +#include <asm/semaphore.h>
> +#include <linux/completion.h>
> +#include <linux/buffer_head.h>
> +#include <asm/uaccess.h>
> +#include <linux/pagemap.h>
> +#include <linux/uio.h>
> +#include <linux/blkdev.h>
> +#include <linux/mm.h>
> +#include <asm/uaccess.h>
> +#include <linux/gfs2_ioctl.h>

Preferred order is to include linux/ first and asm/ after that. 

> +#define vma2state(vma) \
> +((((vma)->vm_flags & (VM_MAYWRITE | VM_MAYSHARE)) == \
> + (VM_MAYWRITE | VM_MAYSHARE)) ? \
> + LM_ST_EXCLUSIVE : LM_ST_SHARED) \

static inline function, please. The above is completely unreadable. 

> +struct inode *gfs2_ip2v(struct gfs2_inode *ip, int create)
> +{
> +     struct inode *inode = NULL, *tmp;
> +
> +     gfs2_assert_warn(ip->i_sbd,
> +                      test_bit(GIF_MIN_INIT, &ip->i_flags));
> +
> +     spin_lock(&ip->i_spin);
> +     if (ip->i_vnode)
> +             inode = igrab(ip->i_vnode);
> +     spin_unlock(&ip->i_spin);

Suggestion: make the above a separate function __gfs2_lookup_inode(),
use it here and where you pass NO_CREATE to get rid of the create
parameter. 

> +
> +     if (inode || !create)
> +             return inode;
> +
> +     tmp = new_inode(ip->i_sbd->sd_vfs);
> +     if (!tmp)
> +             return NULL;

[snip] 

> +	entries = gfs2_tune_get(sdp, gt_entries_per_readdir);
> +	size = sizeof(struct filldir_bad) +
> +	    entries * (sizeof(struct filldir_bad_entry) + GFS2_FAST_NAME_SIZE);
> +
> +	fdb = kmalloc(size, GFP_KERNEL);
> +	if (!fdb)
> +		return -ENOMEM;
> +	memset(fdb, 0, size);

kzalloc(), which is in 2.6.13-rc6-mm5 please. Appears in other places as 
well. 

> +		if (error) {
> +			printk("GFS2: fsid=%s: can't make FS RW: %d\n",
> +			       sdp->sd_fsname, error);
> +			goto fail_proc;
> +		}
> +	}
> +
> +	gfs2_glock_dq_uninit(&mount_gh);
> +
> +	return 0;
> +
> + fail_proc:
> +	gfs2_proc_fs_del(sdp);
> +	init_threads(sdp, UNDO);

Please provide a release_threads instead and make it deal with partial
initialization. The above is very confusing. 

> +				     parent,
> +				     strlen(system_utsname.nodename));
> +	else if (gfs2_filecmp(&dentry->d_name, "@mach", 5))
> +		new = lookup_one_len(system_utsname.machine,
> +				     parent,
> +				     strlen(system_utsname.machine));
> +	else if (gfs2_filecmp(&dentry->d_name, "@os", 3))
> +		new = lookup_one_len(system_utsname.sysname,
> +				     parent,
> +				     strlen(system_utsname.sysname));
> +	else if (gfs2_filecmp(&dentry->d_name, "@uid", 4))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%u", current->fsuid));
> +	else if (gfs2_filecmp(&dentry->d_name, "@gid", 4))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%u", current->fsgid));
> +	else if (gfs2_filecmp(&dentry->d_name, "@sys", 4))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%s_%s",
> +					     system_utsname.machine,
> +					     system_utsname.sysname));
> +	else if (gfs2_filecmp(&dentry->d_name, "@jid", 4))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%u",
> +					     sdp->sd_jdesc->jd_jid));

Smells like policy in the kernel. Why can't this be done in the
userspace? 

> +				     parent,
> +				     strlen(system_utsname.nodename));
> +	else if (gfs2_filecmp(&dentry->d_name, "{mach}", 6))
> +		new = lookup_one_len(system_utsname.machine,
> +				     parent,
> +				     strlen(system_utsname.machine));
> +	else if (gfs2_filecmp(&dentry->d_name, "{os}", 4))
> +		new = lookup_one_len(system_utsname.sysname,
> +				     parent,
> +				     strlen(system_utsname.sysname));
> +	else if (gfs2_filecmp(&dentry->d_name, "{uid}", 5))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%u", current->fsuid));
> +	else if (gfs2_filecmp(&dentry->d_name, "{gid}", 5))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%u", current->fsgid));
> +	else if (gfs2_filecmp(&dentry->d_name, "{sys}", 5))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%s_%s",
> +					     system_utsname.machine,
> +					     system_utsname.sysname));
> +	else if (gfs2_filecmp(&dentry->d_name, "{jid}", 5))
> +		new = lookup_one_len(buf,
> +				     parent,
> +				     sprintf(buf, "%u",
> +					     sdp->sd_jdesc->jd_jid));

Ditto. 

> +int gfs2_statfs_slow(struct gfs2_sbd *sdp, struct gfs2_statfs_change *sc)
> +{
> +	struct gfs2_holder ri_gh;
> +	struct gfs2_rgrpd *rgd_next;
> +	struct gfs2_holder *gha, *gh;
> +	unsigned int slots = 64;
> +	unsigned int x;
> +	int done;
> +	int error = 0, err;
> +
> +	memset(sc, 0, sizeof(struct gfs2_statfs_change));
> +	gha = kmalloc(slots * sizeof(struct gfs2_holder), GFP_KERNEL);
> +	if (!gha)
> +		return -ENOMEM;
> +	memset(gha, 0, slots * sizeof(struct gfs2_holder));

kcalloc, please 

> +	line = kmalloc(256, GFP_KERNEL);
> +	if (!line)
> +		return -ENOMEM;
> +
> +	len = snprintf(line, 256, "GFS2: fsid=%s: quota %s for %s %u\r\n",
> +		       sdp->sd_fsname, type,
> +		       (test_bit(QDF_USER, &qd->qd_flags)) ? "user" : "group",
> +		       qd->qd_id);

Please use constant instead of magic number 256. 

> +struct lm_lockops gdlm_ops = {
> +	lm_proto_name:"lock_dlm",
> +	lm_mount:gdlm_mount,
> +	lm_others_may_mount:gdlm_others_may_mount,
> +	lm_unmount:gdlm_unmount,
> +	lm_withdraw:gdlm_withdraw,
> +	lm_get_lock:gdlm_get_lock,
> +	lm_put_lock:gdlm_put_lock,
> +	lm_lock:gdlm_lock,
> +	lm_unlock:gdlm_unlock,
> +	lm_plock:gdlm_plock,
> +	lm_punlock:gdlm_punlock,
> +	lm_plock_get:gdlm_plock_get,
> +	lm_cancel:gdlm_cancel,
> +	lm_hold_lvb:gdlm_hold_lvb,
> +	lm_unhold_lvb:gdlm_unhold_lvb,
> +	lm_sync_lvb:gdlm_sync_lvb,
> +	lm_recovery_done:gdlm_recovery_done,
> +	lm_owner:THIS_MODULE,
> +};

C99 initializers, please. 



From lhh at redhat.com  Tue Aug  9 15:40:50 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 09 Aug 2005 11:40:50 -0400
Subject: [Linux-cluster] Clustered LDAP, good or bad idea?
In-Reply-To: <42F89D77.80907@netregistry.com.au>
References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca>
	<1123531971.3992.51.camel@ayanami.boston.redhat.com>
	<42F89D77.80907@netregistry.com.au>
Message-ID: <1123602050.13564.6.camel@ayanami.boston.redhat.com>

On Tue, 2005-08-09 at 22:11 +1000, Adam Cassar wrote:
> We run openldap quite extensively here. In my experience, if slapd fails 
> it is usually due to some backend db issue, and gfs will not help you 
> with that.
> 
> A master slave set up will be your best option.

Right, I think he meant to run one without a slave as a failover cluster
service, not a multi-instance app running atop of GFS...

*shrug*

-- Lon




From thomsonr at ucalgary.ca  Tue Aug  9 16:17:46 2005
From: thomsonr at ucalgary.ca (Ryan Thomson)
Date: Tue, 09 Aug 2005 10:17:46 -0600
Subject: [Linux-cluster] Clustered LDAP, good or bad idea?
In-Reply-To: <1123602050.13564.6.camel@ayanami.boston.redhat.com>
References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca>
	<1123531971.3992.51.camel@ayanami.boston.redhat.com>
	<42F89D77.80907@netregistry.com.au>
	<1123602050.13564.6.camel@ayanami.boston.redhat.com>
Message-ID: <1123604266.27091.5.camel@porcupine.bio.ucalgary.ca>

Indeed, that is what I meant. I'm looking at running OpenLDAP as a
failover cluster service.

I understand what was meant about db failures and how GFS won't help
with that. Our LDAP directory is fairly simple and updates/changes are
made very infrequently. I've never experienced a db failure but I won't
discount that as a possible issue (that's what I have backups for. TSM,
I love you).

I'll take my chances and run it the way I'd planned with some failover
testing before I go production.

--
Ryan

On Tue, 2005-08-09 at 11:40 -0400, Lon Hohberger wrote:
> On Tue, 2005-08-09 at 22:11 +1000, Adam Cassar wrote:
> > We run openldap quite extensively here. In my experience, if slapd fails 
> > it is usually due to some backend db issue, and gfs will not help you 
> > with that.
> > 
> > A master slave set up will be your best option.
> 
> Right, I think he meant to run one without a slave as a failover cluster
> service, not a multi-instance app running atop of GFS...
> 
> *shrug*
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Tue Aug  9 18:24:44 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 09 Aug 2005 14:24:44 -0400
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <42F8A0A4.5000507@redhat.com>
References: <20050808141238.GA6455@gibson.drslump.org>
	<42F77BB2.4070705@redhat.com>
	<20050808215530.GA23695@gibson.drslump.org>
	<42F8A0A4.5000507@redhat.com>
Message-ID: <1123611884.13564.11.camel@ayanami.boston.redhat.com>

On Tue, 2005-08-09 at 13:25 +0100, Patrick Caulfield wrote:

> > and if so, should I enable manually the port on the SAN, or will fenced
> > do it for me (as the script does actually accepts an enable parameter)
> > :??
> > 
> 
> I don't know. I've never had access to a SAN fencing device!
> 

When using SAN fencing, the easiest set of steps to follow to ensure
data integrity + correct operation:

(a) power off machine
(b) un-fence at the SAN level
(c) power on machine

-- Lon




From teigland at redhat.com  Wed Aug 10 05:59:45 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 10 Aug 2005 13:59:45 +0800
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42F768D5.000046F2@courier.cs.helsinki.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
Message-ID: <20050810055945.GB13926@redhat.com>

On Mon, Aug 08, 2005 at 05:14:45PM +0300, Pekka J Enberg wrote:

                 if (!dumping)
                        down_read(&mm->mmap_sem);
> >+
> >+             for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
> >+                     if (end <= vma->vm_start)
> >+                             break;
> >+                     if (vma->vm_file &&
> >+                         vma->vm_file->f_dentry->d_inode->i_sb == sb) {
> >+                             num_gh++;
> >+                     }
> >+             }
> >+
> >+             ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder),
> >+                           GFP_KERNEL);
> >+             if (!ghs) {
> >+                     if (!dumping)
> >+                             up_read(&mm->mmap_sem);
> >+                     return -ENOMEM;
> >+             }
> >+
> >+             for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
> 
> Sorry if this is an obvious question but what prevents another thread from 
> doing mmap() before we do the second walk and messing up num_gh? 

mm->mmap_sem ?



From penberg at cs.helsinki.fi  Wed Aug 10 06:06:42 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Wed, 10 Aug 2005 09:06:42 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050810055945.GB13926@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<20050810055945.GB13926@redhat.com>
Message-ID: <courier.42F99973.00000BD6@courier.cs.helsinki.fi>

David Teigland writes:
> 
>                  if (!dumping)
>                         down_read(&mm->mmap_sem);
> > > +
> > > +             for (vma = find_vma(mm, start); vma; vma = vma->vm_next)  {
> > > +                     if (end <= vma->vm_start)
> > > +                             break;
> > > +                     if (vma->vm_file &&
> > > +                         vma->vm_file->f_dentry->d_inode->i_sb == sb)  {
> > > +                             num_gh++;
> > > +                     }
> > > +             }
> > > +
> > > +             ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder),
> > > +                           GFP_KERNEL);
> > > +             if (!ghs) {
> > > +                     if (!dumping)
> > > +                             up_read(&mm->mmap_sem);
> > > +                     return -ENOMEM;
> > > +             }
> > > +
> > > +             for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
> > 
> > Sorry if this is an obvious question but what prevents another thread from 
> > doing mmap() before we do the second walk and messing up num_gh?  
> 
> mm->mmap_sem ?

Aah, I read that !dumping expression the other way around. Sorry and thanks. 

                           Pekka 



From pcaulfie at redhat.com  Wed Aug 10 07:20:44 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 10 Aug 2005 08:20:44 +0100
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <1123611884.13564.11.camel@ayanami.boston.redhat.com>
References: <20050808141238.GA6455@gibson.drslump.org>	<42F77BB2.4070705@redhat.com>	<20050808215530.GA23695@gibson.drslump.org>	<42F8A0A4.5000507@redhat.com>
	<1123611884.13564.11.camel@ayanami.boston.redhat.com>
Message-ID: <42F9AACC.9030009@redhat.com>

Lon Hohberger wrote:
> 
> When using SAN fencing, the easiest set of steps to follow to ensure
> data integrity + correct operation:
> 
> (a) power off machine
> (b) un-fence at the SAN level
> (c) power on machine
> 

That's pretty much what I suspected. I just thought it better it came from
someone who knew what they were talking about :)

Thanks.
-- 

patrick



From penberg at cs.helsinki.fi  Wed Aug 10 07:40:37 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Wed, 10 Aug 2005 10:40:37 +0300
Subject: [Linux-cluster] Re: GFS
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
Message-ID: <courier.42F9AF75.00005806@courier.cs.helsinki.fi>

Hi David, 

> +             return -EINVAL;
> +     if (!access_ok(VERIFY_WRITE, buf, size))
> +             return -EFAULT;
> +
> +     if (!(file->f_flags & O_LARGEFILE)) {
> +             if (*offset >= 0x7FFFFFFFull)
> +                     return -EFBIG;
> +             if (*offset + size > 0x7FFFFFFFull)
> +                     size = 0x7FFFFFFFull - *offset;

Please use a constant instead for 0x7FFFFFFFull. (Appears in various other 
places as well.) 

                                   Pekka 



From lmb at suse.de  Wed Aug 10 10:30:41 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Wed, 10 Aug 2005 12:30:41 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810070309.GA2415@infradead.org>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
Message-ID: <20050810103041.GB4634@marowsky-bree.de>

On 2005-08-10T08:03:09, Christoph Hellwig <hch at infradead.org> wrote:

> > Kindly lose the "Context Dependent Pathname" crap.
> Same for ocfs2.

Would a generic implementation of that higher up in the VFS be more
acceptable?

It's not like context-dependent symlinks are an arbitary feature, but
rather very useful in practice.


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From lmb at suse.de  Wed Aug 10 10:34:24 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Wed, 10 Aug 2005 12:34:24 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810103256.GA6127@infradead.org>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
Message-ID: <20050810103424.GC4634@marowsky-bree.de>

On 2005-08-10T11:32:56, Christoph Hellwig <hch at infradead.org> wrote:

> > Would a generic implementation of that higher up in the VFS be more
> > acceptable?
> No.  Use mount --bind

That's a working and less complex alternative for upto how many places
at once? That works for non-root users how...?


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From lmb at suse.de  Wed Aug 10 11:02:59 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Wed, 10 Aug 2005 13:02:59 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810105450.GA6519@infradead.org>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
	<20050810105450.GA6519@infradead.org>
Message-ID: <20050810110259.GE4634@marowsky-bree.de>

On 2005-08-10T11:54:50, Christoph Hellwig <hch at infradead.org> wrote:

> It works now.  Unlike context link which steal totally valid symlink
> targets for magic mushroom bullshit.

Right, that is a valid concern. Avoiding context dependent symlinks
entirely certainly is one possible path around this. But, let's just for
the sake of this discussion continue the other path for a bit, to
explore the options available for implementing CPS which don't result in
shivers running down the spine, because I believe CPS do have some
applications in which bind mounts are not entirely adequate
replacements.

(Unless, of course, you want a bind mount for each homedirectory which
might include architecture-specific subdirectories or for every
host-specific configuration file.)

What would a syntax look like which in your opinion does not remove
totally valid symlink targets for magic mushroom bullshit? Prefix with
// (which, according to POSIX, allows for implementation-defined
behaviour)? Something else, not allowed in a regular pathname?

If we can't find an acceptable way of implementing them, maybe it's time
to grab some magic mushrooms and come up with a new approach, then ;-)


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From lmb at suse.de  Wed Aug 10 11:09:17 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Wed, 10 Aug 2005 13:09:17 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810110511.GA6728@infradead.org>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
	<20050810105450.GA6519@infradead.org>
	<20050810110259.GE4634@marowsky-bree.de>
	<20050810110511.GA6728@infradead.org>
Message-ID: <20050810110917.GG4634@marowsky-bree.de>

On 2005-08-10T12:05:11, Christoph Hellwig <hch at infradead.org> wrote:

> > What would a syntax look like which in your opinion does not remove
> > totally valid symlink targets for magic mushroom bullshit? Prefix with
> > // (which, according to POSIX, allows for implementation-defined
> > behaviour)? Something else, not allowed in a regular pathname?
> None.  just don't do it.  Use bindmount, they're cheap and have sane
> defined semtantics.

So for every directoy hiearchy on a shared filesystem, each user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?

Sure. Very cheap and sane. I'm buying.


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From javipolo at datagrama.net  Wed Aug 10 11:45:26 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Wed, 10 Aug 2005 13:45:26 +0200
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <1123611884.13564.11.camel@ayanami.boston.redhat.com>
References: <20050808141238.GA6455@gibson.drslump.org>
	<42F77BB2.4070705@redhat.com>
	<20050808215530.GA23695@gibson.drslump.org>
	<42F8A0A4.5000507@redhat.com>
	<1123611884.13564.11.camel@ayanami.boston.redhat.com>
Message-ID: <20050810114526.GA23149@kinoko.datagrama.net>

On Aug/09/2005, Lon Hohberger wrote:

> When using SAN fencing, the easiest set of steps to follow to ensure
> data integrity + correct operation:
> (a) power off machine
> (b) un-fence at the SAN level
> (c) power on machine

I've made a script that, prior to starting any of the cluster
infrastructure, enables his SAN port.

I can then join the cluster, but when I try to join the fence, it locks
up there ... :

gfstest1:~# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join      S-1,80,3
[]

gfstest1:~# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    3   M   gfstest1
   2    1    3   M   gfstest2
   3    1    3   M   gfstest3
gfstest1:~# cman_tool status
Protocol version: 4.0.1
Config version: 9
Cluster name: test_cluster
Cluster ID: 61876
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2   
Active subsystems: 1
Node addresses: 192.168.0.1

gfstest1:~# 

from other nodes, I see it as recovering:
gfstest2:/etc/init.d# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 2 -
[2 3]

what happent?

-- 
Javier Polo @ Datagrama
902 136 126



From penberg at cs.helsinki.fi  Tue Aug  9 14:49:43 2005
From: penberg at cs.helsinki.fi (Pekka Enberg)
Date: Tue, 09 Aug 2005 17:49:43 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <42F7A557.3000200@zabbo.net>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net>
Message-ID: <1123598983.10790.1.camel@haji.ri.fi>

On Mon, 2005-08-08 at 11:32 -0700, Zach Brown wrote:
> > Sorry if this is an obvious question but what prevents another thread
> > from doing mmap() before we do the second walk and messing up num_gh?
> 
> Nothing, I suspect.  OCFS2 has a problem like this, too.  It wants a way
> for a file system to serialize mmap/munmap/mremap during file IO.  Well,
> more specifically, it wants to make sure that the locks it acquired at
> the start of the IO really cover the buf regions that might fault during
> the IO.. mapping activity during the IO can wreck that.

In addition, the vma walk will become an unmaintainable mess as soon as
someone introduces another mmap() capable fs that needs similar locking.

I am not an expert so could someone please explain why this cannot be
done with a_ops->prepare_write and friends?

			Pekka



From viro at parcelfarce.linux.theplanet.co.uk  Tue Aug  9 15:20:45 2005
From: viro at parcelfarce.linux.theplanet.co.uk (Al Viro)
Date: Tue, 9 Aug 2005 16:20:45 +0100
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>

On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks
> 
> http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050801/broken-out/

Kindly lose the "Context Dependent Pathname" crap.



From zab at zabbo.net  Tue Aug  9 17:17:10 2005
From: zab at zabbo.net (Zach Brown)
Date: Tue, 09 Aug 2005 10:17:10 -0700
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <1123598983.10790.1.camel@haji.ri.fi>
References: <20050802071828.GA11217@redhat.com>	
	<84144f0205080203163cab015c@mail.gmail.com>	
	<20050803063644.GD9812@redhat.com>	
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>	
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
Message-ID: <42F8E516.7020600@zabbo.net>

Pekka Enberg wrote:

> In addition, the vma walk will become an unmaintainable mess as soon
>  as someone introduces another mmap() capable fs that needs similar 
> locking.

Yup, I suspect that if the core kernel ends up caring about this problem
then the VFS will be involved in helping file systems sort the locks
they'll acquire around IO.

> I am not an expert so could someone please explain why this cannot be
>  done with a_ops->prepare_write and friends?

I'll try, briefly.

Usually clustered file systems in Linux maintain data consistency for
normal posix IO by holding DLM locks for the duration of their
file->{read,write} methods.  A task on a node won't be able to read
until all tasks on other nodes have finished any conflicting writes they
might have been performing, etc, nothing surprising here.

Now say we want to extend consistency guarantees to mmap().  This boils
down to protecting mappings with DLM locks.  Say a page is mapped for
reading, the continued presence of that mapping is protected by holding
a DLM lock.  If another node goes to write to that page, the read lock
is revoked and the mapping is torn down.  These locks are acquired in
a_ops->nopage as the task faults and tries to bring up the mapping.

And that's the problem. Because they're acquired in ->nopage they can
be acquired during a fault that is servicing the 'buf' argument to an
outer file->{read,write} operation which has grabbed a lock for the
target file. Acquiring multiple locks introduces the risk of ABBA
deadlocks. It's trivial to construct examples of mmap(), read(), and
write() on 2 nodes with 2 files that deadlock.

So clustered file systems in Linux (GFS, Lustre, OCFS2, (GPFS?)) all
walk vmas in their file->{read,write} to discover mappings that belong
to their files so that they can preemptively sort and acquire the locks
that will be needed to cover the mappings that might be established in
->nopage. As you point out, this both relies on the mappings not
changing and gets very exciting when you mix files and mappings between
file systems that are each sorting and acquiring their own DLM locks.

I brought this up with some people at the kernel summit but no one,
including myself, considers it a high priority.  It wouldn't be too hard
to construct a patch if people want to take a look.

- z



From penberg at cs.helsinki.fi  Tue Aug  9 18:35:58 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Tue, 09 Aug 2005 21:35:58 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <42F8E516.7020600@zabbo.net>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<42F8E516.7020600@zabbo.net>
Message-ID: <courier.42F8F78E.0000257A@courier.cs.helsinki.fi>

Hi Zach, 

Zach Brown writes:
> I'll try, briefly.

Thanks for the excellent explanation. 

Zach Brown writes:
> And that's the problem. Because they're acquired in ->nopage they can
> be acquired during a fault that is servicing the 'buf' argument to an
> outer file->{read,write} operation which has grabbed a lock for the
> target file. Acquiring multiple locks introduces the risk of ABBA
> deadlocks. It's trivial to construct examples of mmap(), read(), and
> write() on 2 nodes with 2 files that deadlock.

But couldn't we use make_pages_present() to figure which locks we need, sort 
them, and then grab them? 

Zach Brown writes:
> I brought this up with some people at the kernel summit but no one,
> including myself, considers it a high priority.  It wouldn't be too hard
> to construct a patch if people want to take a look.

I guess it's not a problem as long as the kernel has zero or one cluster 
filesystems that support mmap(). After we have two or more, we have a 
problem. 

The GFS2 vma walk needs fixing anyway, I think, as it can lead to buffer 
overflow (if we have more locks during the second walk). 

               Pekka 



From penberg at cs.helsinki.fi  Wed Aug 10 04:48:29 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Wed, 10 Aug 2005 07:48:29 +0300
Subject: [Linux-cluster] Re: GFS
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<42F8E516.7020600@zabbo.net>
Message-ID: <courier.42F9871D.000051DB@courier.cs.helsinki.fi>

Zach Brown writes:
> But couldn't we use make_pages_present() to figure which locks we need, 
> sort them, and then grab them?

Doh, obviously we can't as nopage() needs to bring the page in. Sorry about 
that. 

I also thought of another failure case for the vma walk. When a thread uses 
userspace memcpy() between two clusterfs mmap'd regions instead of write() 
or read(). 

              Pekka 



From hch at infradead.org  Wed Aug 10 07:03:09 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 10 Aug 2005 08:03:09 +0100
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
Message-ID: <20050810070309.GA2415@infradead.org>

On Tue, Aug 09, 2005 at 04:20:45PM +0100, Al Viro wrote:
> On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote:
> > Hi, GFS (Global File System) is a cluster file system that we'd like to
> > see added to the kernel.  The 14 patches total about 900K so I won't send
> > them to the list unless that's requested.  Comments and suggestions are
> > welcome.  Thanks
> > 
> > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> > http://redhat.com/~teigland/gfs2/20050801/broken-out/
> 
> Kindly lose the "Context Dependent Pathname" crap.

Same for ocfs2.



From hch at infradead.org  Wed Aug 10 07:21:21 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 10 Aug 2005 08:21:21 +0100
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <1123598983.10790.1.camel@haji.ri.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
Message-ID: <20050810072121.GA2825@infradead.org>

On Tue, Aug 09, 2005 at 05:49:43PM +0300, Pekka Enberg wrote:
> On Mon, 2005-08-08 at 11:32 -0700, Zach Brown wrote:
> > > Sorry if this is an obvious question but what prevents another thread
> > > from doing mmap() before we do the second walk and messing up num_gh?
> > 
> > Nothing, I suspect.  OCFS2 has a problem like this, too.  It wants a way
> > for a file system to serialize mmap/munmap/mremap during file IO.  Well,
> > more specifically, it wants to make sure that the locks it acquired at
> > the start of the IO really cover the buf regions that might fault during
> > the IO.. mapping activity during the IO can wreck that.
> 
> In addition, the vma walk will become an unmaintainable mess as soon as
> someone introduces another mmap() capable fs that needs similar locking.

We already have OCFS2 in -mm that does similar things.  I think we need
to solve this in common code before either of them can be merged.



From penberg at cs.helsinki.fi  Wed Aug 10 07:31:04 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Wed, 10 Aug 2005 10:31:04 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050810072121.GA2825@infradead.org>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<20050810072121.GA2825@infradead.org>
Message-ID: <courier.42F9AD38.000018F9@courier.cs.helsinki.fi>

On Tue, Aug 09, 2005 at 05:49:43PM +0300, Pekka Enberg wrote:
> > In addition, the vma walk will become an unmaintainable mess as soon as
> > someone introduces another mmap() capable fs that needs similar locking.

Christoph Hellwig writes:
> We already have OCFS2 in -mm that does similar things.  I think we need
> to solve this in common code before either of them can be merged.

It seems to me that the distributed locks must be acquired in ->nopage 
anyway to solve the problem with memcpy() between two mmap'd regions. One 
possible solution would be for the lock manager to detect deadlocks and 
break some locks accordingly. Don't know how well that would mix with 
 ->nopage though... 

                  Pekka 



From hch at infradead.org  Wed Aug 10 07:43:38 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 10 Aug 2005 08:43:38 +0100
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42F9AF75.00005806@courier.cs.helsinki.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080223445375c907@mail.gmail.com>
	<20050808095747.GD13951@redhat.com>
	<courier.42F9AF75.00005806@courier.cs.helsinki.fi>
Message-ID: <20050810074338.GA3172@infradead.org>

On Wed, Aug 10, 2005 at 10:40:37AM +0300, Pekka J Enberg wrote:
> Hi David, 
> 
> >+             return -EINVAL;
> >+     if (!access_ok(VERIFY_WRITE, buf, size))
> >+             return -EFAULT;
> >+
> >+     if (!(file->f_flags & O_LARGEFILE)) {
> >+             if (*offset >= 0x7FFFFFFFull)
> >+                     return -EFBIG;
> >+             if (*offset + size > 0x7FFFFFFFull)
> >+                     size = 0x7FFFFFFFull - *offset;
> 
> Please use a constant instead for 0x7FFFFFFFull. (Appears in various other 
> places as well.) 

In fact this very much looks like it's duplicating generic_write_checks().
Folks, please use common code.



From hch at infradead.org  Wed Aug 10 10:32:56 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 10 Aug 2005 11:32:56 +0100
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810103041.GB4634@marowsky-bree.de>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
Message-ID: <20050810103256.GA6127@infradead.org>

On Wed, Aug 10, 2005 at 12:30:41PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-10T08:03:09, Christoph Hellwig <hch at infradead.org> wrote:
> 
> > > Kindly lose the "Context Dependent Pathname" crap.
> > Same for ocfs2.
> 
> Would a generic implementation of that higher up in the VFS be more
> acceptable?

No.  Use mount --bind



From hch at infradead.org  Wed Aug 10 10:54:50 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 10 Aug 2005 11:54:50 +0100
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810103424.GC4634@marowsky-bree.de>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
Message-ID: <20050810105450.GA6519@infradead.org>

On Wed, Aug 10, 2005 at 12:34:24PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-10T11:32:56, Christoph Hellwig <hch at infradead.org> wrote:
> 
> > > Would a generic implementation of that higher up in the VFS be more
> > > acceptable?
> > No.  Use mount --bind
> 
> That's a working and less complex alternative for upto how many places
> at once? That works for non-root users how...?

It works now.  Unlike context link which steal totally valid symlink
targets for magic mushroom bullshit.



From hch at infradead.org  Wed Aug 10 11:05:11 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 10 Aug 2005 12:05:11 +0100
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810110259.GE4634@marowsky-bree.de>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
	<20050810105450.GA6519@infradead.org>
	<20050810110259.GE4634@marowsky-bree.de>
Message-ID: <20050810110511.GA6728@infradead.org>

On Wed, Aug 10, 2005 at 01:02:59PM +0200, Lars Marowsky-Bree wrote:
> What would a syntax look like which in your opinion does not remove
> totally valid symlink targets for magic mushroom bullshit? Prefix with
> // (which, according to POSIX, allows for implementation-defined
> behaviour)? Something else, not allowed in a regular pathname?

None.  just don't do it.  Use bindmount, they're cheap and have sane
defined semtantics.



From hch at infradead.org  Wed Aug 10 11:11:10 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 10 Aug 2005 12:11:10 +0100
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810110917.GG4634@marowsky-bree.de>
References: <20050802071828.GA11217@redhat.com>
	<20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
	<20050810105450.GA6519@infradead.org>
	<20050810110259.GE4634@marowsky-bree.de>
	<20050810110511.GA6728@infradead.org>
	<20050810110917.GG4634@marowsky-bree.de>
Message-ID: <20050810111110.GA6878@infradead.org>

On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-10T12:05:11, Christoph Hellwig <hch at infradead.org> wrote:
> 
> > > What would a syntax look like which in your opinion does not remove
> > > totally valid symlink targets for magic mushroom bullshit? Prefix with
> > > // (which, according to POSIX, allows for implementation-defined
> > > behaviour)? Something else, not allowed in a regular pathname?
> > None.  just don't do it.  Use bindmount, they're cheap and have sane
> > defined semtantics.
> 
> So for every directoy hiearchy on a shared filesystem, each user needs
> to have the complete list of bindmounts needed, and automatically resync
> that across all nodes when a new one is added or removed? And then have
> that executed by root, because a regular user can't?

Do it in an initscripts and let users simply not do it, they shouldn't
even know what kind of filesystem they are on.



From alewis at redhat.com  Wed Aug 10 13:26:26 2005
From: alewis at redhat.com (AJ Lewis)
Date: Wed, 10 Aug 2005 08:26:26 -0500
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810111110.GA6878@infradead.org>
References: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
	<20050810105450.GA6519@infradead.org>
	<20050810110259.GE4634@marowsky-bree.de>
	<20050810110511.GA6728@infradead.org>
	<20050810110917.GG4634@marowsky-bree.de>
	<20050810111110.GA6878@infradead.org>
Message-ID: <20050810132626.GC4954@null.msp.redhat.com>

On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote:
> On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
> > So for every directoy hiearchy on a shared filesystem, each user needs
> > to have the complete list of bindmounts needed, and automatically resync
> > that across all nodes when a new one is added or removed? And then have
> > that executed by root, because a regular user can't?
> 
> Do it in an initscripts and let users simply not do it, they shouldn't
> even know what kind of filesystem they are on.

I'm just thinking of a 100-node cluster that has different mounts on different
nodes, and trying to update the bind mounts in a sane and efficient manner
without clobbering the various mount setups.  Ouch.

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat                                    E-Mail: alewis at redhat.com
One Main Street SE, Suite 209
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050810/698c8166/attachment.sig>

From gforte at leopard.us.udel.edu  Wed Aug 10 14:20:55 2005
From: gforte at leopard.us.udel.edu (Greg Forte)
Date: Wed, 10 Aug 2005 10:20:55 -0400
Subject: [Linux-cluster] Which APC fence device ?
Message-ID: <42FA0D47.1090308@leopard.us.udel.edu>

There was a message on this list in January to the effect of "does APC 
7900 work with fence_apc?" / "I think so ...".

Has anyone confirmed this?

-g

Greg Forte
gforte at udel.edu
IT - User Services
University of Delaware
302-831-1982
Newark, DE



From eric at bootseg.com  Wed Aug 10 14:43:05 2005
From: eric at bootseg.com (Eric Kerin)
Date: Wed, 10 Aug 2005 10:43:05 -0400
Subject: [Linux-cluster] Which APC fence device ?
In-Reply-To: <42FA0D47.1090308@leopard.us.udel.edu>
References: <42FA0D47.1090308@leopard.us.udel.edu>
Message-ID: <1123684986.3402.14.camel@auh5-0479.corp.jabil.org>

On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote:
> There was a message on this list in January to the effect of "does APC 
> 7900 work with fence_apc?" / "I think so ...".
> 
> Has anyone confirmed this?

I'm using them in my cluster, they work well with the fence_apc agent.

The only caveat is that the agent won't work if you use outlet groups
for cluster nodes, without a few changes to the agent.  I can give you
my modified version if you want to use outlet groups for cluster nodes.
(A patch will be coming that will work with both port group and non-port
group modes, but I haven't had a chance to write it up yet.)

Thanks,
Eric





From ggilyeat at jhsph.edu  Wed Aug 10 14:47:15 2005
From: ggilyeat at jhsph.edu (Gerald G. Gilyeat)
Date: Wed, 10 Aug 2005 10:47:15 -0400
Subject: [Linux-cluster] Test Env.
Message-ID: <DF33F4DAC09B3048AA095F1B5C368915A57EBF@XCH-VN02.sph.ad.jhsph.edu>


Another question - What is the -minimal- hardware configuration to test GFS6.1 (with all the RHEL4 and LVM2 goodness...)

Thanks :)

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050810/61275476/attachment.htm>

From gforte at leopard.us.udel.edu  Wed Aug 10 15:01:56 2005
From: gforte at leopard.us.udel.edu (Greg Forte)
Date: Wed, 10 Aug 2005 11:01:56 -0400
Subject: [Linux-cluster] Which APC fence device ?
In-Reply-To: <1123684986.3402.14.camel@auh5-0479.corp.jabil.org>
References: <42FA0D47.1090308@leopard.us.udel.edu>
	<1123684986.3402.14.camel@auh5-0479.corp.jabil.org>
Message-ID: <42FA16E4.9050101@leopard.us.udel.edu>

Eric Kerin wrote:
> On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote:
> 
>>There was a message on this list in January to the effect of "does APC 
>>7900 work with fence_apc?" / "I think so ...".
>>
>>Has anyone confirmed this?
> 
> 
> I'm using them in my cluster, they work well with the fence_apc agent.
> 
> The only caveat is that the agent won't work if you use outlet groups
> for cluster nodes, without a few changes to the agent.  I can give you
> my modified version if you want to use outlet groups for cluster nodes.
> (A patch will be coming that will work with both port group and non-port
> group modes, but I haven't had a chance to write it up yet.)

Great, thanks.

I'm new to this sort of hardware, not sure exactly what you mean by 
"outlet groups", could you elaborate?

-g

Greg Forte
gforte at udel.edu
IT - User Services
University of Delaware
302-831-1982
Newark, DE



From eric at bootseg.com  Wed Aug 10 15:16:24 2005
From: eric at bootseg.com (Eric Kerin)
Date: Wed, 10 Aug 2005 11:16:24 -0400
Subject: [Linux-cluster] Which APC fence device ?
In-Reply-To: <42FA16E4.9050101@leopard.us.udel.edu>
References: <42FA0D47.1090308@leopard.us.udel.edu>
	<1123684986.3402.14.camel@auh5-0479.corp.jabil.org>
	<42FA16E4.9050101@leopard.us.udel.edu>
Message-ID: <1123686984.3402.19.camel@auh5-0479.corp.jabil.org>

On Wed, 2005-08-10 at 11:01 -0400, Greg Forte wrote:
> Eric Kerin wrote:
> > On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote:
> > 
> >>There was a message on this list in January to the effect of "does APC 
> >>7900 work with fence_apc?" / "I think so ...".
> >>
> >>Has anyone confirmed this?
> > 
> > 
> > I'm using them in my cluster, they work well with the fence_apc agent.
> > 
> > The only caveat is that the agent won't work if you use outlet groups
> > for cluster nodes, without a few changes to the agent.  I can give you
> > my modified version if you want to use outlet groups for cluster nodes.
> > (A patch will be coming that will work with both port group and non-port
> > group modes, but I haven't had a chance to write it up yet.)
> 
> Great, thanks.
> 
> I'm new to this sort of hardware, not sure exactly what you mean by 
> "outlet groups", could you elaborate?
> 
It's a feature on the APC 7900 that will allow a single command to one
7900 to automatically turn off/on ports on other 7900s via the network.
I use it to turn off both power supplies on my cluster nodes at once,
while still providing redundant paths for power.

Eric



From mrmacman_g4 at mac.com  Wed Aug 10 15:43:02 2005
From: mrmacman_g4 at mac.com (Kyle Moffett)
Date: Wed, 10 Aug 2005 11:43:02 -0400
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810132626.GC4954@null.msp.redhat.com>
References: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
	<20050810105450.GA6519@infradead.org>
	<20050810110259.GE4634@marowsky-bree.de>
	<20050810110511.GA6728@infradead.org>
	<20050810110917.GG4634@marowsky-bree.de>
	<20050810111110.GA6878@infradead.org>
	<20050810132626.GC4954@null.msp.redhat.com>
Message-ID: <F0B76E9F-6B4F-4984-A540-F431500F9B96@mac.com>

On Aug 10, 2005, at 09:26:26, AJ Lewis wrote:
> On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote:
>
>> On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
>>
>>> So for every directory hierarchy on a shared filesystem, each  
>>> user needs
>>> to have the complete list of bindmounts needed, and automatically  
>>> resync
>>> that across all nodes when a new one is added or removed? And  
>>> then have
>>> that executed by root, because a regular user can't?
>>
>> Do it in an initscripts and let users simply not do it, they  
>> shouldn't
>> even know what kind of filesystem they are on.
>
> I'm just thinking of a 100-node cluster that has different mounts  
> on different
> nodes, and trying to update the bind mounts in a sane and efficient  
> manner
> without clobbering the various mount setups.  Ouch.

How about something like the following:
     cpslink()      => Create a Context Dependent Symlink
     readcpslink()  => Return the Context Dependent path data
     readlink()     => Return the path of the Context Dependent  
Symlink as it
                       would be evaluated in the current context,  
basically as a
                       normal symlink.
     lstat()        => Return information on the Context Dependent  
Symlink in
                       the same format as a regular symlink.
     unlink()       => Delete the Context Dependent Symlink.

You would need an extra userspace tool that understands cpslink/ 
readcpslink to
create and get information on the links for now, but ls and ln could  
eventually
be updated, and until then the would provide sane behavior.  Perhaps  
this
should be extended into a new API for some of the strange things several
filesystems want to do in the VFS:
     extlink()      => Create an extended filesystem link (with type  
specified)
     readextlink()  => Return the path (and type) for the link

The filesystem could define how each type of link acts with respect  
to other
syscalls.  OpenAFS could use extlink() instead of their symlink magic  
for
adjusting the AFS volume hierarchy.  The new in-kernel AFS client  
could use it
in similar fashion (It has no method to adjust hierarchy, because  
it's still
read-only).  GFS could use it for their Context Dependent Symlinks.   
Since it
would pass the type in as well, it would be possible to use it for  
different
kinds of links on the same filesystem.

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
   -- Alan Kay





From aspanke at hpce.nec.com  Wed Aug 10 17:18:06 2005
From: aspanke at hpce.nec.com (Alexander Spanke)
Date: Wed, 10 Aug 2005 19:18:06 +0200
Subject: [Linux-cluster] /bin/login hangs on NFS troubles
Message-ID: <42FA36CE.9030100@hpce.nec.com>

Hi all,

we have a problem with the configuration of a linux cluster based on RH 
ES3 WS. The cluster contains 230+ nodes; these nodes are sharing 
/usr/local/bin over NFS due to shared scripts for health checking.

Now, if we encounter NFS problems, the login process is hanging non-stop 
  until the NFS connection is back again.
After checking all startup scripts and $PATH definitions we found out, 
that the /bin/login binary itself is pre-defining the $PATH hardwired 
with /usr/local/bin in the beginning.

Unfortunately we cannot change the /usr/local/bin down to the local disk 
again, but need a solution to be able to login even or especially with 
NFS problems to start investigation.

Do you have any nice idea, how to get rid of this problem ? Changing the 
binary and rebuild the package is one, but we don't like it because of 
missing update possibility after it ...

Thnx in advance

Cheers
  Alex


-- 
=======================================================
Alexander Spanke			 System Analyst
       NEC High Performance Computing Europe GmbH
Prinzenallee 11		   D-40549 Duesseldorf, Germany
Tel:	+49 211 5369 146   	   aspanke at hpce.nec.com
Fax:	+49 211 5369 199    	http://www.hpce.nec.com
=======================================================



From gforte at leopard.us.udel.edu  Wed Aug 10 17:22:08 2005
From: gforte at leopard.us.udel.edu (Greg Forte)
Date: Wed, 10 Aug 2005 13:22:08 -0400
Subject: [Linux-cluster] Which APC fence device ?
In-Reply-To: <1123686984.3402.19.camel@auh5-0479.corp.jabil.org>
References: <42FA0D47.1090308@leopard.us.udel.edu>	<1123684986.3402.14.camel@auh5-0479.corp.jabil.org>	<42FA16E4.9050101@leopard.us.udel.edu>
	<1123686984.3402.19.camel@auh5-0479.corp.jabil.org>
Message-ID: <42FA37C0.6000807@leopard.us.udel.edu>

Eric Kerin wrote:
> On Wed, 2005-08-10 at 11:01 -0400, Greg Forte wrote:
> 
>>Eric Kerin wrote:
>>
>>>On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote:
>>>
>>>
>>>>There was a message on this list in January to the effect of "does APC 
>>>>7900 work with fence_apc?" / "I think so ...".
>>>>
>>>>Has anyone confirmed this?
>>>
>>>
>>>I'm using them in my cluster, they work well with the fence_apc agent.
>>>
>>>The only caveat is that the agent won't work if you use outlet groups
>>>for cluster nodes, without a few changes to the agent.  I can give you
>>>my modified version if you want to use outlet groups for cluster nodes.
>>>(A patch will be coming that will work with both port group and non-port
>>>group modes, but I haven't had a chance to write it up yet.)
>>
>>Great, thanks.
>>
>>I'm new to this sort of hardware, not sure exactly what you mean by 
>>"outlet groups", could you elaborate?
>>
> 
> It's a feature on the APC 7900 that will allow a single command to one
> 7900 to automatically turn off/on ports on other 7900s via the network.
> I use it to turn off both power supplies on my cluster nodes at once,
> while still providing redundant paths for power.

Ah, I see.  Yes, I will probably be wanting to do the same thing, if you 
could forward me your patch I'd appreciate it.

-g

Greg Forte
gforte at udel.edu
IT - User Services
University of Delaware
302-831-1982
Newark, DE



From eric at bootseg.com  Wed Aug 10 17:43:32 2005
From: eric at bootseg.com (Eric Kerin)
Date: Wed, 10 Aug 2005 13:43:32 -0400
Subject: [Linux-cluster] Which APC fence device ?
In-Reply-To: <42FA37C0.6000807@leopard.us.udel.edu>
References: <42FA0D47.1090308@leopard.us.udel.edu>
	<1123684986.3402.14.camel@auh5-0479.corp.jabil.org>
	<42FA16E4.9050101@leopard.us.udel.edu>
	<1123686984.3402.19.camel@auh5-0479.corp.jabil.org>
	<42FA37C0.6000807@leopard.us.udel.edu>
Message-ID: <1123695813.3402.31.camel@auh5-0479.corp.jabil.org>

On Wed, 2005-08-10 at 13:22 -0400, Greg Forte wrote:
> Eric Kerin wrote:
> >On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote:
> >>Great, thanks.
> >>
> >>I'm new to this sort of hardware, not sure exactly what you mean by 
> >>"outlet groups", could you elaborate?
> >>
> > 
> > It's a feature on the APC 7900 that will allow a single command to one
> > 7900 to automatically turn off/on ports on other 7900s via the network.
> > I use it to turn off both power supplies on my cluster nodes at once,
> > while still providing redundant paths for power.
> 
> Ah, I see.  Yes, I will probably be wanting to do the same thing, if you 
> could forward me your patch I'd appreciate it.

I attached the modified fence_apc agent for use with Outlet Groups, just
replace fence_apc in /sbin with this one (as long as you're using outlet
groups, if you're not you'll have to use the original one that comes
with the cluster software.)  I'll whip up a proper, non-outlet group
compatible, patch when I get some time.

When you setup your fence devices, you'll want to use the name of the
port, instead of the port number.  My config looks like this for the
node's fence device config:
<device name="AUHAPC01" port="AUHJPSN01A" switch="0"/>

Or if you don't rename the ports: 
<device name="AUHAPC01" port="Outlet 1" switch="0"/>

By default the port names are "Outlet X", but if you change them like I
did just use whatever you called the port.

Hope this helps

Eric

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence_apc
Type: application/x-perl
Size: 10776 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050810/a9394e4b/attachment.pl>

From lhh at redhat.com  Wed Aug 10 18:41:28 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 10 Aug 2005 14:41:28 -0400
Subject: [Linux-cluster] Test Env.
In-Reply-To: <DF33F4DAC09B3048AA095F1B5C368915A57EBF@XCH-VN02.sph.ad.jhsph.edu>
References: <DF33F4DAC09B3048AA095F1B5C368915A57EBF@XCH-VN02.sph.ad.jhsph.edu>
Message-ID: <1123699288.13564.17.camel@ayanami.boston.redhat.com>

On Wed, 2005-08-10 at 10:47 -0400, Gerald G. Gilyeat wrote:
> 
> Another question - What is the -minimal- hardware configuration to
> test GFS6.1 (with all the RHEL4 and LVM2 goodness...)
> 
2 machines, a network switch, and a SAN

If you don't have a SAN, a third machine acting as a GNBD server.  (You
even get software-fencing [fence_gnbd] if you do it this way!)

-- Lon



From amanthei at redhat.com  Wed Aug 10 19:08:04 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 10 Aug 2005 14:08:04 -0500
Subject: [Linux-cluster] Which APC fence device ?
In-Reply-To: <42FA0D47.1090308@leopard.us.udel.edu>
References: <42FA0D47.1090308@leopard.us.udel.edu>
Message-ID: <20050810190804.GG17678@redhat.com>

On Wed, Aug 10, 2005 at 10:20:55AM -0400, Greg Forte wrote:
> There was a message on this list in January to the effect of "does APC 
> 7900 work with fence_apc?" / "I think so ...".
> 
> Has anyone confirmed this?

Yes, the APC 79xx Series is supported.

-- 
Adam Manthei  <amanthei at redhat.com>



From teigland at redhat.com  Thu Aug 11 03:45:14 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 11 Aug 2005 11:45:14 +0800
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <20050810114526.GA23149@kinoko.datagrama.net>
References: <20050808141238.GA6455@gibson.drslump.org>
	<42F77BB2.4070705@redhat.com>
	<20050808215530.GA23695@gibson.drslump.org>
	<42F8A0A4.5000507@redhat.com>
	<1123611884.13564.11.camel@ayanami.boston.redhat.com>
	<20050810114526.GA23149@kinoko.datagrama.net>
Message-ID: <20050811034513.GA11132@redhat.com>

On Wed, Aug 10, 2005 at 01:45:26PM +0200, Javi Polo wrote:
> I've made a script that, prior to starting any of the cluster
> infrastructure, enables his SAN port.

I'm not sure if this is related to the rest.

> I can then join the cluster, but when I try to join the fence, it locks
> up there ... :
> 
> gfstest1:~# cman_tool services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           0   2 join      S-1,80,3
> []

it's waiting to join the fence domain, the others won't let him yet...

> gfstest1:~# cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    3   M   gfstest1
>    2    1    3   M   gfstest2
>    3    1    3   M   gfstest3
> 
> from other nodes, I see it as recovering:
> gfstest2:/etc/init.d# cman_tool services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           1   2 recover 2 -
> [2 3]

These two appear to be trying to fence gfstest1, but the fencing operation
hasn't completed.  They won't let anyone join the domain until they
finish.  You could check /var/log/messages on 2&3 for any fencing messages
or errors.

Dave



From teigland at redhat.com  Thu Aug 11 06:06:02 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 11 Aug 2005 14:06:02 +0800
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
Message-ID: <20050811060602.GA12438@redhat.com>

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> * +	if (create)
> +		down_write(&ip->i_rw_mutex);
> +	else
> +		down_read(&ip->i_rw_mutex);
> 
> why do you use a rwsem and not a regular semaphore? You are aware that
> rwsems are far more expensive than regular ones right?  How skewed is
> the read/write ratio?

Rough tests show around 4/1, that high or low?



From arjan at infradead.org  Thu Aug 11 06:55:49 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Thu, 11 Aug 2005 08:55:49 +0200
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050811060602.GA12438@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<1122968724.3247.22.camel@laptopd505.fenrus.org>
	<20050811060602.GA12438@redhat.com>
Message-ID: <1123743349.3201.15.camel@laptopd505.fenrus.org>

On Thu, 2005-08-11 at 14:06 +0800, David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> 
> > * +	if (create)
> > +		down_write(&ip->i_rw_mutex);
> > +	else
> > +		down_read(&ip->i_rw_mutex);
> > 
> > why do you use a rwsem and not a regular semaphore? You are aware that
> > rwsems are far more expensive than regular ones right?  How skewed is
> > the read/write ratio?
> 
> Rough tests show around 4/1, that high or low?

that's quite borderline; if it was my code I'd not use a rwsem for that
ratio (my own rule of thumb, based on not a lot other than gut feeling)
is a 10/1 ratio at minimum... but it's not so low that it screams for
removing it. However.... it might well make your code a lot simpler so
it might still be worth simplifying.




From teigland at redhat.com  Thu Aug 11 08:17:29 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 11 Aug 2005 16:17:29 +0800
Subject: [Linux-cluster] GFS - updated patches
In-Reply-To: <20050802071828.GA11217@redhat.com>
References: <20050802071828.GA11217@redhat.com>
Message-ID: <20050811081729.GB12438@redhat.com>

Thanks for all the review and comments.  This is a new set of patches that
incorporates the suggestions we've received.

http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050811/broken-out/

Dave



From mikore.li at gmail.com  Thu Aug 11 08:21:04 2005
From: mikore.li at gmail.com (Michael)
Date: Thu, 11 Aug 2005 16:21:04 +0800
Subject: [Linux-cluster] GFS - updated patches
In-Reply-To: <20050811081729.GB12438@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
Message-ID: <bc57270905081101217fdd4c5f@mail.gmail.com>

I have the same question as I asked before, how can I see GFS in "make
menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel?

Michael

On 8/11/05, David Teigland <teigland at redhat.com> wrote:
> Thanks for all the review and comments.  This is a new set of patches that
> incorporates the suggestions we've received.
> 
> http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050811/broken-out/
> 
> Dave
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From arjan at infradead.org  Thu Aug 11 08:32:38 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Thu, 11 Aug 2005 10:32:38 +0200
Subject: [Linux-cluster] Re: GFS - updated patches
In-Reply-To: <20050811081729.GB12438@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
Message-ID: <1123749159.3201.19.camel@laptopd505.fenrus.org>

On Thu, 2005-08-11 at 16:17 +0800, David Teigland wrote:
> Thanks for all the review and comments.  This is a new set of patches that
> incorporates the suggestions we've received.

all of them or only a subset?




From teigland at redhat.com  Thu Aug 11 08:46:45 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 11 Aug 2005 16:46:45 +0800
Subject: [Linux-cluster] Re: GFS - updated patches
In-Reply-To: <bc57270905081101217fdd4c5f@mail.gmail.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
	<bc57270905081101217fdd4c5f@mail.gmail.com>
Message-ID: <20050811084645.GD12438@redhat.com>

On Thu, Aug 11, 2005 at 04:21:04PM +0800, Michael wrote:
> I have the same question as I asked before, how can I see GFS in "make
> menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel?

You need to select the dlm under drivers.  It's in -mm, or apply
  http://redhat.com/~teigland/dlm.patch



From teigland at redhat.com  Thu Aug 11 08:50:06 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 11 Aug 2005 16:50:06 +0800
Subject: [Linux-cluster] Re: GFS - updated patches
In-Reply-To: <1123749159.3201.19.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
	<1123749159.3201.19.camel@laptopd505.fenrus.org>
Message-ID: <20050811085006.GA19972@redhat.com>

On Thu, Aug 11, 2005 at 10:32:38AM +0200, Arjan van de Ven wrote:
> On Thu, 2005-08-11 at 16:17 +0800, David Teigland wrote:
> > Thanks for all the review and comments.  This is a new set of patches that
> > incorporates the suggestions we've received.
> 
> all of them or only a subset?

All patches, now 01-13 (what was patch 08 disappeared entirely)



From mikore.li at gmail.com  Thu Aug 11 08:49:42 2005
From: mikore.li at gmail.com (Michael)
Date: Thu, 11 Aug 2005 16:49:42 +0800
Subject: [Linux-cluster] Re: GFS - updated patches
In-Reply-To: <20050811084645.GD12438@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
	<bc57270905081101217fdd4c5f@mail.gmail.com>
	<20050811084645.GD12438@redhat.com>
Message-ID: <bc57270905081101495fea825e@mail.gmail.com>

yes, after apply dlm.patch, I saw it! although I don't know what's "-mm".

Thanks,

Michael

On 8/11/05, David Teigland <teigland at redhat.com> wrote:
> On Thu, Aug 11, 2005 at 04:21:04PM +0800, Michael wrote:
> > I have the same question as I asked before, how can I see GFS in "make
> > menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel?
> 
> You need to select the dlm under drivers.  It's in -mm, or apply
>   http://redhat.com/~teigland/dlm.patch
> 
>



From arjan at infradead.org  Thu Aug 11 08:50:32 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Thu, 11 Aug 2005 10:50:32 +0200
Subject: [Linux-cluster] Re: GFS - updated patches
In-Reply-To: <20050811085006.GA19972@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
	<1123749159.3201.19.camel@laptopd505.fenrus.org>
	<20050811085006.GA19972@redhat.com>
Message-ID: <1123750232.3201.22.camel@laptopd505.fenrus.org>

On Thu, 2005-08-11 at 16:50 +0800, David Teigland wrote:
> On Thu, Aug 11, 2005 at 10:32:38AM +0200, Arjan van de Ven wrote:
> > On Thu, 2005-08-11 at 16:17 +0800, David Teigland wrote:
> > > Thanks for all the review and comments.  This is a new set of patches that
> > > incorporates the suggestions we've received.
> > 
> > all of them or only a subset?
> 
> All patches, now 01-13 (what was patch 08 disappeared entirely)

with them I meant the suggestions not the patches ;)




From teigland at redhat.com  Thu Aug 11 09:16:51 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 11 Aug 2005 17:16:51 +0800
Subject: [Linux-cluster] Re: GFS - updated patches
In-Reply-To: <1123750232.3201.22.camel@laptopd505.fenrus.org>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
	<1123749159.3201.19.camel@laptopd505.fenrus.org>
	<20050811085006.GA19972@redhat.com>
	<1123750232.3201.22.camel@laptopd505.fenrus.org>
Message-ID: <20050811091651.GB19972@redhat.com>

On Thu, Aug 11, 2005 at 10:50:32AM +0200, Arjan van de Ven wrote:

> > > > Thanks for all the review and comments.  This is a new set of
> > > > patches that incorporates the suggestions we've received.
> > > 
> > > all of them or only a subset?
> 
> with them I meant the suggestions not the patches ;)

The large majority, and I think all that people care about.  If we ignored
something that someone thinks is important, a reminder would be useful.



From mikore.li at gmail.com  Thu Aug 11 09:54:33 2005
From: mikore.li at gmail.com (Michael)
Date: Thu, 11 Aug 2005 17:54:33 +0800
Subject: [Linux-cluster] GFS - updated patches
In-Reply-To: <20050811081729.GB12438@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
Message-ID: <bc57270905081102545e9555c1@mail.gmail.com>

Hi, Dave,

I quickly applied gfs2 and dlm patches in kernel 2.6.12.2, it passed
compiling but has some warning log, see attachment. maybe helpful to
you.

Thanks,

Michael

On 8/11/05, David Teigland <teigland at redhat.com> wrote:
> Thanks for all the review and comments.  This is a new set of patches that
> incorporates the suggestions we've received.
> 
> http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050811/broken-out/
> 
> Dave
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: gfs2_and_linux-2.6.12.2.txt
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050811/8f2e5ef7/attachment.txt>

From penberg at gmail.com  Thu Aug 11 10:00:35 2005
From: penberg at gmail.com (Pekka Enberg)
Date: Thu, 11 Aug 2005 13:00:35 +0300
Subject: [Linux-cluster] GFS - updated patches
In-Reply-To: <bc57270905081102545e9555c1@mail.gmail.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
	<bc57270905081102545e9555c1@mail.gmail.com>
Message-ID: <84144f0205081103003cf4fbe7@mail.gmail.com>

On 8/11/05, Michael <mikore.li at gmail.com> wrote:
> Hi, Dave,
> 
> I quickly applied gfs2 and dlm patches in kernel 2.6.12.2, it passed
> compiling but has some warning log, see attachment. maybe helpful to
> you.

kzalloc is not in Linus' tree yet. Try with 2.6.13-rc5-mm1.

                                 Pekka



From javipolo at datagrama.net  Thu Aug 11 10:03:35 2005
From: javipolo at datagrama.net (Javi Polo)
Date: Thu, 11 Aug 2005 12:03:35 +0200
Subject: [Linux-cluster] problem with rejoining a node
In-Reply-To: <20050811034513.GA11132@redhat.com>
References: <20050808141238.GA6455@gibson.drslump.org>
	<42F77BB2.4070705@redhat.com>
	<20050808215530.GA23695@gibson.drslump.org>
	<42F8A0A4.5000507@redhat.com>
	<1123611884.13564.11.camel@ayanami.boston.redhat.com>
	<20050810114526.GA23149@kinoko.datagrama.net>
	<20050811034513.GA11132@redhat.com>
Message-ID: <20050811100335.GA15222@kinoko.datagrama.net>

On Aug/11/2005, David Teigland wrote:

> > I've made a script that, prior to starting any of the cluster
> > infrastructure, enables his SAN port.
> I'm not sure if this is related to the rest.

I did it because the san port never turned on, and I thought it could be
part of the problem, but I see is not ...

> > gfstest1:~# cman_tool services
> > Service          Name                              GID LID State     Code
> > Fence Domain:    "default"                           0   2 join      S-1,80,3
> > []
> it's waiting to join the fence domain, the others won't let him yet...

> > Service          Name                              GID LID State     Code
> > Fence Domain:    "default"                           1   2 recover 2 -
> > [2 3]
> These two appear to be trying to fence gfstest1, but the fencing operation
> hasn't completed.  They won't let anyone join the domain until they
> finish.  You could check /var/log/messages on 2&3 for any fencing messages
> or errors.

I tried fence_tool with -D on those, and found the problem ....
dont know why, but sometimes the switch sets the port status to "FAULTY"
instead of "OFFLINE", and so the fence_IBMswitch failed and so the node
wasnt completely fenced ....

Now it seems to be working fine! :)))
thx a lot

Now there's another doubt I have ... when the system rejoins the fence,
does the fence_XXX script runs to enable the port switch, or should I do
it by other means (ie enabling it on boot and so) :?

I though about making a boot script that runs cman_tool services, checks if
the host is in the fence, and if so, enable the SAN port and then rescan
for SCSI devices ... but I dont know if that's "the right way" to do it,
or at least a polite one
:?

-- 
Javier Polo @ Datagrama
902 136 126



From penberg at gmail.com  Thu Aug 11 10:04:10 2005
From: penberg at gmail.com (Pekka Enberg)
Date: Thu, 11 Aug 2005 13:04:10 +0300
Subject: [Linux-cluster] Re: GFS - updated patches
In-Reply-To: <20050811091651.GB19972@redhat.com>
References: <20050802071828.GA11217@redhat.com>
	<20050811081729.GB12438@redhat.com>
	<1123749159.3201.19.camel@laptopd505.fenrus.org>
	<20050811085006.GA19972@redhat.com>
	<1123750232.3201.22.camel@laptopd505.fenrus.org>
	<20050811091651.GB19972@redhat.com>
Message-ID: <84144f0205081103043067d36e@mail.gmail.com>

Hi,

On 8/11/05, David Teigland <teigland at redhat.com> wrote:
> The large majority, and I think all that people care about.  If we ignored
> something that someone thinks is important, a reminder would be useful.

The only remaining issue for me is the vma walk. Thanks, David!

                                 Pekka



From pcaulfie at redhat.com  Thu Aug 11 10:56:26 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 11 Aug 2005 11:56:26 +0100
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <42F77AA3.80000@redhat.com>
References: <42DB63F6.5070600@redhat.com>	<1122318870.12824.29.camel@localhost.localdomain>	<42EF4AD1.6010809@redhat.com>	<1123263949.16923.23.camel@localhost.localdomain>
	<42F77AA3.80000@redhat.com>
Message-ID: <42FB2EDA.4010300@redhat.com>

For those not reading the commit list the ais-based cman is now in CVS - be
careful with it...

For the moment it downloads a prepackaged/patched version of the openais source
from my people.redhat.com web site. This /will/ change. In fact the only
additional patch in there is one Steven posted to the openais mailing list so
don't think I'm hiding anything!
	
There's still a lot of work to do on this code but is basically works with a few
caveats:
	
1. Barriers are completely untested and may not work at all.
2. Don't start several nodes up at the same time, they might get the same
   node ID(!) unless you used static node IDs.
3. The exec path for cmand is hard coded (in the Makefile) to ../daemon/cmand
   so you must currently always run cman_tool from the dev directory unless
   you change it.
4. Broadcast is no longer supported. If you fail to specify a multicast address
   cman_tool will provide one.
5. IPv6 is unsupported, I'm going to start on that next!
6. Error reporting is probably rubbish.

Generally it seems to work. I can certainly get the DLM up with it now.
-- 

patrick



From Axel.Thimm at ATrpms.net  Thu Aug 11 13:31:17 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Thu, 11 Aug 2005 15:31:17 +0200
Subject: [Linux-cluster] exporting gfs quotas over nfs (was: GFS -> NFS,
	Locking and quotas)
In-Reply-To: <2a5b11a870bfe22794ec127f5e34cf22@valuecommerce.co.jp>
References: <2a5b11a870bfe22794ec127f5e34cf22@valuecommerce.co.jp>
Message-ID: <20050811133117.GC19032@neu.nirvana>

On Tue, Feb 22, 2005 at 06:14:04PM +0900, Nathan Ollerenshaw wrote:
> 1. It should support quotas.
> 
> It seems that GFS supports quotas just fine, using some commands on the 
> machines mounting the GFS filesystems. While this is fine, ideally I'd 
> like to be able to query and set the quota information using 
> rpc.rquotad and the associated commands 'edquota' and 'quota'.
> 
> Is this possible?

Found this in the archives w/o an answer. What is the best practice to
export quotas over NFS? Does rpc.rquotad interact with gfs' quotas? If
not one could think of using the conventional quota utils, but are
they cluster-safe?

Thanks.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050811/d1c9df4f/attachment.sig>

From dharma_deep at yahoo.com  Thu Aug 11 13:35:25 2005
From: dharma_deep at yahoo.com (dharma deep)
Date: Thu, 11 Aug 2005 06:35:25 -0700 (PDT)
Subject: [Linux-cluster] GFS cluster using UML(User mode linux)
Message-ID: <20050811133525.66333.qmail@web60611.mail.yahoo.com>

Hi,

   Is it possible to create a GFS cluster using UML
?If
so, please point me to the howto doc.

Thanks,
dharmadeep


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 



From mark.fasheh at oracle.com  Wed Aug 10 16:26:18 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Wed, 10 Aug 2005 09:26:18 -0700
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42F9AD38.000018F9@courier.cs.helsinki.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<20050810072121.GA2825@infradead.org>
	<courier.42F9AD38.000018F9@courier.cs.helsinki.fi>
Message-ID: <20050810162618.GH21228@ca-server1.us.oracle.com>

On Wed, Aug 10, 2005 at 10:31:04AM +0300, Pekka J Enberg wrote:
> It seems to me that the distributed locks must be acquired in ->nopage 
> anyway to solve the problem with memcpy() between two mmap'd regions. One 
> possible solution would be for the lock manager to detect deadlocks and 
> break some locks accordingly. Don't know how well that would mix with 
> ->nopage though... 
Yeah, my experience with ->nopage so far has indicated to me that we are to
avoid erroring out if at all possible which I believe is what we'd have to
do if a deadlock is found.
Also, I'm not sure how multiple dlms would coordinate deadlock detection in
that case.
This may sound naive, but so far OCFS2 has avoided the nead for deadlock
detection... I'd hate to have to add it now -- better to try avoiding them
in the first place.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From penberg at cs.helsinki.fi  Wed Aug 10 16:57:43 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Wed, 10 Aug 2005 19:57:43 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050810162618.GH21228@ca-server1.us.oracle.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<20050810072121.GA2825@infradead.org>
	<courier.42F9AD38.000018F9@courier.cs.helsinki.fi>
	<20050810162618.GH21228@ca-server1.us.oracle.com>
Message-ID: <courier.42FA3207.00002648@courier.cs.helsinki.fi>

Hi Mark, 

Mark Fasheh writes:
> This may sound naive, but so far OCFS2 has avoided the nead for deadlock
> detection... I'd hate to have to add it now -- better to try avoiding them
> in the first place.

Surely avoiding them is preferred but how do you do that when you have to 
mmap'd regions where userspace does memcpy()? The kernel won't much saying 
in it until ->nopage. We cannot grab all the required locks in proper order 
here because we don't know what size the buffer is. That's why I think lock 
sorting won't work of all the cases and thus the problem needs to be taken 
care of by the dlm. 

                       Pekka 



From mark.fasheh at oracle.com  Wed Aug 10 18:21:56 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Wed, 10 Aug 2005 11:21:56 -0700
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42FA3207.00002648@courier.cs.helsinki.fi>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<20050810072121.GA2825@infradead.org>
	<courier.42F9AD38.000018F9@courier.cs.helsinki.fi>
	<20050810162618.GH21228@ca-server1.us.oracle.com>
	<courier.42FA3207.00002648@courier.cs.helsinki.fi>
Message-ID: <20050810182155.GI21228@ca-server1.us.oracle.com>

On Wed, Aug 10, 2005 at 07:57:43PM +0300, Pekka J Enberg wrote:
> Surely avoiding them is preferred but how do you do that when you have to 
> mmap'd regions where userspace does memcpy()? The kernel won't much saying 
> in it until ->nopage. We cannot grab all the required locks in proper order 
> here because we don't know what size the buffer is. That's why I think lock 
> sorting won't work of all the cases and thus the problem needs to be taken 
> care of by the dlm. 
Hmm, well today in OCFS2 if you're not coming from read or write, the lock
is held only for the duration of ->nopage so I don't think we could get into
any deadlocks for that usage.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From penberg at cs.helsinki.fi  Wed Aug 10 20:18:48 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Wed, 10 Aug 2005 23:18:48 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050810182155.GI21228@ca-server1.us.oracle.com>
References: <20050802071828.GA11217@redhat.com>
	<84144f0205080203163cab015c@mail.gmail.com>
	<20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<20050810072121.GA2825@infradead.org>
	<courier.42F9AD38.000018F9@courier.cs.helsinki.fi>
	<20050810162618.GH21228@ca-server1.us.oracle.com>
	<courier.42FA3207.00002648@courier.cs.helsinki.fi>
	<20050810182155.GI21228@ca-server1.us.oracle.com>
Message-ID: <courier.42FA6128.000009AE@courier.cs.helsinki.fi>

Mark Fasheh writes:
> Hmm, well today in OCFS2 if you're not coming from read or write, the lock
> is held only for the duration of ->nopage so I don't think we could get into
> any deadlocks for that usage.

Aah, I see GFS2 does that too so no deadlocks here. Thanks. You, however,
don't maintain the same level of data consistency when reads and writes
are from other filesystems as they use ->nopage. 

Fixing this requires a generic vma walk in every write() and read(), no?
That doesn't seem such an hot idea which brings us back to using ->nopage
for taking the locks (but now the deadlocks are back). 

                       Pekka 



From mark.fasheh at oracle.com  Wed Aug 10 22:07:44 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Wed, 10 Aug 2005 15:07:44 -0700
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <courier.42FA6128.000009AE@courier.cs.helsinki.fi>
References: <20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<20050810072121.GA2825@infradead.org>
	<courier.42F9AD38.000018F9@courier.cs.helsinki.fi>
	<20050810162618.GH21228@ca-server1.us.oracle.com>
	<courier.42FA3207.00002648@courier.cs.helsinki.fi>
	<20050810182155.GI21228@ca-server1.us.oracle.com>
	<courier.42FA6128.000009AE@courier.cs.helsinki.fi>
Message-ID: <20050810220744.GJ21228@ca-server1.us.oracle.com>

On Wed, Aug 10, 2005 at 11:18:48PM +0300, Pekka J Enberg wrote:
> Aah, I see GFS2 does that too so no deadlocks here. Thanks.
Yep, no problem :)

> You, however, don't maintain the same level of data consistency when reads
> and writes are from other filesystems as they use ->nopage.
I'm not sure what you mean here...

> Fixing this requires a generic vma walk in every write() and read(), no?
> That doesn't seem such an hot idea which brings us back to using ->nopage
> for taking the locks (but now the deadlocks are back). 
Yeah if you look through mmap.c in ocfs2_fill_ctxt_from_buf() we do this...
Or am I misunderstanding what you mean?
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From penberg at cs.helsinki.fi  Thu Aug 11 04:41:17 2005
From: penberg at cs.helsinki.fi (Pekka J Enberg)
Date: Thu, 11 Aug 2005 07:41:17 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050810220744.GJ21228@ca-server1.us.oracle.com>
References: <20050803063644.GD9812@redhat.com>
	<courier.42F768D5.000046F2@courier.cs.helsinki.fi>
	<42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi>
	<20050810072121.GA2825@infradead.org>
	<courier.42F9AD38.000018F9@courier.cs.helsinki.fi>
	<20050810162618.GH21228@ca-server1.us.oracle.com>
	<courier.42FA3207.00002648@courier.cs.helsinki.fi>
	<20050810182155.GI21228@ca-server1.us.oracle.com>
	<courier.42FA6128.000009AE@courier.cs.helsinki.fi>
	<20050810220744.GJ21228@ca-server1.us.oracle.com>
Message-ID: <courier.42FAD6ED.00002E45@courier.cs.helsinki.fi>

Hi, 

On Wed, Aug 10, 2005 at 11:18:48PM +0300, Pekka J Enberg wrote:
> > You, however, don't maintain the same level of data consistency when reads
> > and writes are from other filesystems as they use ->nopage.

Mark Fasheh writes:
> I'm not sure what you mean here...

Reading and writing from other filesystems to a GFS2 mmap'd file
does not walk the vmas. Therefore, data consistency guarantees
are different: 

 - A GFS2 filesystem does a read that writes to a GFS2 mmap'd
  file -> we take all locks for the mmap'd buffer in order and
  release them after read() is done. 

 - A ext3 filesystem, for example, does a read that writes to
  a GFS2 mmap'd file -> we now take locks one page at the
  time releasing them before we exit ->nopage(). Other nodes
  are now free to write to the same GFS2 mmap'd file. 

Or am I missing something here? 

On Wed, Aug 10, 2005 at 11:18:48PM +0300, Pekka J Enberg wrote:
> > Fixing this requires a generic vma walk in every write() and read(), no?
> > That doesn't seem such an hot idea which brings us back to using ->nopage
> > for taking the locks (but now the deadlocks are back).

Mark Fasheh writes:
> Yeah if you look through mmap.c in ocfs2_fill_ctxt_from_buf() we do this...
> Or am I misunderstanding what you mean?

If are doing write() or read() from some other filesystem, we don't walk the 
vmas but instead rely on ->nopage for locking, right? 

                   Pekka 



From penberg at cs.Helsinki.FI  Thu Aug 11 07:10:16 2005
From: penberg at cs.Helsinki.FI (Pekka J Enberg)
Date: Thu, 11 Aug 2005 10:10:16 +0300 (EEST)
Subject: [Linux-cluster] Re: GFS
Message-ID: <Pine.LNX.4.58.0508111006470.13379@sbz-30.cs.Helsinki.FI>

Hi Mark,

On Thu, 11 Aug 2005, Pekka J Enberg wrote:
> Reading and writing from other filesystems to a GFS2 mmap'd file
> does not walk the vmas. Therefore, data consistency guarantees
> are different:

What I meant was that, if a filesystem requires vma walks, we need to do 
it VFS level with something like the following patch. With this, your 
filesystem would implement a_ops->iolock_acquire that sorts the locks
and takes them all. In case of GFS2, this would replace walk_vm().

Thoughts?

			Pekka

[PATCH] vfs: iolock

This patch introduces iolock which can be used by filesystems that require
special locking when accessing an mmap'd region.

Unfinished and untested.

Signed-off-by: Pekka Enberg <penberg at cs.helsinki.fi>
---

 fs/Makefile            |    2 -
 fs/iolock.c            |   88 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/read_write.c        |   15 ++++++++
 include/linux/fs.h     |    2 +
 include/linux/iolock.h |   11 ++++++
 5 files changed, 117 insertions(+), 1 deletion(-)

Index: 2.6-mm/fs/iolock.c
===================================================================
--- /dev/null
+++ 2.6-mm/fs/iolock.c
@@ -0,0 +1,88 @@
+/*
+ * fs/iolock.c
+ *
+ * Derived from GFS2.
+ */
+
+#include <linux/iolock.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+/*
+ * I/O lock contains all files that participate in locking a memory region.
+ * It is used for filesystems that require special locks to access mmap'd
+ * memory.
+ */
+struct iolock {
+	struct address_space	*mapping;
+	unsigned long		nr_files;
+	struct file		**files;
+};
+
+struct iolock *iolock_region(const char __user *buf, size_t size)
+{
+	int err = -ENOMEM;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	unsigned long start = (unsigned long)buf;
+	unsigned long end = start + size;
+	struct iolock *ret;
+
+	ret = kcalloc(1, sizeof(*ret), GFP_KERNEL);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
+
+	down_read(&mm->mmap_sem);
+
+	ret->files = kcalloc(mm->map_count, sizeof(struct file*), GFP_KERNEL);
+	if (!ret->files)
+		goto error;
+
+	for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
+		struct file *file;
+		struct address_space *mapping;
+
+		if (end <= vma->vm_start)
+			break;
+
+		file = vma->vm_file;
+		if (!file)
+			continue;
+
+		mapping = file->f_mapping;
+		if (!mapping->a_ops->iolock_acquire ||
+		    !mapping->a_ops->iolock_release)
+			continue;
+
+ 		/* FIXME: This only works when one address_space participates
+		          in the iolock. */
+		ret->mapping = mapping;
+		ret->files[ret->nr_files++] = file;
+	}
+out:
+	up_read(&mm->mmap_sem);
+
+	if (ret->mapping->a_ops->iolock_acquire) {
+		err = ret->mapping->a_ops->iolock_acquire(ret->files, ret->nr_files);
+		if (!err)
+			goto error;
+	}
+
+	return ret;
+
+error:
+	iolock_release(ret);
+	ret = ERR_PTR(err);
+	goto out;
+}
+
+void iolock_release(struct iolock *iolock)
+{
+	struct address_space *mapping = iolock->mapping;
+	if (mapping && mapping->a_ops->iolock_release)
+		mapping->a_ops->iolock_release(iolock->files, iolock->nr_files);
+	kfree(iolock->files);
+	kfree(iolock);
+}
Index: 2.6-mm/fs/read_write.c
===================================================================
--- 2.6-mm.orig/fs/read_write.c
+++ 2.6-mm/fs/read_write.c
@@ -14,6 +14,7 @@
 #include <linux/security.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/iolock.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -247,14 +248,21 @@ ssize_t vfs_read(struct file *file, char
 	if (!ret) {
 		ret = security_file_permission (file, MAY_READ);
 		if (!ret) {
+			struct iolock * iolock = iolock_region(buf, count);
+			if (IS_ERR(iolock)) {
+				ret = PTR_ERR(iolock);
+				goto out;
+			}
 			if (file->f_op->read)
 				ret = file->f_op->read(file, buf, count, pos);
 			else
 				ret = do_sync_read(file, buf, count, pos);
+			iolock_release(iolock);
 			if (ret > 0) {
 				fsnotify_access(file->f_dentry);
 				current->rchar += ret;
 			}
+  out:
 			current->syscr++;
 		}
 	}
@@ -298,14 +306,21 @@ ssize_t vfs_write(struct file *file, con
 	if (!ret) {
 		ret = security_file_permission (file, MAY_WRITE);
 		if (!ret) {
+			struct iolock * iolock = iolock_region(buf, count);
+			if (IS_ERR(iolock)) {
+				ret = PTR_ERR(iolock);
+				goto out;
+			}
 			if (file->f_op->write)
 				ret = file->f_op->write(file, buf, count, pos);
 			else
 				ret = do_sync_write(file, buf, count, pos);
+			iolock_release(iolock);
 			if (ret > 0) {
 				fsnotify_modify(file->f_dentry);
 				current->wchar += ret;
 			}
+  out:
 			current->syscw++;
 		}
 	}
Index: 2.6-mm/include/linux/iolock.h
===================================================================
--- /dev/null
+++ 2.6-mm/include/linux/iolock.h
@@ -0,0 +1,11 @@
+#ifndef __LINUX_IOLOCK_H
+#define __LINUX_IOLOCK_H
+
+#include <linux/kernel.h>
+
+struct iolock;
+
+struct iolock *iolock_region(const char __user *buf, size_t count);
+void iolock_release(struct iolock *lock);
+
+#endif
Index: 2.6-mm/fs/Makefile
===================================================================
--- 2.6-mm.orig/fs/Makefile
+++ 2.6-mm/fs/Makefile
@@ -10,7 +10,7 @@ obj-y :=	open.o read_write.o file_table.
 		ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
-		ioprio.o
+		ioprio.o iolock.o
 
 obj-$(CONFIG_INOTIFY)		+= inotify.o
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
Index: 2.6-mm/include/linux/fs.h
===================================================================
--- 2.6-mm.orig/include/linux/fs.h
+++ 2.6-mm/include/linux/fs.h
@@ -334,6 +334,8 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
+	int (*iolock_acquire)(struct file **, unsigned long);
+	void (*iolock_release)(struct file **, unsigned long);
 };
 
 struct backing_dev_info;



From zab at zabbo.net  Thu Aug 11 16:33:41 2005
From: zab at zabbo.net (Zach Brown)
Date: Thu, 11 Aug 2005 09:33:41 -0700
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <Pine.LNX.4.58.0508111006470.13379@sbz-30.cs.Helsinki.FI>
References: <Pine.LNX.4.58.0508111006470.13379@sbz-30.cs.Helsinki.FI>
Message-ID: <42FB7DE5.2080506@zabbo.net>


> What I meant was that, if a filesystem requires vma walks, we need to do 
> it VFS level with something like the following patch.

I don't think this patch is the way to go at all.  It imposes an
allocation and vma walking overhead for the vast majority of IOs that
aren't interested.  It doesn't look like it will get a consistent
ordering when multiple file systems are concerned.  It doesn't record
the ranges of the mappings involved so Lustre can't properly use its
range locks.  And finally, it doesn't prohibit mapping operations for
the duration of the IO -- the whole reason we ended up in this thread in
the first place :)

Christoph, would you be interested in looking at a more thorough patch
if I threw one together?

- z



From hch at infradead.org  Thu Aug 11 16:35:41 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Thu, 11 Aug 2005 17:35:41 +0100
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <42FB7DE5.2080506@zabbo.net>
References: <Pine.LNX.4.58.0508111006470.13379@sbz-30.cs.Helsinki.FI>
	<42FB7DE5.2080506@zabbo.net>
Message-ID: <20050811163541.GA4351@infradead.org>

On Thu, Aug 11, 2005 at 09:33:41AM -0700, Zach Brown wrote:
> ordering when multiple file systems are concerned.  It doesn't record
> the ranges of the mappings involved so Lustre can't properly use its
> range locks.

That doesn't matter.  Please don't put in any effort for lustre special
cases - they are unwilling to cooperate and they'll get what they deserve.

> And finally, it doesn't prohibit mapping operations for
> the duration of the IO -- the whole reason we ended up in this thread in
> the first place :)
> 
> Christoph, would you be interested in looking at a more thorough patch
> if I threw one together?

Sure, I'm not sure that'll happen in a timely fashion, though.



From zab at zabbo.net  Thu Aug 11 16:39:43 2005
From: zab at zabbo.net (Zach Brown)
Date: Thu, 11 Aug 2005 09:39:43 -0700
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <20050811163541.GA4351@infradead.org>
References: <Pine.LNX.4.58.0508111006470.13379@sbz-30.cs.Helsinki.FI>
	<42FB7DE5.2080506@zabbo.net> <20050811163541.GA4351@infradead.org>
Message-ID: <42FB7F4F.80507@zabbo.net>


> That doesn't matter.  Please don't put in any effort for lustre special
> cases - they are unwilling to cooperate and they'll get what they deserve.

Sure, we can add that extra functional layer in another pass.  I thought
I'd still bring it up, though, as OCFS2 is slated to care at some point
in the not too distant future.

> Sure, I'm not sure that'll happen in a timely fashion, though.

Roger.

- z



From penberg at cs.helsinki.fi  Thu Aug 11 16:44:50 2005
From: penberg at cs.helsinki.fi (Pekka Enberg)
Date: Thu, 11 Aug 2005 19:44:50 +0300
Subject: [Linux-cluster] Re: GFS
In-Reply-To: <42FB7DE5.2080506@zabbo.net>
References: <Pine.LNX.4.58.0508111006470.13379@sbz-30.cs.Helsinki.FI>
	<42FB7DE5.2080506@zabbo.net>
Message-ID: <1123778691.24181.8.camel@localhost>

On Thu, 2005-08-11 at 09:33 -0700, Zach Brown wrote:
> I don't think this patch is the way to go at all.  It imposes an
> allocation and vma walking overhead for the vast majority of IOs that
> aren't interested.  It doesn't look like it will get a consistent
> ordering when multiple file systems are concerned.  It doesn't record
> the ranges of the mappings involved so Lustre can't properly use its
> range locks.  And finally, it doesn't prohibit mapping operations for
> the duration of the IO -- the whole reason we ended up in this thread in
> the first place :)

Hmm. So how do you propose we get rid of the mandatory vma walk? I was
thinking of making iolock a config option so when you don't have any
filesystems that need it, it can go away. I have also optimized the
extra allocation away when there are none mmap'd files that require
locking.

As for the rest of your comments, I heartly agree with them and
hopefully some interested party will take care of them :-).

			Pekka

Index: 2.6-mm/fs/iolock.c
===================================================================
--- /dev/null
+++ 2.6-mm/fs/iolock.c
@@ -0,0 +1,183 @@
+/*
+ * I/O locking for memory regions. Used by filesystems that need special
+ * locking for mmap'd files.
+ */
+
+#include <linux/iolock.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+/*
+ * TODO:
+ *
+ *  - Deadlock when two nodes acquire iolocks in reverse order for two
+ *    different filesystems. Solution: use rbtree in iolock_chain so we
+ *    can walk iolocks in order. XXX: what order is stable for two nodes
+ *    that don't know about each other?
+ */
+
+/*
+ * I/O lock contains all files that participate in locking a memory region
+ * in an address_space.
+ */
+struct iolock {
+	struct address_space	*mapping;
+	unsigned long		nr_files;
+	struct file		**files;
+	struct list_head	chain;
+};
+
+struct iolock_chain {
+	struct list_head	list;
+};
+
+static struct iolock *iolock_new(unsigned long max_files)
+{
+	struct iolock *ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+	if (!ret)
+		goto out;
+	ret->files = kcalloc(max_files, sizeof(struct file *), GFP_KERNEL);
+	if (!ret->files) {
+		kfree(ret);
+		ret = NULL;
+		goto out;
+	}
+	INIT_LIST_HEAD(&ret->chain);
+out:
+	return ret;
+}
+
+static struct iolock_chain *iolock_chain_new(void)
+{
+	struct iolock_chain * ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+	if (ret) {
+		INIT_LIST_HEAD(&ret->list);
+	}
+	return ret;
+}
+
+static int iolock_chain_acquire(struct iolock_chain *chain)
+{
+	struct iolock * iolock;
+	int err = 0;
+
+	list_for_each_entry(iolock, &chain->list, chain) {
+		if (iolock->mapping->a_ops->iolock_acquire) {
+			err = iolock->mapping->a_ops->iolock_acquire(
+					iolock->files, iolock->nr_files);
+			if (!err)
+				goto error;
+		}
+	}
+error:
+	return err;
+}
+
+static struct iolock *iolock_lookup(struct iolock_chain *chain,
+				    struct address_space *mapping)
+{
+	struct iolock *ret = NULL;
+	struct iolock *iolock;
+
+	list_for_each_entry(iolock, &chain->list, chain) {
+		if (iolock->mapping == mapping) {
+			ret = iolock;
+			break;
+		}
+	}
+	return ret;
+}
+
+/**
+ * iolock_region - Lock memory region for file I/O.
+ * @buf: the buffer we want to lock.
+ * @size: size of the buffer.
+ *
+ * Returns a pointer to the iolock_chain or NULL to denote an empty chain;
+ * otherwise returns ERR_PTR().
+ */
+struct iolock_chain *iolock_region(const char __user *buf, size_t size)
+{
+	struct iolock_chain *ret = NULL;
+	int err = -ENOMEM;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	unsigned long start = (unsigned long)buf;
+	unsigned long end = start + size;
+	int max_files;
+
+	down_read(&mm->mmap_sem);
+	max_files = mm->map_count;
+
+	for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
+		struct file *file;
+		struct address_space *mapping;
+		struct iolock *iolock;
+
+		if (end <= vma->vm_start)
+			break;
+
+		file = vma->vm_file;
+		if (!file)
+			continue;
+
+		mapping = file->f_mapping;
+		if (!mapping->a_ops->iolock_acquire ||
+		    !mapping->a_ops->iolock_release)
+			continue;
+
+		/* Allocate chain lazily to avoid initialization overhead
+		   when we don't have any files that require iolock.  */
+		if (!ret) {
+			ret = iolock_chain_new();
+			if (!ret)
+				goto error;
+		}
+
+		iolock = iolock_lookup(ret, mapping);
+		if (!iolock) {
+			iolock = iolock_new(max_files);
+			if (!iolock)
+				goto error;
+			iolock->mapping = mapping;
+		}
+
+		iolock->files[iolock->nr_files++] = file;
+		list_add(&iolock->chain, &ret->list);
+	}
+	err = iolock_chain_acquire(ret);
+	if (!err)
+		goto error;
+
+out:
+	up_read(&mm->mmap_sem);
+	return ret;
+
+error:
+	iolock_release(ret);
+	ret = ERR_PTR(err);
+	goto out;
+}
+
+/**
+ * iolock_release - Release file I/O locks for a memory region.
+ * @chain: The I/O lock chain to release. Passing NULL means no-op.
+ */
+void iolock_release(struct iolock_chain *chain)
+{
+	struct iolock *iolock;
+
+	if (!chain)
+		return;
+
+	list_for_each_entry(iolock, &chain->list, chain) {
+		struct address_space *mapping = iolock->mapping;
+		if (mapping && mapping->a_ops->iolock_release)
+			mapping->a_ops->iolock_release(iolock->files, iolock->nr_files);
+		kfree(iolock->files);
+		kfree(iolock);
+	}
+	kfree(chain);
+}
Index: 2.6-mm/fs/read_write.c
===================================================================
--- 2.6-mm.orig/fs/read_write.c
+++ 2.6-mm/fs/read_write.c
@@ -14,6 +14,7 @@
 #include <linux/security.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/iolock.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -247,14 +248,21 @@ ssize_t vfs_read(struct file *file, char
 	if (!ret) {
 		ret = security_file_permission (file, MAY_READ);
 		if (!ret) {
+			struct iolock_chain * lock = iolock_region(buf, count);
+			if (IS_ERR(lock)) {
+				ret = PTR_ERR(lock);
+				goto out;
+			}
 			if (file->f_op->read)
 				ret = file->f_op->read(file, buf, count, pos);
 			else
 				ret = do_sync_read(file, buf, count, pos);
+			iolock_release(lock);
 			if (ret > 0) {
 				fsnotify_access(file->f_dentry);
 				current->rchar += ret;
 			}
+  out:
 			current->syscr++;
 		}
 	}
@@ -298,14 +306,21 @@ ssize_t vfs_write(struct file *file, con
 	if (!ret) {
 		ret = security_file_permission (file, MAY_WRITE);
 		if (!ret) {
+			struct iolock_chain * lock = iolock_region(buf, count);
+			if (IS_ERR(lock)) {
+				ret = PTR_ERR(lock);
+				goto out;
+			}
 			if (file->f_op->write)
 				ret = file->f_op->write(file, buf, count, pos);
 			else
 				ret = do_sync_write(file, buf, count, pos);
+			iolock_release(lock);
 			if (ret > 0) {
 				fsnotify_modify(file->f_dentry);
 				current->wchar += ret;
 			}
+  out:
 			current->syscw++;
 		}
 	}
Index: 2.6-mm/include/linux/iolock.h
===================================================================
--- /dev/null
+++ 2.6-mm/include/linux/iolock.h
@@ -0,0 +1,11 @@
+#ifndef __LINUX_IOLOCK_H
+#define __LINUX_IOLOCK_H
+
+#include <linux/kernel.h>
+
+struct iolock_chain;
+
+extern struct iolock_chain *iolock_region(const char __user *, size_t);
+extern void iolock_release(struct iolock_chain *);
+
+#endif
Index: 2.6-mm/fs/Makefile
===================================================================
--- 2.6-mm.orig/fs/Makefile
+++ 2.6-mm/fs/Makefile
@@ -10,7 +10,7 @@ obj-y :=	open.o read_write.o file_table.
 		ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
-		ioprio.o
+		ioprio.o iolock.o
 
 obj-$(CONFIG_INOTIFY)		+= inotify.o
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
Index: 2.6-mm/include/linux/fs.h
===================================================================
--- 2.6-mm.orig/include/linux/fs.h
+++ 2.6-mm/include/linux/fs.h
@@ -334,6 +334,8 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
+	int (*iolock_acquire)(struct file **, unsigned long);
+	void (*iolock_release)(struct file **, unsigned long);
 };
 
 struct backing_dev_info;




From lhh at redhat.com  Thu Aug 11 16:50:04 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 11 Aug 2005 12:50:04 -0400
Subject: [Linux-cluster] GFS cluster using UML(User mode linux)
In-Reply-To: <20050811133525.66333.qmail@web60611.mail.yahoo.com>
References: <20050811133525.66333.qmail@web60611.mail.yahoo.com>
Message-ID: <1123779004.13564.55.camel@ayanami.boston.redhat.com>

On Thu, 2005-08-11 at 06:35 -0700, dharma deep wrote:
> Hi,
> 
>    Is it possible to create a GFS cluster using UML
> ?If
> so, please point me to the howto doc.

No idea, but there's been success with Xen.

-- Lon




From Axel.Thimm at ATrpms.net  Thu Aug 11 19:14:45 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Thu, 11 Aug 2005 21:14:45 +0200
Subject: [Linux-cluster] filesystem consistency error upon umount
Message-ID: <20050811191445.GA6362@neu.nirvana>

Hi,

this is from an FC4/x86_64 node that forms a cluster with three
RHEL4/x86_64 nodes. All of them running latest errata kernels and
(vendor packaged) cluster/gfs bits.

a) to start with: is it OK to mix FC4 and RHEL4, or did I do something
   forbidden?

b) the cluster wasn't doing anything with the GFS filesystem at that
   time, i.e. it was just mounted on all 4 nodes, no data was being
   moved in any direction.

c) The other nodes correctly replayed the journal, This node was
   removed from the cluster w/o fencing and w/o any traces in the logs
   other than gfs' "about to withdraw from the cluster". I expected cman
   to report this, too. The other nodes' logs only contained
   information about the journal acquisition and replay.

d) There is a 10 min. delay from the moment of the mysterious
   filesystem consistency error and a series of Glock messages

e) And most importantly why did the gfs issue a filesystem consistency
   error upon a simple umount? FC4 vs RHEL4 issue?

Thanks!

Aug 11 19:11:48 zs01 rgmanager: [25900]: <notice> Shutting down Cluster Service Manager... 
Aug 11 19:11:48 zs01 clurgmgrd[3660]: <notice> Shutting down 
Aug 11 19:11:48 zs01 clurgmgrd[3660]: <notice> Stopping service homes-cifs 
Aug 11 19:11:48 zs01 clurgmgrd[3660]: <notice> Stopping service backup 
Aug 11 19:11:48 zs01 clurgmgrd[3660]: <notice> Service homes-cifs is stopped 
Aug 11 19:11:48 zs01 clurgmgrd[3660]: <notice> Service backup is stopped 
Aug 11 19:11:52 zs01 clurgmgrd[3660]: <notice> Shutdown complete, exiting 
Aug 11 19:11:53 zs01 rgmanager: [25900]: <notice> Cluster Service Manager is stopped. 
Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: fatal: filesystem consistency error
Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2:   function = trans_go_xmote_bh
Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2:   file = /usr/src/build/588747-x86_64/BUILD/smp/src/gfs/glops.c, line = 542
Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2:   time = 1123780394
Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: about to withdraw from the cluster
Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: waiting for outstanding I/O
Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: telling LM to withdraw
Aug 11 19:13:27 zs01 kernel: lock_dlm: withdraw abandoned memory
Aug 11 19:13:27 zs01 kernel: GFS: fsid=physik:data.2: withdrawn
Aug 11 19:23:27 zs01 kernel: ror = 0
Aug 11 19:23:27 zs01 kernel:     gh_iflags = 2 4 5 
Aug 11 19:23:27 zs01 kernel: Glock (5, 8676146)
Aug 11 19:23:27 zs01 kernel:   gl_flags = 1 
Aug 11 19:23:27 zs01 kernel:   gl_count = 3
Aug 11 19:23:27 zs01 kernel:   gl_state = 3
Aug 11 19:23:27 zs01 kernel:   req_gh = yes
Aug 11 19:23:27 zs01 kernel:   req_bh = yes
Aug 11 19:23:27 zs01 kernel:   lvb_count = 0
Aug 11 19:23:27 zs01 kernel:   object = no
Aug 11 19:23:27 zs01 kernel:   new_le = no
Aug 11 19:23:27 zs01 kernel:   incore_le = no
Aug 11 19:23:27 zs01 kernel:   reclaim = no
Aug 11 19:23:27 zs01 kernel:   aspace = no
Aug 11 19:23:27 zs01 kernel:   ail_bufs = no
Aug 11 19:23:27 zs01 kernel:   Request
Aug 11 19:23:27 zs01 kernel:     owner = -1
Aug 11 19:23:27 zs01 kernel:     gh_state = 0
Aug 11 19:23:27 zs01 kernel:     gh_flags = 0 
Aug 11 19:23:27 zs01 kernel:     error = 0
Aug 11 19:23:27 zs01 kernel:     gh_iflags = 2 4 5 
Aug 11 19:23:27 zs01 kernel:   Waiter2
Aug 11 19:23:27 zs01 kernel:     owner = -1
Aug 11 19:23:27 zs01 kernel:     gh_state = 0
Aug 11 19:23:27 zs01 kernel:     gh_flags = 0 
Aug 11 19:23:27 zs01 kernel:     error = 0
Aug 11 19:23:27 zs01 kernel:     gh_iflags = 2 4 5 
Aug 11 19:23:27 zs01 kernel: Glock (5, 7146196)
Aug 11 19:23:27 zs01 kernel:   gl_flags = 1 
Aug 11 19:23:27 zs01 kernel:   gl_count = 3
Aug 11 19:23:27 zs01 kernel:   gl_state = 3
Aug 11 19:23:27 zs01 kernel:   req_gh = yes
Aug 11 19:23:27 zs01 kernel:   req_bh = yes
Aug 11 19:23:27 zs01 kernel:   lvb_count = 0
Aug 11 19:23:27 zs01 kernel:   object = no
Aug 11 19:23:27 zs01 kernel:   new_le = no
Aug 11 19:23:27 zs01 kernel:   incore_le = no
Aug 11 19:23:27 zs01 kernel:   reclaim = no
Aug 11 19:23:27 zs01 kernel:   aspace = no
Aug 11 19:23:27 zs01 kernel:   ail_bufs = no
Aug 11 19:23:27 zs01 kernel:   Request
Aug 11 19:23:27 zs01 kernel:     owner = -1
Aug 11 19:23:27 zs01 kernel:     gh_state = 0
Aug 11 19:23:27 zs01 kernel:     gh_flags = 0 
Aug 11 19:23:27 zs01 kernel:     error = 0
Aug 11 19:23:27 zs01 kernel:     gh_iflags = 2 4 5 
Aug 11 19:23:27 zs01 kernel:   Waiter2
Aug 11 19:23:27 zs01 kernel:     owner = -1
Aug 11 19:23:27 zs01 kernel:     gh_state = 0
Aug 11 19:23:27 zs01 kernel:     gh_flags = 0 
Aug 11 19:23:27 zs01 kernel:     error = 0
Aug 11 19:23:27 zs01 kernel:     gh_iflags = 2 4 5 
Aug 11 19:23:27 zs01 kernel: Glock (5, 190905665)
[...]
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050811/98b46c94/attachment.sig>

From Joel.Becker at oracle.com  Fri Aug 12 02:46:56 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Thu, 11 Aug 2005 19:46:56 -0700
Subject: [Linux-cluster] Re: [PATCH 00/14] GFS
In-Reply-To: <20050810111110.GA6878@infradead.org>
References: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk>
	<20050810070309.GA2415@infradead.org>
	<20050810103041.GB4634@marowsky-bree.de>
	<20050810103256.GA6127@infradead.org>
	<20050810103424.GC4634@marowsky-bree.de>
	<20050810105450.GA6519@infradead.org>
	<20050810110259.GE4634@marowsky-bree.de>
	<20050810110511.GA6728@infradead.org>
	<20050810110917.GG4634@marowsky-bree.de>
	<20050810111110.GA6878@infradead.org>
Message-ID: <20050812024655.GF5586@ca-server1.us.oracle.com>

On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote:
> On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
> > 
> > So for every directoy hiearchy on a shared filesystem, each user needs
> > to have the complete list of bindmounts needed, and automatically resync
> > that across all nodes when a new one is added or removed? And then have
> > that executed by root, because a regular user can't?
> 
> Do it in an initscripts and let users simply not do it, they shouldn't
> even know what kind of filesystem they are on.

Christoph,
	Users know.  They want to know.  They want to install git on a
shared filesystem, and have it Just Work no matter what architecture
they're on ({arch} context symlink).
	I've yet to see a sane way to replace symlinks with bind mounts
for anything but the most trivial of usages.

1) You can't make them as non-root
2) It's not stored in the filesystem, so permanence is separate.  Other
   nodes and namespaces don't see them automatically if you want it.
   These both violate KISS and PoLS.
3) You pollute the output of "mount", and when you have as many bind
   mounts as you might have symlinks, that's a ton of output you didn't
   want to see when you were wondering what disks are mounted.
4) When I'm looking at a file, ls -l doesn't tell me what I'm really
   looking at.  With symlinks it does.  In some circumstances, that's a
   good thing.  For most symlink-like uses it is not.  The two uses
   (security and "symlink-like") are both valid approaches, and one
   should not preclude the other.

	Now, (3) can easily be fixed with an option.  (4) can probably
be massaged the same way.  But (1) and (2) can't be, and that needs
fixing before this is even viable to most real users.
	CDSL, or whatever you call it, exists in most can-be-shared
filesystems for a reason.  On AFS and DFS, it was @foo.
/.../thisdcecell/foo/@sys/bin/git-ls-tree would be
/.../thisdcecell/foo/linux-i386/bin/git-ls-tree on my machine.  I'd just
put the @sys path in my PATH, and never worry whether I was on x86, ppc,
or s390.  I don't know how GFS/GFS2 do theirs, but OCFS2 copied straight
from VMS clustering, where they used it as well.  They seem to have set
the standard on the topic of clustering.  It would be
/usr/{arch}/bin/git-ls-tree -> /usr/i386/bin/git-ls-tree or whatever.
	If you can't do this as a user, it's irrelevant to you.  Major
installations, where the person installing the application never gets
root, would expect it to work easily and nicely.  Bind mounts, as of
now, do not.

Joel

-- 

"Here's something to think about:  How come you never see a headline
 like ``Psychic Wins Lottery''?"
	- Jay Leno

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From thomasie at eyou.com  Fri Aug 12 07:23:49 2005
From: thomasie at eyou.com (thomasie)
Date: 12 Aug 2005 15:23:49 +0800
Subject: [Linux-cluster] GFS/cman related problems?
Message-ID: <323831431.12922@eyou.com>

Dear ALL:
    Hello, I hope I'm sending to the correct list.  I'm having problems
starting up gfs 6.1, and hopefully it's just something incorrect with my
configuration.

    I used two standard machine . My system is Fedora Core 4. The kernel
modules load fine.
    root # modprobe lock_dlm

first node:
    root # ccsd
    root # cman_tool join

second node:
    root # ccsd
    root # cman_tool join

    when we are going to start both nodes we get that the first node without
problems, but the second one never get connected at cman and remain trying to
connect with the following error: udp port 6809 unreachable.
    Then, I run cman_tool with DEBUG defined and get this:
first node: sending HELLO
second node: sending membership request, but I get udp port 6809
unreachable.

    My configuration file is 
<?xml version="1.0"?>
<cluster name="alpha" config_version="1">

<cman two_node="1" expected_votes="1">
  <multicast addr="224.0.0.1"/>
</cman>

<clusternodes>
  <clusternode name="node1" nodeid="1" votes="1">
    <multicast addr="224.0.0.1" interface="eth1"/>
    <fence>
      <method name="human">
        <device name="net_dev1"/>
      </method>
    </fence>
  </clusternode>

  <clusternode name="node2" nodeid="2" votes="1">
    <multicast addr="224.0.0.1" interface="eth0"/>
    <fence>
      <method name="human">
        <device name="net_dev2"/>
      </method>
    </fence>
  </clusternode>
</clusternodes>

<fencedevices>
  <fencedevice name="net_dev1" agent="fence_gnbd" server="node1" login="root"
passwd="111111"/>
  <fencedevice name="net_dev2" agent="fence_gnbd" server="node2" login="root"
passwd="111111"/>
</fencedevices>

</cluster>

any help is much appreciated.

Rregards,
thomasie





--http://www.eyou.com
--??????????????????  ????????  ????????  ????????  ????????...????????

--http://vip.eyou.com
--????????????VIP????  ??????????????????

--http://sms.eyou.com
--??????????????????????...????????????



From pcaulfie at redhat.com  Fri Aug 12 08:38:28 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 12 Aug 2005 09:38:28 +0100
Subject: [Linux-cluster] GFS/cman related problems?
In-Reply-To: <323831431.12922@eyou.com>
References: <323831431.12922@eyou.com>
Message-ID: <42FC6004.5080107@redhat.com>

thomasie wrote:
> Dear ALL:
>     Hello, I hope I'm sending to the correct list.  I'm having problems
> starting up gfs 6.1, and hopefully it's just something incorrect with my
> configuration.
> 
>     I used two standard machine . My system is Fedora Core 4. The kernel
> modules load fine.
>     root # modprobe lock_dlm
> 
> first node:
>     root # ccsd
>     root # cman_tool join
> 
> second node:
>     root # ccsd
>     root # cman_tool join
> 
>     when we are going to start both nodes we get that the first node without
> problems, but the second one never get connected at cman and remain trying to
> connect with the following error: udp port 6809 unreachable.
>     Then, I run cman_tool with DEBUG defined and get this:
> first node: sending HELLO
> second node: sending membership request, but I get udp port 6809
> unreachable.

Do you have a firewall configured ? if so either disable it or allow through UDP
port 6809.

-- 

patrick



From thomasie at eyou.com  Tue Aug 16 02:42:42 2005
From: thomasie at eyou.com (thomasie)
Date: 16 Aug 2005 10:42:42 +0800
Subject: [Linux-cluster] GFS/cman related problems
Message-ID: <324160162.18275@eyou.com>

Dear ALL:

Hello, I hope I'm sending to the correct list.  I'm having problems
starting up gfs 6.1, and hopefully it's just something incorrect with my
configuration.

I used two standard machine . My system is Fedora Core 4. The kernel modules
load fine.

First of all,  I load all the needed kernel modules for the cluster. To do
this, execute the following commands (on both nodes):
    root at node1 # modprobe lock_dlm 

    root at node2 # modprobe lock_dlm 


Next, I start the cluster configuration service daemon on both nodes with
    root at node1 # ccsd 

    root at node2 # ccsd
    

Having started ccsd I need to create the cluster by starting the cluster
manager on both nodes:
    root at node1 # /sbin/cman_tool join 

    root at node2 # /sbin/cman_tool join    
    
when we are going to start both nodes we get that the first node without
problems, but the second one never get connected at cman and remain trying to
connect with the following error: 10.190.5.174 (node1) udp port 6809
unreachable.
I have not start firewall. And I use the command of  "netstat -a" , 
the result is 
udp        0      0 127.0.0.1:6809              0.0.0.0:*                     
         
udp        0      0 224.0.0.1:6809              0.0.0.0:*                     
         
but it receive package is 10.190.6.82:6809 (node2) -> 10.190.5.174 :6809
(node1)

Then, I run cman_tool with DEBUG defined and get this:
node1: sending HELLO
node2: sending membership request, but I get udp port 6809 unreachable.


My configuration file (cluster.xml) is 
<?xml version="1.0"?>
<cluster name="alpha" config_version="1">

<cman two_node="1" expected_votes="1">
  <multicast addr="224.0.0.1"/>
</cman>

<clusternodes>
  <clusternode name="node1" nodeid="1" votes="1">
    <multicast addr="224.0.0.1" interface="eth1"/>
    <fence>
      <method name="human">
        <device name="net_dev1"/>
      </method>
    </fence>
  </clusternode>

  <clusternode name="node2" nodeid="2" votes="1">
    <multicast addr="224.0.0.1" interface="eth0"/>
    <fence>
      <method name="human">
        <device name="net_dev2"/>
      </method>
    </fence>
  </clusternode>
</clusternodes>

<fencedevices>
  <fencedevice name="net_dev1" agent="fence_gnbd" server="node1" login="root"
passwd="111111"/>
  <fencedevice name="net_dev2" agent="fence_gnbd" server="node2" login="root"
passwd="111111"/>
</fencedevices>

</cluster>

any help is much appreciated.

Regards,
thomasie





--http://www.eyou.com
--??????????????????  ????????  ????????  ????????  ????????...????????

--http://vip.eyou.com
--????????????VIP????  ??????????????????

--http://sms.eyou.com
--??????????????????????...????????????



From pcaulfie at redhat.com  Tue Aug 16 07:05:35 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 16 Aug 2005 08:05:35 +0100
Subject: [Linux-cluster] GFS/cman related problems
In-Reply-To: <324160162.18275@eyou.com>
References: <324160162.18275@eyou.com>
Message-ID: <4301903F.4090302@redhat.com>

thomasie wrote:
> Dear ALL:
> 
> Hello, I hope I'm sending to the correct list.  I'm having problems
> starting up gfs 6.1, and hopefully it's just something incorrect with my
> configuration.
> 
> I used two standard machine . My system is Fedora Core 4. The kernel modules
> load fine.
> 
> First of all,  I load all the needed kernel modules for the cluster. To do
> this, execute the following commands (on both nodes):
>     root at node1 # modprobe lock_dlm 
> 
>     root at node2 # modprobe lock_dlm 
> 
> 
> Next, I start the cluster configuration service daemon on both nodes with
>     root at node1 # ccsd 
> 
>     root at node2 # ccsd
>     
> 
> Having started ccsd I need to create the cluster by starting the cluster
> manager on both nodes:
>     root at node1 # /sbin/cman_tool join 
> 
>     root at node2 # /sbin/cman_tool join    
>     
> when we are going to start both nodes we get that the first node without
> problems, but the second one never get connected at cman and remain trying to
> connect with the following error: 10.190.5.174 (node1) udp port 6809
> unreachable.
> I have not start firewall. And I use the command of  "netstat -a" , 
> the result is 
> udp        0      0 127.0.0.1:6809              0.0.0.0:*                               

Oh this old chestnut. Your local host name resolves to 127.0.0.1 rather than to
a real IP address. Remove the host name from the 127.0.0.1 line in /etc/hosts.

sigh.



-- 

patrick



From jpyeron at pdinc.us  Tue Aug 16 14:13:50 2005
From: jpyeron at pdinc.us (Jason Pyeron)
Date: Tue, 16 Aug 2005 10:13:50 -0400 (EDT)
Subject: [Linux-cluster] seeking guidance before opening a bug on cman
Message-ID: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>


I was tring to compile a rpm from the srpm generated from cvs STABLE 
(also tried from HEAD).

I got errors on prep stage, and then again on the files stage.

I have patched the spec file, and I was wondering if I was wrong or was it 
a bug?

:pserver:cvs at sources.redhat.com:/cvs/cluster cluster
--- cman/make/cman.spec.in      1 Nov 2004 23:23:18 -0000       1.2
+++ cman/make/cman.spec.in      16 Aug 2005 14:09:55 -0000
@@ -33,7 +33,7 @@
  cman - The Cluster Manager

  %prep
-%setup -q
+%setup -q -n %{name}-%{version}-%{release}

  %build
  ./configure --mandir=%{_mandir}
@@ -51,6 +51,13 @@
  # Binaries
  /sbin/cman_tool

+/etc/init.d/cman
+/usr/include/libcman.h
+/usr/lib/libcman.a
+/usr/lib/libcman.so
+/usr/lib/libcman.so.%{version}
+/usr/lib/libcman.so.%{version}.%{release}
+
  %doc
  %{_mandir}



-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Partner & Sr. Manager             7 West 24th Street #100     -
- +1 (443) 921-0381                 Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise private information. If you 
have received it in error, purge the message from your system and 
notify the sender immediately.  Any other use of the email by you 
is prohibited.



From jpyeron at pdinc.us  Tue Aug 16 14:38:56 2005
From: jpyeron at pdinc.us (Jason Pyeron)
Date: Tue, 16 Aug 2005 10:38:56 -0400 (EDT)
Subject: [Linux-cluster] seeking guidance before opening a bug on cman
In-Reply-To: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>
References: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>
Message-ID: <Pine.LNX.4.62.0508161033570.626@ns.pyerotechnics.com>


I opened a bug after further googling.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166060

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Partner & Sr. Manager             7 West 24th Street #100     -
- +1 (443) 921-0381                 Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise private information. If you 
have received it in error, purge the message from your system and 
notify the sender immediately.  Any other use of the email by you 
is prohibited.



From cfeist at redhat.com  Tue Aug 16 15:29:48 2005
From: cfeist at redhat.com (Chris Feist)
Date: Tue, 16 Aug 2005 10:29:48 -0500
Subject: [Linux-cluster] seeking guidance before opening a bug on cman
In-Reply-To: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>
References: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>
Message-ID: <4302066C.5010101@redhat.com>

Jason,

We don't actually use those .spec.in files anymore.  I have removed them 
because they have become outdated.

Thanks,
Chris

Jason Pyeron wrote:
> 
> I was tring to compile a rpm from the srpm generated from cvs STABLE 
> (also tried from HEAD).
> 
> I got errors on prep stage, and then again on the files stage.
> 
> I have patched the spec file, and I was wondering if I was wrong or was 
> it a bug?
> 
> :pserver:cvs at sources.redhat.com:/cvs/cluster cluster
> --- cman/make/cman.spec.in      1 Nov 2004 23:23:18 -0000       1.2
> +++ cman/make/cman.spec.in      16 Aug 2005 14:09:55 -0000
> @@ -33,7 +33,7 @@
>  cman - The Cluster Manager
> 
>  %prep
> -%setup -q
> +%setup -q -n %{name}-%{version}-%{release}
> 
>  %build
>  ./configure --mandir=%{_mandir}
> @@ -51,6 +51,13 @@
>  # Binaries
>  /sbin/cman_tool
> 
> +/etc/init.d/cman
> +/usr/include/libcman.h
> +/usr/lib/libcman.a
> +/usr/lib/libcman.so
> +/usr/lib/libcman.so.%{version}
> +/usr/lib/libcman.so.%{version}.%{release}
> +
>  %doc
>  %{_mandir}
> 
> 
> 



From jpyeron at pdinc.us  Tue Aug 16 15:36:16 2005
From: jpyeron at pdinc.us (Jason Pyeron)
Date: Tue, 16 Aug 2005 11:36:16 -0400 (EDT)
Subject: [Linux-cluster] seeking guidance before opening a bug on cman
In-Reply-To: <4302066C.5010101@redhat.com>
References: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>
	<4302066C.5010101@redhat.com>
Message-ID: <Pine.LNX.4.62.0508161135350.626@ns.pyerotechnics.com>

On Tue, 16 Aug 2005, Chris Feist wrote:

> We don't actually use those .spec.in files anymore.  I have removed them 
> because they have become outdated.

then where are the spec files?

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Partner & Sr. Manager             7 West 24th Street #100     -
- +1 (443) 921-0381                 Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise private information. If you 
have received it in error, purge the message from your system and 
notify the sender immediately.  Any other use of the email by you 
is prohibited.



From cfeist at redhat.com  Tue Aug 16 15:41:56 2005
From: cfeist at redhat.com (Chris Feist)
Date: Tue, 16 Aug 2005 10:41:56 -0500
Subject: [Linux-cluster] seeking guidance before opening a bug on cman
In-Reply-To: <Pine.LNX.4.62.0508161135350.626@ns.pyerotechnics.com>
References: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>	<4302066C.5010101@redhat.com>
	<Pine.LNX.4.62.0508161135350.626@ns.pyerotechnics.com>
Message-ID: <43020944.2000302@redhat.com>

Jason Pyeron wrote:
> On Tue, 16 Aug 2005, Chris Feist wrote:
> 
>> We don't actually use those .spec.in files anymore.  I have removed 
>> them because they have become outdated.
> 
> 
> then where are the spec files?
> 

They're available in the release srpm files on
ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/

Thanks,
Chris



From jpyeron at pdinc.us  Tue Aug 16 15:51:43 2005
From: jpyeron at pdinc.us (Jason Pyeron)
Date: Tue, 16 Aug 2005 11:51:43 -0400 (EDT)
Subject: [Linux-cluster] seeking guidance before opening a bug on cman
In-Reply-To: <43020944.2000302@redhat.com>
References: <Pine.LNX.4.62.0508161007450.626@ns.pyerotechnics.com>
	<4302066C.5010101@redhat.com>
	<Pine.LNX.4.62.0508161135350.626@ns.pyerotechnics.com>
	<43020944.2000302@redhat.com>
Message-ID: <Pine.LNX.4.62.0508161146060.626@ns.pyerotechnics.com>

On Tue, 16 Aug 2005, Chris Feist wrote:

> Jason Pyeron wrote:
>> On Tue, 16 Aug 2005, Chris Feist wrote:
>> 
>>> We don't actually use those .spec.in files anymore.  I have removed them 
>>> because they have become outdated.
>> 
>> 
>> then where are the spec files?
>> 
>
> They're available in the release srpm files on
> ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/
>

The spec files inside the srpms seem a bit dated, too.

This begs the issue of the fake-build-provides from late July on the list 
and cman-kernel.

Are there more recent versions?

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Partner & Sr. Manager             7 West 24th Street #100     -
- +1 (443) 921-0381                 Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise private information. If you 
have received it in error, purge the message from your system and 
notify the sender immediately.  Any other use of the email by you 
is prohibited.



From johngw at msi.umn.edu  Tue Aug 16 20:49:20 2005
From: johngw at msi.umn.edu (John Griffin-Wiesner)
Date: Tue, 16 Aug 2005 15:49:20 -0500
Subject: [Linux-cluster] updating ccs and running lock_gulmd
Message-ID: <20050816204920.GA18577@fog.msi.umn.edu>

I want to change the heartbeat allowed_misses setting.  I
extracted the current config with "ccs_tool extract", modified
cluster.ccs, then ran
"ccs_tool create -O <dir of files> /dev/pool/cca"

Does lock_gulmd automagically see changes, or do I have to tell
it to look?  "service lock_gulmd" offers only these options:
  start|stop|restart|status|forcestop
(I was hoping for a "reload".)

This is on a production system so I'd prefer to do this without
causing everyone to be fenced off.

Thanks for any suggestions.

-- 
John Griffin-Wiesner
Linux Cluster/Unix Systems Administrator
Univ. MN Supercomputing Institute
http://www.msi.umn.edu
612-624-4167
johngw at msi.umn.edu



From amanthei at redhat.com  Tue Aug 16 20:58:35 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Tue, 16 Aug 2005 15:58:35 -0500
Subject: [Linux-cluster] updating ccs and running lock_gulmd
In-Reply-To: <20050816204920.GA18577@fog.msi.umn.edu>
References: <20050816204920.GA18577@fog.msi.umn.edu>
Message-ID: <20050816205835.GA25135@redhat.com>

On Tue, Aug 16, 2005 at 03:49:20PM -0500, John Griffin-Wiesner wrote:
> I want to change the heartbeat allowed_misses setting.  I
> extracted the current config with "ccs_tool extract", modified
> cluster.ccs, then ran
> "ccs_tool create -O <dir of files> /dev/pool/cca"
> 
> Does lock_gulmd automagically see changes, or do I have to tell
> it to look?  "service lock_gulmd" offers only these options:
>   start|stop|restart|status|forcestop
> (I was hoping for a "reload".)
> 
> This is on a production system so I'd prefer to do this without
> causing everyone to be fenced off.
> 
> Thanks for any suggestions.

You can not change the heartbeat_rate or allowed_misses for lock_gulmd 
while the cluster is active.  In order to make the changes take effect, 
you will have to:

1. unmount all GFS in the cluster
2. stop all lock_gulmd in the cluster
3. modify ccs
4. start lock_gulmd
5. mount gfs

Note.  there is a bug that requires you to have to specify the
heartbeat_rate as a floating point number.  (see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166009)

Changing these values while the cluster is up will prevent new instances of
lock_gulmd (for both the clients and servers) from connecting to the active 
instance of the cluster, but it should not cause the running nodes to be
fenced.

-- 
Adam Manthei  <amanthei at redhat.com>



From brianu at silvercash.com  Wed Aug 17 23:55:19 2005
From: brianu at silvercash.com (brianu)
Date: Wed, 17 Aug 2005 16:55:19 -0700
Subject: [Linux-cluster] complex gnbd setup example
Message-ID: <20050817235255.E21635A8678@mail.silvercash.com>

Hello,

 

 

I am trying to setup a redundant system using gnbd ( linux-iscsi for now
fiber if I can figure this out).

 

Hardware and softwaredetails

HP MSA100 SAN

3 gnbd servers mounting the storage via linux-scsi in a backend lan &
exporting to 3 test nodes

All nodes import the gnbds, but in the event that one GNBD server dies I
want it to not affect the clients.

OS Centos 4 - kernel.org kernel 2.6.12

 

 

Problems-> 

1) multipath doesn't appear to work with GFS 6.1 obtained from CVS Stable
for latest kernel.org kernels.

2) LVM doesn't allow duplicate PVs ( the SAN to be created in the Volume
group)

i.e.

[root at dell-1650-31 ~]# vgcreate my_new_gfs /dev/gnbd/l108_sdata
/dev/gnbd/l109_sdata

  Found duplicate PV TbSaubq5Si2S1nrMbuMF620HPb6kFojE: using /dev/gnbd2 not
/dev/gnbd0

  Found duplicate PV TbSaubq5Si2S1nrMbuMF620HPb6kFojE: using /dev/gnbd2 not
/dev/gnbd0

  Volume group "my_new_gfs" successfully created

[root at dell-1650-31 ~]#

[root at dell-1650-31 ~]# vgchange -a y my_new_gfs

  Found duplicate PV TbSaubq5Si2S1nrMbuMF620HPb6kFojE: using /dev/gnbd2 not
/dev/gnbd0

  0 logical volume(s) in volume group "my_new_gfs" now active

[root at dell-1650-31 ~]#   

 

Ideas->

Heartbeat -linux-HA for a dynamic ip in which the nodes will mount  via
gnbd_import -I vip

 

Thoughts? Can someone suggest a better route, or how they might have
accomplished this ? the idea is a failover to another gnbd server should it
one go down.

I am open to hearing suggestions.

 

Thanks in advance.

 

 

Brian Urrutia

System Administrator

Price Communications Inc.

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050817/b4110af3/attachment.htm>

From jpyeron at pdinc.us  Thu Aug 18 04:01:56 2005
From: jpyeron at pdinc.us (Jason Pyeron)
Date: Thu, 18 Aug 2005 00:01:56 -0400 (EDT)
Subject: [Linux-cluster] Source RPM debacle
Message-ID: <Pine.LNX.4.62.0508172344300.31247@ns.pyerotechnics.com>


I have been working with the CVS versions of the cluster suite and GFS. 
As well the versions on ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/

I am trying to avoid reinventing the wheel on the spec files.

I filed a bug but the spec files in CVS are 'not' the correct spec files.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166060

The spec files from the ftp site are broken.
https://www.redhat.com/archives/linux-cluster/2005-July/msg00224.html

The bug report indicates that it is fixed but I can't find any indication 
of it.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=163963

So where does one get the current versions?

CVS does not have the spec files. Where are the spec files published? 
Where are the SRPMs?

-Jason Pyeron

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Partner & Sr. Manager             7 West 24th Street #100     -
- +1 (443) 921-0381                 Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise private information. If you 
have received it in error, purge the message from your system and 
notify the sender immediately.  Any other use of the email by you 
is prohibited.



From teigland at redhat.com  Thu Aug 18 06:10:14 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 18 Aug 2005 14:10:14 +0800
Subject: [Linux-cluster] [PATCH 2/3] dlm: remove file
Message-ID: <20050818061014.GB10133@redhat.com>

The reduced member_sysfs.c is no longer related to lockspace members.
Move what's left into lockspace.c which is the only file that uses the
remaining functions.

Signed-off-by: David Teigland <teigland at redhat.com>

---

 Makefile       |    1 
 lockspace.c    |  155 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 lockspace.h    |    1 
 main.c         |   14 +---
 member.c       |    2 
 member_sysfs.c |  165 ---------------------------------------------------------
 member_sysfs.h |   22 -------
 7 files changed, 156 insertions(+), 204 deletions(-)

diff -urpN a/drivers/dlm/Makefile b/drivers/dlm/Makefile
--- a/drivers/dlm/Makefile	2005-08-18 13:26:02.648375344 +0800
+++ b/drivers/dlm/Makefile	2005-08-18 13:26:25.736865360 +0800
@@ -9,7 +9,6 @@ dlm-y :=			ast.o \
 				lowcomms.o \
 				main.o \
 				member.o \
-				member_sysfs.o \
 				memory.o \
 				midcomms.o \
 				rcom.o \
diff -urpN a/drivers/dlm/lockspace.c b/drivers/dlm/lockspace.c
--- a/drivers/dlm/lockspace.c	2005-08-18 13:26:02.651374888 +0800
+++ b/drivers/dlm/lockspace.c	2005-08-18 13:26:25.737865208 +0800
@@ -14,7 +14,6 @@
 #include "dlm_internal.h"
 #include "lockspace.h"
 #include "member.h"
-#include "member_sysfs.h"
 #include "recoverd.h"
 #include "ast.h"
 #include "dir.h"
@@ -38,13 +37,159 @@ static spinlock_t		lslist_lock;
 static struct task_struct *	scand_task;
 
 
+static ssize_t dlm_control_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	ssize_t ret = len;
+	int n = simple_strtol(buf, NULL, 0);
+
+	switch (n) {
+	case 0:
+		dlm_ls_stop(ls);
+		break;
+	case 1:
+		dlm_ls_start(ls);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+	return ret;
+}
+
+static ssize_t dlm_event_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	ls->ls_uevent_result = simple_strtol(buf, NULL, 0);
+	set_bit(LSFL_UEVENT_WAIT, &ls->ls_flags);
+	wake_up(&ls->ls_uevent_wait);
+	return len;
+}
+
+static ssize_t dlm_id_show(struct dlm_ls *ls, char *buf)
+{
+	return sprintf(buf, "%u\n", ls->ls_global_id);
+}
+
+static ssize_t dlm_id_store(struct dlm_ls *ls, const char *buf, size_t len)
+{
+	ls->ls_global_id = simple_strtoul(buf, NULL, 0);
+	return len;
+}
+
+struct dlm_attr {
+	struct attribute attr;
+	ssize_t (*show)(struct dlm_ls *, char *);
+	ssize_t (*store)(struct dlm_ls *, const char *, size_t);
+};
+
+static struct dlm_attr dlm_attr_control = {
+	.attr  = {.name = "control", .mode = S_IWUSR},
+	.store = dlm_control_store
+};
+
+static struct dlm_attr dlm_attr_event = {
+	.attr  = {.name = "event_done", .mode = S_IWUSR},
+	.store = dlm_event_store
+};
+
+static struct dlm_attr dlm_attr_id = {
+	.attr  = {.name = "id", .mode = S_IRUGO | S_IWUSR},
+	.show  = dlm_id_show,
+	.store = dlm_id_store
+};
+
+static struct attribute *dlm_attrs[] = {
+	&dlm_attr_control.attr,
+	&dlm_attr_event.attr,
+	&dlm_attr_id.attr,
+	NULL,
+};
+
+static ssize_t dlm_attr_show(struct kobject *kobj, struct attribute *attr,
+			     char *buf)
+{
+	struct dlm_ls *ls  = container_of(kobj, struct dlm_ls, ls_kobj);
+	struct dlm_attr *a = container_of(attr, struct dlm_attr, attr);
+	return a->show ? a->show(ls, buf) : 0;
+}
+
+static ssize_t dlm_attr_store(struct kobject *kobj, struct attribute *attr,
+			      const char *buf, size_t len)
+{
+	struct dlm_ls *ls  = container_of(kobj, struct dlm_ls, ls_kobj);
+	struct dlm_attr *a = container_of(attr, struct dlm_attr, attr);
+	return a->store ? a->store(ls, buf, len) : len;
+}
+
+static struct sysfs_ops dlm_attr_ops = {
+	.show  = dlm_attr_show,
+	.store = dlm_attr_store,
+};
+
+static struct kobj_type dlm_ktype = {
+	.default_attrs = dlm_attrs,
+	.sysfs_ops     = &dlm_attr_ops,
+};
+
+static struct kset dlm_kset = {
+	.subsys = &kernel_subsys,
+	.kobj   = {.name = "dlm",},
+	.ktype  = &dlm_ktype,
+};
+
+static int kobject_setup(struct dlm_ls *ls)
+{
+	char lsname[DLM_LOCKSPACE_LEN];
+	int error;
+
+	memset(lsname, 0, DLM_LOCKSPACE_LEN);
+	snprintf(lsname, DLM_LOCKSPACE_LEN, "%s", ls->ls_name);
+
+	error = kobject_set_name(&ls->ls_kobj, "%s", lsname);
+	if (error)
+		return error;
+
+	ls->ls_kobj.kset = &dlm_kset;
+	ls->ls_kobj.ktype = &dlm_ktype;
+	return 0;
+}
+
+static int do_uevent(struct dlm_ls *ls, int in)
+{
+	int error;
+
+	if (in)
+		kobject_uevent(&ls->ls_kobj, KOBJ_ONLINE, NULL);
+	else
+		kobject_uevent(&ls->ls_kobj, KOBJ_OFFLINE, NULL);
+
+	error = wait_event_interruptible(ls->ls_uevent_wait,
+			test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
+	if (error)
+		goto out;
+
+	error = ls->ls_uevent_result;
+ out:
+	return error;
+}
+
+
 int dlm_lockspace_init(void)
 {
+	int error;
+
 	ls_count = 0;
 	init_MUTEX(&ls_lock);
 	INIT_LIST_HEAD(&lslist);
 	spin_lock_init(&lslist_lock);
-	return 0;
+
+	error = kset_register(&dlm_kset);
+	if (error)
+		printk("dlm_lockspace_init: cannot register kset %d\n", error);
+	return error;
+}
+
+void dlm_lockspace_exit(void)
+{
+	kset_unregister(&dlm_kset);
 }
 
 static int dlm_scand(void *data)
@@ -310,7 +455,7 @@ static int new_lockspace(char *name, int
 
 	dlm_create_debug_file(ls);
 
-	error = dlm_kobject_setup(ls);
+	error = kobject_setup(ls);
 	if (error)
 		goto out_del;
 
@@ -318,7 +463,7 @@ static int new_lockspace(char *name, int
 	if (error)
 		goto out_del;
 
-	error = dlm_uevent(ls, 1);
+	error = do_uevent(ls, 1);
 	if (error)
 		goto out_unreg;
 
@@ -409,7 +554,7 @@ static int release_lockspace(struct dlm_
 		return -EBUSY;
 
 	if (force < 3)
-		dlm_uevent(ls, 0);
+		do_uevent(ls, 0);
 
 	dlm_recoverd_stop(ls);
 
diff -urpN a/drivers/dlm/lockspace.h b/drivers/dlm/lockspace.h
--- a/drivers/dlm/lockspace.h	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/lockspace.h	2005-08-18 13:26:25.737865208 +0800
@@ -15,6 +15,7 @@
 #define __LOCKSPACE_DOT_H__
 
 int dlm_lockspace_init(void);
+void dlm_lockspace_exit(void);
 struct dlm_ls *dlm_find_lockspace_global(uint32_t id);
 struct dlm_ls *dlm_find_lockspace_local(void *id);
 struct dlm_ls *dlm_find_lockspace_name(char *name, int namelen);
diff -urpN a/drivers/dlm/main.c b/drivers/dlm/main.c
--- a/drivers/dlm/main.c	2005-08-18 13:26:02.653374584 +0800
+++ b/drivers/dlm/main.c	2005-08-18 13:26:25.738865056 +0800
@@ -13,9 +13,7 @@
 
 #include "dlm_internal.h"
 #include "lockspace.h"
-#include "member_sysfs.h"
 #include "lock.h"
-#include "device.h"
 #include "memory.h"
 #include "lowcomms.h"
 #include "config.h"
@@ -40,13 +38,9 @@ static int __init init_dlm(void)
 	if (error)
 		goto out_mem;
 
-	error = dlm_member_sysfs_init();
-	if (error)
-		goto out_mem;
-
 	error = dlm_config_init();
 	if (error)
-		goto out_member;
+		goto out_lockspace;
 
 	error = dlm_register_debugfs();
 	if (error)
@@ -64,8 +58,8 @@ static int __init init_dlm(void)
 	dlm_unregister_debugfs();
  out_config:
 	dlm_config_exit();
- out_member:
-	dlm_member_sysfs_exit();
+ out_lockspace:
+	dlm_lockspace_exit();
  out_mem:
 	dlm_memory_exit();
  out:
@@ -75,9 +69,9 @@ static int __init init_dlm(void)
 static void __exit exit_dlm(void)
 {
 	dlm_lowcomms_exit();
-	dlm_member_sysfs_exit();
 	dlm_config_exit();
 	dlm_memory_exit();
+	dlm_lockspace_exit();
 	dlm_unregister_debugfs();
 }
 
diff -urpN a/drivers/dlm/member.c b/drivers/dlm/member.c
--- a/drivers/dlm/member.c	2005-08-18 13:26:02.654374432 +0800
+++ b/drivers/dlm/member.c	2005-08-18 13:26:25.738865056 +0800
@@ -221,7 +221,7 @@ int dlm_recover_members(struct dlm_ls *l
 }
 
 /*
- * Following called from member_sysfs.c
+ * Following called from lockspace.c
  */
 
 int dlm_ls_stop(struct dlm_ls *ls)
diff -urpN a/drivers/dlm/member_sysfs.c b/drivers/dlm/member_sysfs.c
--- a/drivers/dlm/member_sysfs.c	2005-08-18 13:26:02.655374280 +0800
+++ b/drivers/dlm/member_sysfs.c	1970-01-01 07:30:00.000000000 +0730
@@ -1,165 +0,0 @@
-/******************************************************************************
-*******************************************************************************
-**
-**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
-**
-**  This copyrighted material is made available to anyone wishing to use,
-**  modify, copy, or redistribute it subject to the terms and conditions
-**  of the GNU General Public License v.2.
-**
-*******************************************************************************
-******************************************************************************/
-
-#include "dlm_internal.h"
-#include "member.h"
-
-
-static ssize_t dlm_control_store(struct dlm_ls *ls, const char *buf, size_t len)
-{
-	ssize_t ret = len;
-	int n = simple_strtol(buf, NULL, 0);
-
-	switch (n) {
-	case 0:
-		dlm_ls_stop(ls);
-		break;
-	case 1:
-		dlm_ls_start(ls);
-		break;
-	default:
-		ret = -EINVAL;
-	}
-	return ret;
-}
-
-static ssize_t dlm_event_store(struct dlm_ls *ls, const char *buf, size_t len)
-{
-	ls->ls_uevent_result = simple_strtol(buf, NULL, 0);
-	set_bit(LSFL_UEVENT_WAIT, &ls->ls_flags);
-	wake_up(&ls->ls_uevent_wait);
-	return len;
-}
-
-static ssize_t dlm_id_show(struct dlm_ls *ls, char *buf)
-{
-	return sprintf(buf, "%u\n", ls->ls_global_id);
-}
-
-static ssize_t dlm_id_store(struct dlm_ls *ls, const char *buf, size_t len)
-{
-	ls->ls_global_id = simple_strtoul(buf, NULL, 0);
-	return len;
-}
-
-struct dlm_attr {
-	struct attribute attr;
-	ssize_t (*show)(struct dlm_ls *, char *);
-	ssize_t (*store)(struct dlm_ls *, const char *, size_t);
-};
-
-static struct dlm_attr dlm_attr_control = {
-	.attr  = {.name = "control", .mode = S_IWUSR},
-	.store = dlm_control_store
-};
-
-static struct dlm_attr dlm_attr_event = {
-	.attr  = {.name = "event_done", .mode = S_IWUSR},
-	.store = dlm_event_store
-};
-
-static struct dlm_attr dlm_attr_id = {
-	.attr  = {.name = "id", .mode = S_IRUGO | S_IWUSR},
-	.show  = dlm_id_show,
-	.store = dlm_id_store
-};
-
-static struct attribute *dlm_attrs[] = {
-	&dlm_attr_control.attr,
-	&dlm_attr_event.attr,
-	&dlm_attr_id.attr,
-	NULL,
-};
-
-static ssize_t dlm_attr_show(struct kobject *kobj, struct attribute *attr,
-			     char *buf)
-{
-	struct dlm_ls *ls  = container_of(kobj, struct dlm_ls, ls_kobj);
-	struct dlm_attr *a = container_of(attr, struct dlm_attr, attr);
-	return a->show ? a->show(ls, buf) : 0;
-}
-
-static ssize_t dlm_attr_store(struct kobject *kobj, struct attribute *attr,
-			      const char *buf, size_t len)
-{
-	struct dlm_ls *ls  = container_of(kobj, struct dlm_ls, ls_kobj);
-	struct dlm_attr *a = container_of(attr, struct dlm_attr, attr);
-	return a->store ? a->store(ls, buf, len) : len;
-}
-
-static struct sysfs_ops dlm_attr_ops = {
-	.show  = dlm_attr_show,
-	.store = dlm_attr_store,
-};
-
-static struct kobj_type dlm_ktype = {
-	.default_attrs = dlm_attrs,
-	.sysfs_ops     = &dlm_attr_ops,
-};
-
-static struct kset dlm_kset = {
-	.subsys = &kernel_subsys,
-	.kobj   = {.name = "dlm",},
-	.ktype  = &dlm_ktype,
-};
-
-int dlm_member_sysfs_init(void)
-{
-	int error;
-
-	error = kset_register(&dlm_kset);
-	if (error)
-		printk("dlm_lockspace_init: cannot register kset %d\n", error);
-	return error;
-}
-
-void dlm_member_sysfs_exit(void)
-{
-	kset_unregister(&dlm_kset);
-}
-
-int dlm_kobject_setup(struct dlm_ls *ls)
-{
-	char lsname[DLM_LOCKSPACE_LEN];
-	int error;
-
-	memset(lsname, 0, DLM_LOCKSPACE_LEN);
-	snprintf(lsname, DLM_LOCKSPACE_LEN, "%s", ls->ls_name);
-
-	error = kobject_set_name(&ls->ls_kobj, "%s", lsname);
-	if (error)
-		return error;
-
-	ls->ls_kobj.kset = &dlm_kset;
-	ls->ls_kobj.ktype = &dlm_ktype;
-	return 0;
-}
-
-int dlm_uevent(struct dlm_ls *ls, int in)
-{
-	int error;
-
-	if (in)
-		kobject_uevent(&ls->ls_kobj, KOBJ_ONLINE, NULL);
-	else
-		kobject_uevent(&ls->ls_kobj, KOBJ_OFFLINE, NULL);
-
-	error = wait_event_interruptible(ls->ls_uevent_wait,
-			test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
-	if (error)
-		goto out;
-
-	error = ls->ls_uevent_result;
- out:
-	return error;
-}
-
diff -urpN a/drivers/dlm/member_sysfs.h b/drivers/dlm/member_sysfs.h
--- a/drivers/dlm/member_sysfs.h	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/member_sysfs.h	1970-01-01 07:30:00.000000000 +0730
@@ -1,22 +0,0 @@
-/******************************************************************************
-*******************************************************************************
-**
-**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
-**
-**  This copyrighted material is made available to anyone wishing to use,
-**  modify, copy, or redistribute it subject to the terms and conditions
-**  of the GNU General Public License v.2.
-**
-*******************************************************************************
-******************************************************************************/
-
-#ifndef __MEMBER_SYSFS_DOT_H__
-#define __MEMBER_SYSFS_DOT_H__
-
-int dlm_member_sysfs_init(void);
-void dlm_member_sysfs_exit(void);
-int dlm_kobject_setup(struct dlm_ls *ls);
-int dlm_uevent(struct dlm_ls *ls, int in);
-
-#endif                          /* __MEMBER_SYSFS_DOT_H__ */
-



From teigland at redhat.com  Thu Aug 18 06:11:17 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 18 Aug 2005 14:11:17 +0800
Subject: [Linux-cluster] [PATCH 3/3] dlm: use jhash
Message-ID: <20050818061117.GC10133@redhat.com>

Use linux/jhash.h instead of our own hash function.

Signed-off-by: David Teigland <teigland at redhat.com>

---

 dir.c          |    2 +-
 dlm_internal.h |    1 +
 lock.c         |    2 +-
 util.c         |   34 ----------------------------------
 util.h         |    2 --
 5 files changed, 3 insertions(+), 38 deletions(-)

diff -urpN a/drivers/dlm/dir.c b/drivers/dlm/dir.c
--- a/drivers/dlm/dir.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/dir.c	2005-08-18 13:47:29.112803024 +0800
@@ -120,7 +120,7 @@ static inline uint32_t dir_hash(struct d
 {
 	uint32_t val;
 
-	val = dlm_hash(name, len);
+	val = jhash(name, len, 0);
 	val &= (ls->ls_dirtbl_size - 1);
 
 	return val;
diff -urpN a/drivers/dlm/dlm_internal.h b/drivers/dlm/dlm_internal.h
--- a/drivers/dlm/dlm_internal.h	2005-08-18 13:26:02.651374888 +0800
+++ b/drivers/dlm/dlm_internal.h	2005-08-18 13:47:29.112803024 +0800
@@ -34,6 +34,7 @@
 #include <linux/kobject.h>
 #include <linux/kref.h>
 #include <linux/kernel.h>
+#include <linux/jhash.h>
 #include <asm/semaphore.h>
 #include <asm/uaccess.h>
 
diff -urpN a/drivers/dlm/lock.c b/drivers/dlm/lock.c
--- a/drivers/dlm/lock.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/lock.c	2005-08-18 13:47:29.114802720 +0800
@@ -369,7 +369,7 @@ static int find_rsb(struct dlm_ls *ls, c
 	if (dlm_no_directory(ls))
 		flags |= R_CREATE;
 
-	hash = dlm_hash(name, namelen);
+	hash = jhash(name, namelen, 0);
 	bucket = hash & (ls->ls_rsbtbl_size - 1);
 
 	error = search_rsb(ls, name, namelen, bucket, flags, &r);
diff -urpN a/drivers/dlm/util.c b/drivers/dlm/util.c
--- a/drivers/dlm/util.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/util.c	2005-08-18 13:47:29.115802568 +0800
@@ -13,40 +13,6 @@
 #include "dlm_internal.h"
 #include "rcom.h"
 
-/**
- * dlm_hash - hash an array of data
- * @data: the data to be hashed
- * @len: the length of data to be hashed
- *
- * Copied from GFS which copied from...
- *
- * Take some data and convert it to a 32-bit hash.
- * This is the 32-bit FNV-1a hash from:
- * http://www.isthe.com/chongo/tech/comp/fnv/
- */
-
-static inline uint32_t hash_more_internal(const void *data, unsigned int len,
-					  uint32_t hash)
-{
-	unsigned char *p = (unsigned char *)data;
-	unsigned char *e = p + len;
-	uint32_t h = hash;
-
-	while (p < e) {
-		h ^= (uint32_t)(*p++);
-		h *= 0x01000193;
-	}
-
-	return h;
-}
-
-uint32_t dlm_hash(const void *data, int len)
-{
-	uint32_t h = 0x811C9DC5;
-	h = hash_more_internal(data, len, h);
-	return h;
-}
-
 static void header_out(struct dlm_header *hd)
 {
 	hd->h_version		= cpu_to_le32(hd->h_version);
diff -urpN a/drivers/dlm/util.h b/drivers/dlm/util.h
--- a/drivers/dlm/util.h	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/util.h	2005-08-18 13:47:29.115802568 +0800
@@ -13,8 +13,6 @@
 #ifndef __UTIL_DOT_H__
 #define __UTIL_DOT_H__
 
-uint32_t dlm_hash(const char *data, int len);
-
 void dlm_message_out(struct dlm_message *ms);
 void dlm_message_in(struct dlm_message *ms);
 void dlm_rcom_out(struct dlm_rcom *rc);



From akpm at osdl.org  Thu Aug 18 06:22:18 2005
From: akpm at osdl.org (Andrew Morton)
Date: Wed, 17 Aug 2005 23:22:18 -0700
Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs
In-Reply-To: <20050818060750.GA10133@redhat.com>
References: <20050818060750.GA10133@redhat.com>
Message-ID: <20050817232218.56a06fd6.akpm@osdl.org>

David Teigland <teigland at redhat.com> wrote:
>
> Use configfs to configure lockspace members and node addresses.  This was
>  previously done with sysfs and ioctl.

Fair enough.  This really means that the configfs patch should be split out
of the ocfs2 megapatch...



From teigland at redhat.com  Thu Aug 18 06:07:50 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 18 Aug 2005 14:07:50 +0800
Subject: [Linux-cluster] [PATCH 1/3] dlm: use configfs
Message-ID: <20050818060750.GA10133@redhat.com>

Use configfs to configure lockspace members and node addresses.  This was
previously done with sysfs and ioctl.

Signed-off-by: David Teigland <teigland at redhat.com>

---

 drivers/dlm/Makefile       |    1 
 drivers/dlm/config.c       |  759 ++++++++++++++++++++++++++++++++++++++++++++-
 drivers/dlm/config.h       |   12 
 drivers/dlm/dlm_internal.h |    2 
 drivers/dlm/lockspace.c    |    7 
 drivers/dlm/lowcomms.c     |  195 +----------
 drivers/dlm/lowcomms.h     |    4 
 drivers/dlm/main.c         |   18 -
 drivers/dlm/member.c       |   40 +-
 drivers/dlm/member_sysfs.c |   76 ----
 drivers/dlm/node_ioctl.c   |  126 -------
 drivers/dlm/requestqueue.c |    2 
 include/linux/dlm_node.h   |   44 --
 13 files changed, 828 insertions(+), 458 deletions(-)

diff -urpN a/drivers/dlm/Makefile b/drivers/dlm/Makefile
--- a/drivers/dlm/Makefile	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/Makefile	2005-08-18 13:22:00.718154328 +0800
@@ -12,7 +12,6 @@ dlm-y :=			ast.o \
 				member_sysfs.o \
 				memory.o \
 				midcomms.o \
-				node_ioctl.o \
 				rcom.o \
 				recover.o \
 				recoverd.o \
diff -urpN a/drivers/dlm/config.c b/drivers/dlm/config.c
--- a/drivers/dlm/config.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/config.c	2005-08-18 13:22:00.719154176 +0800
@@ -11,9 +11,756 @@
 *******************************************************************************
 ******************************************************************************/
 
-#include "dlm_internal.h"
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/configfs.h>
+#include <net/sock.h>
+
 #include "config.h"
 
+/*
+ * /config/dlm/<cluster>/spaces/<space>/nodes/<node>/nodeid
+ * /config/dlm/<cluster>/spaces/<space>/nodes/<node>/weight
+ * /config/dlm/<cluster>/comms/<comm>/nodeid
+ * /config/dlm/<cluster>/comms/<comm>/local
+ * /config/dlm/<cluster>/comms/<comm>/addr
+ * The <cluster> level is useless, but I haven't figured out how to avoid it.
+ */
+
+static struct config_group *space_list;
+static struct config_group *comm_list;
+static struct comm *local_comm;
+
+struct clusters;
+struct cluster;
+struct spaces;
+struct space;
+struct comms;
+struct comm;
+struct nodes;
+struct node;
+
+static struct config_group *make_cluster(struct config_group *, const char *);
+static void drop_cluster(struct config_group *, struct config_item *);
+static void release_cluster(struct config_item *);
+static struct config_group *make_space(struct config_group *, const char *);
+static void drop_space(struct config_group *, struct config_item *);
+static void release_space(struct config_item *);
+static struct config_item *make_comm(struct config_group *, const char *);
+static void drop_comm(struct config_group *, struct config_item *);
+static void release_comm(struct config_item *);
+static struct config_item *make_node(struct config_group *, const char *);
+static void drop_node(struct config_group *, struct config_item *);
+static void release_node(struct config_item *);
+
+static ssize_t show_comm(struct config_item *i, struct configfs_attribute *a,
+			 char *buf);
+static ssize_t store_comm(struct config_item *i, struct configfs_attribute *a,
+			  const char *buf, size_t len);
+static ssize_t show_node(struct config_item *i, struct configfs_attribute *a,
+			 char *buf);
+static ssize_t store_node(struct config_item *i, struct configfs_attribute *a,
+			  const char *buf, size_t len);
+
+static ssize_t comm_nodeid_read(struct comm *cm, char *buf);
+static ssize_t comm_nodeid_write(struct comm *cm, const char *buf, size_t len);
+static ssize_t comm_local_read(struct comm *cm, char *buf);
+static ssize_t comm_local_write(struct comm *cm, const char *buf, size_t len);
+static ssize_t comm_addr_write(struct comm *cm, const char *buf, size_t len);
+static ssize_t node_nodeid_read(struct node *nd, char *buf);
+static ssize_t node_nodeid_write(struct node *nd, const char *buf, size_t len);
+static ssize_t node_weight_read(struct node *nd, char *buf);
+static ssize_t node_weight_write(struct node *nd, const char *buf, size_t len);
+
+enum {
+	COMM_ATTR_NODEID = 0,
+	COMM_ATTR_LOCAL,
+	COMM_ATTR_ADDR,
+};
+
+struct comm_attribute {
+	struct configfs_attribute attr;
+	ssize_t (*show)(struct comm *, char *);
+	ssize_t (*store)(struct comm *, const char *, size_t);
+};
+
+static struct comm_attribute comm_attr_nodeid = {
+	.attr   = { .ca_owner = THIS_MODULE,
+                    .ca_name = "nodeid",
+                    .ca_mode = S_IRUGO | S_IWUSR },
+	.show   = comm_nodeid_read,
+	.store  = comm_nodeid_write,
+};
+
+static struct comm_attribute comm_attr_local = {
+	.attr   = { .ca_owner = THIS_MODULE,
+                    .ca_name = "local",
+                    .ca_mode = S_IRUGO | S_IWUSR },
+	.show   = comm_local_read,
+	.store  = comm_local_write,
+};
+
+static struct comm_attribute comm_attr_addr = {
+	.attr   = { .ca_owner = THIS_MODULE,
+                    .ca_name = "addr",
+                    .ca_mode = S_IRUGO | S_IWUSR },
+	.store  = comm_addr_write,
+};
+
+static struct configfs_attribute *comm_attrs[] = {
+	[COMM_ATTR_NODEID] = &comm_attr_nodeid.attr,
+	[COMM_ATTR_LOCAL] = &comm_attr_local.attr,
+	[COMM_ATTR_ADDR] = &comm_attr_addr.attr,
+	NULL,
+};
+
+enum {
+	NODE_ATTR_NODEID = 0,
+	NODE_ATTR_WEIGHT,
+};
+
+struct node_attribute {
+	struct configfs_attribute attr;
+	ssize_t (*show)(struct node *, char *);
+	ssize_t (*store)(struct node *, const char *, size_t);
+};
+
+static struct node_attribute node_attr_nodeid = {
+	.attr   = { .ca_owner = THIS_MODULE,
+                    .ca_name = "nodeid",
+                    .ca_mode = S_IRUGO | S_IWUSR },
+	.show   = node_nodeid_read,
+	.store  = node_nodeid_write,
+};
+
+static struct node_attribute node_attr_weight = {
+	.attr   = { .ca_owner = THIS_MODULE,
+                    .ca_name = "weight",
+                    .ca_mode = S_IRUGO | S_IWUSR },
+	.show   = node_weight_read,
+	.store  = node_weight_write,
+};
+
+static struct configfs_attribute *node_attrs[] = {
+	[NODE_ATTR_NODEID] = &node_attr_nodeid.attr,
+	[NODE_ATTR_WEIGHT] = &node_attr_weight.attr,
+	NULL,
+};
+
+struct clusters {
+	struct configfs_subsystem subsys;
+};
+
+struct cluster {
+	struct config_group group;
+};
+
+struct spaces {
+	struct config_group ss_group;
+};
+
+struct space {
+	struct config_group group;
+	struct list_head members;
+	struct semaphore members_lock;
+	int members_count;
+};
+
+struct comms {
+	struct config_group cs_group;
+};
+
+struct comm {
+	struct config_item item;
+	int nodeid;
+	int local;
+	int addr_count;
+	struct sockaddr_storage *addr[DLM_MAX_ADDR_COUNT];
+};
+
+struct nodes {
+	struct config_group ns_group;
+};
+
+struct node {
+	struct config_item item;
+	struct list_head list; /* space->members */
+	int nodeid;
+	int weight;
+};
+
+static struct configfs_group_operations clusters_ops = {
+	.make_group = make_cluster,
+	.drop_item = drop_cluster,
+};
+
+static struct configfs_item_operations cluster_ops = {
+	.release = release_cluster,
+};
+
+static struct configfs_group_operations spaces_ops = {
+	.make_group = make_space,
+	.drop_item = drop_space,
+};
+
+static struct configfs_item_operations space_ops = {
+	.release = release_space,
+};
+
+static struct configfs_group_operations comms_ops = {
+	.make_item = make_comm,
+	.drop_item = drop_comm,
+};
+
+static struct configfs_item_operations comm_ops = {
+	.release = release_comm,
+	.show_attribute = show_comm,
+	.store_attribute = store_comm,
+};
+
+static struct configfs_group_operations nodes_ops = {
+	.make_item = make_node,
+	.drop_item = drop_node,
+};
+
+static struct configfs_item_operations node_ops = {
+	.release = release_node,
+	.show_attribute = show_node,
+	.store_attribute = store_node,
+};
+
+static struct config_item_type clusters_type = {
+	.ct_group_ops = &clusters_ops,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type cluster_type = {
+	.ct_item_ops = &cluster_ops,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type spaces_type = {
+	.ct_group_ops = &spaces_ops,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type space_type = {
+	.ct_item_ops = &space_ops,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type comms_type = {
+	.ct_group_ops = &comms_ops,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type comm_type = {
+	.ct_item_ops = &comm_ops,
+	.ct_attrs = comm_attrs,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type nodes_type = {
+	.ct_group_ops = &nodes_ops,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type node_type = {
+	.ct_item_ops = &node_ops,
+	.ct_attrs = node_attrs,
+	.ct_owner = THIS_MODULE,
+};
+
+static struct cluster *to_cluster(struct config_item *i)
+{
+	return i ? container_of(to_config_group(i), struct cluster, group):NULL;
+}
+
+static struct space *to_space(struct config_item *i)
+{
+	return i ? container_of(to_config_group(i), struct space, group) : NULL;
+}
+
+static struct comm *to_comm(struct config_item *i)
+{
+	return i ? container_of(i, struct comm, item) : NULL;
+}
+
+static struct node *to_node(struct config_item *i)
+{
+	return i ? container_of(i, struct node, item) : NULL;
+}
+
+static struct config_group *make_cluster(struct config_group *g,
+					 const char *name)
+{
+	struct cluster *cl = NULL;
+	struct spaces *sps = NULL;
+	struct comms *cms = NULL;
+	void *gps = NULL;
+
+	cl = kzalloc(sizeof(struct cluster), GFP_KERNEL);
+	gps = kcalloc(3, sizeof(struct config_group *), GFP_KERNEL);
+	sps = kzalloc(sizeof(struct spaces), GFP_KERNEL);
+	cms = kzalloc(sizeof(struct comms), GFP_KERNEL);
+
+	if (!cl || !gps || !sps || !cms)
+		goto fail;
+
+	config_group_init_type_name(&cl->group, name, &cluster_type);
+	config_group_init_type_name(&sps->ss_group, "spaces", &spaces_type);
+	config_group_init_type_name(&cms->cs_group, "comms", &comms_type);
+
+	cl->group.default_groups = gps;
+	cl->group.default_groups[0] = &sps->ss_group;
+	cl->group.default_groups[1] = &cms->cs_group;
+	cl->group.default_groups[2] = NULL;
+
+	space_list = &sps->ss_group;
+	comm_list = &cms->cs_group;
+	return &cl->group;
+
+ fail:
+	kfree(cl);
+	kfree(gps);
+	kfree(sps);
+	kfree(cms);
+	return NULL;
+}
+
+static void drop_cluster(struct config_group *g, struct config_item *i)
+{
+	struct cluster *cl = to_cluster(i);
+	struct config_item *tmp;
+	int j;
+
+	for (j = 0; cl->group.default_groups[j]; j++) {
+		tmp = &cl->group.default_groups[j]->cg_item;
+		cl->group.default_groups[j] = NULL;
+		config_item_put(tmp);
+	}
+
+	space_list = NULL;
+	comm_list = NULL;
+
+	config_item_put(i);
+}
+
+static void release_cluster(struct config_item *i)
+{
+	struct cluster *cl = to_cluster(i);
+	kfree(cl->group.default_groups);
+	kfree(cl);
+}
+
+static struct config_group *make_space(struct config_group *g, const char *name)
+{
+	struct space *sp = NULL;
+	struct nodes *nds = NULL;
+	void *gps = NULL;
+
+	sp = kzalloc(sizeof(struct space), GFP_KERNEL);
+	gps = kcalloc(2, sizeof(struct config_group *), GFP_KERNEL);
+	nds = kzalloc(sizeof(struct nodes), GFP_KERNEL);
+
+	if (!sp || !gps || !nds)
+		goto fail;
+
+	config_group_init_type_name(&sp->group, name, &space_type);
+	config_group_init_type_name(&nds->ns_group, "nodes", &nodes_type);
+
+	sp->group.default_groups = gps;
+	sp->group.default_groups[0] = &nds->ns_group;
+	sp->group.default_groups[1] = NULL;
+
+	INIT_LIST_HEAD(&sp->members);
+	init_MUTEX(&sp->members_lock);
+	sp->members_count = 0;
+	return &sp->group;
+
+ fail:
+	kfree(sp);
+	kfree(gps);
+	kfree(nds);
+	return NULL;
+}
+
+static void drop_space(struct config_group *g, struct config_item *i)
+{
+	struct space *sp = to_space(i);
+	struct config_item *tmp;
+	int j;
+
+	/* assert list_empty(&sp->members) */
+
+	for (j = 0; sp->group.default_groups[j]; j++) {
+		tmp = &sp->group.default_groups[j]->cg_item;
+		sp->group.default_groups[j] = NULL;
+		config_item_put(tmp);
+	}
+
+	config_item_put(i);
+}
+
+static void release_space(struct config_item *i)
+{
+	struct space *sp = to_space(i);
+	kfree(sp->group.default_groups);
+	kfree(sp);
+}
+
+static struct config_item *make_comm(struct config_group *g, const char *name)
+{
+	struct comm *cm;
+
+	cm = kzalloc(sizeof(struct comm), GFP_KERNEL);
+	if (!cm)
+		return NULL;
+
+	config_item_init_type_name(&cm->item, name, &comm_type);
+	cm->nodeid = -1;
+	cm->local = 0;
+	cm->addr_count = 0;
+	return &cm->item;
+}
+
+static void drop_comm(struct config_group *g, struct config_item *i)
+{
+	struct comm *cm = to_comm(i);
+	if (local_comm == cm)
+		local_comm = NULL;
+	while (cm->addr_count--)
+		kfree(cm->addr[cm->addr_count]);
+	config_item_put(i);
+}
+
+static void release_comm(struct config_item *i)
+{
+	struct comm *cm = to_comm(i);
+	kfree(cm);
+}
+
+static struct config_item *make_node(struct config_group *g, const char *name)
+{
+	struct space *sp = to_space(g->cg_item.ci_parent);
+	struct node *nd;
+
+	nd = kzalloc(sizeof(struct node), GFP_KERNEL);
+	if (!nd)
+		return NULL;
+
+	config_item_init_type_name(&nd->item, name, &node_type);
+	nd->nodeid = -1;
+	nd->weight = 1;  /* default weight of 1 if none is set */
+
+	down(&sp->members_lock);
+	list_add(&nd->list, &sp->members);
+	sp->members_count++;
+	up(&sp->members_lock);
+
+	return &nd->item;
+}
+
+static void drop_node(struct config_group *g, struct config_item *i)
+{
+	struct space *sp = to_space(g->cg_item.ci_parent);
+	struct node *nd = to_node(i);
+
+	down(&sp->members_lock);
+	list_del(&nd->list);
+	sp->members_count--;
+	up(&sp->members_lock);
+
+	config_item_put(i);
+}
+
+static void release_node(struct config_item *i)
+{
+	struct node *nd = to_node(i);
+	kfree(nd);
+}
+
+static struct clusters clusters_root = {
+	.subsys = {
+		.su_group = {
+			.cg_item = {
+				.ci_namebuf = "dlm",
+				.ci_type = &clusters_type,
+			},
+		},
+	},
+};
+
+int dlm_config_init(void)
+{
+	config_group_init(&clusters_root.subsys.su_group);
+	init_MUTEX(&clusters_root.subsys.su_sem);
+	return configfs_register_subsystem(&clusters_root.subsys);
+}
+
+void dlm_config_exit(void)
+{
+	configfs_unregister_subsystem(&clusters_root.subsys);
+}
+
+/*
+ * Functions for user space to read/write attributes
+ */
+
+static ssize_t show_comm(struct config_item *i, struct configfs_attribute *a,
+			 char *buf)
+{
+	struct comm *cm = to_comm(i);
+	struct comm_attribute *cma =
+			container_of(a, struct comm_attribute, attr);
+	return cma->show ? cma->show(cm, buf) : 0;
+}
+
+static ssize_t store_comm(struct config_item *i, struct configfs_attribute *a,
+			  const char *buf, size_t len)
+{
+	struct comm *cm = to_comm(i);
+	struct comm_attribute *cma =
+		container_of(a, struct comm_attribute, attr);
+	return cma->store ? cma->store(cm, buf, len) : -EINVAL;
+}
+
+static ssize_t comm_nodeid_read(struct comm *cm, char *buf)
+{
+	return sprintf(buf, "%d\n", cm->nodeid);
+}
+
+static ssize_t comm_nodeid_write(struct comm *cm, const char *buf, size_t len)
+{
+	cm->nodeid = simple_strtol(buf, NULL, 0);
+	return len;
+}
+
+static ssize_t comm_local_read(struct comm *cm, char *buf)
+{
+	return sprintf(buf, "%d\n", cm->local);
+}
+
+static ssize_t comm_local_write(struct comm *cm, const char *buf, size_t len)
+{
+	cm->local= simple_strtol(buf, NULL, 0);
+	if (cm->local && !local_comm)
+		local_comm = cm;
+	return len;
+}
+
+static ssize_t comm_addr_write(struct comm *cm, const char *buf, size_t len)
+{
+	struct sockaddr_storage *addr;
+
+	if (len != sizeof(struct sockaddr_storage))
+		return -EINVAL;
+
+	if (cm->addr_count >= DLM_MAX_ADDR_COUNT)
+		return -ENOSPC;
+
+	addr = kzalloc(sizeof(*addr), GFP_KERNEL);
+	if (!addr)
+		return -ENOMEM;
+
+	memcpy(addr, buf, len);
+	cm->addr[cm->addr_count++] = addr;
+	return len;
+}
+
+static ssize_t show_node(struct config_item *i, struct configfs_attribute *a,
+			 char *buf)
+{
+	struct node *nd = to_node(i);
+	struct node_attribute *nda =
+			container_of(a, struct node_attribute, attr);
+	return nda->show ? nda->show(nd, buf) : 0;
+}
+
+static ssize_t store_node(struct config_item *i, struct configfs_attribute *a,
+			  const char *buf, size_t len)
+{
+	struct node *nd = to_node(i);
+	struct node_attribute *nda =
+		container_of(a, struct node_attribute, attr);
+	return nda->store ? nda->store(nd, buf, len) : -EINVAL;
+}
+
+static ssize_t node_nodeid_read(struct node *nd, char *buf)
+{
+	return sprintf(buf, "%d\n", nd->nodeid);
+}
+
+static ssize_t node_nodeid_write(struct node *nd, const char *buf, size_t len)
+{
+	nd->nodeid = simple_strtol(buf, NULL, 0);
+	return len;
+}
+
+static ssize_t node_weight_read(struct node *nd, char *buf)
+{
+	return sprintf(buf, "%d\n", nd->weight);
+}
+
+static ssize_t node_weight_write(struct node *nd, const char *buf, size_t len)
+{
+	nd->weight = simple_strtol(buf, NULL, 0);
+	return len;
+}
+
+/*
+ * Functions for the dlm to get the info that's been configured
+ */
+
+static struct space *get_space(char *name)
+{
+	if (!space_list)
+		return NULL;
+	return to_space(config_group_find_obj(space_list, name));
+}
+
+static void put_space(struct space *sp)
+{
+	config_item_put(&sp->group.cg_item);
+}
+
+static struct comm *get_comm(int nodeid, struct sockaddr_storage *addr)
+{
+	struct config_item *i;
+	struct comm *cm;
+	int found = 0;
+
+	if (!comm_list)
+		return NULL;
+
+	list_for_each_entry(i, &comm_list->cg_children, ci_entry) {
+		cm = to_comm(i);
+
+		if (nodeid) {
+			if (cm->nodeid != nodeid)
+				continue;
+			found = 1;
+			break;
+		} else {
+			if (!cm->addr_count ||
+			    memcmp(cm->addr[0], addr, sizeof(*addr)))
+				continue;
+			found = 1;
+			break;
+		}
+	}
+
+	if (found)
+		config_item_get(i);
+	else
+		cm = NULL;
+	return cm;
+}
+
+static void put_comm(struct comm *cm)
+{
+	config_item_put(&cm->item);
+}
+
+/* caller must free mem */
+int dlm_nodeid_list(char *lsname, int **ids_out)
+{
+	struct space *sp;
+	struct node *nd;
+	int i = 0, rv = 0;
+	int *ids;
+
+	sp = get_space(lsname);
+	if (!sp)
+		return -EEXIST;
+
+	down(&sp->members_lock);
+	if (!sp->members_count) {
+		rv = 0;
+		goto out;
+	}
+
+	ids = kcalloc(sp->members_count, sizeof(int), GFP_KERNEL);
+	if (!ids) {
+		rv = -ENOMEM;
+		goto out;
+	}
+
+	rv = sp->members_count;
+	list_for_each_entry(nd, &sp->members, list)
+		ids[i++] = nd->nodeid;
+
+	if (rv != i)
+		printk("bad nodeid count %d %d\n", rv, i);
+
+	*ids_out = ids;
+ out:
+	up(&sp->members_lock);
+	put_space(sp);
+	return rv;
+}
+
+int dlm_node_weight(char *lsname, int nodeid)
+{
+	struct space *sp;
+	struct node *nd;
+	int w = -EEXIST;
+
+	sp = get_space(lsname);
+	if (!sp)
+		goto out;
+
+	down(&sp->members_lock);
+	list_for_each_entry(nd, &sp->members, list) {
+		if (nd->nodeid != nodeid)
+			continue;
+		w = nd->weight;
+		break;
+	}
+	up(&sp->members_lock);
+	put_space(sp);
+ out:
+	return w;
+}
+
+int dlm_nodeid_to_addr(int nodeid, struct sockaddr_storage *addr)
+{
+	struct comm *cm = get_comm(nodeid, NULL);
+	if (!cm)
+		return -EEXIST;
+	if (!cm->addr_count)
+		return -ENOENT;
+	memcpy(addr, cm->addr[0], sizeof(*addr));
+	put_comm(cm);
+	return 0;
+}
+
+int dlm_addr_to_nodeid(struct sockaddr_storage *addr, int *nodeid)
+{
+	struct comm *cm = get_comm(0, addr);
+	if (!cm)
+		return -EEXIST;
+	*nodeid = cm->nodeid;
+	put_comm(cm);
+	return 0;
+}
+
+int dlm_our_nodeid(void)
+{
+	return local_comm ? local_comm->nodeid : 0;
+}
+
+/* num 0 is first addr, num 1 is second addr */
+int dlm_our_addr(struct sockaddr_storage *addr, int num)
+{
+	if (!local_comm)
+		return -1;
+	if (num + 1 > local_comm->addr_count)
+		return -1;
+	memcpy(addr, local_comm->addr[num], sizeof(*addr));
+	return 0;
+}
+
 /* Config file defaults */
 #define DEFAULT_TCP_PORT       21064
 #define DEFAULT_BUFFER_SIZE     4096
@@ -35,13 +782,3 @@ struct dlm_config_info dlm_config = {
 	.scan_secs = DEFAULT_SCAN_SECS
 };
 
-int dlm_config_init(void)
-{
-	/* FIXME: hook the config values into sysfs */
-	return 0;
-}
-
-void dlm_config_exit(void)
-{
-}
-
diff -urpN a/drivers/dlm/config.h b/drivers/dlm/config.h
--- a/drivers/dlm/config.h	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/config.h	2005-08-18 13:22:00.720154024 +0800
@@ -14,6 +14,8 @@
 #ifndef __CONFIG_DOT_H__
 #define __CONFIG_DOT_H__
 
+#define DLM_MAX_ADDR_COUNT 3
+
 struct dlm_config_info {
 	int tcp_port;
 	int buffer_size;
@@ -27,8 +29,14 @@ struct dlm_config_info {
 
 extern struct dlm_config_info dlm_config;
 
-extern int dlm_config_init(void);
-extern void dlm_config_exit(void);
+int dlm_config_init(void);
+void dlm_config_exit(void);
+int dlm_node_weight(char *lsname, int nodeid);
+int dlm_nodeid_list(char *lsname, int **ids_out);
+int dlm_nodeid_to_addr(int nodeid, struct sockaddr_storage *addr);
+int dlm_addr_to_nodeid(struct sockaddr_storage *addr, int *nodeid);
+int dlm_our_nodeid(void);
+int dlm_our_addr(struct sockaddr_storage *addr, int num);
 
 #endif				/* __CONFIG_DOT_H__ */
 
diff -urpN a/drivers/dlm/dlm_internal.h b/drivers/dlm/dlm_internal.h
--- a/drivers/dlm/dlm_internal.h	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/dlm_internal.h	2005-08-18 13:22:00.720154024 +0800
@@ -457,8 +457,6 @@ struct dlm_ls {
 	int			ls_low_nodeid;
 	int			ls_total_weight;
 	int			*ls_node_array;
-	int			*ls_nodeids_next;
-	int			ls_nodeids_next_count;
 
 	struct dlm_rsb		ls_stub_rsb;	/* for returning errors */
 	struct dlm_lkb		ls_stub_lkb;	/* for returning errors */
diff -urpN a/drivers/dlm/lockspace.c b/drivers/dlm/lockspace.c
--- a/drivers/dlm/lockspace.c	2005-08-18 12:14:09.000000000 +0800
+++ b/drivers/dlm/lockspace.c	2005-08-18 13:22:00.721153872 +0800
@@ -94,6 +94,11 @@ static struct dlm_ls *find_lockspace_nam
 	return ls;
 }
 
+struct dlm_ls *dlm_find_lockspace_name(char *name, int namelen)
+{
+	return find_lockspace_name(name, namelen);
+}
+
 struct dlm_ls *dlm_find_lockspace_global(uint32_t id)
 {
 	struct dlm_ls *ls;
@@ -261,8 +266,6 @@ static int new_lockspace(char *name, int
 	ls->ls_low_nodeid = 0;
 	ls->ls_total_weight = 0;
 	ls->ls_node_array = NULL;
-	ls->ls_nodeids_next = NULL;
-	ls->ls_nodeids_next_count = 0;
 
 	memset(&ls->ls_stub_rsb, 0, sizeof(struct dlm_rsb));
 	ls->ls_stub_rsb.res_ls = ls;
diff -urpN a/drivers/dlm/lowcomms.c b/drivers/dlm/lowcomms.c
--- a/drivers/dlm/lowcomms.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/lowcomms.c	2005-08-18 13:22:00.722153720 +0800
@@ -50,29 +50,14 @@
 #include <linux/socket.h>
 #include <linux/idr.h>
 
-#include <linux/dlm_node.h>
-
 #include "dlm_internal.h"
 #include "lowcomms.h"
 #include "config.h"
-#include "member.h"
 #include "midcomms.h"
 
 static struct sockaddr_storage *local_addr[DLM_MAX_ADDR_COUNT];
-static int			local_nodeid;
-static int			local_weight;
 static int			local_count;
-static struct list_head		nodes;
-static struct semaphore		nodes_sem;
-
-/* One of these per configured node */
-
-struct dlm_node {
-	struct list_head	list;
-	int			nodeid;
-	int			weight;
-	struct sockaddr_storage	addr;
-};
+static int			local_nodeid;
 
 /* One of these per connected node */
 
@@ -163,89 +148,24 @@ static atomic_t accepting;
 static struct connection sctp_con;
 
 
-static struct dlm_node *search_node(int nodeid)
-{
-	struct dlm_node *node;
-
-	list_for_each_entry(node, &nodes, list) {
-		if (node->nodeid == nodeid)
-			goto out;
-	}
-	node = NULL;
- out:
-	return node;
-}
-
-static struct dlm_node *search_node_addr(struct sockaddr_storage *addr)
-{
-	struct dlm_node *node;
-
-	list_for_each_entry(node, &nodes, list) {
-		if (!memcmp(&node->addr, addr, sizeof(*addr)))
-			goto out;
-	}
-	node = NULL;
- out:
-	return node;
-}
-
-static int _get_node(int nodeid, struct dlm_node **node_ret)
-{
-	struct dlm_node *node;
-	int error = 0;
-
-	node = search_node(nodeid);
-	if (node)
-		goto out;
-
-	node = kmalloc(sizeof(struct dlm_node), GFP_KERNEL);
-	if (!node) {
-		error = -ENOMEM;
-		goto out;
-	}
-	memset(node, 0, sizeof(struct dlm_node));
-	node->nodeid = nodeid;
-	list_add_tail(&node->list, &nodes);
- out:
-	*node_ret = node;
-	return error;
-}
-
-static int addr_to_nodeid(struct sockaddr_storage *addr, int *nodeid)
-{
-	struct dlm_node *node;
-
-	down(&nodes_sem);
-	node = search_node_addr(addr);
-	up(&nodes_sem);
-	if (!node)
-		return -1;
-	*nodeid = node->nodeid;
-	return 0;
-}
-
 static int nodeid_to_addr(int nodeid, struct sockaddr *retaddr)
 {
-	struct dlm_node *node;
-	struct sockaddr_storage *addr;
+	struct sockaddr_storage addr;
+	int error;
 
 	if (!local_count)
 		return -1;
 
-	down(&nodes_sem);
-	node = search_node(nodeid);
-	up(&nodes_sem);
-	if (!node)
-		return -1;
-
-	addr = &node->addr;
+	error = dlm_nodeid_to_addr(nodeid, &addr);
+	if (error)
+		return error;
 
 	if (local_addr[0]->ss_family == AF_INET) {
-	        struct sockaddr_in *in4  = (struct sockaddr_in *) addr;
+	        struct sockaddr_in *in4  = (struct sockaddr_in *) &addr;
 		struct sockaddr_in *ret4 = (struct sockaddr_in *) retaddr;
 		ret4->sin_addr.s_addr = in4->sin_addr.s_addr;
 	} else {
-	        struct sockaddr_in6 *in6  = (struct sockaddr_in6 *) addr;
+	        struct sockaddr_in6 *in6  = (struct sockaddr_in6 *) &addr;
 		struct sockaddr_in6 *ret6 = (struct sockaddr_in6 *) retaddr;
 		memcpy(&ret6->sin6_addr, &in6->sin6_addr,
 		       sizeof(in6->sin6_addr));
@@ -254,67 +174,6 @@ static int nodeid_to_addr(int nodeid, st
 	return 0;
 }
 
-int dlm_node_weight(int nodeid)
-{
-	struct dlm_node *node;
-	int weight = -1;
-
-	down(&nodes_sem);
-	node = search_node(nodeid);
-	if (node)
-		weight = node->weight;
-	up(&nodes_sem);
-	return weight;
-}
-
-int dlm_set_node(int nodeid, int weight, char *addr_buf)
-{
-	struct dlm_node *node;
-	int error;
-
-	down(&nodes_sem);
-	error = _get_node(nodeid, &node);
-	if (!error) {
-		memcpy(&node->addr, addr_buf, sizeof(struct sockaddr_storage));
-		node->weight = weight;
-	}
-	up(&nodes_sem);
-	return error;
-}
-
-int dlm_set_local(int nodeid, int weight, char *addr_buf)
-{
-	struct sockaddr_storage *addr;
-	int i;
-
-	if (local_count > DLM_MAX_ADDR_COUNT - 1) {
-		log_print("too many local addresses set %d", local_count);
-		return -EINVAL;
-	}
-	local_nodeid = nodeid;
-	local_weight = weight;
-
-	addr = kmalloc(sizeof(*addr), GFP_KERNEL);
-	if (!addr)
-		return -ENOMEM;
-	memcpy(addr, addr_buf, sizeof(*addr));
-
-	for (i = 0; i < local_count; i++) {
-		if (!memcmp(local_addr[i], addr, sizeof(*addr))) {
-			kfree(addr);
-			goto out;
-		}
-	}
-	local_addr[local_count++] = addr;
- out:
-	return 0;
-}
-
-int dlm_our_nodeid(void)
-{
-	return local_nodeid;
-}
-
 static struct nodeinfo *nodeid2nodeinfo(int nodeid, int alloc)
 {
 	struct nodeinfo *ni;
@@ -556,7 +415,7 @@ static void process_sctp_notification(st
 				return;
 			}
 			make_sockaddr(&prim.ssp_addr, 0, &addr_len);
-			if (addr_to_nodeid(&prim.ssp_addr, &nodeid)) {
+			if (dlm_addr_to_nodeid(&prim.ssp_addr, &nodeid)) {
 				log_print("reject connect from unknown addr");
 				send_shutdown(prim.ssp_assoc_id);
 				return;
@@ -772,6 +631,24 @@ static int add_bind_addr(struct sockaddr
 	return result;
 }
 
+static void init_local(void)
+{
+	struct sockaddr_storage sas, *addr;
+	int i;
+
+	local_nodeid = dlm_our_nodeid();
+
+	for (i = 0; i < DLM_MAX_ADDR_COUNT - 1; i++) {
+		if (dlm_our_addr(&sas, i))
+			break;
+
+		addr = kmalloc(sizeof(*addr), GFP_KERNEL);
+		if (!addr)
+			break;
+		memcpy(addr, &sas, sizeof(*addr));
+		local_addr[local_count++] = addr;
+	}
+}
 
 /* Initialise SCTP socket and bind to all interfaces */
 static int init_sock(void)
@@ -783,8 +660,11 @@ static int init_sock(void)
 	int result = -EINVAL, num = 1, i, addr_len;
 
 	if (!local_count) {
-		log_print("no local IP address has been set");
-		goto out;
+		init_local();
+		if (!local_count) {
+			log_print("no local IP address has been set");
+			goto out;
+		}
 	}
 
 	result = sock_create_kern(local_addr[0]->ss_family, SOCK_SEQPACKET,
@@ -1323,25 +1203,16 @@ void dlm_lowcomms_stop(void)
 int dlm_lowcomms_init(void)
 {
 	init_waitqueue_head(&lowcomms_recv_wait);
-	INIT_LIST_HEAD(&nodes);
-	init_MUTEX(&nodes_sem);
 	return 0;
 }
 
 void dlm_lowcomms_exit(void)
 {
-	struct dlm_node *node, *safe;
 	int i;
 
 	for (i = 0; i < local_count; i++)
 		kfree(local_addr[i]);
-	local_nodeid = 0;
-	local_weight = 0;
 	local_count = 0;
-
-	list_for_each_entry_safe(node, safe, &nodes, list) {
-		list_del(&node->list);
-		kfree(node);
-	}
+	local_nodeid = 0;
 }
 
diff -urpN a/drivers/dlm/lowcomms.h b/drivers/dlm/lowcomms.h
--- a/drivers/dlm/lowcomms.h	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/lowcomms.h	2005-08-18 13:22:00.722153720 +0800
@@ -20,10 +20,6 @@ int dlm_lowcomms_start(void);
 void dlm_lowcomms_stop(void);
 void *dlm_lowcomms_get_buffer(int nodeid, int len, int allocation, char **ppc);
 void dlm_lowcomms_commit_buffer(void *mh);
-int dlm_set_node(int nodeid, int weight, char *addr_buf);
-int dlm_set_local(int nodeid, int weight, char *addr_buf);
-int dlm_our_nodeid(void);
-int dlm_node_weight(int nodeid);
 
 #endif				/* __LOWCOMMS_DOT_H__ */
 
diff -urpN a/drivers/dlm/main.c b/drivers/dlm/main.c
--- a/drivers/dlm/main.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/main.c	2005-08-18 13:22:00.723153568 +0800
@@ -18,6 +18,7 @@
 #include "device.h"
 #include "memory.h"
 #include "lowcomms.h"
+#include "config.h"
 
 #ifdef CONFIG_DLM_DEBUG
 int dlm_register_debugfs(void);
@@ -27,9 +28,6 @@ static inline int dlm_register_debugfs(v
 static inline void dlm_unregister_debugfs(void) { }
 #endif
 
-int dlm_node_ioctl_init(void);
-void dlm_node_ioctl_exit(void);
-
 static int __init init_dlm(void)
 {
 	int error;
@@ -42,17 +40,17 @@ static int __init init_dlm(void)
 	if (error)
 		goto out_mem;
 
-	error = dlm_node_ioctl_init();
+	error = dlm_member_sysfs_init();
 	if (error)
 		goto out_mem;
 
-	error = dlm_member_sysfs_init();
+	error = dlm_config_init();
 	if (error)
-		goto out_node;
+		goto out_member;
 
 	error = dlm_register_debugfs();
 	if (error)
-		goto out_member;
+		goto out_config;
 
 	error = dlm_lowcomms_init();
 	if (error)
@@ -64,10 +62,10 @@ static int __init init_dlm(void)
 
  out_debug:
 	dlm_unregister_debugfs();
+ out_config:
+	dlm_config_exit();
  out_member:
 	dlm_member_sysfs_exit();
- out_node:
-	dlm_node_ioctl_exit();
  out_mem:
 	dlm_memory_exit();
  out:
@@ -78,7 +76,7 @@ static void __exit exit_dlm(void)
 {
 	dlm_lowcomms_exit();
 	dlm_member_sysfs_exit();
-	dlm_node_ioctl_exit();
+	dlm_config_exit();
 	dlm_memory_exit();
 	dlm_unregister_debugfs();
 }
diff -urpN a/drivers/dlm/member.c b/drivers/dlm/member.c
--- a/drivers/dlm/member.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/member.c	2005-08-18 13:22:00.724153416 +0800
@@ -11,13 +11,13 @@
 ******************************************************************************/
 
 #include "dlm_internal.h"
-#include "member_sysfs.h"
 #include "lockspace.h"
 #include "member.h"
 #include "recoverd.h"
 #include "recover.h"
 #include "lowcomms.h"
 #include "rcom.h"
+#include "config.h"
 
 /*
  * Following called by dlm_recoverd thread
@@ -50,13 +50,18 @@ static void add_ordered_member(struct dl
 static int dlm_add_member(struct dlm_ls *ls, int nodeid)
 {
 	struct dlm_member *memb;
+	int w;
 
 	memb = kmalloc(sizeof(struct dlm_member), GFP_KERNEL);
 	if (!memb)
 		return -ENOMEM;
 
+	w = dlm_node_weight(ls->ls_name, nodeid);
+	if (w < 0)
+		return w;
+
 	memb->nodeid = nodeid;
-	memb->weight = dlm_node_weight(nodeid);
+	memb->weight = w;
 	add_ordered_member(ls, memb);
 	ls->ls_num_nodes++;
 	return 0;
@@ -262,14 +267,19 @@ int dlm_ls_stop(struct dlm_ls *ls)
 
 int dlm_ls_start(struct dlm_ls *ls)
 {
-	struct dlm_recover *rv, *rv_old;
-	int error = 0;
+	struct dlm_recover *rv = NULL, *rv_old;
+	int *ids = NULL;
+	int error, count;
 
 	rv = kmalloc(sizeof(struct dlm_recover), GFP_KERNEL);
 	if (!rv)
 		return -ENOMEM;
 	memset(rv, 0, sizeof(struct dlm_recover));
 
+	error = count = dlm_nodeid_list(ls->ls_name, &ids);
+	if (error <= 0)
+		goto fail;
+
 	spin_lock(&ls->ls_recover_lock);
 
 	/* the lockspace needs to be stopped before it can be started */
@@ -277,22 +287,12 @@ int dlm_ls_start(struct dlm_ls *ls)
 	if (!dlm_locking_stopped(ls)) {
 		spin_unlock(&ls->ls_recover_lock);
 		log_error(ls, "start ignored: lockspace running");
-		kfree(rv);
-		error = -EINVAL;
-		goto out;
-	}
-
-	if (!ls->ls_nodeids_next) {
-		spin_unlock(&ls->ls_recover_lock);
-		log_error(ls, "start ignored: existing nodeids_next");
-		kfree(rv);
 		error = -EINVAL;
-		goto out;
+		goto fail;
 	}
 
-	rv->nodeids = ls->ls_nodeids_next;
-	ls->ls_nodeids_next = NULL;
-	rv->node_count = ls->ls_nodeids_next_count;
+	rv->nodeids = ids;
+	rv->node_count = count;
 	rv->seq = ++ls->ls_recover_seq;
 	rv_old = ls->ls_recover_args;
 	ls->ls_recover_args = rv;
@@ -304,7 +304,11 @@ int dlm_ls_start(struct dlm_ls *ls)
 	}
 
 	dlm_recoverd_kick(ls);
- out:
+	return 0;
+
+ fail:
+	kfree(rv);
+	kfree(ids);
 	return error;
 }
 
diff -urpN a/drivers/dlm/member_sysfs.c b/drivers/dlm/member_sysfs.c
--- a/drivers/dlm/member_sysfs.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/member_sysfs.c	2005-08-18 13:22:00.724153416 +0800
@@ -47,77 +47,10 @@ static ssize_t dlm_id_show(struct dlm_ls
 
 static ssize_t dlm_id_store(struct dlm_ls *ls, const char *buf, size_t len)
 {
-	ls->ls_global_id = simple_strtol(buf, NULL, 0);
+	ls->ls_global_id = simple_strtoul(buf, NULL, 0);
 	return len;
 }
 
-static ssize_t dlm_members_show(struct dlm_ls *ls, char *buf)
-{
-	struct dlm_member *memb;
-	ssize_t ret = 0;
-
-	if (!down_read_trylock(&ls->ls_in_recovery))
-		return -EBUSY;
-	list_for_each_entry(memb, &ls->ls_nodes, list)
-		ret += sprintf(buf+ret, "%u ", memb->nodeid);
-	ret += sprintf(buf+ret, "\n");
-	up_read(&ls->ls_in_recovery);
-	return ret;
-}
-
-static ssize_t dlm_members_store(struct dlm_ls *ls, const char *buf, size_t len)
-{
-	int *nodeids, id, count = 1, i;
-	ssize_t ret = len;
-	char *p, *t;
-
-	/* count number of id's in buf, assumes no trailing spaces */
-	for (i = 0; i < len; i++)
-		if (isspace(buf[i]))
-			count++;
-
-	nodeids = kmalloc(sizeof(int) * count, GFP_KERNEL);
-	if (!nodeids)
-		return -ENOMEM;
-
-	p = kmalloc(len+1, GFP_KERNEL);
-	if (!p) {
-		kfree(nodeids);
-		return -ENOMEM;
-	}
-	memcpy(p, buf, len);
-	p[len+1] = '\0';
-
-	for (i = 0; i < count; i++) {
-		if ((t = strsep(&p, " ")) == NULL)
-			break;
-		if (sscanf(t, "%u", &id) != 1)
-			break;
-		nodeids[i] = id;
-	}
-
-	if (i != count) {
-		kfree(nodeids);
-		ret = -EINVAL;
-		goto out;
-	}
-
-	spin_lock(&ls->ls_recover_lock);
-	if (ls->ls_nodeids_next) {
-		kfree(nodeids);
-		ret = -EINVAL;
-		goto out_unlock;
-	}
-	ls->ls_nodeids_next = nodeids;
-	ls->ls_nodeids_next_count = count;
-
- out_unlock:
-	spin_unlock(&ls->ls_recover_lock);
- out:
-	kfree(p);
-	return ret;
-}
-
 struct dlm_attr {
 	struct attribute attr;
 	ssize_t (*show)(struct dlm_ls *, char *);
@@ -140,17 +73,10 @@ static struct dlm_attr dlm_attr_id = {
 	.store = dlm_id_store
 };
 
-static struct dlm_attr dlm_attr_members = {
-	.attr  = {.name = "members", .mode = S_IRUGO | S_IWUSR},
-	.show  = dlm_members_show,
-	.store = dlm_members_store
-};
-
 static struct attribute *dlm_attrs[] = {
 	&dlm_attr_control.attr,
 	&dlm_attr_event.attr,
 	&dlm_attr_id.attr,
-	&dlm_attr_members.attr,
 	NULL,
 };
 
diff -urpN a/drivers/dlm/node_ioctl.c b/drivers/dlm/node_ioctl.c
--- a/drivers/dlm/node_ioctl.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/node_ioctl.c	1970-01-01 07:30:00.000000000 +0730
@@ -1,126 +0,0 @@
-/******************************************************************************
-*******************************************************************************
-**
-**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
-**
-**  This copyrighted material is made available to anyone wishing to use,
-**  modify, copy, or redistribute it subject to the terms and conditions
-**  of the GNU General Public License v.2.
-**
-*******************************************************************************
-******************************************************************************/
-
-#include <linux/miscdevice.h>
-#include <linux/fs.h>
-
-#include <linux/dlm_node.h>
-
-#include "dlm_internal.h"
-#include "lowcomms.h"
-
-
-static int check_version(unsigned int cmd,
-			 struct dlm_node_ioctl __user *u_param)
-{
-	u32 version[3];
-	int error = 0;
-
-	if (copy_from_user(version, u_param->version, sizeof(version)))
-		return -EFAULT;
-
-	if ((DLM_NODE_VERSION_MAJOR != version[0]) ||
-	    (DLM_NODE_VERSION_MINOR < version[1])) {
-		log_print("node_ioctl: interface mismatch: "
-		          "kernel(%u.%u.%u), user(%u.%u.%u), cmd(%d)",
-		          DLM_NODE_VERSION_MAJOR,
-		          DLM_NODE_VERSION_MINOR,
-		          DLM_NODE_VERSION_PATCH,
-		          version[0], version[1], version[2], cmd);
-		error = -EINVAL;
-	}
-
-	version[0] = DLM_NODE_VERSION_MAJOR;
-	version[1] = DLM_NODE_VERSION_MINOR;
-	version[2] = DLM_NODE_VERSION_PATCH;
-
-	if (copy_to_user(u_param->version, version, sizeof(version)))
-		return -EFAULT;
-	return error;
-}
-
-static int node_ioctl(struct inode *inode, struct file *file,
-	              uint command, ulong u)
-{
-	struct dlm_node_ioctl *k_param;
-	struct dlm_node_ioctl __user *u_param;
-	unsigned int cmd, type;
-	int error;
-
-	u_param = (struct dlm_node_ioctl __user *) u;
-
-	if (!capable(CAP_SYS_ADMIN))
-		return -EACCES;
-
-	type = _IOC_TYPE(command);
-	cmd = _IOC_NR(command);
-
-	if (type != DLM_IOCTL) {
-		log_print("node_ioctl: bad ioctl 0x%x 0x%x 0x%x",
-		          command, type, cmd);
-		return -ENOTTY;
-	}
-
-	error = check_version(cmd, u_param);
-	if (error)
-		return error;
-
-	if (cmd == DLM_NODE_VERSION_CMD)
-		return 0;
-
-	k_param = kmalloc(sizeof(*k_param), GFP_KERNEL);
-	if (!k_param)
-		return -ENOMEM;
-
-	if (copy_from_user(k_param, u_param, sizeof(*k_param))) {
-		kfree(k_param);
-		return -EFAULT;
-	}
-
-	if (cmd == DLM_SET_NODE_CMD)
-		error = dlm_set_node(k_param->nodeid, k_param->weight,
-				     k_param->addr);
-	else if (cmd == DLM_SET_LOCAL_CMD)
-		error = dlm_set_local(k_param->nodeid, k_param->weight,
-				      k_param->addr);
-
-	kfree(k_param);
-	return error;
-}
-
-static struct file_operations node_fops = {
-	.ioctl	= node_ioctl,
-	.owner	= THIS_MODULE,
-};
-
-static struct miscdevice node_misc = {
-	.minor	= MISC_DYNAMIC_MINOR,
-	.name	= DLM_NODE_MISC_NAME,
-	.fops	= &node_fops
-};
-
-int dlm_node_ioctl_init(void)
-{
-	int error;
-
-	error = misc_register(&node_misc);
-	if (error)
-		log_print("node_ioctl: misc_register failed %d", error);
-	return error;
-}
-
-void dlm_node_ioctl_exit(void)
-{
-	if (misc_deregister(&node_misc) < 0)
-		log_print("node_ioctl: misc_deregister failed");
-}
-
diff -urpN a/drivers/dlm/requestqueue.c b/drivers/dlm/requestqueue.c
--- a/drivers/dlm/requestqueue.c	2005-08-17 17:19:22.000000000 +0800
+++ b/drivers/dlm/requestqueue.c	2005-08-18 13:22:00.725153264 +0800
@@ -14,7 +14,7 @@
 #include "member.h"
 #include "lock.h"
 #include "dir.h"
-#include "lowcomms.h"
+#include "config.h"
 
 struct rq_entry {
 	struct list_head list;
diff -urpN a/include/linux/dlm_node.h b/include/linux/dlm_node.h
--- a/include/linux/dlm_node.h	2005-08-17 17:19:23.000000000 +0800
+++ b/include/linux/dlm_node.h	1970-01-01 07:30:00.000000000 +0730
@@ -1,44 +0,0 @@
-/******************************************************************************
-*******************************************************************************
-**
-**  Copyright (C) 2005 Red Hat, Inc.  All rights reserved.
-**
-**  This copyrighted material is made available to anyone wishing to use,
-**  modify, copy, or redistribute it subject to the terms and conditions
-**  of the GNU General Public License v.2.
-**
-*******************************************************************************
-******************************************************************************/
-
-#ifndef __DLM_NODE_DOT_H__
-#define __DLM_NODE_DOT_H__
-
-#define DLM_ADDR_LEN			256
-#define DLM_MAX_ADDR_COUNT		3
-#define DLM_NODE_MISC_NAME		"dlm-node"
-
-#define DLM_NODE_VERSION_MAJOR		1
-#define DLM_NODE_VERSION_MINOR		0
-#define DLM_NODE_VERSION_PATCH		0
-
-struct dlm_node_ioctl {
-	__u32	version[3];
-	int	nodeid;
-	int	weight;
-	char	addr[DLM_ADDR_LEN];
-};
-
-enum {
-	DLM_NODE_VERSION_CMD = 0,
-	DLM_SET_NODE_CMD,
-	DLM_SET_LOCAL_CMD,
-};
-
-#define DLM_IOCTL			0xd1
-
-#define DLM_NODE_VERSION _IOWR(DLM_IOCTL, DLM_NODE_VERSION_CMD, struct dlm_node_ioctl)
-#define DLM_SET_NODE     _IOWR(DLM_IOCTL, DLM_SET_NODE_CMD, struct dlm_node_ioctl)
-#define DLM_SET_LOCAL    _IOWR(DLM_IOCTL, DLM_SET_LOCAL_CMD, struct dlm_node_ioctl)
-
-#endif
-



From nish.aravamudan at gmail.com  Thu Aug 18 06:29:03 2005
From: nish.aravamudan at gmail.com (Nish Aravamudan)
Date: Wed, 17 Aug 2005 23:29:03 -0700
Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs
In-Reply-To: <20050818060750.GA10133@redhat.com>
References: <20050818060750.GA10133@redhat.com>
Message-ID: <29495f1d05081723293c2bd337@mail.gmail.com>

On 8/17/05, David Teigland <teigland at redhat.com> wrote:
> Use configfs to configure lockspace members and node addresses.  This was
> previously done with sysfs and ioctl.
> 
> Signed-off-by: David Teigland <teigland at redhat.com>

Are you the official maintainer of the DLM subsystem? Could you submit
a patch to add a MAINTAINERS entry? I was looking for a maintainer to
send the dlm portion of my schedule_timeout() fixes to, but there
wasn't one listed.

Thanks,
Nish



From treddy at rallydev.com  Thu Aug 18 20:16:49 2005
From: treddy at rallydev.com (Tarun Reddy)
Date: Thu, 18 Aug 2005 14:16:49 -0600
Subject: [Linux-cluster] RHEL/RHCS3: /usr/lib/clumanager/services/service
	status # stays up
Message-ID: <DCF764FF-09C6-4C6F-B7CE-973E4708BFCE@rallydev.com>

I'm running a new instance of RHCS on RHEL3 and am having an issue  
where I get many instances (over how many ever days of the machine  
running) of the following:

/bin/bash /usr/lib/clumanager/services/service status 1

all showing up when I do a ps auxww | grep status. The number on the  
end changes and is not always the same but currently on my system I  
have status 1 and status 0 both "stuck" running. These happen to be  
checks for mysql (1) and httpd (0), both of which are using standard  
redhat startup/shutdown/status scripts.

If I kill them, the service that it is associated with restarts  
thinking that the result didn't return back correctly. Not desirable  
since the service is actually up.

The number of occurrences has been greatly reduced since I increased  
the time between checks from 1 to 10. I didn't realize it was in  
seconds (RTFM) and so I'll probably boost that up to 30 or 60 seconds.

Anyway, in an attempt to debug this, I started a while loop that  
called the above statement with a -x after the bash and found that  
the command occassionally hangs at

+ retVal=0
+ '[' -n 6 -a -n 5 -a 6 -le 5 ']'
+ return 3
+ return 0
+ rm -f /tmp/cluster-httpd_status.z16209
+ ip status 0


Anybody venture a guess as to why this might be occurring? And are my  
check intervals too low?

Thanks,
Tarun



From treddy at rallydev.com  Thu Aug 18 20:19:47 2005
From: treddy at rallydev.com (Tarun Reddy)
Date: Thu, 18 Aug 2005 14:19:47 -0600
Subject: [Linux-cluster] RHEL/RHCS3: many pipe files keep building up in /tmp
Message-ID: <F74FE087-B326-4CDE-936E-4B376D4BEB7F@rallydev.com>

Under a new install of RHCS, after the system has been up for a while  
I get tons of files with a name in the format of sh-np-1124010202  
(the numbers change) in /tmp. They are all pipe files like this:

prw-------    1 root     root            0 Aug 14 09:42 sh-np-1124010202

I can't find anything attaching to them with lsof | grep sh-np  
however the only show up after I've started up clustering.

Can anyone explain these files and if they are supposed to be cleaned  
up somehow?

Thanks,
Tarun



From mark.fasheh at oracle.com  Thu Aug 18 21:23:48 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Thu, 18 Aug 2005 14:23:48 -0700
Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs
In-Reply-To: <20050818060750.GA10133@redhat.com>
References: <20050818060750.GA10133@redhat.com>
Message-ID: <20050818212348.GW21228@ca-server1.us.oracle.com>

Hi David,

On Thu, Aug 18, 2005 at 02:07:50PM +0800, David Teigland wrote:
> +/*
> + * /config/dlm/<cluster>/spaces/<space>/nodes/<node>/nodeid
> + * /config/dlm/<cluster>/spaces/<space>/nodes/<node>/weight
> + * /config/dlm/<cluster>/comms/<comm>/nodeid
> + * /config/dlm/<cluster>/comms/<comm>/local
> + * /config/dlm/<cluster>/comms/<comm>/addr
> + * The <cluster> level is useless, but I haven't figured out how to avoid it.
> + */
	So what happened to factoring out the common parts of
ocfs2_nodemanager? I was quite a big fan of that approach :) Or am I just
misunderstanding what these patches do?
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From ocrete at max-t.com  Thu Aug 18 21:32:30 2005
From: ocrete at max-t.com (Olivier Crete)
Date: Thu, 18 Aug 2005 17:32:30 -0400
Subject: [Linux-cluster] zero vote node with cman
Message-ID: <1124400750.12024.52.camel@cocagne.max-t.internal>

Hi,

We're using cman from the STABLE branch and we're pretty satisfied.. But
there is one thing that I dont seem to be able to get working. In a
client-server application, I would like the client nodes to be able to
take actions when the system becomes inquorate or a server dies, but not
could towards the quorum.

I tried setting the votes to 0, but it seems that it wont let me do it..
Is there another solution?


-- 
Olivier Cr?te
ocrete at max-t.com
Maximum Throughput Inc.




From mikore.li at gmail.com  Fri Aug 19 05:59:05 2005
From: mikore.li at gmail.com (Michael)
Date: Fri, 19 Aug 2005 13:59:05 +0800
Subject: [Linux-cluster] Cache in GFS?
Message-ID: <bc5727090508182259677b2985@mail.gmail.com>

Hi, 

Is there cache principle in GFS client? Can I increase read/write
performance by adding much more kernel memory cache for GFS?

Thanks,

Q.L



From teigland at redhat.com  Fri Aug 19 07:13:44 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 19 Aug 2005 15:13:44 +0800
Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs
In-Reply-To: <20050818212348.GW21228@ca-server1.us.oracle.com>
References: <20050818060750.GA10133@redhat.com>
	<20050818212348.GW21228@ca-server1.us.oracle.com>
Message-ID: <20050819071344.GB10864@redhat.com>

On Thu, Aug 18, 2005 at 02:23:48PM -0700, Mark Fasheh wrote:
> On Thu, Aug 18, 2005 at 02:07:50PM +0800, David Teigland wrote:

> > + * /config/dlm/<cluster>/spaces/<space>/nodes/<node>/nodeid
> > + * /config/dlm/<cluster>/spaces/<space>/nodes/<node>/weight
> > + * /config/dlm/<cluster>/comms/<comm>/nodeid
> > + * /config/dlm/<cluster>/comms/<comm>/local
> > + * /config/dlm/<cluster>/comms/<comm>/addr
>
> So what happened to factoring out the common parts of ocfs2_nodemanager?
> I was quite a big fan of that approach :) Or am I just misunderstanding
> what these patches do?

The nodemanager RFC I sent a month ago
  http://marc.theaimsgroup.com/?l=linux-kernel&m=112166723919347&w=2

amounts to half of dlm/config.c (everything under comms/ above) moved into
a separate kernel module.  That would be trivial to do, and is still an
option to bat around.

I question whether factoring such a small chunk into a separate module is
really worth it, though?  Making all of config.c (all of /config/dlm/
above) into a separate module wouldn't seem quite so strange.  It would
require just a few lines of code to turn it into a stand alone module.

Dave



From pcaulfie at redhat.com  Fri Aug 19 07:19:58 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 19 Aug 2005 08:19:58 +0100
Subject: [Linux-cluster] zero vote node with cman
In-Reply-To: <1124400750.12024.52.camel@cocagne.max-t.internal>
References: <1124400750.12024.52.camel@cocagne.max-t.internal>
Message-ID: <4305881E.5000106@redhat.com>

Olivier Crete wrote:
> Hi,
> 
> We're using cman from the STABLE branch and we're pretty satisfied.. But
> there is one thing that I dont seem to be able to get working. In a
> client-server application, I would like the client nodes to be able to
> take actions when the system becomes inquorate or a server dies, but not
> could towards the quorum.
> 
> I tried setting the votes to 0, but it seems that it wont let me do it..
> Is there another solution?
> 
> 

It seems to be a bug in cman_tool that's overriding the votes rather
over-enthusiastically.

This patch should fix:

Index: main.c
===================================================================
RCS file: /cvs/cluster/cluster/cman/cman_tool/main.c,v
retrieving revision 1.12.2.7
diff -u -p -r1.12.2.7 main.c
--- main.c      21 Mar 2005 16:17:06 -0000      1.12.2.7
+++ main.c      19 Aug 2005 07:18:47 -0000
@@ -552,7 +552,7 @@ static void check_arguments(commandline_
        if (!comline->clustername[0])
                die("cluster name not set");

-       if (!comline->votes)
+       if (!comline->votes_opt && comline->no_ccs)
                comline->votes = DEFAULT_VOTES;

        if (!comline->port)


-- 

patrick



From Tarun.Reddy at rallydev.com  Thu Aug 18 20:12:25 2005
From: Tarun.Reddy at rallydev.com (Tarun Reddy)
Date: Thu, 18 Aug 2005 14:12:25 -0600
Subject: [Linux-cluster] RHEL/RHCS3: /usr/lib/clumanager/services/service
	status # stays up
Message-ID: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com>

I'm running a new instance of RHCS on RHEL3 and am having an issue  
where I get many instances (over how many ever days of the machine  
running) of the following:

/bin/bash /usr/lib/clumanager/services/service status 1

all showing up when I do a ps auxww | grep status. The number on the  
end changes and is not always the same but currently on my system I  
have status 1 and status 0 both "stuck" running. These happen to be  
checks for mysql (1) and httpd (0), both of which are using standard  
redhat startup/shutdown/status scripts.

If I kill them, the service that it is associated with restarts  
thinking that the result didn't return back correctly. Not desirable  
since the service is actually up.

The number of occurrences has been greatly reduced since I increased  
the time between checks from 1 to 10. I didn't realize it was in  
seconds (RTFM) and so I'll probably boost that up to 30 or 60 seconds.

Anyway, in an attempt to debug this, I started a while loop that  
called the above statement with a -x after the bash and found that  
the command occassionally hangs at

+ retVal=0
+ '[' -n 6 -a -n 5 -a 6 -le 5 ']'
+ return 3
+ return 0
+ rm -f /tmp/cluster-httpd_status.z16209
+ ip status 0


Anybody venture a guess as to why this might be occurring? And are my  
check intervals too low?

Thanks,
Tarun



From forgue at oakland.edu  Fri Aug 19 15:43:50 2005
From: forgue at oakland.edu (Andrew Forgue)
Date: Fri, 19 Aug 2005 11:43:50 -0400
Subject: [Linux-cluster] Trying to compile dlm-kernel-2.6.9-34.0.src.rpm
Message-ID: <1124466230.15510.7.camel@localhost.localdomain>

Hello,

I'm trying to build the dlm-kernel SRPM and I'm getting this error when
building the SMP module of DLM.

+ /lib/modules/2.6.9-11.ELsmp/build//scripts/mod/modpost -m
-i /lib/modules/2.6.9-11.ELsmp/kernel/cluster/cman.symvers src/dlm.o -o
dlm.s ymvers
src/dlm.o: No such file or directory
/var/tmp/rpm-tmp.45995: line 31: 31466 Aborted
$kernel_src/scripts/mod/modpost -m -i /lib/modules/2.6.9-11.EL
$flavor/kern el/cluster/cman.symvers src/dlm.o -o dlm.symvers
error: Bad exit status from /var/tmp/rpm-tmp.45995 (%build)


This is on a RHEL-4 machine that's completely up to date.


Thanks, 
Andrew

Complete Output:


[forgue at server02 SPECS]$ sudo rpmbuild -bb --target=i686 dlm-kernel.spec
Building target platforms: i686
Building for target i686
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.45995
+ umask 022
+ cd /usr/src/redhat/BUILD
+ LANG=C
+ export LANG
+ unset DISPLAY
+ cd /usr/src/redhat/BUILD
+ rm -rf dlm-kernel-2.6.9-34
+ /usr/bin/gzip -dc /usr/src/redhat/SOURCES/dlm-kernel-2.6.9-34.tar.gz
+ tar -xf -
+ STATUS=0
+ '[' 0 -ne 0 ']'
+ cd dlm-kernel-2.6.9-34
++ /usr/bin/id -u
+ '[' 0 = 0 ']'
+ /bin/chown -Rhf root .
++ /usr/bin/id -u
+ '[' 0 = 0 ']'
+ /bin/chgrp -Rhf root .
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ sed -i -e '/RELEASE_NAME/s/"<CVS>"/"2.6.9-34.0"/' src/dlm_internal.h
+ exit 0
Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.45995
+ umask 022
+ cd /usr/src/redhat/BUILD
+ cd dlm-kernel-2.6.9-34
+ LANG=C
+ export LANG
+ unset DISPLAY
++ pwd
+ cp -r /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34 ../smp
++ pwd
+ cp -r /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34 ../hugemem
+ Build_dlm i686
+ cpu_type=i686
+ flavor=
+ kernel_src=/lib/modules/2.6.9-11.EL/build/
+ '[' -d /lib/modules/2.6.9-11.EL/build//. ']'
+ echo 'Kernel 2.6.9-11.EL found.'
Kernel 2.6.9-11.EL found.
+ echo /lib/modules/2.6.9-11.EL/build/
/lib/modules/2.6.9-11.EL/build/
+ ./configure --kernel_src=/lib/modules/2.6.9-11.EL/build/
--incdir=/usr/include

Configuring Makefiles for your system...
Completed Makefile configuration

+ make symverfile=/lib/modules/2.6.9-11.EL/kernel/cluster/cman.symvers
cd src && make all
make[1]: Entering directory
`/usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src'
if [ ! -e cluster ]; then ln -s . cluster; fi
if [ ! -e service.h ]; then cp //usr/include/cluster/service.h .; fi
if [ ! -e cnxman.h ]; then cp //usr/include/cluster/cnxman.h .; fi
if [ ! -e cnxman-socket.h ]; then
cp //usr/include/cluster/cnxman-socket.h .; fi
make -C /lib/modules/2.6.9-11.EL/build/
M=/usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src modules USING_KBUILD=yes
make[2]: Entering directory `/usr/src/kernels/2.6.9-11.EL-i686'
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/ast.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/config.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/device.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dir.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lkb.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/locking.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lockqueue.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lockspace.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lowcomms.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/main.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/memory.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/midcomms.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/nodes.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/proc.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/queries.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/rebuild.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/reccomms.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/recover.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/recoverd.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/rsb.o
  CC [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/util.o
  LD [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.o
  Building modules, stage 2.
  MODPOST
  CC      /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.mod.o
  LD [M]  /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.ko
make[2]: Leaving directory `/usr/src/kernels/2.6.9-11.EL-i686'
make[1]: Leaving directory
`/usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src'
+ /lib/modules/2.6.9-11.EL/build//scripts/mod/modpost -m
-i /lib/modules/2.6.9-11.EL/kernel/cluster/cman.symvers src/dlm.o -o
dlm.symvers
+ cd ../smp
+ Build_dlm i686 smp
+ cpu_type=i686
+ flavor=smp
+ kernel_src=/lib/modules/2.6.9-11.ELsmp/build/
+ '[' -d /lib/modules/2.6.9-11.ELsmp/build//. ']'
+ echo 'Kernel 2.6.9-11.EL found.'
Kernel 2.6.9-11.EL found.
+ echo /lib/modules/2.6.9-11.ELsmp/build/
/lib/modules/2.6.9-11.ELsmp/build/
+ ./configure --kernel_src=/lib/modules/2.6.9-11.ELsmp/build/
--incdir=/usr/include

Configuring Makefiles for your system...
Completed Makefile configuration

+ make
symverfile=/lib/modules/2.6.9-11.ELsmp/kernel/cluster/cman.symvers
cd src && make all
make[1]: Entering directory `/usr/src/redhat/BUILD/smp/src'
rm -f cluster
ln -s . cluster
make -C /lib/modules/2.6.9-11.ELsmp/build/
M=/usr/src/redhat/BUILD/smp/src modules USING_KBUILD=yes
make[2]: Entering directory `/usr/src/kernels/2.6.9-11.EL-smp-i686'
  Building modules, stage 2.
  MODPOST
make[2]: Leaving directory `/usr/src/kernels/2.6.9-11.EL-smp-i686'
make[1]: Leaving directory `/usr/src/redhat/BUILD/smp/src'
+ /lib/modules/2.6.9-11.ELsmp/build//scripts/mod/modpost -m
-i /lib/modules/2.6.9-11.ELsmp/kernel/cluster/cman.symvers src/dlm.o -o
dlm.s ymvers
src/dlm.o: No such file or directory
/var/tmp/rpm-tmp.45995: line 31: 31466 Aborted
$kernel_src/scripts/mod/modpost -m -i /lib/modules/2.6.9-11.EL
$flavor/kern el/cluster/cman.symvers src/dlm.o -o dlm.symvers
error: Bad exit status from /var/tmp/rpm-tmp.45995 (%build)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050819/d6286bee/attachment.sig>

From ocrete at max-t.com  Fri Aug 19 15:51:24 2005
From: ocrete at max-t.com (Olivier Crete)
Date: Fri, 19 Aug 2005 11:51:24 -0400
Subject: [Linux-cluster] zero vote node with cman
In-Reply-To: <4305881E.5000106@redhat.com>
References: <1124400750.12024.52.camel@cocagne.max-t.internal>
	<4305881E.5000106@redhat.com>
Message-ID: <1124466684.12024.58.camel@cocagne.max-t.internal>

On Fri, 2005-19-08 at 08:19 +0100, Patrick Caulfield wrote:
> Olivier Crete wrote:
> > Hi,
> > 
> > We're using cman from the STABLE branch and we're pretty satisfied.. But
> > there is one thing that I dont seem to be able to get working. In a
> > client-server application, I would like the client nodes to be able to
> > take actions when the system becomes inquorate or a server dies, but not
> > could towards the quorum.
> > 
> > I tried setting the votes to 0, but it seems that it wont let me do it..
> > Is there another solution?
> > 
> > 
> 
> It seems to be a bug in cman_tool that's overriding the votes rather
> over-enthusiastically.
> 
> This patch should fix:

Actually it doesnt.. it sets the default to 0... the attached patch
seems to work better. 


-- 
Olivier Cr?te
ocrete at max-t.com
Maximum Throughput Inc.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cman_tool-zero-vote.patch
Type: text/x-patch
Size: 632 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050819/f75067e7/attachment.bin>

From Joel.Becker at oracle.com  Sat Aug 20 00:40:11 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Fri, 19 Aug 2005 17:40:11 -0700
Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs
In-Reply-To: <20050818210747.GC22742@insight>
References: <20050818060750.GA10133@redhat.com>
	<20050817232218.56a06fd6.akpm@osdl.org>
	<20050818210747.GC22742@insight>
Message-ID: <20050820004011.GF4100@insight.us.oracle.com>

On Thu, Aug 18, 2005 at 02:07:47PM -0700, Joel Becker wrote:
> On Wed, Aug 17, 2005 at 11:22:18PM -0700, Andrew Morton wrote:
> > Fair enough.  This really means that the configfs patch should be split out
> > of the ocfs2 megapatch...
> 
> 	Easy to do, it's a separate commit in the ocfs2.git repository.
> Would you rather
> 
> 	a) Do the diffs yourself (configfs commit, remaining ocfs2 commits)
> 	b) Have two repositories, configfs.git and ocfs2.git, where
> 	   ocfs2.git is configfs.git+ocfs2
> 	c) Just take the configfs patch (which really hasn't changed in
> 	   months)

	Well, I included the patch in my last email.  For the latest
spin, I've created http://oss.oracle.com/git/configfs.git.  The ocfs2
git repositories (http://oss.oracle.com/git/ocfs2-dev.git,
http://oss.oracle.com/git/ocfs2.git) are now based on the configfs one.
	If there's any other way you want me to do it, let me know.

Joel

-- 

"If the human brain were so simple we could understand it, we would
 be so simple that we could not."
	- W. A. Clouston

			http://www.jlbec.org/
			jlbec at evilplan.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050819/8f24d941/attachment.sig>

From Joel.Becker at oracle.com  Fri Aug 19 15:09:09 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Fri, 19 Aug 2005 08:09:09 -0700
Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs
In-Reply-To: <20050817232218.56a06fd6.akpm@osdl.org>
References: <20050818060750.GA10133@redhat.com>
	<20050817232218.56a06fd6.akpm@osdl.org>
Message-ID: <20050819150909.GA18991@ca-server1.us.oracle.com>

On Wed, Aug 17, 2005 at 11:22:18PM -0700, Andrew Morton wrote:
> David Teigland <teigland at redhat.com> wrote:
> >
> > Use configfs to configure lockspace members and node addresses.  This was
> >  previously done with sysfs and ioctl.
> 
> Fair enough.  This really means that the configfs patch should be split out
> of the ocfs2 megapatch...

	Easy to do, it's a separate commit in the ocfs2.git repository.
Would you rather

	a) Do the diffs yourself (configfs commit, remaining ocfs2 commits)
	b) Have two repositories, configfs.git and ocfs2.git, where
	   ocfs2.git is configfs.git+ocfs2
	c) Just take the configfs patch (which really hasn't changed in
	   months)


Joel

-------------------------------------------

[PATCH] configfs: Userspace-driven configuration filesystem

Configfs, a file system for userspace-driven kernel object configuration.
The OCFS2 stack makes extensive use of this for propagation of cluster
configuration information into kernel.

Signed-off-by: Joel Becker <joel.becker at oracle.com>
---

diff -ruN linux-2.6.15-rc6.old/Documentation/filesystems/00-INDEX linux-2.6.13-rc6/Documentation/filesystems/00-INDEX
--- linux-2.6.13-rc6.old/Documentation/filesystems/00-INDEX	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/Documentation/filesystems/00-INDEX	2005-08-12 18:09:41.923178911 -0700
@@ -12,6 +12,8 @@
 	- description of the CIFS filesystem
 coda.txt
 	- description of the CODA filesystem.
+configfs/
+	- directory containing configfs documentation and example code.
 cramfs.txt
 	- info on the cram filesystem for small storage (ROMs etc)
 devfs/
diff -ruN linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs.txt linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs.txt
--- linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs.txt	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs.txt	2005-08-12 18:09:41.924178946 -0700
@@ -0,0 +1,434 @@
+
+configfs - Userspace-driven kernel object configuation.
+
+Joel Becker <joel.becker at oracle.com>
+
+Updated: 31 March 2005
+
+Copyright (c) 2005 Oracle Corporation,
+	Joel Becker <joel.becker at oracle.com>
+
+
+[What is configfs?]
+
+configfs is a ram-based filesystem that provides the converse of
+sysfs's functionality.  Where sysfs is a filesystem-based view of
+kernel objects, configfs is a filesystem-based manager of kernel
+objects, or config_items.
+
+With sysfs, an object is created in kernel (for example, when a device
+is discovered) and it is registered with sysfs.  Its attributes then
+appear in sysfs, allowing userspace to read the attributes via
+readdir(3)/read(2).  It may allow some attributes to be modified via
+write(2).  The important point is that the object is created and
+destroyed in kernel, the kernel controls the lifecycle of the sysfs
+representation, and sysfs is merely a window on all this.
+
+A configfs config_item is created via an explicit userspace operation:
+mkdir(2).  It is destroyed via rmdir(2).  The attributes appear at
+mkdir(2) time, and can be read or modified via read(2) and write(2).
+As with sysfs, readdir(3) queries the list of items and/or attributes.
+symlink(2) can be used to group items together.  Unlike sysfs, the
+lifetime of the representation is completely driven by userspace.  The
+kernel modules backing the items must respond to this.
+
+Both sysfs and configfs can and should exist together on the same
+system.  One is not a replacement for the other.
+
+[Using configfs]
+
+configfs can be compiled as a module or into the kernel.  You can access
+it by doing
+
+	mount -t configfs none /config
+
+The configfs tree will be empty unless client modules are also loaded.
+These are modules that register their item types with configfs as
+subsystems.  Once a client subsystem is loaded, it will appear as a
+subdirectory (or more than one) under /config.  Like sysfs, the
+configfs tree is always there, whether mounted on /config or not.
+
+An item is created via mkdir(2).  The item's attributes will also
+appear at this time.  readdir(3) can determine what the attributes are,
+read(2) can query their default values, and write(2) can store new
+values.  Like sysfs, attributes should be ASCII text files, preferably
+with only one value per file.  The same efficiency caveats from sysfs
+apply.  Don't mix more than one attribute in one attribute file.
+
+Like sysfs, configfs expects write(2) to store the entire buffer at
+once.  When writing to configfs attributes, userspace processes should
+first read the entire file, modify the portions they wish to change, and
+then write the entire buffer back.  Attribute files have a maximum size
+of one page (PAGE_SIZE, 4096 on i386).
+
+When an item needs to be destroyed, remove it with rmdir(2).  An
+item cannot be destroyed if any other item has a link to it (via
+symlink(2)).  Links can be removed via unlink(2).
+
+[Configuring FakeNBD: an Example]
+
+Imagine there's a Network Block Device (NBD) driver that allows you to
+access remote block devices.  Call it FakeNBD.  FakeNBD uses configfs
+for its configuration.  Obviously, there will be a nice program that
+sysadmins use to configure FakeNBD, but somehow that program has to tell
+the driver about it.  Here's where configfs comes in.
+
+When the FakeNBD driver is loaded, it registers itself with configfs.
+readdir(3) sees this just fine:
+
+	# ls /config
+	fakenbd
+
+A fakenbd connection can be created with mkdir(2).  The name is
+arbitrary, but likely the tool will make some use of the name.  Perhaps
+it is a uuid or a disk name:
+
+	# mkdir /config/fakenbd/disk1
+	# ls /config/fakenbd/disk1
+	target device rw
+
+The target attribute contains the IP address of the server FakeNBD will
+connect to.  The device attribute is the device on the server.
+Predictably, the rw attribute determines whether the connection is
+read-only or read-write.
+
+	# echo 10.0.0.1 > /config/fakenbd/disk1/target
+	# echo /dev/sda1 > /config/fakenbd/disk1/device
+	# echo 1 > /config/fakenbd/disk1/rw
+
+That's it.  That's all there is.  Now the device is configured, via the
+shell no less.
+
+[Coding With configfs]
+
+Every object in configfs is a config_item.  A config_item reflects an
+object in the subsystem.  It has attributes that match values on that
+object.  configfs handles the filesystem representation of that object
+and its attributes, allowing the subsystem to ignore all but the
+basic show/store interaction.
+
+Items are created and destroyed inside a config_group.  A group is a
+collection of items that share the same attributes and operations.
+Items are created by mkdir(2) and removed by rmdir(2), but configfs
+handles that.  The group has a set of operations to perform these tasks
+
+A subsystem is the top level of a client module.  During initialization,
+the client module registers the subsystem with configfs, the subsystem
+appears as a directory at the top of the configfs filesystem.  A
+subsystem is also a config_group, and can do everything a config_group
+can.
+
+[struct config_item]
+
+	struct config_item {
+		char                    *ci_name;
+		char                    ci_namebuf[UOBJ_NAME_LEN];
+		struct kref             ci_kref;
+		struct list_head        ci_entry;
+		struct config_item      *ci_parent;
+		struct config_group     *ci_group;
+		struct config_item_type *ci_type;
+		struct dentry           *ci_dentry;
+	};
+
+	void config_item_init(struct config_item *);
+	void config_item_init_type_name(struct config_item *,
+					const char *name,
+					struct config_item_type *type);
+	struct config_item *config_item_get(struct config_item *);
+	void config_item_put(struct config_item *);
+
+Generally, struct config_item is embedded in a container structure, a
+structure that actually represents what the subsystem is doing.  The
+config_item portion of that structure is how the object interacts with
+configfs.
+
+Whether statically defined in a source file or created by a parent
+config_group, a config_item must have one of the _init() functions
+called on it.  This initializes the reference count and sets up the
+appropriate fields.
+
+All users of a config_item should have a reference on it via
+config_item_get(), and drop the reference when they are done via
+config_item_put().
+
+By itself, a config_item cannot do much more than appear in configfs.
+Usually a subsystem wants the item to display and/or store attributes,
+among other things.  For that, it needs a type.
+
+[struct config_item_type]
+
+	struct configfs_item_operations {
+		void (*release)(struct config_item *);
+		ssize_t (*show_attribute)(struct config_item *,
+					  struct configfs_attribute *,
+					  char *);
+		ssize_t (*store_attribute)(struct config_item *,
+					   struct configfs_attribute *,
+					   const char *, size_t);
+		int (*allow_link)(struct config_item *src,
+				  struct config_item *target);
+		int (*drop_link)(struct config_item *src,
+				 struct config_item *target);
+	};
+
+	struct config_item_type {
+		struct module                           *ct_owner;
+		struct configfs_item_operations         *ct_item_ops;
+		struct configfs_group_operations        *ct_group_ops;
+		struct configfs_attribute               **ct_attrs;
+	};
+
+The most basic function of a config_item_type is to define what
+operations can be performed on a config_item.  All items that have been
+allocated dynamically will need to provide the ct_item_ops->release()
+method.  This method is called when the config_item's reference count
+reaches zero.  Items that wish to display an attribute need to provide
+the ct_item_ops->show_attribute() method.  Similarly, storing a new
+attribute value uses the store_attribute() method.
+
+[struct configfs_attribute]
+
+	struct configfs_attribute {
+		char                    *ca_name;
+		struct module           *ca_owner;
+		mode_t                  ca_mode;
+	};
+
+When a config_item wants an attribute to appear as a file in the item's
+configfs directory, it must define a configfs_attribute describing it.
+It then adds the attribute to the NULL-terminated array
+config_item_type->ct_attrs.  When the item appears in configfs, the
+attribute file will appear with the configfs_attribute->ca_name
+filename.  configfs_attribute->ca_mode specifies the file permissions.
+
+If an attribute is readable and the config_item provides a
+ct_item_ops->show_attribute() method, that method will be called
+whenever userspace asks for a read(2) on the attribute.  The converse
+will happen for write(2).
+
+[struct config_group]
+
+A config_item cannot live in a vaccum.  The only way one can be created
+is via mkdir(2) on a config_group.  This will trigger creation of a
+child item.
+
+	struct config_group {
+		struct config_item		cg_item;
+		struct list_head		cg_children;
+		struct configfs_subsystem 	*cg_subsys;
+		struct config_group		**default_groups;
+	};
+
+	void config_group_init(struct config_group *group);
+	void config_group_init_type_name(struct config_group *group,
+					 const char *name,
+					 struct config_item_type *type);
+
+
+The config_group structure contains a config_item.  Properly configuring
+that item means that a group can behave as an item in its own right.
+However, it can do more: it can create child items or groups.  This is
+accomplished via the group operations specified on the group's
+config_item_type.
+
+	struct configfs_group_operations {
+		struct config_item *(*make_item)(struct config_group *group,
+						 const char *name);
+		struct config_group *(*make_group)(struct config_group *group,
+						   const char *name);
+		int (*commit_item)(struct config_item *item);
+		void (*drop_item)(struct config_group *group,
+				  struct config_item *item);
+	};
+
+A group creates child items by providing the
+ct_group_ops->make_item() method.  If provided, this method is called from mkdir(2) in the group's directory.  The subsystem allocates a new
+config_item (or more likely, its container structure), initializes it,
+and returns it to configfs.  Configfs will then populate the filesystem
+tree to reflect the new item.
+
+If the subsystem wants the child to be a group itself, the subsystem
+provides ct_group_ops->make_group().  Everything else behaves the same,
+using the group _init() functions on the group.
+
+Finally, when userspace calls rmdir(2) on the item or group,
+ct_group_ops->drop_item() is called.  As a config_group is also a
+config_item, it is not necessary for a seperate drop_group() method.
+The subsystem must config_item_put() the reference that was initialized
+upon item allocation.  If a subsystem has no work to do, it may omit
+the ct_group_ops->drop_item() method, and configfs will call
+config_item_put() on the item on behalf of the subsystem.
+
+IMPORTANT: drop_item() is void, and as such cannot fail.  When rmdir(2)
+is called, configfs WILL remove the item from the filesystem tree
+(assuming that it has no children to keep it busy).  The subsystem is
+responsible for responding to this.  If the subsystem has references to
+the item in other threads, the memory is safe.  It may take some time
+for the item to actually disappear from the subsystem's usage.  But it
+is gone from configfs.
+
+A config_group cannot be removed while it still has child items.  This
+is implemented in the configfs rmdir(2) code.  ->drop_item() will not be
+called, as the item has not been dropped.  rmdir(2) will fail, as the
+directory is not empty.
+
+[struct configfs_subsystem]
+
+A subsystem must register itself, ususally at module_init time.  This
+tells configfs to make the subsystem appear in the file tree.
+
+	struct configfs_subsystem {
+		struct config_group	su_group;
+		struct semaphore	su_sem;
+	};
+
+	int configfs_register_subsystem(struct configfs_subsystem *subsys);
+	void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
+
+	A subsystem consists of a toplevel config_group and a semaphore.
+The group is where child config_items are created.  For a subsystem,
+this group is usually defined statically.  Before calling
+configfs_register_subsystem(), the subsystem must have initialized the
+group via the usual group _init() functions, and it must also have
+initialized the semaphore.
+	When the register call returns, the subsystem is live, and it
+will be visible via configfs.  At that point, mkdir(2) can be called and
+the subsystem must be ready for it.
+
+[An Example]
+
+The best example of these basic concepts is the simple_children
+subsystem/group and the simple_child item in configfs_example.c  It
+shows a trivial object displaying and storing an attribute, and a simple
+group creating and destroying these children.
+
+[Hierarchy Navigation and the Subsystem Semaphore]
+
+There is an extra bonus that configfs provides.  The config_groups and
+config_items are arranged in a hierarchy due to the fact that they
+appear in a filesystem.  A subsystem is NEVER to touch the filesystem
+parts, but the subsystem might be interested in this hierarchy.  For
+this reason, the hierarchy is mirrored via the config_group->cg_children
+and config_item->ci_parent structure members.
+
+A subsystem can navigate the cg_children list and the ci_parent pointer
+to see the tree created by the subsystem.  This can race with configfs'
+management of the hierarchy, so configfs uses the subsystem semaphore to
+protect modifications.  Whenever a subsystem wants to navigate the
+hierarchy, it must do so under the protection of the subsystem
+semaphore.
+
+A subsystem will be prevented from acquiring the semaphore while a newly
+allocated item has not been linked into this hierarchy.   Similarly, it
+will not be able to acquire the semaphore while a dropping item has not
+yet been unlinked.  This means that an item's ci_parent pointer will
+never be NULL while the item is in configfs, and that an item will only
+be in its parent's cg_children list for the same duration.  This allows
+a subsystem to trust ci_parent and cg_children while they hold the
+semaphore.
+
+[Item Aggregation Via symlink(2)]
+
+configfs provides a simple group via the group->item parent/child
+relationship.  Often, however, a larger environment requires aggregation
+outside of the parent/child connection.  This is implemented via
+symlink(2).
+
+A config_item may provide the ct_item_ops->allow_link() and
+ct_item_ops->drop_link() methods.  If the ->allow_link() method exists,
+symlink(2) may be called with the config_item as the source of the link.
+These links are only allowed between configfs config_items.  Any
+symlink(2) attempt outside the configfs filesystem will be denied.
+
+When symlink(2) is called, the source config_item's ->allow_link()
+method is called with itself and a target item.  If the source item
+allows linking to target item, it returns 0.  A source item may wish to
+reject a link if it only wants links to a certain type of object (say,
+in its own subsystem).
+
+When unlink(2) is called on the symbolic link, the source item is
+notified via the ->drop_link() method.  Like the ->drop_item() method,
+this is a void function and cannot return failure.  The subsystem is
+responsible for responding to the change.
+
+A config_item cannot be removed while it links to any other item, nor
+can it be removed while an item links to it.  Dangling symlinks are not
+allowed in configfs.
+
+[Automatically Created Subgroups]
+
+A new config_group may want to have two types of child config_items.
+While this could be codified by magic names in ->make_item(), it is much
+more explicit to have a method whereby userspace sees this divergence.
+
+Rather than have a group where some items behave differently than
+others, configfs provides a method whereby one or many subgroups are
+automatically created inside the parent at its creation.  Thus,
+mkdir("parent) results in "parent", "parent/subgroup1", up through
+"parent/subgroupN".  Items of type 1 can now be created in
+"parent/subgroup1", and items of type N can be created in
+"parent/subgroupN".
+
+These automatic subgroups, or default groups, do not preclude other
+children of the parent group.  If ct_group_ops->make_group() exists,
+other child groups can be created on the parent group directly.
+
+A configfs subsystem specifies default groups by filling in the
+NULL-terminated array default_groups on the config_group structure.
+Each group in that array is populated in the configfs tree at the same
+time as the parent group.  Similarly, they are removed at the same time
+as the parent.  No extra notification is provided.  When a ->drop_item()
+method call notifies the subsystem the parent group is going away, it
+also means every default group child associated with that parent group.
+
+As a consequence of this, default_groups cannot be removed directly via
+rmdir(2).  They also are not considered when rmdir(2) on the parent
+group is checking for children.
+
+[Committable Items]
+
+NOTE: Committable items are currently unimplemented.
+
+Some config_items cannot have a valid initial state.  That is, no
+default values can be specified for the item's attributes such that the
+item can do its work.  Userspace must configure one or more attributes,
+after which the subsystem can start whatever entity this item
+represents.
+
+Consider the FakeNBD device from above.  Without a target address *and*
+a target device, the subsystem has no idea what block device to import.
+The simple example assumes that the subsystem merely waits until all the
+appropriate attributes are configured, and then connects.  This will,
+indeed, work, but now every attribute store must check if the attributes
+are initialized.  Every attribute store must fire off the connection if
+that condition is met.
+
+Far better would be an explicit action notifying the subsystem that the
+config_item is ready to go.  More importantly, an explicit action allows
+the subsystem to provide feedback as to whether the attibutes are
+initialized in a way that makes sense.  configfs provides this as
+committable items.
+
+configfs still uses only normal filesystem operations.  An item is
+committed via rename(2).  The item is moved from a directory where it
+can be modified to a directory where it cannot.
+
+Any group that provides the ct_group_ops->commit_item() method has
+committable items.  When this group appears in configfs, mkdir(2) will
+not work directly in the group.  Instead, the group will have two
+subdirectories: "live" and "pending".  The "live" directory does not
+support mkdir(2) or rmdir(2) either.  It only allows rename(2).  The
+"pending" directory does allow mkdir(2) and rmdir(2).  An item is
+created in the "pending" directory.  Its attributes can be modified at
+will.  Userspace commits the item by renaming it into the "live"
+directory.  At this point, the subsystem recieves the ->commit_item()
+callback.  If all required attributes are filled to satisfaction, the
+method returns zero and the item is moved to the "live" directory.
+
+As rmdir(2) does not work in the "live" directory, an item must be
+shutdown, or "uncommitted".  Again, this is done via rename(2), this
+time from the "live" directory back to the "pending" one.  The subsystem
+is notified by the ct_group_ops->uncommit_object() method.
+
+
diff -ruN linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs_example.c linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs_example.c
--- linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs_example.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs_example.c	2005-08-12 18:09:41.925178981 -0700
@@ -0,0 +1,474 @@
+/*
+ * vim: noexpandtab ts=8 sts=0 sw=8:
+ *
+ * configfs_example.c - This file is a demonstration module containing
+ *      a number of configfs subsystems.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include <linux/configfs.h>
+
+
+
+/*
+ * 01-childless
+ *
+ * This first example is a childless subsystem.  It cannot create
+ * any config_items.  It just has attributes.
+ *
+ * Note that we are enclosing the configfs_subsystem inside a container.
+ * This is not necessary if a subsystem has no attributes directly
+ * on the subsystem.  See the next example, 02-simple-children, for
+ * such a subsystem.
+ */
+
+struct childless {
+	struct configfs_subsystem subsys;
+	int showme;
+	int storeme;
+};
+
+struct childless_attribute {
+	struct configfs_attribute attr;
+	ssize_t (*show)(struct childless *, char *);
+	ssize_t (*store)(struct childless *, const char *, size_t);
+};
+
+static inline struct childless *to_childless(struct config_item *item)
+{
+	return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
+}
+
+static ssize_t childless_showme_read(struct childless *childless,
+				     char *page)
+{
+	ssize_t pos;
+
+	pos = sprintf(page, "%d\n", childless->showme);
+	childless->showme++;
+
+	return pos;
+}
+
+static ssize_t childless_storeme_read(struct childless *childless,
+				      char *page)
+{
+	return sprintf(page, "%d\n", childless->storeme);
+}
+
+static ssize_t childless_storeme_write(struct childless *childless,
+				       const char *page,
+				       size_t count)
+{
+	unsigned long tmp;
+	char *p = (char *) page;
+
+	tmp = simple_strtoul(p, &p, 10);
+	if (!p || (*p && (*p != '\n')))
+		return -EINVAL;
+
+	if (tmp > INT_MAX)
+		return -ERANGE;
+
+	childless->storeme = tmp;
+
+	return count;
+}
+
+static ssize_t childless_description_read(struct childless *childless,
+					  char *page)
+{
+	return sprintf(page,
+"[01-childless]\n"
+"\n"
+"The childless subsystem is the simplest possible subsystem in\n"
+"configfs.  It does not support the creation of child config_items.\n"
+"It only has a few attributes.  In fact, it isn't much different\n"
+"than a directory in /proc.\n");
+}
+
+static struct childless_attribute childless_attr_showme = {
+	.attr	= { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
+	.show	= childless_showme_read,
+};
+static struct childless_attribute childless_attr_storeme = {
+	.attr	= { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
+	.show	= childless_storeme_read,
+	.store	= childless_storeme_write,
+};
+static struct childless_attribute childless_attr_description = {
+	.attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
+	.show = childless_description_read,
+};
+
+static struct configfs_attribute *childless_attrs[] = {
+	&childless_attr_showme.attr,
+	&childless_attr_storeme.attr,
+	&childless_attr_description.attr,
+	NULL,
+};
+
+static ssize_t childless_attr_show(struct config_item *item,
+				   struct configfs_attribute *attr,
+				   char *page)
+{
+	struct childless *childless = to_childless(item);
+	struct childless_attribute *childless_attr =
+		container_of(attr, struct childless_attribute, attr);
+	ssize_t ret = 0;
+
+	if (childless_attr->show)
+		ret = childless_attr->show(childless, page);
+	return ret;
+}
+
+static ssize_t childless_attr_store(struct config_item *item,
+				    struct configfs_attribute *attr,
+				    const char *page, size_t count)
+{
+	struct childless *childless = to_childless(item);
+	struct childless_attribute *childless_attr =
+		container_of(attr, struct childless_attribute, attr);
+	ssize_t ret = -EINVAL;
+
+	if (childless_attr->store)
+		ret = childless_attr->store(childless, page, count);
+	return ret;
+}
+
+static struct configfs_item_operations childless_item_ops = {
+	.show_attribute		= childless_attr_show,
+	.store_attribute	= childless_attr_store,
+};
+
+static struct config_item_type childless_type = {
+	.ct_item_ops	= &childless_item_ops,
+	.ct_attrs	= childless_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct childless childless_subsys = {
+	.subsys = {
+		.su_group = {
+			.cg_item = {
+				.ci_namebuf = "01-childless",
+				.ci_type = &childless_type,
+			},
+		},
+	},
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 02-simple-children
+ *
+ * This example merely has a simple one-attribute child.  Note that
+ * there is no extra attribute structure, as the child's attribute is
+ * known from the get-go.  Also, there is no container for the
+ * subsystem, as it has no attributes of its own.
+ */
+
+struct simple_child {
+	struct config_item item;
+	int storeme;
+};
+
+static inline struct simple_child *to_simple_child(struct config_item *item)
+{
+	return item ? container_of(item, struct simple_child, item) : NULL;
+}
+
+static struct configfs_attribute simple_child_attr_storeme = {
+	.ca_owner = THIS_MODULE,
+	.ca_name = "storeme",
+	.ca_mode = S_IRUGO | S_IWUSR,
+};
+
+static struct configfs_attribute *simple_child_attrs[] = {
+	&simple_child_attr_storeme,
+	NULL,
+};
+
+static ssize_t simple_child_attr_show(struct config_item *item,
+				      struct configfs_attribute *attr,
+				      char *page)
+{
+	ssize_t count;
+	struct simple_child *simple_child = to_simple_child(item);
+
+	count = sprintf(page, "%d\n", simple_child->storeme);
+
+	return count;
+}
+
+static ssize_t simple_child_attr_store(struct config_item *item,
+				       struct configfs_attribute *attr,
+				       const char *page, size_t count)
+{
+	struct simple_child *simple_child = to_simple_child(item);
+	unsigned long tmp;
+	char *p = (char *) page;
+
+	tmp = simple_strtoul(p, &p, 10);
+	if (!p || (*p && (*p != '\n')))
+		return -EINVAL;
+
+	if (tmp > INT_MAX)
+		return -ERANGE;
+
+	simple_child->storeme = tmp;
+
+	return count;
+}
+
+static void simple_child_release(struct config_item *item)
+{
+	kfree(to_simple_child(item));
+}
+
+static struct configfs_item_operations simple_child_item_ops = {
+	.release		= simple_child_release,
+	.show_attribute		= simple_child_attr_show,
+	.store_attribute	= simple_child_attr_store,
+};
+
+static struct config_item_type simple_child_type = {
+	.ct_item_ops	= &simple_child_item_ops,
+	.ct_attrs	= simple_child_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+
+static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
+{
+	struct simple_child *simple_child;
+
+	simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL);
+	if (!simple_child)
+		return NULL;
+
+	memset(simple_child, 0, sizeof(struct simple_child));
+
+	config_item_init_type_name(&simple_child->item, name,
+				   &simple_child_type);
+
+	simple_child->storeme = 0;
+
+	return &simple_child->item;
+}
+
+static struct configfs_attribute simple_children_attr_description = {
+	.ca_owner = THIS_MODULE,
+	.ca_name = "description",
+	.ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *simple_children_attrs[] = {
+	&simple_children_attr_description,
+	NULL,
+};
+
+static ssize_t simple_children_attr_show(struct config_item *item,
+			   		 struct configfs_attribute *attr,
+			   		 char *page)
+{
+	return sprintf(page,
+"[02-simple-children]\n"
+"\n"
+"This subsystem allows the creation of child config_items.  These\n"
+"items have only one attribute that is readable and writeable.\n");
+}
+
+static struct configfs_item_operations simple_children_item_ops = {
+	.show_attribute	= simple_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations simple_children_group_ops = {
+	.make_item	= simple_children_make_item,
+};
+
+static struct config_item_type simple_children_type = {
+	.ct_item_ops	= &simple_children_item_ops,
+	.ct_group_ops	= &simple_children_group_ops,
+	.ct_attrs	= simple_children_attrs,
+};
+
+static struct configfs_subsystem simple_children_subsys = {
+	.su_group = {
+		.cg_item = {
+			.ci_namebuf = "02-simple-children",
+			.ci_type = &simple_children_type,
+		},
+	},
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 03-group-children
+ *
+ * This example reuses the simple_children group from above.  However,
+ * the simple_children group is not the subsystem itself, it is a
+ * child of the subsystem.  Creation of a group in the subsystem creates
+ * a new simple_children group.  That group can then have simple_child
+ * children of its own.
+ */
+
+struct simple_children {
+	struct config_group group;
+};
+
+static struct config_group *group_children_make_group(struct config_group *group, const char *name)
+{
+	struct simple_children *simple_children;
+
+	simple_children = kmalloc(sizeof(struct simple_children),
+				  GFP_KERNEL);
+	if (!simple_children)
+		return NULL;
+
+	memset(simple_children, 0, sizeof(struct simple_children));
+
+	config_group_init_type_name(&simple_children->group, name,
+				    &simple_children_type);
+
+	return &simple_children->group;
+}
+
+static struct configfs_attribute group_children_attr_description = {
+	.ca_owner = THIS_MODULE,
+	.ca_name = "description",
+	.ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *group_children_attrs[] = {
+	&group_children_attr_description,
+	NULL,
+};
+
+static ssize_t group_children_attr_show(struct config_item *item,
+			   		struct configfs_attribute *attr,
+			   		char *page)
+{
+	return sprintf(page,
+"[03-group-children]\n"
+"\n"
+"This subsystem allows the creation of child config_groups.  These\n"
+"groups are like the subsystem simple-children.\n");
+}
+
+static struct configfs_item_operations group_children_item_ops = {
+	.show_attribute	= group_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations group_children_group_ops = {
+	.make_group	= group_children_make_group,
+};
+
+static struct config_item_type group_children_type = {
+	.ct_item_ops	= &group_children_item_ops,
+	.ct_group_ops	= &group_children_group_ops,
+	.ct_attrs	= group_children_attrs,
+};
+
+static struct configfs_subsystem group_children_subsys = {
+	.su_group = {
+		.cg_item = {
+			.ci_namebuf = "03-group-children",
+			.ci_type = &group_children_type,
+		},
+	},
+};
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * We're now done with our subsystem definitions.
+ * For convenience in this module, here's a list of them all.  It
+ * allows the init function to easily register them.  Most modules
+ * will only have one subsystem, and will only call register_subsystem
+ * on it directly.
+ */
+static struct configfs_subsystem *example_subsys[] = {
+	&childless_subsys.subsys,
+	&simple_children_subsys,
+	&group_children_subsys,
+	NULL,
+};
+
+static int __init configfs_example_init(void)
+{
+	int ret;
+	int i;
+	struct configfs_subsystem *subsys;
+
+	for (i = 0; example_subsys[i]; i++) {
+		subsys = example_subsys[i];
+
+		config_group_init(&subsys->su_group);
+		init_MUTEX(&subsys->su_sem);
+		ret = configfs_register_subsystem(subsys);
+		if (ret) {
+			printk(KERN_ERR "Error %d while registering subsystem %s\n",
+			       ret,
+			       subsys->su_group.cg_item.ci_namebuf);
+			goto out_unregister;
+		}
+	}
+
+	return 0;
+
+out_unregister:
+	for (; i >= 0; i--) {
+		configfs_unregister_subsystem(example_subsys[i]);
+	}
+
+	return ret;
+}
+
+static void __exit configfs_example_exit(void)
+{
+	int i;
+
+	for (i = 0; example_subsys[i]; i++) {
+		configfs_unregister_subsystem(example_subsys[i]);
+	}
+}
+
+module_init(configfs_example_init);
+module_exit(configfs_example_exit);
+MODULE_LICENSE("GPL");
diff -ruN linux-2.6.13-rc6.old/fs/Kconfig linux-2.6.13-rc6/fs/Kconfig
--- linux-2.6.13-rc6.old/fs/Kconfig	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/fs/Kconfig	2005-08-12 18:09:46.778349585 -0700
@@ -859,6 +859,20 @@
 	  To compile this as a module, choose M here: the module will be called
 	  ramfs.
 
+config CONFIGFS_FS
+	tristate "Userspace-driven configuration filesystem (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  configfs is a ram-based filesystem that provides the converse
+	  of sysfs's functionality. Where sysfs is a filesystem-based
+	  view of kernel objects, configfs is a filesystem-based manager
+	  of kernel objects, or config_items.
+
+	  Both sysfs and configfs can and should exist together on the
+	  same system. One is not a replacement for the other.
+
+	  If unsure, say N.
+
 endmenu
 
 menu "Miscellaneous filesystems"
diff -ruN linux-2.6.13-rc6.old/fs/Makefile linux-2.6.13-rc6/fs/Makefile
--- linux-2.6.13-rc6.old/fs/Makefile	2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/fs/Makefile	2005-08-12 18:09:46.778349585 -0700
@@ -98,3 +98,4 @@
 obj-$(CONFIG_HOSTFS)		+= hostfs/
 obj-$(CONFIG_HPPFS)		+= hppfs/
 obj-$(CONFIG_DEBUG_FS)		+= debugfs/
+obj-$(CONFIG_CONFIGFS_FS)	+= configfs/
diff -ruN linux-2.6.13-rc6.old/fs/configfs/Makefile linux-2.6.13-rc6/fs/configfs/Makefile
--- linux-2.6.13-rc6.old/fs/configfs/Makefile	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/Makefile	2005-08-12 18:09:46.784349796 -0700
@@ -0,0 +1,7 @@
+#
+# Makefile for the configfs virtual filesystem
+#
+
+obj-$(CONFIG_CONFIGFS_FS)	+= configfs.o
+
+configfs-objs	:= inode.o file.o dir.o symlink.o mount.o item.o
diff -ruN linux-2.6.13-rc6.old/fs/configfs/configfs_internal.h linux-2.6.13-rc6/fs/configfs/configfs_internal.h
--- linux-2.6.13-rc6.old/fs/configfs/configfs_internal.h	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/configfs_internal.h	2005-08-12 18:09:46.784349796 -0700
@@ -0,0 +1,142 @@
+/* -*- mode: c; c-basic-offset:8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * configfs_internal.h - Internal stuff for configfs
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include <linux/slab.h>
+#include <linux/list.h>
+
+struct configfs_dirent {
+	atomic_t		s_count;
+	struct list_head	s_sibling;
+	struct list_head	s_children;
+	struct list_head	s_links;
+	void 			* s_element;
+	int			s_type;
+	umode_t			s_mode;
+	struct dentry		* s_dentry;
+};
+
+#define CONFIGFS_ROOT		0x0001
+#define CONFIGFS_DIR		0x0002
+#define CONFIGFS_ITEM_ATTR 	0x0004
+#define CONFIGFS_ITEM_LINK 	0x0020
+#define CONFIGFS_USET_DIR	0x0040
+#define CONFIGFS_USET_DEFAULT	0x0080
+#define CONFIGFS_USET_DROPPING	0x0100
+#define CONFIGFS_NOT_PINNED	(CONFIGFS_ITEM_ATTR)
+
+extern struct vfsmount * configfs_mount;
+
+extern int configfs_is_root(struct config_item *item);
+
+extern struct inode * configfs_new_inode(mode_t mode);
+extern int configfs_create(struct dentry *, int mode, int (*init)(struct inode *));
+
+extern int configfs_create_file(struct config_item *, const struct configfs_attribute *);
+extern int configfs_make_dirent(struct configfs_dirent *,
+				struct dentry *, void *, umode_t, int);
+
+extern int configfs_add_file(struct dentry *, const struct configfs_attribute *, int);
+extern void configfs_hash_and_remove(struct dentry * dir, const char * name);
+
+extern const unsigned char * configfs_get_name(struct configfs_dirent *sd);
+extern void configfs_drop_dentry(struct configfs_dirent *sd, struct dentry *parent);
+
+extern int configfs_pin_fs(void);
+extern void configfs_release_fs(void);
+
+extern struct rw_semaphore configfs_rename_sem;
+extern struct super_block * configfs_sb;
+extern struct file_operations configfs_dir_operations;
+extern struct file_operations configfs_file_operations;
+extern struct file_operations bin_fops;
+extern struct inode_operations configfs_dir_inode_operations;
+extern struct inode_operations configfs_symlink_inode_operations;
+
+extern int configfs_symlink(struct inode *dir, struct dentry *dentry,
+			    const char *symname);
+extern int configfs_unlink(struct inode *dir, struct dentry *dentry);
+
+struct configfs_symlink {
+	struct list_head sl_list;
+	struct config_item *sl_target;
+};
+
+extern int configfs_create_link(struct configfs_symlink *sl,
+				struct dentry *parent,
+				struct dentry *dentry);
+
+static inline struct config_item * to_item(struct dentry * dentry)
+{
+	struct configfs_dirent * sd = dentry->d_fsdata;
+	return ((struct config_item *) sd->s_element);
+}
+
+static inline struct configfs_attribute * to_attr(struct dentry * dentry)
+{
+	struct configfs_dirent * sd = dentry->d_fsdata;
+	return ((struct configfs_attribute *) sd->s_element);
+}
+
+static inline struct config_item *configfs_get_config_item(struct dentry *dentry)
+{
+	struct config_item * item = NULL;
+
+	spin_lock(&dcache_lock);
+	if (!d_unhashed(dentry)) {
+		struct configfs_dirent * sd = dentry->d_fsdata;
+		if (sd->s_type & CONFIGFS_ITEM_LINK) {
+			struct configfs_symlink * sl = sd->s_element;
+			item = config_item_get(sl->sl_target);
+		} else
+			item = config_item_get(sd->s_element);
+	}
+	spin_unlock(&dcache_lock);
+
+	return item;
+}
+
+static inline void release_configfs_dirent(struct configfs_dirent * sd)
+{
+	if (!(sd->s_type & CONFIGFS_ROOT))
+		kfree(sd);
+}
+
+static inline struct configfs_dirent * configfs_get(struct configfs_dirent * sd)
+{
+	if (sd) {
+		WARN_ON(!atomic_read(&sd->s_count));
+		atomic_inc(&sd->s_count);
+	}
+	return sd;
+}
+
+static inline void configfs_put(struct configfs_dirent * sd)
+{
+	WARN_ON(!atomic_read(&sd->s_count));
+	if (atomic_dec_and_test(&sd->s_count))
+		release_configfs_dirent(sd);
+}
+
diff -ruN linux-2.6.13-rc6.old/fs/configfs/dir.c linux-2.6.13-rc6/fs/configfs/dir.c
--- linux-2.6.13-rc6.old/fs/configfs/dir.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/dir.c	2005-08-12 18:09:46.786349866 -0700
@@ -0,0 +1,1102 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * dir.c - Operations for configfs directories.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#undef DEBUG
+
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include <linux/configfs.h>
+#include "configfs_internal.h"
+
+DECLARE_RWSEM(configfs_rename_sem);
+
+static void configfs_d_iput(struct dentry * dentry,
+			    struct inode * inode)
+{
+	struct configfs_dirent * sd = dentry->d_fsdata;
+
+	if (sd) {
+		BUG_ON(sd->s_dentry != dentry);
+		sd->s_dentry = NULL;
+		configfs_put(sd);
+	}
+	iput(inode);
+}
+
+/*
+ * We _must_ delete our dentries on last dput, as the chain-to-parent
+ * behavior is required to clear the parents of default_groups.
+ */
+static int configfs_d_delete(struct dentry *dentry)
+{
+	return 1;
+}
+
+static struct dentry_operations configfs_dentry_ops = {
+	.d_iput		= configfs_d_iput,
+	/* simple_delete_dentry() isn't exported */
+	.d_delete	= configfs_d_delete,
+};
+
+/*
+ * Allocates a new configfs_dirent and links it to the parent configfs_dirent
+ */
+static struct configfs_dirent *configfs_new_dirent(struct configfs_dirent * parent_sd,
+						void * element)
+{
+	struct configfs_dirent * sd;
+
+	sd = kmalloc(sizeof(*sd), GFP_KERNEL);
+	if (!sd)
+		return NULL;
+
+	memset(sd, 0, sizeof(*sd));
+	atomic_set(&sd->s_count, 1);
+	INIT_LIST_HEAD(&sd->s_links);
+	INIT_LIST_HEAD(&sd->s_children);
+	list_add(&sd->s_sibling, &parent_sd->s_children);
+	sd->s_element = element;
+
+	return sd;
+}
+
+int configfs_make_dirent(struct configfs_dirent * parent_sd,
+			 struct dentry * dentry, void * element,
+			 umode_t mode, int type)
+{
+	struct configfs_dirent * sd;
+
+	sd = configfs_new_dirent(parent_sd, element);
+	if (!sd)
+		return -ENOMEM;
+
+	sd->s_mode = mode;
+	sd->s_type = type;
+	sd->s_dentry = dentry;
+	if (dentry) {
+		dentry->d_fsdata = configfs_get(sd);
+		dentry->d_op = &configfs_dentry_ops;
+	}
+
+	return 0;
+}
+
+static int init_dir(struct inode * inode)
+{
+	inode->i_op = &configfs_dir_inode_operations;
+	inode->i_fop = &configfs_dir_operations;
+
+	/* directory inodes start off with i_nlink == 2 (for "." entry) */
+	inode->i_nlink++;
+	return 0;
+}
+
+static int init_file(struct inode * inode)
+{
+	inode->i_size = PAGE_SIZE;
+	inode->i_fop = &configfs_file_operations;
+	return 0;
+}
+
+static int init_symlink(struct inode * inode)
+{
+	inode->i_op = &configfs_symlink_inode_operations;
+	return 0;
+}
+
+static int create_dir(struct config_item * k, struct dentry * p,
+		      struct dentry * d)
+{
+	int error;
+	umode_t mode = S_IFDIR| S_IRWXU | S_IRUGO | S_IXUGO;
+
+	error = configfs_create(d, mode, init_dir);
+	if (!error) {
+		error = configfs_make_dirent(p->d_fsdata, d, k, mode,
+					   CONFIGFS_DIR);
+		if (!error) {
+			p->d_inode->i_nlink++;
+			(d)->d_op = &configfs_dentry_ops;
+		}
+	}
+	return error;
+}
+
+
+/**
+ *	configfs_create_dir - create a directory for an config_item.
+ *	@item:		config_itemwe're creating directory for.
+ *	@dentry:	config_item's dentry.
+ */
+
+static int configfs_create_dir(struct config_item * item, struct dentry *dentry)
+{
+	struct dentry * parent;
+	int error = 0;
+
+	BUG_ON(!item);
+
+	if (item->ci_parent)
+		parent = item->ci_parent->ci_dentry;
+	else if (configfs_mount && configfs_mount->mnt_sb)
+		parent = configfs_mount->mnt_sb->s_root;
+	else
+		return -EFAULT;
+
+	error = create_dir(item,parent,dentry);
+	if (!error)
+		item->ci_dentry = dentry;
+	return error;
+}
+
+int configfs_create_link(struct configfs_symlink *sl,
+			 struct dentry *parent,
+			 struct dentry *dentry)
+{
+	int err = 0;
+	umode_t mode = S_IFLNK | S_IRWXUGO;
+
+	err = configfs_create(dentry, mode, init_symlink);
+	if (!err) {
+		err = configfs_make_dirent(parent->d_fsdata, dentry, sl,
+					 mode, CONFIGFS_ITEM_LINK);
+		if (!err)
+			dentry->d_op = &configfs_dentry_ops;
+	}
+	return err;
+}
+
+static void remove_dir(struct dentry * d)
+{
+	struct dentry * parent = dget(d->d_parent);
+	struct configfs_dirent * sd;
+
+	sd = d->d_fsdata;
+ 	list_del_init(&sd->s_sibling);
+	configfs_put(sd);
+	if (d->d_inode)
+		simple_rmdir(parent->d_inode,d);
+
+	pr_debug(" o %s removing done (%d)\n",d->d_name.name,
+		 atomic_read(&d->d_count));
+
+	dput(parent);
+}
+
+/**
+ * configfs_remove_dir - remove an config_item's directory.
+ * @item:	config_item we're removing.
+ *
+ * The only thing special about this is that we remove any files in
+ * the directory before we remove the directory, and we've inlined
+ * what used to be configfs_rmdir() below, instead of calling separately.
+ */
+
+static void configfs_remove_dir(struct config_item * item)
+{
+	struct dentry * dentry = dget(item->ci_dentry);
+
+	if (!dentry)
+		return;
+
+	remove_dir(dentry);
+	/**
+	 * Drop reference from dget() on entrance.
+	 */
+	dput(dentry);
+}
+
+
+/* attaches attribute's configfs_dirent to the dentry corresponding to the
+ * attribute file
+ */
+static int configfs_attach_attr(struct configfs_dirent * sd, struct dentry * dentry)
+{
+	struct configfs_attribute * attr = sd->s_element;
+	int error;
+
+	error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, init_file);
+	if (error)
+		return error;
+
+	dentry->d_op = &configfs_dentry_ops;
+	dentry->d_fsdata = configfs_get(sd);
+	sd->s_dentry = dentry;
+	d_rehash(dentry);
+
+	return 0;
+}
+
+static struct dentry * configfs_lookup(struct inode *dir,
+				       struct dentry *dentry,
+				       struct nameidata *nd)
+{
+	struct configfs_dirent * parent_sd = dentry->d_parent->d_fsdata;
+	struct configfs_dirent * sd;
+	int found = 0;
+	int err = 0;
+
+	list_for_each_entry(sd, &parent_sd->s_children, s_sibling) {
+		if (sd->s_type & CONFIGFS_NOT_PINNED) {
+			const unsigned char * name = configfs_get_name(sd);
+
+			if (strcmp(name, dentry->d_name.name))
+				continue;
+
+			found = 1;
+			err = configfs_attach_attr(sd, dentry);
+			break;
+		}
+	}
+
+	if (!found) {
+		/*
+		 * If it doesn't exist and it isn't a NOT_PINNED item,
+		 * it must be negative.
+		 */
+		return simple_lookup(dir, dentry, nd);
+	}
+
+	return ERR_PTR(err);
+}
+
+/*
+ * Only subdirectories count here.  Files (CONFIGFS_NOT_PINNED) are
+ * attributes and are removed by rmdir().  We recurse, taking i_sem
+ * on all children that are candidates for default detach.  If the
+ * result is clean, then configfs_detach_group() will handle dropping
+ * i_sem.  If there is an error, the caller will clean up the i_sem
+ * holders via configfs_detach_rollback().
+ */
+static int configfs_detach_prep(struct dentry *dentry)
+{
+	struct configfs_dirent *parent_sd = dentry->d_fsdata;
+	struct configfs_dirent *sd;
+	int ret;
+
+	ret = -EBUSY;
+	if (!list_empty(&parent_sd->s_links))
+		goto out;
+
+	ret = 0;
+	list_for_each_entry(sd, &parent_sd->s_children, s_sibling) {
+		if (sd->s_type & CONFIGFS_NOT_PINNED)
+			continue;
+		if (sd->s_type & CONFIGFS_USET_DEFAULT) {
+			down(&sd->s_dentry->d_inode->i_sem);
+			/* Mark that we've taken i_sem */
+			sd->s_type |= CONFIGFS_USET_DROPPING;
+
+			ret = configfs_detach_prep(sd->s_dentry);
+			if (!ret)
+			       	continue;
+		} else
+			ret = -ENOTEMPTY;
+
+		break;
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * Walk the tree, dropping i_sem wherever CONFIGFS_USET_DROPPING is
+ * set.
+ */
+static void configfs_detach_rollback(struct dentry *dentry)
+{
+	struct configfs_dirent *parent_sd = dentry->d_fsdata;
+	struct configfs_dirent *sd;
+
+	list_for_each_entry(sd, &parent_sd->s_children, s_sibling) {
+		if (sd->s_type & CONFIGFS_USET_DEFAULT) {
+			configfs_detach_rollback(sd->s_dentry);
+
+			if (sd->s_type & CONFIGFS_USET_DROPPING) {
+				sd->s_type &= ~CONFIGFS_USET_DROPPING;
+				up(&sd->s_dentry->d_inode->i_sem);
+			}
+		}
+	}
+}
+
+static void detach_attrs(struct config_item * item)
+{
+	struct dentry * dentry = dget(item->ci_dentry);
+	struct configfs_dirent * parent_sd;
+	struct configfs_dirent * sd, * tmp;
+
+	if (!dentry)
+		return;
+
+	pr_debug("configfs %s: dropping attrs for  dir\n",
+		 dentry->d_name.name);
+
+	parent_sd = dentry->d_fsdata;
+	list_for_each_entry_safe(sd, tmp, &parent_sd->s_children, s_sibling) {
+		if (!sd->s_element || !(sd->s_type & CONFIGFS_NOT_PINNED))
+			continue;
+		list_del_init(&sd->s_sibling);
+		configfs_drop_dentry(sd, dentry);
+		configfs_put(sd);
+	}
+
+	/**
+	 * Drop reference from dget() on entrance.
+	 */
+	dput(dentry);
+}
+
+static int populate_attrs(struct config_item *item)
+{
+	struct config_item_type *t = item->ci_type;
+	struct configfs_attribute *attr;
+	int error = 0;
+	int i;
+
+	if (!t)
+		return -EINVAL;
+	if (t->ct_attrs) {
+		for (i = 0; (attr = t->ct_attrs[i]) != NULL; i++) {
+			if ((error = configfs_create_file(item, attr)))
+				break;
+		}
+	}
+
+	if (error)
+		detach_attrs(item);
+
+	return error;
+}
+
+static int configfs_attach_group(struct config_item *parent_item,
+				 struct config_item *item,
+				 struct dentry *dentry);
+static void configfs_detach_group(struct config_item *item);
+
+static void detach_groups(struct config_group *group)
+{
+	struct dentry * dentry = dget(group->cg_item.ci_dentry);
+	struct dentry *child;
+	struct configfs_dirent *parent_sd;
+	struct configfs_dirent *sd, *tmp;
+
+	if (!dentry)
+		return;
+
+	parent_sd = dentry->d_fsdata;
+	list_for_each_entry_safe(sd, tmp, &parent_sd->s_children, s_sibling) {
+		if (!sd->s_element ||
+		    !(sd->s_type & CONFIGFS_USET_DEFAULT))
+			continue;
+
+		child = sd->s_dentry;
+
+		configfs_detach_group(sd->s_element);
+		child->d_inode->i_flags |= S_DEAD;
+
+		/*
+		 * From rmdir/unregister, a configfs_detach_prep() pass
+		 * has taken our i_sem for us.  Drop it.
+		 * From mkdir/register cleanup, there is no sem held.
+		 */
+		if (sd->s_type & CONFIGFS_USET_DROPPING)
+			up(&child->d_inode->i_sem);
+
+		d_delete(child);
+		dput(child);
+	}
+
+	/**
+	 * Drop reference from dget() on entrance.
+	 */
+	dput(dentry);
+}
+
+/*
+ * This fakes mkdir(2) on a default_groups[] entry.  It
+ * creates a dentry, attachs it, and then does fixup
+ * on the sd->s_type.
+ *
+ * We could, perhaps, tweak our parent's ->mkdir for a minute and
+ * try using vfs_mkdir.  Just a thought.
+ */
+static int create_default_group(struct config_group *parent_group,
+				struct config_group *group)
+{
+	int ret;
+	struct qstr name;
+	struct configfs_dirent *sd;
+	/* We trust the caller holds a reference to parent */
+	struct dentry *child, *parent = parent_group->cg_item.ci_dentry;
+
+	if (!group->cg_item.ci_name)
+		group->cg_item.ci_name = group->cg_item.ci_namebuf;
+	name.name = group->cg_item.ci_name;
+	name.len = strlen(name.name);
+	name.hash = full_name_hash(name.name, name.len);
+
+	ret = -ENOMEM;
+	child = d_alloc(parent, &name);
+	if (child) {
+		d_add(child, NULL);
+
+		ret = configfs_attach_group(&parent_group->cg_item,
+					    &group->cg_item, child);
+		if (!ret) {
+			sd = child->d_fsdata;
+			sd->s_type |= CONFIGFS_USET_DEFAULT;
+		} else {
+			d_delete(child);
+			dput(child);
+		}
+	}
+
+	return ret;
+}
+
+static int populate_groups(struct config_group *group)
+{
+	struct config_group *new_group;
+	struct dentry *dentry = group->cg_item.ci_dentry;
+	int ret = 0;
+	int i;
+
+	if (group && group->default_groups) {
+		/* FYI, we're faking mkdir here
+		 * I'm not sure we need this semaphore, as we're called
+		 * from our parent's mkdir.  That holds our parent's
+		 * i_sem, so afaik lookup cannot continue through our
+		 * parent to find us, let alone mess with our tree.
+		 * That said, taking our i_sem is closer to mkdir
+		 * emulation, and shouldn't hurt. */
+		down(&dentry->d_inode->i_sem);
+
+		for (i = 0; group->default_groups[i]; i++) {
+			new_group = group->default_groups[i];
+
+			ret = create_default_group(group, new_group);
+			if (ret)
+				break;
+		}
+
+		up(&dentry->d_inode->i_sem);
+	}
+
+	if (ret)
+		detach_groups(group);
+
+	return ret;
+}
+
+/*
+ * All of link_obj/unlink_obj/link_group/unlink_group require that
+ * subsys->su_sem is held.
+ */
+
+static void unlink_obj(struct config_item *item)
+{
+	struct config_group *group;
+
+	group = item->ci_group;
+	if (group) {
+		list_del_init(&item->ci_entry);
+
+		item->ci_group = NULL;
+		item->ci_parent = NULL;
+		config_item_put(item);
+
+		config_group_put(group);
+	}
+}
+
+static void link_obj(struct config_item *parent_item, struct config_item *item)
+{
+	/* Parent seems redundant with group, but it makes certain
+	 * traversals much nicer. */
+	item->ci_parent = parent_item;
+	item->ci_group = config_group_get(to_config_group(parent_item));
+	list_add_tail(&item->ci_entry, &item->ci_group->cg_children);
+
+	config_item_get(item);
+}
+
+static void unlink_group(struct config_group *group)
+{
+	int i;
+	struct config_group *new_group;
+
+	if (group->default_groups) {
+		for (i = 0; group->default_groups[i]; i++) {
+			new_group = group->default_groups[i];
+			unlink_group(new_group);
+		}
+	}
+
+	group->cg_subsys = NULL;
+	unlink_obj(&group->cg_item);
+}
+
+static void link_group(struct config_group *parent_group, struct config_group *group)
+{
+	int i;
+	struct config_group *new_group;
+	struct configfs_subsystem *subsys = NULL; /* gcc is a turd */
+
+	link_obj(&parent_group->cg_item, &group->cg_item);
+
+	if (parent_group->cg_subsys)
+		subsys = parent_group->cg_subsys;
+	else if (configfs_is_root(&parent_group->cg_item))
+		subsys = to_configfs_subsystem(group);
+	else
+		BUG();
+	group->cg_subsys = subsys;
+
+	if (group->default_groups) {
+		for (i = 0; group->default_groups[i]; i++) {
+			new_group = group->default_groups[i];
+			link_group(group, new_group);
+		}
+	}
+}
+
+/*
+ * The goal is that configfs_attach_item() (and
+ * configfs_attach_group()) can be called from either the VFS or this
+ * module.  That is, they assume that the items have been created,
+ * the dentry allocated, and the dcache is all ready to go.
+ *
+ * If they fail, they must clean up after themselves as if they
+ * had never been called.  The caller (VFS or local function) will
+ * handle cleaning up the dcache bits.
+ *
+ * configfs_detach_group() and configfs_detach_item() behave similarly on
+ * the way out.  They assume that the proper semaphores are held, they
+ * clean up the configfs items, and they expect their callers will
+ * handle the dcache bits.
+ */
+static int configfs_attach_item(struct config_item *parent_item,
+				struct config_item *item,
+				struct dentry *dentry)
+{
+	int ret;
+
+	ret = configfs_create_dir(item, dentry);
+	if (!ret) {
+		ret = populate_attrs(item);
+		if (ret) {
+			configfs_remove_dir(item);
+			d_delete(dentry);
+		}
+	}
+
+	return ret;
+}
+
+static void configfs_detach_item(struct config_item *item)
+{
+	detach_attrs(item);
+	configfs_remove_dir(item);
+}
+
+static int configfs_attach_group(struct config_item *parent_item,
+				 struct config_item *item,
+				 struct dentry *dentry)
+{
+	int ret;
+	struct configfs_dirent *sd;
+
+	ret = configfs_attach_item(parent_item, item, dentry);
+	if (!ret) {
+		sd = dentry->d_fsdata;
+		sd->s_type |= CONFIGFS_USET_DIR;
+
+		ret = populate_groups(to_config_group(item));
+		if (ret) {
+			configfs_detach_item(item);
+			d_delete(dentry);
+		}
+	}
+
+	return ret;
+}
+
+static void configfs_detach_group(struct config_item *item)
+{
+	detach_groups(to_config_group(item));
+	configfs_detach_item(item);
+}
+
+/*
+ * Drop the initial reference from make_item()/make_group()
+ * This function assumes that reference is held on item
+ * and that item holds a valid reference to the parent.  Also, it
+ * assumes the caller has validated ci_type.
+ */
+static void client_drop_item(struct config_item *parent_item,
+			     struct config_item *item)
+{
+	struct config_item_type *type;
+
+	type = parent_item->ci_type;
+	BUG_ON(!type);
+
+	if (type->ct_group_ops && type->ct_group_ops->drop_item)
+		type->ct_group_ops->drop_item(to_config_group(parent_item),
+						item);
+	else
+		config_item_put(item);
+}
+
+
+static int configfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	int ret;
+	struct config_group *group;
+	struct config_item *item;
+	struct config_item *parent_item;
+	struct configfs_subsystem *subsys;
+	struct configfs_dirent *sd;
+	struct config_item_type *type;
+	struct module *owner;
+	char *name;
+
+	if (dentry->d_parent == configfs_sb->s_root)
+		return -EPERM;
+
+	sd = dentry->d_parent->d_fsdata;
+	if (!(sd->s_type & CONFIGFS_USET_DIR))
+		return -EPERM;
+
+	parent_item = configfs_get_config_item(dentry->d_parent);
+	type = parent_item->ci_type;
+	subsys = to_config_group(parent_item)->cg_subsys;
+	BUG_ON(!subsys);
+
+	if (!type || !type->ct_group_ops ||
+	    (!type->ct_group_ops->make_group &&
+	     !type->ct_group_ops->make_item)) {
+		config_item_put(parent_item);
+		return -EPERM;  /* What lack-of-mkdir returns */
+	}
+
+	name = kmalloc(dentry->d_name.len + 1, GFP_KERNEL);
+	if (!name) {
+		config_item_put(parent_item);
+		return -ENOMEM;
+	}
+	snprintf(name, dentry->d_name.len + 1, "%s", dentry->d_name.name);
+
+	down(&subsys->su_sem);
+	group = NULL;
+	item = NULL;
+	if (type->ct_group_ops->make_group) {
+		group = type->ct_group_ops->make_group(to_config_group(parent_item), name);
+		if (group) {
+			link_group(to_config_group(parent_item), group);
+			item = &group->cg_item;
+		}
+	} else {
+		item = type->ct_group_ops->make_item(to_config_group(parent_item), name);
+		if (item)
+			link_obj(parent_item, item);
+	}
+	up(&subsys->su_sem);
+
+	kfree(name);
+	if (!item) {
+		config_item_put(parent_item);
+		return -ENOMEM;
+	}
+
+	ret = -EINVAL;
+	type = item->ci_type;
+	if (type) {
+		owner = type->ct_owner;
+		if (try_module_get(owner)) {
+			if (group) {
+				ret = configfs_attach_group(parent_item,
+							    item,
+							    dentry);
+			} else {
+				ret = configfs_attach_item(parent_item,
+							   item,
+							   dentry);
+			}
+
+			if (ret) {
+				down(&subsys->su_sem);
+				if (group)
+					unlink_group(group);
+				else
+					unlink_obj(item);
+				client_drop_item(parent_item, item);
+				up(&subsys->su_sem);
+
+				config_item_put(parent_item);
+				module_put(owner);
+			}
+		}
+	}
+
+	return ret;
+}
+
+static int configfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct config_item *parent_item;
+	struct config_item *item;
+	struct configfs_subsystem *subsys;
+	struct configfs_dirent *sd;
+	struct module *owner = NULL;
+	int ret;
+
+	if (dentry->d_parent == configfs_sb->s_root)
+		return -EPERM;
+
+	sd = dentry->d_fsdata;
+	if (sd->s_type & CONFIGFS_USET_DEFAULT)
+		return -EPERM;
+
+	parent_item = configfs_get_config_item(dentry->d_parent);
+	subsys = to_config_group(parent_item)->cg_subsys;
+	BUG_ON(!subsys);
+
+	if (!parent_item->ci_type) {
+		config_item_put(parent_item);
+		return -EINVAL;
+	}
+
+	ret = configfs_detach_prep(dentry);
+	if (ret) {
+		configfs_detach_rollback(dentry);
+		config_item_put(parent_item);
+		return ret;
+	}
+
+	item = configfs_get_config_item(dentry);
+
+	/* Drop reference from above, item already holds one. */
+	config_item_put(parent_item);
+
+	if (item->ci_type)
+		owner = item->ci_type->ct_owner;
+
+	if (sd->s_type & CONFIGFS_USET_DIR) {
+		configfs_detach_group(item);
+
+		down(&subsys->su_sem);
+		unlink_group(to_config_group(item));
+	} else {
+		configfs_detach_item(item);
+
+		down(&subsys->su_sem);
+		unlink_obj(item);
+	}
+
+	client_drop_item(parent_item, item);
+	up(&subsys->su_sem);
+
+	/* Drop our reference from above */
+	config_item_put(item);
+
+	module_put(owner);
+
+	return 0;
+}
+
+struct inode_operations configfs_dir_inode_operations = {
+	.mkdir		= configfs_mkdir,
+	.rmdir		= configfs_rmdir,
+	.symlink	= configfs_symlink,
+	.unlink		= configfs_unlink,
+	.lookup		= configfs_lookup,
+};
+
+#if 0
+int configfs_rename_dir(struct config_item * item, const char *new_name)
+{
+	int error = 0;
+	struct dentry * new_dentry, * parent;
+
+	if (!strcmp(config_item_name(item), new_name))
+		return -EINVAL;
+
+	if (!item->parent)
+		return -EINVAL;
+
+	down_write(&configfs_rename_sem);
+	parent = item->parent->dentry;
+
+	down(&parent->d_inode->i_sem);
+
+	new_dentry = lookup_one_len(new_name, parent, strlen(new_name));
+	if (!IS_ERR(new_dentry)) {
+  		if (!new_dentry->d_inode) {
+			error = config_item_set_name(item, "%s", new_name);
+			if (!error) {
+				d_add(new_dentry, NULL);
+				d_move(item->dentry, new_dentry);
+			}
+			else
+				d_delete(new_dentry);
+		} else
+			error = -EEXIST;
+		dput(new_dentry);
+	}
+	up(&parent->d_inode->i_sem);
+	up_write(&configfs_rename_sem);
+
+	return error;
+}
+#endif
+
+static int configfs_dir_open(struct inode *inode, struct file *file)
+{
+	struct dentry * dentry = file->f_dentry;
+	struct configfs_dirent * parent_sd = dentry->d_fsdata;
+
+	down(&dentry->d_inode->i_sem);
+	file->private_data = configfs_new_dirent(parent_sd, NULL);
+	up(&dentry->d_inode->i_sem);
+
+	return file->private_data ? 0 : -ENOMEM;
+
+}
+
+static int configfs_dir_close(struct inode *inode, struct file *file)
+{
+	struct dentry * dentry = file->f_dentry;
+	struct configfs_dirent * cursor = file->private_data;
+
+	down(&dentry->d_inode->i_sem);
+	list_del_init(&cursor->s_sibling);
+	up(&dentry->d_inode->i_sem);
+
+	release_configfs_dirent(cursor);
+
+	return 0;
+}
+
+/* Relationship between s_mode and the DT_xxx types */
+static inline unsigned char dt_type(struct configfs_dirent *sd)
+{
+	return (sd->s_mode >> 12) & 15;
+}
+
+static int configfs_readdir(struct file * filp, void * dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_dentry;
+	struct configfs_dirent * parent_sd = dentry->d_fsdata;
+	struct configfs_dirent *cursor = filp->private_data;
+	struct list_head *p, *q = &cursor->s_sibling;
+	ino_t ino;
+	int i = filp->f_pos;
+
+	switch (i) {
+		case 0:
+			ino = dentry->d_inode->i_ino;
+			if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
+				break;
+			filp->f_pos++;
+			i++;
+			/* fallthrough */
+		case 1:
+			ino = parent_ino(dentry);
+			if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0)
+				break;
+			filp->f_pos++;
+			i++;
+			/* fallthrough */
+		default:
+			if (filp->f_pos == 2) {
+				list_del(q);
+				list_add(q, &parent_sd->s_children);
+			}
+			for (p=q->next; p!= &parent_sd->s_children; p=p->next) {
+				struct configfs_dirent *next;
+				const char * name;
+				int len;
+
+				next = list_entry(p, struct configfs_dirent,
+						   s_sibling);
+				if (!next->s_element)
+					continue;
+
+				name = configfs_get_name(next);
+				len = strlen(name);
+				if (next->s_dentry)
+					ino = next->s_dentry->d_inode->i_ino;
+				else
+					ino = iunique(configfs_sb, 2);
+
+				if (filldir(dirent, name, len, filp->f_pos, ino,
+						 dt_type(next)) < 0)
+					return 0;
+
+				list_del(q);
+				list_add(q, p);
+				p = q;
+				filp->f_pos++;
+			}
+	}
+	return 0;
+}
+
+static loff_t configfs_dir_lseek(struct file * file, loff_t offset, int origin)
+{
+	struct dentry * dentry = file->f_dentry;
+
+	down(&dentry->d_inode->i_sem);
+	switch (origin) {
+		case 1:
+			offset += file->f_pos;
+		case 0:
+			if (offset >= 0)
+				break;
+		default:
+			up(&file->f_dentry->d_inode->i_sem);
+			return -EINVAL;
+	}
+	if (offset != file->f_pos) {
+		file->f_pos = offset;
+		if (file->f_pos >= 2) {
+			struct configfs_dirent *sd = dentry->d_fsdata;
+			struct configfs_dirent *cursor = file->private_data;
+			struct list_head *p;
+			loff_t n = file->f_pos - 2;
+
+			list_del(&cursor->s_sibling);
+			p = sd->s_children.next;
+			while (n && p != &sd->s_children) {
+				struct configfs_dirent *next;
+				next = list_entry(p, struct configfs_dirent,
+						   s_sibling);
+				if (next->s_element)
+					n--;
+				p = p->next;
+			}
+			list_add_tail(&cursor->s_sibling, p);
+		}
+	}
+	up(&dentry->d_inode->i_sem);
+	return offset;
+}
+
+struct file_operations configfs_dir_operations = {
+	.open		= configfs_dir_open,
+	.release	= configfs_dir_close,
+	.llseek		= configfs_dir_lseek,
+	.read		= generic_read_dir,
+	.readdir	= configfs_readdir,
+};
+
+int configfs_register_subsystem(struct configfs_subsystem *subsys)
+{
+	int err;
+	struct config_group *group = &subsys->su_group;
+	struct qstr name;
+	struct dentry *dentry;
+	struct configfs_dirent *sd;
+
+	err = configfs_pin_fs();
+	if (err)
+		return err;
+
+	if (!group->cg_item.ci_name)
+		group->cg_item.ci_name = group->cg_item.ci_namebuf;
+
+	sd = configfs_sb->s_root->d_fsdata;
+	link_group(to_config_group(sd->s_element), group);
+
+	down(&configfs_sb->s_root->d_inode->i_sem);
+
+	name.name = group->cg_item.ci_name;
+	name.len = strlen(name.name);
+	name.hash = full_name_hash(name.name, name.len);
+
+	err = -ENOMEM;
+	dentry = d_alloc(configfs_sb->s_root, &name);
+	if (!dentry)
+		goto out_release;
+
+	d_add(dentry, NULL);
+
+	err = configfs_attach_group(sd->s_element, &group->cg_item,
+				    dentry);
+	if (!err)
+		dentry = NULL;
+	else
+		d_delete(dentry);
+
+	up(&configfs_sb->s_root->d_inode->i_sem);
+
+	if (dentry) {
+	    dput(dentry);
+out_release:
+	    unlink_group(group);
+	    configfs_release_fs();
+	}
+
+	return err;
+}
+
+void configfs_unregister_subsystem(struct configfs_subsystem *subsys)
+{
+	struct config_group *group = &subsys->su_group;
+	struct dentry *dentry = group->cg_item.ci_dentry;
+
+	if (dentry->d_parent != configfs_sb->s_root) {
+		printk(KERN_ERR "configfs: Tried to unregister non-subsystem!\n");
+		return;
+	}
+
+	down(&configfs_sb->s_root->d_inode->i_sem);
+	down(&dentry->d_inode->i_sem);
+	if (configfs_detach_prep(dentry)) {
+		printk(KERN_ERR "configfs: Tried to unregister non-empty subsystem!\n");
+	}
+	configfs_detach_group(&group->cg_item);
+	dentry->d_inode->i_flags |= S_DEAD;
+	up(&dentry->d_inode->i_sem);
+
+	d_delete(dentry);
+
+	up(&configfs_sb->s_root->d_inode->i_sem);
+
+	dput(dentry);
+
+	unlink_group(group);
+	configfs_release_fs();
+}
+
+EXPORT_SYMBOL(configfs_register_subsystem);
+EXPORT_SYMBOL(configfs_unregister_subsystem);
diff -ruN linux-2.6.13-rc6.old/fs/configfs/file.c linux-2.6.13-rc6/fs/configfs/file.c
--- linux-2.6.13-rc6.old/fs/configfs/file.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/file.c	2005-08-12 18:09:46.786349866 -0700
@@ -0,0 +1,360 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * file.c - operations for regular (text) files.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/dnotify.h>
+#include <linux/slab.h>
+#include <asm/uaccess.h>
+#include <asm/semaphore.h>
+
+#include <linux/configfs.h>
+#include "configfs_internal.h"
+
+
+struct configfs_buffer {
+	size_t			count;
+	loff_t			pos;
+	char			* page;
+	struct configfs_item_operations	* ops;
+	struct semaphore	sem;
+	int			needs_read_fill;
+};
+
+
+/**
+ *	fill_read_buffer - allocate and fill buffer from item.
+ *	@dentry:	dentry pointer.
+ *	@buffer:	data buffer for file.
+ *
+ *	Allocate @buffer->page, if it hasn't been already, then call the
+ *	config_item's show() method to fill the buffer with this attribute's
+ *	data.
+ *	This is called only once, on the file's first read.
+ */
+static int fill_read_buffer(struct dentry * dentry, struct configfs_buffer * buffer)
+{
+	struct configfs_attribute * attr = to_attr(dentry);
+	struct config_item * item = to_item(dentry->d_parent);
+	struct configfs_item_operations * ops = buffer->ops;
+	int ret = 0;
+	ssize_t count;
+
+	if (!buffer->page)
+		buffer->page = (char *) get_zeroed_page(GFP_KERNEL);
+	if (!buffer->page)
+		return -ENOMEM;
+
+	count = ops->show_attribute(item,attr,buffer->page);
+	buffer->needs_read_fill = 0;
+	BUG_ON(count > (ssize_t)PAGE_SIZE);
+	if (count >= 0)
+		buffer->count = count;
+	else
+		ret = count;
+	return ret;
+}
+
+
+/**
+ *	flush_read_buffer - push buffer to userspace.
+ *	@buffer:	data buffer for file.
+ *	@userbuf:	user-passed buffer.
+ *	@count:		number of bytes requested.
+ *	@ppos:		file position.
+ *
+ *	Copy the buffer we filled in fill_read_buffer() to userspace.
+ *	This is done at the reader's leisure, copying and advancing
+ *	the amount they specify each time.
+ *	This may be called continuously until the buffer is empty.
+ */
+static int flush_read_buffer(struct configfs_buffer * buffer, char __user * buf,
+			     size_t count, loff_t * ppos)
+{
+	int error;
+
+	if (*ppos > buffer->count)
+		return 0;
+
+	if (count > (buffer->count - *ppos))
+		count = buffer->count - *ppos;
+
+	error = copy_to_user(buf,buffer->page + *ppos,count);
+	if (!error)
+		*ppos += count;
+	return error ? -EFAULT : count;
+}
+
+/**
+ *	configfs_read_file - read an attribute.
+ *	@file:	file pointer.
+ *	@buf:	buffer to fill.
+ *	@count:	number of bytes to read.
+ *	@ppos:	starting offset in file.
+ *
+ *	Userspace wants to read an attribute file. The attribute descriptor
+ *	is in the file's ->d_fsdata. The target item is in the directory's
+ *	->d_fsdata.
+ *
+ *	We call fill_read_buffer() to allocate and fill the buffer from the
+ *	item's show() method exactly once (if the read is happening from
+ *	the beginning of the file). That should fill the entire buffer with
+ *	all the data the item has to offer for that attribute.
+ *	We then call flush_read_buffer() to copy the buffer to userspace
+ *	in the increments specified.
+ */
+
+static ssize_t
+configfs_read_file(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct configfs_buffer * buffer = file->private_data;
+	ssize_t retval = 0;
+
+	down(&buffer->sem);
+	if (buffer->needs_read_fill) {
+		if ((retval = fill_read_buffer(file->f_dentry,buffer)))
+			goto out;
+	}
+	pr_debug("%s: count = %d, ppos = %lld, buf = %s\n",
+		 __FUNCTION__,count,*ppos,buffer->page);
+	retval = flush_read_buffer(buffer,buf,count,ppos);
+out:
+	up(&buffer->sem);
+	return retval;
+}
+
+
+/**
+ *	fill_write_buffer - copy buffer from userspace.
+ *	@buffer:	data buffer for file.
+ *	@userbuf:	data from user.
+ *	@count:		number of bytes in @userbuf.
+ *
+ *	Allocate @buffer->page if it hasn't been already, then
+ *	copy the user-supplied buffer into it.
+ */
+
+static int
+fill_write_buffer(struct configfs_buffer * buffer, const char __user * buf, size_t count)
+{
+	int error;
+
+	if (!buffer->page)
+		buffer->page = (char *)get_zeroed_page(GFP_KERNEL);
+	if (!buffer->page)
+		return -ENOMEM;
+
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+	error = copy_from_user(buffer->page,buf,count);
+	buffer->needs_read_fill = 1;
+	return error ? -EFAULT : count;
+}
+
+
+/**
+ *	flush_write_buffer - push buffer to config_item.
+ *	@file:		file pointer.
+ *	@buffer:	data buffer for file.
+ *
+ *	Get the correct pointers for the config_item and the attribute we're
+ *	dealing with, then call the store() method for the attribute,
+ *	passing the buffer that we acquired in fill_write_buffer().
+ */
+
+static int
+flush_write_buffer(struct dentry * dentry, struct configfs_buffer * buffer, size_t count)
+{
+	struct configfs_attribute * attr = to_attr(dentry);
+	struct config_item * item = to_item(dentry->d_parent);
+	struct configfs_item_operations * ops = buffer->ops;
+
+	return ops->store_attribute(item,attr,buffer->page,count);
+}
+
+
+/**
+ *	configfs_write_file - write an attribute.
+ *	@file:	file pointer
+ *	@buf:	data to write
+ *	@count:	number of bytes
+ *	@ppos:	starting offset
+ *
+ *	Similar to configfs_read_file(), though working in the opposite direction.
+ *	We allocate and fill the data from the user in fill_write_buffer(),
+ *	then push it to the config_item in flush_write_buffer().
+ *	There is no easy way for us to know if userspace is only doing a partial
+ *	write, so we don't support them. We expect the entire buffer to come
+ *	on the first write.
+ *	Hint: if you're writing a value, first read the file, modify only the
+ *	the value you're changing, then write entire buffer back.
+ */
+
+static ssize_t
+configfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct configfs_buffer * buffer = file->private_data;
+
+	down(&buffer->sem);
+	count = fill_write_buffer(buffer,buf,count);
+	if (count > 0)
+		count = flush_write_buffer(file->f_dentry,buffer,count);
+	if (count > 0)
+		*ppos += count;
+	up(&buffer->sem);
+	return count;
+}
+
+static int check_perm(struct inode * inode, struct file * file)
+{
+	struct config_item *item = configfs_get_config_item(file->f_dentry->d_parent);
+	struct configfs_attribute * attr = to_attr(file->f_dentry);
+	struct configfs_buffer * buffer;
+	struct configfs_item_operations * ops = NULL;
+	int error = 0;
+
+	if (!item || !attr)
+		goto Einval;
+
+	/* Grab the module reference for this attribute if we have one */
+	if (!try_module_get(attr->ca_owner)) {
+		error = -ENODEV;
+		goto Done;
+	}
+
+	if (item->ci_type)
+		ops = item->ci_type->ct_item_ops;
+	else
+		goto Eaccess;
+
+	/* File needs write support.
+	 * The inode's perms must say it's ok,
+	 * and we must have a store method.
+	 */
+	if (file->f_mode & FMODE_WRITE) {
+
+		if (!(inode->i_mode & S_IWUGO) || !ops->store_attribute)
+			goto Eaccess;
+
+	}
+
+	/* File needs read support.
+	 * The inode's perms must say it's ok, and we there
+	 * must be a show method for it.
+	 */
+	if (file->f_mode & FMODE_READ) {
+		if (!(inode->i_mode & S_IRUGO) || !ops->show_attribute)
+			goto Eaccess;
+	}
+
+	/* No error? Great, allocate a buffer for the file, and store it
+	 * it in file->private_data for easy access.
+	 */
+	buffer = kmalloc(sizeof(struct configfs_buffer),GFP_KERNEL);
+	if (buffer) {
+		memset(buffer,0,sizeof(struct configfs_buffer));
+		init_MUTEX(&buffer->sem);
+		buffer->needs_read_fill = 1;
+		buffer->ops = ops;
+		file->private_data = buffer;
+	} else
+		error = -ENOMEM;
+	goto Done;
+
+ Einval:
+	error = -EINVAL;
+	goto Done;
+ Eaccess:
+	error = -EACCES;
+	module_put(attr->ca_owner);
+ Done:
+	if (error && item)
+		config_item_put(item);
+	return error;
+}
+
+static int configfs_open_file(struct inode * inode, struct file * filp)
+{
+	return check_perm(inode,filp);
+}
+
+static int configfs_release(struct inode * inode, struct file * filp)
+{
+	struct config_item * item = to_item(filp->f_dentry->d_parent);
+	struct configfs_attribute * attr = to_attr(filp->f_dentry);
+	struct module * owner = attr->ca_owner;
+	struct configfs_buffer * buffer = filp->private_data;
+
+	if (item)
+		config_item_put(item);
+	/* After this point, attr should not be accessed. */
+	module_put(owner);
+
+	if (buffer) {
+		if (buffer->page)
+			free_page((unsigned long)buffer->page);
+		kfree(buffer);
+	}
+	return 0;
+}
+
+struct file_operations configfs_file_operations = {
+	.read		= configfs_read_file,
+	.write		= configfs_write_file,
+	.llseek		= generic_file_llseek,
+	.open		= configfs_open_file,
+	.release	= configfs_release,
+};
+
+
+int configfs_add_file(struct dentry * dir, const struct configfs_attribute * attr, int type)
+{
+	struct configfs_dirent * parent_sd = dir->d_fsdata;
+	umode_t mode = (attr->ca_mode & S_IALLUGO) | S_IFREG;
+	int error = 0;
+
+	down(&dir->d_inode->i_sem);
+	error = configfs_make_dirent(parent_sd, NULL, (void *) attr, mode, type);
+	up(&dir->d_inode->i_sem);
+
+	return error;
+}
+
+
+/**
+ *	configfs_create_file - create an attribute file for an item.
+ *	@item:	item we're creating for.
+ *	@attr:	atrribute descriptor.
+ */
+
+int configfs_create_file(struct config_item * item, const struct configfs_attribute * attr)
+{
+	BUG_ON(!item || !item->ci_dentry || !attr);
+
+	return configfs_add_file(item->ci_dentry, attr,
+				 CONFIGFS_ITEM_ATTR);
+}
+
diff -ruN linux-2.6.13-rc6.old/fs/configfs/inode.c linux-2.6.13-rc6/fs/configfs/inode.c
--- linux-2.6.13-rc6.old/fs/configfs/inode.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/inode.c	2005-08-12 18:09:46.787349901 -0700
@@ -0,0 +1,162 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * inode.c - basic inode and dentry operations.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ *
+ * Please see Documentation/filesystems/configfs.txt for more information.
+ */
+
+#undef DEBUG
+
+#include <linux/pagemap.h>
+#include <linux/namei.h>
+#include <linux/backing-dev.h>
+
+#include <linux/configfs.h>
+#include "configfs_internal.h"
+
+extern struct super_block * configfs_sb;
+
+static struct address_space_operations configfs_aops = {
+	.readpage	= simple_readpage,
+	.prepare_write	= simple_prepare_write,
+	.commit_write	= simple_commit_write
+};
+
+static struct backing_dev_info configfs_backing_dev_info = {
+	.ra_pages	= 0,	/* No readahead */
+	.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
+};
+
+struct inode * configfs_new_inode(mode_t mode)
+{
+	struct inode * inode = new_inode(configfs_sb);
+	if (inode) {
+		inode->i_mode = mode;
+		inode->i_uid = 0;
+		inode->i_gid = 0;
+		inode->i_blksize = PAGE_CACHE_SIZE;
+		inode->i_blocks = 0;
+		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+		inode->i_mapping->a_ops = &configfs_aops;
+		inode->i_mapping->backing_dev_info = &configfs_backing_dev_info;
+	}
+	return inode;
+}
+
+int configfs_create(struct dentry * dentry, int mode, int (*init)(struct inode *))
+{
+	int error = 0;
+	struct inode * inode = NULL;
+	if (dentry) {
+		if (!dentry->d_inode) {
+			if ((inode = configfs_new_inode(mode))) {
+				if (dentry->d_parent && dentry->d_parent->d_inode) {
+					struct inode *p_inode = dentry->d_parent->d_inode;
+					p_inode->i_mtime = p_inode->i_ctime = CURRENT_TIME;
+				}
+				goto Proceed;
+			}
+			else
+				error = -ENOMEM;
+		} else
+			error = -EEXIST;
+	} else
+		error = -ENOENT;
+	goto Done;
+
+ Proceed:
+	if (init)
+		error = init(inode);
+	if (!error) {
+		d_instantiate(dentry, inode);
+		if (S_ISDIR(mode) || S_ISLNK(mode))
+			dget(dentry);  /* pin link and directory dentries in core */
+	} else
+		iput(inode);
+ Done:
+	return error;
+}
+
+/*
+ * Get the name for corresponding element represented by the given configfs_dirent
+ */
+const unsigned char * configfs_get_name(struct configfs_dirent *sd)
+{
+	struct attribute * attr;
+
+	if (!sd || !sd->s_element)
+		BUG();
+
+	/* These always have a dentry, so use that */
+	if (sd->s_type & (CONFIGFS_DIR | CONFIGFS_ITEM_LINK))
+		return sd->s_dentry->d_name.name;
+
+	if (sd->s_type & CONFIGFS_ITEM_ATTR) {
+		attr = sd->s_element;
+		return attr->name;
+	}
+	return NULL;
+}
+
+
+/*
+ * Unhashes the dentry corresponding to given configfs_dirent
+ * Called with parent inode's i_sem held.
+ */
+void configfs_drop_dentry(struct configfs_dirent * sd, struct dentry * parent)
+{
+	struct dentry * dentry = sd->s_dentry;
+
+	if (dentry) {
+		spin_lock(&dcache_lock);
+		if (!(d_unhashed(dentry) && dentry->d_inode)) {
+			dget_locked(dentry);
+			__d_drop(dentry);
+			spin_unlock(&dcache_lock);
+			simple_unlink(parent->d_inode, dentry);
+		} else
+			spin_unlock(&dcache_lock);
+	}
+}
+
+void configfs_hash_and_remove(struct dentry * dir, const char * name)
+{
+	struct configfs_dirent * sd;
+	struct configfs_dirent * parent_sd = dir->d_fsdata;
+
+	down(&dir->d_inode->i_sem);
+	list_for_each_entry(sd, &parent_sd->s_children, s_sibling) {
+		if (!sd->s_element)
+			continue;
+		if (!strcmp(configfs_get_name(sd), name)) {
+			list_del_init(&sd->s_sibling);
+			configfs_drop_dentry(sd, dir);
+			configfs_put(sd);
+			break;
+		}
+	}
+	up(&dir->d_inode->i_sem);
+}
+
+
diff -ruN linux-2.6.13-rc6.old/fs/configfs/item.c linux-2.6.13-rc6/fs/configfs/item.c
--- linux-2.6.13-rc6.old/fs/configfs/item.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/item.c	2005-08-12 18:09:46.787349901 -0700
@@ -0,0 +1,227 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * item.c - library routines for handling generic config items
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on kobject:
+ * 	kobject is Copyright (c) 2002-2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ *
+ * Please see the file Documentation/filesystems/configfs.txt for
+ * critical information about using the config_item interface.
+ */
+
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/stat.h>
+#include <linux/slab.h>
+
+#include <linux/configfs.h>
+
+
+static inline struct config_item * to_item(struct list_head * entry)
+{
+	return container_of(entry,struct config_item,ci_entry);
+}
+
+/* Evil kernel */
+static void config_item_release(struct kref *kref);
+
+/**
+ *	config_item_init - initialize item.
+ *	@item:	item in question.
+ */
+void config_item_init(struct config_item * item)
+{
+	kref_init(&item->ci_kref);
+	INIT_LIST_HEAD(&item->ci_entry);
+}
+
+/**
+ *	config_item_set_name - Set the name of an item
+ *	@item:	item.
+ *	@name:	name.
+ *
+ *	If strlen(name) >= CONFIGFS_ITEM_NAME_LEN, then use a
+ *	dynamically allocated string that @item->ci_name points to.
+ *	Otherwise, use the static @item->ci_namebuf array.
+ */
+
+int config_item_set_name(struct config_item * item, const char * fmt, ...)
+{
+	int error = 0;
+	int limit = CONFIGFS_ITEM_NAME_LEN;
+	int need;
+	va_list args;
+	char * name;
+
+	/*
+	 * First, try the static array
+	 */
+	va_start(args,fmt);
+	need = vsnprintf(item->ci_namebuf,limit,fmt,args);
+	va_end(args);
+	if (need < limit)
+		name = item->ci_namebuf;
+	else {
+		/*
+		 * Need more space? Allocate it and try again
+		 */
+		limit = need + 1;
+		name = kmalloc(limit,GFP_KERNEL);
+		if (!name) {
+			error = -ENOMEM;
+			goto Done;
+		}
+		va_start(args,fmt);
+		need = vsnprintf(name,limit,fmt,args);
+		va_end(args);
+
+		/* Still? Give up. */
+		if (need >= limit) {
+			kfree(name);
+			error = -EFAULT;
+			goto Done;
+		}
+	}
+
+	/* Free the old name, if necessary. */
+	if (item->ci_name && item->ci_name != item->ci_namebuf)
+		kfree(item->ci_name);
+
+	/* Now, set the new name */
+	item->ci_name = name;
+ Done:
+	return error;
+}
+
+EXPORT_SYMBOL(config_item_set_name);
+
+void config_item_init_type_name(struct config_item *item,
+				const char *name,
+				struct config_item_type *type)
+{
+	config_item_set_name(item, name);
+	item->ci_type = type;
+	config_item_init(item);
+}
+EXPORT_SYMBOL(config_item_init_type_name);
+
+void config_group_init_type_name(struct config_group *group, const char *name,
+			 struct config_item_type *type)
+{
+	config_item_set_name(&group->cg_item, name);
+	group->cg_item.ci_type = type;
+	config_group_init(group);
+}
+EXPORT_SYMBOL(config_group_init_type_name);
+
+struct config_item * config_item_get(struct config_item * item)
+{
+	if (item)
+		kref_get(&item->ci_kref);
+	return item;
+}
+
+/**
+ *	config_item_cleanup - free config_item resources.
+ *	@item:	item.
+ */
+
+void config_item_cleanup(struct config_item * item)
+{
+	struct config_item_type * t = item->ci_type;
+	struct config_group * s = item->ci_group;
+	struct config_item * parent = item->ci_parent;
+
+	pr_debug("config_item %s: cleaning up\n",config_item_name(item));
+	if (item->ci_name != item->ci_namebuf)
+		kfree(item->ci_name);
+	item->ci_name = NULL;
+	if (t && t->ct_item_ops && t->ct_item_ops->release)
+		t->ct_item_ops->release(item);
+	if (s)
+		config_group_put(s);
+	if (parent)
+		config_item_put(parent);
+}
+
+static void config_item_release(struct kref *kref)
+{
+	config_item_cleanup(container_of(kref, struct config_item, ci_kref));
+}
+
+/**
+ *	config_item_put - decrement refcount for item.
+ *	@item:	item.
+ *
+ *	Decrement the refcount, and if 0, call config_item_cleanup().
+ */
+void config_item_put(struct config_item * item)
+{
+	if (item)
+		kref_put(&item->ci_kref, config_item_release);
+}
+
+
+/**
+ *	config_group_init - initialize a group for use
+ *	@k:	group
+ */
+
+void config_group_init(struct config_group *group)
+{
+	config_item_init(&group->cg_item);
+	INIT_LIST_HEAD(&group->cg_children);
+}
+
+
+/**
+ *	config_group_find_obj - search for item in group.
+ *	@group:	group we're looking in.
+ *	@name:	item's name.
+ *
+ *	Lock group via @group->cg_subsys, and iterate over @group->cg_list,
+ *	looking for a matching config_item. If matching item is found
+ *	take a reference and return the item.
+ */
+
+struct config_item * config_group_find_obj(struct config_group * group, const char * name)
+{
+	struct list_head * entry;
+	struct config_item * ret = NULL;
+
+        /* XXX LOCKING! */
+	list_for_each(entry,&group->cg_children) {
+		struct config_item * item = to_item(entry);
+		if (config_item_name(item) &&
+                    !strcmp(config_item_name(item), name)) {
+			ret = config_item_get(item);
+			break;
+		}
+	}
+	return ret;
+}
+
+
+EXPORT_SYMBOL(config_item_init);
+EXPORT_SYMBOL(config_group_init);
+EXPORT_SYMBOL(config_item_get);
+EXPORT_SYMBOL(config_item_put);
+
diff -ruN linux-2.6.13-rc6.old/fs/configfs/mount.c linux-2.6.13-rc6/fs/configfs/mount.c
--- linux-2.6.13-rc6.old/fs/configfs/mount.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/mount.c	2005-08-12 18:09:46.788349937 -0700
@@ -0,0 +1,149 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * mount.c - operations for initializing and mounting configfs.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/pagemap.h>
+#include <linux/init.h>
+
+#include <linux/configfs.h>
+#include "configfs_internal.h"
+
+/* Random magic number */
+#define CONFIGFS_MAGIC 0x62656570
+
+struct vfsmount * configfs_mount = NULL;
+struct super_block * configfs_sb = NULL;
+static int configfs_mnt_count = 0;
+
+static struct super_operations configfs_ops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+};
+
+static struct config_group configfs_root_group = {
+	.cg_item = {
+		.ci_namebuf	= "root",
+		.ci_name	= configfs_root_group.cg_item.ci_namebuf,
+	},
+};
+
+int configfs_is_root(struct config_item *item)
+{
+	return item == &configfs_root_group.cg_item;
+}
+
+static struct configfs_dirent configfs_root = {
+	.s_sibling	= LIST_HEAD_INIT(configfs_root.s_sibling),
+	.s_children	= LIST_HEAD_INIT(configfs_root.s_children),
+	.s_element	= &configfs_root_group.cg_item,
+	.s_type		= CONFIGFS_ROOT,
+};
+
+static int configfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct inode *inode;
+	struct dentry *root;
+
+	sb->s_blocksize = PAGE_CACHE_SIZE;
+	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+	sb->s_magic = CONFIGFS_MAGIC;
+	sb->s_op = &configfs_ops;
+	configfs_sb = sb;
+
+	inode = configfs_new_inode(S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
+	if (inode) {
+		inode->i_op = &configfs_dir_inode_operations;
+		inode->i_fop = &configfs_dir_operations;
+		/* directory inodes start off with i_nlink == 2 (for "." entry) */
+		inode->i_nlink++;
+	} else {
+		pr_debug("configfs: could not get root inode\n");
+		return -ENOMEM;
+	}
+
+	root = d_alloc_root(inode);
+	if (!root) {
+		pr_debug("%s: could not get root dentry!\n",__FUNCTION__);
+		iput(inode);
+		return -ENOMEM;
+	}
+	config_group_init(&configfs_root_group);
+	configfs_root_group.cg_item.ci_dentry = root;
+	root->d_fsdata = &configfs_root;
+	sb->s_root = root;
+	return 0;
+}
+
+static struct super_block *configfs_get_sb(struct file_system_type *fs_type,
+	int flags, const char *dev_name, void *data)
+{
+	return get_sb_single(fs_type, flags, data, configfs_fill_super);
+}
+
+static struct file_system_type configfs_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "configfs",
+	.get_sb		= configfs_get_sb,
+	.kill_sb	= kill_litter_super,
+};
+
+int configfs_pin_fs(void)
+{
+	return simple_pin_fs("configfs", &configfs_mount,
+			     &configfs_mnt_count);
+}
+
+void configfs_release_fs(void)
+{
+	simple_release_fs(&configfs_mount, &configfs_mnt_count);
+}
+
+static int __init configfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&configfs_fs_type);
+	if (err) {
+                printk(KERN_ERR "configfs: Unable to register filesystem!\n");
+	}
+
+	return err;
+}
+
+static void __exit configfs_exit(void)
+{
+        unregister_filesystem(&configfs_fs_type);
+}
+
+MODULE_AUTHOR("Oracle");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.0.1");
+MODULE_DESCRIPTION("Simple RAM filesystem for user driven kernel subsystem configuration.");
+
+module_init(configfs_init);
+module_exit(configfs_exit);
diff -ruN linux-2.6.13-rc6.old/fs/configfs/symlink.c linux-2.6.13-rc6/fs/configfs/symlink.c
--- linux-2.6.13-rc6.old/fs/configfs/symlink.c	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/fs/configfs/symlink.c	2005-08-12 18:09:46.788349937 -0700
@@ -0,0 +1,272 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * symlink.c - operations for configfs symlinks.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/namei.h>
+
+#include <linux/configfs.h>
+#include "configfs_internal.h"
+
+static int item_depth(struct config_item * item)
+{
+	struct config_item * p = item;
+	int depth = 0;
+	do { depth++; } while ((p = p->ci_parent) && !configfs_is_root(p));
+	return depth;
+}
+
+static int item_path_length(struct config_item * item)
+{
+	struct config_item * p = item;
+	int length = 1;
+	do {
+		length += strlen(config_item_name(p)) + 1;
+		p = p->ci_parent;
+	} while (p && !configfs_is_root(p));
+	return length;
+}
+
+static void fill_item_path(struct config_item * item, char * buffer, int length)
+{
+	struct config_item * p;
+
+	--length;
+	for (p = item; p && !configfs_is_root(p); p = p->ci_parent) {
+		int cur = strlen(config_item_name(p));
+
+		/* back up enough to print this bus id with '/' */
+		length -= cur;
+		strncpy(buffer + length,config_item_name(p),cur);
+		*(buffer + --length) = '/';
+	}
+}
+
+static int create_link(struct config_item *parent_item,
+ 		       struct config_item *item,
+		       struct dentry *dentry)
+{
+	struct configfs_dirent *target_sd = item->ci_dentry->d_fsdata;
+	struct configfs_symlink *sl;
+	int ret;
+
+	ret = -ENOMEM;
+	sl = kmalloc(sizeof(struct configfs_symlink), GFP_KERNEL);
+	if (sl) {
+		sl->sl_target = config_item_get(item);
+		/* FIXME: needs a lock, I'd bet */
+		list_add(&sl->sl_list, &target_sd->s_links);
+		ret = configfs_create_link(sl, parent_item->ci_dentry,
+					   dentry);
+		if (ret) {
+			list_del_init(&sl->sl_list);
+			config_item_put(item);
+			kfree(sl);
+		}
+	}
+
+	return ret;
+}
+
+
+static int get_target(const char *symname, struct nameidata *nd,
+		      struct config_item **target)
+{
+	int ret;
+
+	ret = path_lookup(symname, LOOKUP_FOLLOW|LOOKUP_DIRECTORY, nd);
+	if (!ret) {
+		if (nd->dentry->d_sb == configfs_sb) {
+			*target = configfs_get_config_item(nd->dentry);
+			if (!*target) {
+				ret = -ENOENT;
+				path_release(nd);
+			}
+		} else
+			ret = -EPERM;
+	}
+
+	return ret;
+}
+
+
+int configfs_symlink(struct inode *dir, struct dentry *dentry, const char *symname)
+{
+	int ret;
+	struct nameidata nd;
+	struct config_item *parent_item;
+	struct config_item *target_item;
+	struct config_item_type *type;
+
+	ret = -EPERM;  /* What lack-of-symlink returns */
+	if (dentry->d_parent == configfs_sb->s_root)
+		goto out;
+
+	parent_item = configfs_get_config_item(dentry->d_parent);
+	type = parent_item->ci_type;
+
+	if (!type || !type->ct_item_ops ||
+	    !type->ct_item_ops->allow_link)
+		goto out_put;
+
+	ret = get_target(symname, &nd, &target_item);
+	if (ret)
+		goto out_put;
+
+	ret = type->ct_item_ops->allow_link(parent_item, target_item);
+	if (!ret)
+		ret = create_link(parent_item, target_item, dentry);
+
+	config_item_put(target_item);
+	path_release(&nd);
+
+out_put:
+	config_item_put(parent_item);
+
+out:
+	return ret;
+}
+
+int configfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct configfs_dirent *sd = dentry->d_fsdata;
+	struct configfs_symlink *sl;
+	struct config_item *parent_item;
+	struct config_item_type *type;
+	int ret;
+
+	ret = -EPERM;  /* What lack-of-symlink returns */
+	if (!(sd->s_type & CONFIGFS_ITEM_LINK))
+		goto out;
+
+	if (dentry->d_parent == configfs_sb->s_root)
+		BUG();
+
+	sl = sd->s_element;
+
+	parent_item = configfs_get_config_item(dentry->d_parent);
+	type = parent_item->ci_type;
+
+	list_del_init(&sd->s_sibling);
+	configfs_drop_dentry(sd, dentry->d_parent);
+	dput(dentry);
+	configfs_put(sd);
+
+	/*
+	 * drop_link() must be called before
+	 * list_del_init(&sl->sl_list), so that the order of
+	 * drop_link(this, target) and drop_item(target) is preserved.
+	 */
+	if (type && type->ct_item_ops &&
+	    type->ct_item_ops->drop_link)
+		type->ct_item_ops->drop_link(parent_item,
+					       sl->sl_target);
+
+	/* FIXME: Needs lock */
+	list_del_init(&sl->sl_list);
+
+	/* Put reference from create_link() */
+	config_item_put(sl->sl_target);
+	kfree(sl);
+
+	config_item_put(parent_item);
+
+	ret = 0;
+
+out:
+	return ret;
+}
+
+static int configfs_get_target_path(struct config_item * item, struct config_item * target,
+				   char *path)
+{
+	char * s;
+	int depth, size;
+
+	depth = item_depth(item);
+	size = item_path_length(target) + depth * 3 - 1;
+	if (size > PATH_MAX)
+		return -ENAMETOOLONG;
+
+	pr_debug("%s: depth = %d, size = %d\n", __FUNCTION__, depth, size);
+
+	for (s = path; depth--; s += 3)
+		strcpy(s,"../");
+
+	fill_item_path(target, path, size);
+	pr_debug("%s: path = '%s'\n", __FUNCTION__, path);
+
+	return 0;
+}
+
+static int configfs_getlink(struct dentry *dentry, char * path)
+{
+	struct config_item *item, *target_item;
+	int error = 0;
+
+	item = configfs_get_config_item(dentry->d_parent);
+	if (!item)
+		return -EINVAL;
+
+	target_item = configfs_get_config_item(dentry);
+	if (!target_item) {
+		config_item_put(item);
+		return -EINVAL;
+	}
+
+	down_read(&configfs_rename_sem);
+	error = configfs_get_target_path(item, target_item, path);
+	up_read(&configfs_rename_sem);
+
+	config_item_put(item);
+	config_item_put(target_item);
+	return error;
+
+}
+
+static int configfs_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	int error = -ENOMEM;
+	unsigned long page = get_zeroed_page(GFP_KERNEL);
+	if (page)
+		error = configfs_getlink(dentry, (char *) page);
+	nd_set_link(nd, error ? ERR_PTR(error) : (char *)page);
+	return 0;
+}
+
+static void configfs_put_link(struct dentry *dentry, struct nameidata *nd)
+{
+	char *page = nd_get_link(nd);
+	if (!IS_ERR(page))
+		free_page((unsigned long)page);
+}
+
+struct inode_operations configfs_symlink_inode_operations = {
+	.follow_link = configfs_follow_link,
+	.readlink = generic_readlink,
+	.put_link = configfs_put_link,
+};
+
diff -ruN linux-2.6.13-rc6.old/include/linux/configfs.h linux-2.6.13-rc6/include/linux/configfs.h
--- linux-2.6.13-rc6.old/include/linux/configfs.h	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.13-rc6/include/linux/configfs.h	2005-08-12 18:09:48.026393458 -0700
@@ -0,0 +1,205 @@
+/* -*- mode: c; c-basic-offset: 8; -*-
+ * vim: noexpandtab sw=8 ts=8 sts=0:
+ *
+ * configfs.h - definitions for the device driver filesystem
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * Based on kobject.h:
+ *      Copyright (c) 2002-2003	Patrick Mochel
+ *      Copyright (c) 2002-2003	Open Source Development Labs
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ *
+ * Please read Documentation/filesystems/configfs.txt before using the
+ * configfs interface, ESPECIALLY the parts about reference counts and
+ * item destructors.
+ */
+
+#ifndef _CONFIGFS_H_
+#define _CONFIGFS_H_
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/kref.h>
+
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
+
+#define CONFIGFS_ITEM_NAME_LEN	20
+
+struct module;
+
+struct configfs_item_operations;
+struct configfs_group_operations;
+struct configfs_attribute;
+struct configfs_subsystem;
+
+struct config_item {
+	char			*ci_name;
+	char			ci_namebuf[CONFIGFS_ITEM_NAME_LEN];
+	struct kref		ci_kref;
+	struct list_head	ci_entry;
+	struct config_item	*ci_parent;
+	struct config_group	*ci_group;
+	struct config_item_type	*ci_type;
+	struct dentry		*ci_dentry;
+};
+
+extern int config_item_set_name(struct config_item *, const char *, ...);
+
+static inline char *config_item_name(struct config_item * item)
+{
+	return item->ci_name;
+}
+
+extern void config_item_init(struct config_item *);
+extern void config_item_init_type_name(struct config_item *item,
+				       const char *name,
+				       struct config_item_type *type);
+extern void config_item_cleanup(struct config_item *);
+
+extern struct config_item * config_item_get(struct config_item *);
+extern void config_item_put(struct config_item *);
+
+struct config_item_type {
+	struct module				*ct_owner;
+	struct configfs_item_operations		*ct_item_ops;
+	struct configfs_group_operations	*ct_group_ops;
+	struct configfs_attribute		**ct_attrs;
+};
+
+
+/**
+ *	group - a group of config_items of a specific type, belonging
+ *	to a specific subsystem.
+ */
+
+struct config_group {
+	struct config_item		cg_item;
+	struct list_head		cg_children;
+	struct configfs_subsystem 	*cg_subsys;
+	struct config_group		**default_groups;
+};
+
+
+extern void config_group_init(struct config_group *group);
+extern void config_group_init_type_name(struct config_group *group,
+					const char *name,
+					struct config_item_type *type);
+
+
+static inline struct config_group *to_config_group(struct config_item *item)
+{
+	return item ? container_of(item,struct config_group,cg_item) : NULL;
+}
+
+static inline struct config_group *config_group_get(struct config_group *group)
+{
+	return group ? to_config_group(config_item_get(&group->cg_item)) : NULL;
+}
+
+static inline void config_group_put(struct config_group *group)
+{
+	config_item_put(&group->cg_item);
+}
+
+extern struct config_item *config_group_find_obj(struct config_group *, const char *);
+
+
+struct configfs_attribute {
+	char			*ca_name;
+	struct module 		*ca_owner;
+	mode_t			ca_mode;
+};
+
+
+/*
+ * If allow_link() exists, the item can symlink(2) out to other
+ * items.  If the item is a group, it may support mkdir(2).
+ * Groups supply one of make_group() and make_item().  If the
+ * group supports make_group(), one can create group children.  If it
+ * supports make_item(), one can create config_item children.  If it has
+ * default_groups on group->default_groups, it has automatically created
+ * group children.  default_groups may coexist alongsize make_group() or
+ * make_item(), but if the group wishes to have only default_groups
+ * children (disallowing mkdir(2)), it need not provide either function.
+ * If the group has commit(), it supports pending and commited (active)
+ * items.
+ */
+struct configfs_item_operations {
+	void (*release)(struct config_item *);
+	ssize_t	(*show_attribute)(struct config_item *, struct configfs_attribute *,char *);
+	ssize_t	(*store_attribute)(struct config_item *,struct configfs_attribute *,const char *, size_t);
+	int (*allow_link)(struct config_item *src, struct config_item *target);
+	int (*drop_link)(struct config_item *src, struct config_item *target);
+};
+
+struct configfs_group_operations {
+	struct config_item *(*make_item)(struct config_group *group, const char *name);
+	struct config_group *(*make_group)(struct config_group *group, const char *name);
+	int (*commit_item)(struct config_item *item);
+	void (*drop_item)(struct config_group *group, struct config_item *item);
+};
+
+
+
+/**
+ * Use these macros to make defining attributes easier. See include/linux/device.h
+ * for examples..
+ */
+
+#if 0
+#define __ATTR(_name,_mode,_show,_store) { \
+	.attr = {.ca_name = __stringify(_name), .ca_mode = _mode, .ca_owner = THIS_MODULE },	\
+	.show	= _show,					\
+	.store	= _store,					\
+}
+
+#define __ATTR_RO(_name) { \
+	.attr	= { .ca_name = __stringify(_name), .ca_mode = 0444, .ca_owner = THIS_MODULE },	\
+	.show	= _name##_show,	\
+}
+
+#define __ATTR_NULL { .attr = { .name = NULL } }
+
+#define attr_name(_attr) (_attr).attr.name
+#endif
+
+
+struct configfs_subsystem {
+	struct config_group	su_group;
+	struct semaphore	su_sem;
+};
+
+static inline struct configfs_subsystem *to_configfs_subsystem(struct config_group *group)
+{
+	return group ?
+		container_of(group, struct configfs_subsystem, su_group) :
+		NULL;
+}
+
+int configfs_register_subsystem(struct configfs_subsystem *subsys);
+void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
+
+#endif  /* __KERNEL__ */
+
+#endif /* _CONFIGFS_H_ */

-- 

"What does it say about a society's priorities when the time you
 spend in meetings on Monday is greater than the total number of
 hours you spent sleeping over the weekend?"
	- Nat Friedman

			http://www.jlbec.org/
			jlbec at evilplan.org



From linux-cluster at redhat.com  Sun Aug 21 17:11:07 2005
From: linux-cluster at redhat.com (Cluster2005)
Date: Sun, 21 Aug 2005 13:11:07 -0400
Subject: [Linux-cluster] IEEE Cluster2005 Registration Online
Message-ID: <20050821171103.D99744097E@villi.rcsnetworks.com>

see http://cluster2005.org



From jan at bruvoll.com  Mon Aug 22 19:19:52 2005
From: jan at bruvoll.com (Jan Bruvoll)
Date: Mon, 22 Aug 2005 21:19:52 +0200
Subject: [Linux-cluster] Fencing woes
Message-ID: <430A2558.5060404@bruvoll.com>

Dear list,

I am having problems with a node where I can't get it to rejoin the
fence domain. It has been rebooted before, and it has so far
automatically joined the fence domain so that that it could pick up the
rest of the depending services, but not this time. I upgraded the kernel
and cluster/GFS suite (this is a Gentoo system) to
gentoo-sources-2.6.12-r9 and cluster software v1.00.00.

I guess the biggest problem is that I don't know what to actually do to
unfence the node that has been shut out. Since I have set the cluster up
to use manual fencing, I suppose the un-fence command to use is
fence_ack_manual, however using that only produces a warning about a
missing /tmp/fence_manual.fifo. Manually creating this fifo before
running the command only removes the fifo -and- produces the warning.

This is what a cman_tool services emits:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join      S-2,2,1
[]


I don't seem to be able to find any information anywhere on the "Codes"
- any pointers there?

The cluster has 6 members: one "file server" and five "clients". Excerpt
from cluster.conf follows:

<?xml version="1.0"?>
<cluster name="nbs-sc-1" config_version="1">

  <cman></cman>

  <dlm></dlm>

  <clusternodes>
    <clusternode name="fs-2" votes="2">
      <fence>
        <method name="single">
          <device name="human" ipaddr="10.42.0.200"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="app-1" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="10.42.0.202"/>
        </method>
      </fence>
    </clusternode>
    [...]
    <clusternode name="app-5" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="10.42.0.206"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fence_devices>
    <device name="human" agent="fence_manual"/>
  </fence_devices>
</cluster>

I also found this from dmesg - is this important?:
SM: process_reply invalid id=0 nodeid=4
SM: process_reply invalid id=0 nodeid=1
SM: process_reply invalid id=0 nodeid=2
SM: process_reply invalid id=0 nodeid=6
SM: process_reply invalid id=0 nodeid=5

Any help or pointers to more information would be most appreciated. I
have read through everything I could find on the i'net without becoming
much wiser, and the status today is that I can't upgrade single servers
in my cluster without taking down the whole group - which is hardly useful.

Thanks in advance for any assistance!

Best regards
Jan Bruvoll



From Axel.Thimm at ATrpms.net  Mon Aug 22 22:33:41 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Tue, 23 Aug 2005 00:33:41 +0200
Subject: [Linux-cluster] rgmanager/src/resources/samba.sh
Message-ID: <20050822223341.GI24127@neu.nirvana>

is missing :)

services.sh references "samba" as a further resource agent. Is this an
old remnant still to be removed, or will there be a samba.sh?

In contrast to nfs samba has too many options to map them adequately
to parameters of a resource agent, so perhaps the idea is to have
users copy /etc/init.d/smb and modify accordingly.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050823/e3e64d09/attachment.sig>

From Axel.Thimm at ATrpms.net  Mon Aug 22 22:52:27 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Tue, 23 Aug 2005 00:52:27 +0200
Subject: [Linux-cluster] iptables protection wrapper;
	nfsexport.sh vs ip.sh racing
Message-ID: <20050822225227.GJ24127@neu.nirvana>

The typical NFS cluster setups seem to fail for Gigabit NFS/tcp. Some
clients that are busy during the relocation of services either bail
out with RPC garbage, or set the filesytem to EACCES, or timeout for
17 min.

This has to do with some racing/timing in the NFS vs ip setup/teardown
procedure. Protecting the service startup/shutdown with an iptables
rule is a good workaround to fix this.

But what is the proper way to integrate this workaround? I could setup
new resource agents, one with start=1 and another with start=6 to
start/stop dropping packages. Or I could modify the current resource
agents to allow for child entities and wrap one script around the
service and one in the inner element.

I could probably also hack ip.sh to introduce some delay, to make sure
the NFS services are really up/down before proceeding. Or maybe fix
the true evil by making nfsexport.sh wait for NFS startup/stop
completion (how?)?

What's the best way?
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050823/04b14368/attachment.sig>

From mark.fasheh at oracle.com  Tue Aug 23 02:41:16 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Mon, 22 Aug 2005 19:41:16 -0700
Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs
In-Reply-To: <20050819071344.GB10864@redhat.com>
References: <20050818060750.GA10133@redhat.com>
	<20050818212348.GW21228@ca-server1.us.oracle.com>
	<20050819071344.GB10864@redhat.com>
Message-ID: <20050823024116.GY21228@ca-server1.us.oracle.com>

On Fri, Aug 19, 2005 at 03:13:44PM +0800, David Teigland wrote:
> The nodemanager RFC I sent a month ago
>   http://marc.theaimsgroup.com/?l=linux-kernel&m=112166723919347&w=2
> 
> amounts to half of dlm/config.c (everything under comms/ above) moved into
> a separate kernel module.  That would be trivial to do, and is still an
> option to bat around.
Yeah ok, so the address/id/local part is still there. As is much of the API
to query those attributes.

> I question whether factoring such a small chunk into a separate module is
> really worth it, though?
IMHO, yes. Mostly because we both have very similar basic requirements there
and it seems a waste to have duplicated code (even if it's not a huge
amount). Future projects wanting to query basic node information from the
kernel could have simply used that API without having to further duplicate
code too. That said, I'm not sure it has to be done *now*

Was there anything in my comments which made going forward with that
approach difficult for dlm?

> Making all of config.c (all of /config/dlm/ above) into a separate module
> wouldn't seem quite so strange. It would require just a few lines of code
> to turn it into a stand alone module.
Without the dlm specifics, right? It's perfectly fine with me if dlm has a
couple more attributes that it wants on a node object - OCFS2 simply won't
query them.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From teigland at redhat.com  Tue Aug 23 03:46:09 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 23 Aug 2005 11:46:09 +0800
Subject: [Linux-cluster] Fencing woes
In-Reply-To: <430A2558.5060404@bruvoll.com>
References: <430A2558.5060404@bruvoll.com>
Message-ID: <20050823034609.GA13360@redhat.com>

On Mon, Aug 22, 2005 at 09:19:52PM +0200, Jan Bruvoll wrote:
> Dear list,
> 
> I am having problems with a node where I can't get it to rejoin the
> fence domain. It has been rebooted before, and it has so far
> automatically joined the fence domain so that that it could pick up the
> rest of the depending services, but not this time. I upgraded the kernel
> and cluster/GFS suite (this is a Gentoo system) to
> gentoo-sources-2.6.12-r9 and cluster software v1.00.00.

Are the nodes running slightly different versions of the cluster software?
They must all be running the same version -- there was a change to the
cman message formats shortly before 1.00.00 was released.

> I guess the biggest problem is that I don't know what to actually do to
> unfence the node that has been shut out. Since I have set the cluster up
> to use manual fencing, I suppose the un-fence command to use is
> fence_ack_manual, however using that only produces a warning about a
> missing /tmp/fence_manual.fifo. Manually creating this fifo before
> running the command only removes the fifo -and- produces the warning.
> 
> This is what a cman_tool services emits:
> 
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           0   2 join      S-2,2,1
> []

Manual fencing is hard to use and get right, first recommendation is to
not use it.  You only need to run fence_ack_manual when instructed to do
so by a message in /var/log/messages on some node.

Dave



From Hansjoerg.Maurer at dlr.de  Tue Aug 23 05:30:55 2005
From: Hansjoerg.Maurer at dlr.de (=?ISO-8859-15?Q?Hansj=F6rg_Maurer?=)
Date: Tue, 23 Aug 2005 07:30:55 +0200
Subject: [Linux-cluster] iptables protection wrapper;	nfsexport.sh vs
	ip.sh racing
In-Reply-To: <20050822225227.GJ24127@neu.nirvana>
References: <20050822225227.GJ24127@neu.nirvana>
Message-ID: <430AB48F.9000005@dlr.de>

Hi



Axel Thimm schrieb:

>The typical NFS cluster setups seem to fail for Gigabit NFS/tcp. Some
>clients that are busy during the relocation of services either bail
>out with RPC garbage, or set the filesytem to EACCES, or timeout for
>17 min.
>  
>
we observe this problem to, using NFS over TCP.
Mounting the filesystem with -o tcp,timeo=600,retrans=1
reduces the  timeout for about one minute on Linux and Solaris 10.

Greetings

Hansj?rg



>This has to do with some racing/timing in the NFS vs ip setup/teardown
>procedure. Protecting the service startup/shutdown with an iptables
>rule is a good workaround to fix this.
>
>But what is the proper way to integrate this workaround? I could setup
>new resource agents, one with start=1 and another with start=6 to
>start/stop dropping packages. Or I could modify the current resource
>agents to allow for child entities and wrap one script around the
>service and one in the inner element.
>
>I could probably also hack ip.sh to introduce some delay, to make sure
>the NFS services are really up/down before proceeding. Or maybe fix
>the true evil by making nfsexport.sh wait for NFS startup/stop
>completion (how?)?
>
>What's the best way?
>  
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>

-- 
_________________________________________________________________

Dr.  Hansjoerg Maurer           | LAN- & System-Manager
                                |
Deutsches Zentrum               | DLR Oberpfaffenhofen
  f. Luft- und Raumfahrt e.V.   |
Institut f. Robotik             |
Postfach 1116                   | Muenchner Strasse 20
82230 Wessling                  | 82234 Wessling
Germany                         |
                                |
Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
__________________________________________________________________


There are 10 types of people in this world, 
those who understand binary and those who don't.



From Rob.TerVeer at getronics.com  Tue Aug 23 10:21:53 2005
From: Rob.TerVeer at getronics.com (Veer, Rob ter)
Date: Tue, 23 Aug 2005 12:21:53 +0200
Subject: [Linux-cluster] Workings of Tiebreaker IP (RHCS)
Message-ID: <09B5576A48518947960A89175FB6C23DE38604@excbebr204.europe.unity>

Hello,

To completely understand what the role of a tiebreaker IP within a two
or four node RHCS cluster is, I've searched redhat and Google. I can't
however find anything describing the precise workings of the
tiebreaker-IP. I would really like to know what happens excactly when
the tiebreaker is used an how (maybe even somekind of flow diagram). 

Can anyone here maybe explain that to me, or point me in the direction
of more specific information regarding tiebreaker?

Regards,
Rob.



From sco at adviseo.fr  Tue Aug 23 13:21:46 2005
From: sco at adviseo.fr (Sylvain COUTANT)
Date: Tue, 23 Aug 2005 15:21:46 +0200
Subject: [Linux-cluster] gnbd/clvm and device mapper : 256 devices
	limitation ?
Message-ID: <20050823132153.CCC383181C2@smtp.cegetel.net>

Hello the list,

I was wondering if it were at all possible to have more than 256 block devices shared in a cluster.

I'd like to export gnbd devices (10-15) with volume groups on top. There would be many lvs (up to 256) in each vg.

Question is : how the device mapper will handle this on each cluster member ?

I ran a basic test by creating more than 256 lvs in a single vg and device mapper did create devices twice with the same major/minor (wrapping after minor 255).

Basically, that would mean I won't be able to share more than 256 lvs amongst the entire architecture. This limitation is far too low for me. I'd prefer hear about 10000+ ;-)

I know this question is not directly related to the cluster project (except the clvm part), but since I have chances to find here some people with knowledge about large architectures, I try it anyway ...


Thanks in advance for any tip.


--
Sylvain COUTANT

ADVISEO
http://www.adviseo.fr/
http://www.open-sp.fr/

Tel: +33 (0)1 30 42 72 95
Fax: +33 (0)1 30 42 72 95
Gsm: +33 (0)6 30 79 26 33
sco at adviseo.fr






From djani22 at dynamicweb.hu  Tue Aug 23 21:33:28 2005
From: djani22 at dynamicweb.hu (djani22 at dynamicweb.hu)
Date: Tue, 23 Aug 2005 23:33:28 +0200
Subject: [Linux-cluster] GNBD 1.0.0-1 bug
References: <20050818060750.GA10133@redhat.com><20050818212348.GW21228@ca-server1.us.oracle.com><20050819071344.GB10864@redhat.com>
	<20050823024116.GY21228@ca-server1.us.oracle.com>
Message-ID: <028b01c5a82a$52732da0$0400a8c0@LocalHost>

Hello list,

I am not sure this is the right place, but I can't found better...

I have found one bug in gnbd 1.0.0-1!

There are the messages:

#1

 Unable to handle kernel paging request at virtual address a014d7a5
  printing eip:
 c0118cee
 *pde = f7bedd02
 Oops: 0000 [#1]
 SMP
 Modules linked in: netconsole gnbd
 CPU:    0
 EIP:    0060:[<c0118cee>]    Not tainted VLI
 EFLAGS: 00010296   (2.6.13-rc6)
 EIP is at kmap+0x1e/0x54
 eax: 00000246   ebx: a014d7a5   ecx: c11ef260   edx: cabbc400
 esi: 00008000   edi: 00000001   ebp: f6c7fe00   esp: f6c7fdf4
 ds: 007b   es: 007b   ss: 0068
 Process md3_raid1 (pid: 2769, threadinfo=f6c7e000 task=f7eef020)
 Stack: c0577800 00000006 f5f93cfc f6c7fe54 f895a9cc a014d7a5 00000001
cf793000
        00001000 00004000 d3fc3180 f73e9bf0 f895e718 cabbc400 007ea037
01000000
        d4175a4c f895e6f0 65000000 00f03d8d 00100000 d4175a4c f895e6f0
f895e700
 Call Trace:
  [<c0103ca2>] show_stack+0x9a/0xd0
  [<c0103e6d>] show_registers+0x175/0x209
  [<c010408c>] die+0xfa/0x17c
  [<c0117b68>] do_page_fault+0x269/0x7bd
  [<c01038d7>] error_code+0x4f/0x54
  [<f895a9cc>] __gnbd_send_req+0x196/0x28d [gnbd]
  [<f895af12>] do_gnbd_request+0xe5/0x198 [gnbd]
  [<c0383a0d>] __generic_unplug_device+0x28/0x2e
  [<c038150f>] __elv_add_request+0xaa/0xac
  [<c0384e5b>] __make_request+0x20d/0x512
  [<c0385490>] generic_make_request+0xb2/0x27a
  [<c04748a2>] raid1d+0xbf/0x2cb
  [<c04825c9>] md_thread+0x134/0x16f
  [<c01010d5>] kernel_thread_helper+0x5/0xb
 Code: 89 c1 81 e1 ff ff 0f 00 eb b0 90 90 90 55 89 e5 53 83 ec 08 8b 5d 08
c7 44 24 04 06 00 00 00 c7 04 24 00 78 57 c0 e8 72
 47 00 00 <8b> 03 c1 e8 1e 8b 14 85 14 db 73 c0 8b 82 0c 04 00 00 05 00 09
  <0>Fatal exception: panic in 5 seconds



#2
 ------------[ cut here ]------------
 kernel BUG at mm/highmem.c:183!
 invalid operand: 0000 [#1]
 SMP
 Modules linked in: netconsole gnbd
 CPU:    0
 EIP:    0060:[<c0151167>]    Not tainted VLI
 EFLAGS: 00010246   (2.6.13-rc6)
 EIP is at kunmap_high+0x1f/0x93
 eax: 00000000   ebx: c1c1a3a0   ecx: c0740e88   edx: 00000292
 esi: 00001000   edi: 00000000   ebp: f6e81df8   esp: f6e81df0
 ds: 007b   es: 007b   ss: 0068
 Process md2_raid1 (pid: 2813, threadinfo=f6e80000 task=f7eba020)
 Stack: c1c1a3a0 d3708c10 f6e81e00 c0118d66 f6e81e54 f895a9fd c1c1a3a0
00000001
        d0990000 00001000 00004000 ebddff84 f6cb5d8c f895e620 d452f900
007ea037
        01000000 f1c1d9ac f895e5f8 b9000000 00106937 00100000 f1c1d9ac
f895e5f8
 Call Trace:
  [<c0103ca2>] show_stack+0x9a/0xd0
  [<c0103e6d>] show_registers+0x175/0x209
  [<c010408c>] die+0xfa/0x17c
  [<c010418c>] do_trap+0x7e/0xb2
  [<c01044a8>] do_invalid_op+0xa9/0xb3
  [<c01038d7>] error_code+0x4f/0x54
  [<c0118d66>] kunmap+0x42/0x44
  [<f895a9fd>] __gnbd_send_req+0x1c7/0x28d [gnbd]
  [<f895af12>] do_gnbd_request+0xe5/0x198 [gnbd]
  [<c0383a0d>] __generic_unplug_device+0x28/0x2e
  [<c038150f>] __elv_add_request+0xaa/0xac
  [<c0384e5b>] __make_request+0x20d/0x512
  [<c0385490>] generic_make_request+0xb2/0x27a
  [<c04748a2>] raid1d+0xbf/0x2cb
  [<c04825c9>] md_thread+0x134/0x16f
  [<c01010d5>] kernel_thread_helper+0x5/0xb
 Code: e8 08 06 00 00 89 c7 e9 38 ff ff ff 55 89 e5 53 83 ec 04 89 c3 b8 80
6c 6c c0 e8 59 c3 40 00 89 1c 24 e8 e6 05 00 00 85
 c0 75 08 <0f> 0b b7 00 0a 78 57 c0 05 00 00 40 00 c1 e8 0c 8b 14 85 20 dc


#3

------------[ cut here ]------------
kernel BUG at mm/highmem.c:183!
invalid operand: 0000 [#1]
SMP
Modules linked in: netconsole gnbd
CPU:    0
EIP:    0060:[<c0151167>]    Not tainted VLI
EFLAGS: 00010246   (2.6.13-rc6)
EIP is at kunmap_high+0x1f/0x93
eax: 00000000   ebx: c25c99a0   ecx: c0740508   edx: 00000292
esi: 00001000   edi: 00000000   ebp: f6c3dd34   esp: f6c3dd2c
ds: 007b   es: 007b   ss: 0068
Process md10_raid1 (pid: 2865, threadinfo=f6c3c000 task=f7be4020)
Stack: c25c99a0 e287c8c0 f6c3dd3c c0118d66 f6c3dd90 f895a9fd c25c99a0
00000001
    f6d59000 00001000 00004000 f6c3ddb4 c010377e f895e810 dd1c4200 007ea037
    01000000 f10392cc f6c3ddb4 20000000 00004899 00100000 f10392cc f895e7e8
Call Trace:
 [<c0103ca2>] show_stack+0x9a/0xd0
 [<c0103e6d>] show_registers+0x175/0x209
 [<c010408c>] die+0xfa/0x17c
 [<c010418c>] do_trap+0x7e/0xb2
 [<c01044a8>] do_invalid_op+0xa9/0xb3
 [<c01038d7>] error_code+0x4f/0x54
 [<c0118d66>] kunmap+0x42/0x44
 [<f895a9fd>] __gnbd_send_req+0x1c7/0x28d [gnbd]
 [<f895af12>] do_gnbd_request+0xe5/0x198 [gnbd]
 [<c0383a0d>] __generic_unplug_device+0x28/0x2e
 [<c0384e8d>] __make_request+0x23f/0x512
 [<c0385490>] generic_make_request+0xb2/0x27a
 [<c03856a9>] submit_bio+0x51/0xe7
 [<c047de1f>] md_super_write+0x75/0x7d
 [<c047fa8d>] md_update_sb+0xd1/0x1e2
 [<c0483fa5>] md_check_recovery+0x197/0x3e9
 [<c0474805>] raid1d+0x22/0x2cb
 [<c04825c9>] md_thread+0x134/0x16f
 [<c01010d5>] kernel_thread_helper+0x5/0xb
 [43139186.670000] Code: e8 08 06 00 00 89 c7 e9 38 ff ff ff 55 89 e5 53 83
ec 04 89 c3 b8  80 6c 6c c0 e8 59 c3 40 00 89 1c 2
4 e8 e6 05 00 00 85 c0 75 08 <0f> 0b b7 00 0a 78 57 c0 05 00 00 40 00 c1 e8
0c 8b 14 85 20 dc
 <0>Fatal exception: panic in 5 seconds



And additionally I have 2 deadlock-messages, if it helps for something...

I hope I can help some....

Thanks

Janos



From brianu at silvercash.com  Wed Aug 24 17:24:43 2005
From: brianu at silvercash.com (brianu)
Date: Wed, 24 Aug 2005 10:24:43 -0700
Subject: [Linux-cluster] gnbd/clvm and device mapper : 256 devices
	limitation ?
Message-ID: <20050824172215.332115A867C@mail.silvercash.com>

Hello, 

I have a setup that is similar (except not on that scale), but I'm
missing one piece, I keep getting duplicate PV's when I run vgcreate
-aly & have not had any luck setting up multipath in LVM, although I
know this is not the response you wanted, could you share how you got
device-mapper to create devices twice with the same major/minor numbers?

Regards,

Brian Urrutia
Price Communications Inc.

On Tue, 2005-08-23 at 12:01 -0400, linux-cluster-request at redhat.com
wrote:
> 
> Date: Tue, 23 Aug 2005 15:21:46 +0200
> From: "Sylvain COUTANT" <sco at adviseo.fr>
     A. > Subject: [Linux-cluster] gnbd/clvm and device mapper : 256
        devices
     B. >         limitation ?
> To: <linux-cluster at redhat.com>
> Message-ID: <20050823132153.CCC383181C2 at smtp.cegetel.net>
> Content-Type: text/plain;       charset="utf-8"
> 
> Hello the list,
> 
> I was wondering if it were at all possible to have more than 256 block
> devices shared in a cluster.
> 
> I'd like to export gnbd devices (10-15) with volume groups on top.
> There would be many lvs (up to 256) in each vg.
> 
> Question is : how the device mapper will handle this on each cluster
> member ?
> 
> I ran a basic test by creating more than 256 lvs in a single vg and
> device mapper did create devices twice with the same major/minor
> (wrapping after minor 255).
> 
> Basically, that would mean I won't be able to share more than 256 lvs
> amongst the entire architecture. This limitation is far too low for
> me. I'd prefer hear about 10000+ ;-)
> 
> I know this question is not directly related to the cluster project
> (except the clvm part), but since I have chances to find here some
> people with knowledge about large architectures, I try it anyway ...
> 
> 
> Thanks in advance for any tip.





From sco at adviseo.fr  Wed Aug 24 17:29:45 2005
From: sco at adviseo.fr (Sylvain COUTANT)
Date: Wed, 24 Aug 2005 19:29:45 +0200
Subject: [Linux-cluster] gnbd/clvm and device mapper : 256
	deviceslimitation ?
In-Reply-To: <20050824172215.332115A867C@mail.silvercash.com>
Message-ID: <20050824172951.5A77D318514@smtp.cegetel.net>

> could you share how you got
> device-mapper to create devices twice with the same major/minor numbers?

Quite easy :

for a in `seq 1 300`; do lvcreate -L 32M -n test$a myvg; done

This was able to create around 260 (don't remember the exact number) lvs before failing. The 20 last ones before failing (between 240 and 260) were created with duplicate major/minor (debian kernel 2.6.11.12).


Regards,

--
Sylvain COUTANT

ADVISEO
http://www.adviseo.fr/
http://www.open-sp.fr/





From fabiomirmar at gmail.com  Wed Aug 24 18:56:25 2005
From: fabiomirmar at gmail.com (=?ISO-8859-1?Q?F=E1bio_Augusto?=)
Date: Wed, 24 Aug 2005 15:56:25 -0300
Subject: [Linux-cluster] Problems while installing the Heartbeat 2.0.0 on a
	RH AS 3.0 Update 5
Message-ID: <180003a6050824115668c4641f@mail.gmail.com>

Good Afternoon,

I'm trying to install the Heartbeat 2.0.0 package to create a High
Availability solution.

I have two IBM XSeries x455 using Red Hat Advanced Server 3 Update 5 here.

I've downloaded the following files from the linux-ha.org official
download home page:
- heartbeat-2.0.0-1.x86_64.rpm
- heartbeat-pils-2.0.0-1.x86_64.rpm
- heartbeat-stonith-2.0.0-1.x86_64.rpm

While trying to install them using rpm -ivh *, I receive the following
errors (dependencies):

[root at svdb2-1 RPMS]# rpm -ivh * > /root/dependencies-heartbeat.log
error: Failed dependencies:
        libc.so.6()(64bit) is needed by heartbeat-2.0.0-1
        libc.so.6(GLIBC_2.2.5)(64bit) is needed by heartbeat-2.0.0-1
        libc.so.6(GLIBC_2.3)(64bit) is needed by heartbeat-2.0.0-1
        libm.so.6()(64bit) is needed by heartbeat-2.0.0-1
        libnet.so.0()(64bit) is needed by heartbeat-2.0.0-1
        libpthread.so.0(GLIBC_2.2.5)(64bit) is needed by heartbeat-2.0.0-1
        libc.so.6()(64bit) is needed by heartbeat-pils-2.0.0-1
        libc.so.6(GLIBC_2.2.5)(64bit) is needed by heartbeat-pils-2.0.0-1
        libcrypto.so.0.9.7()(64bit) is needed by heartbeat-stonith-2.0.0-1
        libc.so.6()(64bit) is needed by heartbeat-stonith-2.0.0-1
        libc.so.6(GLIBC_2.2.5)(64bit) is needed by heartbeat-stonith-2.0.0-1
        libc.so.6(GLIBC_2.3)(64bit) is needed by heartbeat-stonith-2.0.0-1


I checked using a #rpm -qa | grep -i glibc and I have the following
files already installed:

glibc-kernheaders-2.4-8.34.1
glibc-2.3.2-95.33
glibc-headers-2.3.2-95.33
glibc-common-2.3.2-95.33
glibc-2.3.2-95.33
glibc-profile-2.3.2-95.33
glibc-devel-2.3.2-95.33
glibc-utils-2.3.2-95.33


I checked which of the files above are missing using the following command:

[root at svdb2-1 logs-glibc-fabio]# rpm -q --provides glibc | grep -i libc.so.6
libc.so.6  
libc.so.6(GCC_3.0)  
libc.so.6(GLIBC_2.0)  
libc.so.6(GLIBC_2.1)  
libc.so.6(GLIBC_2.1.1)  
libc.so.6(GLIBC_2.1.2)  
libc.so.6(GLIBC_2.1.3)  
libc.so.6(GLIBC_2.2)  
libc.so.6(GLIBC_2.2.1)  
libc.so.6(GLIBC_2.2.2)  
libc.so.6(GLIBC_2.2.3)  
libc.so.6(GLIBC_2.2.4)  
libc.so.6(GLIBC_2.2.6)  
libc.so.6(GLIBC_2.3)  
libc.so.6(GLIBC_2.3.2)  
libc.so.6(GLIBC_2.3.3)  
libc.so.6.1()(64bit)  
libc.so.6.1(GLIBC_2.2)(64bit)  
libc.so.6.1(GLIBC_2.2.1)(64bit)  
libc.so.6.1(GLIBC_2.2.2)(64bit)  
libc.so.6.1(GLIBC_2.2.3)(64bit)  
libc.so.6.1(GLIBC_2.2.4)(64bit)  
libc.so.6.1(GLIBC_2.2.6)(64bit)  
libc.so.6.1(GLIBC_2.3)(64bit)  
libc.so.6.1(GLIBC_2.3.2)(64bit)  
libc.so.6.1(GLIBC_2.3.3)(64bit)  

Do someone has ever encountered that problem? Which package can I
install to solve all the dependencies?

Thanks a Lot!!
-- 
F?bio Augusto Miranda Martins
E-mail: fabiomirmar at gmail.com



From jason_wilk at stircrazy.net  Wed Aug 24 19:05:02 2005
From: jason_wilk at stircrazy.net (Jason Wilkinson)
Date: Wed, 24 Aug 2005 14:05:02 -0500
Subject: [Linux-cluster] Quorum without fibre channel or shared scsi
Message-ID: <A1752420D10F65429F82C257C05AFAA301EB978E@godzilla.corp.digitalims.com>

Hi all,

I'm rather new to clustering...so I'd appreciate any help you may be able to
give me. I'm trying to set up a test cluster in advance of purchasing our
SAN. Currently I have several old laptops set up. What I'm trying to do is
to find a way to set up the quorum partition without the more expensive
infrastructure. Is it possible to do it using DRBD or some piece of
software? Do any of you know where I could find a really good howto?

Any help would be appreciated.

Thanks in advance,

Jason





From eric at bootseg.com  Wed Aug 24 19:17:21 2005
From: eric at bootseg.com (Eric Kerin)
Date: Wed, 24 Aug 2005 15:17:21 -0400
Subject: [Linux-cluster] Problems while installing the Heartbeat 2.0.0
	on a RH AS 3.0 Update 5
In-Reply-To: <180003a6050824115668c4641f@mail.gmail.com>
References: <180003a6050824115668c4641f@mail.gmail.com>
Message-ID: <1124911042.4244.7.camel@auh5-0479.corp.jabil.org>

On Wed, 2005-08-24 at 15:56 -0300, F?bio Augusto wrote:
> Good Afternoon,
> 
> I'm trying to install the Heartbeat 2.0.0 package to create a High
> Availability solution.
> 
> I have two IBM XSeries x455 using Red Hat Advanced Server 3 Update 5 here.
> 
> I've downloaded the following files from the linux-ha.org official
> download home page:
> - heartbeat-2.0.0-1.x86_64.rpm
> - heartbeat-pils-2.0.0-1.x86_64.rpm
> - heartbeat-stonith-2.0.0-1.x86_64.rpm

You most likely want the i586 packages, instead of the x86_64 versions
you downloaded, since you don't seem to have the x86_64 glibc installed.

Also, for help on linux-ha you might want to try the linux-ha mailing
lists (http://linux-ha.org/ContactUs)

Thanks,
Eric





From hansjoerg.maurer at dlr.de  Wed Aug 24 19:57:31 2005
From: hansjoerg.maurer at dlr.de (=?ISO-8859-1?Q?Hansj=F6rg_Maurer?=)
Date: Wed, 24 Aug 2005 21:57:31 +0200
Subject: [Linux-cluster] Problems running service exclusivly
Message-ID: <430CD12B.8060705@dlr.de>

Hi

I am trying to get the following config to work.

4 node cluster with
1 node providing a mysql service (only to nodes (bs1 and bs2) have the 
hardare to do so)
2 nodes providing a custom service (the same service on each node (bi1 
and bi2))
1 node fallback for both services

It is important, that on every node ony one service is running.
I created the following cluster configuration (RHEL4U1),
but the exclusice flag seems not to work.
I have created three priorisised restricted failoverdomains (one for 
eache services).
I assigned each service to one failoverdomain (with run exclusively on),
but if I start the rgmanger. als services are running on one node.
I can move them by hand to the right nodes, but this is nor persistent.

Here my cluster.conf

I would be glad, if someone could help me.

Greetings

Hansj?rg


<?xml version="1.0"?>
<cluster config_version="46" name="betty_call1">
        <fence_daemon clean_start="1" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="bs1" votes="1">
                </clusternode>
                <clusternode name="bs2" votes="1">
                </clusternode>
                <clusternode name="bi1" votes="1">
                </clusternode>
                <clusternode name="bi2" votes="1">
                </clusternode>
        </clusternodes>
        <cman/>

        <rm>
                <failoverdomains>
                        <failoverdomain name="mysql" ordered="1" 
restricted="1">
                                <failoverdomainnode name="bs1" 
priority="1"/>
                                <failoverdomainnode name="bs2" 
priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="betty-bi1" ordered="1" 
restricted="1">
                                <failoverdomainnode name="bi1" 
priority="1"/>
                                <failoverdomainnode name="bs1" 
priority="4"/>
                                <failoverdomainnode name="bs2" 
priority="3"/>
                                <failoverdomainnode name="bi2" 
priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="betty-bi2" ordered="1" 
restricted="1">
                                <failoverdomainnode name="bi2" 
priority="1"/>
                                <failoverdomainnode name="bs1" 
priority="3"/>
                                <failoverdomainnode name="bs2" 
priority="4"/>
                                <failoverdomainnode name="bi1" 
priority="2"/>
                        </failoverdomain>
                </failoverdomains>

                <resources>
                        <ip address="172.20.27.3" monitor_link="1"/>
                        <fs device="/dev/drbd0" fstype="ext3" 
mountpoint="/drbd" name="drbd"/>
                        <script file="/etc/init.d/mysqld" name="mysql"/>
                        <ip address="172.20.27.1" monitor_link="0"/>
                        <ip address="172.20.27.2" monitor_link="0"/>
                        <script file="/etc/init.d/foo" name="foo"/>
                </resources>

                <service autostart="1" exclusive="1" domain="mysql" 
name="mysql-service">
                        <fs ref="drbd">
                                <script ref="mysql"/>
                        </fs>
                        <ip ref="172.20.27.1"/>
                </service>

                <service autostart="1" exclusive="1" domain="foo-bi1" 
name="foo-bi1-service">
                        <ip ref="172.20.27.2">
                                <script ref="foo"/>
                        </ip>
                </service>

                <service autostart="1" exclusive="1" domain="foo-bi2" 
name="foo-bi2-service">
                        <ip ref="172.20.27.3">
                                <script ref="foo"/>
                        </ip>
                </service>

        </rm>
</cluster>





From mgoebel at workforcesoftware.com  Wed Aug 24 21:55:33 2005
From: mgoebel at workforcesoftware.com (Matt Goebel)
Date: Wed, 24 Aug 2005 17:55:33 -0400 (EDT)
Subject: [Linux-cluster] GFS + Oracle storage hardware suggestions?
Message-ID: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>

I am in the process of specing out a high availability Oracle database
solution and need some advice from those of you experienced in doing this
as to what storage hardware to get.

The plan right now is to have 3 Oracle 9i or 10g nodes (2 failover)
running on Redhat Enterprise 3.5 or 4.1 with a shared filesystem for
Oracle via GFS (6.0 or 6.1 depending on which version of RHEL)  We'd also
like to use multipathing and 2 mirrored SANS of some sort.  The
application we will be running requires very very little in terms of
storage space, only ~3GB per year, per database.  DB load will also
probably not be that high.  Uptime is critical.  So for hardware I have
been looking into the following:

iSCSI SAN: I've tested a low end (nice price) EMC AX100i /w GFS 6.0 (using
GULM) and RHEL 3.5 (4.1 doesn't have iSCSI working yet).  Performance was
awful... 2.5-5 MB/s writes, 25-30 MB/s reads.  So it looks like iSCSI is
out of the question unless there is another hardware option that would
give me the performance I'd want?

Fibre Channel SAN: I've been trying to avoid using a FC solution if I can
because of cost but if it's what I need it's what I have to get.  Any
suggestions on this?  Are these reliable enough to safely use just one SAN
in a 99% uptime environment?  Any good entry/mid model's to look at?

AOE (Coraid.com):  This looks to be a perfect solution.. low cost and
potentially decent performance.  It's relatively new and I haven't heard
of anyone using it yet though.

There doesn't seem to be much out there to tell me what sort I can expect
out of these with GFS...  Any info would be helpful.



From yfttyfs at gmail.com  Thu Aug 25 03:12:55 2005
From: yfttyfs at gmail.com (y f)
Date: Thu, 25 Aug 2005 11:12:55 +0800
Subject: [Linux-cluster] GFS + Oracle storage hardware suggestions?
In-Reply-To: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
References: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
Message-ID: <78fcc84a050824201273c84de9@mail.gmail.com>

can the GFS be used in multiple gnbd server mode, so we can build it
all in IP network to gain cheap cost ?

I show my idea in the picture attached.

On 8/25/05, Matt Goebel <mgoebel at workforcesoftware.com> wrote:
> I am in the process of specing out a high availability Oracle database
> solution and need some advice from those of you experienced in doing this
> as to what storage hardware to get.
> 
> The plan right now is to have 3 Oracle 9i or 10g nodes (2 failover)
> running on Redhat Enterprise 3.5 or 4.1 with a shared filesystem for
> Oracle via GFS (6.0 or 6.1 depending on which version of RHEL)  We'd also
> like to use multipathing and 2 mirrored SANS of some sort.  The
> application we will be running requires very very little in terms of
> storage space, only ~3GB per year, per database.  DB load will also
> probably not be that high.  Uptime is critical.  So for hardware I have
> been looking into the following:
> 
> iSCSI SAN: I've tested a low end (nice price) EMC AX100i /w GFS 6.0 (using
> GULM) and RHEL 3.5 (4.1 doesn't have iSCSI working yet).  Performance was
> awful... 2.5-5 MB/s writes, 25-30 MB/s reads.  So it looks like iSCSI is
> out of the question unless there is another hardware option that would
> give me the performance I'd want?
> 
> Fibre Channel SAN: I've been trying to avoid using a FC solution if I can
> because of cost but if it's what I need it's what I have to get.  Any
> suggestions on this?  Are these reliable enough to safely use just one SAN
> in a 99% uptime environment?  Any good entry/mid model's to look at?
> 
> AOE (Coraid.com):  This looks to be a perfect solution.. low cost and
> potentially decent performance.  It's relatively new and I haven't heard
> of anyone using it yet though.
> 
> There doesn't seem to be much out there to tell me what sort I can expect
> out of these with GFS...  Any info would be helpful.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: multi_gnbd.png
Type: image/png
Size: 9531 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050825/6264aefe/attachment.png>

From mgoebel at workforcesoftware.com  Thu Aug 25 04:06:38 2005
From: mgoebel at workforcesoftware.com (Matt Goebel)
Date: Thu, 25 Aug 2005 00:06:38 -0400 (EDT)
Subject: [Linux-cluster] GFS + Oracle storage hardware suggestions?
In-Reply-To: <78fcc84a050824201273c84de9@mail.gmail.com>
References: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
	<78fcc84a050824201273c84de9@mail.gmail.com>
Message-ID: <48818.69.133.94.192.1124942798.squirrel@mail.workforcesoftware.com>

That setup is possible.  It adds extra servers and more things to go wrong
though.  Plus GNBD would likely cause a hit in performance.  Direct
attached storage (SCSI) units have been ruled out as an option since we
use a lot of them now and we've had lot of problems.  Every so often a
unit will just flip out and take all the drives offline.  You bring them
all back online and they go back to working fine.  These are all Dell
powervaults.  Dell support says "oh yeah, that happens sometimes.. one
drive will hicup and cause a cascading failure for the other drives on
that bus..."  We had another non-Dell DAS unit do something similiar.  So
now when we use them it's only in an md raid1 across two hardware raid5
units.  For this we are looking at SANs which hopefully don't have this
problem.

y f said:
> can the GFS be used in multiple gnbd server mode, so we can build it
> all in IP network to gain cheap cost ?
>
> I show my idea in the picture attached.
>
> On 8/25/05, Matt Goebel <mgoebel at workforcesoftware.com> wrote:
>> I am in the process of specing out a high availability Oracle database
>> solution and need some advice from those of you experienced in doing
>> this
>> as to what storage hardware to get.
>>
>> The plan right now is to have 3 Oracle 9i or 10g nodes (2 failover)
>> running on Redhat Enterprise 3.5 or 4.1 with a shared filesystem for
>> Oracle via GFS (6.0 or 6.1 depending on which version of RHEL)  We'd
>> also
>> like to use multipathing and 2 mirrored SANS of some sort.  The
>> application we will be running requires very very little in terms of
>> storage space, only ~3GB per year, per database.  DB load will also
>> probably not be that high.  Uptime is critical.  So for hardware I have
>> been looking into the following:
>>
>> iSCSI SAN: I've tested a low end (nice price) EMC AX100i /w GFS 6.0
>> (using
>> GULM) and RHEL 3.5 (4.1 doesn't have iSCSI working yet).  Performance
>> was
>> awful... 2.5-5 MB/s writes, 25-30 MB/s reads.  So it looks like iSCSI is
>> out of the question unless there is another hardware option that would
>> give me the performance I'd want?
>>
>> Fibre Channel SAN: I've been trying to avoid using a FC solution if I
>> can
>> because of cost but if it's what I need it's what I have to get.  Any
>> suggestions on this?  Are these reliable enough to safely use just one
>> SAN
>> in a 99% uptime environment?  Any good entry/mid model's to look at?
>>
>> AOE (Coraid.com):  This looks to be a perfect solution.. low cost and
>> potentially decent performance.  It's relatively new and I haven't heard
>> of anyone using it yet though.
>>
>> There doesn't seem to be much out there to tell me what sort I can
>> expect
>> out of these with GFS...  Any info would be helpful.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> http://www.redhat.com/mailman/listinfo/linux-cluster
>>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster




From cboudjnah at squiz.net  Thu Aug 25 04:59:20 2005
From: cboudjnah at squiz.net (Chmouel Boudjnah)
Date: Thu, 25 Aug 2005 14:59:20 +1000
Subject: [Linux-cluster] GFS + Oracle storage hardware suggestions?
In-Reply-To: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
References: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
Message-ID: <1124945960.26649.12.camel@paris.squiz.net>

On Wed, 2005-08-24 at 17:55 -0400, Matt Goebel wrote:
> AOE (Coraid.com):  This looks to be a perfect solution.. low cost and
> potentially decent performance.  It's relatively new and I haven't heard
> of anyone using it yet though.

Works perfectly fine for me with very good performances.

-- 
Chmouel Boudjnah - Squiz.net - http://www.squiz.net



From mgoebel at workforcesoftware.com  Thu Aug 25 05:33:18 2005
From: mgoebel at workforcesoftware.com (Matt Goebel)
Date: Thu, 25 Aug 2005 01:33:18 -0400 (EDT)
Subject: [Linux-cluster] GFS + Oracle storage hardware suggestions?
In-Reply-To: <1124945960.26649.12.camel@paris.squiz.net>
References: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
	<1124945960.26649.12.camel@paris.squiz.net>
Message-ID: <49404.69.133.94.192.1124947998.squirrel@mail.workforcesoftware.com>

Are you using DFS/Oracle on AOE?  If so how many drives, PATA or SATA,
hardware and/or software RAID, etc...?  Any additional info would be
appreciated.  On paper it looks like my best option right now.

Chmouel Boudjnah said:
> On Wed, 2005-08-24 at 17:55 -0400, Matt Goebel wrote:
>> AOE (Coraid.com):  This looks to be a perfect solution.. low cost and
>> potentially decent performance.  It's relatively new and I haven't heard
>> of anyone using it yet though.
>
> Works perfectly fine for me with very good performances.
>
> --
> Chmouel Boudjnah - Squiz.net - http://www.squiz.net
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>
>




From gshi at ncsa.uiuc.edu  Thu Aug 25 08:05:36 2005
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Thu, 25 Aug 2005 03:05:36 -0500
Subject: [Linux-cluster] documents for technical specification
Message-ID: <5.1.0.14.2.20050825025909.04077dc0@pop.ncsa.uiuc.edu>

hi,

This paper seems outdate and dlm/fencing part are empty
http://people.redhat.com/%7Eteigland/sca.pdf

Where can I find a new one?

Thanks
-Guochun



From teigland at redhat.com  Thu Aug 25 08:36:34 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 25 Aug 2005 16:36:34 +0800
Subject: [Linux-cluster] documents for technical specification
In-Reply-To: <5.1.0.14.2.20050825025909.04077dc0@pop.ncsa.uiuc.edu>
References: <5.1.0.14.2.20050825025909.04077dc0@pop.ncsa.uiuc.edu>
Message-ID: <20050825083634.GB19180@redhat.com>

On Thu, Aug 25, 2005 at 03:05:36AM -0500, Guochun Shi wrote:
> hi,
> 
> This paper seems outdate and dlm/fencing part are empty
> http://people.redhat.com/%7Eteigland/sca.pdf
> 
> Where can I find a new one?

There is none, it's too early to write a new one with things bound to
change.

Dave



From Rob.TerVeer at getronics.com  Thu Aug 25 09:44:57 2005
From: Rob.TerVeer at getronics.com (Veer, Rob ter)
Date: Thu, 25 Aug 2005 11:44:57 +0200
Subject: [Linux-cluster] Tiebreaker IP
Message-ID: <09B5576A48518947960A89175FB6C23DEA20A1@excbebr204.europe.unity>

Hello,

I'm trying to get a deeper insight into the workings of the tiebreaker
with a standard RHCS cluster configuration. It's not clear to me what
the role of the tiebreaker within a 2 or 4 node cluster exactly is. 

Is there any documentation on this subject? I'm particulary interested
in the flow of events when the tiebreaker is used. I know the tiebreaker
is used to prevent node isolation, but how exactly?

Regards,
Rob.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050825/576544d8/attachment.htm>

From hernando.garcia at gmail.com  Thu Aug 25 08:10:15 2005
From: hernando.garcia at gmail.com (Hernando Garcia)
Date: Thu, 25 Aug 2005 09:10:15 +0100
Subject: [Linux-cluster] GFS + Oracle storage hardware suggestions?
In-Reply-To: <49404.69.133.94.192.1124947998.squirrel@mail.workforcesoftware.com>
References: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
	<1124945960.26649.12.camel@paris.squiz.net>
	<49404.69.133.94.192.1124947998.squirrel@mail.workforcesoftware.com>
Message-ID: <1124957415.8570.13.camel@hgarcia.surrey.redhat.com>

You might want to talk to the Software Vendor ( Oracle ) and OS Vendor
( Red Hat ) regarding the best option in terms of hardware. The OS and
application layer GFS will benefit tremendously by using FC HBA
( qlogic/Emulex ) those being the preferred and supported options, the
latter is VERY important should anything goes wrong and you need to ask
for assistance.

Storage MSAxxx are really reliable.

DO NOT USE SCSI/SATA for this very important project of yours. If you
require high availability and maximum uptime for your DB, it does not
make sense to pay big money for the OS/Software and penuts for the
hardware.

REDHAT/ORACLE do not do miracles is you have a flaky hardware layer. 

On Thu, 2005-08-25 at 01:33 -0400, Matt Goebel wrote:
> now.
> 



From cjk at techma.com  Thu Aug 25 19:53:59 2005
From: cjk at techma.com (Kovacs, Corey J.)
Date: Thu, 25 Aug 2005 15:53:59 -0400
Subject: [Linux-cluster] usedev directive not working correctly?
Message-ID: <EE32D921D7601547AD9CA5C87906C56609518A@tmaemail.techma.com>

It's been a while since I lst dealt with this issue. Shortly after I posted
this problem I was put on another issue... Alas, here I am again....

I've rebuilt the cluster from the ground up in a clean state. RHEL2 update 4,
with GFS 6.0.2-25 re-compiled for these machines.


Here are my configs....

#/etc/hosts
10.0.0.1	clua
10.0.0.2	club
10.0.0.3	cluc
192.168.0.1 clua-ic
192.168.0.2 club-ic
192.168.0.3 cluc-ic


cluster {
	name = "cluster"
	lock_gulm {
			servers = ["clua", "club", "cluc"]
	}
}


nodes {
	clua {
		ip_interfaces {
			eth1 = "192.168.0.1"
		}
		usedev="eth1"
		fence {
			iLO {
				clua-ilo {
					action="reboot"
				}
			}
		}
	}

	club {
		ip_interfaces {
			eth1 = "192.168.0.2"
		}
		usedev="eth1"
		fence {
			iLO {
				club-ilo {
					action="reboot"
				}
			}
		}
	}

	cluc {
		ip_interfaces {
			eth1 = "192.168.0.3"
		}
		usedev="eth1"
		fence {
			iLO {
				cluc-ilo {
					action="reboot"
				}
			}
		}
	}

}

fence_devices {
	clua-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.10"
		login = "xxxxx"
		passwd = "yyyyyy"
	}
	club-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.11"
		login = "xxxxx"
		passwd = "yyyyyy"
	}
	cluc-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.12"
		login = "xxxxx"
		passwd = "yyyyyy"
	}



}


when I run lock_gulmd -C against this config (after doing a ccs_tool create)
I get the following output


cluster {
	name="cluster"
	lock_gulm {
		heartbeat_rate = 15.000
		allowed_misses = 2
		coreport = 40040
		new_connection_timeout = 15.000
		# server cnt: 3
		# servers = ["clua", "club", "cluc"]
		servers = ["10.0.0.1", "10.0.0.2", "10.0.0.3"]
		lt_partitions = 1
		lt_base_port = 41040
		lt_high_locks = 1048576
		lt_drop_req_rate = 10
		prealloc_locks = 90000
		prealloc_holders = 130000
		prealloc_lkrqs = 60
		ltpx_port = 40042
	}
}


So, the question is, what am I doing blatently wrong? The docs seem fairly
simple
but this is just not working for me...

Any suggestions would be appreciated and acted upon much quicker this time.

Thanks


Corey



-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad Tadpol
Tilstra
Sent: Tuesday, June 07, 2005 11:41 AM
To: linux clustering
Subject: Re: [Linux-cluster] usedev directive not working correctly?

On Tue, Jun 07, 2005 at 11:03:36AM -0400, Kovacs, Corey J. wrote:
> Sorry about the previous message subject, too lazy to type the address 
> and didn't change the subject.
> 
> On one hand I agree with that, however I've gone as far as to set up 
> static routes for the addresses and lock_gulmd won't start at all 
> since it can't talk to the other lock servers at all. As I said in the 
> original message, 'gulm_tool nodelist node1' reports that the lock 
> manager on node3 is NOT using the directed interface but node2 and node1
are.

maybe.  but it really looks like the gulm on node3 is miss-configuring
itself.  So look in syslog when you start lock_gulmd on that node, it prints
what it thinks the hostname and ip is.  If its picking hte wrong one there,
gulm is reading the config wrong.  You can run lock_gulmd with the -C option,
and it will just parse config data and dump it out in /tmp/Gulm_config.??????
(the ? will be random chars.)  Look at that to see if it looks like what
you've configured.

And, I'd like to see the complete nodes.ccs, if you don't mind.

--
Michael Conrad Tadpol Tilstra
Push to test.  <click>  Release to detonate.



From cjk at techma.com  Thu Aug 25 20:15:15 2005
From: cjk at techma.com (Kovacs, Corey J.)
Date: Thu, 25 Aug 2005 16:15:15 -0400
Subject: [Linux-cluster] usedev directive not working correctly?
Message-ID: <EE32D921D7601547AD9CA5C87906C56609518B@tmaemail.techma.com>

Sorry, the description of the build is supposed to say RHEL3 update 4 ...

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
Sent: Thursday, August 25, 2005 3:54 PM
To: mtilstra at redhat.com; linux clustering
Subject: RE: [Linux-cluster] usedev directive not working correctly?

It's been a while since I lst dealt with this issue. Shortly after I posted
this problem I was put on another issue... Alas, here I am again....

I've rebuilt the cluster from the ground up in a clean state. RHEL2 update 4,
with GFS 6.0.2-25 re-compiled for these machines.


Here are my configs....

#/etc/hosts
10.0.0.1	clua
10.0.0.2	club
10.0.0.3	cluc
192.168.0.1 clua-ic
192.168.0.2 club-ic
192.168.0.3 cluc-ic


cluster {
	name = "cluster"
	lock_gulm {
			servers = ["clua", "club", "cluc"]
	}
}


nodes {
	clua {
		ip_interfaces {
			eth1 = "192.168.0.1"
		}
		usedev="eth1"
		fence {
			iLO {
				clua-ilo {
					action="reboot"
				}
			}
		}
	}

	club {
		ip_interfaces {
			eth1 = "192.168.0.2"
		}
		usedev="eth1"
		fence {
			iLO {
				club-ilo {
					action="reboot"
				}
			}
		}
	}

	cluc {
		ip_interfaces {
			eth1 = "192.168.0.3"
		}
		usedev="eth1"
		fence {
			iLO {
				cluc-ilo {
					action="reboot"
				}
			}
		}
	}

}

fence_devices {
	clua-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.10"
		login = "xxxxx"
		passwd = "yyyyyy"
	}
	club-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.11"
		login = "xxxxx"
		passwd = "yyyyyy"
	}
	cluc-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.12"
		login = "xxxxx"
		passwd = "yyyyyy"
	}



}


when I run lock_gulmd -C against this config (after doing a ccs_tool create)
I get the following output


cluster {
	name="cluster"
	lock_gulm {
		heartbeat_rate = 15.000
		allowed_misses = 2
		coreport = 40040
		new_connection_timeout = 15.000
		# server cnt: 3
		# servers = ["clua", "club", "cluc"]
		servers = ["10.0.0.1", "10.0.0.2", "10.0.0.3"]
		lt_partitions = 1
		lt_base_port = 41040
		lt_high_locks = 1048576
		lt_drop_req_rate = 10
		prealloc_locks = 90000
		prealloc_holders = 130000
		prealloc_lkrqs = 60
		ltpx_port = 40042
	}
}


So, the question is, what am I doing blatently wrong? The docs seem fairly
simple but this is just not working for me...

Any suggestions would be appreciated and acted upon much quicker this time.

Thanks


Corey



-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad Tadpol
Tilstra
Sent: Tuesday, June 07, 2005 11:41 AM
To: linux clustering
Subject: Re: [Linux-cluster] usedev directive not working correctly?

On Tue, Jun 07, 2005 at 11:03:36AM -0400, Kovacs, Corey J. wrote:
> Sorry about the previous message subject, too lazy to type the address 
> and didn't change the subject.
> 
> On one hand I agree with that, however I've gone as far as to set up 
> static routes for the addresses and lock_gulmd won't start at all 
> since it can't talk to the other lock servers at all. As I said in the 
> original message, 'gulm_tool nodelist node1' reports that the lock 
> manager on node3 is NOT using the directed interface but node2 and 
> node1
are.

maybe.  but it really looks like the gulm on node3 is miss-configuring
itself.  So look in syslog when you start lock_gulmd on that node, it prints
what it thinks the hostname and ip is.  If its picking hte wrong one there,
gulm is reading the config wrong.  You can run lock_gulmd with the -C option,
and it will just parse config data and dump it out in /tmp/Gulm_config.??????
(the ? will be random chars.)  Look at that to see if it looks like what
you've configured.

And, I'd like to see the complete nodes.ccs, if you don't mind.

--
Michael Conrad Tadpol Tilstra
Push to test.  <click>  Release to detonate.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster



From robert at deakin.edu.au  Fri Aug 26 03:22:43 2005
From: robert at deakin.edu.au (Robert Ruge)
Date: Fri, 26 Aug 2005 13:22:43 +1000
Subject: [Linux-cluster] CLVM snapshot question
Message-ID: <200508260322.j7Q3Mit7005308@deakin.edu.au>

Due to problems with samba locking up on a GFS filesystem, which
appears to be a known problem, we will be reverting back to an ext3
filesystem for our home directories. I have two questions resulting
from this:

1. Is it possible to create a snapshot of an ext3 filesystem but mount
the snapshot on a second computer? I would like to do this so that the
actual file service will be on server 1 but the snapshot mounted on
the backups server for the backup to tape.

2. In system-config-cluster the GFS resource has an options field for
mount options but the FS resource has no such field. The
/usr/share/cluster/fs.sh script would appear to handle an options
entry ok so can I hand edit the cluster.conf file and add an options
entry and have it work ok. I really need to get the ext3 filesystems
mounted with mount options QUOTA and ACL.

Thanks.

Robert Ruge   School of Information Technology, Deakin University 



From pcaulfie at redhat.com  Fri Aug 26 07:13:53 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 26 Aug 2005 08:13:53 +0100
Subject: [Linux-cluster] CLVM snapshot question
In-Reply-To: <200508260322.j7Q3Mit7005308@deakin.edu.au>
References: <200508260322.j7Q3Mit7005308@deakin.edu.au>
Message-ID: <430EC131.30208@redhat.com>

Robert Ruge wrote:
> Due to problems with samba locking up on a GFS filesystem, which
> appears to be a known problem, we will be reverting back to an ext3
> filesystem for our home directories. I have two questions resulting
> from this:
> 
> 1. Is it possible to create a snapshot of an ext3 filesystem but mount
> the snapshot on a second computer? 
> 

no.

-- 

patrick



From mikore.li at gmail.com  Fri Aug 26 07:17:56 2005
From: mikore.li at gmail.com (Michael)
Date: Fri, 26 Aug 2005 15:17:56 +0800
Subject: [Linux-cluster] If I have 5 GNBD server?
Message-ID: <bc572709050826001742c6b316@mail.gmail.com>

Hi, 

If I have 5 gnbd servers in the network, each one export 1 block
device, can I import all gnbd devices on each gfs client, and use lvm
to manage them as 1 shared pool, then mkfs_gfs on it?

Thanks,

Michael



From fajar at telkom.co.id  Fri Aug 26 07:51:59 2005
From: fajar at telkom.co.id (Fajar A. Nugraha)
Date: Fri, 26 Aug 2005 14:51:59 +0700
Subject: [Linux-cluster] CLVM snapshot question
In-Reply-To: <430EC131.30208@redhat.com>
References: <200508260322.j7Q3Mit7005308@deakin.edu.au>
	<430EC131.30208@redhat.com>
Message-ID: <430ECA1F.2030100@telkom.co.id>

Patrick Caulfield wrote:

>Robert Ruge wrote:
>  
>
>>1. Is it possible to create a snapshot of an ext3 filesystem but mount
>>the snapshot on a second computer? 
>>
>>    
>>
>
>no.
>
>  
>
As I recall, pvmove also requires the functionality of LVM snapshot.
Is it possible to use pvmove on a PV which contains a mounted GFS 
filesystem?

To be specific, I have a two-node cluster with GFS mount. I know that 
cluster snapshot is not finished yet, but is it possible to shut down 
one node, and use pvmove on the live node?

Regards,

Fajar



From pcaulfie at redhat.com  Fri Aug 26 08:11:18 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 26 Aug 2005 09:11:18 +0100
Subject: [Linux-cluster] CLVM snapshot question
In-Reply-To: <430ECA1F.2030100@telkom.co.id>
References: <200508260322.j7Q3Mit7005308@deakin.edu.au>	<430EC131.30208@redhat.com>
	<430ECA1F.2030100@telkom.co.id>
Message-ID: <430ECEA6.1070705@redhat.com>

Fajar A. Nugraha wrote:
> Patrick Caulfield wrote:
> 
>> Robert Ruge wrote:
>>  
>>
>>> 1. Is it possible to create a snapshot of an ext3 filesystem but mount
>>> the snapshot on a second computer?
>>>   
>>
>>
>> no.
>>
>>  
>>
> As I recall, pvmove also requires the functionality of LVM snapshot.

Well, it's mirroring actually :)

> Is it possible to use pvmove on a PV which contains a mounted GFS
> filesystem?
> 
> To be specific, I have a two-node cluster with GFS mount. I know that
> cluster snapshot is not finished yet, but is it possible to shut down
> one node, and use pvmove on the live node?

If it's only mounted on one node then, yes, you should be able to pvmove a GFS
filesystem.

-- 

patrick



From mikore.li at gmail.com  Fri Aug 26 08:26:43 2005
From: mikore.li at gmail.com (Michael)
Date: Fri, 26 Aug 2005 16:26:43 +0800
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <bc572709050826001742c6b316@mail.gmail.com>
References: <bc572709050826001742c6b316@mail.gmail.com>
Message-ID: <bc57270905082601262d520201@mail.gmail.com>

Can anyone answer my question?

On 8/26/05, Michael <mikore.li at gmail.com> wrote:
> Hi,
> 
> If I have 5 gnbd servers in the network, each one export 1 block
> device, can I import all gnbd devices on each gfs client, and use lvm
> to manage them as 1 shared pool, then mkfs_gfs on it?
> 
> Thanks,
> 
> Michael
>



From fajar at telkom.co.id  Fri Aug 26 09:33:51 2005
From: fajar at telkom.co.id (Fajar A. Nugraha)
Date: Fri, 26 Aug 2005 16:33:51 +0700
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <bc57270905082601262d520201@mail.gmail.com>
References: <bc572709050826001742c6b316@mail.gmail.com>
	<bc57270905082601262d520201@mail.gmail.com>
Message-ID: <430EE1FF.6050505@telkom.co.id>

Michael wrote:

>Can anyone answer my question?
>
>  
>
I don't think LVM cares what kind of storage it uses, as long as it's a 
block device.
So theoretically, you can mix local disc, FC storage, GNBD, and ATOE and 
combine them using LVM. You might run into performance issues (GNBD are 
slower than FC disks) and startup issues (LVM has to start after cluster 
and gnbd-import successfuly start), but it should be possible.

The real question is WHY you want to do that.
AFAIK, if you combine 5 gnbd from 5 gnbd servers into one LVM, and you 
are accessing a volume on that volume group, and one of the gnbd server 
dies (or hangs), gnbd import will wait forever until that server is back 
up. So you'll have five more single-point-of-failures.

Regards,

Fajar

>On 8/26/05, Michael <mikore.li at gmail.com> wrote:
>  
>
>>Hi,
>>
>>If I have 5 gnbd servers in the network, each one export 1 block
>>device, can I import all gnbd devices on each gfs client, and use lvm
>>to manage them as 1 shared pool, then mkfs_gfs on it?
>>
>>Thanks,
>>
>>Michael
>>
>>    
>>
>
>  
>



From mikore.li at gmail.com  Fri Aug 26 09:50:27 2005
From: mikore.li at gmail.com (Michael)
Date: Fri, 26 Aug 2005 17:50:27 +0800
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <430EE1FF.6050505@telkom.co.id>
References: <bc572709050826001742c6b316@mail.gmail.com>
	<bc57270905082601262d520201@mail.gmail.com>
	<430EE1FF.6050505@telkom.co.id>
Message-ID: <bc5727090508260250382295cd@mail.gmail.com>

Fajar,

Thanks for you, quite clear.

Michael

On 8/26/05, Fajar A. Nugraha <fajar at telkom.co.id> wrote:
> Michael wrote:
> 
> >Can anyone answer my question?
> >
> >
> >
> I don't think LVM cares what kind of storage it uses, as long as it's a
> block device.
> So theoretically, you can mix local disc, FC storage, GNBD, and ATOE and
> combine them using LVM. You might run into performance issues (GNBD are
> slower than FC disks) and startup issues (LVM has to start after cluster
> and gnbd-import successfuly start), but it should be possible.
> 
> The real question is WHY you want to do that.
> AFAIK, if you combine 5 gnbd from 5 gnbd servers into one LVM, and you
> are accessing a volume on that volume group, and one of the gnbd server
> dies (or hangs), gnbd import will wait forever until that server is back
> up. So you'll have five more single-point-of-failures.
> 
> Regards,
> 
> Fajar
> 
> >On 8/26/05, Michael <mikore.li at gmail.com> wrote:
> >
> >
> >>Hi,
> >>
> >>If I have 5 gnbd servers in the network, each one export 1 block
> >>device, can I import all gnbd devices on each gfs client, and use lvm
> >>to manage them as 1 shared pool, then mkfs_gfs on it?
> >>
> >>Thanks,
> >>
> >>Michael
> >>
> >>
> >>
> >
> >
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From yanj at brainaire.com  Fri Aug 26 11:03:28 2005
From: yanj at brainaire.com (yanj)
Date: Fri, 26 Aug 2005 19:03:28 +0800
Subject: [Linux-cluster] Redhat 9.0+GFS+iSCSI
Message-ID: <000501c5aa2d$cdb5ef70$2e00a8c0@yanzijie>

Hello, 
 
I try to setup 64 nodes GFS system in Redhat 9.0. So all the 64 nodes
can share the iSCSI storage.
The node OS must be Redhat 9.0 .
 
So far I cannot find any GFS source release or rpm which is used in
Redhat 9.0.
Is GFS working in Redhat 9.0?
 
Thanks,
Jeffrey Yan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050826/35e286da/attachment.htm>

From cjk at techma.com  Fri Aug 26 17:10:24 2005
From: cjk at techma.com (Kovacs, Corey J.)
Date: Fri, 26 Aug 2005 13:10:24 -0400
Subject: [Linux-cluster] usedev directive not working correctly?
Message-ID: <EE32D921D7601547AD9CA5C87906C56609518D@tmaemail.techma.com>

Sorry, slight difference between 6.0.2-25 and 6.0.2.20 ....

seems to be working now...


Corey

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
Sent: Thursday, August 25, 2005 4:15 PM
To: linux clustering; mtilstra at redhat.com
Subject: RE: [Linux-cluster] usedev directive not working correctly?

Sorry, the description of the build is supposed to say RHEL3 update 4 ...

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
Sent: Thursday, August 25, 2005 3:54 PM
To: mtilstra at redhat.com; linux clustering
Subject: RE: [Linux-cluster] usedev directive not working correctly?

It's been a while since I lst dealt with this issue. Shortly after I posted
this problem I was put on another issue... Alas, here I am again....

I've rebuilt the cluster from the ground up in a clean state. RHEL2 update 4,
with GFS 6.0.2-25 re-compiled for these machines.


Here are my configs....

#/etc/hosts
10.0.0.1	clua
10.0.0.2	club
10.0.0.3	cluc
192.168.0.1 clua-ic
192.168.0.2 club-ic
192.168.0.3 cluc-ic


cluster {
	name = "cluster"
	lock_gulm {
			servers = ["clua", "club", "cluc"]
	}
}


nodes {
	clua {
		ip_interfaces {
			eth1 = "192.168.0.1"
		}
		usedev="eth1"
		fence {
			iLO {
				clua-ilo {
					action="reboot"
				}
			}
		}
	}

	club {
		ip_interfaces {
			eth1 = "192.168.0.2"
		}
		usedev="eth1"
		fence {
			iLO {
				club-ilo {
					action="reboot"
				}
			}
		}
	}

	cluc {
		ip_interfaces {
			eth1 = "192.168.0.3"
		}
		usedev="eth1"
		fence {
			iLO {
				cluc-ilo {
					action="reboot"
				}
			}
		}
	}

}

fence_devices {
	clua-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.10"
		login = "xxxxx"
		passwd = "yyyyyy"
	}
	club-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.11"
		login = "xxxxx"
		passwd = "yyyyyy"
	}
	cluc-ilo {
		agent="fence_ilo"
		hostname = "10.0.0.12"
		login = "xxxxx"
		passwd = "yyyyyy"
	}



}


when I run lock_gulmd -C against this config (after doing a ccs_tool create)
I get the following output


cluster {
	name="cluster"
	lock_gulm {
		heartbeat_rate = 15.000
		allowed_misses = 2
		coreport = 40040
		new_connection_timeout = 15.000
		# server cnt: 3
		# servers = ["clua", "club", "cluc"]
		servers = ["10.0.0.1", "10.0.0.2", "10.0.0.3"]
		lt_partitions = 1
		lt_base_port = 41040
		lt_high_locks = 1048576
		lt_drop_req_rate = 10
		prealloc_locks = 90000
		prealloc_holders = 130000
		prealloc_lkrqs = 60
		ltpx_port = 40042
	}
}


So, the question is, what am I doing blatently wrong? The docs seem fairly
simple but this is just not working for me...

Any suggestions would be appreciated and acted upon much quicker this time.

Thanks


Corey



-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad Tadpol
Tilstra
Sent: Tuesday, June 07, 2005 11:41 AM
To: linux clustering
Subject: Re: [Linux-cluster] usedev directive not working correctly?

On Tue, Jun 07, 2005 at 11:03:36AM -0400, Kovacs, Corey J. wrote:
> Sorry about the previous message subject, too lazy to type the address 
> and didn't change the subject.
> 
> On one hand I agree with that, however I've gone as far as to set up 
> static routes for the addresses and lock_gulmd won't start at all 
> since it can't talk to the other lock servers at all. As I said in the 
> original message, 'gulm_tool nodelist node1' reports that the lock 
> manager on node3 is NOT using the directed interface but node2 and
> node1
are.

maybe.  but it really looks like the gulm on node3 is miss-configuring
itself.  So look in syslog when you start lock_gulmd on that node, it prints
what it thinks the hostname and ip is.  If its picking hte wrong one there,
gulm is reading the config wrong.  You can run lock_gulmd with the -C option,
and it will just parse config data and dump it out in /tmp/Gulm_config.??????
(the ? will be random chars.)  Look at that to see if it looks like what
you've configured.

And, I'd like to see the complete nodes.ccs, if you don't mind.

--
Michael Conrad Tadpol Tilstra
Push to test.  <click>  Release to detonate.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster



From hzhong at gmail.com  Fri Aug 26 17:27:04 2005
From: hzhong at gmail.com (Hua Zhong)
Date: Fri, 26 Aug 2005 10:27:04 -0700
Subject: [Linux-cluster] DLM: lock timeout and deadlock detection
Message-ID: <924c2883050826102720e5ebd4@mail.gmail.com>

Hi list:

I've recently started to read about the Redhat's implementation of DLM
that David posted on LKML. I've few questions:

1. There is no lock timeout imeplemented (at least from API point of
view). Is this something you plan to add? Is there any draft of the
design?
2. With no lock timeout in DLM, how does upper-layer applications
(like GFS) implement such lock timeout (or does GFS also have no
timeout)? One thought is that it could has its own timer and when it
expires just cancel the pending lock.
3. How much does DLM do wrt deadlock detection? Especially when it
doesn't have timeout. Is it solely the application's responsibility to
detect it?
4. This is not a technical question. I'm trying to convince our
management to use DLM, so I'd like to know how stable it is and on
what kind of scale the Redhat clustering solution is being used by the
enterprise. Is it stable enough in production (data like average
uptime, etc)?

Thanks for your help.

Hua



From JACOB_LIBERMAN at Dell.com  Fri Aug 26 18:24:39 2005
From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com)
Date: Fri, 26 Aug 2005 13:24:39 -0500
Subject: [Linux-cluster] Tiebreaker IP
Message-ID: <BC430F453501174992B9D9E8AFB7519A518A10@ausx3mps309.aus.amer.dell.com>

Rob,

Heres a summary of what I have observed with this configuration. You may
want to verify the accuracy of my observations on your own. 

Starting with RHEL3, the RHCS verified node membership via a network
heartbeat rather than/in addition to a disk timestamp. The network
heartbeat traffic moves over the same interface that is used to access
the network resources. This means that there is no dedicated heartbeat
interface like you would see in a microsoft cluster.

The tiebreaker IP is used to prevent a split brain situation in a a
cluster with an even number of nodes. Lets say you have 2 active cluster
nodes... say nodeA and nodeB, and nodeA owns an NFS disk resource and
IP. Then lets say nodeA fails to receive a heartbeat from nodeB over its
primary interface. This could mean several things: nodeA's interface is
down, nodeB's interface is down, or their shared switch is down. So if
nodeA and nodeB stop communicating with eachother, they will both try to
ping the tiebreaker IP, which is usually your default gateway IP. If
nodeA gets a response from the tiebreaker IP, it will continue to own
the resource. If it cant, it will assume its external interface is down
and fence/reboot itself. The same holds true for nodeB. Unlike RHEL2.1
which used STONITH, RHEL3 cluster nodes reboot themselves. Therefor,
even if nodeB can reach the tiebreaker and CANT reach nodeA, it will not
get the cluster resource until nodeA releases it. This prevents the
nodes from accessing the shared disk resource concommitantly.

This configuration prevents split brain by ensuring the resource owner
doesn't get killed accidentally by its peer. For those that remember,
ping=ponging was a big problem with RHEL2.1 clusters. If they couldn't
read their partners disk timestamp update in a timely manner -- due to
IO latency or whatever -- they would reboot their partner node. On
reboot, the rebooted node would STONITH the other node, etc.

Anyway, I hope this answers your questions. It is fairly easy to test.
Set up a 2 node cluster, then reboot the service owner. If the service
starts on the other node, you should be configured correctly. Next
disconnect the service owner from the network. The service owner should
reboot itself with the watchdog or fail over the resource, depending on
how its ocnfigured. Repeat this test with the non-service owner. (the
resources should not move in this case.) then take turns disconnecting
them from the shared storage. 

Cheers, jacob

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Veer, Rob ter
> Sent: Thursday, August 25, 2005 4:45 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Tiebreaker IP
> 
> Hello, 
> 
> I'm trying to get a deeper insight into the workings of the 
> tiebreaker with a standard RHCS cluster configuration. It's 
> not clear to me what the role of the tiebreaker within a 2 or 
> 4 node cluster exactly is. 
> 
> Is there any documentation on this subject? I'm particulary 
> interested in the flow of events when the tiebreaker is used. 
> I know the tiebreaker is used to prevent node isolation, but 
> how exactly?
> 
> Regards,
> Rob. 
> 
> 



From brianu at silvercash.com  Sat Aug 27 00:51:30 2005
From: brianu at silvercash.com (brianu)
Date: Fri, 26 Aug 2005 17:51:30 -0700
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
Message-ID: <20050827004859.D66965A8690@mail.silvercash.com>

Hello all,

 

This is a question I have basically been asking, the question on why you
would want to do it is failover, the docs at
http://sourceware.org/cluster/gnbd/gnbd_usage.txt state that dm-multipath is
an option for gnbd, and documents elsewhere also indicate that GNBD can be
configured as a redundancy, yet I cannot find any documentation on how to
configure it.

 

If using LVM to make a volume of imported gnbds is not the answer for
redundancy can anyone suggest a method that is? Im not opposed to using any
other resource of cluster or GFS but I would really like to implement a
redundant solution, ( gnbd, gulm, etc.).

 

 

Does anyone has an example of a redundant solution for a cluster/gfs
filesystem i.e ( gnbd, gulm etc) ?

 

Regards,

 

Brian

 

Message: 9

Date: Fri, 26 Aug 2005 16:33:51 +0700

From: "Fajar A. Nugraha" <fajar at telkom.co.id>

Subject: Re: [Linux-cluster] Re: If I have 5 GNBD server?

To: linux clustering <linux-cluster at redhat.com>

Message-ID: <430EE1FF.6050505 at telkom.co.id>

Content-Type: text/plain; charset=ISO-8859-1; format=flowed

 

Michael wrote:

 

>Can anyone answer my question?

> 

>  

> 

I don't think LVM cares what kind of storage it uses, as long as it's a 

block device.

So theoretically, you can mix local disc, FC storage, GNBD, and ATOE and 

combine them using LVM. You might run into performance issues (GNBD are 

slower than FC disks) and startup issues (LVM has to start after cluster 

and gnbd-import successfuly start), but it should be possible.

 

The real question is WHY you want to do that.

AFAIK, if you combine 5 gnbd from 5 gnbd servers into one LVM, and you 

are accessing a volume on that volume group, and one of the gnbd server 

dies (or hangs), gnbd import will wait forever until that server is back 

up. So you'll have five more single-point-of-failures.

 

Regards,

 

Fajar

 

>On 8/26/05, Michael <mikore.li at gmail.com> wrote:

>  

> 

>>Hi,

>> 

>>If I have 5 gnbd servers in the network, each one export 1 block

>>device, can I import all gnbd devices on each gfs client, and use lvm

>>to manage them as 1 shared pool, then mkfs_gfs on it?

>> 

>>Thanks,

>> 

>>Michael

>> 

>>    

>> 

> 

>  

> 

 

 

 

Brian Urrutia

System Administrator

Price Communications Inc.

 

There are 10 types of people in this world, 
those who understand binary and those who don't.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050826/fcc632de/attachment.htm>

From sco at adviseo.fr  Sat Aug 27 00:26:04 2005
From: sco at adviseo.fr (Sylvain COUTANT)
Date: Sat, 27 Aug 2005 02:26:04 +0200
Subject: [Linux-cluster] do_dlm_unlock kernel panic
Message-ID: <20050827002611.F36FE31C5BC@postfix4-2.free.fr>

Hi all,

I got the following error tonight after a network blackout on one -test- server :

dlm: dlm_unlock: lkid 10301 lockspace not found
exp1 move flags 0,1,0 ids 0,2,0
exp1 move use event 2
exp1 recover event 2 (first)
exp1 add nodes
exp1 total nodes 3
exp1 rebuild resource directory
exp1 rebuilt 5 resources
exp1 recover event 2 done
exp1 move flags 0,0,1 ids 0,2,2
exp1 process held requests
exp1 processed 0 requests
exp1 recover event 2 finished
exp1 move flags 1,0,0 ids 2,2,2

8140 un 2,13f 10107 3 0
8140 un 2,40e 10074 3 0
8140 un 2,9c 201c7 3 0
8140 un 2,27b 201ab 3 0
8133 qc 2,38d 3,3 id 203a8 sts -65538 0
8133 qc 2,a4 3,3 id 20000 sts -65538 0
8133 qc 2,393 3,3 id 20381 sts -65538 0
8133 qc 2,308 3,3 id 10350 sts -65538 0
8133 qc 2,8e 3,3 id 1015f sts -65538 0
8133 qc 2,3e0 3,3 id 20171 sts -65538 0
8133 qc 2,411 3,3 id 100cc sts -65538 0
8133 qc 2,397 3,3 id 302ef sts -65538 0
8133 qc 2,13f 3,3 id 10107 sts -65538 0
8133 qc 2,40e 3,3 id 10074 sts -65538 0
8133 qc 2,9c 3,3 id 201c7 sts -65538 0
8133 qc 2,27b 3,3 id 201ab sts -65538 0
8140 un 2,38b 202ab 3 0
8140 un 2,3d8 20351 3 0
8140 un 2,304 203f6 3 0
8140 un 2,3dc 103ea 3 0
8140 un 2,319 10023 3 0
8140 un 2,16e 203b9 3 0
8140 un 2,4b 101bf 3 0
8140 un 2,414 10342 3 0
8140 un 2,31d 1026f 3 0
8140 un 2,187 1024e 3 0
8140 un 2,3d4 1016a 3 0
8140 un 2,34 2011a 3 0
8133 qc 2,38b 3,3 id 202ab sts -65538 0
8133 qc 2,3d8 3,3 id 20351 sts -65538 0
8133 qc 2,304 3,3 id 203f6 sts -65538 0
8133 qc 2,3dc 3,3 id 103ea sts -65538 0
8133 qc 2,319 3,3 id 10023 sts -65538 0
8133 qc 2,16e 3,3 id 203b9 sts -65538 0
8133 qc 2,4b 3,3 id 101bf sts -65538 0
8133 qc 2,414 3,3 id 10342 sts -65538 0
8133 qc 2,31d 3,3 id 1026f sts -65538 0
8133 qc 2,187 3,3 id 1024e sts -65538 0
8133 qc 2,3d4 3,3 id 1016a sts -65538 0
8133 qc 2,34 3,3 id 2011a sts -65538 0
8140 un 2,311 10161 3 0
8133 qc 2,311 3,3 id 10161 sts -65538 0
8140 un 2,148 1036c 3 0
8133 qc 2,148 3,3 id 1036c sts -65538 0
8140 un 2,2c 2038d 3 0
8133 qc 2,2c 3,3 id 2038d sts -65538 0
8140 un 2,31b 100be 3 0
8133 qc 2,31b 3,3 id 100be sts -65538 0
8140 un 2,263 1016f 3 0
8133 qc 2,263 3,3 id 1016f sts -65538 0
8140 un 2,1e1 10068 3 0
8133 qc 2,1e1 3,3 id 10068 sts -65538 0
8140 un 2,3b2 10175 3 0
8133 qc 2,3b2 3,3 id 10175 sts -65538 0
8140 un 2,309 100af 3 0
8133 qc 2,309 3,3 id 100af sts -65538 0
8140 un 2,3e1 102fe 3 0
8133 qc 2,3e1 3,3 id 102fe sts -65538 0
8140 un 2,31f 1037f 3 0
8133 qc 2,31f 3,3 id 1037f sts -65538 0
8140 un 2,f8 201f7 3 0
8133 qc 2,f8 3,3 id 201f7 sts -65538 0
8140 un 2,55 102cc 3 0
8133 qc 2,55 3,3 id 102cc sts -65538 0
8140 un 2,30d 2007e 3 0
8133 qc 2,30d 3,3 id 2007e sts -65538 0
8140 un 2,313 100b0 3 0
8133 qc 2,313 3,3 id 100b0 sts -65538 0
8140 un 2,25b 1006f 3 0
8133 qc 2,25b 3,3 id 1006f sts -65538 0
8140 un 2,144 103a1 3 0
8133 qc 2,144 3,3 id 103a1 sts -65538 0
8140 un 2,3aa 103f5 3 0
8133 qc 2,3aa 3,3 id 103f5 sts -65538 0
8140 un 2,156 103ab 3 0
8133 qc 2,156 3,3 id 103ab sts -65538 0
8140 un 2,3d9 10099 3 0
8133 qc 2,3d9 3,3 id 10099 sts -65538 0
8140 un 2,b3 103e3 3 0
8133 qc 2,b3 3,3 id 103e3 sts -65538 0
8140 un 2,18 10040 3 0
8133 qc 2,18 3,3 id 10040 sts -65538 0
8140 un 2,317 201ec 3 0
8133 qc 2,317 3,3 id 201ec sts -65538 0
8140 un 2,4d 1017e 3 0
8133 qc 2,4d 3,3 id 1017e sts -65538 0
8140 un 2,3ae 1027e 3 0
8133 qc 2,3ae 3,3 id 1027e sts -65538 0
8140 un 2,3dd 1038e 3 0
8133 qc 2,3dd 3,3 id 1038e sts -65538 0
8140 un 2,179 10168 3 0
8133 qc 2,179 3,3 id 10168 sts -65538 0
8140 un 2,3a6 102ee 3 0
8133 qc 2,3a6 3,3 id 102ee sts -65538 0
8140 un 2,142 1039f 3 0
8133 qc 2,142 3,3 id 1039f sts -65538 0
8140 un 2,3d5 1016e 3 0
8133 qc 2,3d5 3,3 id 1016e sts -65538 0
8140 un 2,3db 10334 3 0
8133 qc 2,3db 3,3 id 10334 sts -65538 0
8140 un 2,39a 10016 3 0
8133 qc 2,39a 3,3 id 10016 sts -65538 0
8140 un 2,146 203cb 3 0
8133 qc 2,146 3,3 id 203cb sts -65538 0
8140 un 2,a3 100e5 3 0
8133 qc 2,a3 3,3 id 100e5 sts -65538 0
8140 un 2,307 100aa 3 0
8133 qc 2,307 3,3 id 100aa sts -65538 0
8140 un 2,27e 200c6 3 0
8133 qc 2,27e 3,3 id 200c6 sts -65538 0
8140 un 2,3df 20326 3 0
8133 qc 2,3df 3,3 id 20326 sts -65538 0
8140 un 2,a7 10366 3 0
8133 qc 2,a7 3,3 id 10366 sts -65538 0
8140 un 2,272 10287 3 0
8133 qc 2,272 3,3 id 10287 sts -65538 0
8140 un 2,3d3 10308 3 0
8133 qc 2,3d3 3,3 id 10308 sts -65538 0
8140 un 2,412 10146 3 0
8140 un 2,3d7 103f6 3 0
8133 qc 2,412 3,3 id 10146 sts -65538 0
8133 qc 2,3d7 3,3 id 103f6 sts -65538 0
8140 un 2,396 20001 3 0
8133 qc 2,396 3,3 id 20001 sts -65538 0
1974 lk 2,18 id 0 -1,3 10000
8133 qc 2,18 -1,3 id 10301 sts 0 0
8140 un 2,18 10301 3 0

lock_dlm:  Assertion failed on line 353 of file /usr/src/cluster-1.00.00/gfs-kernel/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 265742
exp1: error=-22 num=2,18 lkf=10000 flags=84

kernel BUG at /usr/src/cluster-1.00.00/gfs-kernel/src/dlm/lock.c:353 (do_dlm_unlock)!
 [<c48ffde4>] do_dlm_unlock+0x134/0x140 [lock_dlm]  [<c49001d0>] lm_dlm_unlock+0x20/0x30 [lock_dlm]  [<c49eac29>] gfs_lm_unlock+0x39/0x60 [gfs]  [<c49ddd9d>] gfs_glock_drop_th+0x8d/0x1d0 [gfs]  [<c49dcee6>] rq_demote+0xb6/0xd0 [gfs]  [<c49dc945>] gfs_holder_get+0x45/0x60 [gfs]  [<c49dd00c>] run_queue+0xac/0xe0 [gfs]  [<c49dd15b>] unlock_on_glock+0x2b/0x40 [gfs]  [<c49dff84>] gfs_reclaim_glock+0x144/0x1b0 [gfs]  [<c49d0369>] gfs_glockd+0x109/0x120 [gfs]  [<c011a780>] default_wake_function+0x0/0x20  [<c010a082>] ret_from_fork+0x6/0x14  [<c011a780>] default_wake_function+0x0/0x20  [<c49d0260>] gfs_glockd+0x0/0x120 [gfs]  [<c01080a5>] kernel_thread_helper+0x5/0x10 Kernel panic - not syncing: BUG!

Do someone know if there is any reason for this ?


This one was built from the cluster-1.00.00 tarball.

Regards,
--
Sylvain COUTANT

ADVISEO
http://www.adviseo.fr/
http://www.open-sp.fr/





From teigland at redhat.com  Sat Aug 27 02:40:27 2005
From: teigland at redhat.com (David Teigland)
Date: Sat, 27 Aug 2005 10:40:27 +0800
Subject: [Linux-cluster] do_dlm_unlock kernel panic
In-Reply-To: <20050827002611.F36FE31C5BC@postfix4-2.free.fr>
References: <20050827002611.F36FE31C5BC@postfix4-2.free.fr>
Message-ID: <20050827024027.GA10660@redhat.com>

On Sat, Aug 27, 2005 at 02:26:04AM +0200, Sylvain COUTANT wrote:

> I got the following error tonight after a network blackout on one -test-
> server :
> dlm: dlm_unlock: lkid 10301 lockspace not found

This means that the cluster manager, cman, has shut down.  When that
happens it forcibily takes down all the dlm lockspaces which results in
the assertion failures below if gfs happens to be running.

There may be more information in /var/log/messages about why cman shut
down, but it's probably because of the network failure.

Dave

> lock_dlm:  Assertion failed on line 353 of file
> /usr/src/cluster-1.00.00/gfs-kernel/src/dlm/lock.c
> lock_dlm:  assertion:  "!error"
> lock_dlm:  time = 265742
> exp1: error=-22 num=2,18 lkf=10000 flags=84



From teigland at redhat.com  Sat Aug 27 03:15:33 2005
From: teigland at redhat.com (David Teigland)
Date: Sat, 27 Aug 2005 11:15:33 +0800
Subject: [Linux-cluster] Re: DLM: lock timeout and deadlock detection
In-Reply-To: <924c2883050826102720e5ebd4@mail.gmail.com>
References: <924c2883050826102720e5ebd4@mail.gmail.com>
Message-ID: <20050827031533.GB10660@redhat.com>

On Fri, Aug 26, 2005 at 10:27:04AM -0700, Hua Zhong wrote:

> 1. There is no lock timeout imeplemented (at least from API point of
> view). Is this something you plan to add? Is there any draft of the
> design?

Yes, we will probably add timeouts (or take patches adding them.)  We
haven't thought much about the API, but one starting point would be the
first version of the dlm that had timeouts:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm-kernel/src/?cvsroot=cluster

> 2. With no lock timeout in DLM, how does upper-layer applications
> (like GFS) implement such lock timeout (or does GFS also have no
> timeout)? One thought is that it could has its own timer and when it
> expires just cancel the pending lock.

GFS does not use timeouts.

> 3. How much does DLM do wrt deadlock detection? Especially when it
> doesn't have timeout. Is it solely the application's responsibility to
> detect it?

GFS avoids deadlock in general by ordering its lock requests, but it
cannot avoid conversion-deadlock.  GFS depends on a special feature of the
dlm to resolve conversion deadlocks:  the flag DLM_LKF_CONVDEADLK permits
the dlm to detect conversion deadlock and resolve it by demoting one of
the locks involved.  GFS is notified when this happens with the
DLM_SBF_DEMOTED flag that's returned with the lock result.

> 4. This is not a technical question. I'm trying to convince our
> management to use DLM, so I'd like to know how stable it is and on
> what kind of scale the Redhat clustering solution is being used by the
> enterprise. Is it stable enough in production (data like average
> uptime, etc)?

The dlm on linux-kernel (and in -mm releases) is new code and is still in
somewhat of a development stage, it's not been thoroughly tested yet.  The
first version of the dlm (url above) is currently a part of the RHEL4
gfs/cluster product.

Dave



From sco at adviseo.fr  Sat Aug 27 09:42:49 2005
From: sco at adviseo.fr (Sylvain COUTANT)
Date: Sat, 27 Aug 2005 11:42:49 +0200
Subject: [Linux-cluster] do_dlm_unlock kernel panic
In-Reply-To: <20050827024027.GA10660@redhat.com>
Message-ID: <20050827094257.124BA323494@postfix4-2.free.fr>

> > I got the following error tonight after a network blackout on one -test-
> > server :
> > dlm: dlm_unlock: lkid 10301 lockspace not found
> 
> This means that the cluster manager, cman, has shut down.  When that
> happens it forcibily takes down all the dlm lockspaces which results in
> the assertion failures below if gfs happens to be running.

Fine. Do you mean any failure will lead to kernel panic ?


> There may be more information in /var/log/messages about why cman shut
> down, but it's probably because of the network failure.

It is. Not worth checking.


Regards,

--
Sylvain COUTANT

ADVISEO
http://www.adviseo.fr/
http://www.open-sp.fr/




From oldmoonster at gmail.com  Sun Aug 28 17:57:27 2005
From: oldmoonster at gmail.com (Michael)
Date: Mon, 29 Aug 2005 01:57:27 +0800
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <20050827004859.D66965A8690@mail.silvercash.com>
References: <20050827004859.D66965A8690@mail.silvercash.com>
Message-ID: <4311FB07.50703@gmail.com>

Brianu and all,

If you know the way  to implement a redundant solution for gnbd, share 
with me please.

Thanks,

Michael
 

brianu wrote:

> Hello all,
>
>  
>
> This is a question I have basically been asking, the question on why 
> you would want to do it is failover, the docs at 
> http://sourceware.org/cluster/gnbd/gnbd_usage.txt state that 
> dm-multipath is an option for gnbd, and documents elsewhere also 
> indicate that GNBD can be configured as a redundancy, yet I cannot 
> find any documentation on how to configure it.
>
>  
>
> If using LVM to make a volume of imported gnbds is not the answer for 
> redundancy can anyone suggest a method that is? Im not opposed to 
> using any other resource of cluster or GFS but I would really like to 
> implement a redundant solution, ( gnbd, gulm, etc.).
>
>  
>
>  
>
> Does anyone has an example of a redundant solution for a cluster/gfs 
> filesystem i.e ( gnbd, gulm etc) ?
>
>  
>
> Regards,
>
>  
>
> Brian
>
>  
>
> Message: 9
>
> Date: Fri, 26 Aug 2005 16:33:51 +0700
>
> From: "Fajar A. Nugraha" <fajar at telkom.co.id>
>
> Subject: Re: [Linux-cluster] Re: If I have 5 GNBD server?
>
> To: linux clustering <linux-cluster at redhat.com>
>
> Message-ID: <430EE1FF.6050505 at telkom.co.id>
>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>  
>
> Michael wrote:
>
>  
>
>>Can anyone answer my question?
>
>> 
>
>> 
>
>> 
>
> I don't think LVM cares what kind of storage it uses, as long as it's a
>
> block device.
>
> So theoretically, you can mix local disc, FC storage, GNBD, and ATOE and
>
> combine them using LVM. You might run into performance issues (GNBD are
>
> slower than FC disks) and startup issues (LVM has to start after cluster
>
> and gnbd-import successfuly start), but it should be possible.
>
>  
>
> The real question is WHY you want to do that.
>
> AFAIK, if you combine 5 gnbd from 5 gnbd servers into one LVM, and you
>
> are accessing a volume on that volume group, and one of the gnbd server
>
> dies (or hangs), gnbd import will wait forever until that server is back
>
> up. So you'll have five more single-point-of-failures.
>
>  
>
> Regards,
>
>  
>
> Fajar
>
>  
>
>>On 8/26/05, Michael <mikore.li at gmail.com> wrote:
>
>> 
>
>> 
>
>>>Hi,
>
>>> 
>
>>>If I have 5 gnbd servers in the network, each one export 1 block
>
>>>device, can I import all gnbd devices on each gfs client, and use lvm
>
>>>to manage them as 1 shared pool, then mkfs_gfs on it?
>
>>> 
>
>>>Thanks,
>
>>> 
>
>>>Michael
>
>>> 
>
>>>    
>
>>> 
>
>> 
>
>> 
>
>> 
>
>  
>
>  
>
>  
>
> Brian Urrutia
>
> System Administrator
>
> Price Communications Inc.
>
>  
>
>There are 10 types of people in this world, 
>
>those who understand binary and those who don't.
>
>  
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>



From nattaponv at hotmail.com  Sun Aug 28 01:25:02 2005
From: nattaponv at hotmail.com (nattapon viroonsri)
Date: Sun, 28 Aug 2005 01:25:02 +0000
Subject: [Linux-cluster] cluster v4  send_broadcast_message error -22
Message-ID: <BAY22-F12230D2FEDEC813D928BB0A6AC0@phx.gbl>

Would any please recommend about this error it show in /var/log/message, 
when i start cman after ccsd.
error messange show like this " SM: send_broadcast_message error -22"
I use redhat cluster manager v.4 with rhel4 update1

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



From aspanke at hpce.nec.com  Sun Aug 28 09:00:55 2005
From: aspanke at hpce.nec.com (Alexander Spanke)
Date: Sun, 28 Aug 2005 11:00:55 +0200
Subject: [Linux-cluster] Rx error on broadcast ping
Message-ID: <43117D47.2000308@hpce.nec.com>

Dear all,

I was got a reminder by a friedn about our network situation; we are
maintaining a cluster of ~200 nodes, connected via Gigbit ethernet. The
network configuration of the cluster is usual, nothing special; but if I
do a ping -b <bcast address> I see the Rx error counter increasing.

During normal network activity like NFS, scp or other applications,
everything is fine, only with doing broadcast pings I can reproduce the
increasing Rx error.

Does someone of you saw such a problem before ?

Thnx for any idea,

Cheers
 Alex




From mikore.li at gmail.com  Sun Aug 28 16:39:53 2005
From: mikore.li at gmail.com (Michael)
Date: Mon, 29 Aug 2005 00:39:53 +0800
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <20050827004859.D66965A8690@mail.silvercash.com>
References: <20050827004859.D66965A8690@mail.silvercash.com>
Message-ID: <bc57270905082809391fced031@mail.gmail.com>

On 8/27/05, brianu <brianu at silvercash.com> wrote:
>  
>  
> 
> Hello all, 
> 
>   
> 
> This is a question I have basically been asking, the question on why you
> would want to do it is failover, the docs at
> http://sourceware.org/cluster/gnbd/gnbd_usage.txt state
> that dm-multipath is an option for gnbd, and documents elsewhere also
> indicate that GNBD can be configured as a redundancy, yet I cannot find any
> documentation on how to configure it. 
> 
>   
> 
> If using LVM to make a volume of imported gnbds is not the answer for
> redundancy can anyone suggest a method that is? Im not opposed to using any
> other resource of cluster or GFS but I would really like to implement a
> redundant solution, ( gnbd, gulm, etc.). 
> 
Hi, Brianu, maybe LVM + md + gnbd should be one of the solution for
redundancy, for example, you have 2 gnbd servers, each one exports 1
disk. Then, the steps should be:
1. create a RAID-1  /dev/md0 on GFS client with imported 2 gnbd block devices.
2. use LVM  create /dev/vg0 on top of them.
3. mkfs_gfs on /dev/vg0.
I haven't tried this configuration, theoretically, it should work.

Thanks,

Michael
>   
> 
>   
> 
> Does anyone has an example of a redundant solution for a cluster/gfs
> filesystem i.e ( gnbd, gulm etc) ? 
> 
>   
> 
> Regards, 
> 
>   
> 
> Brian 
> 
>   
> 
> Message: 9 
> 
> Date: Fri, 26 Aug 2005 16:33:51 +0700 
> 
> From: "Fajar A. Nugraha" <fajar at telkom.co.id> 
> 
> Subject: Re: [Linux-cluster] Re: If I have 5 GNBD server? 
> 
> To: linux clustering <linux-cluster at redhat.com> 
> 
> Message-ID: <430EE1FF.6050505 at telkom.co.id> 
> 
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed 
> 
>   
> 
> Michael wrote: 
> 
>   
> 
> >Can anyone answer my question? 
> 
> >  
> 
> >  
> 
> >  
> 
> I don't think LVM cares what kind of storage it uses, as long as it's a 
> 
> block device. 
> 
> So theoretically, you can mix local disc, FC storage, GNBD, and ATOE and 
> 
> combine them using LVM. You might run into performance issues (GNBD are 
> 
> slower than FC disks) and startup issues (LVM has to start after cluster 
> 
> and gnbd-import successfuly start), but it should be possible. 
> 
>   
> 
> The real question is WHY you want to do that. 
> 
> AFAIK, if you combine 5 gnbd from 5 gnbd servers into one LVM, and you 
> 
> are accessing a volume on that volume group, and one of the gnbd server 
> 
> dies (or hangs), gnbd import will wait forever until that server is back 
> 
> up. So you'll have five more single-point-of-failures. 
> 
>   
> 
> Regards, 
> 
>   
> 
> Fajar 
> 
>   
> 
> >On 8/26/05, Michael <mikore.li at gmail.com> wrote: 
> 
> >  
> 
> >  
> 
> >>Hi, 
> 
> >>  
> 
> >>If I have 5 gnbd servers in the network, each one export 1 block 
> 
> >>device, can I import all gnbd devices on each gfs client, and use lvm 
> 
> >>to manage them as 1 shared pool, then mkfs_gfs on it? 
> 
> >>  
> 
> >>Thanks, 
> 
> >>  
> 
> >>Michael 
> 
> >>  
> 
> >>     
> 
> >>  
> 
> >  
> 
> >  
> 
> >  
> 
>   
> 
>   
> 
>   
> 
> Brian Urrutia 
> 
> System Administrator 
> 
> Price Communications Inc. 
> 
>   There are 10 types of people in this world, 
> those who understand binary and those who don't.
>  
> 
>   
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
>



From cboudjnah at squiz.net  Mon Aug 29 00:49:49 2005
From: cboudjnah at squiz.net (Chmouel Boudjnah)
Date: Mon, 29 Aug 2005 10:49:49 +1000
Subject: [Linux-cluster] GFS + Oracle storage hardware suggestions?
In-Reply-To: <49404.69.133.94.192.1124947998.squirrel@mail.workforcesoftware.com>
References: <44425.10.200.200.228.1124920533.squirrel@mail.workforcesoftware.com>
	<1124945960.26649.12.camel@paris.squiz.net>
	<49404.69.133.94.192.1124947998.squirrel@mail.workforcesoftware.com>
Message-ID: <1125276589.24747.4.camel@paris.squiz.net>

Hello,

No i am not using DFS/Oracle i am using it with GFS (i guess DFS is a
kind of Oracle Distribued File System). I am using it with on SATA disks
and the way to do Raid on a Coraid server is i guess only by software
Raid (could not find anywhere else how to make it in hardware way).
Performances are still great on a testload benchmark but the clusters
are not live yet.

Cheers, Chmouel.

On Thu, 2005-08-25 at 01:33 -0400, Matt Goebel wrote:
> Are you using DFS/Oracle on AOE?  If so how many drives, PATA or SATA,
> hardware and/or software RAID, etc...?  Any additional info would be
> appreciated.  On paper it looks like my best option right now.
> 
> Chmouel Boudjnah said:
> > On Wed, 2005-08-24 at 17:55 -0400, Matt Goebel wrote:
> >> AOE (Coraid.com):  This looks to be a perfect solution.. low cost and
> >> potentially decent performance.  It's relatively new and I haven't heard
> >> of anyone using it yet though.
> >
> > Works perfectly fine for me with very good performances.
> >
> > --
> > Chmouel Boudjnah - Squiz.net - http://www.squiz.net
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Chmouel Boudjnah - Squiz.net - http://www.squiz.net



From fajar at telkom.co.id  Mon Aug 29 01:49:10 2005
From: fajar at telkom.co.id (Fajar A. Nugraha)
Date: Mon, 29 Aug 2005 08:49:10 +0700
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <20050827004859.D66965A8690@mail.silvercash.com>
References: <20050827004859.D66965A8690@mail.silvercash.com>
Message-ID: <43126996.8000606@telkom.co.id>

brianu wrote:

> Hello all,
>
>  
>
> This is a question I have basically been asking, the question on why 
> you would want to do it is failover, the docs at 
> http://sourceware.org/cluster/gnbd/gnbd_usage.txt state that 
> dm-multipath is an option for gnbd,
>
I'm not sure about dm-multipath. The thing is, when a gnbd server dies, 
instead of saying "read/write failed" as normal block device does, gnbd 
simply retries the request and tries to reconnect if it's disconnected. 
Forever.

> and documents elsewhere also indicate that GNBD can be configured as a 
> redundancy, yet I cannot find any documentation on how to configure it.
>
>  
>
> If using LVM to make a volume of imported gnbds is not the answer for 
> redundancy can anyone suggest a method that is? Im not opposed to 
> using any other resource of cluster or GFS but I would really like to 
> implement a redundant solution, ( gnbd, gulm, etc.).
>
>  
>
It would be possible if you have at least two servers, connected to the 
same storage, running as gnbd server and exporting the same block devices.

You need to have one IP address that can failover to any available node 
(use rgmanager or keepalived to achieve this). That way, if one server 
node dies the IP address will be moved to the other node. Client will be 
disconnected, but since gnbd-import will automatically reconnect (it 
actually connects to a different node since the gnbd server IP address 
was moved) the process will be transparent to the client (all they see 
is a slight delay during reconnect).

Regards,

Fajar



From nattaponv at hotmail.com  Mon Aug 29 02:26:20 2005
From: nattaponv at hotmail.com (nattapon viroonsri)
Date: Mon, 29 Aug 2005 02:26:20 +0000
Subject: [Linux-cluster] Rhcs4 fence with no hardware power switch
Message-ID: <BAY22-F9D130F8A207AA1336DEB4A6AF0@phx.gbl>

I use RedHat cluster v4 for HA and have no power switch hardware
When i disconnected  network cable for client access resource , Failover 
have occur as expect
But the server that was disconnect network cable not be rebooted or fenced 
like rhcs V3(even no power switch).  And the failed server  can't not join 
cluster again.

In Rhcs V3. it reboot fail node automatically without hardware power switch

So, Have anyway to reboot failed node(network problem or server problem) 
without use hardware power switch to make failnode to clean state again ?

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



From sco at adviseo.fr  Mon Aug 29 13:07:07 2005
From: sco at adviseo.fr (Sylvain COUTANT)
Date: Mon, 29 Aug 2005 15:07:07 +0200
Subject: [Linux-cluster] GFS journals
Message-ID: <20050829130714.5A617318444@smtp.cegetel.net>

Hi all,

AFAIK, I need one GFS journal per server connected to the device. How can I adjust the number of journal if I need to add one server as a new "client" ?

If not possible, what is the overhead of oversizing the number of journals when first mfks'ing the device ?

Thanks in advance for tips.


Regards,

--
Sylvain COUTANT

ADVISEO
http://www.adviseo.fr/
http://www.open-sp.fr/





From mbrookov at mines.edu  Mon Aug 29 14:14:59 2005
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Mon, 29 Aug 2005 08:14:59 -0600
Subject: [Linux-cluster] GFS file system corruption?
Message-ID: <1125324899.12438.11.camel@merlin.Mines.EDU>

I have 6 computers running Redhat Enterrpise 3 release 5, and kernel
2.4.21-32.0.1.ELsmp.

I compiled GFS 6.0.2.20-2 from the source code.  The SAN is an ISCSI
based storage system from LeftHand Networks.   Using ext3, the postmark
disk test works fine, on a GFS file system, we get a number of errors. 
The output from both postmark runs is below.

I unmounted the file systems, and ran gfs_fsck on the GFS system.  It
produced a number of errors like these: 

[root at imagine root]# gfs_fsck -y /dev/pool/u_as
Initializing fsck
Starting pass1
Pass1 complete
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete
Starting pass3
Pass3 complete
Starting pass4
Pass4 complete
Starting pass5
ondisk and fsck bitmaps differ at block 17887
Succeeded.
ondisk and fsck bitmaps differ at block 17888
Succeeded.
ondisk and fsck bitmaps differ at block 17889
Succeeded.
ondisk and fsck bitmaps differ at block 17890
Succeeded.
ondisk and fsck bitmaps differ at block 17891
Succeeded.
ondisk and fsck bitmaps differ at block 17892
Succeeded.
ondisk and fsck bitmaps differ at block 17893
Succeeded.
ondisk and fsck bitmaps differ at block 17894
Succeeded.
ondisk and fsck bitmaps differ at block 17895
Succeeded.
ondisk and fsck bitmaps differ at block 17896
Succeeded.
ondisk and fsck bitmaps differ at block 17897
Succeeded.
ondisk and fsck bitmaps differ at block 17898
Succeeded.
ondisk and fsck bitmaps differ at block 17899
Succeeded.
ondisk and fsck bitmaps differ at block 17900
Succeeded.
ondisk and fsck bitmaps differ at block 17901
Succeeded.

The complete output was from gfs_fsck was 935k.

I have included the the cluster configuration files below.  Fencing is
handled by a perl script that I wrote.  It uses SNMP to turn off the
ports in a Cisco 3750 switch.  There were no log entries from GULM or
GFS on any of the hosts.

PostMark is available from http://www.netapp.com/tech_library/3022.html.

Does any body have any ideas what would do this?

Other GFS file systems on these servers have had similar problems.  It
seems that gfs_fsck repairs bitmap errors after some number of file
creates and deletes.  PostMark is the only program that I have used on
GFS that reports errors.  The EXT3 fsck was clean after PostMark was
ran.

thank you for your assistance.

Matt Brookover
Academic Computing and Networking
Colorado School of Mines
303-273-3436
mbrookov at mines.edu

PostMark output when running on EXT3 file system: 

PostMark v1.5 : 3/27/01
pm>set size 10000 100000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
        1941 seconds total
        1350 seconds of transactions (0 per second)

Files:
        771 created (0 per second)
                Creation alone: 500 files (0 per second)
                Mixed with transactions: 271 files (0 per second)
        247 read (0 per second)
        253 appended (0 per second)
        771 deleted (0 per second)
                Deletion alone: 542 files (27 per second)
                Mixed with transactions: 229 files (0 per second)

Data:
        13100.09 megabytes read (6.75 megabytes per second)
        42926.13 megabytes written (22.12 megabytes per second)
pm>exit


PostMark output when running on GFS file system: 

PostMark v1.5 : 3/27/01
pm>set size 10000 100000000
pm>run
Creating files...Done
Performing transactions....Error: cannot open '615' for writing
Error: cannot open '616' for writing
Error: cannot open '617' for writing
Error: cannot open '619' for writing
Error: cannot open '623' for writing
.Error: cannot open '624' for writing
Error: cannot open '625' for writing
Error: cannot open '626' for writing
Error: cannot open '629' for writing
Error: cannot open '633' for writing
Error: cannot open '634' for writing
Error: cannot open '635' for writing
Error: cannot open '636' for writing
Error: cannot open '637' for writing
Error: cannot open '641' for writing
Error: cannot open '642' for writing
Error: cannot open '643' for writing
Error: cannot open '644' for writing
Error: Cannot delete '637'
Error: Cannot delete '615'
Error: Cannot delete '634'
.Error: cannot open '650' for writing
Error: Cannot delete '625'
Error: cannot open '667' for writing
Error: cannot open '668' for writing
Error: cannot open '669' for writing
.Error: cannot open '687' for writing
.Error: cannot open '696' for writing
Error: cannot open '709' for writing
Error: cannot open '712' for writing
Error: cannot open '719' for writing
Error: cannot open '720' for writing
.Error: cannot open '721' for writing
Error: cannot open '642' for reading
Error: cannot open '722' for writing
Error: cannot open '626' for append
Error: cannot open '642' for append
Error: Cannot delete '624'
Error: cannot open '731' for writing
Error: cannot open '735' for writing
Error: cannot open '736' for writing
Error: cannot open '720' for append
Error: cannot open '737' for writing
Error: cannot open '741' for writing
Error: cannot open '742' for writing
Error: cannot open '743' for writing
Error: cannot open '744' for writing
Error: cannot open '746' for writing
Error: cannot open '748' for writing
Error: Cannot delete '743'
.Error: Cannot delete '721'
Error: cannot open '741' for reading
Error: Cannot delete '719'
Error: cannot open '755' for writing
Error: cannot open '756' for writing
Error: cannot open '641' for reading
Error: cannot open '760' for writing
Error: cannot open '743' for reading
Error: Cannot delete '636'
Error: Cannot delete '669'
Error: cannot open '687' for reading
Done
Deleting files...Error: Cannot delete '615'
Error: Cannot delete '617'
Error: Cannot delete '619'
Error: Cannot delete '623'
Error: Cannot delete '624'
Error: Cannot delete '625'
Error: Cannot delete '626'
Error: Cannot delete '629'
Error: Cannot delete '633'
Error: Cannot delete '634'
Error: Cannot delete '635'
Error: Cannot delete '636'
Error: Cannot delete '637'
Error: Cannot delete '641'
Error: Cannot delete '642'
Error: Cannot delete '643'
Error: Cannot delete '644'
Error: Cannot delete '667'
Error: Cannot delete '668'
Error: Cannot delete '669'
Error: Cannot delete '687'
Error: Cannot delete '696'
Error: Cannot delete '709'
Error: Cannot delete '712'
Error: Cannot delete '719'
Error: Cannot delete '720'
Error: Cannot delete '721'
Error: Cannot delete '722'
Error: Cannot delete '731'
Error: Cannot delete '735'
Error: Cannot delete '736'
Error: Cannot delete '737'
Error: Cannot delete '741'
Error: Cannot delete '742'
Error: Cannot delete '743'
Error: Cannot delete '744'
Error: Cannot delete '746'
Error: Cannot delete '748'
Error: Cannot delete '755'
Error: Cannot delete '756'
Error: Cannot delete '760'
Done
Time:
        1773 seconds total
        1086 seconds of transactions (0 per second)

Files:
        768 created (0 per second)
                Creation alone: 500 files (0 per second)
                Mixed with transactions: 268 files (0 per second)
        239 read (0 per second)
        253 appended (0 per second)
        768 deleted (0 per second)
                Deletion alone: 505 files (16 per second)
                Mixed with transactions: 222 files (0 per second)

Data:
        12812.81 megabytes read (7.23 megabytes per second)
        40532.43 megabytes written (22.86 megabytes per second)
pm>exit


Cluster.css file:

cluster
{
        name = "CSM_ACN"
        lock_gulm
        {
                servers = ["imagine.Mines.EDU","illuminate.Mines.EDU","illusion.Mines.EDU"]
                heartbeat_rate = 3.0
                allowed_misses = 5
        }
}


fence.css file:

fence_devices
{
        CSMACN_fence
        {
                agent = "fence_cisco"
        }
}


Nodes.css file:

nodes
{
        imagine.Mines.EDU
        {
                ip_interfaces
                {
                        eth0 = "138.67.130.1"
                }
                fence
                {
                        snmpfence
                        {
                                CSMACN_fence
                                {
                                        port="imagine"
                                }
                        }
                }
        }

        illuminate.Mines.EDU
        {
                ip_interfaces
                {
                        eth0 = "138.67.130.2"
                }
                fence
                {
                        snmpfence
                        {
                                CSMACN_fence
                                {
                                        port="illuminate"
                                }
                        }
                }
        }

        illusion.Mines.EDU
        {
                ip_interfaces
                {
                        eth0 = "138.67.130.3"
                }
                fence
                {
                        snmpfence
                        {
                                CSMACN_fence
                                {
                                        port="illusion"
                                }
                        }
                }
        }

        inspire.Mines.EDU
        {
                ip_interfaces
                {
                        eth0 = "138.67.130.5"
                }
                fence
                {
                        snmpfence
                        {
                                CSMACN_fence
                                {
                                        port="inspire"
                                }
                        }
                }
        }
        inception.Mines.EDU
        {
                ip_interfaces
                {
                        eth0 = "138.67.130.4"
                }
                fence
                {
                        snmpfence
                        {
                                CSMACN_fence
                                {
                                        port="inception"
                                }
                        }
                }
        }
        incantation.Mines.EDU
        {
                ip_interfaces
                {
                        eth0 = "138.67.130.6"
                }
                fence
                {
                        snmpfence
                        {
                                CSMACN_fence
                                {
                                        port="incantation"
                                }
                        }
                }
        }
}


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050829/71902757/attachment.htm>

From karasiov at infobox.ru  Mon Aug 29 13:06:31 2005
From: karasiov at infobox.ru (karasiov at infobox.ru)
Date: Mon, 29 Aug 2005 17:06:31 +0400
Subject: [Linux-cluster] kernel tainted
In-Reply-To: <BAY22-F9D130F8A207AA1336DEB4A6AF0@phx.gbl>
References: <BAY22-F9D130F8A207AA1336DEB4A6AF0@phx.gbl>
Message-ID: <1249792925.20050829170631@infobox.ru>

Hello, linux clustering

Lock_Harness 1.00.00 (built Aug 29 2005 15:41:33) installed
GFS 1.00.00 (built Aug 29 2005 15:41:56) installed
CMAN 1.00.00 (built Aug 29 2005 15:41:16) installed
NET: Registered protocol family 30
dlm: no version for "kcl_register_service" found: kernel tainted.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
DLM 1.00.00 (built Aug 29 2005 15:41:28) installed
Lock_DLM (built Aug 29 2005 15:41:36) installed

I try to build GFS from cluster-1.00.00.tgz
on Debian Sarge with kernel 2.6.12.3

All works fine - but I can not understand
what string about DLM means. Please, explane me.

Thank you.

-- 
? ?????????,
 karasiov                          mailto:karasiov at infobox.ru



From scoutant at free.fr  Fri Aug 26 23:41:29 2005
From: scoutant at free.fr (Sylvain)
Date: Sat, 27 Aug 2005 01:41:29 +0200
Subject: [Linux-cluster] do_dlm_unlock kernel panic
Message-ID: <20050826234137.5480831D964@postfix4-2.free.fr>

Hi all,

I got the following error tonight after a network blackout on one -test- server :

dlm: dlm_unlock: lkid 10301 lockspace not found
exp1 move flags 0,1,0 ids 0,2,0
exp1 move use event 2
exp1 recover event 2 (first)
exp1 add nodes
exp1 total nodes 3
exp1 rebuild resource directory
exp1 rebuilt 5 resources
exp1 recover event 2 done
exp1 move flags 0,0,1 ids 0,2,2
exp1 process held requests
exp1 processed 0 requests
exp1 recover event 2 finished
exp1 move flags 1,0,0 ids 2,2,2

8140 un 2,13f 10107 3 0
8140 un 2,40e 10074 3 0
8140 un 2,9c 201c7 3 0
8140 un 2,27b 201ab 3 0
8133 qc 2,38d 3,3 id 203a8 sts -65538 0
8133 qc 2,a4 3,3 id 20000 sts -65538 0
8133 qc 2,393 3,3 id 20381 sts -65538 0
8133 qc 2,308 3,3 id 10350 sts -65538 0
8133 qc 2,8e 3,3 id 1015f sts -65538 0
8133 qc 2,3e0 3,3 id 20171 sts -65538 0
8133 qc 2,411 3,3 id 100cc sts -65538 0
8133 qc 2,397 3,3 id 302ef sts -65538 0
8133 qc 2,13f 3,3 id 10107 sts -65538 0
8133 qc 2,40e 3,3 id 10074 sts -65538 0
8133 qc 2,9c 3,3 id 201c7 sts -65538 0
8133 qc 2,27b 3,3 id 201ab sts -65538 0
8140 un 2,38b 202ab 3 0
8140 un 2,3d8 20351 3 0
8140 un 2,304 203f6 3 0
8140 un 2,3dc 103ea 3 0
8140 un 2,319 10023 3 0
8140 un 2,16e 203b9 3 0
8140 un 2,4b 101bf 3 0
8140 un 2,414 10342 3 0
8140 un 2,31d 1026f 3 0
8140 un 2,187 1024e 3 0
8140 un 2,3d4 1016a 3 0
8140 un 2,34 2011a 3 0
8133 qc 2,38b 3,3 id 202ab sts -65538 0
8133 qc 2,3d8 3,3 id 20351 sts -65538 0
8133 qc 2,304 3,3 id 203f6 sts -65538 0
8133 qc 2,3dc 3,3 id 103ea sts -65538 0
8133 qc 2,319 3,3 id 10023 sts -65538 0
8133 qc 2,16e 3,3 id 203b9 sts -65538 0
8133 qc 2,4b 3,3 id 101bf sts -65538 0
8133 qc 2,414 3,3 id 10342 sts -65538 0
8133 qc 2,31d 3,3 id 1026f sts -65538 0
8133 qc 2,187 3,3 id 1024e sts -65538 0
8133 qc 2,3d4 3,3 id 1016a sts -65538 0
8133 qc 2,34 3,3 id 2011a sts -65538 0
8140 un 2,311 10161 3 0
8133 qc 2,311 3,3 id 10161 sts -65538 0
8140 un 2,148 1036c 3 0
8133 qc 2,148 3,3 id 1036c sts -65538 0
8140 un 2,2c 2038d 3 0
8133 qc 2,2c 3,3 id 2038d sts -65538 0
8140 un 2,31b 100be 3 0
8133 qc 2,31b 3,3 id 100be sts -65538 0
8140 un 2,263 1016f 3 0
8133 qc 2,263 3,3 id 1016f sts -65538 0
8140 un 2,1e1 10068 3 0
8133 qc 2,1e1 3,3 id 10068 sts -65538 0
8140 un 2,3b2 10175 3 0
8133 qc 2,3b2 3,3 id 10175 sts -65538 0
8140 un 2,309 100af 3 0
8133 qc 2,309 3,3 id 100af sts -65538 0
8140 un 2,3e1 102fe 3 0
8133 qc 2,3e1 3,3 id 102fe sts -65538 0
8140 un 2,31f 1037f 3 0
8133 qc 2,31f 3,3 id 1037f sts -65538 0
8140 un 2,f8 201f7 3 0
8133 qc 2,f8 3,3 id 201f7 sts -65538 0
8140 un 2,55 102cc 3 0
8133 qc 2,55 3,3 id 102cc sts -65538 0
8140 un 2,30d 2007e 3 0
8133 qc 2,30d 3,3 id 2007e sts -65538 0
8140 un 2,313 100b0 3 0
8133 qc 2,313 3,3 id 100b0 sts -65538 0
8140 un 2,25b 1006f 3 0
8133 qc 2,25b 3,3 id 1006f sts -65538 0
8140 un 2,144 103a1 3 0
8133 qc 2,144 3,3 id 103a1 sts -65538 0
8140 un 2,3aa 103f5 3 0
8133 qc 2,3aa 3,3 id 103f5 sts -65538 0
8140 un 2,156 103ab 3 0
8133 qc 2,156 3,3 id 103ab sts -65538 0
8140 un 2,3d9 10099 3 0
8133 qc 2,3d9 3,3 id 10099 sts -65538 0
8140 un 2,b3 103e3 3 0
8133 qc 2,b3 3,3 id 103e3 sts -65538 0
8140 un 2,18 10040 3 0
8133 qc 2,18 3,3 id 10040 sts -65538 0
8140 un 2,317 201ec 3 0
8133 qc 2,317 3,3 id 201ec sts -65538 0
8140 un 2,4d 1017e 3 0
8133 qc 2,4d 3,3 id 1017e sts -65538 0
8140 un 2,3ae 1027e 3 0
8133 qc 2,3ae 3,3 id 1027e sts -65538 0
8140 un 2,3dd 1038e 3 0
8133 qc 2,3dd 3,3 id 1038e sts -65538 0
8140 un 2,179 10168 3 0
8133 qc 2,179 3,3 id 10168 sts -65538 0
8140 un 2,3a6 102ee 3 0
8133 qc 2,3a6 3,3 id 102ee sts -65538 0
8140 un 2,142 1039f 3 0
8133 qc 2,142 3,3 id 1039f sts -65538 0
8140 un 2,3d5 1016e 3 0
8133 qc 2,3d5 3,3 id 1016e sts -65538 0
8140 un 2,3db 10334 3 0
8133 qc 2,3db 3,3 id 10334 sts -65538 0
8140 un 2,39a 10016 3 0
8133 qc 2,39a 3,3 id 10016 sts -65538 0
8140 un 2,146 203cb 3 0
8133 qc 2,146 3,3 id 203cb sts -65538 0
8140 un 2,a3 100e5 3 0
8133 qc 2,a3 3,3 id 100e5 sts -65538 0
8140 un 2,307 100aa 3 0
8133 qc 2,307 3,3 id 100aa sts -65538 0
8140 un 2,27e 200c6 3 0
8133 qc 2,27e 3,3 id 200c6 sts -65538 0
8140 un 2,3df 20326 3 0
8133 qc 2,3df 3,3 id 20326 sts -65538 0
8140 un 2,a7 10366 3 0
8133 qc 2,a7 3,3 id 10366 sts -65538 0
8140 un 2,272 10287 3 0
8133 qc 2,272 3,3 id 10287 sts -65538 0
8140 un 2,3d3 10308 3 0
8133 qc 2,3d3 3,3 id 10308 sts -65538 0
8140 un 2,412 10146 3 0
8140 un 2,3d7 103f6 3 0
8133 qc 2,412 3,3 id 10146 sts -65538 0
8133 qc 2,3d7 3,3 id 103f6 sts -65538 0
8140 un 2,396 20001 3 0
8133 qc 2,396 3,3 id 20001 sts -65538 0
1974 lk 2,18 id 0 -1,3 10000
8133 qc 2,18 -1,3 id 10301 sts 0 0
8140 un 2,18 10301 3 0

lock_dlm:  Assertion failed on line 353 of file /usr/src/cluster-1.00.00/gfs-kernel/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 265742
exp1: error=-22 num=2,18 lkf=10000 flags=84

kernel BUG at /usr/src/cluster-1.00.00/gfs-kernel/src/dlm/lock.c:353 (do_dlm_unlock)!
 [<c48ffde4>] do_dlm_unlock+0x134/0x140 [lock_dlm]
 [<c49001d0>] lm_dlm_unlock+0x20/0x30 [lock_dlm]
 [<c49eac29>] gfs_lm_unlock+0x39/0x60 [gfs]
 [<c49ddd9d>] gfs_glock_drop_th+0x8d/0x1d0 [gfs]
 [<c49dcee6>] rq_demote+0xb6/0xd0 [gfs]
 [<c49dc945>] gfs_holder_get+0x45/0x60 [gfs]
 [<c49dd00c>] run_queue+0xac/0xe0 [gfs]
 [<c49dd15b>] unlock_on_glock+0x2b/0x40 [gfs]
 [<c49dff84>] gfs_reclaim_glock+0x144/0x1b0 [gfs]
 [<c49d0369>] gfs_glockd+0x109/0x120 [gfs]
 [<c011a780>] default_wake_function+0x0/0x20
 [<c010a082>] ret_from_fork+0x6/0x14
 [<c011a780>] default_wake_function+0x0/0x20
 [<c49d0260>] gfs_glockd+0x0/0x120 [gfs]
 [<c01080a5>] kernel_thread_helper+0x5/0x10
Kernel panic - not syncing: BUG!

Do someone know if there is any reason for this ?


This one was built from the cluster-1.00.00 tarball.

Regards,
--
Sylvain COUTANT

ADVISEO
http://www.adviseo.fr/
http://www.open-sp.fr/





From linux-cluster at redhat.com  Sat Aug 27 15:16:35 2005
From: linux-cluster at redhat.com (Cluster2005)
Date: Sat, 27 Aug 2005 11:16:35 -0400
Subject: [Linux-cluster] IEEE Cluster2005 Boston Early Bird Deadline
	Approaching
Message-ID: <20050827151635.1EADC4098A@villi.rcsnetworks.com>

see http://cluster2005.org



From mbrookov at mines.edu  Sun Aug 28 20:31:13 2005
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Sun, 28 Aug 2005 14:31:13 -0600
Subject: [Linux-cluster] GFS file system corruption?
Message-ID: <1125261073.10424.74.camel@merlin.Mines.EDU>

I have 6 computers running Redhat Enterrpise 3 release 5, running kernel
2.4.21-32.0.1.ELsmp.

>From the source code I compiled GFS 6.0.2.20-2.  The SAN is an ISCSI
based storage system from LeftHand Networks.   Using ext3, the postmark
disk test works fine, on a GFS file system, we get a number of errors. 
The output from both postmark runs is below.

I unmounted the file systems, and ran gfs_fsck on the GFS system.  It
produced a number of errors like these:

Pass4 complete
Starting pass5
ondisk and fsck bitmaps differ at block 17887
Succeeded.
ondisk and fsck bitmaps differ at block 17888
Succeeded.
ondisk and fsck bitmaps differ at block 17889
Succeeded.
ondisk and fsck bitmaps differ at block 17890
Succeeded.
ondisk and fsck bitmaps differ at block 17891
Succeeded.
ondisk and fsck bitmaps differ at block 17892

(The complete gfs_fsck output is attached, I compressed it because the
file is quite large)

I have attached the cluster configuration files if you are interested. 
Fencing is handeled by a perl script that I wrote.  It uses SNMP to turn
off the ports in a Cisco 3750 switch.  There were no log entries from
GULM or GFS on any of the hosts.

PostMark is available from http://www.netapp.com/tech_library/3022.html.

Does any body have any ideas what would do this?

Other GFS file systems on these servers have had similar problems.  It
seems that gfs_fsck repairs bitmap errors after some number of file
creates and deletes.  PostMark is the only program that I have used on
GFS that reports errors.  The EXT3 fsck was clean after PostMark was
ran.

thank you for your assistance.

Matt Brookover
Academic Computing and Networking
Colorado School of Mines
303-273-3436
mbrookov at mines.edu

PostMark output when running on EXT3 file system:

PostMark v1.5 : 3/27/01
pm>set size 10000 100000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
        1941 seconds total
        1350 seconds of transactions (0 per second)

Files:
        771 created (0 per second)
                Creation alone: 500 files (0 per second)
                Mixed with transactions: 271 files (0 per second)
        247 read (0 per second)
        253 appended (0 per second)
        771 deleted (0 per second)
                Deletion alone: 542 files (27 per second)
                Mixed with transactions: 229 files (0 per second)

Data:
        13100.09 megabytes read (6.75 megabytes per second)
        42926.13 megabytes written (22.12 megabytes per second)
pm>exit


PostMark output when running on GFS file system:

PostMark v1.5 : 3/27/01
pm>set size 10000 100000000
pm>run
Creating files...Done
Performing transactions....Error: cannot open '615' for writing
Error: cannot open '616' for writing
Error: cannot open '617' for writing
Error: cannot open '619' for writing
Error: cannot open '623' for writing
.Error: cannot open '624' for writing
Error: cannot open '625' for writing
Error: cannot open '626' for writing
Error: cannot open '629' for writing
Error: cannot open '633' for writing
Error: cannot open '634' for writing
Error: cannot open '635' for writing
Error: cannot open '636' for writing
Error: cannot open '637' for writing
Error: cannot open '641' for writing
Error: cannot open '642' for writing
Error: cannot open '643' for writing
Error: cannot open '644' for writing
Error: Cannot delete '637'
Error: Cannot delete '615'
Error: Cannot delete '634'
.Error: cannot open '650' for writing
Error: Cannot delete '625'
Error: cannot open '667' for writing
Error: cannot open '668' for writing
Error: cannot open '669' for writing
.Error: cannot open '687' for writing
.Error: cannot open '696' for writing
Error: cannot open '709' for writing
Error: cannot open '712' for writing
Error: cannot open '719' for writing
Error: cannot open '720' for writing
.Error: cannot open '721' for writing
Error: cannot open '642' for reading
Error: cannot open '722' for writing
Error: cannot open '626' for append
Error: cannot open '642' for append
Error: Cannot delete '624'
Error: cannot open '731' for writing
Error: cannot open '735' for writing
Error: cannot open '736' for writing
Error: cannot open '720' for append
Error: cannot open '737' for writing
Error: cannot open '741' for writing
Error: cannot open '742' for writing
Error: cannot open '743' for writing
Error: cannot open '744' for writing
Error: cannot open '746' for writing
Error: cannot open '748' for writing
Error: Cannot delete '743'
.Error: Cannot delete '721'
Error: cannot open '741' for reading
Error: Cannot delete '719'
Error: cannot open '755' for writing
Error: cannot open '756' for writing
Error: cannot open '641' for reading
Error: cannot open '760' for writing
Error: cannot open '743' for reading
Error: Cannot delete '636'
Error: Cannot delete '669'
Error: cannot open '687' for reading
Done
Deleting files...Error: Cannot delete '615'
Error: Cannot delete '617'
Error: Cannot delete '619'
Error: Cannot delete '623'
Error: Cannot delete '624'
Error: Cannot delete '625'
Error: Cannot delete '626'
Error: Cannot delete '629'
Error: Cannot delete '633'
Error: Cannot delete '634'
Error: Cannot delete '635'
Error: Cannot delete '636'
Error: Cannot delete '637'
Error: Cannot delete '641'
Error: Cannot delete '642'
Error: Cannot delete '643'
Error: Cannot delete '644'
Error: Cannot delete '667'
Error: Cannot delete '668'
Error: Cannot delete '669'
Error: Cannot delete '687'
Error: Cannot delete '696'
Error: Cannot delete '709'
Error: Cannot delete '712'
Error: Cannot delete '719'
Error: Cannot delete '720'
Error: Cannot delete '721'
Error: Cannot delete '722'
Error: Cannot delete '731'
Error: Cannot delete '735'
Error: Cannot delete '736'
Error: Cannot delete '737'
Error: Cannot delete '741'
Error: Cannot delete '742'
Error: Cannot delete '743'
Error: Cannot delete '744'
Error: Cannot delete '746'
Error: Cannot delete '748'
Error: Cannot delete '755'
Error: Cannot delete '756'
Error: Cannot delete '760'
Done
Time:
        1773 seconds total
        1086 seconds of transactions (0 per second)

Files:
        768 created (0 per second)
                Creation alone: 500 files (0 per second)
                Mixed with transactions: 268 files (0 per second)
        239 read (0 per second)
        253 appended (0 per second)
        768 deleted (0 per second)
                Deletion alone: 505 files (16 per second)
                Mixed with transactions: 222 files (0 per second)

Data:
        12812.81 megabytes read (7.23 megabytes per second)
        40532.43 megabytes written (22.86 megabytes per second)
pm>exit


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050828/743faa0b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfs_fsck_all.out1.bz2
Type: application/x-bzip
Size: 36928 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050828/743faa0b/attachment.bin>
-------------- next part --------------
cluster
{ 
	name = "CSM_ACN" 
	lock_gulm
	{
		servers = ["imagine.Mines.EDU","illuminate.Mines.EDU","illusion.Mines.EDU"]
		heartbeat_rate = 3.0
		allowed_misses = 5
	} 
}
-------------- next part --------------
fence_devices
{
	CSMACN_fence
	{
		agent = "fence_cisco"
	}
}
-------------- next part --------------
nodes
{
	imagine.Mines.EDU
	{
		ip_interfaces
		{
			eth0 = "138.67.130.1"
		}
		fence
		{
			snmpfence
			{
				CSMACN_fence
				{
					port="imagine"
				}
			}
		}
	}

	illuminate.Mines.EDU
	{
		ip_interfaces
		{
			eth0 = "138.67.130.2"
		}
		fence
		{
			snmpfence
			{
				CSMACN_fence
				{
					port="illuminate"
				}
			}
		}
	}

	illusion.Mines.EDU
	{
		ip_interfaces
		{
			eth0 = "138.67.130.3"
		}
		fence
		{
			snmpfence
			{
				CSMACN_fence
				{
					port="illusion"
				} 
			} 
		} 
	}
	
	inspire.Mines.EDU
	{
		ip_interfaces
		{
			eth0 = "138.67.130.5"
		}
		fence
		{
			snmpfence
			{
				CSMACN_fence
				{
					port="inspire"
				}
			}
		}
	}
	inception.Mines.EDU
	{
		ip_interfaces
		{
			eth0 = "138.67.130.4"
		}
		fence
		{
			snmpfence
			{
				CSMACN_fence
				{
					port="inception"
				}
			}
		}
	}
	incantation.Mines.EDU
	{
		ip_interfaces
		{
			eth0 = "138.67.130.6"
		}
		fence
		{
			snmpfence
			{
				CSMACN_fence
				{
					port="incantation"
				}
			}
		}
	}
}

From brianu at mail.silvercash.com  Mon Aug 29 07:56:20 2005
From: brianu at mail.silvercash.com (brian urrutia)
Date: Mon, 29 Aug 2005 00:56:20 -0700
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <bc57270905082809391fced031@mail.gmail.com>
References: <20050827004859.D66965A8690@mail.silvercash.com>
	<bc57270905082809391fced031@mail.gmail.com>
Message-ID: <1125302180.8421.11.camel@bfc4-dev1>

> > If using LVM to make a volume of imported gnbds is not the answer for
> > redundancy can anyone suggest a method that is? Im not opposed to using any
> > other resource of cluster or GFS but I would really like to implement a
> > redundant solution, ( gnbd, gulm, etc.). 
> > 
> Hi, Brianu, maybe LVM + md + gnbd should be one of the solution for
> redundancy, for example, you have 2 gnbd servers, each one exports 1
> disk. Then, the steps should be:
> 1. create a RAID-1  /dev/md0 on GFS client with imported 2 gnbd block devices.
> 2. use LVM  create /dev/vg0 on top of them.
> 3. mkfs_gfs on /dev/vg0.
> I haven't tried this configuration, theoretically, it should work.
> 
> Thanks,
> 
> Michael

I will look into trying a md & lvm combo, as far as keepalived or
rgmanager to failover an ip, i havent seen a clear example on how to use
rgmanager, but i have tried heartbeat (linux HA) to failover the ip, and
the problem is that the gnbd clients still seem to lock on the former
server regardless of that the ip has failed over to another ip ( and continuly try and reconnect as Fajar had mentioned).


The shared storage I have is a HP MSA 100 SAN
It might be a config error on my part as far as rgmanager is concerned i
will have to post my cluster.conf tommorrow.

-Brian



From mbrookov at mines.edu  Mon Aug 29 14:54:52 2005
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Mon, 29 Aug 2005 08:54:52 -0600
Subject: [Linux-cluster] GFS file system corruption?
In-Reply-To: <1125261073.10424.74.camel@merlin.Mines.EDU>
References: <1125261073.10424.74.camel@merlin.Mines.EDU>
Message-ID: <1125327291.12438.17.camel@merlin.Mines.EDU>

Sorry for posting this twice, I sent out the first version yesterday and
figured it did not make it out when I did not see it this morning, so I
posted a second version without any attachments this morning.  Next time
I will be more patient.

Matt


On Sun, 2005-08-28 at 14:31, Matthew B. Brookover wrote:

> I have 6 computers running Redhat Enterrpise 3 release 5, running
> kernel 2.4.21-32.0.1.ELsmp.
> 
> >From the source code I compiled GFS 6.0.2.20-2.  The SAN is an ISCSI
> based storage system from LeftHand Networks.   Using ext3, the
> postmark disk test works fine, on a GFS file system, we get a number
> of errors.  The output from both postmark runs is below.
> 
> I unmounted the file systems, and ran gfs_fsck on the GFS system.  It
> produced a number of errors like these: 
> 
> Pass4 complete
> Starting pass5
> ondisk and fsck bitmaps differ at block 17887
> Succeeded.
> ondisk and fsck bitmaps differ at block 17888
> Succeeded.
> ondisk and fsck bitmaps differ at block 17889
> Succeeded.
> ondisk and fsck bitmaps differ at block 17890
> Succeeded.
> ondisk and fsck bitmaps differ at block 17891
> Succeeded.
> ondisk and fsck bitmaps differ at block 17892
> 
> (The complete gfs_fsck output is attached, I compressed it because the
> file is quite large)
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050829/35e4611b/attachment.htm>

From alewis at redhat.com  Mon Aug 29 16:48:54 2005
From: alewis at redhat.com (AJ Lewis)
Date: Mon, 29 Aug 2005 11:48:54 -0500
Subject: [Linux-cluster] GFS file system corruption?
In-Reply-To: <1125261073.10424.74.camel@merlin.Mines.EDU>
References: <1125261073.10424.74.camel@merlin.Mines.EDU>
Message-ID: <20050829164854.GA19475@null.msp.redhat.com>

On Sun, Aug 28, 2005 at 02:31:13PM -0600, Matthew B. Brookover wrote:
> I have 6 computers running Redhat Enterrpise 3 release 5, running kernel
> 2.4.21-32.0.1.ELsmp.
> 
> >From the source code I compiled GFS 6.0.2.20-2.  The SAN is an ISCSI
> based storage system from LeftHand Networks.   Using ext3, the postmark
> disk test works fine, on a GFS file system, we get a number of errors. 
> The output from both postmark runs is below.

Looks like you're running into bugs
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160835
and
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160525
which have been fixed in the latest code.  Unfortunately there is not a GFS
6.0 srpm released yet that contains this fix, but I've attached a patch that
clears them up, as well as a couple other minor issues.

I'm not sure about the postmark errors you're seeing - hopefully someone else
on the list can shed some light on them.  But try the fsck with these patches
and see if there are any actual problems with the fs shown.

Regards,
-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat                                    E-Mail: alewis at redhat.com
One Main Street SE, Suite 209
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/fs_recovery.c gfs_fsck-6.0-cvs/fs_recovery.c
--- gfs_fsck-6.0.2.20/fs_recovery.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/fs_recovery.c	2005-06-21 13:39:10.000000000 -0500
@@ -37,6 +37,8 @@ static int reconstruct_single_journal(st
   srandom(time(NULL));
   sequence = jdesc->ji_nsegment / (RAND_MAX + 1.0) * random();
 
+  log_info("Clearing journal %d\n", jnum);
+
   for (seg = 0; seg < jdesc->ji_nsegment; seg++){
     memset(buf, 0, sdp->sb.sb_bsize);
     memset(&lh, 0, sizeof(struct gfs_log_header));
@@ -75,9 +77,13 @@ static int reconstruct_single_journal(st
 int reconstruct_journals(struct fsck_sb *sdp){
   int i;
 
-  for(i=0; i < sdp->journals; i++)
+  log_warn("Clearing journals (this may take a while)\n");
+  for(i=0; i < sdp->journals; i++) {
+    if((i % 10) == 0)
+      log_at_notice(".");
     if(reconstruct_single_journal(sdp, i))
       return -1;
-
+  }
+  log_notice("Cleared journals\n");
   return 0;
 }
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/fsck.h gfs_fsck-6.0-cvs/fsck.h
--- gfs_fsck-6.0.2.20/fsck.h	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/fsck.h	2005-06-20 09:55:52.000000000 -0500
@@ -46,6 +46,7 @@ struct options {
 
 int initialize(struct fsck_sb *sbp);
 void destroy(struct fsck_sb *sbp);
+int block_mounters(struct fsck_sb *sbp, int block_em);
 int pass1(struct fsck_sb *sbp);
 int pass1b(struct fsck_sb *sbp);
 int pass1c(struct fsck_sb *sbp);
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/initialize.c gfs_fsck-6.0-cvs/initialize.c
--- gfs_fsck-6.0.2.20/initialize.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/initialize.c	2005-06-20 09:55:52.000000000 -0500
@@ -52,6 +52,37 @@ int init_journals(struct fsck_sb *sbp)
 	return 0;
 }
 
+/**
+ * block_mounters
+ *
+ * Change the lock protocol so nobody can mount the fs
+ *
+ */
+int block_mounters(struct fsck_sb *sbp, int block_em)
+{
+	if(block_em) {
+		/* verify it starts with lock_ */
+		if(!strncmp(sbp->sb.sb_lockproto, "lock_", 5)) {
+			/* Change lock_ to fsck_ */
+			memcpy(sbp->sb.sb_lockproto, "fsck_", 5);
+		}
+		/* FIXME: Need to do other verification in the else
+		 * case */
+	} else {
+		/* verify it starts with fsck_ */
+		/* verify it starts with lock_ */
+		if(!strncmp(sbp->sb.sb_lockproto, "fsck_", 5)) {
+			/* Change fsck_ to lock_ */
+			memcpy(sbp->sb.sb_lockproto, "lock_", 5);
+		}
+	}
+
+	if(write_sb(sbp)) {
+		stack;
+		return -1;
+	}
+	return 0;
+}
 
 
 /*
@@ -330,8 +361,18 @@ int init_sbp(struct fsck_sb *sbp)
 		}
 	}
 	if (fill_super_block(sbp)) {
+		stack;
 		return -1;
 	}
+
+	/* Change lock protocol to be fsck_* instead of lock_* */
+	if(!sbp->opts->no) {
+		if(block_mounters(sbp, 1)) {
+			log_err("Unable to block other mounters\n");
+			return -1;
+		}
+	}
+
 	/* verify various things */
 
 	if(init_journals(sbp)) {
@@ -344,9 +385,14 @@ int init_sbp(struct fsck_sb *sbp)
 
 void destroy_sbp(struct fsck_sb *sbp)
 {
-	empty_super_block(sbp);
-	if(!sbp->opts->no)
+	if(!sbp->opts->no) {
+		if(block_mounters(sbp, 0)) {
+			log_warn("Unable to unblock other mounters - manual intevention required\n");
+			log_warn("Use 'gfs_tool sb <device> proto' to fix\n");
+		}
 		fsync(sbp->diskfd);
+	}
+	empty_super_block(sbp);
 	close(sbp->diskfd);
 }
 
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/log.c gfs_fsck-6.0-cvs/log.c
--- gfs_fsck-6.0.2.20/log.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/log.c	2005-06-21 13:39:10.000000000 -0500
@@ -60,7 +60,8 @@ void print_msg(int priority, char *file,
 	return;
 }
 
-void print_fsck_log(int priority, char *file, int line, const char *format, ...)
+
+void print_fsck_log(int iif, int priority, char *file, int line, const char *format, ...)
 {
 
 	va_list args;
@@ -70,10 +71,9 @@ void print_fsck_log(int priority, char *
 
 	transform = _(format);
 
-	if(_state.print_level >= priority) {
+	if((_state.print_level == priority) ||
+	   (!iif && (_state.print_level >= priority)))
 		print_msg(priority, file, line, transform, args);
-	}
-
 
 	va_end(args);
 }
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/log.h gfs_fsck-6.0-cvs/log.h
--- gfs_fsck-6.0.2.20/log.h	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/log.h	2005-06-21 13:39:10.000000000 -0500
@@ -23,46 +23,76 @@
 
 
 
-#define print_log(priority, format...) \
+#define print_log(iif, priority, format...)	\
 do { \
-	print_fsck_log(priority, __FILE__, __LINE__, ## format); \
+	print_fsck_log(iif, priority, __FILE__, __LINE__, ## format);	\
 } while(0)
 
 #define log_debug(format...) \
 do { \
-	print_log(MSG_DEBUG, format); \
+	print_log(0, MSG_DEBUG, format);		\
 } while(0)
 
 #define log_info(format...) \
 do { \
-	print_log(MSG_INFO, format); \
+	print_log(0, MSG_INFO, format);		\
 } while(0)
 
 #define log_notice(format...) \
 do { \
-	print_log(MSG_NOTICE, format); \
+	print_log(0, MSG_NOTICE, format);	\
 } while(0)
 
 #define log_warn(format...) \
 do { \
-	print_log(MSG_WARN, format); \
+	print_log(0, MSG_WARN, format);		\
 } while(0)
 
 #define log_err(format...) \
 do { \
-	print_log(MSG_ERROR, format); \
+	print_log(0, MSG_ERROR, format);		\
 } while(0)
 
 #define log_crit(format...) \
 do { \
-	print_log(MSG_CRITICAL, format); \
+	print_log(0, MSG_CRITICAL, format);	\
 } while(0)
 
 #define stack log_debug("<backtrace> - %s()\n", __func__)
 
+#define log_at_debug(format...)		\
+do { \
+	print_log(1, MSG_DEBUG, format);	\
+} while(0)
+
+#define log_at_info(format...) \
+do { \
+	print_log(1, MSG_INFO, format);		\
+} while(0)
+
+#define log_at_notice(format...) \
+do { \
+	print_log(1, MSG_NOTICE, format);	\
+} while(0)
+
+#define log_at_warn(format...) \
+do { \
+	print_log(1, MSG_WARN, format);		\
+} while(0)
+
+#define log_at_err(format...) \
+do { \
+	print_log(1, MSG_ERROR, format);		\
+} while(0)
+
+#define log_at_crit(format...) \
+do { \
+	print_log(1, MSG_CRITICAL, format);	\
+} while(0)
+
 void increase_verbosity(void);
 void decrease_verbosity(void);
-void print_fsck_log(int priority, char *file, int line, const char *format, ...);
+void print_fsck_log(int iif, int priority, char *file, int line, const char *format, ...);
 int query(struct fsck_sb *sbp, const char *format, ...);
 
 
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/metawalk.c gfs_fsck-6.0-cvs/metawalk.c
--- gfs_fsck-6.0.2.20/metawalk.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/metawalk.c	2005-07-11 13:22:39.000000000 -0500
@@ -464,6 +464,8 @@ static int build_metalist(struct fsck_in
 			for (ptr = (uint64 *)(bh->b_data + head_size);
 			     (char *)ptr < (bh->b_data + bh->b_size);
 			     ptr++) {
+				nbh = NULL;
+
 				if (!*ptr)
 					continue;
 
@@ -481,9 +483,11 @@ static int build_metalist(struct fsck_in
 					continue;
 				}
 				if(!nbh) {
-					/* FIXME: error checking */
-					get_and_read_buf(ip->i_sbd, block,
-							 &nbh, 0);
+					if(get_and_read_buf(ip->i_sbd, block,
+							    &nbh, 0)) {
+						stack;
+						goto fail;
+					}
 				}
 				osi_list_add(&nbh->b_list, cur_list);
 			}
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/pass1b.c gfs_fsck-6.0-cvs/pass1b.c
--- gfs_fsck-6.0.2.20/pass1b.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/pass1b.c	2005-07-11 13:22:39.000000000 -0500
@@ -77,7 +77,17 @@ static int check_data(struct fsck_inode 
 static int check_eattr_indir(struct fsck_inode *ip, uint64_t block,
 			     uint64_t parent, osi_buf_t **bh, void *private)
 {
+	struct fsck_sb *sbp = ip->i_sbd;
+	osi_buf_t *indir_bh = NULL;
+
 	inc_if_found(block, 0, private);
+	if(get_and_read_buf(sbp, block, &indir_bh, 0)){
+		log_warn("Unable to read EA leaf block #%"PRIu64".\n",
+			 block);
+		return 1;
+	}
+
+	*bh = indir_bh;
 
 	return 0;
 }
@@ -85,8 +95,17 @@ static int check_eattr_indir(struct fsck
 static int check_eattr_leaf(struct fsck_inode *ip, uint64_t block,
 			    uint64_t parent, osi_buf_t **bh, void *private)
 {
+	struct fsck_sb *sbp = ip->i_sbd;
+	osi_buf_t *leaf_bh = NULL;
+
 	inc_if_found(block, 0, private);
+	if(get_and_read_buf(sbp, block, &leaf_bh, 0)){
+		log_warn("Unable to read EA leaf block #%"PRIu64".\n",
+			 block);
+		return 1;
+	}
 
+	*bh = leaf_bh;
 	return 0;
 }
 
@@ -326,11 +345,14 @@ int find_block_ref(struct fsck_sb *sbp, 
 		stack;
 		return -1;
 	}
+	log_info("Checking inode %"PRIu64"'s metatree for references to block %"PRIu64"\n",
+		 inode, b->block_no);
 	if(check_metatree(ip, &find_refs)) {
 		stack;
 		free_inode(&ip);
 		return -1;
 	}
+	log_info("Done checking metatree\n");
 
 	if (myfi.found) {
 		if(!(id = malloc(sizeof(*id)))) {
@@ -463,7 +485,9 @@ int pass1b(struct fsck_sb *sbp)
 	/* Rescan the fs looking for pointers to blocks that are in
 	 * the duplicate block map */
 	log_info("Scanning filesystem for inodes containing duplicate blocks...\n");
-	for(i = 0; i < sbp->last_fs_block; i++) {
+	log_debug("Filesystem has %"PRIu64" blocks total\n", sbp->last_fs_block);
+	for(i = 0; i < sbp->last_fs_block; i += 1) {
+		log_debug("Scanning block %"PRIu64" for inodes\n", i);
 		if(block_check(sbp->bl, i, &q)) {
 			stack;
 			return -1;
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/pass1c.c gfs_fsck-6.0-cvs/pass1c.c
--- gfs_fsck-6.0.2.20/pass1c.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/pass1c.c	2005-06-15 10:51:08.000000000 -0500
@@ -46,6 +46,7 @@ int check_eattr_indir(struct fsck_inode 
 	int *update = (int *) private;
 	struct fsck_sb *sbp = ip->i_sbd;
 	struct block_query q;
+	osi_buf_t *indir_bh;
 
 	if(check_range(sbp, block)) {
 		log_err("Extended attributes indirect block out of range...removing\n");
@@ -63,7 +64,15 @@ int check_eattr_indir(struct fsck_inode 
 		*update = 1;
 		return 1;
 	}
+	else if(get_and_read_buf(sbp, block, &indir_bh, 0)){
+		log_warn("Unable to read EA leaf block #%"PRIu64".\n",
+			 block);
+		ip->i_di.di_eattr = 0;
+		*update = 1;
+		return 1;
+	}
 
+	*bh = indir_bh;
 	return 0;
 }
 int check_eattr_leaf(struct fsck_inode *ip, uint64_t block,
@@ -72,6 +81,7 @@ int check_eattr_leaf(struct fsck_inode *
 	int *update = (int *) private;
 	struct fsck_sb *sbp = ip->i_sbd;
 	struct block_query q;
+	osi_buf_t *leaf_bh;
 
 	if(check_range(sbp, block)) {
 		log_err("Extended attributes block out of range...removing\n");
@@ -89,6 +99,15 @@ int check_eattr_leaf(struct fsck_inode *
 		*update = 1;
 		return 1;
 	}
+	else if(get_and_read_buf(sbp, block, &leaf_bh, 0)){
+		log_warn("Unable to read EA leaf block #%"PRIu64".\n",
+			 block);
+		ip->i_di.di_eattr = 0;
+		*update = 1;
+		return 1;
+	}
+
+	*bh = leaf_bh;
 
 	return 0;
 
@@ -200,7 +219,7 @@ int pass1c(struct fsck_sb *sbp)
 	while (!find_next_block_type(sbp->bl, eattr_block, &block_no)) {
 
 		log_info("EA in inode %"PRIu64"\n", block_no);
-		if(get_and_read_buf(sbp, eattr_block, &bh, 0)) {
+		if(get_and_read_buf(sbp, block_no, &bh, 0)) {
 			stack;
 			return -1;
 		}
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/pass5.c gfs_fsck-6.0-cvs/pass5.c
--- gfs_fsck-6.0.2.20/pass5.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/pass5.c	2005-06-20 16:22:23.000000000 -0500
@@ -196,21 +196,38 @@ int check_block_status(struct fsck_sb *s
 		block_status = convert_mark(q.block_type, count);
 
 		if(rg_status != block_status) {
-			log_err("ondisk and fsck bitmaps differ at block %"
-				PRIu64"\n", block);
 			log_debug("Ondisk is %u - FSCK thinks it is %u (%u)\n",
 				  rg_status, block_status, q.block_type);
-			if(query(sbp, "Fix bitmap for block %"PRIu64"? (y/n) ",
-				 block)) {
-				if(fs_set_bitmap(sbp, block, block_status)) {
-					log_err("Failed.\n");
+			if((rg_status == GFS_BLKST_FREEMETA) &&
+			   (block_status == GFS_BLKST_FREE)) {
+				log_info("Converting free metadata block at %"
+					 PRIu64" to a free data block\n", block);
+				if(!sbp->opts->no) {
+					if(fs_set_bitmap(sbp, block, block_status)) {
+						log_warn("Failed to convert free metadata block to free data block at %PRIu64.\n", block);
+					}
+					else {
+						log_info("Succeeded.\n");
+					}
 				}
-				else {
-					log_err("Succeeded.\n");
+			}
+			else {
+
+				log_err("ondisk and fsck bitmaps differ at"
+					" block %"PRIu64"\n", block);
+
+				if(query(sbp, "Fix bitmap for block %"
+					 PRIu64"? (y/n) ", block)) {
+					if(fs_set_bitmap(sbp, block, block_status)) {
+						log_err("Failed.\n");
+					}
+					else {
+						log_err("Succeeded.\n");
+					}
+				} else {
+					log_err("Bitmap at block %"PRIu64
+						" left inconsistent\n", block);
 				}
-			} else {
-				log_err("Bitmap at block %"PRIu64
-					" left inconsistent\n", block);
 			}
 		}
 		(*rg_block)++;
@@ -224,13 +241,19 @@ int check_block_status(struct fsck_sb *s
 	return 0;
 }
 
+#define FREE_COUNT       1
+#define USED_INODE_COUNT 2
+#define FREE_INODE_COUNT 4
+#define USED_META_COUNT  8
+#define FREE_META_COUNT  16
+#define CONVERT_FREEMETA_TO_FREE (FREE_COUNT | FREE_META_COUNT)
 
 int update_rgrp(struct fsck_rgrp *rgp, uint32_t *count)
 {
-	int update = 0;
 	uint32_t i;
 	fs_bitmap_t *bits;
 	uint64_t rg_block = 0;
+	uint8_t bmap = 0;
 
 	for(i = 0; i < rgp->rd_ri.ri_length; i++) {
 		bits = &rgp->rd_bits[i];
@@ -243,42 +266,63 @@ int update_rgrp(struct fsck_rgrp *rgp, u
 	}
 
 	/* Compare the rgrps counters with what we found */
-	/* actually adjust counters and write out to disk */
 	if(rgp->rd_rg.rg_free != count[0]) {
-		log_err("free count inconsistent: is %u should be %u\n",
-			rgp->rd_rg.rg_free, count[0] );
-		rgp->rd_rg.rg_free = count[0];
-		update = 1;
+		bmap |= FREE_COUNT;
 	}
 	if(rgp->rd_rg.rg_useddi != count[1]) {
-		log_err("used inode count inconsistent: is %u should be %u\n",
-			rgp->rd_rg.rg_useddi, count[1]);
-		rgp->rd_rg.rg_useddi = count[1];
-		update = 1;
+		bmap |= USED_INODE_COUNT;
 	}
 	if(rgp->rd_rg.rg_freedi != count[2]) {
-		log_err("free inode count inconsistent: is %u should be %u\n",
-			rgp->rd_rg.rg_freedi, count[2]);
-		rgp->rd_rg.rg_freedi = count[2];
-		update = 1;
+		bmap |= FREE_INODE_COUNT;
 	}
 	if(rgp->rd_rg.rg_usedmeta != count[3]) {
-		log_err("used meta count inconsistent: is %u should be %u\n",
-			rgp->rd_rg.rg_usedmeta, count[3]);
-		rgp->rd_rg.rg_usedmeta = count[3];
-		update = 1;
+		bmap |= USED_META_COUNT;
 	}
 	if(rgp->rd_rg.rg_freemeta != count[4]) {
-		log_err("free meta count inconsistent: is %u should be %u\n",
-			rgp->rd_rg.rg_freemeta, count[4]);
-		rgp->rd_rg.rg_freemeta = count[4];
-		update = 1;
+		bmap |= FREE_META_COUNT;
 	}
 
-	if(update) {
+	if(bmap && !(bmap & ~CONVERT_FREEMETA_TO_FREE)) {
+		log_notice("Converting %d unused metadata blocks to free data blocks...\n",
+			   rgp->rd_rg.rg_freemeta - count[4]);
+		rgp->rd_rg.rg_free = count[0];
+		rgp->rd_rg.rg_freemeta = count[4];
+		gfs_rgrp_out(&rgp->rd_rg, BH_DATA(rgp->rd_bh[0]));
+		if(!rgp->rd_sbd->opts->no) {
+			write_buf(rgp->rd_sbd, rgp->rd_bh[0], 0);
+		}
+	} else if(bmap) {
+		/* actually adjust counters and write out to disk */
+		if(bmap & FREE_COUNT) {
+			log_err("free count inconsistent: is %u should be %u\n",
+				rgp->rd_rg.rg_free, count[0] );
+			rgp->rd_rg.rg_free = count[0];
+		}
+		if(bmap & USED_INODE_COUNT) {
+			log_err("used inode count inconsistent: is %u should be %u\n",
+				rgp->rd_rg.rg_useddi, count[1]);
+			rgp->rd_rg.rg_useddi = count[1];
+		}
+		if(bmap & FREE_INODE_COUNT) {
+			log_err("free inode count inconsistent: is %u should be %u\n",
+				rgp->rd_rg.rg_freedi, count[2]);
+			rgp->rd_rg.rg_freedi = count[2];
+		}
+		if(bmap & USED_META_COUNT) {
+			log_err("used meta count inconsistent: is %u should be %u\n",
+				rgp->rd_rg.rg_usedmeta, count[3]);
+			rgp->rd_rg.rg_usedmeta = count[3];
+		}
+		if(bmap & FREE_META_COUNT) {
+			log_err("free meta count inconsistent: is %u should be %u\n",
+				rgp->rd_rg.rg_freemeta, count[4]);
+			rgp->rd_rg.rg_freemeta = count[4];
+		}
+
 		if(query(rgp->rd_sbd,
 			 "Update resource group counts? (y/n) ")) {
-		/* write out the rgrp */
+			log_warn("Resource group counts updated\n");
+			/* write out the rgrp */
 			gfs_rgrp_out(&rgp->rd_rg, BH_DATA(rgp->rd_bh[0]));
 			write_buf(rgp->rd_sbd, rgp->rd_bh[0], 0);
 		} else {
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/super.c gfs_fsck-6.0-cvs/super.c
--- gfs_fsck-6.0.2.20/super.c	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/super.c	2005-06-20 10:02:58.000000000 -0500
@@ -303,3 +303,30 @@ int ri_update(struct fsck_sb *sdp)
 
 	return -1;
 }
+
+int write_sb(struct fsck_sb *sbp)
+{
+	int error = 0;
+	osi_buf_t *bh;
+
+	error = get_and_read_buf(sbp, GFS_SB_ADDR >> sbp->fsb2bb_shift,
+				 &bh, 0);
+	if (error){
+		log_crit("Unable to read superblock\n");
+		goto out;
+	}
+
+	gfs_sb_out(&sbp->sb, BH_DATA(bh));
+
+	/* FIXME: Should this set the BW_WAIT flag? */
+	if((error = write_buf(sbp, bh, 0))) {
+		stack;
+		goto out;
+	}
+
+	relse_buf(sbp, bh);
+out:
+	return error;
+
+}
+
diff -P -Nurp --exclude CVS gfs_fsck-6.0.2.20/super.h gfs_fsck-6.0-cvs/super.h
--- gfs_fsck-6.0.2.20/super.h	2005-08-29 11:27:21.000000000 -0500
+++ gfs_fsck-6.0-cvs/super.h	2005-06-20 09:50:35.000000000 -0500
@@ -19,5 +19,6 @@
 int read_sb(struct fsck_sb *sdp);
 int ji_update(struct fsck_sb *sdp);
 int ri_update(struct fsck_sb *sdp);
+int write_sb(struct fsck_sb *sdp);
 
 #endif /* _SUPER_H */
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050829/eaa435fa/attachment.sig>

From lhh at redhat.com  Mon Aug 29 18:38:33 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 14:38:33 -0400
Subject: [Linux-cluster] rgmanager/src/resources/samba.sh
In-Reply-To: <20050822223341.GI24127@neu.nirvana>
References: <20050822223341.GI24127@neu.nirvana>
Message-ID: <1125340713.24205.27.camel@ayanami.boston.redhat.com>

On Tue, 2005-08-23 at 00:33 +0200, Axel Thimm wrote:
> is missing :)
> 
> services.sh references "samba" as a further resource agent. Is this an
> old remnant still to be removed, or will there be a samba.sh?
> 
> In contrast to nfs samba has too many options to map them adequately
> to parameters of a resource agent, so perhaps the idea is to have
> users copy /etc/init.d/smb and modify accordingly.

Undecided.  The old one (from clumanager) basically did a "start samba
with this config file".  The config file was automatically populated
with some info by the old clumanager admin utils, but had to be
hand-edited if anything changed.

So, currently, to get samba running, it's like you said.  Copy and
modify.

-- Lon




From lhh at redhat.com  Mon Aug 29 18:41:19 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 14:41:19 -0400
Subject: [Linux-cluster] iptables protection wrapper; nfsexport.sh vs
	ip.sh racing
In-Reply-To: <20050822225227.GJ24127@neu.nirvana>
References: <20050822225227.GJ24127@neu.nirvana>
Message-ID: <1125340879.24205.30.camel@ayanami.boston.redhat.com>

On Tue, 2005-08-23 at 00:52 +0200, Axel Thimm wrote:
> The typical NFS cluster setups seem to fail for Gigabit NFS/tcp. Some
> clients that are busy during the relocation of services either bail
> out with RPC garbage, or set the filesytem to EACCES, or timeout for
> 17 min.
> 
> This has to do with some racing/timing in the NFS vs ip setup/teardown
> procedure. Protecting the service startup/shutdown with an iptables
> rule is a good workaround to fix this.
> 
> But what is the proper way to integrate this workaround? I could setup
> new resource agents, one with start=1 and another with start=6 to
> start/stop dropping packages. Or I could modify the current resource
> agents to allow for child entities and wrap one script around the
> service and one in the inner element.
> 
> I could probably also hack ip.sh to introduce some delay, to make sure
> the NFS services are really up/down before proceeding. Or maybe fix
> the true evil by making nfsexport.sh wait for NFS startup/stop
> completion (how?)?

Traditionally, we start the NFS daemons as a service to people who
forget to start them before starting rgmanager.

I.e.  Red Hat / Fedora Core users are supposed to do this prior to
configuring NFS services in rgmanager:

   chkconfig --level 345 nfslock on
   chkconfig --level 345 nfs on

It's really an attempt at a workaround a configuration problem -- and
nothing more.

-- Lon




From lhh at redhat.com  Mon Aug 29 19:04:03 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 15:04:03 -0400
Subject: [Linux-cluster] Workings of Tiebreaker IP (RHCS)
In-Reply-To: <09B5576A48518947960A89175FB6C23DE38604@excbebr204.europe.unity>
References: <09B5576A48518947960A89175FB6C23DE38604@excbebr204.europe.unity>
Message-ID: <1125342243.24205.54.camel@ayanami.boston.redhat.com>

On Tue, 2005-08-23 at 12:21 +0200, Veer, Rob ter wrote:
> Hello,
> 
> To completely understand what the role of a tiebreaker IP within a two
> or four node RHCS cluster is, I've searched redhat and Google. I can't
> however find anything describing the precise workings of the
> tiebreaker-IP. I would really like to know what happens excactly when
> the tiebreaker is used an how (maybe even somekind of flow diagram). 
> 
> Can anyone here maybe explain that to me, or point me in the direction
> of more specific information regarding tiebreaker?

The tiebreaker IP address is used as an additional vote in the event
that half the nodes become unreachable or dead in a 2 or 4 node cluster
on RHCS.

The IP address must reside on the same network as is used for cluster
communication.  To be a little more specific, if your cluster is using
eth0 for communication, your IP address used for a tiebreaker must be
reachable only via eth0 (otherwise, you will end up with a split brain).

When enabled, the nodes ping the given IP address at regular intervals.
When the IP address is not reachable, the tiebreaker is considered
"dead".  When it is reachable, it is considered "alive".

It acts as an additional vote (like an extra cluster member), except for
one key difference: Unless the default configuration is overridden, the
IP tiebreaker may not be used to *form* a quorum where one did not exist
previously.

So, if one node of a two node cluster is online, it will never become
quorate unless the other node comes online (or administrator override,
see man pages for "cluforce" and "cludb").

So, in a 2 node cluster, if one node fails and the other node is online
(and the tiebreaker is still "alive" according to that node), the
remaining node considers itself quorate and "shoots" (aka STONITHs, aka
fences) the dead node and takes over services.

If a network partition occurs such that both nodes see the tiebreaker
but not each other, the first one to fence the other will naturally win.


Ok, moving on...

The disk tiebreaker works in a similar way, except that it lets the
cluster limp in along in a safe, semi-split-brain (split brain) in a
network outage.  What I mean is that because there's state information
written to/read from the shared raw partitions, the nodes can actually
tell via other means whether or not the other node is "alive" or not as
opposed to relying solely on the network traffic.

Both nodes update state information on the shared partitions.  When one
node detects that the other node has not updated its information for a
period of time, that node is "down" according to the disk subsystem.  If
this coincides with a "down" status from the membership daemon, the node
is fenced and services are failed over.  If the node never goes down
(and keeps updating its information on the shared partitions), then the
node is never fenced and services never fail over.

-- Lon






From lhh at redhat.com  Mon Aug 29 19:05:19 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 15:05:19 -0400
Subject: [Linux-cluster] Quorum without fibre channel or shared scsi
In-Reply-To: <A1752420D10F65429F82C257C05AFAA301EB978E@godzilla.corp.digitalims.com>
References: <A1752420D10F65429F82C257C05AFAA301EB978E@godzilla.corp.digitalims.com>
Message-ID: <1125342319.24205.55.camel@ayanami.boston.redhat.com>

On Wed, 2005-08-24 at 14:05 -0500, Jason Wilkinson wrote:
> Hi all,
> 
> I'm rather new to clustering...so I'd appreciate any help you may be able to
> give me. I'm trying to set up a test cluster in advance of purchasing our
> SAN. Currently I have several old laptops set up. What I'm trying to do is
> to find a way to set up the quorum partition without the more expensive
> infrastructure. Is it possible to do it using DRBD or some piece of
> software? Do any of you know where I could find a really good howto?
> 
> Any help would be appreciated.

Unlike clumanager, with the linux-cluster project, you don't need a SAN
at all.  You don't need quorum partitions...

-- Lon



From lhh at redhat.com  Mon Aug 29 19:06:23 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 15:06:23 -0400
Subject: [Linux-cluster] Problems running service exclusivly
In-Reply-To: <430CD12B.8060705@dlr.de>
References: <430CD12B.8060705@dlr.de>
Message-ID: <1125342383.24205.57.camel@ayanami.boston.redhat.com>

On Wed, 2005-08-24 at 21:57 +0200, Hansj?rg Maurer wrote:
> Hi
> 
> I am trying to get the following config to work.
> 
> 4 node cluster with
> 1 node providing a mysql service (only to nodes (bs1 and bs2) have the 
> hardare to do so)
> 2 nodes providing a custom service (the same service on each node (bi1 
> and bi2))
> 1 node fallback for both services
> 
> It is important, that on every node ony one service is running.
> I created the following cluster configuration (RHEL4U1),
> but the exclusice flag seems not to work.
> I have created three priorisised restricted failoverdomains (one for 
> eache services).
> I assigned each service to one failoverdomain (with run exclusively on),
> but if I start the rgmanger. als services are running on one node.
> I can move them by hand to the right nodes, but this is nor persistent.

This sounds like a bug, can you bugzilla this?

Thanks!

-- Lon




From lhh at redhat.com  Mon Aug 29 19:07:39 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 15:07:39 -0400
Subject: [Linux-cluster] Redhat 9.0+GFS+iSCSI
In-Reply-To: <000501c5aa2d$cdb5ef70$2e00a8c0@yanzijie>
References: <000501c5aa2d$cdb5ef70$2e00a8c0@yanzijie>
Message-ID: <1125342459.24205.59.camel@ayanami.boston.redhat.com>

On Fri, 2005-08-26 at 19:03 +0800, yanj wrote:

> Is GFS working in Redhat 9.0?

Not that I am aware of.

-- Lon




From lhh at redhat.com  Mon Aug 29 19:19:40 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 15:19:40 -0400
Subject: [Linux-cluster] Tiebreaker IP
In-Reply-To: <BC430F453501174992B9D9E8AFB7519A518A10@ausx3mps309.aus.amer.dell.com>
References: <BC430F453501174992B9D9E8AFB7519A518A10@ausx3mps309.aus.amer.dell.com>
Message-ID: <1125343180.24205.72.camel@ayanami.boston.redhat.com>

On Fri, 2005-08-26 at 13:24 -0500, JACOB_LIBERMAN at Dell.com wrote:

> If it cant, it will assume its external interface is down
> and fence/reboot itself. The same holds true for nodeB. Unlike RHEL2.1
> which used STONITH, RHEL3 cluster nodes reboot themselves. 

Both use STONITH.  RHEL3 cluster nodes are more paranoid about running
without STONITH.

If STONITH is configured on a RHEL3 cluster, the node will instead wait
to be shot -- or for a new quorum to form -- if it loses network
connectivity.

> Anyway, I hope this answers your questions. It is fairly easy to test.
> Set up a 2 node cluster, then reboot the service owner. If the service
> starts on the other node, you should be configured correctly. Next
> disconnect the service owner from the network. The service owner should
> reboot itself with the watchdog or fail over the resource, depending on
> how its ocnfigured.

It should reboot itself because it loses quorum, really.  Basically,
without STONITH, a node thinks like this on RHEL3:

"I was quorate and now I'm not, and no one can cut me off from shared
storage... Uh, oh, REBOOT!"

(Note: On RHCS4/RHEL4, you *must* have I/O fencing (aka STONITH)
configured!)

-- Lon



From lhh at redhat.com  Mon Aug 29 19:22:05 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 29 Aug 2005 15:22:05 -0400
Subject: [Linux-cluster] Rhcs4 fence with no hardware power switch
In-Reply-To: <BAY22-F9D130F8A207AA1336DEB4A6AF0@phx.gbl>
References: <BAY22-F9D130F8A207AA1336DEB4A6AF0@phx.gbl>
Message-ID: <1125343325.24205.75.camel@ayanami.boston.redhat.com>

On Mon, 2005-08-29 at 02:26 +0000, nattapon viroonsri wrote:
> I use RedHat cluster v4 for HA and have no power switch hardware
> When i disconnected  network cable for client access resource , Failover 
> have occur as expect
> But the server that was disconnect network cable not be rebooted or fenced 
> like rhcs V3(even no power switch).  And the failed server  can't not join 
> cluster again.

Eep, no failover should have occurred if you don't have fencing.

> In Rhcs V3. it reboot fail node automatically without hardware power switch
> 
> So, Have anyway to reboot failed node(network problem or server problem) 
> without use hardware power switch to make failnode to clean state again ?

You need fencing hardware with RHCS4/RHEL4.  I'd check ebay if you're
not looking to spend a lot, you can get a WTI NPS115 for pretty cheap
nowadays (under $100!).

You can run with manual fencing, but no automatic recovery will occur,
and this isn't a supported solution.

-- Lon



From bmarzins at redhat.com  Mon Aug 29 20:59:51 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Mon, 29 Aug 2005 15:59:51 -0500
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <43126996.8000606@telkom.co.id>
References: <20050827004859.D66965A8690@mail.silvercash.com>
	<43126996.8000606@telkom.co.id>
Message-ID: <20050829205951.GK12333@phlogiston.msp.redhat.com>

On Mon, Aug 29, 2005 at 08:49:10AM +0700, Fajar A. Nugraha wrote:
> brianu wrote:
> 
> >Hello all,
> >
> > 
> >
> >This is a question I have basically been asking, the question on why 
> >you would want to do it is failover, the docs at 
> >http://sourceware.org/cluster/gnbd/gnbd_usage.txt state that 
> >dm-multipath is an option for gnbd,
> >
> I'm not sure about dm-multipath. The thing is, when a gnbd server dies, 
> instead of saying "read/write failed" as normal block device does, gnbd 
> simply retries the request and tries to reconnect if it's disconnected. 
> Forever.

If the gnbds are exported uncached (the default), the client will fail back IO
if it can no longer talk to the server after a specified timeout.  However
the userspace tools for dm-multipath are still to SCSI-centric to allow you
to run on top of gnbd.  You can manually run dm-setup commands to build
the appropriate multipath map, scan the map to check if a path has failed,
remove the failed gnbd from the map (so the device can close and gnbd can
start trying to reconnect), and them manually add the gnbd device back into
the map when it has reconnected. That's pretty much all the dm-multipath
userspace tools do.  Someone could even write a pretty simple daemon that did
this, and become the personal hero of many people on this list.

The only problem is that if you manually execute the commands, or write the
daemon in bash or some other scripting language, you can run into a memory
deadlock. If you are in a very low memory situation, and you need to complete
gnbd IO requests to free up memory, the daemon can't allocate any memory in
doing it's job.

> >and documents elsewhere also indicate that GNBD can be configured as a 
> >redundancy, yet I cannot find any documentation on how to configure it.
> >
> > 
> >
> >If using LVM to make a volume of imported gnbds is not the answer for 
> >redundancy can anyone suggest a method that is? Im not opposed to 
> >using any other resource of cluster or GFS but I would really like to 
> >implement a redundant solution, ( gnbd, gulm, etc.).
> >
> > 
> >
> It would be possible if you have at least two servers, connected to the 
> same storage, running as gnbd server and exporting the same block devices.
> 
> You need to have one IP address that can failover to any available node 
> (use rgmanager or keepalived to achieve this). That way, if one server 
> node dies the IP address will be moved to the other node. Client will be 
> disconnected, but since gnbd-import will automatically reconnect (it 
> actually connects to a different node since the gnbd server IP address 
> was moved) the process will be transparent to the client (all they see 
> is a slight delay during reconnect).

If you have the gnbd exported in caching mode, each server will maintain it's
own cache, So if you write a block to one server, and then the server crashes,
when you read the block from the second server, if it was already cached
before the read, you will get invalid data, so that won't work. If you
set the gnbd to uncached mode, the client will fail the IO back, and something
(a multipath driver) needs to be there to reissue the request.

-Ben
 
> Regards,
> 
> Fajar
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From mbrookov at mines.edu  Mon Aug 29 21:06:19 2005
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Mon, 29 Aug 2005 15:06:19 -0600
Subject: [Linux-cluster] GFS file system corruption?
In-Reply-To: <20050829164854.GA19475@null.msp.redhat.com>
References: <1125261073.10424.74.camel@merlin.Mines.EDU>
	<20050829164854.GA19475@null.msp.redhat.com>
Message-ID: <1125349579.12438.28.camel@merlin.Mines.EDU>

Aj, thank you for the quick response.

The patches changed the situation.  Postmark now runs without any
errors, gfs_fsck still has some problems. I do not understand why
postmark did not report any errors because the patches all appeared to
be for fsck.

Output from gfs_fsck:

[root at illusion root]# gfs_fsck -y /dev/pool/u_as
Initializing fsck
Clearing journals (this may take a while)
.Cleared journals
Starting pass1
Pass1 complete
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete
Starting pass3
Pass3 complete
Starting pass4
Pass4 complete
Starting pass5
Converting 134 unused metadata blocks to free data blocks...
Converting 162 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 152 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 152 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 168 unused metadata blocks to free data blocks...
Converting 161 unused metadata blocks to free data blocks...
Converting 174 unused metadata blocks to free data blocks...
Converting 147 unused metadata blocks to free data blocks...
Converting 148 unused metadata blocks to free data blocks...
Converting 134 unused metadata blocks to free data blocks...
Converting 134 unused metadata blocks to free data blocks...
Converting 137 unused metadata blocks to free data blocks...
Converting 143 unused metadata blocks to free data blocks...
Converting 138 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 148 unused metadata blocks to free data blocks...
Converting 156 unused metadata blocks to free data blocks...
Converting 179 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 164 unused metadata blocks to free data blocks...
Converting 139 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 152 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 141 unused metadata blocks to free data blocks...
Converting 151 unused metadata blocks to free data blocks...
Converting 150 unused metadata blocks to free data blocks...
Converting 174 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 170 unused metadata blocks to free data blocks...
Converting 140 unused metadata blocks to free data blocks...
Converting 150 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 135 unused metadata blocks to free data blocks...
Converting 145 unused metadata blocks to free data blocks...
Converting 143 unused metadata blocks to free data blocks...
Converting 187 unused metadata blocks to free data blocks...
Converting 166 unused metadata blocks to free data blocks...
Converting 152 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 152 unused metadata blocks to free data blocks...
Converting 168 unused metadata blocks to free data blocks...
Converting 180 unused metadata blocks to free data blocks...
Converting 139 unused metadata blocks to free data blocks...
Converting 134 unused metadata blocks to free data blocks...
Converting 145 unused metadata blocks to free data blocks...
Converting 178 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 135 unused metadata blocks to free data blocks...
Converting 140 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 161 unused metadata blocks to free data blocks...
Converting 160 unused metadata blocks to free data blocks...
Converting 151 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 171 unused metadata blocks to free data blocks...
Converting 156 unused metadata blocks to free data blocks...
Converting 183 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 142 unused metadata blocks to free data blocks...
Converting 142 unused metadata blocks to free data blocks...
Converting 163 unused metadata blocks to free data blocks...
Converting 166 unused metadata blocks to free data blocks...
Converting 134 unused metadata blocks to free data blocks...
Converting 180 unused metadata blocks to free data blocks...
Converting 156 unused metadata blocks to free data blocks...
Converting 150 unused metadata blocks to free data blocks...
Converting 141 unused metadata blocks to free data blocks...
Converting 143 unused metadata blocks to free data blocks...
Converting 177 unused metadata blocks to free data blocks...
Converting 163 unused metadata blocks to free data blocks...
Converting 136 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 143 unused metadata blocks to free data blocks...
Converting 174 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 137 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 134 unused metadata blocks to free data blocks...
Converting 134 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 172 unused metadata blocks to free data blocks...
Converting 169 unused metadata blocks to free data blocks...
Converting 158 unused metadata blocks to free data blocks...
Converting 135 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 133 unused metadata blocks to free data blocks...
Converting 136 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 134 unused metadata blocks to free data blocks...
Converting 132 unused metadata blocks to free data blocks...
Converting 140 unused metadata blocks to free data blocks...
Converting 187 unused metadata blocks to free data blocks...
Converting 162 unused metadata blocks to free data blocks...
Converting 174 unused metadata blocks to free data blocks...
Converting 131 unused metadata blocks to free data blocks...
Converting 130 unused metadata blocks to free data blocks...
Converting 159 unused metadata blocks to free data blocks...
Converting 171 unused metadata blocks to free data blocks...
Pass5 complete
[root at illusion root]#


What I did:
* compiled GFS with the patch
* shut down GFS, lock_gulmd, ccsd, and pool manager on all 6 nodes
* loaded the new RPM
* rebooted all 6 nodes and started up the pool manager, and ran gfs_fsck
on all of the file sytems.  It was clean.
* Started gulmd
* mounted the GFS file systems and ran post mark several times. 
Postmark did not report any errors.
* unmounted the GFS file system, and ran gfs_fsck and got the above
messages.

Any ideas?

thanks you for your assistance

Matt

On Mon, 2005-08-29 at 10:48, AJ Lewis wrote:

> On Sun, Aug 28, 2005 at 02:31:13PM -0600, Matthew B. Brookover wrote:
> > I have 6 computers running Redhat Enterrpise 3 release 5, running kernel
> > 2.4.21-32.0.1.ELsmp.
> > 
> > >From the source code I compiled GFS 6.0.2.20-2.  The SAN is an ISCSI
> > based storage system from LeftHand Networks.   Using ext3, the postmark
> > disk test works fine, on a GFS file system, we get a number of errors. 
> > The output from both postmark runs is below.
> 
> Looks like you're running into bugs
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160835
> and
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160525
> which have been fixed in the latest code.  Unfortunately there is not a GFS
> 6.0 srpm released yet that contains this fix, but I've attached a patch that
> clears them up, as well as a couple other minor issues.
> 
> I'm not sure about the postmark errors you're seeing - hopefully someone else
> on the list can shed some light on them.  But try the fsck with these patches
> and see if there are any actual problems with the fs shown.
> 
> Regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050829/a82bb51d/attachment.htm>

From alewis at redhat.com  Mon Aug 29 21:17:25 2005
From: alewis at redhat.com (AJ Lewis)
Date: Mon, 29 Aug 2005 16:17:25 -0500
Subject: [Linux-cluster] GFS file system corruption?
In-Reply-To: <1125349579.12438.28.camel@merlin.Mines.EDU>
References: <1125261073.10424.74.camel@merlin.Mines.EDU>
	<20050829164854.GA19475@null.msp.redhat.com>
	<1125349579.12438.28.camel@merlin.Mines.EDU>
Message-ID: <20050829211725.GC19475@null.msp.redhat.com>

On Mon, Aug 29, 2005 at 03:06:19PM -0600, Matthew B. Brookover wrote:
> Aj, thank you for the quick response.
> 
> The patches changed the situation.  Postmark now runs without any
> errors, gfs_fsck still has some problems. I do not understand why
> postmark did not report any errors because the patches all appeared to
> be for fsck.

Actually, this is normal output for the fsck.  The way GFS works is that any
metadata that is deleted is actually converted to unused metadata blocks for
later reuse instead of being freed completely.  The way the fsck tracks block
use, it ends up converting all these blocks back to free blocks.  So basically
this is what a normal fsck looks like when run on a gfs filesystem that had
lots of files deleted.
 
> Output from gfs_fsck:
> 
> [root at illusion root]# gfs_fsck -y /dev/pool/u_as
> Initializing fsck
> Clearing journals (this may take a while)
> .Cleared journals
> Starting pass1
> Pass1 complete
> Starting pass1b
> Pass1b complete
> Starting pass1c
> Pass1c complete
> Starting pass2
> Pass2 complete
> Starting pass3
> Pass3 complete
> Starting pass4
> Pass4 complete
> Starting pass5
> Converting 134 unused metadata blocks to free data blocks...
> Converting 162 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 152 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 152 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 168 unused metadata blocks to free data blocks...
> Converting 161 unused metadata blocks to free data blocks...
> Converting 174 unused metadata blocks to free data blocks...
> Converting 147 unused metadata blocks to free data blocks...
> Converting 148 unused metadata blocks to free data blocks...
> Converting 134 unused metadata blocks to free data blocks...
> Converting 134 unused metadata blocks to free data blocks...
> Converting 137 unused metadata blocks to free data blocks...
> Converting 143 unused metadata blocks to free data blocks...
> Converting 138 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 148 unused metadata blocks to free data blocks...
> Converting 156 unused metadata blocks to free data blocks...
> Converting 179 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 164 unused metadata blocks to free data blocks...
> Converting 139 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 152 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 141 unused metadata blocks to free data blocks...
> Converting 151 unused metadata blocks to free data blocks...
> Converting 150 unused metadata blocks to free data blocks...
> Converting 174 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 170 unused metadata blocks to free data blocks...
> Converting 140 unused metadata blocks to free data blocks...
> Converting 150 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 135 unused metadata blocks to free data blocks...
> Converting 145 unused metadata blocks to free data blocks...
> Converting 143 unused metadata blocks to free data blocks...
> Converting 187 unused metadata blocks to free data blocks...
> Converting 166 unused metadata blocks to free data blocks...
> Converting 152 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 152 unused metadata blocks to free data blocks...
> Converting 168 unused metadata blocks to free data blocks...
> Converting 180 unused metadata blocks to free data blocks...
> Converting 139 unused metadata blocks to free data blocks...
> Converting 134 unused metadata blocks to free data blocks...
> Converting 145 unused metadata blocks to free data blocks...
> Converting 178 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 135 unused metadata blocks to free data blocks...
> Converting 140 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 161 unused metadata blocks to free data blocks...
> Converting 160 unused metadata blocks to free data blocks...
> Converting 151 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 171 unused metadata blocks to free data blocks...
> Converting 156 unused metadata blocks to free data blocks...
> Converting 183 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 142 unused metadata blocks to free data blocks...
> Converting 142 unused metadata blocks to free data blocks...
> Converting 163 unused metadata blocks to free data blocks...
> Converting 166 unused metadata blocks to free data blocks...
> Converting 134 unused metadata blocks to free data blocks...
> Converting 180 unused metadata blocks to free data blocks...
> Converting 156 unused metadata blocks to free data blocks...
> Converting 150 unused metadata blocks to free data blocks...
> Converting 141 unused metadata blocks to free data blocks...
> Converting 143 unused metadata blocks to free data blocks...
> Converting 177 unused metadata blocks to free data blocks...
> Converting 163 unused metadata blocks to free data blocks...
> Converting 136 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 143 unused metadata blocks to free data blocks...
> Converting 174 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 137 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 134 unused metadata blocks to free data blocks...
> Converting 134 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 172 unused metadata blocks to free data blocks...
> Converting 169 unused metadata blocks to free data blocks...
> Converting 158 unused metadata blocks to free data blocks...
> Converting 135 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 133 unused metadata blocks to free data blocks...
> Converting 136 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 134 unused metadata blocks to free data blocks...
> Converting 132 unused metadata blocks to free data blocks...
> Converting 140 unused metadata blocks to free data blocks...
> Converting 187 unused metadata blocks to free data blocks...
> Converting 162 unused metadata blocks to free data blocks...
> Converting 174 unused metadata blocks to free data blocks...
> Converting 131 unused metadata blocks to free data blocks...
> Converting 130 unused metadata blocks to free data blocks...
> Converting 159 unused metadata blocks to free data blocks...
> Converting 171 unused metadata blocks to free data blocks...
> Pass5 complete
> [root at illusion root]#
> 
> 
> What I did:
> * compiled GFS with the patch
> * shut down GFS, lock_gulmd, ccsd, and pool manager on all 6 nodes
> * loaded the new RPM
> * rebooted all 6 nodes and started up the pool manager, and ran gfs_fsck
> on all of the file sytems.  It was clean.
> * Started gulmd
> * mounted the GFS file systems and ran post mark several times. 
> Postmark did not report any errors.
> * unmounted the GFS file system, and ran gfs_fsck and got the above
> messages.
> 
> Any ideas?
> 
> thanks you for your assistance
> 
> Matt
> 
> On Mon, 2005-08-29 at 10:48, AJ Lewis wrote:
> 
> > On Sun, Aug 28, 2005 at 02:31:13PM -0600, Matthew B. Brookover wrote:
> > > I have 6 computers running Redhat Enterrpise 3 release 5, running kernel
> > > 2.4.21-32.0.1.ELsmp.
> > > 
> > > >From the source code I compiled GFS 6.0.2.20-2.  The SAN is an ISCSI
> > > based storage system from LeftHand Networks.   Using ext3, the postmark
> > > disk test works fine, on a GFS file system, we get a number of errors. 
> > > The output from both postmark runs is below.
> > 
> > Looks like you're running into bugs
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160835
> > and
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160525
> > which have been fixed in the latest code.  Unfortunately there is not a GFS
> > 6.0 srpm released yet that contains this fix, but I've attached a patch that
> > clears them up, as well as a couple other minor issues.
> > 
> > I'm not sure about the postmark errors you're seeing - hopefully someone else
> > on the list can shed some light on them.  But try the fsck with these patches
> > and see if there are any actual problems with the fs shown.
> > 
> > Regards,

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
AJ Lewis                                   Voice:  612-638-0500
Red Hat                                    E-Mail: alewis at redhat.com
One Main Street SE, Suite 209
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050829/ac5b7b11/attachment.sig>

From mbrookov at mines.edu  Mon Aug 29 21:43:13 2005
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Mon, 29 Aug 2005 15:43:13 -0600
Subject: [Linux-cluster] GFS file system corruption?
In-Reply-To: <20050829211725.GC19475@null.msp.redhat.com>
References: <1125261073.10424.74.camel@merlin.Mines.EDU>
	<20050829164854.GA19475@null.msp.redhat.com>
	<1125349579.12438.28.camel@merlin.Mines.EDU>
	<20050829211725.GC19475@null.msp.redhat.com>
Message-ID: <1125351792.12438.32.camel@merlin.Mines.EDU>

Good deal!  I will run more test to see what else I can find.  It still
seams odd that postmark started working also.  

thanks

Matt


On Mon, 2005-08-29 at 15:17, AJ Lewis wrote:

> On Mon, Aug 29, 2005 at 03:06:19PM -0600, Matthew B. Brookover wrote:
> > Aj, thank you for the quick response.
> > 
> > The patches changed the situation.  Postmark now runs without any
> > errors, gfs_fsck still has some problems. I do not understand why
> > postmark did not report any errors because the patches all appeared to
> > be for fsck.
> 
> Actually, this is normal output for the fsck.  The way GFS works is that any
> metadata that is deleted is actually converted to unused metadata blocks for
> later reuse instead of being freed completely.  The way the fsck tracks block
> use, it ends up converting all these blocks back to free blocks.  So basically
> this is what a normal fsck looks like when run on a gfs filesystem that had
> lots of files deleted.
>  
> > Output from gfs_fsck:
> > 
> > [root at illusion root]# gfs_fsck -y /dev/pool/u_as
> > Initializing fsck


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050829/18020b4a/attachment.htm>

From Axel.Thimm at ATrpms.net  Mon Aug 29 23:35:23 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Tue, 30 Aug 2005 01:35:23 +0200
Subject: [Linux-cluster] Re: iptables protection wrapper;
	nfsexport.sh vs ip.sh racing
In-Reply-To: <1125340879.24205.30.camel@ayanami.boston.redhat.com>
References: <20050822225227.GJ24127@neu.nirvana>
	<1125340879.24205.30.camel@ayanami.boston.redhat.com>
Message-ID: <20050829233523.GD5908@neu.nirvana>

On Mon, Aug 29, 2005 at 02:41:19PM -0400, Lon Hohberger wrote:
> On Tue, 2005-08-23 at 00:52 +0200, Axel Thimm wrote:
> > The typical NFS cluster setups seem to fail for Gigabit NFS/tcp. Some
> > clients that are busy during the relocation of services either bail
> > out with RPC garbage, or set the filesytem to EACCES, or timeout for
> > 17 min.
> > 
> > This has to do with some racing/timing in the NFS vs ip setup/teardown
> > procedure. Protecting the service startup/shutdown with an iptables
> > rule is a good workaround to fix this.
> > 
> > But what is the proper way to integrate this workaround? I could setup
> > new resource agents, one with start=1 and another with start=6 to
> > start/stop dropping packages. Or I could modify the current resource
> > agents to allow for child entities and wrap one script around the
> > service and one in the inner element.
> > 
> > I could probably also hack ip.sh to introduce some delay, to make sure
> > the NFS services are really up/down before proceeding. Or maybe fix
> > the true evil by making nfsexport.sh wait for NFS startup/stop
> > completion (how?)?
> 
> Traditionally, we start the NFS daemons as a service to people who
> forget to start them before starting rgmanager.
> 
> I.e.  Red Hat / Fedora Core users are supposed to do this prior to
> configuring NFS services in rgmanager:
> 
>    chkconfig --level 345 nfslock on
>    chkconfig --level 345 nfs on
> 
> It's really an attempt at a workaround a configuration problem -- and
> nothing more.

The above is with nfs running on all nodes already. The racing seems
to be with the exportfs commands and ip setup/teardown.

It is easy to reproduce (>=50%) if the client connects over Gigabit
and is in write transaction while the service is moved. We saw this in
two different setups. If you throttle the network bandwidth to <=
20MB/sec you don't trigger the bug, so it really seems like a racing
problem.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050830/c19cef72/attachment.sig>

From fajar at telkom.co.id  Tue Aug 30 01:34:56 2005
From: fajar at telkom.co.id (Fajar A. Nugraha)
Date: Tue, 30 Aug 2005 08:34:56 +0700
Subject: [Linux-cluster] GFS journals
In-Reply-To: <20050829130714.5A617318444@smtp.cegetel.net>
References: <20050829130714.5A617318444@smtp.cegetel.net>
Message-ID: <4313B7C0.9060701@telkom.co.id>

Sylvain COUTANT wrote:

>Hi all,
>
>AFAIK, I need one GFS journal per server connected to the device. How can I adjust the number of journal if I need to add one server as a new "client" ?
>
>  
>
gfs_jadd.
However, note that it requires some space, so you might need to increase 
the size of the block device as well. If you use LVM, the process is 
lvextend -> gfs_jadd -> gfs_grow
The gfs_grow is optional, but it will make use of whatever space is 
available in case you allocated too much space during lvextend :)

Regards,

Fajar



From fajar at telkom.co.id  Tue Aug 30 01:41:12 2005
From: fajar at telkom.co.id (Fajar A. Nugraha)
Date: Tue, 30 Aug 2005 08:41:12 +0700
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <20050829205951.GK12333@phlogiston.msp.redhat.com>
References: <20050827004859.D66965A8690@mail.silvercash.com>	<43126996.8000606@telkom.co.id>
	<20050829205951.GK12333@phlogiston.msp.redhat.com>
Message-ID: <4313B938.2090409@telkom.co.id>

Benjamin Marzinski wrote:

>If the gnbds are exported uncached (the default), the client will fail back IO
>if it can no longer talk to the server after a specified timeout.  
>
What is the default timeout anyway, and how can I set it?
Last time I test gnbd-import timeout was on a development version 
(DEVEL.1104982050) and after more than 30 minutes, the client still 
tries to reconnect.

Regards,

Fajar



From nattaponv at hotmail.com  Tue Aug 30 02:00:54 2005
From: nattaponv at hotmail.com (nattapon viroonsri)
Date: Tue, 30 Aug 2005 02:00:54 +0000
Subject: [Linux-cluster] Rhcs4 fence with no hardware power switch
In-Reply-To: <1125343325.24205.75.camel@ayanami.boston.redhat.com>
Message-ID: <BAY22-F30DF0A9DA9F0E76AA68C3FA6AE0@phx.gbl>




>On Mon, 2005-08-29 at 02:26 +0000, nattapon viroonsri wrote:
> > I use RedHat cluster v4 for HA and have no power switch hardware
> > When i disconnected  network cable for client access resource , Failover
> > have occur as expect
> > But the server that was disconnect network cable not be rebooted or 
>fenced
> > like rhcs V3(even no power switch).  And the failed server  can't not 
>join
> > cluster again.
>
>Eep, no failover should have occurred if you don't have fencing.

I see resource(virtual ip, serviice) tranfer to another node. but dont see 
any mesage in /var/log/message said about fencing, may be i miss something .
and i try to reboot failnode with command "init 6" manually but it show that 
cant stop rgmanager service.

>
> > In Rhcs V3. it reboot fail node automatically without hardware power 
>switch
> >
> > So, Have anyway to reboot failed node(network problem or server problem)
> > without use hardware power switch to make failnode to clean state again 
>?
>
>You need fencing hardware with RHCS4/RHEL4.  I'd check ebay if you're
>not looking to spend a lot, you can get a WTI NPS115 for pretty cheap
>nowadays (under $100!).
>
Have any addition module or modify version of  perl script "fence_xxx"  or 
use wathdog with rhcs v4. ?

>You can run with manual fencing, but no automatic recovery will occur,
>and this isn't a supported solution.
>

Thanks for your suggestion. :)


>-- Lon
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



From teigland at redhat.com  Tue Aug 30 02:44:42 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 30 Aug 2005 10:44:42 +0800
Subject: [Linux-cluster] do_dlm_unlock kernel panic
In-Reply-To: <20050827094257.124BA323494@postfix4-2.free.fr>
References: <20050827024027.GA10660@redhat.com>
	<20050827094257.124BA323494@postfix4-2.free.fr>
Message-ID: <20050830024441.GA11575@redhat.com>

On Sat, Aug 27, 2005 at 11:42:49AM +0200, Sylvain COUTANT wrote:
> > > I got the following error tonight after a network blackout on one -test-
> > > server :
> > > dlm: dlm_unlock: lkid 10301 lockspace not found
> > 
> > This means that the cluster manager, cman, has shut down.  When that
> > happens it forcibily takes down all the dlm lockspaces which results in
> > the assertion failures below if gfs happens to be running.
> 
> Fine. Do you mean any failure will lead to kernel panic ?

Any network failure will likely result in a gfs panic.

Dave



From yanj at brainaire.com  Tue Aug 30 01:50:09 2005
From: yanj at brainaire.com (yanj)
Date: Tue, 30 Aug 2005 09:50:09 +0800
Subject: [Linux-cluster] Nodename problem in GFS based on iSCSI
Message-ID: <000201c5ad05$2ce45390$2e00a8c0@yanzijie>

Hello,
 
I am a beginner in GFS. I am trying to setup a GFS system based on iSCSI
storage in RHEL3.
Kernel: 2.4.21-27.EL
GFS: GFS-6.0.2-25
Question
I have successful setup the GFS+iSCSI system if I use the default
hostname: "localhost".
However, if I change the hostname to another name, e.g. "test1" (update
/etc/sysconfig/network and /etc/hosts), I FAIL to start lock_gulmd.
 
Command sequence:
 
Service ccsd start [ok]
Service lock_gulmd start [fail]
Service lock_gulmd status
lock_gulmd (pid 2616 2614 2611) is running...
gulm_master: gulm master not found
 
lock_gulmd (pid 2616 2614 2611) is running...
gulm_master: gulm master not found
 
I have also tried this within one machine, the error message is same.
 
The GFS configuration in the machine (192.168.0.22) is:
[root at test1 gfs-config]# uname -n
test1
[root at test1 gfs-config]# cat cluster.ccs 
      cluster { 
              name = "gfs" 
              lock_gulm { 
                  servers = ["test1"] 
                  heartbeat_rate = 0.9 
                  allowed_misses = 10 
              } 
      }
[root at test1 gfs-config]# cat fence.ccs 
fence_devices{ 
              admin { 
                    agent = "fence_manual" 
              } 
      }
[root at test1 gfs-config]# cat nodes.ccs 
nodes { 
         test1 { 
            ip_interfaces { 
               hsi0 = "192.168.0.22" 
            } 
            fence { 
               human { 
                  admin { 
                     ipaddr = "192.168.0.22" 
                  } 
               } 
            } 
         } 
}
 
[root at test1 gfs-config]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
192.168.0.22            test1
 
 
Thanks,
Jeffrey Yan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050830/bc01207b/attachment.htm>

From erling.nygaard at gmail.com  Tue Aug 30 09:03:17 2005
From: erling.nygaard at gmail.com (Erling Nygaard)
Date: Tue, 30 Aug 2005 11:03:17 +0200
Subject: [Linux-cluster] Redhat 9.0+GFS+iSCSI
In-Reply-To: <1125342459.24205.59.camel@ayanami.boston.redhat.com>
References: <000501c5aa2d$cdb5ef70$2e00a8c0@yanzijie>
	<1125342459.24205.59.camel@ayanami.boston.redhat.com>
Message-ID: <adb721b405083002037869f576@mail.gmail.com>

GFS for RedHat 9.0 was available as closed source from Sistina Software. It 
worked fine back in the days. 
As far as I know these RPMs are no longer available anywhere. Feel free to 
backport the current GFS ;)

-
Erling

On 8/29/05, Lon Hohberger <lhh at redhat.com> wrote:
> 
> On Fri, 2005-08-26 at 19:03 +0800, yanj wrote:
> 
> > Is GFS working in Redhat 9.0?
> 
> Not that I am aware of.
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050830/9f28e48c/attachment.htm>

From mbrookov at mines.edu  Tue Aug 30 13:55:09 2005
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Tue, 30 Aug 2005 07:55:09 -0600
Subject: [Linux-cluster] Nodename problem in GFS based on iSCSI
In-Reply-To: <000201c5ad05$2ce45390$2e00a8c0@yanzijie>
References: <000201c5ad05$2ce45390$2e00a8c0@yanzijie>
Message-ID: <1125410109.14992.5.camel@merlin.Mines.EDU>

Check you /etc/hosts, make sure that the host name is not on the same
line with localhost.  This works for me:

127.0.0.1    localhost.localdomain localhost
1.2.3.4      mycomputer.example.com mycomputer


I also have set the fully qualified hostname in /etc/sysconfig/network
and in the nodes.css file.

Best I can tell, if your host name is on the localhost line, lock_gulmd
tries to use the loop back device instead of the Ethernet device.  

Matt

On Mon, 2005-08-29 at 19:50, yanj wrote:

> Hello,
> 
>  
> 
> I am a beginner in GFS. I am trying to setup a GFS system based on
> iSCSI storage in RHEL3.
> 
> Kernel: 2.4.21-27.EL
> 
> GFS: GFS-6.0.2-25
> 
> Question
> 
> I have successful setup the GFS+iSCSI system if I use the default
> hostname: ?localhost?.
> 
> However, if I change the hostname to another name, e.g. ?test1?
> (update /etc/sysconfig/network and /etc/hosts), I FAIL to start
> lock_gulmd.
> 
>  
> 
> Command sequence:
> 
>  
> 
> Service ccsd start [ok]
> 
> Service lock_gulmd start [fail]
> 
> Service lock_gulmd status
> 
> lock_gulmd (pid 2616 2614 2611) is running...
> 
> gulm_master: gulm master not found
> 
>  
> 
> lock_gulmd (pid 2616 2614 2611) is running...
> 
> gulm_master: gulm master not found
> 
>  
> 
> I have also tried this within one machine, the error message is same.
> 
>  
> 
> The GFS configuration in the machine (192.168.0.22) is:
> 
> [root at test1 gfs-config]# uname -n
> 
> test1
> 
> [root at test1 gfs-config]# cat cluster.ccs 
> 
>       cluster { 
> 
>              name = "gfs" 
> 
>              lock_gulm { 
> 
>                  servers = ["test1"] 
> 
>                  heartbeat_rate = 0.9 
> 
>                  allowed_misses = 10 
> 
>              } 
> 
>       }
> 
> [root at test1 gfs-config]# cat fence.ccs 
> 
> fence_devices{ 
> 
>              admin { 
> 
>                    agent = "fence_manual" 
> 
>              } 
> 
>       }
> 
> [root at test1 gfs-config]# cat nodes.ccs 
> 
> nodes { 
> 
>         test1 { 
> 
>            ip_interfaces { 
> 
>               hsi0 = "192.168.0.22" 
> 
>            } 
> 
>            fence { 
> 
>               human { 
> 
>                  admin { 
> 
>                     ipaddr = "192.168.0.22" 
> 
>                  } 
> 
>               } 
> 
>            } 
> 
>         } 
> 
> }
> 
>  
> 
> [root at test1 gfs-config]# cat /etc/hosts
> 
> # Do not remove the following line, or various programs
> 
> # that require network functionality will fail.
> 
> 192.168.0.22           test1
> 
>  
> 
>  
> 
> Thanks,
> 
> Jeffrey Yan
> 
> 
> 
> ______________________________________________________________________
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050830/a0c97d07/attachment.htm>

From lhh at redhat.com  Tue Aug 30 18:03:32 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 30 Aug 2005 14:03:32 -0400
Subject: [Linux-cluster] Re: iptables protection wrapper; nfsexport.sh
	vs ip.sh racing
In-Reply-To: <20050829233523.GD5908@neu.nirvana>
References: <20050822225227.GJ24127@neu.nirvana>
	<1125340879.24205.30.camel@ayanami.boston.redhat.com>
	<20050829233523.GD5908@neu.nirvana>
Message-ID: <1125425012.21943.1.camel@ayanami.boston.redhat.com>

On Tue, 2005-08-30 at 01:35 +0200, Axel Thimm wrote:

> > It's really an attempt at a workaround a configuration problem -- and
> > nothing more.
> 
> The above is with nfs running on all nodes already. The racing seems
> to be with the exportfs commands and ip setup/teardown.
> 
> It is easy to reproduce (>=50%) if the client connects over Gigabit
> and is in write transaction while the service is moved. We saw this in
> two different setups. If you throttle the network bandwidth to <=
> 20MB/sec you don't trigger the bug, so it really seems like a racing
> problem.

ewww...  Can you bugzilla this so we can track it?  =)

-- Lon




From bmarzins at redhat.com  Tue Aug 30 19:52:40 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Tue, 30 Aug 2005 14:52:40 -0500
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <4313B938.2090409@telkom.co.id>
References: <20050827004859.D66965A8690@mail.silvercash.com>
	<43126996.8000606@telkom.co.id>
	<20050829205951.GK12333@phlogiston.msp.redhat.com>
	<4313B938.2090409@telkom.co.id>
Message-ID: <20050830195240.GM12333@phlogiston.msp.redhat.com>

On Tue, Aug 30, 2005 at 08:41:12AM +0700, Fajar A. Nugraha wrote:
> Benjamin Marzinski wrote:
> 
> >If the gnbds are exported uncached (the default), the client will fail 
> >back IO
> >if it can no longer talk to the server after a specified timeout.  
> >
> What is the default timeout anyway, and how can I set it?
> Last time I test gnbd-import timeout was on a development version 
> (DEVEL.1104982050) and after more than 30 minutes, the client still 
> tries to reconnect.

The default timeout is 1 minute. It is tuneable with the -t option (see the
gnbd man page). However you only timeout if you export the device in uncached
mode.

-Ben
 
> Regards,
> 
> Fajar
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From fajar at telkom.co.id  Wed Aug 31 03:10:37 2005
From: fajar at telkom.co.id (Fajar A. Nugraha)
Date: Wed, 31 Aug 2005 10:10:37 +0700
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <20050830195240.GM12333@phlogiston.msp.redhat.com>
References: <20050827004859.D66965A8690@mail.silvercash.com>	<43126996.8000606@telkom.co.id>	<20050829205951.GK12333@phlogiston.msp.redhat.com>	<4313B938.2090409@telkom.co.id>
	<20050830195240.GM12333@phlogiston.msp.redhat.com>
Message-ID: <43151FAD.20803@telkom.co.id>

Benjamin Marzinski wrote:

>On Tue, Aug 30, 2005 at 08:41:12AM +0700, Fajar A. Nugraha wrote:
>  
>
>>Benjamin Marzinski wrote:
>>
>>    
>>
>>>If the gnbds are exported uncached (the default), the client will fail 
>>>back IO
>>>if it can no longer talk to the server after a specified timeout.  
>>>
>>>      
>>>
>>What is the default timeout anyway, and how can I set it?
>>Last time I test gnbd-import timeout was on a development version 
>>(DEVEL.1104982050) and after more than 30 minutes, the client still 
>>tries to reconnect.
>>    
>>
>
>The default timeout is 1 minute. It is tuneable with the -t option (see the
>gnbd man page). However you only timeout if you export the device in uncached
>mode.
>
>  
>
I find something interesting :
gnbd_import man page (no mention of timeout):
       -t server
              Fence from Server.
              Specify a server for the IO fence (only used with the -s 
option).

gnbd_export man page :
       -t [seconds]
              Timeout.
              Set the exported GNBD to timeout mode 
              This option is used with -p.
              This  is  the  default  for uncached  GNBDs

Isn't the client the one that has to determine whether it's in wait mode 
or timeout mode? How does the parameter from gnbd_export passed to 
gnbd_import?

I tested it today with gnbd 1.00.00, by adding an extra ip address to 
the server -> gnbd_export on the server (IP address 192.168.17.193, 
cluster member, no extra parameter, so it should be exported as uncached 
gnbd in timeout mode) -> gnbd_import on the client (member of a 
different cluster) -> mount the gnbd_import -> remove the IP addresss 
192.168.17.193 from the server -> do df -k on the client, and I got 
these on the client's syslog

Aug 31 09:55:58 node1 gnbd_recvd[9792]: client lost connection with 
192.168.17.193 : Interrupted system call
Aug 31 09:55:58 node1 gnbd_recvd[9792]: reconnecting
Aug 31 09:55:58 node1 kernel: gnbd (pid 9792: gnbd_recvd) got signal 1
Aug 31 09:55:58 node1 kernel: gnbd2: Receive control failed (result -4)
Aug 31 09:55:58 node1 kernel: gnbd2: shutting down socket
Aug 31 09:55:58 node1 kernel: exitting GNBD_DO_IT ioctl
Aug 31 09:56:03 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] 
server D?? is not a cluster member, cannot fence.
Aug 31 09:56:08 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] 
server D?? is not a cluster member, cannot fence.
Aug 31 09:56:08 node1 gnbd_recvd[9792]: ERROR [gnbd_recvd.c:213] cannot 
connect to server 192.168.17.193 (-1) : Interrupted system call
Aug 31 09:56:08 node1 gnbd_recvd[9792]: reconnecting
Aug 31 09:56:13 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] 
server D?? is not a cluster member, cannot fence.
Aug 31 09:56:13 node1 gnbd_recvd[9792]: ERROR [gnbd_recvd.c:213] cannot 
connect to server 192.168.17.193 (-1) : Interrupted system call
Aug 31 09:56:13 node1 gnbd_recvd[9792]: reconnecting

And it goes on, and on, and on :) After ten minutes, I add the IP 
address back to the server and these appear on syslog :
Aug 31 10:06:13 node1 gnbd_recvd[9792]: reconnecting
Aug 31 10:06:16 node1 kernel: resending requests

So it looks like by default gnbd runs in wait mode, and after it 
reconnects the kernel automatically resends the request without the need 
of dm-multipath.

Is my setup incorrect, or is this how it's supposed to work?

Regards,

Fajar



From gkapitany at dundeesecurities.com  Wed Aug 31 13:52:49 2005
From: gkapitany at dundeesecurities.com (Gabriel Kapitany)
Date: Wed, 31 Aug 2005 09:52:49 -0400
Subject: [Linux-cluster] RedHat 3 GFS support
Message-ID: <C596E4F435FDF8488A438382D2635E38AC214A@TOR-AC-EVS02.accounts.dundeecorp.com>

Hi guys,
I'd like to know if Red Hat Enterprise Linux AS release 3 has support
for GFS.
Thanks,
Gabriel 



From rkenna at redhat.com  Wed Aug 31 14:05:59 2005
From: rkenna at redhat.com (Rob Kenna)
Date: Wed, 31 Aug 2005 10:05:59 -0400
Subject: [Linux-cluster] RedHat 3 GFS support
In-Reply-To: <C596E4F435FDF8488A438382D2635E38AC214A@TOR-AC-EVS02.accounts.dundeecorp.com>
References: <C596E4F435FDF8488A438382D2635E38AC214A@TOR-AC-EVS02.accounts.dundeecorp.com>
Message-ID: <4315B947.5090209@redhat.com>

Gabriel Kapitany wrote:

>Hi guys,
>I'd like to know if Red Hat Enterprise Linux AS release 3 has support
>for GFS.
>Thanks,
>Gabriel 
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>  
>
GFS 6.0 is a supported option on top of RHEL 3.

http://www.redhat.com/software/rha/gfs/


- Rob

-- 
Robert Kenna / Red Hat
Sr Product Mgr - Storage
10 Technology Park Drive
Westford, MA 01886
o: (978) 392-2410 (x22410)
f: (978) 392-1001
c: (978) 771-6314
rkenna at redhat.com



From gkapitany at dundeesecurities.com  Wed Aug 31 14:13:47 2005
From: gkapitany at dundeesecurities.com (Gabriel Kapitany)
Date: Wed, 31 Aug 2005 10:13:47 -0400
Subject: [Linux-cluster] RedHat 3 GFS support
Message-ID: <C596E4F435FDF8488A438382D2635E38AC2157@TOR-AC-EVS02.accounts.dundeecorp.com>

Sorry,
I was blind :-)
Thanks,
Gabriel

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rob Kenna
Sent: August 31, 2005 10:06 AM
To: linux clustering
Subject: Re: [Linux-cluster] RedHat 3 GFS support

Gabriel Kapitany wrote:

>Hi guys,
>I'd like to know if Red Hat Enterprise Linux AS release 3 has support
>for GFS.
>Thanks,
>Gabriel 
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>  
>
GFS 6.0 is a supported option on top of RHEL 3.

http://www.redhat.com/software/rha/gfs/


- Rob

-- 
Robert Kenna / Red Hat
Sr Product Mgr - Storage
10 Technology Park Drive
Westford, MA 01886
o: (978) 392-2410 (x22410)
f: (978) 392-1001
c: (978) 771-6314
rkenna at redhat.com

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Wed Aug 31 16:27:44 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 31 Aug 2005 12:27:44 -0400
Subject: [Linux-cluster] Rhcs4 fence with no hardware power switch
In-Reply-To: <BAY22-F30DF0A9DA9F0E76AA68C3FA6AE0@phx.gbl>
References: <BAY22-F30DF0A9DA9F0E76AA68C3FA6AE0@phx.gbl>
Message-ID: <1125505664.21943.80.camel@ayanami.boston.redhat.com>

On Tue, 2005-08-30 at 02:00 +0000, nattapon viroonsri wrote:
> >On Mon, 2005-08-29 at 02:26 +0000, nattapon viroonsri wrote:
> > > I use RedHat cluster v4 for HA and have no power switch hardware
> > > When i disconnected  network cable for client access resource , Failover
> > > have occur as expect
> > > But the server that was disconnect network cable not be rebooted or 
> >fenced
> > > like rhcs V3(even no power switch).  And the failed server  can't not 
> >join
> > > cluster again.
> >
> >Eep, no failover should have occurred if you don't have fencing.
> 
> I see resource(virtual ip, serviice) tranfer to another node. but dont see 
> any mesage in /var/log/message said about fencing, may be i miss something .
> and i try to reboot failnode with command "init 6" manually but it show that 
> cant stop rgmanager service.

Interesting... the event that the node has died, failover shouldn't
occur until the user-level service group recovers.

The user-level service groups should not recover until *after* fencing
recovers and completes.  Without fencing, there should never be a
failover (because fencing *fails* unlike in RHCS3, where it just
complained and continued), unless something's been done that I'm not
aware of?


> > > So, Have anyway to reboot failed node(network problem or server problem)
> > > without use hardware power switch to make failnode to clean state again 
> >?
> >
> >You need fencing hardware with RHCS4/RHEL4.  I'd check ebay if you're
> >not looking to spend a lot, you can get a WTI NPS115 for pretty cheap
> >nowadays (under $100!).
> >
> Have any addition module or modify version of  perl script "fence_xxx"  or 
> use wathdog with rhcs v4. ?

We are looking in to using them as an additional means, though I doubt
we'll allow reliance on them like we did in RHEL2.1.  That is, I suspect
that there won't be a "fence_watchdog" driver, but you never know...

On RHCS3, they were used as an additional means of protection, but the
cluster software complained about lack of "STONITH devices" when only
using the watchdog...

I will state this plainly: Fencing hardware is *required* for RHCS4 (the
Red Hat product/support offering). 


> >You can run with manual fencing, but no automatic recovery will occur,
> >and this isn't a supported solution.
> >
> 
> Thanks for your suggestion. :)

To clarify - my suggestion was to get a supported fence device, and I
noted one which could be had cheaply.  Personally, I recommend against
using manual fencing for anything, including evaluation testing.

You can also use GNBD + fence_gnbd -- no fencing hardware required! :)

-- Lon



From lhh at redhat.com  Wed Aug 31 16:28:51 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 31 Aug 2005 12:28:51 -0400
Subject: [Linux-cluster] RHEL/RHCS3:
	/usr/lib/clumanager/services/service status # stays up
In-Reply-To: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com>
References: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com>
Message-ID: <1125505731.21943.82.camel@ayanami.boston.redhat.com>

On Thu, 2005-08-18 at 14:12 -0600, Tarun Reddy wrote:

> Anybody venture a guess as to why this might be occurring? And are my  
> check intervals too low?

What are they set to?

You get multiple running at the same time if you have multiple services.

-- Lon




From lhh at redhat.com  Wed Aug 31 16:29:55 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 31 Aug 2005 12:29:55 -0400
Subject: [Linux-cluster] RHEL/RHCS3: many pipe files keep building up
	in /tmp
In-Reply-To: <F74FE087-B326-4CDE-936E-4B376D4BEB7F@rallydev.com>
References: <F74FE087-B326-4CDE-936E-4B376D4BEB7F@rallydev.com>
Message-ID: <1125505795.21943.84.camel@ayanami.boston.redhat.com>

On Thu, 2005-08-18 at 14:19 -0600, Tarun Reddy wrote:
> sh-np-1124010202

I'm not aware of what's causing this...  Can you file a bugzilla about
it?

-- Lon





From brianu at silvercash.com  Wed Aug 31 17:26:19 2005
From: brianu at silvercash.com (brianu)
Date: Wed, 31 Aug 2005 10:26:19 -0700
Subject: FW: [Linux-cluster] Re: If I have 5 GNBD server?
Message-ID: <20050831172345.C9FDB5A8684@mail.silvercash.com>

Ok, so I tried using heartbeat to create a virtual ip that floats between
the gnbd servers it worked out okay I can actually mount under that virtual
gnbd server ip. And If I wanted to manually fail out one gnbd server for
another no prob if all was done cleany, i.e. ( vgchange -aln, gnbd_import
-R) then would remount by gnbd_import -i gnbd_vip:

Which would give the unique gnbd in /dev/gnbd, then I'd run vgchange -aly,
would would bring it in lvm and device mapper.

However a manual failover test hangs with "vgchange -aln" as the old device
"unique gnbd" is still attempted to be accessed, also kill the process with
a killall or a kill -11, still doesn't cleanly allow clmvd to return to a
clean state.


As for Multipath as  Ben has wrote

> If the gnbds are exported uncached (the default), the client will fail
back IO
> if it can no longer talk to the server after a specified timeout.  However
> the userspace tools for dm-multipath are still to SCSI-centric to allow
you
> to run on top of gnbd.  You can manually run dm-setup commands to build
> the appropriate multipath map, scan the map to check if a path has failed,
> remove the failed gnbd from the map (so the device can close and gnbd can
> start trying to reconnect), and them manually add the gnbd device back
into
> the map when it has reconnected. That's pretty much all the dm-multipath
> userspace tools do.  Someone could even write a pretty simple daemon that
did
> this, and become the personal hero of many people on this list.

> The only problem is that if you manually execute the commands, or write
the
> daemon in bash or some other scripting language, you can run into a memory
> deadlock. If you are in a very low memory situation, and you need to
complete
> gnbd IO requests to free up memory, the daemon can't allocate any memory
in
> doing it's job.
& 
> If you have the gnbd exported in caching mode, each server will maintain
it's
> own cache, So if you write a block to one server, and then the server
crashes,
> when you read the block from the second server, if it was already cached
> before the read, you will get invalid data, so that won't work. If you
> set the gnbd to uncached mode, the client will fail the IO back, and
something
> (a multipath driver) needs to be there to reissue the request.

> -Ben

I have tried to get the dm-multipath setup working correctly but had little
success I had started a earlier thread on it and didn't get any response,

My test was based on 
https://www.redhat.com/archives/linux-cluster/2005-April/msg00062.html

I posted my issues here.

https://www.redhat.com/archives/linux-cluster/2005-August/msg00080.html

The initial command used was 
echo "0 1146990592  multipath 0 0 1 1 round-robin 0 2 1 251:1 1000 251:3
1000 " | dmsetup create dm-1

(251:1 & 251:3 are the major:minor ids of the gnbds obtained from the
command cat /proc/partitions)
(1146990592 -> I believe is the size of the block device.) 
This resulted in a block device of which I still could not mount, I tried
multipath -0ll (after installing multipath and create a rudimentary
multipath.conf) and the result was

dm-1

[size=546 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active][first]
  \_ 0:0:0:0      251:0   [undef ][active]
\_ round-robin 0 [enabled]
  \_ 0:0:0:0      251:4   [undef ][active]

"notice that the size was 1/2 the actual size!?! (I have no idea what this
means "somebody enlighten me, please!)

When I attempted to mount
 
[root dell-1650-31 ~]# mount -t gfs /dev/mapper/dm-1 /mnt/gfs1

mount: /dev/dm-1: can't read superblock

This was tried previously off a multipathed device in  which dmsetup status
gives the output below:

dm-1: 0 1146990592 multipath 1 0 0 2 1 A 0 1 0 251:1 A 0 E 0 1 0 251:3 A 0

dmsetup deps gives
dm-1: 2 dependencies    : (251, 3) (251, 1)

and dmsetup info gives
Name:              dm-1
State:             ACTIVE
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 1
Number of targets: 1

I have managed to get a nfs/gnbd failover type scenario working, in which
gnbd_servers export the shared storage via nfs and the clients mount via a
heartbeat VIP. I then created a script which I will rewrite into into a mini
daemon soon, that checks status of the servers then when the ip is token
over stops apache, unmounts nfs via the "umount -l -t nfs $mountpoint" then
"mount $mountpoint" & starts apache again. I have tested it and it works (
by checking for stale handles and remounting cleanly) Some of the same
principles can go into one for gnbd. But my bump right now is LVM.


Can GNBD be used without LVM? Or does anyone know how to enable failover
correctly on dm-multipath?


Any help would be appreciated.

Brian Urrutia 


-----Original Message-----
From: brian urrutia [mailto:brianu at mail.silvercash.com] 
Sent: Monday, August 29, 2005 12:56 AM
To: linux-cluster at redhat.com
Cc: mikore.li at gmail.com; brianu
Subject: Re: [Linux-cluster] Re: If I have 5 GNBD server?

> > If using LVM to make a volume of imported gnbds is not the answer for
> > redundancy can anyone suggest a method that is? Im not opposed to using
any
> > other resource of cluster or GFS but I would really like to implement a
> > redundant solution, ( gnbd, gulm, etc.). 
> > 
> Hi, Brianu, maybe LVM + md + gnbd should be one of the solution for
> redundancy, for example, you have 2 gnbd servers, each one exports 1
> disk. Then, the steps should be:
> 1. create a RAID-1  /dev/md0 on GFS client with imported 2 gnbd block
devices.
> 2. use LVM  create /dev/vg0 on top of them.
> 3. mkfs_gfs on /dev/vg0.
> I haven't tried this configuration, theoretically, it should work.
> 
> Thanks,
> 
> Michael

I will look into trying a md & lvm combo, as far as keepalived or
rgmanager to failover an ip, i havent seen a clear example on how to use
rgmanager, but i have tried heartbeat (linux HA) to failover the ip, and
the problem is that the gnbd clients still seem to lock on the former
server regardless of that the ip has failed over to another ip ( and
continuly try and reconnect as Fajar had mentioned).


The shared storage I have is a HP MSA 100 SAN
It might be a config error on my part as far as rgmanager is concerned i
will have to post my cluster.conf tommorrow.

-Brian





From brianu at silvercash.com  Wed Aug 31 22:17:19 2005
From: brianu at silvercash.com (brianu)
Date: Wed, 31 Aug 2005 15:17:19 -0700
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <20050831172345.C9FDB5A8684@mail.silvercash.com>
Message-ID: <20050831221444.BB7845A866A@mail.silvercash.com>

Hello all,

I tried creating a md device from the GNBD's and I was able to create one, I
created a raid 1 device (however it is still going thru its sync process(yet
there is no difference the volume is from shared storage)). The problem is I
cannot mount the device as shown below

Aug 31 14:46:27 dell-1650-31 kernel: GFS: Trying to join cluster "lock_dlm",
"sansilvercash:gfs2"
Aug 31 14:46:27 dell-1650-31 kernel: lock_dlm: fence domain not found; check
fenced Aug 31 14:46:27 dell-1650-31 kernel: GFS: can't mount proto =
lock_dlm, table = sansilvercash:gfs2, hostdata = Aug 31 14:46:39
dell-1650-31 hal.hotplug[5236]: timout(10000 ms) waiting for
/block/diapered_dm-1 Aug 31 14:50:01 dell-1650-31 crond(pam_unix)[7698]:
session opened for user root by (uid=0) Aug 31 14:50:01 dell-1650-31
crond(pam_unix)[7698]: session closed for user root.

Yet I can mount the device by gnbds individually.

PVSCAN shows:

[root at dell-1650-31 ~]# pvscan
  PV /dev/sda3   VG VolGroupData   lvm2 [31.81 GB / 32.00 MB free]
  PV /dev/md0    VG space          lvm2 [1.07 TB / 0    free]
  Total: 2 [1.10 TB] / in use: 2 [1.10 TB] / in no VG: 0 [0   ]
[root at dell-1650-31 ~]#


Any clues on why I wouldn't be able to mount the device? This is the same
behavior when I tried to mount a multipathed device ( which I may have
configured right or wrong)

-Brian
-----Original Message-----
From: brianu [mailto:brianu at silvercash.com] 
Sent: Wednesday, August 31, 2005 10:26 AM
To: linux-cluster at redhat.com
Cc: 'brianu'
Subject: FW: [Linux-cluster] Re: If I have 5 GNBD server?

Ok, so I tried using heartbeat to create a virtual ip that floats between
the gnbd servers it worked out okay I can actually mount under that virtual
gnbd server ip. And If I wanted to manually fail out one gnbd server for
another no prob if all was done cleany, i.e. ( vgchange -aln, gnbd_import
-R) then would remount by gnbd_import -i gnbd_vip:

Which would give the unique gnbd in /dev/gnbd, then I'd run vgchange -aly,
would would bring it in lvm and device mapper.

However a manual failover test hangs with "vgchange -aln" as the old device
"unique gnbd" is still attempted to be accessed, also kill the process with
a killall or a kill -11, still doesn't cleanly allow clmvd to return to a
clean state.


As for Multipath as  Ben has wrote

> If the gnbds are exported uncached (the default), the client will fail
back IO
> if it can no longer talk to the server after a specified timeout.  However
> the userspace tools for dm-multipath are still to SCSI-centric to allow
you
> to run on top of gnbd.  You can manually run dm-setup commands to build
> the appropriate multipath map, scan the map to check if a path has failed,
> remove the failed gnbd from the map (so the device can close and gnbd can
> start trying to reconnect), and them manually add the gnbd device back
into
> the map when it has reconnected. That's pretty much all the dm-multipath
> userspace tools do.  Someone could even write a pretty simple daemon that
did
> this, and become the personal hero of many people on this list.

> The only problem is that if you manually execute the commands, or write
the
> daemon in bash or some other scripting language, you can run into a memory
> deadlock. If you are in a very low memory situation, and you need to
complete
> gnbd IO requests to free up memory, the daemon can't allocate any memory
in
> doing it's job.
& 
> If you have the gnbd exported in caching mode, each server will maintain
it's
> own cache, So if you write a block to one server, and then the server
crashes,
> when you read the block from the second server, if it was already cached
> before the read, you will get invalid data, so that won't work. If you
> set the gnbd to uncached mode, the client will fail the IO back, and
something
> (a multipath driver) needs to be there to reissue the request.

> -Ben

I have tried to get the dm-multipath setup working correctly but had little
success I had started a earlier thread on it and didn't get any response,

My test was based on 
https://www.redhat.com/archives/linux-cluster/2005-April/msg00062.html

I posted my issues here.

https://www.redhat.com/archives/linux-cluster/2005-August/msg00080.html

The initial command used was 
echo "0 1146990592  multipath 0 0 1 1 round-robin 0 2 1 251:1 1000 251:3
1000 " | dmsetup create dm-1

(251:1 & 251:3 are the major:minor ids of the gnbds obtained from the
command cat /proc/partitions)
(1146990592 -> I believe is the size of the block device.) 
This resulted in a block device of which I still could not mount, I tried
multipath -0ll (after installing multipath and create a rudimentary
multipath.conf) and the result was

dm-1

[size=546 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active][first]
  \_ 0:0:0:0      251:0   [undef ][active]
\_ round-robin 0 [enabled]
  \_ 0:0:0:0      251:4   [undef ][active]

"notice that the size was 1/2 the actual size!?! (I have no idea what this
means "somebody enlighten me, please!)

When I attempted to mount
 
[root dell-1650-31 ~]# mount -t gfs /dev/mapper/dm-1 /mnt/gfs1

mount: /dev/dm-1: can't read superblock

This was tried previously off a multipathed device in  which dmsetup status
gives the output below:

dm-1: 0 1146990592 multipath 1 0 0 2 1 A 0 1 0 251:1 A 0 E 0 1 0 251:3 A 0

dmsetup deps gives
dm-1: 2 dependencies    : (251, 3) (251, 1)

and dmsetup info gives
Name:              dm-1
State:             ACTIVE
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 1
Number of targets: 1

I have managed to get a nfs/gnbd failover type scenario working, in which
gnbd_servers export the shared storage via nfs and the clients mount via a
heartbeat VIP. I then created a script which I will rewrite into into a mini
daemon soon, that checks status of the servers then when the ip is token
over stops apache, unmounts nfs via the "umount -l -t nfs $mountpoint" then
"mount $mountpoint" & starts apache again. I have tested it and it works (
by checking for stale handles and remounting cleanly) Some of the same
principles can go into one for gnbd. But my bump right now is LVM.


Can GNBD be used without LVM? Or does anyone know how to enable failover
correctly on dm-multipath?


Any help would be appreciated.

Brian Urrutia 


-----Original Message-----
From: brian urrutia [mailto:brianu at mail.silvercash.com] 
Sent: Monday, August 29, 2005 12:56 AM
To: linux-cluster at redhat.com
Cc: mikore.li at gmail.com; brianu
Subject: Re: [Linux-cluster] Re: If I have 5 GNBD server?

> > If using LVM to make a volume of imported gnbds is not the answer for
> > redundancy can anyone suggest a method that is? Im not opposed to using
any
> > other resource of cluster or GFS but I would really like to implement a
> > redundant solution, ( gnbd, gulm, etc.). 
> > 
> Hi, Brianu, maybe LVM + md + gnbd should be one of the solution for
> redundancy, for example, you have 2 gnbd servers, each one exports 1
> disk. Then, the steps should be:
> 1. create a RAID-1  /dev/md0 on GFS client with imported 2 gnbd block
devices.
> 2. use LVM  create /dev/vg0 on top of them.
> 3. mkfs_gfs on /dev/vg0.
> I haven't tried this configuration, theoretically, it should work.
> 
> Thanks,
> 
> Michael

I will look into trying a md & lvm combo, as far as keepalived or
rgmanager to failover an ip, i havent seen a clear example on how to use
rgmanager, but i have tried heartbeat (linux HA) to failover the ip, and
the problem is that the gnbd clients still seem to lock on the former
server regardless of that the ip has failed over to another ip ( and
continuly try and reconnect as Fajar had mentioned).


The shared storage I have is a HP MSA 100 SAN
It might be a config error on my part as far as rgmanager is concerned i
will have to post my cluster.conf tommorrow.

-Brian







From zach at zachlowry.net  Wed Aug 31 15:30:15 2005
From: zach at zachlowry.net (Zach Lowry)
Date: Wed, 31 Aug 2005 10:30:15 -0500
Subject: [Linux-cluster] Fencing Module for VMware ESX and GSX
Message-ID: <D0167E4E-813A-4C63-A8B8-48C51316CC9E@zachlowry.net>

Hello!

I recently deployed GFS between 2 virtual machines on a VMware ESX  
server and had a problem because there was no fencing module that  
could handle this architecture. So, using the fence_apc module as a  
template, I wrote a compatible module, fence_vmware. Now, since I  
didn't want to rewrite GFS and recompile to use this new module, I am  
currently using it as a drop-in replacement for the fence_apc module,  
and using the APC configuration syntax. However, this code seems to  
work just right when a machine misses sync, it will log into the  
VMware ESX server and attempt to do a soft reboot of the VM, then a  
hard reboot if necessary. I hope to see this incorporated into the  
GFS tree, but if not it will be available on my website.

Attached is a copy of the source, also available at http:// 
www.zachlowry.net/software.php, along with a sample configuration.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence.ccs
Type: application/octet-stream
Size: 238 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050831/8aef918b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nodes.ccs
Type: application/octet-stream
Size: 364 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050831/8aef918b/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence_vmware
Type: application/octet-stream
Size: 7664 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050831/8aef918b/attachment-0002.obj>
-------------- next part --------------



--
Zach Lowry
MTSU, Murfreesboro, TN
zach at zachlowry.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050831/8aef918b/attachment.sig>