From pcaulfie at redhat.com Mon Aug 1 07:32:43 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 01 Aug 2005 08:32:43 +0100 Subject: [Linux-cluster] Error when loading the lock_dlm module In-Reply-To: <8293054050729235847b66f37@mail.gmail.com> References: <8293054050729235847b66f37@mail.gmail.com> Message-ID: <42EDD01B.40809@redhat.com> KL Raja Sekar wrote: > Hi, > > I am trying to use GFS 6.1 with lock_dlm module. But when i try to > load the lock_dlm module i am getting fatal error. Herewith attached > the fatal error msg and the dmesg errors. > > I am using HP Proliant DL360 server with XP 1024 storage. please let > me know if anybody has the solution for the above. > > regards > shekar > > ****************************************************************************** > root at uranus1 ~]# modprobe lock_dlm > FATAL: Error inserting lock_dlm > (/lib/modules/2.6.9-11.ELsmp/kernel/fs/gfs_locking/lock_dlm/lock_dlm.ko): > Unknown symbol in module, or unknown parameter (see dmesg) > ****************************************************************************** > dmesg output ------------------------------- > lock_dlm: Unknown symbol dlm_debug_dump > lock_dlm: Unknown symbol dlm_new_lockspace > lock_dlm: Unknown symbol kcl_register_service > lock_dlm: Unknown symbol dlm_unlock > lock_dlm: Unknown symbol kcl_start_done > lock_dlm: Unknown symbol dlm_release_lockspace > lock_dlm: Unknown symbol kcl_join_service > lock_dlm: Unknown symbol kcl_unregister_service > lock_dlm: Unknown symbol kcl_leave_service > lock_dlm: Unknown symbol dlm_query > lock_dlm: Unknown symbol kcl_get_members > lock_dlm: Unknown symbol kcl_releaseref_cluster > lock_dlm: Unknown symbol dlm_lock > lock_dlm: Unknown symbol kcl_cluster_name > lock_dlm: Unknown symbol kcl_get_services > lock_dlm: Unknown symbol kcl_addref_cluster Looks like the cman & dlm modules are not loaded. If they're on the system then try re-running depmod -a. If not, then install them ;-) -- patrick From ialberdi at histor.fr Mon Aug 1 09:30:48 2005 From: ialberdi at histor.fr (Ion Alberdi) Date: Mon, 01 Aug 2005 11:30:48 +0200 Subject: [Linux-cluster] File size limitation on GFS Message-ID: <42EDEBC8.7070402@histor.fr> Hi everybody, is there is a maximum file size the GFS can handle? I tried to do some tests with big files, and I couldn't open (open(2)) files that were >= 2Go. (It works with 1Go files, I didn't try sizes between 1 and 2 Go). I would like to know if this limitation comes from my configuration or from the GFS file system. I searched an answer in the web and in the mailing list but I didn't found anything, If I missed something I'd be very sorry and an url to the article I missed would be a great answer :). Thanks in advance! Regards From javipolo at datagrama.net Mon Aug 1 12:14:48 2005 From: javipolo at datagrama.net (Javi Polo) Date: Mon, 1 Aug 2005 14:14:48 +0200 Subject: [Linux-cluster] segfault Message-ID: <20050801121448.GA4036@gibson.drslump.org> While doing cman_tool join ... I got a segfault and this on the logs: Unable to handle kernel NULL pointer dereference at virtual address 0000019c printing eip: c034e80a *pde = 00000000 Oops: 0000 [#12] Modules linked in: gfs lock_harness dlm cman ipv6 snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010296 (2.6.12.3) EIP is at sk_alloc+0x1b/0xd0 eax: c1551180 ebx: 00000002 ecx: ffffff9c edx: 000000d0 esi: 00000134 edi: 000000d0 ebp: ffffff9f esp: c2213ed8 ds: 007b es: 007b ss: 0068 Process cman_tool (pid: 4034, threadinfo=c2212000 task=c993fa80) Stack: ca2850ac c886c074 00000286 00000002 cc1f3380 000000d0 ffffff9f e1aa90e7 0000001e 000000d0 00000134 c1551180 00000002 cc1f3380 00000002 e1aa933e cc1f3380 000000d0 0000001e cc1f3380 c034c923 cc1f3380 00000002 00000001 Call Trace: [] cl_alloc_sock+0x38/0x97 [cman] [] cl_create+0x59/0x101 [cman] [] __sock_create+0xc3/0x1c7 [] sock_create+0x2f/0x33 [] sys_socket+0x28/0x55 [] sys_socketcall+0x89/0x251 [] filp_close+0x52/0x96 [] do_page_fault+0x0/0x5bf [] syscall_call+0x7/0xb Code: ff ff ff c7 44 24 14 04 00 00 00 e9 75 fc ff ff 83 ec 1c 89 74 24 10 89 5c 24 0c 89 7c 24 14 89 6c 24 18 8b 74 24 28 8b 54 24 24 <8b> 46 68 85 c0 0f 84 8c 00 00 00 89 54 24 04 89 04 24 e8 a8 79 Has anybody a hint? I compiled the kernel modules, and also made debian packages from the sources out there at ubuntu ... I'm with 2.6.12.3 in debian/sid ..... thx -- Javier Polo @ Datagrama 902 136 126 From pcaulfie at redhat.com Mon Aug 1 12:24:27 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 01 Aug 2005 13:24:27 +0100 Subject: [Linux-cluster] segfault In-Reply-To: <20050801121448.GA4036@gibson.drslump.org> References: <20050801121448.GA4036@gibson.drslump.org> Message-ID: <42EE147B.30702@redhat.com> Javi Polo wrote: > While doing cman_tool join ... I got a segfault and this on the logs: > > Unable to handle kernel NULL pointer dereference at virtual address 0000019c > printing eip: > c034e80a > *pde = 00000000 > Oops: 0000 [#12] > Modules linked in: gfs lock_harness dlm cman ipv6 snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc > CPU: 0 > EIP: 0060:[] Not tainted VLI > EFLAGS: 00010296 (2.6.12.3) > EIP is at sk_alloc+0x1b/0xd0 > eax: c1551180 ebx: 00000002 ecx: ffffff9c edx: 000000d0 > esi: 00000134 edi: 000000d0 ebp: ffffff9f esp: c2213ed8 > ds: 007b es: 007b ss: 0068 > Process cman_tool (pid: 4034, threadinfo=c2212000 task=c993fa80) > Stack: ca2850ac c886c074 00000286 00000002 cc1f3380 000000d0 ffffff9f e1aa90e7 > 0000001e 000000d0 00000134 c1551180 00000002 cc1f3380 00000002 e1aa933e > cc1f3380 000000d0 0000001e cc1f3380 c034c923 cc1f3380 00000002 00000001 > Call Trace: > [] cl_alloc_sock+0x38/0x97 [cman] > [] cl_create+0x59/0x101 [cman] > [] __sock_create+0xc3/0x1c7 > [] sock_create+0x2f/0x33 > [] sys_socket+0x28/0x55 > [] sys_socketcall+0x89/0x251 > [] filp_close+0x52/0x96 > [] do_page_fault+0x0/0x5bf > [] syscall_call+0x7/0xb > Code: ff ff ff c7 44 24 14 04 00 00 00 e9 75 fc ff ff 83 ec 1c 89 74 24 10 89 5c 24 0c 89 7c 24 14 89 6c 24 18 8b 74 24 28 8b 54 24 24 <8b> 46 68 85 c0 0f 84 8c 00 00 00 89 54 24 04 89 04 24 e8 a8 79 > > Has anybody a hint? > I compiled the kernel modules, and also made debian packages from the sources out there at ubuntu ... I'm with 2.6.12.3 in debian/sid ..... > The sk_alloc code has changed a few times in the kernel so it might be that the source you have compiled doesn't match the kernel it is running on. Though that usually results in a compile error rather than a runtime one! Which source are you using? -- patrick From pcaulfie at redhat.com Mon Aug 1 12:28:49 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 01 Aug 2005 13:28:49 +0100 Subject: [Linux-cluster] How do nodes in a cluster authenticate each other? In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A364F4BF@mailxchg01.corp.opsource.net> References: <38A48FA2F0103444906AD22E14F1B5A364F4BF@mailxchg01.corp.opsource.net> Message-ID: <42EE1581.2040604@redhat.com> Jeff Harr wrote: > I asked this under a different heading earlier but nobody answered J > Don?t mean to spam the group but I can?t figure it out. This is on a > Redhat Cluster 4. > Basically they don't. In a CMAN cluster there is a join protocol that all nodes have to go through to become a member and cluster nodes will only talk to known members. But as things currently stand if someone spoofs the IP address, port & cluster number then it will get through I'm afraid. The solution is to use a private network for cluster communications. -- patrick From javipolo at datagrama.net Mon Aug 1 12:57:09 2005 From: javipolo at datagrama.net (Javi Polo) Date: Mon, 1 Aug 2005 14:57:09 +0200 Subject: [Linux-cluster] segfault In-Reply-To: <42EE147B.30702@redhat.com> References: <20050801121448.GA4036@gibson.drslump.org> <42EE147B.30702@redhat.com> Message-ID: <20050801125709.GA4173@gibson.drslump.org> On Aug/01/2005, Patrick Caulfield wrote: > The sk_alloc code has changed a few times in the kernel so it might be that the > source you have compiled doesn't match the kernel it is running on. Though that > usually results in a compile error rather than a runtime one! > Which source are you using? I downloaded the debian patches: kernel-patch-2.6-cman - Cluster manager - kernel patch It did not compile, and I hand-fixed it with some patches that appeared in this list ... I'm gonna check out now the svn code in open.datacore.ch ..... -- Javier Polo @ Datagrama 902 136 126 From jharr at opsource.net Mon Aug 1 13:17:29 2005 From: jharr at opsource.net (Jeff Harr) Date: Mon, 1 Aug 2005 09:17:29 -0400 Subject: [Linux-cluster] How do nodes in a cluster authenticate each other? Message-ID: <38A48FA2F0103444906AD22E14F1B5A364FB1F@mailxchg01.corp.opsource.net> Ok, thank you very much, I just wanted to be sure that I wasn't overlooking something. Jeff -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patrick Caulfield Sent: Monday, August 01, 2005 8:29 AM To: linux clustering Subject: Re: [Linux-cluster] How do nodes in a cluster authenticate each other? Jeff Harr wrote: > I asked this under a different heading earlier but nobody answered J > Don't mean to spam the group but I can't figure it out. This is on a > Redhat Cluster 4. > Basically they don't. In a CMAN cluster there is a join protocol that all nodes have to go through to become a member and cluster nodes will only talk to known members. But as things currently stand if someone spoofs the IP address, port & cluster number then it will get through I'm afraid. The solution is to use a private network for cluster communications. -- patrick -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From javiermarasco at yahoo.com.ar Mon Aug 1 12:32:20 2005 From: javiermarasco at yahoo.com.ar (javier marasco) Date: Mon, 1 Aug 2005 09:32:20 -0300 Subject: [Linux-cluster] job opening in Houston In-Reply-To: <78fcc84a050730211531f4a197@mail.gmail.com> Message-ID: <200508011332.j71DWOGh025030@mx3.redhat.com> I don't think so. Im only ask for information purpose. If you can of course , tell me how much pay for that job. thanks Javier Marasco System Administrator Digbang (Vera 358) Argentina 54-11-4857-6585 www.digbang.com -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of y f Sent: Sunday, July 31, 2005 1:16 AM To: keith at clearpathit.com; linux clustering Subject: Re: [Linux-cluster] job opening in Houston Hi, Keith, Can the work be done remotely ? On 7/29/05, Keith Grammer wrote: > > > > I am looking for a Linux Cluster Specialist for a contract position. > Please respond with a resume in Word. > > Thank You, > > Keith > > > > > > > > > > Keith Grammer > > Partner > > ClearPath IT LLC > > 713-344-0232 > > keith at clearpathit.com > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > > -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster ___________________________________________________________ 1GB gratis, Antivirus y Antispam Correo Yahoo!, el mejor correo web del mundo http://correo.yahoo.com.ar From teigland at redhat.com Tue Aug 2 07:18:28 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 2 Aug 2005 15:18:28 +0800 Subject: [Linux-cluster] [PATCH 00/14] GFS Message-ID: <20050802071828.GA11217@redhat.com> Hi, GFS (Global File System) is a cluster file system that we'd like to see added to the kernel. The 14 patches total about 900K so I won't send them to the list unless that's requested. Comments and suggestions are welcome. Thanks http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch http://redhat.com/~teigland/gfs2/20050801/broken-out/ Dave From javipolo at datagrama.net Tue Aug 2 07:45:18 2005 From: javipolo at datagrama.net (Javi Polo) Date: Tue, 2 Aug 2005 09:45:18 +0200 Subject: [Linux-cluster] segfault In-Reply-To: <20050801125709.GA4173@gibson.drslump.org> References: <20050801121448.GA4036@gibson.drslump.org> <42EE147B.30702@redhat.com> <20050801125709.GA4173@gibson.drslump.org> Message-ID: <20050802074518.GA20528@gibson.drslump.org> On Aug/01/2005, Javi Polo wrote: > > The sk_alloc code has changed a few times in the kernel so it might be that the > > source you have compiled doesn't match the kernel it is running on. Though that > > usually results in a compile error rather than a runtime one! > > Which source are you using? > I'm gonna check out now the svn code in open.datacore.ch ..... Weeeeeeek, error :P cluster/cman/cnxman.c: In function `cl_alloc_sock': cluster/cman/cnxman.c:922: warning: passing arg 3 of `sk_alloc' makes pointer from integer without a cast cluster/cman/cnxman.c:922: warning: passing arg 4 of `sk_alloc' makes integer from pointer without a cast cluster/cman/cnxman.c: In function `cl_bind': cluster/cman/cnxman.c:1062: error: structure has no member named `sk_zapped' cluster/cman/cnxman.c:1086: error: structure has no member named `sk_zapped' make[2]: *** [cluster/cman/cnxman.o] Error 1 make[1]: *** [cluster/cman] Error 2 make: *** [cluster] Error 2 kinoko:/usr/src/linux# I fixed those as described in https://www.redhat.com/archives/linux-cluster/2005-April/msg00051.html and https://www.redhat.com/archives/linux-cluster/2005-April/msg00034.html (with the debian patch I just had to fix the sk_zapped thing) and the results areeeeeeee: yuppp, now it does not segfaults .... :) -- Javier Polo @ Datagrama 902 136 126 From pegasus at nerv.eu.org Tue Aug 2 08:49:52 2005 From: pegasus at nerv.eu.org (Jure =?iso-8859-2?Q?Pe=E8ar?=) Date: Tue, 02 Aug 2005 10:49:52 +0200 Subject: [Linux-cluster] gfs max number of subdirs per directory Message-ID: <1122972593.8420.7.camel@localhost.localdomain> Hi all, I want to know what is the maximum number of subdirectories one can create on a GFS filesystem. For example, both ext2/3 and Veritas vxfs are limited to 32k, but that will soon become a limiting factor for my application. Because of this my only choice right now is reiserfs ... -- Jure Pe?ar http://jure.pecar.org From teigland at redhat.com Tue Aug 2 07:18:28 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 2 Aug 2005 15:18:28 +0800 Subject: [Linux-cluster] [PATCH 00/14] GFS Message-ID: <20050802071828.GA11217@redhat.com> Hi, GFS (Global File System) is a cluster file system that we'd like to see added to the kernel. The 14 patches total about 900K so I won't send them to the list unless that's requested. Comments and suggestions are welcome. Thanks http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch http://redhat.com/~teigland/gfs2/20050801/broken-out/ Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From oldmoonster at gmail.com Tue Aug 2 09:48:07 2005 From: oldmoonster at gmail.com (Q.L) Date: Tue, 2 Aug 2005 17:48:07 +0800 Subject: [Linux-cluster] Compile Issues In-Reply-To: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> Message-ID: <359782e705080202484184f323@mail.gmail.com> Hi, I have tried 2.6.9,2.6.11,2.6.12, but the kernel build always fails, :-(, could you share with me which kernel did you patch to? Is there any instructions to help me build and install cluster-1.00.00.tar.gz on my redhat 9 system? Indeed, I have refered to http://gfs.wikidev.net/Installation, but I still can't successfully build sources... . Thanks!!!! Q.L On 7/30/05, Jacob Liff wrote: > > > > Howdy, > > > > I have gone through the list before buging you guys but haven't found the > answer. A few days ago someone else was having issues and someone > recommended getting the source from: > > > > ftp://sources.redhat.com/pub/cluster/releases/cluster-1.00.00.tar.gz > > > > Instead of the CSV and to compile it against vanilla 2.6.12. I followed > those instructions and everything appeared to compile fine except when I > modprobe gfs I get this fun error: > > > > ATAL: Error inserting gfs > (/lib/modules/2.6.12/kernel/fs/gfs/gfs.ko): Unknown symbol > in module, or unknown parameter (see dmesg) > > > > dmesg output: > > > > gfs: Unknown symbol posix_acl_from_xattr > > gfs: Unknown symbol posix_acl_valid > > gfs: Unknown symbol posix_acl_permission > > gfs: Unknown symbol posix_acl_equiv_mode > > gfs: Unknown symbol posix_acl_chmod_masq > > gfs: Unknown symbol posix_acl_to_xattr > > gfs: Unknown symbol posix_acl_create_masq > > gfs: Unknown symbol posix_acl_clone > > > > Looking at the source its linking against the correct headers for the > functions.. and the functions do exist in the headers. > > > > Maybe I have not compiled someone into the kernel that I needed in order to > make this work? Could I be trying to use this against the wrong > kernel(Vanilla 2.6.12 > http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.12.3.tar.bz2 > )? I have been trying to either compile modules or patch against most of the > newer kernels for the last two days with no luck. Any help would be greatly > appreciated. > > > > Jacob L. > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > > From pcaulfie at redhat.com Tue Aug 2 10:03:26 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 02 Aug 2005 11:03:26 +0100 Subject: [Linux-cluster] Compile Issues In-Reply-To: <359782e705080202484184f323@mail.gmail.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> Message-ID: <42EF44EE.7020804@redhat.com> Q.L wrote: > Hi, > > I have tried 2.6.9,2.6.11,2.6.12, but the kernel build always fails, > :-(, could you share with me which kernel did you patch to? Is there > any instructions to help me build and install cluster-1.00.00.tar.gz > on my redhat 9 system? Indeed, I have refered to > http://gfs.wikidev.net/Installation, but I still can't successfully > build sources... . > It compiles cleanly for me against 2.6.12.2. -- patrick From oldmoonster at gmail.com Tue Aug 2 10:09:52 2005 From: oldmoonster at gmail.com (Q.L) Date: Tue, 2 Aug 2005 18:09:52 +0800 Subject: [Linux-cluster] Compile Issues In-Reply-To: <42EF44EE.7020804@redhat.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> Message-ID: <359782e7050802030967157c38@mail.gmail.com> Hi, Patrick, Could you share me the instructions to build against kernel 2.6.12.2? Thanks! Q.L On 8/2/05, Patrick Caulfield wrote: > Q.L wrote: > > Hi, > > > > I have tried 2.6.9,2.6.11,2.6.12, but the kernel build always fails, > > :-(, could you share with me which kernel did you patch to? Is there > > any instructions to help me build and install cluster-1.00.00.tar.gz > > on my redhat 9 system? Indeed, I have refered to > > http://gfs.wikidev.net/Installation, but I still can't successfully > > build sources... . > > > > It compiles cleanly for me against 2.6.12.2. > > -- > > patrick > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From pcaulfie at redhat.com Tue Aug 2 10:16:54 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 02 Aug 2005 11:16:54 +0100 Subject: [Linux-cluster] Compile Issues In-Reply-To: <359782e7050802030967157c38@mail.gmail.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> <359782e7050802030967157c38@mail.gmail.com> Message-ID: <42EF4816.1090200@redhat.com> Q.L wrote: > Hi, Patrick, > > Could you share me the instructions to build against kernel 2.6.12.2? > Certainly: ./configure --kernel_src= make -- patrick From oldmoonster at gmail.com Tue Aug 2 10:23:08 2005 From: oldmoonster at gmail.com (Q.L) Date: Tue, 2 Aug 2005 18:23:08 +0800 Subject: [Linux-cluster] Compile Issues In-Reply-To: <42EF4816.1090200@redhat.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> <359782e7050802030967157c38@mail.gmail.com> <42EF4816.1090200@redhat.com> Message-ID: <359782e70508020323165f669@mail.gmail.com> On 8/2/05, Patrick Caulfield wrote: > Q.L wrote: > > Hi, Patrick, > > > > Could you share me the instructions to build against kernel 2.6.12.2? > > > > > Certainly: > > ./configure --kernel_src= > make > > Yes, I know that option, but to me, output just like this, it then make error. [root at buckupy cluster-1.00.00]# ./configure --kernel_src=/usr/src/linux-2.6.9 configure cman-kernel Configuring Makefiles for your system... Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95. configure dlm-kernel Configuring Makefiles for your system... Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95. configure gfs-kernel Configuring Makefiles for your system... Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95. configure gnbd-kernel Configuring Makefiles for your system... Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95. configure magma Configuring Makefiles for your system... Completed Makefile configuration configure ccs Configuring Makefiles for your system... Completed Makefile configuration configure cman Configuring Makefiles for your system... Completed Makefile configuration configure dlm Configuring Makefiles for your system... Completed Makefile configuration configure fence Configuring Makefiles for your system... Completed Makefile configuration configure iddev Configuring Makefiles for your system... Completed Makefile configuration configure gfs Configuring Makefiles for your system... Completed Makefile configuration configure gnbd Configuring Makefiles for your system... Completed Makefile configuration configure gulm Configuring Makefiles for your system... Completed Makefile configuration configure magma-plugins Configuring Makefiles for your system... Completed Makefile configuration configure rgmanager Configuring Makefiles for your system... Completed Makefile configuration Thanks! Q.L > -- > > patrick > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From pcaulfie at redhat.com Tue Aug 2 10:28:33 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 02 Aug 2005 11:28:33 +0100 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <1122318870.12824.29.camel@localhost.localdomain> References: <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> Message-ID: <42EF4AD1.6010809@redhat.com> Steven Dake wrote: > On Mon, 2005-07-18 at 09:10 +0100, Patrick Caulfield wrote: > >>As I see it there are two things we can do with userland cman that's current in >>the head of CVS: >> >>1. Leave it as it is - a port of the kernel one. This has some benefits: it's >>easy (plus a few bug fixes that need to go in), it's protocol-compatible with >>the kernel one. There are a small number of extra features that could go in >>there (that would, annoyingly, break that compatibility) but nothing really >>serious. It doesn't give us anything new, but what new is neeed ? >> >>2. Migrate it to something much more sophisticated. I've mentioned Virtual >>Synchrony a few times before and I've been looking into this in some detail >>since. The benefits are largely internal but they do provide a reliable, robust >>and well-performing messaging system that other cluster subsystems can use. >>While the application programmers at the cluster summit maintained they had no >>use for a cluster messaging system, I still believe that it is a useful thing to >>have at a lower level - if only for our own programming needs. I know that Jon >>looked into the existing cman messaging system before rejecting it as too slow >>and unreliable for he needs of the cluster mirroring code. >> >>There are two suboptions here. >> a) write it ourself. Quite a big job this. Bigger than I would like. To be >>honest I did make a start at this and now realise just what a huge job it is to >>get something that both performs well and is reliable. REALLY reliable. even >>worse if the academics want something provably reliable. >> b) adopt something else. The obvious candidate here is the openAIS code[1]. >>This looks to be quite mature now and has all the features we need of a low >>level messaging system. It's very nicely abstracted out so we can pick out just >>the bits we need without having the whole (rather heavyweight) system on top of it. >> >>The one problem with the openAIS code is that it doesn't support IPv6, and much >>of the code is tied to IPv4. Having had a look at it and emailed Steven Dake >>about this he reckons it's about 2 weeks work to add.[2] >> >>The advantages of doing this are several. >>- It saves time. We get something that is known to work, even though it needs >>extra features added for our own use. >>- we're not inventing something new that already exists in several other places. >>- we get more people who know the code. Currently only I know the internals of >>cman as it stands and it's quite scary code that people don't want to get >>involved with (we've have several DLM patches in the past, but no CMAN ones). >>This way we get at least 2 (Steven and me) as well as anyone else who is >>following openAIS. Of course there will be CMAN-specific stuff on top of their >>comms layer to make it quorum-based and capable of supporting GFS and DLM that > > > sorry my response is so late I missed this mail while at OLS. > > The quorum problem is commonly referred to in the literature as a > "virtual synchrony filter". I'd love to have some implementations of > virtual synchrony filters that exist within libtotem itself.. > Definately an area of interest for openais as we need some services to > operate only in one partition (like the amf). > > >>will be Red Hat specific but these are not going to be large. >>- the APIs are all open (based on SAforum specifications) and already >>implemented. Although adding saCLM to CMAN is pretty easy as I proved last week. >> > > >>The disadvantages are >>- Need to learn the internals of someone else's code. > > > indeed this part is somewhat painful :( > > >>- We don't have full control over the code. Although we can obviously fork it if >>we feel the need it would, obviously be preferable not to. > > > My view is that open source influence is dictated by level of > contribution just like any kind of community. ie: the more a person > contributes the more influence they can exert over a project or > direction. Even as maintainer I don't have full control over the > openais code as the community really decides where we go and what work > we do. > > My point here is that if you are willing to fork, then you probably have > some time to maintain the code.. which is better spent influencing the > current openais tree :) > > >>- non-compatibility with "old" cman, making rolling upgrades har or even >>impossible. I'm not sure what to do about this yet, but it's worth pointing out >>that the DLM has a new line-protocol too. > > > yes upgrades are a real pain. We have not fully tackled this problem in > the openais project yet, because we havn't released a stable version. > Ideally we would like two versions (older, newer) to interoperate, even > if that means uglifying the implementation to coexist with two line > types. We have some work in place to address this problem but before > our first production release I'm planning to really think through > interoperability with new implementations for features of the totem > protocol (like redundant ring, multi ring gateway (for local area > networks), group key generation, multi-ring-bridged (for wide area > networks), etc). > > >>- openAIS is BSD licensed, I don't think this is a problem but it probably needs >>checking. >> > > > Originally I had planned to use spread for openais, but the license was > not compatible with the lawyers "approved list". So we had to implement > a protocol completely from scratch because of the license issue which > took about 1.5 years of work (sigh). I wanted to be sure other projects > could reuse the totem code so chose the most liberal license I could > find. > > >>In short, I'm advocating adopting the openAIS core (libtotem basically) as >>CMAN's communications/membership protocol. If we're going to do a "CMAN V2" that >>has anything significant over V1 then re-inventing it is going to be a huge >>amount of work that someone else has already done. >> >>Comments? >> > > > sounds good Patrick if you need any help from us let us know > Thanks for that Steven. I'm going to make a start on this when I get back from UKUUG next week. I've managed to knock up something that looks like cman from the outside but uses libtotem for it's comms layer so it's looking good. On other thing I need to look into (apart from IPv6) is multi-home. cman had a (primitive) failover system but it's not currently in use by anyone because DLM doesn't support it but I think it's something we need to provide at some stage. Don't worry about the mention of a fork - the chances of it happening are almost nil! -- patrick From pcaulfie at redhat.com Tue Aug 2 10:29:46 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 02 Aug 2005 11:29:46 +0100 Subject: [Linux-cluster] Compile Issues In-Reply-To: <359782e70508020323165f669@mail.gmail.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> <359782e7050802030967157c38@mail.gmail.com> <42EF4816.1090200@redhat.com> <359782e70508020323165f669@mail.gmail.com> Message-ID: <42EF4B1A.2030405@redhat.com> Q.L wrote: > On 8/2/05, Patrick Caulfield wrote: > >>Q.L wrote: >> >>>Hi, Patrick, >>> >>>Could you share me the instructions to build against kernel 2.6.12.2? >>> >> >> >>Certainly: >> >>./configure --kernel_src= >>make >> >> > > Yes, I know that option, but to me, output just like this, it then make error. > > [root at buckupy cluster-1.00.00]# ./configure --kernel_src=/usr/src/linux-2.6.9 > configure cman-kernel > > Configuring Makefiles for your system... > Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95. > configure dlm-kernel > Have you configured the kernel source using make menuconfig ? -- patrick From oldmoonster at gmail.com Tue Aug 2 10:44:33 2005 From: oldmoonster at gmail.com (Q.L) Date: Tue, 2 Aug 2005 18:44:33 +0800 Subject: [Linux-cluster] Compile Issues In-Reply-To: <42EF4B1A.2030405@redhat.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> <359782e7050802030967157c38@mail.gmail.com> <42EF4816.1090200@redhat.com> <359782e70508020323165f669@mail.gmail.com> <42EF4B1A.2030405@redhat.com> Message-ID: <359782e70508020344510f473a@mail.gmail.com> On 8/2/05, Patrick Caulfield wrote: > Q.L wrote: > > On 8/2/05, Patrick Caulfield wrote: > > > >>Q.L wrote: > >> > >>>Hi, Patrick, > >>> > >>>Could you share me the instructions to build against kernel 2.6.12.2? > >>> > >> > >> > >>Certainly: > >> > >>./configure --kernel_src= > >>make > >> > >> > > > > Yes, I know that option, but to me, output just like this, it then make error. > > > > [root at buckupy cluster-1.00.00]# ./configure --kernel_src=/usr/src/linux-2.6.9 > > configure cman-kernel > > > > Configuring Makefiles for your system... > > Can't open /usr/src/linux-2.6.9/include/linux/version.h at ./configure line 95. > > configure dlm-kernel > > > > > Have you configured the kernel source using make menuconfig ? > > Thanks! and this time I ran: # cd /usr/src/linux-2.6.9 # find /path/to/cluster -name '*.patch' | xargs cat | patch -t -p1 it seems ok, but no following options in .config # GFS-specific CONFIG_LOCK_HARNESS=m CONFIG_GFS_FS=m CONFIG_LOCK_NOLOCK=m CONFIG_LOCK_DLM=m CONFIG_LOCK_GULM=m further more, when I run above instructions in linux-2.6.12 source tree, it reports many conflicts... see following: [root at buckupy linux-2.6.12]# find /home/share/cluster-1.00.00 -name '*.patch' | xargs cat | patch -t -p1 patching file arch/alpha/Kconfig Hunk #1 succeeded at 608 (offset 8 lines). patching file arch/arm/Kconfig Hunk #1 succeeded at 744 (offset 54 lines). patching file arch/arm26/Kconfig Hunk #1 FAILED at 222. 1 out of 1 hunk FAILED -- saving rejects to file arch/arm26/Kconfig.rej patching file arch/cris/Kconfig Hunk #1 succeeded at 178 (offset 4 lines). patching file arch/i386/Kconfig Hunk #1 succeeded at 1263 with fuzz 2 (offset 69 lines). patching file arch/ia64/Kconfig Hunk #1 succeeded at 441 (offset 51 lines). patching file arch/m68k/Kconfig Hunk #1 succeeded at 668 (offset 13 lines). patching file arch/mips/Kconfig Hunk #1 FAILED at 1563. Thanks!! Q.L > -- > > patrick > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From pcaulfie at redhat.com Tue Aug 2 11:33:06 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 02 Aug 2005 12:33:06 +0100 Subject: [Linux-cluster] Compile Issues In-Reply-To: <359782e70508020344510f473a@mail.gmail.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> <359782e7050802030967157c38@mail.gmail.com> <42EF4816.1090200@redhat.com> <359782e70508020323165f669@mail.gmail.com> <42EF4B1A.2030405@redhat.com> <359782e70508020344510f473a@mail.gmail.com> Message-ID: <42EF59F2.2060305@redhat.com> Q.L wrote: > Thanks! and this time I ran: > # cd /usr/src/linux-2.6.9 > # find /path/to/cluster -name '*.patch' | xargs cat | patch -t -p1 two things. 1. I didn't say anything about patching, just ./configure && make 2. I also said it works against a 2.6.12.2 kernel, not 2.6.9 -- patrick From teigland at redhat.com Tue Aug 2 11:47:58 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 2 Aug 2005 19:47:58 +0800 Subject: [Linux-cluster] Compile Issues In-Reply-To: <359782e70508020344510f473a@mail.gmail.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> <359782e7050802030967157c38@mail.gmail.com> <42EF4816.1090200@redhat.com> <359782e70508020323165f669@mail.gmail.com> <42EF4B1A.2030405@redhat.com> <359782e70508020344510f473a@mail.gmail.com> Message-ID: <20050802114758.GD11217@redhat.com> > [root at buckupy linux-2.6.12]# find /home/share/cluster-1.00.00 -name > '*.patch' | xargs cat | patch -t -p1 We don't do kernel patches any more, ignore what's there; we'll remove those last few in the next release. You have to build the modules within the cluster tree now. Dave From natecars at natecarlson.com Tue Aug 2 12:37:45 2005 From: natecars at natecarlson.com (Nate Carlson) Date: Tue, 2 Aug 2005 07:37:45 -0500 (CDT) Subject: [Linux-cluster] [PATCH 00/14] GFS In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: On Tue, 2 Aug 2005, David Teigland wrote: > Hi, GFS (Global File System) is a cluster file system that we'd like to > see added to the kernel. The 14 patches total about 900K so I won't > send them to the list unless that's requested. Comments and suggestions > are welcome. Thanks > > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050801/broken-out/ I see that these patches are for GFS2.. does this mean that GFS2 is ready for prime time? ------------------------------------------------------------------------ | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981 | ------------------------------------------------------------------------ From djani22 at dynamicweb.hu Tue Aug 2 16:16:15 2005 From: djani22 at dynamicweb.hu (djani22 at dynamicweb.hu) Date: Tue, 2 Aug 2005 18:16:15 +0200 Subject: [Linux-cluster] gnbd question References: <1122972593.8420.7.camel@localhost.localdomain> Message-ID: <010c01c5977d$8f2e2780$0400a8c0@LocalHost> Hi all, I want to know is there a way to speed up gnbd-server? I have a "big" (~8TB) free web store based on (g)nbd. (4 server + 1 client) When I try to read linear the one big device, it can generate 380-400 Mb/s transfers on Gig-Eth. But when I start the web serving from it, the result is only 80-90Mb/s , and the client's load goes up to 100-150! This is the best performance what I can set up and the settings are these: Scheduler: deadline nr_requests: 255 read_ahead_kb: 0 iosched: front_merges: 0 read_expire: 50 The all server's load is always ~1.00 (0.95-1.14). Is there a way to stress more the servers? (And with small modification the code?) GNBD 1.0.0 no cluster, fs: xfs I try the nbd, enbd, anbd, and gnbd, but the gnbd's stability is the best! ;-) The another nbds always generate a deadlock for me... (Sorry for my english) Thanks Janos Haar (Hungary) From haydar2906 at hotmail.com Tue Aug 2 17:08:27 2005 From: haydar2906 at hotmail.com (haydar Ali) Date: Tue, 02 Aug 2005 13:08:27 -0400 Subject: [Linux-cluster] GFS : important questions Message-ID: Hi, Now, we have 3 servers HP Proliant 380 G3 (RedHat Advanced Server 3) attached by 2 fiber channels each to the storage area network SAN HP MSA1000 and we want to install and configure GFS to allow 2 servers to simultaneously read and write to a single shared file system (Word documents located into /u04) located on the Storage area network SAN HP MSA1000. I know that I have to install GFS on the 3 nodes and one of them will be a master and the 2 others will be the slaves. My questions are: 1 - If one of the GFS slaves RAC1 or RAC2 is down, will be the other slave server able to access to the shared file system /u04? 2 - If the master is down, will be the both GFS slave servers able to access to the shared file system /u04 simultaneously? Thanks for your help Cheers! Haydar From natecars at natecarlson.com Tue Aug 2 18:50:17 2005 From: natecars at natecarlson.com (Nate Carlson) Date: Tue, 2 Aug 2005 13:50:17 -0500 (CDT) Subject: [Linux-cluster] CLVM and Snapshots? Message-ID: Hey all, I'm curious if it should be possible to take a snapshot of a filesystem sitting on top of CLVM (*not* a shared filesystem like GFS; something like XFS or ext3, but still on a shared block device). I recall trying it a couple weeks ago, and it failing miserably, but don't recall why, and figured I'd ask if it should work before experimenting again. If it should work, a few questions: - Should I be able to create the snapshot on any node, or just the node that is using the LV that I want to create a snapshot of? - Is the syntax identical to normal linux snapshots? Thanks! ------------------------------------------------------------------------ | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981 | ------------------------------------------------------------------------ From natecars at natecarlson.com Tue Aug 2 19:10:25 2005 From: natecars at natecarlson.com (Nate Carlson) Date: Tue, 2 Aug 2005 14:10:25 -0500 (CDT) Subject: [Linux-cluster] CLVM and Snapshots? In-Reply-To: References: Message-ID: On Tue, 2 Aug 2005, Nate Carlson wrote: > I'm curious if it should be possible to take a snapshot of a filesystem > sitting on top of CLVM (*not* a shared filesystem like GFS; something > like XFS or ext3, but still on a shared block device). I recall trying > it a couple weeks ago, and it failing miserably, but don't recall why, > and figured I'd ask if it should work before experimenting again. > > If it should work, a few questions: > - Should I be able to create the snapshot on any node, or just the node > that is using the LV that I want to create a snapshot of? > - Is the syntax identical to normal linux snapshots? > > Thanks! OK, decided to try it.. here's what I get: xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron Error locking on node nitrogen: Internal lvm error, check syslog Aborting. Failed to activate snapshot exception store. Remove new LV and retry. Error on Nitrogen: Aug 2 14:09:05 nitrogen lvm[295]: Volume group for uuid not found: doCTDCC376pE3g2JA35fNAieVNTWpAC1B3UIGOGQpn5Um5FcmPsa8yQPa0p9o9ZP Nitrogen does not have access to the PV that this snapshot is on, which could be part of the problem. ------------------------------------------------------------------------ | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981 | ------------------------------------------------------------------------ From alewis at redhat.com Tue Aug 2 19:14:33 2005 From: alewis at redhat.com (AJ Lewis) Date: Tue, 2 Aug 2005 14:14:33 -0500 Subject: [Linux-cluster] CLVM and Snapshots? In-Reply-To: References: Message-ID: <20050802191433.GP4954@null.msp.redhat.com> On Tue, Aug 02, 2005 at 02:10:25PM -0500, Nate Carlson wrote: > On Tue, 2 Aug 2005, Nate Carlson wrote: > >I'm curious if it should be possible to take a snapshot of a filesystem > >sitting on top of CLVM (*not* a shared filesystem like GFS; something > >like XFS or ext3, but still on a shared block device). I recall trying > >it a couple weeks ago, and it failing miserably, but don't recall why, > >and figured I'd ask if it should work before experimenting again. > > > >If it should work, a few questions: > >- Should I be able to create the snapshot on any node, or just the node > > that is using the LV that I want to create a snapshot of? > >- Is the syntax identical to normal linux snapshots? > > > >Thanks! > > OK, decided to try it.. here's what I get: > > xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron > Error locking on node nitrogen: Internal lvm error, check syslog > Aborting. Failed to activate snapshot exception store. Remove new LV and > retry. > > Error on Nitrogen: > > Aug 2 14:09:05 nitrogen lvm[295]: Volume group for uuid not found: > doCTDCC376pE3g2JA35fNAieVNTWpAC1B3UIGOGQpn5Um5FcmPsa8yQPa0p9o9ZP > > Nitrogen does not have access to the PV that this snapshot is on, which > could be part of the problem. ...yeah, you can't do this. How is the node is managing the snapshot going to know that blocks changed on the node with the origin? You need cluster snapshots, which aren't finished yet. Do *not* do this - you will seriously screw yourself over. -- AJ Lewis Voice: 612-638-0500 Red Hat E-Mail: alewis at redhat.com One Main Street SE, Suite 209 Minneapolis, MN 55414 Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C 54A8 578C 8715 Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the many keyservers out there... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From natecars at natecarlson.com Tue Aug 2 19:14:52 2005 From: natecars at natecarlson.com (Nate Carlson) Date: Tue, 2 Aug 2005 14:14:52 -0500 (CDT) Subject: [Linux-cluster] CLVM and Snapshots? In-Reply-To: References: Message-ID: On Tue, 2 Aug 2005, Nate Carlson wrote: > OK, decided to try it.. here's what I get: > > xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron > Error locking on node nitrogen: Internal lvm error, check syslog > Aborting. Failed to activate snapshot exception store. Remove new LV and > retry. > > Error on Nitrogen: > > Aug 2 14:09:05 nitrogen lvm[295]: Volume group for uuid not found: > doCTDCC376pE3g2JA35fNAieVNTWpAC1B3UIGOGQpn5Um5FcmPsa8yQPa0p9o9ZP > > Nitrogen does not have access to the PV that this snapshot is on, which could > be part of the problem. Shut down Nitrogen and the rest of the nodes that do not have access to that PV, and get: xen1:~# lvcreate -L 1G -s -n snaptest /dev/XenSystemDisks/iron Error locking on node xen2: Internal lvm error, check syslog Error locking on node xen1: Internal lvm error, check syslog Problem reactivating origin iron xen1 dmesg: [ 491.091753] clvmd: page allocation failure. order:0, mode:0xd0 [ 491.168600] [] __alloc_pages+0x2b3/0x430 [ 491.232788] [] kmem_cache_alloc+0x69/0x70 [ 491.298194] [] alloc_pl+0x34/0x60 [dm_mod] [ 491.364848] [] client_alloc_pages+0x25/0x60 [dm_mod] [ 491.442834] [] kcopyd_client_create+0x6d/0xc0 [dm_mod] [ 491.523111] [] snapshot_ctr+0x2d5/0x3b0 [dm_snapshot] [ 491.602352] [] dm_table_add_target+0x149/0x200 [dm_mod] [ 491.683980] [] populate_table+0x90/0xf0 [dm_mod] [ 491.757601] [] table_load+0x68/0x170 [dm_mod] [ 491.827788] [] ctl_ioctl+0xf9/0x160 [dm_mod] [ 491.896622] [] table_load+0x0/0x170 [dm_mod] [ 491.965460] [] do_ioctl+0x58/0x80 [ 492.021717] [] vfs_ioctl+0x65/0x1e0 [ 492.080366] [] sys_ioctl+0x67/0x90 [ 492.137764] [] syscall_call+0x7/0xb [ 492.199824] device-mapper: Could not create kcopyd client [ 492.271951] device-mapper: error adding target to table xen2 dmesg: [ 497.254501] clvmd: page allocation failure. order:0, mode:0xd0 [ 497.331260] [] __alloc_pages+0x2b3/0x430 [ 497.395454] [] kmem_cache_alloc+0x69/0x70 [ 497.460858] [] alloc_pl+0x34/0x60 [dm_mod] [ 497.527518] [] client_alloc_pages+0x25/0x60 [dm_mod] [ 497.605612] [] kcopyd_client_create+0x6d/0xc0 [dm_mod] [ 497.685994] [] snapshot_ctr+0x2d5/0x3b0 [dm_snapshot] [ 497.765238] [] dm_table_add_target+0x149/0x200 [dm_mod] [ 497.846871] [] populate_table+0x90/0xf0 [dm_mod] [ 497.920384] [] table_load+0x68/0x170 [dm_mod] [ 497.990365] [] ctl_ioctl+0xf9/0x160 [dm_mod] [ 498.059305] [] table_load+0x0/0x170 [dm_mod] [ 498.128602] [] do_ioctl+0x58/0x80 [ 498.185030] [] vfs_ioctl+0x65/0x1e0 [ 498.243674] [] sys_ioctl+0x67/0x90 [ 498.301075] [] syscall_call+0x7/0xb [ 498.363072] device-mapper: Could not create kcopyd client [ 498.434127] device-mapper: error adding target to table ------------------------------------------------------------------------ | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981 | ------------------------------------------------------------------------ From natecars at natecarlson.com Tue Aug 2 19:21:26 2005 From: natecars at natecarlson.com (Nate Carlson) Date: Tue, 2 Aug 2005 14:21:26 -0500 (CDT) Subject: [Linux-cluster] CLVM and Snapshots? In-Reply-To: <20050802191433.GP4954@null.msp.redhat.com> References: <20050802191433.GP4954@null.msp.redhat.com> Message-ID: On Tue, 2 Aug 2005, AJ Lewis wrote: > ...yeah, you can't do this. How is the node is managing the snapshot > going to know that blocks changed on the node with the origin? You need > cluster snapshots, which aren't finished yet. Do *not* do this - you > will seriously screw yourself over. Good to know - I won't try doing that anymore, then. :) The docs on the cluster snapshot page seemed to indicate that it was only necessary for clustered file systems - guess not! ------------------------------------------------------------------------ | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981 | ------------------------------------------------------------------------ From keith at clearpathit.com Tue Aug 2 19:24:18 2005 From: keith at clearpathit.com (Keith Grammer) Date: Tue, 2 Aug 2005 14:24:18 -0500 Subject: [Linux-cluster] CLVM and Snapshots? In-Reply-To: Message-ID: <0MKyxe-1E02N92ISu-0007Z6@mrelay.perfora.net> Please do not the broadcast in this manor. Keith Grammer Partner ClearPath IT LLC 713-344-0232 keith at clearpathit.com -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Nate Carlson Sent: Tuesday, August 02, 2005 2:21 PM To: linux clustering Subject: Re: [Linux-cluster] CLVM and Snapshots? On Tue, 2 Aug 2005, AJ Lewis wrote: > ...yeah, you can't do this. How is the node is managing the snapshot > going to know that blocks changed on the node with the origin? You need > cluster snapshots, which aren't finished yet. Do *not* do this - you > will seriously screw yourself over. Good to know - I won't try doing that anymore, then. :) The docs on the cluster snapshot page seemed to indicate that it was only necessary for clustered file systems - guess not! ------------------------------------------------------------------------ | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981 | ------------------------------------------------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From JACOB_LIBERMAN at Dell.com Tue Aug 2 19:29:31 2005 From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com) Date: Tue, 2 Aug 2005 14:29:31 -0500 Subject: [Linux-cluster] GF support across architectures Message-ID: I have 3 hosts: an IA32, IA64, and an EM64T that I would like to cluster. Can 3 RHEL 4 hosts running with different architectures participate in the same cluster? Can they both access the same GFS 6.1 file system? Many thanks, Jacob From dawson at fnal.gov Tue Aug 2 21:31:37 2005 From: dawson at fnal.gov (Troy Dawson) Date: Tue, 02 Aug 2005 16:31:37 -0500 Subject: [Linux-cluster] GF support across architectures In-Reply-To: References: Message-ID: <42EFE639.4000805@fnal.gov> I have combination of i686 and x86_64 machines all accessing the same GFS file system, and they all seem to be happy. There hasn't been any architecture problems at least. So I would say yes they can. Troy JACOB_LIBERMAN at Dell.com wrote: > I have 3 hosts: an IA32, IA64, and an EM64T that I would like to > cluster. > > Can 3 RHEL 4 hosts running with different architectures participate in > the same cluster? > > Can they both access the same GFS 6.1 file system? > > Many thanks, Jacob > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- __________________________________________________ Troy Dawson dawson at fnal.gov (630)840-6468 Fermilab ComputingDivision/CSS CSI Group __________________________________________________ From anu.matthew at bms.com Tue Aug 2 23:20:55 2005 From: anu.matthew at bms.com (Anu Matthew) Date: Tue, 02 Aug 2005 19:20:55 -0400 Subject: [Linux-cluster] IP alias over a bonded interface Message-ID: <42EFFFD7.7070703@bms.com> Greetings..!! I have eth0 and eth1 bonded together as bond0 -- Works good, everything is fine, redundancy is good etc. I created an alias bond0:0 and assigned another IP in the same subnet to it. This new IP is now pingable from other systems, other subnets etc. Okay, here it gets interesting: Traceroutes to any hosts from bond0 succeeds, but traceroutes using bond0:0 fails. [root at linux1 root]# traceroute linux12 -i bond0 traceroute to linux12 (A.B.C.D), 30 hops max, 38 byte packets 1 linux12 (A.B.C.D) 0.067 ms 0.029 ms 0.019 ms [root at linux1 root]# traceroute linux12 -i bond0:0 setsockopt: No such device unable to bind to device: bond0:0 Any ideas on why I cannot traceroute using bond0:0? (BTW, if it were on a non_bonded interface, say, eth0:0, it works.). Thanks in advance, --~~AM From rajkum2002 at rediffmail.com Wed Aug 3 02:34:10 2005 From: rajkum2002 at rediffmail.com (Raj Kumar) Date: 3 Aug 2005 02:34:10 -0000 Subject: [Linux-cluster] GFS : important questions Message-ID: <20050803023410.12171.qmail@webmail46.rediffmail.com> >My questions are: >1 - If one of the GFS slaves RAC1 or RAC2 is down, will be the other slave server able to access to the shared file system /u04? If the slave server was shutdown cleanly, the other slave server will be able to access the shared file system. However, if the slave was fenced off and was not rebooted the locks it holds will not be released. In that case the other slave server may not be able to access all files on /u04. If you have ilo or rilo in your HP servers you can use remote ilo fencing agent. >2 - If the master is down, will be the both GFS slave servers able to access to the shared file system /u04 simultaneously? No. The slave servers will not be able to access the shared file system if the master is down! this is my two cents knowledge... experts please add to this!! Good luck! -------------- next part -------------- An HTML attachment was scrubbed... URL: From haydar2906 at hotmail.com Wed Aug 3 02:54:30 2005 From: haydar2906 at hotmail.com (haydar Ali) Date: Tue, 02 Aug 2005 22:54:30 -0400 Subject: [Linux-cluster] GFS : important questions In-Reply-To: <20050803023410.12171.qmail@webmail46.rediffmail.com> Message-ID: Thanks Raj Very Kind Haydar >From: "Raj Kumar" >Reply-To: "Raj Kumar" >To: "linux clustering" >CC: "haydar Ali" >Subject: Re: [Linux-cluster] GFS : important questions >Date: 3 Aug 2005 02:34:10 -0000 > > > >My questions are: > >1 - If one of the GFS slaves RAC1 or RAC2 is down, will be the other >slave server able to access to the shared file system /u04? >If the slave server was shutdown cleanly, the other slave server will be >able to access the shared file system. However, if the slave was fenced off >and was not rebooted the locks it holds will not be released. In that case >the other slave server may not be able to access all files on /u04. If you >have ilo or rilo in your HP servers you can use remote ilo fencing agent. > > >2 - If the master is down, will be the both GFS slave servers able to >access to the shared file system /u04 simultaneously? >No. The slave servers will not be able to access the shared file system if >the master is down! > >this is my two cents knowledge... experts please add to this!! > >Good luck! From teigland at redhat.com Wed Aug 3 03:56:18 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 3 Aug 2005 11:56:18 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> Message-ID: <20050803035618.GB9812@redhat.com> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > * The on disk structures are defined in terms of uint32_t and friends, > which are NOT endian neutral. Why are they not le32/be32 and thus > endian-defined? Did you run bitwise-sparse on GFS yet ? GFS has had proper endian handling for many years, it's still correct as far as we've been able to test. I ran bitwise-sparse yesterday and didn't find anything alarming. > * None of your on disk structures are packet. Are you sure? Quite, particular attention has been paid to aligning the structure fields, you'll find "pad" fields throughout. We'll write a quick test to verify that packing doesn't change anything. > +#define gfs2_16_to_cpu be16_to_cpu > +#define gfs2_32_to_cpu be32_to_cpu > +#define gfs2_64_to_cpu be64_to_cpu > > why this pointless abstracting? #ifdef GFS2_ENDIAN_BIG #define gfs2_16_to_cpu be16_to_cpu #define gfs2_32_to_cpu be32_to_cpu #define gfs2_64_to_cpu be64_to_cpu #define cpu_to_gfs2_16 cpu_to_be16 #define cpu_to_gfs2_32 cpu_to_be32 #define cpu_to_gfs2_64 cpu_to_be64 #else /* GFS2_ENDIAN_BIG */ #define gfs2_16_to_cpu le16_to_cpu #define gfs2_32_to_cpu le32_to_cpu #define gfs2_64_to_cpu le64_to_cpu #define cpu_to_gfs2_16 cpu_to_le16 #define cpu_to_gfs2_32 cpu_to_le32 #define cpu_to_gfs2_64 cpu_to_le64 #endif /* GFS2_ENDIAN_BIG */ The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE on-disk instead of LE which is another useful way to verify endian correctness. You should be able to use gfs in mixed architecture and mixed endian clusters. We don't have a mixed endian cluster to test, though. > * +static const uint32_t crc_32_tab[] = ..... > why do you duplicate this? The kernel has a perfectly good set of generic > crc32 tables/functions just fine We'll try them, they'll probably do fine. > * Why use your own journalling layer and not say ... jbd ? Here's an analysis of three approaches to cluster-fs journaling and their pros/cons (including using jbd): http://tinyurl.com/7sbqq > * + while (!kthread_should_stop()) { > + gfs2_scand_internal(sdp); > + > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); > + } > > you probably really want to check for signals if you do interruptible sleeps I don't know why we'd be interested in signals here. > * why not use msleep() and friends instead of schedule_timeout(), you're > not using the complex variants anyway When unmounting we really appreciate waking up more often than the timeout, otherwise the unmount sits and waits for the longest daemon's msleep to complete. I converted this to msleep recently but it was too painful and had to go back. We'll get to your other comments, thanks. Dave From oldmoonster at gmail.com Wed Aug 3 05:21:41 2005 From: oldmoonster at gmail.com (Q.L) Date: Wed, 3 Aug 2005 13:21:41 +0800 Subject: [Linux-cluster] Compile Issues In-Reply-To: <20050802114758.GD11217@redhat.com> References: <889A47B16278164FB657E0FFB1CAB8C7CB0C3C@hq-exchange.ccbill-hq.local> <359782e705080202484184f323@mail.gmail.com> <42EF44EE.7020804@redhat.com> <359782e7050802030967157c38@mail.gmail.com> <42EF4816.1090200@redhat.com> <359782e70508020323165f669@mail.gmail.com> <42EF4B1A.2030405@redhat.com> <359782e70508020344510f473a@mail.gmail.com> <20050802114758.GD11217@redhat.com> Message-ID: <359782e705080222217bf4f733@mail.gmail.com> I began to love GFS more and more. :-) Thanks, Q.L On 8/2/05, David Teigland wrote: > > [root at buckupy linux-2.6.12]# find /home/share/cluster-1.00.00 -name > > '*.patch' | xargs cat | patch -t -p1 > > We don't do kernel patches any more, ignore what's there; we'll remove > those last few in the next release. You have to build the modules within > the cluster tree now. > > Dave > > From teigland at redhat.com Wed Aug 3 06:36:44 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 3 Aug 2005 14:36:44 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <84144f0205080203163cab015c@mail.gmail.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> Message-ID: <20050803063644.GD9812@redhat.com> On Tue, Aug 02, 2005 at 01:16:53PM +0300, Pekka Enberg wrote: > > +void *gmalloc_nofail_real(unsigned int size, int flags, char *file, > > + unsigned int line) > > +{ > > + void *x; > > + for (;;) { > > + x = kmalloc(size, flags); > > + if (x) > > + return x; > > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { > > + printk("GFS2: out of memory: %s, %u\n", > > + __FILE__, __LINE__); > > + gfs2_malloc_warning = jiffies; > > + } > > + yield(); > > This does not belong in a filesystem. It also seems like a very bad > idea. What are you trying to do here? If you absolutely must not fail, > use __GFP_NOFAIL instead. will do, carried over from before NOFAIL existed > -mm has memory leak detection patches and there are others floating > around. Please do not introduce yet another subsystem-specific debug > allocator. ok, thanks Dave From teigland at redhat.com Wed Aug 3 10:08:49 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 3 Aug 2005 18:08:49 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1123060630.3363.10.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050803035618.GB9812@redhat.com> <1123060630.3363.10.camel@laptopd505.fenrus.org> Message-ID: <20050803100849.GE9812@redhat.com> On Wed, Aug 03, 2005 at 11:17:09AM +0200, Arjan van de Ven wrote: > On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote: > > The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE > > on-disk instead of LE which is another useful way to verify endian > > correctness. > > that sounds wrong to be a compile option. If you really want to deal > with dual disk endianness it really ought to be a runtime one (see jffs2 > for example). We don't want BE to be an "option" per se; as developers we'd just like to be able to compile it that way to verify gfs's endianness handling. If you think that's unmaintainable or a bad idea we'll rip it out. > > > * + while (!kthread_should_stop()) { > > > + gfs2_scand_internal(sdp); > > > + > > > + set_current_state(TASK_INTERRUPTIBLE); > > > + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); > > > > > > you probably really want to check for signals if you do > > > interruptible sleeps > > > > I don't know why we'd be interested in signals here. > > well.. because if you don't your schedule_timeout becomes a nop when you > get one, which makes your loop a busy waiting one. OK, it looks like we need to block/flush signals a la daemonize(); I guess I mistakenly figured the kthread routines did everything daemonize did. Thanks, Dave From lmb at suse.de Wed Aug 3 10:37:44 2005 From: lmb at suse.de (Lars Marowsky-Bree) Date: Wed, 3 Aug 2005 12:37:44 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050803035618.GB9812@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050803035618.GB9812@redhat.com> Message-ID: <20050803103744.GG11081@marowsky-bree.de> On 2005-08-03T11:56:18, David Teigland wrote: > > * Why use your own journalling layer and not say ... jbd ? > Here's an analysis of three approaches to cluster-fs journaling and their > pros/cons (including using jbd): http://tinyurl.com/7sbqq Very instructive read, thanks for the link. -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" From addi at hugsmidjan.is Wed Aug 3 11:58:47 2005 From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=) Date: Wed, 03 Aug 2005 11:58:47 +0000 Subject: [Linux-cluster] Fencing agents Message-ID: <42F0B177.7050907@hugsmidjan.is> I'm implementing a shared storage between multiple (2 at the moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC AX100 through FC. The SAN has two FC ports so the need for a FC Switch has not yet come however we will add other Blades in the coming months. The one thing I haven't got figured out with GFS and the Cluster-Suite is the whole idea about fencing. We have a working setup using Centos rebuilds of the Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which we are not planning to use in the final implementation where we plan to use the official GFS packages from Red Hat. The fencing agents in that setup is manual fencing. Both machines have the file system mounted and there appears to be no problems. What does "automatic" fencing have to offer that the manual fencing lacks. If we decide to buy the FC switch right away is it recomended that we buy one of the ones that have fencing agent available for the Cluster-Suite ? If can't get our hands on supported FC switchs can we do fencing in another manner than throught a FC switch ? -- S?valdur Gunnarsson :: Hugsmi?jan From arjan at infradead.org Tue Aug 2 07:45:24 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Tue, 02 Aug 2005 09:45:24 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: <1122968724.3247.22.camel@laptopd505.fenrus.org> On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote: > Hi, GFS (Global File System) is a cluster file system that we'd like to > see added to the kernel. The 14 patches total about 900K so I won't send > them to the list unless that's requested. Comments and suggestions are > welcome. Thanks > > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050801/broken-out/ * The on disk structures are defined in terms of uint32_t and friends, which are NOT endian neutral. Why are they not le32/be32 and thus endian-defined? Did you run bitwise-sparse on GFS yet ? * None of your on disk structures are packet. Are you sure? * +#define gfs2_16_to_cpu be16_to_cpu +#define gfs2_32_to_cpu be32_to_cpu +#define gfs2_64_to_cpu be64_to_cpu why this pointless abstracting? * +static const uint32_t crc_32_tab[] = ..... why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine * Why are you using bufferheads extensively in a new filesystem? * + if (create) + down_write(&ip->i_rw_mutex); + else + down_read(&ip->i_rw_mutex); why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right? How skewed is the read/write ratio? * Why use your own journalling layer and not say ... jbd ? * + while (!kthread_should_stop()) { + gfs2_scand_internal(sdp); + + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); + } you probably really want to check for signals if you do interruptible sleeps (multiple places) * why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800 ehhhh why? * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset, + unsigned int size) +{ + int error; + + if (bh) + error = copy_to_user(*buf, bh->b_data + offset, size); + else + error = clear_user(*buf, size); that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem? * [PATCH 08/14] GFS: diaper device The diaper device is a block device within gfs that gets transparently inserted between the real device the and rest of the filesystem. hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't this wrapper just increase the risk for memory deadlocks? * [PATCH 06/14] GFS: logging and recovery quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL? * do_lock_wait that almost screams for using wait_event and related APIs * +static inline void gfs2_log_lock(struct gfs2_sbd *sdp) +{ + spin_lock(&sdp->sd_log_lock); +} why the abstraction ? From arjan at infradead.org Tue Aug 2 07:45:24 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Tue, 02 Aug 2005 09:45:24 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: <1122968724.3247.22.camel@laptopd505.fenrus.org> On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote: > Hi, GFS (Global File System) is a cluster file system that we'd like to > see added to the kernel. The 14 patches total about 900K so I won't send > them to the list unless that's requested. Comments and suggestions are > welcome. Thanks > > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050801/broken-out/ * The on disk structures are defined in terms of uint32_t and friends, which are NOT endian neutral. Why are they not le32/be32 and thus endian-defined? Did you run bitwise-sparse on GFS yet ? * None of your on disk structures are packet. Are you sure? * +#define gfs2_16_to_cpu be16_to_cpu +#define gfs2_32_to_cpu be32_to_cpu +#define gfs2_64_to_cpu be64_to_cpu why this pointless abstracting? * +static const uint32_t crc_32_tab[] = ..... why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine * Why are you using bufferheads extensively in a new filesystem? * + if (create) + down_write(&ip->i_rw_mutex); + else + down_read(&ip->i_rw_mutex); why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right? How skewed is the read/write ratio? * Why use your own journalling layer and not say ... jbd ? * + while (!kthread_should_stop()) { + gfs2_scand_internal(sdp); + + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); + } you probably really want to check for signals if you do interruptible sleeps (multiple places) * why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800 ehhhh why? * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset, + unsigned int size) +{ + int error; + + if (bh) + error = copy_to_user(*buf, bh->b_data + offset, size); + else + error = clear_user(*buf, size); that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem? * [PATCH 08/14] GFS: diaper device The diaper device is a block device within gfs that gets transparently inserted between the real device the and rest of the filesystem. hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't this wrapper just increase the risk for memory deadlocks? * [PATCH 06/14] GFS: logging and recovery quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL? * do_lock_wait that almost screams for using wait_event and related APIs * +static inline void gfs2_log_lock(struct gfs2_sbd *sdp) +{ + spin_lock(&sdp->sd_log_lock); +} why the abstraction ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From penberg at gmail.com Tue Aug 2 10:16:53 2005 From: penberg at gmail.com (Pekka Enberg) Date: Tue, 2 Aug 2005 13:16:53 +0300 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: <84144f0205080203163cab015c@mail.gmail.com> Hi David, On 8/2/05, David Teigland wrote: > Hi, GFS (Global File System) is a cluster file system that we'd like to > see added to the kernel. The 14 patches total about 900K so I won't send > them to the list unless that's requested. Comments and suggestions are > welcome. Thanks > +#define kmalloc_nofail(size, flags) \ > + gmalloc_nofail((size), (flags), __FILE__, __LINE__) [snip] > +void *gmalloc_nofail_real(unsigned int size, int flags, char *file, > + unsigned int line) > +{ > + void *x; > + for (;;) { > + x = kmalloc(size, flags); > + if (x) > + return x; > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { > + printk("GFS2: out of memory: %s, %u\n", > + __FILE__, __LINE__); > + gfs2_malloc_warning = jiffies; > + } > + yield(); This does not belong in a filesystem. It also seems like a very bad idea. What are you trying to do here? If you absolutely must not fail, use __GFP_NOFAIL instead. > + } > +} > + > +#if defined(GFS2_MEMORY_SIMPLE) > + > +atomic_t gfs2_memory_count; > + > +void gfs2_memory_add_i(void *data, char *file, unsigned int line) > +{ > + atomic_inc(&gfs2_memory_count); > +} > + > +void gfs2_memory_rm_i(void *data, char *file, unsigned int line) > +{ > + if (data) > + atomic_dec(&gfs2_memory_count); > +} > + > +void *gmalloc(unsigned int size, int flags, char *file, unsigned int line) > +{ > + void *data = kmalloc(size, flags); > + if (data) > + atomic_inc(&gfs2_memory_count); > + return data; > +} > + > +void *gmalloc_nofail(unsigned int size, int flags, char *file, > + unsigned int line) > +{ > + atomic_inc(&gfs2_memory_count); > + return gmalloc_nofail_real(size, flags, file, line); > +} > + > +void gfree(void *data, char *file, unsigned int line) > +{ > + if (data) { > + atomic_dec(&gfs2_memory_count); > + kfree(data); > + } > +} -mm has memory leak detection patches and there are others floating around. Please do not introduce yet another subsystem-specific debug allocator. Pekka From jengelh at linux01.gwdg.de Tue Aug 2 14:57:11 2005 From: jengelh at linux01.gwdg.de (Jan Engelhardt) Date: Tue, 2 Aug 2005 16:57:11 +0200 (MEST) Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> Message-ID: >* Why use your own journalling layer and not say ... jbd ? Why does reiser use its own journalling layer and not say ... jbd ? Jan Engelhardt -- From arjan at infradead.org Tue Aug 2 15:02:52 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Tue, 02 Aug 2005 17:02:52 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> Message-ID: <1122994972.3247.31.camel@laptopd505.fenrus.org> On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote: > >* Why use your own journalling layer and not say ... jbd ? > > Why does reiser use its own journalling layer and not say ... jbd ? because reiser got merged before jbd. Next question. Now the question for GFS is still a valid one; there might be reasons to not use it (which is fair enough) but if there's no real reason then using jdb sounds a lot better given it's maturity (and it is used by 2 filesystems in -mm already). From reiser at namesys.com Wed Aug 3 01:00:02 2005 From: reiser at namesys.com (Hans Reiser) Date: Tue, 02 Aug 2005 18:00:02 -0700 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1122994972.3247.31.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <1122994972.3247.31.camel@laptopd505.fenrus.org> Message-ID: <42F01712.2030105@namesys.com> Arjan van de Ven wrote: >On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote: > > >>>* Why use your own journalling layer and not say ... jbd ? >>> >>> >>Why does reiser use its own journalling layer and not say ... jbd ? >> >> > >because reiser got merged before jbd. Next question. > > That is the wrong reason. We use our own journaling layer for the reason that Vivaldi used his own melody. I don't know anything about GFS, but expecting a filesystem author to use a journaling layer he does not want to is a bit arrogant. Now, if you got into details, and said jbd does X, Y and Z, and GFS does the same X and Y, and does not do Z as well as jbd, that would be a more serious comment. He might want to look at how reiser4 does wandering logs instead of using jbd..... but I would never claim that for sure some other author should be expected to use it..... and something like changing one's journaling system is not something to do just before a merge..... >Now the question for GFS is still a valid one; there might be reasons to >not use it (which is fair enough) but if there's no real reason then >using jdb sounds a lot better given it's maturity (and it is used by 2 >filesystems in -mm already). > > > >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo at vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > > > From mrmacman_g4 at mac.com Wed Aug 3 04:07:38 2005 From: mrmacman_g4 at mac.com (Kyle Moffett) Date: Wed, 3 Aug 2005 00:07:38 -0400 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <42F01712.2030105@namesys.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <1122994972.3247.31.camel@laptopd505.fenrus.org> <42F01712.2030105@namesys.com> Message-ID: <4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2@mac.com> On Aug 2, 2005, at 21:00:02, Hans Reiser wrote: > Arjan van de Ven wrote: >> because reiser got merged before jbd. Next question. > That is the wrong reason. We use our own journaling layer for the > reason that Vivaldi used his own melody. > > I don't know anything about GFS, but expecting a filesystem author to > use a journaling layer he does not want to is a bit arrogant. Now, if > you got into details, and said jbd does X, Y and Z, and GFS does the > same X and Y, and does not do Z as well as jbd, that would be a more > serious comment. He might want to look at how reiser4 does wandering > logs instead of using jbd..... but I would never claim that for sure > some other author should be expected to use it..... and something > like > changing one's journaling system is not something to do just before a > merge..... I don't want to start another big reiser4 flamewar, but... "I don't know anything about Reiser4, but expecting a filesystem author to use a VFS layer he does not want to is a bit arrogant. Now, if you got into details, and said the linux VFS does X, Y, and Z, and Reiser4 does..." Do you see my point here? If every person who added new kernel code just wrote their own thing without checking to see if it had already been done before, then there would be a lot of poorly maintained code in the kernel. If a journalling layer already exists, _new_ journaled filesystems should either (A) use the layer as is, or (B) fix the layer so it has sufficient functionality for them to use, and submit patches. That way if somebody later says, "Ah, crap, there's a bug in the kernel journalling layer", and fixes it, there are not eight other filesystems with their own open-coded layers that need to be audited for similar mistakes. This is similar to why some kernel developers did not like the Reiser4 code, because it implemented some private layers that looked kinda like stuff the VFS should be doing (Again, I don't want to get into that argument again, I'm just bringing up the similarities to clarify _this_ particular point, as that one has been beaten to death enough already). >> Now the question for GFS is still a valid one; there might be >> reasons to >> not use it (which is fair enough) but if there's no real reason then >> using jdb sounds a lot better given it's maturity (and it is used >> by 2 >> filesystems in -mm already). Personally, I am of the opinion that if GFS cannot use jdb, the developers ought to clarify why it isn't useable, and possibly submit fixes to make it useful, so that others can share the benefits. Cheers, Kyle Moffett -- I lost interest in "blade servers" when I found they didn't throw knives at people who weren't supposed to be in your machine room. -- Anthony de Boer From jengelh at linux01.gwdg.de Wed Aug 3 06:37:19 2005 From: jengelh at linux01.gwdg.de (Jan Engelhardt) Date: Wed, 3 Aug 2005 08:37:19 +0200 (MEST) Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2@mac.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <1122994972.3247.31.camel@laptopd505.fenrus.org> <42F01712.2030105@namesys.com> <4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2@mac.com> Message-ID: >> > because reiser got merged before jbd. Next question. >> >> That is the wrong reason. We use our own journaling layer for the >> reason that Vivaldi used his own melody. >> >> [...] He might want to look at how reiser4 does wandering >> logs instead of using jbd..... but I would never claim that for sure >> some other author should be expected to use it..... and something like >> changing one's journaling system is not something to do just before a >> merge..... > > Do you see my point here? If every person who added new kernel code > just wrote their own thing without checking to see if it had already > been done before, then there would be a lot of poorly maintained code > in the kernel. If a journalling layer already exists, _new_ journaled > filesystems should either (A) use the layer as is, or (B) fix the layer > so it has sufficient functionality for them to use, and submit patches. Maybe jbd 'sucks' for something 'cool' like reiser*, and modifying jbd to be 'eleet enough' for reiser* would overwhelm ext. Lastly, there is the 'political' thing, when a -only specific change to jbd is rejected by all other jbd-using fs. (Basically the situation thing that leads to software forks, in any area.) Jan Engelhardt -- From penberg at gmail.com Wed Aug 3 06:44:06 2005 From: penberg at gmail.com (Pekka Enberg) Date: Wed, 3 Aug 2005 09:44:06 +0300 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: <84144f0205080223445375c907@mail.gmail.com> Hi David, Some more comments below. Pekka On 8/2/05, David Teigland wrote: > +/** > + * inode_create - create a struct gfs2_inode > + * @i_gl: The glock covering the inode > + * @inum: The inode number > + * @io_gl: the iopen glock to acquire/hold (using holder in new gfs2_inode) > + * @io_state: the state the iopen glock should be acquired in > + * @ipp: pointer to put the returned inode in > + * > + * Returns: errno > + */ > + > +static int inode_create(struct gfs2_glock *i_gl, struct gfs2_inum *inum, > + struct gfs2_glock *io_gl, unsigned int io_state, > + struct gfs2_inode **ipp) > +{ > + struct gfs2_sbd *sdp = i_gl->gl_sbd; > + struct gfs2_inode *ip; > + int error = 0; > + > + RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip); Why do you want to do this? The callers can handle ENOMEM just fine. > +/** > + * gfs2_random - Generate a random 32-bit number > + * > + * Generate a semi-crappy 32-bit pseudo-random number without using > + * floating point. > + * > + * The PRNG is from "Numerical Recipes in C" (second edition), page 284. > + * > + * Returns: a 32-bit random number > + */ > + > +uint32_t gfs2_random(void) > +{ > + gfs2_random_number = 0x0019660D * gfs2_random_number + 0x3C6EF35F; > + return gfs2_random_number; > +} Please consider moving this into lib/random.c. This one already appears in drivers/net/hamradio/dmascc.c. > +/** > + * gfs2_hash - hash an array of data > + * @data: the data to be hashed > + * @len: the length of data to be hashed > + * > + * Take some data and convert it to a 32-bit hash. > + * > + * This is the 32-bit FNV-1a hash from: > + * http://www.isthe.com/chongo/tech/comp/fnv/ > + * > + * Returns: the hash > + */ > + > +uint32_t gfs2_hash(const void *data, unsigned int len) > +{ > + uint32_t h = 0x811C9DC5; > + h = hash_more_internal(data, len, h); > + return h; > +} Is there a reason why you cannot use or ? > +void gfs2_sort(void *base, unsigned int num_elem, unsigned int size, > + int (*compar) (const void *, const void *)) > +{ > + register char *pbase = (char *)base; > + int i, j, k, h; > + static int cols[16] = {1391376, 463792, 198768, 86961, > + 33936, 13776, 4592, 1968, > + 861, 336, 112, 48, > + 21, 7, 3, 1}; > + > + for (k = 0; k < 16; k++) { > + h = cols[k]; > + for (i = h; i < num_elem; i++) { > + j = i; > + while (j >= h && > + (*compar)((void *)(pbase + size * (j - h)), > + (void *)(pbase + size * j)) > 0) { > + SWAP(pbase + size * j, > + pbase + size * (j - h), > + size); > + j = j - h; > + } > + } > + } > +} Please use sort() from lib/sort.c. > +/** > + * gfs2_io_error_inode_i - Flag an inode I/O error and withdraw > + * @ip: > + * @function: > + * @file: > + * @line: Please drop empty kerneldoc tags. (Appears in various other places as well.) > +#define RETRY_MALLOC(do_this, until_this) \ > +for (;;) { \ > + { do_this; } \ > + if (until_this) \ > + break; \ > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \ > + printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \ > + gfs2_malloc_warning = jiffies; \ > + } \ > + yield(); \ > +} Please drop this. > +int gfs2_acl_create(struct gfs2_inode *dip, struct gfs2_inode *ip) > +{ > + struct gfs2_sbd *sdp = dip->i_sbd; > + struct posix_acl *acl = NULL; > + struct gfs2_ea_request er; > + mode_t mode = ip->i_di.di_mode; > + int error; > + > + if (!sdp->sd_args.ar_posix_acl) > + return 0; > + if (S_ISLNK(ip->i_di.di_mode)) > + return 0; > + > + memset(&er, 0, sizeof(struct gfs2_ea_request)); > + er.er_type = GFS2_EATYPE_SYS; > + > + error = acl_get(dip, ACL_DEFAULT, &acl, NULL, > + &er.er_data, &er.er_data_len); > + if (error) > + return error; > + if (!acl) { > + mode &= ~current->fs->umask; > + if (mode != ip->i_di.di_mode) > + error = munge_mode(ip, mode); > + return error; > + } > + > + { > + struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL); > + error = -ENOMEM; > + if (!clone) > + goto out; > + gfs2_memory_add(clone); > + gfs2_memory_rm(acl); > + posix_acl_release(acl); > + acl = clone; > + } Please make this a real function. It is duplicated below. > + if (error > 0) { > + er.er_name = GFS2_POSIX_ACL_ACCESS; > + er.er_name_len = GFS2_POSIX_ACL_ACCESS_LEN; > + posix_acl_to_xattr(acl, er.er_data, er.er_data_len); > + er.er_mode = mode; > + er.er_flags = GFS2_ERF_MODE; > + error = gfs2_system_eaops.eo_set(ip, &er); > + if (error) > + goto out; > + } else > + munge_mode(ip, mode); > + > + out: > + gfs2_memory_rm(acl); > + posix_acl_release(acl); > + kfree(er.er_data); > + > + return error; Whitespace damage. > +int gfs2_acl_chmod(struct gfs2_inode *ip, struct iattr *attr) > +{ > + struct posix_acl *acl = NULL; > + struct gfs2_ea_location el; > + char *data; > + unsigned int len; > + int error; > + > + error = acl_get(ip, ACL_ACCESS, &acl, &el, &data, &len); > + if (error) > + return error; > + if (!acl) > + return gfs2_setattr_simple(ip, attr); > + > + { > + struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL); > + error = -ENOMEM; > + if (!clone) > + goto out; > + gfs2_memory_add(clone); > + gfs2_memory_rm(acl); > + posix_acl_release(acl); > + acl = clone; > + } Duplicated above. > +static int ea_foreach(struct gfs2_inode *ip, ea_call_t ea_call, void *data) > +{ > + struct buffer_head *bh; > + int error; > + > + error = gfs2_meta_read(ip->i_gl, ip->i_di.di_eattr, > + DIO_START | DIO_WAIT, &bh); > + if (error) > + return error; > + > + if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT)) > + error = ea_foreach_i(ip, bh, ea_call, data); goto out here so you can drop the else branch below. > + else { > + struct buffer_head *eabh; > + uint64_t *eablk, *end; > + > + if (gfs2_metatype_check(ip->i_sbd, bh, GFS2_METATYPE_IN)) { > + error = -EIO; > + goto out; > + } > + > + eablk = (uint64_t *)(bh->b_data + > + sizeof(struct gfs2_meta_header)); > + end = eablk + ip->i_sbd->sd_inptrs; > + > +static int ea_find_i(struct gfs2_inode *ip, struct buffer_head *bh, > + struct gfs2_ea_header *ea, struct gfs2_ea_header *prev, > + void *private) > +{ > + struct ea_find *ef = (struct ea_find *)private; > + struct gfs2_ea_request *er = ef->ef_er; > + > + if (ea->ea_type == GFS2_EATYPE_UNUSED) > + return 0; > + > + if (ea->ea_type == er->er_type) { > + if (ea->ea_name_len == er->er_name_len && > + !memcmp(GFS2_EA2NAME(ea), er->er_name, ea->ea_name_len)) { > + struct gfs2_ea_location *el = ef->ef_el; > + get_bh(bh); > + el->el_bh = bh; > + el->el_ea = ea; > + el->el_prev = prev; > + return 1; > + } > + } > + > +#if 0 > + else if ((ip->i_di.di_flags & GFS2_DIF_EA_PACKED) && > + er->er_type == GFS2_EATYPE_SYS) > + return 1; > +#endif Please drop commented out code. > +static int ea_list_i(struct gfs2_inode *ip, struct buffer_head *bh, > + struct gfs2_ea_header *ea, struct gfs2_ea_header *prev, > + void *private) > +{ > + struct ea_list *ei = (struct ea_list *)private; Please drop redundant cast. > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er, > + struct gfs2_ea_location *el) > +{ > + { > + struct ea_set es; > + int error; > + > + memset(&es, 0, sizeof(struct ea_set)); > + es.es_er = er; > + es.es_el = el; > + > + error = ea_foreach(ip, ea_set_simple, &es); > + if (error > 0) > + return 0; > + if (error) > + return error; > + } > + { > + unsigned int blks = 2; > + if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT)) > + blks++; > + if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize) > + blks += DIV_RU(er->er_data_len, > + ip->i_sbd->sd_jbsize); > + > + return ea_alloc_skeleton(ip, er, blks, ea_set_block, el); > + } Please drop the extra braces. From arjan at infradead.org Wed Aug 3 09:09:02 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Wed, 03 Aug 2005 11:09:02 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <42F01712.2030105@namesys.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <1122994972.3247.31.camel@laptopd505.fenrus.org> <42F01712.2030105@namesys.com> Message-ID: <1123060142.3363.8.camel@laptopd505.fenrus.org> > I don't know anything about GFS, but expecting a filesystem author to > use a journaling layer he does not want to is a bit arrogant. good that I didn't expect that then. I think it's fair enough to ask people if they can use it. If the answer is "No because it doesn't fit our model " then that's fine. If the answer is "eh yeah we could" then I think it's entirely reasonable to expect people to use common code as opposed to adding new code. From arjan at infradead.org Wed Aug 3 09:17:09 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Wed, 03 Aug 2005 11:17:09 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050803035618.GB9812@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050803035618.GB9812@redhat.com> Message-ID: <1123060630.3363.10.camel@laptopd505.fenrus.org> On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote: > The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE > on-disk instead of LE which is another useful way to verify endian > correctness. that sounds wrong to be a compile option. If you really want to deal with dual disk endianness it really ought to be a runtime one (see jffs2 for example). > > * Why use your own journalling layer and not say ... jbd ? > > Here's an analysis of three approaches to cluster-fs journaling and their > pros/cons (including using jbd): http://tinyurl.com/7sbqq > > > * + while (!kthread_should_stop()) { > > + gfs2_scand_internal(sdp); > > + > > + set_current_state(TASK_INTERRUPTIBLE); > > + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); > > + } > > > > you probably really want to check for signals if you do interruptible sleeps > > I don't know why we'd be interested in signals here. well.. because if you don't your schedule_timeout becomes a nop when you get one, which makes your loop a busy waiting one. From travellig at gmail.com Wed Aug 3 15:57:45 2005 From: travellig at gmail.com (travellig travellig) Date: Wed, 3 Aug 2005 16:57:45 +0100 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 16, Issue 4 In-Reply-To: <20050803140914.6801B739B8@hormel.redhat.com> References: <20050803140914.6801B739B8@hormel.redhat.com> Message-ID: <6944872105080308573f220551@mail.gmail.com> On Wed, 2005-08-03 at 11:58 +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote: > What does "automatic" fencing have to offer that the manual fencing > lacks. Automatic fencing uses hardware to fence a node and reboot it. Manual fencing relay on you to manually fence the node whenever you release there is a problem in the cluster and relays on you to prowercycle the faulty node manually, no very convenient when you are sysadmin the cluster remotely. > If we decide to buy the FC switch right away is it recomended that we > buy one of the ones that have fencing agent available for the > Cluster-Suite ? If you look at the configuration manual for RHCS, there is a list of supported fencing agents. > If can't get our hands on supported FC switchs can we do fencing in > another manner than throught a FC switch ? Manual fencing. Nando On 8/3/05, linux-cluster-request at redhat.com wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. Re: [PATCH 00/14] GFS (Lars Marowsky-Bree) > 2. Fencing agents (S?valdur Arnar Gunnarsson [Hugsmi?jan]) > 3. Re: [PATCH 00/14] GFS (Arjan van de Ven) > 4. Re: [PATCH 00/14] GFS (Arjan van de Ven) > 5. Re: [PATCH 00/14] GFS (Pekka Enberg) > 6. Re: [PATCH 00/14] GFS (Jan Engelhardt) > 7. Re: [PATCH 00/14] GFS (Arjan van de Ven) > 8. Re: [PATCH 00/14] GFS (Hans Reiser) > 9. Re: [PATCH 00/14] GFS (Jan Engelhardt) > 10. Re: [PATCH 00/14] GFS (Pekka Enberg) > 11. Re: [PATCH 00/14] GFS (Kyle Moffett) > 12. Re: [PATCH 00/14] GFS (Arjan van de Ven) > 13. Re: [PATCH 00/14] GFS (Arjan van de Ven) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 3 Aug 2005 12:37:44 +0200 > From: Lars Marowsky-Bree > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: David Teigland , Arjan van de Ven > > Cc: akpm at osdl.org, linux-cluster at redhat.com, > linux-kernel at vger.kernel.org > Message-ID: <20050803103744.GG11081 at marowsky-bree.de> > Content-Type: text/plain; charset=us-ascii > > On 2005-08-03T11:56:18, David Teigland wrote: > > > > * Why use your own journalling layer and not say ... jbd ? > > Here's an analysis of three approaches to cluster-fs journaling and their > > pros/cons (including using jbd): http://tinyurl.com/7sbqq > > Very instructive read, thanks for the link. > > > > -- > High Availability & Clustering > SUSE Labs, Research and Development > SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin > "Ignorance more frequently begets confidence than does knowledge" > > > > ------------------------------ > > Message: 2 > Date: Wed, 03 Aug 2005 11:58:47 +0000 > From: "S?valdur Arnar Gunnarsson [Hugsmi?jan]" > Subject: [Linux-cluster] Fencing agents > To: linux-cluster at redhat.com > Message-ID: <42F0B177.7050907 at hugsmidjan.is> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > I'm implementing a shared storage between multiple (2 at the moment) > Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC > AX100 through FC. > > The SAN has two FC ports so the need for a FC Switch has not yet come > however we will add other Blades in the coming months. > The one thing I haven't got figured out with GFS and the Cluster-Suite > is the whole idea about fencing. > > We have a working setup using Centos rebuilds of the Cluster-Suite and > GFS (http://rpm.karan.org/el4/csgfs/) which we are not planning to use > in the final implementation where we plan to use the official GFS > packages from Red Hat. > The fencing agents in that setup is manual fencing. > > Both machines have the file system mounted and there appears to be no > problems. > > What does "automatic" fencing have to offer that the manual fencing lacks. > If we decide to buy the FC switch right away is it recomended that we > buy one of the ones that have fencing agent available for the > Cluster-Suite ? > > If can't get our hands on supported FC switchs can we do fencing in > another manner than throught a FC switch ? > > > > > -- > S?valdur Gunnarsson :: Hugsmi?jan > > > > ------------------------------ > > Message: 3 > Date: Tue, 02 Aug 2005 09:45:24 +0200 > From: Arjan van de Ven > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: David Teigland > Cc: akpm at osdl.org, linux-cluster at redhat.com, > linux-kernel at vger.kernel.org > Message-ID: <1122968724.3247.22.camel at laptopd505.fenrus.org> > Content-Type: text/plain > > On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote: > > Hi, GFS (Global File System) is a cluster file system that we'd like to > > see added to the kernel. The 14 patches total about 900K so I won't send > > them to the list unless that's requested. Comments and suggestions are > > welcome. Thanks > > > > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch > > http://redhat.com/~teigland/gfs2/20050801/broken-out/ > > * The on disk structures are defined in terms of uint32_t and friends, > which are NOT endian neutral. Why are they not le32/be32 and thus > endian-defined? Did you run bitwise-sparse on GFS yet ? > > * None of your on disk structures are packet. Are you sure? > > * > +#define gfs2_16_to_cpu be16_to_cpu > +#define gfs2_32_to_cpu be32_to_cpu > +#define gfs2_64_to_cpu be64_to_cpu > > why this pointless abstracting? > > * +static const uint32_t crc_32_tab[] = ..... > > why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine > > * Why are you using bufferheads extensively in a new filesystem? > > * + if (create) > + down_write(&ip->i_rw_mutex); > + else > + down_read(&ip->i_rw_mutex); > > why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right? > How skewed is the read/write ratio? > > * Why use your own journalling layer and not say ... jbd ? > > * + while (!kthread_should_stop()) { > + gfs2_scand_internal(sdp); > + > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); > + } > > you probably really want to check for signals if you do interruptible sleeps > (multiple places) > > * why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway > > * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800 > > ehhhh why? > > * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset, > + unsigned int size) > +{ > + int error; > + > + if (bh) > + error = copy_to_user(*buf, bh->b_data + offset, size); > + else > + error = clear_user(*buf, size); > > that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem? > > * [PATCH 08/14] GFS: diaper device > > The diaper device is a block device within gfs that gets transparently > inserted between the real device the and rest of the filesystem. > > hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't > this wrapper just increase the risk for memory deadlocks? > > * [PATCH 06/14] GFS: logging and recovery > > quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL? > > > * do_lock_wait > > that almost screams for using wait_event and related APIs > > > * > +static inline void gfs2_log_lock(struct gfs2_sbd *sdp) > +{ > + spin_lock(&sdp->sd_log_lock); > +} > why the abstraction ? > > > > > > ------------------------------ > > Message: 4 > Date: Tue, 02 Aug 2005 09:45:24 +0200 > From: Arjan van de Ven > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: David Teigland > Cc: akpm at osdl.org, linux-cluster at redhat.com, > linux-kernel at vger.kernel.org > Message-ID: <1122968724.3247.22.camel at laptopd505.fenrus.org> > Content-Type: text/plain > > On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote: > > Hi, GFS (Global File System) is a cluster file system that we'd like to > > see added to the kernel. The 14 patches total about 900K so I won't send > > them to the list unless that's requested. Comments and suggestions are > > welcome. Thanks > > > > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch > > http://redhat.com/~teigland/gfs2/20050801/broken-out/ > > * The on disk structures are defined in terms of uint32_t and friends, > which are NOT endian neutral. Why are they not le32/be32 and thus > endian-defined? Did you run bitwise-sparse on GFS yet ? > > * None of your on disk structures are packet. Are you sure? > > * > +#define gfs2_16_to_cpu be16_to_cpu > +#define gfs2_32_to_cpu be32_to_cpu > +#define gfs2_64_to_cpu be64_to_cpu > > why this pointless abstracting? > > * +static const uint32_t crc_32_tab[] = ..... > > why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine > > * Why are you using bufferheads extensively in a new filesystem? > > * + if (create) > + down_write(&ip->i_rw_mutex); > + else > + down_read(&ip->i_rw_mutex); > > why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right? > How skewed is the read/write ratio? > > * Why use your own journalling layer and not say ... jbd ? > > * + while (!kthread_should_stop()) { > + gfs2_scand_internal(sdp); > + > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); > + } > > you probably really want to check for signals if you do interruptible sleeps > (multiple places) > > * why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway > > * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800 > > ehhhh why? > > * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset, > + unsigned int size) > +{ > + int error; > + > + if (bh) > + error = copy_to_user(*buf, bh->b_data + offset, size); > + else > + error = clear_user(*buf, size); > > that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem? > > * [PATCH 08/14] GFS: diaper device > > The diaper device is a block device within gfs that gets transparently > inserted between the real device the and rest of the filesystem. > > hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't > this wrapper just increase the risk for memory deadlocks? > > * [PATCH 06/14] GFS: logging and recovery > > quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL? > > > * do_lock_wait > > that almost screams for using wait_event and related APIs > > > * > +static inline void gfs2_log_lock(struct gfs2_sbd *sdp) > +{ > + spin_lock(&sdp->sd_log_lock); > +} > why the abstraction ? > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > > > ------------------------------ > > Message: 5 > Date: Tue, 2 Aug 2005 13:16:53 +0300 > From: Pekka Enberg > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: David Teigland > Cc: akpm at osdl.org, Pekka Enberg , > linux-cluster at redhat.com, linux-kernel at vger.kernel.org > Message-ID: <84144f0205080203163cab015c at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi David, > > On 8/2/05, David Teigland wrote: > > Hi, GFS (Global File System) is a cluster file system that we'd like to > > see added to the kernel. The 14 patches total about 900K so I won't send > > them to the list unless that's requested. Comments and suggestions are > > welcome. Thanks > > > +#define kmalloc_nofail(size, flags) \ > > + gmalloc_nofail((size), (flags), __FILE__, __LINE__) > > [snip] > > > +void *gmalloc_nofail_real(unsigned int size, int flags, char *file, > > + unsigned int line) > > +{ > > + void *x; > > + for (;;) { > > + x = kmalloc(size, flags); > > + if (x) > > + return x; > > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { > > + printk("GFS2: out of memory: %s, %u\n", > > + __FILE__, __LINE__); > > + gfs2_malloc_warning = jiffies; > > + } > > + yield(); > > This does not belong in a filesystem. It also seems like a very bad > idea. What are you trying to do here? If you absolutely must not fail, > use __GFP_NOFAIL instead. > > > + } > > +} > > + > > +#if defined(GFS2_MEMORY_SIMPLE) > > + > > +atomic_t gfs2_memory_count; > > + > > +void gfs2_memory_add_i(void *data, char *file, unsigned int line) > > +{ > > + atomic_inc(&gfs2_memory_count); > > +} > > + > > +void gfs2_memory_rm_i(void *data, char *file, unsigned int line) > > +{ > > + if (data) > > + atomic_dec(&gfs2_memory_count); > > +} > > + > > +void *gmalloc(unsigned int size, int flags, char *file, unsigned int line) > > +{ > > + void *data = kmalloc(size, flags); > > + if (data) > > + atomic_inc(&gfs2_memory_count); > > + return data; > > +} > > + > > +void *gmalloc_nofail(unsigned int size, int flags, char *file, > > + unsigned int line) > > +{ > > + atomic_inc(&gfs2_memory_count); > > + return gmalloc_nofail_real(size, flags, file, line); > > +} > > + > > +void gfree(void *data, char *file, unsigned int line) > > +{ > > + if (data) { > > + atomic_dec(&gfs2_memory_count); > > + kfree(data); > > + } > > +} > > -mm has memory leak detection patches and there are others floating > around. Please do not introduce yet another subsystem-specific debug allocator. > > Pekka > > > > ------------------------------ > > Message: 6 > Date: Tue, 2 Aug 2005 16:57:11 +0200 (MEST) > From: Jan Engelhardt > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: Arjan van de Ven > Cc: akpm at osdl.org, linux-cluster at redhat.com, > linux-kernel at vger.kernel.org > Message-ID: > Content-Type: TEXT/PLAIN; charset=US-ASCII > > > >* Why use your own journalling layer and not say ... jbd ? > > Why does reiser use its own journalling layer and not say ... jbd ? > > > > Jan Engelhardt > -- > > > > ------------------------------ > > Message: 7 > Date: Tue, 02 Aug 2005 17:02:52 +0200 > From: Arjan van de Ven > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: Jan Engelhardt > Cc: akpm at osdl.org, linux-cluster at redhat.com, > linux-kernel at vger.kernel.org > Message-ID: <1122994972.3247.31.camel at laptopd505.fenrus.org> > Content-Type: text/plain > > On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote: > > >* Why use your own journalling layer and not say ... jbd ? > > > > Why does reiser use its own journalling layer and not say ... jbd ? > > because reiser got merged before jbd. Next question. > > Now the question for GFS is still a valid one; there might be reasons to > not use it (which is fair enough) but if there's no real reason then > using jdb sounds a lot better given it's maturity (and it is used by 2 > filesystems in -mm already). > > > > > > ------------------------------ > > Message: 8 > Date: Tue, 02 Aug 2005 18:00:02 -0700 > From: Hans Reiser > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: Arjan van de Ven > Cc: akpm at osdl.org, linux-cluster at redhat.com, Jan Engelhardt > , linux-kernel at vger.kernel.org > Message-ID: <42F01712.2030105 at namesys.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Arjan van de Ven wrote: > > >On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote: > > > > > >>>* Why use your own journalling layer and not say ... jbd ? > >>> > >>> > >>Why does reiser use its own journalling layer and not say ... jbd ? > >> > >> > > > >because reiser got merged before jbd. Next question. > > > > > That is the wrong reason. We use our own journaling layer for the > reason that Vivaldi used his own melody. > > I don't know anything about GFS, but expecting a filesystem author to > use a journaling layer he does not want to is a bit arrogant. Now, if > you got into details, and said jbd does X, Y and Z, and GFS does the > same X and Y, and does not do Z as well as jbd, that would be a more > serious comment. He might want to look at how reiser4 does wandering > logs instead of using jbd..... but I would never claim that for sure > some other author should be expected to use it..... and something like > changing one's journaling system is not something to do just before a > merge..... > > >Now the question for GFS is still a valid one; there might be reasons to > >not use it (which is fair enough) but if there's no real reason then > >using jdb sounds a lot better given it's maturity (and it is used by 2 > >filesystems in -mm already). > > > > > > > >- > >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >the body of a message to majordomo at vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > >Please read the FAQ at http://www.tux.org/lkml/ > > > > > > > > > > > > ------------------------------ > > Message: 9 > Date: Wed, 3 Aug 2005 08:37:19 +0200 (MEST) > From: Jan Engelhardt > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: Kyle Moffett > Cc: akpm at osdl.org, linux-cluster at redhat.com, Hans Reiser > , linux-kernel at vger.kernel.org, Arjan van de Ven > > Message-ID: > Content-Type: TEXT/PLAIN; charset=US-ASCII > > > >> > because reiser got merged before jbd. Next question. > >> > >> That is the wrong reason. We use our own journaling layer for the > >> reason that Vivaldi used his own melody. > >> > >> [...] He might want to look at how reiser4 does wandering > >> logs instead of using jbd..... but I would never claim that for sure > >> some other author should be expected to use it..... and something like > >> changing one's journaling system is not something to do just before a > >> merge..... > > > > Do you see my point here? If every person who added new kernel code > > just wrote their own thing without checking to see if it had already > > been done before, then there would be a lot of poorly maintained code > > in the kernel. If a journalling layer already exists, _new_ journaled > > filesystems should either (A) use the layer as is, or (B) fix the layer > > so it has sufficient functionality for them to use, and submit patches. > > Maybe jbd 'sucks' for something 'cool' like reiser*, and modifying jbd to be > 'eleet enough' for reiser* would overwhelm ext. > > Lastly, there is the 'political' thing, when a -only > specific change to jbd is rejected by all other jbd-using fs. (Basically the > situation thing that leads to software forks, in any area.) > > > > Jan Engelhardt > -- > > > > ------------------------------ > > Message: 10 > Date: Wed, 3 Aug 2005 09:44:06 +0300 > From: Pekka Enberg > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: David Teigland > Cc: akpm at osdl.org, Pekka Enberg , > linux-cluster at redhat.com, linux-kernel at vger.kernel.org > Message-ID: <84144f0205080223445375c907 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi David, > > Some more comments below. > > Pekka > > On 8/2/05, David Teigland wrote: > > +/** > > + * inode_create - create a struct gfs2_inode > > + * @i_gl: The glock covering the inode > > + * @inum: The inode number > > + * @io_gl: the iopen glock to acquire/hold (using holder in new gfs2_inode) > > + * @io_state: the state the iopen glock should be acquired in > > + * @ipp: pointer to put the returned inode in > > + * > > + * Returns: errno > > + */ > > + > > +static int inode_create(struct gfs2_glock *i_gl, struct gfs2_inum *inum, > > + struct gfs2_glock *io_gl, unsigned int io_state, > > + struct gfs2_inode **ipp) > > +{ > > + struct gfs2_sbd *sdp = i_gl->gl_sbd; > > + struct gfs2_inode *ip; > > + int error = 0; > > + > > + RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip); > > Why do you want to do this? The callers can handle ENOMEM just fine. > > > +/** > > + * gfs2_random - Generate a random 32-bit number > > + * > > + * Generate a semi-crappy 32-bit pseudo-random number without using > > + * floating point. > > + * > > + * The PRNG is from "Numerical Recipes in C" (second edition), page 284. > > + * > > + * Returns: a 32-bit random number > > + */ > > + > > +uint32_t gfs2_random(void) > > +{ > > + gfs2_random_number = 0x0019660D * gfs2_random_number + 0x3C6EF35F; > > + return gfs2_random_number; > > +} > > Please consider moving this into lib/random.c. This one already appears in > drivers/net/hamradio/dmascc.c. > > > +/** > > + * gfs2_hash - hash an array of data > > + * @data: the data to be hashed > > + * @len: the length of data to be hashed > > + * > > + * Take some data and convert it to a 32-bit hash. > > + * > > + * This is the 32-bit FNV-1a hash from: > > + * http://www.isthe.com/chongo/tech/comp/fnv/ > > + * > > + * Returns: the hash > > + */ > > + > > +uint32_t gfs2_hash(const void *data, unsigned int len) > > +{ > > + uint32_t h = 0x811C9DC5; > > + h = hash_more_internal(data, len, h); > > + return h; > > +} > > Is there a reason why you cannot use or ? > > > +void gfs2_sort(void *base, unsigned int num_elem, unsigned int size, > > + int (*compar) (const void *, const void *)) > > +{ > > + register char *pbase = (char *)base; > > + int i, j, k, h; > > + static int cols[16] = {1391376, 463792, 198768, 86961, > > + 33936, 13776, 4592, 1968, > > + 861, 336, 112, 48, > > + 21, 7, 3, 1}; > > + > > + for (k = 0; k < 16; k++) { > > + h = cols[k]; > > + for (i = h; i < num_elem; i++) { > > + j = i; > > + while (j >= h && > > + (*compar)((void *)(pbase + size * (j - h)), > > + (void *)(pbase + size * j)) > 0) { > > + SWAP(pbase + size * j, > > + pbase + size * (j - h), > > + size); > > + j = j - h; > > + } > > + } > > + } > > +} > > Please use sort() from lib/sort.c. > > > +/** > > + * gfs2_io_error_inode_i - Flag an inode I/O error and withdraw > > + * @ip: > > + * @function: > > + * @file: > > + * @line: > > Please drop empty kerneldoc tags. (Appears in various other places as well.) > > > +#define RETRY_MALLOC(do_this, until_this) \ > > +for (;;) { \ > > + { do_this; } \ > > + if (until_this) \ > > + break; \ > > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \ > > + printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \ > > + gfs2_malloc_warning = jiffies; \ > > + } \ > > + yield(); \ > > +} > > Please drop this. > > > +int gfs2_acl_create(struct gfs2_inode *dip, struct gfs2_inode *ip) > > +{ > > + struct gfs2_sbd *sdp = dip->i_sbd; > > + struct posix_acl *acl = NULL; > > + struct gfs2_ea_request er; > > + mode_t mode = ip->i_di.di_mode; > > + int error; > > + > > + if (!sdp->sd_args.ar_posix_acl) > > + return 0; > > + if (S_ISLNK(ip->i_di.di_mode)) > > + return 0; > > + > > + memset(&er, 0, sizeof(struct gfs2_ea_request)); > > + er.er_type = GFS2_EATYPE_SYS; > > + > > + error = acl_get(dip, ACL_DEFAULT, &acl, NULL, > > + &er.er_data, &er.er_data_len); > > + if (error) > > + return error; > > + if (!acl) { > > + mode &= ~current->fs->umask; > > + if (mode != ip->i_di.di_mode) > > + error = munge_mode(ip, mode); > > + return error; > > + } > > + > > + { > > + struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL); > > + error = -ENOMEM; > > + if (!clone) > > + goto out; > > + gfs2_memory_add(clone); > > + gfs2_memory_rm(acl); > > + posix_acl_release(acl); > > + acl = clone; > > + } > > Please make this a real function. It is duplicated below. > > > + if (error > 0) { > > + er.er_name = GFS2_POSIX_ACL_ACCESS; > > + er.er_name_len = GFS2_POSIX_ACL_ACCESS_LEN; > > + posix_acl_to_xattr(acl, er.er_data, er.er_data_len); > > + er.er_mode = mode; > > + er.er_flags = GFS2_ERF_MODE; > > + error = gfs2_system_eaops.eo_set(ip, &er); > > + if (error) > > + goto out; > > + } else > > + munge_mode(ip, mode); > > + > > + out: > > + gfs2_memory_rm(acl); > > + posix_acl_release(acl); > > + kfree(er.er_data); > > + > > + return error; > > Whitespace damage. > > > +int gfs2_acl_chmod(struct gfs2_inode *ip, struct iattr *attr) > > +{ > > + struct posix_acl *acl = NULL; > > + struct gfs2_ea_location el; > > + char *data; > > + unsigned int len; > > + int error; > > + > > + error = acl_get(ip, ACL_ACCESS, &acl, &el, &data, &len); > > + if (error) > > + return error; > > + if (!acl) > > + return gfs2_setattr_simple(ip, attr); > > + > > + { > > + struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL); > > + error = -ENOMEM; > > + if (!clone) > > + goto out; > > + gfs2_memory_add(clone); > > + gfs2_memory_rm(acl); > > + posix_acl_release(acl); > > + acl = clone; > > + } > > Duplicated above. > > > +static int ea_foreach(struct gfs2_inode *ip, ea_call_t ea_call, void *data) > > +{ > > + struct buffer_head *bh; > > + int error; > > + > > + error = gfs2_meta_read(ip->i_gl, ip->i_di.di_eattr, > > + DIO_START | DIO_WAIT, &bh); > > + if (error) > > + return error; > > + > > + if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT)) > > + error = ea_foreach_i(ip, bh, ea_call, data); > > goto out here so you can drop the else branch below. > > > + else { > > + struct buffer_head *eabh; > > + uint64_t *eablk, *end; > > + > > + if (gfs2_metatype_check(ip->i_sbd, bh, GFS2_METATYPE_IN)) { > > + error = -EIO; > > + goto out; > > + } > > + > > + eablk = (uint64_t *)(bh->b_data + > > + sizeof(struct gfs2_meta_header)); > > + end = eablk + ip->i_sbd->sd_inptrs; > > + > > > +static int ea_find_i(struct gfs2_inode *ip, struct buffer_head *bh, > > + struct gfs2_ea_header *ea, struct gfs2_ea_header *prev, > > + void *private) > > +{ > > + struct ea_find *ef = (struct ea_find *)private; > > + struct gfs2_ea_request *er = ef->ef_er; > > + > > + if (ea->ea_type == GFS2_EATYPE_UNUSED) > > + return 0; > > + > > + if (ea->ea_type == er->er_type) { > > + if (ea->ea_name_len == er->er_name_len && > > + !memcmp(GFS2_EA2NAME(ea), er->er_name, ea->ea_name_len)) { > > + struct gfs2_ea_location *el = ef->ef_el; > > + get_bh(bh); > > + el->el_bh = bh; > > + el->el_ea = ea; > > + el->el_prev = prev; > > + return 1; > > + } > > + } > > + > > +#if 0 > > + else if ((ip->i_di.di_flags & GFS2_DIF_EA_PACKED) && > > + er->er_type == GFS2_EATYPE_SYS) > > + return 1; > > +#endif > > Please drop commented out code. > > > +static int ea_list_i(struct gfs2_inode *ip, struct buffer_head *bh, > > + struct gfs2_ea_header *ea, struct gfs2_ea_header *prev, > > + void *private) > > +{ > > + struct ea_list *ei = (struct ea_list *)private; > > Please drop redundant cast. > > > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er, > > + struct gfs2_ea_location *el) > > +{ > > + { > > + struct ea_set es; > > + int error; > > + > > + memset(&es, 0, sizeof(struct ea_set)); > > + es.es_er = er; > > + es.es_el = el; > > + > > + error = ea_foreach(ip, ea_set_simple, &es); > > + if (error > 0) > > + return 0; > > + if (error) > > + return error; > > + } > > + { > > + unsigned int blks = 2; > > + if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT)) > > + blks++; > > + if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize) > > + blks += DIV_RU(er->er_data_len, > > + ip->i_sbd->sd_jbsize); > > + > > + return ea_alloc_skeleton(ip, er, blks, ea_set_block, el); > > + } > > Please drop the extra braces. > > > > ------------------------------ > > Message: 11 > Date: Wed, 3 Aug 2005 00:07:38 -0400 > From: Kyle Moffett > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: Hans Reiser > Cc: akpm at osdl.org, linux-cluster at redhat.com, Jan Engelhardt > , linux-kernel at vger.kernel.org, Arjan van de > Ven > Message-ID: <4CBCB111-36B9-4F8C-9A3F-A9126ADE1CA2 at mac.com> > Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed > > On Aug 2, 2005, at 21:00:02, Hans Reiser wrote: > > Arjan van de Ven wrote: > >> because reiser got merged before jbd. Next question. > > That is the wrong reason. We use our own journaling layer for the > > reason that Vivaldi used his own melody. > > > > I don't know anything about GFS, but expecting a filesystem author to > > use a journaling layer he does not want to is a bit arrogant. Now, if > > you got into details, and said jbd does X, Y and Z, and GFS does the > > same X and Y, and does not do Z as well as jbd, that would be a more > > serious comment. He might want to look at how reiser4 does wandering > > logs instead of using jbd..... but I would never claim that for sure > > some other author should be expected to use it..... and something > > like > > changing one's journaling system is not something to do just before a > > merge..... > > I don't want to start another big reiser4 flamewar, but... > > "I don't know anything about Reiser4, but expecting a filesystem author > to use a VFS layer he does not want to is a bit arrogant. Now, if you > got into details, and said the linux VFS does X, Y, and Z, and Reiser4 > does..." > > Do you see my point here? If every person who added new kernel code > just wrote their own thing without checking to see if it had already > been done before, then there would be a lot of poorly maintained code > in the kernel. If a journalling layer already exists, _new_ journaled > filesystems should either (A) use the layer as is, or (B) fix the layer > so it has sufficient functionality for them to use, and submit patches. > That way if somebody later says, "Ah, crap, there's a bug in the kernel > journalling layer", and fixes it, there are not eight other filesystems > with their own open-coded layers that need to be audited for similar > mistakes. > > This is similar to why some kernel developers did not like the Reiser4 > code, because it implemented some private layers that looked kinda like > stuff the VFS should be doing (Again, I don't want to get into that > argument again, I'm just bringing up the similarities to clarify _this_ > particular point, as that one has been beaten to death enough already). > > >> Now the question for GFS is still a valid one; there might be > >> reasons to > >> not use it (which is fair enough) but if there's no real reason then > >> using jdb sounds a lot better given it's maturity (and it is used > >> by 2 > >> filesystems in -mm already). > > Personally, I am of the opinion that if GFS cannot use jdb, the > developers > ought to clarify why it isn't useable, and possibly submit fixes to make > it useful, so that others can share the benefits. > > Cheers, > Kyle Moffett > > -- > I lost interest in "blade servers" when I found they didn't throw > knives at > people who weren't supposed to be in your machine room. > -- Anthony de Boer > > > > > ------------------------------ > > Message: 12 > Date: Wed, 03 Aug 2005 11:09:02 +0200 > From: Arjan van de Ven > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: Hans Reiser > Cc: akpm at osdl.org, linux-cluster at redhat.com, Jan Engelhardt > , linux-kernel at vger.kernel.org > Message-ID: <1123060142.3363.8.camel at laptopd505.fenrus.org> > Content-Type: text/plain > > > > I don't know anything about GFS, but expecting a filesystem author to > > use a journaling layer he does not want to is a bit arrogant. > > good that I didn't expect that then. > I think it's fair enough to ask people if they can use it. If the answer > is "No because it doesn't fit our model " then that's fine. If the > answer is "eh yeah we could" then I think it's entirely reasonable to > expect people to use common code as opposed to adding new code. > > > > > ------------------------------ > > Message: 13 > Date: Wed, 03 Aug 2005 11:17:09 +0200 > From: Arjan van de Ven > Subject: [Linux-cluster] Re: [PATCH 00/14] GFS > To: David Teigland > Cc: akpm at osdl.org, linux-cluster at redhat.com, > linux-kernel at vger.kernel.org > Message-ID: <1123060630.3363.10.camel at laptopd505.fenrus.org> > Content-Type: text/plain > > On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote: > > The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE > > on-disk instead of LE which is another useful way to verify endian > > correctness. > > that sounds wrong to be a compile option. If you really want to deal > with dual disk endianness it really ought to be a runtime one (see jffs2 > for example). > > > > > > * Why use your own journalling layer and not say ... jbd ? > > > > Here's an analysis of three approaches to cluster-fs journaling and their > > pros/cons (including using jbd): http://tinyurl.com/7sbqq > > > > > * + while (!kthread_should_stop()) { > > > + gfs2_scand_internal(sdp); > > > + > > > + set_current_state(TASK_INTERRUPTIBLE); > > > + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ); > > > + } > > > > > > you probably really want to check for signals if you do interruptible sleeps > > > > I don't know why we'd be interested in signals here. > > well.. because if you don't your schedule_timeout becomes a nop when you > get one, which makes your loop a busy waiting one. > > > > > > ------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > > End of Linux-cluster Digest, Vol 16, Issue 4 > ******************************************** > -- travellig. From JACOB_LIBERMAN at Dell.com Wed Aug 3 16:40:14 2005 From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com) Date: Wed, 3 Aug 2005 11:40:14 -0500 Subject: [Linux-cluster] Fencing agents Message-ID: The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw. http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab. Thanks, jacob > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > "S?valdur Arnar Gunnarsson [Hugsmi?jan]" > Sent: Wednesday, August 03, 2005 6:59 AM > To: linux-cluster at redhat.com > Subject: [Linux-cluster] Fencing agents > > I'm implementing a shared storage between multiple (2 at the > moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES > connected to a EMC AX100 through FC. > > The SAN has two FC ports so the need for a FC Switch has not > yet come however we will add other Blades in the coming months. > The one thing I haven't got figured out with GFS and the > Cluster-Suite is the whole idea about fencing. > > We have a working setup using Centos rebuilds of the > Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which > we are not planning to use in the final implementation where > we plan to use the official GFS packages from Red Hat. > The fencing agents in that setup is manual fencing. > > Both machines have the file system mounted and there appears > to be no problems. > > What does "automatic" fencing have to offer that the manual > fencing lacks. > If we decide to buy the FC switch right away is it recomended > that we buy one of the ones that have fencing agent available > for the Cluster-Suite ? > > If can't get our hands on supported FC switchs can we do > fencing in another manner than throught a FC switch ? > > > > > -- > S?valdur Gunnarsson :: Hugsmi?jan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From addi at hugsmidjan.is Wed Aug 3 17:20:01 2005 From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=) Date: Wed, 03 Aug 2005 17:20:01 +0000 Subject: [Linux-cluster] Fencing agents In-Reply-To: References: Message-ID: <42F0FCC1.7030108@hugsmidjan.is> Could you please include a cluster.conf fencing sample on how to implement this. JACOB_LIBERMAN at Dell.com wrote: > The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw. > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster > > The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab. > > Thanks, jacob > > >>-----Original Message----- >>From: linux-cluster-bounces at redhat.com >>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of >>"S?valdur Arnar Gunnarsson [Hugsmi?jan]" >>Sent: Wednesday, August 03, 2005 6:59 AM >>To: linux-cluster at redhat.com >>Subject: [Linux-cluster] Fencing agents >> >>I'm implementing a shared storage between multiple (2 at the >>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES >>connected to a EMC AX100 through FC. >> >>The SAN has two FC ports so the need for a FC Switch has not >>yet come however we will add other Blades in the coming months. >>The one thing I haven't got figured out with GFS and the >>Cluster-Suite is the whole idea about fencing. >> >>We have a working setup using Centos rebuilds of the >>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which >>we are not planning to use in the final implementation where >>we plan to use the official GFS packages from Red Hat. >>The fencing agents in that setup is manual fencing. >> >>Both machines have the file system mounted and there appears >>to be no problems. >> >>What does "automatic" fencing have to offer that the manual >>fencing lacks. >>If we decide to buy the FC switch right away is it recomended >>that we buy one of the ones that have fencing agent available >>for the Cluster-Suite ? >> >>If can't get our hands on supported FC switchs can we do >>fencing in another manner than throught a FC switch ? >> >> >> >> >>-- >>S?valdur Gunnarsson :: Hugsmi?jan >> >>-- >>Linux-cluster mailing list >>Linux-cluster at redhat.com >>http://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- S?valdur Gunnarsson :: Hugsmi?jan From mark.fasheh at oracle.com Wed Aug 3 18:54:01 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Wed, 3 Aug 2005 11:54:01 -0700 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050803103744.GG11081@marowsky-bree.de> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050803035618.GB9812@redhat.com> <20050803103744.GG11081@marowsky-bree.de> Message-ID: <20050803185401.GB21228@ca-server1.us.oracle.com> On Wed, Aug 03, 2005 at 12:37:44PM +0200, Lars Marowsky-Bree wrote: > On 2005-08-03T11:56:18, David Teigland wrote: > > > > * Why use your own journalling layer and not say ... jbd ? > > Here's an analysis of three approaches to cluster-fs journaling and their > > pros/cons (including using jbd): http://tinyurl.com/7sbqq > > Very instructive read, thanks for the link. While it may be true that for a full log, flushing for a *single* lock may be more expensive in OCFS2, Ken ignores the fact that in our one big flush we've made all locks on journalled resources immediately releasable. According to that description, GFS2 would have to do a seperate transaction flush (including the extra step of writing revoke records) for each lock protecting a journalled resource. Assuming the same number of locks are required to be dropped under both systems then for a number of locks > 1 OCFS2 will actually do less work - the actual metadata blocks would be the same on either end, but JBD only has to write that the journal is now clean to the journal superblock whereas GFS2 has to revoke the blocks for each dropped lock. Of course all of this talk completely avoids the fact that in any case these things are expensive so a cluster file system has to take care to ping locks as little as possible. OCFS2 takes great pains to make as many operations node local (requiring no cluster locks) as possible - data allocation is usually done from a node local pool which is refreshed from the main bitmap. Deallocation happens similarly - we have a truncate log in which we record deleted clusters. Each node has their own inode and metadata chain allocators which another node will only lock for delete (a truncate log style local metadata delete log could easily be added if that ever became a problem). --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From jacobl at ccbill.com Wed Aug 3 20:36:21 2005 From: jacobl at ccbill.com (Jacob Liff) Date: Wed, 3 Aug 2005 13:36:21 -0700 Subject: [Linux-cluster] Live Changes To Cluster Message-ID: <889A47B16278164FB657E0FFB1CAB8C7CB1375@hq-exchange.ccbill-hq.local> Hello, I've done some searching on the mailing list and found one link that talks about what I'm trying to accomplish. These are also the steps in the usage.txt for changing the config in a live cluster. http://tinyurl.com/duxqt I have tried this but it does not seem to work. I changed the config on one box HUP'ed ccsd but the new cluster.conf is not sent to the other members. I am using the latest gzip from: ftp://sources.redhat.com/pub/cluster/releases/cluster-1.00.00.tar.gz It will allow me to change the config number on the other members(cman_tool version -r 2) but something just doesn't seem right with the new configs not residing on the machines. Has this process been updated at some point? Jacob L. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric at bootseg.com Wed Aug 3 20:48:01 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 03 Aug 2005 16:48:01 -0400 Subject: [Linux-cluster] Live Changes To Cluster In-Reply-To: <889A47B16278164FB657E0FFB1CAB8C7CB1375@hq-exchange.ccbill-hq.local> References: <889A47B16278164FB657E0FFB1CAB8C7CB1375@hq-exchange.ccbill-hq.local> Message-ID: <1123102081.3344.6.camel@auh5-0479.corp.jabil.org> On Wed, 2005-08-03 at 13:36 -0700, Jacob Liff wrote: > Hello, > > > It will allow me to change the config number on the other members(cman_tool version ?r 2) but something just doesn?t seem right with the new configs not residing on the machines. Has this process been updated at some point? > This is the process I use when I update my cluster.xml file: 1. make my changes, upping the version number 2. run: "ccs_tool update /etc/cluster/cluster.xml" 3. run: "cman_tool version -r " 4. on each node, run: "cman_tool status" and make sure that the version number is the new one. Thanks, Eric Kerin From jacobl at ccbill.com Wed Aug 3 20:53:30 2005 From: jacobl at ccbill.com (Jacob Liff) Date: Wed, 3 Aug 2005 13:53:30 -0700 Subject: [Linux-cluster] Live Changes To Cluster Message-ID: <889A47B16278164FB657E0FFB1CAB8C7CB1384@hq-exchange.ccbill-hq.local> Eric, Thanks very much, this did the trick for me. You guys maintain an incredibly helpful list. Jacob L. -----Original Message----- From: Eric Kerin [mailto:eric at bootseg.com] Sent: Wednesday, August 03, 2005 1:48 PM To: Jacob Liff; ; linux clustering Subject: Re: [Linux-cluster] Live Changes To Cluster On Wed, 2005-08-03 at 13:36 -0700, Jacob Liff wrote: > Hello, > > > It will allow me to change the config number on the other members(cman_tool version -r 2) but something just doesn't seem right with the new configs not residing on the machines. Has this process been updated at some point? > This is the process I use when I update my cluster.xml file: 1. make my changes, upping the version number 2. run: "ccs_tool update /etc/cluster/cluster.xml" 3. run: "cman_tool version -r " 4. on each node, run: "cman_tool status" and make sure that the version number is the new one. Thanks, Eric Kerin From amanthei at redhat.com Wed Aug 3 21:29:49 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 3 Aug 2005 16:29:49 -0500 Subject: [Linux-cluster] Fencing agents In-Reply-To: <42F0B177.7050907@hugsmidjan.is> References: <42F0B177.7050907@hugsmidjan.is> Message-ID: <20050803212949.GA3268@redhat.com> On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote: > I'm implementing a shared storage between multiple (2 at the moment) > Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC > AX100 through FC. > > The SAN has two FC ports so the need for a FC Switch has not yet come > however we will add other Blades in the coming months. > The one thing I haven't got figured out with GFS and the Cluster-Suite > is the whole idea about fencing. Funny timing :) I just checked in the fencing agent for the PowerEdge 1855's a couple days ago! (http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb-markup&cvsroot=cluster) > The fencing agents in that setup is manual fencing. I would strongly discourage this. > What does "automatic" fencing have to offer that the manual fencing lacks. > If we decide to buy the FC switch right away is it recomended that we > buy one of the ones that have fencing agent available for the > Cluster-Suite ? In this case, you already have a fencing agent (fence_drac) that works with the PE 1855 blades so there is no need for further fencing hardware (unless you are going to be connecting other machines to the cluster that aren't going to have any other form of fencing) The main advantage that "automatic" fencing gives you over manual fencing is that in the event that a fencing operation is required, your cluster can automatically recover (on the order of seconds to minutes) instead of waiting for user intervention (which can take minutes to hours to days depending on how attentive the admins are :). -- Adam Manthei From amanthei at redhat.com Wed Aug 3 21:34:25 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 3 Aug 2005 16:34:25 -0500 Subject: [Linux-cluster] Fencing agents In-Reply-To: References: Message-ID: <20050803213425.GB3268@redhat.com> On Wed, Aug 03, 2005 at 11:40:14AM -0500, JACOB_LIBERMAN at Dell.com wrote: > The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw. > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster "racadm" should be avoided. The interface does not provide the feedback necessary to guarantee that nodes have been properly fenced. The telnet interface is the preferred method for support DRAC hardware. Fortunately, as your link above shows, the fence_drac agent now supports the 1855 as of Monday. > The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab. > > Thanks, jacob > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > "S?valdur Arnar Gunnarsson [Hugsmi?jan]" > > Sent: Wednesday, August 03, 2005 6:59 AM > > To: linux-cluster at redhat.com > > Subject: [Linux-cluster] Fencing agents > > > > I'm implementing a shared storage between multiple (2 at the > > moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES > > connected to a EMC AX100 through FC. > > > > The SAN has two FC ports so the need for a FC Switch has not > > yet come however we will add other Blades in the coming months. > > The one thing I haven't got figured out with GFS and the > > Cluster-Suite is the whole idea about fencing. > > > > We have a working setup using Centos rebuilds of the > > Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which > > we are not planning to use in the final implementation where > > we plan to use the official GFS packages from Red Hat. > > The fencing agents in that setup is manual fencing. > > > > Both machines have the file system mounted and there appears > > to be no problems. > > > > What does "automatic" fencing have to offer that the manual > > fencing lacks. > > If we decide to buy the FC switch right away is it recomended > > that we buy one of the ones that have fencing agent available > > for the Cluster-Suite ? > > > > If can't get our hands on supported FC switchs can we do > > fencing in another manner than throught a FC switch ? > > > > > > > > > > -- > > S?valdur Gunnarsson :: Hugsmi?jan > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From JACOB_LIBERMAN at Dell.com Wed Aug 3 21:42:23 2005 From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com) Date: Wed, 3 Aug 2005 16:42:23 -0500 Subject: [Linux-cluster] Fencing agents Message-ID: Hi Adam, I noticed that you updated this script quite a bit from previous versions. If I'm not mistaken, the previous version actually used the "racadm serveraction powercycle/shutdown/etc" commands. This version uses telnet exclusively. How about adding some logic that checks whether racadm is installed locally and uses that if it is, and then uses telnet if it is not? I think that adding the racadm commands to enable telnet on the rac is a good idea, but if they can use racadm to configure telnet access, they should also be able to use racadm to fence the node. Just my 2 cents. I think its great that you wrote an agent for the drac. Thanks, jacob > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Adam Manthei > Sent: Wednesday, August 03, 2005 4:30 PM > To: linux clustering > Subject: Re: [Linux-cluster] Fencing agents > > On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar > Gunnarsson [Hugsmi?jan]" wrote: > > I'm implementing a shared storage between multiple (2 at > the moment) > > Blade machines (Dell PowerEdge 1855) running RHEL4 ES > connected to a > > EMC AX100 through FC. > > > > The SAN has two FC ports so the need for a FC Switch has > not yet come > > however we will add other Blades in the coming months. > > The one thing I haven't got figured out with GFS and the > Cluster-Suite > > is the whole idea about fencing. > > Funny timing :) I just checked in the fencing agent for the > PowerEdge 1855's a couple days ago! > > (http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ag > ents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb > -markup&cvsroot=cluster) > > > The fencing agents in that setup is manual fencing. > > I would strongly discourage this. > > > What does "automatic" fencing have to offer that the manual > fencing lacks. > > If we decide to buy the FC switch right away is it > recomended that we > > buy one of the ones that have fencing agent available for the > > Cluster-Suite ? > > In this case, you already have a fencing agent (fence_drac) > that works with the PE 1855 blades so there is no need for > further fencing hardware (unless you are going to be > connecting other machines to the cluster that aren't going to > have any other form of fencing) > > The main advantage that "automatic" fencing gives you over > manual fencing is that in the event that a fencing operation > is required, your cluster can automatically recover (on the > order of seconds to minutes) instead of waiting for user > intervention (which can take minutes to hours to days depending on > how attentive the admins are :). > > -- > Adam Manthei > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From JACOB_LIBERMAN at Dell.com Wed Aug 3 21:43:16 2005 From: JACOB_LIBERMAN at Dell.com (JACOB_LIBERMAN at Dell.com) Date: Wed, 3 Aug 2005 16:43:16 -0500 Subject: [Linux-cluster] Fencing agents Message-ID: Oops! Looks like I sent this 1 second too soon. 8) > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > JACOB_LIBERMAN at Dell.com > Sent: Wednesday, August 03, 2005 4:42 PM > To: linux-cluster at redhat.com > Subject: RE: [Linux-cluster] Fencing agents > > Hi Adam, > > I noticed that you updated this script quite a bit from > previous versions. If I'm not mistaken, the previous version > actually used the "racadm serveraction > powercycle/shutdown/etc" commands. This version uses telnet > exclusively. How about adding some logic that checks whether > racadm is installed locally and uses that if it is, and then > uses telnet if it is not? > > I think that adding the racadm commands to enable telnet on > the rac is a good idea, but if they can use racadm to > configure telnet access, they should also be able to use > racadm to fence the node. > > Just my 2 cents. I think its great that you wrote an agent > for the drac. > > Thanks, jacob > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Adam Manthei > > Sent: Wednesday, August 03, 2005 4:30 PM > > To: linux clustering > > Subject: Re: [Linux-cluster] Fencing agents > > > > On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar > Gunnarsson > > [Hugsmi?jan]" wrote: > > > I'm implementing a shared storage between multiple (2 at > > the moment) > > > Blade machines (Dell PowerEdge 1855) running RHEL4 ES > > connected to a > > > EMC AX100 through FC. > > > > > > The SAN has two FC ports so the need for a FC Switch has > > not yet come > > > however we will add other Blades in the coming months. > > > The one thing I haven't got figured out with GFS and the > > Cluster-Suite > > > is the whole idea about fencing. > > > > Funny timing :) I just checked in the fencing agent for > the PowerEdge > > 1855's a couple days ago! > > > > (http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ag > > ents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb > > -markup&cvsroot=cluster) > > > > > The fencing agents in that setup is manual fencing. > > > > I would strongly discourage this. > > > > > What does "automatic" fencing have to offer that the manual > > fencing lacks. > > > If we decide to buy the FC switch right away is it > > recomended that we > > > buy one of the ones that have fencing agent available for the > > > Cluster-Suite ? > > > > In this case, you already have a fencing agent (fence_drac) > that works > > with the PE 1855 blades so there is no need for further fencing > > hardware (unless you are going to be connecting other > machines to the > > cluster that aren't going to have any other form of fencing) > > > > The main advantage that "automatic" fencing gives you over manual > > fencing is that in the event that a fencing operation is required, > > your cluster can automatically recover (on the order of seconds to > > minutes) instead of waiting for user intervention (which can take > > minutes to hours to days depending on > > how attentive the admins are :). > > > > -- > > Adam Manthei > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From amanthei at redhat.com Wed Aug 3 22:11:44 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 3 Aug 2005 17:11:44 -0500 Subject: [Linux-cluster] Fencing agents In-Reply-To: References: Message-ID: <20050803221144.GC3268@redhat.com> On Wed, Aug 03, 2005 at 04:42:23PM -0500, JACOB_LIBERMAN at Dell.com wrote: > Hi Adam, > > I noticed that you updated this script quite a bit from previous versions. If I'm not mistaken, the previous version actually used the "racadm serveraction powercycle/shutdown/etc" commands. This version uses telnet exclusively. How about adding some logic that checks whether racadm is installed locally and uses that if it is, and then uses telnet if it is not? The problem that I experienced with the racadm utility is that there where times that there was no way of querying what the power status of a node was. I know that I am unable to do that at all with the firmware that I have installed for my PowerEdge 750's. Another drawback to the racadm approach is that `serveraction` returns right away before waiting to for that command to complete. Given the combination of the two issues, it makes using racadm difficult to rely upon for a fencing agent because it's possible for the fencing agent to report success before the machine is powered off. If that were to happen, corruption in the filesystem could occur. I've emailed a couple people at Dell and the linux-poweredge list and have not been able to get an adequate response as to how to use racadm reliably. As such, we only support the telnet interface. > I think that adding the racadm commands to enable telnet on the rac is a good idea, but if they can use racadm to configure telnet access, they should also be able to use racadm to fence the node. I thought about adding that functionality, but forgot about it shortly after getting the telnet interface enabled on my DRAC card ;) Thanks for the reminder, I'll look into adding that feature. In the meantime, the commands for enabling it are documented in the man page. [root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1 [root]# racadm racreset > Just my 2 cents. I think its great that you wrote an agent for the drac. :) Adam > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Adam Manthei > > Sent: Wednesday, August 03, 2005 4:30 PM > > To: linux clustering > > Subject: Re: [Linux-cluster] Fencing agents > > > > On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar > > Gunnarsson [Hugsmi?jan]" wrote: > > > I'm implementing a shared storage between multiple (2 at > > the moment) > > > Blade machines (Dell PowerEdge 1855) running RHEL4 ES > > connected to a > > > EMC AX100 through FC. > > > > > > The SAN has two FC ports so the need for a FC Switch has > > not yet come > > > however we will add other Blades in the coming months. > > > The one thing I haven't got figured out with GFS and the > > Cluster-Suite > > > is the whole idea about fencing. > > > > Funny timing :) I just checked in the fencing agent for the > > PowerEdge 1855's a couple days ago! > > > > (http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/ag > > ents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb > > -markup&cvsroot=cluster) > > > > > The fencing agents in that setup is manual fencing. > > > > I would strongly discourage this. > > > > > What does "automatic" fencing have to offer that the manual > > fencing lacks. > > > If we decide to buy the FC switch right away is it > > recomended that we > > > buy one of the ones that have fencing agent available for the > > > Cluster-Suite ? > > > > In this case, you already have a fencing agent (fence_drac) > > that works with the PE 1855 blades so there is no need for > > further fencing hardware (unless you are going to be > > connecting other machines to the cluster that aren't going to > > have any other form of fencing) > > > > The main advantage that "automatic" fencing gives you over > > manual fencing is that in the event that a fencing operation > > is required, your cluster can automatically recover (on the > > order of seconds to minutes) instead of waiting for user > > intervention (which can take minutes to hours to days depending on > > how attentive the admins are :). > > > > -- > > Adam Manthei > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From oldmoonster at gmail.com Thu Aug 4 02:29:54 2005 From: oldmoonster at gmail.com (Q.L) Date: Thu, 4 Aug 2005 10:29:54 +0800 Subject: [Linux-cluster] Compiling error against kernel-2.6.12.2 Message-ID: <359782e705080319293993cbf1@mail.gmail.com> Hi, When I began to "make" on the RH9.0, I can't pass following errors, could you help me? however, it seems no compiling problem happen on a host with FC1.0. further more, what's the special config required in kernel .config file for GFS cluster? Thanks. Q.L cd ccs_tool && make install make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool' gcc -Wall -I. -I../config -I../include -I../lib -I/home/share/cluster-1.00.00/build/incdir -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags` -DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib -I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config --libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma -lmagmamsg -ldl /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference to `pthread_rwlock_rdlock' /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference to `pthread_rwlock_unlock' /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference to `pthread_rwlock_wrlock' From oldmoonster at gmail.com Thu Aug 4 06:34:40 2005 From: oldmoonster at gmail.com (Q.L) Date: Thu, 4 Aug 2005 14:34:40 +0800 Subject: [Linux-cluster] Is there backporting patches for 2.4.x kernel can be found? Message-ID: <359782e7050803233458cf8e76@mail.gmail.com> Hi, If I don't ask such a question, I can't give up the idea forever, although I know it is ultimately impossible. Thanks, Q.L From mdl at veles.ru Thu Aug 4 08:32:13 2005 From: mdl at veles.ru (Denis Medvedev) Date: Thu, 04 Aug 2005 12:32:13 +0400 Subject: [Linux-cluster] Fencing agents In-Reply-To: <20050803212949.GA3268@redhat.com> References: <42F0B177.7050907@hugsmidjan.is> <20050803212949.GA3268@redhat.com> Message-ID: <42F1D28D.10006@veles.ru> Adam Manthei ?????: >On Wed, Aug 03, 2005 at 11:58:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote: > > >>I'm implementing a shared storage between multiple (2 at the moment) >>Blade machines (Dell PowerEdge 1855) running RHEL4 ES connected to a EMC >>AX100 through FC. >> >>The SAN has two FC ports so the need for a FC Switch has not yet come >>however we will add other Blades in the coming months. >>The one thing I haven't got figured out with GFS and the Cluster-Suite >>is the whole idea about fencing. >> >> > >Funny timing :) I just checked in the fencing agent for the PowerEdge >1855's a couple days ago! > >(http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/fence_drac.pl?rev=1.3.4.2&content-type=text/x-cvsweb-markup&cvsroot=cluster) > > > >>The fencing agents in that setup is manual fencing. >> >> > >I would strongly discourage this. > > > >>What does "automatic" fencing have to offer that the manual fencing lacks. >>If we decide to buy the FC switch right away is it recomended that we >>buy one of the ones that have fencing agent available for the >>Cluster-Suite ? >> >> > >In this case, you already have a fencing agent (fence_drac) that works with >the PE 1855 blades so there is no need for further fencing hardware (unless >you are going to be connecting other machines to the cluster that aren't >going to have any other form of fencing) > >The main advantage that "automatic" fencing gives you over manual fencing is >that in the event that a fencing operation is required, your cluster can >automatically recover (on the order of seconds to minutes) instead of waiting >for user intervention (which can take minutes to hours to days depending on >how attentive the admins are :). > > > "recover"? You mean reboot? But if a machine need fencing, doesn't that mean that something is inherently wrong with that machine and simple reboot would't cure that? From mdl at veles.ru Thu Aug 4 08:49:13 2005 From: mdl at veles.ru (Denis Medvedev) Date: Thu, 04 Aug 2005 12:49:13 +0400 Subject: [Linux-cluster] Mirrored shared disks Message-ID: <42F1D689.40807@veles.ru> Dear sirs, I am trying to make the following configuration: a two node cluster, each node exports its own (identical in size) device, each node imports the device from its neighbour, a md device which is a composition of own and exported neighbour device is created on each node, lock_dlm is used as a locking system, a gfs is used on each node for that md device. Will it work? Isi it possible to create no-single-point of failure if I have 2 storage that expose iSCSI devices and I want to make a mirror based on both of them? Thanks in advance Denis Medvedev From javipolo at datagrama.net Thu Aug 4 11:00:48 2005 From: javipolo at datagrama.net (Javi Polo) Date: Thu, 4 Aug 2005 13:00:48 +0200 Subject: [Linux-cluster] ipv6_loopback symbol in 2.6.12 Message-ID: <20050804110048.GA18954@gibson.drslump.org> Hi there I managed to compile almost everything fine, and wanted to do some testing, but I realise that lock_gulm has some undefined reference: gfstest1:/usr/src/linux# modprobe lock_gulm FATAL: Error inserting lock_gulm (/lib/modules/2.6.12.3/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko): Unknown symbol in module, or unknown parameter (see dmesg) gfstest1:/usr/src/linux# dmesg |grep gulm lock_gulm: Unknown symbol in6addr_loopback lock_gulm: Unknown symbol in6addr_loopback gfstest1:/usr/src/linux# I added ipv6 support (though I didnt want to). Is it required? and anyway, I suppose they must have changed something in the kernel, as I got this error. I had to fix several things (2.6.x is changing so much, i guess) that I found on the archives, but no fix refering to this ipv6 thing. Can anybody give me a hint? :P thanks in advance ;) -- Javier Polo @ Datagrama 902 136 126 From addi at hugsmidjan.is Thu Aug 4 11:00:47 2005 From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=) Date: Thu, 04 Aug 2005 11:00:47 +0000 Subject: [Linux-cluster] Fencing agents In-Reply-To: References: Message-ID: <42F1F55F.1080708@hugsmidjan.is> Well .. when I manually run the fence_drac.pl perl script and supply it with the ip of the DRAC (-a 192.168.100.173) login name (-l root) DRAC/MC module name (-m Server-1) and the password (-p dummypassword) the machine in question (Server-1) powers down and doesn't power back on. How do I implement this in cluster.xml (specify the ip/login/pass/module name) and shouldn't it power back up afterwards ? JACOB_LIBERMAN at Dell.com wrote: > The 1855 has a built in ERA controller. You can modify the fencing agents to either send "racadm serveraction powercycle" or install the PERL telent module and create your own fencing script. The former option requires that the rac management software be installed on the host. I havent tested this with the 1855 btw. > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster > > The fence_drac agent out on the CVS should work for you. If you cant get it working, let me know, and ill see if I can dig up an 1855 in the lab. > > Thanks, jacob > > >>-----Original Message----- >>From: linux-cluster-bounces at redhat.com >>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of >>"S?valdur Arnar Gunnarsson [Hugsmi?jan]" >>Sent: Wednesday, August 03, 2005 6:59 AM >>To: linux-cluster at redhat.com >>Subject: [Linux-cluster] Fencing agents >> >>I'm implementing a shared storage between multiple (2 at the >>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES >>connected to a EMC AX100 through FC. >> >>The SAN has two FC ports so the need for a FC Switch has not >>yet come however we will add other Blades in the coming months. >>The one thing I haven't got figured out with GFS and the >>Cluster-Suite is the whole idea about fencing. >> >>We have a working setup using Centos rebuilds of the >>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which >>we are not planning to use in the final implementation where >>we plan to use the official GFS packages from Red Hat. >>The fencing agents in that setup is manual fencing. >> >>Both machines have the file system mounted and there appears >>to be no problems. >> >>What does "automatic" fencing have to offer that the manual >>fencing lacks. >>If we decide to buy the FC switch right away is it recomended >>that we buy one of the ones that have fencing agent available >>for the Cluster-Suite ? >> >>If can't get our hands on supported FC switchs can we do >>fencing in another manner than throught a FC switch ? >> >> >> >> >>-- >>S?valdur Gunnarsson :: Hugsmi?jan >> >>-- >>Linux-cluster mailing list >>Linux-cluster at redhat.com >>http://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- S?valdur Gunnarsson :: Hugsmi?jan From javipolo at datagrama.net Thu Aug 4 12:38:19 2005 From: javipolo at datagrama.net (Javi Polo) Date: Thu, 4 Aug 2005 14:38:19 +0200 Subject: [Linux-cluster] Fencing agents In-Reply-To: References: Message-ID: <20050804123819.GA20306@gibson.drslump.org> I just modified a bit fence_sanbox2.pl so it can fence hosts telnetting to the fiber switch. It's an IBM 2005 H16 ... dunnow if other's IBM commands are the same. Here it is, in case anyone find it useful :P -- Javier Polo @ Datagrama 902 136 126 -------------- next part -------------- A non-text attachment was scrubbed... Name: fence_IBMswitch.pl Type: text/x-perl Size: 5062 bytes Desc: not available URL: From mtilstra at redhat.com Thu Aug 4 13:30:20 2005 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Thu, 4 Aug 2005 08:30:20 -0500 Subject: [Linux-cluster] ipv6_loopback symbol in 2.6.12 In-Reply-To: <20050804110048.GA18954@gibson.drslump.org> References: <20050804110048.GA18954@gibson.drslump.org> Message-ID: <20050804133020.GA21470@redhat.com> On Thu, Aug 04, 2005 at 01:00:48PM +0200, Javi Polo wrote: > Hi there > > I managed to compile almost everything fine, and wanted to do some > testing, but I realise that lock_gulm has some undefined reference: > > gfstest1:/usr/src/linux# modprobe lock_gulm > FATAL: Error inserting lock_gulm > (/lib/modules/2.6.12.3/kernel/fs/gfs_locking/lock_gulm/lock_gulm.ko): > Unknown symbol in module, or unknown parameter (see dmesg) > gfstest1:/usr/src/linux# dmesg |grep gulm > lock_gulm: Unknown symbol in6addr_loopback > lock_gulm: Unknown symbol in6addr_loopback > gfstest1:/usr/src/linux# > > I added ipv6 support (though I didnt want to). Is it required? gulm requires ipv6. If you plan on using cman/dlm, you don't need gulm, and so can comment it out of the makefiles. -- Michael Conrad Tadpol Tilstra -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From travellig at yahoo.co.uk Wed Aug 3 14:11:58 2005 From: travellig at yahoo.co.uk (Hernando Garcia) Date: Wed, 03 Aug 2005 15:11:58 +0100 Subject: [Linux-cluster] Fencing agents In-Reply-To: <42F0B177.7050907@hugsmidjan.is> References: <42F0B177.7050907@hugsmidjan.is> Message-ID: <1123078318.4405.32.camel@hgarcia.surrey.redhat.com> On Wed, 2005-08-03 at 11:58 +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote: > What does "automatic" fencing have to offer that the manual fencing > lacks. Automatic fencing uses hardware to fence a node and reboot it. Manual fencing relay on you to manually fence the node whenever you release there is a problem in the cluster and relays on you to prowercycle the faulty node manually, no very convenient when you are sysadmin the cluster remotely. > If we decide to buy the FC switch right away is it recomended that we > buy one of the ones that have fencing agent available for the > Cluster-Suite ? If you look at the configuration manual for RHCS, there is a list of supported fencing agents. > If can't get our hands on supported FC switchs can we do fencing in > another manner than throught a FC switch ? Manual fencing. Nando From mbrookov at mines.edu Thu Aug 4 14:26:43 2005 From: mbrookov at mines.edu (Matthew B. Brookover) Date: Thu, 04 Aug 2005 08:26:43 -0600 Subject: [Linux-cluster] Is there backporting patches for 2.4.x kernel can be found? In-Reply-To: <359782e7050803233458cf8e76@mail.gmail.com> References: <359782e7050803233458cf8e76@mail.gmail.com> Message-ID: <1123165603.1143.1.camel@merlin.Mines.EDU> There is no back port that I am aware of, but there is the old version that works on 2.4. See http://www.gyrate.org/misc/gfs.txt for build instructions. Matt On Thu, 2005-08-04 at 00:34, Q.L wrote: > Hi, > > If I don't ask such a question, I can't give up the idea forever, > although I know it is ultimately impossible. > > Thanks, > > Q.L > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From javipolo at datagrama.net Thu Aug 4 14:32:00 2005 From: javipolo at datagrama.net (Javi Polo) Date: Thu, 4 Aug 2005 16:32:00 +0200 Subject: [Linux-cluster] ipv6_loopback symbol in 2.6.12 In-Reply-To: <20050804133020.GA21470@redhat.com> References: <20050804110048.GA18954@gibson.drslump.org> <20050804133020.GA21470@redhat.com> Message-ID: <20050804143200.GA23365@gibson.drslump.org> On Aug/04/2005, Michael Conrad Tadpol Tilstra wrote: > > lock_gulm: Unknown symbol in6addr_loopback > gulm requires ipv6. If you plan on using cman/dlm, you don't need gulm, > and so can comment it out of the makefiles. I understand clm/gulm are lock managers ... what advantages has each one? (or where could I read a little bit about it, I've been on RH cluster page, but cant understand well what's better from one or another, and what should I use ... :? btw, I couldnt test gulm because of this change in 2.6.12 ... has anybody a patch? O:) thx -- Javier Polo @ Datagrama 902 136 126 From addi at hugsmidjan.is Thu Aug 4 15:21:48 2005 From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=) Date: Thu, 04 Aug 2005 15:21:48 +0000 Subject: [Linux-cluster] Purpose of fencing devices Message-ID: <42F2328C.4050700@hugsmidjan.is> Could someone explain to me the purpose of the fencing hardware in a cluster with a shared storage resource. When one of the cluster member goes down all access to the shared volume (GFS) is closed off. No other cluster member can read or write to the volume until the failed node comes back up. Are fencing devices used to close off the access the dead node has on the filesystem so the other nodes can access (read/write) the fileystem as usual ? -- S?valdur Gunnarsson :: Hugsmi?jan From oldmoonster at gmail.com Thu Aug 4 16:16:24 2005 From: oldmoonster at gmail.com (Qin Li) Date: Fri, 05 Aug 2005 00:16:24 +0800 Subject: [Linux-cluster] many compiling warning. Message-ID: <42F23F58.4080606@gmail.com> Hi, I am trying to install cluster-1.0 on my Redhat/Fedora core 1 system, but the compiling is not clean. See following: make[3]: Entering directory `/usr/src/linux-2.6.12.2' Building modules, stage 2. MODPOST *** Warning: "kcl_addref_cluster" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_by_addr" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_addresses" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_releaseref_cluster" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_current_interface" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_by_nodeid" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_leave_service" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_remove_callback" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_global_service_id" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_unregister_service" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_join_service" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_start_done" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_add_callback" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_register_service" [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! What's wrong? Thanks, Q.L From amanthei at redhat.com Thu Aug 4 16:29:09 2005 From: amanthei at redhat.com (Adam Manthei) Date: Thu, 4 Aug 2005 11:29:09 -0500 Subject: [Linux-cluster] Fencing agents In-Reply-To: <42F1D28D.10006@veles.ru> References: <42F0B177.7050907@hugsmidjan.is> <20050803212949.GA3268@redhat.com> <42F1D28D.10006@veles.ru> Message-ID: <20050804162909.GD3268@redhat.com> On Thu, Aug 04, 2005 at 12:32:13PM +0400, Denis Medvedev wrote: > >>What does "automatic" fencing have to offer that the manual fencing lacks. > >>If we decide to buy the FC switch right away is it recomended that we > >>buy one of the ones that have fencing agent available for the > >>Cluster-Suite ? > >> > >> > > > >In this case, you already have a fencing agent (fence_drac) that works with > >the PE 1855 blades so there is no need for further fencing hardware (unless > >you are going to be connecting other machines to the cluster that aren't > >going to have any other form of fencing) > > > >The main advantage that "automatic" fencing gives you over manual fencing > >is > >that in the event that a fencing operation is required, your cluster can > >automatically recover (on the order of seconds to minutes) instead of > >waiting > >for user intervention (which can take minutes to hours to days depending on > >how attentive the admins are :). > > > > > > > "recover"? You mean reboot? In order for the filesystem to recover, an expired node must first be fenced. In this case, since DRAC is being used, it means that the node is probably rebooted. > But if a machine need fencing, doesn't that > mean that something is inherently wrong with that machine and simple > reboot would't cure that? Perhaps. Otherwise it might be as simple as a network hiccup that causes a node to miss enough heartbeats that result in a node getting fenced. If you want to leave the node in a state to debug it, then use a SAN based fencing setup, thus isolating the node for the cluster and keeping it's state intact for the admin to look at later... maybe (if the machine locks up too hard, you won't be able to get into it anyway). If you want to automate the recovery process, but still make sure that nodes that got fenced aren't automatically reintegrated into the cluster, you can use a power based fencing agent that just turns the machine off and doesn't attempt to power it back it on again. -- Adam Manthei From amanthei at redhat.com Thu Aug 4 16:59:34 2005 From: amanthei at redhat.com (Adam Manthei) Date: Thu, 4 Aug 2005 11:59:34 -0500 Subject: [Linux-cluster] Fencing agents In-Reply-To: <42F1F55F.1080708@hugsmidjan.is> References: <42F1F55F.1080708@hugsmidjan.is> Message-ID: <20050804165934.GE3268@redhat.com> On Thu, Aug 04, 2005 at 11:00:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote: > Well .. when I manually run the fence_drac.pl perl script and supply it > with the ip of the DRAC (-a 192.168.100.173) login name (-l root) > DRAC/MC module name (-m Server-1) and the password (-p dummypassword) > the machine in question (Server-1) powers down and doesn't power back on. Interesting... :( > How do I implement this in cluster.xml (specify the ip/login/pass/module > name) Typically, the parameters are suppose to be in the manpage for the agent. If they are not, then it should be considered a bug. I think this will work for, but I've not tested the bellow config, so it might not be error free :) >and shouldn't it power back up afterwards ? The default action is suppose to be "reboot" as in the machine should come back online. I don't know why it isn't. If you continue to have problems, try enabling the debugging output from the command line: fence_drac -a 192.168.100.173 -l root -p dummypassword -m Server-1 \ -D /tmp/drac.log -v Keep us posted. -Adam > JACOB_LIBERMAN at Dell.com wrote: > >The 1855 has a built in ERA controller. You can modify the fencing agents > >to either send "racadm serveraction powercycle" or install the PERL telent > >module and create your own fencing script. The former option requires that > >the rac management software be installed on the host. I havent tested this > >with the 1855 btw. > > > >http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster > > > >The fence_drac agent out on the CVS should work for you. If you cant get > >it working, let me know, and ill see if I can dig up an 1855 in the lab. > > > >Thanks, jacob > > > > > >>-----Original Message----- > >>From: linux-cluster-bounces at redhat.com > >>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of > >>"S?valdur Arnar Gunnarsson [Hugsmi?jan]" > >>Sent: Wednesday, August 03, 2005 6:59 AM > >>To: linux-cluster at redhat.com > >>Subject: [Linux-cluster] Fencing agents > >> > >>I'm implementing a shared storage between multiple (2 at the > >>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES > >>connected to a EMC AX100 through FC. > >> > >>The SAN has two FC ports so the need for a FC Switch has not > >>yet come however we will add other Blades in the coming months. > >>The one thing I haven't got figured out with GFS and the > >>Cluster-Suite is the whole idea about fencing. > >> > >>We have a working setup using Centos rebuilds of the > >>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which > >>we are not planning to use in the final implementation where > >>we plan to use the official GFS packages from Red Hat. > >>The fencing agents in that setup is manual fencing. > >> > >>Both machines have the file system mounted and there appears > >>to be no problems. > >> > >>What does "automatic" fencing have to offer that the manual > >>fencing lacks. > >>If we decide to buy the FC switch right away is it recomended > >>that we buy one of the ones that have fencing agent available > >>for the Cluster-Suite ? > >> > >>If can't get our hands on supported FC switchs can we do > >>fencing in another manner than throught a FC switch ? > >> > >> > >> > >> > >>-- > >>S?valdur Gunnarsson :: Hugsmi?jan > >> > >>-- > >>Linux-cluster mailing list > >>Linux-cluster at redhat.com > >>http://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > >-- > >Linux-cluster mailing list > >Linux-cluster at redhat.com > >http://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > S?valdur Gunnarsson :: Hugsmi?jan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From cfeist at redhat.com Thu Aug 4 18:54:39 2005 From: cfeist at redhat.com (Chris Feist) Date: Thu, 04 Aug 2005 13:54:39 -0500 Subject: [Linux-cluster] many compiling warning. In-Reply-To: <42F23F58.4080606@gmail.com> References: <42F23F58.4080606@gmail.com> Message-ID: <42F2646F.4080703@redhat.com> The problem is caused by versioning done during the kernel build. Because dlm.ko uses symbols from cman.ko it expects to know about those in the Module.symvers file in the kernel build directory. But, since we don't modify that file and add the cman.ko symbols when we build cman-kernel dlm.ko can't find the cman.ko symbols when it builds. It's not a big deal and the modules will load fine even with those warnings. Thanks, Chris Qin Li wrote: > Hi, > > I am trying to install cluster-1.0 on my Redhat/Fedora core 1 system, > but the compiling is not clean. > See following: > > make[3]: Entering directory `/usr/src/linux-2.6.12.2' > Building modules, stage 2. > MODPOST > *** Warning: "kcl_addref_cluster" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_get_node_by_addr" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_get_node_addresses" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_releaseref_cluster" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_get_current_interface" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_get_node_by_nodeid" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_leave_service" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_remove_callback" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_global_service_id" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_unregister_service" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_join_service" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_start_done" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_add_callback" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > *** Warning: "kcl_register_service" > [/usr/src/cluster-1.00.00/dlm-kernel/src/dlm.ko] undefined! > > What's wrong? > > Thanks, > > Q.L > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From addi at hugsmidjan.is Thu Aug 4 20:22:29 2005 From: addi at hugsmidjan.is (=?ISO-8859-1?Q?=22S=E6valdur_Arnar_Gunnarsson_=5BHugsmi=F0jan?= =?ISO-8859-1?Q?=5D=22?=) Date: Thu, 04 Aug 2005 20:22:29 +0000 Subject: [Linux-cluster] Fencing agents In-Reply-To: <20050804165934.GE3268@redhat.com> References: <42F1F55F.1080708@hugsmidjan.is> <20050804165934.GE3268@redhat.com> Message-ID: <42F27905.7010908@hugsmidjan.is> I'm sorry, that was my mistake, the machine does in fact power back up. Adam Manthei wrote: > On Thu, Aug 04, 2005 at 11:00:47AM +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote: > >>Well .. when I manually run the fence_drac.pl perl script and supply it >>with the ip of the DRAC (-a 192.168.100.173) login name (-l root) >>DRAC/MC module name (-m Server-1) and the password (-p dummypassword) >>the machine in question (Server-1) powers down and doesn't power back on. > > > Interesting... :( > > >>How do I implement this in cluster.xml (specify the ip/login/pass/module >>name) > > > Typically, the parameters are suppose to be in the manpage for the agent. > If they are not, then it should be considered a bug. I think this will work > for, but I've not tested the bellow config, so it might not be error free :) > > > agent="fence_drac" > login="root" > passwd="dummypassword" > ipaddr="192.168.100.173" > action="reboot" /> > > > > > > > > > > >>and shouldn't it power back up afterwards ? > > > The default action is suppose to be "reboot" as in the machine should come > back online. I don't know why it isn't. If you continue to have problems, > try enabling the debugging output from the command line: > > fence_drac -a 192.168.100.173 -l root -p dummypassword -m Server-1 \ > -D /tmp/drac.log -v > > Keep us posted. > -Adam > > >>JACOB_LIBERMAN at Dell.com wrote: >> >>>The 1855 has a built in ERA controller. You can modify the fencing agents >>>to either send "racadm serveraction powercycle" or install the PERL telent >>>module and create your own fencing script. The former option requires that >>>the rac management software be installed on the host. I havent tested this >>>with the 1855 btw. >>> >>>http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/drac/?cvsroot=cluster >>> >>>The fence_drac agent out on the CVS should work for you. If you cant get >>>it working, let me know, and ill see if I can dig up an 1855 in the lab. >>> >>>Thanks, jacob >>> >>> >>> >>>>-----Original Message----- >>>>From: linux-cluster-bounces at redhat.com >>>>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of >>>>"S?valdur Arnar Gunnarsson [Hugsmi?jan]" >>>>Sent: Wednesday, August 03, 2005 6:59 AM >>>>To: linux-cluster at redhat.com >>>>Subject: [Linux-cluster] Fencing agents >>>> >>>>I'm implementing a shared storage between multiple (2 at the >>>>moment) Blade machines (Dell PowerEdge 1855) running RHEL4 ES >>>>connected to a EMC AX100 through FC. >>>> >>>>The SAN has two FC ports so the need for a FC Switch has not >>>>yet come however we will add other Blades in the coming months. >>>>The one thing I haven't got figured out with GFS and the >>>>Cluster-Suite is the whole idea about fencing. >>>> >>>>We have a working setup using Centos rebuilds of the >>>>Cluster-Suite and GFS (http://rpm.karan.org/el4/csgfs/) which >>>>we are not planning to use in the final implementation where >>>>we plan to use the official GFS packages from Red Hat. >>>>The fencing agents in that setup is manual fencing. >>>> >>>>Both machines have the file system mounted and there appears >>>>to be no problems. >>>> >>>>What does "automatic" fencing have to offer that the manual >>>>fencing lacks. >>>>If we decide to buy the FC switch right away is it recomended >>>>that we buy one of the ones that have fencing agent available >>>>for the Cluster-Suite ? >>>> >>>>If can't get our hands on supported FC switchs can we do >>>>fencing in another manner than throught a FC switch ? >>>> >>>> >>>> >>>> >>>>-- >>>>S?valdur Gunnarsson :: Hugsmi?jan >>>> >>>>-- >>>>Linux-cluster mailing list >>>>Linux-cluster at redhat.com >>>>http://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> >>>-- >>>Linux-cluster mailing list >>>Linux-cluster at redhat.com >>>http://www.redhat.com/mailman/listinfo/linux-cluster >> >> >>-- >>S?valdur Gunnarsson :: Hugsmi?jan >> >>-- >>Linux-cluster mailing list >>Linux-cluster at redhat.com >>http://www.redhat.com/mailman/listinfo/linux-cluster > > -- S?valdur Gunnarsson :: Hugsmi?jan From brianu at silvercash.com Thu Aug 4 23:01:44 2005 From: brianu at silvercash.com (brianu) Date: Thu, 4 Aug 2005 16:01:44 -0700 Subject: [Linux-cluster] values for gnbd multipath dmsetup Message-ID: <20050804230208.A0D195A86A7@mail.silvercash.com> Hello all, We would like to use GFS and GNBD for a SAN setup, but I am having trouble setting up the multipath. Currently I have three GNBD servers mounting the storage and exporting the volumes, I read the information in the post from the previous thread but I am having trouble finding the correct values to echo into Dmsetup, example from the thread below: https://www.redhat.com/archives/linux-cluster/2005-April/msg00062.html I found references for > echo "0 167772160 multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0 > (251:0 ist the major:minor id of /dev/gnbd0) I can see the major:minor blockid and the size from /sys/block/gnbd0 & gnbd1 so my attempt looks as so: > echo "0 2293981184 multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0 > (251:0 ist the major:minor id of /dev/gnbd0) Which gives me an error of: Device-mapper ioctl cmd 9 failed: Invalid argument What am I doing wrong? Also the vaules from above specifically the " 0 0 1 1 & 0 2 1" & "1000" from the previous post can someone clarify where these are coming from? We are using a vanilla kernel-2.6.12 with from kernel.org with the cluster software from CVS stabile for 2.6.12, from the path provided at sources.redhat.com/cluster. Kernel Params: CONFIG_DM_MULTIPATH=m OS=CENTOS4 Brian Urrutia System Administrator Price Communications Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: From oldmoonster at gmail.com Fri Aug 5 02:28:29 2005 From: oldmoonster at gmail.com (Q.L) Date: Fri, 5 Aug 2005 10:28:29 +0800 Subject: [Linux-cluster] Can't build on a RH9.0 system with cluster-1.00.00.tar.gz Message-ID: <359782e705080419284d07b58c@mail.gmail.com> Hi, I build the cluster-1.00.00 againt kernel-2.6.12.2, but failed, could you help to explain what's wrong happen? make[2]: Leaving directory `/home/share/cluster-1.00.00/ccs/lib' cd ccs_test && make install make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_test' install -d /home/share/cluster-1.00.00/build/sbin install ccs_test /home/share/cluster-1.00.00/build/sbin make[2]: Leaving directory `/home/share/cluster-1.00.00/ccs/ccs_test' cd ccs_tool && make install make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool' gcc -Wall -I. -I../config -I../include -I../lib -I/home/share/cluster-1.00.00/build/incdir -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags` -DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib -I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config --libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma -lmagmamsg -ldl /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference to `pthread_rwlock_rdlock' /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference to `pthread_rwlock_unlock' /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference to `pthread_rwlock_wrlock' collect2: ld returned 1 exit status make[2]: *** [ccs_tool] Error 1 make[2]: Leaving directory `/home/share/cluster-1.00.00/ccs/ccs_tool' make[1]: *** [install] Error 2 make[1]: Leaving directory `/home/share/cluster-1.00.00/ccs' make: *** [all] Error 2 [root at localhost cluster-1.00.00]# From naoki at valuecommerce.com Fri Aug 5 02:44:05 2005 From: naoki at valuecommerce.com (Naoki) Date: Fri, 05 Aug 2005 11:44:05 +0900 Subject: [Linux-cluster] RH exporting local disks as LUNs. Message-ID: <1123209846.17379.18.camel@dragon.sys.intra> RH / Fedora can export devices as iSCSI. This question may show my total ignorance but I'm not above that ;) Is there anything stopping me from exporting a device or a volume as a raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act like) a SCSI or FC disk? The idea is this.. Hook up a couple of boxes with plenty of internal disk (as a raid 5 set) to an FC switch. Export the two LUNs and then on a client server, LVM mirror the sets. I could then put NFS (for example) on the client server. If there is a limitation here is it driver, hardware, or both? From fabbione at fabbione.net Fri Aug 5 05:06:26 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Fri, 05 Aug 2005 07:06:26 +0200 Subject: [PATCH] fix Re: [Linux-cluster] Compiling error against kernel-2.6.12.2 In-Reply-To: <359782e705080319293993cbf1@mail.gmail.com> References: <359782e705080319293993cbf1@mail.gmail.com> Message-ID: <42F2F3D2.7080900@fabbione.net> Q.L wrote: > Hi, > > When I began to "make" on the RH9.0, I can't pass following errors, > could you help me? > however, it seems no compiling problem happen on a host with FC1.0. > further more, what's the special config required in kernel .config > file for GFS cluster? > > Thanks. > > Q.L > > cd ccs_tool && make install > make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool' > gcc -Wall -I. -I../config -I../include -I../lib > -I/home/share/cluster-1.00.00/build/incdir -Wall -O2 > -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags` > -DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib > -I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c > update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config > --libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma > -lmagmamsg -ldl > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference > to `pthread_rwlock_rdlock' > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference > to `pthread_rwlock_unlock' > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference > to `pthread_rwlock_wrlock' > It appears also in Ubuntu, but i am not sure why or what did lost a link to pthred. The patch in attachment fix the problem. Fabio -- no signature file found. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ccs_tool_and_pthread_love.dpatch URL: From teigland at redhat.com Fri Aug 5 07:14:15 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 5 Aug 2005 15:14:15 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> Message-ID: <20050805071415.GC14880@redhat.com> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > * +static const uint32_t crc_32_tab[] = ..... > why do you duplicate this? The kernel has a perfectly good set of > generic crc32 tables/functions just fine The gfs2_disk_hash() function and the crc table on which it's based are a part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit unusual since gfs uses a hash table on-disk for its directory structure. This header, including the hash function/table, must be included by user space programs like fsck that want to decipher a fs, and any change to the function or table would effectively make the fs corrupted. Because of this I think it's best for gfs to keep it's own copy as part of its ondisk format spec. > * Why are you using bufferheads extensively in a new filesystem? bh's are used for metadata, the log, and journaled data which need to be written at the block granularity, not page. > why do you use a rwsem and not a regular semaphore? You are aware that > rwsems are far more expensive than regular ones right? How skewed is > the read/write ratio? Aware, yes, it's the only rwsem in gfs. Specific skew, no, we'll have to measure that. > * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800 > ehhhh why? I'm not sure, actually, apart from the comments: do_div: /* For ia32 we need to pull some tricks to get past various versions of the compiler which do not like us using do_div in the middle of large functions. */ do_mod: /* Side effect free 64 bit mod operation */ fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing, perhaps this is an old problem that's now fixed? > * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset, > + unsigned int size) > +{ > + int error; > + > + if (bh) > + error = copy_to_user(*buf, bh->b_data + offset, size); > + else > + error = clear_user(*buf, size); > > that looks to be missing a few kmaps.. whats the guarantee that b_data > is actually, like in lowmem? This is only used in the specific case of reading a journaled-data file. That seems to effectively be the same as reading a buffer of fs metadata. > The diaper device is a block device within gfs that gets transparently > inserted between the real device the and rest of the filesystem. > > hmmmm why not use device mapper or something? Is this really needed? This is needed for the "withdraw" feature (described in the comment) which is fairly important. We'll see if dm could be used instead. Thanks, Dave From mchristi at redhat.com Fri Aug 5 07:27:08 2005 From: mchristi at redhat.com (Mike Christie) Date: Fri, 05 Aug 2005 02:27:08 -0500 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050805071415.GC14880@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> Message-ID: <42F314CC.4000309@redhat.com> David Teigland wrote: > On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > >>* Why are you using bufferheads extensively in a new filesystem? > > > bh's are used for metadata, the log, and journaled data which need to be > written at the block granularity, not page. > In a scsi tree http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary there is a function, bio_map_kern(), in fs.c that maps a buffer into a bio. It does not have to be page granularity. Can something like that be used in these places? From mchristi at redhat.com Fri Aug 5 07:30:41 2005 From: mchristi at redhat.com (Mike Christie) Date: Fri, 05 Aug 2005 02:30:41 -0500 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <42F314CC.4000309@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> <42F314CC.4000309@redhat.com> Message-ID: <42F315A1.7050408@redhat.com> Mike Christie wrote: > David Teigland wrote: > >>On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: >> >> >>>* Why are you using bufferheads extensively in a new filesystem? >> >> >>bh's are used for metadata, the log, and journaled data which need to be >>written at the block granularity, not page. >> > > > In a scsi tree > http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary oh yeah it is in -mm too. > there is a function, bio_map_kern(), in fs.c that maps a buffer into a > bio. It does not have to be page granularity. Can something like that be > used in these places? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From arjan at infradead.org Fri Aug 5 07:34:38 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 05 Aug 2005 09:34:38 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050805071415.GC14880@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> Message-ID: <1123227279.3239.1.camel@laptopd505.fenrus.org> On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote: > On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > > > * +static const uint32_t crc_32_tab[] = ..... > > why do you duplicate this? The kernel has a perfectly good set of > > generic crc32 tables/functions just fine > > The gfs2_disk_hash() function and the crc table on which it's based are a > part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit > unusual since gfs uses a hash table on-disk for its directory structure. > This header, including the hash function/table, must be included by user > space programs like fsck that want to decipher a fs, and any change to the > function or table would effectively make the fs corrupted. Because of > this I think it's best for gfs to keep it's own copy as part of its ondisk > format spec. for userspace there's libcrc32 as well. If it's *the* bog standard crc32 I don't see a reason why your "spec" can't just reference that instead. And esp in the kernel you should just use the in kernel one not your own regardless; you can assume the in kernel one is optimized and it also keeps size down. From mdl at veles.ru Fri Aug 5 08:00:29 2005 From: mdl at veles.ru (Denis Medvedev) Date: Fri, 05 Aug 2005 12:00:29 +0400 Subject: [Linux-cluster] RH exporting local disks as LUNs. In-Reply-To: <1123209846.17379.18.camel@dragon.sys.intra> References: <1123209846.17379.18.camel@dragon.sys.intra> Message-ID: <42F31C9D.8090601@veles.ru> Naoki ?????: >RH / Fedora can export devices as iSCSI. > >This question may show my total ignorance but I'm not above that ;) > >Is there anything stopping me from exporting a device or a volume as a >raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act >like) a SCSI or FC disk? > >The idea is this.. Hook up a couple of boxes with plenty of internal >disk (as a raid 5 set) to an FC switch. Export the two LUNs and then on >a client server, LVM mirror the sets. > >I could then put NFS (for example) on the client server. > >If there is a limitation here is it driver, hardware, or both? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster > > > Yes, you can. Probably a single point of failure in this will be the NFS server and also the FC switch. And why in this case not to put the storage directly to the NFS server? From jengelh at linux01.gwdg.de Fri Aug 5 08:28:13 2005 From: jengelh at linux01.gwdg.de (Jan Engelhardt) Date: Fri, 5 Aug 2005 10:28:13 +0200 (MEST) Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050805071415.GC14880@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> Message-ID: >The gfs2_disk_hash() function and the crc table on which it's based are a >part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit >unusual since gfs uses a hash table on-disk for its directory structure. >This header, including the hash function/table, must be included by user >space programs like fsck that want to decipher a fs, and any change to the >function or table would effectively make the fs corrupted. Because of >this I think it's best for gfs to keep it's own copy as part of its ondisk >format spec. Tune the spec to use kernel and libcrc32 tables and bump the version number of the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out eventually at some later date. Jan Engelhardt -- From arjan at infradead.org Fri Aug 5 08:34:32 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 05 Aug 2005 10:34:32 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> Message-ID: <1123230872.3239.20.camel@laptopd505.fenrus.org> On Fri, 2005-08-05 at 10:28 +0200, Jan Engelhardt wrote: > >The gfs2_disk_hash() function and the crc table on which it's based are a > >part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit > >unusual since gfs uses a hash table on-disk for its directory structure. > >This header, including the hash function/table, must be included by user > >space programs like fsck that want to decipher a fs, and any change to the > >function or table would effectively make the fs corrupted. Because of > >this I think it's best for gfs to keep it's own copy as part of its ondisk > >format spec. > > Tune the spec to use kernel and libcrc32 tables and bump the version number of > the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out > eventually at some later date. afaik the tables aren't actually different. So no need to bump the spec! From mikore.li at gmail.com Fri Aug 5 08:45:32 2005 From: mikore.li at gmail.com (Michael) Date: Fri, 5 Aug 2005 16:45:32 +0800 Subject: [Linux-cluster] gfs2 Vs gfs Message-ID: Hi, Does that means redhat has already turn to development of gfs2 and only bugfix for gfs? Where is the brief of the new features in upcoming gfs2? when will the first release achieve? Is there 1 or 5 years roadmap for GFS development? Thanks, Q.L From teigland at redhat.com Fri Aug 5 09:44:52 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 5 Aug 2005 17:44:52 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1123227279.3239.1.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> <1123227279.3239.1.camel@laptopd505.fenrus.org> Message-ID: <20050805094452.GD14880@redhat.com> On Fri, Aug 05, 2005 at 09:34:38AM +0200, Arjan van de Ven wrote: > On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote: > > On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > > > > > * +static const uint32_t crc_32_tab[] = ..... > > > why do you duplicate this? The kernel has a perfectly good set of > > > generic crc32 tables/functions just fine > > > > The gfs2_disk_hash() function and the crc table on which it's based are a > > part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit > > unusual since gfs uses a hash table on-disk for its directory structure. > > This header, including the hash function/table, must be included by user > > space programs like fsck that want to decipher a fs, and any change to the > > function or table would effectively make the fs corrupted. Because of > > this I think it's best for gfs to keep it's own copy as part of its ondisk > > format spec. > > for userspace there's libcrc32 as well. If it's *the* bog standard crc32 > I don't see a reason why your "spec" can't just reference that instead. > And esp in the kernel you should just use the in kernel one not your own > regardless; you can assume the in kernel one is optimized and it also > keeps size down. linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[]. This looks like a standard that's not going to change, as you've said, so including crc32table.h and getting rid of our own table would work fine. Do we go a step beyond this and use say the crc32() function from linux/crc32.h? Is this _function_ as standard and unchanging as the table of crcs? In my tests it doesn't produce the same results as our gfs2_disk_hash() function, even with both using the same crc table. I don't mind adopting a new function and just writing a user space equivalent for the tools if it's a fixed standard. Dave From teigland at redhat.com Fri Aug 5 10:31:38 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 5 Aug 2005 18:31:38 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050805100750.GA9818@wohnheim.fh-wedel.de> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> <1123227279.3239.1.camel@laptopd505.fenrus.org> <20050805094452.GD14880@redhat.com> <20050805100750.GA9818@wohnheim.fh-wedel.de> Message-ID: <20050805103138.GE14880@redhat.com> On Fri, Aug 05, 2005 at 12:07:50PM +0200, J?rn Engel wrote: > On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote: > > Do we go a step beyond this and use say the crc32() function from > > linux/crc32.h? Is this _function_ as standard and unchanging as the table > > of crcs? In my tests it doesn't produce the same results as our > > gfs2_disk_hash() function, even with both using the same crc table. I > > don't mind adopting a new function and just writing a user space > > equivalent for the tools if it's a fixed standard. > > The function is basically set in stone. Variants exists depending on > how it is called. I know of four variants, but there may be more: > > 1. Initial value is 0 > 2. Initial value is 0xffffffff > a) Result is taken as-is > b) Result is XORed with 0xffffffff > > Maybe your code implements 1a, while you tried 2b with the lib/crc32.c > function or something similar? You're right, initial value 0xffffffff and xor result with 0xffffffff matches the results from our function. Great, we can get rid of gfs2_disk_hash() and use crc32() directly. Thanks, Dave From natecars at natecarlson.com Fri Aug 5 13:21:18 2005 From: natecars at natecarlson.com (Nate Carlson) Date: Fri, 5 Aug 2005 08:21:18 -0500 (CDT) Subject: [Linux-cluster] RH exporting local disks as LUNs. In-Reply-To: <42F31C9D.8090601@veles.ru> References: <1123209846.17379.18.camel@dragon.sys.intra> <42F31C9D.8090601@veles.ru> Message-ID: On Fri, 5 Aug 2005, Denis Medvedev wrote: >> This question may show my total ignorance but I'm not above that ;) >> Is there anything stopping me from exporting a device or a volume as a >> raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act >> like) a SCSI or FC disk? > > Yes, you can. Nifty! I was actually looking at this the other day, and couldn't figure out a way to do it. Do you happen to have any documentation? ------------------------------------------------------------------------ | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981 | ------------------------------------------------------------------------ From mikore.li at gmail.com Fri Aug 5 05:22:30 2005 From: mikore.li at gmail.com (Mikore Li) Date: Fri, 5 Aug 2005 13:22:30 +0800 Subject: [PATCH] fix Re: [Linux-cluster] Compiling error against kernel-2.6.12.2 In-Reply-To: <42F2F3D2.7080900@fabbione.net> References: <359782e705080319293993cbf1@mail.gmail.com> <42F2F3D2.7080900@fabbione.net> Message-ID: Yes,Yes! It works! Thanks, Q.L On 8/5/05, Fabio Massimo Di Nitto wrote: > Q.L wrote: > > Hi, > > > > When I began to "make" on the RH9.0, I can't pass following errors, > > could you help me? > > however, it seems no compiling problem happen on a host with FC1.0. > > further more, what's the special config required in kernel .config > > file for GFS cluster? > > > > Thanks. > > > > Q.L > > > > cd ccs_tool && make install > > make[2]: Entering directory `/home/share/cluster-1.00.00/ccs/ccs_tool' > > gcc -Wall -I. -I../config -I../include -I../lib > > -I/home/share/cluster-1.00.00/build/incdir -Wall -O2 > > -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE `xml2-config --cflags` > > -DCCS_RELEASE_NAME=\"1.00.00\" -I. -I../config -I../include -I../lib > > -I/home/share/cluster-1.00.00/build/incdir -o ccs_tool ccs_tool.c > > update.c upgrade.c old_parser.c editconf.c -L../lib `xml2-config > > --libs` -L/home/share/cluster-1.00.00/build/lib -lccs -lmagma > > -lmagmamsg -ldl > > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference > > to `pthread_rwlock_rdlock' > > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference > > to `pthread_rwlock_unlock' > > /home/share/cluster-1.00.00/build/lib/libmagma.so: undefined reference > > to `pthread_rwlock_wrlock' > > > > It appears also in Ubuntu, but i am not sure why or what did lost a link to > pthred. The patch in attachment fix the problem. > > Fabio > > -- > no signature file found. > > > #! /bin/sh /usr/share/dpatch/dpatch-run > ## ccs_tool_and_pthread_love.dpatch by > ## > ## All lines beginning with `## DP:' are a description of the patch. > ## DP: No description. > > @DPATCH@ > diff -urNad --exclude=CVS --exclude=.svn ./ccs/ccs_tool/Makefile /usr/src/dpatchtemp/dpep-work.OlU9tL/redhat-cluster-suite-1.20050729/ccs/ccs_tool/Makefile > --- ./ccs/ccs_tool/Makefile 2005-07-29 06:48:35.000000000 +0200 > +++ /usr/src/dpatchtemp/dpep-work.OlU9tL/redhat-cluster-suite-1.20050729/ccs/ccs_tool/Makefile 2005-07-29 07:00:37.000000000 +0200 > @@ -25,7 +25,7 @@ > endif > > LDFLAGS+= -L${ccs_libdir} `xml2-config --libs` -L${libdir} > -LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl > +LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl -lpthread > > all: ccs_tool > > > > From joern at wohnheim.fh-wedel.de Fri Aug 5 10:07:50 2005 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Fri, 5 Aug 2005 12:07:50 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050805094452.GD14880@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> <1123227279.3239.1.camel@laptopd505.fenrus.org> <20050805094452.GD14880@redhat.com> Message-ID: <20050805100750.GA9818@wohnheim.fh-wedel.de> On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote: > > linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[]. > This looks like a standard that's not going to change, as you've said, so > including crc32table.h and getting rid of our own table would work fine. > > Do we go a step beyond this and use say the crc32() function from > linux/crc32.h? Is this _function_ as standard and unchanging as the table > of crcs? In my tests it doesn't produce the same results as our > gfs2_disk_hash() function, even with both using the same crc table. I > don't mind adopting a new function and just writing a user space > equivalent for the tools if it's a fixed standard. The function is basically set in stone. Variants exists depending on how it is called. I know of four variants, but there may be more: 1. Initial value is 0 2. Initial value is 0xffffffff a) Result is taken as-is b) Result is XORed with 0xffffffff Maybe your code implements 1a, while you tried 2b with the lib/crc32.c function or something similar? J?rn -- And spam is a useful source of entropy for /dev/random too! -- Jasmine Strong From javipolo at datagrama.net Fri Aug 5 15:02:05 2005 From: javipolo at datagrama.net (Javi Polo) Date: Fri, 5 Aug 2005 17:02:05 +0200 Subject: [Linux-cluster] problem with fencing Message-ID: <20050805150205.GA13010@gibson.drslump.org> Hi there I'm trying to set up gfs for work with a SAN ... and I want to use a script for fencing, instead of fence_manual, but it doesnt works :/ to try that, I do a "ifconfig eth0 down" in gfstest2, and gfstest1's syslog says: Aug 5 16:51:13 gfstest1 fenced: gfstest2 not a cluster member after 0 sec post_fail_delay Aug 5 16:51:13 gfstest1 fenced: fencing node "gfstest2" Aug 5 16:51:13 gfstest1 fence_manual: Node 192.168.1.2 needs to be reset before recovery can procede. Waiting for 192.168.1.2 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 192.168.1.2) I want it to be automatic, and I modified fence_sanbox2.pl so it fits our switch commands. (I attached it on another mail some days ago) the script works fine if I run it manually: gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 portDisable 4 success: disable 4 gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable portEnable 4 success: enable 4 gfstest1:~# could anybody give me a hint? I'm using lock_dlm this is my cluster.conf: -- Javier Polo @ Datagrama 902 136 126 From amanthei at redhat.com Fri Aug 5 15:21:02 2005 From: amanthei at redhat.com (Adam Manthei) Date: Fri, 5 Aug 2005 10:21:02 -0500 Subject: [Linux-cluster] problem with fencing In-Reply-To: <20050805150205.GA13010@gibson.drslump.org> References: <20050805150205.GA13010@gibson.drslump.org> Message-ID: <20050805152102.GE7385@redhat.com> On Fri, Aug 05, 2005 at 05:02:05PM +0200, Javi Polo wrote: > Hi there > > I'm trying to set up gfs for work with a SAN ... and I want to use a > script for fencing, instead of fence_manual, but it doesnt works :/ > > to try that, I do a "ifconfig eth0 down" in gfstest2, and gfstest1's syslog says: > Aug 5 16:51:13 gfstest1 fenced: gfstest2 not a cluster member after 0 sec post_fail_delay > Aug 5 16:51:13 gfstest1 fenced: fencing node "gfstest2" > Aug 5 16:51:13 gfstest1 fence_manual: Node 192.168.1.2 needs to be reset before recovery can procede. Waiting for 192.168.1.2 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 192.168.1.2) > > I want it to be automatic, and I modified fence_sanbox2.pl so it fits > our switch commands. (I attached it on another mail some days ago) > > the script works fine if I run it manually: > gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 > portDisable 4 > success: disable 4 > gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable > portEnable 4 > success: enable 4 > gfstest1:~# > > could anybody give me a hint? > I'm using lock_dlm Did you update the cluster.conf file across all the nodes? Could it be that gfstest1 still has the old cluster.conf file? That might account for the manual fencing being run. Another way that you are going to run into manual fencing using this configuration is if the first method ("san") fails the second method ("single") will be called. What's odd about that is that there should still be something in the logs listing the output of the first command. I would hope that there would also be an error in the logs in the even that you forgot to but "fence_IBM" in the path or make it executable. I'd consider it a bug if that wasn't the case. Lastly, I've not looked too closely at your script for fence_IBMswitch (I think that's what you called it in the previous email... did you rename it to fence_IBM?), but success and failure is not determined by fenced on the basis of the output, but on the exit status of the agent. If the agent returns 0, then it succeeds, otherwise it's a failed fencing operation. This might explain why the second method is being called, but it wouldn't explain why there is no output in the logs from the first. > this is my cluster.conf: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Javier Polo @ Datagrama > 902 136 126 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From amanthei at redhat.com Fri Aug 5 15:28:17 2005 From: amanthei at redhat.com (Adam Manthei) Date: Fri, 5 Aug 2005 10:28:17 -0500 Subject: [Linux-cluster] problem with fencing In-Reply-To: <20050805150205.GA13010@gibson.drslump.org> References: <20050805150205.GA13010@gibson.drslump.org> Message-ID: <20050805152817.GF7385@redhat.com> On Fri, Aug 05, 2005 at 05:02:05PM +0200, Javi Polo wrote: > could anybody give me a hint? how about this? looks like there is a typo in you cluster.conf your test command... > gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable ^^^^^^^^^^^^^^^ your cluster.conf... > login="admin" passwd="tangerine"/> -- Adam Manthei From javipolo at datagrama.net Fri Aug 5 15:46:57 2005 From: javipolo at datagrama.net (Javi Polo) Date: Fri, 5 Aug 2005 17:46:57 +0200 Subject: [Linux-cluster] problem with fencing In-Reply-To: <20050805152817.GF7385@redhat.com> References: <20050805150205.GA13010@gibson.drslump.org> <20050805152817.GF7385@redhat.com> Message-ID: <20050805154657.GA14355@gibson.drslump.org> On Aug/05/2005, Adam Manthei wrote: > > could anybody give me a hint? > how about this? looks like there is a typo in you cluster.conf > > gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable > ^^^^^^^^^^^^^^^ > > ^^^^^^^^^ yipes ... that was a big and stupid mistake :/ thanks for helping me see it (i've spent lot of time today trying to debug this ... sigh) anyway, now I cannot seem to even join the fence cluster with the machine I'm doing the testing and so ..: gfstest2:~# fence_tool -w -D join fence_tool: wait for quorum 1 fence_tool: get our node name fence_tool: connect to ccs fence_tool: start fenced fenced: 1123256223 our name from cman "gfstest2" fenced: 1123256223 delay post_join 6s post_fail 0s fenced: 1123256223 added 3 nodes from ccs ^C gfstest2:~# fence_tool -w -D -c join fence_tool: wait for quorum 1 fence_tool: get our node name fence_tool: connect to ccs fence_tool: start fenced fenced: 1123256379 our name from cman "gfstest2" fenced: 1123256379 delay post_join 6s post_fail 0s fenced: 1123256379 clean start, skipping initial nodes fenced: fence_domain_add: service set level failed gfstest2:~# cman_tool services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,3 [] gfstest2:~# I maybe leave it for monday, as today I'm having a pretty bad day :/ -- Javier Polo @ Datagrama 902 136 126 From eric at bootseg.com Fri Aug 5 16:21:40 2005 From: eric at bootseg.com (Eric Kerin) Date: Fri, 05 Aug 2005 12:21:40 -0400 Subject: [Linux-cluster] problem with fencing In-Reply-To: <20050805152102.GE7385@redhat.com> References: <20050805150205.GA13010@gibson.drslump.org> <20050805152102.GE7385@redhat.com> Message-ID: <1123258900.3350.8.camel@auh5-0479.corp.jabil.org> On Fri, 2005-08-05 at 10:21 -0500, Adam Manthei wrote: > What's odd about that is that there should still > be something in the logs listing the output of the first command. I just did a quick test, and confirmed that it won't produce an error unless all of the fence devices for a node fail to fence the server. Right now it treats inability to execute the fence script the same as if the fence script failed to fence the node. Also the failure condition where the exec returns does not produce any output, so nothing gets displayed (or sent to syslog). The attached patch (diff against RHEL4) will produce an error message when the exec fails with the error message. Also it display a message when no output is produced by a fence agent, for a failed exec. Thanks, Eric -------------- next part -------------- A non-text attachment was scrubbed... Name: fenced_badagentfix.patch Type: text/x-patch Size: 1180 bytes Desc: not available URL: From brianu at silvercash.com Fri Aug 5 17:38:10 2005 From: brianu at silvercash.com (brianu) Date: Fri, 5 Aug 2005 10:38:10 -0700 Subject: [Linux-cluster] RE: values for gnbd multipath dmsetup Message-ID: <20050805173812.CA6F65A8619@mail.silvercash.com> Hello, Ok I figured out id just try some of the vaules from the previous post without fully understanding them, and multipath appears to be working. dm-1 [size=546 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active][first] \_ 0:0:0:0 251:0 [undef ][active] \_ round-robin 0 [enabled] \_ 0:0:0:0 251:4 [undef ][active] But I stil get an error [root at dell-1650-31 ~]# mount -t gfs /dev/mapper/dm-1 /mnt/gfs1 mount: /dev/dm-1: can't read superblock if I do a dmsetup remove dm-1, then mount the individual gnbds all is well, but the purpose of this is to enable some sore of failover which I am told GNBD has the capability of doing. >From redhats main site and documentation for gfs 6.1 they state that multipath is not supported in the 6.1 realease however I optained this source from CVS and the main docs for http://sources.redhat.com/cluster/gnbd/ state that multipath is an option. Can someone clarify whether the CVS stabile sources for kernel-2.6.12 is multipath compatable, or am I doing something wrong? Current specs. SAN -> MSA-1000 3 GNBD servers currently using software iSCSI to mount that SAN - will prob go fiber if I can figure this out. ( lets say this cluster is called cluster1) Using DLM & GNBD 1 client for testing separate cluster name lets say "cluster2" Client mounted the gnbd from one of the servers that is exporting it, the servers are not mounting it, then formatted the device with gfs & created 20 journals size of 32MB each, remounted the device and verified write and read (bonnie++) Ran dmsetup to round robin the devices then failed to mount the volume as shown above. Brian Urrutia System Administrator Price Communications Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brianu at silvercash.com Fri Aug 5 19:22:34 2005 From: brianu at silvercash.com (brianu) Date: Fri, 5 Aug 2005 12:22:34 -0700 Subject: [Linux-cluster] RE: values for gnbd multipath dmsetup In-Reply-To: <20050805173812.CA6F65A8619@mail.silvercash.com> Message-ID: <20050805192247.AC0C15A868B@mail.silvercash.com> Hello, Just thought I'd plug in some more info I've also tried testing this with a stock FC4 client with all the Cluster RPMs installed. GFS: fsid=sclients:mygfs.0: jid=17: Trying to acquire journal lock... GFS: fsid=sclients:mygfs.0: jid=17: Looking at journal... GFS: fsid=sclients:mygfs.0: jid=17: Done GFS: fsid=sclients:mygfs.0: jid=18: Trying to acquire journal lock... GFS: fsid=sclients:mygfs.0: jid=18: Looking at journal... GFS: fsid=sclients:mygfs.0: jid=18: Done GFS: fsid=sclients:mygfs.0: jid=19: Trying to acquire journal lock... GFS: fsid=sclients:mygfs.0: jid=19: Looking at journal... GFS: fsid=sclients:mygfs.0: jid=19: Done GFS: fsid=sclients:mygfs.0: jid=20: Trying to acquire journal lock... GFS: fsid=sclients:mygfs.0: jid=20: Looking at journal... attempt to access beyond end of device dm-0: rw=0, want=1146990600, limit=1146990592 GFS: fsid=sclients:mygfs.0: fatal: I/O error GFS: fsid=sclients:mygfs.0: block = 143373824 GFS: fsid=sclients:mygfs.0: function = gfs_dreread GFS: fsid=sclients:mygfs.0: file = /usr/src/build/588748-i686/BUILD/xen0/src/gfs/dio.c, line = 576 GFS: fsid=sclients:mygfs.0: time = 1123268277 GFS: fsid=sclients:mygfs.0: about to withdraw from the cluster GFS: fsid=sclients:mygfs.0: waiting for outstanding I/O GFS: fsid=sclients:mygfs.0: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=sclients:mygfs.0: withdrawn GFS: fsid=sclients:mygfs.0: jid=20: Failed GFS: fsid=sclients:mygfs.0: error recovering journal 20: -5 [root at 5n@k3bi73 ~]# Aug 5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: time = 1123268277 Aug 5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: about to withdraw from the cluster Aug 5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: waiting for outstanding I/O Aug 5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: telling LM to withdraw Aug 5 11:57:57 5n at k3bi73 kernel: lock_dlm: withdraw abandoned memory Aug 5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: withdrawn Aug 5 11:57:57 5n at k3bi73 kernel: GFS: fsid=sclients:mygfs.0: jid=20: Failed This off a multipathed device which dmsetup status gives the output below: dm-1: 0 1146990592 multipath 1 0 0 2 1 A 0 1 0 251:0 A 0 E 0 1 0 251:4 A 0 dmsetup deps gives dm-1: 2 dependencies : (251, 4) (251, 0) and dmsetup info gives Name: dm-1 State: ACTIVE Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 1 Number of targets: 1 _____ From: brianu [mailto:brianu at silvercash.com] Sent: Friday, August 05, 2005 10:38 AM To: linux-cluster at redhat.com Cc: brianu at silvercash.com Subject: RE: values for gnbd multipath dmsetup Hello, Ok I figured out id just try some of the vaules from the previous post without fully understanding them, and multipath appears to be working. dm-1 [size=546 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active][first] \_ 0:0:0:0 251:0 [undef ][active] \_ round-robin 0 [enabled] \_ 0:0:0:0 251:4 [undef ][active] But I stil get an error [root at dell-1650-31 ~]# mount -t gfs /dev/mapper/dm-1 /mnt/gfs1 mount: /dev/dm-1: can't read superblock if I do a dmsetup remove dm-1, then mount the individual gnbds all is well, but the purpose of this is to enable some sore of failover which I am told GNBD has the capability of doing. >From redhats main site and documentation for gfs 6.1 they state that multipath is not supported in the 6.1 realease however I optained this source from CVS and the main docs for http://sources.redhat.com/cluster/gnbd/ state that multipath is an option. Can someone clarify whether the CVS stabile sources for kernel-2.6.12 is multipath compatable, or am I doing something wrong? Current specs. SAN -> MSA-1000 3 GNBD servers currently using software iSCSI to mount that SAN - will prob go fiber if I can figure this out. ( lets say this cluster is called cluster1) Using DLM & GNBD 1 client for testing separate cluster name lets say "cluster2" Client mounted the gnbd from one of the servers that is exporting it, the servers are not mounting it, then formatted the device with gfs & created 20 journals size of 32MB each, remounted the device and verified write and read (bonnie++) Ran dmsetup to round robin the devices then failed to mount the volume as shown above. Brian Urrutia System Administrator Price Communications Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomsonr at ucalgary.ca Sat Aug 6 01:08:25 2005 From: thomsonr at ucalgary.ca (Ryan Thomson) Date: Fri, 05 Aug 2005 19:08:25 -0600 Subject: [Linux-cluster] Clustered LDAP, good or bad idea? Message-ID: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca> Hello list, I've been thinking about running OpenLDAP on our to-be GFS/RHCS based storage cluster. I was thinking I could run LDAP as a service with all of it's data on a shared disk so that if the node with that service goes down, another node can pickup the service. It would be nice to have failover support for LDAP without using replication to another OpenLDAP server. I've already done tests on a two node cluster and it seems to work fine but it "seeming" to work fine isn't much confidence, I must admit. I have the configuration, lock files and database all on shared storage. I've modified the LDAP init.d script on each node to start/stop LDAP since the config and lock files aren't in the default spot anymore. I'm a bit concerned about failures since I can't test that properly in a two-node cluster. I suppose what I'm really asking is this: Is running LDAP as a cluster service a particularly bad idea for any reason? -- Ryan Thomson Systems Administrator University Of Calgary, Biocomputing Phone: (403) 220-2264 Email: thomsonr at ucalgary.ca From robert at deakin.edu.au Sat Aug 6 01:13:22 2005 From: robert at deakin.edu.au (Robert Ruge) Date: Sat, 6 Aug 2005 11:13:22 +1000 Subject: [Linux-cluster] I love GFS Message-ID: <200508060113.j761DCKc013924@deakin.edu.au> I would just like to say a big thank you to all those who have created GFS. I have fallen in love with it in the last month. I have a setup where all of my Windows infrastructure/servers run under vmware and I am currently migrating the vmware images to a GFS cluster that allows me some pretty cool failover and load balancing of the virtual machines. I have one question though - what is the real world experience with the reliability of a GFS filesystem? In my case if I lose the vmware GFS image repository I would lose 6 or more virtual servers and all of our Windows infrastructure, which would be a major pain. Something deep down says that putting all of my eggs in one basket is a bad idea, but there are some great advantages to doing it this way. What do other people think? Robert Ruge School of Information Technology, Deakin University From teigland at redhat.com Mon Aug 8 06:26:36 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Aug 2005 14:26:36 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050805071415.GC14880@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050805071415.GC14880@redhat.com> Message-ID: <20050808062636.GB13951@redhat.com> On Fri, Aug 05, 2005 at 03:14:15PM +0800, David Teigland wrote: > On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > > * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800 > > ehhhh why? > > I'm not sure, actually, apart from the comments: > > do_div: /* For ia32 we need to pull some tricks to get past various versions > of the compiler which do not like us using do_div in the middle > of large functions. */ > > do_mod: /* Side effect free 64 bit mod operation */ > > fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing, > perhaps this is an old problem that's now fixed? I've looked into getting rid of these: - The existing do_div() works fine for me with 64 bit numerators, so I'll get rid of the "fixed" version. - The "fixed" do_mod() seems to be the only way to do 64 bit modulus. It would be great if I was wrong about that... Thanks, Dave From mikore.li at gmail.com Mon Aug 8 08:34:05 2005 From: mikore.li at gmail.com (Michael) Date: Mon, 8 Aug 2005 16:34:05 +0800 Subject: [Linux-cluster] Any update for this document? Message-ID: Hi, I found a good document that introduces GFS implementation which is release in the middle of last year, however, many APIs has been changed since that. To have a good understanding of GFS, I'd like to get the latest version of it for your latest release, could you point to me the link? ps: the doc name is "Symmetric Cluster Architecture and Component Technical Specifications" Thanks, Michael From teigland at redhat.com Mon Aug 8 09:00:50 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Aug 2005 17:00:50 +0800 Subject: [Linux-cluster] Any update for this document? In-Reply-To: References: Message-ID: <20050808090050.GC13951@redhat.com> On Mon, Aug 08, 2005 at 04:34:05PM +0800, Michael wrote: > Hi, > > I found a good document that introduces GFS implementation which is > release in the middle of last year, however, many APIs has been > changed since that. To have a good understanding of GFS, I'd like to > get the latest version of it for your latest release, could you point > to me the link? > > ps: the doc name is "Symmetric Cluster Architecture and Component > Technical Specifications" In general terms, this document describes pretty well the code that's in the RHEL4 branch. It doesn't talk much about GFS, it's mostly about the cluster infrastructure. As you've said, the specific api's and code examples are very outdated -- there's no recent version. If you're interested in the current development of the cluster infrastructure (where most things run in user space), then that document is useless I'm afraid. Dave From teigland at redhat.com Mon Aug 8 09:57:47 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Aug 2005 17:57:47 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <84144f0205080223445375c907@mail.gmail.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> Message-ID: <20050808095747.GD13951@redhat.com> On Wed, Aug 03, 2005 at 09:44:06AM +0300, Pekka Enberg wrote: > > +uint32_t gfs2_hash(const void *data, unsigned int len) > > +{ > > + uint32_t h = 0x811C9DC5; > > + h = hash_more_internal(data, len, h); > > + return h; > > +} > > Is there a reason why you cannot use or ? See gfs2_hash_more() and comment; we hash discontiguous regions. > > +#define RETRY_MALLOC(do_this, until_this) \ > > +for (;;) { \ > > + { do_this; } \ > > + if (until_this) \ > > + break; \ > > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \ > > + printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \ > > + gfs2_malloc_warning = jiffies; \ > > + } \ > > + yield(); \ > > +} > > Please drop this. Done in the spot that could deal with an error, but there are three other places that still need it. > > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er, > > + struct gfs2_ea_location *el) > > +{ > > + { > > + struct ea_set es; > > + int error; > > + > > + memset(&es, 0, sizeof(struct ea_set)); > > + es.es_er = er; > > + es.es_el = el; > > + > > + error = ea_foreach(ip, ea_set_simple, &es); > > + if (error > 0) > > + return 0; > > + if (error) > > + return error; > > + } > > + { > > + unsigned int blks = 2; > > + if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT)) > > + blks++; > > + if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize) > > + blks += DIV_RU(er->er_data_len, > > + ip->i_sbd->sd_jbsize); > > + > > + return ea_alloc_skeleton(ip, er, blks, ea_set_block, el); > > + } > > Please drop the extra braces. Here and elsewhere we try to keep unused stuff off the stack. Are you suggesting that we're being overly cautious, or do you just dislike the way it looks? Thanks, Dave From arjan at infradead.org Mon Aug 8 10:05:25 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Mon, 08 Aug 2005 12:05:25 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050808095747.GD13951@redhat.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: <1123495525.3245.36.camel@laptopd505.fenrus.org> On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote: > > > > Please drop the extra braces. > > Here and elsewhere we try to keep unused stuff off the stack. Are you > suggesting that we're being overly cautious, or do you just dislike the > way it looks? nice theory. In practice gcc 3.x still adds up all the stack space anyway and as long as gcc 3.x is a supported kernel compiler, you can't depend on this. Also.. please favor readability. gcc is getting smarter about stack use nowadays, and {}'s shouldn't be needed to help it, it tracks liveness of variables already. From teigland at redhat.com Mon Aug 8 10:56:13 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Aug 2005 18:56:13 +0800 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: <20050808105613.GE13951@redhat.com> On Mon, Aug 08, 2005 at 01:18:45PM +0300, Pekka J Enberg wrote: > gfs2-02.patch:+ RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, > -> GFP_NOFAIL. Already gone, inode_create() can return an error. if (create) { RETRY_MALLOC(page = grab_cache_page(aspace->i_mapping, index), page); } else { page = find_lock_page(aspace->i_mapping, index); if (!page) return NULL; } > I think you can set aspace->flags to GFP_NOFAIL will try that > but why can't you return NULL here on failure like you do for > find_lock_page()? because create is set > gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, > GFP_KERNEL), > -> GFP_NOFAIL It looks to me like NOFAIL does nothing for kmem_cache_alloc(). Am I seeing that wrong? > gfs2-10.patch:+ RETRY_MALLOC(new_gh = gfs2_holder_get(gl, state, > gfs2_holder_get uses kmalloc which can use GFP_NOFAIL. Which means adding a new gfp_flags parameter... fine. Dave From teigland at redhat.com Mon Aug 8 11:39:10 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Aug 2005 19:39:10 +0800 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> <20050808105613.GE13951@redhat.com> Message-ID: <20050808113910.GF13951@redhat.com> On Mon, Aug 08, 2005 at 01:57:55PM +0300, Pekka J Enberg wrote: > David Teigland writes: > >> but why can't you return NULL here on failure like you do for > >> find_lock_page()? > > > >because create is set > > Yes, but looking at (some of the) top-level callers, there's no real reason > why create must not fail. Am I missing something here? I'll trace the callers back farther and see about dealing with errors. > >> gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, > > It is passed to the page allocator just like with kmalloc() which uses > __cache_alloc() too. Yes, I read it wrongly, looks like NOFAIL should work fine. I think we can get rid of the RETRY macro entirely. Thanks, Dave From penberg at cs.helsinki.fi Mon Aug 8 10:00:43 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Mon, 08 Aug 2005 13:00:43 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050808095747.GD13951@redhat.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: David Teigland writes: > > > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er, > > > + struct gfs2_ea_location *el) > > > +{ > > > + { > > > + struct ea_set es; > > > + int error; > > > + > > > + memset(&es, 0, sizeof(struct ea_set)); > > > + es.es_er = er; > > > + es.es_el = el; > > > + > > > + error = ea_foreach(ip, ea_set_simple, &es); > > > + if (error > 0) > > > + return 0; > > > + if (error) > > > + return error; > > > + } > > > + { > > > + unsigned int blks = 2; > > > + if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT)) > > > + blks++; > > > + if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize) > > > + blks += DIV_RU(er->er_data_len, > > > + ip->i_sbd->sd_jbsize); > > > + > > > + return ea_alloc_skeleton(ip, er, blks, ea_set_block, el); > > > + } > > > > Please drop the extra braces. > > Here and elsewhere we try to keep unused stuff off the stack. Are you > suggesting that we're being overly cautious, or do you just dislike the > way it looks? The extra braces hurt readability. Please drop them or make them proper functions instead. And yes, I think you're hiding potential stack usage problems here. Small unused stuff on the stack don't matter and large ones should probably be kmalloc() anyway. Pekka From penberg at cs.helsinki.fi Mon Aug 8 10:18:45 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Mon, 08 Aug 2005 13:18:45 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050808095747.GD13951@redhat.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: David Teigland writes: > > > +#define RETRY_MALLOC(do_this, until_this) \ > > > +for (;;) { \ > > > + { do_this; } \ > > > + if (until_this) \ > > > + break; \ > > > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \ > > > + printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \ > > > + gfs2_malloc_warning = jiffies; \ > > > + } \ > > > + yield(); \ > > > +} > > > > Please drop this. > > Done in the spot that could deal with an error, but there are three other > places that still need it. Which places are those? I only see these: gfs2-02.patch:+ RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip); gfs2-02.patch-+ gfs2_memory_add(ip); gfs2-02.patch-+ memset(ip, 0, sizeof(struct gfs2_inode)); gfs2-02.patch-+ gfs2-02.patch-+ ip->i_num = *inum; gfs2-02.patch-+ -> GFP_NOFAIL. gfs2-02.patch:+ RETRY_MALLOC(page = grab_cache_page(aspace->i_mapping, index), gfs2-02.patch-+ page); gfs2-02.patch-+ } else { gfs2-02.patch-+ page = find_lock_page(aspace->i_mapping, index); gfs2-02.patch-+ if (!page) gfs2-02.patch-+ return NULL; I think you can set aspace->flags to GFP_NOFAIL but why can't you return NULL here on failure like you do for find_lock_page()? gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, GFP_KERNEL), gfs2-02.patch-+ bd); gfs2-02.patch-+ gfs2_memory_add(bd); gfs2-02.patch-+ atomic_inc(&gl->gl_sbd->sd_bufdata_count); gfs2-02.patch-+ gfs2-02.patch-+ memset(bd, 0, sizeof(struct gfs2_bufdata)); -> GFP_NOFAIL gfs2-08.patch:+ RETRY_MALLOC(gm = kmalloc(sizeof(struct gfs2_memory), GFP_KERNEL), gm); gfs2-08.patch-+ gm->gm_data = data; gfs2-08.patch-+ gm->gm_file = file; gfs2-08.patch-+ gm->gm_line = line; gfs2-08.patch-+ gfs2-08.patch-+ spin_lock(&memory_lock); -> GFP_NOFAIL gfs2-10.patch:+ RETRY_MALLOC(new_gh = gfs2_holder_get(gl, state, gfs2-10.patch-+ LM_FLAG_TRY | gfs2-10.patch-+ GL_NEVER_RECURSE), gfs2-10.patch-+ new_gh); gfs2-10.patch-+ set_bit(HIF_DEMOTE, &new_gh->gh_iflags); gfs2-10.patch-+ set_bit(HIF_DEALLOC, &new_gh->gh_iflags); gfs2_holder_get uses kmalloc which can use GFP_NOFAIL. Pekka From joern at wohnheim.fh-wedel.de Mon Aug 8 10:20:22 2005 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Mon, 8 Aug 2005 12:20:22 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1123495525.3245.36.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> <1123495525.3245.36.camel@laptopd505.fenrus.org> Message-ID: <20050808102022.GA17978@wohnheim.fh-wedel.de> On Mon, 8 August 2005 12:05:25 +0200, Arjan van de Ven wrote: > On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote: > > > > > > Please drop the extra braces. > > > > Here and elsewhere we try to keep unused stuff off the stack. Are you > > suggesting that we're being overly cautious, or do you just dislike the > > way it looks? > > nice theory. In practice gcc 3.x still adds up all the stack space > anyway and as long as gcc 3.x is a supported kernel compiler, you can't > depend on this. Also.. please favor readability. gcc is getting smarter > about stack use nowadays, and {}'s shouldn't be needed to help it, it > tracks liveness of variables already. Plus, you don't have to guess about stack usage. Run "make checkstack" or, better yet, run the objdump of fs/gfs/built-in.o through the perl script. J?rn -- It's just what we asked for, but not what we want! -- anonymous From penberg at cs.helsinki.fi Mon Aug 8 10:34:56 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Mon, 08 Aug 2005 13:34:56 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050808095747.GD13951@redhat.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: David Teigland writes: > > Is there a reason why you cannot use or ? > > See gfs2_hash_more() and comment; we hash discontiguous regions. jhash() takes an initial value. Isn't that sufficient? Pekka From penberg at cs.helsinki.fi Mon Aug 8 10:57:55 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Mon, 08 Aug 2005 13:57:55 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050808105613.GE13951@redhat.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> <20050808105613.GE13951@redhat.com> Message-ID: David Teigland writes: > > but why can't you return NULL here on failure like you do for > > find_lock_page()? > > because create is set Yes, but looking at (some of the) top-level callers, there's no real reason why create must not fail. Am I missing something here? > > gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep, > > GFP_KERNEL), > > -> GFP_NOFAIL > > It looks to me like NOFAIL does nothing for kmem_cache_alloc(). > Am I seeing that wrong? It is passed to the page allocator just like with kmalloc() which uses __cache_alloc() too. Pekka From sdake at mvista.com Fri Aug 5 17:45:49 2005 From: sdake at mvista.com (Steven Dake) Date: Fri, 05 Aug 2005 10:45:49 -0700 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <42EF4AD1.6010809@redhat.com> References: <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> Message-ID: <1123263949.16923.23.camel@localhost.localdomain> On Tue, 2005-08-02 at 11:28 +0100, Patrick Caulfield wrote: > Steven Dake wrote: > > On Mon, 2005-07-18 at 09:10 +0100, Patrick Caulfield wrote: > > > >>As I see it there are two things we can do with userland cman that's current in > >>the head of CVS: > >> > >>1. Leave it as it is - a port of the kernel one. This has some benefits: it's > >>easy (plus a few bug fixes that need to go in), it's protocol-compatible with > >>the kernel one. There are a small number of extra features that could go in > >>there (that would, annoyingly, break that compatibility) but nothing really > >>serious. It doesn't give us anything new, but what new is neeed ? > >> > >>2. Migrate it to something much more sophisticated. I've mentioned Virtual > >>Synchrony a few times before and I've been looking into this in some detail > >>since. The benefits are largely internal but they do provide a reliable, robust > >>and well-performing messaging system that other cluster subsystems can use. > >>While the application programmers at the cluster summit maintained they had no > >>use for a cluster messaging system, I still believe that it is a useful thing to > >>have at a lower level - if only for our own programming needs. I know that Jon > >>looked into the existing cman messaging system before rejecting it as too slow > >>and unreliable for he needs of the cluster mirroring code. > >> > >>There are two suboptions here. > >> a) write it ourself. Quite a big job this. Bigger than I would like. To be > >>honest I did make a start at this and now realise just what a huge job it is to > >>get something that both performs well and is reliable. REALLY reliable. even > >>worse if the academics want something provably reliable. > >> b) adopt something else. The obvious candidate here is the openAIS code[1]. > >>This looks to be quite mature now and has all the features we need of a low > >>level messaging system. It's very nicely abstracted out so we can pick out just > >>the bits we need without having the whole (rather heavyweight) system on top of it. > >> > >>The one problem with the openAIS code is that it doesn't support IPv6, and much > >>of the code is tied to IPv4. Having had a look at it and emailed Steven Dake > >>about this he reckons it's about 2 weeks work to add.[2] > >> > >>The advantages of doing this are several. > >>- It saves time. We get something that is known to work, even though it needs > >>extra features added for our own use. > >>- we're not inventing something new that already exists in several other places. > >>- we get more people who know the code. Currently only I know the internals of > >>cman as it stands and it's quite scary code that people don't want to get > >>involved with (we've have several DLM patches in the past, but no CMAN ones). > >>This way we get at least 2 (Steven and me) as well as anyone else who is > >>following openAIS. Of course there will be CMAN-specific stuff on top of their > >>comms layer to make it quorum-based and capable of supporting GFS and DLM that > > > > > > sorry my response is so late I missed this mail while at OLS. > > > > The quorum problem is commonly referred to in the literature as a > > "virtual synchrony filter". I'd love to have some implementations of > > virtual synchrony filters that exist within libtotem itself.. > > Definately an area of interest for openais as we need some services to > > operate only in one partition (like the amf). > > > > > >>will be Red Hat specific but these are not going to be large. > >>- the APIs are all open (based on SAforum specifications) and already > >>implemented. Although adding saCLM to CMAN is pretty easy as I proved last week. > >> > > > > > >>The disadvantages are > >>- Need to learn the internals of someone else's code. > > > > > > indeed this part is somewhat painful :( > > > > > >>- We don't have full control over the code. Although we can obviously fork it if > >>we feel the need it would, obviously be preferable not to. > > > > > > My view is that open source influence is dictated by level of > > contribution just like any kind of community. ie: the more a person > > contributes the more influence they can exert over a project or > > direction. Even as maintainer I don't have full control over the > > openais code as the community really decides where we go and what work > > we do. > > > > My point here is that if you are willing to fork, then you probably have > > some time to maintain the code.. which is better spent influencing the > > current openais tree :) > > > > > >>- non-compatibility with "old" cman, making rolling upgrades har or even > >>impossible. I'm not sure what to do about this yet, but it's worth pointing out > >>that the DLM has a new line-protocol too. > > > > > > yes upgrades are a real pain. We have not fully tackled this problem in > > the openais project yet, because we havn't released a stable version. > > Ideally we would like two versions (older, newer) to interoperate, even > > if that means uglifying the implementation to coexist with two line > > types. We have some work in place to address this problem but before > > our first production release I'm planning to really think through > > interoperability with new implementations for features of the totem > > protocol (like redundant ring, multi ring gateway (for local area > > networks), group key generation, multi-ring-bridged (for wide area > > networks), etc). > > > > > >>- openAIS is BSD licensed, I don't think this is a problem but it probably needs > >>checking. > >> > > > > > > Originally I had planned to use spread for openais, but the license was > > not compatible with the lawyers "approved list". So we had to implement > > a protocol completely from scratch because of the license issue which > > took about 1.5 years of work (sigh). I wanted to be sure other projects > > could reuse the totem code so chose the most liberal license I could > > find. > > > > > >>In short, I'm advocating adopting the openAIS core (libtotem basically) as > >>CMAN's communications/membership protocol. If we're going to do a "CMAN V2" that > >>has anything significant over V1 then re-inventing it is going to be a huge > >>amount of work that someone else has already done. > >> > >>Comments? > >> > > > > > > sounds good Patrick if you need any help from us let us know > > > > Thanks for that Steven. I'm going to make a start on this when I get back from > UKUUG next week. I've managed to knock up something that looks like cman from > the outside but uses libtotem for it's comms layer so it's looking good. On > other thing I need to look into (apart from IPv6) is multi-home. cman had a > (primitive) failover system but it's not currently in use by anyone because DLM > doesn't support it but I think it's something we need to provide at some stage. > > Don't worry about the mention of a fork - the chances of it happening are almost > nil! Thats great news Patrick. One thing you should be aware of is that I have changed some of the internal interfaces in preparation for others to use libtotem to be extremely more sanitary. Unfortunately I may have done this a little too late in your case.. But I think you will find things are a little better. It really only effects totempg_initialize. Also libtotem was renamed to libtotem_pg because of requests from Daniel about a name-space collision with some movie player in fc4. For multihoming, I want to support the totem redundant ring protocol in the totem code. This is an extension of totemsrp to support multiple nics per processor. Then data is either actively or passively replicated over multiple links. There is essentially no failover and multiple links can offer better performance and still operate properly when one entire network fails. It looks pretty simple to implement. The paper is at: http://www.rcsc.de/pdf/icdcs02.pdf regards -steve From javipolo at datagrama.net Mon Aug 8 14:12:38 2005 From: javipolo at datagrama.net (Javi Polo) Date: Mon, 8 Aug 2005 16:12:38 +0200 Subject: [Linux-cluster] problem with rejoining a node Message-ID: <20050808141238.GA6455@gibson.drslump.org> Hi there (again :P) I'm still fighting with all this, sorry to bother so much (hope some day when I understand it all better I'll write some article on how to set this up) Well, I have already up the cluster and mounted the gfs filesystem in 3 machines, and if one of those goes down, it's correctly fenced. The FC port is also disconnected, so I suppose at this point is everything ok. The problem is on the recovery. I understand that when a node rejoins is automaticaly unfenced, and then it can rejoin the fence and mount again the filesystem. I've blocked all input and output traffic on the node I want to test with iptables. The node gets fenced ok: Aug 8 16:00:48 gfstest2 fenced[2594]: fencing node "gfstest1" Aug 8 16:00:56 gfstest2 fenced[2594]: fence "gfstest1" success Now I can access the GFS filesystem safely from my other 2 nodes, as the FC port for gfstest1 is disabled now, but if I enable traffic for the node, it does not rejoin the cluster. Shouldnt this be automatically? Anyway, I cannot rejoin/leave/whatever the cluster from gfstest1: gfstest1:~# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "primer_fs" 2 3 run - [1 2 3] GFS Mount Group: "primer_fs" 3 4 run - [1 2 3] gfstest1:~# cman_tool join cman_tool: Node is already active gfstest1:~# cman_tool leave cman_tool: Can't leave cluster while there are 5 active subsystems and also, I cannot umount /dev/sdc1 as I have no access to the SAN (and however DLM should block him not to do so). So I get a totally screwed up system, that I can just fix by hard-rebooting (if I do a clean reboot, the system "hangs" while "umounting filesystems"). Also, when the system boots up, the SAN is still unaccessible, as the fencing script does not run to re-enable the port ... I'm loooooost diving into google querys ... and certainly it's hard to find accurate info about all this :/ could someone spot some light? (probably I dont understand well how the fencing system works, but also havent find anywhere where its explained :/) thx in advance :) -- Javier Polo @ Datagrama 902 136 126 From penberg at cs.helsinki.fi Mon Aug 8 14:14:45 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Mon, 08 Aug 2005 17:14:45 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050803063644.GD9812@redhat.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> Message-ID: David Teigland writes: > +static ssize_t walk_vm_hard(struct file *file, char *buf, size_t size, > + loff_t *offset, do_rw_t operation) > +{ > + struct gfs2_holder *ghs; > + unsigned int num_gh = 0; > + ssize_t count; > + > + { Can we please get rid of the extra braces everywhere? [snip] David Teigland writes: > + > + for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { > + if (end <= vma->vm_start) > + break; > + if (vma->vm_file && > + vma->vm_file->f_dentry->d_inode->i_sb == sb) { > + num_gh++; > + } > + } > + > + ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder), > + GFP_KERNEL); > + if (!ghs) { > + if (!dumping) > + up_read(&mm->mmap_sem); > + return -ENOMEM; > + } > + > + for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { Sorry if this is an obvious question but what prevents another thread from doing mmap() before we do the second walk and messing up num_gh? > + if (end <= vma->vm_start) > + break; > + if (vma->vm_file) { > + struct inode *inode; > + inode = vma->vm_file->f_dentry->d_inode; > + if (inode->i_sb == sb) > + gfs2_holder_init(get_v2ip(inode)->i_gl, > + vma2state(vma), > + 0, &ghs[x++]); > + } > + } Pekka From pcaulfie at redhat.com Mon Aug 8 15:30:43 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 08 Aug 2005 16:30:43 +0100 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <1123263949.16923.23.camel@localhost.localdomain> References: <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> Message-ID: <42F77AA3.80000@redhat.com> Steven Dake wrote: > Thats great news Patrick. One thing you should be aware of is that I > have changed some of the internal interfaces in preparation for others > to use libtotem to be extremely more sanitary. Unfortunately I may have > done this a little too late in your case.. But I think you will find > things are a little better. It really only effects totempg_initialize. > Also libtotem was renamed to libtotem_pg because of requests from Daniel > about a name-space collision with some movie player in fc4. Yes I spotted that, my current "nearly-working" cman is based on the latest SVN sources. > For multihoming, I want to support the totem redundant ring protocol in > the totem code. This is an extension of totemsrp to support multiple > nics per processor. Then data is either actively or passively > replicated over multiple links. There is essentially no failover and > multiple links can offer better performance and still operate properly > when one entire network fails. It looks pretty simple to implement. > The paper is at: > > http://www.rcsc.de/pdf/icdcs02.pdf Excellent, thanks. I'll have a read. -- patrick From pcaulfie at redhat.com Mon Aug 8 15:35:14 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 08 Aug 2005 16:35:14 +0100 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <20050808141238.GA6455@gibson.drslump.org> References: <20050808141238.GA6455@gibson.drslump.org> Message-ID: <42F77BB2.4070705@redhat.com> Javi Polo wrote: > Hi there (again :P) > > I'm still fighting with all this, sorry to bother so much (hope some day > when I understand it all better I'll write some article on how to set this up) > > Well, I have already up the cluster and mounted the gfs filesystem in 3 > machines, and if one of those goes down, it's correctly fenced. The FC > port is also disconnected, so I suppose at this point is everything ok. > > The problem is on the recovery. I understand that when a node rejoins > is automaticaly unfenced, and then it can rejoin the fence and > mount again the filesystem. > > I've blocked all input and output traffic on the node I want to test > with iptables. > > The node gets fenced ok: > Aug 8 16:00:48 gfstest2 fenced[2594]: fencing node "gfstest1" > Aug 8 16:00:56 gfstest2 fenced[2594]: fence "gfstest1" success What sort of fencing are you using? If it's a power-switch fence then the node should be hard rebooted. If it's SAN fencing then you'll have to get the node out of the cluster - the remaining two nodes /should/ tell it it leave the cluster. A node can't just "rejoin" a cluster after being SAN fenced. it must be removed from the cluster and rejoin from scratch. There's far too much state involved for it to merge seamlessly back into a cluster. > Now I can access the GFS filesystem safely from my other 2 nodes, as the > FC port for gfstest1 is disabled now, but if I enable traffic for the > node, it does not rejoin the cluster. Shouldnt this be automatically? > > Anyway, I cannot rejoin/leave/whatever the cluster from gfstest1: > gfstest1:~# cman_tool services > Service Name GID LID State Code > Fence Domain: "default" 1 2 run - > [1 2 3] > > DLM Lock Space: "primer_fs" 2 3 run - > [1 2 3] > > GFS Mount Group: "primer_fs" 3 4 run - > [1 2 3] > > gfstest1:~# cman_tool join > cman_tool: Node is already active > gfstest1:~# cman_tool leave > cman_tool: Can't leave cluster while there are 5 active subsystems cman_tool leave force will force it to leave, but you might find it still needs a reboot to clear the filesystems. > and also, I cannot umount /dev/sdc1 as I have no access to the SAN > (and however DLM should block him not to do so). So I get a totally > screwed up system, that I can just fix by hard-rebooting (if I do a > clean reboot, the system "hangs" while "umounting filesystems"). > > Also, when the system boots up, the SAN is still unaccessible, as the > fencing script does not run to re-enable the port ... > > I'm loooooost diving into google querys ... and certainly it's hard to > find accurate info about all this :/ > > could someone spot some light? > (probably I dont understand well how the fencing system works, but also > havent find anywhere where its explained :/) > > thx in advance :) -- patrick From mdl at veles.ru Tue Aug 9 07:34:57 2005 From: mdl at veles.ru (Denis Medvedev) Date: Tue, 09 Aug 2005 11:34:57 +0400 Subject: [Linux-cluster] RH exporting local disks as LUNs. In-Reply-To: References: <1123209846.17379.18.camel@dragon.sys.intra> <42F31C9D.8090601@veles.ru> Message-ID: <42F85CA1.3040402@veles.ru> Nate Carlson ?????: > On Fri, 5 Aug 2005, Denis Medvedev wrote: > >>> This question may show my total ignorance but I'm not above that ;) >>> Is there anything stopping me from exporting a device or a volume as a >>> raw SCSI or FC-AL LUN for example. Could a linux box be made to be (act >>> like) a SCSI or FC disk? >> >> >> Yes, you can. > > > Nifty! I was actually looking at this the other day, and couldn't > figure out a way to do it. > > Do you happen to have any documentation? > > ------------------------------------------------------------------------ > | nate carlson | natecars at natecarlson.com | http://www.natecarlson.com | > | depriving some poor village of its idiot since 1981 | > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > I just tried (successfully) TWO working implementations of Linux was AltLinux (www.altlinux.ru, ftp.altlinux.ru - Master distribution upgraded to Sisyphus). And the working iscsi "targets" (providers of iscsi disks) was iscsi-target.sf.net (better for me) and umh-iscsi ( http://www.iol.unh.edu/consortiums/iscsi/) BOTH worked fine with Microsoft initiator (the iscsi client). From pcaulfie at redhat.com Tue Aug 9 07:08:01 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 09 Aug 2005 08:08:01 +0100 Subject: [Openais] Re: [Linux-cluster] Where to go with cman ? In-Reply-To: <1123524809.16145.17.camel@localhost.localdomain> References: <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> <1123524809.16145.17.camel@localhost.localdomain> Message-ID: <42F85651.80501@redhat.com> Steven Dake wrote: > On Mon, 2005-08-08 at 16:30 +0100, Patrick Caulfield wrote: > >>Steven Dake wrote: >> >>>Thats great news Patrick. One thing you should be aware of is that I >>>have changed some of the internal interfaces in preparation for others >>>to use libtotem to be extremely more sanitary. Unfortunately I may have >>>done this a little too late in your case.. But I think you will find >>>things are a little better. It really only effects totempg_initialize. >>>Also libtotem was renamed to libtotem_pg because of requests from Daniel >>>about a name-space collision with some movie player in fc4. >> >>Yes I spotted that, my current "nearly-working" cman is based on the latest SVN >>sources. >> >> >>>For multihoming, I want to support the totem redundant ring protocol in >>>the totem code. This is an extension of totemsrp to support multiple >>>nics per processor. Then data is either actively or passively >>>replicated over multiple links. There is essentially no failover and >>>multiple links can offer better performance and still operate properly >>>when one entire network fails. It looks pretty simple to implement. >>>The paper is at: >>> >>>http://www.rcsc.de/pdf/icdcs02.pdf >> >>Excellent, thanks. I'll have a read. >> > > > Patrick, > > Over the weekend I reorged the totem code significantly (although the > totempg interfaces have not changed). The reorg was painful timewise, > but the result is that redundant ring should be pretty easy to implement > now. Basically I took all of the network junk out of totemsrp and put > it in "totemnet.c". It also allows for multiple instances of totemnet > binds. This is the main feature I needed to implement redundant ring in > a clean fashion. The ipv6 support should be a little easier to add now > since most of the network code is limited to totemnet. Superb! I intend to get on to the IPv6 stuff in the next week or so, other things permitting. > I should have a patch in a few days with a redundant ring passive and > active implementation. -- patrick From javipolo at datagrama.net Mon Aug 8 21:55:30 2005 From: javipolo at datagrama.net (Javi Polo) Date: Mon, 8 Aug 2005 23:55:30 +0200 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <42F77BB2.4070705@redhat.com> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> Message-ID: <20050808215530.GA23695@gibson.drslump.org> On Aug/08/2005, Patrick Caulfield wrote: > What sort of fencing are you using? If it's a power-switch fence then the > node should be hard rebooted. If it's SAN fencing then you'll have to get the it's san fencing. I modified the fence_sanbox2.pl script to suit the switch commands, and "by hand" it works perfectly > node out of the cluster - the remaining two nodes /should/ tell it it leave the > cluster. so, when the node recovers and "says hello" to the cluster, the other two do take him out of the cluster? > A node can't just "rejoin" a cluster after being SAN fenced. it must be removed > from the cluster and rejoin from scratch. There's far too much state involved > for it to merge seamlessly back into a cluster. must i do it manually? or is any kind of automated process here? what are the steps the node should perform after being fenced so it can join again the node? (sorry asking so much but I'm really lost here :/ ) > > gfstest1:~# cman_tool join > > cman_tool: Node is already active > > gfstest1:~# cman_tool leave > > cman_tool: Can't leave cluster while there are 5 active subsystems > cman_tool leave force will force it to leave, but you might find it still needs > a reboot to clear the filesystems. so, if we just simply loose conectivity between nodes, we should still reboot the server so it can be "clean" and join again the cluster? and if so, should I enable manually the port on the SAN, or will fenced do it for me (as the script does actually accepts an enable parameter) :?? -- Javier Polo @ Datagrama 902 136 126 From javipolo at datagrama.net Mon Aug 8 21:55:30 2005 From: javipolo at datagrama.net (Javi Polo) Date: Mon, 8 Aug 2005 23:55:30 +0200 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <42F77BB2.4070705@redhat.com> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> Message-ID: <20050808215530.GA23695@gibson.drslump.org> On Aug/08/2005, Patrick Caulfield wrote: > What sort of fencing are you using? If it's a power-switch fence then the > node should be hard rebooted. If it's SAN fencing then you'll have to get the it's san fencing. I modified the fence_sanbox2.pl script to suit the switch commands, and "by hand" it works perfectly > node out of the cluster - the remaining two nodes /should/ tell it it leave the > cluster. so, when the node recovers and "says hello" to the cluster, the other two do take him out of the cluster? > A node can't just "rejoin" a cluster after being SAN fenced. it must be removed > from the cluster and rejoin from scratch. There's far too much state involved > for it to merge seamlessly back into a cluster. must i do it manually? or is any kind of automated process here? what are the steps the node should perform after being fenced so it can join again the node? (sorry asking so much but I'm really lost here :/ ) > > gfstest1:~# cman_tool join > > cman_tool: Node is already active > > gfstest1:~# cman_tool leave > > cman_tool: Can't leave cluster while there are 5 active subsystems > cman_tool leave force will force it to leave, but you might find it still needs > a reboot to clear the filesystems. so, if we just simply loose conectivity between nodes, we should still reboot the server so it can be "clean" and join again the cluster? and if so, should I enable manually the port on the SAN, or will fenced do it for me (as the script does actually accepts an enable parameter) :?? -- Javier Polo @ Datagrama 902 136 126 From lhh at redhat.com Mon Aug 8 20:14:07 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 08 Aug 2005 16:14:07 -0400 Subject: [Linux-cluster] Purpose of fencing devices In-Reply-To: <42F2328C.4050700@hugsmidjan.is> References: <42F2328C.4050700@hugsmidjan.is> Message-ID: <1123532047.3992.54.camel@ayanami.boston.redhat.com> On Thu, 2005-08-04 at 15:21 +0000, "S?valdur Arnar Gunnarsson [Hugsmi?jan]" wrote: > Could someone explain to me the purpose of the fencing hardware in a > cluster with a shared storage resource. > > When one of the cluster member goes down all access to the shared volume > (GFS) is closed off. > No other cluster member can read or write to the volume until the failed > node comes back up. That sounds like manual fencing. > Are fencing devices used to close off the access the dead node has on > the filesystem so the other nodes can access (read/write) the fileystem > as usual ? Yes. -- Lon From lhh at redhat.com Mon Aug 8 20:12:51 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 08 Aug 2005 16:12:51 -0400 Subject: [Linux-cluster] Clustered LDAP, good or bad idea? In-Reply-To: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca> References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca> Message-ID: <1123531971.3992.51.camel@ayanami.boston.redhat.com> On Fri, 2005-08-05 at 19:08 -0600, Ryan Thomson wrote: > I'm a bit concerned about failures since I can't test that properly in a > two-node cluster. > > I suppose what I'm really asking is this: Is running LDAP as a cluster > service a particularly bad idea for any reason? > Not that I'm aware of. I had one running a while for a reason which I've since forgotten (I stopped it on 29-Mar-2005, and has never been restarted). However, that LDAP server was only for testing; I've never run one in production. -- Lon From lhh at redhat.com Mon Aug 8 20:06:50 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 08 Aug 2005 16:06:50 -0400 Subject: [Linux-cluster] Fencing agents In-Reply-To: <42F1D28D.10006@veles.ru> References: <42F0B177.7050907@hugsmidjan.is> <20050803212949.GA3268@redhat.com> <42F1D28D.10006@veles.ru> Message-ID: <1123531610.3992.47.camel@ayanami.boston.redhat.com> On Thu, 2005-08-04 at 12:32 +0400, Denis Medvedev wrote: > >The main advantage that "automatic" fencing gives you over manual fencing is > >that in the event that a fencing operation is required, your cluster can > >automatically recover (on the order of seconds to minutes) instead of waiting > >for user intervention (which can take minutes to hours to days depending on > >how attentive the admins are :). > > > > > > > "recover"? You mean reboot? But if a machine need fencing, doesn't that > mean that something is inherently wrong with that machine and simple > reboot would't cure that? recovery = The *cluster* can continue operation, not the *node*. If the node is truly dead (maybe its CPU was fried from a bolt of lightning), rebooting it doesn't hurt it. If it was just a software or network issue (e.g. kernel panic, router glitch), then the node should be able to recover after its reboot and rejoin the cluster. -- Lon From adam.cassar at netregistry.com.au Tue Aug 9 12:11:35 2005 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Tue, 09 Aug 2005 22:11:35 +1000 Subject: [Linux-cluster] Clustered LDAP, good or bad idea? In-Reply-To: <1123531971.3992.51.camel@ayanami.boston.redhat.com> References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca> <1123531971.3992.51.camel@ayanami.boston.redhat.com> Message-ID: <42F89D77.80907@netregistry.com.au> We run openldap quite extensively here. In my experience, if slapd fails it is usually due to some backend db issue, and gfs will not help you with that. A master slave set up will be your best option. Lon Hohberger wrote: >On Fri, 2005-08-05 at 19:08 -0600, Ryan Thomson wrote: > > > >>I'm a bit concerned about failures since I can't test that properly in a >>two-node cluster. >> >>I suppose what I'm really asking is this: Is running LDAP as a cluster >>service a particularly bad idea for any reason? >> >> >> > >Not that I'm aware of. I had one running a while for a reason which >I've since forgotten (I stopped it on 29-Mar-2005, and has never been >restarted). However, that LDAP server was only for testing; I've never >run one in production. > >-- Lon > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster > > From pcaulfie at redhat.com Tue Aug 9 12:25:08 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 09 Aug 2005 13:25:08 +0100 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <20050808215530.GA23695@gibson.drslump.org> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> <20050808215530.GA23695@gibson.drslump.org> Message-ID: <42F8A0A4.5000507@redhat.com> Javi Polo wrote: > On Aug/08/2005, Patrick Caulfield wrote: > > >>What sort of fencing are you using? If it's a power-switch fence then the >>node should be hard rebooted. If it's SAN fencing then you'll have to get the > > > it's san fencing. I modified the fence_sanbox2.pl script to suit the > switch commands, and "by hand" it works perfectly > > >>node out of the cluster - the remaining two nodes /should/ tell it it leave the >>cluster. > > > so, when the node recovers and "says hello" to the cluster, the other > two do take him out of the cluster? Yes. Is that not happening ? > >>A node can't just "rejoin" a cluster after being SAN fenced. it must be removed >>from the cluster and rejoin from scratch. There's far too much state involved >>for it to merge seamlessly back into a cluster. > > > must i do it manually? or is any kind of automated process here? > > what are the steps the node should perform after being fenced so it can > join again the node? > (sorry asking so much but I'm really lost here :/ ) A reboot is usually the easiest way to ensure that a node is "clean". If you can umount all the GFS filesystems and stop all the cluster subsystems (fence, clvmd, gfs) then you should be able to run the startup scripts again but it's just a faff. >>>gfstest1:~# cman_tool join >>>cman_tool: Node is already active >>>gfstest1:~# cman_tool leave >>>cman_tool: Can't leave cluster while there are 5 active subsystems >> >>cman_tool leave force will force it to leave, but you might find it still needs >>a reboot to clear the filesystems. > > > so, if we just simply loose conectivity between nodes, we should still > reboot the server so it can be "clean" and join again the cluster? > > and if so, should I enable manually the port on the SAN, or will fenced > do it for me (as the script does actually accepts an enable parameter) > :?? > I don't know. I've never had access to a SAN fencing device! -- patrick From oldmoonster at gmail.com Mon Aug 8 15:41:34 2005 From: oldmoonster at gmail.com (Michael) Date: Mon, 08 Aug 2005 23:41:34 +0800 Subject: [Linux-cluster] [PATCH 00/14] GFS In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: <42F77D2E.8000806@gmail.com> I patched gfs2-full.patch to 2.6.12.2 kernel, however, if I don't comment out "depends on DLM" in fs/Kconfig, I can't see GFS2 in "make menuconfig", and of course, this result to compiling failure. config GFS2_FS tristate "GFS2 file system support" # depends on DLM help A cluster filesystem. Thanks, Michael David Teigland wrote: >Hi, GFS (Global File System) is a cluster file system that we'd like to >see added to the kernel. The 14 patches total about 900K so I won't send >them to the list unless that's requested. Comments and suggestions are >welcome. Thanks > >http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch >http://redhat.com/~teigland/gfs2/20050801/broken-out/ > >Dave > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster > > > From sdake at mvista.com Mon Aug 8 18:13:29 2005 From: sdake at mvista.com (Steven Dake) Date: Mon, 08 Aug 2005 11:13:29 -0700 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <42F77AA3.80000@redhat.com> References: <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> Message-ID: <1123524809.16145.17.camel@localhost.localdomain> On Mon, 2005-08-08 at 16:30 +0100, Patrick Caulfield wrote: > Steven Dake wrote: > > Thats great news Patrick. One thing you should be aware of is that I > > have changed some of the internal interfaces in preparation for others > > to use libtotem to be extremely more sanitary. Unfortunately I may have > > done this a little too late in your case.. But I think you will find > > things are a little better. It really only effects totempg_initialize. > > Also libtotem was renamed to libtotem_pg because of requests from Daniel > > about a name-space collision with some movie player in fc4. > > Yes I spotted that, my current "nearly-working" cman is based on the latest SVN > sources. > > > For multihoming, I want to support the totem redundant ring protocol in > > the totem code. This is an extension of totemsrp to support multiple > > nics per processor. Then data is either actively or passively > > replicated over multiple links. There is essentially no failover and > > multiple links can offer better performance and still operate properly > > when one entire network fails. It looks pretty simple to implement. > > The paper is at: > > > > http://www.rcsc.de/pdf/icdcs02.pdf > > Excellent, thanks. I'll have a read. > Patrick, Over the weekend I reorged the totem code significantly (although the totempg interfaces have not changed). The reorg was painful timewise, but the result is that redundant ring should be pretty easy to implement now. Basically I took all of the network junk out of totemsrp and put it in "totemnet.c". It also allows for multiple instances of totemnet binds. This is the main feature I needed to implement redundant ring in a clean fashion. The ipv6 support should be a little easier to add now since most of the network code is limited to totemnet. I should have a patch in a few days with a redundant ring passive and active implementation. regards -steve From zab at zabbo.net Mon Aug 8 18:32:55 2005 From: zab at zabbo.net (Zach Brown) Date: Mon, 08 Aug 2005 11:32:55 -0700 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> Message-ID: <42F7A557.3000200@zabbo.net> Pekka J Enberg wrote: > Sorry if this is an obvious question but what prevents another thread > from doing mmap() before we do the second walk and messing up num_gh? Nothing, I suspect. OCFS2 has a problem like this, too. It wants a way for a file system to serialize mmap/munmap/mremap during file IO. Well, more specifically, it wants to make sure that the locks it acquired at the start of the IO really cover the buf regions that might fault during the IO.. mapping activity during the IO can wreck that. - z From teigland at redhat.com Tue Aug 9 14:13:35 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 9 Aug 2005 22:13:35 +0800 Subject: [Linux-cluster] [PATCH 00/14] GFS In-Reply-To: <42F77D2E.8000806@gmail.com> References: <20050802071828.GA11217@redhat.com> <42F77D2E.8000806@gmail.com> Message-ID: <20050809141334.GB12114@redhat.com> On Mon, Aug 08, 2005 at 11:41:34PM +0800, Michael wrote: > I patched gfs2-full.patch to 2.6.12.2 kernel, however, if I don't > comment out "depends on DLM" in fs/Kconfig, > I can't see GFS2 in "make menuconfig", and of course, this result to > compiling failure. > > config GFS2_FS > tristate "GFS2 file system support" > # depends on DLM > help > A cluster filesystem. you need the dlm as found in -mm kernels Dave From ggilyeat at jhsph.edu Tue Aug 9 14:29:26 2005 From: ggilyeat at jhsph.edu (Gerald G. Gilyeat) Date: Tue, 9 Aug 2005 10:29:26 -0400 Subject: [Linux-cluster] Tuning... Message-ID: I'm looking for information regarding what each of the tunable parameters returned by gfs_tool gettune actually -does-. Output from one of my partitions: ilimit1 = 100 ilimit1_tries = 3 ilimit1_min = 1 ilimit2 = 500 ilimit2_tries = 10 ilimit2_min = 3 demote_secs = 300 incore_log_blocks = 1024 jindex_refresh_secs = 60 gldep_secs = 30 scand_secs = 5 recoverd_secs = 60 logd_secs = 1 quotad_secs = 5 inoded_secs = 15 quota_simul_sync = 64 quota_warn_period = 10 atime_quantum = 3600 quota_quantum = 60 quota_scale = 1.0000 (1, 1) quota_enforce = 1 quota_account = 1 new_files_jdata = 0 new_files_directio = 0 max_atomic_write = 4194304 max_readahead = 262144 lockdump_size = 131072 stall_secs = 600 complain_secs = 10 reclaim_limit = 5000 entries_per_readdir = 32 prefetch_secs = 10 statfs_slots = 64 Some of them are fairly obvious, but I'd like to have a more solid idea of what each does before I start mucking with things. Thanks! -- Jerry Gilyeat, RHCE Systems Administrator Molecular Microbiology and Immunology Johns Hopkins Bloomberg School of Public Health -------------- next part -------------- An HTML attachment was scrubbed... URL: From penberg at cs.helsinki.fi Tue Aug 9 14:55:41 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Tue, 09 Aug 2005 17:55:41 +0300 Subject: [Linux-cluster] Re: GFS References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: Hi David, Here are some more comments. Pekka +/************************************************************************** **** > +******************************************************************************* > +** > +** Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. > +** Copyright (C) 2004-2005 Red Hat, Inc. All rights reserved. > +** > +** This copyrighted material is made available to anyone wishing to use, > +** modify, copy, or redistribute it subject to the terms and conditions > +** of the GNU General Public License v.2. > +** > +******************************************************************************* > +******************************************************************************/ Do you really need this verbose banner? > +#define NO_CREATE 0 > +#define CREATE 1 > + > +#define NO_WAIT 0 > +#define WAIT 1 > + > +#define NO_FORCE 0 > +#define FORCE 1 The code seems to interchangeably use FORCE and NO_FORCE together with TRUE and FALSE. Perhaps they could be dropped? > +#define GLF_PLUG 0 > +#define GLF_LOCK 1 > +#define GLF_STICKY 2 > +#define GLF_PREFETCH 3 > +#define GLF_SYNC 4 > +#define GLF_DIRTY 5 > +#define GLF_SKIP_WAITERS2 6 > +#define GLF_GREEDY 7 Would be nice if these were either enums or had a comment linking them to the struct member they are used for. > +#define GIF_MIN_INIT 0 > +#define GIF_QD_LOCKED 1 > +#define GIF_PAGED 2 > +#define GIF_SW_PAGED 3 Same here and in few other places too. > +#define LO_BEFORE_COMMIT(sdp) \ > +do { \ > + int __lops_x; \ > + for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \ > + if (gfs2_log_ops[__lops_x]->lo_before_commit) \ > + gfs2_log_ops[__lops_x]->lo_before_commit((sdp)); \ > +} while (0) > + > +#define LO_AFTER_COMMIT(sdp, ai) \ > +do { \ > + int __lops_x; \ > + for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \ > + if (gfs2_log_ops[__lops_x]->lo_after_commit) \ > + gfs2_log_ops[__lops_x]->lo_after_commit((sdp), (ai)); \ > +} while (0) > + > +#define LO_BEFORE_SCAN(jd, head, pass) \ > +do \ > +{ \ > + int __lops_x; \ > + for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \ > + if (gfs2_log_ops[__lops_x]->lo_before_scan) \ > + gfs2_log_ops[__lops_x]->lo_before_scan((jd), (head), (pass)); \ > +} \ > +while (0) static inline functions, please. > +static inline int LO_SCAN_ELEMENTS(struct gfs2_jdesc *jd, unsigned int start, > + struct gfs2_log_descriptor *ld, > + unsigned int pass) Lower case name, please. > +{ > + unsigned int x; > + int error; > + > + for (x = 0; gfs2_log_ops[x]; x++) > + if (gfs2_log_ops[x]->lo_scan_elements) { > + error = gfs2_log_ops[x]->lo_scan_elements(jd, start, > + ld, pass); > + if (error) > + return error; > + } > + > + return 0; > +} > + > +#define LO_AFTER_SCAN(jd, error, pass) \ > +do \ > +{ \ > + int __lops_x; \ > + for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \ > + if (gfs2_log_ops[__lops_x]->lo_before_scan) \ > + gfs2_log_ops[__lops_x]->lo_after_scan((jd), (error), (pass)); \ > +} \ > +while (0) static inline function, please. > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include Preferred order is to include linux/ first and asm/ after that. > +#define vma2state(vma) \ > +((((vma)->vm_flags & (VM_MAYWRITE | VM_MAYSHARE)) == \ > + (VM_MAYWRITE | VM_MAYSHARE)) ? \ > + LM_ST_EXCLUSIVE : LM_ST_SHARED) \ static inline function, please. The above is completely unreadable. > +struct inode *gfs2_ip2v(struct gfs2_inode *ip, int create) > +{ > + struct inode *inode = NULL, *tmp; > + > + gfs2_assert_warn(ip->i_sbd, > + test_bit(GIF_MIN_INIT, &ip->i_flags)); > + > + spin_lock(&ip->i_spin); > + if (ip->i_vnode) > + inode = igrab(ip->i_vnode); > + spin_unlock(&ip->i_spin); Suggestion: make the above a separate function __gfs2_lookup_inode(), use it here and where you pass NO_CREATE to get rid of the create parameter. > + > + if (inode || !create) > + return inode; > + > + tmp = new_inode(ip->i_sbd->sd_vfs); > + if (!tmp) > + return NULL; [snip] > + entries = gfs2_tune_get(sdp, gt_entries_per_readdir); > + size = sizeof(struct filldir_bad) + > + entries * (sizeof(struct filldir_bad_entry) + GFS2_FAST_NAME_SIZE); > + > + fdb = kmalloc(size, GFP_KERNEL); > + if (!fdb) > + return -ENOMEM; > + memset(fdb, 0, size); kzalloc(), which is in 2.6.13-rc6-mm5 please. Appears in other places as well. > + if (error) { > + printk("GFS2: fsid=%s: can't make FS RW: %d\n", > + sdp->sd_fsname, error); > + goto fail_proc; > + } > + } > + > + gfs2_glock_dq_uninit(&mount_gh); > + > + return 0; > + > + fail_proc: > + gfs2_proc_fs_del(sdp); > + init_threads(sdp, UNDO); Please provide a release_threads instead and make it deal with partial initialization. The above is very confusing. > + parent, > + strlen(system_utsname.nodename)); > + else if (gfs2_filecmp(&dentry->d_name, "@mach", 5)) > + new = lookup_one_len(system_utsname.machine, > + parent, > + strlen(system_utsname.machine)); > + else if (gfs2_filecmp(&dentry->d_name, "@os", 3)) > + new = lookup_one_len(system_utsname.sysname, > + parent, > + strlen(system_utsname.sysname)); > + else if (gfs2_filecmp(&dentry->d_name, "@uid", 4)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%u", current->fsuid)); > + else if (gfs2_filecmp(&dentry->d_name, "@gid", 4)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%u", current->fsgid)); > + else if (gfs2_filecmp(&dentry->d_name, "@sys", 4)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%s_%s", > + system_utsname.machine, > + system_utsname.sysname)); > + else if (gfs2_filecmp(&dentry->d_name, "@jid", 4)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%u", > + sdp->sd_jdesc->jd_jid)); Smells like policy in the kernel. Why can't this be done in the userspace? > + parent, > + strlen(system_utsname.nodename)); > + else if (gfs2_filecmp(&dentry->d_name, "{mach}", 6)) > + new = lookup_one_len(system_utsname.machine, > + parent, > + strlen(system_utsname.machine)); > + else if (gfs2_filecmp(&dentry->d_name, "{os}", 4)) > + new = lookup_one_len(system_utsname.sysname, > + parent, > + strlen(system_utsname.sysname)); > + else if (gfs2_filecmp(&dentry->d_name, "{uid}", 5)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%u", current->fsuid)); > + else if (gfs2_filecmp(&dentry->d_name, "{gid}", 5)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%u", current->fsgid)); > + else if (gfs2_filecmp(&dentry->d_name, "{sys}", 5)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%s_%s", > + system_utsname.machine, > + system_utsname.sysname)); > + else if (gfs2_filecmp(&dentry->d_name, "{jid}", 5)) > + new = lookup_one_len(buf, > + parent, > + sprintf(buf, "%u", > + sdp->sd_jdesc->jd_jid)); Ditto. > +int gfs2_statfs_slow(struct gfs2_sbd *sdp, struct gfs2_statfs_change *sc) > +{ > + struct gfs2_holder ri_gh; > + struct gfs2_rgrpd *rgd_next; > + struct gfs2_holder *gha, *gh; > + unsigned int slots = 64; > + unsigned int x; > + int done; > + int error = 0, err; > + > + memset(sc, 0, sizeof(struct gfs2_statfs_change)); > + gha = kmalloc(slots * sizeof(struct gfs2_holder), GFP_KERNEL); > + if (!gha) > + return -ENOMEM; > + memset(gha, 0, slots * sizeof(struct gfs2_holder)); kcalloc, please > + line = kmalloc(256, GFP_KERNEL); > + if (!line) > + return -ENOMEM; > + > + len = snprintf(line, 256, "GFS2: fsid=%s: quota %s for %s %u\r\n", > + sdp->sd_fsname, type, > + (test_bit(QDF_USER, &qd->qd_flags)) ? "user" : "group", > + qd->qd_id); Please use constant instead of magic number 256. > +struct lm_lockops gdlm_ops = { > + lm_proto_name:"lock_dlm", > + lm_mount:gdlm_mount, > + lm_others_may_mount:gdlm_others_may_mount, > + lm_unmount:gdlm_unmount, > + lm_withdraw:gdlm_withdraw, > + lm_get_lock:gdlm_get_lock, > + lm_put_lock:gdlm_put_lock, > + lm_lock:gdlm_lock, > + lm_unlock:gdlm_unlock, > + lm_plock:gdlm_plock, > + lm_punlock:gdlm_punlock, > + lm_plock_get:gdlm_plock_get, > + lm_cancel:gdlm_cancel, > + lm_hold_lvb:gdlm_hold_lvb, > + lm_unhold_lvb:gdlm_unhold_lvb, > + lm_sync_lvb:gdlm_sync_lvb, > + lm_recovery_done:gdlm_recovery_done, > + lm_owner:THIS_MODULE, > +}; C99 initializers, please. From lhh at redhat.com Tue Aug 9 15:40:50 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 09 Aug 2005 11:40:50 -0400 Subject: [Linux-cluster] Clustered LDAP, good or bad idea? In-Reply-To: <42F89D77.80907@netregistry.com.au> References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca> <1123531971.3992.51.camel@ayanami.boston.redhat.com> <42F89D77.80907@netregistry.com.au> Message-ID: <1123602050.13564.6.camel@ayanami.boston.redhat.com> On Tue, 2005-08-09 at 22:11 +1000, Adam Cassar wrote: > We run openldap quite extensively here. In my experience, if slapd fails > it is usually due to some backend db issue, and gfs will not help you > with that. > > A master slave set up will be your best option. Right, I think he meant to run one without a slave as a failover cluster service, not a multi-instance app running atop of GFS... *shrug* -- Lon From thomsonr at ucalgary.ca Tue Aug 9 16:17:46 2005 From: thomsonr at ucalgary.ca (Ryan Thomson) Date: Tue, 09 Aug 2005 10:17:46 -0600 Subject: [Linux-cluster] Clustered LDAP, good or bad idea? In-Reply-To: <1123602050.13564.6.camel@ayanami.boston.redhat.com> References: <1123290505.31135.10.camel@porcupine.bio.ucalgary.ca> <1123531971.3992.51.camel@ayanami.boston.redhat.com> <42F89D77.80907@netregistry.com.au> <1123602050.13564.6.camel@ayanami.boston.redhat.com> Message-ID: <1123604266.27091.5.camel@porcupine.bio.ucalgary.ca> Indeed, that is what I meant. I'm looking at running OpenLDAP as a failover cluster service. I understand what was meant about db failures and how GFS won't help with that. Our LDAP directory is fairly simple and updates/changes are made very infrequently. I've never experienced a db failure but I won't discount that as a possible issue (that's what I have backups for. TSM, I love you). I'll take my chances and run it the way I'd planned with some failover testing before I go production. -- Ryan On Tue, 2005-08-09 at 11:40 -0400, Lon Hohberger wrote: > On Tue, 2005-08-09 at 22:11 +1000, Adam Cassar wrote: > > We run openldap quite extensively here. In my experience, if slapd fails > > it is usually due to some backend db issue, and gfs will not help you > > with that. > > > > A master slave set up will be your best option. > > Right, I think he meant to run one without a slave as a failover cluster > service, not a multi-instance app running atop of GFS... > > *shrug* > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Tue Aug 9 18:24:44 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 09 Aug 2005 14:24:44 -0400 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <42F8A0A4.5000507@redhat.com> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> <20050808215530.GA23695@gibson.drslump.org> <42F8A0A4.5000507@redhat.com> Message-ID: <1123611884.13564.11.camel@ayanami.boston.redhat.com> On Tue, 2005-08-09 at 13:25 +0100, Patrick Caulfield wrote: > > and if so, should I enable manually the port on the SAN, or will fenced > > do it for me (as the script does actually accepts an enable parameter) > > :?? > > > > I don't know. I've never had access to a SAN fencing device! > When using SAN fencing, the easiest set of steps to follow to ensure data integrity + correct operation: (a) power off machine (b) un-fence at the SAN level (c) power on machine -- Lon From teigland at redhat.com Wed Aug 10 05:59:45 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 10 Aug 2005 13:59:45 +0800 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> Message-ID: <20050810055945.GB13926@redhat.com> On Mon, Aug 08, 2005 at 05:14:45PM +0300, Pekka J Enberg wrote: if (!dumping) down_read(&mm->mmap_sem); > >+ > >+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { > >+ if (end <= vma->vm_start) > >+ break; > >+ if (vma->vm_file && > >+ vma->vm_file->f_dentry->d_inode->i_sb == sb) { > >+ num_gh++; > >+ } > >+ } > >+ > >+ ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder), > >+ GFP_KERNEL); > >+ if (!ghs) { > >+ if (!dumping) > >+ up_read(&mm->mmap_sem); > >+ return -ENOMEM; > >+ } > >+ > >+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { > > Sorry if this is an obvious question but what prevents another thread from > doing mmap() before we do the second walk and messing up num_gh? mm->mmap_sem ? From penberg at cs.helsinki.fi Wed Aug 10 06:06:42 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Wed, 10 Aug 2005 09:06:42 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050810055945.GB13926@redhat.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <20050810055945.GB13926@redhat.com> Message-ID: David Teigland writes: > > if (!dumping) > down_read(&mm->mmap_sem); > > > + > > > + for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { > > > + if (end <= vma->vm_start) > > > + break; > > > + if (vma->vm_file && > > > + vma->vm_file->f_dentry->d_inode->i_sb == sb) { > > > + num_gh++; > > > + } > > > + } > > > + > > > + ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder), > > > + GFP_KERNEL); > > > + if (!ghs) { > > > + if (!dumping) > > > + up_read(&mm->mmap_sem); > > > + return -ENOMEM; > > > + } > > > + > > > + for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { > > > > Sorry if this is an obvious question but what prevents another thread from > > doing mmap() before we do the second walk and messing up num_gh? > > mm->mmap_sem ? Aah, I read that !dumping expression the other way around. Sorry and thanks. Pekka From pcaulfie at redhat.com Wed Aug 10 07:20:44 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 10 Aug 2005 08:20:44 +0100 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <1123611884.13564.11.camel@ayanami.boston.redhat.com> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> <20050808215530.GA23695@gibson.drslump.org> <42F8A0A4.5000507@redhat.com> <1123611884.13564.11.camel@ayanami.boston.redhat.com> Message-ID: <42F9AACC.9030009@redhat.com> Lon Hohberger wrote: > > When using SAN fencing, the easiest set of steps to follow to ensure > data integrity + correct operation: > > (a) power off machine > (b) un-fence at the SAN level > (c) power on machine > That's pretty much what I suspected. I just thought it better it came from someone who knew what they were talking about :) Thanks. -- patrick From penberg at cs.helsinki.fi Wed Aug 10 07:40:37 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Wed, 10 Aug 2005 10:40:37 +0300 Subject: [Linux-cluster] Re: GFS References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: Hi David, > + return -EINVAL; > + if (!access_ok(VERIFY_WRITE, buf, size)) > + return -EFAULT; > + > + if (!(file->f_flags & O_LARGEFILE)) { > + if (*offset >= 0x7FFFFFFFull) > + return -EFBIG; > + if (*offset + size > 0x7FFFFFFFull) > + size = 0x7FFFFFFFull - *offset; Please use a constant instead for 0x7FFFFFFFull. (Appears in various other places as well.) Pekka From lmb at suse.de Wed Aug 10 10:30:41 2005 From: lmb at suse.de (Lars Marowsky-Bree) Date: Wed, 10 Aug 2005 12:30:41 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810070309.GA2415@infradead.org> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> Message-ID: <20050810103041.GB4634@marowsky-bree.de> On 2005-08-10T08:03:09, Christoph Hellwig wrote: > > Kindly lose the "Context Dependent Pathname" crap. > Same for ocfs2. Would a generic implementation of that higher up in the VFS be more acceptable? It's not like context-dependent symlinks are an arbitary feature, but rather very useful in practice. Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" From lmb at suse.de Wed Aug 10 10:34:24 2005 From: lmb at suse.de (Lars Marowsky-Bree) Date: Wed, 10 Aug 2005 12:34:24 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810103256.GA6127@infradead.org> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> Message-ID: <20050810103424.GC4634@marowsky-bree.de> On 2005-08-10T11:32:56, Christoph Hellwig wrote: > > Would a generic implementation of that higher up in the VFS be more > > acceptable? > No. Use mount --bind That's a working and less complex alternative for upto how many places at once? That works for non-root users how...? Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" From lmb at suse.de Wed Aug 10 11:02:59 2005 From: lmb at suse.de (Lars Marowsky-Bree) Date: Wed, 10 Aug 2005 13:02:59 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810105450.GA6519@infradead.org> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> <20050810105450.GA6519@infradead.org> Message-ID: <20050810110259.GE4634@marowsky-bree.de> On 2005-08-10T11:54:50, Christoph Hellwig wrote: > It works now. Unlike context link which steal totally valid symlink > targets for magic mushroom bullshit. Right, that is a valid concern. Avoiding context dependent symlinks entirely certainly is one possible path around this. But, let's just for the sake of this discussion continue the other path for a bit, to explore the options available for implementing CPS which don't result in shivers running down the spine, because I believe CPS do have some applications in which bind mounts are not entirely adequate replacements. (Unless, of course, you want a bind mount for each homedirectory which might include architecture-specific subdirectories or for every host-specific configuration file.) What would a syntax look like which in your opinion does not remove totally valid symlink targets for magic mushroom bullshit? Prefix with // (which, according to POSIX, allows for implementation-defined behaviour)? Something else, not allowed in a regular pathname? If we can't find an acceptable way of implementing them, maybe it's time to grab some magic mushrooms and come up with a new approach, then ;-) Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" From lmb at suse.de Wed Aug 10 11:09:17 2005 From: lmb at suse.de (Lars Marowsky-Bree) Date: Wed, 10 Aug 2005 13:09:17 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810110511.GA6728@infradead.org> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> <20050810105450.GA6519@infradead.org> <20050810110259.GE4634@marowsky-bree.de> <20050810110511.GA6728@infradead.org> Message-ID: <20050810110917.GG4634@marowsky-bree.de> On 2005-08-10T12:05:11, Christoph Hellwig wrote: > > What would a syntax look like which in your opinion does not remove > > totally valid symlink targets for magic mushroom bullshit? Prefix with > > // (which, according to POSIX, allows for implementation-defined > > behaviour)? Something else, not allowed in a regular pathname? > None. just don't do it. Use bindmount, they're cheap and have sane > defined semtantics. So for every directoy hiearchy on a shared filesystem, each user needs to have the complete list of bindmounts needed, and automatically resync that across all nodes when a new one is added or removed? And then have that executed by root, because a regular user can't? Sure. Very cheap and sane. I'm buying. Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" From javipolo at datagrama.net Wed Aug 10 11:45:26 2005 From: javipolo at datagrama.net (Javi Polo) Date: Wed, 10 Aug 2005 13:45:26 +0200 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <1123611884.13564.11.camel@ayanami.boston.redhat.com> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> <20050808215530.GA23695@gibson.drslump.org> <42F8A0A4.5000507@redhat.com> <1123611884.13564.11.camel@ayanami.boston.redhat.com> Message-ID: <20050810114526.GA23149@kinoko.datagrama.net> On Aug/09/2005, Lon Hohberger wrote: > When using SAN fencing, the easiest set of steps to follow to ensure > data integrity + correct operation: > (a) power off machine > (b) un-fence at the SAN level > (c) power on machine I've made a script that, prior to starting any of the cluster infrastructure, enables his SAN port. I can then join the cluster, but when I try to join the fence, it locks up there ... : gfstest1:~# cman_tool services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,3 [] gfstest1:~# cman_tool nodes Node Votes Exp Sts Name 1 1 3 M gfstest1 2 1 3 M gfstest2 3 1 3 M gfstest3 gfstest1:~# cman_tool status Protocol version: 4.0.1 Config version: 9 Cluster name: test_cluster Cluster ID: 61876 Membership state: Cluster-Member Nodes: 3 Expected_votes: 3 Total_votes: 3 Quorum: 2 Active subsystems: 1 Node addresses: 192.168.0.1 gfstest1:~# from other nodes, I see it as recovering: gfstest2:/etc/init.d# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 2 - [2 3] what happent? -- Javier Polo @ Datagrama 902 136 126 From penberg at cs.helsinki.fi Tue Aug 9 14:49:43 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Tue, 09 Aug 2005 17:49:43 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <42F7A557.3000200@zabbo.net> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> Message-ID: <1123598983.10790.1.camel@haji.ri.fi> On Mon, 2005-08-08 at 11:32 -0700, Zach Brown wrote: > > Sorry if this is an obvious question but what prevents another thread > > from doing mmap() before we do the second walk and messing up num_gh? > > Nothing, I suspect. OCFS2 has a problem like this, too. It wants a way > for a file system to serialize mmap/munmap/mremap during file IO. Well, > more specifically, it wants to make sure that the locks it acquired at > the start of the IO really cover the buf regions that might fault during > the IO.. mapping activity during the IO can wreck that. In addition, the vma walk will become an unmaintainable mess as soon as someone introduces another mmap() capable fs that needs similar locking. I am not an expert so could someone please explain why this cannot be done with a_ops->prepare_write and friends? Pekka From viro at parcelfarce.linux.theplanet.co.uk Tue Aug 9 15:20:45 2005 From: viro at parcelfarce.linux.theplanet.co.uk (Al Viro) Date: Tue, 9 Aug 2005 16:20:45 +0100 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote: > Hi, GFS (Global File System) is a cluster file system that we'd like to > see added to the kernel. The 14 patches total about 900K so I won't send > them to the list unless that's requested. Comments and suggestions are > welcome. Thanks > > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050801/broken-out/ Kindly lose the "Context Dependent Pathname" crap. From zab at zabbo.net Tue Aug 9 17:17:10 2005 From: zab at zabbo.net (Zach Brown) Date: Tue, 09 Aug 2005 10:17:10 -0700 Subject: [Linux-cluster] Re: GFS In-Reply-To: <1123598983.10790.1.camel@haji.ri.fi> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> Message-ID: <42F8E516.7020600@zabbo.net> Pekka Enberg wrote: > In addition, the vma walk will become an unmaintainable mess as soon > as someone introduces another mmap() capable fs that needs similar > locking. Yup, I suspect that if the core kernel ends up caring about this problem then the VFS will be involved in helping file systems sort the locks they'll acquire around IO. > I am not an expert so could someone please explain why this cannot be > done with a_ops->prepare_write and friends? I'll try, briefly. Usually clustered file systems in Linux maintain data consistency for normal posix IO by holding DLM locks for the duration of their file->{read,write} methods. A task on a node won't be able to read until all tasks on other nodes have finished any conflicting writes they might have been performing, etc, nothing surprising here. Now say we want to extend consistency guarantees to mmap(). This boils down to protecting mappings with DLM locks. Say a page is mapped for reading, the continued presence of that mapping is protected by holding a DLM lock. If another node goes to write to that page, the read lock is revoked and the mapping is torn down. These locks are acquired in a_ops->nopage as the task faults and tries to bring up the mapping. And that's the problem. Because they're acquired in ->nopage they can be acquired during a fault that is servicing the 'buf' argument to an outer file->{read,write} operation which has grabbed a lock for the target file. Acquiring multiple locks introduces the risk of ABBA deadlocks. It's trivial to construct examples of mmap(), read(), and write() on 2 nodes with 2 files that deadlock. So clustered file systems in Linux (GFS, Lustre, OCFS2, (GPFS?)) all walk vmas in their file->{read,write} to discover mappings that belong to their files so that they can preemptively sort and acquire the locks that will be needed to cover the mappings that might be established in ->nopage. As you point out, this both relies on the mappings not changing and gets very exciting when you mix files and mappings between file systems that are each sorting and acquiring their own DLM locks. I brought this up with some people at the kernel summit but no one, including myself, considers it a high priority. It wouldn't be too hard to construct a patch if people want to take a look. - z From penberg at cs.helsinki.fi Tue Aug 9 18:35:58 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Tue, 09 Aug 2005 21:35:58 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <42F8E516.7020600@zabbo.net> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <42F8E516.7020600@zabbo.net> Message-ID: Hi Zach, Zach Brown writes: > I'll try, briefly. Thanks for the excellent explanation. Zach Brown writes: > And that's the problem. Because they're acquired in ->nopage they can > be acquired during a fault that is servicing the 'buf' argument to an > outer file->{read,write} operation which has grabbed a lock for the > target file. Acquiring multiple locks introduces the risk of ABBA > deadlocks. It's trivial to construct examples of mmap(), read(), and > write() on 2 nodes with 2 files that deadlock. But couldn't we use make_pages_present() to figure which locks we need, sort them, and then grab them? Zach Brown writes: > I brought this up with some people at the kernel summit but no one, > including myself, considers it a high priority. It wouldn't be too hard > to construct a patch if people want to take a look. I guess it's not a problem as long as the kernel has zero or one cluster filesystems that support mmap(). After we have two or more, we have a problem. The GFS2 vma walk needs fixing anyway, I think, as it can lead to buffer overflow (if we have more locks during the second walk). Pekka From penberg at cs.helsinki.fi Wed Aug 10 04:48:29 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Wed, 10 Aug 2005 07:48:29 +0300 Subject: [Linux-cluster] Re: GFS References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <42F8E516.7020600@zabbo.net> Message-ID: Zach Brown writes: > But couldn't we use make_pages_present() to figure which locks we need, > sort them, and then grab them? Doh, obviously we can't as nopage() needs to bring the page in. Sorry about that. I also thought of another failure case for the vma walk. When a thread uses userspace memcpy() between two clusterfs mmap'd regions instead of write() or read(). Pekka From hch at infradead.org Wed Aug 10 07:03:09 2005 From: hch at infradead.org (Christoph Hellwig) Date: Wed, 10 Aug 2005 08:03:09 +0100 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> Message-ID: <20050810070309.GA2415@infradead.org> On Tue, Aug 09, 2005 at 04:20:45PM +0100, Al Viro wrote: > On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote: > > Hi, GFS (Global File System) is a cluster file system that we'd like to > > see added to the kernel. The 14 patches total about 900K so I won't send > > them to the list unless that's requested. Comments and suggestions are > > welcome. Thanks > > > > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch > > http://redhat.com/~teigland/gfs2/20050801/broken-out/ > > Kindly lose the "Context Dependent Pathname" crap. Same for ocfs2. From hch at infradead.org Wed Aug 10 07:21:21 2005 From: hch at infradead.org (Christoph Hellwig) Date: Wed, 10 Aug 2005 08:21:21 +0100 Subject: [Linux-cluster] Re: GFS In-Reply-To: <1123598983.10790.1.camel@haji.ri.fi> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> Message-ID: <20050810072121.GA2825@infradead.org> On Tue, Aug 09, 2005 at 05:49:43PM +0300, Pekka Enberg wrote: > On Mon, 2005-08-08 at 11:32 -0700, Zach Brown wrote: > > > Sorry if this is an obvious question but what prevents another thread > > > from doing mmap() before we do the second walk and messing up num_gh? > > > > Nothing, I suspect. OCFS2 has a problem like this, too. It wants a way > > for a file system to serialize mmap/munmap/mremap during file IO. Well, > > more specifically, it wants to make sure that the locks it acquired at > > the start of the IO really cover the buf regions that might fault during > > the IO.. mapping activity during the IO can wreck that. > > In addition, the vma walk will become an unmaintainable mess as soon as > someone introduces another mmap() capable fs that needs similar locking. We already have OCFS2 in -mm that does similar things. I think we need to solve this in common code before either of them can be merged. From penberg at cs.helsinki.fi Wed Aug 10 07:31:04 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Wed, 10 Aug 2005 10:31:04 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050810072121.GA2825@infradead.org> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <20050810072121.GA2825@infradead.org> Message-ID: On Tue, Aug 09, 2005 at 05:49:43PM +0300, Pekka Enberg wrote: > > In addition, the vma walk will become an unmaintainable mess as soon as > > someone introduces another mmap() capable fs that needs similar locking. Christoph Hellwig writes: > We already have OCFS2 in -mm that does similar things. I think we need > to solve this in common code before either of them can be merged. It seems to me that the distributed locks must be acquired in ->nopage anyway to solve the problem with memcpy() between two mmap'd regions. One possible solution would be for the lock manager to detect deadlocks and break some locks accordingly. Don't know how well that would mix with ->nopage though... Pekka From hch at infradead.org Wed Aug 10 07:43:38 2005 From: hch at infradead.org (Christoph Hellwig) Date: Wed, 10 Aug 2005 08:43:38 +0100 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <84144f0205080223445375c907@mail.gmail.com> <20050808095747.GD13951@redhat.com> Message-ID: <20050810074338.GA3172@infradead.org> On Wed, Aug 10, 2005 at 10:40:37AM +0300, Pekka J Enberg wrote: > Hi David, > > >+ return -EINVAL; > >+ if (!access_ok(VERIFY_WRITE, buf, size)) > >+ return -EFAULT; > >+ > >+ if (!(file->f_flags & O_LARGEFILE)) { > >+ if (*offset >= 0x7FFFFFFFull) > >+ return -EFBIG; > >+ if (*offset + size > 0x7FFFFFFFull) > >+ size = 0x7FFFFFFFull - *offset; > > Please use a constant instead for 0x7FFFFFFFull. (Appears in various other > places as well.) In fact this very much looks like it's duplicating generic_write_checks(). Folks, please use common code. From hch at infradead.org Wed Aug 10 10:32:56 2005 From: hch at infradead.org (Christoph Hellwig) Date: Wed, 10 Aug 2005 11:32:56 +0100 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810103041.GB4634@marowsky-bree.de> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> Message-ID: <20050810103256.GA6127@infradead.org> On Wed, Aug 10, 2005 at 12:30:41PM +0200, Lars Marowsky-Bree wrote: > On 2005-08-10T08:03:09, Christoph Hellwig wrote: > > > > Kindly lose the "Context Dependent Pathname" crap. > > Same for ocfs2. > > Would a generic implementation of that higher up in the VFS be more > acceptable? No. Use mount --bind From hch at infradead.org Wed Aug 10 10:54:50 2005 From: hch at infradead.org (Christoph Hellwig) Date: Wed, 10 Aug 2005 11:54:50 +0100 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810103424.GC4634@marowsky-bree.de> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> Message-ID: <20050810105450.GA6519@infradead.org> On Wed, Aug 10, 2005 at 12:34:24PM +0200, Lars Marowsky-Bree wrote: > On 2005-08-10T11:32:56, Christoph Hellwig wrote: > > > > Would a generic implementation of that higher up in the VFS be more > > > acceptable? > > No. Use mount --bind > > That's a working and less complex alternative for upto how many places > at once? That works for non-root users how...? It works now. Unlike context link which steal totally valid symlink targets for magic mushroom bullshit. From hch at infradead.org Wed Aug 10 11:05:11 2005 From: hch at infradead.org (Christoph Hellwig) Date: Wed, 10 Aug 2005 12:05:11 +0100 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810110259.GE4634@marowsky-bree.de> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> <20050810105450.GA6519@infradead.org> <20050810110259.GE4634@marowsky-bree.de> Message-ID: <20050810110511.GA6728@infradead.org> On Wed, Aug 10, 2005 at 01:02:59PM +0200, Lars Marowsky-Bree wrote: > What would a syntax look like which in your opinion does not remove > totally valid symlink targets for magic mushroom bullshit? Prefix with > // (which, according to POSIX, allows for implementation-defined > behaviour)? Something else, not allowed in a regular pathname? None. just don't do it. Use bindmount, they're cheap and have sane defined semtantics. From hch at infradead.org Wed Aug 10 11:11:10 2005 From: hch at infradead.org (Christoph Hellwig) Date: Wed, 10 Aug 2005 12:11:10 +0100 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810110917.GG4634@marowsky-bree.de> References: <20050802071828.GA11217@redhat.com> <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> <20050810105450.GA6519@infradead.org> <20050810110259.GE4634@marowsky-bree.de> <20050810110511.GA6728@infradead.org> <20050810110917.GG4634@marowsky-bree.de> Message-ID: <20050810111110.GA6878@infradead.org> On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote: > On 2005-08-10T12:05:11, Christoph Hellwig wrote: > > > > What would a syntax look like which in your opinion does not remove > > > totally valid symlink targets for magic mushroom bullshit? Prefix with > > > // (which, according to POSIX, allows for implementation-defined > > > behaviour)? Something else, not allowed in a regular pathname? > > None. just don't do it. Use bindmount, they're cheap and have sane > > defined semtantics. > > So for every directoy hiearchy on a shared filesystem, each user needs > to have the complete list of bindmounts needed, and automatically resync > that across all nodes when a new one is added or removed? And then have > that executed by root, because a regular user can't? Do it in an initscripts and let users simply not do it, they shouldn't even know what kind of filesystem they are on. From alewis at redhat.com Wed Aug 10 13:26:26 2005 From: alewis at redhat.com (AJ Lewis) Date: Wed, 10 Aug 2005 08:26:26 -0500 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810111110.GA6878@infradead.org> References: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> <20050810105450.GA6519@infradead.org> <20050810110259.GE4634@marowsky-bree.de> <20050810110511.GA6728@infradead.org> <20050810110917.GG4634@marowsky-bree.de> <20050810111110.GA6878@infradead.org> Message-ID: <20050810132626.GC4954@null.msp.redhat.com> On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote: > On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote: > > So for every directoy hiearchy on a shared filesystem, each user needs > > to have the complete list of bindmounts needed, and automatically resync > > that across all nodes when a new one is added or removed? And then have > > that executed by root, because a regular user can't? > > Do it in an initscripts and let users simply not do it, they shouldn't > even know what kind of filesystem they are on. I'm just thinking of a 100-node cluster that has different mounts on different nodes, and trying to update the bind mounts in a sane and efficient manner without clobbering the various mount setups. Ouch. -- AJ Lewis Voice: 612-638-0500 Red Hat E-Mail: alewis at redhat.com One Main Street SE, Suite 209 Minneapolis, MN 55414 Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C 54A8 578C 8715 Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the many keyservers out there... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From gforte at leopard.us.udel.edu Wed Aug 10 14:20:55 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 10 Aug 2005 10:20:55 -0400 Subject: [Linux-cluster] Which APC fence device ? Message-ID: <42FA0D47.1090308@leopard.us.udel.edu> There was a message on this list in January to the effect of "does APC 7900 work with fence_apc?" / "I think so ...". Has anyone confirmed this? -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From eric at bootseg.com Wed Aug 10 14:43:05 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 10 Aug 2005 10:43:05 -0400 Subject: [Linux-cluster] Which APC fence device ? In-Reply-To: <42FA0D47.1090308@leopard.us.udel.edu> References: <42FA0D47.1090308@leopard.us.udel.edu> Message-ID: <1123684986.3402.14.camel@auh5-0479.corp.jabil.org> On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote: > There was a message on this list in January to the effect of "does APC > 7900 work with fence_apc?" / "I think so ...". > > Has anyone confirmed this? I'm using them in my cluster, they work well with the fence_apc agent. The only caveat is that the agent won't work if you use outlet groups for cluster nodes, without a few changes to the agent. I can give you my modified version if you want to use outlet groups for cluster nodes. (A patch will be coming that will work with both port group and non-port group modes, but I haven't had a chance to write it up yet.) Thanks, Eric From ggilyeat at jhsph.edu Wed Aug 10 14:47:15 2005 From: ggilyeat at jhsph.edu (Gerald G. Gilyeat) Date: Wed, 10 Aug 2005 10:47:15 -0400 Subject: [Linux-cluster] Test Env. Message-ID: Another question - What is the -minimal- hardware configuration to test GFS6.1 (with all the RHEL4 and LVM2 goodness...) Thanks :) -- Jerry Gilyeat, RHCE Systems Administrator Molecular Microbiology and Immunology Johns Hopkins Bloomberg School of Public Health -------------- next part -------------- An HTML attachment was scrubbed... URL: From gforte at leopard.us.udel.edu Wed Aug 10 15:01:56 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 10 Aug 2005 11:01:56 -0400 Subject: [Linux-cluster] Which APC fence device ? In-Reply-To: <1123684986.3402.14.camel@auh5-0479.corp.jabil.org> References: <42FA0D47.1090308@leopard.us.udel.edu> <1123684986.3402.14.camel@auh5-0479.corp.jabil.org> Message-ID: <42FA16E4.9050101@leopard.us.udel.edu> Eric Kerin wrote: > On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote: > >>There was a message on this list in January to the effect of "does APC >>7900 work with fence_apc?" / "I think so ...". >> >>Has anyone confirmed this? > > > I'm using them in my cluster, they work well with the fence_apc agent. > > The only caveat is that the agent won't work if you use outlet groups > for cluster nodes, without a few changes to the agent. I can give you > my modified version if you want to use outlet groups for cluster nodes. > (A patch will be coming that will work with both port group and non-port > group modes, but I haven't had a chance to write it up yet.) Great, thanks. I'm new to this sort of hardware, not sure exactly what you mean by "outlet groups", could you elaborate? -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From eric at bootseg.com Wed Aug 10 15:16:24 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 10 Aug 2005 11:16:24 -0400 Subject: [Linux-cluster] Which APC fence device ? In-Reply-To: <42FA16E4.9050101@leopard.us.udel.edu> References: <42FA0D47.1090308@leopard.us.udel.edu> <1123684986.3402.14.camel@auh5-0479.corp.jabil.org> <42FA16E4.9050101@leopard.us.udel.edu> Message-ID: <1123686984.3402.19.camel@auh5-0479.corp.jabil.org> On Wed, 2005-08-10 at 11:01 -0400, Greg Forte wrote: > Eric Kerin wrote: > > On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote: > > > >>There was a message on this list in January to the effect of "does APC > >>7900 work with fence_apc?" / "I think so ...". > >> > >>Has anyone confirmed this? > > > > > > I'm using them in my cluster, they work well with the fence_apc agent. > > > > The only caveat is that the agent won't work if you use outlet groups > > for cluster nodes, without a few changes to the agent. I can give you > > my modified version if you want to use outlet groups for cluster nodes. > > (A patch will be coming that will work with both port group and non-port > > group modes, but I haven't had a chance to write it up yet.) > > Great, thanks. > > I'm new to this sort of hardware, not sure exactly what you mean by > "outlet groups", could you elaborate? > It's a feature on the APC 7900 that will allow a single command to one 7900 to automatically turn off/on ports on other 7900s via the network. I use it to turn off both power supplies on my cluster nodes at once, while still providing redundant paths for power. Eric From mrmacman_g4 at mac.com Wed Aug 10 15:43:02 2005 From: mrmacman_g4 at mac.com (Kyle Moffett) Date: Wed, 10 Aug 2005 11:43:02 -0400 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810132626.GC4954@null.msp.redhat.com> References: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> <20050810105450.GA6519@infradead.org> <20050810110259.GE4634@marowsky-bree.de> <20050810110511.GA6728@infradead.org> <20050810110917.GG4634@marowsky-bree.de> <20050810111110.GA6878@infradead.org> <20050810132626.GC4954@null.msp.redhat.com> Message-ID: On Aug 10, 2005, at 09:26:26, AJ Lewis wrote: > On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote: > >> On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote: >> >>> So for every directory hierarchy on a shared filesystem, each >>> user needs >>> to have the complete list of bindmounts needed, and automatically >>> resync >>> that across all nodes when a new one is added or removed? And >>> then have >>> that executed by root, because a regular user can't? >> >> Do it in an initscripts and let users simply not do it, they >> shouldn't >> even know what kind of filesystem they are on. > > I'm just thinking of a 100-node cluster that has different mounts > on different > nodes, and trying to update the bind mounts in a sane and efficient > manner > without clobbering the various mount setups. Ouch. How about something like the following: cpslink() => Create a Context Dependent Symlink readcpslink() => Return the Context Dependent path data readlink() => Return the path of the Context Dependent Symlink as it would be evaluated in the current context, basically as a normal symlink. lstat() => Return information on the Context Dependent Symlink in the same format as a regular symlink. unlink() => Delete the Context Dependent Symlink. You would need an extra userspace tool that understands cpslink/ readcpslink to create and get information on the links for now, but ls and ln could eventually be updated, and until then the would provide sane behavior. Perhaps this should be extended into a new API for some of the strange things several filesystems want to do in the VFS: extlink() => Create an extended filesystem link (with type specified) readextlink() => Return the path (and type) for the link The filesystem could define how each type of link acts with respect to other syscalls. OpenAFS could use extlink() instead of their symlink magic for adjusting the AFS volume hierarchy. The new in-kernel AFS client could use it in similar fashion (It has no method to adjust hierarchy, because it's still read-only). GFS could use it for their Context Dependent Symlinks. Since it would pass the type in as well, it would be possible to use it for different kinds of links on the same filesystem. Cheers, Kyle Moffett -- Simple things should be simple and complex things should be possible -- Alan Kay From aspanke at hpce.nec.com Wed Aug 10 17:18:06 2005 From: aspanke at hpce.nec.com (Alexander Spanke) Date: Wed, 10 Aug 2005 19:18:06 +0200 Subject: [Linux-cluster] /bin/login hangs on NFS troubles Message-ID: <42FA36CE.9030100@hpce.nec.com> Hi all, we have a problem with the configuration of a linux cluster based on RH ES3 WS. The cluster contains 230+ nodes; these nodes are sharing /usr/local/bin over NFS due to shared scripts for health checking. Now, if we encounter NFS problems, the login process is hanging non-stop until the NFS connection is back again. After checking all startup scripts and $PATH definitions we found out, that the /bin/login binary itself is pre-defining the $PATH hardwired with /usr/local/bin in the beginning. Unfortunately we cannot change the /usr/local/bin down to the local disk again, but need a solution to be able to login even or especially with NFS problems to start investigation. Do you have any nice idea, how to get rid of this problem ? Changing the binary and rebuild the package is one, but we don't like it because of missing update possibility after it ... Thnx in advance Cheers Alex -- ======================================================= Alexander Spanke System Analyst NEC High Performance Computing Europe GmbH Prinzenallee 11 D-40549 Duesseldorf, Germany Tel: +49 211 5369 146 aspanke at hpce.nec.com Fax: +49 211 5369 199 http://www.hpce.nec.com ======================================================= From gforte at leopard.us.udel.edu Wed Aug 10 17:22:08 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 10 Aug 2005 13:22:08 -0400 Subject: [Linux-cluster] Which APC fence device ? In-Reply-To: <1123686984.3402.19.camel@auh5-0479.corp.jabil.org> References: <42FA0D47.1090308@leopard.us.udel.edu> <1123684986.3402.14.camel@auh5-0479.corp.jabil.org> <42FA16E4.9050101@leopard.us.udel.edu> <1123686984.3402.19.camel@auh5-0479.corp.jabil.org> Message-ID: <42FA37C0.6000807@leopard.us.udel.edu> Eric Kerin wrote: > On Wed, 2005-08-10 at 11:01 -0400, Greg Forte wrote: > >>Eric Kerin wrote: >> >>>On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote: >>> >>> >>>>There was a message on this list in January to the effect of "does APC >>>>7900 work with fence_apc?" / "I think so ...". >>>> >>>>Has anyone confirmed this? >>> >>> >>>I'm using them in my cluster, they work well with the fence_apc agent. >>> >>>The only caveat is that the agent won't work if you use outlet groups >>>for cluster nodes, without a few changes to the agent. I can give you >>>my modified version if you want to use outlet groups for cluster nodes. >>>(A patch will be coming that will work with both port group and non-port >>>group modes, but I haven't had a chance to write it up yet.) >> >>Great, thanks. >> >>I'm new to this sort of hardware, not sure exactly what you mean by >>"outlet groups", could you elaborate? >> > > It's a feature on the APC 7900 that will allow a single command to one > 7900 to automatically turn off/on ports on other 7900s via the network. > I use it to turn off both power supplies on my cluster nodes at once, > while still providing redundant paths for power. Ah, I see. Yes, I will probably be wanting to do the same thing, if you could forward me your patch I'd appreciate it. -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From eric at bootseg.com Wed Aug 10 17:43:32 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 10 Aug 2005 13:43:32 -0400 Subject: [Linux-cluster] Which APC fence device ? In-Reply-To: <42FA37C0.6000807@leopard.us.udel.edu> References: <42FA0D47.1090308@leopard.us.udel.edu> <1123684986.3402.14.camel@auh5-0479.corp.jabil.org> <42FA16E4.9050101@leopard.us.udel.edu> <1123686984.3402.19.camel@auh5-0479.corp.jabil.org> <42FA37C0.6000807@leopard.us.udel.edu> Message-ID: <1123695813.3402.31.camel@auh5-0479.corp.jabil.org> On Wed, 2005-08-10 at 13:22 -0400, Greg Forte wrote: > Eric Kerin wrote: > >On Wed, 2005-08-10 at 10:20 -0400, Greg Forte wrote: > >>Great, thanks. > >> > >>I'm new to this sort of hardware, not sure exactly what you mean by > >>"outlet groups", could you elaborate? > >> > > > > It's a feature on the APC 7900 that will allow a single command to one > > 7900 to automatically turn off/on ports on other 7900s via the network. > > I use it to turn off both power supplies on my cluster nodes at once, > > while still providing redundant paths for power. > > Ah, I see. Yes, I will probably be wanting to do the same thing, if you > could forward me your patch I'd appreciate it. I attached the modified fence_apc agent for use with Outlet Groups, just replace fence_apc in /sbin with this one (as long as you're using outlet groups, if you're not you'll have to use the original one that comes with the cluster software.) I'll whip up a proper, non-outlet group compatible, patch when I get some time. When you setup your fence devices, you'll want to use the name of the port, instead of the port number. My config looks like this for the node's fence device config: Or if you don't rename the ports: By default the port names are "Outlet X", but if you change them like I did just use whatever you called the port. Hope this helps Eric -------------- next part -------------- A non-text attachment was scrubbed... Name: fence_apc Type: application/x-perl Size: 10776 bytes Desc: not available URL: From lhh at redhat.com Wed Aug 10 18:41:28 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 10 Aug 2005 14:41:28 -0400 Subject: [Linux-cluster] Test Env. In-Reply-To: References: Message-ID: <1123699288.13564.17.camel@ayanami.boston.redhat.com> On Wed, 2005-08-10 at 10:47 -0400, Gerald G. Gilyeat wrote: > > Another question - What is the -minimal- hardware configuration to > test GFS6.1 (with all the RHEL4 and LVM2 goodness...) > 2 machines, a network switch, and a SAN If you don't have a SAN, a third machine acting as a GNBD server. (You even get software-fencing [fence_gnbd] if you do it this way!) -- Lon From amanthei at redhat.com Wed Aug 10 19:08:04 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 10 Aug 2005 14:08:04 -0500 Subject: [Linux-cluster] Which APC fence device ? In-Reply-To: <42FA0D47.1090308@leopard.us.udel.edu> References: <42FA0D47.1090308@leopard.us.udel.edu> Message-ID: <20050810190804.GG17678@redhat.com> On Wed, Aug 10, 2005 at 10:20:55AM -0400, Greg Forte wrote: > There was a message on this list in January to the effect of "does APC > 7900 work with fence_apc?" / "I think so ...". > > Has anyone confirmed this? Yes, the APC 79xx Series is supported. -- Adam Manthei From teigland at redhat.com Thu Aug 11 03:45:14 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 11 Aug 2005 11:45:14 +0800 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <20050810114526.GA23149@kinoko.datagrama.net> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> <20050808215530.GA23695@gibson.drslump.org> <42F8A0A4.5000507@redhat.com> <1123611884.13564.11.camel@ayanami.boston.redhat.com> <20050810114526.GA23149@kinoko.datagrama.net> Message-ID: <20050811034513.GA11132@redhat.com> On Wed, Aug 10, 2005 at 01:45:26PM +0200, Javi Polo wrote: > I've made a script that, prior to starting any of the cluster > infrastructure, enables his SAN port. I'm not sure if this is related to the rest. > I can then join the cluster, but when I try to join the fence, it locks > up there ... : > > gfstest1:~# cman_tool services > Service Name GID LID State Code > Fence Domain: "default" 0 2 join S-1,80,3 > [] it's waiting to join the fence domain, the others won't let him yet... > gfstest1:~# cman_tool nodes > Node Votes Exp Sts Name > 1 1 3 M gfstest1 > 2 1 3 M gfstest2 > 3 1 3 M gfstest3 > > from other nodes, I see it as recovering: > gfstest2:/etc/init.d# cman_tool services > Service Name GID LID State Code > Fence Domain: "default" 1 2 recover 2 - > [2 3] These two appear to be trying to fence gfstest1, but the fencing operation hasn't completed. They won't let anyone join the domain until they finish. You could check /var/log/messages on 2&3 for any fencing messages or errors. Dave From teigland at redhat.com Thu Aug 11 06:06:02 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 11 Aug 2005 14:06:02 +0800 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <1122968724.3247.22.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> Message-ID: <20050811060602.GA12438@redhat.com> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > * + if (create) > + down_write(&ip->i_rw_mutex); > + else > + down_read(&ip->i_rw_mutex); > > why do you use a rwsem and not a regular semaphore? You are aware that > rwsems are far more expensive than regular ones right? How skewed is > the read/write ratio? Rough tests show around 4/1, that high or low? From arjan at infradead.org Thu Aug 11 06:55:49 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 11 Aug 2005 08:55:49 +0200 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050811060602.GA12438@redhat.com> References: <20050802071828.GA11217@redhat.com> <1122968724.3247.22.camel@laptopd505.fenrus.org> <20050811060602.GA12438@redhat.com> Message-ID: <1123743349.3201.15.camel@laptopd505.fenrus.org> On Thu, 2005-08-11 at 14:06 +0800, David Teigland wrote: > On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote: > > > * + if (create) > > + down_write(&ip->i_rw_mutex); > > + else > > + down_read(&ip->i_rw_mutex); > > > > why do you use a rwsem and not a regular semaphore? You are aware that > > rwsems are far more expensive than regular ones right? How skewed is > > the read/write ratio? > > Rough tests show around 4/1, that high or low? that's quite borderline; if it was my code I'd not use a rwsem for that ratio (my own rule of thumb, based on not a lot other than gut feeling) is a 10/1 ratio at minimum... but it's not so low that it screams for removing it. However.... it might well make your code a lot simpler so it might still be worth simplifying. From teigland at redhat.com Thu Aug 11 08:17:29 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 11 Aug 2005 16:17:29 +0800 Subject: [Linux-cluster] GFS - updated patches In-Reply-To: <20050802071828.GA11217@redhat.com> References: <20050802071828.GA11217@redhat.com> Message-ID: <20050811081729.GB12438@redhat.com> Thanks for all the review and comments. This is a new set of patches that incorporates the suggestions we've received. http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch http://redhat.com/~teigland/gfs2/20050811/broken-out/ Dave From mikore.li at gmail.com Thu Aug 11 08:21:04 2005 From: mikore.li at gmail.com (Michael) Date: Thu, 11 Aug 2005 16:21:04 +0800 Subject: [Linux-cluster] GFS - updated patches In-Reply-To: <20050811081729.GB12438@redhat.com> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> Message-ID: I have the same question as I asked before, how can I see GFS in "make menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel? Michael On 8/11/05, David Teigland wrote: > Thanks for all the review and comments. This is a new set of patches that > incorporates the suggestions we've received. > > http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050811/broken-out/ > > Dave > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From arjan at infradead.org Thu Aug 11 08:32:38 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 11 Aug 2005 10:32:38 +0200 Subject: [Linux-cluster] Re: GFS - updated patches In-Reply-To: <20050811081729.GB12438@redhat.com> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> Message-ID: <1123749159.3201.19.camel@laptopd505.fenrus.org> On Thu, 2005-08-11 at 16:17 +0800, David Teigland wrote: > Thanks for all the review and comments. This is a new set of patches that > incorporates the suggestions we've received. all of them or only a subset? From teigland at redhat.com Thu Aug 11 08:46:45 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 11 Aug 2005 16:46:45 +0800 Subject: [Linux-cluster] Re: GFS - updated patches In-Reply-To: References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> Message-ID: <20050811084645.GD12438@redhat.com> On Thu, Aug 11, 2005 at 04:21:04PM +0800, Michael wrote: > I have the same question as I asked before, how can I see GFS in "make > menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel? You need to select the dlm under drivers. It's in -mm, or apply http://redhat.com/~teigland/dlm.patch From teigland at redhat.com Thu Aug 11 08:50:06 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 11 Aug 2005 16:50:06 +0800 Subject: [Linux-cluster] Re: GFS - updated patches In-Reply-To: <1123749159.3201.19.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> <1123749159.3201.19.camel@laptopd505.fenrus.org> Message-ID: <20050811085006.GA19972@redhat.com> On Thu, Aug 11, 2005 at 10:32:38AM +0200, Arjan van de Ven wrote: > On Thu, 2005-08-11 at 16:17 +0800, David Teigland wrote: > > Thanks for all the review and comments. This is a new set of patches that > > incorporates the suggestions we've received. > > all of them or only a subset? All patches, now 01-13 (what was patch 08 disappeared entirely) From mikore.li at gmail.com Thu Aug 11 08:49:42 2005 From: mikore.li at gmail.com (Michael) Date: Thu, 11 Aug 2005 16:49:42 +0800 Subject: [Linux-cluster] Re: GFS - updated patches In-Reply-To: <20050811084645.GD12438@redhat.com> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> <20050811084645.GD12438@redhat.com> Message-ID: yes, after apply dlm.patch, I saw it! although I don't know what's "-mm". Thanks, Michael On 8/11/05, David Teigland wrote: > On Thu, Aug 11, 2005 at 04:21:04PM +0800, Michael wrote: > > I have the same question as I asked before, how can I see GFS in "make > > menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel? > > You need to select the dlm under drivers. It's in -mm, or apply > http://redhat.com/~teigland/dlm.patch > > From arjan at infradead.org Thu Aug 11 08:50:32 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 11 Aug 2005 10:50:32 +0200 Subject: [Linux-cluster] Re: GFS - updated patches In-Reply-To: <20050811085006.GA19972@redhat.com> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> <1123749159.3201.19.camel@laptopd505.fenrus.org> <20050811085006.GA19972@redhat.com> Message-ID: <1123750232.3201.22.camel@laptopd505.fenrus.org> On Thu, 2005-08-11 at 16:50 +0800, David Teigland wrote: > On Thu, Aug 11, 2005 at 10:32:38AM +0200, Arjan van de Ven wrote: > > On Thu, 2005-08-11 at 16:17 +0800, David Teigland wrote: > > > Thanks for all the review and comments. This is a new set of patches that > > > incorporates the suggestions we've received. > > > > all of them or only a subset? > > All patches, now 01-13 (what was patch 08 disappeared entirely) with them I meant the suggestions not the patches ;) From teigland at redhat.com Thu Aug 11 09:16:51 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 11 Aug 2005 17:16:51 +0800 Subject: [Linux-cluster] Re: GFS - updated patches In-Reply-To: <1123750232.3201.22.camel@laptopd505.fenrus.org> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> <1123749159.3201.19.camel@laptopd505.fenrus.org> <20050811085006.GA19972@redhat.com> <1123750232.3201.22.camel@laptopd505.fenrus.org> Message-ID: <20050811091651.GB19972@redhat.com> On Thu, Aug 11, 2005 at 10:50:32AM +0200, Arjan van de Ven wrote: > > > > Thanks for all the review and comments. This is a new set of > > > > patches that incorporates the suggestions we've received. > > > > > > all of them or only a subset? > > with them I meant the suggestions not the patches ;) The large majority, and I think all that people care about. If we ignored something that someone thinks is important, a reminder would be useful. From mikore.li at gmail.com Thu Aug 11 09:54:33 2005 From: mikore.li at gmail.com (Michael) Date: Thu, 11 Aug 2005 17:54:33 +0800 Subject: [Linux-cluster] GFS - updated patches In-Reply-To: <20050811081729.GB12438@redhat.com> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> Message-ID: Hi, Dave, I quickly applied gfs2 and dlm patches in kernel 2.6.12.2, it passed compiling but has some warning log, see attachment. maybe helpful to you. Thanks, Michael On 8/11/05, David Teigland wrote: > Thanks for all the review and comments. This is a new set of patches that > incorporates the suggestions we've received. > > http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050811/broken-out/ > > Dave > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: gfs2_and_linux-2.6.12.2.txt URL: From penberg at gmail.com Thu Aug 11 10:00:35 2005 From: penberg at gmail.com (Pekka Enberg) Date: Thu, 11 Aug 2005 13:00:35 +0300 Subject: [Linux-cluster] GFS - updated patches In-Reply-To: References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> Message-ID: <84144f0205081103003cf4fbe7@mail.gmail.com> On 8/11/05, Michael wrote: > Hi, Dave, > > I quickly applied gfs2 and dlm patches in kernel 2.6.12.2, it passed > compiling but has some warning log, see attachment. maybe helpful to > you. kzalloc is not in Linus' tree yet. Try with 2.6.13-rc5-mm1. Pekka From javipolo at datagrama.net Thu Aug 11 10:03:35 2005 From: javipolo at datagrama.net (Javi Polo) Date: Thu, 11 Aug 2005 12:03:35 +0200 Subject: [Linux-cluster] problem with rejoining a node In-Reply-To: <20050811034513.GA11132@redhat.com> References: <20050808141238.GA6455@gibson.drslump.org> <42F77BB2.4070705@redhat.com> <20050808215530.GA23695@gibson.drslump.org> <42F8A0A4.5000507@redhat.com> <1123611884.13564.11.camel@ayanami.boston.redhat.com> <20050810114526.GA23149@kinoko.datagrama.net> <20050811034513.GA11132@redhat.com> Message-ID: <20050811100335.GA15222@kinoko.datagrama.net> On Aug/11/2005, David Teigland wrote: > > I've made a script that, prior to starting any of the cluster > > infrastructure, enables his SAN port. > I'm not sure if this is related to the rest. I did it because the san port never turned on, and I thought it could be part of the problem, but I see is not ... > > gfstest1:~# cman_tool services > > Service Name GID LID State Code > > Fence Domain: "default" 0 2 join S-1,80,3 > > [] > it's waiting to join the fence domain, the others won't let him yet... > > Service Name GID LID State Code > > Fence Domain: "default" 1 2 recover 2 - > > [2 3] > These two appear to be trying to fence gfstest1, but the fencing operation > hasn't completed. They won't let anyone join the domain until they > finish. You could check /var/log/messages on 2&3 for any fencing messages > or errors. I tried fence_tool with -D on those, and found the problem .... dont know why, but sometimes the switch sets the port status to "FAULTY" instead of "OFFLINE", and so the fence_IBMswitch failed and so the node wasnt completely fenced .... Now it seems to be working fine! :))) thx a lot Now there's another doubt I have ... when the system rejoins the fence, does the fence_XXX script runs to enable the port switch, or should I do it by other means (ie enabling it on boot and so) :? I though about making a boot script that runs cman_tool services, checks if the host is in the fence, and if so, enable the SAN port and then rescan for SCSI devices ... but I dont know if that's "the right way" to do it, or at least a polite one :? -- Javier Polo @ Datagrama 902 136 126 From penberg at gmail.com Thu Aug 11 10:04:10 2005 From: penberg at gmail.com (Pekka Enberg) Date: Thu, 11 Aug 2005 13:04:10 +0300 Subject: [Linux-cluster] Re: GFS - updated patches In-Reply-To: <20050811091651.GB19972@redhat.com> References: <20050802071828.GA11217@redhat.com> <20050811081729.GB12438@redhat.com> <1123749159.3201.19.camel@laptopd505.fenrus.org> <20050811085006.GA19972@redhat.com> <1123750232.3201.22.camel@laptopd505.fenrus.org> <20050811091651.GB19972@redhat.com> Message-ID: <84144f0205081103043067d36e@mail.gmail.com> Hi, On 8/11/05, David Teigland wrote: > The large majority, and I think all that people care about. If we ignored > something that someone thinks is important, a reminder would be useful. The only remaining issue for me is the vma walk. Thanks, David! Pekka From pcaulfie at redhat.com Thu Aug 11 10:56:26 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 11 Aug 2005 11:56:26 +0100 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <42F77AA3.80000@redhat.com> References: <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> Message-ID: <42FB2EDA.4010300@redhat.com> For those not reading the commit list the ais-based cman is now in CVS - be careful with it... For the moment it downloads a prepackaged/patched version of the openais source from my people.redhat.com web site. This /will/ change. In fact the only additional patch in there is one Steven posted to the openais mailing list so don't think I'm hiding anything! There's still a lot of work to do on this code but is basically works with a few caveats: 1. Barriers are completely untested and may not work at all. 2. Don't start several nodes up at the same time, they might get the same node ID(!) unless you used static node IDs. 3. The exec path for cmand is hard coded (in the Makefile) to ../daemon/cmand so you must currently always run cman_tool from the dev directory unless you change it. 4. Broadcast is no longer supported. If you fail to specify a multicast address cman_tool will provide one. 5. IPv6 is unsupported, I'm going to start on that next! 6. Error reporting is probably rubbish. Generally it seems to work. I can certainly get the DLM up with it now. -- patrick From Axel.Thimm at ATrpms.net Thu Aug 11 13:31:17 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Thu, 11 Aug 2005 15:31:17 +0200 Subject: [Linux-cluster] exporting gfs quotas over nfs (was: GFS -> NFS, Locking and quotas) In-Reply-To: <2a5b11a870bfe22794ec127f5e34cf22@valuecommerce.co.jp> References: <2a5b11a870bfe22794ec127f5e34cf22@valuecommerce.co.jp> Message-ID: <20050811133117.GC19032@neu.nirvana> On Tue, Feb 22, 2005 at 06:14:04PM +0900, Nathan Ollerenshaw wrote: > 1. It should support quotas. > > It seems that GFS supports quotas just fine, using some commands on the > machines mounting the GFS filesystems. While this is fine, ideally I'd > like to be able to query and set the quota information using > rpc.rquotad and the associated commands 'edquota' and 'quota'. > > Is this possible? Found this in the archives w/o an answer. What is the best practice to export quotas over NFS? Does rpc.rquotad interact with gfs' quotas? If not one could think of using the conventional quota utils, but are they cluster-safe? Thanks. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From dharma_deep at yahoo.com Thu Aug 11 13:35:25 2005 From: dharma_deep at yahoo.com (dharma deep) Date: Thu, 11 Aug 2005 06:35:25 -0700 (PDT) Subject: [Linux-cluster] GFS cluster using UML(User mode linux) Message-ID: <20050811133525.66333.qmail@web60611.mail.yahoo.com> Hi, Is it possible to create a GFS cluster using UML ?If so, please point me to the howto doc. Thanks, dharmadeep ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs From mark.fasheh at oracle.com Wed Aug 10 16:26:18 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Wed, 10 Aug 2005 09:26:18 -0700 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <20050810072121.GA2825@infradead.org> Message-ID: <20050810162618.GH21228@ca-server1.us.oracle.com> On Wed, Aug 10, 2005 at 10:31:04AM +0300, Pekka J Enberg wrote: > It seems to me that the distributed locks must be acquired in ->nopage > anyway to solve the problem with memcpy() between two mmap'd regions. One > possible solution would be for the lock manager to detect deadlocks and > break some locks accordingly. Don't know how well that would mix with > ->nopage though... Yeah, my experience with ->nopage so far has indicated to me that we are to avoid erroring out if at all possible which I believe is what we'd have to do if a deadlock is found. Also, I'm not sure how multiple dlms would coordinate deadlock detection in that case. This may sound naive, but so far OCFS2 has avoided the nead for deadlock detection... I'd hate to have to add it now -- better to try avoiding them in the first place. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From penberg at cs.helsinki.fi Wed Aug 10 16:57:43 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Wed, 10 Aug 2005 19:57:43 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050810162618.GH21228@ca-server1.us.oracle.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <20050810072121.GA2825@infradead.org> <20050810162618.GH21228@ca-server1.us.oracle.com> Message-ID: Hi Mark, Mark Fasheh writes: > This may sound naive, but so far OCFS2 has avoided the nead for deadlock > detection... I'd hate to have to add it now -- better to try avoiding them > in the first place. Surely avoiding them is preferred but how do you do that when you have to mmap'd regions where userspace does memcpy()? The kernel won't much saying in it until ->nopage. We cannot grab all the required locks in proper order here because we don't know what size the buffer is. That's why I think lock sorting won't work of all the cases and thus the problem needs to be taken care of by the dlm. Pekka From mark.fasheh at oracle.com Wed Aug 10 18:21:56 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Wed, 10 Aug 2005 11:21:56 -0700 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <20050810072121.GA2825@infradead.org> <20050810162618.GH21228@ca-server1.us.oracle.com> Message-ID: <20050810182155.GI21228@ca-server1.us.oracle.com> On Wed, Aug 10, 2005 at 07:57:43PM +0300, Pekka J Enberg wrote: > Surely avoiding them is preferred but how do you do that when you have to > mmap'd regions where userspace does memcpy()? The kernel won't much saying > in it until ->nopage. We cannot grab all the required locks in proper order > here because we don't know what size the buffer is. That's why I think lock > sorting won't work of all the cases and thus the problem needs to be taken > care of by the dlm. Hmm, well today in OCFS2 if you're not coming from read or write, the lock is held only for the duration of ->nopage so I don't think we could get into any deadlocks for that usage. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From penberg at cs.helsinki.fi Wed Aug 10 20:18:48 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Wed, 10 Aug 2005 23:18:48 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050810182155.GI21228@ca-server1.us.oracle.com> References: <20050802071828.GA11217@redhat.com> <84144f0205080203163cab015c@mail.gmail.com> <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <20050810072121.GA2825@infradead.org> <20050810162618.GH21228@ca-server1.us.oracle.com> <20050810182155.GI21228@ca-server1.us.oracle.com> Message-ID: Mark Fasheh writes: > Hmm, well today in OCFS2 if you're not coming from read or write, the lock > is held only for the duration of ->nopage so I don't think we could get into > any deadlocks for that usage. Aah, I see GFS2 does that too so no deadlocks here. Thanks. You, however, don't maintain the same level of data consistency when reads and writes are from other filesystems as they use ->nopage. Fixing this requires a generic vma walk in every write() and read(), no? That doesn't seem such an hot idea which brings us back to using ->nopage for taking the locks (but now the deadlocks are back). Pekka From mark.fasheh at oracle.com Wed Aug 10 22:07:44 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Wed, 10 Aug 2005 15:07:44 -0700 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <20050810072121.GA2825@infradead.org> <20050810162618.GH21228@ca-server1.us.oracle.com> <20050810182155.GI21228@ca-server1.us.oracle.com> Message-ID: <20050810220744.GJ21228@ca-server1.us.oracle.com> On Wed, Aug 10, 2005 at 11:18:48PM +0300, Pekka J Enberg wrote: > Aah, I see GFS2 does that too so no deadlocks here. Thanks. Yep, no problem :) > You, however, don't maintain the same level of data consistency when reads > and writes are from other filesystems as they use ->nopage. I'm not sure what you mean here... > Fixing this requires a generic vma walk in every write() and read(), no? > That doesn't seem such an hot idea which brings us back to using ->nopage > for taking the locks (but now the deadlocks are back). Yeah if you look through mmap.c in ocfs2_fill_ctxt_from_buf() we do this... Or am I misunderstanding what you mean? --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From penberg at cs.helsinki.fi Thu Aug 11 04:41:17 2005 From: penberg at cs.helsinki.fi (Pekka J Enberg) Date: Thu, 11 Aug 2005 07:41:17 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050810220744.GJ21228@ca-server1.us.oracle.com> References: <20050803063644.GD9812@redhat.com> <42F7A557.3000200@zabbo.net> <1123598983.10790.1.camel@haji.ri.fi> <20050810072121.GA2825@infradead.org> <20050810162618.GH21228@ca-server1.us.oracle.com> <20050810182155.GI21228@ca-server1.us.oracle.com> <20050810220744.GJ21228@ca-server1.us.oracle.com> Message-ID: Hi, On Wed, Aug 10, 2005 at 11:18:48PM +0300, Pekka J Enberg wrote: > > You, however, don't maintain the same level of data consistency when reads > > and writes are from other filesystems as they use ->nopage. Mark Fasheh writes: > I'm not sure what you mean here... Reading and writing from other filesystems to a GFS2 mmap'd file does not walk the vmas. Therefore, data consistency guarantees are different: - A GFS2 filesystem does a read that writes to a GFS2 mmap'd file -> we take all locks for the mmap'd buffer in order and release them after read() is done. - A ext3 filesystem, for example, does a read that writes to a GFS2 mmap'd file -> we now take locks one page at the time releasing them before we exit ->nopage(). Other nodes are now free to write to the same GFS2 mmap'd file. Or am I missing something here? On Wed, Aug 10, 2005 at 11:18:48PM +0300, Pekka J Enberg wrote: > > Fixing this requires a generic vma walk in every write() and read(), no? > > That doesn't seem such an hot idea which brings us back to using ->nopage > > for taking the locks (but now the deadlocks are back). Mark Fasheh writes: > Yeah if you look through mmap.c in ocfs2_fill_ctxt_from_buf() we do this... > Or am I misunderstanding what you mean? If are doing write() or read() from some other filesystem, we don't walk the vmas but instead rely on ->nopage for locking, right? Pekka From penberg at cs.Helsinki.FI Thu Aug 11 07:10:16 2005 From: penberg at cs.Helsinki.FI (Pekka J Enberg) Date: Thu, 11 Aug 2005 10:10:16 +0300 (EEST) Subject: [Linux-cluster] Re: GFS Message-ID: Hi Mark, On Thu, 11 Aug 2005, Pekka J Enberg wrote: > Reading and writing from other filesystems to a GFS2 mmap'd file > does not walk the vmas. Therefore, data consistency guarantees > are different: What I meant was that, if a filesystem requires vma walks, we need to do it VFS level with something like the following patch. With this, your filesystem would implement a_ops->iolock_acquire that sorts the locks and takes them all. In case of GFS2, this would replace walk_vm(). Thoughts? Pekka [PATCH] vfs: iolock This patch introduces iolock which can be used by filesystems that require special locking when accessing an mmap'd region. Unfinished and untested. Signed-off-by: Pekka Enberg --- fs/Makefile | 2 - fs/iolock.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/read_write.c | 15 ++++++++ include/linux/fs.h | 2 + include/linux/iolock.h | 11 ++++++ 5 files changed, 117 insertions(+), 1 deletion(-) Index: 2.6-mm/fs/iolock.c =================================================================== --- /dev/null +++ 2.6-mm/fs/iolock.c @@ -0,0 +1,88 @@ +/* + * fs/iolock.c + * + * Derived from GFS2. + */ + +#include +#include +#include +#include +#include + +/* + * I/O lock contains all files that participate in locking a memory region. + * It is used for filesystems that require special locks to access mmap'd + * memory. + */ +struct iolock { + struct address_space *mapping; + unsigned long nr_files; + struct file **files; +}; + +struct iolock *iolock_region(const char __user *buf, size_t size) +{ + int err = -ENOMEM; + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + unsigned long start = (unsigned long)buf; + unsigned long end = start + size; + struct iolock *ret; + + ret = kcalloc(1, sizeof(*ret), GFP_KERNEL); + if (!ret) + return ERR_PTR(-ENOMEM); + + down_read(&mm->mmap_sem); + + ret->files = kcalloc(mm->map_count, sizeof(struct file*), GFP_KERNEL); + if (!ret->files) + goto error; + + for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { + struct file *file; + struct address_space *mapping; + + if (end <= vma->vm_start) + break; + + file = vma->vm_file; + if (!file) + continue; + + mapping = file->f_mapping; + if (!mapping->a_ops->iolock_acquire || + !mapping->a_ops->iolock_release) + continue; + + /* FIXME: This only works when one address_space participates + in the iolock. */ + ret->mapping = mapping; + ret->files[ret->nr_files++] = file; + } +out: + up_read(&mm->mmap_sem); + + if (ret->mapping->a_ops->iolock_acquire) { + err = ret->mapping->a_ops->iolock_acquire(ret->files, ret->nr_files); + if (!err) + goto error; + } + + return ret; + +error: + iolock_release(ret); + ret = ERR_PTR(err); + goto out; +} + +void iolock_release(struct iolock *iolock) +{ + struct address_space *mapping = iolock->mapping; + if (mapping && mapping->a_ops->iolock_release) + mapping->a_ops->iolock_release(iolock->files, iolock->nr_files); + kfree(iolock->files); + kfree(iolock); +} Index: 2.6-mm/fs/read_write.c =================================================================== --- 2.6-mm.orig/fs/read_write.c +++ 2.6-mm/fs/read_write.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -247,14 +248,21 @@ ssize_t vfs_read(struct file *file, char if (!ret) { ret = security_file_permission (file, MAY_READ); if (!ret) { + struct iolock * iolock = iolock_region(buf, count); + if (IS_ERR(iolock)) { + ret = PTR_ERR(iolock); + goto out; + } if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else ret = do_sync_read(file, buf, count, pos); + iolock_release(iolock); if (ret > 0) { fsnotify_access(file->f_dentry); current->rchar += ret; } + out: current->syscr++; } } @@ -298,14 +306,21 @@ ssize_t vfs_write(struct file *file, con if (!ret) { ret = security_file_permission (file, MAY_WRITE); if (!ret) { + struct iolock * iolock = iolock_region(buf, count); + if (IS_ERR(iolock)) { + ret = PTR_ERR(iolock); + goto out; + } if (file->f_op->write) ret = file->f_op->write(file, buf, count, pos); else ret = do_sync_write(file, buf, count, pos); + iolock_release(iolock); if (ret > 0) { fsnotify_modify(file->f_dentry); current->wchar += ret; } + out: current->syscw++; } } Index: 2.6-mm/include/linux/iolock.h =================================================================== --- /dev/null +++ 2.6-mm/include/linux/iolock.h @@ -0,0 +1,11 @@ +#ifndef __LINUX_IOLOCK_H +#define __LINUX_IOLOCK_H + +#include + +struct iolock; + +struct iolock *iolock_region(const char __user *buf, size_t count); +void iolock_release(struct iolock *lock); + +#endif Index: 2.6-mm/fs/Makefile =================================================================== --- 2.6-mm.orig/fs/Makefile +++ 2.6-mm/fs/Makefile @@ -10,7 +10,7 @@ obj-y := open.o read_write.o file_table. ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \ attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \ - ioprio.o + ioprio.o iolock.o obj-$(CONFIG_INOTIFY) += inotify.o obj-$(CONFIG_EPOLL) += eventpoll.o Index: 2.6-mm/include/linux/fs.h =================================================================== --- 2.6-mm.orig/include/linux/fs.h +++ 2.6-mm/include/linux/fs.h @@ -334,6 +334,8 @@ struct address_space_operations { loff_t offset, unsigned long nr_segs); struct page* (*get_xip_page)(struct address_space *, sector_t, int); + int (*iolock_acquire)(struct file **, unsigned long); + void (*iolock_release)(struct file **, unsigned long); }; struct backing_dev_info; From zab at zabbo.net Thu Aug 11 16:33:41 2005 From: zab at zabbo.net (Zach Brown) Date: Thu, 11 Aug 2005 09:33:41 -0700 Subject: [Linux-cluster] Re: GFS In-Reply-To: References: Message-ID: <42FB7DE5.2080506@zabbo.net> > What I meant was that, if a filesystem requires vma walks, we need to do > it VFS level with something like the following patch. I don't think this patch is the way to go at all. It imposes an allocation and vma walking overhead for the vast majority of IOs that aren't interested. It doesn't look like it will get a consistent ordering when multiple file systems are concerned. It doesn't record the ranges of the mappings involved so Lustre can't properly use its range locks. And finally, it doesn't prohibit mapping operations for the duration of the IO -- the whole reason we ended up in this thread in the first place :) Christoph, would you be interested in looking at a more thorough patch if I threw one together? - z From hch at infradead.org Thu Aug 11 16:35:41 2005 From: hch at infradead.org (Christoph Hellwig) Date: Thu, 11 Aug 2005 17:35:41 +0100 Subject: [Linux-cluster] Re: GFS In-Reply-To: <42FB7DE5.2080506@zabbo.net> References: <42FB7DE5.2080506@zabbo.net> Message-ID: <20050811163541.GA4351@infradead.org> On Thu, Aug 11, 2005 at 09:33:41AM -0700, Zach Brown wrote: > ordering when multiple file systems are concerned. It doesn't record > the ranges of the mappings involved so Lustre can't properly use its > range locks. That doesn't matter. Please don't put in any effort for lustre special cases - they are unwilling to cooperate and they'll get what they deserve. > And finally, it doesn't prohibit mapping operations for > the duration of the IO -- the whole reason we ended up in this thread in > the first place :) > > Christoph, would you be interested in looking at a more thorough patch > if I threw one together? Sure, I'm not sure that'll happen in a timely fashion, though. From zab at zabbo.net Thu Aug 11 16:39:43 2005 From: zab at zabbo.net (Zach Brown) Date: Thu, 11 Aug 2005 09:39:43 -0700 Subject: [Linux-cluster] Re: GFS In-Reply-To: <20050811163541.GA4351@infradead.org> References: <42FB7DE5.2080506@zabbo.net> <20050811163541.GA4351@infradead.org> Message-ID: <42FB7F4F.80507@zabbo.net> > That doesn't matter. Please don't put in any effort for lustre special > cases - they are unwilling to cooperate and they'll get what they deserve. Sure, we can add that extra functional layer in another pass. I thought I'd still bring it up, though, as OCFS2 is slated to care at some point in the not too distant future. > Sure, I'm not sure that'll happen in a timely fashion, though. Roger. - z From penberg at cs.helsinki.fi Thu Aug 11 16:44:50 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 11 Aug 2005 19:44:50 +0300 Subject: [Linux-cluster] Re: GFS In-Reply-To: <42FB7DE5.2080506@zabbo.net> References: <42FB7DE5.2080506@zabbo.net> Message-ID: <1123778691.24181.8.camel@localhost> On Thu, 2005-08-11 at 09:33 -0700, Zach Brown wrote: > I don't think this patch is the way to go at all. It imposes an > allocation and vma walking overhead for the vast majority of IOs that > aren't interested. It doesn't look like it will get a consistent > ordering when multiple file systems are concerned. It doesn't record > the ranges of the mappings involved so Lustre can't properly use its > range locks. And finally, it doesn't prohibit mapping operations for > the duration of the IO -- the whole reason we ended up in this thread in > the first place :) Hmm. So how do you propose we get rid of the mandatory vma walk? I was thinking of making iolock a config option so when you don't have any filesystems that need it, it can go away. I have also optimized the extra allocation away when there are none mmap'd files that require locking. As for the rest of your comments, I heartly agree with them and hopefully some interested party will take care of them :-). Pekka Index: 2.6-mm/fs/iolock.c =================================================================== --- /dev/null +++ 2.6-mm/fs/iolock.c @@ -0,0 +1,183 @@ +/* + * I/O locking for memory regions. Used by filesystems that need special + * locking for mmap'd files. + */ + +#include +#include +#include +#include +#include + +/* + * TODO: + * + * - Deadlock when two nodes acquire iolocks in reverse order for two + * different filesystems. Solution: use rbtree in iolock_chain so we + * can walk iolocks in order. XXX: what order is stable for two nodes + * that don't know about each other? + */ + +/* + * I/O lock contains all files that participate in locking a memory region + * in an address_space. + */ +struct iolock { + struct address_space *mapping; + unsigned long nr_files; + struct file **files; + struct list_head chain; +}; + +struct iolock_chain { + struct list_head list; +}; + +static struct iolock *iolock_new(unsigned long max_files) +{ + struct iolock *ret = kzalloc(sizeof(*ret), GFP_KERNEL); + if (!ret) + goto out; + ret->files = kcalloc(max_files, sizeof(struct file *), GFP_KERNEL); + if (!ret->files) { + kfree(ret); + ret = NULL; + goto out; + } + INIT_LIST_HEAD(&ret->chain); +out: + return ret; +} + +static struct iolock_chain *iolock_chain_new(void) +{ + struct iolock_chain * ret = kzalloc(sizeof(*ret), GFP_KERNEL); + if (ret) { + INIT_LIST_HEAD(&ret->list); + } + return ret; +} + +static int iolock_chain_acquire(struct iolock_chain *chain) +{ + struct iolock * iolock; + int err = 0; + + list_for_each_entry(iolock, &chain->list, chain) { + if (iolock->mapping->a_ops->iolock_acquire) { + err = iolock->mapping->a_ops->iolock_acquire( + iolock->files, iolock->nr_files); + if (!err) + goto error; + } + } +error: + return err; +} + +static struct iolock *iolock_lookup(struct iolock_chain *chain, + struct address_space *mapping) +{ + struct iolock *ret = NULL; + struct iolock *iolock; + + list_for_each_entry(iolock, &chain->list, chain) { + if (iolock->mapping == mapping) { + ret = iolock; + break; + } + } + return ret; +} + +/** + * iolock_region - Lock memory region for file I/O. + * @buf: the buffer we want to lock. + * @size: size of the buffer. + * + * Returns a pointer to the iolock_chain or NULL to denote an empty chain; + * otherwise returns ERR_PTR(). + */ +struct iolock_chain *iolock_region(const char __user *buf, size_t size) +{ + struct iolock_chain *ret = NULL; + int err = -ENOMEM; + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + unsigned long start = (unsigned long)buf; + unsigned long end = start + size; + int max_files; + + down_read(&mm->mmap_sem); + max_files = mm->map_count; + + for (vma = find_vma(mm, start); vma; vma = vma->vm_next) { + struct file *file; + struct address_space *mapping; + struct iolock *iolock; + + if (end <= vma->vm_start) + break; + + file = vma->vm_file; + if (!file) + continue; + + mapping = file->f_mapping; + if (!mapping->a_ops->iolock_acquire || + !mapping->a_ops->iolock_release) + continue; + + /* Allocate chain lazily to avoid initialization overhead + when we don't have any files that require iolock. */ + if (!ret) { + ret = iolock_chain_new(); + if (!ret) + goto error; + } + + iolock = iolock_lookup(ret, mapping); + if (!iolock) { + iolock = iolock_new(max_files); + if (!iolock) + goto error; + iolock->mapping = mapping; + } + + iolock->files[iolock->nr_files++] = file; + list_add(&iolock->chain, &ret->list); + } + err = iolock_chain_acquire(ret); + if (!err) + goto error; + +out: + up_read(&mm->mmap_sem); + return ret; + +error: + iolock_release(ret); + ret = ERR_PTR(err); + goto out; +} + +/** + * iolock_release - Release file I/O locks for a memory region. + * @chain: The I/O lock chain to release. Passing NULL means no-op. + */ +void iolock_release(struct iolock_chain *chain) +{ + struct iolock *iolock; + + if (!chain) + return; + + list_for_each_entry(iolock, &chain->list, chain) { + struct address_space *mapping = iolock->mapping; + if (mapping && mapping->a_ops->iolock_release) + mapping->a_ops->iolock_release(iolock->files, iolock->nr_files); + kfree(iolock->files); + kfree(iolock); + } + kfree(chain); +} Index: 2.6-mm/fs/read_write.c =================================================================== --- 2.6-mm.orig/fs/read_write.c +++ 2.6-mm/fs/read_write.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -247,14 +248,21 @@ ssize_t vfs_read(struct file *file, char if (!ret) { ret = security_file_permission (file, MAY_READ); if (!ret) { + struct iolock_chain * lock = iolock_region(buf, count); + if (IS_ERR(lock)) { + ret = PTR_ERR(lock); + goto out; + } if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else ret = do_sync_read(file, buf, count, pos); + iolock_release(lock); if (ret > 0) { fsnotify_access(file->f_dentry); current->rchar += ret; } + out: current->syscr++; } } @@ -298,14 +306,21 @@ ssize_t vfs_write(struct file *file, con if (!ret) { ret = security_file_permission (file, MAY_WRITE); if (!ret) { + struct iolock_chain * lock = iolock_region(buf, count); + if (IS_ERR(lock)) { + ret = PTR_ERR(lock); + goto out; + } if (file->f_op->write) ret = file->f_op->write(file, buf, count, pos); else ret = do_sync_write(file, buf, count, pos); + iolock_release(lock); if (ret > 0) { fsnotify_modify(file->f_dentry); current->wchar += ret; } + out: current->syscw++; } } Index: 2.6-mm/include/linux/iolock.h =================================================================== --- /dev/null +++ 2.6-mm/include/linux/iolock.h @@ -0,0 +1,11 @@ +#ifndef __LINUX_IOLOCK_H +#define __LINUX_IOLOCK_H + +#include + +struct iolock_chain; + +extern struct iolock_chain *iolock_region(const char __user *, size_t); +extern void iolock_release(struct iolock_chain *); + +#endif Index: 2.6-mm/fs/Makefile =================================================================== --- 2.6-mm.orig/fs/Makefile +++ 2.6-mm/fs/Makefile @@ -10,7 +10,7 @@ obj-y := open.o read_write.o file_table. ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \ attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \ - ioprio.o + ioprio.o iolock.o obj-$(CONFIG_INOTIFY) += inotify.o obj-$(CONFIG_EPOLL) += eventpoll.o Index: 2.6-mm/include/linux/fs.h =================================================================== --- 2.6-mm.orig/include/linux/fs.h +++ 2.6-mm/include/linux/fs.h @@ -334,6 +334,8 @@ struct address_space_operations { loff_t offset, unsigned long nr_segs); struct page* (*get_xip_page)(struct address_space *, sector_t, int); + int (*iolock_acquire)(struct file **, unsigned long); + void (*iolock_release)(struct file **, unsigned long); }; struct backing_dev_info; From lhh at redhat.com Thu Aug 11 16:50:04 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 11 Aug 2005 12:50:04 -0400 Subject: [Linux-cluster] GFS cluster using UML(User mode linux) In-Reply-To: <20050811133525.66333.qmail@web60611.mail.yahoo.com> References: <20050811133525.66333.qmail@web60611.mail.yahoo.com> Message-ID: <1123779004.13564.55.camel@ayanami.boston.redhat.com> On Thu, 2005-08-11 at 06:35 -0700, dharma deep wrote: > Hi, > > Is it possible to create a GFS cluster using UML > ?If > so, please point me to the howto doc. No idea, but there's been success with Xen. -- Lon From Axel.Thimm at ATrpms.net Thu Aug 11 19:14:45 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Thu, 11 Aug 2005 21:14:45 +0200 Subject: [Linux-cluster] filesystem consistency error upon umount Message-ID: <20050811191445.GA6362@neu.nirvana> Hi, this is from an FC4/x86_64 node that forms a cluster with three RHEL4/x86_64 nodes. All of them running latest errata kernels and (vendor packaged) cluster/gfs bits. a) to start with: is it OK to mix FC4 and RHEL4, or did I do something forbidden? b) the cluster wasn't doing anything with the GFS filesystem at that time, i.e. it was just mounted on all 4 nodes, no data was being moved in any direction. c) The other nodes correctly replayed the journal, This node was removed from the cluster w/o fencing and w/o any traces in the logs other than gfs' "about to withdraw from the cluster". I expected cman to report this, too. The other nodes' logs only contained information about the journal acquisition and replay. d) There is a 10 min. delay from the moment of the mysterious filesystem consistency error and a series of Glock messages e) And most importantly why did the gfs issue a filesystem consistency error upon a simple umount? FC4 vs RHEL4 issue? Thanks! Aug 11 19:11:48 zs01 rgmanager: [25900]: Shutting down Cluster Service Manager... Aug 11 19:11:48 zs01 clurgmgrd[3660]: Shutting down Aug 11 19:11:48 zs01 clurgmgrd[3660]: Stopping service homes-cifs Aug 11 19:11:48 zs01 clurgmgrd[3660]: Stopping service backup Aug 11 19:11:48 zs01 clurgmgrd[3660]: Service homes-cifs is stopped Aug 11 19:11:48 zs01 clurgmgrd[3660]: Service backup is stopped Aug 11 19:11:52 zs01 clurgmgrd[3660]: Shutdown complete, exiting Aug 11 19:11:53 zs01 rgmanager: [25900]: Cluster Service Manager is stopped. Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: fatal: filesystem consistency error Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: function = trans_go_xmote_bh Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: file = /usr/src/build/588747-x86_64/BUILD/smp/src/gfs/glops.c, line = 542 Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: time = 1123780394 Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: about to withdraw from the cluster Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: waiting for outstanding I/O Aug 11 19:13:14 zs01 kernel: GFS: fsid=physik:data.2: telling LM to withdraw Aug 11 19:13:27 zs01 kernel: lock_dlm: withdraw abandoned memory Aug 11 19:13:27 zs01 kernel: GFS: fsid=physik:data.2: withdrawn Aug 11 19:23:27 zs01 kernel: ror = 0 Aug 11 19:23:27 zs01 kernel: gh_iflags = 2 4 5 Aug 11 19:23:27 zs01 kernel: Glock (5, 8676146) Aug 11 19:23:27 zs01 kernel: gl_flags = 1 Aug 11 19:23:27 zs01 kernel: gl_count = 3 Aug 11 19:23:27 zs01 kernel: gl_state = 3 Aug 11 19:23:27 zs01 kernel: req_gh = yes Aug 11 19:23:27 zs01 kernel: req_bh = yes Aug 11 19:23:27 zs01 kernel: lvb_count = 0 Aug 11 19:23:27 zs01 kernel: object = no Aug 11 19:23:27 zs01 kernel: new_le = no Aug 11 19:23:27 zs01 kernel: incore_le = no Aug 11 19:23:27 zs01 kernel: reclaim = no Aug 11 19:23:27 zs01 kernel: aspace = no Aug 11 19:23:27 zs01 kernel: ail_bufs = no Aug 11 19:23:27 zs01 kernel: Request Aug 11 19:23:27 zs01 kernel: owner = -1 Aug 11 19:23:27 zs01 kernel: gh_state = 0 Aug 11 19:23:27 zs01 kernel: gh_flags = 0 Aug 11 19:23:27 zs01 kernel: error = 0 Aug 11 19:23:27 zs01 kernel: gh_iflags = 2 4 5 Aug 11 19:23:27 zs01 kernel: Waiter2 Aug 11 19:23:27 zs01 kernel: owner = -1 Aug 11 19:23:27 zs01 kernel: gh_state = 0 Aug 11 19:23:27 zs01 kernel: gh_flags = 0 Aug 11 19:23:27 zs01 kernel: error = 0 Aug 11 19:23:27 zs01 kernel: gh_iflags = 2 4 5 Aug 11 19:23:27 zs01 kernel: Glock (5, 7146196) Aug 11 19:23:27 zs01 kernel: gl_flags = 1 Aug 11 19:23:27 zs01 kernel: gl_count = 3 Aug 11 19:23:27 zs01 kernel: gl_state = 3 Aug 11 19:23:27 zs01 kernel: req_gh = yes Aug 11 19:23:27 zs01 kernel: req_bh = yes Aug 11 19:23:27 zs01 kernel: lvb_count = 0 Aug 11 19:23:27 zs01 kernel: object = no Aug 11 19:23:27 zs01 kernel: new_le = no Aug 11 19:23:27 zs01 kernel: incore_le = no Aug 11 19:23:27 zs01 kernel: reclaim = no Aug 11 19:23:27 zs01 kernel: aspace = no Aug 11 19:23:27 zs01 kernel: ail_bufs = no Aug 11 19:23:27 zs01 kernel: Request Aug 11 19:23:27 zs01 kernel: owner = -1 Aug 11 19:23:27 zs01 kernel: gh_state = 0 Aug 11 19:23:27 zs01 kernel: gh_flags = 0 Aug 11 19:23:27 zs01 kernel: error = 0 Aug 11 19:23:27 zs01 kernel: gh_iflags = 2 4 5 Aug 11 19:23:27 zs01 kernel: Waiter2 Aug 11 19:23:27 zs01 kernel: owner = -1 Aug 11 19:23:27 zs01 kernel: gh_state = 0 Aug 11 19:23:27 zs01 kernel: gh_flags = 0 Aug 11 19:23:27 zs01 kernel: error = 0 Aug 11 19:23:27 zs01 kernel: gh_iflags = 2 4 5 Aug 11 19:23:27 zs01 kernel: Glock (5, 190905665) [...] -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From Joel.Becker at oracle.com Fri Aug 12 02:46:56 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Thu, 11 Aug 2005 19:46:56 -0700 Subject: [Linux-cluster] Re: [PATCH 00/14] GFS In-Reply-To: <20050810111110.GA6878@infradead.org> References: <20050809152045.GT29811@parcelfarce.linux.theplanet.co.uk> <20050810070309.GA2415@infradead.org> <20050810103041.GB4634@marowsky-bree.de> <20050810103256.GA6127@infradead.org> <20050810103424.GC4634@marowsky-bree.de> <20050810105450.GA6519@infradead.org> <20050810110259.GE4634@marowsky-bree.de> <20050810110511.GA6728@infradead.org> <20050810110917.GG4634@marowsky-bree.de> <20050810111110.GA6878@infradead.org> Message-ID: <20050812024655.GF5586@ca-server1.us.oracle.com> On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote: > On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote: > > > > So for every directoy hiearchy on a shared filesystem, each user needs > > to have the complete list of bindmounts needed, and automatically resync > > that across all nodes when a new one is added or removed? And then have > > that executed by root, because a regular user can't? > > Do it in an initscripts and let users simply not do it, they shouldn't > even know what kind of filesystem they are on. Christoph, Users know. They want to know. They want to install git on a shared filesystem, and have it Just Work no matter what architecture they're on ({arch} context symlink). I've yet to see a sane way to replace symlinks with bind mounts for anything but the most trivial of usages. 1) You can't make them as non-root 2) It's not stored in the filesystem, so permanence is separate. Other nodes and namespaces don't see them automatically if you want it. These both violate KISS and PoLS. 3) You pollute the output of "mount", and when you have as many bind mounts as you might have symlinks, that's a ton of output you didn't want to see when you were wondering what disks are mounted. 4) When I'm looking at a file, ls -l doesn't tell me what I'm really looking at. With symlinks it does. In some circumstances, that's a good thing. For most symlink-like uses it is not. The two uses (security and "symlink-like") are both valid approaches, and one should not preclude the other. Now, (3) can easily be fixed with an option. (4) can probably be massaged the same way. But (1) and (2) can't be, and that needs fixing before this is even viable to most real users. CDSL, or whatever you call it, exists in most can-be-shared filesystems for a reason. On AFS and DFS, it was @foo. /.../thisdcecell/foo/@sys/bin/git-ls-tree would be /.../thisdcecell/foo/linux-i386/bin/git-ls-tree on my machine. I'd just put the @sys path in my PATH, and never worry whether I was on x86, ppc, or s390. I don't know how GFS/GFS2 do theirs, but OCFS2 copied straight from VMS clustering, where they used it as well. They seem to have set the standard on the topic of clustering. It would be /usr/{arch}/bin/git-ls-tree -> /usr/i386/bin/git-ls-tree or whatever. If you can't do this as a user, it's irrelevant to you. Major installations, where the person installing the application never gets root, would expect it to work easily and nicely. Bind mounts, as of now, do not. Joel -- "Here's something to think about: How come you never see a headline like ``Psychic Wins Lottery''?" - Jay Leno Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From thomasie at eyou.com Fri Aug 12 07:23:49 2005 From: thomasie at eyou.com (thomasie) Date: 12 Aug 2005 15:23:49 +0800 Subject: [Linux-cluster] GFS/cman related problems? Message-ID: <323831431.12922@eyou.com> Dear ALL: Hello, I hope I'm sending to the correct list. I'm having problems starting up gfs 6.1, and hopefully it's just something incorrect with my configuration. I used two standard machine . My system is Fedora Core 4. The kernel modules load fine. root # modprobe lock_dlm first node: root # ccsd root # cman_tool join second node: root # ccsd root # cman_tool join when we are going to start both nodes we get that the first node without problems, but the second one never get connected at cman and remain trying to connect with the following error: udp port 6809 unreachable. Then, I run cman_tool with DEBUG defined and get this: first node: sending HELLO second node: sending membership request, but I get udp port 6809 unreachable. My configuration file is any help is much appreciated. Rregards, thomasie --http://www.eyou.com --?????????????????? ???????? ???????? ???????? ????????...???????? --http://vip.eyou.com --????????????VIP???? ?????????????????? --http://sms.eyou.com --??????????????????????...???????????? From pcaulfie at redhat.com Fri Aug 12 08:38:28 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 12 Aug 2005 09:38:28 +0100 Subject: [Linux-cluster] GFS/cman related problems? In-Reply-To: <323831431.12922@eyou.com> References: <323831431.12922@eyou.com> Message-ID: <42FC6004.5080107@redhat.com> thomasie wrote: > Dear ALL: > Hello, I hope I'm sending to the correct list. I'm having problems > starting up gfs 6.1, and hopefully it's just something incorrect with my > configuration. > > I used two standard machine . My system is Fedora Core 4. The kernel > modules load fine. > root # modprobe lock_dlm > > first node: > root # ccsd > root # cman_tool join > > second node: > root # ccsd > root # cman_tool join > > when we are going to start both nodes we get that the first node without > problems, but the second one never get connected at cman and remain trying to > connect with the following error: udp port 6809 unreachable. > Then, I run cman_tool with DEBUG defined and get this: > first node: sending HELLO > second node: sending membership request, but I get udp port 6809 > unreachable. Do you have a firewall configured ? if so either disable it or allow through UDP port 6809. -- patrick From thomasie at eyou.com Tue Aug 16 02:42:42 2005 From: thomasie at eyou.com (thomasie) Date: 16 Aug 2005 10:42:42 +0800 Subject: [Linux-cluster] GFS/cman related problems Message-ID: <324160162.18275@eyou.com> Dear ALL: Hello, I hope I'm sending to the correct list. I'm having problems starting up gfs 6.1, and hopefully it's just something incorrect with my configuration. I used two standard machine . My system is Fedora Core 4. The kernel modules load fine. First of all, I load all the needed kernel modules for the cluster. To do this, execute the following commands (on both nodes): root at node1 # modprobe lock_dlm root at node2 # modprobe lock_dlm Next, I start the cluster configuration service daemon on both nodes with root at node1 # ccsd root at node2 # ccsd Having started ccsd I need to create the cluster by starting the cluster manager on both nodes: root at node1 # /sbin/cman_tool join root at node2 # /sbin/cman_tool join when we are going to start both nodes we get that the first node without problems, but the second one never get connected at cman and remain trying to connect with the following error: 10.190.5.174 (node1) udp port 6809 unreachable. I have not start firewall. And I use the command of "netstat -a" , the result is udp 0 0 127.0.0.1:6809 0.0.0.0:* udp 0 0 224.0.0.1:6809 0.0.0.0:* but it receive package is 10.190.6.82:6809 (node2) -> 10.190.5.174 :6809 (node1) Then, I run cman_tool with DEBUG defined and get this: node1: sending HELLO node2: sending membership request, but I get udp port 6809 unreachable. My configuration file (cluster.xml) is any help is much appreciated. Regards, thomasie --http://www.eyou.com --?????????????????? ???????? ???????? ???????? ????????...???????? --http://vip.eyou.com --????????????VIP???? ?????????????????? --http://sms.eyou.com --??????????????????????...???????????? From pcaulfie at redhat.com Tue Aug 16 07:05:35 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 16 Aug 2005 08:05:35 +0100 Subject: [Linux-cluster] GFS/cman related problems In-Reply-To: <324160162.18275@eyou.com> References: <324160162.18275@eyou.com> Message-ID: <4301903F.4090302@redhat.com> thomasie wrote: > Dear ALL: > > Hello, I hope I'm sending to the correct list. I'm having problems > starting up gfs 6.1, and hopefully it's just something incorrect with my > configuration. > > I used two standard machine . My system is Fedora Core 4. The kernel modules > load fine. > > First of all, I load all the needed kernel modules for the cluster. To do > this, execute the following commands (on both nodes): > root at node1 # modprobe lock_dlm > > root at node2 # modprobe lock_dlm > > > Next, I start the cluster configuration service daemon on both nodes with > root at node1 # ccsd > > root at node2 # ccsd > > > Having started ccsd I need to create the cluster by starting the cluster > manager on both nodes: > root at node1 # /sbin/cman_tool join > > root at node2 # /sbin/cman_tool join > > when we are going to start both nodes we get that the first node without > problems, but the second one never get connected at cman and remain trying to > connect with the following error: 10.190.5.174 (node1) udp port 6809 > unreachable. > I have not start firewall. And I use the command of "netstat -a" , > the result is > udp 0 0 127.0.0.1:6809 0.0.0.0:* Oh this old chestnut. Your local host name resolves to 127.0.0.1 rather than to a real IP address. Remove the host name from the 127.0.0.1 line in /etc/hosts. sigh. -- patrick From jpyeron at pdinc.us Tue Aug 16 14:13:50 2005 From: jpyeron at pdinc.us (Jason Pyeron) Date: Tue, 16 Aug 2005 10:13:50 -0400 (EDT) Subject: [Linux-cluster] seeking guidance before opening a bug on cman Message-ID: I was tring to compile a rpm from the srpm generated from cvs STABLE (also tried from HEAD). I got errors on prep stage, and then again on the files stage. I have patched the spec file, and I was wondering if I was wrong or was it a bug? :pserver:cvs at sources.redhat.com:/cvs/cluster cluster --- cman/make/cman.spec.in 1 Nov 2004 23:23:18 -0000 1.2 +++ cman/make/cman.spec.in 16 Aug 2005 14:09:55 -0000 @@ -33,7 +33,7 @@ cman - The Cluster Manager %prep -%setup -q +%setup -q -n %{name}-%{version}-%{release} %build ./configure --mandir=%{_mandir} @@ -51,6 +51,13 @@ # Binaries /sbin/cman_tool +/etc/init.d/cman +/usr/include/libcman.h +/usr/lib/libcman.a +/usr/lib/libcman.so +/usr/lib/libcman.so.%{version} +/usr/lib/libcman.so.%{version}.%{release} + %doc %{_mandir} -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Partner & Sr. Manager 7 West 24th Street #100 - - +1 (443) 921-0381 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, purge the message from your system and notify the sender immediately. Any other use of the email by you is prohibited. From jpyeron at pdinc.us Tue Aug 16 14:38:56 2005 From: jpyeron at pdinc.us (Jason Pyeron) Date: Tue, 16 Aug 2005 10:38:56 -0400 (EDT) Subject: [Linux-cluster] seeking guidance before opening a bug on cman In-Reply-To: References: Message-ID: I opened a bug after further googling. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166060 -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Partner & Sr. Manager 7 West 24th Street #100 - - +1 (443) 921-0381 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, purge the message from your system and notify the sender immediately. Any other use of the email by you is prohibited. From cfeist at redhat.com Tue Aug 16 15:29:48 2005 From: cfeist at redhat.com (Chris Feist) Date: Tue, 16 Aug 2005 10:29:48 -0500 Subject: [Linux-cluster] seeking guidance before opening a bug on cman In-Reply-To: References: Message-ID: <4302066C.5010101@redhat.com> Jason, We don't actually use those .spec.in files anymore. I have removed them because they have become outdated. Thanks, Chris Jason Pyeron wrote: > > I was tring to compile a rpm from the srpm generated from cvs STABLE > (also tried from HEAD). > > I got errors on prep stage, and then again on the files stage. > > I have patched the spec file, and I was wondering if I was wrong or was > it a bug? > > :pserver:cvs at sources.redhat.com:/cvs/cluster cluster > --- cman/make/cman.spec.in 1 Nov 2004 23:23:18 -0000 1.2 > +++ cman/make/cman.spec.in 16 Aug 2005 14:09:55 -0000 > @@ -33,7 +33,7 @@ > cman - The Cluster Manager > > %prep > -%setup -q > +%setup -q -n %{name}-%{version}-%{release} > > %build > ./configure --mandir=%{_mandir} > @@ -51,6 +51,13 @@ > # Binaries > /sbin/cman_tool > > +/etc/init.d/cman > +/usr/include/libcman.h > +/usr/lib/libcman.a > +/usr/lib/libcman.so > +/usr/lib/libcman.so.%{version} > +/usr/lib/libcman.so.%{version}.%{release} > + > %doc > %{_mandir} > > > From jpyeron at pdinc.us Tue Aug 16 15:36:16 2005 From: jpyeron at pdinc.us (Jason Pyeron) Date: Tue, 16 Aug 2005 11:36:16 -0400 (EDT) Subject: [Linux-cluster] seeking guidance before opening a bug on cman In-Reply-To: <4302066C.5010101@redhat.com> References: <4302066C.5010101@redhat.com> Message-ID: On Tue, 16 Aug 2005, Chris Feist wrote: > We don't actually use those .spec.in files anymore. I have removed them > because they have become outdated. then where are the spec files? -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Partner & Sr. Manager 7 West 24th Street #100 - - +1 (443) 921-0381 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, purge the message from your system and notify the sender immediately. Any other use of the email by you is prohibited. From cfeist at redhat.com Tue Aug 16 15:41:56 2005 From: cfeist at redhat.com (Chris Feist) Date: Tue, 16 Aug 2005 10:41:56 -0500 Subject: [Linux-cluster] seeking guidance before opening a bug on cman In-Reply-To: References: <4302066C.5010101@redhat.com> Message-ID: <43020944.2000302@redhat.com> Jason Pyeron wrote: > On Tue, 16 Aug 2005, Chris Feist wrote: > >> We don't actually use those .spec.in files anymore. I have removed >> them because they have become outdated. > > > then where are the spec files? > They're available in the release srpm files on ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/ Thanks, Chris From jpyeron at pdinc.us Tue Aug 16 15:51:43 2005 From: jpyeron at pdinc.us (Jason Pyeron) Date: Tue, 16 Aug 2005 11:51:43 -0400 (EDT) Subject: [Linux-cluster] seeking guidance before opening a bug on cman In-Reply-To: <43020944.2000302@redhat.com> References: <4302066C.5010101@redhat.com> <43020944.2000302@redhat.com> Message-ID: On Tue, 16 Aug 2005, Chris Feist wrote: > Jason Pyeron wrote: >> On Tue, 16 Aug 2005, Chris Feist wrote: >> >>> We don't actually use those .spec.in files anymore. I have removed them >>> because they have become outdated. >> >> >> then where are the spec files? >> > > They're available in the release srpm files on > ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/ > The spec files inside the srpms seem a bit dated, too. This begs the issue of the fake-build-provides from late July on the list and cman-kernel. Are there more recent versions? -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Partner & Sr. Manager 7 West 24th Street #100 - - +1 (443) 921-0381 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, purge the message from your system and notify the sender immediately. Any other use of the email by you is prohibited. From johngw at msi.umn.edu Tue Aug 16 20:49:20 2005 From: johngw at msi.umn.edu (John Griffin-Wiesner) Date: Tue, 16 Aug 2005 15:49:20 -0500 Subject: [Linux-cluster] updating ccs and running lock_gulmd Message-ID: <20050816204920.GA18577@fog.msi.umn.edu> I want to change the heartbeat allowed_misses setting. I extracted the current config with "ccs_tool extract", modified cluster.ccs, then ran "ccs_tool create -O /dev/pool/cca" Does lock_gulmd automagically see changes, or do I have to tell it to look? "service lock_gulmd" offers only these options: start|stop|restart|status|forcestop (I was hoping for a "reload".) This is on a production system so I'd prefer to do this without causing everyone to be fenced off. Thanks for any suggestions. -- John Griffin-Wiesner Linux Cluster/Unix Systems Administrator Univ. MN Supercomputing Institute http://www.msi.umn.edu 612-624-4167 johngw at msi.umn.edu From amanthei at redhat.com Tue Aug 16 20:58:35 2005 From: amanthei at redhat.com (Adam Manthei) Date: Tue, 16 Aug 2005 15:58:35 -0500 Subject: [Linux-cluster] updating ccs and running lock_gulmd In-Reply-To: <20050816204920.GA18577@fog.msi.umn.edu> References: <20050816204920.GA18577@fog.msi.umn.edu> Message-ID: <20050816205835.GA25135@redhat.com> On Tue, Aug 16, 2005 at 03:49:20PM -0500, John Griffin-Wiesner wrote: > I want to change the heartbeat allowed_misses setting. I > extracted the current config with "ccs_tool extract", modified > cluster.ccs, then ran > "ccs_tool create -O /dev/pool/cca" > > Does lock_gulmd automagically see changes, or do I have to tell > it to look? "service lock_gulmd" offers only these options: > start|stop|restart|status|forcestop > (I was hoping for a "reload".) > > This is on a production system so I'd prefer to do this without > causing everyone to be fenced off. > > Thanks for any suggestions. You can not change the heartbeat_rate or allowed_misses for lock_gulmd while the cluster is active. In order to make the changes take effect, you will have to: 1. unmount all GFS in the cluster 2. stop all lock_gulmd in the cluster 3. modify ccs 4. start lock_gulmd 5. mount gfs Note. there is a bug that requires you to have to specify the heartbeat_rate as a floating point number. (see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166009) Changing these values while the cluster is up will prevent new instances of lock_gulmd (for both the clients and servers) from connecting to the active instance of the cluster, but it should not cause the running nodes to be fenced. -- Adam Manthei From brianu at silvercash.com Wed Aug 17 23:55:19 2005 From: brianu at silvercash.com (brianu) Date: Wed, 17 Aug 2005 16:55:19 -0700 Subject: [Linux-cluster] complex gnbd setup example Message-ID: <20050817235255.E21635A8678@mail.silvercash.com> Hello, I am trying to setup a redundant system using gnbd ( linux-iscsi for now fiber if I can figure this out). Hardware and softwaredetails HP MSA100 SAN 3 gnbd servers mounting the storage via linux-scsi in a backend lan & exporting to 3 test nodes All nodes import the gnbds, but in the event that one GNBD server dies I want it to not affect the clients. OS Centos 4 - kernel.org kernel 2.6.12 Problems-> 1) multipath doesn't appear to work with GFS 6.1 obtained from CVS Stable for latest kernel.org kernels. 2) LVM doesn't allow duplicate PVs ( the SAN to be created in the Volume group) i.e. [root at dell-1650-31 ~]# vgcreate my_new_gfs /dev/gnbd/l108_sdata /dev/gnbd/l109_sdata Found duplicate PV TbSaubq5Si2S1nrMbuMF620HPb6kFojE: using /dev/gnbd2 not /dev/gnbd0 Found duplicate PV TbSaubq5Si2S1nrMbuMF620HPb6kFojE: using /dev/gnbd2 not /dev/gnbd0 Volume group "my_new_gfs" successfully created [root at dell-1650-31 ~]# [root at dell-1650-31 ~]# vgchange -a y my_new_gfs Found duplicate PV TbSaubq5Si2S1nrMbuMF620HPb6kFojE: using /dev/gnbd2 not /dev/gnbd0 0 logical volume(s) in volume group "my_new_gfs" now active [root at dell-1650-31 ~]# Ideas-> Heartbeat -linux-HA for a dynamic ip in which the nodes will mount via gnbd_import -I vip Thoughts? Can someone suggest a better route, or how they might have accomplished this ? the idea is a failover to another gnbd server should it one go down. I am open to hearing suggestions. Thanks in advance. Brian Urrutia System Administrator Price Communications Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jpyeron at pdinc.us Thu Aug 18 04:01:56 2005 From: jpyeron at pdinc.us (Jason Pyeron) Date: Thu, 18 Aug 2005 00:01:56 -0400 (EDT) Subject: [Linux-cluster] Source RPM debacle Message-ID: I have been working with the CVS versions of the cluster suite and GFS. As well the versions on ftp://ftp.redhat.com/pub/redhat/linux/enterprise/4/en/ I am trying to avoid reinventing the wheel on the spec files. I filed a bug but the spec files in CVS are 'not' the correct spec files. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166060 The spec files from the ftp site are broken. https://www.redhat.com/archives/linux-cluster/2005-July/msg00224.html The bug report indicates that it is fixed but I can't find any indication of it. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=163963 So where does one get the current versions? CVS does not have the spec files. Where are the spec files published? Where are the SRPMs? -Jason Pyeron -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Partner & Sr. Manager 7 West 24th Street #100 - - +1 (443) 921-0381 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, purge the message from your system and notify the sender immediately. Any other use of the email by you is prohibited. From teigland at redhat.com Thu Aug 18 06:10:14 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 18 Aug 2005 14:10:14 +0800 Subject: [Linux-cluster] [PATCH 2/3] dlm: remove file Message-ID: <20050818061014.GB10133@redhat.com> The reduced member_sysfs.c is no longer related to lockspace members. Move what's left into lockspace.c which is the only file that uses the remaining functions. Signed-off-by: David Teigland --- Makefile | 1 lockspace.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++++++-- lockspace.h | 1 main.c | 14 +--- member.c | 2 member_sysfs.c | 165 --------------------------------------------------------- member_sysfs.h | 22 ------- 7 files changed, 156 insertions(+), 204 deletions(-) diff -urpN a/drivers/dlm/Makefile b/drivers/dlm/Makefile --- a/drivers/dlm/Makefile 2005-08-18 13:26:02.648375344 +0800 +++ b/drivers/dlm/Makefile 2005-08-18 13:26:25.736865360 +0800 @@ -9,7 +9,6 @@ dlm-y := ast.o \ lowcomms.o \ main.o \ member.o \ - member_sysfs.o \ memory.o \ midcomms.o \ rcom.o \ diff -urpN a/drivers/dlm/lockspace.c b/drivers/dlm/lockspace.c --- a/drivers/dlm/lockspace.c 2005-08-18 13:26:02.651374888 +0800 +++ b/drivers/dlm/lockspace.c 2005-08-18 13:26:25.737865208 +0800 @@ -14,7 +14,6 @@ #include "dlm_internal.h" #include "lockspace.h" #include "member.h" -#include "member_sysfs.h" #include "recoverd.h" #include "ast.h" #include "dir.h" @@ -38,13 +37,159 @@ static spinlock_t lslist_lock; static struct task_struct * scand_task; +static ssize_t dlm_control_store(struct dlm_ls *ls, const char *buf, size_t len) +{ + ssize_t ret = len; + int n = simple_strtol(buf, NULL, 0); + + switch (n) { + case 0: + dlm_ls_stop(ls); + break; + case 1: + dlm_ls_start(ls); + break; + default: + ret = -EINVAL; + } + return ret; +} + +static ssize_t dlm_event_store(struct dlm_ls *ls, const char *buf, size_t len) +{ + ls->ls_uevent_result = simple_strtol(buf, NULL, 0); + set_bit(LSFL_UEVENT_WAIT, &ls->ls_flags); + wake_up(&ls->ls_uevent_wait); + return len; +} + +static ssize_t dlm_id_show(struct dlm_ls *ls, char *buf) +{ + return sprintf(buf, "%u\n", ls->ls_global_id); +} + +static ssize_t dlm_id_store(struct dlm_ls *ls, const char *buf, size_t len) +{ + ls->ls_global_id = simple_strtoul(buf, NULL, 0); + return len; +} + +struct dlm_attr { + struct attribute attr; + ssize_t (*show)(struct dlm_ls *, char *); + ssize_t (*store)(struct dlm_ls *, const char *, size_t); +}; + +static struct dlm_attr dlm_attr_control = { + .attr = {.name = "control", .mode = S_IWUSR}, + .store = dlm_control_store +}; + +static struct dlm_attr dlm_attr_event = { + .attr = {.name = "event_done", .mode = S_IWUSR}, + .store = dlm_event_store +}; + +static struct dlm_attr dlm_attr_id = { + .attr = {.name = "id", .mode = S_IRUGO | S_IWUSR}, + .show = dlm_id_show, + .store = dlm_id_store +}; + +static struct attribute *dlm_attrs[] = { + &dlm_attr_control.attr, + &dlm_attr_event.attr, + &dlm_attr_id.attr, + NULL, +}; + +static ssize_t dlm_attr_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct dlm_ls *ls = container_of(kobj, struct dlm_ls, ls_kobj); + struct dlm_attr *a = container_of(attr, struct dlm_attr, attr); + return a->show ? a->show(ls, buf) : 0; +} + +static ssize_t dlm_attr_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t len) +{ + struct dlm_ls *ls = container_of(kobj, struct dlm_ls, ls_kobj); + struct dlm_attr *a = container_of(attr, struct dlm_attr, attr); + return a->store ? a->store(ls, buf, len) : len; +} + +static struct sysfs_ops dlm_attr_ops = { + .show = dlm_attr_show, + .store = dlm_attr_store, +}; + +static struct kobj_type dlm_ktype = { + .default_attrs = dlm_attrs, + .sysfs_ops = &dlm_attr_ops, +}; + +static struct kset dlm_kset = { + .subsys = &kernel_subsys, + .kobj = {.name = "dlm",}, + .ktype = &dlm_ktype, +}; + +static int kobject_setup(struct dlm_ls *ls) +{ + char lsname[DLM_LOCKSPACE_LEN]; + int error; + + memset(lsname, 0, DLM_LOCKSPACE_LEN); + snprintf(lsname, DLM_LOCKSPACE_LEN, "%s", ls->ls_name); + + error = kobject_set_name(&ls->ls_kobj, "%s", lsname); + if (error) + return error; + + ls->ls_kobj.kset = &dlm_kset; + ls->ls_kobj.ktype = &dlm_ktype; + return 0; +} + +static int do_uevent(struct dlm_ls *ls, int in) +{ + int error; + + if (in) + kobject_uevent(&ls->ls_kobj, KOBJ_ONLINE, NULL); + else + kobject_uevent(&ls->ls_kobj, KOBJ_OFFLINE, NULL); + + error = wait_event_interruptible(ls->ls_uevent_wait, + test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags)); + if (error) + goto out; + + error = ls->ls_uevent_result; + out: + return error; +} + + int dlm_lockspace_init(void) { + int error; + ls_count = 0; init_MUTEX(&ls_lock); INIT_LIST_HEAD(&lslist); spin_lock_init(&lslist_lock); - return 0; + + error = kset_register(&dlm_kset); + if (error) + printk("dlm_lockspace_init: cannot register kset %d\n", error); + return error; +} + +void dlm_lockspace_exit(void) +{ + kset_unregister(&dlm_kset); } static int dlm_scand(void *data) @@ -310,7 +455,7 @@ static int new_lockspace(char *name, int dlm_create_debug_file(ls); - error = dlm_kobject_setup(ls); + error = kobject_setup(ls); if (error) goto out_del; @@ -318,7 +463,7 @@ static int new_lockspace(char *name, int if (error) goto out_del; - error = dlm_uevent(ls, 1); + error = do_uevent(ls, 1); if (error) goto out_unreg; @@ -409,7 +554,7 @@ static int release_lockspace(struct dlm_ return -EBUSY; if (force < 3) - dlm_uevent(ls, 0); + do_uevent(ls, 0); dlm_recoverd_stop(ls); diff -urpN a/drivers/dlm/lockspace.h b/drivers/dlm/lockspace.h --- a/drivers/dlm/lockspace.h 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/lockspace.h 2005-08-18 13:26:25.737865208 +0800 @@ -15,6 +15,7 @@ #define __LOCKSPACE_DOT_H__ int dlm_lockspace_init(void); +void dlm_lockspace_exit(void); struct dlm_ls *dlm_find_lockspace_global(uint32_t id); struct dlm_ls *dlm_find_lockspace_local(void *id); struct dlm_ls *dlm_find_lockspace_name(char *name, int namelen); diff -urpN a/drivers/dlm/main.c b/drivers/dlm/main.c --- a/drivers/dlm/main.c 2005-08-18 13:26:02.653374584 +0800 +++ b/drivers/dlm/main.c 2005-08-18 13:26:25.738865056 +0800 @@ -13,9 +13,7 @@ #include "dlm_internal.h" #include "lockspace.h" -#include "member_sysfs.h" #include "lock.h" -#include "device.h" #include "memory.h" #include "lowcomms.h" #include "config.h" @@ -40,13 +38,9 @@ static int __init init_dlm(void) if (error) goto out_mem; - error = dlm_member_sysfs_init(); - if (error) - goto out_mem; - error = dlm_config_init(); if (error) - goto out_member; + goto out_lockspace; error = dlm_register_debugfs(); if (error) @@ -64,8 +58,8 @@ static int __init init_dlm(void) dlm_unregister_debugfs(); out_config: dlm_config_exit(); - out_member: - dlm_member_sysfs_exit(); + out_lockspace: + dlm_lockspace_exit(); out_mem: dlm_memory_exit(); out: @@ -75,9 +69,9 @@ static int __init init_dlm(void) static void __exit exit_dlm(void) { dlm_lowcomms_exit(); - dlm_member_sysfs_exit(); dlm_config_exit(); dlm_memory_exit(); + dlm_lockspace_exit(); dlm_unregister_debugfs(); } diff -urpN a/drivers/dlm/member.c b/drivers/dlm/member.c --- a/drivers/dlm/member.c 2005-08-18 13:26:02.654374432 +0800 +++ b/drivers/dlm/member.c 2005-08-18 13:26:25.738865056 +0800 @@ -221,7 +221,7 @@ int dlm_recover_members(struct dlm_ls *l } /* - * Following called from member_sysfs.c + * Following called from lockspace.c */ int dlm_ls_stop(struct dlm_ls *ls) diff -urpN a/drivers/dlm/member_sysfs.c b/drivers/dlm/member_sysfs.c --- a/drivers/dlm/member_sysfs.c 2005-08-18 13:26:02.655374280 +0800 +++ b/drivers/dlm/member_sysfs.c 1970-01-01 07:30:00.000000000 +0730 @@ -1,165 +0,0 @@ -/****************************************************************************** -******************************************************************************* -** -** Copyright (C) 2005 Red Hat, Inc. All rights reserved. -** -** This copyrighted material is made available to anyone wishing to use, -** modify, copy, or redistribute it subject to the terms and conditions -** of the GNU General Public License v.2. -** -******************************************************************************* -******************************************************************************/ - -#include "dlm_internal.h" -#include "member.h" - - -static ssize_t dlm_control_store(struct dlm_ls *ls, const char *buf, size_t len) -{ - ssize_t ret = len; - int n = simple_strtol(buf, NULL, 0); - - switch (n) { - case 0: - dlm_ls_stop(ls); - break; - case 1: - dlm_ls_start(ls); - break; - default: - ret = -EINVAL; - } - return ret; -} - -static ssize_t dlm_event_store(struct dlm_ls *ls, const char *buf, size_t len) -{ - ls->ls_uevent_result = simple_strtol(buf, NULL, 0); - set_bit(LSFL_UEVENT_WAIT, &ls->ls_flags); - wake_up(&ls->ls_uevent_wait); - return len; -} - -static ssize_t dlm_id_show(struct dlm_ls *ls, char *buf) -{ - return sprintf(buf, "%u\n", ls->ls_global_id); -} - -static ssize_t dlm_id_store(struct dlm_ls *ls, const char *buf, size_t len) -{ - ls->ls_global_id = simple_strtoul(buf, NULL, 0); - return len; -} - -struct dlm_attr { - struct attribute attr; - ssize_t (*show)(struct dlm_ls *, char *); - ssize_t (*store)(struct dlm_ls *, const char *, size_t); -}; - -static struct dlm_attr dlm_attr_control = { - .attr = {.name = "control", .mode = S_IWUSR}, - .store = dlm_control_store -}; - -static struct dlm_attr dlm_attr_event = { - .attr = {.name = "event_done", .mode = S_IWUSR}, - .store = dlm_event_store -}; - -static struct dlm_attr dlm_attr_id = { - .attr = {.name = "id", .mode = S_IRUGO | S_IWUSR}, - .show = dlm_id_show, - .store = dlm_id_store -}; - -static struct attribute *dlm_attrs[] = { - &dlm_attr_control.attr, - &dlm_attr_event.attr, - &dlm_attr_id.attr, - NULL, -}; - -static ssize_t dlm_attr_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct dlm_ls *ls = container_of(kobj, struct dlm_ls, ls_kobj); - struct dlm_attr *a = container_of(attr, struct dlm_attr, attr); - return a->show ? a->show(ls, buf) : 0; -} - -static ssize_t dlm_attr_store(struct kobject *kobj, struct attribute *attr, - const char *buf, size_t len) -{ - struct dlm_ls *ls = container_of(kobj, struct dlm_ls, ls_kobj); - struct dlm_attr *a = container_of(attr, struct dlm_attr, attr); - return a->store ? a->store(ls, buf, len) : len; -} - -static struct sysfs_ops dlm_attr_ops = { - .show = dlm_attr_show, - .store = dlm_attr_store, -}; - -static struct kobj_type dlm_ktype = { - .default_attrs = dlm_attrs, - .sysfs_ops = &dlm_attr_ops, -}; - -static struct kset dlm_kset = { - .subsys = &kernel_subsys, - .kobj = {.name = "dlm",}, - .ktype = &dlm_ktype, -}; - -int dlm_member_sysfs_init(void) -{ - int error; - - error = kset_register(&dlm_kset); - if (error) - printk("dlm_lockspace_init: cannot register kset %d\n", error); - return error; -} - -void dlm_member_sysfs_exit(void) -{ - kset_unregister(&dlm_kset); -} - -int dlm_kobject_setup(struct dlm_ls *ls) -{ - char lsname[DLM_LOCKSPACE_LEN]; - int error; - - memset(lsname, 0, DLM_LOCKSPACE_LEN); - snprintf(lsname, DLM_LOCKSPACE_LEN, "%s", ls->ls_name); - - error = kobject_set_name(&ls->ls_kobj, "%s", lsname); - if (error) - return error; - - ls->ls_kobj.kset = &dlm_kset; - ls->ls_kobj.ktype = &dlm_ktype; - return 0; -} - -int dlm_uevent(struct dlm_ls *ls, int in) -{ - int error; - - if (in) - kobject_uevent(&ls->ls_kobj, KOBJ_ONLINE, NULL); - else - kobject_uevent(&ls->ls_kobj, KOBJ_OFFLINE, NULL); - - error = wait_event_interruptible(ls->ls_uevent_wait, - test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags)); - if (error) - goto out; - - error = ls->ls_uevent_result; - out: - return error; -} - diff -urpN a/drivers/dlm/member_sysfs.h b/drivers/dlm/member_sysfs.h --- a/drivers/dlm/member_sysfs.h 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/member_sysfs.h 1970-01-01 07:30:00.000000000 +0730 @@ -1,22 +0,0 @@ -/****************************************************************************** -******************************************************************************* -** -** Copyright (C) 2005 Red Hat, Inc. All rights reserved. -** -** This copyrighted material is made available to anyone wishing to use, -** modify, copy, or redistribute it subject to the terms and conditions -** of the GNU General Public License v.2. -** -******************************************************************************* -******************************************************************************/ - -#ifndef __MEMBER_SYSFS_DOT_H__ -#define __MEMBER_SYSFS_DOT_H__ - -int dlm_member_sysfs_init(void); -void dlm_member_sysfs_exit(void); -int dlm_kobject_setup(struct dlm_ls *ls); -int dlm_uevent(struct dlm_ls *ls, int in); - -#endif /* __MEMBER_SYSFS_DOT_H__ */ - From teigland at redhat.com Thu Aug 18 06:11:17 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 18 Aug 2005 14:11:17 +0800 Subject: [Linux-cluster] [PATCH 3/3] dlm: use jhash Message-ID: <20050818061117.GC10133@redhat.com> Use linux/jhash.h instead of our own hash function. Signed-off-by: David Teigland --- dir.c | 2 +- dlm_internal.h | 1 + lock.c | 2 +- util.c | 34 ---------------------------------- util.h | 2 -- 5 files changed, 3 insertions(+), 38 deletions(-) diff -urpN a/drivers/dlm/dir.c b/drivers/dlm/dir.c --- a/drivers/dlm/dir.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/dir.c 2005-08-18 13:47:29.112803024 +0800 @@ -120,7 +120,7 @@ static inline uint32_t dir_hash(struct d { uint32_t val; - val = dlm_hash(name, len); + val = jhash(name, len, 0); val &= (ls->ls_dirtbl_size - 1); return val; diff -urpN a/drivers/dlm/dlm_internal.h b/drivers/dlm/dlm_internal.h --- a/drivers/dlm/dlm_internal.h 2005-08-18 13:26:02.651374888 +0800 +++ b/drivers/dlm/dlm_internal.h 2005-08-18 13:47:29.112803024 +0800 @@ -34,6 +34,7 @@ #include #include #include +#include #include #include diff -urpN a/drivers/dlm/lock.c b/drivers/dlm/lock.c --- a/drivers/dlm/lock.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/lock.c 2005-08-18 13:47:29.114802720 +0800 @@ -369,7 +369,7 @@ static int find_rsb(struct dlm_ls *ls, c if (dlm_no_directory(ls)) flags |= R_CREATE; - hash = dlm_hash(name, namelen); + hash = jhash(name, namelen, 0); bucket = hash & (ls->ls_rsbtbl_size - 1); error = search_rsb(ls, name, namelen, bucket, flags, &r); diff -urpN a/drivers/dlm/util.c b/drivers/dlm/util.c --- a/drivers/dlm/util.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/util.c 2005-08-18 13:47:29.115802568 +0800 @@ -13,40 +13,6 @@ #include "dlm_internal.h" #include "rcom.h" -/** - * dlm_hash - hash an array of data - * @data: the data to be hashed - * @len: the length of data to be hashed - * - * Copied from GFS which copied from... - * - * Take some data and convert it to a 32-bit hash. - * This is the 32-bit FNV-1a hash from: - * http://www.isthe.com/chongo/tech/comp/fnv/ - */ - -static inline uint32_t hash_more_internal(const void *data, unsigned int len, - uint32_t hash) -{ - unsigned char *p = (unsigned char *)data; - unsigned char *e = p + len; - uint32_t h = hash; - - while (p < e) { - h ^= (uint32_t)(*p++); - h *= 0x01000193; - } - - return h; -} - -uint32_t dlm_hash(const void *data, int len) -{ - uint32_t h = 0x811C9DC5; - h = hash_more_internal(data, len, h); - return h; -} - static void header_out(struct dlm_header *hd) { hd->h_version = cpu_to_le32(hd->h_version); diff -urpN a/drivers/dlm/util.h b/drivers/dlm/util.h --- a/drivers/dlm/util.h 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/util.h 2005-08-18 13:47:29.115802568 +0800 @@ -13,8 +13,6 @@ #ifndef __UTIL_DOT_H__ #define __UTIL_DOT_H__ -uint32_t dlm_hash(const char *data, int len); - void dlm_message_out(struct dlm_message *ms); void dlm_message_in(struct dlm_message *ms); void dlm_rcom_out(struct dlm_rcom *rc); From akpm at osdl.org Thu Aug 18 06:22:18 2005 From: akpm at osdl.org (Andrew Morton) Date: Wed, 17 Aug 2005 23:22:18 -0700 Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs In-Reply-To: <20050818060750.GA10133@redhat.com> References: <20050818060750.GA10133@redhat.com> Message-ID: <20050817232218.56a06fd6.akpm@osdl.org> David Teigland wrote: > > Use configfs to configure lockspace members and node addresses. This was > previously done with sysfs and ioctl. Fair enough. This really means that the configfs patch should be split out of the ocfs2 megapatch... From teigland at redhat.com Thu Aug 18 06:07:50 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 18 Aug 2005 14:07:50 +0800 Subject: [Linux-cluster] [PATCH 1/3] dlm: use configfs Message-ID: <20050818060750.GA10133@redhat.com> Use configfs to configure lockspace members and node addresses. This was previously done with sysfs and ioctl. Signed-off-by: David Teigland --- drivers/dlm/Makefile | 1 drivers/dlm/config.c | 759 ++++++++++++++++++++++++++++++++++++++++++++- drivers/dlm/config.h | 12 drivers/dlm/dlm_internal.h | 2 drivers/dlm/lockspace.c | 7 drivers/dlm/lowcomms.c | 195 +---------- drivers/dlm/lowcomms.h | 4 drivers/dlm/main.c | 18 - drivers/dlm/member.c | 40 +- drivers/dlm/member_sysfs.c | 76 ---- drivers/dlm/node_ioctl.c | 126 ------- drivers/dlm/requestqueue.c | 2 include/linux/dlm_node.h | 44 -- 13 files changed, 828 insertions(+), 458 deletions(-) diff -urpN a/drivers/dlm/Makefile b/drivers/dlm/Makefile --- a/drivers/dlm/Makefile 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/Makefile 2005-08-18 13:22:00.718154328 +0800 @@ -12,7 +12,6 @@ dlm-y := ast.o \ member_sysfs.o \ memory.o \ midcomms.o \ - node_ioctl.o \ rcom.o \ recover.o \ recoverd.o \ diff -urpN a/drivers/dlm/config.c b/drivers/dlm/config.c --- a/drivers/dlm/config.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/config.c 2005-08-18 13:22:00.719154176 +0800 @@ -11,9 +11,756 @@ ******************************************************************************* ******************************************************************************/ -#include "dlm_internal.h" +#include +#include +#include +#include + #include "config.h" +/* + * /config/dlm//spaces//nodes//nodeid + * /config/dlm//spaces//nodes//weight + * /config/dlm//comms//nodeid + * /config/dlm//comms//local + * /config/dlm//comms//addr + * The level is useless, but I haven't figured out how to avoid it. + */ + +static struct config_group *space_list; +static struct config_group *comm_list; +static struct comm *local_comm; + +struct clusters; +struct cluster; +struct spaces; +struct space; +struct comms; +struct comm; +struct nodes; +struct node; + +static struct config_group *make_cluster(struct config_group *, const char *); +static void drop_cluster(struct config_group *, struct config_item *); +static void release_cluster(struct config_item *); +static struct config_group *make_space(struct config_group *, const char *); +static void drop_space(struct config_group *, struct config_item *); +static void release_space(struct config_item *); +static struct config_item *make_comm(struct config_group *, const char *); +static void drop_comm(struct config_group *, struct config_item *); +static void release_comm(struct config_item *); +static struct config_item *make_node(struct config_group *, const char *); +static void drop_node(struct config_group *, struct config_item *); +static void release_node(struct config_item *); + +static ssize_t show_comm(struct config_item *i, struct configfs_attribute *a, + char *buf); +static ssize_t store_comm(struct config_item *i, struct configfs_attribute *a, + const char *buf, size_t len); +static ssize_t show_node(struct config_item *i, struct configfs_attribute *a, + char *buf); +static ssize_t store_node(struct config_item *i, struct configfs_attribute *a, + const char *buf, size_t len); + +static ssize_t comm_nodeid_read(struct comm *cm, char *buf); +static ssize_t comm_nodeid_write(struct comm *cm, const char *buf, size_t len); +static ssize_t comm_local_read(struct comm *cm, char *buf); +static ssize_t comm_local_write(struct comm *cm, const char *buf, size_t len); +static ssize_t comm_addr_write(struct comm *cm, const char *buf, size_t len); +static ssize_t node_nodeid_read(struct node *nd, char *buf); +static ssize_t node_nodeid_write(struct node *nd, const char *buf, size_t len); +static ssize_t node_weight_read(struct node *nd, char *buf); +static ssize_t node_weight_write(struct node *nd, const char *buf, size_t len); + +enum { + COMM_ATTR_NODEID = 0, + COMM_ATTR_LOCAL, + COMM_ATTR_ADDR, +}; + +struct comm_attribute { + struct configfs_attribute attr; + ssize_t (*show)(struct comm *, char *); + ssize_t (*store)(struct comm *, const char *, size_t); +}; + +static struct comm_attribute comm_attr_nodeid = { + .attr = { .ca_owner = THIS_MODULE, + .ca_name = "nodeid", + .ca_mode = S_IRUGO | S_IWUSR }, + .show = comm_nodeid_read, + .store = comm_nodeid_write, +}; + +static struct comm_attribute comm_attr_local = { + .attr = { .ca_owner = THIS_MODULE, + .ca_name = "local", + .ca_mode = S_IRUGO | S_IWUSR }, + .show = comm_local_read, + .store = comm_local_write, +}; + +static struct comm_attribute comm_attr_addr = { + .attr = { .ca_owner = THIS_MODULE, + .ca_name = "addr", + .ca_mode = S_IRUGO | S_IWUSR }, + .store = comm_addr_write, +}; + +static struct configfs_attribute *comm_attrs[] = { + [COMM_ATTR_NODEID] = &comm_attr_nodeid.attr, + [COMM_ATTR_LOCAL] = &comm_attr_local.attr, + [COMM_ATTR_ADDR] = &comm_attr_addr.attr, + NULL, +}; + +enum { + NODE_ATTR_NODEID = 0, + NODE_ATTR_WEIGHT, +}; + +struct node_attribute { + struct configfs_attribute attr; + ssize_t (*show)(struct node *, char *); + ssize_t (*store)(struct node *, const char *, size_t); +}; + +static struct node_attribute node_attr_nodeid = { + .attr = { .ca_owner = THIS_MODULE, + .ca_name = "nodeid", + .ca_mode = S_IRUGO | S_IWUSR }, + .show = node_nodeid_read, + .store = node_nodeid_write, +}; + +static struct node_attribute node_attr_weight = { + .attr = { .ca_owner = THIS_MODULE, + .ca_name = "weight", + .ca_mode = S_IRUGO | S_IWUSR }, + .show = node_weight_read, + .store = node_weight_write, +}; + +static struct configfs_attribute *node_attrs[] = { + [NODE_ATTR_NODEID] = &node_attr_nodeid.attr, + [NODE_ATTR_WEIGHT] = &node_attr_weight.attr, + NULL, +}; + +struct clusters { + struct configfs_subsystem subsys; +}; + +struct cluster { + struct config_group group; +}; + +struct spaces { + struct config_group ss_group; +}; + +struct space { + struct config_group group; + struct list_head members; + struct semaphore members_lock; + int members_count; +}; + +struct comms { + struct config_group cs_group; +}; + +struct comm { + struct config_item item; + int nodeid; + int local; + int addr_count; + struct sockaddr_storage *addr[DLM_MAX_ADDR_COUNT]; +}; + +struct nodes { + struct config_group ns_group; +}; + +struct node { + struct config_item item; + struct list_head list; /* space->members */ + int nodeid; + int weight; +}; + +static struct configfs_group_operations clusters_ops = { + .make_group = make_cluster, + .drop_item = drop_cluster, +}; + +static struct configfs_item_operations cluster_ops = { + .release = release_cluster, +}; + +static struct configfs_group_operations spaces_ops = { + .make_group = make_space, + .drop_item = drop_space, +}; + +static struct configfs_item_operations space_ops = { + .release = release_space, +}; + +static struct configfs_group_operations comms_ops = { + .make_item = make_comm, + .drop_item = drop_comm, +}; + +static struct configfs_item_operations comm_ops = { + .release = release_comm, + .show_attribute = show_comm, + .store_attribute = store_comm, +}; + +static struct configfs_group_operations nodes_ops = { + .make_item = make_node, + .drop_item = drop_node, +}; + +static struct configfs_item_operations node_ops = { + .release = release_node, + .show_attribute = show_node, + .store_attribute = store_node, +}; + +static struct config_item_type clusters_type = { + .ct_group_ops = &clusters_ops, + .ct_owner = THIS_MODULE, +}; + +static struct config_item_type cluster_type = { + .ct_item_ops = &cluster_ops, + .ct_owner = THIS_MODULE, +}; + +static struct config_item_type spaces_type = { + .ct_group_ops = &spaces_ops, + .ct_owner = THIS_MODULE, +}; + +static struct config_item_type space_type = { + .ct_item_ops = &space_ops, + .ct_owner = THIS_MODULE, +}; + +static struct config_item_type comms_type = { + .ct_group_ops = &comms_ops, + .ct_owner = THIS_MODULE, +}; + +static struct config_item_type comm_type = { + .ct_item_ops = &comm_ops, + .ct_attrs = comm_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct config_item_type nodes_type = { + .ct_group_ops = &nodes_ops, + .ct_owner = THIS_MODULE, +}; + +static struct config_item_type node_type = { + .ct_item_ops = &node_ops, + .ct_attrs = node_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct cluster *to_cluster(struct config_item *i) +{ + return i ? container_of(to_config_group(i), struct cluster, group):NULL; +} + +static struct space *to_space(struct config_item *i) +{ + return i ? container_of(to_config_group(i), struct space, group) : NULL; +} + +static struct comm *to_comm(struct config_item *i) +{ + return i ? container_of(i, struct comm, item) : NULL; +} + +static struct node *to_node(struct config_item *i) +{ + return i ? container_of(i, struct node, item) : NULL; +} + +static struct config_group *make_cluster(struct config_group *g, + const char *name) +{ + struct cluster *cl = NULL; + struct spaces *sps = NULL; + struct comms *cms = NULL; + void *gps = NULL; + + cl = kzalloc(sizeof(struct cluster), GFP_KERNEL); + gps = kcalloc(3, sizeof(struct config_group *), GFP_KERNEL); + sps = kzalloc(sizeof(struct spaces), GFP_KERNEL); + cms = kzalloc(sizeof(struct comms), GFP_KERNEL); + + if (!cl || !gps || !sps || !cms) + goto fail; + + config_group_init_type_name(&cl->group, name, &cluster_type); + config_group_init_type_name(&sps->ss_group, "spaces", &spaces_type); + config_group_init_type_name(&cms->cs_group, "comms", &comms_type); + + cl->group.default_groups = gps; + cl->group.default_groups[0] = &sps->ss_group; + cl->group.default_groups[1] = &cms->cs_group; + cl->group.default_groups[2] = NULL; + + space_list = &sps->ss_group; + comm_list = &cms->cs_group; + return &cl->group; + + fail: + kfree(cl); + kfree(gps); + kfree(sps); + kfree(cms); + return NULL; +} + +static void drop_cluster(struct config_group *g, struct config_item *i) +{ + struct cluster *cl = to_cluster(i); + struct config_item *tmp; + int j; + + for (j = 0; cl->group.default_groups[j]; j++) { + tmp = &cl->group.default_groups[j]->cg_item; + cl->group.default_groups[j] = NULL; + config_item_put(tmp); + } + + space_list = NULL; + comm_list = NULL; + + config_item_put(i); +} + +static void release_cluster(struct config_item *i) +{ + struct cluster *cl = to_cluster(i); + kfree(cl->group.default_groups); + kfree(cl); +} + +static struct config_group *make_space(struct config_group *g, const char *name) +{ + struct space *sp = NULL; + struct nodes *nds = NULL; + void *gps = NULL; + + sp = kzalloc(sizeof(struct space), GFP_KERNEL); + gps = kcalloc(2, sizeof(struct config_group *), GFP_KERNEL); + nds = kzalloc(sizeof(struct nodes), GFP_KERNEL); + + if (!sp || !gps || !nds) + goto fail; + + config_group_init_type_name(&sp->group, name, &space_type); + config_group_init_type_name(&nds->ns_group, "nodes", &nodes_type); + + sp->group.default_groups = gps; + sp->group.default_groups[0] = &nds->ns_group; + sp->group.default_groups[1] = NULL; + + INIT_LIST_HEAD(&sp->members); + init_MUTEX(&sp->members_lock); + sp->members_count = 0; + return &sp->group; + + fail: + kfree(sp); + kfree(gps); + kfree(nds); + return NULL; +} + +static void drop_space(struct config_group *g, struct config_item *i) +{ + struct space *sp = to_space(i); + struct config_item *tmp; + int j; + + /* assert list_empty(&sp->members) */ + + for (j = 0; sp->group.default_groups[j]; j++) { + tmp = &sp->group.default_groups[j]->cg_item; + sp->group.default_groups[j] = NULL; + config_item_put(tmp); + } + + config_item_put(i); +} + +static void release_space(struct config_item *i) +{ + struct space *sp = to_space(i); + kfree(sp->group.default_groups); + kfree(sp); +} + +static struct config_item *make_comm(struct config_group *g, const char *name) +{ + struct comm *cm; + + cm = kzalloc(sizeof(struct comm), GFP_KERNEL); + if (!cm) + return NULL; + + config_item_init_type_name(&cm->item, name, &comm_type); + cm->nodeid = -1; + cm->local = 0; + cm->addr_count = 0; + return &cm->item; +} + +static void drop_comm(struct config_group *g, struct config_item *i) +{ + struct comm *cm = to_comm(i); + if (local_comm == cm) + local_comm = NULL; + while (cm->addr_count--) + kfree(cm->addr[cm->addr_count]); + config_item_put(i); +} + +static void release_comm(struct config_item *i) +{ + struct comm *cm = to_comm(i); + kfree(cm); +} + +static struct config_item *make_node(struct config_group *g, const char *name) +{ + struct space *sp = to_space(g->cg_item.ci_parent); + struct node *nd; + + nd = kzalloc(sizeof(struct node), GFP_KERNEL); + if (!nd) + return NULL; + + config_item_init_type_name(&nd->item, name, &node_type); + nd->nodeid = -1; + nd->weight = 1; /* default weight of 1 if none is set */ + + down(&sp->members_lock); + list_add(&nd->list, &sp->members); + sp->members_count++; + up(&sp->members_lock); + + return &nd->item; +} + +static void drop_node(struct config_group *g, struct config_item *i) +{ + struct space *sp = to_space(g->cg_item.ci_parent); + struct node *nd = to_node(i); + + down(&sp->members_lock); + list_del(&nd->list); + sp->members_count--; + up(&sp->members_lock); + + config_item_put(i); +} + +static void release_node(struct config_item *i) +{ + struct node *nd = to_node(i); + kfree(nd); +} + +static struct clusters clusters_root = { + .subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "dlm", + .ci_type = &clusters_type, + }, + }, + }, +}; + +int dlm_config_init(void) +{ + config_group_init(&clusters_root.subsys.su_group); + init_MUTEX(&clusters_root.subsys.su_sem); + return configfs_register_subsystem(&clusters_root.subsys); +} + +void dlm_config_exit(void) +{ + configfs_unregister_subsystem(&clusters_root.subsys); +} + +/* + * Functions for user space to read/write attributes + */ + +static ssize_t show_comm(struct config_item *i, struct configfs_attribute *a, + char *buf) +{ + struct comm *cm = to_comm(i); + struct comm_attribute *cma = + container_of(a, struct comm_attribute, attr); + return cma->show ? cma->show(cm, buf) : 0; +} + +static ssize_t store_comm(struct config_item *i, struct configfs_attribute *a, + const char *buf, size_t len) +{ + struct comm *cm = to_comm(i); + struct comm_attribute *cma = + container_of(a, struct comm_attribute, attr); + return cma->store ? cma->store(cm, buf, len) : -EINVAL; +} + +static ssize_t comm_nodeid_read(struct comm *cm, char *buf) +{ + return sprintf(buf, "%d\n", cm->nodeid); +} + +static ssize_t comm_nodeid_write(struct comm *cm, const char *buf, size_t len) +{ + cm->nodeid = simple_strtol(buf, NULL, 0); + return len; +} + +static ssize_t comm_local_read(struct comm *cm, char *buf) +{ + return sprintf(buf, "%d\n", cm->local); +} + +static ssize_t comm_local_write(struct comm *cm, const char *buf, size_t len) +{ + cm->local= simple_strtol(buf, NULL, 0); + if (cm->local && !local_comm) + local_comm = cm; + return len; +} + +static ssize_t comm_addr_write(struct comm *cm, const char *buf, size_t len) +{ + struct sockaddr_storage *addr; + + if (len != sizeof(struct sockaddr_storage)) + return -EINVAL; + + if (cm->addr_count >= DLM_MAX_ADDR_COUNT) + return -ENOSPC; + + addr = kzalloc(sizeof(*addr), GFP_KERNEL); + if (!addr) + return -ENOMEM; + + memcpy(addr, buf, len); + cm->addr[cm->addr_count++] = addr; + return len; +} + +static ssize_t show_node(struct config_item *i, struct configfs_attribute *a, + char *buf) +{ + struct node *nd = to_node(i); + struct node_attribute *nda = + container_of(a, struct node_attribute, attr); + return nda->show ? nda->show(nd, buf) : 0; +} + +static ssize_t store_node(struct config_item *i, struct configfs_attribute *a, + const char *buf, size_t len) +{ + struct node *nd = to_node(i); + struct node_attribute *nda = + container_of(a, struct node_attribute, attr); + return nda->store ? nda->store(nd, buf, len) : -EINVAL; +} + +static ssize_t node_nodeid_read(struct node *nd, char *buf) +{ + return sprintf(buf, "%d\n", nd->nodeid); +} + +static ssize_t node_nodeid_write(struct node *nd, const char *buf, size_t len) +{ + nd->nodeid = simple_strtol(buf, NULL, 0); + return len; +} + +static ssize_t node_weight_read(struct node *nd, char *buf) +{ + return sprintf(buf, "%d\n", nd->weight); +} + +static ssize_t node_weight_write(struct node *nd, const char *buf, size_t len) +{ + nd->weight = simple_strtol(buf, NULL, 0); + return len; +} + +/* + * Functions for the dlm to get the info that's been configured + */ + +static struct space *get_space(char *name) +{ + if (!space_list) + return NULL; + return to_space(config_group_find_obj(space_list, name)); +} + +static void put_space(struct space *sp) +{ + config_item_put(&sp->group.cg_item); +} + +static struct comm *get_comm(int nodeid, struct sockaddr_storage *addr) +{ + struct config_item *i; + struct comm *cm; + int found = 0; + + if (!comm_list) + return NULL; + + list_for_each_entry(i, &comm_list->cg_children, ci_entry) { + cm = to_comm(i); + + if (nodeid) { + if (cm->nodeid != nodeid) + continue; + found = 1; + break; + } else { + if (!cm->addr_count || + memcmp(cm->addr[0], addr, sizeof(*addr))) + continue; + found = 1; + break; + } + } + + if (found) + config_item_get(i); + else + cm = NULL; + return cm; +} + +static void put_comm(struct comm *cm) +{ + config_item_put(&cm->item); +} + +/* caller must free mem */ +int dlm_nodeid_list(char *lsname, int **ids_out) +{ + struct space *sp; + struct node *nd; + int i = 0, rv = 0; + int *ids; + + sp = get_space(lsname); + if (!sp) + return -EEXIST; + + down(&sp->members_lock); + if (!sp->members_count) { + rv = 0; + goto out; + } + + ids = kcalloc(sp->members_count, sizeof(int), GFP_KERNEL); + if (!ids) { + rv = -ENOMEM; + goto out; + } + + rv = sp->members_count; + list_for_each_entry(nd, &sp->members, list) + ids[i++] = nd->nodeid; + + if (rv != i) + printk("bad nodeid count %d %d\n", rv, i); + + *ids_out = ids; + out: + up(&sp->members_lock); + put_space(sp); + return rv; +} + +int dlm_node_weight(char *lsname, int nodeid) +{ + struct space *sp; + struct node *nd; + int w = -EEXIST; + + sp = get_space(lsname); + if (!sp) + goto out; + + down(&sp->members_lock); + list_for_each_entry(nd, &sp->members, list) { + if (nd->nodeid != nodeid) + continue; + w = nd->weight; + break; + } + up(&sp->members_lock); + put_space(sp); + out: + return w; +} + +int dlm_nodeid_to_addr(int nodeid, struct sockaddr_storage *addr) +{ + struct comm *cm = get_comm(nodeid, NULL); + if (!cm) + return -EEXIST; + if (!cm->addr_count) + return -ENOENT; + memcpy(addr, cm->addr[0], sizeof(*addr)); + put_comm(cm); + return 0; +} + +int dlm_addr_to_nodeid(struct sockaddr_storage *addr, int *nodeid) +{ + struct comm *cm = get_comm(0, addr); + if (!cm) + return -EEXIST; + *nodeid = cm->nodeid; + put_comm(cm); + return 0; +} + +int dlm_our_nodeid(void) +{ + return local_comm ? local_comm->nodeid : 0; +} + +/* num 0 is first addr, num 1 is second addr */ +int dlm_our_addr(struct sockaddr_storage *addr, int num) +{ + if (!local_comm) + return -1; + if (num + 1 > local_comm->addr_count) + return -1; + memcpy(addr, local_comm->addr[num], sizeof(*addr)); + return 0; +} + /* Config file defaults */ #define DEFAULT_TCP_PORT 21064 #define DEFAULT_BUFFER_SIZE 4096 @@ -35,13 +782,3 @@ struct dlm_config_info dlm_config = { .scan_secs = DEFAULT_SCAN_SECS }; -int dlm_config_init(void) -{ - /* FIXME: hook the config values into sysfs */ - return 0; -} - -void dlm_config_exit(void) -{ -} - diff -urpN a/drivers/dlm/config.h b/drivers/dlm/config.h --- a/drivers/dlm/config.h 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/config.h 2005-08-18 13:22:00.720154024 +0800 @@ -14,6 +14,8 @@ #ifndef __CONFIG_DOT_H__ #define __CONFIG_DOT_H__ +#define DLM_MAX_ADDR_COUNT 3 + struct dlm_config_info { int tcp_port; int buffer_size; @@ -27,8 +29,14 @@ struct dlm_config_info { extern struct dlm_config_info dlm_config; -extern int dlm_config_init(void); -extern void dlm_config_exit(void); +int dlm_config_init(void); +void dlm_config_exit(void); +int dlm_node_weight(char *lsname, int nodeid); +int dlm_nodeid_list(char *lsname, int **ids_out); +int dlm_nodeid_to_addr(int nodeid, struct sockaddr_storage *addr); +int dlm_addr_to_nodeid(struct sockaddr_storage *addr, int *nodeid); +int dlm_our_nodeid(void); +int dlm_our_addr(struct sockaddr_storage *addr, int num); #endif /* __CONFIG_DOT_H__ */ diff -urpN a/drivers/dlm/dlm_internal.h b/drivers/dlm/dlm_internal.h --- a/drivers/dlm/dlm_internal.h 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/dlm_internal.h 2005-08-18 13:22:00.720154024 +0800 @@ -457,8 +457,6 @@ struct dlm_ls { int ls_low_nodeid; int ls_total_weight; int *ls_node_array; - int *ls_nodeids_next; - int ls_nodeids_next_count; struct dlm_rsb ls_stub_rsb; /* for returning errors */ struct dlm_lkb ls_stub_lkb; /* for returning errors */ diff -urpN a/drivers/dlm/lockspace.c b/drivers/dlm/lockspace.c --- a/drivers/dlm/lockspace.c 2005-08-18 12:14:09.000000000 +0800 +++ b/drivers/dlm/lockspace.c 2005-08-18 13:22:00.721153872 +0800 @@ -94,6 +94,11 @@ static struct dlm_ls *find_lockspace_nam return ls; } +struct dlm_ls *dlm_find_lockspace_name(char *name, int namelen) +{ + return find_lockspace_name(name, namelen); +} + struct dlm_ls *dlm_find_lockspace_global(uint32_t id) { struct dlm_ls *ls; @@ -261,8 +266,6 @@ static int new_lockspace(char *name, int ls->ls_low_nodeid = 0; ls->ls_total_weight = 0; ls->ls_node_array = NULL; - ls->ls_nodeids_next = NULL; - ls->ls_nodeids_next_count = 0; memset(&ls->ls_stub_rsb, 0, sizeof(struct dlm_rsb)); ls->ls_stub_rsb.res_ls = ls; diff -urpN a/drivers/dlm/lowcomms.c b/drivers/dlm/lowcomms.c --- a/drivers/dlm/lowcomms.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/lowcomms.c 2005-08-18 13:22:00.722153720 +0800 @@ -50,29 +50,14 @@ #include #include -#include - #include "dlm_internal.h" #include "lowcomms.h" #include "config.h" -#include "member.h" #include "midcomms.h" static struct sockaddr_storage *local_addr[DLM_MAX_ADDR_COUNT]; -static int local_nodeid; -static int local_weight; static int local_count; -static struct list_head nodes; -static struct semaphore nodes_sem; - -/* One of these per configured node */ - -struct dlm_node { - struct list_head list; - int nodeid; - int weight; - struct sockaddr_storage addr; -}; +static int local_nodeid; /* One of these per connected node */ @@ -163,89 +148,24 @@ static atomic_t accepting; static struct connection sctp_con; -static struct dlm_node *search_node(int nodeid) -{ - struct dlm_node *node; - - list_for_each_entry(node, &nodes, list) { - if (node->nodeid == nodeid) - goto out; - } - node = NULL; - out: - return node; -} - -static struct dlm_node *search_node_addr(struct sockaddr_storage *addr) -{ - struct dlm_node *node; - - list_for_each_entry(node, &nodes, list) { - if (!memcmp(&node->addr, addr, sizeof(*addr))) - goto out; - } - node = NULL; - out: - return node; -} - -static int _get_node(int nodeid, struct dlm_node **node_ret) -{ - struct dlm_node *node; - int error = 0; - - node = search_node(nodeid); - if (node) - goto out; - - node = kmalloc(sizeof(struct dlm_node), GFP_KERNEL); - if (!node) { - error = -ENOMEM; - goto out; - } - memset(node, 0, sizeof(struct dlm_node)); - node->nodeid = nodeid; - list_add_tail(&node->list, &nodes); - out: - *node_ret = node; - return error; -} - -static int addr_to_nodeid(struct sockaddr_storage *addr, int *nodeid) -{ - struct dlm_node *node; - - down(&nodes_sem); - node = search_node_addr(addr); - up(&nodes_sem); - if (!node) - return -1; - *nodeid = node->nodeid; - return 0; -} - static int nodeid_to_addr(int nodeid, struct sockaddr *retaddr) { - struct dlm_node *node; - struct sockaddr_storage *addr; + struct sockaddr_storage addr; + int error; if (!local_count) return -1; - down(&nodes_sem); - node = search_node(nodeid); - up(&nodes_sem); - if (!node) - return -1; - - addr = &node->addr; + error = dlm_nodeid_to_addr(nodeid, &addr); + if (error) + return error; if (local_addr[0]->ss_family == AF_INET) { - struct sockaddr_in *in4 = (struct sockaddr_in *) addr; + struct sockaddr_in *in4 = (struct sockaddr_in *) &addr; struct sockaddr_in *ret4 = (struct sockaddr_in *) retaddr; ret4->sin_addr.s_addr = in4->sin_addr.s_addr; } else { - struct sockaddr_in6 *in6 = (struct sockaddr_in6 *) addr; + struct sockaddr_in6 *in6 = (struct sockaddr_in6 *) &addr; struct sockaddr_in6 *ret6 = (struct sockaddr_in6 *) retaddr; memcpy(&ret6->sin6_addr, &in6->sin6_addr, sizeof(in6->sin6_addr)); @@ -254,67 +174,6 @@ static int nodeid_to_addr(int nodeid, st return 0; } -int dlm_node_weight(int nodeid) -{ - struct dlm_node *node; - int weight = -1; - - down(&nodes_sem); - node = search_node(nodeid); - if (node) - weight = node->weight; - up(&nodes_sem); - return weight; -} - -int dlm_set_node(int nodeid, int weight, char *addr_buf) -{ - struct dlm_node *node; - int error; - - down(&nodes_sem); - error = _get_node(nodeid, &node); - if (!error) { - memcpy(&node->addr, addr_buf, sizeof(struct sockaddr_storage)); - node->weight = weight; - } - up(&nodes_sem); - return error; -} - -int dlm_set_local(int nodeid, int weight, char *addr_buf) -{ - struct sockaddr_storage *addr; - int i; - - if (local_count > DLM_MAX_ADDR_COUNT - 1) { - log_print("too many local addresses set %d", local_count); - return -EINVAL; - } - local_nodeid = nodeid; - local_weight = weight; - - addr = kmalloc(sizeof(*addr), GFP_KERNEL); - if (!addr) - return -ENOMEM; - memcpy(addr, addr_buf, sizeof(*addr)); - - for (i = 0; i < local_count; i++) { - if (!memcmp(local_addr[i], addr, sizeof(*addr))) { - kfree(addr); - goto out; - } - } - local_addr[local_count++] = addr; - out: - return 0; -} - -int dlm_our_nodeid(void) -{ - return local_nodeid; -} - static struct nodeinfo *nodeid2nodeinfo(int nodeid, int alloc) { struct nodeinfo *ni; @@ -556,7 +415,7 @@ static void process_sctp_notification(st return; } make_sockaddr(&prim.ssp_addr, 0, &addr_len); - if (addr_to_nodeid(&prim.ssp_addr, &nodeid)) { + if (dlm_addr_to_nodeid(&prim.ssp_addr, &nodeid)) { log_print("reject connect from unknown addr"); send_shutdown(prim.ssp_assoc_id); return; @@ -772,6 +631,24 @@ static int add_bind_addr(struct sockaddr return result; } +static void init_local(void) +{ + struct sockaddr_storage sas, *addr; + int i; + + local_nodeid = dlm_our_nodeid(); + + for (i = 0; i < DLM_MAX_ADDR_COUNT - 1; i++) { + if (dlm_our_addr(&sas, i)) + break; + + addr = kmalloc(sizeof(*addr), GFP_KERNEL); + if (!addr) + break; + memcpy(addr, &sas, sizeof(*addr)); + local_addr[local_count++] = addr; + } +} /* Initialise SCTP socket and bind to all interfaces */ static int init_sock(void) @@ -783,8 +660,11 @@ static int init_sock(void) int result = -EINVAL, num = 1, i, addr_len; if (!local_count) { - log_print("no local IP address has been set"); - goto out; + init_local(); + if (!local_count) { + log_print("no local IP address has been set"); + goto out; + } } result = sock_create_kern(local_addr[0]->ss_family, SOCK_SEQPACKET, @@ -1323,25 +1203,16 @@ void dlm_lowcomms_stop(void) int dlm_lowcomms_init(void) { init_waitqueue_head(&lowcomms_recv_wait); - INIT_LIST_HEAD(&nodes); - init_MUTEX(&nodes_sem); return 0; } void dlm_lowcomms_exit(void) { - struct dlm_node *node, *safe; int i; for (i = 0; i < local_count; i++) kfree(local_addr[i]); - local_nodeid = 0; - local_weight = 0; local_count = 0; - - list_for_each_entry_safe(node, safe, &nodes, list) { - list_del(&node->list); - kfree(node); - } + local_nodeid = 0; } diff -urpN a/drivers/dlm/lowcomms.h b/drivers/dlm/lowcomms.h --- a/drivers/dlm/lowcomms.h 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/lowcomms.h 2005-08-18 13:22:00.722153720 +0800 @@ -20,10 +20,6 @@ int dlm_lowcomms_start(void); void dlm_lowcomms_stop(void); void *dlm_lowcomms_get_buffer(int nodeid, int len, int allocation, char **ppc); void dlm_lowcomms_commit_buffer(void *mh); -int dlm_set_node(int nodeid, int weight, char *addr_buf); -int dlm_set_local(int nodeid, int weight, char *addr_buf); -int dlm_our_nodeid(void); -int dlm_node_weight(int nodeid); #endif /* __LOWCOMMS_DOT_H__ */ diff -urpN a/drivers/dlm/main.c b/drivers/dlm/main.c --- a/drivers/dlm/main.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/main.c 2005-08-18 13:22:00.723153568 +0800 @@ -18,6 +18,7 @@ #include "device.h" #include "memory.h" #include "lowcomms.h" +#include "config.h" #ifdef CONFIG_DLM_DEBUG int dlm_register_debugfs(void); @@ -27,9 +28,6 @@ static inline int dlm_register_debugfs(v static inline void dlm_unregister_debugfs(void) { } #endif -int dlm_node_ioctl_init(void); -void dlm_node_ioctl_exit(void); - static int __init init_dlm(void) { int error; @@ -42,17 +40,17 @@ static int __init init_dlm(void) if (error) goto out_mem; - error = dlm_node_ioctl_init(); + error = dlm_member_sysfs_init(); if (error) goto out_mem; - error = dlm_member_sysfs_init(); + error = dlm_config_init(); if (error) - goto out_node; + goto out_member; error = dlm_register_debugfs(); if (error) - goto out_member; + goto out_config; error = dlm_lowcomms_init(); if (error) @@ -64,10 +62,10 @@ static int __init init_dlm(void) out_debug: dlm_unregister_debugfs(); + out_config: + dlm_config_exit(); out_member: dlm_member_sysfs_exit(); - out_node: - dlm_node_ioctl_exit(); out_mem: dlm_memory_exit(); out: @@ -78,7 +76,7 @@ static void __exit exit_dlm(void) { dlm_lowcomms_exit(); dlm_member_sysfs_exit(); - dlm_node_ioctl_exit(); + dlm_config_exit(); dlm_memory_exit(); dlm_unregister_debugfs(); } diff -urpN a/drivers/dlm/member.c b/drivers/dlm/member.c --- a/drivers/dlm/member.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/member.c 2005-08-18 13:22:00.724153416 +0800 @@ -11,13 +11,13 @@ ******************************************************************************/ #include "dlm_internal.h" -#include "member_sysfs.h" #include "lockspace.h" #include "member.h" #include "recoverd.h" #include "recover.h" #include "lowcomms.h" #include "rcom.h" +#include "config.h" /* * Following called by dlm_recoverd thread @@ -50,13 +50,18 @@ static void add_ordered_member(struct dl static int dlm_add_member(struct dlm_ls *ls, int nodeid) { struct dlm_member *memb; + int w; memb = kmalloc(sizeof(struct dlm_member), GFP_KERNEL); if (!memb) return -ENOMEM; + w = dlm_node_weight(ls->ls_name, nodeid); + if (w < 0) + return w; + memb->nodeid = nodeid; - memb->weight = dlm_node_weight(nodeid); + memb->weight = w; add_ordered_member(ls, memb); ls->ls_num_nodes++; return 0; @@ -262,14 +267,19 @@ int dlm_ls_stop(struct dlm_ls *ls) int dlm_ls_start(struct dlm_ls *ls) { - struct dlm_recover *rv, *rv_old; - int error = 0; + struct dlm_recover *rv = NULL, *rv_old; + int *ids = NULL; + int error, count; rv = kmalloc(sizeof(struct dlm_recover), GFP_KERNEL); if (!rv) return -ENOMEM; memset(rv, 0, sizeof(struct dlm_recover)); + error = count = dlm_nodeid_list(ls->ls_name, &ids); + if (error <= 0) + goto fail; + spin_lock(&ls->ls_recover_lock); /* the lockspace needs to be stopped before it can be started */ @@ -277,22 +287,12 @@ int dlm_ls_start(struct dlm_ls *ls) if (!dlm_locking_stopped(ls)) { spin_unlock(&ls->ls_recover_lock); log_error(ls, "start ignored: lockspace running"); - kfree(rv); - error = -EINVAL; - goto out; - } - - if (!ls->ls_nodeids_next) { - spin_unlock(&ls->ls_recover_lock); - log_error(ls, "start ignored: existing nodeids_next"); - kfree(rv); error = -EINVAL; - goto out; + goto fail; } - rv->nodeids = ls->ls_nodeids_next; - ls->ls_nodeids_next = NULL; - rv->node_count = ls->ls_nodeids_next_count; + rv->nodeids = ids; + rv->node_count = count; rv->seq = ++ls->ls_recover_seq; rv_old = ls->ls_recover_args; ls->ls_recover_args = rv; @@ -304,7 +304,11 @@ int dlm_ls_start(struct dlm_ls *ls) } dlm_recoverd_kick(ls); - out: + return 0; + + fail: + kfree(rv); + kfree(ids); return error; } diff -urpN a/drivers/dlm/member_sysfs.c b/drivers/dlm/member_sysfs.c --- a/drivers/dlm/member_sysfs.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/member_sysfs.c 2005-08-18 13:22:00.724153416 +0800 @@ -47,77 +47,10 @@ static ssize_t dlm_id_show(struct dlm_ls static ssize_t dlm_id_store(struct dlm_ls *ls, const char *buf, size_t len) { - ls->ls_global_id = simple_strtol(buf, NULL, 0); + ls->ls_global_id = simple_strtoul(buf, NULL, 0); return len; } -static ssize_t dlm_members_show(struct dlm_ls *ls, char *buf) -{ - struct dlm_member *memb; - ssize_t ret = 0; - - if (!down_read_trylock(&ls->ls_in_recovery)) - return -EBUSY; - list_for_each_entry(memb, &ls->ls_nodes, list) - ret += sprintf(buf+ret, "%u ", memb->nodeid); - ret += sprintf(buf+ret, "\n"); - up_read(&ls->ls_in_recovery); - return ret; -} - -static ssize_t dlm_members_store(struct dlm_ls *ls, const char *buf, size_t len) -{ - int *nodeids, id, count = 1, i; - ssize_t ret = len; - char *p, *t; - - /* count number of id's in buf, assumes no trailing spaces */ - for (i = 0; i < len; i++) - if (isspace(buf[i])) - count++; - - nodeids = kmalloc(sizeof(int) * count, GFP_KERNEL); - if (!nodeids) - return -ENOMEM; - - p = kmalloc(len+1, GFP_KERNEL); - if (!p) { - kfree(nodeids); - return -ENOMEM; - } - memcpy(p, buf, len); - p[len+1] = '\0'; - - for (i = 0; i < count; i++) { - if ((t = strsep(&p, " ")) == NULL) - break; - if (sscanf(t, "%u", &id) != 1) - break; - nodeids[i] = id; - } - - if (i != count) { - kfree(nodeids); - ret = -EINVAL; - goto out; - } - - spin_lock(&ls->ls_recover_lock); - if (ls->ls_nodeids_next) { - kfree(nodeids); - ret = -EINVAL; - goto out_unlock; - } - ls->ls_nodeids_next = nodeids; - ls->ls_nodeids_next_count = count; - - out_unlock: - spin_unlock(&ls->ls_recover_lock); - out: - kfree(p); - return ret; -} - struct dlm_attr { struct attribute attr; ssize_t (*show)(struct dlm_ls *, char *); @@ -140,17 +73,10 @@ static struct dlm_attr dlm_attr_id = { .store = dlm_id_store }; -static struct dlm_attr dlm_attr_members = { - .attr = {.name = "members", .mode = S_IRUGO | S_IWUSR}, - .show = dlm_members_show, - .store = dlm_members_store -}; - static struct attribute *dlm_attrs[] = { &dlm_attr_control.attr, &dlm_attr_event.attr, &dlm_attr_id.attr, - &dlm_attr_members.attr, NULL, }; diff -urpN a/drivers/dlm/node_ioctl.c b/drivers/dlm/node_ioctl.c --- a/drivers/dlm/node_ioctl.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/node_ioctl.c 1970-01-01 07:30:00.000000000 +0730 @@ -1,126 +0,0 @@ -/****************************************************************************** -******************************************************************************* -** -** Copyright (C) 2005 Red Hat, Inc. All rights reserved. -** -** This copyrighted material is made available to anyone wishing to use, -** modify, copy, or redistribute it subject to the terms and conditions -** of the GNU General Public License v.2. -** -******************************************************************************* -******************************************************************************/ - -#include -#include - -#include - -#include "dlm_internal.h" -#include "lowcomms.h" - - -static int check_version(unsigned int cmd, - struct dlm_node_ioctl __user *u_param) -{ - u32 version[3]; - int error = 0; - - if (copy_from_user(version, u_param->version, sizeof(version))) - return -EFAULT; - - if ((DLM_NODE_VERSION_MAJOR != version[0]) || - (DLM_NODE_VERSION_MINOR < version[1])) { - log_print("node_ioctl: interface mismatch: " - "kernel(%u.%u.%u), user(%u.%u.%u), cmd(%d)", - DLM_NODE_VERSION_MAJOR, - DLM_NODE_VERSION_MINOR, - DLM_NODE_VERSION_PATCH, - version[0], version[1], version[2], cmd); - error = -EINVAL; - } - - version[0] = DLM_NODE_VERSION_MAJOR; - version[1] = DLM_NODE_VERSION_MINOR; - version[2] = DLM_NODE_VERSION_PATCH; - - if (copy_to_user(u_param->version, version, sizeof(version))) - return -EFAULT; - return error; -} - -static int node_ioctl(struct inode *inode, struct file *file, - uint command, ulong u) -{ - struct dlm_node_ioctl *k_param; - struct dlm_node_ioctl __user *u_param; - unsigned int cmd, type; - int error; - - u_param = (struct dlm_node_ioctl __user *) u; - - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - - type = _IOC_TYPE(command); - cmd = _IOC_NR(command); - - if (type != DLM_IOCTL) { - log_print("node_ioctl: bad ioctl 0x%x 0x%x 0x%x", - command, type, cmd); - return -ENOTTY; - } - - error = check_version(cmd, u_param); - if (error) - return error; - - if (cmd == DLM_NODE_VERSION_CMD) - return 0; - - k_param = kmalloc(sizeof(*k_param), GFP_KERNEL); - if (!k_param) - return -ENOMEM; - - if (copy_from_user(k_param, u_param, sizeof(*k_param))) { - kfree(k_param); - return -EFAULT; - } - - if (cmd == DLM_SET_NODE_CMD) - error = dlm_set_node(k_param->nodeid, k_param->weight, - k_param->addr); - else if (cmd == DLM_SET_LOCAL_CMD) - error = dlm_set_local(k_param->nodeid, k_param->weight, - k_param->addr); - - kfree(k_param); - return error; -} - -static struct file_operations node_fops = { - .ioctl = node_ioctl, - .owner = THIS_MODULE, -}; - -static struct miscdevice node_misc = { - .minor = MISC_DYNAMIC_MINOR, - .name = DLM_NODE_MISC_NAME, - .fops = &node_fops -}; - -int dlm_node_ioctl_init(void) -{ - int error; - - error = misc_register(&node_misc); - if (error) - log_print("node_ioctl: misc_register failed %d", error); - return error; -} - -void dlm_node_ioctl_exit(void) -{ - if (misc_deregister(&node_misc) < 0) - log_print("node_ioctl: misc_deregister failed"); -} - diff -urpN a/drivers/dlm/requestqueue.c b/drivers/dlm/requestqueue.c --- a/drivers/dlm/requestqueue.c 2005-08-17 17:19:22.000000000 +0800 +++ b/drivers/dlm/requestqueue.c 2005-08-18 13:22:00.725153264 +0800 @@ -14,7 +14,7 @@ #include "member.h" #include "lock.h" #include "dir.h" -#include "lowcomms.h" +#include "config.h" struct rq_entry { struct list_head list; diff -urpN a/include/linux/dlm_node.h b/include/linux/dlm_node.h --- a/include/linux/dlm_node.h 2005-08-17 17:19:23.000000000 +0800 +++ b/include/linux/dlm_node.h 1970-01-01 07:30:00.000000000 +0730 @@ -1,44 +0,0 @@ -/****************************************************************************** -******************************************************************************* -** -** Copyright (C) 2005 Red Hat, Inc. All rights reserved. -** -** This copyrighted material is made available to anyone wishing to use, -** modify, copy, or redistribute it subject to the terms and conditions -** of the GNU General Public License v.2. -** -******************************************************************************* -******************************************************************************/ - -#ifndef __DLM_NODE_DOT_H__ -#define __DLM_NODE_DOT_H__ - -#define DLM_ADDR_LEN 256 -#define DLM_MAX_ADDR_COUNT 3 -#define DLM_NODE_MISC_NAME "dlm-node" - -#define DLM_NODE_VERSION_MAJOR 1 -#define DLM_NODE_VERSION_MINOR 0 -#define DLM_NODE_VERSION_PATCH 0 - -struct dlm_node_ioctl { - __u32 version[3]; - int nodeid; - int weight; - char addr[DLM_ADDR_LEN]; -}; - -enum { - DLM_NODE_VERSION_CMD = 0, - DLM_SET_NODE_CMD, - DLM_SET_LOCAL_CMD, -}; - -#define DLM_IOCTL 0xd1 - -#define DLM_NODE_VERSION _IOWR(DLM_IOCTL, DLM_NODE_VERSION_CMD, struct dlm_node_ioctl) -#define DLM_SET_NODE _IOWR(DLM_IOCTL, DLM_SET_NODE_CMD, struct dlm_node_ioctl) -#define DLM_SET_LOCAL _IOWR(DLM_IOCTL, DLM_SET_LOCAL_CMD, struct dlm_node_ioctl) - -#endif - From nish.aravamudan at gmail.com Thu Aug 18 06:29:03 2005 From: nish.aravamudan at gmail.com (Nish Aravamudan) Date: Wed, 17 Aug 2005 23:29:03 -0700 Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs In-Reply-To: <20050818060750.GA10133@redhat.com> References: <20050818060750.GA10133@redhat.com> Message-ID: <29495f1d05081723293c2bd337@mail.gmail.com> On 8/17/05, David Teigland wrote: > Use configfs to configure lockspace members and node addresses. This was > previously done with sysfs and ioctl. > > Signed-off-by: David Teigland Are you the official maintainer of the DLM subsystem? Could you submit a patch to add a MAINTAINERS entry? I was looking for a maintainer to send the dlm portion of my schedule_timeout() fixes to, but there wasn't one listed. Thanks, Nish From treddy at rallydev.com Thu Aug 18 20:16:49 2005 From: treddy at rallydev.com (Tarun Reddy) Date: Thu, 18 Aug 2005 14:16:49 -0600 Subject: [Linux-cluster] RHEL/RHCS3: /usr/lib/clumanager/services/service status # stays up Message-ID: I'm running a new instance of RHCS on RHEL3 and am having an issue where I get many instances (over how many ever days of the machine running) of the following: /bin/bash /usr/lib/clumanager/services/service status 1 all showing up when I do a ps auxww | grep status. The number on the end changes and is not always the same but currently on my system I have status 1 and status 0 both "stuck" running. These happen to be checks for mysql (1) and httpd (0), both of which are using standard redhat startup/shutdown/status scripts. If I kill them, the service that it is associated with restarts thinking that the result didn't return back correctly. Not desirable since the service is actually up. The number of occurrences has been greatly reduced since I increased the time between checks from 1 to 10. I didn't realize it was in seconds (RTFM) and so I'll probably boost that up to 30 or 60 seconds. Anyway, in an attempt to debug this, I started a while loop that called the above statement with a -x after the bash and found that the command occassionally hangs at + retVal=0 + '[' -n 6 -a -n 5 -a 6 -le 5 ']' + return 3 + return 0 + rm -f /tmp/cluster-httpd_status.z16209 + ip status 0 Anybody venture a guess as to why this might be occurring? And are my check intervals too low? Thanks, Tarun From treddy at rallydev.com Thu Aug 18 20:19:47 2005 From: treddy at rallydev.com (Tarun Reddy) Date: Thu, 18 Aug 2005 14:19:47 -0600 Subject: [Linux-cluster] RHEL/RHCS3: many pipe files keep building up in /tmp Message-ID: Under a new install of RHCS, after the system has been up for a while I get tons of files with a name in the format of sh-np-1124010202 (the numbers change) in /tmp. They are all pipe files like this: prw------- 1 root root 0 Aug 14 09:42 sh-np-1124010202 I can't find anything attaching to them with lsof | grep sh-np however the only show up after I've started up clustering. Can anyone explain these files and if they are supposed to be cleaned up somehow? Thanks, Tarun From mark.fasheh at oracle.com Thu Aug 18 21:23:48 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Thu, 18 Aug 2005 14:23:48 -0700 Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs In-Reply-To: <20050818060750.GA10133@redhat.com> References: <20050818060750.GA10133@redhat.com> Message-ID: <20050818212348.GW21228@ca-server1.us.oracle.com> Hi David, On Thu, Aug 18, 2005 at 02:07:50PM +0800, David Teigland wrote: > +/* > + * /config/dlm//spaces//nodes//nodeid > + * /config/dlm//spaces//nodes//weight > + * /config/dlm//comms//nodeid > + * /config/dlm//comms//local > + * /config/dlm//comms//addr > + * The level is useless, but I haven't figured out how to avoid it. > + */ So what happened to factoring out the common parts of ocfs2_nodemanager? I was quite a big fan of that approach :) Or am I just misunderstanding what these patches do? --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From ocrete at max-t.com Thu Aug 18 21:32:30 2005 From: ocrete at max-t.com (Olivier Crete) Date: Thu, 18 Aug 2005 17:32:30 -0400 Subject: [Linux-cluster] zero vote node with cman Message-ID: <1124400750.12024.52.camel@cocagne.max-t.internal> Hi, We're using cman from the STABLE branch and we're pretty satisfied.. But there is one thing that I dont seem to be able to get working. In a client-server application, I would like the client nodes to be able to take actions when the system becomes inquorate or a server dies, but not could towards the quorum. I tried setting the votes to 0, but it seems that it wont let me do it.. Is there another solution? -- Olivier Cr?te ocrete at max-t.com Maximum Throughput Inc. From mikore.li at gmail.com Fri Aug 19 05:59:05 2005 From: mikore.li at gmail.com (Michael) Date: Fri, 19 Aug 2005 13:59:05 +0800 Subject: [Linux-cluster] Cache in GFS? Message-ID: Hi, Is there cache principle in GFS client? Can I increase read/write performance by adding much more kernel memory cache for GFS? Thanks, Q.L From teigland at redhat.com Fri Aug 19 07:13:44 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 19 Aug 2005 15:13:44 +0800 Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs In-Reply-To: <20050818212348.GW21228@ca-server1.us.oracle.com> References: <20050818060750.GA10133@redhat.com> <20050818212348.GW21228@ca-server1.us.oracle.com> Message-ID: <20050819071344.GB10864@redhat.com> On Thu, Aug 18, 2005 at 02:23:48PM -0700, Mark Fasheh wrote: > On Thu, Aug 18, 2005 at 02:07:50PM +0800, David Teigland wrote: > > + * /config/dlm//spaces//nodes//nodeid > > + * /config/dlm//spaces//nodes//weight > > + * /config/dlm//comms//nodeid > > + * /config/dlm//comms//local > > + * /config/dlm//comms//addr > > So what happened to factoring out the common parts of ocfs2_nodemanager? > I was quite a big fan of that approach :) Or am I just misunderstanding > what these patches do? The nodemanager RFC I sent a month ago http://marc.theaimsgroup.com/?l=linux-kernel&m=112166723919347&w=2 amounts to half of dlm/config.c (everything under comms/ above) moved into a separate kernel module. That would be trivial to do, and is still an option to bat around. I question whether factoring such a small chunk into a separate module is really worth it, though? Making all of config.c (all of /config/dlm/ above) into a separate module wouldn't seem quite so strange. It would require just a few lines of code to turn it into a stand alone module. Dave From pcaulfie at redhat.com Fri Aug 19 07:19:58 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 19 Aug 2005 08:19:58 +0100 Subject: [Linux-cluster] zero vote node with cman In-Reply-To: <1124400750.12024.52.camel@cocagne.max-t.internal> References: <1124400750.12024.52.camel@cocagne.max-t.internal> Message-ID: <4305881E.5000106@redhat.com> Olivier Crete wrote: > Hi, > > We're using cman from the STABLE branch and we're pretty satisfied.. But > there is one thing that I dont seem to be able to get working. In a > client-server application, I would like the client nodes to be able to > take actions when the system becomes inquorate or a server dies, but not > could towards the quorum. > > I tried setting the votes to 0, but it seems that it wont let me do it.. > Is there another solution? > > It seems to be a bug in cman_tool that's overriding the votes rather over-enthusiastically. This patch should fix: Index: main.c =================================================================== RCS file: /cvs/cluster/cluster/cman/cman_tool/main.c,v retrieving revision 1.12.2.7 diff -u -p -r1.12.2.7 main.c --- main.c 21 Mar 2005 16:17:06 -0000 1.12.2.7 +++ main.c 19 Aug 2005 07:18:47 -0000 @@ -552,7 +552,7 @@ static void check_arguments(commandline_ if (!comline->clustername[0]) die("cluster name not set"); - if (!comline->votes) + if (!comline->votes_opt && comline->no_ccs) comline->votes = DEFAULT_VOTES; if (!comline->port) -- patrick From Tarun.Reddy at rallydev.com Thu Aug 18 20:12:25 2005 From: Tarun.Reddy at rallydev.com (Tarun Reddy) Date: Thu, 18 Aug 2005 14:12:25 -0600 Subject: [Linux-cluster] RHEL/RHCS3: /usr/lib/clumanager/services/service status # stays up Message-ID: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com> I'm running a new instance of RHCS on RHEL3 and am having an issue where I get many instances (over how many ever days of the machine running) of the following: /bin/bash /usr/lib/clumanager/services/service status 1 all showing up when I do a ps auxww | grep status. The number on the end changes and is not always the same but currently on my system I have status 1 and status 0 both "stuck" running. These happen to be checks for mysql (1) and httpd (0), both of which are using standard redhat startup/shutdown/status scripts. If I kill them, the service that it is associated with restarts thinking that the result didn't return back correctly. Not desirable since the service is actually up. The number of occurrences has been greatly reduced since I increased the time between checks from 1 to 10. I didn't realize it was in seconds (RTFM) and so I'll probably boost that up to 30 or 60 seconds. Anyway, in an attempt to debug this, I started a while loop that called the above statement with a -x after the bash and found that the command occassionally hangs at + retVal=0 + '[' -n 6 -a -n 5 -a 6 -le 5 ']' + return 3 + return 0 + rm -f /tmp/cluster-httpd_status.z16209 + ip status 0 Anybody venture a guess as to why this might be occurring? And are my check intervals too low? Thanks, Tarun From forgue at oakland.edu Fri Aug 19 15:43:50 2005 From: forgue at oakland.edu (Andrew Forgue) Date: Fri, 19 Aug 2005 11:43:50 -0400 Subject: [Linux-cluster] Trying to compile dlm-kernel-2.6.9-34.0.src.rpm Message-ID: <1124466230.15510.7.camel@localhost.localdomain> Hello, I'm trying to build the dlm-kernel SRPM and I'm getting this error when building the SMP module of DLM. + /lib/modules/2.6.9-11.ELsmp/build//scripts/mod/modpost -m -i /lib/modules/2.6.9-11.ELsmp/kernel/cluster/cman.symvers src/dlm.o -o dlm.s ymvers src/dlm.o: No such file or directory /var/tmp/rpm-tmp.45995: line 31: 31466 Aborted $kernel_src/scripts/mod/modpost -m -i /lib/modules/2.6.9-11.EL $flavor/kern el/cluster/cman.symvers src/dlm.o -o dlm.symvers error: Bad exit status from /var/tmp/rpm-tmp.45995 (%build) This is on a RHEL-4 machine that's completely up to date. Thanks, Andrew Complete Output: [forgue at server02 SPECS]$ sudo rpmbuild -bb --target=i686 dlm-kernel.spec Building target platforms: i686 Building for target i686 Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.45995 + umask 022 + cd /usr/src/redhat/BUILD + LANG=C + export LANG + unset DISPLAY + cd /usr/src/redhat/BUILD + rm -rf dlm-kernel-2.6.9-34 + /usr/bin/gzip -dc /usr/src/redhat/SOURCES/dlm-kernel-2.6.9-34.tar.gz + tar -xf - + STATUS=0 + '[' 0 -ne 0 ']' + cd dlm-kernel-2.6.9-34 ++ /usr/bin/id -u + '[' 0 = 0 ']' + /bin/chown -Rhf root . ++ /usr/bin/id -u + '[' 0 = 0 ']' + /bin/chgrp -Rhf root . + /bin/chmod -Rf a+rX,u+w,g-w,o-w . + sed -i -e '/RELEASE_NAME/s/""/"2.6.9-34.0"/' src/dlm_internal.h + exit 0 Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.45995 + umask 022 + cd /usr/src/redhat/BUILD + cd dlm-kernel-2.6.9-34 + LANG=C + export LANG + unset DISPLAY ++ pwd + cp -r /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34 ../smp ++ pwd + cp -r /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34 ../hugemem + Build_dlm i686 + cpu_type=i686 + flavor= + kernel_src=/lib/modules/2.6.9-11.EL/build/ + '[' -d /lib/modules/2.6.9-11.EL/build//. ']' + echo 'Kernel 2.6.9-11.EL found.' Kernel 2.6.9-11.EL found. + echo /lib/modules/2.6.9-11.EL/build/ /lib/modules/2.6.9-11.EL/build/ + ./configure --kernel_src=/lib/modules/2.6.9-11.EL/build/ --incdir=/usr/include Configuring Makefiles for your system... Completed Makefile configuration + make symverfile=/lib/modules/2.6.9-11.EL/kernel/cluster/cman.symvers cd src && make all make[1]: Entering directory `/usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src' if [ ! -e cluster ]; then ln -s . cluster; fi if [ ! -e service.h ]; then cp //usr/include/cluster/service.h .; fi if [ ! -e cnxman.h ]; then cp //usr/include/cluster/cnxman.h .; fi if [ ! -e cnxman-socket.h ]; then cp //usr/include/cluster/cnxman-socket.h .; fi make -C /lib/modules/2.6.9-11.EL/build/ M=/usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src modules USING_KBUILD=yes make[2]: Entering directory `/usr/src/kernels/2.6.9-11.EL-i686' CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/ast.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/config.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/device.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dir.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lkb.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/locking.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lockqueue.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lockspace.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/lowcomms.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/main.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/memory.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/midcomms.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/nodes.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/proc.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/queries.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/rebuild.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/reccomms.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/recover.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/recoverd.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/rsb.o CC [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/util.o LD [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.o Building modules, stage 2. MODPOST CC /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.mod.o LD [M] /usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src/dlm.ko make[2]: Leaving directory `/usr/src/kernels/2.6.9-11.EL-i686' make[1]: Leaving directory `/usr/src/redhat/BUILD/dlm-kernel-2.6.9-34/src' + /lib/modules/2.6.9-11.EL/build//scripts/mod/modpost -m -i /lib/modules/2.6.9-11.EL/kernel/cluster/cman.symvers src/dlm.o -o dlm.symvers + cd ../smp + Build_dlm i686 smp + cpu_type=i686 + flavor=smp + kernel_src=/lib/modules/2.6.9-11.ELsmp/build/ + '[' -d /lib/modules/2.6.9-11.ELsmp/build//. ']' + echo 'Kernel 2.6.9-11.EL found.' Kernel 2.6.9-11.EL found. + echo /lib/modules/2.6.9-11.ELsmp/build/ /lib/modules/2.6.9-11.ELsmp/build/ + ./configure --kernel_src=/lib/modules/2.6.9-11.ELsmp/build/ --incdir=/usr/include Configuring Makefiles for your system... Completed Makefile configuration + make symverfile=/lib/modules/2.6.9-11.ELsmp/kernel/cluster/cman.symvers cd src && make all make[1]: Entering directory `/usr/src/redhat/BUILD/smp/src' rm -f cluster ln -s . cluster make -C /lib/modules/2.6.9-11.ELsmp/build/ M=/usr/src/redhat/BUILD/smp/src modules USING_KBUILD=yes make[2]: Entering directory `/usr/src/kernels/2.6.9-11.EL-smp-i686' Building modules, stage 2. MODPOST make[2]: Leaving directory `/usr/src/kernels/2.6.9-11.EL-smp-i686' make[1]: Leaving directory `/usr/src/redhat/BUILD/smp/src' + /lib/modules/2.6.9-11.ELsmp/build//scripts/mod/modpost -m -i /lib/modules/2.6.9-11.ELsmp/kernel/cluster/cman.symvers src/dlm.o -o dlm.s ymvers src/dlm.o: No such file or directory /var/tmp/rpm-tmp.45995: line 31: 31466 Aborted $kernel_src/scripts/mod/modpost -m -i /lib/modules/2.6.9-11.EL $flavor/kern el/cluster/cman.symvers src/dlm.o -o dlm.symvers error: Bad exit status from /var/tmp/rpm-tmp.45995 (%build) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From ocrete at max-t.com Fri Aug 19 15:51:24 2005 From: ocrete at max-t.com (Olivier Crete) Date: Fri, 19 Aug 2005 11:51:24 -0400 Subject: [Linux-cluster] zero vote node with cman In-Reply-To: <4305881E.5000106@redhat.com> References: <1124400750.12024.52.camel@cocagne.max-t.internal> <4305881E.5000106@redhat.com> Message-ID: <1124466684.12024.58.camel@cocagne.max-t.internal> On Fri, 2005-19-08 at 08:19 +0100, Patrick Caulfield wrote: > Olivier Crete wrote: > > Hi, > > > > We're using cman from the STABLE branch and we're pretty satisfied.. But > > there is one thing that I dont seem to be able to get working. In a > > client-server application, I would like the client nodes to be able to > > take actions when the system becomes inquorate or a server dies, but not > > could towards the quorum. > > > > I tried setting the votes to 0, but it seems that it wont let me do it.. > > Is there another solution? > > > > > > It seems to be a bug in cman_tool that's overriding the votes rather > over-enthusiastically. > > This patch should fix: Actually it doesnt.. it sets the default to 0... the attached patch seems to work better. -- Olivier Cr?te ocrete at max-t.com Maximum Throughput Inc. -------------- next part -------------- A non-text attachment was scrubbed... Name: cman_tool-zero-vote.patch Type: text/x-patch Size: 632 bytes Desc: not available URL: From Joel.Becker at oracle.com Sat Aug 20 00:40:11 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Fri, 19 Aug 2005 17:40:11 -0700 Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs In-Reply-To: <20050818210747.GC22742@insight> References: <20050818060750.GA10133@redhat.com> <20050817232218.56a06fd6.akpm@osdl.org> <20050818210747.GC22742@insight> Message-ID: <20050820004011.GF4100@insight.us.oracle.com> On Thu, Aug 18, 2005 at 02:07:47PM -0700, Joel Becker wrote: > On Wed, Aug 17, 2005 at 11:22:18PM -0700, Andrew Morton wrote: > > Fair enough. This really means that the configfs patch should be split out > > of the ocfs2 megapatch... > > Easy to do, it's a separate commit in the ocfs2.git repository. > Would you rather > > a) Do the diffs yourself (configfs commit, remaining ocfs2 commits) > b) Have two repositories, configfs.git and ocfs2.git, where > ocfs2.git is configfs.git+ocfs2 > c) Just take the configfs patch (which really hasn't changed in > months) Well, I included the patch in my last email. For the latest spin, I've created http://oss.oracle.com/git/configfs.git. The ocfs2 git repositories (http://oss.oracle.com/git/ocfs2-dev.git, http://oss.oracle.com/git/ocfs2.git) are now based on the configfs one. If there's any other way you want me to do it, let me know. Joel -- "If the human brain were so simple we could understand it, we would be so simple that we could not." - W. A. Clouston http://www.jlbec.org/ jlbec at evilplan.org -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: From Joel.Becker at oracle.com Fri Aug 19 15:09:09 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Fri, 19 Aug 2005 08:09:09 -0700 Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs In-Reply-To: <20050817232218.56a06fd6.akpm@osdl.org> References: <20050818060750.GA10133@redhat.com> <20050817232218.56a06fd6.akpm@osdl.org> Message-ID: <20050819150909.GA18991@ca-server1.us.oracle.com> On Wed, Aug 17, 2005 at 11:22:18PM -0700, Andrew Morton wrote: > David Teigland wrote: > > > > Use configfs to configure lockspace members and node addresses. This was > > previously done with sysfs and ioctl. > > Fair enough. This really means that the configfs patch should be split out > of the ocfs2 megapatch... Easy to do, it's a separate commit in the ocfs2.git repository. Would you rather a) Do the diffs yourself (configfs commit, remaining ocfs2 commits) b) Have two repositories, configfs.git and ocfs2.git, where ocfs2.git is configfs.git+ocfs2 c) Just take the configfs patch (which really hasn't changed in months) Joel ------------------------------------------- [PATCH] configfs: Userspace-driven configuration filesystem Configfs, a file system for userspace-driven kernel object configuration. The OCFS2 stack makes extensive use of this for propagation of cluster configuration information into kernel. Signed-off-by: Joel Becker --- diff -ruN linux-2.6.15-rc6.old/Documentation/filesystems/00-INDEX linux-2.6.13-rc6/Documentation/filesystems/00-INDEX --- linux-2.6.13-rc6.old/Documentation/filesystems/00-INDEX 2005-08-07 11:18:56.000000000 -0700 +++ linux-2.6.13-rc6/Documentation/filesystems/00-INDEX 2005-08-12 18:09:41.923178911 -0700 @@ -12,6 +12,8 @@ - description of the CIFS filesystem coda.txt - description of the CODA filesystem. +configfs/ + - directory containing configfs documentation and example code. cramfs.txt - info on the cram filesystem for small storage (ROMs etc) devfs/ diff -ruN linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs.txt linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs.txt --- linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs.txt 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs.txt 2005-08-12 18:09:41.924178946 -0700 @@ -0,0 +1,434 @@ + +configfs - Userspace-driven kernel object configuation. + +Joel Becker + +Updated: 31 March 2005 + +Copyright (c) 2005 Oracle Corporation, + Joel Becker + + +[What is configfs?] + +configfs is a ram-based filesystem that provides the converse of +sysfs's functionality. Where sysfs is a filesystem-based view of +kernel objects, configfs is a filesystem-based manager of kernel +objects, or config_items. + +With sysfs, an object is created in kernel (for example, when a device +is discovered) and it is registered with sysfs. Its attributes then +appear in sysfs, allowing userspace to read the attributes via +readdir(3)/read(2). It may allow some attributes to be modified via +write(2). The important point is that the object is created and +destroyed in kernel, the kernel controls the lifecycle of the sysfs +representation, and sysfs is merely a window on all this. + +A configfs config_item is created via an explicit userspace operation: +mkdir(2). It is destroyed via rmdir(2). The attributes appear at +mkdir(2) time, and can be read or modified via read(2) and write(2). +As with sysfs, readdir(3) queries the list of items and/or attributes. +symlink(2) can be used to group items together. Unlike sysfs, the +lifetime of the representation is completely driven by userspace. The +kernel modules backing the items must respond to this. + +Both sysfs and configfs can and should exist together on the same +system. One is not a replacement for the other. + +[Using configfs] + +configfs can be compiled as a module or into the kernel. You can access +it by doing + + mount -t configfs none /config + +The configfs tree will be empty unless client modules are also loaded. +These are modules that register their item types with configfs as +subsystems. Once a client subsystem is loaded, it will appear as a +subdirectory (or more than one) under /config. Like sysfs, the +configfs tree is always there, whether mounted on /config or not. + +An item is created via mkdir(2). The item's attributes will also +appear at this time. readdir(3) can determine what the attributes are, +read(2) can query their default values, and write(2) can store new +values. Like sysfs, attributes should be ASCII text files, preferably +with only one value per file. The same efficiency caveats from sysfs +apply. Don't mix more than one attribute in one attribute file. + +Like sysfs, configfs expects write(2) to store the entire buffer at +once. When writing to configfs attributes, userspace processes should +first read the entire file, modify the portions they wish to change, and +then write the entire buffer back. Attribute files have a maximum size +of one page (PAGE_SIZE, 4096 on i386). + +When an item needs to be destroyed, remove it with rmdir(2). An +item cannot be destroyed if any other item has a link to it (via +symlink(2)). Links can be removed via unlink(2). + +[Configuring FakeNBD: an Example] + +Imagine there's a Network Block Device (NBD) driver that allows you to +access remote block devices. Call it FakeNBD. FakeNBD uses configfs +for its configuration. Obviously, there will be a nice program that +sysadmins use to configure FakeNBD, but somehow that program has to tell +the driver about it. Here's where configfs comes in. + +When the FakeNBD driver is loaded, it registers itself with configfs. +readdir(3) sees this just fine: + + # ls /config + fakenbd + +A fakenbd connection can be created with mkdir(2). The name is +arbitrary, but likely the tool will make some use of the name. Perhaps +it is a uuid or a disk name: + + # mkdir /config/fakenbd/disk1 + # ls /config/fakenbd/disk1 + target device rw + +The target attribute contains the IP address of the server FakeNBD will +connect to. The device attribute is the device on the server. +Predictably, the rw attribute determines whether the connection is +read-only or read-write. + + # echo 10.0.0.1 > /config/fakenbd/disk1/target + # echo /dev/sda1 > /config/fakenbd/disk1/device + # echo 1 > /config/fakenbd/disk1/rw + +That's it. That's all there is. Now the device is configured, via the +shell no less. + +[Coding With configfs] + +Every object in configfs is a config_item. A config_item reflects an +object in the subsystem. It has attributes that match values on that +object. configfs handles the filesystem representation of that object +and its attributes, allowing the subsystem to ignore all but the +basic show/store interaction. + +Items are created and destroyed inside a config_group. A group is a +collection of items that share the same attributes and operations. +Items are created by mkdir(2) and removed by rmdir(2), but configfs +handles that. The group has a set of operations to perform these tasks + +A subsystem is the top level of a client module. During initialization, +the client module registers the subsystem with configfs, the subsystem +appears as a directory at the top of the configfs filesystem. A +subsystem is also a config_group, and can do everything a config_group +can. + +[struct config_item] + + struct config_item { + char *ci_name; + char ci_namebuf[UOBJ_NAME_LEN]; + struct kref ci_kref; + struct list_head ci_entry; + struct config_item *ci_parent; + struct config_group *ci_group; + struct config_item_type *ci_type; + struct dentry *ci_dentry; + }; + + void config_item_init(struct config_item *); + void config_item_init_type_name(struct config_item *, + const char *name, + struct config_item_type *type); + struct config_item *config_item_get(struct config_item *); + void config_item_put(struct config_item *); + +Generally, struct config_item is embedded in a container structure, a +structure that actually represents what the subsystem is doing. The +config_item portion of that structure is how the object interacts with +configfs. + +Whether statically defined in a source file or created by a parent +config_group, a config_item must have one of the _init() functions +called on it. This initializes the reference count and sets up the +appropriate fields. + +All users of a config_item should have a reference on it via +config_item_get(), and drop the reference when they are done via +config_item_put(). + +By itself, a config_item cannot do much more than appear in configfs. +Usually a subsystem wants the item to display and/or store attributes, +among other things. For that, it needs a type. + +[struct config_item_type] + + struct configfs_item_operations { + void (*release)(struct config_item *); + ssize_t (*show_attribute)(struct config_item *, + struct configfs_attribute *, + char *); + ssize_t (*store_attribute)(struct config_item *, + struct configfs_attribute *, + const char *, size_t); + int (*allow_link)(struct config_item *src, + struct config_item *target); + int (*drop_link)(struct config_item *src, + struct config_item *target); + }; + + struct config_item_type { + struct module *ct_owner; + struct configfs_item_operations *ct_item_ops; + struct configfs_group_operations *ct_group_ops; + struct configfs_attribute **ct_attrs; + }; + +The most basic function of a config_item_type is to define what +operations can be performed on a config_item. All items that have been +allocated dynamically will need to provide the ct_item_ops->release() +method. This method is called when the config_item's reference count +reaches zero. Items that wish to display an attribute need to provide +the ct_item_ops->show_attribute() method. Similarly, storing a new +attribute value uses the store_attribute() method. + +[struct configfs_attribute] + + struct configfs_attribute { + char *ca_name; + struct module *ca_owner; + mode_t ca_mode; + }; + +When a config_item wants an attribute to appear as a file in the item's +configfs directory, it must define a configfs_attribute describing it. +It then adds the attribute to the NULL-terminated array +config_item_type->ct_attrs. When the item appears in configfs, the +attribute file will appear with the configfs_attribute->ca_name +filename. configfs_attribute->ca_mode specifies the file permissions. + +If an attribute is readable and the config_item provides a +ct_item_ops->show_attribute() method, that method will be called +whenever userspace asks for a read(2) on the attribute. The converse +will happen for write(2). + +[struct config_group] + +A config_item cannot live in a vaccum. The only way one can be created +is via mkdir(2) on a config_group. This will trigger creation of a +child item. + + struct config_group { + struct config_item cg_item; + struct list_head cg_children; + struct configfs_subsystem *cg_subsys; + struct config_group **default_groups; + }; + + void config_group_init(struct config_group *group); + void config_group_init_type_name(struct config_group *group, + const char *name, + struct config_item_type *type); + + +The config_group structure contains a config_item. Properly configuring +that item means that a group can behave as an item in its own right. +However, it can do more: it can create child items or groups. This is +accomplished via the group operations specified on the group's +config_item_type. + + struct configfs_group_operations { + struct config_item *(*make_item)(struct config_group *group, + const char *name); + struct config_group *(*make_group)(struct config_group *group, + const char *name); + int (*commit_item)(struct config_item *item); + void (*drop_item)(struct config_group *group, + struct config_item *item); + }; + +A group creates child items by providing the +ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new +config_item (or more likely, its container structure), initializes it, +and returns it to configfs. Configfs will then populate the filesystem +tree to reflect the new item. + +If the subsystem wants the child to be a group itself, the subsystem +provides ct_group_ops->make_group(). Everything else behaves the same, +using the group _init() functions on the group. + +Finally, when userspace calls rmdir(2) on the item or group, +ct_group_ops->drop_item() is called. As a config_group is also a +config_item, it is not necessary for a seperate drop_group() method. +The subsystem must config_item_put() the reference that was initialized +upon item allocation. If a subsystem has no work to do, it may omit +the ct_group_ops->drop_item() method, and configfs will call +config_item_put() on the item on behalf of the subsystem. + +IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) +is called, configfs WILL remove the item from the filesystem tree +(assuming that it has no children to keep it busy). The subsystem is +responsible for responding to this. If the subsystem has references to +the item in other threads, the memory is safe. It may take some time +for the item to actually disappear from the subsystem's usage. But it +is gone from configfs. + +A config_group cannot be removed while it still has child items. This +is implemented in the configfs rmdir(2) code. ->drop_item() will not be +called, as the item has not been dropped. rmdir(2) will fail, as the +directory is not empty. + +[struct configfs_subsystem] + +A subsystem must register itself, ususally at module_init time. This +tells configfs to make the subsystem appear in the file tree. + + struct configfs_subsystem { + struct config_group su_group; + struct semaphore su_sem; + }; + + int configfs_register_subsystem(struct configfs_subsystem *subsys); + void configfs_unregister_subsystem(struct configfs_subsystem *subsys); + + A subsystem consists of a toplevel config_group and a semaphore. +The group is where child config_items are created. For a subsystem, +this group is usually defined statically. Before calling +configfs_register_subsystem(), the subsystem must have initialized the +group via the usual group _init() functions, and it must also have +initialized the semaphore. + When the register call returns, the subsystem is live, and it +will be visible via configfs. At that point, mkdir(2) can be called and +the subsystem must be ready for it. + +[An Example] + +The best example of these basic concepts is the simple_children +subsystem/group and the simple_child item in configfs_example.c It +shows a trivial object displaying and storing an attribute, and a simple +group creating and destroying these children. + +[Hierarchy Navigation and the Subsystem Semaphore] + +There is an extra bonus that configfs provides. The config_groups and +config_items are arranged in a hierarchy due to the fact that they +appear in a filesystem. A subsystem is NEVER to touch the filesystem +parts, but the subsystem might be interested in this hierarchy. For +this reason, the hierarchy is mirrored via the config_group->cg_children +and config_item->ci_parent structure members. + +A subsystem can navigate the cg_children list and the ci_parent pointer +to see the tree created by the subsystem. This can race with configfs' +management of the hierarchy, so configfs uses the subsystem semaphore to +protect modifications. Whenever a subsystem wants to navigate the +hierarchy, it must do so under the protection of the subsystem +semaphore. + +A subsystem will be prevented from acquiring the semaphore while a newly +allocated item has not been linked into this hierarchy. Similarly, it +will not be able to acquire the semaphore while a dropping item has not +yet been unlinked. This means that an item's ci_parent pointer will +never be NULL while the item is in configfs, and that an item will only +be in its parent's cg_children list for the same duration. This allows +a subsystem to trust ci_parent and cg_children while they hold the +semaphore. + +[Item Aggregation Via symlink(2)] + +configfs provides a simple group via the group->item parent/child +relationship. Often, however, a larger environment requires aggregation +outside of the parent/child connection. This is implemented via +symlink(2). + +A config_item may provide the ct_item_ops->allow_link() and +ct_item_ops->drop_link() methods. If the ->allow_link() method exists, +symlink(2) may be called with the config_item as the source of the link. +These links are only allowed between configfs config_items. Any +symlink(2) attempt outside the configfs filesystem will be denied. + +When symlink(2) is called, the source config_item's ->allow_link() +method is called with itself and a target item. If the source item +allows linking to target item, it returns 0. A source item may wish to +reject a link if it only wants links to a certain type of object (say, +in its own subsystem). + +When unlink(2) is called on the symbolic link, the source item is +notified via the ->drop_link() method. Like the ->drop_item() method, +this is a void function and cannot return failure. The subsystem is +responsible for responding to the change. + +A config_item cannot be removed while it links to any other item, nor +can it be removed while an item links to it. Dangling symlinks are not +allowed in configfs. + +[Automatically Created Subgroups] + +A new config_group may want to have two types of child config_items. +While this could be codified by magic names in ->make_item(), it is much +more explicit to have a method whereby userspace sees this divergence. + +Rather than have a group where some items behave differently than +others, configfs provides a method whereby one or many subgroups are +automatically created inside the parent at its creation. Thus, +mkdir("parent) results in "parent", "parent/subgroup1", up through +"parent/subgroupN". Items of type 1 can now be created in +"parent/subgroup1", and items of type N can be created in +"parent/subgroupN". + +These automatic subgroups, or default groups, do not preclude other +children of the parent group. If ct_group_ops->make_group() exists, +other child groups can be created on the parent group directly. + +A configfs subsystem specifies default groups by filling in the +NULL-terminated array default_groups on the config_group structure. +Each group in that array is populated in the configfs tree at the same +time as the parent group. Similarly, they are removed at the same time +as the parent. No extra notification is provided. When a ->drop_item() +method call notifies the subsystem the parent group is going away, it +also means every default group child associated with that parent group. + +As a consequence of this, default_groups cannot be removed directly via +rmdir(2). They also are not considered when rmdir(2) on the parent +group is checking for children. + +[Committable Items] + +NOTE: Committable items are currently unimplemented. + +Some config_items cannot have a valid initial state. That is, no +default values can be specified for the item's attributes such that the +item can do its work. Userspace must configure one or more attributes, +after which the subsystem can start whatever entity this item +represents. + +Consider the FakeNBD device from above. Without a target address *and* +a target device, the subsystem has no idea what block device to import. +The simple example assumes that the subsystem merely waits until all the +appropriate attributes are configured, and then connects. This will, +indeed, work, but now every attribute store must check if the attributes +are initialized. Every attribute store must fire off the connection if +that condition is met. + +Far better would be an explicit action notifying the subsystem that the +config_item is ready to go. More importantly, an explicit action allows +the subsystem to provide feedback as to whether the attibutes are +initialized in a way that makes sense. configfs provides this as +committable items. + +configfs still uses only normal filesystem operations. An item is +committed via rename(2). The item is moved from a directory where it +can be modified to a directory where it cannot. + +Any group that provides the ct_group_ops->commit_item() method has +committable items. When this group appears in configfs, mkdir(2) will +not work directly in the group. Instead, the group will have two +subdirectories: "live" and "pending". The "live" directory does not +support mkdir(2) or rmdir(2) either. It only allows rename(2). The +"pending" directory does allow mkdir(2) and rmdir(2). An item is +created in the "pending" directory. Its attributes can be modified at +will. Userspace commits the item by renaming it into the "live" +directory. At this point, the subsystem recieves the ->commit_item() +callback. If all required attributes are filled to satisfaction, the +method returns zero and the item is moved to the "live" directory. + +As rmdir(2) does not work in the "live" directory, an item must be +shutdown, or "uncommitted". Again, this is done via rename(2), this +time from the "live" directory back to the "pending" one. The subsystem +is notified by the ct_group_ops->uncommit_object() method. + + diff -ruN linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs_example.c linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs_example.c --- linux-2.6.13-rc6.old/Documentation/filesystems/configfs/configfs_example.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/Documentation/filesystems/configfs/configfs_example.c 2005-08-12 18:09:41.925178981 -0700 @@ -0,0 +1,474 @@ +/* + * vim: noexpandtab ts=8 sts=0 sw=8: + * + * configfs_example.c - This file is a demonstration module containing + * a number of configfs subsystems. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include +#include +#include + +#include + + + +/* + * 01-childless + * + * This first example is a childless subsystem. It cannot create + * any config_items. It just has attributes. + * + * Note that we are enclosing the configfs_subsystem inside a container. + * This is not necessary if a subsystem has no attributes directly + * on the subsystem. See the next example, 02-simple-children, for + * such a subsystem. + */ + +struct childless { + struct configfs_subsystem subsys; + int showme; + int storeme; +}; + +struct childless_attribute { + struct configfs_attribute attr; + ssize_t (*show)(struct childless *, char *); + ssize_t (*store)(struct childless *, const char *, size_t); +}; + +static inline struct childless *to_childless(struct config_item *item) +{ + return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL; +} + +static ssize_t childless_showme_read(struct childless *childless, + char *page) +{ + ssize_t pos; + + pos = sprintf(page, "%d\n", childless->showme); + childless->showme++; + + return pos; +} + +static ssize_t childless_storeme_read(struct childless *childless, + char *page) +{ + return sprintf(page, "%d\n", childless->storeme); +} + +static ssize_t childless_storeme_write(struct childless *childless, + const char *page, + size_t count) +{ + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + childless->storeme = tmp; + + return count; +} + +static ssize_t childless_description_read(struct childless *childless, + char *page) +{ + return sprintf(page, +"[01-childless]\n" +"\n" +"The childless subsystem is the simplest possible subsystem in\n" +"configfs. It does not support the creation of child config_items.\n" +"It only has a few attributes. In fact, it isn't much different\n" +"than a directory in /proc.\n"); +} + +static struct childless_attribute childless_attr_showme = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO }, + .show = childless_showme_read, +}; +static struct childless_attribute childless_attr_storeme = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR }, + .show = childless_storeme_read, + .store = childless_storeme_write, +}; +static struct childless_attribute childless_attr_description = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO }, + .show = childless_description_read, +}; + +static struct configfs_attribute *childless_attrs[] = { + &childless_attr_showme.attr, + &childless_attr_storeme.attr, + &childless_attr_description.attr, + NULL, +}; + +static ssize_t childless_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + struct childless *childless = to_childless(item); + struct childless_attribute *childless_attr = + container_of(attr, struct childless_attribute, attr); + ssize_t ret = 0; + + if (childless_attr->show) + ret = childless_attr->show(childless, page); + return ret; +} + +static ssize_t childless_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct childless *childless = to_childless(item); + struct childless_attribute *childless_attr = + container_of(attr, struct childless_attribute, attr); + ssize_t ret = -EINVAL; + + if (childless_attr->store) + ret = childless_attr->store(childless, page, count); + return ret; +} + +static struct configfs_item_operations childless_item_ops = { + .show_attribute = childless_attr_show, + .store_attribute = childless_attr_store, +}; + +static struct config_item_type childless_type = { + .ct_item_ops = &childless_item_ops, + .ct_attrs = childless_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct childless childless_subsys = { + .subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "01-childless", + .ci_type = &childless_type, + }, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 02-simple-children + * + * This example merely has a simple one-attribute child. Note that + * there is no extra attribute structure, as the child's attribute is + * known from the get-go. Also, there is no container for the + * subsystem, as it has no attributes of its own. + */ + +struct simple_child { + struct config_item item; + int storeme; +}; + +static inline struct simple_child *to_simple_child(struct config_item *item) +{ + return item ? container_of(item, struct simple_child, item) : NULL; +} + +static struct configfs_attribute simple_child_attr_storeme = { + .ca_owner = THIS_MODULE, + .ca_name = "storeme", + .ca_mode = S_IRUGO | S_IWUSR, +}; + +static struct configfs_attribute *simple_child_attrs[] = { + &simple_child_attr_storeme, + NULL, +}; + +static ssize_t simple_child_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + ssize_t count; + struct simple_child *simple_child = to_simple_child(item); + + count = sprintf(page, "%d\n", simple_child->storeme); + + return count; +} + +static ssize_t simple_child_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct simple_child *simple_child = to_simple_child(item); + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + simple_child->storeme = tmp; + + return count; +} + +static void simple_child_release(struct config_item *item) +{ + kfree(to_simple_child(item)); +} + +static struct configfs_item_operations simple_child_item_ops = { + .release = simple_child_release, + .show_attribute = simple_child_attr_show, + .store_attribute = simple_child_attr_store, +}; + +static struct config_item_type simple_child_type = { + .ct_item_ops = &simple_child_item_ops, + .ct_attrs = simple_child_attrs, + .ct_owner = THIS_MODULE, +}; + + +static struct config_item *simple_children_make_item(struct config_group *group, const char *name) +{ + struct simple_child *simple_child; + + simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL); + if (!simple_child) + return NULL; + + memset(simple_child, 0, sizeof(struct simple_child)); + + config_item_init_type_name(&simple_child->item, name, + &simple_child_type); + + simple_child->storeme = 0; + + return &simple_child->item; +} + +static struct configfs_attribute simple_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *simple_children_attrs[] = { + &simple_children_attr_description, + NULL, +}; + +static ssize_t simple_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[02-simple-children]\n" +"\n" +"This subsystem allows the creation of child config_items. These\n" +"items have only one attribute that is readable and writeable.\n"); +} + +static struct configfs_item_operations simple_children_item_ops = { + .show_attribute = simple_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations simple_children_group_ops = { + .make_item = simple_children_make_item, +}; + +static struct config_item_type simple_children_type = { + .ct_item_ops = &simple_children_item_ops, + .ct_group_ops = &simple_children_group_ops, + .ct_attrs = simple_children_attrs, +}; + +static struct configfs_subsystem simple_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "02-simple-children", + .ci_type = &simple_children_type, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 03-group-children + * + * This example reuses the simple_children group from above. However, + * the simple_children group is not the subsystem itself, it is a + * child of the subsystem. Creation of a group in the subsystem creates + * a new simple_children group. That group can then have simple_child + * children of its own. + */ + +struct simple_children { + struct config_group group; +}; + +static struct config_group *group_children_make_group(struct config_group *group, const char *name) +{ + struct simple_children *simple_children; + + simple_children = kmalloc(sizeof(struct simple_children), + GFP_KERNEL); + if (!simple_children) + return NULL; + + memset(simple_children, 0, sizeof(struct simple_children)); + + config_group_init_type_name(&simple_children->group, name, + &simple_children_type); + + return &simple_children->group; +} + +static struct configfs_attribute group_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *group_children_attrs[] = { + &group_children_attr_description, + NULL, +}; + +static ssize_t group_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[03-group-children]\n" +"\n" +"This subsystem allows the creation of child config_groups. These\n" +"groups are like the subsystem simple-children.\n"); +} + +static struct configfs_item_operations group_children_item_ops = { + .show_attribute = group_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations group_children_group_ops = { + .make_group = group_children_make_group, +}; + +static struct config_item_type group_children_type = { + .ct_item_ops = &group_children_item_ops, + .ct_group_ops = &group_children_group_ops, + .ct_attrs = group_children_attrs, +}; + +static struct configfs_subsystem group_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "03-group-children", + .ci_type = &group_children_type, + }, + }, +}; + +/* ----------------------------------------------------------------- */ + +/* + * We're now done with our subsystem definitions. + * For convenience in this module, here's a list of them all. It + * allows the init function to easily register them. Most modules + * will only have one subsystem, and will only call register_subsystem + * on it directly. + */ +static struct configfs_subsystem *example_subsys[] = { + &childless_subsys.subsys, + &simple_children_subsys, + &group_children_subsys, + NULL, +}; + +static int __init configfs_example_init(void) +{ + int ret; + int i; + struct configfs_subsystem *subsys; + + for (i = 0; example_subsys[i]; i++) { + subsys = example_subsys[i]; + + config_group_init(&subsys->su_group); + init_MUTEX(&subsys->su_sem); + ret = configfs_register_subsystem(subsys); + if (ret) { + printk(KERN_ERR "Error %d while registering subsystem %s\n", + ret, + subsys->su_group.cg_item.ci_namebuf); + goto out_unregister; + } + } + + return 0; + +out_unregister: + for (; i >= 0; i--) { + configfs_unregister_subsystem(example_subsys[i]); + } + + return ret; +} + +static void __exit configfs_example_exit(void) +{ + int i; + + for (i = 0; example_subsys[i]; i++) { + configfs_unregister_subsystem(example_subsys[i]); + } +} + +module_init(configfs_example_init); +module_exit(configfs_example_exit); +MODULE_LICENSE("GPL"); diff -ruN linux-2.6.13-rc6.old/fs/Kconfig linux-2.6.13-rc6/fs/Kconfig --- linux-2.6.13-rc6.old/fs/Kconfig 2005-08-07 11:18:56.000000000 -0700 +++ linux-2.6.13-rc6/fs/Kconfig 2005-08-12 18:09:46.778349585 -0700 @@ -859,6 +859,20 @@ To compile this as a module, choose M here: the module will be called ramfs. +config CONFIGFS_FS + tristate "Userspace-driven configuration filesystem (EXPERIMENTAL)" + depends on EXPERIMENTAL + help + configfs is a ram-based filesystem that provides the converse + of sysfs's functionality. Where sysfs is a filesystem-based + view of kernel objects, configfs is a filesystem-based manager + of kernel objects, or config_items. + + Both sysfs and configfs can and should exist together on the + same system. One is not a replacement for the other. + + If unsure, say N. + endmenu menu "Miscellaneous filesystems" diff -ruN linux-2.6.13-rc6.old/fs/Makefile linux-2.6.13-rc6/fs/Makefile --- linux-2.6.13-rc6.old/fs/Makefile 2005-08-07 11:18:56.000000000 -0700 +++ linux-2.6.13-rc6/fs/Makefile 2005-08-12 18:09:46.778349585 -0700 @@ -98,3 +98,4 @@ obj-$(CONFIG_HOSTFS) += hostfs/ obj-$(CONFIG_HPPFS) += hppfs/ obj-$(CONFIG_DEBUG_FS) += debugfs/ +obj-$(CONFIG_CONFIGFS_FS) += configfs/ diff -ruN linux-2.6.13-rc6.old/fs/configfs/Makefile linux-2.6.13-rc6/fs/configfs/Makefile --- linux-2.6.13-rc6.old/fs/configfs/Makefile 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/Makefile 2005-08-12 18:09:46.784349796 -0700 @@ -0,0 +1,7 @@ +# +# Makefile for the configfs virtual filesystem +# + +obj-$(CONFIG_CONFIGFS_FS) += configfs.o + +configfs-objs := inode.o file.o dir.o symlink.o mount.o item.o diff -ruN linux-2.6.13-rc6.old/fs/configfs/configfs_internal.h linux-2.6.13-rc6/fs/configfs/configfs_internal.h --- linux-2.6.13-rc6.old/fs/configfs/configfs_internal.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/configfs_internal.h 2005-08-12 18:09:46.784349796 -0700 @@ -0,0 +1,142 @@ +/* -*- mode: c; c-basic-offset:8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * configfs_internal.h - Internal stuff for configfs + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include +#include + +struct configfs_dirent { + atomic_t s_count; + struct list_head s_sibling; + struct list_head s_children; + struct list_head s_links; + void * s_element; + int s_type; + umode_t s_mode; + struct dentry * s_dentry; +}; + +#define CONFIGFS_ROOT 0x0001 +#define CONFIGFS_DIR 0x0002 +#define CONFIGFS_ITEM_ATTR 0x0004 +#define CONFIGFS_ITEM_LINK 0x0020 +#define CONFIGFS_USET_DIR 0x0040 +#define CONFIGFS_USET_DEFAULT 0x0080 +#define CONFIGFS_USET_DROPPING 0x0100 +#define CONFIGFS_NOT_PINNED (CONFIGFS_ITEM_ATTR) + +extern struct vfsmount * configfs_mount; + +extern int configfs_is_root(struct config_item *item); + +extern struct inode * configfs_new_inode(mode_t mode); +extern int configfs_create(struct dentry *, int mode, int (*init)(struct inode *)); + +extern int configfs_create_file(struct config_item *, const struct configfs_attribute *); +extern int configfs_make_dirent(struct configfs_dirent *, + struct dentry *, void *, umode_t, int); + +extern int configfs_add_file(struct dentry *, const struct configfs_attribute *, int); +extern void configfs_hash_and_remove(struct dentry * dir, const char * name); + +extern const unsigned char * configfs_get_name(struct configfs_dirent *sd); +extern void configfs_drop_dentry(struct configfs_dirent *sd, struct dentry *parent); + +extern int configfs_pin_fs(void); +extern void configfs_release_fs(void); + +extern struct rw_semaphore configfs_rename_sem; +extern struct super_block * configfs_sb; +extern struct file_operations configfs_dir_operations; +extern struct file_operations configfs_file_operations; +extern struct file_operations bin_fops; +extern struct inode_operations configfs_dir_inode_operations; +extern struct inode_operations configfs_symlink_inode_operations; + +extern int configfs_symlink(struct inode *dir, struct dentry *dentry, + const char *symname); +extern int configfs_unlink(struct inode *dir, struct dentry *dentry); + +struct configfs_symlink { + struct list_head sl_list; + struct config_item *sl_target; +}; + +extern int configfs_create_link(struct configfs_symlink *sl, + struct dentry *parent, + struct dentry *dentry); + +static inline struct config_item * to_item(struct dentry * dentry) +{ + struct configfs_dirent * sd = dentry->d_fsdata; + return ((struct config_item *) sd->s_element); +} + +static inline struct configfs_attribute * to_attr(struct dentry * dentry) +{ + struct configfs_dirent * sd = dentry->d_fsdata; + return ((struct configfs_attribute *) sd->s_element); +} + +static inline struct config_item *configfs_get_config_item(struct dentry *dentry) +{ + struct config_item * item = NULL; + + spin_lock(&dcache_lock); + if (!d_unhashed(dentry)) { + struct configfs_dirent * sd = dentry->d_fsdata; + if (sd->s_type & CONFIGFS_ITEM_LINK) { + struct configfs_symlink * sl = sd->s_element; + item = config_item_get(sl->sl_target); + } else + item = config_item_get(sd->s_element); + } + spin_unlock(&dcache_lock); + + return item; +} + +static inline void release_configfs_dirent(struct configfs_dirent * sd) +{ + if (!(sd->s_type & CONFIGFS_ROOT)) + kfree(sd); +} + +static inline struct configfs_dirent * configfs_get(struct configfs_dirent * sd) +{ + if (sd) { + WARN_ON(!atomic_read(&sd->s_count)); + atomic_inc(&sd->s_count); + } + return sd; +} + +static inline void configfs_put(struct configfs_dirent * sd) +{ + WARN_ON(!atomic_read(&sd->s_count)); + if (atomic_dec_and_test(&sd->s_count)) + release_configfs_dirent(sd); +} + diff -ruN linux-2.6.13-rc6.old/fs/configfs/dir.c linux-2.6.13-rc6/fs/configfs/dir.c --- linux-2.6.13-rc6.old/fs/configfs/dir.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/dir.c 2005-08-12 18:09:46.786349866 -0700 @@ -0,0 +1,1102 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * dir.c - Operations for configfs directories. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#undef DEBUG + +#include +#include +#include +#include + +#include +#include "configfs_internal.h" + +DECLARE_RWSEM(configfs_rename_sem); + +static void configfs_d_iput(struct dentry * dentry, + struct inode * inode) +{ + struct configfs_dirent * sd = dentry->d_fsdata; + + if (sd) { + BUG_ON(sd->s_dentry != dentry); + sd->s_dentry = NULL; + configfs_put(sd); + } + iput(inode); +} + +/* + * We _must_ delete our dentries on last dput, as the chain-to-parent + * behavior is required to clear the parents of default_groups. + */ +static int configfs_d_delete(struct dentry *dentry) +{ + return 1; +} + +static struct dentry_operations configfs_dentry_ops = { + .d_iput = configfs_d_iput, + /* simple_delete_dentry() isn't exported */ + .d_delete = configfs_d_delete, +}; + +/* + * Allocates a new configfs_dirent and links it to the parent configfs_dirent + */ +static struct configfs_dirent *configfs_new_dirent(struct configfs_dirent * parent_sd, + void * element) +{ + struct configfs_dirent * sd; + + sd = kmalloc(sizeof(*sd), GFP_KERNEL); + if (!sd) + return NULL; + + memset(sd, 0, sizeof(*sd)); + atomic_set(&sd->s_count, 1); + INIT_LIST_HEAD(&sd->s_links); + INIT_LIST_HEAD(&sd->s_children); + list_add(&sd->s_sibling, &parent_sd->s_children); + sd->s_element = element; + + return sd; +} + +int configfs_make_dirent(struct configfs_dirent * parent_sd, + struct dentry * dentry, void * element, + umode_t mode, int type) +{ + struct configfs_dirent * sd; + + sd = configfs_new_dirent(parent_sd, element); + if (!sd) + return -ENOMEM; + + sd->s_mode = mode; + sd->s_type = type; + sd->s_dentry = dentry; + if (dentry) { + dentry->d_fsdata = configfs_get(sd); + dentry->d_op = &configfs_dentry_ops; + } + + return 0; +} + +static int init_dir(struct inode * inode) +{ + inode->i_op = &configfs_dir_inode_operations; + inode->i_fop = &configfs_dir_operations; + + /* directory inodes start off with i_nlink == 2 (for "." entry) */ + inode->i_nlink++; + return 0; +} + +static int init_file(struct inode * inode) +{ + inode->i_size = PAGE_SIZE; + inode->i_fop = &configfs_file_operations; + return 0; +} + +static int init_symlink(struct inode * inode) +{ + inode->i_op = &configfs_symlink_inode_operations; + return 0; +} + +static int create_dir(struct config_item * k, struct dentry * p, + struct dentry * d) +{ + int error; + umode_t mode = S_IFDIR| S_IRWXU | S_IRUGO | S_IXUGO; + + error = configfs_create(d, mode, init_dir); + if (!error) { + error = configfs_make_dirent(p->d_fsdata, d, k, mode, + CONFIGFS_DIR); + if (!error) { + p->d_inode->i_nlink++; + (d)->d_op = &configfs_dentry_ops; + } + } + return error; +} + + +/** + * configfs_create_dir - create a directory for an config_item. + * @item: config_itemwe're creating directory for. + * @dentry: config_item's dentry. + */ + +static int configfs_create_dir(struct config_item * item, struct dentry *dentry) +{ + struct dentry * parent; + int error = 0; + + BUG_ON(!item); + + if (item->ci_parent) + parent = item->ci_parent->ci_dentry; + else if (configfs_mount && configfs_mount->mnt_sb) + parent = configfs_mount->mnt_sb->s_root; + else + return -EFAULT; + + error = create_dir(item,parent,dentry); + if (!error) + item->ci_dentry = dentry; + return error; +} + +int configfs_create_link(struct configfs_symlink *sl, + struct dentry *parent, + struct dentry *dentry) +{ + int err = 0; + umode_t mode = S_IFLNK | S_IRWXUGO; + + err = configfs_create(dentry, mode, init_symlink); + if (!err) { + err = configfs_make_dirent(parent->d_fsdata, dentry, sl, + mode, CONFIGFS_ITEM_LINK); + if (!err) + dentry->d_op = &configfs_dentry_ops; + } + return err; +} + +static void remove_dir(struct dentry * d) +{ + struct dentry * parent = dget(d->d_parent); + struct configfs_dirent * sd; + + sd = d->d_fsdata; + list_del_init(&sd->s_sibling); + configfs_put(sd); + if (d->d_inode) + simple_rmdir(parent->d_inode,d); + + pr_debug(" o %s removing done (%d)\n",d->d_name.name, + atomic_read(&d->d_count)); + + dput(parent); +} + +/** + * configfs_remove_dir - remove an config_item's directory. + * @item: config_item we're removing. + * + * The only thing special about this is that we remove any files in + * the directory before we remove the directory, and we've inlined + * what used to be configfs_rmdir() below, instead of calling separately. + */ + +static void configfs_remove_dir(struct config_item * item) +{ + struct dentry * dentry = dget(item->ci_dentry); + + if (!dentry) + return; + + remove_dir(dentry); + /** + * Drop reference from dget() on entrance. + */ + dput(dentry); +} + + +/* attaches attribute's configfs_dirent to the dentry corresponding to the + * attribute file + */ +static int configfs_attach_attr(struct configfs_dirent * sd, struct dentry * dentry) +{ + struct configfs_attribute * attr = sd->s_element; + int error; + + error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, init_file); + if (error) + return error; + + dentry->d_op = &configfs_dentry_ops; + dentry->d_fsdata = configfs_get(sd); + sd->s_dentry = dentry; + d_rehash(dentry); + + return 0; +} + +static struct dentry * configfs_lookup(struct inode *dir, + struct dentry *dentry, + struct nameidata *nd) +{ + struct configfs_dirent * parent_sd = dentry->d_parent->d_fsdata; + struct configfs_dirent * sd; + int found = 0; + int err = 0; + + list_for_each_entry(sd, &parent_sd->s_children, s_sibling) { + if (sd->s_type & CONFIGFS_NOT_PINNED) { + const unsigned char * name = configfs_get_name(sd); + + if (strcmp(name, dentry->d_name.name)) + continue; + + found = 1; + err = configfs_attach_attr(sd, dentry); + break; + } + } + + if (!found) { + /* + * If it doesn't exist and it isn't a NOT_PINNED item, + * it must be negative. + */ + return simple_lookup(dir, dentry, nd); + } + + return ERR_PTR(err); +} + +/* + * Only subdirectories count here. Files (CONFIGFS_NOT_PINNED) are + * attributes and are removed by rmdir(). We recurse, taking i_sem + * on all children that are candidates for default detach. If the + * result is clean, then configfs_detach_group() will handle dropping + * i_sem. If there is an error, the caller will clean up the i_sem + * holders via configfs_detach_rollback(). + */ +static int configfs_detach_prep(struct dentry *dentry) +{ + struct configfs_dirent *parent_sd = dentry->d_fsdata; + struct configfs_dirent *sd; + int ret; + + ret = -EBUSY; + if (!list_empty(&parent_sd->s_links)) + goto out; + + ret = 0; + list_for_each_entry(sd, &parent_sd->s_children, s_sibling) { + if (sd->s_type & CONFIGFS_NOT_PINNED) + continue; + if (sd->s_type & CONFIGFS_USET_DEFAULT) { + down(&sd->s_dentry->d_inode->i_sem); + /* Mark that we've taken i_sem */ + sd->s_type |= CONFIGFS_USET_DROPPING; + + ret = configfs_detach_prep(sd->s_dentry); + if (!ret) + continue; + } else + ret = -ENOTEMPTY; + + break; + } + +out: + return ret; +} + +/* + * Walk the tree, dropping i_sem wherever CONFIGFS_USET_DROPPING is + * set. + */ +static void configfs_detach_rollback(struct dentry *dentry) +{ + struct configfs_dirent *parent_sd = dentry->d_fsdata; + struct configfs_dirent *sd; + + list_for_each_entry(sd, &parent_sd->s_children, s_sibling) { + if (sd->s_type & CONFIGFS_USET_DEFAULT) { + configfs_detach_rollback(sd->s_dentry); + + if (sd->s_type & CONFIGFS_USET_DROPPING) { + sd->s_type &= ~CONFIGFS_USET_DROPPING; + up(&sd->s_dentry->d_inode->i_sem); + } + } + } +} + +static void detach_attrs(struct config_item * item) +{ + struct dentry * dentry = dget(item->ci_dentry); + struct configfs_dirent * parent_sd; + struct configfs_dirent * sd, * tmp; + + if (!dentry) + return; + + pr_debug("configfs %s: dropping attrs for dir\n", + dentry->d_name.name); + + parent_sd = dentry->d_fsdata; + list_for_each_entry_safe(sd, tmp, &parent_sd->s_children, s_sibling) { + if (!sd->s_element || !(sd->s_type & CONFIGFS_NOT_PINNED)) + continue; + list_del_init(&sd->s_sibling); + configfs_drop_dentry(sd, dentry); + configfs_put(sd); + } + + /** + * Drop reference from dget() on entrance. + */ + dput(dentry); +} + +static int populate_attrs(struct config_item *item) +{ + struct config_item_type *t = item->ci_type; + struct configfs_attribute *attr; + int error = 0; + int i; + + if (!t) + return -EINVAL; + if (t->ct_attrs) { + for (i = 0; (attr = t->ct_attrs[i]) != NULL; i++) { + if ((error = configfs_create_file(item, attr))) + break; + } + } + + if (error) + detach_attrs(item); + + return error; +} + +static int configfs_attach_group(struct config_item *parent_item, + struct config_item *item, + struct dentry *dentry); +static void configfs_detach_group(struct config_item *item); + +static void detach_groups(struct config_group *group) +{ + struct dentry * dentry = dget(group->cg_item.ci_dentry); + struct dentry *child; + struct configfs_dirent *parent_sd; + struct configfs_dirent *sd, *tmp; + + if (!dentry) + return; + + parent_sd = dentry->d_fsdata; + list_for_each_entry_safe(sd, tmp, &parent_sd->s_children, s_sibling) { + if (!sd->s_element || + !(sd->s_type & CONFIGFS_USET_DEFAULT)) + continue; + + child = sd->s_dentry; + + configfs_detach_group(sd->s_element); + child->d_inode->i_flags |= S_DEAD; + + /* + * From rmdir/unregister, a configfs_detach_prep() pass + * has taken our i_sem for us. Drop it. + * From mkdir/register cleanup, there is no sem held. + */ + if (sd->s_type & CONFIGFS_USET_DROPPING) + up(&child->d_inode->i_sem); + + d_delete(child); + dput(child); + } + + /** + * Drop reference from dget() on entrance. + */ + dput(dentry); +} + +/* + * This fakes mkdir(2) on a default_groups[] entry. It + * creates a dentry, attachs it, and then does fixup + * on the sd->s_type. + * + * We could, perhaps, tweak our parent's ->mkdir for a minute and + * try using vfs_mkdir. Just a thought. + */ +static int create_default_group(struct config_group *parent_group, + struct config_group *group) +{ + int ret; + struct qstr name; + struct configfs_dirent *sd; + /* We trust the caller holds a reference to parent */ + struct dentry *child, *parent = parent_group->cg_item.ci_dentry; + + if (!group->cg_item.ci_name) + group->cg_item.ci_name = group->cg_item.ci_namebuf; + name.name = group->cg_item.ci_name; + name.len = strlen(name.name); + name.hash = full_name_hash(name.name, name.len); + + ret = -ENOMEM; + child = d_alloc(parent, &name); + if (child) { + d_add(child, NULL); + + ret = configfs_attach_group(&parent_group->cg_item, + &group->cg_item, child); + if (!ret) { + sd = child->d_fsdata; + sd->s_type |= CONFIGFS_USET_DEFAULT; + } else { + d_delete(child); + dput(child); + } + } + + return ret; +} + +static int populate_groups(struct config_group *group) +{ + struct config_group *new_group; + struct dentry *dentry = group->cg_item.ci_dentry; + int ret = 0; + int i; + + if (group && group->default_groups) { + /* FYI, we're faking mkdir here + * I'm not sure we need this semaphore, as we're called + * from our parent's mkdir. That holds our parent's + * i_sem, so afaik lookup cannot continue through our + * parent to find us, let alone mess with our tree. + * That said, taking our i_sem is closer to mkdir + * emulation, and shouldn't hurt. */ + down(&dentry->d_inode->i_sem); + + for (i = 0; group->default_groups[i]; i++) { + new_group = group->default_groups[i]; + + ret = create_default_group(group, new_group); + if (ret) + break; + } + + up(&dentry->d_inode->i_sem); + } + + if (ret) + detach_groups(group); + + return ret; +} + +/* + * All of link_obj/unlink_obj/link_group/unlink_group require that + * subsys->su_sem is held. + */ + +static void unlink_obj(struct config_item *item) +{ + struct config_group *group; + + group = item->ci_group; + if (group) { + list_del_init(&item->ci_entry); + + item->ci_group = NULL; + item->ci_parent = NULL; + config_item_put(item); + + config_group_put(group); + } +} + +static void link_obj(struct config_item *parent_item, struct config_item *item) +{ + /* Parent seems redundant with group, but it makes certain + * traversals much nicer. */ + item->ci_parent = parent_item; + item->ci_group = config_group_get(to_config_group(parent_item)); + list_add_tail(&item->ci_entry, &item->ci_group->cg_children); + + config_item_get(item); +} + +static void unlink_group(struct config_group *group) +{ + int i; + struct config_group *new_group; + + if (group->default_groups) { + for (i = 0; group->default_groups[i]; i++) { + new_group = group->default_groups[i]; + unlink_group(new_group); + } + } + + group->cg_subsys = NULL; + unlink_obj(&group->cg_item); +} + +static void link_group(struct config_group *parent_group, struct config_group *group) +{ + int i; + struct config_group *new_group; + struct configfs_subsystem *subsys = NULL; /* gcc is a turd */ + + link_obj(&parent_group->cg_item, &group->cg_item); + + if (parent_group->cg_subsys) + subsys = parent_group->cg_subsys; + else if (configfs_is_root(&parent_group->cg_item)) + subsys = to_configfs_subsystem(group); + else + BUG(); + group->cg_subsys = subsys; + + if (group->default_groups) { + for (i = 0; group->default_groups[i]; i++) { + new_group = group->default_groups[i]; + link_group(group, new_group); + } + } +} + +/* + * The goal is that configfs_attach_item() (and + * configfs_attach_group()) can be called from either the VFS or this + * module. That is, they assume that the items have been created, + * the dentry allocated, and the dcache is all ready to go. + * + * If they fail, they must clean up after themselves as if they + * had never been called. The caller (VFS or local function) will + * handle cleaning up the dcache bits. + * + * configfs_detach_group() and configfs_detach_item() behave similarly on + * the way out. They assume that the proper semaphores are held, they + * clean up the configfs items, and they expect their callers will + * handle the dcache bits. + */ +static int configfs_attach_item(struct config_item *parent_item, + struct config_item *item, + struct dentry *dentry) +{ + int ret; + + ret = configfs_create_dir(item, dentry); + if (!ret) { + ret = populate_attrs(item); + if (ret) { + configfs_remove_dir(item); + d_delete(dentry); + } + } + + return ret; +} + +static void configfs_detach_item(struct config_item *item) +{ + detach_attrs(item); + configfs_remove_dir(item); +} + +static int configfs_attach_group(struct config_item *parent_item, + struct config_item *item, + struct dentry *dentry) +{ + int ret; + struct configfs_dirent *sd; + + ret = configfs_attach_item(parent_item, item, dentry); + if (!ret) { + sd = dentry->d_fsdata; + sd->s_type |= CONFIGFS_USET_DIR; + + ret = populate_groups(to_config_group(item)); + if (ret) { + configfs_detach_item(item); + d_delete(dentry); + } + } + + return ret; +} + +static void configfs_detach_group(struct config_item *item) +{ + detach_groups(to_config_group(item)); + configfs_detach_item(item); +} + +/* + * Drop the initial reference from make_item()/make_group() + * This function assumes that reference is held on item + * and that item holds a valid reference to the parent. Also, it + * assumes the caller has validated ci_type. + */ +static void client_drop_item(struct config_item *parent_item, + struct config_item *item) +{ + struct config_item_type *type; + + type = parent_item->ci_type; + BUG_ON(!type); + + if (type->ct_group_ops && type->ct_group_ops->drop_item) + type->ct_group_ops->drop_item(to_config_group(parent_item), + item); + else + config_item_put(item); +} + + +static int configfs_mkdir(struct inode *dir, struct dentry *dentry, int mode) +{ + int ret; + struct config_group *group; + struct config_item *item; + struct config_item *parent_item; + struct configfs_subsystem *subsys; + struct configfs_dirent *sd; + struct config_item_type *type; + struct module *owner; + char *name; + + if (dentry->d_parent == configfs_sb->s_root) + return -EPERM; + + sd = dentry->d_parent->d_fsdata; + if (!(sd->s_type & CONFIGFS_USET_DIR)) + return -EPERM; + + parent_item = configfs_get_config_item(dentry->d_parent); + type = parent_item->ci_type; + subsys = to_config_group(parent_item)->cg_subsys; + BUG_ON(!subsys); + + if (!type || !type->ct_group_ops || + (!type->ct_group_ops->make_group && + !type->ct_group_ops->make_item)) { + config_item_put(parent_item); + return -EPERM; /* What lack-of-mkdir returns */ + } + + name = kmalloc(dentry->d_name.len + 1, GFP_KERNEL); + if (!name) { + config_item_put(parent_item); + return -ENOMEM; + } + snprintf(name, dentry->d_name.len + 1, "%s", dentry->d_name.name); + + down(&subsys->su_sem); + group = NULL; + item = NULL; + if (type->ct_group_ops->make_group) { + group = type->ct_group_ops->make_group(to_config_group(parent_item), name); + if (group) { + link_group(to_config_group(parent_item), group); + item = &group->cg_item; + } + } else { + item = type->ct_group_ops->make_item(to_config_group(parent_item), name); + if (item) + link_obj(parent_item, item); + } + up(&subsys->su_sem); + + kfree(name); + if (!item) { + config_item_put(parent_item); + return -ENOMEM; + } + + ret = -EINVAL; + type = item->ci_type; + if (type) { + owner = type->ct_owner; + if (try_module_get(owner)) { + if (group) { + ret = configfs_attach_group(parent_item, + item, + dentry); + } else { + ret = configfs_attach_item(parent_item, + item, + dentry); + } + + if (ret) { + down(&subsys->su_sem); + if (group) + unlink_group(group); + else + unlink_obj(item); + client_drop_item(parent_item, item); + up(&subsys->su_sem); + + config_item_put(parent_item); + module_put(owner); + } + } + } + + return ret; +} + +static int configfs_rmdir(struct inode *dir, struct dentry *dentry) +{ + struct config_item *parent_item; + struct config_item *item; + struct configfs_subsystem *subsys; + struct configfs_dirent *sd; + struct module *owner = NULL; + int ret; + + if (dentry->d_parent == configfs_sb->s_root) + return -EPERM; + + sd = dentry->d_fsdata; + if (sd->s_type & CONFIGFS_USET_DEFAULT) + return -EPERM; + + parent_item = configfs_get_config_item(dentry->d_parent); + subsys = to_config_group(parent_item)->cg_subsys; + BUG_ON(!subsys); + + if (!parent_item->ci_type) { + config_item_put(parent_item); + return -EINVAL; + } + + ret = configfs_detach_prep(dentry); + if (ret) { + configfs_detach_rollback(dentry); + config_item_put(parent_item); + return ret; + } + + item = configfs_get_config_item(dentry); + + /* Drop reference from above, item already holds one. */ + config_item_put(parent_item); + + if (item->ci_type) + owner = item->ci_type->ct_owner; + + if (sd->s_type & CONFIGFS_USET_DIR) { + configfs_detach_group(item); + + down(&subsys->su_sem); + unlink_group(to_config_group(item)); + } else { + configfs_detach_item(item); + + down(&subsys->su_sem); + unlink_obj(item); + } + + client_drop_item(parent_item, item); + up(&subsys->su_sem); + + /* Drop our reference from above */ + config_item_put(item); + + module_put(owner); + + return 0; +} + +struct inode_operations configfs_dir_inode_operations = { + .mkdir = configfs_mkdir, + .rmdir = configfs_rmdir, + .symlink = configfs_symlink, + .unlink = configfs_unlink, + .lookup = configfs_lookup, +}; + +#if 0 +int configfs_rename_dir(struct config_item * item, const char *new_name) +{ + int error = 0; + struct dentry * new_dentry, * parent; + + if (!strcmp(config_item_name(item), new_name)) + return -EINVAL; + + if (!item->parent) + return -EINVAL; + + down_write(&configfs_rename_sem); + parent = item->parent->dentry; + + down(&parent->d_inode->i_sem); + + new_dentry = lookup_one_len(new_name, parent, strlen(new_name)); + if (!IS_ERR(new_dentry)) { + if (!new_dentry->d_inode) { + error = config_item_set_name(item, "%s", new_name); + if (!error) { + d_add(new_dentry, NULL); + d_move(item->dentry, new_dentry); + } + else + d_delete(new_dentry); + } else + error = -EEXIST; + dput(new_dentry); + } + up(&parent->d_inode->i_sem); + up_write(&configfs_rename_sem); + + return error; +} +#endif + +static int configfs_dir_open(struct inode *inode, struct file *file) +{ + struct dentry * dentry = file->f_dentry; + struct configfs_dirent * parent_sd = dentry->d_fsdata; + + down(&dentry->d_inode->i_sem); + file->private_data = configfs_new_dirent(parent_sd, NULL); + up(&dentry->d_inode->i_sem); + + return file->private_data ? 0 : -ENOMEM; + +} + +static int configfs_dir_close(struct inode *inode, struct file *file) +{ + struct dentry * dentry = file->f_dentry; + struct configfs_dirent * cursor = file->private_data; + + down(&dentry->d_inode->i_sem); + list_del_init(&cursor->s_sibling); + up(&dentry->d_inode->i_sem); + + release_configfs_dirent(cursor); + + return 0; +} + +/* Relationship between s_mode and the DT_xxx types */ +static inline unsigned char dt_type(struct configfs_dirent *sd) +{ + return (sd->s_mode >> 12) & 15; +} + +static int configfs_readdir(struct file * filp, void * dirent, filldir_t filldir) +{ + struct dentry *dentry = filp->f_dentry; + struct configfs_dirent * parent_sd = dentry->d_fsdata; + struct configfs_dirent *cursor = filp->private_data; + struct list_head *p, *q = &cursor->s_sibling; + ino_t ino; + int i = filp->f_pos; + + switch (i) { + case 0: + ino = dentry->d_inode->i_ino; + if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0) + break; + filp->f_pos++; + i++; + /* fallthrough */ + case 1: + ino = parent_ino(dentry); + if (filldir(dirent, "..", 2, i, ino, DT_DIR) < 0) + break; + filp->f_pos++; + i++; + /* fallthrough */ + default: + if (filp->f_pos == 2) { + list_del(q); + list_add(q, &parent_sd->s_children); + } + for (p=q->next; p!= &parent_sd->s_children; p=p->next) { + struct configfs_dirent *next; + const char * name; + int len; + + next = list_entry(p, struct configfs_dirent, + s_sibling); + if (!next->s_element) + continue; + + name = configfs_get_name(next); + len = strlen(name); + if (next->s_dentry) + ino = next->s_dentry->d_inode->i_ino; + else + ino = iunique(configfs_sb, 2); + + if (filldir(dirent, name, len, filp->f_pos, ino, + dt_type(next)) < 0) + return 0; + + list_del(q); + list_add(q, p); + p = q; + filp->f_pos++; + } + } + return 0; +} + +static loff_t configfs_dir_lseek(struct file * file, loff_t offset, int origin) +{ + struct dentry * dentry = file->f_dentry; + + down(&dentry->d_inode->i_sem); + switch (origin) { + case 1: + offset += file->f_pos; + case 0: + if (offset >= 0) + break; + default: + up(&file->f_dentry->d_inode->i_sem); + return -EINVAL; + } + if (offset != file->f_pos) { + file->f_pos = offset; + if (file->f_pos >= 2) { + struct configfs_dirent *sd = dentry->d_fsdata; + struct configfs_dirent *cursor = file->private_data; + struct list_head *p; + loff_t n = file->f_pos - 2; + + list_del(&cursor->s_sibling); + p = sd->s_children.next; + while (n && p != &sd->s_children) { + struct configfs_dirent *next; + next = list_entry(p, struct configfs_dirent, + s_sibling); + if (next->s_element) + n--; + p = p->next; + } + list_add_tail(&cursor->s_sibling, p); + } + } + up(&dentry->d_inode->i_sem); + return offset; +} + +struct file_operations configfs_dir_operations = { + .open = configfs_dir_open, + .release = configfs_dir_close, + .llseek = configfs_dir_lseek, + .read = generic_read_dir, + .readdir = configfs_readdir, +}; + +int configfs_register_subsystem(struct configfs_subsystem *subsys) +{ + int err; + struct config_group *group = &subsys->su_group; + struct qstr name; + struct dentry *dentry; + struct configfs_dirent *sd; + + err = configfs_pin_fs(); + if (err) + return err; + + if (!group->cg_item.ci_name) + group->cg_item.ci_name = group->cg_item.ci_namebuf; + + sd = configfs_sb->s_root->d_fsdata; + link_group(to_config_group(sd->s_element), group); + + down(&configfs_sb->s_root->d_inode->i_sem); + + name.name = group->cg_item.ci_name; + name.len = strlen(name.name); + name.hash = full_name_hash(name.name, name.len); + + err = -ENOMEM; + dentry = d_alloc(configfs_sb->s_root, &name); + if (!dentry) + goto out_release; + + d_add(dentry, NULL); + + err = configfs_attach_group(sd->s_element, &group->cg_item, + dentry); + if (!err) + dentry = NULL; + else + d_delete(dentry); + + up(&configfs_sb->s_root->d_inode->i_sem); + + if (dentry) { + dput(dentry); +out_release: + unlink_group(group); + configfs_release_fs(); + } + + return err; +} + +void configfs_unregister_subsystem(struct configfs_subsystem *subsys) +{ + struct config_group *group = &subsys->su_group; + struct dentry *dentry = group->cg_item.ci_dentry; + + if (dentry->d_parent != configfs_sb->s_root) { + printk(KERN_ERR "configfs: Tried to unregister non-subsystem!\n"); + return; + } + + down(&configfs_sb->s_root->d_inode->i_sem); + down(&dentry->d_inode->i_sem); + if (configfs_detach_prep(dentry)) { + printk(KERN_ERR "configfs: Tried to unregister non-empty subsystem!\n"); + } + configfs_detach_group(&group->cg_item); + dentry->d_inode->i_flags |= S_DEAD; + up(&dentry->d_inode->i_sem); + + d_delete(dentry); + + up(&configfs_sb->s_root->d_inode->i_sem); + + dput(dentry); + + unlink_group(group); + configfs_release_fs(); +} + +EXPORT_SYMBOL(configfs_register_subsystem); +EXPORT_SYMBOL(configfs_unregister_subsystem); diff -ruN linux-2.6.13-rc6.old/fs/configfs/file.c linux-2.6.13-rc6/fs/configfs/file.c --- linux-2.6.13-rc6.old/fs/configfs/file.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/file.c 2005-08-12 18:09:46.786349866 -0700 @@ -0,0 +1,360 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * file.c - operations for regular (text) files. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include +#include +#include +#include +#include +#include + +#include +#include "configfs_internal.h" + + +struct configfs_buffer { + size_t count; + loff_t pos; + char * page; + struct configfs_item_operations * ops; + struct semaphore sem; + int needs_read_fill; +}; + + +/** + * fill_read_buffer - allocate and fill buffer from item. + * @dentry: dentry pointer. + * @buffer: data buffer for file. + * + * Allocate @buffer->page, if it hasn't been already, then call the + * config_item's show() method to fill the buffer with this attribute's + * data. + * This is called only once, on the file's first read. + */ +static int fill_read_buffer(struct dentry * dentry, struct configfs_buffer * buffer) +{ + struct configfs_attribute * attr = to_attr(dentry); + struct config_item * item = to_item(dentry->d_parent); + struct configfs_item_operations * ops = buffer->ops; + int ret = 0; + ssize_t count; + + if (!buffer->page) + buffer->page = (char *) get_zeroed_page(GFP_KERNEL); + if (!buffer->page) + return -ENOMEM; + + count = ops->show_attribute(item,attr,buffer->page); + buffer->needs_read_fill = 0; + BUG_ON(count > (ssize_t)PAGE_SIZE); + if (count >= 0) + buffer->count = count; + else + ret = count; + return ret; +} + + +/** + * flush_read_buffer - push buffer to userspace. + * @buffer: data buffer for file. + * @userbuf: user-passed buffer. + * @count: number of bytes requested. + * @ppos: file position. + * + * Copy the buffer we filled in fill_read_buffer() to userspace. + * This is done at the reader's leisure, copying and advancing + * the amount they specify each time. + * This may be called continuously until the buffer is empty. + */ +static int flush_read_buffer(struct configfs_buffer * buffer, char __user * buf, + size_t count, loff_t * ppos) +{ + int error; + + if (*ppos > buffer->count) + return 0; + + if (count > (buffer->count - *ppos)) + count = buffer->count - *ppos; + + error = copy_to_user(buf,buffer->page + *ppos,count); + if (!error) + *ppos += count; + return error ? -EFAULT : count; +} + +/** + * configfs_read_file - read an attribute. + * @file: file pointer. + * @buf: buffer to fill. + * @count: number of bytes to read. + * @ppos: starting offset in file. + * + * Userspace wants to read an attribute file. The attribute descriptor + * is in the file's ->d_fsdata. The target item is in the directory's + * ->d_fsdata. + * + * We call fill_read_buffer() to allocate and fill the buffer from the + * item's show() method exactly once (if the read is happening from + * the beginning of the file). That should fill the entire buffer with + * all the data the item has to offer for that attribute. + * We then call flush_read_buffer() to copy the buffer to userspace + * in the increments specified. + */ + +static ssize_t +configfs_read_file(struct file *file, char __user *buf, size_t count, loff_t *ppos) +{ + struct configfs_buffer * buffer = file->private_data; + ssize_t retval = 0; + + down(&buffer->sem); + if (buffer->needs_read_fill) { + if ((retval = fill_read_buffer(file->f_dentry,buffer))) + goto out; + } + pr_debug("%s: count = %d, ppos = %lld, buf = %s\n", + __FUNCTION__,count,*ppos,buffer->page); + retval = flush_read_buffer(buffer,buf,count,ppos); +out: + up(&buffer->sem); + return retval; +} + + +/** + * fill_write_buffer - copy buffer from userspace. + * @buffer: data buffer for file. + * @userbuf: data from user. + * @count: number of bytes in @userbuf. + * + * Allocate @buffer->page if it hasn't been already, then + * copy the user-supplied buffer into it. + */ + +static int +fill_write_buffer(struct configfs_buffer * buffer, const char __user * buf, size_t count) +{ + int error; + + if (!buffer->page) + buffer->page = (char *)get_zeroed_page(GFP_KERNEL); + if (!buffer->page) + return -ENOMEM; + + if (count > PAGE_SIZE) + count = PAGE_SIZE; + error = copy_from_user(buffer->page,buf,count); + buffer->needs_read_fill = 1; + return error ? -EFAULT : count; +} + + +/** + * flush_write_buffer - push buffer to config_item. + * @file: file pointer. + * @buffer: data buffer for file. + * + * Get the correct pointers for the config_item and the attribute we're + * dealing with, then call the store() method for the attribute, + * passing the buffer that we acquired in fill_write_buffer(). + */ + +static int +flush_write_buffer(struct dentry * dentry, struct configfs_buffer * buffer, size_t count) +{ + struct configfs_attribute * attr = to_attr(dentry); + struct config_item * item = to_item(dentry->d_parent); + struct configfs_item_operations * ops = buffer->ops; + + return ops->store_attribute(item,attr,buffer->page,count); +} + + +/** + * configfs_write_file - write an attribute. + * @file: file pointer + * @buf: data to write + * @count: number of bytes + * @ppos: starting offset + * + * Similar to configfs_read_file(), though working in the opposite direction. + * We allocate and fill the data from the user in fill_write_buffer(), + * then push it to the config_item in flush_write_buffer(). + * There is no easy way for us to know if userspace is only doing a partial + * write, so we don't support them. We expect the entire buffer to come + * on the first write. + * Hint: if you're writing a value, first read the file, modify only the + * the value you're changing, then write entire buffer back. + */ + +static ssize_t +configfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t *ppos) +{ + struct configfs_buffer * buffer = file->private_data; + + down(&buffer->sem); + count = fill_write_buffer(buffer,buf,count); + if (count > 0) + count = flush_write_buffer(file->f_dentry,buffer,count); + if (count > 0) + *ppos += count; + up(&buffer->sem); + return count; +} + +static int check_perm(struct inode * inode, struct file * file) +{ + struct config_item *item = configfs_get_config_item(file->f_dentry->d_parent); + struct configfs_attribute * attr = to_attr(file->f_dentry); + struct configfs_buffer * buffer; + struct configfs_item_operations * ops = NULL; + int error = 0; + + if (!item || !attr) + goto Einval; + + /* Grab the module reference for this attribute if we have one */ + if (!try_module_get(attr->ca_owner)) { + error = -ENODEV; + goto Done; + } + + if (item->ci_type) + ops = item->ci_type->ct_item_ops; + else + goto Eaccess; + + /* File needs write support. + * The inode's perms must say it's ok, + * and we must have a store method. + */ + if (file->f_mode & FMODE_WRITE) { + + if (!(inode->i_mode & S_IWUGO) || !ops->store_attribute) + goto Eaccess; + + } + + /* File needs read support. + * The inode's perms must say it's ok, and we there + * must be a show method for it. + */ + if (file->f_mode & FMODE_READ) { + if (!(inode->i_mode & S_IRUGO) || !ops->show_attribute) + goto Eaccess; + } + + /* No error? Great, allocate a buffer for the file, and store it + * it in file->private_data for easy access. + */ + buffer = kmalloc(sizeof(struct configfs_buffer),GFP_KERNEL); + if (buffer) { + memset(buffer,0,sizeof(struct configfs_buffer)); + init_MUTEX(&buffer->sem); + buffer->needs_read_fill = 1; + buffer->ops = ops; + file->private_data = buffer; + } else + error = -ENOMEM; + goto Done; + + Einval: + error = -EINVAL; + goto Done; + Eaccess: + error = -EACCES; + module_put(attr->ca_owner); + Done: + if (error && item) + config_item_put(item); + return error; +} + +static int configfs_open_file(struct inode * inode, struct file * filp) +{ + return check_perm(inode,filp); +} + +static int configfs_release(struct inode * inode, struct file * filp) +{ + struct config_item * item = to_item(filp->f_dentry->d_parent); + struct configfs_attribute * attr = to_attr(filp->f_dentry); + struct module * owner = attr->ca_owner; + struct configfs_buffer * buffer = filp->private_data; + + if (item) + config_item_put(item); + /* After this point, attr should not be accessed. */ + module_put(owner); + + if (buffer) { + if (buffer->page) + free_page((unsigned long)buffer->page); + kfree(buffer); + } + return 0; +} + +struct file_operations configfs_file_operations = { + .read = configfs_read_file, + .write = configfs_write_file, + .llseek = generic_file_llseek, + .open = configfs_open_file, + .release = configfs_release, +}; + + +int configfs_add_file(struct dentry * dir, const struct configfs_attribute * attr, int type) +{ + struct configfs_dirent * parent_sd = dir->d_fsdata; + umode_t mode = (attr->ca_mode & S_IALLUGO) | S_IFREG; + int error = 0; + + down(&dir->d_inode->i_sem); + error = configfs_make_dirent(parent_sd, NULL, (void *) attr, mode, type); + up(&dir->d_inode->i_sem); + + return error; +} + + +/** + * configfs_create_file - create an attribute file for an item. + * @item: item we're creating for. + * @attr: atrribute descriptor. + */ + +int configfs_create_file(struct config_item * item, const struct configfs_attribute * attr) +{ + BUG_ON(!item || !item->ci_dentry || !attr); + + return configfs_add_file(item->ci_dentry, attr, + CONFIGFS_ITEM_ATTR); +} + diff -ruN linux-2.6.13-rc6.old/fs/configfs/inode.c linux-2.6.13-rc6/fs/configfs/inode.c --- linux-2.6.13-rc6.old/fs/configfs/inode.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/inode.c 2005-08-12 18:09:46.787349901 -0700 @@ -0,0 +1,162 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * inode.c - basic inode and dentry operations. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + * + * Please see Documentation/filesystems/configfs.txt for more information. + */ + +#undef DEBUG + +#include +#include +#include + +#include +#include "configfs_internal.h" + +extern struct super_block * configfs_sb; + +static struct address_space_operations configfs_aops = { + .readpage = simple_readpage, + .prepare_write = simple_prepare_write, + .commit_write = simple_commit_write +}; + +static struct backing_dev_info configfs_backing_dev_info = { + .ra_pages = 0, /* No readahead */ + .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, +}; + +struct inode * configfs_new_inode(mode_t mode) +{ + struct inode * inode = new_inode(configfs_sb); + if (inode) { + inode->i_mode = mode; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_mapping->a_ops = &configfs_aops; + inode->i_mapping->backing_dev_info = &configfs_backing_dev_info; + } + return inode; +} + +int configfs_create(struct dentry * dentry, int mode, int (*init)(struct inode *)) +{ + int error = 0; + struct inode * inode = NULL; + if (dentry) { + if (!dentry->d_inode) { + if ((inode = configfs_new_inode(mode))) { + if (dentry->d_parent && dentry->d_parent->d_inode) { + struct inode *p_inode = dentry->d_parent->d_inode; + p_inode->i_mtime = p_inode->i_ctime = CURRENT_TIME; + } + goto Proceed; + } + else + error = -ENOMEM; + } else + error = -EEXIST; + } else + error = -ENOENT; + goto Done; + + Proceed: + if (init) + error = init(inode); + if (!error) { + d_instantiate(dentry, inode); + if (S_ISDIR(mode) || S_ISLNK(mode)) + dget(dentry); /* pin link and directory dentries in core */ + } else + iput(inode); + Done: + return error; +} + +/* + * Get the name for corresponding element represented by the given configfs_dirent + */ +const unsigned char * configfs_get_name(struct configfs_dirent *sd) +{ + struct attribute * attr; + + if (!sd || !sd->s_element) + BUG(); + + /* These always have a dentry, so use that */ + if (sd->s_type & (CONFIGFS_DIR | CONFIGFS_ITEM_LINK)) + return sd->s_dentry->d_name.name; + + if (sd->s_type & CONFIGFS_ITEM_ATTR) { + attr = sd->s_element; + return attr->name; + } + return NULL; +} + + +/* + * Unhashes the dentry corresponding to given configfs_dirent + * Called with parent inode's i_sem held. + */ +void configfs_drop_dentry(struct configfs_dirent * sd, struct dentry * parent) +{ + struct dentry * dentry = sd->s_dentry; + + if (dentry) { + spin_lock(&dcache_lock); + if (!(d_unhashed(dentry) && dentry->d_inode)) { + dget_locked(dentry); + __d_drop(dentry); + spin_unlock(&dcache_lock); + simple_unlink(parent->d_inode, dentry); + } else + spin_unlock(&dcache_lock); + } +} + +void configfs_hash_and_remove(struct dentry * dir, const char * name) +{ + struct configfs_dirent * sd; + struct configfs_dirent * parent_sd = dir->d_fsdata; + + down(&dir->d_inode->i_sem); + list_for_each_entry(sd, &parent_sd->s_children, s_sibling) { + if (!sd->s_element) + continue; + if (!strcmp(configfs_get_name(sd), name)) { + list_del_init(&sd->s_sibling); + configfs_drop_dentry(sd, dir); + configfs_put(sd); + break; + } + } + up(&dir->d_inode->i_sem); +} + + diff -ruN linux-2.6.13-rc6.old/fs/configfs/item.c linux-2.6.13-rc6/fs/configfs/item.c --- linux-2.6.13-rc6.old/fs/configfs/item.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/item.c 2005-08-12 18:09:46.787349901 -0700 @@ -0,0 +1,227 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * item.c - library routines for handling generic config items + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on kobject: + * kobject is Copyright (c) 2002-2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + * + * Please see the file Documentation/filesystems/configfs.txt for + * critical information about using the config_item interface. + */ + +#include +#include +#include +#include + +#include + + +static inline struct config_item * to_item(struct list_head * entry) +{ + return container_of(entry,struct config_item,ci_entry); +} + +/* Evil kernel */ +static void config_item_release(struct kref *kref); + +/** + * config_item_init - initialize item. + * @item: item in question. + */ +void config_item_init(struct config_item * item) +{ + kref_init(&item->ci_kref); + INIT_LIST_HEAD(&item->ci_entry); +} + +/** + * config_item_set_name - Set the name of an item + * @item: item. + * @name: name. + * + * If strlen(name) >= CONFIGFS_ITEM_NAME_LEN, then use a + * dynamically allocated string that @item->ci_name points to. + * Otherwise, use the static @item->ci_namebuf array. + */ + +int config_item_set_name(struct config_item * item, const char * fmt, ...) +{ + int error = 0; + int limit = CONFIGFS_ITEM_NAME_LEN; + int need; + va_list args; + char * name; + + /* + * First, try the static array + */ + va_start(args,fmt); + need = vsnprintf(item->ci_namebuf,limit,fmt,args); + va_end(args); + if (need < limit) + name = item->ci_namebuf; + else { + /* + * Need more space? Allocate it and try again + */ + limit = need + 1; + name = kmalloc(limit,GFP_KERNEL); + if (!name) { + error = -ENOMEM; + goto Done; + } + va_start(args,fmt); + need = vsnprintf(name,limit,fmt,args); + va_end(args); + + /* Still? Give up. */ + if (need >= limit) { + kfree(name); + error = -EFAULT; + goto Done; + } + } + + /* Free the old name, if necessary. */ + if (item->ci_name && item->ci_name != item->ci_namebuf) + kfree(item->ci_name); + + /* Now, set the new name */ + item->ci_name = name; + Done: + return error; +} + +EXPORT_SYMBOL(config_item_set_name); + +void config_item_init_type_name(struct config_item *item, + const char *name, + struct config_item_type *type) +{ + config_item_set_name(item, name); + item->ci_type = type; + config_item_init(item); +} +EXPORT_SYMBOL(config_item_init_type_name); + +void config_group_init_type_name(struct config_group *group, const char *name, + struct config_item_type *type) +{ + config_item_set_name(&group->cg_item, name); + group->cg_item.ci_type = type; + config_group_init(group); +} +EXPORT_SYMBOL(config_group_init_type_name); + +struct config_item * config_item_get(struct config_item * item) +{ + if (item) + kref_get(&item->ci_kref); + return item; +} + +/** + * config_item_cleanup - free config_item resources. + * @item: item. + */ + +void config_item_cleanup(struct config_item * item) +{ + struct config_item_type * t = item->ci_type; + struct config_group * s = item->ci_group; + struct config_item * parent = item->ci_parent; + + pr_debug("config_item %s: cleaning up\n",config_item_name(item)); + if (item->ci_name != item->ci_namebuf) + kfree(item->ci_name); + item->ci_name = NULL; + if (t && t->ct_item_ops && t->ct_item_ops->release) + t->ct_item_ops->release(item); + if (s) + config_group_put(s); + if (parent) + config_item_put(parent); +} + +static void config_item_release(struct kref *kref) +{ + config_item_cleanup(container_of(kref, struct config_item, ci_kref)); +} + +/** + * config_item_put - decrement refcount for item. + * @item: item. + * + * Decrement the refcount, and if 0, call config_item_cleanup(). + */ +void config_item_put(struct config_item * item) +{ + if (item) + kref_put(&item->ci_kref, config_item_release); +} + + +/** + * config_group_init - initialize a group for use + * @k: group + */ + +void config_group_init(struct config_group *group) +{ + config_item_init(&group->cg_item); + INIT_LIST_HEAD(&group->cg_children); +} + + +/** + * config_group_find_obj - search for item in group. + * @group: group we're looking in. + * @name: item's name. + * + * Lock group via @group->cg_subsys, and iterate over @group->cg_list, + * looking for a matching config_item. If matching item is found + * take a reference and return the item. + */ + +struct config_item * config_group_find_obj(struct config_group * group, const char * name) +{ + struct list_head * entry; + struct config_item * ret = NULL; + + /* XXX LOCKING! */ + list_for_each(entry,&group->cg_children) { + struct config_item * item = to_item(entry); + if (config_item_name(item) && + !strcmp(config_item_name(item), name)) { + ret = config_item_get(item); + break; + } + } + return ret; +} + + +EXPORT_SYMBOL(config_item_init); +EXPORT_SYMBOL(config_group_init); +EXPORT_SYMBOL(config_item_get); +EXPORT_SYMBOL(config_item_put); + diff -ruN linux-2.6.13-rc6.old/fs/configfs/mount.c linux-2.6.13-rc6/fs/configfs/mount.c --- linux-2.6.13-rc6.old/fs/configfs/mount.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/mount.c 2005-08-12 18:09:46.788349937 -0700 @@ -0,0 +1,149 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * mount.c - operations for initializing and mounting configfs. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include +#include +#include +#include +#include + +#include +#include "configfs_internal.h" + +/* Random magic number */ +#define CONFIGFS_MAGIC 0x62656570 + +struct vfsmount * configfs_mount = NULL; +struct super_block * configfs_sb = NULL; +static int configfs_mnt_count = 0; + +static struct super_operations configfs_ops = { + .statfs = simple_statfs, + .drop_inode = generic_delete_inode, +}; + +static struct config_group configfs_root_group = { + .cg_item = { + .ci_namebuf = "root", + .ci_name = configfs_root_group.cg_item.ci_namebuf, + }, +}; + +int configfs_is_root(struct config_item *item) +{ + return item == &configfs_root_group.cg_item; +} + +static struct configfs_dirent configfs_root = { + .s_sibling = LIST_HEAD_INIT(configfs_root.s_sibling), + .s_children = LIST_HEAD_INIT(configfs_root.s_children), + .s_element = &configfs_root_group.cg_item, + .s_type = CONFIGFS_ROOT, +}; + +static int configfs_fill_super(struct super_block *sb, void *data, int silent) +{ + struct inode *inode; + struct dentry *root; + + sb->s_blocksize = PAGE_CACHE_SIZE; + sb->s_blocksize_bits = PAGE_CACHE_SHIFT; + sb->s_magic = CONFIGFS_MAGIC; + sb->s_op = &configfs_ops; + configfs_sb = sb; + + inode = configfs_new_inode(S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); + if (inode) { + inode->i_op = &configfs_dir_inode_operations; + inode->i_fop = &configfs_dir_operations; + /* directory inodes start off with i_nlink == 2 (for "." entry) */ + inode->i_nlink++; + } else { + pr_debug("configfs: could not get root inode\n"); + return -ENOMEM; + } + + root = d_alloc_root(inode); + if (!root) { + pr_debug("%s: could not get root dentry!\n",__FUNCTION__); + iput(inode); + return -ENOMEM; + } + config_group_init(&configfs_root_group); + configfs_root_group.cg_item.ci_dentry = root; + root->d_fsdata = &configfs_root; + sb->s_root = root; + return 0; +} + +static struct super_block *configfs_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_single(fs_type, flags, data, configfs_fill_super); +} + +static struct file_system_type configfs_fs_type = { + .owner = THIS_MODULE, + .name = "configfs", + .get_sb = configfs_get_sb, + .kill_sb = kill_litter_super, +}; + +int configfs_pin_fs(void) +{ + return simple_pin_fs("configfs", &configfs_mount, + &configfs_mnt_count); +} + +void configfs_release_fs(void) +{ + simple_release_fs(&configfs_mount, &configfs_mnt_count); +} + +static int __init configfs_init(void) +{ + int err; + + err = register_filesystem(&configfs_fs_type); + if (err) { + printk(KERN_ERR "configfs: Unable to register filesystem!\n"); + } + + return err; +} + +static void __exit configfs_exit(void) +{ + unregister_filesystem(&configfs_fs_type); +} + +MODULE_AUTHOR("Oracle"); +MODULE_LICENSE("GPL"); +MODULE_VERSION("0.0.1"); +MODULE_DESCRIPTION("Simple RAM filesystem for user driven kernel subsystem configuration."); + +module_init(configfs_init); +module_exit(configfs_exit); diff -ruN linux-2.6.13-rc6.old/fs/configfs/symlink.c linux-2.6.13-rc6/fs/configfs/symlink.c --- linux-2.6.13-rc6.old/fs/configfs/symlink.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/fs/configfs/symlink.c 2005-08-12 18:09:46.788349937 -0700 @@ -0,0 +1,272 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * symlink.c - operations for configfs symlinks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include +#include +#include + +#include +#include "configfs_internal.h" + +static int item_depth(struct config_item * item) +{ + struct config_item * p = item; + int depth = 0; + do { depth++; } while ((p = p->ci_parent) && !configfs_is_root(p)); + return depth; +} + +static int item_path_length(struct config_item * item) +{ + struct config_item * p = item; + int length = 1; + do { + length += strlen(config_item_name(p)) + 1; + p = p->ci_parent; + } while (p && !configfs_is_root(p)); + return length; +} + +static void fill_item_path(struct config_item * item, char * buffer, int length) +{ + struct config_item * p; + + --length; + for (p = item; p && !configfs_is_root(p); p = p->ci_parent) { + int cur = strlen(config_item_name(p)); + + /* back up enough to print this bus id with '/' */ + length -= cur; + strncpy(buffer + length,config_item_name(p),cur); + *(buffer + --length) = '/'; + } +} + +static int create_link(struct config_item *parent_item, + struct config_item *item, + struct dentry *dentry) +{ + struct configfs_dirent *target_sd = item->ci_dentry->d_fsdata; + struct configfs_symlink *sl; + int ret; + + ret = -ENOMEM; + sl = kmalloc(sizeof(struct configfs_symlink), GFP_KERNEL); + if (sl) { + sl->sl_target = config_item_get(item); + /* FIXME: needs a lock, I'd bet */ + list_add(&sl->sl_list, &target_sd->s_links); + ret = configfs_create_link(sl, parent_item->ci_dentry, + dentry); + if (ret) { + list_del_init(&sl->sl_list); + config_item_put(item); + kfree(sl); + } + } + + return ret; +} + + +static int get_target(const char *symname, struct nameidata *nd, + struct config_item **target) +{ + int ret; + + ret = path_lookup(symname, LOOKUP_FOLLOW|LOOKUP_DIRECTORY, nd); + if (!ret) { + if (nd->dentry->d_sb == configfs_sb) { + *target = configfs_get_config_item(nd->dentry); + if (!*target) { + ret = -ENOENT; + path_release(nd); + } + } else + ret = -EPERM; + } + + return ret; +} + + +int configfs_symlink(struct inode *dir, struct dentry *dentry, const char *symname) +{ + int ret; + struct nameidata nd; + struct config_item *parent_item; + struct config_item *target_item; + struct config_item_type *type; + + ret = -EPERM; /* What lack-of-symlink returns */ + if (dentry->d_parent == configfs_sb->s_root) + goto out; + + parent_item = configfs_get_config_item(dentry->d_parent); + type = parent_item->ci_type; + + if (!type || !type->ct_item_ops || + !type->ct_item_ops->allow_link) + goto out_put; + + ret = get_target(symname, &nd, &target_item); + if (ret) + goto out_put; + + ret = type->ct_item_ops->allow_link(parent_item, target_item); + if (!ret) + ret = create_link(parent_item, target_item, dentry); + + config_item_put(target_item); + path_release(&nd); + +out_put: + config_item_put(parent_item); + +out: + return ret; +} + +int configfs_unlink(struct inode *dir, struct dentry *dentry) +{ + struct configfs_dirent *sd = dentry->d_fsdata; + struct configfs_symlink *sl; + struct config_item *parent_item; + struct config_item_type *type; + int ret; + + ret = -EPERM; /* What lack-of-symlink returns */ + if (!(sd->s_type & CONFIGFS_ITEM_LINK)) + goto out; + + if (dentry->d_parent == configfs_sb->s_root) + BUG(); + + sl = sd->s_element; + + parent_item = configfs_get_config_item(dentry->d_parent); + type = parent_item->ci_type; + + list_del_init(&sd->s_sibling); + configfs_drop_dentry(sd, dentry->d_parent); + dput(dentry); + configfs_put(sd); + + /* + * drop_link() must be called before + * list_del_init(&sl->sl_list), so that the order of + * drop_link(this, target) and drop_item(target) is preserved. + */ + if (type && type->ct_item_ops && + type->ct_item_ops->drop_link) + type->ct_item_ops->drop_link(parent_item, + sl->sl_target); + + /* FIXME: Needs lock */ + list_del_init(&sl->sl_list); + + /* Put reference from create_link() */ + config_item_put(sl->sl_target); + kfree(sl); + + config_item_put(parent_item); + + ret = 0; + +out: + return ret; +} + +static int configfs_get_target_path(struct config_item * item, struct config_item * target, + char *path) +{ + char * s; + int depth, size; + + depth = item_depth(item); + size = item_path_length(target) + depth * 3 - 1; + if (size > PATH_MAX) + return -ENAMETOOLONG; + + pr_debug("%s: depth = %d, size = %d\n", __FUNCTION__, depth, size); + + for (s = path; depth--; s += 3) + strcpy(s,"../"); + + fill_item_path(target, path, size); + pr_debug("%s: path = '%s'\n", __FUNCTION__, path); + + return 0; +} + +static int configfs_getlink(struct dentry *dentry, char * path) +{ + struct config_item *item, *target_item; + int error = 0; + + item = configfs_get_config_item(dentry->d_parent); + if (!item) + return -EINVAL; + + target_item = configfs_get_config_item(dentry); + if (!target_item) { + config_item_put(item); + return -EINVAL; + } + + down_read(&configfs_rename_sem); + error = configfs_get_target_path(item, target_item, path); + up_read(&configfs_rename_sem); + + config_item_put(item); + config_item_put(target_item); + return error; + +} + +static int configfs_follow_link(struct dentry *dentry, struct nameidata *nd) +{ + int error = -ENOMEM; + unsigned long page = get_zeroed_page(GFP_KERNEL); + if (page) + error = configfs_getlink(dentry, (char *) page); + nd_set_link(nd, error ? ERR_PTR(error) : (char *)page); + return 0; +} + +static void configfs_put_link(struct dentry *dentry, struct nameidata *nd) +{ + char *page = nd_get_link(nd); + if (!IS_ERR(page)) + free_page((unsigned long)page); +} + +struct inode_operations configfs_symlink_inode_operations = { + .follow_link = configfs_follow_link, + .readlink = generic_readlink, + .put_link = configfs_put_link, +}; + diff -ruN linux-2.6.13-rc6.old/include/linux/configfs.h linux-2.6.13-rc6/include/linux/configfs.h --- linux-2.6.13-rc6.old/include/linux/configfs.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.13-rc6/include/linux/configfs.h 2005-08-12 18:09:48.026393458 -0700 @@ -0,0 +1,205 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * configfs.h - definitions for the device driver filesystem + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * Based on kobject.h: + * Copyright (c) 2002-2003 Patrick Mochel + * Copyright (c) 2002-2003 Open Source Development Labs + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + * + * Please read Documentation/filesystems/configfs.txt before using the + * configfs interface, ESPECIALLY the parts about reference counts and + * item destructors. + */ + +#ifndef _CONFIGFS_H_ +#define _CONFIGFS_H_ + +#ifdef __KERNEL__ + +#include +#include +#include + +#include +#include + +#define CONFIGFS_ITEM_NAME_LEN 20 + +struct module; + +struct configfs_item_operations; +struct configfs_group_operations; +struct configfs_attribute; +struct configfs_subsystem; + +struct config_item { + char *ci_name; + char ci_namebuf[CONFIGFS_ITEM_NAME_LEN]; + struct kref ci_kref; + struct list_head ci_entry; + struct config_item *ci_parent; + struct config_group *ci_group; + struct config_item_type *ci_type; + struct dentry *ci_dentry; +}; + +extern int config_item_set_name(struct config_item *, const char *, ...); + +static inline char *config_item_name(struct config_item * item) +{ + return item->ci_name; +} + +extern void config_item_init(struct config_item *); +extern void config_item_init_type_name(struct config_item *item, + const char *name, + struct config_item_type *type); +extern void config_item_cleanup(struct config_item *); + +extern struct config_item * config_item_get(struct config_item *); +extern void config_item_put(struct config_item *); + +struct config_item_type { + struct module *ct_owner; + struct configfs_item_operations *ct_item_ops; + struct configfs_group_operations *ct_group_ops; + struct configfs_attribute **ct_attrs; +}; + + +/** + * group - a group of config_items of a specific type, belonging + * to a specific subsystem. + */ + +struct config_group { + struct config_item cg_item; + struct list_head cg_children; + struct configfs_subsystem *cg_subsys; + struct config_group **default_groups; +}; + + +extern void config_group_init(struct config_group *group); +extern void config_group_init_type_name(struct config_group *group, + const char *name, + struct config_item_type *type); + + +static inline struct config_group *to_config_group(struct config_item *item) +{ + return item ? container_of(item,struct config_group,cg_item) : NULL; +} + +static inline struct config_group *config_group_get(struct config_group *group) +{ + return group ? to_config_group(config_item_get(&group->cg_item)) : NULL; +} + +static inline void config_group_put(struct config_group *group) +{ + config_item_put(&group->cg_item); +} + +extern struct config_item *config_group_find_obj(struct config_group *, const char *); + + +struct configfs_attribute { + char *ca_name; + struct module *ca_owner; + mode_t ca_mode; +}; + + +/* + * If allow_link() exists, the item can symlink(2) out to other + * items. If the item is a group, it may support mkdir(2). + * Groups supply one of make_group() and make_item(). If the + * group supports make_group(), one can create group children. If it + * supports make_item(), one can create config_item children. If it has + * default_groups on group->default_groups, it has automatically created + * group children. default_groups may coexist alongsize make_group() or + * make_item(), but if the group wishes to have only default_groups + * children (disallowing mkdir(2)), it need not provide either function. + * If the group has commit(), it supports pending and commited (active) + * items. + */ +struct configfs_item_operations { + void (*release)(struct config_item *); + ssize_t (*show_attribute)(struct config_item *, struct configfs_attribute *,char *); + ssize_t (*store_attribute)(struct config_item *,struct configfs_attribute *,const char *, size_t); + int (*allow_link)(struct config_item *src, struct config_item *target); + int (*drop_link)(struct config_item *src, struct config_item *target); +}; + +struct configfs_group_operations { + struct config_item *(*make_item)(struct config_group *group, const char *name); + struct config_group *(*make_group)(struct config_group *group, const char *name); + int (*commit_item)(struct config_item *item); + void (*drop_item)(struct config_group *group, struct config_item *item); +}; + + + +/** + * Use these macros to make defining attributes easier. See include/linux/device.h + * for examples.. + */ + +#if 0 +#define __ATTR(_name,_mode,_show,_store) { \ + .attr = {.ca_name = __stringify(_name), .ca_mode = _mode, .ca_owner = THIS_MODULE }, \ + .show = _show, \ + .store = _store, \ +} + +#define __ATTR_RO(_name) { \ + .attr = { .ca_name = __stringify(_name), .ca_mode = 0444, .ca_owner = THIS_MODULE }, \ + .show = _name##_show, \ +} + +#define __ATTR_NULL { .attr = { .name = NULL } } + +#define attr_name(_attr) (_attr).attr.name +#endif + + +struct configfs_subsystem { + struct config_group su_group; + struct semaphore su_sem; +}; + +static inline struct configfs_subsystem *to_configfs_subsystem(struct config_group *group) +{ + return group ? + container_of(group, struct configfs_subsystem, su_group) : + NULL; +} + +int configfs_register_subsystem(struct configfs_subsystem *subsys); +void configfs_unregister_subsystem(struct configfs_subsystem *subsys); + +#endif /* __KERNEL__ */ + +#endif /* _CONFIGFS_H_ */ -- "What does it say about a society's priorities when the time you spend in meetings on Monday is greater than the total number of hours you spent sleeping over the weekend?" - Nat Friedman http://www.jlbec.org/ jlbec at evilplan.org From linux-cluster at redhat.com Sun Aug 21 17:11:07 2005 From: linux-cluster at redhat.com (Cluster2005) Date: Sun, 21 Aug 2005 13:11:07 -0400 Subject: [Linux-cluster] IEEE Cluster2005 Registration Online Message-ID: <20050821171103.D99744097E@villi.rcsnetworks.com> see http://cluster2005.org From jan at bruvoll.com Mon Aug 22 19:19:52 2005 From: jan at bruvoll.com (Jan Bruvoll) Date: Mon, 22 Aug 2005 21:19:52 +0200 Subject: [Linux-cluster] Fencing woes Message-ID: <430A2558.5060404@bruvoll.com> Dear list, I am having problems with a node where I can't get it to rejoin the fence domain. It has been rebooted before, and it has so far automatically joined the fence domain so that that it could pick up the rest of the depending services, but not this time. I upgraded the kernel and cluster/GFS suite (this is a Gentoo system) to gentoo-sources-2.6.12-r9 and cluster software v1.00.00. I guess the biggest problem is that I don't know what to actually do to unfence the node that has been shut out. Since I have set the cluster up to use manual fencing, I suppose the un-fence command to use is fence_ack_manual, however using that only produces a warning about a missing /tmp/fence_manual.fifo. Manually creating this fifo before running the command only removes the fifo -and- produces the warning. This is what a cman_tool services emits: Service Name GID LID State Code Fence Domain: "default" 0 2 join S-2,2,1 [] I don't seem to be able to find any information anywhere on the "Codes" - any pointers there? The cluster has 6 members: one "file server" and five "clients". Excerpt from cluster.conf follows: [...] I also found this from dmesg - is this important?: SM: process_reply invalid id=0 nodeid=4 SM: process_reply invalid id=0 nodeid=1 SM: process_reply invalid id=0 nodeid=2 SM: process_reply invalid id=0 nodeid=6 SM: process_reply invalid id=0 nodeid=5 Any help or pointers to more information would be most appreciated. I have read through everything I could find on the i'net without becoming much wiser, and the status today is that I can't upgrade single servers in my cluster without taking down the whole group - which is hardly useful. Thanks in advance for any assistance! Best regards Jan Bruvoll From Axel.Thimm at ATrpms.net Mon Aug 22 22:33:41 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Tue, 23 Aug 2005 00:33:41 +0200 Subject: [Linux-cluster] rgmanager/src/resources/samba.sh Message-ID: <20050822223341.GI24127@neu.nirvana> is missing :) services.sh references "samba" as a further resource agent. Is this an old remnant still to be removed, or will there be a samba.sh? In contrast to nfs samba has too many options to map them adequately to parameters of a resource agent, so perhaps the idea is to have users copy /etc/init.d/smb and modify accordingly. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From Axel.Thimm at ATrpms.net Mon Aug 22 22:52:27 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Tue, 23 Aug 2005 00:52:27 +0200 Subject: [Linux-cluster] iptables protection wrapper; nfsexport.sh vs ip.sh racing Message-ID: <20050822225227.GJ24127@neu.nirvana> The typical NFS cluster setups seem to fail for Gigabit NFS/tcp. Some clients that are busy during the relocation of services either bail out with RPC garbage, or set the filesytem to EACCES, or timeout for 17 min. This has to do with some racing/timing in the NFS vs ip setup/teardown procedure. Protecting the service startup/shutdown with an iptables rule is a good workaround to fix this. But what is the proper way to integrate this workaround? I could setup new resource agents, one with start=1 and another with start=6 to start/stop dropping packages. Or I could modify the current resource agents to allow for child entities and wrap one script around the service and one in the inner element. I could probably also hack ip.sh to introduce some delay, to make sure the NFS services are really up/down before proceeding. Or maybe fix the true evil by making nfsexport.sh wait for NFS startup/stop completion (how?)? What's the best way? -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mark.fasheh at oracle.com Tue Aug 23 02:41:16 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Mon, 22 Aug 2005 19:41:16 -0700 Subject: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs In-Reply-To: <20050819071344.GB10864@redhat.com> References: <20050818060750.GA10133@redhat.com> <20050818212348.GW21228@ca-server1.us.oracle.com> <20050819071344.GB10864@redhat.com> Message-ID: <20050823024116.GY21228@ca-server1.us.oracle.com> On Fri, Aug 19, 2005 at 03:13:44PM +0800, David Teigland wrote: > The nodemanager RFC I sent a month ago > http://marc.theaimsgroup.com/?l=linux-kernel&m=112166723919347&w=2 > > amounts to half of dlm/config.c (everything under comms/ above) moved into > a separate kernel module. That would be trivial to do, and is still an > option to bat around. Yeah ok, so the address/id/local part is still there. As is much of the API to query those attributes. > I question whether factoring such a small chunk into a separate module is > really worth it, though? IMHO, yes. Mostly because we both have very similar basic requirements there and it seems a waste to have duplicated code (even if it's not a huge amount). Future projects wanting to query basic node information from the kernel could have simply used that API without having to further duplicate code too. That said, I'm not sure it has to be done *now* Was there anything in my comments which made going forward with that approach difficult for dlm? > Making all of config.c (all of /config/dlm/ above) into a separate module > wouldn't seem quite so strange. It would require just a few lines of code > to turn it into a stand alone module. Without the dlm specifics, right? It's perfectly fine with me if dlm has a couple more attributes that it wants on a node object - OCFS2 simply won't query them. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From teigland at redhat.com Tue Aug 23 03:46:09 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 23 Aug 2005 11:46:09 +0800 Subject: [Linux-cluster] Fencing woes In-Reply-To: <430A2558.5060404@bruvoll.com> References: <430A2558.5060404@bruvoll.com> Message-ID: <20050823034609.GA13360@redhat.com> On Mon, Aug 22, 2005 at 09:19:52PM +0200, Jan Bruvoll wrote: > Dear list, > > I am having problems with a node where I can't get it to rejoin the > fence domain. It has been rebooted before, and it has so far > automatically joined the fence domain so that that it could pick up the > rest of the depending services, but not this time. I upgraded the kernel > and cluster/GFS suite (this is a Gentoo system) to > gentoo-sources-2.6.12-r9 and cluster software v1.00.00. Are the nodes running slightly different versions of the cluster software? They must all be running the same version -- there was a change to the cman message formats shortly before 1.00.00 was released. > I guess the biggest problem is that I don't know what to actually do to > unfence the node that has been shut out. Since I have set the cluster up > to use manual fencing, I suppose the un-fence command to use is > fence_ack_manual, however using that only produces a warning about a > missing /tmp/fence_manual.fifo. Manually creating this fifo before > running the command only removes the fifo -and- produces the warning. > > This is what a cman_tool services emits: > > Service Name GID LID State Code > Fence Domain: "default" 0 2 join S-2,2,1 > [] Manual fencing is hard to use and get right, first recommendation is to not use it. You only need to run fence_ack_manual when instructed to do so by a message in /var/log/messages on some node. Dave From Hansjoerg.Maurer at dlr.de Tue Aug 23 05:30:55 2005 From: Hansjoerg.Maurer at dlr.de (=?ISO-8859-15?Q?Hansj=F6rg_Maurer?=) Date: Tue, 23 Aug 2005 07:30:55 +0200 Subject: [Linux-cluster] iptables protection wrapper; nfsexport.sh vs ip.sh racing In-Reply-To: <20050822225227.GJ24127@neu.nirvana> References: <20050822225227.GJ24127@neu.nirvana> Message-ID: <430AB48F.9000005@dlr.de> Hi Axel Thimm schrieb: >The typical NFS cluster setups seem to fail for Gigabit NFS/tcp. Some >clients that are busy during the relocation of services either bail >out with RPC garbage, or set the filesytem to EACCES, or timeout for >17 min. > > we observe this problem to, using NFS over TCP. Mounting the filesystem with -o tcp,timeo=600,retrans=1 reduces the timeout for about one minute on Linux and Solaris 10. Greetings Hansj?rg >This has to do with some racing/timing in the NFS vs ip setup/teardown >procedure. Protecting the service startup/shutdown with an iptables >rule is a good workaround to fix this. > >But what is the proper way to integrate this workaround? I could setup >new resource agents, one with start=1 and another with start=6 to >start/stop dropping packages. Or I could modify the current resource >agents to allow for child entities and wrap one script around the >service and one in the inner element. > >I could probably also hack ip.sh to introduce some delay, to make sure >the NFS services are really up/down before proceeding. Or maybe fix >the true evil by making nfsexport.sh wait for NFS startup/stop >completion (how?)? > >What's the best way? > > >------------------------------------------------------------------------ > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster > -- _________________________________________________________________ Dr. Hansjoerg Maurer | LAN- & System-Manager | Deutsches Zentrum | DLR Oberpfaffenhofen f. Luft- und Raumfahrt e.V. | Institut f. Robotik | Postfach 1116 | Muenchner Strasse 20 82230 Wessling | 82234 Wessling Germany | | Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ __________________________________________________________________ There are 10 types of people in this world, those who understand binary and those who don't. From Rob.TerVeer at getronics.com Tue Aug 23 10:21:53 2005 From: Rob.TerVeer at getronics.com (Veer, Rob ter) Date: Tue, 23 Aug 2005 12:21:53 +0200 Subject: [Linux-cluster] Workings of Tiebreaker IP (RHCS) Message-ID: <09B5576A48518947960A89175FB6C23DE38604@excbebr204.europe.unity> Hello, To completely understand what the role of a tiebreaker IP within a two or four node RHCS cluster is, I've searched redhat and Google. I can't however find anything describing the precise workings of the tiebreaker-IP. I would really like to know what happens excactly when the tiebreaker is used an how (maybe even somekind of flow diagram). Can anyone here maybe explain that to me, or point me in the direction of more specific information regarding tiebreaker? Regards, Rob. From sco at adviseo.fr Tue Aug 23 13:21:46 2005 From: sco at adviseo.fr (Sylvain COUTANT) Date: Tue, 23 Aug 2005 15:21:46 +0200 Subject: [Linux-cluster] gnbd/clvm and device mapper : 256 devices limitation ? Message-ID: <20050823132153.CCC383181C2@smtp.cegetel.net> Hello the list, I was wondering if it were at all possible to have more than 256 block devices shared in a cluster. I'd like to export gnbd devices (10-15) with volume groups on top. There would be many lvs (up to 256) in each vg. Question is : how the device mapper will handle this on each cluster member ? I ran a basic test by creating more than 256 lvs in a single vg and device mapper did create devices twice with the same major/minor (wrapping after minor 255). Basically, that would mean I won't be able to share more than 256 lvs amongst the entire architecture. This limitation is far too low for me. I'd prefer hear about 10000+ ;-) I know this question is not directly related to the cluster project (except the clvm part), but since I have chances to find here some people with knowledge about large architectures, I try it anyway ... Thanks in advance for any tip. -- Sylvain COUTANT ADVISEO http://www.adviseo.fr/ http://www.open-sp.fr/ Tel: +33 (0)1 30 42 72 95 Fax: +33 (0)1 30 42 72 95 Gsm: +33 (0)6 30 79 26 33 sco at adviseo.fr From djani22 at dynamicweb.hu Tue Aug 23 21:33:28 2005 From: djani22 at dynamicweb.hu (djani22 at dynamicweb.hu) Date: Tue, 23 Aug 2005 23:33:28 +0200 Subject: [Linux-cluster] GNBD 1.0.0-1 bug References: <20050818060750.GA10133@redhat.com><20050818212348.GW21228@ca-server1.us.oracle.com><20050819071344.GB10864@redhat.com> <20050823024116.GY21228@ca-server1.us.oracle.com> Message-ID: <028b01c5a82a$52732da0$0400a8c0@LocalHost> Hello list, I am not sure this is the right place, but I can't found better... I have found one bug in gnbd 1.0.0-1! There are the messages: #1 Unable to handle kernel paging request at virtual address a014d7a5 printing eip: c0118cee *pde = f7bedd02 Oops: 0000 [#1] SMP Modules linked in: netconsole gnbd CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010296 (2.6.13-rc6) EIP is at kmap+0x1e/0x54 eax: 00000246 ebx: a014d7a5 ecx: c11ef260 edx: cabbc400 esi: 00008000 edi: 00000001 ebp: f6c7fe00 esp: f6c7fdf4 ds: 007b es: 007b ss: 0068 Process md3_raid1 (pid: 2769, threadinfo=f6c7e000 task=f7eef020) Stack: c0577800 00000006 f5f93cfc f6c7fe54 f895a9cc a014d7a5 00000001 cf793000 00001000 00004000 d3fc3180 f73e9bf0 f895e718 cabbc400 007ea037 01000000 d4175a4c f895e6f0 65000000 00f03d8d 00100000 d4175a4c f895e6f0 f895e700 Call Trace: [] show_stack+0x9a/0xd0 [] show_registers+0x175/0x209 [] die+0xfa/0x17c [] do_page_fault+0x269/0x7bd [] error_code+0x4f/0x54 [] __gnbd_send_req+0x196/0x28d [gnbd] [] do_gnbd_request+0xe5/0x198 [gnbd] [] __generic_unplug_device+0x28/0x2e [] __elv_add_request+0xaa/0xac [] __make_request+0x20d/0x512 [] generic_make_request+0xb2/0x27a [] raid1d+0xbf/0x2cb [] md_thread+0x134/0x16f [] kernel_thread_helper+0x5/0xb Code: 89 c1 81 e1 ff ff 0f 00 eb b0 90 90 90 55 89 e5 53 83 ec 08 8b 5d 08 c7 44 24 04 06 00 00 00 c7 04 24 00 78 57 c0 e8 72 47 00 00 <8b> 03 c1 e8 1e 8b 14 85 14 db 73 c0 8b 82 0c 04 00 00 05 00 09 <0>Fatal exception: panic in 5 seconds #2 ------------[ cut here ]------------ kernel BUG at mm/highmem.c:183! invalid operand: 0000 [#1] SMP Modules linked in: netconsole gnbd CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 (2.6.13-rc6) EIP is at kunmap_high+0x1f/0x93 eax: 00000000 ebx: c1c1a3a0 ecx: c0740e88 edx: 00000292 esi: 00001000 edi: 00000000 ebp: f6e81df8 esp: f6e81df0 ds: 007b es: 007b ss: 0068 Process md2_raid1 (pid: 2813, threadinfo=f6e80000 task=f7eba020) Stack: c1c1a3a0 d3708c10 f6e81e00 c0118d66 f6e81e54 f895a9fd c1c1a3a0 00000001 d0990000 00001000 00004000 ebddff84 f6cb5d8c f895e620 d452f900 007ea037 01000000 f1c1d9ac f895e5f8 b9000000 00106937 00100000 f1c1d9ac f895e5f8 Call Trace: [] show_stack+0x9a/0xd0 [] show_registers+0x175/0x209 [] die+0xfa/0x17c [] do_trap+0x7e/0xb2 [] do_invalid_op+0xa9/0xb3 [] error_code+0x4f/0x54 [] kunmap+0x42/0x44 [] __gnbd_send_req+0x1c7/0x28d [gnbd] [] do_gnbd_request+0xe5/0x198 [gnbd] [] __generic_unplug_device+0x28/0x2e [] __elv_add_request+0xaa/0xac [] __make_request+0x20d/0x512 [] generic_make_request+0xb2/0x27a [] raid1d+0xbf/0x2cb [] md_thread+0x134/0x16f [] kernel_thread_helper+0x5/0xb Code: e8 08 06 00 00 89 c7 e9 38 ff ff ff 55 89 e5 53 83 ec 04 89 c3 b8 80 6c 6c c0 e8 59 c3 40 00 89 1c 24 e8 e6 05 00 00 85 c0 75 08 <0f> 0b b7 00 0a 78 57 c0 05 00 00 40 00 c1 e8 0c 8b 14 85 20 dc #3 ------------[ cut here ]------------ kernel BUG at mm/highmem.c:183! invalid operand: 0000 [#1] SMP Modules linked in: netconsole gnbd CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 (2.6.13-rc6) EIP is at kunmap_high+0x1f/0x93 eax: 00000000 ebx: c25c99a0 ecx: c0740508 edx: 00000292 esi: 00001000 edi: 00000000 ebp: f6c3dd34 esp: f6c3dd2c ds: 007b es: 007b ss: 0068 Process md10_raid1 (pid: 2865, threadinfo=f6c3c000 task=f7be4020) Stack: c25c99a0 e287c8c0 f6c3dd3c c0118d66 f6c3dd90 f895a9fd c25c99a0 00000001 f6d59000 00001000 00004000 f6c3ddb4 c010377e f895e810 dd1c4200 007ea037 01000000 f10392cc f6c3ddb4 20000000 00004899 00100000 f10392cc f895e7e8 Call Trace: [] show_stack+0x9a/0xd0 [] show_registers+0x175/0x209 [] die+0xfa/0x17c [] do_trap+0x7e/0xb2 [] do_invalid_op+0xa9/0xb3 [] error_code+0x4f/0x54 [] kunmap+0x42/0x44 [] __gnbd_send_req+0x1c7/0x28d [gnbd] [] do_gnbd_request+0xe5/0x198 [gnbd] [] __generic_unplug_device+0x28/0x2e [] __make_request+0x23f/0x512 [] generic_make_request+0xb2/0x27a [] submit_bio+0x51/0xe7 [] md_super_write+0x75/0x7d [] md_update_sb+0xd1/0x1e2 [] md_check_recovery+0x197/0x3e9 [] raid1d+0x22/0x2cb [] md_thread+0x134/0x16f [] kernel_thread_helper+0x5/0xb [43139186.670000] Code: e8 08 06 00 00 89 c7 e9 38 ff ff ff 55 89 e5 53 83 ec 04 89 c3 b8 80 6c 6c c0 e8 59 c3 40 00 89 1c 2 4 e8 e6 05 00 00 85 c0 75 08 <0f> 0b b7 00 0a 78 57 c0 05 00 00 40 00 c1 e8 0c 8b 14 85 20 dc <0>Fatal exception: panic in 5 seconds And additionally I have 2 deadlock-messages, if it helps for something... I hope I can help some.... Thanks Janos From brianu at silvercash.com Wed Aug 24 17:24:43 2005 From: brianu at silvercash.com (brianu) Date: Wed, 24 Aug 2005 10:24:43 -0700 Subject: [Linux-cluster] gnbd/clvm and device mapper : 256 devices limitation ? Message-ID: <20050824172215.332115A867C@mail.silvercash.com> Hello, I have a setup that is similar (except not on that scale), but I'm missing one piece, I keep getting duplicate PV's when I run vgcreate -aly & have not had any luck setting up multipath in LVM, although I know this is not the response you wanted, could you share how you got device-mapper to create devices twice with the same major/minor numbers? Regards, Brian Urrutia Price Communications Inc. On Tue, 2005-08-23 at 12:01 -0400, linux-cluster-request at redhat.com wrote: > > Date: Tue, 23 Aug 2005 15:21:46 +0200 > From: "Sylvain COUTANT" A. > Subject: [Linux-cluster] gnbd/clvm and device mapper : 256 devices B. > limitation ? > To: > Message-ID: <20050823132153.CCC383181C2 at smtp.cegetel.net> > Content-Type: text/plain; charset="utf-8" > > Hello the list, > > I was wondering if it were at all possible to have more than 256 block > devices shared in a cluster. > > I'd like to export gnbd devices (10-15) with volume groups on top. > There would be many lvs (up to 256) in each vg. > > Question is : how the device mapper will handle this on each cluster > member ? > > I ran a basic test by creating more than 256 lvs in a single vg and > device mapper did create devices twice with the same major/minor > (wrapping after minor 255). > > Basically, that would mean I won't be able to share more than 256 lvs > amongst the entire architecture. This limitation is far too low for > me. I'd prefer hear about 10000+ ;-) > > I know this question is not directly related to the cluster project > (except the clvm part), but since I have chances to find here some > people with knowledge about large architectures, I try it anyway ... > > > Thanks in advance for any tip. From sco at adviseo.fr Wed Aug 24 17:29:45 2005 From: sco at adviseo.fr (Sylvain COUTANT) Date: Wed, 24 Aug 2005 19:29:45 +0200 Subject: [Linux-cluster] gnbd/clvm and device mapper : 256 deviceslimitation ? In-Reply-To: <20050824172215.332115A867C@mail.silvercash.com> Message-ID: <20050824172951.5A77D318514@smtp.cegetel.net> > could you share how you got > device-mapper to create devices twice with the same major/minor numbers? Quite easy : for a in `seq 1 300`; do lvcreate -L 32M -n test$a myvg; done This was able to create around 260 (don't remember the exact number) lvs before failing. The 20 last ones before failing (between 240 and 260) were created with duplicate major/minor (debian kernel 2.6.11.12). Regards, -- Sylvain COUTANT ADVISEO http://www.adviseo.fr/ http://www.open-sp.fr/ From fabiomirmar at gmail.com Wed Aug 24 18:56:25 2005 From: fabiomirmar at gmail.com (=?ISO-8859-1?Q?F=E1bio_Augusto?=) Date: Wed, 24 Aug 2005 15:56:25 -0300 Subject: [Linux-cluster] Problems while installing the Heartbeat 2.0.0 on a RH AS 3.0 Update 5 Message-ID: <180003a6050824115668c4641f@mail.gmail.com> Good Afternoon, I'm trying to install the Heartbeat 2.0.0 package to create a High Availability solution. I have two IBM XSeries x455 using Red Hat Advanced Server 3 Update 5 here. I've downloaded the following files from the linux-ha.org official download home page: - heartbeat-2.0.0-1.x86_64.rpm - heartbeat-pils-2.0.0-1.x86_64.rpm - heartbeat-stonith-2.0.0-1.x86_64.rpm While trying to install them using rpm -ivh *, I receive the following errors (dependencies): [root at svdb2-1 RPMS]# rpm -ivh * > /root/dependencies-heartbeat.log error: Failed dependencies: libc.so.6()(64bit) is needed by heartbeat-2.0.0-1 libc.so.6(GLIBC_2.2.5)(64bit) is needed by heartbeat-2.0.0-1 libc.so.6(GLIBC_2.3)(64bit) is needed by heartbeat-2.0.0-1 libm.so.6()(64bit) is needed by heartbeat-2.0.0-1 libnet.so.0()(64bit) is needed by heartbeat-2.0.0-1 libpthread.so.0(GLIBC_2.2.5)(64bit) is needed by heartbeat-2.0.0-1 libc.so.6()(64bit) is needed by heartbeat-pils-2.0.0-1 libc.so.6(GLIBC_2.2.5)(64bit) is needed by heartbeat-pils-2.0.0-1 libcrypto.so.0.9.7()(64bit) is needed by heartbeat-stonith-2.0.0-1 libc.so.6()(64bit) is needed by heartbeat-stonith-2.0.0-1 libc.so.6(GLIBC_2.2.5)(64bit) is needed by heartbeat-stonith-2.0.0-1 libc.so.6(GLIBC_2.3)(64bit) is needed by heartbeat-stonith-2.0.0-1 I checked using a #rpm -qa | grep -i glibc and I have the following files already installed: glibc-kernheaders-2.4-8.34.1 glibc-2.3.2-95.33 glibc-headers-2.3.2-95.33 glibc-common-2.3.2-95.33 glibc-2.3.2-95.33 glibc-profile-2.3.2-95.33 glibc-devel-2.3.2-95.33 glibc-utils-2.3.2-95.33 I checked which of the files above are missing using the following command: [root at svdb2-1 logs-glibc-fabio]# rpm -q --provides glibc | grep -i libc.so.6 libc.so.6 libc.so.6(GCC_3.0) libc.so.6(GLIBC_2.0) libc.so.6(GLIBC_2.1) libc.so.6(GLIBC_2.1.1) libc.so.6(GLIBC_2.1.2) libc.so.6(GLIBC_2.1.3) libc.so.6(GLIBC_2.2) libc.so.6(GLIBC_2.2.1) libc.so.6(GLIBC_2.2.2) libc.so.6(GLIBC_2.2.3) libc.so.6(GLIBC_2.2.4) libc.so.6(GLIBC_2.2.6) libc.so.6(GLIBC_2.3) libc.so.6(GLIBC_2.3.2) libc.so.6(GLIBC_2.3.3) libc.so.6.1()(64bit) libc.so.6.1(GLIBC_2.2)(64bit) libc.so.6.1(GLIBC_2.2.1)(64bit) libc.so.6.1(GLIBC_2.2.2)(64bit) libc.so.6.1(GLIBC_2.2.3)(64bit) libc.so.6.1(GLIBC_2.2.4)(64bit) libc.so.6.1(GLIBC_2.2.6)(64bit) libc.so.6.1(GLIBC_2.3)(64bit) libc.so.6.1(GLIBC_2.3.2)(64bit) libc.so.6.1(GLIBC_2.3.3)(64bit) Do someone has ever encountered that problem? Which package can I install to solve all the dependencies? Thanks a Lot!! -- F?bio Augusto Miranda Martins E-mail: fabiomirmar at gmail.com From jason_wilk at stircrazy.net Wed Aug 24 19:05:02 2005 From: jason_wilk at stircrazy.net (Jason Wilkinson) Date: Wed, 24 Aug 2005 14:05:02 -0500 Subject: [Linux-cluster] Quorum without fibre channel or shared scsi Message-ID: Hi all, I'm rather new to clustering...so I'd appreciate any help you may be able to give me. I'm trying to set up a test cluster in advance of purchasing our SAN. Currently I have several old laptops set up. What I'm trying to do is to find a way to set up the quorum partition without the more expensive infrastructure. Is it possible to do it using DRBD or some piece of software? Do any of you know where I could find a really good howto? Any help would be appreciated. Thanks in advance, Jason From eric at bootseg.com Wed Aug 24 19:17:21 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 24 Aug 2005 15:17:21 -0400 Subject: [Linux-cluster] Problems while installing the Heartbeat 2.0.0 on a RH AS 3.0 Update 5 In-Reply-To: <180003a6050824115668c4641f@mail.gmail.com> References: <180003a6050824115668c4641f@mail.gmail.com> Message-ID: <1124911042.4244.7.camel@auh5-0479.corp.jabil.org> On Wed, 2005-08-24 at 15:56 -0300, F?bio Augusto wrote: > Good Afternoon, > > I'm trying to install the Heartbeat 2.0.0 package to create a High > Availability solution. > > I have two IBM XSeries x455 using Red Hat Advanced Server 3 Update 5 here. > > I've downloaded the following files from the linux-ha.org official > download home page: > - heartbeat-2.0.0-1.x86_64.rpm > - heartbeat-pils-2.0.0-1.x86_64.rpm > - heartbeat-stonith-2.0.0-1.x86_64.rpm You most likely want the i586 packages, instead of the x86_64 versions you downloaded, since you don't seem to have the x86_64 glibc installed. Also, for help on linux-ha you might want to try the linux-ha mailing lists (http://linux-ha.org/ContactUs) Thanks, Eric From hansjoerg.maurer at dlr.de Wed Aug 24 19:57:31 2005 From: hansjoerg.maurer at dlr.de (=?ISO-8859-1?Q?Hansj=F6rg_Maurer?=) Date: Wed, 24 Aug 2005 21:57:31 +0200 Subject: [Linux-cluster] Problems running service exclusivly Message-ID: <430CD12B.8060705@dlr.de> Hi I am trying to get the following config to work. 4 node cluster with 1 node providing a mysql service (only to nodes (bs1 and bs2) have the hardare to do so) 2 nodes providing a custom service (the same service on each node (bi1 and bi2)) 1 node fallback for both services It is important, that on every node ony one service is running. I created the following cluster configuration (RHEL4U1), but the exclusice flag seems not to work. I have created three priorisised restricted failoverdomains (one for eache services). I assigned each service to one failoverdomain (with run exclusively on), but if I start the rgmanger. als services are running on one node. I can move them by hand to the right nodes, but this is nor persistent. Here my cluster.conf I would be glad, if someone could help me. Greetings Hansj?rg