From buytenh at wantstofly.org Sun Aug 1 10:54:48 2004 From: buytenh at wantstofly.org (Lennert Buytenhek) Date: Sun, 1 Aug 2004 12:54:48 +0200 Subject: [Linux-cluster] new FC2 RPMS for CVS GFS snapshot of today Message-ID: <20040801105448.GA9682@xi.wantstofly.org> Hi, I've made some new FC2 GFS RPMS available for your enjoyment at: http://www2.wantstofly.org/gfs/20040801/ These are totally untested as of yet, since I unfortunately got distracted by some other projects requiring attention. Feedback would be most welcome. --L From laza at yu.net Sun Aug 1 17:59:19 2004 From: laza at yu.net (Lazar Obradovic) Date: Sun, 01 Aug 2004 19:59:19 +0200 Subject: [Linux-cluster] SNMP modules? In-Reply-To: <1090861715.13809.3.camel@laza.eunet.yu> References: <1090861715.13809.3.camel@laza.eunet.yu> Message-ID: <1091383159.32177.14.camel@laza.eunet.yu> ok, here's the patch for ibm blade fencing agent... qlogic sanbox2, comming up next :) On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: > Hello all, > > I'd like to develop my own fencing agents (for IBM BladeCenter and > QLogic SANBox2 switches), but they will require SNMP bindings. > > Is that ok with general development philosophy, since I'd like to > contribude them? net-snmp-5.x.x-based API? -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- -------------- next part -------------- A non-text attachment was scrubbed... Name: fence-ibmblade.patch Type: text/x-patch Size: 12640 bytes Desc: not available URL: From laza at yu.net Sun Aug 1 23:40:55 2004 From: laza at yu.net (Lazar Obradovic) Date: Mon, 02 Aug 2004 01:40:55 +0200 Subject: [Linux-cluster] SNMP modules? In-Reply-To: <1091383159.32177.14.camel@laza.eunet.yu> References: <1090861715.13809.3.camel@laza.eunet.yu> <1091383159.32177.14.camel@laza.eunet.yu> Message-ID: <1091403655.6495.17.camel@laza.eunet.yu> both things in one patch... On Sun, 2004-08-01 at 19:59, Lazar Obradovic wrote: > ok, here's the patch for ibm blade fencing agent... > qlogic sanbox2, comming up next :) > > On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: > > Hello all, > > > > I'd like to develop my own fencing agents (for IBM BladeCenter and > > QLogic SANBox2 switches), but they will require SNMP bindings. > > > > Is that ok with general development philosophy, since I'd like to > > contribude them? net-snmp-5.x.x-based API? -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- -------------- next part -------------- A non-text attachment was scrubbed... Name: fence-blade_sanbox2.patch Type: text/x-patch Size: 23698 bytes Desc: not available URL: From smelkovs at worldsoft.ch Mon Aug 2 11:27:19 2004 From: smelkovs at worldsoft.ch (Konrads Smelkovs) Date: Mon, 02 Aug 2004 13:27:19 +0200 Subject: [Linux-cluster] incompatible configurations Message-ID: <410E2517.4040505@worldsoft.ch> Hello, I have two nodes connected to SAN(san1 and san2), and one that is not (eth1, for locking purposes). I've set the fencing method to manual. All three nodes run the same cluster configuration (same file). However when eth1 connects to san1 or san2 ,the other boxes display: Aug 2 13:26:21 san1 lock_gulmd_core[2762]: Config CRC doesn't match. ( 20681140 != 4177759411 ) Aug 2 13:26:21 san1 lock_gulmd_core[2762]: We gave them(cmsfe01) an error (1004:Incompatible configurations). Aug 2 13:26:21 san1 lock_gulmd_core[2762]: Closing connection idx:4, fd:9 to 192.168.100.151 why? P.S. This is for purely testing purposes. From danderso at redhat.com Mon Aug 2 14:45:02 2004 From: danderso at redhat.com (Derek Anderson) Date: Mon, 2 Aug 2004 09:45:02 -0500 Subject: [Linux-cluster] incompatible configurations In-Reply-To: <410E2517.4040505@worldsoft.ch> References: <410E2517.4040505@worldsoft.ch> Message-ID: <200408020945.02895.danderso@redhat.com> Check your /etc/hosts file on each machine. If the hostname is in the loopback address line (127.0.0.1) take it out and retry. i.e. s/127.0.0.1 san1 localhost.localdomain localhost/127.0.0.1 localhost.localdomain localhost/ On Monday 02 August 2004 06:27, Konrads Smelkovs wrote: > Hello, > I have two nodes connected to SAN(san1 and san2), and one that is not > (eth1, for locking purposes). > I've set the fencing method to manual. > All three nodes run the same cluster configuration (same file). However > when eth1 connects to san1 or san2 ,the other boxes display: > Aug 2 13:26:21 san1 lock_gulmd_core[2762]: Config CRC doesn't match. ( > 20681140 != 4177759411 ) > Aug 2 13:26:21 san1 lock_gulmd_core[2762]: We gave them(cmsfe01) an > error (1004:Incompatible configurations). > Aug 2 13:26:21 san1 lock_gulmd_core[2762]: Closing connection idx:4, > fd:9 to 192.168.100.151 > > why? > P.S. This is for purely testing purposes. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From mnerren at paracel.com Mon Aug 2 14:48:02 2004 From: mnerren at paracel.com (micah nerren) Date: Mon, 02 Aug 2004 07:48:02 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine Message-ID: <1091458082.8356.23.camel@angmar> Hello, I am having some problems setting up GFS 6.0. I have the src rpms, and built them against kernel-2.4.21-15.ELsmp on x86_64. The kernel modules all load, everything seems to build properly. I create the pool device, am able to create the file system as well. However, when I do the mount command (mount -t /dev/pool/pool_gfs01 /gfs01) the machine crashes instantly. lock_gulm seems to be the culprit, however I cannot get any useful information out of the system about why this is happening. No logs, just whammo - the system dies instantly. My platform information is as follows: Rocks 3.1 (based on RHEL WS 3.0) Dual 1.4 opterons Kernel from rpm: kernel-smp-2.4.21-15.EL kernel source from rpm: kernel-source-2.4.21-15.EL GFS: GFS-6.0.0-1.2.src.rpm My dev files look like: /dev/pool: brw------- 2 root root 254, 65 Jul 31 01:12 hopkins_cca brw------- 2 root root 254, 66 Jul 31 01:12 pool_gfs01 Modules are loaded: gfs 261792 0 (unused) lock_gulm 68960 0 (unused) lock_harness 4048 0 [gfs lock_gulm] pool 85760 3 uname -a: Linux frontend-0.public 2.4.21-15.ELsmp #1 SMP Thu Apr 22 00:09:01 EDT 2004 x86_64 x86_64 x86_64 GNU/Linux [root at frontend-0 root]# pool_tool -s Device Pool Label ====== ========== /dev/pool/hopkins_cca <- CCA device -> /dev/pool/pool_gfs01 <- GFS filesystem -> /dev/sda <- partition information -> /dev/sda1 hopkins_cca /dev/sda2 pool_gfs01 And when I do the following mount command: mount -t gfs /dev/pool/pool_gfs01 /gfs01 The system crashes. At the console, there are tons of system calls being listed, and at the bottom of the screen: Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 Console Shuts up: pid: 3547, lock_gulmd Not tainted RIP: 0010 So... Any ideas on what may be causing this? This seems to be a supported platform for this tool according to redhat. Has anybody used GFS 6.0 on this kernel rev on x86_64 and if so, how did you get it to work? Other information: HBA: QLogic Corp. QLA2312 Fibre Channel Adapter SAN Switch: Qlogic SANBOX Thank you for any help you may have!! Micah From mtilstra at redhat.com Mon Aug 2 14:54:35 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Mon, 2 Aug 2004 09:54:35 -0500 Subject: [Linux-cluster] incompatible configurations In-Reply-To: <410E2517.4040505@worldsoft.ch> References: <410E2517.4040505@worldsoft.ch> Message-ID: <20040802145435.GA3266@redhat.com> On Mon, Aug 02, 2004 at 01:27:19PM +0200, Konrads Smelkovs wrote: > Hello, > I have two nodes connected to SAN(san1 and san2), and one that is not > (eth1, for locking purposes). > I've set the fencing method to manual. > All three nodes run the same cluster configuration (same file). However > when eth1 connects to san1 or san2 ,the other boxes display: > Aug 2 13:26:21 san1 lock_gulmd_core[2762]: Config CRC doesn't match. ( > 20681140 != 4177759411 ) config files aren't matching in the important bits. run lock_gulmd -C (plus any other cmd line params you had) the -C will cause lock_gulmd to print what it thinks the config is, and exit. Do this on both nodes, and see what is different. Common problem is what derek pointed at. (with the loopback address getting the host's name.) -- Michael Conrad Tadpol Tilstra an experation date on distilled water......? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From amanthei at redhat.com Mon Aug 2 15:46:05 2004 From: amanthei at redhat.com (Adam Manthei) Date: Mon, 2 Aug 2004 10:46:05 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091458082.8356.23.camel@angmar> References: <1091458082.8356.23.camel@angmar> Message-ID: <20040802154605.GC1518@redhat.com> On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote: > > The system crashes. At the console, there are tons of system calls being > listed, and at the bottom of the screen: > > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 > Console Shuts up: > pid: 3547, lock_gulmd Not tainted > RIP: 0010 > > > So... Any ideas on what may be causing this? Those "tons of system calls being listed" are really quite useful if not necessary to tell you what the problem is. My gut feeling is that there is a stack overrun that is happening. > This seems to be a supported platform for this tool according to redhat. Correct. > Has anybody used GFS 6.0 on this kernel rev on x86_64 and if so, > how did you get it to work? Yup. I just used the rpms. Perhaps you compiled it with debugging options enabled? (I don't know if that would make the stack bigger) -- Adam Manthei From mnerren at paracel.com Mon Aug 2 16:06:48 2004 From: mnerren at paracel.com (micah nerren) Date: Mon, 02 Aug 2004 09:06:48 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040802154605.GC1518@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> Message-ID: <1091462808.8356.39.camel@angmar> Hi, On Mon, 2004-08-02 at 08:46, Adam Manthei wrote: > On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote: > > > > The system crashes. At the console, there are tons of system calls being > > listed, and at the bottom of the screen: > > > > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 > > Console Shuts up: > > pid: 3547, lock_gulmd Not tainted > > RIP: 0010 > > > > > > So... Any ideas on what may be causing this? > > Those "tons of system calls being listed" are really quite useful if not > necessary to tell you what the problem is. My gut feeling is that there is > a stack overrun that is happening. I could try to post them if anybody would find that useful. I will write all that down and attempt to post it coherently. Is there any way to capture that kind of info to a file? > > This seems to be a supported platform for this tool according to redhat. > > Correct. > > > Has anybody used GFS 6.0 on this kernel rev on x86_64 and if so, > > how did you get it to work? > > Yup. I just used the rpms. Perhaps you compiled it with debugging options > enabled? (I don't know if that would make the stack bigger) All I did was 'rpmbuild --rebuild GFS-6.0.0-1.2.src.rpm' That created the following rpms: GFS-6.0.0-1.2.x86_64.rpm GFS-debuginfo-6.0.0-1.2.x86_64.rpm GFS-devel-6.0.0-1.2.x86_64.rpm GFS-modules-6.0.0-1.2.x86_64.rpm GFS-modules-smp-6.0.0-1.2.x86_64.rpm Of those, I have the following actually installed: GFS-modules-smp-6.0.0-1.2 GFS-6.0.0-1.2 Do you have any build instructions for getting them to work properly? Could something built into my running kernel cause this? I am building a new kernel from source right now to see if the binary kernel rpm I used had some sort of problem. Could it be related to the HBA I am using as well? Thanks! Micah From phillips at redhat.com Mon Aug 2 16:15:01 2004 From: phillips at redhat.com (Daniel Phillips) Date: Mon, 2 Aug 2004 12:15:01 -0400 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091462808.8356.39.camel@angmar> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091462808.8356.39.camel@angmar> Message-ID: <200408021215.01082.phillips@redhat.com> On Monday 02 August 2004 12:06, micah nerren wrote: > Hi, > > On Mon, 2004-08-02 at 08:46, Adam Manthei wrote: > > On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote: > > > The system crashes. At the console, there are tons of system > > > calls being listed, and at the bottom of the screen: > > > > > > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 > > > Console Shuts up: > > > pid: 3547, lock_gulmd Not tainted > > > RIP: 0010 > > > > > > > > > So... Any ideas on what may be causing this? > > > > Those "tons of system calls being listed" are really quite useful > > if not necessary to tell you what the problem is. My gut feeling > > is that there is a stack overrun that is happening. > > I could try to post them if anybody would find that useful. I will > write all that down and attempt to post it coherently. Is there any > way to capture that kind of info to a file? The traditional way is to connect a serial cable and direct console output to the serial port. You can then cut and paste the console messages from a nice graphical buffer. Other ways are proposed, such as crash dump to disk and network crash dump, but I don't think any have made it to mainline yet, though patches are available. (More voices asking for it on lkml would help.) Regards, Daniel From amanthei at redhat.com Mon Aug 2 16:21:23 2004 From: amanthei at redhat.com (Adam Manthei) Date: Mon, 2 Aug 2004 11:21:23 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091462808.8356.39.camel@angmar> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091462808.8356.39.camel@angmar> Message-ID: <20040802162123.GE1518@redhat.com> On Mon, Aug 02, 2004 at 09:06:48AM -0700, micah nerren wrote: > Hi, > > > On Mon, 2004-08-02 at 08:46, Adam Manthei wrote: > > On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote: > > > > > > The system crashes. At the console, there are tons of system calls being > > > listed, and at the bottom of the screen: > > > > > > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 > > > Console Shuts up: > > > pid: 3547, lock_gulmd Not tainted > > > RIP: 0010 > > > > > > > > > So... Any ideas on what may be causing this? > > > > Those "tons of system calls being listed" are really quite useful if not > > necessary to tell you what the problem is. My gut feeling is that there is > > a stack overrun that is happening. > > I could try to post them if anybody would find that useful. I will write > all that down and attempt to post it coherently. Is there any way to > capture that kind of info to a file? The best way is to connect it to a serial console to grab the output. Depending on the state of the machine, you may even be able to grab that with 'dmesg' or the syslogs (although it's not likely to have made it to the syslog). > > > Has anybody used GFS 6.0 on this kernel rev on x86_64 and if so, > > > how did you get it to work? > > > > Yup. I just used the rpms. Perhaps you compiled it with debugging options > > enabled? (I don't know if that would make the stack bigger) > > All I did was 'rpmbuild --rebuild GFS-6.0.0-1.2.src.rpm' > > That created the following rpms: > GFS-6.0.0-1.2.x86_64.rpm > GFS-debuginfo-6.0.0-1.2.x86_64.rpm > GFS-devel-6.0.0-1.2.x86_64.rpm > GFS-modules-6.0.0-1.2.x86_64.rpm > GFS-modules-smp-6.0.0-1.2.x86_64.rpm > > Of those, I have the following actually installed: > > GFS-modules-smp-6.0.0-1.2 > GFS-6.0.0-1.2 > > > Do you have any build instructions for getting them to work properly? What you did makes sense to me. > Could something built into my running kernel cause this? I am building a > new kernel from source right now to see if the binary kernel rpm I used > had some sort of problem. > > Could it be related to the HBA I am using as well? If it is a stack overflow, then yes, it _could_ be related, but I'm not going to blame that just yet ;) > Thanks! > > Micah > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From david.n.lombard at intel.com Mon Aug 2 16:33:06 2004 From: david.n.lombard at intel.com (Lombard, David N) Date: Mon, 2 Aug 2004 09:33:06 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine Message-ID: <187D3A7CAB42A54DB61F1D05F0125722039C1567@orsmsx402.amr.corp.intel.com> From: micah Nerren; Monday, August 02, 2004 9:07 AM > >On Mon, 2004-08-02 at 08:46, Adam Manthei wrote: >> On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote: >> > >> > The system crashes. At the console, there are tons of system calls >being >> > listed, and at the bottom of the screen: >> > >> > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 >> > Console Shuts up: >> > pid: 3547, lock_gulmd Not tainted >> > RIP: 0010 >> > >> > >> > So... Any ideas on what may be causing this? >> >> Those "tons of system calls being listed" are really quite useful if not >> necessary to tell you what the problem is. My gut feeling is that there >is >> a stack overrun that is happening. > >I could try to post them if anybody would find that useful. I will write >all that down and attempt to post it coherently. Is there any way to >capture that kind of info to a file? Set up the kernel for serial console, and capture with minicom on another system via a null modem cable. -- David N. Lombard My comments represent my opinions, not those of Intel Corporation. From jeff at intersystems.com Mon Aug 2 20:20:45 2004 From: jeff at intersystems.com (Jeff) Date: Mon, 2 Aug 2004 16:20:45 -0400 Subject: [Linux-cluster] segfault if dlm is loaded while cman is still joining the cluster Message-ID: <118314539.20040802162045@intersystems.com> Is there a bug tracker somewhere or should we just post them to this list? -------------------------------------------------------------- This is on a dual-cpu box (FC2) with hyperthreading enabled (eg. for a total of 4 logical CPUs). If I issue the following commands where I type each command as soon as the prior command completes I get a segfault loading the dlm. The code is from CVS/latest. [root at lx4 cluster_orig]# ccsd [root at lx4 cluster_orig]# cman_tool join [root at lx4 cluster_orig]# modprobe dlm Segmentation fault [root at lx4 cluster_orig]# modprobe dlm [root at lx4 cluster_orig]# dmesg CMAN: Waiting to join or form a Linux-cluster CMAN (built Aug 2 2004 15:04:09) installed kmem_cache_create: duplicate cache cluster_sock ------------[ cut here ]------------ kernel BUG at mm/slab.c:1392! invalid operand: 0000 [#1] SMP Modules linked in: cman parport_pc lp parport autofs4 nfs lockd sunrpc e1000 3c59x floppy sg microcode dm_mod uhci_hcd button battery asus_acpi ac ipv6 ext3 jbd aic7xxx sd_mod scsi_mod CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010202 (2.6.7-clu-smp) EIP is at kmem_cache_create+0x4c6/0x660 eax: 00000030 ebx: c22f4770 ecx: c0487c98 edx: 00004ce1 esi: c033a366 edi: f8aa662d ebp: f51d7b80 esp: f3fb0f5c ds: 007b es: 007b ss: 0068 Process modprobe (pid: 5476, threadinfo=f3fb0000 task=f43ce230) Stack: c031b3c8 f8aa6620 f51d7c38 0000000a c0000000 ffffff80 00000080 f8aa6620 00000080 c0356fe0 f8aae200 c0356fc4 c0356fc4 f88a804e 00002000 00000000 00000000 f8aa6605 c013a5c7 f6b7daa0 00000000 40018008 0807a1a0 00ccaffc Call Trace: [] cluster_init+0x4e/0x3f9 [cman] [] sys_init_module+0x107/0x220 [] sysenter_past_esp+0x52/0x71 Code: 0f 0b 70 05 2d ad 31 c0 8b 0b e9 5b ff ff ff 8b 87 b0 00 00 DLM (built Aug 2 2004 15:04:29) installed CMAN: sending membership request CMAN: got node lx3 CMAN: quorum regained, resuming activity [root at lx4 cluster_orig]# The cpuinfo for the 4 cpu's is pretty much the same. Here's one of them: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) CPU 1.80GHz stepping : 4 cpu MHz : 1779.842 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm bogomips : 3514.36 From mnerren at paracel.com Mon Aug 2 20:59:55 2004 From: mnerren at paracel.com (micah nerren) Date: Mon, 02 Aug 2004 13:59:55 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040802154605.GC1518@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> Message-ID: <1091480394.8356.58.camel@angmar> On Mon, 2004-08-02 at 08:46, Adam Manthei wrote: > On Mon, Aug 02, 2004 at 07:48:02AM -0700, micah nerren wrote: > > > > The system crashes. At the console, there are tons of system calls being > > listed, and at the bottom of the screen: > > > > Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 > > Console Shuts up: > > pid: 3547, lock_gulmd Not tainted > > RIP: 0010 > > > > > > So... Any ideas on what may be causing this? > > Those "tons of system calls being listed" are really quite useful if not > necessary to tell you what the problem is. My gut feeling is that there is > a stack overrun that is happening. Ok, here is a capture of the crash occurring. Note that the message is slightly different than the one I posted before, the end changes, however the calls it is making look very similar. I also went and upgraded the kernel to the lastest from RHEL 3 WS. I upgraded GFS to GFS-6.0.0-7.src.rpm. Still crashing. Here is my entire boot log from power on, to mount crash. Prior to the crash, I did the following: logged in as root via ssh depmod -a modprobe lock_gulm modprobe gfs (module pool was already loaded at boot time.) mount -t gfs /dev/pool/pool_gfs01 /mnt/gfs CRASH I hope this helps!! Micah //////////////////////// Bootdata ok (command line is ro root=LABEL=/ noapic console=ttyS0,38400) Linux version 2.4.21-15.0.3.ELsmp (bhcompile at thor.perf.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-37)) #1 SMP Tue Jun 29 17:46:55 EDT 2004 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009b800 (usable) BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000cc000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007ff80000 (usable) BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) kernel direct mapping tables upto 10100000000 @ 8000-d000 Scanning NUMA topology in Northbridge 24 Node 0 using interleaving mode 1/0 No NUMA configuration found Faking a node at 0000000000000000-000000007ff80000 Bootmem setup node 0 0000000000000000-000000007ff80000 found SMP MP-table at 000f69a0 hm, page 000f6000 reserved twice. hm, page 000f7000 reserved twice. hm, page 0009b000 reserved twice. hm, page 0009c000 reserved twice. setting up node 0 0-7ff80 On node 0 totalpages: 524160 zone(0): 4096 pages. zone(1): 520064 pages. zone(2): 0 pages. ACPI: Unable to locate RSDP Intel MultiProcessor Specification v1.4 Virtual Wire compatibility mode. OEM ID: AMD <6>Product ID: HAMMER <6>APIC at: 0xFEE00000 Processor #0 15:5 APIC version 16 Processor #1 15:5 APIC version 16 I/O APIC #2 Version 17 at 0xFEC00000. I/O APIC #3 Version 17 at 0xFC000000. I/O APIC #4 Version 17 at 0xFC001000. Processors: 2 Kernel command line: ro root=LABEL=/ noapic console=ttyS0,38400 Initializing CPU#0 time.c: Detected 1.193182 MHz PIT timer. time.c: Detected 1403.229 MHz TSC timer. Console: colour VGA+ 80x25 Calibrating delay loop... 2798.38 BogoMIPS Memory: 2034216k/2096640k available (1797k kernel code, 0k reserved, 1862k data, 224k init) Dentry cache hash table entries: 262144 (order: 10, 4194304 bytes) Inode cache hash table entries: 131072 (order: 9, 2097152 bytes) Mount cache hash table entries: 256 (order: 0, 4096 bytes) Buffer cache hash table entries: 131072 (order: 8, 1048576 bytes) Page-cache hash table entries: 524288 (order: 10, 4194304 bytes) CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2 way) CPU: L2 Cache: 1024K (64 bytes/line/8 way) Machine Check Reporting enabled for CPU#0 POSIX conformance testing by UNIFIX mtrr: v2.02 (20020716)) CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2 way) CPU: L2 Cache: 1024K (64 bytes/line/8 way) CPU0: AMD Opteron(tm) Processor 240 stepping 01 per-CPU timeslice cutoff: 5119.55 usecs. task migration cache decay timeout: 10 msecs. Booting processor 1/1 rip 6000 page 00000100077e2000 Initializing CPU#1 Calibrating delay loop... 2804.94 BogoMIPS CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2 way) CPU: L2 Cache: 1024K (64 bytes/line/8 way) Machine Check Reporting enabled for CPU#1 CPU1: AMD Opteron(tm) Processor 240 stepping 01 Total of 2 processors activated (5603.32 BogoMIPS). Using local APIC timer interrupts. Detected 12.528 MHz APIC timer. cpu: 0, clocks: 2004614, slice: 668204 CPU0 cpu: 1, clocks: 2004614, slice: 668204 CPU1 checking TSC synchronization across CPUs: passed. time.c: Using PIT based timekeeping. Starting migration thread for cpu 0 Starting migration thread for cpu 1 ACPI: Subsystem revision 20030619 PCI: Using configuration type 1 ACPI: System description tables not found ACPI-0084: *** Error: acpi_load_tables: Could not get RSDP, AE_NOT_FOUND ACPI-0134: *** Error: acpi_load_tables: Could not load tables: AE_NOT_FOUND ACPI: Unable to load the System Description Tables PCI: Probing PCI hardware PCI: Using IRQ router default [1022/746b] at 00:07.3 Linux agpgart interface v0.99 (c) Jeff Hartmann agpgart: Maximum main memory to use for agp memory: 1919M PCI-DMA: Disabling IOMMU. Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket Starting kswapd VFS: Disk quotas vdquot_6.5.1 aio_setup: num_physpages = 131040 aio_setup: sizeof(struct page) = 104 Hugetlbfs mounted. Total HugeTLB memory allocated, 0 IA32 emulation $Id: sys_ia32.c,v 1.56 2003/04/10 10:45:37 ak Exp $ pty: 2048 Unix98 ptys configured Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ SERIAL_PCI SERIAL_ACPI enabled ttyS0 at 0x03f8 (irq = 4) is a 16550A Real Time Clock Driver v1.10e NET4: Frame Diverter 0.46 RAMDISK driver initialized: 256 RAM disks of 8192K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD8111: IDE controller at PCI slot 00:07.1 AMD8111: chipset revision 3 AMD8111: not 100% native mode: will probe irqs later ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD_IDE: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) UDMA100 controller on pci00:07.1 ide0: BM-DMA at 0x1020-0x1027, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0x1028-0x102f, BIOS settings: hdc:pio, hdd:pio hda: WDC WD600JB-00CRA1, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: attached ide-disk driver. hda: host protected area => 1 hda: 117231408 sectors (60022 MB) w/8192KiB Cache, CHS=116301/16/63, UDMA(100) ide-floppy driver 0.99.newide Partition check: hda: hda1 hda2 hda3 ide-floppy driver 0.99.newide md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. Initializing Cryptographic API NET4: Linux TCP/IP 1.0 for NET4.0 IP: routing cache hash table of 8192 buckets, 128Kbytes TCP: Hash tables configured (established 262144 bind 65536) Linux IP multicast router 0.06 plus PIM-SM Initializing IPsec netlink socket NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. RAMDISK: Compressed image found at block 0 VFS: Mounted root (ext2 filesystem). Red Hat nash version 3.5.13 starSCSI subsystem driver Revision: 1.00 ting Loading scsi_mod.o module Loading sd_mod.o module Loadinqla2x00_set_info starts at address = ffffffffa00230c0 g qla2300.o moduqla2x00: Found VID=1077 DID=2312 SSVID=1077 SSDID=101 scsi(0): Found a QLA2312 @ bus 2, device 0x1, irq 5, iobase 0xffffff0000013000 le scsi(0): Allocated 4096 SRB(s). scsi(0): Configure NVRAM parameters... scsi(0): 64 Bit PCI Addressing Enabled. qla2x00_nvram_config ZIO enabled:intr_timer_delay=3 scsi(0): Verifying loaded RISC code... scsi(0): Verifying chip... scsi(0): Waiting for LIP to complete... scsi(0): Cable is unplugged... scsi-qla0-adapter-node=200000e08b17cf0f\; scsi-qla0-adapter-port=210000e08b17cf0f\; qla2x00: Found VID=1077 DID=2312 SSVID=1077 SSDID=101 scsi(1): Found a QLA2312 @ bus 2, device 0x1, irq 10, iobase 0xffffff0000015000 scsi(1): Allocated 4096 SRB(s). scsi(1): Configure NVRAM parameters... scsi(1): 64 Bit PCI Addressing Enabled. qla2x00_nvram_config ZIO enabled:intr_timer_delay=3 scsi(1): Verifying loaded RISC code... scsi(1): Verifying chip... scsi(1): Waiting for LIP to complete... scsi(1): LOOP UP detected. scsi(1): Port database changed. scsi(1): Topology - (F_Port), Host Loop address 0xffff qla2x00_configure_fcports(1): LOOP READY scsi-qla1-adapter-node=200100e08b37cf0f\; scsi-qla1-adapter-port=210100e08b37cf0f\; scsi-qla1-tgt-0-di-0-port=22000004cffd1447\; scsi-qla1-tgt-1-di-0-port=22000004cffd1411\; scsi-qla1-tgt-2-di-0-port=22000004cffd0254\; scsi-qla1-tgt-3-di-0-port=22000004cffcec36\; scsi(1) qla2x00_isr MBA_PORT_UPDATE ignored scsi0 : QLogic QLA2312 PCI to Fibre Channel Host Adapter: bus 2 device 1 irq 5 Firmware version: 3.02.24, Driver version 6.07.02-RH2 scsi1 : QLogic QLA2312 PCI to Fibre Channel Host Adapter: bus 2 device 1 irq 10 Firmware version: 3.02.24, Driver version 6.07.02-RH2 Vendor: SEAGATE Model: ST336607FC Rev: 0006 Type: Direct-Access ANSI SCSI revision: 03 Vendor: SEAGATE Model: ST336607FC Rev: 0006 Type: Direct-Access ANSI SCSI revision: 03 Vendor: SEAGATE Model: ST336607FC Rev: 0006 Type: Direct-Access ANSI SCSI revision: 03 Vendor: SEAGATE Model: ST336607FC Rev: 0006 Type: Direct-Access ANSI SCSI revision: 03 scsi(1:0:0:0): Enabled tagged queuing, queue depth 64. scsi(1:0:1:0): Enabled tagged queuing, queue depth 64. scsi(1:0:2:0): Enabled tagged queuing, queue depth 64. scsi(1:0:3:0): Enabled tagged queuing, queue depth 64. Attached scsi disk sda at scsi1, channel 0, id 0, lun 0 Attached scsi disk sdb at scsi1, channel 0, id 1, lun 0 Attached scsi disk sdc at scsi1, channel 0, id 2, lun 0 Attached scsi disk sdd at scsi1, channel 0, id 3, lun 0 SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB) sda: sda1 sda2 SCSI device sdb: 71687372 512-byte hdwr sectors (36704 MB) sdb: sdb1 sdb2 SCSI device sdc: 71687372 512-byte hdwr sectors (36704 MB) sdc: sdc1 sdc2 SCSI device sdd: 71687372 512-byte hdwr sectors (36704 MB) sdd: sdd1 sdd2 Loading jbd.o module Journalled Block Device driver loaded Loading ext3.o module Mounting /proc filesystem Creating block devices Creating root device Mounting root filesystem kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. spurious 8259A interrupt: IRQ7. Freeing unused kernel memory: 224k freed INIT: version 2.85 booting Welcome to Rocks Press 'I' to enter interactive startup. Unmounting initrd: [ OK ] Configuring kernel parameters: [ OK ] Setting clock (utc): Mon Aug 2 20:42:30 GMT 2004 [ OK ] Setting hostname frontend-0.public: [ OK ] Initializing USB controller (usb-ohci): [ OK ] Mounting USB filesystem: [ OK ] Initializing USB HID interface: [ OK ] Initializing USB keyboard: [ OK ] Initializing USB mouse: [ OK ] Checking root filesystem /: clean, 158083/7061504 files, 1680159/14116410 blocks [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/hda2 [ OK ] Remounting root filesystem in read-write mode: [ OK ] Activating swap partitions: [ OK ] Finding module dependencies: [ OK ] Checking filesystems /boot: clean, 69/25584 files, 75707/102280 blocks Checking all file systems. [/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/hda1 [ OK ] Mounting local filesystems: [ OK ] Enabling local filesystem quotas: [ OK ] Enabling swap space: [ OK ] INIT: Entering runlevel: 3 Entering non-interactive startup Applying iptables firewall rules: [ OK ] Setting network parameters: [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface eth0: [ OK ] Bringing up interface eth1: [ OK ] Starting system logger: [ OK ] Starting kernel logger: [ OK ] Starting portmapper: [ OK ] Starting NFS statd: [ OK ] Starting pool: Pool v6.0.0 (built Aug 2 2004 18:51:15) installed [ OK ] Starting ganglia-restore-rrds: [ OK ] Starting ccsd: [ OK ] Starting GANGLIA gmetad: [ OK ] Initializing random number generator: [ OK ] Starting Ganglia Receptor: [ OK ] Starting lock_gulmd: [ OK ] modprobe: Can't locate module pvfs Starting PVFS daemon: (pvfsd.c, 683): Could not setup device /dev/pvfsd. (pvfsd.c, 684): Did you remember to load the pvfs module? (pvfsd.c, 453): pvfsd: setup_pvfsdev() failed [FAILED][ OK ] Mounting other filesystems: [ OK ] Publishing login files via 411...[ OK ] Starting automount:[ OK ] Starting named: [ OK ] Starting sshd:[ OK ] Starting xinetd: [ OK ] ntpd: Synchronizing with time server: [FAILED] Starting ntpd: [ OK ] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Starting dhcpd: [ OK ] Starting GANGLIA gmond: [ OK ] Starting MySQL: [ OK ] Starting httpd: [ OK ] Starting crond: [ OK ] Starting xfs: [ OK ] Starting atd: [ OK ] Starting firstboot: [ OK ] starting sge_qmaster starting program: /opt/gridengine/bin/amd64linux/sge_commd using service "sge_commd" bound to port 535 Reading in complexes: Complex "host". Complex "queue". Reading in execution hosts. Reading in administrative hosts. Reading in submit hosts. Reading in usersets: Userset "defaultdepartment". Userset "deadlineusers". Reading in queues: Queue "compute-0-0.q". Reading in parallel environments: PE "make". PE "mpich". PE "mpi". Reading in scheduler configuration cant load sharetree (cant open file sharetree: No such file or directory), starting up with empty sharetree starting sge_schedd Turn off kernel logging to console: [ OK ] /wet^H^H^HUnable to handle kernel NULL pointer dereference at virtual address 0000000000000000 printing rip: ffffffff8024a875 PML4 78215067 PGD 77f93067 PMD 0 Oops: 0002 CPU 1 Pid: 4027, comm: mount Not tainted RIP: 0010:[]{net_rx_action+213} RSP: 0018:0000010078051048 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffff80607ae8 RCX: ffffffff80607c88 RDX: ffffffff80607ae8 RSI: 0000010078986080 RDI: ffffffff80607ad0 RBP: ffffffff80607968 R08: 0000000080e76a9c R09: 0000000000e780e7 R10: 000000000100007f R11: 0000000000000000 R12: ffffffff80607ae8 R13: ffffffff80607ac0 R14: 00000000000071c2 R15: 0000000000000001 FS: 0000002a955764c0(0000) GS:ffffffff805d98c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000079d2000 CR4: 00000000000006e0 Call Trace: []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{:lock_gulm:do_tfer+369} []{:lock_gulm:.rodata.str1.1+467} []{:lock_gulm:xdr_send+37} []{:lock_gulm:xdr_enc_flush+56} []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_gulm:gulm_core_login_reply+164} []{:lock_gulm:core_cb+0} []{:lock_gulm:lg_core_handle_messages+315} []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock_gulm:start_gulm_threads+174} []{:lock_gulm:gulm_mount+616} []{:gfs:gfs_glock_cb+0} []{release_task+763} []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_mount_lockproto+313} []{do_anonymous_page+1234} []{do_no_page+95} []{do_page_fault+627} []{error_exit+0} []{create_elf_tables+211} []{strnlen_user+56} []{create_elf_tables+871} []{:gfs:gfs_read_super+1307} []{:gfs:gfs_fs_type+0} []{get_sb_bdev+588} []{:gfs:gfs_fs_type+0} []{do_kern_mount+121} []{do_add_mount+161} []{do_mount+345} []{__get_free_pages+16} []{sys_mount+197} []{system_call+119} Process mount (pid: 4027, stackpage=10078051000) Stack: 0000010078051048 0000000000000018 ffffffff8024a84d 0000012a80445d20 0000000000000001 ffffffff80606c60 0000000000000001 000000000000000a 0000000000000001 0000000000000002 ffffffff8012a72e ffffffff80267cf0 0000000000000246 0000000000000000 0000000000000003 ffffffff80445d20 ffffffff80267cc0 0000000000000000 ffffffff802b5915 0000000000000043 0000000000000006 000001007a05309e 000001007c97bd80 0000000000000000 0000000000000300 ffffffff8049c688 0000000000000001 ffffffff806077c0 ffffffff802533a7 ffffffff80267cc0 ffffffff80445d20 0000000000000002 000001007c97bd80 ffffffff805abcd0 000001007a0530ac 000001007c97bd80 0000010078986080 0000000000000000 0000010078986080 000001007c97bde8 Call Trace: []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{:lock_gulm:do_tfer+369} []{:lock_gulm:.rodata.str1.1+467} []{:lock_gulm:xdr_send+37} []{:lock_gulm:xdr_enc_flush+56} []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_gulm:gulm_core_login_reply+164} []{:lock_gulm:core_cb+0} []{:lock_gulm:lg_core_handle_messages+315} []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock_gulm:start_gulm_threads+174} []{:lock_gulm:gulm_mount+616} []{:gfs:gfs_glock_cb+0} []{release_task+763} []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_mount_lockproto+313} []{do_anonymous_page+1234} []{do_no_page+95} []{do_page_fault+627} []{error_exit+0} []{create_elf_tables+211} []{strnlen_user+56} []{create_elf_tables+871} []{:gfs:gfs_read_super+1307} []{:gfs:gfs_fs_type+0} []{get_sb_bdev+588} []{:gfs:gfs_fs_type+0} []{do_kern_mount+121} []{do_add_mount+161} []{do_mount+345} []{__get_free_pages+16} []{sys_mount+197} []{system_call+119} Code: 48 89 18 48 89 43 08 8b 85 90 01 00 00 85 c0 79 08 03 85 94 Kernel panic: Fatal exception In interrupt handler - not syncing NMI Watchdog detected LOCKUP on CPU0, eip ffffffff801a5419, registers: CPU 0 Pid: 3532, comm: lock_gulmd Not tainted RIP: 0010:[]{.text.lock.fault+7} RSP: 0018:000001007ba7b978 EFLAGS: 00000086 RAX: 000000000000000f RBX: ffffffff806077e8 RCX: 0000000000000000 RDX: ffffffff803042e0 RSI: ffffffff803042e0 RDI: ffffffff8024a875 RBP: ffffffff80607668 R08: ffffffff803042d0 R09: 0000000000e780e7 R10: 000000000100007f R11: 0000000000000000 R12: 0000010037dcbc00 R13: 0000000000000000 R14: 0000000000000002 R15: 000001007ba7ba58 FS: 0000002a95576ce0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Call Trace: []{nf_hook_slow+305} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{ip_local_deliver_finish+0} []{error_exit+0} []{net_rx_action+213} []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{sys_sendto+195} []{free_pages+132} []{__poll_freewait+136} []{system_call+119} Process lock_gulmd (pid: 3532, stackpage=1007ba7b000) Stack: 000001007ba7b978 0000000000000018 0000000000100000 0000000000000000 00000100079c4c80 ffffffff803e89a0 0000000000000000 00000100000fdea0 ffffffff803e8d00 00000100079bf000 00000100079d6400 0000000000000042 00000100079de280 ffffff0000000000 000000fffffff000 0000000000000000 00000100079d7a80 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000010078050d48 0000000000000000 00000000006d9994 0000000000000003 0000000000000000 0000000000000000 0000000100000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff Call Trace: []{nf_hook_slow+305} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{ip_local_deliver_finish+0} []{error_exit+0} []{net_rx_action+213} []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{sys_sendto+195} []{free_pages+132} []{__poll_freewait+136} []{system_call+119} Code: f3 90 7e f5 e9 c8 fd ff ff 90 90 90 90 90 90 90 90 90 90 90 console shuts up ... NM I Watchdog detected LOCKUP on CPU1, eip ffffffff8011a948, registers: From bruce.walker at hp.com Mon Aug 2 22:25:27 2004 From: bruce.walker at hp.com (Walker, Bruce J) Date: Mon, 2 Aug 2004 15:25:27 -0700 Subject: [Linux-cluster] RE: [SSI-devel] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! Message-ID: <3689AF909D816446BA505D21F1461AE4C750F0@cacexc04.americas.cpqcorp.net> As I indicated earlier, we are going to redo the hooks for 2.6 and submit them in a more managable way. I expect that to take several months. Bruce Walker Project manager for OpenSSI. > -----Original Message----- > From: ssic-linux-devel-admin at lists.sourceforge.net > [mailto:ssic-linux-devel-admin at lists.sourceforge.net] On > Behalf Of Erich Focht > Sent: Monday, August 02, 2004 6:51 AM > To: K V, Aneesh Kumar > Cc: ssic-linux-devel at lists.sourceforge.net; Andi Kleen; Linux > Kernel Mailing List; linux-cluster at redhat.com > Subject: Re: [SSI-devel] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! > > > On Monday 02 August 2004 08:30, Aneesh Kumar K.V wrote: > > > [....] Congratulations. But I was a bit disappointed that there > > > wasn't a tarball with the kernel patches and other sources. > > > Any chance to add that to the site? > > > > I have posted the diff at > > http://www.openssi.org/contrib/linux-ssi.diff.gz > > Hmmm, that's too huge to get an overview on what it does... > The current CVS ci/kernel touches 137 files, openssi/kernel touches > 350 files. Plus the ci/kernel.patches and openssi/kernel.patches... > > > For 2.6 we are planning to group the changes into small > patches that is > > easy to review. > > Sounds great! Having groups sorted by functionality will help a > lot. When will these be visible in the CVS? > > Thanks, > best regards, > Erich > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by OSTG. Have you noticed the > changes on > Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, > one more big change to announce. We are now OSTG- Open Source > Technology > Group. Come see the changes on the new OSTG site. www.ostg.com > _______________________________________________ > ssic-linux-devel mailing list > ssic-linux-devel at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > From pcaulfie at redhat.com Tue Aug 3 06:55:51 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 3 Aug 2004 07:55:51 +0100 Subject: [Linux-cluster] segfault if dlm is loaded while cman is still joining the cluster In-Reply-To: <118314539.20040802162045@intersystems.com> References: <118314539.20040802162045@intersystems.com> Message-ID: <20040803065551.GB23467@tykepenguin.com> On Mon, Aug 02, 2004 at 04:20:45PM -0400, Jeff wrote: > Is there a bug tracker somewhere or should we just post > them to this list? There is a bugzilla at bugzilla.redhat.com, but posting to the list is generally OK too. Thanks, I'll have a look at this. Patrick From pcaulfie at redhat.com Tue Aug 3 07:28:20 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 3 Aug 2004 08:28:20 +0100 Subject: [Linux-cluster] segfault if dlm is loaded while cman is still joining the cluster In-Reply-To: <118314539.20040802162045@intersystems.com> References: <118314539.20040802162045@intersystems.com> Message-ID: <20040803072819.GE23467@tykepenguin.com> On Mon, Aug 02, 2004 at 04:20:45PM -0400, Jeff wrote: > CMAN: Waiting to join or form a Linux-cluster > CMAN (built Aug 2 2004 15:04:09) installed > kmem_cache_create: duplicate cache cluster_sock Hang on - how did you manage that? the cman code looks like it has been loaded twice, or something.... The module load message is BELOW the "Waiting" message which says that the cman code was already in the kernel when the "modprobe dlm" was executed which loaded the cman.ko module as a dependancy. The only way I can think this could happen is that you have cman in the kernel AND as a module. -- patrick From pcaulfie at redhat.com Tue Aug 3 13:11:54 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 3 Aug 2004 14:11:54 +0100 Subject: [Linux-cluster] Is this intentional: specifying a new completion ast routine on a convert In-Reply-To: <1524056551.20040728124948@intersystems.com> References: <1524056551.20040728124948@intersystems.com> Message-ID: <20040803131154.GI23467@tykepenguin.com> I've changed this so that the all the AST parameters passed into a lock request will always override the ones that are there. This makes more sense really and seems to be what VMS does too. Of course if you change the blocking AST routine or argument during a convert, the DLM makes no guarantees that a call with the old values won't be in flight and waiting for you :-) -- patrick From mtilstra at redhat.com Tue Aug 3 14:40:19 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 3 Aug 2004 09:40:19 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091480394.8356.58.camel@angmar> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> Message-ID: <20040803144019.GA4365@redhat.com> On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote: [snip] > I hope this helps!! [snip] yeah, looks like a stack overflow. here's a patch that I put in for 6.0. (patch works on 6.0.0-7) -- Michael Conrad Tadpol Tilstra Duct tape is like the force. It has a light side and a dark side and it holds the universe together. -------------- next part -------------- =================================================================== RCS file: /mnt/export/cvs/GFS/locking/lock_gulm/kernel/gulm_fs.c,v retrieving revision 1.1.2.16 retrieving revision 1.1.2.17 diff -u -r1.1.2.16 -r1.1.2.17 --- GFS/locking/lock_gulm/kernel/gulm_fs.c 2004/07/20 16:54:18 1.1.2.16 +++ GFS/locking/lock_gulm/kernel/gulm_fs.c 2004/08/02 16:12:39 1.1.2.17 @@ -335,11 +335,17 @@ unsigned int min_lvb_size, struct lm_lockstruct *lockstruct) { gulm_fs_t *gulm; - char work[256], *tbln; + char *work=NULL, *tbln; int first; int error = -1; struct list_head *lltmp; + work = kmalloc(256, GFP_KERNEL); + if(work == NULL ) { + log_err("Out of Memory.\n"); + error = -ENOMEM; + goto fail; + } strncpy (work, table_name, 256); tbln = strstr (work, ":"); @@ -483,6 +489,7 @@ fail: + if(work != NULL ) kfree(work); gulm_cm.starts = FALSE; log_msg (lgm_Always, "fsid=%s: Exiting gulm_mount with errors %d\n", table_name, error); @@ -570,7 +577,7 @@ { gulm_fs_t *fs = (gulm_fs_t *) lockspace; int err; - uint8_t name[256]; + uint8_t name[64]; if (message != LM_RD_SUCCESS) { /* Need to start thinking about how I want to use this... */ @@ -579,7 +586,7 @@ if (jid == fs->fsJID) { /* this may be drifting crud through. */ /* hey! its me! */ - strncpy (name, gulm_cm.myName, 256); + strncpy (name, gulm_cm.myName, 64); } else if (lookup_name_by_jid (fs, jid, name) != 0) { log_msg (lgm_JIDMap, "fsid=%s: Could not find a client for jid %d\n", -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From jeff at intersystems.com Tue Aug 3 14:56:38 2004 From: jeff at intersystems.com (Jeff) Date: Tue, 3 Aug 2004 10:56:38 -0400 Subject: [Linux-cluster] segfault if dlm is loaded while cman is still joining the cluster In-Reply-To: <20040803072819.GE23467@tykepenguin.com> References: <118314539.20040802162045@intersystems.com> <20040803072819.GE23467@tykepenguin.com> Message-ID: <1574160181.20040803105638@intersystems.com> Tuesday, August 3, 2004, 3:28:20 AM, Patrick Caulfield wrote: > On Mon, Aug 02, 2004 at 04:20:45PM -0400, Jeff wrote: >> CMAN: Waiting to join or form a Linux-cluster >> CMAN (built Aug 2 2004 15:04:09) installed >> kmem_cache_create: duplicate cache cluster_sock > Hang on - how did you manage that? the cman code looks like it has been loaded > twice, or something.... > The module load message is BELOW the "Waiting" message which says that the cman > code was already in the kernel when the "modprobe dlm" was executed which loaded > the cman.ko module as a dependancy. > The only way I can think this could happen is that you have cman in the kernel > AND as a module. Hmmm. When I look at the Cluster Infrastructure item in 'menuconfig' its marked with a *. Is this 'cman'? Would changing this to M solve the problem or is there something else going on here. Starting over with a vanilla kernel I find that if I try to build I get a bunch of undefined symbol warnings during 'make install' from dlm-kernel and gfs-kernel. For instance, from dlm-kernel: [root at lx4 src]# make install if [ ! -e cluster ]; then ln -s . cluster; fi if [ ! -e service.h ]; then cp //usr/include/cluster/service.h .; fi if [ ! -e cnxman.h ]; then cp //usr/include/cluster/cnxman.h .; fi if [ ! -e cnxman-socket.h ]; then cp //usr/include/cluster/cnxman-socket.h .; fi make -C /usr/src/linux-2.6.7 M=/usr/src/cvs/cluster_orig/dlm-kernel/src modules USING_KBUILD=yes make[1]: Entering directory `/usr/src/linux-2.6.7' Building modules, stage 2. MODPOST *** Warning: "kcl_addref_cluster" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_by_addr" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_addresses" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_releaseref_cluster" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_current_interface" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_get_node_by_nodeid" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_leave_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_remove_callback" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_global_service_id" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_unregister_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_join_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_start_done" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_add_callback" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! *** Warning: "kcl_register_service" [/usr/src/cvs/cluster_orig/dlm-kernel/src/dlm.ko] undefined! make[1]: Leaving directory `/usr/src/linux-2.6.7' install -d //lib/modules/2.6.7-clu-smp/kernel/cluster install dlm.ko //lib/modules/2.6.7-clu-smp/kernel/cluster install -d //usr/include/cluster install dlm.h dlm_device.h //usr/include/cluster [root at lx4 src]# From mtilstra at redhat.com Tue Aug 3 14:59:05 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 3 Aug 2004 09:59:05 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040803144019.GA4365@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> Message-ID: <20040803145905.GA5818@redhat.com> On Tue, Aug 03, 2004 at 09:40:19AM -0500, michael tilstra wrote: > On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote: > [snip] > > I hope this helps!! > [snip] > > yeah, looks like a stack overflow. > here's a patch that I put in for 6.0. (patch works on 6.0.0-7) oh, it is entered at: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=129042 -- Michael Conrad Tadpol Tilstra How? My roommate is dying to get some! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From phillips at istop.com Sun Aug 1 17:23:01 2004 From: phillips at istop.com (Daniel Phillips) Date: Sun, 1 Aug 2004 13:23:01 -0400 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <410B80BC.4060100@hp.com> References: <410B80BC.4060100@hp.com> Message-ID: <200408011323.02478.phillips@istop.com> On Saturday 31 July 2004 07:21, Aneesh Kumar K.V wrote: > 10. DLM > * is integrated with CLMS and is HA As briefly mentioned at last week's cluster summit, we'd like to try to integrate the Red Hat (nee Sistina) GDLM, want to give it a try? "One DLM to rule them all, one DLM to mind them, one DLM to sync them all, and in the cluster, bind them" Regards, Daniel From phillips at istop.com Sun Aug 1 17:30:01 2004 From: phillips at istop.com (Daniel Phillips) Date: Sun, 1 Aug 2004 13:30:01 -0400 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net> References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net> Message-ID: <200408011330.01848.phillips@istop.com> On Saturday 31 July 2004 12:00, Walker, Bruce J wrote: > In the 2.4 implementation, providing this one capability by > leveraging devfs was quite economic, efficient and has been very stable. I wonder if device-mapper (slightly hacked) wouldn't be a better approach for 2.6+. Regards, Daniel From kpfleming at backtobasicsmgmt.com Sun Aug 1 17:32:57 2004 From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming) Date: Sun, 01 Aug 2004 10:32:57 -0700 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <200408011330.01848.phillips@istop.com> References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net> <200408011330.01848.phillips@istop.com> Message-ID: <410D2949.20503@backtobasicsmgmt.com> Daniel Phillips wrote: > On Saturday 31 July 2004 12:00, Walker, Bruce J wrote: > >>In the 2.4 implementation, providing this one capability by >>leveraging devfs was quite economic, efficient and has been very stable. > > > I wonder if device-mapper (slightly hacked) wouldn't be a better approach for > 2.6+. It appeared from the original posting that their "cluster-wide devfs" actually supported all types of device nodes, not just block devices. I don't know whether accessing a character device on another node would ever be useful, but certainly using device-mapper wouldn't help for that case. From phillips at istop.com Mon Aug 2 01:53:46 2004 From: phillips at istop.com (Daniel Phillips) Date: Sun, 1 Aug 2004 21:53:46 -0400 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <410D2949.20503@backtobasicsmgmt.com> References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net> <200408011330.01848.phillips@istop.com> <410D2949.20503@backtobasicsmgmt.com> Message-ID: <200408012153.46835.phillips@istop.com> On Sunday 01 August 2004 13:32, Kevin P. Fleming wrote: > Daniel Phillips wrote: > > On Saturday 31 July 2004 12:00, Walker, Bruce J wrote: > >>In the 2.4 implementation, providing this one capability by > >>leveraging devfs was quite economic, efficient and has been very stable. > > > > I wonder if device-mapper (slightly hacked) wouldn't be a better approach > > for 2.6+. > > It appeared from the original posting that their "cluster-wide devfs" > actually supported all types of device nodes, not just block devices. I > don't know whether accessing a character device on another node would > ever be useful, but certainly using device-mapper wouldn't help for that > case. Unless device-mapper learned how to deal with char devices... Just a thought. Regards, Daniel From aneesh.kumar at hp.com Mon Aug 2 06:30:58 2004 From: aneesh.kumar at hp.com (Aneesh Kumar K.V) Date: Mon, 02 Aug 2004 12:00:58 +0530 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: References: <2o0e0-6qx-5@gated-at.bofh.it> Message-ID: <410DDFA2.40107@hp.com> Andi Kleen wrote: > "Aneesh Kumar K.V" writes: > > >>Hi, >> >>Sorry for the cross post. I came across this on OpenSSI website. I >>guess others may also be interested. > > > > [....] Congratulations. But I was a bit disappointed that there > wasn't a tarball with the kernel patches and other sources. > Any chance to add that to the site? > > I have posted the diff at http://www.openssi.org/contrib/linux-ssi.diff.gz This is against kernel linux-rh-2.4.20-31.9 which can be found in the OpenSSI CVS as srpms/linux-rh-2.4.20-31.9.tar.bz2 $cvs -d:pserver:anonymous at cvs.openssi.org:/cvsroot/ssic-linux login $cvs -z3 -d:pserver:anonymous at cvs.openssi.org:/cvsroot/sic-linux co -r OPENSSI-RH srpms/linux-rh-2.4.20-31.9.tar.bz2 This patch include the IPVS, KDB and OpenSSI changes For 2.6 we are planning to group the changes into small patches that is easy to review. All the other sources can be found as tar.gz at ( http://www.openssi.org/contrib/debian/openssidebs/sources/ )or better by doing apt-get source package on a debian system :) -aneesh From bruce.walker at hp.com Mon Aug 2 00:00:32 2004 From: bruce.walker at hp.com (Walker, Bruce J) Date: Sun, 1 Aug 2004 17:00:32 -0700 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! Message-ID: <3689AF909D816446BA505D21F1461AE4C750EA@cacexc04.americas.cpqcorp.net> When processes can freely and transparently move around the cluster (at exec time, fork time or during any system call), being able to transparently access your controlling tty is pretty handy. In 2.4 we stack our CFS on top of each node's devfs to give us naming of and access to all devices on all nodes. TBD on how will do this in 2.6. Bruce > > > > I wonder if device-mapper (slightly hacked) wouldn't be a > better approach for > > 2.6+. > > It appeared from the original posting that their "cluster-wide devfs" > actually supported all types of device nodes, not just block > devices. I > don't know whether accessing a character device on another node would > ever be useful, but certainly using device-mapper wouldn't > help for that > case. > From raven at themaw.net Mon Aug 2 03:13:39 2004 From: raven at themaw.net (Ian Kent) Date: Mon, 2 Aug 2004 11:13:39 +0800 (WST) Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <410D2949.20503@backtobasicsmgmt.com> References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net> <200408011330.01848.phillips@istop.com> <410D2949.20503@backtobasicsmgmt.com> Message-ID: On Sun, 1 Aug 2004, Kevin P. Fleming wrote: > Daniel Phillips wrote: > > > On Saturday 31 July 2004 12:00, Walker, Bruce J wrote: > > > >>In the 2.4 implementation, providing this one capability by > >>leveraging devfs was quite economic, efficient and has been very stable. > > > > > > I wonder if device-mapper (slightly hacked) wouldn't be a better approach for > > 2.6+. > > It appeared from the original posting that their "cluster-wide devfs" > actually supported all types of device nodes, not just block devices. I > don't know whether accessing a character device on another node would > ever be useful, but certainly using device-mapper wouldn't help for that > case. Does the reduced function 2.6 devfs still have what's needed? If it does then you should have a fair amount of breathing space. From efocht at gmx.net Mon Aug 2 13:50:39 2004 From: efocht at gmx.net (Erich Focht) Date: Mon, 2 Aug 2004 15:50:39 +0200 Subject: [Linux-cluster] Re: [SSI-devel] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <410DDFA2.40107@hp.com> References: <2o0e0-6qx-5@gated-at.bofh.it> <410DDFA2.40107@hp.com> Message-ID: <200408021550.39219.efocht@gmx.net> On Monday 02 August 2004 08:30, Aneesh Kumar K.V wrote: > > [....] Congratulations. But I was a bit disappointed that there > > wasn't a tarball with the kernel patches and other sources. > > Any chance to add that to the site? > > I have posted the diff at > http://www.openssi.org/contrib/linux-ssi.diff.gz Hmmm, that's too huge to get an overview on what it does... The current CVS ci/kernel touches 137 files, openssi/kernel touches 350 files. Plus the ci/kernel.patches and openssi/kernel.patches... > For 2.6 we are planning to group the changes into small patches that is > easy to review. Sounds great! Having groups sorted by functionality will help a lot. When will these be visible in the CVS? Thanks, best regards, Erich From bernd.schumacher at hp.com Tue Aug 3 11:55:42 2004 From: bernd.schumacher at hp.com (Schumacher, Bernd) Date: Tue, 3 Aug 2004 13:55:42 +0200 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence Message-ID: Hi, I have three nodes oben, mitte and unten. Test: I have disabled eth0 on mitte, so that mitte will be excluded. Result: Oben and unten are trying to fence mitte and build a new cluster. OK! But mitte tries to fence oben and unten. PROBLEM! Why can this happen? Mitte knows that it can not build a cluster. See Logfile from mitte: "Have 1, need 2" Logfile from mitte: Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost slave quorum. Have 1, need 2. Switching to Arbitrating. Aug 3 12:53:17 mitte lock_gulmd_core[2120]: Gonna exec fence_node oben Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing fence method, manual, on oben. cluster.ccs: cluster { name = "tom" lock_gulm { servers = ["oben", "mitte", "unten"] } } fence.ccs: fence_devices { manual_oben { agent = "fence_manual" } manual_mitte ... nodes.ccs: nodes { oben { ip_interfaces { eth0 = "192.168.100.241" } fence { manual { manual_oben { ipaddr = "192.168.100.241" } } } } mitte ... regards Bernd Schumacher From danderso at redhat.com Tue Aug 3 15:49:26 2004 From: danderso at redhat.com (Derek Anderson) Date: Tue, 3 Aug 2004 10:49:26 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: References: Message-ID: <200408031049.26537.danderso@redhat.com> Bernd, Please see http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128635. There are some outstanding fence issues here. On Tuesday 03 August 2004 06:55, Schumacher, Bernd wrote: > Hi, > I have three nodes oben, mitte and unten. > > Test: > I have disabled eth0 on mitte, so that mitte will be excluded. > > Result: > Oben and unten are trying to fence mitte and build a new cluster. OK! > But mitte tries to fence oben and unten. PROBLEM! > > Why can this happen? Mitte knows that it can not build a cluster. See > Logfile from mitte: "Have 1, need 2" > > Logfile from mitte: > Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) expired Aug > 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost slave quorum. Have 1, > need 2. Switching to Arbitrating. Aug 3 12:53:17 mitte > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug 3 12:53:17 mitte > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 pause. Aug > 3 12:53:17 mitte fence_node[2120]: Performing fence method, manual, on > oben. > > cluster.ccs: > cluster { > name = "tom" > lock_gulm { > servers = ["oben", "mitte", "unten"] > } > } > > fence.ccs: > fence_devices { > manual_oben { > agent = "fence_manual" > } > manual_mitte ... > > > nodes.ccs: > nodes { > oben { > ip_interfaces { > eth0 = "192.168.100.241" > } > fence { > manual { > manual_oben { > ipaddr = "192.168.100.241" > } > } > } > } > mitte ... > > regards > Bernd Schumacher > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From danderso at redhat.com Tue Aug 3 16:00:06 2004 From: danderso at redhat.com (Derek Anderson) Date: Tue, 3 Aug 2004 11:00:06 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <200408031049.26537.danderso@redhat.com> References: <200408031049.26537.danderso@redhat.com> Message-ID: <200408031100.06404.danderso@redhat.com> Please disregard my last post. Too quick of a scan; thought you were referring to the pubilc CVS branch. On Tuesday 03 August 2004 10:49, Derek Anderson wrote: > Bernd, > > Please see http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128635. > There are some outstanding fence issues here. > > On Tuesday 03 August 2004 06:55, Schumacher, Bernd wrote: > > Hi, > > I have three nodes oben, mitte and unten. > > > > Test: > > I have disabled eth0 on mitte, so that mitte will be excluded. > > > > Result: > > Oben and unten are trying to fence mitte and build a new cluster. OK! > > But mitte tries to fence oben and unten. PROBLEM! > > > > Why can this happen? Mitte knows that it can not build a cluster. See > > Logfile from mitte: "Have 1, need 2" > > > > Logfile from mitte: > > Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) expired Aug > > 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost slave quorum. Have 1, > > need 2. Switching to Arbitrating. Aug 3 12:53:17 mitte > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug 3 12:53:17 mitte > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 pause. Aug > > 3 12:53:17 mitte fence_node[2120]: Performing fence method, manual, on > > oben. > > > > cluster.ccs: > > cluster { > > name = "tom" > > lock_gulm { > > servers = ["oben", "mitte", "unten"] > > } > > } > > > > fence.ccs: > > fence_devices { > > manual_oben { > > agent = "fence_manual" > > } > > manual_mitte ... > > > > > > nodes.ccs: > > nodes { > > oben { > > ip_interfaces { > > eth0 = "192.168.100.241" > > } > > fence { > > manual { > > manual_oben { > > ipaddr = "192.168.100.241" > > } > > } > > } > > } > > mitte ... > > > > regards > > Bernd Schumacher > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From mtilstra at redhat.com Tue Aug 3 16:12:47 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 3 Aug 2004 11:12:47 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <200408031049.26537.danderso@redhat.com> References: <200408031049.26537.danderso@redhat.com> Message-ID: <20040803161247.GA6095@redhat.com> On Tue, Aug 03, 2004 at 10:49:26AM -0500, Derek Anderson wrote: > Bernd, > > Please see http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128635. There > are some outstanding fence issues here. except that bug is on cman, and this is about gulm. > On Tuesday 03 August 2004 06:55, Schumacher, Bernd wrote: > > Hi, > > I have three nodes oben, mitte and unten. > > > > Test: > > I have disabled eth0 on mitte, so that mitte will be excluded. > > > > Result: > > Oben and unten are trying to fence mitte and build a new cluster. OK! > > But mitte tries to fence oben and unten. PROBLEM! Actually not problem. just not what you expected. Hopefully I can explain why... (you have a netsplit. neither side knows what the other is doing, and must assume that the other is dead and they are right.) > > Why can this happen? Mitte knows that it can not build a cluster. See > > Logfile from mitte: "Have 1, need 2" So looking at what you gave below, mitte was master. (making this guess from the "Core lost slave quorum" part of the message below.) It knows that it doesn't have quorum, it still is going to try to be the Master. It does not know "that it can not build a cluster." The only thing it knows right now about the other nodes is that they failed to send heartbeats. Therefor they must have left the cluter abnormally. Therefor it must fence them. The other two nodes see that mitte have failed to reply to heartbeats. Therefor it must have left the cluster abnormally. Therefor it must be fenced. Both sides of the netsplit are trying to resolve things to regain the cluster. From an outsiders view point (which you and I have, the nodes do not.) We can see that mitte's attempts are futile, oben and unten will get control of the cluter. But the node cannot see this. This is what makes netsplits kind of ugly. (using ifdown to test cluster stuff causes extra confusion in my opinion. because you actually are creating a netsplit case. Not a simpler node down case. The power switch is nice for this.) I hope that made some sence. -- Michael Conrad Tadpol Tilstra Blood is thicker than water, and much tastier. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From landherr at kazeon.com Tue Aug 3 16:22:59 2004 From: landherr at kazeon.com (Steve Landherr) Date: Tue, 3 Aug 2004 09:22:59 -0700 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence Message-ID: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local> In a netsplit, what does fencing achieve when done by a node that doesn't have quorum? It still won't have quorum. It should probably just clean up as best it can and leave the rest of the cluster alone. -steve -- Steve Landherr -- landherr at kazeon.com Kazeon Systems, Inc. Mountain View, California -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad Tadpol Tilstra Sent: Tuesday, August 03, 2004 9:13 AM To: Discussion of clustering software components including GFS Subject: Re: [Linux-cluster] GFS 6.0 node without quorum tries to fence So looking at what you gave below, mitte was master. (making this guess from the "Core lost slave quorum" part of the message below.) It knows that it doesn't have quorum, it still is going to try to be the Master. It does not know "that it can not build a cluster." The only thing it knows right now about the other nodes is that they failed to send heartbeats. Therefor they must have left the cluter abnormally. Therefor it must fence them. The other two nodes see that mitte have failed to reply to heartbeats. Therefor it must have left the cluster abnormally. Therefor it must be fenced. Both sides of the netsplit are trying to resolve things to regain the cluster. From an outsiders view point (which you and I have, the nodes do not.) We can see that mitte's attempts are futile, oben and unten will get control of the cluter. But the node cannot see this. This is what makes netsplits kind of ugly. (using ifdown to test cluster stuff causes extra confusion in my opinion. because you actually are creating a netsplit case. Not a simpler node down case. The power switch is nice for this.) I hope that made some sence. -- Michael Conrad Tadpol Tilstra Blood is thicker than water, and much tastier. From bernd.schumacher at hp.com Tue Aug 3 16:44:06 2004 From: bernd.schumacher at hp.com (Schumacher, Bernd) Date: Tue, 3 Aug 2004 18:44:06 +0200 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence Message-ID: before I tried with manual fencing I tried this with automatic fencing (fence_rib). And always mitte was faster and fenced oben and unten. This means, one faulty node can reboot all other nodes. I think this is not ok. And even after reboot the problem is not solved, because the faulty node is still faulty. A node should only be allowed to fence if it is Master and if it has the qourum. And never if it is in arbitrating mode. > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steve Landherr > Sent: Dienstag, 3. August 2004 18:23 > To: Discussion of clustering software components including GFS > Subject: RE: [Linux-cluster] GFS 6.0 node without quorum > tries to fence > > > In a netsplit, what does fencing achieve when done by a node > that doesn't have quorum? It still won't have quorum. It > should probably just clean up as best it can and leave the > rest of the cluster alone. > > -steve > -- > Steve Landherr -- landherr at kazeon.com > Kazeon Systems, Inc. > Mountain View, California > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On > Behalf Of > Michael Conrad Tadpol Tilstra > Sent: Tuesday, August 03, 2004 9:13 AM > To: Discussion of clustering software components including GFS > Subject: Re: [Linux-cluster] GFS 6.0 node without quorum > tries to fence > > So looking at what you gave below, mitte was master. (making > this guess from the "Core lost slave quorum" part of the > message below.) It knows that it doesn't have quorum, it > still is going to try to be the Master. It does not know > "that it can not build a cluster." The only thing it knows > right now about the other nodes is that they failed to send > heartbeats. Therefor they must have left the cluter > abnormally. Therefor it must fence them. > > The other two nodes see that mitte have failed to reply to > heartbeats. Therefor it must have left the cluster > abnormally. Therefor it must be fenced. > > Both sides of the netsplit are trying to resolve things to > regain the cluster. From an outsiders view point (which you > and I have, the nodes do not.) We can see that mitte's > attempts are futile, oben and unten will get control of the > cluter. But the node cannot see this. > > This is what makes netsplits kind of ugly. > > (using ifdown to test cluster stuff causes extra confusion in > my opinion. because you actually are creating a netsplit > case. Not a simpler node down case. The power switch is > nice for this.) > > > I hope that made some sence. > > -- > Michael Conrad Tadpol Tilstra > Blood is thicker than water, and much tastier. > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-> cluster > From laza at yu.net Tue Aug 3 17:17:51 2004 From: laza at yu.net (Lazar Obradovic) Date: Tue, 03 Aug 2004 19:17:51 +0200 Subject: [Linux-cluster] Multicast for GFS? Message-ID: <1091553471.16747.165.camel@laza.eunet.yu> Hi, can someone, please, give some advice about configuring multicast with GFS? I know it might go out of topic, but it's perhaps useful for others. I'd use broadcast instead, but I have a problem that two groups of servers sharing the same storage, but that are located in different vlans, separated by router-on-a-stick, so I guess I have to use multicast. I've configured the router for multicast (config is right below), but it doesn't seem to work. Here's ascii pic of what I'm trying to make: +--------+ | router | +--------+ / \ +----------+ +----------+ | switch A | | switch B | | vlan 100 | | vlan 200 | +----------+ +----------+ | | +----------+ +----------+ | server A | | server B | +----------+ +----------+ | | +---------------------+ | san / storage | +---------------------+ and relevant config (that I made this far): router (cisco ios): ip multicast-routing ! interface FastEthernet0/0 description Branch A ip address 1.1.1.1 255.255.255.0 ip pim sparse-dense-mode ip igmp version 1 encapsulation dot1q 100 ! interface FastEthernet0/1 description Branch B ip address 1.1.2.1 255.255.255.0 ip pim sparse-dense-mode ip igmp version 1 encapsulation dot1q 200 ! ip pim send-rp-announce FastEthernet0/0 scope 16 ip pim send-rp-discovery scope 16 ! switch A and switch B are manageable Intel switches (dunno the exact model; they are bundled with my IBM Blades), but have IGMP Snooping turned on on every interface, and show default cisco pim groups (224.0.0.40 and 224.0.0.39) on upstream ports. /etc/cluster/cluster.conf for each cluster node is same (only important part of config is here, ask for more, if needed): hosts ping each other, so networking part, as far as basic ip and unicast is concerned, is working properly. When starting, cman_tool says: cluster # cman_tool join -d multicast address 224.0.0.11 if eth0 for mcast address 224.0.0.11 setup up interface for address: node1 and as I can see from strace, "cman_tool join", this is what happens: socket(0x1f /* PF_??? */, SOCK_DGRAM, 2) = 3 ioctl(3, 0x780b, 0x2) = 0 setsockopt(3, 0x2 /* SOL_?? */, 109, [6516590], 4) = 0 [...] socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 bind(4, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("224.0.0.11")}, 16) = 0 socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 5 bind(5, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("1.1.1.2")}, 16) = 0 setsockopt(3, 0x2 /* SOL_?? */, 100, "\4\0\0\0\0\0\0\0", 8) = 0 setsockopt(3, 0x2 /* SOL_?? */, 103, "\5\0\0\0\0\0\0\0", 8) = 0 setsockopt(3, 0x2 /* SOL_?? */, 101, "\1\344\377\277\3\0\0\0\0\0\0\0\1\0\0\0smtp\0\'\1@\210\0"..., 36) = 0 close(3) = 0 exit_group(0) = ? I've checked some programming examples on multicast as well as code for cman, and I thing cman_tool/join.c has two problems: - it never seems to issue setsockopt(..., IP_ADD_MEMBERSHIP...), thus, never joins the group. I believe the problem is in if (!bcast) check, which, if replaced with "if (bhe)" should work fine... - it binds the socket with multicast address (fd = 4 in my case) instead of local address. If the examples I looked are true, one should bind local interface, and then specify mcast address in setsockopt call. Can someone comment this issue? Am I going in completly wrong direction, or multicast support isn't ready yet? -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From mtilstra at redhat.com Tue Aug 3 17:58:12 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 3 Aug 2004 12:58:12 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local> References: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local> Message-ID: <20040803175812.GA6470@redhat.com> On Tue, Aug 03, 2004 at 09:22:59AM -0700, Steve Landherr wrote: > In a netsplit, what does fencing achieve when done by a node that > doesn't have quorum? It still won't have quorum. It should probably > just clean up as best it can and leave the rest of the cluster alone. Think of the world as the nodes see it, not as you see it. You cannot tell if it is a netsplit or truely nodes dieing. It looks the same. -- Michael Conrad Tadpol Tilstra Caution: breathing may be hazardous to your health. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From alewis at redhat.com Tue Aug 3 18:03:07 2004 From: alewis at redhat.com (AJ Lewis) Date: Tue, 3 Aug 2004 13:03:07 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <20040803175812.GA6470@redhat.com> References: <4E022DDAB8F45741914ACD6EDFE2309B324A8D@BIGFOOT.kazeon.local> <20040803175812.GA6470@redhat.com> Message-ID: <20040803180307.GD25464@null.msp.redhat.com> On Tue, Aug 03, 2004 at 12:58:12PM -0500, Michael Conrad Tadpol Tilstra wrote: > On Tue, Aug 03, 2004 at 09:22:59AM -0700, Steve Landherr wrote: > > In a netsplit, what does fencing achieve when done by a node that > > doesn't have quorum? It still won't have quorum. It should probably > > just clean up as best it can and leave the rest of the cluster alone. > > Think of the world as the nodes see it, not as you see it. You cannot > tell if it is a netsplit or truely nodes dieing. It looks the same. Or in other words, if 2 of 3 nodes go AWOL, do you really want your cluster to stop until someone comes in an manually starts things up again - especially if in the meantime, the 2 bad nodes can write to the disks? The fact that a single node can kill of the other nodes is a good thing! -- AJ Lewis Voice: 612-638-0500 Red Hat Inc. E-Mail: alewis at redhat.com 720 Washington Ave. SE, Suite 200 Minneapolis, MN 55414 Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C 54A8 578C 8715 Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the many keyservers out there... -----Begin Obligatory Humorous Quote---------------------------------------- Behind every good computer -- is a jumble of wires 'n stuff. -----End Obligatory Humorous Quote------------------------------------------ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From landherr at kazeon.com Tue Aug 3 18:03:40 2004 From: landherr at kazeon.com (Steve Landherr) Date: Tue, 3 Aug 2004 11:03:40 -0700 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence Message-ID: <4E022DDAB8F45741914ACD6EDFE2309B324AAD@BIGFOOT.kazeon.local> Agreed. But I though that the quorum concept was there to protect against a netsplit causing independent clusters to form. If I'm a node and I can't contact enough peers to gain quorum, then I can't be a part of the cluster. If I'm not part of the cluster, then I shouldn't be fencing other nodes. Am I missing something? -steve -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad Tadpol Tilstra Sent: Tuesday, August 03, 2004 10:58 AM To: Discussion of clustering software components including GFS Subject: Re: [Linux-cluster] GFS 6.0 node without quorum tries to fence On Tue, Aug 03, 2004 at 09:22:59AM -0700, Steve Landherr wrote: > In a netsplit, what does fencing achieve when done by a node that > doesn't have quorum? It still won't have quorum. It should probably > just clean up as best it can and leave the rest of the cluster alone. Think of the world as the nodes see it, not as you see it. You cannot tell if it is a netsplit or truely nodes dieing. It looks the same. From laza at yu.net Tue Aug 3 18:04:39 2004 From: laza at yu.net (Lazar Obradovic) Date: Tue, 03 Aug 2004 20:04:39 +0200 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1091553471.16747.165.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> Message-ID: <1091556279.30938.179.camel@laza.eunet.yu> Just an update, I've wrote my own mcast server and client based on the examples i've got, and they work perfectly with this kind of network setup (separate vlans and everything else), so I guess problem is actually in cman and it's mcast interface... I'll try to correct this today and send a patch. On Tue, 2004-08-03 at 19:17, Lazar Obradovic wrote: > Hi, > > can someone, please, give some advice about configuring multicast with > GFS? I know it might go out of topic, but it's perhaps useful for > others. > > I'd use broadcast instead, but I have a problem that two groups of > servers sharing the same storage, but that are located in different > vlans, separated by router-on-a-stick, so I guess I have to use > multicast. > > I've configured the router for multicast (config is right below), but it > doesn't seem to work. > > Here's ascii pic of what I'm trying to make: > > +--------+ > | router | > +--------+ > / \ > +----------+ +----------+ > | switch A | | switch B | > | vlan 100 | | vlan 200 | > +----------+ +----------+ > | | > +----------+ +----------+ > | server A | | server B | > +----------+ +----------+ > | | > +---------------------+ > | san / storage | > +---------------------+ > > and relevant config (that I made this far): > > router (cisco ios): > > ip multicast-routing > ! > interface FastEthernet0/0 > description Branch A > ip address 1.1.1.1 255.255.255.0 > ip pim sparse-dense-mode > ip igmp version 1 > encapsulation dot1q 100 > ! > interface FastEthernet0/1 > description Branch B > ip address 1.1.2.1 255.255.255.0 > ip pim sparse-dense-mode > ip igmp version 1 > encapsulation dot1q 200 > ! > ip pim send-rp-announce FastEthernet0/0 scope 16 > ip pim send-rp-discovery scope 16 > ! > > switch A and switch B are manageable Intel switches (dunno the exact > model; they are bundled with my IBM Blades), but have IGMP Snooping > turned on on every interface, and show default cisco pim groups > (224.0.0.40 and 224.0.0.39) on upstream ports. > > /etc/cluster/cluster.conf for each cluster node is same (only important > part of config is here, ask for more, if needed): > > > > > > > > > > > > > > > > hosts ping each other, so networking part, as far as basic ip and > unicast is concerned, is working properly. > > When starting, cman_tool says: > cluster # cman_tool join -d > multicast address 224.0.0.11 > if eth0 for mcast address 224.0.0.11 > setup up interface for address: node1 > > and as I can see from strace, "cman_tool join", this is what happens: > > socket(0x1f /* PF_??? */, SOCK_DGRAM, 2) = 3 > ioctl(3, 0x780b, 0x2) = 0 > setsockopt(3, 0x2 /* SOL_?? */, 109, [6516590], 4) = 0 > [...] > socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 > bind(4, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("224.0.0.11")}, 16) = 0 > socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 5 > bind(5, {sa_family=AF_INET, sin_port=htons(6809), sin_addr=inet_addr("1.1.1.2")}, 16) = 0 > setsockopt(3, 0x2 /* SOL_?? */, 100, "\4\0\0\0\0\0\0\0", 8) = 0 > setsockopt(3, 0x2 /* SOL_?? */, 103, "\5\0\0\0\0\0\0\0", 8) = 0 > setsockopt(3, 0x2 /* SOL_?? */, 101, "\1\344\377\277\3\0\0\0\0\0\0\0\1\0\0\0smtp\0\'\1@\210\0"..., 36) = 0 > close(3) = 0 > exit_group(0) = ? > > I've checked some programming examples on multicast as well as code for > cman, and I thing cman_tool/join.c has two problems: > > - it never seems to issue setsockopt(..., IP_ADD_MEMBERSHIP...), thus, > never joins the group. I believe the problem is in if (!bcast) check, > which, if replaced with "if (bhe)" should work fine... > > - it binds the socket with multicast address (fd = 4 in my case) instead > of local address. If the examples I looked are true, one should bind > local interface, and then specify mcast address in setsockopt call. > > Can someone comment this issue? Am I going in completly wrong direction, > or multicast support isn't ready yet? -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From mtilstra at redhat.com Tue Aug 3 18:44:37 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 3 Aug 2004 13:44:37 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <4E022DDAB8F45741914ACD6EDFE2309B324AAD@BIGFOOT.kazeon.local> References: <4E022DDAB8F45741914ACD6EDFE2309B324AAD@BIGFOOT.kazeon.local> Message-ID: <20040803184437.GA6865@redhat.com> On Tue, Aug 03, 2004 at 11:03:40AM -0700, Steve Landherr wrote: > Agreed. But I though that the quorum concept was there to protect > against a netsplit causing independent clusters to form. If I'm a node > and I can't contact enough peers to gain quorum, then I can't be a part > of the cluster. If I'm not part of the cluster, then I shouldn't be > fencing other nodes. > > Am I missing something? In a lights out setup. (no user required to keep things running) 2 of three nodes go AWOL. One node now needs to reset (NPS fencing) the other two, without quorum, to keep things running. -- Michael Conrad Tadpol Tilstra A hacker is a machine for turning caffeine into code. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From landherr at kazeon.com Tue Aug 3 19:03:15 2004 From: landherr at kazeon.com (Steve Landherr) Date: Tue, 3 Aug 2004 12:03:15 -0700 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence Message-ID: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local> But if you do it that way, and you really have a netsplit, won't you get into a "quickdraw" situation where each of the newly formed clusters are trying to fence out the others? In the worst case, all the nodes get reset and nobody is happy. But maybe the worst case happens so infrequently that it is better than always losing the cluster whenever quorum is lost. But then again, from any one node's perspective, how often do multiple nodes drop out of a cluster at the same time and the problem is not either a netsplit or a glitch on the local node? Just pondering... -steve -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Conrad Tadpol Tilstra Sent: Tuesday, August 03, 2004 11:45 AM To: Discussion of clustering software components including GFS Subject: Re: [Linux-cluster] GFS 6.0 node without quorum tries to fence In a lights out setup. (no user required to keep things running) 2 of three nodes go AWOL. One node now needs to reset (NPS fencing) the other two, without quorum, to keep things running. From gshi at ncsa.uiuc.edu Tue Aug 3 19:19:43 2004 From: gshi at ncsa.uiuc.edu (Guochun Shi) Date: Tue, 03 Aug 2004 14:19:43 -0500 Subject: [Linux-cluster] gnbd: finiband supprt and multi-device to one device mapping Message-ID: <5.1.0.14.2.20040803140638.03a99680@pop.ncsa.uiuc.edu> hi, I am interested in adding 2 features in gnbd: 1. finiband support for communication between a gnbd client and a gnbd server. 2. multi-device to one device mapping: multiples servers export their devices and one client import from those servers. The client sees one device, who is a wrapper device for all devices in servers. Writing/reading to the device will be distributed to different servers according to different writing/reading position, size, etc .... I did not find any design document for gnbd so far. I will certainly be grateful if someone can point me to any URL for that. comments and suggestions are welcome. thanks -Guochun From mtilstra at redhat.com Tue Aug 3 19:41:54 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 3 Aug 2004 14:41:54 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local> References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local> Message-ID: <20040803194154.GA7790@redhat.com> On Tue, Aug 03, 2004 at 12:03:15PM -0700, Steve Landherr wrote: > But if you do it that way, and you really have a netsplit, won't you get > into a "quickdraw" situation where each of the newly formed clusters are > trying to fence out the others? In the worst case, all the nodes get > reset and nobody is happy. But maybe the worst case happens so > infrequently that it is better than always losing the cluster whenever > quorum is lost. yes, worst case everyone reboots. BUT! the data on disc is safe. There is probbly a better way to go about this, but we've currently kept the idea that an unexpected reboot is better than looking for backup tapes. > But then again, from any one node's perspective, how often do multiple > nodes drop out of a cluster at the same time and the problem is not > either a netsplit or a glitch on the local node? no idea. > Just pondering... cool. -- Michael Conrad Tadpol Tilstra It's never too late to have a happy childhood. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From smelkovs at worldsoft.ch Tue Aug 3 20:04:53 2004 From: smelkovs at worldsoft.ch (Konrads Smelkovs) Date: Tue, 03 Aug 2004 22:04:53 +0200 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <20040803194154.GA7790@redhat.com> References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local> <20040803194154.GA7790@redhat.com> Message-ID: <410FEFE5.8090301@worldsoft.ch> Michael Conrad Tadpol Tilstra wrote: > >yes, worst case everyone reboots. BUT! the data on disc is safe. There >is probbly a better way to go about this, but we've currently kept the >idea that an unexpected reboot is better than looking for backup tapes. > > > I don't think it is smart enough. This kind of assumes that the fencing method is power. Suppose people are running only on san fencing. From bmarzins at redhat.com Tue Aug 3 20:25:15 2004 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Tue, 3 Aug 2004 15:25:15 -0500 Subject: [Linux-cluster] gnbd: finiband supprt and multi-device to one device mapping In-Reply-To: <5.1.0.14.2.20040803140638.03a99680@pop.ncsa.uiuc.edu> References: <5.1.0.14.2.20040803140638.03a99680@pop.ncsa.uiuc.edu> Message-ID: <20040803202515.GR23619@phlogiston.msp.redhat.com> On Tue, Aug 03, 2004 at 02:19:43PM -0500, Guochun Shi wrote: > hi, > > I am interested in adding 2 features in gnbd: > > 1. finiband support for communication between a gnbd client and a gnbd server. > > 2. multi-device to one device mapping: multiples servers export their devices and one client import from those servers. The client sees one device, who is a wrapper device for all devices in servers. Writing/reading to the device will be distributed to different servers according to different writing/reading position, size, etc .... > > I did not find any design document for gnbd so far. I will certainly be grateful if someone can point me to any URL for that. > > comments and suggestions are welcome. There isn't very much in the way of current GNBD documentation https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage has the most current info Otherwise, there are the man pages, which I haven't got around to updating yet, but are still mostly accurate. Oh, and http://www.redhat.com/docs/manuals/csgfs/admin-guide/ This is the old administrator's guide, but again, it's mostly uptodate. As soon as I finish up some coding that needs to get done, I will set to work on some documentation. If this isn't enough, I'd be glad to answer any questions you have, either via email, or via IRC (#linux-cluster on freenode) Good luck. -Ben Marzinski bmarzins at redhat.com > thanks > -Guochun > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From patrick.seinguerlet at e-asc.com Tue Aug 3 18:00:00 2004 From: patrick.seinguerlet at e-asc.com (Seinguerlet Patrick) Date: Tue, 3 Aug 2004 20:00:00 +0200 Subject: [Linux-cluster] lock_dlm: init_fence error -1 Message-ID: <003701c47983$b59787e0$0100a8c0@amdk6> When I would like to mount the GFS file system, this messages appear. What can I do? mount -t gfs /dev/test_gfs/lv_test /mnt lock_dlm: init_fence error -1 GFS: can't mount proto = lock_dlm, table = test:partage1, hostdata = mount: permission denied I use a debian and I use the documentation file for install. Patrick From amir at datacore.ch Tue Aug 3 21:02:10 2004 From: amir at datacore.ch (Amir Guindehi) Date: Tue, 03 Aug 2004 23:02:10 +0200 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <20040803194154.GA7790@redhat.com> References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local> <20040803194154.GA7790@redhat.com> Message-ID: <410FFD52.2050705@datacore.ch> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, |>But if you do it that way, and you really have a netsplit, won't you get |>into a "quickdraw" situation where each of the newly formed clusters are |>trying to fence out the others? In the worst case, all the nodes get |>reset and nobody is happy. But maybe the worst case happens so |>infrequently that it is better than always losing the cluster whenever |>quorum is lost. | | yes, worst case everyone reboots. BUT! the data on disc is safe. There | is probbly a better way to go about this, but we've currently kept the | idea that an unexpected reboot is better than looking for backup tapes. I think you would need a form of /atomic fencing/ )which is not really possible afaik) to resolve this problem. It's a race, and as long as you can't do it atomically you can always reach above worst case state. - - Amir - -- Amir Guindehi, nospam.amir at datacore.ch DataCore GmbH, Witikonerstrasse 289, 8053 Zurich, Switzerland -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2-nr1 (Windows 2000) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBD/1PbycOjskSVCwRAoRcAJ9cq5+FiQnIx817IdEthaB6HTgPTQCg9G3n MLA+ulKC4Jh3BQZLbPq59/0= =Gy9x -----END PGP SIGNATURE----- From amanthei at redhat.com Tue Aug 3 21:17:44 2004 From: amanthei at redhat.com (Adam Manthei) Date: Tue, 3 Aug 2004 16:17:44 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <410FEFE5.8090301@worldsoft.ch> References: <4E022DDAB8F45741914ACD6EDFE2309B324AC2@BIGFOOT.kazeon.local> <20040803194154.GA7790@redhat.com> <410FEFE5.8090301@worldsoft.ch> Message-ID: <20040803211744.GD26705@redhat.com> > > Michael Conrad Tadpol Tilstra wrote: > >yes, worst case everyone reboots. BUT! the data on disc is safe. There > >is probbly a better way to go about this, but we've currently kept the > >idea that an unexpected reboot is better than looking for backup tapes. On Tue, Aug 03, 2004 at 10:04:53PM +0200, Konrads Smelkovs wrote: > I don't think it is smart enough. This kind of assumes that the fencing > method is power. Suppose people are running only on san fencing. Is a node/resource that has access to the cluster only through IP (say if it is a dedicated lock_gulmd server) really fenced if SAN fencing is used? I would argue not. After the a SAN fencing action has successfully returned, the lock_gulmd resources still would have access to the rest of the cluster. In this case, you may as well have used /bin/true as your fencing agent. SAN fencing for the lock_gulmd cluster resource is the wrong tool for the job (unless you are doing IP traffic through that SAN). So, you're right, it's not smart enough. What's worse is that it relies upon the admins being smart enough to realize this before their cluster is configured ;) This is probably a point worth adding to our FAQ if it's not already there. -- Adam Manthei From alewis at redhat.com Tue Aug 3 22:11:45 2004 From: alewis at redhat.com (AJ Lewis) Date: Tue, 3 Aug 2004 17:11:45 -0500 Subject: [Linux-cluster] Re: Call for presentation materials; Attendee List In-Reply-To: <200408031427.46419.phillips@redhat.com> References: <200408031427.46419.phillips@redhat.com> Message-ID: <20040803221145.GG25464@null.msp.redhat.com> On Tue, Aug 03, 2004 at 02:27:46PM -0400, Daniel Phillips wrote: > And thanks for your part in making the first-ever Minneapolis Cluster > Summit a great success. Here is the attendee list as promised, in the > form of a massive cc list. You're encouraged to "Reply All" with any > comments, suggestions, gripes, flames or other constructive material. > Please don't worry about generating n-squared traffic, as n is yet low. The attendee list is good for everyone who attended to have, but let's not use it as a primary means of communication (see below) > There may be a few names on the list who didn't actually make it; no > matter, I'd rather err on the side of not leaving anybody out who was > there. If you know of anybody I left out, could you please email me. Trying to keep track of people that need to be added to this CC list is going to be a real pain. I know for a fact at least 3 people from the Red Hat team have been accidentally left off of the list, not to mention what happens if people's e-mail address change, or they don't want to get this traffic anymore. I strongly recommend people post to linux-cluster at redhat.com if they want everyone involved to see what's going on. There is also the added advantage that discussions will be preserved for future reference in the mailing list archives. > Speakers: > > Including non-Red Hat speakers, please send me your slides and any > supporting materials you feel are relevant. Even (pointers to) code > would be fine, and white papers certainly qualify. Even png scans of > napkins might get posted :-) This is all for posting to the community > cluster site. > > Please send attachments, not URL's, except for code and/or links to > project homepages. Please indicate which text/links in your email are > to appear on the web page. Please include a short bio, just a few > words. Anyone on linux-cluster who has material to post on the source.redhat.com/cluster web page, please post to the list asking for its inclusion. > Everybody: > > This cc list is for ongoing discussion of the material we covered at the > summit, in particular: > > * Possible amendments to CMAN interfaces to accommodate your > own project - now is a good time to speak. > > * Possible amendments to GDLM, with a view to adoption by other > projects besides GFS > > * Mainline submission track. Code and [RFC]'s should start > appearing on lkml in October. What code? How changed? What > supporting arguments? Now is the time to sort that out. > > * Clustered Samba: anybody out there willing to look at how to > graft oplocks, weird case translations, clustered tdb, etc onto > gfs, please shout > > Anything else that's on your mind Seems to me all this should be discussed on linux-cluster as well. > The conference information page has been updated with the "as built" > conference schedule: > > http://sources.redhat.com/cluster/events/summit2004/info.html > > Thanks once again for your tremendous support. > > Regards, > > Daniel Regards, -- AJ Lewis Voice: 612-638-0500 Red Hat Inc. E-Mail: alewis at redhat.com 720 Washington Ave. SE, Suite 200 Minneapolis, MN 55414 Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C 54A8 578C 8715 Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the many keyservers out there... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From phillips at redhat.com Wed Aug 4 00:54:25 2004 From: phillips at redhat.com (Daniel Phillips) Date: Tue, 3 Aug 2004 20:54:25 -0400 Subject: [Linux-cluster] Re: Call for presentation materials; Attendee List In-Reply-To: <20040803221145.GG25464@null.msp.redhat.com> References: <200408031427.46419.phillips@redhat.com> <20040803221145.GG25464@null.msp.redhat.com> Message-ID: <200408032054.25297.phillips@redhat.com> On Tuesday 03 August 2004 18:11, AJ Lewis wrote: Anybody who spoke at the cluster summit, please email me their slides and other supporting material. Thanks for your input, AJ. Regards, Daniel From mnerren at paracel.com Wed Aug 4 01:12:01 2004 From: mnerren at paracel.com (micah nerren) Date: Tue, 03 Aug 2004 18:12:01 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040803144019.GA4365@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> Message-ID: <1091581920.8356.257.camel@angmar> Hi, On Tue, 2004-08-03 at 07:40, Michael Conrad Tadpol Tilstra wrote: > On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote: > [snip] > > I hope this helps!! > [snip] > > yeah, looks like a stack overflow. > here's a patch that I put in for 6.0. (patch works on 6.0.0-7) > I applied the patch to 6.0.0-7, rebuild the entire package, and I still get the crash when I mount. Below is the text of the crash. Any ideas? I double and triple checked that the patch was indeed applied to the code I was building and it was. Thanks, Micah /////////////// Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 printing rip: ffffffff8024a875 PML4 77caf067 PGD 7a78f067 PMD 0 Oops: 0002 CPU 0 Pid: 4056, comm: mount Not tainted RIP: 0010:[]{net_rx_action+213} RSP: 0018:0000010077d93048 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffff806077e8 RCX: ffffffff80607988 RDX: ffffffff806077e8 RSI: 0000010077d68800 RDI: ffffffff806077d0 RBP: ffffffff80607668 R08: 00000000824c6a9c R09: 00000000004c824c R10: 000000000100007f R11: 0000000000000000 R12: ffffffff806077e8 R13: ffffffff806077c0 R14: 000000000000ed06 R15: 0000000000000000 FS: 0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Call Trace: []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{:lock_gulm:do_tfer+369} []{:lock_gulm:.rodata.str1.1+467} []{:lock_gulm:xdr_send+37} []{:lock_gulm:xdr_enc_flush+56} []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_gulm:gulm_core_login_reply+164} []{:lock_gulm:core_cb+0} []{:lock_gulm:lg_core_handle_messages+315} []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock_gulm:start_gulm_threads+174} []{:lock_gulm:gulm_mount+616} []{:gfs:gfs_glock_cb+0} []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_mount_lockproto+313} []{do_anonymous_page+1234} []{do_no_page+95} []{do_page_fault+627} []{error_exit+0} []{create_elf_tables+261} []{__alloc_pages+156} []{:gfs:gfs_read_super+1307} []{:gfs:gfs_fs_type+0} []{get_sb_bdev+588} []{:gfs:gfs_fs_type+0} []{do_kern_mount+121} []{do_add_mount+161} []{do_mount+345} []{__get_free_pages+16} []{sys_mount+197} []{system_call+119} Process mount (pid: 4056, stackpage=10077d93000) Stack: 0000010077d93048 0000000000000018 ffffffff8024a84d 0000012a80445d20 0000000000000001 ffffffff80606c60 0000000000000000 000000000000000a 0000000000000000 0000000000000002 ffffffff8012a72e ffffffff80267cf0 0000000000000246 0000000000000000 0000000000000003 ffffffff80445d20 ffffffff80267cc0 0000000000000000 ffffffff802b5915 0000000000000043 0000000000000006 00000100796a109e 000001007c6231c0 0000000000000000 0000000000000000 ffffffff8049c648 0000000000000000 ffffffff806077c0 ffffffff802533a7 ffffffff80267cc0 ffffffff80445d20 0000000000000002 000001007c6231c0 ffffffff805abcd0 00000100796a10ac 000001007c6231c0 0000010077d68800 0000000000000000 0000010077d68800 000001007c623228 Call Trace: []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{:lock_gulm:do_tfer+369} []{:lock_gulm:.rodata.str1.1+467} []{:lock_gulm:xdr_send+37} []{:lock_gulm:xdr_enc_flush+56} []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_gulm:gulm_core_login_reply+164} []{:lock_gulm:core_cb+0} []{:lock_gulm:lg_core_handle_messages+315} []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock_gulm:start_gulm_threads+174} []{:lock_gulm:gulm_mount+616} []{:gfs:gfs_glock_cb+0} []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_mount_lockproto+313} []{do_anonymous_page+1234} []{do_no_page+95} []{do_page_fault+627} []{error_exit+0} []{create_elf_tables+261} []{__alloc_pages+156} []{:gfs:gfs_read_super+1307} []{:gfs:gfs_fs_type+0} []{get_sb_bdev+588} []{:gfs:gfs_fs_type+0} []{do_kern_mount+121} []{do_add_mount+161} []{do_mount+345} []{__get_free_pages+16} []{sys_mount+197} []{system_call+119} Code: 48 89 18 48 89 43 08 8b 85 90 01 00 00 85 c0 79 08 03 85 94 Kernel panic: Fatal exception In interrupt handler - not syncing NMI Watchdog detected LOCKUP on CPU0, eip ffffffff8011a948, registers: CPU 0 Pid: 4056, comm: mount Not tainted RIP: 0010:[]{smp_call_function+120} RSP: 0018:0000010077d92d48 EFLAGS: 00000097 RAX: 0000000000000000 RBX: ffffffff802cfc1a RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff8011a970 RBP: 0000000000000002 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000000 R11: 00000000000003c8 R12: ffffffff802da247 R13: 0000000000000000 R14: 0000000000000002 R15: 0000010077d92f98 FS: 0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Call Trace: []{stop_this_cpu+0} []{smp_send_stop+25} []{panic+312} []{show_trace+666} []{show_stack+205} []{show_registers+304} []{die+268} []{do_page_fault+989} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{error_exit+0} []{net_rx_action+213} []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{:lock_gulm:do_tfer+369} []{:lock_gulm:.rodata.str1.1+467} []{:lock_gulm:xdr_send+37} []{:lock_gulm:xdr_enc_flush+56} []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_gulm:gulm_core_login_reply+164} []{:lock_gulm:core_cb+0} []{:lock_gulm:lg_core_handle_messages+315} []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock_gulm:start_gulm_threads+174} []{:lock_gulm:gulm_mount+616} []{:gfs:gfs_glock_cb+0} []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_mount_lockproto+313} []{do_anonymous_page+1234} []{do_no_page+95} []{do_page_fault+627} []{error_exit+0} []{create_elf_tables+261} []{__alloc_pages+156} []{:gfs:gfs_read_super+1307} []{:gfs:gfs_fs_type+0} []{get_sb_bdev+588} []{:gfs:gfs_fs_type+0} []{do_kern_mount+121} []{do_add_mount+161} []{do_mount+345} []{__get_free_pages+16} []{sys_mount+197} []{system_call+119} Process mount (pid: 4056, stackpage=10077d93000) Stack: 0000010077d92d48 0000000000000018 0000000000100000 0000000000000000 00000100079c4c80 ffffffff803e89a0 0000000000000000 00000100000fdea0 ffffffff803e8d00 00000100079bf000 00000100079d6400 0000000000000042 00000100079de280 ffffff0000000000 000000fffffff000 0000000000000000 00000100079d7a80 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000010077d92d48 0000000000000000 00000000006d9994 0000000000000003 0000000000000000 0000000000000000 0000000100000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff Call Trace: []{stop_this_cpu+0} []{smp_send_stop+25} []{panic+312} []{show_trace+666} []{show_stack+205} []{show_registers+304} []{die+268} []{do_page_fault+989} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{error_exit+0} []{net_rx_action+213} []{net_rx_action+173} []{do_softirq+174} []{ip_finish_output2+0} []{dst_output+0} []{do_softirq_thunk+53} []{.text.lock.netfilter+165} []{dst_output+0} []{ip_queue_xmit+1019} []{ip_rcv_finish+0} []{ip_rcv_finish+528} []{nf_hook_slow+305} []{ip_rcv_finish+0} []{tcp_transmit_skb+1295} []{tcp_write_xmit+198} []{tcp_sendmsg+4051} []{inet_sendmsg+69} []{sock_sendmsg+142} []{:lock_gulm:do_tfer+369} []{:lock_gulm:.rodata.str1.1+467} []{:lock_gulm:xdr_send+37} []{:lock_gulm:xdr_enc_flush+56} []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_gulm:gulm_core_login_reply+164} []{:lock_gulm:core_cb+0} []{:lock_gulm:lg_core_handle_messages+315} []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock_gulm:start_gulm_threads+174} []{:lock_gulm:gulm_mount+616} []{:gfs:gfs_glock_cb+0} []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_mount_lockproto+313} []{do_anonymous_page+1234} []{do_no_page+95} []{do_page_fault+627} []{error_exit+0} []{create_elf_tables+261} []{__alloc_pages+156} []{:gfs:gfs_read_super+1307} []{:gfs:gfs_fs_type+0} []{get_sb_bdev+588} []{:gfs:gfs_fs_type+0} []{do_kern_mount+121} []{do_add_mount+161} []{do_mount+345} []{__get_free_pages+16} []{sys_mount+197} []{system_call+119} Code: 39 d0 75 f8 85 c9 74 10 8b 44 24 14 39 d0 74 08 8b 44 24 14 console shuts up ... NM I Watchdog detected LOCKUP on CPU1, eip ffffffff801a5419, registers: From bernd.schumacher at hp.com Wed Aug 4 06:12:51 2004 From: bernd.schumacher at hp.com (Schumacher, Bernd) Date: Wed, 4 Aug 2004 08:12:51 +0200 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence Message-ID: So, what I have learned from all answers is very bad news for me. It seems, what happened is as expected by most of you. But this means: ----------------------------------------------------------------------- --- One single point of failure in one node can stop the whole gfs. --- ----------------------------------------------------------------------- The single point of failure is: The lancard specified in "nodes.ccs:ip_interfaces" stops working on one node. No matter if this node was master or slave. The whole gfs is stopped: The rest of the cluster seems to need time to form a new cluster. The bad node does not need so much time for switching to arbitrary mode. So the bad node has enough time to fence all other nodes, before it would be fenced by the new master. The bad node lives but it can not form a cluster. GFS is not working. Now all other nodes will reboot. But after reboot they can not join the cluster, because they can not contact the bad node. The lancard is still broken. GFS is not working. Did I miss something? Please tell me that I am wrong! > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Schumacher, Bernd > Sent: Dienstag, 3. August 2004 13:56 > To: linux-cluster at redhat.com > Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence > > > Hi, > I have three nodes oben, mitte and unten. > > Test: > I have disabled eth0 on mitte, so that mitte will be excluded. > > Result: > Oben and unten are trying to fence mitte and build a new > cluster. OK! But mitte tries to fence oben and unten. PROBLEM! > > Why can this happen? Mitte knows that it can not build a > cluster. See Logfile from mitte: "Have 1, need 2" > > Logfile from mitte: > Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug > 3 12:53:17 mitte > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug 3 > 12:53:17 mitte > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing > fence method, manual, on oben. > > cluster.ccs: > cluster { > name = "tom" > lock_gulm { > servers = ["oben", "mitte", "unten"] > } > } > > fence.ccs: > fence_devices { > manual_oben { > agent = "fence_manual" > } > manual_mitte ... > > > nodes.ccs: > nodes { > oben { > ip_interfaces { > eth0 = "192.168.100.241" > } > fence { > manual { > manual_oben { > ipaddr = "192.168.100.241" > } > } > } > } > mitte ... > > regards > Bernd Schumacher > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-> cluster > From teigland at redhat.com Wed Aug 4 06:51:26 2004 From: teigland at redhat.com (David Teigland) Date: Wed, 4 Aug 2004 14:51:26 +0800 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: References: Message-ID: <20040804065126.GB13816@redhat.com> On Wed, Aug 04, 2004 at 08:12:51AM +0200, Schumacher, Bernd wrote: > So, what I have learned from all answers is very bad news for me. It > seems, what happened is as expected by most of you. But this means: > > ----------------------------------------------------------------------- > --- One single point of failure in one node can stop the whole gfs. --- > ----------------------------------------------------------------------- > > The single point of failure is: > The lancard specified in "nodes.ccs:ip_interfaces" stops working on one > node. No matter if this node was master or slave. > > The whole gfs is stopped: > The rest of the cluster seems to need time to form a new cluster. The > bad node does not need so much time for switching to arbitrary mode. So > the bad node has enough time to fence all other nodes, before it would > be fenced by the new master. > > The bad node lives but it can not form a cluster. GFS is not working. > > Now all other nodes will reboot. But after reboot they can not join the > cluster, because they can not contact the bad node. The lancard is still > broken. GFS is not working. > > Did I miss something? > Please tell me that I am wrong! Although it's still in development/testing, what you're looking for is the way cman/fenced works. When there's a network partition, the group with quorum will fence the group without quorum. If neither has quorum then no one will be fenced and neither side can run. Gulm could probably be designed to do fencing differently but I'm not sure how likely that is at this point. -- Dave Teigland From tom at regio.net Wed Aug 4 08:39:57 2004 From: tom at regio.net (tom at regio.net) Date: Wed, 4 Aug 2004 10:39:57 +0200 Subject: [Linux-cluster] errors on inode.c Message-ID: Hi all, im getting errors on inode.c make[4]: Entering directory `/usr/src/linux-2.6.7' CC [M] /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c: In function `inode_init_and_link': /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c:1139: error: structure has no member named `ar_suiddir' make[5]: *** [/tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o] Error 1 make[4]: *** [_module_/tmp/rhgfs/cluster/gfs-kernel/src/gfs] Error 2 anyone have an idea? m.f.G. regio[.NET] GmbH, Support Thomas Marmetschke Bahnhofstrasse 16 36037 Fulda Tel. +49 661 25000-0 Fax. +49 661 25000-49 From jeff at intersystems.com Wed Aug 4 10:37:16 2004 From: jeff at intersystems.com (Jeff) Date: Wed, 4 Aug 2004 06:37:16 -0400 Subject: [Linux-cluster] errors on inode.c In-Reply-To: References: Message-ID: <775913971.20040804063716@intersystems.com> Wednesday, August 4, 2004, 4:39:57 AM, tom at regio.net wrote: > Hi all, > im getting errors on inode.c > make[4]: Entering directory `/usr/src/linux-2.6.7' > CC [M] /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o > /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c: In function > `inode_init_and_link': > /tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.c:1139: error: structure has no > member named `ar_suiddir' > make[5]: *** [/tmp/rhgfs/cluster/gfs-kernel/src/gfs/inode.o] Error 1 > make[4]: *** [_module_/tmp/rhgfs/cluster/gfs-kernel/src/gfs] Error 2 > anyone have an idea? I ran into this when I moved from one of the snapshots to the cvs-latest. Issue "updatedb" and then "locate gfs_ioctl.h". Remove the copies outside of the source tree. The make script looks for header files in various places other than the source tree and if it finds them, it uses them in preference to the source tree. There may be similar problems with header files for cman-kernel and gfs-kernel. Also, the libraries moved between the snapshots and latest so if you did install the snapshot you need to execute: rm -rf /lib/libmagma* /lib/magma /lib/libgulm* rm -rf /lib/libccs* /lib/libdlm* before you build from cvs. From alewis at redhat.com Wed Aug 4 13:54:20 2004 From: alewis at redhat.com (AJ Lewis) Date: Wed, 4 Aug 2004 08:54:20 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: References: Message-ID: <20040804135420.GI25464@null.msp.redhat.com> On Wed, Aug 04, 2004 at 08:12:51AM +0200, Schumacher, Bernd wrote: > So, what I have learned from all answers is very bad news for me. It > seems, what happened is as expected by most of you. But this means: > > ----------------------------------------------------------------------- > --- One single point of failure in one node can stop the whole gfs. --- > ----------------------------------------------------------------------- > > The single point of failure is: > The lancard specified in "nodes.ccs:ip_interfaces" stops working on one > node. No matter if this node was master or slave. > > The whole gfs is stopped: > The rest of the cluster seems to need time to form a new cluster. The > bad node does not need so much time for switching to arbitrary mode. So > the bad node has enough time to fence all other nodes, before it would > be fenced by the new master. > > The bad node lives but it can not form a cluster. GFS is not working. > > Now all other nodes will reboot. But after reboot they can not join the > cluster, because they can not contact the bad node. The lancard is still > broken. GFS is not working. > > Did I miss something? > Please tell me that I am wrong! Well, I guess I'm confused how the node with the bad lan card can contact the fencing device to fence the other nodes. If it can't communicate with the other nodes because it's NIC is down, it can't contact the fencing device over that NIC either, right? Or are you using some alternate transport to contact the fencing device? > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > Schumacher, Bernd > > Sent: Dienstag, 3. August 2004 13:56 > > To: linux-cluster at redhat.com > > Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence > > > > > > Hi, > > I have three nodes oben, mitte and unten. > > > > Test: > > I have disabled eth0 on mitte, so that mitte will be excluded. > > > > Result: > > Oben and unten are trying to fence mitte and build a new > > cluster. OK! But mitte tries to fence oben and unten. PROBLEM! > > > > Why can this happen? Mitte knows that it can not build a > > cluster. See Logfile from mitte: "Have 1, need 2" > > > > Logfile from mitte: > > Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) > > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost > > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug > > 3 12:53:17 mitte > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug 3 > > 12:53:17 mitte > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 > > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing > > fence method, manual, on oben. > > > > cluster.ccs: > > cluster { > > name = "tom" > > lock_gulm { > > servers = ["oben", "mitte", "unten"] > > } > > } > > > > fence.ccs: > > fence_devices { > > manual_oben { > > agent = "fence_manual" > > } > > manual_mitte ... > > > > > > nodes.ccs: > > nodes { > > oben { > > ip_interfaces { > > eth0 = "192.168.100.241" > > } > > fence { > > manual { > > manual_oben { > > ipaddr = "192.168.100.241" > > } > > } > > } > > } > > mitte ... > > > > regards > > Bernd Schumacher > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-> cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- AJ Lewis Voice: 612-638-0500 Red Hat Inc. E-Mail: alewis at redhat.com 720 Washington Ave. SE, Suite 200 Minneapolis, MN 55414 Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C 54A8 578C 8715 Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the many keyservers out there... -----Begin Obligatory Humorous Quote---------------------------------------- "In this time of war against Osama bin Laden and the oppressive Taliban regime, we are thankful that OUR leader isn't the spoiled son of a powerful politician from a wealthy oil family who is supported by religious fundamentalists, operates through clandestine organizations, has no respect for the democratic electoral process, bombs innocents, and uses war to deny people their civil liberties." --The Boondocks -----End Obligatory Humorous Quote------------------------------------------ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From bernd.schumacher at hp.com Wed Aug 4 14:06:32 2004 From: bernd.schumacher at hp.com (Schumacher, Bernd) Date: Wed, 4 Aug 2004 16:06:32 +0200 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence Message-ID: > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of AJ Lewis > Sent: Mittwoch, 4. August 2004 15:54 > To: Discussion of clustering software components including GFS > Subject: Re: [Linux-cluster] GFS 6.0 node without quorum > tries to fence > > > On Wed, Aug 04, 2004 at 08:12:51AM +0200, Schumacher, Bernd wrote: > > So, what I have learned from all answers is very bad news > for me. It > > seems, what happened is as expected by most of you. But this means: > > > > > ---------------------------------------------------------------------- > > - > > --- One single point of failure in one node can stop the > whole gfs. --- > > > -------------------------------------------------------------- > --------- > > > > The single point of failure is: > > The lancard specified in "nodes.ccs:ip_interfaces" stops working on > > one node. No matter if this node was master or slave. > > > > The whole gfs is stopped: > > The rest of the cluster seems to need time to form a new > cluster. The > > bad node does not need so much time for switching to > arbitrary mode. > > So the bad node has enough time to fence all other nodes, before it > > would be fenced by the new master. > > > > The bad node lives but it can not form a cluster. GFS is > not working. > > > > Now all other nodes will reboot. But after reboot they can not join > > the cluster, because they can not contact the bad node. The > lancard is > > still broken. GFS is not working. > > > > Did I miss something? > > Please tell me that I am wrong! > > Well, I guess I'm confused how the node with the bad lan card > can contact the fencing device to fence the other nodes. If > it can't communicate with the other nodes because it's NIC is > down, it can't contact the fencing device over that NIC > either, right? Or are you using some alternate transport to > contact the fencing device? There is a second admin Lan which is used for fencing. Could I probably use this second admin Lan for GFS Heartbeats too. Can I define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would not have a single point of failure anymore. But the documentation seems not to allow this. I will test this tomorrow. > > > > -----Original Message----- > > > From: linux-cluster-bounces at redhat.com > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > > Schumacher, Bernd > > > Sent: Dienstag, 3. August 2004 13:56 > > > To: linux-cluster at redhat.com > > > Subject: [Linux-cluster] GFS 6.0 node without quorum > tries to fence > > > > > > > > > Hi, > > > I have three nodes oben, mitte and unten. > > > > > > Test: > > > I have disabled eth0 on mitte, so that mitte will be excluded. > > > > > > Result: > > > Oben and unten are trying to fence mitte and build a new > > > cluster. OK! But mitte tries to fence oben and unten. PROBLEM! > > > > > > Why can this happen? Mitte knows that it can not build a > > > cluster. See Logfile from mitte: "Have 1, need 2" > > > > > > Logfile from mitte: > > > Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) > > > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost > > > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug > > > 3 12:53:17 mitte > > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug 3 > > > 12:53:17 mitte > > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 > > > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing > > > fence method, manual, on oben. > > > > > > cluster.ccs: > > > cluster { > > > name = "tom" > > > lock_gulm { > > > servers = ["oben", "mitte", "unten"] > > > } > > > } > > > > > > fence.ccs: > > > fence_devices { > > > manual_oben { > > > agent = "fence_manual" > > > } > > > manual_mitte ... > > > > > > > > > nodes.ccs: > > > nodes { > > > oben { > > > ip_interfaces { > > > eth0 = "192.168.100.241" > > > } > > > fence { > > > manual { > > > manual_oben { > > > ipaddr = "192.168.100.241" > > > } > > > } > > > } > > > } > > > mitte ... > > > > > > regards > > > Bernd Schumacher > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > http://www.redhat.com/mailman/listinfo/linux-> cluster > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > > -- > AJ Lewis Voice: 612-638-0500 > Red Hat Inc. E-Mail: alewis at redhat.com > 720 Washington Ave. SE, Suite 200 > Minneapolis, MN 55414 > > Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C > 54A8 578C 8715 Grab the key at: > http://people.redhat.com/alewis/gpg.html or > one of the many > keyservers out there... -----Begin Obligatory Humorous > Quote---------------------------------------- > "In this time of war against Osama bin Laden and the > oppressive Taliban regime, we are thankful that OUR leader > isn't the spoiled son of a powerful politician from a wealthy > oil family who is supported by religious fundamentalists, > operates through clandestine organizations, has no respect > for the democratic electoral process, bombs innocents, and > uses war to deny people their civil liberties." --The > Boondocks -----End Obligatory Humorous > Quote------------------------------------------ > From amanthei at redhat.com Wed Aug 4 14:20:22 2004 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 4 Aug 2004 09:20:22 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: References: Message-ID: <20040804142022.GG26705@redhat.com> On Wed, Aug 04, 2004 at 04:06:32PM +0200, Schumacher, Bernd wrote: > > > The single point of failure is: > > > The lancard specified in "nodes.ccs:ip_interfaces" stops working on > > > one node. No matter if this node was master or slave. > > > > > > The whole gfs is stopped: > > > The rest of the cluster seems to need time to form a new cluster. The > > > bad node does not need so much time for switching to > > > arbitrary mode. So the bad node has enough time to fence all other > > > nodes, before it would be fenced by the new master. > > > > > > The bad node lives but it can not form a cluster. GFS is not working. > > > > > > Now all other nodes will reboot. But after reboot they can not join > > > the cluster, because they can not contact the bad node. The > > > lancard is still broken. GFS is not working. > > > > > > Did I miss something? > > > Please tell me that I am wrong! > > > > Well, I guess I'm confused how the node with the bad lan card > > can contact the fencing device to fence the other nodes. If > > it can't communicate with the other nodes because it's NIC is > > down, it can't contact the fencing device over that NIC > > either, right? Or are you using some alternate transport to > > contact the fencing device? > > There is a second admin Lan which is used for fencing. > > Could I probably use this second admin Lan for GFS Heartbeats too. Can I > define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would > not have a single point of failure anymore. But the documentation seems > not to allow this. > I will test this tomorrow. GULM does not support multiple ethernet devices. In this case, you would want to architect your network so that the fence devices are on the same network as the heartbeats. However, if you did _NOT_ do that, the problem isn't as bad as you make it out to be. You're correct in thinking that there will be a shootout. One of your gulm servers will try to hence the others, and the others will try to fence the one. When the smoke clears, you will at worst be left with a single server. If that remaining server can no longer talk to the other lock_gulmd servers due to a net split, it will continue to sit in the arbitrating state waiting for the other nodes to login. The other nodes however will be able to start a new generation of the cluster when they restart because they will be quorate. If the other quorate part of the netsplit wins the shootout, you only loose the one node. If this is not acceptable, then you really need to rethink why the heartbeats are not going over the same interface as the fencing device. -Adam > > > > -----Original Message----- > > > > From: linux-cluster-bounces at redhat.com > > > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > > > Schumacher, Bernd > > > > Sent: Dienstag, 3. August 2004 13:56 > > > > To: linux-cluster at redhat.com > > > > Subject: [Linux-cluster] GFS 6.0 node without quorum > > tries to fence > > > > > > > > > > > > Hi, > > > > I have three nodes oben, mitte and unten. > > > > > > > > Test: > > > > I have disabled eth0 on mitte, so that mitte will be excluded. > > > > > > > > Result: > > > > Oben and unten are trying to fence mitte and build a new > > > > cluster. OK! But mitte tries to fence oben and unten. PROBLEM! > > > > > > > > Why can this happen? Mitte knows that it can not build a > > > > cluster. See Logfile from mitte: "Have 1, need 2" > > > > > > > > Logfile from mitte: > > > > Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben) > > > > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost > > > > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug > > > > 3 12:53:17 mitte > > > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug 3 > > > > 12:53:17 mitte > > > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 > > > > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing > > > > fence method, manual, on oben. > > > > > > > > cluster.ccs: > > > > cluster { > > > > name = "tom" > > > > lock_gulm { > > > > servers = ["oben", "mitte", "unten"] > > > > } > > > > } > > > > > > > > fence.ccs: > > > > fence_devices { > > > > manual_oben { > > > > agent = "fence_manual" > > > > } > > > > manual_mitte ... > > > > > > > > > > > > nodes.ccs: > > > > nodes { > > > > oben { > > > > ip_interfaces { > > > > eth0 = "192.168.100.241" > > > > } > > > > fence { > > > > manual { > > > > manual_oben { > > > > ipaddr = "192.168.100.241" > > > > } > > > > } > > > > } > > > > } > > > > mitte ... > > > > > > > > regards > > > > Bernd Schumacher > > > > > > > > -- > > > > Linux-cluster mailing list > > > > Linux-cluster at redhat.com > > > > http://www.redhat.com/mailman/listinfo/linux-> cluster > > > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > http://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > AJ Lewis Voice: 612-638-0500 > > Red Hat Inc. E-Mail: alewis at redhat.com > > 720 Washington Ave. SE, Suite 200 > > Minneapolis, MN 55414 > > > > Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C > > 54A8 578C 8715 Grab the key at: > > http://people.redhat.com/alewis/gpg.html or > one of the many > > keyservers out there... -----Begin Obligatory Humorous > > Quote---------------------------------------- > > "In this time of war against Osama bin Laden and the > > oppressive Taliban regime, we are thankful that OUR leader > > isn't the spoiled son of a powerful politician from a wealthy > > oil family who is supported by religious fundamentalists, > > operates through clandestine organizations, has no respect > > for the democratic electoral process, bombs innocents, and > > uses war to deny people their civil liberties." --The > > Boondocks -----End Obligatory Humorous > > Quote------------------------------------------ > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From mtilstra at redhat.com Wed Aug 4 15:33:38 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Wed, 4 Aug 2004 10:33:38 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091581920.8356.257.camel@angmar> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> Message-ID: <20040804153338.GA10091@redhat.com> On Tue, Aug 03, 2004 at 06:12:01PM -0700, micah nerren wrote: > Hi, > > On Tue, 2004-08-03 at 07:40, Michael Conrad Tadpol Tilstra wrote: > > On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote: > > [snip] > > > I hope this helps!! > > [snip] > > > > yeah, looks like a stack overflow. > > here's a patch that I put in for 6.0. (patch works on 6.0.0-7) > > > > I applied the patch to 6.0.0-7, rebuild the entire package, and I still > get the crash when I mount. Below is the text of the crash. > > Any ideas? I double and triple checked that the patch was indeed applied > to the code I was building and it was. well, it could still be a stack overflow, just some other function pushing it over the edge. I'll look over things later. Mostly just looking for things in the stack space of the functions listed in the backtrace for things that can take out of the stack and put onto the heap. (run sentence run!) -- Michael Conrad Tadpol Tilstra Today, I am the bug. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mtilstra at redhat.com Wed Aug 4 15:40:36 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Wed, 4 Aug 2004 10:40:36 -0500 Subject: [Linux-cluster] GFS 6.0 node without quorum tries to fence In-Reply-To: <20040804142022.GG26705@redhat.com> References: <20040804142022.GG26705@redhat.com> Message-ID: <20040804154036.GB10091@redhat.com> On Wed, Aug 04, 2004 at 09:20:22AM -0500, adam manthei wrote: > On Wed, Aug 04, 2004 at 04:06:32PM +0200, Schumacher, Bernd wrote: > > > > The single point of failure is: > > > > The lancard specified in "nodes.ccs:ip_interfaces" stops working on > > > > one node. No matter if this node was master or slave. > > > > > > > > The whole gfs is stopped: > > > > The rest of the cluster seems to need time to form a new cluster. The > > > > bad node does not need so much time for switching to > > > > arbitrary mode. So the bad node has enough time to fence all other > > > > nodes, before it would be fenced by the new master. > > > > > > > > The bad node lives but it can not form a cluster. GFS is not working. > > > > > > > > Now all other nodes will reboot. But after reboot they can not join > > > > the cluster, because they can not contact the bad node. The > > > > lancard is still broken. GFS is not working. > > > > > > > > Did I miss something? > > > > Please tell me that I am wrong! > > > > > > Well, I guess I'm confused how the node with the bad lan card > > > can contact the fencing device to fence the other nodes. If > > > it can't communicate with the other nodes because it's NIC is > > > down, it can't contact the fencing device over that NIC > > > either, right? Or are you using some alternate transport to > > > contact the fencing device? > > > > There is a second admin Lan which is used for fencing. > > > > Could I probably use this second admin Lan for GFS Heartbeats too. Can I > > define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would > > not have a single point of failure anymore. But the documentation seems > > not to allow this. > > I will test this tomorrow. > > GULM does not support multiple ethernet devices. In this case, you would > want to architect your network so that the fence devices are on the same > network as the heartbeats. > > However, if you did _NOT_ do that, the problem isn't as bad as you make it out > to be. You're correct in thinking that there will be a shootout. One of > your gulm servers will try to hence the others, and the others will try to > fence the one. When the smoke clears, you will at worst be left with a > single server. If that remaining server can no longer talk to the other > lock_gulmd servers due to a net split, it will continue to sit in the > arbitrating state waiting for the other nodes to login. The other nodes > however will be able to start a new generation of the cluster when they > restart because they will be quorate. If the other quorate part of the > netsplit wins the shootout, you only loose the one node. > > If this is not acceptable, then you really need to rethink why the > heartbeats are not going over the same interface as the fencing device. Unfortunitly gulm has not yet had mutliple network device support added. We've always ment to, but lacked the time and resources to do it. You really *must* put heartbeats/locktraffic/fencing/etc on the same network device. Things won't work the way they should otherwise. -- Michael Conrad Tadpol Tilstra I used to be indecisive, but now I'm not sure. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From danderso at redhat.com Wed Aug 4 18:07:42 2004 From: danderso at redhat.com (Derek Anderson) Date: Wed, 4 Aug 2004 13:07:42 -0500 Subject: [Linux-cluster] lock_dlm: init_fence error -1 In-Reply-To: <003701c47983$b59787e0$0100a8c0@amdk6> References: <003701c47983$b59787e0$0100a8c0@amdk6> Message-ID: <200408041307.42543.danderso@redhat.com> Patrick, Please attach 'cat /proc/cluster/nodes' and 'cat /proc/cluster/services' from each of the nodes prior to the mount attempt. Also, any messages produced in /var/log/messages from the mount. On Tuesday 03 August 2004 13:00, Seinguerlet Patrick wrote: > When I would like to mount the GFS file system, this messages appear. > What can I do? > > mount -t gfs /dev/test_gfs/lv_test /mnt > lock_dlm: init_fence error -1 > GFS: can't mount proto = lock_dlm, table = test:partage1, hostdata = > mount: permission denied > > I use a debian and I use the documentation file for install. > > Patrick > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From hanafim at asc.hpc.mil Wed Aug 4 18:37:42 2004 From: hanafim at asc.hpc.mil (MAHMOUD HANAFI) Date: Wed, 04 Aug 2004 14:37:42 -0400 Subject: [Linux-cluster] GFS 5.2.1-28.3.0.11 file system courrption Message-ID: <41112CF6.1070609@asc.hpc.mil> we are currently running GFS with 8 IO nodes attached to a DDN s2a. We recently had GFS crashing any time larger number of file were being accessed. It turned out that One of the file system was corrupted. We discovered the issue only by chance when I ran fsck.gfs on the file system. It ran for 20+ and correct many corruptions. My question is how robust is gfs. How can one test file system for corruption without running fsck. thanks From phillips at redhat.com Wed Aug 4 15:31:47 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 4 Aug 2004 11:31:47 -0400 Subject: [Linux-cluster] Nag for summit presentation materials Message-ID: <200408041131.47099.phillips@redhat.com> Hi all, It looks like this now: http://sources.redhat.com/cluster/events/summit2004/presentations.html * Patrick, thanks for the slides, but could you please suggest how to distribute them across your three presentations? * Lon, I don't have anything on "Cluster Resource Management", do I? * Mike... mike... earth to mike... :-) * Alan and Bruce, you've got lots of great stuff, can I please have some? * Alasdair, perhaps you didn't see the first two emails? It would be very nice to complete this process today. Regards, Daniel From lhh at redhat.com Wed Aug 4 16:20:41 2004 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 04 Aug 2004 12:20:41 -0400 Subject: [Linux-cluster] Re: Nag for summit presentation materials In-Reply-To: <200408041131.47099.phillips@redhat.com> References: <200408041131.47099.phillips@redhat.com> Message-ID: <1091636441.13608.206.camel@atlantis.boston.redhat.com> On Wed, 2004-08-04 at 11:31 -0400, Daniel Phillips wrote: > * Lon, I don't have anything on "Cluster Resource Management", do I? I sent it to you a few days ago; you noted that you would rename it to "lon.resources.sxi". In any case: http://metamorphism.com/~lon/resources.sxi From phillips at redhat.com Wed Aug 4 16:28:41 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 4 Aug 2004 12:28:41 -0400 Subject: [Linux-cluster] Re: Nag for summit presentation materials In-Reply-To: <1091636441.13608.206.camel@atlantis.boston.redhat.com> References: <200408041131.47099.phillips@redhat.com> <1091636441.13608.206.camel@atlantis.boston.redhat.com> Message-ID: <200408041228.41434.phillips@redhat.com> On Wednesday 04 August 2004 12:20, Lon Hohberger wrote: > On Wed, 2004-08-04 at 11:31 -0400, Daniel Phillips wrote: > > * Lon, I don't have anything on "Cluster Resource Management", do > > I? > > I sent it to you a few days ago; you noted that you would rename it > to "lon.resources.sxi". Oh, then I put it under the wrong talk, I'll move it. But then, what do I put under "Magma - User level Cluster and Lock manager transparent library interface" ? Regards, Daniel From lhh at redhat.com Wed Aug 4 16:44:45 2004 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 04 Aug 2004 12:44:45 -0400 Subject: [Linux-cluster] Re: Nag for summit presentation materials In-Reply-To: <200408041228.41434.phillips@redhat.com> References: <200408041131.47099.phillips@redhat.com> <1091636441.13608.206.camel@atlantis.boston.redhat.com> <200408041228.41434.phillips@redhat.com> Message-ID: <1091637885.13608.209.camel@atlantis.boston.redhat.com> On Wed, 2004-08-04 at 12:28 -0400, Daniel Phillips wrote: > > Oh, then I put it under the wrong talk, I'll move it. > > But then, what do I put under "Magma - User level Cluster and Lock > manager transparent library interface" ? Oh, right. I'll bring that in tomorrow. It's on by defunct notebook. -- Lon From john.l.villalovos at intel.com Wed Aug 4 19:58:18 2004 From: john.l.villalovos at intel.com (Villalovos, John L) Date: Wed, 4 Aug 2004 12:58:18 -0700 Subject: [Linux-cluster] Re: Nag for summit presentation materials Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B018A9299@orsmsx410> linux-cluster-bounces at redhat.com wrote: > On Wednesday 04 August 2004 12:20, Lon Hohberger wrote: >> On Wed, 2004-08-04 at 11:31 -0400, Daniel Phillips wrote: >>> * Lon, I don't have anything on "Cluster Resource Management", do >>> I? >> >> I sent it to you a few days ago; you noted that you would rename it >> to "lon.resources.sxi". The web page: http://sources.redhat.com/cluster/events/summit2004/presentations.html Cluster resource management Presented by: Lon Hohberger, Red Hat Slides - Cluster Resources Has a link of: file:///src/sources.redhat.cvs/htdocs/events/summit2004/lon.magma.resour ces.sxi Which doesn't work :( John From jeff at intersystems.com Wed Aug 4 22:30:28 2004 From: jeff at intersystems.com (Jeff) Date: Wed, 4 Aug 2004 18:30:28 -0400 Subject: [Linux-cluster] cman doesn't load building out of cvs outside of the kernel Message-ID: <1502355489.20040804183028@intersystems.com> Following the current doc/usage.txt instructions for building outside of the kernel from cvs/latest (as of this afternoon) I get the following error trying to load the cman module. [root at lx3 cman-kernel]# modprobe cman FATAL: Error inserting cman (/lib/modules/2.6.7-smp/kernel/cluster/cman.ko): Operation not permitted [root at lx3 cman-kernel]# dmesg CMAN (built Aug 4 2004 12:34:28) installed NET: Registered protocol family 31 Unable to register cluster socket type Any suggestions on what I need to do to resolve this? TIA From buytenh at wantstofly.org Wed Aug 4 23:02:36 2004 From: buytenh at wantstofly.org (Lennert Buytenhek) Date: Thu, 5 Aug 2004 01:02:36 +0200 Subject: [Linux-cluster] cman doesn't load building out of cvs outside of the kernel In-Reply-To: <1502355489.20040804183028@intersystems.com> References: <1502355489.20040804183028@intersystems.com> Message-ID: <20040804230236.GB10696@xi.wantstofly.org> On Wed, Aug 04, 2004 at 06:30:28PM -0400, Jeff wrote: > [root at lx3 cman-kernel]# modprobe cman > FATAL: Error inserting cman (/lib/modules/2.6.7-smp/kernel/cluster/cman.ko): > Operation not permitted > [root at lx3 cman-kernel]# dmesg > > CMAN (built Aug 4 2004 12:34:28) installed > NET: Registered protocol family 31 > Unable to register cluster socket type > > Any suggestions on what I need to do to resolve this? Remove the bluetooth modules that you already have loaded (there is an AF_* identifier conflict still), then manually load cman, dlm, and such. --L From jeff at intersystems.com Thu Aug 5 03:41:45 2004 From: jeff at intersystems.com (Jeff) Date: Wed, 4 Aug 2004 23:41:45 -0400 Subject: [Linux-cluster] Strange behavior(s) of DLM Message-ID: <1909350721.20040804234145@intersystems.com> The attached routine demonstrates some strange behavior in the DLM and it was responsible for the dmesg text at the end of this note. This is on a FC2, SMP box running cvs/latest version of cman and the dlm. Its a 2 CPU box configured with 4 logical CPUs. I have a two node cluster and the two machines are identical as far as I can tell with the exception of which order they are listed in the cluster config file. On node #1 (in the config file) when I run the attached test from two terminals the output looks reasonable. The same as it does if I run it on Tru64 or VMS (more or less). 8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0 18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0 28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0 If you shut this down and start it up on node #2 (lx4) you start to get messages that look like: 91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0 125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ 125138: NL Blocking Notification on lockid 0x00010312 (mode 0) 125138: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ 141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ 141371: NL Blocking Notification on lockid 0x00010312 (mode 0) 141371: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ 141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ There are two strange things about this: 1) why does node #2 behave differently than node #1. I get the same results if I reboot both nodes and only node #2 joins the cluster. This seems to imply that the nodes aren't as identical as I think they are but... They are running the same kernel build and the same source from cvs (moved over as a tar file from one to another). 2) Why is a blocking ast routine associated with a NL lock being triggered. The test code may be a bit hard to follow but you can look at where this message comes from (nlblkrtn) and where nlblkrtn is used (DLM_CVT requests to convert to NULL). This looks like a race condition between queuing a new conversion request and delivering a blocking AST on the existing lock. I'm guessing that the conversion to NL is updating the AST pointers at a time when the blocking AST can still be delivered for the existing lock. I tripped over this because do_dlm_dispatch() ends in /* Call AST */ result.astaddr(result.astparam); return 0; and it doesn't check whether result.astaddr() is null or not. Its not valid to have a NULL completion AST routine but it is valid to have a NULL blocking AST routine. To go a bit further, its pretty common to have a null blocking AST routine on a conversion to NULL because the NULL lock can't block any other locks. dmesg output: ------------------------------------------------------------ CMAN: quorum regained, resuming activity dlm: default: recover event 1 (first) dlm: default: add nodes dlm: got connection from 1 dlm: default: total nodes 2 dlm: default: rebuild resource directory dlm: default: rebuilt 0 resources dlm: default: recover event 1 done dlm: default: recover event 1 finished dlm: default: release lkb with status 3 dlm: lkb id 102c9 remid 0 flags 4000 status 3 rqmode 5 grmode 3 nodeid 0 lqstate 0 lqflags 44 name "Test Lock" flags 4 nodeid 4294967295 ref 0 grant queue 000102c9 gr 5 rq -1 flg 24000 sts 2 node 0 remid 0 lq 0,44 est Lock" default cv 5 102c9 "Test Lock" default cv 3 1018a "Test Lock" default cv 0 102c9 "Test Lock" default cv 3 102c9 "Test Lock" default cv 5 1018a "Test Lock" default cv 0 102c9 "Test Lock" default cv 0 1018a "Test Lock" default cv 3 102c9 "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 102c9 "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 102c9 "Test Lock" default cv 3 1018a "Test Lock" default cv 5 102c9 "Test Lock" default cv 5 1018a "Test Lock" default un 1018a ref 1 flg 4 nodeid 0/-1 "Test Lock" DLM: Assertion failed on line 64 of file /usr/src/cvs/cluster_orig/dlm-kernel/src/rsb.c DLM: assertion: "list_empty(&r->res_grantqueue)" DLM: time = 948604 name "Test Lock" flags 4 nodeid 4294967295 ref 0 convert queue 000102c9 gr 5 rq 0 flg 4000 sts 3 node 0 remid 0 lq 2,44 est Lock" default cv 3 1018a "Test Lock" default cv 0 102c9 "Test Lock" default cv 3 102c9 "Test Lock" default cv 5 1018a "Test Lock" default cv 0 102c9 "Test Lock" default cv 0 1018a "Test Lock" default cv 3 102c9 "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 102c9 "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018adlm: rsb name "Test Lock" nodeid -1 flags 4 ref 0 "Test Lock" default cv 0 1018a "Test Lock" default cv 3 1018a "Test Lock" default cv 5 1018a "Test Lock" default cv 0 1018a "Test Lock" default cv 3 102c9 "Test Lock" default cv 3 1018a "Test Lock" default cv 5 102c9 "Test Lock" default cv 5 1018a "Test Lock" default un 1018a ref 1 flg 4 nodeid 0/-1 "Test Lock" default cv 0 102c9 "Test Lock" DLM: Assertion failed on line 661 of file /usr/src/cvs/cluster_orig/dlm-kernel/src/lockqueue.c DLM: assertion: "target_nodeid && target_nodeid != -1" DLM: time = 948606 dlm: lkb id 102c9 remid 0 flags 4000 status 3 rqmode 0 grmode 5 nodeid 0 lqstate 2 lqflags 44 dlm: rsb name "Test Lock" nodeid -1 flags 4 ref 0 target_nodeid 0 ------------[ cut here ]------------ kernel BUG at /usr/src/cvs/cluster_orig/dlm-kernel/src/lockqueue.c:661! invalid operand: 0000 [#1] SMP Modules linked in: dlm cman parport_pc lp parport autofs4 nfs lockd sunrpc e1000 3c59x floppy sg microcode dm_mod uhci_hcd button battery asus_acpi ac ipv6 ext3 jbd aic7xxx sd_mod scsi_mod CPU: 2 EIP: 0060:[] Not tainted EFLAGS: 00010246 (2.6.7-smp) EIP is at send_cluster_request+0x577/0x590 [dlm] eax: 00000001 ebx: f4442810 ecx: f40c5dec edx: 000059ab esi: 00000000 edi: 00000295 ebp: f8ad6b48 esp: f40c5de8 ds: 007b es: 007b ss: 0068 Process cp (pid: 3565, threadinfo=f40c4000 task=f6cca1b0) Stack: f8ad5994 00000000 f8ad6b48 f8ad6e54 000e797e f4442810 f7fcca00 f438f934 f4442810 f4442810 00000002 f438f934 f7fcca00 f8ac6510 f4442810 f7fcca80 f4442810 f7fcca80 f8ac58e4 f7fcca00 f8ad5871 00000000 000102c9 f438f9ad Call Trace: [] remote_stage+0x20/0x50 [dlm] [] convert_lock+0x1a4/0x1d0 [dlm] [] dlm_lock+0x347/0x350 [dlm] [] ast_routine+0x0/0x150 [dlm] [] bast_routine+0x0/0x20 [dlm] [] do_user_lock+0x123/0x220 [dlm] [] ast_routine+0x0/0x150 [dlm] [] bast_routine+0x0/0x20 [dlm] [] sigprocmask+0x59/0xe0 [] dlm_write+0xbb/0xe0 [dlm] [] vfs_write+0xd1/0x120 [] sys_write+0x38/0x60 [] sysenter_past_esp+0x52/0x71 Code: 0f 0b 95 02 48 6b ad f8 e9 09 fc ff ff e8 37 bd ff ff 89 c6 ------------[ cut here ]------------ kernel BUG at /usr/src/cvs/cluster_orig/dlm-kernel/src/rsb.c:64! invalid operand: 0000 [#2] SMP Modules linked in: dlm cman parport_pc lp parport autofs4 nfs lockd sunrpc e1000 3c59x floppy sg microcode dm_mod uhci_hcd button battery asus_acpi ac ipv6 ext3 jbd aic7xxx sd_mod scsi_mod CPU: 3 EIP: 0060:[] Not tainted EFLAGS: 00010246 (2.6.7-smp) EIP is at _release_rsb+0x29d/0x2b0 [dlm] eax: 00000001 ebx: f7fcca80 ecx: f6a93f40 edx: 000059af esi: f438f934 edi: f7fcca00 ebp: 00000000 esp: f6a93f3c ds: 007b es: 007b ss: 0068 Process dlm_astd (pid: 3404, threadinfo=f6a92000 task=f6af48c0) Stack: f8ad640e 00000040 f8ad8370 f8ad83ec 000e797c f7fcca80 f4442ec4 f7fcca00 00000005 f8ac1465 f8adfa80 dd2514c0 000f431c f6d1e640 c2031ce0 f438f934 f7b83f58 f8ac2590 f8ac25b0 f8adfa68 f6a92000 f6a93fb4 f6a93fc0 f8ac1f5a Call Trace: [] process_asts+0xe5/0x1b0 [dlm] [] bast_routine+0x0/0x20 [dlm] [] ast_routine+0x0/0x150 [dlm] [] dlm_astd+0x29a/0x2b0 [dlm] [] default_wake_function+0x0/0x10 [] ret_from_fork+0x6/0x14 [] default_wake_function+0x0/0x10 [] dlm_astd+0x0/0x2b0 [dlm] [] kernel_thread_helper+0x5/0x10 Code: 0f 0b 40 00 70 83 ad f8 e9 43 ff ff ff 8d b6 00 00 00 00 31 [root at lx4 -------------- next part -------------- A non-text attachment was scrubbed... Name: conv_play.c Type: application/octet-stream Size: 20488 bytes Desc: not available URL: From patrick.seinguerlet at e-asc.com Thu Aug 5 09:39:20 2004 From: patrick.seinguerlet at e-asc.com (SEINGUERLET Patrick) Date: Thu, 5 Aug 2004 11:39:20 +0200 Subject: [Linux-cluster] lock_dlm: init_fence error -1 In-Reply-To: <200408041307.42543.danderso@redhat.com> Message-ID: <001201c47ad0$1800d330$1d0224d5@porpat> Thanks for all, but I do a error when I configure my cluster.xml file. -----Message d'origine----- De : linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Derek Anderson Envoy? : mercredi 4 ao?t 2004 20:08 ? : Discussion of clustering software components including GFS; Seinguerlet Patrick Objet : Re: [Linux-cluster] lock_dlm: init_fence error -1 Patrick, Please attach 'cat /proc/cluster/nodes' and 'cat /proc/cluster/services' from each of the nodes prior to the mount attempt. Also, any messages produced in /var/log/messages from the mount. On Tuesday 03 August 2004 13:00, Seinguerlet Patrick wrote: > When I would like to mount the GFS file system, this messages appear. > What can I do? > > mount -t gfs /dev/test_gfs/lv_test /mnt > lock_dlm: init_fence error -1 > GFS: can't mount proto = lock_dlm, table = test:partage1, hostdata = > mount: permission denied > > I use a debian and I use the documentation file for install. > > Patrick > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From danderso at redhat.com Thu Aug 5 13:41:41 2004 From: danderso at redhat.com (Derek Anderson) Date: Thu, 5 Aug 2004 08:41:41 -0500 Subject: [Linux-cluster] cman doesn't load building out of cvs outside of the kernel In-Reply-To: <20040804230236.GB10696@xi.wantstofly.org> References: <1502355489.20040804183028@intersystems.com> <20040804230236.GB10696@xi.wantstofly.org> Message-ID: <200408050841.41294.danderso@redhat.com> On Wednesday 04 August 2004 18:02, Lennert Buytenhek wrote: > On Wed, Aug 04, 2004 at 06:30:28PM -0400, Jeff wrote: > > [root at lx3 cman-kernel]# modprobe cman > > FATAL: Error inserting cman > > (/lib/modules/2.6.7-smp/kernel/cluster/cman.ko): Operation not permitted > > [root at lx3 cman-kernel]# dmesg > > > > CMAN (built Aug 4 2004 12:34:28) installed > > NET: Registered protocol family 31 > > Unable to register cluster socket type > > > > Any suggestions on what I need to do to resolve this? > > Remove the bluetooth modules that you already have loaded (there is > an AF_* identifier conflict still), then manually load cman, dlm, and > such. Yep. Or make sure that the cman module is loaded before ccsd is started. http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127019 > > > --L > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From phillips at redhat.com Thu Aug 5 14:34:44 2004 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 5 Aug 2004 10:34:44 -0400 Subject: [Linux-cluster] Another nag for summit presentation materials Message-ID: <200408051034.44608.phillips@redhat.com> If you are cc'd on this mail then there is still a cluster summit presentation that you made, for which I haven't received any presentation materials. Patrick and Lon have both sent me stuff (thanks) but not for every presentation. Alasdair, I haven't seen anything from you. Surely you must have something, somewhere, that you can send. Presentations still lacking slides or other supporting material: Patrick: * CMAN - Kernel cluster membership * DLM - Kernel distributed lock manager Lon: * Magma - User level Cluster and Lock manager transparent library interface Alasdair: * CLVM - Architecture and extensions of LVM2 Regards, Daniel From lhh at redhat.com Thu Aug 5 15:40:42 2004 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 05 Aug 2004 11:40:42 -0400 Subject: [Linux-cluster] Cluster Summit Pictures Message-ID: <1091720442.25665.2.camel@atlantis.boston.redhat.com> http://people.redhat.com/lhh/cs-pics/ I didn't include the 2048x1536 originals to save space and bandwidth. These will probably be migrated to sources.redhat.com, but for now, this should work. -- Lon From laza at yu.net Thu Aug 5 18:28:21 2004 From: laza at yu.net (Lazar Obradovic) Date: Thu, 05 Aug 2004 20:28:21 +0200 Subject: [Linux-cluster] bug in cman-kernel / membership.c Message-ID: <1091730500.15503.302.camel@laza.eunet.yu> Just got this when joining one node. CMAN: Waiting to join or form a Linux-cluster CMAN: sending membership request CMAN: got node new-noc Got ENDTRANS from a node not the master: master: 6, sender: 1 CMAN: node new-noc is not responding - removing from the cluster ------------[ cut here ]------------ kernel BUG at /usr/src/cvs/cluster/cman-kernel/src/membership.c:2892! invalid operand: 0000 [#1] PREEMPT SMP Modules linked in: ipv6 qla2300 qla2xxx ohci_hcd gfs lock_dlm lock_harness dlm cman CPU: 2 EIP: 0060:[] Tainted: GF EFLAGS: 00010246 (2.6.7-gentoo-r11) EIP is at elect_master+0x2a/0x41 [cman] eax: 00000080 ebx: 00000080 ecx: f88a4000 edx: 00000000 esi: f8870c08 edi: f8870c00 ebp: f7139fc0 esp: f7139f90 ds: 007b es: 007b ss: 0068 Process cman_memb (pid: 7327, threadinfo=f7138000 task=c22ed1e0) Stack: f7afdb28 f8859d34 f7139fa4 00000001 f8858883 f7bf7494 fffffffb 00000000 f7138000 0000001f 00000000 c0103fb6 00000000 c22ed1e0 c01176e2 00100100 00200200 00000000 00000000 00000000 f88584d8 00000000 00000000 00000000 Call Trace: [] a_node_just_died+0x130/0x181 [cman] [] membership_kthread+0x3ab/0x3e4 [cman] [] ret_from_fork+0x6/0x14 [] default_wake_function+0x0/0x12 [] membership_kthread+0x0/0x3e4 [cman] [] kernel_thread_helper+0x5/0xb Code: 0f 0b 4c 0b a0 51 86 f8 31 c0 5b c3 8b 44 24 08 89 10 8b 42 -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From agk at redhat.com Thu Aug 5 19:13:37 2004 From: agk at redhat.com (Alasdair G Kergon) Date: Thu, 5 Aug 2004 20:13:37 +0100 Subject: [Linux-cluster] Cluster Summit Pictures In-Reply-To: <1091720442.25665.2.camel@atlantis.boston.redhat.com> References: <1091720442.25665.2.camel@atlantis.boston.redhat.com> Message-ID: <20040805191337.GF18235@agk.surrey.redhat.com> On Thu, Aug 05, 2004 at 11:40:42AM -0400, Lon Hohberger wrote: > http://people.redhat.com/lhh/cs-pics/ > These will probably be migrated to sources.redhat.com, but for now, this > should work. We can *link* to them from sources.redhat.com, but photos are best stored elsewhere e.g. in one of the many professional photo gallery websites such as fotango.com, fotopic.net to mention a couple of UK-based ones. Alasdair -- agk at redhat.com From phillips at redhat.com Thu Aug 5 19:37:26 2004 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 5 Aug 2004 15:37:26 -0400 Subject: [Linux-cluster] Cluster Summit Pictures In-Reply-To: <20040805191337.GF18235@agk.surrey.redhat.com> References: <1091720442.25665.2.camel@atlantis.boston.redhat.com> <20040805191337.GF18235@agk.surrey.redhat.com> Message-ID: <200408051537.26614.phillips@redhat.com> On Thursday 05 August 2004 15:13, Alasdair G Kergon wrote: > On Thu, Aug 05, 2004 at 11:40:42AM -0400, Lon Hohberger wrote: > > http://people.redhat.com/lhh/cs-pics/ > > > > These will probably be migrated to sources.redhat.com, but for now, > > this should work. > > We can *link* to them from sources.redhat.com, but photos are best > stored elsewhere e.g. in one of the many professional photo gallery > websites such as fotango.com, fotopic.net to mention a couple of > UK-based ones. Where they will summarily disappear in the shifting sands of the internet. Regards, Daniel From laza at yu.net Thu Aug 5 20:02:52 2004 From: laza at yu.net (Lazar Obradovic) Date: Thu, 05 Aug 2004 22:02:52 +0200 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1091556279.30938.179.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> Message-ID: <1091736172.19762.336.camel@laza.eunet.yu> It took some time... Attached is the patch for correcting cman_tool join into ipv4 mcast group. Note to ipv6 developers / users: you may also need to set mcast ttl to higher value via setsockopt() if you want to have cluster with network traversing L3 devices, since linux default ttl is set for local scope (ttl = 1), which would make first router drop the packet. -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- -------------- next part -------------- A non-text attachment was scrubbed... Name: cman-mcast.diff Type: text/x-patch Size: 1150 bytes Desc: not available URL: From phillips at redhat.com Thu Aug 5 20:23:27 2004 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 5 Aug 2004 16:23:27 -0400 Subject: [Linux-cluster] Final nag for summit presentation materials Message-ID: <200408051623.27287.phillips@redhat.com> Hi all, This is the final nag for presentation materials. Patrick, I didn't receive anything for the CMAN or DLM presentations so I extracted portions of your overview and placed them on those talks. Therefore, some material is duplicated. Please send me more stuff if you have issues with this. That leaves only Alasdair with nothing on the site. Alasdair?? Otherwise, it's starting to look decent, though it does tend to send the message "we can't be bothered to post more detailed information". It also sends the message "at least there's something, we're trying". Speakers, please look over your own presentations and at least try all the links. I you think your presentation looks threadbare, just send more material. http://sources.redhat.com/cluster/events/summit2004/presentations.html Thanks to everybody who sent stuff. In general it is first-rate, even if terse. Regards, Daniel From mnerren at paracel.com Thu Aug 5 23:52:53 2004 From: mnerren at paracel.com (micah nerren) Date: Thu, 05 Aug 2004 16:52:53 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040804153338.GA10091@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> Message-ID: <1091749973.18842.70.camel@angmar> On Wed, 2004-08-04 at 08:33, Michael Conrad Tadpol Tilstra wrote: > On Tue, Aug 03, 2004 at 06:12:01PM -0700, micah nerren wrote: > > Hi, > > > > On Tue, 2004-08-03 at 07:40, Michael Conrad Tadpol Tilstra wrote: > > > On Mon, Aug 02, 2004 at 01:59:55PM -0700, micah nerren wrote: > > > [snip] > > > > I hope this helps!! > > > [snip] > > > > > > yeah, looks like a stack overflow. > > > here's a patch that I put in for 6.0. (patch works on 6.0.0-7) > > > > > > > I applied the patch to 6.0.0-7, rebuild the entire package, and I still > > get the crash when I mount. Below is the text of the crash. > > > > Any ideas? I double and triple checked that the patch was indeed applied > > to the code I was building and it was. > > well, it could still be a stack overflow, just some other function > pushing it over the edge. I'll look over things later. Mostly just > looking for things in the stack space of the functions listed in the > backtrace for things that can take out of the stack and put onto the > heap. (run sentence run!) FYI, I tried this with a few different HBA's, that didn't work. I thought perhaps it could be some funny interaction with the driver but that doesn't seem to be the case. If there is anything I can do to help, please let me know! Up to and including allowing access to the machines running the software if that will help you debug it. Thanks, Micah From fuscof at cli.di.unipi.it Fri Aug 6 08:45:27 2004 From: fuscof at cli.di.unipi.it (Francesco Fusco) Date: Fri, 6 Aug 2004 10:45:27 +0200 (CEST) Subject: [Linux-cluster] CLVM and redundant RAID levels Message-ID: Hi! I want to use some ide servers to have an high available cluster filesystem. I don't have Fibre Channel/Scsi disk array, but only inexpensive ide disks. Can GFS be a good choice? Does it support redoundant raid levels between GNBD servers? Thanks -- Fusco Francesco From anton at hq.310.ru Fri Aug 6 08:52:50 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Fri, 6 Aug 2004 12:52:50 +0400 Subject: [Linux-cluster] announce services through DLM Message-ID: <828806003.20040806125250@hq.310.ru> Hi all! Whether I can program announce services which are used on nodes through DLM? It is necessary for definition of working services on nodes only there to send requests. -- e-mail: anton at hq.310.ru http://www.310.ru From teigland at redhat.com Fri Aug 6 12:54:29 2004 From: teigland at redhat.com (David Teigland) Date: Fri, 6 Aug 2004 20:54:29 +0800 Subject: [Linux-cluster] Strange behavior(s) of DLM In-Reply-To: <1909350721.20040804234145@intersystems.com> References: <1909350721.20040804234145@intersystems.com> Message-ID: <20040806125429.GG16109@redhat.com> On Wed, Aug 04, 2004 at 11:41:45PM -0400, Jeff wrote: > The attached routine demonstrates some strange > behavior in the DLM and it was responsible for the > dmesg text at the end of this note. > > This is on a FC2, SMP box running cvs/latest version of > cman and the dlm. Its a 2 CPU box configured with 4 logical > CPUs. > > I have a two node cluster and the two machines are identical > as far as I can tell with the exception of which order they are > listed in the cluster config file. > > On node #1 (in the config file) when I run the attached test from > two terminals the output looks reasonable. The same as it does if > I run it on Tru64 or VMS (more or less). > > 8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0 > 18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0 > 28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0 > > If you shut this down and start it up on node #2 (lx4) you start > to get messages that look like: > 91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0 > 125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > 125138: NL Blocking Notification on lockid 0x00010312 (mode 0) > 125138: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ > 141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > 141371: NL Blocking Notification on lockid 0x00010312 (mode 0) > 141371: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ > 141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ You're running the program on two nodes at once right? The line with "*" is when I started the program on a second node, so it appears I get the same thing. I don't get any assertion failure, though. That may be the result of changes I've checked in for some other bugs over the past couple days. 57150: over last 10.000 seconds, grant 57149, blkast 0, cancel 0 116825: over last 9.001 seconds, grant 59675, blkast 0, cancel 0 * 123790: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ 123790: NL Blocking Notification on lockid 0x00010373 (mode 0) 123790: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ 123822: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ 123822: NL Blocking Notification on lockid 0x00010373 (mode 0) 123822: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ -- Dave Teigland From jeff at intersystems.com Fri Aug 6 13:35:39 2004 From: jeff at intersystems.com (Jeff) Date: Fri, 6 Aug 2004 09:35:39 -0400 Subject: [Linux-cluster] Strange behavior(s) of DLM In-Reply-To: <20040806125429.GG16109@redhat.com> References: <1909350721.20040804234145@intersystems.com> <20040806125429.GG16109@redhat.com> Message-ID: <1323209748.20040806093539@intersystems.com> Friday, August 6, 2004, 8:54:29 AM, David Teigland wrote: > On Wed, Aug 04, 2004 at 11:41:45PM -0400, Jeff wrote: >> The attached routine demonstrates some strange >> behavior in the DLM and it was responsible for the >> dmesg text at the end of this note. >> >> This is on a FC2, SMP box running cvs/latest version of >> cman and the dlm. Its a 2 CPU box configured with 4 logical >> CPUs. >> >> I have a two node cluster and the two machines are identical >> as far as I can tell with the exception of which order they are >> listed in the cluster config file. >> >> On node #1 (in the config file) when I run the attached test from >> two terminals the output looks reasonable. The same as it does if >> I run it on Tru64 or VMS (more or less). >> >> 8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0 >> 18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0 >> 28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0 >> >> If you shut this down and start it up on node #2 (lx4) you start >> to get messages that look like: >> 91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0 >> 125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 125138: NL Blocking Notification on lockid 0x00010312 (mode 0) >> 125138: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ >> 141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 141371: NL Blocking Notification on lockid 0x00010312 (mode 0) >> 141371: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ >> 141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > You're running the program on two nodes at once right? The line with "*" > is when I started the program on a second node, so it appears I get the > same thing. I don't get any assertion failure, though. That may be the > result of changes I've checked in for some other bugs over the past couple > days. > 57150: over last 10.000 seconds, grant 57149, blkast 0, cancel 0 > 116825: over last 9.001 seconds, grant 59675, blkast 0, cancel 0 > * 123790: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > 123790: NL Blocking Notification on lockid 0x00010373 (mode 0) > 123790: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ > 123822: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > 123822: NL Blocking Notification on lockid 0x00010373 (mode 0) > 123822: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ I'm running the program from two processes on a single node. On the two nodes if I run the program from two processes on node #1, I don't get the above behavior. If I run it from two processes on node #2, I do (the 'NL Blocking'). When you run it from two nodes I suspect you only see the NL blocking on one of the nodes, never on the other one. I'll update the lock module with the recent changes and try to reproduce the assertion failure. The way I produce it is: Starting from both nodes rebooted... install the modules and have both nodes join the cluster. First node #1 then node #2. Run the program on node #1 and ctrl/c it to stop after a minute or so. Start the program on node #2 (one process) and let it run for 10-20 seconds (one or two status lines). Start another copy on node #2. This usually generates the NL messages. CTRL/C that copy and start it again. Maybe CTRL/C the other copy and start it again. At some point after CTRL/Cing and restarting, the program just hangs. At that point the process doesn't respond to CTRL/C any more and dmesg will show the various failure messages. From mtilstra at redhat.com Fri Aug 6 16:45:10 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Fri, 6 Aug 2004 11:45:10 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091749973.18842.70.camel@angmar> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> Message-ID: <20040806164510.GA20479@redhat.com> On Thu, Aug 05, 2004 at 04:52:53PM -0700, micah nerren wrote: > > FYI, I tried this with a few different HBA's, that didn't work. I > thought perhaps it could be some funny interaction with the driver but > that doesn't seem to be the case. > > If there is anything I can do to help, please let me know! Up to and > including allowing access to the machines running the software if that > will help you debug it. well, at this point I'd try things without the hbas and without gulm. So first off, try mounting gfs using nolock instead of gulm on a single node. Then gets some space on a local drive to put gfs (without pool first) and use gulm to mount that. (kinda pointless other than just seeing if it does an oops.) If that works, put pool onto the local disk and try again. That should give us a good idea of what parts need to be involved to get the oops. -- Michael Conrad Tadpol Tilstra Sharpies don't just sniff themselves. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From lhh at redhat.com Fri Aug 6 19:01:09 2004 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 06 Aug 2004 15:01:09 -0400 Subject: [Linux-cluster] ccsd patch to allow retrieval of child type + CDATA Message-ID: <1091818869.23658.43.camel@atlantis.boston.redhat.com> This is so we have a way to figure out child types as well as the CDATA value. Ex: stuff Old behavior: [root at red lhh]# ccs_test connect Connect successful. Connection descriptor = 0 [root at red lhh]# ccs_test get 0 /cluster/nodes/child::*[1] Get successful. Value = New behavior: [root at red lhh]# ccs_test connect Connect successful. Connection descriptor = 0 [root at red lhh]# ccs_test get 0 /cluster/nodes/child::*[1] Get successful. Value = -------------- next part -------------- A non-text attachment was scrubbed... Name: ccsd-child.patch Type: text/x-patch Size: 900 bytes Desc: not available URL: From mnerren at paracel.com Fri Aug 6 22:03:39 2004 From: mnerren at paracel.com (micah nerren) Date: Fri, 06 Aug 2004 15:03:39 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040806164510.GA20479@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> Message-ID: <1091829819.22512.14.camel@angmar> On Fri, 2004-08-06 at 09:45, Michael Conrad Tadpol Tilstra wrote: > On Thu, Aug 05, 2004 at 04:52:53PM -0700, micah nerren wrote: > > > > FYI, I tried this with a few different HBA's, that didn't work. I > > thought perhaps it could be some funny interaction with the driver but > > that doesn't seem to be the case. > > > > If there is anything I can do to help, please let me know! Up to and > > including allowing access to the machines running the software if that > > will help you debug it. > > well, at this point I'd try things without the hbas and without gulm. > So first off, try mounting gfs using nolock instead of gulm on a single > node. > Then gets some space on a local drive to put gfs (without pool first) > and use gulm to mount that. (kinda pointless other than just seeing if > it does an oops.) > If that works, put pool onto the local disk and try again. > > That should give us a good idea of what parts need to be involved to get > the oops. Alrighty, I thought I'd give you the latest on our efforts along these lines. We are progressing down the paths you suggested, and wanted to post a few results before the weekend. We have used nolock instead of gulm, still on the pool device over the HBA, and received a crash. Attached are two traces of the crashes. We edited the code sprinkling printk's throughout to get some output. Using lock_nolock instead of lock_gulm still crashes, but slightly differently. See koops-nolock.txt The tracing printk()'s added to lock_gulm and gfs don't show much, but the crash is different yet again. See koops-gulm-traced.txt The tracing messages use -> for enter, <- for leave and ?? for "This function returns in far too many places to bother." Later this evening or monday, I will attempt building a local file system without a pool, then with a pool, to give you some more data. Thanks, Micah -------------- next part -------------- Lock_Harness v6.0.0 (built Aug 6 2004 20:27:11) installed Gulm v6.0.0 (built Aug 6 2004 20:27:09) installed Debugging printks added at paracel. GFS v6.0.0 (built Aug 6 2004 20:26:48) installed ->gfs_read_super(774e0000, 0, 0) ->gfs_mount_lockproto({proto="", table="", host=""}, 0) ->gulm_mount("hopkins:gfs02", "", a0128980, 1cf000, 32, 24f6b8) ->start_gulm_threads("hopkins", "") ->cm_login() ??lg_core_login(7768200, 1) ??xdr_enc_flush(776515c0) ??lg_core_handle_messages(7768200, a010ca00, 0) ??gulm_core_login_reply(0, 0, 0, -1, 3) ->lt_login() ??lg_lock_login(7768200, {71, 70, 83, 32}) Unable to handle kernel paging request at virtual address 0000000100000000 printing rip: ffffffff802b5dd2 PML4 775d3067 PGD 0 Oops: 0000 CPU 0 Pid: 4026, comm: mount Not tainted RIP: 0010:[]{memcpy+18} RSP: 0018:00000100775fb238 EFLAGS: 00010002 RAX: ffffffff805d3928 RBX: 00000100775fa760 RCX: 0000000000000001 RDX: 0000000000000080 RSI: 0000000100000000 RDI: ffffffff805d3928 RBP: 0000000000000000 R08: 00000000ffffffff R09: 00000100076bf840 R10: 0000002a95782200 R11: 0000000000000246 R12: 000001007bf46760 R13: 00000100775fa000 R14: 000001007bf46000 R15: ffffffff805d38c0 FS: 0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000100000000 CR3: 0000000000101000 CR4: 00000000000006e0 Call Trace: []{__switch_to+499} []{thread_r []{init_level4_pgt+0} []{schedule_ti []{do_softirq_thunk+53} []{inet_wait []{inet_stream_connect+339} []{:lock []{:lock_gulm:xdr_connect+28} []{:lo []{:lock_gulm:lt_login+63} []{:lock_ []{:lock_gulm:core_cb+0} []{:lock_gu []{:lock_gulm:lg_core_login+346} []{:lock_gulm:cm_login+136} []{:lock []{:lock_gulm:gulm_mount+665} []{:gf []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_ []{do_anonymous_page+1234} []{do_no_ []{do_page_fault+627} []{error_exit+ []{serial_in+41} []{wake_up_cpu+29} []{:gfs:gfs_read_super+1338} []{:gfs []{get_sb_bdev+588} []{:gfs:gfs_fs_t []{do_kern_mount+121} []{do_add_moun []{do_mount+345} []{__get_free_pages []{sys_mount+197} []{system_call+119 Process mount (pid: 4026, stackpage=100775fb000) Stack: 00000100775fb238 0000000000000018 0000000000000040 00000100775fa760 ffffffff8010ed13 0000000000000006 00000100775fa000 000001007bf47ed8 000001007bf46000 ffffffff805e02c0 0000000000000000 0000000000000079 ffffffff8011f8c2 00000100775fb328 000001007ad88000 0000000000000020 0000000000000006 00000100775ffb40 0000000000000000 ffffffff80101000 000001007b571000 0000000000000069 0000000000000000 ffffffff805e02c0 00000100775fa000 00000100775fa000 000001007ad88000 0000000000000010 00000100775260c0 7fffffffffffffff 0000010077526108 00000100775fb3c8 0000000000000010 7fffffffffffffff ffffffff8012f9b5 00000100775fb3c8 ffffffff802b5915 0000000000000020 0000000000000006 000001000478929e Call Trace: []{__switch_to+499} []{thread_r []{init_level4_pgt+0} []{schedule_ti []{do_softirq_thunk+53} []{inet_wait []{inet_stream_connect+339} []{:lock []{:lock_gulm:xdr_connect+28} []{:lo []{:lock_gulm:lt_login+63} []{:lock_ []{:lock_gulm:core_cb+0} []{:lock_gu []{:lock_gulm:lg_core_login+346} []{:lock_gulm:cm_login+136} []{:lock []{:lock_gulm:gulm_mount+665} []{:gf []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_ []{do_anonymous_page+1234} []{do_no_ []{do_page_fault+627} []{error_exit+ []{serial_in+41} []{wake_up_cpu+29} []{:gfs:gfs_read_super+1338} []{:gfs []{get_sb_bdev+588} []{:gfs:gfs_fs_t []{do_kern_mount+121} []{do_add_moun []{do_mount+345} []{__get_free_pages []{sys_mount+197} []{system_call+119 Code: 4c 8b 1e 4c 8b 46 08 4c 89 1f 4c 89 47 08 4c 8b 4e 10 4c 8b Kernel panic: Fatal exception NMI Watchdog detected LOCKUP on CPU0, eip ffffffff8012162f, registers: CPU 0 Pid: 4026, comm: mount Not tainted RIP: 0010:[]{.text.lock.sched+131} RSP: 0018:ffffffff805de5c0 EFLAGS: 00000086 RAX: 0000000000000000 RBX: 00000100775fa000 RCX: 00000000000a6040 RDX: ffffffff8049d6a0 RSI: ffffffff8049d6b0 RDI: 0000000000000000 RBP: ffffffff805de5f0 R08: 0000000000000000 R09: ffffffff8049d6a0 R10: ffffffff8049d690 R11: 00000100775fad28 R12: ffffffff805e02c0 R13: 000000000000000b R14: 0000000000000000 R15: 00000000000033c5 FS: 0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000100000000 CR3: 0000000000101000 CR4: 00000000000006e0 Call Trace: []{smp_apic_timer_interrupt+291} []{apic_timer_interrupt+64} []{handl []{do_IRQ+274} []{common_interrupt+9 []{do_softirq+153} []{do_IRQ+339} []{common_interrupt+95} []{ip []{dev_queue_xmit+453} []{__make_req []{__make_request+1159} []{generic_m []{submit_bh_rsector+97} []{write_lo []{write_some_buffers+372} []{printk []{write_unlocked_buffers+23} []{syn []{fsync_dev+10} []{sys_sync+11} []{panic+286} []{show_trace+666} []{show_stack+205} []{show_registers []{die+268} []{do_page_fault+989} []{tcp_v4_rcv+1330} []{ip_local_deli []{ip_local_deliver_finish+244} []{n []{ip_local_deliver_finish+0} []{err []{memcpy+18} []{__switch_to+499} []{thread_return+0} []{init_level4_p []{schedule_timeout+37} []{do_softir []{inet_wait_for_connect+287} []{ine []{:lock_gulm:.rodata.str1.1+583} []{:lock_gulm:xdr_connect+28} []{:lo []{:lock_gulm:lt_login+63} []{:lock_ []{:lock_gulm:core_cb+0} []{:lock_gu []{:lock_gulm:lg_core_login+346} []{:lock_gulm:cm_login+136} []{:lock []{:lock_gulm:gulm_mount+665} []{:gf []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_ []{do_anonymous_page+1234} []{do_no_ []{do_page_fault+627} []{error_exit+ []{serial_in+41} []{wake_up_cpu+29} []{:gfs:gfs_read_super+1338} []{:gfs []{get_sb_bdev+588} []{:gfs:gfs_fs_t []{do_kern_mount+121} []{do_add_moun []{do_mount+345} []{__get_free_pages []{sys_mount+197} []{system_call+119 Process mount (pid: 4026, stackpage=100775fb000) Stack: ffffffff805de5c0 0000000000000018 0000000000100000 0000000000000000 00000100079c4c80 ffffffff803e89a0 0000000000000000 00000100000fdea0 ffffffff803e8d00 00000100079bf000 00000100079d6400 0000000000000042 00000100079de280 ffffff0000000000 000000fffffff000 0000000000000000 00000100079d7a80 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000100775fbc28 0000000000000000 00000000006d9994 0000000000000003 0000000000000000 0000000000000000 0000000100000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff Call Trace: []{smp_apic_timer_interrupt+291} []{apic_timer_interrupt+64} []{handl []{do_IRQ+274} []{common_interrupt+9 []{do_softirq+153} []{do_IRQ+339} []{common_interrupt+95} []{ip []{dev_queue_xmit+453} []{__make_req []{__make_request+1159} []{generic_m []{submit_bh_rsector+97} []{write_lo []{write_some_buffers+372} []{printk []{write_unlocked_buffers+23} []{syn []{fsync_dev+10} []{sys_sync+11} []{panic+286} []{show_trace+666} []{show_stack+205} []{show_registers []{die+268} []{do_page_fault+989} []{tcp_v4_rcv+1330} []{ip_local_deli []{ip_local_deliver_finish+244} []{n []{ip_local_deliver_finish+0} []{err []{memcpy+18} []{__switch_to+499} []{thread_return+0} []{init_level4_p []{schedule_timeout+37} []{do_softir []{inet_wait_for_connect+287} []{ine []{:lock_gulm:.rodata.str1.1+583} []{:lock_gulm:xdr_connect+28} []{:lo []{:lock_gulm:lt_login+63} []{:lock_ []{:lock_gulm:core_cb+0} []{:lock_gu []{:lock_gulm:lg_core_login+346} []{:lock_gulm:cm_login+136} []{:lock []{:lock_gulm:gulm_mount+665} []{:gf []{:lock_harness:lm_mount_Rsmp_ad6c5c21+355} []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_ []{do_anonymous_page+1234} []{do_no_ []{do_page_fault+627} []{error_exit+ []{serial_in+41} []{wake_up_cpu+29} []{:gfs:gfs_read_super+1338} []{:gfs []{get_sb_bdev+588} []{:gfs:gfs_fs_t []{do_kern_mount+121} []{do_add_moun []{do_mount+345} []{__get_free_pages []{sys_mount+197} []{system_call+119 Code: f3 90 7e f7 e9 0b db ff ff 80 ba c0 02 5e 80 00 f3 90 7e f5 console shuts up ... -------------- next part -------------- Lock_Harness v6.0.0 (built Aug 6 2004 20:27:11) installed Lock_Nolock v6.0.0 (built Aug 6 2004 20:27:12) installed GFS v6.0.0 (built Aug 6 2004 20:26:48) installed ->gfs_read_super(78a4d000, 0, 0) ->gfs_mount_lockproto({proto="", table="", host=""}, 0) Gulm v6.0.0 (built Aug 5 2004 16:27:11) installed Unable to handle kernel NULL pointer dereference at virtual address 000000000000 printing rip: ffffffff8024a875 PML4 77ae7067 PGD 7798c067 PMD 0 Oops: 0002 CPU 0 Pid: 4027, comm: mount Not tainted RIP: 0010:[]{net_rx_action+213} RSP: 0018:0000010077605048 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffff806077e8 RCX: ffffffff80607988 RDX: ffffffff806077e8 RSI: 0000010077b27080 RDI: ffffffff806077d0 RBP: ffffffff80607668 R08: 0000000080a56a9c R09: 0000000000a580a5 R10: 000000000100007f R11: 0000000000000000 R12: ffffffff806077e8 R13: ffffffff806077c0 R14: 00000000000046e6 R15: 0000000000000000 FS: 0000002a955764c0(0000) GS:ffffffff805d9840(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Call Trace: []{net_rx_action+173} []{do_softirq+174} []{ip_finish_outp []{dst_output+0} []{do_softirq_thunk []{.text.lock.netfilter+165} []{dst_ []{ip_queue_xmit+1019} []{ip_rcv_fin []{ip_rcv_finish+528} []{nf_hook_slo []{ip_rcv_finish+0} []{tcp_transmit_ []{tcp_write_xmit+198} []{tcp_sendms []{inet_sendmsg+69} []{sock_sendmsg+ []{:lock_gulm:do_tfer+369} []{:lock_ []{:lock_gulm:xdr_send+37} []{:lock_ []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_ []{:lock_gulm:core_cb+0} []{:lock_gu []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock []{:lock_gulm:gulm_mount+616} []{:gf []{release_task+763} []{:lock_harnes []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_ []{do_anonymous_page+1234} []{do_no_ []{do_page_fault+627} []{error_exit+ []{serial_in+41} []{wake_up_cpu+29} []{:gfs:gfs_read_super+1338} []{:gfs []{get_sb_bdev+588} []{:gfs:gfs_fs_t []{do_kern_mount+121} []{do_add_moun []{do_mount+345} []{__get_free_pages []{sys_mount+197} []{system_call+119 Process mount (pid: 4027, stackpage=10077605000) Stack: 0000010077605048 0000000000000018 ffffffff8024a84d 0000012a80445d20 0000000000000001 ffffffff80606c60 0000000000000000 000000000000000a 0000000000000000 0000000000000002 ffffffff8012a72e ffffffff80267cf0 0000000000000246 0000000000000000 0000000000000003 ffffffff80445d20 ffffffff80267cc0 0000000000000000 ffffffff802b5915 0000000000000043 0000000000000006 000001007a56a09e 0000010077a32d80 0000000000000000 0000000000000000 ffffffff8049c648 0000000000000000 ffffffff806077c0 ffffffff802533a7 ffffffff80267cc0 ffffffff80445d20 0000000000000002 0000010077a32d80 ffffffff805abcd0 000001007a56a0ac 0000010077a32d80 0000010077b27080 0000000000000000 0000010077b27080 0000010077a32de8 Call Trace: []{net_rx_action+173} []{do_softirq+174} []{ip_finish_outp []{dst_output+0} []{do_softirq_thunk []{.text.lock.netfilter+165} []{dst_ []{ip_queue_xmit+1019} []{ip_rcv_fin []{ip_rcv_finish+528} []{nf_hook_slo []{ip_rcv_finish+0} []{tcp_transmit_ []{tcp_write_xmit+198} []{tcp_sendms []{inet_sendmsg+69} []{sock_sendmsg+ []{:lock_gulm:do_tfer+369} []{:lock_ []{:lock_gulm:xdr_send+37} []{:lock_ []{:lock_gulm:lg_lock_login+301} []{:lock_gulm:lt_login+57} []{:lock_ []{:lock_gulm:core_cb+0} []{:lock_gu []{:lock_gulm:lg_core_login+323} []{:lock_gulm:cm_login+122} []{:lock []{:lock_gulm:gulm_mount+616} []{:gf []{release_task+763} []{:lock_harnes []{:gfs:gfs_glock_cb+0} []{:gfs:gfs_ []{do_anonymous_page+1234} []{do_no_ []{do_page_fault+627} []{error_exit+ []{serial_in+41} []{wake_up_cpu+29} []{:gfs:gfs_read_super+1338} []{:gfs []{get_sb_bdev+588} []{:gfs:gfs_fs_t []{do_kern_mount+121} []{do_add_moun []{do_mount+345} []{__get_free_pages []{sys_mount+197} []{system_call+119 Code: 48 89 18 48 89 43 08 8b 85 90 01 00 00 85 c0 79 08 03 85 94 Kernel panic: Fatal exception In interrupt handler - not syncing NMI Watchdog detected LOCKUP on CPU1, eip ffffffff801a5419, registers: CPU 1 Pid: 3534, comm: lock_gulmd Not tainted RIP: 0010:[]{.text.lock.fault+7} RSP: 0018:000001007adc1978 EFLAGS: 00000086 RAX: 000000000000000f RBX: ffffffff80607ae8 RCX: 0000000000000000 RDX: ffffffff803042e0 RSI: ffffffff803042e0 RDI: ffffffff8024a875 RBP: ffffffff80607968 R08: ffffffff803042d0 R09: 0000000000a580a5 R10: 000000000100007f R11: 0000000000000000 R12: 0000010007a0e9c0 R13: 0000000000000000 R14: 0000000000000002 R15: 000001007adc1a58 FS: 0000002a95576ce0(0000) GS:ffffffff805d98c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000079d2000 CR4: 00000000000006e0 Call Trace: Process lock_gulmd (pid: 3534, stackpage=1007adc1000) Stack: 000001007adc1978 0000000000000018 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: Code: f3 90 7e f5 e9 c8 fd ff ff 90 90 90 90 90 90 90 90 90 90 90 console shuts up ... From mtilstra at redhat.com Fri Aug 6 22:35:57 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Fri, 6 Aug 2004 17:35:57 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091829819.22512.14.camel@angmar> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar> Message-ID: <20040806223557.GA21731@redhat.com> On Fri, Aug 06, 2004 at 03:03:39PM -0700, micah nerren wrote: > We have used nolock instead of gulm, still on the pool device over the > HBA, and received a crash. Attached are two traces of the crashes. We > edited the code sprinkling printk's throughout to get some output. > > Using lock_nolock instead of lock_gulm still crashes, but slightly > differently. See koops-nolock.txt er, you might want to double check this run, looking at the oops and loging, it looks like it is still trying to use gulm. the line: Gulm v6.0.0 (built Aug 5 2004 16:27:11) installed in the file: koops-nolock.txt lead me to believe this, along with the lock_gulm sysmbols in the oops. i could be imaging things too... -- Michael Conrad Tadpol Tilstra At night as I lay in bed looking at the stars I thought 'Where the hell is the ceiling?' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mnerren at paracel.com Fri Aug 6 22:37:35 2004 From: mnerren at paracel.com (micah nerren) Date: Fri, 06 Aug 2004 15:37:35 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040806223557.GA21731@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar> <20040806223557.GA21731@redhat.com> Message-ID: <1091831854.22512.18.camel@angmar> On Fri, 2004-08-06 at 15:35, Michael Conrad Tadpol Tilstra wrote: > On Fri, Aug 06, 2004 at 03:03:39PM -0700, micah nerren wrote: > > We have used nolock instead of gulm, still on the pool device over the > > HBA, and received a crash. Attached are two traces of the crashes. We > > edited the code sprinkling printk's throughout to get some output. > > > > Using lock_nolock instead of lock_gulm still crashes, but slightly > > differently. See koops-nolock.txt > > er, you might want to double check this run, looking at the oops and > loging, it looks like it is still trying to use gulm. > > the line: > Gulm v6.0.0 (built Aug 5 2004 16:27:11) installed > in the file: koops-nolock.txt > lead me to believe this, along with the lock_gulm sysmbols in the oops. > > i could be imaging things too... Yeah you are right, I had caught that and am attempting a true nolock at the moment. From mnerren at paracel.com Fri Aug 6 23:31:55 2004 From: mnerren at paracel.com (micah nerren) Date: Fri, 06 Aug 2004 16:31:55 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040806223557.GA21731@redhat.com> References: <1091458082.8356.23.camel@angmar> <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar> <20040806223557.GA21731@redhat.com> Message-ID: <1091835114.22512.46.camel@angmar> On Fri, 2004-08-06 at 15:35, Michael Conrad Tadpol Tilstra wrote: > On Fri, Aug 06, 2004 at 03:03:39PM -0700, micah nerren wrote: > > We have used nolock instead of gulm, still on the pool device over the > > HBA, and received a crash. Attached are two traces of the crashes. We > > edited the code sprinkling printk's throughout to get some output. > > > > Using lock_nolock instead of lock_gulm still crashes, but slightly > > differently. See koops-nolock.txt > > er, you might want to double check this run, looking at the oops and > loging, it looks like it is still trying to use gulm. > > the line: > Gulm v6.0.0 (built Aug 5 2004 16:27:11) installed > in the file: koops-nolock.txt > lead me to believe this, along with the lock_gulm sysmbols in the oops. > > i could be imaging things too... Ok, using a disk attached via fibre channel to a single machine via an LSI hba I can create and mount a GFS file system using lock_nolock without a pool. A start! Console log: GFS: fsid=(8,2).0: Joined cluster. Now mounting FS... GFS: fsid=(8,2).0: jid=0: Trying to acquire journal lock... GFS: fsid=(8,2).0: jid=0: Looking at journal... GFS: fsid=(8,2).0: jid=0: Done I then tried to do lock_nolock on a pool device, and that worked as well: GFS: fsid=hopkins:gfs01.0: Joined cluster. Now mounting FS... GFS: fsid=hopkins:gfs01.0: jid=0: Trying to acquire journal lock... GFS: fsid=hopkins:gfs01.0: jid=0: Looking at journal... GFS: fsid=hopkins:gfs01.0: jid=0: Done So it appears to be specifically related to lock_gulm. Anything else I should try? I really appreciate all your help in debugging this! Thanks, Micah From phillips at redhat.com Sat Aug 7 02:43:03 2004 From: phillips at redhat.com (Daniel Phillips) Date: Fri, 6 Aug 2004 22:43:03 -0400 Subject: [Linux-cluster] CLVM and redundant RAID levels In-Reply-To: References: Message-ID: <200408062243.03666.phillips@redhat.com> On Friday 06 August 2004 04:45, Francesco Fusco wrote: > Hi! > I want to use some ide servers to have an high available cluster > filesystem. > I don't have Fibre Channel/Scsi disk array, but only inexpensive > ide disks. > > Can GFS be a good choice? > Does it support redoundant raid levels between GNBD servers? It's in the pipeline. Regards, Daniel From pcaulfie at redhat.com Mon Aug 9 07:36:34 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 9 Aug 2004 08:36:34 +0100 Subject: [Linux-cluster] bug in cman-kernel / membership.c In-Reply-To: <1091730500.15503.302.camel@laza.eunet.yu> References: <1091730500.15503.302.camel@laza.eunet.yu> Message-ID: <20040809073634.GB8035@tykepenguin.com> On Thu, Aug 05, 2004 at 08:28:21PM +0200, Lazar Obradovic wrote: > Just got this when joining one node. > > > CMAN: Waiting to join or form a Linux-cluster > CMAN: sending membership request > CMAN: got node new-noc > Got ENDTRANS from a node not the master: master: 6, sender: 1 > CMAN: node new-noc is not responding - removing from the cluster > ------------[ cut here ]------------ > kernel BUG at /usr/src/cvs/cluster/cman-kernel/src/membership.c:2892! Just checking: Is this fixed by your cman_tool patch or shall I put it into bugzilla ? patrick From pcaulfie at redhat.com Mon Aug 9 07:50:40 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 9 Aug 2004 08:50:40 +0100 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1091736172.19762.336.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> Message-ID: <20040809075039.GA9240@tykepenguin.com> On Thu, Aug 05, 2004 at 10:02:52PM +0200, Lazar Obradovic wrote: > It took some time... > > Attached is the patch for correcting cman_tool join into ipv4 mcast > group. > > Note to ipv6 developers / users: you may also need to set mcast ttl to > higher value via setsockopt() if you want to have cluster with network > traversing L3 devices, since linux default ttl is set for local scope > (ttl = 1), which would make first router drop the packet. > > --- cluster/cman/cman_tool/join.c 2004-07-23 09:48:16.000000000 +0200 > +++ new-cluster/cman/cman_tool/join.c 2004-08-06 05:59:20.353829392 +0200 > @@ -118,13 +118,22 @@ > die("Cannot bind multicast address: %s", strerror(errno)); > > /* Join the multicast group */ > - if (!bcast) { > + if (bhe) { > struct ip_mreq mreq; > + u_char mcast_opt; > > memcpy(&mreq.imr_multiaddr, bhe->h_addr, bhe->h_length); > - memcpy(&mreq.imr_interface, he->h_addr, he->h_length); > + mreq.imr_interface.s_addr = htonl(INADDR_ANY); Can you explain why this should be INADDR_ANY rather than the local IP address? You also mentioned in another email that the "cman_tool leave" should issue a setsockopt to leave the multicast group, does this not happen automatically when the socket is closed? If it isn't then cman_tool leave can do this I suppose. In the case where the cluster software exits without the help of cman_tool it will be fenced anyway so there shoudn't be a problem :-) -- patrick From laza at yu.net Mon Aug 9 09:26:45 2004 From: laza at yu.net (Lazar Obradovic) Date: Mon, 09 Aug 2004 11:26:45 +0200 Subject: [Linux-cluster] bug in cman-kernel / membership.c In-Reply-To: <20040809073634.GB8035@tykepenguin.com> References: <1091730500.15503.302.camel@laza.eunet.yu> <20040809073634.GB8035@tykepenguin.com> Message-ID: <1092043604.8966.0.camel@laza.eunet.yu> it's not fixed... it appeared on clean cvs node... On Mon, 2004-08-09 at 09:36, Patrick Caulfield wrote: > On Thu, Aug 05, 2004 at 08:28:21PM +0200, Lazar Obradovic wrote: > > Just got this when joining one node. > > > > > > CMAN: Waiting to join or form a Linux-cluster > > CMAN: sending membership request > > CMAN: got node new-noc > > Got ENDTRANS from a node not the master: master: 6, sender: 1 > > CMAN: node new-noc is not responding - removing from the cluster > > ------------[ cut here ]------------ > > kernel BUG at /usr/src/cvs/cluster/cman-kernel/src/membership.c:2892! > > Just checking: Is this fixed by your cman_tool patch or shall I put it into > bugzilla ? > > patrick > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From pcaulfie at redhat.com Mon Aug 9 09:38:23 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 9 Aug 2004 10:38:23 +0100 Subject: [Linux-cluster] cman_tool interface change Message-ID: <20040809093822.GC11723@tykepenguin.com> I've changed the way cman_tool starts the cluster so if you upgrade CVS you'll need to make sure that kernel & userspace match. What I've done is to remove all the setsockopt() calls and replaced with with ioctls. Sorry for the inconvenience here but those things really are /not/ socket options and if this code got susbmitted to the kernel team like I'd get a roasting! -- patrick From laza at yu.net Mon Aug 9 11:32:23 2004 From: laza at yu.net (Lazar Obradovic) Date: Mon, 09 Aug 2004 13:32:23 +0200 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <20040809075039.GA9240@tykepenguin.com> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> Message-ID: <1092051143.1114.130.camel@laza.eunet.yu> > Can you explain why this should be INADDR_ANY rather than the local IP address? i knew i forgot something... i've been having some trouble with memcpy(&mreq.imr_interface, he->h_addr, he->h_length); since it contains my mcast addr instead of real host addr. I'm debugging it now to see where the error happens... Temporarly, I've changed it to htonl(INADDR_ANY)... seems like "temporarly" is a bit streching term for me :) > You also mentioned in another email that the "cman_tool leave" should issue > a setsockopt to leave the multicast group, does this not happen automatically > when the socket is closed? Actually no. When you close the socket that was a member of Mcast Group, node does not sent "IGMP leave" message to router, so mcast packets continue to arrive until router issues "membership refresh" procedure (which, depending on configuration, is at every 30 seconds). If node does not confirm it's membership (and it won't since kernel can't find any socket for the multicast group), mcast path gets pruned at a router, but stays valid for another 2-3 minutes (also depending on configuration). > If it isn't then cman_tool leave can do this I > suppose. In the case where the cluster software exits without the help of > cman_tool it will be fenced anyway so there shoudn't be a problem :-) That's true for fenced nodes, but it's not a clean solution, so, if it isn't much of a trouble, I'd really like to have membership drop implemented before socket gets closed. -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From pcaulfie at redhat.com Mon Aug 9 12:02:55 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 9 Aug 2004 13:02:55 +0100 Subject: [Linux-cluster] bug in cman-kernel / membership.c In-Reply-To: <1092043604.8966.0.camel@laza.eunet.yu> References: <1091730500.15503.302.camel@laza.eunet.yu> <20040809073634.GB8035@tykepenguin.com> <1092043604.8966.0.camel@laza.eunet.yu> Message-ID: <20040809120254.GG11723@tykepenguin.com> On Mon, Aug 09, 2004 at 11:26:45AM +0200, Lazar Obradovic wrote: > it's not fixed... it appeared on clean cvs node... > OK, I've raised bugzilla entry 129445 for this. patrick From laza at yu.net Mon Aug 9 13:10:30 2004 From: laza at yu.net (Lazar Obradovic) Date: Mon, 09 Aug 2004 15:10:30 +0200 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1092051143.1114.130.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> Message-ID: <1092057029.25657.217.camel@laza.eunet.yu> On Mon, 2004-08-09 at 13:32, Lazar Obradovic wrote: > since it contains my mcast addr instead of real host addr. I'm debugging > it now to see where the error happens... ... and the reason why i had trouble is: debugging code revealed that he structure propperly contains node's unicast address along with a hostname *before* bhe struct is inited by gethostbyname2() call. After that, both bhe and he point to new structure, which holds hostent struct for mcast address... That's why bind() failed, so I had to use IPADDR_ANY... If we all look at manual page for gethostbyname(3), we can see that this behaviour is not strange, as noted in "NOTES": The functions gethostbyname() and gethostbyaddr() may return pointers to static data, which may be overwritten by later calls. Copying the struct hostent does not suffice, since it contains pointers - a deep copy is required. I'll correct this later and send a patch :) -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From pcaulfie at redhat.com Mon Aug 9 13:34:38 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 9 Aug 2004 14:34:38 +0100 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1092051143.1114.130.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> Message-ID: <20040809133438.GI11723@tykepenguin.com> On Mon, Aug 09, 2004 at 01:32:23PM +0200, Lazar Obradovic wrote: > > > You also mentioned in another email that the "cman_tool leave" should issue > > a setsockopt to leave the multicast group, does this not happen automatically > > when the socket is closed? > > Actually no. When you close the socket that was a member of Mcast Group, > node does not sent "IGMP leave" message to router, so mcast packets > continue to arrive until router issues "membership refresh" procedure > (which, depending on configuration, is at every 30 seconds). > > If node does not confirm it's membership (and it won't since kernel > can't find any socket for the multicast group), mcast path gets pruned > at a router, but stays valid for another 2-3 minutes (also depending on > configuration). > > > If it isn't then cman_tool leave can do this I > > suppose. In the case where the cluster software exits without the help of > > cman_tool it will be fenced anyway so there shoudn't be a problem :-) > > That's true for fenced nodes, but it's not a clean solution, so, if it > isn't much of a trouble, I'd really like to have membership drop > implemented before socket gets closed. > OK, thanks for clearing that up. It seems I have some thinking to do... patrick From mtilstra at redhat.com Mon Aug 9 15:12:08 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Mon, 9 Aug 2004 10:12:08 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1091835114.22512.46.camel@angmar> References: <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar> <20040806223557.GA21731@redhat.com> <1091835114.22512.46.camel@angmar> Message-ID: <20040809151208.GA2189@redhat.com> On Fri, Aug 06, 2004 at 04:31:55PM -0700, micah nerren wrote: > So it appears to be specifically related to lock_gulm. hrms, so no pushing this off onto someone else. oh well. ;) > Anything else I should try? well, it still pretty much looks like a stack overflow. And looking at the calling tree, there is not much left to take out of the stacks. So I guess we'll have to try making the stack shorter. So, another patch. This still works on my intels, give it a go and lets see how it does on your opterons. > I really appreciate all your help in debugging this! np. -- Michael Conrad Tadpol Tilstra To be, or not to be, those are the parameters. -------------- next part -------------- Index: gulm_core.c =================================================================== RCS file: /cvs/GFS/locking/lock_gulm/kernel/gulm_core.c,v retrieving revision 1.1.2.14 diff -u -b -B -r1.1.2.14 gulm_core.c --- gulm_core.c 25 May 2004 20:11:23 -0000 1.1.2.14 +++ gulm_core.c 9 Aug 2004 15:11:19 -0000 @@ -51,13 +51,6 @@ } gulm_cm.GenerationID = gen; - error = lt_login (); - if (error != 0) { - log_err ("lt_login failed. %d\n", error); - lg_core_logout (gulm_cm.hookup); /* XXX is this safe? */ - return error; - } - log_msg (lgm_Network2, "Logged into local core.\n"); return 0; Index: gulm_fs.c =================================================================== RCS file: /cvs/GFS/locking/lock_gulm/kernel/gulm_fs.c,v retrieving revision 1.1.2.17 diff -u -b -B -r1.1.2.17 gulm_fs.c --- gulm_fs.c 2 Aug 2004 16:12:39 -0000 1.1.2.17 +++ gulm_fs.c 9 Aug 2004 15:11:19 -0000 @@ -287,9 +287,11 @@ goto fail; } - /* lt_login() is called after the success packet for cm_login() - * returns. - */ + error = lt_login(); + if (error != 0) { + log_err ("lt_login failed. %d\n", error); + goto fail; + } } fail: up (&start_stop_lock); -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From jeff at intersystems.com Mon Aug 9 15:53:51 2004 From: jeff at intersystems.com (Jeff) Date: Mon, 9 Aug 2004 11:53:51 -0400 Subject: [Linux-cluster] Strange behavior(s) of DLM In-Reply-To: <20040806125429.GG16109@redhat.com> References: <1909350721.20040804234145@intersystems.com> <20040806125429.GG16109@redhat.com> Message-ID: <602087288.20040809115351@intersystems.com> Friday, August 6, 2004, 8:54:29 AM, David Teigland wrote: > On Wed, Aug 04, 2004 at 11:41:45PM -0400, Jeff wrote: >> The attached routine demonstrates some strange >> behavior in the DLM and it was responsible for the >> dmesg text at the end of this note. >> >> This is on a FC2, SMP box running cvs/latest version of >> cman and the dlm. Its a 2 CPU box configured with 4 logical >> CPUs. >> >> I have a two node cluster and the two machines are identical >> as far as I can tell with the exception of which order they are >> listed in the cluster config file. >> >> On node #1 (in the config file) when I run the attached test from >> two terminals the output looks reasonable. The same as it does if >> I run it on Tru64 or VMS (more or less). >> >> 8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0 >> 18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0 >> 28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0 >> >> If you shut this down and start it up on node #2 (lx4) you start >> to get messages that look like: >> 91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0 >> 125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 125138: NL Blocking Notification on lockid 0x00010312 (mode 0) >> 125138: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ >> 141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 141371: NL Blocking Notification on lockid 0x00010312 (mode 0) >> 141371: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ >> 141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > You're running the program on two nodes at once right? The line with "*" > is when I started the program on a second node, so it appears I get the > same thing. I don't get any assertion failure, though. That may be the > result of changes I've checked in for some other bugs over the past couple > days. > 57150: over last 10.000 seconds, grant 57149, blkast 0, cancel 0 > 116825: over last 9.001 seconds, grant 59675, blkast 0, cancel 0 > * 123790: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > 123790: NL Blocking Notification on lockid 0x00010373 (mode 0) > 123790: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ > 123822: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^ > 123822: NL Blocking Notification on lockid 0x00010373 (mode 0) > 123822: NL Blocking Notification Rountine End ^^^^^^^^^^^^^^^^^^^^ I updated my sources this morning and I get neither the NL Blocking routine start messages nor the assertion failures. In the past I was able to get this quite easily so I suspect you have resolved them. From laza at yu.net Mon Aug 9 15:58:41 2004 From: laza at yu.net (Lazar Obradovic) Date: Mon, 09 Aug 2004 17:58:41 +0200 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <20040809133438.GI11723@tykepenguin.com> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> Message-ID: <1092067121.23273.235.camel@laza.eunet.yu> On Mon, 2004-08-09 at 15:34, Patrick Caulfield wrote: > OK, thanks for clearing that up. It seems I have some thinking to do... holiday? egypt? :) ok, attached is the file with two new functions: my_gethostbyname2() and my_freehe(). First should be used everywhere instead of gethostbyname(), and the later should be called to free up all the used memory. I deliberately didn't send a patch, since I haven't CO'ed yet, so patch would fail with new version. I'm counting on you for incorporating it into new tree. :) With this, you can also keep the memcpy(mreq.imr_interaface, ...) since it's working now :) cheers -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- -------------- next part -------------- A non-text attachment was scrubbed... Name: mygethostbyname.c Type: text/x-csrc Size: 1247 bytes Desc: not available URL: From pcaulfie at redhat.com Mon Aug 9 16:11:14 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 9 Aug 2004 17:11:14 +0100 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1092067121.23273.235.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> Message-ID: <20040809161114.GM11723@tykepenguin.com> On Mon, Aug 09, 2004 at 05:58:41PM +0200, Lazar Obradovic wrote: > On Mon, 2004-08-09 at 15:34, Patrick Caulfield wrote: > > OK, thanks for clearing that up. It seems I have some thinking to do... > > holiday? egypt? :) Soon, soon...(and not too far from you either, Dubrovnik!) > ok, attached is the file with two new functions: my_gethostbyname2() and > my_freehe(). First should be used everywhere instead of gethostbyname(), > and the later should be called to free up all the used memory. > > I deliberately didn't send a patch, since I haven't CO'ed yet, so patch > would fail with new version. I'm counting on you for incorporating it > into new tree. :) > > With this, you can also keep the memcpy(mreq.imr_interaface, ...) since Thanks very much - I'll merge that tomorrow -- patrick From lhh at redhat.com Mon Aug 9 20:16:29 2004 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 09 Aug 2004 16:16:29 -0400 Subject: [Linux-cluster] Kernel oops Message-ID: <1092082589.20439.2.camel@atlantis.boston.redhat.com> While doing a bunch of 'while [ 0 ]; relocate_resource_group foo; done' simultaneously, I triggered this in the DLM: I haven't updated since last week; will do so and attempt to reproduce. This is just a heads-up. -- Lon DLM: Assertion failed on line 328 of file cluster/dlm/lockqueue.c DLM: assertion: "rsb->res_nodeid == -1 || rsb->res_nodeid == 0" DLM: time = 2154223 dlm: lkb id 200ca remid 0 flags 0 status 0 rqmode 5 grmode -1 nodeid 4294967295 lqstate 0 lqflags 0 dlm: rsb name "usrm::vf" nodeid 1 ref 2 dlm: reply rh_cmd 5 rh_lkid 200ca lockstate 0 nodeid 1 status 0 lkid c02bf515 ------------[ cut here ]------------ kernel BUG at cluster/dlm/lockqueue.c:328! invalid operand: 0000 [#1] PREEMPT SMP Modules linked in: dlm cman ipv6 CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010286 (2.6.7cman20040804) EIP is at process_lockqueue_reply+0x5e6/0x720 [dlm] eax: 00000001 ebx: 00000001 ecx: c039df74 edx: 00000282 esi: c9bbb04c edi: c9bbc708 ebp: caf55e24 esp: caf55dfc ds: 007b es: 007b ss: 0068 Process dlm_recvd (pid: 2109, threadinfo=caf54000 task=cad33360) Stack: d09ad257 00000148 d09ad23f d09ae8a0 0020deef c1398200 000200ca c9bbc708 c1398200 caf55ee0 caf55eac d099de86 c9bbc708 caf55ee0 00000001 c03b94c0 caf55f88 caf55e90 caf55e74 c030a4d4 caf55e90 00000000 00000000 00000fc4 Call Trace: [] show_stack+0x7f/0xa0 [] show_registers+0x15e/0x1c0 [] die+0xa2/0x120 [] do_invalid_op+0xb5/0xc0 [] error_code+0x2d/0x38 [] process_cluster_request+0x746/0xde0 [dlm] [] midcomms_process_incoming_buffer+0x167/0x250 [dlm] [] receive_from_sock+0x189/0x360 [dlm] [] process_sockets+0xd8/0x110 [dlm] [] dlm_recvd+0xad/0x110 [dlm] [] kernel_thread_helper+0x5/0x10 Code: 0f 0b 48 01 3f d2 9a d0 e9 0d 01 00 00 e8 e8 f0 ff ff e8 33 From danderso at redhat.com Mon Aug 9 20:53:01 2004 From: danderso at redhat.com (Derek Anderson) Date: Mon, 9 Aug 2004 15:53:01 -0500 Subject: [Linux-cluster] Kernel oops In-Reply-To: <1092082589.20439.2.camel@atlantis.boston.redhat.com> References: <1092082589.20439.2.camel@atlantis.boston.redhat.com> Message-ID: <200408091553.01433.danderso@redhat.com> Lon, May be the same thing as this bug: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128679 On Monday 09 August 2004 15:16, Lon Hohberger wrote: > While doing a bunch of 'while [ 0 ]; relocate_resource_group foo; done' > simultaneously, I triggered this in the DLM: > > I haven't updated since last week; will do so and attempt to reproduce. > This is just a heads-up. > > -- Lon > > > DLM: Assertion failed on line 328 of file cluster/dlm/lockqueue.c > DLM: assertion: "rsb->res_nodeid == -1 || rsb->res_nodeid == 0" > DLM: time = 2154223 > dlm: lkb > id 200ca > remid 0 > flags 0 > status 0 > rqmode 5 > grmode -1 > nodeid 4294967295 > lqstate 0 > lqflags 0 > dlm: rsb > name "usrm::vf" > nodeid 1 > ref 2 > dlm: reply > rh_cmd 5 > rh_lkid 200ca > lockstate 0 > nodeid 1 > status 0 > lkid c02bf515 > > ------------[ cut here ]------------ > kernel BUG at cluster/dlm/lockqueue.c:328! > invalid operand: 0000 [#1] > PREEMPT SMP > Modules linked in: dlm cman ipv6 > CPU: 0 > EIP: 0060:[] Not tainted > EFLAGS: 00010286 (2.6.7cman20040804) > EIP is at process_lockqueue_reply+0x5e6/0x720 [dlm] > eax: 00000001 ebx: 00000001 ecx: c039df74 edx: 00000282 > esi: c9bbb04c edi: c9bbc708 ebp: caf55e24 esp: caf55dfc > ds: 007b es: 007b ss: 0068 > Process dlm_recvd (pid: 2109, threadinfo=caf54000 task=cad33360) > Stack: d09ad257 00000148 d09ad23f d09ae8a0 0020deef c1398200 000200ca > c9bbc708 > c1398200 caf55ee0 caf55eac d099de86 c9bbc708 caf55ee0 00000001 > c03b94c0 > caf55f88 caf55e90 caf55e74 c030a4d4 caf55e90 00000000 00000000 > 00000fc4 > Call Trace: > [] show_stack+0x7f/0xa0 > [] show_registers+0x15e/0x1c0 > [] die+0xa2/0x120 > [] do_invalid_op+0xb5/0xc0 > [] error_code+0x2d/0x38 > [] process_cluster_request+0x746/0xde0 [dlm] > [] midcomms_process_incoming_buffer+0x167/0x250 [dlm] > [] receive_from_sock+0x189/0x360 [dlm] > [] process_sockets+0xd8/0x110 [dlm] > [] dlm_recvd+0xad/0x110 [dlm] > [] kernel_thread_helper+0x5/0x10 > > Code: 0f 0b 48 01 3f d2 9a d0 e9 0d 01 00 00 e8 e8 f0 ff ff e8 33 > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From mnerren at paracel.com Mon Aug 9 20:57:00 2004 From: mnerren at paracel.com (micah nerren) Date: Mon, 09 Aug 2004 13:57:00 -0700 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040809151208.GA2189@redhat.com> References: <20040802154605.GC1518@redhat.com> <1091480394.8356.58.camel@angmar> <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar> <20040806223557.GA21731@redhat.com> <1091835114.22512.46.camel@angmar> <20040809151208.GA2189@redhat.com> Message-ID: <1092085020.14561.3.camel@angmar> On Mon, 2004-08-09 at 08:12, Michael Conrad Tadpol Tilstra wrote: > On Fri, Aug 06, 2004 at 04:31:55PM -0700, micah nerren wrote: > > So it appears to be specifically related to lock_gulm. > > hrms, so no pushing this off onto someone else. oh well. ;) > > > > Anything else I should try? > well, it still pretty much looks like a stack overflow. And looking at > the calling tree, there is not much left to take out of the stacks. So > I guess we'll have to try making the stack shorter. > > So, another patch. This still works on my intels, give it a go and > lets see how it does on your opterons. > > > I really appreciate all your help in debugging this! > np. > I tried the patch, it still crashes with the same oops. However, I tried something I hadn't tried before which may shed some light on this. I rebooted the system into UP mode, loaded the UP modules, and did the mount of the file system. This time, no oops. It still doesn't work, but the machine lives. The mount process simply hangs. When I go to another terminal and kill the mount process, this appears in the syslog: lock_gulm: ERROR cm_login failed. -512 lock_gulm: ERROR Got a -512 trying to start the threads. lock_gulm: fsid=hopkins:gfs01: Exiting gulm_mount with errors -512 GFS: can't mount proto = lock_gulm, table = hopkins:gfs01, hostdata = So, does that shed some light onto things? Something specific to SMP and lock_gulm. It still doesn't work in UP mode, but it does not oops. From john.l.villalovos at intel.com Mon Aug 9 22:46:38 2004 From: john.l.villalovos at intel.com (Villalovos, John L) Date: Mon, 9 Aug 2004 15:46:38 -0700 Subject: [Linux-cluster] GNBD spec file? Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B0194B025@orsmsx410> Is there a spec file available for GNBD? How about RPMS? John From pcaulfie at redhat.com Tue Aug 10 09:29:01 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 10 Aug 2004 10:29:01 +0100 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1092067121.23273.235.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> Message-ID: <20040810092900.GB13291@tykepenguin.com> On Mon, Aug 09, 2004 at 05:58:41PM +0200, Lazar Obradovic wrote: > > With this, you can also keep the memcpy(mreq.imr_interaface, ...) since > it's working now :) > OK, I've had to do this slightly differently, using gethostname2_r, it has the same effect. It behaves itself on my cluster but let me know if I've missed anything. -- patrick From laza at yu.net Tue Aug 10 12:11:50 2004 From: laza at yu.net (Lazar Obradovic) Date: Tue, 10 Aug 2004 14:11:50 +0200 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <20040810092900.GB13291@tykepenguin.com> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> <20040810092900.GB13291@tykepenguin.com> Message-ID: <1092139910.32187.1098.camel@laza.eunet.yu> On Tue, 2004-08-10 at 11:29, Patrick Caulfield wrote: > OK, I've had to do this slightly differently, using gethostname2_r, it has > the same effect. It behaves itself on my cluster but let me know if I've missed > anything. It's ok, just that I wanted cleaner solution (cleaner = w/o additional vars). As you might notice, TTL is fixed to a value of 10, and it might be interesting to take this out of the code, and place it somewhere in cluster.conf. Do you think this is ok? how about something like: #define MCAST_TTL_PATH "//cluster/cman/multicast/@ttl" Let me know if this is generaly a good idea, I'll work on details if you do agree. I also started to change ccs a bit for mcast support. It turns out that ccs has a lot of definitions hardcoded. Can I take 'em out and put into separate header file (comm_header.h looks nice :)? -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From pcaulfie at redhat.com Tue Aug 10 12:20:43 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 10 Aug 2004 13:20:43 +0100 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <1092139910.32187.1098.camel@laza.eunet.yu> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> <20040810092900.GB13291@tykepenguin.com> <1092139910.32187.1098.camel@laza.eunet.yu> Message-ID: <20040810122043.GE13291@tykepenguin.com> On Tue, Aug 10, 2004 at 02:11:50PM +0200, Lazar Obradovic wrote: > On Tue, 2004-08-10 at 11:29, Patrick Caulfield wrote: > > OK, I've had to do this slightly differently, using gethostname2_r, it has > > the same effect. It behaves itself on my cluster but let me know if I've missed > > anything. > > It's ok, just that I wanted cleaner solution (cleaner = w/o additional > vars). > > As you might notice, TTL is fixed to a value of 10, and it might be > interesting to take this out of the code, and place it somewhere in > cluster.conf. Do you think this is ok? > > how about something like: > > #define MCAST_TTL_PATH "//cluster/cman/multicast/@ttl" > > Let me know if this is generaly a good idea, I'll work on details if you > do agree. Certainly. I'm all in favour of moving hard-coded values into configuration files - so long as the defaults are reasonable ! > I also started to change ccs a bit for mcast support. It turns out that > ccs has a lot of definitions hardcoded. Can I take 'em out and put into > separate header file (comm_header.h looks nice :)? I think ccs_join.h would be reasonable, then it's obvious which .c file it holds the defaults for. -- patrick From john.l.villalovos at intel.com Tue Aug 10 15:19:48 2004 From: john.l.villalovos at intel.com (Villalovos, John L) Date: Tue, 10 Aug 2004 08:19:48 -0700 Subject: [Linux-cluster] Where to get RPMS for GFS components Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B019A78AA@orsmsx410> Where can I find the RPMS for the various GFS components? I didn't see it on the website: http://sources.redhat.com/cluster/ Hopefully I'm not blind. I did check out from CVS the Cluster code: cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster And I didn't find any SPEC files. Thanks, John From patrick.seinguerlet at e-asc.com Tue Aug 10 15:26:35 2004 From: patrick.seinguerlet at e-asc.com (SEINGUERLET Patrick) Date: Tue, 10 Aug 2004 17:26:35 +0200 Subject: [Linux-cluster] Where to get RPMS for GFS components In-Reply-To: <60C14C611F1DDD4198D53F2F43D8CA3B019A78AA@orsmsx410> Message-ID: <000001c47eee$70f001b0$8000a8c0@porpat> You can use http://sources.redhat.com/cluster/releases/cvs_snapshots/ to have a snapshot of cvs files. And when you compile files, you have got GFS components. For more information see doc/usage.txt Good luck. Patrick. -----Message d'origine----- De : linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Villalovos, John L Envoy? : mardi 10 ao?t 2004 17:20 ? : Discussion of clustering software components including GFS Objet : [Linux-cluster] Where to get RPMS for GFS components Where can I find the RPMS for the various GFS components? I didn't see it on the website: http://sources.redhat.com/cluster/ Hopefully I'm not blind. I did check out from CVS the Cluster code: cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster And I didn't find any SPEC files. Thanks, John -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From mtilstra at redhat.com Tue Aug 10 15:56:47 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 10 Aug 2004 10:56:47 -0500 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <1092085020.14561.3.camel@angmar> References: <20040803144019.GA4365@redhat.com> <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar> <20040806223557.GA21731@redhat.com> <1091835114.22512.46.camel@angmar> <20040809151208.GA2189@redhat.com> <1092085020.14561.3.camel@angmar> Message-ID: <20040810155647.GA10149@redhat.com> On Mon, Aug 09, 2004 at 01:57:00PM -0700, micah nerren wrote: > I tried the patch, it still crashes with the same oops. evil butterscotch. > However, I tried something I hadn't tried before which may shed some > light on this. I rebooted the system into UP mode, loaded the UP > modules, and did the mount of the file system. This time, no oops. It > still doesn't work, but the machine lives. The mount process simply > hangs. When I go to another terminal and kill the mount process, this > appears in the syslog: > > lock_gulm: ERROR cm_login failed. -512 > lock_gulm: ERROR Got a -512 trying to start the threads. > lock_gulm: fsid=hopkins:gfs01: Exiting gulm_mount with errors -512 > GFS: can't mount proto = lock_gulm, table = hopkins:gfs01, hostdata = yeah, the -512s are just how signal interrupts get moved in the kernel space. > So, does that shed some light onto things? Something specific to SMP and > lock_gulm. It still doesn't work in UP mode, but it does not oops. um. possibly means I've been looking in the wrong place for the solution. I'll dig in some more, but if anyone else reading this has ideas, please share. -- Michael Conrad Tadpol Tilstra It's never too late to have a happy childhood. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From tru at pasteur.fr Tue Aug 10 16:26:55 2004 From: tru at pasteur.fr (Tru Huynh) Date: Tue, 10 Aug 2004 18:26:55 +0200 Subject: [Linux-cluster] GFS 6.0 crashing x86_64 machine In-Reply-To: <20040810155647.GA10149@redhat.com>; from mtilstra@redhat.com on Tue, Aug 10, 2004 at 10:56:47AM -0500 References: <1091581920.8356.257.camel@angmar> <20040804153338.GA10091@redhat.com> <1091749973.18842.70.camel@angmar> <20040806164510.GA20479@redhat.com> <1091829819.22512.14.camel@angmar> <20040806223557.GA21731@redhat.com> <1091835114.22512.46.camel@angmar> <20040809151208.GA2189@redhat.com> <1092085020.14561.3.camel@angmar> <20040810155647.GA10149@redhat.com> Message-ID: <20040810182655.A22653@xiii.bis.pasteur.fr> On Tue, Aug 10, 2004 at 10:56:47AM -0500, Michael Conrad Tadpol Tilstra wrote: ... > > um. possibly means I've been looking in the wrong place for the > solution. I'll dig in some more, but if anyone else reading this has > ideas, please share. newbie (not kernel hacker) idea (taken from the XFS mailing list) On Mon, Aug 09, 2004 at 08:52:40AM -0500, Eric Sandeen wrote: > what does your ffs() look like? I added this patch to our kernels, but > it may not be in Dan's kernels (hm, I need to update the 1.3.3 packages > we have on oss...) > > --- linux/include/asm-x86_64/bitops.h.orig 2004-07-26 > 12:33:54.000000000 -0500 > +++ linux/include/asm-x86_64/bitops.h 2004-07-26 12:35:23.000000000 -0500 > @@ -473,7 +473,7 @@ static __inline__ int ffs(int x) > > __asm__("bsfl %1,%0\n\t" > "cmovzl %2,%0" > - : "=r" (r) : "g" (x), "r" (32)); > + : "=r" (r) : "rm" (x), "r" (-1)); > return r+1; > } just .02 cents (no flame please) Tru From laza at yu.net Wed Aug 11 00:04:51 2004 From: laza at yu.net (Lazar Obradovic) Date: Wed, 11 Aug 2004 02:04:51 +0200 Subject: [Linux-cluster] SNMP modules? In-Reply-To: <1091403655.6495.17.camel@laza.eunet.yu> References: <1090861715.13809.3.camel@laza.eunet.yu> <1091383159.32177.14.camel@laza.eunet.yu> <1091403655.6495.17.camel@laza.eunet.yu> Message-ID: <1092182691.26185.58.camel@laza.eunet.yu> is this ok? will it be a part of cvs tree or it needs additional work? On Mon, 2004-08-02 at 01:40, Lazar Obradovic wrote: > both things in one patch... > > On Sun, 2004-08-01 at 19:59, Lazar Obradovic wrote: > > ok, here's the patch for ibm blade fencing agent... > > qlogic sanbox2, comming up next :) > > > > On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: > > > Hello all, > > > > > > I'd like to develop my own fencing agents (for IBM BladeCenter and > > > QLogic SANBox2 switches), but they will require SNMP bindings. > > > > > > Is that ok with general development philosophy, since I'd like to > > > contribude them? net-snmp-5.x.x-based API? -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From rbrown at metservice.com Wed Aug 11 03:14:44 2004 From: rbrown at metservice.com (Royce Brown) Date: Wed, 11 Aug 2004 15:14:44 +1200 Subject: [Linux-cluster] Clumembd heartbeat problem. Message-ID: <200408111514542.SM01912@rbrown> I am not sure if this is the correct place to post this. If not and you know where I should, could you please tell me. (This effects RedHat ES 3.0 clumanager software) I have found a problem with the clumembd daemon where the heartbeat message is rejected by other nodes causing the node to be powered off. If you have a Ethernet interface with an alias and are using multicast the source address may contain the main IP address or the alias address. If it contains the alias address the message is then rejected by all other nodes as it now contains the wrong IP address. The software correctly creates a socket on the main interface and at first the correct IP address is send. Some time later on the same socket the alias address seems to get into the packets. I have extract the relevant parts from my log file showing the output from the debugging lines I inserted into the code. Computer has Interfaces: bond0 addr 10.10.197.11 bond0:0 addr 10.10.197.6 Multcast set up clumembd[2]: add_interface fd:4 name:bond0 clumembd[2]: Interface IP is 10.10.197.11 clumembd[2]: Setting up multicast 225.0.0.11 on 10.10.197.11 clumembd[2]: Multicast send fd:5 (10.10.197.11) clumembd[2]: Multicast receive fd:6 Sending and receiving message (Correct behaviour) clumembd[2]: sending multicast message fd:5 ,nodeid:1 ,addr:225.0.0.11,token:0x0002881d4119638e clumembd[2]: update_seen new msg nodeid:1 token:0x0002881d4119638e After a while you get. sinp = source address, nsp = expected address clumembd[2]: sending multicast message fd:5 ,nodeid:1 ,addr:225.0.0.11,token:0x0002881d4119638e clumembd[2]: update_seen new msg nodeid:1 token:0x0002881d4119638e clumembd[2]: IP/NodeID mismatch: Probably another cluster on our subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11 The source address now has bond0:0 address when it did have bond0's address. The socket has not changed. This looks to me like a bug in the sending routine (it is using sendto in std library) Has anyone else noticed this sort of behaviour on sending multicast messages to a Ethernet device with multiple addresses. Cheers Royce From lhh at redhat.com Wed Aug 11 13:08:16 2004 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 11 Aug 2004 09:08:16 -0400 Subject: [Linux-cluster] Clumembd heartbeat problem. In-Reply-To: <200408111514542.SM01912@rbrown> References: <200408111514542.SM01912@rbrown> Message-ID: <1092229696.20439.37.camel@atlantis.boston.redhat.com> On Wed, 2004-08-11 at 15:14 +1200, Royce Brown wrote: > I am not sure if this is the correct place to post this. If not > and you know where I should, could you please tell me. Better = taroon-list Best = bugzilla! > Sending and receiving message (Correct behaviour) > clumembd[2]: sending multicast message fd:5 ,nodeid:1 > ,addr:225.0.0.11,token:0x0002881d4119638e > clumembd[2]: update_seen new msg nodeid:1 token:0x0002881d4119638e > > After a while you get. sinp = source address, nsp = expected address > > clumembd[2]: sending multicast message fd:5 ,nodeid:1 > ,addr:225.0.0.11,token:0x0002881d4119638e > clumembd[2]: update_seen new msg nodeid:1 token:0x0002881d4119638e > clumembd[2]: IP/NodeID mismatch: Probably another cluster on our > subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11 Hmm... this doesn't make a lot of sense; clumembd doesn't rebind anything. It's possible to work around it. > Has anyone else noticed this sort of behaviour on sending multicast messages > to a Ethernet device with multiple addresses. It's news; probably worthy of a bugzilla. -- Lon From laza at yu.net Wed Aug 11 13:22:52 2004 From: laza at yu.net (Lazar Obradovic) Date: Wed, 11 Aug 2004 15:22:52 +0200 Subject: [Linux-cluster] Clumembd heartbeat problem. In-Reply-To: <1092229696.20439.37.camel@atlantis.boston.redhat.com> References: <200408111514542.SM01912@rbrown> <1092229696.20439.37.camel@atlantis.boston.redhat.com> Message-ID: <1092230572.19386.69.camel@laza.eunet.yu> On Wed, 2004-08-11 at 15:08, Lon Hohberger wrote: > > Has anyone else noticed this sort of behaviour on sending multicast messages > > to a Ethernet device with multiple addresses. btw, you shouldn't use 224.0.0.0/24 since it's assigned to various mcast-related things... (all host, all routers, routing protocols and the stuff like that). check http://www.iana.org/assignments/multicast-addresses for complete assigment. -- Lazar Obradovic, System Engineer ----- laza at YU.net YUnet International http://www.EUnet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 3119901; Fax: +381 11 3119901 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 3119901. ----- From amanthei at redhat.com Wed Aug 11 13:58:42 2004 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 11 Aug 2004 08:58:42 -0500 Subject: [Linux-cluster] SNMP modules? In-Reply-To: <1092182691.26185.58.camel@laza.eunet.yu> References: <1090861715.13809.3.camel@laza.eunet.yu> <1091383159.32177.14.camel@laza.eunet.yu> <1091403655.6495.17.camel@laza.eunet.yu> <1092182691.26185.58.camel@laza.eunet.yu> Message-ID: <20040811135842.GD26705@redhat.com> On Wed, Aug 11, 2004 at 02:04:51AM +0200, Lazar Obradovic wrote: > is this ok? > will it be a part of cvs tree or it needs additional work? Thanks for the reminder. I'll take a look at them and let you know. > On Mon, 2004-08-02 at 01:40, Lazar Obradovic wrote: > > both things in one patch... > > > > On Sun, 2004-08-01 at 19:59, Lazar Obradovic wrote: > > > ok, here's the patch for ibm blade fencing agent... > > > qlogic sanbox2, comming up next :) > > > > > > On Mon, 2004-07-26 at 19:08, Lazar Obradovic wrote: > > > > Hello all, > > > > > > > > I'd like to develop my own fencing agents (for IBM BladeCenter and > > > > QLogic SANBox2 switches), but they will require SNMP bindings. > > > > > > > > Is that ok with general development philosophy, since I'd like to > > > > contribude them? net-snmp-5.x.x-based API? > -- > Lazar Obradovic, System Engineer > ----- > laza at YU.net > YUnet International http://www.EUnet.yu > Dubrovacka 35/III, 11000 Belgrade > Tel: +381 11 3119901; Fax: +381 11 3119901 > ----- > This e-mail is confidential and intended only for the recipient. > Unauthorized distribution, modification or disclosure of its > contents is prohibited. If you have received this e-mail in error, > please notify the sender by telephone +381 11 3119901. > ----- > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From anton at hq.310.ru Wed Aug 11 15:19:30 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Wed, 11 Aug 2004 19:19:30 +0400 Subject: [Linux-cluster] make problem Message-ID: <19610389547.20040811191930@hq.310.ru> Hi all, after "cvs up -r HEAD -Pd ." for cluster folder i have problem with make # make cd cman-kernel && make all make[1]: Entering directory `/usr/src/cluster/cman-kernel' cd src && make all make[2]: Entering directory `/usr/src/cluster/cman-kernel/src' rm -f cluster ln -s . cluster make -C /usr/src/linux-2.6 M=/usr/src/cluster/cman-kernel/src modules USING_KBUILD=yes make[3]: Entering directory `/usr/src/linux-2.6.7' CC [M] /usr/src/cluster/cman-kernel/src/cnxman.o /usr/src/cluster/cman-kernel/src/cnxman.c: In function `do_ioctl_pass_socket': /usr/src/cluster/cman-kernel/src/cnxman.c:1504: error: storage size of `sock_info' isn't known /usr/src/cluster/cman-kernel/src/cnxman.c:1504: warning: unused variable `sock_info' /usr/src/cluster/cman-kernel/src/cnxman.c: In function `cl_ioctl': /usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: `SIOCCLUSTER_PASS_SOCKET' undeclared (first u se in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: (Each undeclared identifier is reported only once /usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: for each function it appears in.) /usr/src/cluster/cman-kernel/src/cnxman.c:1747: error: `SIOCCLUSTER_SET_NODENAME' undeclared (first use in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1754: error: `SIOCCLUSTER_SET_NODEID' undeclared (first us e in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1761: error: `SIOCCLUSTER_JOIN_CLUSTER' undeclared (first use in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1768: error: `SIOCCLUSTER_LEAVE_CLUSTER' undeclared (first use in this function) make[4]: *** [/usr/src/cluster/cman-kernel/src/cnxman.o] Error 1 make[3]: *** [_module_/usr/src/cluster/cman-kernel/src] Error 2 make[3]: Leaving directory `/usr/src/linux-2.6.7' make[2]: *** [all] Error 2 make[2]: Leaving directory `/usr/src/cluster/cman-kernel/src' make[1]: *** [all] Error 2 make[1]: Leaving directory `/usr/src/cluster/cman-kernel' make: *** [all] Error 2 # uname -a Linux c5.310.ru 2.6.7 #17 SMP Wed Jul 21 19:34:27 MSD 2004 i686 i686 i386 GNU/Linux -- e-mail: anton at hq.310.ru http://www.310.ru From pcaulfie at redhat.com Wed Aug 11 15:28:58 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 11 Aug 2004 16:28:58 +0100 Subject: [Linux-cluster] make problem In-Reply-To: <19610389547.20040811191930@hq.310.ru> References: <19610389547.20040811191930@hq.310.ru> Message-ID: <20040811152858.GH24727@tykepenguin.com> On Wed, Aug 11, 2004 at 07:19:30PM +0400, ????? ????????? wrote: > Hi all, > > after "cvs up -r HEAD -Pd ." for cluster folder > i have problem with make > It's including an old version of cnxman-socket.h. Check you have the updated one in /usr/include/cluster. -- patrick From patrick.seinguerlet at e-asc.com Wed Aug 11 15:25:06 2004 From: patrick.seinguerlet at e-asc.com (Seinguerlet Patrick) Date: Wed, 11 Aug 2004 17:25:06 +0200 Subject: [Linux-cluster] make problem References: <19610389547.20040811191930@hq.310.ru> Message-ID: <000b01c47fb7$6635a170$0100a8c0@amdk6> see instruction in doc/usage.txt for installation. Patrick ----- Original Message ----- From: "????? ?????????" To: Sent: Wednesday, August 11, 2004 5:19 PM Subject: [Linux-cluster] make problem Hi all, after "cvs up -r HEAD -Pd ." for cluster folder i have problem with make # make cd cman-kernel && make all make[1]: Entering directory `/usr/src/cluster/cman-kernel' cd src && make all make[2]: Entering directory `/usr/src/cluster/cman-kernel/src' rm -f cluster ln -s . cluster make -C /usr/src/linux-2.6 M=/usr/src/cluster/cman-kernel/src modules USING_KBUILD=yes make[3]: Entering directory `/usr/src/linux-2.6.7' CC [M] /usr/src/cluster/cman-kernel/src/cnxman.o /usr/src/cluster/cman-kernel/src/cnxman.c: In function `do_ioctl_pass_socket': /usr/src/cluster/cman-kernel/src/cnxman.c:1504: error: storage size of `sock_info' isn't known /usr/src/cluster/cman-kernel/src/cnxman.c:1504: warning: unused variable `sock_info' /usr/src/cluster/cman-kernel/src/cnxman.c: In function `cl_ioctl': /usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: `SIOCCLUSTER_PASS_SOCKET' undeclared (first u se in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: (Each undeclared identifier is reported only once /usr/src/cluster/cman-kernel/src/cnxman.c:1740: error: for each function it appears in.) /usr/src/cluster/cman-kernel/src/cnxman.c:1747: error: `SIOCCLUSTER_SET_NODENAME' undeclared (first use in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1754: error: `SIOCCLUSTER_SET_NODEID' undeclared (first us e in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1761: error: `SIOCCLUSTER_JOIN_CLUSTER' undeclared (first use in this function) /usr/src/cluster/cman-kernel/src/cnxman.c:1768: error: `SIOCCLUSTER_LEAVE_CLUSTER' undeclared (first use in this function) make[4]: *** [/usr/src/cluster/cman-kernel/src/cnxman.o] Error 1 make[3]: *** [_module_/usr/src/cluster/cman-kernel/src] Error 2 make[3]: Leaving directory `/usr/src/linux-2.6.7' make[2]: *** [all] Error 2 make[2]: Leaving directory `/usr/src/cluster/cman-kernel/src' make[1]: *** [all] Error 2 make[1]: Leaving directory `/usr/src/cluster/cman-kernel' make: *** [all] Error 2 # uname -a Linux c5.310.ru 2.6.7 #17 SMP Wed Jul 21 19:34:27 MSD 2004 i686 i686 i386 GNU/Linux -- e-mail: anton at hq.310.ru http://www.310.ru -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From jbrassow at redhat.com Wed Aug 11 16:03:46 2004 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Wed, 11 Aug 2004 11:03:46 -0500 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <20040810122043.GE13291@tykepenguin.com> References: <1091553471.16747.165.camel@laza.eunet.yu> <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> <20040810092900.GB13291@tykepenguin.com> <1092139910.32187.1098.camel@laza.eunet.yu> <20040810122043.GE13291@tykepenguin.com> Message-ID: <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com> >> I also started to change ccs a bit for mcast support. It turns out >> that >> ccs has a lot of definitions hardcoded. Can I take 'em out and put >> into >> separate header file (comm_header.h looks nice :)? > > I think ccs_join.h would be reasonable, then it's obvious which .c file > it holds the defaults for. > i don't think there is a ccs_join.c (you're thinking of cman_tool (?)). comm_header.h would be fine. I'll take a look at it when your ready. brassow From anton at hq.310.ru Wed Aug 11 16:13:10 2004 From: anton at hq.310.ru (=?ISO-8859-15?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Wed, 11 Aug 2004 20:13:10 +0400 Subject: [Linux-cluster] make problem In-Reply-To: <20040811152858.GH24727@tykepenguin.com> References: <19610389547.20040811191930@hq.310.ru> <20040811152858.GH24727@tykepenguin.com> Message-ID: <1182024470.20040811201310@hq.310.ru> ?????? ???? Patrick, Wednesday, August 11, 2004, 7:28:58 PM, you wrote: Patrick Caulfield> On Wed, Aug 11, 2004 at Patrick Caulfield> 07:19:30PM +0400, ????? ????????? wrote: >> Hi all, >> >> after "cvs up -r HEAD -Pd ." for cluster folder >> i have problem with make >> Patrick Caulfield> It's including an old version Patrick Caulfield> of cnxman-socket.h. Check you have the updated one Patrick Caulfield> in /usr/include/cluster. It would be necessary to add in usage.txt, that it is necessary to update not only /usr/local/cluster but and /path/to/kernel/include/cluster And it is even better to automate it for the first method install. -- e-mail: anton at hq.310.ru http://www.310.ru From cherry at osdl.org Wed Aug 11 16:18:50 2004 From: cherry at osdl.org (John Cherry) Date: Wed, 11 Aug 2004 09:18:50 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <20040810205817.GB18086@marowsky-bree.de> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> Message-ID: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> On Tue, 2004-08-10 at 13:58, Lars Marowsky-Bree wrote: > On 2004-08-10T13:49:26, > John Cherry said: > > Hi John, minor correction here... > > This is a work in progress since Daniel Phillips is continuing to add > > The time was right to consider common clusters components. While we > > expected a fair amount of contention at the meetings, it was good to see > > a fairly unanimous desire to identify common components that could be > > leveraged over the various cluster implementations and to drive these > > common components to mainline acceptance. The common cluster components > > identified at the summit were... > > > > cman - cluster manager (membership/quorum/heartbeat, recovery > > control. > > fence - userland daemon which decides which nodes need fencing > > dlm - fully distributed, fully symmetrical lock manager > > gfs - clustered filesystem > > > > While these common components all have RHAT/Sistina roots, these > > components are in the best position for mainline acceptance. As APIs > > are defined for these services, other implementations could also be used > > (the vfs model). > > This isn't quite true. cman as a whole is not quite in the best position > for mainline acceptance; actually, most isn't. I realize that cman will probably be at "alpha" level maturity in October, but we did not discuss any other possibilities for kernel level membership/communication. linux-ha and openais have user level components. I suppose SSI membership could be considered as a candidate implementation for the initial merge, but the consensus was that we would focus on cman, define the APIs, and use cman as the initial membership/communication module. Multiple implementations would be good and if we do a good job defining the APIs (membership, communication, fencing), other membership services could be used down the road. Was I at a different summit than you attended, or is that your understanding of the strategic direction of moving Linux to be a "clusterable kernel"? > > However, what was identified was that the following components > > - membership How can we have membership without some form of communication service? (communication-based membership or connectivity-based membership) The low level cluster communication mechanism is one of those services that I believe we need an API definition for since it will also be leveraged by higher level services such as group messaging or an event service. So you can call the core service "membership", but what we really need is membership/communication, which is what cman provides. Do you have another suggestion for this? TIPC + membership? > - DLM > - Fencing At the summit I attended, we also talked about using GFS as the initial "consumer" of the cluster infrastructure. The cluster infrastructure doesn't stand a chance of mainline acceptance without a consumer that both validates the interfaces and hardens the services. I am not being as subtile as RHAT was at the summit. If we are going to start the process to mainline the components needed to make linux a "clusterable kernel" this year, we will need to get behind the core services that we discussed at the summit. John > > would be the best ones to work on merging first, but it was acknowledged > that there's quite some work left for these to be done, in particular on > the API and the conceptual model behind it. > > > Sincerely, > Lars Marowsky-Br?e From cherry at osdl.org Wed Aug 11 17:22:15 2004 From: cherry at osdl.org (John Cherry) Date: Wed, 11 Aug 2004 10:22:15 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <20040811101104.F1924@build.pdx.osdl.net> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> <20040811101104.F1924@build.pdx.osdl.net> Message-ID: <1092244934.5685.59.camel@cherrybomb.pdx.osdl.net> On Wed, 2004-08-11 at 10:11, Chris Wright wrote: > * John Cherry (cherry at osdl.org) wrote: > > At the summit I attended, we also talked about using GFS as the initial > > "consumer" of the cluster infrastructure. The cluster infrastructure > > doesn't stand a chance of mainline acceptance without a consumer that > > both validates the interfaces and hardens the services. > > > > I am not being as subtile as RHAT was at the summit. If we are going to > > start the process to mainline the components needed to make linux a > > "clusterable kernel" this year, we will need to get behind the core > > services that we discussed at the summit. > > I read Lars' comments as something like: > There's still a lot of work to do, and it's not a foregone conclusion > that any of this would hit mainline. Agreed. There are no guarentees of mainline acceptance. We just need to line up against the unwritten "criteria" for mainline acceptance of this kind of code. These include infrastructure (common services) that would support multiple cluster implementations, not invasive to the core kernel, provide real value (i.e. infrastructure for a clustered filesystem), maintainable, active development community behind the code, etc. > > Maybe I extrapolated too far. However, the kernel summit included > a reasonable bit of pushback on placing this in the kernel without > convincing arguments to the contrary. So I think it's reasonable to > consider part of the work is still clearly defining that need. There were some user vs kernel discussions on the list prior to the summit, but the consensus at the summit was that the core common services would be in the kernel. After all, the initial consumer of the cluster infrastructure (clustered filesystem) is in the kernel. John From bruce.walker at hp.com Wed Aug 11 18:19:27 2004 From: bruce.walker at hp.com (Walker, Bruce J) Date: Wed, 11 Aug 2004 11:19:27 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials Message-ID: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> > * John Cherry (cherry at osdl.org) wrote: > At the summit I attended, we also talked about using GFS as the initial > "consumer" of the cluster infrastructure. The cluster infrastructure > doesn't stand a chance of mainline acceptance without a consumer that > both validates the interfaces and hardens the services. Given cman etc. was written for GFS, it doesn't prove much that it works with GFS. Having an independent cluster effort (like OpenSSI) use the underlying infrastructure presents a much more compelling case. The OpenSSI project has started to look into this but help from OSDL, Intel and/or RedHat wouldn't be discouraged. Also, having SAF layered and/or ha-linux layered would also bolster the case as a general infrastructure. Bruce walker OpenSSI project lead From phillips at redhat.com Wed Aug 11 18:54:08 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 11 Aug 2004 14:54:08 -0400 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> Message-ID: <200408111454.08677.phillips@redhat.com> On Wednesday 11 August 2004 12:18, John Cherry wrote: > On Tue, 2004-08-10 at 13:58, Lars Marowsky-Bree wrote: > > On 2004-08-10T13:49:26, John Cherry said: > > > While these common components all have RHAT/Sistina roots, these > > > components are in the best position for mainline acceptance. As > > > APIs are defined for these services, other implementations could > > > also be used (the vfs model). > > > > This isn't quite true. cman as a whole is not quite in the best > > position for mainline acceptance; actually, most isn't. That's accurate, that's why I keep beating on the 'read the code' issue, not to mention trying it, and hacking it. > I realize that cman will probably be at "alpha" level maturity in > October, but we did not discuss any other possibilities for kernel > level membership/communication. I believe it was briefly mentioned that we mainly use bog-standard tcp socket streams for communication. I'll add that various subsystems incorporate their own reliability logic, and maybe one day far from now, we'll be able to unify all of that. For now, it's a little ambitions, not to mention unnecessary. > linux-ha and openais have user level > components. I suppose SSI membership could be considered as a > candidate implementation for the initial merge, but the consensus was > that we would focus on cman, define the APIs, and use cman as the > initial membership/communication module. Multiple implementations > would be good and if we do a good job defining the APIs (membership, > communication, fencing), other membership services could be used down > the road. IMHO, for the time being only failure detection and failover really has to be unified, and that is CMAN, including interaction with other bits and pieces, i.e., Magma and fencing, and hopefully other systems like Lars' SCRAT. As far as CMAN goes, Lars and Alan seem to be the main parties outside Red Hat. Lon and Patrick are most active inside Red Hat. I think we'd advance fastest if they start hacking each other's code (anybody I just overlooked, please bellow). However it goes, this process is going to take time. Two months would be blindingly fast, and that is before we even think about pushing to Andrew. > Was I at a different summit than you attended, or is that your > understanding of the strategic direction of moving Linux to be a > "clusterable kernel"? That seemed to be the concensus at the summit I attended. Note that we've already got the basic changes to the VFS in place, with a few small exceptions. I still think that gdlm can go to Andrew before CMAN, however that is contingent on working out a way to invert the link-level dependency on CMAN so that the OCFS2 guys and people who want to experiment with dlm-style coding can try it without being forced to adopt a lot of other, less stable infrastructure at the same time. This will be going forward in parallel with the CMAN api work. > How can we have membership without some form of communication > service? (communication-based membership or connectivity-based > membership) > > The low level cluster communication mechanism is one of those > services that I believe we need an API definition for since it will > also be leveraged by higher level services such as group messaging or > an event service. > > So you can call the core service "membership", but what we really > need is membership/communication, which is what cman provides. Do > you have another suggestion for this? TIPC + membership? I think you really mean "connection manager", not "communication service" I'll step back from this now and watch you guys sort it out :-) Regards, Daniel From lmb at suse.de Wed Aug 11 20:55:15 2004 From: lmb at suse.de (Lars Marowsky-Bree) Date: Wed, 11 Aug 2004 22:55:15 +0200 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> Message-ID: <20040811205515.GB10855@marowsky-bree.de> On 2004-08-11T09:18:50, John Cherry said: > I realize that cman will probably be at "alpha" level maturity in > October, but we did not discuss any other possibilities for kernel level > membership/communication. linux-ha and openais have user level > components. Let's be a bit more specific, we have so far agreed on defining the membership API in the kernel (and likely starting from the cman one here), but via a vfs-like "Virtual Cluster Switch" with pluggable components right from the start, of which cman may be one, or a module to go out and talk to a user-level membership implementation another. That all these components need to be in the kernel hasn't been quite agreed on, just that their information needs to be available there. > membership/communication module. Multiple implementations would be good > and if we do a good job defining the APIs (membership, communication, > fencing), other membership services could be used down the road. Right. > > However, what was identified was that the following components > > > > - membership > How can we have membership without some form of communication service? > (communication-based membership or connectivity-based membership) Communication was specifically excluded because the communication APIs are much more complex to define; how the membership is computed internally is, well, internal to the membership module, and thus is it's communication method... > The low level cluster communication mechanism is one of those services > that I believe we need an API definition for since it will also be > leveraged by higher level services such as group messaging or an event > service. Eventually, but it's also more complex and was thus excluded. We specifically listed those three components I gave, for good reasons... > At the summit I attended, we also talked about using GFS as the initial > "consumer" of the cluster infrastructure. The cluster infrastructure > doesn't stand a chance of mainline acceptance without a consumer that > both validates the interfaces and hardens the services. GFS for one doesn't need any further communication channels beyond the DLM and membership. There's more components which are needed here, ie the recovery coordination provided by their Service Manager and some others, however for very good reasons (both their technical as their political complexity) those were left out of the initial go at this. > I am not being as subtile as RHAT was at the summit. If we are going to > start the process to mainline the components needed to make linux a > "clusterable kernel" this year, we will need to get behind the core > services that we discussed at the summit. You better be as careful as everyone was at the Summit, or you'll quickly be treading very loose ground ;-) Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering \ Philosophy proclaiming reason to be SUSE Labs, Research and Development | the supreme human virtue is falling SUSE LINUX AG - A Novell company \ prey to self-adulation. From daniel at osdl.org Wed Aug 11 21:24:49 2004 From: daniel at osdl.org (Daniel McNeil) Date: Wed, 11 Aug 2004 14:24:49 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <200408111454.08677.phillips@redhat.com> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> <200408111454.08677.phillips@redhat.com> Message-ID: <1092259489.14012.55.camel@ibm-c.pdx.osdl.net> On Wed, 2004-08-11 at 11:54, Daniel Phillips wrote: > On Wednesday 11 August 2004 12:18, John Cherry wrote: > > On Tue, 2004-08-10 at 13:58, Lars Marowsky-Bree wrote: > > > On 2004-08-10T13:49:26, John Cherry said: > > > > While these common components all have RHAT/Sistina roots, these > > > > components are in the best position for mainline acceptance. As > > > > APIs are defined for these services, other implementations could > > > > also be used (the vfs model). > > > > > > This isn't quite true. cman as a whole is not quite in the best > > > position for mainline acceptance; actually, most isn't. > > That's accurate, that's why I keep beating on the 'read the code' issue, > not to mention trying it, and hacking it. > > > I realize that cman will probably be at "alpha" level maturity in > > October, but we did not discuss any other possibilities for kernel > > level membership/communication. > > I believe it was briefly mentioned that we mainly use bog-standard tcp > socket streams for communication. I'll add that various subsystems > incorporate their own reliability logic, and maybe one day far from > now, we'll be able to unify all of that. For now, it's a little > ambitions, not to mention unnecessary. > > > linux-ha and openais have user level > > components. I suppose SSI membership could be considered as a > > candidate implementation for the initial merge, but the consensus was > > that we would focus on cman, define the APIs, and use cman as the > > initial membership/communication module. Multiple implementations > > would be good and if we do a good job defining the APIs (membership, > > communication, fencing), other membership services could be used down > > the road. > > IMHO, for the time being only failure detection and failover really has > to be unified, and that is CMAN, including interaction with other bits > and pieces, i.e., Magma and fencing, and hopefully other systems like > Lars' SCRAT. As far as CMAN goes, Lars and Alan seem to be the main > parties outside Red Hat. Lon and Patrick are most active inside Red > Hat. I think we'd advance fastest if they start hacking each other's > code (anybody I just overlooked, please bellow). I not sure what you mean by "failure detection and failover". Do you mean node failure detection and consensus membership change? I thought Magma is just redhat's backward compatibility layer. What "interaction" are you worried about? How fencing integrates and when it occurs might be issues we will need to think about more. > > However it goes, this process is going to take time. Two months would > be blindingly fast, and that is before we even think about pushing to > Andrew. > > > Was I at a different summit than you attended, or is that your > > understanding of the strategic direction of moving Linux to be a > > "clusterable kernel"? > > That seemed to be the concensus at the summit I attended. Note that > we've already got the basic changes to the VFS in place, with a few > small exceptions. > > I still think that gdlm can go to Andrew before CMAN, however that is > contingent on working out a way to invert the link-level dependency on > CMAN so that the OCFS2 guys and people who want to experiment with > dlm-style coding can try it without being forced to adopt a lot of > other, less stable infrastructure at the same time. This will be going > forward in parallel with the CMAN api work. How can the DLM go to Andrew without a membership layer to provide membership? I would think we need the DLM to actually be working... > > > How can we have membership without some form of communication > > service? (communication-based membership or connectivity-based > > membership) > > > > The low level cluster communication mechanism is one of those > > services that I believe we need an API definition for since it will > > also be leveraged by higher level services such as group messaging or > > an event service. > > > > So you can call the core service "membership", but what we really > > need is membership/communication, which is what cman provides. Do > > you have another suggestion for this? TIPC + membership? > > I think you really mean "connection manager", not "communication > service" I'll step back from this now and watch you guys sort it > out :-) I think John really does mean communication. For high availability, the cluster should have no single point of failure. This usually means multiple ethernet links. (I assume CMAN supports multiple links). To determine membership there needs to be a way of sending messages between the nodes to determine membership. Ideally, losing one ethernet link could/would be handle without causing any membership change. This kind of intra-cluster communication would be valuable for other cluster components as well. Example: a cluster snapshot :) or cluster mirror device should be able to send messages to other nodes in the cluster without having to worry about which specific link to use and what to do if a link fails. This would also be valuable for the DLM. Does CMAN provide this kind of functionality? If so, then it really is a communication service. Daniel McNeil > > Regards, > > Daniel > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From cherry at osdl.org Wed Aug 11 21:42:27 2004 From: cherry at osdl.org (John Cherry) Date: Wed, 11 Aug 2004 14:42:27 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <20040811205515.GB10855@marowsky-bree.de> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> <20040811205515.GB10855@marowsky-bree.de> Message-ID: <1092260546.6232.76.camel@cherrybomb.pdx.osdl.net> On Wed, 2004-08-11 at 13:55, Lars Marowsky-Bree wrote: > On 2004-08-11T09:18:50, > John Cherry said: > > > I realize that cman will probably be at "alpha" level maturity in > > October, but we did not discuss any other possibilities for kernel level > > membership/communication. linux-ha and openais have user level > > components. > > Let's be a bit more specific, we have so far agreed on defining the > membership API in the kernel (and likely starting from the cman one > here), but via a vfs-like "Virtual Cluster Switch" with pluggable > components right from the start, of which cman may be one, or a module > to go out and talk to a user-level membership implementation another. > > That all these components need to be in the kernel hasn't been quite > agreed on, just that their information needs to be available there. Quite right. The primary "next step" we defined was to agree on the membership API in the kernel. In the OSS community, this means to provide code that works and define the API based on the working code. You have to start with something, and cman (or whatever it becomes) is a likely first candidate. With a vfs-like cluster switch, any membership service could be plugged in, including one that goes out and talks to a user level membership implementation. I wonder which one you might have in mind? :) > > > membership/communication module. Multiple implementations would be good > > and if we do a good job defining the APIs (membership, communication, > > fencing), other membership services could be used down the road. > > Right. > > > > However, what was identified was that the following components > > > > > > - membership > > How can we have membership without some form of communication service? > > (communication-based membership or connectivity-based membership) > > Communication was specifically excluded because the communication APIs > are much more complex to define; how the membership is computed > internally is, well, internal to the membership module, and thus is it's > communication method... Yes. Perhaps the communication API definitions were excluded in the first go-around. However, you have to admit that cluster communication IS required, if for nothing else to provide redundant communication paths, and exposing this communication API would be valuable for higher level services. For instance, you don't want an event service to provide a completely orthogonal communication mechanism in the cluster when it could use the one that also provides the cluster heartbeat mechanism. > > > The low level cluster communication mechanism is one of those services > > that I believe we need an API definition for since it will also be > > leveraged by higher level services such as group messaging or an event > > service. > > Eventually, but it's also more complex and was thus excluded. We > specifically listed those three components I gave, for good reasons... OK. I admit that we were not going to focus on the communication API in the first go-around. > > > At the summit I attended, we also talked about using GFS as the initial > > "consumer" of the cluster infrastructure. The cluster infrastructure > > doesn't stand a chance of mainline acceptance without a consumer that > > both validates the interfaces and hardens the services. > > GFS for one doesn't need any further communication channels beyond the > DLM and membership. I can agree with that. > > There's more components which are needed here, ie the recovery > coordination provided by their Service Manager and some others, however > for very good reasons (both their technical as their political > complexity) those were left out of the initial go at this. Agreed. However, fencing will probably need to be addressed as we define the membership API. > > > I am not being as subtile as RHAT was at the summit. If we are going to > > start the process to mainline the components needed to make linux a > > "clusterable kernel" this year, we will need to get behind the core > > services that we discussed at the summit. > > You better be as careful as everyone was at the Summit, or you'll > quickly be treading very loose ground ;-) Sorry. I didn't mean to put words into anybody's mouth. However, if there was disagreement with the basic strategy moving forward....it was pretty stealthy. :) John From lmb at suse.de Wed Aug 11 22:04:00 2004 From: lmb at suse.de (Lars Marowsky-Bree) Date: Thu, 12 Aug 2004 00:04:00 +0200 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <1092259489.14012.55.camel@ibm-c.pdx.osdl.net> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> <200408111454.08677.phillips@redhat.com> <1092259489.14012.55.camel@ibm-c.pdx.osdl.net> Message-ID: <20040811220400.GG10855@marowsky-bree.de> On 2004-08-11T14:24:49, Daniel McNeil said: > How can the DLM go to Andrew without a membership layer to > provide membership? I'd agree with this question. Membership is really the first and foremost question, then the DLM can go in. Fencing turns out to be a more difficult beast, because the way how the GFS stack handles it's recovery (a static priority list) is somewhat fundamentally incompatible with the way how a more powerful dependency based cluster recovery manager might wish to handle things. We've just run into this discussion ourselves, and as soon as we have an idea, will propose that adequately for discussion... > I think John really does mean communication. For high availability, > the cluster should have no single point of failure. Exposing the communication APIs begs a ton of questions regarding the semantics; atomic, causal or total ordering?; communication groups; access controls to those; sync or async; broadcast, multicast or pair-wise channels? All of these and some more can/should be supported, however most systems just provide subsets. How to expose that, how to handle it? That's a bit more difficult than answering the question about membership, which is even complex enough - do you get to see membership before or after fencing, with or without quorum etc. Don't rush this. Don't get sidetracked. (And trust me, I've been there at OCF for that one.) Concentrate on the slightly more palatable ones like membership and DLM, and after we've established prior art, then lets tackle the bigger issues. Nobody denies that communication, recovery coordination etc are required and very important, just that we don't wish to start there. > Does CMAN provide this kind of functionality? If so, then it > really is a communication service. It provides a very limitted subset of it which is, for example, not even useable to the low requirements SCRAT (heartbeat's new recovery/resource manager) has, as far as I can see, because it's not performing well enough. And it's not meant to, because they architect their stack differently (around DLM + TCP etc), but it means we'll need to work on this area some more first. Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering \ Philosophy proclaiming reason to be SUSE Labs, Research and Development | the supreme human virtue is falling SUSE LINUX AG - A Novell company \ prey to self-adulation. From pcaulfie at redhat.com Thu Aug 12 06:57:53 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 12 Aug 2004 07:57:53 +0100 Subject: [Linux-cluster] Multicast for GFS? In-Reply-To: <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com> References: <1091556279.30938.179.camel@laza.eunet.yu> <1091736172.19762.336.camel@laza.eunet.yu> <20040809075039.GA9240@tykepenguin.com> <1092051143.1114.130.camel@laza.eunet.yu> <20040809133438.GI11723@tykepenguin.com> <1092067121.23273.235.camel@laza.eunet.yu> <20040810092900.GB13291@tykepenguin.com> <1092139910.32187.1098.camel@laza.eunet.yu> <20040810122043.GE13291@tykepenguin.com> <068D22A6-EBB0-11D8-9B62-000A957BB1F6@redhat.com> Message-ID: <20040812065752.GA21565@tykepenguin.com> On Wed, Aug 11, 2004 at 11:03:46AM -0500, Jonathan E Brassow wrote: > >>I also started to change ccs a bit for mcast support. It turns out > >>that > >>ccs has a lot of definitions hardcoded. Can I take 'em out and put > >>into > >>separate header file (comm_header.h looks nice :)? > > > >I think ccs_join.h would be reasonable, then it's obvious which .c file > >it holds the defaults for. > > > > i don't think there is a ccs_join.c (you're thinking of cman_tool (?)). > comm_header.h would be fine. I'll take a look at it when your ready. Sorry, yes I was still in cman_tool mode ! -- patrick From chrisw at osdl.org Wed Aug 11 17:11:04 2004 From: chrisw at osdl.org (Chris Wright) Date: Wed, 11 Aug 2004 10:11:04 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net>; from cherry@osdl.org on Wed, Aug 11, 2004 at 09:18:50AM -0700 References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <20040810205817.GB18086@marowsky-bree.de> <1092241130.5683.48.camel@cherrybomb.pdx.osdl.net> Message-ID: <20040811101104.F1924@build.pdx.osdl.net> * John Cherry (cherry at osdl.org) wrote: > At the summit I attended, we also talked about using GFS as the initial > "consumer" of the cluster infrastructure. The cluster infrastructure > doesn't stand a chance of mainline acceptance without a consumer that > both validates the interfaces and hardens the services. > > I am not being as subtile as RHAT was at the summit. If we are going to > start the process to mainline the components needed to make linux a > "clusterable kernel" this year, we will need to get behind the core > services that we discussed at the summit. I read Lars' comments as something like: There's still a lot of work to do, and it's not a foregone conclusion that any of this would hit mainline. Maybe I extrapolated too far. However, the kernel summit included a reasonable bit of pushback on placing this in the kernel without convincing arguments to the contrary. So I think it's reasonable to consider part of the work is still clearly defining that need. thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net From lmb at suse.de Thu Aug 12 09:57:36 2004 From: lmb at suse.de (Lars Marowsky-Bree) Date: Thu, 12 Aug 2004 11:57:36 +0200 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <1092249962.4717.21.camel@persist.az.mvista.com> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> Message-ID: <20040812095736.GE4096@marowsky-bree.de> On 2004-08-11T11:46:03, Steven Dake said: > If we can't live with the cluster services in userland (although I'm > still not convinced), then atleast the group messaging protocol in the > kernel could be based upon 20 years of research in group messaging and > work properly under _all_ fault scenarios. Right. Another important alternative maybe the Transis group communication suite, which has been released as GPL/LGPL now. This all just highlights that we need to think about communication some more before we can tackle it sensibly, but of course I'll be glad if someone proves me wrong and Just Does It ;-) Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering \ I allow neither my experience SUSE Labs, Research and Development | nor my cynicism to deter my SUSE LINUX AG - A Novell company \ optimistic outlook on life. From phillips at redhat.com Wed Aug 11 19:58:57 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 11 Aug 2004 15:58:57 -0400 Subject: [Linux-cluster] [ANNOUNCE] Minneapolis Cluster Summit Wrapup Message-ID: <200408111558.57898.phillips@redhat.com> Hi All, The Minneapolis Cluster Summit came and went 10 days ago, with excellent attendance and high-quality interaction all round. Over the last few days I've been collecting slide presentations and related material onto this page: http://sources.redhat.com/cluster/events/summit2004/presentations.html Unfortunately, due to manpower limitations and short lead time, we weren't able to arrange for audio recordings, which would have been great since both presentations and discussion were packed full of useful material. I guess this means we have to do it again next year, this time with a tape recorder! As for the results... discussion continues on linux-cluster and other mailing lists, please judge for yourself. https://www.redhat.com/mailman/listinfo/linux-cluster http://lists.osdl.org/mailman/listinfo/cgl_discussion http://lists.osdl.org/mailman/listinfo/dcl_discussion Regards, Daniel From phillips at redhat.com Thu Aug 12 22:47:17 2004 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 12 Aug 2004 18:47:17 -0400 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <1092259489.14012.55.camel@ibm-c.pdx.osdl.net> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <200408111454.08677.phillips@redhat.com> <1092259489.14012.55.camel@ibm-c.pdx.osdl.net> Message-ID: <200408121847.17158.phillips@redhat.com> On Wednesday 11 August 2004 17:24, Daniel McNeil wrote: > > IMHO, for the time being only failure detection and failover really > > has to be unified, and that is CMAN, including interaction with > > other bits and pieces, i.e., Magma and fencing, and hopefully other > > systems like Lars' SCRAT. As far as CMAN goes, Lars and Alan seem > > to be the main parties outside Red Hat. Lon and Patrick are most > > active inside Red Hat. I think we'd advance fastest if they start > > hacking each other's code (anybody I just overlooked, please > > bellow). > > I not sure what you mean by "failure detection and failover". > Do you mean node failure detection and consensus membership change? I mean anything in the cluster that can fail and be reinstantiated. This would include server processes for cluster block devices such as the ones I've designed, as well as whole nodes. It would also include communication paths, such as socket connections. But by now you may have detected a bias against trying to deal with the latter in a one-size-fits-all automagic, never-stop-never-give-up cluster communications thingamajig layer. What we really need is just a framework for failure detection, including methods supplied by various cluster components, and methods for re-instantiating failed components. Note note note: while a "cluster component" could conceivably be a whole node, that's a special case and we really need to cater to the case that will eventually be much more common, where cluster nodes may be doing all kinds of other things besides just participating in clusters. So by "cluster component" I really mean something closer to "task". > I thought Magma is just redhat's backward compatibility layer. > What "interaction" are you worried about? You might want to ask Lon about that... > How fencing integrates and when it occurs might be issues we > will need to think about more. Understatement of the day. > How can the DLM go to Andrew without a membership layer to > provide membership? By having a simple registration api that allows one to register a membership layer, in place of what is there now, i.e., function links between modules. > > > So you can call the core service "membership", but what we really > > > need is membership/communication, which is what cman provides. > > > Do you have another suggestion for this? TIPC + membership? > > > > I think you really mean "connection manager", not "communication > > service" I'll step back from this now and watch you guys sort it > > out :-) > > I think John really does mean communication. For high availability, > the cluster should have no single point of failure. This usually > means multiple ethernet links. But it's not the business of the cluster framework to operate the links, only to know when they have failed and to be able to arrange for new connections. So John really does mean "connection" and not "communication", I hope. > (I assume CMAN supports multiple > links). To determine membership there needs to be a way of sending > messages between the nodes to determine membership. Ideally, losing > one ethernet link could/would be handle without causing any > membership change. "Ideally" is not a strong enough word, imho. > This kind of intra-cluster communication would be valuable for > other cluster components as well. Example: a cluster snapshot :) > or cluster mirror device should be able to send messages to > other nodes in the cluster without having to worry about which > specific link to use and what to do if a link fails. This would > also be valuable for the DLM. OK, we've seen lots of warnings about not getting derailed by trying to invent the perfect cluster communication system, we should heed those warnings. Instead, let's get down to precise specification of the methods we need to have, and compare it to what already exists, for establishing and re-establishing connections. > Does CMAN provide this kind of functionality? If so, then it > really is a communication service. http://people.redhat.com/~teigland/sca.pdf Regards, Daniel From sdake at mvista.com Thu Aug 12 17:42:16 2004 From: sdake at mvista.com (Steven Dake) Date: Thu, 12 Aug 2004 10:42:16 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <20040812095736.GE4096@marowsky-bree.de> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> <20040812095736.GE4096@marowsky-bree.de> Message-ID: <1092332536.7315.1.camel@persist.az.mvista.com> On Thu, 2004-08-12 at 02:57, Lars Marowsky-Bree wrote: > On 2004-08-11T11:46:03, > Steven Dake said: > > > If we can't live with the cluster services in userland (although I'm > > still not convinced), then atleast the group messaging protocol in the > > kernel could be based upon 20 years of research in group messaging and > > work properly under _all_ fault scenarios. > > Right. Another important alternative maybe the Transis group > communication suite, which has been released as GPL/LGPL now. > > This all just highlights that we need to think about communication some > more before we can tackle it sensibly, but of course I'll be glad if > someone proves me wrong and Just Does It ;-) > agreed... Transis in kernel would be a fine alternative to openais gmi in kernel. Speaking of transis, is the code posted anywhere? I'd like to have a look. Thanks -steve > > Sincerely, > Lars Marowsky-Br?e From lmb at suse.de Thu Aug 12 20:37:38 2004 From: lmb at suse.de (Lars Marowsky-Bree) Date: Thu, 12 Aug 2004 22:37:38 +0200 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <1092332536.7315.1.camel@persist.az.mvista.com> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> <20040812095736.GE4096@marowsky-bree.de> <1092332536.7315.1.camel@persist.az.mvista.com> Message-ID: <20040812203738.GK9722@marowsky-bree.de> On 2004-08-12T10:42:16, Steven Dake said: > agreed... Transis in kernel would be a fine alternative to openais gmi > in kernel. > > Speaking of transis, is the code posted anywhere? I'd like to have a > look. It's not yet at the final location, but we put up what we got at http://wiki.trick.ca/linux-ha/Transis . Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering \ This space / SUSE Labs, Research and Development | intentionally | SUSE LINUX AG - A Novell company \ left blank / From sdake at mvista.com Thu Aug 12 22:59:10 2004 From: sdake at mvista.com (Steven Dake) Date: Thu, 12 Aug 2004 15:59:10 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <20040812203738.GK9722@marowsky-bree.de> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> <20040812095736.GE4096@marowsky-bree.de> <1092332536.7315.1.camel@persist.az.mvista.com> <20040812203738.GK9722@marowsky-bree.de> Message-ID: <1092351549.7315.5.camel@persist.az.mvista.com> On Thu, 2004-08-12 at 13:37, Lars Marowsky-Bree wrote: > On 2004-08-12T10:42:16, > Steven Dake said: > > > agreed... Transis in kernel would be a fine alternative to openais gmi > > in kernel. > > > > Speaking of transis, is the code posted anywhere? I'd like to have a > > look. > > It's not yet at the final location, but we put up what we got at > http://wiki.trick.ca/linux-ha/Transis . > > Lars Thanks for posting transis. I had a look at the examples and API. The API is of course different then openais and focused on client/server architecture. I tried a performance test by sending a 64k message, and then receiving it 10 times with two nodes. This operation takes about 5 seconds on my hardware which is 128k/sec. I was expecting more like 8-10MB/sec. Is there anything that can be done to improve the performance? Thanks -steve Certainly a different sort of API then openais... > Sincerely, > Lars Marowsky-Br?e From sdake at mvista.com Thu Aug 12 23:08:08 2004 From: sdake at mvista.com (Steven Dake) Date: Thu, 12 Aug 2004 16:08:08 -0700 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Cluster summit materials In-Reply-To: <200408121847.17158.phillips@redhat.com> References: <1092170965.2468.86.camel@cherrybomb.pdx.osdl.net> <200408111454.08677.phillips@redhat.com> <1092259489.14012.55.camel@ibm-c.pdx.osdl.net> <200408121847.17158.phillips@redhat.com> Message-ID: <1092352087.7315.15.camel@persist.az.mvista.com> Daniel comments below On Thu, 2004-08-12 at 15:47, Daniel Phillips wrote: > On Wednesday 11 August 2004 17:24, Daniel McNeil wrote: > > > IMHO, for the time being only failure detection and failover really > > > has to be unified, and that is CMAN, including interaction with > > > other bits and pieces, i.e., Magma and fencing, and hopefully other > > > systems like Lars' SCRAT. As far as CMAN goes, Lars and Alan seem > > > to be the main parties outside Red Hat. Lon and Patrick are most > > > active inside Red Hat. I think we'd advance fastest if they start > > > hacking each other's code (anybody I just overlooked, please > > > bellow). > > > > I not sure what you mean by "failure detection and failover". > > Do you mean node failure detection and consensus membership change? > > I mean anything in the cluster that can fail and be reinstantiated. > This would include server processes for cluster block devices such as > the ones I've designed, as well as whole nodes. It would also include > communication paths, such as socket connections. But by now you may > have detected a bias against trying to deal with the latter in a > one-size-fits-all automagic, never-stop-never-give-up cluster > communications thingamajig layer. What we really need is just a > framework for failure detection, including methods supplied by various > cluster components, and methods for re-instantiating failed components. > There really is no reason to reinvent the wheel here. An API has already been developed in the SA Forum Availability Management Framework, and an implementation already exists (http://developer.osdl.org/dev/openais). I suspect there is some work that linux-ha has done on this topic as well. > Note note note: while a "cluster component" could conceivably be a whole > node, that's a special case and we really need to cater to the case > that will eventually be much more common, where cluster nodes may be > doing all kinds of other things besides just participating in clusters. > So by "cluster component" I really mean something closer to "task". > > > I thought Magma is just redhat's backward compatibility layer. > > What "interaction" are you worried about? > > You might want to ask Lon about that... > > > How fencing integrates and when it occurs might be issues we > > will need to think about more. > > Understatement of the day. > > > How can the DLM go to Andrew without a membership layer to > > provide membership? > > By having a simple registration api that allows one to register a > membership layer, in place of what is there now, i.e., function links > between modules. > I think what you are missing is that membership and messaging are strongly related to one another. When a message is sent, it is sent under a certain membership view. When it is received, it should also be received under that same membership view. Otherwise, the view of the membership cannot be used to make decisions along with the message contents. If the distributed system must make decisions about a message based upon the view of the membership (which obviously DLM must do to be reliable) then integrating these two features is the only approach that works. For this reason, membersihp and messaging are tightly integrated, atleast if a reliable distributed system is desired. > > > > So you can call the core service "membership", but what we really > > > > need is membership/communication, which is what cman provides. > > > > Do you have another suggestion for this? TIPC + membership? > > > > > > I think you really mean "connection manager", not "communication > > > service" I'll step back from this now and watch you guys sort it > > > out :-) > > > > I think John really does mean communication. For high availability, > > the cluster should have no single point of failure. This usually > > means multiple ethernet links. > > But it's not the business of the cluster framework to operate the links, > only to know when they have failed and to be able to arrange for new > connections. So John really does mean "connection" and not > "communication", I hope. > > > (I assume CMAN supports multiple > > links). To determine membership there needs to be a way of sending > > messages between the nodes to determine membership. Ideally, losing > > one ethernet link could/would be handle without causing any > > membership change. > > "Ideally" is not a strong enough word, imho. > > > This kind of intra-cluster communication would be valuable for > > other cluster components as well. Example: a cluster snapshot :) > > or cluster mirror device should be able to send messages to > > other nodes in the cluster without having to worry about which > > specific link to use and what to do if a link fails. This would > > also be valuable for the DLM. > > OK, we've seen lots of warnings about not getting derailed by trying to > invent the perfect cluster communication system, we should heed those > warnings. Instead, let's get down to precise specification of the > methods we need to have, and compare it to what already exists, for > establishing and re-establishing connections. > The perfect cluster communication model has already been invented: its called virtual synchrony and backed up by 20 years of research. There are several protocols that implement this model. If there is no need for agreed ordering or group communication in dlm, then maybe an argument could be made that virtual synchrony is not appropriate for dlm. But, DLM benefits strongly from the semantics of virtual synchrony and makes implementing a distributed lock service trivial. Thanks for listening -steve > > Does CMAN provide this kind of functionality? If so, then it > > really is a communication service. > > http://people.redhat.com/~teigland/sca.pdf > > Regards, > > Daniel > _______________________________________________ > cgl_discussion mailing list > cgl_discussion at lists.osdl.org > http://lists.osdl.org/mailman/listinfo/cgl_discussion From lmb at suse.de Fri Aug 13 09:40:24 2004 From: lmb at suse.de (Lars Marowsky-Bree) Date: Fri, 13 Aug 2004 11:40:24 +0200 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <1092351549.7315.5.camel@persist.az.mvista.com> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> <20040812095736.GE4096@marowsky-bree.de> <1092332536.7315.1.camel@persist.az.mvista.com> <20040812203738.GK9722@marowsky-bree.de> <1092351549.7315.5.camel@persist.az.mvista.com> Message-ID: <20040813094024.GH4161@marowsky-bree.de> On 2004-08-12T15:59:10, Steven Dake said: > Thanks for posting transis. I had a look at the examples and API. The > API is of course different then openais and focused on client/server > architecture. Right. > I tried a performance test by sending a 64k message, and then receiving > it 10 times with two nodes. This operation takes about 5 seconds on my > hardware which is 128k/sec. I was expecting more like 8-10MB/sec. Is > there anything that can be done to improve the performance? I've not yet done any real tests with it, so I'm not sure. We were mostly going from the theoretical description ;) But I think 128k/s is really a bit low, so I assume something ain't quite right yet... We'll figure it out. It's possible that maybe it's not the way to go afterall, but before we could go looking we first needed it as GPL/LGPL (for not becoming IP-tainted). Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering \ This space / SUSE Labs, Research and Development | intentionally | SUSE LINUX AG - A Novell company \ left blank / From jonathan at cnds.jhu.edu Fri Aug 13 15:54:41 2004 From: jonathan at cnds.jhu.edu (Jonathan Stanton) Date: Fri, 13 Aug 2004 11:54:41 -0400 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <1092351549.7315.5.camel@persist.az.mvista.com> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> <20040812095736.GE4096@marowsky-bree.de> <1092332536.7315.1.camel@persist.az.mvista.com> <20040812203738.GK9722@marowsky-bree.de> <1092351549.7315.5.camel@persist.az.mvista.com> Message-ID: <20040813155441.GA16662@cnds.jhu.edu> Hi, I just joined the linux-cluster list after seeing a few of the messages that were cross-posted to linux-kernel. On Thu, Aug 12, 2004 at 03:59:10PM -0700, Steven Dake wrote: > Lars > > Thanks for posting transis. I had a look at the examples and API. The > API is of course different then openais and focused on client/server > architecture. If you havn't looked at it already, you might want to try out the Spread group communication system. http://www.spread.org/ It is, conceptually although not code-wise, a decendant of the Transis work (and the Totem system from UCSB) and is relatively widely used as a production quality group messaging system (Some apache modules use it along with a number of large web-clusters, a few commercial clustered storage systems, and a lot of custom replication apps). It is not under GPL but is open-source under a bsd-style (but not exactly the same) license. Like transis it has a client-server architecture (and a simpler API). > I tried a performance test by sending a 64k message, and then receiving > it 10 times with two nodes. This operation takes about 5 seconds on my > hardware which is 128k/sec. I was expecting more like 8-10MB/sec. Is > there anything that can be done to improve the performance? I would expect transis to definitely do better then 128k/s given tests we ran a number of years ago, but on upto medium sized lan environments the totem/spread protocols are generally faster with less cpu overhead. I know Spread could get 80Mb/s a number of years ago. We recently re-ran a clean set of benchmarks and wrote them up. You can find them at: http://www.cnds.jhu.edu/pub/papers/cnds-2004-1.pdf I admit some bias as I'm one of the lead developers of Spread, and we (the developers) have been building group messaging systems since the early 90's -- so I may look at things a bit differently -- so I would be very intersted in your thoughts on how you could use GCS and whether Spread would be useful. Cheers, Jonathan -- ------------------------------------------------------- Jonathan R. Stanton jonathan at cs.jhu.edu Dept. of Computer Science Johns Hopkins University ------------------------------------------------------- From angel at telvia.it Sat Aug 14 18:57:37 2004 From: angel at telvia.it (Angelo Ovidi) Date: Sat, 14 Aug 2004 20:57:37 +0200 Subject: [Linux-cluster] Error compiling GFS patched kernel Message-ID: <001d01c48230$a4877b80$0a14a8c0@venus.it> Hi. I am trying to compile a 2.6.7 kernel patched with cvs version of cluster package of redhat. I have no error applying the patches but the compile give me this error: CC [M] fs/gfs/bmap.o CC [M] fs/gfs/daemon.o CC [M] fs/gfs/dio.o CC [M] fs/gfs/dir.o CC [M] fs/gfs/eattr.o CC [M] fs/gfs/file.o CC [M] fs/gfs/flock.o CC [M] fs/gfs/glock.o CC [M] fs/gfs/glops.o CC [M] fs/gfs/inode.o fs/gfs/inode.c: In function `inode_init_and_link': fs/gfs/inode.c:1214: invalid lvalue in unary `&' fs/gfs/inode.c: In function `inode_alloc_hidden': fs/gfs/inode.c:1933: invalid lvalue in unary `&' make[2]: *** [fs/gfs/inode.o] Error 1 make[1]: *** [fs/gfs] Error 2 make: *** [fs] Error 2 What's the problem? Best regards, Angelo Ovidi Venere Net Spa Rome, Italy -------------- next part -------------- An HTML attachment was scrubbed... URL: From lmb at suse.de Fri Aug 13 20:30:29 2004 From: lmb at suse.de (Lars Marowsky-Bree) Date: Fri, 13 Aug 2004 22:30:29 +0200 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <20040813155441.GA16662@cnds.jhu.edu> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> <20040812095736.GE4096@marowsky-bree.de> <1092332536.7315.1.camel@persist.az.mvista.com> <20040812203738.GK9722@marowsky-bree.de> <1092351549.7315.5.camel@persist.az.mvista.com> <20040813155441.GA16662@cnds.jhu.edu> Message-ID: <20040813203029.GW4161@marowsky-bree.de> On 2004-08-13T11:54:41, Jonathan Stanton said: > If you havn't looked at it already, you might want to try out the Spread > group communication system. > > http://www.spread.org/ The intel lawyers have identified the Spread license to be GPL-incompatible. Otherwise, I agree, Spread is very nice. If those issues could be resolved, that may be an interesting option too. (I think the advertising clause and something else clash with the (L)GPL; I can put you in contact with the Intel folks if you wish to resolve this.) Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering \ This space / SUSE Labs, Research and Development | intentionally | SUSE LINUX AG - A Novell company \ left blank / From jonathan at cnds.jhu.edu Fri Aug 13 22:53:15 2004 From: jonathan at cnds.jhu.edu (Jonathan Stanton) Date: Fri, 13 Aug 2004 18:53:15 -0400 Subject: [Linux-cluster] Re: [cgl_discussion] Re: [dcl_discussion] Clustersummit materials In-Reply-To: <20040813203029.GW4161@marowsky-bree.de> References: <3689AF909D816446BA505D21F1461AE4C75110@cacexc04.americas.cpqcorp.net> <1092249962.4717.21.camel@persist.az.mvista.com> <20040812095736.GE4096@marowsky-bree.de> <1092332536.7315.1.camel@persist.az.mvista.com> <20040812203738.GK9722@marowsky-bree.de> <1092351549.7315.5.camel@persist.az.mvista.com> <20040813155441.GA16662@cnds.jhu.edu> <20040813203029.GW4161@marowsky-bree.de> Message-ID: <20040813225315.GD16662@cnds.jhu.edu> On Fri, Aug 13, 2004 at 10:30:29PM +0200, Lars Marowsky-Bree wrote: > On 2004-08-13T11:54:41, > Jonathan Stanton said: > > > If you havn't looked at it already, you might want to try out the Spread > > group communication system. > > > > http://www.spread.org/ > > The intel lawyers have identified the Spread license to be > GPL-incompatible. > > Otherwise, I agree, Spread is very nice. If those issues could be > resolved, that may be an interesting option too. > > (I think the advertising clause and something else clash with the > (L)GPL; I can put you in contact with the Intel folks if you wish to > resolve this.) I would appreciate that. We did choose our licensing for what I think are good reasons, but we have also worked in the past with outside projects with possible license conflicts and have been able to resolve them. So I would like to understand exactly what the issues are. Cheers, Jonathan -- ------------------------------------------------------- Jonathan R. Stanton jonathan at cs.jhu.edu Dept. of Computer Science Johns Hopkins University ------------------------------------------------------- From jeff at intersystems.com Mon Aug 16 14:02:39 2004 From: jeff at intersystems.com (Jeff) Date: Mon, 16 Aug 2004 10:02:39 -0400 Subject: [Linux-cluster] 'make distclean ' leaves generated files behind Message-ID: <364609576.20040816100239@intersystems.com> 'make distclean' fails to clean up the following directories: fence/bin gndb/bin gfs/bin From ben.m.cahill at intel.com Mon Aug 16 18:26:34 2004 From: ben.m.cahill at intel.com (Cahill, Ben M) Date: Mon, 16 Aug 2004 11:26:34 -0700 Subject: [Linux-cluster] trouble trying to get ccs/cman working on onemachine, not the other Message-ID: <0604335B7764D141945E202153105960033E24FD@orsmsx404.amr.corp.intel.com> > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Lennert Buytenhek > Sent: Saturday, June 26, 2004 6:08 PM > To: linux-cluster at redhat.com > Subject: Re: [Linux-cluster] trouble trying to get ccs/cman > working on onemachine, not the other > > On Sat, Jun 26, 2004 at 11:30:57PM +0200, Lennert Buytenhek wrote: > > > OK, found out why they didn't see each other. If your /etc/hosts has > something like this: > > 127.0.0.1 phi localhost.localdomain localhost > > (which might be a remnant from an earlier Red Hat install on this box, > created by the installer if you install without initially > configuring a > network adapter) the port 6809 broadcasts will happily be > sent out over > the loopback interface towards 10.255.255.255, and no wonder that your > machines are not going to see each other. > I ran into similar problem on fresh FC2 install (not upgrade), in which I configured my cluster nodes to have static addresses 192.168.1.110 and 192.168.1.111. (IOW, neither of Lennert's guesses above as to source of problem applied to my situation). I manually changed /etc/hosts to replace 127.0.0.1 with, e.g., 192.168.1.110 and was able to join the cluster. Was this the "right thing to do"? Is this a bug in FC2? What should set up /etc/hosts?? Thanks. -- Ben -- Opinions are mine, not Intel's From vijay at cs.umass.edu Tue Aug 17 13:51:24 2004 From: vijay at cs.umass.edu (Vijay Sundaram) Date: Tue, 17 Aug 2004 09:51:24 -0400 (EDT) Subject: [Linux-cluster] gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server Message-ID: I followed the steps on the following page. (Basically setting up a two node cluster) https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage everything seems to work except that I am unable to import any devices. I get the error gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server gnbd_import -e correctly lists the devices being exported by the server. Is it because the fields in /sys/class/gnbd/gnbd0/server are not getting set. A cat /sys/class/gnbd/gnbd0/server gives 00000000:0 When do these fields get set? What am I doing wrong? thanks, -- Vijay From danderso at redhat.com Tue Aug 17 14:07:05 2004 From: danderso at redhat.com (Derek Anderson) Date: Tue, 17 Aug 2004 09:07:05 -0500 Subject: [Linux-cluster] gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server In-Reply-To: References: Message-ID: <200408170907.05077.danderso@redhat.com> Vijay, Please see: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126935 Make sure you have the latest from CVS. Also make sure that there are not duplicate gnbd.ko modules under /lib/modules/`uname -r`/drivers/block. On Tuesday 17 August 2004 08:51, Vijay Sundaram wrote: > I followed the steps on the following page. > (Basically setting up a two node cluster) > > https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage > > everything seems to work except that I am unable to import any devices. > I get the error > > gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server > > gnbd_import -e correctly lists the devices being exported by the > server. > > Is it because the fields in /sys/class/gnbd/gnbd0/server are not getting > set. > > A > cat /sys/class/gnbd/gnbd0/server > gives > 00000000:0 > > When do these fields get set? > What am I doing wrong? > > thanks, > -- Vijay > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From pavel at ucw.cz Mon Aug 16 19:26:02 2004 From: pavel at ucw.cz (Pavel Machek) Date: Mon, 16 Aug 2004 21:26:02 +0200 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <410D2949.20503@backtobasicsmgmt.com> References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net> <200408011330.01848.phillips@istop.com> <410D2949.20503@backtobasicsmgmt.com> Message-ID: <20040816192602.GA467@openzaurus.ucw.cz> Hi! > >I wonder if device-mapper (slightly hacked) wouldn't be a better > >approach for 2.6+. > > It appeared from the original posting that their "cluster-wide devfs" > actually supported all types of device nodes, not just block devices. > I don't know whether accessing a character device on another node > would ever be useful, but certainly using device-mapper wouldn't help > for that case. Remote character devices seem extremely usefull to me... mpg456 --device /dev/kitchen/dsp cat /dev/roof/dsp > /dev/laptop/dsp cat picture-to-scare-pigeons.raw > /dev/roof/fb0 X --device=/dev/livingroom/fb0 .... Okay, it will probably take a while until SSI cluster is the right tool to network your home :-). Pavel -- 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms From vijay at cs.umass.edu Tue Aug 17 14:12:15 2004 From: vijay at cs.umass.edu (Vijay Sundaram) Date: Tue, 17 Aug 2004 10:12:15 -0400 (EDT) Subject: [Linux-cluster] gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server In-Reply-To: <200408170907.05077.danderso@redhat.com> Message-ID: Hi Derek, No, I do not have a duplicate modules. Also, I picked the snapshot from this page http://sources.redhat.com/cluster/releases/cvs_snapshots/ Is that good enough? thanks, -- Vijay On Tue, 17 Aug 2004, Derek Anderson wrote: > Vijay, > > Please see: > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126935 > > Make sure you have the latest from CVS. Also make sure that there are not > duplicate gnbd.ko modules under /lib/modules/`uname -r`/drivers/block. > > On Tuesday 17 August 2004 08:51, Vijay Sundaram wrote: > > I followed the steps on the following page. > > (Basically setting up a two node cluster) > > > > https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage > > > > everything seems to work except that I am unable to import any devices. > > I get the error > > > > gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server > > > > gnbd_import -e correctly lists the devices being exported by the > > server. > > > > Is it because the fields in /sys/class/gnbd/gnbd0/server are not getting > > set. > > > > A > > cat /sys/class/gnbd/gnbd0/server > > gives > > 00000000:0 > > > > When do these fields get set? > > What am I doing wrong? > > > > thanks, > > -- Vijay > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > -- -- Vijay From danderso at redhat.com Tue Aug 17 14:28:47 2004 From: danderso at redhat.com (Derek Anderson) Date: Tue, 17 Aug 2004 09:28:47 -0500 Subject: [Linux-cluster] gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server In-Reply-To: References: Message-ID: <200408170928.47731.danderso@redhat.com> Latest snapshot is from June 26; already pretty old. You'll have better luck checking a tree directly out of cvs. Look under the "Source code" section at http://sources.redhat.com/cluster/ s.r.c/cluster maintainers: Time for an updated snapshot already? On Tuesday 17 August 2004 09:12, Vijay Sundaram wrote: > Hi Derek, > > No, I do not have a duplicate modules. > Also, I picked the snapshot from this page > > http://sources.redhat.com/cluster/releases/cvs_snapshots/ > > Is that good enough? > > thanks, > -- Vijay > > On Tue, 17 Aug 2004, Derek Anderson wrote: > > Vijay, > > > > Please see: > > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126935 > > > > Make sure you have the latest from CVS. Also make sure that there are > > not duplicate gnbd.ko modules under /lib/modules/`uname > > -r`/drivers/block. > > > > On Tuesday 17 August 2004 08:51, Vijay Sundaram wrote: > > > I followed the steps on the following page. > > > (Basically setting up a two node cluster) > > > > > > https://open.datacore.ch/DCwiki.open/Wiki.jsp?page=GFS.GNBD.Usage > > > > > > everything seems to work except that I am unable to import any devices. > > > I get the error > > > > > > gnbd_import: ERROR cannot parse /sys/class/gnbd/gnbd0/server > > > > > > gnbd_import -e correctly lists the devices being exported by > > > the server. > > > > > > Is it because the fields in /sys/class/gnbd/gnbd0/server are not > > > getting set. > > > > > > A > > > cat /sys/class/gnbd/gnbd0/server > > > gives > > > 00000000:0 > > > > > > When do these fields get set? > > > What am I doing wrong? > > > > > > thanks, > > > -- Vijay > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > http://www.redhat.com/mailman/listinfo/linux-cluster From jeff at intersystems.com Tue Aug 17 14:56:53 2004 From: jeff at intersystems.com (Jeff) Date: Tue, 17 Aug 2004 10:56:53 -0400 Subject: [Linux-cluster] DLM patch: owner pid, lock ordering and new expedite flag Message-ID: <619335021.20040817105653@intersystems.com> The attached patch contains the following changes: 1) Track process which owns locks Support is added for tracking the pid of the process which owns a lock. This is returned from a query operation and used in debug_log() messages. 2) Change rules for granting new locks When the LSFL_NOCONVGRANT flag is specified for a lockspace the rules for granting a new lock are: 1) There must be no locks on the conversion queue 2) There must be no other locks on the grant queue 3) Change rules for granting locks when a lock is released or converted to a lower mode When the LSFL_NOCONVGRANT flag is specified for a lockspace the rules for granting pending locks when a lock is released/converted down are: 1) Only the lock at the head of a queue and any compatible locks which immediately follow it are eligible to be granted. 2) The waiting queue is only processed if the conversion queue is empty 4) Added LKF_GRNLEXPEDITE The current LKF_EXPEDITE flag means that if the lock has to be queued, it is queued at the head of the queue. LKF_GRNLEXPEDITE has meaning when LSFL_NOCONVGRANT is specified. It is only valid on a grant request for a NL lock and it means that the lock is granted regardless of whether there are any locks waiting on a queue. -------------- next part -------------- A non-text attachment was scrubbed... Name: patch.pid-and-lockorder Type: application/octet-stream Size: 12962 bytes Desc: not available URL: From jeff at intersystems.com Wed Aug 18 13:16:05 2004 From: jeff at intersystems.com (Jeff) Date: Wed, 18 Aug 2004 09:16:05 -0400 Subject: [Linux-cluster] Permissions in create_dlm_namespace() call ignored Message-ID: <323211221.20040818091605@intersystems.com> Assuming that the named DLM namespace does not already exist, the following code should create a namespace which any process on the system can open. However it doesn't work and subsequent processes must be root or else the open_lockspace call fails with "Error opening dlm namespace: Permission denied" dlm_lshandle_t dlmnamesp; int i; i = umask(0); dlmnamesp = dlm_create_lockspace("play",0777); /* S_IRWXU|S_IRWXG|S_IRWXO */ if (!dlmnamesp) { dlmnamesp = dlm_open_lockspace("play"); if (!dlmnamesp) { umask(i); perror("Error opening dlm namespace"); exit(1); } } umask(i); if (dlm_ls_pthread_init(dlmnamesp)) { perror("dlm_pthread_init failed"); exit(1); } [jeff at lx3 ~]$ ls -l /dev/misc total 0 ?--------- ? ? ? ? ? dlm-control ?--------- ? ? ? ? ? dlm_play ?--------- ? ? ? ? ? dlm_default ?--------- ? ? ? ? ? dlm_testls [jeff at lx3 ~]$ From pcaulfie at redhat.com Wed Aug 18 13:46:45 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 18 Aug 2004 14:46:45 +0100 Subject: [Linux-cluster] Permissions in create_dlm_namespace() call ignored In-Reply-To: <323211221.20040818091605@intersystems.com> References: <323211221.20040818091605@intersystems.com> Message-ID: <20040818134643.GD31539@tykepenguin.com> On Wed, Aug 18, 2004 at 09:16:05AM -0400, Jeff wrote: > Assuming that the named DLM namespace does not > already exist, the following code should > create a namespace which any process on the system > can open. However it doesn't work and subsequent > processes must be root or else the open_lockspace > call fails with Odd, it works here: dlm_create_lockspace(lsname, 0755); # ls -l /dev/misc/ total 0 crw-r--r-- 1 root root 10, 62 Jun 11 08:20 dlm-control crw------- 1 root root 10, 61 Aug 17 13:39 dlm_default crwxr-xr-x 1 root root 10, 60 Aug 18 14:44 dlm_testls crw-r--r-- 1 root root 10, 62 Feb 19 08:38 gdlm Have you checked the value of umask ? or is SELinux getting in the way ?(eek!) -- patrick From jeff at intersystems.com Wed Aug 18 14:01:34 2004 From: jeff at intersystems.com (Jeff) Date: Wed, 18 Aug 2004 10:01:34 -0400 Subject: [Linux-cluster] Permissions in create_dlm_namespace() call ignored In-Reply-To: <20040818134643.GD31539@tykepenguin.com> References: <323211221.20040818091605@intersystems.com> <20040818134643.GD31539@tykepenguin.com> Message-ID: <1183043277.20040818100134@intersystems.com> Wednesday, August 18, 2004, 9:46:45 AM, Patrick Caulfield wrote: > On Wed, Aug 18, 2004 at 09:16:05AM -0400, Jeff wrote: >> Assuming that the named DLM namespace does not >> already exist, the following code should >> create a namespace which any process on the system >> can open. However it doesn't work and subsequent >> processes must be root or else the open_lockspace >> call fails with > Odd, it works here: > dlm_create_lockspace(lsname, 0755); > # ls -l /dev/misc/ > total 0 > crw-r--r-- 1 root root 10, 62 Jun 11 08:20 dlm-control > crw------- 1 root root 10, 61 Aug 17 13:39 dlm_default > crwxr-xr-x 1 root root 10, 60 Aug 18 14:44 dlm_testls > crw-r--r-- 1 root root 10, 62 Feb 19 08:38 gdlm > Have you checked the value of umask ? > or is SELinux getting in the way ?(eek!) Apologies for the earlier ls -l output, that was from a user process, not a root job. It really looks like: [root at lx3]# ls -l /dev/misc total 0 crw-r--r-- 1 root root 10, 62 Jul 21 06:24 dlm-control crwxrwxrwx 1 root root 10, 61 Jul 21 06:24 dlm_default crwxrwxrwx 1 root root 10, 59 Aug 18 09:15 dlm_play crwxrwxrwx 1 root root 10, 60 Jul 21 06:26 dlm_testls The problem is that /dev/misc was missing x permission: drw-rw-rw- 2 root root 4096 Aug 18 09:15 /dev/misc/ Changing this to drwxrwxrwx 2 root root 4096 Aug 18 09:55 /dev/misc/ allows non-root jobs to connect to namespaces based on the namespace's permissions. From rmayhew at mweb.com Wed Aug 18 14:17:34 2004 From: rmayhew at mweb.com (Richard Mayhew) Date: Wed, 18 Aug 2004 16:17:34 +0200 Subject: [Linux-cluster] GFS Node Limit? Message-ID: <91C4F1A7C418014D9F88E938C135545881A519@mwjdc2.mweb.com> Hi, I have 4 gulm_lock servers setup and 6 gulm_lock clients. I can mount the GFS file systems on all the lock servers and only 1 client. When I try and add another client (which is specified in the nodes list) it logs in to the master gulm lock server with no problems. When I try mount the gfs file systems it hangs, until I unmount the file system from another client. Its as if there is a max of either 1 Client or a total of 5 servers/clients that can mount the GFS FS;s.......Doesn't make sense... Any ideas? -- Regards Richard Mayhew Unix Specialist From danderso at redhat.com Wed Aug 18 14:30:35 2004 From: danderso at redhat.com (Derek Anderson) Date: Wed, 18 Aug 2004 09:30:35 -0500 Subject: [Linux-cluster] GFS Node Limit? In-Reply-To: <91C4F1A7C418014D9F88E938C135545881A519@mwjdc2.mweb.com> References: <91C4F1A7C418014D9F88E938C135545881A519@mwjdc2.mweb.com> Message-ID: <200408180930.35597.danderso@redhat.com> How many journals on your filesystem? On Wednesday 18 August 2004 09:17, Richard Mayhew wrote: > Hi, > > I have 4 gulm_lock servers setup and 6 gulm_lock clients. > > I can mount the GFS file systems on all the lock servers and only 1 > client. When I try and add another client (which is specified in the > nodes list) it logs in to the master gulm lock server with no problems. > When I try mount the gfs file systems it hangs, until I unmount the file > system from another client. Its as if there is a max of either 1 Client > or a total of 5 servers/clients that can mount the GFS > FS;s.......Doesn't make sense... > > Any ideas? > > > -- > > Regards > > Richard Mayhew > Unix Specialist > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From rmayhew at mweb.com Wed Aug 18 14:40:19 2004 From: rmayhew at mweb.com (Richard Mayhew) Date: Wed, 18 Aug 2004 16:40:19 +0200 Subject: [Linux-cluster] GFS Node Limit? Message-ID: <91C4F1A7C418014D9F88E938C135545881A533@mwjdc2.mweb.com> I have 4 mounts of 50GB each. Each mount has 8 Journals...(this amount I found in the manual somewhere) Is this a prob? Do you have a recommended FS layout etc? -- Regards Richard Mayhew Unix Specialist -----Original Message----- From: Derek Anderson [mailto:danderso at redhat.com] Sent: 18 August 2004 04:31 PM To: Discussion of clustering software components including GFS; Richard Mayhew Subject: Re: [Linux-cluster] GFS Node Limit? How many journals on your filesystem? On Wednesday 18 August 2004 09:17, Richard Mayhew wrote: > Hi, > > I have 4 gulm_lock servers setup and 6 gulm_lock clients. > > I can mount the GFS file systems on all the lock servers and only 1 > client. When I try and add another client (which is specified in the > nodes list) it logs in to the master gulm lock server with no problems. > When I try mount the gfs file systems it hangs, until I unmount the > file system from another client. Its as if there is a max of either 1 > Client or a total of 5 servers/clients that can mount the GFS > FS;s.......Doesn't make sense... > > Any ideas? > > > -- > > Regards > > Richard Mayhew > Unix Specialist > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From rmayhew at mweb.com Wed Aug 18 21:06:28 2004 From: rmayhew at mweb.com (Richard Mayhew) Date: Wed, 18 Aug 2004 23:06:28 +0200 Subject: [Linux-cluster] GFS Node Limit? Message-ID: <91C4F1A7C418014D9F88E938C135545881A543@mwjdc2.mweb.com> I have increased the number of journals on each FS from the 8 previously discussed to 16. At present the max number of servers that I intend on using will be 10, leaving 6 journals open for expansion. When trying to mount the 6th server, I end up with the same problem. The mount hangs until I remove a mount from another system before the 6th system is able to complete the GFS mount. After adding on full verbosity on the gulm lock server I still can't see anything of interest. Dmesg offers no clue either. The only message I'm seeing is the following : "GFS Kernel Interface" is logged out. fd:10" etc. After searching the net, the only advice I can find is to increase the number of journals to at least the number of nodes. This I have done with no success.... Any other ideas? -- Regards Richard Mayhew Unix Specialist -----Original Message----- From: Richard Mayhew [mailto:rmayhew at mweb.com] Sent: 18 August 2004 04:40 PM To: Derek Anderson Cc: linux-cluster at redhat.com Subject: RE: [Linux-cluster] GFS Node Limit? I have 4 mounts of 50GB each. Each mount has 8 Journals...(this amount I found in the manual somewhere) Is this a prob? Do you have a recommended FS layout etc? -- Regards Richard Mayhew Unix Specialist -----Original Message----- From: Derek Anderson [mailto:danderso at redhat.com] Sent: 18 August 2004 04:31 PM To: Discussion of clustering software components including GFS; Richard Mayhew Subject: Re: [Linux-cluster] GFS Node Limit? How many journals on your filesystem? On Wednesday 18 August 2004 09:17, Richard Mayhew wrote: > Hi, > > I have 4 gulm_lock servers setup and 6 gulm_lock clients. > > I can mount the GFS file systems on all the lock servers and only 1 > client. When I try and add another client (which is specified in the > nodes list) it logs in to the master gulm lock server with no problems. > When I try mount the gfs file systems it hangs, until I unmount the > file system from another client. Its as if there is a max of either 1 > Client or a total of 5 servers/clients that can mount the GFS > FS;s.......Doesn't make sense... > > Any ideas? > > > -- > > Regards > > Richard Mayhew > Unix Specialist > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From ben.m.cahill at intel.com Wed Aug 18 21:08:37 2004 From: ben.m.cahill at intel.com (Cahill, Ben M) Date: Wed, 18 Aug 2004 14:08:37 -0700 Subject: [Linux-cluster] man page for gfs_mount? Message-ID: <0604335B7764D141945E202153105960033E2507@orsmsx404.amr.corp.intel.com> I don't see a man page for gfs_mount in CVS anywhere. There's one in OpenGFS, originated by Sistina ... do you want to put this (after being properly updated) in current GFS man page suite? If so, I can take a shot at updating it, and submit as a patch. Or maybe you've got a current one that just didn't get into CVS? -- Ben -- From phillips at redhat.com Wed Aug 18 23:12:55 2004 From: phillips at redhat.com (Daniel Phillips) Date: Wed, 18 Aug 2004 19:12:55 -0400 Subject: [Linux-cluster] man page for gfs_mount? In-Reply-To: <0604335B7764D141945E202153105960033E2507@orsmsx404.amr.corp.intel.com> References: <0604335B7764D141945E202153105960033E2507@orsmsx404.amr.corp.intel.com> Message-ID: <200408181912.55732.phillips@redhat.com> Hi Ben, On Wednesday 18 August 2004 17:08, Cahill, Ben M wrote: > I don't see a man page for gfs_mount in CVS anywhere. > > There's one in OpenGFS, originated by Sistina ... do you want to put > this (after being properly updated) in current GFS man page suite? > If so, I can take a shot at updating it, and submit as a patch. That sounds great. Regards, Daniel From amanthei at redhat.com Wed Aug 18 23:29:23 2004 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 18 Aug 2004 18:29:23 -0500 Subject: [Linux-cluster] GFS Node Limit? In-Reply-To: <91C4F1A7C418014D9F88E938C135545881A543@mwjdc2.mweb.com> References: <91C4F1A7C418014D9F88E938C135545881A543@mwjdc2.mweb.com> Message-ID: <20040818232923.GG8038@redhat.com> On Wed, Aug 18, 2004 at 11:06:28PM +0200, Richard Mayhew wrote: > > I have increased the number of journals on each FS from the 8 previously > discussed to 16. At present the max number of servers that I intend on > using will be 10, leaving 6 journals open for expansion. When trying to > mount the 6th server, I end up with the same problem. The mount hangs > until I remove a mount from another system before the 6th system is able > to complete the GFS mount. > > After adding on full verbosity on the gulm lock server I still can't see > anything of interest. Dmesg offers no clue either. The only message I'm > seeing is the following : "GFS Kernel Interface" is logged out. fd:10" > etc. > > After searching the net, the only advice I can find is to increase the > number of journals to at least the number of nodes. This I have done > with no success.... > > Any other ideas? Gulm can only use 5 nodes in the servers list in cluster.ccs/cluster/lock_gulmd/servers. This could be why you are having difficulties. Try trimming the list and see if that yields better results. You may also want to make sure that your node names are uniquely identifiable by their first 8 characters (this was a bug in version 6.0.0-1.2 of the RPMs, but since you didn't post the version of the code you are using, I can only offer this as a suggestion ;) There would be failure messages to the console stating that you didn't have enough journals if that was the problem. good luck -- Adam Manthei From rmayhew at mweb.com Thu Aug 19 09:35:40 2004 From: rmayhew at mweb.com (Richard Mayhew) Date: Thu, 19 Aug 2004 11:35:40 +0200 Subject: [Linux-cluster] GFS Node Limit? Message-ID: <91C4F1A7C418014D9F88E938C135545881A60A@mwjdc2.mweb.com> Hi Brilliant! All sorted now...I had no idea about this bug. Is it documented anywhere noticeable? I had all my service servers named services-01, services-02. Renamed them to serv-01 ... And it worked! Thanks for all your help! -- Regards Richard Mayhew Unix Specialist -----Original Message----- From: Adam Manthei [mailto:amanthei at redhat.com] Sent: 19 August 2004 01:29 AM To: Discussion of clustering software components including GFS Subject: Re: [Linux-cluster] GFS Node Limit? On Wed, Aug 18, 2004 at 11:06:28PM +0200, Richard Mayhew wrote: > > I have increased the number of journals on each FS from the 8 > previously discussed to 16. At present the max number of servers that > I intend on using will be 10, leaving 6 journals open for expansion. > When trying to mount the 6th server, I end up with the same problem. > The mount hangs until I remove a mount from another system before the > 6th system is able to complete the GFS mount. > > After adding on full verbosity on the gulm lock server I still can't > see anything of interest. Dmesg offers no clue either. The only > message I'm seeing is the following : "GFS Kernel Interface" is logged out. fd:10" > etc. > > After searching the net, the only advice I can find is to increase the > number of journals to at least the number of nodes. This I have done > with no success.... > > Any other ideas? Gulm can only use 5 nodes in the servers list in cluster.ccs/cluster/lock_gulmd/servers. This could be why you are having difficulties. Try trimming the list and see if that yields better results. You may also want to make sure that your node names are uniquely identifiable by their first 8 characters (this was a bug in version 6.0.0-1.2 of the RPMs, but since you didn't post the version of the code you are using, I can only offer this as a suggestion ;) There would be failure messages to the console stating that you didn't have enough journals if that was the problem. good luck -- Adam Manthei -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From Axel.Thimm at ATrpms.net Thu Aug 19 10:24:25 2004 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Thu, 19 Aug 2004 12:24:25 +0200 Subject: [Linux-cluster] RHEL3/kernel 2.4/GFS 6.0 and 2TB limit Message-ID: <20040819102425.GF7626@neu.physik.fu-berlin.de> Within the next weeks/months I'd like to setup a GFS cluster on > 2TB storage backends. There are 2TB (or 1TB) limits for fs (and/or block devices?) for 2.4/32bits. What is the best way to proceed/plan? I understand kernel 2.6 and GFS/cvs lift the limits, should I replace the RHEL3's 2.4 kernel with 2.6 and go GFS/cvs? What about the cluster suite, would it still play nice with GFS/cvs and kernel 2.6? Are there any plans for pushing out a GFS release within the next 1-2 months, or any known RHEL plans wrt GFS? If I know that RHEL3 or some other setup will support > 2TB storage backends in the near future I can start setting up and testing the cluster. Thanks! -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From Paul_Besett at raytheon.com Thu Aug 19 12:40:56 2004 From: Paul_Besett at raytheon.com (Paul Besett) Date: Thu, 19 Aug 2004 08:40:56 -0400 Subject: [Linux-cluster] GFS Performance Best Practices Message-ID: I've got GFS 6.0 successfully loaded and running with two nodes. Now that I'm satisfied with that, I want to expand this and move into an operational environment. Something like 10 nodes with 4-5 partitions. There will be a mix of nodes and partitions, like node 1 mounting partitions 2, 3, and 4; node 2 mounting partitions 3, 4, and 5; and so on. The administration guide is pretty weak on best practices, and I want to maximize performance. Can anyone provide pointers to where I might find this or offer some tips on set up or things to avoid? Thanks, Paul From anton at hq.310.ru Thu Aug 19 12:37:25 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Thu, 19 Aug 2004 16:37:25 +0400 Subject: [Linux-cluster] gfs_eattr tool ? Message-ID: <13410176857.20040819163725@hq.310.ru> Hi all, i found manual for gfs_eattr but not found this tool ? -- e-mail: anton at hq.310.ru From macfisherman at gmail.com Wed Aug 18 20:38:51 2004 From: macfisherman at gmail.com (Jeff Macdonald) Date: Wed, 18 Aug 2004 16:38:51 -0400 Subject: [Linux-cluster] Re: [ANNOUNCE] OpenSSI 1.0.0 released!! In-Reply-To: <20040816192602.GA467@openzaurus.ucw.cz> References: <3689AF909D816446BA505D21F1461AE4C750E6@cacexc04.americas.cpqcorp.net> <200408011330.01848.phillips@istop.com> <410D2949.20503@backtobasicsmgmt.com> <20040816192602.GA467@openzaurus.ucw.cz> Message-ID: <45ae90370408181338680f71bd@mail.gmail.com> On Mon, 16 Aug 2004 21:26:02 +0200, Pavel Machek wrote: > Remote character devices seem extremely usefull to me... > > mpg456 --device /dev/kitchen/dsp > > cat /dev/roof/dsp > /dev/laptop/dsp > > cat picture-to-scare-pigeons.raw > /dev/roof/fb0 > > X --device=/dev/livingroom/fb0 > > ..... Okay, it will probably take a while until SSI cluster is the > right tool to network your home :-). Isn't that what Inferno is suppose to be able to do? -- Jeff Macdonald Ayer, MA From Paul_Besett at raytheon.com Wed Aug 18 22:22:04 2004 From: Paul_Besett at raytheon.com (Paul Besett) Date: Wed, 18 Aug 2004 18:22:04 -0400 Subject: [Linux-cluster] GFS Performance Best Practices Message-ID: I've got GFS 6.0 successfully loaded and running with two nodes. Now that I'm satisfied with that, I want to expand this and move into an operational environment. Something like 10 nodes with 4-5 partitions. There will be a mix of nodes and partitions, like node 1 mounting partitions 2, 3, and 4; node 2 mounting partitions 3, 4, and 5; and so on. The documentation is pretty weak on best practices, and I want to maximize performance. Can anyone provide pointers to where I might find this or offer some tips on set up or things to avoid? Thanks, Paul From danderso at redhat.com Thu Aug 19 13:36:16 2004 From: danderso at redhat.com (Derek Anderson) Date: Thu, 19 Aug 2004 08:36:16 -0500 Subject: [Linux-cluster] gfs_eattr tool ? In-Reply-To: <13410176857.20040819163725@hq.310.ru> References: <13410176857.20040819163725@hq.310.ru> Message-ID: <200408190836.16708.danderso@redhat.com> Hmm. This man page should have been not be included. The standard setfattr(1) and getfattr(1) should be used to set and get extended attributes on GFS. On Thursday 19 August 2004 07:37, ????? ????????? wrote: > Hi all, > > i found manual for gfs_eattr but not found this tool ? From amanthei at redhat.com Thu Aug 19 13:40:01 2004 From: amanthei at redhat.com (Adam Manthei) Date: Thu, 19 Aug 2004 08:40:01 -0500 Subject: [Linux-cluster] GFS Node Limit? In-Reply-To: <91C4F1A7C418014D9F88E938C135545881A60A@mwjdc2.mweb.com> References: <91C4F1A7C418014D9F88E938C135545881A60A@mwjdc2.mweb.com> Message-ID: <20040819134001.GJ8038@redhat.com> On Thu, Aug 19, 2004 at 11:35:40AM +0200, Richard Mayhew wrote: > Hi > Brilliant! > > All sorted now...I had no idea about this bug. Is it documented anywhere > noticeable? http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127828 I would have posted the bugzilla number last night, but was just to lazy to do the search ;) > I had all my service servers named services-01, services-02. Renamed > them to serv-01 ... And it worked! > Thanks for all your help! > -----Original Message----- > From: Adam Manthei [mailto:amanthei at redhat.com] > Sent: 19 August 2004 01:29 AM > To: Discussion of clustering software components including GFS > Subject: Re: [Linux-cluster] GFS Node Limit? > > On Wed, Aug 18, 2004 at 11:06:28PM +0200, Richard Mayhew wrote: > > > > I have increased the number of journals on each FS from the 8 > > previously discussed to 16. At present the max number of servers that > > I intend on using will be 10, leaving 6 journals open for expansion. > > When trying to mount the 6th server, I end up with the same problem. > > The mount hangs until I remove a mount from another system before the > > 6th system is able to complete the GFS mount. > > > > After adding on full verbosity on the gulm lock server I still can't > > see anything of interest. Dmesg offers no clue either. The only > > message I'm seeing is the following : "GFS Kernel Interface" is logged > out. fd:10" > > etc. > > > > After searching the net, the only advice I can find is to increase the > > > number of journals to at least the number of nodes. This I have done > > with no success.... > > > > Any other ideas? > > Gulm can only use 5 nodes in the servers list in > cluster.ccs/cluster/lock_gulmd/servers. This could be why you are > having difficulties. Try trimming the list and see if that yields > better results. > > You may also want to make sure that your node names are uniquely > identifiable by their first 8 characters (this was a bug in version > 6.0.0-1.2 of the RPMs, but since you didn't post the version of the code > you are using, I can only offer this as a suggestion ;) > > There would be failure messages to the console stating that you didn't > have enough journals if that was the problem. > > good luck > -- > Adam Manthei > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From anton at hq.310.ru Thu Aug 19 13:45:39 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Thu, 19 Aug 2004 17:45:39 +0400 Subject: [Linux-cluster] gfs_eattr tool ? In-Reply-To: <200408190836.16708.danderso@redhat.com> References: <13410176857.20040819163725@hq.310.ru> <200408190836.16708.danderso@redhat.com> Message-ID: <426365758.20040819174539@hq.310.ru> Hi Derek, Thursday, August 19, 2004, 5:36:16 PM, you wrote: Derek Anderson> Hmm. This man page should have been Derek Anderson> not be included. The standard Derek Anderson> setfattr(1) and getfattr(1) should be Derek Anderson> used to set and get extended attributes Derek Anderson> on GFS. On ext2(3) i using chattr +i file for set immutable flag. I can set this flag thru setfattr on GFS? chattr on GFS: # chattr +i sh chattr: Inappropriate ioctl for device while reading flags on sh Derek Anderson> On Thursday 19 August 2004 07:37, Derek Anderson> ????? ????????? wrote: >> Hi all, >> >> i found manual for gfs_eattr but not found this tool ? -- e-mail: anton at hq.310.ru From anton at hq.310.ru Thu Aug 19 13:51:14 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Thu, 19 Aug 2004 17:51:14 +0400 Subject: [Linux-cluster] gfs_eattr tool ? In-Reply-To: <200408190836.16708.danderso@redhat.com> References: <13410176857.20040819163725@hq.310.ru> <200408190836.16708.danderso@redhat.com> Message-ID: <16310644182.20040819175114@hq.310.ru> Hi Derek, Thursday, August 19, 2004, 5:36:16 PM, you wrote: Derek Anderson> Hmm. This man page should have been Derek Anderson> not be included. The standard Derek Anderson> setfattr(1) and getfattr(1) should be Derek Anderson> used to set and get extended attributes Derek Anderson> on GFS. On ext2(3) i using chattr +i file for set immutable flag. I can set this flag thru setfattr on GFS? chattr on GFS: # chattr +i sh chattr: Inappropriate ioctl for device while reading flags on sh Derek Anderson> On Thursday 19 August 2004 07:37, Derek Anderson> ????? ????????? wrote: >> Hi all, >> >> i found manual for gfs_eattr but not found this tool ? -- e-mail: anton at hq.310.ru From ritesh.a at net4india.net Thu Aug 19 03:24:19 2004 From: ritesh.a at net4india.net (Ritesh Agrawal) Date: Thu, 19 Aug 2004 08:54:19 +0530 Subject: [Linux-cluster] Active Active Configuration Message-ID: <41241D63.5090102@net4india.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi All, ~ I am using Redhat Linux Cluster suit and EL for High avalilblity server with active/passive configuration. In active /passive configuration , one server(master) act as load balancer,and another one comes on the scene as fail over server , It means only one server's computation power (as load balancer) is used at a time .But i want to use both server's computational power (active/ active) configuration as well as take over the responsiblities of each others in case of one's failure. any suggestion or tutorials regarding this, you have ? one more thing , how to implement GFS in cluster with better optimization. - -- Regards Ritesh Agrawal Senior Engineer-Systems Net 4 India Ltd, B-4/47, Safdarjung Enclave, New Delhi- 110 029, India +----------------------------------------------------+ ~ I think, therefore I am. ...................................................... Public Key Server: http://keyserver.veridis.com/en/ GPG Key Fingerprint : D017 1B21 A699 BDF8 CFDD 2D78 168C FE3F DE63 9D32 +----------------------------------------------------+ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBJB1jFoz+P95jnTIRAvwKAKCtQuXXmt3xsQBX480kZophr02engCfY4rU oBAij0iqwZiiL09ySkWIHXQ= =IBGm -----END PGP SIGNATURE----- From bmarzins at redhat.com Thu Aug 19 15:21:22 2004 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Thu, 19 Aug 2004 10:21:22 -0500 Subject: [Linux-cluster] gfs_eattr tool ? In-Reply-To: <16310644182.20040819175114@hq.310.ru> References: <13410176857.20040819163725@hq.310.ru> <200408190836.16708.danderso@redhat.com> <16310644182.20040819175114@hq.310.ru> Message-ID: <20040819152122.GA12234@phlogiston.msp.redhat.com> On Thu, Aug 19, 2004 at 05:51:14PM +0400, ????? ????????? wrote: > Hi Derek, > > Thursday, August 19, 2004, 5:36:16 PM, you wrote: > > Derek Anderson> Hmm. This man page should have been > Derek Anderson> not be included. The standard > Derek Anderson> setfattr(1) and getfattr(1) should be > Derek Anderson> used to set and get extended attributes > Derek Anderson> on GFS. > > On ext2(3) i using chattr +i file for set immutable flag. > > I can set this flag thru setfattr on GFS? > > chattr on GFS: > # chattr +i sh > chattr: Inappropriate ioctl for device while reading flags on sh I might be wrong, but I don't think chattr has anything to do with extended attributes. gfs_eattr was an old tool for setting extended attributes on GFS, used before there was a syscall for it. I don't know offhand of any way to set all the file attributes in GFS, that chattr can set in ext2/3 "gfs_tool setflag" allows you to set some file attributes, but not make the inode immutable. If someone else knows another way, speak up. -Ben > Derek Anderson> On Thursday 19 August 2004 07:37, > Derek Anderson> ????? ????????? wrote: > >> Hi all, > >> > >> i found manual for gfs_eattr but not found this tool ? > > > > > > -- > e-mail: anton at hq.310.ru > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From mauelshagen at redhat.com Wed Aug 18 19:06:10 2004 From: mauelshagen at redhat.com (Heinz Mauelshagen) Date: Wed, 18 Aug 2004 21:06:10 +0200 Subject: [Linux-cluster] *** Announcement: dmraid 1.0.0-rc3 *** Message-ID: <20040818190610.GA6259@redhat.com> *** Announcement: dmraid 1.0.0-rc3 *** dmraid 1.0.0-rc3 is available at http://people.redhat.com:/~heinzm/sw/dmraid/ in source, source rpm and i386 rpm. dmraid (Device-Mapper Raid tool) discovers, [de]activates and displays properties of software RAID sets (ie. ATARAID) and contained DOS partitions using the device-mapper runtime of the 2.6 kernel. The following ATARAID types are supported on Linux 2.6: Highpoint HPT37X Highpoint HPT45X Intel Software RAID Promise FastTrack Silicon Image Medley This ATARAID type is only basically supported in this version (I need better metadata format specs; please help): LSI Logic MegaRAID Please provide insight to support those metadata formats completely. Thanks. See files README and CHANGELOG, which come with the source tarball for prerequisites to run this software, further instructions on installing and using dmraid! CHANGELOG is contained below for your convenience as well. Call for testers: ----------------- I need testers with the above ATARAID types, to check that the mapping created by this tool is correct (see options "-t -ay") and access to the ATARAID data is proper. You can activate your ATARAID sets without danger of overwriting your metadata, because dmraid accesses it read-only unless you use option -E with -r in order to erase ATARAID metadata (see 'man dmraid')! This is a release candidate version so you want to have backups of your valuable data *and* you want to test accessing your data read-only first in order to make sure that the mapping is correct before you go for read-write access. The author is reachable at . For test results, mapping information, discussions, questions, patches, enhancement requests and the like, please subscribe and mail to . -- Regards, Heinz -- The LVM Guy -- CHANGELOG: --------- Changelog from dmraid 1.0.0-rc2 to 1.0.0-rc3 2004.08.18 FIXES: ------ o HPT37X mapping on first disk of set o dietlibc sscanf() use prevented activation o le*_to_cpu() for certain glibc environments (Luca Berra) o sysfs discovery (Luca Berra) o permissions to write on binary, which is needed by newer strip versions (Luca Berra) o SCSI serial number string length bug o valgrinded memory leaks o updated design document o comments FEATURES: --------- o added basic support for activation of LSI Logic MegaRAID/MegaIDE; more reengineering of the metadata needed! o root check using certain options (eg, activation of RAID sets) o implemented locking abstraction o implemented writing device metadata offsets with "-r[D/E]" for ease of manual restore o file based locking to avoid parallel tool runs competing with each other for the same resources o streamlined library context o implemented access functions for library context o streamlined RAID set consistency checks o implemented log function and removed macros to shrink binary size further o removed superfluous disk geometry code o cleaned up metadata.c collapsing free_*() functions o slimmed down minimal binary (configure option DMRAID_MINI for early boot environment) =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Heinz Mauelshagen Red Hat GmbH Consulting Development Engineer Am Sonnenhang 11 56242 Marienrachdorf Germany Mauelshagen at RedHat.com +49 2626 141200 FAX 924446 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- From andriy at druzhba.lviv.ua Fri Aug 20 12:57:38 2004 From: andriy at druzhba.lviv.ua (Andriy Galetski) Date: Fri, 20 Aug 2004 15:57:38 +0300 Subject: [Linux-cluster] GFS configuration for 2 node Cluster References: <41241D63.5090102@net4india.net> Message-ID: <002001c486b5$46ab23c0$f13cc90a@druzhba.com> Hi All, I want to build RH Cluster with GFS, but have only 2 node connected with shared storage. Now GFS-6.0.0-7 instaled on both nodes (RHEL3 2.4.21-15.0.3.ELsmp) When 2 node operate normal I have no problem with mount/umount read/write GFS filesystem. In case of one node fail / shutdown / lost communication, all operations with GFS filesystem is stopped (Because Lock_gulmd lost its Quorum) To renew GFS filesystem to normal state I need run failed node back give fence_ack_manual command. Q: Is it any trick in 2 node GFS configuration to get one node full operate when other node disconnected from cluster ?? Thannks for all suggestions. From amir at datacore.ch Fri Aug 20 21:13:45 2004 From: amir at datacore.ch (Amir Guindehi) Date: Fri, 20 Aug 2004 23:13:45 +0200 Subject: [Linux-cluster] GFS configuration for 2 node Cluster In-Reply-To: <002001c486b5$46ab23c0$f13cc90a@druzhba.com> References: <41241D63.5090102@net4india.net> <002001c486b5$46ab23c0$f13cc90a@druzhba.com> Message-ID: <41266989.2070101@datacore.ch> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Andriy, | Q: Is it any trick in 2 node GFS configuration to get one node full operate | when other node disconnected from cluster ?? Dunno if it's the same in RH Cluster as in Linux Cluster, but for Linux Cluster I've described how to do it at: https://open.datacore.ch/page/GFS.Install#section-GFS.Install-ConfigurationOfGFSOnASystemRunningTheGenTooLinuxDistributionhttpgentoo.org Regards - - Amir - -- Amir Guindehi, nospam.amir at datacore.ch DataCore GmbH, Witikonerstrasse 289, 8053 Zurich, Switzerland -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBJlzObycOjskSVCwRAriYAJwMQHKKlhYQZrnGGxzgH3ZG9seM9QCgvsmi 8eqyyVFk+Cfn12iIHvDDytQ= =MW3I -----END PGP SIGNATURE----- From jopet at staff.spray.se Mon Aug 23 12:43:29 2004 From: jopet at staff.spray.se (Johan Pettersson) Date: Mon, 23 Aug 2004 14:43:29 +0200 Subject: [Linux-cluster] Modules for kernel 2.4 Message-ID: <1093265009.8980.42.camel@zombie.i.spray.se> Hello! I can't find any tarball at http://sources.redhat.com/cluster/gfs/ for kernel 2.4. Is only 2.6.7 supported? http://sources.redhat.com/cluster/releases/GFS-kernel/gfs-kernel-2.6.7-2.tar.gz Thx /J -- In disk space, nobody can hear your files scream. From lhh at redhat.com Mon Aug 23 14:11:34 2004 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 23 Aug 2004 10:11:34 -0400 Subject: [Linux-cluster] GFS configuration for 2 node Cluster In-Reply-To: <41266989.2070101@datacore.ch> References: <41241D63.5090102@net4india.net> <002001c486b5$46ab23c0$f13cc90a@druzhba.com> <41266989.2070101@datacore.ch> Message-ID: <1093270294.3467.26.camel@atlantis.boston.redhat.com> On Fri, 2004-08-20 at 23:13 +0200, Amir Guindehi wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Andriy, > > | Q: Is it any trick in 2 node GFS configuration to get one node full > operate > | when other node disconnected from cluster ?? > > Dunno if it's the same in RH Cluster as in Linux Cluster, but for Linux > Cluster I've described how to do it at: > > https://open.datacore.ch/page/GFS.Install#section-GFS.Install-ConfigurationOfGFSOnASystemRunningTheGenTooLinuxDistributionhttpgentoo.org Hi Amir, I think he meant for 6.0.0, which is the pappy of linux-cluster. I don't think you can do it with 6.0.0. -- Lon From notiggy at gmail.com Mon Aug 23 14:15:23 2004 From: notiggy at gmail.com (Brian Jackson) Date: Mon, 23 Aug 2004 09:15:23 -0500 Subject: [Linux-cluster] Modules for kernel 2.4 In-Reply-To: <1093265009.8980.42.camel@zombie.i.spray.se> References: <1093265009.8980.42.camel@zombie.i.spray.se> Message-ID: On Mon, 23 Aug 2004 14:43:29 +0200, Johan Pettersson wrote: > Hello! > > I can't find any tarball at http://sources.redhat.com/cluster/gfs/ for > kernel 2.4. Is only 2.6.7 supported? Yes, only 2.6 is supported, for 2.4, look around for the gfs-6 src.rpms. > > http://sources.redhat.com/cluster/releases/GFS-kernel/gfs-kernel-2.6.7-2.tar.gz > > Thx > > /J From phillips at redhat.com Mon Aug 23 15:43:11 2004 From: phillips at redhat.com (Daniel Phillips) Date: Mon, 23 Aug 2004 11:43:11 -0400 Subject: [Linux-cluster] Subversion? Message-ID: <200408231143.11372.phillips@redhat.com> Hi everybody, I was just taking a look at this article and I thought, maybe this would be a good time to show some leadership as a project, and take the Subversion plunge: http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html Subversion is basically CVS as it should have been. It's mature now. The number of complaints I have noticed from users out there is roughly zero. Subversion _versions directories_. Etc. Etc. The only negative I can think of is that some folks may not have Subversion installed. But that is what tarballs are for. Our project development is not highly parallel at this point, so our repository serves more as a place for maintainers of the individual subprojects to post current code. So there isn't a great need for a distributed VCS like Bitkeeper or Arch. Thoughts? Regards, Daniel From kpfleming at backtobasicsmgmt.com Mon Aug 23 15:49:26 2004 From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming) Date: Mon, 23 Aug 2004 08:49:26 -0700 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408231143.11372.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> Message-ID: <412A1206.5040103@backtobasicsmgmt.com> Daniel Phillips wrote: > I was just taking a look at this article and I thought, maybe this would > be a good time to show some leadership as a project, and take the > Subversion plunge: > > http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html I am part of another project that recently switched to Subversion, and we like it quite a bit. It's a major improvement over CVS. > Subversion is basically CVS as it should have been. It's mature now. > The number of complaints I have noticed from users out there is roughly > zero. Subversion _versions directories_. Etc. Etc. Yes, and if you haven't already, read the first few chapters of the "redbean" Subversion book to get a feel for how it works. Branching/tagging is painless and low-cost (including dropping dead branches/tags, something that CVS can't do well at all), there are cool methods to include parts of the repository inside other parts at checkout time, etc. > The only negative I can think of is that some folks may not have > Subversion installed. But that is what tarballs are for. Subversion is an easy install. There is another negative, though: the current release of Subversion uses Berkeley DB as its storage means, and we've had problems with it getting randomly locked and causing issues. We don't know if this is due to running ViewCVS against the repo as well, or what else it may be. Given the problems we've had, we are anxiously awaiting the 1.1 release of Subversion that will have a filesystem-based backend, rather than bdb. > Our project development is not highly parallel at this point, so our > repository serves more as a place for maintainers of the individual > subprojects to post current code. So there isn't a great need for a > distributed VCS like Bitkeeper or Arch. I also like BK quite a bit, and it has one major advantage over CVS/Subversion: you can have local trees and actually _commit_ to them, including changeset comments and everything else. This is very nice when you are working on multiple bits of a project and are not ready to commit them to the "real" repositories. From aneesh.kumar at hp.com Mon Aug 23 15:55:23 2004 From: aneesh.kumar at hp.com (Aneesh Kumar K.V) Date: Mon, 23 Aug 2004 21:25:23 +0530 Subject: [Linux-cluster] Subversion? In-Reply-To: <412A1206.5040103@backtobasicsmgmt.com> References: <200408231143.11372.phillips@redhat.com> <412A1206.5040103@backtobasicsmgmt.com> Message-ID: <412A136B.4000004@hp.com> Kevin P. Fleming wrote: > > Subversion is an easy install. There is another negative, though: the > current release of Subversion uses Berkeley DB as its storage means, and > we've had problems with it getting randomly locked and causing issues. > We don't know if this is due to running ViewCVS against the repo as > well, or what else it may be. Given the problems we've had, we are > anxiously awaiting the 1.1 release of Subversion that will have a > filesystem-based backend, rather than bdb. > I hit this last weekend. My subversion crashed/locked completely. I am not sure whether it is a lockup or a crash. But there is nothing you can do by going to the repo directory. BTW I didn't attempt to recover it. -aneesh From jeff at intersystems.com Mon Aug 23 16:37:01 2004 From: jeff at intersystems.com (Jeff) Date: Mon, 23 Aug 2004 12:37:01 -0400 Subject: [Linux-cluster] patch to return lkid on new lock requests as part of initial lock request processing Message-ID: <1206740724.20040823123701@intersystems.com> In order to allow a new lock request to be canceled, the lock id of the lock must be returned to the user as part of the initial write() call which queues the new lock request. -------------- next part -------------- A non-text attachment was scrubbed... Name: patch.newlock-lockid Type: application/octet-stream Size: 701 bytes Desc: not available URL: From lhh at redhat.com Mon Aug 23 16:55:21 2004 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 23 Aug 2004 12:55:21 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408231143.11372.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> Message-ID: <1093280121.3467.80.camel@atlantis.boston.redhat.com> On Mon, 2004-08-23 at 11:43 -0400, Daniel Phillips wrote: > Hi everybody, > > I was just taking a look at this article and I thought, maybe this would > be a good time to show some leadership as a project, and take the > Subversion plunge: > > http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html > > Subversion is basically CVS as it should have been. It's mature now. > The number of complaints I have noticed from users out there is roughly > zero. Subversion _versions directories_. Etc. Etc. Disagree. We should use GNU arch. Here's a comparison from someone you know: http://wiki.gnuarch.org/moin.cgi/SubVersionAndCvsComparison http://better-scm.berlios.de/comparison/comparison.html Arch supports repeated merging (incl. renames) and digitally signed changesets (which may or may not be helpful in our case). Mirroring and replication are part of the core architecture. Arch applies versions to directories too ;) > Our project development is not highly parallel at this point, so our > repository serves more as a place for maintainers of the individual > subprojects to post current code. True. For now. Switching again in the future (if needed) will be more painful as we attract more developers. > So there isn't a great need for a > distributed VCS like Bitkeeper or Arch. The more users of arch, the more mature it will become. Someday, perhaps, it will replace BK for some major open source projects near and dear to our hearts. Perhaps this is a pipe dream. ;) For projects that don't need the parallel features of arch, nothing requires that the parallelism be used. -- Lon From lhh at redhat.com Mon Aug 23 16:56:55 2004 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 23 Aug 2004 12:56:55 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <1093280121.3467.80.camel@atlantis.boston.redhat.com> References: <200408231143.11372.phillips@redhat.com> <1093280121.3467.80.camel@atlantis.boston.redhat.com> Message-ID: <1093280215.3467.82.camel@atlantis.boston.redhat.com> On Mon, 2004-08-23 at 12:55 -0400, Lon Hohberger wrote: > Disagree. We should use GNU arch. Here's a comparison from someone you > know: > > http://wiki.gnuarch.org/moin.cgi/SubVersionAndCvsComparison ;) I think, anyway. The lower one is by someone else. > http://better-scm.berlios.de/comparison/comparison.html -- Lon From cherry at osdl.org Mon Aug 23 17:07:22 2004 From: cherry at osdl.org (John Cherry) Date: Mon, 23 Aug 2004 10:07:22 -0700 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408231143.11372.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> Message-ID: <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net> I understand that subversion is quite nice, but kernel developers have adopted bitkeeper (at least Linus and several of his maintainers). While you may not need all the distributed capabilities of bitkeeper now, it is sure nice to have a tool that allows for non-local repositories and change set tracking outside of the main repository (as Kevin so clearly stated). Since mainline kernel acceptance of the core services is one of the objectives here, I would certainly recommend that you consider bitkeeper for source control as well. Regards, John On Mon, 2004-08-23 at 08:43, Daniel Phillips wrote: > Hi everybody, > > I was just taking a look at this article and I thought, maybe this would > be a good time to show some leadership as a project, and take the > Subversion plunge: > > http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html > > Subversion is basically CVS as it should have been. It's mature now. > The number of complaints I have noticed from users out there is roughly > zero. Subversion _versions directories_. Etc. Etc. > > The only negative I can think of is that some folks may not have > Subversion installed. But that is what tarballs are for. > > Our project development is not highly parallel at this point, so our > repository serves more as a place for maintainers of the individual > subprojects to post current code. So there isn't a great need for a > distributed VCS like Bitkeeper or Arch. > > Thoughts? > > Regards, > > Daniel > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From crh at ubiqx.mn.org Mon Aug 23 17:48:37 2004 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Mon, 23 Aug 2004 12:48:37 -0500 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408231143.11372.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> Message-ID: <20040823174837.GC22622@Favog.ubiqx.mn.org> On Mon, Aug 23, 2004 at 11:43:11AM -0400, Daniel Phillips wrote: > Hi everybody, > > I was just taking a look at this article and I thought, maybe this would > be a good time to show some leadership as a project, and take the > Subversion plunge: > > http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html We're using SVN to maintain Samba now. There have been glitches, but most have been fixed. The biggest problems (currently) are with the web front-ends. Chris -)----- -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From phillips at redhat.com Mon Aug 23 18:02:22 2004 From: phillips at redhat.com (Daniel Phillips) Date: Mon, 23 Aug 2004 14:02:22 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net> References: <200408231143.11372.phillips@redhat.com> <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net> Message-ID: <200408231402.22863.phillips@redhat.com> Hi John, On Monday 23 August 2004 13:07, John Cherry wrote: > I understand that subversion is quite nice, but kernel developers > have adopted bitkeeper (at least Linus and several of his > maintainers). While you may not need all the distributed capabilities > of bitkeeper now, it is sure nice to have a tool that allows for > non-local repositories and change set tracking outside of the main > repository (as Kevin so clearly stated). In my humble opinion, Bitkeeper does not have a snowball's chance in hell of getting established on sources.redhat.com. > Since mainline kernel acceptance of the core services is one of the > objectives here, I would certainly recommend that you consider > bitkeeper for source control as well. Just read the license. http://www.taniwha.org/bitkeeper.html "Sometimes it is tempting to sacrifice our rights and freedoms for convinience, but we should not do so... with the increasing popularity of alternative licenses, it is important [to] determine whether they preserve the minimum acceptable amount of freedom and be responsible about choosing software that that meets these minimum criteria and advances our goals as a community" This is 3 years old, however there has been no improvement, quite the contrary. Regards, Daniel From kpfleming at backtobasicsmgmt.com Mon Aug 23 18:14:11 2004 From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming) Date: Mon, 23 Aug 2004 11:14:11 -0700 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408231402.22863.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net> <200408231402.22863.phillips@redhat.com> Message-ID: <412A33F3.6020808@backtobasicsmgmt.com> Daniel Phillips wrote: > Just read the license. > > http://www.taniwha.org/bitkeeper.html Wow, an entire treatise predicated on proving that BitKeeper is not Free Software, when noone from BitMover ever claimed it was. From cherry at osdl.org Mon Aug 23 18:17:18 2004 From: cherry at osdl.org (John Cherry) Date: Mon, 23 Aug 2004 11:17:18 -0700 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408231402.22863.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> <1093280841.12873.43.camel@cherrybomb.pdx.osdl.net> <200408231402.22863.phillips@redhat.com> Message-ID: <1093285037.12874.70.camel@cherrybomb.pdx.osdl.net> On Mon, 2004-08-23 at 11:02, Daniel Phillips wrote: > Hi John, > > On Monday 23 August 2004 13:07, John Cherry wrote: > > I understand that subversion is quite nice, but kernel developers > > have adopted bitkeeper (at least Linus and several of his > > maintainers). While you may not need all the distributed capabilities > > of bitkeeper now, it is sure nice to have a tool that allows for > > non-local repositories and change set tracking outside of the main > > repository (as Kevin so clearly stated). > > In my humble opinion, Bitkeeper does not have a snowball's chance in > hell of getting established on sources.redhat.com. I kinda figured it didn't have a chance at sources.redhat.com. But what about bkbits.net? > > > Since mainline kernel acceptance of the core services is one of the > > objectives here, I would certainly recommend that you consider > > bitkeeper for source control as well. > > Just read the license. > > http://www.taniwha.org/bitkeeper.html > > "Sometimes it is tempting to sacrifice our rights and freedoms for > convinience, but we should not do so... with the increasing popularity > of alternative licenses, it is important [to] determine whether they > preserve the minimum acceptable amount of freedom and be responsible > about choosing software that that meets these minimum criteria and > advances our goals as a community" > > This is 3 years old, however there has been no improvement, quite the > contrary. I understand the concerns about the license. It is one of the strangest licenses I have ever read and it sounds like the licensees are at the mercy of the licenser in many respects (the rights and freedoms arguements). However, bk is being used across the kernel development community and this does not appear to be changing anytime soon. BTW, most developers do just fine with up to date tarballs, so source control is not a huge issue for most of them. John From phillips at redhat.com Mon Aug 23 18:23:41 2004 From: phillips at redhat.com (Daniel Phillips) Date: Mon, 23 Aug 2004 14:23:41 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <412A33F3.6020808@backtobasicsmgmt.com> References: <200408231143.11372.phillips@redhat.com> <200408231402.22863.phillips@redhat.com> <412A33F3.6020808@backtobasicsmgmt.com> Message-ID: <200408231423.41542.phillips@redhat.com> On Monday 23 August 2004 14:14, Kevin P. Fleming wrote: > Daniel Phillips wrote: > > Just read the license. > > > > http://www.taniwha.org/bitkeeper.html > > Wow, an entire treatise predicated on proving that BitKeeper is not > Free Software, when noone from BitMover ever claimed it was. Sources.redhat.com not only consists entirely of free software, but shows leadership to the free software community. We[1] are interested in advancing not only our own projects, but other open source projects such as Subversion and Arch. [1] Presumptively speaking for what I presume is the majority. Regards, Daniel From kpfleming at backtobasicsmgmt.com Mon Aug 23 18:36:41 2004 From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming) Date: Mon, 23 Aug 2004 11:36:41 -0700 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408231423.41542.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> <200408231402.22863.phillips@redhat.com> <412A33F3.6020808@backtobasicsmgmt.com> <200408231423.41542.phillips@redhat.com> Message-ID: <412A3939.9050301@backtobasicsmgmt.com> Daniel Phillips wrote: > Sources.redhat.com not only consists entirely of free software, but > shows leadership to the free software community. We[1] are interested > in advancing not only our own projects, but other open source projects > such as Subversion and Arch. > > [1] Presumptively speaking for what I presume is the majority. I wholeheartedly agree with these statements, and if using Free Software projects to advance your own is the right decision then I fully support it. I just don't like to see decisions made using inaccurate, politicized arguments. In this case, you are far better off (IMO) to say "We won't use BitKeeper because it is not open source", rather than to rely on arguments about its licensing model. It's likely that even if the binary-only free use license for BitKeeper came with _no_ restrictions whatsoever, it still would not be your choice for an SCM, because it is not open source. From ananth at osc.edu Mon Aug 23 18:47:12 2004 From: ananth at osc.edu (Ananth Devulapalli) Date: Mon, 23 Aug 2004 14:47:12 -0400 (EDT) Subject: [Linux-cluster] Problem compiling dlm module. Message-ID: Hello: I am following instructions for compilation of dlm module described at http://opendlm.sourceforge.net/doc.php and i am through with installation of libnet and heartbeat modules. but opendlm breaks. my m/c is currently running 2.6.8-1.521smp on a dual xeon opendlm was configured using --with-heartbeat_include=/usr/include/heartbeat I am pasting output of make at the end of this mail. All the errors seem to be in files included in cccp_deliver.c. It appears to me like I am missing some files and have configured the tree incorrectly. For e.g. I dont have include/linux/modversions.h. Another error is MOD_INC_USE_COUNT is defined in /usr/include/linux/module.h but not in /lib/modules/2.6.8-1.521smp/build. configure is assigning linux_src variable to /lib/modules/2.6.8-1.521smp/build instead of /usr/include/linux/ hence its not able to locate that symbol. Is it a known problem? any pointers will be of great help. thanks, -Ananth if gcc -DHAVE_CONFIG_H -I. -I. -I../../.. -D__KERNEL__ -DMODULE -DMODVERSIONS -I/lib/modules/2.6.8-1.521smp/build/include -I../../../src/include -include /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h -I../../../src/api -DOLD_MARSHAL -I/lib/modules/2.6.8-1.521smp/build/include -I/usr/include/glib-1.2 -I/usr/lib/glib/include -pipe -O2 -Wall -g -O2 -MT cccp_deliver.o -MD -MP -MF ".deps/cccp_deliver.Tpo" -c -o cccp_deliver.o cccp_deliver.c; \ then mv -f ".deps/cccp_deliver.Tpo" ".deps/cccp_deliver.Po"; else rm -f ".deps/cccp_deliver.Tpo"; exit 1; fi :167113902:62704: /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h: No such file or directory In file included from /lib/modules/2.6.8-1.521smp/build/include/asm/processor.h:18, from /lib/modules/2.6.8-1.521smp/build/include/asm/thread_info.h:16, from /lib/modules/2.6.8-1.521smp/build/include/linux/thread_info.h:21, from /lib/modules/2.6.8-1.521smp/build/include/linux/spinlock.h:12, from /lib/modules/2.6.8-1.521smp/build/include/linux/capability.h:45, from /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:7, from /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, from cccp_deliver.c:47: /lib/modules/2.6.8-1.521smp/build/include/asm/system.h: In function `__set_64bit_var': /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning: dereferencing type-punned pointer will break strict-aliasing rules /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning: dereferencing type-punned pointer will break strict-aliasing rules In file included from /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18, from /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, from /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, from /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, from cccp_deliver.c:47: /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:6:25: mach_mpspec.h: No such file or directory In file included from /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18, from /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, from /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, from /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, from cccp_deliver.c:47: /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h: At top level: /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: `MAX_MP_BUSSES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:9: error: `MAX_MP_BUSSES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:10: error: `MAX_MP_BUSSES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: `MAX_MP_BUSSES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error: `MAX_MP_BUSSES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error: conflicting types for `mp_bus_id_to_type' /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: previous declaration of `mp_bus_id_to_type' /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: `MAX_IRQ_SOURCES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error: `MAX_MP_BUSSES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error: conflicting types for `mp_bus_id_to_pci_bus' /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: previous declaration of `mp_bus_id_to_pci_bus' In file included from /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:20, from /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, from /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, from /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, from cccp_deliver.c:47: /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error: `MAX_IRQ_SOURCES' undeclared here (not in a function) /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error: conflicting types for `mp_irqs' /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: previous declaration of `mp_irqs' In file included from /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, from /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, from /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, from cccp_deliver.c:47: /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:71:26: mach_apicdef.h: No such file or directory In file included from /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, from /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, from /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, from cccp_deliver.c:47: /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h: In function `hard_smp_processor_id': /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:75: warning: implicit declaration of function `GET_APIC_ID' In file included from cccp_private.h:57, from cccp_deliver.c:73: ../../../src/include/dlm_kernel.h:227:2: warning: #warning Untested signal handlers for Linux-2.6!! cccp_deliver.c: In function `cccp_msg_delivery_loop': cccp_deliver.c:246: error: `MOD_INC_USE_COUNT' undeclared (first use in this function) cccp_deliver.c:246: error: (Each undeclared identifier is reported only once cccp_deliver.c:246: error: for each function it appears in.) cccp_deliver.c:294: error: `MOD_DEC_USE_COUNT' undeclared (first use in this function) make: *** [cccp_deliver.o] Error 1 From erik at debian.franken.de Mon Aug 23 18:40:33 2004 From: erik at debian.franken.de (Erik Tews) Date: Mon, 23 Aug 2004 20:40:33 +0200 Subject: [Linux-cluster] Subversion? In-Reply-To: <20040823174837.GC22622@Favog.ubiqx.mn.org> References: <200408231143.11372.phillips@redhat.com> <20040823174837.GC22622@Favog.ubiqx.mn.org> Message-ID: <1093286433.11335.0.camel@localhost.localdomain> Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 19:48: > We're using SVN to maintain Samba now. There have been glitches, but most > have been fixed. The biggest problems (currently) are with the web > front-ends. I use viewcvs here, it works fine and seems to have all features I need. From crh at ubiqx.mn.org Mon Aug 23 19:06:56 2004 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Mon, 23 Aug 2004 14:06:56 -0500 Subject: [Linux-cluster] Subversion? In-Reply-To: <1093286433.11335.0.camel@localhost.localdomain> References: <200408231143.11372.phillips@redhat.com> <20040823174837.GC22622@Favog.ubiqx.mn.org> <1093286433.11335.0.camel@localhost.localdomain> Message-ID: <20040823190656.GG22622@Favog.ubiqx.mn.org> On Mon, Aug 23, 2004 at 08:40:33PM +0200, Erik Tews wrote: > Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 19:48: > > We're using SVN to maintain Samba now. There have been glitches, but most > > have been fixed. The biggest problems (currently) are with the web > > front-ends. > > I use viewcvs here, it works fine and seems to have all features I need. It's my understanding that Samba source web access will be moving (has already been moved) to viewcvs. I've passed along the caution, given earlier, about database locking. There's some talk of sharing the single database amonst several mirrors using Samba and CIFS-VFS. :) We'll see what flys. Chris -)----- -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From erik at debian.franken.de Mon Aug 23 19:12:06 2004 From: erik at debian.franken.de (Erik Tews) Date: Mon, 23 Aug 2004 21:12:06 +0200 Subject: [Linux-cluster] Subversion? In-Reply-To: <20040823190656.GG22622@Favog.ubiqx.mn.org> References: <200408231143.11372.phillips@redhat.com> <20040823174837.GC22622@Favog.ubiqx.mn.org> <1093286433.11335.0.camel@localhost.localdomain> <20040823190656.GG22622@Favog.ubiqx.mn.org> Message-ID: <1093288326.11335.3.camel@localhost.localdomain> Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 21:06: > On Mon, Aug 23, 2004 at 08:40:33PM +0200, Erik Tews wrote: > > Am Mo, den 23.08.2004 schrieb Christopher R. Hertel um 19:48: > > > We're using SVN to maintain Samba now. There have been glitches, but most > > > have been fixed. The biggest problems (currently) are with the web > > > front-ends. > > > > I use viewcvs here, it works fine and seems to have all features I need. > > It's my understanding that Samba source web access will be moving (has > already been moved) to viewcvs. I've passed along the caution, given > earlier, about database locking. There's some talk of sharing the single > database amonst several mirrors using Samba and CIFS-VFS. :) Well, there is a package called libsvn-mirror-perl which should make it possible to mirror a subversion server on the subversion protocol level, so there should be no problem with any kind of locking. But your approach could work too. From arekm at pld-linux.org Mon Aug 23 20:27:26 2004 From: arekm at pld-linux.org (Arkadiusz Miskiewicz) Date: Mon, 23 Aug 2004 22:27:26 +0200 Subject: [Linux-cluster] Subversion? In-Reply-To: <412A1206.5040103@backtobasicsmgmt.com> References: <200408231143.11372.phillips@redhat.com> <412A1206.5040103@backtobasicsmgmt.com> Message-ID: <200408232227.26770.arekm@pld-linux.org> On Monday 23 of August 2004 17:49, Kevin P. Fleming wrote: > I also like BK quite a bit, and it has one major advantage over > CVS/Subversion: you can have local trees and actually _commit_ to them, > including changeset comments and everything else. This is very nice when > you are working on multiple bits of a project and are not ready to > commit them to the "real" repositories. Try http://svk.elixus.org/. It uses subversion lower layers and it's able to merge from/to normal subversion repository. BK main problem is licence. For example I'm not allowed to use it since I've sent few small patches to subversion people :/ -- Arkadiusz Mi?kiewicz CS at FoE, Wroclaw University of Technology arekm.pld-linux.org, 1024/3DB19BBD, JID: arekm.jabber.org, PLD/Linux From erik at debian.franken.de Mon Aug 23 21:18:05 2004 From: erik at debian.franken.de (Erik Tews) Date: Mon, 23 Aug 2004 23:18:05 +0200 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408232227.26770.arekm@pld-linux.org> References: <200408231143.11372.phillips@redhat.com> <412A1206.5040103@backtobasicsmgmt.com> <200408232227.26770.arekm@pld-linux.org> Message-ID: <1093295884.3373.3.camel@localhost.localdomain> Am Mo, den 23.08.2004 schrieb Arkadiusz Miskiewicz um 22:27: > On Monday 23 of August 2004 17:49, Kevin P. Fleming wrote: > > > I also like BK quite a bit, and it has one major advantage over > > CVS/Subversion: you can have local trees and actually _commit_ to them, > > including changeset comments and everything else. This is very nice when > > you are working on multiple bits of a project and are not ready to > > commit them to the "real" repositories. > Try http://svk.elixus.org/. It uses subversion lower layers and it's able to > merge from/to normal subversion repository. Is this for one time merging, like converting a repository once from cvs to svn, or can this be done on every commit, could I setup a local svk server, and merge all my changes to a upstream svn server, which has no special modifications? From notiggy at gmail.com Mon Aug 23 21:25:53 2004 From: notiggy at gmail.com (Brian Jackson) Date: Mon, 23 Aug 2004 16:25:53 -0500 Subject: [Linux-cluster] Problem compiling dlm module. In-Reply-To: References: Message-ID: opendlm and the dlm that is used with gfs/linux-cluster are 2 different things. You shouldn't use the docs from one to build the other. The usage.txt file linked off of http://sources.redhat.com/cluster will get you up and running with the proper dlm. --Brian Jackson On Mon, 23 Aug 2004 14:47:12 -0400 (EDT), Ananth Devulapalli wrote: > Hello: > > I am following instructions for compilation of dlm module > described at http://opendlm.sourceforge.net/doc.php and i am through with > installation of libnet and heartbeat modules. but opendlm breaks. > > my m/c is currently running 2.6.8-1.521smp on a dual xeon > > opendlm was configured using > --with-heartbeat_include=/usr/include/heartbeat > > I am pasting output of make at the end of this mail. All the > errors seem to be in files included in cccp_deliver.c. It appears to me > like I am missing some files and have configured the tree incorrectly. For > e.g. I dont have include/linux/modversions.h. Another error is > MOD_INC_USE_COUNT is defined in /usr/include/linux/module.h but not in > /lib/modules/2.6.8-1.521smp/build. configure is assigning linux_src > variable to /lib/modules/2.6.8-1.521smp/build instead of > /usr/include/linux/ hence its not able to locate that symbol. Is it a > known problem? any pointers will be of great help. > > thanks, > -Ananth > > > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../.. -D__KERNEL__ -DMODULE > -DMODVERSIONS -I/lib/modules/2.6.8-1.521smp/build/include > -I../../../src/include -include > /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h > -I../../../src/api -DOLD_MARSHAL > -I/lib/modules/2.6.8-1.521smp/build/include -I/usr/include/glib-1.2 > -I/usr/lib/glib/include -pipe -O2 -Wall -g -O2 -MT cccp_deliver.o -MD -MP > -MF ".deps/cccp_deliver.Tpo" -c -o cccp_deliver.o cccp_deliver.c; \ > then mv -f ".deps/cccp_deliver.Tpo" ".deps/cccp_deliver.Po"; else rm -f > ".deps/cccp_deliver.Tpo"; exit 1; fi > :167113902:62704: > /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h: No such > file or directory > In file included from > /lib/modules/2.6.8-1.521smp/build/include/asm/processor.h:18, > from > /lib/modules/2.6.8-1.521smp/build/include/asm/thread_info.h:16, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/thread_info.h:21, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/spinlock.h:12, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/capability.h:45, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:7, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > from cccp_deliver.c:47: > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h: In function > `__set_64bit_var': > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning: > dereferencing type-punned pointer will break strict-aliasing rules > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning: > dereferencing type-punned pointer will break strict-aliasing rules > In file included from > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > from cccp_deliver.c:47: > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:6:25: > mach_mpspec.h: No such file or directory > In file included from > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > from cccp_deliver.c:47: > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h: At top level: > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: > `MAX_MP_BUSSES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:9: error: > `MAX_MP_BUSSES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:10: error: > `MAX_MP_BUSSES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: > `MAX_MP_BUSSES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error: > `MAX_MP_BUSSES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error: > conflicting types for `mp_bus_id_to_type' > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: previous > declaration of `mp_bus_id_to_type' > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: > `MAX_IRQ_SOURCES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error: > `MAX_MP_BUSSES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error: > conflicting types for `mp_bus_id_to_pci_bus' > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: previous > declaration of `mp_bus_id_to_pci_bus' > In file included from > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:20, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > from cccp_deliver.c:47: > /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error: > `MAX_IRQ_SOURCES' undeclared here (not in a function) > /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error: > conflicting types for `mp_irqs' > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: previous > declaration of `mp_irqs' > In file included from > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > from cccp_deliver.c:47: > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:71:26: mach_apicdef.h: > No such file or directory > In file included from > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > from > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > from cccp_deliver.c:47: > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h: In function > `hard_smp_processor_id': > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:75: warning: implicit > declaration of function `GET_APIC_ID' > In file included from cccp_private.h:57, > from cccp_deliver.c:73: > .../../../src/include/dlm_kernel.h:227:2: warning: #warning Untested signal > handlers for Linux-2.6!! > cccp_deliver.c: In function `cccp_msg_delivery_loop': > cccp_deliver.c:246: error: `MOD_INC_USE_COUNT' undeclared (first use in > this function) > cccp_deliver.c:246: error: (Each undeclared identifier is reported only > once > cccp_deliver.c:246: error: for each function it appears in.) > cccp_deliver.c:294: error: `MOD_DEC_USE_COUNT' undeclared (first use in > this function) > make: *** [cccp_deliver.o] Error 1 > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From ananth at osc.edu Mon Aug 23 21:44:41 2004 From: ananth at osc.edu (Ananth Devulapalli) Date: Mon, 23 Aug 2004 17:44:41 -0400 (EDT) Subject: [Linux-cluster] Problem compiling dlm module. In-Reply-To: References: Message-ID: It was my bad. Thanks for pointing my mistake. I got thought both were same since opendlm links to redhat cluster's page. regards, -Ananth On Mon, 23 Aug 2004, Brian Jackson wrote: > opendlm and the dlm that is used with gfs/linux-cluster are 2 > different things. You shouldn't use the docs from one to build the > other. The usage.txt file linked off of > http://sources.redhat.com/cluster will get you up and running with the > proper dlm. > > --Brian Jackson > > On Mon, 23 Aug 2004 14:47:12 -0400 (EDT), Ananth Devulapalli > wrote: > > Hello: > > > > I am following instructions for compilation of dlm module > > described at http://opendlm.sourceforge.net/doc.php and i am through with > > installation of libnet and heartbeat modules. but opendlm breaks. > > > > my m/c is currently running 2.6.8-1.521smp on a dual xeon > > > > opendlm was configured using > > --with-heartbeat_include=/usr/include/heartbeat > > > > I am pasting output of make at the end of this mail. All the > > errors seem to be in files included in cccp_deliver.c. It appears to me > > like I am missing some files and have configured the tree incorrectly. For > > e.g. I dont have include/linux/modversions.h. Another error is > > MOD_INC_USE_COUNT is defined in /usr/include/linux/module.h but not in > > /lib/modules/2.6.8-1.521smp/build. configure is assigning linux_src > > variable to /lib/modules/2.6.8-1.521smp/build instead of > > /usr/include/linux/ hence its not able to locate that symbol. Is it a > > known problem? any pointers will be of great help. > > > > thanks, > > -Ananth > > > > > > > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../.. -D__KERNEL__ -DMODULE > > -DMODVERSIONS -I/lib/modules/2.6.8-1.521smp/build/include > > -I../../../src/include -include > > /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h > > -I../../../src/api -DOLD_MARSHAL > > -I/lib/modules/2.6.8-1.521smp/build/include -I/usr/include/glib-1.2 > > -I/usr/lib/glib/include -pipe -O2 -Wall -g -O2 -MT cccp_deliver.o -MD -MP > > -MF ".deps/cccp_deliver.Tpo" -c -o cccp_deliver.o cccp_deliver.c; \ > > then mv -f ".deps/cccp_deliver.Tpo" ".deps/cccp_deliver.Po"; else rm -f > > ".deps/cccp_deliver.Tpo"; exit 1; fi > > :167113902:62704: > > /lib/modules/2.6.8-1.521smp/build/include/linux/modversions.h: No such > > file or directory > > In file included from > > /lib/modules/2.6.8-1.521smp/build/include/asm/processor.h:18, > > from > > /lib/modules/2.6.8-1.521smp/build/include/asm/thread_info.h:16, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/thread_info.h:21, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/spinlock.h:12, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/capability.h:45, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:7, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > > from cccp_deliver.c:47: > > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h: In function > > `__set_64bit_var': > > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning: > > dereferencing type-punned pointer will break strict-aliasing rules > > /lib/modules/2.6.8-1.521smp/build/include/asm/system.h:193: warning: > > dereferencing type-punned pointer will break strict-aliasing rules > > In file included from > > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > > from cccp_deliver.c:47: > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:6:25: > > mach_mpspec.h: No such file or directory > > In file included from > > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:18, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > > from cccp_deliver.c:47: > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h: At top level: > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: > > `MAX_MP_BUSSES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:9: error: > > `MAX_MP_BUSSES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:10: error: > > `MAX_MP_BUSSES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: > > `MAX_MP_BUSSES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error: > > `MAX_MP_BUSSES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:20: error: > > conflicting types for `mp_bus_id_to_type' > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:8: error: previous > > declaration of `mp_bus_id_to_type' > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: > > `MAX_IRQ_SOURCES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error: > > `MAX_MP_BUSSES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:24: error: > > conflicting types for `mp_bus_id_to_pci_bus' > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:12: error: previous > > declaration of `mp_bus_id_to_pci_bus' > > In file included from > > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:20, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > > from cccp_deliver.c:47: > > /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error: > > `MAX_IRQ_SOURCES' undeclared here (not in a function) > > /lib/modules/2.6.8-1.521smp/build/include/asm/io_apic.h:160: error: > > conflicting types for `mp_irqs' > > /lib/modules/2.6.8-1.521smp/build/include/asm/mpspec.h:22: error: previous > > declaration of `mp_irqs' > > In file included from > > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > > from cccp_deliver.c:47: > > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:71:26: mach_apicdef.h: > > No such file or directory > > In file included from > > /lib/modules/2.6.8-1.521smp/build/include/linux/smp.h:17, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/sched.h:23, > > from > > /lib/modules/2.6.8-1.521smp/build/include/linux/module.h:10, > > from cccp_deliver.c:47: > > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h: In function > > `hard_smp_processor_id': > > /lib/modules/2.6.8-1.521smp/build/include/asm/smp.h:75: warning: implicit > > declaration of function `GET_APIC_ID' > > In file included from cccp_private.h:57, > > from cccp_deliver.c:73: > > .../../../src/include/dlm_kernel.h:227:2: warning: #warning Untested signal > > handlers for Linux-2.6!! > > cccp_deliver.c: In function `cccp_msg_delivery_loop': > > cccp_deliver.c:246: error: `MOD_INC_USE_COUNT' undeclared (first use in > > this function) > > cccp_deliver.c:246: error: (Each undeclared identifier is reported only > > once > > cccp_deliver.c:246: error: for each function it appears in.) > > cccp_deliver.c:294: error: `MOD_DEC_USE_COUNT' undeclared (first use in > > this function) > > make: *** [cccp_deliver.o] Error 1 > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From phillips at redhat.com Mon Aug 23 22:30:26 2004 From: phillips at redhat.com (Daniel Phillips) Date: Mon, 23 Aug 2004 18:30:26 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <1093285037.12874.70.camel@cherrybomb.pdx.osdl.net> References: <200408231143.11372.phillips@redhat.com> <200408231402.22863.phillips@redhat.com> <1093285037.12874.70.camel@cherrybomb.pdx.osdl.net> Message-ID: <200408231830.26171.phillips@redhat.com> On Monday 23 August 2004 14:17, John Cherry wrote: > On Mon, 2004-08-23 at 11:02, Daniel Phillips wrote: > > In my humble opinion, Bitkeeper does not have a snowball's chance > > in hell of getting established on sources.redhat.com. > > I kinda figured it didn't have a chance at sources.redhat.com. But > what about bkbits.net? Why don't you take a poll? ;) > However, bk is being used across the kernel development community and > this does not appear to be changing anytime soon. Regardless of its effect on Linus's scalability, the kernel development community is deeply fractured over Bitkeeper. Please don't be fooled by the apparent low profile of this subject on lkml. We do not need a self-inflicted wound like that in the cluster community. > BTW, most developers do just fine with up to date tarballs, so source > control is not a huge issue for most of them. Yes, I personally prefer tarballs when I'm checking out a project for the first time. However we still need a repository somewhere. Regards, Daniel From arekm at pld-linux.org Mon Aug 23 23:24:46 2004 From: arekm at pld-linux.org (Arkadiusz Miskiewicz) Date: Tue, 24 Aug 2004 01:24:46 +0200 Subject: [Linux-cluster] Subversion? In-Reply-To: <1093295884.3373.3.camel@localhost.localdomain> References: <200408231143.11372.phillips@redhat.com> <200408232227.26770.arekm@pld-linux.org> <1093295884.3373.3.camel@localhost.localdomain> Message-ID: <200408240124.46669.arekm@pld-linux.org> On Monday 23 of August 2004 23:18, Erik Tews wrote: > > Try http://svk.elixus.org/. It uses subversion lower layers and it's able > > to merge from/to normal subversion repository. > > Is this for one time merging, like converting a repository once from cvs > to svn, or can this be done on every commit, could I setup a local svk > server, and merge all my changes to a upstream svn server, which has no > special modifications? You can merge all your local changes upstream then fetch new changes from subversion repo and so on. It's for making soft of decentralized subversion. http://svk.elixus.org/index.cgi?SVKTutorial -- Arkadiusz Mi?kiewicz CS at FoE, Wroclaw University of Technology arekm.pld-linux.org, 1024/3DB19BBD, JID: arekm.jabber.org, PLD/Linux From yjcho at cs.hongik.ac.kr Tue Aug 24 06:29:22 2004 From: yjcho at cs.hongik.ac.kr (Cho Yool Je) Date: Tue, 24 Aug 2004 15:29:22 +0900 Subject: [Linux-cluster] i wanna use gfs with firewire.... Message-ID: <412AE042.9060203@cs.hongik.ac.kr> i wanna use gfs with firewire.... but i can't search for document about it... anybody has docs? From walters at redhat.com Mon Aug 23 17:23:06 2004 From: walters at redhat.com (Colin Walters) Date: Mon, 23 Aug 2004 13:23:06 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <1093280121.3467.80.camel@atlantis.boston.redhat.com> References: <200408231143.11372.phillips@redhat.com> <1093280121.3467.80.camel@atlantis.boston.redhat.com> Message-ID: <1093281786.20301.34.camel@nexus.verbum.private> On Mon, 2004-08-23 at 12:55 -0400, Lon Hohberger wrote: > On Mon, 2004-08-23 at 11:43 -0400, Daniel Phillips wrote: > > Hi everybody, > > > > I was just taking a look at this article and I thought, maybe this would > > be a good time to show some leadership as a project, and take the > > Subversion plunge: > > > > http://www.onlamp.com/pub/a/onlamp/2004/08/19/subversiontips.html > > > > Subversion is basically CVS as it should have been. It's mature now. > > The number of complaints I have noticed from users out there is roughly > > zero. Subversion _versions directories_. Etc. Etc. > > Disagree. We should use GNU arch. Here's a comparison from someone you > know: > > http://wiki.gnuarch.org/moin.cgi/SubVersionAndCvsComparison > http://better-scm.berlios.de/comparison/comparison.html Here also is a presentation giving an introduction to Arch from the "bottom up", which gives you a much better idea I think of why it is the best architecture, rather than just comparing checkboxes on some list. http://web.verbum.org/tla/grokking-arch/img0.html > True. For now. Switching again in the future (if needed) will be more > painful as we attract more developers. Right - switching revision control systems is always painful. You want to make the choice once. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From hanafim at asc.hpc.mil Tue Aug 24 14:01:12 2004 From: hanafim at asc.hpc.mil (MAHMOUD HANAFI) Date: Tue, 24 Aug 2004 10:01:12 -0400 Subject: [Linux-cluster] Re: update 1-11456576 Re: LSF job pends... In-Reply-To: <9CD190F4AD92EC499A9397F49C8B0E4E69B752@catoexm04.noam.corp.platform.com> References: <9CD190F4AD92EC499A9397F49C8B0E4E69B752@catoexm04.noam.corp.platform.com> Message-ID: <412B4A28.7060308@asc.hpc.mil> You may close this issue. This was only applied to jobs that had been submitted before installing the job weight plugin. New jobs do not have the same pending condition. thanks, -Mahmoud Mohammad Asim Khan wrote: > Hi Mahmoud, > Can you please send me the output of the following commands: > bhosts > bhpart -r > lshosts > > Your lsb.hosts file, lsf.cluster and lsf.shared file. > > > Regards > ______________________________________________________________________ > TECHNICAL SUPPORT > Mohammad Asim Khan FTP : ftp.platform.com > Technical Support Engineer > Platform Computing Corporation WWW : www.platform.com > 3760, 14th Avenue Support : support at platform.com > Markham Ontario L3R 3T7 Canada License : license at platform.com > Phone : (905) 948-4325 Inquiries : info at platform.com > Fax : (905) 948-9975 Sales : sales at platform.com > E-mail : mkhan at platform.com Phone : 1-905-948-4297 > > Note : Please cc all emails to support at platform.com > _____________________________________________________________________ > Platform. Accelerating Intelligence > "Unleash the Power" of LSF by attending a Platform LSF Administration Training Class. > distributed and Grid Computing > > To receive periodic Patch Update information, critical bug notification and general support Notification from platform support email supportnotice-request at platform.com with the subject line containing the word "subscribe". > > To receive security related issue notification from Platform support email > securenotice-request at platform.com with the subject line containing the word "subscribe". > > From lhh at redhat.com Tue Aug 24 14:48:04 2004 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 24 Aug 2004 10:48:04 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408240124.46669.arekm@pld-linux.org> References: <200408231143.11372.phillips@redhat.com> <200408232227.26770.arekm@pld-linux.org> <1093295884.3373.3.camel@localhost.localdomain> <200408240124.46669.arekm@pld-linux.org> Message-ID: <1093358884.3467.130.camel@atlantis.boston.redhat.com> On Tue, 2004-08-24 at 01:24 +0200, Arkadiusz Miskiewicz wrote: > On Monday 23 of August 2004 23:18, Erik Tews wrote: > > > > Try http://svk.elixus.org/. It uses subversion lower layers and it's able > > > to merge from/to normal subversion repository. > > > > Is this for one time merging, like converting a repository once from cvs > > to svn, or can this be done on every commit, could I setup a local svk > > server, and merge all my changes to a upstream svn server, which has no > > special modifications? > You can merge all your local changes upstream then fetch new changes from > subversion repo and so on. It's for making soft of decentralized subversion. > > http://svk.elixus.org/index.cgi?SVKTutorial > http://www.gnuarch.org It was _designed_ to handle distributed repositories (like BK). -- Lon From tomc at teamics.com Tue Aug 24 14:51:58 2004 From: tomc at teamics.com (tomc at teamics.com) Date: Tue, 24 Aug 2004 09:51:58 -0500 Subject: [Linux-cluster] unusual GFS problem Message-ID: Looking for some direction on this, please. What is this message telling me? This node was the master in a three node setup: Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold the lock Shr, and someone else is queued before me. Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold the lock Shr, and someone else is queued before me. This repeated for about 3 hours, then one of the other nodes had a GFS panic and had to be rebooted. Any suggestions would be appreciated. tc From notiggy at gmail.com Tue Aug 24 15:18:10 2004 From: notiggy at gmail.com (Brian Jackson) Date: Tue, 24 Aug 2004 10:18:10 -0500 Subject: [Linux-cluster] i wanna use gfs with firewire.... In-Reply-To: <412AE042.9060203@cs.hongik.ac.kr> References: <412AE042.9060203@cs.hongik.ac.kr> Message-ID: There's a doc on the opengfs site about it, you can read the firewire specific parts out of it, then use the linux-cluster docs to do the filesystem setup. --Brian Jackson On Tue, 24 Aug 2004 15:29:22 +0900, Cho Yool Je wrote: > i wanna use gfs with firewire.... > but i can't search for document about it... > anybody has docs? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From amanthei at redhat.com Tue Aug 24 15:36:54 2004 From: amanthei at redhat.com (Adam Manthei) Date: Tue, 24 Aug 2004 10:36:54 -0500 Subject: [Linux-cluster] unusual GFS problem In-Reply-To: References: Message-ID: <20040824153654.GC27527@redhat.com> On Tue, Aug 24, 2004 at 09:51:58AM -0500, tomc at teamics.com wrote: > Looking for some direction on this, please. What is this message telling > me? This node was the master in a three node setup: The message is telling you that you turned on the "Locking" gulm verbosity flag :) The short answer is that it's just letting you know you have lock contention. These messages are rather common and can be ignored (especially if your applications are modifying common files or directories from more than one node). > > > Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold > the lock Shr, and someone else is > queued before me. > Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold > the lock Shr, and someone else is > queued before me. > > This repeated for about 3 hours, then one of the other nodes had a GFS > panic and had to be rebooted. Any suggestions would be appreciated. Without the panic message and relevant syslog messages, we can't really help you. -- Adam Manthei From phillips at redhat.com Tue Aug 24 16:12:53 2004 From: phillips at redhat.com (Daniel Phillips) Date: Tue, 24 Aug 2004 12:12:53 -0400 Subject: [Linux-cluster] Subversion? In-Reply-To: <1093358884.3467.130.camel@atlantis.boston.redhat.com> References: <200408231143.11372.phillips@redhat.com> <200408240124.46669.arekm@pld-linux.org> <1093358884.3467.130.camel@atlantis.boston.redhat.com> Message-ID: <200408241212.53357.phillips@redhat.com> Hi Lon, On Tuesday 24 August 2004 10:48, Lon Hohberger wrote: > It was _designed_ to handle distributed repositories (like BK). Well, what wind is blowing, seems to be blowing in the direction of Arch. I'd be equally happy with either, and in any case, much happier than with CVS. Does anybody else have a strong opinion? Regards, Daniel From amanthei at redhat.com Tue Aug 24 16:15:45 2004 From: amanthei at redhat.com (Adam Manthei) Date: Tue, 24 Aug 2004 11:15:45 -0500 Subject: [Linux-cluster] SNMP modules? In-Reply-To: <1090861715.13809.3.camel@laza.eunet.yu> References: <1090861715.13809.3.camel@laza.eunet.yu> Message-ID: <20040824161545.GA31079@redhat.com> On Mon, Jul 26, 2004 at 07:08:35PM +0200, Lazar Obradovic wrote: > Hello all, > > I'd like to develop my own fencing agents (for IBM BladeCenter and > QLogic SANBox2 switches), but they will require SNMP bindings. > > Is that ok with general development philosophy, since I'd like to > contribude them? net-snmp-5.x.x-based API? I've added these to the repository. I also made the following adjustment: o removed the deprecated "fm" and "name" stdin parameters that were residue of the old GFS-5.1.x fencing system o changed a couple errors to warnings. If a powered off the blade, the fencing agent would detect that it was powered off and fail. If it knows that the blade is off, it should still succeed. Patch is attached (but already checked into CVS) -- Adam Manthei -------------- next part -------------- diff -urNp ibmblade/fence_ibmblade.pl ibmblade.mantis/fence_ibmblade.pl --- ibmblade/fence_ibmblade.pl 2004-08-24 11:09:55.680240183 -0500 +++ ibmblade.mantis/fence_ibmblade.pl 2004-08-24 10:58:57.558573326 -0500 @@ -112,13 +112,6 @@ sub get_options_stdin # DO NOTHING -- this field is used by fenced elsif ($name eq "agent" ) { } - # FIXME -- depricated. use "port" instead. - elsif ($name eq "fm" ) - { - (my $dummy,$opt_n) = split /\s+/,$val; - print STDERR "Depricated \"fm\" entry detected. refer to man page.\n"; - } - elsif ($name eq "ipaddr" ) { $opt_a = $val; @@ -127,8 +120,6 @@ sub get_options_stdin { $opt_c = $val; } - # FIXME -- depreicated residue of old fencing system - elsif ($name eq "name" ) { } elsif ($name eq "option" ) { @@ -204,15 +195,15 @@ if (defined ($opt_t)) { if ($opt_o =~ /^(reboot|off)$/i) { if ($result->{$oid} == "0") { - printf ("$FENCE_RELEASE_NAME ERROR: Port %d on %s already down.\n", $opt_n, $opt_a); + printf ("$FENCE_RELEASE_NAME WARNING: Port %d on %s already down.\n", $opt_n, $opt_a); $snmpsess->close; - exit 1; + exit 0; }; } else { if ($result->{$oid} == "1") { - printf ("$FENCE_RELEASE_NAME ERROR: Port %d on %s already up.\n", $opt_n, $opt_a); + printf ("$FENCE_RELEASE_NAME WARNING: Port %d on %s already up.\n", $opt_n, $opt_a); $snmpsess->close; - exit 1; + exit 0; }; }; From tomc at teamics.com Tue Aug 24 16:22:31 2004 From: tomc at teamics.com (tomc at teamics.com) Date: Tue, 24 Aug 2004 11:22:31 -0500 Subject: [Linux-cluster] unusual GFS problem Message-ID: I think the problem is actually a FC-SAN problem, but I just wanted to follow up on this particular message. GFS panics when Linux loses the SCSI device. The SCSI device disapepars because of a SAN communications failure. I don't think it is a GFS problem. Thanks for the info on that message. tc Adam Manthei To: Discussion of clustering software components including GFS Sent by: linux-cluster-bounces cc: (bcc: Tom Currie/teamics) @redhat.com Subject: Re: [Linux-cluster] unusual GFS problem 08/24/04 10:36 AM Please respond to Discussion of clustering software components including GFS On Tue, Aug 24, 2004 at 09:51:58AM -0500, tomc at teamics.com wrote: > Looking for some direction on this, please. What is this message telling > me? This node was the master in a three node setup: The message is telling you that you turned on the "Locking" gulm verbosity flag :) The short answer is that it's just letting you know you have lock contention. These messages are rather common and can be ignored (especially if your applications are modifying common files or directories from more than one node). > > > Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold > the lock Shr, and someone else is > queued before me. > Aug 24 03:11:21 lvs2 lock_gulmd_LT000[1332]: Asking for exl, where I hold > the lock Shr, and someone else is > queued before me. > > This repeated for about 3 hours, then one of the other nodes had a GFS > panic and had to be rebooted. Any suggestions would be appreciated. Without the panic message and relevant syslog messages, we can't really help you. -- Adam Manthei -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Tue Aug 24 17:11:04 2004 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 24 Aug 2004 13:11:04 -0400 Subject: [Linux-cluster] Re: Arch? In-Reply-To: <200408241212.53357.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> <200408240124.46669.arekm@pld-linux.org> <1093358884.3467.130.camel@atlantis.boston.redhat.com> <200408241212.53357.phillips@redhat.com> Message-ID: <1093367464.3467.133.camel@atlantis.boston.redhat.com> On Tue, 2004-08-24 at 12:12 -0400, Daniel Phillips wrote: > Hi Lon, > > On Tuesday 24 August 2004 10:48, Lon Hohberger wrote: > > It was _designed_ to handle distributed repositories (like BK). > > Well, what wind is blowing, seems to be blowing in the direction of > Arch. I'd be equally happy with either, and in any case, much happier > than with CVS. Does anybody else have a strong opinion? We still have to get the current maintainers to agree, which might prove more of a problem than deciding on what new software to use in the first place. -- Lon From erik at debian.franken.de Tue Aug 24 17:27:42 2004 From: erik at debian.franken.de (Erik Tews) Date: Tue, 24 Aug 2004 19:27:42 +0200 Subject: [Linux-cluster] Subversion? In-Reply-To: <200408241212.53357.phillips@redhat.com> References: <200408231143.11372.phillips@redhat.com> <200408240124.46669.arekm@pld-linux.org> <1093358884.3467.130.camel@atlantis.boston.redhat.com> <200408241212.53357.phillips@redhat.com> Message-ID: <1093368462.20290.7.camel@localhost.localdomain> Am Di, den 24.08.2004 schrieb Daniel Phillips um 18:12: > Well, what wind is blowing, seems to be blowing in the direction of > Arch. I'd be equally happy with either, and in any case, much happier > than with CVS. Does anybody else have a strong opinion? My opinion when I have to choose one is that 3. party support should be good too. CVS is supported on many systems and there are plugins for ides and guis everywhere. SVN is very good in this point too. And I usually need clients for the most common operating systems. But this is less important on sources.redhat.com, all people accessing this site will be running linux at least (not only redhat but linux) and will use no special ides because most of the software makes use of the gnu tools for building and testing. From xiaofeng.ling at intel.com Wed Aug 25 01:59:31 2004 From: xiaofeng.ling at intel.com (Ling, Xiaofeng) Date: Wed, 25 Aug 2004 09:59:31 +0800 Subject: [Linux-cluster] bug? mount hangs. Message-ID: <3ACA40606221794F80A5670F0AF15F840547798E@pdsmsx403> Hi, When I trying to setup GFS on two node. some times it triggers the kdb and the mount hangs. Follow is the dmesg and config file. I use kernel 2.6.6up no preemption with kdb patch. two nodes are both DELL desktop with Intel P3 and P4 CPU. Is this a know issue? ------------------------------------------------------------------------ --------------------------------------------------- Unable to handle kernel NULL pointer dereference at virtual address 00000046 printing eip: d087c916 *pde = 00000000 Oops: 0000 [#1] CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010286 (2.6.6kdb) EIP is at send_to_sock+0x41/0x20a [dlm] eax: 00000002 ebx: c763c060 ecx: 00000000 edx: 00000000 esi: d089124c edi: c763c060 ebp: 00000000 esp: c7007f88 ds: 007b es: 007b ss: 0068 Process dlm_sendd (pid: 19895, threadinfo=c7006000 task=caeaa7b0) Stack: d0887851 0000002b c7007fb0 c011475f c994bc68 c12a31b0 c763c068 00000286 c763c060 d089124c c7006000 00000002 d087cce0 c763c060 c7006000 00000000 00000000 00000000 d087cfda d0887905 00000000 00000000 0000007b 0000007b Call Trace: [] __wake_up_common+0x31/0x50 [] process_output_queue+0x55/0x75 [dlm] [] dlm_sendd+0x95/0xe9 [dlm] [] dlm_sendd+0x0/0xe9 [dlm] [] kernel_thread_helper+0x5/0xb Code: 8b 40 44 89 44 24 1c 8d 47 30 89 44 24 14 8b 5f 30 3b 5c 24 <6>CMAN: Being told to leave the cluster by node 2 CMAN: we are leaving the cluster SM: 00000001 sm_stop: SG still joined SM: 01000002 sm_stop: SG still joined input: AT Translated Set 2 keyboard on isa0060/serio0 my config file. ---------------------------------------------------------------- ------------------- Ling Xiaofeng(Daniel) Intel China Software Lab. iNet: 8-752-1243 8621-52574545-1243(O) xfling at users.sourceforge.net Opinions are my own and don't represent those of my employer From adam.cassar at netregistry.com.au Wed Aug 25 05:07:28 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Wed, 25 Aug 2004 15:07:28 +1000 Subject: [Linux-cluster] what does this mean? Message-ID: <1093410448.17936.232.camel@akira2.nro.au.com> (against latest gfs in cvs) scenario: - 3 machines in cluster - one importing gnbd, two directly mounted to shared fc raid * all 3 performing io * reboot one machine, all of the machines hung on io attempted to get rebooted machine to join the cluster, one machine spat out the following: kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611! invalid operand: 0000 [#1] SMP Modules linked in: gnbd gfs lock_dlm dlm cman lock_harness 8250 serial_core dm_mod CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010246 (2.6.8.1) EIP is at send_joinconf+0x10/0x75 [cman] eax: 00000000 ebx: 00000003 ecx: 018a60be edx: c180e08c esi: f7ff6fc0 edi: 00000000 ebp: 00000001 esp: f324fe74 ds: 007b es: 007b ss: 0068 Process cman_memb (pid: 822, threadinfo=f324e000 task=f3242c70) Stack: f8907380 00000000 00000003 f7ff6fc0 00000000 00000003 f88f15ec f8907380 018a7446 c01193bf c0331140 c0401a8d f7ff6fc0 0000824d 0000824d 0000824d 00000000 f7ff6fc0 00000000 00000007 f88f087f 00000000 00000000 00000038 Call Trace: [] do_process_startack+0x14d/0x38e [cman] [] __call_console_drivers+0x55/0x57 [] start_transition+0x20f/0x2c1 [cman] [] cman_callback+0x35/0x38 [dlm] [] notify_kernel_listeners+0x41/0x68 [cman] [] a_node_just_died+0x163/0x181 [cman] [] do_process_leave+0x6b/0x7d [cman] [] do_membership_packet+0x98/0x1f0 [cman] [] dispatch_messages+0xe3/0x104 [cman] [] membership_kthread+0x216/0x3e6 [cman] [] ret_from_fork+0x6/0x14 [] default_wake_function+0x0/0x12 [] membership_kthread+0x0/0x3e6 [cman] [] kernel_thread_helper+0x5/0xb Code: 0f 0b 63 02 40 bd 8f f8 89 44 24 10 c7 05 f0 73 90 f8 02 00 kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611! invalid operand: 0000 [#1] CPU: 0 EIP: 0060:[] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 (2.6.8.1) eax: 00000000 ebx: 00000003 ecx: 018a60be edx: c180e08c esi: f7ff6fc0 edi: 00000000 ebp: 00000001 esp: f324fe74 ds: 007b es: 007b ss: 0068 Stack: f8907380 00000000 00000003 f7ff6fc0 00000000 00000003 f88f15ec f8907380 018a7446 c01193bf c0331140 c0401a8d f7ff6fc0 0000824d 0000824d 0000824d 00000000 f7ff6fc0 00000000 00000007 f88f087f 00000000 00000000 00000038 [] do_process_startack+0x14d/0x38e [cman] [] __call_console_drivers+0x55/0x57 [] start_transition+0x20f/0x2c1 [cman] [] cman_callback+0x35/0x38 [dlm] [] notify_kernel_listeners+0x41/0x68 [cman] [] a_node_just_died+0x163/0x181 [cman] [] do_process_leave+0x6b/0x7d [cman] [] do_membership_packet+0x98/0x1f0 [cman] [] dispatch_messages+0xe3/0x104 [cman] [] membership_kthread+0x216/0x3e6 [cman] [] ret_from_fork+0x6/0x14 [] default_wake_function+0x0/0x12 [] membership_kthread+0x0/0x3e6 [cman] [] kernel_thread_helper+0x5/0xb Code: 0f 0b 63 02 40 bd 8f f8 89 44 24 10 c7 05 f0 73 90 f8 02 00 >>EIP; f88efc18 <===== >>ecx; 018a60be Before first symbol >>edx; c180e08c >>esi; f7ff6fc0 >>esp; f324fe74 Code; f88efc18 00000000 <_EIP>: Code; f88efc18 <===== 0: 0f 0b ud2a <===== Code; f88efc1a 2: 63 02 arpl %ax,(%edx) Code; f88efc1c 4: 40 inc %eax Code; f88efc1d 5: bd 8f f8 89 44 mov $0x4489f88f,%ebp Code; f88efc22 a: 24 10 and $0x10,%al Code; f88efc24 c: c7 05 f0 73 90 f8 02 movl $0x2,0xf89073f0 Code; f88efc2b 13: 00 00 00 From pcaulfie at redhat.com Wed Aug 25 07:23:32 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 25 Aug 2004 08:23:32 +0100 Subject: [Linux-cluster] bug? mount hangs. In-Reply-To: <3ACA40606221794F80A5670F0AF15F840547798E@pdsmsx403> References: <3ACA40606221794F80A5670F0AF15F840547798E@pdsmsx403> Message-ID: <20040825072331.GB11961@tykepenguin.com> On Wed, Aug 25, 2004 at 09:59:31AM +0800, Ling, Xiaofeng wrote: > Hi, > When I trying to setup GFS on two node. some times it triggers the > kdb and the mount hangs. > Follow is the dmesg and config file. > I use kernel 2.6.6up no preemption with kdb patch. two nodes are both > DELL desktop with Intel P3 and P4 CPU. > Is this a know issue? If it was preceded by a "can't bind to port 21064" message then yes. It has been fixed in CVS. It certainly looks like that bug. -- patrick From adam.cassar at netregistry.com.au Wed Aug 25 07:33:34 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Wed, 25 Aug 2004 17:33:34 +1000 Subject: [Linux-cluster] 2 node vs 3 node cluster Message-ID: <1093419214.17936.313.camel@akira2.nro.au.com> Hi Guys, What are the benefits of running a 3 node cluster, as only one node can fail before bringing the entire cluster down? It appears that a single node cannot be a member of a cluster if the other hosts are missing. I take it that is to prevent single nodes splitting and making themselves independent clusters? From njd at ndietsch.com Wed Aug 25 08:16:07 2004 From: njd at ndietsch.com (Nathan Dietsch) Date: Wed, 25 Aug 2004 18:16:07 +1000 Subject: [Linux-cluster] 2 node vs 3 node cluster In-Reply-To: <1093419214.17936.313.camel@akira2.nro.au.com> References: <1093419214.17936.313.camel@akira2.nro.au.com> Message-ID: <412C4AC7.5050801@ndietsch.com> Hello Adam, Adam Cassar wrote: >Hi Guys, > >What are the benefits of running a 3 node cluster, as only one node can >fail before bringing the entire cluster down? > >It appears that a single node cannot be a member of a cluster if the >other hosts are missing. I take it that is to prevent single nodes >splitting and making themselves independent clusters? > > I think something is missing in your understanding of cluster concepts in general and someone please correct me if I am wrong in my explanation. I am sure others can answer the linux-cluster specific attributes , however this is a matter of quorum (finding a majority view of the cluster) in general. It is important that in the case of failure, split-brain scenarios (as you pointed out) are avoided. If the number of nodes is even and each node has one vote, you face a problem. How this is resolved is implementation dependent, but the explanation below might help; In other clusters this is handled by allocating a device which both machines have access to. (Each machine has a vote, plus the device has a vote making an odd number of votes). When the machines lose sight of each other, they race to grab hold of the device and whoever gets it (using SCSI-3 reservations usually) gets to remain " in the cluster". The other node is "fenced off" from the disks containing the data, usually panics and then reboots, only being allowed back into the cluster once it can communicate with its peers. Quorum can also be handled by allocating a higher-number of votes to a specific node (I believe linux-cluster handles things this way from what I have read). So to answer your question. Having a three-node (or any odd number) cluster is ideal because it reduces the complexity of quorum issues. However, if all you need is the power of two nodes, properly configuring quorum (implementation dependent) can alleviate your problems. FYI, the notion of quorum is used in other scenarios such as the meta-databases in Solaris Volume Manager (formerly Sun Disksuite). I never really understood this one completely, but it does provide an example. I hope this helps, I am sure others will have different and better explanations for the linux-cluster specifics. For more general cluster information, I recommend Gregory Pfister's book "In Search of Clusters". Regards, Nathan Dietsch From xiaofeng.ling at intel.com Wed Aug 25 08:46:57 2004 From: xiaofeng.ling at intel.com (Ling, Xiaofeng) Date: Wed, 25 Aug 2004 16:46:57 +0800 Subject: [Linux-cluster] bug? mount hangs. Message-ID: <3ACA40606221794F80A5670F0AF15F84054B2BFD@pdsmsx403> >-----Original Message----- >From: linux-cluster-bounces at redhat.com >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patrick >Caulfield >Sent: 2004?8?25? 15:24 >To: Discussion of clustering software components including GFS >Subject: Re: [Linux-cluster] bug? mount hangs. > >On Wed, Aug 25, 2004 at 09:59:31AM +0800, Ling, Xiaofeng wrote: >> Hi, >> When I trying to setup GFS on two node. some times it >triggers the >> kdb and the mount hangs. >> Follow is the dmesg and config file. >> I use kernel 2.6.6up no preemption with kdb patch. two nodes are both >> DELL desktop with Intel P3 and P4 CPU. >> Is this a know issue? > >If it was preceded by a "can't bind to port 21064" message >then yes. It has been >fixed in CVS. It certainly looks like that bug. Yes, it is. Thanks. From lhh at redhat.com Wed Aug 25 13:19:53 2004 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 25 Aug 2004 09:19:53 -0400 Subject: [Linux-cluster] 2 node vs 3 node cluster In-Reply-To: <412C4AC7.5050801@ndietsch.com> References: <1093419214.17936.313.camel@akira2.nro.au.com> <412C4AC7.5050801@ndietsch.com> Message-ID: <1093439993.17698.37.camel@atlantis.boston.redhat.com> On Wed, 2004-08-25 at 18:16 +1000, Nathan Dietsch wrote: > Quorum can also be handled by allocating a higher-number of votes to a > specific node (I believe linux-cluster handles things this way from what > I have read). This is currently one way you can do it. I think you can also just put 'cman' into a '2-node' mode where it races to fence the other (no SCSI device needed). -- Lon From lhh at redhat.com Wed Aug 25 13:20:02 2004 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 25 Aug 2004 09:20:02 -0400 Subject: [Linux-cluster] 2 node vs 3 node cluster In-Reply-To: <412C4AC7.5050801@ndietsch.com> References: <1093419214.17936.313.camel@akira2.nro.au.com> <412C4AC7.5050801@ndietsch.com> Message-ID: <1093440002.17698.39.camel@atlantis.boston.redhat.com> On Wed, 2004-08-25 at 18:16 +1000, Nathan Dietsch wrote: > In other clusters this is handled by allocating a device which both > machines have access to. (Each machine has a vote, plus the device has a > vote making an odd number of votes). > When the machines lose sight of each other, they race to grab hold of > the device and whoever gets it (using SCSI-3 reservations usually) gets > to remain " in the cluster". The other node is "fenced off" from the > disks containing the data, usually panics and then reboots, only being > allowed back into the cluster once it can communicate with its peers. Similar to the above is the use of a disk-based membership+quorum model ("it which is writing to the disk is a member and is in the quorum"). This works well in the 2-node case, but doesn't ensure network connectivity, and isn't terribly scalable. One can also use a disk-based membership as a backup to network membership (e.g. membership determined over network; only in the event of a potential split brain is the disk checked), but again, this requires that each node be accessing the disk. Both of the above allow continued concurrent access from all nodes to shared partitions on a single device - but require allocation of space on shared devices for the membership/quorum data. Another popular method of fixing the split-brain in even-node cases is adding a dummy vote to a router or something which responds to ICMP_ECHO ;) Again, similar to Nathan's example, these models require fencing to ensure data integrity. To be precise, "split brain" in data-sharing clusters is typically equated to "data corruption". -- Lon From teigland at redhat.com Wed Aug 25 13:46:27 2004 From: teigland at redhat.com (David Teigland) Date: Wed, 25 Aug 2004 21:46:27 +0800 Subject: [Linux-cluster] 2 node vs 3 node cluster In-Reply-To: <1093419214.17936.313.camel@akira2.nro.au.com> References: <1093419214.17936.313.camel@akira2.nro.au.com> Message-ID: <20040825134627.GB16586@redhat.com> On Wed, Aug 25, 2004 at 05:33:34PM +1000, Adam Cassar wrote: > Hi Guys, > > What are the benefits of running a 3 node cluster, as only one node can > fail before bringing the entire cluster down? > > It appears that a single node cannot be a member of a cluster if the > other hosts are missing. I take it that is to prevent single nodes > splitting and making themselves independent clusters? You're right. Both 2 and 3 node clusters can tolerate the failure of 1 node. If 2 nodes fail, a 2 node cluster would obviously be out of commission, while the single remaining node in a 3 node cluster would be stalled. So, there's no advantage to a 3 node cluster in that sense. This is assuming all nodes have the default 1 vote -- probably the most sensible configuration. In another sense, having one remaining node in a 3 node cluster would make bringing things back up nicer after the failures. All you'd need is one failed node to join the cluster again to make the cluster quorate and allow the stalled node to continue running. Another option for the expert user: if you know the two failed nodes have been reset (or detached from storage) you could manually reduce expected votes to allow the stalled node to continue running. This is dangerous, of course, unless you know what you're doing. -- Dave Teigland From pcaulfie at redhat.com Wed Aug 25 14:46:51 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 25 Aug 2004 15:46:51 +0100 Subject: [Linux-cluster] what does this mean? In-Reply-To: <1093410448.17936.232.camel@akira2.nro.au.com> References: <1093410448.17936.232.camel@akira2.nro.au.com> Message-ID: <20040825144651.GA20829@tykepenguin.com> On Wed, Aug 25, 2004 at 03:07:28PM +1000, Adam Cassar wrote: > (against latest gfs in cvs) > > scenario: > > - 3 machines in cluster > - one importing gnbd, two directly mounted to shared fc raid > > * all 3 performing io > * reboot one machine, all of the machines hung on io > > attempted to get rebooted machine to join the cluster, one machine spat > out the following: > > > kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611! Hmmm, it means a bug I was sure had been fixed, hasn't :-( Was there anything interesting on the other two nodes ? -- patrick From anton at hq.310.ru Wed Aug 25 16:18:47 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Wed, 25 Aug 2004 20:18:47 +0400 Subject: [Linux-cluster] can't open cluster socket: Socket type not supported Message-ID: <1114844837.20040825201847@hq.310.ru> Hi all, After update from cvs i have problem with gfs # cman_tool join can't open cluster socket: Socket type not supported # uname -a Linux c5.310.ru 2.6.8.1 #19 SMP Wed Aug 25 20:15:23 MSD 2004 i686 i686 i386 GNU/Linux In what there can be a problem? -- e-mail: anton at hq.310.ru From jens.dreger at physik.fu-berlin.de Wed Aug 25 16:33:00 2004 From: jens.dreger at physik.fu-berlin.de (Jens Dreger) Date: Wed, 25 Aug 2004 18:33:00 +0200 Subject: [Linux-cluster] can't open cluster socket: Socket type not supported In-Reply-To: <1114844837.20040825201847@hq.310.ru> References: <1114844837.20040825201847@hq.310.ru> Message-ID: <20040825163300.GC12982@smart.physik.fu-berlin.de> On Wed, Aug 25, 2004 at 08:18:47PM +0400, ????? ????????? wrote: > Hi all, > > After update from cvs i have problem with gfs > > # cman_tool join > can't open cluster socket: Socket type not supported I had a similiar problem when trying to start clvmd and could track it back to AF_CLUSTER being defined differently in cluser/cman-kernel/src/cnxman-socket.h (30) and LVM2/daemons/clvmd/cnxman-socket.h (31) This might be related. HTH, Jens. From jeff at intersystems.com Wed Aug 25 16:35:10 2004 From: jeff at intersystems.com (Jeff) Date: Wed, 25 Aug 2004 12:35:10 -0400 Subject: [Linux-cluster] can't open cluster socket: Socket type not supported In-Reply-To: <20040825163300.GC12982@smart.physik.fu-berlin.de> References: <1114844837.20040825201847@hq.310.ru> <20040825163300.GC12982@smart.physik.fu-berlin.de> Message-ID: <2910736404.20040825123510@intersystems.com> Wednesday, August 25, 2004, 12:33:00 PM, Jens Dreger wrote: > On Wed, Aug 25, 2004 at 08:18:47PM +0400, ????? ????????? wrote: >> Hi all, >> >> After update from cvs i have problem with gfs >> >> # cman_tool join >> can't open cluster socket: Socket type not supported > I had a similiar problem when trying to start clvmd and could track it > back to AF_CLUSTER being defined differently in > cluser/cman-kernel/src/cnxman-socket.h (30) > and > LVM2/daemons/clvmd/cnxman-socket.h (31) > This might be related. > HTH, > Jens. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster Try 'modprobe dlm' after ccsd, before cman_tool join. From danderso at redhat.com Wed Aug 25 17:53:29 2004 From: danderso at redhat.com (Derek Anderson) Date: Wed, 25 Aug 2004 12:53:29 -0500 Subject: [Linux-cluster] can't open cluster socket: Socket type not supported In-Reply-To: <20040825163300.GC12982@smart.physik.fu-berlin.de> References: <1114844837.20040825201847@hq.310.ru> <20040825163300.GC12982@smart.physik.fu-berlin.de> Message-ID: <200408251253.29390.danderso@redhat.com> http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127019 AF_CLUSTER should be 30 in the latest versions of the cluster and LVM2 tree. On Wednesday 25 August 2004 11:33, Jens Dreger wrote: > On Wed, Aug 25, 2004 at 08:18:47PM +0400, ????? ????????? wrote: > > Hi all, > > > > After update from cvs i have problem with gfs > > > > # cman_tool join > > can't open cluster socket: Socket type not supported > > I had a similiar problem when trying to start clvmd and could track it > back to AF_CLUSTER being defined differently in > > cluser/cman-kernel/src/cnxman-socket.h (30) > > and > > LVM2/daemons/clvmd/cnxman-socket.h (31) > > This might be related. > > HTH, > > Jens. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From adam.cassar at netregistry.com.au Wed Aug 25 22:40:49 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Thu, 26 Aug 2004 08:40:49 +1000 Subject: [Linux-cluster] what does this mean? In-Reply-To: <20040825144651.GA20829@tykepenguin.com> References: <1093410448.17936.232.camel@akira2.nro.au.com> <20040825144651.GA20829@tykepenguin.com> Message-ID: <1093473649.2330.2.camel@akira2.nro.au.com> If by interesting you mean one being a NFS server, then yes. Would this matter? On Thu, 2004-08-26 at 00:46, Patrick Caulfield wrote: > On Wed, Aug 25, 2004 at 03:07:28PM +1000, Adam Cassar wrote: > > (against latest gfs in cvs) > > > > scenario: > > > > - 3 machines in cluster > > - one importing gnbd, two directly mounted to shared fc raid > > > > * all 3 performing io > > * reboot one machine, all of the machines hung on io > > > > attempted to get rebooted machine to join the cluster, one machine spat > > out the following: > > > > > > kernel BUG at /usr/src/GFS/cluster/cman-kernel/src/membership.c:611! > > Hmmm, it means a bug I was sure had been fixed, hasn't :-( > > Was there anything interesting on the other two nodes ? From adam.cassar at netregistry.com.au Wed Aug 25 22:42:42 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Thu, 26 Aug 2004 08:42:42 +1000 Subject: [Linux-cluster] fsck time Message-ID: <1093473762.2330.5.camel@akira2.nro.au.com> Using the latest CVS code I attempted to run fsck on an 800G GFS partition (of which about 200 meg was used). The fsck took: Pass 7: done (3:53:57) gfs_fsck Complete (3:54:41). Is this time to be expected? From jbrassow at redhat.com Thu Aug 26 01:18:37 2004 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Wed, 25 Aug 2004 20:18:37 -0500 Subject: [Linux-cluster] fsck time In-Reply-To: <1093473762.2330.5.camel@akira2.nro.au.com> References: <1093473762.2330.5.camel@akira2.nro.au.com> Message-ID: Seems a bit high... but it's possible. The fsck, while very thorough and functionally correct, is not very well optimized. Pass 7 checks for block conflicts. That is, if a file (or dir) has a block in common with another file (or dir). Each pass is designed to check something different. So, depending on the access patterns, some passes will take longer than others. Your numbers seem high - especially for the amount of space you are actually using. Part of that time is chewed up looking over the portions of the file system that are not used... If you're just playing around, you may wish to see how the fsck does if the fs is smaller but the contents are the same. brassow On Aug 25, 2004, at 5:42 PM, Adam Cassar wrote: > > Using the latest CVS code I attempted to run fsck on an 800G GFS > partition (of which about 200 meg was used). The fsck took: > > Pass 7: done (3:53:57) > gfs_fsck Complete (3:54:41). > > Is this time to be expected? > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From adam.cassar at netregistry.com.au Thu Aug 26 07:09:37 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Thu, 26 Aug 2004 17:09:37 +1000 Subject: [Linux-cluster] kernel oops Message-ID: <1093504177.2330.167.camel@akira2.nro.au.com> I received the following trying to unmount a GFS partition. I tried to unmount a GFS partition shared between three nodes and it hung. I discovered that one of the nodes had become unresponsive so I manually ACKED the fence request and attempted to unmount. The following occurred: Unable to handle kernel paging request at virtual address 001dae44 printing eip: f88cda59 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: lock_dlm dlm cman gfs lock_harness 8250 serial_core dm_mod CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010286 (2.6.8.1) EIP is at name_to_directory_nodeid+0x15/0xf9 [dlm] eax: 001dae00 ebx: e8dc304c ecx: c1b5ae3c edx: e8dc304c esi: 00000000 edi: 001dae00 ebp: e8dc304c esp: f7297ec0 ds: 007b es: 007b ss: 0068 Process dlm_recoverd (pid: 859, threadinfo=f7296000 task=f7125930) Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c c1b5ae00 c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727 e8dc304c 00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c c1b5ae00 Call Trace: [] rcom_send_message+0xe1/0x217 [dlm] [] get_directory_nodeid+0x21/0x25 [dlm] [] rsb_master_lookup+0x1a/0x126 [dlm] [] restbl_rsb_update+0x142/0x165 [dlm] [] ls_reconfig+0xd5/0x220 [dlm] [] dlm_recoverd+0x0/0x66 [dlm] [] do_ls_recovery+0x16c/0x444 [dlm] [] dlm_recoverd+0x4c/0x66 [dlm] [] kthread+0xb7/0xbd [] kthread+0x0/0xbd [] kernel_thread_helper+0x5/0xb Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 Unable to handle kernel paging request at virtual address 001dae44 f88cda59 *pde = 00000000 Oops: 0000 [#1] CPU: 0 EIP: 0060:[] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010286 (2.6.8.1) eax: 001dae00 ebx: e8dc304c ecx: c1b5ae3c edx: e8dc304c esi: 00000000 edi: 001dae00 ebp: e8dc304c esp: f7297ec0 ds: 007b es: 007b ss: 0068 Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c c1b5ae00 c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727 e8dc304c 00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c c1b5ae00 [] rcom_send_message+0xe1/0x217 [dlm] [] get_directory_nodeid+0x21/0x25 [dlm] [] rsb_master_lookup+0x1a/0x126 [dlm] [] restbl_rsb_update+0x142/0x165 [dlm] [] ls_reconfig+0xd5/0x220 [dlm] [] dlm_recoverd+0x0/0x66 [dlm] [] do_ls_recovery+0x16c/0x444 [dlm] [] dlm_recoverd+0x4c/0x66 [dlm] [] kthread+0xb7/0xbd [] kthread+0x0/0xbd [] kernel_thread_helper+0x5/0xb Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 >>EIP; f88cda59 <===== >>eax; 001dae00 Before first symbol >>ebx; e8dc304c >>ecx; c1b5ae3c >>edx; e8dc304c >>edi; 001dae00 Before first symbol >>ebp; e8dc304c >>esp; f7297ec0 Code; f88cda59 00000000 <_EIP>: Code; f88cda59 <===== 0: 83 7f 44 01 cmpl $0x1,0x44(%edi) <===== Code; f88cda5d 4: 74 65 je 6b <_EIP+0x6b> f88cdac4 Code; f88cda5f 6: 8b 44 24 34 mov 0x34(%esp,1),%eax Code; f88cda63 a: 89 44 24 04 mov %eax,0x4(%esp,1) Code; f88cda67 e: 8b 44 24 30 mov 0x30(%esp,1),%eax Code; f88cda6b 12: 89 04 00 mov %eax,(%eax,%eax,1) From adam.cassar at netregistry.com.au Thu Aug 26 07:19:21 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Thu, 26 Aug 2004 17:19:21 +1000 Subject: [Linux-cluster] kernel oops In-Reply-To: <1093504177.2330.167.camel@akira2.nro.au.com> References: <1093504177.2330.167.camel@akira2.nro.au.com> Message-ID: <1093504760.12947.0.camel@akira2.nro.au.com> I also got quite a few of these: dlm: dude: restbl_rsb_update_recv rsb not found 2447 dlm: dude: restbl_rsb_update_recv rsb not found 2448 dlm: dude: restbl_rsb_update_recv rsb not found 2449 dlm: dude: restbl_rsb_update_recv rsb not found 2450 dlm: dude: restbl_rsb_update_recv rsb not found 2451 dlm: dude: restbl_rsb_update_recv rsb not found 2452 dlm: dude: restbl_rsb_update_recv rsb not found 2453 dlm: dude: restbl_rsb_update_recv rsb not found 2454 dlm: dude: restbl_rsb_update_recv rsb not found 2455 dlm: dude: restbl_rsb_update_recv rsb not found 2456 dlm: dude: restbl_rsb_update_recv rsb not found 2457 dlm: dude: restbl_rsb_update_recv rsb not found 2458 dlm: dude: restbl_rsb_update_recv rsb not found 2459 dlm: dude: restbl_rsb_update_recv rsb not found 2460 dlm: dude: restbl_rsb_update_recv rsb not found 2461 dlm: dude: restbl_rsb_update_recv rsb not found 2462 dlm: dude: restbl_rsb_update_recv rsb not found 2463 dlm: dude: restbl_rsb_update_recv rsb not found 2464 dlm: dude: restbl_rsb_update_recv rsb not found 2465 dlm: dude: restbl_rsb_update_recv rsb not found 2466 On Thu, 2004-08-26 at 17:09, Adam Cassar wrote: > I received the following trying to unmount a GFS partition. > > I tried to unmount a GFS partition shared between three nodes and it > hung. > > I discovered that one of the nodes had become unresponsive so I manually > ACKED the fence request and attempted to unmount. The following > occurred: > > Unable to handle kernel paging request at virtual address 001dae44 > printing eip: > f88cda59 > *pde = 00000000 > Oops: 0000 [#1] > SMP > Modules linked in: lock_dlm dlm cman gfs lock_harness 8250 serial_core > dm_mod > CPU: 0 > EIP: 0060:[] Not tainted > EFLAGS: 00010286 (2.6.8.1) > EIP is at name_to_directory_nodeid+0x15/0xf9 [dlm] > eax: 001dae00 ebx: e8dc304c ecx: c1b5ae3c edx: e8dc304c > esi: 00000000 edi: 001dae00 ebp: e8dc304c esp: f7297ec0 > ds: 007b es: 007b ss: 0068 > Process dlm_recoverd (pid: 859, threadinfo=f7296000 task=f7125930) > Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c > c1b5ae00 > c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727 > e8dc304c > 00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c > c1b5ae00 > Call Trace: > [] rcom_send_message+0xe1/0x217 [dlm] > [] get_directory_nodeid+0x21/0x25 [dlm] > [] rsb_master_lookup+0x1a/0x126 [dlm] > [] restbl_rsb_update+0x142/0x165 [dlm] > [] ls_reconfig+0xd5/0x220 [dlm] > [] dlm_recoverd+0x0/0x66 [dlm] > [] do_ls_recovery+0x16c/0x444 [dlm] > [] dlm_recoverd+0x4c/0x66 [dlm] > [] kthread+0xb7/0xbd > [] kthread+0x0/0xbd > [] kernel_thread_helper+0x5/0xb > Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 > Unable to handle kernel paging request at virtual address 001dae44 > f88cda59 > *pde = 00000000 > Oops: 0000 [#1] > CPU: 0 > EIP: 0060:[] Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EFLAGS: 00010286 (2.6.8.1) > eax: 001dae00 ebx: e8dc304c ecx: c1b5ae3c edx: e8dc304c > esi: 00000000 edi: 001dae00 ebp: e8dc304c esp: f7297ec0 > ds: 007b es: 007b ss: 0068 > Stack: f706c000 f706c000 c1b5ae00 00000000 f88db235 00000000 e8dc304c > c1b5ae00 > c1b5aef0 e8dc304c f88cdb5e 001dae00 e8dc30c5 00000018 f88dc727 > e8dc304c > 00000003 00000003 f706c000 00000000 001dae00 e8dc304c e8dc304c > c1b5ae00 > [] rcom_send_message+0xe1/0x217 [dlm] > [] get_directory_nodeid+0x21/0x25 [dlm] > [] rsb_master_lookup+0x1a/0x126 [dlm] > [] restbl_rsb_update+0x142/0x165 [dlm] > [] ls_reconfig+0xd5/0x220 [dlm] > [] dlm_recoverd+0x0/0x66 [dlm] > [] do_ls_recovery+0x16c/0x444 [dlm] > [] dlm_recoverd+0x4c/0x66 [dlm] > [] kthread+0xb7/0xbd > [] kthread+0x0/0xbd > [] kernel_thread_helper+0x5/0xb > Code: 83 7f 44 01 74 65 8b 44 24 34 89 44 24 04 8b 44 24 30 89 04 > > > >>EIP; f88cda59 <===== > > >>eax; 001dae00 Before first symbol > >>ebx; e8dc304c > >>ecx; c1b5ae3c > >>edx; e8dc304c > >>edi; 001dae00 Before first symbol > >>ebp; e8dc304c > >>esp; f7297ec0 > > Code; f88cda59 > 00000000 <_EIP>: > Code; f88cda59 <===== > 0: 83 7f 44 01 cmpl $0x1,0x44(%edi) <===== > Code; f88cda5d > 4: 74 65 je 6b <_EIP+0x6b> f88cdac4 > > Code; f88cda5f > 6: 8b 44 24 34 mov 0x34(%esp,1),%eax > Code; f88cda63 > a: 89 44 24 04 mov %eax,0x4(%esp,1) > Code; f88cda67 > e: 8b 44 24 30 mov 0x30(%esp,1),%eax > Code; f88cda6b > 12: 89 04 00 mov %eax,(%eax,%eax,1) > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Cassar IT Manager NetRegistry Pty Ltd ______________________________________________ http://www.netregistry.com.au Tel: 02 9699 6099 Fax: 02 9699 6088 PO Box 270 Broadway NSW 2007 Domains |Business Email|Web Hosting|E-Commerce Trusted by 10,000s of businesses since 1997 ______________________________________________ From jopet at staff.spray.se Thu Aug 26 08:25:59 2004 From: jopet at staff.spray.se (Johan Pettersson) Date: Thu, 26 Aug 2004 10:25:59 +0200 Subject: [Linux-cluster] Installation problem Message-ID: <1093508759.31675.35.camel@zombie.i.spray.se> Hello! I'm trying to install (http://gfs.wikidev.net/Installation) gfs and clvm, but have some problem. Checked out sources for `device-mapper', `lvm2' and `cluster' yesterday. And have patched a vanilla-2.6.7 kernel with following patches: /cluster/cman-kernel/patches/2.6.8.1/*patch /cluster/dlm-kernel/patches/2.6.8.1/*patch /cluster/gfs-kernel/patches/2.6.8.1/*patch /cluster/gnbd-kernel/patches/2.6.7/*patch I would like to use a vanilla-2.6.8.1 kernel, but I guess `gndb.kernel' wouldn't work then!? cd cluster; ./configure --kernel_src=/build/linux-2.6.7; make cd cman-kernel && make install make[1]: Entering directory `/build/cluster/cman-kernel' cd src && make install make[2]: Entering directory `/build/cluster/cman-kernel/src' rm -f cluster ln -s . cluster make -C /build/linux-2.6.7 M=/home/jopet/build/cluster/cman-kernel/src modules USING_KBUILD=yes make[3]: Entering directory `/build/linux-2.6.7' CC [M] /build/cluster/cman-kernel/src/cnxman.o /build/cluster/cman-kernel/src/cnxman.c: In function `send_to_userport': /build/cluster/cman-kernel/src/cnxman.c:796: error: `MSG_REPLYEXP' undeclared (first use in this function) /build/cluster/cman-kernel/src/cnxman.c:796: error: (Each undeclared identifier is reported only once /build/cluster/cman-kernel/src/cnxman.c:796: error: for each function it appears in.) /build/cluster/cman-kernel/src/cnxman.c: In function `__sendmsg': /build/cluster/cman-kernel/src/cnxman.c:2242: error: `MSG_REPLYEXP' undeclared (first use in this function) /build/cluster/cman-kernel/src/cnxman.c: In function `send_listen_request': /build/cluster/cman-kernel/src/cnxman.c:2464: error: `MSG_REPLYEXP' undeclared (first use in this function) make[4]: *** [/build/cluster/cman-kernel/src/cnxman.o] Error 1 make[3]: *** [_module_/build/cluster/cman-kernel/src] Error 2 make[3]: Leaving directory `/build/linux-2.6.7' make[2]: *** [all] Error 2 make[2]: Leaving directory `/build/cluster/cman-kernel/src' make[1]: *** [install] Error 2 make[1]: Leaving directory `/build/cluster/cman-kernel' make: *** [install] Error 2 Thx /Johan -- In disk space, nobody can hear your files scream. From pcaulfie at redhat.com Thu Aug 26 08:34:26 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 26 Aug 2004 09:34:26 +0100 Subject: [Linux-cluster] Installation problem In-Reply-To: <1093508759.31675.35.camel@zombie.i.spray.se> References: <1093508759.31675.35.camel@zombie.i.spray.se> Message-ID: <20040826083426.GB6682@tykepenguin.com> On Thu, Aug 26, 2004 at 10:25:59AM +0200, Johan Pettersson wrote: > Hello! > > make[3]: Entering directory `/build/linux-2.6.7' > CC [M] /build/cluster/cman-kernel/src/cnxman.o > /build/cluster/cman-kernel/src/cnxman.c: In function `send_to_userport': > /build/cluster/cman-kernel/src/cnxman.c:796: error: `MSG_REPLYEXP' > undeclared (first use in this function) > /build/cluster/cman-kernel/src/cnxman.c:796: error: (Each undeclared > identifier is reported only once > /build/cluster/cman-kernel/src/cnxman.c:796: error: for each function it > appears in.) > /build/cluster/cman-kernel/src/cnxman.c: In function `__sendmsg': > /build/cluster/cman-kernel/src/cnxman.c:2242: error: `MSG_REPLYEXP' > undeclared (first use in this function) > /build/cluster/cman-kernel/src/cnxman.c: In function > `send_listen_request': > /build/cluster/cman-kernel/src/cnxman.c:2464: error: `MSG_REPLYEXP' > undeclared (first use in this function) it looks like you might have an old version of cnxman.h lying around, maybe in /usr/include/cluster -- patrick From anton at hq.310.ru Thu Aug 26 08:40:09 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Thu, 26 Aug 2004 12:40:09 +0400 Subject: [Linux-cluster] kernel: lock_dlm: init_fence error -1 Message-ID: <05345590.20040826124009@hq.310.ru> Hi all, After update gfs from cvs i have problem with mount gfs i see in messages kernel: lock_dlm: init_fence error -1 kernel: GFS: can't mount proto = lock_dlm, table = 310farm:gfs01, hostdata = before mount "cman_tool join" and "fence_tool join" loaded without error In what there can be a problem? -- e-mail: anton at hq.310.ru From anton at hq.310.ru Thu Aug 26 09:22:50 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Thu, 26 Aug 2004 13:22:50 +0400 Subject: [Linux-cluster] fence_sanbox2 Message-ID: <1445867274.20040826132250@hq.310.ru> Hi all, Have probably forgotten :) *** cluster/fence/bin/Makefile.orig 2004-08-26 13:23:07.084421744 +0400 --- cluster/fence/bin/Makefile 2004-08-26 13:23:02.482121400 +0400 *************** *** 31,36 **** --- 31,37 ---- fence_wti \ fence_xcat \ fence_zvm \ + fence_sanbox2 \ fenced -- e-mail: anton at hq.310.ru From jopet at staff.spray.se Thu Aug 26 09:58:19 2004 From: jopet at staff.spray.se (Johan Pettersson) Date: Thu, 26 Aug 2004 11:58:19 +0200 Subject: [Linux-cluster] Installation problem In-Reply-To: <20040826083426.GB6682@tykepenguin.com> References: <1093508759.31675.35.camel@zombie.i.spray.se> <20040826083426.GB6682@tykepenguin.com> Message-ID: <1093514299.31675.44.camel@zombie.i.spray.se> On Thu, 2004-08-26 at 10:34, Patrick Caulfield wrote: > On Thu, Aug 26, 2004 at 10:25:59AM +0200, Johan Pettersson wrote: > > Hello! > > > > make[3]: Entering directory `/build/linux-2.6.7' > > CC [M] /build/cluster/cman-kernel/src/cnxman.o > > /build/cluster/cman-kernel/src/cnxman.c: In function `send_to_userport': > > /build/cluster/cman-kernel/src/cnxman.c:796: error: `MSG_REPLYEXP' > > undeclared (first use in this function) > > /build/cluster/cman-kernel/src/cnxman.c:796: error: (Each undeclared > > identifier is reported only once > > /build/cluster/cman-kernel/src/cnxman.c:796: error: for each function it > > appears in.) > > /build/cluster/cman-kernel/src/cnxman.c: In function `__sendmsg': > > /build/cluster/cman-kernel/src/cnxman.c:2242: error: `MSG_REPLYEXP' > > undeclared (first use in this function) > > /build/cluster/cman-kernel/src/cnxman.c: In function > > `send_listen_request': > > /build/cluster/cman-kernel/src/cnxman.c:2464: error: `MSG_REPLYEXP' > > undeclared (first use in this function) > > > it looks like you might have an old version of cnxman.h lying around, maybe in > /usr/include/cluster I have only 2 cnxman.h in the system and they do not differ =/ build/cluster/cman-kernel/src/cnxman.h build/linux-2.6.7/include/cluster/cnxman.h /J -- In disk space, nobody can hear your files scream. From pcaulfie at redhat.com Thu Aug 26 10:21:19 2004 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 26 Aug 2004 11:21:19 +0100 Subject: [Linux-cluster] Installation problem In-Reply-To: <1093514299.31675.44.camel@zombie.i.spray.se> References: <1093508759.31675.35.camel@zombie.i.spray.se> <20040826083426.GB6682@tykepenguin.com> <1093514299.31675.44.camel@zombie.i.spray.se> Message-ID: <20040826102119.GA9523@tykepenguin.com> On Thu, Aug 26, 2004 at 11:58:19AM +0200, Johan Pettersson wrote: > > > I have only 2 cnxman.h in the system and they do not differ =/ > > build/cluster/cman-kernel/src/cnxman.h > build/linux-2.6.7/include/cluster/cnxman.h Sorry that should have been cnxman-socket.h patrick From yjcho at cs.hongik.ac.kr Thu Aug 26 10:32:01 2004 From: yjcho at cs.hongik.ac.kr (Cho Yool Je) Date: Thu, 26 Aug 2004 19:32:01 +0900 Subject: [Linux-cluster] cluster.conf Message-ID: <412DBC21.8080502@cs.hongik.ac.kr> hi..everybody.. i have three machines.... i will use one server & two client2 with firewire for GFS but...i can't creat a cluster.conf.. plz show me related cluster.conf... thx... From amanthei at redhat.com Thu Aug 26 13:27:22 2004 From: amanthei at redhat.com (Adam Manthei) Date: Thu, 26 Aug 2004 08:27:22 -0500 Subject: [Linux-cluster] fence_sanbox2 In-Reply-To: <1445867274.20040826132250@hq.310.ru> References: <1445867274.20040826132250@hq.310.ru> Message-ID: <20040826132722.GB20552@redhat.com> On Thu, Aug 26, 2004 at 01:22:50PM +0400, ????? ????????? wrote: > Hi all, > > > Have probably forgotten :) Indeed. I also forgot to add the fence_ibmblade agent too. Thanks for the catch. > > *** cluster/fence/bin/Makefile.orig 2004-08-26 13:23:07.084421744 +0400 > --- cluster/fence/bin/Makefile 2004-08-26 13:23:02.482121400 +0400 > *************** > *** 31,36 **** > --- 31,37 ---- > fence_wti \ > fence_xcat \ > fence_zvm \ > + fence_sanbox2 \ > fenced > > -- > e-mail: anton at hq.310.ru > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From jbrassow at redhat.com Thu Aug 26 14:26:24 2004 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Thu, 26 Aug 2004 09:26:24 -0500 Subject: [Linux-cluster] cluster.conf In-Reply-To: <412DBC21.8080502@cs.hongik.ac.kr> References: <412DBC21.8080502@cs.hongik.ac.kr> Message-ID: man 5 cluster.conf if using dlm, also man 5 cman if using gulm in place of dlm, read man 5 lock_gulmd brassow On Aug 26, 2004, at 5:32 AM, Cho Yool Je wrote: > hi..everybody.. > > i have three machines.... > i will use one server & two client2 with firewire for GFS > > but...i can't creat a cluster.conf.. > plz show me related cluster.conf... > > thx... > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From iisaman at citi.umich.edu Thu Aug 26 15:42:30 2004 From: iisaman at citi.umich.edu (Fredric Isaman) Date: Thu, 26 Aug 2004 11:42:30 -0400 (EDT) Subject: [Linux-cluster] Compile problem - kernel patches updated? In-Reply-To: <20040824160026.6BFC873D58@hormel.redhat.com> References: <20040824160026.6BFC873D58@hormel.redhat.com> Message-ID: I am trying to compile from a CVS download taken on Aug 22. (Using linux kernel 2.6.7) After patching the kernel, I try to compile in the /cluster using 'configure --kernel_src=/path/to/kernel; make install'. I get the following error: CC [M] /nfs/iisaman/gfs/cvs20040822/cluster/dlm-kernel/src/queries.o /nfs/iisaman/gfs/cvs20040822/cluster/dlm-kernel/src/queries.c: In function `remote_query': /nfs/iisaman/gfs/cvs20040822/cluster/dlm-kernel/src/queries.c:338: error: structure has no member named `lki_ownpid' The compile seems to be using the kernel source includes, which do not match those in the cluster directories. Do I need to do a diff and create my own patches, or am I doing something wrong? For example: > diff -u dlm-kernel/src/dlm.h $KERNELSRC/include/cluster/dlm.h @@ -241,7 +241,6 @@ int lki_mstlkid; /* Lock ID on master node */ int lki_parent; int lki_node; /* Originating node (not master) */ - int lki_ownpid; /* Owner pid on originating node */ uint8_t lki_state; /* Queue the lock is on */ uint8_t lki_grmode; /* Granted mode */ uint8_t lki_rqmode; /* Requested mode */ Thanks, Fred From tomc at teamics.com Thu Aug 26 16:32:37 2004 From: tomc at teamics.com (tomc at teamics.com) Date: Thu, 26 Aug 2004 10:32:37 -0600 Subject: [Linux-cluster] What is this GFS pipe doing here: Message-ID: in /tmp I found this pipe. It appears quite old. Any idea what it's for? prw------- 1 root root 0 May 22 06:18 fence.manual.fifo There is one on each of the GFS nodes except the current master. (Can I)/(Should I) delete it? tc From yjcho at cs.hongik.ac.kr Thu Aug 26 17:29:15 2004 From: yjcho at cs.hongik.ac.kr (Cho Yool Je) Date: Fri, 27 Aug 2004 02:29:15 +0900 Subject: [Linux-cluster] cluster.conf Message-ID: <412E1DEB.2030600@cs.hongik.ac.kr> thx a lot...but...i refereced "man 5 cluster.conf" & "man 5 lock_gulmd" some ago... (i'm using gulm...) when i excuted lock_gulmd, result is... ============================================================================= [root at client1 root]# lock_gulmd I cannot find the name for ip "servers=client1". gf->node_cnt = 0 In src/config_main.c:332 (DEVEL.1093502033) death by: ASSERTION FAILED: gf->node_cnt > 0 && gf->node_cnt < 5 && gf->node_cnt != 2 I cannot find the name for ip "servers=client1". gf->node_cnt = 0 In src/config_main.c:332 (DEVEL.1093502033) death by: ASSERTION FAILED: gf->node_cnt > 0 && gf->node_cnt < 5 && gf->node_cnt != 2 [root at client1 root]# I cannot find the name for ip "servers=client1". gf->node_cnt = 0 In src/config_main.c:332 (DEVEL.1093502033) death by: ASSERTION FAILED: gf->node_cnt > 0 && gf->node_cnt < 5 && gf->node_cnt != 2 ============================================================================= my cluster.conf is ... ============================================================================= [root at client1 root]# cat /etc/cluster/cluster.conf client1 client2 ============================================================================ i registerd ip of client1 & client2 in /etc/hosts instead of DNS (and testing with only two nodes..) plz give me a advice... thx... -------------------------------------------- man 5 cluster.conf if using dlm, also man 5 cman if using gulm in place of dlm, read man 5 lock_gulmd brassow On Aug 26, 2004, at 5:32 AM, Cho Yool Je wrote: hi..everybody.. i have three machines.... i will use one server & two client2 with firewire for GFS but...i can't creat a cluster.conf.. plz show me related cluster.conf... thx... -- Linux-cluster mailing list Linux-cluster redhat com http://www.redhat.com/mailman/listinfo/linux-cluster From alewis at redhat.com Thu Aug 26 17:35:25 2004 From: alewis at redhat.com (AJ Lewis) Date: Thu, 26 Aug 2004 12:35:25 -0500 Subject: [Linux-cluster] cluster.conf In-Reply-To: <412E1DEB.2030600@cs.hongik.ac.kr> References: <412E1DEB.2030600@cs.hongik.ac.kr> Message-ID: <20040826173525.GC13272@null.msp.redhat.com> On Fri, Aug 27, 2004 at 02:29:15AM +0900, Cho Yool Je wrote: > thx a lot...but...i refereced "man 5 cluster.conf" & "man 5 lock_gulmd" > some ago... > (i'm using gulm...) > > when i excuted lock_gulmd, result is... > ============================================================================= > [root at client1 root]# lock_gulmd > I cannot find the name for ip "servers=client1". There was a bug introduced in ccs that has since been fixed - grab the latest cvs code and recompile, and it should work. > my cluster.conf is ... > ============================================================================= > [root at client1 root]# cat /etc/cluster/cluster.conf > > > > > client1 client2 > > > > > > > > > > > > > > > > > > > > > > > > ============================================================================ > > i registerd ip of client1 & client2 in /etc/hosts instead of DNS > (and testing with only two nodes..) > > plz give me a advice... > > thx... > > > > > > > -------------------------------------------- > man 5 cluster.conf > > if using dlm, also > > man 5 cman > > if using gulm in place of dlm, read > > man 5 lock_gulmd > > brassow > > On Aug 26, 2004, at 5:32 AM, Cho Yool Je wrote: > > hi..everybody.. > > i have three machines.... > i will use one server & two client2 with firewire for GFS > > > but...i can't creat a cluster.conf.. > plz show me related cluster.conf... > > > thx... > > -- > Linux-cluster mailing list > Linux-cluster redhat com > http://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- AJ Lewis Voice: 612-638-0500 Red Hat Inc. E-Mail: alewis at redhat.com 720 Washington Ave. SE, Suite 200 Minneapolis, MN 55414 Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C 54A8 578C 8715 Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the many keyservers out there... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From ben.m.cahill at intel.com Thu Aug 26 17:55:29 2004 From: ben.m.cahill at intel.com (Cahill, Ben M) Date: Thu, 26 Aug 2004 10:55:29 -0700 Subject: [Linux-cluster] Can anyone (Ken?) explain why num_glockd mount option is there? TIA. EOM. Message-ID: <0604335B7764D141945E202153105960033E2520@orsmsx404.amr.corp.intel.com> From john.l.villalovos at intel.com Thu Aug 26 18:20:04 2004 From: john.l.villalovos at intel.com (Villalovos, John L) Date: Thu, 26 Aug 2004 11:20:04 -0700 Subject: [Linux-cluster] Subversion? Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B01C7F989@orsmsx410> linux-cluster-bounces at redhat.com wrote: > On Tuesday 24 August 2004 10:48, Lon Hohberger wrote: >> It was _designed_ to handle distributed repositories (like BK). > > Well, what wind is blowing, seems to be blowing in the direction of > Arch. I'd be equally happy with either, and in any case, > much happier > than with CVS. Does anybody else have a strong opinion? I'd prefer to use Subversion. It works through our proxy servers. We already use it for some projects we connect to. I guess it depends on what you think the development methodology will be. If you think it will be this great big distributed development with tons of merging of people's patches from all over the place then probably something like Bitkeeper or GNU Arch. If you are going to stick with your centralized development model then CVS or Subversion is probably the way to go. Plus Subversion comes with Fedora Core 2 by default. Not sure about GNU Arch. The change from CVS to SVN (Subversion) is very very easy. I am not sure that we can say the same about going to GNU Arch. (Note: I have never used GNU Arch). Here is some articles on Arch versus Subversion: http://web.mit.edu/ghudson/thoughts/undiagnosing http://web.mit.edu/ghudson/thoughts/diagnosing http://www.reverberate.org/computers/ArchAndSVN.html John From kpreslan at redhat.com Thu Aug 26 18:48:39 2004 From: kpreslan at redhat.com (Ken Preslan) Date: Thu, 26 Aug 2004 13:48:39 -0500 Subject: [Linux-cluster] Can anyone (Ken?) explain why num_glockd mount option is there? TIA. EOM. In-Reply-To: <0604335B7764D141945E202153105960033E2520@orsmsx404.amr.corp.intel.com> References: <0604335B7764D141945E202153105960033E2520@orsmsx404.amr.corp.intel.com> Message-ID: <20040826184839.GA18435@potassium.msp.redhat.com> On Thu, Aug 26, 2004 at 10:55:29AM -0700, Cahill, Ben M wrote: > Can anyone (Ken?) explain why num_glockd mount > option is there? TIA. EOM. One of the things (probably the major thing) that GFS uses memory for that's different from other filesystems is locks. That introduces the interesting property that in order for GFS to free that memory, it has to do network I/O to unlock locks. (The same is true for memory that contains dirty disk blocks, but the VM knows about that.) This means that freeing memory can only happen so quickly. In the past, GFS had one thread (gfs_glockd) that would scan through the glock hash table looking for cached locks that were no longer needed. It would then unlock those locks and free the memory associated with them. But as it turned out, there were memory problems, if you had a few processes that were scanning through huge directory trees. You could get into a situation where you were acquiring new locks much faster than gfs_glockd could release old ones. (There were many threads acquiring, but only one thread releasing.) If this kept up, you'd soon run out of memory and bad things would happen. My first quick-n-dirty solution to this was to add the num_glockd mount option. It would create many many threads that would look for unused locks and unlock them. You could balance out acquiring and releasing processes if you knew what workload you were running. So, that's why the option was there originally. Multiple gfs_glockd processes didn't completely solve the problem, though. You could still get into the situation where GFS wasn't responding quickly enough to memory pressure. So I made a bunch of changes that made things a lot better: 1) I broke gfs_glockd into two threads: A) gfs_scand - scans the glock hash table looking for glocks to demote. When it finds one, it puts it onto a reclaim list of unneeded locks, and wakes up gfs_glockd. B) gfs_glockd - looks at that the glocks on that reclaim list and starts demoting them. 2) When the number of locks on the reclaim list becomes too great, threads that want to acquire new locks will pitch in and release a couple of locks before acquiring a new one. 3) GFS is more proactive about putting locks that it knows it won't need onto the reclaim list. This reduces the need for gfs_scand and actually walking the hash table. There's more work that could be done in this area. So, I left the num_glockd option there to in case it's still needed for some reason. But because of #2 above, I don't think it will be. The option may go away in the future. -- Ken Preslan From anton at hq.310.ru Thu Aug 26 21:01:15 2004 From: anton at hq.310.ru (anton at hq.310.ru) Date: Fri, 27 Aug 2004 01:01:15 +0400 Subject: [Linux-cluster] fence init problem Message-ID: <20040827010115.spihyrcdxc8880wc@mail.310.ru> hi all gfs from cvs after run fence_tool join in /var/log/messages i see ccsd[10248]: Error while processing connect: Connection refused last message repeated 31 times last message repeated 61 time .... # strace -p 10248 (ccsd) [pid 10248] select(1024, [6 9], NULL, NULL, NULL [pid 10249] futex(0x4bfcab28, FUTEX_WAIT, 2, NULL [pid 10248] <... select resumed> ) = 1 (in [6]) [pid 10248] accept(6, {sa_family=AF_INET, sin_port=htons(855), sin_addr=inet_addr("127.0.0.1")}, [16]) = 11 [pid 10248] read(11, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20 [pid 10248] time([1093554303]) = 1093554303 [pid 10248] rt_sigaction(SIGPIPE, {0x298450, [], 0}, {SIG_DFL}, 8) = 0 [pid 10248] send(12, "<27>Aug 27 01:05:03 ccsd[10248]:"..., 84, 0) = 84 [pid 10248] rt_sigaction(SIGPIPE, {SIG_DFL}, NULL, 8) = 0 [pid 10248] write(11, "\1\0\0\0\0\0\0\0\0\0\0\0\221\377\377\377\0\0\0\0", 20) = 20 [pid 10248] close(11) = 0 # strace -p 10269 (fenced) Process 10269 attached - interrupt to quit setup() = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 1 bind(1, {sa_family=AF_INET, sin_port=htons(902), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 connect(1, {sa_family=AF_INET, sin_port=htons(50006), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 write(1, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20 read(1, "\1\0\0\0\0\0\0\0\0\0\0\0\221\377\377\377\0\0\0\0", 20) = 20 close(1) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 1 bind(1, {sa_family=AF_INET, sin_port=htons(903), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 connect(1, {sa_family=AF_INET, sin_port=htons(50006), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 write(1, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20 read(1, "\1\0\0\0\0\0\0\0\0\0\0\0\221\377\377\377\0\0\0\0", 20) = 20 close(1) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, Process 10269 detached In what a problem? ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From amir at datacore.ch Thu Aug 26 23:34:44 2004 From: amir at datacore.ch (Amir Guindehi) Date: Fri, 27 Aug 2004 01:34:44 +0200 Subject: [Linux-cluster] GFS configuration for 2 node Cluster In-Reply-To: <1093270294.3467.26.camel@atlantis.boston.redhat.com> References: <41241D63.5090102@net4india.net> <002001c486b5$46ab23c0$f13cc90a@druzhba.com> <41266989.2070101@datacore.ch> <1093270294.3467.26.camel@atlantis.boston.redhat.com> Message-ID: <412E7394.60505@datacore.ch> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Lon, | I think he meant for 6.0.0, which is the pappy of linux-cluster. I | don't think you can do it with 6.0.0. Uops. I'm sorry, seems I missed that part. - - Amir - -- Amir Guindehi, nospam.amir at datacore.ch DataCore GmbH, Witikonerstrasse 289, 8053 Zurich, Switzerland -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBLnOSbycOjskSVCwRAveaAJ4xbUWvk9O/QFl3dvHNOHxWVy1lowCdFTda x5UCHyfV9pNG3DENBleMPZM= =21ec -----END PGP SIGNATURE----- From erik at debian.franken.de Fri Aug 27 00:10:04 2004 From: erik at debian.franken.de (Erik Tews) Date: Fri, 27 Aug 2004 02:10:04 +0200 Subject: [Linux-cluster] Subversion? In-Reply-To: <60C14C611F1DDD4198D53F2F43D8CA3B01C7F989@orsmsx410> References: <60C14C611F1DDD4198D53F2F43D8CA3B01C7F989@orsmsx410> Message-ID: <1093565404.13004.42.camel@localhost.localdomain> Am Do, den 26.08.2004 schrieb Villalovos, John L um 20:20: > linux-cluster-bounces at redhat.com wrote: > > On Tuesday 24 August 2004 10:48, Lon Hohberger wrote: > >> It was _designed_ to handle distributed repositories (like BK). > > > > Well, what wind is blowing, seems to be blowing in the direction of > > Arch. I'd be equally happy with either, and in any case, > > much happier > > than with CVS. Does anybody else have a strong opinion? > > I'd prefer to use Subversion. It works through our proxy servers. We > already use it for some projects we connect to. Wait, I had a problem here, my university seems to have any kind of cisco transparent proxy which somehow has eaten my subversion-requests (some strange errors in the client, usually only on commit, update worked fine), after I moved my server away from port 80 the problem disappeard. I don't know what they were doing (I am only a student, not an administrator). > If you are going to stick with your centralized development model then > CVS or Subversion is probably the way to go. Subversion > Plus Subversion comes with Fedora Core 2 by default. Not sure about GNU > Arch. > > The change from CVS to SVN (Subversion) is very very easy. I am not > sure that we can say the same about going to GNU Arch. (Note: I have > never used GNU Arch). Thats really true, if you have used cvs before, you need round about 5-10 minutes untill you can do all the things an average cvs user does day by day. From teigland at redhat.com Fri Aug 27 03:13:22 2004 From: teigland at redhat.com (David Teigland) Date: Fri, 27 Aug 2004 11:13:22 +0800 Subject: [Linux-cluster] fence init problem In-Reply-To: <20040827010115.spihyrcdxc8880wc@mail.310.ru> References: <20040827010115.spihyrcdxc8880wc@mail.310.ru> Message-ID: <20040827031322.GC18381@redhat.com> On Fri, Aug 27, 2004 at 01:01:15AM +0400, anton at hq.310.ru wrote: > hi all > > gfs from cvs > > after run fence_tool join > in /var/log/messages i see > ccsd[10248]: Error while processing connect: Connection refused This is probably the same old problem everyone else ran into where ccsd is finding bad/old magma libs that were left behind in /lib instead of the new ones installed to /usr/lib. You may need to go through /lib and remove anything that was previously installed. -- Dave Teigland From teigland at redhat.com Fri Aug 27 03:01:30 2004 From: teigland at redhat.com (David Teigland) Date: Fri, 27 Aug 2004 11:01:30 +0800 Subject: [Linux-cluster] What is this GFS pipe doing here: In-Reply-To: References: Message-ID: <20040827030130.GB18381@redhat.com> On Thu, Aug 26, 2004 at 10:32:37AM -0600, tomc at teamics.com wrote: > in /tmp I found this pipe. It appears quite old. Any idea what it's > for? > > prw------- 1 root root 0 May 22 06:18 > fence.manual.fifo > > There is one on each of the GFS nodes except the current master. (Can > I)/(Should I) delete it? It's left over from a fence_manual that was never completed. You can delete it, but it won't harm anything if you don't. -- Dave Teigland From ben.m.cahill at intel.com Fri Aug 27 04:42:23 2004 From: ben.m.cahill at intel.com (Cahill, Ben M) Date: Thu, 26 Aug 2004 21:42:23 -0700 Subject: [Linux-cluster] fence init problem Message-ID: <0604335B7764D141945E202153105960033E252B@orsmsx404.amr.corp.intel.com> Just as a suggestion ... It would probably save everyone some time and frustration if you could add a few hints in usage.txt about "well-known" (but not well enough) gotchas like this ... I tried to do that as much as possible with "NOTE:", "HINT:" and "Check for success" verbiage in the HOWTOs in OpenGFS (http://opengfs.sourceforge.net/docs.php (e.g. "HOWTO Build and Install OpenGFS (nopool, new)", but I don't have enough experience with the RH stack to contribute much yet. -- Ben -- Opinions are mine, not Intel's > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Teigland > Sent: Thursday, August 26, 2004 11:13 PM > To: anton at hq.310.ru > Cc: linux-cluster at redhat.com > Subject: Re: [Linux-cluster] fence init problem > > > On Fri, Aug 27, 2004 at 01:01:15AM +0400, anton at hq.310.ru wrote: > > hi all > > > > gfs from cvs > > > > after run fence_tool join > > in /var/log/messages i see > > ccsd[10248]: Error while processing connect: Connection refused > > This is probably the same old problem everyone else ran into where > ccsd is finding bad/old magma libs that were left behind in /lib > instead of the new ones installed to /usr/lib. You may need to go > through /lib and remove anything that was previously installed. > > -- > Dave Teigland > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From andriy at druzhba.lviv.ua Fri Aug 27 09:03:32 2004 From: andriy at druzhba.lviv.ua (Andriy Galetski) Date: Fri, 27 Aug 2004 12:03:32 +0300 Subject: [Linux-cluster] Linux cluster startup Message-ID: <010001c48c14$baae0da0$f13cc90a@druzhba.com> Hi ! Can anyone tell me how to get mount GFS partition when GFS (latest CVS version with CMAN and DLM) instaled to only one from 2 nodes future cluster system. Do I need to install GFS on both nodes ? Now when I do fence_tool join .... Geting ccsd[3164]: Cluster is not quorate. Refusing connection. Thanks From lhh at redhat.com Fri Aug 27 13:10:50 2004 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 27 Aug 2004 09:10:50 -0400 Subject: [Linux-cluster] fence init problem In-Reply-To: <20040827031322.GC18381@redhat.com> References: <20040827010115.spihyrcdxc8880wc@mail.310.ru> <20040827031322.GC18381@redhat.com> Message-ID: <1093612250.17473.85.camel@atlantis.boston.redhat.com> On Fri, 2004-08-27 at 11:13 +0800, David Teigland wrote: > This is probably the same old problem everyone else ran into where > ccsd is finding bad/old magma libs that were left behind in /lib > instead of the new ones installed to /usr/lib. You may need to go > through /lib and remove anything that was previously installed. Specficially: rm -f /lib/libmagma* /lib/magma/plugins/* /lib/magma/plugins /lib/magma Should do the trick. -- Lon From lhh at redhat.com Fri Aug 27 13:44:04 2004 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 27 Aug 2004 09:44:04 -0400 Subject: [Linux-cluster] fence init problem In-Reply-To: <105310906.20040827171824@hq.310.ru> References: <20040827010115.spihyrcdxc8880wc@mail.310.ru> <20040827031322.GC18381@redhat.com> <1093612250.17473.85.camel@atlantis.boston.redhat.com> <105310906.20040827171824@hq.310.ru> Message-ID: <1093614244.17473.88.camel@atlantis.boston.redhat.com> On Fri, 2004-08-27 at 17:18 +0400, ????? ????????? wrote: > Hi Lon, > Lon Hohberger> Specficially: > > Lon Hohberger> rm -f /lib/libmagma* > Lon Hohberger> /lib/magma/plugins/* /lib/magma/plugins /lib/magma > > and /lib/libdlm* :) Yes, and that! -- Lon From andriy at druzhba.lviv.ua Fri Aug 27 15:08:27 2004 From: andriy at druzhba.lviv.ua (Andriy Galetski) Date: Fri, 27 Aug 2004 18:08:27 +0300 Subject: [Linux-cluster] System can not join to fance domain (Quorum is Ok !) References: <010001c48c14$baae0da0$f13cc90a@druzhba.com> Message-ID: <014d01c48c47$b71fae90$f13cc90a@druzhba.com> Hi again ! Now I setup GFS for 2 nodes in exactly the same way like in http://sources.redhat.com/cluster/doc/usage.txt # cat /proc/cluster/status /proc/cluster/nodes Version: 2.0.1 Config version: 1 Cluster name: alpha Cluster ID: 3169 Membership state: Cluster-Member Nodes: 2 Expected_votes: 1 Total_votes: 2 Quorum: 1 Active subsystems: 0 Node addresses: 10.201.60.12 192.168.0.10 Node Votes Exp Sts Name 1 1 1 M cl10 2 1 1 M cl20 But when I try fence_tool join get errors: Aug 27 17:50:33 cl10 ccsd[5031]: Error while processing connect: Connection refused Aug 27 17:50:34 cl10 ccsd[5031]: Cluster is not quorate. Refusing connection. Why System can not join to fance domain ? The Quorum is Ok ! My /etc/cluster/cluster.conf : Thanks for any Help. From anton at hq.310.ru Fri Aug 27 15:25:32 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Fri, 27 Aug 2004 19:25:32 +0400 Subject: [Linux-cluster] System can not join to fance domain (Quorum is Ok !) In-Reply-To: <014d01c48c47$b71fae90$f13cc90a@druzhba.com> References: <010001c48c14$baae0da0$f13cc90a@druzhba.com> <014d01c48c47$b71fae90$f13cc90a@druzhba.com> Message-ID: <1507579153.20040827192532@hq.310.ru> ?????? ???? Andriy, Friday, August 27, 2004, 7:08:27 PM, you wrote: rm -f /lib/libmagma* /lib/magma/plugins/* /lib/magma/plugins /lib/magma /lib/libdlm* and try again :) Andriy Galetski> Hi again ! Andriy Galetski> Now I setup GFS for 2 nodes in Andriy Galetski> exactly the same way like in Andriy Galetski> http://sources.redhat.com/cluster/doc/usage.txt Andriy Galetski> # cat /proc/cluster/status /proc/cluster/nodes Andriy Galetski> Version: 2.0.1 Andriy Galetski> Config version: 1 Andriy Galetski> Cluster name: alpha Andriy Galetski> Cluster ID: 3169 Andriy Galetski> Membership state: Cluster-Member Andriy Galetski> Nodes: 2 Andriy Galetski> Expected_votes: 1 Andriy Galetski> Total_votes: 2 Andriy Galetski> Quorum: 1 Andriy Galetski> Active subsystems: 0 Andriy Galetski> Node addresses: 10.201.60.12 192.168.0.10 Andriy Galetski> Node Votes Exp Sts Name Andriy Galetski> 1 1 1 M cl10 Andriy Galetski> 2 1 1 M cl20 Andriy Galetski> But when I try fence_tool join Andriy Galetski> get errors: Andriy Galetski> Aug 27 17:50:33 cl10 ccsd[5031]: Andriy Galetski> Error while processing connect: Connection Andriy Galetski> refused Andriy Galetski> Aug 27 17:50:34 cl10 ccsd[5031]: Andriy Galetski> Cluster is not quorate. Refusing Andriy Galetski> connection. Andriy Galetski> Why System can not join to fance domain ? Andriy Galetski> The Quorum is Ok ! Andriy Galetski> My /etc/cluster/cluster.conf : Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Andriy Galetski> Thanks for any Help. Andriy Galetski> -- Andriy Galetski> Linux-cluster mailing list Andriy Galetski> Linux-cluster at redhat.com Andriy Galetski> http://www.redhat.com/mailman/listinfo/linux-cluster -- ? ?????????, ????? ????????? ???????????? ??????? ?????? ??????? ???????????????????? ?????? ???. (095) 363 3 310 ???? (095) 363 3 310 e-mail: anton at hq.310.ru http://www.310.ru From yjcho at cs.hongik.ac.kr Fri Aug 27 21:27:16 2004 From: yjcho at cs.hongik.ac.kr (Cho Yool Je) Date: Sat, 28 Aug 2004 06:27:16 +0900 Subject: [Linux-cluster] erro log... Message-ID: <412FA734.6090701@cs.hongik.ac.kr> hi~ when i excute "mount -t gfs /dev/sda1 /mnt/gfs", my log is written Aug 28 06:22:04 gfs lock_gulmd_core[1894]: ERROR [src/core_io.c:1317] Node (client1.cs.xxxx.ac.kr ::ffff:xxx.xxx.xxx.xxx) has been denied from connecting here. what does that mean? thx... From adam.cassar at netregistry.com.au Mon Aug 30 02:25:20 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Mon, 30 Aug 2004 12:25:20 +1000 Subject: [Linux-cluster] Error after upgrading from CVS Message-ID: <1093832720.28391.26.camel@akira2.nro.au.com> What does the following mean? CMAN: Waiting to join or form a Linux-cluster CMAN: sending membership request CMAN: sending membership request CMAN: got node cluster3 CMAN: got node cluster1 CMAN: quorum regained, resuming activity CMAN: killed by STARTTRANS or NOMINATE CMAN: we are leaving the cluster SM: 00000000 sm_stop: SG still joined SM: send_nodeid_message error -107 to 2 SM: send_broadcast_message error -107 From adam.cassar at netregistry.com.au Mon Aug 30 03:19:29 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Mon, 30 Aug 2004 13:19:29 +1000 Subject: [Linux-cluster] Error after upgrading from CVS In-Reply-To: <1093832720.28391.26.camel@akira2.nro.au.com> References: <1093832720.28391.26.camel@akira2.nro.au.com> Message-ID: <1093835969.28391.37.camel@akira2.nro.au.com> I found the problem. One of the hosts was still using the old kernel modules. On Mon, 2004-08-30 at 12:25, Adam Cassar wrote: > What does the following mean? > > CMAN: Waiting to join or form a Linux-cluster > CMAN: sending membership request > CMAN: sending membership request > CMAN: got node cluster3 > CMAN: got node cluster1 > CMAN: quorum regained, resuming activity > CMAN: killed by STARTTRANS or NOMINATE > CMAN: we are leaving the cluster > SM: 00000000 sm_stop: SG still joined > SM: send_nodeid_message error -107 to 2 > SM: send_broadcast_message error -107 > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Cassar IT Manager NetRegistry Pty Ltd ______________________________________________ http://www.netregistry.com.au Tel: 02 9699 6099 Fax: 02 9699 6088 PO Box 270 Broadway NSW 2007 Domains |Business Email|Web Hosting|E-Commerce Trusted by 10,000s of businesses since 1997 ______________________________________________ From tomc at teamics.com Mon Aug 30 03:59:05 2004 From: tomc at teamics.com (tomc at teamics.com) Date: Sun, 29 Aug 2004 22:59:05 -0500 Subject: [Linux-cluster] tunables question Message-ID: I am using an IBM FastT 200 and QLA2200 adapters with Sistina GFS. Performance varies wildly between very good to abysmal. Any suggestions on tuning (queue depth, buffering, etc)? Any good docs available on tuning, tweaking and troublehsooting? (Other than the Admin guide, I already read that.) tc From adam.cassar at netregistry.com.au Mon Aug 30 04:13:11 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Mon, 30 Aug 2004 14:13:11 +1000 Subject: [Linux-cluster] tunables question In-Reply-To: References: Message-ID: <1093839191.28391.62.camel@akira2.nro.au.com> Check out the IBM redbooks for FASTt performance tuning. comp.unix.aix and comp.arch.storage are where all the FASTt people hang out. On Mon, 2004-08-30 at 13:59, tomc at teamics.com wrote: > I am using an IBM FastT 200 and QLA2200 adapters with Sistina GFS. > Performance varies wildly between very good to abysmal. Any suggestions > on tuning (queue depth, buffering, etc)? Any good docs available on > tuning, tweaking and troublehsooting? (Other than the Admin guide, I > already read that.) > > tc > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From anton at hq.310.ru Mon Aug 30 09:30:11 2004 From: anton at hq.310.ru (=?Windows-1251?B?wO3y7u0gzeX17vDu+Oj1?=) Date: Mon, 30 Aug 2004 13:30:11 +0400 Subject: [Linux-cluster] immutable flag on gfs Message-ID: <12410128065.20040830133011@hq.310.ru> Hi all, Guys, it is very necessary to set immutable a flag on GFS, how? -- e-mail: anton at hq.310.ru From mtilstra at redhat.com Mon Aug 30 16:23:04 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Mon, 30 Aug 2004 11:23:04 -0500 Subject: [Linux-cluster] erro log... In-Reply-To: <412FA734.6090701@cs.hongik.ac.kr> References: <412FA734.6090701@cs.hongik.ac.kr> Message-ID: <20040830162304.GB6777@redhat.com> On Sat, Aug 28, 2004 at 06:27:16AM +0900, Cho Yool Je wrote: > when i excute "mount -t gfs /dev/sda1 /mnt/gfs", my log is written > > Aug 28 06:22:04 gfs lock_gulmd_core[1894]: ERROR [src/core_io.c:1317] > Node (client1.cs.xxxx.ac.kr ::ffff:xxx.xxx.xxx.xxx) has been denied from > connecting here. > > what does that mean? it means that: 1) gulm failed to match the name and ip from /etc/resolv.conf (probably not the reason here) 2) tcpwrappers was configured not to allow that node to connect. (unless you fiddled with tcpwrappers, not the problem either.) 3) Node entry in ccs doesn't correctly match or is missing. (this is probably your problem.) An other slightly realated problem is if you have a nodes host name mapped to 127.0.0.1 in the /etc/hosts file. -- Michael Conrad Tadpol Tilstra At night as I lay in bed looking at the stars I thought 'Where the hell is the ceiling?' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From john.l.villalovos at intel.com Mon Aug 30 18:51:22 2004 From: john.l.villalovos at intel.com (Villalovos, John L) Date: Mon, 30 Aug 2004 11:51:22 -0700 Subject: [Linux-cluster] Build & Installation instructions for GNBD? Message-ID: <60C14C611F1DDD4198D53F2F43D8CA3B01CC9E77@orsmsx410> Are there any build and installation instructions for GNBD? The documentation file ( http://sources.redhat.com/cluster/doc/usage.txt ) does not mention a word about GNBD. Thanks, John From ecashin at coraid.com Mon Aug 30 21:30:29 2004 From: ecashin at coraid.com (Ed L Cashin) Date: Mon, 30 Aug 2004 17:30:29 -0400 Subject: [Linux-cluster] Re: Error compiling GFS patched kernel References: <001d01c48230$a4877b80$0a14a8c0@venus.it> Message-ID: <87u0ukl3ey.fsf@coraid.com> "Angelo Ovidi" writes: > Hi. > > I am trying to compile a 2.6.7 kernel patched with cvs version of > cluster package of redhat. > > I have no error applying the patches but the compile give me this > error: I get the same error using the "Method 2: using kernel patches" procedure from http://sources.redhat.com/cluster/doc/usage.txt with this tarball: cluster_0406282100.tgz ... > CC [M] fs/gfs/inode.o > fs/gfs/inode.c: In function `inode_init_and_link': > fs/gfs/inode.c:1214: invalid lvalue in unary `&' > fs/gfs/inode.c: In function `inode_alloc_hidden': > fs/gfs/inode.c:1933: invalid lvalue in unary `&' > make[2]: *** [fs/gfs/inode.o] Error 1 > make[1]: *** [fs/gfs] Error 2 > make: *** [fs] Error 2 -- Ed L Cashin From jens.dreger at physik.fu-berlin.de Mon Aug 30 21:54:02 2004 From: jens.dreger at physik.fu-berlin.de (Jens Dreger) Date: Mon, 30 Aug 2004 23:54:02 +0200 Subject: [Linux-cluster] Re: Error compiling GFS patched kernel In-Reply-To: <87u0ukl3ey.fsf@coraid.com> References: <001d01c48230$a4877b80$0a14a8c0@venus.it> <87u0ukl3ey.fsf@coraid.com> Message-ID: <20040830215402.GT3794@smart.physik.fu-berlin.de> On Mon, Aug 30, 2004 at 05:30:29PM -0400, Ed L Cashin wrote: > "Angelo Ovidi" writes: > > > Hi. > > > > I am trying to compile a 2.6.7 kernel patched with cvs version of > > cluster package of redhat. > > > > I have no error applying the patches but the compile give me this > > error: > > I get the same error using the "Method 2: using kernel patches" > procedure from http://sources.redhat.com/cluster/doc/usage.txt > with this tarball: > > cluster_0406282100.tgz Try upgrading gcc. I got that error with gcc 2.95. Upgrading to gcc >3 solved the problem. HTH, Jens. From arekm at pld-linux.org Mon Aug 30 23:35:32 2004 From: arekm at pld-linux.org (Arkadiusz Miskiewicz) Date: Tue, 31 Aug 2004 01:35:32 +0200 Subject: [Linux-cluster] [PATCH]: avoid local_nodeid conflict with ia64/numa define Message-ID: <200408310135.32411.arekm@pld-linux.org> Little patch by qboosh at pld-linux.org: - avoid local_nodeid conflict with ia64/numa define http://cvs.pld-linux.org/cgi-bin/cvsweb/SOURCES/linux-cluster-dlm.patch?r1=1.1.2.3&r2=1.1.2.4 -- Arkadiusz Mi?kiewicz CS at FoE, Wroclaw University of Technology arekm.pld-linux.org, 1024/3DB19BBD, JID: arekm.jabber.org, PLD/Linux From adam.cassar at netregistry.com.au Tue Aug 31 03:50:58 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Tue, 31 Aug 2004 13:50:58 +1000 Subject: [Linux-cluster] repeatable assertion failure extending file system while active Message-ID: <1093924258.28391.200.camel@akira2.nro.au.com> Hi Guys, I have a repeatable assertion failure. machine1:/mnt# bonnie++ -u 0 Using uid:0, gid:0. Writing with putc()... ------------- machine2# /usr/local/lvm/sbin/lvextend -L +1G /dev/GIPETE/lv0 Using stripesize of last segment 64KB Extending logical volume lv0 to 24.00 GB Logical volume lv0 successfully resized machine2# /usr/local/gfs/sbin/gfs_jadd -j 2 /mnt FS: Mount Point: /mnt FS: Device: /dev/GIPETE/lv0 FS: Options: rw,noatime,nodiratime FS: Size: 5242880 DEV: Size: 6291456 Preparing to write new FS information... Done. ------------ machine2 # /usr/local/gfs/sbin/gfs_grow /mnt FS: Mount Point: /mnt FS: Device: /dev/GIPETE/lv0 FS: Options: rw,noatime,nodiratime FS: Size: 5308416 DEV: Size: 6291456 Preparing to write new FS information... Done. ------------- machine1# attempt to access beyond end of device dm-0: rw=0, want=48380304, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380312, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380320, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380328, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380336, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380344, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380352, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380360, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380368, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380376, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380384, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380392, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380400, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380408, limit=44040192 attempt to access beyond end of device dm-0: rw=0, want=48380416, limit=44040192 GFS: fsid=cluster:donkey.0: I/O error on block 6047551 GFS: Assertion failed on line 307 of file /usr/src/GFS/cluster/gfs-kernel/src/gfs/util.c GFS: assertion: "FALSE" GFS: time = 1093923913 GFS: fsid=cluster:donkey.0 Kernel panic: GFS: Record message above and reboot. From adam.cassar at netregistry.com.au Tue Aug 31 05:12:40 2004 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Tue, 31 Aug 2004 15:12:40 +1000 Subject: [Linux-cluster] fencing behaviour In-Reply-To: <1093924258.28391.200.camel@akira2.nro.au.com> References: <1093924258.28391.200.camel@akira2.nro.au.com> Message-ID: <1093929160.28391.230.camel@akira2.nro.au.com> The below node in question does not get fenced by the other nodes! It only occurs after a reboot and it joins the cluster. Only then does the other node want to fence it. Is this normal? On Tue, 2004-08-31 at 13:50, Adam Cassar wrote: > Hi Guys, > > I have a repeatable assertion failure. > > machine1:/mnt# bonnie++ -u 0 > Using uid:0, gid:0. > Writing with putc()... > > ------------- > > machine2# /usr/local/lvm/sbin/lvextend -L +1G /dev/GIPETE/lv0 > Using stripesize of last segment 64KB > Extending logical volume lv0 to 24.00 GB > Logical volume lv0 successfully resized > > machine2# /usr/local/gfs/sbin/gfs_jadd -j 2 /mnt > FS: Mount Point: /mnt > FS: Device: /dev/GIPETE/lv0 > FS: Options: rw,noatime,nodiratime > FS: Size: 5242880 > DEV: Size: 6291456 > Preparing to write new FS information... > Done. > > ------------ > > machine2 # /usr/local/gfs/sbin/gfs_grow /mnt > FS: Mount Point: /mnt > FS: Device: /dev/GIPETE/lv0 > FS: Options: rw,noatime,nodiratime > FS: Size: 5308416 > DEV: Size: 6291456 > Preparing to write new FS information... > Done. > > ------------- > > machine1# > > attempt to access beyond end of device > dm-0: rw=0, want=48380304, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380312, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380320, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380328, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380336, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380344, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380352, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380360, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380368, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380376, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380384, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380392, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380400, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380408, limit=44040192 > attempt to access beyond end of device > dm-0: rw=0, want=48380416, limit=44040192 > GFS: fsid=cluster:donkey.0: I/O error on block 6047551 > > GFS: Assertion failed on line 307 of file > /usr/src/GFS/cluster/gfs-kernel/src/gfs/util.c > GFS: assertion: "FALSE" > GFS: time = 1093923913 > GFS: fsid=cluster:donkey.0 > > Kernel panic: GFS: Record message above and reboot. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Cassar IT Manager NetRegistry Pty Ltd ______________________________________________ http://www.netregistry.com.au Tel: 02 9699 6099 Fax: 02 9699 6088 PO Box 270 Broadway NSW 2007 Domains |Business Email|Web Hosting|E-Commerce Trusted by 10,000s of businesses since 1997 ______________________________________________ From teigland at redhat.com Tue Aug 31 05:30:06 2004 From: teigland at redhat.com (David Teigland) Date: Tue, 31 Aug 2004 13:30:06 +0800 Subject: [Linux-cluster] fencing behaviour In-Reply-To: <1093929160.28391.230.camel@akira2.nro.au.com> References: <1093924258.28391.200.camel@akira2.nro.au.com> <1093929160.28391.230.camel@akira2.nro.au.com> Message-ID: <20040831053006.GA15784@redhat.com> On Tue, Aug 31, 2004 at 03:12:40PM +1000, Adam Cassar wrote: > The below node in question does not get fenced by the other nodes! > > It only occurs after a reboot and it joins the cluster. Only then does > the other node want to fence it. Is this normal? Yes. When the node dies, services (fencing, dlm, gfs) are suspended. If the cluster still has quorum, these services are re-enabled immediately. If the cluster has lost quorum, the services are not re-enabled until the cluster regains quorum, which is what you're seeing. Fencing occurs when the fencing service is re-enabled and performs recovery. This is happening when your failed node rejoins the cluster, giving it quorum again. When doing recovery, the fencing daemon is smart enough to see that the failed node has rejoined the cluster and will bypass the now useless fencing operation for it. -- Dave Teigland From ecashin at coraid.com Tue Aug 31 14:45:42 2004 From: ecashin at coraid.com (Ed L Cashin) Date: Tue, 31 Aug 2004 10:45:42 -0400 Subject: [Linux-cluster] Re: Error compiling GFS patched kernel References: <001d01c48230$a4877b80$0a14a8c0@venus.it> <87u0ukl3ey.fsf@coraid.com> <20040830215402.GT3794@smart.physik.fu-berlin.de> Message-ID: <87656zl621.fsf@coraid.com> Jens Dreger writes: > On Mon, Aug 30, 2004 at 05:30:29PM -0400, Ed L Cashin wrote: >> "Angelo Ovidi" writes: >> >> > Hi. >> > >> > I am trying to compile a 2.6.7 kernel patched with cvs version of >> > cluster package of redhat. >> > >> > I have no error applying the patches but the compile give me this >> > error: >> >> I get the same error using the "Method 2: using kernel patches" >> procedure from http://sources.redhat.com/cluster/doc/usage.txt >> with this tarball: >> >> cluster_0406282100.tgz > > Try upgrading gcc. I got that error with gcc 2.95. Upgrading to gcc >3 > solved the problem. Thanks. Yes, I noticed that it works with gcc 3.3, but the snapshot has been up for a while, and it doesn't work with gcc 2, so it seems like gcc 2 is not supported by gfs, which merits a big warning in any usage.txt-type docs. -- Ed L Cashin From eoey at shopping.com Mon Aug 30 19:03:27 2004 From: eoey at shopping.com (Edy Oey) Date: Mon, 30 Aug 2004 12:03:27 -0700 Subject: [Linux-cluster] QLogic QLA2342 Drivers for 2.6.x kernel? Message-ID: <0C99873E269DF9449701C2CA6D9782C902057B3A@mail-na.shopping.com> Hi, Anybody knows where I can get QLogic QLA2342 Drivers for 2.6.x kernel? Thanks. -edy -------------- next part -------------- An HTML attachment was scrubbed... URL: From ecashin at coraid.com Tue Aug 31 15:56:04 2004 From: ecashin at coraid.com (Ed L Cashin) Date: Tue, 31 Aug 2004 11:56:04 -0400 Subject: [Linux-cluster] Re: Installation problem References: <1093508759.31675.35.camel@zombie.i.spray.se> <20040826083426.GB6682@tykepenguin.com> <1093514299.31675.44.camel@zombie.i.spray.se> <20040826102119.GA9523@tykepenguin.com> Message-ID: <87y8jvjo8b.fsf@coraid.com> Patrick Caulfield writes: > On Thu, Aug 26, 2004 at 11:58:19AM +0200, Johan Pettersson wrote: >> >> >> I have only 2 cnxman.h in the system and they do not differ =/ >> >> build/cluster/cman-kernel/src/cnxman.h >> build/linux-2.6.7/include/cluster/cnxman.h > > Sorry that should have been cnxman-socket.h In today's cvs cluster sources, the cnxman-socket.h in the kernel patches is different from the one in the cluster source tree. If you build the cluster sources according to method 2 of usage.txt on a machine that has never had GFS installed before, I think you'll also find that there are some problems with the include paths. The different parts of the cluster software can't see one anothers headers. The problem is masked when there are headers in /usr/include, but shouldn't system-wide headers be ignored when building from newer sources? -- Ed L Cashin From coughlan at redhat.com Tue Aug 31 16:19:25 2004 From: coughlan at redhat.com (Tom Coughlan) Date: Tue, 31 Aug 2004 12:19:25 -0400 Subject: [Linux-cluster] QLogic QLA2342 Drivers for 2.6.x kernel? In-Reply-To: <0C99873E269DF9449701C2CA6D9782C902057B3A@mail-na.shopping.com> References: <0C99873E269DF9449701C2CA6D9782C902057B3A@mail-na.shopping.com> Message-ID: <1093969165.7121.2617.camel@bianchi.boston.redhat.com> On Mon, 2004-08-30 at 15:03, Edy Oey wrote: > Hi, > Anybody knows where I can get QLogic QLA2342 Drivers for 2.6.x kernel? They are built-in. See scsi/qla2xxx/. From ecashin at coraid.com Tue Aug 31 16:44:49 2004 From: ecashin at coraid.com (Ed L Cashin) Date: Tue, 31 Aug 2004 12:44:49 -0400 Subject: [Linux-cluster] cluster depends on tcp_wrappers? Message-ID: <87u0ujjlz2.fsf@coraid.com> Hi. Does cluster, and gulm/src/utils_ip.c from today's CVS specifically, depend on tcp_wrappers? If so, it deserves mentioning in usage.txt. -- Ed L Cashin From mtilstra at redhat.com Tue Aug 31 16:52:53 2004 From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra) Date: Tue, 31 Aug 2004 11:52:53 -0500 Subject: [Linux-cluster] cluster depends on tcp_wrappers? In-Reply-To: <87u0ujjlz2.fsf@coraid.com> References: <87u0ujjlz2.fsf@coraid.com> Message-ID: <20040831165253.GA14574@redhat.com> On Tue, Aug 31, 2004 at 12:44:49PM -0400, Ed L Cashin wrote: > Hi. Does cluster, and gulm/src/utils_ip.c from today's CVS > specifically, depend on tcp_wrappers? gulm does use tcpwrappers, it always has. -- Michael Conrad Tadpol Tilstra Don't look back, the lemmings are gaining on you. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From bmarzins at redhat.com Tue Aug 31 17:00:07 2004 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Tue, 31 Aug 2004 12:00:07 -0500 Subject: [Linux-cluster] Build & Installation instructions for GNBD? In-Reply-To: <60C14C611F1DDD4198D53F2F43D8CA3B01CC9E77@orsmsx410> References: <60C14C611F1DDD4198D53F2F43D8CA3B01CC9E77@orsmsx410> Message-ID: <20040831170007.GL12234@phlogiston.msp.redhat.com> On Mon, Aug 30, 2004 at 11:51:22AM -0700, Villalovos, John L wrote: > Are there any build and installation instructions for GNBD? > > The documentation file ( http://sources.redhat.com/cluster/doc/usage.txt > ) does not mention a word about GNBD. http://sources.redhat.com/cluster/gnbd/gnbd_usage.txt There is now a link to this from the gnbd page. If you have any questions or comments about it, just let me know. -Ben > Thanks, > John > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From ecashin at coraid.com Tue Aug 31 18:27:03 2004 From: ecashin at coraid.com (Ed L Cashin) Date: Tue, 31 Aug 2004 14:27:03 -0400 Subject: [Linux-cluster] Re: cluster depends on tcp_wrappers? References: <87u0ujjlz2.fsf@coraid.com> <20040831165253.GA14574@redhat.com> Message-ID: <87r7pnjh8o.fsf@coraid.com> Michael Conrad Tadpol Tilstra writes: > On Tue, Aug 31, 2004 at 12:44:49PM -0400, Ed L Cashin wrote: >> Hi. Does cluster, and gulm/src/utils_ip.c from today's CVS >> specifically, depend on tcp_wrappers? > > gulm does use tcpwrappers, it always has. OK, here's a patch. Without tcp wrappers already installed, following the directions in usage.txt results in a cryptic message about tcpd.h being missing, so either a check in the configure script or some documentation is necessary. --- cluster-cvs/doc/usage.txt.20040831 Tue Aug 31 14:21:57 2004 +++ cluster-cvs/doc/usage.txt Tue Aug 31 14:22:39 2004 @@ -25,6 +25,10 @@ cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2 checkout LVM2 cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster +- satisfy dependencies + + gulm requires tcp_wrappers + Build and install ----------------- -- Ed L Cashin From iisaman at citi.umich.edu Tue Aug 31 18:49:13 2004 From: iisaman at citi.umich.edu (Fredric Isaman) Date: Tue, 31 Aug 2004 14:49:13 -0400 (EDT) Subject: [Linux-cluster] cluster send request failed: Bad address Message-ID: I am trying to set up a simple 3-node cluster (containing iota6-8). I get up to running clvmd on each node. At this point, iota8 works fine, all lvm commands work (although with some error messages about lock failures on the other nodes). However, any attempt to use a lvm command on the other nodes gives some sort of locking error. For example: [root at iota8g LVM2]# pvremove /baddev /baddev: Couldn't find device. [root at iota6g LVM2]# pvremove /baddev cluster send request failed: Bad address Can't get lock for orphan PVs I have tracked the failure down to the fact that the call to dlm_ls_lock() from sync_lock() in LVM2/daemons/clvmd/clvmd-cman.c is failing, but I can not figure out why. In particular, I am perplexed that it works on the one machine and not the others. Any hints about what might be causing this would be appreciated. Thanks, Fred The failure in more detail: [root at iota6g cluster]# pvremove -vvv /baddev Setting global/locking_type to 2 Setting global/locking_library to liblvm2clusterlock.so Setting global/library_dir to /lib Opening shared locking library /lib/liblvm2clusterlock.so Loaded external locking library liblvm2clusterlock.so External locking enabled. FRED - called lock_resource(cmd, , 0x24) Locking P_orphans at 0x4 FRED - called _lock_for_cluster(51, 0x4, P_orphans) FRED - _cluster_request(51, ., data='\x04\x00P_orphans\x00', len=12) FRED - in _send_request: outheader = cmd=1, flags=0x0, xid=0, cid=134912944, status=-14, arglen=1, node= cluster send request failed: Bad address Can't get lock for orphan PVs [root at iota6g root]# clvmd -d CLVMD[13066]: 1093975495 CLVMD started CLVMD[13066]: 1093975495 FRED - init_cluster CLVMD[13066]: 1093975496 Cluster ready, doing some more initialisation CLVMD[13066]: 1093975496 starting LVM thread CLVMD[13066]: 1093975496 LVM thread function started CLVMD[13066]: 1093975496 clvmd ready for work CLVMD[13066]: 1093975496 Using timeout of 60 seconds No volume groups found CLVMD[13066]: 1093975496 LVM thread waiting for work CLVMD[13066]: 1093975500 Got new connection on fd 7 CLVMD[13066]: 1093975500 Read on local socket 7, len = 30 CLVMD[13066]: 1093975500 creating pipe, [8, 9] CLVMD[13066]: 1093975500 in sub thread: client = 0x80a8b60 CLVMD[13066]: 1093975500 doing PRE command LOCK_VG P_orphans at 4 CLVMD[13066]: 1093975500 FRED - sync_lock(P_orphans, 4, 0x0) CLVMD[13066]: 1093975500 FRED - sync_lock status = -1 CLVMD[13066]: 1093975500 hold_lock. lock at 4 failed: Bad address CLVMD[13066]: 1093975500 Writing status 14 down pipe 9 CLVMD[13066]: 1093975500 Waiting to do post command - state = 0 CLVMD[13066]: 1093975500 read on PIPE 8: 4 bytes: status: 14 CLVMD[13066]: 1093975500 background routine status was 14, sock_client=0x80a8b60CLVMD[13066]: 1093975500 Send local reply CLVMD[13066]: 1093975500 Read on local socket 7, len = -1 CLVMD[13066]: 1093975500 EOF on local socket: inprogress=0 CLVMD[13066]: 1093975500 Waiting for child thread CLVMD[13066]: 1093975500 SIGUSR2 received CLVMD[13066]: 1093975500 Joined child thread CLVMD[13066]: 1093975500 ret == 0, errno = 104. removing client From ben.m.cahill at intel.com Tue Aug 31 19:01:07 2004 From: ben.m.cahill at intel.com (Cahill, Ben M) Date: Tue, 31 Aug 2004 12:01:07 -0700 Subject: [Linux-cluster] man page for gfs_mount Message-ID: <0604335B7764D141945E202153105960033E2541@orsmsx404.amr.corp.intel.com> Hi all, Attached please find a new man page for gfs_mount, as a submission to be included in cluster/gfs/man. I tried to write it so it would be useful for newbies as well as veterans. Anyone who can, please review and let me know about any problems you see. Thanks! -- Ben -- Opinions are mine, not Intel's -------------- next part -------------- A non-text attachment was scrubbed... Name: gfs_mount.8 Type: application/octet-stream Size: 8813 bytes Desc: gfs_mount.8 URL: From sdake at mvista.com Tue Aug 31 19:50:43 2004 From: sdake at mvista.com (Steven Dake) Date: Tue, 31 Aug 2004 12:50:43 -0700 Subject: [Linux-cluster] New virtual synchrony API for the kernel: was Re: [Openais] New API in openais In-Reply-To: <1093973757.5933.56.camel@cherrybomb.pdx.osdl.net> References: <1093941076.3613.14.camel@persist.az.mvista.com> <1093973757.5933.56.camel@cherrybomb.pdx.osdl.net> Message-ID: <1093981842.3613.42.camel@persist.az.mvista.com> John, As it appears the redhat clusters project is interested in a kernel implementation of cluster messaging, this interface would have to be available to both the kernel and user applications. It possible to provide EVS services to both kernel and user space applications. There currently is no kernel implementation of group messaging, though only a user space interface. TIPC could probably export this sort of interface, or openais's gmi could be ported to the kernel. Then openais, redhat's cluster technologies, linux ha, or other group messaging applications (and there are quite a few) could use that technology and standardize on the EVS API. It would be useful for linux cluster developers for a common low level group communication API to be agreed upon by relevant clusters projects. Without this approach, we may end up with several systems all using different cluster communication & membership mechanisms that are incompatible. Thanks -steve On Tue, 2004-08-31 at 10:35, John Cherry wrote: > Steve, > > This sounds like a low level cluster communication service which would > be potentially leveraged by other services, such as the event service or > a group messaging service. Are you envisioning this to be a public > interface for applications? > > We discussed a low level cluster communication interface at the cluster > summit. The rhat/sistina interface would be used by the cluster manager > (CMAN) and the lock manager (GDLM), but there was no real momentum to > make this a public application interface. It would be great if we could > derive a common cluster communication interface with the rhat/sistina > project as well as the TIPC project. What do you think? > > John > > > On Tue, 2004-08-31 at 01:31, Steven Dake wrote: > > Folks > > > > Its with alot of pleasure that I announce a new API that I implemented > > over the weekend. > > > > The api is called the "EVS" API and is provided by a seperate library > > libevs.so/.a. The standard openais executive is used. There are two > > test programs testevs and evsbench which demonstrate the API. evsbench > > will benchmark throughput rates. I get about 9MB/sec on my hardware, > > however, flow control in the group messaging protocol is slowing this > > down. I've gotten 10MB/sec with tweaking the algorithm some. > > > > The API name EVS means "Extended Virtual Syncrhony". This API provides > > EVS semantics for those that require the guarantees provided in the face > > of partitions and merges. > > > > The API provides the following > > multiple instances may exist at one time > > group keys of 32 bytes > > an instance may join one or more groups at one time > > an instance may leave one or more groups at one time > > an instance may multicast to the currently joined groups > > an instance may multicast to unjoined groups > > any message for a joined group will be delivered via callback > > configuration changes are delivered via callback > > > > Your comments welcome > > > > Thanks > > -steve > > > > > > ______________________________________________________________________ > > _______________________________________________ > > Openais mailing list > > Openais at lists.osdl.org > > http://lists.osdl.org/mailman/listinfo/openais >