From nattaponv at hotmail.com Thu Sep 1 06:47:51 2005 From: nattaponv at hotmail.com (nattapon viroonsri) Date: Thu, 01 Sep 2005 06:47:51 +0000 Subject: [Linux-cluster] Rhcs4 split brain prolem with single heartbeat cable ? Message-ID: >From config menu in Rhcs4 look like dont require share storage to store quorum partion and no ip tie-breaker to config. And i use only 1 network cable both for client access resource and heartbeat channel. So if one node have problem with network connection. the backup node can provide service with uninterrupt. But Can split brain will occur ? Can both node issue i/o to share at the same time ? So hould i use 2 heartbeat channel ? If i use 2 heartbeat connection how can i detect fail of network connection for client access resource ( no ip tie-breaker to config in rhcs4) ? Regard, Nattapon _________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/ From teigland at redhat.com Thu Sep 1 10:46:20 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 1 Sep 2005 18:46:20 +0800 Subject: [Linux-cluster] GFS, what's remaining Message-ID: <20050901104620.GA22482@redhat.com> Hi, this is the latest set of gfs patches, it includes some minor munging since the previous set. Andrew, could this be added to -mm? there's not much in the way of pending changes. http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch http://redhat.com/~teigland/gfs2/20050901/broken-out/ I'd like to get a list of specific things remaining for merging. I believe we've responded to everything from earlier reviews, they were very helpful and more would be excellent. The list begins with one item from before that's still pending: - Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists. [cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.] ... Thanks Dave From arjan at infradead.org Thu Sep 1 10:42:49 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 01 Sep 2005 12:42:49 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901104620.GA22482@redhat.com> References: <20050901104620.GA22482@redhat.com> Message-ID: <1125571369.5025.0.camel@laptopd505.fenrus.org> On Thu, 2005-09-01 at 18:46 +0800, David Teigland wrote: > Hi, this is the latest set of gfs patches, it includes some minor munging > since the previous set. Andrew, could this be added to -mm? there's not > much in the way of pending changes. can you post them here instead so that they can be actually reviewed? From akpm at osdl.org Thu Sep 1 10:59:39 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 1 Sep 2005 03:59:39 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901104620.GA22482@redhat.com> References: <20050901104620.GA22482@redhat.com> Message-ID: <20050901035939.435768f3.akpm@osdl.org> David Teigland wrote: > > Hi, this is the latest set of gfs patches, it includes some minor munging > since the previous set. Andrew, could this be added to -mm? Dumb question: why? Maybe I was asleep, but I don't recall seeing much discussion or exposition of - Why the kernel needs two clustered fileystems - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot possibly gain (or vice versa) - Relative merits of the two offerings etc. Maybe this has all been thrashed out and agreed to. If so, please remind me. From yanj at brainaire.com Thu Sep 1 10:02:45 2005 From: yanj at brainaire.com (yanj) Date: Thu, 1 Sep 2005 18:02:45 +0800 Subject: [Linux-cluster] Re: GFS: Assertion failed + GFS on NO-SMP system Message-ID: <001e01c5aedc$501c5780$2e00a8c0@yanzijie> Forget to say: I have tried the following combination of GFS on Redhat system: 1. GFS version GFS-6.0.2.25 + Kernel 2.4.21-27.EL 2. GFS version GFS-6.0.2.20-1 + Kernel 2.4.21-32.EL In both cases, the system keeps ending in ?Kernel Panic? -----Original Message----- From: yanj [mailto:yanj at brainaire.com] Sent: 2005?9?1? 17:58 To: 'linux-cluster at redhat.com' Subject: GFS: Assertion failed + GFS on NO-SMP system Hi, Could GFS run based on NO-SMP system ? (As what is said on Redhat GFS manuals) I am working on GFS+iSCSI+RHEL3 based on a non-SMP machine. I have setup the GFS system, however, if I run IO tests on two nodes simultaneously. The system keeps crashing. (Kernel panic.) The error message is sth like following Sep 1 17:18:48 test2 kernel: Bad metadata at 24, should be 4 Sep 1 17:18:48 test2 kernel: mh_magic = 0xA5A5A5A5 Sep 1 17:18:48 test2 kernel: mh_type = 2779096485 Sep 1 17:18:48 test2 kernel: mh_generation = 0 Sep 1 17:18:48 test2 kernel: mh_format = 0 Sep 1 17:18:48 test2 kernel: mh_incarn = 0 Sep 1 17:18:48 test2 kernel: cd433ce4 d0918292 00000010 00000000 c0121992 0000000a 00000400 d0937b13 Sep 1 17:18:48 test2 kernel: cd433d30 00000018 00000000 d09273ad cd433d4c 00000018 00000000 00000000 Sep 1 17:18:48 test2 kernel: d08fe34d d0938b2c d093673a 000004e5 00000013 d0958000 cd433d48 cd455660 Sep 1 17:18:48 test2 kernel: Call Trace: [] gfs_asserti [gfs] 0x32 (0xcd433ce8) Sep 1 17:18:48 test2 kernel: [] printk [kernel] 0x122 (0xcd433cf4) Sep 1 17:18:48 test2 kernel: [] .rodata.str1.1 [gfs] 0x14a7 (0xcd433d00) Sep 1 17:18:48 test2 kernel: [] gfs_meta_header_print [gfs] 0x5d (0xcd433d10) Sep 1 17:18:48 test2 kernel: [] gfs_get_meta_buffer [gfs] 0x2ad (0xcd433d24) Sep 1 17:18:48 test2 kernel: [] .rodata.str1.4 [gfs] 0x3bc (0xcd433d28) Sep 1 17:18:48 test2 kernel: [] .rodata.str1.1 [gfs] 0xce (0xcd433d2c) Sep 1 17:18:48 test2 kernel: [] gfs_copyin_dinode [gfs] 0x39 (0xcd433d7c) Sep 1 17:18:48 test2 kernel: [] lock_inode [gfs] 0x8d (0xcd433dcc) Sep 1 17:18:48 test2 kernel: [] glock_wait_internal [gfs] 0x18f (0xcd433de8) Sep 1 17:18:48 test2 kernel: [] run_queue [gfs] 0xac (0xcd433df8) Sep 1 17:18:48 test2 kernel: [] gfs_inode_glops [gfs] 0x0 (0xcd433e04) Sep 1 17:18:48 test2 kernel: [] gfs_glock_nq [gfs] 0x8f (0xcd433e18) Sep 1 17:18:48 test2 kernel: [] gfs_glock_nq_init [gfs] 0x37 (0xcd433e3c) Sep 1 17:18:48 test2 kernel: [] do_quota_sync [gfs] 0x108 (0xcd433e58) Sep 1 17:18:48 test2 kernel: [] gfs_copy_from_mem [gfs] 0x0 (0xcd433e70) Sep 1 17:18:48 test2 kernel: [] context_switch [kernel] 0x7b (0xcd433f54) Sep 1 17:18:48 test2 kernel: [] gfs_quota_sync [gfs] 0xcc (0xcd433f9c) Sep 1 17:18:48 test2 kernel: [] process_timeout [kernel] 0x0 (0xcd433fb0) Sep 1 17:18:48 test2 kernel: [] gfs_quotad [gfs] 0x67 (0xcd433fc8) Sep 1 17:18:48 test2 kernel: [] gfs_quotad_bounce [gfs] 0x0 (0xcd433fdc) Sep 1 17:18:48 test2 kernel: [] gfs_quotad_bounce [gfs] 0xf (0xcd433fe8) Sep 1 17:18:48 test2 kernel: [] kernel_thread_helper [kernel] 0x5 (0xcd433ff0) End with sth like: GFS: Assertion failed on line 318 of file trans.c GFS: assertion:?metatype_check_magic==GFS_magic&&metatype_check_type == ?. Thanks, Jeffrey Yan -------------- next part -------------- An HTML attachment was scrubbed... URL: From arjan at infradead.org Thu Sep 1 11:35:23 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Thu, 01 Sep 2005 13:35:23 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901104620.GA22482@redhat.com> References: <20050901104620.GA22482@redhat.com> Message-ID: <1125574523.5025.10.camel@laptopd505.fenrus.org> On Thu, 2005-09-01 at 18:46 +0800, David Teigland wrote: > Hi, this is the latest set of gfs patches, it includes some minor munging > since the previous set. Andrew, could this be added to -mm? there's not > much in the way of pending changes. > > http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050901/broken-out/ +static inline void glock_put(struct gfs2_glock *gl) +{ + if (atomic_read(&gl->gl_count) == 1) + gfs2_glock_schedule_for_reclaim(gl); + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,); + atomic_dec(&gl->gl_count); +} this code has a race what is gfs2_assert() about anyway? please just use BUG_ON directly everywhere +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head) +{ + int empty; + spin_lock(&gl->gl_spin); + empty = list_empty(head); + spin_unlock(&gl->gl_spin); + return empty; +} that looks like a racey interface to me... if so.. why bother locking at all? +void gfs2_glock_hold(struct gfs2_glock *gl) +{ + glock_hold(gl); +} eh why? +struct gfs2_holder *gfs2_holder_get(struct gfs2_glock *gl, unsigned int state, + int flags, int gfp_flags) +{ + struct gfs2_holder *gh; + + gh = kmalloc(sizeof(struct gfs2_holder), GFP_KERNEL | gfp_flags); this looks odd. Either you take flags or you don't.. this looks really half arsed and thus is really surprising to all callers static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi, + gi_filler_t filler) +{ + unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size); + char *buf; + unsigned int count = 0; + int error; + + if (size > gi->gi_size) + size = gi->gi_size; + + buf = kmalloc(size, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + error = filler(ip, gi, buf, size, &count); + if (error) + goto out; + + if (copy_to_user(gi->gi_data, buf, count + 1)) + error = -EFAULT; where does count get a sensible value? +static unsigned int handle_roll(atomic_t *a) +{ + int x = atomic_read(a); + if (x < 0) { + atomic_set(a, 0); + return 0; + } + return (unsigned int)x; +} this is just plain scary. you'll have to post the rest of your patches if you want anyone to look at them... From penberg at cs.helsinki.fi Thu Sep 1 12:33:24 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 1 Sep 2005 15:33:24 +0300 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901104620.GA22482@redhat.com> References: <20050901104620.GA22482@redhat.com> Message-ID: <84144f020509010533f5f2440@mail.gmail.com> On 9/1/05, David Teigland wrote: > - Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists. > [cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.] It works fine only if you don't care about playing well with other clustered filesystems. Pekka From alan at lxorguk.ukuu.org.uk Thu Sep 1 14:49:18 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Thu, 01 Sep 2005 15:49:18 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901035939.435768f3.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> Message-ID: <1125586158.15768.42.camel@localhost.localdomain> On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote: > - Why the kernel needs two clustered fileystems So delete reiserfs4, FAT, VFAT, ext2, and all the other "junk". > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > possibly gain (or vice versa) > > - Relative merits of the two offerings You missed the important one - people actively use it and have been for some years. Same reason with have NTFS, HPFS, and all the others. On that alone it makes sense to include. Alan From hch at infradead.org Thu Sep 1 14:27:08 2005 From: hch at infradead.org (Christoph Hellwig) Date: Thu, 1 Sep 2005 15:27:08 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125586158.15768.42.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> Message-ID: <20050901142708.GA24933@infradead.org> On Thu, Sep 01, 2005 at 03:49:18PM +0100, Alan Cox wrote: > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > possibly gain (or vice versa) > > > > - Relative merits of the two offerings > > You missed the important one - people actively use it and have been for > some years. Same reason with have NTFS, HPFS, and all the others. On > that alone it makes sense to include. That's GFS. The submission is about a GFS2 that's on-disk incompatible to GFS. From alan at lxorguk.ukuu.org.uk Thu Sep 1 15:28:30 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Thu, 01 Sep 2005 16:28:30 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901142708.GA24933@infradead.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901142708.GA24933@infradead.org> Message-ID: <1125588511.15768.52.camel@localhost.localdomain> > That's GFS. The submission is about a GFS2 that's on-disk incompatible > to GFS. Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3 then. I think the main point still stands - we have always taken multiple file systems on board and we have benefitted enormously from having the competition between them instead of a dictat from the kernel kremlin that 'foofs is the one true way' Competition will decide if OCFS or GFS is better, or indeed if someone comes along with another contender that is better still. And competition will probably get the answer right. The only thing that is important is we don't end up with each cluster fs wanting different core VFS interfaces added. Alan From lmb at suse.de Thu Sep 1 15:11:18 2005 From: lmb at suse.de (Lars Marowsky-Bree) Date: Thu, 1 Sep 2005 17:11:18 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125588511.15768.52.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901142708.GA24933@infradead.org> <1125588511.15768.52.camel@localhost.localdomain> Message-ID: <20050901151118.GV28276@marowsky-bree.de> On 2005-09-01T16:28:30, Alan Cox wrote: > Competition will decide if OCFS or GFS is better, or indeed if someone > comes along with another contender that is better still. And competition > will probably get the answer right. Competition will come up with the same situation like reiserfs and ext3 and XFS, namely that they'll all be maintained going forward because of, uhm, political constraints ;-) But then, as long as they _are_ maintained and play along nicely with eachother (which, btw, is needed already so that at least data can be migrated...), I don't really see a problem of having two or three. > The only thing that is important is we don't end up with each cluster fs > wanting different core VFS interfaces added. Indeed. Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" From phillips at istop.com Thu Sep 1 17:23:07 2005 From: phillips at istop.com (Daniel Phillips) Date: Thu, 1 Sep 2005 13:23:07 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125586158.15768.42.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> Message-ID: <200509011323.08217.phillips@istop.com> On Thursday 01 September 2005 10:49, Alan Cox wrote: > On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote: > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > possibly gain (or vice versa) > > > > - Relative merits of the two offerings > > You missed the important one - people actively use it and have been for > some years. Same reason with have NTFS, HPFS, and all the others. On > that alone it makes sense to include. I thought that gfs2 just appeared last month. Or is it really still just gfs? If there are substantive changes from gfs to gfs2 then obviously they have had practically zero testing, let alone posted benchmarks, testimonials, etc. If it is really still just gfs then the silly-rename should be undone. Regards, Daniel From phillips at istop.com Thu Sep 1 17:27:42 2005 From: phillips at istop.com (Daniel Phillips) Date: Thu, 1 Sep 2005 13:27:42 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901104620.GA22482@redhat.com> References: <20050901104620.GA22482@redhat.com> Message-ID: <200509011327.42660.phillips@istop.com> On Thursday 01 September 2005 06:46, David Teigland wrote: > I'd like to get a list of specific things remaining for merging. Where are the benchmarks and stability analysis? How many hours does it survive cerberos running on all nodes simultaneously? Where are the testimonials from users? How long has there been a gfs2 filesystem? Note that Reiser4 is still not in mainline a year after it was first offered, why do you think gfs2 should be in mainline after one month? So far, all catches are surface things like bogus spinlocks. Substantive issues have not even begun to be addressed. Patience please, this is going to take a while. Regards, Daniel From hch at infradead.org Thu Sep 1 17:56:03 2005 From: hch at infradead.org (Christoph Hellwig) Date: Thu, 1 Sep 2005 18:56:03 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125588511.15768.52.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901142708.GA24933@infradead.org> <1125588511.15768.52.camel@localhost.localdomain> Message-ID: <20050901175603.GA6218@infradead.org> On Thu, Sep 01, 2005 at 04:28:30PM +0100, Alan Cox wrote: > > That's GFS. The submission is about a GFS2 that's on-disk incompatible > > to GFS. > > Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3 > then. I think the main point still stands - we have always taken > multiple file systems on board and we have benefitted enormously from > having the competition between them instead of a dictat from the kernel > kremlin that 'foofs is the one true way' I didn't say anything agains a particular fs, just that your previous arguments where utter nonsense. In fact I think having two or more cluster filesystems in the tree is a good thing. Whether the gfs2 code is mergeable is a completely different question, and it seems at least debatable to submit a filesystem for inclusion that's still pretty new. While we're at it I can't find anything describing what gfs2 is about, what is lacking in gfs, what structual changes did you make, etc.. p.s. why is gfs2 in fs/gfs in the kernel tree? From lhh at redhat.com Thu Sep 1 18:11:23 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 01 Sep 2005 14:11:23 -0400 Subject: [Linux-cluster] Fencing Module for VMware ESX and GSX In-Reply-To: References: Message-ID: <1125598283.14500.34.camel@ayanami.boston.redhat.com> On Wed, 2005-08-31 at 10:30 -0500, Zach Lowry wrote: > Hello! > > I recently deployed GFS between 2 virtual machines on a VMware ESX > server and had a problem because there was no fencing module that > could handle this architecture. So, using the fence_apc module as a > template, I wrote a compatible module, fence_vmware. Now, since I > didn't want to rewrite GFS and recompile to use this new module, I am > currently using it as a drop-in replacement for the fence_apc module, > and using the APC configuration syntax. However, this code seems to > work just right when a machine misses sync, it will log into the > VMware ESX server and attempt to do a soft reboot of the VM, then a > hard reboot if necessary. I hope to see this incorporated into the > GFS tree, but if not it will be available on my website. > > Attached is a copy of the source, also available at http:// > www.zachlowry.net/software.php, along with a sample configuration. Is the on/off/reboot case sensitive? -- Lon From lhh at redhat.com Thu Sep 1 18:13:02 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 01 Sep 2005 14:13:02 -0400 Subject: [Linux-cluster] Rhcs4 split brain prolem with single heartbeat cable ? In-Reply-To: References: Message-ID: <1125598382.14500.37.camel@ayanami.boston.redhat.com> On Thu, 2005-09-01 at 06:47 +0000, nattapon viroonsri wrote: > >From config menu in Rhcs4 look like dont require share storage to store > quorum partion and no ip tie-breaker to config. > And i use only 1 network cable both for client access resource and > heartbeat channel. > So if one node have problem with network connection. the backup node can > provide service with uninterrupt. > But Can split brain will occur ? > Can both node issue i/o to share at the same time ? > > So hould i use 2 heartbeat channel ? > If i use 2 heartbeat connection how can i detect fail of network connection > for client access resource ( no ip tie-breaker to config in rhcs4) ? CMAN (the cluster manager) ensures no split brain via fencing: If you have two nodes and disconnect the cable on one of them, the remaining (connected) node will fence the node which was disconnected. To use it, you need two_node=1 and expected_votes=1 in cluster.conf (which system-config-cluster correctly sets up for you). As stated previously: Fencing hardware is *required* for RHCS4. -- Lon From lhh at redhat.com Thu Sep 1 18:17:03 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 01 Sep 2005 14:17:03 -0400 Subject: [Linux-cluster] Fencing Module for VMware ESX and GSX In-Reply-To: <1125598283.14500.34.camel@ayanami.boston.redhat.com> References: <1125598283.14500.34.camel@ayanami.boston.redhat.com> Message-ID: <1125598623.14500.39.camel@ayanami.boston.redhat.com> On Thu, 2005-09-01 at 14:11 -0400, Lon Hohberger wrote: /me reads code > Is the on/off/reboot case sensitive? No. -- Lon From lhh at redhat.com Thu Sep 1 18:26:05 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 01 Sep 2005 14:26:05 -0400 Subject: [Linux-cluster] Fencing Module for VMware ESX and GSX In-Reply-To: <1125598623.14500.39.camel@ayanami.boston.redhat.com> References: <1125598283.14500.34.camel@ayanami.boston.redhat.com> <1125598623.14500.39.camel@ayanami.boston.redhat.com> Message-ID: <1125599165.14500.42.camel@ayanami.boston.redhat.com> On Thu, 2005-09-01 at 14:17 -0400, Lon Hohberger wrote: > On Thu, 2005-09-01 at 14:11 -0400, Lon Hohberger wrote: > > /me reads code > > > Is the on/off/reboot case sensitive? > > No. http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/vmware/fence_vmware.pl.diff?cvsroot=cluster&r1=NONE&r2=1.1 Please keep me updated if you have changes or you want your name / email removed from the header. -- Lon From pegasus at nerv.eu.org Thu Sep 1 19:58:36 2005 From: pegasus at nerv.eu.org (Jure =?ISO-8859-2?Q?Pe=E8ar?=) Date: Thu, 1 Sep 2005 21:58:36 +0200 Subject: [Linux-cluster] partly OT: failover <500ms Message-ID: <20050901215836.634334a1.pegasus@nerv.eu.org> Hi all, Sorry if this is somewhat offtopic here ... Our telco is looking into linux HA solutions for their VoIP needs. Their main requirement is that the failover happens in the order of a few 100ms. Can redhat cluster be tweaked to work reliably with such short time periods? This would mean heartbeat on the level of few ms and status probes on the level of 10ms. Is this even feasible? Since VoIP is IP anyway, I'm looking into UCARP and stuff like that. Anything else I should check? Thanks for answers, -- Jure Pe?ar http://jure.pecar.org/ From akpm at osdl.org Thu Sep 1 20:21:04 2005 From: akpm at osdl.org (Andrew Morton) Date: Thu, 1 Sep 2005 13:21:04 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125586158.15768.42.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> Message-ID: <20050901132104.2d643ccd.akpm@osdl.org> Alan Cox wrote: > > On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote: > > - Why the kernel needs two clustered fileystems > > So delete reiserfs4, FAT, VFAT, ext2, and all the other "junk". Well, we did delete intermezzo. I was looking for technical reasons, please. > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > possibly gain (or vice versa) > > > > - Relative merits of the two offerings > > You missed the important one - people actively use it and have been for > some years. Same reason with have NTFS, HPFS, and all the others. On > that alone it makes sense to include. Again, that's not a technical reason. It's _a_ reason, sure. But what are the technical reasons for merging gfs[2], ocfs2, both or neither? If one can be grown to encompass the capabilities of the other then we're left with a bunch of legacy code and wasted effort. I'm not saying it's wrong. But I'd like to hear the proponents explain why it's right, please. From lhh at redhat.com Thu Sep 1 21:39:59 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 01 Sep 2005 17:39:59 -0400 Subject: [Linux-cluster] partly OT: failover <500ms In-Reply-To: <20050901215836.634334a1.pegasus@nerv.eu.org> References: <20050901215836.634334a1.pegasus@nerv.eu.org> Message-ID: <1125610799.14500.105.camel@ayanami.boston.redhat.com> On Thu, 2005-09-01 at 21:58 +0200, Jure Pe?ar wrote: > Hi all, > > Sorry if this is somewhat offtopic here ... > > Our telco is looking into linux HA solutions for their VoIP needs. Their > main requirement is that the failover happens in the order of a few 100ms. > > Can redhat cluster be tweaked to work reliably with such short time > periods? This would mean heartbeat on the level of few ms and status probes > on the level of 10ms. Is this even feasible? Possibly, I don't think it can do it right now. A couple of things to remember: * For such a fast requirement, you'll want a dedicated network for cluster traffic and a real-time kernel. * Also, "detection and initiation of recovery" is all the cluster software can do for you; your application - by itself - may take longer than this to recover. * It's practically impossible to guarantee completion of I/O fencing in this amount of time, so your application must be able to do without, or you need to create a new specialized fencing mechanism which is guaranteed to complete within a very fast time. * I *think* CMAN is currently at the whole-second granularity, so some changes would need to be made to give it finer granularity. This shouldn't be difficult (but I'll let the developers of CMAN answer this definitively, though... ;) ) * Clumanager 1.2.x (RHCS3) can theoretically operate at sub-second failure detection, but not at the levels you require (also, doing so is not tested nor supported anyway). -- Lon From treddy at rallydev.com Thu Sep 1 23:03:23 2005 From: treddy at rallydev.com (Tarun Reddy) Date: Thu, 1 Sep 2005 17:03:23 -0600 Subject: [Linux-cluster] RHEL/RHCS3: /usr/lib/clumanager/services/service status # stays up In-Reply-To: <1125505731.21943.82.camel@ayanami.boston.redhat.com> References: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com> <1125505731.21943.82.camel@ayanami.boston.redhat.com> Message-ID: <06A0AC65-147C-465B-833B-B6D818D693B3@rallydev.com> I believe it may have been a deadlocking issue when my status check was at 1 second. I had thought it was 1 minute. When I moved it to 60 seconds, the issue disappeared, mostly. There were occasionally a few /usr/lib/clumanager/service/status scripts running for days, and they generally are bunched together. Tarun On Aug 31, 2005, at 10:28 AM, Lon Hohberger wrote: > On Thu, 2005-08-18 at 14:12 -0600, Tarun Reddy wrote: > > >> Anybody venture a guess as to why this might be occurring? And are my >> check intervals too low? >> > > What are they set to? > > You get multiple running at the same time if you have multiple > services. > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster > From gshi at ncsa.uiuc.edu Thu Sep 1 23:47:08 2005 From: gshi at ncsa.uiuc.edu (Guochun Shi) Date: Thu, 01 Sep 2005 18:47:08 -0500 Subject: [Linux-cluster] compiling CVS head failed with kernel 2.6.13-mm1 Message-ID: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu> Hi, I tried to compile the CVS head with kernel_src=2.6.13-mm1 CC [M] /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.o /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c: In function `make_flags': /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: `DLM_LKF_NOORDER' undeclared (first use in this function) /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: (Each undeclared identifier is reported only once /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: for each function it appears in.) /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:124: error: `DLM_LKF_HEADQUE' undeclared (first use in this function) /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:129: error: `DLM_LKF_ALTCW' undeclared (first use in this function) /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:131: error: `DLM_LKF_ALTPR' undeclared (first use in this function) What I do is ./configure kernel_src=; make Am I missing anything? Thanks -Guochun From teigland at redhat.com Fri Sep 2 05:15:13 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 2 Sep 2005 13:15:13 +0800 Subject: [Linux-cluster] compiling CVS head failed with kernel 2.6.13-mm1 In-Reply-To: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu> References: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu> Message-ID: <20050902051512.GC12084@redhat.com> On Thu, Sep 01, 2005 at 06:47:08PM -0500, Guochun Shi wrote: > I tried to compile the CVS head with kernel_src=2.6.13-mm1 > > CC [M] /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.o > /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c: In function > `make_flags': > /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: > `DLM_LKF_NOORDER' undeclared (first use in this function) > /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: > (Each undeclared identifier is reported only once > /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: > for each function it appears in.) > /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:124: error: > `DLM_LKF_HEADQUE' undeclared (first use in this function) > /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:129: error: > `DLM_LKF_ALTCW' undeclared (first use in this function) > /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:131: error: > `DLM_LKF_ALTPR' undeclared (first use in this function) > > What I do is ./configure kernel_src=; make It works for me, could you send the whole output of the make like below? Dave [gfs-kernel/src/dlm]% make if [ ! -e linux ]; then ln -s . linux; fi if [ ! -e lm_interface.h ]; then ln -s ../../src/harness/lm_interface.h .; fi if [ ! -e dlm.h ]; then cp ../../../dlm-kernel/src2/dlm.h .; fi make -C /opt/kernels/linux-2.6.13-mm1-build/ M=/opt/tmp/cluster-HEAD/gfs-kernel/src/dlm modules USING_KBUILD=yes make[1]: Entering directory `/opt/kernels/linux-2.6.13-mm1-build' CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock.o CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/main.o CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/mount.o CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/thread.o CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/sysfs.o CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/plock.o LD [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.o Building modules, stage 2. MODPOST CC /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.mod.o LD [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.ko make[1]: Leaving directory `/opt/kernels/linux-2.6.13-mm1-build' From teigland at redhat.com Fri Sep 2 07:04:49 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 2 Sep 2005 15:04:49 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901175603.GA6218@infradead.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901142708.GA24933@infradead.org> <1125588511.15768.52.camel@localhost.localdomain> <20050901175603.GA6218@infradead.org> Message-ID: <20050902070449.GA16595@redhat.com> On Thu, Sep 01, 2005 at 06:56:03PM +0100, Christoph Hellwig wrote: > Whether the gfs2 code is mergeable is a completely different question, > and it seems at least debatable to submit a filesystem for inclusion I actually asked what needs to be done for merging. We appreciate the feedback and are carefully studying and working on all of it as usual. We'd also appreciate help, of course, if that sounds interesting to anyone. Thanks Dave From pcaulfie at redhat.com Fri Sep 2 07:03:33 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 02 Sep 2005 08:03:33 +0100 Subject: [Linux-cluster] partly OT: failover <500ms In-Reply-To: <1125610799.14500.105.camel@ayanami.boston.redhat.com> References: <20050901215836.634334a1.pegasus@nerv.eu.org> <1125610799.14500.105.camel@ayanami.boston.redhat.com> Message-ID: <4317F945.5050307@redhat.com> Lon Hohberger wrote: > On Thu, 2005-09-01 at 21:58 +0200, Jure Pe?ar wrote: > >>Hi all, >> >>Sorry if this is somewhat offtopic here ... >> >>Our telco is looking into linux HA solutions for their VoIP needs. Their >>main requirement is that the failover happens in the order of a few 100ms. >> >>Can redhat cluster be tweaked to work reliably with such short time >>periods? This would mean heartbeat on the level of few ms and status probes >>on the level of 10ms. Is this even feasible? > > > Possibly, I don't think it can do it right now. A couple of things to > remember: > > * For such a fast requirement, you'll want a dedicated network for > cluster traffic and a real-time kernel. > > * Also, "detection and initiation of recovery" is all the cluster > software can do for you; your application - by itself - may take longer > than this to recover. > > * It's practically impossible to guarantee completion of I/O fencing in > this amount of time, so your application must be able to do without, or > you need to create a new specialized fencing mechanism which is > guaranteed to complete within a very fast time. > > * I *think* CMAN is currently at the whole-second granularity, so some > changes would need to be made to give it finer granularity. This > shouldn't be difficult (but I'll let the developers of CMAN answer this > definitively, though... ;) ) > All true :) All cman timers are calibrated in seconds. I did run some tests a while ago with them in milliseconds and 100ms timeouts and it worked /reasonably/ well. However, without an RT kernel I wouldn't like to put this into a production system - we've had several instances of the cman kernel thread (which runs at the top RT priority) being stalled for up to 5 seconds and that node being fenced. Smaller stalls may be more common so with timeouts set that low you may well get nodes fenced for small delays. To be quite honest I'm not really sure what causes these stalls, as they generally happen under heavy IO load I assume (possibly wrongly) that they are related to disk flushes but someone who knows the VM better may out me right on this. -- patrick From teigland at redhat.com Fri Sep 2 09:44:03 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 2 Sep 2005 17:44:03 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> Message-ID: <20050902094403.GD16595@redhat.com> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,); > what is gfs2_assert() about anyway? please just use BUG_ON directly > everywhere When a machine has many gfs file systems mounted at once it can be useful to know which one failed. Does the following look ok? #define gfs2_assert(sdp, assertion) \ do { \ if (unlikely(!(assertion))) { \ printk(KERN_ERR \ "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \ "GFS2: fsid=%s: function = %s\n" \ "GFS2: fsid=%s: file = %s, line = %u\n" \ "GFS2: fsid=%s: time = %lu\n", \ sdp->sd_fsname, # assertion, \ sdp->sd_fsname, __FUNCTION__, \ sdp->sd_fsname, __FILE__, __LINE__, \ sdp->sd_fsname, get_seconds()); \ BUG(); \ } \ } while (0) From joern at wohnheim.fh-wedel.de Fri Sep 2 11:46:09 2005 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Fri, 2 Sep 2005 13:46:09 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050902094403.GD16595@redhat.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050902094403.GD16595@redhat.com> Message-ID: <20050902114609.GA11059@wohnheim.fh-wedel.de> On Fri, 2 September 2005 17:44:03 +0800, David Teigland wrote: > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > > > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,); > > > what is gfs2_assert() about anyway? please just use BUG_ON directly > > everywhere > > When a machine has many gfs file systems mounted at once it can be useful > to know which one failed. Does the following look ok? > > #define gfs2_assert(sdp, assertion) \ > do { \ > if (unlikely(!(assertion))) { \ > printk(KERN_ERR \ > "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \ > "GFS2: fsid=%s: function = %s\n" \ > "GFS2: fsid=%s: file = %s, line = %u\n" \ > "GFS2: fsid=%s: time = %lu\n", \ > sdp->sd_fsname, # assertion, \ > sdp->sd_fsname, __FUNCTION__, \ > sdp->sd_fsname, __FILE__, __LINE__, \ > sdp->sd_fsname, get_seconds()); \ > BUG(); \ > } \ > } while (0) That's a lot of string constants. I'm not sure how smart current versions of gcc are, but older ones created a new constant for each invocation of such a macro, iirc. So you might want to move the code out of line. J?rn -- There's nothing better for promoting creativity in a medium than making an audience feel "Hmm ? I could do better than that!" -- Douglas Adams in a slashdot interview From hzhong at cisco.com Thu Sep 1 18:47:54 2005 From: hzhong at cisco.com (Hua Zhong (hzhong)) Date: Thu, 1 Sep 2005 11:47:54 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining Message-ID: <75D9B5F4E50C8B4BB27622BD06C2B82B7B8E04@xmb-sjc-235.amer.cisco.com> I just started looking at gfs. To understand it you'd need to look at it from the entire cluster solution point of view. This is a good document from David. It's not about GFS in particular but about the architecture of the cluster. http://people.redhat.com/~teigland/sca.pdf Hua > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > Christoph Hellwig > Sent: Thursday, September 01, 2005 10:56 AM > To: Alan Cox > Cc: Christoph Hellwig; Andrew Morton; > linux-fsdevel at vger.kernel.org; linux-cluster at redhat.com; > linux-kernel at vger.kernel.org > Subject: [Linux-cluster] Re: GFS, what's remaining > > On Thu, Sep 01, 2005 at 04:28:30PM +0100, Alan Cox wrote: > > > That's GFS. The submission is about a GFS2 that's > on-disk incompatible > > > to GFS. > > > > Just like say reiserfs3 and reiserfs4 or ext and ext2 or > ext2 and ext3 > > then. I think the main point still stands - we have always taken > > multiple file systems on board and we have benefitted > enormously from > > having the competition between them instead of a dictat > from the kernel > > kremlin that 'foofs is the one true way' > > I didn't say anything agains a particular fs, just that your previous > arguments where utter nonsense. In fact I think having two > or more cluster > filesystems in the tree is a good thing. Whether the gfs2 > code is mergeable > is a completely different question, and it seems at least debatable to > submit a filesystem for inclusion that's still pretty new. > > While we're at it I can't find anything describing what gfs2 is about, > what is lacking in gfs, what structual changes did you make, etc.. > > p.s. why is gfs2 in fs/gfs in the kernel tree? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Fri Sep 2 14:52:14 2005 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 02 Sep 2005 10:52:14 -0400 Subject: [Linux-cluster] RHEL/RHCS3: /usr/lib/clumanager/services/service status # stays up In-Reply-To: <06A0AC65-147C-465B-833B-B6D818D693B3@rallydev.com> References: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com> <1125505731.21943.82.camel@ayanami.boston.redhat.com> <06A0AC65-147C-465B-833B-B6D818D693B3@rallydev.com> Message-ID: <1125672734.14500.127.camel@ayanami.boston.redhat.com> On Thu, 2005-09-01 at 17:03 -0600, Tarun Reddy wrote: > I believe it may have been a deadlocking issue when my status check > was at 1 second. I had thought it was 1 minute. When I moved it to 60 > seconds, the issue disappeared, mostly. There were occasionally a > few /usr/lib/clumanager/service/status scripts running for days, and > they generally are bunched together. How long does your status script take to complete normally? There is a known issue where if more than one service operation is requested for a given service, then the service manager will block... This might be what you are seeing, but only if (under some situations) the status check is taking longer than the status check interval. Can you bugzilla this? -- Lon From gshi at ncsa.uiuc.edu Fri Sep 2 18:00:33 2005 From: gshi at ncsa.uiuc.edu (Guochun Shi) Date: Fri, 02 Sep 2005 13:00:33 -0500 Subject: [Linux-cluster] compiling CVS head failed with kernel 2.6.13-mm1 In-Reply-To: <20050902051512.GC12084@redhat.com> References: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu> <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu> Message-ID: <5.1.0.14.2.20050902130003.04207c68@pop.ncsa.uiuc.edu> David, It works for me now. It turns out that there is an old copy of dlm.h in cluster/gfs-kernel/src/dlm. After I deleted it, it compiles fine. BTW, I need to add -lpthread ccs_tool/Makefile to make it compile, I have seen other people have the same problem in the mailing list. Index: Makefile =================================================================== RCS file: /cvs/cluster/cluster/ccs/ccs_tool/Makefile,v retrieving revision 1.7 diff -u -r1.7 Makefile --- Makefile 19 May 2005 19:50:55 -0000 1.7 +++ Makefile 2 Sep 2005 17:37:55 -0000 @@ -25,7 +25,7 @@ `xml2-config --cflags` -DCCS_RELEASE_NAME=\"${RELEASE}\" endif -LDFLAGS+= -L${ccs_libdir} `xml2-config --libs` -L${magmalibdir} -L${libdir} +LDFLAGS+= -L${ccs_libdir} `xml2-config --libs` -L${magmalibdir} -L${libdir} -lpthread LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl -Guochun At 01:15 PM 9/2/2005 +0800, you wrote: >On Thu, Sep 01, 2005 at 06:47:08PM -0500, Guochun Shi wrote: >> I tried to compile the CVS head with kernel_src=2.6.13-mm1 >> >> CC [M] /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.o >> /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c: In function >> `make_flags': >> /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: >> `DLM_LKF_NOORDER' undeclared (first use in this function) >> /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: >> (Each undeclared identifier is reported only once >> /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: >> for each function it appears in.) >> /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:124: error: >> `DLM_LKF_HEADQUE' undeclared (first use in this function) >> /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:129: error: >> `DLM_LKF_ALTCW' undeclared (first use in this function) >> /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:131: error: >> `DLM_LKF_ALTPR' undeclared (first use in this function) >> >> What I do is ./configure kernel_src=; make > >It works for me, could you send the whole output of the make like below? >Dave > >[gfs-kernel/src/dlm]% make >if [ ! -e linux ]; then ln -s . linux; fi >if [ ! -e lm_interface.h ]; then ln -s ../../src/harness/lm_interface.h .; fi >if [ ! -e dlm.h ]; then cp ../../../dlm-kernel/src2/dlm.h .; fi >make -C /opt/kernels/linux-2.6.13-mm1-build/ M=/opt/tmp/cluster-HEAD/gfs-kernel/src/dlm modules USING_KBUILD=yes >make[1]: Entering directory `/opt/kernels/linux-2.6.13-mm1-build' > > CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock.o > CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/main.o > CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/mount.o > CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/thread.o > CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/sysfs.o > CC [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/plock.o > LD [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.o > Building modules, stage 2. > MODPOST > CC /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.mod.o > LD [M] /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.ko >make[1]: Leaving directory `/opt/kernels/linux-2.6.13-mm1-build' From bmarzins at redhat.com Fri Sep 2 22:27:58 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Fri, 2 Sep 2005 17:27:58 -0500 Subject: [Linux-cluster] Re: If I have 5 GNBD server? In-Reply-To: <43151FAD.20803@telkom.co.id> References: <20050827004859.D66965A8690@mail.silvercash.com> <43126996.8000606@telkom.co.id> <20050829205951.GK12333@phlogiston.msp.redhat.com> <4313B938.2090409@telkom.co.id> <20050830195240.GM12333@phlogiston.msp.redhat.com> <43151FAD.20803@telkom.co.id> Message-ID: <20050902222758.GH6671@phlogiston.msp.redhat.com> On Wed, Aug 31, 2005 at 10:10:37AM +0700, Fajar A. Nugraha wrote: > Benjamin Marzinski wrote: > > >On Tue, Aug 30, 2005 at 08:41:12AM +0700, Fajar A. Nugraha wrote: > > > > > >>Benjamin Marzinski wrote: > >> > >> > >> > >>>If the gnbds are exported uncached (the default), the client will fail > >>>back IO > >>>if it can no longer talk to the server after a specified timeout. > >>> > >>> > >>> > >>What is the default timeout anyway, and how can I set it? > >>Last time I test gnbd-import timeout was on a development version > >>(DEVEL.1104982050) and after more than 30 minutes, the client still > >>tries to reconnect. > >> > >> > > > >The default timeout is 1 minute. It is tuneable with the -t option (see the > >gnbd man page). However you only timeout if you export the device in > >uncached > >mode. > > > > > > > I find something interesting : > gnbd_import man page (no mention of timeout): > -t server > Fence from Server. > Specify a server for the IO fence (only used with the -s > option). > > gnbd_export man page : > -t [seconds] > Timeout. > Set the exported GNBD to timeout mode > This option is used with -p. > This is the default for uncached GNBDs > > Isn't the client the one that has to determine whether it's in wait mode > or timeout mode? How does the parameter from gnbd_export passed to > gnbd_import? No, the server determines it. This information is passed to the client when it imports the device. > I tested it today with gnbd 1.00.00, by adding an extra ip address to > the server -> gnbd_export on the server (IP address 192.168.17.193, > cluster member, no extra parameter, so it should be exported as uncached > gnbd in timeout mode) -> gnbd_import on the client (member of a > different cluster) -> mount the gnbd_import -> remove the IP addresss > 192.168.17.193 from the server -> do df -k on the client, and I got > these on the client's syslog Gnbd won't fail the requests back until it can fence the server. Since the server is in another cluster, you cannot fence it. For uncached mode to work, the gnbd client and server MUST be in the same cluster. > Aug 31 09:55:58 node1 gnbd_recvd[9792]: client lost connection with > 192.168.17.193 : Interrupted system call > Aug 31 09:55:58 node1 gnbd_recvd[9792]: reconnecting > Aug 31 09:55:58 node1 kernel: gnbd (pid 9792: gnbd_recvd) got signal 1 > Aug 31 09:55:58 node1 kernel: gnbd2: Receive control failed (result -4) > Aug 31 09:55:58 node1 kernel: gnbd2: shutting down socket > Aug 31 09:55:58 node1 kernel: exitting GNBD_DO_IT ioctl > Aug 31 09:56:03 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] > server D?? is not a cluster member, cannot fence. > Aug 31 09:56:08 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] > server D?? is not a cluster member, cannot fence. > Aug 31 09:56:08 node1 gnbd_recvd[9792]: ERROR [gnbd_recvd.c:213] cannot > connect to server 192.168.17.193 (-1) : Interrupted system call > Aug 31 09:56:08 node1 gnbd_recvd[9792]: reconnecting > Aug 31 09:56:13 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] > server D?? is not a cluster member, cannot fence. > Aug 31 09:56:13 node1 gnbd_recvd[9792]: ERROR [gnbd_recvd.c:213] cannot > connect to server 192.168.17.193 (-1) : Interrupted system call > Aug 31 09:56:13 node1 gnbd_recvd[9792]: reconnecting > > And it goes on, and on, and on :) After ten minutes, I add the IP > address back to the server and these appear on syslog : > Aug 31 10:06:13 node1 gnbd_recvd[9792]: reconnecting > Aug 31 10:06:16 node1 kernel: resending requests > > So it looks like by default gnbd runs in wait mode, and after it > reconnects the kernel automatically resends the request without the need > of dm-multipath. > > Is my setup incorrect, or is this how it's supposed to work? Unfortunately, your setup allows the possibility of data corruption if you actually faile over between servers. Here's why. GNBD must fence the server before it fails over. Otherwise you run into the following situation: You have a gnbd client, and two servers (serverA and serverB). The client writes data to a block on serverA, but serverA becomes unresponsive before the data is written out to disk. The client fails over to serverB and writes out the data to that block. Later the client writes new data to the same block. After this, serverA suddenly wakes back up, and completes writing the old data from the original request to that block. You have now corrupted your block device. I have seen this happen multiple times. In your setup, since the client and server are in different clusters, gnbd cannot fence the server. This keeps the requests from failing out. If you switch the ip, gnbd has no way of knowing that this is no longer the same physical machine (Which should be fixed.. In future releases, I will probably make gnbd make sure that this is the same machine. Not just the same IP, otherwise, people could do just this sort of thing, and accidentally corrupt their data. If you switched IP addresses like this with cached devices, the chance of corrupting your data would become disturbingly likely). When you gnbd can connect to a server on the same ip, it assumes that the old server came back before it could be fenced, and resends the requests. -Ben > Regards, > > Fajar > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From mark.fasheh at oracle.com Sat Sep 3 00:16:28 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Fri, 2 Sep 2005 17:16:28 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> Message-ID: <20050903001628.GH21228@ca-server1.us.oracle.com> On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote: > The only thing that should be probably resolved is a common API > for at least the clustered lock manager. Having multiple > incompatible user space APIs for that would be sad. As far as userspace dlm apis go, dlmfs already abstracts away a large part of the dlm interaction, so writing a module against another dlm looks like it wouldn't be too bad (startup of a lockspace is probably the most difficult part there). --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From gshi at ncsa.uiuc.edu Sat Sep 3 00:30:50 2005 From: gshi at ncsa.uiuc.edu (Guochun Shi) Date: Fri, 02 Sep 2005 19:30:50 -0500 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <42FB2EDA.4010300@redhat.com> References: <42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> Message-ID: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> Patrick, can you describe the steps changed for CVS version compared to those in usage.txt in order to make gfs2 work? Thanks -Guochun At 11:56 AM 8/11/2005 +0100, you wrote: >For those not reading the commit list the ais-based cman is now in CVS - be >careful with it... > >For the moment it downloads a prepackaged/patched version of the openais source >from my people.redhat.com web site. This /will/ change. In fact the only >additional patch in there is one Steven posted to the openais mailing list so >don't think I'm hiding anything! > >There's still a lot of work to do on this code but is basically works with a few >caveats: > >1. Barriers are completely untested and may not work at all. >2. Don't start several nodes up at the same time, they might get the same > node ID(!) unless you used static node IDs. >3. The exec path for cmand is hard coded (in the Makefile) to ../daemon/cmand > so you must currently always run cman_tool from the dev directory unless > you change it. >4. Broadcast is no longer supported. If you fail to specify a multicast address > cman_tool will provide one. >5. IPv6 is unsupported, I'm going to start on that next! >6. Error reporting is probably rubbish. > >Generally it seems to work. I can certainly get the DLM up with it now. >-- > >patrick > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Sat Sep 3 05:18:41 2005 From: teigland at redhat.com (David Teigland) Date: Sat, 3 Sep 2005 13:18:41 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901132104.2d643ccd.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> Message-ID: <20050903051841.GA13211@redhat.com> On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote: > Alan Cox wrote: > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > > possibly gain (or vice versa) > > > > > > - Relative merits of the two offerings > > > > You missed the important one - people actively use it and have been for > > some years. Same reason with have NTFS, HPFS, and all the others. On > > that alone it makes sense to include. > > Again, that's not a technical reason. It's _a_ reason, sure. But what are > the technical reasons for merging gfs[2], ocfs2, both or neither? > > If one can be grown to encompass the capabilities of the other then we're > left with a bunch of legacy code and wasted effort. GFS is an established fs, it's not going away, you'd be hard pressed to find a more widely used cluster fs on Linux. GFS is about 10 years old and has been in use by customers in production environments for about 5 years. It is a mature, stable file system with many features that have been technically refined over years of experience and customer/user feedback. The latest development cycle (GFS2) has focussed on improving performance, it's not a new file system -- the "2" indicates that it's not ondisk compatible with earlier versions. OCFS2 is a new file system. I expect they'll want to optimize for their own unique goals. When OCFS appeared everyone I know accepted it would coexist with GFS, each in their niche like every other fs. That's good, OCFS and GFS help each other technically even though they may eventually compete in some areas (which can also be good.) Dave Here's a random summary of technical features: - cluster infrastructure: a lot of work, perhaps as much as gfs itself, has gone into the infrastructure surrounding and supporting gfs - cluster infrastructure allows for easy cooperation with CLVM - interchangable lock/cluster modules: gfs interacts with the external infrastructure, including lock manager, through an interchangable module allowing the fs to be adapted to different environments. - a "nolock" module can be plugged in to use gfs as a local fs (can be selected at mount time, so any fs can be mounted locally) - quotas, acls, cluster flocks, direct io, data journaling, ordered/writeback journaling modes -- all supported - gfs transparently switches to a different locking scheme for direct io allowing parallel non-allocating writes with no lock contention - posix locks -- supported, although it's being reworked for better performance right now - asynchronous locking, lock prefetching + read-ahead - coherent shared-writeable memory mappings across the cluster - nfs3 support (multiple nfs servers exporting one gfs is very common) - extend fs online, add journals online - full fs quiesce to allow for block level snapshot below gfs - read-only mount - "specatator" mount (like ro but no journal allocated for the mount, no fencing needed for failed node that was mounted as specatator) - infrastructure in place for live ondisk inode migration, fs shrink - stuffed dinodes, small files are stored in the disk inode block - tunable (fuzzy) atime updates - fast, nondisruptive stat on files during non-allocating direct-io - fast, nondisruptive statfs (df) even during heavy fs usage - friendly handling of io errors: shut down fs and withdraw from cluster - largest GFS cluster deployed was around 200 nodes, most are much smaller - use many GFS file systems at once on a node and in a cluster - customers use GFS for: scientific apps, HA, NFS serving, database, others I'm sure - graphical management tools for gfs, clvm, and the cluster infrastruture exist and are improving quickly From phillips at istop.com Sat Sep 3 05:57:31 2005 From: phillips at istop.com (Daniel Phillips) Date: Sat, 3 Sep 2005 01:57:31 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: References: <20050901104620.GA22482@redhat.com> <20050901132104.2d643ccd.akpm@osdl.org> Message-ID: <200509030157.31581.phillips@istop.com> On Friday 02 September 2005 17:17, Andi Kleen wrote: > The only thing that should be probably resolved is a common API > for at least the clustered lock manager. Having multiple > incompatible user space APIs for that would be sad. The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. Therefore, the (g)dlm userspace interface actually has nothing to do with the needs of gfs. It should be taken out the gfs patch and merged later, when or if user space applications emerge that need it. Maybe in the meantime it will be possible to come up with a userspace dlm api that isn't completely repulsive. Also, note that the only reason the two current dlms are in-kernel is because it supposedly cuts down on userspace-kernel communication with the cluster filesystems. Then why should a userspace application bother with a an awkward interface to an in-kernel dlm? This is obviously suboptimal. Why not have a userspace dlm for userspace apps, if indeed there are any userspace apps that would need to use dlm-style synchronization instead of more typical socket-based synchronization, or Posix locking, which is already exposed via a standard api? There is actually nothing wrong with having multiple, completely different dlms active at the same time. There is no urgent need to merge them into the one true dlm. It would be a lot better to let them evolve separately and pick the winner a year or two from now. Just think of the dlm as part of the cfs until then. What does have to be resolved is a common API for node management. It is not just cluster filesystems and their lock managers that have to interface to node management. Below the filesystem layer, cluster block devices and cluster volume management need to be coordinated by the same system, and above the filesystem layer, applications also need to be hooked into it. This work is, in a word, incomplete. Regards, Daniel From arjan at infradead.org Sat Sep 3 06:14:00 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Sat, 03 Sep 2005 08:14:00 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903051841.GA13211@redhat.com> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> Message-ID: <1125728040.3223.2.camel@laptopd505.fenrus.org> On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote: > On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote: > > Alan Cox wrote: > > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > > > possibly gain (or vice versa) > > > > > > > > - Relative merits of the two offerings > > > > > > You missed the important one - people actively use it and have been for > > > some years. Same reason with have NTFS, HPFS, and all the others. On > > > that alone it makes sense to include. > > > > Again, that's not a technical reason. It's _a_ reason, sure. But what are > > the technical reasons for merging gfs[2], ocfs2, both or neither? > > > > If one can be grown to encompass the capabilities of the other then we're > > left with a bunch of legacy code and wasted effort. > > GFS is an established fs, it's not going away, you'd be hard pressed to > find a more widely used cluster fs on Linux. GFS is about 10 years old > and has been in use by customers in production environments for about 5 > years. but you submitted GFS2 not GFS. From yanj at brainaire.com Sat Sep 3 05:22:28 2005 From: yanj at brainaire.com (yanj) Date: Sat, 3 Sep 2005 13:22:28 +0800 Subject: [Linux-cluster] Does GFS working STABELly on no-smp platform? Message-ID: <000e01c5b047$7d642990$2e00a8c0@yanzijie> Hi, all Does GFS working on no-smp platform? I could not find kernel-modules of GFS for no-smp system. I am trying to build up a GFS+iSCSI cluster based on NO-SMP machines. A two nodes GFS system is setup. However, it keeps crash (kernel panic), while I run IO test on both nodes for a while. I have tried the following combination of GFS on Redhat system.: 1. GFS version GFS-6.0.2.25 + Kernel 2.4.21-27.EL 2. GFS version GFS-6.0.2.20-1 + Kernel 2.4.21-32.EL All are not stable. Thanks, Jeffrey Yan -------------- next part -------------- An HTML attachment was scrubbed... URL: From phillips at istop.com Sat Sep 3 06:42:36 2005 From: phillips at istop.com (Daniel Phillips) Date: Sat, 3 Sep 2005 02:42:36 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903001628.GH21228@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <20050903001628.GH21228@ca-server1.us.oracle.com> Message-ID: <200509030242.36506.phillips@istop.com> On Friday 02 September 2005 20:16, Mark Fasheh wrote: > As far as userspace dlm apis go, dlmfs already abstracts away a large part > of the dlm interaction... Dumb question, why can't you use sysfs for this instead of rolling your own? Side note: you seem to have deleted all the 2.6.12-rc4 patches. Perhaps you forgot that there are dozens of lkml archives pointing at them? Regards, Daniel From wim.coekaerts at oracle.com Sat Sep 3 06:46:34 2005 From: wim.coekaerts at oracle.com (Wim Coekaerts) Date: Fri, 2 Sep 2005 23:46:34 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509030242.36506.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <20050903001628.GH21228@ca-server1.us.oracle.com> <200509030242.36506.phillips@istop.com> Message-ID: <20050903064633.GB4593@ca-server1.us.oracle.com> On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote: > On Friday 02 September 2005 20:16, Mark Fasheh wrote: > > As far as userspace dlm apis go, dlmfs already abstracts away a large part > > of the dlm interaction... > > Dumb question, why can't you use sysfs for this instead of rolling your own? because it's totally different. have a look at what it does. From wim.coekaerts at oracle.com Sat Sep 3 07:06:39 2005 From: wim.coekaerts at oracle.com (Wim Coekaerts) Date: Sat, 3 Sep 2005 00:06:39 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> Message-ID: <20050903070639.GC4593@ca-server1.us.oracle.com> On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote: > Andrew Morton writes: > > > > > Again, that's not a technical reason. It's _a_ reason, sure. But what are > > the technical reasons for merging gfs[2], ocfs2, both or neither? clusterfilesystems are very common, there are companies that had/have a whole business around it, veritas, polyserve, ex-sistina, thus now redhat, ibm, tons of companies out there sell this, big bucks. as someone said, it's different than nfs because for certian things there is less overhead but there are many other reasons, it makes it a lot easier to create a clustered nfs server so you create a cfs on a set of disks with a number of nodes and export that fs from all those, you can easily do loadbalancing for applications, you have a lot of infrastructure where people have invested in that allows for shared storage... for ocfs we have tons of production customers running many terabyte databases on a cfs. why ? because dealing with the raw disk froma number of nodes sucks. because nfs is pretty broken for a lot of stuff, there is no consistency across nodes when each machine nfs mounts a server partition. yes nfs can be used for things but cfs's are very useful for many things nfs just can't do. want a list ? companies building failover for services like to use things like this, it creates a non single point of failure kind of setup much more easily. andso on and so on, yes there are alternatives out there but fact is that a lot of folks like to use it, have been using it for ages, and want to be using it. from an implementation point of view, as folks here have already said, we 've tried our best to implement things as a real linux filesystem, no abstractions to have something generic, it's clean and as tight as can be for a lot of stuff. and compared to other cfs's it's pretty darned nice, however I think it's silly to have competition between ocfs2 and gfs2. they are different just like the ton of local filesystems are different and people like to use one or/over the other. david said gfs is popular and has been around, well, I can list you tons of folks that have been using our stuff 24/7 for years (for free) just as well. it's different. that's that. it'd be really nice if mainline kernel had it/them included. it would be a good start to get more folks involved and instead of years of talk on maillists that end up in nothing actually end up with folks participating and contributing. From teigland at redhat.com Sat Sep 3 10:35:03 2005 From: teigland at redhat.com (David Teigland) Date: Sat, 3 Sep 2005 18:35:03 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125728040.3223.2.camel@laptopd505.fenrus.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <1125728040.3223.2.camel@laptopd505.fenrus.org> Message-ID: <20050903103503.GB15239@redhat.com> On Sat, Sep 03, 2005 at 08:14:00AM +0200, Arjan van de Ven wrote: > On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote: > > On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote: > > > Alan Cox wrote: > > > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > > > > possibly gain (or vice versa) > > > > > > > > > > - Relative merits of the two offerings > > > > > > > > You missed the important one - people actively use it and have been for > > > > some years. Same reason with have NTFS, HPFS, and all the others. On > > > > that alone it makes sense to include. > > > > > > Again, that's not a technical reason. It's _a_ reason, sure. But what are > > > the technical reasons for merging gfs[2], ocfs2, both or neither? > > > > > > If one can be grown to encompass the capabilities of the other then we're > > > left with a bunch of legacy code and wasted effort. > > > > GFS is an established fs, it's not going away, you'd be hard pressed to > > find a more widely used cluster fs on Linux. GFS is about 10 years old > > and has been in use by customers in production environments for about 5 > > years. > > but you submitted GFS2 not GFS. Just a new version, not a big difference. The ondisk format changed a little making it incompatible with the previous versions. We'd been holding out on the format change for a long time and thought now would be a sensible time to finally do it. This is also about timing things conveniently. Each GFS version coincides with a development cycle and we decided to wait for this version/cycle to move code upstream. So, we have new version, format change, and code upstream all together, but it's still the same GFS to us. As with _any_ new version (involving ondisk formats or not) we need to thoroughly test everything to fix the inevitible bugs and regresssions that are introduced, there's nothing new or surprising about that. About the name -- we need to support customers running both versions for a long time. The "2" was added to make that process a little easier and clearer for people, that's all. If the 2 is really distressing we could rip it off, but there seems to be as many file systems ending in digits than not these days... Dave From phillips at istop.com Sat Sep 3 20:56:02 2005 From: phillips at istop.com (Daniel Phillips) Date: Sat, 3 Sep 2005 16:56:02 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903103503.GB15239@redhat.com> References: <20050901104620.GA22482@redhat.com> <1125728040.3223.2.camel@laptopd505.fenrus.org> <20050903103503.GB15239@redhat.com> Message-ID: <200509031656.03418.phillips@istop.com> On Saturday 03 September 2005 06:35, David Teigland wrote: > Just a new version, not a big difference. The ondisk format changed a > little making it incompatible with the previous versions. We'd been > holding out on the format change for a long time and thought now would be > a sensible time to finally do it. What exactly was the format change, and for what purpose? From phillips at istop.com Sat Sep 3 22:21:26 2005 From: phillips at istop.com (Daniel Phillips) Date: Sat, 3 Sep 2005 18:21:26 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903064633.GB4593@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <200509030242.36506.phillips@istop.com> <20050903064633.GB4593@ca-server1.us.oracle.com> Message-ID: <200509031821.27070.phillips@istop.com> On Saturday 03 September 2005 02:46, Wim Coekaerts wrote: > On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote: > > On Friday 02 September 2005 20:16, Mark Fasheh wrote: > > > As far as userspace dlm apis go, dlmfs already abstracts away a large > > > part of the dlm interaction... > > > > Dumb question, why can't you use sysfs for this instead of rolling your > > own? > > because it's totally different. have a look at what it does. You create a dlm domain when a directory is created. You create a lock resource when a file of that name is opened. You lock the resource when the file is opened. You access the lvb by read/writing the file. Why doesn't that fit the configfs-nee-sysfs model? If it does, the payoff will be about 500 lines saved. This little dlm fs is very slick, but grossly inefficient. Maybe efficiency doesn't matter here since it is just your slow-path userspace tools taking these locks. Please do not even think of proposing this as a way to export a kernel-based dlm for general purpose use! Your userdlm.c file has some hidden gold in it. You have factored the dlm calls far more attractively than the bad old bazillion-parameter Vaxcluster legacy. You are almost in system call zone there. (But note my earlier comment on dlms in general: until there are dlm-based applications, merging a general-purpose dlm API is pointless and has nothing to do with getting your filesystem merged.) Regards, Daniel From Joel.Becker at oracle.com Sun Sep 4 01:09:12 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sat, 3 Sep 2005 18:09:12 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509031821.27070.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509030242.36506.phillips@istop.com> <20050903064633.GB4593@ca-server1.us.oracle.com> <200509031821.27070.phillips@istop.com> Message-ID: <20050904010912.GJ8684@ca-server1.us.oracle.com> On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote: > that fit the configfs-nee-sysfs model? If it does, the payoff will be about > 500 lines saved. I'm still awaiting your merge of ext3 and reiserfs, because you can save probably 500 lines having a filesystem that can create reiser and ext3 files at the same time. Joel -- Life's Little Instruction Book #267 "Lie on your back and look at the stars." Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From akpm at osdl.org Sun Sep 4 01:32:41 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 3 Sep 2005 18:32:41 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904010912.GJ8684@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <200509030242.36506.phillips@istop.com> <20050903064633.GB4593@ca-server1.us.oracle.com> <200509031821.27070.phillips@istop.com> <20050904010912.GJ8684@ca-server1.us.oracle.com> Message-ID: <20050903183241.1acca6c9.akpm@osdl.org> Joel Becker wrote: > > On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote: > > that fit the configfs-nee-sysfs model? If it does, the payoff will be about > > 500 lines saved. > > I'm still awaiting your merge of ext3 and reiserfs, because you > can save probably 500 lines having a filesystem that can create reiser > and ext3 files at the same time. oy. Daniel is asking a legitimate question. If there's duplicated code in there then we should seek to either make the code multi-purpose or place the common or reusable parts into a library somewhere. If neither approach is applicable or practical for *every single function* then fine, please explain why. AFAIR that has not been done. From Joel.Becker at oracle.com Sun Sep 4 03:06:40 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sat, 3 Sep 2005 20:06:40 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903183241.1acca6c9.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <200509030242.36506.phillips@istop.com> <20050903064633.GB4593@ca-server1.us.oracle.com> <200509031821.27070.phillips@istop.com> <20050904010912.GJ8684@ca-server1.us.oracle.com> <20050903183241.1acca6c9.akpm@osdl.org> Message-ID: <20050904030640.GL8684@ca-server1.us.oracle.com> On Sat, Sep 03, 2005 at 06:32:41PM -0700, Andrew Morton wrote: > If there's duplicated code in there then we should seek to either make the > code multi-purpose or place the common or reusable parts into a library > somewhere. Regarding sysfs and configfs, that's a whole 'nother conversation. I've not yet come up with a function involved that is identical, but that's a response here for another email. Understanding that Daniel is talking about dlmfs, dlmfs is far more similar to devptsfs, tmpfs, and even sockfs and pipefs than it is to sysfs. I don't see him proposing that sockfs and devptsfs be folded into sysfs. dlmfs is *tiny*. The VFS interface is less than his claimed 500 lines of savings. The few VFS callbacks do nothing but call DLM functions. You'd have to replace this VFS glue with sysfs glue, and probably save very few lines of code. In addition, sysfs cannot support the dlmfs model. In dlmfs, mkdir(2) creates a directory representing a DLM domain and mknod(2) creates the user representation of a lock. sysfs doesn't support mkdir(2) or mknod(2) at all. More than mkdir() and mknod(), however, dlmfs uses open(2) to acquire locks from userspace. O_RDONLY acquires a shared read lock (PR in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock is released via close(2). If a process dies, close(2) happens. In other words, ->release() handles all the cleanup for normal and abnormal termination. sysfs does not allow hooking into ->open() or ->release(). So this model, and the inherent lifetiming that comes with it, cannot be used. If dlmfs was changed to use a less intuitive model that fits sysfs, all the handling of lifetimes and cleanup would have to be added. This would make it more complex, not less complex. It would give it a larger code size, not a smaller one. In the end, it would be harder to maintian, less intuitive to use, and larger. Joel -- "Anything that is too stupid to be spoken is sung." - Voltaire Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From phillips at istop.com Sun Sep 4 04:22:36 2005 From: phillips at istop.com (Daniel Phillips) Date: Sun, 4 Sep 2005 00:22:36 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904030640.GL8684@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> Message-ID: <200509040022.37102.phillips@istop.com> On Saturday 03 September 2005 23:06, Joel Becker wrote: > dlmfs is *tiny*. The VFS interface is less than his claimed 500 > lines of savings. It is 640 lines. > The few VFS callbacks do nothing but call DLM > functions. You'd have to replace this VFS glue with sysfs glue, and > probably save very few lines of code. > In addition, sysfs cannot support the dlmfs model. In dlmfs, > mkdir(2) creates a directory representing a DLM domain and mknod(2) > creates the user representation of a lock. sysfs doesn't support > mkdir(2) or mknod(2) at all. I said "configfs" in the email to which you are replying. > More than mkdir() and mknod(), however, dlmfs uses open(2) to > acquire locks from userspace. O_RDONLY acquires a shared read lock (PR > in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a > trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock > is released via close(2). If a process dies, close(2) happens. In > other words, ->release() handles all the cleanup for normal and abnormal > termination. > > sysfs does not allow hooking into ->open() or ->release(). So > this model, and the inherent lifetiming that comes with it, cannot be > used. Configfs has a per-item release method. Configfs has a group open method. What is it that configfs can't do, or can't be made to do trivially? > If dlmfs was changed to use a less intuitive model that fits > sysfs, all the handling of lifetimes and cleanup would have to be added. The model you came up with for dlmfs is beyond cute, it's downright clever. Why mar that achievement by then failing to capitalize on the framework you already have in configfs? By the way, do you agree that dlmfs is too inefficient to be an effective way of exporting your dlm api to user space, except for slow-path applications like you have here? Regards, Daniel From Joel.Becker at oracle.com Sun Sep 4 04:30:00 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sat, 3 Sep 2005 21:30:00 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509040022.37102.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> Message-ID: <20050904043000.GQ8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 12:22:36AM -0400, Daniel Phillips wrote: > It is 640 lines. It's 450 without comments and blank lines. Please, don't tell me that comments to help understanding are bloat. > I said "configfs" in the email to which you are replying. To wit: > Daniel Phillips said: > > Mark Fasheh said: > > > as far as userspace dlm apis go, dlmfs already abstracts away a > > > large > > > part of the dlm interaction... > > > > Dumb question, why can't you use sysfs for this instead of rolling > > your > > own? You asked why dlmfs can't go into sysfs, and I responded. Joel -- "I don't want to achieve immortality through my work; I want to achieve immortality through not dying." - Woody Allen Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From phillips at istop.com Sun Sep 4 04:51:10 2005 From: phillips at istop.com (Daniel Phillips) Date: Sun, 4 Sep 2005 00:51:10 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904043000.GQ8684@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050904043000.GQ8684@ca-server1.us.oracle.com> Message-ID: <200509040051.11095.phillips@istop.com> On Sunday 04 September 2005 00:30, Joel Becker wrote: > You asked why dlmfs can't go into sysfs, and I responded. And you got me! In the heat of the moment I overlooked the fact that you and Greg haven't agreed to the merge yet ;-) Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the same paradigm: drive the kernel logic from user-initiated vfs methods. You already have nearly all the right methods in nearly all the right places. Regards, Daniel From akpm at osdl.org Sun Sep 4 04:46:53 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 3 Sep 2005 21:46:53 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509040022.37102.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> Message-ID: <20050903214653.1b8a8cb7.akpm@osdl.org> Daniel Phillips wrote: > > The model you came up with for dlmfs is beyond cute, it's downright clever. Actually I think it's rather sick. Taking O_NONBLOCK and making it a lock-manager trylock because they're kinda-sorta-similar-sounding? Spare me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to acquire a clustered filesystem lock". Not even close. It would be much better to do something which explicitly and directly expresses what you're trying to do rather than this strange "lets do this because the names sound the same" thing. What happens when we want to add some new primitive which has no posix-file analog? Waaaay too cute. Oh well, whatever. From Joel.Becker at oracle.com Sun Sep 4 04:58:21 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sat, 3 Sep 2005 21:58:21 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> Message-ID: <20050904045821.GT8684@ca-server1.us.oracle.com> On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote: > It would be much better to do something which explicitly and directly > expresses what you're trying to do rather than this strange "lets do this > because the names sound the same" thing. So, you'd like a new flag name? That can be done. > What happens when we want to add some new primitive which has no posix-file > analog? The point of dlmfs is not to express every primitive that the DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS locking scheme. Nor should it. The point isn't to use a filesystem interface for programs that need all the flexibility and power of the VMS DLM. The point is a simple system that programs needing the basic operations can use. Even shell scripts. Joel -- "You must remember this: A kiss is just a kiss, A sigh is just a sigh. The fundamental rules apply As time goes by." Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From Joel.Becker at oracle.com Sun Sep 4 05:00:26 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sat, 3 Sep 2005 22:00:26 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509040051.11095.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050904043000.GQ8684@ca-server1.us.oracle.com> <200509040051.11095.phillips@istop.com> Message-ID: <20050904050026.GU8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote: > Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the > same paradigm: drive the kernel logic from user-initiated vfs methods. You > already have nearly all the right methods in nearly all the right places. configfs, like sysfs, does not support ->open() or ->release() callbacks. And it shouldn't. The point is to hide the complexity and make it easier to plug into. A client object should not ever have to know or care that it is being controlled by a filesystem. It only knows that it has a tree of items with attributes that can be set or shown. Joel -- "In a crisis, don't hide behind anything or anybody. They're going to find you anyway." - Paul "Bear" Bryant Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From akpm at osdl.org Sun Sep 4 05:41:40 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 3 Sep 2005 22:41:40 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904045821.GT8684@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> Message-ID: <20050903224140.0442fac4.akpm@osdl.org> Joel Becker wrote: > > > What happens when we want to add some new primitive which has no posix-file > > analog? > > The point of dlmfs is not to express every primitive that the > DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS > locking scheme. Nor should it. The point isn't to use a filesystem > interface for programs that need all the flexibility and power of the > VMS DLM. The point is a simple system that programs needing the basic > operations can use. Even shell scripts. Are you saying that the posix-file lookalike interface provides access to part of the functionality, but there are other APIs which are used to access the rest of the functionality? If so, what is that interface, and why cannot that interface offer access to 100% of the functionality, thus making the posix-file tricks unnecessary? From Joel.Becker at oracle.com Sun Sep 4 05:49:37 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sat, 3 Sep 2005 22:49:37 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903224140.0442fac4.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> Message-ID: <20050904054936.GW8684@ca-server1.us.oracle.com> On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote: > Are you saying that the posix-file lookalike interface provides access to > part of the functionality, but there are other APIs which are used to > access the rest of the functionality? If so, what is that interface, and > why cannot that interface offer access to 100% of the functionality, thus > making the posix-file tricks unnecessary? Currently, this is all the interface that the OCFS2 DLM provides. But yes, if you wanted to provide the rest of the VMS functionality (something that GFS2's DLM does), you'd need to use a more concrete interface. IMHO, it's worthwhile to have a simple interface, one already used by mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc. This is an interface that can and is used by shell scripts even (we do this to test the DLM). If you make it a C-library-only interface, you've just restricted the subset of folks that can use it, while adding programming complexity. I think that a simple fs-based interface can coexist with a more complex one. FILE* doesn't give you the flexibility of read()/write(), but I wouldn't remove it :-) Joel -- "In the beginning, the universe was created. This has made a lot of people very angry, and is generally considered to have been a bad move." - Douglas Adams Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From phillips at istop.com Sun Sep 4 05:52:29 2005 From: phillips at istop.com (Daniel Phillips) Date: Sun, 4 Sep 2005 01:52:29 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904050026.GU8684@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <200509040051.11095.phillips@istop.com> <20050904050026.GU8684@ca-server1.us.oracle.com> Message-ID: <200509040152.30027.phillips@istop.com> On Sunday 04 September 2005 01:00, Joel Becker wrote: > On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote: > > Clearly, I ought to have asked why dlmfs can't be done by configfs. It > > is the same paradigm: drive the kernel logic from user-initiated vfs > > methods. You already have nearly all the right methods in nearly all the > > right places. > > configfs, like sysfs, does not support ->open() or ->release() > callbacks. struct configfs_item_operations { void (*release)(struct config_item *); ssize_t (*show)(struct config_item *, struct attribute *,char *); ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t); int (*allow_link)(struct config_item *src, struct config_item *target); int (*drop_link)(struct config_item *src, struct config_item *target); }; struct configfs_group_operations { struct config_item *(*make_item)(struct config_group *group, const char *name); struct config_group *(*make_group)(struct config_group *group, const char *name); int (*commit_item)(struct config_item *item); void (*drop_item)(struct config_group *group, struct config_item *item); }; You do have ->release and ->make_item/group. If I may hand you a more substantive argument: you don't support user-driven creation of files in configfs, only directories. Dlmfs supports user-created files. But you know, there isn't actually a good reason not to support user-created files in configfs, as dlmfs demonstrates. Anyway, goodnight. Regards, Daniel From Joel.Becker at oracle.com Sun Sep 4 05:56:51 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sat, 3 Sep 2005 22:56:51 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509040152.30027.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509040051.11095.phillips@istop.com> <20050904050026.GU8684@ca-server1.us.oracle.com> <200509040152.30027.phillips@istop.com> Message-ID: <20050904055650.GX8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 01:52:29AM -0400, Daniel Phillips wrote: > You do have ->release and ->make_item/group. ->release is like kobject release. It's a free callback, not a callback from close. > If I may hand you a more substantive argument: you don't support user-driven > creation of files in configfs, only directories. Dlmfs supports user-created > files. But you know, there isn't actually a good reason not to support > user-created files in configfs, as dlmfs demonstrates. It is outside the domain of configfs. Just because it can be done does not mean it should be. configfs isn't a "thing to create files". It's an interface to creating kernel items. The actual filesystem representation isn't the end, it's just the means. Joel -- "In the room the women come and go Talking of Michaelangelo." Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From mark.fasheh at oracle.com Sun Sep 4 06:10:45 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Sat, 3 Sep 2005 23:10:45 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> Message-ID: <20050904061045.GI21228@ca-server1.us.oracle.com> On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote: > Actually I think it's rather sick. Taking O_NONBLOCK and making it a > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to > acquire a clustered filesystem lock". Not even close. What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock is a bit unfortunate, but really it just needs a bit to express that - nobody over here cares what it's called. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From phillips at istop.com Sun Sep 4 06:40:08 2005 From: phillips at istop.com (Daniel Phillips) Date: Sun, 4 Sep 2005 02:40:08 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> Message-ID: <200509040240.08467.phillips@istop.com> On Sunday 04 September 2005 00:46, Andrew Morton wrote: > Daniel Phillips wrote: > > The model you came up with for dlmfs is beyond cute, it's downright > > clever. > > Actually I think it's rather sick. Taking O_NONBLOCK and making it a > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to > acquire a clustered filesystem lock". Not even close. Now, I see the ocfs2 guys are all ready to back down on this one, but I will at least argue weakly in favor. Sick is a nice word for it, but it is actually not that far off. Normally, this fs will acquire a lock whenever the user creates a virtual file and the create will block until the global lock arrives. With O_NONBLOCK, it will return, erm... ETXTBSY (!) immediately. Is that not what O_NONBLOCK is supposed to accomplish? > It would be much better to do something which explicitly and directly > expresses what you're trying to do rather than this strange "lets do this > because the names sound the same" thing. > > What happens when we want to add some new primitive which has no posix-file > analog? > > Waaaay too cute. Oh well, whatever. The explicit way is syscalls or a set of ioctls, which he already has the makings of. If there is going to be a userspace api, I would hope it looks more like the contents of userdlm.c than the traditional Vaxcluster API, which sucks beyond belief. Another explicit way is to do it with a whole set of virtual attributes instead of just a single file trying to capture the whole model. That is really unappealing, but I am afraid that is exactly what a whole lot of sysfs/configfs usage is going to end up looking like. But more to the point: we have no urgent need for a userspace dlm api at the moment. Nothing will break if we just put that issue off for a few months, quite the contrary. If the only user is their tools I would say let it go ahead and be cute, even sickeningly so. It is not supposed to be a general dlm api, at least that is my understanding. It is just supposed to be an interface for their tools. Of course it would help to know exactly how those tools use it. Too sleepy to find out tonight... Regards, Daniel From hzhong at gmail.com Sun Sep 4 07:12:52 2005 From: hzhong at gmail.com (Hua Zhong) Date: Sun, 4 Sep 2005 00:12:52 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> Message-ID: <924c28830509040012ce7a0ce@mail.gmail.com> On 9/3/05, Andrew Morton wrote: > > Daniel Phillips wrote: > > > > The model you came up with for dlmfs is beyond cute, it's downright > clever. > > Actually I think it's rather sick. Taking O_NONBLOCK and making it a > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to > acquire a clustered filesystem lock". Not even close. No, it's "open this file in nonblocking mode" vs "attempt to acquire a lock in nonblocking mode". I think it makes perfect sense to use this flag. Of course, whether or not to use open as a means to acquire a lock (in either blocking or nonblocking mode) is efficient is another matter. -------------- next part -------------- An HTML attachment was scrubbed... URL: From akpm at osdl.org Sun Sep 4 07:23:43 2005 From: akpm at osdl.org (Andrew Morton) Date: Sun, 4 Sep 2005 00:23:43 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904061045.GI21228@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904061045.GI21228@ca-server1.us.oracle.com> Message-ID: <20050904002343.079daa85.akpm@osdl.org> Mark Fasheh wrote: > > On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote: > > Actually I think it's rather sick. Taking O_NONBLOCK and making it a > > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare > > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to > > acquire a clustered filesystem lock". Not even close. > > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock > is a bit unfortunate, but really it just needs a bit to express that - > nobody over here cares what it's called. The whole idea of reinterpreting file operations to mean something utterly different just seems inappropriate to me. You get a lot of goodies when using a filesystem - the ability for unrelated processes to look things up, resource release on exit(), etc. If those features are valuable in the ocfs2 context then fine. But I'd have thought that it would be saner and more extensible to add new syscalls (perhaps taking fd's) rather than overloading the open() mode in this manner. From akpm at osdl.org Sun Sep 4 07:28:28 2005 From: akpm at osdl.org (Andrew Morton) Date: Sun, 4 Sep 2005 00:28:28 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509040240.08467.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <200509040240.08467.phillips@istop.com> Message-ID: <20050904002828.3d26f64c.akpm@osdl.org> Daniel Phillips wrote: > > If the only user is their tools I would say let it go ahead and be cute, even > sickeningly so. It is not supposed to be a general dlm api, at least that is > my understanding. It is just supposed to be an interface for their tools. > Of course it would help to know exactly how those tools use it. Well I'm not saying "don't do this". I'm saying "eww" and "why?". If there is already a richer interface into all this code (such as a syscall one) and it's feasible to migrate the open() tricksies to that API in the future if it all comes unstuck then OK. That's why I asked (thus far unsuccessfully): Are you saying that the posix-file lookalike interface provides access to part of the functionality, but there are other APIs which are used to access the rest of the functionality? If so, what is that interface, and why cannot that interface offer access to 100% of the functionality, thus making the posix-file tricks unnecessary? From Joel.Becker at oracle.com Sun Sep 4 08:01:02 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sun, 4 Sep 2005 01:01:02 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904002828.3d26f64c.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <200509040240.08467.phillips@istop.com> <20050904002828.3d26f64c.akpm@osdl.org> Message-ID: <20050904080102.GY8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote: > If there is already a richer interface into all this code (such as a > syscall one) and it's feasible to migrate the open() tricksies to that API > in the future if it all comes unstuck then OK. > That's why I asked (thus far unsuccessfully): I personally was under the impression that "syscalls are not to be added". I'm also wary of the effort required to hook into process exit. Not to mention all the lifetiming that has to be written again. On top of that, we lose our cute ability to shell script it. We find this very useful in testing, and think others would in practice. > Are you saying that the posix-file lookalike interface provides > access to part of the functionality, but there are other APIs which are > used to access the rest of the functionality? If so, what is that > interface, and why cannot that interface offer access to 100% of the > functionality, thus making the posix-file tricks unnecessary? I thought I stated this in my other email. We're not intending to extend dlmfs. It pretty much covers the simple DLM usage required of a simple interface. The OCFS2 DLM does not provide any other functionality. If the OCFS2 DLM grew more functionality, or you consider the GFS2 DLM that already has it (and a less intuitive interface via sysfs IIRC), I would contend that dlmfs still has a place. It's simple to use and understand, and it's usable from shell scripts and other simple code. Joel -- "The first thing we do, let's kill all the lawyers." -Henry VI, IV:ii Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From mark.fasheh at oracle.com Sun Sep 4 08:17:48 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Sun, 4 Sep 2005 01:17:48 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904002343.079daa85.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904061045.GI21228@ca-server1.us.oracle.com> <20050904002343.079daa85.akpm@osdl.org> Message-ID: <20050904081748.GJ21228@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote: > > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock > > is a bit unfortunate, but really it just needs a bit to express that - > > nobody over here cares what it's called. > > The whole idea of reinterpreting file operations to mean something utterly > different just seems inappropriate to me. Putting aside trylock for a minute, I'm not sure how utterly different the operations are. You create a lock resource by creating a file named after it. You get a lock (fd) at read or write level on the resource by calling open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR). Now that we've got an fd, lock value blocks are naturally represented as file data which can be read(2) or written(2). Close(2) drops the lock. A really trivial usage example from shell: node1$ echo "hello world" > mylock node2$ cat mylock hello world I could always give a more useful one after I get some sleep :) > You get a lot of goodies when using a filesystem - the ability for > unrelated processes to look things up, resource release on exit(), etc. If > those features are valuable in the ocfs2 context then fine. Right, they certainly are and I think Joel, in another e-mail on this thread, explained well the advantages of using a filesystem. > But I'd have thought that it would be saner and more extensible to add new > syscalls (perhaps taking fd's) rather than overloading the open() mode in > this manner. The idea behind dlmfs was to very simply export a small set of cluster dlm operations to userspace. Given that goal, I felt that a whole set of system calls would have been overkill. That said, I think perhaps I should clarify that I don't intend dlmfs to become _the_ userspace dlm api, just a simple and (imho) intuitive one which could be trivially accessed from any software which just knows how to read and write files. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From akpm at osdl.org Sun Sep 4 08:18:05 2005 From: akpm at osdl.org (Andrew Morton) Date: Sun, 4 Sep 2005 01:18:05 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904080102.GY8684@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <200509040240.08467.phillips@istop.com> <20050904002828.3d26f64c.akpm@osdl.org> <20050904080102.GY8684@ca-server1.us.oracle.com> Message-ID: <20050904011805.68df8dde.akpm@osdl.org> Joel Becker wrote: > > On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote: > > If there is already a richer interface into all this code (such as a > > syscall one) and it's feasible to migrate the open() tricksies to that API > > in the future if it all comes unstuck then OK. > > That's why I asked (thus far unsuccessfully): > > I personally was under the impression that "syscalls are not > to be added". We add syscalls all the time. Whichever user<->kernel API is considered to be most appropriate, use it. > I'm also wary of the effort required to hook into process > exit. I'm not questioning the use of a filesystem. I'm questioning this overloading of normal filesystem system calls. For example (and this is just an example! there's also mknod, mkdir, O_RDWR, O_EXCL...) it would be more usual to do fd = open("/sys/whatever", ...); err = sys_dlm_trylock(fd); I guess your current implementation prevents /sys/whatever from ever appearing if the trylock failed. Dunno if that's valuable. > Not to mention all the lifetiming that has to be written again. > On top of that, we lose our cute ability to shell script it. We > find this very useful in testing, and think others would in practice. > > > Are you saying that the posix-file lookalike interface provides > > access to part of the functionality, but there are other APIs which are > > used to access the rest of the functionality? If so, what is that > > interface, and why cannot that interface offer access to 100% of the > > functionality, thus making the posix-file tricks unnecessary? > > I thought I stated this in my other email. We're not intending > to extend dlmfs. Famous last words ;) > It pretty much covers the simple DLM usage required of > a simple interface. The OCFS2 DLM does not provide any other > functionality. > If the OCFS2 DLM grew more functionality, or you consider the > GFS2 DLM that already has it (and a less intuitive interface via sysfs > IIRC), I would contend that dlmfs still has a place. It's simple to use > and understand, and it's usable from shell scripts and other simple > code. (wonders how to do O_NONBLOCK from a script) I don't buy the general "fs is nice because we can script it" argument, really. You can just write a few simple applications which provide access to the syscalls (or the fs!) and then write scripts around those. Yes, you suddenly need to get a little tarball into users' hands and that's a hassle. And I sometimes think we let this hassle guide kernel interfaces (mutters something about /sbin/hotplug), and that's sad. From akpm at osdl.org Sun Sep 4 08:37:04 2005 From: akpm at osdl.org (Andrew Morton) Date: Sun, 4 Sep 2005 01:37:04 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904081748.GJ21228@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904061045.GI21228@ca-server1.us.oracle.com> <20050904002343.079daa85.akpm@osdl.org> <20050904081748.GJ21228@ca-server1.us.oracle.com> Message-ID: <20050904013704.55c2d9f5.akpm@osdl.org> Mark Fasheh wrote: > > On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote: > > > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock > > > is a bit unfortunate, but really it just needs a bit to express that - > > > nobody over here cares what it's called. > > > > The whole idea of reinterpreting file operations to mean something utterly > > different just seems inappropriate to me. > Putting aside trylock for a minute, I'm not sure how utterly different the > operations are. You create a lock resource by creating a file named after > it. You get a lock (fd) at read or write level on the resource by calling > open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR). > Now that we've got an fd, lock value blocks are naturally represented as > file data which can be read(2) or written(2). > Close(2) drops the lock. > > A really trivial usage example from shell: > > node1$ echo "hello world" > mylock > node2$ cat mylock > hello world > > I could always give a more useful one after I get some sleep :) It isn't extensible though. One couldn't retain this approach while adding (random cfs ignorance exposure) upgrade-read, downgrade-write, query-for-various-runtime-stats, priority modification, whatever. > > You get a lot of goodies when using a filesystem - the ability for > > unrelated processes to look things up, resource release on exit(), etc. If > > those features are valuable in the ocfs2 context then fine. > Right, they certainly are and I think Joel, in another e-mail on this > thread, explained well the advantages of using a filesystem. > > > But I'd have thought that it would be saner and more extensible to add new > > syscalls (perhaps taking fd's) rather than overloading the open() mode in > > this manner. > The idea behind dlmfs was to very simply export a small set of cluster dlm > operations to userspace. Given that goal, I felt that a whole set of system > calls would have been overkill. That said, I think perhaps I should clarify > that I don't intend dlmfs to become _the_ userspace dlm api, just a simple > and (imho) intuitive one which could be trivially accessed from any software > which just knows how to read and write files. Well, as I say. Making it a filesystem is superficially attractive, but once you've build a super-dooper enterprise-grade infrastructure on top of it all, nobody's going to touch the fs interface by hand and you end up wondering why it's there, adding baggage. Not that I'm questioning the fs interface! It has useful permission management, monitoring and resource releasing characteristics. I'm questioning the open() tricks. I guess from Joel's tiny description, the filesystem's interpretation of mknod and mkdir look sensible enough. From Joel.Becker at oracle.com Sun Sep 4 09:11:18 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sun, 4 Sep 2005 02:11:18 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904011805.68df8dde.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <200509040240.08467.phillips@istop.com> <20050904002828.3d26f64c.akpm@osdl.org> <20050904080102.GY8684@ca-server1.us.oracle.com> <20050904011805.68df8dde.akpm@osdl.org> Message-ID: <20050904091118.GZ8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 01:18:05AM -0700, Andrew Morton wrote: > > I thought I stated this in my other email. We're not intending > > to extend dlmfs. > > Famous last words ;) Heh, of course :-) > I don't buy the general "fs is nice because we can script it" argument, > really. You can just write a few simple applications which provide access > to the syscalls (or the fs!) and then write scripts around those. I can't see how that works easily. I'm not worried about a tarball (eventually Red Hat and SuSE and Debian would have it). I'm thinking about this shell: exec 7 References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <200509040240.08467.phillips@istop.com> <20050904002828.3d26f64c.akpm@osdl.org> <20050904080102.GY8684@ca-server1.us.oracle.com> <20050904011805.68df8dde.akpm@osdl.org> <20050904091118.GZ8684@ca-server1.us.oracle.com> Message-ID: <20050904021836.4d4560a5.akpm@osdl.org> Joel Becker wrote: > > I can't see how that works easily. I'm not worried about a > tarball (eventually Red Hat and SuSE and Debian would have it). I'm > thinking about this shell: > > exec 7 do stuff > exec 7 > If someone kills the shell while stuff is doing, the lock is unlocked > because fd 7 is closed. However, if you have an application to do the > locking: > > takelock domainxxx lock1 > do sutff > droplock domainxxx lock1 > > When someone kills the shell, the lock is leaked, becuase droplock isn't > called. And SEGV/QUIT/-9 (especially -9, folks love it too much) are > handled by the first example but not by the second. take-and-drop-lock -d domainxxx -l lock1 -e "do stuff" From Joel.Becker at oracle.com Sun Sep 4 09:39:10 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sun, 4 Sep 2005 02:39:10 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904021836.4d4560a5.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <200509040240.08467.phillips@istop.com> <20050904002828.3d26f64c.akpm@osdl.org> <20050904080102.GY8684@ca-server1.us.oracle.com> <20050904011805.68df8dde.akpm@osdl.org> <20050904091118.GZ8684@ca-server1.us.oracle.com> <20050904021836.4d4560a5.akpm@osdl.org> Message-ID: <20050904093910.GA8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 02:18:36AM -0700, Andrew Morton wrote: > take-and-drop-lock -d domainxxx -l lock1 -e "do stuff" Ahh, but then you have to have lots of scripts somewhere in path, or do massive inline scripts. especially if you want to take another lock in there somewhere. It's doable, but it's nowhere near as easy. :-) Joel -- "I always thought the hardest questions were those I could not answer. Now I know they are the ones I can never ask." - Charlie Watkins Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From hzhong at gmail.com Sun Sep 4 18:03:21 2005 From: hzhong at gmail.com (Hua Zhong) Date: Sun, 4 Sep 2005 11:03:21 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904091118.GZ8684@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <200509040240.08467.phillips@istop.com> <20050904002828.3d26f64c.akpm@osdl.org> <20050904080102.GY8684@ca-server1.us.oracle.com> <20050904011805.68df8dde.akpm@osdl.org> <20050904091118.GZ8684@ca-server1.us.oracle.com> Message-ID: <924c288305090411038aa80f8@mail.gmail.com> > takelock domainxxx lock1 > do sutff > droplock domainxxx lock1 > > When someone kills the shell, the lock is leaked, becuase droplock isn't > called. Why not open the lock resource (or the lock space) instead of individual locks as file? It then looks like this: open lock space file takelock lockresource lock1 do stuff droplock lockresource lock1 close lock space file Then if you are killed the ->release of lock space file should take care of cleaning up all the locks From phillips at istop.com Sun Sep 4 19:51:56 2005 From: phillips at istop.com (Daniel Phillips) Date: Sun, 4 Sep 2005 15:51:56 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904002828.3d26f64c.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <200509040240.08467.phillips@istop.com> <20050904002828.3d26f64c.akpm@osdl.org> Message-ID: <200509041551.56614.phillips@istop.com> On Sunday 04 September 2005 03:28, Andrew Morton wrote: > If there is already a richer interface into all this code (such as a > syscall one) and it's feasible to migrate the open() tricksies to that API > in the future if it all comes unstuck then OK. That's why I asked (thus > far unsuccessfully): > > Are you saying that the posix-file lookalike interface provides > access to part of the functionality, but there are other APIs which are > used to access the rest of the functionality? If so, what is that > interface, and why cannot that interface offer access to 100% of the > functionality, thus making the posix-file tricks unnecessary? There is no such interface at the moment, nor is one needed in the immediate future. Let's look at the arguments for exporting a dlm to userspace: 1) Since we already have a dlm in kernel, why not just export that and save 100K of userspace library? Answer: because we don't want userspace-only dlm features bulking up the kernel. Answer #2: the extra syscalls and interface baggage serve no useful purpose. 2) But we need to take locks in the same lockspaces as the kernel dlm(s)! Answer: only support tools need to do that. A cut-down locking api is entirely appropriate for this. 3) But the kernel dlm is the only one we have! Answer: easily fixed, a simple matter of coding. But please bear in mind that dlm-style synchronization is probably a bad idea for most cluster applications, particularly ones that already do their synchronization via sockets. In other words, exporting the full dlm api is a red herring. It has nothing to do with getting cluster filesystems up and running. It is really just marketing: it sounds like a great thing for userspace to get a dlm "for free", but it isn't free, it contributes to kernel bloat and it isn't even the most efficient way to do it. If after considering that, we _still_ want to export a dlm api from kernel, then can we please take the necessary time and get it right? The full api requires not only syscall-style elements, but asynchronous events as well, similar to aio. I do not think anybody has a good answer to this today, nor do we even need it to begin porting applications to cluster filesystems. Oracle guys: what is the distributed locking API for RAC? Is the RAC team waiting with bated breath to adopt your kernel-based dlm? If not, why not? Regards, Daniel From pavel at ucw.cz Sun Sep 4 20:33:44 2005 From: pavel at ucw.cz (Pavel Machek) Date: Sun, 4 Sep 2005 22:33:44 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903051841.GA13211@redhat.com> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> Message-ID: <20050904203344.GA1987@elf.ucw.cz> Hi! > - read-only mount > - "specatator" mount (like ro but no journal allocated for the mount, > no fencing needed for failed node that was mounted as specatator) I'd call it "real-read-only", and yes, that's very usefull mount. Could we get it for ext3, too? Pavel -- if you have sharp zaurus hardware you don't need... you know my address From Joel.Becker at oracle.com Sun Sep 4 22:18:20 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Sun, 4 Sep 2005 15:18:20 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904203344.GA1987@elf.ucw.cz> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> Message-ID: <20050904221820.GB8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote: > > - read-only mount > > - "specatator" mount (like ro but no journal allocated for the mount, > > no fencing needed for failed node that was mounted as specatator) > > I'd call it "real-read-only", and yes, that's very usefull > mount. Could we get it for ext3, too? In OCFS2 we call readonly+journal+connected-to-cluster "soft readonly". We're a live node, other nodes know we exist, and we can flush pending transactions during the rw->ro transition. In addition, we can allow a ro->rw transition. The no-journal+no-cluster-connection mode we call "hard readonly". This is the mode you get when a device itself is readonly, because you can't do *anything*. Joel -- "Lately I've been talking in my sleep. Can't imagine what I'd have to say. Except my world will be right When love comes back my way." Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From teigland at redhat.com Mon Sep 5 03:47:39 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 11:47:39 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903052821.GA23711@kroah.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050902094403.GD16595@redhat.com> <20050903052821.GA23711@kroah.com> Message-ID: <20050905034739.GA11337@redhat.com> On Fri, Sep 02, 2005 at 10:28:21PM -0700, Greg KH wrote: > On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote: > > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > > > > > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,); > > > > > what is gfs2_assert() about anyway? please just use BUG_ON directly > > > everywhere > > > > When a machine has many gfs file systems mounted at once it can be useful > > to know which one failed. Does the following look ok? > > > > #define gfs2_assert(sdp, assertion) \ > > do { \ > > if (unlikely(!(assertion))) { \ > > printk(KERN_ERR \ > > "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \ > > "GFS2: fsid=%s: function = %s\n" \ > > "GFS2: fsid=%s: file = %s, line = %u\n" \ > > "GFS2: fsid=%s: time = %lu\n", \ > > sdp->sd_fsname, # assertion, \ > > sdp->sd_fsname, __FUNCTION__, \ > > sdp->sd_fsname, __FILE__, __LINE__, \ > > sdp->sd_fsname, get_seconds()); \ > > BUG(); \ > > You will already get the __FUNCTION__ (and hence the __FILE__ info) > directly from the BUG() dump, as well as the time from the syslog > message (turn on the printk timestamps if you want a more fine grain > timestamp), so the majority of this macro is redundant with the BUG() > macro... Joern already suggested moving this out of line and into a function (as it was before) to avoid repeating string constants. In that case the function, file and line from BUG aren't useful. We now have this, does it look ok? void gfs2_assert_i(struct gfs2_sbd *sdp, char *assertion, const char *function, char *file, unsigned int line) { panic("GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" "GFS2: fsid=%s: function = %s, file = %s, line = %u\n", sdp->sd_fsname, assertion, sdp->sd_fsname, function, file, line); } #define gfs2_assert(sdp, assertion) \ do { \ if (unlikely(!(assertion))) { \ gfs2_assert_i((sdp), #assertion, \ __FUNCTION__, __FILE__, __LINE__); \ } \ } while (0) From teigland at redhat.com Mon Sep 5 04:30:33 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 12:30:33 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903224140.0442fac4.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> Message-ID: <20050905043033.GB11337@redhat.com> On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote: > Joel Becker wrote: > > > > > What happens when we want to add some new primitive which has no > > > posix-file analog? > > > > The point of dlmfs is not to express every primitive that the > > DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS > > locking scheme. Nor should it. The point isn't to use a filesystem > > interface for programs that need all the flexibility and power of the > > VMS DLM. The point is a simple system that programs needing the basic > > operations can use. Even shell scripts. > > Are you saying that the posix-file lookalike interface provides access to > part of the functionality, but there are other APIs which are used to > access the rest of the functionality? If so, what is that interface, and > why cannot that interface offer access to 100% of the functionality, thus > making the posix-file tricks unnecessary? We're using our dlm quite a bit in user space and require the full dlm API. It's difficult to export the full API through a pseudo fs like dlmfs, so we've not found it a very practical approach. That said, it's a nice idea and I'd be happy if someone could map a more complete dlm API onto it. We export our full dlm API through read/write/poll on a misc device. All user space apps use the dlm through a library as you'd expect. The library communicates with the dlm_device kernel module through read/write/poll and the dlm_device module talks with the actual dlm: linux/drivers/dlm/device.c If there's a better way to do this, via a pseudo fs or not, we'd be pleased to try it. Dave From teigland at redhat.com Mon Sep 5 05:43:48 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 13:43:48 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> Message-ID: <20050905054348.GC11337@redhat.com> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > +void gfs2_glock_hold(struct gfs2_glock *gl) > +{ > + glock_hold(gl); > +} > > eh why? You removed the comment stating exactly why, see below. If that's not a accepted technique in the kernel, say so and I'll be happy to change it here and elsewhere. Thanks, Dave static inline void glock_hold(struct gfs2_glock *gl) { gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0); atomic_inc(&gl->gl_count); } /** * gfs2_glock_hold() - As glock_hold(), but suitable for exporting * @gl: The glock to hold * */ void gfs2_glock_hold(struct gfs2_glock *gl) { glock_hold(gl); } From teigland at redhat.com Mon Sep 5 06:29:16 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 14:29:16 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> Message-ID: <20050905062916.GA17607@redhat.com> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > +static unsigned int handle_roll(atomic_t *a) > +{ > + int x = atomic_read(a); > + if (x < 0) { > + atomic_set(a, 0); > + return 0; > + } > + return (unsigned int)x; > +} > > this is just plain scary. Not really, it was just resetting atomic statistics counters when they became negative. Unecessary, though, so removed. Dave From penberg at cs.helsinki.fi Mon Sep 5 06:32:59 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Mon, 5 Sep 2005 09:32:59 +0300 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905054348.GC11337@redhat.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050905054348.GC11337@redhat.com> Message-ID: <84144f02050904233274d45230@mail.gmail.com> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > > +void gfs2_glock_hold(struct gfs2_glock *gl) > > +{ > > + glock_hold(gl); > > +} > > > > eh why? On 9/5/05, David Teigland wrote: > You removed the comment stating exactly why, see below. If that's not a > accepted technique in the kernel, say so and I'll be happy to change it > here and elsewhere. Is there a reason why users of gfs2_glock_hold() cannot use glock_hold() directly? Pekka From mark.fasheh at oracle.com Mon Sep 5 07:09:23 2005 From: mark.fasheh at oracle.com (Mark Fasheh) Date: Mon, 5 Sep 2005 00:09:23 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905055428.GA29158@thunk.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> <20050905055428.GA29158@thunk.org> Message-ID: <20050905070922.GK21228@ca-server1.us.oracle.com> On Mon, Sep 05, 2005 at 01:54:28AM -0400, Theodore Ts'o wrote: > In the ext3 case, the only time when read-only isn't quite read-only > is when the filesystem was unmounted uncleanly and the journal needs > to be replayed in order for the filesystem to be consistent. Right, and OCFS2 is going to try to keep the behavior of only using the journal for recovery in normal (soft) read-only operation. Unfortunately other cluster nodes could die at any moment which can complicate things as we are now required to do recovery on them to ensure file system consistency. Recovery of course includes things like orphan dir cleanup, etc so we need a journal around for those transactions. To simplify all this, I'm just going to have it load the journal as it normally does (as opposed to only when the local node has a dirty journal) because it could be used at any moment. Btw, I'm curious to know how useful folks find the ext3 mount options errors=continue and errors=panic. I'm extremely likely to implement the errors=read-only behavior as default in OCFS2 and I'm wondering whether the other two are worth looking into. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com From teigland at redhat.com Mon Sep 5 07:55:28 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 15:55:28 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <84144f02050904233274d45230@mail.gmail.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050905054348.GC11337@redhat.com> <84144f02050904233274d45230@mail.gmail.com> Message-ID: <20050905075528.GB17607@redhat.com> On Mon, Sep 05, 2005 at 09:32:59AM +0300, Pekka Enberg wrote: > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > > > +void gfs2_glock_hold(struct gfs2_glock *gl) > > > +{ > > > + glock_hold(gl); > > > +} > > > > > > eh why? > > On 9/5/05, David Teigland wrote: > > You removed the comment stating exactly why, see below. If that's not a > > accepted technique in the kernel, say so and I'll be happy to change it > > here and elsewhere. > > Is there a reason why users of gfs2_glock_hold() cannot use > glock_hold() directly? Either set could be trivially removed. It's such an insignificant issue that I've removed glock_hold and put. For the record, within glock.c we consistently paired inlined versions of: glock_hold() glock_put() we wanted external versions to be appropriately named so we had: gfs2_glock_hold() gfs2_glock_put() still not sure if that technique is acceptable in this crowd or not. Dave From penberg at cs.helsinki.fi Mon Sep 5 08:00:51 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Mon, 5 Sep 2005 11:00:51 +0300 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905075528.GB17607@redhat.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050905054348.GC11337@redhat.com> <84144f02050904233274d45230@mail.gmail.com> <20050905075528.GB17607@redhat.com> Message-ID: <84144f02050905010066bc516d@mail.gmail.com> On 9/5/05, David Teigland wrote: > Either set could be trivially removed. It's such an insignificant issue > that I've removed glock_hold and put. For the record, > > within glock.c we consistently paired inlined versions of: > glock_hold() > glock_put() > > we wanted external versions to be appropriately named so we had: > gfs2_glock_hold() > gfs2_glock_put() > > still not sure if that technique is acceptable in this crowd or not. You still didn't answer my question why you needed two versions, though. AFAIK you didn't which makes the other one an redundant wrapper which are discouraged in kernel code. Pekka From pavel at ucw.cz Mon Sep 5 08:27:35 2005 From: pavel at ucw.cz (Pavel Machek) Date: Mon, 5 Sep 2005 10:27:35 +0200 Subject: [Linux-cluster] real read-only [was Re: GFS, what's remaining] In-Reply-To: <20050905055428.GA29158@thunk.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> <20050905055428.GA29158@thunk.org> Message-ID: <20050905082735.GA2662@elf.ucw.cz> Hi! > > > - read-only mount > > > - "specatator" mount (like ro but no journal allocated for the mount, > > > no fencing needed for failed node that was mounted as specatator) > > > > I'd call it "real-read-only", and yes, that's very usefull > > mount. Could we get it for ext3, too? > > This is a bit of a degression, but it's quite a bit different from > what ocfs2 is doing, where it is not necessary to replay the journal > in order to assure filesystem consistency. > > In the ext3 case, the only time when read-only isn't quite read-only > is when the filesystem was unmounted uncleanly and the journal needs > to be replayed in order for the filesystem to be consistent. Yes, I know... And that is going to be a disaster when you are attempting to recover data from failing harddrive (and absolutely do not want to write there). There's a better reason, too. I do swsusp. Then I'd like to boot with / mounted read-only (so that I can read my config files, some binaries, and maybe suspended image), but I absolutely may not write to disk at this point, because I still want to resume. Currently distros do that using initrd, but that does not allow you to store suspended image into file, and is slightly hard to setup. Pavel -- if you have sharp zaurus hardware you don't need... you know my address From akpm at osdl.org Mon Sep 5 08:54:08 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 5 Sep 2005 01:54:08 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905043033.GB11337@redhat.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> Message-ID: <20050905015408.21455e56.akpm@osdl.org> David Teigland wrote: > > We export our full dlm API through read/write/poll on a misc device. > inotify did that for a while, but we ended up going with a straight syscall interface. How fat is the dlm interface? ie: how many syscalls would it take? From joern at wohnheim.fh-wedel.de Mon Sep 5 08:58:08 2005 From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel) Date: Mon, 5 Sep 2005 10:58:08 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905034739.GA11337@redhat.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050902094403.GD16595@redhat.com> <20050903052821.GA23711@kroah.com> <20050905034739.GA11337@redhat.com> Message-ID: <20050905085808.GA22802@wohnheim.fh-wedel.de> On Mon, 5 September 2005 11:47:39 +0800, David Teigland wrote: > > Joern already suggested moving this out of line and into a function (as it > was before) to avoid repeating string constants. In that case the > function, file and line from BUG aren't useful. We now have this, does it > look ok? Ok wrt. my concerns, but not with Greg's. BUG() still gives you everything that you need, except: o fsid Notice how this list is just one entry long? ;) So how about #define gfs2_assert(sdp, assertion) do { \ if (unlikely(!(assertion))) { \ printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \ BUG(); \ } while (0) Or, to move the constant out of line again void __gfs2_assert(struct gfs2_sbd *sdp) { printk(KERN_ERR "GFS2: fsid=\n", sdp->sd_fsname); } #define gfs2_assert(sdp, assertion) do {\ if (unlikely(!(assertion))) { \ __gfs2_assert(sdp); \ BUG(); \ } while (0) J?rn -- Admonish your friends privately, but praise them openly. -- Publilius Syrus From teigland at redhat.com Mon Sep 5 09:18:47 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 17:18:47 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905085808.GA22802@wohnheim.fh-wedel.de> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050902094403.GD16595@redhat.com> <20050903052821.GA23711@kroah.com> <20050905034739.GA11337@redhat.com> <20050905085808.GA22802@wohnheim.fh-wedel.de> Message-ID: <20050905091847.GD17607@redhat.com> On Mon, Sep 05, 2005 at 10:58:08AM +0200, J?rn Engel wrote: > #define gfs2_assert(sdp, assertion) do { \ > if (unlikely(!(assertion))) { \ > printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \ > BUG(); \ > } while (0) OK thanks, Dave From teigland at redhat.com Mon Sep 5 09:24:33 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 17:24:33 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905015408.21455e56.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> Message-ID: <20050905092433.GE17607@redhat.com> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote: > David Teigland wrote: > > > > We export our full dlm API through read/write/poll on a misc device. > > > > inotify did that for a while, but we ended up going with a straight syscall > interface. > > How fat is the dlm interface? ie: how many syscalls would it take? Four functions: create_lockspace() release_lockspace() lock() unlock() Dave From akpm at osdl.org Mon Sep 5 09:19:48 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 5 Sep 2005 02:19:48 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905092433.GE17607@redhat.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> Message-ID: <20050905021948.6241f1e0.akpm@osdl.org> David Teigland wrote: > > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote: > > David Teigland wrote: > > > > > > We export our full dlm API through read/write/poll on a misc device. > > > > > > > inotify did that for a while, but we ended up going with a straight syscall > > interface. > > > > How fat is the dlm interface? ie: how many syscalls would it take? > > Four functions: > create_lockspace() > release_lockspace() > lock() > unlock() Neat. I'd be inclined to make them syscalls then. I don't suppose anyone is likely to object if we reserve those slots. From phillips at istop.com Mon Sep 5 09:30:56 2005 From: phillips at istop.com (Daniel Phillips) Date: Mon, 5 Sep 2005 05:30:56 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905021948.6241f1e0.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050905092433.GE17607@redhat.com> <20050905021948.6241f1e0.akpm@osdl.org> Message-ID: <200509050530.56787.phillips@istop.com> On Monday 05 September 2005 05:19, Andrew Morton wrote: > David Teigland wrote: > > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote: > > > David Teigland wrote: > > > > We export our full dlm API through read/write/poll on a misc device. > > > > > > inotify did that for a while, but we ended up going with a straight > > > syscall interface. > > > > > > How fat is the dlm interface? ie: how many syscalls would it take? > > > > Four functions: > > create_lockspace() > > release_lockspace() > > lock() > > unlock() > > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone > is likely to object if we reserve those slots. Better take a look at the actual parameter lists to those calls before jumping to conclusions... Regards, Daniel From teigland at redhat.com Mon Sep 5 09:48:07 2005 From: teigland at redhat.com (David Teigland) Date: Mon, 5 Sep 2005 17:48:07 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905021948.6241f1e0.akpm@osdl.org> References: <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> <20050905021948.6241f1e0.akpm@osdl.org> Message-ID: <20050905094807.GG17607@redhat.com> On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote: > David Teigland wrote: > > Four functions: > > create_lockspace() > > release_lockspace() > > lock() > > unlock() > > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone > is likely to object if we reserve those slots. Patrick is really the expert in this area and he's off this week, but based on what he's done with the misc device I don't see why there'd be more than two or three parameters for any of these. Dave From sct at redhat.com Mon Sep 5 10:44:08 2005 From: sct at redhat.com (Stephen C. Tweedie) Date: Mon, 05 Sep 2005 11:44:08 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904203344.GA1987@elf.ucw.cz> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> Message-ID: <1125917048.1910.9.camel@sisko.sctweedie.blueyonder.co.uk> Hi, On Sun, 2005-09-04 at 21:33, Pavel Machek wrote: > > - read-only mount > > - "specatator" mount (like ro but no journal allocated for the mount, > > no fencing needed for failed node that was mounted as specatator) > > I'd call it "real-read-only", and yes, that's very usefull > mount. Could we get it for ext3, too? I don't want to pollute the ext3 paths with extra checks for the case when there's no journal struct at all. But a dummy journal struct that isn't associated with an on-disk journal and that can never, ever go writable would certainly be pretty easy to do. But mount -o readonly gives you most of what you want already. An always-readonly option would be different in some key ways --- for a start, it would be impossible to perform journal recovery if that's needed, as that still needs journal and superblock write access. That's not necessarily a good thing. And you *still* wouldn't get something that could act as a spectator to a filesystem mounted writable elsewhere on a SAN, because updates on the other node wouldn't invalidate cached data on the readonly node. So is this really a useful combination? About the only combination I can think of that really makes sense in this context is if you have a busted filesystem that somehow can't be recovered --- either the journal is broken or the underlying device is truly readonly --- and you want to mount without recovery in order to attempt to see what you can find. That's asking for data corruption, but that may be better than getting no data at all. But that is something that could be done with a "-o skip-recovery" mount option, which would necessarily imply always-readonly behaviour. --Stephen From lmb at suse.de Mon Sep 5 14:14:32 2005 From: lmb at suse.de (Lars Marowsky-Bree) Date: Mon, 5 Sep 2005 16:14:32 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509030157.31581.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <20050901132104.2d643ccd.akpm@osdl.org> <200509030157.31581.phillips@istop.com> Message-ID: <20050905141432.GF5498@marowsky-bree.de> On 2005-09-03T01:57:31, Daniel Phillips wrote: > The only current users of dlms are cluster filesystems. There are zero users > of the userspace dlm api. That is incorrect, and you're contradicting yourself here: > What does have to be resolved is a common API for node management. It is not > just cluster filesystems and their lock managers that have to interface to > node management. Below the filesystem layer, cluster block devices and > cluster volume management need to be coordinated by the same system, and > above the filesystem layer, applications also need to be hooked into it. > This work is, in a word, incomplete. The Cluster Volume Management of LVM2 for example _does_ use simple cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too. (EVMS2 in cluster-mode uses a verrry simple locking scheme which is basically operated by the failover software and thus uses a different model.) Sincerely, Lars Marowsky-Br?e -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" From Axel.Thimm at ATrpms.net Mon Sep 5 15:36:49 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Mon, 5 Sep 2005 17:36:49 +0200 Subject: [Linux-cluster] NFS relocate: old TCP/IP connection resulting in DUP/ACK storms and largish timeouts (was: iptables protection wrapper; nfsexport.sh vs ip.sh racing) In-Reply-To: <1125425012.21943.1.camel@ayanami.boston.redhat.com> References: <20050822225227.GJ24127@neu.nirvana> <1125340879.24205.30.camel@ayanami.boston.redhat.com> <20050829233523.GD5908@neu.nirvana> <1125425012.21943.1.camel@ayanami.boston.redhat.com> Message-ID: <20050905153649.GE17096@neu.nirvana> On Tue, Aug 30, 2005 at 02:03:32PM -0400, Lon Hohberger wrote: > On Tue, 2005-08-30 at 01:35 +0200, Axel Thimm wrote: > > > > It's really an attempt at a workaround a configuration problem -- and > > > nothing more. > > > > The above is with nfs running on all nodes already. The racing seems > > to be with the exportfs commands and ip setup/teardown. > > > > It is easy to reproduce (>=50%) if the client connects over Gigabit > > and is in write transaction while the service is moved. We saw this in > > two different setups. If you throttle the network bandwidth to <= > > 20MB/sec you don't trigger the bug, so it really seems like a racing > > problem. > > ewww... Can you bugzilla this so we can track it? =) will do so, we are currently still trying to figure it out properly, so we can provide a better bug report (and separate different bugs). One bug that has critalled out is that upon relocation the old server keeps his TCP connections to the NFS client. When this server later on gets to become the NFS server again, he FIN/ACKs that old connection to the client (that had this connection torn down by now), which creates a DUP/ACK storm. A workaround is to shutdown nfs instead of simply unexporting like nfsexport.sh does, so that the pending TCP connections get fried, too. Is there a way to have ip.sh fry all open TCP/IP connections to a service IP that is to be abandoned? I guess that would be the better solution (that would also apply to non-NFS services). Of course the true bug is the DUP/ACK storm that is triggered by the old open TCP connection. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From phillips at istop.com Mon Sep 5 15:49:49 2005 From: phillips at istop.com (Daniel Phillips) Date: Mon, 5 Sep 2005 11:49:49 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905141432.GF5498@marowsky-bree.de> References: <20050901104620.GA22482@redhat.com> <200509030157.31581.phillips@istop.com> <20050905141432.GF5498@marowsky-bree.de> Message-ID: <200509051149.49929.phillips@istop.com> On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > On 2005-09-03T01:57:31, Daniel Phillips wrote: > > The only current users of dlms are cluster filesystems. There are zero > > users of the userspace dlm api. > > That is incorrect... Application users Lars, sorry if I did not make that clear. The issue is whether we need to export an all-singing-all-dancing dlm api from kernel to userspace today, or whether we can afford to take the necessary time to get it right while application writers take their time to have a good think about whether they even need it. > ...and you're contradicting yourself here: How so? Above talks about dlm, below talks about cluster membership. > > What does have to be resolved is a common API for node management. It is > > not just cluster filesystems and their lock managers that have to > > interface to node management. Below the filesystem layer, cluster block > > devices and cluster volume management need to be coordinated by the same > > system, and above the filesystem layer, applications also need to be > > hooked into it. This work is, in a word, incomplete. Regards, Daniel From alan at lxorguk.ukuu.org.uk Mon Sep 5 12:21:34 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Mon, 05 Sep 2005 13:21:34 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905021948.6241f1e0.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> <20050905021948.6241f1e0.akpm@osdl.org> Message-ID: <1125922894.8714.14.camel@localhost.localdomain> On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote: > > create_lockspace() > > release_lockspace() > > lock() > > unlock() > > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone > is likely to object if we reserve those slots. If the locks are not file descriptors then answer the following: - How are they ref counted - What are the cleanup semantics - How do I pass a lock between processes (AF_UNIX sockets wont work now) - How do I poll on a lock coming free. - What are the semantics of lock ownership - What rules apply for inheritance - How do I access a lock across threads. - What is the permission model. - How do I attach audit to it - How do I write SELinux rules for it - How do I use mount to make namespaces appear in multiple vservers and thats for starters... Every so often someone decides that a deeply un-unix interface with new syscalls is a good idea. Every time history proves them totally bonkers. There are cases for new system calls but this doesn't seem one of them. Look at system 5 shared memory, look at system 5 ipc, and so on. You can't use common interfaces on them, you can't select on them, you can't sanely pass them by fd passing. All our existing locking uses the following behaviour fd = open(namespace, options) fcntl(.. lock ...) blah flush fcntl(.. unlock ...) close Unfortunately some people here seem to have forgotten WHY we do things this way. 1. The semantics of file descriptors are well understood by users and by programs. That makes programming easier and keeps code size down 2. Everyone knows how close() works including across fork 3. FD passing is an obscure art but understood and just works 4. Poll() is a standard understood interface 5. Ownership of files is a standard model 6. FD passing across fork/exec is controlled in a standard way 7. The semantics for threaded applications are defined 8. Permissions are a standard model 9. Audit just works with the same tools 9. SELinux just works with the same tools 10. I don't need specialist applications to see the system state (the whole point of sysfs yet someone wants to break it all again) 11. fcntl fd locking is a posix standard interface with precisely defined semantics. Our extensions including leases are very powerful 12. And yes - fcntl fd locking supports mandatory locking too. That also is standards based with precise semantics. Everyone understands how to use the existing locking operations. So if you use the existing interfaces with some small extensions if neccessary everyone understands how to use cluster locks. Isn't that neat.... From alan at lxorguk.ukuu.org.uk Sun Sep 4 08:37:15 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Sun, 04 Sep 2005 09:37:15 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> Message-ID: <1125823035.23858.10.camel@localhost.localdomain> On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote: > Actually I think it's rather sick. Taking O_NONBLOCK and making it a > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to > acquire a clustered filesystem lock". Not even close. The semantics of O_NONBLOCK on many other devices are "trylock" semantics. OSS audio has those semantics for example, as do regular files in the presence of SYS5 mandatory locks. While the latter is "try lock , do operation and then drop lock" the drivers using O_NDELAY are very definitely providing trylock semantics. I am curious why a lock manager uses open to implement its locking semantics rather than using the locking API (POSIX locks etc) however. Alan From greg.freemyer at gmail.com Mon Sep 5 16:41:03 2005 From: greg.freemyer at gmail.com (Greg Freemyer) Date: Mon, 5 Sep 2005 12:41:03 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125917048.1910.9.camel@sisko.sctweedie.blueyonder.co.uk> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> <1125917048.1910.9.camel@sisko.sctweedie.blueyonder.co.uk> Message-ID: <87f94c3705090509411af94019@mail.gmail.com> On 9/5/05, Stephen C. Tweedie wrote: > > Hi, > > On Sun, 2005-09-04 at 21:33, Pavel Machek wrote: > > > > - read-only mount > > > - "specatator" mount (like ro but no journal allocated for the mount, > > > no fencing needed for failed node that was mounted as specatator) > > > > I'd call it "real-read-only", and yes, that's very usefull > > mount. Could we get it for ext3, too? > > I don't want to pollute the ext3 paths with extra checks for the case > when there's no journal struct at all. But a dummy journal struct that > isn't associated with an on-disk journal and that can never, ever go > writable would certainly be pretty easy to do. > > But mount -o readonly gives you most of what you want already. An > always-readonly option would be different in some key ways --- for a > start, it would be impossible to perform journal recovery if that's > needed, as that still needs journal and superblock write access. That's > not necessarily a good thing. > > And you *still* wouldn't get something that could act as a spectator to > a filesystem mounted writable elsewhere on a SAN, because updates on the > other node wouldn't invalidate cached data on the readonly node. So is > this really a useful combination? > > About the only combination I can think of that really makes sense in > this context is if you have a busted filesystem that somehow can't be > recovered --- either the journal is broken or the underlying device is > truly readonly --- and you want to mount without recovery in order to > attempt to see what you can find. That's asking for data corruption, > but that may be better than getting no data at all. > > But that is something that could be done with a "-o skip-recovery" mount > option, which would necessarily imply always-readonly behaviour. > > --Stephen This is getting way off-thread, but xfs does not do journal replay on read-only mount. This was required due to filesystem snapshots which are often truly read-only. i.e. All LVM1 snapshots are truly read-only. Also many FC arrays support read-only snapshots as well. I'm not sure how ext3 supports those environments (I use XFS when I need snapshot capability). The above -skip-recovery option might be required? Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century -------------- next part -------------- An HTML attachment was scrubbed... URL: From Axel.Thimm at ATrpms.net Mon Sep 5 18:21:43 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Mon, 5 Sep 2005 20:21:43 +0200 Subject: [Linux-cluster] Re: NFS relocate: old TCP/IP connection resulting in DUP/ACK storms and largish timeouts In-Reply-To: <20050905153649.GE17096@neu.nirvana> References: <20050822225227.GJ24127@neu.nirvana> <1125340879.24205.30.camel@ayanami.boston.redhat.com> <20050829233523.GD5908@neu.nirvana> <1125425012.21943.1.camel@ayanami.boston.redhat.com> <20050905153649.GE17096@neu.nirvana> Message-ID: <20050905182143.GA2099@neu.nirvana> On Mon, Sep 05, 2005 at 05:36:49PM +0200, Axel Thimm wrote: > On Tue, Aug 30, 2005 at 02:03:32PM -0400, Lon Hohberger wrote: > > On Tue, 2005-08-30 at 01:35 +0200, Axel Thimm wrote: > > > > > > It's really an attempt at a workaround a configuration problem -- and > > > > nothing more. > > > > > > The above is with nfs running on all nodes already. The racing seems > > > to be with the exportfs commands and ip setup/teardown. > > ewww... Can you bugzilla this so we can track it? =) > One bug that has critalled out is that upon relocation the old server > keeps his TCP connections to the NFS client. When this server later on > gets to become the NFS server again, he FIN/ACKs that old connection > to the client (that had this connection torn down by now), which > creates a DUP/ACK storm. > > A workaround is to shutdown nfs instead of simply unexporting like > nfsexport.sh does, so that the pending TCP connections get fried, too. > > Is there a way to have ip.sh fry all open TCP/IP connections to a > service IP that is to be abandoned? I guess that would be the better > solution (that would also apply to non-NFS services). > > Of course the true bug is the DUP/ACK storm that is triggered by the > old open TCP connection. Both bugs have been filed in bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167571 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167572 I guess the latter will move to another component like "kernel", if it really turns out to be neither cluster nor even nfs specific. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From akpm at osdl.org Mon Sep 5 19:53:09 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 5 Sep 2005 12:53:09 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125922894.8714.14.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> <20050905021948.6241f1e0.akpm@osdl.org> <1125922894.8714.14.camel@localhost.localdomain> Message-ID: <20050905125309.4b657b08.akpm@osdl.org> Alan Cox wrote: > > On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote: > > > create_lockspace() > > > release_lockspace() > > > lock() > > > unlock() > > > > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone > > is likely to object if we reserve those slots. > > If the locks are not file descriptors then answer the following: > > - How are they ref counted > - What are the cleanup semantics > - How do I pass a lock between processes (AF_UNIX sockets wont work now) > - How do I poll on a lock coming free. > - What are the semantics of lock ownership > - What rules apply for inheritance > - How do I access a lock across threads. > - What is the permission model. > - How do I attach audit to it > - How do I write SELinux rules for it > - How do I use mount to make namespaces appear in multiple vservers > > and thats for starters... Return an fd from create_lockspace(). From karon at gmx.net Mon Sep 5 20:52:57 2005 From: karon at gmx.net (Andreas Brosche) Date: Mon, 5 Sep 2005 22:52:57 +0200 (MEST) Subject: [Linux-cluster] Using GFS without a network? Message-ID: <16102.1125953577@www46.gmx.net> Hello *, we have two networks which require to be topologically separated. Nevertheless, data exchange shall be possible. We're thinking about implementing two servers (one for each network) with a shared storage (one or two SCSI disks on a shared bus between the two servers). As local filesystems do not make sense on shared SCSI busses, we're thinking about implementing a solution based on GFS. The point is: Synchronisation of the file system access must be handled either via the SCSI bus or via the disks themselves, since an ethernet link between the two machines does not fit into our security concept. I've read about the concept of quorum disks which are based on raw devices. Raw devices skip the Linux read/write cache, so the necessary heartbeat channel and filesystem cache sync could be done via a shared raw partition. Is this possible? All the docs I have read imply setting up a cluster, and thus a network, which is not an option. I want to avoid complexity in this one. We do not have real time requirements, so a periodic update of the shared raw data should suffice. Long story cut short, we want - GFS on a shared SCSI disk (Performance is not important) - dlm without network access (theoretically possible... but how dependant is GFS on the cluster services?) Regards Andreas Brosche -- 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail +++ GMX - die erste Adresse f?r Mail, Message, More +++ From alan at lxorguk.ukuu.org.uk Mon Sep 5 23:20:11 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 06 Sep 2005 00:20:11 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905125309.4b657b08.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> <20050905021948.6241f1e0.akpm@osdl.org> <1125922894.8714.14.camel@localhost.localdomain> <20050905125309.4b657b08.akpm@osdl.org> Message-ID: <1125962411.8714.46.camel@localhost.localdomain> On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote: > > - How are they ref counted > > - What are the cleanup semantics > > - How do I pass a lock between processes (AF_UNIX sockets wont work now) > > - How do I poll on a lock coming free. > > - What are the semantics of lock ownership > > - What rules apply for inheritance > > - How do I access a lock across threads. > > - What is the permission model. > > - How do I attach audit to it > > - How do I write SELinux rules for it > > - How do I use mount to make namespaces appear in multiple vservers > > > > and thats for starters... > > Return an fd from create_lockspace(). That only answers about four of the questions. The rest only come out if create_lockspace behaves like a file system - in other words create_lockspace is better known as either mkdir or mount. Its certainly viable to make the lock/unlock functions taken a fd, it's just not clear why the current lock/unlock functions we have won't do the job. Being able to extend the functionality to leases later on may be very powerful indeed and will fit the existing API From akpm at osdl.org Mon Sep 5 23:06:13 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 5 Sep 2005 16:06:13 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125962411.8714.46.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> <20050905021948.6241f1e0.akpm@osdl.org> <1125922894.8714.14.camel@localhost.localdomain> <20050905125309.4b657b08.akpm@osdl.org> <1125962411.8714.46.camel@localhost.localdomain> Message-ID: <20050905160613.7b0ee7fc.akpm@osdl.org> Alan Cox wrote: > > On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote: > > > - How are they ref counted > > > - What are the cleanup semantics > > > - How do I pass a lock between processes (AF_UNIX sockets wont work now) > > > - How do I poll on a lock coming free. > > > - What are the semantics of lock ownership > > > - What rules apply for inheritance > > > - How do I access a lock across threads. > > > - What is the permission model. > > > - How do I attach audit to it > > > - How do I write SELinux rules for it > > > - How do I use mount to make namespaces appear in multiple vservers > > > > > > and thats for starters... > > > > Return an fd from create_lockspace(). > > That only answers about four of the questions. The rest only come out if > create_lockspace behaves like a file system - in other words > create_lockspace is better known as either mkdir or mount. But David said that "We export our full dlm API through read/write/poll on a misc device.". That miscdevice will simply give us an fd. Hence my suggestion that the miscdevice be done away with in favour of a dedicated syscall which returns an fd. What does a filesystem have to do with this? From Joel.Becker at oracle.com Mon Sep 5 23:32:36 2005 From: Joel.Becker at oracle.com (Joel Becker) Date: Mon, 5 Sep 2005 16:32:36 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125823035.23858.10.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <1125823035.23858.10.camel@localhost.localdomain> Message-ID: <20050905233236.GF8684@ca-server1.us.oracle.com> On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote: > I am curious why a lock manager uses open to implement its locking > semantics rather than using the locking API (POSIX locks etc) however. Because it is simple (how do you fcntl(2) from a shell fd?), has no ranges (what do you do with ranges passed in to fcntl(2) and you don't support them?), and has a well-known fork(2)/exec(2) pattern. fcntl(2) has a known but less intuitive fork(2) pattern. The real reason, though, is that we never considered fcntl(2). We could never think of a case when a process wanted a lock fd open but not locked. At least, that's my recollection. Mark might have more to comment. Joel -- "In the room the women come and go Talking of Michaelangelo." Joel Becker Senior Member of Technical Staff Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 From phillips at istop.com Tue Sep 6 00:57:23 2005 From: phillips at istop.com (Daniel Phillips) Date: Mon, 5 Sep 2005 20:57:23 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509051118.45792.dtor_core@ameritech.net> References: <20050901104620.GA22482@redhat.com> <200509051149.49929.phillips@istop.com> <200509051118.45792.dtor_core@ameritech.net> Message-ID: <200509052057.23807.phillips@istop.com> On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: > On Monday 05 September 2005 10:49, Daniel Phillips wrote: > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > > On 2005-09-03T01:57:31, Daniel Phillips wrote: > > > > The only current users of dlms are cluster filesystems. There are > > > > zero users of the userspace dlm api. > > > > > > That is incorrect... > > > > Application users Lars, sorry if I did not make that clear. The issue is > > whether we need to export an all-singing-all-dancing dlm api from kernel > > to userspace today, or whether we can afford to take the necessary time > > to get it right while application writers take their time to have a good > > think about whether they even need it. > > If Linux fully supported OpenVMS DLM semantics we could start thinking > asbout moving our application onto a Linux box because our alpha server is > aging. > > That's just my user application writer $0.02. What stops you from trying it with the patch? That kind of feedback would be worth way more than $0.02. Regards, Daniel From phillips at istop.com Tue Sep 6 04:02:40 2005 From: phillips at istop.com (Daniel Phillips) Date: Tue, 6 Sep 2005 00:02:40 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509052103.20519.dtor_core@ameritech.net> References: <20050901104620.GA22482@redhat.com> <200509052057.23807.phillips@istop.com> <200509052103.20519.dtor_core@ameritech.net> Message-ID: <200509060002.40823.phillips@istop.com> On Monday 05 September 2005 22:03, Dmitry Torokhov wrote: > On Monday 05 September 2005 19:57, Daniel Phillips wrote: > > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: > > > On Monday 05 September 2005 10:49, Daniel Phillips wrote: > > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > > > > On 2005-09-03T01:57:31, Daniel Phillips wrote: > > > > > > The only current users of dlms are cluster filesystems. There > > > > > > are zero users of the userspace dlm api. > > > > > > > > > > That is incorrect... > > > > > > > > Application users Lars, sorry if I did not make that clear. The > > > > issue is whether we need to export an all-singing-all-dancing dlm api > > > > from kernel to userspace today, or whether we can afford to take the > > > > necessary time to get it right while application writers take their > > > > time to have a good think about whether they even need it. > > > > > > If Linux fully supported OpenVMS DLM semantics we could start thinking > > > asbout moving our application onto a Linux box because our alpha server > > > is aging. > > > > > > That's just my user application writer $0.02. > > > > What stops you from trying it with the patch? That kind of feedback > > would be worth way more than $0.02. > > We do not have such plans at the moment and I prefer spending my free > time on tinkering with kernel, not rewriting some in-house application. > Besides, DLM is not the only thing that does not have a drop-in > replacement in Linux. > > You just said you did not know if there are any potential users for the > full DLM and I said there are some. I did not say "potential", I said there are zero dlm applications at the moment. Nobody has picked up the prototype (g)dlm api, used it in an application and said "gee this works great, look what it does". I also claim that most developers who think that using a dlm for application synchronization would be really cool are probably wrong. Use sockets for synchronization exactly as for a single-node, multi-tasking application and you will end up with less code, more obviously correct code, probably more efficient and... you get an optimal, single-node version for free. And I also claim that there is precious little reason to have a full-featured dlm in-kernel. Being in-kernel has no benefit for a userspace application. But being in-kernel does add kernel bloat, because there will be extra features lathered on that are not needed by the only in-kernel user, the cluster filesystem. In the case of your port, you'd be better off hacking up a userspace library to provide OpenVMS dlm semantics exactly, not almost. By the way, you said "alpha server" not "alpha servers", was that just a slip? Because if you don't have a cluster then why are you using a dlm? Regards, Daniel From phillips at istop.com Tue Sep 6 04:58:44 2005 From: phillips at istop.com (Daniel Phillips) Date: Tue, 6 Sep 2005 00:58:44 -0400 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509052307.27417.dtor_core@ameritech.net> References: <20050901104620.GA22482@redhat.com> <200509060002.40823.phillips@istop.com> <200509052307.27417.dtor_core@ameritech.net> Message-ID: <200509060058.44934.phillips@istop.com> On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote: > On Monday 05 September 2005 23:02, Daniel Phillips wrote: > > By the way, you said "alpha server" not "alpha servers", was that just a > > slip? Because if you don't have a cluster then why are you using a dlm? > > No, it is not a slip. The application is running on just one node, so we > do not really use "distributed" part. However we make heavy use of the > rest of lock manager features, especially lock value blocks. Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature without even having the excuse you were forced to use it. Why don't you just have a daemon that sends your values over a socket? That should be all of a day's coding. Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. But you nicely supported my claim that most who think they should be using a dlm, really shouldn't. Regards, Daniel From phillips at istop.com Tue Sep 6 06:48:47 2005 From: phillips at istop.com (Daniel Phillips) Date: Tue, 6 Sep 2005 02:48:47 -0400 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509060005.59578.dtor_core@ameritech.net> References: <20050901104620.GA22482@redhat.com> <200509060058.44934.phillips@istop.com> <200509060005.59578.dtor_core@ameritech.net> Message-ID: <200509060248.47433.phillips@istop.com> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > do you think it is a bit premature to dismiss something even without > ever seeing the code? You told me you are using a dlm for a single-node application, is there anything more I need to know? Regards, Daniel From phillips at istop.com Tue Sep 6 07:18:24 2005 From: phillips at istop.com (Daniel Phillips) Date: Tue, 6 Sep 2005 03:18:24 -0400 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509060155.04685.dtor_core@ameritech.net> References: <20050901104620.GA22482@redhat.com> <200509060248.47433.phillips@istop.com> <200509060155.04685.dtor_core@ameritech.net> Message-ID: <200509060318.25260.phillips@istop.com> On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote: > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote: > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > > > do you think it is a bit premature to dismiss something even without > > > ever seeing the code? > > > > You told me you are using a dlm for a single-node application, is there > > anything more I need to know? > > I would still like to know why you consider it a "sin". On OpenVMS it is > fast, provides a way of cleaning up... There is something hard about handling EPIPE? > and does not introduce single point > of failure as it is the case with a daemon. And if we ever want to spread > the load between 2 boxes we easily can do it. But you said it runs on an aging Alpha, surely you do not intend to expand it to two aging Alphas? And what makes you think that socket-based synchronization keeps you from spreading out the load over multiple boxes? > Why would I not want to use it? It is not the right tool for the job from what you have told me. You want to get a few bytes of information from one task to another? Use a socket, as God intended. Regards, Daniel From alan at lxorguk.ukuu.org.uk Tue Sep 6 13:42:29 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 06 Sep 2005 14:42:29 +0100 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509060248.47433.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509060058.44934.phillips@istop.com> <200509060005.59578.dtor_core@ameritech.net> <200509060248.47433.phillips@istop.com> Message-ID: <1126014150.22131.51.camel@localhost.localdomain> On Maw, 2005-09-06 at 02:48 -0400, Daniel Phillips wrote: > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > > do you think it is a bit premature to dismiss something even without > > ever seeing the code? > > You told me you are using a dlm for a single-node application, is there > anything more I need to know? That's standard practice for many non-Unix operating systems. It means your code supports failover without much additional work and it provides all the functionality for locks on a single node too From sdake at mvista.com Fri Sep 2 18:45:22 2005 From: sdake at mvista.com (Steven Dake) Date: Fri, 02 Sep 2005 11:45:22 -0700 Subject: [Linux-cluster] partly OT: failover <500ms In-Reply-To: <4317F945.5050307@redhat.com> References: <20050901215836.634334a1.pegasus@nerv.eu.org> <1125610799.14500.105.camel@ayanami.boston.redhat.com> <4317F945.5050307@redhat.com> Message-ID: <1125686722.28688.26.camel@unnamed.az.mvista.com> On Fri, 2005-09-02 at 08:03 +0100, Patrick Caulfield wrote: > Lon Hohberger wrote: > > On Thu, 2005-09-01 at 21:58 +0200, Jure Pe?ar wrote: > > > >>Hi all, > >> > >>Sorry if this is somewhat offtopic here ... > >> > >>Our telco is looking into linux HA solutions for their VoIP needs. Their > >>main requirement is that the failover happens in the order of a few 100ms. > >> > >>Can redhat cluster be tweaked to work reliably with such short time > >>periods? This would mean heartbeat on the level of few ms and status probes > >>on the level of 10ms. Is this even feasible? > > > > > > Possibly, I don't think it can do it right now. A couple of things to > > remember: > > > > * For such a fast requirement, you'll want a dedicated network for > > cluster traffic and a real-time kernel. > > > > * Also, "detection and initiation of recovery" is all the cluster > > software can do for you; your application - by itself - may take longer > > than this to recover. > > > > * It's practically impossible to guarantee completion of I/O fencing in > > this amount of time, so your application must be able to do without, or > > you need to create a new specialized fencing mechanism which is > > guaranteed to complete within a very fast time. > > > > * I *think* CMAN is currently at the whole-second granularity, so some > > changes would need to be made to give it finer granularity. This > > shouldn't be difficult (but I'll let the developers of CMAN answer this > > definitively, though... ;) ) > > > > All true :) All cman timers are calibrated in seconds. I did run some tests a > while ago with them in milliseconds and 100ms timeouts and it worked > /reasonably/ well. However, without an RT kernel I wouldn't like to put this > into a production system - we've had several instances of the cman kernel thread > (which runs at the top RT priority) being stalled for up to 5 seconds and that > node being fenced. Smaller stalls may be more common so with timeouts set that > low you may well get nodes fenced for small delays. > > To be quite honest I'm not really sure what causes these stalls, as they > generally happen under heavy IO load I assume (possibly wrongly) that they are > related to disk flushes but someone who knows the VM better may out me right on > this. > > These systems could have swap.. Swap doesn't work because it is possible for a swapped page to take 1-10 seconds to be swapped into memory. The mlockall() system call resolves this particular problem. The poll sendmsg and recvmsg (and some others that require memory) system calls can block when allocating memory in low memory conditions. This unfortunately results in longer timeouts necessary when the system is overloaded. One solution is to change these system calls via some kind of socket option to allocate memory ahead of time for their operation. But I don't know of anything like this yet. I have measured failover with openais at 3 msec from detection to direction of new CSIs within components. Application failures are detected in 100 msec. Node failures are detected in 100 msec. It is possible on a system that meets the above scenario for a processor to be excluded from the membership during low memory. This is a reasonable choice, because the processor is having difficulty responding to requests in a timely fashion, and should be removed until overload control software on the processor cleans up the processor memory. Regards -steve From ak at suse.de Fri Sep 2 21:17:08 2005 From: ak at suse.de (Andi Kleen) Date: 02 Sep 2005 23:17:08 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050901132104.2d643ccd.akpm@osdl.org> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> Message-ID: Andrew Morton writes: > > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > > possibly gain (or vice versa) > > > > > > - Relative merits of the two offerings > > > > You missed the important one - people actively use it and have been for > > some years. Same reason with have NTFS, HPFS, and all the others. On > > that alone it makes sense to include. > > Again, that's not a technical reason. It's _a_ reason, sure. But what are > the technical reasons for merging gfs[2], ocfs2, both or neither? There seems to be clearly a need for a shared-storage fs of some sort for HA clusters and virtualized usage (multiple guests sharing a partition). Shared storage can be more efficient than network file systems like NFS because the storage access is often more efficient than network access and it is more reliable because it doesn't have a single point of failure in form of the NFS server. It's also a logical extension of the "failover on failure" clusters many people run now - instead of only failing over the shared fs at failure and keeping one machine idle the load can be balanced between multiple machines at any time. One argument to merge both might be that nobody really knows yet which shared-storage file system (GFS or OCFS2) is better. The only way to find out would be to let the user base try out both, and that's most practical when they're merged. Personally I think ocfs2 has nicer&cleaner code than GFS. It seems to be more or less a 64bit ext3 with cluster support, while GFS seems to reinvent a lot more things and has somewhat uglier code. On the other hand GFS' cluster support seems to be more aimed at being a universal cluster service open for other usages too, which might be a good thing. OCFS2s cluster seems to be more aimed at only serving the file system. But which one works better in practice is really an open question. The only thing that should be probably resolved is a common API for at least the clustered lock manager. Having multiple incompatible user space APIs for that would be sad. -Andi From hbryan at us.ibm.com Fri Sep 2 23:03:33 2005 From: hbryan at us.ibm.com (Bryan Henderson) Date: Fri, 2 Sep 2005 16:03:33 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: Message-ID: I have to correct an error in perspective, or at least in the wording of it, in the following, because it affects how people see the big picture in trying to decide how the filesystem types in question fit into the world: >Shared storage can be more efficient than network file >systems like NFS because the storage access is often more efficient >than network access The shared storage access _is_ network access. In most cases, it's a fibre channel/FCP network. Nowadays, it's more and more common for it to be a TCP/IP network just like the one folks use for NFS (but carrying ISCSI instead of NFS). It's also been done with a handful of other TCP/IP-based block storage protocols. The reason the storage access is expected to be more efficient than the NFS access is because the block access network protocols are supposed to be more efficient than the file access network protocols. In reality, I'm not sure there really is such a difference in efficiency between the protocols. The demonstrated differences in efficiency, or at least in speed, are due to other things that are different between a given new shared block implementation and a given old shared file implementation. But there's another advantage to shared block over shared file that hasn't been mentioned yet: some people find it easier to manage a pool of blocks than a pool of filesystems. >it is more reliable because it doesn't have a >single point of failure in form of the NFS server. This advantage isn't because it's shared (block) storage, but because it's a distributed filesystem. There are shared storage filesystems (e.g. IBM SANFS, ADIC StorNext) that have a centralized metadata or locking server that makes them unreliable (or unscalable) in the same ways as an NFS server. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems From michaelc at cs.wisc.edu Sat Sep 3 05:18:56 2005 From: michaelc at cs.wisc.edu (Mike Christie) Date: Sat, 03 Sep 2005 00:18:56 -0500 Subject: [Linux-cluster] [PATCH] rm PRIX64 and friends Message-ID: <1125724736.26239.1.camel@max> On EMT64 the macros spits out warning: format ?%lu? expects type ?long unsigned int?, but argument 5 has type ?uint64_t? Since most of places these macros are used are in places where we use uint64_t or int64_t the patch just has gfs2 use llu instead of trying to define some macros. Index: gfs2-kernel/src/gfs2/gfs2.h =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/gfs2.h,v retrieving revision 1.13 diff -a -u -p -r1.13 gfs2.h --- gfs2-kernel/src/gfs2/gfs2.h 22 Aug 2005 07:25:29 -0000 1.13 +++ gfs2-kernel/src/gfs2/gfs2.h 3 Sep 2005 05:07:31 -0000 @@ -34,18 +34,6 @@ #define NO_FORCE 0 #define FORCE 1 -#if (BITS_PER_LONG == 64) -#define PRIu64 "lu" -#define PRId64 "ld" -#define PRIx64 "lx" -#define PRIX64 "lX" -#else -#define PRIu64 "Lu" -#define PRId64 "Ld" -#define PRIx64 "Lx" -#define PRIX64 "LX" -#endif - /* Divide num by den. Round up if there is a remainder. */ #define DIV_RU(num, den) (((num) + (den) - 1) / (den)) #define MAKE_MULT8(x) (((x) + 7) & ~7) Index: gfs2-kernel/src/gfs2/gfs2_ondisk.h =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/gfs2_ondisk.h,v retrieving revision 1.13 diff -a -u -p -r1.13 gfs2_ondisk.h --- gfs2-kernel/src/gfs2/gfs2_ondisk.h 2 Sep 2005 09:06:54 -0000 1.13 +++ gfs2-kernel/src/gfs2/gfs2_ondisk.h 3 Sep 2005 05:07:31 -0000 @@ -509,8 +509,8 @@ void gfs2_inum_out(struct gfs2_inum *no, void gfs2_inum_print(struct gfs2_inum *no) { - pv(no, no_formal_ino, "%"PRIu64); - pv(no, no_addr, "%"PRIu64); + pv(no, no_formal_ino, "%llu"); + pv(no, no_addr, "%llu"); } void gfs2_meta_header_in(struct gfs2_meta_header *mh, char *buf) @@ -538,7 +538,7 @@ void gfs2_meta_header_print(struct gfs2_ pv(mh, mh_magic, "0x%.8X"); pv(mh, mh_type, "%u"); pv(mh, mh_format, "%u"); - pv(mh, mh_blkno, "%"PRIu64); + pv(mh, mh_blkno, "%llu"); } void gfs2_sb_in(struct gfs2_sb *sb, char *buf) @@ -627,11 +627,11 @@ void gfs2_rindex_out(struct gfs2_rindex void gfs2_rindex_print(struct gfs2_rindex *ri) { - pv(ri, ri_addr, "%"PRIu64); + pv(ri, ri_addr, "%llu"); pv(ri, ri_length, "%u"); pv(ri, ri_pad, "%u"); - pv(ri, ri_data0, "%"PRIu64); + pv(ri, ri_data0, "%llu"); pv(ri, ri_data, "%u"); pv(ri, ri_bitbytes, "%u"); @@ -693,9 +693,9 @@ void gfs2_quota_out(struct gfs2_quota *q void gfs2_quota_print(struct gfs2_quota *qu) { - pv(qu, qu_limit, "%"PRIu64); - pv(qu, qu_warn, "%"PRIu64); - pv(qu, qu_value, "%"PRId64); + pv(qu, qu_limit, "%llu"); + pv(qu, qu_warn, "%llu"); + pv(qu, qu_value, "%lld"); } void gfs2_dinode_in(struct gfs2_dinode *di, char *buf) @@ -775,16 +775,16 @@ void gfs2_dinode_print(struct gfs2_dinod pv(di, di_uid, "%u"); pv(di, di_gid, "%u"); pv(di, di_nlink, "%u"); - pv(di, di_size, "%"PRIu64); - pv(di, di_blocks, "%"PRIu64); - pv(di, di_atime, "%"PRId64); - pv(di, di_mtime, "%"PRId64); - pv(di, di_ctime, "%"PRId64); + pv(di, di_size, "%llu"); + pv(di, di_blocks, "%llu"); + pv(di, di_atime, "%lld"); + pv(di, di_mtime, "%lld"); + pv(di, di_ctime, "%lld"); pv(di, di_major, "%u"); pv(di, di_minor, "%u"); - pv(di, di_goal_meta, "%"PRIu64); - pv(di, di_goal_data, "%"PRIu64); + pv(di, di_goal_meta, "%llu"); + pv(di, di_goal_data, "%llu"); pv(di, di_flags, "0x%.8X"); pv(di, di_payload_format, "%u"); @@ -793,7 +793,7 @@ void gfs2_dinode_print(struct gfs2_dinod pv(di, di_depth, "%u"); pv(di, di_entries, "%u"); - pv(di, di_eattr, "%"PRIu64); + pv(di, di_eattr, "%llu"); pa(di, di_reserved, 32); } @@ -873,7 +873,7 @@ void gfs2_leaf_print(struct gfs2_leaf *l pv(lf, lf_depth, "%u"); pv(lf, lf_entries, "%u"); pv(lf, lf_dirent_format, "%u"); - pv(lf, lf_next, "%"PRIu64); + pv(lf, lf_next, "%llu"); pa(lf, lf_reserved, 32); } @@ -948,7 +948,7 @@ void gfs2_log_header_out(struct gfs2_log void gfs2_log_header_print(struct gfs2_log_header *lh) { gfs2_meta_header_print(&lh->lh_header); - pv(lh, lh_sequence, "%"PRIu64); + pv(lh, lh_sequence, "%llu"); pv(lh, lh_flags, "0x%.8X"); pv(lh, lh_tail, "%u"); pv(lh, lh_blkno, "%u"); @@ -1010,8 +1010,8 @@ void gfs2_inum_range_out(struct gfs2_inu void gfs2_inum_range_print(struct gfs2_inum_range *ir) { - pv(ir, ir_start, "%"PRIu64); - pv(ir, ir_length, "%"PRIu64); + pv(ir, ir_start, "%llu"); + pv(ir, ir_length, "%llu"); } void gfs2_statfs_change_in(struct gfs2_statfs_change *sc, char *buf) @@ -1034,9 +1034,9 @@ void gfs2_statfs_change_out(struct gfs2_ void gfs2_statfs_change_print(struct gfs2_statfs_change *sc) { - pv(sc, sc_total, "%"PRId64); - pv(sc, sc_free, "%"PRId64); - pv(sc, sc_dinodes, "%"PRId64); + pv(sc, sc_total, "%lld"); + pv(sc, sc_free, "%lld"); + pv(sc, sc_dinodes, "%lld"); } void gfs2_unlinked_tag_in(struct gfs2_unlinked_tag *ut, char *buf) @@ -1084,7 +1084,7 @@ void gfs2_quota_change_out(struct gfs2_q void gfs2_quota_change_print(struct gfs2_quota_change *qc) { - pv(qc, qc_change, "%"PRId64); + pv(qc, qc_change, "%lld"); pv(qc, qc_flags, "0x%.8X"); pv(qc, qc_id, "%u"); } Index: gfs2-kernel/src/gfs2/glock.c =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/glock.c,v retrieving revision 1.30 diff -a -u -p -r1.30 glock.c --- gfs2-kernel/src/gfs2/glock.c 19 Aug 2005 07:52:14 -0000 1.30 +++ gfs2-kernel/src/gfs2/glock.c 3 Sep 2005 05:07:31 -0000 @@ -2370,7 +2370,7 @@ static int dump_inode(struct gfs2_inode int error = -ENOBUFS; gfs2_printf(" Inode:\n"); - gfs2_printf(" num = %"PRIu64"/%"PRIu64"\n", + gfs2_printf(" num = %llu %llu\n", ip->i_num.no_formal_ino, ip->i_num.no_addr); gfs2_printf(" type = %u\n", IF2DT(ip->i_di.di_mode)); gfs2_printf(" i_count = %d\n", atomic_read(&ip->i_count)); @@ -2406,7 +2406,7 @@ static int dump_glock(struct gfs2_glock spin_lock(&gl->gl_spin); - gfs2_printf("Glock (%u, %"PRIu64")\n", + gfs2_printf("Glock (%u, %llu)\n", gl->gl_name.ln_type, gl->gl_name.ln_number); gfs2_printf(" gl_flags ="); Index: gfs2-kernel/src/gfs2/ioctl.c =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/ioctl.c,v retrieving revision 1.18 diff -a -u -p -r1.18 ioctl.c --- gfs2-kernel/src/gfs2/ioctl.c 11 Aug 2005 07:23:43 -0000 1.18 +++ gfs2-kernel/src/gfs2/ioctl.c 3 Sep 2005 05:07:31 -0000 @@ -275,9 +275,9 @@ static int gi_get_statfs(struct gfs2_ino gfs2_printf("version 0\n"); gfs2_printf("bsize %u\n", sdp->sd_sb.sb_bsize); - gfs2_printf("total %"PRIu64"\n", sc.sc_total); - gfs2_printf("free %"PRIu64"\n", sc.sc_free); - gfs2_printf("dinodes %"PRIu64"\n", sc.sc_dinodes); + gfs2_printf("total %lld\n", sc.sc_total); + gfs2_printf("free %lld\n", sc.sc_free); + gfs2_printf("dinodes %lld\n", sc.sc_dinodes); error = 0; @@ -353,7 +353,7 @@ static int gi_get_counters(struct gfs2_i sdp->sd_jdesc->jd_blocks); gfs2_printf("sd_reclaim_count:glocks on reclaim list::%d\n", atomic_read(&sdp->sd_reclaim_count)); - gfs2_printf("sd_log_wraps:log wraps::%"PRIu64"\n", + gfs2_printf("sd_log_wraps:log wraps::%llu\n", sdp->sd_log_wraps); gfs2_printf("sd_bio_outstanding:outstanding BIO calls::%u\n", atomic_read(&sdp->sd_bio_outstanding)); Index: gfs2-kernel/src/gfs2/lvb.c =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/lvb.c,v retrieving revision 1.9 diff -a -u -p -r1.9 lvb.c --- gfs2-kernel/src/gfs2/lvb.c 2 Sep 2005 09:06:54 -0000 1.9 +++ gfs2-kernel/src/gfs2/lvb.c 3 Sep 2005 05:07:31 -0000 @@ -54,8 +54,8 @@ void gfs2_quota_lvb_print(struct gfs2_qu { pv(qb, qb_magic, "%u"); pv(qb, qb_pad, "%u"); - pv(qb, qb_limit, "%"PRIu64); - pv(qb, qb_warn, "%"PRIu64); - pv(qb, qb_value, "%"PRId64); + pv(qb, qb_limit, "%llu"); + pv(qb, qb_warn, "%llu"); + pv(qb, qb_value, "%lld"); } Index: gfs2-kernel/src/gfs2/meta_io.c =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/meta_io.c,v retrieving revision 1.25 diff -a -u -p -r1.25 meta_io.c --- gfs2-kernel/src/gfs2/meta_io.c 2 Sep 2005 09:06:54 -0000 1.25 +++ gfs2-kernel/src/gfs2/meta_io.c 3 Sep 2005 05:07:31 -0000 @@ -61,7 +61,7 @@ static void stuck_releasepage(struct buf struct gfs2_glock *gl; fs_warn(sdp, "stuck in gfs2_releasepage()\n"); - fs_warn(sdp, "blkno = %"PRIu64", bh->b_count = %d\n", + fs_warn(sdp, "blkno = %llu, bh->b_count = %d\n", (uint64_t)bh->b_blocknr, atomic_read(&bh->b_count)); fs_warn(sdp, "pinned = %u\n", buffer_pinned(bh)); fs_warn(sdp, "get_v2bd(bh) = %s\n", (bd) ? "!NULL" : "NULL"); @@ -71,7 +71,7 @@ static void stuck_releasepage(struct buf gl = bd->bd_gl; - fs_warn(sdp, "gl = (%u, %"PRIu64")\n", + fs_warn(sdp, "gl = (%u, %llu)\n", gl->gl_name.ln_type, gl->gl_name.ln_number); fs_warn(sdp, "bd_list_tr = %s, bd_le.le_list = %s\n", @@ -85,7 +85,7 @@ static void stuck_releasepage(struct buf if (!ip) return; - fs_warn(sdp, "ip = %"PRIu64"/%"PRIu64"\n", + fs_warn(sdp, "ip = %llu %llu\n", ip->i_num.no_formal_ino, ip->i_num.no_addr); fs_warn(sdp, "ip->i_count = %d, ip->i_vnode = %s\n", atomic_read(&ip->i_count), Index: gfs2-kernel/src/gfs2/rgrp.c =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/rgrp.c,v retrieving revision 1.25 diff -a -u -p -r1.25 rgrp.c --- gfs2-kernel/src/gfs2/rgrp.c 19 Aug 2005 07:52:15 -0000 1.25 +++ gfs2-kernel/src/gfs2/rgrp.c 3 Sep 2005 05:07:32 -0000 @@ -1013,7 +1013,7 @@ static struct gfs2_rgrpd *rgblk_free(str rgd = gfs2_blk2rgrpd(sdp, bstart); if (!rgd) { if (gfs2_consist(sdp)) - fs_err(sdp, "block = %"PRIu64"\n", bstart); + fs_err(sdp, "block = %llu\n", bstart); return NULL; } @@ -1302,7 +1302,7 @@ void gfs2_rlist_add(struct gfs2_sbd *sdp rgd = gfs2_blk2rgrpd(sdp, block); if (!rgd) { if (gfs2_consist(sdp)) - fs_err(sdp, "block = %"PRIu64"\n", block); + fs_err(sdp, "block = %llu\n", block); return; } Index: gfs2-kernel/src/gfs2/util.c =================================================================== RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/util.c,v retrieving revision 1.17 diff -a -u -p -r1.17 util.c --- gfs2-kernel/src/gfs2/util.c 19 Aug 2005 07:52:15 -0000 1.17 +++ gfs2-kernel/src/gfs2/util.c 3 Sep 2005 05:07:32 -0000 @@ -147,7 +147,7 @@ int gfs2_consist_inode_i(struct gfs2_ino struct gfs2_sbd *sdp = ip->i_sbd; return gfs2_lm_withdraw(sdp, "GFS2: fsid=%s: fatal: filesystem consistency error\n" - "GFS2: fsid=%s: inode = %"PRIu64"/%"PRIu64"\n" + "GFS2: fsid=%s: inode = %llu %llu\n" "GFS2: fsid=%s: function = %s\n" "GFS2: fsid=%s: file = %s, line = %u\n" "GFS2: fsid=%s: time = %lu\n", @@ -171,7 +171,7 @@ int gfs2_consist_rgrpd_i(struct gfs2_rgr struct gfs2_sbd *sdp = rgd->rd_sbd; return gfs2_lm_withdraw(sdp, "GFS2: fsid=%s: fatal: filesystem consistency error\n" - "GFS2: fsid=%s: RG = %"PRIu64"\n" + "GFS2: fsid=%s: RG = %llu\n" "GFS2: fsid=%s: function = %s\n" "GFS2: fsid=%s: file = %s, line = %u\n" "GFS2: fsid=%s: time = %lu\n", @@ -195,7 +195,7 @@ int gfs2_meta_check_ii(struct gfs2_sbd * int me; me = gfs2_lm_withdraw(sdp, "GFS2: fsid=%s: fatal: invalid metadata block\n" - "GFS2: fsid=%s: bh = %"PRIu64" (%s)\n" + "GFS2: fsid=%s: bh = %llu (%s)\n" "GFS2: fsid=%s: function = %s\n" "GFS2: fsid=%s: file = %s, line = %u\n" "GFS2: fsid=%s: time = %lu\n", @@ -220,7 +220,7 @@ int gfs2_metatype_check_ii(struct gfs2_s int me; me = gfs2_lm_withdraw(sdp, "GFS2: fsid=%s: fatal: invalid metadata block\n" - "GFS2: fsid=%s: bh = %"PRIu64" (type: exp=%u, found=%u)\n" + "GFS2: fsid=%s: bh = %llu (type: exp=%u, found=%u)\n" "GFS2: fsid=%s: function = %s\n" "GFS2: fsid=%s: file = %s, line = %u\n" "GFS2: fsid=%s: time = %lu\n", @@ -263,7 +263,7 @@ int gfs2_io_error_bh_i(struct gfs2_sbd * { return gfs2_lm_withdraw(sdp, "GFS2: fsid=%s: fatal: I/O error\n" - "GFS2: fsid=%s: block = %"PRIu64"\n" + "GFS2: fsid=%s: block = %llu\n" "GFS2: fsid=%s: function = %s\n" "GFS2: fsid=%s: file = %s, line = %u\n" "GFS2: fsid=%s: time = %lu\n", From greg at kroah.com Sat Sep 3 05:28:21 2005 From: greg at kroah.com (Greg KH) Date: Fri, 2 Sep 2005 22:28:21 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050902094403.GD16595@redhat.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050902094403.GD16595@redhat.com> Message-ID: <20050903052821.GA23711@kroah.com> On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote: > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > > > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,); > > > what is gfs2_assert() about anyway? please just use BUG_ON directly > > everywhere > > When a machine has many gfs file systems mounted at once it can be useful > to know which one failed. Does the following look ok? > > #define gfs2_assert(sdp, assertion) \ > do { \ > if (unlikely(!(assertion))) { \ > printk(KERN_ERR \ > "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \ > "GFS2: fsid=%s: function = %s\n" \ > "GFS2: fsid=%s: file = %s, line = %u\n" \ > "GFS2: fsid=%s: time = %lu\n", \ > sdp->sd_fsname, # assertion, \ > sdp->sd_fsname, __FUNCTION__, \ > sdp->sd_fsname, __FILE__, __LINE__, \ > sdp->sd_fsname, get_seconds()); \ > BUG(); \ You will already get the __FUNCTION__ (and hence the __FILE__ info) directly from the BUG() dump, as well as the time from the syslog message (turn on the printk timestamps if you want a more fine grain timestamp), so the majority of this macro is redundant with the BUG() macro... thanks, greg k-h From dhazelton at enter.net Sat Sep 3 06:42:31 2005 From: dhazelton at enter.net (D. Hazelton) Date: Sat, 3 Sep 2005 02:42:31 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125728040.3223.2.camel@laptopd505.fenrus.org> References: <20050901104620.GA22482@redhat.com> <20050903051841.GA13211@redhat.com> <1125728040.3223.2.camel@laptopd505.fenrus.org> Message-ID: <200509030242.37536.dhazelton@enter.net> On Saturday 03 September 2005 02:14, Arjan van de Ven wrote: > On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote: > > On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote: > > > Alan Cox wrote: > > > > > - Why GFS is better than OCFS2, or has functionality which > > > > > OCFS2 cannot possibly gain (or vice versa) > > > > > > > > > > - Relative merits of the two offerings > > > > > > > > You missed the important one - people actively use it and > > > > have been for some years. Same reason with have NTFS, HPFS, > > > > and all the others. On that alone it makes sense to include. > > > > > > Again, that's not a technical reason. It's _a_ reason, sure. > > > But what are the technical reasons for merging gfs[2], ocfs2, > > > both or neither? > > > > > > If one can be grown to encompass the capabilities of the other > > > then we're left with a bunch of legacy code and wasted effort. > > > > GFS is an established fs, it's not going away, you'd be hard > > pressed to find a more widely used cluster fs on Linux. GFS is > > about 10 years old and has been in use by customers in production > > environments for about 5 years. > > but you submitted GFS2 not GFS. I'd rather not step into the middle of this mess, but you clipped out a good portion that explains why he talks about GFS when he submitted GFS2. Let me quote the post you've pulled that partial paragraph from: "The latest development cycle (GFS2) has focused on improving performance, it's not a new file system -- the "2" indicates that it's not ondisk compatible with earlier versions." In other words he didn't submit the original, but the new version of it that is not compatable with the original GFS on disk format. While it is clear that GFS2 cannot claim the large installed user base or the proven capacity of the original (it is, after all, a new version that has incompatabilities) it can claim that as it's heritage and what it's aiming towards, the same as ext3 can (and does) claim the power and reliability of ext2. In this case I've been following this thread just for the hell of it and I've noticed that there are some people who seem to not want to even think of having GFS2 included in a mainline kernel for personal and not technical reasons. That does not describe most of the people on this list, many of whom have helped debug the code (among other things), but it does describe a few. I'll go back to being quiet now... DRH -------------- next part -------------- A non-text attachment was scrubbed... Name: 0xA6992F96300F159086FF28208F8280BB8B00C32A.asc Type: application/pgp-keys Size: 1365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From tytso at mit.edu Mon Sep 5 05:54:28 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 5 Sep 2005 01:54:28 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050904203344.GA1987@elf.ucw.cz> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> Message-ID: <20050905055428.GA29158@thunk.org> On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote: > Hi! > > > - read-only mount > > - "specatator" mount (like ro but no journal allocated for the mount, > > no fencing needed for failed node that was mounted as specatator) > > I'd call it "real-read-only", and yes, that's very usefull > mount. Could we get it for ext3, too? This is a bit of a degression, but it's quite a bit different from what ocfs2 is doing, where it is not necessary to replay the journal in order to assure filesystem consistency. In the ext3 case, the only time when read-only isn't quite read-only is when the filesystem was unmounted uncleanly and the journal needs to be replayed in order for the filesystem to be consistent. Mounting the filesystem read-only without replaying the journal could and very likely would result in the filesystem reporting filesystem consistency problems, and if the filesystem is mounted with the reboot-on-errors option, well.... - Ted From tytso at mit.edu Mon Sep 5 14:03:19 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 5 Sep 2005 10:03:19 -0400 Subject: [Linux-cluster] Re: real read-only [was Re: GFS, what's remaining] In-Reply-To: <20050905082735.GA2662@elf.ucw.cz> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> <20050905055428.GA29158@thunk.org> <20050905082735.GA2662@elf.ucw.cz> Message-ID: <20050905140318.GA10751@thunk.org> On Mon, Sep 05, 2005 at 10:27:35AM +0200, Pavel Machek wrote: > > There's a better reason, too. I do swsusp. Then I'd like to boot with > / mounted read-only (so that I can read my config files, some > binaries, and maybe suspended image), but I absolutely may not write > to disk at this point, because I still want to resume. > You could _hope_ that the filesystem is consistent enough that it is safe to try to read config files, binaries, etc. without running the journal, but there is absolutely no guarantee that this is the case. I'm not sure you want to depend on that for swsusp. One potential solution that would probably meet your needs is a dm hack which reads in the blocks in the journal, and then uses the most recent block in the journal in preference to the version on disk. - Ted From tytso at mit.edu Mon Sep 5 14:07:47 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 5 Sep 2005 10:07:47 -0400 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905070922.GK21228@ca-server1.us.oracle.com> References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> <20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz> <20050905055428.GA29158@thunk.org> <20050905070922.GK21228@ca-server1.us.oracle.com> Message-ID: <20050905140747.GB10751@thunk.org> On Mon, Sep 05, 2005 at 12:09:23AM -0700, Mark Fasheh wrote: > Btw, I'm curious to know how useful folks find the ext3 mount options > errors=continue and errors=panic. I'm extremely likely to implement the > errors=read-only behavior as default in OCFS2 and I'm wondering whether the > other two are worth looking into. For a single-user system errors=panic is definitely very useful on the system disk, since that's the only way that we can force an fsck, and also abort a server that might be failing and returning erroneous information to its clients. Think of it is as i/o fencing when you're not sure that the system is going to be performing correctly. Whether or not this is useful for ocfs2 is a different matter. If it's only for data volumes, and if the only way to fix filesystem inconsistencies on a cluster filesystem is to request all nodes in the cluster to unmount the filesystem and then arrange to run ocfs2's fsck on the filesystem, then forcing every single cluster in the node to panic is probably counterproductive. :-) - Ted From dtor_core at ameritech.net Mon Sep 5 16:18:45 2005 From: dtor_core at ameritech.net (Dmitry Torokhov) Date: Mon, 5 Sep 2005 11:18:45 -0500 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509051149.49929.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <20050905141432.GF5498@marowsky-bree.de> <200509051149.49929.phillips@istop.com> Message-ID: <200509051118.45792.dtor_core@ameritech.net> On Monday 05 September 2005 10:49, Daniel Phillips wrote: > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > On 2005-09-03T01:57:31, Daniel Phillips wrote: > > > The only current users of dlms are cluster filesystems. There are zero > > > users of the userspace dlm api. > > > > That is incorrect... > > Application users Lars, sorry if I did not make that clear. The issue is > whether we need to export an all-singing-all-dancing dlm api from kernel to > userspace today, or whether we can afford to take the necessary time to get > it right while application writers take their time to have a good think about > whether they even need it. > If Linux fully supported OpenVMS DLM semantics we could start thinking asbout moving our application onto a Linux box because our alpha server is aging. That's just my user application writer $0.02. -- Dmitry From kurt.hackel at oracle.com Mon Sep 5 19:11:59 2005 From: kurt.hackel at oracle.com (kurt.hackel at oracle.com) Date: Mon, 5 Sep 2005 12:11:59 -0700 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905092433.GE17607@redhat.com> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> Message-ID: <20050905191159.GA21169@gimp.org> On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote: > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote: > > David Teigland wrote: > > > > > > We export our full dlm API through read/write/poll on a misc device. > > > > > > > inotify did that for a while, but we ended up going with a straight syscall > > interface. > > > > How fat is the dlm interface? ie: how many syscalls would it take? > > Four functions: > create_lockspace() > release_lockspace() > lock() > unlock() FWIW, it looks like we can agree on the core interface. ocfs2_dlm exports essentially the same functions: dlm_register_domain() dlm_unregister_domain() dlmlock() dlmunlock() I also implemented dlm_migrate_lockres() to explicitly remaster a lock on another node, but this isn't used by any callers today (except for debugging purposes). There is also some wiring between the fs and the dlm (eviction callbacks) to deal with some ordering issues between the two layers, but these could go if we get stronger membership. There are quite a few other functions in the "full" spec(1) that we didn't even attempt, either because we didn't require direct user<->kernel access or we just didn't need the function. As for the rather thick set of parameters expected in dlm calls, we managed to get dlmlock down to *ahem* eight, and the rest are fairly slim. Looking at the misc device that gfs uses, it seems like there is pretty much complete interface to the same calls you have in kernel, validated on the write() calls to the misc device. With dlmfs, we were seeking to lock down and simplify user access by using standard ast/bast/unlockast calls, using a file descriptor as an opaque token for a single lock, letting the vfs lifetime on this fd help with abnormal termination, etc. I think both the misc device and dlmfs are helpful and not necessarily mutually exclusive, and probably both are better approaches than exporting everything via loads of syscalls (which seems to be the VMS/opendlm model). -kurt 1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf Kurt C. Hackel Oracle kurt.hackel at oracle.com From dtor_core at ameritech.net Tue Sep 6 02:03:19 2005 From: dtor_core at ameritech.net (Dmitry Torokhov) Date: Mon, 5 Sep 2005 21:03:19 -0500 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <200509052057.23807.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509051118.45792.dtor_core@ameritech.net> <200509052057.23807.phillips@istop.com> Message-ID: <200509052103.20519.dtor_core@ameritech.net> On Monday 05 September 2005 19:57, Daniel Phillips wrote: > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: > > On Monday 05 September 2005 10:49, Daniel Phillips wrote: > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > > > On 2005-09-03T01:57:31, Daniel Phillips wrote: > > > > > The only current users of dlms are cluster filesystems. There are > > > > > zero users of the userspace dlm api. > > > > > > > > That is incorrect... > > > > > > Application users Lars, sorry if I did not make that clear. The issue is > > > whether we need to export an all-singing-all-dancing dlm api from kernel > > > to userspace today, or whether we can afford to take the necessary time > > > to get it right while application writers take their time to have a good > > > think about whether they even need it. > > > > If Linux fully supported OpenVMS DLM semantics we could start thinking > > asbout moving our application onto a Linux box because our alpha server is > > aging. > > > > That's just my user application writer $0.02. > > What stops you from trying it with the patch? That kind of feedback would be > worth way more than $0.02. > We do not have such plans at the moment and I prefer spending my free time on tinkering with kernel, not rewriting some in-house application. Besides, DLM is not the only thing that does not have a drop-in replacement in Linux. You just said you did not know if there are any potential users for the full DLM and I said there are some. -- Dmitry From dtor_core at ameritech.net Tue Sep 6 04:07:26 2005 From: dtor_core at ameritech.net (Dmitry Torokhov) Date: Mon, 5 Sep 2005 23:07:26 -0500 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509060002.40823.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509052103.20519.dtor_core@ameritech.net> <200509060002.40823.phillips@istop.com> Message-ID: <200509052307.27417.dtor_core@ameritech.net> On Monday 05 September 2005 23:02, Daniel Phillips wrote: > > By the way, you said "alpha server" not "alpha servers", was that just a slip? > Because if you don't have a cluster then why are you using a dlm? > No, it is not a slip. The application is running on just one node, so we do not really use "distributed" part. However we make heavy use of the rest of lock manager features, especially lock value blocks. -- Dmitry From dtor_core at ameritech.net Tue Sep 6 05:05:58 2005 From: dtor_core at ameritech.net (Dmitry Torokhov) Date: Tue, 6 Sep 2005 00:05:58 -0500 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509060058.44934.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509052307.27417.dtor_core@ameritech.net> <200509060058.44934.phillips@istop.com> Message-ID: <200509060005.59578.dtor_core@ameritech.net> On Monday 05 September 2005 23:58, Daniel Phillips wrote: > On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote: > > On Monday 05 September 2005 23:02, Daniel Phillips wrote: > > > By the way, you said "alpha server" not "alpha servers", was that just a > > > slip? Because if you don't have a cluster then why are you using a dlm? > > > > No, it is not a slip. The application is running on just one node, so we > > do not really use "distributed" part. However we make heavy use of the > > rest of lock manager features, especially lock value blocks. > > Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature > without even having the excuse you were forced to use it. Why don't you just > have a daemon that sends your values over a socket? That should be all of a > day's coding. > Umm, because when most of the code was written TCP and the rest was the clunkiest code out there? Plus, having a daemon introduces problems with cleanup (say process dies for one reason or another) whereas having it in OS takes care of that. > Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. > But you nicely supported my claim that most who think they should be using a > dlm, really shouldn't. Heh, do you think it is a bit premature to dismiss something even without ever seeing the code? -- Dmitry From dtor_core at ameritech.net Tue Sep 6 06:55:03 2005 From: dtor_core at ameritech.net (Dmitry Torokhov) Date: Tue, 6 Sep 2005 01:55:03 -0500 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509060248.47433.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509060005.59578.dtor_core@ameritech.net> <200509060248.47433.phillips@istop.com> Message-ID: <200509060155.04685.dtor_core@ameritech.net> On Tuesday 06 September 2005 01:48, Daniel Phillips wrote: > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > > do you think it is a bit premature to dismiss something even without > > ever seeing the code? > > You told me you are using a dlm for a single-node application, is there > anything more I need to know? > I would still like to know why you consider it a "sin". On OpenVMS it is fast, provides a way of cleaning up and does not introduce single point of failure as it is the case with a daemon. And if we ever want to spread the load between 2 boxes we easily can do it. Why would I not want to use it? -- Dmitry From suparna at in.ibm.com Tue Sep 6 12:55:18 2005 From: suparna at in.ibm.com (Suparna Bhattacharya) Date: Tue, 6 Sep 2005 18:25:18 +0530 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: References: <20050901104620.GA22482@redhat.com> <20050901035939.435768f3.akpm@osdl.org> <1125586158.15768.42.camel@localhost.localdomain> <20050901132104.2d643ccd.akpm@osdl.org> Message-ID: <20050906125517.GA7531@in.ibm.com> On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote: > Andrew Morton writes: > > > > > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > > > possibly gain (or vice versa) > > > > > > > > - Relative merits of the two offerings > > > > > > You missed the important one - people actively use it and have been for > > > some years. Same reason with have NTFS, HPFS, and all the others. On > > > that alone it makes sense to include. > > > > Again, that's not a technical reason. It's _a_ reason, sure. But what are > > the technical reasons for merging gfs[2], ocfs2, both or neither? > > There seems to be clearly a need for a shared-storage fs of some sort > for HA clusters and virtualized usage (multiple guests sharing a > partition). Shared storage can be more efficient than network file > systems like NFS because the storage access is often more efficient > than network access and it is more reliable because it doesn't have a > single point of failure in form of the NFS server. > > It's also a logical extension of the "failover on failure" clusters > many people run now - instead of only failing over the shared fs at > failure and keeping one machine idle the load can be balanced between > multiple machines at any time. > > One argument to merge both might be that nobody really knows yet which > shared-storage file system (GFS or OCFS2) is better. The only way to > find out would be to let the user base try out both, and that's most > practical when they're merged. > > Personally I think ocfs2 has nicer&cleaner code than GFS. > It seems to be more or less a 64bit ext3 with cluster support, while The "more or less" is what bothers me here - the first time I heard this, it sounded a little misleading, as I expected to find some kind of a patch to ext3 to make it 64 bit with extents and cluster support. Now I understand it a little better (thanks to Joel and Mark) And herein lies the issue where I tend to agree with Andrew on -- its really nice to have multiple filesystems innovating freely in their niches and eventually proving themselves in practice, without being bogged down by legacy etc. But at the same time, is there enough thought and discussion about where the fragmentation/diversification is really warranted, vs improving what is already there, or say incorporating the best of one into another, maybe over a period of time ? The number of filesystems seems to just keep growing, and supporting all of them isn't easy -- for users it isn't really easy to switch from one to another, and the justifications for choosing between them is sometimes confusing and burdensome from an administrator standpoint - one filesystem is good in certain conditions, another in others, stability levels may vary etc, and its not always possible to predict which aspect to prioritize. Now, with filesystems that have been around in production for a long time, the on-disk format becomes a major constraining factor, and the reason for having various legacy support around. Likewise, for some special purpose filesystems there really is a niche usage. But for new and sufficiently general purpose filesystems, with new on-disk structure, isn't it worth thinking this through and trying to get it right ? Yeah, it is a lot of work upfront ... but with double the people working on something, it just might get much better than what they individually can. Sometimes. BTW, I don't know if it is worth it in this particular case, but just something that worries me in general. > GFS seems to reinvent a lot more things and has somewhat uglier code. > On the other hand GFS' cluster support seems to be more aimed > at being a universal cluster service open for other usages too, > which might be a good thing. OCFS2s cluster seems to be more > aimed at only serving the file system. > > But which one works better in practice is really an open question. True, but what usually ends up happening is that this question can never quite be answered in black and white. So both just continue to exist and apps need to support both ... convergence becomes impossible and long term duplication inevitable. So at least having a clear demarcation/guideline of what situations each is suitable for upfront would be a good thing. That might also get some cross ocfs-gfs and ocfs-ext3 reviews in the process :) Regards Suparna -- Suparna Bhattacharya (suparna at in.ibm.com) Linux Technology Center IBM Software Lab, India From dmitry.torokhov at gmail.com Tue Sep 6 14:31:34 2005 From: dmitry.torokhov at gmail.com (Dmitry Torokhov) Date: Tue, 6 Sep 2005 09:31:34 -0500 Subject: [Linux-cluster] Re: GFS, what's remainingh In-Reply-To: <200509060318.25260.phillips@istop.com> References: <20050901104620.GA22482@redhat.com> <200509060248.47433.phillips@istop.com> <200509060155.04685.dtor_core@ameritech.net> <200509060318.25260.phillips@istop.com> Message-ID: On 9/6/05, Daniel Phillips wrote: > On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote: > > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote: > > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > > > > do you think it is a bit premature to dismiss something even without > > > > ever seeing the code? > > > > > > You told me you are using a dlm for a single-node application, is there > > > anything more I need to know? > > > > I would still like to know why you consider it a "sin". On OpenVMS it is > > fast, provides a way of cleaning up... > > There is something hard about handling EPIPE? > Just the fact that you want me to handle it ;) > > and does not introduce single point > > of failure as it is the case with a daemon. And if we ever want to spread > > the load between 2 boxes we easily can do it. > > But you said it runs on an aging Alpha, surely you do not intend to expand it > to two aging Alphas? You would be right if I was designing this right now. Now roll 10 - 12 years back and now I have a shiny new alpha. Would you criticize me then for using a mechanism that allowed easily spread application across several nodes with minimal changes if needed? What you fail to realize that there applications that run and will continue to run for a long time. > And what makes you think that socket-based > synchronization keeps you from spreading out the load over multiple boxes? > > > Why would I not want to use it? > > It is not the right tool for the job from what you have told me. You want to > get a few bytes of information from one task to another? Use a socket, as > God intended. > Again, when TCPIP is not a native network stack, when libc socket routines are not readily available - DLM starts looking much more viable. -- Dmitry From lhh at redhat.com Tue Sep 6 15:25:11 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 06 Sep 2005 11:25:11 -0400 Subject: [Linux-cluster] NFS relocate: old TCP/IP connection resulting in DUP/ACK storms and largish timeouts (was: iptables protection wrapper; nfsexport.sh vs ip.sh racing) In-Reply-To: <20050905153649.GE17096@neu.nirvana> References: <20050822225227.GJ24127@neu.nirvana> <1125340879.24205.30.camel@ayanami.boston.redhat.com> <20050829233523.GD5908@neu.nirvana> <1125425012.21943.1.camel@ayanami.boston.redhat.com> <20050905153649.GE17096@neu.nirvana> Message-ID: <1126020311.3344.15.camel@ayanami.boston.redhat.com> On Mon, 2005-09-05 at 17:36 +0200, Axel Thimm wrote: > Is there a way to have ip.sh fry all open TCP/IP connections to a > service IP that is to be abandoned? I guess that would be the better > solution (that would also apply to non-NFS services). Not aware of one -- I will look into it, though! -- Lon From lhh at redhat.com Tue Sep 6 15:26:48 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 06 Sep 2005 11:26:48 -0400 Subject: [Linux-cluster] Re: NFS relocate: old TCP/IP connection resulting in DUP/ACK storms and largish timeouts In-Reply-To: <20050905182143.GA2099@neu.nirvana> References: <20050822225227.GJ24127@neu.nirvana> <1125340879.24205.30.camel@ayanami.boston.redhat.com> <20050829233523.GD5908@neu.nirvana> <1125425012.21943.1.camel@ayanami.boston.redhat.com> <20050905153649.GE17096@neu.nirvana> <20050905182143.GA2099@neu.nirvana> Message-ID: <1126020408.3344.18.camel@ayanami.boston.redhat.com> On Mon, 2005-09-05 at 20:21 +0200, Axel Thimm wrote: > Both bugs have been filed in bugzilla: > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167571 > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167572 > > I guess the latter will move to another component like "kernel", if it > really turns out to be neither cluster nor even nfs specific. It should be pretty easy to make that determination, thanks for the report! -- Lon From lhh at redhat.com Tue Sep 6 16:02:47 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 06 Sep 2005 12:02:47 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <16102.1125953577@www46.gmx.net> References: <16102.1125953577@www46.gmx.net> Message-ID: <1126022567.3344.53.camel@ayanami.boston.redhat.com> On Mon, 2005-09-05 at 22:52 +0200, Andreas Brosche wrote: > Long story cut short, we want > - GFS on a shared SCSI disk (Performance is not important) Using GFS on shared SCSI will work in *some* cases: + Shared SCSI RAID arrays with multiple buses work well. Mid to high end here, not JBODs with host-RAID controllers. The biggest discernable difference between one of these and a FC SAN array is the fact that it has SCSI ports instead of fiber-channel ports. ? Host-RAID *might* work, but only if the JBODs behind it has multiple buses, and the host controllers are all in "clustered" and/or "cache disabled" mode. - Multi-initator SCSI buses do not work with GFS in any meaningful way, regardless of what the host controller is. Ex: Two machines with different SCSI IDs on their initiator connected to the same physical SCSI bus. > - dlm without network access (theoretically possible... > but how dependant is GFS on the cluster services?) The DLM runs over IP, as does the cluster manager. Additionally, please remember that GFS requires fencing, and that most fence-devices are IP-enabled. It may be possible to work around the need for actual ethernet by using something like PPP over high speed serial, but I don't see how that's better than a crossover ethernet cable. Also, I don't know if it will work ;) Many users choose to separate cluster communication from other forms by using a fully self-contained private network. There is currently no way for GFS to use only a quorum disk for all the lock information, and even if it could, performance would be abysmal. -- Lon From moya at infomed.sld.cu Tue Sep 6 17:15:26 2005 From: moya at infomed.sld.cu (Maykel Moya) Date: Tue, 06 Sep 2005 13:15:26 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <16102.1125953577@www46.gmx.net> References: <16102.1125953577@www46.gmx.net> Message-ID: <1126026926.11885.7.camel@julia.sld.cu> El lun, 05-09-2005 a las 22:52 +0200, Andreas Brosche escribi?: > - GFS on a shared SCSI disk (Performance is not important) I recently set up something like that. We use a external HP Smart Array Cluster Storage. It has a separate connection (SCSI cable) to both hosts. The servers communicates over a crossover cable. I'm using DLM. Regards, maykel From tabmowzo at us.ibm.com Tue Sep 6 17:35:42 2005 From: tabmowzo at us.ibm.com (Peter R. Badovinatz) Date: Tue, 06 Sep 2005 10:35:42 -0700 Subject: [Linux-cluster] partly OT: failover <500ms In-Reply-To: <20050901215836.634334a1.pegasus@nerv.eu.org> References: <20050901215836.634334a1.pegasus@nerv.eu.org> Message-ID: <431DD36E.9080607@us.ibm.com> Jure Pe?ar wrote: > Hi all, > > Sorry if this is somewhat offtopic here ... > > Our telco is looking into linux HA solutions for their VoIP needs. Their > main requirement is that the failover happens in the order of a few 100ms. > > Can redhat cluster be tweaked to work reliably with such short time > periods? This would mean heartbeat on the level of few ms and status probes > on the level of 10ms. Is this even feasible? > > Since VoIP is IP anyway, I'm looking into UCARP and stuff like that. > Anything else I should check? > Check out Linux-HA (http://linux-ha.org/). It's been used in sub-second environments and has some documentation on the fact sheets and other pages of the web site. If you don't find enough to help you on the web site the mailing lists are quite active. > > Thanks for answers, > Peter -- Peter R. Badovinatz aka 'Wombat' -- IBM Linux Technology Center preferred: tabmowzo at us.ibm.com / alternate: wombat at us.ibm.com These are my opinions and absolutely not official opinions of IBM, Corp. From karon at gmx.net Tue Sep 6 22:57:27 2005 From: karon at gmx.net (Andreas Brosche) Date: Wed, 07 Sep 2005 00:57:27 +0200 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <1126022567.3344.53.camel@ayanami.boston.redhat.com> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> Message-ID: <431E1ED7.7010909@gmx.net> Hi again, thank you for your replies. Lon Hohberger wrote: > On Mon, 2005-09-05 at 22:52 +0200, Andreas Brosche wrote: > >>Long story cut short, we want >>- GFS on a shared SCSI disk (Performance is not important) > > Using GFS on shared SCSI will work in *some* cases: > > + Shared SCSI RAID arrays with multiple buses work well. Mid to high > end here, Mid to high end indeed, what we found was in the range of about $5000. > - Multi-initator SCSI buses do not work with GFS in any meaningful way, > regardless of what the host controller is. > Ex: Two machines with different SCSI IDs on their initiator connected to > the same physical SCSI bus. Hmm... don't laugh at me, but in fact that's what we're about to set up. I've read in Red Hat's docs that it is "not supported" because of performance issues. Multi-initiator buses should comply to SCSI standards, and any SCSI-compliant disk should be able to communicate with the correct controller, if I've interpreted the specs correctly. Of course, you get arbitrary results when using non-compliant hardware... What are other issues with multi-initiator buses, other than performance loss? > The DLM runs over IP, as does the cluster manager. Additionally, please > remember that GFS requires fencing, and that most fence-devices are > IP-enabled. Hmm. The whole setup is supposed to physically divide two networks, and nevertheless provide some kind of shared storage for moving data from one network to another. Establishing an ethernet link between the two servers would sort of disrupt the whole concept, which is to prevent *any* network access from outside into the secure part of the network. This is the (strongly simplified) topology: mid-secure network -- Server1 -- Storage -- Server2 -- secure Network A potential attacker could use a possible security flaw in the dlm service (which is bound to the network interface) to gain access to the server on the "secure" side *instantly* when he was able to compromise the server on the mid-secure side (hey, it CAN happen). If any sort of shared storage can be installed *without* any ethernet link or - ideally - any sort of inter-server communication, there is a way to *prove* that an attacker cannot establish any kind of connection into the secure net (some risks remain, but they have nothing to do with the physical connection). So far, I only see two ways: either sync the filesystems via ethernet (maybe via a firewall, which is pointless when the service has a security leak; it *is* technically possible to set up a tunnel that way) or some solution with administrator interaction (the administrator would have to manually "flip a switch" to remount a *local* flie system rw on one side, and rw on the other), which is impractical (manpower, availability...), but would do the job. > There is currently no way for GFS to use only a quorum disk for all the > lock information, and even if it could, performance would be abysmal. Like I said... performance is not an issue. As an invariant, the filesystems could be mounted "cross over", ie. each server has a partition only it writes to, and the other only reads from that disk. This *can* be done with local filesystems; you *can* disable write caching. You cannot, however, disable *read* caching (which seems to be buried quite deeply into the kernel), which means you actually have to umount and then re-mount (ie, not "mount -o remount") the fs. This means that long transfers could block other users for a long time. And mounting and umounting the same fs over and over again doesn't exactly sound like a good idea... even if it's only mounted ro. Maykel Moya wrote: > El lun, 05-09-2005 a las 22:52 +0200, Andreas Brosche escribi?: > I recently set up something like that. We use a external HP Smart > Array Cluster Storage. It has a separate connection (SCSI cable) to > both hosts. So it is not really a shared bus, but a dual bus configuration. > Regards, > maykel Regards, and thanks for the replies, Andreas From spwilcox at att.com Wed Sep 7 00:06:58 2005 From: spwilcox at att.com (Steve Wilcox) Date: Tue, 06 Sep 2005 20:06:58 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <431E1ED7.7010909@gmx.net> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> Message-ID: <1126051624.3694.26.camel@aptis101.cqtel.com> On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: > > - Multi-initator SCSI buses do not work with GFS in any meaningful way, > > regardless of what the host controller is. > > Ex: Two machines with different SCSI IDs on their initiator connected to > > the same physical SCSI bus. > > Hmm... don't laugh at me, but in fact that's what we're about to set up. > > I've read in Red Hat's docs that it is "not supported" because of > performance issues. Multi-initiator buses should comply to SCSI > standards, and any SCSI-compliant disk should be able to communicate > with the correct controller, if I've interpreted the specs correctly. Of > course, you get arbitrary results when using non-compliant hardware... > What are other issues with multi-initiator buses, other than performance > loss? I set up a small 2 node cluster this way a while back, just as a testbed for myself. Much as I suspected, it was severely unstable because of the storage configuration, even occasionally causing both nodes to crash when one was rebooted due to SCSI bus resets. I tore it down and rebuilt it several times, configuring it as a simple failover cluster with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an openSSI cluster using Fedora3. All tested configurations were equally crash-happy due to the bus resets. My configuration consisted of a couple of old Compaq deskpro PC's, each with a single ended Symbiosis card (set to different SCSI ID's obviously) and an external DEC BA360 jbod shelf with 6 drives. The bus resets might be mitigated somewhat by using HVD SCSI and Y-cables with external terminators, but from my previous experience with other clusters that used this technique (DEC ASE and HP-ux service guard), bus resets will always be a thorn in your side without a separate, independent raid controller to act as a go-between. Calling these configurations simply "not supported" is an understatement - this type of config is guaranteed trouble. I'd never set up a cluster this way unless I'm the only one using it, and only then if I don't care one little bit about crashes and data corruption. My two cents. -steve From spwilcox at att.com Wed Sep 7 03:03:57 2005 From: spwilcox at att.com (Steve Wilcox) Date: Tue, 06 Sep 2005 23:03:57 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <1126051624.3694.26.camel@aptis101.cqtel.com> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> Message-ID: <1126062237.12381.4.camel@aptis101.cqtel.com> On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote: > On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: > > > > - Multi-initator SCSI buses do not work with GFS in any meaningful way, > > > regardless of what the host controller is. > > > Ex: Two machines with different SCSI IDs on their initiator connected to > > > the same physical SCSI bus. > > > > Hmm... don't laugh at me, but in fact that's what we're about to set up. > > > > I've read in Red Hat's docs that it is "not supported" because of > > performance issues. Multi-initiator buses should comply to SCSI > > standards, and any SCSI-compliant disk should be able to communicate > > with the correct controller, if I've interpreted the specs correctly. Of > > course, you get arbitrary results when using non-compliant hardware... > > What are other issues with multi-initiator buses, other than performance > > loss? > > I set up a small 2 node cluster this way a while back, just as a testbed > for myself. Much as I suspected, it was severely unstable because of > the storage configuration, even occasionally causing both nodes to crash > when one was rebooted due to SCSI bus resets. I tore it down and > rebuilt it several times, configuring it as a simple failover cluster > with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an > openSSI cluster using Fedora3. All tested configurations were equally > crash-happy due to the bus resets. > > My configuration consisted of a couple of old Compaq deskpro PC's, each > with a single ended Symbiosis card (set to different SCSI ID's > obviously) and an external DEC BA360 jbod shelf with 6 drives. The bus > resets might be mitigated somewhat by using HVD SCSI and Y-cables with > external terminators, but from my previous experience with other > clusters that used this technique (DEC ASE and HP-ux service guard), bus > resets will always be a thorn in your side without a separate, > independent raid controller to act as a go-between. Calling these > configurations simply "not supported" is an understatement - this type > of config is guaranteed trouble. I'd never set up a cluster this way > unless I'm the only one using it, and only then if I don't care one > little bit about crashes and data corruption. My two cents. > > -steve Small clarification - Although clusters from DEC, HP, and even DigiComWho?Paq's TruCluster can be made to work (sort of) on multi- initiator SCSI busses, IIRC it was never a supported option for any of them (much like RedHat's offering). I doubt any sane company would ever support that type of config. -steve From vlad at nkmz.donetsk.ua Wed Sep 7 08:51:36 2005 From: vlad at nkmz.donetsk.ua (Vlad) Date: Wed, 7 Sep 2005 11:51:36 +0300 Subject: [Linux-cluster] connect server to fiber channel storage (HP MSA1500) Message-ID: <968123011.20050907115136@nkmz.donetsk.ua> Hello linux-cluster, How can I connect server to fiber channel storage (HP MSA1500) ??? Which is device name for disk drives via FC connections ??? On server I have installed RHEL 2.1 U6 with qla2300.o driver: ----------------------------------------------------------------- [root at dl585cl1 proc]# cat /proc/modules e1000 78204 0 (unused) bcm5700 109996 2 mptscsih 41072 0 mptbase 43200 3 [mptscsih] qla2300 705888 0 ----------------------------------------------------------------- ----------------------------------------------------------------- [root at dl585cl1 qla2300]# cat /proc/scsi/qla2300/0 QLogic PCI to Fibre Channel Host Adapter for QLA2340 : Firmware version: 3.03.01, Driver version 7.01.01-RH1 Entry address = f880a060 HBA: QLA2312 , Serial# T79863 Request Queue = 0x362d0000, Response Queue = 0x362c0000 Request Queue count= 512, Response Queue count= 512 Total number of active commands = 0 Total number of interrupts = 6028 Total number of IOCBs (used/max) = (0/600) Total number of queued commands = 0 Device queue depth = 0x20 Number of free request entries = 63 Number of mailbox timeouts = 0 Number of ISP aborts = 0 Number of loop resyncs = 8 Number of retries for empty slots = 0 Number of reqs in pending_q= 0, retry_q= 0, done_q= 0, scsi_retry_q= 0 Host adapter:loop state= , flags= 0x860813 Dpc flags = 0x0 MBX flags = 0x0 SRB Free Count = 4096 Link down Timeout = 008 Port down retry = 030 Login retry count = 030 Commands retried with dropped frame(s) = 0 Configured characteristic impedence: 50 ohms Configured data rate: 1-2 Gb/sec auto-negotiate SCSI Device Information: scsi-qla0-adapter-node=200000e08b1ed735; scsi-qla0-adapter-port=210000e08b1ed735; scsi-qla0-target-0=500508b30090e751; SCSI LUN Information: (Id:Lun) * - indicates lun is not registered with the OS. ( 0: 0): Total reqs 9, Pending reqs 0, flags 0x0, 0:0:81, ( 0: 1): Total reqs 7784, Pending reqs 0, flags 0x0, 0:0:81, ----------------------------------------------------------------- -- Best regards, Vlad mailto:vlad at nkmz.donetsk.ua From Axel.Thimm at ATrpms.net Wed Sep 7 09:24:18 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Wed, 7 Sep 2005 11:24:18 +0200 Subject: [Linux-cluster] Re: Using GFS without a network? In-Reply-To: <431E1ED7.7010909@gmx.net> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> Message-ID: <20050907092418.GA4014@neu.nirvana> On Wed, Sep 07, 2005 at 12:57:27AM +0200, Andreas Brosche wrote: > >The DLM runs over IP, as does the cluster manager. Additionally, please > >remember that GFS requires fencing, and that most fence-devices are > >IP-enabled. > > Hmm. The whole setup is supposed to physically divide two networks, and > nevertheless provide some kind of shared storage for moving data from > one network to another. Establishing an ethernet link between the two > servers would sort of disrupt the whole concept, which is to prevent > *any* network access from outside into the secure part of the network. > This is the (strongly simplified) topology: > > mid-secure network -- Server1 -- Storage -- Server2 -- secure Network > > A potential attacker could use a possible security flaw in the dlm > service (which is bound to the network interface) to gain access to the > server on the "secure" side *instantly* when he was able to compromise > the server on the mid-secure side (hey, it CAN happen). If any sort of > shared storage can be installed *without* any ethernet link or - ideally > - any sort of inter-server communication, there is a way to *prove* that > an attacker cannot establish any kind of connection into the secure net > (some risks remain, but they have nothing to do with the physical > connection). If you are paranoid like that and consider that even if you could do away with dlm and IP connectivity, then o an attacker on the mid-secure network could alter files that the secure network accesses and gain privileges that way. o an attacker can exploit potential bugs in GFS's code, just as well as in dlm's, and having physical access to the Server 2's journals is probably more harmful than trying to hack through dlm's API calls. There is no way to "prove" what you want. Just go for second best to the ideal theorem. You probably don't want GFS, but a hardened NFS connection to the storage allocated within the secure network only. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From karasiov at infobox.ru Wed Sep 7 11:45:49 2005 From: karasiov at infobox.ru (karasiov at infobox.ru) Date: Wed, 7 Sep 2005 15:45:49 +0400 Subject: [Linux-cluster] File size limitation on GFS In-Reply-To: <42EDEBC8.7070402@histor.fr> References: <42EDEBC8.7070402@histor.fr> Message-ID: <872666588.20050907154549@infobox.ru> ????????????, Ion. ?? ?????? 1 ??????? 2005 ?., 13:30:48: IA> Hi everybody, IA> is there is a maximum file size the GFS can handle? IA> I tried to do some tests with big files, and I couldn't open (open(2)) IA> files that were >>= 2Go. (It works with 1Go files, I didn't try sizes between 1 and IA> 2 Go). IA> I would like to know if this limitation comes from my configuration or IA> from the GFS IA> file system. IA> I searched an answer in the web and in the mailing list but I didn't IA> found anything, IA> If I missed something I'd be very sorry and an url to the article IA> I missed would be a great answer :). IA> Thanks in advance! Hi, I need to start GFS in a single mode on Debian Sarge, but ccsd does not start - whats wrong with single mode? SK From karon at gmx.net Wed Sep 7 13:13:48 2005 From: karon at gmx.net (Andreas Brosche) Date: Wed, 7 Sep 2005 15:13:48 +0200 (MEST) Subject: [Linux-cluster] Using GFS without a network? References: <1126051624.3694.26.camel@aptis101.cqtel.com> Message-ID: <11675.1126098828@www67.gmx.net> From: Steve Wilcox > On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: [multi-initiated SCSI issues] > All tested configurations were equally crash-happy due to the bus > resets. [...] > Calling these > configurations simply "not supported" is an understatement - this type > of config is guaranteed trouble. OK, thank you for sharing your experiences. It definitely sounds like we're not going to use this setup. Maybe these issues should find the way into the GFS documentation, as multi-initiated busses *should* be standard compliant. A simple "it don't work", as it basically says now, is not enough, IMHO. From: Axel Thimm > > Hmm. The whole setup is supposed to physically divide two networks, and > > nevertheless provide some kind of shared storage for moving data from > > one network to another. [...] > > This is the (strongly simplified) topology: > > > > mid-secure network -- Server1 -- Storage -- Server2 -- secure Network > > > > A potential attacker could use a possible security flaw in the dlm > > service (which is bound to the network interface) to gain access to the > > server on the "secure" side *instantly* when he was able to compromise > > the server on the mid-secure side (hey, it CAN happen). If any sort of > > shared storage can be installed *without* any ethernet link or - > > ideally - any sort of inter-server communication, there is a way to > > *prove* that an attacker cannot establish any kind of connection into > > the secure net (some risks remain, but they have nothing to do with the > > physical connection). > > If you are paranoid like that and consider that even if you could do > away with dlm and IP connectivity, then > > o an attacker on the mid-secure network could alter files that the > secure network accesses and gain privileges that way. Data corruption is not really an issue - the only way to gain privileges that way on any system would be on the system where the actual data is being processed (which is, in fact, possible, think of viruses in multimedia files, or MS Word macro viruses). The only way the data is processed by Server2 is by transferring it into the secure network. As most files in the secure network will be documents, we'll have to keep our word processing software up to date. But attacks embedded into the actual data are an issue we'd have to deal with, no matter what the transport medium is. > o an attacker can exploit potential bugs in GFS's code, just as well > as in dlm's, and having physical access to the Server 2's journals > is probably more harmful than trying to hack through dlm's API > calls. Sure, the possibility of potential bugs in GFS was also under my considerations. Injection of harmful code could be possible either way, if there is in fact a security flaw in the sync code, granted... it wouldn't make much of a difference if the code is injected via disk or via service... > There is no way to "prove" what you want. Just go for second best to > the ideal theorem. You probably don't want GFS, but a hardened NFS > connection to the storage allocated within the secure network only. So you would set up one (hardened) server only between the two networks? I'd really rather have a solution without the technical ability to set up any kind of tunnel which allows data to be read *from* the secure network. IP over storage might be possible, but the counterpart in the secure network has to interpret it, so any kind of trojan must be injected into the data. For an attacker, the situation is the same, no matter how the data gets into the network. With a single server connected to two networks, however, the situation is by far easier for the attacker, as it introduces a far more elegant way of setting up a tunnel. What the whole setup is supposed to prevent is that an attacker who manages to get into Server1 has no immediate connection to the secure network (which would be the case with a shared NFS server with, say, two ethernet devices). > Axel.Thimm at ATrpms.net Thank you both for your ideas and exeriences, I'll look into the possibilities of hardening network filesystems. Looks like I'll discard the shared bus idea completely; I'm going to fiddle a bit with it though and test when the data gets corrupted. I'm not going to waste too much time on it though. As GFS is supposed to be a file system which is shared between equal nodes of a cluster, I guess it really is not the file system of choice for our needs. An NFS solution sounds less insane. I'll think about the whole thing again. Regards, Andreas -- 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail +++ GMX - die erste Adresse f?r Mail, Message, More +++ From moya at infomed.sld.cu Wed Sep 7 11:29:47 2005 From: moya at infomed.sld.cu (Maykel Moya) Date: Wed, 07 Sep 2005 07:29:47 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <1126051624.3694.26.camel@aptis101.cqtel.com> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> Message-ID: <1126092587.11885.54.camel@julia.sld.cu> El mar, 06-09-2005 a las 20:06 -0400, Steve Wilcox escribi?: > On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: > I set up a small 2 node cluster this way a while back, just as a testbed > for myself. Much as I suspected, it was severely unstable because of > the storage configuration, even occasionally causing both nodes to crash > when one was rebooted due to SCSI bus resets. I tore it down and > rebuilt it several times, configuring it as a simple failover cluster > with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an > openSSI cluster using Fedora3. All tested configurations were equally > crash-happy due to the bus resets. Could you share your cluster components config files ? Regards maykel From hne at hopnet.net Wed Sep 7 09:43:13 2005 From: hne at hopnet.net (Keith Hopkins) Date: Wed, 07 Sep 2005 19:43:13 +1000 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <1126062237.12381.4.camel@aptis101.cqtel.com> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> <1126062237.12381.4.camel@aptis101.cqtel.com> Message-ID: <431EB631.1020300@hopnet.net> Steve Wilcox wrote: > On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote: > >>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: >> >> >>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way, >>>>regardless of what the host controller is. >>>>Ex: Two machines with different SCSI IDs on their initiator connected to >>>>the same physical SCSI bus. >>> >>>Hmm... don't laugh at me, but in fact that's what we're about to set up. >>> >>>I've read in Red Hat's docs that it is "not supported" because of >>>performance issues. Multi-initiator buses should comply to SCSI >>>standards, and any SCSI-compliant disk should be able to communicate >>>with the correct controller, if I've interpreted the specs correctly. Of >>>course, you get arbitrary results when using non-compliant hardware... >>>What are other issues with multi-initiator buses, other than performance >>>loss? >> >>I set up a small 2 node cluster this way a while back, just as a testbed >>for myself. Much as I suspected, it was severely unstable because of >>the storage configuration, even occasionally causing both nodes to crash >>when one was rebooted due to SCSI bus resets. I tore it down and >>rebuilt it several times, configuring it as a simple failover cluster >>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an >>openSSI cluster using Fedora3. All tested configurations were equally >>crash-happy due to the bus resets. >> >>My configuration consisted of a couple of old Compaq deskpro PC's, each >>with a single ended Symbiosis card (set to different SCSI ID's >>obviously) and an external DEC BA360 jbod shelf with 6 drives. The bus >>resets might be mitigated somewhat by using HVD SCSI and Y-cables with >>external terminators, but from my previous experience with other >>clusters that used this technique (DEC ASE and HP-ux service guard), bus >>resets will always be a thorn in your side without a separate, >>independent raid controller to act as a go-between. Calling these >>configurations simply "not supported" is an understatement - this type >>of config is guaranteed trouble. I'd never set up a cluster this way >>unless I'm the only one using it, and only then if I don't care one >>little bit about crashes and data corruption. My two cents. >> >>-steve > > > > Small clarification - Although clusters from DEC, HP, and even > DigiComWho?Paq's TruCluster can be made to work (sort of) on multi- > initiator SCSI busses, IIRC it was never a supported option for any of > them (much like RedHat's offering). I doubt any sane company would ever > support that type of config. > > -steve > HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP. It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years. Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard. --Keith -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3487 bytes Desc: S/MIME Cryptographic Signature URL: From linux-cluster at redhat.com Wed Sep 7 12:48:02 2005 From: linux-cluster at redhat.com (Cluster Boston) Date: Wed, 7 Sep 2005 08:48:02 -0400 Subject: [Linux-cluster] Cluster 2005 Boston Early Bird Deadline approaching Message-ID: <20050907124754.CC66140977@villi.rcsnetworks.com> see http://www.cluster2005.org. From karasiov at infobox.ru Wed Sep 7 14:21:48 2005 From: karasiov at infobox.ru (karasiov at infobox.ru) Date: Wed, 7 Sep 2005 18:21:48 +0400 Subject: [Linux-cluster] single mode In-Reply-To: <872666588.20050907154549@infobox.ru> References: <42EDEBC8.7070402@histor.fr> <872666588.20050907154549@infobox.ru> Message-ID: <897776205.20050907182148@infobox.ru> Hi, I need to start GFS in a single mode on Debian Sarge, but ccsd does not start - whats wrong with single mode? SK From spwilcox at att.com Wed Sep 7 15:03:58 2005 From: spwilcox at att.com (Steve Wilcox) Date: Wed, 07 Sep 2005 11:03:58 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <1126092587.11885.54.camel@julia.sld.cu> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> <1126092587.11885.54.camel@julia.sld.cu> Message-ID: <1126105438.25415.4.camel@aptis101.cqtel.com> On Wed, 2005-09-07 at 07:29 -0400, Maykel Moya wrote: > El mar, 06-09-2005 a las 20:06 -0400, Steve Wilcox escribi?: > > On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: > > I set up a small 2 node cluster this way a while back, just as a testbed > > for myself. Much as I suspected, it was severely unstable because of > > the storage configuration, even occasionally causing both nodes to crash > > when one was rebooted due to SCSI bus resets. I tore it down and > > rebuilt it several times, configuring it as a simple failover cluster > > with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an > > openSSI cluster using Fedora3. All tested configurations were equally > > crash-happy due to the bus resets. > > Could you share your cluster components config files ? > > Regards > maykel > I didn't save any of my old config files - as I said, this was just a small cluster for me to toy with. Everything was fairly vanilla generally speaking though. Here's the config from the current setup, a RHEL4 simple failover cluster without GFS (and currently with no meaningful services). From spwilcox at att.com Wed Sep 7 15:19:52 2005 From: spwilcox at att.com (Steve Wilcox) Date: Wed, 07 Sep 2005 11:19:52 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <431EB631.1020300@hopnet.net> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> <1126062237.12381.4.camel@aptis101.cqtel.com> <431EB631.1020300@hopnet.net> Message-ID: <1126106393.25415.18.camel@aptis101.cqtel.com> On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote: > Steve Wilcox wrote: > > On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote: > > > >>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: > >> > >> > >>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way, > >>>>regardless of what the host controller is. > >>>>Ex: Two machines with different SCSI IDs on their initiator connected to > >>>>the same physical SCSI bus. > >>> > >>>Hmm... don't laugh at me, but in fact that's what we're about to set up. > >>> > >>>I've read in Red Hat's docs that it is "not supported" because of > >>>performance issues. Multi-initiator buses should comply to SCSI > >>>standards, and any SCSI-compliant disk should be able to communicate > >>>with the correct controller, if I've interpreted the specs correctly. Of > >>>course, you get arbitrary results when using non-compliant hardware... > >>>What are other issues with multi-initiator buses, other than performance > >>>loss? > >> > >>I set up a small 2 node cluster this way a while back, just as a testbed > >>for myself. Much as I suspected, it was severely unstable because of > >>the storage configuration, even occasionally causing both nodes to crash > >>when one was rebooted due to SCSI bus resets. I tore it down and > >>rebuilt it several times, configuring it as a simple failover cluster > >>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an > >>openSSI cluster using Fedora3. All tested configurations were equally > >>crash-happy due to the bus resets. > >> > >>My configuration consisted of a couple of old Compaq deskpro PC's, each > >>with a single ended Symbiosis card (set to different SCSI ID's > >>obviously) and an external DEC BA360 jbod shelf with 6 drives. The bus > >>resets might be mitigated somewhat by using HVD SCSI and Y-cables with > >>external terminators, but from my previous experience with other > >>clusters that used this technique (DEC ASE and HP-ux service guard), bus > >>resets will always be a thorn in your side without a separate, > >>independent raid controller to act as a go-between. Calling these > >>configurations simply "not supported" is an understatement - this type > >>of config is guaranteed trouble. I'd never set up a cluster this way > >>unless I'm the only one using it, and only then if I don't care one > >>little bit about crashes and data corruption. My two cents. > >> > >>-steve > > > > > > > > Small clarification - Although clusters from DEC, HP, and even > > DigiComWho?Paq's TruCluster can be made to work (sort of) on multi- > > initiator SCSI busses, IIRC it was never a supported option for any of > > them (much like RedHat's offering). I doubt any sane company would ever > > support that type of config. > > > > -steve > > > > HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP. It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years. Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard. > > --Keith Hmmm... Are you sure you're thinking of a multi-initiator _bus_ and not something like an external SCSI array (i.e. nike arrays or some such thing)? I know that multi-port SCSI hubs are available, and more than one HBA per node is obviously supported for multipathing, but generally any multi-initiator SCSI setup will be talking to an external raid array, not a simple SCSI bus, and even then bus resets can cause grief. Admittedly, I'm much more familiar with the Alpha server side of things (multi-initiator buses were definitely never supported under DEC unix / Tru64) , so I could be wrong about HP-ux. I just can't imagine that a multi-initiator bus wouldn't be a nightmare. -steve From lhh at redhat.com Wed Sep 7 15:19:45 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Sep 2005 11:19:45 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <431E1ED7.7010909@gmx.net> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> Message-ID: <1126106385.30592.11.camel@ayanami.boston.redhat.com> On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: > I've read in Red Hat's docs that it is "not supported" because of > performance issues. Multi-initiator buses should comply to SCSI > standards, and any SCSI-compliant disk should be able to communicate > with the correct controller, if I've interpreted the specs correctly. Of > course, you get arbitrary results when using non-compliant hardware... > What are other issues with multi-initiator buses, other than performance > loss? Dueling resets. Some drivers will reset the bus when loaded (some cards do this when the machine boots, too). Then, the other initator's driver detects a reset, and goes ahead and issues a reset. So, the first initiator's driver detects the reset, and goes ahead and issues a reset. I'm sure you see where this is going. The important thing is that (IIRC) the number of resets is unbounded. It could be 1, it could be 20,000. During this time, none of the devices on the bus can be accessed. > > The DLM runs over IP, as does the cluster manager. Additionally, please > > remember that GFS requires fencing, and that most fence-devices are > > IP-enabled. > > Hmm. The whole setup is supposed to physically divide two networks, and > nevertheless provide some kind of shared storage for moving data from > one network to another. Establishing an ethernet link between the two > servers would sort of disrupt the whole concept, which is to prevent > *any* network access from outside into the secure part of the network. > This is the (strongly simplified) topology: > > mid-secure network -- Server1 -- Storage -- Server2 -- secure Network Ok, GFS will not work for this. However, you *can* still use, for example, a raw device to lock the data, then write out the data directly to the partition (as long as you didn't need file I/O). You can use a disk-based locking scheme similar to the one found in Cluster Manager 1.0.x and/or Kimberlite 1.1.x to synchronize access to the shared partition. If you're using a multi-initator bus, you can certainly also use SCSI reservations to synchronize access as well. > A potential attacker could use a possible security flaw in the dlm > service (which is bound to the network interface) to gain access to the > server on the "secure" side *instantly* when he was able to compromise > the server on the mid-secure side (hey, it CAN happen). Fair enough. > You cannot, however, disable *read* caching (which seems to be > buried quite deeply into the kernel), which means you actually have to > umount and then re-mount (ie, not "mount -o remount") the fs. This means > that long transfers could block other users for a long time. And > mounting and umounting the same fs over and over again doesn't exactly > sound like a good idea... even if it's only mounted ro. Yup. > > Maykel Moya wrote: > > El lun, 05-09-2005 a las 22:52 +0200, Andreas Brosche escribi?: > > I recently set up something like that. We use a external HP Smart > > Array Cluster Storage. It has a separate connection (SCSI cable) to > > both hosts. > > So it is not really a shared bus, but a dual bus configuration. Ah, that's much better =) -- Lon From lhh at redhat.com Wed Sep 7 15:25:09 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Sep 2005 11:25:09 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <431EB631.1020300@hopnet.net> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> <1126062237.12381.4.camel@aptis101.cqtel.com> <431EB631.1020300@hopnet.net> Message-ID: <1126106709.30592.16.camel@ayanami.boston.redhat.com> On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote: > > > > HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP. It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years. Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard. > The key there is that HP-UX is what's different, not specifically ServiceGuard. ServiceGuard on Linux will have the same pitfalls, and won't work well, even if HP *does* support it. :( -- Lon From lhh at redhat.com Wed Sep 7 15:29:53 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Sep 2005 11:29:53 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <11675.1126098828@www67.gmx.net> References: <1126051624.3694.26.camel@aptis101.cqtel.com> <11675.1126098828@www67.gmx.net> Message-ID: <1126106993.30592.21.camel@ayanami.boston.redhat.com> On Wed, 2005-09-07 at 15:13 +0200, Andreas Brosche wrote: > > o an attacker can exploit potential bugs in GFS's code, just as well > > as in dlm's, and having physical access to the Server 2's journals > > is probably more harmful than trying to hack through dlm's API > > calls. > > Sure, the possibility of potential bugs in GFS was also under my > considerations. Injection of harmful code could be possible either way, if > there is in fact a security flaw in the sync code, granted... it wouldn't > make much of a difference if the code is injected via disk or via service... Note: If anyone breaks in to the world-facing server, you will need a way to detect it and notify the other server. Once this happens, it's safe (perhaps paranoid) to assume all data on the shared disk is corrupt, and possibly dangerous, and so should not be used. -- Lon From lhh at redhat.com Wed Sep 7 15:48:46 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Sep 2005 11:48:46 -0400 Subject: [Linux-cluster] Re: Using GFS without a network? In-Reply-To: <20050907092418.GA4014@neu.nirvana> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <20050907092418.GA4014@neu.nirvana> Message-ID: <1126108126.30592.41.camel@ayanami.boston.redhat.com> On Wed, 2005-09-07 at 11:24 +0200, Axel Thimm wrote: > There is no way to "prove" what you want. Just go for second best to > the ideal theorem. You probably don't want GFS, but a hardened NFS > connection to the storage allocated within the secure network only. I'd do shared raw. If we know the computer on the secure network *never* writes to the disk and it has no possible way to establish a network connection to the outside world (via any means) then we only have to worry about the attacker somehow corrupting data to crash the application on the secure server. Make sure your reader application has a reliable way to verify the integrity of the data (possibly using some form of encryption like gpg) and you're golden. So, the would-be attacker would have to do the following to get data off the secure network: (a) Break in to world-facing server (b) Create data which will cause a malfunction in to the secret application on the secure server (without having access to said application; this is based on an outside job, not an inside job), (c) encrypt or sign the data so that the secure server trusts it, and (d) write the data out to the right offset on the raw device... In the "overflow code", the attacker would have to know where the data is stored, retrieve it, and write it out to the shared SCSI disk. Note that the above becomes much more difficult if you change the SCSI block device driver on the secure server to completely disable writes. ;) It also becomes more difficult if the secret application is audited for security flaws before being put into production. Just random ideas... *shrug* -- Lon From moya at infomed.sld.cu Wed Sep 7 14:51:23 2005 From: moya at infomed.sld.cu (Maykel Moya) Date: Wed, 07 Sep 2005 10:51:23 -0400 Subject: [Linux-cluster] Filesystem (GFS) availability Message-ID: <1126104683.15223.7.camel@julia.sld.cu> I have a two node GFS setup. When one of the nodes (B) goes down, the other one (A) is unable to access the fs. A, nevertheless, "notes" that B went down and removes it from the cluster, but any access to the GFS locks. Any clues? My cluster.conf is: Regards, maykel From moya at infomed.sld.cu Wed Sep 7 14:45:37 2005 From: moya at infomed.sld.cu (Maykel Moya) Date: Wed, 07 Sep 2005 10:45:37 -0400 Subject: [Linux-cluster] File size limitation on GFS In-Reply-To: <872666588.20050907154549@infobox.ru> References: <42EDEBC8.7070402@histor.fr> <872666588.20050907154549@infobox.ru> Message-ID: <1126104337.15223.1.camel@julia.sld.cu> > I need to start GFS in a single mode on Debian Sarge, > but ccsd does not start - whats wrong with single mode? Do you have a /etc/cluster/cluster.conf ? Regards, maykel PD: Though you backported packages from unstable From jacobl at ccbill.com Wed Sep 7 17:16:42 2005 From: jacobl at ccbill.com (Jacob Liff) Date: Wed, 7 Sep 2005 10:16:42 -0700 Subject: [Linux-cluster] Filesystem (GFS) availability Message-ID: <63DFFDD742B5E54389C891AB1DFFE9A20576FF@Exchange.ccbill-hq.local> Morning, When using manual fencing you will have to ack the manual fence on the remaining machine. Once this is complete, the remaining node will grab the failed nodes journal and play it back. You will then regain access to the file system. Jacob L. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Maykel Moya Sent: Wednesday, September 07, 2005 7:51 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] Filesystem (GFS) availability I have a two node GFS setup. When one of the nodes (B) goes down, the other one (A) is unable to access the fs. A, nevertheless, "notes" that B went down and removes it from the cluster, but any access to the GFS locks. Any clues? My cluster.conf is: Regards, maykel -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Wed Sep 7 17:20:20 2005 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Sep 2005 13:20:20 -0400 Subject: [Linux-cluster] Filesystem (GFS) availability In-Reply-To: <1126104683.15223.7.camel@julia.sld.cu> References: <1126104683.15223.7.camel@julia.sld.cu> Message-ID: <1126113620.30592.49.camel@ayanami.boston.redhat.com> On Wed, 2005-09-07 at 10:51 -0400, Maykel Moya wrote: > > > Run fence_ack_manual on the surviving node. Better yet, stop using manual fencing and buy a supported power switch off of eBay. It will save you a lot of frustration :) -- Lon From amanthei at redhat.com Wed Sep 7 18:50:24 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 7 Sep 2005 13:50:24 -0500 Subject: [Linux-cluster] File size limitation on GFS In-Reply-To: <872666588.20050907154549@infobox.ru> References: <42EDEBC8.7070402@histor.fr> <872666588.20050907154549@infobox.ru> Message-ID: <20050907185024.GF26769@redhat.com> If you want GFS in single user mode (no networking) then you can mount using the lock_nolock protocol. Be very careful because if two or more machines mount the filesystem using this option, you will cause corruption. On Wed, Sep 07, 2005 at 03:45:49PM +0400, karasiov at infobox.ru wrote: > ????????????, Ion. > > ?? ?????? 1 ??????? 2005 ?., 13:30:48: > > IA> Hi everybody, > IA> is there is a maximum file size the GFS can handle? > IA> I tried to do some tests with big files, and I couldn't open (open(2)) > IA> files that > were >>= 2Go. (It works with 1Go files, I didn't try sizes between 1 and > IA> 2 Go). > > IA> I would like to know if this limitation comes from my configuration or > IA> from the GFS > IA> file system. > > IA> I searched an answer in the web and in the mailing list but I didn't > IA> found anything, > IA> If I missed something I'd be very sorry and an url to the article > IA> I missed would be a great answer :). > > IA> Thanks in advance! > > Hi, > > I need to start GFS in a single mode on Debian Sarge, > but ccsd does not start - whats wrong with single mode? > > SK > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster -- Adam Manthei From Axel.Thimm at ATrpms.net Wed Sep 7 20:15:37 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Wed, 7 Sep 2005 22:15:37 +0200 Subject: [Linux-cluster] Samba failover "impossible" due to missing cifs client reconnect? Message-ID: <20050907201537.GB3455@neu.nirvana> After having setup our workarounds for NFS we are very happy with how it's working. Now we're looking at Samba. But we have quite a showstopper right at the beginning. The smb/cifs clients, be it smbclient or Windows XP, don't like their TCP stream being resetted and don't retry/reconnect (contrary to NFS). It looks like the protocol has no considerations for retries above the TCP/IP level. So when the TCP stream is torn on the server's side due to relocation (either due to crash/fencing or soft) any client smb/cifs activity is broken at that time. This means that any data transfer via smb/cifs shares during the relocation will fail, and there is nothing we can do on the server's side. Or is there? -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From rainer at ultra-secure.de Wed Sep 7 20:37:07 2005 From: rainer at ultra-secure.de (Rainer Duffner) Date: Wed, 07 Sep 2005 22:37:07 +0200 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <431E1ED7.7010909@gmx.net> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> Message-ID: <431F4F73.1030603@ultra-secure.de> Andreas Brosche wrote: > Hmm. The whole setup is supposed to physically divide two networks, > and nevertheless provide some kind of shared storage for moving data > from one network to another. Establishing an ethernet link between the > two servers would sort of disrupt the whole concept, which is to > prevent *any* network access from outside into the secure part of the > network. This is the (strongly simplified) topology: > This is really more of a security-question than clustering (IMO). Have you thought about a so called "air-gap" device, like Whale (www.whalecommunications.com) makes them? It uses a SCSI-switch to "shuttle" data between two networks. cheers, Rainer From crh at ubiqx.mn.org Wed Sep 7 20:43:21 2005 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Wed, 7 Sep 2005 15:43:21 -0500 Subject: [Linux-cluster] Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050907201537.GB3455@neu.nirvana> References: <20050907201537.GB3455@neu.nirvana> Message-ID: <20050907204321.GB5677@Favog.ubiqx.mn.org> On Wed, Sep 07, 2005 at 10:15:37PM +0200, Axel Thimm wrote: > After having setup our workarounds for NFS we are very happy with how > it's working. Now we're looking at Samba. > > But we have quite a showstopper right at the beginning. The smb/cifs > clients, be it smbclient or Windows XP, don't like their TCP stream > being resetted and don't retry/reconnect (contrary to NFS). > > It looks like the protocol has no considerations for retries above the > TCP/IP level. So when the TCP stream is torn on the server's side due > to relocation (either due to crash/fencing or soft) any client > smb/cifs activity is broken at that time. > > This means that any data transfer via smb/cifs shares during the > relocation will fail, and there is nothing we can do on the server's > side. Or is there? Windows clients will reconnect to the same server, and so will smbfs and cifs-vfs. I just tested this. On a W/XP box I browsed through some directories on a share served by Samba. I then shut Samba down, and tried viewing some different subdirectories of the same share. Windows coughed up an error dialog. I then restarted Samba and Windows got happy again. I could browse through all of the subdirectories in the share. We've talked about Samba on GFS within the Samba Team, and various members have done some digging into the problem (Volker most recently, if I'm not mistaken). Samba must maintain a certain amount of state information internally--including name mangling, locking, and sharing information that--is peculiar to Windows+DOS+OS2 semantics. The problem is ensuring that Samba's state information is also shared across the GFS nodes. I've not had time to keep up with this development thread, but I know that the folks working on Samba-4 are aware of the issues involved. Chris -)----- -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From Axel.Thimm at ATrpms.net Wed Sep 7 20:51:16 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Wed, 7 Sep 2005 22:51:16 +0200 Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050907204321.GB5677@Favog.ubiqx.mn.org> References: <20050907201537.GB3455@neu.nirvana> <20050907204321.GB5677@Favog.ubiqx.mn.org> Message-ID: <20050907205116.GA7459@neu.nirvana> On Wed, Sep 07, 2005 at 03:43:21PM -0500, Christopher R. Hertel wrote: > On Wed, Sep 07, 2005 at 10:15:37PM +0200, Axel Thimm wrote: > > After having setup our workarounds for NFS we are very happy with how > > it's working. Now we're looking at Samba. > > > > But we have quite a showstopper right at the beginning. The smb/cifs > > clients, be it smbclient or Windows XP, don't like their TCP stream > > being resetted and don't retry/reconnect (contrary to NFS). > > > > It looks like the protocol has no considerations for retries above the > > TCP/IP level. So when the TCP stream is torn on the server's side due > > to relocation (either due to crash/fencing or soft) any client > > smb/cifs activity is broken at that time. > > > > This means that any data transfer via smb/cifs shares during the > > relocation will fail, and there is nothing we can do on the server's > > side. Or is there? > > Windows clients will reconnect to the same server, and so will smbfs and > cifs-vfs. > > I just tested this. On a W/XP box I browsed through some directories on a > share served by Samba. I then shut Samba down, and tried viewing some > different subdirectories of the same share. Windows coughed up an error > dialog. I then restarted Samba and Windows got happy again. I could > browse through all of the subdirectories in the share. Yes, that does work, but what I wanted to setup is a transparent failover, so that network I/O recovers w/o any manual interaction. I.e. I don't want to (soft) relocate the samba shares onto another node due to load ballancing considerations and generate user visible I/O errors and failures on a dozen clients. > We've talked about Samba on GFS within the Samba Team, and various members > have done some digging into the problem (Volker most recently, if I'm not > mistaken). Samba must maintain a certain amount of state information > internally--including name mangling, locking, and sharing information > that--is peculiar to Windows+DOS+OS2 semantics. The problem is ensuring > that Samba's state information is also shared across the GFS nodes. > > I've not had time to keep up with this development thread, but I know that > the folks working on Samba-4 are aware of the issues involved. > > Chris -)----- > -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From crh at ubiqx.mn.org Wed Sep 7 21:12:52 2005 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Wed, 7 Sep 2005 16:12:52 -0500 Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050907205116.GA7459@neu.nirvana> References: <20050907201537.GB3455@neu.nirvana> <20050907204321.GB5677@Favog.ubiqx.mn.org> <20050907205116.GA7459@neu.nirvana> Message-ID: <20050907211252.GC5677@Favog.ubiqx.mn.org> On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote: : : > > I just tested this. On a W/XP box I browsed through some directories on a > > share served by Samba. I then shut Samba down, and tried viewing some > > different subdirectories of the same share. Windows coughed up an error > > dialog. I then restarted Samba and Windows got happy again. I could > > browse through all of the subdirectories in the share. > > Yes, that does work, but what I wanted to setup is a transparent > failover, so that network I/O recovers w/o any manual interaction. > > I.e. I don't want to (soft) relocate the samba shares onto another > node due to load ballancing considerations and generate user visible > I/O errors and failures on a dozen clients. I guess I'm not really clear on what it is you're trying to accomplish. Can you provide a little more description of what you'd like to see happen, and what kinds of environments you expect? Chris -)----- -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From teigland at redhat.com Thu Sep 8 05:41:28 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 8 Sep 2005 13:41:28 +0800 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> Message-ID: <20050908054128.GD12220@redhat.com> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote: > +static inline void glock_put(struct gfs2_glock *gl) > +{ > + if (atomic_read(&gl->gl_count) == 1) > + gfs2_glock_schedule_for_reclaim(gl); > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,); > + atomic_dec(&gl->gl_count); > +} > > this code has a race The first two lines of the function with the race are non-essential and could be removed. In the common case where there's no race, they just add efficiency by moving the glock to the reclaim list immediately. Otherwise, the scand thread would do it later when actively trying to reclaim glocks. > +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head) > +{ > + int empty; > + spin_lock(&gl->gl_spin); > + empty = list_empty(head); > + spin_unlock(&gl->gl_spin); > + return empty; > +} > > that looks like a racey interface to me... if so.. why bother locking at > all? The spinlock protects the list but is not the primary method of synchronizing processes that are working with a glock. When the list is in fact empty, there will be no race, and the locking wouldn't be necessary. In this case, the "glmutex" in the code fragment below is preventing any change in the list, so we can safely release the spinlock immediately. When the list is not empty, then a process could be adding another entry to the list without "glmutex" locked [1], making the spinlock necessary. In this case we quit after queue_empty() returns and don't do anything else, so releasing the spinlock immediately was still safe. [1] A process that already holds a glock (i.e. has a "holder" struct on the gl_holders list) is allowed to hold it again by adding another holder struct to the same list. It adds the second hold without locking glmutex. if (gfs2_glmutex_trylock(gl)) { if (gl->gl_ops == &gfs2_inode_glops) { struct gfs2_inode *ip = get_gl2ip(gl); if (ip && !atomic_read(&ip->i_count)) gfs2_inode_destroy(ip); } if (queue_empty(gl, &gl->gl_holders) && gl->gl_state != LM_ST_UNLOCKED) handle_callback(gl, LM_ST_UNLOCKED); gfs2_glmutex_unlock(gl); } There is a second way that queue_empty() is used, and that's within assertions that the list is empty. If the assertion is correct, locking isn't necessary; locking is only needed if there's already another bug causing the list to not be empty and the assertion to fail. > static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi, > + gi_filler_t filler) > +{ > + unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size); > + char *buf; > + unsigned int count = 0; > + int error; > + > + if (size > gi->gi_size) > + size = gi->gi_size; > + > + buf = kmalloc(size, GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + error = filler(ip, gi, buf, size, &count); > + if (error) > + goto out; > + > + if (copy_to_user(gi->gi_data, buf, count + 1)) > + error = -EFAULT; > > where does count get a sensible value? from filler() We'll add comments in the code to document the things above. Thanks, Dave From Axel.Thimm at ATrpms.net Thu Sep 8 07:15:12 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Thu, 8 Sep 2005 09:15:12 +0200 Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050907211252.GC5677@Favog.ubiqx.mn.org> References: <20050907201537.GB3455@neu.nirvana> <20050907204321.GB5677@Favog.ubiqx.mn.org> <20050907205116.GA7459@neu.nirvana> <20050907211252.GC5677@Favog.ubiqx.mn.org> Message-ID: <20050908071512.GC9222@neu.nirvana> On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote: > On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote: > : : > > > I just tested this. On a W/XP box I browsed through some directories on a > > > share served by Samba. I then shut Samba down, and tried viewing some > > > different subdirectories of the same share. Windows coughed up an error > > > dialog. I then restarted Samba and Windows got happy again. I could > > > browse through all of the subdirectories in the share. > > > > Yes, that does work, but what I wanted to setup is a transparent > > failover, so that network I/O recovers w/o any manual interaction. > > > > I.e. I don't want to (soft) relocate the samba shares onto another > > node due to load ballancing considerations and generate user visible > > I/O errors and failures on a dozen clients. > > I guess I'm not really clear on what it is you're trying to accomplish. > Can you provide a little more description of what you'd like to see > happen, and what kinds of environments you expect? A cifs client performs a largish copy operation. During that the share is relocated to a different node. The copy operations should stall during the relocation and resume after 10-20 seconds. But if the cifs client does not perform a retry on smb/cifs protocol level (on TCP level it will get a RST, it's the next level protocol that needs to decide on retransmit the read/write request), then there is nothing you can do server-side. Perhaps there are magic registry keys that can persuade Windows clients to do otherwise. -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From bob at interstudio.homeunix.net Thu Sep 8 11:16:06 2005 From: bob at interstudio.homeunix.net (Bob Marcan) Date: Thu, 08 Sep 2005 13:16:06 +0200 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <1126106393.25415.18.camel@aptis101.cqtel.com> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> <1126062237.12381.4.camel@aptis101.cqtel.com> <431EB631.1020300@hopnet.net> <1126106393.25415.18.camel@aptis101.cqtel.com> Message-ID: <43201D76.6040000@interstudio.homeunix.net> Steve Wilcox wrote: > On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote: > >>Steve Wilcox wrote: >> >>>On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote: >>> >>> >>>>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: >>>> >>>> >>>> >>>>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way, >>>>>>regardless of what the host controller is. >>>>>>Ex: Two machines with different SCSI IDs on their initiator connected to >>>>>>the same physical SCSI bus. >>>>> >>>>>Hmm... don't laugh at me, but in fact that's what we're about to set up. >>>>> >>>>>I've read in Red Hat's docs that it is "not supported" because of >>>>>performance issues. Multi-initiator buses should comply to SCSI >>>>>standards, and any SCSI-compliant disk should be able to communicate >>>>>with the correct controller, if I've interpreted the specs correctly. Of >>>>>course, you get arbitrary results when using non-compliant hardware... >>>>>What are other issues with multi-initiator buses, other than performance >>>>>loss? >>>> >>>>I set up a small 2 node cluster this way a while back, just as a testbed >>>>for myself. Much as I suspected, it was severely unstable because of >>>>the storage configuration, even occasionally causing both nodes to crash >>>>when one was rebooted due to SCSI bus resets. I tore it down and >>>>rebuilt it several times, configuring it as a simple failover cluster >>>>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an >>>>openSSI cluster using Fedora3. All tested configurations were equally >>>>crash-happy due to the bus resets. >>>> >>>>My configuration consisted of a couple of old Compaq deskpro PC's, each >>>>with a single ended Symbiosis card (set to different SCSI ID's >>>>obviously) and an external DEC BA360 jbod shelf with 6 drives. The bus >>>>resets might be mitigated somewhat by using HVD SCSI and Y-cables with >>>>external terminators, but from my previous experience with other >>>>clusters that used this technique (DEC ASE and HP-ux service guard), bus >>>>resets will always be a thorn in your side without a separate, >>>>independent raid controller to act as a go-between. Calling these >>>>configurations simply "not supported" is an understatement - this type >>>>of config is guaranteed trouble. I'd never set up a cluster this way >>>>unless I'm the only one using it, and only then if I don't care one >>>>little bit about crashes and data corruption. My two cents. >>>> >>>>-steve >>> >>> >>> >>>Small clarification - Although clusters from DEC, HP, and even >>>DigiComWho?Paq's TruCluster can be made to work (sort of) on multi- >>>initiator SCSI busses, IIRC it was never a supported option for any of >>>them (much like RedHat's offering). I doubt any sane company would ever >>>support that type of config. >>> >>>-steve >>> >> >>HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP. It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years. Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard. >> >>--Keith > > > Hmmm... Are you sure you're thinking of a multi-initiator _bus_ and > not something like an external SCSI array (i.e. nike arrays or some such > thing)? I know that multi-port SCSI hubs are available, and more than > one HBA per node is obviously supported for multipathing, but generally > any multi-initiator SCSI setup will be talking to an external raid > array, not a simple SCSI bus, and even then bus resets can cause grief. > Admittedly, I'm much more familiar with the Alpha server side of things ==========================> should be unfamiliar > (multi-initiator buses were definitely never supported under DEC unix / > Tru64) , so I could be wrong about HP-ux. I just can't imagine that a > multi-initiator bus wouldn't be a nightmare. > > -steve > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster In the past i was using the SCSI cluster on OpenVMS(AXP,VAX) and Tru64. At home i have 2 DS10 with memmory channel, shared SCSI Tru64 cluster. Memmory channel was prerequisite in early days, now you can use ethernet as CI. I still have some customers using the SCSI cluster on Tru64. Two of this are banks, running this configuration a few years. Without any problems. Using host based shadowing. Tru64 has single point of failure in this configuration. Quorum disk can't be shadowed. OpenVMS doesn't have this limitation. It is supported. OpenVMS http://h71000.www7.hp.com/doc/82FINAL/6318/6318pro_002.html#interc_sys_table Tru64 http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/ARHGWETE/CHNTRXXX.HTM#sec-generic-cluster ... Best regards, Bob -- Bob Marcan, Consultant mailto:bob.marcan at snt.si S&T Hermes Plus d.d. tel: +386 (1) 5895-300 Slandrova ul. 2 fax: +386 (1) 5895-202 1231 Ljubljana - Crnuce, Slovenia url: http://www.snt.si From spwilcox at att.com Thu Sep 8 17:22:34 2005 From: spwilcox at att.com (Steve Wilcox) Date: Thu, 08 Sep 2005 13:22:34 -0400 Subject: [Linux-cluster] Using GFS without a network? In-Reply-To: <43201D76.6040000@interstudio.homeunix.net> References: <16102.1125953577@www46.gmx.net> <1126022567.3344.53.camel@ayanami.boston.redhat.com> <431E1ED7.7010909@gmx.net> <1126051624.3694.26.camel@aptis101.cqtel.com> <1126062237.12381.4.camel@aptis101.cqtel.com> <431EB631.1020300@hopnet.net> <1126106393.25415.18.camel@aptis101.cqtel.com> <43201D76.6040000@interstudio.homeunix.net> Message-ID: <1126200154.17706.25.camel@aptis101.cqtel.com> On Thu, 2005-09-08 at 13:16 +0200, Bob Marcan wrote: > Steve Wilcox wrote: > > On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote: > > > >>Steve Wilcox wrote: > >> > >>>On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote: > >>> > >>> > >>>>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote: > >>>> > >>>> > >>>> > >>>>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way, > >>>>>>regardless of what the host controller is. > >>>>>>Ex: Two machines with different SCSI IDs on their initiator connected to > >>>>>>the same physical SCSI bus. > >>>>> > >>>>>Hmm... don't laugh at me, but in fact that's what we're about to set up. > >>>>> > >>>>>I've read in Red Hat's docs that it is "not supported" because of > >>>>>performance issues. Multi-initiator buses should comply to SCSI > >>>>>standards, and any SCSI-compliant disk should be able to communicate > >>>>>with the correct controller, if I've interpreted the specs correctly. Of > >>>>>course, you get arbitrary results when using non-compliant hardware... > >>>>>What are other issues with multi-initiator buses, other than performance > >>>>>loss? > >>>> > >>>>I set up a small 2 node cluster this way a while back, just as a testbed > >>>>for myself. Much as I suspected, it was severely unstable because of > >>>>the storage configuration, even occasionally causing both nodes to crash > >>>>when one was rebooted due to SCSI bus resets. I tore it down and > >>>>rebuilt it several times, configuring it as a simple failover cluster > >>>>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an > >>>>openSSI cluster using Fedora3. All tested configurations were equally > >>>>crash-happy due to the bus resets. > >>>> > >>>>My configuration consisted of a couple of old Compaq deskpro PC's, each > >>>>with a single ended Symbiosis card (set to different SCSI ID's > >>>>obviously) and an external DEC BA360 jbod shelf with 6 drives. The bus > >>>>resets might be mitigated somewhat by using HVD SCSI and Y-cables with > >>>>external terminators, but from my previous experience with other > >>>>clusters that used this technique (DEC ASE and HP-ux service guard), bus > >>>>resets will always be a thorn in your side without a separate, > >>>>independent raid controller to act as a go-between. Calling these > >>>>configurations simply "not supported" is an understatement - this type > >>>>of config is guaranteed trouble. I'd never set up a cluster this way > >>>>unless I'm the only one using it, and only then if I don't care one > >>>>little bit about crashes and data corruption. My two cents. > >>>> > >>>>-steve > >>> > >>> > >>> > >>>Small clarification - Although clusters from DEC, HP, and even > >>>DigiComWho?Paq's TruCluster can be made to work (sort of) on multi- > >>>initiator SCSI busses, IIRC it was never a supported option for any of > >>>them (much like RedHat's offering). I doubt any sane company would ever > >>>support that type of config. > >>> > >>>-steve > >>> > >> > >>HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP. It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years. Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard. > >> > >>--Keith > > > > > > Hmmm... Are you sure you're thinking of a multi-initiator _bus_ and > > not something like an external SCSI array (i.e. nike arrays or some such > > thing)? I know that multi-port SCSI hubs are available, and more than > > one HBA per node is obviously supported for multipathing, but generally > > any multi-initiator SCSI setup will be talking to an external raid > > array, not a simple SCSI bus, and even then bus resets can cause grief. > > Admittedly, I'm much more familiar with the Alpha server side of things > > ==========================> should be unfamiliar > > > (multi-initiator buses were definitely never supported under DEC unix / > > Tru64) , so I could be wrong about HP-ux. I just can't imagine that a > > multi-initiator bus wouldn't be a nightmare. > > > > -steve > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > http://www.redhat.com/mailman/listinfo/linux-cluster > > In the past i was using the SCSI cluster on OpenVMS(AXP,VAX) and Tru64. > At home i have 2 DS10 with memmory channel, shared SCSI Tru64 cluster. > Memmory channel was prerequisite in early days, now you can use ethernet > as CI. > I still have some customers using the SCSI cluster on Tru64. > Two of this are banks, running this configuration a few years. > Without any problems. Using host based shadowing. > Tru64 has single point of failure in this configuration. > Quorum disk can't be shadowed. > OpenVMS doesn't have this limitation. > > It is supported. > > OpenVMS > http://h71000.www7.hp.com/doc/82FINAL/6318/6318pro_002.html#interc_sys_table > > Tru64 > http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/ARHGWETE/CHNTRXXX.HTM#sec-generic-cluster > ... > > Best regards, Bob > I'll ignore the insult and get to the meat of the matter... I'm well aware of that doc. I got burned by it a few years ago when I set up a dev cluster of ES40's based on it. Everything was humming along just fine until I load tested our Oracle database - at something like 700,000 transactions per hour, guess what happened? I had a flurry of bus resets, followed by a flurry of advfs domain panics, resulting in a crashed cluster. When I called my gold support TAM for help in debugging the issue, I was told that "yeah, shared buses will do that. That's why we don't provide technical support for that configuration". When I pointed out that their own documentation claimed it was a "supported" configuration I was told that "supported" only meant technically possible as far as that doc goes - not that HP would provide break-fix support. Maybe that's changed in the last couple years, but I'd doubt it - as my TAM said, shared SCSI buses WILL do that, no way around it really. If you're not having problems with resets, you're simply not loading the bus that heavily. It's all a moot point though - like Lon said, this is a Linux mailing list, not a Tru64, HP-ux, or (god forbid) VMS list, so unless this discussion is going somewhere productive we should probably stop wasting bandwidth. If you want to have a Unix pissing contest, we should do it off list. -steve From RAWIPFEL at novell.com Thu Sep 8 18:18:55 2005 From: RAWIPFEL at novell.com (Robert Wipfel) Date: Thu, 08 Sep 2005 12:18:55 -0600 Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050908071512.GC9222@neu.nirvana> References: <20050907201537.GB3455@neu.nirvana> <20050907204321.GB5677@Favog.ubiqx.mn.org> <20050907205116.GA7459@neu.nirvana> <20050907211252.GC5677@Favog.ubiqx.mn.org> <20050908071512.GC9222@neu.nirvana> Message-ID: <43202C52.9092.00CF.0@novell.com> > A cifs client performs a largish copy operation. During that the share > is relocated to a different node. The copy operations should stall > during the relocation and resume after 10-20 seconds. Microsoft can't do this even with their own cluster server product and CIFS client. Recent versions of some applications like office have masked the drive-letter reconnect internal to the application, but in general, any client side open file handles are lost and have to be re-opened by the client application (involving human intervention, e.g. save the file again, or under the covers in a reconnect aware application). Consider the problem for the client, after transport level reconnect to the virtual IP address associated with the Samba service. Suppose the client had an exclusive lock on a file. How can it be sure some other client didn't gain the lock in the meantime? What should the application do when it discovers the lock it once had on a connection is no longer valid. The protocol and client side APIs weren't designed for dealing with session level failover issues. > Perhaps there are magic registry keys that can persuade Windows > clients to do otherwise. Fwiw, some (e.g. Novell) clients are designed to detect they've connected to a clustered file server and optimize transport level drive-letter reconnect (under the assumption the virtual IP will back soon). Newer protocols like NFSv4 have provision for dealing with these kinds of situations. >>> Axel.Thimm at ATrpms.net 9/8/2005 1:15 am >>> On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote: > On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote: > : : > > > I just tested this. On a W/XP box I browsed through some directories on a > > > share served by Samba. I then shut Samba down, and tried viewing some > > > different subdirectories of the same share. Windows coughed up an error > > > dialog. I then restarted Samba and Windows got happy again. I could > > > browse through all of the subdirectories in the share. > > > > Yes, that does work, but what I wanted to setup is a transparent > > failover, so that network I/O recovers w/o any manual interaction. > > > > I.e. I don't want to (soft) relocate the samba shares onto another > > node due to load ballancing considerations and generate user visible > > I/O errors and failures on a dozen clients. > > I guess I'm not really clear on what it is you're trying to accomplish. > Can you provide a little more description of what you'd like to see > happen, and what kinds of environments you expect? A cifs client performs a largish copy operation. During that the share is relocated to a different node. The copy operations should stall during the relocation and resume after 10-20 seconds. But if the cifs client does not perform a retry on smb/cifs protocol level (on TCP level it will get a RST, it's the next level protocol that needs to decide on retransmit the read/write request), then there is nothing you can do server-side. Perhaps there are magic registry keys that can persuade Windows clients to do otherwise. -- Axel.Thimm at ATrpms.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Thu Sep 8 18:24:28 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 08 Sep 2005 14:24:28 -0400 Subject: [Linux-cluster] Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050907201537.GB3455@neu.nirvana> References: <20050907201537.GB3455@neu.nirvana> Message-ID: <1126203868.30592.132.camel@ayanami.boston.redhat.com> On Wed, 2005-09-07 at 22:15 +0200, Axel Thimm wrote: > After having setup our workarounds for NFS we are very happy with how > it's working. Now we're looking at Samba. > > But we have quite a showstopper right at the beginning. The smb/cifs > clients, be it smbclient or Windows XP, don't like their TCP stream > being resetted and don't retry/reconnect (contrary to NFS). > > It looks like the protocol has no considerations for retries above the > TCP/IP level. So when the TCP stream is torn on the server's side due > to relocation (either due to crash/fencing or soft) any client > smb/cifs activity is broken at that time. > > This means that any data transfer via smb/cifs shares during the > relocation will fail, and there is nothing we can do on the server's > side. Or is there? Potentially not, in the past, we've always had clients reconnect. -- Lon From lhh at redhat.com Thu Sep 8 18:25:34 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 08 Sep 2005 14:25:34 -0400 Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050908071512.GC9222@neu.nirvana> References: <20050907201537.GB3455@neu.nirvana> <20050907204321.GB5677@Favog.ubiqx.mn.org> <20050907205116.GA7459@neu.nirvana> <20050907211252.GC5677@Favog.ubiqx.mn.org> <20050908071512.GC9222@neu.nirvana> Message-ID: <1126203934.30592.134.camel@ayanami.boston.redhat.com> On Thu, 2005-09-08 at 09:15 +0200, Axel Thimm wrote: > On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote: > > On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote: > > : : > > > > I just tested this. On a W/XP box I browsed through some directories on a > > > > share served by Samba. I then shut Samba down, and tried viewing some > > > > different subdirectories of the same share. Windows coughed up an error > > > > dialog. I then restarted Samba and Windows got happy again. I could > > > > browse through all of the subdirectories in the share. > > > > > > Yes, that does work, but what I wanted to setup is a transparent > > > failover, so that network I/O recovers w/o any manual interaction. > > > > > > I.e. I don't want to (soft) relocate the samba shares onto another > > > node due to load ballancing considerations and generate user visible > > > I/O errors and failures on a dozen clients. > > > > I guess I'm not really clear on what it is you're trying to accomplish. > > Can you provide a little more description of what you'd like to see > > happen, and what kinds of environments you expect? > > A cifs client performs a largish copy operation. During that the share > is relocated to a different node. The copy operations should stall > during the relocation and resume after 10-20 seconds. Can't do it, as far as I know... SMB has too much internal state information. -- Lon From crh at ubiqx.mn.org Fri Sep 9 02:59:36 2005 From: crh at ubiqx.mn.org (Christopher R. Hertel) Date: Thu, 08 Sep 2005 21:59:36 -0500 Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs client reconnect? In-Reply-To: <20050908071512.GC9222@neu.nirvana> References: <20050907201537.GB3455@neu.nirvana> <20050907204321.GB5677@Favog.ubiqx.mn.org> <20050907205116.GA7459@neu.nirvana> <20050907211252.GC5677@Favog.ubiqx.mn.org> <20050908071512.GC9222@neu.nirvana> Message-ID: <4320FA98.8080403@ubiqx.mn.org> Axel Thimm wrote: > On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote: > >>On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote: >>: : >> >>>>I just tested this. On a W/XP box I browsed through some directories on a >>>>share served by Samba. I then shut Samba down, and tried viewing some >>>>different subdirectories of the same share. Windows coughed up an error >>>>dialog. I then restarted Samba and Windows got happy again. I could >>>>browse through all of the subdirectories in the share. >>> >>>Yes, that does work, but what I wanted to setup is a transparent >>>failover, so that network I/O recovers w/o any manual interaction. >>> >>>I.e. I don't want to (soft) relocate the samba shares onto another >>>node due to load ballancing considerations and generate user visible >>>I/O errors and failures on a dozen clients. >> >>I guess I'm not really clear on what it is you're trying to accomplish. >>Can you provide a little more description of what you'd like to see >>happen, and what kinds of environments you expect? > > > A cifs client performs a largish copy operation. During that the share > is relocated to a different node. The copy operations should stall > during the relocation and resume after 10-20 seconds. Okay, now I have a clearer picture. > But if the cifs client does not perform a retry on smb/cifs protocol > level (on TCP level it will get a RST, it's the next level protocol > that needs to decide on retransmit the read/write request), then there > is nothing you can do server-side. Yep... > Perhaps there are magic registry keys that can persuade Windows > clients to do otherwise. Not likely. Others on the list have already done a better job than I at working this through. I can only add that I am not aware of anything in the protocol itself that would handle retransmission. Two things to condsider: - The core of SMB is quite old and was not written to run on top of TCP. SMB had to deal with a variety of transport semantics. - SMB was designed, originally, as a request/response protocol (client sends a request, server responds). In theory, the client could re-send the original request if the TCP connection drops and is restarted...but how does the SMB client know that the first request did or didn't succeed? The server might have finished the operation as the connection failed. The solution, in general, is for SMB to report a failure and let the user decide how to handle it. (Eg. Try saving your MS-Word doc to a different drive or something.) Chris -)----- -- "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org From skulkin at mosinfo.ru Fri Sep 9 07:21:27 2005 From: skulkin at mosinfo.ru (Skulkin Dmitry) Date: Fri, 09 Sep 2005 11:21:27 +0400 Subject: [Linux-cluster] nodes don't see each other Message-ID: Hi, I'm tring to make a simple two-node cluster with RHCS4, but nodes don't see each other. clustat on node alpha1 shows that alpha1 is online, clustat on node alpha2 shows that alpha2 is online, but no information about other node. ping alpha1 and ping alpha2 is ok, hostnames are alpha1 and alpha2. For testing purposes I'm using manual fencing, but with unchecked "clean start" on "cluster properties" starting fenced is hang and in /var/log/messages: Sep 9 10:30:57 alpha1 fenced[12327]: fencing node "alpha2" Sep 9 10:30:57 alpha1 fenced[12327]: fence "alpha2" failed Sep 9 10:31:02 alpha1 fenced[12327]: fencing node "alpha2" Sep 9 10:31:02 alpha1 fenced[12327]: fence "alpha2" failed Sep 9 10:31:07 alpha1 fenced[12327]: fencing node "alpha2" Sep 9 10:31:07 alpha1 fenced[12327]: fence "alpha2" failed Sep 9 10:31:12 alpha1 fenced[12327]: fencing node "alpha2" Sep 9 10:31:12 alpha1 fenced[12327]: fence "alpha2" failed and so on. I tried fence_ack_manual -n alpha1 and fence_ack_manual -n alpha2 on both nodes, but no result: Warning: If the node "alpha1" has not been manually fenced (i.e. power cycled or disconnected from shared storage devices) the GFS file system may become corrupted and all its data unrecoverable! Please verify that the node shown above has been reset or disconnected from storage. Are you certain you want to continue? [yN] y can't open /tmp/fence_manual.fifo: No such file or directory With checked "clean start" starting fenced is ok, but nodes don't see each other at all. cluster.conf: Thanks for any help, -- Best regards, Dmitry Skulkin From baesso at ksolutions.it Fri Sep 9 11:37:16 2005 From: baesso at ksolutions.it (Baesso Mirko) Date: Fri, 9 Sep 2005 13:37:16 +0200 Subject: [Linux-cluster] NFS load balancing on REDHAT cluster Message-ID: Hi we have to setup a Redhat cluster with two node on an attached shared storadge system (Fibre Channel connection). We would like to know if is possible to setup an NFS service clustered with load balancing We have to use GFS file system for sharing storadge data on both node, but have to setup GNBD also for exporting same file system to client network? We would like to see only one File server IP with only one file shared list from client side.(see attch) Thanks <> Baesso Mirko - System Engineer KSolutions.S.p.A. Via Lenin 132/26 56017 S.Martino Ulmiano (PI) - Italy tel.+ 39 0 50 898369 fax. + 39 0 50 861200 baesso at ksolutions.it http//www.ksolutions.it -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Cluster_NFS_GFS.gif Type: image/gif Size: 31750 bytes Desc: Cluster_NFS_GFS.gif URL: From lhh at redhat.com Fri Sep 9 14:32:45 2005 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 09 Sep 2005 10:32:45 -0400 Subject: [Linux-cluster] NFS load balancing on REDHAT cluster In-Reply-To: References: Message-ID: <1126276365.16345.25.camel@ayanami.boston.redhat.com> On Fri, 2005-09-09 at 13:37 +0200, Baesso Mirko wrote: > Hi > > we have to setup a Redhat cluster with two node on an attached shared > storadge system (Fibre Channel connection). > > We would like to know if is possible to setup an NFS service clustered > with load balancing Personally, I haven't tried this, but here's my pseudo-educated guess... It should be possible, but there may be some interesting issues with NFS synchronization across the cluster (WRT statd, mountd synchronization). I do not know if NFS will behave well as a load-balanced service. In theory, locking should work (because the NFS locks translate to GFS locks, which would be cluster wide). You'll probably want to pre-populate /var/lib/nfs/rmtab with all the possible client entries on each node. You'll probably need to set up IPVS on a machine to do the load balancing. You can use piranha, one of the many other front ends to IPVS, or just do it by hand. You'll want to make sure that you group mountd+lockd+nfs(+portmap?) ports together on the IPVS director so that client A always requests everything from server B once the initial communication is established (which would typically happen via portmapper or mountd). Also, you should probably use well-known ports for everything NFS/RPC related instead of the portmapper, because there's a good chance that when using the portmapper, the ports in use by mountd/lockd/nfsd/etc will be different on each server - which would make it really difficult for IPVS to correctly load balance it ;) > We have to use GFS file system for sharing storadge data on both node, > but have to setup GNBD also for exporting same file system to client > network? This shouldn't be necessary. If you do a GNBD import on the clients, the clients will need to be running GFS. If you're doing an NFS export from the servers, they can simply be running NFS. -- Lon From baesso at ksolutions.it Fri Sep 9 15:10:46 2005 From: baesso at ksolutions.it (Baesso Mirko) Date: Fri, 9 Sep 2005 17:10:46 +0200 Subject: R: [Linux-cluster] NFS load balancing on REDHAT cluster Message-ID: Thanks for suggests Lon Baesso -----Messaggio originale----- Da: Lon Hohberger [mailto:lhh at redhat.com] Inviato: venerd? 9 settembre 2005 16.33 A: linux clustering Oggetto: Re: [Linux-cluster] NFS load balancing on REDHAT cluster On Fri, 2005-09-09 at 13:37 +0200, Baesso Mirko wrote: > Hi > > we have to setup a Redhat cluster with two node on an attached shared > storadge system (Fibre Channel connection). > > We would like to know if is possible to setup an NFS service clustered > with load balancing Personally, I haven't tried this, but here's my pseudo-educated guess... It should be possible, but there may be some interesting issues with NFS synchronization across the cluster (WRT statd, mountd synchronization). I do not know if NFS will behave well as a load-balanced service. In theory, locking should work (because the NFS locks translate to GFS locks, which would be cluster wide). You'll probably want to pre-populate /var/lib/nfs/rmtab with all the possible client entries on each node. You'll probably need to set up IPVS on a machine to do the load balancing. You can use piranha, one of the many other front ends to IPVS, or just do it by hand. You'll want to make sure that you group mountd+lockd+nfs(+portmap?) ports together on the IPVS director so that client A always requests everything from server B once the initial communication is established (which would typically happen via portmapper or mountd). Also, you should probably use well-known ports for everything NFS/RPC related instead of the portmapper, because there's a good chance that when using the portmapper, the ports in use by mountd/lockd/nfsd/etc will be different on each server - which would make it really difficult for IPVS to correctly load balance it ;) > We have to use GFS file system for sharing storadge data on both node, > but have to setup GNBD also for exporting same file system to client > network? This shouldn't be necessary. If you do a GNBD import on the clients, the clients will need to be running GFS. If you're doing an NFS export from the servers, they can simply be running NFS. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com http://www.redhat.com/mailman/listinfo/linux-cluster From cjk at techma.com Fri Sep 9 15:16:28 2005 From: cjk at techma.com (Kovacs, Corey J.) Date: Fri, 9 Sep 2005 11:16:28 -0400 Subject: [Linux-cluster] Debugging Fencing?? Message-ID: I have a 3 node cluster (RHEL3 + GFS 6.0.2.20-1) running but fencing will not function correctly. I can call fence_ilo manually and fence a reboot a node by hand but calling fence_node fails complaining about connection errors which come from perl-Crypt-SSLeay. I've looked over my configs and things look ok to me. I've got other clusters using similar configs that work fine. I've looked through the fence_node code to find out whats going on and it looks like I can somehow increase the verbosity of the fencing operation so that it reports what the exact arguments are getting passed to fence_ilo. I've tried setting verbosity in lock_gulmd to ReallyALL but there seems to be no effect with respect to additional logging of fenceing operations. I'll keep looking at the code to find out how to enable extra logging but if someone could point me in the right direction, that'd be great. Any suggestions? Thanks Corey From moya at infomed.sld.cu Fri Sep 9 17:31:30 2005 From: moya at infomed.sld.cu (Maykel Moya) Date: Fri, 09 Sep 2005 13:31:30 -0400 Subject: [Linux-cluster] Filesystem (GFS) availability In-Reply-To: <1126113620.30592.49.camel@ayanami.boston.redhat.com> References: <1126104683.15223.7.camel@julia.sld.cu> <1126113620.30592.49.camel@ayanami.boston.redhat.com> Message-ID: <1126287090.30563.3.camel@julia.sld.cu> El mi?, 07-09-2005 a las 13:20 -0400, Lon Hohberger escribi?: > On Wed, 2005-09-07 at 10:51 -0400, Maykel Moya wrote: > > > > > > > > > Run fence_ack_manual on the surviving node. Can it be automated? > Better yet, stop using manual fencing Don't know what to put in cluster.conf to make it anything but manual. > and buy a supported power switch > off of eBay. It will save you a lot of frustration :) Well, servers in rack are connected to central UPS, so, there is no option here :| Regards, maykel From fsetinsek at techniscanmedical.com Fri Sep 9 18:19:27 2005 From: fsetinsek at techniscanmedical.com (Frank L. Setinsek) Date: Fri, 9 Sep 2005 12:19:27 -0600 Subject: [Linux-cluster] GFS Performance Problem Message-ID: <200509091917.j89JHch5004403@mx1.redhat.com> We have a 6 node cluster (RHEL3 + GFS 6.0.2-25), we are running with a SLM Embedded. One on the nodes streams data to the GFS and the other nodes process the data. The problem is that it now takes more than twice as long to acquire the data than when the node was standalone. If we unmount the GFS from all the nodes except the one acquiring the data and the node running the lock manager--the acquisition takes the same amount of time as the standalone configuration. Sometimes there are seconds of inactivity on the RAID when before there was none. Any suggestions would be greatly appreciated. Frank L. Setinsek -------------- next part -------------- An HTML attachment was scrubbed... URL: From arjan at infradead.org Sat Sep 10 10:11:29 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Sat, 10 Sep 2005 12:11:29 +0200 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <20050905054348.GC11337@redhat.com> References: <20050901104620.GA22482@redhat.com> <1125574523.5025.10.camel@laptopd505.fenrus.org> <20050905054348.GC11337@redhat.com> Message-ID: <1126347089.3222.138.camel@laptopd505.fenrus.org> > > You removed the comment stating exactly why, see below. If that's not a > accepted technique in the kernel, say so and I'll be happy to change it > here and elsewhere. > Thanks, > Dave entirely useless wrapping is not acceptable indeed. From sanelson at gmail.com Sat Sep 10 13:52:22 2005 From: sanelson at gmail.com (Steve Nelson) Date: Sat, 10 Sep 2005 14:52:22 +0100 Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node Message-ID: Hello All, I'm in the process of building an Oracle cluster on RHAS 3.0 // GFS 6.0 using 2 x GL580s, an MSA1000 and a DL380 for the quorum server, but have encountered what seems to be a problem with the secondary node seeing the CCA partition, and thus not being able to read the ccs files, preventing me from starting ccsd on that node. Here's the setup: In the MSA 1000 I have created 4 LUNs: => controller serialnumber=P56350GX3R004S logicaldrive all show MSA1000 at SGM0450039 array A logicaldrive 1 (16.9 GB, RAID 1+0, OK) array B logicaldrive 2 (16.9 GB, RAID 1+0, OK) array C logicaldrive 3 (16.9 GB, RAID 1+0, OK) array D logicaldrive 4 (50.8 GB, RAID 5, OK) My partitioning scheme is as follows: /dev/sda1 - 100M (raw partitions used by cluster) /dev/sdb1 - likewise /dev/sda2 - 100M (CCA partition for GFS) /dev/sdb2 - likewise /dev/sda3 /dev/sdb3 - the rest - data partition /dev/sdc /dev/sdd - all - data partition. I can see these partitions with fdisk on both nodes: D isk /dev/sda: 18.2 GB, 18207375360 bytes /dev/sda1 1 13 104391 83 Linux /dev/sda2 14 26 104422+ 83 Linux /dev/sda3 27 2213 17567077+ 83 Linux Disk /dev/sdb: 18.2 GB, 18207375360 bytes /dev/sdb1 1 13 104391 83 Linux /dev/sdb2 14 26 104422+ 83 Linux /dev/sdb3 27 2213 17567077+ 83 Linux Disk /dev/sdc: 18.2 GB, 18207375360 bytes Disk /dev/sdd: 54.6 GB, 54622126080 bytes My pool config files are as below: # more *cfg :::::::::::::: digex_cca.cfg :::::::::::::: poolname digex_cca subpools 1 subpool 0 0 2 pooldevice 0 0 /dev/sda2 pooldevice 0 1 /dev/sdb2 :::::::::::::: gfs0.cfg :::::::::::::: poolname gfs0 subpools 1 subpool 0 128 2 pooldevice 0 0 /dev/sda3 pooldevice 0 1 /dev/sdb3 :::::::::::::: gfs1.cfg :::::::::::::: poolname gfs1 subpools 1 subpool 0 128 2 pooldevice 0 0 /dev/sdc pooldevice 0 1 /dev/sdd Having run pool_assemble -a on both nodes, I wrote my ccs files, and created the cluster archive. On node 1 I see: [root at primary]/etc/gfs# pool_info Major Minor Name Alias Capacity In use MP Type MP Stripe 254 65 digex_cca /dev/poolbn 417632 YES none 254 66 gfs0 /dev/poolbo 70268160 NO none 254 67 gfs1 /dev/poolbp 71122432 NO none [root at primary]/etc/gfs# ls -l /dev/pool total 0 brw------- 2 root root 254, 65 Sep 9 16:53 digex_cca brw------- 2 root root 254, 66 Sep 9 16:53 gfs0 brw------- 2 root root 254, 67 Sep 9 16:53 gfs1 But on node 2 I see: [root at secondary]~# pool_info Major Minor Name Alias Capacity In use MP Type MP Stripe 254 65 gfs1 /dev/poolbn 71122432 NO none [root at secondary]~# ls -l /dev/pool total 0 brw------- 2 root root 254, 65 Sep 9 16:53 gfs1 Consequently when I try to restart ccsd on the secondary node, it looks for the ccs files in the location specified in /etc/sysconfig/gfs (which doesn't exist). Notwithstanding the oddness of sdc and sdd being different sizes - this can be re-organised - I am concerned that the second node can't see the CCA partition, and am loath to simply copy the ccs files to the local machine. I also note that gfs1 as seen on node 2 has the same alias and major/minor as cca, but the same dimensions as the gfs1 seen by node 1. This suggests to me either a multipathing problem, or a configuration error. I am not happy to continue with gfs_mkfs on /dev/pool/gfs[01] at this stage, and would like some advice on why I can't see/access the cca partition. I'd appreciate your thoughts and advice on how to continue! Thanks a lot! Steve Nelson From wcheng at redhat.com Sat Sep 10 15:18:06 2005 From: wcheng at redhat.com (Wendy Cheng) Date: Sat, 10 Sep 2005 11:18:06 -0400 Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node In-Reply-To: References: Message-ID: <1126365487.3406.12.camel@localhost.localdomain> On Sat, 2005-09-10 at 14:52 +0100, Steve Nelson wrote: > Hello All, > > I'm in the process of building an Oracle cluster on RHAS 3.0 // GFS > 6.0 using 2 x GL580s, an MSA1000 and a DL380 for the quorum server, > but have encountered what seems to be a problem with the secondary > node seeing the CCA partition, and thus not being able to read the ccs > files, preventing me from starting ccsd on that node. > .... > I can see these partitions with fdisk on both nodes: > > D > isk /dev/sda: 18.2 GB, 18207375360 bytes > /dev/sda1 1 13 104391 83 Linux > /dev/sda2 14 26 104422+ 83 Linux > /dev/sda3 27 2213 17567077+ 83 Linux > Disk /dev/sdb: 18.2 GB, 18207375360 bytes > /dev/sdb1 1 13 104391 83 Linux > /dev/sdb2 14 26 104422+ 83 Linux > /dev/sdb3 27 2213 17567077+ 83 Linux > Disk /dev/sdc: 18.2 GB, 18207375360 bytes > Disk /dev/sdd: 54.6 GB, 54622126080 bytes > GFS obtains its disk information from /proc/partitions file - check this proc file from your "secondary" node to see whether array A and B are seen by Linux at all (and/or check the /var/log/dmesg file to be sure). Did you reboot between checking with fdisk and pool_info command on "secondary" ? -- Wendy From sanelson at gmail.com Sat Sep 10 15:26:33 2005 From: sanelson at gmail.com (Steve Nelson) Date: Sat, 10 Sep 2005 16:26:33 +0100 Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node In-Reply-To: <1126365487.3406.12.camel@localhost.localdomain> References: <1126365487.3406.12.camel@localhost.localdomain> Message-ID: On 9/10/05, Wendy Cheng wrote: > GFS obtains its disk information from /proc/partitions file - check this > proc file from your "secondary" node to see whether array A and B are > seen by Linux at all (and/or check the /var/log/dmesg file to be sure). Yes - they are visible. > Did you reboot between checking with fdisk and pool_info command on > "secondary" ? Nope :-) However, I have subeseqently re-run pool_assemble -a on the secondary node, and now I see the partition. My question now is why this was necessary. In order, these were my steps: 1) Write pool configs 2) On primary: pool_tool -c 3) On primary: pool_assemble -a 4) On secondary: pool_assemble -a 5) Write ccs files 6) On primary: ccs_tool create 7) On primary: restart ccsd - fine. 8) On secondary: restart ccsd - can't see cca partition. 9) Think and ask for help 10) On secondary: Re-run pool_assemble -a - now can see cca partition. 11) On secondary: Restart ccsd - fine. Is this abberant behaviour? I am surprised I would need to run pool_assemble twice, after creating the cluster archive. Thanks for your help! > -- Wendy Steve From wcheng at redhat.com Sun Sep 11 03:52:40 2005 From: wcheng at redhat.com (Wendy Cheng) Date: Sat, 10 Sep 2005 23:52:40 -0400 Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node In-Reply-To: References: <1126365487.3406.12.camel@localhost.localdomain> Message-ID: <1126410760.3406.43.camel@localhost.localdomain> On Sat, 2005-09-10 at 16:26 +0100, Steve Nelson wrote: > However, I have subeseqently re-run pool_assemble -a on the secondary > node, and now I see the partition. > > My question now is why this was necessary. In order, these were my steps: > > 1) Write pool configs > 2) On primary: pool_tool -c > 3) On primary: pool_assemble -a > 4) On secondary: pool_assemble -a > 5) Write ccs files > 6) On primary: ccs_tool create > 7) On primary: restart ccsd - fine. > 8) On secondary: restart ccsd - can't see cca partition. > 9) Think and ask for help > 10) On secondary: Re-run pool_assemble -a - now can see cca partition. > 11) On secondary: Restart ccsd - fine. > > Is this abberant behaviour? I am surprised I would need to run > pool_assemble twice, after creating the cluster archive. Just quickly browsed thru the pool code. Look to me that "pool_tool - c" (create) does its write (to disk) without a sync flag (O_SYNC). So my guess is that the primary node didn't have a chance to flush its data into the disk before you issued "pool_assemble -a" in step 4. The disk scan missed the pool info in your first try. By the time you did step 10, the flush in primary node (linux io is write-behind) had happened. Just a guess but I'll talk to our developer to confirm. If it is true, may be we can add a sync (code) to avoid this problem in the future. BTW, Red Hat manual encourages people doing a "pool_tool -s" + "pool_info" to make sure the node can see the pool before starting ccsd. -- Wendy From pcaulfie at redhat.com Tue Sep 13 09:32:41 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 13 Sep 2005 10:32:41 +0100 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> References: <42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> Message-ID: <43269CB9.3060005@redhat.com> Guochun Shi wrote: > Patrick, > can you describe the steps changed for CVS version compared to those in usage.txt in order to make gfs2 work? > Very briefly... More doc should be made available soon I hope. ccsd cman_tool join modprobe dlm.ko modprobe dlm_device.ko modprobe lock_harness.ko modprobe lock_dlm.ko modprobe gfs.ko modprobe sctp groupd dlm_controld lock_dlmd fenced fence_tool join Note that if you want to use clvmd you will need a patch to make it use libcman rather than calling directly into the (now non-existant) kernel cman. See attached. -- patrick -------------- next part -------------- A non-text attachment was scrubbed... Name: clvmd-libcman.patch Type: text/x-patch Size: 13494 bytes Desc: not available URL: From gshi at ncsa.uiuc.edu Tue Sep 13 19:59:04 2005 From: gshi at ncsa.uiuc.edu (Guochun Shi) Date: Tue, 13 Sep 2005 14:59:04 -0500 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <43269CB9.3060005@redhat.com> References: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> <42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> Message-ID: <5.1.0.14.2.20050913145610.038735d8@pop.ncsa.uiuc.edu> At 10:32 AM 9/13/2005 +0100, you wrote: >Guochun Shi wrote: >> Patrick, >> can you describe the steps changed for CVS version compared to those in usage.txt in order to make gfs2 work? >> > >Very briefly... More doc should be made available soon I hope. > >ccsd >cman_tool join >modprobe dlm.ko >modprobe dlm_device.ko I don't find dlm_device module >modprobe lock_harness.ko >modprobe lock_dlm.ko >modprobe gfs.ko >modprobe sctp > >groupd >dlm_controld >lock_dlmd After I ran lock_dlmd, "ps -aef|grep lock_dlmd" shows nothing >fenced >fence_tool join It hangs thanks -Guochun From gshi at ncsa.uiuc.edu Tue Sep 13 22:46:01 2005 From: gshi at ncsa.uiuc.edu (Guochun Shi) Date: Tue, 13 Sep 2005 17:46:01 -0500 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <5.1.0.14.2.20050913145610.038735d8@pop.ncsa.uiuc.edu> References: <43269CB9.3060005@redhat.com> <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> <42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> Message-ID: <5.1.0.14.2.20050913174251.0457be18@pop.ncsa.uiuc.edu> At 02:59 PM 9/13/2005 -0500, you wrote: >At 10:32 AM 9/13/2005 +0100, you wrote: >>Guochun Shi wrote: >>> Patrick, >>> can you describe the steps changed for CVS version compared to those in usage.txt in order to make gfs2 work? >>> >> >>Very briefly... More doc should be made available soon I hope. >> >>ccsd >>cman_tool join >>modprobe dlm.ko >>modprobe dlm_device.ko >I don't find dlm_device module > > >>modprobe lock_harness.ko >>modprobe lock_dlm.ko >>modprobe gfs.ko >>modprobe sctp >> >>groupd >>dlm_controld >>lock_dlmd >After I ran lock_dlmd, "ps -aef|grep lock_dlmd" shows nothing > > >>fenced >>fence_tool join >It hangs after talking with lon, it turned out I did not mount configfs in /config. After that, I can go through all steps except the last one mount -t gfs /dev/sdb1 /mnt it hangs out, /var/log/messages show Sep 13 16:34:11 posic066 kernel: dlm: testfs: recover 1 Sep 13 16:34:11 posic066 kernel: dlm: testfs: add member 1 Sep 13 16:34:11 posic066 kernel: dlm: testfs: total members 1 Sep 13 16:34:11 posic066 kernel: dlm: testfs: dlm_recover_directory Sep 13 16:34:11 posic066 kernel: dlm: testfs: dlm_recover_directory 0 entries Sep 13 16:34:11 posic066 kernel: dlm: testfs: recover 1 done: 0 ms any hint what I can do to diagnosis the problem? thanks -Guochun From teigland at redhat.com Wed Sep 14 19:38:52 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 15 Sep 2005 03:38:52 +0800 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <5.1.0.14.2.20050913174251.0457be18@pop.ncsa.uiuc.edu> References: <43269CB9.3060005@redhat.com> <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> <42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> <5.1.0.14.2.20050913174251.0457be18@pop.ncsa.uiuc.edu> Message-ID: <20050914193852.GD922@redhat.com> On Tue, Sep 13, 2005 at 05:46:01PM -0500, Guochun Shi wrote: > >>> can you describe the steps changed for CVS version compared to those > >>> in usage.txt in order to make gfs2 work? > any hint what I can do to diagnosis the problem? For gfs2 it's gfs2.ko, gfs2_mkfs, and mount -t gfs2 ... Dave From dawson at fnal.gov Wed Sep 14 19:39:15 2005 From: dawson at fnal.gov (Troy Dawson) Date: Wed, 14 Sep 2005 14:39:15 -0500 Subject: [Linux-cluster] Switching Master and Slave using lock_gulm Message-ID: <43287C63.10909@fnal.gov> Hi, I can't seem to find this answered anywhere, so if I've overlooked something, please point me in the right direction. I have my GFS cluster running using lock_gulm. I have 3 masters. The one that I REALLY want to be the master came up last, so it is a slave. Both the machine that is currently the master, and the machine I want to be the master should not be rebooted, or actually, loose their GFS file system ability (one is read/write, one is read only) So the question is, is there a command in gulm_tools (or some other program) that will allow me to switch the master from one machine to the other without restarting any of the lock_gulm's. Thanks Troy -- __________________________________________________ Troy Dawson dawson at fnal.gov (630)840-6468 Fermilab ComputingDivision/CSS CSI Group __________________________________________________ From pcaulfie at redhat.com Wed Sep 14 09:01:23 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 14 Sep 2005 10:01:23 +0100 Subject: [Linux-cluster] Re: GFS, what's remaining In-Reply-To: <1125922894.8714.14.camel@localhost.localdomain> References: <20050901104620.GA22482@redhat.com> <20050903183241.1acca6c9.akpm@osdl.org> <20050904030640.GL8684@ca-server1.us.oracle.com> <200509040022.37102.phillips@istop.com> <20050903214653.1b8a8cb7.akpm@osdl.org> <20050904045821.GT8684@ca-server1.us.oracle.com> <20050903224140.0442fac4.akpm@osdl.org> <20050905043033.GB11337@redhat.com> <20050905015408.21455e56.akpm@osdl.org> <20050905092433.GE17607@redhat.com> <20050905021948.6241f1e0.akpm@osdl.org> <1125922894.8714.14.camel@localhost.localdomain> Message-ID: <4327E6E3.3050501@redhat.com> I've just returned from holiday so I'm late to this discussion so let me tell you what we do now and why and lets see what's wrong with it. Currently the library create_lockspace() call returns an FD upon which all lock operations happen. The FD is onto a misc device, one per lockspace, so if you want lockspace protection it can happen at that level. There is no protection applied to locks within a lockspace nor do I think it's helpful to do so to be honest. Using a misc device limits you to <255 lockspaces depending on the other uses of misc but this is just for userland-visible lockspace - it does not affect GFS filesystems for instance. Lock/convert/unlock operations are done using write calls on that lockspace FD. Callbacks are implemented using poll and read on the FD, read will return data blocks (one per callback) as long as there are active callbacks to process. The current read functionality behaves more like a SOCK_PACKET than a data stream which some may not like but then you're going to need to know what you're reading from the device anyway. ioctl/fcntl isn't really useful for DLM locks because you can't do asynchronous operations on them - the lock has to succeed or fail in the one operation - if you want a callback for completion (or blocking notification) you have to poll the lockspace FD anyway and then you might as well go back to using read and write because at least they are something of a matched pair. Something similar applies, I think, to a syscall interface. Another reason the existing fcntl interface isn't appropriate is that it's not locking the same kind of thing. Current Unix fcntl calls lock byte ranges. DLM locks arbitrary names and has a much richer list of lock modes. Adding another fcntl just runs in the problems mentioned above. The other reason we use read for callbacks is that there is information to be passed back: lock status, value block and (possibly) query information. While having an FD per lock sounds like a nice unixy idea I don't think it would work very well in practice. Applications with hundreds or thousands of locks (such as databases) would end up with huge pollfd structs to manage, and it while it helps the refcounting (currently the nastiest bit of the current dlm_device code) removes the possibility of having persistent locks that exist after the process exits - a handy feature that some people do use, though I don't think it's in the currently submitted DLM code. One FD per lock also gives each lock two handles, the lock ID used internally by the DLM and the FD used externally by the application which I think is a little confusing. I don't think a dlmfs is useful, personally. The features you can export from it are either minimal compared to the full DLM functionality (so you have to export the rest by some other means anyway) or are going to be so un-filesystemlike as to be very awkward to use. Doing lock operations in shell scripts is all very cool but how often do you /really/ need to do that? I'm not saying that what we have is perfect - far from it - but we have thought about how this works and what we came up with seems like a good compromise between providing full DLM functionality to userspace using unix features. But we're very happy to listen to other ideas - and have been doing I hope. -- patrick From gshi at ncsa.uiuc.edu Wed Sep 14 23:40:45 2005 From: gshi at ncsa.uiuc.edu (Guochun Shi) Date: Wed, 14 Sep 2005 18:40:45 -0500 Subject: [Linux-cluster] Where to go with cman ? In-Reply-To: <43269CB9.3060005@redhat.com> References: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> <42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com> <1122318870.12824.29.camel@localhost.localdomain> <42EF4AD1.6010809@redhat.com> <1123263949.16923.23.camel@localhost.localdomain> <42F77AA3.80000@redhat.com> <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu> Message-ID: <5.1.0.14.2.20050914183154.0463c700@pop.ncsa.uiuc.edu> At 10:32 AM 9/13/2005 +0100, you wrote: >Guochun Shi wrote: >> Patrick, >> can you describe the steps changed for CVS version compared to those in usage.txt in order to make gfs2 work? >> > >Very briefly... More doc should be made available soon I hope. > >ccsd >cman_tool join >modprobe dlm.ko >modprobe dlm_device.ko >modprobe lock_harness.ko >modprobe lock_dlm.ko >modprobe gfs.ko >modprobe sctp > >groupd >dlm_controld >lock_dlmd >fenced >fence_tool join > >Note that if you want to use clvmd you will need a patch to make it use libcman >rather than calling directly into the (now non-existant) kernel cman. See attached. thanks for the info, I still cannot get it work in simple one node lock_dlm case. It hanged when I tried to mount. (but lock_nolock works for me) I attached all steps I did, the cluster.conf file and the log from /var/log/messages. thanks a lot -Guochun ------------------------------------------------------------------------------------------------------------------ [root at posic066 cman_tool]# mount -t configfs configfs /config [root at posic066 cman_tool]# ccsd [root at posic066 cman_tool]# cman_tool join -N 1 command line options may override cluster.conf values [root at posic066 cman_tool]# modprobe dlm [root at posic066 cman_tool]# modprobe lock_dlml FATAL: Module lock_dlml not found. [root at posic066 cman_tool]# modprobe lock_dlm [root at posic066 cman_tool]# modprobe gfs [root at posic066 cman_tool]# modprobe sctp [root at posic066 cman_tool]# lsmod Module Size Used by sctp 163164 2 [unsafe] ipv6 263904 7 sctp gfs 296708 0 lock_dlm 23544 0 lock_harness 5544 2 gfs,lock_dlm dlm 100036 1 lock_dlm configfs 26892 2 dlm nfs 218856 2 lockd 66056 2 nfs sunrpc 155964 3 nfs,lockd autofs 16384 0 e100 41476 0 mii 5888 1 e100 qla2300 124800 0 qla2xxx 120792 1 qla2300 scsi_transport_fc 29184 1 qla2xxx parport_pc 28612 0 parport 37448 1 parport_pc [root at posic066 cman_tool]# groupd [root at posic066 cman_tool]# dlm_controld [root at posic066 cman_tool]# lock_dlmd [root at posic066 cman_tool]# fenced [root at posic066 cman_tool]# fence_tool join [root at posic066 cman_tool]# gfs_mkfs -p lock_dlm -t alpha:testfs -j 1 /dev/sdb1 This will destroy any data on /dev/sdb1. It appears to contain a GFS filesystem. Are you sure you want to proceed? [y/n] yes Device: /dev/sdb1 Blocksize: 4096 Filesystem Size: 1975184 Journals: 1 Resource Groups: 32 Locking Protocol: lock_dlm Lock Table: alpha:testfs Syncing... All Done [root at posic066 cman_tool]# mount -t gfs /dev/sdb1 /mnt ----------------------------------------------------------------------------------------------------------------------------------------------- The cluster.conf file -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: gfstest.log.gz Type: application/octet-stream Size: 3924 bytes Desc: not available URL: From amanthei at redhat.com Thu Sep 15 03:39:08 2005 From: amanthei at redhat.com (Adam Manthei) Date: Wed, 14 Sep 2005 22:39:08 -0500 Subject: [Linux-cluster] Switching Master and Slave using lock_gulm In-Reply-To: <43287C63.10909@fnal.gov> References: <43287C63.10909@fnal.gov> Message-ID: <20050915033908.GL2190@redhat.com> On Wed, Sep 14, 2005 at 02:39:15PM -0500, Troy Dawson wrote: > So the question is, is there a command in gulm_tools (or some other > program) that will allow me to switch the master from one machine to the > other without restarting any of the lock_gulm's. Not really. You have to bring down the lock_gulmd processes in order to control the order. You can try running `gulm_tool switchpending` on the master server. That will force the master into the pending state, the only problem is that it might be the first one to recognize that there is no master and may come back as master again. On the other hand, you may get lucky and have the master switch to the machine that you want to be master. -- Adam Manthei From piotr.kral at coig.katowice.pl Thu Sep 15 16:29:14 2005 From: piotr.kral at coig.katowice.pl (Piotr Kral) Date: Thu, 15 Sep 2005 18:29:14 +0200 Subject: [Linux-cluster] Cluster dilemas... Message-ID: <4329A15A.50500@coig.katowice.pl> Hi I have IBM Blade Center connected to IBM DS 4300 (former FASTt 600) disk system. The blades are diksless machines and they are connected to storage via fc switches, so I have 4 physical patchs to each disk. My dilemma is as follows: I wont to run Red Hat cluster suite on at least two servers and since I'm building systems from the beginning I'd rather use RHEL4 since it's newer, faster, more reliable (add more marketing bullshit here) etc. But HBA drivers that support multipath fail over are only in RHEL3 on RHEL4 there is something called dm + device-mapper-multipath but it is in beta stage :( So I'd like to ask You three things: 1. Is Red Hat cluster suite much different in RHEL3 then in RHEL4. Because if there is no big difference maybay it's not worth to fight with RHEL4 HBA drivers and just install RHEL3, and when RHEL4 will have proper drivers just upgrade? 2. Do You have any experience with "dm + device-mapper-multipath". To be sure this "beta stage" worries me a lot. But maybay You have some information when it will be in official release of RHEL4 (U2??), and most important: is it stable solution, and can it work in "production sewers" 3. What do You think of solution RHEL4 without multipath failover (so I'll see 4 paths, and in case of primary path failure at least one node will crush) + Red Hat cluster suite? I'm thinking off it in case problems with HBA drivers and some new functionality in Red Hat cluster suite in RHEL4 Kind Regards Piotr From danwest at comcast.net Fri Sep 16 00:11:55 2005 From: danwest at comcast.net (danwest) Date: Thu, 15 Sep 2005 20:11:55 -0400 Subject: [Linux-cluster] cluster.conf configuration command line Message-ID: <1126829515.11500.8.camel@linux.site> I was wondering if anyone knows if there are command line tools to configure cluster.conf? I see that redhat has a python gui as part of their RHEL4 system-config-cluster package but I can't seem to find a command line equivalent like the "redhat-config-cluster-cmd" that was part of the RHEL3 RHCS. Thanks - Daniel From pcaulfie at redhat.com Fri Sep 16 06:51:38 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 16 Sep 2005 07:51:38 +0100 Subject: [Linux-cluster] cluster.conf configuration command line In-Reply-To: <1126829515.11500.8.camel@linux.site> References: <1126829515.11500.8.camel@linux.site> Message-ID: <432A6B7A.3000208@redhat.com> danwest wrote: > I was wondering if anyone knows if there are command line tools to > configure cluster.conf? I see that redhat has a python gui as part of > their RHEL4 system-config-cluster package but I can't seem to find a > command line equivalent like the "redhat-config-cluster-cmd" that was > part of the RHEL3 RHCS. > ccs_tool has some commands for manipulating the cluster.conf file. I'm not sure if any packaging has picked them up yet but it's in CVS HEAD & STABLE. -- patrick From dawson at fnal.gov Fri Sep 16 13:16:50 2005 From: dawson at fnal.gov (Troy Dawson) Date: Fri, 16 Sep 2005 08:16:50 -0500 Subject: [Linux-cluster] Switching Master and Slave using lock_gulm In-Reply-To: <20050915033908.GL2190@redhat.com> References: <43287C63.10909@fnal.gov> <20050915033908.GL2190@redhat.com> Message-ID: <432AC5C2.1090800@fnal.gov> Adam Manthei wrote: > On Wed, Sep 14, 2005 at 02:39:15PM -0500, Troy Dawson wrote: > >>So the question is, is there a command in gulm_tools (or some other >>program) that will allow me to switch the master from one machine to the >>other without restarting any of the lock_gulm's. > > > Not really. You have to bring down the lock_gulmd processes in order to > control the order. You can try running `gulm_tool switchpending` on the > master server. That will force the master into the pending state, the only > problem is that it might be the first one to recognize that there is no > master and may come back as master again. On the other hand, you may get > lucky and have the master switch to the machine that you want to be master. > Thanks, that worked. The one I want to be master has a vote of 5 while the other only have a vote of 1, so that might have been it. Then again, I might have just been lucky and the right one grabbed the master slot. Either way, it worked. Thanks Again, Troy From ocrete at max-t.com Fri Sep 16 16:08:38 2005 From: ocrete at max-t.com (Olivier Crete) Date: Fri, 16 Sep 2005 12:08:38 -0400 Subject: [Linux-cluster] zero vote node with cman In-Reply-To: <1124466684.12024.58.camel@cocagne.max-t.internal> References: <1124400750.12024.52.camel@cocagne.max-t.internal> <4305881E.5000106@redhat.com> <1124466684.12024.58.camel@cocagne.max-t.internal> Message-ID: <1126886918.25404.6.camel@cocagne.max-t.internal> On Fri, 2005-19-08 at 11:51 -0400, Olivier Crete wrote: > On Fri, 2005-19-08 at 08:19 +0100, Patrick Caulfield wrote: > > Olivier Crete wrote: > > > I tried setting the votes to 0, but it seems that it wont let me do it.. > > > Is there another solution? > > > > It seems to be a bug in cman_tool that's overriding the votes rather > > over-enthusiastically. > > > > This patch should fix: > > Actually it doesnt.. it sets the default to 0... the attached patch > seems to work better. The cluster.ng relaxng schema included in the system-config-cluster package refuses zero votes, I've attached a patch that fixes that. -- Olivier Cr?te ocrete at max-t.com Maximum Throughput Inc. -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster-relaxng-zerovotes.patch Type: text/x-patch Size: 337 bytes Desc: not available URL: From jss at ast.cam.ac.uk Fri Sep 16 09:38:46 2005 From: jss at ast.cam.ac.uk (Jeremy Sanders) Date: Fri, 16 Sep 2005 10:38:46 +0100 (BST) Subject: [Linux-cluster] gnbd root device Message-ID: Hi - We've been trying to use gnbd for the root devices of diskless linux systems booted from the network. We're not using gfs. gnbd is used as a plain block device. We use an initrd to start up the networking and connect to the gnbd server. This seems to work fairly well, except the reconnection doesn't appear to work if the server is rebooted. The client hangs indefinitely. gnbd is run from a ram disk. Does anyone know any fundamental reason why this should be a problem? Does anyone have a working setup like this? Thanks very much! Jeremy PS please CC me as I'm not on the mailing list -- Jeremy Sanders http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 From sanelson at gmail.com Sat Sep 17 19:55:29 2005 From: sanelson at gmail.com (Steve Nelson) Date: Sat, 17 Sep 2005 20:55:29 +0100 Subject: [Linux-cluster] Dodgy Mounting Message-ID: Hello, GFS6.0 // RHEL 3.0 Perfectly normal set-up - assembled pools and cluster archives, got lock_gulmd working, and made mountpoints and entries in /etc/fstab Mount /archive # mount /archive mount: wrong fs type, bad option, bad superblock on /dev/pool/gfs0, or too many mounted file systems What's wrong?! # for i in pool ccsd lock_gulmd; do service $i status; done digex_cca is assembled gfs0 is assembled gfs1 is assembled gfs2 is assembled gfs3 is assembled ccsd (pid 5587) is running... lock_gulmd (pid 5632 5629 5626) is running... gulm_master: bundlesmanagment is the master Services: LTPX LT000 /etc/fstab looks like this: /dev/pool/gfs0 /archive gfs defaults 1 2 /dev/pool/gfs1 /redo gfs defaults 1 2 /dev/pool/gfs2 /data gfs defaults 1 2 /dev/pool/gfs3 /backups gfs defaults 1 2 and mount shows this: # mount /dev/cciss/c0d0p3 on / type ext3 (rw) none on /proc type proc (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) /dev/cciss/c0d0p1 on /boot type ext3 (rw) /dev/cciss/c0d0p7 on /local type ext3 (rw) none on /dev/shm type tmpfs (rw) /dev/cciss/c0d0p6 on /tmp type ext3 (rw) /dev/cciss/c0d0p5 on /var type ext3 (rw) Any ideas? What's going on? Thanks! Steve From sanelson at gmail.com Sat Sep 17 20:09:33 2005 From: sanelson at gmail.com (Steve Nelson) Date: Sat, 17 Sep 2005 21:09:33 +0100 Subject: [Linux-cluster] Re: Dodgy Mounting In-Reply-To: References: Message-ID: On 9/17/05, Steve Nelson wrote: > Hello, > > GFS6.0 // RHEL 3.0 > > Perfectly normal set-up - assembled pools and cluster archives, got > lock_gulmd working, and made mountpoints and entries in /etc/fstab > > Mount /archive > > # mount /archive > mount: wrong fs type, bad option, bad superblock on /dev/pool/gfs0, > or too many mounted file systems Ok - so this looks like a generic error - looking further, dmesg shows: GFS: can't mount proto = lock_gulm, table = digex:gfs3, hostdata = lock_gulm: ERROR Core returned error 1003:Bad Cluster ID. lock_gulm: ERROR cm_login failed. 1003 lock_gulm: ERROR Got a 1003 trying to start the threads. lock_gulm: fsid=digex:gfs0: Exiting gulm_mount with errors 1003 etc for various tables. > Steve S. From hardyjm at potsdam.edu Mon Sep 19 13:26:01 2005 From: hardyjm at potsdam.edu (Jeff Hardy) Date: Mon, 19 Sep 2005 09:26:01 -0400 Subject: [Linux-cluster] Basic Scalability Questions Message-ID: <1127136362.23044.138.camel@fritzdesk.potsdam.edu> Hello, I have noted that LVM2 on a 2.6 kernel (in this case Fedora Core 4), has no limit to the maximum number of logical volumes, physical volumes, or physical extents in a particular volume group. Is this also then the case for CLVM? Also, I have a couple of boxes sharing ATAoE storage right now in a two-node cluster configuration. Everything looks good. We are not using GFS as of yet, and do not have immediate plans to do so, so I have been working with XFS and other filesystems in a couple of LVs with each box only mounting one LV at a time (obviously). In the interest of seeing what could happen, I have purposely shutdown all the cluster services on both boxes, resized volumes, done write tests and not seen any filesystem corruption. Have I just been incredibly lucky to avoid particular race conditions? Thanks very much. -- Jeff Hardy Systems Analyst hardyjm at potsdam.edu From mwill at penguincomputing.com Mon Sep 19 17:47:04 2005 From: mwill at penguincomputing.com (Michael Will) Date: Mon, 19 Sep 2005 10:47:04 -0700 Subject: [Linux-cluster] Basic Scalability Questions In-Reply-To: <1127136362.23044.138.camel@fritzdesk.potsdam.edu> References: <1127136362.23044.138.camel@fritzdesk.potsdam.edu> Message-ID: <432EF998.7080301@penguincomputing.com> If you get any personal replies that don't go back to the list, I would be very interested in the result. How far along are those projects like CLVM and others listed on http://sources.redhat.com/cluster/ ? Michael Jeff Hardy wrote: >Hello, > >I have noted that LVM2 on a 2.6 kernel (in this case Fedora Core 4), has >no limit to the maximum number of logical volumes, physical volumes, or >physical extents in a particular volume group. Is this also then the >case for CLVM? > >Also, I have a couple of boxes sharing ATAoE storage right now in a >two-node cluster configuration. Everything looks good. We are not >using GFS as of yet, and do not have immediate plans to do so, so I have >been working with XFS and other filesystems in a couple of LVs with each >box only mounting one LV at a time (obviously). In the interest of >seeing what could happen, I have purposely shutdown all the cluster >services on both boxes, resized volumes, done write tests and not seen >any filesystem corruption. Have I just been incredibly lucky to avoid >particular race conditions? > >Thanks very much. > > > -- Michael Will Penguin Computing Corp. Sales Engineer 415-954-2822 415-954-2899 fx mwill at penguincomputing.com From lhh at redhat.com Mon Sep 19 19:27:51 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 19 Sep 2005 15:27:51 -0400 Subject: [Linux-cluster] Basic Scalability Questions In-Reply-To: <432EF998.7080301@penguincomputing.com> References: <1127136362.23044.138.camel@fritzdesk.potsdam.edu> <432EF998.7080301@penguincomputing.com> Message-ID: <1127158071.3696.88.camel@ayanami.boston.redhat.com> On Mon, 2005-09-19 at 10:47 -0700, Michael Will wrote: > If you get any personal replies that don't go back to the list, > I would be very interested in the result. > > How far along are those projects like CLVM and others listed > on http://sources.redhat.com/cluster/ ? Most of them are productized by Red Hat already. Get the STABLE branch, not head though =) -- Lon From hardyjm at potsdam.edu Tue Sep 20 12:18:59 2005 From: hardyjm at potsdam.edu (Jeff Hardy) Date: Tue, 20 Sep 2005 08:18:59 -0400 Subject: [Linux-cluster] Basic Scalability Questions In-Reply-To: <1127158071.3696.88.camel@ayanami.boston.redhat.com> References: <1127136362.23044.138.camel@fritzdesk.potsdam.edu> <432EF998.7080301@penguincomputing.com> <1127158071.3696.88.camel@ayanami.boston.redhat.com> Message-ID: <1127218739.23044.180.camel@fritzdesk.potsdam.edu> I've been using the FC4 RPMs and all looks good. On Mon, 2005-09-19 at 15:27 -0400, Lon Hohberger wrote: > On Mon, 2005-09-19 at 10:47 -0700, Michael Will wrote: > > If you get any personal replies that don't go back to the list, > > I would be very interested in the result. > > > > How far along are those projects like CLVM and others listed > > on http://sources.redhat.com/cluster/ ? > > Most of them are productized by Red Hat already. > > Get the STABLE branch, not head though =) > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ggilyeat at jhsph.edu Tue Sep 20 18:15:39 2005 From: ggilyeat at jhsph.edu (Gerald G. Gilyeat) Date: Tue, 20 Sep 2005 14:15:39 -0400 Subject: [Linux-cluster] Question... Message-ID: We just had a GFS client node crash (and it took one of my compute clusters with it, but I can deal with that) with the following message in /var/log/messages: Sep 20 13:50:12 front-0 kernel: Sep 20 13:50:12 front-0 kernel: GFS: Assertion failed on line 200 of file trans.c Sep 20 13:50:12 front-0 kernel: GFS: assertion: "!gfs_get_transaction(sdp)" Sep 20 13:50:12 front-0 kernel: GFS: time = 1127238612 Sep 20 13:50:12 front-0 kernel: GFS: fsid=hopkins:bst.2 Sep 20 13:50:12 front-0 kernel: Sep 20 13:50:12 front-0 kernel: Kernel panic: GFS: Record message above and reboot. The other GFS clients and the server are fine. Any chance someone could give me an idea on why there'd be a failure here, so I can have a better idea on what to tune on the system? We recently (ie. yesterday) bumped the number of NFS server processes on this machine from 8 to 24, if that will help... Thanks. -- Jerry Gilyeat, RHCE Systems Administrator Molecular Microbiology and Immunology Johns Hopkins Bloomberg School of Public Health -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmarzins at redhat.com Tue Sep 20 20:26:35 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Tue, 20 Sep 2005 15:26:35 -0500 Subject: [Linux-cluster] Question... In-Reply-To: References: Message-ID: <20050920202634.GB25146@phlogiston.msp.redhat.com> You didn't perhaps get a stack trace along with that message, did you? That would go a long way in figuring out what exactly went wrong. But here's a wild stab in the dark. Do you know if a suid root file was being copied to your gfs file system? That has caused a similar error on other versions of gfs (although not with nfs). -Ben On Tue, Sep 20, 2005 at 02:15:39PM -0400, Gerald G. Gilyeat wrote: > We just had a GFS client node crash (and it took one of my compute > clusters with it, but I can deal with that) with the following message in > /var/log/messages: > Sep 20 13:50:12 front-0 kernel: > Sep 20 13:50:12 front-0 kernel: GFS: Assertion failed on line 200 of file > trans.c > Sep 20 13:50:12 front-0 kernel: GFS: assertion: > "!gfs_get_transaction(sdp)" > Sep 20 13:50:12 front-0 kernel: GFS: time = 1127238612 > Sep 20 13:50:12 front-0 kernel: GFS: fsid=hopkins:bst.2 > Sep 20 13:50:12 front-0 kernel: > Sep 20 13:50:12 front-0 kernel: Kernel panic: GFS: Record message above > and reboot. > > The other GFS clients and the server are fine. > Any chance someone could give me an idea on why there'd be a failure here, > so I can have a better idea on what to tune on the system? > > We recently (ie. yesterday) bumped the number of NFS server processes on > this machine from 8 to 24, if that will help... > > Thanks. > > -- > Jerry Gilyeat, RHCE > Systems Administrator > Molecular Microbiology and Immunology > Johns Hopkins Bloomberg School of Public Health > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ggilyeat at jhsph.edu Wed Sep 21 13:35:18 2005 From: ggilyeat at jhsph.edu (Gerald G. Gilyeat) Date: Wed, 21 Sep 2005 09:35:18 -0400 Subject: [Linux-cluster] Question... Message-ID: Nope, no stack trace was produced that I was able to see. And yeah, I would have preferred to have one, myself. I'll check into the suid thing. It's entirely possible one of my users did something like that, since many of them have root on their desktop linux machines. -- Jerry Gilyeat, RHCE Systems Administrator Molecular Microbiology and Immunology Johns Hopkins Bloomberg School of Public Health -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of Benjamin Marzinski Sent: Tue 9/20/2005 4:26 PM To: linux clustering Subject: Re: [Linux-cluster] Question... You didn't perhaps get a stack trace along with that message, did you? That would go a long way in figuring out what exactly went wrong. But here's a wild stab in the dark. Do you know if a suid root file was being copied to your gfs file system? That has caused a similar error on other versions of gfs (although not with nfs). -Ben On Tue, Sep 20, 2005 at 02:15:39PM -0400, Gerald G. Gilyeat wrote: > We just had a GFS client node crash (and it took one of my compute > clusters with it, but I can deal with that) with the following message in > /var/log/messages: > Sep 20 13:50:12 front-0 kernel: > Sep 20 13:50:12 front-0 kernel: GFS: Assertion failed on line 200 of file > trans.c > Sep 20 13:50:12 front-0 kernel: GFS: assertion: > "!gfs_get_transaction(sdp)" > Sep 20 13:50:12 front-0 kernel: GFS: time = 1127238612 > Sep 20 13:50:12 front-0 kernel: GFS: fsid=hopkins:bst.2 > Sep 20 13:50:12 front-0 kernel: > Sep 20 13:50:12 front-0 kernel: Kernel panic: GFS: Record message above > and reboot. > > The other GFS clients and the server are fine. > Any chance someone could give me an idea on why there'd be a failure here, > so I can have a better idea on what to tune on the system? > > We recently (ie. yesterday) bumped the number of NFS server processes on > this machine from 8 to 24, if that will help... > > Thanks. > > -- > Jerry Gilyeat, RHCE > Systems Administrator > Molecular Microbiology and Immunology > Johns Hopkins Bloomberg School of Public Health > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3692 bytes Desc: not available URL: From haydar2906 at hotmail.com Wed Sep 21 15:03:57 2005 From: haydar2906 at hotmail.com (Abbes Bettahar) Date: Wed, 21 Sep 2005 11:03:57 -0400 Subject: [Linux-cluster] Clustered NFS problem Message-ID: Hi, We have 2 servers HP Proliant 380 G3 (RedHat Advanced Server 3) attached by fiber optic to the storage area network SAN HP MSA1000 and we want to install and configure The RedHat Cluster Suite. I setuped and configured a clustered NFS on the 2 servers RAC1 and RACGFS. clumanager-1.2.26.1-1 redhat-config-cluster-1.0.7-1 I have created 2 quorum partitions /dev/sdd2 and /dev/sdd3 (100MB each). I created another huge partition /dev/sdd4 (over 600GB) and formatted it in ext3 file system. I installed the cluster suite on the 1st node (RAC1) and 2nd node RACGFS and I started the rawdevices on the two nodes RAC1 and RACGFS (it's OK). This the hosts file /etc/host on the node1 (RAC1) and node2 RACGFS Do not remove the following line, or various programs # that require network functionality will fail. #127.0.0.1 rac1 localhost.localdomain localhost 127.0.0.1 localhost.localdomain localhost # # Private hostnames # 192.168.253.3 rac1.project.net rac1 192.168.253.4 rac2.project.net rac2 192.168.253.10 racgfs.project.net racgfs 192.168.253.20 raclu_nfs.project.net raclu_nfs # # Hostnames used for Interconnect # 1.1.1.1 rac1i.project.net rac1i 1.1.1.2 rac2i.project.net rac2i 1.1.1.3 racgfsi.project.net racgfsi # 192.168.253.5 infra.project.net infra 192.168.253.7 ractest.project.net ractest # I generated a /etc/cluster.xml on the 1st node RAC1 and the 2nd node RACGFS. I created a NFS share on /u04 (mount on /dev/sdd4) using the Cluster GUI manager on RAC1. I launched on the 2 nodes Rac1 and RACgfs the following command: service clumanager start I checked the result on the 2 nodes, on RAC1: clustat results : Cluster Status - project 09:04:34 Cluster Quorum Incarnation #1 Shared State: Shared Raw Device Driver v1.2 Member Status ------------------ ---------- 192.168.253.3 Active <-- You are here 192.168.253.10 Active Service Status Owner (Last) Last Transition Chk Restarts -------------- -------- ---------------- --------------- --- -------- nfs_cisn started 192.168.253.3 09:07:59 Sep 21 5 0 on RacGfs: clustat results : Cluster Status - cisn 09:07:39 Cluster Quorum Incarnation #3 Shared State: Shared Raw Device Driver v1.2 Member Status ------------------ ---------- 192.168.253.3 Active 192.168.253.10 Active <-- You are here Service Status Owner (Last) Last Transition Chk Restarts -------------- -------- ---------------- --------------- --- -------- nfs_cisn started 192.168.253.3 09:07:59 Sep 21 5 0 When I launched ifconfig on RAC1, we saw that the service IP address 192.168.253.20 is generated on eth2:0. And I launched on other servers the following command: mount ?t nfs 192.168.253.20:/u04 /u04 And all are OK, I can list the /u04 content from any server. But my only problem is: When I want to try a test if the clustered NFS will work fine, I rebooted RAC1 frequently and RACGFS continue to work as the failover server and when I launched ifconfig on RACGFS, we saw that the service IP address 192.168.253.20 is generated on eth0:0 . We can list /u04 content (clustered NFS mount) on the other servers after few seconds of RAC1 rebooting: But after many reboots, I expect a big problem, the both cluster node servers cannot obtain the service IP address 192.168.253.20 when I launch ifconfig on the both nodes. On Rac1: eth0 Link encap:Ethernet HWaddr 00:0B:CD:EF:2B:C1 inet addr:1.1.1.1 Bcast:1.1.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:89170 errors:0 dropped:0 overruns:0 frame:0 TX packets:87405 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:17288193 (16.4 Mb) TX bytes:14452757 (13.7 Mb) Interrupt:15 eth2 Link encap:Ethernet HWaddr 00:0B:CD:FF:44:02 inet addr:192.168.253.3 Bcast:192.168.253.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1349991 errors:0 dropped:0 overruns:0 frame:0 TX packets:435450 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1592635536 (1518.8 Mb) TX bytes:162026101 (154.5 Mb) Interrupt:7 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1001181 errors:0 dropped:0 overruns:0 frame:0 TX packets:1001181 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:76097441 (72.5 Mb) TX bytes:76097441 (72.5 Mb) On RACGFS: eth0 Link encap:Ethernet HWaddr 00:14:38:50:D3:E4 inet addr:192.168.253.10 Bcast:192.168.253.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:211223 errors:0 dropped:0 overruns:0 frame:0 TX packets:160026 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:14917480 (14.2 Mb) TX bytes:13886063 (13.2 Mb) Interrupt:25 eth1 Link encap:Ethernet HWaddr 00:14:38:50:D3:E3 inet addr:1.1.1.3 Bcast:1.1.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:256 (256.0 b) Interrupt:26 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:184529 errors:0 dropped:0 overruns:0 frame:0 TX packets:184529 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:10971489 (10.4 Mb) TX bytes:10971489 (10.4 Mb) I tried many commands, I stopped the cluster services On both nodes and restart it but unfortunately it doesn?t work and we cannot obtain the clustered NFS mount. Have you any idea to fix this problem? Thanks for your replies and help Abbes Bettahar 514-296-0756 From Alain.Moulle at bull.net Thu Sep 22 06:15:42 2005 From: Alain.Moulle at bull.net (Alain Moulle) Date: Thu, 22 Sep 2005 08:15:42 +0200 Subject: [Linux-cluster] Some questions about heart-beat under Cluster Suite 4 Message-ID: <43324C0E.8010005@bull.net> Hi everybody Some questions about heart-beat under Cluster Suite 4 : 1. how is choosen the eth interface under CS4 ? if we have eth0 and eth1, it seems that HearBeat goes through eth0 and we don't have the possibility to configure this in CS4 , right ? 2. does that mean also that if eth0 fails, the CS4 automatically goes through eth1 ? 3. do I miss the way to configure this via GUI ? Thanks a lot Alain -- mailto:Alain.Moulle at bull.net From andreseso at gmail.com Thu Sep 22 08:17:13 2005 From: andreseso at gmail.com (Andreso) Date: Thu, 22 Sep 2005 10:17:13 +0200 Subject: [Linux-cluster] How do I use a cross over cable to set up quorum? Message-ID: <1a9f416905092201175c6eb118@mail.gmail.com> Hello, I am trying to set up a cluster with clumanager-1.2.3-1 and clumanager-1.2.3-1 and my boss desires a working quorum where the two members of the cluster use a cross over cable to interchange the necessary information for quorum between the two cluster members. The reason is that the company uses faulty switches so he considers using the main network interfaces of the servers not acceptable. I would like to know how I can obtain cluster quorum using the cross over cable. I have succesfully set up the cluster using the main network interface but I have failed miserably setting up the cluster quorum over the cross over cable network interface. I believe I have a routing problem. The route -n information for the cross over cable interface is Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1 I am far from being a linux expert and in some things like clusters I am a newbie and I am not particularily strong on networking. For example until I started setting up the cluster I had never heard of the concepts tiebreaker IP or Multicast IP address. Any help would be appreciated. If somebody has managed to make this work could they please post their cluster.xml file? From Axel.Thimm at ATrpms.net Thu Sep 22 08:32:37 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Thu, 22 Sep 2005 10:32:37 +0200 Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba? Message-ID: <20050922083237.GD3466@neu.nirvana> Hi, I've been stuggling with a strange bug in Samba which required me to have some of the tdb files with permissions 0666 to allow Samba to work. The Samba metadata (locking and connection tables etc) are placed on GFS to allow for easier relocation of the Samba services ("poor man's clustered samba"). The problem is that Samba opens some files as root, then drops priviledges and finally accesses these files assuming that the root access rights are still in order. This does not work under GFS, but under any other local fs. The Samba developers claim that this is POSIX compliant and that GFS is not following POSIX in this matter. Is this true? Does POSIX require the fds to not change access priviledges even when setuiding to another user? If so, why doesn't GFS respect this? A bug or a feature? If the former I'll go and bugzilla it. If the latter, can there be a fix for the RHEL4 branch? Thanks! -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From andreseso at gmail.com Thu Sep 22 09:44:17 2005 From: andreseso at gmail.com (Andreso) Date: Thu, 22 Sep 2005 11:44:17 +0200 Subject: [Linux-cluster] Re: How do I use a cross over cable to set up quorum? In-Reply-To: <1a9f416905092201175c6eb118@mail.gmail.com> References: <1a9f416905092201175c6eb118@mail.gmail.com> Message-ID: <1a9f41690509220244da6fed5@mail.gmail.com> Sorry, I forgot to mention that I am running Centos 3.5 which is equivalent to RHEL AS 3.5. A working cluster.xml using the main network interface is attached Andres On 9/22/05, Andreso wrote: > Hello, > > I am trying to set up a cluster with clumanager-1.2.3-1 and > clumanager-1.2.3-1 and my boss desires a working quorum where the two > members of the cluster use a cross over cable to interchange the > necessary information for quorum between the two cluster members. The > reason is that the company uses faulty switches so he considers using > the main network interfaces of the servers not acceptable. > > I would like to know how I can obtain cluster quorum using the cross > over cable. I have succesfully set up the cluster using the main > network interface but I have failed miserably setting up the cluster > quorum over the cross over cable network interface. I believe I have > a routing problem. > > The route -n information for the cross over cable interface is > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use Iface > 10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1 > > I am far from being a linux expert and in some things like clusters I > am a newbie and I am not particularily strong on networking. For > example until I started setting up the cluster I had never heard of > the concepts tiebreaker IP or Multicast IP address. > > Any help would be appreciated. > > If somebody has managed to make this work could they please post their > cluster.xml file? > -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.xml-funciona Type: application/octet-stream Size: 1916 bytes Desc: not available URL: From hernando.garcia at gmail.com Thu Sep 22 10:08:36 2005 From: hernando.garcia at gmail.com (Hernando Garcia) Date: Thu, 22 Sep 2005 11:08:36 +0100 Subject: [Linux-cluster] Clustered NFS problem In-Reply-To: References: Message-ID: <1127383717.4275.11.camel@hgarcia.surrey.redhat.com> It would be better for you to officially open a call with Red Hat Support directly. They will be able to help you with the issue. When the call is open, make sure you provide with both sysreport of the cluster nodes. On Wed, 2005-09-21 at 11:03 -0400, Abbes Bettahar wrote: > Hi, > > We have 2 servers HP Proliant 380 G3 (RedHat Advanced Server 3) attached > by fiber optic to the storage area network SAN HP MSA1000 and we want to > install and configure The RedHat Cluster Suite. > > I setuped and configured a clustered NFS on the 2 servers RAC1 and RACGFS. > > clumanager-1.2.26.1-1 > redhat-config-cluster-1.0.7-1 > > I have created 2 quorum partitions /dev/sdd2 and /dev/sdd3 (100MB each). > > I created another huge partition /dev/sdd4 (over 600GB) and formatted it in > ext3 file system. > > I installed the cluster suite on the 1st node (RAC1) and 2nd node RACGFS and > I started the rawdevices on the two nodes RAC1 and RACGFS (it's OK). > > This the hosts file /etc/host on the node1 (RAC1) and node2 RACGFS > > Do not remove the following line, or various programs > # that require network functionality will fail. > #127.0.0.1 rac1 localhost.localdomain localhost > 127.0.0.1 localhost.localdomain localhost > # > # Private hostnames > # > 192.168.253.3 rac1.project.net rac1 > 192.168.253.4 rac2.project.net rac2 > 192.168.253.10 racgfs.project.net racgfs > 192.168.253.20 raclu_nfs.project.net raclu_nfs > # > # Hostnames used for Interconnect > # > 1.1.1.1 rac1i.project.net rac1i > 1.1.1.2 rac2i.project.net rac2i > 1.1.1.3 racgfsi.project.net racgfsi > # > 192.168.253.5 infra.project.net infra > 192.168.253.7 ractest.project.net ractest > # > > I generated a /etc/cluster.xml on the 1st node RAC1 and the 2nd node RACGFS. > > > > multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/> > > > > > name="cisn"/> > rawshadow="/dev/raw/raw2" type="raw"/> > > > > > > maxfalsestarts="0" maxrestarts="0" name="nfs_cisn" userscript="None"> > > ipaddress="192.168.253.20" monitor_link="0" netmask="255.255.255.0"/> > > > > > > > > > > > > > > > > > > I created a NFS share on /u04 (mount on /dev/sdd4) using the Cluster GUI > manager on RAC1. > I launched on the 2 nodes Rac1 and RACgfs the following command: > service clumanager start > > I checked the result on the 2 nodes, on RAC1: > > clustat results : > > Cluster Status - project > 09:04:34 > Cluster Quorum Incarnation #1 > Shared State: Shared Raw Device Driver v1.2 > > Member Status > ------------------ ---------- > 192.168.253.3 Active <-- You are here > 192.168.253.10 Active > > Service Status Owner (Last) Last Transition Chk Restarts > -------------- -------- ---------------- --------------- --- -------- > nfs_cisn started 192.168.253.3 09:07:59 Sep 21 5 0 > > > on RacGfs: clustat results : > > Cluster Status - cisn > 09:07:39 > Cluster Quorum Incarnation #3 > Shared State: Shared Raw Device Driver v1.2 > > Member Status > ------------------ ---------- > 192.168.253.3 Active > 192.168.253.10 Active <-- You are here > > Service Status Owner (Last) Last Transition Chk Restarts > -------------- -------- ---------------- --------------- --- -------- > nfs_cisn started 192.168.253.3 09:07:59 Sep 21 5 0 > > > > When I launched ifconfig on RAC1, we saw that the service IP address > 192.168.253.20 is generated on eth2:0. > > And I launched on other servers the following command: > mount t nfs 192.168.253.20:/u04 /u04 > > And all are OK, I can list the /u04 content from any server. > > But my only problem is: > > When I want to try a test if the clustered NFS will work fine, I rebooted > RAC1 frequently and RACGFS continue to work as the failover server and when > I launched ifconfig on RACGFS, we saw that the service IP address > 192.168.253.20 is generated on eth0:0 . > We can list /u04 content (clustered NFS mount) on the other servers after > few seconds of RAC1 rebooting: > > But after many reboots, I expect a big problem, the both cluster node > servers cannot obtain the service IP address 192.168.253.20 when I launch > ifconfig on the both nodes. > > On Rac1: > > eth0 Link encap:Ethernet HWaddr 00:0B:CD:EF:2B:C1 > inet addr:1.1.1.1 Bcast:1.1.1.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:89170 errors:0 dropped:0 overruns:0 frame:0 > TX packets:87405 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:17288193 (16.4 Mb) TX bytes:14452757 (13.7 Mb) > Interrupt:15 > > eth2 Link encap:Ethernet HWaddr 00:0B:CD:FF:44:02 > inet addr:192.168.253.3 Bcast:192.168.253.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:1349991 errors:0 dropped:0 overruns:0 frame:0 > TX packets:435450 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1592635536 (1518.8 Mb) TX bytes:162026101 (154.5 Mb) > Interrupt:7 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:1001181 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1001181 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:76097441 (72.5 Mb) TX bytes:76097441 (72.5 Mb) > > On RACGFS: > > eth0 Link encap:Ethernet HWaddr 00:14:38:50:D3:E4 > inet addr:192.168.253.10 Bcast:192.168.253.255 > Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:211223 errors:0 dropped:0 overruns:0 frame:0 > TX packets:160026 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:14917480 (14.2 Mb) TX bytes:13886063 (13.2 Mb) > Interrupt:25 > > eth1 Link encap:Ethernet HWaddr 00:14:38:50:D3:E3 > inet addr:1.1.1.3 Bcast:1.1.1.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:0 (0.0 b) TX bytes:256 (256.0 b) > Interrupt:26 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:184529 errors:0 dropped:0 overruns:0 frame:0 > TX packets:184529 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:10971489 (10.4 Mb) TX bytes:10971489 (10.4 Mb) > > I tried many commands, I stopped the cluster services On both nodes and > restart it but unfortunately it doesnt work and we cannot obtain the > clustered NFS mount. > > > Have you any idea to fix this problem? > > Thanks for your replies and help > > Abbes Bettahar > 514-296-0756 > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bmarzins at redhat.com Thu Sep 22 13:22:21 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Thu, 22 Sep 2005 08:22:21 -0500 Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba? In-Reply-To: <20050922083237.GD3466@neu.nirvana> References: <20050922083237.GD3466@neu.nirvana> Message-ID: <20050922132221.GA2123@phlogiston.msp.redhat.com> On Thu, Sep 22, 2005 at 10:32:37AM +0200, Axel Thimm wrote: > Hi, > > I've been stuggling with a strange bug in Samba which required me to > have some of the tdb files with permissions 0666 to allow Samba to > work. > > The Samba metadata (locking and connection tables etc) are placed on > GFS to allow for easier relocation of the Samba services ("poor man's > clustered samba"). > > The problem is that Samba opens some files as root, then drops > priviledges and finally accesses these files assuming that the root > access rights are still in order. This does not work under GFS, but > under any other local fs. > > The Samba developers claim that this is POSIX compliant and that GFS > is not following POSIX in this matter. > > Is this true? Does POSIX require the fds to not change access > priviledges even when setuiding to another user? Yes, apparently it does. > If so, why doesn't GFS respect this? A bug or a feature? If the former > I'll go and bugzilla it. If the latter, can there be a fix for the > RHEL4 branch? Since GFS is not complying with POSIX, I'd call this a bug. Please, go bugzilla it. -Ben > Thanks! > -- > Axel.Thimm at ATrpms.net > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Axel.Thimm at ATrpms.net Thu Sep 22 13:38:14 2005 From: Axel.Thimm at ATrpms.net (Axel Thimm) Date: Thu, 22 Sep 2005 15:38:14 +0200 Subject: [Linux-cluster] Re: GFS breaking POSIX exhibited by Samba? In-Reply-To: <20050922132221.GA2123@phlogiston.msp.redhat.com> References: <20050922083237.GD3466@neu.nirvana> <20050922132221.GA2123@phlogiston.msp.redhat.com> Message-ID: <20050922133814.GB25543@neu.nirvana> On Thu, Sep 22, 2005 at 08:22:21AM -0500, Benjamin Marzinski wrote: > On Thu, Sep 22, 2005 at 10:32:37AM +0200, Axel Thimm wrote: > > I've been stuggling with a strange bug in Samba which required me to > > have some of the tdb files with permissions 0666 to allow Samba to > > work. > > > > The Samba metadata (locking and connection tables etc) are placed on > > GFS to allow for easier relocation of the Samba services ("poor man's > > clustered samba"). > > > > The problem is that Samba opens some files as root, then drops > > priviledges and finally accesses these files assuming that the root > > access rights are still in order. This does not work under GFS, but > > under any other local fs. > > > > The Samba developers claim that this is POSIX compliant and that GFS > > is not following POSIX in this matter. > > > > Is this true? Does POSIX require the fds to not change access > > priviledges even when setuiding to another user? > > Yes, apparently it does. > > > If so, why doesn't GFS respect this? A bug or a feature? If the former > > I'll go and bugzilla it. If the latter, can there be a fix for the > > RHEL4 branch? > > Since GFS is not complying with POSIX, I'd call this a bug. Please, go > bugzilla it. OK, here it is https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169039 -- Axel.Thimm at ATrpms.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From lhh at redhat.com Thu Sep 22 14:20:25 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 22 Sep 2005 10:20:25 -0400 Subject: [Linux-cluster] Some questions about heart-beat under Cluster Suite 4 In-Reply-To: <43324C0E.8010005@bull.net> References: <43324C0E.8010005@bull.net> Message-ID: <1127398825.22106.153.camel@ayanami.boston.redhat.com> On Thu, 2005-09-22 at 08:15 +0200, Alain Moulle wrote: > Hi everybody > > Some questions about heart-beat under Cluster Suite 4 : > > 1. how is choosen the eth interface under CS4 ? > if we have eth0 and eth1, it seems that HearBeat > goes through eth0 and we don't have the possibility > to configure this in CS4 , right ? > > 2. does that mean also that if eth0 fails, the CS4 > automatically goes through eth1 ? > > 3. do I miss the way to configure this via GUI ? One easy way to do this is to use bonded interfaces, which will provide failover transparent to the cluster. -- Lon From lhh at redhat.com Thu Sep 22 14:28:17 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 22 Sep 2005 10:28:17 -0400 Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba? In-Reply-To: <20050922083237.GD3466@neu.nirvana> References: <20050922083237.GD3466@neu.nirvana> Message-ID: <1127399297.22106.162.camel@ayanami.boston.redhat.com> On Thu, 2005-09-22 at 10:32 +0200, Axel Thimm wrote: > Hi, > > I've been stuggling with a strange bug in Samba which required me to > have some of the tdb files with permissions 0666 to allow Samba to > work. > > The Samba metadata (locking and connection tables etc) are placed on > GFS to allow for easier relocation of the Samba services ("poor man's > clustered samba"). > > The problem is that Samba opens some files as root, then drops > priviledges and finally accesses these files assuming that the root > access rights are still in order. This does not work under GFS, but > under any other local fs. > > The Samba developers claim that this is POSIX compliant and that GFS > is not following POSIX in this matter. > > Is this true? Does POSIX require the fds to not change access > priviledges even when setuiding to another user? > > If so, why doesn't GFS respect this? A bug or a feature? If the former > I'll go and bugzilla it. If the latter, can there be a fix for the > RHEL4 branch? Definitely file a bugzilla. It's something that we definitely need to look into. Off the top of my head, it sounds like GFS is, indeed, not respecting access rights (and therefore, POSIX) -- but oddly enough, it may be intentional. Cluster semantics don't always line up with POSIX semantics. -- Lon From lhh at redhat.com Thu Sep 22 14:29:46 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 22 Sep 2005 10:29:46 -0400 Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba? In-Reply-To: <1127399297.22106.162.camel@ayanami.boston.redhat.com> References: <20050922083237.GD3466@neu.nirvana> <1127399297.22106.162.camel@ayanami.boston.redhat.com> Message-ID: <1127399386.22106.165.camel@ayanami.boston.redhat.com> On Thu, 2005-09-22 at 10:28 -0400, Lon Hohberger wrote: > It's something that we definitely need to look into. Off the top of my > head, it sounds like GFS is, indeed, not respecting access rights (and > therefore, POSIX) -- but oddly enough, it may be intentional. Cluster > semantics don't always line up with POSIX semantics. Ben beat me to it. I'm just not on my game today. -- Lon From lhh at redhat.com Thu Sep 22 14:32:35 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 22 Sep 2005 10:32:35 -0400 Subject: [Linux-cluster] Re: How do I use a cross over cable to set up quorum? In-Reply-To: <1a9f41690509220244da6fed5@mail.gmail.com> References: <1a9f416905092201175c6eb118@mail.gmail.com> <1a9f41690509220244da6fed5@mail.gmail.com> Message-ID: <1127399555.22106.169.camel@ayanami.boston.redhat.com> On Thu, 2005-09-22 at 11:44 +0200, Andreso wrote: > Sorry, I forgot to mention that I am running Centos 3.5 which is > equivalent to RHEL AS 3.5. > > A working cluster.xml using the main network interface is attached * Use 10.0.0.x (the IPs assigned to the interfaces using the crossover cable) as your cluster member names. * Use broadcast heartbeating. * Set broadcast-primary-only (see man cludb) * Use the disk based tiebreaker. DO NOT use the IP tiebreaker. -- Lon From Robert.Olsson at mobeon.com Thu Sep 22 14:52:29 2005 From: Robert.Olsson at mobeon.com (Robert Olsson) Date: Thu, 22 Sep 2005 16:52:29 +0200 Subject: [Linux-cluster] High availability mail system Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM> Im trying to put up a high availability mail system with high performance who support up to 300 000 mailboxes using a linux cluster. I have looked around for open source cluster solution, but so far only found solution like SAN for shared filesystem with high performance. Any suggestion how to solve the issues using open source software? The system should support - Mailbox replication *- Mailbox synchronization *- Redundancy both in hardware and software - The system is to be build with low cost computers - One shared filesystem without external storage like SAN - Scalable /Robert Olsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Thu Sep 22 15:19:27 2005 From: Alain.Moulle at bull.net (Alain Moulle) Date: Thu, 22 Sep 2005 17:19:27 +0200 Subject: [Linux-cluster] Some questions about heart-beat under Cluster Suite 4 In-Reply-To: <1127398825.22106.153.camel@ayanami.boston.redhat.com> References: <43324C0E.8010005@bull.net> <1127398825.22106.153.camel@ayanami.boston.redhat.com> Message-ID: <4332CB7F.3010304@bull.net> Hi Thanks. Effectively,I saw this in documentation but in fact, my question was more precisely : if we have two eth interfaces eth0 and eth1 , and we don't want the heart beat goes through eth1 in any case, even if the CS4 has to failover in case of eth0 failure, is there a way in CS4 configuration to avoid eth1 ? or do we have to disable Broadcast (or Multicast if choosen in CS4 configuration) at the eth interface configuration ? Thanks Alain Lon Hohberger wrote: > On Thu, 2005-09-22 at 08:15 +0200, Alain Moulle wrote: > >>Hi everybody >> >>Some questions about heart-beat under Cluster Suite 4 : >> >>1. how is choosen the eth interface under CS4 ? >> if we have eth0 and eth1, it seems that HearBeat >> goes through eth0 and we don't have the possibility >> to configure this in CS4 , right ? >> >>2. does that mean also that if eth0 fails, the CS4 >> automatically goes through eth1 ? >> >>3. do I miss the way to configure this via GUI ? > > > One easy way to do this is to use bonded interfaces, which will provide > failover transparent to the cluster. > > -- Lon > > -- mailto:Alain.Moulle at bull.net +------------------------------+--------------------------------+ | Alain Moull? | from France : 04 76 29 75 99 | | | FAX number : 04 76 29 72 49 | | Bull SA | | | 1, Rue de Provence | Adr : FREC B1-041 | | B.P. 208 | | | 38432 Echirolles - CEDEX | Email: Alain.Moulle at bull.net | | France | BCOM : 229 7599 | +-------------------------------+-------------------------------+ From yfttyfs at gmail.com Thu Sep 22 16:40:07 2005 From: yfttyfs at gmail.com (y f) Date: Fri, 23 Sep 2005 00:40:07 +0800 Subject: [Linux-cluster] High availability mail system In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM> References: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM> Message-ID: <78fcc84a050922094012a1b27d@mail.gmail.com> On 9/22/05, Robert Olsson wrote: > > Im trying to put up a high availability mail system with high performance > who support up to 300 000 mailboxes using a linux cluster. > Recently we wrote a Cluster FS dedicated to Email service based on ideas from GoogleFS, LogFS, GlobalFS, etc., which also gave me some thoughts on it. I have looked around for open source cluster solution, but so far only found > solution like SAN for shared filesystem with high performance. > www.Lustre.org ? Any suggestion how to solve the issues using open source software? > The system should support > - Mailbox replication > Replication in mail account level, file level, or GoogleFS[1]' chunk level, I wonder which one will simplify the complication ? ?- Mailbox synchronization > Based on above line. ? > - Redundancy both in hardware and software > - The system is to be build with low cost computers > - One shared filesystem without external storage like SAN > - Scalable > GoogleFS gave a good example on it ;) [1] http://labs.google.com/papers/gfs.html /Robert Olsson > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason_wilk at stircrazy.net Thu Sep 22 17:53:16 2005 From: jason_wilk at stircrazy.net (Jason Wilkinson) Date: Thu, 22 Sep 2005 12:53:16 -0500 Subject: [Linux-cluster] High availability mail system In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM> Message-ID: Robert Olsson wrote: > Im trying to put up a high availability mail system with high > performance who support up to 300 000 mailboxes using a linux > cluster. > > I have looked around for open source cluster solution, but so far > only found solution like SAN for shared filesystem with high > performance. > > Any suggestion how to solve the issues using open source software? > > The system should support > - Mailbox replication > .- Mailbox synchronization Why are you replicating the mailbox. Why don't you put the mailboxes on NFS and just have all of the servers dump into the same mailbox. The POP3 frontends can all pull from the same store as well. Mail servers are one of the odd services that I've seen where it isn't necessary to implement a cluster to scale well. http://shupp.org/maps/ispcluster.html > .- Redundancy both in hardware and software > - The system is to be build with low cost computers > - One shared filesystem without external storage like SAN > - Scalable > > /Robert Olsson From lgodoy at atichile.com Thu Sep 22 19:02:16 2005 From: lgodoy at atichile.com (Luis Godoy Gonzalez) Date: Thu, 22 Sep 2005 15:02:16 -0400 Subject: [Linux-cluster] RedHat EN4U1 AMD64 In-Reply-To: <432AC5C2.1090800@fnal.gov> References: <43287C63.10909@fnal.gov> <20050915033908.GL2190@redhat.com> <432AC5C2.1090800@fnal.gov> Message-ID: <4332FFB8.1070203@atichile.com> Hi I am trying to run a basic config of "cluster suite" software on "Hp cluster Hardware", Using: "*Red Hat Enterprise Linux ES 4 Update 1 (AMD64/Intel EM64T)" "*rhel-4-rhcs-x86_64.iso" Mi first problem was with "cman-kernel-hugemem" and "dlm-kernel-hugemem" pakages, the instalation process failed for funsolved dependencies. So, for testing we ommited these pakages and continued with the instalation proccess. I created a basic "service" ( only script ) and this works, but when I added a IP addreess, the GUI showed services status OK but the IP address whas not added to the machine :| , I think the problem is with rgmanager. Has someone installed This version of SO and Cluster Suite ? You have any idea to solve this problem ? Thanks in advance Luis G. From lhh at redhat.com Thu Sep 22 21:50:17 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 22 Sep 2005 17:50:17 -0400 Subject: [Linux-cluster] RedHat EN4U1 AMD64 In-Reply-To: <4332FFB8.1070203@atichile.com> References: <43287C63.10909@fnal.gov> <20050915033908.GL2190@redhat.com> <432AC5C2.1090800@fnal.gov> <4332FFB8.1070203@atichile.com> Message-ID: <1127425817.22106.200.camel@ayanami.boston.redhat.com> On Thu, 2005-09-22 at 15:02 -0400, Luis Godoy Gonzalez wrote: > I created a basic "service" ( only script ) and this works, but when I > added a IP addreess, the GUI showed services status OK but the IP > address whas not added to the machine :| , I think the problem is with > rgmanager. Use "ip addr list", not "ifconfig". -- Lon From baesso at ksolutions.it Fri Sep 23 07:41:49 2005 From: baesso at ksolutions.it (Baesso Mirko) Date: Fri, 23 Sep 2005 09:41:49 +0200 Subject: R: [Linux-cluster] Re: How do I use a cross over cable to set upquorum? Message-ID: Hi, i need to setup a 2 node cluster and i would like to use the disk based tiebreaker or ip tiebreaker, but I see there is no cluquorumd command. I'm using RHCS 4 (NO GFS) on kernel 2.6.9.11 Could you help me Thanks Baesso Mirko - System Engineer KSolutions.S.p.A. Via Lenin 132/26 56017 S.Martino Ulmiano (PI) - Italy tel.+ 39 0 50 898369 fax. + 39 0 50 861200 baesso at ksolutions.it http//www.ksolutions.it -----Messaggio originale----- Da: Andreso [mailto:andreseso at gmail.com] Inviato: gioved? 22 settembre 2005 11.44 A: linux-cluster at redhat.com Oggetto: [Linux-cluster] Re: How do I use a cross over cable to set upquorum? Sorry, I forgot to mention that I am running Centos 3.5 which is equivalent to RHEL AS 3.5. A working cluster.xml using the main network interface is attached Andres On 9/22/05, Andreso wrote: > Hello, > > I am trying to set up a cluster with clumanager-1.2.3-1 and > clumanager-1.2.3-1 and my boss desires a working quorum where the two > members of the cluster use a cross over cable to interchange the > necessary information for quorum between the two cluster members. The > reason is that the company uses faulty switches so he considers using > the main network interfaces of the servers not acceptable. > > I would like to know how I can obtain cluster quorum using the cross > over cable. I have succesfully set up the cluster using the main > network interface but I have failed miserably setting up the cluster > quorum over the cross over cable network interface. I believe I have > a routing problem. > > The route -n information for the cross over cable interface is > Kernel IP routing table > Destination Gateway Genmask Flags Metric Ref Use Iface > 10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1 > > I am far from being a linux expert and in some things like clusters I > am a newbie and I am not particularily strong on networking. For > example until I started setting up the cluster I had never heard of > the concepts tiebreaker IP or Multicast IP address. > > Any help would be appreciated. > > If somebody has managed to make this work could they please post their > cluster.xml file? > From Robert.Olsson at mobeon.com Fri Sep 23 08:38:53 2005 From: Robert.Olsson at mobeon.com (Robert Olsson) Date: Fri, 23 Sep 2005 10:38:53 +0200 Subject: [Linux-cluster] High availability mail system Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM> Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data? -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason Wilkinson Sent: den 22 september 2005 19:53 To: 'linux clustering' Subject: RE: [Linux-cluster] High availability mail system Robert Olsson wrote: > Im trying to put up a high availability mail system with high > performance who support up to 300 000 mailboxes using a linux cluster. > > I have looked around for open source cluster solution, but so far only > found solution like SAN for shared filesystem with high performance. > > Any suggestion how to solve the issues using open source software? > > The system should support > - Mailbox replication > .- Mailbox synchronization Why are you replicating the mailbox. Why don't you put the mailboxes on NFS and just have all of the servers dump into the same mailbox. The POP3 frontends can all pull from the same store as well. Mail servers are one of the odd services that I've seen where it isn't necessary to implement a cluster to scale well. http://shupp.org/maps/ispcluster.html > .- Redundancy both in hardware and software > - The system is to be build with low cost computers > - One shared filesystem without external storage like SAN > - Scalable > > /Robert Olsson -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From adam.cassar at netregistry.com.au Fri Sep 23 08:53:20 2005 From: adam.cassar at netregistry.com.au (Adam Cassar) Date: Fri, 23 Sep 2005 18:53:20 +1000 Subject: [Linux-cluster] High availability mail system In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM> References: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM> Message-ID: <4333C280.6040901@netregistry.com.au> use maildir format and you will be fine exim and courier support this Robert Olsson wrote: >Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data? > >-----Original Message----- >From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason Wilkinson >Sent: den 22 september 2005 19:53 >To: 'linux clustering' >Subject: RE: [Linux-cluster] High availability mail system > >Robert Olsson wrote: > > >>Im trying to put up a high availability mail system with high >>performance who support up to 300 000 mailboxes using a linux cluster. >> >>I have looked around for open source cluster solution, but so far only >>found solution like SAN for shared filesystem with high performance. >> >> >> > > > >>Any suggestion how to solve the issues using open source software? >> >>The system should support >>- Mailbox replication >>.- Mailbox synchronization >> >> > >Why are you replicating the mailbox. Why don't you put the mailboxes on NFS and just have all of the servers dump into the same mailbox. The POP3 frontends can all pull from the same store as well. > >Mail servers are one of the odd services that I've seen where it isn't necessary to implement a cluster to scale well. > >http://shupp.org/maps/ispcluster.html > > > > >>.- Redundancy both in hardware and software >>- The system is to be build with low cost computers >>- One shared filesystem without external storage like SAN >>- Scalable >> >>/Robert Olsson >> >> > > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > > From andreseso at gmail.com Fri Sep 23 10:22:19 2005 From: andreseso at gmail.com (Andreso) Date: Fri, 23 Sep 2005 12:22:19 +0200 Subject: [Linux-cluster] Four partitions mounted for one server -- samba fails Message-ID: <1a9f4169050923032278a31093@mail.gmail.com> I have set up the cluster using CENTOS 3.5 using the main network interface, the one with the gateway. I have two services running. httpd and mysql httpd had only one partition mounted: /usr/local/htdocs on /dev/sdb7 which is shared by samba For backups and other stuff I am not privy to I have mounted three more partitions: /dev/sdb9, /dev/sdb10 and /dev/sdc1 None of these partitions are shared by samba The service would not start stating that /etc/samba/smb.conf.not was not found I did touch /etc/samba/smb.conf.not and I was able to make the service start. Unfortunately /etc/samba/smb.conf.not seems to be a samba file and it is not sufficient for it to exist. Every time the httpd service is checked it fails, the devices are unmounted forcefully and they are remounted and the service is restarted. I believe this is caused by samba What stuff should I put into this file for /etc/samba.conf.not for the service checks to function correctly? I am loath to use the word urgent but I consider this to be urgent. I am getting paid for the work done. I budgeted six days on this project and I have already spent 10. I would hate having to come back to this place. I am working standing infront of a LVKM getting disconnected constantly and one day due to enterprise restrictions I did not get access to Internet. Its noon and I might have to leave at 14:30 and I would hate to spend another day I am not getting paid for on this project. Andres From rainer at ultra-secure.de Fri Sep 23 10:58:19 2005 From: rainer at ultra-secure.de (Rainer Duffner) Date: Fri, 23 Sep 2005 12:58:19 +0200 Subject: [Linux-cluster] High availability mail system In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM> References: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM> Message-ID: <4333DFCB.6020600@ultra-secure.de> Robert Olsson wrote: >Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. > The trouble is that GFS also has an overhead - especially for Qmail. In fact, we have what I would call a "long-time evaluation" of qmail + GFS running. While some features (no SPOF) are nice, others (way too many concurrent directory-accesses to gain any performance gain in comparison to NFS) are not nice at all. I'm really not a GFS-expert at all, but the way I see it (and was told) is that everytime a directory-access occurs, GFS must synchronize this to the other cluster-members. Now, when Qmail delivers a mail, it already takes great care not to produce conflicts on (NFS-)shared filesystems, by copying the message first to "tmp", then to "new", with timestamp as part of filename etc. For every file created on the shared directory, though, GFS creates locks and lockfiles - which easily doubles or triples the load on the SAN. Even though, no two files of the same name are ever created by different hosts. In addition, GFS doesn't seem to have "directory hashing" like FreeBSD UFS and others have, as a result access to large directories with many files is slow. Running find(1) on the mail-store can bring the cluster to halt, so does du(1). Due to the fact that we also will have to move from cdb-backend to mysql, there will be a SPOF anyway and we will be actively evaluating going to NFS (more or less back). But only on a "sane" NFS-platform, most likely Solaris, or FreeBSD - don't waste your time with Linux-NFS... >Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data? > > The shared-storage mail-systems only scale to a certain point. The numbers are in the 300k-500k ballpark. After that, you have to go distributed. If you have millions of users, go qmail-ldap. These things are very difficult to benchmark, every site has an individual use-profile.. cheers, Rainer From andreseso at gmail.com Fri Sep 23 11:10:21 2005 From: andreseso at gmail.com (Andreso) Date: Fri, 23 Sep 2005 13:10:21 +0200 Subject: [Linux-cluster] Re: Four partitions mounted for one server -- samba fails In-Reply-To: <1a9f4169050923032278a31093@mail.gmail.com> References: <1a9f4169050923032278a31093@mail.gmail.com> Message-ID: <1a9f41690509230410239f1432@mail.gmail.com> I have looked at the /var/log/cluster and here I will reproduce all the messages. I have set the log level on all the cluster daemons to INFO In effect it is a samba issue Sep 23 12:53:46 hercules clusvcmgrd: [16076]: service warning: share_s tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist . Sep 23 12:53:46 hercules clusvcmgrd: [16076]: service error: share_start_s top: nmbd for service httpd died! Sep 23 12:53:46 hercules clusvcmgrd: [16076]: service error: /usr/lib/clum anager/services/service: line 220: [: status: integer expression expected Sep 23 12:53:46 hercules clusvcmgrd: [16076]: service error: grep: found: No such file or directory Sep 23 12:53:46 hercules clusvcmgrd: [16076]: service error: Check status failed on Samba for httpd Sep 23 12:53:46 hercules clusvcmgrd[16075]: Restarting locally failed service httpd Sep 23 12:53:46 hercules clusvcmgrd: [16228]: service notice: Stopping service httpd ... Sep 23 12:53:46 hercules clusvcmgrd: [16228]: service notice: Running u ser script '/etc/init.d/apache2 stop' Sep 23 12:53:46 hercules clusvcmgrd: [16228]: service info: Stopping IP a ddress 10.64.34.141 Sep 23 12:53:46 hercules clusvcmgrd: [16228]: service warning: share_s tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist . Sep 23 12:53:47 hercules last message repeated 2 times Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: /usr/lib/clum anager/services/service: line 220: [: stop: integer expression expected Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: grep: found: No such file or directory Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: /usr/lib/clum anager/services/service: line 220: [: stop: integer expression expected Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: grep: found: No such file or directory Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: /usr/lib/clum anager/services/service: line 220: [: stop: integer expression expected Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: grep: found: No such file or directory Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: grep: found: No such file or directory Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: unmounting /d ev/sdb7 (/usr/local/htdocs) Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: unmounting /d ev/sdb9 (/plantillas) Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: unmounting /d ev/sdb10 (/etse) Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service info: unmounting /d ev/sdc1 (/backup) Sep 23 12:53:47 hercules clusvcmgrd: [16228]: service notice: Stopped s ervice httpd ... Sep 23 12:53:48 hercules clusvcmgrd[16075]: Starting stopped service ht tpd Sep 23 12:53:48 hercules clusvcmgrd: [16727]: service notice: Starting service httpd ... Sep 23 12:53:48 hercules clusvcmgrd: [16727]: service info: Starting IP a ddress 10.64.34.141 Sep 23 12:53:48 hercules clusvcmgrd: [16727]: service info: Sending Gratu itous arp for 10.64.34.141 (00:11:43:E7:BA:A5) Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service warning: share_s tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist . Sep 23 12:53:49 hercules last message repeated 2 times Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service info: /usr/lib/clum anager/services/service: line 220: [: start: integer expression expected Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service info: grep: found: No such file or directory Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service info: /usr/lib/clum anager/services/service: line 220: [: start: integer expression expected Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service info: grep: found: No such file or directory Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service info: /usr/lib/clum anager/services/service: line 220: [: start: integer expression expected Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service info: grep: found: No such file or directory Sep 23 12:53:49 hercules clusvcmgrd: [16727]: service notice: Running u ser script '/etc/init.d/apache2 start' I set up a /etc/samba/smb.conf.not file with just a global section [global] workgroup = RHCLUSTER pid directory = /var/run/samba/not lock directory = /var/cache/samba/not log file = /var/log/samba/%m.log encrypt passwords = yes bind interfaces only = yes interfaces = 10.64.34.141/255.255.255.0 To the best of my knowledge /etc/samba/smb.conf.not is not documented From Robert.Olsson at mobeon.com Fri Sep 23 11:15:07 2005 From: Robert.Olsson at mobeon.com (Robert Olsson) Date: Fri, 23 Sep 2005 13:15:07 +0200 Subject: [Linux-cluster] High availability mail system Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F6487F@vale.MOBEON.COM> -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rainer Duffner Sent: den 23 september 2005 12:58 To: linux clustering Subject: Re: [Linux-cluster] High availability mail system >Robert Olsson wrote: >>Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. >> >The trouble is that GFS also has an overhead - especially for Qmail. >In fact, we have what I would call a "long-time evaluation" of qmail + GFS running. >While some features (no SPOF) are nice, others (way too many concurrent directory-accesses to gain any performance gain in comparison to NFS) are not nice at all. >I'm really not a GFS-expert at all, but the way I see it (and was told) is that everytime a directory-access occurs, GFS >must synchronize this to the other cluster-members. >Now, when Qmail delivers a mail, it already takes great care not to produce conflicts on (NFS-)shared filesystems, by copying the message first to "tmp", then to "new", with timestamp as part of filename etc. >For every file created on the shared directory, though, GFS creates locks and lockfiles - which easily doubles or triples >the load on the SAN. Even though, no two files of the same name are ever created by different hosts. >In addition, GFS doesn't seem to have "directory hashing" like FreeBSD UFS and others have, as a result access to large directories with many files is slow. Running find(1) on the mail-store can bring the cluster to halt, so does du(1). >Due to the fact that we also will have to move from cdb-backend to mysql, there will be a SPOF anyway and we will be actively evaluating going to NFS (more or less back). >But only on a "sane" NFS-platform, most likely Solaris, or FreeBSD - don't waste your time with Linux-NFS... >>Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data? >> >> >The shared-storage mail-systems only scale to a certain point. The numbers are in the 300k-500k ballpark. After that, you have to go distributed. >If you have millions of users, go qmail-ldap. How do you mean with "go distributed"? >These things are very difficult to benchmark, every site has an individual use-profile.. cheers, Rainer -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From merlin at mwob.org.uk Fri Sep 23 11:31:33 2005 From: merlin at mwob.org.uk (Howard Johnson) Date: Fri, 23 Sep 2005 12:31:33 +0100 Subject: [Linux-cluster] High availability mail system In-Reply-To: <4333DFCB.6020600@ultra-secure.de> References: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM> <4333DFCB.6020600@ultra-secure.de> Message-ID: <1127475093.30722.23.camel@thunderbolt.localnet> On Fri, 2005-09-23 at 11:58, Rainer Duffner wrote: > But only on a "sane" NFS-platform, most likely Solaris, or FreeBSD - > don't waste your time with Linux-NFS... > > > >Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data? > > > > > > The shared-storage mail-systems only scale to a certain point. The > numbers are in the 300k-500k ballpark. After that, you have to go > distributed. > If you have millions of users, go qmail-ldap. Linux-based NFS shared-storage mail systems are capable of scaling well beyond that. I've seen such a system handling millions of mailboxes. -- Howard Johnson From rainer at ultra-secure.de Fri Sep 23 12:30:10 2005 From: rainer at ultra-secure.de (Rainer Duffner) Date: Fri, 23 Sep 2005 14:30:10 +0200 Subject: [Linux-cluster] High availability mail system In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F6487F@vale.MOBEON.COM> References: <9B488A5E8C00084C82DB19AF2E713C22F6487F@vale.MOBEON.COM> Message-ID: <4333F552.3080706@ultra-secure.de> Robert Olsson wrote: >How do you mean with "go distributed"? > > Qmail-LDAP-patches. You've no longer got a shared mail-storage, so you don't have to scale-up that, as you add users. A LDAP-directory stores where (on which server) each user is located. See the qmail-ldap pages and accompanying documentation. Rainer From Robert.Olsson at mobeon.com Fri Sep 23 12:42:25 2005 From: Robert.Olsson at mobeon.com (Robert Olsson) Date: Fri, 23 Sep 2005 14:42:25 +0200 Subject: [Linux-cluster] High availability mail system Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F64912@vale.MOBEON.COM> > The system should support > - Mailbox replication > .- Mailbox synchronization >Why are you replicating the mailbox. Why don't you put the mailboxes on NFS and just have all of the servers dump into >the same mailbox. The POP3 frontends can all pull from the same store as well. I want to have redundancy on the mailboxes not just on one node using raid. I want the mailbox at least on two nodes if one node fail. Do you have any suggestion to that? /Robert Olsson -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Mon Sep 26 15:28:58 2005 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 26 Sep 2005 11:28:58 -0400 Subject: [Linux-cluster] Re: How do I use a cross over cable to set up quorum? In-Reply-To: <1a9f416905092304522adfc2b1@mail.gmail.com> References: <1a9f416905092201175c6eb118@mail.gmail.com> <1a9f41690509220244da6fed5@mail.gmail.com> <1127399555.22106.169.camel@ayanami.boston.redhat.com> <1a9f416905092304522adfc2b1@mail.gmail.com> Message-ID: <1127748538.22106.242.camel@ayanami.boston.redhat.com> On Fri, 2005-09-23 at 13:52 +0200, Andreso wrote: > On 9/22/05, Lon Hohberger wrote: > > > > * Set broadcast-primary-only (see man cludb) > > cludb does not have a man page. broadcast-primary-only does not > appear in google Upgrade to the latest package from RHN or wherever you got your software. Anyway, as it turns out, the man page is wrong anyway (it references primary_only instead of broadcast_primary_only). Oops. > > * Use the disk based tiebreaker. DO NOT use the IP tiebreaker. > I use the ping interval instad of tiebreaker Ip. I guess that is what you mean. That's equivalent, yes. > Member Status > ------------------ ---------- > 10.0.0.2 Inactive > 10.0.0.3 Active <-- You are here > > Service Status Owner (Last) Last Transition Chk Restarts > -------------- -------- ---------------- --------------- --- -------- > httpd started 10.0.0.2 13:39:06 Sep 23 30 0 > mysql started 10.0.0.2 13:39:06 Sep 23 30 0 > Member Status > ------------------ ---------- > 10.0.0.2 Active <-- You are here > 10.0.0.3 Inactive > > Service Status Owner (Last) Last Transition Chk Restarts > -------------- -------- ---------------- --------------- --- -------- > httpd started 10.0.0.2 13:39:06 Sep 23 30 0 > mysql started 10.0.0.2 13:39:06 Sep 23 30 0 The disk tiebreaker is working correctly. Your nodes aren't communicating over the private network (crossover cable, in your case), though. The cluster software doesn't do anything arcane. You can try double-checking UDP ping-ability (which is basically what the cluster does, except it's one way instead of bidirectional) using this: http://people.redhat.com/udping-1.0.tar.gz Don't set it up as a cluster service, just start the server on one and try to use udping to ping the other using the private IP. Also try obvious things like normal ping, broadcast ping, and ssh. If these don't work, you probably have a bad cable, incorrect routing rules, or incorrect firewall rules. Your configuration looks okay. After you get the cluster working, stop the cluster on both nodes. Run this on one of them, and copy the cluster configuration to the other node: # cludb -p clumembd%broadcast_primary_only 1 You can also do this from one of the nodes: # shutil -s /etc/cluster.xml This will prevent the cluster from using public interfaces for heartbeats, but is not critical in any way to get the cluster software working. -- Lon From carlopmart at gmail.com Mon Sep 26 15:33:38 2005 From: carlopmart at gmail.com (carlopmart at gmail.com) Date: Mon, 26 Sep 2005 17:33:38 +0200 Subject: [Linux-cluster] Installing RHCS on RHEL 4 Message-ID: <433814D2.1060907@gmail.com> Hi all, I would like to do some tests with RHCS (Cluster Suite) and RHEL 4 under two virtual machines on GSX Server 3.2. At this point, I have some questions to configure two virtual nodes to do the tests: - Is it necessary to have a fence device?? Can I configure custer suite without it?? - How many network interfaces I need for each virtual machine??? Thank you very much. -- CL Martinez carlopmart {at} gmail {d0t} com From david.sullivan at activant.com Mon Sep 19 00:40:23 2005 From: david.sullivan at activant.com (David.Sullivan) Date: Sun, 18 Sep 2005 19:40:23 -0500 Subject: [Linux-cluster] RHCS 4.0 HowTo? Message-ID: I'm completely new to clustering and am trying to set up a "proof of concept" configuration in-house. Our customers run a back-office POS server and must currently do a manual failover that involves moving hard drives if the server fails. Using VMware Workstation 5, I have created several VM's configured with dual NICs and RHEL 4.0/RHCS 4.0. System-config-cluster seems to abstract so much from the Administrator that I'm having trouble getting things working. Some specific things I don't understand: * How do you tell it which NIC to use as a heartbeat, and which to use for the services offered? I have a private LAN set up that I want to use for heartbeat, but don't think it's being used at all. Both the public and private IP's are set up in /etc/hosts. * Building on the above, is it wise for other cluster services (e.g. dlm) to communicate across the private heartbeat link? If so, how do I configure that? * How does one add members to an existing cluster? I found lots online (including Red Hat's Knowledgebase) about how to do it with previous versions, but I'm very unclear about RHCS 4.0. I cloned my first VM, reset all it's networking data, reset the hostname, and deleted the /etc/cluster/cluster.conf file, but the "master" node won't push the cluster configuration to it, so it sits there brain-dead. * Insofar as hardware fencing is required with RHCS 4.0, will I be able to demonstrate failover functionality to management at all? I'm basically looking for a means to automagically fail over to a "hot standby" server. TIA! Notice: This transmission is for the sole use of the intended recipient(s) and may contain information that is confidential and/or privileged. If you are not the intended recipient, please delete this transmission and any attachments and notify the sender by return email immediately. Any unauthorized review, use, disclosure or distribution is prohibited. From ren at teamware-gmbh.de Mon Sep 19 14:52:33 2005 From: ren at teamware-gmbh.de (=?iso-8859-1?B?UmVu6SBFbnNrYXQgW1RlYW13YXJlIEdtYkhd?=) Date: Mon, 19 Sep 2005 16:52:33 +0200 Subject: [Linux-cluster] Nanny bad load average failure Message-ID: Hi list, I still have this strange error. I updated the clustersuite to the newest versions and i still get this errors in my /var/log/messages but the servers are up with the old ruptime version i get the loadaverage but the error in th elogfile was the same "bad load average": Sep 19 11:12:25 telemach nanny[24850]: bad load average returned: telemach down 0:32 telemach3 down 0:31 telemach4 down 0:31 Sep 19 11:12:25 telemach nanny[25253]: bad load average returned: telemach down 0:32 telemach3 down 0:31 telemach4 down 0:31 Sep 19 11:12:32 telemach nanny[24822]: bad load average returned: telemach down 0:32 telemach3 down 0:31 telemach4 down 0:31 Sep 19 11:12:38 telemach nanny[25225]: bad load average returned: telemach down 0:32 telemach3 down 0:31 telemach4 down 0:31 Sep 19 11:12:43 telemach nanny[24850]: bad load average returned: telemach down 0:33 telemach3 down 0:32 telemach4 down 0:31 Sep 19 11:12:43 telemach nanny[25253]: bad load average returned: telemach down 0:33 telemach3 down 0:32 telemach4 down 0:31 ipvsadm-1.24-6 piranha-0.8.0-1 How can i solve this? Thx for HELP! From Alain.Moulle at bull.net Tue Sep 20 13:02:13 2005 From: Alain.Moulle at bull.net (Alain Moulle) Date: Tue, 20 Sep 2005 15:02:13 +0200 Subject: [Linux-cluster] Question about heart-beat with CS4 Message-ID: <43300855.3050709@bull.net> Hi I wonder how the CS4 heart beat is managed : 1. suppose we have two interfaces eth0 and eth1, which one will be used ? or will the CS4 use both ? 2. is-it configurable somewhere ? 3. is the time period between each ping on heart beat configurable somewhere ? 4. Is there a risk to split the cluster when using only one ETH interface ? Thanks Alain Moull? From pcaulfie at redhat.com Tue Sep 27 07:00:29 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 27 Sep 2005 08:00:29 +0100 Subject: [Linux-cluster] Question about heart-beat with CS4 In-Reply-To: <43300855.3050709@bull.net> References: <43300855.3050709@bull.net> Message-ID: <4338EE0D.6090408@redhat.com> Alain Moulle wrote: > Hi For a CMAN/DLM based cluster: > I wonder how the CS4 heart beat is managed : > > 1. suppose we have two interfaces eth0 and eth1, > which one will be used ? > or will the CS4 use both ? It will use the interface bound to the IP address of the hostname by default. If you want to use the otehr interface then specify the name associated with the inerface on the command-line to "cman_tool join" > 2. is-it configurable somewhere ? > > 3. is the time period between each ping on heart beat > configurable somewhere ? in /proc/cluster/config/cman/ there are files you can poke values into. This must be done beteen loading the cman module and running cman_tool join. > 4. Is there a risk to split the cluster when using > only one ETH interface ? Yes. If you want to use both interfaces then join them using the Linux bonding driver. -- patrick From htfrontier at gmail.com Tue Sep 27 07:02:32 2005 From: htfrontier at gmail.com (Hanny Tidore) Date: Tue, 27 Sep 2005 15:02:32 +0800 Subject: [Linux-cluster] Cluster cannot failover Message-ID: <2fa0bfca05092700022fddb849@mail.gmail.com> Hi, I am installing Redhat Cluster Suite in 2 HP Proliant DL380G4 with HP StorageWorks MSA500 G1. I have 3 ethernet cards for each server. 1 card is used for heartbeat and 2 cards are configured as bond0 (bonding). I have setup a service and the service runs on both node: node1 and node2. I can swing the service from node1 to node2. However, when I shutdown node1 (using shutdown -h now), the service which was running in node1 is not restarted on node2. I got the following error message in node2: Sep 27 10:34:54 ppba-papp2 clusvcmgrd[1868]: Couldn't connect to member #0: Connection timed out Sep 27 10:34:54 ppba-papp2 clusvcmgrd[1868]: Unable to obtain cluster lock: No locks available Sep 27 10:35:01 ppba-papp2 cluquorumd[1835]: Membership reports #0 as down, but disk reports as up: State uncertain! Sep 27 10:35:05 ppba-papp2 clusvcmgrd[1868]: Member ppba-papp1's state is uncertain: Some services may be unavailable! Is this test scenario valid ? Is it ok to test the Redhat Cluster by shutting down the server ? What could have gone wrong ? Thanks. Hanny -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreseso at gmail.com Tue Sep 27 15:08:59 2005 From: andreseso at gmail.com (Andreso) Date: Tue, 27 Sep 2005 17:08:59 +0200 Subject: [Linux-cluster] Cluster cannot failover In-Reply-To: <2fa0bfca05092700022fddb849@mail.gmail.com> References: <2fa0bfca05092700022fddb849@mail.gmail.com> Message-ID: <1a9f4169050927080875bfde68@mail.gmail.com> On 9/27/05, Hanny Tidore wrote: > I am installing Redhat Cluster Suite in 2 HP Proliant DL380G4 with HP > StorageWorks MSA500 G1. I sure hope that that is not the cluster suite that comes with Redhat Enterprise Linux Advanced Server 2.1 The reason I say so is that on a cluster with shared storage over SCSI when I upgraded the RHEL 2.1 kernel the machines became unbootable. Something about the RHEL 2.1 upgrade kernels not containing the megaraid2 kernel module. Anyways the machines would not boot and I commited the newbie mistake of uninstalling the old kernel in the hope that the new kernel would work. I had to reinstall one of the cluster members. You know the saying: person that repeats the same steps hoping to get different results -> windows user. If you do not want to pay for RHEL 3 or 4 you can go with CENTOS. If you go with RHEL I recommend paying for support from RedHat as time can be lost debugging the cluster. Andres From carlopmart at gmail.com Tue Sep 27 16:22:54 2005 From: carlopmart at gmail.com (carlopmart at gmail.com) Date: Tue, 27 Sep 2005 18:22:54 +0200 Subject: [Linux-cluster] This configuration could work?? Message-ID: <433971DE.4020302@gmail.com> Hi all, I would like to test RHGS with RHCS, but I have got only one server. My idea is to use VMWare GSX Server. My explain: - VMWare GSX host server acts as GFS server. - Two virtual machines with RHCS using fence_gnbd as fence device connected to GFS server. is it possible to do this configuration with only one server using GFS??? Thank you very much for your help and sorry my bad english. P.D: I will use CentOS 4.1 for host and virtual machines. -- CL Martinez carlopmart {at} gmail {d0t} com From lhh at redhat.com Tue Sep 27 17:12:55 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 27 Sep 2005 13:12:55 -0400 Subject: [Linux-cluster] Installing RHCS on RHEL 4 In-Reply-To: <433814D2.1060907@gmail.com> References: <433814D2.1060907@gmail.com> Message-ID: <1127841175.26042.76.camel@ayanami.boston.redhat.com> On Mon, 2005-09-26 at 17:33 +0200, carlopmart at gmail.com wrote: > Hi all, > > I would like to do some tests with RHCS (Cluster Suite) and RHEL 4 > under two virtual machines on GSX Server 3.2. At this point, I have some > questions to configure two virtual nodes to do the tests: > > - Is it necessary to have a fence device?? Can I configure custer > suite without it?? Sort of. Fencing is required. You may just want to write one for VMWare GSX server which tells the server to power-off the machine. > - How many network interfaces I need for each virtual machine??? 1 is fine, it depends on your needs. -- Lon From lhh at redhat.com Tue Sep 27 17:17:03 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 27 Sep 2005 13:17:03 -0400 Subject: [Linux-cluster] Re: Four partitions mounted for one server -- samba fails In-Reply-To: <1a9f41690509230410239f1432@mail.gmail.com> References: <1a9f4169050923032278a31093@mail.gmail.com> <1a9f41690509230410239f1432@mail.gmail.com> Message-ID: <1127841423.26042.81.camel@ayanami.boston.redhat.com> On Fri, 2005-09-23 at 13:10 +0200, Andreso wrote: > I have looked at the /var/log/cluster and here I will reproduce all > the messages. I have set the log level on all the cluster daemons to > INFO > > In effect it is a samba issue > > Sep 23 12:53:46 hercules clusvcmgrd: [16076]: service warning: share_s > tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist > . ! It's looking for "/etc/samba/smb.conf.not found". Strange... Your configuration makes it think it's a samba service, but when we try to get the share name, it's reporting "not found". That's odd. > service: line 220: [: status: integer expression expected That's a bug. > I set up a /etc/samba/smb.conf.not file with just a global section > [global] > workgroup = RHCLUSTER > pid directory = /var/run/samba/not > lock directory = /var/cache/samba/not > log file = /var/log/samba/%m.log > encrypt passwords = yes > bind interfaces only = yes > interfaces = 10.64.34.141/255.255.255.0 > > > To the best of my knowledge /etc/samba/smb.conf.not is not documented There should be a share name with the device; that's what it's looking for. Please file a bugzilla and paste in or attach your cluster.xml. -- Lon From lhh at redhat.com Tue Sep 27 17:18:29 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 27 Sep 2005 13:18:29 -0400 Subject: R: [Linux-cluster] Re: How do I use a cross over cable to set upquorum? In-Reply-To: References: Message-ID: <1127841509.26042.84.camel@ayanami.boston.redhat.com> On Fri, 2005-09-23 at 09:41 +0200, Baesso Mirko wrote: > Hi, i need to setup a 2 node cluster and i would like to use the disk based tiebreaker or ip tiebreaker, but I see there is no cluquorumd command. I'm using RHCS 4 (NO GFS) on kernel 2.6.9.11 > Could you help me > Thanks No such thing in linux-cluster (or RHCS4). Instead, there's a special "two node" mode for CMAN to run in, which uses fencing to ensure that a split brain doesn't occur. If you use the GUI to configure the cluster, this mode is set automatically when only two nodes are present in the configuration file. -- Lon From lhh at redhat.com Tue Sep 27 17:19:58 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 27 Sep 2005 13:19:58 -0400 Subject: [Linux-cluster] This configuration could work?? In-Reply-To: <433971DE.4020302@gmail.com> References: <433971DE.4020302@gmail.com> Message-ID: <1127841598.26042.87.camel@ayanami.boston.redhat.com> On Tue, 2005-09-27 at 18:22 +0200, carlopmart at gmail.com wrote: > Hi all, > > I would like to test RHGS with RHCS, but I have got only one server. > My idea is to use VMWare GSX Server. My explain: > > - VMWare GSX host server acts as GFS server. > - Two virtual machines with RHCS using fence_gnbd as fence device > connected to GFS server. > > is it possible to do this configuration with only one server using GFS??? > > Thank you very much for your help and sorry my bad english. > > P.D: I will use CentOS 4.1 for host and virtual machines. > It should work, but typically multiple physical machines are used to achieve HA, as HA works around hardware failures (or should). If your host server fails, your entire cluster fails. -- Lon From lhh at redhat.com Tue Sep 27 17:51:26 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 27 Sep 2005 13:51:26 -0400 Subject: [Linux-cluster] RHCS 4.0 HowTo? In-Reply-To: References: Message-ID: <1127843486.26042.104.camel@ayanami.boston.redhat.com> On Sun, 2005-09-18 at 19:40 -0500, David.Sullivan wrote: > I'm completely new to clustering and am trying to set up a "proof of > concept" configuration in-house. Our customers run a back-office POS server > and must currently do a manual failover that involves moving hard drives if > the server fails. Using VMware Workstation 5, I have created several VM's > configured with dual NICs and RHEL 4.0/RHCS 4.0. System-config-cluster > seems to abstract so much from the Administrator that I'm having trouble > getting things working. Some specific things I don't understand: > > * How do you tell it which NIC to use as a heartbeat, and which to use for > the services offered? I have a private LAN set up that I want to use for > heartbeat, but don't think it's being used at all. Both the public and > private IP's are set up in /etc/hosts. With RHCS4, it goes by "uname -n". An easy thing to do is to set up dummy hostnames matching the IP on the private network and set your hostnames to them. e.g. 10.1.1.1 node1 10.1.1.2 node2 ...then set hostnames to node1 and node2. Service IPs are (and have always been) selected based on matching specified IPs to already existing IPs on NICs. E.g. 192.168.2.10/24 would go on the same NIC as 192.168.2.1/24, even if it's a different interface from what the cluster is using for internal communication. > * Building on the above, is it wise for other cluster services (e.g. dlm) > to communicate across the private heartbeat link? If so, how do I configure > that? I think DLM will use the private network, as will rgmanager, and everything (except services). > * Insofar as hardware fencing is required with RHCS 4.0, will I be able to > demonstrate failover functionality to management at all? I'm basically > looking for a means to automagically fail over to a "hot standby" server. Automagically won't work (and is *dangerous*). You'll have to use manual fencing. However, you could probably put together a fencing agent which asked the VMWare server to power off a guest... -- Lon From lhh at redhat.com Tue Sep 27 17:52:39 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 27 Sep 2005 13:52:39 -0400 Subject: [Linux-cluster] Question about heart-beat with CS4 In-Reply-To: <4338EE0D.6090408@redhat.com> References: <43300855.3050709@bull.net> <4338EE0D.6090408@redhat.com> Message-ID: <1127843559.26042.106.camel@ayanami.boston.redhat.com> On Tue, 2005-09-27 at 08:00 +0100, Patrick Caulfield wrote: > > It will use the interface bound to the IP address of the hostname by default. If > you want to use the otehr interface then specify the name associated with the > inerface on the command-line to "cman_tool join" Just a thought -- would it be useful to add that as an option to /etc/sysconfig/cman ? I don't think it's currently done (not sure though) -- Lon From eric at bootseg.com Tue Sep 27 18:27:07 2005 From: eric at bootseg.com (Eric Kerin) Date: Tue, 27 Sep 2005 14:27:07 -0400 Subject: [Linux-cluster] RHCS 4.0 HowTo? In-Reply-To: <1127843486.26042.104.camel@ayanami.boston.redhat.com> References: <1127843486.26042.104.camel@ayanami.boston.redhat.com> Message-ID: <1127845627.4501.40.camel@auh5-0479.corp.jabil.org> On Tue, 2005-09-27 at 13:51 -0400, Lon Hohberger wrote: > On Sun, 2005-09-18 at 19:40 -0500, David.Sullivan wrote: > > * Insofar as hardware fencing is required with RHCS 4.0, will I be able to > > demonstrate failover functionality to management at all? I'm basically > > looking for a means to automagically fail over to a "hot standby" server. > > Automagically won't work (and is *dangerous*). You'll have to use > manual fencing. However, you could probably put together a fencing > agent which asked the VMWare server to power off a guest... > There's already a fence agent for VMWare guests in CVS HEAD, just download the file, and place in your /sbin directory. Then you should be able to use fence_vmware as your agent in your /etc/cluster/cluster.conf file. http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/vmware/?cvsroot=cluster Looks like it requires the VMWare tools on the cluster nodes, but that should be no big deal. Thanks, Eric Kerin eric at bootseg.com From lhh at redhat.com Tue Sep 27 18:48:57 2005 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 27 Sep 2005 14:48:57 -0400 Subject: [Linux-cluster] RHCS 4.0 HowTo? In-Reply-To: <1127845627.4501.40.camel@auh5-0479.corp.jabil.org> References: <1127843486.26042.104.camel@ayanami.boston.redhat.com> <1127845627.4501.40.camel@auh5-0479.corp.jabil.org> Message-ID: <1127846937.26042.108.camel@ayanami.boston.redhat.com> On Tue, 2005-09-27 at 14:27 -0400, Eric Kerin wrote: > On Tue, 2005-09-27 at 13:51 -0400, Lon Hohberger wrote: > > On Sun, 2005-09-18 at 19:40 -0500, David.Sullivan wrote: > > > * Insofar as hardware fencing is required with RHCS 4.0, will I be able to > > > demonstrate failover functionality to management at all? I'm basically > > > looking for a means to automagically fail over to a "hot standby" server. > > > > Automagically won't work (and is *dangerous*). You'll have to use > > manual fencing. However, you could probably put together a fencing > > agent which asked the VMWare server to power off a guest... > > > > There's already a fence agent for VMWare guests in CVS HEAD, just > download the file, and place in your /sbin directory. Then you should > be able to use fence_vmware as your agent in > your /etc/cluster/cluster.conf file. > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/vmware/?cvsroot=cluster > > Looks like it requires the VMWare tools on the cluster nodes, but that > should be no big deal. This is what I get for not looking first ;) -- Lon From tom-fedora at kofler.eu.org Tue Sep 27 21:16:57 2005 From: tom-fedora at kofler.eu.org (tom-fedora at kofler.eu.org) Date: Tue, 27 Sep 2005 23:16:57 +0200 Subject: [Linux-cluster] GFS LogVol00cluster.1: withdrawn / rejecting I/O to dead device Message-ID: <000301c5c3a8$ce55d1e0$2c01380a@TheCenter> Hi, we are building a HA cluster with GFS6.1 and Fedora Core 4 Our SAN box had an outage and was then reconnected. Now, we are unable to mount the clusterfilesystem gfs. Sep 27 20:05:19 www5 kernel: scsi2 (0:0): rejecting I/O to dead device Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: fatal: I/O error Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: block = 9498835 Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: function = gfs_logbh_wait Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: file = /usr/src/build/607778-i686/BUILD/smp/src/gfs/dio.c, line = 923 Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: time = 1127844319 Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: about to withdraw from the cluster Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: waiting for outstanding I/O Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: telling LM to withdraw Sep 27 20:05:19 www5 kernel: lock_dlm: withdraw abandoned memory Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: withdrawn Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block 20971504 Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block 20971504 Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block 20971504 Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block 20971504 Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block 0 Rejecting/lm withdraw did not appear on the third node, also lm withdraw did not appear on www3 [root at www4 ~]# mount /mnt/ /dev/VolGroupDaten01/LogVol00cluster -t gfs mount: /mnt/ is not a block device We need to avoid restarting the server nodes - the volume groups so far are visible and access with eg. fisk is possible. Another single server which only uses a non-cluster LVM2 volume mount worked without reboot. Any help would be really welcome, Thanks Thomas [root at www3 ~]# vgscan Reading all physical volumes. This may take a while... Found volume group "VolGroupDaten02" using metadata type lvm2 Found volume group "VolGroupDaten01" using metadata type lvm2 [root at www3 ~]# lvdisplay VolGroupDaten01 --- Logical volume --- LV Name /dev/VolGroupDaten01/LogVol00cluster VG Name VolGroupDaten01 LV UUID o38bnG-sLSi-WhUJ-47Bs-3u6g-qSUm-5yBkNr LV Write Access read/write LV Status available # open 0 LV Size 80.00 GB Current LE 20480 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:1 [root at www3 ~]# pvdisplay ... ... ... --- Physical volume --- PV Name /dev/sde VG Name VolGroupDaten01 PV Size 540.00 GB / not usable 0 Allocatable yes PE Size (KByte) 4096 Total PE 138239 Free PE 117759 Allocated PE 20480 PV UUID oVeByo-8IoA-qFlt-fsN9-ULAR-xUju-niLTEO [root at www3 ~]# cman_tool status Protocol version: 5.0.1 Config version: 2 Cluster name: xxxcluster Cluster ID: 57396 Cluster Member: Yes Membership state: Cluster-Member Nodes: 3 Expected_votes: 3 Total_votes: 3 Quorum: 2 Active subsystems: 3 Node name: www3.xxx.cc Node addresses: 192.168.2.23 [root at www3 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 3 M www5.xxx.cc 2 1 3 M www4.xxx.cc 3 1 3 M www3.xxx.cc [root at www3 ~]# cat /etc/cluster/cluster.conf From sgray at bluestarinc.com Wed Sep 28 02:20:55 2005 From: sgray at bluestarinc.com (Sean Gray) Date: Tue, 27 Sep 2005 22:20:55 -0400 Subject: [Linux-cluster] GFS LogVol00cluster.1: withdrawn / rejecting I/O to dead device In-Reply-To: <000301c5c3a8$ce55d1e0$2c01380a@TheCenter> References: <000301c5c3a8$ce55d1e0$2c01380a@TheCenter> Message-ID: <1127874055.3736.250.camel@localhost.localdomain> Thomas, Double check your mount command it should read "mount -t gfs . Boot the bad node and check it with clustat, if OK try restarting fenced an clvmd. # clustat # /etc/init.d/fenced restart # /etc/init.d/clvmd restart # mount -t gfs For some reason it may require a few tries. Sean On Tue, 2005-09-27 at 23:16 +0200, tom-fedora at kofler.eu.org wrote: > Hi, > > we are building a HA cluster with GFS6.1 and Fedora Core 4 > > Our SAN box had an outage and was then reconnected. > > Now, we are unable to mount the clusterfilesystem gfs. > > Sep 27 20:05:19 www5 kernel: scsi2 (0:0): rejecting I/O to dead device > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: fatal: > I/O error > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: block > = 9498835 > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: > function = gfs_logbh_wait > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: file > = /usr/src/build/607778-i686/BUILD/smp/src/gfs/dio.c, line = 923 > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: time > = 1127844319 > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: about > to withdraw from the cluster > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: waiting > for outstanding I/O > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: telling > LM to withdraw > Sep 27 20:05:19 www5 kernel: lock_dlm: withdraw abandoned memory > Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: > withdrawn > Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device > Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block > 20971504 > Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device > Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block > 20971504 > > Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device > Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block > 20971504 > Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device > Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block > 20971504 > Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device > Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block > 0 > > Rejecting/lm withdraw did not appear on the third node, also lm withdraw did > not appear on www3 > > [root at www4 ~]# mount /mnt/ /dev/VolGroupDaten01/LogVol00cluster -t gfs > mount: /mnt/ is not a block device > > We need to avoid restarting the server nodes - the volume groups so far are > visible and access with eg. fisk is possible. > Another single server which only uses a non-cluster LVM2 volume mount worked > without reboot. > > Any help would be really welcome, > > Thanks > Thomas > > [root at www3 ~]# vgscan > Reading all physical volumes. This may take a while... > Found volume group "VolGroupDaten02" using metadata type lvm2 > Found volume group "VolGroupDaten01" using metadata type lvm2 > > [root at www3 ~]# lvdisplay VolGroupDaten01 > --- Logical volume --- > LV Name /dev/VolGroupDaten01/LogVol00cluster > VG Name VolGroupDaten01 > LV UUID o38bnG-sLSi-WhUJ-47Bs-3u6g-qSUm-5yBkNr > LV Write Access read/write > LV Status available > # open 0 > LV Size 80.00 GB > Current LE 20480 > Segments 1 > Allocation inherit > Read ahead sectors 0 > Block device 253:1 > > [root at www3 ~]# pvdisplay > > ... > ... > ... > > --- Physical volume --- > PV Name /dev/sde > VG Name VolGroupDaten01 > PV Size 540.00 GB / not usable 0 > Allocatable yes > PE Size (KByte) 4096 > Total PE 138239 > Free PE 117759 > Allocated PE 20480 > PV UUID oVeByo-8IoA-qFlt-fsN9-ULAR-xUju-niLTEO > > > > > [root at www3 ~]# cman_tool status > Protocol version: 5.0.1 > Config version: 2 > Cluster name: xxxcluster > Cluster ID: 57396 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 3 > Expected_votes: 3 > Total_votes: 3 > Quorum: 2 > Active subsystems: 3 > Node name: www3.xxx.cc > Node addresses: 192.168.2.23 > > [root at www3 ~]# cman_tool nodes > Node Votes Exp Sts Name > 1 1 3 M www5.xxx.cc > 2 1 3 M www4.xxx.cc > 3 1 3 M www3.xxx.cc > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [root at www3 ~]# cat /etc/cluster/cluster.conf > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Sean N. Gray Director of Information Technology United Radio Incorporated, DBA BlueStar 24 Spiral Drive Florence, Kentucky 41042 office: 859.371.4423 x263 toll free: 800.371.4423 x263 fax: 859.371.4425 mobile: 513.616.3379 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pcaulfie at redhat.com Wed Sep 28 06:51:45 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 28 Sep 2005 07:51:45 +0100 Subject: [Linux-cluster] Question about heart-beat with CS4 In-Reply-To: <1127843559.26042.106.camel@ayanami.boston.redhat.com> References: <43300855.3050709@bull.net> <4338EE0D.6090408@redhat.com> <1127843559.26042.106.camel@ayanami.boston.redhat.com> Message-ID: <433A3D81.4040307@redhat.com> Lon Hohberger wrote: > On Tue, 2005-09-27 at 08:00 +0100, Patrick Caulfield wrote: > > >>It will use the interface bound to the IP address of the hostname by default. If >>you want to use the otehr interface then specify the name associated with the >>inerface on the command-line to "cman_tool join" > > > Just a thought -- would it be useful to add that as an option > to /etc/sysconfig/cman ? I don't think it's currently done (not sure > though) > Yes I think it would be very useful. -- patrick From carlopmart at gmail.com Wed Sep 28 08:14:03 2005 From: carlopmart at gmail.com (carlopmart at gmail.com) Date: Wed, 28 Sep 2005 10:14:03 +0200 Subject: [Linux-cluster] Question aout fence_vmware.pl Message-ID: <433A50CB.50509@gmail.com> Hi all, Searching mailing list I have found tis interesting thread: https://www.redhat.com/archives/linux-cluster/2005-September/msg00014.html. Is it possible to use this fence module under GSX ??? Where can I find examples to use?? I visited Zach's web without success. -- CL Martinez carlopmart {at} gmail {d0t} com From tom-fedora at kofler.eu.org Wed Sep 28 11:00:30 2005 From: tom-fedora at kofler.eu.org (Thomas Kofler) Date: Wed, 28 Sep 2005 13:00:30 +0200 Subject: [Linux-cluster] Question aout fence_vmware.pl Message-ID: <1127905230.433a77cef24b0@mail.devcon.cc> Hi, the file itself can be found at: http://sources.redhat.com/cgi- bin/cvsweb.cgi/cluster/fence/agents/vmware/fence_vmware.pl?cvsroot=cluster The usage/parameters are documented in the file. I think it is using the perl API from Vmware - its also available for GSX and should be compatible. "VMware GSX Server provides an easy to use API for control and management. Perl and COM interfaces and command line tools ..." http://www.vmware.com/support/developer/ Would be worth a try - best luck, we are waiting for feedback Regards, Thomas Quoting "carlopmart at gmail.com" : > Hi all, > > Searching mailing list I have found tis interesting thread: > https://www.redhat.com/archives/linux-cluster/2005-September/msg00014.html. > > Is it possible to use this fence module under GSX ??? Where can I > find examples to use?? I visited Zach's web without success. > > > -- > CL Martinez > carlopmart {at} gmail {d0t} com > From fseoane at intelsis.com Fri Sep 30 11:08:46 2005 From: fseoane at intelsis.com (Felipe Seoane) Date: Fri, 30 Sep 2005 13:08:46 +0200 Subject: [Linux-cluster] A question about file locks on node fails Message-ID: <8C5D52BA5F40014CA753D0C04957BF73625781@icorreo.correo2003.com> Hi all, Suppouse that we have 3 nodes (All of them are gulm servers) that mounts a GFS shared filesystem in /raid. The directory structure is the next: /raid |-/node1dir |-/node2dir And suppouse that the node1 writes files only in node1dir and node2 only in node2dir while node3 reads files from both directories. My question is: If node1 fails (then it will fenced) while it was writting a file in node1dir, could be able node3 to read a file in node1dir (not the file that node1 was just writting when had failed)? From pcaulfie at redhat.com Fri Sep 30 13:44:20 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 30 Sep 2005 14:44:20 +0100 Subject: [Linux-cluster] new userland cman Message-ID: <433D4134.6080608@redhat.com> This has got to the stage where I'd be grateful for any testing other people can do, though obviously don't endanger a production system! You should be able to run the DLM and GFS on this, see https://www.redhat.com/archives/linux-cluster/2005-September/msg00177.html for (very) brief instructions. There is a new clvm patch available in the cluster CVS at cman/lib/clvmd-libcman.diff Here's a list of the user-visible changes, please feel free to ask questions on the list. good ---- - (optional) encryption & authentication of communications - Multiple interface support (unfinished, needs AIS and cman work) - Automatic re-reading of CCS if a new node joins with an updated config file bad --- - Always uses CCS (cman_tool join -X removed)* - Compulsory static node IDs (easily enforced by GUI or command-line) - Can't have multiple clusters using the same port number unless they use a different encryption key. Currently cluster name is ignored.** - Hard limit to size of cluster (set at compile time to 32 currently)*** neutral ------- - Always uses multicast (no broadcast). A default multicast address is supplied if none is given - libcman is the only API ( a compatible libcman is available for the kernel version) - Simplified CCS schema, but will read old one if it has nodeids in it.**** internal -------- - Usable messaging API - Robust membership algorithm - Community involvement, multiple developers. * I very much doubt that anyone will notice apart from maybe Dave & me ** Could fix this in AIS, but I'm not sure the patch would be popular upstream. It's much more efficient to run them on different ports or multicast addresses anyway. Incidentally: DON'T run an encrypted and a non-encrypted cluster on the same port & multicast address (not that you would!) - the non-encrypted ones will crash. *** I doubt that the old cman worked well above 30 nodes anyway. I intend to do some AIS hacking to improve this situation by drastically reducing the network packet size. **** The main difference here is that the multicast address need only be specified once, in the section of cluster.conf. The interface used will be the one that is bound to the hostname mentioned. patrick -- patrick From Kevin.Ketchum at McKesson.com Fri Sep 30 14:51:42 2005 From: Kevin.Ketchum at McKesson.com (Ketchum, Kevin) Date: Fri, 30 Sep 2005 10:51:42 -0400 Subject: [Linux-cluster] GFS Performance observations/questions Message-ID: We are evaluating GFS as a solution for a clustered application environment. We have set up a 3 node cluster. Each machine is connected to an EMC SAN, all point to the same pool. During some benchmark testing we have observed the following performance: On the local drive, it takes about 10 microseconds to lock a file, and 2 microseconds to unlock the file. On the gfs drive, it takes over 10,000 microseconds (0.01 sec) to lock a file, and over 6000 microseconds (0.006 sec) to unlock the file. We have seen it take considerably longer ... Is this the expected performance? Are there any tuning options available to us? If you need more information to help answer this question or provide guidance, please ask. Thanks Kevin Ketchum -------------- next part -------------- An HTML attachment was scrubbed... URL: From sgray at bluestarinc.com Fri Sep 30 20:44:55 2005 From: sgray at bluestarinc.com (Sean Gray) Date: Fri, 30 Sep 2005 16:44:55 -0400 Subject: [Linux-cluster] script to enable, disable, start, stop and query status of cluster services Message-ID: <1128113095.3539.1148.camel@localhost.localdomain> I found the below useful for my cluster testing, enjoy! #!/bin/bash # Name: cluster # Authors: Sean Gray # Copyright 2005 under the GPL # Version 0.1 # Enable, disable, start, stop and query status of cluster services # on RHEL4. # SERVICES="ccsd cman lock_gulmd fenced clvmd rgmanager gfs" STARTORDER="ccsd cman lock_gulmd fenced clvmd gfs rgmanager" STOPORDER="rgmanager gfs clvmd fenced lock_gulmd cman ccsd" enableStuff (){ for SERVICE in `echo $SERVICES`; do chkconfig --level 2345 $SERVICE on; done; for SERVICE in `echo $SERVICES`; do chkconfig --list $SERVICE; done; } disableStuff (){ for SERVICE in `echo $SERVICES`; do chkconfig --level 2345 $SERVICE off; done; for SERVICE in `echo $SERVICES`; do chkconfig --list $SERVICE; done; } startStuff (){ for SERVICE in `echo $STARTORDER`; do service $SERVICE start; done; } stopStuff (){ for SERVICE in `echo $STOPORDER`; do service $SERVICE stop; done; } serviceStatus (){ for SERVICE in `echo $SERVICES`; do echo -e "\033[36m $SERVICE \033[0m" service $SERVICE status; echo -e "\n" done; } case $1 in "enable" ) enableStuff ;; "disable" ) disableStuff ;; "start" ) startStuff ;; "stop" ) stopStuff ;; "status" ) serviceStatus ;; * ) echo -e "Usage: `basename $0` {enable|disable|start|stop| status}" ;; esac Sean N. Gray Director of Information Technology United Radio Incorporated, DBA BlueStar 24 Spiral Drive Florence, Kentucky 41042 office: 859.371.4423 x263 toll free: 800.371.4423 x263 fax: 859.371.4425 mobile: 513.616.3379 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jharr at opsource.net Fri Sep 30 22:12:03 2005 From: jharr at opsource.net (Jeff Harr) Date: Fri, 30 Sep 2005 23:12:03 +0100 Subject: [Linux-cluster] script to enable, disable, start, stop and query status of cluster services Message-ID: <38A48FA2F0103444906AD22E14F1B5A3015F4306@mailxchg01.corp.opsource.net> Hey, that is cool -thanks man :-) ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sean Gray Sent: Friday, September 30, 2005 4:45 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] script to enable, disable, start,stop and query status of cluster services I found the below useful for my cluster testing, enjoy! #!/bin/bash # Name: cluster # Authors: Sean Gray # Copyright 2005 under the GPL # Version 0.1 # Enable, disable, start, stop and query status of cluster services # on RHEL4. # SERVICES="ccsd cman lock_gulmd fenced clvmd rgmanager gfs" STARTORDER="ccsd cman lock_gulmd fenced clvmd gfs rgmanager" STOPORDER="rgmanager gfs clvmd fenced lock_gulmd cman ccsd" enableStuff (){ for SERVICE in `echo $SERVICES`; do chkconfig --level 2345 $SERVICE on; done; for SERVICE in `echo $SERVICES`; do chkconfig --list $SERVICE; done; } disableStuff (){ for SERVICE in `echo $SERVICES`; do chkconfig --level 2345 $SERVICE off; done; for SERVICE in `echo $SERVICES`; do chkconfig --list $SERVICE; done; } startStuff (){ for SERVICE in `echo $STARTORDER`; do service $SERVICE start; done; } stopStuff (){ for SERVICE in `echo $STOPORDER`; do service $SERVICE stop; done; } serviceStatus (){ for SERVICE in `echo $SERVICES`; do echo -e "\033[36m $SERVICE \033[0m" service $SERVICE status; echo -e "\n" done; } case $1 in "enable" ) enableStuff ;; "disable" ) disableStuff ;; "start" ) startStuff ;; "stop" ) stopStuff ;; "status" ) serviceStatus ;; * ) echo -e "Usage: `basename $0` {enable|disable|start|stop|status}" ;; esac Sean N. Gray Director of Information Technology United Radio Incorporated, DBA BlueStar 24 Spiral Drive Florence, Kentucky 41042 office: 859.371.4423 x263 toll free: 800.371.4423 x263 fax: 859.371.4425 mobile: 513.616.3379 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fedora-tom at kofler.eu.org Wed Sep 28 08:52:35 2005 From: fedora-tom at kofler.eu.org (Thomas Kofler) Date: Wed, 28 Sep 2005 08:52:35 -0000 Subject: [Linux-cluster] Question aout fence_vmware.pl In-Reply-To: <433A50CB.50509@gmail.com> References: <433A50CB.50509@gmail.com> Message-ID: <1127897533.433a59bda54ae@mail.devcon.cc> Hi, the file itself can be found at: http://sources.redhat.com/cgi- bin/cvsweb.cgi/cluster/fence/agents/vmware/fence_vmware.pl?cvsroot=cluster The usage/parameters are documented in the file. I think it is using the perl API from Vmware - its also available for GSX and should be compatible. "VMware GSX Server provides an easy to use API for control and management. Perl and COM interfaces and command line tools ..." http://www.vmware.com/support/developer/ Would be worth a try - best luck, we are waiting for feedback Regards, Thomas Quoting "carlopmart at gmail.com" : > Hi all, > > Searching mailing list I have found tis interesting thread: > https://www.redhat.com/archives/linux-cluster/2005-September/msg00014.html. > > Is it possible to use this fence module under GSX ??? Where can I > find examples to use?? I visited Zach's web without success. > > > -- > CL Martinez > carlopmart {at} gmail {d0t} com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From sdake at mvista.com Fri Sep 30 19:40:00 2005 From: sdake at mvista.com (Steven Dake) Date: Fri, 30 Sep 2005 12:40:00 -0700 Subject: [Linux-cluster] new userland cman In-Reply-To: <433D4134.6080608@redhat.com> References: <433D4134.6080608@redhat.com> Message-ID: <1128109200.8440.14.camel@unnamed.az.mvista.com> Patrick Thanks for the work I have a few comments inline On Fri, 2005-09-30 at 14:44 +0100, Patrick Caulfield wrote: > This has got to the stage where I'd be grateful for any testing other people can > do, though obviously don't endanger a production system! > > You should be able to run the DLM and GFS on this, see > > https://www.redhat.com/archives/linux-cluster/2005-September/msg00177.html > > for (very) brief instructions. There is a new clvm patch available in the > cluster CVS at cman/lib/clvmd-libcman.diff > > Here's a list of the user-visible changes, please feel free to ask questions on > the list. > > good > ---- > - (optional) encryption & authentication of communications > - Multiple interface support (unfinished, needs AIS and cman work) > - Automatic re-reading of CCS if a new node joins with an updated config file > > bad > --- > - Always uses CCS (cman_tool join -X removed)* > - Compulsory static node IDs (easily enforced by GUI or command-line) > - Can't have multiple clusters using the same port number unless they use a > different encryption key. Currently cluster name is ignored.** > - Hard limit to size of cluster (set at compile time to 32 currently)*** > I hope to have multiring in 2006; then we should scale to hundreds of processors... > neutral > ------- > - Always uses multicast (no broadcast). A default multicast address is supplied > if none is given If broadcast is important, which I guess it may be, we can pretty easily add this support... > - libcman is the only API ( a compatible libcman is available for the kernel > version) > - Simplified CCS schema, but will read old one if it has nodeids in it.**** > > internal > -------- > - Usable messaging API > - Robust membership algorithm > - Community involvement, multiple developers. > > > * I very much doubt that anyone will notice apart from maybe Dave & me > > ** Could fix this in AIS, but I'm not sure the patch would be popular upstream. > It's much more efficient to run them on different ports or multicast addresses > anyway. Incidentally: DON'T run an encrypted and a non-encrypted cluster on the > same port & multicast address (not that you would!) - the non-encrypted ones > will crash. > On this point, you mention you could fix "this", do you mean having two clusters use the same port and ips? I have also considered and do want this by having each "cluster" join a specific group at startup to serve as the cluster membership view. Unfortunately this would require process group membership, and the process groups interface is unfinished (totempg.c) so this isn't possible today. Note I'd take a patch from someone that finished the job on this interface :) I for example, would like communication for a specific checkpoint to go over a specific named group, instead of to everyone connected to totem. Then the clm could join a group and get membership events, the checkpoint service for a specific checkpoint could join a group, and communicate on that group, and get membership events for that group etc. What did you have in mind here? regards -steve > *** I doubt that the old cman worked well above 30 nodes anyway. I intend to do > some AIS hacking to improve this situation by drastically reducing the network > packet size. > > **** The main difference here is that the multicast address need only be > specified once, in the section of cluster.conf. The interface used will > be the one that is bound to the hostname mentioned. > > > patrick >