From nattaponv at hotmail.com  Thu Sep  1 06:47:51 2005
From: nattaponv at hotmail.com (nattapon viroonsri)
Date: Thu, 01 Sep 2005 06:47:51 +0000
Subject: [Linux-cluster] Rhcs4 split brain prolem with single heartbeat
	cable ?
Message-ID: <BAY22-F14C36A4E3C419E2A6B4BA3A6A00@phx.gbl>

>From config menu in Rhcs4 look like dont require share storage to store 
quorum partion and no ip tie-breaker to config.
And  i use only 1 network cable both for client access resource and 
heartbeat channel.
So  if one node have problem with network connection. the backup node can 
provide service with uninterrupt.
But Can split brain will occur ?
Can both node issue i/o to share at the same time ?

So hould i use 2 heartbeat channel ?
If i use 2 heartbeat connection how can i detect fail of network connection 
for client access resource ( no  ip tie-breaker to config in rhcs4) ?

Regard,
Nattapon

_________________________________________________________________
Don't just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/



From teigland at redhat.com  Thu Sep  1 10:46:20 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 1 Sep 2005 18:46:20 +0800
Subject: [Linux-cluster] GFS, what's remaining
Message-ID: <20050901104620.GA22482@redhat.com>

Hi, this is the latest set of gfs patches, it includes some minor munging
since the previous set.  Andrew, could this be added to -mm? there's not
much in the way of pending changes.

http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050901/broken-out/

I'd like to get a list of specific things remaining for merging.  I
believe we've responded to everything from earlier reviews, they were very
helpful and more would be excellent.  The list begins with one item from
before that's still pending:

- Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists.
  [cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.]
...

Thanks
Dave



From arjan at infradead.org  Thu Sep  1 10:42:49 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Thu, 01 Sep 2005 12:42:49 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901104620.GA22482@redhat.com>
References: <20050901104620.GA22482@redhat.com>
Message-ID: <1125571369.5025.0.camel@laptopd505.fenrus.org>

On Thu, 2005-09-01 at 18:46 +0800, David Teigland wrote:
> Hi, this is the latest set of gfs patches, it includes some minor munging
> since the previous set.  Andrew, could this be added to -mm? there's not
> much in the way of pending changes.

can you post them here instead so that they can be actually reviewed?





From akpm at osdl.org  Thu Sep  1 10:59:39 2005
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 1 Sep 2005 03:59:39 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901104620.GA22482@redhat.com>
References: <20050901104620.GA22482@redhat.com>
Message-ID: <20050901035939.435768f3.akpm@osdl.org>

David Teigland <teigland at redhat.com> wrote:
>
> Hi, this is the latest set of gfs patches, it includes some minor munging
>  since the previous set.  Andrew, could this be added to -mm?

Dumb question: why?

Maybe I was asleep, but I don't recall seeing much discussion or exposition
of

- Why the kernel needs two clustered fileystems

- Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
  possibly gain (or vice versa)

- Relative merits of the two offerings

etc.

Maybe this has all been thrashed out and agreed to.  If so, please remind me.



From yanj at brainaire.com  Thu Sep  1 10:02:45 2005
From: yanj at brainaire.com (yanj)
Date: Thu, 1 Sep 2005 18:02:45 +0800
Subject: [Linux-cluster] Re: GFS: Assertion failed + GFS on NO-SMP system
Message-ID: <001e01c5aedc$501c5780$2e00a8c0@yanzijie>

Forget to say:
I have tried the following combination of GFS on Redhat system:
1.  GFS version GFS-6.0.2.25 + Kernel 2.4.21-27.EL
2.  GFS version GFS-6.0.2.20-1 + Kernel 2.4.21-32.EL
 
In both cases, the system keeps ending in ?Kernel Panic?
 
-----Original Message-----
From: yanj [mailto:yanj at brainaire.com] 
Sent: 2005?9?1? 17:58
To: 'linux-cluster at redhat.com'
Subject: GFS: Assertion failed + GFS on NO-SMP system
 
Hi,
 
Could GFS run based on NO-SMP system ? (As what is said on Redhat GFS
manuals)
I am working on GFS+iSCSI+RHEL3 based on a non-SMP machine.
 
I have setup the GFS system, however, if I run IO tests on two nodes
simultaneously. The system keeps crashing. (Kernel panic.)
 
The error message is sth like following
Sep  1 17:18:48 test2 kernel: Bad metadata at 24, should be 4
Sep  1 17:18:48 test2 kernel:   mh_magic = 0xA5A5A5A5
Sep  1 17:18:48 test2 kernel:   mh_type = 2779096485
Sep  1 17:18:48 test2 kernel:   mh_generation = 0
Sep  1 17:18:48 test2 kernel:   mh_format = 0
Sep  1 17:18:48 test2 kernel:   mh_incarn = 0
Sep  1 17:18:48 test2 kernel: cd433ce4 d0918292 00000010 00000000
c0121992 0000000a 00000400 d0937b13 
Sep  1 17:18:48 test2 kernel:        cd433d30 00000018 00000000 d09273ad
cd433d4c 00000018 00000000 00000000 
Sep  1 17:18:48 test2 kernel:        d08fe34d d0938b2c d093673a 000004e5
00000013 d0958000 cd433d48 cd455660 
Sep  1 17:18:48 test2 kernel: Call Trace:   [<d0918292>] gfs_asserti
[gfs] 0x32 (0xcd433ce8)
Sep  1 17:18:48 test2 kernel: [<c0121992>] printk [kernel] 0x122
(0xcd433cf4)
Sep  1 17:18:48 test2 kernel: [<d0937b13>] .rodata.str1.1 [gfs] 0x14a7
(0xcd433d00)
Sep  1 17:18:48 test2 kernel: [<d09273ad>] gfs_meta_header_print [gfs]
0x5d (0xcd433d10)
Sep  1 17:18:48 test2 kernel: [<d08fe34d>] gfs_get_meta_buffer [gfs]
0x2ad (0xcd433d24)
Sep  1 17:18:48 test2 kernel: [<d0938b2c>] .rodata.str1.4 [gfs] 0x3bc
(0xcd433d28)
Sep  1 17:18:48 test2 kernel: [<d093673a>] .rodata.str1.1 [gfs] 0xce
(0xcd433d2c)
Sep  1 17:18:48 test2 kernel: [<d091d0a9>] gfs_copyin_dinode [gfs] 0x39
(0xcd433d7c)
Sep  1 17:18:48 test2 kernel: [<d0902acd>] lock_inode [gfs] 0x8d
(0xcd433dcc)
Sep  1 17:18:48 test2 kernel: [<d0919f8f>] glock_wait_internal [gfs]
0x18f (0xcd433de8)
Sep  1 17:18:48 test2 kernel: [<d091926c>] run_queue [gfs] 0xac
(0xcd433df8)
Sep  1 17:18:48 test2 kernel: [<d0941540>] gfs_inode_glops [gfs] 0x0
(0xcd433e04)
Sep  1 17:18:48 test2 kernel: [<d091a3cf>] gfs_glock_nq [gfs] 0x8f
(0xcd433e18)
Sep  1 17:18:48 test2 kernel: [<d091a9a7>] gfs_glock_nq_init [gfs] 0x37
(0xcd433e3c)
Sep  1 17:18:48 test2 kernel: [<d092c608>] do_quota_sync [gfs] 0x108
(0xcd433e58)
Sep  1 17:18:48 test2 kernel: [<d0917560>] gfs_copy_from_mem [gfs] 0x0
(0xcd433e70)
Sep  1 17:18:48 test2 kernel: [<c011ec9b>] context_switch [kernel] 0x7b
(0xcd433f54)
Sep  1 17:18:48 test2 kernel: [<d092cc9c>] gfs_quota_sync [gfs] 0xcc
(0xcd433f9c)
Sep  1 17:18:48 test2 kernel: [<c012bca0>] process_timeout [kernel] 0x0
(0xcd433fb0)
Sep  1 17:18:48 test2 kernel: [<d090e197>] gfs_quotad [gfs] 0x67
(0xcd433fc8)
Sep  1 17:18:48 test2 kernel: [<d0908810>] gfs_quotad_bounce [gfs] 0x0
(0xcd433fdc)
Sep  1 17:18:48 test2 kernel: [<d090881f>] gfs_quotad_bounce [gfs] 0xf
(0xcd433fe8)
Sep  1 17:18:48 test2 kernel: [<c010945d>] kernel_thread_helper [kernel]
0x5 (0xcd433ff0)
 
End with sth like:
GFS: Assertion failed on line 318 of file trans.c
GFS: assertion:?metatype_check_magic==GFS_magic&&metatype_check_type ==
?.
 
Thanks,
Jeffrey Yan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050901/60172fa9/attachment.htm>

From arjan at infradead.org  Thu Sep  1 11:35:23 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Thu, 01 Sep 2005 13:35:23 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901104620.GA22482@redhat.com>
References: <20050901104620.GA22482@redhat.com>
Message-ID: <1125574523.5025.10.camel@laptopd505.fenrus.org>

On Thu, 2005-09-01 at 18:46 +0800, David Teigland wrote:
> Hi, this is the latest set of gfs patches, it includes some minor munging
> since the previous set.  Andrew, could this be added to -mm? there's not
> much in the way of pending changes.
> 
> http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050901/broken-out/

+static inline void glock_put(struct gfs2_glock *gl)
+{
+	if (atomic_read(&gl->gl_count) == 1)
+		gfs2_glock_schedule_for_reclaim(gl);
+	gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
+	atomic_dec(&gl->gl_count);
+}

this code has a race

what is gfs2_assert() about anyway? please just use BUG_ON directly everywhere

+static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
+{
+	int empty;
+	spin_lock(&gl->gl_spin);
+	empty = list_empty(head);
+	spin_unlock(&gl->gl_spin);
+	return empty;
+}

that looks like a racey interface to me... if so.. why bother locking at all?
+void gfs2_glock_hold(struct gfs2_glock *gl)
+{
+	glock_hold(gl);
+}

eh why?

+struct gfs2_holder *gfs2_holder_get(struct gfs2_glock *gl, unsigned int state,
+				    int flags, int gfp_flags)
+{
+	struct gfs2_holder *gh;
+
+	gh = kmalloc(sizeof(struct gfs2_holder), GFP_KERNEL | gfp_flags);

this looks odd. Either you take flags or you don't.. this looks really half arsed and thus is really surprising 
to all callers


static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
+		       gi_filler_t filler)
+{
+	unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size);
+	char *buf;
+	unsigned int count = 0;
+	int error;
+
+	if (size > gi->gi_size)
+		size = gi->gi_size;
+
+	buf = kmalloc(size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	error = filler(ip, gi, buf, size, &count);
+	if (error)
+		goto out;
+
+	if (copy_to_user(gi->gi_data, buf, count + 1))
+		error = -EFAULT;

where does count get a sensible value?

+static unsigned int handle_roll(atomic_t *a)
+{
+	int x = atomic_read(a);
+	if (x < 0) {
+		atomic_set(a, 0);
+		return 0;
+	}
+	return (unsigned int)x;
+}

this is just plain scary.


you'll have to post the rest of your patches if you want anyone to look at them...





From penberg at cs.helsinki.fi  Thu Sep  1 12:33:24 2005
From: penberg at cs.helsinki.fi (Pekka Enberg)
Date: Thu, 1 Sep 2005 15:33:24 +0300
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901104620.GA22482@redhat.com>
References: <20050901104620.GA22482@redhat.com>
Message-ID: <84144f020509010533f5f2440@mail.gmail.com>

On 9/1/05, David Teigland <teigland at redhat.com> wrote:
> - Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists.
>   [cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.]

It works fine only if you don't care about playing well with other
clustered filesystems.

                                  Pekka



From alan at lxorguk.ukuu.org.uk  Thu Sep  1 14:49:18 2005
From: alan at lxorguk.ukuu.org.uk (Alan Cox)
Date: Thu, 01 Sep 2005 15:49:18 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901035939.435768f3.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
Message-ID: <1125586158.15768.42.camel@localhost.localdomain>

On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> - Why the kernel needs two clustered fileystems

So delete reiserfs4, FAT, VFAT, ext2, and all the other "junk". 

> - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
>   possibly gain (or vice versa)
> 
> - Relative merits of the two offerings

You missed the important one - people actively use it and have been for
some years. Same reason with have NTFS, HPFS, and all the others. On
that alone it makes sense to include.

Alan



From hch at infradead.org  Thu Sep  1 14:27:08 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Thu, 1 Sep 2005 15:27:08 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125586158.15768.42.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
Message-ID: <20050901142708.GA24933@infradead.org>

On Thu, Sep 01, 2005 at 03:49:18PM +0100, Alan Cox wrote:
> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> >   possibly gain (or vice versa)
> > 
> > - Relative merits of the two offerings
> 
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.

That's GFS.  The submission is about a GFS2 that's on-disk incompatible
to GFS.



From alan at lxorguk.ukuu.org.uk  Thu Sep  1 15:28:30 2005
From: alan at lxorguk.ukuu.org.uk (Alan Cox)
Date: Thu, 01 Sep 2005 16:28:30 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901142708.GA24933@infradead.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901142708.GA24933@infradead.org>
Message-ID: <1125588511.15768.52.camel@localhost.localdomain>

> That's GFS.  The submission is about a GFS2 that's on-disk incompatible
> to GFS.

Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3
then. I think the main point still stands - we have always taken
multiple file systems on board and we have benefitted enormously from
having the competition between them instead of a dictat from the kernel
kremlin that 'foofs is the one true way'

Competition will decide if OCFS or GFS is better, or indeed if someone
comes along with another contender that is better still. And competition
will probably get the answer right.

The only thing that is important is we don't end up with each cluster fs
wanting different core VFS interfaces added.

Alan



From lmb at suse.de  Thu Sep  1 15:11:18 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Thu, 1 Sep 2005 17:11:18 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125588511.15768.52.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901142708.GA24933@infradead.org>
	<1125588511.15768.52.camel@localhost.localdomain>
Message-ID: <20050901151118.GV28276@marowsky-bree.de>

On 2005-09-01T16:28:30, Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:

> Competition will decide if OCFS or GFS is better, or indeed if someone
> comes along with another contender that is better still. And competition
> will probably get the answer right.

Competition will come up with the same situation like reiserfs and ext3
and XFS, namely that they'll all be maintained going forward because of,
uhm, political constraints ;-)

But then, as long as they _are_ maintained and play along nicely with
eachother (which, btw, is needed already so that at least data can be
migrated...), I don't really see a problem of having two or three.

> The only thing that is important is we don't end up with each cluster fs
> wanting different core VFS interfaces added.

Indeed.


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From phillips at istop.com  Thu Sep  1 17:23:07 2005
From: phillips at istop.com (Daniel Phillips)
Date: Thu, 1 Sep 2005 13:23:07 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125586158.15768.42.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
Message-ID: <200509011323.08217.phillips@istop.com>

On Thursday 01 September 2005 10:49, Alan Cox wrote:
> On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> >   possibly gain (or vice versa)
> >
> > - Relative merits of the two offerings
>
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.

I thought that gfs2 just appeared last month.  Or is it really still just gfs?  
If there are substantive changes from gfs to gfs2 then obviously they have 
had practically zero testing, let alone posted benchmarks, testimonials, etc.  
If it is really still just gfs then the silly-rename should be undone.

Regards,

Daniel



From phillips at istop.com  Thu Sep  1 17:27:42 2005
From: phillips at istop.com (Daniel Phillips)
Date: Thu, 1 Sep 2005 13:27:42 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901104620.GA22482@redhat.com>
References: <20050901104620.GA22482@redhat.com>
Message-ID: <200509011327.42660.phillips@istop.com>

On Thursday 01 September 2005 06:46, David Teigland wrote:
> I'd like to get a list of specific things remaining for merging.

Where are the benchmarks and stability analysis?  How many hours does it 
survive cerberos running on all nodes simultaneously?  Where are the 
testimonials from users?  How long has there been a gfs2 filesystem?  Note 
that Reiser4 is still not in mainline a year after it was first offered, why 
do you think gfs2 should be in mainline after one month?

So far, all catches are surface things like bogus spinlocks.  Substantive 
issues have not even begun to be addressed.  Patience please, this is going 
to take a while.

Regards,

Daniel



From hch at infradead.org  Thu Sep  1 17:56:03 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Thu, 1 Sep 2005 18:56:03 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125588511.15768.52.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901142708.GA24933@infradead.org>
	<1125588511.15768.52.camel@localhost.localdomain>
Message-ID: <20050901175603.GA6218@infradead.org>

On Thu, Sep 01, 2005 at 04:28:30PM +0100, Alan Cox wrote:
> > That's GFS.  The submission is about a GFS2 that's on-disk incompatible
> > to GFS.
> 
> Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3
> then. I think the main point still stands - we have always taken
> multiple file systems on board and we have benefitted enormously from
> having the competition between them instead of a dictat from the kernel
> kremlin that 'foofs is the one true way'

I didn't say anything agains a particular fs, just that your previous
arguments where utter nonsense.  In fact I think having two or more cluster
filesystems in the tree is a good thing.  Whether the gfs2 code is mergeable
is a completely different question, and it seems at least debatable to
submit a filesystem for inclusion that's still pretty new.

While we're at it I can't find anything describing what gfs2 is about,
what is lacking in gfs, what structual changes did you make, etc..

p.s. why is gfs2 in fs/gfs in the kernel tree?



From lhh at redhat.com  Thu Sep  1 18:11:23 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 01 Sep 2005 14:11:23 -0400
Subject: [Linux-cluster] Fencing Module for VMware ESX and GSX
In-Reply-To: <D0167E4E-813A-4C63-A8B8-48C51316CC9E@zachlowry.net>
References: <D0167E4E-813A-4C63-A8B8-48C51316CC9E@zachlowry.net>
Message-ID: <1125598283.14500.34.camel@ayanami.boston.redhat.com>

On Wed, 2005-08-31 at 10:30 -0500, Zach Lowry wrote:
> Hello!
> 
> I recently deployed GFS between 2 virtual machines on a VMware ESX  
> server and had a problem because there was no fencing module that  
> could handle this architecture. So, using the fence_apc module as a  
> template, I wrote a compatible module, fence_vmware. Now, since I  
> didn't want to rewrite GFS and recompile to use this new module, I am  
> currently using it as a drop-in replacement for the fence_apc module,  
> and using the APC configuration syntax. However, this code seems to  
> work just right when a machine misses sync, it will log into the  
> VMware ESX server and attempt to do a soft reboot of the VM, then a  
> hard reboot if necessary. I hope to see this incorporated into the  
> GFS tree, but if not it will be available on my website.
> 
> Attached is a copy of the source, also available at http:// 
> www.zachlowry.net/software.php, along with a sample configuration.

Is the on/off/reboot case sensitive?

-- Lon



From lhh at redhat.com  Thu Sep  1 18:13:02 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 01 Sep 2005 14:13:02 -0400
Subject: [Linux-cluster] Rhcs4 split brain prolem with single heartbeat
	cable ?
In-Reply-To: <BAY22-F14C36A4E3C419E2A6B4BA3A6A00@phx.gbl>
References: <BAY22-F14C36A4E3C419E2A6B4BA3A6A00@phx.gbl>
Message-ID: <1125598382.14500.37.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-01 at 06:47 +0000, nattapon viroonsri wrote:
> >From config menu in Rhcs4 look like dont require share storage to store 
> quorum partion and no ip tie-breaker to config.
> And  i use only 1 network cable both for client access resource and 
> heartbeat channel.
> So  if one node have problem with network connection. the backup node can 
> provide service with uninterrupt.
> But Can split brain will occur ?
> Can both node issue i/o to share at the same time ?
> 
> So hould i use 2 heartbeat channel ?
> If i use 2 heartbeat connection how can i detect fail of network connection 
> for client access resource ( no  ip tie-breaker to config in rhcs4) ?

CMAN (the cluster manager) ensures no split brain via fencing:  If you
have two nodes and disconnect the cable on one of them, the remaining
(connected) node will fence the node which was disconnected.

To use it, you need two_node=1 and expected_votes=1 in cluster.conf
(which system-config-cluster correctly sets up for you).

As stated previously: Fencing hardware is *required* for RHCS4.

-- Lon



From lhh at redhat.com  Thu Sep  1 18:17:03 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 01 Sep 2005 14:17:03 -0400
Subject: [Linux-cluster] Fencing Module for VMware ESX and GSX
In-Reply-To: <1125598283.14500.34.camel@ayanami.boston.redhat.com>
References: <D0167E4E-813A-4C63-A8B8-48C51316CC9E@zachlowry.net>
	<1125598283.14500.34.camel@ayanami.boston.redhat.com>
Message-ID: <1125598623.14500.39.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-01 at 14:11 -0400, Lon Hohberger wrote:

/me reads code

> Is the on/off/reboot case sensitive?

No.

-- Lon




From lhh at redhat.com  Thu Sep  1 18:26:05 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 01 Sep 2005 14:26:05 -0400
Subject: [Linux-cluster] Fencing Module for VMware ESX and GSX
In-Reply-To: <1125598623.14500.39.camel@ayanami.boston.redhat.com>
References: <D0167E4E-813A-4C63-A8B8-48C51316CC9E@zachlowry.net>
	<1125598283.14500.34.camel@ayanami.boston.redhat.com>
	<1125598623.14500.39.camel@ayanami.boston.redhat.com>
Message-ID: <1125599165.14500.42.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-01 at 14:17 -0400, Lon Hohberger wrote:
> On Thu, 2005-09-01 at 14:11 -0400, Lon Hohberger wrote:
> 
> /me reads code
> 
> > Is the on/off/reboot case sensitive?
> 
> No.

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/vmware/fence_vmware.pl.diff?cvsroot=cluster&r1=NONE&r2=1.1

Please keep me updated if you have changes or you want your name / email
removed from the header.

-- Lon



From pegasus at nerv.eu.org  Thu Sep  1 19:58:36 2005
From: pegasus at nerv.eu.org (Jure =?ISO-8859-2?Q?Pe=E8ar?=)
Date: Thu, 1 Sep 2005 21:58:36 +0200
Subject: [Linux-cluster] partly OT: failover <500ms
Message-ID: <20050901215836.634334a1.pegasus@nerv.eu.org>


Hi all,

Sorry if this is somewhat offtopic here ...

Our telco is looking into linux HA solutions for their VoIP needs. Their
main requirement is that the failover happens in the order of a few 100ms. 

Can redhat cluster be tweaked to work reliably with such short time
periods? This would mean heartbeat on the level of few ms and status probes
on the level of 10ms. Is this even feasible?

Since VoIP is IP anyway, I'm looking into UCARP and stuff like that.
Anything else I should check?


Thanks for answers,

-- 

Jure Pe?ar
http://jure.pecar.org/



From akpm at osdl.org  Thu Sep  1 20:21:04 2005
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 1 Sep 2005 13:21:04 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125586158.15768.42.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
Message-ID: <20050901132104.2d643ccd.akpm@osdl.org>

Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:
>
> On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> > - Why the kernel needs two clustered fileystems
> 
> So delete reiserfs4, FAT, VFAT, ext2, and all the other "junk". 

Well, we did delete intermezzo.

I was looking for technical reasons, please.

> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> >   possibly gain (or vice versa)
> > 
> > - Relative merits of the two offerings
> 
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.

Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
the technical reasons for merging gfs[2], ocfs2, both or neither?

If one can be grown to encompass the capabilities of the other then we're
left with a bunch of legacy code and wasted effort.

I'm not saying it's wrong.  But I'd like to hear the proponents explain why
it's right, please.



From lhh at redhat.com  Thu Sep  1 21:39:59 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 01 Sep 2005 17:39:59 -0400
Subject: [Linux-cluster] partly OT: failover <500ms
In-Reply-To: <20050901215836.634334a1.pegasus@nerv.eu.org>
References: <20050901215836.634334a1.pegasus@nerv.eu.org>
Message-ID: <1125610799.14500.105.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-01 at 21:58 +0200, Jure Pe?ar wrote:
> Hi all,
> 
> Sorry if this is somewhat offtopic here ...
> 
> Our telco is looking into linux HA solutions for their VoIP needs. Their
> main requirement is that the failover happens in the order of a few 100ms. 
> 
> Can redhat cluster be tweaked to work reliably with such short time
> periods? This would mean heartbeat on the level of few ms and status probes
> on the level of 10ms. Is this even feasible?

Possibly, I don't think it can do it right now.  A couple of things to
remember:

* For such a fast requirement, you'll want a dedicated network for
cluster traffic and a real-time kernel.

* Also, "detection and initiation of recovery" is all the cluster
software can do for you; your application - by itself - may take longer
than this to recover.

* It's practically impossible to guarantee completion of I/O fencing in
this amount of time, so your application must be able to do without, or
you need to create a new specialized fencing mechanism which is
guaranteed to complete within a very fast time.

* I *think* CMAN is currently at the whole-second granularity, so some
changes would need to be made to give it finer granularity.  This
shouldn't be difficult (but I'll let the developers of CMAN answer this
definitively, though... ;) )

* Clumanager 1.2.x (RHCS3) can theoretically operate at sub-second
failure detection, but not at the levels you require (also, doing so is
not tested nor supported anyway). 

-- Lon



From treddy at rallydev.com  Thu Sep  1 23:03:23 2005
From: treddy at rallydev.com (Tarun Reddy)
Date: Thu, 1 Sep 2005 17:03:23 -0600
Subject: [Linux-cluster] RHEL/RHCS3: /usr/lib/clumanager/services/service
	status # stays up
In-Reply-To: <1125505731.21943.82.camel@ayanami.boston.redhat.com>
References: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com>
	<1125505731.21943.82.camel@ayanami.boston.redhat.com>
Message-ID: <06A0AC65-147C-465B-833B-B6D818D693B3@rallydev.com>

I believe it may have been a deadlocking issue when my status check  
was at 1 second. I had thought it was 1 minute. When I moved it to 60  
seconds, the issue disappeared, mostly. There were occasionally a  
few /usr/lib/clumanager/service/status scripts running for days, and  
they generally are bunched together.

Tarun

On Aug 31, 2005, at 10:28 AM, Lon Hohberger wrote:

> On Thu, 2005-08-18 at 14:12 -0600, Tarun Reddy wrote:
>
>
>> Anybody venture a guess as to why this might be occurring? And are my
>> check intervals too low?
>>
>
> What are they set to?
>
> You get multiple running at the same time if you have multiple  
> services.
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From gshi at ncsa.uiuc.edu  Thu Sep  1 23:47:08 2005
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Thu, 01 Sep 2005 18:47:08 -0500
Subject: [Linux-cluster] compiling CVS head failed with kernel 2.6.13-mm1
Message-ID: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu>

Hi,

I tried to compile the CVS head with kernel_src=2.6.13-mm1

  CC [M]  /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.o
/home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c: In function `make_flags':
/home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: `DLM_LKF_NOORDER' undeclared (first use in this function)
/home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: (Each undeclared identifier is reported only once
/home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error: for each function it appears in.)
/home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:124: error: `DLM_LKF_HEADQUE' undeclared (first use in this function)
/home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:129: error: `DLM_LKF_ALTCW' undeclared (first use in this function)
/home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:131: error: `DLM_LKF_ALTPR' undeclared (first use in this function)


What I do is ./configure kernel_src=<kernel src path>; make 

Am I missing anything?
Thanks
-Guochun



From teigland at redhat.com  Fri Sep  2 05:15:13 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 2 Sep 2005 13:15:13 +0800
Subject: [Linux-cluster] compiling CVS head failed with kernel 2.6.13-mm1
In-Reply-To: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu>
References: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu>
Message-ID: <20050902051512.GC12084@redhat.com>

On Thu, Sep 01, 2005 at 06:47:08PM -0500, Guochun Shi wrote:
> I tried to compile the CVS head with kernel_src=2.6.13-mm1
>
>   CC [M]  /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.o
>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c: In function
>   `make_flags':
>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error:
>   `DLM_LKF_NOORDER' undeclared (first use in this function)
>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error:
>   (Each undeclared identifier is reported only once
>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error:
>   for each function it appears in.)
>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:124: error:
>   `DLM_LKF_HEADQUE' undeclared (first use in this function)
>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:129: error:
>   `DLM_LKF_ALTCW' undeclared (first use in this function)
>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:131: error:
>   `DLM_LKF_ALTPR' undeclared (first use in this function)
> 
> What I do is ./configure kernel_src=<kernel src path>; make 

It works for me, could you send the whole output of the make like below?
Dave

[gfs-kernel/src/dlm]% make
if [ ! -e linux ]; then ln -s . linux; fi
if [ ! -e lm_interface.h ]; then ln -s ../../src/harness/lm_interface.h .; fi
if [ ! -e dlm.h ]; then cp ../../../dlm-kernel/src2/dlm.h .; fi
make -C /opt/kernels/linux-2.6.13-mm1-build/ M=/opt/tmp/cluster-HEAD/gfs-kernel/src/dlm modules USING_KBUILD=yes
make[1]: Entering directory `/opt/kernels/linux-2.6.13-mm1-build'

  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock.o
  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/main.o
  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/mount.o
  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/thread.o
  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/sysfs.o
  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/plock.o
  LD [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.o
  Building modules, stage 2.
  MODPOST
  CC      /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.mod.o
  LD [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.ko
make[1]: Leaving directory `/opt/kernels/linux-2.6.13-mm1-build'



From teigland at redhat.com  Fri Sep  2 07:04:49 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 2 Sep 2005 15:04:49 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901175603.GA6218@infradead.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901142708.GA24933@infradead.org>
	<1125588511.15768.52.camel@localhost.localdomain>
	<20050901175603.GA6218@infradead.org>
Message-ID: <20050902070449.GA16595@redhat.com>

On Thu, Sep 01, 2005 at 06:56:03PM +0100, Christoph Hellwig wrote:

> Whether the gfs2 code is mergeable is a completely different question,
> and it seems at least debatable to submit a filesystem for inclusion

I actually asked what needs to be done for merging.  We appreciate the
feedback and are carefully studying and working on all of it as usual.
We'd also appreciate help, of course, if that sounds interesting to
anyone.

Thanks
Dave



From pcaulfie at redhat.com  Fri Sep  2 07:03:33 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 02 Sep 2005 08:03:33 +0100
Subject: [Linux-cluster] partly OT: failover <500ms
In-Reply-To: <1125610799.14500.105.camel@ayanami.boston.redhat.com>
References: <20050901215836.634334a1.pegasus@nerv.eu.org>
	<1125610799.14500.105.camel@ayanami.boston.redhat.com>
Message-ID: <4317F945.5050307@redhat.com>

Lon Hohberger wrote:
> On Thu, 2005-09-01 at 21:58 +0200, Jure Pe?ar wrote:
> 
>>Hi all,
>>
>>Sorry if this is somewhat offtopic here ...
>>
>>Our telco is looking into linux HA solutions for their VoIP needs. Their
>>main requirement is that the failover happens in the order of a few 100ms. 
>>
>>Can redhat cluster be tweaked to work reliably with such short time
>>periods? This would mean heartbeat on the level of few ms and status probes
>>on the level of 10ms. Is this even feasible?
> 
> 
> Possibly, I don't think it can do it right now.  A couple of things to
> remember:
> 
> * For such a fast requirement, you'll want a dedicated network for
> cluster traffic and a real-time kernel.
> 
> * Also, "detection and initiation of recovery" is all the cluster
> software can do for you; your application - by itself - may take longer
> than this to recover.
> 
> * It's practically impossible to guarantee completion of I/O fencing in
> this amount of time, so your application must be able to do without, or
> you need to create a new specialized fencing mechanism which is
> guaranteed to complete within a very fast time.
> 
> * I *think* CMAN is currently at the whole-second granularity, so some
> changes would need to be made to give it finer granularity.  This
> shouldn't be difficult (but I'll let the developers of CMAN answer this
> definitively, though... ;) )
> 

All true :) All cman timers are calibrated in seconds. I did run some tests a
while ago with them in milliseconds and 100ms timeouts and it worked
/reasonably/ well. However, without an RT kernel I wouldn't like to put this
into a production system - we've had several instances of the cman kernel thread
(which runs at the top RT priority) being stalled for up to 5 seconds and that
node being fenced. Smaller stalls may be more common so with timeouts set that
low you may well get nodes fenced for small delays.

To be quite honest I'm not really sure what causes these stalls, as they
generally happen under heavy IO load I assume (possibly wrongly) that they are
related to disk flushes but someone who knows the VM better may out me right on
this.


-- 

patrick



From teigland at redhat.com  Fri Sep  2 09:44:03 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 2 Sep 2005 17:44:03 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
Message-ID: <20050902094403.GD16595@redhat.com>

On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

> +	gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);

> what is gfs2_assert() about anyway? please just use BUG_ON directly
> everywhere

When a machine has many gfs file systems mounted at once it can be useful
to know which one failed.  Does the following look ok?

#define gfs2_assert(sdp, assertion)                                       \
do {                                                                      \
        if (unlikely(!(assertion))) {                                     \
                printk(KERN_ERR                                           \
                        "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
                        "GFS2: fsid=%s:   function = %s\n"                \
                        "GFS2: fsid=%s:   file = %s, line = %u\n"         \
                        "GFS2: fsid=%s:   time = %lu\n",                  \
                        sdp->sd_fsname, # assertion,                      \
                        sdp->sd_fsname,  __FUNCTION__,                    \
                        sdp->sd_fsname, __FILE__, __LINE__,               \
                        sdp->sd_fsname, get_seconds());                   \
                BUG();                                                    \
        }                                                                 \
} while (0)



From joern at wohnheim.fh-wedel.de  Fri Sep  2 11:46:09 2005
From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel)
Date: Fri, 2 Sep 2005 13:46:09 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050902094403.GD16595@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050902094403.GD16595@redhat.com>
Message-ID: <20050902114609.GA11059@wohnheim.fh-wedel.de>

On Fri, 2 September 2005 17:44:03 +0800, David Teigland wrote:
> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> 
> > +	gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
> 
> > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > everywhere
> 
> When a machine has many gfs file systems mounted at once it can be useful
> to know which one failed.  Does the following look ok?
> 
> #define gfs2_assert(sdp, assertion)                                       \
> do {                                                                      \
>         if (unlikely(!(assertion))) {                                     \
>                 printk(KERN_ERR                                           \
>                         "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
>                         "GFS2: fsid=%s:   function = %s\n"                \
>                         "GFS2: fsid=%s:   file = %s, line = %u\n"         \
>                         "GFS2: fsid=%s:   time = %lu\n",                  \
>                         sdp->sd_fsname, # assertion,                      \
>                         sdp->sd_fsname,  __FUNCTION__,                    \
>                         sdp->sd_fsname, __FILE__, __LINE__,               \
>                         sdp->sd_fsname, get_seconds());                   \
>                 BUG();                                                    \
>         }                                                                 \
> } while (0)

That's a lot of string constants.  I'm not sure how smart current
versions of gcc are, but older ones created a new constant for each
invocation of such a macro, iirc.  So you might want to move the code
out of line.

J?rn

-- 
There's nothing better for promoting creativity in a medium than
making an audience feel "Hmm ? I could do better than that!"
-- Douglas Adams in a slashdot interview



From hzhong at cisco.com  Thu Sep  1 18:47:54 2005
From: hzhong at cisco.com (Hua Zhong (hzhong))
Date: Thu, 1 Sep 2005 11:47:54 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
Message-ID: <75D9B5F4E50C8B4BB27622BD06C2B82B7B8E04@xmb-sjc-235.amer.cisco.com>

I just started looking at gfs. To understand it you'd need to look at it
from the entire cluster solution point of view.

This is a good document from David. It's not about GFS in particular but
about the architecture of the cluster.

http://people.redhat.com/~teigland/sca.pdf 

Hua

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Christoph Hellwig
> Sent: Thursday, September 01, 2005 10:56 AM
> To: Alan Cox
> Cc: Christoph Hellwig; Andrew Morton; 
> linux-fsdevel at vger.kernel.org; linux-cluster at redhat.com; 
> linux-kernel at vger.kernel.org
> Subject: [Linux-cluster] Re: GFS, what's remaining
> 
> On Thu, Sep 01, 2005 at 04:28:30PM +0100, Alan Cox wrote:
> > > That's GFS.  The submission is about a GFS2 that's 
> on-disk incompatible
> > > to GFS.
> > 
> > Just like say reiserfs3 and reiserfs4 or ext and ext2 or 
> ext2 and ext3
> > then. I think the main point still stands - we have always taken
> > multiple file systems on board and we have benefitted 
> enormously from
> > having the competition between them instead of a dictat 
> from the kernel
> > kremlin that 'foofs is the one true way'
> 
> I didn't say anything agains a particular fs, just that your previous
> arguments where utter nonsense.  In fact I think having two 
> or more cluster
> filesystems in the tree is a good thing.  Whether the gfs2 
> code is mergeable
> is a completely different question, and it seems at least debatable to
> submit a filesystem for inclusion that's still pretty new.
> 
> While we're at it I can't find anything describing what gfs2 is about,
> what is lacking in gfs, what structual changes did you make, etc..
> 
> p.s. why is gfs2 in fs/gfs in the kernel tree?
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Fri Sep  2 14:52:14 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 02 Sep 2005 10:52:14 -0400
Subject: [Linux-cluster] RHEL/RHCS3:
	/usr/lib/clumanager/services/service status # stays up
In-Reply-To: <06A0AC65-147C-465B-833B-B6D818D693B3@rallydev.com>
References: <437DFE3B-80E4-4D14-A4A1-DDE56BD2ED5B@rallydev.com>
	<1125505731.21943.82.camel@ayanami.boston.redhat.com>
	<06A0AC65-147C-465B-833B-B6D818D693B3@rallydev.com>
Message-ID: <1125672734.14500.127.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-01 at 17:03 -0600, Tarun Reddy wrote:
> I believe it may have been a deadlocking issue when my status check  
> was at 1 second. I had thought it was 1 minute. When I moved it to 60  
> seconds, the issue disappeared, mostly. There were occasionally a  
> few /usr/lib/clumanager/service/status scripts running for days, and  
> they generally are bunched together.

How long does your status script take to complete normally?

There is a known issue where if more than one service operation is
requested for a given service, then the service manager will block... 

This might be what you are seeing, but only if (under some situations)
the status check is taking longer than the status check interval.

Can you bugzilla this?

-- Lon



From gshi at ncsa.uiuc.edu  Fri Sep  2 18:00:33 2005
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Fri, 02 Sep 2005 13:00:33 -0500
Subject: [Linux-cluster] compiling CVS head failed with kernel 2.6.13-mm1
In-Reply-To: <20050902051512.GC12084@redhat.com>
References: <5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu>
	<5.1.0.14.2.20050901184428.04211f60@pop.ncsa.uiuc.edu>
Message-ID: <5.1.0.14.2.20050902130003.04207c68@pop.ncsa.uiuc.edu>

David, 

It works for me now. It turns out that there is an old copy of dlm.h in cluster/gfs-kernel/src/dlm. After I deleted it, it compiles fine.
BTW, I need to add -lpthread ccs_tool/Makefile to make it compile, I have seen other people have the same problem in the mailing list.

Index: Makefile
===================================================================
RCS file: /cvs/cluster/cluster/ccs/ccs_tool/Makefile,v
retrieving revision 1.7
diff -u -r1.7 Makefile
--- Makefile 19 May 2005 19:50:55 -0000 1.7
+++ Makefile 2 Sep 2005 17:37:55 -0000
@@ -25,7 +25,7 @@
`xml2-config --cflags` -DCCS_RELEASE_NAME=\"${RELEASE}\"
endif

-LDFLAGS+= -L${ccs_libdir} `xml2-config --libs` -L${magmalibdir} -L${libdir}
+LDFLAGS+= -L${ccs_libdir} `xml2-config --libs` -L${magmalibdir} -L${libdir} -lpthread
LOADLIBES+= -lccs -lmagma -lmagmamsg -ldl



-Guochun


At 01:15 PM 9/2/2005 +0800, you wrote:
>On Thu, Sep 01, 2005 at 06:47:08PM -0500, Guochun Shi wrote:
>> I tried to compile the CVS head with kernel_src=2.6.13-mm1
>>
>>   CC [M]  /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.o
>>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c: In function
>>   `make_flags':
>>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error:
>>   `DLM_LKF_NOORDER' undeclared (first use in this function)
>>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error:
>>   (Each undeclared identifier is reported only once
>>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:123: error:
>>   for each function it appears in.)
>>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:124: error:
>>   `DLM_LKF_HEADQUE' undeclared (first use in this function)
>>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:129: error:
>>   `DLM_LKF_ALTCW' undeclared (first use in this function)
>>   /home/gshi/gfs/gfs-cvs/cluster/gfs-kernel/src/dlm/lock.c:131: error:
>>   `DLM_LKF_ALTPR' undeclared (first use in this function)
>> 
>> What I do is ./configure kernel_src=<kernel src path>; make 
>
>It works for me, could you send the whole output of the make like below?
>Dave
>
>[gfs-kernel/src/dlm]% make
>if [ ! -e linux ]; then ln -s . linux; fi
>if [ ! -e lm_interface.h ]; then ln -s ../../src/harness/lm_interface.h .; fi
>if [ ! -e dlm.h ]; then cp ../../../dlm-kernel/src2/dlm.h .; fi
>make -C /opt/kernels/linux-2.6.13-mm1-build/ M=/opt/tmp/cluster-HEAD/gfs-kernel/src/dlm modules USING_KBUILD=yes
>make[1]: Entering directory `/opt/kernels/linux-2.6.13-mm1-build'
>
>  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock.o
>  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/main.o
>  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/mount.o
>  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/thread.o
>  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/sysfs.o
>  CC [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/plock.o
>  LD [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.o
>  Building modules, stage 2.
>  MODPOST
>  CC      /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.mod.o
>  LD [M]  /opt/tmp/cluster-HEAD/gfs-kernel/src/dlm/lock_dlm.ko
>make[1]: Leaving directory `/opt/kernels/linux-2.6.13-mm1-build'



From bmarzins at redhat.com  Fri Sep  2 22:27:58 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Fri, 2 Sep 2005 17:27:58 -0500
Subject: [Linux-cluster] Re: If I have 5 GNBD server?
In-Reply-To: <43151FAD.20803@telkom.co.id>
References: <20050827004859.D66965A8690@mail.silvercash.com>
	<43126996.8000606@telkom.co.id>
	<20050829205951.GK12333@phlogiston.msp.redhat.com>
	<4313B938.2090409@telkom.co.id>
	<20050830195240.GM12333@phlogiston.msp.redhat.com>
	<43151FAD.20803@telkom.co.id>
Message-ID: <20050902222758.GH6671@phlogiston.msp.redhat.com>

On Wed, Aug 31, 2005 at 10:10:37AM +0700, Fajar A. Nugraha wrote:
> Benjamin Marzinski wrote:
> 
> >On Tue, Aug 30, 2005 at 08:41:12AM +0700, Fajar A. Nugraha wrote:
> > 
> >
> >>Benjamin Marzinski wrote:
> >>
> >>   
> >>
> >>>If the gnbds are exported uncached (the default), the client will fail 
> >>>back IO
> >>>if it can no longer talk to the server after a specified timeout.  
> >>>
> >>>     
> >>>
> >>What is the default timeout anyway, and how can I set it?
> >>Last time I test gnbd-import timeout was on a development version 
> >>(DEVEL.1104982050) and after more than 30 minutes, the client still 
> >>tries to reconnect.
> >>   
> >>
> >
> >The default timeout is 1 minute. It is tuneable with the -t option (see the
> >gnbd man page). However you only timeout if you export the device in 
> >uncached
> >mode.
> >
> > 
> >
> I find something interesting :
> gnbd_import man page (no mention of timeout):
>       -t server
>              Fence from Server.
>              Specify a server for the IO fence (only used with the -s 
> option).
> 
> gnbd_export man page :
>       -t [seconds]
>              Timeout.
>              Set the exported GNBD to timeout mode 
>              This option is used with -p.
>              This  is  the  default  for uncached  GNBDs
> 
> Isn't the client the one that has to determine whether it's in wait mode 
> or timeout mode? How does the parameter from gnbd_export passed to 
> gnbd_import?

No, the server determines it. This information is passed to the client when
it imports the device.
 
> I tested it today with gnbd 1.00.00, by adding an extra ip address to 
> the server -> gnbd_export on the server (IP address 192.168.17.193, 
> cluster member, no extra parameter, so it should be exported as uncached 
> gnbd in timeout mode) -> gnbd_import on the client (member of a 
> different cluster) -> mount the gnbd_import -> remove the IP addresss 
> 192.168.17.193 from the server -> do df -k on the client, and I got 
> these on the client's syslog

Gnbd won't fail the requests back until it can fence the server.  Since the
server is in another cluster, you cannot fence it. For uncached mode to work,
the gnbd client and server MUST be in the same cluster.

> Aug 31 09:55:58 node1 gnbd_recvd[9792]: client lost connection with 
> 192.168.17.193 : Interrupted system call
> Aug 31 09:55:58 node1 gnbd_recvd[9792]: reconnecting
> Aug 31 09:55:58 node1 kernel: gnbd (pid 9792: gnbd_recvd) got signal 1
> Aug 31 09:55:58 node1 kernel: gnbd2: Receive control failed (result -4)
> Aug 31 09:55:58 node1 kernel: gnbd2: shutting down socket
> Aug 31 09:55:58 node1 kernel: exitting GNBD_DO_IT ioctl
> Aug 31 09:56:03 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] 
> server D?? is not a cluster member, cannot fence.
> Aug 31 09:56:08 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] 
> server D?? is not a cluster member, cannot fence.
> Aug 31 09:56:08 node1 gnbd_recvd[9792]: ERROR [gnbd_recvd.c:213] cannot 
> connect to server 192.168.17.193 (-1) : Interrupted system call
> Aug 31 09:56:08 node1 gnbd_recvd[9792]: reconnecting
> Aug 31 09:56:13 node1 gnbd_monitor[9781]: ERROR [gnbd_monitor.c:486] 
> server D?? is not a cluster member, cannot fence.
> Aug 31 09:56:13 node1 gnbd_recvd[9792]: ERROR [gnbd_recvd.c:213] cannot 
> connect to server 192.168.17.193 (-1) : Interrupted system call
> Aug 31 09:56:13 node1 gnbd_recvd[9792]: reconnecting
> 
> And it goes on, and on, and on :) After ten minutes, I add the IP 
> address back to the server and these appear on syslog :
> Aug 31 10:06:13 node1 gnbd_recvd[9792]: reconnecting
> Aug 31 10:06:16 node1 kernel: resending requests
> 
> So it looks like by default gnbd runs in wait mode, and after it 
> reconnects the kernel automatically resends the request without the need 
> of dm-multipath.
> 
> Is my setup incorrect, or is this how it's supposed to work?

Unfortunately, your setup allows the possibility of data corruption if you
actually faile over between servers.  Here's why.  GNBD must fence the server
before it fails over.  Otherwise you run into the following situation:
You have a gnbd client, and two servers (serverA and serverB).  The client
writes data to a block on serverA, but serverA becomes unresponsive before the
data is written out to disk. The client fails over to serverB and writes out
the data to that block. Later the client writes new data to the same block.
After this, serverA suddenly wakes back up, and completes writing the old data
from the original request to that block.  You have now corrupted your block
device.  I have seen this happen multiple times.

In your setup, since the client and server are in different clusters, gnbd
cannot fence the server. This keeps the requests from failing out.  If you
switch the ip, gnbd has no way of knowing that this is no longer the same
physical machine (Which should be fixed.. In future releases, I will probably
make gnbd make sure that this is the same machine. Not just the same IP,
otherwise, people could do just this sort of thing, and accidentally corrupt
their data. If you switched IP addresses like this with cached devices, the
chance of corrupting your data would become disturbingly likely). When you
gnbd can connect to a server on the same ip, it assumes that the old server
came back before it could be fenced, and resends the requests.

-Ben

> Regards,
> 
> Fajar
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From mark.fasheh at oracle.com  Sat Sep  3 00:16:28 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Fri, 2 Sep 2005 17:16:28 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <p73fysnqiej.fsf@verdi.suse.de>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<p73fysnqiej.fsf@verdi.suse.de>
Message-ID: <20050903001628.GH21228@ca-server1.us.oracle.com>

On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.
As far as userspace dlm apis go, dlmfs already abstracts away a large part
of the dlm interaction, so writing a module against another dlm looks like
it wouldn't be too bad (startup of a lockspace is probably the most
difficult part there).
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From gshi at ncsa.uiuc.edu  Sat Sep  3 00:30:50 2005
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Fri, 02 Sep 2005 19:30:50 -0500
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <42FB2EDA.4010300@redhat.com>
References: <42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
	<42EF4AD1.6010809@redhat.com>
	<1123263949.16923.23.camel@localhost.localdomain>
	<42F77AA3.80000@redhat.com>
Message-ID: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>

Patrick,
can you describe the steps changed for CVS version compared to those in usage.txt  in order to make gfs2 work? 

Thanks
-Guochun

At 11:56 AM 8/11/2005 +0100, you wrote:
>For those not reading the commit list the ais-based cman is now in CVS - be
>careful with it...
>
>For the moment it downloads a prepackaged/patched version of the openais source
>from my people.redhat.com web site. This /will/ change. In fact the only
>additional patch in there is one Steven posted to the openais mailing list so
>don't think I'm hiding anything!
>        
>There's still a lot of work to do on this code but is basically works with a few
>caveats:
>        
>1. Barriers are completely untested and may not work at all.
>2. Don't start several nodes up at the same time, they might get the same
>   node ID(!) unless you used static node IDs.
>3. The exec path for cmand is hard coded (in the Makefile) to ../daemon/cmand
>   so you must currently always run cman_tool from the dev directory unless
>   you change it.
>4. Broadcast is no longer supported. If you fail to specify a multicast address
>   cman_tool will provide one.
>5. IPv6 is unsupported, I'm going to start on that next!
>6. Error reporting is probably rubbish.
>
>Generally it seems to work. I can certainly get the DLM up with it now.
>-- 
>
>patrick
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster



From teigland at redhat.com  Sat Sep  3 05:18:41 2005
From: teigland at redhat.com (David Teigland)
Date: Sat, 3 Sep 2005 13:18:41 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901132104.2d643ccd.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
Message-ID: <20050903051841.GA13211@redhat.com>

On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:
> > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > >   possibly gain (or vice versa)
> > > 
> > > - Relative merits of the two offerings
> > 
> > You missed the important one - people actively use it and have been for
> > some years. Same reason with have NTFS, HPFS, and all the others. On
> > that alone it makes sense to include.
> 
> Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> the technical reasons for merging gfs[2], ocfs2, both or neither?
> 
> If one can be grown to encompass the capabilities of the other then we're
> left with a bunch of legacy code and wasted effort.

GFS is an established fs, it's not going away, you'd be hard pressed to
find a more widely used cluster fs on Linux.  GFS is about 10 years old
and has been in use by customers in production environments for about 5
years.  It is a mature, stable file system with many features that have
been technically refined over years of experience and customer/user
feedback.  The latest development cycle (GFS2) has focussed on improving
performance, it's not a new file system -- the "2" indicates that it's not
ondisk compatible with earlier versions.

OCFS2 is a new file system.  I expect they'll want to optimize for their
own unique goals.  When OCFS appeared everyone I know accepted it would
coexist with GFS, each in their niche like every other fs.  That's good,
OCFS and GFS help each other technically even though they may eventually
compete in some areas (which can also be good.)

Dave

Here's a random summary of technical features:

- cluster infrastructure: a lot of work, perhaps as much as gfs itself,
  has gone into the infrastructure surrounding and supporting gfs
- cluster infrastructure allows for easy cooperation with CLVM
- interchangable lock/cluster modules:  gfs interacts with the external
  infrastructure, including lock manager, through an interchangable
  module allowing the fs to be adapted to different environments.
- a "nolock" module can be plugged in to use gfs as a local fs
  (can be selected at mount time, so any fs can be mounted locally)
- quotas, acls, cluster flocks, direct io, data journaling,
  ordered/writeback journaling modes -- all supported
- gfs transparently switches to a different locking scheme for direct io
  allowing parallel non-allocating writes with no lock contention
- posix locks -- supported, although it's being reworked for better
  performance right now
- asynchronous locking, lock prefetching + read-ahead
- coherent shared-writeable memory mappings across the cluster
- nfs3 support (multiple nfs servers exporting one gfs is very common)
- extend fs online, add journals online
- full fs quiesce to allow for block level snapshot below gfs
- read-only mount
- "specatator" mount (like ro but no journal allocated for the mount,
  no fencing needed for failed node that was mounted as specatator)
- infrastructure in place for live ondisk inode migration, fs shrink
- stuffed dinodes, small files are stored in the disk inode block
- tunable (fuzzy) atime updates
- fast, nondisruptive stat on files during non-allocating direct-io
- fast, nondisruptive statfs (df) even during heavy fs usage
- friendly handling of io errors: shut down fs and withdraw from cluster
- largest GFS cluster deployed was around 200 nodes, most are much smaller
- use many GFS file systems at once on a node and in a cluster
- customers use GFS for: scientific apps, HA, NFS serving, database,
  others I'm sure
- graphical management tools for gfs, clvm, and the cluster infrastruture
  exist and are improving quickly



From phillips at istop.com  Sat Sep  3 05:57:31 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sat, 3 Sep 2005 01:57:31 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <p73fysnqiej.fsf@verdi.suse.de>
References: <20050901104620.GA22482@redhat.com>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<p73fysnqiej.fsf@verdi.suse.de>
Message-ID: <200509030157.31581.phillips@istop.com>

On Friday 02 September 2005 17:17, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.

The only current users of dlms are cluster filesystems.  There are zero users 
of the userspace dlm api.  Therefore, the (g)dlm userspace interface actually 
has nothing to do with the needs of gfs.  It should be taken out the gfs 
patch and merged later, when or if user space applications emerge that need 
it.  Maybe in the meantime it will be possible to come up with a userspace 
dlm api that isn't completely repulsive.

Also, note that the only reason the two current dlms are in-kernel is because 
it supposedly cuts down on userspace-kernel communication with the cluster 
filesystems.  Then why should a userspace application bother with a an 
awkward interface to an in-kernel dlm?  This is obviously suboptimal.  Why 
not have a userspace dlm for userspace apps, if indeed there are any 
userspace apps that would need to use dlm-style synchronization instead of 
more typical socket-based synchronization, or Posix locking, which is already 
exposed via a standard api?

There is actually nothing wrong with having multiple, completely different 
dlms active at the same time.  There is no urgent need to merge them into the 
one true dlm.  It would be a lot better to let them evolve separately and 
pick the winner a year or two from now.  Just think of the dlm as part of the 
cfs until then.

What does have to be resolved is a common API for node management.  It is not 
just cluster filesystems and their lock managers that have to interface to 
node management.  Below the filesystem layer, cluster block devices and 
cluster volume management need to be coordinated by the same system, and 
above the filesystem layer, applications also need to be hooked into it.  
This work is, in a word, incomplete.

Regards,

Daniel



From arjan at infradead.org  Sat Sep  3 06:14:00 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Sat, 03 Sep 2005 08:14:00 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903051841.GA13211@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
Message-ID: <1125728040.3223.2.camel@laptopd505.fenrus.org>

On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote:
> On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> > Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > >   possibly gain (or vice versa)
> > > > 
> > > > - Relative merits of the two offerings
> > > 
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> > 
> > Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
> > 
> > If one can be grown to encompass the capabilities of the other then we're
> > left with a bunch of legacy code and wasted effort.
> 
> GFS is an established fs, it's not going away, you'd be hard pressed to
> find a more widely used cluster fs on Linux.  GFS is about 10 years old
> and has been in use by customers in production environments for about 5
> years.

but you submitted GFS2 not GFS.




From yanj at brainaire.com  Sat Sep  3 05:22:28 2005
From: yanj at brainaire.com (yanj)
Date: Sat, 3 Sep 2005 13:22:28 +0800
Subject: [Linux-cluster] Does GFS working STABELly on no-smp platform?
Message-ID: <000e01c5b047$7d642990$2e00a8c0@yanzijie>

Hi, all
 
Does GFS working on no-smp platform? I could not find kernel-modules of
GFS for no-smp system.
 
I am trying to build up a GFS+iSCSI cluster based on NO-SMP machines. A
two nodes GFS system is setup. However, it keeps crash (kernel panic),
while I run IO test on both nodes for a while.
I have tried the following combination of GFS on Redhat system.:
1.  GFS version GFS-6.0.2.25 + Kernel 2.4.21-27.EL
2.  GFS version GFS-6.0.2.20-1 + Kernel 2.4.21-32.EL
All are not stable.
Thanks,
Jeffrey Yan
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050903/ac5f31ad/attachment.htm>

From phillips at istop.com  Sat Sep  3 06:42:36 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sat, 3 Sep 2005 02:42:36 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903001628.GH21228@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com> <p73fysnqiej.fsf@verdi.suse.de>
	<20050903001628.GH21228@ca-server1.us.oracle.com>
Message-ID: <200509030242.36506.phillips@istop.com>

On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> As far as userspace dlm apis go, dlmfs already abstracts away a large part
> of the dlm interaction...

Dumb question, why can't you use sysfs for this instead of rolling your own?

Side note: you seem to have deleted all the 2.6.12-rc4 patches.  Perhaps you 
forgot that there are dozens of lkml archives pointing at them?

Regards,

Daniel



From wim.coekaerts at oracle.com  Sat Sep  3 06:46:34 2005
From: wim.coekaerts at oracle.com (Wim Coekaerts)
Date: Fri, 2 Sep 2005 23:46:34 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509030242.36506.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com> <p73fysnqiej.fsf@verdi.suse.de>
	<20050903001628.GH21228@ca-server1.us.oracle.com>
	<200509030242.36506.phillips@istop.com>
Message-ID: <20050903064633.GB4593@ca-server1.us.oracle.com>

On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > As far as userspace dlm apis go, dlmfs already abstracts away a large part
> > of the dlm interaction...
> 
> Dumb question, why can't you use sysfs for this instead of rolling your own?

because it's totally different. have a look at what it does.



From wim.coekaerts at oracle.com  Sat Sep  3 07:06:39 2005
From: wim.coekaerts at oracle.com (Wim Coekaerts)
Date: Sat, 3 Sep 2005 00:06:39 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <p73fysnqiej.fsf@verdi.suse.de>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<p73fysnqiej.fsf@verdi.suse.de>
Message-ID: <20050903070639.GC4593@ca-server1.us.oracle.com>

On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <akpm at osdl.org> writes:
> 
> > 
> > Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?

clusterfilesystems are very common, there are companies that had/have a
whole business around it, veritas, polyserve, ex-sistina, thus now
redhat, ibm, tons of companies out there sell this, big bucks. as
someone said, it's different than nfs because for certian things there
is less overhead  but there are many other reasons, it makes it a lot
easier to create a clustered nfs server so you create a cfs on a set of
disks with a number of nodes and export that fs from all those, you can
easily do loadbalancing for applications, you have a lot of
infrastructure where people have invested in that allows for shared
storage...   

for ocfs we have tons of production customers running many terabyte
databases on a cfs. why ? because dealing with the raw disk froma number
of nodes sucks. because nfs is pretty broken for a lot of stuff, there
is no consistency across nodes when each machine nfs mounts a server
partition. yes nfs can be used for things but cfs's are very useful for
many things nfs just can't do. want a list ? 

companies building failover for services like to use things like this,
it creates a non single point of failure kind of setup much more easily.
andso on and so on, yes there are alternatives out there but fact is
that a lot of folks like to use it, have been using it for ages, and
want to be using it.

from an implementation point of view, as folks here have already said,
we 've tried our best to implement things as a real linux filesystem, no
abstractions to have something generic, it's clean and as tight as can
be for a lot of stuff. and compared to other cfs's it's pretty darned
nice, however I think it's silly to have competition between ocfs2 and
gfs2. they are different just like the ton of local filesystems are
different and people like to use one or/over the other. david said gfs
is popular and has been around, well, I can list you tons of folks that
have been using our stuff 24/7 for years (for free) just as well. it's
different. that's that. 

it'd be really nice if mainline kernel had it/them included. it would be
a good start to get more folks involved and instead of years of talk on
maillists that end up in nothing actually end up with folks
participating and contributing. 



From teigland at redhat.com  Sat Sep  3 10:35:03 2005
From: teigland at redhat.com (David Teigland)
Date: Sat, 3 Sep 2005 18:35:03 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125728040.3223.2.camel@laptopd505.fenrus.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<1125728040.3223.2.camel@laptopd505.fenrus.org>
Message-ID: <20050903103503.GB15239@redhat.com>

On Sat, Sep 03, 2005 at 08:14:00AM +0200, Arjan van de Ven wrote:
> On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> > > Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:
> > > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > > >   possibly gain (or vice versa)
> > > > > 
> > > > > - Relative merits of the two offerings
> > > > 
> > > > You missed the important one - people actively use it and have been for
> > > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > > that alone it makes sense to include.
> > > 
> > > Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> > > the technical reasons for merging gfs[2], ocfs2, both or neither?
> > > 
> > > If one can be grown to encompass the capabilities of the other then we're
> > > left with a bunch of legacy code and wasted effort.
> > 
> > GFS is an established fs, it's not going away, you'd be hard pressed to
> > find a more widely used cluster fs on Linux.  GFS is about 10 years old
> > and has been in use by customers in production environments for about 5
> > years.
> 
> but you submitted GFS2 not GFS.

Just a new version, not a big difference.  The ondisk format changed a
little making it incompatible with the previous versions.  We'd been
holding out on the format change for a long time and thought now would be
a sensible time to finally do it.

This is also about timing things conveniently.  Each GFS version coincides
with a development cycle and we decided to wait for this version/cycle to
move code upstream.  So, we have new version, format change, and code
upstream all together, but it's still the same GFS to us.

As with _any_ new version (involving ondisk formats or not) we need to
thoroughly test everything to fix the inevitible bugs and regresssions
that are introduced, there's nothing new or surprising about that.

About the name -- we need to support customers running both versions for a
long time.  The "2" was added to make that process a little easier and
clearer for people, that's all.  If the 2 is really distressing we could
rip it off, but there seems to be as many file systems ending in digits
than not these days...

Dave



From phillips at istop.com  Sat Sep  3 20:56:02 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sat, 3 Sep 2005 16:56:02 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903103503.GB15239@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<1125728040.3223.2.camel@laptopd505.fenrus.org>
	<20050903103503.GB15239@redhat.com>
Message-ID: <200509031656.03418.phillips@istop.com>

On Saturday 03 September 2005 06:35, David Teigland wrote:
> Just a new version, not a big difference.  The ondisk format changed a
> little making it incompatible with the previous versions.  We'd been
> holding out on the format change for a long time and thought now would be
> a sensible time to finally do it.

What exactly was the format change, and for what purpose?



From phillips at istop.com  Sat Sep  3 22:21:26 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sat, 3 Sep 2005 18:21:26 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903064633.GB4593@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<200509030242.36506.phillips@istop.com>
	<20050903064633.GB4593@ca-server1.us.oracle.com>
Message-ID: <200509031821.27070.phillips@istop.com>

On Saturday 03 September 2005 02:46, Wim Coekaerts wrote:
> On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> > On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > > As far as userspace dlm apis go, dlmfs already abstracts away a large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling your
> > own?
>
> because it's totally different. have a look at what it does.

You create a dlm domain when a directory is created.  You create a lock 
resource when a file of that name is opened.  You lock the resource when the 
file is opened.  You access the lvb by read/writing the file.  Why doesn't 
that fit the configfs-nee-sysfs model?  If it does, the payoff will be about 
500 lines saved.

This little dlm fs is very slick, but grossly inefficient.  Maybe efficiency 
doesn't matter here since it is just your slow-path userspace tools taking 
these locks.  Please do not even think of proposing this as a way to export a 
kernel-based dlm for general purpose use!

Your userdlm.c file has some hidden gold in it.  You have factored the dlm 
calls far more attractively than the bad old bazillion-parameter Vaxcluster 
legacy.  You are almost in system call zone there.  (But note my earlier 
comment on dlms in general: until there are dlm-based applications, merging a 
general-purpose dlm API is pointless and has nothing to do with getting your 
filesystem merged.)

Regards,

Daniel



From Joel.Becker at oracle.com  Sun Sep  4 01:09:12 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sat, 3 Sep 2005 18:09:12 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509031821.27070.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509030242.36506.phillips@istop.com>
	<20050903064633.GB4593@ca-server1.us.oracle.com>
	<200509031821.27070.phillips@istop.com>
Message-ID: <20050904010912.GJ8684@ca-server1.us.oracle.com>

On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote:
> that fit the configfs-nee-sysfs model?  If it does, the payoff will be about 
> 500 lines saved.

	I'm still awaiting your merge of ext3 and reiserfs, because you
can save probably 500 lines having a filesystem that can create reiser
and ext3 files at the same time.

Joel

-- 

Life's Little Instruction Book #267

	"Lie on your back and look at the stars."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From akpm at osdl.org  Sun Sep  4 01:32:41 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sat, 3 Sep 2005 18:32:41 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904010912.GJ8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<200509030242.36506.phillips@istop.com>
	<20050903064633.GB4593@ca-server1.us.oracle.com>
	<200509031821.27070.phillips@istop.com>
	<20050904010912.GJ8684@ca-server1.us.oracle.com>
Message-ID: <20050903183241.1acca6c9.akpm@osdl.org>

Joel Becker <Joel.Becker at oracle.com> wrote:
>
> On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote:
>  > that fit the configfs-nee-sysfs model?  If it does, the payoff will be about 
>  > 500 lines saved.
> 
>  	I'm still awaiting your merge of ext3 and reiserfs, because you
>  can save probably 500 lines having a filesystem that can create reiser
>  and ext3 files at the same time.

oy.   Daniel is asking a legitimate question.

If there's duplicated code in there then we should seek to either make the
code multi-purpose or place the common or reusable parts into a library
somewhere.

If neither approach is applicable or practical for *every single function*
then fine, please explain why.  AFAIR that has not been done.



From Joel.Becker at oracle.com  Sun Sep  4 03:06:40 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sat, 3 Sep 2005 20:06:40 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903183241.1acca6c9.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<200509030242.36506.phillips@istop.com>
	<20050903064633.GB4593@ca-server1.us.oracle.com>
	<200509031821.27070.phillips@istop.com>
	<20050904010912.GJ8684@ca-server1.us.oracle.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
Message-ID: <20050904030640.GL8684@ca-server1.us.oracle.com>

On Sat, Sep 03, 2005 at 06:32:41PM -0700, Andrew Morton wrote:
> If there's duplicated code in there then we should seek to either make the
> code multi-purpose or place the common or reusable parts into a library
> somewhere.

	Regarding sysfs and configfs, that's a whole 'nother
conversation.  I've not yet come up with a function involved that is
identical, but that's a response here for another email.
	Understanding that Daniel is talking about dlmfs, dlmfs is far
more similar to devptsfs, tmpfs, and even sockfs and pipefs than it is
to sysfs.  I don't see him proposing that sockfs and devptsfs be folded
into sysfs.
	dlmfs is *tiny*.  The VFS interface is less than his claimed 500
lines of savings.  The few VFS callbacks do nothing but call DLM
functions.  You'd have to replace this VFS glue with sysfs glue, and
probably save very few lines of code.
	In addition, sysfs cannot support the dlmfs model.  In dlmfs,
mkdir(2) creates a directory representing a DLM domain and mknod(2)
creates the user representation of a lock.  sysfs doesn't support
mkdir(2) or mknod(2) at all.
	More than mkdir() and mknod(), however, dlmfs uses open(2) to
acquire locks from userspace.  O_RDONLY acquires a shared read lock (PR
in VMS parlance).  O_RDWR gets an exclusive lock (X).  O_NONBLOCK is a
trylock.  Here, dlmfs is using the VFS for complete lifetiming.  A lock
is released via close(2).  If a process dies, close(2) happens.  In
other words, ->release() handles all the cleanup for normal and abnormal
termination.
	sysfs does not allow hooking into ->open() or ->release().  So
this model, and the inherent lifetiming that comes with it, cannot be
used.  If dlmfs was changed to use a less intuitive model that fits
sysfs, all the handling of lifetimes and cleanup would have to be added.
This would make it more complex, not less complex.  It would give it a
larger code size, not a smaller one.  In the end, it would be harder to
maintian, less intuitive to use, and larger.

Joel


-- 

"Anything that is too stupid to be spoken is sung."  
        - Voltaire

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From phillips at istop.com  Sun Sep  4 04:22:36 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 4 Sep 2005 00:22:36 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904030640.GL8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
Message-ID: <200509040022.37102.phillips@istop.com>

On Saturday 03 September 2005 23:06, Joel Becker wrote:
>  dlmfs is *tiny*.  The VFS interface is less than his claimed 500
> lines of savings.

It is 640 lines.

> The few VFS callbacks do nothing but call DLM 
> functions.  You'd have to replace this VFS glue with sysfs glue, and
> probably save very few lines of code.
>  In addition, sysfs cannot support the dlmfs model.  In dlmfs,
> mkdir(2) creates a directory representing a DLM domain and mknod(2)
> creates the user representation of a lock.  sysfs doesn't support
> mkdir(2) or mknod(2) at all.

I said "configfs" in the email to which you are replying.

>  More than mkdir() and mknod(), however, dlmfs uses open(2) to
> acquire locks from userspace.  O_RDONLY acquires a shared read lock (PR
> in VMS parlance).  O_RDWR gets an exclusive lock (X).  O_NONBLOCK is a
> trylock.  Here, dlmfs is using the VFS for complete lifetiming.  A lock
> is released via close(2).  If a process dies, close(2) happens.  In
> other words, ->release() handles all the cleanup for normal and abnormal
> termination.
>
>  sysfs does not allow hooking into ->open() or ->release().  So
> this model, and the inherent lifetiming that comes with it, cannot be 
> used.

Configfs has a per-item release method.  Configfs has a group open method.  
What is it that configfs can't do, or can't be made to do trivially?

> If dlmfs was changed to use a less intuitive model that fits 
> sysfs, all the handling of lifetimes and cleanup would have to be added.

The model you came up with for dlmfs is beyond cute, it's downright clever.  
Why mar that achievement by then failing to capitalize on the framework you 
already have in configfs?

By the way, do you agree that dlmfs is too inefficient to be an effective way 
of exporting your dlm api to user space, except for slow-path applications 
like you have here?

Regards,

Daniel



From Joel.Becker at oracle.com  Sun Sep  4 04:30:00 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sat, 3 Sep 2005 21:30:00 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509040022.37102.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
Message-ID: <20050904043000.GQ8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 12:22:36AM -0400, Daniel Phillips wrote:
> It is 640 lines.

	It's 450 without comments and blank lines.  Please, don't tell
me that comments to help understanding are bloat.

> I said "configfs" in the email to which you are replying.

To wit:

> Daniel Phillips said:
> > Mark Fasheh said:
> > > as far as userspace dlm apis go, dlmfs already abstracts away a
> > > large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling
> > your
> > own?

	You asked why dlmfs can't go into sysfs, and I responded.

Joel

-- 

"I don't want to achieve immortality through my work; I want to
 achieve immortality through not dying."
        - Woody Allen

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From phillips at istop.com  Sun Sep  4 04:51:10 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 4 Sep 2005 00:51:10 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904043000.GQ8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050904043000.GQ8684@ca-server1.us.oracle.com>
Message-ID: <200509040051.11095.phillips@istop.com>

On Sunday 04 September 2005 00:30, Joel Becker wrote:
> You asked why dlmfs can't go into sysfs, and I responded.

And you got me!  In the heat of the moment I overlooked the fact that you and 
Greg haven't agreed to the merge yet ;-)

Clearly, I ought to have asked why dlmfs can't be done by configfs.  It is the 
same paradigm: drive the kernel logic from user-initiated vfs methods.  You 
already have nearly all the right methods in nearly all the right places.

Regards,

Daniel






From akpm at osdl.org  Sun Sep  4 04:46:53 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sat, 3 Sep 2005 21:46:53 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509040022.37102.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
Message-ID: <20050903214653.1b8a8cb7.akpm@osdl.org>

Daniel Phillips <phillips at istop.com> wrote:
>
> The model you came up with for dlmfs is beyond cute, it's downright clever.  

Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
acquire a clustered filesystem lock".  Not even close.

It would be much better to do something which explicitly and directly
expresses what you're trying to do rather than this strange "lets do this
because the names sound the same" thing.

What happens when we want to add some new primitive which has no posix-file
analog?

Waaaay too cute.  Oh well, whatever.



From Joel.Becker at oracle.com  Sun Sep  4 04:58:21 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sat, 3 Sep 2005 21:58:21 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
Message-ID: <20050904045821.GT8684@ca-server1.us.oracle.com>

On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.

	So, you'd like a new flag name?  That can be done.

> What happens when we want to add some new primitive which has no posix-file
> analog?

	The point of dlmfs is not to express every primitive that the
DLM has.  dlmfs cannot express the CR, CW, and PW levels of the VMS
locking scheme.  Nor should it.  The point isn't to use a filesystem
interface for programs that need all the flexibility and power of the
VMS DLM.  The point is a simple system that programs needing the basic
operations can use.  Even shell scripts.

Joel

-- 

"You must remember this:
 A kiss is just a kiss,
 A sigh is just a sigh.
 The fundamental rules apply
 As time goes by."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From Joel.Becker at oracle.com  Sun Sep  4 05:00:26 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sat, 3 Sep 2005 22:00:26 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509040051.11095.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050904043000.GQ8684@ca-server1.us.oracle.com>
	<200509040051.11095.phillips@istop.com>
Message-ID: <20050904050026.GU8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> Clearly, I ought to have asked why dlmfs can't be done by configfs.  It is the 
> same paradigm: drive the kernel logic from user-initiated vfs methods.  You 
> already have nearly all the right methods in nearly all the right places.

	configfs, like sysfs, does not support ->open() or ->release()
callbacks.  And it shouldn't.  The point is to hide the complexity and
make it easier to plug into.  
	A client object should not ever have to know or care that it is
being controlled by a filesystem.  It only knows that it has a tree of
items with attributes that can be set or shown.

Joel


-- 

"In a crisis, don't hide behind anything or anybody. They're going
 to find you anyway."
	- Paul "Bear" Bryant

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From akpm at osdl.org  Sun Sep  4 05:41:40 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sat, 3 Sep 2005 22:41:40 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904045821.GT8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
Message-ID: <20050903224140.0442fac4.akpm@osdl.org>

Joel Becker <Joel.Becker at oracle.com> wrote:
>
>  > What happens when we want to add some new primitive which has no posix-file
>  > analog?
> 
>  	The point of dlmfs is not to express every primitive that the
>  DLM has.  dlmfs cannot express the CR, CW, and PW levels of the VMS
>  locking scheme.  Nor should it.  The point isn't to use a filesystem
>  interface for programs that need all the flexibility and power of the
>  VMS DLM.  The point is a simple system that programs needing the basic
>  operations can use.  Even shell scripts.

Are you saying that the posix-file lookalike interface provides access to
part of the functionality, but there are other APIs which are used to
access the rest of the functionality?  If so, what is that interface, and
why cannot that interface offer access to 100% of the functionality, thus
making the posix-file tricks unnecessary?



From Joel.Becker at oracle.com  Sun Sep  4 05:49:37 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sat, 3 Sep 2005 22:49:37 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903224140.0442fac4.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
Message-ID: <20050904054936.GW8684@ca-server1.us.oracle.com>

On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality?  If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?

	Currently, this is all the interface that the OCFS2 DLM
provides.  But yes, if you wanted to provide the rest of the VMS
functionality (something that GFS2's DLM does), you'd need to use a more
concrete interface.
	IMHO, it's worthwhile to have a simple interface, one already
used by mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc.  This is an interface
that can and is used by shell scripts even (we do this to test the DLM).
If you make it a C-library-only interface, you've just restricted the
subset of folks that can use it, while adding programming complexity.
	I think that a simple fs-based interface can coexist with a more
complex one.  FILE* doesn't give you the flexibility of read()/write(),
but I wouldn't remove it :-)

Joel

-- 

"In the beginning, the universe was created. This has made a lot 
 of people very angry, and is generally considered to have been a 
 bad move."
        - Douglas Adams

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From phillips at istop.com  Sun Sep  4 05:52:29 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 4 Sep 2005 01:52:29 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904050026.GU8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040051.11095.phillips@istop.com>
	<20050904050026.GU8684@ca-server1.us.oracle.com>
Message-ID: <200509040152.30027.phillips@istop.com>

On Sunday 04 September 2005 01:00, Joel Becker wrote:
> On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> > Clearly, I ought to have asked why dlmfs can't be done by configfs.  It
> > is the same paradigm: drive the kernel logic from user-initiated vfs
> > methods.  You already have nearly all the right methods in nearly all the
> > right places.
>
>  configfs, like sysfs, does not support ->open() or ->release()
> callbacks.

struct configfs_item_operations {
 void (*release)(struct config_item *);
 ssize_t (*show)(struct config_item *, struct attribute *,char *);
 ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t);
 int (*allow_link)(struct config_item *src, struct config_item *target);
 int (*drop_link)(struct config_item *src, struct config_item *target);
};

struct configfs_group_operations {
 struct config_item *(*make_item)(struct config_group *group, const char *name);
 struct config_group *(*make_group)(struct config_group *group, const char *name);
 int (*commit_item)(struct config_item *item);
 void (*drop_item)(struct config_group *group, struct config_item *item);
};

You do have ->release and ->make_item/group.

If I may hand you a more substantive argument: you don't support user-driven
creation of files in configfs, only directories.  Dlmfs supports user-created
files.  But you know, there isn't actually a good reason not to support
user-created files in configfs, as dlmfs demonstrates.

Anyway, goodnight.

Regards,

Daniel



From Joel.Becker at oracle.com  Sun Sep  4 05:56:51 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sat, 3 Sep 2005 22:56:51 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509040152.30027.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040051.11095.phillips@istop.com>
	<20050904050026.GU8684@ca-server1.us.oracle.com>
	<200509040152.30027.phillips@istop.com>
Message-ID: <20050904055650.GX8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 01:52:29AM -0400, Daniel Phillips wrote:
> You do have ->release and ->make_item/group.

	->release is like kobject release.  It's a free callback, not a
callback from close.

> If I may hand you a more substantive argument: you don't support user-driven
> creation of files in configfs, only directories.  Dlmfs supports user-created
> files.  But you know, there isn't actually a good reason not to support
> user-created files in configfs, as dlmfs demonstrates.

	It is outside the domain of configfs.  Just because it can be
done does not mean it should be.  configfs isn't a "thing to create
files".  It's an interface to creating kernel items.  The actual
filesystem representation isn't the end, it's just the means.

Joel

-- 

"In the room the women come and go
 Talking of Michaelangelo."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From mark.fasheh at oracle.com  Sun Sep  4 06:10:45 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Sat, 3 Sep 2005 23:10:45 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
Message-ID: <20050904061045.GI21228@ca-server1.us.oracle.com>

On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.
What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
is a bit unfortunate, but really it just needs a bit to express that -
nobody over here cares what it's called.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From phillips at istop.com  Sun Sep  4 06:40:08 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 4 Sep 2005 02:40:08 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
Message-ID: <200509040240.08467.phillips@istop.com>

On Sunday 04 September 2005 00:46, Andrew Morton wrote:
> Daniel Phillips <phillips at istop.com> wrote:
> > The model you came up with for dlmfs is beyond cute, it's downright
> > clever.
>
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.

Now, I see the ocfs2 guys are all ready to back down on this one, but I will 
at least argue weakly in favor.

Sick is a nice word for it, but it is actually not that far off.  Normally, 
this fs will acquire a lock whenever the user creates a virtual file and the 
create will block until the global lock arrives.  With O_NONBLOCK, it will 
return, erm... ETXTBSY (!) immediately.  Is that not what O_NONBLOCK is 
supposed to accomplish?

> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
>
> What happens when we want to add some new primitive which has no posix-file
> analog?
>
> Waaaay too cute.  Oh well, whatever.

The explicit way is syscalls or a set of ioctls, which he already has the 
makings of.  If there is going to be a userspace api, I would hope it looks 
more like the contents of userdlm.c than the traditional Vaxcluster API, 
which sucks beyond belief.

Another explicit way is to do it with a whole set of virtual attributes 
instead of just a single file trying to capture the whole model.  That is 
really unappealing, but I am afraid that is exactly what a whole lot of 
sysfs/configfs usage is going to end up looking like.

But more to the point: we have no urgent need for a userspace dlm api at the 
moment.  Nothing will break if we just put that issue off for a few months, 
quite the contrary.

If the only user is their tools I would say let it go ahead and be cute, even 
sickeningly so.  It is not supposed to be a general dlm api, at least that is 
my understanding.  It is just supposed to be an interface for their tools.  
Of course it would help to know exactly how those tools use it.  Too sleepy 
to find out tonight...

Regards,

Daniel



From hzhong at gmail.com  Sun Sep  4 07:12:52 2005
From: hzhong at gmail.com (Hua Zhong)
Date: Sun, 4 Sep 2005 00:12:52 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
Message-ID: <924c28830509040012ce7a0ce@mail.gmail.com>

On 9/3/05, Andrew Morton <akpm at osdl.org> wrote: 
> 
> Daniel Phillips <phillips at istop.com> wrote:
> >
> > The model you came up with for dlmfs is beyond cute, it's downright 
> clever.
> 
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.

 No, it's "open this file in nonblocking mode" vs "attempt to acquire a lock 
in nonblocking mode". I think it makes perfect sense to use this flag.
 Of course, whether or not to use open as a means to acquire a lock (in 
either blocking or nonblocking mode) is efficient is another matter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050904/cd5778a5/attachment.htm>

From akpm at osdl.org  Sun Sep  4 07:23:43 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sun, 4 Sep 2005 00:23:43 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904061045.GI21228@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904061045.GI21228@ca-server1.us.oracle.com>
Message-ID: <20050904002343.079daa85.akpm@osdl.org>

Mark Fasheh <mark.fasheh at oracle.com> wrote:
>
> On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> > Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> > lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> > me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> > acquire a clustered filesystem lock".  Not even close.
>
> What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> is a bit unfortunate, but really it just needs a bit to express that -
> nobody over here cares what it's called.

The whole idea of reinterpreting file operations to mean something utterly
different just seems inappropriate to me.

You get a lot of goodies when using a filesystem - the ability for
unrelated processes to look things up, resource release on exit(), etc.  If
those features are valuable in the ocfs2 context then fine.  But I'd have
thought that it would be saner and more extensible to add new syscalls
(perhaps taking fd's) rather than overloading the open() mode in this
manner.



From akpm at osdl.org  Sun Sep  4 07:28:28 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sun, 4 Sep 2005 00:28:28 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509040240.08467.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<200509040240.08467.phillips@istop.com>
Message-ID: <20050904002828.3d26f64c.akpm@osdl.org>

Daniel Phillips <phillips at istop.com> wrote:
>
> If the only user is their tools I would say let it go ahead and be cute, even 
>  sickeningly so.  It is not supposed to be a general dlm api, at least that is 
>  my understanding.  It is just supposed to be an interface for their tools.  
>  Of course it would help to know exactly how those tools use it.

Well I'm not saying "don't do this".   I'm saying "eww" and "why?".

If there is already a richer interface into all this code (such as a
syscall one) and it's feasible to migrate the open() tricksies to that API
in the future if it all comes unstuck then OK.  That's why I asked (thus
far unsuccessfully):

   Are you saying that the posix-file lookalike interface provides
   access to part of the functionality, but there are other APIs which are
   used to access the rest of the functionality?  If so, what is that
   interface, and why cannot that interface offer access to 100% of the
   functionality, thus making the posix-file tricks unnecessary?



From Joel.Becker at oracle.com  Sun Sep  4 08:01:02 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sun, 4 Sep 2005 01:01:02 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904002828.3d26f64c.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<200509040240.08467.phillips@istop.com>
	<20050904002828.3d26f64c.akpm@osdl.org>
Message-ID: <20050904080102.GY8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.
> That's why I asked (thus far unsuccessfully):

	I personally was under the impression that "syscalls are not
to be added".  I'm also wary of the effort required to hook into process
exit.  Not to mention all the lifetiming that has to be written again.
	On top of that, we lose our cute ability to shell script it.  We
find this very useful in testing, and think others would in practice.

>    Are you saying that the posix-file lookalike interface provides
>    access to part of the functionality, but there are other APIs which are
>    used to access the rest of the functionality?  If so, what is that
>    interface, and why cannot that interface offer access to 100% of the
>    functionality, thus making the posix-file tricks unnecessary?

	I thought I stated this in my other email.  We're not intending
to extend dlmfs.  It pretty much covers the simple DLM usage required of
a simple interface.  The OCFS2 DLM does not provide any other
functionality.
	If the OCFS2 DLM grew more functionality, or you consider the
GFS2 DLM that already has it (and a less intuitive interface via sysfs
IIRC), I would contend that dlmfs still has a place.  It's simple to use
and understand, and it's usable from shell scripts and other simple
code.

Joel

-- 

"The first thing we do, let's kill all the lawyers."
                                        -Henry VI, IV:ii

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From mark.fasheh at oracle.com  Sun Sep  4 08:17:48 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Sun, 4 Sep 2005 01:17:48 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904002343.079daa85.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904061045.GI21228@ca-server1.us.oracle.com>
	<20050904002343.079daa85.akpm@osdl.org>
Message-ID: <20050904081748.GJ21228@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> > is a bit unfortunate, but really it just needs a bit to express that -
> > nobody over here cares what it's called.
> 
> The whole idea of reinterpreting file operations to mean something utterly
> different just seems inappropriate to me.
Putting aside trylock for a minute, I'm not sure how utterly different the
operations are. You create a lock resource by creating a file named after
it. You get a lock (fd) at read or write level on the resource by calling 
open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
Now that we've got an fd, lock value blocks are naturally represented as
file data which can be read(2) or written(2).
Close(2) drops the lock.

A really trivial usage example from shell:

node1$ echo "hello world" > mylock
node2$ cat mylock
hello world

I could always give a more useful one after I get some sleep :)

> You get a lot of goodies when using a filesystem - the ability for
> unrelated processes to look things up, resource release on exit(), etc.  If
> those features are valuable in the ocfs2 context then fine.
Right, they certainly are and I think Joel, in another e-mail on this
thread, explained well the advantages of using a filesystem.

> But I'd have thought that it would be saner and more extensible to add new
> syscalls (perhaps taking fd's) rather than overloading the open() mode in
> this manner.
The idea behind dlmfs was to very simply export a small set of cluster dlm
operations to userspace. Given that goal, I felt that a whole set of system
calls would have been overkill. That said, I think perhaps I should clarify
that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
and (imho) intuitive one which could be trivially accessed from any software
which just knows how to read and write files.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From akpm at osdl.org  Sun Sep  4 08:18:05 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sun, 4 Sep 2005 01:18:05 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904080102.GY8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<200509040240.08467.phillips@istop.com>
	<20050904002828.3d26f64c.akpm@osdl.org>
	<20050904080102.GY8684@ca-server1.us.oracle.com>
Message-ID: <20050904011805.68df8dde.akpm@osdl.org>

Joel Becker <Joel.Becker at oracle.com> wrote:
>
> On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> > If there is already a richer interface into all this code (such as a
> > syscall one) and it's feasible to migrate the open() tricksies to that API
> > in the future if it all comes unstuck then OK.
> > That's why I asked (thus far unsuccessfully):
> 
> 	I personally was under the impression that "syscalls are not
> to be added".

We add syscalls all the time.  Whichever user<->kernel API is considered to
be most appropriate, use it.

>  I'm also wary of the effort required to hook into process
> exit.

I'm not questioning the use of a filesystem.  I'm questioning this
overloading of normal filesystem system calls.  For example (and this is
just an example!  there's also mknod, mkdir, O_RDWR, O_EXCL...) it would be
more usual to do

	fd = open("/sys/whatever", ...);
	err = sys_dlm_trylock(fd);

I guess your current implementation prevents /sys/whatever from ever
appearing if the trylock failed.  Dunno if that's valuable.

>  Not to mention all the lifetiming that has to be written again.
> 	On top of that, we lose our cute ability to shell script it.  We
> find this very useful in testing, and think others would in practice.
> 
> >    Are you saying that the posix-file lookalike interface provides
> >    access to part of the functionality, but there are other APIs which are
> >    used to access the rest of the functionality?  If so, what is that
> >    interface, and why cannot that interface offer access to 100% of the
> >    functionality, thus making the posix-file tricks unnecessary?
> 
> 	I thought I stated this in my other email.  We're not intending
> to extend dlmfs.

Famous last words ;)

>  It pretty much covers the simple DLM usage required of
> a simple interface.  The OCFS2 DLM does not provide any other
> functionality.
> 	If the OCFS2 DLM grew more functionality, or you consider the
> GFS2 DLM that already has it (and a less intuitive interface via sysfs
> IIRC), I would contend that dlmfs still has a place.  It's simple to use
> and understand, and it's usable from shell scripts and other simple
> code.

(wonders how to do O_NONBLOCK from a script)




I don't buy the general "fs is nice because we can script it" argument,
really.  You can just write a few simple applications which provide access
to the syscalls (or the fs!) and then write scripts around those.

Yes, you suddenly need to get a little tarball into users' hands and that's
a hassle.  And I sometimes think we let this hassle guide kernel interfaces
(mutters something about /sbin/hotplug), and that's sad.  



From akpm at osdl.org  Sun Sep  4 08:37:04 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sun, 4 Sep 2005 01:37:04 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904081748.GJ21228@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904061045.GI21228@ca-server1.us.oracle.com>
	<20050904002343.079daa85.akpm@osdl.org>
	<20050904081748.GJ21228@ca-server1.us.oracle.com>
Message-ID: <20050904013704.55c2d9f5.akpm@osdl.org>

Mark Fasheh <mark.fasheh at oracle.com> wrote:
>
> On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> > > is a bit unfortunate, but really it just needs a bit to express that -
> > > nobody over here cares what it's called.
> > 
> > The whole idea of reinterpreting file operations to mean something utterly
> > different just seems inappropriate to me.
> Putting aside trylock for a minute, I'm not sure how utterly different the
> operations are. You create a lock resource by creating a file named after
> it. You get a lock (fd) at read or write level on the resource by calling 
> open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
> Now that we've got an fd, lock value blocks are naturally represented as
> file data which can be read(2) or written(2).
> Close(2) drops the lock.
> 
> A really trivial usage example from shell:
> 
> node1$ echo "hello world" > mylock
> node2$ cat mylock
> hello world
> 
> I could always give a more useful one after I get some sleep :)

It isn't extensible though.  One couldn't retain this approach while adding
(random cfs ignorance exposure) upgrade-read, downgrade-write,
query-for-various-runtime-stats, priority modification, whatever.

> > You get a lot of goodies when using a filesystem - the ability for
> > unrelated processes to look things up, resource release on exit(), etc.  If
> > those features are valuable in the ocfs2 context then fine.
> Right, they certainly are and I think Joel, in another e-mail on this
> thread, explained well the advantages of using a filesystem.
> 
> > But I'd have thought that it would be saner and more extensible to add new
> > syscalls (perhaps taking fd's) rather than overloading the open() mode in
> > this manner.
> The idea behind dlmfs was to very simply export a small set of cluster dlm
> operations to userspace. Given that goal, I felt that a whole set of system
> calls would have been overkill. That said, I think perhaps I should clarify
> that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
> and (imho) intuitive one which could be trivially accessed from any software
> which just knows how to read and write files.

Well, as I say.  Making it a filesystem is superficially attractive, but
once you've build a super-dooper enterprise-grade infrastructure on top of
it all, nobody's going to touch the fs interface by hand and you end up
wondering why it's there, adding baggage.

Not that I'm questioning the fs interface!  It has useful permission
management, monitoring and resource releasing characteristics.  I'm
questioning the open() tricks.  I guess from Joel's tiny description, the
filesystem's interpretation of mknod and mkdir look sensible enough.



From Joel.Becker at oracle.com  Sun Sep  4 09:11:18 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sun, 4 Sep 2005 02:11:18 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904011805.68df8dde.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<200509040240.08467.phillips@istop.com>
	<20050904002828.3d26f64c.akpm@osdl.org>
	<20050904080102.GY8684@ca-server1.us.oracle.com>
	<20050904011805.68df8dde.akpm@osdl.org>
Message-ID: <20050904091118.GZ8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 01:18:05AM -0700, Andrew Morton wrote:
> > 	I thought I stated this in my other email.  We're not intending
> > to extend dlmfs.
> 
> Famous last words ;)

	Heh, of course :-)

> I don't buy the general "fs is nice because we can script it" argument,
> really.  You can just write a few simple applications which provide access
> to the syscalls (or the fs!) and then write scripts around those.

	I can't see how that works easily.  I'm not worried about a
tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
thinking about this shell:

	exec 7</dlm/domainxxxx/lock1
	do stuff
	exec 7</dev/null

If someone kills the shell while stuff is doing, the lock is unlocked
because fd 7 is closed.  However, if you have an application to do the
locking:

	takelock domainxxx lock1
	do sutff
	droplock domainxxx lock1

When someone kills the shell, the lock is leaked, becuase droplock isn't
called.  And SEGV/QUIT/-9 (especially -9, folks love it too much) are
handled by the first example but not by the second.
	
Joel

-- 

"Same dancers in the same old shoes.
 You get too careful with the steps you choose.
 You don't care about winning but you don't want to lose
 After the thrill is gone."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From akpm at osdl.org  Sun Sep  4 09:18:36 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sun, 4 Sep 2005 02:18:36 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904091118.GZ8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<200509040240.08467.phillips@istop.com>
	<20050904002828.3d26f64c.akpm@osdl.org>
	<20050904080102.GY8684@ca-server1.us.oracle.com>
	<20050904011805.68df8dde.akpm@osdl.org>
	<20050904091118.GZ8684@ca-server1.us.oracle.com>
Message-ID: <20050904021836.4d4560a5.akpm@osdl.org>

Joel Becker <Joel.Becker at oracle.com> wrote:
>
> 	I can't see how that works easily.  I'm not worried about a
>  tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
>  thinking about this shell:
> 
>  	exec 7</dlm/domainxxxx/lock1
>  	do stuff
>  	exec 7</dev/null
> 
>  If someone kills the shell while stuff is doing, the lock is unlocked
>  because fd 7 is closed.  However, if you have an application to do the
>  locking:
> 
>  	takelock domainxxx lock1
>  	do sutff
>  	droplock domainxxx lock1
> 
>  When someone kills the shell, the lock is leaked, becuase droplock isn't
>  called.  And SEGV/QUIT/-9 (especially -9, folks love it too much) are
>  handled by the first example but not by the second.


	take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"



From Joel.Becker at oracle.com  Sun Sep  4 09:39:10 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sun, 4 Sep 2005 02:39:10 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904021836.4d4560a5.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<200509040240.08467.phillips@istop.com>
	<20050904002828.3d26f64c.akpm@osdl.org>
	<20050904080102.GY8684@ca-server1.us.oracle.com>
	<20050904011805.68df8dde.akpm@osdl.org>
	<20050904091118.GZ8684@ca-server1.us.oracle.com>
	<20050904021836.4d4560a5.akpm@osdl.org>
Message-ID: <20050904093910.GA8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 02:18:36AM -0700, Andrew Morton wrote:
> 	take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"

	Ahh, but then you have to have lots of scripts somewhere in
path, or do massive inline scripts.  especially if you want to take
another lock in there somewhere.
	It's doable, but it's nowhere near as easy. :-)

Joel

-- 

"I always thought the hardest questions were those I could not answer.
 Now I know they are the ones I can never ask."
			- Charlie Watkins

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From hzhong at gmail.com  Sun Sep  4 18:03:21 2005
From: hzhong at gmail.com (Hua Zhong)
Date: Sun, 4 Sep 2005 11:03:21 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904091118.GZ8684@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<200509040240.08467.phillips@istop.com>
	<20050904002828.3d26f64c.akpm@osdl.org>
	<20050904080102.GY8684@ca-server1.us.oracle.com>
	<20050904011805.68df8dde.akpm@osdl.org>
	<20050904091118.GZ8684@ca-server1.us.oracle.com>
Message-ID: <924c288305090411038aa80f8@mail.gmail.com>

>        takelock domainxxx lock1
>        do sutff
>        droplock domainxxx lock1
> 
> When someone kills the shell, the lock is leaked, becuase droplock isn't
> called.

Why not open the lock resource (or the lock space) instead of
individual locks as file? It then looks like this:

open lock space file
takelock lockresource lock1
do stuff
droplock lockresource lock1
close lock space file

Then if you are killed the ->release of lock space file should take
care of cleaning up all the locks



From phillips at istop.com  Sun Sep  4 19:51:56 2005
From: phillips at istop.com (Daniel Phillips)
Date: Sun, 4 Sep 2005 15:51:56 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904002828.3d26f64c.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<200509040240.08467.phillips@istop.com>
	<20050904002828.3d26f64c.akpm@osdl.org>
Message-ID: <200509041551.56614.phillips@istop.com>

On Sunday 04 September 2005 03:28, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.  That's why I asked (thus
> far unsuccessfully):
>
>    Are you saying that the posix-file lookalike interface provides
>    access to part of the functionality, but there are other APIs which are
>    used to access the rest of the functionality?  If so, what is that
>    interface, and why cannot that interface offer access to 100% of the
>    functionality, thus making the posix-file tricks unnecessary?

There is no such interface at the moment, nor is one needed in the immediate 
future.  Let's look at the arguments for exporting a dlm to userspace:

  1) Since we already have a dlm in kernel, why not just export that and save
     100K of userspace library?  Answer: because we don't want userspace-only
     dlm features bulking up the kernel.  Answer #2: the extra syscalls and
     interface baggage serve no useful purpose.

  2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
     Answer: only support tools need to do that.  A cut-down locking api is
     entirely appropriate for this.

  3) But the kernel dlm is the only one we have!  Answer: easily fixed, a
     simple matter of coding.  But please bear in mind that dlm-style
     synchronization is probably a bad idea for most cluster applications,
     particularly ones that already do their synchronization via sockets.

In other words, exporting the full dlm api is a red herring.  It has nothing 
to do with getting cluster filesystems up and running.  It is really just 
marketing: it sounds like a great thing for userspace to get a dlm "for 
free", but it isn't free, it contributes to kernel bloat and it isn't even 
the most efficient way to do it.

If after considering that, we _still_ want to export a dlm api from kernel, 
then can we please take the necessary time and get it right?  The full api 
requires not only syscall-style elements, but asynchronous events as well, 
similar to aio.  I do not think anybody has a good answer to this today, nor 
do we even need it to begin porting applications to cluster filesystems.

Oracle guys: what is the distributed locking API for RAC?  Is the RAC team 
waiting with bated breath to adopt your kernel-based dlm?  If not, why not?

Regards,

Daniel



From pavel at ucw.cz  Sun Sep  4 20:33:44 2005
From: pavel at ucw.cz (Pavel Machek)
Date: Sun, 4 Sep 2005 22:33:44 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903051841.GA13211@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
Message-ID: <20050904203344.GA1987@elf.ucw.cz>

Hi!

> - read-only mount
> - "specatator" mount (like ro but no journal allocated for the mount,
>   no fencing needed for failed node that was mounted as specatator)

I'd call it "real-read-only", and yes, that's very usefull
mount. Could we get it for ext3, too?
								Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address



From Joel.Becker at oracle.com  Sun Sep  4 22:18:20 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Sun, 4 Sep 2005 15:18:20 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904203344.GA1987@elf.ucw.cz>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<20050904203344.GA1987@elf.ucw.cz>
Message-ID: <20050904221820.GB8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

	In OCFS2 we call readonly+journal+connected-to-cluster "soft
readonly".  We're a live node, other nodes know we exist, and we can
flush pending transactions during the rw->ro transition.  In addition,
we can allow a ro->rw transition.
	The no-journal+no-cluster-connection mode we call "hard
readonly".  This is the mode you get when a device itself is readonly,
because you can't do *anything*.

Joel

-- 

"Lately I've been talking in my sleep.
 Can't imagine what I'd have to say.
 Except my world will be right
 When love comes back my way."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From teigland at redhat.com  Mon Sep  5 03:47:39 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 11:47:39 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903052821.GA23711@kroah.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050902094403.GD16595@redhat.com>
	<20050903052821.GA23711@kroah.com>
Message-ID: <20050905034739.GA11337@redhat.com>

On Fri, Sep 02, 2005 at 10:28:21PM -0700, Greg KH wrote:
> On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > 
> > > +	gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
> > 
> > > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > > everywhere
> > 
> > When a machine has many gfs file systems mounted at once it can be useful
> > to know which one failed.  Does the following look ok?
> > 
> > #define gfs2_assert(sdp, assertion)                                       \
> > do {                                                                      \
> >         if (unlikely(!(assertion))) {                                     \
> >                 printk(KERN_ERR                                           \
> >                         "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
> >                         "GFS2: fsid=%s:   function = %s\n"                \
> >                         "GFS2: fsid=%s:   file = %s, line = %u\n"         \
> >                         "GFS2: fsid=%s:   time = %lu\n",                  \
> >                         sdp->sd_fsname, # assertion,                      \
> >                         sdp->sd_fsname,  __FUNCTION__,                    \
> >                         sdp->sd_fsname, __FILE__, __LINE__,               \
> >                         sdp->sd_fsname, get_seconds());                   \
> >                 BUG();                                                    \
> 
> You will already get the __FUNCTION__ (and hence the __FILE__ info)
> directly from the BUG() dump, as well as the time from the syslog
> message (turn on the printk timestamps if you want a more fine grain
> timestamp), so the majority of this macro is redundant with the BUG()
> macro...

Joern already suggested moving this out of line and into a function (as it
was before) to avoid repeating string constants.  In that case the
function, file and line from BUG aren't useful.  We now have this, does it
look ok?

void gfs2_assert_i(struct gfs2_sbd *sdp, char *assertion, const char *function,
                   char *file, unsigned int line)
{
        panic("GFS2: fsid=%s: fatal: assertion \"%s\" failed\n"
              "GFS2: fsid=%s:   function = %s, file = %s, line = %u\n",
              sdp->sd_fsname, assertion,
              sdp->sd_fsname, function, file, line);
}

#define gfs2_assert(sdp, assertion) \
do { \
        if (unlikely(!(assertion))) { \
                gfs2_assert_i((sdp), #assertion, \
                              __FUNCTION__, __FILE__, __LINE__); \
        } \
} while (0)



From teigland at redhat.com  Mon Sep  5 04:30:33 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 12:30:33 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903224140.0442fac4.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
Message-ID: <20050905043033.GB11337@redhat.com>

On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Joel Becker <Joel.Becker at oracle.com> wrote:
> >
> >  > What happens when we want to add some new primitive which has no
> >  > posix-file analog?
> > 
> >  	The point of dlmfs is not to express every primitive that the
> >  DLM has.  dlmfs cannot express the CR, CW, and PW levels of the VMS
> >  locking scheme.  Nor should it.  The point isn't to use a filesystem
> >  interface for programs that need all the flexibility and power of the
> >  VMS DLM.  The point is a simple system that programs needing the basic
> >  operations can use.  Even shell scripts.
> 
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality?  If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?

We're using our dlm quite a bit in user space and require the full dlm
API.  It's difficult to export the full API through a pseudo fs like
dlmfs, so we've not found it a very practical approach.  That said, it's a
nice idea and I'd be happy if someone could map a more complete dlm API
onto it.

We export our full dlm API through read/write/poll on a misc device.  All
user space apps use the dlm through a library as you'd expect.  The
library communicates with the dlm_device kernel module through
read/write/poll and the dlm_device module talks with the actual dlm:
linux/drivers/dlm/device.c  If there's a better way to do this, via a
pseudo fs or not, we'd be pleased to try it.

Dave



From teigland at redhat.com  Mon Sep  5 05:43:48 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 13:43:48 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
Message-ID: <20050905054348.GC11337@redhat.com>

On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

> +void gfs2_glock_hold(struct gfs2_glock *gl)
> +{
> +	glock_hold(gl);
> +}
> 
> eh why?

You removed the comment stating exactly why, see below.  If that's not a
accepted technique in the kernel, say so and I'll be happy to change it
here and elsewhere.
Thanks,
Dave

static inline void glock_hold(struct gfs2_glock *gl)
{
        gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0);
        atomic_inc(&gl->gl_count);
}

/**
 * gfs2_glock_hold() - As glock_hold(), but suitable for exporting
 * @gl: The glock to hold
 *
 */

void gfs2_glock_hold(struct gfs2_glock *gl)
{
        glock_hold(gl);
}



From teigland at redhat.com  Mon Sep  5 06:29:16 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 14:29:16 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
Message-ID: <20050905062916.GA17607@redhat.com>

On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

> +static unsigned int handle_roll(atomic_t *a)
> +{
> +	int x = atomic_read(a);
> +	if (x < 0) {
> +		atomic_set(a, 0);
> +		return 0;
> +	}
> +	return (unsigned int)x;
> +}
> 
> this is just plain scary.

Not really, it was just resetting atomic statistics counters when they
became negative.  Unecessary, though, so removed.

Dave



From penberg at cs.helsinki.fi  Mon Sep  5 06:32:59 2005
From: penberg at cs.helsinki.fi (Pekka Enberg)
Date: Mon, 5 Sep 2005 09:32:59 +0300
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905054348.GC11337@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050905054348.GC11337@redhat.com>
Message-ID: <84144f02050904233274d45230@mail.gmail.com>

On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > +void gfs2_glock_hold(struct gfs2_glock *gl)
> > +{
> > +     glock_hold(gl);
> > +}
> >
> > eh why?

On 9/5/05, David Teigland <teigland at redhat.com> wrote:
> You removed the comment stating exactly why, see below.  If that's not a
> accepted technique in the kernel, say so and I'll be happy to change it
> here and elsewhere.

Is there a reason why users of gfs2_glock_hold() cannot use
glock_hold() directly?

                            Pekka



From mark.fasheh at oracle.com  Mon Sep  5 07:09:23 2005
From: mark.fasheh at oracle.com (Mark Fasheh)
Date: Mon, 5 Sep 2005 00:09:23 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905055428.GA29158@thunk.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<20050904203344.GA1987@elf.ucw.cz>
	<20050905055428.GA29158@thunk.org>
Message-ID: <20050905070922.GK21228@ca-server1.us.oracle.com>

On Mon, Sep 05, 2005 at 01:54:28AM -0400, Theodore Ts'o wrote:
> In the ext3 case, the only time when read-only isn't quite read-only
> is when the filesystem was unmounted uncleanly and the journal needs
> to be replayed in order for the filesystem to be consistent.
Right, and OCFS2 is going to try to keep the behavior of only using the
journal for recovery in normal (soft) read-only operation.
Unfortunately other cluster nodes could die at any moment which can
complicate things as we are now required to do recovery on them to ensure
file system consistency.

Recovery of course includes things like orphan dir cleanup, etc so we need a
journal around for those transactions. To simplify all this, I'm just going
to have it load the journal as it normally does (as opposed to only when the
local node has a dirty journal) because it could be used at any moment.

Btw, I'm curious to know how useful folks find the ext3 mount options
errors=continue and errors=panic. I'm extremely likely to implement the
errors=read-only behavior as default in OCFS2 and I'm wondering whether the
other two are worth looking into.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com



From teigland at redhat.com  Mon Sep  5 07:55:28 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 15:55:28 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <84144f02050904233274d45230@mail.gmail.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050905054348.GC11337@redhat.com>
	<84144f02050904233274d45230@mail.gmail.com>
Message-ID: <20050905075528.GB17607@redhat.com>

On Mon, Sep 05, 2005 at 09:32:59AM +0300, Pekka Enberg wrote:
> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > > +void gfs2_glock_hold(struct gfs2_glock *gl)
> > > +{
> > > +     glock_hold(gl);
> > > +}
> > >
> > > eh why?
> 
> On 9/5/05, David Teigland <teigland at redhat.com> wrote:
> > You removed the comment stating exactly why, see below.  If that's not a
> > accepted technique in the kernel, say so and I'll be happy to change it
> > here and elsewhere.
> 
> Is there a reason why users of gfs2_glock_hold() cannot use
> glock_hold() directly?

Either set could be trivially removed.  It's such an insignificant issue
that I've removed glock_hold and put.  For the record,

within glock.c we consistently paired inlined versions of:
	glock_hold()
	glock_put()

we wanted external versions to be appropriately named so we had:
	gfs2_glock_hold()
	gfs2_glock_put()

still not sure if that technique is acceptable in this crowd or not.
Dave



From penberg at cs.helsinki.fi  Mon Sep  5 08:00:51 2005
From: penberg at cs.helsinki.fi (Pekka Enberg)
Date: Mon, 5 Sep 2005 11:00:51 +0300
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905075528.GB17607@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050905054348.GC11337@redhat.com>
	<84144f02050904233274d45230@mail.gmail.com>
	<20050905075528.GB17607@redhat.com>
Message-ID: <84144f02050905010066bc516d@mail.gmail.com>

On 9/5/05, David Teigland <teigland at redhat.com> wrote:
> Either set could be trivially removed.  It's such an insignificant issue
> that I've removed glock_hold and put.  For the record,
> 
> within glock.c we consistently paired inlined versions of:
>         glock_hold()
>         glock_put()
> 
> we wanted external versions to be appropriately named so we had:
>         gfs2_glock_hold()
>         gfs2_glock_put()
> 
> still not sure if that technique is acceptable in this crowd or not.

You still didn't answer my question why you needed two versions,
though. AFAIK you didn't which makes the other one an redundant
wrapper which are discouraged in kernel code.

                               Pekka



From pavel at ucw.cz  Mon Sep  5 08:27:35 2005
From: pavel at ucw.cz (Pavel Machek)
Date: Mon, 5 Sep 2005 10:27:35 +0200
Subject: [Linux-cluster] real read-only [was Re: GFS, what's remaining]
In-Reply-To: <20050905055428.GA29158@thunk.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<20050904203344.GA1987@elf.ucw.cz>
	<20050905055428.GA29158@thunk.org>
Message-ID: <20050905082735.GA2662@elf.ucw.cz>

Hi!

> > > - read-only mount
> > > - "specatator" mount (like ro but no journal allocated for the mount,
> > >   no fencing needed for failed node that was mounted as specatator)
> > 
> > I'd call it "real-read-only", and yes, that's very usefull
> > mount. Could we get it for ext3, too?
> 
> This is a bit of a degression, but it's quite a bit different from
> what ocfs2 is doing, where it is not necessary to replay the journal
> in order to assure filesystem consistency.  
> 
> In the ext3 case, the only time when read-only isn't quite read-only
> is when the filesystem was unmounted uncleanly and the journal needs
> to be replayed in order for the filesystem to be consistent.

Yes, I know... And that is going to be a disaster when you are
attempting to recover data from failing harddrive (and absolutely do
not want to write there).

There's a better reason, too. I do swsusp. Then I'd like to boot with
/ mounted read-only (so that I can read my config files, some
binaries, and maybe suspended image), but I absolutely may not write
to disk at this point, because I still want to resume.

Currently distros do that using initrd, but that does not allow you to
store suspended image into file, and is slightly hard to setup.

								Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address



From akpm at osdl.org  Mon Sep  5 08:54:08 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 5 Sep 2005 01:54:08 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905043033.GB11337@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
Message-ID: <20050905015408.21455e56.akpm@osdl.org>

David Teigland <teigland at redhat.com> wrote:
>
>  We export our full dlm API through read/write/poll on a misc device.
>

inotify did that for a while, but we ended up going with a straight syscall
interface.

How fat is the dlm interface?   ie: how many syscalls would it take?



From joern at wohnheim.fh-wedel.de  Mon Sep  5 08:58:08 2005
From: joern at wohnheim.fh-wedel.de (=?iso-8859-1?Q?J=F6rn?= Engel)
Date: Mon, 5 Sep 2005 10:58:08 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905034739.GA11337@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050902094403.GD16595@redhat.com>
	<20050903052821.GA23711@kroah.com>
	<20050905034739.GA11337@redhat.com>
Message-ID: <20050905085808.GA22802@wohnheim.fh-wedel.de>

On Mon, 5 September 2005 11:47:39 +0800, David Teigland wrote:
> 
> Joern already suggested moving this out of line and into a function (as it
> was before) to avoid repeating string constants.  In that case the
> function, file and line from BUG aren't useful.  We now have this, does it
> look ok?

Ok wrt. my concerns, but not with Greg's.  BUG() still gives you
everything that you need, except:
o fsid

Notice how this list is just one entry long? ;)

So how about


#define gfs2_assert(sdp, assertion) do {			\
	if (unlikely(!(assertion))) {				\
	printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname);	\
	BUG();							\
} while (0)


Or, to move the constant out of line again


void __gfs2_assert(struct gfs2_sbd *sdp) {
	printk(KERN_ERR "GFS2: fsid=\n", sdp->sd_fsname);
}

#define gfs2_assert(sdp, assertion) do {\
	if (unlikely(!(assertion))) {	\
	__gfs2_assert(sdp);		\
	BUG();				\
} while (0)


J?rn

-- 
Admonish your friends privately, but praise them openly.
-- Publilius Syrus 



From teigland at redhat.com  Mon Sep  5 09:18:47 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 17:18:47 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905085808.GA22802@wohnheim.fh-wedel.de>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050902094403.GD16595@redhat.com>
	<20050903052821.GA23711@kroah.com>
	<20050905034739.GA11337@redhat.com>
	<20050905085808.GA22802@wohnheim.fh-wedel.de>
Message-ID: <20050905091847.GD17607@redhat.com>

On Mon, Sep 05, 2005 at 10:58:08AM +0200, J?rn Engel wrote:

> #define gfs2_assert(sdp, assertion) do {			\
> 	if (unlikely(!(assertion))) {				\
> 	printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname);	\
> 	BUG();							\
> } while (0)

OK thanks,
Dave



From teigland at redhat.com  Mon Sep  5 09:24:33 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 17:24:33 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905015408.21455e56.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
Message-ID: <20050905092433.GE17607@redhat.com>

On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> David Teigland <teigland at redhat.com> wrote:
> >
> >  We export our full dlm API through read/write/poll on a misc device.
> >
> 
> inotify did that for a while, but we ended up going with a straight syscall
> interface.
> 
> How fat is the dlm interface?   ie: how many syscalls would it take?

Four functions:
  create_lockspace()
  release_lockspace()
  lock()
  unlock()

Dave



From akpm at osdl.org  Mon Sep  5 09:19:48 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 5 Sep 2005 02:19:48 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905092433.GE17607@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
	<20050905092433.GE17607@redhat.com>
Message-ID: <20050905021948.6241f1e0.akpm@osdl.org>

David Teigland <teigland at redhat.com> wrote:
>
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <teigland at redhat.com> wrote:
> > >
> > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > 
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> > 
> > How fat is the dlm interface?   ie: how many syscalls would it take?
> 
> Four functions:
>   create_lockspace()
>   release_lockspace()
>   lock()
>   unlock()

Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
is likely to object if we reserve those slots.



From phillips at istop.com  Mon Sep  5 09:30:56 2005
From: phillips at istop.com (Daniel Phillips)
Date: Mon, 5 Sep 2005 05:30:56 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905021948.6241f1e0.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050905092433.GE17607@redhat.com>
	<20050905021948.6241f1e0.akpm@osdl.org>
Message-ID: <200509050530.56787.phillips@istop.com>

On Monday 05 September 2005 05:19, Andrew Morton wrote:
> David Teigland <teigland at redhat.com> wrote:
> > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > > David Teigland <teigland at redhat.com> wrote:
> > > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > > inotify did that for a while, but we ended up going with a straight
> > > syscall interface.
> > >
> > > How fat is the dlm interface?   ie: how many syscalls would it take?
> >
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
>
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel



From teigland at redhat.com  Mon Sep  5 09:48:07 2005
From: teigland at redhat.com (David Teigland)
Date: Mon, 5 Sep 2005 17:48:07 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905021948.6241f1e0.akpm@osdl.org>
References: <20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
	<20050905092433.GE17607@redhat.com>
	<20050905021948.6241f1e0.akpm@osdl.org>
Message-ID: <20050905094807.GG17607@redhat.com>

On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote:
> David Teigland <teigland at redhat.com> wrote:
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
> 
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Patrick is really the expert in this area and he's off this week, but
based on what he's done with the misc device I don't see why there'd be
more than two or three parameters for any of these.

Dave



From sct at redhat.com  Mon Sep  5 10:44:08 2005
From: sct at redhat.com (Stephen C. Tweedie)
Date: Mon, 05 Sep 2005 11:44:08 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904203344.GA1987@elf.ucw.cz>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<20050904203344.GA1987@elf.ucw.cz>
Message-ID: <1125917048.1910.9.camel@sisko.sctweedie.blueyonder.co.uk>

Hi,

On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:

> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all.  But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.

But mount -o readonly gives you most of what you want already.  An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access.  That's
not necessarily a good thing.

And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node.  So is
this really a useful combination?

About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find.  That's asking for data corruption,
but that may be better than getting no data at all.  

But that is something that could be done with a "-o skip-recovery" mount
option, which would necessarily imply always-readonly behaviour.

--Stephen




From lmb at suse.de  Mon Sep  5 14:14:32 2005
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Mon, 5 Sep 2005 16:14:32 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509030157.31581.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<p73fysnqiej.fsf@verdi.suse.de>
	<200509030157.31581.phillips@istop.com>
Message-ID: <20050905141432.GF5498@marowsky-bree.de>

On 2005-09-03T01:57:31, Daniel Phillips <phillips at istop.com> wrote:

> The only current users of dlms are cluster filesystems.  There are zero users 
> of the userspace dlm api. 

That is incorrect, and you're contradicting yourself here:

> What does have to be resolved is a common API for node management.  It is not 
> just cluster filesystems and their lock managers that have to interface to 
> node management.  Below the filesystem layer, cluster block devices and 
> cluster volume management need to be coordinated by the same system, and 
> above the filesystem layer, applications also need to be hooked into it.  
> This work is, in a word, incomplete.

The Cluster Volume Management of LVM2 for example _does_ use simple
cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too.

(EVMS2 in cluster-mode uses a verrry simple locking scheme which is
basically operated by the failover software and thus uses a different
model.)


Sincerely,
    Lars Marowsky-Br?e <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"



From Axel.Thimm at ATrpms.net  Mon Sep  5 15:36:49 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 5 Sep 2005 17:36:49 +0200
Subject: [Linux-cluster] NFS relocate: old TCP/IP connection resulting in
	DUP/ACK storms and largish timeouts (was: iptables protection
	wrapper; nfsexport.sh vs ip.sh racing)
In-Reply-To: <1125425012.21943.1.camel@ayanami.boston.redhat.com>
References: <20050822225227.GJ24127@neu.nirvana>
	<1125340879.24205.30.camel@ayanami.boston.redhat.com>
	<20050829233523.GD5908@neu.nirvana>
	<1125425012.21943.1.camel@ayanami.boston.redhat.com>
Message-ID: <20050905153649.GE17096@neu.nirvana>

On Tue, Aug 30, 2005 at 02:03:32PM -0400, Lon Hohberger wrote:
> On Tue, 2005-08-30 at 01:35 +0200, Axel Thimm wrote:
> 
> > > It's really an attempt at a workaround a configuration problem -- and
> > > nothing more.
> > 
> > The above is with nfs running on all nodes already. The racing seems
> > to be with the exportfs commands and ip setup/teardown.
> > 
> > It is easy to reproduce (>=50%) if the client connects over Gigabit
> > and is in write transaction while the service is moved. We saw this in
> > two different setups. If you throttle the network bandwidth to <=
> > 20MB/sec you don't trigger the bug, so it really seems like a racing
> > problem.
> 
> ewww...  Can you bugzilla this so we can track it?  =)

will do so, we are currently still trying to figure it out properly,
so we can provide a better bug report (and separate different bugs).

One bug that has critalled out is that upon relocation the old server
keeps his TCP connections to the NFS client. When this server later on
gets to become the NFS server again, he FIN/ACKs that old connection
to the client (that had this connection torn down by now), which
creates a DUP/ACK storm.

A workaround is to shutdown nfs instead of simply unexporting like
nfsexport.sh does, so that the pending TCP connections get fried, too.

Is there a way to have ip.sh fry all open TCP/IP connections to a
service IP that is to be abandoned? I guess that would be the better
solution (that would also apply to non-NFS services).

Of course the true bug is the DUP/ACK storm that is triggered by the
old open TCP connection.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050905/2a938942/attachment.sig>

From phillips at istop.com  Mon Sep  5 15:49:49 2005
From: phillips at istop.com (Daniel Phillips)
Date: Mon, 5 Sep 2005 11:49:49 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905141432.GF5498@marowsky-bree.de>
References: <20050901104620.GA22482@redhat.com>
	<200509030157.31581.phillips@istop.com>
	<20050905141432.GF5498@marowsky-bree.de>
Message-ID: <200509051149.49929.phillips@istop.com>

On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> On 2005-09-03T01:57:31, Daniel Phillips <phillips at istop.com> wrote:
> > The only current users of dlms are cluster filesystems.  There are zero
> > users of the userspace dlm api.
>
> That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

> ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

> > What does have to be resolved is a common API for node management.  It is
> > not just cluster filesystems and their lock managers that have to
> > interface to node management.  Below the filesystem layer, cluster block
> > devices and cluster volume management need to be coordinated by the same
> > system, and above the filesystem layer, applications also need to be
> > hooked into it. This work is, in a word, incomplete.

Regards,

Daniel



From alan at lxorguk.ukuu.org.uk  Mon Sep  5 12:21:34 2005
From: alan at lxorguk.ukuu.org.uk (Alan Cox)
Date: Mon, 05 Sep 2005 13:21:34 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905021948.6241f1e0.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
	<20050905092433.GE17607@redhat.com>
	<20050905021948.6241f1e0.akpm@osdl.org>
Message-ID: <1125922894.8714.14.camel@localhost.localdomain>

On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
> 
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

If the locks are not file descriptors then answer the following:

- How are they ref counted
- What are the cleanup semantics
- How do I pass a lock between processes (AF_UNIX sockets wont work now)
- How do I poll on a lock coming free. 
- What are the semantics of lock ownership
- What rules apply for inheritance
- How do I access a lock across threads.
- What is the permission model. 
- How do I attach audit to it
- How do I write SELinux rules for it
- How do I use mount to make namespaces appear in multiple vservers

and thats for starters...

Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.

Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.

All our existing locking uses the following behaviour

	fd = open(namespace, options)
	fcntl(.. lock ...)
	blah
	flush
	fcntl(.. unlock ...)
	close

Unfortunately some people here seem to have forgotten WHY we do things
this way.

1.	The semantics of file descriptors are well understood by users and by
programs. That makes programming easier and keeps code size down
2.	Everyone knows how close() works including across fork
3.	FD passing is an obscure art but understood and just works
4.	Poll() is a standard understood interface
5.	Ownership of files is a standard model
6.	FD passing across fork/exec is controlled in a standard way
7.	The semantics for threaded applications are defined
8.	Permissions are a standard model
9.	Audit just works with the same tools
9.	SELinux just works with the same tools
10.	I don't need specialist applications to see the system state (the
whole point of sysfs yet someone wants to break it all again)
11.	fcntl fd locking is a posix standard interface with precisely
defined semantics. Our extensions including leases are very powerful
12.	And yes - fcntl fd locking supports mandatory locking too. That also
is standards based with precise semantics.


Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat....




From alan at lxorguk.ukuu.org.uk  Sun Sep  4 08:37:15 2005
From: alan at lxorguk.ukuu.org.uk (Alan Cox)
Date: Sun, 04 Sep 2005 09:37:15 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050903214653.1b8a8cb7.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
Message-ID: <1125823035.23858.10.camel@localhost.localdomain>

On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote:
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.

The semantics of O_NONBLOCK on many other devices are "trylock"
semantics. OSS audio has those semantics for example, as do regular
files in the presence of SYS5 mandatory locks. While the latter is "try
lock , do operation and then drop lock" the drivers using O_NDELAY are
very definitely providing trylock semantics.

I am curious why a lock manager uses open to implement its locking
semantics rather than using the locking API (POSIX locks etc) however.

Alan



From greg.freemyer at gmail.com  Mon Sep  5 16:41:03 2005
From: greg.freemyer at gmail.com (Greg Freemyer)
Date: Mon, 5 Sep 2005 12:41:03 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125917048.1910.9.camel@sisko.sctweedie.blueyonder.co.uk>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com> <20050904203344.GA1987@elf.ucw.cz>
	<1125917048.1910.9.camel@sisko.sctweedie.blueyonder.co.uk>
Message-ID: <87f94c3705090509411af94019@mail.gmail.com>

On 9/5/05, Stephen C. Tweedie <sct at redhat.com> wrote:
> 
> Hi,
> 
> On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:
> 
> > > - read-only mount
> > > - "specatator" mount (like ro but no journal allocated for the mount,
> > > no fencing needed for failed node that was mounted as specatator)
> >
> > I'd call it "real-read-only", and yes, that's very usefull
> > mount. Could we get it for ext3, too?
> 
> I don't want to pollute the ext3 paths with extra checks for the case
> when there's no journal struct at all. But a dummy journal struct that
> isn't associated with an on-disk journal and that can never, ever go
> writable would certainly be pretty easy to do.
> 
> But mount -o readonly gives you most of what you want already. An
> always-readonly option would be different in some key ways --- for a
> start, it would be impossible to perform journal recovery if that's
> needed, as that still needs journal and superblock write access. That's
> not necessarily a good thing.
> 
> And you *still* wouldn't get something that could act as a spectator to
> a filesystem mounted writable elsewhere on a SAN, because updates on the
> other node wouldn't invalidate cached data on the readonly node. So is
> this really a useful combination?
> 
> About the only combination I can think of that really makes sense in
> this context is if you have a busted filesystem that somehow can't be
> recovered --- either the journal is broken or the underlying device is
> truly readonly --- and you want to mount without recovery in order to
> attempt to see what you can find. That's asking for data corruption,
> but that may be better than getting no data at all.
> 
> But that is something that could be done with a "-o skip-recovery" mount
> option, which would necessarily imply always-readonly behaviour.
> 
> --Stephen


This is getting way off-thread, but xfs does not do journal replay on 
read-only mount. This was required due to filesystem snapshots which are 
often truly read-only. i.e. All LVM1 snapshots are truly read-only. Also 
many FC arrays support read-only snapshots as well.

I'm not sure how ext3 supports those environments (I use XFS when I need 
snapshot capability).

The above -skip-recovery option might be required?

Greg
-- 
Greg Freemyer
The Norcross Group
Forensics for the 21st Century
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050905/1bf83d9a/attachment.htm>

From Axel.Thimm at ATrpms.net  Mon Sep  5 18:21:43 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Mon, 5 Sep 2005 20:21:43 +0200
Subject: [Linux-cluster] Re: NFS relocate: old TCP/IP connection resulting
	in DUP/ACK storms and largish timeouts
In-Reply-To: <20050905153649.GE17096@neu.nirvana>
References: <20050822225227.GJ24127@neu.nirvana>
	<1125340879.24205.30.camel@ayanami.boston.redhat.com>
	<20050829233523.GD5908@neu.nirvana>
	<1125425012.21943.1.camel@ayanami.boston.redhat.com>
	<20050905153649.GE17096@neu.nirvana>
Message-ID: <20050905182143.GA2099@neu.nirvana>

On Mon, Sep 05, 2005 at 05:36:49PM +0200, Axel Thimm wrote:
> On Tue, Aug 30, 2005 at 02:03:32PM -0400, Lon Hohberger wrote:
> > On Tue, 2005-08-30 at 01:35 +0200, Axel Thimm wrote:
> > 
> > > > It's really an attempt at a workaround a configuration problem -- and
> > > > nothing more.
> > > 
> > > The above is with nfs running on all nodes already. The racing seems
> > > to be with the exportfs commands and ip setup/teardown.

> > ewww...  Can you bugzilla this so we can track it?  =)

> One bug that has critalled out is that upon relocation the old server
> keeps his TCP connections to the NFS client. When this server later on
> gets to become the NFS server again, he FIN/ACKs that old connection
> to the client (that had this connection torn down by now), which
> creates a DUP/ACK storm.
> 
> A workaround is to shutdown nfs instead of simply unexporting like
> nfsexport.sh does, so that the pending TCP connections get fried, too.
> 
> Is there a way to have ip.sh fry all open TCP/IP connections to a
> service IP that is to be abandoned? I guess that would be the better
> solution (that would also apply to non-NFS services).
> 
> Of course the true bug is the DUP/ACK storm that is triggered by the
> old open TCP connection.

Both bugs have been filed in bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167571
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167572

I guess the latter will move to another component like "kernel", if it
really turns out to be neither cluster nor even nfs specific.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050905/e93a924a/attachment.sig>

From akpm at osdl.org  Mon Sep  5 19:53:09 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 5 Sep 2005 12:53:09 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125922894.8714.14.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
	<20050905092433.GE17607@redhat.com>
	<20050905021948.6241f1e0.akpm@osdl.org>
	<1125922894.8714.14.camel@localhost.localdomain>
Message-ID: <20050905125309.4b657b08.akpm@osdl.org>

Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:
>
> On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
>  > >   create_lockspace()
>  > >   release_lockspace()
>  > >   lock()
>  > >   unlock()
>  > 
>  > Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
>  > is likely to object if we reserve those slots.
> 
>  If the locks are not file descriptors then answer the following:
> 
>  - How are they ref counted
>  - What are the cleanup semantics
>  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
>  - How do I poll on a lock coming free. 
>  - What are the semantics of lock ownership
>  - What rules apply for inheritance
>  - How do I access a lock across threads.
>  - What is the permission model. 
>  - How do I attach audit to it
>  - How do I write SELinux rules for it
>  - How do I use mount to make namespaces appear in multiple vservers
> 
>  and thats for starters...

Return an fd from create_lockspace().



From karon at gmx.net  Mon Sep  5 20:52:57 2005
From: karon at gmx.net (Andreas Brosche)
Date: Mon, 5 Sep 2005 22:52:57 +0200 (MEST)
Subject: [Linux-cluster] Using GFS without a network?
Message-ID: <16102.1125953577@www46.gmx.net>

Hello *, 

we have two networks which require to be topologically separated.
Nevertheless, data exchange shall be possible. We're thinking about
implementing two servers (one for each network) with a shared storage (one
or two SCSI disks on a shared bus between the two servers). As local
filesystems do not make sense on shared SCSI busses, we're thinking about
implementing a solution based on GFS. 

The point is: Synchronisation of the file system access must be handled
either via the SCSI bus or via the disks themselves, since an ethernet link
between the two machines does not fit into our security concept. 
I've read about the concept of quorum disks which are based on raw devices.
Raw devices skip the Linux read/write cache, so the necessary heartbeat
channel and filesystem cache sync could be done via a shared raw partition.
Is this possible? All the docs I have read imply setting up a cluster, and
thus a network, which is not an option. I want to avoid complexity in this
one. We do not have real time requirements, so a periodic update of the
shared raw data should suffice.

Long story cut short, we want
- GFS on a shared SCSI disk (Performance is not important)
- dlm without network access (theoretically possible...
  but how dependant is GFS on the cluster services?)

Regards

Andreas Brosche

-- 
5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
+++ GMX - die erste Adresse f?r Mail, Message, More +++



From alan at lxorguk.ukuu.org.uk  Mon Sep  5 23:20:11 2005
From: alan at lxorguk.ukuu.org.uk (Alan Cox)
Date: Tue, 06 Sep 2005 00:20:11 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905125309.4b657b08.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
	<20050905092433.GE17607@redhat.com>
	<20050905021948.6241f1e0.akpm@osdl.org>
	<1125922894.8714.14.camel@localhost.localdomain>
	<20050905125309.4b657b08.akpm@osdl.org>
Message-ID: <1125962411.8714.46.camel@localhost.localdomain>

On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
> >  - How are they ref counted
> >  - What are the cleanup semantics
> >  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> >  - How do I poll on a lock coming free. 
> >  - What are the semantics of lock ownership
> >  - What rules apply for inheritance
> >  - How do I access a lock across threads.
> >  - What is the permission model. 
> >  - How do I attach audit to it
> >  - How do I write SELinux rules for it
> >  - How do I use mount to make namespaces appear in multiple vservers
> > 
> >  and thats for starters...
> 
> Return an fd from create_lockspace().

That only answers about four of the questions. The rest only come out if
create_lockspace behaves like a file system - in other words
create_lockspace is better known as either mkdir or mount.

Its certainly viable to make the lock/unlock functions taken a fd, it's
just not clear why the current lock/unlock functions we have won't do
the job. Being able to extend the functionality to leases later on may
be very powerful indeed and will fit the existing API



From akpm at osdl.org  Mon Sep  5 23:06:13 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 5 Sep 2005 16:06:13 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125962411.8714.46.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
	<20050905092433.GE17607@redhat.com>
	<20050905021948.6241f1e0.akpm@osdl.org>
	<1125922894.8714.14.camel@localhost.localdomain>
	<20050905125309.4b657b08.akpm@osdl.org>
	<1125962411.8714.46.camel@localhost.localdomain>
Message-ID: <20050905160613.7b0ee7fc.akpm@osdl.org>

Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:
>
> On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
>  > >  - How are they ref counted
>  > >  - What are the cleanup semantics
>  > >  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
>  > >  - How do I poll on a lock coming free. 
>  > >  - What are the semantics of lock ownership
>  > >  - What rules apply for inheritance
>  > >  - How do I access a lock across threads.
>  > >  - What is the permission model. 
>  > >  - How do I attach audit to it
>  > >  - How do I write SELinux rules for it
>  > >  - How do I use mount to make namespaces appear in multiple vservers
>  > > 
>  > >  and thats for starters...
>  > 
>  > Return an fd from create_lockspace().
> 
>  That only answers about four of the questions. The rest only come out if
>  create_lockspace behaves like a file system - in other words
>  create_lockspace is better known as either mkdir or mount.

But David said that "We export our full dlm API through read/write/poll on
a misc device.".  That miscdevice will simply give us an fd.  Hence my
suggestion that the miscdevice be done away with in favour of a dedicated
syscall which returns an fd.

What does a filesystem have to do with this?



From Joel.Becker at oracle.com  Mon Sep  5 23:32:36 2005
From: Joel.Becker at oracle.com (Joel Becker)
Date: Mon, 5 Sep 2005 16:32:36 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125823035.23858.10.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<1125823035.23858.10.camel@localhost.localdomain>
Message-ID: <20050905233236.GF8684@ca-server1.us.oracle.com>

On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote:
> I am curious why a lock manager uses open to implement its locking
> semantics rather than using the locking API (POSIX locks etc) however.

	Because it is simple (how do you fcntl(2) from a shell fd?), has no
ranges (what do you do with ranges passed in to fcntl(2) and you don't
support them?), and has a well-known fork(2)/exec(2) pattern.  fcntl(2)
has a known but less intuitive fork(2) pattern.
	The real reason, though, is that we never considered fcntl(2).
We could never think of a case when a process wanted a lock fd open but
not locked.  At least, that's my recollection.  Mark might have more to
comment.

Joel

-- 

"In the room the women come and go
 Talking of Michaelangelo."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127



From phillips at istop.com  Tue Sep  6 00:57:23 2005
From: phillips at istop.com (Daniel Phillips)
Date: Mon, 5 Sep 2005 20:57:23 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509051118.45792.dtor_core@ameritech.net>
References: <20050901104620.GA22482@redhat.com>
	<200509051149.49929.phillips@istop.com>
	<200509051118.45792.dtor_core@ameritech.net>
Message-ID: <200509052057.23807.phillips@istop.com>

On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > On 2005-09-03T01:57:31, Daniel Phillips <phillips at istop.com> wrote:
> > > > The only current users of dlms are cluster filesystems.  There are
> > > > zero users of the userspace dlm api.
> > >
> > > That is incorrect...
> >
> > Application users Lars, sorry if I did not make that clear.  The issue is
> > whether we need to export an all-singing-all-dancing dlm api from kernel
> > to userspace today, or whether we can afford to take the necessary time
> > to get it right while application writers take their time to have a good
> > think about whether they even need it.
>
> If Linux fully supported OpenVMS DLM semantics we could start thinking
> asbout moving our application onto a Linux box because our alpha server is
> aging.
>
> That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel



From phillips at istop.com  Tue Sep  6 04:02:40 2005
From: phillips at istop.com (Daniel Phillips)
Date: Tue, 6 Sep 2005 00:02:40 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509052103.20519.dtor_core@ameritech.net>
References: <20050901104620.GA22482@redhat.com>
	<200509052057.23807.phillips@istop.com>
	<200509052103.20519.dtor_core@ameritech.net>
Message-ID: <200509060002.40823.phillips@istop.com>

On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
> On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > > On 2005-09-03T01:57:31, Daniel Phillips <phillips at istop.com> wrote:
> > > > > > The only current users of dlms are cluster filesystems.  There
> > > > > > are zero users of the userspace dlm api.
> > > > >
> > > > > That is incorrect...
> > > >
> > > > Application users Lars, sorry if I did not make that clear.  The
> > > > issue is whether we need to export an all-singing-all-dancing dlm api
> > > > from kernel to userspace today, or whether we can afford to take the
> > > > necessary time to get it right while application writers take their
> > > > time to have a good think about whether they even need it.
> > >
> > > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > > asbout moving our application onto a Linux box because our alpha server
> > > is aging.
> > >
> > > That's just my user application writer $0.02.
> >
> > What stops you from trying it with the patch?  That kind of feedback
> > would be worth way more than $0.02.
>
> We do not have such plans at the moment and I prefer spending my free
> time on tinkering with kernel, not rewriting some in-house application.
> Besides, DLM is not the only thing that does not have a drop-in
> replacement in Linux.
>
> You just said you did not know if there are any potential users for the
> full DLM and I said there are some.

I did not say "potential", I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said "gee this works great, look what it does".

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said "alpha server" not "alpha servers", was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel



From phillips at istop.com  Tue Sep  6 04:58:44 2005
From: phillips at istop.com (Daniel Phillips)
Date: Tue, 6 Sep 2005 00:58:44 -0400
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509052307.27417.dtor_core@ameritech.net>
References: <20050901104620.GA22482@redhat.com>
	<200509060002.40823.phillips@istop.com>
	<200509052307.27417.dtor_core@ameritech.net>
Message-ID: <200509060058.44934.phillips@istop.com>

On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > By the way, you said "alpha server" not "alpha servers", was that just a
> > slip? Because if you don't have a cluster then why are you using a dlm?
>
> No, it is not a slip. The application is running on just one node, so we
> do not really use "distributed" part. However we make heavy use of the
> rest of lock manager features, especially lock value blocks.

Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
without even having the excuse you were forced to use it.  Why don't you just 
have a daemon that sends your values over a socket?  That should be all of a 
day's coding.

Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
But you nicely supported my claim that most who think they should be using a 
dlm, really shouldn't.

Regards,

Daniel



From phillips at istop.com  Tue Sep  6 06:48:47 2005
From: phillips at istop.com (Daniel Phillips)
Date: Tue, 6 Sep 2005 02:48:47 -0400
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509060005.59578.dtor_core@ameritech.net>
References: <20050901104620.GA22482@redhat.com>
	<200509060058.44934.phillips@istop.com>
	<200509060005.59578.dtor_core@ameritech.net>
Message-ID: <200509060248.47433.phillips@istop.com>

On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> do you think it is a bit premature to dismiss something even without
> ever seeing the code?

You told me you are using a dlm for a single-node application, is there 
anything more I need to know?

Regards,

Daniel



From phillips at istop.com  Tue Sep  6 07:18:24 2005
From: phillips at istop.com (Daniel Phillips)
Date: Tue, 6 Sep 2005 03:18:24 -0400
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509060155.04685.dtor_core@ameritech.net>
References: <20050901104620.GA22482@redhat.com>
	<200509060248.47433.phillips@istop.com>
	<200509060155.04685.dtor_core@ameritech.net>
Message-ID: <200509060318.25260.phillips@istop.com>

On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > do you think it is a bit premature to dismiss something even without
> > > ever seeing the code?
> >
> > You told me you are using a dlm for a single-node application, is there
> > anything more I need to know?
>
> I would still like to know why you consider it a "sin". On OpenVMS it is
> fast, provides a way of cleaning up...

There is something hard about handling EPIPE?

> and does not introduce single point 
> of failure as it is the case with a daemon. And if we ever want to spread
> the load between 2 boxes we easily can do it.

But you said it runs on an aging Alpha, surely you do not intend to expand it 
to two aging Alphas?  And what makes you think that socket-based 
synchronization keeps you from spreading out the load over multiple boxes?

> Why would I not want to use it?

It is not the right tool for the job from what you have told me.  You want to 
get a few bytes of information from one task to another?  Use a socket, as 
God intended.

Regards,

Daniel



From alan at lxorguk.ukuu.org.uk  Tue Sep  6 13:42:29 2005
From: alan at lxorguk.ukuu.org.uk (Alan Cox)
Date: Tue, 06 Sep 2005 14:42:29 +0100
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509060248.47433.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509060058.44934.phillips@istop.com>
	<200509060005.59578.dtor_core@ameritech.net>
	<200509060248.47433.phillips@istop.com>
Message-ID: <1126014150.22131.51.camel@localhost.localdomain>

On Maw, 2005-09-06 at 02:48 -0400, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
> 
> You told me you are using a dlm for a single-node application, is there 
> anything more I need to know?

That's standard practice for many non-Unix operating systems. It means
your code supports failover without much additional work and it provides
all the functionality for locks on a single node too



From sdake at mvista.com  Fri Sep  2 18:45:22 2005
From: sdake at mvista.com (Steven Dake)
Date: Fri, 02 Sep 2005 11:45:22 -0700
Subject: [Linux-cluster] partly OT: failover <500ms
In-Reply-To: <4317F945.5050307@redhat.com>
References: <20050901215836.634334a1.pegasus@nerv.eu.org>
	<1125610799.14500.105.camel@ayanami.boston.redhat.com>
	<4317F945.5050307@redhat.com>
Message-ID: <1125686722.28688.26.camel@unnamed.az.mvista.com>

On Fri, 2005-09-02 at 08:03 +0100, Patrick Caulfield wrote:
> Lon Hohberger wrote:
> > On Thu, 2005-09-01 at 21:58 +0200, Jure Pe?ar wrote:
> > 
> >>Hi all,
> >>
> >>Sorry if this is somewhat offtopic here ...
> >>
> >>Our telco is looking into linux HA solutions for their VoIP needs. Their
> >>main requirement is that the failover happens in the order of a few 100ms. 
> >>
> >>Can redhat cluster be tweaked to work reliably with such short time
> >>periods? This would mean heartbeat on the level of few ms and status probes
> >>on the level of 10ms. Is this even feasible?
> > 
> > 
> > Possibly, I don't think it can do it right now.  A couple of things to
> > remember:
> > 
> > * For such a fast requirement, you'll want a dedicated network for
> > cluster traffic and a real-time kernel.
> > 
> > * Also, "detection and initiation of recovery" is all the cluster
> > software can do for you; your application - by itself - may take longer
> > than this to recover.
> > 
> > * It's practically impossible to guarantee completion of I/O fencing in
> > this amount of time, so your application must be able to do without, or
> > you need to create a new specialized fencing mechanism which is
> > guaranteed to complete within a very fast time.
> > 
> > * I *think* CMAN is currently at the whole-second granularity, so some
> > changes would need to be made to give it finer granularity.  This
> > shouldn't be difficult (but I'll let the developers of CMAN answer this
> > definitively, though... ;) )
> > 
> 
> All true :) All cman timers are calibrated in seconds. I did run some tests a
> while ago with them in milliseconds and 100ms timeouts and it worked
> /reasonably/ well. However, without an RT kernel I wouldn't like to put this
> into a production system - we've had several instances of the cman kernel thread
> (which runs at the top RT priority) being stalled for up to 5 seconds and that
> node being fenced. Smaller stalls may be more common so with timeouts set that
> low you may well get nodes fenced for small delays.
> 
> To be quite honest I'm not really sure what causes these stalls, as they
> generally happen under heavy IO load I assume (possibly wrongly) that they are
> related to disk flushes but someone who knows the VM better may out me right on
> this.
> 
> 

These systems could have swap..  Swap doesn't work because it is
possible for a swapped page to take 1-10 seconds to be swapped into
memory.  The mlockall() system call resolves this particular problem.

The poll sendmsg and recvmsg (and some others that require memory)
system calls can block when allocating memory in low memory conditions.
This unfortunately results in longer timeouts necessary when the system
is overloaded.  One solution is to change these system calls via some
kind of socket option to allocate memory ahead of time for their
operation.  But I don't know of anything like this yet.

I have measured failover with openais at 3 msec from detection to
direction of new CSIs within components.  Application failures are
detected in 100 msec.  Node failures are detected in 100 msec.  It is
possible on a system that meets the above scenario for a processor to be
excluded from the membership during low memory.

This is a reasonable choice, because the processor is having difficulty
responding to requests in a timely fashion, and should be removed until
overload control software on the processor cleans up the processor
memory.

Regards
-steve




From ak at suse.de  Fri Sep  2 21:17:08 2005
From: ak at suse.de (Andi Kleen)
Date: 02 Sep 2005 23:17:08 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050901132104.2d643ccd.akpm@osdl.org>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
Message-ID: <p73fysnqiej.fsf@verdi.suse.de>

Andrew Morton <akpm at osdl.org> writes:

> 
> > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > >   possibly gain (or vice versa)
> > > 
> > > - Relative merits of the two offerings
> > 
> > You missed the important one - people actively use it and have been for
> > some years. Same reason with have NTFS, HPFS, and all the others. On
> > that alone it makes sense to include.
>  
> Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> the technical reasons for merging gfs[2], ocfs2, both or neither?

There seems to be clearly a need for a shared-storage fs of some sort
for HA clusters and virtualized usage (multiple guests sharing a
partition).  Shared storage can be more efficient than network file
systems like NFS because the storage access is often more efficient
than network access  and it is more reliable because it doesn't have a
single point of failure in form of the NFS server.

It's also a logical extension of the "failover on failure" clusters
many people run now - instead of only failing over the shared fs at
failure and keeping one machine idle the load can be balanced between
multiple machines at any time.

One argument to merge both might be that nobody really knows yet which
shared-storage file system (GFS or OCFS2) is better. The only way to
find out would be to let the user base try out both, and that's most
practical when they're merged.

Personally I think ocfs2 has nicer&cleaner code than GFS.
It seems to be more or less a 64bit ext3 with cluster support, while
GFS seems to reinvent a lot more things and has somewhat uglier code.
On the other hand GFS' cluster support seems to be more aimed
at being a universal cluster service open for other usages too,
which might be a good thing. OCFS2s cluster seems to be more 
aimed at only serving the file system.

But which one works better in practice is really an open question.

The only thing that should be probably resolved is a common API
for at least the clustered lock manager. Having multiple
incompatible user space APIs for that would be sad.

-Andi



From hbryan at us.ibm.com  Fri Sep  2 23:03:33 2005
From: hbryan at us.ibm.com (Bryan Henderson)
Date: Fri, 2 Sep 2005 16:03:33 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <p73fysnqiej.fsf@verdi.suse.de>
Message-ID: <OFC7B937B5.B0B2BC62-ON88257070.007B3980-88257070.007EAB21@us.ibm.com>

I have to correct an error in perspective, or at least in the wording of 
it, in the following, because it affects how people see the big picture in 
trying to decide how the filesystem types in question fit into the world:

>Shared storage can be more efficient than network file
>systems like NFS because the storage access is often more efficient
>than network access

The shared storage access _is_ network access.  In most cases, it's a 
fibre channel/FCP network.  Nowadays, it's more and more common for it to 
be a TCP/IP network just like the one folks use for NFS (but carrying 
ISCSI instead of NFS).  It's also been done with a handful of other 
TCP/IP-based block storage protocols.

The reason the storage access is expected to be more efficient than the 
NFS access is because the block access network protocols are supposed to 
be more efficient than the file access network protocols.

In reality, I'm not sure there really is such a difference in efficiency 
between the protocols.  The demonstrated differences in efficiency, or at 
least in speed, are due to other things that are different between a given 
new shared block implementation and a given old shared file 
implementation.

But there's another advantage to shared block over shared file that hasn't 
been mentioned yet:  some people find it easier to manage a pool of blocks 
than a pool of filesystems.

>it is more reliable because it doesn't have a
>single point of failure in form of the NFS server.

This advantage isn't because it's shared (block) storage, but because it's 
a distributed filesystem.  There are shared storage filesystems (e.g. IBM 
SANFS, ADIC StorNext) that have a centralized metadata or locking server 
that makes them unreliable (or unscalable) in the same ways as an NFS 
server.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems



From michaelc at cs.wisc.edu  Sat Sep  3 05:18:56 2005
From: michaelc at cs.wisc.edu (Mike Christie)
Date: Sat, 03 Sep 2005 00:18:56 -0500
Subject: [Linux-cluster] [PATCH] rm PRIX64 and friends
Message-ID: <1125724736.26239.1.camel@max>

On EMT64 the macros spits out

warning: format ?%lu? expects type ?long unsigned int?, but argument 5 has type ?uint64_t?

Since most of places these macros are used are in places where we
use uint64_t or int64_t the patch just has gfs2 use llu instead
of trying to define some macros.

Index: gfs2-kernel/src/gfs2/gfs2.h
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/gfs2.h,v
retrieving revision 1.13
diff -a -u -p -r1.13 gfs2.h
--- gfs2-kernel/src/gfs2/gfs2.h	22 Aug 2005 07:25:29 -0000	1.13
+++ gfs2-kernel/src/gfs2/gfs2.h	3 Sep 2005 05:07:31 -0000
@@ -34,18 +34,6 @@
 #define NO_FORCE 0
 #define FORCE 1
 
-#if (BITS_PER_LONG == 64)
-#define PRIu64 "lu"
-#define PRId64 "ld"
-#define PRIx64 "lx"
-#define PRIX64 "lX"
-#else
-#define PRIu64 "Lu"
-#define PRId64 "Ld"
-#define PRIx64 "Lx"
-#define PRIX64 "LX"
-#endif
-
 /*  Divide num by den.  Round up if there is a remainder.  */
 #define DIV_RU(num, den) (((num) + (den) - 1) / (den))
 #define MAKE_MULT8(x) (((x) + 7) & ~7)
Index: gfs2-kernel/src/gfs2/gfs2_ondisk.h
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/gfs2_ondisk.h,v
retrieving revision 1.13
diff -a -u -p -r1.13 gfs2_ondisk.h
--- gfs2-kernel/src/gfs2/gfs2_ondisk.h	2 Sep 2005 09:06:54 -0000	1.13
+++ gfs2-kernel/src/gfs2/gfs2_ondisk.h	3 Sep 2005 05:07:31 -0000
@@ -509,8 +509,8 @@ void gfs2_inum_out(struct gfs2_inum *no,
 
 void gfs2_inum_print(struct gfs2_inum *no)
 {
-	pv(no, no_formal_ino, "%"PRIu64);
-	pv(no, no_addr, "%"PRIu64);
+	pv(no, no_formal_ino, "%llu");
+	pv(no, no_addr, "%llu");
 }
 
 void gfs2_meta_header_in(struct gfs2_meta_header *mh, char *buf)
@@ -538,7 +538,7 @@ void gfs2_meta_header_print(struct gfs2_
 	pv(mh, mh_magic, "0x%.8X");
 	pv(mh, mh_type, "%u");
 	pv(mh, mh_format, "%u");
-	pv(mh, mh_blkno, "%"PRIu64);
+	pv(mh, mh_blkno, "%llu");
 }
 
 void gfs2_sb_in(struct gfs2_sb *sb, char *buf)
@@ -627,11 +627,11 @@ void gfs2_rindex_out(struct gfs2_rindex 
 
 void gfs2_rindex_print(struct gfs2_rindex *ri)
 {
-	pv(ri, ri_addr, "%"PRIu64);
+	pv(ri, ri_addr, "%llu");
 	pv(ri, ri_length, "%u");
 	pv(ri, ri_pad, "%u");
 
-	pv(ri, ri_data0, "%"PRIu64);
+	pv(ri, ri_data0, "%llu");
 	pv(ri, ri_data, "%u");
 
 	pv(ri, ri_bitbytes, "%u");
@@ -693,9 +693,9 @@ void gfs2_quota_out(struct gfs2_quota *q
 
 void gfs2_quota_print(struct gfs2_quota *qu)
 {
-	pv(qu, qu_limit, "%"PRIu64);
-	pv(qu, qu_warn, "%"PRIu64);
-	pv(qu, qu_value, "%"PRId64);
+	pv(qu, qu_limit, "%llu");
+	pv(qu, qu_warn, "%llu");
+	pv(qu, qu_value, "%lld");
 }
 
 void gfs2_dinode_in(struct gfs2_dinode *di, char *buf)
@@ -775,16 +775,16 @@ void gfs2_dinode_print(struct gfs2_dinod
 	pv(di, di_uid, "%u");
 	pv(di, di_gid, "%u");
 	pv(di, di_nlink, "%u");
-	pv(di, di_size, "%"PRIu64);
-	pv(di, di_blocks, "%"PRIu64);
-	pv(di, di_atime, "%"PRId64);
-	pv(di, di_mtime, "%"PRId64);
-	pv(di, di_ctime, "%"PRId64);
+	pv(di, di_size, "%llu");
+	pv(di, di_blocks, "%llu");
+	pv(di, di_atime, "%lld");
+	pv(di, di_mtime, "%lld");
+	pv(di, di_ctime, "%lld");
 	pv(di, di_major, "%u");
 	pv(di, di_minor, "%u");
 
-	pv(di, di_goal_meta, "%"PRIu64);
-	pv(di, di_goal_data, "%"PRIu64);
+	pv(di, di_goal_meta, "%llu");
+	pv(di, di_goal_data, "%llu");
 
 	pv(di, di_flags, "0x%.8X");
 	pv(di, di_payload_format, "%u");
@@ -793,7 +793,7 @@ void gfs2_dinode_print(struct gfs2_dinod
 	pv(di, di_depth, "%u");
 	pv(di, di_entries, "%u");
 
-	pv(di, di_eattr, "%"PRIu64);
+	pv(di, di_eattr, "%llu");
 
 	pa(di, di_reserved, 32);
 }
@@ -873,7 +873,7 @@ void gfs2_leaf_print(struct gfs2_leaf *l
 	pv(lf, lf_depth, "%u");
 	pv(lf, lf_entries, "%u");
 	pv(lf, lf_dirent_format, "%u");
-	pv(lf, lf_next, "%"PRIu64);
+	pv(lf, lf_next, "%llu");
 
 	pa(lf, lf_reserved, 32);
 }
@@ -948,7 +948,7 @@ void gfs2_log_header_out(struct gfs2_log
 void gfs2_log_header_print(struct gfs2_log_header *lh)
 {
 	gfs2_meta_header_print(&lh->lh_header);
-	pv(lh, lh_sequence, "%"PRIu64);
+	pv(lh, lh_sequence, "%llu");
 	pv(lh, lh_flags, "0x%.8X");
 	pv(lh, lh_tail, "%u");
 	pv(lh, lh_blkno, "%u");
@@ -1010,8 +1010,8 @@ void gfs2_inum_range_out(struct gfs2_inu
 
 void gfs2_inum_range_print(struct gfs2_inum_range *ir)
 {
-	pv(ir, ir_start, "%"PRIu64);
-	pv(ir, ir_length, "%"PRIu64);
+	pv(ir, ir_start, "%llu");
+	pv(ir, ir_length, "%llu");
 }
 
 void gfs2_statfs_change_in(struct gfs2_statfs_change *sc, char *buf)
@@ -1034,9 +1034,9 @@ void gfs2_statfs_change_out(struct gfs2_
 
 void gfs2_statfs_change_print(struct gfs2_statfs_change *sc)
 {
-	pv(sc, sc_total, "%"PRId64);
-	pv(sc, sc_free, "%"PRId64);
-	pv(sc, sc_dinodes, "%"PRId64);
+	pv(sc, sc_total, "%lld");
+	pv(sc, sc_free, "%lld");
+	pv(sc, sc_dinodes, "%lld");
 }
 
 void gfs2_unlinked_tag_in(struct gfs2_unlinked_tag *ut, char *buf)
@@ -1084,7 +1084,7 @@ void gfs2_quota_change_out(struct gfs2_q
 
 void gfs2_quota_change_print(struct gfs2_quota_change *qc)
 {
-	pv(qc, qc_change, "%"PRId64);
+	pv(qc, qc_change, "%lld");
 	pv(qc, qc_flags, "0x%.8X");
 	pv(qc, qc_id, "%u");
 }
Index: gfs2-kernel/src/gfs2/glock.c
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/glock.c,v
retrieving revision 1.30
diff -a -u -p -r1.30 glock.c
--- gfs2-kernel/src/gfs2/glock.c	19 Aug 2005 07:52:14 -0000	1.30
+++ gfs2-kernel/src/gfs2/glock.c	3 Sep 2005 05:07:31 -0000
@@ -2370,7 +2370,7 @@ static int dump_inode(struct gfs2_inode 
 	int error = -ENOBUFS;
 
 	gfs2_printf("  Inode:\n");
-	gfs2_printf("    num = %"PRIu64"/%"PRIu64"\n",
+	gfs2_printf("    num = %llu %llu\n",
 		    ip->i_num.no_formal_ino, ip->i_num.no_addr);
 	gfs2_printf("    type = %u\n", IF2DT(ip->i_di.di_mode));
 	gfs2_printf("    i_count = %d\n", atomic_read(&ip->i_count));
@@ -2406,7 +2406,7 @@ static int dump_glock(struct gfs2_glock 
 
 	spin_lock(&gl->gl_spin);
 
-	gfs2_printf("Glock (%u, %"PRIu64")\n",
+	gfs2_printf("Glock (%u, %llu)\n",
 		    gl->gl_name.ln_type,
 		    gl->gl_name.ln_number);
 	gfs2_printf("  gl_flags =");
Index: gfs2-kernel/src/gfs2/ioctl.c
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/ioctl.c,v
retrieving revision 1.18
diff -a -u -p -r1.18 ioctl.c
--- gfs2-kernel/src/gfs2/ioctl.c	11 Aug 2005 07:23:43 -0000	1.18
+++ gfs2-kernel/src/gfs2/ioctl.c	3 Sep 2005 05:07:31 -0000
@@ -275,9 +275,9 @@ static int gi_get_statfs(struct gfs2_ino
 
 	gfs2_printf("version 0\n");
 	gfs2_printf("bsize %u\n", sdp->sd_sb.sb_bsize);
-	gfs2_printf("total %"PRIu64"\n", sc.sc_total);
-	gfs2_printf("free %"PRIu64"\n", sc.sc_free);
-	gfs2_printf("dinodes %"PRIu64"\n", sc.sc_dinodes);
+	gfs2_printf("total %lld\n", sc.sc_total);
+	gfs2_printf("free %lld\n", sc.sc_free);
+	gfs2_printf("dinodes %lld\n", sc.sc_dinodes);
 
 	error = 0;
 
@@ -353,7 +353,7 @@ static int gi_get_counters(struct gfs2_i
 		    sdp->sd_jdesc->jd_blocks);
 	gfs2_printf("sd_reclaim_count:glocks on reclaim list::%d\n",
 		    atomic_read(&sdp->sd_reclaim_count));
-	gfs2_printf("sd_log_wraps:log wraps::%"PRIu64"\n",
+	gfs2_printf("sd_log_wraps:log wraps::%llu\n",
 		    sdp->sd_log_wraps);
 	gfs2_printf("sd_bio_outstanding:outstanding BIO calls::%u\n",
 		    atomic_read(&sdp->sd_bio_outstanding));
Index: gfs2-kernel/src/gfs2/lvb.c
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/lvb.c,v
retrieving revision 1.9
diff -a -u -p -r1.9 lvb.c
--- gfs2-kernel/src/gfs2/lvb.c	2 Sep 2005 09:06:54 -0000	1.9
+++ gfs2-kernel/src/gfs2/lvb.c	3 Sep 2005 05:07:31 -0000
@@ -54,8 +54,8 @@ void gfs2_quota_lvb_print(struct gfs2_qu
 {
 	pv(qb, qb_magic, "%u");
 	pv(qb, qb_pad, "%u");
-	pv(qb, qb_limit, "%"PRIu64);
-	pv(qb, qb_warn, "%"PRIu64);
-	pv(qb, qb_value, "%"PRId64);
+	pv(qb, qb_limit, "%llu");
+	pv(qb, qb_warn, "%llu");
+	pv(qb, qb_value, "%lld");
 }
 
Index: gfs2-kernel/src/gfs2/meta_io.c
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/meta_io.c,v
retrieving revision 1.25
diff -a -u -p -r1.25 meta_io.c
--- gfs2-kernel/src/gfs2/meta_io.c	2 Sep 2005 09:06:54 -0000	1.25
+++ gfs2-kernel/src/gfs2/meta_io.c	3 Sep 2005 05:07:31 -0000
@@ -61,7 +61,7 @@ static void stuck_releasepage(struct buf
 	struct gfs2_glock *gl;
 
 	fs_warn(sdp, "stuck in gfs2_releasepage()\n");
-	fs_warn(sdp, "blkno = %"PRIu64", bh->b_count = %d\n",
+	fs_warn(sdp, "blkno = %llu, bh->b_count = %d\n",
 		(uint64_t)bh->b_blocknr, atomic_read(&bh->b_count));
 	fs_warn(sdp, "pinned = %u\n", buffer_pinned(bh));
 	fs_warn(sdp, "get_v2bd(bh) = %s\n", (bd) ? "!NULL" : "NULL");
@@ -71,7 +71,7 @@ static void stuck_releasepage(struct buf
 
 	gl = bd->bd_gl;
 
-	fs_warn(sdp, "gl = (%u, %"PRIu64")\n", 
+	fs_warn(sdp, "gl = (%u, %llu)\n", 
 		gl->gl_name.ln_type, gl->gl_name.ln_number);
 
 	fs_warn(sdp, "bd_list_tr = %s, bd_le.le_list = %s\n",
@@ -85,7 +85,7 @@ static void stuck_releasepage(struct buf
 		if (!ip)
 			return;
 
-		fs_warn(sdp, "ip = %"PRIu64"/%"PRIu64"\n",
+		fs_warn(sdp, "ip = %llu %llu\n",
 			ip->i_num.no_formal_ino, ip->i_num.no_addr);
 		fs_warn(sdp, "ip->i_count = %d, ip->i_vnode = %s\n",
 			atomic_read(&ip->i_count),
Index: gfs2-kernel/src/gfs2/rgrp.c
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/rgrp.c,v
retrieving revision 1.25
diff -a -u -p -r1.25 rgrp.c
--- gfs2-kernel/src/gfs2/rgrp.c	19 Aug 2005 07:52:15 -0000	1.25
+++ gfs2-kernel/src/gfs2/rgrp.c	3 Sep 2005 05:07:32 -0000
@@ -1013,7 +1013,7 @@ static struct gfs2_rgrpd *rgblk_free(str
 	rgd = gfs2_blk2rgrpd(sdp, bstart);
 	if (!rgd) {
 		if (gfs2_consist(sdp))
-			fs_err(sdp, "block = %"PRIu64"\n", bstart);
+			fs_err(sdp, "block = %llu\n", bstart);
 		return NULL;
 	}
 
@@ -1302,7 +1302,7 @@ void gfs2_rlist_add(struct gfs2_sbd *sdp
 	rgd = gfs2_blk2rgrpd(sdp, block);
 	if (!rgd) {
 		if (gfs2_consist(sdp))
-			fs_err(sdp, "block = %"PRIu64"\n", block);
+			fs_err(sdp, "block = %llu\n", block);
 		return;
 	}
 
Index: gfs2-kernel/src/gfs2/util.c
===================================================================
RCS file: /cvs/cluster/cluster/gfs2-kernel/src/gfs2/util.c,v
retrieving revision 1.17
diff -a -u -p -r1.17 util.c
--- gfs2-kernel/src/gfs2/util.c	19 Aug 2005 07:52:15 -0000	1.17
+++ gfs2-kernel/src/gfs2/util.c	3 Sep 2005 05:07:32 -0000
@@ -147,7 +147,7 @@ int gfs2_consist_inode_i(struct gfs2_ino
 	struct gfs2_sbd *sdp = ip->i_sbd;
 	return gfs2_lm_withdraw(sdp,
 			"GFS2: fsid=%s: fatal: filesystem consistency error\n"
-			"GFS2: fsid=%s:   inode = %"PRIu64"/%"PRIu64"\n"
+			"GFS2: fsid=%s:   inode = %llu %llu\n"
 			"GFS2: fsid=%s:   function = %s\n"
 			"GFS2: fsid=%s:   file = %s, line = %u\n"
 			"GFS2: fsid=%s:   time = %lu\n",
@@ -171,7 +171,7 @@ int gfs2_consist_rgrpd_i(struct gfs2_rgr
 	struct gfs2_sbd *sdp = rgd->rd_sbd;
 	return gfs2_lm_withdraw(sdp,
 			"GFS2: fsid=%s: fatal: filesystem consistency error\n"
-			"GFS2: fsid=%s:   RG = %"PRIu64"\n"
+			"GFS2: fsid=%s:   RG = %llu\n"
 			"GFS2: fsid=%s:   function = %s\n"
 			"GFS2: fsid=%s:   file = %s, line = %u\n"
 			"GFS2: fsid=%s:   time = %lu\n",
@@ -195,7 +195,7 @@ int gfs2_meta_check_ii(struct gfs2_sbd *
 	int me;
 	me = gfs2_lm_withdraw(sdp,
 			     "GFS2: fsid=%s: fatal: invalid metadata block\n"
-			     "GFS2: fsid=%s:   bh = %"PRIu64" (%s)\n"
+			     "GFS2: fsid=%s:   bh = %llu (%s)\n"
 			     "GFS2: fsid=%s:   function = %s\n"
 			     "GFS2: fsid=%s:   file = %s, line = %u\n"
 			     "GFS2: fsid=%s:   time = %lu\n",
@@ -220,7 +220,7 @@ int gfs2_metatype_check_ii(struct gfs2_s
 	int me;
 	me = gfs2_lm_withdraw(sdp,
 		"GFS2: fsid=%s: fatal: invalid metadata block\n"
-		"GFS2: fsid=%s:   bh = %"PRIu64" (type: exp=%u, found=%u)\n"
+		"GFS2: fsid=%s:   bh = %llu (type: exp=%u, found=%u)\n"
 		"GFS2: fsid=%s:   function = %s\n"
 		"GFS2: fsid=%s:   file = %s, line = %u\n"
 		"GFS2: fsid=%s:   time = %lu\n",
@@ -263,7 +263,7 @@ int gfs2_io_error_bh_i(struct gfs2_sbd *
 {
 	return gfs2_lm_withdraw(sdp,
 			       "GFS2: fsid=%s: fatal: I/O error\n"
-			       "GFS2: fsid=%s:   block = %"PRIu64"\n"
+			       "GFS2: fsid=%s:   block = %llu\n"
 			       "GFS2: fsid=%s:   function = %s\n"
 			       "GFS2: fsid=%s:   file = %s, line = %u\n"
 			       "GFS2: fsid=%s:   time = %lu\n",




From greg at kroah.com  Sat Sep  3 05:28:21 2005
From: greg at kroah.com (Greg KH)
Date: Fri, 2 Sep 2005 22:28:21 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050902094403.GD16595@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050902094403.GD16595@redhat.com>
Message-ID: <20050903052821.GA23711@kroah.com>

On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote:
> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> 
> > +	gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
> 
> > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > everywhere
> 
> When a machine has many gfs file systems mounted at once it can be useful
> to know which one failed.  Does the following look ok?
> 
> #define gfs2_assert(sdp, assertion)                                       \
> do {                                                                      \
>         if (unlikely(!(assertion))) {                                     \
>                 printk(KERN_ERR                                           \
>                         "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
>                         "GFS2: fsid=%s:   function = %s\n"                \
>                         "GFS2: fsid=%s:   file = %s, line = %u\n"         \
>                         "GFS2: fsid=%s:   time = %lu\n",                  \
>                         sdp->sd_fsname, # assertion,                      \
>                         sdp->sd_fsname,  __FUNCTION__,                    \
>                         sdp->sd_fsname, __FILE__, __LINE__,               \
>                         sdp->sd_fsname, get_seconds());                   \
>                 BUG();                                                    \

You will already get the __FUNCTION__ (and hence the __FILE__ info)
directly from the BUG() dump, as well as the time from the syslog
message (turn on the printk timestamps if you want a more fine grain
timestamp), so the majority of this macro is redundant with the BUG()
macro...

thanks,

greg k-h



From dhazelton at enter.net  Sat Sep  3 06:42:31 2005
From: dhazelton at enter.net (D. Hazelton)
Date: Sat, 3 Sep 2005 02:42:31 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125728040.3223.2.camel@laptopd505.fenrus.org>
References: <20050901104620.GA22482@redhat.com>
	<20050903051841.GA13211@redhat.com>
	<1125728040.3223.2.camel@laptopd505.fenrus.org>
Message-ID: <200509030242.37536.dhazelton@enter.net>

On Saturday 03 September 2005 02:14, Arjan van de Ven wrote:
> On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> > > Alan Cox <alan at lxorguk.ukuu.org.uk> wrote:
> > > > > - Why GFS is better than OCFS2, or has functionality which
> > > > > OCFS2 cannot possibly gain (or vice versa)
> > > > >
> > > > > - Relative merits of the two offerings
> > > >
> > > > You missed the important one - people actively use it and
> > > > have been for some years. Same reason with have NTFS, HPFS,
> > > > and all the others. On that alone it makes sense to include.
> > >
> > > Again, that's not a technical reason.  It's _a_ reason, sure. 
> > > But what are the technical reasons for merging gfs[2], ocfs2,
> > > both or neither?
> > >
> > > If one can be grown to encompass the capabilities of the other
> > > then we're left with a bunch of legacy code and wasted effort.
> >
> > GFS is an established fs, it's not going away, you'd be hard
> > pressed to find a more widely used cluster fs on Linux.  GFS is
> > about 10 years old and has been in use by customers in production
> > environments for about 5 years.
>
> but you submitted GFS2 not GFS.

I'd rather not step into the middle of this mess, but you clipped out 
a good portion that explains why he talks about GFS when he submitted 
GFS2.  Let me quote the post you've pulled that partial paragraph 
from: "The latest development cycle (GFS2) has focused on improving
performance, it's not a new file system -- the "2" indicates that it's 
not ondisk compatible with earlier versions."

In other words he didn't submit the original, but the new version of 
it that is not compatable with the original GFS on disk format.  
While it is clear that GFS2 cannot claim the large installed user 
base or the proven capacity of the original (it is, after all, a new 
version that has incompatabilities) it can claim that as it's 
heritage and what it's aiming towards, the same as ext3 can (and 
does) claim the power and reliability of ext2.

In this case I've been following this thread just for the hell of it 
and I've noticed that there are some people who seem to not want to 
even think of having GFS2 included in a mainline kernel for personal 
and not technical reasons. That does not describe most of the people 
on this list, many of whom have helped debug the code (among other 
things), but it does describe a few.

I'll go back to being quiet now... 

DRH
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0xA6992F96300F159086FF28208F8280BB8B00C32A.asc
Type: application/pgp-keys
Size: 1365 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050903/aacf69e6/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050903/aacf69e6/attachment.sig>

From tytso at mit.edu  Mon Sep  5 05:54:28 2005
From: tytso at mit.edu (Theodore Ts'o)
Date: Mon, 5 Sep 2005 01:54:28 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050904203344.GA1987@elf.ucw.cz>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<20050904203344.GA1987@elf.ucw.cz>
Message-ID: <20050905055428.GA29158@thunk.org>

On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> Hi!
> 
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

This is a bit of a degression, but it's quite a bit different from
what ocfs2 is doing, where it is not necessary to replay the journal
in order to assure filesystem consistency.  

In the ext3 case, the only time when read-only isn't quite read-only
is when the filesystem was unmounted uncleanly and the journal needs
to be replayed in order for the filesystem to be consistent.  Mounting
the filesystem read-only without replaying the journal could and very
likely would result in the filesystem reporting filesystem consistency
problems, and if the filesystem is mounted with the reboot-on-errors
option, well....

							- Ted



From tytso at mit.edu  Mon Sep  5 14:03:19 2005
From: tytso at mit.edu (Theodore Ts'o)
Date: Mon, 5 Sep 2005 10:03:19 -0400
Subject: [Linux-cluster] Re: real read-only [was Re: GFS, what's remaining]
In-Reply-To: <20050905082735.GA2662@elf.ucw.cz>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<20050904203344.GA1987@elf.ucw.cz>
	<20050905055428.GA29158@thunk.org>
	<20050905082735.GA2662@elf.ucw.cz>
Message-ID: <20050905140318.GA10751@thunk.org>

On Mon, Sep 05, 2005 at 10:27:35AM +0200, Pavel Machek wrote:
> 
> There's a better reason, too. I do swsusp. Then I'd like to boot with
> / mounted read-only (so that I can read my config files, some
> binaries, and maybe suspended image), but I absolutely may not write
> to disk at this point, because I still want to resume.
> 

You could _hope_ that the filesystem is consistent enough that it is
safe to try to read config files, binaries, etc. without running the
journal, but there is absolutely no guarantee that this is the case.
I'm not sure you want to depend on that for swsusp.

One potential solution that would probably meet your needs is a dm
hack which reads in the blocks in the journal, and then uses the most
recent block in the journal in preference to the version on disk.

						- Ted



From tytso at mit.edu  Mon Sep  5 14:07:47 2005
From: tytso at mit.edu (Theodore Ts'o)
Date: Mon, 5 Sep 2005 10:07:47 -0400
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905070922.GK21228@ca-server1.us.oracle.com>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<20050903051841.GA13211@redhat.com>
	<20050904203344.GA1987@elf.ucw.cz>
	<20050905055428.GA29158@thunk.org>
	<20050905070922.GK21228@ca-server1.us.oracle.com>
Message-ID: <20050905140747.GB10751@thunk.org>

On Mon, Sep 05, 2005 at 12:09:23AM -0700, Mark Fasheh wrote:
> Btw, I'm curious to know how useful folks find the ext3 mount options
> errors=continue and errors=panic. I'm extremely likely to implement the
> errors=read-only behavior as default in OCFS2 and I'm wondering whether the
> other two are worth looking into.

For a single-user system errors=panic is definitely very useful on the
system disk, since that's the only way that we can force an fsck, and
also abort a server that might be failing and returning erroneous
information to its clients.  Think of it is as i/o fencing when you're
not sure that the system is going to be performing correctly.

Whether or not this is useful for ocfs2 is a different matter.  If
it's only for data volumes, and if the only way to fix filesystem
inconsistencies on a cluster filesystem is to request all nodes in the
cluster to unmount the filesystem and then arrange to run ocfs2's fsck
on the filesystem, then forcing every single cluster in the node to
panic is probably counterproductive.  :-)

						- Ted



From dtor_core at ameritech.net  Mon Sep  5 16:18:45 2005
From: dtor_core at ameritech.net (Dmitry Torokhov)
Date: Mon, 5 Sep 2005 11:18:45 -0500
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509051149.49929.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<20050905141432.GF5498@marowsky-bree.de>
	<200509051149.49929.phillips@istop.com>
Message-ID: <200509051118.45792.dtor_core@ameritech.net>

On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > On 2005-09-03T01:57:31, Daniel Phillips <phillips at istop.com> wrote:
> > > The only current users of dlms are cluster filesystems.  There are zero
> > > users of the userspace dlm api.
> >
> > That is incorrect...
> 
> Application users Lars, sorry if I did not make that clear.  The issue is 
> whether we need to export an all-singing-all-dancing dlm api from kernel to 
> userspace today, or whether we can afford to take the necessary time to get 
> it right while application writers take their time to have a good think about 
> whether they even need it.
>

If Linux fully supported OpenVMS DLM semantics we could start thinking asbout
moving our application onto a Linux box because our alpha server is aging.

That's just my user application writer $0.02.

-- 
Dmitry



From kurt.hackel at oracle.com  Mon Sep  5 19:11:59 2005
From: kurt.hackel at oracle.com (kurt.hackel at oracle.com)
Date: Mon, 5 Sep 2005 12:11:59 -0700
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905092433.GE17607@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<20050903183241.1acca6c9.akpm@osdl.org>
	<20050904030640.GL8684@ca-server1.us.oracle.com>
	<200509040022.37102.phillips@istop.com>
	<20050903214653.1b8a8cb7.akpm@osdl.org>
	<20050904045821.GT8684@ca-server1.us.oracle.com>
	<20050903224140.0442fac4.akpm@osdl.org>
	<20050905043033.GB11337@redhat.com>
	<20050905015408.21455e56.akpm@osdl.org>
	<20050905092433.GE17607@redhat.com>
Message-ID: <20050905191159.GA21169@gimp.org>

On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote:
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <teigland at redhat.com> wrote:
> > >
> > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > 
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> > 
> > How fat is the dlm interface?   ie: how many syscalls would it take?
> 
> Four functions:
>   create_lockspace()
>   release_lockspace()
>   lock()
>   unlock()

FWIW, it looks like we can agree on the core interface.  ocfs2_dlm
exports essentially the same functions:
    dlm_register_domain()
    dlm_unregister_domain()
    dlmlock()
    dlmunlock()

I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes).  There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.

There are quite a few other functions in the "full" spec(1) that we
didn't even attempt, either because we didn't require direct 
user<->kernel access or we just didn't need the function.  As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.

Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device.  With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the 
VMS/opendlm model).

-kurt

1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf


Kurt C. Hackel
Oracle
kurt.hackel at oracle.com



From dtor_core at ameritech.net  Tue Sep  6 02:03:19 2005
From: dtor_core at ameritech.net (Dmitry Torokhov)
Date: Mon, 5 Sep 2005 21:03:19 -0500
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <200509052057.23807.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509051118.45792.dtor_core@ameritech.net>
	<200509052057.23807.phillips@istop.com>
Message-ID: <200509052103.20519.dtor_core@ameritech.net>

On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > On 2005-09-03T01:57:31, Daniel Phillips <phillips at istop.com> wrote:
> > > > > The only current users of dlms are cluster filesystems.  There are
> > > > > zero users of the userspace dlm api.
> > > >
> > > > That is incorrect...
> > >
> > > Application users Lars, sorry if I did not make that clear.  The issue is
> > > whether we need to export an all-singing-all-dancing dlm api from kernel
> > > to userspace today, or whether we can afford to take the necessary time
> > > to get it right while application writers take their time to have a good
> > > think about whether they even need it.
> >
> > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > asbout moving our application onto a Linux box because our alpha server is
> > aging.
> >
> > That's just my user application writer $0.02.
> 
> What stops you from trying it with the patch?  That kind of feedback would be 
> worth way more than $0.02.
>

We do not have such plans at the moment and I prefer spending my free
time on tinkering with kernel, not rewriting some in-house application.
Besides, DLM is not the only thing that does not have a drop-in
replacement in Linux.

You just said you did not know if there are any potential users for the
full DLM and I said there are some.

-- 
Dmitry



From dtor_core at ameritech.net  Tue Sep  6 04:07:26 2005
From: dtor_core at ameritech.net (Dmitry Torokhov)
Date: Mon, 5 Sep 2005 23:07:26 -0500
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509060002.40823.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509052103.20519.dtor_core@ameritech.net>
	<200509060002.40823.phillips@istop.com>
Message-ID: <200509052307.27417.dtor_core@ameritech.net>

On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> 
> By the way, you said "alpha server" not "alpha servers", was that just a slip?  
> Because if you don't have a cluster then why are you using a dlm?
>

No, it is not a slip. The application is running on just one node, so we
do not really use "distributed" part. However we make heavy use of the
rest of lock manager features, especially lock value blocks.

-- 
Dmitry



From dtor_core at ameritech.net  Tue Sep  6 05:05:58 2005
From: dtor_core at ameritech.net (Dmitry Torokhov)
Date: Tue, 6 Sep 2005 00:05:58 -0500
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509060058.44934.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509052307.27417.dtor_core@ameritech.net>
	<200509060058.44934.phillips@istop.com>
Message-ID: <200509060005.59578.dtor_core@ameritech.net>

On Monday 05 September 2005 23:58, Daniel Phillips wrote:
> On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > > By the way, you said "alpha server" not "alpha servers", was that just a
> > > slip? Because if you don't have a cluster then why are you using a dlm?
> >
> > No, it is not a slip. The application is running on just one node, so we
> > do not really use "distributed" part. However we make heavy use of the
> > rest of lock manager features, especially lock value blocks.
> 
> Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
> without even having the excuse you were forced to use it.  Why don't you just 
> have a daemon that sends your values over a socket?  That should be all of a 
> day's coding.
>

Umm, because when most of the code was written TCP and the rest was the
clunkiest code out there? Plus, having a daemon introduces problems with
cleanup (say process dies for one reason or another) whereas having it in
OS takes care of that.
 
> Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
> But you nicely supported my claim that most who think they should be using a 
> dlm, really shouldn't.

Heh, do you think it is a bit premature to dismiss something even without
ever seeing the code?

-- 
Dmitry



From dtor_core at ameritech.net  Tue Sep  6 06:55:03 2005
From: dtor_core at ameritech.net (Dmitry Torokhov)
Date: Tue, 6 Sep 2005 01:55:03 -0500
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509060248.47433.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509060005.59578.dtor_core@ameritech.net>
	<200509060248.47433.phillips@istop.com>
Message-ID: <200509060155.04685.dtor_core@ameritech.net>

On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
> 
> You told me you are using a dlm for a single-node application, is there 
> anything more I need to know?
>

I would still like to know why you consider it a "sin". On OpenVMS it is
fast, provides a way of cleaning up and does not introduce single point
of failure as it is the case with a daemon. And if we ever want to spread
the load between 2 boxes we easily can do it. Why would I not want to use
it?

-- 
Dmitry



From suparna at in.ibm.com  Tue Sep  6 12:55:18 2005
From: suparna at in.ibm.com (Suparna Bhattacharya)
Date: Tue, 6 Sep 2005 18:25:18 +0530
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <p73fysnqiej.fsf@verdi.suse.de>
References: <20050901104620.GA22482@redhat.com>
	<20050901035939.435768f3.akpm@osdl.org>
	<1125586158.15768.42.camel@localhost.localdomain>
	<20050901132104.2d643ccd.akpm@osdl.org>
	<p73fysnqiej.fsf@verdi.suse.de>
Message-ID: <20050906125517.GA7531@in.ibm.com>

On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <akpm at osdl.org> writes:
> 
> > 
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > >   possibly gain (or vice versa)
> > > > 
> > > > - Relative merits of the two offerings
> > > 
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> >  
> > Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
> 
> There seems to be clearly a need for a shared-storage fs of some sort
> for HA clusters and virtualized usage (multiple guests sharing a
> partition).  Shared storage can be more efficient than network file
> systems like NFS because the storage access is often more efficient
> than network access  and it is more reliable because it doesn't have a
> single point of failure in form of the NFS server.
> 
> It's also a logical extension of the "failover on failure" clusters
> many people run now - instead of only failing over the shared fs at
> failure and keeping one machine idle the load can be balanced between
> multiple machines at any time.
> 
> One argument to merge both might be that nobody really knows yet which
> shared-storage file system (GFS or OCFS2) is better. The only way to
> find out would be to let the user base try out both, and that's most
> practical when they're merged.
> 
> Personally I think ocfs2 has nicer&cleaner code than GFS.
> It seems to be more or less a 64bit ext3 with cluster support, while

The "more or less" is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)

And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?

The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.

Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ? 

Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.

BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.

> GFS seems to reinvent a lot more things and has somewhat uglier code.
> On the other hand GFS' cluster support seems to be more aimed
> at being a universal cluster service open for other usages too,
> which might be a good thing. OCFS2s cluster seems to be more 
> aimed at only serving the file system.
> 
> But which one works better in practice is really an open question.

True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.

So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)

Regards
Suparna

-- 
Suparna Bhattacharya (suparna at in.ibm.com)
Linux Technology Center
IBM Software Lab, India



From dmitry.torokhov at gmail.com  Tue Sep  6 14:31:34 2005
From: dmitry.torokhov at gmail.com (Dmitry Torokhov)
Date: Tue, 6 Sep 2005 09:31:34 -0500
Subject: [Linux-cluster] Re: GFS, what's remainingh
In-Reply-To: <200509060318.25260.phillips@istop.com>
References: <20050901104620.GA22482@redhat.com>
	<200509060248.47433.phillips@istop.com>
	<200509060155.04685.dtor_core@ameritech.net>
	<200509060318.25260.phillips@istop.com>
Message-ID: <d120d50005090607315168e479@mail.gmail.com>

On 9/6/05, Daniel Phillips <phillips at istop.com> wrote:
> On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > > do you think it is a bit premature to dismiss something even without
> > > > ever seeing the code?
> > >
> > > You told me you are using a dlm for a single-node application, is there
> > > anything more I need to know?
> >
> > I would still like to know why you consider it a "sin". On OpenVMS it is
> > fast, provides a way of cleaning up...
> 
> There is something hard about handling EPIPE?
> 

Just the fact that you want me to handle it ;)

> > and does not introduce single point
> > of failure as it is the case with a daemon. And if we ever want to spread
> > the load between 2 boxes we easily can do it.
> 
> But you said it runs on an aging Alpha, surely you do not intend to expand it
> to two aging Alphas?

You would be right if I was designing this right now. Now roll 10 - 12
years back and now I have a shiny new alpha. Would you criticize me
then for using a mechanism that allowed easily spread application
across several nodes with minimal changes if needed?

What you fail to realize that there applications that run and will
continue to run for a long time.

>  And what makes you think that socket-based
> synchronization keeps you from spreading out the load over multiple boxes?
> 
> > Why would I not want to use it?
> 
> It is not the right tool for the job from what you have told me.  You want to
> get a few bytes of information from one task to another?  Use a socket, as
> God intended.
>

Again, when TCPIP is not a native network stack, when libc socket
routines are not readily available - DLM starts looking much more
viable.

-- 
Dmitry



From lhh at redhat.com  Tue Sep  6 15:25:11 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Sep 2005 11:25:11 -0400
Subject: [Linux-cluster] NFS relocate: old TCP/IP connection resulting
	in DUP/ACK storms and largish timeouts (was: iptables protection
	wrapper; nfsexport.sh vs ip.sh racing)
In-Reply-To: <20050905153649.GE17096@neu.nirvana>
References: <20050822225227.GJ24127@neu.nirvana>
	<1125340879.24205.30.camel@ayanami.boston.redhat.com>
	<20050829233523.GD5908@neu.nirvana>
	<1125425012.21943.1.camel@ayanami.boston.redhat.com>
	<20050905153649.GE17096@neu.nirvana>
Message-ID: <1126020311.3344.15.camel@ayanami.boston.redhat.com>

On Mon, 2005-09-05 at 17:36 +0200, Axel Thimm wrote:

> Is there a way to have ip.sh fry all open TCP/IP connections to a
> service IP that is to be abandoned? I guess that would be the better
> solution (that would also apply to non-NFS services).

Not aware of one -- I will look into it, though!

-- Lon



From lhh at redhat.com  Tue Sep  6 15:26:48 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Sep 2005 11:26:48 -0400
Subject: [Linux-cluster] Re: NFS relocate: old TCP/IP connection
	resulting in DUP/ACK storms and largish timeouts
In-Reply-To: <20050905182143.GA2099@neu.nirvana>
References: <20050822225227.GJ24127@neu.nirvana>
	<1125340879.24205.30.camel@ayanami.boston.redhat.com>
	<20050829233523.GD5908@neu.nirvana>
	<1125425012.21943.1.camel@ayanami.boston.redhat.com>
	<20050905153649.GE17096@neu.nirvana>
	<20050905182143.GA2099@neu.nirvana>
Message-ID: <1126020408.3344.18.camel@ayanami.boston.redhat.com>

On Mon, 2005-09-05 at 20:21 +0200, Axel Thimm wrote:

> Both bugs have been filed in bugzilla:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167571
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167572
> 
> I guess the latter will move to another component like "kernel", if it
> really turns out to be neither cluster nor even nfs specific.

It should be pretty easy to make that determination, thanks for the
report!

-- Lon



From lhh at redhat.com  Tue Sep  6 16:02:47 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Sep 2005 12:02:47 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <16102.1125953577@www46.gmx.net>
References: <16102.1125953577@www46.gmx.net>
Message-ID: <1126022567.3344.53.camel@ayanami.boston.redhat.com>

On Mon, 2005-09-05 at 22:52 +0200, Andreas Brosche wrote:

> Long story cut short, we want
> - GFS on a shared SCSI disk (Performance is not important)

Using GFS on shared SCSI will work in *some* cases:

+ Shared SCSI RAID arrays with multiple buses work well.  Mid to high
end here, not JBODs with host-RAID controllers.  The biggest discernable
difference between one of these and a FC SAN array is the fact that it
has SCSI ports instead of fiber-channel ports.


? Host-RAID *might* work, but only if the JBODs behind it has multiple
buses, and the host controllers are all in "clustered" and/or "cache
disabled" mode.


- Multi-initator SCSI buses do not work with GFS in any meaningful way,
regardless of what the host controller is.

Ex: Two machines with different SCSI IDs on their initiator connected to
the same physical SCSI bus.


> - dlm without network access (theoretically possible...
>   but how dependant is GFS on the cluster services?)

The DLM runs over IP, as does the cluster manager.  Additionally, please
remember that GFS requires fencing, and that most fence-devices are
IP-enabled.

It may be possible to work around the need for actual ethernet by using
something like PPP over high speed serial, but I don't see how that's
better than a crossover ethernet cable.  Also, I don't know if it will
work ;)

Many users choose to separate cluster communication from other forms by
using a fully self-contained private network.

There is currently no way for GFS to use only a quorum disk for all the
lock information, and even if it could, performance would be abysmal.

-- Lon



From moya at infomed.sld.cu  Tue Sep  6 17:15:26 2005
From: moya at infomed.sld.cu (Maykel Moya)
Date: Tue, 06 Sep 2005 13:15:26 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <16102.1125953577@www46.gmx.net>
References: <16102.1125953577@www46.gmx.net>
Message-ID: <1126026926.11885.7.camel@julia.sld.cu>

El lun, 05-09-2005 a las 22:52 +0200, Andreas Brosche escribi?:
> - GFS on a shared SCSI disk (Performance is not important)

I recently set up something like that. We use a external HP Smart Array
Cluster Storage. It has a separate connection (SCSI cable) to both
hosts.

The servers communicates over a crossover cable. I'm using DLM.

Regards,
maykel





From tabmowzo at us.ibm.com  Tue Sep  6 17:35:42 2005
From: tabmowzo at us.ibm.com (Peter R. Badovinatz)
Date: Tue, 06 Sep 2005 10:35:42 -0700
Subject: [Linux-cluster] partly OT: failover <500ms
In-Reply-To: <20050901215836.634334a1.pegasus@nerv.eu.org>
References: <20050901215836.634334a1.pegasus@nerv.eu.org>
Message-ID: <431DD36E.9080607@us.ibm.com>

Jure Pe?ar wrote:
> Hi all,
> 
> Sorry if this is somewhat offtopic here ...
> 
> Our telco is looking into linux HA solutions for their VoIP needs. Their
> main requirement is that the failover happens in the order of a few 100ms. 
> 
> Can redhat cluster be tweaked to work reliably with such short time
> periods? This would mean heartbeat on the level of few ms and status probes
> on the level of 10ms. Is this even feasible?
> 
> Since VoIP is IP anyway, I'm looking into UCARP and stuff like that.
> Anything else I should check?
> 

Check out Linux-HA (http://linux-ha.org/). It's been used in sub-second 
environments and has some documentation on the fact sheets and other 
pages of the web site.  If you don't find enough to help you on the web 
site the mailing lists are quite active.

> 
> Thanks for answers,
> 

Peter
-- 
Peter R. Badovinatz aka 'Wombat' -- IBM Linux Technology Center
preferred: tabmowzo at us.ibm.com / alternate: wombat at us.ibm.com
These are my opinions and absolutely not official opinions of IBM, Corp.



From karon at gmx.net  Tue Sep  6 22:57:27 2005
From: karon at gmx.net (Andreas Brosche)
Date: Wed, 07 Sep 2005 00:57:27 +0200
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <1126022567.3344.53.camel@ayanami.boston.redhat.com>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
Message-ID: <431E1ED7.7010909@gmx.net>

Hi again,

thank you for your replies.

Lon Hohberger wrote:
> On Mon, 2005-09-05 at 22:52 +0200, Andreas Brosche wrote:
> 
>>Long story cut short, we want
>>- GFS on a shared SCSI disk (Performance is not important)
> 
> Using GFS on shared SCSI will work in *some* cases:
> 
> + Shared SCSI RAID arrays with multiple buses work well.  Mid to high
> end here, 

Mid to high end indeed, what we found was in the range of about $5000.

> - Multi-initator SCSI buses do not work with GFS in any meaningful way,
> regardless of what the host controller is.
> Ex: Two machines with different SCSI IDs on their initiator connected to
> the same physical SCSI bus.

Hmm... don't laugh at me, but in fact that's what we're about to set up.

I've read in Red Hat's docs that it is "not supported" because of 
performance issues. Multi-initiator buses should comply to SCSI 
standards, and any SCSI-compliant disk should be able to communicate 
with the correct controller, if I've interpreted the specs correctly. Of 
course, you get arbitrary results when using non-compliant hardware... 
What are other issues with multi-initiator buses, other than performance 
loss?

> The DLM runs over IP, as does the cluster manager.  Additionally, please
> remember that GFS requires fencing, and that most fence-devices are
> IP-enabled.

Hmm. The whole setup is supposed to physically divide two networks, and 
nevertheless provide some kind of shared storage for moving data from 
one network to another. Establishing an ethernet link between the two 
servers would sort of disrupt the whole concept, which is to prevent 
*any* network access from outside into the secure part of the network. 
This is the (strongly simplified) topology:

mid-secure network -- Server1 -- Storage -- Server2 -- secure Network

A potential attacker could use a possible security flaw in the dlm 
service (which is bound to the network interface) to gain access to the 
server on the "secure" side *instantly* when he was able to compromise 
the server on the mid-secure side (hey, it CAN happen). If any sort of 
shared storage can be installed *without* any ethernet link or - ideally 
- any sort of inter-server communication, there is a way to *prove* that 
an attacker cannot establish any kind of connection into the secure net 
(some risks remain, but they have nothing to do with the physical 
connection).

So far, I only see two ways: either sync the filesystems via ethernet 
(maybe via a firewall, which is pointless when the service has a 
security leak; it *is* technically possible to set up a tunnel that way) 
or some solution with administrator interaction (the administrator would 
have to manually "flip a switch" to remount a *local* flie system rw on 
one side, and rw on the other), which is impractical (manpower, 
availability...), but would do the job.

> There is currently no way for GFS to use only a quorum disk for all the
> lock information, and even if it could, performance would be abysmal.

Like I said... performance is not an issue. As an invariant, the 
filesystems could be mounted "cross over", ie. each server has a 
partition only it writes to, and the other only reads from that disk. 
This *can* be done with local filesystems; you *can* disable write 
caching. You cannot, however, disable *read* caching (which seems to be 
buried quite deeply into the kernel), which means you actually have to 
umount and then re-mount (ie, not "mount -o remount") the fs. This means 
that long transfers could block other users for a long time. And 
mounting and umounting the same fs over and over again doesn't exactly 
sound like a good idea... even if it's only mounted ro.

Maykel Moya wrote:
 > El lun, 05-09-2005 a las 22:52 +0200, Andreas Brosche escribi?:
 > I recently set up something like that. We use a external HP Smart
 > Array Cluster Storage. It has a separate connection (SCSI cable) to
 > both hosts.

So it is not really a shared bus, but a dual bus configuration.

 > Regards,
 > maykel

Regards, and thanks for the replies,

Andreas



From spwilcox at att.com  Wed Sep  7 00:06:58 2005
From: spwilcox at att.com (Steve Wilcox)
Date: Tue, 06 Sep 2005 20:06:58 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <431E1ED7.7010909@gmx.net>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>
Message-ID: <1126051624.3694.26.camel@aptis101.cqtel.com>

On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:

> > - Multi-initator SCSI buses do not work with GFS in any meaningful way,
> > regardless of what the host controller is.
> > Ex: Two machines with different SCSI IDs on their initiator connected to
> > the same physical SCSI bus.
> 
> Hmm... don't laugh at me, but in fact that's what we're about to set up.
> 
> I've read in Red Hat's docs that it is "not supported" because of 
> performance issues. Multi-initiator buses should comply to SCSI 
> standards, and any SCSI-compliant disk should be able to communicate 
> with the correct controller, if I've interpreted the specs correctly. Of 
> course, you get arbitrary results when using non-compliant hardware... 
> What are other issues with multi-initiator buses, other than performance 
> loss?

I set up a small 2 node cluster this way a while back, just as a testbed
for myself.  Much as I suspected, it was severely unstable because of
the storage configuration, even occasionally causing both nodes to crash
when one was rebooted due to SCSI bus resets.  I tore it down and
rebuilt it several times, configuring it as a simple failover cluster
with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
openSSI cluster using Fedora3.  All tested configurations were equally
crash-happy due to the bus resets.  

My configuration consisted of a couple of old Compaq deskpro PC's, each
with a single ended Symbiosis card (set to different SCSI ID's
obviously) and an external DEC BA360 jbod shelf with 6 drives.  The bus
resets might be mitigated somewhat by using HVD SCSI and Y-cables with
external terminators, but from my previous experience with other
clusters that used this technique (DEC ASE and HP-ux service guard), bus
resets will always be a thorn in your side without a separate,
independent raid controller to act as a go-between.  Calling these
configurations simply "not supported" is an understatement - this type
of config is guaranteed trouble.  I'd never set up a cluster this way
unless I'm the only one using it, and only then if I don't care one
little bit about crashes and data corruption.  My two cents.

-steve



From spwilcox at att.com  Wed Sep  7 03:03:57 2005
From: spwilcox at att.com (Steve Wilcox)
Date: Tue, 06 Sep 2005 23:03:57 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <1126051624.3694.26.camel@aptis101.cqtel.com>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>
	<1126051624.3694.26.camel@aptis101.cqtel.com>
Message-ID: <1126062237.12381.4.camel@aptis101.cqtel.com>

On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote:
> On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
> 
> > > - Multi-initator SCSI buses do not work with GFS in any meaningful way,
> > > regardless of what the host controller is.
> > > Ex: Two machines with different SCSI IDs on their initiator connected to
> > > the same physical SCSI bus.
> > 
> > Hmm... don't laugh at me, but in fact that's what we're about to set up.
> > 
> > I've read in Red Hat's docs that it is "not supported" because of 
> > performance issues. Multi-initiator buses should comply to SCSI 
> > standards, and any SCSI-compliant disk should be able to communicate 
> > with the correct controller, if I've interpreted the specs correctly. Of 
> > course, you get arbitrary results when using non-compliant hardware... 
> > What are other issues with multi-initiator buses, other than performance 
> > loss?
> 
> I set up a small 2 node cluster this way a while back, just as a testbed
> for myself.  Much as I suspected, it was severely unstable because of
> the storage configuration, even occasionally causing both nodes to crash
> when one was rebooted due to SCSI bus resets.  I tore it down and
> rebuilt it several times, configuring it as a simple failover cluster
> with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
> openSSI cluster using Fedora3.  All tested configurations were equally
> crash-happy due to the bus resets.  
> 
> My configuration consisted of a couple of old Compaq deskpro PC's, each
> with a single ended Symbiosis card (set to different SCSI ID's
> obviously) and an external DEC BA360 jbod shelf with 6 drives.  The bus
> resets might be mitigated somewhat by using HVD SCSI and Y-cables with
> external terminators, but from my previous experience with other
> clusters that used this technique (DEC ASE and HP-ux service guard), bus
> resets will always be a thorn in your side without a separate,
> independent raid controller to act as a go-between.  Calling these
> configurations simply "not supported" is an understatement - this type
> of config is guaranteed trouble.  I'd never set up a cluster this way
> unless I'm the only one using it, and only then if I don't care one
> little bit about crashes and data corruption.  My two cents.
> 
> -steve


Small clarification - Although clusters from DEC, HP, and even
DigiComWho?Paq's TruCluster can be made to work (sort of) on multi-
initiator SCSI busses, IIRC it was never a supported option for any of
them (much like RedHat's offering).  I doubt any sane company would ever
support that type of config.

-steve   



From vlad at nkmz.donetsk.ua  Wed Sep  7 08:51:36 2005
From: vlad at nkmz.donetsk.ua (Vlad)
Date: Wed, 7 Sep 2005 11:51:36 +0300
Subject: [Linux-cluster] connect server to fiber channel storage (HP MSA1500)
Message-ID: <968123011.20050907115136@nkmz.donetsk.ua>

Hello linux-cluster,

  How can I connect server to fiber channel storage (HP MSA1500) ???
  Which is device name for disk drives via FC connections ???

  On server I have installed RHEL 2.1 U6 with qla2300.o driver:

-----------------------------------------------------------------
[root at dl585cl1 proc]# cat /proc/modules
e1000                  78204   0 (unused)
bcm5700               109996   2
mptscsih               41072   0
mptbase                43200   3 [mptscsih]
qla2300               705888   0
-----------------------------------------------------------------

-----------------------------------------------------------------
[root at dl585cl1 qla2300]# cat /proc/scsi/qla2300/0
QLogic PCI to Fibre Channel Host Adapter for QLA2340 :
        Firmware version:  3.03.01, Driver version 7.01.01-RH1
Entry address = f880a060
HBA: QLA2312 , Serial# T79863
Request Queue = 0x362d0000, Response Queue = 0x362c0000
Request Queue count= 512, Response Queue count= 512
Total number of active commands = 0
Total number of interrupts = 6028
Total number of IOCBs (used/max) = (0/600)
Total number of queued commands = 0
    Device queue depth = 0x20
Number of free request entries = 63
Number of mailbox timeouts = 0
Number of ISP aborts = 0
Number of loop resyncs = 8
Number of retries for empty slots = 0
Number of reqs in pending_q= 0, retry_q= 0, done_q= 0, scsi_retry_q= 0
Host adapter:loop state= <READY>, flags= 0x860813
Dpc flags = 0x0
MBX flags = 0x0
SRB Free Count = 4096
Link down Timeout = 008
Port down retry = 030
Login retry count = 030
Commands retried with dropped frame(s) = 0
Configured characteristic impedence: 50 ohms
Configured data rate: 1-2 Gb/sec auto-negotiate

SCSI Device Information:
scsi-qla0-adapter-node=200000e08b1ed735;
scsi-qla0-adapter-port=210000e08b1ed735;
scsi-qla0-target-0=500508b30090e751;

SCSI LUN Information:
(Id:Lun)  * - indicates lun is not registered with the OS.
( 0: 0): Total reqs 9, Pending reqs 0, flags 0x0, 0:0:81,
( 0: 1): Total reqs 7784, Pending reqs 0, flags 0x0, 0:0:81,
-----------------------------------------------------------------


-- 
Best regards,
 Vlad                          mailto:vlad at nkmz.donetsk.ua




From Axel.Thimm at ATrpms.net  Wed Sep  7 09:24:18 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Wed, 7 Sep 2005 11:24:18 +0200
Subject: [Linux-cluster] Re: Using GFS without a network?
In-Reply-To: <431E1ED7.7010909@gmx.net>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>
Message-ID: <20050907092418.GA4014@neu.nirvana>

On Wed, Sep 07, 2005 at 12:57:27AM +0200, Andreas Brosche wrote:
> >The DLM runs over IP, as does the cluster manager.  Additionally, please
> >remember that GFS requires fencing, and that most fence-devices are
> >IP-enabled.
> 
> Hmm. The whole setup is supposed to physically divide two networks, and 
> nevertheless provide some kind of shared storage for moving data from 
> one network to another. Establishing an ethernet link between the two 
> servers would sort of disrupt the whole concept, which is to prevent 
> *any* network access from outside into the secure part of the network. 
> This is the (strongly simplified) topology:
> 
> mid-secure network -- Server1 -- Storage -- Server2 -- secure Network
> 
> A potential attacker could use a possible security flaw in the dlm 
> service (which is bound to the network interface) to gain access to the 
> server on the "secure" side *instantly* when he was able to compromise 
> the server on the mid-secure side (hey, it CAN happen). If any sort of 
> shared storage can be installed *without* any ethernet link or - ideally 
> - any sort of inter-server communication, there is a way to *prove* that 
> an attacker cannot establish any kind of connection into the secure net 
> (some risks remain, but they have nothing to do with the physical 
> connection).

If you are paranoid like that and consider that even if you could do
away with dlm and IP connectivity, then

o an attacker on the mid-secure network could alter files that the
  secure network accesses and gain privileges that way.
o an attacker can exploit potential bugs in GFS's code, just as well
  as in dlm's, and having physical access to the Server 2's journals
  is probably more harmful than trying to hack through dlm's API
  calls.

There is no way to "prove" what you want. Just go for second best to
the ideal theorem. You probably don't want GFS, but a hardened NFS
connection to the storage allocated within the secure network only.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050907/01883a9c/attachment.sig>

From karasiov at infobox.ru  Wed Sep  7 11:45:49 2005
From: karasiov at infobox.ru (karasiov at infobox.ru)
Date: Wed, 7 Sep 2005 15:45:49 +0400
Subject: [Linux-cluster] File size limitation on GFS
In-Reply-To: <42EDEBC8.7070402@histor.fr>
References: <42EDEBC8.7070402@histor.fr>
Message-ID: <872666588.20050907154549@infobox.ru>

????????????, Ion.

?? ?????? 1 ??????? 2005 ?., 13:30:48:

IA> Hi everybody, 
IA> is there is a maximum file size the GFS can handle?
IA> I tried to do some tests with big files, and I couldn't open (open(2)) 
IA> files that
were  >>= 2Go. (It works with 1Go files, I didn't try sizes between 1 and 
IA> 2 Go).

IA> I would like to know if this limitation comes from my configuration or 
IA> from the GFS
IA> file system.

IA> I searched an answer in the web and in the mailing list but I didn't 
IA> found anything,
IA> If I missed something I'd be very sorry and an url to the article
IA> I missed would be a great answer :).

IA> Thanks in advance!

Hi,

I need to start GFS in a single mode on Debian Sarge,
but ccsd does not start - whats wrong with single mode?

SK



From karon at gmx.net  Wed Sep  7 13:13:48 2005
From: karon at gmx.net (Andreas Brosche)
Date: Wed, 7 Sep 2005 15:13:48 +0200 (MEST)
Subject: [Linux-cluster] Using GFS without a network?
References: <1126051624.3694.26.camel@aptis101.cqtel.com>
Message-ID: <11675.1126098828@www67.gmx.net>

From: Steve Wilcox <spwilcox at att.com>
> On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
[multi-initiated SCSI issues]
> All tested configurations were equally crash-happy due to the bus 
> resets.  
[...]
> Calling these
> configurations simply "not supported" is an understatement - this type
> of config is guaranteed trouble. 

OK, thank you for sharing your experiences. It definitely sounds like we're
not going to use this setup. Maybe these issues should find the way into the
GFS documentation, as multi-initiated busses *should* be standard compliant.
A simple "it don't work", as it basically says now, is not enough, IMHO. 

From: Axel Thimm <Axel.Thimm at ATrpms.net>
> > Hmm. The whole setup is supposed to physically divide two networks, and
> > nevertheless provide some kind of shared storage for moving data from 
> > one network to another.
[...]
> > This is the (strongly simplified) topology:
> > 
> > mid-secure network -- Server1 -- Storage -- Server2 -- secure Network
> > 
> > A potential attacker could use a possible security flaw in the dlm 
> > service (which is bound to the network interface) to gain access to the 
> > server on the "secure" side *instantly* when he was able to compromise 
> > the server on the mid-secure side (hey, it CAN happen). If any sort of 
> > shared storage can be installed *without* any ethernet link or - 
> > ideally - any sort of inter-server communication, there is a way to 
> > *prove* that an attacker cannot establish any kind of connection into 
> > the secure net (some risks remain, but they have nothing to do with the
> > physical connection).
> 
> If you are paranoid like that and consider that even if you could do
> away with dlm and IP connectivity, then
> 
> o an attacker on the mid-secure network could alter files that the
>   secure network accesses and gain privileges that way.

Data corruption is not really an issue - the only way to gain privileges
that way on any system would be on the system where the actual data is being
processed (which is, in fact, possible, think of viruses in multimedia
files, or MS Word macro viruses). The only way the data is processed by
Server2 is by transferring it into the secure network. As most files in the
secure network will be documents, we'll have to keep our word processing
software up to date. But attacks embedded into the actual data are an issue
we'd have to deal with, no matter what the transport medium is. 

> o an attacker can exploit potential bugs in GFS's code, just as well
>   as in dlm's, and having physical access to the Server 2's journals
>   is probably more harmful than trying to hack through dlm's API
>   calls.

Sure, the possibility of potential bugs in GFS was also under my
considerations. Injection of harmful code could be possible either way, if
there is in fact a security flaw in the sync code, granted... it wouldn't
make much of a difference if the code is injected via disk or via service...

> There is no way to "prove" what you want. Just go for second best to
> the ideal theorem. You probably don't want GFS, but a hardened NFS
> connection to the storage allocated within the secure network only.

So you would set up one (hardened) server only between the two networks? I'd
really rather have a solution without the technical ability to set up any
kind of tunnel which allows data to be read *from* the secure network. IP
over storage might be possible, but the counterpart in the secure network
has to interpret it, so any kind of trojan must be injected into the data.
For an attacker, the situation is the same, no matter how the data gets into
the network. With a single server connected to two networks, however, the
situation is by far easier for the attacker, as it introduces a far more
elegant way of setting up a tunnel. 

What the whole setup is supposed to prevent is that an attacker who manages
to get into Server1 has no immediate connection to the secure network (which
would be the case with a shared NFS server with, say, two ethernet devices).

> Axel.Thimm at ATrpms.net

Thank you both for your ideas and exeriences, I'll look into the
possibilities of hardening network filesystems. Looks like I'll discard the
shared bus idea completely; I'm going to fiddle a bit with it though and
test when the data gets corrupted. I'm not going to waste too much time on
it though. 

As GFS is supposed to be a file system which is shared between equal nodes
of a cluster, I guess it really is not the file system of choice for our
needs. An NFS solution sounds less insane. I'll think about the whole thing
again. 

Regards,

Andreas

-- 
5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
+++ GMX - die erste Adresse f?r Mail, Message, More +++



From moya at infomed.sld.cu  Wed Sep  7 11:29:47 2005
From: moya at infomed.sld.cu (Maykel Moya)
Date: Wed, 07 Sep 2005 07:29:47 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <1126051624.3694.26.camel@aptis101.cqtel.com>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>
	<1126051624.3694.26.camel@aptis101.cqtel.com>
Message-ID: <1126092587.11885.54.camel@julia.sld.cu>

El mar, 06-09-2005 a las 20:06 -0400, Steve Wilcox escribi?:
> On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
> I set up a small 2 node cluster this way a while back, just as a testbed
> for myself.  Much as I suspected, it was severely unstable because of
> the storage configuration, even occasionally causing both nodes to crash
> when one was rebooted due to SCSI bus resets.  I tore it down and
> rebuilt it several times, configuring it as a simple failover cluster
> with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
> openSSI cluster using Fedora3.  All tested configurations were equally
> crash-happy due to the bus resets.  

Could you share your cluster components config files ?

Regards
maykel





From hne at hopnet.net  Wed Sep  7 09:43:13 2005
From: hne at hopnet.net (Keith Hopkins)
Date: Wed, 07 Sep 2005 19:43:13 +1000
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <1126062237.12381.4.camel@aptis101.cqtel.com>
References: <16102.1125953577@www46.gmx.net>	<1126022567.3344.53.camel@ayanami.boston.redhat.com>	<431E1ED7.7010909@gmx.net>	<1126051624.3694.26.camel@aptis101.cqtel.com>
	<1126062237.12381.4.camel@aptis101.cqtel.com>
Message-ID: <431EB631.1020300@hopnet.net>

Steve Wilcox wrote:
> On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote:
> 
>>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
>>
>>
>>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way,
>>>>regardless of what the host controller is.
>>>>Ex: Two machines with different SCSI IDs on their initiator connected to
>>>>the same physical SCSI bus.
>>>
>>>Hmm... don't laugh at me, but in fact that's what we're about to set up.
>>>
>>>I've read in Red Hat's docs that it is "not supported" because of 
>>>performance issues. Multi-initiator buses should comply to SCSI 
>>>standards, and any SCSI-compliant disk should be able to communicate 
>>>with the correct controller, if I've interpreted the specs correctly. Of 
>>>course, you get arbitrary results when using non-compliant hardware... 
>>>What are other issues with multi-initiator buses, other than performance 
>>>loss?
>>
>>I set up a small 2 node cluster this way a while back, just as a testbed
>>for myself.  Much as I suspected, it was severely unstable because of
>>the storage configuration, even occasionally causing both nodes to crash
>>when one was rebooted due to SCSI bus resets.  I tore it down and
>>rebuilt it several times, configuring it as a simple failover cluster
>>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
>>openSSI cluster using Fedora3.  All tested configurations were equally
>>crash-happy due to the bus resets.  
>>
>>My configuration consisted of a couple of old Compaq deskpro PC's, each
>>with a single ended Symbiosis card (set to different SCSI ID's
>>obviously) and an external DEC BA360 jbod shelf with 6 drives.  The bus
>>resets might be mitigated somewhat by using HVD SCSI and Y-cables with
>>external terminators, but from my previous experience with other
>>clusters that used this technique (DEC ASE and HP-ux service guard), bus
>>resets will always be a thorn in your side without a separate,
>>independent raid controller to act as a go-between.  Calling these
>>configurations simply "not supported" is an understatement - this type
>>of config is guaranteed trouble.  I'd never set up a cluster this way
>>unless I'm the only one using it, and only then if I don't care one
>>little bit about crashes and data corruption.  My two cents.
>>
>>-steve
> 
> 
> 
> Small clarification - Although clusters from DEC, HP, and even
> DigiComWho?Paq's TruCluster can be made to work (sort of) on multi-
> initiator SCSI busses, IIRC it was never a supported option for any of
> them (much like RedHat's offering).  I doubt any sane company would ever
> support that type of config.
> 
> -steve   
> 

HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP.  It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years.  Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard.

--Keith

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3487 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050907/aec01558/attachment.bin>

From linux-cluster at redhat.com  Wed Sep  7 12:48:02 2005
From: linux-cluster at redhat.com (Cluster Boston)
Date: Wed, 7 Sep 2005 08:48:02 -0400
Subject: [Linux-cluster] Cluster 2005 Boston Early Bird Deadline approaching
Message-ID: <20050907124754.CC66140977@villi.rcsnetworks.com>

see http://www.cluster2005.org.



From karasiov at infobox.ru  Wed Sep  7 14:21:48 2005
From: karasiov at infobox.ru (karasiov at infobox.ru)
Date: Wed, 7 Sep 2005 18:21:48 +0400
Subject: [Linux-cluster] single mode
In-Reply-To: <872666588.20050907154549@infobox.ru>
References: <42EDEBC8.7070402@histor.fr> <872666588.20050907154549@infobox.ru>
Message-ID: <897776205.20050907182148@infobox.ru>

Hi,

I need to start GFS in a single mode on Debian Sarge,
but ccsd does not start - whats wrong with single mode?

SK



From spwilcox at att.com  Wed Sep  7 15:03:58 2005
From: spwilcox at att.com (Steve Wilcox)
Date: Wed, 07 Sep 2005 11:03:58 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <1126092587.11885.54.camel@julia.sld.cu>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>
	<1126051624.3694.26.camel@aptis101.cqtel.com>
	<1126092587.11885.54.camel@julia.sld.cu>
Message-ID: <1126105438.25415.4.camel@aptis101.cqtel.com>

On Wed, 2005-09-07 at 07:29 -0400, Maykel Moya wrote:
> El mar, 06-09-2005 a las 20:06 -0400, Steve Wilcox escribi?:
> > On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
> > I set up a small 2 node cluster this way a while back, just as a testbed
> > for myself.  Much as I suspected, it was severely unstable because of
> > the storage configuration, even occasionally causing both nodes to crash
> > when one was rebooted due to SCSI bus resets.  I tore it down and
> > rebuilt it several times, configuring it as a simple failover cluster
> > with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
> > openSSI cluster using Fedora3.  All tested configurations were equally
> > crash-happy due to the bus resets.  
> 
> Could you share your cluster components config files ?
> 
> Regards
> maykel
> 

I didn't save any of my old config files - as I said, this was just a
small cluster for me to toy with.  Everything was fairly vanilla
generally speaking though.  Here's the config from the current setup, a
RHEL4 simple failover cluster without GFS (and currently with no
meaningful services).

<?xml version="1.0"?>
<cluster config_version="10" name="rh-clu">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="rh-clu01-ics" votes="1"/>
                <clusternode name="rh-clu02-ics" votes="1"/>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_manual" name="human"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="stuff" ordered="0"
restricted="0">
                                <failoverdomainnode name="rh-clu01-ics"
priority="1"/>
                                <failoverdomainnode name="rh-clu02-ics"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="data" ordered="0"
restricted="0">
                                <failoverdomainnode name="rh-clu01-ics"
priority="1"/>
                                <failoverdomainnode name="rh-clu02-ics"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <fs device="/dev/sdf1" fstype="ext3"
mountpoint="/stuff" name="stuff-fs"/>
                        <fs device="/dev/sde1" fstype="ext3"
mountpoint="/data" name="data-fs"/>
                        <ip address="10.8.204.105" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="stuff" name="stuff">
                        <fs ref="stuff-fs"/>
                </service>
                <service autostart="1" domain="data" name="data">
                        <fs ref="data-fs"/>
                        <ip ref="10.8.204.105"/>
                </service>
        </rm>
</cluster>




From spwilcox at att.com  Wed Sep  7 15:19:52 2005
From: spwilcox at att.com (Steve Wilcox)
Date: Wed, 07 Sep 2005 11:19:52 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <431EB631.1020300@hopnet.net>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>	<1126051624.3694.26.camel@aptis101.cqtel.com>
	<1126062237.12381.4.camel@aptis101.cqtel.com>
	<431EB631.1020300@hopnet.net>
Message-ID: <1126106393.25415.18.camel@aptis101.cqtel.com>

On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote:
> Steve Wilcox wrote:
> > On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote:
> > 
> >>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
> >>
> >>
> >>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way,
> >>>>regardless of what the host controller is.
> >>>>Ex: Two machines with different SCSI IDs on their initiator connected to
> >>>>the same physical SCSI bus.
> >>>
> >>>Hmm... don't laugh at me, but in fact that's what we're about to set up.
> >>>
> >>>I've read in Red Hat's docs that it is "not supported" because of 
> >>>performance issues. Multi-initiator buses should comply to SCSI 
> >>>standards, and any SCSI-compliant disk should be able to communicate 
> >>>with the correct controller, if I've interpreted the specs correctly. Of 
> >>>course, you get arbitrary results when using non-compliant hardware... 
> >>>What are other issues with multi-initiator buses, other than performance 
> >>>loss?
> >>
> >>I set up a small 2 node cluster this way a while back, just as a testbed
> >>for myself.  Much as I suspected, it was severely unstable because of
> >>the storage configuration, even occasionally causing both nodes to crash
> >>when one was rebooted due to SCSI bus resets.  I tore it down and
> >>rebuilt it several times, configuring it as a simple failover cluster
> >>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
> >>openSSI cluster using Fedora3.  All tested configurations were equally
> >>crash-happy due to the bus resets.  
> >>
> >>My configuration consisted of a couple of old Compaq deskpro PC's, each
> >>with a single ended Symbiosis card (set to different SCSI ID's
> >>obviously) and an external DEC BA360 jbod shelf with 6 drives.  The bus
> >>resets might be mitigated somewhat by using HVD SCSI and Y-cables with
> >>external terminators, but from my previous experience with other
> >>clusters that used this technique (DEC ASE and HP-ux service guard), bus
> >>resets will always be a thorn in your side without a separate,
> >>independent raid controller to act as a go-between.  Calling these
> >>configurations simply "not supported" is an understatement - this type
> >>of config is guaranteed trouble.  I'd never set up a cluster this way
> >>unless I'm the only one using it, and only then if I don't care one
> >>little bit about crashes and data corruption.  My two cents.
> >>
> >>-steve
> > 
> > 
> > 
> > Small clarification - Although clusters from DEC, HP, and even
> > DigiComWho?Paq's TruCluster can be made to work (sort of) on multi-
> > initiator SCSI busses, IIRC it was never a supported option for any of
> > them (much like RedHat's offering).  I doubt any sane company would ever
> > support that type of config.
> > 
> > -steve   
> > 
> 
> HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP.  It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years.  Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard.
> 
> --Keith

Hmmm...   Are you sure you're thinking of a multi-initiator _bus_ and
not something like an external SCSI array (i.e. nike arrays or some such
thing)?  I know that multi-port SCSI hubs are available, and more than
one HBA per node is obviously supported for multipathing, but generally
any multi-initiator SCSI setup will be talking to an external raid
array, not a simple SCSI bus, and even then bus resets can cause grief.
Admittedly, I'm much more familiar with the Alpha server side of things
(multi-initiator buses were definitely never supported under DEC unix /
Tru64) , so I could be wrong about HP-ux.  I just can't imagine that a
multi-initiator bus wouldn't be a nightmare.   

-steve



From lhh at redhat.com  Wed Sep  7 15:19:45 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 07 Sep 2005 11:19:45 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <431E1ED7.7010909@gmx.net>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>
Message-ID: <1126106385.30592.11.camel@ayanami.boston.redhat.com>

On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:

> I've read in Red Hat's docs that it is "not supported" because of 
> performance issues. Multi-initiator buses should comply to SCSI 
> standards, and any SCSI-compliant disk should be able to communicate 
> with the correct controller, if I've interpreted the specs correctly. Of 
> course, you get arbitrary results when using non-compliant hardware... 
> What are other issues with multi-initiator buses, other than performance 
> loss?

Dueling resets.  Some drivers will reset the bus when loaded (some cards
do this when the machine boots, too).  Then, the other initator's driver
detects a reset, and goes ahead and issues a reset.  So, the first
initiator's driver detects the reset, and goes ahead and issues a reset.

I'm sure you see where this is going.

The important thing is that (IIRC) the number of resets is unbounded.
It could be 1, it could be 20,000.  During this time, none of the
devices on the bus can be accessed.

> > The DLM runs over IP, as does the cluster manager.  Additionally, please
> > remember that GFS requires fencing, and that most fence-devices are
> > IP-enabled.
> 
> Hmm. The whole setup is supposed to physically divide two networks, and 
> nevertheless provide some kind of shared storage for moving data from 
> one network to another. Establishing an ethernet link between the two 
> servers would sort of disrupt the whole concept, which is to prevent 
> *any* network access from outside into the secure part of the network. 
> This is the (strongly simplified) topology:
> 
> mid-secure network -- Server1 -- Storage -- Server2 -- secure Network

Ok, GFS will not work for this.  However, you *can* still use, for
example, a raw device to lock the data, then write out the data directly
to the partition (as long as you didn't need file I/O).

You can use a disk-based locking scheme similar to the one found in
Cluster Manager 1.0.x and/or Kimberlite 1.1.x to synchronize access to
the shared partition.

If you're using a multi-initator bus, you can certainly also use SCSI
reservations to synchronize access as well.

> A potential attacker could use a possible security flaw in the dlm 
> service (which is bound to the network interface) to gain access to the 
> server on the "secure" side *instantly* when he was able to compromise 
> the server on the mid-secure side (hey, it CAN happen). 

Fair enough.

> You cannot, however, disable *read* caching (which seems to be 
> buried quite deeply into the kernel), which means you actually have to 
> umount and then re-mount (ie, not "mount -o remount") the fs. This means 
> that long transfers could block other users for a long time. And 
> mounting and umounting the same fs over and over again doesn't exactly 
> sound like a good idea... even if it's only mounted ro.

Yup.

> 
> Maykel Moya wrote:
>  > El lun, 05-09-2005 a las 22:52 +0200, Andreas Brosche escribi?:
>  > I recently set up something like that. We use a external HP Smart
>  > Array Cluster Storage. It has a separate connection (SCSI cable) to
>  > both hosts.
> 
> So it is not really a shared bus, but a dual bus configuration.

Ah, that's much better =)

-- Lon



From lhh at redhat.com  Wed Sep  7 15:25:09 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 07 Sep 2005 11:25:09 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <431EB631.1020300@hopnet.net>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>	<1126051624.3694.26.camel@aptis101.cqtel.com>
	<1126062237.12381.4.camel@aptis101.cqtel.com>
	<431EB631.1020300@hopnet.net>
Message-ID: <1126106709.30592.16.camel@ayanami.boston.redhat.com>

On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote:

> > 
> 
> HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP.  It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years.  Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard.
> 

The key there is that HP-UX is what's different, not specifically
ServiceGuard.  ServiceGuard on Linux will have the same pitfalls, and
won't work well, even if HP *does* support it. :(

-- Lon



From lhh at redhat.com  Wed Sep  7 15:29:53 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 07 Sep 2005 11:29:53 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <11675.1126098828@www67.gmx.net>
References: <1126051624.3694.26.camel@aptis101.cqtel.com>
	<11675.1126098828@www67.gmx.net>
Message-ID: <1126106993.30592.21.camel@ayanami.boston.redhat.com>

On Wed, 2005-09-07 at 15:13 +0200, Andreas Brosche wrote:

> > o an attacker can exploit potential bugs in GFS's code, just as well
> >   as in dlm's, and having physical access to the Server 2's journals
> >   is probably more harmful than trying to hack through dlm's API
> >   calls.
> 
> Sure, the possibility of potential bugs in GFS was also under my
> considerations. Injection of harmful code could be possible either way, if
> there is in fact a security flaw in the sync code, granted... it wouldn't
> make much of a difference if the code is injected via disk or via service...

Note: If anyone breaks in to the world-facing server, you will need a
way to detect it and notify the other server.

Once this happens, it's safe (perhaps paranoid) to assume all data on
the shared disk is corrupt, and possibly dangerous, and so should not be
used.

-- Lon




From lhh at redhat.com  Wed Sep  7 15:48:46 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 07 Sep 2005 11:48:46 -0400
Subject: [Linux-cluster] Re: Using GFS without a network?
In-Reply-To: <20050907092418.GA4014@neu.nirvana>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>  <20050907092418.GA4014@neu.nirvana>
Message-ID: <1126108126.30592.41.camel@ayanami.boston.redhat.com>

On Wed, 2005-09-07 at 11:24 +0200, Axel Thimm wrote:

> There is no way to "prove" what you want. Just go for second best to
> the ideal theorem. You probably don't want GFS, but a hardened NFS
> connection to the storage allocated within the secure network only.

I'd do shared raw.

If we know the computer on the secure network *never* writes to the disk
and it has no possible way to establish a network connection to the
outside world (via any means) then we only have to worry about the
attacker somehow corrupting data to crash the application on the secure
server.

Make sure your reader application has a reliable way to verify the
integrity of the data (possibly using some form of encryption like gpg)
and you're golden.

So, the would-be attacker would have to do the following to get data off
the secure network:

(a) Break in to world-facing server

(b) Create data which will cause a malfunction in to the secret
application on the secure server (without having access to said
application; this is based on an outside job, not an inside job),

(c) encrypt or sign the data so that the secure server trusts it, and

(d) write the data out to the right offset on the raw device...

In the "overflow code", the attacker would have to know where the data
is stored, retrieve it, and write it out to the shared SCSI disk.

Note that the above becomes much more difficult if you change the SCSI
block device driver on the secure server to completely disable
writes. ;)

It also becomes more difficult if the secret application is audited for
security flaws before being put into production.

Just random ideas... *shrug*

-- Lon



From moya at infomed.sld.cu  Wed Sep  7 14:51:23 2005
From: moya at infomed.sld.cu (Maykel Moya)
Date: Wed, 07 Sep 2005 10:51:23 -0400
Subject: [Linux-cluster] Filesystem (GFS) availability
Message-ID: <1126104683.15223.7.camel@julia.sld.cu>

I have a two node GFS setup. When one of the nodes (B) goes down, the
other one (A) is unable to access the fs.

A, nevertheless, "notes" that B went down and removes it from the
cluster, but any access to the GFS locks.

Any clues?

My cluster.conf is:

<?xml version="1.0"?>
<cluster name="cluster1" config_version="1">
  <cman two_node="1" expected_votes="1">
  </cman>
  <clusternodes>
    <clusternode name="varela1" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="x.j.h.b"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="varela2" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="x.y.z.h"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fence_devices>
    <device name="human" agent="fence_manual"/>
  </fence_devices>
</cluster>

Regards,
maykel




From moya at infomed.sld.cu  Wed Sep  7 14:45:37 2005
From: moya at infomed.sld.cu (Maykel Moya)
Date: Wed, 07 Sep 2005 10:45:37 -0400
Subject: [Linux-cluster] File size limitation on GFS
In-Reply-To: <872666588.20050907154549@infobox.ru>
References: <42EDEBC8.7070402@histor.fr> <872666588.20050907154549@infobox.ru>
Message-ID: <1126104337.15223.1.camel@julia.sld.cu>

> I need to start GFS in a single mode on Debian Sarge,
> but ccsd does not start - whats wrong with single mode?

Do you have a /etc/cluster/cluster.conf ?

Regards,
maykel

PD: Though you backported packages from unstable



From jacobl at ccbill.com  Wed Sep  7 17:16:42 2005
From: jacobl at ccbill.com (Jacob Liff)
Date: Wed, 7 Sep 2005 10:16:42 -0700
Subject: [Linux-cluster] Filesystem (GFS) availability
Message-ID: <63DFFDD742B5E54389C891AB1DFFE9A20576FF@Exchange.ccbill-hq.local>

Morning,

	When using manual fencing you will have to ack the manual fence
on the remaining machine. Once this is complete, the remaining node will
grab the failed nodes journal and play it back. You will then regain
access to the file system.

Jacob L.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Maykel Moya
Sent: Wednesday, September 07, 2005 7:51 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Filesystem (GFS) availability

I have a two node GFS setup. When one of the nodes (B) goes down, the
other one (A) is unable to access the fs.

A, nevertheless, "notes" that B went down and removes it from the
cluster, but any access to the GFS locks.

Any clues?

My cluster.conf is:

<?xml version="1.0"?>
<cluster name="cluster1" config_version="1">
  <cman two_node="1" expected_votes="1">
  </cman>
  <clusternodes>
    <clusternode name="varela1" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="x.j.h.b"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="varela2" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="x.y.z.h"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fence_devices>
    <device name="human" agent="fence_manual"/>
  </fence_devices>
</cluster>

Regards,
maykel


--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Wed Sep  7 17:20:20 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 07 Sep 2005 13:20:20 -0400
Subject: [Linux-cluster] Filesystem (GFS) availability
In-Reply-To: <1126104683.15223.7.camel@julia.sld.cu>
References: <1126104683.15223.7.camel@julia.sld.cu>
Message-ID: <1126113620.30592.49.camel@ayanami.boston.redhat.com>

On Wed, 2005-09-07 at 10:51 -0400, Maykel Moya wrote:

>   <fence_devices>
>     <device name="human" agent="fence_manual"/>
>   </fence_devices>

Run fence_ack_manual on the surviving node.

Better yet, stop using manual fencing and buy a supported power switch
off of eBay.  It will save you a lot of frustration :)

-- Lon



From amanthei at redhat.com  Wed Sep  7 18:50:24 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 7 Sep 2005 13:50:24 -0500
Subject: [Linux-cluster] File size limitation on GFS
In-Reply-To: <872666588.20050907154549@infobox.ru>
References: <42EDEBC8.7070402@histor.fr> <872666588.20050907154549@infobox.ru>
Message-ID: <20050907185024.GF26769@redhat.com>

If you want GFS in single user mode (no networking) then you can mount using
the lock_nolock protocol.  Be very careful because if two or more machines
mount the filesystem using this option, you will cause corruption.


On Wed, Sep 07, 2005 at 03:45:49PM +0400, karasiov at infobox.ru wrote:
> ????????????, Ion.
> 
> ?? ?????? 1 ??????? 2005 ?., 13:30:48:
> 
> IA> Hi everybody, 
> IA> is there is a maximum file size the GFS can handle?
> IA> I tried to do some tests with big files, and I couldn't open (open(2)) 
> IA> files that
> were  >>= 2Go. (It works with 1Go files, I didn't try sizes between 1 and 
> IA> 2 Go).
> 
> IA> I would like to know if this limitation comes from my configuration or 
> IA> from the GFS
> IA> file system.
> 
> IA> I searched an answer in the web and in the mailing list but I didn't 
> IA> found anything,
> IA> If I missed something I'd be very sorry and an url to the article
> IA> I missed would be a great answer :).
> 
> IA> Thanks in advance!
> 
> Hi,
> 
> I need to start GFS in a single mode on Debian Sarge,
> but ccsd does not start - whats wrong with single mode?
> 
> SK
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>



From Axel.Thimm at ATrpms.net  Wed Sep  7 20:15:37 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Wed, 7 Sep 2005 22:15:37 +0200
Subject: [Linux-cluster] Samba failover "impossible" due to missing cifs
	client reconnect?
Message-ID: <20050907201537.GB3455@neu.nirvana>

After having setup our workarounds for NFS we are very happy with how
it's working. Now we're looking at Samba.

But we have quite a showstopper right at the beginning. The smb/cifs
clients, be it smbclient or Windows XP, don't like their TCP stream
being resetted and don't retry/reconnect (contrary to NFS).

It looks like the protocol has no considerations for retries above the
TCP/IP level. So when the TCP stream is torn on the server's side due
to relocation (either due to crash/fencing or soft) any client
smb/cifs activity is broken at that time.

This means that any data transfer via smb/cifs shares during the
relocation will fail, and there is nothing we can do on the server's
side. Or is there?
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050907/10ef2542/attachment.sig>

From rainer at ultra-secure.de  Wed Sep  7 20:37:07 2005
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Wed, 07 Sep 2005 22:37:07 +0200
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <431E1ED7.7010909@gmx.net>
References: <16102.1125953577@www46.gmx.net>	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>
Message-ID: <431F4F73.1030603@ultra-secure.de>

Andreas Brosche wrote:

> Hmm. The whole setup is supposed to physically divide two networks, 
> and nevertheless provide some kind of shared storage for moving data 
> from one network to another. Establishing an ethernet link between the 
> two servers would sort of disrupt the whole concept, which is to 
> prevent *any* network access from outside into the secure part of the 
> network. This is the (strongly simplified) topology:
>

This is really more of a security-question than clustering (IMO).
Have you thought about a so called "air-gap" device, like Whale 
(www.whalecommunications.com) makes them?

It uses a SCSI-switch to "shuttle" data between two networks.



cheers,
Rainer



From crh at ubiqx.mn.org  Wed Sep  7 20:43:21 2005
From: crh at ubiqx.mn.org (Christopher R. Hertel)
Date: Wed, 7 Sep 2005 15:43:21 -0500
Subject: [Linux-cluster] Samba failover "impossible" due to missing cifs
	client reconnect?
In-Reply-To: <20050907201537.GB3455@neu.nirvana>
References: <20050907201537.GB3455@neu.nirvana>
Message-ID: <20050907204321.GB5677@Favog.ubiqx.mn.org>

On Wed, Sep 07, 2005 at 10:15:37PM +0200, Axel Thimm wrote:
> After having setup our workarounds for NFS we are very happy with how
> it's working. Now we're looking at Samba.
> 
> But we have quite a showstopper right at the beginning. The smb/cifs
> clients, be it smbclient or Windows XP, don't like their TCP stream
> being resetted and don't retry/reconnect (contrary to NFS).
>
> It looks like the protocol has no considerations for retries above the
> TCP/IP level. So when the TCP stream is torn on the server's side due
> to relocation (either due to crash/fencing or soft) any client
> smb/cifs activity is broken at that time.
>
> This means that any data transfer via smb/cifs shares during the
> relocation will fail, and there is nothing we can do on the server's
> side. Or is there?

Windows clients will reconnect to the same server, and so will smbfs and 
cifs-vfs.

I just tested this.  On a W/XP box I browsed through some directories on a 
share served by Samba.  I then shut Samba down, and tried viewing some 
different subdirectories of the same share.  Windows coughed up an error 
dialog.  I then restarted Samba and Windows got happy again.  I could 
browse through all of the subdirectories in the share.

We've talked about Samba on GFS within the Samba Team, and various members
have done some digging into the problem (Volker most recently, if I'm not
mistaken).  Samba must maintain a certain amount of state information
internally--including name mangling, locking, and sharing information
that--is peculiar to Windows+DOS+OS2 semantics.  The problem is ensuring 
that Samba's state information is also shared across the GFS nodes.

I've not had time to keep up with this development thread, but I know that 
the folks working on Samba-4 are aware of the issues involved.

Chris -)-----

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org



From Axel.Thimm at ATrpms.net  Wed Sep  7 20:51:16 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Wed, 7 Sep 2005 22:51:16 +0200
Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs
	client reconnect?
In-Reply-To: <20050907204321.GB5677@Favog.ubiqx.mn.org>
References: <20050907201537.GB3455@neu.nirvana>
	<20050907204321.GB5677@Favog.ubiqx.mn.org>
Message-ID: <20050907205116.GA7459@neu.nirvana>

On Wed, Sep 07, 2005 at 03:43:21PM -0500, Christopher R. Hertel wrote:
> On Wed, Sep 07, 2005 at 10:15:37PM +0200, Axel Thimm wrote:
> > After having setup our workarounds for NFS we are very happy with how
> > it's working. Now we're looking at Samba.
> > 
> > But we have quite a showstopper right at the beginning. The smb/cifs
> > clients, be it smbclient or Windows XP, don't like their TCP stream
> > being resetted and don't retry/reconnect (contrary to NFS).
> >
> > It looks like the protocol has no considerations for retries above the
> > TCP/IP level. So when the TCP stream is torn on the server's side due
> > to relocation (either due to crash/fencing or soft) any client
> > smb/cifs activity is broken at that time.
> >
> > This means that any data transfer via smb/cifs shares during the
> > relocation will fail, and there is nothing we can do on the server's
> > side. Or is there?
> 
> Windows clients will reconnect to the same server, and so will smbfs and 
> cifs-vfs.
> 
> I just tested this.  On a W/XP box I browsed through some directories on a 
> share served by Samba.  I then shut Samba down, and tried viewing some 
> different subdirectories of the same share.  Windows coughed up an error 
> dialog.  I then restarted Samba and Windows got happy again.  I could 
> browse through all of the subdirectories in the share.

Yes, that does work, but what I wanted to setup is a transparent
failover, so that network I/O recovers w/o any manual interaction.

I.e. I don't want to (soft) relocate the samba shares onto another
node due to load ballancing considerations and generate user visible
I/O errors and failures on a dozen clients.

> We've talked about Samba on GFS within the Samba Team, and various members
> have done some digging into the problem (Volker most recently, if I'm not
> mistaken).  Samba must maintain a certain amount of state information
> internally--including name mangling, locking, and sharing information
> that--is peculiar to Windows+DOS+OS2 semantics.  The problem is ensuring 
> that Samba's state information is also shared across the GFS nodes.
> 
> I've not had time to keep up with this development thread, but I know that 
> the folks working on Samba-4 are aware of the issues involved.
> 
> Chris -)-----
> 

-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050907/c3355f81/attachment.sig>

From crh at ubiqx.mn.org  Wed Sep  7 21:12:52 2005
From: crh at ubiqx.mn.org (Christopher R. Hertel)
Date: Wed, 7 Sep 2005 16:12:52 -0500
Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing
	cifs client reconnect?
In-Reply-To: <20050907205116.GA7459@neu.nirvana>
References: <20050907201537.GB3455@neu.nirvana>
	<20050907204321.GB5677@Favog.ubiqx.mn.org>
	<20050907205116.GA7459@neu.nirvana>
Message-ID: <20050907211252.GC5677@Favog.ubiqx.mn.org>

On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote:
: :
> > I just tested this.  On a W/XP box I browsed through some directories on a 
> > share served by Samba.  I then shut Samba down, and tried viewing some 
> > different subdirectories of the same share.  Windows coughed up an error 
> > dialog.  I then restarted Samba and Windows got happy again.  I could 
> > browse through all of the subdirectories in the share.
> 
> Yes, that does work, but what I wanted to setup is a transparent
> failover, so that network I/O recovers w/o any manual interaction.
>
> I.e. I don't want to (soft) relocate the samba shares onto another
> node due to load ballancing considerations and generate user visible
> I/O errors and failures on a dozen clients.

I guess I'm not really clear on what it is you're trying to accomplish.
Can you provide a little more description of what you'd like to see 
happen, and what kinds of environments you expect?

Chris -)-----

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org



From teigland at redhat.com  Thu Sep  8 05:41:28 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 8 Sep 2005 13:41:28 +0800
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125574523.5025.10.camel@laptopd505.fenrus.org>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
Message-ID: <20050908054128.GD12220@redhat.com>

On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> +static inline void glock_put(struct gfs2_glock *gl)
> +{
> +	if (atomic_read(&gl->gl_count) == 1)
> +		gfs2_glock_schedule_for_reclaim(gl);
> +	gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
> +	atomic_dec(&gl->gl_count);
> +}
> 
> this code has a race

The first two lines of the function with the race are non-essential and
could be removed.  In the common case where there's no race, they just add
efficiency by moving the glock to the reclaim list immediately.
Otherwise, the scand thread would do it later when actively trying to
reclaim glocks.

> +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
> +{
> +	int empty;
> +	spin_lock(&gl->gl_spin);
> +	empty = list_empty(head);
> +	spin_unlock(&gl->gl_spin);
> +	return empty;
> +}
> 
> that looks like a racey interface to me... if so.. why bother locking at
> all?

The spinlock protects the list but is not the primary method of
synchronizing processes that are working with a glock.

When the list is in fact empty, there will be no race, and the locking
wouldn't be necessary.  In this case, the "glmutex" in the code fragment
below is preventing any change in the list, so we can safely release the
spinlock immediately.

When the list is not empty, then a process could be adding another entry
to the list without "glmutex" locked [1], making the spinlock necessary.
In this case we quit after queue_empty() returns and don't do anything
else, so releasing the spinlock immediately was still safe.

[1] A process that already holds a glock (i.e. has a "holder" struct on
the gl_holders list) is allowed to hold it again by adding another holder
struct to the same list.  It adds the second hold without locking glmutex.

        if (gfs2_glmutex_trylock(gl)) {
                if (gl->gl_ops == &gfs2_inode_glops) {
                        struct gfs2_inode *ip = get_gl2ip(gl);
                        if (ip && !atomic_read(&ip->i_count))
                                gfs2_inode_destroy(ip);
                }
                if (queue_empty(gl, &gl->gl_holders) &&
                    gl->gl_state != LM_ST_UNLOCKED)
                        handle_callback(gl, LM_ST_UNLOCKED);

                gfs2_glmutex_unlock(gl);
        }

There is a second way that queue_empty() is used, and that's within
assertions that the list is empty.  If the assertion is correct, locking
isn't necessary; locking is only needed if there's already another bug
causing the list to not be empty and the assertion to fail.

> static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
> +		       gi_filler_t filler)
> +{
> +	unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size);
> +	char *buf;
> +	unsigned int count = 0;
> +	int error;
> +
> +	if (size > gi->gi_size)
> +		size = gi->gi_size;
> +
> +	buf = kmalloc(size, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	error = filler(ip, gi, buf, size, &count);
> +	if (error)
> +		goto out;
> +
> +	if (copy_to_user(gi->gi_data, buf, count + 1))
> +		error = -EFAULT;
> 
> where does count get a sensible value?

from filler()

We'll add comments in the code to document the things above.
Thanks,
Dave



From Axel.Thimm at ATrpms.net  Thu Sep  8 07:15:12 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Thu, 8 Sep 2005 09:15:12 +0200
Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing cifs
	client reconnect?
In-Reply-To: <20050907211252.GC5677@Favog.ubiqx.mn.org>
References: <20050907201537.GB3455@neu.nirvana>
	<20050907204321.GB5677@Favog.ubiqx.mn.org>
	<20050907205116.GA7459@neu.nirvana>
	<20050907211252.GC5677@Favog.ubiqx.mn.org>
Message-ID: <20050908071512.GC9222@neu.nirvana>

On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote:
> On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote:
> : :
> > > I just tested this.  On a W/XP box I browsed through some directories on a 
> > > share served by Samba.  I then shut Samba down, and tried viewing some 
> > > different subdirectories of the same share.  Windows coughed up an error 
> > > dialog.  I then restarted Samba and Windows got happy again.  I could 
> > > browse through all of the subdirectories in the share.
> > 
> > Yes, that does work, but what I wanted to setup is a transparent
> > failover, so that network I/O recovers w/o any manual interaction.
> >
> > I.e. I don't want to (soft) relocate the samba shares onto another
> > node due to load ballancing considerations and generate user visible
> > I/O errors and failures on a dozen clients.
> 
> I guess I'm not really clear on what it is you're trying to accomplish.
> Can you provide a little more description of what you'd like to see 
> happen, and what kinds of environments you expect?

A cifs client performs a largish copy operation. During that the share
is relocated to a different node. The copy operations should stall
during the relocation and resume after 10-20 seconds.

But if the cifs client does not perform a retry on smb/cifs protocol
level (on TCP level it will get a RST, it's the next level protocol
that needs to decide on retransmit the read/write request), then there
is nothing you can do server-side.

Perhaps there are magic registry keys that can persuade Windows
clients to do otherwise.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050908/43727206/attachment.sig>

From bob at interstudio.homeunix.net  Thu Sep  8 11:16:06 2005
From: bob at interstudio.homeunix.net (Bob Marcan)
Date: Thu, 08 Sep 2005 13:16:06 +0200
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <1126106393.25415.18.camel@aptis101.cqtel.com>
References: <16102.1125953577@www46.gmx.net>	<1126022567.3344.53.camel@ayanami.boston.redhat.com>	<431E1ED7.7010909@gmx.net>	<1126051624.3694.26.camel@aptis101.cqtel.com>	<1126062237.12381.4.camel@aptis101.cqtel.com>	<431EB631.1020300@hopnet.net>
	<1126106393.25415.18.camel@aptis101.cqtel.com>
Message-ID: <43201D76.6040000@interstudio.homeunix.net>

Steve Wilcox wrote:
> On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote:
> 
>>Steve Wilcox wrote:
>>
>>>On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote:
>>>
>>>
>>>>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
>>>>
>>>>
>>>>
>>>>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way,
>>>>>>regardless of what the host controller is.
>>>>>>Ex: Two machines with different SCSI IDs on their initiator connected to
>>>>>>the same physical SCSI bus.
>>>>>
>>>>>Hmm... don't laugh at me, but in fact that's what we're about to set up.
>>>>>
>>>>>I've read in Red Hat's docs that it is "not supported" because of 
>>>>>performance issues. Multi-initiator buses should comply to SCSI 
>>>>>standards, and any SCSI-compliant disk should be able to communicate 
>>>>>with the correct controller, if I've interpreted the specs correctly. Of 
>>>>>course, you get arbitrary results when using non-compliant hardware... 
>>>>>What are other issues with multi-initiator buses, other than performance 
>>>>>loss?
>>>>
>>>>I set up a small 2 node cluster this way a while back, just as a testbed
>>>>for myself.  Much as I suspected, it was severely unstable because of
>>>>the storage configuration, even occasionally causing both nodes to crash
>>>>when one was rebooted due to SCSI bus resets.  I tore it down and
>>>>rebuilt it several times, configuring it as a simple failover cluster
>>>>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
>>>>openSSI cluster using Fedora3.  All tested configurations were equally
>>>>crash-happy due to the bus resets.  
>>>>
>>>>My configuration consisted of a couple of old Compaq deskpro PC's, each
>>>>with a single ended Symbiosis card (set to different SCSI ID's
>>>>obviously) and an external DEC BA360 jbod shelf with 6 drives.  The bus
>>>>resets might be mitigated somewhat by using HVD SCSI and Y-cables with
>>>>external terminators, but from my previous experience with other
>>>>clusters that used this technique (DEC ASE and HP-ux service guard), bus
>>>>resets will always be a thorn in your side without a separate,
>>>>independent raid controller to act as a go-between.  Calling these
>>>>configurations simply "not supported" is an understatement - this type
>>>>of config is guaranteed trouble.  I'd never set up a cluster this way
>>>>unless I'm the only one using it, and only then if I don't care one
>>>>little bit about crashes and data corruption.  My two cents.
>>>>
>>>>-steve
>>>
>>>
>>>
>>>Small clarification - Although clusters from DEC, HP, and even
>>>DigiComWho?Paq's TruCluster can be made to work (sort of) on multi-
>>>initiator SCSI busses, IIRC it was never a supported option for any of
>>>them (much like RedHat's offering).  I doubt any sane company would ever
>>>support that type of config.
>>>
>>>-steve   
>>>
>>
>>HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP.  It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years.  Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard.
>>
>>--Keith
> 
> 
> Hmmm...   Are you sure you're thinking of a multi-initiator _bus_ and
> not something like an external SCSI array (i.e. nike arrays or some such
> thing)?  I know that multi-port SCSI hubs are available, and more than
> one HBA per node is obviously supported for multipathing, but generally
> any multi-initiator SCSI setup will be talking to an external raid
> array, not a simple SCSI bus, and even then bus resets can cause grief.
> Admittedly, I'm much more familiar with the Alpha server side of things

==========================> should be unfamiliar

> (multi-initiator buses were definitely never supported under DEC unix /
> Tru64) , so I could be wrong about HP-ux.  I just can't imagine that a
> multi-initiator bus wouldn't be a nightmare.   
> 
> -steve
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

In the past i was using the SCSI cluster on OpenVMS(AXP,VAX) and Tru64.
At home i have 2 DS10 with memmory channel, shared SCSI Tru64 cluster.
Memmory channel was prerequisite in early days, now you can use ethernet 
as CI.
I still have some customers using the SCSI cluster on Tru64.
Two of this are banks, running this configuration a few years.
Without any problems. Using host based shadowing.
Tru64 has single point of failure in this configuration.
Quorum disk can't be shadowed.
OpenVMS doesn't have this limitation.

It is supported.

OpenVMS 
http://h71000.www7.hp.com/doc/82FINAL/6318/6318pro_002.html#interc_sys_table

Tru64
http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/ARHGWETE/CHNTRXXX.HTM#sec-generic-cluster
...

Best regards, Bob

-- 
  Bob Marcan, Consultant                mailto:bob.marcan at snt.si
  S&T Hermes Plus d.d.                  tel:   +386 (1) 5895-300
  Slandrova ul. 2                       fax:   +386 (1) 5895-202
  1231 Ljubljana - Crnuce, Slovenia     url:   http://www.snt.si



From spwilcox at att.com  Thu Sep  8 17:22:34 2005
From: spwilcox at att.com (Steve Wilcox)
Date: Thu, 08 Sep 2005 13:22:34 -0400
Subject: [Linux-cluster] Using GFS without a network?
In-Reply-To: <43201D76.6040000@interstudio.homeunix.net>
References: <16102.1125953577@www46.gmx.net>
	<1126022567.3344.53.camel@ayanami.boston.redhat.com>
	<431E1ED7.7010909@gmx.net>	<1126051624.3694.26.camel@aptis101.cqtel.com>
	<1126062237.12381.4.camel@aptis101.cqtel.com>	<431EB631.1020300@hopnet.net>
	<1126106393.25415.18.camel@aptis101.cqtel.com>
	<43201D76.6040000@interstudio.homeunix.net>
Message-ID: <1126200154.17706.25.camel@aptis101.cqtel.com>

On Thu, 2005-09-08 at 13:16 +0200, Bob Marcan wrote:
> Steve Wilcox wrote:
> > On Wed, 2005-09-07 at 19:43 +1000, Keith Hopkins wrote:
> > 
> >>Steve Wilcox wrote:
> >>
> >>>On Tue, 2005-09-06 at 20:06 -0400, Steve Wilcox wrote:
> >>>
> >>>
> >>>>On Wed, 2005-09-07 at 00:57 +0200, Andreas Brosche wrote:
> >>>>
> >>>>
> >>>>
> >>>>>>- Multi-initator SCSI buses do not work with GFS in any meaningful way,
> >>>>>>regardless of what the host controller is.
> >>>>>>Ex: Two machines with different SCSI IDs on their initiator connected to
> >>>>>>the same physical SCSI bus.
> >>>>>
> >>>>>Hmm... don't laugh at me, but in fact that's what we're about to set up.
> >>>>>
> >>>>>I've read in Red Hat's docs that it is "not supported" because of 
> >>>>>performance issues. Multi-initiator buses should comply to SCSI 
> >>>>>standards, and any SCSI-compliant disk should be able to communicate 
> >>>>>with the correct controller, if I've interpreted the specs correctly. Of 
> >>>>>course, you get arbitrary results when using non-compliant hardware... 
> >>>>>What are other issues with multi-initiator buses, other than performance 
> >>>>>loss?
> >>>>
> >>>>I set up a small 2 node cluster this way a while back, just as a testbed
> >>>>for myself.  Much as I suspected, it was severely unstable because of
> >>>>the storage configuration, even occasionally causing both nodes to crash
> >>>>when one was rebooted due to SCSI bus resets.  I tore it down and
> >>>>rebuilt it several times, configuring it as a simple failover cluster
> >>>>with RHEL3 and RHEL4, a GFS cluster under RHEL4 and Fedora4, and as an
> >>>>openSSI cluster using Fedora3.  All tested configurations were equally
> >>>>crash-happy due to the bus resets.  
> >>>>
> >>>>My configuration consisted of a couple of old Compaq deskpro PC's, each
> >>>>with a single ended Symbiosis card (set to different SCSI ID's
> >>>>obviously) and an external DEC BA360 jbod shelf with 6 drives.  The bus
> >>>>resets might be mitigated somewhat by using HVD SCSI and Y-cables with
> >>>>external terminators, but from my previous experience with other
> >>>>clusters that used this technique (DEC ASE and HP-ux service guard), bus
> >>>>resets will always be a thorn in your side without a separate,
> >>>>independent raid controller to act as a go-between.  Calling these
> >>>>configurations simply "not supported" is an understatement - this type
> >>>>of config is guaranteed trouble.  I'd never set up a cluster this way
> >>>>unless I'm the only one using it, and only then if I don't care one
> >>>>little bit about crashes and data corruption.  My two cents.
> >>>>
> >>>>-steve
> >>>
> >>>
> >>>
> >>>Small clarification - Although clusters from DEC, HP, and even
> >>>DigiComWho?Paq's TruCluster can be made to work (sort of) on multi-
> >>>initiator SCSI busses, IIRC it was never a supported option for any of
> >>>them (much like RedHat's offering).  I doubt any sane company would ever
> >>>support that type of config.
> >>>
> >>>-steve   
> >>>
> >>
> >>HP-UX ServiceGuard words well with multi-initiator SCSI configurations, and is fully supported by HP.  It is sold that way for small 2-4 node clusters when cost is an issue, although FC has become a big favorite (um...money maker) in recent years.  Yes, SCSI bus resets are a pain, but they are handled by HP-UX, not ServiceGuard.
> >>
> >>--Keith
> > 
> > 
> > Hmmm...   Are you sure you're thinking of a multi-initiator _bus_ and
> > not something like an external SCSI array (i.e. nike arrays or some such
> > thing)?  I know that multi-port SCSI hubs are available, and more than
> > one HBA per node is obviously supported for multipathing, but generally
> > any multi-initiator SCSI setup will be talking to an external raid
> > array, not a simple SCSI bus, and even then bus resets can cause grief.
> > Admittedly, I'm much more familiar with the Alpha server side of things
> 
> ==========================> should be unfamiliar
> 
> > (multi-initiator buses were definitely never supported under DEC unix /
> > Tru64) , so I could be wrong about HP-ux.  I just can't imagine that a
> > multi-initiator bus wouldn't be a nightmare.   
> > 
> > -steve
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> In the past i was using the SCSI cluster on OpenVMS(AXP,VAX) and Tru64.
> At home i have 2 DS10 with memmory channel, shared SCSI Tru64 cluster.
> Memmory channel was prerequisite in early days, now you can use ethernet 
> as CI.
> I still have some customers using the SCSI cluster on Tru64.
> Two of this are banks, running this configuration a few years.
> Without any problems. Using host based shadowing.
> Tru64 has single point of failure in this configuration.
> Quorum disk can't be shadowed.
> OpenVMS doesn't have this limitation.
> 
> It is supported.
> 
> OpenVMS 
> http://h71000.www7.hp.com/doc/82FINAL/6318/6318pro_002.html#interc_sys_table
> 
> Tru64
> http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/ARHGWETE/CHNTRXXX.HTM#sec-generic-cluster
> ...
> 
> Best regards, Bob
> 

I'll ignore the insult and get to the meat of the matter...

I'm well aware of that doc.  I got burned by it a few years ago when I
set up a dev cluster of ES40's based on it.  Everything was humming
along just fine until I load tested our Oracle database - at something
like 700,000 transactions per hour, guess what happened?  I had a flurry
of bus resets, followed by a flurry of advfs domain panics, resulting in
a crashed cluster.  When I called my gold support TAM for help in
debugging the issue, I was told that "yeah, shared buses will do that.
That's why we don't provide technical support for that configuration".
When I pointed out that their own documentation claimed it was a
"supported" configuration I was told that "supported" only meant
technically possible as far as that doc goes - not that HP would provide
break-fix support.  Maybe that's changed in the last couple years, but
I'd doubt it - as my TAM said, shared SCSI buses WILL do that, no way
around it really.  If you're not having problems with resets, you're
simply not loading the bus that heavily.

It's all a moot point though - like Lon said, this is a Linux mailing
list, not a Tru64, HP-ux, or (god forbid) VMS list, so unless this
discussion is going somewhere productive we should probably stop wasting
bandwidth.  If you want to have a Unix pissing contest, we should do it
off list.

-steve



From RAWIPFEL at novell.com  Thu Sep  8 18:18:55 2005
From: RAWIPFEL at novell.com (Robert Wipfel)
Date: Thu, 08 Sep 2005 12:18:55 -0600
Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing
	cifs client reconnect?
In-Reply-To: <20050908071512.GC9222@neu.nirvana>
References: <20050907201537.GB3455@neu.nirvana>
	<20050907204321.GB5677@Favog.ubiqx.mn.org>
	<20050907205116.GA7459@neu.nirvana>
	<20050907211252.GC5677@Favog.ubiqx.mn.org>
	<20050908071512.GC9222@neu.nirvana>
Message-ID: <43202C52.9092.00CF.0@novell.com>

> A cifs client performs a largish copy operation. During that the
share
> is relocated to a different node. The copy operations should stall
> during the relocation and resume after 10-20 seconds.
 
Microsoft can't do this even with their own cluster server product and
CIFS client.
 
Recent versions of some applications like office have masked the
drive-letter reconnect internal to the application, but in general, any
client side open file handles are lost and have to be re-opened by the
client application (involving human intervention, e.g. save the file
again, or under the covers in a reconnect aware application). Consider
the problem for the client, after transport level reconnect to the
virtual IP address associated with the Samba service. Suppose the client
had an exclusive lock on a file. How can it be sure some other client
didn't gain the lock in the meantime? What should the application do
when it discovers the lock it once had on a connection is no longer
valid. The protocol and client side APIs weren't designed for dealing
with session level failover issues.
 
> Perhaps there are magic registry keys that can persuade Windows
> clients to do otherwise.
 
Fwiw, some (e.g. Novell) clients are designed to detect they've
connected to a clustered file server and optimize transport level
drive-letter reconnect (under the assumption the virtual IP will back
soon). Newer protocols like NFSv4 have provision for dealing with these
kinds of situations.


>>> Axel.Thimm at ATrpms.net 9/8/2005 1:15 am >>>
On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote:
> On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote:
> : :
> > > I just tested this.  On a W/XP box I browsed through some
directories on a 
> > > share served by Samba.  I then shut Samba down, and tried viewing
some 
> > > different subdirectories of the same share.  Windows coughed up
an error 
> > > dialog.  I then restarted Samba and Windows got happy again.  I
could 
> > > browse through all of the subdirectories in the share.
> > 
> > Yes, that does work, but what I wanted to setup is a transparent
> > failover, so that network I/O recovers w/o any manual interaction.
> >
> > I.e. I don't want to (soft) relocate the samba shares onto another
> > node due to load ballancing considerations and generate user
visible
> > I/O errors and failures on a dozen clients.
> 
> I guess I'm not really clear on what it is you're trying to
accomplish.
> Can you provide a little more description of what you'd like to see 
> happen, and what kinds of environments you expect?

A cifs client performs a largish copy operation. During that the share
is relocated to a different node. The copy operations should stall
during the relocation and resume after 10-20 seconds.

But if the cifs client does not perform a retry on smb/cifs protocol
level (on TCP level it will get a RST, it's the next level protocol
that needs to decide on retransmit the read/write request), then there
is nothing you can do server-side.

Perhaps there are magic registry keys that can persuade Windows
clients to do otherwise.
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050908/6684721f/attachment.htm>

From lhh at redhat.com  Thu Sep  8 18:24:28 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 08 Sep 2005 14:24:28 -0400
Subject: [Linux-cluster] Samba failover "impossible" due to missing
	cifs client reconnect?
In-Reply-To: <20050907201537.GB3455@neu.nirvana>
References: <20050907201537.GB3455@neu.nirvana>
Message-ID: <1126203868.30592.132.camel@ayanami.boston.redhat.com>

On Wed, 2005-09-07 at 22:15 +0200, Axel Thimm wrote:
> After having setup our workarounds for NFS we are very happy with how
> it's working. Now we're looking at Samba.
> 
> But we have quite a showstopper right at the beginning. The smb/cifs
> clients, be it smbclient or Windows XP, don't like their TCP stream
> being resetted and don't retry/reconnect (contrary to NFS).
> 
> It looks like the protocol has no considerations for retries above the
> TCP/IP level. So when the TCP stream is torn on the server's side due
> to relocation (either due to crash/fencing or soft) any client
> smb/cifs activity is broken at that time.
> 
> This means that any data transfer via smb/cifs shares during the
> relocation will fail, and there is nothing we can do on the server's
> side. Or is there?

Potentially not, in the past, we've always had clients reconnect.

-- Lon




From lhh at redhat.com  Thu Sep  8 18:25:34 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 08 Sep 2005 14:25:34 -0400
Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing
	cifs client reconnect?
In-Reply-To: <20050908071512.GC9222@neu.nirvana>
References: <20050907201537.GB3455@neu.nirvana>
	<20050907204321.GB5677@Favog.ubiqx.mn.org>
	<20050907205116.GA7459@neu.nirvana>
	<20050907211252.GC5677@Favog.ubiqx.mn.org>
	<20050908071512.GC9222@neu.nirvana>
Message-ID: <1126203934.30592.134.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-08 at 09:15 +0200, Axel Thimm wrote:
> On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote:
> > On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote:
> > : :
> > > > I just tested this.  On a W/XP box I browsed through some directories on a 
> > > > share served by Samba.  I then shut Samba down, and tried viewing some 
> > > > different subdirectories of the same share.  Windows coughed up an error 
> > > > dialog.  I then restarted Samba and Windows got happy again.  I could 
> > > > browse through all of the subdirectories in the share.
> > > 
> > > Yes, that does work, but what I wanted to setup is a transparent
> > > failover, so that network I/O recovers w/o any manual interaction.
> > >
> > > I.e. I don't want to (soft) relocate the samba shares onto another
> > > node due to load ballancing considerations and generate user visible
> > > I/O errors and failures on a dozen clients.
> > 
> > I guess I'm not really clear on what it is you're trying to accomplish.
> > Can you provide a little more description of what you'd like to see 
> > happen, and what kinds of environments you expect?
> 
> A cifs client performs a largish copy operation. During that the share
> is relocated to a different node. The copy operations should stall
> during the relocation and resume after 10-20 seconds.

Can't do it, as far as I know... SMB has too much internal state
information.

-- Lon




From crh at ubiqx.mn.org  Fri Sep  9 02:59:36 2005
From: crh at ubiqx.mn.org (Christopher R. Hertel)
Date: Thu, 08 Sep 2005 21:59:36 -0500
Subject: [Linux-cluster] Re: Samba failover "impossible" due to missing
	cifs	client reconnect?
In-Reply-To: <20050908071512.GC9222@neu.nirvana>
References: <20050907201537.GB3455@neu.nirvana>	<20050907204321.GB5677@Favog.ubiqx.mn.org>	<20050907205116.GA7459@neu.nirvana>	<20050907211252.GC5677@Favog.ubiqx.mn.org>
	<20050908071512.GC9222@neu.nirvana>
Message-ID: <4320FA98.8080403@ubiqx.mn.org>

Axel Thimm wrote:
> On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote:
> 
>>On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote:
>>: :
>>
>>>>I just tested this.  On a W/XP box I browsed through some directories on a 
>>>>share served by Samba.  I then shut Samba down, and tried viewing some 
>>>>different subdirectories of the same share.  Windows coughed up an error 
>>>>dialog.  I then restarted Samba and Windows got happy again.  I could 
>>>>browse through all of the subdirectories in the share.
>>>
>>>Yes, that does work, but what I wanted to setup is a transparent
>>>failover, so that network I/O recovers w/o any manual interaction.
>>>
>>>I.e. I don't want to (soft) relocate the samba shares onto another
>>>node due to load ballancing considerations and generate user visible
>>>I/O errors and failures on a dozen clients.
>>
>>I guess I'm not really clear on what it is you're trying to accomplish.
>>Can you provide a little more description of what you'd like to see 
>>happen, and what kinds of environments you expect?
> 
> 
> A cifs client performs a largish copy operation. During that the share
> is relocated to a different node. The copy operations should stall
> during the relocation and resume after 10-20 seconds.

Okay, now I have a clearer picture.

> But if the cifs client does not perform a retry on smb/cifs protocol
> level (on TCP level it will get a RST, it's the next level protocol
> that needs to decide on retransmit the read/write request), then there
> is nothing you can do server-side.

Yep...

> Perhaps there are magic registry keys that can persuade Windows
> clients to do otherwise.

Not likely.

Others on the list have already done a better job than I at working this 
through.  I can only add that I am not aware of anything in the protocol
itself that would handle retransmission.

Two things to condsider:

- The core of SMB is quite old and was not written to run on top of TCP.
   SMB had to deal with a variety of transport semantics.

- SMB was designed, originally, as a request/response protocol (client
   sends a request, server responds).  In theory, the client could re-send
   the original request if the TCP connection drops and is restarted...but
   how does the SMB client know that the first request did or didn't
   succeed?  The server might have finished the operation as the connection
   failed.

The solution, in general, is for SMB to report a failure and let the user
decide how to handle it.  (Eg.  Try saving your MS-Word doc to a different
drive or something.)

Chris -)-----

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org



From skulkin at mosinfo.ru  Fri Sep  9 07:21:27 2005
From: skulkin at mosinfo.ru (Skulkin Dmitry)
Date: Fri, 09 Sep 2005 11:21:27 +0400
Subject: [Linux-cluster] nodes don't see each other
Message-ID: <opswtqp1ea5s7lg4@scul1.msk.evp>

Hi,

I'm tring to make a simple two-node cluster with RHCS4, but nodes don't  
see each other. clustat on node alpha1 shows that alpha1 is online,  
clustat on node alpha2 shows that alpha2 is online, but no information  
about other node. ping alpha1 and ping alpha2 is ok, hostnames are alpha1  
and alpha2. For testing purposes I'm using manual fencing, but with  
unchecked "clean start" on "cluster properties" starting fenced is hang  
and in /var/log/messages:
Sep  9 10:30:57 alpha1 fenced[12327]: fencing node "alpha2"
Sep  9 10:30:57 alpha1 fenced[12327]: fence "alpha2" failed
Sep  9 10:31:02 alpha1 fenced[12327]: fencing node "alpha2"
Sep  9 10:31:02 alpha1 fenced[12327]: fence "alpha2" failed
Sep  9 10:31:07 alpha1 fenced[12327]: fencing node "alpha2"
Sep  9 10:31:07 alpha1 fenced[12327]: fence "alpha2" failed
Sep  9 10:31:12 alpha1 fenced[12327]: fencing node "alpha2"
Sep  9 10:31:12 alpha1 fenced[12327]: fence "alpha2" failed

and so on. I tried fence_ack_manual -n alpha1 and fence_ack_manual -n  
alpha2 on both nodes, but no result:

Warning:  If the node "alpha1" has not been manually fenced
(i.e. power cycled or disconnected from shared storage devices)
the GFS file system may become corrupted and all its data
unrecoverable!  Please verify that the node shown above has
been reset or disconnected from storage.

Are you certain you want to continue? [yN] y
can't open /tmp/fence_manual.fifo: No such file or directory

With checked "clean start" starting fenced is ok, but nodes don't see each  
other at all.
cluster.conf:
<?xml version="1.0"?>
<cluster config_version="5" name="alpha_cluster">
         <fence_daemon clean_start="1" post_fail_delay="0"  
post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="alpha1" votes="1"/>
                 <clusternode name="alpha2" votes="1"/>
         </clusternodes>
         <cman expected_votes="1" two_node="1"/>
         <fencedevices>
                 <fencedevice agent="fence_manual" name="mf"/>
         </fencedevices>
         <rm>
                 <failoverdomains/>
                 <resources/>
         </rm>
</cluster>

Thanks for any help,
-- 
Best regards,
Dmitry Skulkin



From baesso at ksolutions.it  Fri Sep  9 11:37:16 2005
From: baesso at ksolutions.it (Baesso Mirko)
Date: Fri, 9 Sep 2005 13:37:16 +0200
Subject: [Linux-cluster] NFS load balancing on REDHAT cluster
Message-ID: <F63C63702B77F94CB4664B4D88660C4D4BAC3E@kmail.ksolutions.it>

Hi
we have to setup a Redhat cluster with two node on an attached shared
storadge system (Fibre Channel connection).
We would like to know if is possible to setup an NFS service clustered
with load balancing 
We have to use GFS file system for sharing storadge data on both node,
but have to setup GNBD also for exporting same file system to client
network?
We would like to see only one File server IP with only one file shared
list from client side.(see attch)
Thanks

 <<Cluster_NFS_GFS.gif>> 



Baesso Mirko - System Engineer
KSolutions.S.p.A.
Via Lenin 132/26
56017  S.Martino Ulmiano (PI) - Italy
tel.+ 39 0 50 898369 fax. + 39 0 50 861200
baesso at ksolutions.it   http//www.ksolutions.it


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050909/d31a5a33/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Cluster_NFS_GFS.gif
Type: image/gif
Size: 31750 bytes
Desc: Cluster_NFS_GFS.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050909/d31a5a33/attachment.gif>

From lhh at redhat.com  Fri Sep  9 14:32:45 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 09 Sep 2005 10:32:45 -0400
Subject: [Linux-cluster] NFS load balancing on REDHAT cluster
In-Reply-To: <F63C63702B77F94CB4664B4D88660C4D4BAC3E@kmail.ksolutions.it>
References: <F63C63702B77F94CB4664B4D88660C4D4BAC3E@kmail.ksolutions.it>
Message-ID: <1126276365.16345.25.camel@ayanami.boston.redhat.com>

On Fri, 2005-09-09 at 13:37 +0200, Baesso Mirko wrote:
> Hi
> 
> we have to setup a Redhat cluster with two node on an attached shared
> storadge system (Fibre Channel connection).
> 
> We would like to know if is possible to setup an NFS service clustered
> with load balancing 

Personally, I haven't tried this, but here's my pseudo-educated guess...

It should be possible, but there may be some interesting issues with NFS
synchronization across the cluster (WRT statd, mountd synchronization).
I do not know if NFS will behave well as a load-balanced service.  In
theory, locking should work (because the NFS locks translate to GFS
locks, which would be cluster wide).  You'll probably want to
pre-populate /var/lib/nfs/rmtab with all the possible client entries on
each node.

You'll probably need to set up IPVS on a machine to do the load
balancing.  You can use piranha, one of the many other front ends to
IPVS, or just do it by hand.

You'll want to make sure that you group mountd+lockd+nfs(+portmap?)
ports together on the IPVS director so that client A always requests
everything from server B once the initial communication is established
(which would typically happen via portmapper or mountd).  Also, you
should probably use well-known ports for everything NFS/RPC related
instead of the portmapper, because there's a good chance that when using
the portmapper, the ports in use by mountd/lockd/nfsd/etc will be
different on each server - which would make it really difficult for IPVS
to correctly load balance it ;)


> We have to use GFS file system for sharing storadge data on both node,
> but have to setup GNBD also for exporting same file system to client
> network?

This shouldn't be necessary.  If you do a GNBD import on the clients,
the clients will need to be running GFS.  If you're doing an NFS export
from the servers, they can simply be running NFS.

-- Lon



From baesso at ksolutions.it  Fri Sep  9 15:10:46 2005
From: baesso at ksolutions.it (Baesso Mirko)
Date: Fri, 9 Sep 2005 17:10:46 +0200
Subject: R: [Linux-cluster] NFS load balancing on REDHAT cluster
Message-ID: <F63C63702B77F94CB4664B4D88660C4D4BAC45@kmail.ksolutions.it>

Thanks for suggests Lon 

Baesso 


-----Messaggio originale-----
Da: Lon Hohberger [mailto:lhh at redhat.com] 
Inviato: venerd? 9 settembre 2005 16.33
A: linux clustering
Oggetto: Re: [Linux-cluster] NFS load balancing on REDHAT cluster

On Fri, 2005-09-09 at 13:37 +0200, Baesso Mirko wrote:
> Hi
> 
> we have to setup a Redhat cluster with two node on an attached shared
> storadge system (Fibre Channel connection).
> 
> We would like to know if is possible to setup an NFS service clustered
> with load balancing 

Personally, I haven't tried this, but here's my pseudo-educated guess...

It should be possible, but there may be some interesting issues with NFS
synchronization across the cluster (WRT statd, mountd synchronization).
I do not know if NFS will behave well as a load-balanced service.  In
theory, locking should work (because the NFS locks translate to GFS
locks, which would be cluster wide).  You'll probably want to
pre-populate /var/lib/nfs/rmtab with all the possible client entries on
each node.

You'll probably need to set up IPVS on a machine to do the load
balancing.  You can use piranha, one of the many other front ends to
IPVS, or just do it by hand.

You'll want to make sure that you group mountd+lockd+nfs(+portmap?)
ports together on the IPVS director so that client A always requests
everything from server B once the initial communication is established
(which would typically happen via portmapper or mountd).  Also, you
should probably use well-known ports for everything NFS/RPC related
instead of the portmapper, because there's a good chance that when using
the portmapper, the ports in use by mountd/lockd/nfsd/etc will be
different on each server - which would make it really difficult for IPVS
to correctly load balance it ;)


> We have to use GFS file system for sharing storadge data on both node,
> but have to setup GNBD also for exporting same file system to client
> network?

This shouldn't be necessary.  If you do a GNBD import on the clients,
the clients will need to be running GFS.  If you're doing an NFS export
from the servers, they can simply be running NFS.

-- Lon

--
Linux-cluster mailing list
Linux-cluster at redhat.com
http://www.redhat.com/mailman/listinfo/linux-cluster





From cjk at techma.com  Fri Sep  9 15:16:28 2005
From: cjk at techma.com (Kovacs, Corey J.)
Date: Fri, 9 Sep 2005 11:16:28 -0400
Subject: [Linux-cluster] Debugging Fencing??
Message-ID: <EE32D921D7601547AD9CA5C87906C566095192@tmaemail.techma.com>

I have a 3 node cluster (RHEL3 + GFS 6.0.2.20-1) running but fencing will
not function correctly. I can call fence_ilo manually and fence a reboot
a node by hand but calling fence_node fails complaining about connection
errors which come from perl-Crypt-SSLeay. I've looked over my configs and 
things look ok to me. I've got other clusters using similar configs that 
work fine. I've looked through the fence_node code to find out whats going
on and it looks like I can somehow increase the verbosity of the fencing
operation so that it reports what the exact arguments are getting passed 
to fence_ilo. I've tried setting verbosity in lock_gulmd to ReallyALL but
there seems to be no effect with respect to additional logging of fenceing
operations. 

I'll keep looking at the code to find out how to enable extra logging but
if someone could point me in the right direction, that'd be great.

Any suggestions?


Thanks


Corey



From moya at infomed.sld.cu  Fri Sep  9 17:31:30 2005
From: moya at infomed.sld.cu (Maykel Moya)
Date: Fri, 09 Sep 2005 13:31:30 -0400
Subject: [Linux-cluster] Filesystem (GFS) availability
In-Reply-To: <1126113620.30592.49.camel@ayanami.boston.redhat.com>
References: <1126104683.15223.7.camel@julia.sld.cu>
	<1126113620.30592.49.camel@ayanami.boston.redhat.com>
Message-ID: <1126287090.30563.3.camel@julia.sld.cu>

El mi?, 07-09-2005 a las 13:20 -0400, Lon Hohberger escribi?:
> On Wed, 2005-09-07 at 10:51 -0400, Maykel Moya wrote:
> 
> >   <fence_devices>
> >     <device name="human" agent="fence_manual"/>
> >   </fence_devices>
> 
> Run fence_ack_manual on the surviving node.

Can it be automated?

> Better yet, stop using manual fencing

Don't know what to put in cluster.conf to make it anything but manual.

>  and buy a supported power switch
> off of eBay.  It will save you a lot of frustration :)

Well, servers in rack are connected to central UPS, so, there is no
option here :|

Regards,
maykel





From fsetinsek at techniscanmedical.com  Fri Sep  9 18:19:27 2005
From: fsetinsek at techniscanmedical.com (Frank L. Setinsek)
Date: Fri, 9 Sep 2005 12:19:27 -0600
Subject: [Linux-cluster] GFS Performance Problem
Message-ID: <200509091917.j89JHch5004403@mx1.redhat.com>

We have a 6 node cluster (RHEL3 + GFS 6.0.2-25),  we are running with a SLM
Embedded.  One on the nodes streams data to the GFS and the other nodes
process the data.  The problem is that it now takes more than twice as long
to acquire the data than when the node was standalone.  If we unmount the
GFS from all the nodes except the one acquiring the data and the node
running the lock manager--the acquisition takes the same amount of time as
the standalone configuration.  Sometimes there are seconds of inactivity on
the RAID when before there was none.
Any suggestions would be greatly appreciated.
 
Frank L. Setinsek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050909/2110f11e/attachment.htm>

From arjan at infradead.org  Sat Sep 10 10:11:29 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Sat, 10 Sep 2005 12:11:29 +0200
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <20050905054348.GC11337@redhat.com>
References: <20050901104620.GA22482@redhat.com>
	<1125574523.5025.10.camel@laptopd505.fenrus.org>
	<20050905054348.GC11337@redhat.com>
Message-ID: <1126347089.3222.138.camel@laptopd505.fenrus.org>


> 
> You removed the comment stating exactly why, see below.  If that's not a
> accepted technique in the kernel, say so and I'll be happy to change it
> here and elsewhere.
> Thanks,
> Dave

entirely useless wrapping is not acceptable indeed.




From sanelson at gmail.com  Sat Sep 10 13:52:22 2005
From: sanelson at gmail.com (Steve Nelson)
Date: Sat, 10 Sep 2005 14:52:22 +0100
Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node
Message-ID: <b6131fdc05091006525f04a171@mail.gmail.com>

Hello All,

I'm in the process of building an Oracle cluster on RHAS 3.0 // GFS
6.0 using 2 x GL580s, an MSA1000 and a DL380 for the quorum server,
but have encountered what seems to be a problem with the secondary
node seeing the CCA partition, and thus not being able to read the ccs
files, preventing me from starting ccsd on that node.

Here's the setup:

In the MSA 1000 I have created 4 LUNs:

=> controller serialnumber=P56350GX3R004S logicaldrive all show

MSA1000 at SGM0450039

  array A
   logicaldrive 1 (16.9 GB, RAID 1+0, OK)

  array B
   logicaldrive 2 (16.9 GB, RAID 1+0, OK)

  array C
   logicaldrive 3 (16.9 GB, RAID 1+0, OK)

  array D
   logicaldrive 4 (50.8 GB, RAID 5, OK)

My partitioning scheme is as follows:

/dev/sda1 - 100M (raw partitions used by cluster)
/dev/sdb1 - likewise

/dev/sda2 - 100M (CCA partition for GFS)
/dev/sdb2 - likewise

/dev/sda3
/dev/sdb3 - the rest - data partition

/dev/sdc
/dev/sdd - all - data partition.

I can see these partitions with fdisk on both nodes:

D
isk /dev/sda: 18.2 GB, 18207375360 bytes
/dev/sda1             1        13    104391   83  Linux
/dev/sda2            14        26    104422+  83  Linux
/dev/sda3            27      2213  17567077+  83  Linux
Disk /dev/sdb: 18.2 GB, 18207375360 bytes
/dev/sdb1             1        13    104391   83  Linux
/dev/sdb2            14        26    104422+  83  Linux
/dev/sdb3            27      2213  17567077+  83  Linux
Disk /dev/sdc: 18.2 GB, 18207375360 bytes
Disk /dev/sdd: 54.6 GB, 54622126080 bytes

My pool config files are as below:

# more *cfg
::::::::::::::
digex_cca.cfg
::::::::::::::
poolname digex_cca
subpools 1
subpool 0 0 2
pooldevice 0 0 /dev/sda2
pooldevice 0 1 /dev/sdb2
::::::::::::::
gfs0.cfg
::::::::::::::
poolname gfs0
subpools 1
subpool 0 128 2
pooldevice 0 0 /dev/sda3
pooldevice 0 1 /dev/sdb3
::::::::::::::
gfs1.cfg
::::::::::::::
poolname gfs1
subpools 1
subpool 0 128 2
pooldevice 0 0 /dev/sdc
pooldevice 0 1 /dev/sdd

Having run pool_assemble -a on both nodes, I wrote my ccs files, and
created the cluster archive.

On node 1 I see:

[root at primary]/etc/gfs#  pool_info 
Major Minor Name      Alias          Capacity  In use   MP Type   MP Stripe
254      65 digex_cca /dev/poolbn      417632     YES        none      
254      66 gfs0      /dev/poolbo    70268160      NO        none      
254      67 gfs1      /dev/poolbp    71122432      NO        none      

[root at primary]/etc/gfs# ls -l /dev/pool
total 0
brw-------    2 root     root     254,  65 Sep  9 16:53 digex_cca
brw-------    2 root     root     254,  66 Sep  9 16:53 gfs0
brw-------    2 root     root     254,  67 Sep  9 16:53 gfs1

But on node 2 I see:

[root at secondary]~# pool_info
Major Minor Name Alias          Capacity  In use   MP Type   MP Stripe
254      65 gfs1 /dev/poolbn    71122432      NO        none      

[root at secondary]~# ls -l /dev/pool
total 0
brw-------    2 root     root     254,  65 Sep  9 16:53 gfs1

Consequently when I try to restart ccsd on the secondary node, it
looks for the ccs files in the location specified in
/etc/sysconfig/gfs (which doesn't exist).

Notwithstanding the oddness of sdc and sdd being different sizes -
this can be re-organised - I am concerned that the second node can't
see the CCA partition, and am loath to simply copy the ccs files to
the local machine.

I also note that gfs1 as seen on node 2 has the same alias and
major/minor as cca, but the same dimensions as the gfs1 seen by node
1.  This suggests to me either a multipathing problem, or a
configuration error.

I am not happy to continue with gfs_mkfs on /dev/pool/gfs[01] at this
stage, and would like some advice on why I can't see/access the cca
partition.

I'd appreciate your thoughts and advice on how to continue!

Thanks a lot!

Steve Nelson



From wcheng at redhat.com  Sat Sep 10 15:18:06 2005
From: wcheng at redhat.com (Wendy Cheng)
Date: Sat, 10 Sep 2005 11:18:06 -0400
Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node
In-Reply-To: <b6131fdc05091006525f04a171@mail.gmail.com>
References: <b6131fdc05091006525f04a171@mail.gmail.com>
Message-ID: <1126365487.3406.12.camel@localhost.localdomain>

On Sat, 2005-09-10 at 14:52 +0100, Steve Nelson wrote:
> Hello All,
> 
> I'm in the process of building an Oracle cluster on RHAS 3.0 // GFS
> 6.0 using 2 x GL580s, an MSA1000 and a DL380 for the quorum server,
> but have encountered what seems to be a problem with the secondary
> node seeing the CCA partition, and thus not being able to read the ccs
> files, preventing me from starting ccsd on that node.
> 
....

> I can see these partitions with fdisk on both nodes:
> 
> D
> isk /dev/sda: 18.2 GB, 18207375360 bytes
> /dev/sda1             1        13    104391   83  Linux
> /dev/sda2            14        26    104422+  83  Linux
> /dev/sda3            27      2213  17567077+  83  Linux
> Disk /dev/sdb: 18.2 GB, 18207375360 bytes
> /dev/sdb1             1        13    104391   83  Linux
> /dev/sdb2            14        26    104422+  83  Linux
> /dev/sdb3            27      2213  17567077+  83  Linux
> Disk /dev/sdc: 18.2 GB, 18207375360 bytes
> Disk /dev/sdd: 54.6 GB, 54622126080 bytes
> 

GFS obtains its disk information from /proc/partitions file - check this
proc file from your "secondary" node to see whether array A and B are
seen by Linux at all (and/or check the /var/log/dmesg file to be sure).
Did you reboot between checking with fdisk and pool_info command on
"secondary" ? 

-- Wendy




From sanelson at gmail.com  Sat Sep 10 15:26:33 2005
From: sanelson at gmail.com (Steve Nelson)
Date: Sat, 10 Sep 2005 16:26:33 +0100
Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node
In-Reply-To: <1126365487.3406.12.camel@localhost.localdomain>
References: <b6131fdc05091006525f04a171@mail.gmail.com>
	<1126365487.3406.12.camel@localhost.localdomain>
Message-ID: <b6131fdc05091008266e813556@mail.gmail.com>

On 9/10/05, Wendy Cheng <wcheng at redhat.com> wrote:

> GFS obtains its disk information from /proc/partitions file - check this
> proc file from your "secondary" node to see whether array A and B are
> seen by Linux at all (and/or check the /var/log/dmesg file to be sure).

Yes - they are visible.

> Did you reboot between checking with fdisk and pool_info command on
> "secondary" ?

Nope :-)

However, I have subeseqently re-run pool_assemble -a on the secondary
node, and now I see the partition.

My question now is why this was necessary.  In order, these were my steps:

1) Write pool configs
2) On primary: pool_tool -c
3) On primary: pool_assemble -a
4) On secondary: pool_assemble -a
5) Write ccs files
6) On primary: ccs_tool create
7) On primary: restart ccsd - fine.
8) On secondary: restart ccsd - can't see cca partition.
9) <fx> Think and ask for help </fx>
10) On secondary: Re-run pool_assemble -a - now can see cca partition.
11) On secondary: Restart ccsd  - fine.

Is this abberant behaviour?  I am surprised I would need to run
pool_assemble twice, after creating the cluster archive.

Thanks for your help!

> -- Wendy

Steve



From wcheng at redhat.com  Sun Sep 11 03:52:40 2005
From: wcheng at redhat.com (Wendy Cheng)
Date: Sat, 10 Sep 2005 23:52:40 -0400
Subject: [Linux-cluster] CCA Partition Invisible from 2nd Node
In-Reply-To: <b6131fdc05091008266e813556@mail.gmail.com>
References: <b6131fdc05091006525f04a171@mail.gmail.com>
	<1126365487.3406.12.camel@localhost.localdomain>
	<b6131fdc05091008266e813556@mail.gmail.com>
Message-ID: <1126410760.3406.43.camel@localhost.localdomain>

On Sat, 2005-09-10 at 16:26 +0100, Steve Nelson wrote:

> However, I have subeseqently re-run pool_assemble -a on the secondary
> node, and now I see the partition.
> 
> My question now is why this was necessary.  In order, these were my steps:
> 
> 1) Write pool configs
> 2) On primary: pool_tool -c
> 3) On primary: pool_assemble -a
> 4) On secondary: pool_assemble -a
> 5) Write ccs files
> 6) On primary: ccs_tool create
> 7) On primary: restart ccsd - fine.
> 8) On secondary: restart ccsd - can't see cca partition.
> 9) <fx> Think and ask for help </fx>
> 10) On secondary: Re-run pool_assemble -a - now can see cca partition.
> 11) On secondary: Restart ccsd  - fine.
> 
> Is this abberant behaviour?  I am surprised I would need to run
> pool_assemble twice, after creating the cluster archive.

Just quickly browsed thru the pool code. Look to me that "pool_tool -
c" (create) does its write (to disk) without a sync flag (O_SYNC). So my
guess is that the primary node didn't have a chance to flush its data
into the disk before you issued "pool_assemble -a" in step 4. The disk
scan missed the pool info in your first try. By the time you did step
10, the flush in primary node (linux io is write-behind) had happened.

Just a guess but I'll talk to our developer to confirm. If it is true,
may be we can add a sync (code) to avoid this problem in the future.

BTW, Red Hat manual encourages people doing a "pool_tool -s" +
"pool_info" to make sure the node can see the pool before starting
ccsd.  

-- Wendy



From pcaulfie at redhat.com  Tue Sep 13 09:32:41 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 13 Sep 2005 10:32:41 +0100
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
References: <42F77AA3.80000@redhat.com>
	<42DB63F6.5070600@redhat.com>	<1122318870.12824.29.camel@localhost.localdomain>	<42EF4AD1.6010809@redhat.com>	<1123263949.16923.23.camel@localhost.localdomain>	<42F77AA3.80000@redhat.com>
	<5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
Message-ID: <43269CB9.3060005@redhat.com>

Guochun Shi wrote:
> Patrick,
> can you describe the steps changed for CVS version compared to those in usage.txt  in order to make gfs2 work? 
> 

Very briefly... More doc should be made available soon I hope.

ccsd
cman_tool join
modprobe dlm.ko
modprobe dlm_device.ko
modprobe lock_harness.ko
modprobe lock_dlm.ko
modprobe gfs.ko
modprobe sctp

groupd
dlm_controld
lock_dlmd
fenced
fence_tool join

Note that if you want to use clvmd you will need a patch to make it use libcman
rather than calling directly into the (now non-existant) kernel cman. See attached.

-- 

patrick
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clvmd-libcman.patch
Type: text/x-patch
Size: 13494 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050913/887d27e6/attachment.bin>

From gshi at ncsa.uiuc.edu  Tue Sep 13 19:59:04 2005
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Tue, 13 Sep 2005 14:59:04 -0500
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <43269CB9.3060005@redhat.com>
References: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
	<42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
	<42EF4AD1.6010809@redhat.com>
	<1123263949.16923.23.camel@localhost.localdomain>
	<42F77AA3.80000@redhat.com>
	<5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
Message-ID: <5.1.0.14.2.20050913145610.038735d8@pop.ncsa.uiuc.edu>

At 10:32 AM 9/13/2005 +0100, you wrote:
>Guochun Shi wrote:
>> Patrick,
>> can you describe the steps changed for CVS version compared to those in usage.txt  in order to make gfs2 work? 
>> 
>
>Very briefly... More doc should be made available soon I hope.
>
>ccsd
>cman_tool join
>modprobe dlm.ko
>modprobe dlm_device.ko
I don't find dlm_device module 


>modprobe lock_harness.ko
>modprobe lock_dlm.ko
>modprobe gfs.ko
>modprobe sctp
>
>groupd
>dlm_controld
>lock_dlmd
After I ran lock_dlmd,  "ps -aef|grep lock_dlmd" shows nothing 


>fenced
>fence_tool join
It hangs 

thanks
-Guochun



From gshi at ncsa.uiuc.edu  Tue Sep 13 22:46:01 2005
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Tue, 13 Sep 2005 17:46:01 -0500
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <5.1.0.14.2.20050913145610.038735d8@pop.ncsa.uiuc.edu>
References: <43269CB9.3060005@redhat.com>
	<5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
	<42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
	<42EF4AD1.6010809@redhat.com>
	<1123263949.16923.23.camel@localhost.localdomain>
	<42F77AA3.80000@redhat.com>
	<5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
Message-ID: <5.1.0.14.2.20050913174251.0457be18@pop.ncsa.uiuc.edu>

At 02:59 PM 9/13/2005 -0500, you wrote:
>At 10:32 AM 9/13/2005 +0100, you wrote:
>>Guochun Shi wrote:
>>> Patrick,
>>> can you describe the steps changed for CVS version compared to those in usage.txt  in order to make gfs2 work? 
>>> 
>>
>>Very briefly... More doc should be made available soon I hope.
>>
>>ccsd
>>cman_tool join
>>modprobe dlm.ko
>>modprobe dlm_device.ko
>I don't find dlm_device module 
>
>
>>modprobe lock_harness.ko
>>modprobe lock_dlm.ko
>>modprobe gfs.ko
>>modprobe sctp
>>
>>groupd
>>dlm_controld
>>lock_dlmd
>After I ran lock_dlmd,  "ps -aef|grep lock_dlmd" shows nothing 
>
>
>>fenced
>>fence_tool join
>It hangs 

after talking with lon, it turned out I did not mount configfs in /config. After that, I can go through all steps except the last one

mount -t gfs /dev/sdb1 /mnt

it hangs out, /var/log/messages show

Sep 13 16:34:11 posic066 kernel: dlm: testfs: recover 1
Sep 13 16:34:11 posic066 kernel: dlm: testfs: add member 1
Sep 13 16:34:11 posic066 kernel: dlm: testfs: total members 1
Sep 13 16:34:11 posic066 kernel: dlm: testfs: dlm_recover_directory
Sep 13 16:34:11 posic066 kernel: dlm: testfs: dlm_recover_directory 0 entries
Sep 13 16:34:11 posic066 kernel: dlm: testfs: recover 1 done: 0 ms

any hint what I can do to diagnosis the problem? 

thanks
-Guochun





From teigland at redhat.com  Wed Sep 14 19:38:52 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 15 Sep 2005 03:38:52 +0800
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <5.1.0.14.2.20050913174251.0457be18@pop.ncsa.uiuc.edu>
References: <43269CB9.3060005@redhat.com>
	<5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
	<42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
	<42EF4AD1.6010809@redhat.com>
	<1123263949.16923.23.camel@localhost.localdomain>
	<42F77AA3.80000@redhat.com>
	<5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
	<5.1.0.14.2.20050913174251.0457be18@pop.ncsa.uiuc.edu>
Message-ID: <20050914193852.GD922@redhat.com>

On Tue, Sep 13, 2005 at 05:46:01PM -0500, Guochun Shi wrote:

> >>> can you describe the steps changed for CVS version compared to those
> >>> in usage.txt  in order to make gfs2 work? 

> any hint what I can do to diagnosis the problem? 

For gfs2 it's gfs2.ko, gfs2_mkfs, and mount -t gfs2 ...

Dave



From dawson at fnal.gov  Wed Sep 14 19:39:15 2005
From: dawson at fnal.gov (Troy Dawson)
Date: Wed, 14 Sep 2005 14:39:15 -0500
Subject: [Linux-cluster] Switching Master and Slave using lock_gulm
Message-ID: <43287C63.10909@fnal.gov>

Hi,
I can't seem to find this answered anywhere, so if I've overlooked 
something, please point me in the right direction.

I have my GFS cluster running using lock_gulm.  I have 3 masters.  The 
one that I REALLY want to be the master came up last, so it is a slave. 
  Both the machine that is currently the master, and the machine I want 
to be the master should not be rebooted, or actually, loose their GFS 
file system ability (one is read/write, one is read only)

So the question is, is there a command in gulm_tools (or some other 
program) that will allow me to switch the master from one machine to the 
other without restarting any of the lock_gulm's.

Thanks
Troy
-- 
__________________________________________________
Troy Dawson  dawson at fnal.gov  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________



From pcaulfie at redhat.com  Wed Sep 14 09:01:23 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 14 Sep 2005 10:01:23 +0100
Subject: [Linux-cluster] Re: GFS, what's remaining
In-Reply-To: <1125922894.8714.14.camel@localhost.localdomain>
References: <20050901104620.GA22482@redhat.com>	<20050903183241.1acca6c9.akpm@osdl.org>	<20050904030640.GL8684@ca-server1.us.oracle.com>	<200509040022.37102.phillips@istop.com>	<20050903214653.1b8a8cb7.akpm@osdl.org>	<20050904045821.GT8684@ca-server1.us.oracle.com>	<20050903224140.0442fac4.akpm@osdl.org>	<20050905043033.GB11337@redhat.com>	<20050905015408.21455e56.akpm@osdl.org>	<20050905092433.GE17607@redhat.com>	<20050905021948.6241f1e0.akpm@osdl.org>
	<1125922894.8714.14.camel@localhost.localdomain>
Message-ID: <4327E6E3.3050501@redhat.com>

I've just returned from holiday so I'm late to this discussion so let me tell
you what we do now and why and lets see what's wrong with it.

Currently the library create_lockspace() call returns an FD upon which all lock
operations happen. The FD is onto a misc device, one per lockspace, so if you
want lockspace protection it can happen at that level. There is no protection
applied to locks within a lockspace nor do I think it's helpful to do so to be
honest. Using a misc device limits you to <255 lockspaces depending on the other
uses of misc but this is just for userland-visible lockspace - it does not
affect GFS filesystems for instance.

Lock/convert/unlock operations are done using write calls on that lockspace FD.
Callbacks are implemented using poll and read on the FD, read will return data
blocks (one per callback) as long as there are active callbacks to process. The
current read functionality behaves more like a SOCK_PACKET than a data stream
which some may not like but then you're going to need to know what you're
reading from the device anyway.

ioctl/fcntl isn't really useful for DLM locks because you can't do asynchronous
operations on them - the lock has to succeed or fail in the one operation - if
you want a callback for completion (or blocking notification) you have to poll
the lockspace FD anyway and then you might as well go back to using read and
write because at least they are something of a matched pair. Something similar
applies, I think, to a syscall interface.

Another reason the existing fcntl interface isn't appropriate is that it's not
locking the same kind of thing. Current Unix fcntl calls lock byte ranges. DLM
locks arbitrary names and has a much richer list of lock modes. Adding another
fcntl just runs in the problems mentioned above.

The other reason we use read for callbacks is that there is information to be
passed back: lock status, value block and (possibly) query information.

While having an FD per lock sounds like a nice unixy idea I don't think it would
work very well in practice. Applications with hundreds or thousands of locks
(such as databases) would end up with huge pollfd structs to manage, and it
while it helps the refcounting (currently the nastiest bit of the current
dlm_device code) removes the possibility of having persistent locks that exist
after the process exits - a handy feature that some people do use, though I
don't think it's in the currently submitted DLM code. One FD per lock also gives
each lock two handles, the lock ID used internally by the DLM and the FD used
externally by the application which I think is a little confusing.

I don't think a dlmfs is useful, personally. The features you can export from it
are either minimal compared to the full DLM functionality (so you have to export
the rest by some other means anyway) or are going to be so un-filesystemlike as
to be very awkward to use. Doing lock operations in shell scripts is all very
cool but how often do you /really/ need to do that?

I'm not saying that what we have is perfect - far from it - but we have thought
about how this works and what we came up with seems like a good compromise
between providing full DLM functionality to userspace using unix features. But
we're very happy to listen to other ideas - and have been doing I hope.

-- 

patrick




From gshi at ncsa.uiuc.edu  Wed Sep 14 23:40:45 2005
From: gshi at ncsa.uiuc.edu (Guochun Shi)
Date: Wed, 14 Sep 2005 18:40:45 -0500
Subject: [Linux-cluster] Where to go with cman ?
In-Reply-To: <43269CB9.3060005@redhat.com>
References: <5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
	<42F77AA3.80000@redhat.com> <42DB63F6.5070600@redhat.com>
	<1122318870.12824.29.camel@localhost.localdomain>
	<42EF4AD1.6010809@redhat.com>
	<1123263949.16923.23.camel@localhost.localdomain>
	<42F77AA3.80000@redhat.com>
	<5.1.0.14.2.20050902192903.0431b638@pop.ncsa.uiuc.edu>
Message-ID: <5.1.0.14.2.20050914183154.0463c700@pop.ncsa.uiuc.edu>

At 10:32 AM 9/13/2005 +0100, you wrote:
>Guochun Shi wrote:
>> Patrick,
>> can you describe the steps changed for CVS version compared to those in usage.txt  in order to make gfs2 work? 
>> 
>
>Very briefly... More doc should be made available soon I hope.
>
>ccsd
>cman_tool join
>modprobe dlm.ko
>modprobe dlm_device.ko
>modprobe lock_harness.ko
>modprobe lock_dlm.ko
>modprobe gfs.ko
>modprobe sctp
>
>groupd
>dlm_controld
>lock_dlmd
>fenced
>fence_tool join
>
>Note that if you want to use clvmd you will need a patch to make it use libcman
>rather than calling directly into the (now non-existant) kernel cman. See attached.

thanks for the info, I  still cannot get it work in simple one node lock_dlm case.
It hanged when I tried to mount.
(but lock_nolock works for me)

I attached all steps I did, the cluster.conf file and the log from /var/log/messages.

thanks a lot
-Guochun

------------------------------------------------------------------------------------------------------------------
[root at posic066 cman_tool]# mount -t configfs configfs /config 
[root at posic066 cman_tool]# ccsd
[root at posic066 cman_tool]# cman_tool join -N 1
command line options may override cluster.conf values
[root at posic066 cman_tool]# modprobe dlm
[root at posic066 cman_tool]# modprobe lock_dlml 
FATAL: Module lock_dlml not found.
[root at posic066 cman_tool]# modprobe lock_dlm
[root at posic066 cman_tool]# modprobe gfs
[root at posic066 cman_tool]# modprobe sctp
[root at posic066 cman_tool]# lsmod 
Module                  Size  Used by
sctp                  163164  2 [unsafe]
ipv6                  263904  7 sctp
gfs                   296708  0 
lock_dlm               23544  0 
lock_harness            5544  2 gfs,lock_dlm
dlm                   100036  1 lock_dlm
configfs               26892  2 dlm
nfs                   218856  2 
lockd                  66056  2 nfs
sunrpc                155964  3 nfs,lockd
autofs                 16384  0 
e100                   41476  0 
mii                     5888  1 e100
qla2300               124800  0 
qla2xxx               120792  1 qla2300
scsi_transport_fc      29184  1 qla2xxx
parport_pc             28612  0 
parport                37448  1 parport_pc
[root at posic066 cman_tool]# groupd
[root at posic066 cman_tool]# dlm_controld 
[root at posic066 cman_tool]# lock_dlmd
[root at posic066 cman_tool]# fenced
[root at posic066 cman_tool]# fence_tool join
[root at posic066 cman_tool]# gfs_mkfs -p lock_dlm -t alpha:testfs -j 1 /dev/sdb1
This will destroy any data on /dev/sdb1.
  It appears to contain a GFS filesystem.

Are you sure you want to proceed? [y/n] yes

Device:                    /dev/sdb1
Blocksize:                 4096
Filesystem Size:           1975184
Journals:                  1
Resource Groups:           32
Locking Protocol:          lock_dlm
Lock Table:                alpha:testfs

Syncing...
All Done
[root at posic066 cman_tool]# mount -t gfs /dev/sdb1 /mnt

-----------------------------------------------------------------------------------------------------------------------------------------------



The cluster.conf file
--------------------------------------------------------------------------------------------------
<?xml version="1.0"?>
<cluster name="alpha" config_version="1">

<cman>
</cman>

<clusternodes>
<clusternode name="posic066">
        <fence>
                <method name="single">
                        <device name="human" nodename="posic066"/>
                </method>
        </fence>
</clusternode>

</clusternodes>

<fencedevices>
        <fencedevice name="human" agent="fence_manual"/>
</fencedevices>

</cluster>
--------------------------------------------------------------------------------------------------------------------------





-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfstest.log.gz
Type: application/octet-stream
Size: 3924 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050914/ef820bd7/attachment.obj>

From amanthei at redhat.com  Thu Sep 15 03:39:08 2005
From: amanthei at redhat.com (Adam Manthei)
Date: Wed, 14 Sep 2005 22:39:08 -0500
Subject: [Linux-cluster] Switching Master and Slave using lock_gulm
In-Reply-To: <43287C63.10909@fnal.gov>
References: <43287C63.10909@fnal.gov>
Message-ID: <20050915033908.GL2190@redhat.com>

On Wed, Sep 14, 2005 at 02:39:15PM -0500, Troy Dawson wrote:
> So the question is, is there a command in gulm_tools (or some other 
> program) that will allow me to switch the master from one machine to the 
> other without restarting any of the lock_gulm's.

Not really.  You have to bring down the lock_gulmd processes in order to 
control the order.  You can try running `gulm_tool switchpending` on the 
master server.  That will force the master into the pending state, the only 
problem is that it might be the first one to recognize that there is no 
master and may come back as master again.  On the other hand, you may get
lucky and have the master switch to the machine that you want to be master.

-- 
Adam Manthei  <amanthei at redhat.com>



From piotr.kral at coig.katowice.pl  Thu Sep 15 16:29:14 2005
From: piotr.kral at coig.katowice.pl (Piotr Kral)
Date: Thu, 15 Sep 2005 18:29:14 +0200
Subject: [Linux-cluster] Cluster dilemas...
Message-ID: <4329A15A.50500@coig.katowice.pl>

	Hi
I have IBM Blade Center connected to IBM DS 4300 (former FASTt 600) disk 
system. The blades are diksless machines and they are connected to 
storage via fc switches, so I have 4 physical patchs to each disk. My 
dilemma is as follows: I wont to run Red Hat cluster suite on at least 
two servers and since I'm building systems from the beginning I'd rather 
use RHEL4 since it's newer, faster, more reliable (add more marketing 
bullshit here) etc. But HBA drivers that support multipath fail over are 
only in RHEL3 on RHEL4 there is something called dm + 
device-mapper-multipath but it is in beta stage :( So I'd like to ask 
You three things:
1. Is Red Hat cluster suite much different in RHEL3 then in RHEL4. 
Because if there is no big difference maybay it's not worth to fight 
with RHEL4 HBA drivers and just install RHEL3, and when RHEL4 will have 
proper drivers just upgrade?
2. Do You have any experience with "dm + device-mapper-multipath". To be 
sure this "beta stage" worries me a lot. But maybay You have some 
information when it will be in official release of RHEL4 (U2??), and 
most important: is it stable solution, and can it work in "production 
sewers"
3. What do You think of solution RHEL4 without multipath failover (so 
I'll see 4 paths, and in case of primary path failure at least one node 
will crush) + Red Hat cluster suite? I'm thinking off it in case 
problems with HBA drivers and some new functionality in Red Hat cluster 
suite in RHEL4

Kind Regards
Piotr



From danwest at comcast.net  Fri Sep 16 00:11:55 2005
From: danwest at comcast.net (danwest)
Date: Thu, 15 Sep 2005 20:11:55 -0400
Subject: [Linux-cluster] cluster.conf configuration command line
Message-ID: <1126829515.11500.8.camel@linux.site>

I was wondering if anyone knows if there are command line tools to
configure cluster.conf?  I see that redhat has a python gui as part of
their RHEL4 system-config-cluster package but I can't seem to find a
command line equivalent like the "redhat-config-cluster-cmd" that was
part of the RHEL3 RHCS.

Thanks
 - Daniel



From pcaulfie at redhat.com  Fri Sep 16 06:51:38 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 16 Sep 2005 07:51:38 +0100
Subject: [Linux-cluster] cluster.conf configuration command line
In-Reply-To: <1126829515.11500.8.camel@linux.site>
References: <1126829515.11500.8.camel@linux.site>
Message-ID: <432A6B7A.3000208@redhat.com>

danwest wrote:
> I was wondering if anyone knows if there are command line tools to
> configure cluster.conf?  I see that redhat has a python gui as part of
> their RHEL4 system-config-cluster package but I can't seem to find a
> command line equivalent like the "redhat-config-cluster-cmd" that was
> part of the RHEL3 RHCS.
> 

ccs_tool has some commands for manipulating the cluster.conf file. I'm not sure
if any packaging has picked them up yet but it's in CVS HEAD & STABLE.

-- 

patrick



From dawson at fnal.gov  Fri Sep 16 13:16:50 2005
From: dawson at fnal.gov (Troy Dawson)
Date: Fri, 16 Sep 2005 08:16:50 -0500
Subject: [Linux-cluster] Switching Master and Slave using lock_gulm
In-Reply-To: <20050915033908.GL2190@redhat.com>
References: <43287C63.10909@fnal.gov> <20050915033908.GL2190@redhat.com>
Message-ID: <432AC5C2.1090800@fnal.gov>

Adam Manthei wrote:
> On Wed, Sep 14, 2005 at 02:39:15PM -0500, Troy Dawson wrote:
> 
>>So the question is, is there a command in gulm_tools (or some other 
>>program) that will allow me to switch the master from one machine to the 
>>other without restarting any of the lock_gulm's.
> 
> 
> Not really.  You have to bring down the lock_gulmd processes in order to 
> control the order.  You can try running `gulm_tool switchpending` on the 
> master server.  That will force the master into the pending state, the only 
> problem is that it might be the first one to recognize that there is no 
> master and may come back as master again.  On the other hand, you may get
> lucky and have the master switch to the machine that you want to be master.
> 

Thanks, that worked.

The one I want to be master has a vote of 5 while the other only have a 
vote of 1, so that might have been it.
Then again, I might have just been lucky and the right one grabbed the 
master slot.
Either way, it worked.

Thanks Again,
Troy



From ocrete at max-t.com  Fri Sep 16 16:08:38 2005
From: ocrete at max-t.com (Olivier Crete)
Date: Fri, 16 Sep 2005 12:08:38 -0400
Subject: [Linux-cluster] zero vote node with cman
In-Reply-To: <1124466684.12024.58.camel@cocagne.max-t.internal>
References: <1124400750.12024.52.camel@cocagne.max-t.internal>
	<4305881E.5000106@redhat.com>
	<1124466684.12024.58.camel@cocagne.max-t.internal>
Message-ID: <1126886918.25404.6.camel@cocagne.max-t.internal>

On Fri, 2005-19-08 at 11:51 -0400, Olivier Crete wrote:
> On Fri, 2005-19-08 at 08:19 +0100, Patrick Caulfield wrote:
> > Olivier Crete wrote:
> > > I tried setting the votes to 0, but it seems that it wont let me do it..
> > > Is there another solution?
> > 
> > It seems to be a bug in cman_tool that's overriding the votes rather
> > over-enthusiastically.
> > 
> > This patch should fix:
> 
> Actually it doesnt.. it sets the default to 0... the attached patch
> seems to work better. 

The cluster.ng relaxng schema included in the system-config-cluster
package refuses zero votes, I've attached a patch that fixes that.


-- 
Olivier Cr?te
ocrete at max-t.com
Maximum Throughput Inc.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster-relaxng-zerovotes.patch
Type: text/x-patch
Size: 337 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050916/b5876a10/attachment.bin>

From jss at ast.cam.ac.uk  Fri Sep 16 09:38:46 2005
From: jss at ast.cam.ac.uk (Jeremy Sanders)
Date: Fri, 16 Sep 2005 10:38:46 +0100 (BST)
Subject: [Linux-cluster] gnbd root device
Message-ID: <Pine.LNX.4.63.0509161034180.13583@xpc20.ast.cam.ac.uk>

Hi -

We've been trying to use gnbd for the root devices of diskless linux 
systems booted from the network. We're not using gfs. gnbd is used as a 
plain block device.

We use an initrd to start up the networking and connect to the gnbd 
server.

This seems to work fairly well, except the reconnection doesn't appear to 
work if the server is rebooted. The client hangs indefinitely. gnbd is run 
from a ram disk.

Does anyone know any fundamental reason why this should be a problem?

Does anyone have a working setup like this?

Thanks very much!

Jeremy

PS please CC me as I'm not on the mailing list

-- 
Jeremy Sanders <jss at ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053



From sanelson at gmail.com  Sat Sep 17 19:55:29 2005
From: sanelson at gmail.com (Steve Nelson)
Date: Sat, 17 Sep 2005 20:55:29 +0100
Subject: [Linux-cluster] Dodgy Mounting
Message-ID: <b6131fdc050917125559a4fd9@mail.gmail.com>

Hello,

GFS6.0 // RHEL 3.0

Perfectly normal set-up - assembled pools and cluster archives, got
lock_gulmd working, and made mountpoints and entries in /etc/fstab

Mount /archive

# mount /archive
mount: wrong fs type, bad option, bad superblock on /dev/pool/gfs0,
       or too many mounted file systems

What's wrong?!

# for i in pool ccsd lock_gulmd; do service $i status; done
digex_cca is assembled
gfs0 is assembled
gfs1 is assembled
gfs2 is assembled
gfs3 is assembled
ccsd (pid 5587) is running...
lock_gulmd (pid 5632 5629 5626) is running...
gulm_master: bundlesmanagment is the master
Services:
LTPX
LT000

/etc/fstab looks like this:

/dev/pool/gfs0          /archive                gfs     defaults        1 2
/dev/pool/gfs1          /redo                  gfs     defaults        1 2
/dev/pool/gfs2          /data                  gfs     defaults        1 2
/dev/pool/gfs3          /backups               gfs     defaults        1 2

and mount shows this:
# mount
/dev/cciss/c0d0p3 on / type ext3 (rw)
none on /proc type proc (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/cciss/c0d0p1 on /boot type ext3 (rw)
/dev/cciss/c0d0p7 on /local type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/cciss/c0d0p6 on /tmp type ext3 (rw)
/dev/cciss/c0d0p5 on /var type ext3 (rw)

Any ideas?  What's going on?

Thanks!

Steve



From sanelson at gmail.com  Sat Sep 17 20:09:33 2005
From: sanelson at gmail.com (Steve Nelson)
Date: Sat, 17 Sep 2005 21:09:33 +0100
Subject: [Linux-cluster] Re: Dodgy Mounting
In-Reply-To: <b6131fdc050917125559a4fd9@mail.gmail.com>
References: <b6131fdc050917125559a4fd9@mail.gmail.com>
Message-ID: <b6131fdc05091713095d4f8879@mail.gmail.com>

On 9/17/05, Steve Nelson <sanelson at gmail.com> wrote:
> Hello,
> 
> GFS6.0 // RHEL 3.0
> 
> Perfectly normal set-up - assembled pools and cluster archives, got
> lock_gulmd working, and made mountpoints and entries in /etc/fstab
> 
> Mount /archive
> 
> # mount /archive
> mount: wrong fs type, bad option, bad superblock on /dev/pool/gfs0,
>        or too many mounted file systems

Ok - so this looks like a generic error - looking further, dmesg shows:

GFS: can't mount proto = lock_gulm, table = digex:gfs3, hostdata = 
lock_gulm: ERROR Core returned error 1003:Bad Cluster ID.
lock_gulm: ERROR cm_login failed. 1003
lock_gulm: ERROR Got a 1003 trying to start the threads.
lock_gulm: fsid=digex:gfs0: Exiting gulm_mount with errors 1003

etc for various tables.

> Steve

S.



From hardyjm at potsdam.edu  Mon Sep 19 13:26:01 2005
From: hardyjm at potsdam.edu (Jeff Hardy)
Date: Mon, 19 Sep 2005 09:26:01 -0400
Subject: [Linux-cluster] Basic Scalability Questions
Message-ID: <1127136362.23044.138.camel@fritzdesk.potsdam.edu>

Hello,

I have noted that LVM2 on a 2.6 kernel (in this case Fedora Core 4), has
no limit to the maximum number of logical volumes, physical volumes, or
physical extents in a particular volume group.  Is this also then the
case for CLVM?

Also, I have a couple of boxes sharing ATAoE storage right now in a
two-node cluster configuration.  Everything looks good.  We are not
using GFS as of yet, and do not have immediate plans to do so, so I have
been working with XFS and other filesystems in a couple of LVs with each
box only mounting one LV at a time (obviously).  In the interest of
seeing what could happen, I have purposely shutdown all the cluster
services on both boxes, resized volumes, done write tests and not seen
any filesystem corruption.  Have I just been incredibly lucky to avoid
particular race conditions?

Thanks very much.

-- 
Jeff Hardy
Systems Analyst
hardyjm at potsdam.edu



From mwill at penguincomputing.com  Mon Sep 19 17:47:04 2005
From: mwill at penguincomputing.com (Michael Will)
Date: Mon, 19 Sep 2005 10:47:04 -0700
Subject: [Linux-cluster] Basic Scalability Questions
In-Reply-To: <1127136362.23044.138.camel@fritzdesk.potsdam.edu>
References: <1127136362.23044.138.camel@fritzdesk.potsdam.edu>
Message-ID: <432EF998.7080301@penguincomputing.com>

If you get any personal replies that don't go back to the list,
I would be very interested in the result.

How far along are those projects like CLVM and others listed
on http://sources.redhat.com/cluster/ ?

Michael

Jeff Hardy wrote:

>Hello,
>
>I have noted that LVM2 on a 2.6 kernel (in this case Fedora Core 4), has
>no limit to the maximum number of logical volumes, physical volumes, or
>physical extents in a particular volume group.  Is this also then the
>case for CLVM?
>
>Also, I have a couple of boxes sharing ATAoE storage right now in a
>two-node cluster configuration.  Everything looks good.  We are not
>using GFS as of yet, and do not have immediate plans to do so, so I have
>been working with XFS and other filesystems in a couple of LVs with each
>box only mounting one LV at a time (obviously).  In the interest of
>seeing what could happen, I have purposely shutdown all the cluster
>services on both boxes, resized volumes, done write tests and not seen
>any filesystem corruption.  Have I just been incredibly lucky to avoid
>particular race conditions?
>
>Thanks very much.
>
>  
>


-- 
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill at penguincomputing.com 



From lhh at redhat.com  Mon Sep 19 19:27:51 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 19 Sep 2005 15:27:51 -0400
Subject: [Linux-cluster] Basic Scalability Questions
In-Reply-To: <432EF998.7080301@penguincomputing.com>
References: <1127136362.23044.138.camel@fritzdesk.potsdam.edu>
	<432EF998.7080301@penguincomputing.com>
Message-ID: <1127158071.3696.88.camel@ayanami.boston.redhat.com>

On Mon, 2005-09-19 at 10:47 -0700, Michael Will wrote:
> If you get any personal replies that don't go back to the list,
> I would be very interested in the result.
> 
> How far along are those projects like CLVM and others listed
> on http://sources.redhat.com/cluster/ ?

Most of them are productized by Red Hat already.

Get the STABLE branch, not head though =)

-- Lon




From hardyjm at potsdam.edu  Tue Sep 20 12:18:59 2005
From: hardyjm at potsdam.edu (Jeff Hardy)
Date: Tue, 20 Sep 2005 08:18:59 -0400
Subject: [Linux-cluster] Basic Scalability Questions
In-Reply-To: <1127158071.3696.88.camel@ayanami.boston.redhat.com>
References: <1127136362.23044.138.camel@fritzdesk.potsdam.edu>
	<432EF998.7080301@penguincomputing.com>
	<1127158071.3696.88.camel@ayanami.boston.redhat.com>
Message-ID: <1127218739.23044.180.camel@fritzdesk.potsdam.edu>

I've been using the FC4 RPMs and all looks good.



On Mon, 2005-09-19 at 15:27 -0400, Lon Hohberger wrote:
> On Mon, 2005-09-19 at 10:47 -0700, Michael Will wrote:
> > If you get any personal replies that don't go back to the list,
> > I would be very interested in the result.
> > 
> > How far along are those projects like CLVM and others listed
> > on http://sources.redhat.com/cluster/ ?
> 
> Most of them are productized by Red Hat already.
> 
> Get the STABLE branch, not head though =)
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From ggilyeat at jhsph.edu  Tue Sep 20 18:15:39 2005
From: ggilyeat at jhsph.edu (Gerald G. Gilyeat)
Date: Tue, 20 Sep 2005 14:15:39 -0400
Subject: [Linux-cluster] Question...
Message-ID: <DF33F4DAC09B3048AA095F1B5C368915A57FEF@XCH-VN02.sph.ad.jhsph.edu>

We just had a GFS client node crash (and it took one of my compute clusters with it, but I can deal with that) with the following message in /var/log/messages:
Sep 20 13:50:12 front-0 kernel: 
Sep 20 13:50:12 front-0 kernel: GFS: Assertion failed on line 200 of file trans.c
Sep 20 13:50:12 front-0 kernel: GFS: assertion: "!gfs_get_transaction(sdp)"
Sep 20 13:50:12 front-0 kernel: GFS: time = 1127238612
Sep 20 13:50:12 front-0 kernel: GFS: fsid=hopkins:bst.2
Sep 20 13:50:12 front-0 kernel: 
Sep 20 13:50:12 front-0 kernel: Kernel panic: GFS: Record message above and reboot.

The other GFS clients and the server are fine.
Any chance someone could give me an idea on why there'd be a failure here, so I can have a better idea on what to tune on the system?

We recently (ie. yesterday) bumped the number of NFS server processes on this machine from 8 to 24, if that will help...

Thanks.

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050920/d905d0ae/attachment.htm>

From bmarzins at redhat.com  Tue Sep 20 20:26:35 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Tue, 20 Sep 2005 15:26:35 -0500
Subject: [Linux-cluster] Question...
In-Reply-To: <DF33F4DAC09B3048AA095F1B5C368915A57FEF@XCH-VN02.sph.ad.jhsph.edu>
References: <DF33F4DAC09B3048AA095F1B5C368915A57FEF@XCH-VN02.sph.ad.jhsph.edu>
Message-ID: <20050920202634.GB25146@phlogiston.msp.redhat.com>

You didn't perhaps get a stack trace along with that message, did you?
That would go a long way in figuring out what exactly went wrong. But here's
a wild stab in the dark.  Do you know if a suid root file was being copied to
your gfs file system?  That has caused a similar error on other versions of
gfs (although not with nfs).

-Ben

On Tue, Sep 20, 2005 at 02:15:39PM -0400, Gerald G. Gilyeat wrote:
>    We just had a GFS client node crash (and it took one of my compute
>    clusters with it, but I can deal with that) with the following message in
>    /var/log/messages:
>    Sep 20 13:50:12 front-0 kernel:
>    Sep 20 13:50:12 front-0 kernel: GFS: Assertion failed on line 200 of file
>    trans.c
>    Sep 20 13:50:12 front-0 kernel: GFS: assertion:
>    "!gfs_get_transaction(sdp)"
>    Sep 20 13:50:12 front-0 kernel: GFS: time = 1127238612
>    Sep 20 13:50:12 front-0 kernel: GFS: fsid=hopkins:bst.2
>    Sep 20 13:50:12 front-0 kernel:
>    Sep 20 13:50:12 front-0 kernel: Kernel panic: GFS: Record message above
>    and reboot.
> 
>    The other GFS clients and the server are fine.
>    Any chance someone could give me an idea on why there'd be a failure here,
>    so I can have a better idea on what to tune on the system?
> 
>    We recently (ie. yesterday) bumped the number of NFS server processes on
>    this machine from 8 to 24, if that will help...
> 
>    Thanks.
> 
>    --
>    Jerry Gilyeat, RHCE
>    Systems Administrator
>    Molecular Microbiology and Immunology
>    Johns Hopkins Bloomberg School of Public Health

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From ggilyeat at jhsph.edu  Wed Sep 21 13:35:18 2005
From: ggilyeat at jhsph.edu (Gerald G. Gilyeat)
Date: Wed, 21 Sep 2005 09:35:18 -0400
Subject: [Linux-cluster] Question...
Message-ID: <DF33F4DAC09B3048AA095F1B5C368915A57FF6@XCH-VN02.sph.ad.jhsph.edu>


Nope, no stack trace was produced that I was able to see.
And yeah, I would have preferred to have one, myself.

I'll check into the suid thing. It's entirely possible one of my users did something like that, since many of them have root on their desktop linux machines.

--
Jerry Gilyeat, RHCE
Systems Administrator
Molecular Microbiology and Immunology
Johns Hopkins Bloomberg School of Public Health



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Benjamin Marzinski
Sent: Tue 9/20/2005 4:26 PM
To: linux clustering
Subject: Re: [Linux-cluster] Question...
 
You didn't perhaps get a stack trace along with that message, did you?
That would go a long way in figuring out what exactly went wrong. But here's
a wild stab in the dark.  Do you know if a suid root file was being copied to
your gfs file system?  That has caused a similar error on other versions of
gfs (although not with nfs).

-Ben

On Tue, Sep 20, 2005 at 02:15:39PM -0400, Gerald G. Gilyeat wrote:
>    We just had a GFS client node crash (and it took one of my compute
>    clusters with it, but I can deal with that) with the following message in
>    /var/log/messages:
>    Sep 20 13:50:12 front-0 kernel:
>    Sep 20 13:50:12 front-0 kernel: GFS: Assertion failed on line 200 of file
>    trans.c
>    Sep 20 13:50:12 front-0 kernel: GFS: assertion:
>    "!gfs_get_transaction(sdp)"
>    Sep 20 13:50:12 front-0 kernel: GFS: time = 1127238612
>    Sep 20 13:50:12 front-0 kernel: GFS: fsid=hopkins:bst.2
>    Sep 20 13:50:12 front-0 kernel:
>    Sep 20 13:50:12 front-0 kernel: Kernel panic: GFS: Record message above
>    and reboot.
> 
>    The other GFS clients and the server are fine.
>    Any chance someone could give me an idea on why there'd be a failure here,
>    so I can have a better idea on what to tune on the system?
> 
>    We recently (ie. yesterday) bumped the number of NFS server processes on
>    this machine from 8 to 24, if that will help...
> 
>    Thanks.
> 
>    --
>    Jerry Gilyeat, RHCE
>    Systems Administrator
>    Molecular Microbiology and Immunology
>    Johns Hopkins Bloomberg School of Public Health

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3692 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050921/6d59b6ff/attachment.bin>

From haydar2906 at hotmail.com  Wed Sep 21 15:03:57 2005
From: haydar2906 at hotmail.com (Abbes Bettahar)
Date: Wed, 21 Sep 2005 11:03:57 -0400
Subject: [Linux-cluster] Clustered NFS problem 
Message-ID: <BAY104-F15BA538C238798FC4B6253C5940@phx.gbl>

Hi,

We have 2 servers HP Proliant 380 G3 (RedHat Advanced Server 3) attached
by fiber optic to the storage area network SAN HP MSA1000 and we want to  
install and configure The RedHat Cluster Suite.

I setuped and configured a clustered NFS on the 2 servers RAC1 and RACGFS.

clumanager-1.2.26.1-1
redhat-config-cluster-1.0.7-1

I have created 2 quorum partitions /dev/sdd2 and /dev/sdd3  (100MB each).

I created another huge partition /dev/sdd4 (over 600GB) and formatted it in 
ext3 file system.

I installed the cluster suite on the 1st node (RAC1) and 2nd node RACGFS and 
I started the rawdevices on the two nodes RAC1 and RACGFS (it's OK).

This the hosts file /etc/host on the node1 (RAC1) and node2 RACGFS

Do not remove the following line, or various programs
# that require network functionality will fail.
#127.0.0.1 rac1 localhost.localdomain localhost
127.0.0.1              localhost.localdomain localhost
#
# Private hostnames
#
192.168.253.3           rac1.project.net     rac1
192.168.253.4           rac2.project.net     rac2
192.168.253.10          racgfs.project.net     racgfs
192.168.253.20          raclu_nfs.project.net   raclu_nfs
#
# Hostnames used for Interconnect
#
1.1.1.1                 rac1i.project.net    rac1i
1.1.1.2                 rac2i.project.net    rac2i
1.1.1.3                 racgfsi.project.net    racgfsi
#
192.168.253.5           infra.project.net       infra
192.168.253.7 ractest.project.net     ractest
#

I generated a /etc/cluster.xml on the 1st node RAC1 and the 2nd node RACGFS.

<?xml version="1.0"?>
<cluconfig version="3.0">
  <clumembd broadcast="no" interval="750000" loglevel="5" multicast="yes" 
multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/>
  <cluquorumd loglevel="5" pinginterval="1" tiebreaker_ip=""/>
  <clurmtabd loglevel="5" pollinterval="4"/>
  <clusvcmgrd loglevel="5" use_netlink="yes"/>
  <clulockd loglevel="5"/>
  <cluster config_viewnumber="24" key="978dcd78e05c5961cf1aaaa03b41209b" 
name="cisn"/>
  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1" 
rawshadow="/dev/raw/raw2" type="raw"/>
  <members>
    <member id="0" name="192.168.253.3" watchdog="no"/>
    <member id="1" name="192.168.253.10" watchdog="no"/>
  </members>
  <services>
    <service checkinterval="5" failoverdomain="cisncluster" id="0" 
maxfalsestarts="0" maxrestarts="0" name="nfs_cisn" userscript="None">
      <service_ipaddresses>
        <service_ipaddress broadcast="None" id="0" 
ipaddress="192.168.253.20" monitor_link="0" netmask="255.255.255.0"/>
      </service_ipaddresses>
      <device id="0" name="/dev/sdd4">
        <mount forceunmount="yes" mountpoint="/u04"/>
        <nfsexport id="0" name="/u04">
          <client id="0" name="*" options="rw"/>
        </nfsexport>
      </device>
    </service>
  </services>
  <failoverdomains>
    <failoverdomain id="0" name="cisncluster" ordered="yes" restricted="no">
      <failoverdomainnode id="0" name="192.168.253.3"/>
      <failoverdomainnode id="1" name="192.168.253.10"/>
    </failoverdomain>
  </failoverdomains>
</cluconfig>

I created a NFS share on /u04 (mount on /dev/sdd4) using the Cluster GUI 
manager on RAC1.
I launched on the 2 nodes Rac1 and RACgfs the following command:
service clumanager start

I checked the result on the 2 nodes, on RAC1:

clustat  results :

Cluster Status - project                                                  
09:04:34
Cluster Quorum Incarnation #1
Shared State: Shared Raw Device Driver v1.2

  Member             Status
  ------------------ ----------
  192.168.253.3      Active     <-- You are here
  192.168.253.10     Active

  Service        Status   Owner (Last)     Last Transition Chk Restarts
  -------------- -------- ---------------- --------------- --- --------
  nfs_cisn       started  192.168.253.3    09:07:59 Sep 21   5        0


on RacGfs: clustat  results :

Cluster Status - cisn                                                  
09:07:39
Cluster Quorum Incarnation #3
Shared State: Shared Raw Device Driver v1.2

  Member             Status
  ------------------ ----------
  192.168.253.3      Active
  192.168.253.10     Active     <-- You are here

  Service        Status   Owner (Last)     Last Transition Chk Restarts
  -------------- -------- ---------------- --------------- --- --------
  nfs_cisn       started  192.168.253.3    09:07:59 Sep 21   5        0



When I launched ifconfig on RAC1, we saw that the service IP address 
192.168.253.20 is generated on eth2:0.

And I launched on  other servers the following command:
mount ?t nfs 192.168.253.20:/u04 /u04

And all are OK, I can list the /u04 content from any server.

But my only problem is:

When I want to try a test if the clustered NFS will work fine, I rebooted 
RAC1 frequently and RACGFS continue to work as the failover server and when 
I launched ifconfig on RACGFS, we saw that the service IP address 
192.168.253.20 is generated on eth0:0 .
We can list /u04 content (clustered NFS mount) on the other servers after 
few seconds of RAC1 rebooting:

But after many reboots, I expect a big problem, the both cluster node 
servers  cannot obtain the service IP address 192.168.253.20 when I launch 
ifconfig on the both nodes.

On Rac1:

eth0      Link encap:Ethernet  HWaddr 00:0B:CD:EF:2B:C1
          inet addr:1.1.1.1  Bcast:1.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:89170 errors:0 dropped:0 overruns:0 frame:0
          TX packets:87405 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:17288193 (16.4 Mb)  TX bytes:14452757 (13.7 Mb)
          Interrupt:15

eth2      Link encap:Ethernet  HWaddr 00:0B:CD:FF:44:02
          inet addr:192.168.253.3  Bcast:192.168.253.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1349991 errors:0 dropped:0 overruns:0 frame:0
          TX packets:435450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1592635536 (1518.8 Mb)  TX bytes:162026101 (154.5 Mb)
          Interrupt:7

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1001181 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1001181 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:76097441 (72.5 Mb)  TX bytes:76097441 (72.5 Mb)

On RACGFS:

eth0      Link encap:Ethernet  HWaddr 00:14:38:50:D3:E4
          inet addr:192.168.253.10  Bcast:192.168.253.255  
Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:211223 errors:0 dropped:0 overruns:0 frame:0
          TX packets:160026 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:14917480 (14.2 Mb)  TX bytes:13886063 (13.2 Mb)
          Interrupt:25

eth1      Link encap:Ethernet  HWaddr 00:14:38:50:D3:E3
          inet addr:1.1.1.3  Bcast:1.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:26

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:184529 errors:0 dropped:0 overruns:0 frame:0
          TX packets:184529 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:10971489 (10.4 Mb)  TX bytes:10971489 (10.4 Mb)

I tried many commands, I stopped the cluster services On both nodes and 
restart it but unfortunately it doesn?t work and we cannot obtain the 
clustered NFS mount.


Have you any idea to fix this problem?

Thanks for your replies and help

Abbes Bettahar
514-296-0756




From Alain.Moulle at bull.net  Thu Sep 22 06:15:42 2005
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 22 Sep 2005 08:15:42 +0200
Subject: [Linux-cluster] Some questions about heart-beat under Cluster Suite
	4
Message-ID: <43324C0E.8010005@bull.net>

Hi everybody

Some questions about heart-beat under Cluster Suite 4 :

1. how is choosen the eth interface under CS4 ?
   if we have eth0 and eth1, it seems that HearBeat
   goes through eth0 and we don't have the possibility
   to configure this in CS4 , right ?

2. does that mean also that if eth0 fails, the CS4
   automatically goes through eth1 ?

3. do I miss the way to configure this via GUI ?

Thanks a lot
Alain
-- 
mailto:Alain.Moulle at bull.net




From andreseso at gmail.com  Thu Sep 22 08:17:13 2005
From: andreseso at gmail.com (Andreso)
Date: Thu, 22 Sep 2005 10:17:13 +0200
Subject: [Linux-cluster] How do I use a cross over cable to set up quorum?
Message-ID: <1a9f416905092201175c6eb118@mail.gmail.com>

Hello,

I am trying to set up a cluster with clumanager-1.2.3-1 and
clumanager-1.2.3-1 and my boss desires a working quorum where the two
members of the cluster use a cross over cable to interchange the
necessary information for quorum between the two cluster members.  The
reason is that the company uses faulty switches so he considers using
the main network interfaces of the servers not acceptable.

I would like to know how I can obtain cluster quorum using the cross
over cable.  I have succesfully set up the cluster using the main
network interface but I have failed miserably setting up the cluster
quorum over the cross over cable network interface.  I believe I have
a routing problem.

The route -n information for the cross over cable interface is
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth1

I am far from being a linux expert and in some things like clusters I
am a newbie and I am not particularily strong on networking.  For
example until I started setting up the cluster I had never heard of
the concepts tiebreaker IP or Multicast IP address.

Any help would be appreciated.

If somebody has managed to make this work could they please post their
cluster.xml file?



From Axel.Thimm at ATrpms.net  Thu Sep 22 08:32:37 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Thu, 22 Sep 2005 10:32:37 +0200
Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba?
Message-ID: <20050922083237.GD3466@neu.nirvana>

Hi,

I've been stuggling with a strange bug in Samba which required me to
have some of the tdb files with permissions 0666 to allow Samba to
work.

The Samba metadata (locking and connection tables etc) are placed on
GFS to allow for easier relocation of the Samba services ("poor man's
clustered samba").

The problem is that Samba opens some files as root, then drops
priviledges and finally accesses these files assuming that the root
access rights are still in order. This does not work under GFS, but
under any other local fs.

The Samba developers claim that this is POSIX compliant and that GFS
is not following POSIX in this matter.

Is this true? Does POSIX require the fds to not change access
priviledges even when setuiding to another user?

If so, why doesn't GFS respect this? A bug or a feature? If the former
I'll go and bugzilla it. If the latter, can there be a fix for the
RHEL4 branch?

Thanks!
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050922/ccf84a2c/attachment.sig>

From andreseso at gmail.com  Thu Sep 22 09:44:17 2005
From: andreseso at gmail.com (Andreso)
Date: Thu, 22 Sep 2005 11:44:17 +0200
Subject: [Linux-cluster] Re: How do I use a cross over cable to set up
	quorum?
In-Reply-To: <1a9f416905092201175c6eb118@mail.gmail.com>
References: <1a9f416905092201175c6eb118@mail.gmail.com>
Message-ID: <1a9f41690509220244da6fed5@mail.gmail.com>

Sorry, I forgot to mention that I am running Centos 3.5 which is
equivalent to RHEL AS 3.5.

A working cluster.xml using the main network interface is attached

Andres

On 9/22/05, Andreso <andreseso at gmail.com> wrote:
> Hello,
>
> I am trying to set up a cluster with clumanager-1.2.3-1 and
> clumanager-1.2.3-1 and my boss desires a working quorum where the two
> members of the cluster use a cross over cable to interchange the
> necessary information for quorum between the two cluster members.  The
> reason is that the company uses faulty switches so he considers using
> the main network interfaces of the servers not acceptable.
>
> I would like to know how I can obtain cluster quorum using the cross
> over cable.  I have succesfully set up the cluster using the main
> network interface but I have failed miserably setting up the cluster
> quorum over the cross over cable network interface.  I believe I have
> a routing problem.
>
> The route -n information for the cross over cable interface is
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth1
>
> I am far from being a linux expert and in some things like clusters I
> am a newbie and I am not particularily strong on networking.  For
> example until I started setting up the cluster I had never heard of
> the concepts tiebreaker IP or Multicast IP address.
>
> Any help would be appreciated.
>
> If somebody has managed to make this work could they please post their
> cluster.xml file?
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.xml-funciona
Type: application/octet-stream
Size: 1916 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050922/d2829768/attachment.obj>

From hernando.garcia at gmail.com  Thu Sep 22 10:08:36 2005
From: hernando.garcia at gmail.com (Hernando Garcia)
Date: Thu, 22 Sep 2005 11:08:36 +0100
Subject: [Linux-cluster] Clustered NFS problem
In-Reply-To: <BAY104-F15BA538C238798FC4B6253C5940@phx.gbl>
References: <BAY104-F15BA538C238798FC4B6253C5940@phx.gbl>
Message-ID: <1127383717.4275.11.camel@hgarcia.surrey.redhat.com>

It would be better for you to officially open a call with Red Hat
Support directly. They will be able to help you with the issue.

When the call is open, make sure you provide with both sysreport of the
cluster nodes.

On Wed, 2005-09-21 at 11:03 -0400, Abbes Bettahar wrote:
> Hi,
> 
> We have 2 servers HP Proliant 380 G3 (RedHat Advanced Server 3) attached
> by fiber optic to the storage area network SAN HP MSA1000 and we want to  
> install and configure The RedHat Cluster Suite.
> 
> I setuped and configured a clustered NFS on the 2 servers RAC1 and RACGFS.
> 
> clumanager-1.2.26.1-1
> redhat-config-cluster-1.0.7-1
> 
> I have created 2 quorum partitions /dev/sdd2 and /dev/sdd3  (100MB each).
> 
> I created another huge partition /dev/sdd4 (over 600GB) and formatted it in 
> ext3 file system.
> 
> I installed the cluster suite on the 1st node (RAC1) and 2nd node RACGFS and 
> I started the rawdevices on the two nodes RAC1 and RACGFS (it's OK).
> 
> This the hosts file /etc/host on the node1 (RAC1) and node2 RACGFS
> 
> Do not remove the following line, or various programs
> # that require network functionality will fail.
> #127.0.0.1 rac1 localhost.localdomain localhost
> 127.0.0.1              localhost.localdomain localhost
> #
> # Private hostnames
> #
> 192.168.253.3           rac1.project.net     rac1
> 192.168.253.4           rac2.project.net     rac2
> 192.168.253.10          racgfs.project.net     racgfs
> 192.168.253.20          raclu_nfs.project.net   raclu_nfs
> #
> # Hostnames used for Interconnect
> #
> 1.1.1.1                 rac1i.project.net    rac1i
> 1.1.1.2                 rac2i.project.net    rac2i
> 1.1.1.3                 racgfsi.project.net    racgfsi
> #
> 192.168.253.5           infra.project.net       infra
> 192.168.253.7 ractest.project.net     ractest
> #
> 
> I generated a /etc/cluster.xml on the 1st node RAC1 and the 2nd node RACGFS.
> 
> <?xml version="1.0"?>
> <cluconfig version="3.0">
>   <clumembd broadcast="no" interval="750000" loglevel="5" multicast="yes" 
> multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/>
>   <cluquorumd loglevel="5" pinginterval="1" tiebreaker_ip=""/>
>   <clurmtabd loglevel="5" pollinterval="4"/>
>   <clusvcmgrd loglevel="5" use_netlink="yes"/>
>   <clulockd loglevel="5"/>
>   <cluster config_viewnumber="24" key="978dcd78e05c5961cf1aaaa03b41209b" 
> name="cisn"/>
>   <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1" 
> rawshadow="/dev/raw/raw2" type="raw"/>
>   <members>
>     <member id="0" name="192.168.253.3" watchdog="no"/>
>     <member id="1" name="192.168.253.10" watchdog="no"/>
>   </members>
>   <services>
>     <service checkinterval="5" failoverdomain="cisncluster" id="0" 
> maxfalsestarts="0" maxrestarts="0" name="nfs_cisn" userscript="None">
>       <service_ipaddresses>
>         <service_ipaddress broadcast="None" id="0" 
> ipaddress="192.168.253.20" monitor_link="0" netmask="255.255.255.0"/>
>       </service_ipaddresses>
>       <device id="0" name="/dev/sdd4">
>         <mount forceunmount="yes" mountpoint="/u04"/>
>         <nfsexport id="0" name="/u04">
>           <client id="0" name="*" options="rw"/>
>         </nfsexport>
>       </device>
>     </service>
>   </services>
>   <failoverdomains>
>     <failoverdomain id="0" name="cisncluster" ordered="yes" restricted="no">
>       <failoverdomainnode id="0" name="192.168.253.3"/>
>       <failoverdomainnode id="1" name="192.168.253.10"/>
>     </failoverdomain>
>   </failoverdomains>
> </cluconfig>
> 
> I created a NFS share on /u04 (mount on /dev/sdd4) using the Cluster GUI 
> manager on RAC1.
> I launched on the 2 nodes Rac1 and RACgfs the following command:
> service clumanager start
> 
> I checked the result on the 2 nodes, on RAC1:
> 
> clustat  results :
> 
> Cluster Status - project                                                  
> 09:04:34
> Cluster Quorum Incarnation #1
> Shared State: Shared Raw Device Driver v1.2
> 
>   Member             Status
>   ------------------ ----------
>   192.168.253.3      Active     <-- You are here
>   192.168.253.10     Active
> 
>   Service        Status   Owner (Last)     Last Transition Chk Restarts
>   -------------- -------- ---------------- --------------- --- --------
>   nfs_cisn       started  192.168.253.3    09:07:59 Sep 21   5        0
> 
> 
> on RacGfs: clustat  results :
> 
> Cluster Status - cisn                                                  
> 09:07:39
> Cluster Quorum Incarnation #3
> Shared State: Shared Raw Device Driver v1.2
> 
>   Member             Status
>   ------------------ ----------
>   192.168.253.3      Active
>   192.168.253.10     Active     <-- You are here
> 
>   Service        Status   Owner (Last)     Last Transition Chk Restarts
>   -------------- -------- ---------------- --------------- --- --------
>   nfs_cisn       started  192.168.253.3    09:07:59 Sep 21   5        0
> 
> 
> 
> When I launched ifconfig on RAC1, we saw that the service IP address 
> 192.168.253.20 is generated on eth2:0.
> 
> And I launched on  other servers the following command:
> mount t nfs 192.168.253.20:/u04 /u04
> 
> And all are OK, I can list the /u04 content from any server.
> 
> But my only problem is:
> 
> When I want to try a test if the clustered NFS will work fine, I rebooted 
> RAC1 frequently and RACGFS continue to work as the failover server and when 
> I launched ifconfig on RACGFS, we saw that the service IP address 
> 192.168.253.20 is generated on eth0:0 .
> We can list /u04 content (clustered NFS mount) on the other servers after 
> few seconds of RAC1 rebooting:
> 
> But after many reboots, I expect a big problem, the both cluster node 
> servers  cannot obtain the service IP address 192.168.253.20 when I launch 
> ifconfig on the both nodes.
> 
> On Rac1:
> 
> eth0      Link encap:Ethernet  HWaddr 00:0B:CD:EF:2B:C1
>           inet addr:1.1.1.1  Bcast:1.1.1.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:89170 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:87405 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:17288193 (16.4 Mb)  TX bytes:14452757 (13.7 Mb)
>           Interrupt:15
> 
> eth2      Link encap:Ethernet  HWaddr 00:0B:CD:FF:44:02
>           inet addr:192.168.253.3  Bcast:192.168.253.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1349991 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:435450 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:1592635536 (1518.8 Mb)  TX bytes:162026101 (154.5 Mb)
>           Interrupt:7
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:1001181 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:1001181 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:76097441 (72.5 Mb)  TX bytes:76097441 (72.5 Mb)
> 
> On RACGFS:
> 
> eth0      Link encap:Ethernet  HWaddr 00:14:38:50:D3:E4
>           inet addr:192.168.253.10  Bcast:192.168.253.255  
> Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:211223 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:160026 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:14917480 (14.2 Mb)  TX bytes:13886063 (13.2 Mb)
>           Interrupt:25
> 
> eth1      Link encap:Ethernet  HWaddr 00:14:38:50:D3:E3
>           inet addr:1.1.1.3  Bcast:1.1.1.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
>           Interrupt:26
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:184529 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:184529 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:10971489 (10.4 Mb)  TX bytes:10971489 (10.4 Mb)
> 
> I tried many commands, I stopped the cluster services On both nodes and 
> restart it but unfortunately it doesnt work and we cannot obtain the 
> clustered NFS mount.
> 
> 
> Have you any idea to fix this problem?
> 
> Thanks for your replies and help
> 
> Abbes Bettahar
> 514-296-0756
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From bmarzins at redhat.com  Thu Sep 22 13:22:21 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 22 Sep 2005 08:22:21 -0500
Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba?
In-Reply-To: <20050922083237.GD3466@neu.nirvana>
References: <20050922083237.GD3466@neu.nirvana>
Message-ID: <20050922132221.GA2123@phlogiston.msp.redhat.com>

On Thu, Sep 22, 2005 at 10:32:37AM +0200, Axel Thimm wrote:
> Hi,
> 
> I've been stuggling with a strange bug in Samba which required me to
> have some of the tdb files with permissions 0666 to allow Samba to
> work.
> 
> The Samba metadata (locking and connection tables etc) are placed on
> GFS to allow for easier relocation of the Samba services ("poor man's
> clustered samba").
> 
> The problem is that Samba opens some files as root, then drops
> priviledges and finally accesses these files assuming that the root
> access rights are still in order. This does not work under GFS, but
> under any other local fs.
> 
> The Samba developers claim that this is POSIX compliant and that GFS
> is not following POSIX in this matter.
> 
> Is this true? Does POSIX require the fds to not change access
> priviledges even when setuiding to another user?

Yes, apparently it does.
 
> If so, why doesn't GFS respect this? A bug or a feature? If the former
> I'll go and bugzilla it. If the latter, can there be a fix for the
> RHEL4 branch?

Since GFS is not complying with POSIX, I'd call this a bug. Please, go
bugzilla it.

-Ben

> Thanks!
> -- 
> Axel.Thimm at ATrpms.net



> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Axel.Thimm at ATrpms.net  Thu Sep 22 13:38:14 2005
From: Axel.Thimm at ATrpms.net (Axel Thimm)
Date: Thu, 22 Sep 2005 15:38:14 +0200
Subject: [Linux-cluster] Re: GFS breaking POSIX exhibited by Samba?
In-Reply-To: <20050922132221.GA2123@phlogiston.msp.redhat.com>
References: <20050922083237.GD3466@neu.nirvana>
	<20050922132221.GA2123@phlogiston.msp.redhat.com>
Message-ID: <20050922133814.GB25543@neu.nirvana>

On Thu, Sep 22, 2005 at 08:22:21AM -0500, Benjamin Marzinski wrote:
> On Thu, Sep 22, 2005 at 10:32:37AM +0200, Axel Thimm wrote:
> > I've been stuggling with a strange bug in Samba which required me to
> > have some of the tdb files with permissions 0666 to allow Samba to
> > work.
> > 
> > The Samba metadata (locking and connection tables etc) are placed on
> > GFS to allow for easier relocation of the Samba services ("poor man's
> > clustered samba").
> > 
> > The problem is that Samba opens some files as root, then drops
> > priviledges and finally accesses these files assuming that the root
> > access rights are still in order. This does not work under GFS, but
> > under any other local fs.
> > 
> > The Samba developers claim that this is POSIX compliant and that GFS
> > is not following POSIX in this matter.
> > 
> > Is this true? Does POSIX require the fds to not change access
> > priviledges even when setuiding to another user?
> 
> Yes, apparently it does.
>  
> > If so, why doesn't GFS respect this? A bug or a feature? If the former
> > I'll go and bugzilla it. If the latter, can there be a fix for the
> > RHEL4 branch?
> 
> Since GFS is not complying with POSIX, I'd call this a bug. Please, go
> bugzilla it.

OK, here it is https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169039
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050922/610889f4/attachment.sig>

From lhh at redhat.com  Thu Sep 22 14:20:25 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 22 Sep 2005 10:20:25 -0400
Subject: [Linux-cluster] Some questions about heart-beat under Cluster
	Suite 4
In-Reply-To: <43324C0E.8010005@bull.net>
References: <43324C0E.8010005@bull.net>
Message-ID: <1127398825.22106.153.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-22 at 08:15 +0200, Alain Moulle wrote:
> Hi everybody
> 
> Some questions about heart-beat under Cluster Suite 4 :
> 
> 1. how is choosen the eth interface under CS4 ?
>    if we have eth0 and eth1, it seems that HearBeat
>    goes through eth0 and we don't have the possibility
>    to configure this in CS4 , right ?
> 
> 2. does that mean also that if eth0 fails, the CS4
>    automatically goes through eth1 ?
> 
> 3. do I miss the way to configure this via GUI ?

One easy way to do this is to use bonded interfaces, which will provide
failover transparent to the cluster.

-- Lon



From lhh at redhat.com  Thu Sep 22 14:28:17 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 22 Sep 2005 10:28:17 -0400
Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba?
In-Reply-To: <20050922083237.GD3466@neu.nirvana>
References: <20050922083237.GD3466@neu.nirvana>
Message-ID: <1127399297.22106.162.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-22 at 10:32 +0200, Axel Thimm wrote:
> Hi,
> 
> I've been stuggling with a strange bug in Samba which required me to
> have some of the tdb files with permissions 0666 to allow Samba to
> work.
> 
> The Samba metadata (locking and connection tables etc) are placed on
> GFS to allow for easier relocation of the Samba services ("poor man's
> clustered samba").
> 
> The problem is that Samba opens some files as root, then drops
> priviledges and finally accesses these files assuming that the root
> access rights are still in order. This does not work under GFS, but
> under any other local fs.
> 
> The Samba developers claim that this is POSIX compliant and that GFS
> is not following POSIX in this matter.
> 
> Is this true? Does POSIX require the fds to not change access
> priviledges even when setuiding to another user?
> 
> If so, why doesn't GFS respect this? A bug or a feature? If the former
> I'll go and bugzilla it. If the latter, can there be a fix for the
> RHEL4 branch?

Definitely file a bugzilla.

It's something that we definitely need to look into.  Off the top of my
head, it sounds like GFS is, indeed, not respecting access rights (and
therefore, POSIX) -- but oddly enough, it may be intentional.  Cluster
semantics don't always line up with POSIX semantics.



-- Lon





From lhh at redhat.com  Thu Sep 22 14:29:46 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 22 Sep 2005 10:29:46 -0400
Subject: [Linux-cluster] GFS breaking POSIX exhibited by Samba?
In-Reply-To: <1127399297.22106.162.camel@ayanami.boston.redhat.com>
References: <20050922083237.GD3466@neu.nirvana>
	<1127399297.22106.162.camel@ayanami.boston.redhat.com>
Message-ID: <1127399386.22106.165.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-22 at 10:28 -0400, Lon Hohberger wrote:

> It's something that we definitely need to look into.  Off the top of my
> head, it sounds like GFS is, indeed, not respecting access rights (and
> therefore, POSIX) -- but oddly enough, it may be intentional.  Cluster
> semantics don't always line up with POSIX semantics.

Ben beat me to it.  I'm just not on my game today.

-- Lon




From lhh at redhat.com  Thu Sep 22 14:32:35 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 22 Sep 2005 10:32:35 -0400
Subject: [Linux-cluster] Re: How do I use a cross over cable to set up
	quorum?
In-Reply-To: <1a9f41690509220244da6fed5@mail.gmail.com>
References: <1a9f416905092201175c6eb118@mail.gmail.com>
	<1a9f41690509220244da6fed5@mail.gmail.com>
Message-ID: <1127399555.22106.169.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-22 at 11:44 +0200, Andreso wrote:
> Sorry, I forgot to mention that I am running Centos 3.5 which is
> equivalent to RHEL AS 3.5.
> 
> A working cluster.xml using the main network interface is attached

* Use 10.0.0.x (the IPs assigned to the interfaces using the crossover
cable) as your cluster member names.

* Use broadcast heartbeating.

* Set broadcast-primary-only (see man cludb)

* Use the disk based tiebreaker.  DO NOT use the IP tiebreaker.

-- Lon



From Robert.Olsson at mobeon.com  Thu Sep 22 14:52:29 2005
From: Robert.Olsson at mobeon.com (Robert Olsson)
Date: Thu, 22 Sep 2005 16:52:29 +0200
Subject: [Linux-cluster] High availability mail system
Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM>

Im trying to put up a high availability mail system with high
performance who support up to 300 000 mailboxes using a linux cluster.
 
I have looked around for open source cluster solution, but so far only
found solution like SAN for shared filesystem with high performance.
 
Any suggestion how to solve the issues using open source software?
 
The system should support  
- Mailbox replication 
*- Mailbox synchronization 
*- Redundancy both in hardware and software
- The system is to be build with low cost computers
- One shared filesystem without external storage like SAN
- Scalable
 
/Robert Olsson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050922/69b1929b/attachment.htm>

From Alain.Moulle at bull.net  Thu Sep 22 15:19:27 2005
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 22 Sep 2005 17:19:27 +0200
Subject: [Linux-cluster] Some questions about heart-beat under Cluster
	Suite 4
In-Reply-To: <1127398825.22106.153.camel@ayanami.boston.redhat.com>
References: <43324C0E.8010005@bull.net>
	<1127398825.22106.153.camel@ayanami.boston.redhat.com>
Message-ID: <4332CB7F.3010304@bull.net>

Hi
Thanks. Effectively,I saw this in documentation but
in fact, my question was more precisely :
if we have two eth interfaces eth0 and eth1 , and we don't want the
heart beat goes through eth1 in any case, even if the
CS4 has to failover in case of eth0 failure, is there
a way in CS4 configuration to avoid eth1 ? or do we have
to disable Broadcast (or Multicast if choosen in CS4
configuration) at the eth interface configuration ?

Thanks
Alain

Lon Hohberger wrote:
> On Thu, 2005-09-22 at 08:15 +0200, Alain Moulle wrote:
> 
>>Hi everybody
>>
>>Some questions about heart-beat under Cluster Suite 4 :
>>
>>1. how is choosen the eth interface under CS4 ?
>>   if we have eth0 and eth1, it seems that HearBeat
>>   goes through eth0 and we don't have the possibility
>>   to configure this in CS4 , right ?
>>
>>2. does that mean also that if eth0 fails, the CS4
>>   automatically goes through eth1 ?
>>
>>3. do I miss the way to configure this via GUI ?
> 
> 
> One easy way to do this is to use bonded interfaces, which will provide
> failover transparent to the cluster.
> 
> -- Lon
> 
> 


-- 



mailto:Alain.Moulle at bull.net
+------------------------------+--------------------------------+
|	Alain Moull?	       	| from France :	04 76 29 75 99  |
|                              	| FAX number  : 04 76 29 72 49  |
| Bull SA		       	|				|
| 1, Rue de Provence  		| Adr  : FREC B1-041            |
| B.P. 208			|				|
| 38432 Echirolles - CEDEX     	| Email: Alain.Moulle at bull.net  |
| France                       	| BCOM : 229 7599               |
+-------------------------------+-------------------------------+




From yfttyfs at gmail.com  Thu Sep 22 16:40:07 2005
From: yfttyfs at gmail.com (y f)
Date: Fri, 23 Sep 2005 00:40:07 +0800
Subject: [Linux-cluster] High availability mail system
In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM>
References: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM>
Message-ID: <78fcc84a050922094012a1b27d@mail.gmail.com>

On 9/22/05, Robert Olsson <Robert.Olsson at mobeon.com> wrote:
>
> Im trying to put up a high availability mail system with high performance
> who support up to 300 000 mailboxes using a linux cluster.
>

Recently we wrote a Cluster FS dedicated to Email service based on ideas
from GoogleFS, LogFS, GlobalFS, etc., which also gave me some thoughts on
it.

I have looked around for open source cluster solution, but so far only found
> solution like SAN for shared filesystem with high performance.
>

www.Lustre.org <http://www.Lustre.org> ?

Any suggestion how to solve the issues using open source software?
>  The system should support
> - Mailbox replication
>

Replication in mail account level, file level, or GoogleFS[1]' chunk level,
I wonder which one will simplify the complication ?

?- Mailbox synchronization
>

Based on above line.

?
> - Redundancy both in hardware and software
> - The system is to be build with low cost computers
> - One shared filesystem without external storage like SAN
> - Scalable
>

GoogleFS gave a good example on it ;)

[1] http://labs.google.com/papers/gfs.html

/Robert Olsson
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050923/6c84248d/attachment.htm>

From jason_wilk at stircrazy.net  Thu Sep 22 17:53:16 2005
From: jason_wilk at stircrazy.net (Jason Wilkinson)
Date: Thu, 22 Sep 2005 12:53:16 -0500
Subject: [Linux-cluster] High availability mail system
In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37BA0@vale.MOBEON.COM>
Message-ID: <A1752420D10F65429F82C257C05AFAA3037742F5@godzilla.corp.digitalims.com>

Robert Olsson wrote:
> Im trying to put up a high availability mail system with high
> performance who support up to 300 000 mailboxes using a linux
> cluster.  
> 
> I have looked around for open source cluster solution, but so far
> only found solution like SAN for shared filesystem with high
> performance.  
> 

> Any suggestion how to solve the issues using open source software?
> 
> The system should support
> - Mailbox replication
> .- Mailbox synchronization

Why are you replicating the mailbox. Why don't you put the mailboxes on NFS
and just have all of the servers dump into the same mailbox. The POP3
frontends can all pull from the same store as well.

Mail servers are one of the odd services that I've seen where it isn't
necessary to implement a cluster to scale well.

http://shupp.org/maps/ispcluster.html


> .- Redundancy both in hardware and software
> - The system is to be build with low cost computers
> - One shared filesystem without external storage like SAN
> - Scalable
> 
> /Robert Olsson





From lgodoy at atichile.com  Thu Sep 22 19:02:16 2005
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Thu, 22 Sep 2005 15:02:16 -0400
Subject: [Linux-cluster] RedHat EN4U1 AMD64 
In-Reply-To: <432AC5C2.1090800@fnal.gov>
References: <43287C63.10909@fnal.gov> <20050915033908.GL2190@redhat.com>
	<432AC5C2.1090800@fnal.gov>
Message-ID: <4332FFB8.1070203@atichile.com>

Hi

I am trying to run a basic config of  "cluster suite" software on "Hp 
cluster Hardware", Using:

"*Red Hat Enterprise Linux ES 4 Update 1 (AMD64/Intel EM64T)"
"*rhel-4-rhcs-x86_64.iso"

Mi first problem was with "cman-kernel-hugemem" and "dlm-kernel-hugemem" 
pakages, the instalation process failed for funsolved dependencies. So, 
for testing we ommited these pakages and continued with the instalation 
proccess.

I created a basic "service" ( only script ) and this works, but  when I 
added a IP addreess, the GUI showed services status OK but the IP 
address whas not added to the machine :| , I think the problem is with 
rgmanager.

Has someone installed This version of SO and Cluster Suite ?
You have any idea to solve this problem ?

Thanks in advance
Luis G.





From lhh at redhat.com  Thu Sep 22 21:50:17 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 22 Sep 2005 17:50:17 -0400
Subject: [Linux-cluster] RedHat EN4U1 AMD64
In-Reply-To: <4332FFB8.1070203@atichile.com>
References: <43287C63.10909@fnal.gov> <20050915033908.GL2190@redhat.com>
	<432AC5C2.1090800@fnal.gov>  <4332FFB8.1070203@atichile.com>
Message-ID: <1127425817.22106.200.camel@ayanami.boston.redhat.com>

On Thu, 2005-09-22 at 15:02 -0400, Luis Godoy Gonzalez wrote:

> I created a basic "service" ( only script ) and this works, but  when I 
> added a IP addreess, the GUI showed services status OK but the IP 
> address whas not added to the machine :| , I think the problem is with 
> rgmanager.

Use "ip addr list", not "ifconfig".

-- Lon



From baesso at ksolutions.it  Fri Sep 23 07:41:49 2005
From: baesso at ksolutions.it (Baesso Mirko)
Date: Fri, 23 Sep 2005 09:41:49 +0200
Subject: R: [Linux-cluster] Re: How do I use a cross over cable to set
	upquorum?
Message-ID: <F63C63702B77F94CB4664B4D88660C4D4BAC9A@kmail.ksolutions.it>

Hi, i need to setup a 2 node cluster and i would like to use the disk based tiebreaker or ip tiebreaker, but I see there is no cluquorumd command. I'm using RHCS 4 (NO GFS) on kernel 2.6.9.11
Could you help me
Thanks

Baesso Mirko - System Engineer
KSolutions.S.p.A.
Via Lenin 132/26
56017  S.Martino Ulmiano (PI) - Italy
tel.+ 39 0 50 898369 fax. + 39 0 50 861200
baesso at ksolutions.it   http//www.ksolutions.it

-----Messaggio originale-----
Da: Andreso [mailto:andreseso at gmail.com] 
Inviato: gioved? 22 settembre 2005 11.44
A: linux-cluster at redhat.com
Oggetto: [Linux-cluster] Re: How do I use a cross over cable to set upquorum?

Sorry, I forgot to mention that I am running Centos 3.5 which is
equivalent to RHEL AS 3.5.

A working cluster.xml using the main network interface is attached

Andres

On 9/22/05, Andreso <andreseso at gmail.com> wrote:
> Hello,
>
> I am trying to set up a cluster with clumanager-1.2.3-1 and
> clumanager-1.2.3-1 and my boss desires a working quorum where the two
> members of the cluster use a cross over cable to interchange the
> necessary information for quorum between the two cluster members.  The
> reason is that the company uses faulty switches so he considers using
> the main network interfaces of the servers not acceptable.
>
> I would like to know how I can obtain cluster quorum using the cross
> over cable.  I have succesfully set up the cluster using the main
> network interface but I have failed miserably setting up the cluster
> quorum over the cross over cable network interface.  I believe I have
> a routing problem.
>
> The route -n information for the cross over cable interface is
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth1
>
> I am far from being a linux expert and in some things like clusters I
> am a newbie and I am not particularily strong on networking.  For
> example until I started setting up the cluster I had never heard of
> the concepts tiebreaker IP or Multicast IP address.
>
> Any help would be appreciated.
>
> If somebody has managed to make this work could they please post their
> cluster.xml file?
>




From Robert.Olsson at mobeon.com  Fri Sep 23 08:38:53 2005
From: Robert.Olsson at mobeon.com (Robert Olsson)
Date: Fri, 23 Sep 2005 10:38:53 +0200
Subject: [Linux-cluster] High availability mail system
Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM>

Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason Wilkinson
Sent: den 22 september 2005 19:53
To: 'linux clustering'
Subject: RE: [Linux-cluster] High availability mail system

Robert Olsson wrote:
> Im trying to put up a high availability mail system with high 
> performance who support up to 300 000 mailboxes using a linux cluster.
> 
> I have looked around for open source cluster solution, but so far only 
> found solution like SAN for shared filesystem with high performance.
> 

> Any suggestion how to solve the issues using open source software?
> 
> The system should support
> - Mailbox replication
> .- Mailbox synchronization

Why are you replicating the mailbox. Why don't you put the mailboxes on NFS and just have all of the servers dump into the same mailbox. The POP3 frontends can all pull from the same store as well.

Mail servers are one of the odd services that I've seen where it isn't necessary to implement a cluster to scale well.

http://shupp.org/maps/ispcluster.html


> .- Redundancy both in hardware and software
> - The system is to be build with low cost computers
> - One shared filesystem without external storage like SAN
> - Scalable
> 
> /Robert Olsson



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From adam.cassar at netregistry.com.au  Fri Sep 23 08:53:20 2005
From: adam.cassar at netregistry.com.au (Adam Cassar)
Date: Fri, 23 Sep 2005 18:53:20 +1000
Subject: [Linux-cluster] High availability mail system
In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM>
References: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM>
Message-ID: <4333C280.6040901@netregistry.com.au>

use maildir format and you will be fine

exim and courier support this

Robert Olsson wrote:

>Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data?
>
>-----Original Message-----
>From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason Wilkinson
>Sent: den 22 september 2005 19:53
>To: 'linux clustering'
>Subject: RE: [Linux-cluster] High availability mail system
>
>Robert Olsson wrote:
>  
>
>>Im trying to put up a high availability mail system with high 
>>performance who support up to 300 000 mailboxes using a linux cluster.
>>
>>I have looked around for open source cluster solution, but so far only 
>>found solution like SAN for shared filesystem with high performance.
>>
>>    
>>
>
>  
>
>>Any suggestion how to solve the issues using open source software?
>>
>>The system should support
>>- Mailbox replication
>>.- Mailbox synchronization
>>    
>>
>
>Why are you replicating the mailbox. Why don't you put the mailboxes on NFS and just have all of the servers dump into the same mailbox. The POP3 frontends can all pull from the same store as well.
>
>Mail servers are one of the odd services that I've seen where it isn't necessary to implement a cluster to scale well.
>
>http://shupp.org/maps/ispcluster.html
>
>
>  
>
>>.- Redundancy both in hardware and software
>>- The system is to be build with low cost computers
>>- One shared filesystem without external storage like SAN
>>- Scalable
>>
>>/Robert Olsson
>>    
>>
>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>



From andreseso at gmail.com  Fri Sep 23 10:22:19 2005
From: andreseso at gmail.com (Andreso)
Date: Fri, 23 Sep 2005 12:22:19 +0200
Subject: [Linux-cluster] Four partitions mounted for one server -- samba
	fails
Message-ID: <1a9f4169050923032278a31093@mail.gmail.com>

I have set up the cluster using CENTOS 3.5 using the main network
interface, the one with the gateway.

I have two services running.  httpd and mysql

httpd had only one partition mounted:  /usr/local/htdocs on /dev/sdb7
which is shared by samba

For backups and other stuff I am not privy to I have mounted three
more partitions:  /dev/sdb9, /dev/sdb10 and /dev/sdc1

None of these partitions are shared by samba

The service would not start stating that /etc/samba/smb.conf.not was not found

I did touch /etc/samba/smb.conf.not and I was able to make the service
start.  Unfortunately /etc/samba/smb.conf.not seems to be a samba file
and it is not sufficient for it to exist.   Every time the httpd
service is checked it fails, the devices are unmounted forcefully and
they are remounted and the service is restarted.  I believe this is
caused by samba

What stuff should I put into this file for /etc/samba.conf.not for the
service checks to function correctly?


I am loath to use the word urgent but I consider this to be urgent.  I
am getting paid for the work done.  I budgeted six days on this
project and I have already spent 10.  I would hate having to come back
to this place.  I am working standing infront of a LVKM getting
disconnected constantly and one day due to enterprise restrictions I
did not get access to Internet.  Its noon and I might have to leave at
14:30 and I would hate to spend another day I am not getting paid for
on this project.

Andres



From rainer at ultra-secure.de  Fri Sep 23 10:58:19 2005
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Fri, 23 Sep 2005 12:58:19 +0200
Subject: [Linux-cluster] High availability mail system
In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM>
References: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM>
Message-ID: <4333DFCB.6020600@ultra-secure.de>

Robert Olsson wrote:

>Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. 
>


The trouble is that GFS also has an overhead - especially for Qmail.
In fact, we have what I would call a "long-time evaluation" of qmail + 
GFS running.
While some features (no SPOF) are nice, others (way too many concurrent 
directory-accesses to gain any performance gain in comparison to NFS) 
are not nice at all.
I'm really not a GFS-expert at all, but the way I see it (and was told) 
is that everytime a directory-access occurs, GFS must synchronize this 
to the other cluster-members.
Now, when Qmail delivers a mail, it already takes great care not to 
produce conflicts on (NFS-)shared filesystems, by copying the message 
first to "tmp", then to "new", with timestamp as part of filename etc.
For every file created on the shared directory, though, GFS creates 
locks and lockfiles - which easily doubles or triples the load on the 
SAN. Even though, no two files of the same name are ever created by 
different hosts.
In addition, GFS doesn't seem to have "directory hashing" like FreeBSD 
UFS and others have, as a result access to large directories with many 
files is slow. Running find(1) on the mail-store can bring the cluster 
to halt, so does du(1).

Due to the fact that we also will have to move from cdb-backend to 
mysql, there will be a SPOF anyway and we will be actively evaluating 
going to NFS (more or less back).
But only on a "sane" NFS-platform, most likely Solaris, or FreeBSD - 
don't waste your time with Linux-NFS...


>Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data?
>  
>

The shared-storage mail-systems only scale to a certain point. The 
numbers are in the 300k-500k ballpark. After that, you have to go 
distributed.
If you have millions of users, go qmail-ldap.

These things are very difficult to benchmark, every site has an 
individual use-profile..




cheers,
Rainer



From andreseso at gmail.com  Fri Sep 23 11:10:21 2005
From: andreseso at gmail.com (Andreso)
Date: Fri, 23 Sep 2005 13:10:21 +0200
Subject: [Linux-cluster] Re: Four partitions mounted for one server -- samba
	fails
In-Reply-To: <1a9f4169050923032278a31093@mail.gmail.com>
References: <1a9f4169050923032278a31093@mail.gmail.com>
Message-ID: <1a9f41690509230410239f1432@mail.gmail.com>

I have looked at the /var/log/cluster and here I will reproduce all
the messages.  I have set the log level on all the cluster daemons to
INFO

In effect it is a samba issue

Sep 23 12:53:46 hercules clusvcmgrd: [16076]: <warning> service warning: share_s
tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist
.
Sep 23 12:53:46 hercules clusvcmgrd: [16076]: <err> service error: share_start_s
top: nmbd for service httpd died!
Sep 23 12:53:46 hercules clusvcmgrd: [16076]: <err> service error: /usr/lib/clum
anager/services/service: line 220: [: status: integer expression expected
Sep 23 12:53:46 hercules clusvcmgrd: [16076]: <err> service error: grep: found:
No such file or directory
Sep 23 12:53:46 hercules clusvcmgrd: [16076]: <err> service error: Check status
failed on Samba for httpd
Sep 23 12:53:46 hercules clusvcmgrd[16075]: <warning> Restarting locally failed
service httpd
Sep 23 12:53:46 hercules clusvcmgrd: [16228]: <notice> service notice: Stopping
service httpd ...
Sep 23 12:53:46 hercules clusvcmgrd: [16228]: <notice> service notice: Running u
ser script '/etc/init.d/apache2 stop'
Sep 23 12:53:46 hercules clusvcmgrd: [16228]: <info> service info: Stopping IP a
ddress 10.64.34.141
Sep 23 12:53:46 hercules clusvcmgrd: [16228]: <warning> service warning: share_s
tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist
.
Sep 23 12:53:47 hercules last message repeated 2 times
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: /usr/lib/clum
anager/services/service: line 220: [: stop: integer expression expected
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: grep: found:
No such file or directory
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: /usr/lib/clum
anager/services/service: line 220: [: stop: integer expression expected
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: grep: found:
No such file or directory
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: /usr/lib/clum
anager/services/service: line 220: [: stop: integer expression expected
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: grep: found:
No such file or directory
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: grep: found:
No such file or directory
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: unmounting /d
ev/sdb7 (/usr/local/htdocs)
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: unmounting /d
ev/sdb9 (/plantillas)
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: unmounting /d
ev/sdb10 (/etse)
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <info> service info: unmounting /d
ev/sdc1 (/backup)
Sep 23 12:53:47 hercules clusvcmgrd: [16228]: <notice> service notice: Stopped s
ervice httpd ...
Sep 23 12:53:48 hercules clusvcmgrd[16075]: <notice> Starting stopped service ht
tpd
Sep 23 12:53:48 hercules clusvcmgrd: [16727]: <notice> service notice: Starting
service httpd ...
Sep 23 12:53:48 hercules clusvcmgrd: [16727]: <info> service info: Starting IP a
ddress 10.64.34.141
Sep 23 12:53:48 hercules clusvcmgrd: [16727]: <info> service info: Sending Gratu
itous arp for 10.64.34.141 (00:11:43:E7:BA:A5)
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <warning> service warning: share_s
tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist
.
Sep 23 12:53:49 hercules last message repeated 2 times
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <info> service info: /usr/lib/clum
anager/services/service: line 220: [: start: integer expression expected
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <info> service info: grep: found:
No such file or directory
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <info> service info: /usr/lib/clum
anager/services/service: line 220: [: start: integer expression expected
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <info> service info: grep: found:
No such file or directory
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <info> service info: /usr/lib/clum
anager/services/service: line 220: [: start: integer expression expected
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <info> service info: grep: found:
No such file or directory
Sep 23 12:53:49 hercules clusvcmgrd: [16727]: <notice> service notice: Running u
ser script '/etc/init.d/apache2 start'


I set up a /etc/samba/smb.conf.not file with just a global section
 [global]
        workgroup = RHCLUSTER
        pid directory = /var/run/samba/not
        lock directory = /var/cache/samba/not
        log file = /var/log/samba/%m.log
        encrypt passwords = yes
        bind interfaces only = yes
        interfaces = 10.64.34.141/255.255.255.0


To the best of my knowledge /etc/samba/smb.conf.not is not documented



From Robert.Olsson at mobeon.com  Fri Sep 23 11:15:07 2005
From: Robert.Olsson at mobeon.com (Robert Olsson)
Date: Fri, 23 Sep 2005 13:15:07 +0200
Subject: [Linux-cluster] High availability mail system
Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F6487F@vale.MOBEON.COM>

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rainer Duffner
Sent: den 23 september 2005 12:58
To: linux clustering
Subject: Re: [Linux-cluster] High availability mail system

>Robert Olsson wrote:

>>Ok, how about performance when using NFS? I?m thinking about the overhead when accesing NFS filesystems. 
>>


>The trouble is that GFS also has an overhead - especially for Qmail.
>In fact, we have what I would call a "long-time evaluation" of qmail + GFS running.
>While some features (no SPOF) are nice, others (way too many concurrent directory-accesses to gain any performance gain in comparison to NFS) are not nice at all.
>I'm really not a GFS-expert at all, but the way I see it (and was told) is that everytime a directory-access occurs, GFS >must synchronize this to the other cluster-members.
>Now, when Qmail delivers a mail, it already takes great care not to produce conflicts on (NFS-)shared filesystems, by copying the message first to "tmp", then to "new", with timestamp as part of filename etc.
>For every file created on the shared directory, though, GFS creates locks and lockfiles - which easily doubles or triples >the load on the SAN. Even though, no two files of the same name are ever created by different hosts.
>In addition, GFS doesn't seem to have "directory hashing" like FreeBSD UFS and others have, as a result access to large directories with many files is slow. Running find(1) on the mail-store can bring the cluster to halt, so does du(1).

>Due to the fact that we also will have to move from cdb-backend to mysql, there will be a SPOF anyway and we will be actively evaluating going to NFS (more or less back).
>But only on a "sane" NFS-platform, most likely Solaris, or FreeBSD - don't waste your time with Linux-NFS...


>>Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data?
>>  
>>

>The shared-storage mail-systems only scale to a certain point. The numbers are in the 300k-500k ballpark. After that, you have to go distributed.
>If you have millions of users, go qmail-ldap.

How do you mean with "go distributed"?

>These things are very difficult to benchmark, every site has an individual use-profile..




cheers,
Rainer

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From merlin at mwob.org.uk  Fri Sep 23 11:31:33 2005
From: merlin at mwob.org.uk (Howard Johnson)
Date: Fri, 23 Sep 2005 12:31:33 +0100
Subject: [Linux-cluster] High availability mail system
In-Reply-To: <4333DFCB.6020600@ultra-secure.de>
References: <9B488A5E8C00084C82DB19AF2E713C22F37D8E@vale.MOBEON.COM>
	<4333DFCB.6020600@ultra-secure.de>
Message-ID: <1127475093.30722.23.camel@thunderbolt.localnet>

On Fri, 2005-09-23 at 11:58, Rainer Duffner wrote:

> But only on a "sane" NFS-platform, most likely Solaris, or FreeBSD - 
> don't waste your time with Linux-NFS...
> 
> 
> >Do you know about any mailsystem that distribute mail over NFS and do you have any links to performance data?
> >  
> >
> 
> The shared-storage mail-systems only scale to a certain point. The 
> numbers are in the 300k-500k ballpark. After that, you have to go 
> distributed.
> If you have millions of users, go qmail-ldap.

Linux-based NFS shared-storage mail systems are capable of scaling well
beyond that. I've seen such a system handling millions of mailboxes.

-- 
Howard Johnson



From rainer at ultra-secure.de  Fri Sep 23 12:30:10 2005
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Fri, 23 Sep 2005 14:30:10 +0200
Subject: [Linux-cluster] High availability mail system
In-Reply-To: <9B488A5E8C00084C82DB19AF2E713C22F6487F@vale.MOBEON.COM>
References: <9B488A5E8C00084C82DB19AF2E713C22F6487F@vale.MOBEON.COM>
Message-ID: <4333F552.3080706@ultra-secure.de>

Robert Olsson wrote:

>How do you mean with "go distributed"?
>  
>


Qmail-LDAP-patches.
You've no longer got a shared mail-storage, so you don't have to 
scale-up that, as you add users.
A LDAP-directory stores where (on which server) each user is located.

See the qmail-ldap pages and accompanying documentation.




Rainer



From Robert.Olsson at mobeon.com  Fri Sep 23 12:42:25 2005
From: Robert.Olsson at mobeon.com (Robert Olsson)
Date: Fri, 23 Sep 2005 14:42:25 +0200
Subject: [Linux-cluster] High availability mail system
Message-ID: <9B488A5E8C00084C82DB19AF2E713C22F64912@vale.MOBEON.COM>


> The system should support
> - Mailbox replication
> .- Mailbox synchronization

>Why are you replicating the mailbox. Why don't you put the mailboxes on
NFS and just have all of the servers dump into 
>the same mailbox. The POP3 frontends can all pull from the same store
as well.

I want to have redundancy on the mailboxes not just on one node using
raid. I want the mailbox at least on two nodes if one node fail.

Do you have any suggestion to that?

/Robert Olsson

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Mon Sep 26 15:28:58 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Sep 2005 11:28:58 -0400
Subject: [Linux-cluster] Re: How do I use a cross over cable to set up
	quorum?
In-Reply-To: <1a9f416905092304522adfc2b1@mail.gmail.com>
References: <1a9f416905092201175c6eb118@mail.gmail.com>
	<1a9f41690509220244da6fed5@mail.gmail.com>
	<1127399555.22106.169.camel@ayanami.boston.redhat.com>
	<1a9f416905092304522adfc2b1@mail.gmail.com>
Message-ID: <1127748538.22106.242.camel@ayanami.boston.redhat.com>

On Fri, 2005-09-23 at 13:52 +0200, Andreso wrote:


> On 9/22/05, Lon Hohberger <lhh at redhat.com> wrote:
> >
> > * Set broadcast-primary-only (see man cludb)
> 
> cludb does not have a man page.  broadcast-primary-only does not
> appear in google

Upgrade to the latest package from RHN or wherever you got your
software.  Anyway, as it turns out, the man page is wrong anyway (it
references primary_only instead of broadcast_primary_only).  Oops.


> > * Use the disk based tiebreaker.  DO NOT use the IP tiebreaker.
> I use the ping interval instad of tiebreaker Ip.  I guess that is what you mean.

That's equivalent, yes.


>   Member             Status
>   ------------------ ----------
>   10.0.0.2           Inactive
>   10.0.0.3           Active     <-- You are here
> 
>   Service        Status   Owner (Last)     Last Transition Chk Restarts
>   -------------- -------- ---------------- --------------- --- --------
>   httpd          started  10.0.0.2         13:39:06 Sep 23  30        0
>   mysql          started  10.0.0.2         13:39:06 Sep 23  30        0


>   Member             Status
>   ------------------ ----------
>   10.0.0.2           Active     <-- You are here
>   10.0.0.3           Inactive
> 
>   Service        Status   Owner (Last)     Last Transition Chk Restarts
>   -------------- -------- ---------------- --------------- --- --------
>   httpd          started  10.0.0.2         13:39:06 Sep 23  30        0
>   mysql          started  10.0.0.2         13:39:06 Sep 23  30        0

The disk tiebreaker is working correctly.  Your nodes aren't
communicating over the private network (crossover cable, in your case),
though.

The cluster software doesn't do anything arcane.  You can try
double-checking UDP ping-ability (which is basically what the cluster
does, except it's one way instead of bidirectional) using this:

http://people.redhat.com/udping-1.0.tar.gz

Don't set it up as a cluster service, just start the server on one and
try to use udping to ping the other using the private IP.  Also try
obvious things like normal ping, broadcast ping, and ssh.  If these
don't work, you probably have a bad cable, incorrect routing rules, or
incorrect firewall rules.


Your configuration looks okay.  After you get the cluster working, stop
the cluster on both nodes.  Run this on one of them, and copy the
cluster configuration to the other node:

# cludb -p clumembd%broadcast_primary_only 1

You can also do this from one of the nodes:

# shutil -s /etc/cluster.xml

This will prevent the cluster from using public interfaces for
heartbeats, but is not critical in any way to get the cluster software
working.

-- Lon




From carlopmart at gmail.com  Mon Sep 26 15:33:38 2005
From: carlopmart at gmail.com (carlopmart at gmail.com)
Date: Mon, 26 Sep 2005 17:33:38 +0200
Subject: [Linux-cluster] Installing RHCS on RHEL 4
Message-ID: <433814D2.1060907@gmail.com>

Hi all,

  I would like to do some tests with RHCS (Cluster Suite) and RHEL 4 
under two virtual machines on GSX Server 3.2. At this point, I have some 
questions to configure two virtual nodes to do the tests:

  - Is it necessary to have a fence device?? Can I configure custer 
suite without it??

  - How many network interfaces I need for each virtual machine???

Thank you very much.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From david.sullivan at activant.com  Mon Sep 19 00:40:23 2005
From: david.sullivan at activant.com (David.Sullivan)
Date: Sun, 18 Sep 2005 19:40:23 -0500
Subject: [Linux-cluster] RHCS 4.0 HowTo?
Message-ID: <BF537327.1BF2%david.sullivan@activant.com>


I'm completely new to clustering and am trying to set up a "proof of
concept" configuration in-house.  Our customers run a back-office POS server
and must currently do a manual failover that involves moving hard drives if
the server fails.  Using VMware Workstation 5, I have created several VM's
configured with dual NICs and RHEL 4.0/RHCS 4.0.  System-config-cluster
seems to abstract so much from the Administrator that I'm having trouble
getting things working.  Some specific things I don't understand:

*  How do you tell it which NIC to use as a heartbeat, and which to use for
the services offered?  I have a private LAN set up that I want to use for
heartbeat, but don't think it's being used at all.  Both the public and
private IP's are set up in /etc/hosts.

*  Building on the above, is it wise for other cluster services (e.g. dlm)
to communicate across the private heartbeat link?  If so, how do I configure
that?

*  How does one add members to an existing cluster?  I found lots online
(including Red Hat's Knowledgebase) about how to do it with previous
versions, but I'm very unclear about RHCS 4.0.  I cloned my first VM, reset
all it's networking data, reset the hostname, and deleted the
/etc/cluster/cluster.conf file, but the "master" node won't push the cluster
configuration to it, so it sits there brain-dead.

*  Insofar as hardware fencing is required with RHCS 4.0, will I be able to
demonstrate failover functionality to management at all?  I'm basically
looking for a means to automagically fail over to a "hot standby" server.

TIA!


Notice: This transmission is for the sole use of the intended recipient(s) and may contain information that is confidential and/or privileged.  If you are not the intended recipient, please delete this transmission and any attachments and notify the sender by return email immediately.  Any unauthorized review, use, disclosure or distribution is prohibited.




From ren at teamware-gmbh.de  Mon Sep 19 14:52:33 2005
From: ren at teamware-gmbh.de (=?iso-8859-1?B?UmVu6SBFbnNrYXQgW1RlYW13YXJlIEdtYkhd?=)
Date: Mon, 19 Sep 2005 16:52:33 +0200
Subject: [Linux-cluster] Nanny bad load average failure
Message-ID: <e67efa23abfc6b4f9c8cd74fc2dc2142@teamware-gmbh.de>

Hi list,

I still have this strange error.
I updated the clustersuite to the newest versions and i still get this
errors in my /var/log/messages but the servers are up with the old
ruptime version i get the loadaverage but the error in th elogfile was
the same "bad load average":

Sep 19 11:12:25 telemach nanny[24850]: bad load average returned:
telemach    down      0:32 telemach3   down      0:31 telemach4   down
0:31
Sep 19 11:12:25 telemach nanny[25253]: bad load average returned:
telemach    down      0:32 telemach3   down      0:31 telemach4   down
0:31
Sep 19 11:12:32 telemach nanny[24822]: bad load average returned:
telemach    down      0:32 telemach3   down      0:31 telemach4   down
0:31
Sep 19 11:12:38 telemach nanny[25225]: bad load average returned:
telemach    down      0:32 telemach3   down      0:31 telemach4   down
0:31
Sep 19 11:12:43 telemach nanny[24850]: bad load average returned:
telemach    down      0:33 telemach3   down      0:32 telemach4   down
0:31
Sep 19 11:12:43 telemach nanny[25253]: bad load average returned:
telemach    down      0:33 telemach3   down      0:32 telemach4   down
0:31

ipvsadm-1.24-6
piranha-0.8.0-1

How can i solve this?

Thx for HELP!






From Alain.Moulle at bull.net  Tue Sep 20 13:02:13 2005
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 20 Sep 2005 15:02:13 +0200
Subject: [Linux-cluster] Question about heart-beat with CS4
Message-ID: <43300855.3050709@bull.net>

Hi

I wonder how the CS4 heart beat is managed :

1. suppose we have two interfaces eth0 and eth1,
   which one will be used ?
   or will the CS4 use both ?

2. is-it configurable somewhere ?

3. is the time period between each ping on heart beat
   configurable somewhere ?

4. Is there a risk to split the cluster when using
   only one ETH interface ?

Thanks
Alain Moull?




From pcaulfie at redhat.com  Tue Sep 27 07:00:29 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 27 Sep 2005 08:00:29 +0100
Subject: [Linux-cluster] Question about heart-beat with CS4
In-Reply-To: <43300855.3050709@bull.net>
References: <43300855.3050709@bull.net>
Message-ID: <4338EE0D.6090408@redhat.com>

Alain Moulle wrote:
> Hi

For a CMAN/DLM based cluster:

> I wonder how the CS4 heart beat is managed :
> 
> 1. suppose we have two interfaces eth0 and eth1,
>    which one will be used ?
>    or will the CS4 use both ?

It will use the interface bound to the IP address of the hostname by default. If
you want to use the otehr interface then specify the name associated with the
inerface on the command-line to "cman_tool join"

> 2. is-it configurable somewhere ?
> 
> 3. is the time period between each ping on heart beat
>    configurable somewhere ?

in /proc/cluster/config/cman/ there are files you can poke values into. This
must be done beteen loading the cman module and running cman_tool join.

> 4. Is there a risk to split the cluster when using
>    only one ETH interface ?

Yes. If you want to use both interfaces then join them using the Linux bonding
driver.

-- 

patrick



From htfrontier at gmail.com  Tue Sep 27 07:02:32 2005
From: htfrontier at gmail.com (Hanny Tidore)
Date: Tue, 27 Sep 2005 15:02:32 +0800
Subject: [Linux-cluster] Cluster cannot failover
Message-ID: <2fa0bfca05092700022fddb849@mail.gmail.com>

Hi,

I am installing Redhat Cluster Suite in 2 HP Proliant DL380G4 with HP
StorageWorks MSA500 G1.

I have 3 ethernet cards for each server. 1 card is used for heartbeat and 2
cards are configured as bond0 (bonding).

I have setup a service and the service runs on both node: node1 and node2. I
can swing the service from node1 to node2.

However, when I shutdown node1 (using shutdown -h now), the service which
was running in node1 is not restarted on node2.
I got the following error message in node2:

Sep 27 10:34:54 ppba-papp2 clusvcmgrd[1868]: <crit> Couldn't connect to
member #0: Connection timed out
Sep 27 10:34:54 ppba-papp2 clusvcmgrd[1868]: <err> Unable to obtain cluster
lock: No locks available
Sep 27 10:35:01 ppba-papp2 cluquorumd[1835]: <warning> Membership reports #0
as down, but disk reports as up: State uncertain!
Sep 27 10:35:05 ppba-papp2 clusvcmgrd[1868]: <warning> Member ppba-papp1's
state is uncertain: Some services may be unavailable!

Is this test scenario valid ? Is it ok to test the Redhat Cluster by
shutting down the server ? What could have gone wrong ?

Thanks.
Hanny
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050927/58cf3ed5/attachment.htm>

From andreseso at gmail.com  Tue Sep 27 15:08:59 2005
From: andreseso at gmail.com (Andreso)
Date: Tue, 27 Sep 2005 17:08:59 +0200
Subject: [Linux-cluster] Cluster cannot failover
In-Reply-To: <2fa0bfca05092700022fddb849@mail.gmail.com>
References: <2fa0bfca05092700022fddb849@mail.gmail.com>
Message-ID: <1a9f4169050927080875bfde68@mail.gmail.com>

On 9/27/05, Hanny Tidore <htfrontier at gmail.com> wrote:
>  I am installing Redhat Cluster Suite in 2 HP Proliant DL380G4 with HP
> StorageWorks MSA500 G1.

I sure hope that that is not the cluster suite that comes with Redhat
Enterprise Linux Advanced Server 2.1

The reason I say so is that on a cluster with shared storage over SCSI
when I upgraded the RHEL 2.1 kernel the machines became unbootable. 
Something about the RHEL 2.1 upgrade kernels not containing the
megaraid2 kernel module.  Anyways the machines would not boot and I
commited the newbie mistake of uninstalling the old kernel in the hope
that the new kernel would work.  I had to reinstall one of the cluster
members.   You know the saying:  person that repeats the same steps
hoping to get different results -> windows user.

If you do not want to pay for RHEL 3 or 4 you can go with CENTOS.  If
you go with RHEL I recommend paying for support from RedHat as time
can be lost debugging the cluster.

Andres



From carlopmart at gmail.com  Tue Sep 27 16:22:54 2005
From: carlopmart at gmail.com (carlopmart at gmail.com)
Date: Tue, 27 Sep 2005 18:22:54 +0200
Subject: [Linux-cluster] This configuration could work??
Message-ID: <433971DE.4020302@gmail.com>

Hi all,

  I would like to test RHGS with RHCS, but I have got only one server. 
My idea is to use VMWare GSX Server. My explain:

  - VMWare GSX host server acts as GFS server.
  - Two virtual machines with RHCS using fence_gnbd as fence device 
connected to GFS server.

  is it possible to do this configuration with only one server using GFS???

Thank you very much for your help and sorry my bad english.

P.D: I will use CentOS 4.1 for host and virtual machines.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From lhh at redhat.com  Tue Sep 27 17:12:55 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Sep 2005 13:12:55 -0400
Subject: [Linux-cluster] Installing RHCS on RHEL 4
In-Reply-To: <433814D2.1060907@gmail.com>
References: <433814D2.1060907@gmail.com>
Message-ID: <1127841175.26042.76.camel@ayanami.boston.redhat.com>

On Mon, 2005-09-26 at 17:33 +0200, carlopmart at gmail.com wrote:
> Hi all,
> 
>   I would like to do some tests with RHCS (Cluster Suite) and RHEL 4 
> under two virtual machines on GSX Server 3.2. At this point, I have some 
> questions to configure two virtual nodes to do the tests:
> 
>   - Is it necessary to have a fence device?? Can I configure custer 
> suite without it??

Sort of.  Fencing is required.  You may just want to write one for
VMWare GSX server which tells the server to power-off the machine.

>   - How many network interfaces I need for each virtual machine???

1 is fine, it depends on your needs.

-- Lon




From lhh at redhat.com  Tue Sep 27 17:17:03 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Sep 2005 13:17:03 -0400
Subject: [Linux-cluster] Re: Four partitions mounted for one server --
	samba fails
In-Reply-To: <1a9f41690509230410239f1432@mail.gmail.com>
References: <1a9f4169050923032278a31093@mail.gmail.com>
	<1a9f41690509230410239f1432@mail.gmail.com>
Message-ID: <1127841423.26042.81.camel@ayanami.boston.redhat.com>

On Fri, 2005-09-23 at 13:10 +0200, Andreso wrote:
> I have looked at the /var/log/cluster and here I will reproduce all
> the messages.  I have set the log level on all the cluster daemons to
> INFO
> 
> In effect it is a samba issue
> 
> Sep 23 12:53:46 hercules clusvcmgrd: [16076]: <warning> service warning: share_s
> tart_stop: Samba configuration file /etc/samba/smb.conf.not found does not exist
> .

!

It's looking for "/etc/samba/smb.conf.not found".  Strange...  Your
configuration makes it think it's a samba service, but when we try to
get the share name, it's reporting "not found".

That's odd.

> service: line 220: [: status: integer expression expected

That's a bug.


> I set up a /etc/samba/smb.conf.not file with just a global section
>  [global]
>         workgroup = RHCLUSTER
>         pid directory = /var/run/samba/not
>         lock directory = /var/cache/samba/not
>         log file = /var/log/samba/%m.log
>         encrypt passwords = yes
>         bind interfaces only = yes
>         interfaces = 10.64.34.141/255.255.255.0
> 
> 
> To the best of my knowledge /etc/samba/smb.conf.not is not documented

There should be a share name with the device; that's what it's looking
for.

Please file a bugzilla and paste in or attach your cluster.xml.

-- Lon




From lhh at redhat.com  Tue Sep 27 17:18:29 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Sep 2005 13:18:29 -0400
Subject: R: [Linux-cluster] Re: How do I use a cross over cable to set
	upquorum?
In-Reply-To: <F63C63702B77F94CB4664B4D88660C4D4BAC9A@kmail.ksolutions.it>
References: <F63C63702B77F94CB4664B4D88660C4D4BAC9A@kmail.ksolutions.it>
Message-ID: <1127841509.26042.84.camel@ayanami.boston.redhat.com>

On Fri, 2005-09-23 at 09:41 +0200, Baesso Mirko wrote:
> Hi, i need to setup a 2 node cluster and i would like to use the disk based tiebreaker or ip tiebreaker, but I see there is no cluquorumd command. I'm using RHCS 4 (NO GFS) on kernel 2.6.9.11
> Could you help me
> Thanks

No such thing in linux-cluster (or RHCS4).  Instead, there's a special
"two node" mode for CMAN to run in, which uses fencing to ensure that a
split brain doesn't occur.  If you use the GUI to configure the cluster,
this mode is set automatically when only two nodes are present in the
configuration file.

-- Lon




From lhh at redhat.com  Tue Sep 27 17:19:58 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Sep 2005 13:19:58 -0400
Subject: [Linux-cluster] This configuration could work??
In-Reply-To: <433971DE.4020302@gmail.com>
References: <433971DE.4020302@gmail.com>
Message-ID: <1127841598.26042.87.camel@ayanami.boston.redhat.com>

On Tue, 2005-09-27 at 18:22 +0200, carlopmart at gmail.com wrote:
> Hi all,
> 
>   I would like to test RHGS with RHCS, but I have got only one server. 
> My idea is to use VMWare GSX Server. My explain:
> 
>   - VMWare GSX host server acts as GFS server.
>   - Two virtual machines with RHCS using fence_gnbd as fence device 
> connected to GFS server.
> 
>   is it possible to do this configuration with only one server using GFS???
> 
> Thank you very much for your help and sorry my bad english.
> 
> P.D: I will use CentOS 4.1 for host and virtual machines.
> 

It should work, but typically multiple physical machines are used to
achieve HA, as HA works around hardware failures (or should).  If your
host server fails, your entire cluster fails.

-- Lon



From lhh at redhat.com  Tue Sep 27 17:51:26 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Sep 2005 13:51:26 -0400
Subject: [Linux-cluster] RHCS 4.0 HowTo?
In-Reply-To: <BF537327.1BF2%david.sullivan@activant.com>
References: <BF537327.1BF2%david.sullivan@activant.com>
Message-ID: <1127843486.26042.104.camel@ayanami.boston.redhat.com>

On Sun, 2005-09-18 at 19:40 -0500, David.Sullivan wrote:
> I'm completely new to clustering and am trying to set up a "proof of
> concept" configuration in-house.  Our customers run a back-office POS server
> and must currently do a manual failover that involves moving hard drives if
> the server fails.  Using VMware Workstation 5, I have created several VM's
> configured with dual NICs and RHEL 4.0/RHCS 4.0.  System-config-cluster
> seems to abstract so much from the Administrator that I'm having trouble
> getting things working.  Some specific things I don't understand:
> 
> *  How do you tell it which NIC to use as a heartbeat, and which to use for
> the services offered?  I have a private LAN set up that I want to use for
> heartbeat, but don't think it's being used at all.  Both the public and
> private IP's are set up in /etc/hosts.

With RHCS4, it goes by "uname -n".  An easy thing to do is to set up
dummy hostnames matching the IP on the private network and set your
hostnames to them.

e.g. 10.1.1.1 node1
     10.1.1.2 node2

...then set hostnames to node1 and node2.

Service IPs are (and have always been) selected based on matching
specified IPs to already existing IPs on NICs.  E.g. 192.168.2.10/24
would go on the same NIC as 192.168.2.1/24, even if it's a different
interface from what the cluster is using for internal communication.


> *  Building on the above, is it wise for other cluster services (e.g. dlm)
> to communicate across the private heartbeat link?  If so, how do I configure
> that?

I think DLM will use the private network, as will rgmanager, and
everything (except services).


> *  Insofar as hardware fencing is required with RHCS 4.0, will I be able to
> demonstrate failover functionality to management at all?  I'm basically
> looking for a means to automagically fail over to a "hot standby" server.

Automagically won't work (and is *dangerous*).  You'll have to use
manual fencing.  However, you could probably put together a fencing
agent which asked the VMWare server to power off a guest...

-- Lon



From lhh at redhat.com  Tue Sep 27 17:52:39 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Sep 2005 13:52:39 -0400
Subject: [Linux-cluster] Question about heart-beat with CS4
In-Reply-To: <4338EE0D.6090408@redhat.com>
References: <43300855.3050709@bull.net>  <4338EE0D.6090408@redhat.com>
Message-ID: <1127843559.26042.106.camel@ayanami.boston.redhat.com>

On Tue, 2005-09-27 at 08:00 +0100, Patrick Caulfield wrote:

> 
> It will use the interface bound to the IP address of the hostname by default. If
> you want to use the otehr interface then specify the name associated with the
> inerface on the command-line to "cman_tool join"

Just a thought -- would it be useful to add that as an option
to /etc/sysconfig/cman ?  I don't think it's currently done (not sure
though)

-- Lon



From eric at bootseg.com  Tue Sep 27 18:27:07 2005
From: eric at bootseg.com (Eric Kerin)
Date: Tue, 27 Sep 2005 14:27:07 -0400
Subject: [Linux-cluster] RHCS 4.0 HowTo?
In-Reply-To: <1127843486.26042.104.camel@ayanami.boston.redhat.com>
References: <BF537327.1BF2%david.sullivan@activant.com>
	<1127843486.26042.104.camel@ayanami.boston.redhat.com>
Message-ID: <1127845627.4501.40.camel@auh5-0479.corp.jabil.org>

On Tue, 2005-09-27 at 13:51 -0400, Lon Hohberger wrote:
> On Sun, 2005-09-18 at 19:40 -0500, David.Sullivan wrote:
> > *  Insofar as hardware fencing is required with RHCS 4.0, will I be able to
> > demonstrate failover functionality to management at all?  I'm basically
> > looking for a means to automagically fail over to a "hot standby" server.
> 
> Automagically won't work (and is *dangerous*).  You'll have to use
> manual fencing.  However, you could probably put together a fencing
> agent which asked the VMWare server to power off a guest...
> 

There's already a fence agent for VMWare guests in CVS HEAD, just
download the file, and place in your /sbin directory.  Then you should
be able to use fence_vmware as your agent in
your /etc/cluster/cluster.conf file.

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/vmware/?cvsroot=cluster

Looks like it requires the VMWare tools on the cluster nodes, but that
should be no big deal.

Thanks,
Eric Kerin
eric at bootseg.com



From lhh at redhat.com  Tue Sep 27 18:48:57 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Sep 2005 14:48:57 -0400
Subject: [Linux-cluster] RHCS 4.0 HowTo?
In-Reply-To: <1127845627.4501.40.camel@auh5-0479.corp.jabil.org>
References: <BF537327.1BF2%david.sullivan@activant.com>
	<1127843486.26042.104.camel@ayanami.boston.redhat.com>
	<1127845627.4501.40.camel@auh5-0479.corp.jabil.org>
Message-ID: <1127846937.26042.108.camel@ayanami.boston.redhat.com>

On Tue, 2005-09-27 at 14:27 -0400, Eric Kerin wrote:
> On Tue, 2005-09-27 at 13:51 -0400, Lon Hohberger wrote:
> > On Sun, 2005-09-18 at 19:40 -0500, David.Sullivan wrote:
> > > *  Insofar as hardware fencing is required with RHCS 4.0, will I be able to
> > > demonstrate failover functionality to management at all?  I'm basically
> > > looking for a means to automagically fail over to a "hot standby" server.
> > 
> > Automagically won't work (and is *dangerous*).  You'll have to use
> > manual fencing.  However, you could probably put together a fencing
> > agent which asked the VMWare server to power off a guest...
> > 
> 
> There's already a fence agent for VMWare guests in CVS HEAD, just
> download the file, and place in your /sbin directory.  Then you should
> be able to use fence_vmware as your agent in
> your /etc/cluster/cluster.conf file.
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/vmware/?cvsroot=cluster
> 
> Looks like it requires the VMWare tools on the cluster nodes, but that
> should be no big deal.

This is what I get for not looking first ;)

-- Lon



From tom-fedora at kofler.eu.org  Tue Sep 27 21:16:57 2005
From: tom-fedora at kofler.eu.org (tom-fedora at kofler.eu.org)
Date: Tue, 27 Sep 2005 23:16:57 +0200
Subject: [Linux-cluster] GFS LogVol00cluster.1: withdrawn / rejecting I/O to
	dead device
Message-ID: <000301c5c3a8$ce55d1e0$2c01380a@TheCenter>

Hi,

we are building a HA cluster with GFS6.1 and Fedora Core 4

Our SAN box had an outage and was then reconnected.

Now, we are unable to mount the clusterfilesystem gfs.

Sep 27 20:05:19 www5 kernel: scsi2 (0:0): rejecting I/O to dead device
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: fatal:
I/O error
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:   block
= 9498835
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:
function = gfs_logbh_wait
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:   file
= /usr/src/build/607778-i686/BUILD/smp/src/gfs/dio.c, line = 923
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:   time
= 1127844319
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: about
to withdraw from the cluster
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: waiting
for outstanding I/O
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: telling
LM to withdraw
Sep 27 20:05:19 www5 kernel: lock_dlm: withdraw abandoned memory
Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:
withdrawn
Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device
Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block
20971504
Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device
Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block
20971504

Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device
Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block
20971504
Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device
Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block
20971504
Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device
Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block
0

Rejecting/lm withdraw did not appear on the third node, also lm withdraw did
not appear on www3

[root at www4 ~]# mount /mnt/ /dev/VolGroupDaten01/LogVol00cluster -t gfs
mount: /mnt/ is not a block device

We need to avoid restarting the server nodes - the volume groups so far are
visible and access with eg. fisk is possible.
Another single server which only uses a non-cluster LVM2 volume mount worked
without reboot.

Any help would be really welcome,

Thanks
Thomas

[root at www3 ~]# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "VolGroupDaten02" using metadata type lvm2
  Found volume group "VolGroupDaten01" using metadata type lvm2

[root at www3 ~]# lvdisplay VolGroupDaten01
  --- Logical volume ---
  LV Name                /dev/VolGroupDaten01/LogVol00cluster
  VG Name                VolGroupDaten01
  LV UUID                o38bnG-sLSi-WhUJ-47Bs-3u6g-qSUm-5yBkNr
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                80.00 GB
  Current LE             20480
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:1

[root at www3 ~]# pvdisplay

...
...
...

  --- Physical volume ---
  PV Name               /dev/sde
  VG Name               VolGroupDaten01
  PV Size               540.00 GB / not usable 0
  Allocatable           yes
  PE Size (KByte)       4096
  Total PE              138239
  Free PE               117759
  Allocated PE          20480
  PV UUID               oVeByo-8IoA-qFlt-fsN9-ULAR-xUju-niLTEO




[root at www3 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 2
Cluster name: xxxcluster
Cluster ID: 57396
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 3
Total_votes: 3
Quorum: 2
Active subsystems: 3
Node name: www3.xxx.cc
Node addresses: 192.168.2.23

[root at www3 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    3   M   www5.xxx.cc
   2    1    3   M   www4.xxx.cc
   3    1    3   M   www3.xxx.cc


<?xml version="1.0"?>
<cluster name="xxxcluster" config_version="3">
  <clusternodes>
    <clusternode name="www5.xxx.cc" votes="1">
     <fence>
      <method name="single">
       <device name="human" ipaddr="192.168.2.25"/>
     </method>
    </fence>
   </clusternode>
   <clusternode name="www3.xxx.cc" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.168.2.23"/>
     </method>
    </fence>
    </clusternode>
   <clusternode name="www4.xxx.cc" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.168.2.24"/>
     </method>
    </fence>
  </clusternode>
 </clusternodes>
<fence_devices>
 <fence_device name="human" agent="fence_manual"/>
</fence_devices>
</cluster>
[root at www3 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="xxxcluster" config_version="3">
  <clusternodes>
    <clusternode name="www5.xxx.cc" votes="1">
     <fence>
      <method name="single">
       <device name="human" ipaddr="192.168.2.25"/>
     </method>
    </fence>
   </clusternode>
   <clusternode name="www3.xxx.cc" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.168.2.23"/>
     </method>
    </fence>
    </clusternode>
   <clusternode name="www4.xxx.cc" votes="1">
    <fence>
     <method name="single">
       <device name="human" ipaddr="192.168.2.24"/>
     </method>
    </fence>
  </clusternode>
 </clusternodes>
<fence_devices>
 <fence_device name="human" agent="fence_manual"/>
</fence_devices>
</cluster>



From sgray at bluestarinc.com  Wed Sep 28 02:20:55 2005
From: sgray at bluestarinc.com (Sean Gray)
Date: Tue, 27 Sep 2005 22:20:55 -0400
Subject: [Linux-cluster] GFS LogVol00cluster.1: withdrawn / rejecting
	I/O to dead device
In-Reply-To: <000301c5c3a8$ce55d1e0$2c01380a@TheCenter>
References: <000301c5c3a8$ce55d1e0$2c01380a@TheCenter>
Message-ID: <1127874055.3736.250.camel@localhost.localdomain>

Thomas,

Double check your mount command it should read "mount -t gfs
<devicename> <mountpoint>. 

Boot the bad node and check it with clustat, if OK try restarting fenced
an clvmd.

# clustat
# /etc/init.d/fenced restart
# /etc/init.d/clvmd restart
# mount -t gfs <devicename> <mountpoint>

For some reason it may require a few tries.

Sean

On Tue, 2005-09-27 at 23:16 +0200, tom-fedora at kofler.eu.org wrote:

> Hi,
> 
> we are building a HA cluster with GFS6.1 and Fedora Core 4
> 
> Our SAN box had an outage and was then reconnected.
> 
> Now, we are unable to mount the clusterfilesystem gfs.
> 
> Sep 27 20:05:19 www5 kernel: scsi2 (0:0): rejecting I/O to dead device
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: fatal:
> I/O error
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:   block
> = 9498835
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:
> function = gfs_logbh_wait
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:   file
> = /usr/src/build/607778-i686/BUILD/smp/src/gfs/dio.c, line = 923
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:   time
> = 1127844319
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: about
> to withdraw from the cluster
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: waiting
> for outstanding I/O
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1: telling
> LM to withdraw
> Sep 27 20:05:19 www5 kernel: lock_dlm: withdraw abandoned memory
> Sep 27 20:05:19 www5 kernel: GFS: fsid=xxxcluster:LogVol00cluster.1:
> withdrawn
> Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device
> Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block
> 20971504
> Sep 27 20:05:43 www5 kernel: scsi2 (0:0): rejecting I/O to dead device
> Sep 27 20:05:43 www5 kernel: Buffer I/O error on device dm-3, logical block
> 20971504
> 
> Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device
> Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block
> 20971504
> Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device
> Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block
> 20971504
> Sep 27 20:52:17 www3 kernel: scsi2 (0:0): rejecting I/O to dead device
> Sep 27 20:52:17 www3 kernel: Buffer I/O error on device dm-1, logical block
> 0
> 
> Rejecting/lm withdraw did not appear on the third node, also lm withdraw did
> not appear on www3
> 
> [root at www4 ~]# mount /mnt/ /dev/VolGroupDaten01/LogVol00cluster -t gfs
> mount: /mnt/ is not a block device
> 
> We need to avoid restarting the server nodes - the volume groups so far are
> visible and access with eg. fisk is possible.
> Another single server which only uses a non-cluster LVM2 volume mount worked
> without reboot.
> 
> Any help would be really welcome,
> 
> Thanks
> Thomas
> 
> [root at www3 ~]# vgscan
>   Reading all physical volumes.  This may take a while...
>   Found volume group "VolGroupDaten02" using metadata type lvm2
>   Found volume group "VolGroupDaten01" using metadata type lvm2
> 
> [root at www3 ~]# lvdisplay VolGroupDaten01
>   --- Logical volume ---
>   LV Name                /dev/VolGroupDaten01/LogVol00cluster
>   VG Name                VolGroupDaten01
>   LV UUID                o38bnG-sLSi-WhUJ-47Bs-3u6g-qSUm-5yBkNr
>   LV Write Access        read/write
>   LV Status              available
>   # open                 0
>   LV Size                80.00 GB
>   Current LE             20480
>   Segments               1
>   Allocation             inherit
>   Read ahead sectors     0
>   Block device           253:1
> 
> [root at www3 ~]# pvdisplay
> 
> ...
> ...
> ...
> 
>   --- Physical volume ---
>   PV Name               /dev/sde
>   VG Name               VolGroupDaten01
>   PV Size               540.00 GB / not usable 0
>   Allocatable           yes
>   PE Size (KByte)       4096
>   Total PE              138239
>   Free PE               117759
>   Allocated PE          20480
>   PV UUID               oVeByo-8IoA-qFlt-fsN9-ULAR-xUju-niLTEO
> 
> 
> 
> 
> [root at www3 ~]# cman_tool status
> Protocol version: 5.0.1
> Config version: 2
> Cluster name: xxxcluster
> Cluster ID: 57396
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 3
> Expected_votes: 3
> Total_votes: 3
> Quorum: 2
> Active subsystems: 3
> Node name: www3.xxx.cc
> Node addresses: 192.168.2.23
> 
> [root at www3 ~]# cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    3   M   www5.xxx.cc
>    2    1    3   M   www4.xxx.cc
>    3    1    3   M   www3.xxx.cc
> 
> 
> <?xml version="1.0"?>
> <cluster name="xxxcluster" config_version="3">
>   <clusternodes>
>     <clusternode name="www5.xxx.cc" votes="1">
>      <fence>
>       <method name="single">
>        <device name="human" ipaddr="192.168.2.25"/>
>      </method>
>     </fence>
>    </clusternode>
>    <clusternode name="www3.xxx.cc" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.168.2.23"/>
>      </method>
>     </fence>
>     </clusternode>
>    <clusternode name="www4.xxx.cc" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.168.2.24"/>
>      </method>
>     </fence>
>   </clusternode>
>  </clusternodes>
> <fence_devices>
>  <fence_device name="human" agent="fence_manual"/>
> </fence_devices>
> </cluster>
> [root at www3 ~]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster name="xxxcluster" config_version="3">
>   <clusternodes>
>     <clusternode name="www5.xxx.cc" votes="1">
>      <fence>
>       <method name="single">
>        <device name="human" ipaddr="192.168.2.25"/>
>      </method>
>     </fence>
>    </clusternode>
>    <clusternode name="www3.xxx.cc" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.168.2.23"/>
>      </method>
>     </fence>
>     </clusternode>
>    <clusternode name="www4.xxx.cc" votes="1">
>     <fence>
>      <method name="single">
>        <device name="human" ipaddr="192.168.2.24"/>
>      </method>
>     </fence>
>   </clusternode>
>  </clusternodes>
> <fence_devices>
>  <fence_device name="human" agent="fence_manual"/>
> </fence_devices>
> </cluster>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Sean N. Gray
Director of Information Technology
United Radio Incorporated, DBA BlueStar
24 Spiral Drive
Florence, Kentucky 41042
office: 859.371.4423 x263
toll free: 800.371.4423 x263
fax: 859.371.4425
mobile: 513.616.3379
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050927/84ac12b2/attachment.htm>

From pcaulfie at redhat.com  Wed Sep 28 06:51:45 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 28 Sep 2005 07:51:45 +0100
Subject: [Linux-cluster] Question about heart-beat with CS4
In-Reply-To: <1127843559.26042.106.camel@ayanami.boston.redhat.com>
References: <43300855.3050709@bull.net> <4338EE0D.6090408@redhat.com>
	<1127843559.26042.106.camel@ayanami.boston.redhat.com>
Message-ID: <433A3D81.4040307@redhat.com>

Lon Hohberger wrote:
> On Tue, 2005-09-27 at 08:00 +0100, Patrick Caulfield wrote:
> 
> 
>>It will use the interface bound to the IP address of the hostname by default. If
>>you want to use the otehr interface then specify the name associated with the
>>inerface on the command-line to "cman_tool join"
> 
> 
> Just a thought -- would it be useful to add that as an option
> to /etc/sysconfig/cman ?  I don't think it's currently done (not sure
> though)
> 

Yes I think it would be very useful.
-- 

patrick



From carlopmart at gmail.com  Wed Sep 28 08:14:03 2005
From: carlopmart at gmail.com (carlopmart at gmail.com)
Date: Wed, 28 Sep 2005 10:14:03 +0200
Subject: [Linux-cluster] Question aout fence_vmware.pl
Message-ID: <433A50CB.50509@gmail.com>

Hi all,

  Searching mailing list I have found tis interesting thread: 
https://www.redhat.com/archives/linux-cluster/2005-September/msg00014.html. 
   Is it possible to use this fence module under GSX ??? Where can I 
find examples to use?? I visited Zach's web without success.


-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From tom-fedora at kofler.eu.org  Wed Sep 28 11:00:30 2005
From: tom-fedora at kofler.eu.org (Thomas Kofler)
Date: Wed, 28 Sep 2005 13:00:30 +0200
Subject: [Linux-cluster] Question aout fence_vmware.pl
Message-ID: <1127905230.433a77cef24b0@mail.devcon.cc>

Hi, 

the file itself can be found at: 

http://sources.redhat.com/cgi- 
bin/cvsweb.cgi/cluster/fence/agents/vmware/fence_vmware.pl?cvsroot=cluster 

The usage/parameters are documented in the file. I think it is using the perl 
API from Vmware - its also available for GSX and should be compatible. 

"VMware GSX Server provides an easy to use API for control and management. 
Perl and COM interfaces and command line tools ..." 
http://www.vmware.com/support/developer/ 

Would be worth a try - best luck, we are waiting for feedback 

Regards, 
Thomas 

Quoting "carlopmart at gmail.com" <carlopmart at gmail.com>: 

> Hi all, 
> 
>   Searching mailing list I have found tis interesting thread: 
> https://www.redhat.com/archives/linux-cluster/2005-September/msg00014.html. 
> 
>    Is it possible to use this fence module under GSX ??? Where can I 
> find examples to use?? I visited Zach's web without success. 
> 
> 
> -- 
> CL Martinez 
> carlopmart {at} gmail {d0t} com 
> 



 



From fseoane at intelsis.com  Fri Sep 30 11:08:46 2005
From: fseoane at intelsis.com (Felipe Seoane)
Date: Fri, 30 Sep 2005 13:08:46 +0200
Subject: [Linux-cluster] A question about file locks on node fails
Message-ID: <8C5D52BA5F40014CA753D0C04957BF73625781@icorreo.correo2003.com>

Hi all,
Suppouse that we have 3 nodes (All of them are gulm servers) that mounts a GFS shared filesystem in /raid. The directory structure is the next:
         /raid
         |-/node1dir
         |-/node2dir
And suppouse that the node1 writes files only in node1dir and node2 only in node2dir while node3 reads files from both directories.
My question is:
If node1 fails (then it will fenced) while it was writting a file in node1dir, could be able node3 to read a file in node1dir (not the file that node1 was just writting when had failed)?



From pcaulfie at redhat.com  Fri Sep 30 13:44:20 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 30 Sep 2005 14:44:20 +0100
Subject: [Linux-cluster] new userland cman
Message-ID: <433D4134.6080608@redhat.com>

This has got to the stage where I'd be grateful for any testing other people can
 do, though obviously don't endanger a production system!

You should be able to run the DLM and GFS on this, see

https://www.redhat.com/archives/linux-cluster/2005-September/msg00177.html

for (very) brief instructions. There is a new clvm patch available in the
cluster CVS at cman/lib/clvmd-libcman.diff

Here's a list of the user-visible changes, please feel free to ask questions on
the list.

good
----
- (optional) encryption & authentication of communications
- Multiple interface support (unfinished, needs AIS and cman work)
- Automatic re-reading of CCS if a new node joins with an updated config file

bad
---
- Always uses CCS (cman_tool join -X removed)*
- Compulsory static node IDs (easily enforced by GUI or command-line)
- Can't have multiple clusters using the same port number unless they use a
different encryption key. Currently cluster name is ignored.**
- Hard limit to size of cluster (set at compile time to 32 currently)***

neutral
-------
- Always uses multicast (no broadcast). A default multicast address is supplied
if none is given
- libcman is the only API ( a compatible libcman is available for the kernel
version)
- Simplified CCS schema, but will read old one if it has nodeids in it.****

internal
--------
- Usable messaging API
- Robust membership algorithm
- Community involvement, multiple developers.


* I very much doubt that anyone will notice apart from maybe Dave & me

** Could fix this in AIS, but I'm not sure the patch would be popular upstream.
It's much more efficient to run them on different ports or multicast addresses
anyway. Incidentally: DON'T run an encrypted and a non-encrypted cluster on the
same port & multicast address (not that you would!) - the non-encrypted ones
will crash.

*** I doubt that the old cman worked well above 30 nodes anyway. I intend to do
some AIS hacking to improve this situation by drastically reducing the network
packet size.

**** The main difference here is that the multicast address need only be
specified once, in the <cman> section of cluster.conf. The interface used will
be the one that is bound to the hostname mentioned.


patrick

-- 

patrick



From Kevin.Ketchum at McKesson.com  Fri Sep 30 14:51:42 2005
From: Kevin.Ketchum at McKesson.com (Ketchum, Kevin)
Date: Fri, 30 Sep 2005 10:51:42 -0400
Subject: [Linux-cluster] GFS Performance observations/questions
Message-ID: <B1F6F2DA362BD2118EA30008C75618E30BA8247E@eugexc01ntms.eugalg.hboc.com>

We are evaluating GFS as a solution for a clustered application
environment. 

We have set up a 3 node cluster.  Each machine is connected to an EMC
SAN, all point to the same pool. 

During some benchmark testing we have observed the following
performance: 

On the local drive, it takes about 10 microseconds to lock a file, and 2
microseconds to unlock the file. 
On the gfs drive, it takes over 10,000 microseconds (0.01 sec) to lock a
file, and over 6000 microseconds (0.006 sec) to unlock the file. We have
seen it take considerably longer ...

Is this the expected performance?  Are there any tuning options
available to us? 

If you need more information to help answer this question or provide
guidance, please ask. 

Thanks 
Kevin Ketchum 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050930/105f777c/attachment.htm>

From sgray at bluestarinc.com  Fri Sep 30 20:44:55 2005
From: sgray at bluestarinc.com (Sean Gray)
Date: Fri, 30 Sep 2005 16:44:55 -0400
Subject: [Linux-cluster] script to enable, disable, start,
	stop and query status of cluster services
Message-ID: <1128113095.3539.1148.camel@localhost.localdomain>

I found the below useful for my cluster testing, enjoy!

#!/bin/bash
# Name: cluster
# Authors: Sean Gray <me @ seangray.com, sgray @ bluestarinc.com>
# Copyright 2005 under the GPL
# Version 0.1
# Enable, disable, start, stop and query status of cluster services
# on RHEL4.
#
SERVICES="ccsd cman lock_gulmd fenced clvmd rgmanager gfs"
STARTORDER="ccsd cman lock_gulmd fenced clvmd gfs rgmanager"
STOPORDER="rgmanager gfs clvmd fenced lock_gulmd cman ccsd"
enableStuff (){
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --level 2345 $SERVICE on;
        done;
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --list $SERVICE;
        done;

}

disableStuff (){
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --level 2345 $SERVICE off;
        done;
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --list $SERVICE;
        done;
}

startStuff (){
        for SERVICE in `echo $STARTORDER`;
        do
                service $SERVICE start;
        done;
}

stopStuff (){
        for SERVICE in `echo $STOPORDER`;
        do
                service $SERVICE stop;
        done;
}

serviceStatus (){
        for SERVICE in `echo $SERVICES`;
        do

                echo -e "\033[36m $SERVICE \033[0m"
                service $SERVICE status;
                echo -e "\n"
        done;
}


case $1 in
        "enable" )
        enableStuff
        ;;

        "disable" )
        disableStuff
        ;;

        "start" )
        startStuff
        ;;

        "stop" )
        stopStuff
        ;;

        "status" )
        serviceStatus
        ;;

        * )
        echo -e "Usage: `basename $0` {enable|disable|start|stop|
status}"
        ;;
esac

Sean N. Gray
Director of Information Technology
United Radio Incorporated, DBA BlueStar
24 Spiral Drive
Florence, Kentucky 41042
office: 859.371.4423 x263
toll free: 800.371.4423 x263
fax: 859.371.4425
mobile: 513.616.3379
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050930/eca79193/attachment.htm>

From jharr at opsource.net  Fri Sep 30 22:12:03 2005
From: jharr at opsource.net (Jeff Harr)
Date: Fri, 30 Sep 2005 23:12:03 +0100
Subject: [Linux-cluster] script to enable, disable, start,
	stop and query status of cluster services
Message-ID: <38A48FA2F0103444906AD22E14F1B5A3015F4306@mailxchg01.corp.opsource.net>

Hey, that is cool -thanks man :-)

 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sean Gray
Sent: Friday, September 30, 2005 4:45 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] script to enable, disable, start,stop and query
status of cluster services

 

I found the below useful for my cluster testing, enjoy!

#!/bin/bash
# Name: cluster
# Authors: Sean Gray <me @ seangray.com, sgray @ bluestarinc.com>
# Copyright 2005 under the GPL
# Version 0.1
# Enable, disable, start, stop and query status of cluster services
# on RHEL4.
#
SERVICES="ccsd cman lock_gulmd fenced clvmd rgmanager gfs"
STARTORDER="ccsd cman lock_gulmd fenced clvmd gfs rgmanager"
STOPORDER="rgmanager gfs clvmd fenced lock_gulmd cman ccsd"
enableStuff (){
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --level 2345 $SERVICE on;
        done;
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --list $SERVICE;
        done;

}

disableStuff (){
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --level 2345 $SERVICE off;
        done;
        for SERVICE in `echo $SERVICES`;
        do
                chkconfig --list $SERVICE;
        done;
}

startStuff (){
        for SERVICE in `echo $STARTORDER`;
        do
                service $SERVICE start;
        done;
}

stopStuff (){
        for SERVICE in `echo $STOPORDER`;
        do
                service $SERVICE stop;
        done;
}

serviceStatus (){
        for SERVICE in `echo $SERVICES`;
        do

                echo -e "\033[36m $SERVICE \033[0m"
                service $SERVICE status;
                echo -e "\n"
        done;
}


case $1 in
        "enable" )
        enableStuff
        ;;

        "disable" )
        disableStuff
        ;;

        "start" )
        startStuff
        ;;

        "stop" )
        stopStuff
        ;;

        "status" )
        serviceStatus
        ;;

        * )
        echo -e "Usage: `basename $0`
{enable|disable|start|stop|status}"
        ;;
esac

Sean N. Gray <mailto:sgray at bluestarinc.com> 
Director of Information Technology
United Radio Incorporated, DBA BlueStar <http://www.bluestarinc.com> 
24 Spiral Drive
Florence, Kentucky 41042
office: 859.371.4423 x263
toll free: 800.371.4423 x263
fax: 859.371.4425
mobile: 513.616.3379 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050930/dff8c944/attachment.htm>

From fedora-tom at kofler.eu.org  Wed Sep 28 08:52:35 2005
From: fedora-tom at kofler.eu.org (Thomas Kofler)
Date: Wed, 28 Sep 2005 08:52:35 -0000
Subject: [Linux-cluster] Question aout fence_vmware.pl
In-Reply-To: <433A50CB.50509@gmail.com>
References: <433A50CB.50509@gmail.com>
Message-ID: <1127897533.433a59bda54ae@mail.devcon.cc>

Hi,

the file itself can be found at:

http://sources.redhat.com/cgi-
bin/cvsweb.cgi/cluster/fence/agents/vmware/fence_vmware.pl?cvsroot=cluster

The usage/parameters are documented in the file. I think it is using the perl 
API from Vmware - its also available for GSX and should be compatible.

"VMware GSX Server provides an easy to use API for control and management. 
Perl and COM interfaces and command line tools ..."
http://www.vmware.com/support/developer/

Would be worth a try - best luck, we are waiting for feedback

Regards,
Thomas

Quoting "carlopmart at gmail.com" <carlopmart at gmail.com>:

> Hi all,
> 
>   Searching mailing list I have found tis interesting thread: 
> https://www.redhat.com/archives/linux-cluster/2005-September/msg00014.html.
> 
>    Is it possible to use this fence module under GSX ??? Where can I 
> find examples to use?? I visited Zach's web without success.
> 
> 
> -- 
> CL Martinez
> carlopmart {at} gmail {d0t} com
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



 



From sdake at mvista.com  Fri Sep 30 19:40:00 2005
From: sdake at mvista.com (Steven Dake)
Date: Fri, 30 Sep 2005 12:40:00 -0700
Subject: [Linux-cluster] new userland cman
In-Reply-To: <433D4134.6080608@redhat.com>
References: <433D4134.6080608@redhat.com>
Message-ID: <1128109200.8440.14.camel@unnamed.az.mvista.com>

Patrick

Thanks for the work

I have a few comments inline

On Fri, 2005-09-30 at 14:44 +0100, Patrick Caulfield wrote:
> This has got to the stage where I'd be grateful for any testing other people can
>  do, though obviously don't endanger a production system!
> 
> You should be able to run the DLM and GFS on this, see
> 
> https://www.redhat.com/archives/linux-cluster/2005-September/msg00177.html
> 
> for (very) brief instructions. There is a new clvm patch available in the
> cluster CVS at cman/lib/clvmd-libcman.diff
> 
> Here's a list of the user-visible changes, please feel free to ask questions on
> the list.
> 
> good
> ----
> - (optional) encryption & authentication of communications
> - Multiple interface support (unfinished, needs AIS and cman work)
> - Automatic re-reading of CCS if a new node joins with an updated config file
> 
> bad
> ---
> - Always uses CCS (cman_tool join -X removed)*
> - Compulsory static node IDs (easily enforced by GUI or command-line)
> - Can't have multiple clusters using the same port number unless they use a
> different encryption key. Currently cluster name is ignored.**
> - Hard limit to size of cluster (set at compile time to 32 currently)***
> 

I hope to have multiring in 2006; then we should scale to hundreds of
processors...

> neutral
> -------
> - Always uses multicast (no broadcast). A default multicast address is supplied
> if none is given

If broadcast is important, which I guess it may be, we can pretty easily
add this support...

> - libcman is the only API ( a compatible libcman is available for the kernel
> version)
> - Simplified CCS schema, but will read old one if it has nodeids in it.****
> 
> internal
> --------
> - Usable messaging API
> - Robust membership algorithm
> - Community involvement, multiple developers.
> 
> 
> * I very much doubt that anyone will notice apart from maybe Dave & me
> 
> ** Could fix this in AIS, but I'm not sure the patch would be popular upstream.
> It's much more efficient to run them on different ports or multicast addresses
> anyway. Incidentally: DON'T run an encrypted and a non-encrypted cluster on the
> same port & multicast address (not that you would!) - the non-encrypted ones
> will crash.
> 

On this point, you mention you could fix "this", do you mean having two
clusters use the same port and ips?  I have also considered and do want
this by having each "cluster" join a specific group at startup to serve
as the cluster membership view.  Unfortunately this would require
process group membership, and the process groups interface is unfinished
(totempg.c) so this isn't possible today.  Note I'd take a patch from
someone that finished the job on this interface :)  I for example, would
like communication for a specific checkpoint to go over a specific named
group, instead of to everyone connected to totem.  Then the clm could
join a group and get membership events, the checkpoint service for a
specific checkpoint could join a group, and communicate on that group,
and get membership events for that group etc.

What did you have in mind here?

regards
-steve

> *** I doubt that the old cman worked well above 30 nodes anyway. I intend to do
> some AIS hacking to improve this situation by drastically reducing the network
> packet size.
> 
> **** The main difference here is that the multicast address need only be
> specified once, in the <cman> section of cluster.conf. The interface used will
> be the one that is bound to the hostname mentioned.
> 
> 
> patrick
>