From Alain.Moulle at bull.net  Tue Aug  1 06:24:29 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 01 Aug 2006 08:24:29 +0200
Subject: [Linux-cluster] 2-node fencing question
Message-ID: <44CEF39D.8000001@bull.net>

> Also is there a way to configure fence_ipmilan in cluster.xml to reboot
> rather than stop the server?  fence_ipmilan by itself takes the ?o
> option (on,off,reboot)
I use fence_ipmilan (with CS4 Update 2) it does at
first poweroff AND then poweron ... except if it does not get
the off status after the poweroff. (check agent ipmilan.c)
Alain Moull?




From zachacker at ibh.de  Tue Aug  1 06:38:18 2006
From: zachacker at ibh.de (Zachacker, Maik)
Date: Tue, 1 Aug 2006 08:38:18 +0200
Subject: [Linux-cluster] 2-node fencing question
Message-ID: <0DDD325898FC3C4C88393E540965B37F1E59DB@dcdwa.ibh>

>> Also is there a way to configure fence_ipmilan in cluster.xml to
reboot
>> rather than stop the server?  fence_ipmilan by itself takes the -o
>> option (on,off,reboot)
>
> I use fence_ipmilan (with CS4 Update 2) it does at
> first poweroff AND then poweron ... except if it does not get
> the off status after the poweroff. (check agent ipmilan.c)

I use fence_ilo and fence_apc (CS4U3) - both first poweroff and then
poweron too. This is only a problem in a two node configuration because
both nodes send the poweroff command and non of them can send the
poweron command because both are down.

The most fence-devices have an option or action tag, that is not
available via the cluster configuration tool. They can be used to force
a reboot (default) or an poweroff. 

<fencedevice action="off" agent="fence_ilo" .../>
<fencedevice agent="fence_apc" option="off" .../>

Maik Zachacker
--
Maik Zachacker
IBH Prof. Dr. Horn GmbH, Dresden, Germany 




From zhiwei at linuxone.myftp.org  Tue Aug  1 08:02:17 2006
From: zhiwei at linuxone.myftp.org (zhiwei)
Date: Tue, 01 Aug 2006 16:02:17 +0800
Subject: [Linux-cluster] Re: lvm2 liblvm2clusterlock.so on fc5 (Jeff Hardy)
In-Reply-To: <20060728141302.E2A9573A0E@hormel.redhat.com>
References: <20060728141302.E2A9573A0E@hormel.redhat.com>
Message-ID: <1154419337.6987.13.camel@alpha01.mcs.com>

> Message: 4
> Date: Thu, 27 Jul 2006 13:52:43 -0400
> From: Jeff Hardy <hardyjm at potsdam.edu>
> Subject: [Linux-cluster] lvm2 liblvm2clusterlock.so on fc5
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID: <1154022763.2789.120.camel at fritzdesk.potsdam.edu>
> Content-Type: text/plain
> 
> I apologize if this has been answered already or appeared in release
> notes somewhere, but I cannot find it.  FC4 had the lvm2-cluster package
> to provide the clvm locking library.  This was removed in FC5 (as
> indicated in the release notes).
> 
> Is this still necessary for a clvm setup:
> 
> In /etc/lvm/lvm.conf:
> locking_type = 2
> locking_library = "/lib/liblvm2clusterlock.so"
> 
> And if so, where does one find this now?
> 
You can obtain lvm2 source code from rhcs and recompile it to enable
clvmd option. Clvmd is needed to manage the shared storage to share the
lvm information among cluster members.

Zhiwei




From stephen.willey at framestore-cfc.com  Tue Aug  1 10:40:20 2006
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Tue, 01 Aug 2006 11:40:20 +0100
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
Message-ID: <44CF2F94.4000003@framestore-cfc.com>

We fscked the filesystem because we'd started seeing the following
errors following a power failure.

GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block
GFS: fsid=nearlineA:gfs1.0:   bh = 2644310219 (type: exp=4, found=5)
GFS: fsid=nearlineA:gfs1.0:   function = gfs_get_meta_buffer
GFS: fsid=nearlineA:gfs1.0:   file =
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dio.c, line = 1223
GFS: fsid=nearlineA:gfs1.0:   time = 1154425344
GFS: fsid=nearlineA:gfs1.0: about to withdraw from the cluster
GFS: fsid=nearlineA:gfs1.0: waiting for outstanding I/O
GFS: fsid=nearlineA:gfs1.0: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=nearlineA:gfs1.0: withdrawn

And another instance:

GFS: fsid=nearlineA:gfs1.1: fatal: filesystem consistency error
GFS: fsid=nearlineA:gfs1.1:   inode = 2384574146/2384574146
GFS: fsid=nearlineA:gfs1.1:   function = dir_e_del
GFS: fsid=nearlineA:gfs1.1:   file =
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dir.c, line = 1495
GFS: fsid=nearlineA:gfs1.1:   time = 1154393717
GFS: fsid=nearlineA:gfs1.1: about to withdraw from the cluster
GFS: fsid=nearlineA:gfs1.1: waiting for outstanding I/O
GFS: fsid=nearlineA:gfs1.1: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=nearlineA:gfs1.1: withdrawn



Running gfs_fsck -vvv -y /dev/gfs1_vg/gfs1_lv

Returns the following after chewing all the physical and swap RAM. The
machines have 4Gb or RAM and 2Gb of swap.  We can increase the swap
size, but is this just gonna keep running out of RAM?

We're running on x86_64 so it can use as much memory as it likes.  The
filesystem is roughly 45Tb.


Initializing fsck
Initializing lists...
Initializing special inodes...
Setting block ranges...
Creating a block list of size 11105160192...
Unable to allocate bitmap of size 1388145025
Segmentation fault
[root at ns1a ~]# gfs_fsck -vvv -y /dev/gfs1_vg/gfs1_lv
Initializing fsck
Initializing lists...
(bio.c:140)     Writing to 65536 - 16 4096
Initializing special inodes...
(file.c:45)     readi:  Offset (640) is >= the file size (640).
(super.c:208)   8 journals found.
(file.c:45)     readi:  Offset (7116576) is >= the file size (7116576).
(super.c:265)   74131 resource groups found.
Setting block ranges...
Creating a block list of size 11105160192...
(bitmap.c:68)   Allocated bitmap of size 5552580097 with 2 chunks per byte
Unable to allocate bitmap of size 1388145025
(block_list.c:72)       <backtrace> - block_list_create()
Segmentation fault


-- 
Stephen Willey

Senior Systems Engineer, Framestore-CFC
+44 (0)207 344 8000
http://www.framestore-cfc.com



From Alain.Moulle at bull.net  Tue Aug  1 11:06:54 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 01 Aug 2006 13:06:54 +0200
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
Message-ID: <44CF35CE.1060700@bull.net>

Hi

We are facing a big problem of split-brain, due to the fact
that the process and Clurgmgrd daemon from RedHat Cluster-Suite unexpectedly
disappeared (still for an unknown reason ...) on one of the HA-Nodes pair. This
caused the other Clurgmgrd on the other Node to be aware of this and then simply
to re-start the application service without effective fenceing/migration.

It seems to be an abnormal behavior, isn't it ?

Is there a already a fix available in more recent Update ?

Have you any suggestion about this ?

Thanks a lot
Alain Moull?





From kent2004 at gmail.com  Tue Aug  1 13:23:47 2006
From: kent2004 at gmail.com (Kent Chen)
Date: Tue, 1 Aug 2006 21:23:47 +0800
Subject: [Linux-cluster] hung when 3rd nodes mounting the gfs using dlm
Message-ID: <def42cd30608010623q25995921o5e2c270399ae84a0@mail.gmail.com>

I connect 4 SUN x4100 (2 AMD dual core, 2G RAM ) to a SUN Storage 3510 with
a silkworm 200e FC switch.
the OS is RHEL 4 U3 for X86_64.

I make 2 GFS FS?one called Alpha:gfs1, another called Alpha:gfs2

All things seems good when only 2 nodes mount GFS.

Once 3rd node mount the GFS, the command hang.

Is there anyone who encounted the similar problem?

Is it a bug of GFS?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060801/ada851db/attachment.htm>

From kent2004 at gmail.com  Tue Aug  1 13:23:47 2006
From: kent2004 at gmail.com (Kent Chen)
Date: Tue, 1 Aug 2006 21:23:47 +0800
Subject: [Linux-cluster] hung when 3rd nodes mounting the gfs using dlm
Message-ID: <def42cd30608010623q25995921o5e2c270399ae84a0@mail.gmail.com>

I connect 4 SUN x4100 (2 AMD dual core, 2G RAM ) to a SUN Storage 3510 with
a silkworm 200e FC switch.
the OS is RHEL 4 U3 for X86_64.

I make 2 GFS FS?one called Alpha:gfs1, another called Alpha:gfs2

All things seems good when only 2 nodes mount GFS.

Once 3rd node mount the GFS, the command hang.

Is there anyone who encounted the similar problem?

Is it a bug of GFS?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060801/ada851db/attachment-0001.htm>

From rpeterso at redhat.com  Tue Aug  1 16:38:27 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 01 Aug 2006 11:38:27 -0500
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44CF2F94.4000003@framestore-cfc.com>
References: <44CF2F94.4000003@framestore-cfc.com>
Message-ID: <44CF8383.3040208@redhat.com>

Stephen Willey wrote:
> We fscked the filesystem because we'd started seeing the following
> errors following a power failure.
> (snip)
> We're running on x86_64 so it can use as much memory as it likes.  The
> filesystem is roughly 45Tb.
>   
Hi Stephen,

Yes, this is a problem with gfs_fsck.  The problem is, it tries to 
allocate memory
for bitmaps based on the size of the file system.  The bitmap structures 
are used
throughout the code, so they're not optional.  I'll have to figure out 
how to
do this a better way.  Thanks for opening the bugzilla (200883).  I'll 
work on it.

Regards,

Bob Peterson
Red Hat Cluster Suite



From stephen.willey at framestore-cfc.com  Tue Aug  1 16:32:45 2006
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Tue, 01 Aug 2006 17:32:45 +0100
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44CF8383.3040208@redhat.com>
References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com>
Message-ID: <44CF822D.7070705@framestore-cfc.com>

Robert Peterson wrote:
>
> Hi Stephen,
> 
> Yes, this is a problem with gfs_fsck.  The problem is, it tries to
> allocate memory
> for bitmaps based on the size of the file system.  The bitmap structures
> are used
> throughout the code, so they're not optional.  I'll have to figure out
> how to
> do this a better way.  Thanks for opening the bugzilla (200883).  I'll
> work on it.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite

The fsck is now running after we added the 137Gb swap drive.  It appears
to consistently chew about 4Gb of RAM (sometimes higher) but it is
working (for now).

Any ballpark idea of how long it'll take to fsck a 45Tb FS?  I know
that's a "how long is a piece of string" question, but are we talking
hours/days/weeks?

Stephen



From teigland at redhat.com  Tue Aug  1 16:35:57 2006
From: teigland at redhat.com (David Teigland)
Date: Tue, 1 Aug 2006 11:35:57 -0500
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44CF2F94.4000003@framestore-cfc.com>
References: <44CF2F94.4000003@framestore-cfc.com>
Message-ID: <20060801163557.GD5976@redhat.com>

On Tue, Aug 01, 2006 at 11:40:20AM +0100, Stephen Willey wrote:
> We fscked the filesystem because we'd started seeing the following
> errors following a power failure.
> 
> GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block
> GFS: fsid=nearlineA:gfs1.0:   bh = 2644310219 (type: exp=4, found=5)
> GFS: fsid=nearlineA:gfs1.0:   function = gfs_get_meta_buffer
> GFS: fsid=nearlineA:gfs1.0:   file =
> /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dio.c, line = 1223
> GFS: fsid=nearlineA:gfs1.0:   time = 1154425344
> GFS: fsid=nearlineA:gfs1.0: about to withdraw from the cluster
> GFS: fsid=nearlineA:gfs1.0: waiting for outstanding I/O
> GFS: fsid=nearlineA:gfs1.0: telling LM to withdraw
> lock_dlm: withdraw abandoned memory
> GFS: fsid=nearlineA:gfs1.0: withdrawn
> 
> And another instance:
> 
> GFS: fsid=nearlineA:gfs1.1: fatal: filesystem consistency error
> GFS: fsid=nearlineA:gfs1.1:   inode = 2384574146/2384574146
> GFS: fsid=nearlineA:gfs1.1:   function = dir_e_del
> GFS: fsid=nearlineA:gfs1.1:   file =
> /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dir.c, line = 1495
> GFS: fsid=nearlineA:gfs1.1:   time = 1154393717
> GFS: fsid=nearlineA:gfs1.1: about to withdraw from the cluster
> GFS: fsid=nearlineA:gfs1.1: waiting for outstanding I/O
> GFS: fsid=nearlineA:gfs1.1: telling LM to withdraw
> lock_dlm: withdraw abandoned memory
> GFS: fsid=nearlineA:gfs1.1: withdrawn

What kind of fencing are you using in the cluster?  Trying to understand
how this might have happened.

Dave



From stephen.willey at framestore-cfc.com  Tue Aug  1 16:40:48 2006
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Tue, 01 Aug 2006 17:40:48 +0100
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <20060801163557.GD5976@redhat.com>
References: <44CF2F94.4000003@framestore-cfc.com>
	<20060801163557.GD5976@redhat.com>
Message-ID: <44CF8410.7040507@framestore-cfc.com>

David Teigland wrote:
> 
> What kind of fencing are you using in the cluster?  Trying to understand
> how this might have happened.
> 
> Dave
> 

We're using STONITH through HP/Compaq ILO.  We believe that the
corruption was almost certainly caused during a building-wide power
failure though.

That'll teach us to double-check the UPS setup.

-- 
Stephen Willey

Senior Systems Engineer, Framestore-CFC
+44 (0)207 344 8000
http://www.framestore-cfc.com



From rpeterso at redhat.com  Tue Aug  1 17:53:02 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 01 Aug 2006 12:53:02 -0500
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44CF822D.7070705@framestore-cfc.com>
References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com>
	<44CF822D.7070705@framestore-cfc.com>
Message-ID: <44CF94FE.3070407@redhat.com>

Stephen Willey wrote:
> The fsck is now running after we added the 137Gb swap drive.  It appears
> to consistently chew about 4Gb of RAM (sometimes higher) but it is
> working (for now).
>
> Any ballpark idea of how long it'll take to fsck a 45Tb FS?  I know
> that's a "how long is a piece of string" question, but are we talking
> hours/days/weeks?
>
> Stephen
>   
Hi Stephen,

I don't know how long it will take to fsck a 45TB fs, but it wouldn't 
surprise me if
it took several days.  It also varies because of hardware differences, 
and of course
if you're going to swap, that might slow it down too.
Any way you look at it, 45TB is a lot of data to go through with a
fine-tooth comb like gfs_fsck does.

The latest RHEL4 U3 version (and up) and recent STABLE
and HEAD versions (in CVS) now give you a percent complete number every
second during the more lengthy passes, such as pass5.

When it finishes, can you post something on the list to let us know?

We've tried to kick around ideas on how to improve the speed, such as
(1) adding an option to only focus on areas where the journals are dirty,
(2) introducing multiple threads to process the different RGs, and even
(3) trying to get multiple nodes in the cluster to team up and do different
areas of the file system.  None of these have been implemented yet
because of higher priorities.  Since this is an open-source project, anyone
could step in and do these.  Volunteers?

Regards,

Bob Peterson
Red Hat Cluster Suite



From Leonardo.Mello at planejamento.gov.br  Tue Aug  1 16:24:01 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Tue, 1 Aug 2006 13:24:01 -0300
Subject: [Linux-cluster] hung when 3rd nodes mounting the gfs using dlm
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B55@corp-bsa-mp01.planejamento.gov.br>

What command had you used to create the gfs filesystem ? 

GFS need one journal for each server that mount the filesystem. If you have created the filesystem only with 2 journals, you won't be able to use more than two machines. 

The is used among other things to restore the filesystem in case of server failure. 

If gfs in the current architeture permits you to mount in more servers than the number of journals you have, will damage the filesystem, maybe this is one of the reasons for gfs block access from more servers than the number of journals.

01 - To specify the number of journals at filesystem creation: 

with the option -j number at mkfs.gfs where number is the number of machines. 

for 4 machines the option will be: -j 4 


02 - To increase the number of journals in filesystem that has been already created: (possibly that is your case)

for this task exist the tool gfs_jadd. see it manpage
to use this tool, you need to mount the gfs filesystem in the machine that will increase the number of journal. 

gfs_jadd -j number_to_increase /gfs/filesystem/mount/point

number_to_increase must be how many journals you want to add to that filesystem, by default this number is 1. in your case with four servers: (you already have 2 journals, could be like: 

gfs_jadd -j 2 /gfs/filesystem/mount/point

some times gfs_jadd doesnt work, because there isnt space in the disk for the journal creation. in that case the best solution i know is to format the filesystem specifying the correct number of journals.


Best Regards
Leonardo Rodrigues de Mello


-----Original Message-----
From:	linux-cluster-bounces at redhat.com on behalf of Kent Chen
Sent:	ter 1/8/2006 10:23
To:	linux-cluster at redhat.com
Cc:	
Subject:	[Linux-cluster] hung when 3rd nodes mounting the gfs using dlm

I connect 4 SUN x4100 (2 AMD dual core, 2G RAM ) to a SUN Storage 3510 with
a silkworm 200e FC switch.
the OS is RHEL 4 U3 for X86_64.

I make 2 GFS FS,one called Alpha:gfs1, another called Alpha:gfs2

All things seems good when only 2 nodes mount GFS.

Once 3rd node mount the GFS, the command hang.

Is there anyone who encounted the similar problem?

Is it a bug of GFS?



-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3755 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060801/1402c3ae/attachment.bin>

From mykleb at no.ibm.com  Tue Aug  1 20:16:42 2006
From: mykleb at no.ibm.com (Jan-Frode Myklebust)
Date: Tue, 1 Aug 2006 22:16:42 +0200
Subject: [Linux-cluster] Re: E-Mail Cluster
References: <44CE15B1.9010603@fiocruz.br>
Message-ID: <slrnecvdla.i23.mykleb@99RXZYP.ibm.com>

On 2006-07-31, Nicholas Anderson <nicholas at fiocruz.br> wrote:

> I'm new to clustering and was wondering what would be the best solution 
> when clustering an email server.
>
> Today we've 1 server  with a storage where all mailboxes (mbox format) 

For clustering, I think it would be better to use Maildir-format
for the mailboxes. Then you'll avoid any locking problems on the
mailboxes. New messages can be delivered on one machine while other
messages in the same mail-folder is being deleted on another machine.

If your users are only accessing their email by pop/imap, moving to 
Maildir shouldn't be any issue.

> and home dirs are stored.
> I'm planning to use 3 or 4 nodes running imap, pop and smtp, all of them 
> sharing users' data.
>
> Should I use NFS or GFS?

NFS is very single-point-of-failure.. so definately a clusterfs/GFS.
If you can move to Maildir, you should be able to run any number of
servers where each server is running all services (imap, pop and smtp), 
and incoming traffic is routed to a random server trough f.ex. round
robin dns.

To handle single-node downtime/crash, you'll just need to move the
ip-address to an available node. Easily achivable trough f.ex. 
heartbeat from linux-ha.org, and probably also RH Cluster Suite..


  -jf



From hyperbaba at neobee.net  Wed Aug  2 06:49:24 2006
From: hyperbaba at neobee.net (Vladimir Grujic)
Date: Wed, 2 Aug 2006 08:49:24 +0200
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44CF822D.7070705@framestore-cfc.com>
References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com>
	<44CF822D.7070705@framestore-cfc.com>
Message-ID: <200608020849.25035.hyperbaba@neobee.net>

On Tuesday 01 August 2006 18:32, Stephen Willey wrote:
> Robert Peterson wrote:
> > Hi Stephen,
> >
> > Yes, this is a problem with gfs_fsck.  The problem is, it tries to
> > allocate memory
> > for bitmaps based on the size of the file system.  The bitmap structures
> > are used
> > throughout the code, so they're not optional.  I'll have to figure out
> > how to
> > do this a better way.  Thanks for opening the bugzilla (200883).  I'll
> > work on it.
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat Cluster Suite
>
> The fsck is now running after we added the 137Gb swap drive.  It appears
> to consistently chew about 4Gb of RAM (sometimes higher) but it is
> working (for now).
>
> Any ballpark idea of how long it'll take to fsck a 45Tb FS?  I know
> that's a "how long is a piece of string" question, but are we talking
> hours/days/weeks?
>
> Stephen
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

It took 55 hours for all 7 passes on my  1TB partition (with alot of files on 
it) . partition resided on raid 10 sata storage. Does anyone else have 
execution times for gfs_gsck ?



From stephen.willey at framestore-cfc.com  Wed Aug  2 09:51:05 2006
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Wed, 02 Aug 2006 10:51:05 +0100
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <200608020849.25035.hyperbaba@neobee.net>
References: <44CF2F94.4000003@framestore-cfc.com>
	<44CF8383.3040208@redhat.com>	<44CF822D.7070705@framestore-cfc.com>
	<200608020849.25035.hyperbaba@neobee.net>
Message-ID: <44D07589.7090507@framestore-cfc.com>

> On Tuesday 01 August 2006 18:32, Stephen Willey wrote:

>> The fsck is now running after we added the 137Gb swap drive.  It appears
>> to consistently chew about 4Gb of RAM (sometimes higher) but it is
>> working (for now).
>>
>> Any ballpark idea of how long it'll take to fsck a 45Tb FS?  I know
>> that's a "how long is a piece of string" question, but are we talking
>> hours/days/weeks?
>>
>> Stephen
>>

Is there any way we can determine the progress during all passes?  At
the moment all we're seeing is lines like the following:

(pass1.c:213)  Setting 557096777 to data block

Is this representative of simply the numbers of blocks in the
filesystem?  If so, how do we get the numbers of blocks in the
filesystem while the fsck is running?

We use this FS for backups and we're currently determining whether we'd
be better off just wiping it and re-syncing all our data (which would
take a couple of days, not several) so unless we can get a reliable
indication of how long this will take, we probably won't finish it.

-- 
Stephen Willey

Senior Systems Engineer, Framestore-CFC
+44 (0)207 344 8000
http://www.framestore-cfc.com



From f.hackenberger at mediatransfer.com  Wed Aug  2 11:03:15 2006
From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting)
Date: Wed, 02 Aug 2006 13:03:15 +0200
Subject: [Linux-cluster] clurgmgrd stops service without reason
Message-ID: <44D08673.3010207@mediatransfer.com>

Hello,

we have running cs4 config with 2 nodes.
for debuging one node is offline.
so it is running only on 1 node.

Now we have the problem that clurgmgrd stops the services wich he
provides without recognizable reason.

we have log_level 7 on the cman and on rm -lines in cluster.conf

but the reason of stoping the service is not recognizable.

I see in the logfile entries as:

--snip--
Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
/exports/imap/checkimapstartup.sh status
Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
/exports/subversion/etc/rc.d/init.d/svnserver status
Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Checking 192.168.0.223,
Level 0
Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> 192.168.0.223 present on
eth0
Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link for eth0: Detected
Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link detected on eth0
Aug  1 17:31:37 kain clurgmgrd[4780]: <notice> Stopping service storage
--snap--

how to say to clurgmgrd, that he should log the reason for stoping the
service?

any other hints ?

thanks falk



From Michael.Roethlein at ri-solution.com  Wed Aug  2 12:09:29 2006
From: Michael.Roethlein at ri-solution.com (=?iso-8859-1?Q?R=F6thlein_Michael_=28RI-Solution=29?=)
Date: Wed, 2 Aug 2006 14:09:29 +0200
Subject: [Linux-cluster] Tracing gfs problems
Message-ID: <992633B6A0E42B49BC5A41C10A8C841B030E29B8@MUCEX004.root.local>

Hello,

In the past there occured hangs resulting in reboots of our 4 node cluster. The real problem is: there aren't any traces in the log files of the nodes.

Is there a possibilty to raise the verbosity of gfs?

Thanks

Michael



From rpeterso at redhat.com  Wed Aug  2 14:27:32 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 02 Aug 2006 09:27:32 -0500
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44D07589.7090507@framestore-cfc.com>
References: <44CF2F94.4000003@framestore-cfc.com>	<44CF8383.3040208@redhat.com>	<44CF822D.7070705@framestore-cfc.com>	<200608020849.25035.hyperbaba@neobee.net>
	<44D07589.7090507@framestore-cfc.com>
Message-ID: <44D0B654.5010508@redhat.com>

Stephen Willey wrote:
> Is there any way we can determine the progress during all passes?  At
> the moment all we're seeing is lines like the following:
>
> (pass1.c:213)  Setting 557096777 to data block
>
> Is this representative of simply the numbers of blocks in the
> filesystem?  If so, how do we get the numbers of blocks in the
> filesystem while the fsck is running?
>
> We use this FS for backups and we're currently determining whether we'd
> be better off just wiping it and re-syncing all our data (which would
> take a couple of days, not several) so unless we can get a reliable
> indication of how long this will take, we probably won't finish it.
Hi Stephen,

The latest gfs_fsck will report the percent complete for passes 1 and 5,
which take the longest.   It sounds like you're running it in verbose mode
(i.e. with -v) which is going to do a lot of unnecessary I/O to stdout and
will slow it down considerably.  If you're redirecting stdout, you can
do a 'grep "percent complete" /your/stdout | tail' or something similar to
figure out how far along it is with that pass.

Only passes 1 and 5 go block-by-block and therefore it's easy to figure
out how far they've gotten.  For the other passes, it would be difficult to
estimate their progress, and probably not worth the overhead in terms
of time the computer would have to spend figuring it out.

You can get it to go faster by restarting it without the -v, but then it 
will
have to re-do all the work it's already done to this point.

Based on what you've told me, it probably will take longer to fsck than
you're willing to wait. 

Regards,

Bob Peterson
Red Hat Cluster Suite



From stephen.willey at framestore-cfc.com  Wed Aug  2 14:19:56 2006
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Wed, 02 Aug 2006 15:19:56 +0100
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44D0B654.5010508@redhat.com>
References: <44CF2F94.4000003@framestore-cfc.com>	<44CF8383.3040208@redhat.com>	<44CF822D.7070705@framestore-cfc.com>	<200608020849.25035.hyperbaba@neobee.net>	<44D07589.7090507@framestore-cfc.com>
	<44D0B654.5010508@redhat.com>
Message-ID: <44D0B48C.4080004@framestore-cfc.com>

Robert Peterson wrote:
> Hi Stephen,
> 
> The latest gfs_fsck will report the percent complete for passes 1 and 5,
> which take the longest.   It sounds like you're running it in verbose mode
> (i.e. with -v) which is going to do a lot of unnecessary I/O to stdout and
> will slow it down considerably.  If you're redirecting stdout, you can
> do a 'grep "percent complete" /your/stdout | tail' or something similar to
> figure out how far along it is with that pass.
> 
> Only passes 1 and 5 go block-by-block and therefore it's easy to figure
> out how far they've gotten.  For the other passes, it would be difficult to
> estimate their progress, and probably not worth the overhead in terms
> of time the computer would have to spend figuring it out.
> 
> You can get it to go faster by restarting it without the -v, but then it
> will
> have to re-do all the work it's already done to this point.
> 
> Based on what you've told me, it probably will take longer to fsck than
> you're willing to wait.
> Regards,

We have restarted it without the -vs and it does appear to be
progressing much faster.  We'll give it a while...

-- 
Stephen Willey

Senior Systems Engineer, Framestore-CFC
+44 (0)207 344 8000
http://www.framestore-cfc.com



From danwest at comcast.net  Wed Aug  2 15:50:16 2006
From: danwest at comcast.net (danwest at comcast.net)
Date: Wed, 02 Aug 2006 15:50:16 +0000
Subject: [Linux-cluster] 2-node fencing question
Message-ID: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net>

It seems like a significant problem to have fence_ipmilan issue a power-off followed by a power-on with a 2 node cluster.  As described both nodes power-off and are then unable to issue the required power-on.  Does anyone know a solution to this?  This seems to make a 2-node cluster with ipmi fencing pointless.  It looks like fence_ipmilan needs to support sending a cycle instead of a poweroff than a poweron?
 
According to fence_ipmilan.c it looks like cycle is not an option although it is an option for ipmitool.  (ipmitool -H <ipaddr> -U <userid> -P <password> chassis power cycle)
 
>From fence_ipmilan.c:
 
switch(op) {
        case ST_POWERON:
                snprintf(arg, sizeof(arg),
                         "%s chassis power on", cmd);
                break;
        case ST_POWEROFF:
                snprintf(arg, sizeof(arg),
                         "%s chassis power off", cmd);
                break;
        case ST_STATUS:
                snprintf(arg, sizeof(arg),
                         "%s chassis power status", cmd);
                break;
        }
 
Thanks,
 Dan 
 -------------- Original message ----------------------
From: "Zachacker, Maik" <zachacker at ibh.de>
> >> Also is there a way to configure fence_ipmilan in cluster.xml to
> reboot
> >> rather than stop the server?  fence_ipmilan by itself takes the -o
> >> option (on,off,reboot)
> >
> > I use fence_ipmilan (with CS4 Update 2) it does at
> > first poweroff AND then poweron ... except if it does not get
> > the off status after the poweroff. (check agent ipmilan.c)
> 
> I use fence_ilo and fence_apc (CS4U3) - both first poweroff and then
> poweron too. This is only a problem in a two node configuration because
> both nodes send the poweroff command and non of them can send the
> poweron command because both are down.
> 
> The most fence-devices have an option or action tag, that is not
> available via the cluster configuration tool. They can be used to force
> a reboot (default) or an poweroff. 
> 
> <fencedevice action="off" agent="fence_ilo" .../>
> <fencedevice agent="fence_apc" option="off" .../>
> 
> Maik Zachacker
> --
> Maik Zachacker
> IBH Prof. Dr. Horn GmbH, Dresden, Germany 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From nicholas at fiocruz.br  Wed Aug  2 18:43:47 2006
From: nicholas at fiocruz.br (Nicholas Anderson)
Date: Wed, 02 Aug 2006 15:43:47 -0300
Subject: [Linux-cluster] Re: E-Mail Cluster
In-Reply-To: <slrnecvdla.i23.mykleb@99RXZYP.ibm.com>
References: <44CE15B1.9010603@fiocruz.br>
	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com>
Message-ID: <44D0F263.509@fiocruz.br>

Hi Jan,

I'm searching in google how to convert from mbox to maildir using 
sendmail/procmail .... i have 3000+ users and something like 70GB of 
emails and  I'll have to test it very well before doing in the 
production server ....

As soon as i get this things working fine, i'll try gfs and the other 
cluster stuff .....
I'm thinking on doing something like you said ... 3 nodes running 
imap/pop/smtp sharing one filesystem probably with gfs where user data 
will be stored.....
I was running slackware but now I'm thinking about something like redhat 
or centos (will depend on our budget :-)  ) to the nodes ....
It'll be easier to keep them up2dated :-)

Any new tips are welcome :-)

thanks,

Nick


Jan-Frode Myklebust wrote:
> On 2006-07-31, Nicholas Anderson <nicholas at fiocruz.br> wrote:
>   
> For clustering, I think it would be better to use Maildir-format
> for the mailboxes. Then you'll avoid any locking problems on the
> mailboxes. New messages can be delivered on one machine while other
> messages in the same mail-folder is being deleted on another machine.
>
> If your users are only accessing their email by pop/imap, moving to 
> Maildir shouldn't be any issue.
>   
> NFS is very single-point-of-failure.. so definately a clusterfs/GFS.
> If you can move to Maildir, you should be able to run any number of
> servers where each server is running all services (imap, pop and smtp), 
> and incoming traffic is routed to a random server trough f.ex. round
> robin dns.
>
> To handle single-node downtime/crash, you'll just need to move the
> ip-address to an available node. Easily achivable trough f.ex. 
> heartbeat from linux-ha.org, and probably also RH Cluster Suite..
>
>
>   -jf
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


-- 
Nicholas Anderson
Administrador de Sistemas Unix
LPIC-1 Certified
Rede Fiocruz
e-mail: nicholas at fiocruz.br



From mykleb at no.ibm.com  Wed Aug  2 19:43:33 2006
From: mykleb at no.ibm.com (Jan-Frode Myklebust)
Date: Wed, 2 Aug 2006 21:43:33 +0200
Subject: [Linux-cluster] Re: E-Mail Cluster
References: <44CE15B1.9010603@fiocruz.br>
	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com> <44D0F263.509@fiocruz.br>
Message-ID: <slrned2035.sth.mykleb@99RXZYP.ibm.com>

On 2006-08-02, Nicholas Anderson <nicholas at fiocruz.br> wrote:
>
> I'm searching in google how to convert from mbox to maildir using 
> sendmail/procmail .... 

At my previous job we changed from exim/uw-imap on mbox, 
to exim/docevot on maildir a couple of years ago. Didn't use
a cluster-fs, only SCSI-based disk failover. For about 500-users.

Right now I'm setting up a similar solution to your... trying
to support up to 200.000 users on a 5 node cluster, using IBM GPFS.

If sendmail is using procmail to do final mailbox-delivery, I 
think the configuration change should be primarily putting a '/' 
at the end of the path, as that should instruct procmail to
do maildir-style delivery. At least that's how I've been doing
it in my ~/.procmailrc. Ref. 'man procmailrc'.

> i have 3000+ users and something like 70GB of 
> emails and  I'll have to test it very well before doing in the 
> production server ....

Sure..  There are a few mbox2maildir converters.. You should probably
try a few of them and verify that they all give the same result.

Another thing to check is that your cluster-fs handles your load
well. My main consern would be how well GFS performs on 
maildir-style folders, as most cluster-fs's I've seen are optimized
for large file streaming I/O. If possible, try to keep a lot of 
file-metadata in cache so that you don't have to go to disk every
time someone check their maildir for new messages.


  -jf



From riaan at obsidian.co.za  Wed Aug  2 22:27:32 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Thu, 03 Aug 2006 00:27:32 +0200
Subject: [Linux-cluster] Re: E-Mail Cluster
In-Reply-To: <slrned2035.sth.mykleb@99RXZYP.ibm.com>
References: <44CE15B1.9010603@fiocruz.br>	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com>
	<44D0F263.509@fiocruz.br> <slrned2035.sth.mykleb@99RXZYP.ibm.com>
Message-ID: <44D126D4.2080303@obsidian.co.za>


Jan-Frode Myklebust wrote:
> On 2006-08-02, Nicholas Anderson <nicholas at fiocruz.br> wrote:
>> I'm searching in google how to convert from mbox to maildir using 
>> sendmail/procmail .... 
> 
> At my previous job we changed from exim/uw-imap on mbox, 
> to exim/docevot on maildir a couple of years ago. Didn't use
> a cluster-fs, only SCSI-based disk failover. For about 500-users.
> 
> Right now I'm setting up a similar solution to your... trying
> to support up to 200.000 users on a 5 node cluster, using IBM GPFS.
> 
> If sendmail is using procmail to do final mailbox-delivery, I 
> think the configuration change should be primarily putting a '/' 
> at the end of the path, as that should instruct procmail to
> do maildir-style delivery. At least that's how I've been doing
> it in my ~/.procmailrc. Ref. 'man procmailrc'.
> 
>> i have 3000+ users and something like 70GB of 
>> emails and  I'll have to test it very well before doing in the 
>> production server ....
> 
> Sure..  There are a few mbox2maildir converters.. You should probably
> try a few of them and verify that they all give the same result.
> 
> Another thing to check is that your cluster-fs handles your load
> well. My main consern would be how well GFS performs on 
> maildir-style folders, as most cluster-fs's I've seen are optimized
> for large file streaming I/O. If possible, try to keep a lot of 
> file-metadata in cache so that you don't have to go to disk every
> time someone check their maildir for new messages.
> 

We are running 700 000 users on a 2.5 GFS, 4 nodes, with POP, IMAP 
(direct access and SquirrelmMail) and SMTP. To make things worse, we use 
NFS between our GFS nodes and our mail servers.

We initially had huge performance problems in our setup, which I wrote 
in this message:
http://www.redhat.com/archives/linux-cluster/2006-July/msg00136.html

We ended up bumping the spindle count from 36 to 60 and then to 114, 
without it making a noticeable difference.

Our main killer was Squirrelmail over IMAP (the solution is primarily a 
webmail-based one)
Our performance problems were solved by the following:
- removing the folder-size plugin (built-in) and mail quota plugin (3rd 
party) reduced the traffic between IMAP servers and storage backend by 40%.
- Implement imap proxy (www.imapproxy.org). This is giving us a 1 to 14 
hit ratio. This storage which could not keep up previously, is now 
humming along fine.

Our initial mistake was to try and optimise on the FS layer (there 
werent any real performance optimizations in our setup to be made) and 
throw hardware at the problem, instead of suspecting and optimizing our 
application. Despite GFS not being designed for lots of small files, and 
not recommended for use with NFS, with the above changes, it performs 
more than adequately. We hope to see another performance gain once we 
get rid of the NFS and have our mail servers access the GFS directly.

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060803/205ab22a/attachment.vcf>

From Hansjoerg.Maurer at dlr.de  Thu Aug  3 07:17:31 2006
From: Hansjoerg.Maurer at dlr.de (=?ISO-8859-15?Q?Hansj=F6rg_Maurer?=)
Date: Thu, 03 Aug 2006 09:17:31 +0200
Subject: [Linux-cluster] Re: E-Mail Cluster
In-Reply-To: <slrned2035.sth.mykleb@99RXZYP.ibm.com>
References: <44CE15B1.9010603@fiocruz.br>	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com>
	<44D0F263.509@fiocruz.br> <slrned2035.sth.mykleb@99RXZYP.ibm.com>
Message-ID: <44D1A30B.4060200@dlr.de>

Hi

we had some problems with cyrus-imap on top of gpfs a year ago
concerning mmap files in a simple failover environment.
(see the gpfs-mailinglist)
But with recent versions it works.

Greetings

Hansjoerg



>
>Right now I'm setting up a similar solution to your... trying
>to support up to 200.000 users on a 5 node cluster, using IBM GPFS.
>
>If sendmail is using procmail to do final mailbox-delivery, I 
>think the configuration change should be primarily putting a '/' 
>at the end of the path, as that should instruct procmail to
>do maildir-style delivery. At least that's how I've been doing
>it in my ~/.procmailrc. Ref. 'man procmailrc'.
>
>  
>
>>i have 3000+ users and something like 70GB of 
>>emails and  I'll have to test it very well before doing in the 
>>production server ....
>>    
>>
>
>Sure..  There are a few mbox2maildir converters.. You should probably
>try a few of them and verify that they all give the same result.
>
>Another thing to check is that your cluster-fs handles your load
>well. My main consern would be how well GFS performs on 
>maildir-style folders, as most cluster-fs's I've seen are optimized
>for large file streaming I/O. If possible, try to keep a lot of 
>file-metadata in cache so that you don't have to go to disk every
>time someone check their maildir for new messages.
>
>
>  -jf
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>

-- 
_________________________________________________________________

Dr.  Hansjoerg Maurer           | LAN- & System-Manager
                                |
Deutsches Zentrum               | DLR Oberpfaffenhofen
  f. Luft- und Raumfahrt e.V.   |
Institut f. Robotik             |
Postfach 1116                   | Muenchner Strasse 20
82230 Wessling                  | 82234 Wessling
Germany                         |
                                |
Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
__________________________________________________________________


There are 10 types of people in this world, 
those who understand binary and those who don't.



From mykleb at no.ibm.com  Thu Aug  3 11:42:45 2006
From: mykleb at no.ibm.com (Jan-Frode Myklebust)
Date: Thu, 3 Aug 2006 13:42:45 +0200
Subject: [Linux-cluster] Re: E-Mail Cluster
References: <44CE15B1.9010603@fiocruz.br>
	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com> <44D0F263.509@fiocruz.br>
	<slrned2035.sth.mykleb@99RXZYP.ibm.com> <44D1A30B.4060200@dlr.de>
Message-ID: <slrned3o9l.t9a.mykleb@99RXZYP.ibm.com>

On 2006-08-03, Hansj?rg Maurer <Hansjoerg.Maurer at dlr.de> wrote:
>
> we had some problems with cyrus-imap on top of gpfs a year ago
> concerning mmap files in a simple failover environment.
> (see the gpfs-mailinglist)
> But with recent versions it works.

Yes, I remember your posting.. and AFAICT you solved it by turning
off mmap in Cyrus.

	https://lists.sdsc.edu/mailman/private.cgi/gpfs-general/2005q4/000040.html

Did you consider GFS for this project ? Or are you now looking
at GFS for the same project ?

We're using courier-imap, which as far as I can tell doesn't use mmap, 
so we shouldn't hit this problem. Otherwise there is always the 
mmap-invalidate patch that should solve this...


  -jf



From singh.rajeshwar at gmail.com  Thu Aug  3 12:41:08 2006
From: singh.rajeshwar at gmail.com (Rajesh singh)
Date: Thu, 3 Aug 2006 18:11:08 +0530
Subject: [Linux-cluster] fencing agent
Message-ID: <e2a5a8050608030541vbefe467j79550f50f06233a4@mail.gmail.com>

Hi all,
We are in process of procurring a fencing device and we have been suggested
by our hardware vendor to use fencing device as mentioned in below URL.
http://www.supermicro.com/products/accessories/addon/AOC-IPMI20-E.cfm
My setup is that I am using  2 node AMD  servers on rhel4 u2  in clustered
mode.
I am not using gfs, but i am putting fencing device.
My querry is that, can I use the *AOC-IPMI20-E card as an fencing device.

regards


*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060803/e27086a4/attachment.htm>

From riaan at obsidian.co.za  Thu Aug  3 14:21:28 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Thu, 03 Aug 2006 16:21:28 +0200
Subject: [Linux-cluster] fencing agent
In-Reply-To: <e2a5a8050608030541vbefe467j79550f50f06233a4@mail.gmail.com>
References: <e2a5a8050608030541vbefe467j79550f50f06233a4@mail.gmail.com>
Message-ID: <44D20668.6070901@obsidian.co.za>

Rajesh singh wrote:
> Hi all,
> We are in process of procurring a fencing device and we have been 
> suggested by our hardware vendor to use fencing device as mentioned in 
> below URL.
> http://www.supermicro.com/products/accessories/addon/AOC-IPMI20-E.cfm
> My setup is that I am using  2 node AMD  servers on rhel4 u2  in 
> clustered mode.
> I am not using gfs, but i am putting fencing device.
> My querry is that, can I use the *AOC-IPMI20-E card as an fencing device.
> 
> regards

According to the above URL, this card supports IPMI 2, which means that 
it should work with the fence_ipmilan fencing module in RHCS 4.

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060803/4ce7e105/attachment.vcf>

From mwill at penguincomputing.com  Thu Aug  3 15:29:36 2006
From: mwill at penguincomputing.com (Michael Will)
Date: Thu, 3 Aug 2006 08:29:36 -0700
Subject: [Linux-cluster] fencing agent
Message-ID: <433093DF7AD7444DA65EFAFE3987879C125E83@jellyfish.highlyscyld.com>

Or you buy systems that come with ipmi on the mainboard.

 -----Original Message-----
From: 	Rajesh singh [mailto:singh.rajeshwar at gmail.com]
Sent:	Thu Aug 03 07:01:25 2006
To:	linux-cluster at redhat.com
Subject:	[Linux-cluster] fencing agent

Hi all,
We are in process of procurring a fencing device and we have been suggested
by our hardware vendor to use fencing device as mentioned in below URL.
http://www.supermicro.com/products/accessories/addon/AOC-IPMI20-E.cfm
My setup is that I am using  2 node AMD  servers on rhel4 u2  in clustered
mode.
I am not using gfs, but i am putting fencing device.
My querry is that, can I use the *AOC-IPMI20-E card as an fencing device.

regards


*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060803/3f4a73b5/attachment.htm>

From raycharles_man at yahoo.com  Thu Aug  3 15:55:04 2006
From: raycharles_man at yahoo.com (Ray Charles)
Date: Thu, 3 Aug 2006 08:55:04 -0700 (PDT)
Subject: [Linux-cluster] Logging for cluster errors.
Message-ID: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com>




Hi,

Easy question.

When I run system-config-cluster I am able to see the
gui.  But in the event there's an error while using
the gui where does that get logged? 

I've seen an error from the gui and it says check
logging. I didn't see a /var/log/cluster/

-TIA

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From rpeterso at redhat.com  Thu Aug  3 16:20:03 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Thu, 03 Aug 2006 11:20:03 -0500
Subject: [Linux-cluster] Logging for cluster errors.
In-Reply-To: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com>
References: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com>
Message-ID: <44D22233.5030403@redhat.com>

Ray Charles wrote:
>
> Hi,
>
> Easy question.
>
> When I run system-config-cluster I am able to see the
> gui.  But in the event there's an error while using
> the gui where does that get logged? 
>
> I've seen an error from the gui and it says check
> logging. I didn't see a /var/log/cluster/
>
> -TIA
>   
Hi Ray,

Usually the messages go into /var/log/messages.
Many of the messages can be redirected to other places by
changing the cluster.conf file, so the code won't tell you
specifically where to look.

Regards,

Bob peterson
Red Hat Cluster Suite



From jparsons at redhat.com  Thu Aug  3 17:06:26 2006
From: jparsons at redhat.com (James Parsons)
Date: Thu, 03 Aug 2006 13:06:26 -0400
Subject: [Linux-cluster] Logging for cluster errors.
In-Reply-To: <44D22233.5030403@redhat.com>
References: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com>
	<44D22233.5030403@redhat.com>
Message-ID: <44D22D12.2060501@redhat.com>

Robert Peterson wrote:

> Ray Charles wrote:
>
>>
>> Hi,
>>
>> Easy question.
>>
>> When I run system-config-cluster I am able to see the
>> gui.  But in the event there's an error while using
>> the gui where does that get logged?
>> I've seen an error from the gui and it says check
>> logging. I didn't see a /var/log/cluster/
>>
>> -TIA
>>   
>
> Hi Ray,
>
> Usually the messages go into /var/log/messages.
> Many of the messages can be redirected to other places by
> changing the cluster.conf file, so the code won't tell you
> specifically where to look.
>
> Regards,
>
> Bob peterson
> Red Hat Cluster Suite
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Ray,

What is the nature of the error that you are seeing?

-J



From rpeterso at redhat.com  Thu Aug  3 18:44:31 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Thu, 03 Aug 2006 13:44:31 -0500
Subject: [Linux-cluster] Tracing gfs problems
In-Reply-To: <992633B6A0E42B49BC5A41C10A8C841B030E29B8@MUCEX004.root.local>
References: <992633B6A0E42B49BC5A41C10A8C841B030E29B8@MUCEX004.root.local>
Message-ID: <44D2440F.40408@redhat.com>

R?thlein Michael (RI-Solution) wrote:
> Hello,
>
> In the past there occured hangs resulting in reboots of our 4 node cluster. The real problem is: there aren't any traces in the log files of the nodes.
>
> Is there a possibilty to raise the verbosity of gfs?
>
> Thanks
>
> Michael
>   
Hi Michael,

Right now, there's no way to increase the level of verbosity or logging 
in the gfs kernel code, but
I'm not sure that would help you anyway.  The lockup could be in any 
part of the kernel:
GFS, The DLM/Gulm locking infrastructure, or any other part for that 
matter.  It could also be
hardware related or running out of memory, etc.

Your best bet may be to temporarily disable fencing so that the hung 
node(s) don't get fenced
as soon as it happens, for example by changing it to manual fencing, and 
then when it hangs,
check for dmesgs on the console, syslog messages in /var/log/messages 
and if you can't get
a command prompt, use the "magic sysreq" key to dump out what each 
module, thread and
process is doing.

If that doesn't tell you where the problem is, you can send the info to 
this list or create a
bugzilla for the problem and attach the output from the sysrq, along 
with details on what
release of code you're using, your cluster.conf, etc.

Here are simple instructions for using the "magic sysrq" in case you're 
unfamiliar:

1. Turn it on by doing:
    echo "1" >  /proc/sys/kernel/sysrq
2. Recreate your kernel hang
3. If you're at the system console with a keyboard, do alt-sysrq t (task 
list)
   If you have a telnet console instead, do ctrl-] to get telnet> prompt
   telnet> send brk  (send a break char)
   t (task list)
   If you don't have a keyboard or telnet, but do have a shell:
   echo "t" > /proc/sysrq-trigger
   If you're doing it from a minicom, use: <ctrl-a>f followed by t
(For other types of serial consoles, you have to get it to send a break, 
then letter t)
4. The task info will be dumped to the console, so hopefully you have
    a way to save that off.

Regards,

Bob Peterson
Red Hat Cluster Suite



From lhh at redhat.com  Thu Aug  3 18:36:39 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 03 Aug 2006 14:36:39 -0400
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
In-Reply-To: <44CF35CE.1060700@bull.net>
References: <44CF35CE.1060700@bull.net>
Message-ID: <1154630199.28677.18.camel@ayanami.boston.redhat.com>

On Tue, 2006-08-01 at 13:06 +0200, Alain Moulle wrote:
> Hi
> 
> We are facing a big problem of split-brain, due to the fact
> that the process and Clurgmgrd daemon from RedHat Cluster-Suite unexpectedly
> disappeared (still for an unknown reason ...) on one of the HA-Nodes pair. This
> caused the other Clurgmgrd on the other Node to be aware of this and then simply
> to re-start the application service without effective fenceing/migration.
> 
> It seems to be an abnormal behavior, isn't it ?
> 
> Is there a already a fix available in more recent Update ?

Fixed in U4 beta; there were two problems:

(a) a segfault, and
(b) missing inclusion of Stanko Kupcevic's self-monitoring in clurgmgrd.

-- Lon





From lhh at redhat.com  Thu Aug  3 18:38:44 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 03 Aug 2006 14:38:44 -0400
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <44D08673.3010207@mediatransfer.com>
References: <44D08673.3010207@mediatransfer.com>
Message-ID: <1154630324.28677.21.camel@ayanami.boston.redhat.com>

On Wed, 2006-08-02 at 13:03 +0200, Falk Hackenberger - MediaTransfer AG
Netresearch & Consulting wrote:

> --snip--
> Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
> /exports/imap/checkimapstartup.sh status
> Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
> /exports/subversion/etc/rc.d/init.d/svnserver status
> Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Checking 192.168.0.223,
> Level 0
> Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> 192.168.0.223 present on
> eth0
> Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link for eth0: Detected
> Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link detected on eth0
> Aug  1 17:31:37 kain clurgmgrd[4780]: <notice> Stopping service storage
> --snap--
> 
> how to say to clurgmgrd, that he should log the reason for stoping the
> service?

Something must be returning an error code where it should not be; can
you post your service XML blob?

-- Lon



From lhh at redhat.com  Thu Aug  3 18:39:29 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 03 Aug 2006 14:39:29 -0400
Subject: [Linux-cluster] fencing agent
In-Reply-To: <44D20668.6070901@obsidian.co.za>
References: <e2a5a8050608030541vbefe467j79550f50f06233a4@mail.gmail.com>
	<44D20668.6070901@obsidian.co.za>
Message-ID: <1154630369.28677.23.camel@ayanami.boston.redhat.com>

On Thu, 2006-08-03 at 16:21 +0200, Riaan van Niekerk wrote:

> According to the above URL, this card supports IPMI 2, which means that 
> it should work with the fence_ipmilan fencing module in RHCS 4.

It should work with RHCS4, since we're just calling ipmitool.

-- Lon




From lhh at redhat.com  Thu Aug  3 19:25:46 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 03 Aug 2006 15:25:46 -0400
Subject: [Linux-cluster] 2-node fencing question
In-Reply-To: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net>
References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net>
Message-ID: <1154633146.28677.70.camel@ayanami.boston.redhat.com>

Sorry I didn't see this earlier!

On Wed, 2006-08-02 at 15:50 +0000, danwest at comcast.net wrote:
> It seems like a significant problem to have fence_ipmilan issue a power-off followed by a power-on with a 2 node cluster.

Generally, the chances of this occurring are very, very small, though
not impossible.

However, it could very well be that IPMI hardware modules are slow
enough at processing requests that this could pose a problem.  What
hardware has this happened on?  Was ACPI disabled on boot in the host OS
(it should be; see below)?


> This seems to make a 2-node cluster with ipmi fencing pointless.

I'm pretty sure that 'both-nodes-off problem' can only occur if all of
the following criteria are met:

(a) while using a separate NICs for IPMI and cluster traffic (the
recommended configuration),

(b) in the event of a network partition, such that both nodes can not
see each other but can see each other's IPMI port, and

(c) if both nodes send their power-off packets at or near the exact same
time.

The time window for (c) increases significantly (5+ seconds) if the
cluster nodes are enabling ACPI power events on boot.  This is one of
the reasons why booting with acpi=off is required when using IPMI, iLO,
or other integrated power management solutions.

If booting with acpi=off, does the problem persist?

> It looks like fence_ipmilan needs to support sending a cycle instead of a poweroff than a poweron?

The reason fence_ipmilan functions this way (off, status, on) is because
that we require a confirmation that the node has lost power.  I am not
sure that it is possible to confirm the node has rebooted using IPMI.

Arguably, it also might not be necessary to make such a confirmation in
this particular case.  

> According to fence_ipmilan.c it looks like cycle is not an option although it is an option for ipmitool.  (ipmitool -H <ipaddr> -U <userid> -P <password> chassis power cycle)

Looks like you're on the right track.

-- Lon



From nicholas at fiocruz.br  Thu Aug  3 20:27:03 2006
From: nicholas at fiocruz.br (Nicholas Anderson)
Date: Thu, 03 Aug 2006 17:27:03 -0300
Subject: [Linux-cluster] Re: E-Mail Cluster
In-Reply-To: <44D126D4.2080303@obsidian.co.za>
References: <44CE15B1.9010603@fiocruz.br>	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com>	<44D0F263.509@fiocruz.br>
	<slrned2035.sth.mykleb@99RXZYP.ibm.com>
	<44D126D4.2080303@obsidian.co.za>
Message-ID: <44D25C17.8000004@fiocruz.br>

Hi again all .....

I guess i'm starting to understand how the things should work ....

I was reading about GFS and all the documents that i found suppose that 
you have a storage with a SAN and 2 or more machines connected through 
FC to the SAN.
Well, it seems to me that in this case the storage or the SAN switch 
still being one single-point-of-failure right? If the storage or SAN 
goes down, the whole service will be offline right ?

I thought that with GFS i could do something like a "Parallel FS" where 
2 (or more) machines would have the same data in their disks, but this 
data would be synchronized in realtime ....
am i totally noob or there really has a way to make FS's work in 
parallel, synchronizing in realtime?
I'd like to do this without having a SAN (cause i don't have one :-) and 
i have only 1 storage ) and without leaving a single-point-of-failure.

Let me try to explain exactly what I'm thinking ...

3 servers, each one with a 300GB SCSI disk (local, no FC) to be 
synchronized with the others (through GFS?? mounted and shared as a 
/data f.ex.), and one 36GB disk only for the SO.
All the servers would have smtp(sendmail with spamassassin and clamav), 
imap and pop3 services running, and probably a squirrelmail.

Is it possible to do this? Is it possible to get this data synchronized 
in realtime ?

Thanks again for your really really important answers, and sorry for 
asking so much noob questions :-)


Nick





Riaan van Niekerk wrote:
>
> We are running 700 000 users on a 2.5 GFS, 4 nodes, with POP, IMAP 
> (direct access and SquirrelmMail) and SMTP. To make things worse, we 
> use NFS between our GFS nodes and our mail servers.
>
> We initially had huge performance problems in our setup, which I wrote 
> in this message:
> http://www.redhat.com/archives/linux-cluster/2006-July/msg00136.html
>
> We ended up bumping the spindle count from 36 to 60 and then to 114, 
> without it making a noticeable difference.
>
> Our main killer was Squirrelmail over IMAP (the solution is primarily 
> a webmail-based one)
> Our performance problems were solved by the following:
> - removing the folder-size plugin (built-in) and mail quota plugin 
> (3rd party) reduced the traffic between IMAP servers and storage 
> backend by 40%.
> - Implement imap proxy (www.imapproxy.org). This is giving us a 1 to 
> 14 hit ratio. This storage which could not keep up previously, is now 
> humming along fine.
>
> Our initial mistake was to try and optimise on the FS layer (there 
> werent any real performance optimizations in our setup to be made) and 
> throw hardware at the problem, instead of suspecting and optimizing 
> our application. Despite GFS not being designed for lots of small 
> files, and not recommended for use with NFS, with the above changes, 
> it performs more than adequately. We hope to see another performance 
> gain once we get rid of the NFS and have our mail servers access the 
> GFS directly.
>
> Riaan
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Nicholas Anderson
Administrador de Sistemas Unix
LPIC-1 Certified
Rede Fiocruz
e-mail: nicholas at fiocruz.br



From rainer at ultra-secure.de  Thu Aug  3 22:53:51 2006
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Fri, 04 Aug 2006 00:53:51 +0200
Subject: [Linux-cluster] Re: E-Mail Cluster
In-Reply-To: <44D25C17.8000004@fiocruz.br>
References: <44CE15B1.9010603@fiocruz.br>	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com>	<44D0F263.509@fiocruz.br>	<slrned2035.sth.mykleb@99RXZYP.ibm.com>	<44D126D4.2080303@obsidian.co.za>
	<44D25C17.8000004@fiocruz.br>
Message-ID: <44D27E7F.6000706@ultra-secure.de>

Nicholas Anderson wrote:
> Hi again all .....
>
> I guess i'm starting to understand how the things should work ....
>
> I was reading about GFS and all the documents that i found suppose 
> that you have a storage with a SAN and 2 or more machines connected 
> through FC to the SAN.
> Well, it seems to me that in this case the storage or the SAN switch 
> still being one single-point-of-failure right? If the storage or SAN 
> goes down, the whole service will be offline right ?

First of all, you (should) have redundant FC-switches (mulipathing).
Then, your storage has (should have) multiple controllers. Eg. HP EVA 
series.
If that isn't enough, there are solution to mirror the storage at the 
hardware-level.
Usually, this is in the 
"if-you-have-to-ask-it's-probably-too-expensive-for-you-anyway"-pricerange 
and thus only used where the (lack of) downtime is worth the investment.


>
> I thought that with GFS i could do something like a "Parallel FS" 
> where 2 (or more) machines would have the same data in their disks, 
> but this data would be synchronized in realtime ....
> am i totally noob or there really has a way to make FS's work in 
> parallel, synchronizing in realtime?
> I'd like to do this without having a SAN (cause i don't have one :-) 
> and i have only 1 storage ) and without leaving a 
> single-point-of-failure.
>
> Let me try to explain exactly what I'm thinking ...
>
> 3 servers, each one with a 300GB SCSI disk (local, no FC) to be 
> synchronized with the others (through GFS?? mounted and shared as a 
> /data f.ex.), and one 36GB disk only for the SO.
> All the servers would have smtp(sendmail with spamassassin and 
> clamav), imap and pop3 services running, and probably a squirrelmail.
>


You can have a master/slave solution with DRBD.


> Is it possible to do this? Is it possible to get this data 
> synchronized in realtime ?


I don't think so.
Well, Google has sort-of a solution via their "Google Filesystem". But 
not for you or me. :-(


>
> Thanks again for your really really important answers, and sorry for 
> asking so much noob questions :-)
>


IMO, hardware is very reliable these days (if you choose wisely). Things 
like DRBD seem (to me) only useful in very special cases - and I would 
fear that DRBD might create more problems than it solves.
In your special case (email), if you can't afford a SAN, get a used 
NetApp and store the maildirs there (qmail-style maildirs). Then 
NFS-mount them on the "cluster-nodes".
The NetApp is reliable enough for these scenarios and depending on the 
exact model, already contains a lot of redundancy in itself.





cheers,
Rainer



From nanfang.xun at sunnexchina.com  Fri Aug  4 00:37:21 2006
From: nanfang.xun at sunnexchina.com (Nanfang.Xun)
Date: Fri, 04 Aug 2006 08:37:21 +0800
Subject: [Linux-cluster] Linux-cluster mailing list submissions
Message-ID: <1154651841.3512.20.camel@ns.xunting.net>





From yfttyfs at gmail.com  Fri Aug  4 02:32:07 2006
From: yfttyfs at gmail.com (y f)
Date: Fri, 4 Aug 2006 10:32:07 +0800
Subject: [Linux-cluster] Linux-cluster mailing list submissions
In-Reply-To: <1154651841.3512.20.camel@ns.xunting.net>
References: <1154651841.3512.20.camel@ns.xunting.net>
Message-ID: <78fcc84a0608031932j7522df50wf6cd36b28a81ff67@mail.gmail.com>

Hi, Xun,

Do you also like Cluster as a guy of metal products company ?

Wish you a nice day !

/yf


On 8/4/06, Nanfang.Xun <nanfang.xun at sunnexchina.com> wrote:
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060804/a6bfbc28/attachment.htm>

From nicholas at fiocruz.br  Fri Aug  4 02:34:30 2006
From: nicholas at fiocruz.br (Nicholas Anderson)
Date: Thu, 3 Aug 2006 23:34:30 -0300 (BRT)
Subject: [Linux-cluster] Re: E-Mail Cluster
In-Reply-To: <44D27E7F.6000706@ultra-secure.de>
References: <44CE15B1.9010603@fiocruz.br>	<slrnecvdla.i23.mykleb@99RXZYP.ibm.com>	<44D0F263.509@fiocruz.br>	<slrned2035.sth.mykleb@99RXZYP.ibm.com>	<44D126D4.2080303@obsidian.co.za>
	<44D25C17.8000004@fiocruz.br> <44D27E7F.6000706@ultra-secure.de>
Message-ID: <61582.201.51.123.23.1154658870.squirrel@www.redefiocruz.fiocruz.br>


> First of all, you (should) have redundant FC-switches (mulipathing).
> Then, your storage has (should have) multiple controllers. Eg. HP EVA
> series.
> If that isn't enough, there are solution to mirror the storage at the
> hardware-level.
> Usually, this is in the
> "if-you-have-to-ask-it's-probably-too-expensive-for-you-anyway"-pricerange
> and thus only used where the (lack of) downtime is worth the investment.

oops, money is the problem :-P
i work for a government institution ..... in Brazil :-P

> IMO, hardware is very reliable these days (if you choose wisely). Things
> like DRBD seem (to me) only useful in very special cases - and I would
> fear that DRBD might create more problems than it solves.
> In your special case (email), if you can't afford a SAN, get a used
> NetApp and store the maildirs there (qmail-style maildirs). Then
> NFS-mount them on the "cluster-nodes".
> The NetApp is reliable enough for these scenarios and depending on the
> exact model, already contains a lot of redundancy in itself.

i already thought about this ..... its a possibility ....

thanks for the answer ...

cheers

Nick

--
Nicholas Anderson
Administrador de Sistemas Unix
LPIC-1 Certified
Rede Fiocruz
e-mail: nicholas at fiocruz.br



From Leonardo.Mello at planejamento.gov.br  Fri Aug  4 11:31:01 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Fri, 4 Aug 2006 08:31:01 -0300
Subject: [Linux-cluster] gfs support for extended security attributes
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B5E@corp-bsa-mp01.planejamento.gov.br>

The gfs doesn't support SELinux attributes, currently you MUST DISABLE SELinux to use GFS+Cluster Suite.

I don't know if there is any plan to support it. maybe one developer or someone from redhat can anwser you better. :-D 

Best Regards
Leonardo Rodrigues de Mello


-----Original Message-----
From:	linux-cluster-bounces at redhat.com on behalf of David Caplan
Sent:	sex 21/7/2006 17:16
To:	linux-cluster at redhat.com
Cc:	
Subject:	[Linux-cluster] gfs support for extended security attributes

Does the current release of GFS support extended security attributes for
use with SELinux? If not, are there any plans for support?

Thanks,
David

--
__________________________________

David Caplan     410 290 1411 x105
dac at tresys.com
Tresys Technology, LLC
8840 Stanford Blvd., Suite 2100
Columbia, MD 21045

 


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3061 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060804/8d6adb53/attachment.bin>

From kanderso at redhat.com  Fri Aug  4 14:12:49 2006
From: kanderso at redhat.com (Kevin Anderson)
Date: Fri, 04 Aug 2006 09:12:49 -0500
Subject: [Linux-cluster] gfs support for extended security attributes
In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B5E@corp-bsa-mp01.planejamento.gov.br>
References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B5E@corp-bsa-mp01.planejamento.gov.br>
Message-ID: <1154700769.2783.3.camel@dhcp80-204.msp.redhat.com>

The gfs-kernel code in the HEAD of the cvs tree now has SELinux extended
attribute support integrated into the code.  Also, the upstream gfs2
code that is in the -mm kernel also has the SELinux support as well.
The gfs code in HEAD is targeted at the Fedora Core 6 and RHEL5
releases.

Kevin

On Fri, 2006-08-04 at 08:31 -0300, Leonardo Rodrigues de Mello wrote:
> The gfs doesn't support SELinux attributes, currently you MUST DISABLE SELinux to use GFS+Cluster Suite.
> 
> I don't know if there is any plan to support it. maybe one developer or someone from redhat can anwser you better. :-D 
> 
> Best Regards
> Leonardo Rodrigues de Mello
> 
> 
> -----Original Message-----
> From:	linux-cluster-bounces at redhat.com on behalf of David Caplan
> Sent:	sex 21/7/2006 17:16
> To:	linux-cluster at redhat.com
> Cc:	
> Subject:	[Linux-cluster] gfs support for extended security attributes
> 
> Does the current release of GFS support extended security attributes for
> use with SELinux? If not, are there any plans for support?
> 
> Thanks,
> David
> 
> --
> __________________________________
> 
> David Caplan     410 290 1411 x105
> dac at tresys.com
> Tresys Technology, LLC
> 8840 Stanford Blvd., Suite 2100
> Columbia, MD 21045
> 
>  
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gforte at leopard.us.udel.edu  Fri Aug  4 14:32:25 2006
From: gforte at leopard.us.udel.edu (Greg Forte)
Date: Fri, 04 Aug 2006 10:32:25 -0400
Subject: [Linux-cluster] what causes "magma send einval to ..."?
Message-ID: <44D35A79.700@leopard.us.udel.edu>

I had a cluster node chugging along seemingly fine last night, then the 
following two lines appear in /var/log/messages:

Aug  3 22:20:07 hostname kernel: al to 1
Aug  3 22:20:07 hostname kernel: Magma send einval to 1

And about 20 seconds later the other node fenced this one.

I'm guessing that that fragmented message means that there's some sort 
of kernel flakiness going on, or that the box got overloaded (no way to 
tell, unfortunately - any recommendations on monitoring tools to track 
and log load level?), but that's just a guess.

-g

Greg Forte
gforte at udel.edu
IT - User Services
University of Delaware
302-831-1982
Newark, DE



From lhh at redhat.com  Fri Aug  4 15:11:12 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 04 Aug 2006 11:11:12 -0400
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
In-Reply-To: <44D3251A.5080001@bull.net>
References: <44D3251A.5080001@bull.net>
Message-ID: <1154704272.28677.90.camel@ayanami.boston.redhat.com>

On Fri, 2006-08-04 at 12:44 +0200, Alain Moulle wrote:
> Hi Ron,
> 
> could you provide me the defects numbers and/or linked patches ?

Here's the current list of pending fixes:

http://bugzilla.redhat.com/bugzilla/buglist.cgi?component=rgmanager&bug_status=MODIFIED&bug_status=FAILS_QA&bug_status=ON_QA

The patch for internal self-monitoring was simply a backport from the
HEAD branch.  I've attached a hand-edited patch which should enable the
self-monitoring bit.

Additionally, there was a segfault fixed in U3.  Here's the errata
advisory, which contains links to bugzillas:

https://rhn.redhat.com/errata/RHBA-2006-0241.html

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: watchdog.diff
Type: text/x-patch
Size: 4064 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060804/40a14ae8/attachment.bin>

From raycharles_man at yahoo.com  Fri Aug  4 15:43:41 2006
From: raycharles_man at yahoo.com (Ray Charles)
Date: Fri, 4 Aug 2006 08:43:41 -0700 (PDT)
Subject: [Linux-cluster] Logging for cluster errors.
In-Reply-To: <44D22D12.2060501@redhat.com>
Message-ID: <20060804154341.47391.qmail@web32114.mail.mud.yahoo.com>



Yes,

The error pops up say when I want to disable a service
and re-enable.  At the moment its a failed service so
when I go to disable it in the gui i get the error and
it directs me to check the logs.  The Error is not
explicit at all just ERROR and a directive to check
the logs.

-Ray
--- James Parsons <jparsons at redhat.com> wrote:

> Robert Peterson wrote:
> 
> > Ray Charles wrote:
> >
> >>
> >> Hi,
> >>
> >> Easy question.
> >>
> >> When I run system-config-cluster I am able to see
> the
> >> gui.  But in the event there's an error while
> using
> >> the gui where does that get logged?
> >> I've seen an error from the gui and it says check
> >> logging. I didn't see a /var/log/cluster/
> >>
> >> -TIA
> >>   
> >
> > Hi Ray,
> >
> > Usually the messages go into /var/log/messages.
> > Many of the messages can be redirected to other
> places by
> > changing the cluster.conf file, so the code won't
> tell you
> > specifically where to look.
> >
> > Regards,
> >
> > Bob peterson
> > Red Hat Cluster Suite
> >
> > -- 
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> >
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> Ray,
> 
> What is the nature of the error that you are seeing?
> 
> -J
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From Leonardo.Mello at planejamento.gov.br  Fri Aug  4 17:20:48 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Fri, 4 Aug 2006 14:20:48 -0300
Subject: [Linux-cluster] what causes "magma send einval to ..."?
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B61@corp-bsa-mp01.planejamento.gov.br>

There was some discussions about this on the list in october 2005. 
http://www.google.com/search?q=%22Magma+send+einval+to%22&hl=en&lr=&filter=0

One entry in bugzilla relative this: 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169693

Best Regards
Leonardo Rodrigues de Mello


-----Original Message-----
From:	linux-cluster-bounces at redhat.com on behalf of Greg Forte
Sent:	sex 4/8/2006 11:32
To:	linux clustering
Cc:	
Subject:	[Linux-cluster] what causes "magma send einval to ..."?

I had a cluster node chugging along seemingly fine last night, then the 
following two lines appear in /var/log/messages:

Aug  3 22:20:07 hostname kernel: al to 1
Aug  3 22:20:07 hostname kernel: Magma send einval to 1

And about 20 seconds later the other node fenced this one.

I'm guessing that that fragmented message means that there's some sort 
of kernel flakiness going on, or that the box got overloaded (no way to 
tell, unfortunately - any recommendations on monitoring tools to track 
and log load level?), but that's just a guess.

-g

Greg Forte
gforte at udel.edu
IT - User Services
University of Delaware
302-831-1982
Newark, DE

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3388 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060804/2cc1cd17/attachment.bin>

From rohara at redhat.com  Fri Aug  4 21:57:16 2006
From: rohara at redhat.com (Ryan O'Hara)
Date: Fri, 04 Aug 2006 16:57:16 -0500
Subject: [Linux-cluster] gfs support for extended security attributes
In-Reply-To: <6FE441CD9F0C0C479F2D88F959B01588298BCA@exchange.columbia.tresys.com>
References: <6FE441CD9F0C0C479F2D88F959B01588298BCA@exchange.columbia.tresys.com>
Message-ID: <44D3C2BC.2070904@redhat.com>


David,

Sorry for the delay.

The current release of GFS (in RHEL3 and RHEL4) does not support SELinux 
extended attributes.

The code for GFS(1) in cvs HEAD does have support of SELinux. I added 
this code recently. This should make its way into our released version 
of GFS in the near future.

GFS2, which is currently in development and being pushed upstream, also 
has SELinux extended attribute support.

So to answer your questions... No, our current release does not support 
SELinux. Yes, we do plan to support it and the code is in-place.

Note that anyone who wanted to try using GFS/GFS2 with SELinux 
attributes may need to make relevant changes to the policy. With that 
said, I do know for certain that the Rawhide packages do have a policy 
that define gfs and gfs2 as supported filesystems.

Ryan


David Caplan wrote:
 >
> Does the current release of GFS support extended security attributes for
> use with SELinux? If not, are there any plans for support?
> 
> Thanks,
> David
> 
> --
> __________________________________
> 
> David Caplan     410 290 1411 x105
> dac at tresys.com
> Tresys Technology, LLC
> 8840 Stanford Blvd., Suite 2100
> Columbia, MD 21045
> 
>  
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From riaan at obsidian.co.za  Sat Aug  5 22:19:56 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Sun, 06 Aug 2006 00:19:56 +0200
Subject: [Linux-cluster] 2-node fencing question
In-Reply-To: <1154633146.28677.70.camel@ayanami.boston.redhat.com>
References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net>
	<1154633146.28677.70.camel@ayanami.boston.redhat.com>
Message-ID: <44D5198C.2090603@obsidian.co.za>

> However, it could very well be that IPMI hardware modules are slow
> enough at processing requests that this could pose a problem.  What
> hardware has this happened on?  Was ACPI disabled on boot in the host OS
> (it should be; see below)?
> 
> 
snip
> 
> The time window for (c) increases significantly (5+ seconds) if the
> cluster nodes are enabling ACPI power events on boot.  This is one of
> the reasons why booting with acpi=off is required when using IPMI, iLO,
> or other integrated power management solutions.
> 
> If booting with acpi=off, does the problem persist?
> 

Lon - is the requirement for disabling acpi when using integrated fence 
devices documented anywhere?

I have searched far and wide on the nature of acpi=off (if it is good or 
bad, recommended by Red Hat or anyone out there). Yours is the strongest 
against acpi enabled I have found, but not for reasons I would have 
expected.

My impression of acpi=off is it borders on a magical cure-all for 
boot/installation problems (in part due to bad acpi by server/firmware 
vendors), but that it also acts as some kind of safe mode (e.g. ht is 
disabled, does things to IRQ routing, etc) which may have an adverse 
effect on system performance.

Are you aware of any negative effects, performance or otherwise, which 
acpi=off will cause. E.g. if the only adverse effect of acpi=off is 
hyperthreading being disabled, users that want it back, can so using acpi=ht

Riaan

note: IMHO, a Knowledge Base article on the use of acpi=off (and its 
variants), for general RHEL installations, and pertaining to RHCS/GFS 
implementations would be very welcome.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060806/41ed43b2/attachment.vcf>

From riaan at obsidian.co.za  Sat Aug  5 22:53:13 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Sun, 06 Aug 2006 00:53:13 +0200
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
In-Reply-To: <1154704272.28677.90.camel@ayanami.boston.redhat.com>
References: <44D3251A.5080001@bull.net>
	<1154704272.28677.90.camel@ayanami.boston.redhat.com>
Message-ID: <44D52159.8090004@obsidian.co.za>

Lon Hohberger wrote:
> On Fri, 2006-08-04 at 12:44 +0200, Alain Moulle wrote:
>> Hi Ron,
>>
>> could you provide me the defects numbers and/or linked patches ?
> 
> Here's the current list of pending fixes:
> 
> http://bugzilla.redhat.com/bugzilla/buglist.cgi?component=rgmanager&bug_status=MODIFIED&bug_status=FAILS_QA&bug_status=ON_QA
> 

Lon

With RHEL 4 update 4 just around the corner, what is the planned release 
schedule for RHCS 4 update 4 / GFS 6.1 update 4? Since these are not 
even in beta yet, does that mean that CS/GFS customers will have to wait 
  for the CS/GFS versions of update 4 before they can go to RHEL 4 update 4?

tnx
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060806/7b1456be/attachment.vcf>

From f.hackenberger at mediatransfer.de  Mon Aug  7 07:16:56 2006
From: f.hackenberger at mediatransfer.de (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting)
Date: Mon, 07 Aug 2006 09:16:56 +0200
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <1154630324.28677.21.camel@ayanami.boston.redhat.com>
References: <44D08673.3010207@mediatransfer.com>
	<1154630324.28677.21.camel@ayanami.boston.redhat.com>
Message-ID: <44D6E8E8.4090903@mediatransfer.de>

Lon Hohberger wrote:
> On Wed, 2006-08-02 at 13:03 +0200, Falk Hackenberger - MediaTransfer AG
> Netresearch & Consulting wrote:
> 
>>--snip--
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
>>/exports/imap/checkimapstartup.sh status
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
>>/exports/subversion/etc/rc.d/init.d/svnserver status
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Checking 192.168.0.223,
>>Level 0
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> 192.168.0.223 present on
>>eth0
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link for eth0: Detected
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link detected on eth0
>>Aug  1 17:31:37 kain clurgmgrd[4780]: <notice> Stopping service storage
>>--snap--
>>
>>how to say to clurgmgrd, that he should log the reason for stoping the
>>service?
> 
> Something must be returning an error code where it should not be; can
> you post your service XML blob?

it is very long and a little bit complex as i know... ;-)

<service autostart="1" domain="storage" exclusive="1" name="storage"
recovery="restart">
	<script ref="check_fence">
		<ip ref="192.168.0.223">
			<fs ref="install">
				<nfsexport ref="/exports/i">
					<nfsclient ref="backupserver_ro"/>
				</nfsexport>
			</fs>
			<fs ref="home">
				<nfsexport ref="/exports/home">
					<nfsclient ref="home_nfs_clients"/>
					<nfsclient ref="nfs_client_quantum"/>
					<nfsclient ref="otaq"/>
					<nfsclient ref="borchers_home"/>
				</nfsexport>
			</fs>
			<fs ref="untitleitung"/>
			<fs ref="SRC"/>
			<fs ref="bmw"/>
			<fs ref="buchhaltung"/>
			<fs ref="developer"/>
			<fs ref="feld"/>
			<fs ref="mt-daten"/>
			<fs ref="OfficeManagement"/>
			<fs ref="procter"/>
			<fs ref="vorstand"/>
			<fs ref="zMuell"/>
			<fs ref="archiv">
				<nfsexport ref="/exports.smb/archiv">
					<nfsclient ref="backupserver_ro"/>
				</nfsexport>
			</fs>
			<fs ref="procter_archiv">
				<nfsexport ref="/exports.smb/procter_archiv">
					<nfsclient ref="backupserver_ro"/>
				</nfsexport>
			</fs>
			<fs ref="servershare">
				<script ref="amavisd"/>
				<script ref="samba"/>
			</fs>
			<fs ref="imap">
				<script ref="checkimapstartup.sh">
					<script ref="cyrus-imapd">
						<script ref="imapfilter"/>
						<script ref="fetchmail"/>
					</script>
				</script>
			</fs>
			<fs ref="subversion">
				<script ref="svnserver"/>
			</fs>
			<fs ref="dbarchiv">
				<nfsexport ref="/exports/dbarchiv">
					<nfsclient ref="db"/>
				</nfsexport>
			</fs>
			<fs ref="cvs"/>
			<fs ref="save">
				<nfsexport ref="/exports/s">
					<nfsclient ref="backupserver_ro"/>
					<nfsclient ref="otaq_backup"/>
					<nfsclient ref="mysql"/>
					<nfsclient ref="internmysql"/>
					<nfsclient ref="vaweb"/>
					<nfsclient ref="vadb"/>
					<nfsclient ref="mainframe"/>
					<nfsclient ref="idb"/>
					<nfsclient ref="eva"/>
					<nfsclient ref="adam"/>
					<nfsclient ref="db_backup"/>
					<nfsclient ref="home_backup"/>
					<nfsclient ref="quantum_backup"/>
					<nfsclient ref="abel"/>
					<nfsclient ref="kain"/>
					<nfsclient ref="nfs_client_storage"/>
					<nfsclient ref="mainframe2_backup"/>
					<nfsclient ref="project_backup"/>
				</nfsexport>
			</fs>
			<fs ref="rcs"/>
		</ip>
	</script>
</service>




From f.hackenberger at mediatransfer.com  Mon Aug  7 07:17:26 2006
From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting)
Date: Mon, 07 Aug 2006 09:17:26 +0200
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <1154630324.28677.21.camel@ayanami.boston.redhat.com>
References: <44D08673.3010207@mediatransfer.com>
	<1154630324.28677.21.camel@ayanami.boston.redhat.com>
Message-ID: <44D6E906.5060006@mediatransfer.com>

Lon Hohberger wrote:
> On Wed, 2006-08-02 at 13:03 +0200, Falk Hackenberger - MediaTransfer AG
> Netresearch & Consulting wrote:
> 
>>--snip--
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
>>/exports/imap/checkimapstartup.sh status
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <info> Executing
>>/exports/subversion/etc/rc.d/init.d/svnserver status
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Checking 192.168.0.223,
>>Level 0
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> 192.168.0.223 present on
>>eth0
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link for eth0: Detected
>>Aug  1 17:31:28 kain clurgmgrd: [4780]: <debug> Link detected on eth0
>>Aug  1 17:31:37 kain clurgmgrd[4780]: <notice> Stopping service storage
>>--snap--
>>
>>how to say to clurgmgrd, that he should log the reason for stoping the
>>service?
> 
> Something must be returning an error code where it should not be; can
> you post your service XML blob?

it is very long and a little bit complex as i know... ;-)

<service autostart="1" domain="storage" exclusive="1" name="storage"
recovery="restart">
	<script ref="check_fence">
		<ip ref="192.168.0.223">
			<fs ref="install">
				<nfsexport ref="/exports/i">
					<nfsclient ref="backupserver_ro"/>
				</nfsexport>
			</fs>
			<fs ref="home">
				<nfsexport ref="/exports/home">
					<nfsclient ref="home_nfs_clients"/>
					<nfsclient ref="nfs_client_quantum"/>
					<nfsclient ref="otaq"/>
					<nfsclient ref="borchers_home"/>
				</nfsexport>
			</fs>
			<fs ref="untitleitung"/>
			<fs ref="SRC"/>
			<fs ref="bmw"/>
			<fs ref="buchhaltung"/>
			<fs ref="developer"/>
			<fs ref="feld"/>
			<fs ref="mt-daten"/>
			<fs ref="OfficeManagement"/>
			<fs ref="procter"/>
			<fs ref="vorstand"/>
			<fs ref="zMuell"/>
			<fs ref="archiv">
				<nfsexport ref="/exports.smb/archiv">
					<nfsclient ref="backupserver_ro"/>
				</nfsexport>
			</fs>
			<fs ref="procter_archiv">
				<nfsexport ref="/exports.smb/procter_archiv">
					<nfsclient ref="backupserver_ro"/>
				</nfsexport>
			</fs>
			<fs ref="servershare">
				<script ref="amavisd"/>
				<script ref="samba"/>
			</fs>
			<fs ref="imap">
				<script ref="checkimapstartup.sh">
					<script ref="cyrus-imapd">
						<script ref="imapfilter"/>
						<script ref="fetchmail"/>
					</script>
				</script>
			</fs>
			<fs ref="subversion">
				<script ref="svnserver"/>
			</fs>
			<fs ref="dbarchiv">
				<nfsexport ref="/exports/dbarchiv">
					<nfsclient ref="db"/>
				</nfsexport>
			</fs>
			<fs ref="cvs"/>
			<fs ref="save">
				<nfsexport ref="/exports/s">
					<nfsclient ref="backupserver_ro"/>
					<nfsclient ref="otaq_backup"/>
					<nfsclient ref="mysql"/>
					<nfsclient ref="internmysql"/>
					<nfsclient ref="vaweb"/>
					<nfsclient ref="vadb"/>
					<nfsclient ref="mainframe"/>
					<nfsclient ref="idb"/>
					<nfsclient ref="eva"/>
					<nfsclient ref="adam"/>
					<nfsclient ref="db_backup"/>
					<nfsclient ref="home_backup"/>
					<nfsclient ref="quantum_backup"/>
					<nfsclient ref="abel"/>
					<nfsclient ref="kain"/>
					<nfsclient ref="nfs_client_storage"/>
					<nfsclient ref="mainframe2_backup"/>
					<nfsclient ref="project_backup"/>
				</nfsexport>
			</fs>
			<fs ref="rcs"/>
		</ip>
	</script>
</service>





From neohill at gmail.com  Mon Aug  7 07:50:07 2006
From: neohill at gmail.com (Neo Hill)
Date: Mon, 7 Aug 2006 09:50:07 +0200
Subject: [Linux-cluster] DRBD in Active-active mode
Message-ID: <fc286e8a0608070050o37a809bl430b9f516d05183@mail.gmail.com>

Hi everybody,

I am still looking on information or documents regarding DRBD in
active-active mode.

Does anyone could help me ?

Thanks a lot.

Neo hill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060807/1f5e3e18/attachment.htm>

From Alain.Moulle at bull.net  Mon Aug  7 09:29:33 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 07 Aug 2006 11:29:33 +0200
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
In-Reply-To: <1154704272.28677.90.camel@ayanami.boston.redhat.com>
References: <44D3251A.5080001@bull.net>
	<1154704272.28677.90.camel@ayanami.boston.redhat.com>
Message-ID: <44D707FD.6080602@bull.net>

Hi Lon

I've tried to patch the U2 version with this patch, but it requires
a nodeevent.c which apparently did not exist in CS4 U2 (that Makefile patch
adds a nodeevent.o as well as the watchdog.o) .
Does that mean that this patch can definetly not be applied
on rgmanager (1.9.38)  from CS4 U2 ?

Thanks
Alain Moull?


Lon Hohberger wrote:
> On Fri, 2006-08-04 at 12:44 +0200, Alain Moulle wrote:
> 
>>Hi Ron,
>>
>>could you provide me the defects numbers and/or linked patches ?
> 
> 
> Here's the current list of pending fixes:
> 
> http://bugzilla.redhat.com/bugzilla/buglist.cgi?component=rgmanager&bug_status=MODIFIED&bug_status=FAILS_QA&bug_status=ON_QA
> 
> The patch for internal self-monitoring was simply a backport from the
> HEAD branch.  I've attached a hand-edited patch which should enable the
> self-monitoring bit.
> 
> Additionally, there was a segfault fixed in U3.  Here's the errata
> advisory, which contains links to bugzillas:
> 
> https://rhn.redhat.com/errata/RHBA-2006-0241.html
> 
> -- Lon





From joe.devman at yahoo.fr  Mon Aug  7 12:08:00 2006
From: joe.devman at yahoo.fr (Joe)
Date: Mon, 07 Aug 2006 14:08:00 +0200
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44CF94FE.3070407@redhat.com>
References: <44CF2F94.4000003@framestore-cfc.com>
	<44CF8383.3040208@redhat.com>	<44CF822D.7070705@framestore-cfc.com>
	<44CF94FE.3070407@redhat.com>
Message-ID: <44D72D20.2060705@yahoo.fr>

Robert Peterson wrote:

> We've tried to kick around ideas on how to improve the speed, such as
> (1) adding an option to only focus on areas where the journals are dirty,
> (2) introducing multiple threads to process the different RGs, and even
> (3) trying to get multiple nodes in the cluster to team up and do 
> different
> areas of the file system.  None of these have been implemented yet
> because of higher priorities.  Since this is an open-source project, 
> anyone
> could step in and do these.  Volunteers?
>

I've tried to look at the code many times. But, as a clustered file 
system is a complex thing, it gets hard to figure out what it's all 
about. I tried to find a "big picture" documentation, at least for 
on-disk layout. The only nearest thing i've found is : 
http://opengfs.sourceforge.net/docs.php , which is the documentation 
written at the time OpenGFS forked from Cistina's code. Although 
principles may still be the sames (or not ?), the code has obviously 
changed and on-disk layout may not be the same, too.
So, is there some sort of documentation about the principles found in 
GFS (not a design doc, i've read 
/usr/src/linux/Documentation/stable_api_nonsense.txt) ? This would much 
help anybody who wishes to enter the code to do it more efficientely...

Thanks !





From lhh at redhat.com  Mon Aug  7 14:27:27 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 07 Aug 2006 10:27:27 -0400
Subject: [Linux-cluster] DRBD in Active-active mode
In-Reply-To: <fc286e8a0608070050o37a809bl430b9f516d05183@mail.gmail.com>
References: <fc286e8a0608070050o37a809bl430b9f516d05183@mail.gmail.com>
Message-ID: <1154960847.21204.35.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote:
> Hi everybody,
>  
> I am still looking on information or documents regarding DRBD in
> active-active mode.
>  
> Does anyone could help me ?
>  
> Thanks a lot.

Fairly certain this is not possible, unless something has changed
recently.  That is, you cannot use DRBD as a distributed concurrently
writable mirror; only one node can be the master of a DRBD device at a
time.

You can do this with GNBD + Cluster Mirroring, though.

-- Lon



From lhh at redhat.com  Mon Aug  7 14:30:41 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 07 Aug 2006 10:30:41 -0400
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
In-Reply-To: <44D707FD.6080602@bull.net>
References: <44D3251A.5080001@bull.net>
	<1154704272.28677.90.camel@ayanami.boston.redhat.com>
	<44D707FD.6080602@bull.net>
Message-ID: <1154961041.21204.40.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-07 at 11:29 +0200, Alain Moulle wrote:
> Hi Lon
> 
> I've tried to patch the U2 version with this patch, but it requires
> a nodeevent.c which apparently did not exist in CS4 U2 (that Makefile patch
> adds a nodeevent.o as well as the watchdog.o) .
> Does that mean that this patch can definetly not be applied
> on rgmanager (1.9.38)  from CS4 U2 ?

Take it out of the patched Makefile.  Nodeevent.c shouldn't be required
to make the watchdog work.

-- Lon



From chawkins at bplinux.com  Mon Aug  7 14:33:31 2006
From: chawkins at bplinux.com (Christopher Hawkins)
Date: Mon, 7 Aug 2006 10:33:31 -0400
Subject: [Linux-cluster] DRBD in Active-active mode
In-Reply-To: <1154960847.21204.35.camel@ayanami.boston.redhat.com>
Message-ID: <200608071418.k77EIq1X000664@mail2.ontariocreditcorp.com>

On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote:
>> Hi everybody,
>>  
>> I am still looking on information or documents regarding DRBD in 
>> active-active mode.
>>  
>> Does anyone could help me ?
>>  
>> Thanks a lot.

>Fairly certain this is not possible, unless something has changed recently.

>That is, you cannot use DRBD as adistributed concurrently writable mirror; 
>only one node can be the master of a DRBD device at a time.
>You can do this with GNBD + Cluster Mirroring, though.

>-- Lon

Lon, 
GNBD + Cluster Mirroring? Are you referring to clvm2, or is there another
package out there I haven't heard of?

Thanks,
Chris



From riaan at obsidian.co.za  Mon Aug  7 15:03:32 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Mon, 07 Aug 2006 17:03:32 +0200
Subject: [Linux-cluster] DRBD in Active-active mode
In-Reply-To: <1154960847.21204.35.camel@ayanami.boston.redhat.com>
References: <fc286e8a0608070050o37a809bl430b9f516d05183@mail.gmail.com>
	<1154960847.21204.35.camel@ayanami.boston.redhat.com>
Message-ID: <44D75644.8000901@obsidian.co.za>

Lon Hohberger wrote:
> On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote:
>> Hi everybody,
>>  
>> I am still looking on information or documents regarding DRBD in
>> active-active mode.
>>  
>> Does anyone could help me ?
>>  
>> Thanks a lot.
> 
> Fairly certain this is not possible, unless something has changed
> recently.  That is, you cannot use DRBD as a distributed concurrently
> writable mirror; only one node can be the master of a DRBD device at a
> time.
> 
> You can do this with GNBD + Cluster Mirroring, though.
> 
> -- Lon


According to the DRBD FAQ:

http://www.linux-ha.org/DRBD/FAQ#head-ec4ab5a57e15232e9ac4e12775de5a1b328aeff5

Why does DRBD not allow concurrent access from all nodes? I'd like to 
use it with GFS/OCFS2...

       Actually, DRBD v8 (which is still in pre-release state at the 
time of this writing) supports this. You need to net { 
allow-two-primaries; } ...

I have not tried this myself though, but would like to hear the 
experiences of anyone who has tried this.

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060807/b44d7d4f/attachment.vcf>

From Alain.Moulle at bull.net  Mon Aug  7 15:07:40 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 07 Aug 2006 17:07:40 +0200
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
In-Reply-To: <1154961041.21204.40.camel@ayanami.boston.redhat.com>
References: <44D3251A.5080001@bull.net>	
	<1154704272.28677.90.camel@ayanami.boston.redhat.com>	
	<44D707FD.6080602@bull.net>
	<1154961041.21204.40.camel@ayanami.boston.redhat.com>
Message-ID: <44D7573C.5020209@bull.net>

Lon Hohberger wrote:
> On Mon, 2006-08-07 at 11:29 +0200, Alain Moulle wrote:
> 
>>Hi Lon
>>
>>I've tried to patch the U2 version with this patch, but it requires
>>a nodeevent.c which apparently did not exist in CS4 U2 (that Makefile patch
>>adds a nodeevent.o as well as the watchdog.o) .
>>Does that mean that this patch can definetly not be applied
>>on rgmanager (1.9.38)  from CS4 U2 ?
> 
> 
> Take it out of the patched Makefile.  Nodeevent.c shouldn't be required
> to make the watchdog work.
> 
> -- Lon

Build ok. Thanks.
Could you explain exactly the benefit of this watchdog work ?
Thanks
Alain
> 
> 





From rpeterso at redhat.com  Mon Aug  7 15:55:10 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 07 Aug 2006 10:55:10 -0500
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44D72D20.2060705@yahoo.fr>
References: <44CF2F94.4000003@framestore-cfc.com>	<44CF8383.3040208@redhat.com>	<44CF822D.7070705@framestore-cfc.com>	<44CF94FE.3070407@redhat.com>
	<44D72D20.2060705@yahoo.fr>
Message-ID: <44D7625E.9090305@redhat.com>

Joe wrote:
> I've tried to look at the code many times. But, as a clustered file 
> system is a complex thing, it gets hard to figure out what it's all 
> about. I tried to find a "big picture" documentation, at least for 
> on-disk layout. The only nearest thing i've found is : 
> http://opengfs.sourceforge.net/docs.php , which is the documentation 
> written at the time OpenGFS forked from Cistina's code. Although 
> principles may still be the sames (or not ?), the code has obviously 
> changed and on-disk layout may not be the same, too.
> So, is there some sort of documentation about the principles found in 
> GFS (not a design doc, i've read 
> /usr/src/linux/Documentation/stable_api_nonsense.txt) ? This would 
> much help anybody who wishes to enter the code to do it more 
> efficientely...
>
> Thanks !

Hi Joe,

I agree that there isn't much good design information out there 
regarding GFS.
That might be because it started out as a proprietary product before Red 
Hat open-sourced it.
There are some comments in the kernel's gfs_ondisk.h include:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/gfs_ondisk.h?cvsroot=cluster

Perhaps I'll start working on a GFS1/2 design whitepaper based on some 
of the information I've gathered.

Regards,

Bob Peterson
Red Hat Cluster Suite



From hardyjm at potsdam.edu  Mon Aug  7 18:11:06 2006
From: hardyjm at potsdam.edu (Jeff Hardy)
Date: Mon, 07 Aug 2006 14:11:06 -0400
Subject: [Linux-cluster] lvm2 liblvm2clusterlock.so on fc5
In-Reply-To: <1154022763.2789.120.camel@fritzdesk.potsdam.edu>
References: <1154022763.2789.120.camel@fritzdesk.potsdam.edu>
Message-ID: <1154974266.1797.124.camel@fritzdesk.potsdam.edu>

On Thu, 2006-07-27 at 13:52 -0400, Jeff Hardy wrote:
> I apologize if this has been answered already or appeared in release
> notes somewhere, but I cannot find it.  FC4 had the lvm2-cluster package
> to provide the clvm locking library.  This was removed in FC5 (as
> indicated in the release notes).
> 
> Is this still necessary for a clvm setup:
> 
> In /etc/lvm/lvm.conf:
> locking_type = 2
> locking_library = "/lib/liblvm2clusterlock.so"
> 
> And if so, where does one find this now?
> 
> Thank you.
> 
> 


Well, though absent in FC5, I just recently saw a message somewhere
indicating the lvm2-cluster package was back in FC6 testing.  Anyone
have any idea why this was dropped for FC5?  I built off of the lvm2
source rpm, using a modified lvm2-cluster spec file from FC4.  Looks ok.

If anyone has reason to believe this is a really bad idea, or wants the
srpm or rpm, feel free to drop me a line.


-- 
Jeff Hardy
Systems Analyst
hardyjm at potsdam.edu



From agk at redhat.com  Mon Aug  7 18:28:24 2006
From: agk at redhat.com (Alasdair G Kergon)
Date: Mon, 7 Aug 2006 19:28:24 +0100
Subject: [Linux-cluster] lvm2 liblvm2clusterlock.so on fc5
In-Reply-To: <1154974266.1797.124.camel@fritzdesk.potsdam.edu>
References: <1154022763.2789.120.camel@fritzdesk.potsdam.edu>
	<1154974266.1797.124.camel@fritzdesk.potsdam.edu>
Message-ID: <20060807182824.GP18633@agk.surrey.redhat.com>

On Mon, Aug 07, 2006 at 02:11:06PM -0400, Jeff Hardy wrote:
> have any idea why this was dropped for FC5?

It got disabled early on because it wouldn't build (depended on cluster
infrastructure that wasn't there) and when that got resolved, nobody
remembered to reenable it.

As you noticed, we've got it back into fc6/devel and we're trying to
to get it approved for fc5 updates.

Alasdair
-- 
agk at redhat.com



From Leonardo.Mello at planejamento.gov.br  Mon Aug  7 19:09:00 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Mon, 7 Aug 2006 16:09:00 -0300
Subject: RES: [Linux-cluster] DRBD in Active-active mode
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6E@corp-bsa-mp01.planejamento.gov.br>

Hi everyone,

Lon you are right for the stable version of DRBD the version 0.7. But DRBD actualy has support for active-active setup in the development version 0.8. There is significant changes between this versions, the entire roadmap can be read at:
http://svn.drbd.org/drbd/trunk/ROADMAP

I have done some investigations and tests with DRBD 0.8 in active-active setup with two nodes and OCFS2 and with GFS. This was for one project i was doing related to oracle Rac 10g. 

I have produced one documentation in portuguese that shows how to setup and use drbd in active-active with ocfs2. the link is: 
http://guialivre.governoeletronico.gov.br/seminario/index.php/DocumentacaoTecnologiasDRBDOCFS2

I have discovered in my investigations that ocfs2 is more unstable that GFS. I have a several kernel panics with ocfs2 under high loads on the machine, but no one with GFS.  

I have the instalation of GFS documented at, one performance test i have done some time ago: 
http://guialivre.governoeletronico.gov.br/mediawiki/index.php/TestesGFS
(here i use clvm, and gnbd)


The problem of drbd is that actualy you can use just two machines, if you want to use more you need to use the commercial version drbd+. 


Best Regards
Leonardo Rodrigues de Mello
-----Mensagem original-----
De:	linux-cluster-bounces at redhat.com em nome de Lon Hohberger
Enviada:	seg 7/8/2006 11:27
Para:	linux clustering
Cc:	
Assunto:	Re: [Linux-cluster] DRBD in Active-active mode

On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote:
> Hi everybody,
>  
> I am still looking on information or documents regarding DRBD in
> active-active mode.
>  
> Does anyone could help me ?
>  
> Thanks a lot.

Fairly certain this is not possible, unless something has changed
recently.  That is, you cannot use DRBD as a distributed concurrently
writable mirror; only one node can be the master of a DRBD device at a
time.

You can do this with GNBD + Cluster Mirroring, though.

-- Lon

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3708 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060807/975f68f0/attachment.bin>

From Leonardo.Mello at planejamento.gov.br  Mon Aug  7 19:14:19 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Mon, 7 Aug 2006 16:14:19 -0300
Subject: RES: [Linux-cluster] DRBD in Active-active mode
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6F@corp-bsa-mp01.planejamento.gov.br>


sorry for the typos and english errors in the last message.
I believe need more coffee.

Leonardo Rodrigues de Mello
-----Mensagem original-----
De:	Leonardo Rodrigues de Mello em nome de Leonardo Rodrigues de Mello
Enviada:	seg 7/8/2006 16:09
Para:	linux clustering
Cc:	
Assunto:	RES: [Linux-cluster] DRBD in Active-active mode

Hi everyone,

Lon you are right for the stable version of DRBD the version 0.7. But DRBD actualy has support for active-active setup in the development version 0.8. There is significant changes between this versions, the entire roadmap can be read at:
http://svn.drbd.org/drbd/trunk/ROADMAP

I have done some investigations and tests with DRBD 0.8 in active-active setup with two nodes and OCFS2 and with GFS. This was for one project i was doing related to oracle Rac 10g. 

I have produced one documentation in portuguese that shows how to setup and use drbd in active-active with ocfs2. the link is: 
http://guialivre.governoeletronico.gov.br/seminario/index.php/DocumentacaoTecnologiasDRBDOCFS2

I have discovered in my investigations that ocfs2 is more unstable that GFS. I have a several kernel panics with ocfs2 under high loads on the machine, but no one with GFS.  

I have the instalation of GFS documented at, one performance test i have done some time ago: 
http://guialivre.governoeletronico.gov.br/mediawiki/index.php/TestesGFS
(here i use clvm, and gnbd)


The problem of drbd is that actualy you can use just two machines, if you want to use more you need to use the commercial version drbd+. 


Best Regards
Leonardo Rodrigues de Mello
-----Mensagem original-----
De:	linux-cluster-bounces at redhat.com em nome de Lon Hohberger
Enviada:	seg 7/8/2006 11:27
Para:	linux clustering
Cc:	
Assunto:	Re: [Linux-cluster] DRBD in Active-active mode

On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote:
> Hi everybody,
>  
> I am still looking on information or documents regarding DRBD in
> active-active mode.
>  
> Does anyone could help me ?
>  
> Thanks a lot.

Fairly certain this is not possible, unless something has changed
recently.  That is, you cannot use DRBD as a distributed concurrently
writable mirror; only one node can be the master of a DRBD device at a
time.

You can do this with GNBD + Cluster Mirroring, though.

-- Lon

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3932 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060807/0d14c80c/attachment.bin>

From haiwu.us at gmail.com  Mon Aug  7 20:21:39 2006
From: haiwu.us at gmail.com (hai wu)
Date: Mon, 7 Aug 2006 15:21:39 -0500
Subject: [Linux-cluster] 2-node cluster and fence_drac
Message-ID: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>

Hi,
For a 2-node cluster (RHEL4), does it require the use of power switch or
fence_drac would be good enough for the setup? Would fence_drac work
properly in a 2-node cluster?
Thanks,
Hai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060807/3c7c4947/attachment.htm>

From brad at seatab.com  Mon Aug  7 22:07:29 2006
From: brad at seatab.com (Brad Dameron)
Date: Mon, 07 Aug 2006 15:07:29 -0700
Subject: [Linux-cluster] GFS 6.1 kernel warning.
Message-ID: <1154988449.19157.20.camel@serpent.office.seatab.com>

We have two rather large server's running GFS in a production
environment and have been getting errors since the start. First our
configuration.

2 - Quad 880 Opteron Servers with 64GB RAM
1 - Infortrend 2GB SAN
OS - SuSe 10.0 Professional (Kernel 2.6.13-15.8-smp x86_64)

Cluster network is on GigE connection. This link is shared and used for
other purposes but not much traffic.

Here is the error message:

Aug  7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1: warning:
assertion "gfs_glock_is_locked_by_me(ip->i_gl)" failed
Aug  7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1:   function
= gfs_readpage
Aug  7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1:   file
= /usr/src/gfs/src/cluster-1.02.00/gfs-kernel/src/gfs/ops_address.c,
line = 283
Aug  7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1:   time =
1154986845



This appears to occur when both machines try to access the same
files/directory. They happen at a rate of about 10-15 an hour. Anyone
know if this is critical or a way to turn these off if they are not an
issue? There is definitely a big performance issue when using GFS on
very CPU intense applications. When the first server is using all 8 CPU
core's doing processing the second server's IO response slows to a
crawl. Any sysctl tweaks to help improve the performance appreciated.

Thanks,
Brad Dameron
SeaTab Software
www.seatab.com






From Alain.Moulle at bull.net  Tue Aug  8 07:53:27 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 08 Aug 2006 09:53:27 +0200
Subject: [Linux-cluster] CS4 Update 4 / two questions
Message-ID: <44D842F7.7080805@bull.net>

Hi

1/ About the return of quorum disk functionnality : is it mandatory
to configure it, or is it possible to run the CS4 U4 without it in
a first step ?

(this question only to know how to manage eventual update from U2 (currently in
production without quorum disk configured) to U4 )

2/ is there a beta documentation on CS4 U4 download-able somewhere ?

Thanks
Alain



From pcaulfie at redhat.com  Tue Aug  8 08:04:37 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 08 Aug 2006 09:04:37 +0100
Subject: [Linux-cluster] CS4 Update 4 / two questions
In-Reply-To: <44D842F7.7080805@bull.net>
References: <44D842F7.7080805@bull.net>
Message-ID: <44D84595.3000304@redhat.com>

Alain Moulle wrote:
> Hi
> 
> 1/ About the return of quorum disk functionnality : is it mandatory
> to configure it, or is it possible to run the CS4 U4 without it in
> a first step ?
> 
> (this question only to know how to manage eventual update from U2 (currently in
> production without quorum disk configured) to U4 )
> 

Quorum disk is completely optional, even in a two-node system.

-- 

patrick



From Alain.Moulle at bull.net  Tue Aug  8 13:08:38 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 08 Aug 2006 15:08:38 +0200
Subject: [Linux-cluster] CS4 Update 4/ about __NR_gettid and syscall
Message-ID: <44D88CD6.7030800@bull.net>

Hi

In CS4 Update 4 , there are several places where a syscall call
is dependant on NR_gettid set or not , for example in qdisk/gettid.c :

#include <sys/types.h>
#include <linux/unistd.h>
#include <gettid.h>
#include <errno.h>

/* Patch from Adam Conrad / Ubuntu: Don't use _syscall macro */

#ifdef __NR_gettid
pid_t gettid (void)
{
        return syscall(__NR_gettid);
}
#else

#warn "gettid not available -- substituting with pthread_self()"

#include <pthread.h>
pid_t gettid (void)
{
        return (pid_t)pthread_self();
}
#endif

and also in :
magma-plugins-1.0.9/gulm/gulm.c
rgmanager-1.9.52/src/clulib/gettid

And in fact, I have compilation error if the syscall is choosen by the ifdef ,
so I wonder what to do about that , what does __NR_gettid means ,etc.

Any piece of advise ?

Thanks
Alain



From Leonardo.Mello at planejamento.gov.br  Wed Aug  9 15:04:58 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Wed, 9 Aug 2006 12:04:58 -0300
Subject: [Linux-cluster] cs-deploy-gfs
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B78@corp-bsa-mp01.planejamento.gov.br>

Hi everyone,

Does anyone know what happened with the development of cs-deploy-gfs ? 

The version in cvs is the lastest version ? 

There is any improvements since the initial version ? 

This software was abandoned ? 

I don't have much time but I want help in the development of it, do the things in TODO, and others like porting it to systems debian-like or better, port it to use smartpm (http://labix.org/smart), help with the internacionalization, and translation to portuguese brazil.   


Best Regards
Leonardo Rodrigues de Mello

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060809/ea600d32/attachment.htm>

From Leonardo.Mello at planejamento.gov.br  Wed Aug  9 15:16:16 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Wed, 9 Aug 2006 12:16:16 -0300
Subject: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs)
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B79@corp-bsa-mp01.planejamento.gov.br>

sorry, 
the application name is cs-deploy-tool, not cs-deploy-gfs. 

Best Regards
Leonardo Rodrigues de Mello


-----Mensagem original-----
De:	linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello
Enviada:	qua 9/8/2006 12:04
Para:	linux-cluster at redhat.com
Cc:	
Assunto:	[Linux-cluster] cs-deploy-gfs

Hi everyone,

Does anyone know what happened with the development of cs-deploy-gfs ? 

The version in cvs is the lastest version ? 

There is any improvements since the initial version ? 

This software was abandoned ? 

I don't have much time but I want help in the development of it, do the things in TODO, and others like porting it to systems debian-like or better, port it to use smartpm (http://labix.org/smart), help with the internacionalization, and translation to portuguese brazil.   


Best Regards
Leonardo Rodrigues de Mello




-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3004 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060809/5aa114ce/attachment.bin>

From jparsons at redhat.com  Wed Aug  9 15:16:41 2006
From: jparsons at redhat.com (James Parsons)
Date: Wed, 09 Aug 2006 11:16:41 -0400
Subject: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs)
In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B79@corp-bsa-mp01.planejamento.gov.br>
References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B79@corp-bsa-mp01.planejamento.gov.br>
Message-ID: <44D9FC59.5050408@redhat.com>

Leonardo Rodrigues de Mello wrote:

>sorry, 
>the application name is cs-deploy-tool, not cs-deploy-gfs. 
>
>Best Regards
>Leonardo Rodrigues de Mello
>
>
>-----Mensagem original-----
>De:	linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello
>Enviada:	qua 9/8/2006 12:04
>Para:	linux-cluster at redhat.com
>Cc:	
>Assunto:	[Linux-cluster] cs-deploy-gfs
>
>Hi everyone,
>
>Does anyone know what happened with the development of cs-deploy-gfs ? 
>
>The version in cvs is the lastest version ? 
>
>There is any improvements since the initial version ? 
>
>This software was abandoned ? 
>  
>
The functionality available in cs-deploy-tool will be available in a new 
management interface for clusters and storage called Conga, targetted 
for RHEL5 and (hopefully) RHEL4.5

-J



From stephen.willey at framestore-cfc.com  Wed Aug  9 15:33:29 2006
From: stephen.willey at framestore-cfc.com (Stephen Willey)
Date: Wed, 09 Aug 2006 16:33:29 +0100
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44D7625E.9090305@redhat.com>
References: <44CF2F94.4000003@framestore-cfc.com>	<44CF8383.3040208@redhat.com>	<44CF822D.7070705@framestore-cfc.com>	<44CF94FE.3070407@redhat.com>	<44D72D20.2060705@yahoo.fr>
	<44D7625E.9090305@redhat.com>
Message-ID: <44DA0049.8030505@framestore-cfc.com>

So ya know...

Once we'd added a 137Gb swap drive, it took 48 hours to run all stages
of the gfs_fsck on a 42Tb filesystem without any -v options

That was on a dual Opteron 275 (4Gb RAM) with 4Gb FC to 6 SATA RAIDs in
CLVM.

-- 
Stephen Willey

Senior Systems Engineer, Framestore-CFC
+44 (0)207 344 8000
http://www.framestore-cfc.com



From lhh at redhat.com  Wed Aug  9 15:34:39 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 09 Aug 2006 11:34:39 -0400
Subject: [Linux-cluster] CS4 Update 4 / two questions
In-Reply-To: <44D842F7.7080805@bull.net>
References: <44D842F7.7080805@bull.net>
Message-ID: <1155137679.21204.144.camel@ayanami.boston.redhat.com>

On Tue, 2006-08-08 at 09:53 +0200, Alain Moulle wrote:
> Hi
> 
> 1/ About the return of quorum disk functionnality : is it mandatory
> to configure it, 

Not required in the least.

-- Lon



From lhh at redhat.com  Wed Aug  9 15:35:16 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 09 Aug 2006 11:35:16 -0400
Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent
	update ?
In-Reply-To: <44D7573C.5020209@bull.net>
References: <44D3251A.5080001@bull.net>
	<1154704272.28677.90.camel@ayanami.boston.redhat.com>
	<44D707FD.6080602@bull.net>
	<1154961041.21204.40.camel@ayanami.boston.redhat.com>
	<44D7573C.5020209@bull.net>
Message-ID: <1155137716.21204.146.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-07 at 17:07 +0200, Alain Moulle wrote:

> Build ok. Thanks.
> Could you explain exactly the benefit of this watchdog work ?
> Thanks
> Alain

If rgmanager crashes, the node gets rebooted.

-- Lon




From lhh at redhat.com  Wed Aug  9 15:37:07 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 09 Aug 2006 11:37:07 -0400
Subject: RES: [Linux-cluster] DRBD in Active-active mode
In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6E@corp-bsa-mp01.planejamento.gov.br>
References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6E@corp-bsa-mp01.planejamento.gov.br>
Message-ID: <1155137827.21204.149.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-07 at 16:09 -0300, Leonardo Rodrigues de Mello wrote:
> Hi everyone,
> 
> Lon you are right for the stable version of DRBD the version 0.7. But DRBD actualy has support for active-active setup in the development version 0.8. There is significant changes between this versions, the entire roadmap can be read at:
> http://svn.drbd.org/drbd/trunk/ROADMAP

Awesome. :)

> 
> I have done some investigations and tests with DRBD 0.8 in active-active setup with two nodes and OCFS2 and with GFS. This was for one project i was doing related to oracle Rac 10g. 
> 
> I have produced one documentation in portuguese that shows how to setup and use drbd in active-active with ocfs2. the link is: 
> http://guialivre.governoeletronico.gov.br/seminario/index.php/DocumentacaoTecnologiasDRBDOCFS2
> 
> I have discovered in my investigations that ocfs2 is more unstable that GFS. I have a several kernel panics with ocfs2 under high loads on the machine, but no one with GFS.  
> 
> I have the instalation of GFS documented at, one performance test i have done some time ago: 
> http://guialivre.governoeletronico.gov.br/mediawiki/index.php/TestesGFS
> (here i use clvm, and gnbd)
> 
> 
> The problem of drbd is that actualy you can use just two machines, if you want to use more you need to use the commercial version drbd+. 

Wow, great information.  Thanks!

-- Lon



From lhh at redhat.com  Wed Aug  9 15:46:12 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 09 Aug 2006 11:46:12 -0400
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
Message-ID: <1155138372.21204.158.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-07 at 15:21 -0500, hai wu wrote:
> Hi,
> For a 2-node cluster (RHEL4), does it require the use of power switch
> or fence_drac would be good enough for the setup? Would fence_drac
> work properly in a 2-node cluster?
> Thanks,
> Hai

fence_drac would be fine, but you need to understand that with DRAC (or
any integrated power management which receives power from the machine)
that if host power is completely lost, fencing will fail - causing the
cluster to stop.

This failure is indistinguishable from DRAC + host losing network at the
same time (ex: the ethernet switch fails).

Generally, these machines have redundant power, so losing power all at
once is less likely.

So, DRAC is fine, but there are failure cases where it is less than
optimal, particularly in machines without redundant power supplies.

-- Lon



From lhh at redhat.com  Wed Aug  9 15:46:38 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 09 Aug 2006 11:46:38 -0400
Subject: [Linux-cluster] cs-deploy-gfs
In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B78@corp-bsa-mp01.planejamento.gov.br>
References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B78@corp-bsa-mp01.planejamento.gov.br>
Message-ID: <1155138398.21204.160.camel@ayanami.boston.redhat.com>

On Wed, 2006-08-09 at 12:04 -0300, Leonardo Rodrigues de Mello wrote:
> Hi everyone,
> 
> Does anyone know what happened with the development of cs-deploy-gfs ?

I think that it's being replaced with Conga.

-- Lon




From Leonardo.Mello at planejamento.gov.br  Wed Aug  9 16:54:03 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Wed, 9 Aug 2006 13:54:03 -0300
Subject: RES: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs)
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B7A@corp-bsa-mp01.planejamento.gov.br>

Thanks for the anwsers :-D 

But I believe beside the fact that conga and cs-deploy-tool share some things in common. cs-deploy-tool has it place for simple and no pain instalation of cluster suite with basic services in one network environment. 

The point that count for me is that i can get anywhere with my laptop and just with the knowledge of IP numbers and root passwords setup one cluster in 10 minutes or less.  

To use conga i need go configure and install zope in one machine, install the agents in the servers that will be in the cluster, configure the zope to see the agents, its more complicated and demands more work for the simple task of cluster instalation and basic initial configuration. 
 

Conga is one great initiative and complex initiative for managing, deploy, administration, and others thinks for production cluster enviroments. if i need to choose one tool to just deploy cluster suite. i will choose cs-deploy-tool. 

If i need to manage, and be the administrator of a cluster, of course i will need the power of conga. :-D 

this long message is just to ask:  Can I implement the changes i had proposed in the first message ? if, yes  to who i will send they ? 


best regards
Leonardo Rodrigues de Mello
-----Mensagem original-----
De:	linux-cluster-bounces at redhat.com em nome de James Parsons
Enviada:	qua 9/8/2006 12:16
Para:	linux clustering
Cc:	
Assunto:	Re: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs)

Leonardo Rodrigues de Mello wrote:

>sorry, 
>the application name is cs-deploy-tool, not cs-deploy-gfs. 
>
>Best Regards
>Leonardo Rodrigues de Mello
>
>
>-----Mensagem original-----
>De:	linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello
>Enviada:	qua 9/8/2006 12:04
>Para:	linux-cluster at redhat.com
>Cc:	
>Assunto:	[Linux-cluster] cs-deploy-gfs
>
>Hi everyone,
>
>Does anyone know what happened with the development of cs-deploy-gfs ? 
>
>The version in cvs is the lastest version ? 
>
>There is any improvements since the initial version ? 
>
>This software was abandoned ? 
>  
>
The functionality available in cs-deploy-tool will be available in a new 
management interface for clusters and storage called Conga, targetted 
for RHEL5 and (hopefully) RHEL4.5

-J

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3761 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060809/23ec01d4/attachment.bin>

From jparsons at redhat.com  Wed Aug  9 18:17:44 2006
From: jparsons at redhat.com (James Parsons)
Date: Wed, 09 Aug 2006 14:17:44 -0400
Subject: RES: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs)
In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B7A@corp-bsa-mp01.planejamento.gov.br>
References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B7A@corp-bsa-mp01.planejamento.gov.br>
Message-ID: <44DA26C8.6020701@redhat.com>

Leonardo Rodrigues de Mello wrote:

>Thanks for the anwsers :-D 
>
>But I believe beside the fact that conga and cs-deploy-tool share some things in common. cs-deploy-tool has it place for simple and no pain instalation of cluster suite with basic services in one network environment. 
>
>The point that count for me is that i can get anywhere with my laptop and just with the knowledge of IP numbers and root passwords setup one cluster in 10 minutes or less.  
>
>To use conga i need go configure and install zope in one machine, install the agents in the servers that will be in the cluster, configure the zope to see the agents, its more complicated and demands more work for the simple task of cluster instalation and basic initial configuration.
>
HOLD IT HOLD IT! I have to make an urgent correction to your response 
above :)
Conga requires absolutely NO configuration of zope. In fact, zope is so 
far beneath the sheets that you will be able to install your OWN default 
instance of zope, and there will be no interaction with Conga.

After installing the Conga server component, the admin enters the IP 
addresses of the machines/cluster nodes to be managed just like you do 
with cs-deploy-tool.  There is no other configuration work necessary.

It is true that you will need the agent installed on the machines that 
you wish to manage. cs-deploy-tool does not use an agent, but rather 
logs in through an ssh session with the user-provided root password in 
order to get things set up. cs-deploy-tool is not going away, and your 
patches are welcome.  You can send them to me and I will forward them to 
Stan Kupcevic who wrote and maintains cs-deploy-tool. Thanks for your 
involvement, Leonardo.

-J

> 
> 
>
>Conga is one great initiative and complex initiative for managing, deploy, administration, and others thinks for production cluster enviroments. if i need to choose one tool to just deploy cluster suite. i will choose cs-deploy-tool. 
>
>If i need to manage, and be the administrator of a cluster, of course i will need the power of conga. :-D 
>
>this long message is just to ask:  Can I implement the changes i had proposed in the first message ? if, yes  to who i will send they ? 
>
>
>best regards
>Leonardo Rodrigues de Mello
>-----Mensagem original-----
>De:	linux-cluster-bounces at redhat.com em nome de James Parsons
>Enviada:	qua 9/8/2006 12:16
>Para:	linux clustering
>Cc:	
>Assunto:	Re: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs)
>
>Leonardo Rodrigues de Mello wrote:
>
>  
>
>>sorry, 
>>the application name is cs-deploy-tool, not cs-deploy-gfs. 
>>
>>Best Regards
>>Leonardo Rodrigues de Mello
>>
>>
>>-----Mensagem original-----
>>De:	linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello
>>Enviada:	qua 9/8/2006 12:04
>>Para:	linux-cluster at redhat.com
>>Cc:	
>>Assunto:	[Linux-cluster] cs-deploy-gfs
>>
>>Hi everyone,
>>
>>Does anyone know what happened with the development of cs-deploy-gfs ? 
>>
>>The version in cvs is the lastest version ? 
>>
>>There is any improvements since the initial version ? 
>>
>>This software was abandoned ? 
>> 
>>
>>    
>>
>The functionality available in cs-deploy-tool will be available in a new 
>management interface for clusters and storage called Conga, targetted 
>for RHEL5 and (hopefully) RHEL4.5
>
>-J
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>  
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>




From haiwu.us at gmail.com  Wed Aug  9 18:44:39 2006
From: haiwu.us at gmail.com (hai wu)
Date: Wed, 9 Aug 2006 13:44:39 -0500
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <1155138372.21204.158.camel@ayanami.boston.redhat.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
	<1155138372.21204.158.camel@ayanami.boston.redhat.com>
Message-ID: <eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>

Thanks Lon. We got redundant power here.

How can I test this fence_drac? How to simulate a failure on one node and
know for sure that it does kick in and restarts the failed node in the
cluster?
Thanks,
Hai


On 8/9/06, Lon Hohberger <lhh at redhat.com> wrote:
>
> On Mon, 2006-08-07 at 15:21 -0500, hai wu wrote:
> > Hi,
> > For a 2-node cluster (RHEL4), does it require the use of power switch
> > or fence_drac would be good enough for the setup? Would fence_drac
> > work properly in a 2-node cluster?
> > Thanks,
> > Hai
>
> fence_drac would be fine, but you need to understand that with DRAC (or
> any integrated power management which receives power from the machine)
> that if host power is completely lost, fencing will fail - causing the
> cluster to stop.
>
> This failure is indistinguishable from DRAC + host losing network at the
> same time (ex: the ethernet switch fails).
>
> Generally, these machines have redundant power, so losing power all at
> once is less likely.
>
> So, DRAC is fine, but there are failure cases where it is less than
> optimal, particularly in machines without redundant power supplies.
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060809/a1def1c9/attachment.htm>

From lhh at redhat.com  Wed Aug  9 19:05:32 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 09 Aug 2006 15:05:32 -0400
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
	<1155138372.21204.158.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>
Message-ID: <1155150332.21204.202.camel@ayanami.boston.redhat.com>

On Wed, 2006-08-09 at 13:44 -0500, hai wu wrote:
> Thanks Lon. We got redundant power here. 
>  
> How can I test this fence_drac? How to simulate a failure on one node
> and know for sure that it does kick in and restarts the failed node in
> the cluster?

After both nodes join the cluster, try doing 'reboot -fn' on the node.

Oh, also, you should be booting with acpi=off when using integrated
power management.

-- Lon





From haiwu.us at gmail.com  Wed Aug  9 20:39:50 2006
From: haiwu.us at gmail.com (hai wu)
Date: Wed, 9 Aug 2006 15:39:50 -0500
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <1155150332.21204.202.camel@ayanami.boston.redhat.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
	<1155138372.21204.158.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>
	<1155150332.21204.202.camel@ayanami.boston.redhat.com>
Message-ID: <eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>

I got the following errors after "reboot -fn" on erd-tt-eproof1, which
script do I need to change?

Aug  9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node erd-tt-eproof1
from t
he cluster : Missed too many heartbeats
Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a cluster
member
 after 0 sec post_fail_delay
Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node "erd-tt-eproof1"
Aug  9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac" reports:
WARNING
: unable to detect DRAC version ' Dell Embedded Remote Access Controller
(ERA) F
irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC version
'__unknow
n__' failed: unable to determine power state

This is DRAC on Dell PE2650.
Thanks,
Hai

On 8/9/06, Lon Hohberger <lhh at redhat.com> wrote:
>
> On Wed, 2006-08-09 at 13:44 -0500, hai wu wrote:
> > Thanks Lon. We got redundant power here.
> >
> > How can I test this fence_drac? How to simulate a failure on one node
> > and know for sure that it does kick in and restarts the failed node in
> > the cluster?
>
> After both nodes join the cluster, try doing 'reboot -fn' on the node.
>
> Oh, also, you should be booting with acpi=off when using integrated
> power management.
>
> -- Lon
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060809/d65d70a7/attachment.htm>

From jparsons at redhat.com  Wed Aug  9 20:54:36 2006
From: jparsons at redhat.com (James Parsons)
Date: Wed, 09 Aug 2006 16:54:36 -0400
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>	<1155138372.21204.158.camel@ayanami.boston.redhat.com>	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>	<1155150332.21204.202.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>
Message-ID: <44DA4B8C.1050602@redhat.com>

hai wu wrote:

> I got the following errors after "reboot -fn" on erd-tt-eproof1, which 
> script do I need to change?
>  
> Aug  9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node 
> erd-tt-eproof1 from t
> he cluster : Missed too many heartbeats
> Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a 
> cluster member
>  after 0 sec post_fail_delay
> Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node "erd-tt-eproof1"
> Aug  9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac" 
> reports: WARNING
> : unable to detect DRAC version ' Dell Embedded Remote Access 
> Controller (ERA) F
> irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC version 
> '__unknow
> n__' failed: unable to determine power state
>
> This is DRAC on Dell PE2650.
> Thanks,
> Hai

Do you know what DRAC version you are using? Can you please telnet into 
the drac port and find out what it says when it starts your session?

Thanks,

-J



From haiwu.us at gmail.com  Wed Aug  9 21:04:40 2006
From: haiwu.us at gmail.com (hai wu)
Date: Wed, 9 Aug 2006 16:04:40 -0500
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <44DA4B8C.1050602@redhat.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
	<1155138372.21204.158.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>
	<1155150332.21204.202.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>
	<44DA4B8C.1050602@redhat.com>
Message-ID: <eb7f3f8f0608091404l2b8790d3p9689b1db63aa0e49@mail.gmail.com>

I got the following prompts after telneting to the drac port, maybe a simple
upgrade for the firmware would fix this issue:

Dell Embedded Remote Access Controller (ERA)
Firmware Version 3.31 (Build 07.15)
Login:

Thanks,
Hai

On 8/9/06, James Parsons <jparsons at redhat.com> wrote:
>
> hai wu wrote:
>
> > I got the following errors after "reboot -fn" on erd-tt-eproof1, which
> > script do I need to change?
> >
> > Aug  9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node
> > erd-tt-eproof1 from t
> > he cluster : Missed too many heartbeats
> > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a
> > cluster member
> >  after 0 sec post_fail_delay
> > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node
> "erd-tt-eproof1"
> > Aug  9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac"
> > reports: WARNING
> > : unable to detect DRAC version ' Dell Embedded Remote Access
> > Controller (ERA) F
> > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC version
> > '__unknow
> > n__' failed: unable to determine power state
> >
> > This is DRAC on Dell PE2650.
> > Thanks,
> > Hai
>
> Do you know what DRAC version you are using? Can you please telnet into
> the drac port and find out what it says when it starts your session?
>
> Thanks,
>
> -J
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060809/b0716706/attachment.htm>

From jparsons at redhat.com  Wed Aug  9 22:47:45 2006
From: jparsons at redhat.com (James Parsons)
Date: Wed, 09 Aug 2006 18:47:45 -0400
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <eb7f3f8f0608091404l2b8790d3p9689b1db63aa0e49@mail.gmail.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>	<1155138372.21204.158.camel@ayanami.boston.redhat.com>	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>	<1155150332.21204.202.camel@ayanami.boston.redhat.com>	<eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>	<44DA4B8C.1050602@redhat.com>
	<eb7f3f8f0608091404l2b8790d3p9689b1db63aa0e49@mail.gmail.com>
Message-ID: <44DA6611.6080602@redhat.com>

hai wu wrote:

> I got the following prompts after telneting to the drac port, maybe a 
> simple upgrade for the firmware would fix this issue:
>  
> Dell Embedded Remote Access Controller (ERA)
> Firmware Version 3.31 (Build 07.15)
> Login:
>
> Thanks,
> Hai

Oops. Sorry. Unsupported version. If you want, you could hack the agent 
script (it is in perl) and get it to accept that version and just *see* 
if it works -- it might. I tried looking for documentation for that 
firmware rev and couldn't google any. If you know of some, drop me a 
line and maybe we can get something working - or at least know if it 
*will ever* work. :)

-J

BTW, the agent supports DRAC III/XT, DRAC MC, and DRAC 4/I.

>  
> On 8/9/06, *James Parsons* <jparsons at redhat.com 
> <mailto:jparsons at redhat.com>> wrote:
>
>     hai wu wrote:
>
>     > I got the following errors after "reboot -fn" on erd-tt-eproof1,
>     which
>     > script do I need to change?
>     >
>     > Aug  9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node
>     > erd-tt-eproof1 from t
>     > he cluster : Missed too many heartbeats
>     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a
>     > cluster member
>     >  after 0 sec post_fail_delay
>     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node
>     "erd-tt-eproof1"
>     > Aug  9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac"
>     > reports: WARNING
>     > : unable to detect DRAC version ' Dell Embedded Remote Access
>     > Controller (ERA) F
>     > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC
>     version
>     > '__unknow
>     > n__' failed: unable to determine power state
>     >
>     > This is DRAC on Dell PE2650.
>     > Thanks,
>     > Hai
>
>     Do you know what DRAC version you are using? Can you please telnet
>     into
>     the drac port and find out what it says when it starts your session?
>
>     Thanks,
>
>     -J
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>




From joni at philox.eu  Thu Aug 10 07:04:31 2006
From: joni at philox.eu (Jonathan Salomon)
Date: Thu, 10 Aug 2006 09:04:31 +0200
Subject: [Linux-cluster] patch 2.6 kernel without modules
Message-ID: <44DADA7F.8040505@philox.eu>

Hi all,

I want to use GFS for a webcluster with shared data through a iSCSI SAN. 
The cluster nodes are diskless and boot through PXE by downloading a 
kernel and rootfs that is stored in RAM. I have built a custom minimal 
Linux system with LFS (http://linuxfromscratch.org) to keep the image as 
small as possible (the smallest Fedora/RedHat I could get by stripping 
RPMs was still 650MB).

I would like to work without kernel modules and therefore I would like 
to know whether it is possible to patch the 2.6 kernel to include GFS 
'statically' (i.e. no kernel modules). As far as I cann tell the 
cluster-1.02.00 package I downloaded builds kernel modules.

In addition I would like to know what minimal requirements I need to use 
GFS. The load balancing itself will be done on other machines with a 
different setup. Hence I would like to refrain from installing any of 
that functionality on the cluster nodes. From reading the docs I get the 
impression GFS needs a whole lot of clustering packages.

Thanks!
Jonathan



From sbhagat at redhat.com  Thu Aug 10 08:11:21 2006
From: sbhagat at redhat.com (Subodh Bhagat)
Date: Thu, 10 Aug 2006 13:41:21 +0530
Subject: [Linux-cluster] Red Hat Cluster Service and Informix with 1.5 GB
	memory allocation
Message-ID: <44DAEA29.1060903@redhat.com>

Dear all,

This issue is with one of our major customers, IBM Global Services. They 
are implementing a 3 node cluster and configuring Informix database for 
failover. The specifications of the three nodes are as follows:

ADBM01      2.4.21-40.ELsmp i686              AS release 3 (Taroon 
Update 8)      clumanager-1.2.31-1-i386
ADBM02      2.4.21-40.ELsmp i686              AS release 3 (Taroon 
Update 8)      clumanager-1.2.31-1-i386
ADBM03      2.4.21-40.ELhugemem  i686    AS release 3 (Taroon Update 
8)      clumanager-1.2.26.1-1-i386

Informix version: IBM Informix Dynamic Server 10.00.UC4 On Linux Intel

Informix runs with over 1.8GB MEM allocated to it on the server when the 
clustering agents are turned off. Also it works with Mem allocation of 
less that 1.5 GB in cluster environment. But when in cluster 
environment, the node is rebooted if  >=1.5 GB is allocated.

At Informix end, the SHMBASE parameter would help only if there was a 
memory allocation issue between Linux and Informix. But as Informix runs 
with over 1.8GB MEM allocated to it on the server when the clustering 
agents are turned off, altering SHMBASE may not help resolving this 
issue. The issue most definitely be between the Red Hat Cluster Service 
and Informix with a high mem allocation.

* We have suggested the customer to setup all the nodes in cluster as 
identical with respect to the kernel version and cluster suite versions 
and OS versions.
* Any idea, what else can be done here?

Please suggest.

-- 
Regards,
Subodh Bhagat,
Technical Engineer,
Red Hat India Pvt. Ltd.
1st Floor, 'C' Wing,
Fortune 2000,
Bandra Kurla Complex,
Bandra (East), Mumbai 400051
----------------------------
Mobile: +91-9323968930
Technical Support: +91-9322952612
Tel: +91-22-39878888 (Board Line)
Fax: +91-22-39878899 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060810/f81deee8/attachment.htm>

From mark at sparkyone.com  Thu Aug 10 11:35:13 2006
From: mark at sparkyone.com (Mark Reynolds)
Date: Thu, 10 Aug 2006 12:35:13 +0100 (BST)
Subject: [Linux-cluster] clurgmgrd stops service without reason
Message-ID: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>

Hi,

Have you been able to resolve this issue? I have the exact same symptoms
on a RedHat cluster (rgmanger version 1.9.46).

I receive a message "<notice> stopping service fileserver" and the node
shutsdown and ends up rebooting as it cant unmount a partition.

What worries me is that this has happened 3 times in 2 weeks with no
obvous reason as the server is working fine up until that point.

The relevant section of my cluster.conf is

<service autostart="0" domain="main" name="fileservices">
               <fs device="/dev/mapper/livevg-data" force_fsck="1"
force_unmount="1" fsid="11439" fstype="ext3"
mountpoint="/mnt/live" name="live" options="noatime"
self_fence="1"/>
                <fs device="/dev/mapper/backupvg-data" force_fsck="1"
force_unmount="1" fsid="53676" fstype="ext3"
mountpoint="/mnt/backup" name="backup" options="noatime"
self_fence="1"/>
                 <ip address="192.168.11.253" monitor_link="1"/>
                  <ip address="192.168.1.253" monitor_link="1"/>
                    <script file="/etc/init.d/smb-rhcs" name="Samba"/>
                   <script file="/etc/init.d/nfs-rhcs" name="NFS"/>
     </service>

Any thoughts or updates greatly appreciated as this is occuring on a
production server.

Regards

Mark Reynolds




> > On Wed, 2006-08-02 at 13:03 +0200, Falk Hackenberger - MediaTransfer AG
> > Netresearch &amp; Consulting wrote:
> >
> >>--snip--
> >>Aug  1 17:31:28 kain clurgmgrd: [4780]:  Executing
> >>/exports/imap/checkimapstartup.sh status
> >>Aug  1 17:31:28 kain clurgmgrd: [4780]:  Executing
>
>>/exports/subversion/etc/rc.d/init.d/svnserver status
> >>Aug  1 17:31:28 kain clurgmgrd: [4780]:  Checking 192.168.0.223,
> >>Level 0
> >>Aug  1 17:31:28 kain clurgmgrd: [4780]:  192.168.0.223 present on
> >>eth0
> >>Aug  1 17:31:28 kain clurgmgrd: [4780]:  Link for eth0: Detected
> >>Aug  1 17:31:28 kain clurgmgrd: [4780]:  Link detected on eth0
> >>Aug  1 17:31:37 kain clurgmgrd[4780]:  Stopping service storage
> >>--snap--
> >>
> >>how to say to clurgmgrd, that he should log the reason for stoping the
> >>service?
> >
> > Something must be returning an error code where it should not be; can
> > you post
your service XML blob?
>
>it is very long and a little bit complex as i know... ;-)
>
>recovery="restart">




From rico_tsang at macroview.com  Thu Aug 10 12:59:05 2006
From: rico_tsang at macroview.com (Rico Tsang)
Date: Thu, 10 Aug 2006 20:59:05 +0800
Subject: [Linux-cluster] Two-node cluster fencing each other
Message-ID: <61E6BBD96354E1419428314BA80EA8B95D0C1D@exchsvr.macroview.com>

Hi 
  
This is my first trial on using Red Hat Cluster Suite and GFS on RHEL4.  I'm trying to setup a two-node cluster.  I've configured the Dell DRAC as the fencing device for both nodes.  When I disconnect the network interface of one of the node, both nodes will try to fence each other.

How can I prevent this?

Suppose, I would like to check whether the router can be pinged before I fence the peer node.  Is it a possible configuration in using Red Hat Cluster Suite?


  
Regards, 
Rico 

 




From f.hackenberger at mediatransfer.com  Thu Aug 10 15:04:14 2006
From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting)
Date: Thu, 10 Aug 2006 17:04:14 +0200
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>
References: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>
Message-ID: <44DB4AEE.2000908@mediatransfer.com>

Mark Reynolds wrote:
> Have you been able to resolve this issue?
no way until now...
> I have the exact same symptoms
> on a RedHat cluster (rgmanger version 1.9.46).
> 
> I receive a message "<notice> stopping service fileserver" and the node
> shutsdown and ends up rebooting as it cant unmount a partition.
> 
> What worries me is that this has happened 3 times in 2 weeks with no
> obvous reason as the server is working fine up until that point.

same thing here but not only 3 times in 2 week more 1 time in 3 days...



From Matthew.Patton.ctr at osd.mil  Thu Aug 10 19:03:11 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Thu, 10 Aug 2006 15:03:11 -0400
Subject: [Linux-cluster] patch 2.6 kernel without modules
Message-ID: <D8063DF686D10247B0A49D0127128569119C27BC@osdn06.osd.mil>

Classification: UNCLASSIFIED


> small as possible (the smallest Fedora/RedHat I could get by 
> stripping RPMs was still 650MB).

wow, really? my more or less stripped RH4 is 163 RPMs and a footprint of
~350mb and that is with man pages and all the other junk in /usr/share.

> I would like to work without kernel modules

I'm with you there. however, I would suggest building a minimalist kernel
with module support and putting all the standard things compiled in and
using modules for extraneous stuff like GFS and cluster services. The reason
is GFS etc are a moving target and it's a LOT easier to track the steady
stream of bugfixes by just deploying new modules than it is to keep
recompiling a full kernel.

unless your webfarm is doing something unusual, NFS will be vastly simpler,
easier, and probably a lot faster. I think recent Linux NFS is a much less
sucky than it used to be.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060810/9a2e1d5e/attachment.htm>

From Zelikov_Mikhail at emc.com  Thu Aug 10 20:04:11 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Thu, 10 Aug 2006 16:04:11 -0400
Subject: [Linux-cluster] Magma; Magma-plugins
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF2C@CORPUSMX40B.corp.emc.com>

I was looking for any documentation on Magma API or any Red Hat cluster
suite API. So far I was only able to find a man page on clu_connect,
clu_disconnect and clu_get_event. However there many more defined in magma.h
and magma-build.h. It also looks like there is a draft version of Magma man
page at the end of magma.h. 
I am interested in writing a cluster aware application as well as defining
my own cluster resource.
Any help would be really appreciated.
     Thank you,
     Mike

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060810/18cc9490/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Blank Bkgrd.gif
Type: image/gif
Size: 145 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060810/18cc9490/attachment.gif>

From teigland at redhat.com  Thu Aug 10 20:07:41 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 10 Aug 2006 15:07:41 -0500
Subject: [Linux-cluster] Re: [Cluster-devel] Magma; Magma-plugins
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF2C@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF2C@CORPUSMX40B.corp.emc.com>
Message-ID: <20060810200741.GB20622@redhat.com>

On Thu, Aug 10, 2006 at 04:04:11PM -0400, Zelikov_Mikhail at emc.com wrote:
> I was looking for any documentation on Magma API or any Red Hat cluster
> suite API.

Magma was only a temporary lib used in RHEL4, we're not using it any
longer.  You should look at libcman or any of the other libraries
available in openais, http://developer.osdl.org/dev/openais/

Dave



From Zelikov_Mikhail at emc.com  Thu Aug 10 20:19:03 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Thu, 10 Aug 2006 16:19:03 -0400
Subject: [Linux-cluster] RE: [Cluster-devel] Magma; Magma-plugins
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF5D@CORPUSMX40B.corp.emc.com>

Dave, thank you for the reply.
Magma is used in the latest release of CS and GFS. I was wondering if I use
this (openais) API will it work within the currently existing cluster
infrastructure on RHEL4.3? Or is it the future supported API?
	Mike

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Thursday, August 10, 2006 4:08 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com; cluster-devel at redhat.com
Subject: Re: [Cluster-devel] Magma; Magma-plugins

On Thu, Aug 10, 2006 at 04:04:11PM -0400, Zelikov_Mikhail at emc.com wrote:
> I was looking for any documentation on Magma API or any Red Hat 
> cluster suite API.

Magma was only a temporary lib used in RHEL4, we're not using it any longer.
You should look at libcman or any of the other libraries available in
openais, http://developer.osdl.org/dev/openais/

Dave



From lhh at redhat.com  Thu Aug 10 20:28:22 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 10 Aug 2006 16:28:22 -0400
Subject: [Linux-cluster] Re: [Cluster-devel] Magma; Magma-plugins
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF2C@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF2C@CORPUSMX40B.corp.emc.com>
Message-ID: <1155241702.21204.267.camel@ayanami.boston.redhat.com>

On Thu, 2006-08-10 at 16:04 -0400, Zelikov_Mikhail at emc.com wrote:
> I was looking for any documentation on Magma API or any Red Hat
> cluster suite API. So far I was only able to find a man page on
> clu_connect, clu_disconnect and clu_get_event. However there many more
> defined in magma.h and magma-build.h. It also looks like there is a
> draft version of Magma man page at the end of magma.h. 
> I am interested in writing a cluster aware application as well as
> defining my own cluster resource.
> Any help would be really appreciated.
>      Thank you,
>      Mike

There really isn't much.

Originally, it was used to provide an API which worked on CMAN and GuLM
clusters.  We've since decided that we did not need two infrastructures,
so we no longer have Magma (in -head; it's still in STABLE and RHEL4
branches).

You should use the CMAN API and the DLM API instead of Magma if you need
compatibility between current stable/RHEL4 and future release(s).

-- Lon



From lhh at redhat.com  Thu Aug 10 20:29:44 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 10 Aug 2006 16:29:44 -0400
Subject: [Linux-cluster] RE: [Cluster-devel] Magma; Magma-plugins
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF5D@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF5D@CORPUSMX40B.corp.emc.com>
Message-ID: <1155241784.21204.270.camel@ayanami.boston.redhat.com>

On Thu, 2006-08-10 at 16:19 -0400, Zelikov_Mikhail at emc.com wrote:
> Dave, thank you for the reply.
> Magma is used in the latest release of CS and GFS. I was wondering if I use
> this (openais) API will it work within the currently existing cluster
> infrastructure on RHEL4.3? Or is it the future supported API?
> 	Mike

It will work on RHEL4, but it will not work on RHEL5, FC5/6 or any
future release.  

The CMAN and DLM APIs work on all of the above.

-- Lon



From teigland at redhat.com  Thu Aug 10 20:30:19 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 10 Aug 2006 15:30:19 -0500
Subject: [Linux-cluster] Re: [Cluster-devel] Magma; Magma-plugins
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF5D@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF5D@CORPUSMX40B.corp.emc.com>
Message-ID: <20060810203019.GA25666@redhat.com>

On Thu, Aug 10, 2006 at 04:19:03PM -0400, Zelikov_Mikhail at emc.com wrote:
> Dave, thank you for the reply.
> Magma is used in the latest release of CS and GFS. I was wondering if I use
> this (openais) API will it work within the currently existing cluster
> infrastructure on RHEL4.3? Or is it the future supported API?

RHEL4 doesn't include openais, but libcman is available on both RHEL4 and
RHEL5 (with some minor changes IIRC).


> -----Original Message-----
> From: David Teigland [mailto:teigland at redhat.com] 
> Sent: Thursday, August 10, 2006 4:08 PM
> To: Zelikov, Mikhail
> Cc: linux-cluster at redhat.com; cluster-devel at redhat.com
> Subject: Re: [Cluster-devel] Magma; Magma-plugins
> 
> On Thu, Aug 10, 2006 at 04:04:11PM -0400, Zelikov_Mikhail at emc.com wrote:
> > I was looking for any documentation on Magma API or any Red Hat 
> > cluster suite API.
> 
> Magma was only a temporary lib used in RHEL4, we're not using it any longer.
> You should look at libcman or any of the other libraries available in
> openais, http://developer.osdl.org/dev/openais/
> 
> Dave



From lhh at redhat.com  Thu Aug 10 20:33:40 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 10 Aug 2006 16:33:40 -0400
Subject: [Linux-cluster] Two-node cluster fencing each other
In-Reply-To: <61E6BBD96354E1419428314BA80EA8B95D0C1D@exchsvr.macroview.com>
References: <61E6BBD96354E1419428314BA80EA8B95D0C1D@exchsvr.macroview.com>
Message-ID: <1155242020.21204.274.camel@ayanami.boston.redhat.com>

On Thu, 2006-08-10 at 20:59 +0800, Rico Tsang wrote:
> Hi 
>   
> This is my first trial on using Red Hat Cluster Suite and GFS on RHEL4.  I'm trying to setup a two-node cluster.  I've configured the Dell DRAC as the fencing device for both nodes.  When I disconnect the network interface of one of the node, both nodes will try to fence each other.
> 
> How can I prevent this?

Don't do that.

:)

They're *supposed* to try to fence each other in this case.  However,
the one with the disconnected network jack will lose - because it should
not be able to talk to DRAC.


> Suppose, I would like to check whether the router can be pinged before I fence the peer node.  Is it a possible configuration in using Red Hat Cluster Suite?

You can use qdisk to add any sort of heuristic you want for a node to
determine liveliness fitness.  See the qdisk man page out of the RHEL4
branch.

-- Lon



From Zelikov_Mikhail at emc.com  Thu Aug 10 20:40:10 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Thu, 10 Aug 2006 16:40:10 -0400
Subject: [Linux-cluster] RE: [Cluster-devel] Magma; Magma-plugins
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF9F@CORPUSMX40B.corp.emc.com>

I can see that I have libcman and libdlm installed as the part of cman and
dlm packages. I looked at the cluster, dlm and cman project home pages - the
APIs are mentioned but there is no documentation I could find. Am I looking
at the wrong place?
	Mike

-----Original Message-----
From: Lon Hohberger [mailto:lhh at redhat.com] 
Sent: Thursday, August 10, 2006 4:30 PM
To: Zelikov, Mikhail
Cc: teigland at redhat.com; cluster-devel at redhat.com; linux-cluster at redhat.com
Subject: RE: [Cluster-devel] Magma; Magma-plugins

On Thu, 2006-08-10 at 16:19 -0400, Zelikov_Mikhail at emc.com wrote:
> Dave, thank you for the reply.
> Magma is used in the latest release of CS and GFS. I was wondering if 
> I use this (openais) API will it work within the currently existing 
> cluster infrastructure on RHEL4.3? Or is it the future supported API?
> 	Mike

It will work on RHEL4, but it will not work on RHEL5, FC5/6 or any future
release.  

The CMAN and DLM APIs work on all of the above.

-- Lon



From lhh at redhat.com  Thu Aug 10 20:42:41 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 10 Aug 2006 16:42:41 -0400
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>
References: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>
Message-ID: <1155242561.21204.283.camel@ayanami.boston.redhat.com>

On Thu, 2006-08-10 at 12:35 +0100, Mark Reynolds wrote:
> Hi,
> 
> Have you been able to resolve this issue? I have the exact same symptoms
> on a RedHat cluster (rgmanger version 1.9.46).
> 
> I receive a message "<notice> stopping service fileserver" and the node
> shutsdown and ends up rebooting as it cant unmount a partition.
> 
> What worries me is that this has happened 3 times in 2 weeks with no
> obvous reason as the server is working fine up until that point.
> 
> The relevant section of my cluster.conf is
> 
> <service autostart="0" domain="main" name="fileservices">
>                <fs device="/dev/mapper/livevg-data" force_fsck="1"
> force_unmount="1" fsid="11439" fstype="ext3"
> mountpoint="/mnt/live" name="live" options="noatime"
> self_fence="1"/>
>                 <fs device="/dev/mapper/backupvg-data" force_fsck="1"
> force_unmount="1" fsid="53676" fstype="ext3"
> mountpoint="/mnt/backup" name="backup" options="noatime"
> self_fence="1"/>
>                  <ip address="192.168.11.253" monitor_link="1"/>
>                   <ip address="192.168.1.253" monitor_link="1"/>
>                     <script file="/etc/init.d/smb-rhcs" name="Samba"/>
>                    <script file="/etc/init.d/nfs-rhcs" name="NFS"/>
>      </service>
> 
> Any thoughts or updates greatly appreciated as this is occuring on a
> production server.

Well, your log messages and XML don't match.

There's a recent bugzilla noting that rgmanager lacks sufficient error
reporting for several resource agents.

I will make a couple of updates to the resource agents shortly (e.g.
today or tomorrow), and you can drop them in (on an already-running
cluster, without restarting rgmanager).  It should, then, provide you
the information as to what part is failing.  I would suspect that it is
either the Samba script or the NFS script that is returning an error,
based on the previously noted log messages.

-- Lon



From teigland at redhat.com  Thu Aug 10 20:51:34 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 10 Aug 2006 15:51:34 -0500
Subject: [Linux-cluster] Re: [Cluster-devel] Magma; Magma-plugins
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF9F@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CF9F@CORPUSMX40B.corp.emc.com>
Message-ID: <20060810205134.GB25666@redhat.com>

On Thu, Aug 10, 2006 at 04:40:10PM -0400, Zelikov_Mikhail at emc.com wrote:
> I can see that I have libcman and libdlm installed as the part of cman and
> dlm packages. I looked at the cluster, dlm and cman project home pages - the
> APIs are mentioned but there is no documentation I could find. Am I looking
> at the wrong place?

Download the source code from the "cluster" cvs tree.  The API's in the
RHEL4 cvs branch will be slightly different that those in the cvs HEAD
(for RHEL5).  Look at

cluster/cman/lib/libcman.h
cluster/dlm/lib/libdlm.h
cluster/dlm/doc/*



> -----Original Message-----
> From: Lon Hohberger [mailto:lhh at redhat.com] 
> Sent: Thursday, August 10, 2006 4:30 PM
> To: Zelikov, Mikhail
> Cc: teigland at redhat.com; cluster-devel at redhat.com; linux-cluster at redhat.com
> Subject: RE: [Cluster-devel] Magma; Magma-plugins
> 
> On Thu, 2006-08-10 at 16:19 -0400, Zelikov_Mikhail at emc.com wrote:
> > Dave, thank you for the reply.
> > Magma is used in the latest release of CS and GFS. I was wondering if 
> > I use this (openais) API will it work within the currently existing 
> > cluster infrastructure on RHEL4.3? Or is it the future supported API?
> > 	Mike
> 
> It will work on RHEL4, but it will not work on RHEL5, FC5/6 or any future
> release.  
> 
> The CMAN and DLM APIs work on all of the above.
> 
> -- Lon



From Zelikov_Mikhail at emc.com  Thu Aug 10 20:57:16 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Thu, 10 Aug 2006 16:57:16 -0400
Subject: [Linux-cluster] RE: [Cluster-devel] Magma; Magma-plugins
In-Reply-To: <20060810205134.GB25666@redhat.com>
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730437CFE9@CORPUSMX40B.corp.emc.com>

Will do. Thank you! 
	Mike 

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Thursday, August 10, 2006 4:52 PM
To: Zelikov, Mikhail
Cc: lhh at redhat.com; cluster-devel at redhat.com; linux-cluster at redhat.com
Subject: Re: [Cluster-devel] Magma; Magma-plugins

On Thu, Aug 10, 2006 at 04:40:10PM -0400, Zelikov_Mikhail at emc.com wrote:
> I can see that I have libcman and libdlm installed as the part of cman

> and dlm packages. I looked at the cluster, dlm and cman project home 
> pages - the APIs are mentioned but there is no documentation I could 
> find. Am I looking at the wrong place?

Download the source code from the "cluster" cvs tree.  The API's in the
RHEL4 cvs branch will be slightly different that those in the cvs HEAD
(for RHEL5).  Look at

cluster/cman/lib/libcman.h
cluster/dlm/lib/libdlm.h
cluster/dlm/doc/*



> -----Original Message-----
> From: Lon Hohberger [mailto:lhh at redhat.com]
> Sent: Thursday, August 10, 2006 4:30 PM
> To: Zelikov, Mikhail
> Cc: teigland at redhat.com; cluster-devel at redhat.com; 
> linux-cluster at redhat.com
> Subject: RE: [Cluster-devel] Magma; Magma-plugins
> 
> On Thu, 2006-08-10 at 16:19 -0400, Zelikov_Mikhail at emc.com wrote:
> > Dave, thank you for the reply.
> > Magma is used in the latest release of CS and GFS. I was wondering 
> > if I use this (openais) API will it work within the currently 
> > existing cluster infrastructure on RHEL4.3? Or is it the future
supported API?
> > 	Mike
> 
> It will work on RHEL4, but it will not work on RHEL5, FC5/6 or any 
> future release.
> 
> The CMAN and DLM APIs work on all of the above.
> 
> -- Lon



From lhh at redhat.com  Fri Aug 11 15:52:25 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 11 Aug 2006 11:52:25 -0400
Subject: [Linux-cluster] [PATCH] Update several resource agents to log errors
Message-ID: <1155311545.21204.290.camel@ayanami.boston.redhat.com>

Patch is against HEAD; should apply to most branches (except for the
xenvm.sh part)

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: resource-agent-error.patch
Type: text/x-patch
Size: 4711 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060811/7433b56a/attachment.bin>

From lhh at redhat.com  Fri Aug 11 18:01:34 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 11 Aug 2006 14:01:34 -0400
Subject: [Linux-cluster] Red Hat Cluster Service and Informix with 1.5
	GB memory allocation
In-Reply-To: <44DAEA29.1060903@redhat.com>
References: <44DAEA29.1060903@redhat.com>
Message-ID: <1155319294.21204.331.camel@ayanami.boston.redhat.com>

On Thu, 2006-08-10 at 13:41 +0530, Subodh Bhagat wrote:
> Dear all,
> 
> This issue is with one of our major customers, IBM Global Services.
> They are implementing a 3 node cluster and configuring Informix
> database for failover. The specifications of the three nodes are as
> follows:
> 
> ADBM01      2.4.21-40.ELsmp i686              AS release 3 (Taroon
> Update 8)      clumanager-1.2.31-1-i386
> ADBM02      2.4.21-40.ELsmp i686              AS release 3 (Taroon
> Update 8)      clumanager-1.2.31-1-i386
> ADBM03      2.4.21-40.ELhugemem  i686    AS release 3 (Taroon Update
> 8)      clumanager-1.2.26.1-1-i386
> 
> Informix version: IBM Informix Dynamic Server 10.00.UC4 On Linux Intel
> 
> Informix runs with over 1.8GB MEM allocated to it on the server when
> the clustering agents are turned off. Also it works with Mem
> allocation of less that 1.5 GB in cluster environment. But when in
> cluster environment, the node is rebooted if  >=1.5 GB is allocated. 
> 
> At Informix end, the SHMBASE parameter would help only if there was a
> memory allocation issue between Linux and Informix. But as Informix
> runs with over 1.8GB MEM allocated to it on the server when the
> clustering agents are turned off, altering SHMBASE may not help
> resolving this issue. The issue most definitely be between the Red Hat
> Cluster Service and Informix with a high mem allocation.
> 
> * We have suggested the customer to setup all the nodes in cluster as
> identical with respect to the kernel version and cluster suite
> versions and OS versions.

RHCS3 supports rolling upgrade, so there shouldn't be an issue with the
1.2.26.1 + 1.2.31 versions being mixed.  There certainly shouldn't be
though.

> * Any idea, what else can be done here?

You can try setting realtime priority in RHCS; see the cludb man page.
Also, increase the failover time.  Both will decrease the chance that
clumanager gets a 'false' transition.

-- Lon




From sunsadm at gmail.com  Fri Aug 11 20:27:20 2006
From: sunsadm at gmail.com (sun sadm)
Date: Fri, 11 Aug 2006 22:27:20 +0200
Subject: [Linux-cluster] Re: USB flash drive no longer mounted read/write
	under RHEL
In-Reply-To: <538FF2EE40374C48BD9D7BA20F776ADB0166D82D@nnc.co.uk>
References: <538FF2EE40374C48BD9D7BA20F776ADB0166D82D@nnc.co.uk>
Message-ID: <e68335d40608111327y9d9769etc39ae936ce3dea0a@mail.gmail.com>

On 8/11/06, Cannon, Andrew <Andrew.Cannon at amecnnc.com> wrote:
> Hi all,
> According to the mtab it is mounted read/write,
> the properties box comes up in Gnome as saying that there are rw permissions
> on the drive, but that I am not the owner so I can't change any of the
> permissions.  I can't even write to it as root.
>
> Yet, I can plug it into my Windows box and write to it to my heart's
> content.  Any ideas on why it has suddenly decided to become read only
> (there are no switches on it) and what I can do to fix it?
>
> Thanks
>
> Andrew

Hi colleague,

please share your comments about this issue. We have a similar problem
in cluster environment where SAN disks get read only (mtab and mount
say rw). We absolutely don't know why this happen. How can we
troubleshoot this?

We opened a ticket at EMC (our SAN provider) and they claim its a RHEL problem.

Nico



From rpeterso at redhat.com  Fri Aug 11 21:21:25 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 11 Aug 2006 16:21:25 -0500
Subject: [Linux-cluster] Re: USB flash drive no longer mounted read/write
	under RHEL
In-Reply-To: <e68335d40608111327y9d9769etc39ae936ce3dea0a@mail.gmail.com>
References: <538FF2EE40374C48BD9D7BA20F776ADB0166D82D@nnc.co.uk>
	<e68335d40608111327y9d9769etc39ae936ce3dea0a@mail.gmail.com>
Message-ID: <44DCF4D5.5010103@redhat.com>

sun sadm wrote:
> Hi colleague,
>
> please share your comments about this issue. We have a similar problem
> in cluster environment where SAN disks get read only (mtab and mount
> say rw). We absolutely don't know why this happen. How can we
> troubleshoot this?
>
> We opened a ticket at EMC (our SAN provider) and they claim its a RHEL 
> problem.
>
> Nico
Hi Nico,

Have you spoken to Red Hat Tech Support about this?  I doubt whether 
your problem
is related to the USB flash drive issue, although it's hard to rule it 
out without
more info.

Before we can help, we need to know more about your situation, such
as what file system you are using, what version of the Cluster Suite you 
are using,
and what messages appear in /var/log/messages from all nodes in the cluster.

If it's on a SAN in a cluster, I'm assuming it's GFS, in which case you 
can do some
things, like (1) checking for errors in /var/log/messages, (2) check 
/proc/mounts to see
if the kernel also thinks that it's mounted RW.  (3) make sure the 
cluster bit in
on for the volume group:  do the vgs command and check if it has a
"c" in the flags, e.g. "wz--nc" and not "wz--n-".

If the data on the drive is expendable and you can afford to lose it, 
you can
do some experiments writing data to the raw device.  For example,
unmount the file system from ALL nodes and do something like:

(save off some raw data from the lv)
dd if=/dev/your_vg/your_lv0 of=/tmp/test1 bs=1024 count=1
(write the saved data back to the lv)
dd if=/tmp/test1 of=/dev/your_vg/your_lv0 bs=1024 count=1

If that doesn't work, try writing to the SCSI device directly.
Assuming that /dev/sda is part of your vg on the SAN:

(save off some raw data from the SCSI device)
dd if=/dev/sda of=/tmp/test1 bs=1024 count=1
(write the saved data back to the raw device)
dd if=/tmp/test1 of=/dev/sda bs=1024 count=1

NOTE: These commands are dangerous and should not be attempted
on production machines with live data.

I think some fibre channel SANs can be configured to restrict access
to the data, so you may have to check that as well.  Just some ideas.

Regards,

Bob Peterson
Red Hat Cluster Suite



From sanjay at userspace.com  Sat Aug 12 01:50:44 2006
From: sanjay at userspace.com (Sanjaya Joshi)
Date: Fri, 11 Aug 2006 18:50:44 -0700
Subject: [Linux-cluster] 6-processor 64-bit cluster available for sale
Message-ID: <44DD33F4.3030502@userspace.com>

Apologize if this is not the venue for this post...

Self-Contained 6-processor 64-bit cluster available for sale.
Ideal for training or small groups.

This is a cluster built from "best-of-breed" components and most of the system 
is less than 6 months old. The Operating System and MPI engine is installed and 
ready to go!

Please contact me if you are interested.
Details below:

--

NODE 1:

CPU: Opteron 248 2.6 Ghz Single Core Retail; Quantity: 2
Mainboard: Make/Model: TYAN "Thunder K8S Pro(S2882UG3NR)" AMD-8000 Chipset Server
Motherboard for Dual AMD Socket 940 CPU -RETAIL
Quantity: 1

MEMORY: RAM - Make/Model: Kingston 184 Pin 512MB ECC Registered DDR PC-2700 - 
Retail; Quantity: 2

HARD DISK: 80GB PATA
SAMSUNG 80GB 7200RPM IDE Hard Drive, Model SP0802N, OEM Drive only

CASE: 19" 2U rackmount case
Make/Model: I-STAR 2U Stylish Rackmount Server Chassis (Black) With
I-Star 350W Power Supply, Model "D-200 Storm Series" -RETAIL
Quantity: 1

--

NODE 2:
CPU: Opteron 265 1.8 Ghz Dual Core OEM; Quantity: 2
Mainboard: Tyan S2892G3NR Dual Socket 940/nForce Pro.2200 Motherboard; Quantity: 1

MEMORY: RAM - Make/Model: Kingston 184 Pin 512MB ECC Registered DDR PC-2700 - 
Retail; Quantity: 2

HARD DISK
SAMSUNG 80GB 7200RPM IDE Hard Drive, Model SP0802N, OEM Drive only

CASE 19" 2U rackmount case
Make/Model: I-STAR 2U Stylish Rackmount Server Chassis (Black) With
I-Star 350W Power Supply, Model "D-200 Storm Series" -RETAIL
Quantity: 1

--

NODE 3:
CPU: Opteron 248 2.6 Ghz Single Core Retail; Quantity: 2
Mainboard: Tyan S2892G3NR Dual Socket 940/nForce Pro.2200 Motherboard
Quantity: 1

MEMORY:RAM - Make/Model: Kingston 184 Pin 512MB ECC Registered DDR PC-2700 - 
Retail; Quantity: 2

HARD DISK: 200GB 7,400rpm SATA drive

CASE 19" 3U rackmount case
Make/Model: Antec 3U25ATX550EPS-XR BLACK
Quantity: 1

--

RACK
ISC, Glass front, wheels (and can hold 5x 2U cases or 3x 2U cases and 1x 3U Case)

SWITCH DETAILS
8 port Gigabit switch

LCD MONITOR
15 inch Generic

KEYBOARD: Generic

--

OPERATING SYSTEM:
Fedora Core 4 with MPI
Installed and tested for all nodes.

PLEASE NOTE: RAILS INCLUDED FOR RACK MOUNT OF CASE, BUT NOT INSTALLED WITHIN RACK.

CLUSTER READY AND OPERATIONAL
--
Most components less than about 6 months


SALE PRICE: $4,995

Available for inspection locally in WA.

Sales tax and shipping/pickup is responsibility of purchaser
If sales tax is exempt, need the exemption certificate

System around 200 pounds weight.



From joni at philox.eu  Sat Aug 12 09:55:16 2006
From: joni at philox.eu (Philox / Jonathan Salomon)
Date: Sat, 12 Aug 2006 11:55:16 +0200
Subject: [Linux-cluster] patch 2.6 kernel without modules
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27BC@osdn06.osd.mil>
References: <D8063DF686D10247B0A49D0127128569119C27BC@osdn06.osd.mil>
Message-ID: <44DDA584.4040108@philox.eu>

Patton, Matthew F, CTR, OSD-PA&E wrote:

> unless your webfarm is doing something unusual, NFS will be vastly 
> simpler, easier, and probably a lot faster. I think recent Linux NFS 
> is a much less sucky than it used to be.
>
As my webfarm is serving webpages, I don't think my webfarm is doing 
something unusual ;) But seriously why would NFS be faster? I have a 
Dell EMC AX-150 storage machine that speaks iSCSI to my cluster nodes, 
which are equipped with iSCSI HBA's, so they are directly connected to 
the data storage. If I were to connect only one machine to the data 
storage and export the data through NFS on that machine, it seems to me 
that creating this extra layer would only slow it down!?

Could you elaborate why you think/know NFS would be faster?

Thanks!
Jonathan



From troels at arvin.dk  Mon Aug 14 07:37:35 2006
From: troels at arvin.dk (Troels Arvin)
Date: Mon, 14 Aug 2006 09:37:35 +0200
Subject: [Linux-cluster] Two clurgmgrd processes
Message-ID: <pan.2006.08.14.07.37.33.875000@arvin.dk>

Hello,

I've upgraded our RHEL 4+Cluster Suite installation to Update 4. Before
the upgrade, there was always one and only one clurgmgrd process; after
the update, there seems to be two. Is this expected, or does in indicate a
potential problem?

-- 
Greetings from Troels Arvin




From mels.kooijman at gmail.com  Mon Aug 14 07:53:54 2006
From: mels.kooijman at gmail.com (Mels Kooijman)
Date: Mon, 14 Aug 2006 09:53:54 +0200
Subject: [Linux-cluster] Cmnd failed-retry the same path
Message-ID: <7a02458e0608140053x2a5f9f29s6cc4b7d1da53580c@mail.gmail.com>

We have a 2 node cluster, with qlogic san drives and GFS filesystem.
When the system boot we get a message in dmesg:
qla2300 0000:03:0b.0:
 QLogic Fibre Channel HBA Driver: 8.01.02-d4
  QLogic QLA2340 -
  ISP2312: PCI-X (133 MHz) @ 0000:03:0b.0 hdma+, host#=1, fw=3.03.18 IPX
  Vendor: IBM       Model: 1815      FAStT   Rev: 0914
  Type:   Direct-Access                      ANSI SCSI revision: 03
qla2300 0000:03:0b.0: scsi(1:0:0:1): Enabled tagged queuing, queue depth 32.
  Vendor: IBM       Model: 1815      FAStT   Rev: 0914
  Type:   Direct-Access                      ANSI SCSI revision: 03
qla2300 0000:03:0b.0: scsi(1:0:0:2): Enabled tagged queuing, queue depth 32.
  Vendor: IBM       Model: 1815      FAStT   Rev: 0914
  Type:   Direct-Access                      ANSI SCSI revision: 03
qla2300 0000:03:0b.0: scsi(1:0:0:3): Enabled tagged queuing, queue depth 32.
  Vendor: IBM       Model: 1815      FAStT   Rev: 0914
  Type:   Direct-Access                      ANSI SCSI revision: 03
qla2300 0000:03:0b.0: scsi(1:0:0:4): Enabled tagged queuing, queue depth 32.
scsi2 : mpp virtual bus adaptor :version:09.01.B5.30,timestamp:Tue Apr 18
08:34:11 CDT 2006
  Vendor: IBM       Model: VirtualDisk       Rev: 0914
  Type:   Direct-Access                      ANSI SCSI revision: 03
scsi(2:0:0:0): Enabled tagged queuing, queue depth 30.
  Vendor: IBM       Model: VirtualDisk       Rev: 0914
  Type:   Direct-Access                      ANSI SCSI revision: 03
scsi(2:0:0:1): Enabled tagged queuing, queue depth 30.
SCSI device sdb: 104857600 512-byte hdwr sectors (53687 MB)
SCSI device sdb: drive cache: write back
SCSI device sdb: 104857600 512-byte hdwr sectors (53687 MB)
SCSI device sdb: drive cache: write back
 sdb:<4>493 [RAIDarray.mpp]DS4800_AM:1:0:1 Cmnd failed-retry the same path.
vcmnd SN 680 pdev H1:C0:T0:L1 0x06/0x8b/0x02 0x08000002 mpp_status:1

The last line we see in messages when both systems read from the GFS
filesystem, the read performance is very low on that moment.

We are using linuxrdac-09.01.B5.30

Any has a solution?

Regards
Mels
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060814/63a6a237/attachment.htm>

From raj4linux at gmail.com  Mon Aug 14 08:00:00 2006
From: raj4linux at gmail.com (rajesh mishra)
Date: Mon, 14 Aug 2006 13:30:00 +0530
Subject: [Linux-cluster] Two-node cluster fencing each other
In-Reply-To: <61E6BBD96354E1419428314BA80EA8B95D0C1D@exchsvr.macroview.com>
References: <61E6BBD96354E1419428314BA80EA8B95D0C1D@exchsvr.macroview.com>
Message-ID: <5a8d914c0608140100l56d7d3eu4314237e1b791b7a@mail.gmail.com>

For the trail purpose u can use gnbd for gfs. That will be more easy to set
up. For that u need to have 3-machones.
U can read the setup instruction form downloaded source (cluster/doc/min-
gfs.txt).


With Regards
RajSun.


On 8/10/06, Rico Tsang <rico_tsang at macroview.com> wrote:
>
> Hi
>
> This is my first trial on using Red Hat Cluster Suite and GFS on
> RHEL4.  I'm trying to setup a two-node cluster.  I've configured the Dell
> DRAC as the fencing device for both nodes.  When I disconnect the network
> interface of one of the node, both nodes will try to fence each other.
>
> How can I prevent this?
>
> Suppose, I would like to check whether the router can be pinged before I
> fence the peer node.  Is it a possible configuration in using Red Hat
> Cluster Suite?
>
>
>
> Regards,
> Rico
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060814/08608592/attachment.htm>

From raj4linux at gmail.com  Mon Aug 14 08:02:07 2006
From: raj4linux at gmail.com (rajesh mishra)
Date: Mon, 14 Aug 2006 13:32:07 +0530
Subject: [Linux-cluster] Two-node cluster fencing each other
In-Reply-To: <5a8d914c0608140100l56d7d3eu4314237e1b791b7a@mail.gmail.com>
References: <61E6BBD96354E1419428314BA80EA8B95D0C1D@exchsvr.macroview.com>
	<5a8d914c0608140100l56d7d3eu4314237e1b791b7a@mail.gmail.com>
Message-ID: <5a8d914c0608140102j4239575cs96c3eae8b3fb58a4@mail.gmail.com>

Ooops spelling mistake..

For the trail purpose u can use gnbd for gfs. That will be more easy to set
up. For that u need to have* 3-machines*.
U can read the setup instruction form downloaded source (cluster/doc/min-
gfs.txt).


With Regards
RajSun.




On 8/14/06, rajesh mishra <raj4linux at gmail.com> wrote:
>
>  For the trail purpose u can use gnbd for gfs. That will be more easy to
> set up. For that u need to have 3-machones.
> U can read the setup instruction form downloaded source (cluster/doc/min-
> gfs.txt).
>
>
> With Regards
> RajSun.
>
>
>  On 8/10/06, Rico Tsang <rico_tsang at macroview.com> wrote:
> >
> > Hi
> >
> > This is my first trial on using Red Hat Cluster Suite and GFS on
> > RHEL4.  I'm trying to setup a two-node cluster.  I've configured the Dell
> > DRAC as the fencing device for both nodes.  When I disconnect the network
> > interface of one of the node, both nodes will try to fence each other.
> >
> > How can I prevent this?
> >
> > Suppose, I would like to check whether the router can be pinged before I
> > fence the peer node.  Is it a possible configuration in using Red Hat
> > Cluster Suite?
> >
> >
> >
> > Regards,
> > Rico
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060814/178e95b8/attachment.htm>

From f.hackenberger at mediatransfer.com  Mon Aug 14 08:33:02 2006
From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting)
Date: Mon, 14 Aug 2006 10:33:02 +0200
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <1155242561.21204.283.camel@ayanami.boston.redhat.com>
References: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>
	<1155242561.21204.283.camel@ayanami.boston.redhat.com>
Message-ID: <44E0353E.5090507@mediatransfer.com>

Lon Hohberger wrote:
> Well, your log messages and XML don't match.
> 
> There's a recent bugzilla noting that rgmanager lacks sufficient error
> reporting for several resource agents.
wich bug id?

> I will make a couple of updates to the resource agents shortly (e.g.
> today or tomorrow), and you can drop them in (on an already-running
> cluster, without restarting rgmanager).  It should, then, provide you
> the information as to what part is failing.

wich rgmanager version should have this bugfix?

falk



From joe.devman at yahoo.fr  Mon Aug  7 07:08:13 2006
From: joe.devman at yahoo.fr (Joe)
Date: Mon, 07 Aug 2006 09:08:13 +0200
Subject: [Linux-cluster] gfs_fsck fails on large filesystem
In-Reply-To: <44CF94FE.3070407@redhat.com>
References: <44CF2F94.4000003@framestore-cfc.com>
	<44CF8383.3040208@redhat.com>	<44CF822D.7070705@framestore-cfc.com>
	<44CF94FE.3070407@redhat.com>
Message-ID: <44D6E6DD.7050407@yahoo.fr>

Robert Peterson wrote:

> We've tried to kick around ideas on how to improve the speed, such as
> (1) adding an option to only focus on areas where the journals are dirty,
> (2) introducing multiple threads to process the different RGs, and even
> (3) trying to get multiple nodes in the cluster to team up and do 
> different
> areas of the file system.  None of these have been implemented yet
> because of higher priorities.  Since this is an open-source project, 
> anyone
> could step in and do these.  Volunteers?
>

I've tried to look at the code many times. But, as a clustered file 
system is a complex thing, it gets hard to figure out what it's all 
about. I tried to find a "big picture" documentation, at least for 
on-disk layout. The only nearest thing i've found is : 
http://opengfs.sourceforge.net/docs.php , which is the documentation 
written at the time OpenGFS forked from Cistina's code. Although 
principles may still be the sames (or not ?), the code has obviously 
changed and on-disk layout may not be the same, too.
So, is there some sort of documentation about the  principles found in 
OpenGFS (not a design doc, i've read 
/usr/src/linux/Documentation/stable_api_nonsense.txt) ? This would much 
help anybody who wishes to enter the code to do it more efficientely...

--
Mathieu



From joni at 2male.com  Wed Aug  9 12:59:45 2006
From: joni at 2male.com (2m@le / Jonathan Salomon)
Date: Wed, 9 Aug 2006 14:59:45 +0200
Subject: [Linux-cluster] patch 2.6 kernel without modules
Message-ID: <533F8DF356F31946B3E08ACB9F6EF25F1C7BBA@2003server.2male.com>

Hi all,

I want to use GFS for a webcluster with shared data through a iSCSI SAN. The cluster nodes are diskless and boot through PXE by downloading a kernel and rootfs that is stored in RAM. I have built a custom minimal Linux system with LFS (http://linuxfromscratch.org) to keep the image as small as possible (the smallest Fedora/RedHat I could get by stripping RPMs was still 650MB). 

I would like to work without kernel modules and therefore I would like to know whether it is possible to patch the 2.6 kernel to include GFS 'statically' (i.e. no kernel modules). As far as I cann tell the cluster-1.02.00 package I downloaded builds kernel modules.

In addition I would like to know what minimal requirements I need to use GFS. The load balancing itself will be done on other machines with a different setup. Hence I would like to refrain from installing any of that functionality on the cluster nodes. From reading the docs I get the impression GFS needs a whole lot of clustering packages.

Thanks!
Jonathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060809/4858aeb4/attachment.htm>

From lhh at redhat.com  Mon Aug 14 15:21:38 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 14 Aug 2006 11:21:38 -0400
Subject: [Linux-cluster] Two clurgmgrd processes
In-Reply-To: <pan.2006.08.14.07.37.33.875000@arvin.dk>
References: <pan.2006.08.14.07.37.33.875000@arvin.dk>
Message-ID: <1155568898.21204.362.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-14 at 09:37 +0200, Troels Arvin wrote:
> Hello,
> 
> I've upgraded our RHEL 4+Cluster Suite installation to Update 4. Before
> the upgrade, there was always one and only one clurgmgrd process; after
> the update, there seems to be two. Is this expected, or does in indicate a
> potential problem?
> 

Expected.

One is a watchdog daemon (all it does is monitor the other one), it
reboots the node to clean up any resources if the main one crashes.

-- Lon



From lhh at redhat.com  Mon Aug 14 15:30:55 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 14 Aug 2006 11:30:55 -0400
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <44E0353E.5090507@mediatransfer.com>
References: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>
	<1155242561.21204.283.camel@ayanami.boston.redhat.com>
	<44E0353E.5090507@mediatransfer.com>
Message-ID: <1155569455.21204.374.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-14 at 10:33 +0200, Falk Hackenberger - MediaTransfer AG
Netresearch & Consulting wrote:
> Lon Hohberger wrote:
> > Well, your log messages and XML don't match.
> > 
> > There's a recent bugzilla noting that rgmanager lacks sufficient error
> > reporting for several resource agents.
> wich bug id?

199678, but it looks like it's marked private.  You can file another one
if you would like.

> > I will make a couple of updates to the resource agents shortly (e.g.
> > today or tomorrow), and you can drop them in (on an already-running
> > cluster, without restarting rgmanager).  It should, then, provide you
> > the information as to what part is failing.
> 
> wich rgmanager version should have this bugfix?

None yet; I submitted a patch on Friday, so others might leave feedback.

-- Lon



From haiwu.us at gmail.com  Mon Aug 14 15:38:45 2006
From: haiwu.us at gmail.com (hai wu)
Date: Mon, 14 Aug 2006 10:38:45 -0500
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <44DA6611.6080602@redhat.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
	<1155138372.21204.158.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>
	<1155150332.21204.202.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>
	<44DA4B8C.1050602@redhat.com>
	<eb7f3f8f0608091404l2b8790d3p9689b1db63aa0e49@mail.gmail.com>
	<44DA6611.6080602@redhat.com>
Message-ID: <eb7f3f8f0608140838w16826e0fp2763079853926075@mail.gmail.com>

The one I am using is ERA, which is almost the same as DRAC III/XT, except
it is an embeded DRAC card on motherboard, compared to DRAC III/XT, which is
a seperate added DRAC card. But the telnet prompt does not have the matching
pattern in /sbin/fence_drac file, so I just added the following to the file,
which works now.

Thanks,
Hai

       if (/Dell Embedded Remote Access Controller \(ERA\)\nFirmware
Version/m)
        {
                $drac_version = $DRAC_VERSION_III_XT;
        } else {
        if (/.*\((DRAC[^)]*)\)/m)
        {
                print "detected drac version '$1'\n" if $verbose;
                $drac_version = $1 unless defined $drac_version;

                print "WARNING: detected drac version '$1' but using "
                        . "user defined version '$drac_version'\n"
                        if ($drac_version ne $1);
        }
        else
        {
                print "WARNING: unable to detect DRAC version '$_'\n";
                $drac_version = $DRAC_VERSION_UNKNOWN;
        }
        }


On 8/9/06, James Parsons <jparsons at redhat.com> wrote:
>
> hai wu wrote:
>
> > I got the following prompts after telneting to the drac port, maybe a
> > simple upgrade for the firmware would fix this issue:
> >
> > Dell Embedded Remote Access Controller (ERA)
> > Firmware Version 3.31 (Build 07.15)
> > Login:
> >
> > Thanks,
> > Hai
>
> Oops. Sorry. Unsupported version. If you want, you could hack the agent
> script (it is in perl) and get it to accept that version and just *see*
> if it works -- it might. I tried looking for documentation for that
> firmware rev and couldn't google any. If you know of some, drop me a
> line and maybe we can get something working - or at least know if it
> *will ever* work. :)
>
> -J
>
> BTW, the agent supports DRAC III/XT, DRAC MC, and DRAC 4/I.
>
> >
> > On 8/9/06, *James Parsons* <jparsons at redhat.com
> > <mailto:jparsons at redhat.com>> wrote:
> >
> >     hai wu wrote:
> >
> >     > I got the following errors after "reboot -fn" on erd-tt-eproof1,
> >     which
> >     > script do I need to change?
> >     >
> >     > Aug  9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node
> >     > erd-tt-eproof1 from t
> >     > he cluster : Missed too many heartbeats
> >     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a
> >     > cluster member
> >     >  after 0 sec post_fail_delay
> >     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node
> >     "erd-tt-eproof1"
> >     > Aug  9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac"
> >     > reports: WARNING
> >     > : unable to detect DRAC version ' Dell Embedded Remote Access
> >     > Controller (ERA) F
> >     > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC
> >     version
> >     > '__unknow
> >     > n__' failed: unable to determine power state
> >     >
> >     > This is DRAC on Dell PE2650.
> >     > Thanks,
> >     > Hai
> >
> >     Do you know what DRAC version you are using? Can you please telnet
> >     into
> >     the drac port and find out what it says when it starts your session?
> >
> >     Thanks,
> >
> >     -J
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >     <https://www.redhat.com/mailman/listinfo/linux-cluster>
> >
> >
> >------------------------------------------------------------------------
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060814/5fe20cf2/attachment.htm>

From jparsons at redhat.com  Mon Aug 14 15:54:27 2006
From: jparsons at redhat.com (James Parsons)
Date: Mon, 14 Aug 2006 11:54:27 -0400
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <eb7f3f8f0608140838w16826e0fp2763079853926075@mail.gmail.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>	<1155138372.21204.158.camel@ayanami.boston.redhat.com>	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>	<1155150332.21204.202.camel@ayanami.boston.redhat.com>	<eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>	<44DA4B8C.1050602@redhat.com>	<eb7f3f8f0608091404l2b8790d3p9689b1db63aa0e49@mail.gmail.com>	<44DA6611.6080602@redhat.com>
	<eb7f3f8f0608140838w16826e0fp2763079853926075@mail.gmail.com>
Message-ID: <44E09CB3.5080802@redhat.com>

hai wu wrote:

> The one I am using is ERA, which is almost the same as DRAC III/XT, 
> except it is an embeded DRAC card on motherboard, compared to DRAC 
> III/XT, which is a seperate added DRAC card. But the telnet prompt 
> does not have the matching pattern in /sbin/fence_drac file, so I just 
> added the following to the file, which works now.
>  
> Thanks,
> Hai

Hai,

So, you are saying that if you are using a DRAC ERA card, and you set 
the $drac_version in the script to $DRAC_VERSION_III_XT, then the fence 
agent works? If so, we'll patch the agent accordingly. Please confirm.

-J

>  
>        if (/Dell Embedded Remote Access Controller \(ERA\)\nFirmware 
> Version/m)
>         {
>                 $drac_version = $DRAC_VERSION_III_XT;
>         } else {
>         if (/.*\((DRAC[^)]*)\)/m)
>         {
>                 print "detected drac version '$1'\n" if $verbose;
>                 $drac_version = $1 unless defined $drac_version;
>
>                 print "WARNING: detected drac version '$1' but using "
>                         . "user defined version '$drac_version'\n"
>                         if ($drac_version ne $1);
>         }
>         else
>         {
>                 print "WARNING: unable to detect DRAC version '$_'\n";
>                 $drac_version = $DRAC_VERSION_UNKNOWN;
>         }
>         }
>
>
>
> On 8/9/06, *James Parsons* <jparsons at redhat.com 
> <mailto:jparsons at redhat.com>> wrote:
>
>     hai wu wrote:
>
>     > I got the following prompts after telneting to the drac port,
>     maybe a
>     > simple upgrade for the firmware would fix this issue:
>     >
>     > Dell Embedded Remote Access Controller (ERA)
>     > Firmware Version 3.31 (Build 07.15)
>     > Login:
>     >
>     > Thanks,
>     > Hai
>
>     Oops. Sorry. Unsupported version. If you want, you could hack the
>     agent
>     script (it is in perl) and get it to accept that version and just
>     *see*
>     if it works -- it might. I tried looking for documentation for that
>     firmware rev and couldn't google any. If you know of some, drop me a
>     line and maybe we can get something working - or at least know if it
>     *will ever* work. :)
>
>     -J
>
>     BTW, the agent supports DRAC III/XT, DRAC MC, and DRAC 4/I.
>
>     >
>     > On 8/9/06, *James Parsons* < jparsons at redhat.com
>     <mailto:jparsons at redhat.com>
>     > <mailto:jparsons at redhat.com <mailto:jparsons at redhat.com>>> wrote:
>     >
>     >     hai wu wrote:
>     >
>     >     > I got the following errors after "reboot -fn" on
>     erd-tt-eproof1,
>     >     which
>     >     > script do I need to change?
>     >     >
>     >     > Aug  9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node
>     >     > erd-tt-eproof1 from t
>     >     > he cluster : Missed too many heartbeats
>     >     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]:
>     erd-tt-eproof1 not a
>     >     > cluster member
>     >     >  after 0 sec post_fail_delay
>     >     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node
>     >     "erd-tt-eproof1"
>     >     > Aug  9 15:35:42 erd-tt-eproof2 fenced[3437]: agent
>     "fence_drac"
>     >     > reports: WARNING
>     >     > : unable to detect DRAC version ' Dell Embedded Remote Access
>     >     > Controller (ERA) F
>     >     > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC
>     >     version
>     >     > '__unknow
>     >     > n__' failed: unable to determine power state
>     >     >
>     >     > This is DRAC on Dell PE2650.
>     >     > Thanks,
>     >     > Hai
>     >
>     >     Do you know what DRAC version you are using? Can you please
>     telnet
>     >     into
>     >     the drac port and find out what it says when it starts your
>     session?
>     >
>     >     Thanks,
>     >
>     >     -J
>     >
>     >     --
>     >     Linux-cluster mailing list
>     >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     <mailto: Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>>
>     >     https://www.redhat.com/mailman/listinfo/linux-cluster
>     >     < https://www.redhat.com/mailman/listinfo/linux-cluster>
>     >
>     >
>     >------------------------------------------------------------------------
>     >
>     >--
>     >Linux-cluster mailing list
>     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     >https://www.redhat.com/mailman/listinfo/linux-cluster
>     >
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>




From haiwu.us at gmail.com  Mon Aug 14 16:37:21 2006
From: haiwu.us at gmail.com (hai wu)
Date: Mon, 14 Aug 2006 11:37:21 -0500
Subject: [Linux-cluster] 2-node cluster and fence_drac
In-Reply-To: <44E09CB3.5080802@redhat.com>
References: <eb7f3f8f0608071321x3617d4f4q630f55f14bbc21cb@mail.gmail.com>
	<1155138372.21204.158.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091144r791d64a0qb72bda901ba65332@mail.gmail.com>
	<1155150332.21204.202.camel@ayanami.boston.redhat.com>
	<eb7f3f8f0608091339r34090e92lda4d8fed2a56e222@mail.gmail.com>
	<44DA4B8C.1050602@redhat.com>
	<eb7f3f8f0608091404l2b8790d3p9689b1db63aa0e49@mail.gmail.com>
	<44DA6611.6080602@redhat.com>
	<eb7f3f8f0608140838w16826e0fp2763079853926075@mail.gmail.com>
	<44E09CB3.5080802@redhat.com>
Message-ID: <eb7f3f8f0608140937v836a221k5f93e27a80ede0b8@mail.gmail.com>

Yes, the fence agent works after that. I tested it with DRAC firmware v3.20and
v3.35.

According to Dell:
Dell Remote Access Controller - ERA and DRAC III/XT, v.3.20,
A00<http://support.dell.com/support/downloads/format.aspx?c=us&cs=19&l=en&s=dhs&SystemID=PWE_FOS_XEO_2650&os=LIN4&osl=en&deviceid=4334&libid=29&typecnt=1&vercnt=17&releaseid=R88251>
was
released on 12/8/2004
Dell Remote Access Controller - ERA and DRAC III/XT, v.3.30,
A00<http://support.dell.com/support/downloads/format.aspx?c=us&cs=19&l=en&s=dhs&SystemID=PWE_FOS_XEO_2650&os=LIN4&osl=en&deviceid=4334&libid=29&typecnt=1&vercnt=17&releaseid=R99465>
was
released on 6/9/2005
Dell Remote Access Controller - ERA and DRAC III/XT, v.3.31,
A00<http://support.dell.com/support/downloads/format.aspx?c=us&cs=19&l=en&s=dhs&SystemID=PWE_FOS_XEO_2650&os=LIN4&osl=en&deviceid=4334&libid=29&typecnt=1&vercnt=17&releaseid=R104823>
was
released on 07/25/2005
Dell Remote Access Controller - ERA and DRAC III/XT, v.3.35,
A00<http://support.dell.com/support/downloads/format.aspx?c=us&cs=19&l=en&s=dhs&SystemID=PWE_FOS_XEO_2650&os=LIN4&osl=en&deviceid=4334&libid=29&typecnt=1&vercnt=17&releaseid=R104823>
was
released on 12/25/2005
I am sure it would work for v3.30 and v3.31 in this case as well.

Thanks,
Hai


On 8/14/06, James Parsons <jparsons at redhat.com> wrote:
>
> hai wu wrote:
>
> > The one I am using is ERA, which is almost the same as DRAC III/XT,
> > except it is an embeded DRAC card on motherboard, compared to DRAC
> > III/XT, which is a seperate added DRAC card. But the telnet prompt
> > does not have the matching pattern in /sbin/fence_drac file, so I just
> > added the following to the file, which works now.
> >
> > Thanks,
> > Hai
>
> Hai,
>
> So, you are saying that if you are using a DRAC ERA card, and you set
> the $drac_version in the script to $DRAC_VERSION_III_XT, then the fence
> agent works? If so, we'll patch the agent accordingly. Please confirm.
>
> -J
>
> >
> >        if (/Dell Embedded Remote Access Controller \(ERA\)\nFirmware
> > Version/m)
> >         {
> >                 $drac_version = $DRAC_VERSION_III_XT;
> >         } else {
> >         if (/.*\((DRAC[^)]*)\)/m)
> >         {
> >                 print "detected drac version '$1'\n" if $verbose;
> >                 $drac_version = $1 unless defined $drac_version;
> >
> >                 print "WARNING: detected drac version '$1' but using "
> >                         . "user defined version '$drac_version'\n"
> >                         if ($drac_version ne $1);
> >         }
> >         else
> >         {
> >                 print "WARNING: unable to detect DRAC version '$_'\n";
> >                 $drac_version = $DRAC_VERSION_UNKNOWN;
> >         }
> >         }
> >
> >
> >
> > On 8/9/06, *James Parsons* <jparsons at redhat.com
> > <mailto:jparsons at redhat.com>> wrote:
> >
> >     hai wu wrote:
> >
> >     > I got the following prompts after telneting to the drac port,
> >     maybe a
> >     > simple upgrade for the firmware would fix this issue:
> >     >
> >     > Dell Embedded Remote Access Controller (ERA)
> >     > Firmware Version 3.31 (Build 07.15)
> >     > Login:
> >     >
> >     > Thanks,
> >     > Hai
> >
> >     Oops. Sorry. Unsupported version. If you want, you could hack the
> >     agent
> >     script (it is in perl) and get it to accept that version and just
> >     *see*
> >     if it works -- it might. I tried looking for documentation for that
> >     firmware rev and couldn't google any. If you know of some, drop me a
> >     line and maybe we can get something working - or at least know if it
> >     *will ever* work. :)
> >
> >     -J
> >
> >     BTW, the agent supports DRAC III/XT, DRAC MC, and DRAC 4/I.
> >
> >     >
> >     > On 8/9/06, *James Parsons* < jparsons at redhat.com
> >     <mailto:jparsons at redhat.com>
> >     > <mailto:jparsons at redhat.com <mailto:jparsons at redhat.com>>> wrote:
> >     >
> >     >     hai wu wrote:
> >     >
> >     >     > I got the following errors after "reboot -fn" on
> >     erd-tt-eproof1,
> >     >     which
> >     >     > script do I need to change?
> >     >     >
> >     >     > Aug  9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node
> >     >     > erd-tt-eproof1 from t
> >     >     > he cluster : Missed too many heartbeats
> >     >     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]:
> >     erd-tt-eproof1 not a
> >     >     > cluster member
> >     >     >  after 0 sec post_fail_delay
> >     >     > Aug  9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node
> >     >     "erd-tt-eproof1"
> >     >     > Aug  9 15:35:42 erd-tt-eproof2 fenced[3437]: agent
> >     "fence_drac"
> >     >     > reports: WARNING
> >     >     > : unable to detect DRAC version ' Dell Embedded Remote
> Access
> >     >     > Controller (ERA) F
> >     >     > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported
> DRAC
> >     >     version
> >     >     > '__unknow
> >     >     > n__' failed: unable to determine power state
> >     >     >
> >     >     > This is DRAC on Dell PE2650.
> >     >     > Thanks,
> >     >     > Hai
> >     >
> >     >     Do you know what DRAC version you are using? Can you please
> >     telnet
> >     >     into
> >     >     the drac port and find out what it says when it starts your
> >     session?
> >     >
> >     >     Thanks,
> >     >
> >     >     -J
> >     >
> >     >     --
> >     >     Linux-cluster mailing list
> >     >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     <mailto: Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>>
> >     >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >     >     < https://www.redhat.com/mailman/listinfo/linux-cluster>
> >     >
> >     >
> >
> >------------------------------------------------------------------------
> >     >
> >     >--
> >     >Linux-cluster mailing list
> >     > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     >https://www.redhat.com/mailman/listinfo/linux-cluster
> >     >
> >
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >------------------------------------------------------------------------
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060814/af498316/attachment.htm>

From dist-list at LEXUM.UMontreal.CA  Mon Aug 14 17:28:34 2006
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Mon, 14 Aug 2006 13:28:34 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
Message-ID: <44E0B2C2.2010808@lexum.umontreal.ca>

Hello,
After 1 week of trial and error, and a lot of reading, I still cannot
figure out how-to configure :
3 webservers (behind director servers using piranha_gui) connected to a
GFS file system (on a LUN).

What I am trying to achieve :
3 servers connected to a GFS file system. the web root is on the GFS
file system. HTTPD up on all servers for failover/load balancing (httpd
is monitored by the director servers).
The prob is the link httpd/gfs. if I loose the GFS link, httpd will
still response to the director, but the information will be missing.
That's where, I suppose, cluster suite takes place. But I am unable to
have httpd+gfs activated on each node at the same time !

I tried :
1 failover domain per node + 1service per failover domain + 1 resource
1 failover domain per node + 1service per failover domain + 1 resource
per service

All failed to achieve my goal  !

Can someone have a clue ? I suppose I'm not the first to try this.

Thanks !



From lhh at redhat.com  Mon Aug 14 17:33:52 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 14 Aug 2006 13:33:52 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
In-Reply-To: <44E0B2C2.2010808@lexum.umontreal.ca>
References: <44E0B2C2.2010808@lexum.umontreal.ca>
Message-ID: <1155576832.21204.377.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-14 at 13:28 -0400, FM wrote:
> Hello,
> After 1 week of trial and error, and a lot of reading, I still cannot
> figure out how-to configure :
> 3 webservers (behind director servers using piranha_gui) connected to a
> GFS file system (on a LUN).
> 
> What I am trying to achieve :
> 3 servers connected to a GFS file system. the web root is on the GFS
> file system. HTTPD up on all servers for failover/load balancing (httpd
> is monitored by the director servers).
> The prob is the link httpd/gfs. if I loose the GFS link, httpd will
> still response to the director, but the information will be missing.
> That's where, I suppose, cluster suite takes place. But I am unable to
> have httpd+gfs activated on each node at the same time !
> 
> I tried :
> 1 failover domain per node + 1service per failover domain + 1 resource
> 1 failover domain per node + 1service per failover domain + 1 resource
> per service
> 
> All failed to achieve my goal  !
> 
> Can someone have a clue ? I suppose I'm not the first to try this.

Could you post one of your configurations?

-- Lon




From dist-list at LEXUM.UMontreal.CA  Mon Aug 14 17:45:24 2006
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Mon, 14 Aug 2006 13:45:24 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
In-Reply-To: <1155576832.21204.377.camel@ayanami.boston.redhat.com>
References: <44E0B2C2.2010808@lexum.umontreal.ca>
	<1155576832.21204.377.camel@ayanami.boston.redhat.com>
Message-ID: <44E0B6B4.4070709@lexum.umontreal.ca>

sure here is my latest cluster.conf.
Note that on this one the is no gfs service. I trying to fixe my httpd
prob (httpd up on all nodes) before adding gfs in the game :-)
thanks !

<?xml version="1.0"?>
<cluster config_version="39" name="webfarm">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="lecce" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="manual_fencing"
nodename="lecce"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="cagliari" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="manual_fencing"
nodename="cagliari"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_manual" name="manual_fencing"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="lecce-gfs" ordered="0"
restricted="1">
                                <failoverdomainnode name="lecce"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="cagliari-gfs" ordered="0"
restricted="1">
                                <failoverdomainnode name="cagliari"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/httpd"
name="httpd-lecce"/>
                        <script file="/etc/init.d/httpd"
name="httpd-cagliari"/>
                </resources>
                <service autostart="1" domain="lecce-gfs" name="web-lecce">
                        <script ref="httpd-lecce"/>
                </service>
                <service autostart="1" domain="cagliari-gfs"
name="web-cagliari">
                        <script ref="httpd-cagliari"/>
                </service>
        </rm>
</cluster>






------------------------------------
Fr?d?ric M?dery
Administrateur Syst?me /
System Administrator
LexUM, Universit? de Montr?al
email : mederyf at lexum.umontreal.ca
tel.  : (514) 343-6111  p. 1-3288
fax. : (514) 343-7359
------------------------------------



Lon Hohberger wrote:
> On Mon, 2006-08-14 at 13:28 -0400, FM wrote:
>   
>> Hello,
>> After 1 week of trial and error, and a lot of reading, I still cannot
>> figure out how-to configure :
>> 3 webservers (behind director servers using piranha_gui) connected to a
>> GFS file system (on a LUN).
>>
>> What I am trying to achieve :
>> 3 servers connected to a GFS file system. the web root is on the GFS
>> file system. HTTPD up on all servers for failover/load balancing (httpd
>> is monitored by the director servers).
>> The prob is the link httpd/gfs. if I loose the GFS link, httpd will
>> still response to the director, but the information will be missing.
>> That's where, I suppose, cluster suite takes place. But I am unable to
>> have httpd+gfs activated on each node at the same time !
>>
>> I tried :
>> 1 failover domain per node + 1service per failover domain + 1 resource
>> 1 failover domain per node + 1service per failover domain + 1 resource
>> per service
>>
>> All failed to achieve my goal  !
>>
>> Can someone have a clue ? I suppose I'm not the first to try this.
>>     
>
> Could you post one of your configurations?
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From adel at opennet.ae  Mon Aug 14 17:30:34 2006
From: adel at opennet.ae (Adel Ben Zarrouk)
Date: Mon, 14 Aug 2006 21:30:34 +0400
Subject: [Linux-cluster] RH GFS 6.0 and fencing configuration
Message-ID: <200608142130.35116.adel@opennet.ae>

Dear All,

I am trying to solve a problem regarding one of the fencing method which 
supposed to work without any problem (ILO)

I installed two gfs nodes with RH AS3 U7 and GFS 6.0 latest update, but the 
problem I don't have a sufficient hardware, like the power switches (WTI) or 
additional lock server, I installed, 2  HP DL380, attached on it SAN fibre 
channel storage (QLogic), with  ONE normal pc lock server, it seems working 
fine, but i didn't get how can i configure the fencing device since I have 
only ILO, 

I tried to make one of the scenario if one of the server goes done, what will 
happen, and below the result:

in the beginning of the startup of the all server, everything works perfect, 
loading of the all modules except the daemon " lock_gulmd" doesn't work, it 
looks for the lock server, and the lock server looks for fencing device and 
the method must be used, after certain time out, it fail

Somebody have any idea how can I solve his problem without additional hardware 

Regards

 --Adel
-- 
Adel Ben Zarrouk
Opennet MEA FZ LLC
Tel: +971 4 390 1943
Fax: +971 4 390 4767
http://www.opennet.ae/



From Matthew.Patton.ctr at osd.mil  Mon Aug 14 22:17:00 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Mon, 14 Aug 2006 18:17:00 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
Message-ID: <D8063DF686D10247B0A49D0127128569119C27BF@osdn06.osd.mil>

Classification: UNCLASSIFIED

this doesnt solve your problem but why not just leave all httpd processes
running on all the boxes and put some smarts in the loadbalancer such that
if a page fetch comes up empty/invalid to mark that server as offline. So if
GFS goes wondering off, cluster services doesnt have to figure if http is
valid or not. Obviously cluster services needs to figure out that GFS went
screwy...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060814/5aee9dc6/attachment.htm>

From mailing-lists at hughesjr.com  Tue Aug 15 09:36:52 2006
From: mailing-lists at hughesjr.com (Johnny Hughes)
Date: Tue, 15 Aug 2006 04:36:52 -0500
Subject: [Linux-cluster] Missing GFS-6.0.2.34-2.src.rpm
Message-ID: <1155634612.28367.21.camel@myth.home.local>

Why has the latest SRPM from here:

http://rhn.redhat.com/errata/RHBA-2006-0593.html

Not yet been posted to:
http://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/RHGFS/SRPMS/

Anybody know where GFS-6.0.2.34-2.src.rpm is?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060815/95dda55c/attachment.sig>

From lhh at redhat.com  Tue Aug 15 17:12:52 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 15 Aug 2006 13:12:52 -0400
Subject: [Linux-cluster] clurgmgrd stops service without reason
In-Reply-To: <1155569455.21204.374.camel@ayanami.boston.redhat.com>
References: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk>
	<1155242561.21204.283.camel@ayanami.boston.redhat.com>
	<44E0353E.5090507@mediatransfer.com>
	<1155569455.21204.374.camel@ayanami.boston.redhat.com>
Message-ID: <1155661972.24719.2.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-14 at 11:30 -0400, Lon Hohberger wrote:
> On Mon, 2006-08-14 at 10:33 +0200, Falk Hackenberger - MediaTransfer AG
> Netresearch & Consulting wrote:
> > Lon Hohberger wrote:
> > > Well, your log messages and XML don't match.
> > > 
> > > There's a recent bugzilla noting that rgmanager lacks sufficient error
> > > reporting for several resource agents.
> > wich bug id?
> 
> 199678, but it looks like it's marked private.  You can file another one
> if you would like.

I've filed this bugzilla which is viewable by all:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=202637

Add yourself to it, and you can track how progress goes.  It should be
dome this week, I hope.

-- Lon




From lhh at redhat.com  Tue Aug 15 17:22:12 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 15 Aug 2006 13:22:12 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
In-Reply-To: <44E0B6B4.4070709@lexum.umontreal.ca>
References: <44E0B2C2.2010808@lexum.umontreal.ca>
	<1155576832.21204.377.camel@ayanami.boston.redhat.com>
	<44E0B6B4.4070709@lexum.umontreal.ca>
Message-ID: <1155662533.24719.12.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-14 at 13:45 -0400, FM wrote:
> sure here is my latest cluster.conf.
> Note that on this one the is no gfs service. I trying to fixe my httpd
> prob (httpd up on all nodes) before adding gfs in the game :-)
> thanks !
> 
> <?xml version="1.0"?>
> <cluster config_version="39" name="webfarm">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="lecce" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="manual_fencing"
> nodename="lecce"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="cagliari" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="manual_fencing"
> nodename="cagliari"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_manual" name="manual_fencing"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="lecce-gfs" ordered="0"
> restricted="1">
>                                 <failoverdomainnode name="lecce"
> priority="1"/>
>                         </failoverdomain>
>                         <failoverdomain name="cagliari-gfs" ordered="0"
> restricted="1">
>                                 <failoverdomainnode name="cagliari"
> priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <script file="/etc/init.d/httpd"
> name="httpd-lecce"/>
>                         <script file="/etc/init.d/httpd"
> name="httpd-cagliari"/>
>                 </resources>
>                 <service autostart="1" domain="lecce-gfs" name="web-lecce">
>                         <script ref="httpd-lecce"/>
>                 </service>
>                 <service autostart="1" domain="cagliari-gfs"
> name="web-cagliari">
>                         <script ref="httpd-cagliari"/>
>                 </service>
>         </rm>
> </cluster>


Thanks.

One way to make this work is to add a clusterfs resource to watch the
GFS file system, then reference it in each service (note: this does not
work with ext3).  This will cause rgmanager to monitor access to the GFS
file system.  If access to the file system fails, the service should be
stopped (but it might end up in the failed state), and piranha should
stop using the node if monitoring is set up correctly.

Something like this:

<resources>
	<script file="/etc/init.d/httpd" name="httpd-lecce"/>
	<script file="/etc/init.d/httpd" name="httpd-cagliari"/>
	<clusterfs name="mygfs" device="..." mountpoint="..."/>
</resources>
<service autostart="1" domain="lecce-gfs" name="web-lecce">
	<clusterfs ref="mygfs"/>
	<script ref="httpd-lecce"/>
</service>
<service autostart="1" domain="cagliari-gfs"
name="web-cagliari">
	<clusterfs ref="mygfs"/>
	<script ref="httpd-cagliari"/>
</service>


Note that the storage path coming back up will not trigger a restart of
the service (because services which are stopped/failed are not
monitored).

-- Lon



From lhh at redhat.com  Tue Aug 15 17:24:47 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 15 Aug 2006 13:24:47 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27BF@osdn06.osd.mil>
References: <D8063DF686D10247B0A49D0127128569119C27BF@osdn06.osd.mil>
Message-ID: <1155662687.24719.16.camel@ayanami.boston.redhat.com>

On Mon, 2006-08-14 at 18:17 -0400, Patton, Matthew F, CTR, OSD-PA&E
wrote:
> Classification: UNCLASSIFIED
> 
> this doesnt solve your problem but why not just leave all httpd
> processes running on all the boxes and put some smarts in the
> loadbalancer such that if a page fetch comes up empty/invalid to mark
> that server as offline. So if GFS goes wondering off, cluster services
> doesnt have to figure if http is valid or not. Obviously cluster
> services needs to figure out that GFS went screwy...

That's another way to do it, and is far simpler than the one I
suggested.  Rgmanager is no longer needed.

-- Lon



From lhh at redhat.com  Tue Aug 15 17:27:15 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 15 Aug 2006 13:27:15 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
In-Reply-To: <1155662533.24719.12.camel@ayanami.boston.redhat.com>
References: <44E0B2C2.2010808@lexum.umontreal.ca>
	<1155576832.21204.377.camel@ayanami.boston.redhat.com>
	<44E0B6B4.4070709@lexum.umontreal.ca>
	<1155662533.24719.12.camel@ayanami.boston.redhat.com>
Message-ID: <1155662836.24719.19.camel@ayanami.boston.redhat.com>

On Tue, 2006-08-15 at 13:22 -0400, Lon Hohberger wrote:

> Note that the storage path coming back up will not trigger a restart of
> the service (because services which are stopped/failed are not
> monitored).

Also see Matthew Patton's suggestion, it's simpler.

-- Lon



From brendanheading at clara.co.uk  Tue Aug 15 20:29:59 2006
From: brendanheading at clara.co.uk (Brendan Heading)
Date: Tue, 15 Aug 2006 21:29:59 +0100
Subject: [Linux-cluster] Setting up a GFS cluster
Message-ID: <44E22EC7.8020506@clara.co.uk>

Hi all,

I'm planning to build a cluster using a pair of PE1950s, using RHEL 3 
(or 4) with RHCS. Plan at the moment is to use GFS. Most of our stuff is 
Dell, therefore the obvious choice is to use a Dell PowerVault 220S as 
the shared storage device.

Before I kick off with this idea I'd be interested to hear if anyone had 
any issues with this kind of setup, or if there were any general 
performance problems. Are there other SCSI enclosures which might be 
better or more appropriate for these purposes ?

Regards

Brendan



From dist-list at LEXUM.UMontreal.CA  Tue Aug 15 20:40:43 2006
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Tue, 15 Aug 2006 16:40:43 -0400
Subject: [Linux-cluster] cluster suite and webfarm + GFS ???
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27BF@osdn06.osd.mil>
References: <D8063DF686D10247B0A49D0127128569119C27BF@osdn06.osd.mil>
Message-ID: <44E2314B.5020503@lexum.umontreal.ca>

Just like said Lon, this idea is very easy and perfect for me ! No need
to handle the cluster suite in this case :-)

Because I wanna know if a node has stopped working, I'll do :
on each node, a script watch a file on the gfs every minute, if cannot
find it, stop httpd and send me a email. Because httpd will be down, the
server will be remove from the servers pool. Simple and should work.
Best of both work.
Thanks Matthew and Lon for the responses !!!



Patton, Matthew F, CTR, OSD-PA&E wrote:
>
> Classification: UNCLASSIFIED
>
> this doesnt solve your problem but why not just leave all httpd
> processes running on all the boxes and put some smarts in the
> loadbalancer such that if a page fetch comes up empty/invalid to mark
> that server as offline. So if GFS goes wondering off, cluster services
> doesnt have to figure if http is valid or not. Obviously cluster
> services needs to figure out that GFS went screwy...
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From celso at webbertek.com.br  Wed Aug 16 02:23:40 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Tue, 15 Aug 2006 23:23:40 -0300
Subject: [Linux-cluster] Setting up a GFS cluster
In-Reply-To: <44E22EC7.8020506@clara.co.uk>
References: <44E22EC7.8020506@clara.co.uk>
Message-ID: <44E281AC.4010608@webbertek.com.br>

Hello Brendan,

Although Dell hardware is an excellent choice for Linux, the PV220S 
solution is terrible at performance under a cluster environment.

The reason is that the PV220S itself does not manage RAID devices, it is 
in fact a JBOD (Just a Bunch Of Disks). The RAID management is done by 
the SCSI controllers within the servers (PERC 3/DC or PERC 4/DC).

Since there is a possibility of one of the machines going down, together 
with data in the controller's write cache, this solution automatically 
disable the write cache (write through mode) when you set the 
controllers in "cluster mode".

The end result is very poor performance, specially on write operations. 
It's not uncommon that Dell provides the PV-220S with 15K RPM disks to 
compensate this performance penalty due to lack of write cache.

As far as I can tell, Red Hat did support the PV220S solution in the 
past, during the RHEL 2.1 era, but it is not supported anymore as 
certified shared storage for cluster solutions (RHCS or RHGFS).

If you still plan to go on, be warned that the PV220S performs better in 
Cluster Mode if you set up the data transfer rate to 160 MB/s instead of 
320 MB/s (the PERC 3/DC supports transfer rates of up to 160 MB/s while 
the PERC 4/DC supports up to 320 MB/s). This is a known issue at Dell 
support queues.

As an extra information, there were too many problems about reliability 
with the PV220S when used in Cluster Mode, this can be seen by the large 
amount of firmware updates for the PERC 3/DC and 4/DC (LSI Logic based 
chipset, megaraid driver on Linux). More recent firmware versions seem 
to have corrected most logical drive corruption problems I've 
experienced, so I believe the PV220S is still worth a try if you can 
live with the poor performance issue.

Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a not so 
high price tag.

Sorry for the long message, I believe this information can be useful to 
others.

Best regards,

Celso.

Brendan Heading escreveu:
> Hi all,
> 
> I'm planning to build a cluster using a pair of PE1950s, using RHEL 3 
> (or 4) with RHCS. Plan at the moment is to use GFS. Most of our stuff is 
> Dell, therefore the obvious choice is to use a Dell PowerVault 220S as 
> the shared storage device.
> 
> Before I kick off with this idea I'd be interested to hear if anyone had 
> any issues with this kind of setup, or if there were any general 
> performance problems. Are there other SCSI enclosures which might be 
> better or more appropriate for these purposes ?
> 
> Regards
> 
> Brendan
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From celso at webbertek.com.br  Wed Aug 16 02:38:53 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Tue, 15 Aug 2006 23:38:53 -0300
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
Message-ID: <44E2853D.2070405@webbertek.com.br>

Hello all,

Sorry for returning with a similar question: a few weeks ago, Bob 
Peterson kindly answered a question of mine about selecting the right 
lock manager. He added this answer to his Cluster Project FAQ:
http://sources.redhat.com/cluster/faq.html#dlm_which

The FAQ states that Oracle only supports GULM and not DLM.

I had difficulties about setting up GULM under GFS 6.0 (RHELv3 based, 
Linux kernel 2.4.x) because of the need of an even number of lock 
managers. The GFS 6.0 manual states that you cannot use 2-node GULM lock 
managers:
http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-locl-gulm.html#S2-LOCK-GULM-NUMSVRS

The above section of the manual (section 8.2.2) states that you must 
have at least 3 GULM lock managers for a high availability locking 
systems under GFS.

Question: is this still true under GFS 6.1 (RHELv4 based, 2.6.x kernel)? 
  Since Oracle demands GULM, can I have a 2-node GULM GFS locking system 
on a 2-node GFS Cluster?

If not, which is the best option and which are supported by Red Hat?
a) to have just 1 lock manager in one of the Cluster Nodes;
b) to have just 1 lock manager in a machine outside the Cluster;
c) to have 2 GULM lock manager on the cluster nodes;
d) to have 3 or 5 lock managers using both the 2 cluster nodes and other 
servers?

Sorry for this recurring question, I wish to make sure of the correct 
environment I should deploy for a customer who needs GFS (officially 
licensed from Red Hat).

Thank you all in advance.

Celso.

PS: I think this could be useful to be in the FAQ. What do you think Bob?

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From brendanheading at clara.co.uk  Wed Aug 16 07:33:23 2006
From: brendanheading at clara.co.uk (Brendan Heading)
Date: Wed, 16 Aug 2006 08:33:23 +0100
Subject: [Linux-cluster] Setting up a GFS cluster
In-Reply-To: <44E281AC.4010608@webbertek.com.br>
References: <44E22EC7.8020506@clara.co.uk> <44E281AC.4010608@webbertek.com.br>
Message-ID: <44E2CA43.5090805@clara.co.uk>

Celso K. Webber wrote:

> Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a not so 
> high price tag.
> 
> Sorry for the long message, I believe this information can be useful to 
> others.

Celso,

Please don't apologize, your reply was very informative and you've 
probably just saved me a big pile of cash :) I've never been happy with 
the PERC RAID controller that I currently have, and as part of moving to 
the cluster I intend to use Adaptec stuff all the way. However it sounds 
like that won't fly with the PowerVault in any case.

I saw the EMC AX-100 but I've got one significant problem with it - the 
disk storage inside is SATA. The server that I'm dealing with is 
essentially a build server used to do regular parallel builds, and it 
may be simultaneously in use at any one time by up to 20-30 users. I've 
been looking at iSCSI but I find it very strange that nobody seems to 
sell iSCSI-to-SCSI boxes.

Do you think an SATA-based array is likely to hold up under these 
conditions ?

A less important reason here is that I've got a bunch of 15000rpm 150GB 
& 300GB SCSI320 disks connected to the Dell boxen at the moment that I 
was hoping to reuse. It would be nice to reuse them, but I don't have to.

Thanks again,

Brendan



From celso at webbertek.com.br  Wed Aug 16 10:14:30 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Wed, 16 Aug 2006 07:14:30 -0300
Subject: [Linux-cluster] Low cost storage for clusters (was: Setting up a
	GFS cluster)
In-Reply-To: <44E2CA43.5090805@clara.co.uk>
References: <44E22EC7.8020506@clara.co.uk> <44E281AC.4010608@webbertek.com.br>
	<44E2CA43.5090805@clara.co.uk>
Message-ID: <44E2F006.6070401@webbertek.com.br>

Hello again Brendan,

We have recently deployed the following environment for a customer:
* Dell EMC AX-100 storage through fibre channel connections;
* 2 DELL PE-2800 servers with 1 QLogic HBA each, direct attached to the 
storage;
* Around 400 GB of raw disk space on the AX100's SATA disks;
* Red Hat Enterprise Linux v3 Update 7 (this was before the release of U8);
* Red Hat Cluster Suite HA solution installed.

This environment has yet to prove to be effective, but when it goes to 
production, it will support about 100 concurrent users accessing 
Dataflex files and applications, both through Telnet sessions and 
through file sharing using Samba (Visual Dataflex running on top of 
Windows machines).

I believe the performance will be quite acceptable, even though the 
storage uses SATA disks.

I've seen some storage solutions using even ATA HDDs with very good 
performance. On such example is Apple's X Server RAID storage, which 
uses ATA (not SATA nor SCSI) disks. It is even certified for Red Hat 
Enterprise Linux (and for RHCS too, I believe). It uses copper fibre 
channel connections, though.

In my opinion the secret with these cheaper solutions is on storage 
processors. If they hold a reasonable amount of cache memory, they can 
compensate eventual latencies with the hard drives.

We have also had a good degree of success using Dell EMC's CX-300 and 
CX-500, using fibre channel connections.

Finally, I think iSCSI is a good thing if you don't need high 
performance I/O. I previously suggested you iSCSI in place of Dell's 
PV220S (which lacks write cache in Cluster mode) because an iSCSI 
solution with full cache functionality would give you similar or better 
performance than the PV220S solution.

With iSCSI, we haven't seen very high performance I/O rates. In one 
case, a customer employed an IBM storage (don't remember the model), 
which was accessed via iSCSI using a Cisco iSCSI-to-fibrechannel switch. 
Even with trunking enabled on the iSCSI NICs on the servers, performance 
was not outstanding as the storage itself would give if the servers were 
direct attached or connected through an SAN. We used in this case 
standard NICs and Linux's traditional iscsi-initiator-utils (a 
sourceforge project), so we could have better performance if we employed 
dedicated iSCSI NICs (such as the QLogic models).

Maybe someone will disagree with me about a "not so good performance 
with iSCSI", but we (our company) see iSCSI as a good solution for 
lowering costs when you have a reasonable amount of servers and don't 
want to invest on a full redundant SAN solution (a QLogic HBA card costs 
around USD 2000.00 here in Brazil, for example).

I hope this gives you some extra information about storage choices for 
your cluster deployments, ok?

Best regards,

Celso.

Brendan Heading escreveu:
> Celso K. Webber wrote:
> 
>> Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a not 
>> so high price tag.
>>
>> Sorry for the long message, I believe this information can be useful 
>> to others.
> 
> Celso,
> 
> Please don't apologize, your reply was very informative and you've 
> probably just saved me a big pile of cash :) I've never been happy with 
> the PERC RAID controller that I currently have, and as part of moving to 
> the cluster I intend to use Adaptec stuff all the way. However it sounds 
> like that won't fly with the PowerVault in any case.
> 
> I saw the EMC AX-100 but I've got one significant problem with it - the 
> disk storage inside is SATA. The server that I'm dealing with is 
> essentially a build server used to do regular parallel builds, and it 
> may be simultaneously in use at any one time by up to 20-30 users. I've 
> been looking at iSCSI but I find it very strange that nobody seems to 
> sell iSCSI-to-SCSI boxes.
> 
> Do you think an SATA-based array is likely to hold up under these 
> conditions ?
> 
> A less important reason here is that I've got a bunch of 15000rpm 150GB 
> & 300GB SCSI320 disks connected to the Dell boxen at the moment that I 
> was hoping to reuse. It would be nice to reuse them, but I don't have to.
> 
> Thanks again,
> 
> Brendan
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From hirantha at vcs.informatics.lk  Wed Aug 16 10:19:30 2006
From: hirantha at vcs.informatics.lk (Hirantha Wijayawardena)
Date: Wed, 16 Aug 2006 15:49:30 +0530
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <44E2853D.2070405@webbertek.com.br>
Message-ID: <20060816094730.D5A0B27C6F@ux-mail.informatics.lk>

Hi,

I also wanted to post the same question on Bob post. But I given-up to ask
'cause it written as " DLM should work just fine." From the clause "Right
now, Oracle and Oracle RAC are only certified to work with the GULM locking
manager. So Oracle won't support your configuration if you're using DLM.
However, DLM should work just fine. Also, GULM has works fine on more than
32 nodes."

I just wanted to know was whether we could run Oracle or oracle RAC on
2-node cluster with DLM

Thanks in advance

- Hirantha



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Celso K. Webber
> Sent: Wednesday, August 16, 2006 8:09 AM
> To: linux clustering
> Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS
> Cluster?
> 
> Hello all,
> 
> Sorry for returning with a similar question: a few weeks ago, Bob
> Peterson kindly answered a question of mine about selecting the right
> lock manager. He added this answer to his Cluster Project FAQ:
> http://sources.redhat.com/cluster/faq.html#dlm_which
> 
> The FAQ states that Oracle only supports GULM and not DLM.
> 
> I had difficulties about setting up GULM under GFS 6.0 (RHELv3 based,
> Linux kernel 2.4.x) because of the need of an even number of lock
> managers. The GFS 6.0 manual states that you cannot use 2-node GULM lock
> managers:
> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-locl-gulm.html#S2-
> LOCK-GULM-NUMSVRS
> 
> The above section of the manual (section 8.2.2) states that you must
> have at least 3 GULM lock managers for a high availability locking
> systems under GFS.
> 
> Question: is this still true under GFS 6.1 (RHELv4 based, 2.6.x kernel)?
>   Since Oracle demands GULM, can I have a 2-node GULM GFS locking system
> on a 2-node GFS Cluster?
> 
> If not, which is the best option and which are supported by Red Hat?
> a) to have just 1 lock manager in one of the Cluster Nodes;
> b) to have just 1 lock manager in a machine outside the Cluster;
> c) to have 2 GULM lock manager on the cluster nodes;
> d) to have 3 or 5 lock managers using both the 2 cluster nodes and other
> servers?
> 
> Sorry for this recurring question, I wish to make sure of the correct
> environment I should deploy for a customer who needs GFS (officially
> licensed from Red Hat).
> 
> Thank you all in advance.
> 
> Celso.
> 
> PS: I think this could be useful to be in the FAQ. What do you think Bob?
> 
> --
> *Celso Kopp Webber*
> 
> celso at webbertek.com.br <mailto:celso at webbertek.com.br>
> 
> *Webbertek - Opensource Knowledge*
> (41) 8813-1919
> (41) 3284-3035
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From thomas at ideawise.de  Wed Aug 16 10:28:42 2006
From: thomas at ideawise.de (Thomas Karsten)
Date: Wed, 16 Aug 2006 18:28:42 +0800
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
Message-ID: <200608161828.43141.thomas@ideawise.de>

Hallo,

I would like to get GFS for the Linux kernel 2.6.16. As stated on the
Cluster Project Page, both the released tarballs and the code from the
CVS repository compile against the recent kernel only. Besides this
information, I already tried to compile the code from CVS STABLE, but
it failed using the Linux kernel 2.6.16.

The FAQ says, that the code should run on the 2.6.xx series kernel. Is
it limited to a special Red Hat kernel?

Does the Cluster Project Page provide GFS downloads for other kernel
versions than the recent one? If so, where can I find it?

Any hint is welcome,
Thomas



From celso at webbertek.com.br  Wed Aug 16 10:42:40 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Wed, 16 Aug 2006 07:42:40 -0300
Subject: [Linux-cluster] No storage cluster configuration
Message-ID: <44E2F6A0.9010408@webbertek.com.br>

Hello all,

Does anyone from Red Hat would tell me if Red Hat supports a no external 
storage environment when using RH Cluster Suite and DRDB for replicating 
data beetween cluster nodes?

When customers come to us arguing that IBM is offering SuSE Cluster 
solutions without the need of an external storage, we usually tell them 
to go on if they don't value their data. IBM on its turn tells that DRDB 
is stable enough for production, even when using Oracle database on top 
of it.

We have lost some projects because of this "magical solution" of using 
DRDB instead of an external storage, but we are still against deploying 
production cluster solutions using DRDB or similar solutions.

So, if we are certain that a DRDB (or similar solution) is supported by 
Red Hat when used in conjunction with RH Cluster Suite, maybe we could 
compete with those IBM + Novell SuSE solutions.

Thanks in advance for any help!

Celso.
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From pcaulfie at redhat.com  Wed Aug 16 10:44:26 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 16 Aug 2006 11:44:26 +0100
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
In-Reply-To: <200608161828.43141.thomas@ideawise.de>
References: <200608161828.43141.thomas@ideawise.de>
Message-ID: <44E2F70A.5080809@redhat.com>

Thomas Karsten wrote:
> Hallo,
> 
> I would like to get GFS for the Linux kernel 2.6.16. As stated on the
> Cluster Project Page, both the released tarballs and the code from the
> CVS repository compile against the recent kernel only. Besides this
> information, I already tried to compile the code from CVS STABLE, but
> it failed using the Linux kernel 2.6.16.
> 
> The FAQ says, that the code should run on the 2.6.xx series kernel. Is
> it limited to a special Red Hat kernel?

The RHEL4 tag of CVS is for the Red Hat Enterprise kernels, the STABLE tag is
for the latest kernel.org kernels.

> Does the Cluster Project Page provide GFS downloads for other kernel
> versions than the recent one? If so, where can I find it? 

No. -rSTABLE works on the kernels it works on. If you need it to work on
another kernel you'll have to port it yourself ;-)

-- 

patrick



From thomas at ideawise.de  Wed Aug 16 10:56:23 2006
From: thomas at ideawise.de (Thomas Karsten)
Date: Wed, 16 Aug 2006 18:56:23 +0800
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
In-Reply-To: <44E2F70A.5080809@redhat.com>
References: <200608161828.43141.thomas@ideawise.de>
	<44E2F70A.5080809@redhat.com>
Message-ID: <200608161856.23634.thomas@ideawise.de>

On Wednesday 16 August 2006 18:44, you wrote:
> Thomas Karsten wrote:

> > Does the Cluster Project Page provide GFS downloads for other kernel
> > versions than the recent one? If so, where can I find it? 
> 
> No. -rSTABLE works on the kernels it works on. If you need it to work on
> another kernel you'll have to port it yourself ;-)

Does this mean that I just missed the time when the Linux kernel 2.6.16
was the recent one? Is the code nowhere stored, just developed in the
CVS?

Since the recent kernel version is 2.6.17 I assume that the GFS code was
already developed for 2.6.16. That's why I thought there has to be an
archive of all GFS versions somewhere (that matches the older kernel
versions, because they have been programmed for those in the past). Of
course, then I would not use the latest version of GFS but the version
that has been programmed for kernel 2.6.16.

Is there no other possibility for me than to "downgrade" it on my own
now?

Thanks,
Thomas



From riaan at obsidian.co.za  Wed Aug 16 12:33:16 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Wed, 16 Aug 2006 14:33:16 +0200
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <20060816094730.D5A0B27C6F@ux-mail.informatics.lk>
References: <20060816094730.D5A0B27C6F@ux-mail.informatics.lk>
Message-ID: <44E3108C.20304@obsidian.co.za>

hi Hirantha

short answer: "No"

If you are willing to settle for "should work just fine" instead of 
"certified by Oracle" you can try running DLM on 2 RAC/GFS nodes, then "yes"

To have your solution certified/supported by Oracle, for a minimal 
configuration you need 5 machines:
a) 2 GFS/RAC nodes
b) 3 external GULM lock servers

We also found these system requirements somewhat excessive (having more 
lock servers than actual GFS-accessing cluster nodes).

more notes:
- supported GFS with RAC is GULM only - DLM is not supported by Oracle, 
because it is embedded by default.

- external lock servers - that means the the GULM service cannot run on 
the nodes accessing the GFS. The 3 GULM nodes and the 2 GFS/RAC nodes 
must be completely separate

- the reason for the separation of lock servers from GFS/RAC nodes is 
that Oracle RAC is designed to be able to with a single node up, whereas 
DLM  needs a quorum of 50%+1 to operate. Also, external lock servers 
allow you to reboot any RAC node without causing the lock server to switch.

- specs for the locks servers are as entry-level as you can get for
modern server hardware, e.g. 1CPU, 1GB RAM.

- Red Hat Knowledge base article for RAC on GFS: 
http://kbase.redhat.com/faq/FAQ_78_3853.shtm
It mentions GFS 6.0, (slightly out of date) but is just as relevant for 
GFS 6.1 / RHEL 4). According to this article, "Red Hat does not charge 
for Red Hat GFS nodes that are used only for the external lock server 
configuration"

- the Cluster/GFS FAQ has some URLs for RAC on GFS:
http://sources.redhat.com/cluster/faq.html#gfs_oraclerac

- MetaLink Note  329530.1 is the definitive source for what exactly is 
supported by Oracle, w.r.t. RAC and GFS. Unfortunately it is only 
available via Metalink.

Celso - answers to your questions below

Hirantha Wijayawardena wrote:
> Hi,
> 
> I also wanted to post the same question on Bob post. But I given-up to ask
> 'cause it written as " DLM should work just fine." From the clause "Right
> now, Oracle and Oracle RAC are only certified to work with the GULM locking
> manager. So Oracle won't support your configuration if you're using DLM.
> However, DLM should work just fine. Also, GULM has works fine on more than
> 32 nodes."
> 
> I just wanted to know was whether we could run Oracle or oracle RAC on
> 2-node cluster with DLM
> 
> Thanks in advance
> 
> - Hirantha
> 
> 
> 
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
>> bounces at redhat.com] On Behalf Of Celso K. Webber
>> Sent: Wednesday, August 16, 2006 8:09 AM
>> To: linux clustering
>> Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS
>> Cluster?
>>
>> Hello all,
>>
>> Sorry for returning with a similar question: a few weeks ago, Bob
>> Peterson kindly answered a question of mine about selecting the right
>> lock manager. He added this answer to his Cluster Project FAQ:
>> http://sources.redhat.com/cluster/faq.html#dlm_which
>>
>> The FAQ states that Oracle only supports GULM and not DLM.
>>
>> I had difficulties about setting up GULM under GFS 6.0 (RHELv3 based,
>> Linux kernel 2.4.x) because of the need of an even number of lock
>> managers. The GFS 6.0 manual states that you cannot use 2-node GULM lock
>> managers:
>> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-locl-gulm.html#S2-
>> LOCK-GULM-NUMSVRS
>>
>> The above section of the manual (section 8.2.2) states that you must
>> have at least 3 GULM lock managers for a high availability locking
>> systems under GFS.
>>
>> Question: is this still true under GFS 6.1 (RHELv4 based, 2.6.x kernel)?
>>   Since Oracle demands GULM, can I have a 2-node GULM GFS locking system
>> on a 2-node GFS Cluster?
>>

No

(for purposes of my answer, I assume you mean GFS nodes when you write 
cluster nodes)

>> If not, which is the best option and which are supported by Red Hat?
>> a) to have just 1 lock manager in one of the Cluster Nodes;
Supported by Red Hat. Not supported by Oracle. Lock manager is a single 
point of failure

>> b) to have just 1 lock manager in a machine outside the Cluster;
Supported by Red Hat. Unsure if supported by Oracle. Lock manager is 
still a single point of failure

>> c) to have 2 GULM lock manager on the cluster nodes;
Not possible. you need 1 (minimum) or 3+ (recommended)

>> d) to have 3 or 5 lock managers using both the 2 cluster nodes and other
>> servers?
Have 3 lock managers external to the GFS nodes. the GFS nodes may not 
run the GULM server (neither primary nor standby) for it to be supported 
by Oracle.

This is your best option in terms of support, redundancy, etc, if not 
the most cost-effective.

>>

>> Sorry for this recurring question, I wish to make sure of the correct
>> environment I should deploy for a customer who needs GFS (officially
>> licensed from Red Hat).
>>
>> Thank you all in advance.
>>
>> Celso.
>>
>> PS: I think this could be useful to be in the FAQ. What do you think Bob?
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060816/bb212a23/attachment.vcf>

From riaan at obsidian.co.za  Wed Aug 16 14:03:07 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Wed, 16 Aug 2006 16:03:07 +0200
Subject: [Linux-cluster] Cluster / GFS FAQ as wiki
Message-ID: <44E3259B.9070206@obsidian.co.za>

hi there Bob

what are the chances of turning your FAQ at 
http://sources.redhat.com/cluster/faq.html into a wiki version?

This way more of us would be able to contribute.

has this been brought up before?

I know there is a wiki at http://gfs.wikidev.net/Main_Page but it is not 
actively maintained. Turning the FAQ into a wiki would mean giving up 
control of your baby, possibly a scary thought. OTOH, since the FAQ is 
not an official Red Hat resource (or is it), this gives you the 
flexibility to do something different (like a wiki) as the need arises.

Perhaps splitting off the single page into sepearate pages for the 
various subsystems/topics might make it more manageable aswell.

Also, a lot of the content which is covered in the FAQ might be of great 
benefit to the Red Hat Knowledge Base (kbase.redhat.com), and make it 
more relevant to cs/gfs users. I would hope the KB maintainers would 
feel comfortable enough to freely borrow from your FAQ in order to 
populate it.

greetings
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060816/1386524a/attachment.vcf>

From rpeterso at redhat.com  Wed Aug 16 14:24:07 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 16 Aug 2006 09:24:07 -0500
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
In-Reply-To: <200608161856.23634.thomas@ideawise.de>
References: <200608161828.43141.thomas@ideawise.de>	<44E2F70A.5080809@redhat.com>
	<200608161856.23634.thomas@ideawise.de>
Message-ID: <44E32A87.1030302@redhat.com>

Thomas Karsten wrote:
> Does this mean that I just missed the time when the Linux kernel 2.6.16
> was the recent one? Is the code nowhere stored, just developed in the
> CVS?
>
> Since the recent kernel version is 2.6.17 I assume that the GFS code was
> already developed for 2.6.16. That's why I thought there has to be an
> archive of all GFS versions somewhere (that matches the older kernel
> versions, because they have been programmed for those in the past). Of
> course, then I would not use the latest version of GFS but the version
> that has been programmed for kernel 2.6.16.
>
> Is there no other possibility for me than to "downgrade" it on my own
> now?
>
> Thanks,
> Thomas
>   
Hi Thomas,

CVS keeps all back versions of the code, but it's not necessarily easy 
to find a
subset that works with a given kernel, unless it happens to be the one 
we're developing on.
However, it shouldn't be too difficult to port the latest GFS to a 
different kernel.

The HEAD branch is currently set to compile against 2.6.18-rc4.
I don't know about 2.6.16 specifically, but there are only two changes 
necessary to
make the current HEAD branch compile against 2.6.17:

--- a/cluster/gfs-kernel/src/gfs/inode.c     2006-08-02 
11:33:09.000000000 -0500
+++ b/cluster/gfs-kernel/src/gfs/inode.c       2006-08-11 
14:10:26.000000000 -0500
@@ -97,7 +97,6 @@ inode_attr_in(struct gfs_inode *ip, stru
        inode->i_mtime.tv_sec = ip->i_di.di_mtime;
        inode->i_ctime.tv_sec = ip->i_di.di_ctime;
        inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec = 
inode->i_ctime.tv_nsec = 0;
-       inode->i_blksize = PAGE_SIZE;
        inode->i_blocks = ip->i_di.di_blocks <<
                (ip->i_sbd->sd_sb.sb_bsize_shift - GFS_BASIC_BLOCK_SHIFT);
        inode->i_generation = ip->i_di.di_header.mh_incarn;
--- a/cluster/gfs-kernel/src/gfs/gfs.h       2006-08-10 
17:07:07.000000000 -0500
+++ b/cluster/gfs-kernel/src/gfs/gfs.h 2006-08-11 14:11:51.000000000 -0500
@@ -66,8 +67,8 @@

 #define get_v2sdp(sb) ((struct gfs_sbd *)(sb)->s_fs_info)
 #define set_v2sdp(sb, sdp) (sb)->s_fs_info = (sdp)
-#define get_v2ip(inode) ((struct gfs_inode *)(inode)->u.generic_ip)
-#define set_v2ip(inode, ip) (inode)->u.generic_ip = (ip)
+#define get_v2ip(inode) ((struct gfs_inode *)(inode)->i_private)
+#define set_v2ip(inode, ip) (inode)->i_private = (ip)
 #define get_v2fp(file) ((struct gfs_file *)(file)->private_data)
 #define set_v2fp(file, fp) (file)->private_data = (fp)
 #define get_v2bd(bh) ((struct gfs_bufdata *)(bh)->b_private)

Without this patch, the symptom is this message:

acl.c:177: struct inode has no member named u

Some older kernels have issues with using mutexes vs. semaphores,
but that's easy to fix.

There's always the possibility of upgrading your kernel too.  :7)

Regards,

Bob Peterson
Red Hat Cluster Suite



From garylua at singnet.com.sg  Wed Aug 16 14:37:49 2006
From: garylua at singnet.com.sg (Gary Lua)
Date: Wed, 16 Aug 2006 22:37:49 +0800
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc fence
	device 
Message-ID: <44E32DBD.2040105@singnet.com.sg>

Hi,

I'm currently configuring fencing devices for my 2 nodes on a RHEL4 
cluster. The problem is quite long, so please bear with me.

I have 2 nodes (let's call them stone1 and stone2) and 2 APC fencing 
devices (pdu1 and pdu2, both apc 7952 devices). Both stone1 and stone2 
has dual power supplies. Stone1's power supplies are connected to outlet 
13 of pdu1 and pdu2. Stone2's power supplies are connected to outlet 20 
of both the pdus. My question is: during the fencing configuration for 
each node, i need to specify which fence device to add to the fence 
level of each node. Is it correct to specify for stone1 as follows : 
pdu1 -> port=13, switch=1, pdu2-> port=13, switch=2? The same applies to 
stone 2 : pdu1-> port=20, switch=1, pdu2-> port=20, switch=2?

After configuring as mentioned above, with both nodes on the cluster 
running and my application running on stone1, i pull out the ethernet 
cables for stone1 to simulate that the server is down. By right, my 
application should fail over to stone2 and fencing should occur to 
stone1 (ie, stone1 should be rebooted/shutdown). However, what happened 
is that my application is started on stone2, and stone1 is not fenced. 
In fact, when i reconnect by cables, my application is still running on 
stone1! Seems that there are 2 instances of my application running, each 
on stone1 and stone2.

Why has the fencing failed? I've read somewhere that acpid service plays 
a part and i need to disable it. Is it true? When I check my 
/var/log/messages, I see a cman :sendmsg failed -101 error. What does 
this mean?

I've been trying to solve this problem for the last few days, but to no 
avail. Any advice will be appreciated.



From tmornini at engineyard.com  Wed Aug 16 15:10:48 2006
From: tmornini at engineyard.com (Tom Mornini)
Date: Wed, 16 Aug 2006 08:10:48 -0700
Subject: [Linux-cluster] Setting up a GFS cluster
In-Reply-To: <44E2CA43.5090805@clara.co.uk>
References: <44E22EC7.8020506@clara.co.uk> <44E281AC.4010608@webbertek.com.br>
	<44E2CA43.5090805@clara.co.uk>
Message-ID: <B1B8D53E-4E2A-40F1-8298-4624B1751563@engineyard.com>

On Aug 16, 2006, at 12:33 AM, Brendan Heading wrote:

> Celso K. Webber wrote:
>
>> Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a  
>> not so high price tag.
>> Sorry for the long message, I believe this information can be  
>> useful to others.
>
> Please don't apologize, your reply was very informative and you've  
> probably just saved me a big pile of cash :) I've never been happy  
> with the PERC RAID controller that I currently have, and as part of  
> moving to the cluster I intend to use Adaptec stuff all the way.  
> However it sounds like that won't fly with the PowerVault in any case.
>
> I saw the EMC AX-100 but I've got one significant problem with it -  
> the disk storage inside is SATA. The server that I'm dealing with  
> is essentially a build server used to do regular parallel builds,  
> and it may be simultaneously in use at any one time by up to 20-30  
> users. I've been looking at iSCSI but I find it very strange that  
> nobody seems to sell iSCSI-to-SCSI boxes.

I've read nothing but good reviews of the Coraid AoE systems running  
in cluster configurations. I'll be setting up one the week of August  
28th and I'll let the list know the results.

I saw them yesterday at LinuxWorld Expo. There was quite a buzz about  
the booth. The EtherBoot folks announced direct support of AoE and  
everyone I talked to about them had nothing but positive experiences  
to share.

For shared storage, it's hard to imagine a better price/performance  
point than AoE.

-- 
-- Tom Mornini



From teigland at redhat.com  Wed Aug 16 15:38:12 2006
From: teigland at redhat.com (David Teigland)
Date: Wed, 16 Aug 2006 10:38:12 -0500
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
In-Reply-To: <200608161828.43141.thomas@ideawise.de>
References: <200608161828.43141.thomas@ideawise.de>
Message-ID: <20060816153812.GB30761@redhat.com>

On Wed, Aug 16, 2006 at 06:28:42PM +0800, Thomas Karsten wrote:
> Hallo,
> 
> I would like to get GFS for the Linux kernel 2.6.16. As stated on the
> Cluster Project Page, both the released tarballs and the code from the
> CVS repository compile against the recent kernel only. Besides this
> information, I already tried to compile the code from CVS STABLE, but
> it failed using the Linux kernel 2.6.16.
> 
> The FAQ says, that the code should run on the 2.6.xx series kernel. Is
> it limited to a special Red Hat kernel?
> 
> Does the Cluster Project Page provide GFS downloads for other kernel
> versions than the recent one? If so, where can I find it?

Look here: ftp://sources.redhat.com/pub/cluster/releases/

The latest release: cluster-1.02.00.tar.gz
builds against a standard 2.6.16 kernel.

We'll soon be releasing 1.03 (from CVS STABLE) which builds on 2.6.17.

Dave



From Timothy.Lin at noaa.gov  Wed Aug 16 15:52:16 2006
From: Timothy.Lin at noaa.gov (Timothy Lin)
Date: Wed, 16 Aug 2006 11:52:16 -0400
Subject: [Linux-cluster] Setting up a GFS cluster
In-Reply-To: <44E281AC.4010608@webbertek.com.br>
References: <44E22EC7.8020506@clara.co.uk> <44E281AC.4010608@webbertek.com.br>
Message-ID: <44E33F30.1090006@noaa.gov>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Celso K. Webber wrote:

<Snip>
> 
> Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a not so
> high price tag.
>


I'd think twice about getting an AX-100
it's an entry level SATA-1 only EMC raid (No NCQ support)
I got one for a non-clustering environment and the speed I get out of it
and the speed isn't much faster than a decent single SCSI HDD.


Tim.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE4z8wDxbptQqFnWERAjEKAJ46Z/UOAJ41OP/mSrbeR9clKM6V2ACgpdXO
bxB2GJQjEtHlTBWolwLpG4c=
=hI7N
-----END PGP SIGNATURE-----



From Matthew.Patton.ctr at osd.mil  Wed Aug 16 17:17:14 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Wed, 16 Aug 2006 13:17:14 -0400
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
Message-ID: <D8063DF686D10247B0A49D0127128569119C27C1@osdn06.osd.mil>

Classification: UNCLASSIFIED

> CVS keeps all back versions of the code, but it's not necessarily easy 
> to find a subset that works with a given kernel, unless it happens to 
> be the one we're developing on.

would it be too hard to start tagging releases?

CVS Tag: 2.6.16-stable, or 2.6.16-dev, 2.6.18-dev
for example


that way anyone can check out whatever version they want to match up against
their kernel source of choice.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060816/c7232566/attachment.htm>

From lhh at redhat.com  Wed Aug 16 18:24:04 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 16 Aug 2006 14:24:04 -0400
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc
	fence device
In-Reply-To: <44E32DBD.2040105@singnet.com.sg>
References: <44E32DBD.2040105@singnet.com.sg>
Message-ID: <1155752644.31126.0.camel@ayanami.boston.redhat.com>

On Wed, 2006-08-16 at 22:37 +0800, Gary Lua wrote:
> Hi,
> 
> I'm currently configuring fencing devices for my 2 nodes on a RHEL4 
> cluster. The problem is quite long, so please bear with me.
> 
> I have 2 nodes (let's call them stone1 and stone2) and 2 APC fencing 
> devices (pdu1 and pdu2, both apc 7952 devices). Both stone1 and stone2 
> has dual power supplies. Stone1's power supplies are connected to outlet 
> 13 of pdu1 and pdu2. Stone2's power supplies are connected to outlet 20 
> of both the pdus. My question is: during the fencing configuration for 
> each node, i need to specify which fence device to add to the fence 
> level of each node. Is it correct to specify for stone1 as follows : 
> pdu1 -> port=13, switch=1, pdu2-> port=13, switch=2? The same applies to 
> stone 2 : pdu1-> port=20, switch=1, pdu2-> port=20, switch=2?
> 
> After configuring as mentioned above, with both nodes on the cluster 
> running and my application running on stone1, i pull out the ethernet 
> cables for stone1 to simulate that the server is down. By right, my 
> application should fail over to stone2 and fencing should occur to 
> stone1 (ie, stone1 should be rebooted/shutdown). However, what happened 
> is that my application is started on stone2, and stone1 is not fenced. 
> In fact, when i reconnect by cables, my application is still running on 
> stone1! Seems that there are 2 instances of my application running, each 
> on stone1 and stone2.

Post the cluster configuration.

-- Lon




From rpeterso at redhat.com  Wed Aug 16 19:25:18 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 16 Aug 2006 14:25:18 -0500
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <44E2853D.2070405@webbertek.com.br>
References: <44E2853D.2070405@webbertek.com.br>
Message-ID: <44E3711E.7060405@redhat.com>

Celso K. Webber wrote:
> Hello all,
>
> Sorry for returning with a similar question: a few weeks ago, Bob 
> Peterson kindly answered a question of mine about selecting the right 
> lock manager. He added this answer to his Cluster Project FAQ:
> http://sources.redhat.com/cluster/faq.html#dlm_which
>
> The FAQ states that Oracle only supports GULM and not DLM.
>
> I had difficulties about setting up GULM under GFS 6.0 (RHELv3 based, 
> Linux kernel 2.4.x) because of the need of an even number of lock 
> managers. The GFS 6.0 manual states that you cannot use 2-node GULM 
> lock managers:
> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-locl-gulm.html#S2-LOCK-GULM-NUMSVRS 
>
>
> The above section of the manual (section 8.2.2) states that you must 
> have at least 3 GULM lock managers for a high availability locking 
> systems under GFS.
>
> Question: is this still true under GFS 6.1 (RHELv4 based, 2.6.x 
> kernel)?  Since Oracle demands GULM, can I have a 2-node GULM GFS 
> locking system on a 2-node GFS Cluster?
>
> If not, which is the best option and which are supported by Red Hat?
> a) to have just 1 lock manager in one of the Cluster Nodes;
> b) to have just 1 lock manager in a machine outside the Cluster;
> c) to have 2 GULM lock manager on the cluster nodes;
> d) to have 3 or 5 lock managers using both the 2 cluster nodes and 
> other servers?
>
> Sorry for this recurring question, I wish to make sure of the correct 
> environment I should deploy for a customer who needs GFS (officially 
> licensed from Red Hat).
>
> Thank you all in advance.
>
> Celso.
>
> PS: I think this could be useful to be in the FAQ. What do you think Bob?
>
Hi Celso, Hirantha and everyone else,

Riaan's comments are correct (Thanks, Riaan!). 
I just made changes to the FAQ to hopefully clarify this.

Regards,

Bob Peterson
Red Hat Cluster Suite



From celso at webbertek.com.br  Wed Aug 16 19:17:39 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Wed, 16 Aug 2006 16:17:39 -0300
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <44E3108C.20304@obsidian.co.za>
References: <20060816094730.D5A0B27C6F@ux-mail.informatics.lk>
	<44E3108C.20304@obsidian.co.za>
Message-ID: <44E36F53.1000708@webbertek.com.br>

Hi Riaan,

Thank for your excelent explanation!

Now I know I'll have to run an Oracle unsupported environment ... :)

Would someone please confirm if the environment suggested by Hirantha is 
supported by Red Hat?

* 2 GFS/RAC nodes;
* 2 DLM on RAC/GFS nodes;
* fibre channel external storage.

The above environment is what we usually encounter on our cluster 
projects. Sometimes Oracle Database is installed as active-passive (not 
a RAC Cluster). It would be useful to know if Red Hat supports it, even 
if Oracle does not.

Last question: does Oracle supports a 2-node cluster using DLM for 
Oracle Database (no RAC, no GFS) when used with RHCS?

Thanks again!

Celso.


Riaan van Niekerk escreveu:
> hi Hirantha
> 
> short answer: "No"
> 
> If you are willing to settle for "should work just fine" instead of 
> "certified by Oracle" you can try running DLM on 2 RAC/GFS nodes, then 
> "yes"
> 
> To have your solution certified/supported by Oracle, for a minimal 
> configuration you need 5 machines:
> a) 2 GFS/RAC nodes
> b) 3 external GULM lock servers
> 
> We also found these system requirements somewhat excessive (having more 
> lock servers than actual GFS-accessing cluster nodes).
> 
> more notes:
> - supported GFS with RAC is GULM only - DLM is not supported by Oracle, 
> because it is embedded by default.
> 
> - external lock servers - that means the the GULM service cannot run on 
> the nodes accessing the GFS. The 3 GULM nodes and the 2 GFS/RAC nodes 
> must be completely separate
> 
> - the reason for the separation of lock servers from GFS/RAC nodes is 
> that Oracle RAC is designed to be able to with a single node up, whereas 
> DLM  needs a quorum of 50%+1 to operate. Also, external lock servers 
> allow you to reboot any RAC node without causing the lock server to switch.
> 
> - specs for the locks servers are as entry-level as you can get for
> modern server hardware, e.g. 1CPU, 1GB RAM.
> 
> - Red Hat Knowledge base article for RAC on GFS: 
> http://kbase.redhat.com/faq/FAQ_78_3853.shtm
> It mentions GFS 6.0, (slightly out of date) but is just as relevant for 
> GFS 6.1 / RHEL 4). According to this article, "Red Hat does not charge 
> for Red Hat GFS nodes that are used only for the external lock server 
> configuration"
> 
> - the Cluster/GFS FAQ has some URLs for RAC on GFS:
> http://sources.redhat.com/cluster/faq.html#gfs_oraclerac
> 
> - MetaLink Note  329530.1 is the definitive source for what exactly is 
> supported by Oracle, w.r.t. RAC and GFS. Unfortunately it is only 
> available via Metalink.
> 
> Celso - answers to your questions below
...
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From celso at webbertek.com.br  Wed Aug 16 19:21:12 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Wed, 16 Aug 2006 16:21:12 -0300
Subject: [Linux-cluster] Setting up a GFS cluster
In-Reply-To: <44E33F30.1090006@noaa.gov>
References: <44E22EC7.8020506@clara.co.uk> <44E281AC.4010608@webbertek.com.br>
	<44E33F30.1090006@noaa.gov>
Message-ID: <44E37028.8010604@webbertek.com.br>

Hi Tim,

I totally agree with you. But when compared to Dell PV220S in Cluster 
Mode, the AX-100 is much better, I think.

Regards,

Celso.

Timothy Lin escreveu:
> 
> 
> I'd think twice about getting an AX-100
> it's an entry level SATA-1 only EMC raid (No NCQ support)
> I got one for a non-clustering environment and the speed I get out of it
> and the speed isn't much faster than a decent single SCSI HDD.
> 
> 
> Tim.
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From rpeterso at redhat.com  Wed Aug 16 19:36:46 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 16 Aug 2006 14:36:46 -0500
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <44E36F53.1000708@webbertek.com.br>
References: <20060816094730.D5A0B27C6F@ux-mail.informatics.lk>	<44E3108C.20304@obsidian.co.za>
	<44E36F53.1000708@webbertek.com.br>
Message-ID: <44E373CE.7040205@redhat.com>

Celso K. Webber wrote:
> Hi Riaan,
>
> Thank for your excelent explanation!
>
> Now I know I'll have to run an Oracle unsupported environment ... :)
>
> Would someone please confirm if the environment suggested by Hirantha 
> is supported by Red Hat?
>
> * 2 GFS/RAC nodes;
> * 2 DLM on RAC/GFS nodes;
> * fibre channel external storage.
>
> The above environment is what we usually encounter on our cluster 
> projects. Sometimes Oracle Database is installed as active-passive 
> (not a RAC Cluster). It would be useful to know if Red Hat supports 
> it, even if Oracle does not.
>
> Last question: does Oracle supports a 2-node cluster using DLM for 
> Oracle Database (no RAC, no GFS) when used with RHCS?
>
> Thanks again!
>
> Celso.
Hi Celso,

Yes, Red Hat will support the configuration you mentioned.
I don't think Oracle will support the DLM locking protocol with RHCS.

Regards,

Bob Peterson
Red Hat Cluster Suite



From celso at webbertek.com.br  Wed Aug 16 19:28:28 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Wed, 16 Aug 2006 16:28:28 -0300
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <44E3711E.7060405@redhat.com>
References: <44E2853D.2070405@webbertek.com.br> <44E3711E.7060405@redhat.com>
Message-ID: <44E371DC.9000302@webbertek.com.br>

Hi Bob!

This FAQ update totally clarifies my doubts. Please ignore my last 
message about this topic.

Riian's idea of a Wiki sounds interesting to me, maybe there could be 
more contributions from the list readers.

Best regards to all,

Celso.

Robert Peterson escreveu:
<snip>

>> PS: I think this could be useful to be in the FAQ. What do you think Bob?
>>
> Hi Celso, Hirantha and everyone else,
> 
> Riaan's comments are correct (Thanks, Riaan!). I just made changes to 
> the FAQ to hopefully clarify this.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From brendanheading at clara.co.uk  Wed Aug 16 19:34:40 2006
From: brendanheading at clara.co.uk (Brendan Heading)
Date: Wed, 16 Aug 2006 20:34:40 +0100
Subject: [Linux-cluster] Low cost storage for clusters
In-Reply-To: <44E2F006.6070401@webbertek.com.br>
References: <44E22EC7.8020506@clara.co.uk>
	<44E281AC.4010608@webbertek.com.br>	<44E2CA43.5090805@clara.co.uk>
	<44E2F006.6070401@webbertek.com.br>
Message-ID: <44E37350.40205@clara.co.uk>

Celso K. Webber wrote:
> Hello again Brendan,

Celso,

Thanks again for the informative reply.

<snip>

> I believe the performance will be quite acceptable, even though the 
> storage uses SATA disks.

That's the key point. At the moment my users are used to RAID5 on a 
PERC4 controller, which, let's face it, is not exactly stellar. If I buy 
a few extra drives in order to do RAID 10, and use dedicated iSCSI HBAs, 
that should do nicely.

Timothy Lin followed up:

 >I'd think twice about getting an AX-100
 >it's an entry level SATA-1 only EMC raid (No NCQ support)
 >I got one for a non-clustering environment and the speed I get out of 
 >it and the speed isn't much faster than a decent single SCSI HDD.

I suspect that I'd be looking to use high-end SATA disks (eg WD Raptor) 
with NCQ enabled so this is a bit of a bummer.

Has anyone got any experience with the EMC AX-150 ? It is the current 
machine which Dell are offering. Since it is SATA II I guess it should 
do NCQ ?

I know that with a SCSI-based array, the number of cluster servers I can 
connect to the array are limited by the number of SCSI ports on the 
array enclosure. With iSCSI, is it simply a matter of connecting lots of 
servers using a regular gige ethernet switch ?

On the subject of ethernet switches, are they all made equal ? Obviously 
I know that some are managed, but what are you getting when you pay 
large amounts of money for fairly ordinary looking switches ?

> In my opinion the secret with these cheaper solutions is on storage 
> processors. If they hold a reasonable amount of cache memory, they can 
> compensate eventual latencies with the hard drives.

The AX-150 has 1GB cache. That sounds OK :)

> Finally, I think iSCSI is a good thing if you don't need high 
> performance I/O. I previously suggested you iSCSI in place of Dell's 
> PV220S (which lacks write cache in Cluster mode) because an iSCSI 
> solution with full cache functionality would give you similar or better 
> performance than the PV220S solution.

I'm happy enough that iSCSI is acceptable, I don't think I will be able 
to justify a full fibre-channel SAN. Certainly I'd expect it to do as 
well as straight SCSI320 given that it has 3x the raw bandwidth (even 
accounting for the TCPIP overhead).

> With iSCSI, we haven't seen very high performance I/O rates. In one 
> case, a customer employed an IBM storage (don't remember the model), 
> which was accessed via iSCSI using a Cisco iSCSI-to-fibrechannel switch. 
> Even with trunking enabled on the iSCSI NICs on the servers, performance 
> was not outstanding as the storage itself would give if the servers were 
> direct attached or connected through an SAN. We used in this case 
> standard NICs and Linux's traditional iscsi-initiator-utils (a 
> sourceforge project), so we could have better performance if we employed 
> dedicated iSCSI NICs (such as the QLogic models).

I would probably plan on getting iSCSI HBAs at least for the critical 
machines in the cluster. I'm planning to allow a few lower-priority 
machines access to the GFS cluster filesystem too, but those machines 
will probably just use a regular NIC.

Thanks again,

Brendan



From lhh at redhat.com  Wed Aug 16 19:37:07 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 16 Aug 2006 15:37:07 -0400
Subject: [Linux-cluster] No storage cluster configuration
In-Reply-To: <44E2F6A0.9010408@webbertek.com.br>
References: <44E2F6A0.9010408@webbertek.com.br>
Message-ID: <1155757027.31126.8.camel@ayanami.boston.redhat.com>

On Wed, 2006-08-16 at 07:42 -0300, Celso K. Webber wrote:
> Hello all,
> 
> Does anyone from Red Hat would tell me if Red Hat supports a no external 
> storage environment when using RH Cluster Suite 

Yes...

> and DRDB for replicating 
> data beetween cluster nodes?

No.

> When customers come to us arguing that IBM is offering SuSE Cluster 
> solutions without the need of an external storage, we usually tell them 
> to go on if they don't value their data. IBM on its turn tells that DRDB 
> is stable enough for production, even when using Oracle database on top 
> of it.

There is, or was, some deep-seated hatred between some of our elite
engineers and DRBD, but I don't remember what it was.  It's not part of
the kernel.

> We have lost some projects because of this "magical solution" of using 
> DRDB instead of an external storage, but we are still against deploying 
> production cluster solutions using DRDB or similar solutions.

We don't ship DRBD.  If that changes, we can certainly support it.  In
fact, people from the community have asked about it, too.  Eventually,
we may have to add support for it to our upstream linux-cluster project,
even if Red Hat never ships it - mostly due to its simplicity to
configure, and ease of use in failover environments.

> So, if we are certain that a DRDB (or similar solution) is supported by 
> Red Hat when used in conjunction with RH Cluster Suite, maybe we could 
> compete with those IBM + Novell SuSE solutions.

We're certain it's not supported right now out of the box, since we
don't ship DRBD.  Using it as part of a <script> resource is perfectly
valid, since we do not help with the contents of users' scripts.

Note that the GNBD + Cluster Mirroring paradigm is like DRBD++ - it
supports concurrent writes.

-- Lon



From Matthew.Patton.ctr at osd.mil  Wed Aug 16 20:02:55 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Wed, 16 Aug 2006 16:02:55 -0400
Subject: [Linux-cluster] Low cost storage for clusters
Message-ID: <D8063DF686D10247B0A49D0127128569119C27C2@osdn06.osd.mil>

Classification: UNCLASSIFIED

> I'm happy enough that iSCSI is acceptable, I don't think I 
> will be able 
> to justify a full fibre-channel SAN. Certainly I'd expect it to do as 
> well as straight SCSI320 given that it has 3x the raw bandwidth (even 
> accounting for the TCPIP overhead).

use Jumbo frames and thruput might be much better. Then again I don't know
how iSCSI generates packets. I hope it's not 1 packet per disk block which
is typically 512bytes. presumably how many disk blocks / packet is a
(auto-)tunable parameter.

switch price matters. But some of the cheepies claim features that the
big-names charge 10x as much for which doesn't make sense. I never pay the
Cisco tax unless there is a gun to my head.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060816/be4b53d4/attachment.htm>

From dist-list at LEXUM.UMontreal.CA  Wed Aug 16 20:10:25 2006
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Wed, 16 Aug 2006 16:10:25 -0400
Subject: [Linux-cluster] cman : exceeds two node limit ?
Message-ID: <44E37BB1.8070306@lexum.umontreal.ca>

The more I play with CS the more I am frustrated  ;-)

Why do I received that :
kernel: CMAN: Join request from SERVER rejected, exceeds two node limit

my cluster.conf is very simple :

<?xml version="1.0"?>
<cluster config_version="6" name="webfarm">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="cagliari" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="como" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="lecce" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="webfarm" ordered="0"
restricted="0">
                                <failoverdomainnode name="cagliari"
priority="1"/>
                                <failoverdomainnode name="lecce"
priority="1"/>
                                <failoverdomainnode name="como"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
        </rm>
</cluster>



From teigland at redhat.com  Wed Aug 16 20:15:58 2006
From: teigland at redhat.com (David Teigland)
Date: Wed, 16 Aug 2006 15:15:58 -0500
Subject: [Linux-cluster] cman : exceeds two node limit ?
In-Reply-To: <44E37BB1.8070306@lexum.umontreal.ca>
References: <44E37BB1.8070306@lexum.umontreal.ca>
Message-ID: <20060816201558.GC30761@redhat.com>

On Wed, Aug 16, 2006 at 04:10:25PM -0400, FM wrote:
> The more I play with CS the more I am frustrated  ;-)
> 
> Why do I received that :
> kernel: CMAN: Join request from SERVER rejected, exceeds two node limit

Apparently you had one or more nodes join the current cluster when you had
it configured in two_node mode, then changed the config to include three
nodes.  When you remove the two_node setting from cluster.conf you need to
shut down the nodes in the cluster and have them rejoin with the new
config.

Dave



From brendanheading at clara.co.uk  Wed Aug 16 20:17:42 2006
From: brendanheading at clara.co.uk (Brendan Heading)
Date: Wed, 16 Aug 2006 21:17:42 +0100
Subject: [Linux-cluster] Low cost storage for clusters
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27C2@osdn06.osd.mil>
References: <D8063DF686D10247B0A49D0127128569119C27C2@osdn06.osd.mil>
Message-ID: <44E37D66.20706@clara.co.uk>

Patton, Matthew F, CTR, OSD-PA&E wrote:

> use Jumbo frames and thruput might be much better. Then again I don't 
> know how iSCSI generates packets. I hope it's not 1 packet per disk 
> block which is typically 512bytes. presumably how many disk blocks / 
> packet is a (auto-)tunable parameter.

I don't know how much of it is tunable. I have to say that I'm very 
suspicious about the idea of using TCP/IP and all that goes with it to 
shift disk blocks around. Web servers and higher-latency comms, I can 
get. But disk blocks with millisecond access times I don't. I'm trying 
to be objective though, and the consensus is that it seems to work quite 
well.

> switch price matters. But some of the cheepies claim features that the 
> big-names charge 10x as much for which doesn't make sense. I never pay 
> the Cisco tax unless there is a gun to my head.

Well said. But I'm still suspicious. You can get a no-name 24-port 
web-managed gigabit ethernet switch from Dell for UKP236 at the moment, 
and that's with two SFPs. There are some people who want several grand 
for the same number of ports. There's got to be something in it other 
than an SNMP interface ?






From riaan at obsidian.co.za  Wed Aug 16 20:19:04 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Wed, 16 Aug 2006 22:19:04 +0200
Subject: [Linux-cluster] Low cost storage for clusters
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27C2@osdn06.osd.mil>
References: <D8063DF686D10247B0A49D0127128569119C27C2@osdn06.osd.mil>
Message-ID: <44E37DB8.7090806@obsidian.co.za>



Patton, Matthew F, CTR, OSD-PA&E wrote:
  >  > I'm happy enough that iSCSI is acceptable, I don't think I
>  > will be able
>  > to justify a full fibre-channel SAN. Certainly I'd expect it to do as
>  > well as straight SCSI320 given that it has 3x the raw bandwidth (even
>  > accounting for the TCPIP overhead).
> 
> use Jumbo frames and thruput might be much better. Then again I don't 
> know how iSCSI generates packets. I hope it's not 1 packet per disk 
> block which is typically 512bytes. presumably how many disk blocks / 
> packet is a (auto-)tunable parameter.

I have often heard that jumbo frames is a very easy "low hanging fruit" 
for performance optimizing of iSCSI. I do not have any experience with 
it myself tho.

> switch price matters. But some of the cheepies claim features that the 
> big-names charge 10x as much for which doesn't make sense. I never pay 
> the Cisco tax unless there is a gun to my head.

Puts a new meaning to "Nobody ever got fired (at) for buying 
IBM/Cisco/Oracle"

my 2c regarding hardware initiators (HBAs) vs SW initiator - this was 
mentioned in this or another thread today:

I have also heard anecdotal evidence that hardware initiators are not 
all that much faster than software initiators.

1 The reasoning behind it being similar to why software RAID 1 (with a 
big, idle Xeon behind it beats the pants of dedicated hardware RAID 1 
with its relatively underpowered ASIC.

2 Also, the TCP offload engine (TOE) in a hardware initiator should give 
it an edge, but most server NIC vendors are starting to bundle TOE in 
their regular NICs, negating this performance benefit of dedicated iSCSI HW.

If possible, get a demo QLA iSCSI HBA and test the two.

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060816/2ee6641e/attachment.vcf>

From Matthew.Patton.ctr at osd.mil  Wed Aug 16 20:26:44 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Wed, 16 Aug 2006 16:26:44 -0400
Subject: [Linux-cluster] Low cost storage for clusters
Message-ID: <D8063DF686D10247B0A49D0127128569119C27C3@osdn06.osd.mil>

Classification: UNCLASSIFIED

unfortunately I can't respond privately as it's going off topic, but some of
the differentiating factors are chipset choice, per port buffer size,
separate buffers for for in/out, speed of switching engine/fabric, and
pipelining. I can't presume to say what switch does what. I let guys like
NetworkWorld try to figure that out.

> Well said. But I'm still suspicious. You can get a no-name 24-port 
> web-managed gigabit ethernet switch from Dell for UKP236 at 

Dell stuff is generally just "ok" in my book. Unless you can finger who and
what product their OEM'ing I tend to prefer something from a recognized
network company. That said, I haven't been bit too hard by Compaq stuff.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060816/12cb322e/attachment.htm>

From riaan at obsidian.co.za  Wed Aug 16 20:45:16 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Wed, 16 Aug 2006 22:45:16 +0200
Subject: [Linux-cluster] No storage cluster configuration
In-Reply-To: <1155757027.31126.8.camel@ayanami.boston.redhat.com>
References: <44E2F6A0.9010408@webbertek.com.br>
	<1155757027.31126.8.camel@ayanami.boston.redhat.com>
Message-ID: <44E383DC.80600@obsidian.co.za>

>> So, if we are certain that a DRDB (or similar solution) is supported by 
>> Red Hat when used in conjunction with RH Cluster Suite, maybe we could 
>> compete with those IBM + Novell SuSE solutions.
> 
> We're certain it's not supported right now out of the box, since we
> don't ship DRBD.  Using it as part of a <script> resource is perfectly
> valid, since we do not help with the contents of users' scripts.
> 
> Note that the GNBD + Cluster Mirroring paradigm is like DRBD++ - it
> supports concurrent writes.
> 

I wish someone comfortable with GNDB and LVM Cluster Mirroring would 
write a howto/walkthrough to do this. I am going to try my luck 
submitting this question to kbase.redhat.com : How do I use GNDB + LVM 
Cluster Mirroring to simulate shared storage without a physically shared 
storage array? I know they will take their time if they even decide to 
answer it, so is anyone willing to try writing one?

It would be great if this could move past the ingredients phase ("use A, 
B and C") to the recipe phase ("first you ..."). There is definitely a 
huge demand for "pseudo-shared storage" clustering, based on what I have 
heard from my customers and what I have seen on this list.

(If I knew that opening a Service Request for that howto I might be 
compelled to do it.)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060816/3f2e4468/attachment.vcf>

From brendanheading at clara.co.uk  Wed Aug 16 20:53:23 2006
From: brendanheading at clara.co.uk (Brendan Heading)
Date: Wed, 16 Aug 2006 21:53:23 +0100
Subject: [Linux-cluster] No storage cluster configuration
In-Reply-To: <44E383DC.80600@obsidian.co.za>
References: <44E2F6A0.9010408@webbertek.com.br>	<1155757027.31126.8.camel@ayanami.boston.redhat.com>
	<44E383DC.80600@obsidian.co.za>
Message-ID: <44E385C3.1010406@clara.co.uk>

Riaan van Niekerk wrote:

> It would be great if this could move past the ingredients phase ("use A, 
> B and C") to the recipe phase ("first you ..."). There is definitely a 
> huge demand for "pseudo-shared storage" clustering, based on what I have 
> heard from my customers and what I have seen on this list.

I second that. The trouble that many people (like me) have when starting 
out is that it's difficult to justify dropping a relatively large amount 
of money on a storage array that might not work very well. Using GNBD it 
would surely be easier to prove the concept.

I remember a comment in a FAQ somewhere which said rather emphatically 
that GNBD could not do mirroring across two devices, and it suggested to 
go and look at Intermezzo or AFS. Two projects which appear to be very 
far indeed from being enterprise-ready.




From brendanheading at clara.co.uk  Wed Aug 16 20:53:28 2006
From: brendanheading at clara.co.uk (Brendan Heading)
Date: Wed, 16 Aug 2006 21:53:28 +0100
Subject: [Linux-cluster] Low cost storage for clusters
In-Reply-To: <44E37DB8.7090806@obsidian.co.za>
References: <D8063DF686D10247B0A49D0127128569119C27C2@osdn06.osd.mil>
	<44E37DB8.7090806@obsidian.co.za>
Message-ID: <44E385C8.2070208@clara.co.uk>

Riaan van Niekerk wrote:

> 1 The reasoning behind it being similar to why software RAID 1 (with a 
> big, idle Xeon behind it beats the pants of dedicated hardware RAID 1 
> with its relatively underpowered ASIC.

Surely though that doesn't apply in cases where the server is performing 
CPU-bound work.

The servers in my case are used mostly for running large parallel builds 
of projects comprising multiple millions of lines of code. As well as 
the disk activity for shifting the source off the disk and writing back 
the compiled objects, your CPU is spending all it's time doing the 
compilation work. If it has to stop compiling in order to do TCP/IP as 
well, well that's bad.

Basically in that environment you can never have enough CPU.

> 2 Also, the TCP offload engine (TOE) in a hardware initiator should give 
> it an edge, but most server NIC vendors are starting to bundle TOE in 
> their regular NICs, negating this performance benefit of dedicated iSCSI 
> HW.
> 
> If possible, get a demo QLA iSCSI HBA and test the two.

Good advice..



From Zelikov_Mikhail at emc.com  Wed Aug 16 21:19:11 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Wed, 16 Aug 2006 17:19:11 -0400
Subject: [Linux-cluster] RE: [Cluster-devel] Magma; Magma-plugins
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B17304479E7E@CORPUSMX40B.corp.emc.com>

I can't seem to find a way to get a cluster node state information. In magma
API there are UP/DOWN/REMOVED states but in CMAN API I can only query
membership info and compare with the old one. Am I missing something?

Also I was wondering on the notification - it looks like I get a separate
event for each new node added and a single for many removed. How about
Up/DOWN information?

	Mike 

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Thursday, August 10, 2006 4:52 PM
To: Zelikov, Mikhail
Cc: lhh at redhat.com; cluster-devel at redhat.com; linux-cluster at redhat.com
Subject: Re: [Cluster-devel] Magma; Magma-plugins

On Thu, Aug 10, 2006 at 04:40:10PM -0400, Zelikov_Mikhail at emc.com wrote:
> I can see that I have libcman and libdlm installed as the part of cman 
> and dlm packages. I looked at the cluster, dlm and cman project home 
> pages - the APIs are mentioned but there is no documentation I could 
> find. Am I looking at the wrong place?

Download the source code from the "cluster" cvs tree.  The API's in the
RHEL4 cvs branch will be slightly different that those in the cvs HEAD (for
RHEL5).  Look at

cluster/cman/lib/libcman.h
cluster/dlm/lib/libdlm.h
cluster/dlm/doc/*



> -----Original Message-----
> From: Lon Hohberger [mailto:lhh at redhat.com]
> Sent: Thursday, August 10, 2006 4:30 PM
> To: Zelikov, Mikhail
> Cc: teigland at redhat.com; cluster-devel at redhat.com; 
> linux-cluster at redhat.com
> Subject: RE: [Cluster-devel] Magma; Magma-plugins
> 
> On Thu, 2006-08-10 at 16:19 -0400, Zelikov_Mikhail at emc.com wrote:
> > Dave, thank you for the reply.
> > Magma is used in the latest release of CS and GFS. I was wondering 
> > if I use this (openais) API will it work within the currently 
> > existing cluster infrastructure on RHEL4.3? Or is it the future
supported API?
> > 	Mike
> 
> It will work on RHEL4, but it will not work on RHEL5, FC5/6 or any 
> future release.
> 
> The CMAN and DLM APIs work on all of the above.
> 
> -- Lon



From riaan at obsidian.co.za  Wed Aug 16 22:11:54 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Thu, 17 Aug 2006 00:11:54 +0200
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <44E373CE.7040205@redhat.com>
References: <20060816094730.D5A0B27C6F@ux-mail.informatics.lk>	<44E3108C.20304@obsidian.co.za>	<44E36F53.1000708@webbertek.com.br>
	<44E373CE.7040205@redhat.com>
Message-ID: <44E3982A.7030208@obsidian.co.za>



Robert Peterson wrote:
> Celso K. Webber wrote:
>> Hi Riaan,
>>
>> Thank for your excelent explanation!
>>
>> Now I know I'll have to run an Oracle unsupported environment ... :)
>>
>> Would someone please confirm if the environment suggested by Hirantha 
>> is supported by Red Hat?
>>
>> * 2 GFS/RAC nodes;
>> * 2 DLM on RAC/GFS nodes;
>> * fibre channel external storage.
>>
>> The above environment is what we usually encounter on our cluster 
>> projects. Sometimes Oracle Database is installed as active-passive 
>> (not a RAC Cluster). It would be useful to know if Red Hat supports 
>> it, even if Oracle does not.
>>
>> Last question: does Oracle supports a 2-node cluster using DLM for 
>> Oracle Database (no RAC, no GFS) when used with RHCS?
>>
>> Thanks again!
>>
>> Celso.
> Hi Celso,
> 
> Yes, Red Hat will support the configuration you mentioned.
> I don't think Oracle will support the DLM locking protocol with RHCS.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite

To supplement what Bob said.

Oracle doesn't "care" (in the nice sense of the word) about RHCS/DLM, 
since they are not Oracle's clustering middleware but Red Hat's.

for RHCS-only, Oracle and its resources (VIP, storage, mountpoints) are 
like any other clustered service. Oracle will support that config as 
long as you can replicate your problem on a standalone/non-clustered 
instance (e.g. have the OS manage the VIP/storage/script rather than 
RHCS). E.g you need to be able to prove that RHCS is not breaking things 
doing things logically different from how the OS would be doing them.

There is a good chapter on Oracle on RHCS at
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ch-db-service.html

(You have to use the direct link - or know where to look inside the HTML 
tarbal, since neither the PDF version nor the HTML chapter index page 
contain references to this chapter. I thought this was in bugzilla 
already, but I could not find it. I logged it as BZ 202884
- Database Services chapter (including Oracle setup) missing from RHCS 
PDF manual and HTML home page)

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060817/b579ed48/attachment.vcf>

From garylua at singnet.com.sg  Thu Aug 17 01:21:30 2006
From: garylua at singnet.com.sg (garylua at singnet.com.sg)
Date: Thu, 17 Aug 2006 09:21:30 +0800 (SGT)
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc fence
	device
In-Reply-To: <1155752644.31126.0.camel@ayanami.boston.redhat.com>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
Message-ID: <1155777690.44e3c49a7befb@discus.singnet.com.sg>

Hi Lon, thanks for the reply. My cluster.conf is as follows.
coral1 and coral2 are my 2 nodes.


<?xml version="1.0"?>
<cluster config_version="264" name="MF_Cluster">
	<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="coral1" votes="1">
			<fence>
				<method name="1">
					<device name="pdu1" option="off" port="13" switch="1"/>
					<device name="pdu2" option="off" port="13" switch="2"/>
					<device name="pdu1" option="on" port="13" switch="1"/>
					<device name="pdu2" option="on" port="13" switch="2"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="coral2" votes="1">
			<fence>
				<method name="1">
					<device name="pdu1" option="off" port="20" switch="1"/>
					<device name="pdu2" option="off" port="20" switch="2"/>
					<device name="pdu1" option="on" port="20" switch="1"/>
					<device name="pdu2" option="on" port="20" switch="2"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_apc" ipaddr="10.10.50.100" login="apc" name="pdu1" passwd="apc"/>
		<fencedevice agent="fence_apc" ipaddr="10.10.50.101" login="apc" name="pdu2" passwd="apc"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="MF_Failover" ordered="0" restricted="1">
				<failoverdomainnode name="coral2" priority="1"/>
				<failoverdomainnode name="coral1" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<fs device="/dev/sda1" force_fsck="0" force_unmount="1" fstype="ext3" mountpoint="/MF/MF_v1.1/shared" name="testmount" options="" self_fence="0"/>
			<script file="/etc/rc.d/init.d/msgfwd" name="Message Forwarder"/>
			<ip address="10.10.50.22" monitor_link="1"/>
			<script file="/etc/rc.d/init.d/namesvc" name="Name Service"/>
		</resources>
		<service autostart="1" domain="MF_Failover" name="msgfwd" recovery="relocate">
			<fs ref="testmount"/>
			<script ref="Message Forwarder"/>
			<ip ref="10.10.50.22"/>
		</service>
	</rm>
	<cman expected_votes="1" two_node="1"/>
</cluster>




--- Lon Hohberger <lhh at redhat.com> wrote:

> On Wed, 2006-08-16 at 22:37 +0800, Gary Lua wrote:
> > Hi,
> > 
> > I'm currently configuring fencing devices for my 2 nodes on a
> RHEL4 
> > cluster. The problem is quite long, so please bear with me.
> > 
> > I have 2 nodes (let's call them stone1 and stone2) and 2 APC
> fencing 
> > devices (pdu1 and pdu2, both apc 7952 devices). Both stone1 and
> stone2 
> > has dual power supplies. Stone1's power supplies are connected to
> outlet 
> > 13 of pdu1 and pdu2. Stone2's power supplies are connected to
> outlet 20 
> > of both the pdus. My question is: during the fencing configuration
> for 
> > each node, i need to specify which fence device to add to the
> fence 
> > level of each node. Is it correct to specify for stone1 as follows
> : 
> > pdu1 -> port=13, switch=1, pdu2-> port=13, switch=2? The same
> applies to 
> > stone 2 : pdu1-> port=20, switch=1, pdu2-> port=20, switch=2?
> > 
> > After configuring as mentioned above, with both nodes on the
> cluster 
> > running and my application running on stone1, i pull out the
> ethernet 
> > cables for stone1 to simulate that the server is down. By right,
> my 
> > application should fail over to stone2 and fencing should occur to
> 
> > stone1 (ie, stone1 should be rebooted/shutdown). However, what
> happened 
> > is that my application is started on stone2, and stone1 is not
> fenced. 
> > In fact, when i reconnect by cables, my application is still
> running on 
> > stone1! Seems that there are 2 instances of my application
> running, each 
> > on stone1 and stone2.
> 
> Post the cluster configuration.
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From thomas at ideawise.de  Thu Aug 17 02:30:44 2006
From: thomas at ideawise.de (Thomas Karsten)
Date: Thu, 17 Aug 2006 10:30:44 +0800
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27C1@osdn06.osd.mil>
References: <D8063DF686D10247B0A49D0127128569119C27C1@osdn06.osd.mil>
Message-ID: <200608171030.44201.thomas@ideawise.de>

Hallo,

thanks for your suggestions.

On Wednesday 16 August 2006 22:24, Robert Peterson wrote:
> CVS keeps all back versions of the code, but it's not necessarily easy 
> to find a
> subset that works with a given kernel, unless it happens to be the one 
> we're developing on.

When do you decide to develop for a new kernel? Couldn't you store the code
in your releases directory and tag it before you start programming for a new
kernel version? In this case you will always have working releases for
various kernel versions.

> There's always the possibility of upgrading your kernel too.  :7)

I know and I agree :-) Unfortunately I am using a Xen kernel and standard
Xen builds only against version 2.6.16.

On Wednesday 16 August 2006 23:38, David Teigland wrote:
> Look here: ftp://sources.redhat.com/pub/cluster/releases/
> 
> The latest release: cluster-1.02.00.tar.gz
> builds against a standard 2.6.16 kernel.

I will try to use this release and to build it against the 2.6.16 kernel.

On Thursday 17 August 2006 01:17, Patton, Matthew F, CTR, OSD-PA&E wrote:
> would it be too hard to start tagging releases?
> 
> CVS Tag: 2.6.16-stable, or 2.6.16-dev, 2.6.18-dev
> for example

I agree. This would make it much easier to find the matching sources,
especially for people who will download GFS for the first time (like me).
Since a version of GFS is build against a special kernel version I was
surprised that there is no identification of the source code (like the
tagging that Matthew mentioned above).

Thomas



From garylua at singnet.com.sg  Thu Aug 17 05:28:46 2006
From: garylua at singnet.com.sg (garylua at singnet.com.sg)
Date: Thu, 17 Aug 2006 13:28:46 +0800 (SGT)
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc fence
	device
In-Reply-To: <1155777690.44e3c49a7befb@discus.singnet.com.sg>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
	<1155777690.44e3c49a7befb@discus.singnet.com.sg>
Message-ID: <1155792526.44e3fe8e54cc6@flounder.singnet.com.sg>

Hi, another related has been puzzling me since I'm new to redhat clustering. I understand that when configuring apc fencing device for each node, I need to fill in the port and switch of the apc device. Port refers to the power outlet which my node is connected to, right? For my case, my node 1 is connected to Outlet 13 of the first pdu. Do I fill in Port=13 or Port=Outlet 13? As for switch, I have absolutely no idea what it refers to. Does it refer to the state I want the outlet to go to during fencing, like 1=Immediate ON, 2=Immediate OFF, 3=Immediate Reboot, etc?

Thanks.

Gary


--- garylua at singnet.com.sg wrote:

> Hi Lon, thanks for the reply. My cluster.conf is as follows.
> coral1 and coral2 are my 2 nodes.
> 
> 
> <?xml version="1.0"?>
> <cluster config_version="264" name="MF_Cluster">
> 	<fence_daemon clean_start="1" post_fail_delay="0"
> post_join_delay="3"/>
> 	<clusternodes>
> 		<clusternode name="coral1" votes="1">
> 			<fence>
> 				<method name="1">
> 					<device name="pdu1" option="off" port="13" switch="1"/>
> 					<device name="pdu2" option="off" port="13" switch="2"/>
> 					<device name="pdu1" option="on" port="13" switch="1"/>
> 					<device name="pdu2" option="on" port="13" switch="2"/>
> 				</method>
> 			</fence>
> 		</clusternode>
> 		<clusternode name="coral2" votes="1">
> 			<fence>
> 				<method name="1">
> 					<device name="pdu1" option="off" port="20" switch="1"/>
> 					<device name="pdu2" option="off" port="20" switch="2"/>
> 					<device name="pdu1" option="on" port="20" switch="1"/>
> 					<device name="pdu2" option="on" port="20" switch="2"/>
> 				</method>
> 			</fence>
> 		</clusternode>
> 	</clusternodes>
> 	<fencedevices>
> 		<fencedevice agent="fence_apc" ipaddr="10.10.50.100" login="apc"
> name="pdu1" passwd="apc"/>
> 		<fencedevice agent="fence_apc" ipaddr="10.10.50.101" login="apc"
> name="pdu2" passwd="apc"/>
> 	</fencedevices>
> 	<rm>
> 		<failoverdomains>
> 			<failoverdomain name="MF_Failover" ordered="0" restricted="1">
> 				<failoverdomainnode name="coral2" priority="1"/>
> 				<failoverdomainnode name="coral1" priority="1"/>
> 			</failoverdomain>
> 		</failoverdomains>
> 		<resources>
> 			<fs device="/dev/sda1" force_fsck="0" force_unmount="1"
> fstype="ext3" mountpoint="/MF/MF_v1.1/shared" name="testmount"
> options="" self_fence="0"/>
> 			<script file="/etc/rc.d/init.d/msgfwd" name="Message
> Forwarder"/>
> 			<ip address="10.10.50.22" monitor_link="1"/>
> 			<script file="/etc/rc.d/init.d/namesvc" name="Name Service"/>
> 		</resources>
> 		<service autostart="1" domain="MF_Failover" name="msgfwd"
> recovery="relocate">
> 			<fs ref="testmount"/>
> 			<script ref="Message Forwarder"/>
> 			<ip ref="10.10.50.22"/>
> 		</service>
> 	</rm>
> 	<cman expected_votes="1" two_node="1"/>
> </cluster>
> 
> 
> 
> 
> --- Lon Hohberger <lhh at redhat.com> wrote:
> 
> > On Wed, 2006-08-16 at 22:37 +0800, Gary Lua wrote:
> > > Hi,
> > > 
> > > I'm currently configuring fencing devices for my 2 nodes on a
> > RHEL4 
> > > cluster. The problem is quite long, so please bear with me.
> > > 
> > > I have 2 nodes (let's call them stone1 and stone2) and 2 APC
> > fencing 
> > > devices (pdu1 and pdu2, both apc 7952 devices). Both stone1 and
> > stone2 
> > > has dual power supplies. Stone1's power supplies are connected
> to
> > outlet 
> > > 13 of pdu1 and pdu2. Stone2's power supplies are connected to
> > outlet 20 
> > > of both the pdus. My question is: during the fencing
> configuration
> > for 
> > > each node, i need to specify which fence device to add to the
> > fence 
> > > level of each node. Is it correct to specify for stone1 as
> follows
> > : 
> > > pdu1 -> port=13, switch=1, pdu2-> port=13, switch=2? The same
> > applies to 
> > > stone 2 : pdu1-> port=20, switch=1, pdu2-> port=20, switch=2?
> > > 
> > > After configuring as mentioned above, with both nodes on the
> > cluster 
> > > running and my application running on stone1, i pull out the
> > ethernet 
> > > cables for stone1 to simulate that the server is down. By
> right,
> > my 
> > > application should fail over to stone2 and fencing should occur
> to
> > 
> > > stone1 (ie, stone1 should be rebooted/shutdown). However, what
> > happened 
> > > is that my application is started on stone2, and stone1 is not
> > fenced. 
> > > In fact, when i reconnect by cables, my application is still
> > running on 
> > > stone1! Seems that there are 2 instances of my application
> > running, each 
> > > on stone1 and stone2.
> > 
> > Post the cluster configuration.
> > 
> > -- Lon
> > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From hirantha at vcs.informatics.lk  Thu Aug 17 08:00:43 2006
From: hirantha at vcs.informatics.lk (Hirantha Wijayawardena)
Date: Thu, 17 Aug 2006 13:30:43 +0530
Subject: [Linux-cluster] Which lock manager to use on a 2-node GFS Cluster?
In-Reply-To: <44E36F53.1000708@webbertek.com.br>
Message-ID: <20060817072844.DD95727B4E@ux-mail.informatics.lk>

Hi,

Thanks Riaan/Bob for the explanations. It helps me a lot and thanks to
Ceslo, you asked what I finally wanted.

Thanks and regards

- Hirantha

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Celso K. Webber
> Sent: Thursday, August 17, 2006 12:48 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Which lock manager to use on a 2-node GFS
> Cluster?
> 
> Hi Riaan,
> 
> Thank for your excelent explanation!
> 
> Now I know I'll have to run an Oracle unsupported environment ... :)
> 
> Would someone please confirm if the environment suggested by Hirantha is
> supported by Red Hat?
> 
> * 2 GFS/RAC nodes;
> * 2 DLM on RAC/GFS nodes;
> * fibre channel external storage.
> 
> The above environment is what we usually encounter on our cluster
> projects. Sometimes Oracle Database is installed as active-passive (not
> a RAC Cluster). It would be useful to know if Red Hat supports it, even
> if Oracle does not.
> 
> Last question: does Oracle supports a 2-node cluster using DLM for
> Oracle Database (no RAC, no GFS) when used with RHCS?
> 
> Thanks again!
> 
> Celso.
> 
> 
> Riaan van Niekerk escreveu:
> > hi Hirantha
> >
> > short answer: "No"
> >
> > If you are willing to settle for "should work just fine" instead of
> > "certified by Oracle" you can try running DLM on 2 RAC/GFS nodes, then
> > "yes"
> >
> > To have your solution certified/supported by Oracle, for a minimal
> > configuration you need 5 machines:
> > a) 2 GFS/RAC nodes
> > b) 3 external GULM lock servers
> >
> > We also found these system requirements somewhat excessive (having more
> > lock servers than actual GFS-accessing cluster nodes).
> >
> > more notes:
> > - supported GFS with RAC is GULM only - DLM is not supported by Oracle,
> > because it is embedded by default.
> >
> > - external lock servers - that means the the GULM service cannot run on
> > the nodes accessing the GFS. The 3 GULM nodes and the 2 GFS/RAC nodes
> > must be completely separate
> >
> > - the reason for the separation of lock servers from GFS/RAC nodes is
> > that Oracle RAC is designed to be able to with a single node up, whereas
> > DLM  needs a quorum of 50%+1 to operate. Also, external lock servers
> > allow you to reboot any RAC node without causing the lock server to
> switch.
> >
> > - specs for the locks servers are as entry-level as you can get for
> > modern server hardware, e.g. 1CPU, 1GB RAM.
> >
> > - Red Hat Knowledge base article for RAC on GFS:
> > http://kbase.redhat.com/faq/FAQ_78_3853.shtm
> > It mentions GFS 6.0, (slightly out of date) but is just as relevant for
> > GFS 6.1 / RHEL 4). According to this article, "Red Hat does not charge
> > for Red Hat GFS nodes that are used only for the external lock server
> > configuration"
> >
> > - the Cluster/GFS FAQ has some URLs for RAC on GFS:
> > http://sources.redhat.com/cluster/faq.html#gfs_oraclerac
> >
> > - MetaLink Note  329530.1 is the definitive source for what exactly is
> > supported by Oracle, w.r.t. RAC and GFS. Unfortunately it is only
> > available via Metalink.
> >
> > Celso - answers to your questions below
> ...
> --
> *Celso Kopp Webber*
> 
> celso at webbertek.com.br <mailto:celso at webbertek.com.br>
> 
> *Webbertek - Opensource Knowledge*
> (41) 8813-1919
> (41) 3284-3035
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From dan.hawker at astrium.eads.net  Thu Aug 17 08:49:04 2006
From: dan.hawker at astrium.eads.net (HAWKER, Dan)
Date: Thu, 17 Aug 2006 09:49:04 +0100
Subject: [Linux-cluster] Low cost storage for clusters
Message-ID: <7F6B06837A5DBD49AC6E1650EFF54906BFBC39@auk52177.ukr.astrium.corp>


> 
> On the subject of ethernet switches, are they all made equal 
> ? Obviously 
> I know that some are managed, but what are you getting when you pay 
> large amounts of money for fairly ordinary looking switches ?
> 

All switches are born equal, its just that some are more equal than
others...  :)

Have used a large variety of vendors, for a similarly large variety of
applications/environments and yes, in some areas, sure, you get what you pay
for, however in other areas, some are just taking the piss.

As usual it depends upon your application and how much you value your
data/uptime/non-worktime/sanity. The smaller, traditionally more consumer
orientated vendors, are starting to add features that not even a year or so
ago were strictly the preserve of high end Cisco/Extreme hardware. We have a
few Netgear gigabit switches with fibre that are managed, but have nothing
like the manageability of our HP and Cisco boxes. It's a basic web driven
interface that allows you to fiddle around with VLANs, turn on jumbo-frames,
turn on alerting and generally configure the switch sufficiently to help
smaller environments help with their network.

However it just doesn't have the absolute granularity of configuration
compared to the higher-end switches. Most of the managed features are global
settings, whereas often these could/should/would be nice to apply them at a
port level.

Other than the management, you are into the realms of pure performance and
the hardware quality. Of course, all vendors have lemons that just don't
live up to expectations, but extremely high MTBF and additional hardware
based failover/error correction does add cost. As does the backplane. A
fully populated 24port gig switch creates an awful amount of  data
(theoretically 48Gbit/sec) to shuffle around. Most high-end stuff (and some
low-end too) have non-blocking backplanes, (ie the backplane can handle the
theoretical maximum bandwidth) without having to drop packets. Some can't
claim that. In an environment like a storage fabric or mission critical
database access, dropped packets at the least mean poor performance, at the
worst corrupted data.

Rounding up, no they're not all created equal, however in *your* environment
a low-end switch *may* be appropriate, but equally it may not.

Personally, in a storage fabric (we have an iSCSI box here) I'd spend the
cash. Agree with not having to pay for the Cisco name unless you
particularly need a feature. Personally I really like HP ProCurve kit.
Similar/same/better feature-set but generally cheaper for a similarly
specced Cisco.

HTH

Dan

--

Dan Hawker
Linux Systems Administrator
EADS Astrium

-- 

This email is for the intended addressee only.
If you have received it in error then you must not use, retain, disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of Astrium Limited.
Nothing in this email shall bind Astrium Limited in any contract or obligation.

Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England



From marek.dabrowski at infor.pl  Thu Aug 17 11:35:17 2006
From: marek.dabrowski at infor.pl (Marek Dabrowski)
Date: Thu, 17 Aug 2006 13:35:17 +0200
Subject: [Linux-cluster] Big load
Message-ID: <44E45475.5020902@infor.pl>

Hello friends

I've problem with my Centos 4.3.
On start some information:
1. serwer HP DL 380 x Intel(R) Xeon(TM) CPU 3.40GHz
2. RAM 4GB
3. HDD storage FC
4. kernel 2.6.9-34.0.2.ELsmp #1 SMP

This serwer is working with another one (that some configuration) in 
RHEL cluster (locking by DLM), as a file serwer with samba-3.0.22. Samba 
have about 100 simultaneus sessions. Problem is with load of my system. 
It's growing up. I disabled/enabled samba processes, disabled backup 
system, disabled antyvirus for samba, but this didn't help. CPU 
utilization is 5-10%, ram utilization ~2GB. I don't have any idea what's 
wrong. Any suggestion?

Regards
Marek

Sorry for my english
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graph_image.png
Type: image/png
Size: 6324 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060817/318baba6/attachment.png>

From marek.dabrowski at infor.pl  Thu Aug 17 12:05:33 2006
From: marek.dabrowski at infor.pl (Marek Dabrowski)
Date: Thu, 17 Aug 2006 14:05:33 +0200
Subject: [Linux-cluster] Problem with load
Message-ID: <44E45B8D.2060606@infor.pl>

Hello friends

I've problem with my Centos 4.3.
On start some information:
1. serwer HP DL 380 x Intel(R) Xeon(TM) CPU 3.40GHz
2. RAM 4GB
3. HDD storage FC
4. kernel 2.6.9-34.0.2.ELsmp #1 SMP

This serwer is working with another one (that some configuration) in 
RHEL cluster (locking by DLM), as a file serwer with samba-3.0.22. Samba 
have about 100 simultaneus sessions. Problem is with load of my system. 
It's growing up. I disabled/enabled samba processes, disabled backup 
system, disabled antyvirus for samba, but this didn't help. CPU 
utilization is 5-10%, ram utilization ~2GB. I checked cron. There isn't 
any jobs. I don't have any idea what's wrong. Any suggestion?

Regards
Marek

Sorry for my english
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graph_image.png
Type: image/png
Size: 6324 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060817/0086700f/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graph_image2.png
Type: image/png
Size: 6449 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060817/0086700f/attachment-0001.png>

From lhh at redhat.com  Thu Aug 17 19:02:52 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 17 Aug 2006 15:02:52 -0400
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc
	fence device
In-Reply-To: <1155792526.44e3fe8e54cc6@flounder.singnet.com.sg>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
	<1155777690.44e3c49a7befb@discus.singnet.com.sg>
	<1155792526.44e3fe8e54cc6@flounder.singnet.com.sg>
Message-ID: <1155841372.6909.13.camel@rei.boston.devel.redhat.com>

On Thu, 2006-08-17 at 13:28 +0800, garylua at singnet.com.sg wrote:
> Hi, another related has been puzzling me since I'm new to redhat clustering. I understand that when configuring apc fencing device for each node, I need to fill in the port and switch of the apc device. Port refers to the power outlet which my node is connected to, right? For my case, my node 1 is connected to Outlet 13 of the first pdu. Do I fill in Port=13 or Port=Outlet 13? As for switch, I have absolutely no idea what it refers to. Does it refer to the state I want the outlet to go to during fencing, like 1=Immediate ON, 2=Immediate OFF, 3=Immediate Reboot, etc?

Should just be the outlet number.  Named and grouped outlets currently
do not work.

-- Lon




From lhh at redhat.com  Thu Aug 17 19:05:53 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 17 Aug 2006 15:05:53 -0400
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc
	fence device
In-Reply-To: <1155777690.44e3c49a7befb@discus.singnet.com.sg>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
	<1155777690.44e3c49a7befb@discus.singnet.com.sg>
Message-ID: <1155841553.6909.16.camel@rei.boston.devel.redhat.com>

On Thu, 2006-08-17 at 09:21 +0800, garylua at singnet.com.sg wrote:
> Hi Lon, thanks for the reply. My cluster.conf is as follows.
> coral1 and coral2 are my 2 nodes.

Thanks!

I have good news and bad news.  The good news is that your configuration
looks just fine.  The bad news is I'm not immediately certain how this
could happen, unless perhaps 'fenced' was not started on both nodes (or
it crashed for some reason).

-- Lon



From lhh at redhat.com  Thu Aug 17 19:11:47 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 17 Aug 2006 15:11:47 -0400
Subject: [Linux-cluster] No storage cluster configuration
In-Reply-To: <1155757027.31126.8.camel@ayanami.boston.redhat.com>
References: <44E2F6A0.9010408@webbertek.com.br>
	<1155757027.31126.8.camel@ayanami.boston.redhat.com>
Message-ID: <1155841907.6909.22.camel@rei.boston.devel.redhat.com>

On Wed, 2006-08-16 at 15:37 -0400, Lon Hohberger wrote:
> On Wed, 2006-08-16 at 07:42 -0300, Celso K. Webber wrote:
> > Hello all,
> > 
> > Does anyone from Red Hat would tell me if Red Hat supports a no external 
> > storage environment when using RH Cluster Suite 
> 
> Yes...
> 
> > and DRDB for replicating 
> > data beetween cluster nodes?
> 
> No.
> 
> > When customers come to us arguing that IBM is offering SuSE Cluster 
> > solutions without the need of an external storage, we usually tell them 
> > to go on if they don't value their data. IBM on its turn tells that DRDB 
> > is stable enough for production, even when using Oracle database on top 
> > of it.
> 
> There is, or was, some deep-seated hatred between some of our elite
> engineers and DRBD, but I don't remember what it was.  It's not part of
> the kernel.
> 
> > We have lost some projects because of this "magical solution" of using 
> > DRDB instead of an external storage, but we are still against deploying 
> > production cluster solutions using DRDB or similar solutions.
> 
> We don't ship DRBD.  If that changes, we can certainly support it.  In
> fact, people from the community have asked about it, too.  Eventually,
> we may have to add support for it to our upstream linux-cluster project,
> even if Red Hat never ships it - mostly due to its simplicity to
> configure, and ease of use in failover environments.

Sidenote - if anyone wants to try the DRBD device script (resource
agent) from the Linux-HA pool on linux-cluster, it "should" work with
little or no modifications.  Linux-HA and linux-cluster use a similar
script for resources.

Note, however, that the stable version of DRBD is not suitable for use
with GFS (GFS requires multiple concurrent writers).  Recently, someone
pointed out that the development version of DRBD allows concurrent
writes from both nodes, which would then make it suitable for use with
GFS.

Whether or not Red Hat picks it up is another question, which I can't
really answer :)

-- Lon



From mwill at penguincomputing.com  Thu Aug 17 19:29:53 2006
From: mwill at penguincomputing.com (Michael Will)
Date: Thu, 17 Aug 2006 12:29:53 -0700
Subject: [Linux-cluster] No storage cluster configuration
Message-ID: <433093DF7AD7444DA65EFAFE3987879C23EA4E@jellyfish.highlyscyld.com>

How many other technologies are being kept out of redhat
because of 'deep seated hatred of elite engineers'?

I am forced to sell (non-gfs based) fileservers with SLES9 simply
because
there is no XFS support in RHEL4. Or I have to go the centos route which
has the centos-plus packages that reintroduce XFS, but then I can't sell
redhat's support contracts.

Same for desktops with proper KDE integration.

Now DRDB I did not know about but am certainly curious about - the last
time I
investigated that approach was with nbd / dnbd and did not result in a
lot 
of confidence or the feeling that it was actively maintained.

Michael Will
PS: Love RHEL4 because of CS and GFS

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Thursday, August 17, 2006 12:12 PM
To: linux clustering
Subject: Re: [Linux-cluster] No storage cluster configuration

On Wed, 2006-08-16 at 15:37 -0400, Lon Hohberger wrote:
> On Wed, 2006-08-16 at 07:42 -0300, Celso K. Webber wrote:
> > Hello all,
> > 
> > Does anyone from Red Hat would tell me if Red Hat supports a no 
> > external storage environment when using RH Cluster Suite
> 
> Yes...
> 
> > and DRDB for replicating
> > data beetween cluster nodes?
> 
> No.
> 
> > When customers come to us arguing that IBM is offering SuSE Cluster 
> > solutions without the need of an external storage, we usually tell 
> > them to go on if they don't value their data. IBM on its turn tells 
> > that DRDB is stable enough for production, even when using Oracle 
> > database on top of it.
> 
> There is, or was, some deep-seated hatred between some of our elite 
> engineers and DRBD, but I don't remember what it was.  It's not part 
> of the kernel.
> 
> > We have lost some projects because of this "magical solution" of 
> > using DRDB instead of an external storage, but we are still against 
> > deploying production cluster solutions using DRDB or similar
solutions.
> 
> We don't ship DRBD.  If that changes, we can certainly support it.  In

> fact, people from the community have asked about it, too.  Eventually,

> we may have to add support for it to our upstream linux-cluster 
> project, even if Red Hat never ships it - mostly due to its simplicity

> to configure, and ease of use in failover environments.

Sidenote - if anyone wants to try the DRBD device script (resource
agent) from the Linux-HA pool on linux-cluster, it "should" work with
little or no modifications.  Linux-HA and linux-cluster use a similar
script for resources.

Note, however, that the stable version of DRBD is not suitable for use
with GFS (GFS requires multiple concurrent writers).  Recently, someone
pointed out that the development version of DRBD allows concurrent
writes from both nodes, which would then make it suitable for use with
GFS.

Whether or not Red Hat picks it up is another question, which I can't
really answer :)

-- Lon

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Thu Aug 17 20:19:49 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 17 Aug 2006 16:19:49 -0400
Subject: [Linux-cluster] No storage cluster configuration
In-Reply-To: <433093DF7AD7444DA65EFAFE3987879C23EA4E@jellyfish.highlyscyld.com>
References: <433093DF7AD7444DA65EFAFE3987879C23EA4E@jellyfish.highlyscyld.com>
Message-ID: <1155845989.6909.69.camel@rei.boston.devel.redhat.com>

On Thu, 2006-08-17 at 12:29 -0700, Michael Will wrote:
> How many other technologies are being kept out of redhat
> because of 'deep seated hatred of elite engineers'?

I should have used different phrasing.  I would like to apologize for my
previous email, because it came out the wrong way.

I don't remember what the original reasons were when I asked a few years
ago about it, but I remember the response wasn't particularly - shall we
say - "positive" towards DRBD.  This, however, was a long time ago.

Fast forward to now.  At this point, with cluster mirroring getting
closer to completion, it probably makes a lot of sense to just put
efforts in to that.  Combined with GNBD, it solves the same problem that
DRBD 0.8.x (unstable) solves - and has the benefit of LVM device-mapper
integration.

A couple of people have asked for a 'howto' for GNBD + cmirror, and I
think that would help a lot in understanding how it works, so I will try
to look into that in the next week or two.  Remember that cmirror isn't
quite 100% yet, but any help in testing will be, er, helpful.


> I am forced to sell (non-gfs based) fileservers with SLES9 simply
> because
> there is no XFS support in RHEL4. Or I have to go the centos route which
> has the centos-plus packages that reintroduce XFS, but then I can't sell
> redhat's support contracts.

> Same for desktops with proper KDE integration.

I'm going to dodge these; I'm pretty sure they've been beaten to death
plenty of times on other mailing lists, slashdot, and other places. ;)


> Now DRDB I did not know about but am certainly curious about - the last
> time I
> investigated that approach was with nbd / dnbd and did not result in a
> lot 
> of confidence or the feeling that it was actively maintained.

DRBD is actively maintained and developed.  You can even purchase
support for it, just not from Red Hat.

http://www.linbit.com/linhac_drbd.html

As far as making it work... linux-cluster simply lacks a resource-agent
to control DRBD in failover situations.  It's probably not hard to craft
one.

-- Lon



From teigland at redhat.com  Thu Aug 17 20:46:25 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 17 Aug 2006 15:46:25 -0500
Subject: [Linux-cluster] cluster-1.03.00
Message-ID: <20060817204625.GF13097@redhat.com>

A new source tarball from the STABLE branch has been released; it builds
and runs on 2.6.17:

  ftp://sources.redhat.com/pub/cluster/releases/cluster-1.03.00.tar.gz

Version 1.03.00 - 16 August 2006
================================
  ccsd
   * don't grab random config from network, require initial cluster.conf
   * Fix inifinite loop causing hangs in other daemons. bz#194361
  cman
   * Allow zero votes.
  cman-kernel
   * Don't try to delete AUTODELETE barriers in timer context as we can't
     get the semaphore that protects the structures. bz#177577
   * If quit_threads gets set via an incoming message, don't try to carry on.
     bz#164535
   * Don't try to print local node stuff is 'us' is NULL. bz#189605
   * Clear comms sequence number for a node when it leaves the cluster. 
     Otherwise we ignore messages when it tries to join again and causes
     cluster mayhem. bz#187777
   * If we get a Master-HELLO and we are not the master for this transition
     then kick off a new one to resolve the ambiguity. bz#194491
  dlm-kernel
   * Don't try to unlock a lock if there's no LKB. bz#188525
   * We need to allocated space for 5 ints, rather than 4 when sending a
     query reply. bz#173811
   * add printk's for the error conditions so we have some idea what
     happened before a gfs panic.
   * Kernel Oops when passing LKF_CANCEL to dlm_ls_unlock_wait.  bz#201325
  fenced
   * If there are no devices defined within a node's method, that method
     should be considered failed. bz#190661
  gfs_fsck
   * improve logging. bz#156009
   * fix to repair damaged and corrupt resource groups and resource group
     index entries that previously caused gfs_fsck to abort. bz#179069
  gfs-kernel
   * allow cman nodeid to be used with CDPN. bz#198381
   * When releasing a glock with GL_NOCACHE flag set, care was not taken to
     ensure that only one holder for the glock remained. This was corrupting
     the glock and preventing further access to the glock. FLOCKS use this
     GL_NOCACHE flag. bz#191222
   * F_GETLK was broken, always used to return zero conflicts for local
     plocks. Also bogus pid was being returned for local locks.  Added a new
     pid field to gfs posix lock to store and return actual pid.  bz#198303
  gnbd
   * Make gnbd work with device mapper multipath
   * Fix gnbd_monitor so that it will correctly restart multiple devices per
     server.
  gulm
   * Retry if initial connections attempts fail.  bz#183507
  rgmanager
   * Work around for bz#193128
   * Enable self-watchdog support (adds a second clurgmgrd process),
     bz#193247
   * Apply patch to fix bz#193128 from Navid Sheikhol-Eslami
   * Allow failover if owner dies while stopping a service.  bz#193255
   * Fix various clustat related usability & performance problems.
     bz#185952, bz#175010, bz#182454, bz#190234, bz#190408, bz#192999
   * Port clumanager's 'lock' operation to rgmanager bz#175010
   * Various internal performance improvements for large numbers of
     services.
   * Implement crude NFS lock reclaim broadcast / reclaim notifications.
   * Mark services with autostart=0 as 'disabled' instead of 'stopped'.
   * Add patch from Josef Whiter to implement no-failback option for a given
     FO domain. bz#189841



From riaan at obsidian.co.za  Thu Aug 17 21:39:50 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Thu, 17 Aug 2006 23:39:50 +0200
Subject: [Linux-cluster] cluster-1.03.00
In-Reply-To: <20060817204625.GF13097@redhat.com>
References: <20060817204625.GF13097@redhat.com>
Message-ID: <44E4E226.2000607@obsidian.co.za>

hi David

Is it safe to assume that all these bugs are also fixed in the 
current/latest CS and GFS binary updates from Red Hat?

David Teigland wrote:
> A new source tarball from the STABLE branch has been released; it builds
> and runs on 2.6.17:
> 
>   ftp://sources.redhat.com/pub/cluster/releases/cluster-1.03.00.tar.gz
> 
> Version 1.03.00 - 16 August 2006
> ================================
>   ccsd
>    * don't grab random config from network, require initial cluster.conf
>    * Fix inifinite loop causing hangs in other daemons. bz#194361
>   cman
>    * Allow zero votes.
>   cman-kernel
>    * Don't try to delete AUTODELETE barriers in timer context as we can't
>      get the semaphore that protects the structures. bz#177577
>    * If quit_threads gets set via an incoming message, don't try to carry on.
>      bz#164535
>    * Don't try to print local node stuff is 'us' is NULL. bz#189605
>    * Clear comms sequence number for a node when it leaves the cluster. 
>      Otherwise we ignore messages when it tries to join again and causes
>      cluster mayhem. bz#187777
>    * If we get a Master-HELLO and we are not the master for this transition
>      then kick off a new one to resolve the ambiguity. bz#194491
>   dlm-kernel
>    * Don't try to unlock a lock if there's no LKB. bz#188525
>    * We need to allocated space for 5 ints, rather than 4 when sending a
>      query reply. bz#173811
>    * add printk's for the error conditions so we have some idea what
>      happened before a gfs panic.
>    * Kernel Oops when passing LKF_CANCEL to dlm_ls_unlock_wait.  bz#201325
>   fenced
>    * If there are no devices defined within a node's method, that method
>      should be considered failed. bz#190661
>   gfs_fsck
>    * improve logging. bz#156009
>    * fix to repair damaged and corrupt resource groups and resource group
>      index entries that previously caused gfs_fsck to abort. bz#179069
>   gfs-kernel
>    * allow cman nodeid to be used with CDPN. bz#198381
>    * When releasing a glock with GL_NOCACHE flag set, care was not taken to
>      ensure that only one holder for the glock remained. This was corrupting
>      the glock and preventing further access to the glock. FLOCKS use this
>      GL_NOCACHE flag. bz#191222
>    * F_GETLK was broken, always used to return zero conflicts for local
>      plocks. Also bogus pid was being returned for local locks.  Added a new
>      pid field to gfs posix lock to store and return actual pid.  bz#198303
>   gnbd
>    * Make gnbd work with device mapper multipath
>    * Fix gnbd_monitor so that it will correctly restart multiple devices per
>      server.
>   gulm
>    * Retry if initial connections attempts fail.  bz#183507
>   rgmanager
>    * Work around for bz#193128
>    * Enable self-watchdog support (adds a second clurgmgrd process),
>      bz#193247
>    * Apply patch to fix bz#193128 from Navid Sheikhol-Eslami
>    * Allow failover if owner dies while stopping a service.  bz#193255
>    * Fix various clustat related usability & performance problems.
>      bz#185952, bz#175010, bz#182454, bz#190234, bz#190408, bz#192999
>    * Port clumanager's 'lock' operation to rgmanager bz#175010
>    * Various internal performance improvements for large numbers of
>      services.
>    * Implement crude NFS lock reclaim broadcast / reclaim notifications.
>    * Mark services with autostart=0 as 'disabled' instead of 'stopped'.
>    * Add patch from Josef Whiter to implement no-failback option for a given
>      FO domain. bz#189841
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060817/33871293/attachment.vcf>

From teigland at redhat.com  Thu Aug 17 21:49:12 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 17 Aug 2006 16:49:12 -0500
Subject: [Linux-cluster] cluster-1.03.00
In-Reply-To: <44E4E226.2000607@obsidian.co.za>
References: <20060817204625.GF13097@redhat.com>
	<44E4E226.2000607@obsidian.co.za>
Message-ID: <20060817214912.GG13097@redhat.com>

On Thu, Aug 17, 2006 at 11:39:50PM +0200, Riaan van Niekerk wrote:
> Is it safe to assume that all these bugs are also fixed in the 
> current/latest CS and GFS binary updates from Red Hat?

Yes, at least the major ones have been.  There may be a few new ones here
that came after the cutoff for RHEL update 4.

Dave



From mwill at penguincomputing.com  Thu Aug 17 21:55:11 2006
From: mwill at penguincomputing.com (Michael Will)
Date: Thu, 17 Aug 2006 14:55:11 -0700
Subject: [Linux-cluster] Low cost storage for clusters
In-Reply-To: <7F6B06837A5DBD49AC6E1650EFF54906BFBC39@auk52177.ukr.astrium.corp>
References: <7F6B06837A5DBD49AC6E1650EFF54906BFBC39@auk52177.ukr.astrium.corp>
Message-ID: <44E4E5BF.9090505@penguincomputing.com>

I like hp procurve 2824/2848 as full bisectional bandwidth managed 
switches that have processors strong enough
to hold up even on jumbo frames.

I have not tested jumbo frames on the 48-port netgear switches, we 
typically use them when the customer uses
infiniband as their real interconnect (thats in HPC clusters). Generally 
the word is that a low-end switch that
might be able to push full linerate at normal package sizes does drop in 
throughput when using jumbo frames.

Does GFS recommend using jumbo frames? I know the NFS people would.

Michael.

HAWKER, Dan wrote:
>> On the subject of ethernet switches, are they all made equal 
>> ? Obviously 
>> I know that some are managed, but what are you getting when you pay 
>> large amounts of money for fairly ordinary looking switches ?
>>
>>     
>
> All switches are born equal, its just that some are more equal than
> others...  :)
>
> Have used a large variety of vendors, for a similarly large variety of
> applications/environments and yes, in some areas, sure, you get what you pay
> for, however in other areas, some are just taking the piss.
>
> As usual it depends upon your application and how much you value your
> data/uptime/non-worktime/sanity. The smaller, traditionally more consumer
> orientated vendors, are starting to add features that not even a year or so
> ago were strictly the preserve of high end Cisco/Extreme hardware. We have a
> few Netgear gigabit switches with fibre that are managed, but have nothing
> like the manageability of our HP and Cisco boxes. It's a basic web driven
> interface that allows you to fiddle around with VLANs, turn on jumbo-frames,
> turn on alerting and generally configure the switch sufficiently to help
> smaller environments help with their network.
>
> However it just doesn't have the absolute granularity of configuration
> compared to the higher-end switches. Most of the managed features are global
> settings, whereas often these could/should/would be nice to apply them at a
> port level.
>
> Other than the management, you are into the realms of pure performance and
> the hardware quality. Of course, all vendors have lemons that just don't
> live up to expectations, but extremely high MTBF and additional hardware
> based failover/error correction does add cost. As does the backplane. A
> fully populated 24port gig switch creates an awful amount of  data
> (theoretically 48Gbit/sec) to shuffle around. Most high-end stuff (and some
> low-end too) have non-blocking backplanes, (ie the backplane can handle the
> theoretical maximum bandwidth) without having to drop packets. Some can't
> claim that. In an environment like a storage fabric or mission critical
> database access, dropped packets at the least mean poor performance, at the
> worst corrupted data.
>
> Rounding up, no they're not all created equal, however in *your* environment
> a low-end switch *may* be appropriate, but equally it may not.
>
> Personally, in a storage fabric (we have an iSCSI box here) I'd spend the
> cash. Agree with not having to pay for the Cisco name unless you
> particularly need a feature. Personally I really like HP ProCurve kit.
> Similar/same/better feature-set but generally cheaper for a similarly
> specced Cisco.
>
> HTH
>
> Dan
>
> --
>
> Dan Hawker
> Linux Systems Administrator
> EADS Astrium
>
>   


-- 
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill at penguincomputing.com 



From haiwu.us at gmail.com  Fri Aug 18 01:09:33 2006
From: haiwu.us at gmail.com (hai wu)
Date: Thu, 17 Aug 2006 20:09:33 -0500
Subject: [Linux-cluster] Setting up a GFS cluster
In-Reply-To: <44E281AC.4010608@webbertek.com.br>
References: <44E22EC7.8020506@clara.co.uk> <44E281AC.4010608@webbertek.com.br>
Message-ID: <eb7f3f8f0608171809j5e1eccadg8bef249371f4227f@mail.gmail.com>

Hi Celso,
I have one PV220S for test right now. Where can I set up the data transfer
rate to 160 MB/s instead of 320 MB/s?
Thanks,
Hai


On 8/15/06, Celso K. Webber <celso at webbertek.com.br> wrote:
>
> Hello Brendan,
>
> Although Dell hardware is an excellent choice for Linux, the PV220S
> solution is terrible at performance under a cluster environment.
>
> The reason is that the PV220S itself does not manage RAID devices, it is
> in fact a JBOD (Just a Bunch Of Disks). The RAID management is done by
> the SCSI controllers within the servers (PERC 3/DC or PERC 4/DC).
>
> Since there is a possibility of one of the machines going down, together
> with data in the controller's write cache, this solution automatically
> disable the write cache (write through mode) when you set the
> controllers in "cluster mode".
>
> The end result is very poor performance, specially on write operations.
> It's not uncommon that Dell provides the PV-220S with 15K RPM disks to
> compensate this performance penalty due to lack of write cache.
>
> As far as I can tell, Red Hat did support the PV220S solution in the
> past, during the RHEL 2.1 era, but it is not supported anymore as
> certified shared storage for cluster solutions (RHCS or RHGFS).
>
> If you still plan to go on, be warned that the PV220S performs better in
> Cluster Mode if you set up the data transfer rate to 160 MB/s instead of
> 320 MB/s (the PERC 3/DC supports transfer rates of up to 160 MB/s while
> the PERC 4/DC supports up to 320 MB/s). This is a known issue at Dell
> support queues.
>
> As an extra information, there were too many problems about reliability
> with the PV220S when used in Cluster Mode, this can be seen by the large
> amount of firmware updates for the PERC 3/DC and 4/DC (LSI Logic based
> chipset, megaraid driver on Linux). More recent firmware versions seem
> to have corrected most logical drive corruption problems I've
> experienced, so I believe the PV220S is still worth a try if you can
> live with the poor performance issue.
>
> Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a not so
> high price tag.
>
> Sorry for the long message, I believe this information can be useful to
> others.
>
> Best regards,
>
> Celso.
>
> Brendan Heading escreveu:
> > Hi all,
> >
> > I'm planning to build a cluster using a pair of PE1950s, using RHEL 3
> > (or 4) with RHCS. Plan at the moment is to use GFS. Most of our stuff is
> > Dell, therefore the obvious choice is to use a Dell PowerVault 220S as
> > the shared storage device.
> >
> > Before I kick off with this idea I'd be interested to hear if anyone had
> > any issues with this kind of setup, or if there were any general
> > performance problems. Are there other SCSI enclosures which might be
> > better or more appropriate for these purposes ?
> >
> > Regards
> >
> > Brendan
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
> --
> *Celso Kopp Webber*
>
> celso at webbertek.com.br <mailto:celso at webbertek.com.br>
>
> *Webbertek - Opensource Knowledge*
> (41) 8813-1919
> (41) 3284-3035
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060817/4046995b/attachment.htm>

From garylua at singnet.com.sg  Fri Aug 18 01:13:25 2006
From: garylua at singnet.com.sg (garylua at singnet.com.sg)
Date: Fri, 18 Aug 2006 09:13:25 +0800 (SGT)
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc fence
	device
In-Reply-To: <1155752644.31126.0.camel@ayanami.boston.redhat.com>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
Message-ID: <1155863605.44e51435c18c2@arrowana.singnet.com.sg>

Hi, thanks for the reply. But I would like to know what does the "switch" parameter refer to during the configuration of the node. Somehow, I have a feeling it might be this parameter causing the fencing problem.

As for my configuration, when my application fails over to node2 (by "downing the network on node1), would node1 reboot or just shutdown?

Finally, I've checked the /var/log/messages. No fencing is done at all, as indicated from the log messages.

Any help on this would really be appreciated. Thanks



From garylua at singnet.com.sg  Fri Aug 18 04:29:53 2006
From: garylua at singnet.com.sg (garylua at singnet.com.sg)
Date: Fri, 18 Aug 2006 12:29:53 +0800 (SGT)
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc fence
	device
In-Reply-To: <1155863605.44e51435c18c2@arrowana.singnet.com.sg>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
	<1155863605.44e51435c18c2@arrowana.singnet.com.sg>
Message-ID: <1155875393.44e54241721d8@arrowana.singnet.com.sg>

Hi,

Aother update on the situation: Apparently, i got the fencing working already. When I down the ethernet connection of node1, node2 will fence node1 and the power outlet connected to node1 power cycle (switch off then on). However, is this the proper behaviour of fencing? I wish to reboot node1, not shut it down completely. Seems that my machines are configured in such a way that when the power supply is switched on and then off, my machine stays off. Is there any way which i can alter this behaviour such that my node1 will reboot instead of shutting down? Thanks!



--- garylua at singnet.com.sg wrote:

> Hi, thanks for the reply. But I would like to know what does the
> "switch" parameter refer to during the configuration of the node.
> Somehow, I have a feeling it might be this parameter causing the
> fencing problem.
> 
> As for my configuration, when my application fails over to node2 (by
> "downing the network on node1), would node1 reboot or just
> shutdown?
> 
> Finally, I've checked the /var/log/messages. No fencing is done at
> all, as indicated from the log messages.
> 
> Any help on this would really be appreciated. Thanks
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From gforte at leopard.us.udel.edu  Fri Aug 18 09:26:24 2006
From: gforte at leopard.us.udel.edu (Greg Forte)
Date: Fri, 18 Aug 2006 05:26:24 -0400
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc fence
	device
In-Reply-To: <1155875393.44e54241721d8@arrowana.singnet.com.sg>
References: <44E32DBD.2040105@singnet.com.sg>	<1155752644.31126.0.camel@ayanami.boston.redhat.com>	<1155863605.44e51435c18c2@arrowana.singnet.com.sg>
	<1155875393.44e54241721d8@arrowana.singnet.com.sg>
Message-ID: <44E587C0.7020706@leopard.us.udel.edu>

That's a function of your hardware - I don't think you've mentioned what 
type of nodes you have, but most machines' BIOS settings include an 
option for whether to power back on or not after a power loss.  Typical 
options are "Stay off", "Turn on", or "Last State", and many systems 
ship with "Stay off" as the factory default setting.  You'll obviously 
want one of the other two in this case.

-g

garylua at singnet.com.sg wrote:
> Hi,
> 
> Aother update on the situation: Apparently, i got the fencing working already. When I down the ethernet connection of node1, node2 will fence node1 and the power outlet connected to node1 power cycle (switch off then on). However, is this the proper behaviour of fencing? I wish to reboot node1, not shut it down completely. Seems that my machines are configured in such a way that when the power supply is switched on and then off, my machine stays off. Is there any way which i can alter this behaviour such that my node1 will reboot instead of shutting down? Thanks!
> 
> 
> 
> --- garylua at singnet.com.sg wrote:
> 
>> Hi, thanks for the reply. But I would like to know what does the
>> "switch" parameter refer to during the configuration of the node.
>> Somehow, I have a feeling it might be this parameter causing the
>> fencing problem.
>>
>> As for my configuration, when my application fails over to node2 (by
>> "downing the network on node1), would node1 reboot or just
>> shutdown?
>>
>> Finally, I've checked the /var/log/messages. No fencing is done at
>> all, as indicated from the log messages.
>>
>> Any help on this would really be appreciated. Thanks
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From robert.hatch at terebellum.co.uk  Fri Aug 18 10:20:04 2006
From: robert.hatch at terebellum.co.uk (Robert Hatch)
Date: Fri, 18 Aug 2006 11:20:04 +0100
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16 and Xen 3.0.2_2
Message-ID: <001d01c6c2af$def0dbb0$1400010a@stingray>

Hi, I am running the current testing version of Debian and I have been
trying to compile the cluster source into my custom kernel.  I have
successfully built and booted from a 2.6.16.27 kernel with latest Xen
3.0.2-2 stable support but I have failed to successfully compile the cluster
suite.  I have been using ./configure --kernel_src=/PATH and then make
install as advised but the build process always fails and I have tried this
using both the current stable CVS and 1.02.00 tarball (which progresses
further).  I have read that others have had very similar problems
http://lists.xensource.com/archives/html/xen-users/2006-07/msg00620.html but
later report they solve the problems by installing certain packages
http://lists.xensource.com/archives/html/xen-users/2006-07/msg00730.html.
This has not worked for me although the errors they report are identical. I
was wondering if anyone knew a solution to this problem or possibly a list
of dependent packages and versions that are required to compile Redhat
cluster suite.  The only thing I can currently think of is that my
ncurses5-dev package is somehow incompatible with the build.

Any help would be very much appreciated 

Regards

Rob



From troels at arvin.dk  Fri Aug 18 11:29:42 2006
From: troels at arvin.dk (Troels Arvin)
Date: Fri, 18 Aug 2006 13:29:42 +0200
Subject: [Linux-cluster] Re: Why Redhat replace quorum partition/lock
	lun	with new fencing mechanisms?
References: <a0e72880606141149g258b7bd6j905f7b0ea73392f5@mail.gmail.com>
	<1150339626.2982.51.camel@localhost.localdomain>
Message-ID: <pan.2006.08.18.11.29.40.953000@arvin.dk>

Hello,

On Wed, 14 Jun 2006 21:47:06 -0500, Kevin Anderson wrote:
> Good news is we realized the deficiency and have added quorum disk support
> and it will be part of the RHCS4.4 update

Has RHCS 4.4 been released yet?

I'm asking because I'm unsure if Update 4 of RHEL 4 means that RHCS 4 is
now also at Update 4.

-- 
Greetings from Troels Arvin




From thomas at ideawise.de  Fri Aug 18 12:16:16 2006
From: thomas at ideawise.de (Thomas Karsten)
Date: Fri, 18 Aug 2006 20:16:16 +0800
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16 and Xen 3.0.2_2
In-Reply-To: <001d01c6c2af$def0dbb0$1400010a@stingray>
References: <001d01c6c2af$def0dbb0$1400010a@stingray>
Message-ID: <200608182016.16687.thomas@ideawise.de>

Hi,

On Friday 18 August 2006 18:20, Robert Hatch wrote:
> Hi, I am running the current testing version of Debian and I have been
> trying to compile the cluster source into my custom kernel.  I have
> successfully built and booted from a 2.6.16.27 kernel with latest Xen
> 3.0.2-2 stable support but I have failed to successfully compile the cluster
> suite.  I have been using ./configure --kernel_src=/PATH and then make
> install as advised but the build process always fails and I have tried this
> using both the current stable CVS and 1.02.00 tarball (which progresses
> further).  I have read that others have had very similar problems
> http://lists.xensource.com/archives/html/xen-users/2006-07/msg00620.html but
> later report they solve the problems by installing certain packages
> http://lists.xensource.com/archives/html/xen-users/2006-07/msg00730.html.
> This has not worked for me although the errors they report are identical. I
> was wondering if anyone knew a solution to this problem or possibly a list
> of dependent packages and versions that are required to compile Redhat
> cluster suite.  The only thing I can currently think of is that my
> ncurses5-dev package is somehow incompatible with the build.

I also had the problem with the same configuration (Linux kernel 2.6.16,
Xen 3.0.2-2, cluster-1.02.00 tarball). Here is how I solved it:

1) Install the required packages.

2) Apply the patch that is listed at
   http://www.redhat.com/archives/cluster-devel/2006-June/msg00162.html.
   This will solve the problem 'error: dereferencing pointer to
   incomplete type' (as stated in
   http://lists.xensource.com/archives/html/xen-users/2006-07/msg00730.html).

3) Some further error occured, stating that some header files could not
   be found. You can fix it on your own when you just change the
   '#include' statement (I cannot tell you exactly now where to fix it,
   because I cannot access that computer, see the error message from the
   compiler). I remember that a 'magma.h' and another file were included in
   the wrong way (the given path was wrong). After I changed the '#include'
   statement in the source it compiled fine.

As far as I remember that was all what was needed to let me compile the
cluster-1.02.00.

Hope this helps,
Thomas



From riaan at obsidian.co.za  Fri Aug 18 12:40:51 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Fri, 18 Aug 2006 14:40:51 +0200
Subject: [Linux-cluster] Re: Why Redhat replace quorum partition/lock
	lun	with new fencing mechanisms?
In-Reply-To: <pan.2006.08.18.11.29.40.953000@arvin.dk>
References: <a0e72880606141149g258b7bd6j905f7b0ea73392f5@mail.gmail.com>	<1150339626.2982.51.camel@localhost.localdomain>
	<pan.2006.08.18.11.29.40.953000@arvin.dk>
Message-ID: <44E5B553.8070509@obsidian.co.za>

Troels Arvin wrote:
> Hello,
> 
> On Wed, 14 Jun 2006 21:47:06 -0500, Kevin Anderson wrote:
>> Good news is we realized the deficiency and have added quorum disk support
>> and it will be part of the RHCS4.4 update
> 
> Has RHCS 4.4 been released yet?
> 
> I'm asking because I'm unsure if Update 4 of RHEL 4 means that RHCS 4 is
> now also at Update 4.
> 

The RPMs (via up2date/RHN) have been released, even though the ISOs for 
RHCS 4.4 and GFS 6.1 u4 have not been released yet. Red Hat Support told 
me they should be out within a week or so.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060818/e61fcfe9/attachment.vcf>

From troels at arvin.dk  Fri Aug 18 13:38:20 2006
From: troels at arvin.dk (Troels Arvin)
Date: Fri, 18 Aug 2006 15:38:20 +0200
Subject: [Linux-cluster] Re: Why Redhat replace quorum
	partition/lock	lun	with new fencing mechanisms?
References: <a0e72880606141149g258b7bd6j905f7b0ea73392f5@mail.gmail.com>
	<1150339626.2982.51.camel@localhost.localdomain>
	<pan.2006.08.18.11.29.40.953000@arvin.dk>
	<44E5B553.8070509@obsidian.co.za>
Message-ID: <pan.2006.08.18.13.38.19.953000@arvin.dk>

On Fri, 18 Aug 2006 14:40:51 +0200, Riaan van Niekerk wrote:
>> Has RHCS 4.4 been released yet?

> The RPMs (via up2date/RHN) have been released, even though the ISOs for
> RHCS 4.4 and GFS 6.1 u4 have not been released yet. Red Hat Support told
> me they should be out within a week or so.

OK. So if I run an "update -u" today on my servers running RHCS, I'll be
at RHEL 4.4 and RHCS 4.4.

I'm curious to read about the new quorum disk feature. Is it documented
somewhere yet? - The docs for RHCS don't seem to have been updated yet:
http://www.redhat.com/docs/manuals/csgfs/

-- 
Greetings from Troels Arvin




From redhat at watson-wilson.ca  Fri Aug 18 16:00:31 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Fri, 18 Aug 2006 12:00:31 -0400
Subject: [Linux-cluster] Cluster fail over fails when umounting fs
Message-ID: <20060818160031.GB16977@ettin.watson-wilson.ca>

I'm build a cluster that runs a DB2 service.  The cluster has 2 nodes
in an active standby configuration.  I am now performing fail over
tests.

Shared resources:
DB2 controlled by /etc/init.d/db2 start stop script.
Floating IP address.
/db2 ext3 file system located on a SAN and connected via HBA.

Nodes are fenced with ILO cards.

Nodes are running AS4 x86_64 with the Redhat Cluster Suite.  RPMs are up
to date.

Procedure:

1. Connect to DB2 remotely and begin a long SQL insert program.
2. While the inserts a being performed, disconnected the fibre cable
from the HBA, on the active node.
3. Examine the system logs an observe for fail over.

Observations:

1. Cluster does not fail over to standby node.  Service becomes
unavailable.
2. The log files of the active node report a 'generic error' about the status
of the shared file system.

Aug 16 15:32:37 caesar kernel: qla2300 0000:06:01.0: LOOP DOWN detected (2).
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15839
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15847
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15855
Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical block 1974
Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 103813199
Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical block 12976642
Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:45 caesar kernel: Aborting journal on device sda1.
Aug 16 15:32:45 caesar kernel: ext3_abort called.
Aug 16 15:32:45 caesar kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Aug 16 15:32:45 caesar kernel: Remounting filesystem read-only
Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 8279
Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical block 1027
Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 103546959
Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical block 12943362
Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> status on fs "db2" returned 1 (generic error)
Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> Stopping service db2
Aug 16 15:32:48 caesar clurgmgrd: [5159]: <info> Executing /etc/rc.d/init.d/db2 stop
Aug 16 15:32:48 caesar su(pam_unix)[1227]: session opened for user dwapinst by (uid=0)
Aug 16 15:32:49 caesar su:
Aug 16 15:32:49 caesar su: Instance  : dwapinst
Aug 16 15:32:49 caesar su: DB2 State : Available
Aug 16 15:32:49 caesar su(pam_unix)[1227]: session closed for user dwapinst
Aug 16 15:32:49 caesar db2:  succeeded
Aug 16 15:32:49 caesar su(pam_unix)[1322]: session opened for user dwapinst by (uid=0)
Aug 16 15:36:51 caesar su(pam_unix)[5473]: session opened for user root by nhwatson(uid=0)
Aug 16 15:36:55 caesar su(pam_unix)[6000]: session opened for user dwapinst by nhwatson(uid=0)
Aug 16 15:36:55 caesar su:
Aug 16 15:36:55 caesar su: Instance  : dwapinst
Aug 16 15:36:55 caesar su: DB2 State : Operable
Aug 16 15:36:55 caesar su(pam_unix)[6000]: session closed for user dwapinst
Aug 16 15:36:55 caesar db2:  failed 

3. The are no log entries for this event on the standby node.

Why does the cluster fail during this test?  What does the 'generic error'
mean?

-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 22 days
http://watson-wilson.ca | 2.6.16.19 AMD Athlon(tm) MP 2000+ x 2



From lhh at redhat.com  Fri Aug 18 17:03:01 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 18 Aug 2006 13:03:01 -0400
Subject: [Linux-cluster] Cluster fail over fails when umounting fs
In-Reply-To: <20060818160031.GB16977@ettin.watson-wilson.ca>
References: <20060818160031.GB16977@ettin.watson-wilson.ca>
Message-ID: <1155920582.4501.6.camel@rei.boston.devel.redhat.com>

On Fri, 2006-08-18 at 12:00 -0400, Neil Watson wrote:
> I'm build a cluster that runs a DB2 service.  The cluster has 2 nodes
> in an active standby configuration.  I am now performing fail over
> tests.
> 
> Shared resources:
> DB2 controlled by /etc/init.d/db2 start stop script.
> Floating IP address.
> /db2 ext3 file system located on a SAN and connected via HBA.
> 
> Nodes are fenced with ILO cards.
> 
> Nodes are running AS4 x86_64 with the Redhat Cluster Suite.  RPMs are up
> to date.
> 
> Procedure:
> 
> 1. Connect to DB2 remotely and begin a long SQL insert program.
> 2. While the inserts a being performed, disconnected the fibre cable
> from the HBA, on the active node.
> 3. Examine the system logs an observe for fail over.
> 
> Observations:
> 
> 1. Cluster does not fail over to standby node.  Service becomes
> unavailable.
> 2. The log files of the active node report a 'generic error' about the status
> of the shared file system.
> 
> Aug 16 15:32:37 caesar kernel: qla2300 0000:06:01.0: LOOP DOWN detected (2).
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15839
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15847
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15855
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical block 1974
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 103813199
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical block 12976642
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: Aborting journal on device sda1.
> Aug 16 15:32:45 caesar kernel: ext3_abort called.
> Aug 16 15:32:45 caesar kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
> Aug 16 15:32:45 caesar kernel: Remounting filesystem read-only
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 8279
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical block 1027
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 103546959
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical block 12943362
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> status on fs "db2" returned 1 (generic error)
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> Stopping service db2
> Aug 16 15:32:48 caesar clurgmgrd: [5159]: <info> Executing /etc/rc.d/init.d/db2 stop
> Aug 16 15:32:48 caesar su(pam_unix)[1227]: session opened for user dwapinst by (uid=0)
> Aug 16 15:32:49 caesar su:
> Aug 16 15:32:49 caesar su: Instance  : dwapinst
> Aug 16 15:32:49 caesar su: DB2 State : Available
> Aug 16 15:32:49 caesar su(pam_unix)[1227]: session closed for user dwapinst
> Aug 16 15:32:49 caesar db2:  succeeded
> Aug 16 15:32:49 caesar su(pam_unix)[1322]: session opened for user dwapinst by (uid=0)
> Aug 16 15:36:51 caesar su(pam_unix)[5473]: session opened for user root by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session opened for user dwapinst by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su:
> Aug 16 15:36:55 caesar su: Instance  : dwapinst
> Aug 16 15:36:55 caesar su: DB2 State : Operable
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session closed for user dwapinst
> Aug 16 15:36:55 caesar db2:  failed 
> 
> 3. The are no log entries for this event on the standby node.
> 
> Why does the cluster fail during this test?  What does the 'generic error'
> mean?

There are only a few types of errors noted in the OCF RA API, that's the
most often used one.

With self_fence set, the node should reboot itself if it can not unmount
the file system successfully.  However, because a script is involved, a
bug in rgmanager from linux-cluster 1.02 / RHCS4U3 (and previous
versions) caused the file system to not be stopped if the script failed
to stop.

Because the file system was not stopped when it should have been, the
unmount (and thus, reboot), would never be tried - causing the service
to enter the failed state.

Here's the bug:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=193859

If upgrading does not fix the problem, please file a bugzilla.

-- Lon



From redhat at watson-wilson.ca  Fri Aug 18 17:11:55 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Fri, 18 Aug 2006 13:11:55 -0400
Subject: [Linux-cluster] Cluster fail over fails when umounting fs
In-Reply-To: <1155920582.4501.6.camel@rei.boston.devel.redhat.com>
References: <20060818160031.GB16977@ettin.watson-wilson.ca>
	<1155920582.4501.6.camel@rei.boston.devel.redhat.com>
Message-ID: <20060818171155.GD16977@ettin.watson-wilson.ca>

On Fri, Aug 18, 2006 at 01:03:01PM -0400, Lon Hohberger wrote:
Because the file system was not stopped when it should have been, the
>unmount (and thus, reboot), would never be tried - causing the service
>to enter the failed state.
>
>Here's the bug:
>
>https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=193859
>
>If upgrading does not fix the problem, please file a bugzilla.

I'm already running rgmanager 1.9.53-0.

-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 22 days
http://watson-wilson.ca | 2.6.16.19 AMD Athlon(tm) MP 2000+ x 2



From singh.rajeshwar at gmail.com  Fri Aug 18 18:53:27 2006
From: singh.rajeshwar at gmail.com (Rajesh singh)
Date: Sat, 19 Aug 2006 00:23:27 +0530
Subject: [Linux-cluster] fence device (IBM san switch)
Message-ID: <e2a5a8050608181153j1f50665agc399752a32097726@mail.gmail.com>

Hi,
we are setting up GFS+Cluster on rhel4 u3.
We are using IBM switch IBM 2005-16B.
It permits me to do telnet from console but it fails once i use
fence_brocade command.
Due to this we are not able to configure fencing mechanism.
Has any one used this switch as fence device?

thanx in advance
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060819/f860ab67/attachment.htm>

From redhat at watson-wilson.ca  Fri Aug 18 19:31:39 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Fri, 18 Aug 2006 15:31:39 -0400
Subject: [Linux-cluster] fence device (IBM san switch)
In-Reply-To: <e2a5a8050608181153j1f50665agc399752a32097726@mail.gmail.com>
References: <e2a5a8050608181153j1f50665agc399752a32097726@mail.gmail.com>
Message-ID: <20060818193139.GG16977@ettin.watson-wilson.ca>

Is there anything in /var/log/messages about the attempted fence
connection?

-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 22 days
http://watson-wilson.ca | 2.6.16.19 AMD Athlon(tm) MP 2000+ x 2



From knagalin at redhat.com  Sat Aug 19 05:45:08 2006
From: knagalin at redhat.com (karthikeyan)
Date: Sat, 19 Aug 2006 11:15:08 +0530
Subject: [Linux-cluster] Cluster fail over fails when umounting fs
In-Reply-To: <20060818160031.GB16977@ettin.watson-wilson.ca>
References: <20060818160031.GB16977@ettin.watson-wilson.ca>
Message-ID: <44E6A564.5010804@redhat.com>

hi,
      	plz check the following
	1) e2fsck with -c /dev/sda1
	2) hard disk with vendor supply hardware/SAN health monitoring utility
	3) Is it any network flood like DOS attack in your network.

regards
karthikeyan.N

Neil Watson wrote:
> I'm build a cluster that runs a DB2 service.  The cluster has 2 nodes
> in an active standby configuration.  I am now performing fail over
> tests.
> 
> Shared resources:
> DB2 controlled by /etc/init.d/db2 start stop script.
> Floating IP address.
> /db2 ext3 file system located on a SAN and connected via HBA.
> 
> Nodes are fenced with ILO cards.
> 
> Nodes are running AS4 x86_64 with the Redhat Cluster Suite.  RPMs are up
> to date.
> 
> Procedure:
> 
> 1. Connect to DB2 remotely and begin a long SQL insert program.
> 2. While the inserts a being performed, disconnected the fibre cable
> from the HBA, on the active node.
> 3. Examine the system logs an observe for fail over.
> 
> Observations:
> 
> 1. Cluster does not fail over to standby node.  Service becomes
> unavailable.
> 2. The log files of the active node report a 'generic error' about the 
> status
> of the shared file system.
> 
> Aug 16 15:32:37 caesar kernel: qla2300 0000:06:01.0: LOOP DOWN detected 
> (2).
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 15839
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 15847
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 15855
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical 
> block 1974
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 
> 103813199
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical 
> block 12976642
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: Aborting journal on device sda1.
> Aug 16 15:32:45 caesar kernel: ext3_abort called.
> Aug 16 15:32:45 caesar kernel: EXT3-fs error (device sda1): 
> ext3_journal_start_sb: Detected aborted journal
> Aug 16 15:32:45 caesar kernel: Remounting filesystem read-only
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 8279
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical 
> block 1027
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 
> 103546959
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical 
> block 12943362
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> status on fs "db2" 
> returned 1 (generic error)
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> Stopping service db2
> Aug 16 15:32:48 caesar clurgmgrd: [5159]: <info> Executing 
> /etc/rc.d/init.d/db2 stop
> Aug 16 15:32:48 caesar su(pam_unix)[1227]: session opened for user 
> dwapinst by (uid=0)
> Aug 16 15:32:49 caesar su:
> Aug 16 15:32:49 caesar su: Instance  : dwapinst
> Aug 16 15:32:49 caesar su: DB2 State : Available
> Aug 16 15:32:49 caesar su(pam_unix)[1227]: session closed for user dwapinst
> Aug 16 15:32:49 caesar db2:  succeeded
> Aug 16 15:32:49 caesar su(pam_unix)[1322]: session opened for user 
> dwapinst by (uid=0)
> Aug 16 15:36:51 caesar su(pam_unix)[5473]: session opened for user root 
> by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session opened for user 
> dwapinst by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su:
> Aug 16 15:36:55 caesar su: Instance  : dwapinst
> Aug 16 15:36:55 caesar su: DB2 State : Operable
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session closed for user dwapinst
> Aug 16 15:36:55 caesar db2:  failed
> 3. The are no log entries for this event on the standby node.
> 
> Why does the cluster fail during this test?  What does the 'generic error'
> mean?
> 



From suvankar_moitra at yahoo.com  Sun Aug 20 14:25:45 2006
From: suvankar_moitra at yahoo.com (SUVANKAR MOITRA)
Date: Sun, 20 Aug 2006 07:25:45 -0700 (PDT)
Subject: [Linux-cluster] fence device (IBM san switch)
In-Reply-To: <e2a5a8050608181153j1f50665agc399752a32097726@mail.gmail.com>
Message-ID: <20060820142545.38541.qmail@web52303.mail.yahoo.com>

dear rajesh ,

which command you are giving to test the fenced
device.

regards

Suvankar 

--- Rajesh singh <singh.rajeshwar at gmail.com> wrote:

> Hi,
> we are setting up GFS+Cluster on rhel4 u3.
> We are using IBM switch IBM 2005-16B.
> It permits me to do telnet from console but it fails
> once i use
> fence_brocade command.
> Due to this we are not able to configure fencing
> mechanism.
> Has any one used this switch as fence device?
> 
> thanx in advance
> > --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From riaan at obsidian.co.za  Mon Aug 21 07:42:02 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Mon, 21 Aug 2006 09:42:02 +0200
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
Message-ID: <44E963CA.9060508@obsidian.co.za>

hi Bob and others

I found on the Red Hat 108 Developer Portal the following GFS1/GFS2 
design document which details amongst others, some of the issues with 
NFS on GFS:
https://rpeterso.108.redhat.com/servlets/ProjectDocumentView?documentID=99

(I see it was sent to this list over a year ago, but I never found it 
while searching through the archives. it has a lot of good information 
in it)

It has a disclaimer: Some of the comments
are no longer applicable due to design changes

My question to you or anyone who is familiar with NFS on GFS, or GFS in 
general, which of the following are still valid issues for the current 
(6.1u4) version of GFS. If all or most of them still apply, I can use 
this as motivation for my customer to strongly consider going off NFS on 
GFS. Removing the NFS from our GFS cluster has been on the cards for 
quite a while, but has not gained momentum due to lack of information on 
the performance gains of such a move (very difficult to gage) or the 
architectural problems/limitations of NFS on GFS (for which the 
following extract is spot-on).

Note - can you consider adding a link to this document from your FAQ?

+++++++++

o  NFS Support

A GFS filesystem can be exported through NFS to other
nodes.  There
are a number of issues with NFS on top of a cluster
filesystem,
though.

1) Filehandle misses

    When a NFS request comes into the server, it's up to
the filesystem
    (and a few Linux helper routines) to map the NFS
filehandle to the
    correct inode.  Doing that is easy if the inode is
already in the
    node's cache.  The tricky part is when the
filesystem must read in
    the inode from the disk.  There is nothing in the
filehandle that
    anchors the inode into the filesystem (such as a
glock on a
    directory that contains an entry pointing to the
inode), so a lot
    more care has to taken to make sure the block really
contains a
    valid inode.  (See the section on the proposed new
RG formats.)

    It's also non-trivial to handle inode migration in
GFS when a NFS
    server is running.  There is no centralized data
structure that can
    map filehandles into inodes (such a structure would be a
    scalability/performance bottleneck).  It's difficult
to find a
    representation of the inode that could be used to
quickly find it
    even in the face of the inode changing blocks.

    Another problem is that filehandle requests can come
in random
    times for inodes that don't exist anymore or are in
the process of
    being recreated.  This can break optimizations based
on ideas like
    "since this node in the process of creating this
inode, it are
    the only one who knows about its locks".  GFS has
suffered from
    these mis-optimizations in the past.  From what I've
seen, I believe
    OCFS2 currently has problems like this, too.

2) Readdir

    Linux has an interesting mechanism to do handle
readdir() requests.
    The VFS (or NFSD) passes the filesystem a request
containing not
    only the directory and offset to be read, but a
filldir function to
    call for each entry found.  So, the filesystem
doesn't directly
    fill in a buffer of entries, but calls an arbitrary
routine that
    can either put the entries into a buffer or do some
other type of
    processing on them.  This is a powerful concept, but
can be easily
    misused.

    I believe that NFSD's use of it is problematic at
best.  The
    filldir routine used by NFSD for the readdirplus NFS
procedure
    calls back into the filesystem to do a lookup and
stat() on the
    inode pointed to by the entry.  This call is painful
because of
    GFS' locking.  gfs_readdir() must call filldir with
the directory
    lock held so that it doesn't lose its place in the
directory.  The
    stat() that the filldir routine does causes the
inode's lock to be
    acquired.  Because concurrent inode locks must
always be acquired
    in ascending numerical order and the filldir routine
forces an
    ordering that might be something other than that,
there is a
    deadlock potential.

    GFS detects when NFSD calls its readdir and switches
to a routine
    that avoids calling the filldir routine with the
lock held.  It's
    not as efficient, but it avoids the deadlock.  It'd
be nice if
    there was a better way to do the detection, though.
  (The code
    currently looks at the program's name.)

3) FCNTL locking

    There are a huge number of issues with acquiring and
failing over
    fcntl()-style locks when there are multiple GFS
heads exporting
    NFS.  GFS pretty much ignores them right now.  A
good place to
    start would be to change NFSD so it actually passes
fcntl calls
    down into the filesystem.

4) NFSv4

    NFSv4 requires all sorts of changes to GFS in order
for them to
    work together.  Op locks being one I can remember at
the moment.
    I think I've repressed my memories of the others.

++++++++
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060821/fa68b8fc/attachment.vcf>

From efri at qumranet.com  Mon Aug 21 09:18:18 2006
From: efri at qumranet.com (Efri Nattel-Shay)
Date: Mon, 21 Aug 2006 02:18:18 -0700
Subject: [Linux-cluster] Build errors of RHEL branch on a clean RHEL4(u3)
Message-ID: <64F9B87B6B770947A9F8391472E0321607C9AB13@ehost011-8.exch011.intermedia.net>


I'm new to linux-clusters, and had some troubles building the RHEL
branch of cluster. I tried it on a clean RHEL4(u3) with everything
up2date'd (kernel and apps - kernel is 2.6.9-42ELsmp).
The problem seems to me like a simple directory problem - It looks like
the make assumes that /usr/include/cluster/ exists, and that libccs is
in /usr/lib. I enclose the patches I made to the Makefiles and the
initial error that started me changing them.
I now pass 'make' and 'make install'. I'll keep on working on that, but
wanted some feedback, in case I'm missing something.

Thanks,
Efri Nattel-Shay

===================================================================
RCS file: /cvs/cluster/cluster/cman/lib/Makefile,v
retrieving revision 1.1
diff -r1.1 Makefile
31,32c31,32
<           echo '-I${KERNEL_SRC}/include/cluster'; else \
<           echo '-I${incdir}/cluster'; fi)
---
>           echo '-I${KERNEL_SRC}/include'; else \
>           echo '-I${incdir}'; fi)
34c34
< CFLAGS += -I${incdir}/cluster
---
> CFLAGS += -I${incdir}
===================================================================
RCS file: /cvs/cluster/cluster/cman/qdisk/Makefile,v
retrieving revision 1.1.2.2
diff -r1.1.2.2 Makefile
35c35
<     gcc -o $@ $^ -lpthread -L../lib -lccs
---
>     gcc -o $@ $^ -lpthread -L../lib -L${libdir} -lccs
===================================================================
RCS file: /cvs/cluster/cluster/gfs-kernel/src/gfs/Makefile,v
retrieving revision 1.6.2.4
diff -r1.6.2.4 Makefile
69c69
< EXTRA_CFLAGS += -I$(obj) -I/usr/include
---
> EXTRA_CFLAGS += -I$(obj) -I/usr/include -I${incdir}





# make
...

make[3]: Entering directory `/root/GFS-RHEL4/cluster/gfs-kernel/src/gfs'
rm -f linux lm_interface.h
ln -s . linux
ln -s ../../src/harness/lm_interface.h .
make -C /usr/src/kernels/2.6.9-42.EL-smp-i686
M=/root/GFS-RHEL4/cluster/gfs-kernel/src/gfs
symverfile=/root/GFS-RHEL4/cluster/gfs-kernel/src/gfs/../gulm/lock_gulm.
symvers modules USING_KBUILD=yes
make[4]: Entering directory `/usr/src/kernels/2.6.9-42.EL-smp-i686'

...

  CC [M]  /root/GFS-RHEL4/cluster/gfs-kernel/src/gfs/ops_inode.o
/root/GFS-RHEL4/cluster/gfs-kernel/src/gfs/ops_inode.c:27:28:
cluster/cnxman.h: No such file or directory
/root/GFS-RHEL4/cluster/gfs-kernel/src/gfs/ops_inode.c: In function
`get_my_nodeid':
/root/GFS-RHEL4/cluster/gfs-kernel/src/gfs/ops_inode.c:136: error:
dereferencing pointer to incomplete type

...



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060821/cd5b5d91/attachment.htm>

From riaan at obsidian.co.za  Mon Aug 21 11:46:59 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Mon, 21 Aug 2006 13:46:59 +0200
Subject: [Linux-cluster] how to confirm GFS quotas are disabled
Message-ID: <44E99D33.9070803@obsidian.co.za>

According to this document:
https://rpeterso.108.redhat.com/servlets/ProjectDocumentView?documentID=99
you can get a 5% performance increase by disabling quotas on GFS.

(I am not taking this as gospel, but every couple of percentage points 
helps)

According to the GFS manual
http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/s1-manage-quota.html#S2-MANAGE-QUOTAACCOUNT
you do this using the following syntax:

gfs_tool settune /mnt/gfs quota_account 0

How do I confirm that quotas have been disabled?
An indirect (to me) way is to run 'gfs_quota list -f /mnt/gfs' and check 
if the values in the last field change or not. Nothing in the output of 
'gfs_tool quota /mnt/gfs' jumped out at me to say if it is enabled or not

I was hoping there is a more elegant way to check this.

tnx
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060821/7ad70273/attachment.vcf>

From wcheng at redhat.com  Mon Aug 21 14:04:05 2006
From: wcheng at redhat.com (Wendy Cheng)
Date: Mon, 21 Aug 2006 10:04:05 -0400
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44E963CA.9060508@obsidian.co.za>
References: <44E963CA.9060508@obsidian.co.za>
Message-ID: <44E9BD55.1060006@redhat.com>

Riaan van Niekerk wrote:

>
> My question to you or anyone who is familiar with NFS on GFS, or GFS 
> in general, which of the following are still valid issues for the 
> current (6.1u4) version of GFS. If all or most of them still apply, I 
> can use this as motivation for my customer to strongly consider going 
> off NFS on GFS. Removing the NFS from our GFS cluster has been on the 
> cards for quite a while, but has not gained momentum due to lack of 
> information on the performance gains of such a move (very difficult to 
> gage) or the architectural problems/limitations of NFS on GFS (for 
> which the following extract is spot-on).


These have been worked on and some of them do have test patches ready to 
address the issues. However, the changes are non-trivial and may involve 
base kernel modifiction that we need to get upstream (community linux 
kernel)  acceptance. The efforts take time since we would like to do it 
conservatively to preserve GFS1/2 stability. Unless the posted problems 
have urgent needs (let us know), the current NFS-GFS development focus 
is on failover (Red Hat bugzilla 132823).

Is performance the primary concern you have now ?

-- Wendy




From riaan at obsidian.co.za  Mon Aug 21 14:41:12 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Mon, 21 Aug 2006 16:41:12 +0200
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44E9BD55.1060006@redhat.com>
References: <44E963CA.9060508@obsidian.co.za> <44E9BD55.1060006@redhat.com>
Message-ID: <44E9C608.9030109@obsidian.co.za>

Wendy Cheng wrote:
> Riaan van Niekerk wrote:
> 
>>
>> My question to you or anyone who is familiar with NFS on GFS, or GFS 
>> in general, which of the following are still valid issues for the 
>> current (6.1u4) version of GFS. If all or most of them still apply, I 
>> can use this as motivation for my customer to strongly consider going 
>> off NFS on GFS. Removing the NFS from our GFS cluster has been on the 
>> cards for quite a while, but has not gained momentum due to lack of 
>> information on the performance gains of such a move (very difficult to 
>> gage) or the architectural problems/limitations of NFS on GFS (for 
>> which the following extract is spot-on).
> 
> 
> These have been worked on and some of them do have test patches ready to 
> address the issues. However, the changes are non-trivial and may involve 
> base kernel modifiction that we need to get upstream (community linux 
> kernel)  acceptance. The efforts take time since we would like to do it 
> conservatively to preserve GFS1/2 stability. Unless the posted problems 
> have urgent needs (let us know), the current NFS-GFS development focus 
> is on failover (Red Hat bugzilla 132823).
> 
> Is performance the primary concern you have now ?
> 
> -- Wendy

Yes, mostly. We have a couple of open service requests for stability. 
They are very intermittent and not reproduceable (and nothing in 
bugzilla seems to match):

a) load average on nodes steadily climbs until load average reaches the 
nfsd count, upon which all I/O hangs. We reboot nodes one by one, and as 
soon as the one with a stuck lock is bounced, I/O returns to all nodes)

b) kernel oopses with Assertion failed on line 428 / 357 of dlm/lock.c 
while there is no load on the system . this happens 3 days in a row, 
over a weekend, and then for weeks, the error does not occur again.

getting the info that upport requires (sysrq t, lockdump, etc, on all 
nodes, crashdump on failing node, is pretty difficult). We are not 
married to NFS on GFS, even though it is a cost-effective interim step 
for until we can get all our mail servers (14 in all) SAN-attached.

Can I read into "have been worked on" and "some do have test patches" 
that these 4 issues still persist? I need the ammunition to motivate the 
move away from NFS on GFS. this architecture document gives it to me if 
these issues are still valid.

tnx
Riaan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060821/6891ed38/attachment.vcf>

From wcheng at redhat.com  Mon Aug 21 15:15:48 2006
From: wcheng at redhat.com (Wendy Cheng)
Date: Mon, 21 Aug 2006 11:15:48 -0400
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44E9C608.9030109@obsidian.co.za>
References: <44E963CA.9060508@obsidian.co.za> <44E9BD55.1060006@redhat.com>
	<44E9C608.9030109@obsidian.co.za>
Message-ID: <44E9CE24.9080407@redhat.com>

Riaan van Niekerk wrote:

> Wendy Cheng wrote:
>
>> Riaan van Niekerk wrote:
>>
>>>
>>> My question to you or anyone who is familiar with NFS on GFS, or GFS 
>>> in general, which of the following are still valid issues for the 
>>> current (6.1u4) version of GFS. If all or most of them still apply, 
>>> I can use this as motivation for my customer to strongly consider 
>>> going off NFS on GFS. Removing the NFS from our GFS cluster has been 
>>> on the cards for quite a while, but has not gained momentum due to 
>>> lack of information on the performance gains of such a move (very 
>>> difficult to gage) or the architectural problems/limitations of NFS 
>>> on GFS (for which the following extract is spot-on).
>>
>>
>>
>> These have been worked on and some of them do have test patches ready 
>> to address the issues. However, the changes are non-trivial and may 
>> involve base kernel modifiction that we need to get upstream 
>> (community linux kernel)  acceptance. The efforts take time since we 
>> would like to do it conservatively to preserve GFS1/2 stability. 
>> Unless the posted problems have urgent needs (let us know), the 
>> current NFS-GFS development focus is on failover (Red Hat bugzilla 
>> 132823).
>>
>> Is performance the primary concern you have now ?
>>
>> -- Wendy
>
>
> Yes, mostly. We have a couple of open service requests for stability. 
> They are very intermittent and not reproduceable (and nothing in 
> bugzilla seems to match):


All four issues mentioned are still present in RHEL 3/4 and most of them 
are not GFS specific. As long as it is a cluster filesystem beneath 
linux's nfsd, these issues exist.

I don't plan to defend for NFS - it has inherited problems by its 
nature. Moving to GFS natively is good from performance point of view 
(one layer less). However, I would suggest you do make some efforts to 
understand the problems before taking any significant changes.

Do you have Red Hat support ticket numbers for the mentioned problems 
that I can take a look ?

-- Wendy

 



From rpeterso at redhat.com  Mon Aug 21 15:53:34 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 21 Aug 2006 10:53:34 -0500
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44E963CA.9060508@obsidian.co.za>
References: <44E963CA.9060508@obsidian.co.za>
Message-ID: <44E9D6FE.5010702@redhat.com>

Riaan van Niekerk wrote:
> hi Bob and others
>
> I found on the Red Hat 108 Developer Portal the following GFS1/GFS2 
> design document which details amongst others, some of the issues with 
> NFS on GFS:
> https://rpeterso.108.redhat.com/servlets/ProjectDocumentView?documentID=99 
>
>
> (I see it was sent to this list over a year ago, but I never found it 
> while searching through the archives. it has a lot of good information 
> in it)
>
> It has a disclaimer: Some of the comments
> are no longer applicable due to design changes
>
> My question to you or anyone who is familiar with NFS on GFS, or GFS 
> in general, which of the following are still valid issues for the 
> current (6.1u4) version of GFS. If all or most of them still apply, I 
> can use this as motivation for my customer to strongly consider going 
> off NFS on GFS. Removing the NFS from our GFS cluster has been on the 
> cards for quite a while, but has not gained momentum due to lack of 
> information on the performance gains of such a move (very difficult to 
> gage) or the architectural problems/limitations of NFS on GFS (for 
> which the following extract is spot-on).
>
> Note - can you consider adding a link to this document from your FAQ?
Hi Riaan,

The document you mentioned was written by Ken Preslan more than a year ago.
It has some good architectural information regarding GFS and GFS2, but 
the problem is, there
have been a lot of changes to GFS2 and a lot of work has been done on 
NFS since that time,
so a lot of it no longer applies.

One day I was playing with 108 and decided to upload the document to my 
108 page because I thought it
was "a good find" and there was a need for  GFS architectural 
information on the Internet.
Afterward, I was discussing the article with some of the developers and 
they all agreed
that the article shouldn't be posted because it contained too much 
misinformation due to recent changes
made to all areas of the code.  The problem is, I had already posted the 
article and I couldn't figure
out how to get 108 to delete it.  (Today I figured out how to delete it, 
and did so, and I apologize if
anyone was misled by what it says.  I'm going to file a usability 
bugzilla against 108 though.)

What I really need to do is write a white paper about GFS and its 
internals and its structures,
rather than spending the time required to sift through Ken's article and 
separate fact from "no longer
applicable".  And the link to that document will certainly be added to 
the FAQ.

There are some known issues with NFS failover, but it works great unless 
you're intentionally trying to break it
by doing some nasty tricks such as those documented in bugzilla 178057.  
If you read the first few comments
of the bugzilla, you'll see that I tried very hard at first to break it 
and couldn't.
Wendy Cheng has been spearheading the effort to improve NFS failover and 
I applaud her efforts.

Regards,

Bob Peterson
Red Hat Cluster Suite



From tmornini at engineyard.com  Mon Aug 21 15:44:57 2006
From: tmornini at engineyard.com (Tom Mornini)
Date: Mon, 21 Aug 2006 08:44:57 -0700
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44E963CA.9060508@obsidian.co.za>
References: <44E963CA.9060508@obsidian.co.za>
Message-ID: <3FB5148F-C427-4D29-BE23-752EEC56A7A6@engineyard.com>

I get an error page that no document with ID 99 exists.

I'm about to setup a cluster that uses NFS with GFS, so I'd love to  
read that document.

On Aug 21, 2006, at 12:42 AM, Riaan van Niekerk wrote:

> hi Bob and others
>
> I found on the Red Hat 108 Developer Portal the following GFS1/GFS2  
> design document which details amongst others, some of the issues  
> with NFS on GFS:
> https://rpeterso.108.redhat.com/servlets/ProjectDocumentView? 
> documentID=99
>
> (I see it was sent to this list over a year ago, but I never found  
> it while searching through the archives. it has a lot of good  
> information in it)
>
> It has a disclaimer: Some of the comments
> are no longer applicable due to design changes
>
> My question to you or anyone who is familiar with NFS on GFS, or  
> GFS in general, which of the following are still valid issues for  
> the current (6.1u4) version of GFS. If all or most of them still  
> apply, I can use this as motivation for my customer to strongly  
> consider going off NFS on GFS. Removing the NFS from our GFS  
> cluster has been on the cards for quite a while, but has not gained  
> momentum due to lack of information on the performance gains of  
> such a move (very difficult to gage) or the architectural problems/ 
> limitations of NFS on GFS (for which the following extract is spot- 
> on).
>
> Note - can you consider adding a link to this document from your FAQ?
>
> +++++++++
>
> o  NFS Support
>
> A GFS filesystem can be exported through NFS to other
> nodes.  There
> are a number of issues with NFS on top of a cluster
> filesystem,
> though.
>
> 1) Filehandle misses
>
>    When a NFS request comes into the server, it's up to
> the filesystem
>    (and a few Linux helper routines) to map the NFS
> filehandle to the
>    correct inode.  Doing that is easy if the inode is
> already in the
>    node's cache.  The tricky part is when the
> filesystem must read in
>    the inode from the disk.  There is nothing in the
> filehandle that
>    anchors the inode into the filesystem (such as a
> glock on a
>    directory that contains an entry pointing to the
> inode), so a lot
>    more care has to taken to make sure the block really
> contains a
>    valid inode.  (See the section on the proposed new
> RG formats.)
>
>    It's also non-trivial to handle inode migration in
> GFS when a NFS
>    server is running.  There is no centralized data
> structure that can
>    map filehandles into inodes (such a structure would be a
>    scalability/performance bottleneck).  It's difficult
> to find a
>    representation of the inode that could be used to
> quickly find it
>    even in the face of the inode changing blocks.
>
>    Another problem is that filehandle requests can come
> in random
>    times for inodes that don't exist anymore or are in
> the process of
>    being recreated.  This can break optimizations based
> on ideas like
>    "since this node in the process of creating this
> inode, it are
>    the only one who knows about its locks".  GFS has
> suffered from
>    these mis-optimizations in the past.  From what I've
> seen, I believe
>    OCFS2 currently has problems like this, too.
>
> 2) Readdir
>
>    Linux has an interesting mechanism to do handle
> readdir() requests.
>    The VFS (or NFSD) passes the filesystem a request
> containing not
>    only the directory and offset to be read, but a
> filldir function to
>    call for each entry found.  So, the filesystem
> doesn't directly
>    fill in a buffer of entries, but calls an arbitrary
> routine that
>    can either put the entries into a buffer or do some
> other type of
>    processing on them.  This is a powerful concept, but
> can be easily
>    misused.
>
>    I believe that NFSD's use of it is problematic at
> best.  The
>    filldir routine used by NFSD for the readdirplus NFS
> procedure
>    calls back into the filesystem to do a lookup and
> stat() on the
>    inode pointed to by the entry.  This call is painful
> because of
>    GFS' locking.  gfs_readdir() must call filldir with
> the directory
>    lock held so that it doesn't lose its place in the
> directory.  The
>    stat() that the filldir routine does causes the
> inode's lock to be
>    acquired.  Because concurrent inode locks must
> always be acquired
>    in ascending numerical order and the filldir routine
> forces an
>    ordering that might be something other than that,
> there is a
>    deadlock potential.
>
>    GFS detects when NFSD calls its readdir and switches
> to a routine
>    that avoids calling the filldir routine with the
> lock held.  It's
>    not as efficient, but it avoids the deadlock.  It'd
> be nice if
>    there was a better way to do the detection, though.
>  (The code
>    currently looks at the program's name.)
>
> 3) FCNTL locking
>
>    There are a huge number of issues with acquiring and
> failing over
>    fcntl()-style locks when there are multiple GFS
> heads exporting
>    NFS.  GFS pretty much ignores them right now.  A
> good place to
>    start would be to change NFSD so it actually passes
> fcntl calls
>    down into the filesystem.
>
> 4) NFSv4
>
>    NFSv4 requires all sorts of changes to GFS in order
> for them to
>    work together.  Op locks being one I can remember at
> the moment.
>    I think I've repressed my memories of the others.
>
> ++++++++
> <riaan.vcf>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rpeterso at redhat.com  Mon Aug 21 16:02:28 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 21 Aug 2006 11:02:28 -0500
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <3FB5148F-C427-4D29-BE23-752EEC56A7A6@engineyard.com>
References: <44E963CA.9060508@obsidian.co.za>
	<3FB5148F-C427-4D29-BE23-752EEC56A7A6@engineyard.com>
Message-ID: <44E9D914.7080901@redhat.com>

Tom Mornini wrote:
> I get an error page that no document with ID 99 exists.
>
> I'm about to setup a cluster that uses NFS with GFS, so I'd love to 
> read that document.
Hi Tom,

That was just an old email with some architectural nodes.  I deleted 
it.  Here is a much
better document for setting up NFS on GFS:

http://sources.redhat.com/cluster/doc/nfscookbook.pdf

Regards,

Bob Peterson
Red Hat Cluster Suite



From lhh at redhat.com  Fri Aug 18 17:46:10 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 18 Aug 2006 13:46:10 -0400
Subject: [Linux-cluster] Cluster fail over fails when umounting fs
In-Reply-To: <20060818171155.GD16977@ettin.watson-wilson.ca>
References: <20060818160031.GB16977@ettin.watson-wilson.ca>
	<1155920582.4501.6.camel@rei.boston.devel.redhat.com>
	<20060818171155.GD16977@ettin.watson-wilson.ca>
Message-ID: <1155923205.4501.9.camel@rei.boston.devel.redhat.com>

On Fri, 2006-08-18 at 13:11 -0400, Neil Watson wrote:
> I'm already running rgmanager 1.9.53-0.

self_fence is set for the file system, right?

Can you post your service blob from cluster.conf?

-- Lon




From lhh at redhat.com  Mon Aug 21 17:09:07 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 21 Aug 2006 13:09:07 -0400
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc
	fence device
In-Reply-To: <1155875393.44e54241721d8@arrowana.singnet.com.sg>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
	<1155863605.44e51435c18c2@arrowana.singnet.com.sg>
	<1155875393.44e54241721d8@arrowana.singnet.com.sg>
Message-ID: <1156180147.4501.12.camel@rei.boston.devel.redhat.com>

On Fri, 2006-08-18 at 12:29 +0800, garylua at singnet.com.sg wrote:
> Hi,
> 
> Aother update on the situation: Apparently, i got the fencing working already. When I down the ethernet connection of node1, node2 will fence node1 and the power outlet connected to node1 power cycle (switch off then on). However, is this the proper behaviour of fencing? I wish to reboot node1, not shut it down completely. Seems that my machines are configured in such a way that when the power supply is switched on and then off, my machine stays off. Is there any way which i can alter this behaviour such that my node1 will reboot instead of shutting down? Thanks!


Yes, it's usually in the BIOS -- sorry for the late reply!

-- Lon



From redhat at watson-wilson.ca  Mon Aug 21 17:19:48 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Mon, 21 Aug 2006 13:19:48 -0400
Subject: [Linux-cluster] Cluster fail over fails when umounting fs
In-Reply-To: <1155923205.4501.9.camel@rei.boston.devel.redhat.com>
References: <20060818160031.GB16977@ettin.watson-wilson.ca>
	<1155920582.4501.6.camel@rei.boston.devel.redhat.com>
	<20060818171155.GD16977@ettin.watson-wilson.ca>
	<1155923205.4501.9.camel@rei.boston.devel.redhat.com>
Message-ID: <20060821171948.GC30518@ettin.watson-wilson.ca>

On Fri, Aug 18, 2006 at 01:46:10PM -0400, Lon Hohberger wrote:
>Can you post your service blob from cluster.conf?

<?xml version="1.0"?>
<cluster config_version="17" name="dcdb">
   <fence_daemon post_fail_delay="0" post_join_delay="3"/>
   <clusternodes>
      <clusternode name="hadrian" votes="1">
         <fence>
            <method name="1">
               <device name="hadrian-ilo"/>
            </method>
         </fence>
      </clusternode>
      <clusternode name="caesar" votes="1">
         <fence>
            <method name="1">
               <device name="caesar-ilo"/>
            </method>
         </fence>
      </clusternode>
   </clusternodes>
   <cman expected_votes="1" two_node="1"/>
   <fencedevices>
      <fencedevice agent="fence_ilo" hostname="172.16.112.30" login="Administrator" name="hadrian-ilo" passwd="****"/>
      <fencedevice agent="fence_ilo" hostname="172.16.112.31" login="Administrator" name="caesar-ilo" passwd="****"/>
   </fencedevices>
   <rm>
      <failoverdomains/>
      <resources>
         <fs device="/dev/sda1" force_fsck="0" force_unmount="1" fsid="57931" fstype="ext3" mountpoint="/db2" name="db2" options="" self_fence="1"/>
         <ip address="172.16.1.205" monitor_link="1"/>
         <script file="/etc/rc.d/init.d/db2" name="db-init"/>
      </resources>
      <service autostart="1" name="db2" recovery="restart">
         <fs ref="db2"/>
         <ip ref="172.16.1.205"/>
         <script ref="db-init"/>
      </service>
   </rm>
</cluster>


-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 25 days
http://watson-wilson.ca | 2.6.16.19 AMD Athlon(tm) MP 2000+ x 2



From glsutter at mitre.org  Mon Aug 21 18:13:38 2006
From: glsutter at mitre.org (Sutterfield, Geary L.)
Date: Mon, 21 Aug 2006 14:13:38 -0400
Subject: [Linux-cluster] JBoss Clustering Compatibility
Message-ID: <91C5B377A44AF34DBDCC8348E8CB4D33E6F0E2@IMCSRV2.MITRE.ORG>

Hello all,
 
Does anyone have any information or experience about using GFS and the
Cluster Suite and JBoss Clustering on the same cluster? I've been
looking for a couple of days, but I can't seem to find any information
on that configuration. Our goal is to create a J2EE cluster with a
shared, global filesystem. GFS seemed like a logical choice. 
 
I'm speculating that if Cluster Suite was only used for GFS, and not
for any other services, then it might work okay with JBoss Clustering.
Cluster Suite would handle the GFS clustering, and JBoss Clustering the
J2EE app server clustering. But I also wonder about conflicts or
incompatibilities. And I especially wonder how fencing would impact
JBoss. Has anyone tried this configuration, or heard of anyone who has?

 
If there any problems, I suppose they will be worked out eventually,
since Red Hat has acquired JBoss. But we don't want to waste a lot of
time going down a path with known problems. Any thoughts or feedback
would be appreciated.
 
Geary Sutterfield
Lead Software Systems Engineer
The MITRE Corporation



From Zelikov_Mikhail at emc.com  Mon Aug 21 18:40:54 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 21 Aug 2006 14:40:54 -0400
Subject: [Linux-cluster] DLM: dlm_query_wait()
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B173044EF94B@CORPUSMX40B.corp.emc.com>

While playing with DLM I modified the code in dlmtest.c to use
dlm_query_wait, however it seems to always return EINPROG status. There is a
comment in dlmtest.c:query_lock  indicating that the synch version does not
exist. Is this a bug or am I missing something (I can certainly get away
with the async routine, which works)?
	Thanks, 
	Mike

---------------------------------------------------------------
Here is the code with the comment:
---------------------------------------------------------------
"

    status = dlm_query(&lksb,
         DLM_QUERY_QUEUE_ALL | DLM_QUERY_LOCKS_ALL,
         &qinfo,
         query_ast_routine,
         &lksb);
    if (status)
 perror("Query failed");
    else
 sleep(1); /* Just to allow the result to come back. There isn't
       a synchronous version of this call */

"

---------------------------------------------------------------
Here is the modified code:
---------------------------------------------------------------
static int query_lock(int lockid)
{
    int status;

    lksb.sb_lkid = lockid;
    qinfo.gqi_resinfo = &resinfo;
    qinfo.gqi_lockinfo = malloc(sizeof(struct dlm_lockinfo) *
MAX_QUERY_LOCKS);
    qinfo.gqi_locksize = MAX_QUERY_LOCKS;
    lksb.sb_lvbptr = (char *)&qinfo;

    status = dlm_query_wait(&lksb,
                            DLM_QUERY_QUEUE_ALL | DLM_QUERY_LOCKS_ALL,
                            &qinfo);
    if (status)
	   perror("Query failed");
    else
    {
         if (lksb.sb_status)
         {
              printf("Query failed: sb_status: %d (0x%08x)\n",
lksb.sb_status, lksb.sb_status);
		  return status;
         }
         query_ast_routine(&lksb);
    }

    return status;
}



From celso at webbertek.com.br  Mon Aug 21 19:38:38 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Mon, 21 Aug 2006 16:38:38 -0300
Subject: [Linux-cluster] Bonded heartbeat channels on RH Cluster Suite v3
Message-ID: <44EA0BBE.1010401@webbertek.com.br>

Hello all,

I'm experiencing a weird behaviour on RHCSv3 that I don't know if it is 
my mistake.

The configuration is like this:
* 2-node RHCS Cluster (not GFS);
* two onboard NICS are channel bonded (bond0) for corporate network access;
* one offboard NIC is used for cluster network heartbeating.

Since each node is located on two separate buildings, the customer 
wanted to channel bond the heartbeat channel, also. There has been some 
Ethernet switch problems in the heartbeat channel before.

So we tried to add another bonded hannel (bond1) to the setup so that we 
have a redundant heartbeat channel.

The setup went like this (sorry for the ASCII art):
+---------+      +----------+      +---------+
|         |----->|          |<-----|         |
| server1 |bond0 | ethernet | bond0| server2 |
|         |----->| switch   |<-----|         |
|         |      |          |      |         |
|         |----->+----------+<-----|         |
|         |bond1              bond1|         |
|         |<----crossover cable--->|         |
+---------+                        +---------+

For bond1 the customer wanted the following:
* to use the same ethernet switch of the corporate network, since it is 
fully redundant (each cable plugged to a differente physical switch);
* to use a crossover cable for the redundant connection of bond1, just 
in case the whole ethernet switch solution goes down. The crossover 
cable here is an optical fiber passed on between the buildings;
* beartbeat IP address for each server are 10.1.1.3 (clu_server1) and 
10.1.1.4 (clu_server2).

   - /etc/modules.conf:
alias bond0 bonding
options bond0 -o bond0 mode=1 miimon=100
alias bond1 bonding
options bond1 -o bond1 mode=1 miimon=100

   - /etc/sysconfig/network-scripts/ifcfg-bond1:
DEVICE=bond1
ONBOOT=yes
IPADDR=10.1.1.XXX
NETMASK=255.255.255.0
BOOTPROTO=none
TYPE=Bonding

   - /etc/sysconfig/network-scripts/ifcfg-eth2:
DEVICE=eth2
ONBOOT=yes
BOOTPROTO=none
MASTER=bond1
SLAVE=yes
TYPE=Ethernet

   - /etc/sysconfig/network-scripts/ifcfg-eth3:
DEVICE=eth3
ONBOOT=yes
BOOTPROTO=none
MASTER=bond1
SLAVE=yes
TYPE=Ethernet


Now the problem is that the Cluster didn't come up, and we could get 
some warning in the logs:
00:19:25 server1 clumembd[11041]: <notice> Member clu_server1 UP
00:19:28 server1 clumembd[11041]: <warning> Dropping connect from 
10.1.1.4: Not in subnet!
00:19:29 server1 cluquorumd[11039]: <warning> Dropping connect from 
10.1.1.4: Not in subnet!
00:19:31 server1 cluquorumd[11039]: <notice> IPv4 TB @ 10.0.4.196 Online

00:18:59 server2 clumembd[17634]: <notice> Member clu_server1 UP
00:19:09 server2 clumembd[17634]: <warning> Dropping connect from 
10.1.1.3: Not in subnet!
00:19:11 server2 clumembd[17634]: <notice> Member clu_server2 UP
00:19:19 server2 cluquorumd[17632]: <notice> IPv4 TB @ 10.0.4.196 Online

It seems that both servers "see" each other, both "see" the IPv4 
Tiebraker as Online, but they refuse to form quorum.

Removing the "bond1" configuration made the cluster come back to normal 
functions, but now we cant't understand what we did wrong here.


Is this mixed switch+crossover setup for channel bonding wrong?

Please, tell me if someone got any mistake from our part, ok?

Thank you all.

Regards,

Celso.
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035



From lhh at redhat.com  Mon Aug 21 20:18:46 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 21 Aug 2006 16:18:46 -0400
Subject: [Linux-cluster] JBoss Clustering Compatibility
In-Reply-To: <91C5B377A44AF34DBDCC8348E8CB4D33E6F0E2@IMCSRV2.MITRE.ORG>
References: <91C5B377A44AF34DBDCC8348E8CB4D33E6F0E2@IMCSRV2.MITRE.ORG>
Message-ID: <1156191526.4501.22.camel@rei.boston.devel.redhat.com>

On Mon, 2006-08-21 at 14:13 -0400, Sutterfield, Geary L. wrote:
> Hello all,
>  
> Does anyone have any information or experience about using GFS and the
> Cluster Suite and JBoss Clustering on the same cluster? I've been
> looking for a couple of days, but I can't seem to find any information
> on that configuration. Our goal is to create a J2EE cluster with a
> shared, global filesystem. GFS seemed like a logical choice. 
>  
> I'm speculating that if Cluster Suite was only used for GFS, and not
> for any other services, then it might work okay with JBoss Clustering.
> Cluster Suite would handle the GFS clustering, and JBoss Clustering the
> J2EE app server clustering. But I also wonder about conflicts or
> incompatibilities. And I especially wonder how fencing would impact
> JBoss. Has anyone tried this configuration, or heard of anyone who has?

I have not tried it, but here are some of the 'gotchas':

(a) Disagreements in the JBoss cluster view vs. the GFS cluster view.
GFS cluster view should always win in the end, because it has fencing.
So, the GFS/RHCS cluster should be configured to converge on membership
changes before JBoss (whether that means increase deadnode_timeout, or
tweaking JBoss is up to you).

(b) Cluster quorum.  I don't know much about JBoss clustering (yet), but
remember that cluster suite / GFS need a majority of votes (normally,
there's a 1:1 correlation between nodes and votes) available in order to
form a quorum.  Without a quorum, access to GFS will be halted.

(c) Fencing.  You will probably need to use power-based fencing, because
storage/SAN level fencing will not drop a node out of the JBoss cluster
by itself.  Since I don't currently know much about JBoss clustering
generally, I also don't know the specific implications of trying to run
JBoss on a given node after the GFS volumes have been cut off from that
node.  I *assume* that JBoss would take that node out of the JBoss
cluster.


> If there any problems, I suppose they will be worked out eventually,
> since Red Hat has acquired JBoss. But we don't want to waste a lot of
> time going down a path with known problems. Any thoughts or feedback
> would be appreciated.

-- Lon



From lhh at redhat.com  Mon Aug 21 20:40:55 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 21 Aug 2006 16:40:55 -0400
Subject: [Linux-cluster] Bonded heartbeat channels on RH Cluster Suite v3
In-Reply-To: <44EA0BBE.1010401@webbertek.com.br>
References: <44EA0BBE.1010401@webbertek.com.br>
Message-ID: <1156192855.4501.28.camel@rei.boston.devel.redhat.com>

On Mon, 2006-08-21 at 16:38 -0300, Celso K. Webber wrote:
> Hello all,
> 
> I'm experiencing a weird behaviour on RHCSv3 that I don't know if it is 
> my mistake.
> 
> The configuration is like this:
> * 2-node RHCS Cluster (not GFS);
> * two onboard NICS are channel bonded (bond0) for corporate network access;
> * one offboard NIC is used for cluster network heartbeating.
> 
> Since each node is located on two separate buildings, the customer 
> wanted to channel bond the heartbeat channel, also. There has been some 
> Ethernet switch problems in the heartbeat channel before.
> 
> So we tried to add another bonded hannel (bond1) to the setup so that we 
> have a redundant heartbeat channel.
> 
> The setup went like this (sorry for the ASCII art):
> +---------+      +----------+      +---------+
> |         |----->|          |<-----|         |
> | server1 |bond0 | ethernet | bond0| server2 |
> |         |----->| switch   |<-----|         |
> |         |      |          |      |         |
> |         |----->+----------+<-----|         |
> |         |bond1              bond1|         |
> |         |<----crossover cable--->|         |
> +---------+                        +---------+
> 
> For bond1 the customer wanted the following:
> * to use the same ethernet switch of the corporate network, since it is 
> fully redundant (each cable plugged to a differente physical switch);
> * to use a crossover cable for the redundant connection of bond1, just 
> in case the whole ethernet switch solution goes down. The crossover 
> cable here is an optical fiber passed on between the buildings;
> * beartbeat IP address for each server are 10.1.1.3 (clu_server1) and 
> 10.1.1.4 (clu_server2).
> 
>    - /etc/modules.conf:
> alias bond0 bonding
> options bond0 -o bond0 mode=1 miimon=100
> alias bond1 bonding
> options bond1 -o bond1 mode=1 miimon=100
> 
>    - /etc/sysconfig/network-scripts/ifcfg-bond1:
> DEVICE=bond1
> ONBOOT=yes
> IPADDR=10.1.1.XXX
> NETMASK=255.255.255.0
> BOOTPROTO=none
> TYPE=Bonding
> 
>    - /etc/sysconfig/network-scripts/ifcfg-eth2:
> DEVICE=eth2
> ONBOOT=yes
> BOOTPROTO=none
> MASTER=bond1
> SLAVE=yes
> TYPE=Ethernet
> 
>    - /etc/sysconfig/network-scripts/ifcfg-eth3:
> DEVICE=eth3
> ONBOOT=yes
> BOOTPROTO=none
> MASTER=bond1
> SLAVE=yes
> TYPE=Ethernet
> 
> 
> Now the problem is that the Cluster didn't come up, and we could get 
> some warning in the logs:
> 00:19:25 server1 clumembd[11041]: <notice> Member clu_server1 UP
> 00:19:28 server1 clumembd[11041]: <warning> Dropping connect from 
> 10.1.1.4: Not in subnet!
> 00:19:29 server1 cluquorumd[11039]: <warning> Dropping connect from 
> 10.1.1.4: Not in subnet!
> 00:19:31 server1 cluquorumd[11039]: <notice> IPv4 TB @ 10.0.4.196 Online
> 
> 00:18:59 server2 clumembd[17634]: <notice> Member clu_server1 UP
> 00:19:09 server2 clumembd[17634]: <warning> Dropping connect from 
> 10.1.1.3: Not in subnet!
> 00:19:11 server2 clumembd[17634]: <notice> Member clu_server2 UP
> 00:19:19 server2 cluquorumd[17632]: <notice> IPv4 TB @ 10.0.4.196 Online
> 
> It seems that both servers "see" each other, both "see" the IPv4 
> Tiebraker as Online, but they refuse to form quorum.
> 
> Removing the "bond1" configuration made the cluster come back to normal 
> functions, but now we cant't understand what we did wrong here.

It's probably just the just the ARP code causing problems, which is why
you can turn it off.  Try running the following with the cluster
stopped:

  cludb -p cluster%msgsvc_noarp 1

...and copying the configuration to the other node.

-- Lon



From riaan at obsidian.co.za  Mon Aug 21 20:45:23 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Mon, 21 Aug 2006 22:45:23 +0200
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44E9D6FE.5010702@redhat.com>
References: <44E963CA.9060508@obsidian.co.za> <44E9D6FE.5010702@redhat.com>
Message-ID: <44EA1B63.7020904@obsidian.co.za>



Robert Peterson wrote:
> Riaan van Niekerk wrote:
>> hi Bob and others
>>
>> I found on the Red Hat 108 Developer Portal the following GFS1/GFS2 
>> design document which details amongst others, some of the issues with 
>> NFS on GFS:
>> https://rpeterso.108.redhat.com/servlets/ProjectDocumentView?documentID=99 
>>
>>
>> (I see it was sent to this list over a year ago, but I never found it 
>> while searching through the archives. it has a lot of good information 
>> in it)
>>
>> It has a disclaimer: Some of the comments
>> are no longer applicable due to design changes
>>
>> My question to you or anyone who is familiar with NFS on GFS, or GFS 
>> in general, which of the following are still valid issues for the 
>> current (6.1u4) version of GFS. If all or most of them still apply, I 
>> can use this as motivation for my customer to strongly consider going 
>> off NFS on GFS. Removing the NFS from our GFS cluster has been on the 
>> cards for quite a while, but has not gained momentum due to lack of 
>> information on the performance gains of such a move (very difficult to 
>> gage) or the architectural problems/limitations of NFS on GFS (for 
>> which the following extract is spot-on).
>>
>> Note - can you consider adding a link to this document from your FAQ?
> Hi Riaan,
> 
> The document you mentioned was written by Ken Preslan more than a year ago.
> It has some good architectural information regarding GFS and GFS2, but 
> the problem is, there
> have been a lot of changes to GFS2 and a lot of work has been done on 
> NFS since that time,
> so a lot of it no longer applies.
> 
> One day I was playing with 108 and decided to upload the document to my 
> 108 page because I thought it
> was "a good find" and there was a need for  GFS architectural 
> information on the Internet.
> Afterward, I was discussing the article with some of the developers and 
> they all agreed
> that the article shouldn't be posted because it contained too much 
> misinformation due to recent changes
> made to all areas of the code.  The problem is, I had already posted the 
> article and I couldn't figure
> out how to get 108 to delete it.  (Today I figured out how to delete it, 
> and did so, and I apologize if
> anyone was misled by what it says.  I'm going to file a usability 
> bugzilla against 108 though.)
> 
> What I really need to do is write a white paper about GFS and its 
> internals and its structures,
> rather than spending the time required to sift through Ken's article and 
> separate fact from "no longer
> applicable".  And the link to that document will certainly be added to 
> the FAQ.
> 
> There are some known issues with NFS failover, but it works great unless 
> you're intentionally trying to break it
> by doing some nasty tricks such as those documented in bugzilla 178057.  
> If you read the first few comments
> of the bugzilla, you'll see that I tried very hard at first to break it 
> and couldn't.
> Wendy Cheng has been spearheading the effort to improve NFS failover and 
> I applaud her efforts.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
hi Bob

I find it extremely unfortunate that you felt it necessary to remove 
this article. From Wendy's reply I can deduce that most of those issues 
still persist, meaning that with regards to NFS on GFS at least, the 
document is still very relevant.

I know you guys are pretty busy, and I sincerely hope that you or 
someone will get/make the time to write an architectural overview and 
detail document of GFS, perhaps even use Ken's original document as a 
base. I will gladly provide feedback on such a document, once you have 
written it.

(Aside: asking a user looking for info on the innards of GFS (and this 
has been asked before on the list) to settle for a cookbook on NFS on 
GFS is just a long shot - with all respect).

My biggest regret is to not have saved a copy of this document locally. 
  It really contained quite a bit of useful information in my mind, 
despite it being old and in parts deprecated.

all the best
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060821/489219d1/attachment.vcf>

From rpeterso at redhat.com  Mon Aug 21 21:26:55 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 21 Aug 2006 16:26:55 -0500
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44EA1B63.7020904@obsidian.co.za>
References: <44E963CA.9060508@obsidian.co.za> <44E9D6FE.5010702@redhat.com>
	<44EA1B63.7020904@obsidian.co.za>
Message-ID: <44EA251F.7030301@redhat.com>

Riaan van Niekerk wrote:
> hi Bob
>
> I find it extremely unfortunate that you felt it necessary to remove 
> this article. From Wendy's reply I can deduce that most of those 
> issues still persist, meaning that with regards to NFS on GFS at 
> least, the document is still very relevant.
>
> I know you guys are pretty busy, and I sincerely hope that you or 
> someone will get/make the time to write an architectural overview and 
> detail document of GFS, perhaps even use Ken's original document as a 
> base. I will gladly provide feedback on such a document, once you have 
> written it.
>
> (Aside: asking a user looking for info on the innards of GFS (and this 
> has been asked before on the list) to settle for a cookbook on NFS on 
> GFS is just a long shot - with all respect).
>
> My biggest regret is to not have saved a copy of this document 
> locally.  It really contained quite a bit of useful information in my 
> mind, despite it being old and in parts deprecated.
>
> all the best
> Riaan
Hi Riaan,

Sorry about the document; this isn't any secret information.  What GFS 
does and how it works is all out there,
in the public domain, but I don't want to mislead people with old 
information.
As I said, I am planning on writing a white paper with details about the 
inner workings of GFS, and I'd
be happy to get your feedback as I did with the NFS/GFS cookbook.
I'll go over Ken's email to make sure I cover everything that's in there. 

As far as the NFS issues are concerned, they are all documented in 
bugzilla(s) such as the one
I mentioned in my other email, which is 178057.

As far as Tom Mornini's inquiry...Tom Mornini wrote:
> I'm about to setup a cluster that uses NFS with GFS, so I'd love to 
> read that document. 
He said he is "about to setup a cluster" with NFS/GFS, which I take to 
mean that
he's looking for setup information, not for obsolete technical notes 
about the design of GFS2.

Regards,

Bob Peterson
Red Hat Cluster Suite



From tmornini at engineyard.com  Mon Aug 21 21:49:47 2006
From: tmornini at engineyard.com (Tom Mornini)
Date: Mon, 21 Aug 2006 14:49:47 -0700
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <44EA251F.7030301@redhat.com>
References: <44E963CA.9060508@obsidian.co.za> <44E9D6FE.5010702@redhat.com>
	<44EA1B63.7020904@obsidian.co.za> <44EA251F.7030301@redhat.com>
Message-ID: <C3735873-E960-4254-8CCA-5EBA100226F4@engineyard.com>

On Aug 21, 2006, at 2:26 PM, Robert Peterson wrote:

> As far as Tom Mornini's inquiry...Tom Mornini wrote:
>> I'm about to setup a cluster that uses NFS with GFS, so I'd love  
>> to read that document.
>
> He said he is "about to setup a cluster" with NFS/GFS, which I take  
> to mean that
> he's looking for setup information, not for obsolete technical  
> notes about the design of GFS2.

That's 100% correct. I have little need for out of date  
documentation. :-)

-- 
-- Tom Mornini



From garylua at singnet.com.sg  Tue Aug 22 03:41:01 2006
From: garylua at singnet.com.sg (garylua at singnet.com.sg)
Date: Tue, 22 Aug 2006 11:41:01 +0800 (SGT)
Subject: [Linux-cluster] fencing problem in 2 node cluster using apc fence
	device
In-Reply-To: <1156180147.4501.12.camel@rei.boston.devel.redhat.com>
References: <44E32DBD.2040105@singnet.com.sg>
	<1155752644.31126.0.camel@ayanami.boston.redhat.com>
	<1155863605.44e51435c18c2@arrowana.singnet.com.sg>
	<1155875393.44e54241721d8@arrowana.singnet.com.sg>
	<1156180147.4501.12.camel@rei.boston.devel.redhat.com>
Message-ID: <1156218061.44ea7ccdaccd1@arrowana.singnet.com.sg>

Hi Lon,

No problem. Thanks for the reply. I've found out how to enable the power on function already on my Sun x4200 server. Apparently, it's in BIOS under Chipset-> Southbridge setting.

However, I have another query regarding fencing. Is it common practice for a node to be fenced when i do a manual failover? For example, i have an application running on node1. I do a "drag and drop" of the application from node1 to node2 at the cluster configuration interface, causing my application to failover to node2. In this case, is it necessary to fence up node1? If so, how do i go about doing it?

Right now, I can only see fencing kicks in when i down the network on node1. Is there any other situation when i need to fence node1? In other words, do i need to fence node1 every time my application fails over to node2? Thanks in advance!


--- Lon Hohberger <lhh at redhat.com> wrote:

> On Fri, 2006-08-18 at 12:29 +0800, garylua at singnet.com.sg wrote:
> > Hi,
> > 
> > Aother update on the situation: Apparently, i got the fencing
> working already. When I down the ethernet connection of node1, node2
> will fence node1 and the power outlet connected to node1 power cycle
> (switch off then on). However, is this the proper behaviour of
> fencing? I wish to reboot node1, not shut it down completely. Seems
> that my machines are configured in such a way that when the power
> supply is switched on and then off, my machine stays off. Is there
> any way which i can alter this behaviour such that my node1 will
> reboot instead of shutting down? Thanks!
> 
> 
> Yes, it's usually in the BIOS -- sorry for the late reply!
> 
> -- Lon
> 
> 



From riaan at obsidian.co.za  Tue Aug 22 07:03:48 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Tue, 22 Aug 2006 09:03:48 +0200
Subject: [Linux-cluster] NFS on GFS architectural issues / problems
In-Reply-To: <C3735873-E960-4254-8CCA-5EBA100226F4@engineyard.com>
References: <44E963CA.9060508@obsidian.co.za>
	<44E9D6FE.5010702@redhat.com>	<44EA1B63.7020904@obsidian.co.za>
	<44EA251F.7030301@redhat.com>
	<C3735873-E960-4254-8CCA-5EBA100226F4@engineyard.com>
Message-ID: <44EAAC54.8050807@obsidian.co.za>

Tom Mornini wrote:
> On Aug 21, 2006, at 2:26 PM, Robert Peterson wrote:
> 
>> As far as Tom Mornini's inquiry...Tom Mornini wrote:
>>> I'm about to setup a cluster that uses NFS with GFS, so I'd love to 
>>> read that document.
>>
>> He said he is "about to setup a cluster" with NFS/GFS, which I take to 
>> mean that
>> he's looking for setup information, not for obsolete technical notes 
>> about the design of GFS2.
> 
> That's 100% correct. I have little need for out of date documentation. :-)
> 

Tom/Bob

I appologize to you both. Mine was a cheap shot after all.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/33208a0e/attachment.vcf>

From riaan at obsidian.co.za  Tue Aug 22 07:07:31 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Tue, 22 Aug 2006 09:07:31 +0200
Subject: [Linux-cluster] JBoss Clustering Compatibility
In-Reply-To: <1156191526.4501.22.camel@rei.boston.devel.redhat.com>
References: <91C5B377A44AF34DBDCC8348E8CB4D33E6F0E2@IMCSRV2.MITRE.ORG>
	<1156191526.4501.22.camel@rei.boston.devel.redhat.com>
Message-ID: <44EAAD33.1020303@obsidian.co.za>



Lon Hohberger wrote:
> On Mon, 2006-08-21 at 14:13 -0400, Sutterfield, Geary L. wrote:
>> Hello all,
>>  
>> Does anyone have any information or experience about using GFS and the
>> Cluster Suite and JBoss Clustering on the same cluster? I've been
>> looking for a couple of days, but I can't seem to find any information
>> on that configuration. Our goal is to create a J2EE cluster with a
>> shared, global filesystem. GFS seemed like a logical choice. 
>>  
>> I'm speculating that if Cluster Suite was only used for GFS, and not
>> for any other services, then it might work okay with JBoss Clustering.
>> Cluster Suite would handle the GFS clustering, and JBoss Clustering the
>> J2EE app server clustering. But I also wonder about conflicts or
>> incompatibilities. And I especially wonder how fencing would impact
>> JBoss. Has anyone tried this configuration, or heard of anyone who has?
> 
> I have not tried it, but here are some of the 'gotchas':
> 
> (a) Disagreements in the JBoss cluster view vs. the GFS cluster view.
> GFS cluster view should always win in the end, because it has fencing.
> So, the GFS/RHCS cluster should be configured to converge on membership
> changes before JBoss (whether that means increase deadnode_timeout, or
> tweaking JBoss is up to you).
> 
> (b) Cluster quorum.  I don't know much about JBoss clustering (yet), but
> remember that cluster suite / GFS need a majority of votes (normally,
> there's a 1:1 correlation between nodes and votes) available in order to
> form a quorum.  Without a quorum, access to GFS will be halted.
> 
> (c) Fencing.  You will probably need to use power-based fencing, because
> storage/SAN level fencing will not drop a node out of the JBoss cluster
> by itself.  Since I don't currently know much about JBoss clustering
> generally, I also don't know the specific implications of trying to run
> JBoss on a given node after the GFS volumes have been cut off from that
> node.  I *assume* that JBoss would take that node out of the JBoss
> cluster.
> 
> 
>> If there any problems, I suppose they will be worked out eventually,
>> since Red Hat has acquired JBoss. But we don't want to waste a lot of
>> time going down a path with known problems. Any thoughts or feedback
>> would be appreciated.
> 
> -- Lon

What is the process for initiating/requesting a white paper to be 
written about a particular topic? JBoss on Red Hat Clustering with GFS 
might make for a very good topic for a white paper

Riaan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/c15cc47e/attachment.vcf>

From tmornini at engineyard.com  Tue Aug 22 07:44:51 2006
From: tmornini at engineyard.com (Tom Mornini)
Date: Tue, 22 Aug 2006 00:44:51 -0700
Subject: [Linux-cluster] cluster-1.03.00
In-Reply-To: <20060817204625.GF13097@redhat.com>
References: <20060817204625.GF13097@redhat.com>
Message-ID: <0AB8FFB2-4ECE-4B25-8546-3A6985DC0A24@engineyard.com>

We're going to implement next week (and have a good consultant lined  
up to help), so forgive my current stupidity...

Is this GFS or GFS2?

Also, what's the current status of GFS -vs- GFS2?

Are there docs anywhere describing changes, enhancements, etc.?

I believe GFS2 is what's attempting to get integrated into the  
mainline kernel. If so, should we be using GFS or GFS2?

On Aug 17, 2006, at 1:46 PM, David Teigland wrote:

> A new source tarball from the STABLE branch has been released; it  
> builds
> and runs on 2.6.17:
>
>   ftp://sources.redhat.com/pub/cluster/releases/cluster-1.03.00.tar.gz
>
> Version 1.03.00 - 16 August 2006
> ================================
>   ccsd
>    * don't grab random config from network, require initial  
> cluster.conf
>    * Fix inifinite loop causing hangs in other daemons. bz#194361
>   cman
>    * Allow zero votes.
>   cman-kernel
>    * Don't try to delete AUTODELETE barriers in timer context as we  
> can't
>      get the semaphore that protects the structures. bz#177577
>    * If quit_threads gets set via an incoming message, don't try to  
> carry on.
>      bz#164535
>    * Don't try to print local node stuff is 'us' is NULL. bz#189605
>    * Clear comms sequence number for a node when it leaves the  
> cluster.
>      Otherwise we ignore messages when it tries to join again and  
> causes
>      cluster mayhem. bz#187777
>    * If we get a Master-HELLO and we are not the master for this  
> transition
>      then kick off a new one to resolve the ambiguity. bz#194491
>   dlm-kernel
>    * Don't try to unlock a lock if there's no LKB. bz#188525
>    * We need to allocated space for 5 ints, rather than 4 when  
> sending a
>      query reply. bz#173811
>    * add printk's for the error conditions so we have some idea what
>      happened before a gfs panic.
>    * Kernel Oops when passing LKF_CANCEL to dlm_ls_unlock_wait.   
> bz#201325
>   fenced
>    * If there are no devices defined within a node's method, that  
> method
>      should be considered failed. bz#190661
>   gfs_fsck
>    * improve logging. bz#156009
>    * fix to repair damaged and corrupt resource groups and resource  
> group
>      index entries that previously caused gfs_fsck to abort. bz#179069
>   gfs-kernel
>    * allow cman nodeid to be used with CDPN. bz#198381
>    * When releasing a glock with GL_NOCACHE flag set, care was not  
> taken to
>      ensure that only one holder for the glock remained. This was  
> corrupting
>      the glock and preventing further access to the glock. FLOCKS  
> use this
>      GL_NOCACHE flag. bz#191222
>    * F_GETLK was broken, always used to return zero conflicts for  
> local
>      plocks. Also bogus pid was being returned for local locks.   
> Added a new
>      pid field to gfs posix lock to store and return actual pid.   
> bz#198303
>   gnbd
>    * Make gnbd work with device mapper multipath
>    * Fix gnbd_monitor so that it will correctly restart multiple  
> devices per
>      server.
>   gulm
>    * Retry if initial connections attempts fail.  bz#183507
>   rgmanager
>    * Work around for bz#193128
>    * Enable self-watchdog support (adds a second clurgmgrd process),
>      bz#193247
>    * Apply patch to fix bz#193128 from Navid Sheikhol-Eslami
>    * Allow failover if owner dies while stopping a service.  bz#193255
>    * Fix various clustat related usability & performance problems.
>      bz#185952, bz#175010, bz#182454, bz#190234, bz#190408, bz#192999
>    * Port clumanager's 'lock' operation to rgmanager bz#175010
>    * Various internal performance improvements for large numbers of
>      services.
>    * Implement crude NFS lock reclaim broadcast / reclaim  
> notifications.
>    * Mark services with autostart=0 as 'disabled' instead of  
> 'stopped'.
>    * Add patch from Josef Whiter to implement no-failback option  
> for a given
>      FO domain. bz#189841
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From pcaulfie at redhat.com  Tue Aug 22 08:30:19 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 22 Aug 2006 09:30:19 +0100
Subject: [Linux-cluster] cluster-1.03.00
In-Reply-To: <0AB8FFB2-4ECE-4B25-8546-3A6985DC0A24@engineyard.com>
References: <20060817204625.GF13097@redhat.com>
	<0AB8FFB2-4ECE-4B25-8546-3A6985DC0A24@engineyard.com>
Message-ID: <44EAC09B.5020209@redhat.com>

Tom Mornini wrote:
> We're going to implement next week (and have a good consultant lined up
> to help), so forgive my current stupidity...
> 
> Is this GFS or GFS2?

This is GFS (not 2) as in RHEL4

> Also, what's the current status of GFS -vs- GFS2?
> 
> Are there docs anywhere describing changes, enhancements, etc.?
> 
> I believe GFS2 is what's attempting to get integrated into the mainline
> kernel. If so, should we be using GFS or GFS2?

For a production system you should be using GFS for the moment.


-- 

patrick



From mgrac at redhat.com  Tue Aug 22 15:47:13 2006
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Tue, 22 Aug 2006 17:47:13 +0200
Subject: [Linux-cluster] Resource agents for Apache & MySQL
Message-ID: <44EB2701.90503@redhat.com>

Hi,

I have written a resource agents (RA) for apache & mysql (in attachement).

Apache:

When starting Apache you have to give RA path to ServerRoot (default 
value /etc/httpd) and configuration file (default conf/httpd.conf). If 
the path to the conf file doesn't start with / then it is relative to 
the ServerRoot. Last value which is passed to the RA is the name of the 
service. Apache's configuration file can contain <IfDefine %name%> 
sections, so you can have multiple web services defined in the one 
configuration file. RA parse this main configuration file and generates 
new one for each resource (-D%name%). If in the config file are options 
like Listen/Port then they are removed and we will add Listen for every 
combination of Port & IP address of the service (Apache is then binded 
to multiple IP addresses). If you change this generated file (there is 
sha1-checksum) then it won't be regenerated again.

example of simple /etc/cluster/cluster.conf
....
<service autostart="1" domain="" name="test22">
    <ip address="192.168.79.8" monitor_link="0"/>
    <ip address="192.168.79.9" monitor_link="0"/>
    <apache name="web"/>
</service>
...

MySQL:

RA for MySQL needs path to configuration file (default value 
/etc/my.cnf) and IP address (default value is the first IP address in 
the service). Unfortunately MySQL can be binded to one IP address or to 
the all of the local IP address (not supported in RA because then you 
can't run multiple MySQL resources). You should not run more MySQL 
demons on same data (directory).

Metadata information are stored in separate files (*.metadata) where you 
can see all of the options in XML format which is quite readable.

If you would like to test these RA then you have to extract archive and 
copy to the /usr/share/cluster. You can't use system-config-cluster yet, 
so you have to modify /etc/cluster/cluster.conf manually (add resources, 
raise version number); run ccs_tool update /etc/cluster/cluster.conf;  
run cman_tool version -r <raised_version>. Feedback welcomed :)

Thanks,
Marek
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster-resource-agents-0.5.tgz
Type: application/x-compressed-tar
Size: 5574 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/c3985ff6/attachment.bin>

From rpeterso at redhat.com  Tue Aug 22 19:26:54 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 22 Aug 2006 14:26:54 -0500
Subject: [Linux-cluster] GFS for Linux kernel 2.6.16
In-Reply-To: <200608171030.44201.thomas@ideawise.de>
References: <D8063DF686D10247B0A49D0127128569119C27C1@osdn06.osd.mil>
	<200608171030.44201.thomas@ideawise.de>
Message-ID: <44EB5A7E.5040700@redhat.com>

Hi Thomas,

Thomas Karsten wrote:
> When do you decide to develop for a new kernel? Couldn't you store the code
> in your releases directory and tag it before you start programming for a new
> kernel version? In this case you will always have working releases for
> various kernel versions.
>   
For development, we try to stay with the latest kernel defined by 
kernel.org in their
git repositories.  At some point, someone at a level above me has to 
draw a line in the
sand and freeze the kernels for both Fedora Core and the development 
versions of
Red Hat Enterprise Linux (RHEL).  The kernel.org kernels are so 
fast-moving that
it's usually a bit out of date when we freeze the kernel.  So we often 
have to keep
our own kernel patches so that the development code will compile against 
the frozen
Linux kernel.  Tagging it sounds like a good idea, but that whole 
process is done by
somebody else, so I don't know much about it.
> I know and I agree :-) Unfortunately I am using a Xen kernel and standard
> Xen builds only against version 2.6.16.
>   
We've got Xen building for 2.6.17 and probably 2.6.18, but again, that's 
someone
else, so I'm afraid I'm not much help there.
I just know that some people are testing the new code on Xen clusters, 
that is,
clusters that are virtual, i.e. running on the same physical box, 
through Xen,
or for example, dozens of virtual machines running from a few boxes on Xen.
> I agree. This would make it much easier to find the matching sources,
> especially for people who will download GFS for the first time (like me).
> Since a version of GFS is build against a special kernel version I was
> surprised that there is no identification of the source code (like the
> tagging that Matthew mentioned above).
>
> Thomas
>   
I agree that tagging the versions is a good idea.  I'll see if I can 
talk to the
build people about this.

Regards,

Bob Peterson
Red Hat Cluster Suite



From johnha at ccbill.com  Tue Aug 22 19:42:19 2006
From: johnha at ccbill.com (John Anderson)
Date: Tue, 22 Aug 2006 12:42:19 -0700
Subject: [Linux-cluster] Testing a fence program
Message-ID: <A7F39817EC2477418A3AA053E69F835A3AA116@Exchange.ccbill-hq.local>

Greetings,

 

I'm in the process of developing a fence_* program that will xm shutdown
xen instances remotely.  What is a good way to get fenced to invoke this
program so I can test/debug it.  I don't see any such options to
fence_tool or fenced, and I'm thinking fence_manual will just manually
fence the instance without running the configured fence_ mechanism. 

 

 

Thanks,

 

John Anderson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/ad76cf77/attachment.htm>

From teigland at redhat.com  Tue Aug 22 19:50:04 2006
From: teigland at redhat.com (David Teigland)
Date: Tue, 22 Aug 2006 14:50:04 -0500
Subject: [Linux-cluster] Testing a fence program
In-Reply-To: <A7F39817EC2477418A3AA053E69F835A3AA116@Exchange.ccbill-hq.local>
References: <A7F39817EC2477418A3AA053E69F835A3AA116@Exchange.ccbill-hq.local>
Message-ID: <20060822195004.GA3288@redhat.com>

On Tue, Aug 22, 2006 at 12:42:19PM -0700, John Anderson wrote:
> I'm in the process of developing a fence_* program that will xm shutdown
> xen instances remotely.  What is a good way to get fenced to invoke this
> program so I can test/debug it.  I don't see any such options to
> fence_tool or fenced, and I'm thinking fence_manual will just manually
> fence the instance without running the configured fence_ mechanism. 

fence_node executes the configured agent (cluster.conf) against the named
node.  The only way to test it with fenced is to run a cluster and kill
one of the nodes.

Dave



From johnha at ccbill.com  Tue Aug 22 20:02:50 2006
From: johnha at ccbill.com (John Anderson)
Date: Tue, 22 Aug 2006 13:02:50 -0700
Subject: [Linux-cluster] Testing a fence program
Message-ID: <A7F39817EC2477418A3AA053E69F835A3AA117@Exchange.ccbill-hq.local>

Thanks for the quick response.  Somehow I missed that binary in my
search of applicable binaries and their man pages.  That did the trick!

>From what I've read it seems that most people who have deployed GFS on
Xen instances are using a fence mechanism that ssh's to the Xen host
running the instance and kills the instance locally.  Mainly due to the
fact that there is no security built into remote operations on xend, and
enabling remote access to xend operations is strongly cautioned.  Other
are using traditional power based fence mechanisms that fence the Xen
host itself which could take down several virtual servers at the same
time.

Is anyone using any alternative to the ssh or power method?  If so,
which method(s) are being used?

Since my security department frowns strongly on authentication by ssh
key, I'm writing a different way to fence xen instances via soap calls
authenticated by ssl certificate exchange. 

Does anyone know if I'm reinventing the wheel, or if not, could other
GFS users benefit from code?

John A.

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Tuesday, August 22, 2006 12:50 PM
To: John Anderson
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Testing a fence program

On Tue, Aug 22, 2006 at 12:42:19PM -0700, John Anderson wrote:
> I'm in the process of developing a fence_* program that will xm
shutdown
> xen instances remotely.  What is a good way to get fenced to invoke
this
> program so I can test/debug it.  I don't see any such options to
> fence_tool or fenced, and I'm thinking fence_manual will just manually
> fence the instance without running the configured fence_ mechanism. 

fence_node executes the configured agent (cluster.conf) against the
named
node.  The only way to test it with fenced is to run a cluster and kill
one of the nodes.

Dave




From Leonardo.Mello at planejamento.gov.br  Tue Aug 22 20:27:18 2006
From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello)
Date: Tue, 22 Aug 2006 17:27:18 -0300
Subject: [Linux-cluster] Testing a fence program
Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255BB5@corp-bsa-mp01.planejamento.gov.br>

You are somewhere reinventing the wheel. 
http://www.gplhost.com/software-dtc-xen.html

"
DTC-Xen is a Xen dom0 application and SOAP server (with authentication and SSL) through which a Domain Technologie Control (DTC) Web hosting panel can be connected to manage your VMs.

It supports remote start, shutdown, and destroy of your virtual machines, as well as SSH login in the physical console (with the ability to remotely change the SSH password).

DTC-Xen also displays CPU and network graphs that can be used to analyse the performance of your virtual machines."


I was looking into integrating this in one cluster management tool. So i can have many soap servers with xen i want. being monitored by one server. Maybe this is a good aproach for conga.. who knows ? 

I believe you can use this to fence too.


Best regards.
Leonardo Rodrigues de Mello
-----Original Message-----
From:	linux-cluster-bounces at redhat.com on behalf of John Anderson
Sent:	ter 22/8/2006 17:02
To:	David Teigland
Cc:	linux-cluster at redhat.com
Subject:	RE: [Linux-cluster] Testing a fence program

Does anyone know if I'm reinventing the wheel, or if not, could other
GFS users benefit from code?

John A.



-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3280 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/ee910a29/attachment.bin>

From Matthew.Patton.ctr at osd.mil  Tue Aug 22 20:52:21 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Tue, 22 Aug 2006 16:52:21 -0400
Subject: [Linux-cluster] Testing a fence program
Message-ID: <D8063DF686D10247B0A49D0127128569119C27D4@osdn06.osd.mil>

Classification: UNCLASSIFIED

John Anderson wrote:

> Since my security department frowns strongly on authentication by ssh
> key

Can you elaborate on what their problems are? Is the security department
staffed by the clueless?
Because the whole SOAP solution while downright creative, is so unnecessary.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/f6253c49/attachment.htm>

From johnha at ccbill.com  Tue Aug 22 21:01:05 2006
From: johnha at ccbill.com (John Anderson)
Date: Tue, 22 Aug 2006 14:01:05 -0700
Subject: [Linux-cluster] Testing a fence program
Message-ID: <A7F39817EC2477418A3AA053E69F835A3AA118@Exchange.ccbill-hq.local>

It's not really the ssh solution via keys that's the problem, it's
allowing root to login via ssh that the problem.  That is strictly
prohibited.  Xm destroying a host requires root access.  We cannot su
to xm destroy after we login as a non root user for obvious reasons.
Sudo is right out of the question in our environment.  I'm not even
going to try to mix up GRSecurity RBAC policies and sudo policies, etc

 

 

 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patton, Matthew
F, CTR, OSD-PA&E
Sent: Tuesday, August 22, 2006 1:52 PM
To: 'linux clustering'
Subject: RE: [Linux-cluster] Testing a fence program

 

Classification: UNCLASSIFIED 

John Anderson wrote: 

> Since my security department frowns strongly on authentication by ssh 
> key 

Can you elaborate on what their problems are? Is the security department
staffed by the clueless? 
Because the whole SOAP solution while downright creative, is so
unnecessary. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/97c2bd5f/attachment.htm>

From Matthew.Patton.ctr at osd.mil  Tue Aug 22 21:23:02 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Tue, 22 Aug 2006 17:23:02 -0400
Subject: [Linux-cluster] Testing a fence program
Message-ID: <D8063DF686D10247B0A49D0127128569119C27D7@osdn06.osd.mil>

Classification: UNCLASSIFIED 

> it's allowing root to login via ssh that the problem.  That is strictly
prohibited.

Have them go reread the man page on specifying commands and associating them
with keys.

> We cannot su  to xm destroy after we login as a non root user for obvious
reasons.

oh? I don't know anything about running Xen. But I fail to grasp the
rationale behind denying both SU or logins as root. It sounds like the
"security" people can't tell the difference between real security management
and somebody's naive policy assertion. It's rather telling that they think
some home-grown application is a more secure/acceptable solution.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/29808504/attachment.htm>

From johnha at ccbill.com  Tue Aug 22 21:43:17 2006
From: johnha at ccbill.com (John Anderson)
Date: Tue, 22 Aug 2006 14:43:17 -0700
Subject: [Linux-cluster] Testing a fence program
Message-ID: <A7F39817EC2477418A3AA053E69F835A3AA119@Exchange.ccbill-hq.local>

 

>Have them go reread the man page on specifying commands and associating
them with keys. 

 

Visa and MasterCard mandate that direct superuser access will not be
allowed remotely.  In other words, their auditors must see the line
"PermitRootLogin no" in our sshd configs.  Our security guys have very
little say in that respect.

>oh? I don't know anything about running Xen. But I fail to grasp the
rationale behind denying both SU or logins as root. It sounds like the
"security" people can't tell the difference between real security
management and somebody's >naive policy assertion. It's rather telling
that they think some home-grown application is a more secure/acceptable
solution.

It's not su- that's disabled.  The obvious reason I was speaking of is
the authentciation after su - is invoked.  Given the following shell
script excerpt;

#!/bin/bash

ssh nobody at host -c "su -; xm shutdown domain3"

We can't very well pass the root password to su with expect,  that's
just bad practice in anyone's book.  Neither can we allow nobody to su
without authentication, that's even worse practice.  

Making xm setuid  and protecting it with RBAC controls is not an option
either, beause xm is a python script.  I would have to make python
setuid, and that's the worse scenario yet.

So it seems that a small daemon running on the xen host that receives a
request from a fence agent, then makes the appropriate libvirt call to
destroy the instance is both relatively simple while remaining secure.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/0ec07d38/attachment.htm>

From rpeterso at redhat.com  Tue Aug 22 22:28:56 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 22 Aug 2006 17:28:56 -0500
Subject: [Linux-cluster] how to confirm GFS quotas are disabled
In-Reply-To: <44E99D33.9070803@obsidian.co.za>
References: <44E99D33.9070803@obsidian.co.za>
Message-ID: <44EB8528.10401@redhat.com>

Riaan van Niekerk wrote:
> According to this document:
> https://rpeterso.108.redhat.com/servlets/ProjectDocumentView?documentID=99 
>
> you can get a 5% performance increase by disabling quotas on GFS.
>
> (I am not taking this as gospel, but every couple of percentage points 
> helps)
>
> According to the GFS manual
> http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/s1-manage-quota.html#S2-MANAGE-QUOTAACCOUNT 
>
> you do this using the following syntax:
>
> gfs_tool settune /mnt/gfs quota_account 0
>
> How do I confirm that quotas have been disabled?
> An indirect (to me) way is to run 'gfs_quota list -f /mnt/gfs' and 
> check if the values in the last field change or not. Nothing in the 
> output of 'gfs_tool quota /mnt/gfs' jumped out at me to say if it is 
> enabled or not
>
> I was hoping there is a more elegant way to check this.
>
> tnx
> Riaan
Hi Riaan,

I think you want:
gfs_tool gettune /mnt/gfs1/ | grep quota_enforce
where /mnt/gfs1/ is your mounted gfs fs, but I haven't actually verified 
it in the code...

Regards,

Bob Peterson
Red Hat Cluster Suite



From Matthew.Patton.ctr at osd.mil  Tue Aug 22 22:21:47 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Tue, 22 Aug 2006 18:21:47 -0400
Subject: [Linux-cluster] Testing a fence program
Message-ID: <D8063DF686D10247B0A49D0127128569119C27D8@osdn06.osd.mil>

Classification: UNCLASSIFIED

this should be off-list but I can't. you can find me at pattonme at yahoo
dot com

 PermitRootLogin forced-commands-only

is precisely what you need. If the auditors really are too stupid to know
what that does, then I'd tell them to come back after they have somebody
'splain it to them and they rewrite their simpleton "policy". Like I said,
sounds like the auditors are just checking boxes without knowledge of what
they are actually checking. Typical, unfortunately.

You can of course leave 
"PermitRootLogin no        # for stupid auditors"
in sshd_config and change /etc/init.d/sshd to put the "-o PermitRootLogin"
on the command line. You could even bury it in an options file. *grin*

>From a system auditing standpoint where one tries to minimize the number of
places where security policies are stored, I'd use sudo and as a real
account, not "nobody".

Have fun with the daemon. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060822/a5acea36/attachment.htm>

From riaan at obsidian.co.za  Wed Aug 23 10:45:34 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Wed, 23 Aug 2006 12:45:34 +0200
Subject: [Linux-cluster] RHCS 4 / GFS 6.1 update 4 ISOs
Message-ID: <44EC31CE.30204@obsidian.co.za>

are in RHN now
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060823/4ca81f80/attachment.vcf>

From riaan at obsidian.co.za  Wed Aug 23 11:34:36 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Wed, 23 Aug 2006 13:34:36 +0200
Subject: [Linux-cluster] gfs_grow experience
Message-ID: <44EC3D4C.60802@obsidian.co.za>

Yesterday evening we grew our 6-node 2.4 TB GFS 6.1 filesystem to 4.5 TB.

Here is our experience, which I hope others can benefit from

Having grown the underlying LUN (on an EMC CX500) a couple of weeks ago, 
we got bit by this parted bug:  Parted segfaults because of extended 
devices with GPT partition table
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=194238
(for which we just got a hotfix from RH after extensive testing of test 
packages)

The partitioning and LVM all went smoothly. gfs_grow -T (test) showed 
nothing funny. When we started the real gfs_grow, things started out 
smoothly. At about 7 - 10 minutes, the GFS was withdrawn from 5 of the 
nodes (the only one not withdrawing was the one on which the gfs_grow 
was running):

Aug 23 00:58:10 host kernel: GFS: fsid=webmail:gfs_mail.4: jid=5: Trying 
to acquire journal lock..
.
Aug 23 00:58:10 host kernel: GFS: fsid=webmail:gfs_mail.4: jid=5: Busy
Aug 23 00:58:29 host kernel: attempt to access beyond end of device
...
Aug 23 00:58:29 host kernel: attempt to access beyond end of device
Aug 23 00:58:29 host kernel: dm-0: rw=0, want=7803155736, limit=5044961280
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: fatal: I/O error
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   block = 
975394466
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   function = 
gfs_dreread
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   file = 
/usr/src/build/765788-x86_64/B
UILD/gfs-kernel-2.6.9-58/smp/src/gfs/dio.c, line = 576
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4:   time = 
1156287509
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: about to 
withdraw from the cluster
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: waiting for 
outstanding I/O
Aug 23 00:58:29 host kernel: GFS: fsid=webmail:gfs_mail.4: telling LM to 
withdraw
Aug 23 00:58:32 host kernel: lock_dlm: withdraw abandoned memory
Aug 23 00:58:32 host kernel: GFS: fsid=webmail:gfs_mail.4: withdrawn


Definitely not a nice message to see for something as suspenseful as a 
gfs_grow, which you cannot rollback, and interrupting/resuming seems to 
be not recommended. Even more so since the fs needs to be mounted for it 
to be grown. While the GFS is being grown, I/O to the fs is blocked. The 
only way I could get an idea that the gfs_grow was still busy doing 
something, was to run strace on its PID.

After 16 very long minutes, the grow completed. the GFS on 2 of the 
nodes could be brought back by a simple 'service gfs restart'. The 
others had to be bounced.  After 30 minutes of everything being up, the 
2 nodes also lost the FS with the same error message as above and had to 
be bounced.

When I disabled quotas (we were still in our maintenance window) , I 
mistakenly ran the command 'gfs_tool settune /mnt/san quota_account 0' 
on more than one node since the quota value was not updated quickly 
enough on other nodes after I ran it on the first node. The FS was 
withdrawn again on 2 nodes, with error:

Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: fatal: filesystem
consistency error
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   function =
trans_go_xmote_bh
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   file =
/usr/src/build/765787-i686/BUI
LD/gfs-kernel-2.6.9-58/smp/src/gfs/glops.c, line = 542
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0:   time =
1156290932
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: about to withdraw
from the cluster
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: waiting for
outstanding I/O
Aug 23 01:55:32 host kernel: GFS: fsid=webmail:gfs_mail.0: telling LM to
withdraw
Aug 23 01:55:37 host kernel: lock_dlm: withdraw abandoned memory
Aug 23 01:55:37 host kernel: GFS: fsid=webmail:gfs_mail.0: withdrawn

After bouncing them, all seemed well. To Red Hat: would it make sense to 
log bugzillas for these withdraw scenarios (what seems to be bugs in 
gfs_grow and gfs_tune/quota, unless the withdraw on gfs_grow works as 
intended and/or despite the latter probably being pebcak / incorrect 
usage)? I will not be able to easily replicate and we are fine now 
(hopefully) despite these hickups. (e.g. I have no reason to open 
Service Requests) I am sure others might run into these aswell.

greetings
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060823/e48b68d2/attachment.vcf>

From lhh at redhat.com  Wed Aug 23 17:41:48 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 23 Aug 2006 13:41:48 -0400
Subject: [Linux-cluster] Testing a fence program
In-Reply-To: <A7F39817EC2477418A3AA053E69F835A3AA116@Exchange.ccbill-hq.local>
References: <A7F39817EC2477418A3AA053E69F835A3AA116@Exchange.ccbill-hq.local>
Message-ID: <1156354909.4501.61.camel@rei.boston.devel.redhat.com>

On Tue, 2006-08-22 at 12:42 -0700, John Anderson wrote:
> Greetings,
> 
>  
> 
> I?m in the process of developing a fence_* program that will xm
> shutdown xen instances remotely.  What is a good way to get fenced to
> invoke this program so I can test/debug it.  I don?t see any such
> options to fence_tool or fenced, and I?m thinking fence_manual will
> just manually fence the instance without running the configured fence_
> mechanism. 

There's fence_xen already in -head, maybe it does some of what you need
already?

-- Lon





From johnha at ccbill.com  Wed Aug 23 21:27:20 2006
From: johnha at ccbill.com (John Anderson)
Date: Wed, 23 Aug 2006 14:27:20 -0700
Subject: [Linux-cluster] Testing a fence program
Message-ID: <A7F39817EC2477418A3AA053E69F835A3AA11C@Exchange.ccbill-hq.local>

It looks like fence_xen.pl, by default anyway, uses ssh to login to the
remote xen host and shut down or destroy the instance.  That's what I'm
trying to avoid because of certain rules and regulations I have to
follow.  

I went ahead and wrote a gSOAP/libvirt based fencing agent that doesn't
require any type of shell access and uses SSL client certificate
authentication.  It solves the problems while keeping the regulation
Nazis off my back.

Thanks,

John Anderson

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Wednesday, August 23, 2006 10:42 AM
To: linux clustering
Subject: Re: [Linux-cluster] Testing a fence program

On Tue, 2006-08-22 at 12:42 -0700, John Anderson wrote:
> Greetings,
> 
>  
> 
> I'm in the process of developing a fence_* program that will xm
> shutdown xen instances remotely.  What is a good way to get fenced to
> invoke this program so I can test/debug it.  I don't see any such
> options to fence_tool or fenced, and I'm thinking fence_manual will
> just manually fence the instance without running the configured fence_
> mechanism. 

There's fence_xen already in -head, maybe it does some of what you need
already?

-- Lon



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Thu Aug 24 17:37:54 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 24 Aug 2006 13:37:54 -0400
Subject: [Linux-cluster] Testing a fence program
In-Reply-To: <A7F39817EC2477418A3AA053E69F835A3AA11C@Exchange.ccbill-hq.local>
References: <A7F39817EC2477418A3AA053E69F835A3AA11C@Exchange.ccbill-hq.local>
Message-ID: <1156441074.4501.96.camel@rei.boston.devel.redhat.com>

On Wed, 2006-08-23 at 14:27 -0700, John Anderson wrote:
> It looks like fence_xen.pl, by default anyway, uses ssh to login to the
> remote xen host and shut down or destroy the instance.  That's what I'm
> trying to avoid because of certain rules and regulations I have to
> follow.  
> 
> I went ahead and wrote a gSOAP/libvirt based fencing agent that doesn't
> require any type of shell access and uses SSL client certificate
> authentication.  It solves the problems while keeping the regulation
> Nazis off my back.

That's *cool*; when you get it working, are you going to post it? :)

-- Lon




From lhh at redhat.com  Thu Aug 24 20:41:45 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 24 Aug 2006 16:41:45 -0400
Subject: [Linux-cluster] JBoss Clustering Compatibility
In-Reply-To: <44EAAD33.1020303@obsidian.co.za>
References: <91C5B377A44AF34DBDCC8348E8CB4D33E6F0E2@IMCSRV2.MITRE.ORG>
	<1156191526.4501.22.camel@rei.boston.devel.redhat.com>
	<44EAAD33.1020303@obsidian.co.za>
Message-ID: <1156452105.4501.122.camel@rei.boston.devel.redhat.com>

On Tue, 2006-08-22 at 09:07 +0200, Riaan van Niekerk wrote:

> 
> What is the process for initiating/requesting a white paper to be 
> written about a particular topic? JBoss on Red Hat Clustering with GFS 
> might make for a very good topic for a white paper
> 

I'll try to find out.

-- Lon




From johnha at ccbill.com  Thu Aug 24 23:56:31 2006
From: johnha at ccbill.com (John Anderson)
Date: Thu, 24 Aug 2006 16:56:31 -0700
Subject: [Linux-cluster] Testing a fence program
Message-ID: <A7F39817EC2477418A3AA053E69F835A3AA11E@Exchange.ccbill-hq.local>


>That's *cool*; when you get it working, are you going to post it? :)
>
>-- Lon


Absolutely.  I'm just writing the readme's and some documentation now.
It's functional and it works, if a bit ugly on the inside.  I'm hosting
it off my box at home (so it's slow :-) ) for the time being.  If it
gets to be more widely used maybe I'll try SourceForge or see if the
fence team wants it.  

Unfortunately (or perhaps fortunately?) the Xen XML-RPC API guys should
be done with at least a functional implementation of remote Xen
administration which would make all of these tricks obsolete I hope.

You'll need OpenSSL >= 0.9.7, gSOAP >= 2.6.7 and libvirt >= 0.0.6. Read
the README and man pages and have at it!


P.S.  I'm having a little trouble with older versions of libvirt ( <
0.1.0 or so). 

John A.



From johnha at ccbill.com  Fri Aug 25 04:22:33 2006
From: johnha at ccbill.com (John Anderson)
Date: Thu, 24 Aug 2006 21:22:33 -0700
Subject: [Linux-cluster] Testing a fence program
References: <A7F39817EC2477418A3AA053E69F835A3AA11E@Exchange.ccbill-hq.local>
Message-ID: <A7F39817EC2477418A3AA053E69F835A0B7072@Exchange.ccbill-hq.local>

Generally I'm a bit slow at the end of the workday.  I suppose a link of some sort would help.

http://chesty.homedns.org:4576/fence-xen-0.01.tar.gz

I think it's time for some caffeine.

John A.

-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of John Anderson
Sent: Thu 8/24/2006 4:56 PM
To: linux clustering
Subject: RE: [Linux-cluster] Testing a fence program
 

>That's *cool*; when you get it working, are you going to post it? :)
>
>-- Lon


Absolutely.  I'm just writing the readme's and some documentation now.
It's functional and it works, if a bit ugly on the inside.  I'm hosting
it off my box at home (so it's slow :-) ) for the time being.  If it
gets to be more widely used maybe I'll try SourceForge or see if the
fence team wants it.  

Unfortunately (or perhaps fortunately?) the Xen XML-RPC API guys should
be done with at least a functional implementation of remote Xen
administration which would make all of these tricks obsolete I hope.

You'll need OpenSSL >= 0.9.7, gSOAP >= 2.6.7 and libvirt >= 0.0.6. Read
the README and man pages and have at it!


P.S.  I'm having a little trouble with older versions of libvirt ( <
0.1.0 or so). 

John A.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3348 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060824/2384c531/attachment.bin>

From nickolay at protei.ru  Fri Aug 25 07:11:59 2006
From: nickolay at protei.ru (Nickolay)
Date: Fri, 25 Aug 2006 11:11:59 +0400
Subject: [Linux-cluster] filesystem consistency error
Message-ID: <44EEA2BF.4090001@protei.ru>

Hello!

Kernel 2.6.17.8 + latest GFS source from CVS.
3 node in the cluster.

[43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0: fatal: filesystem 
consistency error
[43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0:   inode = 
59508586/59508586
[43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0:   function = 
inode_dealloc
[43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0:   file = 
/root/CLUSTER/release/cluster/gfs-kernel/src/gfs/inode.c, line = 653
[43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0:   time = 1156466444
[43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0: about to withdraw 
from the cluster
[43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0: waiting for 
outstanding I/O
[43165234.590000] GFS: fsid=tz23_cluster:tz23_home.0: telling LM to withdraw
[43165237.890000] lock_dlm: withdraw abandoned memory
[43165237.890000] GFS: fsid=tz23_cluster:tz23_home.0: withdrawn
[43165237.890000]   mh_magic = 0x01161970
[43165237.890000]   mh_type = 4
[43165237.890000]   mh_generation = 21
[43165237.890000]   mh_format = 400
[43165237.890000]   mh_incarn = 1
[43165237.890000]   no_formal_ino = 59508586
[43165237.890000]   no_addr = 59508586
[43165237.890000]   di_mode = 02755
[43165237.890000]   di_uid = 1028
[43165237.890000]   di_gid = 102
[43165237.890000]   di_nlink = 1
[43165237.890000]   di_size = 3864
[43165237.890000]   di_blocks = 1
[43165237.890000]   di_atime = 1156449621
[43165237.890000]   di_mtime = 1156449622
[43165237.890000]   di_ctime = 1156449622
[43165237.890000]   di_major = 0
[43165237.890000]   di_minor = 0
[43165237.890000]   di_rgrp = 59468655
[43165237.890000]   di_goal_rgrp = 59468655
[43165237.890000]   di_goal_dblk = 39926
[43165237.890000]   di_goal_mblk = 39926
[43165237.890000]   di_flags = 0x00000001
[43165237.890000]   di_payload_format = 1200
[43165237.890000]   di_type = 2
[43165237.890000]   di_height = 0
[43165237.890000]   di_incarn = 0
[43165237.890000]   di_pad = 0
[43165237.890000]   di_depth = 0
[43165237.890000]   di_entries = 0
[43165237.890000]   no_formal_ino = 0
[43165237.890000]   no_addr = 0
[43165237.890000]   di_eattr = 0
[43165237.890000]   di_reserved =
[43165237.890000] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[43165237.890000] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[43165237.890000] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[43165237.890000] 00 00 00 00 00 00 00 00

-- 
Nickolay Vinogradov
Protei Reseach and Development Center
St.Petersburg, 194044, Russia
Tel.: +7 812 449 47 27



From RMoody at mweb.com  Fri Aug 25 08:15:10 2006
From: RMoody at mweb.com (Robert Moody - MWEB)
Date: Fri, 25 Aug 2006 10:15:10 +0200
Subject: [Linux-cluster] GFS update problem
Message-ID: <6586D1F97DDEDE408BEEF44402F3797806A78CCE@mwmx4.mweb.com>

Hi all,
 
I am trying to apply the latest updates to one of our GFS clusters.
 
The problem I have is it seems the GFS patches want kernel version
2.6.9-42ELsmp. However the up2date command on the system keeps
installing 2.6.9-42.0.2EL.
 
Is there anyway to get the older version of the kernel.
 
The system is running Redhat ES4 update 4 with GFS 6.1
 
 
Robert Moody
Snr. Sun Solaris/Linux Engineer
WPF : Communications Officer
MWeb
Tel : 021 596 8753
Cell : 084 466 8521
rmoody at mweb.com
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060825/77ffa049/attachment.htm>

From riaan at obsidian.co.za  Fri Aug 25 09:43:59 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Fri, 25 Aug 2006 11:43:59 +0200
Subject: [Linux-cluster] GFS update problem
In-Reply-To: <6586D1F97DDEDE408BEEF44402F3797806A78CCE@mwmx4.mweb.com>
References: <6586D1F97DDEDE408BEEF44402F3797806A78CCE@mwmx4.mweb.com>
Message-ID: <44EEC65F.3030705@obsidian.co.za>

Robert Moody - MWEB wrote:
> Hi all,
>  
> I am trying to apply the latest updates to one of our GFS clusters.
>  
> The problem I have is it seems the GFS patches want kernel version 
> 2.6.9-42ELsmp. However the up2date command on the system keeps 
> installing 2.6.9-42.0.2EL.
>  
> Is there anyway to get the older version of the kernel.
>  
> The system is running Redhat ES4 update 4 with GFS 6.1
>  
>  
> Robert Moody
> Snr. Sun Solaris/Linux Engineer
> WPF : Communications Officer
> MWeb
> Tel : 021 596 8753
> Cell : 084 466 8521
>  <mailto:rmoody at mweb.com>
>  
> 
hi Robert

not using up2date, unfortunately, but via the RHN web site or on the 
RHEL4u4 ISOS:

1 RHN

- Sign in at https://rhn.redhat.com
- In the search box, select 'Packages' and type in 'kernel-smp' and 
[Search]
- To dramatically reduce the number of kernel versions you will have to 
sift through, select "Channels relevant to your systems" and unselect 
the architectures, and [Search]. (doing this for me reduced the number 
of matches from 900 to 200.
- Click on kernel-smp
- Click on the '>|' (the newest packages are at the end)
- You may have to forward a page or two ('<')
- Click on kernel-smp-2.6.9-42.EL.i686
- Click on 'Download Package' at the bottom

2 ISOs/CDs (If you have downloaded them)

ISO 2 for i386 and x86_64

3 If you have Red Hat Satellite, you will have this package localy, and 
depending on how you many software channels you have, the number of 
kernel version packages to sift through should be significantly less
(OT: I am not sure why RHN Hosted thinks that RHL 9.0 is relevant to me 
- I have only RHEL 4 systems registered)

I upgraded a customer of mine to GFS 6.1 u4 / RHEL 4.4 3 days ago and we 
are fairly happy so far.

good luck

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060825/4a6652c4/attachment.vcf>

From RMoody at mweb.com  Fri Aug 25 11:05:03 2006
From: RMoody at mweb.com (Robert Moody - MWEB)
Date: Fri, 25 Aug 2006 13:05:03 +0200
Subject: [Linux-cluster] GFS update problem
In-Reply-To: <44EEC65F.3030705@obsidian.co.za>
Message-ID: <6586D1F97DDEDE408BEEF44402F3797806A78D71@mwmx4.mweb.com>

Thanks that worked. Now the question remains why does the plain up2date
not work...

 Anyway will raise that with Redhat with the call I have logged to them.



Robert Moody
Snr. Sun Solaris/Linux Engineer
WPF : Communications Officer
MWeb
Tel : 021 596 8753
Cell : 084 466 8521
rmoody at mweb.com

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Riaan van Niekerk
Sent: Friday, August 25, 2006 11:44 AM
To: linux clustering
Subject: Re: [Linux-cluster] GFS update problem

Robert Moody - MWEB wrote:
> Hi all,
>  
> I am trying to apply the latest updates to one of our GFS clusters.
>  
> The problem I have is it seems the GFS patches want kernel version 
> 2.6.9-42ELsmp. However the up2date command on the system keeps 
> installing 2.6.9-42.0.2EL.
>  
> Is there anyway to get the older version of the kernel.
>  
> The system is running Redhat ES4 update 4 with GFS 6.1
>  
>  
> Robert Moody
> Snr. Sun Solaris/Linux Engineer
> WPF : Communications Officer
> MWeb
> Tel : 021 596 8753
> Cell : 084 466 8521
>  <mailto:rmoody at mweb.com>
>  
> 
hi Robert

not using up2date, unfortunately, but via the RHN web site or on the
RHEL4u4 ISOS:

1 RHN

- Sign in at https://rhn.redhat.com
- In the search box, select 'Packages' and type in 'kernel-smp' and
[Search]
- To dramatically reduce the number of kernel versions you will have to
sift through, select "Channels relevant to your systems" and unselect
the architectures, and [Search]. (doing this for me reduced the number
of matches from 900 to 200.
- Click on kernel-smp
- Click on the '>|' (the newest packages are at the end)
- You may have to forward a page or two ('<')
- Click on kernel-smp-2.6.9-42.EL.i686
- Click on 'Download Package' at the bottom

2 ISOs/CDs (If you have downloaded them)

ISO 2 for i386 and x86_64

3 If you have Red Hat Satellite, you will have this package localy, and
depending on how you many software channels you have, the number of
kernel version packages to sift through should be significantly less
(OT: I am not sure why RHN Hosted thinks that RHL 9.0 is relevant to me
- I have only RHEL 4 systems registered)

I upgraded a customer of mine to GFS 6.1 u4 / RHEL 4.4 3 days ago and we
are fairly happy so far.

good luck

Riaan



From djo at dotis.us  Fri Aug 25 13:56:49 2006
From: djo at dotis.us (David J. Otis)
Date: Fri, 25 Aug 2006 09:56:49 -0400
Subject: [Linux-cluster] Performance on SunFire V40Z 
Message-ID: <003001c6c84e$50c74a00$87dea8c0@cabcds.org>

I am running a cluster with 2 Sunfire V40Z's, Winchester FLEXSAN 100 and Redhat ES4, The GFS performance seems just fine, but for some reason the o/s is dragging when doing local work, I have raid 0 running for the volitile stuff and raid 1 for the o/s any tuning suggestions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060825/0fd780ae/attachment.htm>

From rpeterso at redhat.com  Fri Aug 25 14:09:13 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 25 Aug 2006 09:09:13 -0500
Subject: [Linux-cluster] filesystem consistency error
In-Reply-To: <44EEA2BF.4090001@protei.ru>
References: <44EEA2BF.4090001@protei.ru>
Message-ID: <44EF0489.8050206@redhat.com>

Nickolay wrote:
> Hello!
>
> Kernel 2.6.17.8 + latest GFS source from CVS.
> 3 node in the cluster.
>
> [43165234.490000] GFS: fsid=tz23_cluster:tz23_home.0: fatal: 
> filesystem consistency error
Hi Nickolay,

Looks like something may have corrupted your file system.
This can be caused by a number of different things.  For example,
if you were to run gfs_fsck from one node while the file system was
still mounted on another node.

Also, just a note:  When you say "latest GFS source from CVS" it's not clear
to me whether you mean HEAD or STABLE.  If you didn't specify in your
cvs checkout, then it uses HEAD.

Regards,

Bob Peterson
Red Hat Cluster Suite



From redhat at watson-wilson.ca  Fri Aug 25 17:23:21 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Fri, 25 Aug 2006 13:23:21 -0400
Subject: [Linux-cluster] cluster did not fail over when san mount was lost
Message-ID: <20060825172321.GC4357@ettin.watson-wilson.ca>

I arranged for a cluster's shared partition (/db2, ext3, attached via
HBA to a SAN) to be disconnected while the cluster was busy.  I was
expecting the cluster to fail over to the standby node.  This did not
happen.  From what I can tell the kernel of the active node reports the
file system error and then then remounts /db2 as read only.  The fs.sh
script seems to check that the /db2 mount exists but, does not check its
state.  Thus, fs.sh reports nothing wrong.  Log entries:

Aug 25 11:38:59 caesar kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Aug 25 11:38:59 caesar kernel: Remounting filesystem read-only
Aug 25 11:39:00 caesar clurgmgrd: [5719]: <crit> Finding status of: /dev/sda1 /db2  
Aug 25 11:39:00 caesar clurgmgrd: [5719]: <crit> in isMounted with /dev/sda1 and /db2  
Aug 25 11:39:00 caesar clurgmgrd: [5719]: <crit> isMounted returned: 0  
Aug 25 11:39:00 caesar clurgmgrd: [5719]: <crit> Past isMounted  
Aug 25 11:39:00 caesar clurgmgrd: [5719]: <crit> running isAlive with: /db2  
Aug 25 11:39:00 caesar clurgmgrd: [5719]: <crit> isAlive returnd: 0  
Aug 25 11:39:20 caesar clurgmgrd: [5719]: <crit> Finding status of: /dev/sda1 /db2  
Aug 25 11:39:20 caesar clurgmgrd: [5719]: <crit> in isMounted with /dev/sda1 and /db2  
Aug 25 11:39:20 caesar clurgmgrd: [5719]: <crit> isMounted returned: 0  
Aug 25 11:39:20 caesar clurgmgrd: [5719]: <crit> Past isMounted  
Aug 25 11:39:20 caesar clurgmgrd: [5719]: <crit> running isAlive with: /db2  
Aug 25 11:39:20 caesar clurgmgrd: [5719]: <crit> isAlive returnd: 1  

Is this a bug or, have I missed something?

Configuration is 2 systems in active, standby mode.  Both are AS4 x86_64
and are currently up to date with their patches.

-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 29 days
http://watson-wilson.ca | 2.6.16.19 AMD Athlon(tm) MP 2000+ x 2



From dist-list at LEXUM.UMontreal.CA  Fri Aug 25 19:41:40 2006
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Fri, 25 Aug 2006 15:41:40 -0400
Subject: [Linux-cluster] gfs atime perf question
Message-ID: <44EF5274.1000109@lexum.umontreal.ca>

Hello,
I just installed a 4 nodes webfarm (using direct routing).
a cron on each server check a file ($netconfig on GFS) like this :

if ! find $netconfig -anewer $netconfig | grep -q $netconfig ; then
    echo "cluster node $node : $netconfig has changed, reconfiguring
network ..."
    "$netsetup"
fi

So when I update the file, every server run $netsup. this is my way to
configure IP and arptables.

But as I know now, the atime is updated every hour. So my first option
would be to tune this to 60 instead of 3600. But I suppose that I will
have a perf issue ?? Do you have tips to handle that kind of scenario :
watch a file for modification ?

Thanks !






From dist-list at LEXUM.UMontreal.CA  Fri Aug 25 21:04:45 2006
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Fri, 25 Aug 2006 17:04:45 -0400
Subject: [Linux-cluster] gfs atime perf question (SOLVED)
In-Reply-To: <44EF5274.1000109@lexum.umontreal.ca>
References: <44EF5274.1000109@lexum.umontreal.ca>
Message-ID: <44EF65ED.9030501@lexum.umontreal.ca>

I use md5sum now

FM wrote:
> Hello,
> I just installed a 4 nodes webfarm (using direct routing).
> a cron on each server check a file ($netconfig on GFS) like this :
>
> if ! find $netconfig -anewer $netconfig | grep -q $netconfig ; then
>     echo "cluster node $node : $netconfig has changed, reconfiguring
> network ..."
>     "$netsetup"
> fi
>
> So when I update the file, every server run $netsup. this is my way to
> configure IP and arptables.
>
> But as I know now, the atime is updated every hour. So my first option
> would be to tune this to 60 instead of 3600. But I suppose that I will
> have a perf issue ?? Do you have tips to handle that kind of scenario :
> watch a file for modification ?
>
> Thanks !
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From Jon.Stanley at savvis.net  Fri Aug 25 22:15:49 2006
From: Jon.Stanley at savvis.net (Stanley, Jon)
Date: Fri, 25 Aug 2006 17:15:49 -0500
Subject: [Linux-cluster] GFS update problem
In-Reply-To: <44EEC65F.3030705@obsidian.co.za>
Message-ID: <9A6FE0FCC2B29846824C5CD81C6647B902C50542@s228130hz1ew08.apptix-01.savvis.net>

On RHN Satellite that package would be at
/var/satellite/redhat/NULL/kernel-smp/<version>, FWIW.

>-----Original Message-----
>From: linux-cluster-bounces at redhat.com 
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Riaan 
>van Niekerk
>Sent: Friday, August 25, 2006 5:44 AM
>To: linux clustering
>Subject: Re: [Linux-cluster] GFS update problem
>
>Robert Moody - MWEB wrote:
>> Hi all,
>>  
>> I am trying to apply the latest updates to one of our GFS clusters.
>>  
>> The problem I have is it seems the GFS patches want kernel version 
>> 2.6.9-42ELsmp. However the up2date command on the system keeps 
>> installing 2.6.9-42.0.2EL.
>>  
>> Is there anyway to get the older version of the kernel.
>>  
>> The system is running Redhat ES4 update 4 with GFS 6.1
>>  
>>  
>> Robert Moody
>> Snr. Sun Solaris/Linux Engineer
>> WPF : Communications Officer
>> MWeb
>> Tel : 021 596 8753
>> Cell : 084 466 8521
>>  <mailto:rmoody at mweb.com>
>>  
>> 
>hi Robert
>
>not using up2date, unfortunately, but via the RHN web site or on the 
>RHEL4u4 ISOS:
>
>1 RHN
>
>- Sign in at https://rhn.redhat.com
>- In the search box, select 'Packages' and type in 'kernel-smp' and 
>[Search]
>- To dramatically reduce the number of kernel versions you 
>will have to 
>sift through, select "Channels relevant to your systems" and unselect 
>the architectures, and [Search]. (doing this for me reduced the number 
>of matches from 900 to 200.
>- Click on kernel-smp
>- Click on the '>|' (the newest packages are at the end)
>- You may have to forward a page or two ('<')
>- Click on kernel-smp-2.6.9-42.EL.i686
>- Click on 'Download Package' at the bottom
>
>2 ISOs/CDs (If you have downloaded them)
>
>ISO 2 for i386 and x86_64
>
>3 If you have Red Hat Satellite, you will have this package 
>localy, and 
>depending on how you many software channels you have, the number of 
>kernel version packages to sift through should be significantly less
>(OT: I am not sure why RHN Hosted thinks that RHL 9.0 is 
>relevant to me 
>- I have only RHEL 4 systems registered)
>
>I upgraded a customer of mine to GFS 6.1 u4 / RHEL 4.4 3 days 
>ago and we 
>are fairly happy so far.
>
>good luck
>
>Riaan
>



From orkcu at yahoo.com  Sat Aug 26 01:21:11 2006
From: orkcu at yahoo.com (Roger Pe�a Escobio)
Date: Fri, 25 Aug 2006 18:21:11 -0700 (PDT)
Subject: [Linux-cluster] GFS update problem
In-Reply-To: <44EEC65F.3030705@obsidian.co.za>
Message-ID: <20060826012111.74117.qmail@web50604.mail.yahoo.com>



--- Riaan van Niekerk <riaan at obsidian.co.za> wrote:

> Robert Moody - MWEB wrote:
> > Hi all,
> >  
> > I am trying to apply the latest updates to one of
> our GFS clusters.
> >  
> > The problem I have is it seems the GFS patches
> want kernel version 
> > 2.6.9-42ELsmp. However the up2date command on the
> system keeps 
> > installing 2.6.9-42.0.2EL.
> >  
> > Is there anyway to get the older version of the
> kernel.
> >  
> > The system is running Redhat ES4 update 4 with GFS
> 6.1
> >  
> >  
> > Robert Moody
> > Snr. Sun Solaris/Linux Engineer
> > WPF : Communications Officer
> > MWeb
> > Tel : 021 596 8753
> > Cell : 084 466 8521
> >  <mailto:rmoody at mweb.com>
> >  
> > 
> hi Robert
> 
> not using up2date, unfortunately, but via the RHN
> web site or on the 
> RHEL4u4 ISOS:
> 
> 1 RHN
> 
> - Sign in at https://rhn.redhat.com
> - In the search box, select 'Packages' and type in
> 'kernel-smp' and 
> [Search]
> - To dramatically reduce the number of kernel
> versions you will have to 
> sift through, select "Channels relevant to your
> systems" and unselect 
> the architectures, and [Search]. (doing this for me
> reduced the number 
> of matches from 900 to 200.
> - Click on kernel-smp
> - Click on the '>|' (the newest packages are at the
> end)
> - You may have to forward a page or two ('<')
> - Click on kernel-smp-2.6.9-42.EL.i686
> - Click on 'Download Package' at the bottom
so, if you install this kernel the kernel modules of
the rhcs and rhgfs will install but, in order of use
it, you will need to run that "old" kernel too :-(
so your computer will be at risk.

so, what I usually do is:
1- after instaling the kernels modules I will get an
extra directory inside the kernel module directory
(/lib/modules/<kernel version for which the gfs was
compiled>/extra)
2- move or copy that extra directory to the kernel
module directory of the new kernel:
/lib/modules/2.6.9-42.0.2/extra
3- run depmod command (if you are running the new
kernel the command will not need any argument, if not
...)

ok, that is a "hack" to get the system working with
the security-ok kernel but is not the right solution
(the new extra directory in the kernel module
directory and the files insight it, will be orfans,
they can't be uninstalled with rpm :-( ). the right
solucions should be recompile the cluster suit and the
GFS for the new kernel but ....

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From thomas at ideawise.de  Mon Aug 28 03:23:01 2006
From: thomas at ideawise.de (Thomas Karsten)
Date: Mon, 28 Aug 2006 11:23:01 +0800
Subject: [Linux-cluster] How to patch the kernel sources instead of building
	modules?
Message-ID: <200608281123.01712.thomas@ideawise.de>

Hallo,

I would like to build GFS into the kernel rather than building the
modules and inserting them into the kernel at runtime. I used the
Makefiles in *-kernel/src to create the patches for the required
modules (cluster-1.02.00).

Each 'make patches' itself seems to copy the kernel sources and to patch
the kernel then. Additionally, it stores the patches in separate files.
I just ignored the created kernel sources, collected the separate files
and patched the kernel with all those patches (which worked fine). 

But how can I patch the config files of the kernel, so that I can choose
which module should be compiled into the kernel and which module should
be build as a module? When I compile the patched kernel, the cluster
source files are just ignored by the kernel build system.

Thanks,
Thomas



From jos at xos.nl  Mon Aug 28 08:27:02 2006
From: jos at xos.nl (Jos Vos)
Date: Mon, 28 Aug 2006 10:27:02 +0200
Subject: [Linux-cluster] Init scripts and cluster suite
Message-ID: <200608280827.k7S8R2t15477@xos037.xos.nl>

Hi,

Init scripts usually return a non-zero return code when they try to 
stop a service that isn't running anymore.

When a cluster service has failed for some reason, the cluster suite
requires you to first disable a service, before enabling it again.
Disabling a service will try to stop the service, which will fail,
and thus the service can't be disabled (and also not enabled again).

The workaround is to either manually start the service and then
disabling it (bad idea for a cluster service) or to write all
cluster service scripts yourself, even if you just need to
control a standard service like httpd.

Is the latter the recommended solution for this problem?

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From Zelikov_Mikhail at emc.com  Mon Aug 28 15:41:55 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 28 Aug 2006 11:41:55 -0400
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468EEA2@CORPUSMX40B.corp.emc.com>

I am using the latest cluster from RHEL4 branch. I have 2 node cluster:
nodes A and B. Node A grabs a lock in the exclusive mode, node B waits for a
membership change. I manually reset node A at which point node B gets the
membership change notification and then tries to acquire the lock in the
exclusive mode. At this point this operations locks forever. Once the node A
is up DLM returns with the lock acquired for node B - as expected.
However, if I shutdown node A instead of killing it then everything works as
expected - Node B gets the notification and the successfully grabs the lock
w/o locking up.
 
It can be easily reproduced with dlmtest: grab the lock on one machine in EX
mode (bof226), block on another for the same lock (bof227), kill the first
machine - see that we never acquire lock on the second:
 
1) *** GRAB LOCK MY_RES (bof226)
[root at bof226 usertest]# ./dlmtest -Q -m EX MY_RES -d 10000
locking MY_RES EX ...done (lkid = 1015e)
lockinfo: status     = 0
lockinfo: resource   = 'MY_RES'
lockinfo: grantcount = 1
lockinfo: convcount  = 0
lockinfo: waitcount  = 0
lockinfo: masternode = 1
lockinfo: lock: lkid        = 1015e
lockinfo: lock: master lkid = 0
lockinfo: lock: parent lkid = 0
lockinfo: lock: node        = 1
lockinfo: lock: pid         = 3771
lockinfo: lock: state       = 2
lockinfo: lock: grmode      = 5
lockinfo: lock: rqmode      = 255
 
2) *** GRAB LOCK MY_RES (bof227)
[root at bof227 usertest]# ./dlmtest -Q -m EX -d 10000 MY_RES
locking MY_RES EX ...
 
3) *** KILL bof226
4) *** WAITING FOREVER 
5) *** BOOTING UP bof226 results in lock acquired
lockinfo: status     = 0
lockinfo: resource   = 'MY_RES'
lockinfo: grantcount = 1
lockinfo: convcount  = 0
lockinfo: waitcount  = 0
lockinfo: masternode = 2
lockinfo: lock: lkid        = 10312
lockinfo: lock: master lkid = 103eb
lockinfo: lock: parent lkid = 0
lockinfo: lock: node        = 2
lockinfo: lock: pid         = 4136
lockinfo: lock: state       = 2
lockinfo: lock: grmode      = 5
lockinfo: lock: rqmode      = 255

Has anybody else seen this? I was wondering if this is a bug or there is
something special about 2-node clusters, or do I misunderstand how it
supposed to work?
    Mike



From teigland at redhat.com  Mon Aug 28 17:05:09 2006
From: teigland at redhat.com (David Teigland)
Date: Mon, 28 Aug 2006 12:05:09 -0500
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468EEA2@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468EEA2@CORPUSMX40B.corp.emc.com>
Message-ID: <20060828170509.GA10565@redhat.com>

On Mon, Aug 28, 2006 at 11:41:55AM -0400, Zelikov_Mikhail at emc.com wrote:
> I am using the latest cluster from RHEL4 branch. I have 2 node cluster:
> nodes A and B. Node A grabs a lock in the exclusive mode, node B waits for a
> membership change. I manually reset node A at which point node B gets the
> membership change notification and then tries to acquire the lock in the
> exclusive mode. At this point this operations locks forever. Once the node A
> is up DLM returns with the lock acquired for node B - as expected.
> However, if I shutdown node A instead of killing it then everything works as
> expected - Node B gets the notification and the successfully grabs the lock
> w/o locking up.

> Has anybody else seen this? I was wondering if this is a bug or there is
> something special about 2-node clusters, or do I misunderstand how it
> supposed to work?

You're probably not using cman's two_node setting which means the one node
by itself will be blocked by quorum.  When you do a 'cman_tool leave
remove', that reduces the votes required for quorum, so it's not a problem
for the remaining node.  Try using the two_node mode as described here:
  http://sources.redhat.com/cluster/doc/usage.txt

Dave



From Zelikov_Mikhail at emc.com  Mon Aug 28 17:32:43 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 28 Aug 2006 13:32:43 -0400
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F00B@CORPUSMX40B.corp.emc.com>

I checked that I have to following in the cluster.conf
<cman expected_votes="1" two_nodes="1">

I also checked that I do not loose the quorum with one node:

[root at bof227] clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  bof227                                   Online, Local, rgmanager
  bof226                                   Offline


-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 1:05 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 11:41:55AM -0400, Zelikov_Mikhail at emc.com wrote:
> I am using the latest cluster from RHEL4 branch. I have 2 node cluster:
> nodes A and B. Node A grabs a lock in the exclusive mode, node B waits 
> for a membership change. I manually reset node A at which point node B 
> gets the membership change notification and then tries to acquire the 
> lock in the exclusive mode. At this point this operations locks 
> forever. Once the node A is up DLM returns with the lock acquired for node
B - as expected.
> However, if I shutdown node A instead of killing it then everything 
> works as expected - Node B gets the notification and the successfully 
> grabs the lock w/o locking up.

> Has anybody else seen this? I was wondering if this is a bug or there 
> is something special about 2-node clusters, or do I misunderstand how 
> it supposed to work?

You're probably not using cman's two_node setting which means the one node
by itself will be blocked by quorum.  When you do a 'cman_tool leave
remove', that reduces the votes required for quorum, so it's not a problem
for the remaining node.  Try using the two_node mode as described here:
  http://sources.redhat.com/cluster/doc/usage.txt

Dave



From teigland at redhat.com  Mon Aug 28 17:55:49 2006
From: teigland at redhat.com (David Teigland)
Date: Mon, 28 Aug 2006 12:55:49 -0500
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F00B@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F00B@CORPUSMX40B.corp.emc.com>
Message-ID: <20060828175549.GB10565@redhat.com>

On Mon, Aug 28, 2006 at 01:32:43PM -0400, Zelikov_Mikhail at emc.com wrote:
> I checked that I have to following in the cluster.conf
> <cman expected_votes="1" two_nodes="1">

In that case, perhaps the remaining node is trying to fence the failed
node and not getting anywhere?  What do  'cman_tool status' and 'cman_tool
services' say after the one node has failed?

Dave



From Zelikov_Mikhail at emc.com  Mon Aug 28 18:28:19 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 28 Aug 2006 14:28:19 -0400
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F0B7@CORPUSMX40B.corp.emc.com>

Dave, here is the output:

*** 2 nodes are alive
[root at bof227 ~]# cman_tool status; cman_tool services
Protocol version: 5.0.1
Config version: 84
Cluster name: MZ_CLUSTER
Cluster ID: 18388
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 1
Total_votes: 2
Quorum: 1
Active subsystems: 5
Node name: bof227
Node ID: 2
Node addresses: 10.14.32.227

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 1]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2 1]

DLM Lock Space:  "default"                          11  10 run       -
[1 2]

User:            "usrm::manager"                     3   4 run       -
[2 1]

*** 1 node is alive (right after membership change detection)
[root at bof227 ~]# cman_tool status; cman_tool services
Protocol version: 5.0.1
Config version: 84
Cluster name: MZ_CLUSTER
Cluster ID: 18388
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 5
Node name: bof227
Node ID: 2
Node addresses: 10.14.32.227

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 2 -
[2]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2]

DLM Lock Space:  "default"                          11  10 recover 0 -
[2]

User:            "usrm::manager"                     3   4 recover 0 -
[2]

*** 1 node is alive (3 minutes later)
[root at bof227 ~]# cman_tool status; cman_tool services
Protocol version: 5.0.1
Config version: 84
Cluster name: MZ_CLUSTER
Cluster ID: 18388
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 5
Node name: bof227
Node ID: 2
Node addresses: 10.14.32.227

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 2 -
[2]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[2]

DLM Lock Space:  "default"                          11  10 recover 0 -
[2]

User:            "usrm::manager"                     3   4 recover 0 -
[2]

*** Output of the DLM /proc/cluster*
[root at bof227 cluster]# cat dlm_stats
DLM stats (HZ=1000)

Lock operations:         13
Unlock operations:        9
Convert operations:       0
Completion ASTs:         26
Blocking ASTs:            0

Lockqueue        num  waittime   ave
WAIT_RSB           2         0     0
WAIT_GRANT         7         2     0
WAIT_UNLOCK        3         0     0
Total             12         2     0

[root at bof227 cluster]# cat dlm_debug
ault rebuild resource directory
default rebuilt 0 resources
default purge requests
default purged 0 requests
default mark waiting requests
default marked 0 requests
default purge locks of departed nodes
default purged 0 locks
default update remastered resources
default updated 0 resources
default rebuild locks
default rebuilt 0 locks
default recover event 42 done
default move flags 0,0,1 ids 41,42,42
default process held requests
default processed 0 requests
default resend marked requests
default resent 0 requests
default recover event 42 finished
default move flags 1,0,0 ids 42,42,42
default move flags 0,1,0 ids 0,44,0
default move use event 44
default recover event 44 (first)
default add nodes
default total nodes 2
default rebuild resource directory
default rebuilt 1 resources
default recover event 44 done
default move flags 0,0,1 ids 0,44,44
default process held requests
default processed 0 requests
default recover event 44 finished
clvmd move flags 1,0,0 ids 37,37,37
default move flags 1,0,0 ids 44,44,44

[root at bof227 cluster]# cat dlm_dir
[root at bof227 cluster]# cat dlm_locks
[root at bof227 cluster]# cat sm_debug
t state 3
0100000b sevent state 5
0100000b sevent state 7
0100000b sevent state 9
00000001 remove node 1 count 1
0100000b remove node 1 count 1
01000002 remove node 1 count 1
03000003 remove node 1 count 1
00000001 recover state 0
00000001 recover state 1

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 1:56 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 01:32:43PM -0400, Zelikov_Mikhail at emc.com wrote:
> I checked that I have to following in the cluster.conf <cman 
> expected_votes="1" two_nodes="1">

In that case, perhaps the remaining node is trying to fence the failed node
and not getting anywhere?  What do  'cman_tool status' and 'cman_tool
services' say after the one node has failed?

Dave



From teigland at redhat.com  Mon Aug 28 18:35:52 2006
From: teigland at redhat.com (David Teigland)
Date: Mon, 28 Aug 2006 13:35:52 -0500
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F0B7@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F0B7@CORPUSMX40B.corp.emc.com>
Message-ID: <20060828183552.GC10565@redhat.com>

On Mon, Aug 28, 2006 at 02:28:19PM -0400, Zelikov_Mikhail at emc.com wrote:
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           1   2 recover 2 -
> [2]
> 
> DLM Lock Space:  "clvmd"                             2   3 recover 0 -
> [2]
> 
> DLM Lock Space:  "default"                          11  10 recover 0 -
> [2]
> 
> User:            "usrm::manager"                     3   4 recover 0 -
> [2]

It's trying to fence the failed node and won't continue with recovery
until that's done.  What fencing method are you using in cluster.conf?
Are there any fencing error messages in /var/log/messages?  What does your
cluster.conf look like?

Dave



From Zelikov_Mikhail at emc.com  Mon Aug 28 18:58:32 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 28 Aug 2006 14:58:32 -0400
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F119@CORPUSMX40B.corp.emc.com>

(resending since forgot to include linux-cluster at redhat.com)
I am using manual fencing with gnbd fencing. Here is the tail on
/var/proc/messages:

Aug 28 14:17:06 bof227 fenced[2497]: bof226 not a cluster member after 0 sec
post_fail_delay Aug 28 14:17:06 bof227 kernel: CMAN: removing node bof226
from the cluster : Missed too many heartbeats Aug 28 14:17:06 bof227
fenced[2497]: fencing node "bof226"
Aug 28 14:17:06 bof227 fence_manual: Node bof226 needs to be reset before
recovery can procede.  Waiting for bof226 to rejoin the cluster or for
manual acknowledgement that it has been reset (i.e. fence_ack_manual -n
bof226)


************************ cluster.conf
<?xml version="1.0"?>
<cluster config_version="84" name="MZ_CLUSTER">
	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="bof227" votes="1">
			<fence>
				<method name="1">
					<device name="device_MF_227"
nodename="bof227"/>
					<device name="gnbd_server_bof226"
nodename="bof227"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="bof226" votes="1">
			<fence>
				<method name="1">
					<device name="device_MF_226"
nodename="bof226"/>
					<device name="gnbd_server_bof227"
nodename="bof226"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_manual" name="device_MF_226"/>
		<fencedevice agent="fence_manual" name="device_MF_227"/>
		<fencedevice agent="fence_gnbd" name="gnbd_server_bof226"
servers="bof226"/>
		<fencedevice agent="fence_gnbd" name="gnbd_server_bof227"
servers="bof227"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="FD_PREF_BOF226" ordered="1"
restricted="1">
				<failoverdomainnode name="bof226"
priority="1"/>
				<failoverdomainnode name="bof227"
priority="2"/>
			</failoverdomain>
			<failoverdomain name="FD_PREF_BOF_227" ordered="1"
restricted="1">
				<failoverdomainnode name="bof227"
priority="1"/>
				<failoverdomainnode name="bof226"
priority="2"/>
			</failoverdomain>
		</failoverdomains>
		<resources/>
	</rm>
</cluster> 

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 2:36 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

It's trying to fence the failed node and won't continue with recovery until
that's done.  What fencing method are you using in cluster.conf?
Are there any fencing error messages in /var/log/messages?  What does your
cluster.conf look like?

Dave



From teigland at redhat.com  Mon Aug 28 19:03:50 2006
From: teigland at redhat.com (David Teigland)
Date: Mon, 28 Aug 2006 14:03:50 -0500
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F119@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F119@CORPUSMX40B.corp.emc.com>
Message-ID: <20060828190350.GE10565@redhat.com>

On Mon, Aug 28, 2006 at 02:58:32PM -0400, Zelikov_Mikhail at emc.com wrote:
> I am using manual fencing with gnbd fencing.

Is there a special reason you're using both gnbd and manual fencing?  I've
never seen that done before and can't think of a reason you'd want to.
(I'd just use gnbd, not manual.)  That said, I suspect what you have
configured should still work.

> Here is the tail on /var/proc/messages:
> 
> Aug 28 14:17:06 bof227 fenced[2497]: bof226 not a cluster member after 0 sec
> post_fail_delay Aug 28 14:17:06 bof227 kernel: CMAN: removing node bof226
> from the cluster : Missed too many heartbeats Aug 28 14:17:06 bof227
> fenced[2497]: fencing node "bof226"
> Aug 28 14:17:06 bof227 fence_manual: Node bof226 needs to be reset before
> recovery can procede.  Waiting for bof226 to rejoin the cluster or for
> manual acknowledgement that it has been reset (i.e. fence_ack_manual -n
> bof226)

Follow what the message says and run "fence_ack_manual -n bof226" on the
remaining node after verifying the failed node has been reset or otherwise
fenced.

Dave



From Zelikov_Mikhail at emc.com  Mon Aug 28 19:18:24 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 28 Aug 2006 15:18:24 -0400
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <20060828190350.GE10565@redhat.com>
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F152@CORPUSMX40B.corp.emc.com>

While the node is down (bof226) I do fence_ack_manual -n bof226. I start
getting the following messages in the /var/log/messages:

Aug 28 15:08:30 bof227 fence_manual: Node bof226 needs to be reset
before recovery can procede.  Waiting for bof226 to rejoin the cluster
or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n bof226)
Aug 28 15:10:33 bof227 ccsd[2433]: process_get: Invalid connection
descriptor received.
Aug 28 15:10:33 bof227 ccsd[2433]: Error while processing get: Invalid
request descriptor
Aug 28 15:10:33 bof227 fenced[2497]: fence "bof226" failed
Aug 28 15:10:38 bof227 fenced[2497]: fencing node "bof226"
Aug 28 15:10:38 bof227 ccsd[2433]: process_get: Invalid connection
descriptor received.
Aug 28 15:10:38 bof227 ccsd[2433]: Error while processing get: Invalid
request descriptor
Aug 28 15:10:38 bof227 fenced[2497]: fence "bof226" failed
Aug 28 15:10:43 bof227 fenced[2497]: fencing node "bof226"
Aug 28 15:10:43 bof227 ccsd[2433]: process_get: Invalid connection
descriptor received.
Aug 28 15:10:43 bof227 ccsd[2433]: Error while processing get: Invalid
request descriptor
Aug 28 15:10:43 bof227 fenced[2497]: fence "bof226" failed


>>> Is there a special reason you're using both gnbd and manual fencing?
I've never seen that done before and can't think of a reason you'd want
to.
I was under impression that if there is no hw fencing device then the
manual one is required. It was also my understanding that if I use gnbd
devices then an explicit gnbd fencing is required as well.
	Mike

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 3:04 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 02:58:32PM -0400, Zelikov_Mikhail at emc.com wrote:
> I am using manual fencing with gnbd fencing.

Is there a special reason you're using both gnbd and manual fencing?
I've never seen that done before and can't think of a reason you'd want
to.
(I'd just use gnbd, not manual.)  That said, I suspect what you have
configured should still work.

> Here is the tail on /var/proc/messages:
> 
> Aug 28 14:17:06 bof227 fenced[2497]: bof226 not a cluster member after

> 0 sec post_fail_delay Aug 28 14:17:06 bof227 kernel: CMAN: removing 
> node bof226 from the cluster : Missed too many heartbeats Aug 28 
> 14:17:06 bof227
> fenced[2497]: fencing node "bof226"
> Aug 28 14:17:06 bof227 fence_manual: Node bof226 needs to be reset 
> before recovery can procede.  Waiting for bof226 to rejoin the cluster

> or for manual acknowledgement that it has been reset (i.e. 
> fence_ack_manual -n
> bof226)

Follow what the message says and run "fence_ack_manual -n bof226" on the
remaining node after verifying the failed node has been reset or
otherwise fenced.

Dave




From Zelikov_Mikhail at emc.com  Mon Aug 28 19:33:47 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 28 Aug 2006 15:33:47 -0400
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F188@CORPUSMX40B.corp.emc.com>

Dave, I guess we are confused here by "the failed node is actually reset" -
does this mean that "the system is down/has been shutdown" or does this mean
"the system has been rebooted and now is up and running"? In the first case
I am getting errors in /var/log/messages in the second I do not need to do
anything since the cluster will recover by itself.
	Mike

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 2:52 PM
To: Zelikov, Mikhail
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 02:52:48PM -0400, Zelikov_Mikhail at emc.com wrote:
> I am using manual fencing with gnbd fencing. Here is the tail on
> /var/proc/messages:
> 
> Aug 28 14:17:06 bof227 fenced[2497]: bof226 not a cluster member after 
> 0 sec post_fail_delay Aug 28 14:17:06 bof227 kernel: CMAN: removing 
> node bof226 from the cluster :
> Missed too many heartbeats
> Aug 28 14:17:06 bof227 fenced[2497]: fencing node "bof226"
> Aug 28 14:17:06 bof227 fence_manual: Node bof226 needs to be reset 
> before recovery can procede.  Waiting for bof226 to rejoin the cluster 
> or for manual acknowledgement that it has been reset (i.e. 
> fence_ack_manual -n
> bof226)

Follow what the message says:
- make sure the failed node is actually reset, then
- run "fence_ack_manual -n bof226" on the remaining node

then recovery will continue.

Dave



From fadia.marei at paltel.net  Mon Aug 28 20:34:32 2006
From: fadia.marei at paltel.net (Fadia Al-Marei)
Date: Mon, 28 Aug 2006 22:34:32 +0200
Subject: [Linux-cluster] cman can not start 
Message-ID: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA70@PALTELMAIL.wb.paltel.net>

Dear All

I have two redhat Enterprise Linux 4 AS servers with shared storage , I
try to cluster them , I download the cluster suite packages on the two
servers and it ok, when I start to add nodes on the cluster the problems
start , in the first server I add the two nodes start the ccsd service
and the cman service and every thing is ok, when I ftp the file to the
second server and try to start the ccsd and the cman services I face the
problems and the log file show the errors.

 

Aug 28 10:28:02 CluServ2 cman: startup succeeded
Aug 28 10:28:07 CluServ2 kernel: CMAN: sendmsg failed: -13
Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
Aug 28 10:28:12 CluServ2 kernel: CMAN: No functional network interfaces,
leaving cluster
Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
Aug 28 10:28:12 CluServ2 kernel: CMAN: we are leaving the cluster.
Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
Aug 28 10:28:12 CluServ2 ccsd[31038]: Cluster manager shutdown.
Attemping to reconnect...
Aug 28 10:28:33 CluServ2 ccsd[31038]: Unable to connect to cluster
infrastructure after 30 seconds.

 

What is the problem and how can I solve it.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060828/dc8e27f6/attachment.htm>

From teigland at redhat.com  Mon Aug 28 19:33:38 2006
From: teigland at redhat.com (David Teigland)
Date: Mon, 28 Aug 2006 14:33:38 -0500
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F152@CORPUSMX40B.corp.emc.com>
References: <20060828190350.GE10565@redhat.com>
	<9B2FEC4CE7E80B4A965F1D9ADF22B1730468F152@CORPUSMX40B.corp.emc.com>
Message-ID: <20060828193338.GF10565@redhat.com>

On Mon, Aug 28, 2006 at 03:18:24PM -0400, Zelikov_Mikhail at emc.com wrote:
> While the node is down (bof226) I do fence_ack_manual -n bof226. I start
> getting the following messages in the /var/log/messages:
> 
> Aug 28 15:08:30 bof227 fence_manual: Node bof226 needs to be reset
> before recovery can procede.  Waiting for bof226 to rejoin the cluster
> or for manual acknowledgement that it has been reset (i.e.
> fence_ack_manual -n bof226)
> Aug 28 15:10:33 bof227 ccsd[2433]: process_get: Invalid connection
> descriptor received.
> Aug 28 15:10:33 bof227 ccsd[2433]: Error while processing get: Invalid
> request descriptor
> Aug 28 15:10:33 bof227 fenced[2497]: fence "bof226" failed

Strange bug you've found, I've not seen those ccsd errors before.
fence_manual doesn't use ccs, so I'm not sure how that's getting involved.


> >>> Is there a special reason you're using both gnbd and manual fencing?
> I've never seen that done before and can't think of a reason you'd want
> to.

> I was under impression that if there is no hw fencing device then the
> manual one is required. It was also my understanding that if I use gnbd
> devices then an explicit gnbd fencing is required as well.

If you're using gnbd you have three separate options for fencing:

- fence_gnbd, or
- hardware fencing, or
- fence_manual

All three of them are independent of each other, none need to be combined,
just pick one.  We only recommend fence_manual when experimenting.
fence_gnbd is a perfectly legitimate alternative to hw fencing.

Dave



From teigland at redhat.com  Mon Aug 28 19:46:42 2006
From: teigland at redhat.com (David Teigland)
Date: Mon, 28 Aug 2006 14:46:42 -0500
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F188@CORPUSMX40B.corp.emc.com>
References: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F188@CORPUSMX40B.corp.emc.com>
Message-ID: <20060828194642.GG10565@redhat.com>

On Mon, Aug 28, 2006 at 03:33:47PM -0400, Zelikov_Mikhail at emc.com wrote:
> Dave, I guess we are confused here by "the failed node is actually reset" -
> does this mean that "the system is down/has been shutdown" or does this mean
> "the system has been rebooted and now is up and running"? In the first case
> I am getting errors in /var/log/messages in the second I do not need to do
> anything since the cluster will recover by itself.

The idea behind fence_manual is that you need to go and manually fence the
failed machine somehow when you see that message.  That means doing
yourself what one of the normal fencing agents would otherwise do, e.g.
power it off, disable its SAN connection.  After you've done this, you run
fence_ack_manual to tell the system that the failed node has been properly
fenced (by you).

If you reset the failed node, you just need to make sure the power is off
before doing the ack command; you don't need to wait for it to be up and
running again.

If you reset the failed node and it comes back up and rejoins the cluster
before you happen to run the fence_ack_manual command, then the
fence_manual agent that's waiting on the non-failed node will recognize
this and effectively do the fence_ack_manual step for you since it knows
the failed node has been rebooted if it rejoins the cluster.

Dave



From Zelikov_Mikhail at emc.com  Mon Aug 28 20:07:18 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Mon, 28 Aug 2006 16:07:18 -0400
Subject: [Linux-cluster] DLM locks with 1 node on 2 node cluster
In-Reply-To: <20060828194642.GG10565@redhat.com>
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B1730468F208@CORPUSMX40B.corp.emc.com>

Yes, make sense :) I changed the cluster.conf not to include two fencing
mechanisms: rather just manual (since I do not have any gnbd devices
yet) ... and it worked :)
So it might be (WARNING - speculation here)  that a tmp file that is
used for fencing is used by both manual and gndb fences and opened by
one of them in the exclusive mode, so the other can not open it and wait
on it ...

You mentioned that  the gndb fencing has multiple options: hw, manual
... I tried to change the configuration on my gndb   fence resource via
gui (system-config-cluster) and the only options are the name and the
servers ... 
	Mike

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 3:47 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 03:33:47PM -0400, Zelikov_Mikhail at emc.com wrote:
> Dave, I guess we are confused here by "the failed node is actually 
> reset" - does this mean that "the system is down/has been shutdown" or

> does this mean "the system has been rebooted and now is up and 
> running"? In the first case I am getting errors in /var/log/messages 
> in the second I do not need to do anything since the cluster will
recover by itself.

The idea behind fence_manual is that you need to go and manually fence
the failed machine somehow when you see that message.  That means doing
yourself what one of the normal fencing agents would otherwise do, e.g.
power it off, disable its SAN connection.  After you've done this, you
run fence_ack_manual to tell the system that the failed node has been
properly fenced (by you).

If you reset the failed node, you just need to make sure the power is
off before doing the ack command; you don't need to wait for it to be up
and running again.

If you reset the failed node and it comes back up and rejoins the
cluster before you happen to run the fence_ack_manual command, then the
fence_manual agent that's waiting on the non-failed node will recognize
this and effectively do the fence_ack_manual step for you since it knows
the failed node has been rebooted if it rejoins the cluster.

Dave




From rpeterso at redhat.com  Mon Aug 28 21:22:45 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 28 Aug 2006 16:22:45 -0500
Subject: [Linux-cluster] cman can not start
In-Reply-To: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA70@PALTELMAIL.wb.paltel.net>
References: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA70@PALTELMAIL.wb.paltel.net>
Message-ID: <44F35EA5.40306@redhat.com>

Fadia Al-Marei wrote:
>
> Dear All
>
> I have two redhat Enterprise Linux 4 AS servers with shared storage , 
> I try to cluster them , I download the cluster suite packages on the 
> two servers and it ok, when I start to add nodes on the cluster the 
> problems start , in the first server I add the two nodes start the 
> ccsd service and the cman service and every thing is ok, when I ftp 
> the file to the second server and try to start the ccsd and the cman 
> services I face the problems and the log file show the errors.
>
>  
>
> Aug 28 10:28:02 CluServ2 cman: startup succeeded
> Aug 28 10:28:07 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: No functional network 
> interfaces, leaving cluster
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: we are leaving the cluster.
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 ccsd[31038]: Cluster manager shutdown. 
> Attemping to reconnect...
> Aug 28 10:28:33 CluServ2 ccsd[31038]: Unable to connect to cluster 
> infrastructure after 30 seconds.
>
>  
>
> What is the problem and how can I solve it.
>

Hi Fadia,

Sounds like the cluster manager can't communicate between the nodes.
Check to make sure the iptables firewall isn't filtering out messages 
between
the nodes, (i.e. temporarily turn it off and try again).  If it is 
iptable interfering,
you may need to enable some ports in iptables.  See 
http://sources.redhat.com/cluster/faq.html#iptables

Also, check to make sure that broadcast messages work properly between 
the nodes.

Regards,

Bob Peterson
Red Hat Cluster Suite



From Paul.McDowell at celera.com  Mon Aug 28 21:15:51 2006
From: Paul.McDowell at celera.com (Paul n McDowell)
Date: Mon, 28 Aug 2006 17:15:51 -0400
Subject: [Linux-cluster] GFS update problem (RHEL AS3 update 6 to update 8)
In-Reply-To: <20060826012111.74117.qmail@web50604.mail.yahoo.com>
Message-ID: <OF546B33E5.FF2F16FA-ON852571D8.0071B29C-852571D8.0074CDF8@applera.com>

Hi all....

I'm sure this question has been asked before but I've been unable to find 
the solution in any of the archives.

I had GFS up and running on a system running the 2.4.21-37.ELsmp (update 
6) kernel and performed an up2date to get to the latest release, 
2.4.21-47.ELsmp (update 8)

Amongst the numerous RPM's that have been changed the GFS ones have been 
upgraded to the following:

GFS-devel-6.0.2.34-2
GFS-modules-smp-6.0.2.34-2
GFS-6.0.2.34-2

The problems is that the various modules installed as part of the 
GFS-modules-smp-6.0.2.34-2 will not load.

# depmod
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/block/gnbd/gnbd.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/block/gnbd/gnbd_serv.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-47.ELsmp/kernel/fs/gfs/gfs.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.21-47.ELsmp/kernel/fs/gfs_locking/lock_gulm/lock_gulm.o

#insmod pool

Using /lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol blk_dev_Rsmp_9d2a8367
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol blkdev_put_Rsmp_aff0bb36
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol blk_init_queue_Rsmp_e5c1537d
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol refile_buffer_Rsmp_accbf3d3
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol ll_rw_block_Rsmp_4c81d4ac
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol kernel_flag_cacheline_Rsmp_eb36899a
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol register_blkdev_Rsmp_b1e7aafd
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol blkdev_get_Rsmp_e8f100cd
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol blk_cleanup_queue_Rsmp_726de84b
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol getblk_Rsmp_55c716c5
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol __wait_on_buffer_Rsmp_5c764f29
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol bdget_Rsmp_4f5b15c0
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol del_gendisk_Rsmp_38acd99c
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol flush_signals_Rsmp_7f9054be
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol wake_up_process_Rsmp_67aed137
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol __brelse_Rsmp_15b3c770
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol add_gendisk_Rsmp_ab4f41e7
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol get_gendisk_Rsmp_bbe6a7ad
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol blk_queue_make_request_Rsmp_8bc2cc9f
/lib/modules/2.4.21-47.ELsmp/kernel/drivers/md/pool/pool.o: unresolved 
symbol generic_make_request_Rsmp_10b9064f


My numerous attempts to get round this problem have all failed so if 
anyone has any ideas I'd be very grateful

Thanks

Paul McDowell
Celera Genomics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060828/f8984862/attachment.htm>

From aneesh.kumar at gmail.com  Mon Aug 28 16:02:31 2006
From: aneesh.kumar at gmail.com (Aneesh Kumar K.V)
Date: Mon, 28 Aug 2006 21:32:31 +0530
Subject: [Linux-cluster] Transport independent cluster framework.
Message-ID: <ecv42o$n4f$1@sea.gmane.org>

I am right now working on http://ci-linux.sf.net project. This project provides a nice framework for writing kernel based cluster services without being worried about the transport. The transport can be TCP, sctp, tipc, infiniband etc. Right now TCP is impemented and work is happening on infiniband using verbs. The purpose of this mail is to find out whether there is sufficient interest in this framework. I have taken the code at http://ci-linux.sf.net and cleaned it up a lot. The result of the work can be found at 

http://git.openssi.org/~kvaneesh/gitweb.cgi?p=ci-to-linus.git;a=summary

The git repository also contain a simple test case which should give you an idea about how easy it is to write cluster services with the framework. Also with respect to OpenSSI project the code base have been used extensively. So we can be sure the code is robust and handle corner cases. 

I also recently made a 0.1 release of the cleaned up code base. You can find the details at  http://marc.theaimsgroup.com/?l=ssic-linux-devel&m=115659520310003&w=2

Let me know if  you are interested. If you are looking for a particular feature please let me know.

-aneesh 



From fadia.marei at paltel.net  Tue Aug 29 06:55:03 2006
From: fadia.marei at paltel.net (Fadia Al-Marei)
Date: Tue, 29 Aug 2006 08:55:03 +0200
Subject: [Linux-cluster] cman can not start
Message-ID: <E15DC930F4BEB849A14D7F6F4B01ACC701AD8BC3@PALTELMAIL.wb.paltel.net>

Dear Robert

The firewall is disabled on the server.
What should I check?

BR;
Fadia
-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Peterson
Sent: Monday, August 28, 2006 11:23 PM
To: linux clustering
Subject: Re: [Linux-cluster] cman can not start

Fadia Al-Marei wrote:
>
> Dear All
>
> I have two redhat Enterprise Linux 4 AS servers with shared storage , 
> I try to cluster them , I download the cluster suite packages on the 
> two servers and it ok, when I start to add nodes on the cluster the 
> problems start , in the first server I add the two nodes start the 
> ccsd service and the cman service and every thing is ok, when I ftp 
> the file to the second server and try to start the ccsd and the cman 
> services I face the problems and the log file show the errors.
>
>  
>
> Aug 28 10:28:02 CluServ2 cman: startup succeeded
> Aug 28 10:28:07 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: No functional network 
> interfaces, leaving cluster
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: we are leaving the cluster.
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 ccsd[31038]: Cluster manager shutdown. 
> Attemping to reconnect...
> Aug 28 10:28:33 CluServ2 ccsd[31038]: Unable to connect to cluster 
> infrastructure after 30 seconds.
>
>  
>
> What is the problem and how can I solve it.
>

Hi Fadia,

Sounds like the cluster manager can't communicate between the nodes.
Check to make sure the iptables firewall isn't filtering out messages 
between
the nodes, (i.e. temporarily turn it off and try again).  If it is 
iptable interfering,
you may need to enable some ports in iptables.  See 
http://sources.redhat.com/cluster/faq.html#iptables

Also, check to make sure that broadcast messages work properly between 
the nodes.

Regards,

Bob Peterson
Red Hat Cluster Suite

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





From pcaulfie at redhat.com  Tue Aug 29 07:53:51 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 29 Aug 2006 08:53:51 +0100
Subject: [Linux-cluster] cman can not start
In-Reply-To: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA70@PALTELMAIL.wb.paltel.net>
References: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA70@PALTELMAIL.wb.paltel.net>
Message-ID: <44F3F28F.8030703@redhat.com>

Fadia Al-Marei wrote:
> Dear All
> 
> I have two redhat Enterprise Linux 4 AS servers with shared storage , I
> try to cluster them , I download the cluster suite packages on the two
> servers and it ok, when I start to add nodes on the cluster the problems
> start , in the first server I add the two nodes start the ccsd service
> and the cman service and every thing is ok, when I ftp the file to the
> second server and try to start the ccsd and the cman services I face the
> problems and the log file show the errors.
> 
>  
> 
> Aug 28 10:28:02 CluServ2 cman: startup succeeded
> Aug 28 10:28:07 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: No functional network interfaces,
> leaving cluster
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: we are leaving the cluster.
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 ccsd[31038]: Cluster manager shutdown.
> Attemping to reconnect...
> Aug 28 10:28:33 CluServ2 ccsd[31038]: Unable to connect to cluster
> infrastructure after 30 seconds.
> 
>  
> 
> What is the problem and how can I solve it.

That's mismatched userland and kernel. Upgrade the userland tools (cman_tool
in particular) to the latest version.

-- 

patrick



From fadia.marei at paltel.net  Tue Aug 29 09:03:03 2006
From: fadia.marei at paltel.net (Fadia Al-Marei)
Date: Tue, 29 Aug 2006 11:03:03 +0200
Subject: [Linux-cluster] cman can not start
Message-ID: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA71@PALTELMAIL.wb.paltel.net>

The first server kernel version is 2.6.9-42.ELsmp and the second one is
2.6.9-42.0.2.ELsmp and the so the cluster packages version is different,
but I try to build the cluster first on the second server and try to
start it and it gives me the same error.


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Patrick Caulfield
Sent: Tuesday, August 29, 2006 9:54 AM
To: linux clustering
Subject: Re: [Linux-cluster] cman can not start

Fadia Al-Marei wrote:
> Dear All
> 
> I have two redhat Enterprise Linux 4 AS servers with shared storage ,
I
> try to cluster them , I download the cluster suite packages on the two
> servers and it ok, when I start to add nodes on the cluster the
problems
> start , in the first server I add the two nodes start the ccsd service
> and the cman service and every thing is ok, when I ftp the file to the
> second server and try to start the ccsd and the cman services I face
the
> problems and the log file show the errors.
> 
>  
> 
> Aug 28 10:28:02 CluServ2 cman: startup succeeded
> Aug 28 10:28:07 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: No functional network
interfaces,
> leaving cluster
> Aug 28 10:28:12 CluServ2 kernel: CMAN: sendmsg failed: -13
> Aug 28 10:28:12 CluServ2 kernel: CMAN: we are leaving the cluster.
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 kernel: WARNING: dlm_emergency_shutdown
> Aug 28 10:28:12 CluServ2 ccsd[31038]: Cluster manager shutdown.
> Attemping to reconnect...
> Aug 28 10:28:33 CluServ2 ccsd[31038]: Unable to connect to cluster
> infrastructure after 30 seconds.
> 
>  
> 
> What is the problem and how can I solve it.

That's mismatched userland and kernel. Upgrade the userland tools
(cman_tool
in particular) to the latest version.

-- 

patrick

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





From hirantha at vcs.informatics.lk  Tue Aug 29 08:52:14 2006
From: hirantha at vcs.informatics.lk (Hirantha Wijayawardena)
Date: Tue, 29 Aug 2006 14:22:14 +0530
Subject: [Linux-cluster] Init script files for services..
Message-ID: <20060829082051.8086D43C8E@ux-mail.informatics.lk>

Dear all,

I have mail server setup in 2-nodes cluster. Mail services are postfix and
TrendMicro IMSS.

TrendMicro not rc.script standards (There is no 'status' checking part) so I
wrote a script to start | stop | restart | status for trendmicro.

Now the setup is working smoothly but..

If I stop postfix manually, the cluster package service won't start
automatically - I do need to start postfix manually and disable/enable using
clusvcadm, service to go online with cluster. Here is the log:
/etc/log/messages

clurgmgrd: [2186]: <info> Executing /etc/init.d/postfix status 
clurgmgrd[2186]: <notice> status on script "postfix_script" returned 3
(function not implemented) 
clurgmgrd[2186]: <notice> Stopping service MAIL 
clurgmgrd: [2186]: <info> Executing /etc/init.d/rc.trend stop 
clurgmgrd: [2186]: <info> Executing /etc/init.d/postfix stop 
postfix: postfix stop failed
clurgmgrd[2186]: <notice> stop on script "postfix_script" returned 1
(generic error) 
clurgmgrd[2186]: <crit> #12: RG MAIL failed to stop; intervention required 
clurgmgrd[2186]: <notice> Service MAIL is failed 

But when I manually stop TrendMicro, it will successfully restart. Here is
the log: /etc/log/messages

clurgmgrd: [2186]: <info> Executing /etc/init.d/postfix status 
clurgmgrd: [2186]: <info> Executing /etc/init.d/rc.trend status 
clurgmgrd[2186]: <notice> status on script "imss_script" returned 1 (generic
error) 
clurgmgrd[2186]: <notice> Stopping service MAIL 
clurgmgrd: [2186]: <info> Executing /etc/init.d/rc.trend stop 
clurgmgrd: [2186]: <info> Executing /etc/init.d/postfix stop 
postfix:  succeeded
clurgmgrd: [2186]: <info> Removing IPv4 address 192.168.40.3 from bond0 
clurgmgrd[2186]: <notice> Service MAIL is recovering 
clurgmgrd[2186]: <notice> Recovering failed service MAIL 
clurgmgrd: [2186]: <info> Adding IPv4 address 192.168.40.3 to bond0 
clurgmgrd: [2186]: <info> Executing /etc/init.d/rc.trend start 
clurgmgrd: [2186]: <info> Executing /etc/init.d/postfix start
postfix:  succeeded
clurgmgrd[2186]: <notice> Service MAIL started 

clurgmgrd: [2186]: <info> Executing /etc/init.d/rc.trend status 
clurgmgrd: [2186]: <info> Executing /etc/init.d/postfix status

What I can feel is this is due to the error report while the service is
checking its status:
What postfix reports - clurgmgrd[2186]: <notice> status on script
"postfix_script" returned 3 (function not implemented) 

And TrendMicro (my script) reports - clurgmgrd[2186]: <notice> status on
script "imss_script" returned 1 (generic error) 

Manually checking the status on stopped services:
/etc/init.d/postfix status
master is stopped

/etc/init.d/rc.trend status
Service IMSS is stopped...

Do I have to write a script for postfix to print what I done as on IMSS?
How do we no the returned # is because of what?

Your support on this issue is highly appreciated

Thanks in advance

- Hirantha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060829/52f456fe/attachment.htm>

From lgodoy at atichile.com  Tue Aug 29 14:39:26 2006
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Tue, 29 Aug 2006 10:39:26 -0400
Subject: [Linux-cluster] cman and bond isssue
In-Reply-To: <OF546B33E5.FF2F16FA-ON852571D8.0071B29C-852571D8.0074CDF8@applera.com>
References: <OF546B33E5.FF2F16FA-ON852571D8.0071B29C-852571D8.0074CDF8@applera.com>
Message-ID: <44F4519E.1030305@atichile.com>

Hello

I have a problem with the installation of the Cluster Suite. I've 
configured 2 nodes cluster, add some services to test de installation 
and this worked OK.

But when I configured "bond" for ethernet interfaces, the communication 
between the Cluster nodes doesn't work well. Although networking at IP 
level works fine, when I reboot one of the nodes the other one goes to 
kernel panic.

I've lost a lot of time debbuging this problem and I finally decide to 
replace one switch ( D-Link DGS-1016D Gigabit Switch ) putting another 
from other installation  ( D-Link  10/100 ) and the cluster Works Fine now.

But Now, I'm not sure if the problem is with the hardware switch ( 
D-Link ) or with the software.
Any ideas ?

I have  RHE4 U2 and Cluster Suite 4 U2 using HP DL380 and external RAID.

Some error messages are below

---------------------------------------------------------
# SM: 03000002 process_stop_request: uevent already set

SM:  Assertion failed on line 106 of file 
/usr/src/build/615121-i686/BUILD/cman-
kernel-2.6.9-39/smp/src/sm_membership.c
SM:  assertion:  "node"
SM:  time = 256523
nodeid=1

Kernel panic - not syncing: SM:  Record message above and reboot.
----------------------------------------------------------------------

Thanks in advance for any help.
Luis G.



From redhat at watson-wilson.ca  Tue Aug 29 14:51:52 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Tue, 29 Aug 2006 10:51:52 -0400
Subject: [Linux-cluster] cman and bond isssue
In-Reply-To: <44F4519E.1030305@atichile.com>
References: <OF546B33E5.FF2F16FA-ON852571D8.0071B29C-852571D8.0074CDF8@applera.com>
	<44F4519E.1030305@atichile.com>
Message-ID: <20060829145152.GB300@ettin.watson-wilson.ca>

What kind of bonding?  Show us your ifcfg-bond0 file and the ifcfg-eth?
files for the slaves.

-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 33 days
http://watson-wilson.ca | 2.6.16.19 AMD Athlon(tm) MP 2000+ x 2



From pcaulfie at redhat.com  Tue Aug 29 14:54:58 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 29 Aug 2006 15:54:58 +0100
Subject: [Linux-cluster] cman and bond isssue
In-Reply-To: <44F4519E.1030305@atichile.com>
References: <OF546B33E5.FF2F16FA-ON852571D8.0071B29C-852571D8.0074CDF8@applera.com>
	<44F4519E.1030305@atichile.com>
Message-ID: <44F45542.6090702@redhat.com>

Luis Godoy Gonzalez wrote:
> Hello
> 
> I have a problem with the installation of the Cluster Suite. I've
> configured 2 nodes cluster, add some services to test de installation
> and this worked OK.
> 
> But when I configured "bond" for ethernet interfaces, the communication
> between the Cluster nodes doesn't work well. Although networking at IP
> level works fine, when I reboot one of the nodes the other one goes to
> kernel panic.
> 
> I've lost a lot of time debbuging this problem and I finally decide to
> replace one switch ( D-Link DGS-1016D Gigabit Switch ) putting another
> from other installation  ( D-Link  10/100 ) and the cluster Works Fine now.
> 
> But Now, I'm not sure if the problem is with the hardware switch (
> D-Link ) or with the software.
> Any ideas ?
> 
> I have  RHE4 U2 and Cluster Suite 4 U2 using HP DL380 and external RAID.
> 
> Some error messages are below
> 
> ---------------------------------------------------------
> # SM: 03000002 process_stop_request: uevent already set
> 
> SM:  Assertion failed on line 106 of file
> /usr/src/build/615121-i686/BUILD/cman-
> kernel-2.6.9-39/smp/src/sm_membership.c
> SM:  assertion:  "node"
> SM:  time = 256523
> nodeid=1
> 


This was fixed in U3

-- 

patrick



From lhh at redhat.com  Tue Aug 29 15:18:34 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 29 Aug 2006 11:18:34 -0400
Subject: [Linux-cluster] Init scripts and cluster suite
In-Reply-To: <200608280827.k7S8R2t15477@xos037.xos.nl>
References: <200608280827.k7S8R2t15477@xos037.xos.nl>
Message-ID: <1156864715.4501.201.camel@rei.boston.devel.redhat.com>

On Mon, 2006-08-28 at 10:27 +0200, Jos Vos wrote:
> Hi,
> 
> Init scripts usually return a non-zero return code when they try to 
> stop a service that isn't running anymore.

According to the LSB, init scripts are supposed to return 0 in
stop-after-stop situations.

> When a cluster service has failed for some reason, the cluster suite
> requires you to first disable a service, before enabling it again.
> Disabling a service will try to stop the service, which will fail,
> and thus the service can't be disabled (and also not enabled again).

Disabling (e.g. failed->disabled) should always work, even if a portion
of the 'stop' phase returns nonzero.  It's really the only way to get a
service out of the failed state - so the assumption is that you have
cleaned up (or will clean up) the service before you try to enable it
again.

If this is not working, please file a bugzilla -- failed->disable should
work (maybe it should throw better warnings).


> The workaround is to either manually start the service and then
> disabling it (bad idea for a cluster service) or to write all
> cluster service scripts yourself, even if you just need to
> control a standard service like httpd.

Well, for httpd, Marek Grac just wrote an agent which plugs in to
rgmanager. ^^  On a more serious note, here's the bugzilla which talks
about the problem you're seeing:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104

> Is the latter the recommended solution for this problem?

:( Yes.  For now.

The patch included in the above bugzilla should fix the problem for most
Red Hat (CentOS, etc.) installations, but will not be shipped in any
updates of RHEL4 because of the fact that users / administrators might
be erroneously relying on the "stop after stop returning failure"
"feature" (even though it is not LSB compliant).

I'm fairly certain that RHEL5 and later releases will have the problem
corrected (I'm pretty sure FC5 already has it fixed).

-- Lon




From jos at xos.nl  Tue Aug 29 20:49:33 2006
From: jos at xos.nl (Jos Vos)
Date: Tue, 29 Aug 2006 22:49:33 +0200
Subject: [Linux-cluster] Init scripts and cluster suite
In-Reply-To: <1156864715.4501.201.camel@rei.boston.devel.redhat.com>;
	from lhh@redhat.com on Tue, Aug 29, 2006 at 11:18:34AM -0400
References: <200608280827.k7S8R2t15477@xos037.xos.nl>
	<1156864715.4501.201.camel@rei.boston.devel.redhat.com>
Message-ID: <20060829224933.C28582@xos037.xos.nl>

On Tue, Aug 29, 2006 at 11:18:34AM -0400, Lon Hohberger wrote:

> According to the LSB, init scripts are supposed to return 0 in
> stop-after-stop situations.

Which is not the case at this moment, as bug #151104, that you refer
to yourself, already mentions.

> Disabling (e.g. failed->disabled) should always work, even if a portion
> of the 'stop' phase returns nonzero.  It's really the only way to get a

This is definitely not the case at this moment.

> service out of the failed state - so the assumption is that you have
> cleaned up (or will clean up) the service before you try to enable it
> again.
> 
> If this is not working, please file a bugzilla -- failed->disable should
> work (maybe it should throw better warnings).

Will do so.  Against rgmanager?

> The patch included in the above bugzilla should fix the problem for most
> Red Hat (CentOS, etc.) installations, but will not be shipped in any
> updates of RHEL4 because of the fact that users / administrators might
> be erroneously relying on the "stop after stop returning failure"
> "feature" (even though it is not LSB compliant).

OK, understood.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From lhh at redhat.com  Tue Aug 29 22:03:59 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 29 Aug 2006 18:03:59 -0400
Subject: [Linux-cluster] Init scripts and cluster suite
In-Reply-To: <20060829224933.C28582@xos037.xos.nl>
References: <200608280827.k7S8R2t15477@xos037.xos.nl>
	<1156864715.4501.201.camel@rei.boston.devel.redhat.com>
	<20060829224933.C28582@xos037.xos.nl>
Message-ID: <1156889039.4501.219.camel@rei.boston.devel.redhat.com>

On Tue, 2006-08-29 at 22:49 +0200, Jos Vos wrote:
> On Tue, Aug 29, 2006 at 11:18:34AM -0400, Lon Hohberger wrote:
> 
> > According to the LSB, init scripts are supposed to return 0 in
> > stop-after-stop situations.
> 
> Which is not the case at this moment, as bug #151104, that you refer
> to yourself, already mentions.
> 
> > Disabling (e.g. failed->disabled) should always work, even if a portion
> > of the 'stop' phase returns nonzero.  It's really the only way to get a
> 
> This is definitely not the case at this moment.

> > service out of the failed state - so the assumption is that you have
> > cleaned up (or will clean up) the service before you try to enable it
> > again.
> > 
> > If this is not working, please file a bugzilla -- failed->disable should
> > work (maybe it should throw better warnings).
> 
> Will do so.  Against rgmanager?

Yes.  I'm pretty sure this works, but maybe there's something about your
configuration in particular which is preventing it from working.  So, if
you could include it (the service blob -- stuff in the <rm> tag), that
would be great.

-- Lon



From lhh at redhat.com  Tue Aug 29 22:10:45 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 29 Aug 2006 18:10:45 -0400
Subject: [Linux-cluster] cluster did not fail over when san mount was lost
In-Reply-To: <20060825172321.GC4357@ettin.watson-wilson.ca>
References: <20060825172321.GC4357@ettin.watson-wilson.ca>
Message-ID: <1156889445.4501.227.camel@rei.boston.devel.redhat.com>

On Fri, 2006-08-25 at 13:23 -0400, Neil Watson wrote:

> Aug 25 11:39:20 caesar clurgmgrd: [5719]: <crit> isAlive returnd: 1  
> 
> Is this a bug or, have I missed something?

isMounted does a mount/path check: is this device mounted where I want
it.  That will pass (even if the fs is not accessible).

isAlive does a very simple accessibility check and returns 1 if the
mount is not accessible (only does a read check if ro is specified).
Looks like it's doing the right thing in this case.

If isAlive returns 1, the file system status check should return an
error status, causing an attempt to restart the service (which probably
fail, and drop the service into the failed state if self_fence is not
specified) if that didn't happen, it's definitely a bug.

-- Lon



From hirantha at vcs.informatics.lk  Wed Aug 30 07:22:41 2006
From: hirantha at vcs.informatics.lk (Hirantha Wijayawardena)
Date: Wed, 30 Aug 2006 12:52:41 +0530
Subject: [Linux-cluster] Init scripts and cluster suite
In-Reply-To: <1156864715.4501.201.camel@rei.boston.devel.redhat.com>
Message-ID: <20060830065119.8589343C84@ux-mail.informatics.lk>

Hi,
I am on the same boat with my post [Linux-cluster] Init script files for
services..


> 
> On Mon, 2006-08-28 at 10:27 +0200, Jos Vos wrote:
> > Hi,
> >
> > Init scripts usually return a non-zero return code when they try to
> > stop a service that isn't running anymore.
> 
> According to the LSB, init scripts are supposed to return 0 in
> stop-after-stop situations.

I can't see this is true 'cause I tired following

[root at rhel4 ~]# /etc/init.d/httpd stop
Stopping httpd:                                            [  OK  ]
[root at rhel4 ~]# echo $?
0
[root at rhel4 ~]# /etc/init.d/httpd stop
Stopping httpd:                                            [FAILED]
[root at rhel4 ~]# echo $?
1

[root at rhel4 ~]# /etc/init.d/squid stop
Stopping squid: ................                           [  OK  ]
[root at rhel4 ~]# /etc/init.d/squid stop
Stopping squid:                                            [FAILED]
[root at rhel4 ~]# echo $?
1

[root at rhel4 ~]# /etc/init.d/postfix stop
Shutting down postfix:                                     [  OK  ]
[root at rhel4 ~]# /etc/init.d/postfix stop
Shutting down postfix:                                     [FAILED]
[root at rhel4 ~]# echo $?
1
[root at dsi-node1 ~]# /etc/init.d/postfix status
master is stopped
[root at dsi-node1 ~]# echo $?
3

> > When a cluster service has failed for some reason, the cluster suite
> > requires you to first disable a service, before enabling it again.
> > Disabling a service will try to stop the service, which will fail,
> > and thus the service can't be disabled (and also not enabled again).
> 
> Disabling (e.g. failed->disabled) should always work, even if a portion
> of the 'stop' phase returns nonzero.  It's really the only way to get a
> service out of the failed state - so the assumption is that you have
> cleaned up (or will clean up) the service before you try to enable it
> again.
> 

This is not always worked for me..


clurgmgrd: [2109]: <info> Executing /etc/init.d/rc.trend status
clurgmgrd: [2109]: <info> Executing /etc/init.d/postfix status
clurgmgrd[2109]: <notice> status on script "postfix_script" returned 3
(function not implemented)
clurgmgrd[2109]: <notice> Stopping service MAIL
clurgmgrd: [2109]: <info> Executing /etc/init.d/rc.trend stop
clurgmgrd: [2109]: <info> Executing /etc/init.d/postfix stop
postfix: postfix stop failed
clurgmgrd[2109]: <notice> stop on script "postfix_script" returned 1
(generic error)
clurgmgrd[2109]: <crit> #12: RG MAIL failed to stop; intervention required
clurgmgrd[2109]: <notice> Service MAIL is failed

# clusvcadm -d MAIL

clurgmgrd[2109]: <notice> Stopping service MAIL
clurgmgrd: [2109]: <info> Executing /etc/init.d/rc.trend stop
Aug 30 12:23:52 dsi-node1 clurgmgrd: [2109]: <info> Executing
/etc/init.d/postfix stop
postfix: postfix stop failed
clurgmgrd[2109]: <notice> stop on script "postfix_script" returned 1
(generic error)
clurgmgrd[2109]: <crit> #12: RG MAIL failed to stop; intervention required
clurgmgrd[2109]: <notice> Service MAIL is failed


Do I have to write scripts by myself?

- Hirantha

> If this is not working, please file a bugzilla -- failed->disable should
> work (maybe it should throw better warnings).
> 
> > The workaround is to either manually start the service and then
> > disabling it (bad idea for a cluster service) or to write all
> > cluster service scripts yourself, even if you just need to
> > control a standard service like httpd.
> 
> Well, for httpd, Marek Grac just wrote an agent which plugs in to
> rgmanager. ^^  On a more serious note, here's the bugzilla which talks
> about the problem you're seeing:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104
> 
> > Is the latter the recommended solution for this problem?
> 
> :( Yes.  For now.
> 
> The patch included in the above bugzilla should fix the problem for most
> Red Hat (CentOS, etc.) installations, but will not be shipped in any
> updates of RHEL4 because of the fact that users / administrators might
> be erroneously relying on the "stop after stop returning failure"
> "feature" (even though it is not LSB compliant).
> 
> I'm fairly certain that RHEL5 and later releases will have the problem
> corrected (I'm pretty sure FC5 already has it fixed).
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From fadia.marei at paltel.net  Wed Aug 30 12:13:07 2006
From: fadia.marei at paltel.net (Fadia Al-Marei)
Date: Wed, 30 Aug 2006 14:13:07 +0200
Subject: [Linux-cluster] Apache configuration on cluster servers
Message-ID: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA72@PALTELMAIL.wb.paltel.net>

Dear All

 

I my two cluster servers work with installing packages suitable for the
installation, I try to add the http service to cluster I face a lot of
problems although I work as in the document, I want some explanation
that help me in the installation

1- Should the cman , ccsd and rgmanager services be started when I add
the services and resources or it does not matter?

2-the virtual IP that I will put as the resource should be added to the
httpd.conf file, should I configure a virtual network interface for the
existing one ( creat bond0:1 for bond0 card) ? or how should the apache
start?

Kindly your advice.

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060830/79617533/attachment.htm>

From dan.hawker at astrium.eads.net  Wed Aug 30 11:39:06 2006
From: dan.hawker at astrium.eads.net (HAWKER, Dan)
Date: Wed, 30 Aug 2006 12:39:06 +0100
Subject: [Linux-cluster] GFS Based File Server Advice
Message-ID: <7F6B06837A5DBD49AC6E1650EFF54906BFBC60@auk52177.ukr.astrium.corp>



Hi All,

I have a new iSCSI array arriving soon and would like to carve it up and use
GFS to provide a central storage pool for a group of servers we use for
building/compiling software.

Each team of developers only has access to their own specific build server
and as such I don't need the servers to *act as one* however as they all use
a common data snapshot, a cluster FS would seem a sensible direction to
take.

My developers are coding an unholy mix of Java, C & C++ in the main, and
although they access the data snapshot simultaneously (presently spread
across a couple of servers and dragged around using NFS) they rarely, if
ever are reading/writing the exact same part of the data store at the same
time. They use a view of the snapshot (in essence a duplicate),
edit/write/etc the code, build and then submit it back to our versioning
software (Rational ClearCase) which is stored elsewhere. 

After reading through the docs and other resources, it seems GFS would be a
good fit, what with us running a mix of FC4/5 and RHEL4 boxes.

Having read through the docs and stuff it all looks pretty straightforward,
however thought I'd ask some advice to check for any gotchas and general
*that didn't quite work as anticipated* experiences from ppl that I presume
use GFS on a daily basis, and hopefully anecdotes from ppl who use GFS in a
software development role.

TIA

Dan
--

Dan Hawker
Linux System Administrator
EADS Astrium

-- 

This email is for the intended addressee only.
If you have received it in error then you must not use, retain, disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of Astrium Limited.
Nothing in this email shall bind Astrium Limited in any contract or obligation.

Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England



From mgrac at redhat.com  Wed Aug 30 12:50:00 2006
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Wed, 30 Aug 2006 14:50:00 +0200
Subject: [Linux-cluster] Apache configuration on cluster servers
In-Reply-To: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA72@PALTELMAIL.wb.paltel.net>
References: <E15DC930F4BEB849A14D7F6F4B01ACC72CBA72@PALTELMAIL.wb.paltel.net>
Message-ID: <44F58978.8000905@redhat.com>

Hi,

Fadia Al-Marei wrote:
>
> I my two cluster servers work with installing packages suitable for 
> the installation, I try to add the http service to cluster I face a 
> lot of problems although I work as in the document, I want some 
> explanation that help me in the installation
>
I believe that you are referring to my email "Resource agents for Apache 
& MySQL" (08/22/2006).
>
> 1- Should the cman , ccsd and rgmanager services be started when I add 
> the services and resources or it does not matter?
>
I think they should be started because you have to run "ccs_tool 
update"and "cman_tool version" after modifying /etc/cluster/cluster.conf
>
> 2-the virtual IP that I will put as the resource should be added to 
> the httpd.conf file, should I configure a virtual network interface 
> for the existing one ( creat bond0:1 for bond0 card) ? or how should 
> the apache start?
>
<service autostart="1" domain="" name="test22">
   <ip address="192.168.79.8" monitor_link="0"/>
   <ip address="192.168.79.9" monitor_link="0"/>
   <apache name="web" />
</service>

note:
<apache name="web" /> is equal to <apache name="web" 
serverRoot="/etc/httpd" serverConfigFile="conf/httpd.conf" />

httpd.conf is used as base for generation of the new httpd.conf which 
are specific for each <apache ...>. IP addresses are taken from  the 
service definition (in this case 192.168.79.8-9). So you don't have to 
change it manually.

m,




From hirantha at vcs.informatics.lk  Wed Aug 30 13:27:03 2006
From: hirantha at vcs.informatics.lk (Hirantha Wijayawardena)
Date: Wed, 30 Aug 2006 18:57:03 +0530
Subject: [Linux-cluster] Write log messages to a different file
Message-ID: <20060830135532.703C443C8A@ux-mail.informatics.lk>

Dear All,

Quick question.

By default cluster daemons log messages by using linux syslog subsystem (to
/var/log/messages)

Is there a way that we could redirect cluster daemons log messages to
different file instead of default location?

Please advice 

Thanks in advance

- Hirantha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060830/35889356/attachment.htm>

From fadia.marei at paltel.net  Wed Aug 30 14:54:25 2006
From: fadia.marei at paltel.net (Fadia Al-Marei)
Date: Wed, 30 Aug 2006 16:54:25 +0200
Subject: [Linux-cluster] Apache configuration on cluster servers
Message-ID: <E15DC930F4BEB849A14D7F6F4B01ACC701AD8BCA@PALTELMAIL.wb.paltel.net>

I get another case now the service appears o one of the servers and it
does not on the other what should I do

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marek 'marx' Grac
Sent: Wednesday, August 30, 2006 2:50 PM
To: linux clustering
Subject: Re: [Linux-cluster] Apache configuration on cluster servers

Hi,

Fadia Al-Marei wrote:
>
> I my two cluster servers work with installing packages suitable for 
> the installation, I try to add the http service to cluster I face a 
> lot of problems although I work as in the document, I want some 
> explanation that help me in the installation
>
I believe that you are referring to my email "Resource agents for Apache

& MySQL" (08/22/2006).
>
> 1- Should the cman , ccsd and rgmanager services be started when I add

> the services and resources or it does not matter?
>
I think they should be started because you have to run "ccs_tool 
update"and "cman_tool version" after modifying /etc/cluster/cluster.conf
>
> 2-the virtual IP that I will put as the resource should be added to 
> the httpd.conf file, should I configure a virtual network interface 
> for the existing one ( creat bond0:1 for bond0 card) ? or how should 
> the apache start?
>
<service autostart="1" domain="" name="test22">
   <ip address="192.168.79.8" monitor_link="0"/>
   <ip address="192.168.79.9" monitor_link="0"/>
   <apache name="web" />
</service>

note:
<apache name="web" /> is equal to <apache name="web" 
serverRoot="/etc/httpd" serverConfigFile="conf/httpd.conf" />

httpd.conf is used as base for generation of the new httpd.conf which 
are specific for each <apache ...>. IP addresses are taken from  the 
service definition (in this case 192.168.79.8-9). So you don't have to 
change it manually.

m,


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





From hirantha at vcs.informatics.lk  Wed Aug 30 14:22:09 2006
From: hirantha at vcs.informatics.lk (Hirantha Wijayawardena)
Date: Wed, 30 Aug 2006 19:52:09 +0530
Subject: [Linux-cluster] Apache configuration on cluster servers
In-Reply-To: <E15DC930F4BEB849A14D7F6F4B01ACC701AD8BCA@PALTELMAIL.wb.paltel.net>
Message-ID: <20060830145038.AD7F927C6E@ux-mail.informatics.lk>

Do your httpd start | stop without errors on normal operation?
(/etc/init.d/httpd start|stop)

Then

<resources>
           <ip address="192.168.0.3" monitor_link="1"/>
           <script file="/etc/init.d/httpd" name="http_script"/>
</resources>
<service autostart="1" name="HTTPD">
           <script ref="http_script"/>
           <ip ref="192.168.0.3"/>
</service>

Do not bind the 192.168.0.3 any Interface

update cluster.conf file

This is basic and didn't add /var/www/ as shared resource. I hope this will
help in someway

- Hirantha


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Fadia Al-Marei
> Sent: Wednesday, August 30, 2006 8:24 PM
> To: linux clustering
> Subject: RE: [Linux-cluster] Apache configuration on cluster servers
> 
> I get another case now the service appears o one of the servers and it
> does not on the other what should I do
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marek 'marx' Grac
> Sent: Wednesday, August 30, 2006 2:50 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Apache configuration on cluster servers
> 
> Hi,
> 
> Fadia Al-Marei wrote:
> >
> > I my two cluster servers work with installing packages suitable for
> > the installation, I try to add the http service to cluster I face a
> > lot of problems although I work as in the document, I want some
> > explanation that help me in the installation
> >
> I believe that you are referring to my email "Resource agents for Apache
> 
> & MySQL" (08/22/2006).
> >
> > 1- Should the cman , ccsd and rgmanager services be started when I add
> 
> > the services and resources or it does not matter?
> >
> I think they should be started because you have to run "ccs_tool
> update"and "cman_tool version" after modifying /etc/cluster/cluster.conf
> >
> > 2-the virtual IP that I will put as the resource should be added to
> > the httpd.conf file, should I configure a virtual network interface
> > for the existing one ( creat bond0:1 for bond0 card) ? or how should
> > the apache start?
> >
> <service autostart="1" domain="" name="test22">
>    <ip address="192.168.79.8" monitor_link="0"/>
>    <ip address="192.168.79.9" monitor_link="0"/>
>    <apache name="web" />
> </service>
> 
> note:
> <apache name="web" /> is equal to <apache name="web"
> serverRoot="/etc/httpd" serverConfigFile="conf/httpd.conf" />
> 
> httpd.conf is used as base for generation of the new httpd.conf which
> are specific for each <apache ...>. IP addresses are taken from  the
> service definition (in this case 192.168.79.8-9). So you don't have to
> change it manually.
> 
> m,
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From rodrick.r.brown at bofasecurities.com  Wed Aug 30 15:09:36 2006
From: rodrick.r.brown at bofasecurities.com (Brown, Rodrick R)
Date: Wed, 30 Aug 2006 11:09:36 -0400
Subject: [Linux-cluster] GFS Based File Server Advice
In-Reply-To: <7F6B06837A5DBD49AC6E1650EFF54906BFBC60@auk52177.ukr.astrium.corp>
Message-ID: <5F08B160555AC946B5AB743B85FF406D05ABF26E@ex2k.bankofamerica.com>

If I recall correctly, all nodes in your GFS cluster need to be on the
same kernel rev, so mixing and matching FC4/5 might work, but I don't
think this configuration is supported from Redhat or even something you
might want to do later down the line.

What exactly are you accomplishing by spreading the storage across
multiple systems and using them in a GFS shared file system? It really
doesn't sound like you need GFS here? Are you just trying to keep a copy
of each code repository on each system and provide r/w access from each
node? Unless you're using some kind of distributed source code revision
system that will handle you're locking and consistency checking I
wouldn't advise going down this route you are really just increasing
your chances of file/data corruption. 

If bandwidth isn't a concern I would go with either NFS, or dedicated
iscsi LUNS to each host not under a shared fs. 



-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of HAWKER, Dan
Sent: Wednesday, August 30, 2006 7:39 AM
To: (linux-cluster at redhat.com)
Subject: [Linux-cluster] GFS Based File Server Advice



Hi All,

I have a new iSCSI array arriving soon and would like to carve it up and
use
GFS to provide a central storage pool for a group of servers we use for
building/compiling software.

Each team of developers only has access to their own specific build
server
and as such I don't need the servers to *act as one* however as they all
use
a common data snapshot, a cluster FS would seem a sensible direction to
take.

My developers are coding an unholy mix of Java, C & C++ in the main, and
although they access the data snapshot simultaneously (presently spread
across a couple of servers and dragged around using NFS) they rarely, if
ever are reading/writing the exact same part of the data store at the
same
time. They use a view of the snapshot (in essence a duplicate),
edit/write/etc the code, build and then submit it back to our versioning
software (Rational ClearCase) which is stored elsewhere. 

After reading through the docs and other resources, it seems GFS would
be a
good fit, what with us running a mix of FC4/5 and RHEL4 boxes.

Having read through the docs and stuff it all looks pretty
straightforward,
however thought I'd ask some advice to check for any gotchas and general
*that didn't quite work as anticipated* experiences from ppl that I
presume
use GFS on a daily basis, and hopefully anecdotes from ppl who use GFS
in a
software development role.

TIA

Dan
--

Dan Hawker
Linux System Administrator
EADS Astrium

-- 

This email is for the intended addressee only.
If you have received it in error then you must not use, retain,
disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of
Astrium Limited.
Nothing in this email shall bind Astrium Limited in any contract or
obligation.

Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS,
England

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From rodrick.r.brown at bofasecurities.com  Wed Aug 30 15:15:04 2006
From: rodrick.r.brown at bofasecurities.com (Brown, Rodrick R)
Date: Wed, 30 Aug 2006 11:15:04 -0400
Subject: [Linux-cluster] Write log messages to a different file
In-Reply-To: <20060830135532.703C443C8A@ux-mail.informatics.lk>
Message-ID: <5F08B160555AC946B5AB743B85FF406D05ABF26F@ex2k.bankofamerica.com>

You need to modify /etc/syslog.conf 
local4.*                         /var/log/cluster.log


________________________________________
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Hirantha
Wijayawardena
Sent: Wednesday, August 30, 2006 9:27 AM
To: 'linux clustering'
Subject: [Linux-cluster] Write log messages to a different file

Dear All,
Quick question...
By default cluster daemons log messages by using linux syslog subsystem
(to /var/log/messages)
Is there a way that we could redirect cluster daemons log messages to
different file instead of default location?
Please advice 
Thanks in advance
- Hirantha



From lhh at redhat.com  Wed Aug 30 17:29:15 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 30 Aug 2006 13:29:15 -0400
Subject: [Linux-cluster] Write log messages to a different file
In-Reply-To: <5F08B160555AC946B5AB743B85FF406D05ABF26F@ex2k.bankofamerica.com>
References: <5F08B160555AC946B5AB743B85FF406D05ABF26F@ex2k.bankofamerica.com>
Message-ID: <1156958955.4501.245.camel@rei.boston.devel.redhat.com>

On Wed, 2006-08-30 at 11:15 -0400, Brown, Rodrick R wrote:
> You need to modify /etc/syslog.conf 
> local4.*                         /var/log/cluster.log

I think it's daemon.*, not local4.*, by default.  You can make rgmanager
use local4 by tweaking the <rm> tag, though:

  <rm log_facility="local4">

... but this doesn't change CMAN, CCS, GuLM, etc.

-- Lon




From tmornini at infomania.com  Wed Aug 23 09:34:34 2006
From: tmornini at infomania.com (Tom Mornini)
Date: Wed, 23 Aug 2006 02:34:34 -0700
Subject: [Linux-cluster] CSNAP? Does CLVM support snapshots?
Message-ID: <1DC8819A-0F90-4209-A3D9-CE0ED5DC45D6@infomania.com>

Is it possible to do LVM snapshots on an active (but quiesced) GFS  
volume?

-- 
-- Tom Mornini



From mzelikov at cs.uml.edu  Thu Aug 24 14:36:32 2006
From: mzelikov at cs.uml.edu (Mikhail A Zelikov)
Date: Thu, 24 Aug 2006 10:36:32 -0400 (EDT)
Subject: [Linux-cluster] Shared Storage and Quorum device
Message-ID: <53831.128.221.197.129.1156430192.squirrel@webmail.cs.uml.edu>

I was wondering if anybody has information on this. I have 2 node cluster
and some shared storage between two nodes.
The manual contains some information on the shared storage setup:
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-cluster.html#S2-HARDWARE-STORAGESETUP
Section 2.5.3.1 at
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.html.
However I could not find a way to tell the cluster to use my raw devices
and also test that any of these devices are being used. I also use my
cluster w/o any shared devices and it seems to work.
But in this case a 2-node cluster requires a quorum devices (usually
shared storage) - I can see that when one node is down I still have a
quorum w/o using any shared storage.

So my questions are:
1. Do I have to have a shared storage?
1.a If yes then how do I configure and test it?
1.b If no then how is the quorum is achieved in 2-node cluster when 1 node
is down?

Thanks,
Mike



From mzelikov at cs.uml.edu  Thu Aug 24 15:49:58 2006
From: mzelikov at cs.uml.edu (Mikhail A Zelikov)
Date: Thu, 24 Aug 2006 11:49:58 -0400 (EDT)
Subject: [Linux-cluster] CMAN cluster event notification broken?
Message-ID: <25090.128.221.197.129.1156434598.squirrel@webmail.cs.uml.edu>

I have a 2-node cluster, and I wrote a simple program that registers for
cluster events using:
cman_start_notification and cman_start_recv_data.  If a node dies (manual
reboot/shutdown) then I receive CMAN_REASON_PORTCLOSED event and then NO
CMAN_REASON_STATECHANGE (as I woudl expect). I receive
CMAN_REASON_STATECHANGE once the node is booted again. If I do not
register for data then I do not receive CMAN_REASON_PORTCLOSED as
expected.
Is the notification mechanism broken?
  Mike
P.S.
It also looks like that the conditions in while loop in cman_dispatch
might be incorrect if the intention was to continue forever as long as
there no errors.

	while ( flags & CMAN_DISPATCH_ALL &&
		(len < 0 && errno == EAGAIN) );

Should it be:

	while ( flags & CMAN_DISPATCH_ALL &&
		!(len < 0 && errno == EAGAIN) );



From mzelikov at cs.uml.edu  Thu Aug 24 15:09:26 2006
From: mzelikov at cs.uml.edu (mzelikov at cs.uml.edu)
Date: Thu, 24 Aug 2006 11:09:26 -0400 (EDT)
Subject: [Linux-cluster] Shared Storage and Quorum device
Message-ID: <52762.128.221.197.129.1156432166.squirrel@webmail.cs.uml.edu>

(Note this is a resubmittion since I had a wrong reply address)

I was wondering if anybody has information on this. I have 2 node cluster
and some shared storage between two nodes.
The manual contains some information on the shared storage setup:
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-cluster.html#S2-HARDWARE-STORAGESETUP
Section 2.5.3.1 at
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.html.
However I could not find a way to tell the cluster to use my raw devices
and also test that any of these devices are being used. I also use my
cluster w/o any shared devices and it seems to work.
But in this case a 2-node cluster requires a quorum devices (usually
shared storage) - I can see that when one node is down I still have a
quorum w/o using any shared storage.

So my questions are:
1. Do I have to have a shared storage?
1.a If yes then how do I configure and test it?
1.b If no then how is the quorum is achieved in 2-node cluster when 1 node
is down?

Thanks,
Mike



From peter.gustafsson1 at se.ibm.com  Thu Aug 31 07:39:50 2006
From: peter.gustafsson1 at se.ibm.com (Peter Gustafsson1)
Date: Thu, 31 Aug 2006 09:39:50 +0200
Subject: [Linux-cluster] Shared Storage and Quorum device
In-Reply-To: <53831.128.221.197.129.1156430192.squirrel@webmail.cs.uml.edu>
Message-ID: <OF682F5503.2528A699-ONC12571DB.00267CF8-C12571DB.0029E998@se.ibm.com>

Hi,

In Red Hat Cluster  Suite 4 you don't need to have shared storage.

H?lsningar/Regards Peter
___________________________________________________________________________________ 

Peter Gustafsson

UNIX/LINUX systems programmer

SDO/SSO/Server Management/Operating Systems/Unix

.   Nordic Processor, Oddegatan 5, SE-164 92 Stockholm
%   Phone:+46 (0)8 793 39 23 Mobile:+46 (0)70 793 39 23
)   mailto:peter.gustafsson1 at se.ibm.com 
Red Hat Enterprise Linux Certified Engineer



Mikhail A Zelikov <mzelikov at cs.uml.edu> 
Sent by: linux-cluster-bounces at redhat.com
2006-08-24 16:36
Please respond to
linux clustering


To
linux-cluster at redhat.com
cc

Subject
[Linux-cluster] Shared Storage and Quorum device






I was wondering if anybody has information on this. I have 2 node cluster
and some shared storage between two nodes.
The manual contains some information on the shared storage setup:
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-cluster.html#S2-HARDWARE-STORAGESETUP

Section 2.5.3.1 at
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.html
.
However I could not find a way to tell the cluster to use my raw devices
and also test that any of these devices are being used. I also use my
cluster w/o any shared devices and it seems to work.
But in this case a 2-node cluster requires a quorum devices (usually
shared storage) - I can see that when one node is down I still have a
quorum w/o using any shared storage.

So my questions are:
1. Do I have to have a shared storage?
1.a If yes then how do I configure and test it?
1.b If no then how is the quorum is achieved in 2-node cluster when 1 node
is down?

Thanks,
Mike

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060831/c9bccf81/attachment.htm>

From cjk at techma.com  Thu Aug 31 09:50:35 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Thu, 31 Aug 2006 05:50:35 -0400
Subject: =?iso-8859-1?Q?RE:_=5BLinux-cluster=5D_Shared_Storage_and_Quorum_dev?=
	=?iso-8859-1?Q?ice?=
In-Reply-To: <OF682F5503.2528A699-ONC12571DB.00267CF8-C12571DB.0029E998@se.ibm.com>
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079EF1@tmaemail.techma.com>

True, you don't "have to" but as of RHEL4, update 4, you have the option. I
believe the only
available documentation is "man qdisk" once it's installed, unless redhat has
updated the
online docs. I believe the docs you might be looking at are for RHEL 3 and
the older cluster
services software.
 
 
Cheers
 
 
Corey

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Peter Gustafsson1
Sent: Thursday, August 31, 2006 3:40 AM
To: linux clustering
Subject: Re: [Linux-cluster] Shared Storage and Quorum device



Hi, 

In Red Hat Cluster  Suite 4 you don't need to have shared storage.

H?lsningar/Regards Peter
_____________________________________________________________________________
______ 
Peter Gustafsson

UNIX/LINUX systems programmer

SDO/SSO/Server Management/Operating Systems/Unix

. * Nordic Processor, Oddegatan 5, SE-164 92 Stockholm
% * Phone:+46 (0)8 793 39 23 Mobile:+46 (0)70 793 39 23
) * mailto:peter.gustafsson1 at se.ibm.com 
Red Hat Enterprise Linux Certified Engineer 



Mikhail A Zelikov <mzelikov at cs.uml.edu> 
Sent by: linux-cluster-bounces at redhat.com 

2006-08-24 16:36 
Please respond to
linux clustering


To
linux-cluster at redhat.com 
cc
Subject
[Linux-cluster] Shared Storage and Quorum device

	




I was wondering if anybody has information on this. I have 2 node cluster
and some shared storage between two nodes.
The manual contains some information on the shared storage setup:
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-cluster.
html#S2-HARDWARE-STORAGESETUP
Section 2.5.3.1 at
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-hardware-connect.
html.
However I could not find a way to tell the cluster to use my raw devices
and also test that any of these devices are being used. I also use my
cluster w/o any shared devices and it seems to work.
But in this case a 2-node cluster requires a quorum devices (usually
shared storage) - I can see that when one node is down I still have a
quorum w/o using any shared storage.

So my questions are:
1. Do I have to have a shared storage?
1.a If yes then how do I configure and test it?
1.b If no then how is the quorum is achieved in 2-node cluster when 1 node
is down?

Thanks,
Mike

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060831/f70f61e8/attachment.htm>

From gforte at leopard.us.udel.edu  Thu Aug 31 10:37:36 2006
From: gforte at leopard.us.udel.edu (Greg Forte)
Date: Thu, 31 Aug 2006 06:37:36 -0400
Subject: [Linux-cluster] Shared Storage and Quorum device
In-Reply-To: <53831.128.221.197.129.1156430192.squirrel@webmail.cs.uml.edu>
References: <53831.128.221.197.129.1156430192.squirrel@webmail.cs.uml.edu>
Message-ID: <44F6BBF0.4060003@leopard.us.udel.edu>

As others have mentioned, shared storage is not required in RHEL4/CS4; 
you haven't said what version you're running, but since you pasted links 
to the CS4 docs, I'll draw that conclusion (aside to redhat maintainers 
- why is it that the documentation doesn't seem to explicitly state what 
version(s) it covers anywhere?)

In a 2-node cluster, quorum is achieved when 1 node is down by 
configuring the two_node option in cluster.conf (see 'man cman'). 
Essentially, the cluster is _always_ quorate, as long as cman is running 
on at least one node.  If you don't configure this option, then the 
cluster can only be quorate if both nodes are up (if I understand 
correctly).

-g

Mikhail A Zelikov wrote:
> 1.b If no then how is the quorum is achieved in 2-node cluster when 1 node
> is down?
> 
> Thanks,
> Mike
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From admin.cluster at gmail.com  Thu Aug 31 13:53:45 2006
From: admin.cluster at gmail.com (Anthony)
Date: Thu, 31 Aug 2006 15:53:45 +0200
Subject: [Linux-cluster] php4-xslt package???
Message-ID: <44F6E9E9.5070703@gmail.com>

Hello,

i am unable to find the php4-xslt package for Red Hat Enterprise Linux 4 AS.

any help?




From pcaulfie at redhat.com  Thu Aug 31 08:08:57 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 31 Aug 2006 09:08:57 +0100
Subject: [Linux-cluster] CMAN cluster event notification broken?
In-Reply-To: <25090.128.221.197.129.1156434598.squirrel@webmail.cs.uml.edu>
References: <25090.128.221.197.129.1156434598.squirrel@webmail.cs.uml.edu>
Message-ID: <44F69919.1000103@redhat.com>

Mikhail A Zelikov wrote:
> I have a 2-node cluster, and I wrote a simple program that registers for
> cluster events using:
> cman_start_notification and cman_start_recv_data.  If a node dies (manual
> reboot/shutdown) then I receive CMAN_REASON_PORTCLOSED event and then NO
> CMAN_REASON_STATECHANGE (as I woudl expect). I receive
> CMAN_REASON_STATECHANGE once the node is booted again. If I do not
> register for data then I do not receive CMAN_REASON_PORTCLOSED as
> expected.
> Is the notification mechanism broken?

It used to be broken for the last node in the cluster. It should be fixed in U3.

>   Mike
> P.S.
> It also looks like that the conditions in while loop in cman_dispatch
> might be incorrect if the intention was to continue forever as long as
> there no errors.
> 
> 	while ( flags & CMAN_DISPATCH_ALL &&
> 		(len < 0 && errno == EAGAIN) );
> 
> Should it be:
> 
> 	while ( flags & CMAN_DISPATCH_ALL &&
> 		!(len < 0 && errno == EAGAIN) );
> 

Err, yes you're right.

Thank you.

-- 

patrick



From orkcu at yahoo.com  Thu Aug 31 19:16:26 2006
From: orkcu at yahoo.com (Roger Pe�a Escobio)
Date: Thu, 31 Aug 2006 12:16:26 -0700 (PDT)
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
Message-ID: <20060831191626.99599.qmail@web50613.mail.yahoo.com>

Hi

I was wondering why in the docs and examples the GFS
filesystem is build on top of a lv "partition" ?
I can understand that if I build the GFS in a direct
scsi attached storage because is not easy to grow the
"device" without destroy the data but the same apply
in an SAN enviroment?
We have here a EMC SAN, where is relative easy to grow
a LUN, so can we skip the LVM layer and build the GFS
filesystem directly over the emcpower device ? 

there is any advantage of using LVM in this scenario?

thanks in advance
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From redhat at watson-wilson.ca  Thu Aug 31 19:25:29 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Thu, 31 Aug 2006 15:25:29 -0400
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <20060831191626.99599.qmail@web50613.mail.yahoo.com>
References: <20060831191626.99599.qmail@web50613.mail.yahoo.com>
Message-ID: <20060831192529.GC9874@ettin.watson-wilson.ca>

Does growing a LUN create un-partitioned space?  That would mean you
could not add the space to the existing partition without LVM.  Can
someone confirm that?

-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 35 days
http://watson-wilson.ca | 2.6.16.19 AMD Athlon(tm) MP 2000+ x 2



From Matthew.Patton.ctr at osd.mil  Thu Aug 31 19:26:15 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Thu, 31 Aug 2006 15:26:15 -0400
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
Message-ID: <D8063DF686D10247B0A49D0127128569119C27F1@osdn06.osd.mil>

Classification: UNCLASSIFIED

> there is any advantage of using LVM in this scenario?

assuming your LUN is dedicated to a single partition, no probably not.  LVM
comes in handy when growing partitions/disks. Often we can't just extend the
end of the partition because it will run up against a neighboring one. So in
order to have disjoint growth, we need to use LVM and Physical Volumes and
grow that way.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060831/f06f64d5/attachment.htm>

From orkcu at yahoo.com  Thu Aug 31 19:39:29 2006
From: orkcu at yahoo.com (Roger Pe�a Escobio)
Date: Thu, 31 Aug 2006 12:39:29 -0700 (PDT)
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27F1@osdn06.osd.mil>
Message-ID: <20060831193929.25959.qmail@web50610.mail.yahoo.com>



--- "Patton, Matthew F, CTR, OSD-PA&E"
<Matthew.Patton.ctr at osd.mil> wrote:

> Classification: UNCLASSIFIED
> 
> > there is any advantage of using LVM in this
> scenario?
> 
> assuming your LUN is dedicated to a single
> partition, no probably not.  LVM
> comes in handy when growing partitions/disks. Often
> we can't just extend the
> end of the partition because it will run up against
> a neighboring one. So in
> order to have disjoint growth, we need to use LVM
> and Physical Volumes and
> grow that way.

I understand that, but what I don not understant yet
is the "Often we can't extend the end of the
partition"
I am completly new to SAN technologies, so, for me,
the EMC SAN is like a black box where I can create the
LUN but I do not know how it is create _insigth_ the
SAN so I do not know if I have close neighbors :-)
that avoid me grow the LUN

If I can't grow a LUN because the neighbors are too
close, the LVM is the way to go :-)
can you share some of your experience with that ?

thanks
roger


__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From orkcu at yahoo.com  Thu Aug 31 19:57:06 2006
From: orkcu at yahoo.com (Roger Pe�a Escobio)
Date: Thu, 31 Aug 2006 12:57:06 -0700 (PDT)
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <20060831192529.GC9874@ettin.watson-wilson.ca>
Message-ID: <20060831195706.85868.qmail@web50612.mail.yahoo.com>



--- Neil Watson <redhat at watson-wilson.ca> wrote:

> Does growing a LUN create un-partitioned space? 
> That would mean you
> could not add the space to the existing partition
> without LVM.  Can
> someone confirm that?

I can extend the partition and then the filesystem, do
I ?

thanks
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From riaan at obsidian.co.za  Thu Aug 31 20:10:12 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Thu, 31 Aug 2006 22:10:12 +0200
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <20060831192529.GC9874@ettin.watson-wilson.ca>
References: <20060831191626.99599.qmail@web50613.mail.yahoo.com>
	<20060831192529.GC9874@ettin.watson-wilson.ca>
Message-ID: <44F74224.3030202@obsidian.co.za>

Neil Watson wrote:
> Does growing a LUN create un-partitioned space?  

Yes

> That would mean you
> could not add the space to the existing partition without LVM.  Can
> someone confirm that?
> 
You could delete and recreate the partition, then gfs_grow afterwards.
LVM has been designed for growing and extending the device underlying a 
FS, and is much less risky than manipulating the partition table.

Caveat: when growing a big LUN, using parted and gpt, don't get bitten 
by this bug:

Parted segfaults because of extended devices with GPT partition table
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=194238
There are hotfixes out for RHEL 4.4 and patches for the upstream version 
of parted on the bug-parted mailing list

HTH
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060831/5b48ed53/attachment.vcf>

From Matthew.Patton.ctr at osd.mil  Thu Aug 31 20:54:49 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Thu, 31 Aug 2006 16:54:49 -0400
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
Message-ID: <D8063DF686D10247B0A49D0127128569119C27F2@osdn06.osd.mil>

Classification: UNCLASSIFIED

> I understand that, but what I don not understant yet
> is the "Often we can't extend the end of the
> partition"

most of the time we deal with actual disks, not virtual disks ala SAN LUN's.
SAN LUN's are no different than LVM's logical volumes. What you do with your
EMC software when creating or extending a LUN is not one bit different from
using fdisk to define new physical partitions on physical drives, creating
PV's and VG's and then defining LV's. It just happens without you having to
issue all those commands or know what's really going on inside. With a SAN,
LVM is already happening and transparently to you. So layering on more LVM
isn't very productive. UNLESS, you decide to treat the LUN like a physical
disk and carve it into various partitions. In which case we're back to
square 1.

Only when dealing with actual, raw hard drives (or hardware RAID volumes),
or carved up LUNs do you run into the problem of 2 partitions butted up
against one other and the inability to grow the earlier one because growing
it necessarily means overwriting the one behind it. Tools like Partition
Magic have been around for quite some time and they are quite capable of
moving partitions around the physical disk, but that generally take a lot of
downtime.

> If I can't grow a LUN because the neighbors are too
> close, the LVM is the way to go :-)

You can grow a LUN any which way you want and it doesn't matter to machines
that want to use it as a disk. It's only when you start partitioning that
LUN into more than 1 slice that you run afoul of the partition boundaries
getting in the way. I would just create a LUN for each specific task (say
your common GFS area) and when you run out of space, grow the LUN, grow the
partition, and then grow the FS.

In my lab I'm running a poor man's SAN - siz 72GB drives in Raid10 and RAID5
exported via iSCSI. Each machine gets their own scribble space and so I need
to partition the RAID logical drive accordingly. If I used hard partitions
it would be impossible to change space allocation later so I use LVM to give
each machine it's own partition. and if I need to change it, I can do so
trivially. As far as the client machines are concerned they are dealing with
their very own drive and have no idea I'm giving them space that could be
scattered all over the place. If I partition the iSCSI LUN on the client,
then I either treat it as raw disk and run the risk of sizing stuff wrong,
create just a single partition, or virtualize it again using client LVM.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060831/5539210b/attachment.htm>

From orkcu at yahoo.com  Thu Aug 31 21:48:25 2006
From: orkcu at yahoo.com (Roger Pe�a Escobio)
Date: Thu, 31 Aug 2006 14:48:25 -0700 (PDT)
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <D8063DF686D10247B0A49D0127128569119C27F2@osdn06.osd.mil>
Message-ID: <20060831214825.82129.qmail@web50610.mail.yahoo.com>


thanks

that clarified me a lot :-)

roger

--- "Patton, Matthew F, CTR, OSD-PA&E"
<Matthew.Patton.ctr at osd.mil> wrote:

> Classification: UNCLASSIFIED
> 
> > I understand that, but what I don not understant
> yet
> > is the "Often we can't extend the end of the
> > partition"
> 
> most of the time we deal with actual disks, not
> virtual disks ala SAN LUN's.
> SAN LUN's are no different than LVM's logical
> volumes. What you do with your
> EMC software when creating or extending a LUN is not
> one bit different from
> using fdisk to define new physical partitions on
> physical drives, creating
> PV's and VG's and then defining LV's. It just
> happens without you having to
> issue all those commands or know what's really going
> on inside. With a SAN,
> LVM is already happening and transparently to you.
> So layering on more LVM
> isn't very productive. UNLESS, you decide to treat
> the LUN like a physical
> disk and carve it into various partitions. In which
> case we're back to
> square 1.
> 
> Only when dealing with actual, raw hard drives (or
> hardware RAID volumes),
> or carved up LUNs do you run into the problem of 2
> partitions butted up
> against one other and the inability to grow the
> earlier one because growing
> it necessarily means overwriting the one behind it.
> Tools like Partition
> Magic have been around for quite some time and they
> are quite capable of
> moving partitions around the physical disk, but that
> generally take a lot of
> downtime.
> 
> > If I can't grow a LUN because the neighbors are
> too
> > close, the LVM is the way to go :-)
> 
> You can grow a LUN any which way you want and it
> doesn't matter to machines
> that want to use it as a disk. It's only when you
> start partitioning that
> LUN into more than 1 slice that you run afoul of the
> partition boundaries
> getting in the way. I would just create a LUN for
> each specific task (say
> your common GFS area) and when you run out of space,
> grow the LUN, grow the
> partition, and then grow the FS.
> 
> In my lab I'm running a poor man's SAN - siz 72GB
> drives in Raid10 and RAID5
> exported via iSCSI. Each machine gets their own
> scribble space and so I need
> to partition the RAID logical drive accordingly. If
> I used hard partitions
> it would be impossible to change space allocation
> later so I use LVM to give
> each machine it's own partition. and if I need to
> change it, I can do so
> trivially. As far as the client machines are
> concerned they are dealing with
> their very own drive and have no idea I'm giving
> them space that could be
> scattered all over the place. If I partition the
> iSCSI LUN on the client,
> then I either treat it as raw disk and run the risk
> of sizing stuff wrong,
> create just a single partition, or virtualize it
> again using client LVM.
> > --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster


__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com