From haprapp at gmail.com  Fri Jun  1 02:48:36 2007
From: haprapp at gmail.com (hari hari)
Date: Fri, 1 Jun 2007 08:18:36 +0530
Subject: [Linux-cluster] linix cluster
Message-ID: <bcac39f70705311948u788f60e8x843d59cc6d5bd8b@mail.gmail.com>

I want how configure linux cluster details

-- 

Thanks and Regards

A.HARI PRAKASH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/9c7f97ac/attachment.htm>

From mij at irwan.name  Fri Jun  1 02:54:07 2007
From: mij at irwan.name (Mohd Irwan Jamaluddin)
Date: Fri, 1 Jun 2007 10:54:07 +0800
Subject: [Linux-cluster] linix cluster
In-Reply-To: <bcac39f70705311948u788f60e8x843d59cc6d5bd8b@mail.gmail.com>
References: <bcac39f70705311948u788f60e8x843d59cc6d5bd8b@mail.gmail.com>
Message-ID: <b3d998b00705311954v600c9b60xf0ff4982fb80b23f@mail.gmail.com>

On 6/1/07, hari hari <haprapp at gmail.com> wrote:
> I want how configure linux cluster details
>

Simple question with simple answer :)
http://www.redhat.com/docs/manuals/enterprise/

Ok seriously, what kind of cluster you want to configure? Is it
high-availability, network load balancing, HPC etc? Which RHEL version
do you use?

-- 
Regards,
Mohd Irwan Jamaluddin
Web: http://www.irwan.name/
Blog: http://blog.irwan.name/



From grimme at atix.de  Fri Jun  1 05:38:31 2007
From: grimme at atix.de (Marc Grimme)
Date: Fri, 1 Jun 2007 07:38:31 +0200
Subject: [Linux-cluster] Active/passive binary files
In-Reply-To: <20070531195655.GO4041@redhat.com>
References: <dc53768e0705240722l751f7fa5l19c540a38a3dd46b@mail.gmail.com>
	<20070531195655.GO4041@redhat.com>
Message-ID: <200706010738.32962.grimme@atix.de>

On Thursday 31 May 2007 21:56:55 Lon Hohberger wrote:
> On Thu, May 24, 2007 at 08:22:33AM -0600, Rodolfo Estrada wrote:
> > Hi!
> >
> > I am transferring a TRU64 cluster with oracle 10gR2 as an active/passive
> > (no RAC) service to a RHEL5 cluster using GFS. The oracle binaries and
> > data files are shared by the nodes in the TRU64 cluster. Can I use the
> > same approach in the Linux cluster using GFS? or the binaries are require
> > to be installed separately on each node?
>
You might also want to take a look at www.open-sharedroot.org. There you can 
build a diskless sharedroot cluster much like the TRUCluster and as a 
consequence have a sharedroot like with TRU64.
Regards Marc.
> You can do either.  You can share the binaries on GFS or install them
> per-node.
>
> -- Lon
>
> --
> Lon Hohberger - Software Engineer - Red Hat, Inc.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From rhurst at bidmc.harvard.edu  Fri Jun  1 11:39:51 2007
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Fri, 1 Jun 2007 07:39:51 -0400
Subject: [Linux-cluster] Cluster services stopping
In-Reply-To: <20070531200339.GQ4041@redhat.com>
References: <1180615948.5475.71.camel@jarjar.trnswrks.com>
	<20070531200339.GQ4041@redhat.com>
Message-ID: <1180697991.3326.9.camel@WSBID06223>

Any chance that extended attributes, like log_level, be configurable
within luci?  I can not find any references to such things.


On Thu, 2007-05-31 at 16:03 -0400, Lon Hohberger wrote:

> On Thu, May 31, 2007 at 08:52:28AM -0400, Scott McClanahan wrote:
> > Any help is appreciated.  I can provide more information if you think it
> > is helpful.  Also, is there some sort of debugging within rgmanager I
> > can enable to see what is truly failing or timing out and requiring a
> > restart of these services?
> 
> (1) Upgrade at least rgmanager, ccsd, magma, and magma-plugins 
> to 4.5.
> 
> (2) Configure rgmanager to use a different log thing, like local4:
> 
>    <rm log_facility="local4" log_level="6">
>       ...
>    </rm>
> 
>   (don't forget to use ccs_tool update)
> 
> (3) Configure syslog to redirect local4 to something besides
> /var/log/messages:
> 
>    local4.* /var/log/rgmanager
> 
> (4) Restart syslog
> 
> ... and you'll have awesome logging in /var/log/rgmanager.  Probably
> more than you need ;)
> 
> -- Lon


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/3484d9b6/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/3484d9b6/attachment.p7s>

From james.lapthorn at lapthornconsulting.com  Fri Jun  1 11:51:49 2007
From: james.lapthorn at lapthornconsulting.com (James Lapthorn)
Date: Fri, 1 Jun 2007 12:51:49 +0100
Subject: [Linux-cluster] CMAN: quorum lost, blocking activity
Message-ID: <c5703b1f0706010451g4a6deff9o5cdd580ca937793d@mail.gmail.com>

Can any dody help shed some light on the following log entries.  This is on
a 4 node cluster suite 4 cluster.  Looks like I had problems with my quorum
disk that shutdown my services on node 1.   This was not shown when doing a
'clustat' everything appeared to be fine.  It wasn't until checking the
mounts that I realised the DB disk was not mounted.

Jun  1 10:44:11 leoukldb1 qdiskd[6386]: <notice> Score insufficient for
master operation (0/1; max=1); downgrading
Jun  1 10:44:11 leoukldb1 kernel: CMAN: quorum lost, blocking activity
Jun  1 10:44:11 leoukldb1 clurgmgrd[6436]: <emerg> #1: Quorum Dissolved
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Cluster is not quorate.  Refusing
connection.
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Error while processing connect:
Connection refused
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Invalid descriptor specified (-111).
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Someone may be attempting something
evil.
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Error while processing get: Invalid
request descriptor
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Invalid descriptor specified (-111).
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Someone may be attempting something
evil.
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Error while processing get: Invalid
request descriptor
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Invalid descriptor specified (-21).
Jun  1 10:44:11 leoukldb1 ccsd[6264]: Someone may be attempting something
evil.

I had to do a fored reboot in order to get the services to fail over??

Any help would be appreciated!

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/e6742055/attachment.htm>

From lhh at redhat.com  Fri Jun  1 13:06:09 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 1 Jun 2007 09:06:09 -0400
Subject: [Linux-cluster] Cluster services stopping
In-Reply-To: <1180697991.3326.9.camel@WSBID06223>
References: <1180615948.5475.71.camel@jarjar.trnswrks.com>
	<20070531200339.GQ4041@redhat.com>
	<1180697991.3326.9.camel@WSBID06223>
Message-ID: <20070601130609.GS4041@redhat.com>

On Fri, Jun 01, 2007 at 07:39:51AM -0400, rhurst at bidmc.harvard.edu wrote:
> Any chance that extended attributes, like log_level, be configurable
> within luci?  I can not find any references to such things.

Sure.  Though, errors like the ones causing a service restart should
appear in /var/log/messages without reconfiguration. :)

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From scott.mcclanahan at trnswrks.com  Fri Jun  1 13:10:20 2007
From: scott.mcclanahan at trnswrks.com (Scott McClanahan)
Date: Fri, 01 Jun 2007 09:10:20 -0400
Subject: [Linux-cluster] Cluster services stopping
In-Reply-To: <20070601130609.GS4041@redhat.com>
References: <1180615948.5475.71.camel@jarjar.trnswrks.com>
	<20070531200339.GQ4041@redhat.com><1180697991.3326.9.camel@WSBID06223>
	<20070601130609.GS4041@redhat.com>
Message-ID: <1180703420.5475.81.camel@jarjar.trnswrks.com>

On Fri, 2007-06-01 at 09:06 -0400, Lon Hohberger wrote:
> On Fri, Jun 01, 2007 at 07:39:51AM -0400, rhurst at bidmc.harvard.edu
> wrote:
> > Any chance that extended attributes, like log_level, be configurable
> > within luci?  I can not find any references to such things.
> 
> Sure.  Though, errors like the ones causing a service restart should
> appear in /var/log/messages without reconfiguration. :)
> 
> -- Lon
> 
> --
> Lon Hohberger - Software Engineer - Red Hat, Inc.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
I've never simulated a file system failure to the point that a file
system check would fail and cause a service restart but that should be
logged right?  Even with default log levels enabled?  Is the same true
with IP address check failures?  It's shocking that nothing is being
logged except that the service is being stopped.



From Robert.Gil at americanhm.com  Fri Jun  1 14:18:49 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 1 Jun 2007 10:18:49 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
Message-ID: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

I am curious if anyone knows the best practices for this? Several use
cases include
 
Note: We are choosing to use a vip for the two nodes to make the
failover change transparent to the application side.
 
1) Node 1 (master) dies
        -How do we enable "sticky" failover so that it does not then
fail back to Node 1
        -Is Node 2 active all the time or is the service completely shut
off? And if its off, how would replication happen?
        -How do failover domains work in this case?
2) Node 2 (Master) Node 1 recovered
        -How does replication continue again?
        -How does the master slave relationship change? Is this
automated, or does it require manual intervention? Should we be using
DRDB?
3) Node 1 (master) Node 2 (slave) - network connectivity dies on node 1
        -There is an IP resource available, but how does this monitor
and handle failover?
        -How can I move the vip in the event of a failure? Do I need to
manually script this?
 
With the vip failover, do I attach the vip resource to the mysql
resource in the failover domain for those two nodes? What happens if I
do this?
 
Thanks,
 
Robert Gil
Linux Systems Administrator
American Home Mortgage
Phone: 631-622-8410
Cell: 631-827-5775
Fax: 516-495-5861
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/11e4ffa1/attachment.htm>

From srigler at marathonoil.com  Fri Jun  1 14:44:28 2007
From: srigler at marathonoil.com (Steve Rigler)
Date: Fri, 01 Jun 2007 09:44:28 -0500
Subject: [Linux-cluster] gfs_quota returns negative value for usage
In-Reply-To: <1180011906.25316.11.camel@houuc8>
References: <1180011906.25316.11.camel@houuc8>
Message-ID: <1180709068.30624.3.camel@houuc8>

On Thu, 2007-05-24 at 08:05 -0500, Steve Rigler wrote:
> Greetings,
> 
> We are in the process of implementing quotas on home directories that
> reside on a GFS filesystem.  All seems well with the exception of one
> user who's usage is returned as a negative number from gfs_quota:
> 
> user    <snip>:  limit: 102400.0   warn: 0.0        value: -81.4
> 
> The user actually has about 100MB in their home directory.
> 
> This is on RHEL 4 update 3 with "GFS-6.1.5-0".  Any ideas how we can get
> the actual usage to be returned from gfs_quota?
> 
> Thanks,
> Steve
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Is this something that "gfs_quota check" might fix?

Thanks,
Steve



From lgodoy at atichile.com  Fri Jun  1 16:08:38 2007
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Fri, 01 Jun 2007 12:08:38 -0400
Subject: [Linux-cluster] Install Cluster in RHE4 U5
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <46604486.1040401@atichile.com>

Hi

I'm trying to install a new machine with RHE4 U5 and Cluster Suite, but 
I have several troubles.
Could any indicate the steps to make this ?

Someone indicated to use up2date, but how ? ( up2date  ???? )
I don't whish to update the whole system, just only install cluster suite.

I download latest sources from 
"ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/RHCS/SRPMS" 
, but I can't compile this source on any of my machines (RHE4, U4, U5 or 
RHE5 ).

I got this error when I tried to use rpmbuild
=============
[root at ele install]# rpmbuild --rebuild ccs-1.0.7-0.src.rpm
Installing ccs-1.0.7-0.src.rpm
error: Architecture is not included: i386
[root at ele install]# uname -a
Linux ele.ati-labs.cl 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT 2007 i686 
athlon i386 GNU/Linux
=============


We tested E5 but is not possible install it yet, for oracle 
certification issues.  :(


Thanks in advance for any help.
Luis G.



From Robert.Gil at americanhm.com  Fri Jun  1 16:30:17 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 1 Jun 2007 12:30:17 -0400
Subject: [Linux-cluster] IP Relocate Error
Message-ID: <F154BEC9D4278D4BA9B94BDEC761934826F4@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

I have an IP address as a resource. I have the ip address in a 2 node
failover domain (total 4 nodes).
 
When i run ifconfig eth0:1 down
 
The service shows as stopped in clustat and the following errors show in
the logs
 
Jun  1 12:25:36 <host> clurgmgrd[5346]: <warning> #71: Relocating failed
service mastervip
Jun  1 12:25:36 <host> clurgmgrd[5346]: <warning> #70: Attempting to
restart service mastervip locally.
Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Recovering failed
service mastervip
Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> start on
ip:192.168.2.100 returned 1 (generic error)
Jun  1 12:25:37 <host> clurgmgrd[5346]: <warning> #68: Failed to start
mastervip; return value: 1
Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Stopping service
mastervip
Jun  1 12:25:37 <host> clurgmgrd[5346]: <notice> Service mastervip is
stopped
 
The following is the resources in /etc/cluster.conf
 
                <resources>
                        <clusterfs device="/dev/mapper/mqdata-mqdata"
force_unmount="0" fsid="22567" fstype="gfs" mountpoint="/mqdata"
name="mqdata" options=""/>
                        <ip address="192.168.2.100" interface="eth0"
monitor_link="1"/>
                </resources>

 
The service in /etc/cluster.conf
 
                <service autostart="1" domain="mysql" exclusive="1"
name="mastervip" recovery="relocate">
                        <ip ref="192.168.2.100"/>
                </service>

Any ideas?
 
Thanks,

 
Robert Gil
Linux Systems Administrator
American Home Mortgage
Phone: 631-622-8410
Cell: 631-827-5775
Fax: 516-495-5861
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/d66f4218/attachment.htm>

From jparsons at redhat.com  Fri Jun  1 16:43:35 2007
From: jparsons at redhat.com (James Parsons)
Date: Fri, 01 Jun 2007 12:43:35 -0400
Subject: [Linux-cluster] Install Cluster in RHE4 U5
In-Reply-To: <46604486.1040401@atichile.com>
References: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<46604486.1040401@atichile.com>
Message-ID: <46604CB7.1040601@redhat.com>

Luis Godoy Gonzalez wrote:

> Hi
>
> I'm trying to install a new machine with RHE4 U5 and Cluster Suite, 
> but I have several troubles.
> Could any indicate the steps to make this ?

A lot of folks on this list are hands on, command-line types who sneer 
at GUI's :)  HOWEVER:
If you use the conga bits to create a cluster, you just have to enter 
the FQDN's for your cluster nodes and their passwords in a secure form, 
and click create. Then if you want, you never, never, ever have to use 
the GUI again...you can tinker to your hearts content with configuration 
files and prompt commands with six switches in them.
Or, you could
1) download/up2date all the necessary RPMs and kernel pieces yourself to 
all of the nodes
2) create a skel cluster config file
3) copy it around to all of the nodes via scp
4) start the cluster services by hand on each node (careful, order matters!)
5) ssh around and check that every node has joined...if not, run the 
join command...

Know what I mean? ;)

-J

DISCLAIMER: This post was generated by a known GUI DEVELOPER and could 
very likely be prejudiced towards GUIs and the lazy, carefree lifestyle 
they engender.



From weikuan.yu at gmail.com  Fri Jun  1 16:51:02 2007
From: weikuan.yu at gmail.com (Weikuan Yu)
Date: Fri, 01 Jun 2007 12:51:02 -0400
Subject: [Linux-cluster] GNBD configuration
Message-ID: <46604E76.3040708@gmail.com>

Hi,

I recently want to update from GFS6.0 to GFS6.1. I have not yet been 
successful in creating a GNBD-based small configuration.

I used the following cluster.conf on node15,node16,node22.
# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="y2gfs" config_version="1">
<cman two_node="1" expected_votes="1"> </cman>
<clusternodes>

<clusternode name="node15"> <fence> <method name="single"> <device 
name="gnbd" ipaddr="node15"/> </method>
</fence> </clusternode>

<clusternode name="node16"> <fence> <method name="single"> <device 
name="gnbd" ipaddr="node16"/> </method>
</fence> </clusternode>
</clusternodes>

<fencedevices>
         <fencedevice name="gnbd" agent="fence_gnbd" servers="node22"/>
</fencedevices>
</cluster>


After loading all the modules, I try to run these commands:
ccsd
cman_tool join
fence_tool join
clvmd

I have the following problems.
  -- not able to have node22 join the cluster manager.
  -- Not able to complete this command for node16
     # fence_tool join

[root at node22 cluster-RHEL4]# cman_tool join
cman_tool: local node name "node22" not found in cluster.conf

Can anybody share some insights here? I have been trying different steps for 
a while to no successes. Let me know if I need to be more clear on some 
specifics.

Many thanks in advance,

Weikuan



From jleafey at utmem.edu  Fri Jun  1 16:45:14 2007
From: jleafey at utmem.edu (Jay Leafey)
Date: Fri, 01 Jun 2007 11:45:14 -0500
Subject: [Linux-cluster] Install Cluster in RHE4 U5
In-Reply-To: <46604486.1040401@atichile.com>
References: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<46604486.1040401@atichile.com>
Message-ID: <46604D1A.2050408@utmem.edu>

Luis Godoy Gonzalez wrote:
> Hi
> 
> I'm trying to install a new machine with RHE4 U5 and Cluster Suite, but 
> I have several troubles.
> Could any indicate the steps to make this ?
> 
> Someone indicated to use up2date, but how ? ( up2date  ???? )
> I don't whish to update the whole system, just only install cluster suite.
> 
> I download latest sources from 
> "ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/RHCS/SRPMS" 
> , but I can't compile this source on any of my machines (RHE4, U4, U5 or 
> RHE5 ).
> 
> I got this error when I tried to use rpmbuild
> =============
> [root at ele install]# rpmbuild --rebuild ccs-1.0.7-0.src.rpm
> Installing ccs-1.0.7-0.src.rpm
> error: Architecture is not included: i386
> [root at ele install]# uname -a
> Linux ele.ati-labs.cl 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT 2007 i686 
> athlon i386 GNU/Linux
> =============
> 
> 
> We tested E5 but is not possible install it yet, for oracle 
> certification issues.  :(
> 
> 
> Thanks in advance for any help.
> Luis G.
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Try this:

     rpmbuild --rebuild --target i686 ccs-1.0.7-0.src.rpm

or

     rpmbuild --rebuild --target x86_64 ccs-1.0.7-0.src.rpm

as appropriate.  The 'arch' command should give you the appropriate 
value.  The spec file for ccs only lists i686, ia64 (Itanium), and 
x86_64 (AMD64/EM64T) architectures.  The default for rpmbuild on an ix86 
box is 'i386' unless you specify a target with the '--target' option.

-- 
Jay Leafey - University of Tennessee
E-Mail:  jleafey at utmem.edu  Phone:  901-448-6534  FAX:  901-448-8199

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5153 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/e685802b/attachment.bin>

From jstoner at opsource.net  Fri Jun  1 17:21:20 2007
From: jstoner at opsource.net (Jeff Stoner)
Date: Fri, 1 Jun 2007 18:21:20 +0100
Subject: [Linux-cluster] MySQL Failover / Failback
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A305DE343C@mailxchg01.corp.opsource.net>

Sounds like you've got several things happening all at once. If you are
not using MySQL Cluster, then you will probably have an active/passive
setup, in which MySQL will be running on only one node. If you are using
MySQL Cluster, why are you using Redhat Cluster?
 
Replication? Are you referring to MySQL Replication? What is replicating
where? Are the slaves a part of the Redhat Cluster? If you simply mean
will replication "break" if MySQL fails over then no. Replication on the
slave will retry connecting to the master (according to the connection
retry settings in MySQL.) Also, you must use the Redhat
Cluster-controlled IP when establishing replication and not the IP of
any particular node (for obvious reasons.)
 
For my MySQL databases built on Redhat Cluster, I specify my service as
follows:
 
<service autostart="1" domain="mysql-fail-domain" name="mysql5-service">
   <ip ref="10.2.2.2"/>
   <fs ref="mysqlfs"/>
   <script ref="mysqld"/>
</service>

The filesystem resource is a slice of SAN accessible by all nodes in the
cluster. The script is a (modified) /etc/init.d/mysqld script.
 

--Jeff

Service Engineer
OpSource Inc.

"The Message of the Day is /etc/motd"


 



________________________________


	I am curious if anyone knows the best practices for this?
Several use cases include
	 
	Note: We are choosing to use a vip for the two nodes to make the
failover change transparent to the application side.
	 
	1) Node 1 (master) dies
	        -How do we enable "sticky" failover so that it does not
then fail back to Node 1
	        -Is Node 2 active all the time or is the service
completely shut off? And if its off, how would replication happen?
	        -How do failover domains work in this case?
	2) Node 2 (Master) Node 1 recovered
	        -How does replication continue again?
	        -How does the master slave relationship change? Is this
automated, or does it require manual intervention? Should we be using
DRDB?
	3) Node 1 (master) Node 2 (slave) - network connectivity dies on
node 1
	        -There is an IP resource available, but how does this
monitor and handle failover?
	        -How can I move the vip in the event of a failure? Do I
need to manually script this?
	 
	With the vip failover, do I attach the vip resource to the mysql
resource in the failover domain for those two nodes? What happens if I
do this?
	 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/254b369d/attachment.htm>

From Robert.Gil at americanhm.com  Fri Jun  1 17:43:22 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 1 Jun 2007 13:43:22 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
Message-ID: <F154BEC9D4278D4BA9B94BDEC761934826F6@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

 


From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
Sent: Friday, June 01, 2007 1:21 PM
To: linux clustering
Subject: RE: [Linux-cluster] MySQL Failover / Failback


Sounds like you've got several things happening all at once. If you are
not using MySQL Cluster, then you will probably have an active/passive
setup, in which MySQL will be running on only one node. If you are using
MySQL Cluster, why are you using Redhat Cluster?
 
Yes, it is using mysql replication NOT mysql cluster. It is active /
passive. One node is rw (master) the other is ro (slave).
 
Replication? Are you referring to MySQL Replication? What is replicating
where? Are the slaves a part of the Redhat Cluster? If you simply mean
will replication "break" if MySQL fails over then no. Replication on the
slave will retry connecting to the master (according to the connection
retry settings in MySQL.) Also, you must use the Redhat
Cluster-controlled IP when establishing replication and not the IP of
any particular node (for obvious reasons.)
 
For my MySQL databases built on Redhat Cluster, I specify my service as
follows:
 
<service autostart="1" domain="mysql-fail-domain" name="mysql5-service">
   <ip ref="10.2.2.2"/>
   <fs ref="mysqlfs"/>
   <script ref="mysqld"/>
</service>

The filesystem resource is a slice of SAN accessible by all nodes in the
cluster. The script is a (modified) /etc/init.d/mysqld script.
 
 
What does your resource section look like for the ip address? I keep
getting the following errors (posted in another message)
 

--Jeff

Service Engineer
OpSource Inc.

"The Message of the Day is /etc/motd"


 



________________________________


	I am curious if anyone knows the best practices for this?
Several use cases include
	 
	Note: We are choosing to use a vip for the two nodes to make the
failover change transparent to the application side.
	 
	1) Node 1 (master) dies
	        -How do we enable "sticky" failover so that it does not
then fail back to Node 1
	        -Is Node 2 active all the time or is the service
completely shut off? And if its off, how would replication happen?
	        -How do failover domains work in this case?
	2) Node 2 (Master) Node 1 recovered
	        -How does replication continue again?
	        -How does the master slave relationship change? Is this
automated, or does it require manual intervention? Should we be using
DRDB?
	3) Node 1 (master) Node 2 (slave) - network connectivity dies on
node 1
	        -There is an IP resource available, but how does this
monitor and handle failover?
	        -How can I move the vip in the event of a failure? Do I
need to manually script this?
	 
	With the vip failover, do I attach the vip resource to the mysql
resource in the failover domain for those two nodes? What happens if I
do this?
	 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/2a0b26f3/attachment.htm>

From Robert.Gil at americanhm.com  Fri Jun  1 17:44:45 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 1 Jun 2007 13:44:45 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
Message-ID: <F154BEC9D4278D4BA9B94BDEC761934826F7@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

Sorry, forgot the error message in here.
 


From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Gil
Sent: Friday, June 01, 2007 1:43 PM
To: linux clustering
Subject: RE: [Linux-cluster] MySQL Failover / Failback


 


From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
Sent: Friday, June 01, 2007 1:21 PM
To: linux clustering
Subject: RE: [Linux-cluster] MySQL Failover / Failback


Sounds like you've got several things happening all at once. If you are
not using MySQL Cluster, then you will probably have an active/passive
setup, in which MySQL will be running on only one node. If you are using
MySQL Cluster, why are you using Redhat Cluster?
 
Yes, it is using mysql replication NOT mysql cluster. It is active /
passive. One node is rw (master) the other is ro (slave).
 
Replication? Are you referring to MySQL Replication? What is replicating
where? Are the slaves a part of the Redhat Cluster? If you simply mean
will replication "break" if MySQL fails over then no. Replication on the
slave will retry connecting to the master (according to the connection
retry settings in MySQL.) Also, you must use the Redhat
Cluster-controlled IP when establishing replication and not the IP of
any particular node (for obvious reasons.)
 
For my MySQL databases built on Redhat Cluster, I specify my service as
follows:
 
<service autostart="1" domain="mysql-fail-domain" name="mysql5-service">
   <ip ref="10.2.2.2"/>
   <fs ref="mysqlfs"/>
   <script ref="mysqld"/>
</service>

The filesystem resource is a slice of SAN accessible by all nodes in the
cluster. The script is a (modified) /etc/init.d/mysqld script.
 
 
What does your resource section look like for the ip address? I keep
getting the following errors (posted in another message)
 
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> Starting
disabled service mastervip
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> start on
ip:192.168.2.100 returned 1 (generic error)
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <warning> #68: Failed to
start mastervip; return value: 1
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> Stopping service
mastervip
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> Service
mastervip is recovering
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <warning> #71: Relocating
failed service mastervip
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <warning> #70: Attempting
to restart service mastervip locally.
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> Recovering
failed service mastervip
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> start on
ip:192.168.2.100 returned 1 (generic error)
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <warning> #68: Failed to
start mastervip; return value: 1
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> Stopping service
mastervip
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> Service
mastervip is stopped


--Jeff

Service Engineer
OpSource Inc.

"The Message of the Day is /etc/motd"


 



________________________________


	I am curious if anyone knows the best practices for this?
Several use cases include
	 
	Note: We are choosing to use a vip for the two nodes to make the
failover change transparent to the application side.
	 
	1) Node 1 (master) dies
	        -How do we enable "sticky" failover so that it does not
then fail back to Node 1
	        -Is Node 2 active all the time or is the service
completely shut off? And if its off, how would replication happen?
	        -How do failover domains work in this case?
	2) Node 2 (Master) Node 1 recovered
	        -How does replication continue again?
	        -How does the master slave relationship change? Is this
automated, or does it require manual intervention? Should we be using
DRDB?
	3) Node 1 (master) Node 2 (slave) - network connectivity dies on
node 1
	        -There is an IP resource available, but how does this
monitor and handle failover?
	        -How can I move the vip in the event of a failure? Do I
need to manually script this?
	 
	With the vip failover, do I attach the vip resource to the mysql
resource in the failover domain for those two nodes? What happens if I
do this?
	 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/3b648296/attachment.htm>

From jimm at simutronics.com  Fri Jun  1 17:47:20 2007
From: jimm at simutronics.com (James Miller)
Date: Fri, 1 Jun 2007 12:47:20 -0500
Subject: [Linux-cluster] Linux Cluster Questions
Message-ID: <03fa01c7a474$e73d62f0$5dd810d1@e3demo>

Hello everyone,

I'm new to this list so please bear with me.  My previous Linux
clustering/HA has come from LinuxHA and Linux LVS.  Linux HA/LVS are great
for application HA/Load balancing but not for HA storage, before you all
scream I know about DRBD but that seems slow and kludgey.  I'm wanting to
setup two servers with CLVS and I want to create an LVM2 mirror spread
across the two servers (neither server will access the partition at the same
time -- so I'm not too worried about using EXT2/3).  Ideally I want to have
a 'global' Volume Group with PVs from both boxes and from that create the
mirrored LVs.  

If this is crazy or undo-able please let me know.  Heh, I'm not sure if I'm
making things worse for myself or not but I'm running on Debian 4.0 Etch.

>From what I've gathered I need to configure the following daemon and
applications, let me know if I'm missing anything (or have too much =):
1. CCS - Cluster configuration manager
2. CMAN & DLM -- Cluster Manager (or GULM?)
3. Fence I/O
4. CLVM daemon 
5. gnbd -- maybe for fencing purposes?

Some questions I have about Linux Cluster for CLVM are:
1. lock_gulmd vs. cman & dlm  -- If I were using GFS2 I would want to use
glumd right? 

2. fencing -- Ick!!! I don't have any of the fencing agents -- such as an
APC MasterSwitch, IBM Blade center, Brocade McData, Sanbox/Qlogic, or Vixel
fibre switches, Egenera blades, HP ILO ... that leaves me with GNBD and
Manual/Manual_ack.  From the man pages it's suggested I not use
Manual/Manual_ack for production.  I really don't need automatic re-adding
of an node once it comes back up, what issues will I run into using manual
fencing?

3. cman -- What issues are there to getting CMAN running?  I'm having
trouble getting a custom kernel compiled with the CMAN-Kernel module.

4. ccs_tool, cman_tool and version numbers.  I see there's config versions
in /etc/cluster/cluster.conf and that ccs_tool is used to send updates
(changes) to the other nodes.  Soooo I assume when you make a change to a
running cluster you increment the version number each time and you use that
new version number when you run cman_tool?


Thanks,
Jim







James Miller - MCSE RHCE CISSP
Sr Systems & Network Administrator
Simutronics Corp.
www.play.net
636.946.4263 x113 



From lgodoy at atichile.com  Fri Jun  1 20:12:11 2007
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Fri, 01 Jun 2007 16:12:11 -0400
Subject: [Linux-cluster] Install Cluster in RHE4 U5
In-Reply-To: <46604D1A.2050408@utmem.edu>
References: <F154BEC9D4278D4BA9B94BDEC761934826EE@melpsechmbx15.HQ.CORP.AMERICANHM.COM>	<46604486.1040401@atichile.com>
	<46604D1A.2050408@utmem.edu>
Message-ID: <46607D9B.2060309@atichile.com>

Thanks, that work it !!!   :)

Jay Leafey escribi?:
> Luis Godoy Gonzalez wrote:
>> Hi
>>
>> I'm trying to install a new machine with RHE4 U5 and Cluster Suite, 
>> but I have several troubles.
>> Could any indicate the steps to make this ?
>>
>> Someone indicated to use up2date, but how ? ( up2date  ???? )
>> I don't whish to update the whole system, just only install cluster 
>> suite.
>>
>> I download latest sources from 
>> "ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4ES/en/RHCS/SRPMS" 
>> , but I can't compile this source on any of my machines (RHE4, U4, U5 
>> or RHE5 ).
>>
>> I got this error when I tried to use rpmbuild
>> =============
>> [root at ele install]# rpmbuild --rebuild ccs-1.0.7-0.src.rpm
>> Installing ccs-1.0.7-0.src.rpm
>> error: Architecture is not included: i386
>> [root at ele install]# uname -a
>> Linux ele.ati-labs.cl 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT 2007 
>> i686 athlon i386 GNU/Linux
>> =============
>>
>>
>> We tested E5 but is not possible install it yet, for oracle 
>> certification issues.  :(
>>
>>
>> Thanks in advance for any help.
>> Luis G.
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> Try this:
>
>     rpmbuild --rebuild --target i686 ccs-1.0.7-0.src.rpm
>
> or
>
>     rpmbuild --rebuild --target x86_64 ccs-1.0.7-0.src.rpm
>
> as appropriate.  The 'arch' command should give you the appropriate 
> value.  The spec file for ccs only lists i686, ia64 (Itanium), and 
> x86_64 (AMD64/EM64T) architectures.  The default for rpmbuild on an 
> ix86 box is 'i386' unless you specify a target with the '--target' 
> option.
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From weikuan.yu at gmail.com  Fri Jun  1 20:33:27 2007
From: weikuan.yu at gmail.com (Weikuan Yu)
Date: Fri, 01 Jun 2007 16:33:27 -0400
Subject: [Linux-cluster] problem about RHCS4(RHES4)
In-Reply-To: <AF53B60A7942414C988C5A8AE45DE904@ZSTAR>
References: <BAY120-W14C713096DE53D11C3833FD9290@phx.gbl>
	<AF53B60A7942414C988C5A8AE45DE904@ZSTAR>
Message-ID: <46608297.2060801@gmail.com>

I am having the same problem here.

I can ssh back and forth between several nodes, but cman reports this puzzle:
No functional network interfaces

-- Weikuan

shirai at SystemCreateINC wrote:
> In the log, it seems to have downed eth0.
> Is eth0 used for the network of GFS?
>  
>  >May 25 23:17:37 DB01 clurgmgrd: [4720]: <info> Removing IPv4 address 
> 192.168.100.2 from eth0
>  >May 25 23:17:59 DB01 kernel: CMAN: No functional network interfaces, 
> leaving cluster
> 
> Was the network checked?
> 
> Regards
> 
> ------------------------------------------------------
> Shirai Noriyuki
> Chief Engineer Technical Div. System Create Inc
> Kanda Toyo Bldg, 3-4-2 Kandakajicho
> Chiyodaku Tokyo 101-0045 Japan
> Tel81-3-5296-3775 Fax81-3-5296-3777
> e-mail:shirai at sc-i.co.jp web:http://www.sc-i.co.jp
> ------------------------------------------------------
>  
> 
>     ----- Original Message -----
>     *From:* ?? ?? <mailto:unyata at hotmail.com>
>     *To:* linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
>     *Sent:* Sunday, May 27, 2007 3:45 PM
>     *Subject:* [Linux-cluster] problem about RHCS4(RHES4)
> 
>      
>      
>     Hi,
>      
>     i have problem about system-config-cluster.
>     this system has 2 node and the service is mySQL.
>     when i start service ccsd,cman,fenced,rgmanager DB01 and DB02 for
>     alternation, suddonry DB01 is starting shutdown.
>     there is a messeage
>     please tell me  what's wrong with my configuration if you know.
>     Thanks
>      
>     May 25 23:16:28 DB01 ccsd[4617]: Starting ccsd 1.0.7:
>     May 25 23:16:28 DB01 ccsd[4617]:  Built: Jun 22 2006 18:15:41
>     May 25 23:16:28 DB01 ccsd[4617]:  Copyright (C) Red Hat, Inc.  2004 
>     All rights reserved.
>     May 25 23:16:28 DB01 ccsd:  succeeded
>     May 25 23:16:41 DB01 kernel: CMAN 2.6.9-45.2 (built Jul 13 2006
>     11:42:36) installed
>     May 25 23:16:41 DB01 kernel: NET: Registered protocol family 30
>     May 25 23:16:41 DB01 kernel: DLM 2.6.9-42.10 (built Jul 13 2006
>     11:48:04) installed
>     May 25 23:16:42 DB01 ccsd[4617]: cluster.conf (cluster name =
>     alpha_cluster, version = 7) found.
>     May 25 23:16:42 DB01 kernel: CMAN: Waiting to join or form a
>     Linux-cluster
>     May 25 23:16:42 DB01 ccsd[4617]: Connected to cluster infrastruture
>     via: CMAN/SM Plugin v1.1.7.1
>     May 25 23:16:42 DB01 ccsd[4617]: Initial status:: Inquorate
>     May 25 23:17:14 DB01 ccsd[4617]: Cluster is quorate.  Allowing
>     connections.
>     May 25 23:17:14 DB01 kernel: C! MAN: forming a new cluster
>     May 25 23:17:14 DB01 kernel: CMAN: quorum regained, resuming activity
>     May 25 23:17:14 DB01 cman: startup succeeded
>     May 25 23:17:20 DB01 fenced: startup succeeded
>     May 25 23:17:37 DB01 clurgmgrd[4720]: <notice> Resource Group
>     Manager Starting
>     May 25 23:17:37 DB01 clurgmgrd[4720]: <info> Loading Service Data
>     May 25 23:17:37 DB01 rgmanager: clurgmgrd service start succeeded
>     May 25 23:17:37 DB01 clurgmgrd[4720]: <info> Initializing Services
>     May 25 23:17:37 DB01 clurgmgrd: [4720]: <info> Removing IPv4 address
>     192.168.100.2 from eth0
>     May 25 23:17:39 DB01 kernel: CMAN: sendmsg failed: -22
>     May 25 23:17:44 DB01 kernel: CMAN: sendmsg failed: -22
>     May 25 23:17:47 DB01 clurgmgrd: [4720]: <info> Executing
>     /etc/init.d/mysqld stop
>     May 25 23:17:47 DB01 mysqld: Stopping MySQL:  failed
>     May 25 23:17:47 DB01 clurgmgrd[4720]: <notice> stop on script
>     "mysqld" returned 1 (generic error) May 25 23:17:47 DB01
>     clurgmgrd[4720]: <info> Services! Initial ized
>     May 25 23:17:47 DB01 kernel: CMAN: sendmsg failed: -22
>     May 25 23:17:47 DB01 kernel: SM: send_broadcast_message error -22
>     May 25 23:17:48 DB01 kernel: CMAN: resend failed: -22
>     May 25 23:17:49 DB01 kernel: CMAN: sendmsg failed: -22
>     May 25 23:17:59 DB01 last message repeated 2 times
>     May 25 23:17:59 DB01 kernel: CMAN: No functional network interfaces,
>     leaving cluster
>     May 25 23:17:59 DB01 kernel: CMAN: sendmsg failed: -22
>     May 25 23:17:59 DB01 kernel: CMAN: we are leaving the cluster.
>     May 25 23:17:59 DB01 kernel: SM: send_broadcast_message error -107
>     May 25 23:17:59 DB01 kernel: WARNING: dlm_emergency_shutdown
>     May 25 23:17:59 DB01 kernel: WARNING: dlm_emergency_shutdown
>     May 25 23:17:59 DB01 kernel: SM: 00000001 sm_stop: SG still joined
>     May 25 23:17:59 DB01 kernel: SM: 00000000 sm_stop: SG still joined
>     May 25 23:17:59 DB01 ccsd[4617]: Cluster manager shutdown. 
>     Attemping to reconnect...
>     May 25 23:18:12 DB01 shutdown: shuttin! g down for system halt
>     May 25 23:18:12 DB01 init: Switching to runlevel: 0
>     May 25 23:18:12 DB01 gconfd (root-4441): recieveing signal 15
>     starting shut down
>     May 25 23:18:12 DB01 htt_server[3292]: Client shut down the
>     connection owned by im_id(1).
>     May 25 23:18:12 DB01 gconfd (root-4441): shut down
>     May 25 23:18:12 DB01 htt_server[3292]: Client shut down the
>     connection owned by im_id(1).
>     May 25 23:18:12 DB01 gdm(pam_unix)[4088]: session closed for user root
>     May 25 23:18:13 DB01 ccsd[4617]: Cluster is not quorate.  Refusing
>     connection.
>     May 25 23:18:13 DB01 ccsd[4617]: Error while processing connect:
>     Connection refused
>     May 25 23:18:13 DB01 ccsd[4617]: Invalid descriptor specified (-111).
>     May 25 23:18:13 DB01 ccsd[4617]: Someone may be attempting something
>     evil.
>     May 25 23:18:13 DB01 ccsd[4617]: Error while processing get: Invalid
>     request descriptor
>     May 25 23:18:13 DB01 ccsd[4617]: Invalid descriptor specified (-111).
>     May 25 23:18:1! 3 DB01 ccsd[4617]: Someone may be attempting
>     something evil. May 2 5 23:18:13 DB01 ccsd[4617]: Error while
>     processing get: Invalid request descriptor
>     May 25 23:18:13 DB01 ccsd[4617]: Invalid descriptor specified (-21).
>     May 25 23:18:13 DB01 ccsd[4617]: Someone may be attempting something
>     evil.
>     May 25 23:18:13 DB01 ccsd[4617]: Error while processing disconnect:
>     Invalid request descriptor
>     May 25 23:18:13 DB01 rgmanager: [5153]: <notice> Shutting down
>     Cluster Service Manager...
>     May 25 23:18:14 DB01 ccsd[4617]: Unable to connect to cluster
>     infrastructure after 30 seconds.
>     May 25 23:18:14 DB01 gpm[3260]: *** info [mice.c(1766)]:
>     May 25 23:18:14 DB01 gpm[3260]: imps2: Auto-detected intellimouse PS/2
> 
>     ------------------------------------------------------------------------
>     Get news, entertainment and everything you care about at Live.com.
>     Check it out! <http://www.live.com/getstarted.aspx >
> 
>     ------------------------------------------------------------------------
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com
>     https://www.redhat.com/mailman/listinfo/linux-cluster
> 
>     ------------------------------------------------------------------------
> 
>     No virus found in this incoming message.
>     Checked by AVG Free Edition.
>     Version: 7.5.472 / Virus Database: 269.8.0/819 - Release Date:
>     2007/05/26 10:47
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From weikuan.yu at gmail.com  Fri Jun  1 21:16:20 2007
From: weikuan.yu at gmail.com (Weikuan Yu)
Date: Fri, 01 Jun 2007 17:16:20 -0400
Subject: [Linux-cluster] Re: GNBD configuration
In-Reply-To: <46604E76.3040708@gmail.com>
References: <46604E76.3040708@gmail.com>
Message-ID: <46608CA4.1030408@gmail.com>

I finally figured out this. It is because of my network configuration errors.

Weikuan

Weikuan Yu wrote:
> Hi,
> 
> I recently want to update from GFS6.0 to GFS6.1. I have not yet been 
> successful in creating a GNBD-based small configuration.
> 
> I used the following cluster.conf on node15,node16,node22.
> # cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster name="y2gfs" config_version="1">
> <cman two_node="1" expected_votes="1"> </cman>
> <clusternodes>
> 
> <clusternode name="node15"> <fence> <method name="single"> <device 
> name="gnbd" ipaddr="node15"/> </method>
> </fence> </clusternode>
> 
> <clusternode name="node16"> <fence> <method name="single"> <device 
> name="gnbd" ipaddr="node16"/> </method>
> </fence> </clusternode>
> </clusternodes>
> 
> <fencedevices>
>         <fencedevice name="gnbd" agent="fence_gnbd" servers="node22"/>
> </fencedevices>
> </cluster>
> 
> 
> After loading all the modules, I try to run these commands:
> ccsd
> cman_tool join
> fence_tool join
> clvmd
> 
> I have the following problems.
>  -- not able to have node22 join the cluster manager.
>  -- Not able to complete this command for node16
>     # fence_tool join
> 
> [root at node22 cluster-RHEL4]# cman_tool join
> cman_tool: local node name "node22" not found in cluster.conf
> 
> Can anybody share some insights here? I have been trying different steps 
> for a while to no successes. Let me know if I need to be more clear on 
> some specifics.
> 
> Many thanks in advance,
> 
> Weikuan
> 



From Robert.Gil at americanhm.com  Fri Jun  1 21:18:27 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 1 Jun 2007 17:18:27 -0400
Subject: [Linux-cluster] MySQL Failover / Failback
Message-ID: <F154BEC9D4278D4BA9B94BDEC761934826FC@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

I figured this problem out. In order to do a vip, both the host ip and
the vip need to be in the same subnet.
 
Robert Gil
Linux Systems Administrator
American Home Mortgage
Phone: 631-622-8410
Cell: 631-827-5775
Fax: 516-495-5861
 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Gil
Sent: Friday, June 01, 2007 1:45 PM
To: linux clustering
Subject: RE: [Linux-cluster] MySQL Failover / Failback


Sorry, forgot the error message in here.
 


From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Gil
Sent: Friday, June 01, 2007 1:43 PM
To: linux clustering
Subject: RE: [Linux-cluster] MySQL Failover / Failback


 


From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
Sent: Friday, June 01, 2007 1:21 PM
To: linux clustering
Subject: RE: [Linux-cluster] MySQL Failover / Failback


Sounds like you've got several things happening all at once. If you are
not using MySQL Cluster, then you will probably have an active/passive
setup, in which MySQL will be running on only one node. If you are using
MySQL Cluster, why are you using Redhat Cluster?
 
Yes, it is using mysql replication NOT mysql cluster. It is active /
passive. One node is rw (master) the other is ro (slave).
 
Replication? Are you referring to MySQL Replication? What is replicating
where? Are the slaves a part of the Redhat Cluster? If you simply mean
will replication "break" if MySQL fails over then no. Replication on the
slave will retry connecting to the master (according to the connection
retry settings in MySQL.) Also, you must use the Redhat
Cluster-controlled IP when establishing replication and not the IP of
any particular node (for obvious reasons.)
 
For my MySQL databases built on Redhat Cluster, I specify my service as
follows:
 
<service autostart="1" domain="mysql-fail-domain" name="mysql5-service">
   <ip ref="10.2.2.2"/>
   <fs ref="mysqlfs"/>
   <script ref="mysqld"/>
</service>

The filesystem resource is a slice of SAN accessible by all nodes in the
cluster. The script is a (modified) /etc/init.d/mysqld script.
 
 
What does your resource section look like for the ip address? I keep
getting the following errors (posted in another message)
 
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> Starting
disabled service mastervip
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> start on
ip:192.168.2.100 returned 1 (generic error)
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <warning> #68: Failed to
start mastervip; return value: 1
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> Stopping service
mastervip
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <notice> Service
mastervip is recovering
Jun  1 13:29:10 melpsjssdbs01 clurgmgrd[5346]: <warning> #71: Relocating
failed service mastervip
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <warning> #70: Attempting
to restart service mastervip locally.
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> Recovering
failed service mastervip
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> start on
ip:192.168.2.100 returned 1 (generic error)
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <warning> #68: Failed to
start mastervip; return value: 1
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> Stopping service
mastervip
Jun  1 13:29:11 melpsjssdbs01 clurgmgrd[5346]: <notice> Service
mastervip is stopped


--Jeff

Service Engineer
OpSource Inc.

"The Message of the Day is /etc/motd"


 



________________________________


	I am curious if anyone knows the best practices for this?
Several use cases include
	 
	Note: We are choosing to use a vip for the two nodes to make the
failover change transparent to the application side.
	 
	1) Node 1 (master) dies
	        -How do we enable "sticky" failover so that it does not
then fail back to Node 1
	        -Is Node 2 active all the time or is the service
completely shut off? And if its off, how would replication happen?
	        -How do failover domains work in this case?
	2) Node 2 (Master) Node 1 recovered
	        -How does replication continue again?
	        -How does the master slave relationship change? Is this
automated, or does it require manual intervention? Should we be using
DRDB?
	3) Node 1 (master) Node 2 (slave) - network connectivity dies on
node 1
	        -There is an IP resource available, but how does this
monitor and handle failover?
	        -How can I move the vip in the event of a failure? Do I
need to manually script this?
	 
	With the vip failover, do I attach the vip resource to the mysql
resource in the failover domain for those two nodes? What happens if I
do this?
	 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070601/58e442a0/attachment.htm>

From bob.marcan at interstudio.homeunix.net  Sun Jun  3 08:31:36 2007
From: bob.marcan at interstudio.homeunix.net (Bob Marcan)
Date: Sun, 03 Jun 2007 10:31:36 +0200
Subject: [Linux-cluster] RHEL cluster upgrade from v4.4 to v4.5
In-Reply-To: <20070517102837.0c8903ad@encijan.interstudio.homeunix.net>
References: <20070516233943.700396f3@encijan.interstudio.homeunix.net>	<9C203D6FD2BF9D49BFF3450201DEDA530D0FA4@LI-OWL.hag.hilti.com>
	<20070517102837.0c8903ad@encijan.interstudio.homeunix.net>
Message-ID: <46627C68.1020903@interstudio.homeunix.net>

Bob Marcan wrote:
> On Thu, 17 May 2007 09:17:56 +0200
> "Hagmann, Michael" <Michael.Hagmann at hilti.com> wrote:
> 
>> Hi Bob
>>
>> I get the response from Support that CS4.5 has a dependencies Problem with the fence package, but I don't have try it out.
>> I would recommend you that you get in touch with the Support Team, and NOT upgrade before.
>>
>> Mike
>>
> 
> Already did. Response was:
> The Cluster Suit and GFS packages for Red Hat Enterprise Linux 4 U5 are
> not released yet however it should be soon. So please do not upgrade as
> of now.
> 
> Best regards, Bob
> 

It is released now.
Is it safe to use "up2date -u" ?
Which kernel?
kernel-smp-2.6.9-55.EL.i686 ?

TIA, Bob
-- 
  Bob Marcan, Consultant                mailto:bob.marcan at snt.si
  S&T Slovenija d.d.                    tel:   +386 (1) 5895-300
  Leskoskova cesta 6                    fax:   +386 (1) 5895-202
  1000 Ljubljana, Slovenia              url:   http://www.snt.si



From satya.daragani at gmail.com  Sun Jun  3 13:45:23 2007
From: satya.daragani at gmail.com (Satya Daragani)
Date: Sun, 3 Jun 2007 19:15:23 +0530
Subject: [Linux-cluster] LVS Real Servers HA
Message-ID: <24fecab20706030645gd7fa430xf50a7b4798be1caf@mail.gmail.com>

Hi Team,

I am trying to build the following cluster, suggest me is there any way?

I want to use totally three nodes in the scnario, one will act as a LVS
router and other two will act as LVS real servers. I will use the Httpd
service for the testing.
Among the real servers i want to use the RHEL Cluster suite (failover
configuration in active/active mode) to have httpd service running on both
the nodes and will failover to other node if one fails.

Waiting for your all replies

-- 
Thanx
Satya Daragani
satya.daragani at gmail.com
+91 98850 58366
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070603/82dc8725/attachment.htm>

From satya.daragani at gmail.com  Sun Jun  3 14:04:25 2007
From: satya.daragani at gmail.com (Satya Daragani)
Date: Sun, 3 Jun 2007 19:34:25 +0530
Subject: [Linux-cluster] LVS + HA
Message-ID: <24fecab20706030704v25584caeh6e1dde10e6151a27@mail.gmail.com>

Hi All,



I am trying to build the following scenario using Redhat cluster suite and
LVS.

I will be using three nodes in the entire scenario, one will be used as LVS
Router and other two will be real servers (running httpd web service).

I want to configure the redhat cluster suite on both the real server nodes
for the failover of the httpd service (in active-active mode), is it
possible to do or is there any other way?



Kindly suggest me.
-- 
Thanx
Satya Daragani
satya.daragani at gmail.com
+91 98850 58366
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070603/5ad71cee/attachment.htm>

From satya.daragani at gmail.com  Sun Jun  3 14:05:24 2007
From: satya.daragani at gmail.com (Satya Daragani)
Date: Sun, 3 Jun 2007 19:35:24 +0530
Subject: [Linux-cluster] LVS+HA
Message-ID: <24fecab20706030705ya21deb7w89fab8eeb7303035@mail.gmail.com>

Hi All,



I am trying to build the following scenario using Redhat cluster suite and
LVS.

I will be using three nodes in the entire scenario, one will be used as LVS
Router and other two will be real servers (running httpd web service).

I want to configure the redhat cluster suite on both the real server nodes
for the failover of the httpd service (in active-active mode), is it
possible to do or is there any other way?



Kindly suggest me.


-- 
Thanx
Satya Daragani
satya.daragani at gmail.com
+91 98850 58366
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070603/f5bef899/attachment.htm>

From shailesh at verismonetworks.com  Mon Jun  4 05:01:33 2007
From: shailesh at verismonetworks.com (Shailesh)
Date: Mon, 04 Jun 2007 10:31:33 +0530
Subject: [Linux-cluster] Not able to stop cman service
In-Reply-To: <BAY105-F24FE01113BA964C8AFCB7FE32D0@phx.gbl>
References: <BAY105-F24FE01113BA964C8AFCB7FE32D0@phx.gbl>
Message-ID: <1180933293.4343.208.camel@linux.site>

I tried the same (cman_tool leave force) and worked.

How and why does rgmanager run without even explicitly starting it ,
I have only been running the ccsd ,cman,gfs services ?

Thanks 
Shailesh

On Thu, 2007-05-31 at 12:03 +0000, mehmet celik wrote:
>       rgmanager is running on your system.  did you check the rgmanager ? if 
> you dont want to check, you should use "cman_tool leave force"  command.
> 
> >From: Shailesh <shailesh at verismonetworks.com>
> >Reply-To: linux clustering <linux-cluster at redhat.com>
> >To: linux clustering <linux-cluster at redhat.com>
> >Subject: [Linux-cluster] Not able to stop cman service
> >Date: Thu, 31 May 2007 13:01:13 +0530
> >
> >Fellow clusterites,
> >
> >                I have 2 node gfs-gnbd RHCS Cluster and I wanted to add
> >third node to it. In the process , I have tried to stop the cman service
> >in the existing nodes , the stopping fails.
> >
> > > service cman stop  fails.
> >  Stopping cman:                                             [FAILED]
> >
> >
> >I tried the 'cman_tool leave' command it has also failed.
> >
> > > cman_tool leave
> >  Error:  Can't leave cluster while there are 1 active subsystems
> >
> >The /proc/cluster/services file dump shows that 'usrm::manager' is
> >running , so how do I stop this service??
> >
> >Service          Name                              GID LID State
> >Code
> >User:            "usrm::manager"                     2   3 run       -
> >[1 2]
> >
> >
> >
> >If you anything about this 'usrm::manager' please let me know.
> >
> >Thanks & Regards
> >Shailesh p Shirali
> >
> >
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> _________________________________________________________________
> More photos, more messages, more storageget 2GB with Windows Live Hotmail. 
> http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_2G_0507
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 



From kristoffer.lippert at jppol.dk  Mon Jun  4 08:59:20 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Mon, 4 Jun 2007 10:59:20 +0200
Subject: [Linux-cluster] GFS2 problem.
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>

Hi,

I get a slightly strange problem:

I'm setting up a new 2. node cluster, and one of my nodes gives this in
the message log:

Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1: about
to withdraw this file system
Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
telling LM to withdraw

After the error, the san is locked. I can't get the second node off the
lock. I did a ls of /sandata (where the san is mounted), and i can't
kill the process. 
I cannot even reboot the server with reboot. I have to powercycle it to
get it back online.

I found one similar error on google, posted to this list, but i did not
find any replys. 
Have anyone got any clues? 


Mvh / Kind regards

Kristoffer Lippert
Systemansvarlig
JP/Politiken A/S
Online Magasiner

Tlf. +45 8738 3032
Cell. +45 6062 8703

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070604/f05dfb23/attachment.htm>

From swhiteho at redhat.com  Mon Jun  4 09:00:13 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 04 Jun 2007 10:00:13 +0100
Subject: [Linux-cluster] GFS2 problem.
In-Reply-To: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
Message-ID: <1180947613.25918.120.camel@quoit>

Hi,

What version of GFS2 are you using? This has been seen before, but not
for a fair time now, so I suspect you might be using old code,

Steve.

On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I get a slightly strange problem:
> 
> I'm setting up a new 2. node cluster, and one of my nodes gives this
> in the message log:
> 
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> 
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
> 
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> about to withdraw this file system 
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> telling LM to withdraw
> 
> After the error, the san is locked. I can't get the second node off
> the lock. I did a ls of /sandata (where the san is mounted), and i
> can't kill the process. 
> 
> I cannot even reboot the server with reboot. I have to powercycle it
> to get it back online.
> 
> I found one similar error on google, posted to this list, but i did
> not find any replys.  
> Have anyone got any clues? 
> 
> 
> Mvh / Kind regards
> 
> Kristoffer Lippert 
> Systemansvarlig 
> JP/Politiken A/S 
> Online Magasiner
> 
> Tlf. +45 8738 3032 
> Cell. +45 6062 8703
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Mon Jun  4 09:13:02 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Mon, 4 Jun 2007 11:13:02 +0200
Subject: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <1180947613.25918.120.camel@quoit>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
	<1180947613.25918.120.camel@quoit>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>

Hi,

I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the actual version of GFS2 is the one included in cman-2.0.64-1.el5
 
According to the message log its:
Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) installed


Is there a newer version?

Mvh / Kind regards

Kristoffer Lippert
Systemansvarlig
JP/Politiken A/S
Online Magasiner

Tlf. +45 8738 3032
Cell. +45 6062 8703

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
Sendt: 4. juni 2007 11:00
Til: linux clustering
Emne: Re: [Linux-cluster] GFS2 problem.

Hi,

What version of GFS2 are you using? This has been seen before, but not for a fair time now, so I suspect you might be using old code,

Steve.

On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I get a slightly strange problem:
> 
> I'm setting up a new 2. node cluster, and one of my nodes gives this 
> in the message log:
> 
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> 
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
> 
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> about to withdraw this file system
> Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> telling LM to withdraw
> 
> After the error, the san is locked. I can't get the second node off 
> the lock. I did a ls of /sandata (where the san is mounted), and i 
> can't kill the process.
> 
> I cannot even reboot the server with reboot. I have to powercycle it 
> to get it back online.
> 
> I found one similar error on google, posted to this list, but i did 
> not find any replys.
> Have anyone got any clues? 
> 
> 
> Mvh / Kind regards
> 
> Kristoffer Lippert
> Systemansvarlig
> JP/Politiken A/S
> Online Magasiner
> 
> Tlf. +45 8738 3032
> Cell. +45 6062 8703
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From swhiteho at redhat.com  Mon Jun  4 09:13:40 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 04 Jun 2007 10:13:40 +0100
Subject: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
	<1180947613.25918.120.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>
Message-ID: <1180948420.25918.125.camel@quoit>

Hi,

I was thinking more of the kernel version rather than the tools version.
What does uname -a say? The fact that it was built on the 4th May
suggests that it doesn't have all the latest bug fixes in. Genernally
I'd suggest following the upstream Linus' kernels (or the -nmw git tree)
to get the latest GFS2 kernels.

Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively
recently too. There will also be uptodate GFS2 code appearing in the
forthcoming RHEL 5.1,

Steve.

On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the actual version of GFS2 is the one included in cman-2.0.64-1.el5
>  
> According to the message log its:
> Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) installed
> 
> 
> Is there a newer version?
> 
> Mvh / Kind regards
> 
> Kristoffer Lippert
> Systemansvarlig
> JP/Politiken A/S
> Online Magasiner
> 
> Tlf. +45 8738 3032
> Cell. +45 6062 8703
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
> Sendt: 4. juni 2007 11:00
> Til: linux clustering
> Emne: Re: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> What version of GFS2 are you using? This has been seen before, but not for a fair time now, so I suspect you might be using old code,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > Hi,
> > 
> > I get a slightly strange problem:
> > 
> > I'm setting up a new 2. node cluster, and one of my nodes gives this 
> > in the message log:
> > 
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > 
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
> > 
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > about to withdraw this file system
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > telling LM to withdraw
> > 
> > After the error, the san is locked. I can't get the second node off 
> > the lock. I did a ls of /sandata (where the san is mounted), and i 
> > can't kill the process.
> > 
> > I cannot even reboot the server with reboot. I have to powercycle it 
> > to get it back online.
> > 
> > I found one similar error on google, posted to this list, but i did 
> > not find any replys.
> > Have anyone got any clues? 
> > 
> > 
> > Mvh / Kind regards
> > 
> > Kristoffer Lippert
> > Systemansvarlig
> > JP/Politiken A/S
> > Online Magasiner
> > 
> > Tlf. +45 8738 3032
> > Cell. +45 6062 8703
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Mon Jun  4 09:29:34 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Mon, 4 Jun 2007 11:29:34 +0200
Subject: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <1180948420.25918.125.camel@quoit>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk><1180947613.25918.120.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>
	<1180948420.25918.125.camel@quoit>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk>

Ah.

It's RHEL 5. 
Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
 
I didn't know there was a 5.1 at the door. I'll try updating the system. 

Thanks a lot for your help.

/Kristoffer


-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
Sendt: 4. juni 2007 11:14
Til: linux clustering
Emne: Re: SV: [Linux-cluster] GFS2 problem.

Hi,

I was thinking more of the kernel version rather than the tools version.
What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.

Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively recently too. There will also be uptodate GFS2 code appearing in the forthcoming RHEL 5.1,

Steve.

On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the actual 
> version of GFS2 is the one included in cman-2.0.64-1.el5
>  
> According to the message log its:
> Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> installed
> 
> 
> Is there a newer version?
> 
> Mvh / Kind regards
> 
> Kristoffer Lippert
> Systemansvarlig
> JP/Politiken A/S
> Online Magasiner
> 
> Tlf. +45 8738 3032
> Cell. +45 6062 8703
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> Whitehouse
> Sendt: 4. juni 2007 11:00
> Til: linux clustering
> Emne: Re: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> What version of GFS2 are you using? This has been seen before, but not 
> for a fair time now, so I suspect you might be using old code,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > Hi,
> > 
> > I get a slightly strange problem:
> > 
> > I'm setting up a new 2. node cluster, and one of my nodes gives this 
> > in the message log:
> > 
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > 
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
> > 
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > about to withdraw this file system
> > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > telling LM to withdraw
> > 
> > After the error, the san is locked. I can't get the second node off 
> > the lock. I did a ls of /sandata (where the san is mounted), and i 
> > can't kill the process.
> > 
> > I cannot even reboot the server with reboot. I have to powercycle it 
> > to get it back online.
> > 
> > I found one similar error on google, posted to this list, but i did 
> > not find any replys.
> > Have anyone got any clues? 
> > 
> > 
> > Mvh / Kind regards
> > 
> > Kristoffer Lippert
> > Systemansvarlig
> > JP/Politiken A/S
> > Online Magasiner
> > 
> > Tlf. +45 8738 3032
> > Cell. +45 6062 8703
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From swhiteho at redhat.com  Mon Jun  4 09:32:14 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 04 Jun 2007 10:32:14 +0100
Subject: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
	<1180947613.25918.120.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>
	<1180948420.25918.125.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk>
Message-ID: <1180949534.25918.136.camel@quoit>

Hi,

RHEL 5 is quite a long way behind the current development. It will be a
little while before 5.1 is out (I'm not sure of the exact date) but it
should only be a few weeks or so at the most,

Steve.

On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> Ah.
> 
> It's RHEL 5. 
> Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>  
> I didn't know there was a 5.1 at the door. I'll try updating the system. 
> 
> Thanks a lot for your help.
> 
> /Kristoffer
> 
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
> Sendt: 4. juni 2007 11:14
> Til: linux clustering
> Emne: Re: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> I was thinking more of the kernel version rather than the tools version.
> What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> 
> Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively recently too. There will also be uptodate GFS2 code appearing in the forthcoming RHEL 5.1,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > Hi,
> > 
> > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the actual 
> > version of GFS2 is the one included in cman-2.0.64-1.el5
> >  
> > According to the message log its:
> > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> > installed
> > 
> > 
> > Is there a newer version?
> > 
> > Mvh / Kind regards
> > 
> > Kristoffer Lippert
> > Systemansvarlig
> > JP/Politiken A/S
> > Online Magasiner
> > 
> > Tlf. +45 8738 3032
> > Cell. +45 6062 8703
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 11:00
> > Til: linux clustering
> > Emne: Re: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > What version of GFS2 are you using? This has been seen before, but not 
> > for a fair time now, so I suspect you might be using old code,
> > 
> > Steve.
> > 
> > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > Hi,
> > > 
> > > I get a slightly strange problem:
> > > 
> > > I'm setting up a new 2. node cluster, and one of my nodes gives this 
> > > in the message log:
> > > 
> > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > 
> > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
> > > 
> > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > about to withdraw this file system
> > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > telling LM to withdraw
> > > 
> > > After the error, the san is locked. I can't get the second node off 
> > > the lock. I did a ls of /sandata (where the san is mounted), and i 
> > > can't kill the process.
> > > 
> > > I cannot even reboot the server with reboot. I have to powercycle it 
> > > to get it back online.
> > > 
> > > I found one similar error on google, posted to this list, but i did 
> > > not find any replys.
> > > Have anyone got any clues? 
> > > 
> > > 
> > > Mvh / Kind regards
> > > 
> > > Kristoffer Lippert
> > > Systemansvarlig
> > > JP/Politiken A/S
> > > Online Magasiner
> > > 
> > > Tlf. +45 8738 3032
> > > Cell. +45 6062 8703
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Mon Jun  4 10:50:31 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Mon, 4 Jun 2007 12:50:31 +0200
Subject: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <1180949534.25918.136.camel@quoit>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk><1180947613.25918.120.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk><1180948420.25918.125.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk>
	<1180949534.25918.136.camel@quoit>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk>

Ok, nothing newer than what i'm running now from Red Hat.

If i take the git branch:
GIT tree (recent upstream kernel development for gfs2 and dlm):
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git

Am I then required to recompile the kernel, or can i compile a new gfs2 alone?

Regards
Kristoffer


-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
Sendt: 4. juni 2007 11:32
Til: linux clustering
Emne: Re: SV: SV: [Linux-cluster] GFS2 problem.

Hi,

RHEL 5 is quite a long way behind the current development. It will be a little while before 5.1 is out (I'm not sure of the exact date) but it should only be a few weeks or so at the most,

Steve.

On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> Ah.
> 
> It's RHEL 5. 
> Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4 
> 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>  
> I didn't know there was a 5.1 at the door. I'll try updating the system. 
> 
> Thanks a lot for your help.
> 
> /Kristoffer
> 
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> Whitehouse
> Sendt: 4. juni 2007 11:14
> Til: linux clustering
> Emne: Re: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> I was thinking more of the kernel version rather than the tools version.
> What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> 
> Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively 
> recently too. There will also be uptodate GFS2 code appearing in the 
> forthcoming RHEL 5.1,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > Hi,
> > 
> > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the actual 
> > version of GFS2 is the one included in cman-2.0.64-1.el5
> >  
> > According to the message log its:
> > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> > installed
> > 
> > 
> > Is there a newer version?
> > 
> > Mvh / Kind regards
> > 
> > Kristoffer Lippert
> > Systemansvarlig
> > JP/Politiken A/S
> > Online Magasiner
> > 
> > Tlf. +45 8738 3032
> > Cell. +45 6062 8703
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 11:00
> > Til: linux clustering
> > Emne: Re: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > What version of GFS2 are you using? This has been seen before, but 
> > not for a fair time now, so I suspect you might be using old code,
> > 
> > Steve.
> > 
> > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > Hi,
> > > 
> > > I get a slightly strange problem:
> > > 
> > > I'm setting up a new 2. node cluster, and one of my nodes gives 
> > > this in the message log:
> > > 
> > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > 
> > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
> > > 
> > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > about to withdraw this file system Jun  4 10:10:33 app01 kernel: 
> > > GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > telling LM to withdraw
> > > 
> > > After the error, the san is locked. I can't get the second node 
> > > off the lock. I did a ls of /sandata (where the san is mounted), 
> > > and i can't kill the process.
> > > 
> > > I cannot even reboot the server with reboot. I have to powercycle 
> > > it to get it back online.
> > > 
> > > I found one similar error on google, posted to this list, but i 
> > > did not find any replys.
> > > Have anyone got any clues? 
> > > 
> > > 
> > > Mvh / Kind regards
> > > 
> > > Kristoffer Lippert
> > > Systemansvarlig
> > > JP/Politiken A/S
> > > Online Magasiner
> > > 
> > > Tlf. +45 8738 3032
> > > Cell. +45 6062 8703
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From swhiteho at redhat.com  Mon Jun  4 10:48:01 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 04 Jun 2007 11:48:01 +0100
Subject: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
	<1180947613.25918.120.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>
	<1180948420.25918.125.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk>
	<1180949534.25918.136.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk>
Message-ID: <1180954081.25918.150.camel@quoit>

Hi,

On Mon, 2007-06-04 at 12:50 +0200, Kristoffer Lippert wrote:
> Ok, nothing newer than what i'm running now from Red Hat.
> 
No, not yet I'm afraid as its only a demo feature in RHEL 5.0.

> If i take the git branch:
> GIT tree (recent upstream kernel development for gfs2 and dlm):
> git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git
> 
> Am I then required to recompile the kernel, or can i compile a new gfs2 alone?
> 
> Regards
> Kristoffer
> 
You'll almost certainly need to recompile a complete kernel. If you are
based on RHEL 5 that will be the case I'm afraid,

Steve.

> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
> Sendt: 4. juni 2007 11:32
> Til: linux clustering
> Emne: Re: SV: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> RHEL 5 is quite a long way behind the current development. It will be a little while before 5.1 is out (I'm not sure of the exact date) but it should only be a few weeks or so at the most,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> > Ah.
> > 
> > It's RHEL 5. 
> > Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4 
> > 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> >  
> > I didn't know there was a 5.1 at the door. I'll try updating the system. 
> > 
> > Thanks a lot for your help.
> > 
> > /Kristoffer
> > 
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 11:14
> > Til: linux clustering
> > Emne: Re: SV: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > I was thinking more of the kernel version rather than the tools version.
> > What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> > 
> > Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively 
> > recently too. There will also be uptodate GFS2 code appearing in the 
> > forthcoming RHEL 5.1,
> > 
> > Steve.
> > 
> > On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > > Hi,
> > > 
> > > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the actual 
> > > version of GFS2 is the one included in cman-2.0.64-1.el5
> > >  
> > > According to the message log its:
> > > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> > > installed
> > > 
> > > 
> > > Is there a newer version?
> > > 
> > > Mvh / Kind regards
> > > 
> > > Kristoffer Lippert
> > > Systemansvarlig
> > > JP/Politiken A/S
> > > Online Magasiner
> > > 
> > > Tlf. +45 8738 3032
> > > Cell. +45 6062 8703
> > > 
> > > -----Oprindelig meddelelse-----
> > > Fra: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > Whitehouse
> > > Sendt: 4. juni 2007 11:00
> > > Til: linux clustering
> > > Emne: Re: [Linux-cluster] GFS2 problem.
> > > 
> > > Hi,
> > > 
> > > What version of GFS2 are you using? This has been seen before, but 
> > > not for a fair time now, so I suspect you might be using old code,
> > > 
> > > Steve.
> > > 
> > > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > > Hi,
> > > > 
> > > > I get a slightly strange problem:
> > > > 
> > > > I'm setting up a new 2. node cluster, and one of my nodes gives 
> > > > this in the message log:
> > > > 
> > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > > 
> > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 61
> > > > 
> > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > about to withdraw this file system Jun  4 10:10:33 app01 kernel: 
> > > > GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > telling LM to withdraw
> > > > 
> > > > After the error, the san is locked. I can't get the second node 
> > > > off the lock. I did a ls of /sandata (where the san is mounted), 
> > > > and i can't kill the process.
> > > > 
> > > > I cannot even reboot the server with reboot. I have to powercycle 
> > > > it to get it back online.
> > > > 
> > > > I found one similar error on google, posted to this list, but i 
> > > > did not find any replys.
> > > > Have anyone got any clues? 
> > > > 
> > > > 
> > > > Mvh / Kind regards
> > > > 
> > > > Kristoffer Lippert
> > > > Systemansvarlig
> > > > JP/Politiken A/S
> > > > Online Magasiner
> > > > 
> > > > Tlf. +45 8738 3032
> > > > Cell. +45 6062 8703
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Mon Jun  4 10:55:41 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Mon, 4 Jun 2007 12:55:41 +0200
Subject: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <1180954081.25918.150.camel@quoit>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk><1180947613.25918.120.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk><1180948420.25918.125.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk><1180949534.25918.136.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk>
	<1180954081.25918.150.camel@quoit>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C6@exchsrv04.rootdom.dk>

Hi,

Do you know of any alternatives to gfs2 that'll work with a shared storage?

And again, thanks a lot for your help

Regards
/Kristoffer

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
Sendt: 4. juni 2007 12:48
Til: linux clustering
Emne: Re: SV: SV: SV: [Linux-cluster] GFS2 problem.

Hi,

On Mon, 2007-06-04 at 12:50 +0200, Kristoffer Lippert wrote:
> Ok, nothing newer than what i'm running now from Red Hat.
> 
No, not yet I'm afraid as its only a demo feature in RHEL 5.0.

> If i take the git branch:
> GIT tree (recent upstream kernel development for gfs2 and dlm):
> git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git
> 
> Am I then required to recompile the kernel, or can i compile a new gfs2 alone?
> 
> Regards
> Kristoffer
> 
You'll almost certainly need to recompile a complete kernel. If you are based on RHEL 5 that will be the case I'm afraid,

Steve.

> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> Whitehouse
> Sendt: 4. juni 2007 11:32
> Til: linux clustering
> Emne: Re: SV: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> RHEL 5 is quite a long way behind the current development. It will be 
> a little while before 5.1 is out (I'm not sure of the exact date) but 
> it should only be a few weeks or so at the most,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> > Ah.
> > 
> > It's RHEL 5. 
> > Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4
> > 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> >  
> > I didn't know there was a 5.1 at the door. I'll try updating the system. 
> > 
> > Thanks a lot for your help.
> > 
> > /Kristoffer
> > 
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 11:14
> > Til: linux clustering
> > Emne: Re: SV: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > I was thinking more of the kernel version rather than the tools version.
> > What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> > 
> > Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively 
> > recently too. There will also be uptodate GFS2 code appearing in the 
> > forthcoming RHEL 5.1,
> > 
> > Steve.
> > 
> > On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > > Hi,
> > > 
> > > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the 
> > > actual version of GFS2 is the one included in cman-2.0.64-1.el5
> > >  
> > > According to the message log its:
> > > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> > > installed
> > > 
> > > 
> > > Is there a newer version?
> > > 
> > > Mvh / Kind regards
> > > 
> > > Kristoffer Lippert
> > > Systemansvarlig
> > > JP/Politiken A/S
> > > Online Magasiner
> > > 
> > > Tlf. +45 8738 3032
> > > Cell. +45 6062 8703
> > > 
> > > -----Oprindelig meddelelse-----
> > > Fra: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > Whitehouse
> > > Sendt: 4. juni 2007 11:00
> > > Til: linux clustering
> > > Emne: Re: [Linux-cluster] GFS2 problem.
> > > 
> > > Hi,
> > > 
> > > What version of GFS2 are you using? This has been seen before, but 
> > > not for a fair time now, so I suspect you might be using old code,
> > > 
> > > Steve.
> > > 
> > > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > > Hi,
> > > > 
> > > > I get a slightly strange problem:
> > > > 
> > > > I'm setting up a new 2. node cluster, and one of my nodes gives 
> > > > this in the message log:
> > > > 
> > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > > 
> > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 
> > > > 61
> > > > 
> > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > about to withdraw this file system Jun  4 10:10:33 app01 kernel: 
> > > > GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > telling LM to withdraw
> > > > 
> > > > After the error, the san is locked. I can't get the second node 
> > > > off the lock. I did a ls of /sandata (where the san is mounted), 
> > > > and i can't kill the process.
> > > > 
> > > > I cannot even reboot the server with reboot. I have to 
> > > > powercycle it to get it back online.
> > > > 
> > > > I found one similar error on google, posted to this list, but i 
> > > > did not find any replys.
> > > > Have anyone got any clues? 
> > > > 
> > > > 
> > > > Mvh / Kind regards
> > > > 
> > > > Kristoffer Lippert
> > > > Systemansvarlig
> > > > JP/Politiken A/S
> > > > Online Magasiner
> > > > 
> > > > Tlf. +45 8738 3032
> > > > Cell. +45 6062 8703
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From swhiteho at redhat.com  Mon Jun  4 10:53:02 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 04 Jun 2007 11:53:02 +0100
Subject: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C6@exchsrv04.rootdom.dk>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
	<1180947613.25918.120.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>
	<1180948420.25918.125.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk>
	<1180949534.25918.136.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk>
	<1180954081.25918.150.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C6@exchsrv04.rootdom.dk>
Message-ID: <1180954382.25918.152.camel@quoit>

Hi,

OCFS2 is the only other fs I can think of that is likely to have a
similar feature set,

Steve.

On Mon, 2007-06-04 at 12:55 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> Do you know of any alternatives to gfs2 that'll work with a shared storage?
> 
> And again, thanks a lot for your help
> 
> Regards
> /Kristoffer
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
> Sendt: 4. juni 2007 12:48
> Til: linux clustering
> Emne: Re: SV: SV: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> On Mon, 2007-06-04 at 12:50 +0200, Kristoffer Lippert wrote:
> > Ok, nothing newer than what i'm running now from Red Hat.
> > 
> No, not yet I'm afraid as its only a demo feature in RHEL 5.0.
> 
> > If i take the git branch:
> > GIT tree (recent upstream kernel development for gfs2 and dlm):
> > git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git
> > 
> > Am I then required to recompile the kernel, or can i compile a new gfs2 alone?
> > 
> > Regards
> > Kristoffer
> > 
> You'll almost certainly need to recompile a complete kernel. If you are based on RHEL 5 that will be the case I'm afraid,
> 
> Steve.
> 
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 11:32
> > Til: linux clustering
> > Emne: Re: SV: SV: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > RHEL 5 is quite a long way behind the current development. It will be 
> > a little while before 5.1 is out (I'm not sure of the exact date) but 
> > it should only be a few weeks or so at the most,
> > 
> > Steve.
> > 
> > On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> > > Ah.
> > > 
> > > It's RHEL 5. 
> > > Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4
> > > 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> > >  
> > > I didn't know there was a 5.1 at the door. I'll try updating the system. 
> > > 
> > > Thanks a lot for your help.
> > > 
> > > /Kristoffer
> > > 
> > > 
> > > -----Oprindelig meddelelse-----
> > > Fra: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > Whitehouse
> > > Sendt: 4. juni 2007 11:14
> > > Til: linux clustering
> > > Emne: Re: SV: [Linux-cluster] GFS2 problem.
> > > 
> > > Hi,
> > > 
> > > I was thinking more of the kernel version rather than the tools version.
> > > What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> > > 
> > > Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively 
> > > recently too. There will also be uptodate GFS2 code appearing in the 
> > > forthcoming RHEL 5.1,
> > > 
> > > Steve.
> > > 
> > > On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > > > Hi,
> > > > 
> > > > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the 
> > > > actual version of GFS2 is the one included in cman-2.0.64-1.el5
> > > >  
> > > > According to the message log its:
> > > > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> > > > installed
> > > > 
> > > > 
> > > > Is there a newer version?
> > > > 
> > > > Mvh / Kind regards
> > > > 
> > > > Kristoffer Lippert
> > > > Systemansvarlig
> > > > JP/Politiken A/S
> > > > Online Magasiner
> > > > 
> > > > Tlf. +45 8738 3032
> > > > Cell. +45 6062 8703
> > > > 
> > > > -----Oprindelig meddelelse-----
> > > > Fra: linux-cluster-bounces at redhat.com 
> > > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > > Whitehouse
> > > > Sendt: 4. juni 2007 11:00
> > > > Til: linux clustering
> > > > Emne: Re: [Linux-cluster] GFS2 problem.
> > > > 
> > > > Hi,
> > > > 
> > > > What version of GFS2 are you using? This has been seen before, but 
> > > > not for a fair time now, so I suspect you might be using old code,
> > > > 
> > > > Steve.
> > > > 
> > > > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > > > Hi,
> > > > > 
> > > > > I get a slightly strange problem:
> > > > > 
> > > > > I'm setting up a new 2. node cluster, and one of my nodes gives 
> > > > > this in the message log:
> > > > > 
> > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > > > 
> > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line = 
> > > > > 61
> > > > > 
> > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > about to withdraw this file system Jun  4 10:10:33 app01 kernel: 
> > > > > GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > telling LM to withdraw
> > > > > 
> > > > > After the error, the san is locked. I can't get the second node 
> > > > > off the lock. I did a ls of /sandata (where the san is mounted), 
> > > > > and i can't kill the process.
> > > > > 
> > > > > I cannot even reboot the server with reboot. I have to 
> > > > > powercycle it to get it back online.
> > > > > 
> > > > > I found one similar error on google, posted to this list, but i 
> > > > > did not find any replys.
> > > > > Have anyone got any clues? 
> > > > > 
> > > > > 
> > > > > Mvh / Kind regards
> > > > > 
> > > > > Kristoffer Lippert
> > > > > Systemansvarlig
> > > > > JP/Politiken A/S
> > > > > Online Magasiner
> > > > > 
> > > > > Tlf. +45 8738 3032
> > > > > Cell. +45 6062 8703
> > > > > 
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Mon Jun  4 11:27:15 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Mon, 4 Jun 2007 13:27:15 +0200
Subject: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <1180954382.25918.152.camel@quoit>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk><1180947613.25918.120.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk><1180948420.25918.125.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk><1180949534.25918.136.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk><1180954081.25918.150.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C6@exchsrv04.rootdom.dk>
	<1180954382.25918.152.camel@quoit>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C8@exchsrv04.rootdom.dk>

Hi,

Do you know of any simpler solution to make a san avaliable on two (or more) servers? Is GFS-6.x an option?

Kind Regards
Kristoffer
 

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
Sendt: 4. juni 2007 12:53
Til: linux clustering
Emne: Re: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.

Hi,

OCFS2 is the only other fs I can think of that is likely to have a similar feature set,

Steve.

On Mon, 2007-06-04 at 12:55 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> Do you know of any alternatives to gfs2 that'll work with a shared storage?
> 
> And again, thanks a lot for your help
> 
> Regards
> /Kristoffer
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> Whitehouse
> Sendt: 4. juni 2007 12:48
> Til: linux clustering
> Emne: Re: SV: SV: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> On Mon, 2007-06-04 at 12:50 +0200, Kristoffer Lippert wrote:
> > Ok, nothing newer than what i'm running now from Red Hat.
> > 
> No, not yet I'm afraid as its only a demo feature in RHEL 5.0.
> 
> > If i take the git branch:
> > GIT tree (recent upstream kernel development for gfs2 and dlm):
> > git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git
> > 
> > Am I then required to recompile the kernel, or can i compile a new gfs2 alone?
> > 
> > Regards
> > Kristoffer
> > 
> You'll almost certainly need to recompile a complete kernel. If you 
> are based on RHEL 5 that will be the case I'm afraid,
> 
> Steve.
> 
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 11:32
> > Til: linux clustering
> > Emne: Re: SV: SV: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > RHEL 5 is quite a long way behind the current development. It will 
> > be a little while before 5.1 is out (I'm not sure of the exact date) 
> > but it should only be a few weeks or so at the most,
> > 
> > Steve.
> > 
> > On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> > > Ah.
> > > 
> > > It's RHEL 5. 
> > > Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4
> > > 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> > >  
> > > I didn't know there was a 5.1 at the door. I'll try updating the system. 
> > > 
> > > Thanks a lot for your help.
> > > 
> > > /Kristoffer
> > > 
> > > 
> > > -----Oprindelig meddelelse-----
> > > Fra: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > Whitehouse
> > > Sendt: 4. juni 2007 11:14
> > > Til: linux clustering
> > > Emne: Re: SV: [Linux-cluster] GFS2 problem.
> > > 
> > > Hi,
> > > 
> > > I was thinking more of the kernel version rather than the tools version.
> > > What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> > > 
> > > Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively 
> > > recently too. There will also be uptodate GFS2 code appearing in 
> > > the forthcoming RHEL 5.1,
> > > 
> > > Steve.
> > > 
> > > On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > > > Hi,
> > > > 
> > > > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the 
> > > > actual version of GFS2 is the one included in cman-2.0.64-1.el5
> > > >  
> > > > According to the message log its:
> > > > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> > > > installed
> > > > 
> > > > 
> > > > Is there a newer version?
> > > > 
> > > > Mvh / Kind regards
> > > > 
> > > > Kristoffer Lippert
> > > > Systemansvarlig
> > > > JP/Politiken A/S
> > > > Online Magasiner
> > > > 
> > > > Tlf. +45 8738 3032
> > > > Cell. +45 6062 8703
> > > > 
> > > > -----Oprindelig meddelelse-----
> > > > Fra: linux-cluster-bounces at redhat.com 
> > > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > > Whitehouse
> > > > Sendt: 4. juni 2007 11:00
> > > > Til: linux clustering
> > > > Emne: Re: [Linux-cluster] GFS2 problem.
> > > > 
> > > > Hi,
> > > > 
> > > > What version of GFS2 are you using? This has been seen before, 
> > > > but not for a fair time now, so I suspect you might be using old 
> > > > code,
> > > > 
> > > > Steve.
> > > > 
> > > > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > > > Hi,
> > > > > 
> > > > > I get a slightly strange problem:
> > > > > 
> > > > > I'm setting up a new 2. node cluster, and one of my nodes 
> > > > > gives this in the message log:
> > > > > 
> > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > > > 
> > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line 
> > > > > =
> > > > > 61
> > > > > 
> > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > about to withdraw this file system Jun  4 10:10:33 app01 kernel: 
> > > > > GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > telling LM to withdraw
> > > > > 
> > > > > After the error, the san is locked. I can't get the second 
> > > > > node off the lock. I did a ls of /sandata (where the san is 
> > > > > mounted), and i can't kill the process.
> > > > > 
> > > > > I cannot even reboot the server with reboot. I have to 
> > > > > powercycle it to get it back online.
> > > > > 
> > > > > I found one similar error on google, posted to this list, but 
> > > > > i did not find any replys.
> > > > > Have anyone got any clues? 
> > > > > 
> > > > > 
> > > > > Mvh / Kind regards
> > > > > 
> > > > > Kristoffer Lippert
> > > > > Systemansvarlig
> > > > > JP/Politiken A/S
> > > > > Online Magasiner
> > > > > 
> > > > > Tlf. +45 8738 3032
> > > > > Cell. +45 6062 8703
> > > > > 
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From swhiteho at redhat.com  Mon Jun  4 11:38:55 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 04 Jun 2007 12:38:55 +0100
Subject: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C8@exchsrv04.rootdom.dk>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk>
	<1180947613.25918.120.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk>
	<1180948420.25918.125.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk>
	<1180949534.25918.136.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk>
	<1180954081.25918.150.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C6@exchsrv04.rootdom.dk>
	<1180954382.25918.152.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5C8@exchsrv04.rootdom.dk>
Message-ID: <1180957135.25918.154.camel@quoit>

Hi,

Yes, the older GFS is an option too,

Steve.

On Mon, 2007-06-04 at 13:27 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> Do you know of any simpler solution to make a san avaliable on two (or more) servers? Is GFS-6.x an option?
> 
> Kind Regards
> Kristoffer
>  
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
> Sendt: 4. juni 2007 12:53
> Til: linux clustering
> Emne: Re: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> OCFS2 is the only other fs I can think of that is likely to have a similar feature set,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 12:55 +0200, Kristoffer Lippert wrote:
> > Hi,
> > 
> > Do you know of any alternatives to gfs2 that'll work with a shared storage?
> > 
> > And again, thanks a lot for your help
> > 
> > Regards
> > /Kristoffer
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 12:48
> > Til: linux clustering
> > Emne: Re: SV: SV: SV: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > On Mon, 2007-06-04 at 12:50 +0200, Kristoffer Lippert wrote:
> > > Ok, nothing newer than what i'm running now from Red Hat.
> > > 
> > No, not yet I'm afraid as its only a demo feature in RHEL 5.0.
> > 
> > > If i take the git branch:
> > > GIT tree (recent upstream kernel development for gfs2 and dlm):
> > > git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git
> > > 
> > > Am I then required to recompile the kernel, or can i compile a new gfs2 alone?
> > > 
> > > Regards
> > > Kristoffer
> > > 
> > You'll almost certainly need to recompile a complete kernel. If you 
> > are based on RHEL 5 that will be the case I'm afraid,
> > 
> > Steve.
> > 
> > > 
> > > -----Oprindelig meddelelse-----
> > > Fra: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > Whitehouse
> > > Sendt: 4. juni 2007 11:32
> > > Til: linux clustering
> > > Emne: Re: SV: SV: [Linux-cluster] GFS2 problem.
> > > 
> > > Hi,
> > > 
> > > RHEL 5 is quite a long way behind the current development. It will 
> > > be a little while before 5.1 is out (I'm not sure of the exact date) 
> > > but it should only be a few weeks or so at the most,
> > > 
> > > Steve.
> > > 
> > > On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> > > > Ah.
> > > > 
> > > > It's RHEL 5. 
> > > > Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 4
> > > > 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> > > >  
> > > > I didn't know there was a 5.1 at the door. I'll try updating the system. 
> > > > 
> > > > Thanks a lot for your help.
> > > > 
> > > > /Kristoffer
> > > > 
> > > > 
> > > > -----Oprindelig meddelelse-----
> > > > Fra: linux-cluster-bounces at redhat.com 
> > > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > > Whitehouse
> > > > Sendt: 4. juni 2007 11:14
> > > > Til: linux clustering
> > > > Emne: Re: SV: [Linux-cluster] GFS2 problem.
> > > > 
> > > > Hi,
> > > > 
> > > > I was thinking more of the kernel version rather than the tools version.
> > > > What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> > > > 
> > > > Fedora 7 is fairly uptodate now, and FC5/6 had updates relatively 
> > > > recently too. There will also be uptodate GFS2 code appearing in 
> > > > the forthcoming RHEL 5.1,
> > > > 
> > > > Steve.
> > > > 
> > > > On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > > > > Hi,
> > > > > 
> > > > > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the 
> > > > > actual version of GFS2 is the one included in cman-2.0.64-1.el5
> > > > >  
> > > > > According to the message log its:
> > > > > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 22:16:07) 
> > > > > installed
> > > > > 
> > > > > 
> > > > > Is there a newer version?
> > > > > 
> > > > > Mvh / Kind regards
> > > > > 
> > > > > Kristoffer Lippert
> > > > > Systemansvarlig
> > > > > JP/Politiken A/S
> > > > > Online Magasiner
> > > > > 
> > > > > Tlf. +45 8738 3032
> > > > > Cell. +45 6062 8703
> > > > > 
> > > > > -----Oprindelig meddelelse-----
> > > > > Fra: linux-cluster-bounces at redhat.com 
> > > > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > > > Whitehouse
> > > > > Sendt: 4. juni 2007 11:00
> > > > > Til: linux clustering
> > > > > Emne: Re: [Linux-cluster] GFS2 problem.
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > What version of GFS2 are you using? This has been seen before, 
> > > > > but not for a fair time now, so I suspect you might be using old 
> > > > > code,
> > > > > 
> > > > > Steve.
> > > > > 
> > > > > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I get a slightly strange problem:
> > > > > > 
> > > > > > I'm setting up a new 2. node cluster, and one of my nodes 
> > > > > > gives this in the message log:
> > > > > > 
> > > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > > > > 
> > > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, line 
> > > > > > =
> > > > > > 61
> > > > > > 
> > > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > about to withdraw this file system Jun  4 10:10:33 app01 kernel: 
> > > > > > GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > telling LM to withdraw
> > > > > > 
> > > > > > After the error, the san is locked. I can't get the second 
> > > > > > node off the lock. I did a ls of /sandata (where the san is 
> > > > > > mounted), and i can't kill the process.
> > > > > > 
> > > > > > I cannot even reboot the server with reboot. I have to 
> > > > > > powercycle it to get it back online.
> > > > > > 
> > > > > > I found one similar error on google, posted to this list, but 
> > > > > > i did not find any replys.
> > > > > > Have anyone got any clues? 
> > > > > > 
> > > > > > 
> > > > > > Mvh / Kind regards
> > > > > > 
> > > > > > Kristoffer Lippert
> > > > > > Systemansvarlig
> > > > > > JP/Politiken A/S
> > > > > > Online Magasiner
> > > > > > 
> > > > > > Tlf. +45 8738 3032
> > > > > > Cell. +45 6062 8703
> > > > > > 
> > > > > > --
> > > > > > Linux-cluster mailing list
> > > > > > Linux-cluster at redhat.com
> > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > > 
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > > 
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Mon Jun  4 14:50:15 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Mon, 4 Jun 2007 16:50:15 +0200
Subject: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <1180957135.25918.154.camel@quoit>
References: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5C0@exchsrv04.rootdom.dk><1180947613.25918.120.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C1@exchsrv04.rootdom.dk><1180948420.25918.125.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C2@exchsrv04.rootdom.dk><1180949534.25918.136.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C5@exchsrv04.rootdom.dk><1180954081.25918.150.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C6@exchsrv04.rootdom.dk><1180954382.25918.152.camel@quoit><F358B15FDF5F1249AFA3B9C79250A0F403DDB5C8@exchsrv04.rootdom.dk>
	<1180957135.25918.154.camel@quoit>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5CE@exchsrv04.rootdom.dk>

Hi,

I've looked at the options, and it seems that building a new kernel is the only way to go about it. 

RHEL 5.1 is scheduled for okt/nov, and that long we cannot wait.

Kind regards
Kristoffer

 

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
Sendt: 4. juni 2007 13:39
Til: linux clustering
Emne: Re: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.

Hi,

Yes, the older GFS is an option too,

Steve.

On Mon, 2007-06-04 at 13:27 +0200, Kristoffer Lippert wrote:
> Hi,
> 
> Do you know of any simpler solution to make a san avaliable on two (or more) servers? Is GFS-6.x an option?
> 
> Kind Regards
> Kristoffer
>  
> 
> -----Oprindelig meddelelse-----
> Fra: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> Whitehouse
> Sendt: 4. juni 2007 12:53
> Til: linux clustering
> Emne: Re: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
> 
> Hi,
> 
> OCFS2 is the only other fs I can think of that is likely to have a 
> similar feature set,
> 
> Steve.
> 
> On Mon, 2007-06-04 at 12:55 +0200, Kristoffer Lippert wrote:
> > Hi,
> > 
> > Do you know of any alternatives to gfs2 that'll work with a shared storage?
> > 
> > And again, thanks a lot for your help
> > 
> > Regards
> > /Kristoffer
> > 
> > -----Oprindelig meddelelse-----
> > Fra: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > Whitehouse
> > Sendt: 4. juni 2007 12:48
> > Til: linux clustering
> > Emne: Re: SV: SV: SV: [Linux-cluster] GFS2 problem.
> > 
> > Hi,
> > 
> > On Mon, 2007-06-04 at 12:50 +0200, Kristoffer Lippert wrote:
> > > Ok, nothing newer than what i'm running now from Red Hat.
> > > 
> > No, not yet I'm afraid as its only a demo feature in RHEL 5.0.
> > 
> > > If i take the git branch:
> > > GIT tree (recent upstream kernel development for gfs2 and dlm):
> > > git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git
> > > 
> > > Am I then required to recompile the kernel, or can i compile a new gfs2 alone?
> > > 
> > > Regards
> > > Kristoffer
> > > 
> > You'll almost certainly need to recompile a complete kernel. If you 
> > are based on RHEL 5 that will be the case I'm afraid,
> > 
> > Steve.
> > 
> > > 
> > > -----Oprindelig meddelelse-----
> > > Fra: linux-cluster-bounces at redhat.com 
> > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > Whitehouse
> > > Sendt: 4. juni 2007 11:32
> > > Til: linux clustering
> > > Emne: Re: SV: SV: [Linux-cluster] GFS2 problem.
> > > 
> > > Hi,
> > > 
> > > RHEL 5 is quite a long way behind the current development. It will 
> > > be a little while before 5.1 is out (I'm not sure of the exact 
> > > date) but it should only be a few weeks or so at the most,
> > > 
> > > Steve.
> > > 
> > > On Mon, 2007-06-04 at 11:29 +0200, Kristoffer Lippert wrote:
> > > > Ah.
> > > > 
> > > > It's RHEL 5. 
> > > > Linux app02.onlinemagasiner.com 2.6.18-8.1.4.el5 #1 SMP Fri May 
> > > > 4
> > > > 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> > > >  
> > > > I didn't know there was a 5.1 at the door. I'll try updating the system. 
> > > > 
> > > > Thanks a lot for your help.
> > > > 
> > > > /Kristoffer
> > > > 
> > > > 
> > > > -----Oprindelig meddelelse-----
> > > > Fra: linux-cluster-bounces at redhat.com 
> > > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > > Whitehouse
> > > > Sendt: 4. juni 2007 11:14
> > > > Til: linux clustering
> > > > Emne: Re: SV: [Linux-cluster] GFS2 problem.
> > > > 
> > > > Hi,
> > > > 
> > > > I was thinking more of the kernel version rather than the tools version.
> > > > What does uname -a say? The fact that it was built on the 4th May suggests that it doesn't have all the latest bug fixes in. Genernally I'd suggest following the upstream Linus' kernels (or the -nmw git tree) to get the latest GFS2 kernels.
> > > > 
> > > > Fedora 7 is fairly uptodate now, and FC5/6 had updates 
> > > > relatively recently too. There will also be uptodate GFS2 code 
> > > > appearing in the forthcoming RHEL 5.1,
> > > > 
> > > > Steve.
> > > > 
> > > > On Mon, 2007-06-04 at 11:13 +0200, Kristoffer Lippert wrote:
> > > > > Hi,
> > > > > 
> > > > > I'm using version 0.1.25-1.el5 of gfs2-utils, but i guess the 
> > > > > actual version of GFS2 is the one included in 
> > > > > cman-2.0.64-1.el5
> > > > >  
> > > > > According to the message log its:
> > > > > Jun  4 08:37:23 app02 kernel: GFS2 (built May  4 2007 
> > > > > 22:16:07) installed
> > > > > 
> > > > > 
> > > > > Is there a newer version?
> > > > > 
> > > > > Mvh / Kind regards
> > > > > 
> > > > > Kristoffer Lippert
> > > > > Systemansvarlig
> > > > > JP/Politiken A/S
> > > > > Online Magasiner
> > > > > 
> > > > > Tlf. +45 8738 3032
> > > > > Cell. +45 6062 8703
> > > > > 
> > > > > -----Oprindelig meddelelse-----
> > > > > Fra: linux-cluster-bounces at redhat.com 
> > > > > [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven 
> > > > > Whitehouse
> > > > > Sendt: 4. juni 2007 11:00
> > > > > Til: linux clustering
> > > > > Emne: Re: [Linux-cluster] GFS2 problem.
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > What version of GFS2 are you using? This has been seen before, 
> > > > > but not for a fair time now, so I suspect you might be using 
> > > > > old code,
> > > > > 
> > > > > Steve.
> > > > > 
> > > > > On Mon, 2007-06-04 at 10:59 +0200, Kristoffer Lippert wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I get a slightly strange problem:
> > > > > > 
> > > > > > I'm setting up a new 2. node cluster, and one of my nodes 
> > > > > > gives this in the message log:
> > > > > > 
> > > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
> > > > > > 
> > > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > function = glock_lo_after_commit, file = fs/gfs2/lops.c, 
> > > > > > line =
> > > > > > 61
> > > > > > 
> > > > > > Jun  4 10:10:33 app01 kernel: GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > about to withdraw this file system Jun  4 10:10:33 app01 kernel: 
> > > > > > GFS2: fsid=onmagcluster:onmag_gfs.1:
> > > > > > telling LM to withdraw
> > > > > > 
> > > > > > After the error, the san is locked. I can't get the second 
> > > > > > node off the lock. I did a ls of /sandata (where the san is 
> > > > > > mounted), and i can't kill the process.
> > > > > > 
> > > > > > I cannot even reboot the server with reboot. I have to 
> > > > > > powercycle it to get it back online.
> > > > > > 
> > > > > > I found one similar error on google, posted to this list, 
> > > > > > but i did not find any replys.
> > > > > > Have anyone got any clues? 
> > > > > > 
> > > > > > 
> > > > > > Mvh / Kind regards
> > > > > > 
> > > > > > Kristoffer Lippert
> > > > > > Systemansvarlig
> > > > > > JP/Politiken A/S
> > > > > > Online Magasiner
> > > > > > 
> > > > > > Tlf. +45 8738 3032
> > > > > > Cell. +45 6062 8703
> > > > > > 
> > > > > > --
> > > > > > Linux-cluster mailing list
> > > > > > Linux-cluster at redhat.com
> > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > > 
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > > 
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From teigland at redhat.com  Mon Jun  4 14:57:15 2007
From: teigland at redhat.com (David Teigland)
Date: Mon, 4 Jun 2007 09:57:15 -0500
Subject: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5CE@exchsrv04.rootdom.dk>
References: <1180957135.25918.154.camel@quoit>
	<F358B15FDF5F1249AFA3B9C79250A0F403DDB5CE@exchsrv04.rootdom.dk>
Message-ID: <20070604145715.GC5597@redhat.com>

On Mon, Jun 04, 2007 at 04:50:15PM +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I've looked at the options, and it seems that building a new kernel is
> the only way to go about it. 
> 
> RHEL 5.1 is scheduled for okt/nov, and that long we cannot wait.

GFS2 is still in preview/beta mode; you shouldn't use it for production
purposes yet (regardless of the kernel).  Just use GFS(1) instead.

Dave



From Arne.Brieseneck at vodafone.com  Mon Jun  4 15:05:33 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Mon, 4 Jun 2007 17:05:33 +0200
Subject: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <20070604145715.GC5597@redhat.com>
Message-ID: <E67F1468BF7A4C418D874810215A377E9D0D6C@EITO-MBX01.internal.vodafone.com>

Hi David,

You know, that GFS1 is slow like hell... I truly can understand
Kristoffer when he wants to have GFS2. The throughput is much better.
On the other hand Kristoffer could use professional cluster filesystems
for PROD systems like Veritas FS. Afaik VFS is also available for RHEL.

Arne

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Teigland
Sent: Montag, 4. Juni 2007 16:57
To: Kristoffer Lippert
Cc: linux-cluster at redhat.com
Subject: Re: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.

On Mon, Jun 04, 2007 at 04:50:15PM +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I've looked at the options, and it seems that building a new kernel is

> the only way to go about it.
> 
> RHEL 5.1 is scheduled for okt/nov, and that long we cannot wait.

GFS2 is still in preview/beta mode; you shouldn't use it for production
purposes yet (regardless of the kernel).  Just use GFS(1) instead.

Dave

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jamesm at xandros.com  Mon Jun  4 15:29:21 2007
From: jamesm at xandros.com (James McOrmond)
Date: Mon, 4 Jun 2007 11:29:21 -0400
Subject: [Linux-cluster] LVS+HA
In-Reply-To: <24fecab20706030705ya21deb7w89fab8eeb7303035@mail.gmail.com>
References: <24fecab20706030705ya21deb7w89fab8eeb7303035@mail.gmail.com>
Message-ID: <46642FD1.9020503@xandros.com>

You do not need to run the cluster suite on the two web servers unless 
you're using shared SAN storage...

The LVS router will handle the case where one web server goes down..

Satya Daragani wrote:
> Hi All,
>  
> 
> I am trying to build the following scenario using Redhat cluster suite 
> and LVS.
> 
> I will be using three nodes in the entire scenario, one will be used as 
> LVS Router and other two will be real servers (running httpd web service).
> 
> I want to configure the redhat cluster suite on both the real server 
> nodes for the failover of the httpd service (in active-active mode), is 
> it possible to do or is there any other way?
> 
>  
> 
> Kindly suggest me.
> 
> 
> 
> -- 
> Thanx
> Satya Daragani
> satya.daragani at gmail.com <mailto:satya.daragani at gmail.com>
> +91 98850 58366
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
James A. McOrmond (jamesm at xandros.com)
Hardware QA Lead & Network Administrator
Xandros Corporation, Ottawa, Canada.
Morpheus: ...after a century of war I remember that which matters most:
  *We are still HERE!*



From cdillard at rpstechnology.com  Mon Jun  4 15:53:01 2007
From: cdillard at rpstechnology.com (Clayton Dillard)
Date: Mon, 04 Jun 2007 11:53:01 -0400
Subject: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
Message-ID: <1180972381.6730.27.camel@jetfighter>

Hi Folks,
    I'm hoping for some advice here.  We have a hosting service where we
run a LAMP stack.  Currently our environment is comprised of a pair of
Apache servers which are load balanced via a pair of hardware balancers,
and a pair of MySQL servers that are in a Master/Slave configuration.
The two Apache servers use a common NFS export as shared storage and the
DB servers just use local disks.

I've been reading up on RH cluster services and GFS, and from what I
see, it sounds like a great solution for us, especially since we are
bringing on more and more customers and thus have higher loads and
availability requirements.

I'm wondering if anyone on this list is currently using RH Cluster
Services and GFS for a LAMP environment and if so, are there any
pitfalls to watch out for, or any tips that I should know before getting
started on the engineering phase.

I was reading an article this weekend on the RH site about using a
shared (GFS) root partition on SAN and booting up cluster nodes to this
shared root, the premise being that patches and config changes are done
once, to the clustered root on all servers, as opposed to each server
separately.  Does anyone here run such a configuration and does it work
well?

Also, I've not been able to find any good books that are focused mostly
on Red Hat clustering and GFS...  Does anyone know of any good books on
this subject?

Thank you,
-- 
Clayton Dillard <cdillard at rpstechnology.com>
RPS Technology, LLC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070604/73ca0d67/attachment.htm>

From satya.daragani at gmail.com  Mon Jun  4 16:25:51 2007
From: satya.daragani at gmail.com (Satya Daragani)
Date: Mon, 4 Jun 2007 21:55:51 +0530
Subject: [Linux-cluster] LVS+HA
In-Reply-To: <46642FD1.9020503@xandros.com>
References: <24fecab20706030705ya21deb7w89fab8eeb7303035@mail.gmail.com>
	<46642FD1.9020503@xandros.com>
Message-ID: <24fecab20706040925u554cb25cifccbe6cf8859ccfe@mail.gmail.com>

Hi

Actually i am using the shared storage but not SAN, IBM DS400 connected to
both the nodes using fiber.

So in that case hw can i use the cluster suite?

Thank you.

Satya


On 6/4/07, James McOrmond <jamesm at xandros.com> wrote:
>
> You do not need to run the cluster suite on the two web servers unless
> you're using shared SAN storage...
>
> The LVS router will handle the case where one web server goes down..
>
> Satya Daragani wrote:
> > Hi All,
> >
> >
> > I am trying to build the following scenario using Redhat cluster suite
> > and LVS.
> >
> > I will be using three nodes in the entire scenario, one will be used as
> > LVS Router and other two will be real servers (running httpd web
> service).
> >
> > I want to configure the redhat cluster suite on both the real server
> > nodes for the failover of the httpd service (in active-active mode), is
> > it possible to do or is there any other way?
> >
> >
> >
> > Kindly suggest me.
> >
> >
> >
> > --
> > Thanx
> > Satya Daragani
> > satya.daragani at gmail.com <mailto:satya.daragani at gmail.com>
> > +91 98850 58366
> >
> >
> > ------------------------------------------------------------------------
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> James A. McOrmond (jamesm at xandros.com)
> Hardware QA Lead & Network Administrator
> Xandros Corporation, Ottawa, Canada.
> Morpheus: ...after a century of war I remember that which matters most:
> *We are still HERE!*
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Thanx
Satya Daragani
satya.daragani at gmail.com
+91 98850 58366
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070604/520ac1be/attachment.htm>

From Christian.Schaefer2 at messe-muenchen.de  Mon Jun  4 16:23:50 2007
From: Christian.Schaefer2 at messe-muenchen.de (Schaefer Christian)
Date: Mon, 4 Jun 2007 18:23:50 +0200
Subject: AW: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
In-Reply-To: <1180972381.6730.27.camel@jetfighter>
References: <1180972381.6730.27.camel@jetfighter>
Message-ID: <56825E6A1E167E41A80CE0F6F6916CAB023E58B7@MNTSVCL10E.mmgmuc.de>

Hallo Clayton,

> Does anyone here run such a configuration and does it work well?
Definitely yes. Munich trade fair center ist hosting the fair's websites itself. Due to availability and performance issues we have decided in 2005 to renew our server pool and migrate it to the diskless shared root concept to to achieve the aim, that by administrating only one server you could manage all servers. The topology mentioned consists of 8 PHP frontend servers responsing on the customers' requests if they were invoked by a pair of LVS/Keepaliced loadbalancers and a couple of backend servers and database clusters. MySQL is currently used in an actice/passive style, but is operated on an two node diskless shared root cluster as well.

 
> I'm wondering if anyone on this list is currently using RH 
> Cluster Services and GFS for a LAMP environment and if so, 
> are there any pitfalls to watch out for, or any tips that I 
> should know before getting started on the engineering phase.
As clusters are more complex than single host systems, a shared root environment is on step more complex. Hence you should precisely plan your environment to run, because there are more components involved, and some piece of software are not designed to run on such cluster environment. Especially PHP must be patched due to some session locking behavior, so that the webservers run more stable. Another example is writing logfiles. Here the applications run on the cluster should be aware of the GFS file locking mechanism. Using a single application log file on all of the cluster nodes can be a performance killer.


> I was reading an article this weekend on the RH site about 
> using a shared (GFS) root partition on SAN and booting up 
> cluster nodes to this shared root, the premise being that 
> patches and config changes are done once, to the clustered 
> root on all servers, as opposed to each server separately. 
This environment is developed and operated for the munich trade fair center by ATIX (http://www.atix.de). If you'd like to have more information about our environment, just reply to my message.


Greetings, 

Christian Sch?fer
Abt. IT-Applications
Messe M?nchen 

Tel.: +49 (0) 89 949 21985 
E-Mail: christian.schaefer2 at messe-muenchen.de 
WWW: www.messe-muenchen.de 

 

> -----Urspr?ngliche Nachricht-----
> Von: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von 
> Clayton Dillard
> Gesendet: Montag, 4. Juni 2007 17:53
> An: linux clustering
> Betreff: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
> 
> Hi Folks,
>     I'm hoping for some advice here.  We have a hosting 
> service where we run a LAMP stack.  Currently our environment 
> is comprised of a pair of Apache servers which are load 
> balanced via a pair of hardware balancers, and a pair of 
> MySQL servers that are in a Master/Slave configuration.  The 
> two Apache servers use a common NFS export as shared storage 
> and the DB servers just use local disks.
> 
> I've been reading up on RH cluster services and GFS, and from 
> what I see, it sounds like a great solution for us, 
> especially since we are bringing on more and more customers 
> and thus have higher loads and availability requirements.
> 
> I'm wondering if anyone on this list is currently using RH 
> Cluster Services and GFS for a LAMP environment and if so, 
> are there any pitfalls to watch out for, or any tips that I 
> should know before getting started on the engineering phase.
> 
> I was reading an article this weekend on the RH site about 
> using a shared (GFS) root partition on SAN and booting up 
> cluster nodes to this shared root, the premise being that 
> patches and config changes are done once, to the clustered 
> root on all servers, as opposed to each server separately.  
> Does anyone here run such a configuration and does it work well?
> 
> Also, I've not been able to find any good books that are 
> focused mostly on Red Hat clustering and GFS...  Does anyone 
> know of any good books on this subject?
> 
> Thank you,
> 
> -- 
> Clayton Dillard <cdillard at rpstechnology.com>
> RPS Technology, LLC 	
> 



From cdillard at rpstechnology.com  Mon Jun  4 16:52:34 2007
From: cdillard at rpstechnology.com (Clayton Dillard)
Date: Mon, 04 Jun 2007 12:52:34 -0400
Subject: AW: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
In-Reply-To: <56825E6A1E167E41A80CE0F6F6916CAB023E58B7@MNTSVCL10E.mmgmuc.de>
References: <1180972381.6730.27.camel@jetfighter>
	<56825E6A1E167E41A80CE0F6F6916CAB023E58B7@MNTSVCL10E.mmgmuc.de>
Message-ID: <1180975954.28815.3.camel@jetfighter>

Christian,
    Thank you for the notes and detail.  This sounds promising.  My only
concern is the level of complexity that this would introduce into our
environment.

There also doesn't seem to be much documentation on building diskless
Linux clusters that I've found.

Are you using specialized tools for the configuration?

Clay


On Mon, 2007-06-04 at 18:23 +0200, Schaefer Christian wrote:

> Hallo Clayton,
> 
> > Does anyone here run such a configuration and does it work well?
> Definitely yes. Munich trade fair center ist hosting the fair's websites itself. Due to availability and performance issues we have decided in 2005 to renew our server pool and migrate it to the diskless shared root concept to to achieve the aim, that by administrating only one server you could manage all servers. The topology mentioned consists of 8 PHP frontend servers responsing on the customers' requests if they were invoked by a pair of LVS/Keepaliced loadbalancers and a couple of backend servers and database clusters. MySQL is currently used in an actice/passive style, but is operated on an two node diskless shared root cluster as well.
> 
>  
> > I'm wondering if anyone on this list is currently using RH 
> > Cluster Services and GFS for a LAMP environment and if so, 
> > are there any pitfalls to watch out for, or any tips that I 
> > should know before getting started on the engineering phase.
> As clusters are more complex than single host systems, a shared root environment is on step more complex. Hence you should precisely plan your environment to run, because there are more components involved, and some piece of software are not designed to run on such cluster environment. Especially PHP must be patched due to some session locking behavior, so that the webservers run more stable. Another example is writing logfiles. Here the applications run on the cluster should be aware of the GFS file locking mechanism. Using a single application log file on all of the cluster nodes can be a performance killer.
> 
> 
> > I was reading an article this weekend on the RH site about 
> > using a shared (GFS) root partition on SAN and booting up 
> > cluster nodes to this shared root, the premise being that 
> > patches and config changes are done once, to the clustered 
> > root on all servers, as opposed to each server separately. 
> This environment is developed and operated for the munich trade fair center by ATIX (http://www.atix.de). If you'd like to have more information about our environment, just reply to my message.
> 
> 
> Greetings, 
> 
> Christian Sch?fer
> Abt. IT-Applications
> Messe M?nchen 
> 
> Tel.: +49 (0) 89 949 21985 
> E-Mail: christian.schaefer2 at messe-muenchen.de 
> WWW: www.messe-muenchen.de 
> 
> 
> 
> > -----Urspr?ngliche Nachricht-----
> > Von: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von 
> > Clayton Dillard
> > Gesendet: Montag, 4. Juni 2007 17:53
> > An: linux clustering
> > Betreff: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
> > 
> > Hi Folks,
> >     I'm hoping for some advice here.  We have a hosting 
> > service where we run a LAMP stack.  Currently our environment 
> > is comprised of a pair of Apache servers which are load 
> > balanced via a pair of hardware balancers, and a pair of 
> > MySQL servers that are in a Master/Slave configuration.  The 
> > two Apache servers use a common NFS export as shared storage 
> > and the DB servers just use local disks.
> > 
> > I've been reading up on RH cluster services and GFS, and from 
> > what I see, it sounds like a great solution for us, 
> > especially since we are bringing on more and more customers 
> > and thus have higher loads and availability requirements.
> > 
> > I'm wondering if anyone on this list is currently using RH 
> > Cluster Services and GFS for a LAMP environment and if so, 
> > are there any pitfalls to watch out for, or any tips that I 
> > should know before getting started on the engineering phase.
> > 
> > I was reading an article this weekend on the RH site about 
> > using a shared (GFS) root partition on SAN and booting up 
> > cluster nodes to this shared root, the premise being that 
> > patches and config changes are done once, to the clustered 
> > root on all servers, as opposed to each server separately.  
> > Does anyone here run such a configuration and does it work well?
> > 
> > Also, I've not been able to find any good books that are 
> > focused mostly on Red Hat clustering and GFS...  Does anyone 
> > know of any good books on this subject?
> > 
> > Thank you,
> > 
> > -- 
> > Clayton Dillard <cdillard at rpstechnology.com>
> > RPS Technology, LLC 	
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Clayton Dillard <cdillard at rpstechnology.com>
RPS Technology, LLC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070604/048e58cd/attachment.htm>

From dist-list at LEXUM.UMontreal.CA  Mon Jun  4 19:47:33 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Mon, 04 Jun 2007 15:47:33 -0400
Subject: [Linux-cluster] Clusterng with GFS
In-Reply-To: <465D7C28.4050607@hcl.in>
References: <465D7C28.4050607@hcl.in>
Message-ID: <46646C55.3010409@lexum.umontreal.ca>

BTW, I do not think that postgresql on GFS is supported

Regards,


Daniel Fernanduz wrote:
>
> Hello
>
> I am trying to implement a failover cluster with shared storage
> concept (Postgresql Database 7.4).
> RHEL AS 4.0 + RHCS + GFS
>
> I am having two gfs volumes which gets mounted by cluster when any of
> the nodes in
> cluster takes control. The volumes are getting mounted by the cluster
> whenever a node takes
> control, but not getting unmounted when the node leaves the cluster.
> Especially when primary
> node takes control, the volumes mounted on the secondary node (Node
> which gives control to the
> primary node) is not getting unmounted. Anybody plz help me to solve
> this issue. Thanks for the help in advance
>
>
> Regards
> J. DANIEL
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kanderso at redhat.com  Mon Jun  4 19:55:33 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Mon, 04 Jun 2007 14:55:33 -0500
Subject: [Linux-cluster] Clusterng with GFS
In-Reply-To: <46646C55.3010409@lexum.umontreal.ca>
References: <465D7C28.4050607@hcl.in>  <46646C55.3010409@lexum.umontreal.ca>
Message-ID: <1180986933.2757.44.camel@localhost.localdomain>

There are failover scripts for postgres in the source base
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/resources/postgres-8.metadata?cvsroot=cluster

and 

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/resources/postgres-8.sh?cvsroot=cluster

You can certainly use active/passive postgres in conjunction with gfs as
the storage.  From that perspective, gfs is supported.  However, don't
think postgres will work in an active/active configuration.

Kevin

On Mon, 2007-06-04 at 15:47 -0400, FM wrote:
> BTW, I do not think that postgresql on GFS is supported
> 
> Regards,
> 
> 
> Daniel Fernanduz wrote:
> >
> > Hello
> >
> > I am trying to implement a failover cluster with shared storage
> > concept (Postgresql Database 7.4).
> > RHEL AS 4.0 + RHCS + GFS
> >
> > I am having two gfs volumes which gets mounted by cluster when any of
> > the nodes in
> > cluster takes control. The volumes are getting mounted by the cluster
> > whenever a node takes
> > control, but not getting unmounted when the node leaves the cluster.
> > Especially when primary
> > node takes control, the volumes mounted on the secondary node (Node
> > which gives control to the
> > primary node) is not getting unmounted. Anybody plz help me to solve
> > this issue. Thanks for the help in advance
> >
> >
> > Regards
> > J. DANIEL
> >
> > -- 
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070604/e16a4640/attachment.htm>

From jamesm at xandros.com  Mon Jun  4 21:33:05 2007
From: jamesm at xandros.com (James McOrmond)
Date: Mon, 4 Jun 2007 17:33:05 -0400
Subject: [Linux-cluster] LVS+HA
In-Reply-To: <24fecab20706040925u554cb25cifccbe6cf8859ccfe@mail.gmail.com>
References: <24fecab20706030705ya21deb7w89fab8eeb7303035@mail.gmail.com>
Message-ID: <46648511.5050909@xandros.com>


Satya Daragani wrote:
> Hi
>  
> Actually i am using the shared storage but not SAN, IBM DS400 connected 
> to both the nodes using fiber.

Uhmm, I think i'd consider that a SAN...

> So in that case hw can i use the cluster suite?

I would only setup the cluster suite to manage a shared GFS file system. 
  Alternatively you could use OCFS2 since you're only doing disk storage.

LVS would be on your router system and would handle web server outages..

-- 
James A. McOrmond (jamesm at xandros.com)
Hardware QA Lead & Network Administrator
Xandros Corporation, Ottawa, Canada.
Morpheus: ...after a century of war I remember that which matters most:
  *We are still HERE!*



From Christian.Schaefer2 at messe-muenchen.de  Mon Jun  4 21:54:16 2007
From: Christian.Schaefer2 at messe-muenchen.de (Schaefer Christian)
Date: Mon, 4 Jun 2007 23:54:16 +0200
Subject: AW: AW: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
In-Reply-To: <1180975954.28815.3.camel@jetfighter>
References: <1180972381.6730.27.camel@jetfighter><56825E6A1E167E41A80CE0F6F6916CAB023E58B7@MNTSVCL10E.mmgmuc.de>
	<1180975954.28815.3.camel@jetfighter>
Message-ID: <56825E6A1E167E41A80CE0F6F6916CAB0241A909@MNTSVCL10E.mmgmuc.de>

Hello Clay,

> My only concern is the level of complexity that 
> this would introduce into our environment.
Dealing with cluster infrastructures is never easy. Compared with the NFS server topology you currently use, the webservers will then access a local GFS filesystem for operating system purposes and several other GFS filesystems for common data (web application files or media files). Due to the fact, that each flesystem can be mounted on each of the nodes - provided each of the filesystems has enough journals - you just have to publish data to one single filesystem to be delivered to customers visiting your websites. In other words, the complexity added on the one side lasts in more simplification on the other side. 


> There also doesn't seem to be much documentation on building 
> diskless Linux clusters that I've found.
Please referr to http://sourceforge.net/project/showfiles.php?group_id=156679. There you can find preconfigured boot images and the com.oonics cluster suite, which provides administrative tools to build and maintain a shared root cluster. Documentation is little, but once i have seen a HOWTO to build a shared root cluster node from the bootimage. Please send mail to ATIX, I am shure they have the document laying around somewhere... ;-)


> Are you using specialized tools for the configuration?
Usually we use the standard tools provided by a standard linux installation. Moreover the gfs_tool binary or other programs included in the GFS package assist while handling system maintenance tasks. 

Greetings,
Christian

 

Christian Sch?fer
Abt. IT-Applications
Messe M?nchen 

Tel.: +49 (0) 89 949 21985 
E-Mail: christian.schaefer2 at messe-muenchen.de 
WWW: www.messe-muenchen.de 

 

> -----Urspr?ngliche Nachricht-----
> Von: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von 
> Clayton Dillard
> Gesendet: Montag, 4. Juni 2007 18:53
> An: linux clustering
> Betreff: Re: AW: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
> 
> Christian,
>     Thank you for the notes and detail.  This sounds 
> promising.  My only concern is the level of complexity that 
> this would introduce into our environment.
> 
> There also doesn't seem to be much documentation on building 
> diskless Linux clusters that I've found.
> 
> Are you using specialized tools for the configuration?
> 
> Clay
> 
> 
> On Mon, 2007-06-04 at 18:23 +0200, Schaefer Christian wrote: 
> 
> 	
> 	Hallo Clayton,
> 	
> 	> Does anyone here run such a configuration and does it 
> work well?
> 	Definitely yes. Munich trade fair center ist hosting 
> the fair's websites itself. Due to availability and 
> performance issues we have decided in 2005 to renew our 
> server pool and migrate it to the diskless shared root 
> concept to to achieve the aim, that by administrating only 
> one server you could manage all servers. The topology 
> mentioned consists of 8 PHP frontend servers responsing on 
> the customers' requests if they were invoked by a pair of 
> LVS/Keepaliced loadbalancers and a couple of backend servers 
> and database clusters. MySQL is currently used in an 
> actice/passive style, but is operated on an two node diskless 
> shared root cluster as well.
> 	
> 	 
> 	> I'm wondering if anyone on this list is currently using RH 
> 	> Cluster Services and GFS for a LAMP environment and if so, 
> 	> are there any pitfalls to watch out for, or any tips that I 
> 	> should know before getting started on the engineering phase.
> 	As clusters are more complex than single host systems, 
> a shared root environment is on step more complex. Hence you 
> should precisely plan your environment to run, because there 
> are more components involved, and some piece of software are 
> not designed to run on such cluster environment. Especially 
> PHP must be patched due to some session locking behavior, so 
> that the webservers run more stable. Another example is 
> writing logfiles. Here the applications run on the cluster 
> should be aware of the GFS file locking mechanism. Using a 
> single application log file on all of the cluster nodes can 
> be a performance killer.
> 	
> 	
> 	> I was reading an article this weekend on the RH site about 
> 	> using a shared (GFS) root partition on SAN and booting up 
> 	> cluster nodes to this shared root, the premise being that 
> 	> patches and config changes are done once, to the clustered 
> 	> root on all servers, as opposed to each server separately. 
> 	This environment is developed and operated for the 
> munich trade fair center by ATIX (http://www.atix.de). If 
> you'd like to have more information about our environment, 
> just reply to my message.
> 	
> 	
> 	Greetings, 
> 	
> 	Christian Sch?fer
> 	Abt. IT-Applications
> 	Messe M?nchen 
> 	
> 	Tel.: +49 (0) 89 949 21985 
> 	E-Mail: christian.schaefer2 at messe-muenchen.de 
> 	WWW: www.messe-muenchen.de 
> 	
> 	
> 	
> 	> -----Urspr?ngliche Nachricht-----
> 	> Von: linux-cluster-bounces at redhat.com 
> 	> [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von 
> 	> Clayton Dillard
> 	> Gesendet: Montag, 4. Juni 2007 17:53
> 	> An: linux clustering
> 	> Betreff: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
> 	> 
> 	> Hi Folks,
> 	>     I'm hoping for some advice here.  We have a hosting 
> 	> service where we run a LAMP stack.  Currently our environment 
> 	> is comprised of a pair of Apache servers which are load 
> 	> balanced via a pair of hardware balancers, and a pair of 
> 	> MySQL servers that are in a Master/Slave configuration.  The 
> 	> two Apache servers use a common NFS export as shared storage 
> 	> and the DB servers just use local disks.
> 	> 
> 	> I've been reading up on RH cluster services and GFS, and from 
> 	> what I see, it sounds like a great solution for us, 
> 	> especially since we are bringing on more and more customers 
> 	> and thus have higher loads and availability requirements.
> 	> 
> 	> I'm wondering if anyone on this list is currently using RH 
> 	> Cluster Services and GFS for a LAMP environment and if so, 
> 	> are there any pitfalls to watch out for, or any tips that I 
> 	> should know before getting started on the engineering phase.
> 	> 
> 	> I was reading an article this weekend on the RH site about 
> 	> using a shared (GFS) root partition on SAN and booting up 
> 	> cluster nodes to this shared root, the premise being that 
> 	> patches and config changes are done once, to the clustered 
> 	> root on all servers, as opposed to each server separately.  
> 	> Does anyone here run such a configuration and does it 
> work well?
> 	> 
> 	> Also, I've not been able to find any good books that are 
> 	> focused mostly on Red Hat clustering and GFS...  Does anyone 
> 	> know of any good books on this subject?
> 	> 
> 	> Thank you,
> 	> 
> 	> -- 
> 	> Clayton Dillard <cdillard at rpstechnology.com>
> 	> RPS Technology, LLC 	
> 	> 
> 	
> 	--
> 	Linux-cluster mailing list
> 	Linux-cluster at redhat.com
> 	https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Clayton Dillard <cdillard at rpstechnology.com>
> RPS Technology, LLC 	
> 



From daniel.j at hcl.in  Tue Jun  5 16:10:07 2007
From: daniel.j at hcl.in (Daniel Fernanduz)
Date: Tue, 05 Jun 2007 12:10:07 -0400
Subject: [Linux-cluster] shared storage clustering
Message-ID: <46658ADF.5040803@hcl.in>

Hello

           I am having two shared volumes (GFS) one for postgres and 
other to save files, so whenever i upload some files
           Informations related to the files will be updated in the 
postgres volume and actual file will be cpied to the Storage  
           volumes. Now database and file storage will be accessible 
thorough a shared mechanism.

           I had configured the failover cluster  in such a way that  
either node1 or node2 takes control when one of
           the node fails (Active /Passive clustering).

           My resource and service section of cluster.conf
                
                          <resources>
                        <script file="/etc/init.d/postgresql" 
name="Postgresql"/>
                        <clusterfs device="/dev/pgsqlvg/datalv" 
fstype="gfs" mountpoint="/pgsql" name="Postgres" 
options=""/>                        <script file="/etc/init.d/vsftpd" 
name="Vsftpd"/>
                        <clusterfs device="/dev/pgsqlvg/storage" 
fstype="gfs" mountpoint="/home/ftpuser" name="Storage" options=""/>
                        <ip address="151.8.18.147" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="postgres" name="Postgres">
                        <script ref="Postgresql"/>
                        <clusterfs ref="Postgres"/>
                        <script ref="Vsftpd"/>
                        <clusterfs ref="Storage"/>
                        <ip ref="151.8.18.147"/>
                </service>

            Two shared volumes:
                *device="/dev/pgsqlvg/datalv" for postgres (<script 
file="/etc/init.d/postgresql" name="Postgresql"/>  for
                              database)
                device="/dev/pgsqlvg/storage" for storing files (  
<script file="/etc/init.d/vsftpd" name="Vsftpd"/>) for
                              loading files to storage.*

            When ever a node takes control two volumes had to be mounted 
and appropriate service for corresponding
            volumes had to be  started and  this cluster had to be 
accesible by an virtual ip ( <*ip address="151.8.18.147"
               monitor_link="1"/>.*

             For such situation whether my current conf is enough( 
*adding all resources to a single service*) or is there any
             actual configurations or appropriate procedures to achieve 
these types of settings inorder to avoid single point of
             failure.
            
             Thanks for the help in advance
             

           
 



From satya.daragani at gmail.com  Tue Jun  5 07:31:31 2007
From: satya.daragani at gmail.com (Satya Daragani)
Date: Tue, 5 Jun 2007 13:01:31 +0530
Subject: [Linux-cluster] LVS+HA
In-Reply-To: <46648511.5050909@xandros.com>
References: <24fecab20706040925u554cb25cifccbe6cf8859ccfe@mail.gmail.com>
	<46648511.5050909@xandros.com>
Message-ID: <24fecab20706050031j4415f44dw665aeab977a1248@mail.gmail.com>

Thanks for the reply....

"LVS would be on your router system and would handle web server outages."

In the above statement is it applicable for all the other services like
database (Oracle), FTP or any other services.

Satya


On 6/5/07, James McOrmond <jamesm at xandros.com> wrote:
>
>
> Satya Daragani wrote:
> > Hi
> >
> > Actually i am using the shared storage but not SAN, IBM DS400 connected
> > to both the nodes using fiber.
>
> Uhmm, I think i'd consider that a SAN...
>
> > So in that case hw can i use the cluster suite?
>
> I would only setup the cluster suite to manage a shared GFS file system.
> Alternatively you could use OCFS2 since you're only doing disk storage.
>
> LVS would be on your router system and would handle web server outages..
>
> --
> James A. McOrmond (jamesm at xandros.com)
> Hardware QA Lead & Network Administrator
> Xandros Corporation, Ottawa, Canada.
> Morpheus: ...after a century of war I remember that which matters most:
> *We are still HERE!*
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Thanx
Satya Daragani
satya.daragani at gmail.com
+91 98850 58366
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070605/17c60397/attachment.htm>

From hlawatschek at atix.de  Tue Jun  5 07:44:14 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 5 Jun 2007 09:44:14 +0200
Subject: AW: AW: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
In-Reply-To: <56825E6A1E167E41A80CE0F6F6916CAB0241A909@MNTSVCL10E.mmgmuc.de>
References: <1180972381.6730.27.camel@jetfighter>
	<1180975954.28815.3.camel@jetfighter>
	<56825E6A1E167E41A80CE0F6F6916CAB0241A909@MNTSVCL10E.mmgmuc.de>
Message-ID: <200706050944.14473.hlawatschek@atix.de>

> Please referr to
> http://sourceforge.net/project/showfiles.php?group_id=156679. There you can
> find preconfigured boot images and the com.oonics cluster suite, which
> provides administrative tools to build and maintain a shared root cluster.
> Documentation is little, but once i have seen a HOWTO to build a shared
> root cluster node from the bootimage. Please send mail to ATIX, I am shure
> they have the document laying around somewhere... ;-)

The main site is http://www.open-sharedroot.org/ now. The sourceforge project 
site is not well maintained anymore.
There's also a lot of documentation online: 
http://www.open-sharedroot.org/documentation

Don't hesitate to ask if you have some questions!

Mark
-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany
Registergericht: Amtsgericht M?nchen
Registernummer: HR131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From kristoffer.lippert at jppol.dk  Tue Jun  5 07:59:14 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Tue, 5 Jun 2007 09:59:14 +0200
Subject: SV: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <E67F1468BF7A4C418D874810215A377E9D0D6C@EITO-MBX01.internal.vodafone.com>
References: <20070604145715.GC5597@redhat.com>
	<E67F1468BF7A4C418D874810215A377E9D0D6C@EITO-MBX01.internal.vodafone.com>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5D9@exchsrv04.rootdom.dk>

Hi,

Does anyone have experience with ocfs2 compared to gfs1 (or gfs2). Are
there any news on performance?

Kind Regards
Kristoffer

-----Oprindelig meddelelse-----
Fra: Brieseneck, Arne, VF-Group [mailto:Arne.Brieseneck at vodafone.com] 
Sendt: 4. juni 2007 17:06
Til: linux clustering; Kristoffer Lippert
Emne: RE: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.

Hi David,

You know, that GFS1 is slow like hell... I truly can understand
Kristoffer when he wants to have GFS2. The throughput is much better.
On the other hand Kristoffer could use professional cluster filesystems
for PROD systems like Veritas FS. Afaik VFS is also available for RHEL.

Arne

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Teigland
Sent: Montag, 4. Juni 2007 16:57
To: Kristoffer Lippert
Cc: linux-cluster at redhat.com
Subject: Re: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.

On Mon, Jun 04, 2007 at 04:50:15PM +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I've looked at the options, and it seems that building a new kernel is

> the only way to go about it.
> 
> RHEL 5.1 is scheduled for okt/nov, and that long we cannot wait.

GFS2 is still in preview/beta mode; you shouldn't use it for production
purposes yet (regardless of the kernel).  Just use GFS(1) instead.

Dave

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From kristoffer.lippert at jppol.dk  Tue Jun  5 08:17:46 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Tue, 5 Jun 2007 10:17:46 +0200
Subject: SV: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.
In-Reply-To: <E67F1468BF7A4C418D874810215A377E9D0D6C@EITO-MBX01.internal.vodafone.com>
References: <20070604145715.GC5597@redhat.com>
	<E67F1468BF7A4C418D874810215A377E9D0D6C@EITO-MBX01.internal.vodafone.com>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5DB@exchsrv04.rootdom.dk>

Hi,

And thanks for all the posts. Not withstanding the fact that GFS2 to is
pre-production, i decided to spend a little more time on it.

I seem to have found another issue. 
The clusternodes (2 nodes alltogether), are connected to the San (FuSi
SX60 w. full redundancy) via. 2 port fibercontrollers. Thus i've set up
multipath, wich work fine. (tried pulling a cable, and it failed over
correctly).
But when i create much diskaccess from and to the san (ie. Copying a
100mb file from san to a new file on same partition) with the san
running GFS2, and the cluster suite running (RHEL5). I get a message in
the errorlog, saying that the lpfc just failed over. 
Why does it suddenly failover on the fiber, and does this smell a bit of
hardware issue to you too? 

If I do as much as a ls on the same volume from the other node, i get
the earlier mentioned error from GFS2, and the san is locked (I think
because of the fencing). Until i boot the faulted node.

Any sugestions or comments will be greatly appreciated,
Kristoffer
 

-----Oprindelig meddelelse-----
Fra: Brieseneck, Arne, VF-Group [mailto:Arne.Brieseneck at vodafone.com] 
Sendt: 4. juni 2007 17:06
Til: linux clustering; Kristoffer Lippert
Emne: RE: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.

Hi David,

You know, that GFS1 is slow like hell... I truly can understand
Kristoffer when he wants to have GFS2. The throughput is much better.
On the other hand Kristoffer could use professional cluster filesystems
for PROD systems like Veritas FS. Afaik VFS is also available for RHEL.

Arne

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Teigland
Sent: Montag, 4. Juni 2007 16:57
To: Kristoffer Lippert
Cc: linux-cluster at redhat.com
Subject: Re: SV: SV: SV: SV: SV: SV: [Linux-cluster] GFS2 problem.

On Mon, Jun 04, 2007 at 04:50:15PM +0200, Kristoffer Lippert wrote:
> Hi,
> 
> I've looked at the options, and it seems that building a new kernel is

> the only way to go about it.
> 
> RHEL 5.1 is scheduled for okt/nov, and that long we cannot wait.

GFS2 is still in preview/beta mode; you shouldn't use it for production
purposes yet (regardless of the kernel).  Just use GFS(1) instead.

Dave

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From benjhenrion.cluster at gmail.com  Tue Jun  5 14:54:23 2007
From: benjhenrion.cluster at gmail.com (ben henrion)
Date: Tue, 5 Jun 2007 16:54:23 +0200
Subject: [Linux-cluster] Journals
Message-ID: <72f4f1c50706050754v3eb7c554vc39f98eb3d7d780c@mail.gmail.com>

Hello,
Could you tell me what is exactly the number of journal in gfs node?

Thanks for your advice/responds...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070605/ed9ad851/attachment.htm>

From cdillard at rpstechnology.com  Tue Jun  5 17:14:52 2007
From: cdillard at rpstechnology.com (Clayton Dillard)
Date: Tue, 05 Jun 2007 13:14:52 -0400
Subject: AW: AW: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
In-Reply-To: <56825E6A1E167E41A80CE0F6F6916CAB0241A909@MNTSVCL10E.mmgmuc.de>
References: <1180972381.6730.27.camel@jetfighter>
	<56825E6A1E167E41A80CE0F6F6916CAB023E58B7@MNTSVCL10E.mmgmuc.de>
	<1180975954.28815.3.camel@jetfighter>
	<56825E6A1E167E41A80CE0F6F6916CAB0241A909@MNTSVCL10E.mmgmuc.de>
Message-ID: <1181063692.22827.14.camel@jetfighter>

Christian and Mark,
    You guys rock.  Thank you so much for the information.  I'll visit
the site you recommended and do some reading.  Maybe I can even mock
this up in a Xen VM pool.

Cheers,
Clay


On Mon, 2007-06-04 at 23:54 +0200, Schaefer Christian wrote:

> Hello Clay,
> 
> > My only concern is the level of complexity that 
> > this would introduce into our environment.
> Dealing with cluster infrastructures is never easy. Compared with the NFS server topology you currently use, the webservers will then access a local GFS filesystem for operating system purposes and several other GFS filesystems for common data (web application files or media files). Due to the fact, that each flesystem can be mounted on each of the nodes - provided each of the filesystems has enough journals - you just have to publish data to one single filesystem to be delivered to customers visiting your websites. In other words, the complexity added on the one side lasts in more simplification on the other side. 
> 
> 
> > There also doesn't seem to be much documentation on building 
> > diskless Linux clusters that I've found.
> Please referr to http://sourceforge.net/project/showfiles.php?group_id=156679. There you can find preconfigured boot images and the com.oonics cluster suite, which provides administrative tools to build and maintain a shared root cluster. Documentation is little, but once i have seen a HOWTO to build a shared root cluster node from the bootimage. Please send mail to ATIX, I am shure they have the document laying around somewhere... ;-)
> 
> 
> > Are you using specialized tools for the configuration?
> Usually we use the standard tools provided by a standard linux installation. Moreover the gfs_tool binary or other programs included in the GFS package assist while handling system maintenance tasks. 
> 
> Greetings,
> Christian
> 
>  
> 
> Christian Sch?fer
> Abt. IT-Applications
> Messe M?nchen 
> 
> Tel.: +49 (0) 89 949 21985 
> E-Mail: christian.schaefer2 at messe-muenchen.de 
> WWW: www.messe-muenchen.de 
> 
> 
> 
> > -----Urspr?ngliche Nachricht-----
> > Von: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von 
> > Clayton Dillard
> > Gesendet: Montag, 4. Juni 2007 18:53
> > An: linux clustering
> > Betreff: Re: AW: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
> > 
> > Christian,
> >     Thank you for the notes and detail.  This sounds 
> > promising.  My only concern is the level of complexity that 
> > this would introduce into our environment.
> > 
> > There also doesn't seem to be much documentation on building 
> > diskless Linux clusters that I've found.
> > 
> > Are you using specialized tools for the configuration?
> > 
> > Clay
> > 
> > 
> > On Mon, 2007-06-04 at 18:23 +0200, Schaefer Christian wrote: 
> > 
> > 	
> > 	Hallo Clayton,
> > 	
> > 	> Does anyone here run such a configuration and does it 
> > work well?
> > 	Definitely yes. Munich trade fair center ist hosting 
> > the fair's websites itself. Due to availability and 
> > performance issues we have decided in 2005 to renew our 
> > server pool and migrate it to the diskless shared root 
> > concept to to achieve the aim, that by administrating only 
> > one server you could manage all servers. The topology 
> > mentioned consists of 8 PHP frontend servers responsing on 
> > the customers' requests if they were invoked by a pair of 
> > LVS/Keepaliced loadbalancers and a couple of backend servers 
> > and database clusters. MySQL is currently used in an 
> > actice/passive style, but is operated on an two node diskless 
> > shared root cluster as well.
> > 	
> > 	 
> > 	> I'm wondering if anyone on this list is currently using RH 
> > 	> Cluster Services and GFS for a LAMP environment and if so, 
> > 	> are there any pitfalls to watch out for, or any tips that I 
> > 	> should know before getting started on the engineering phase.
> > 	As clusters are more complex than single host systems, 
> > a shared root environment is on step more complex. Hence you 
> > should precisely plan your environment to run, because there 
> > are more components involved, and some piece of software are 
> > not designed to run on such cluster environment. Especially 
> > PHP must be patched due to some session locking behavior, so 
> > that the webservers run more stable. Another example is 
> > writing logfiles. Here the applications run on the cluster 
> > should be aware of the GFS file locking mechanism. Using a 
> > single application log file on all of the cluster nodes can 
> > be a performance killer.
> > 	
> > 	
> > 	> I was reading an article this weekend on the RH site about 
> > 	> using a shared (GFS) root partition on SAN and booting up 
> > 	> cluster nodes to this shared root, the premise being that 
> > 	> patches and config changes are done once, to the clustered 
> > 	> root on all servers, as opposed to each server separately. 
> > 	This environment is developed and operated for the 
> > munich trade fair center by ATIX (http://www.atix.de). If 
> > you'd like to have more information about our environment, 
> > just reply to my message.
> > 	
> > 	
> > 	Greetings, 
> > 	
> > 	Christian Sch?fer
> > 	Abt. IT-Applications
> > 	Messe M?nchen 
> > 	
> > 	Tel.: +49 (0) 89 949 21985 
> > 	E-Mail: christian.schaefer2 at messe-muenchen.de 
> > 	WWW: www.messe-muenchen.de 
> > 	
> > 	
> > 	
> > 	> -----Urspr?ngliche Nachricht-----
> > 	> Von: linux-cluster-bounces at redhat.com 
> > 	> [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von 
> > 	> Clayton Dillard
> > 	> Gesendet: Montag, 4. Juni 2007 17:53
> > 	> An: linux clustering
> > 	> Betreff: [Linux-cluster] Red Hat Cluster & GFS for a LAMP
> > 	> 
> > 	> Hi Folks,
> > 	>     I'm hoping for some advice here.  We have a hosting 
> > 	> service where we run a LAMP stack.  Currently our environment 
> > 	> is comprised of a pair of Apache servers which are load 
> > 	> balanced via a pair of hardware balancers, and a pair of 
> > 	> MySQL servers that are in a Master/Slave configuration.  The 
> > 	> two Apache servers use a common NFS export as shared storage 
> > 	> and the DB servers just use local disks.
> > 	> 
> > 	> I've been reading up on RH cluster services and GFS, and from 
> > 	> what I see, it sounds like a great solution for us, 
> > 	> especially since we are bringing on more and more customers 
> > 	> and thus have higher loads and availability requirements.
> > 	> 
> > 	> I'm wondering if anyone on this list is currently using RH 
> > 	> Cluster Services and GFS for a LAMP environment and if so, 
> > 	> are there any pitfalls to watch out for, or any tips that I 
> > 	> should know before getting started on the engineering phase.
> > 	> 
> > 	> I was reading an article this weekend on the RH site about 
> > 	> using a shared (GFS) root partition on SAN and booting up 
> > 	> cluster nodes to this shared root, the premise being that 
> > 	> patches and config changes are done once, to the clustered 
> > 	> root on all servers, as opposed to each server separately.  
> > 	> Does anyone here run such a configuration and does it 
> > work well?
> > 	> 
> > 	> Also, I've not been able to find any good books that are 
> > 	> focused mostly on Red Hat clustering and GFS...  Does anyone 
> > 	> know of any good books on this subject?
> > 	> 
> > 	> Thank you,
> > 	> 
> > 	> -- 
> > 	> Clayton Dillard <cdillard at rpstechnology.com>
> > 	> RPS Technology, LLC 	
> > 	> 
> > 	
> > 	--
> > 	Linux-cluster mailing list
> > 	Linux-cluster at redhat.com
> > 	https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > -- 
> > Clayton Dillard <cdillard at rpstechnology.com>
> > RPS Technology, LLC 	
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Clayton Dillard <cdillard at rpstechnology.com>
RPS Technology, LLC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070605/1b103227/attachment.htm>

From Robert.Gil at americanhm.com  Tue Jun  5 17:18:33 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Tue, 5 Jun 2007 13:18:33 -0400
Subject: [Linux-cluster] Journals
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482719@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

Could you be more specific?
 
You need a journal for every node accessing the gfs. use gfs_tool to see
how many journals there are.
 
Robert Gil
Linux Systems Administrator
American Home Mortgage
Phone: 631-622-8410
Cell: 631-827-5775
Fax: 516-495-5861
 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of ben henrion
Sent: Tuesday, June 05, 2007 10:54 AM
To: linux clustering
Subject: [Linux-cluster] Journals


Hello,
Could you tell me what is exactly the number of journal in gfs node?

Thanks for your advice/responds...


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070605/784ce82a/attachment.htm>

From jimm at simutronics.com  Tue Jun  5 18:14:44 2007
From: jimm at simutronics.com (James Miller)
Date: Tue, 5 Jun 2007 13:14:44 -0500
Subject: [Linux-cluster] CLVMD and Locking type 3 initialisation failed.
Message-ID: <074e01c7a79d$64e85400$5dd810d1@e3demo>

Hello everyone,

I'm setting up a cluster and was hoping to get some insight into trouble I'm
having with running clvmd. My cluster seems to be in a good state (2 nodes).
I using Debian Etch, 2.6.18 kernel, and redhat-cluster-modules for 2.6.18.

---------------------------
cman_tool status:
Protocol version: 5.0.1
Config version: 2
Cluster name: alpha1
Cluster ID: 6387
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 2
Total_votes: 2
Quorum: 2
Active subsystems: 3
Node name: hari
Node ID: 1
Node addresses: 209.16.216.121

-------------------------------
cman_tool nodes:
Node  Votes Exp Sts  Name
   1    1    2   M   hari
   2    1    2   M   seldon

------------------------------
cman_tool services:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2]

---------------------------------
When I run vgchange -aly I get the following message:
vgchange -aly
  Unknown locking type requested.
  Locking type 3 initialisation failed.

---------------------------
/etc/lvm/lvm.conf:
#
devices {

	dir = "/dev"
	scan = [ "/dev" ]
	filter = [ "r|/dev/cdrom|" ]
	write_cache_state = 1
	sysfs_scan = 1
	md_component_detection = 1
}

log {
    verbose = 0
    syslog = 1
    overwrite = 0
    level = 0
    indent = 1
    command_names = 0
    prefix = "  "
}

backup {
    backup = 1
    backup_dir = "/etc/lvm/backup"
    archive = 1
    archive_dir = "/etc/lvm/archive"
    retain_min = 10
    retain_days = 30
}

shell {
    history_size = 100
}

global {
    umask = 077
    test = 0
    activation = 1
    proc = "/proc"
    locking_dir = "/var/lock/lvm"
    locking_library = "liblvm2clusterlock.so"
# ***** I VERIFIED the locking_dir exists, the library_dir is correct and
the locking_library is in the right place
    locking_type = 3
    library_dir = "/lib/lvm2"
}

activation {
    missing_stripe_filler = "/dev/ioerror"
    reserved_stack = 256
    reserved_memory = 8192
    process_priority = -18
    mirror_region_size = 512
    mirror_log_fault_policy = "allocate"
    mirror_device_fault_policy = "remove"
}

-------------------------------------
Important output of lsmod:
lsmod
Module                  Size  Used by
dm_snapshot            20664  0
dm_mirror              25216  0
dm_mod                 62800  2 dm_snapshot,dm_mirror
lock_dlm               44644  0
dlm                   123040  4 lock_dlm
cman                  132800  15 lock_dlm,dlm
lock_harness           10160  1 lock_dlm
----------------------------------


Thank you in advance for your help


--Jim





From jimm at simutronics.com  Tue Jun  5 18:56:45 2007
From: jimm at simutronics.com (James Miller)
Date: Tue, 5 Jun 2007 13:56:45 -0500
Subject: SOLVED: RE: [Linux-cluster] CLVMD and Locking type 3 initialisation
	failed.
In-Reply-To: <074e01c7a79d$64e85400$5dd810d1@e3demo>
References: <074e01c7a79d$64e85400$5dd810d1@e3demo>
Message-ID: <077601c7a7a3$43b03e00$5dd810d1@e3demo>

SORRY FOR THE TOP POSTING!!!

For Debian lvm2 you have to set the locking_type=2.  It's in the lvm2 source
clvm.README.Debian.


--Jim



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Miller
> Sent: Tuesday, June 05, 2007 1:15 PM
> To: 'linux clustering'
> Subject: [Linux-cluster] CLVMD and Locking type 3 
> initialisation failed.
> 
> Hello everyone,
> 
> I'm setting up a cluster and was hoping to get some insight 
> into trouble I'm having with running clvmd. My cluster seems 
> to be in a good state (2 nodes).
> I using Debian Etch, 2.6.18 kernel, and 
> redhat-cluster-modules for 2.6.18.
> 
> ---------------------------
> cman_tool status:
> Protocol version: 5.0.1
> Config version: 2
> Cluster name: alpha1
> Cluster ID: 6387
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 2
> Expected_votes: 2
> Total_votes: 2
> Quorum: 2
> Active subsystems: 3
> Node name: hari
> Node ID: 1
> Node addresses: 209.16.216.121
> 
> -------------------------------
> cman_tool nodes:
> Node  Votes Exp Sts  Name
>    1    1    2   M   hari
>    2    1    2   M   seldon
> 
> ------------------------------
> cman_tool services:
> Service          Name                              GID LID 
> State     Code
> Fence Domain:    "default"                           1   2 run       -
> [1 2]
> 
> DLM Lock Space:  "clvmd"                             2   3 run       -
> [1 2]
> 
> ---------------------------------
> When I run vgchange -aly I get the following message:
> vgchange -aly
>   Unknown locking type requested.
>   Locking type 3 initialisation failed.
> 
> ---------------------------
> /etc/lvm/lvm.conf:
> #
> devices {
> 
> 	dir = "/dev"
> 	scan = [ "/dev" ]
> 	filter = [ "r|/dev/cdrom|" ]
> 	write_cache_state = 1
> 	sysfs_scan = 1
> 	md_component_detection = 1
> }
> 
> log {
>     verbose = 0
>     syslog = 1
>     overwrite = 0
>     level = 0
>     indent = 1
>     command_names = 0
>     prefix = "  "
> }
> 
> backup {
>     backup = 1
>     backup_dir = "/etc/lvm/backup"
>     archive = 1
>     archive_dir = "/etc/lvm/archive"
>     retain_min = 10
>     retain_days = 30
> }
> 
> shell {
>     history_size = 100
> }
> 
> global {
>     umask = 077
>     test = 0
>     activation = 1
>     proc = "/proc"
>     locking_dir = "/var/lock/lvm"
>     locking_library = "liblvm2clusterlock.so"
> # ***** I VERIFIED the locking_dir exists, the library_dir is 
> correct and the locking_library is in the right place
>     locking_type = 3
>     library_dir = "/lib/lvm2"
> }
> 
> activation {
>     missing_stripe_filler = "/dev/ioerror"
>     reserved_stack = 256
>     reserved_memory = 8192
>     process_priority = -18
>     mirror_region_size = 512
>     mirror_log_fault_policy = "allocate"
>     mirror_device_fault_policy = "remove"
> }
> 
> -------------------------------------
> Important output of lsmod:
> lsmod
> Module                  Size  Used by
> dm_snapshot            20664  0
> dm_mirror              25216  0
> dm_mod                 62800  2 dm_snapshot,dm_mirror
> lock_dlm               44644  0
> dlm                   123040  4 lock_dlm
> cman                  132800  15 lock_dlm,dlm
> lock_harness           10160  1 lock_dlm
> ----------------------------------
> 
> 
> Thank you in advance for your help
> 
> 
> --Jim
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From rpeterso at redhat.com  Tue Jun  5 19:29:29 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 05 Jun 2007 14:29:29 -0500
Subject: [Linux-cluster] Can I control journal size and placement in GFS?
	(need update for GFS 6.1?)
In-Reply-To: <a6d13c780705310738i508df429if6c813cd921eacad@mail.gmail.com>
References: <a6d13c780705310738i508df429if6c813cd921eacad@mail.gmail.com>
Message-ID: <4665B999.9090903@redhat.com>

Filipe Miranda wrote:
> Hello,
> 
> I would like to know if the information below applies to the GFS 6.1?
> I guess that resource groups are the building blocks of the pool file 
> system
> management, but I the GFS 6.1 uses a tottaly different file system
> management (LVM2).
> (please correct me if I am wrong)
> 
> 
>   1. Can I control journal size and placement in GFS?
> 
>   Not really. The gfs_mkfs command decides exactly where everything
>   should go and you have no choice in the matter. The volume is carved into
>   logical "sections." The first and last sections are for multiple resource
>   groups, based roughly on the rg size specified on the gfs_mkfs 
> commandline.
>   The journals are always placed between the first and last section.
>   Specifying a different number of journals will force gfs_mkfs to carve 
> the
>   section size smaller, thus changing where your journals will end up.
> 
> 
> Thank you,
> Filipe Miranda

Hi Filipe,

I think you are confusing several different concepts here.

1. As far as I know, the term "pools" pertains to the older RHEL3 cluster code,
   and GFS 6.1 doesn't know about pools.
2. The RGs (Resource Groups) are building blocks of the GFS file system.
3. The LVM2 layer is completely separate from GFS.  In other words, you can 
   make a GFS file system inside (I think of it as inside anyway) a Logical
   volume of LVM2.  You can also make a GFS file system on a file system
   partition without LVM2.
4. Yes, the FAQ entry you referenced does apply to GFS 6.1 and all other
   releases of GFS as far as I know.  It does not, however, address GFS2 directly.
   GFS2 has a slightly different format.  Even so, you can't really control 
   the placement of a journal in GFS2 either.

Regards,

Bob Peterson
Red Hat Cluster Suite



From rpeterso at redhat.com  Tue Jun  5 20:18:31 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 05 Jun 2007 15:18:31 -0500
Subject: [Linux-cluster] Exceeds Two Node Limit
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482640@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <F154BEC9D4278D4BA9B94BDEC76193482640@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <4665C517.3020607@redhat.com>

Robert Gil wrote:
> Im curious why I would be getting this error?
>  
> May 21 12:57:22 **NODE4** kernel: CMAN: Join request from **NODE1**
> rejected, exceeds two node limit.
>  
> I have added two new nodes to the cluster to make a total of 4. Where
> would this limit exist? I do not see it in cluster.conf. I thought maybe
> part of gfs, however that is not set as a resource right now and I also
> tried adding 2 more journals. Neither seemed to fix the issue.
>  
> Any insight is appreciated
>  
> Thanks,
>  
> Robert Gil

Hi Robert,

Once a cluster is configured as "two_node" that setting is kept in
memory by both running nodes.  If a third (and fourth, in your case)
node is added, the "two_node" option has to be removed from the
cluster.conf file and the cluster must be restarted.  This is a special
case for two-node clusters only (ordinarily, adding a node does not
require a restart of the cluster).

Here's a FAQ entry that gives some info about this:

http://sources.redhat.com/cluster/faq.html#cman_2to3

Regards,

Bob Peterson
Red Hat Cluster Suite



From rpeterso at redhat.com  Tue Jun  5 20:41:08 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 05 Jun 2007 15:41:08 -0500
Subject: [Linux-cluster] Not able to stop cman service
In-Reply-To: <1180933293.4343.208.camel@linux.site>
References: <BAY105-F24FE01113BA964C8AFCB7FE32D0@phx.gbl>
	<1180933293.4343.208.camel@linux.site>
Message-ID: <4665CA64.9010104@redhat.com>

Shailesh wrote:
> I tried the same (cman_tool leave force) and worked.
> 
> How and why does rgmanager run without even explicitly starting it ,
> I have only been running the ccsd ,cman,gfs services ?
> 
> Thanks 
> Shailesh
Hi Shailesh,

Perhaps somehow your rgmanager service got set on.  Do this:

chkconfig --list | grep rgmanager

If it says "on" for 3 4 5, you can disable it if you want by
doing: "chkconfig rgmanager off"

I can't think of any other way it could be started at the moment,
unless you're using the system-config-cluster or conga GUIs,
in which case starting a failover service may automatically start
rgmanager for you (I'm not sure offhand).

If you want rgmanager to start automatically, just do:

chkconfig rgmanager on

but then you should do "service rgmanager stop" before you stop
the other cluster services (like cman).

Regards,

Bob Peterson
Red Hat Cluster Suite



From rpeterso at redhat.com  Tue Jun  5 20:54:46 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 05 Jun 2007 15:54:46 -0500
Subject: [Linux-cluster] resourse order
In-Reply-To: <20070524204940.GM31361@watson-wilson.ca>
References: <20070524204940.GM31361@watson-wilson.ca>
Message-ID: <4665CD96.1030305@redhat.com>

Neil Watson wrote:
> Using the Cluster Suite how can a define resource dependencies?  For
> instance I have an NFS mount as a resource.  I also have the portmapper
> as a resource.  How do I tell the cluster suite to start the portmapper
> before attempting the NFS mount?
> 
Hi Neil,

Lon H is the expert in that area.  But since he's on vacation at the 
moment, I'll direct you to this FAQ entry:

http://sources.redhat.com/cluster/faq.html#rgm_fodomains

Regards,

Bob Peterson
Red Hat Cluster Suite



From shailesh at verismonetworks.com  Wed Jun  6 04:26:35 2007
From: shailesh at verismonetworks.com (Shailesh)
Date: Wed, 06 Jun 2007 09:56:35 +0530
Subject: [Linux-cluster] Journals
In-Reply-To: <72f4f1c50706050754v3eb7c554vc39f98eb3d7d780c@mail.gmail.com>
References: <72f4f1c50706050754v3eb7c554vc39f98eb3d7d780c@mail.gmail.com>
Message-ID: <1181103995.8578.26.camel@linux.site>

If your question is to know how many journals can be created, then you
can create as many in the -j option of the gfs_mkfs.

Each node in the cluster, mounting the gfs should have one journal.

- Shailesh Shirali


On Tue, 2007-06-05 at 16:54 +0200, ben henrion wrote:
> Hello,
> Could you tell me what is exactly the number of journal in gfs node?
> 
> Thanks for your advice/responds...
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Alain.Moulle at bull.net  Wed Jun  6 05:55:47 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 06 Jun 2007 07:55:47 +0200
Subject: [Linux-cluster] CS4 U5 / problem ?
Message-ID: <46664C63.7000102@bull.net>

Hi

Trying the new CS4 U5, I've got these messages in syslog :

Jun  6 06:23:24 s_kernel at bali2 kernel: SM: 03000003 sm_stop: SG still joined
Jun  6 06:23:24 s_sys at bali2 ccsd[21387]: Cluster is not quorate.  Refusing
connection.
Jun  6 06:23:24 s_sys at bali2 ccsd[21387]: Error while processing connect:
Connection refused
Jun  6 06:23:24 s_sys at bali2 ccsd[21387]: Invalid descriptor specified (-111).
Jun  6 06:23:24 s_sys at bali2 ccsd[21387]: Someone may be attempting something evil.
Jun  6 06:23:24 s_sys at bali2 ccsd[21387]: Error while processing get: Invalid
request descriptor

Is it a problem ?

Thanks
Alain Moull?




From pgharios at gmail.com  Wed Jun  6 16:30:37 2007
From: pgharios at gmail.com (Patrick Gharios)
Date: Wed, 6 Jun 2007 19:30:37 +0300
Subject: [Linux-cluster] RHCE on SUSE.
Message-ID: <7e0bc7bc0706060930i5169d655k7c27f6c5b3dfd0c3@mail.gmail.com>

Hello,

I am just wondering if RHCS package can be installed on a SUSE 10.2 distro.

Regards, Pat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070606/5337ae34/attachment.htm>

From mlgeneric at gmail.com  Wed Jun  6 20:23:06 2007
From: mlgeneric at gmail.com (Jean-Marc jmc)
Date: Wed, 6 Jun 2007 22:23:06 +0200
Subject: [Linux-cluster] Problem with gfs2 - waiting for i/o
Message-ID: <56a19b490706061323u468150dy49eaec2f209945c2@mail.gmail.com>

Hi list,
I am new to cluster and gfs so my apologies if I missed something obvious.
I am having problems using gfs2 filesystems where writing processes hang
forever waiting for i/o. It does not seem to be locking (or even cluster)
related since its reproducible on a local gfs2 filesystem with 'lock_nolock'
as locking protocol.

I can create and mount the gfs2 file system without any problems. Creating
and writing to files works as well, but when I write larger amounts of data
the process gets stuck waiting for i/o. It doesn't matter whether I amd
using lvm or not thre results are the same. The hardware is working and I
have no problem writing to an ext3 filesystem on the same device.

The problem can be reproduced by doing the following:

-------------------------------------------------------------------------------------------------
[root at indigo ~]# mkfs.gfs2 -p lock_nolock /dev/cciss/c0d0p3
This will destroy any data on /dev/cciss/c0d0p3.
  It appears to contain a gfs2 filesystem.

Are you sure you want to proceed? [y/n] y

Device:                    /dev/cciss/c0d0p3
Blocksize:                 4096
Device Size                644.44 GB (168935523 blocks)
Filesystem Size:           644.44 GB (168935523 blocks)
Journals:                  1
Resource Groups:           2578
Locking Protocol:          "lock_nolock"
Lock Table:                ""

[root at indigo ~]# mount -o locktable=xyz /dev/cciss/c0d0p3 /export

[root at indigo ~]# mount|grep /export
/dev/cciss/c0d0p3 on /export type gfs2
(rw,locktable=xyz,localflocks,localcaching)

[root at indigo ~]# dd if=/dev/zero of=/export/8GB bs=1024k count=8000

## A partial file is created, but the process never returns, it is forever
waiting for i/o:
[root at indigo ~]# ls -l /export/;ps auxww|grep 8GB
total 1995152
-rw-r--r-- 1 root root 2039009280 Jun  1 07:12 8GB
root      3857 42.8  0.0   4880  1584 pts/0    D    07:12   0:07 dd if
/dev/zero of /export/8GB bs 1024k count 8000
----------------------------------------------------------------------------------------------------

It's still possible to create new files to the file system but these
processes are also waiting for i/o after a while.
After the mounting there is nothing else appearing in dmesg or messages log.
Since the processes are uninterruptible we are forced to reboot the machine
in order to resolve the lockup.

Basic info on the system setup:

HP dl380 G5 2 xeon cpu with dual cores each, 12 GB RAM
local raid controller is p400 but we also reproduced the behaviour on the
san disk (Emulex card)
OS is RH 5 (32 bit), output of uname is:
    Linux indigo 2.6.18-8.1.4.el5xen #1 SMP Fri May 4 22:42:50 EDT 2007 i686
i686 i386 GNU/Linux

The same problem could also be reproduced on RH 4 and Ubuntu 7.04

Thanks in advance for your help!

/jmc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070606/d9fbbc81/attachment.htm>

From swhiteho at redhat.com  Wed Jun  6 20:25:26 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 06 Jun 2007 21:25:26 +0100
Subject: [Linux-cluster] Problem with gfs2 - waiting for i/o
In-Reply-To: <56a19b490706061323u468150dy49eaec2f209945c2@mail.gmail.com>
References: <56a19b490706061323u468150dy49eaec2f209945c2@mail.gmail.com>
Message-ID: <1181161526.25918.226.camel@quoit>

Hi,

You are using a rather too old version of GFS2. I'd suggest either using
Fedora or the upstream kernel's until GFS2 appears in RHEL 5.1 as the
RHEL 5 version is rather out of date I'm afraid,

Steve.

On Wed, 2007-06-06 at 22:23 +0200, Jean-Marc jmc wrote:
> Hi list,
> I am new to cluster and gfs so my apologies if I missed something
> obvious.
> I am having problems using gfs2 filesystems where writing processes
> hang forever waiting for i/o. It does not seem to be locking (or even
> cluster) related since its reproducible on a local gfs2 filesystem
> with 'lock_nolock' as locking protocol.  
> 
> I can create and mount the gfs2 file system without any problems.
> Creating and writing to files works as well, but when I write larger
> amounts of data the process gets stuck waiting for i/o. It doesn't
> matter whether I amd using lvm or not thre results are the same. The
> hardware is working and I have no problem writing to an ext3
> filesystem on the same device. 
> 
> The problem can be reproduced by doing the following:
> 
> -------------------------------------------------------------------------------------------------
> [root at indigo ~]# mkfs.gfs2 -p lock_nolock /dev/cciss/c0d0p3 
> This will destroy any data on /dev/cciss/c0d0p3. 
>   It appears to contain a gfs2 filesystem.
> 
> Are you sure you want to proceed? [y/n] y
> 
> Device:                    /dev/cciss/c0d0p3
> Blocksize:                 4096
> Device Size                644.44 GB (168935523 blocks) 
> Filesystem Size:           644.44 GB (168935523 blocks)
> Journals:                  1
> Resource Groups:           2578
> Locking Protocol:          "lock_nolock"
> Lock Table:                "" 
> 
> [root at indigo ~]# mount -o locktable=xyz /dev/cciss/c0d0p3 /export
> 
> [root at indigo ~]# mount|grep /export 
> /dev/cciss/c0d0p3 on /export type gfs2
> (rw,locktable=xyz,localflocks,localcaching)
> 
> [root at indigo ~]# dd if=/dev/zero of=/export/8GB bs=1024k count=8000
> 
> ## A partial file is created, but the process never returns, it is
> forever waiting for i/o: 
> [root at indigo ~]# ls -l /export/;ps auxww|grep 8GB
> total 1995152
> -rw-r--r-- 1 root root 2039009280 Jun  1 07:12 8GB
> root      3857 42.8  0.0   4880  1584 pts/0    D    07:12   0:07 dd
> if /dev/zero of /export/8GB bs 1024k count 8000 
> ----------------------------------------------------------------------------------------------------
> 
> It's still possible to create new files to the file system but these
> processes are also waiting for i/o after a while.
> After the mounting there is nothing else appearing in dmesg or
> messages log. Since the processes are uninterruptible we are forced to
> reboot the machine in order to resolve the lockup. 
> 
> Basic info on the system setup:
> 
> HP dl380 G5 2 xeon cpu with dual cores each, 12 GB RAM
> local raid controller is p400 but we also reproduced the behaviour on
> the san disk (Emulex card)
> OS is RH 5 (32 bit), output of uname is: 
>     Linux indigo 2.6.18-8.1.4.el5xen #1 SMP Fri May 4 22:42:50 EDT
> 2007 i686 i686 i386 GNU/Linux
> 
> The same problem could also be reproduced on RH 4 and Ubuntu 7.04
> 
> Thanks in advance for your help!
> 
> /jmc
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Robert.Gil at americanhm.com  Wed Jun  6 20:33:57 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Wed, 6 Jun 2007 16:33:57 -0400
Subject: [Linux-cluster] GFS Filesystem Scan
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA614@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

How can I scan for gfs filesystems? They dont show on this particular
server anymore for some reason.
 
I have tried to restart clvmd, but the gfs wont show. When i do a
pvscan, it says its being skipped, so I know it does exist.
 
This is RHAS 4.5 all stock redhat.
 
Thanks,
 
Robert Gil
Linux Systems Administrator
American Home Mortgage
Phone: 631-622-8410
Cell: 631-827-5775
Fax: 516-495-5861
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070606/8ca7b34d/attachment.htm>

From nrbwpi at gmail.com  Wed Jun  6 23:27:42 2007
From: nrbwpi at gmail.com (nrbwpi at gmail.com)
Date: Wed, 6 Jun 2007 19:27:42 -0400
Subject: [Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing
Message-ID: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com>

Hello,

Installed RHEL5 on a new two node cluster with Shared FC storage.  The two
shared storage boxes are each split into 6.9TB LUNs for a total of 4 -
6.9TBLUNS.  Each machine is connected via a single 100Mb connection to
a switch
and a single FC connection to a FC switch.

The 4 LUNs have LVM on them with GFS2.  The file systems are mountable from
each box.  When performing a script dd write of zeros in 250MB file sizes to
the file system from each box to different LUNS, one of the nodes in the
cluster is fenced by the other one.  File size does not seem to matter.

My first guess at the problem was the heartbeat timeout in openais.  In the
cluster.conf below I added the totem line to hopefully raise the timeout to
10 seconds.  This however did not resolve the problem.  Both boxes are
running the latest updates as of 2 days ago from up2date.

Below is the cluster.conf and what is seen in the logs.  Any suggestions
would be greatly appreciated.

Thanks!

Neal



##########################################

Cluster.conf

##########################################


<?xml version="1.0"?>
<cluster alias="storage1" config_version="4" name="storage1">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="fu1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="apc4" port="1"
switch="1"/>
                                </method>
                        </fence>
                        <multicast addr="224.10.10.10" interface="eth0"/>
                </clusternode>
                <clusternode name="fu2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="apc4" port="2"
switch="1"/>
                                </method>
                        </fence>
                        <multicast addr="224.10.10.10" interface="eth0"/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
                <multicast addr="224.10.10.10"/>
                <totem token="10000"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.14.193"
login="apc" name="apc4" passwd="apc"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>


#####################################################

/var/log/messages

#####################################################

Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket recv
buffer size (262142 bytes).
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from 2.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from 0.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token because I
am the rep.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high seq
received 6e
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member
192.168.14.195:
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep
192.168.14.195
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e received
flag 0
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate any
messages in recovery.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for ring
14
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec
post_fail_delay
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.195)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2"
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.197)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the primary
component and will provide service.
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.195)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the primary
component and will provide service.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message
192.168.14.195
Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from node 1
Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Trying
to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Looking
at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Replayed
0 of 0 blocks
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Found 0
revoke tags
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Done
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Replayed
0 of 0 blocks
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Found 0
revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Done
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
Acquiring the transaction lock...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Replayed
222 of 223 blocks
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Found 1
revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Done
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
Replaying journal...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Replayed
438 of 439 blocks
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Found 1
revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Journal
replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Done
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070606/64378584/attachment.htm>

From kristoffer.lippert at jppol.dk  Thu Jun  7 05:30:09 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Thu, 7 Jun 2007 07:30:09 +0200
Subject: SV: [Linux-cluster] Problem with gfs2 - waiting for i/o
In-Reply-To: <1181161526.25918.226.camel@quoit>
References: <56a19b490706061323u468150dy49eaec2f209945c2@mail.gmail.com>
	<1181161526.25918.226.camel@quoit>
Message-ID: <F358B15FDF5F1249AFA3B9C79250A0F403DDB5F9@exchsrv04.rootdom.dk>

Hi,

I have the same RHEL 5 situation, and I can add that RHEL 5.1 is scheduled for Q3 (sep-okt). 
You could download the RHEL 5 Cluster disk (RHEL5 disk 6) on wich the old GFS(6.x) is, and use that. Otherwise the only choice (that I know of) is to start compiling kernels with newer versions of GFS2.

Kind Regards
Kristoffer

 

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] P? vegne af Steven Whitehouse
Sendt: 6. juni 2007 22:25
Til: linux clustering
Emne: Re: [Linux-cluster] Problem with gfs2 - waiting for i/o

Hi,

You are using a rather too old version of GFS2. I'd suggest either using Fedora or the upstream kernel's until GFS2 appears in RHEL 5.1 as the RHEL 5 version is rather out of date I'm afraid,

Steve.

On Wed, 2007-06-06 at 22:23 +0200, Jean-Marc jmc wrote:
> Hi list,
> I am new to cluster and gfs so my apologies if I missed something 
> obvious.
> I am having problems using gfs2 filesystems where writing processes 
> hang forever waiting for i/o. It does not seem to be locking (or even
> cluster) related since its reproducible on a local gfs2 filesystem 
> with 'lock_nolock' as locking protocol.
> 
> I can create and mount the gfs2 file system without any problems.
> Creating and writing to files works as well, but when I write larger 
> amounts of data the process gets stuck waiting for i/o. It doesn't 
> matter whether I amd using lvm or not thre results are the same. The 
> hardware is working and I have no problem writing to an ext3 
> filesystem on the same device.
> 
> The problem can be reproduced by doing the following:
> 
> ----------------------------------------------------------------------
> --------------------------- [root at indigo ~]# mkfs.gfs2 -p lock_nolock 
> /dev/cciss/c0d0p3 This will destroy any data on /dev/cciss/c0d0p3.
>   It appears to contain a gfs2 filesystem.
> 
> Are you sure you want to proceed? [y/n] y
> 
> Device:                    /dev/cciss/c0d0p3
> Blocksize:                 4096
> Device Size                644.44 GB (168935523 blocks) 
> Filesystem Size:           644.44 GB (168935523 blocks)
> Journals:                  1
> Resource Groups:           2578
> Locking Protocol:          "lock_nolock"
> Lock Table:                "" 
> 
> [root at indigo ~]# mount -o locktable=xyz /dev/cciss/c0d0p3 /export
> 
> [root at indigo ~]# mount|grep /export
> /dev/cciss/c0d0p3 on /export type gfs2
> (rw,locktable=xyz,localflocks,localcaching)
> 
> [root at indigo ~]# dd if=/dev/zero of=/export/8GB bs=1024k count=8000
> 
> ## A partial file is created, but the process never returns, it is 
> forever waiting for i/o:
> [root at indigo ~]# ls -l /export/;ps auxww|grep 8GB total 1995152
> -rw-r--r-- 1 root root 2039009280 Jun  1 07:12 8GB
> root      3857 42.8  0.0   4880  1584 pts/0    D    07:12   0:07 dd
> if /dev/zero of /export/8GB bs 1024k count 8000
> ----------------------------------------------------------------------
> ------------------------------
> 
> It's still possible to create new files to the file system but these 
> processes are also waiting for i/o after a while.
> After the mounting there is nothing else appearing in dmesg or 
> messages log. Since the processes are uninterruptible we are forced to 
> reboot the machine in order to resolve the lockup.
> 
> Basic info on the system setup:
> 
> HP dl380 G5 2 xeon cpu with dual cores each, 12 GB RAM local raid 
> controller is p400 but we also reproduced the behaviour on the san 
> disk (Emulex card) OS is RH 5 (32 bit), output of uname is:
>     Linux indigo 2.6.18-8.1.4.el5xen #1 SMP Fri May 4 22:42:50 EDT
> 2007 i686 i686 i386 GNU/Linux
> 
> The same problem could also be reproduced on RH 4 and Ubuntu 7.04
> 
> Thanks in advance for your help!
> 
> /jmc
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From swhiteho at redhat.com  Thu Jun  7 07:28:53 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 07 Jun 2007 08:28:53 +0100
Subject: [Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing
In-Reply-To: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com>
References: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com>
Message-ID: <1181201333.25918.229.camel@quoit>

Hi,

The version of GFS2 in RHEL5 is rather old. Please use Fedora, the
upstream kernel or wait until RHEL 5.1 is out. This should solve the
problem that you are seeing,

Steve.

On Wed, 2007-06-06 at 19:27 -0400, nrbwpi at gmail.com wrote:
> Hello,
> 
> Installed RHEL5 on a new two node cluster with Shared FC storage.  The
> two shared storage boxes are each split into 6.9TB LUNs for a total of
> 4 - 6.9TB LUNS.  Each machine is connected via a single 100Mb
> connection to a switch and a single FC connection to a FC switch. 
> 
> The 4 LUNs have LVM on them with GFS2.  The file systems are mountable
> from each box.  When performing a script dd write of zeros in 250MB
> file sizes to the file system from each box to different LUNS, one of
> the nodes in the cluster is fenced by the other one.  File size does
> not seem to matter. 
> 
> My first guess at the problem was the heartbeat timeout in openais.
> In the cluster.conf below I added the totem line to hopefully raise
> the timeout to 10 seconds.  This however did not resolve the problem.
> Both boxes are running the latest updates as of 2 days ago from
> up2date. 
> 
> Below is the cluster.conf and what is seen in the logs.  Any
> suggestions would be greatly appreciated.
> 
> Thanks!
> 
> Neal
> 
> 
> 
> ##########################################
> 
> Cluster.conf
> 
> ##########################################
> 
> 
> <?xml version="1.0"?>
> <cluster alias="storage1" config_version="4" name="storage1">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/> 
>         <clusternodes>
>                 <clusternode name="fu1" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1"> 
>                                         <device name="apc4" port="1"
> switch="1"/>
>                                 </method>
>                         </fence>
>                         <multicast addr=" 224.10.10.10"
> interface="eth0"/>
>                 </clusternode>
>                 <clusternode name="fu2" nodeid="2" votes="1"> 
>                         <fence>
>                                 <method name="1">
>                                         <device name="apc4" port="2"
> switch="1"/> 
>                                 </method>
>                         </fence>
>                         <multicast addr="224.10.10.10"
> interface="eth0"/> 
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1">
>                 <multicast addr="224.10.10.10"/>
>                 <totem token="10000"/>
>         </cman>
>         <fencedevices>
>                 <fencedevice agent="fence_apc" ipaddr="192.168.14.193"
> login="apc" name="apc4" passwd="apc"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/> 
>         </rm>
> </cluster>
> 
> 
> #####################################################
> 
> /var/log/messages
> 
> #####################################################
> 
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the
> OPERATIONAL state. 
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket
> recv buffer size (262142 bytes).
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from
> 2. 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from
> 0.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token
> because I am the rep.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high
> seq received 6e 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member
> 192.168.14.195:
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep
> 192.168.14.195
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e
> received flag 0 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for
> ring 14
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
> Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec
> post_fail_delay 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.195)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2" 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.197)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> primary component and will provide service. 
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.195)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> primary component and will provide service. 
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message
> 192.168.14.195
> Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from
> node 1 
> Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Trying to acquire journal lock... 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Trying to acquire journal lock... 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Looking at journal... 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Replaying journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Replayed 0 of 0 blocks 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Found 0 revoke tags
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Journal replayed in 1s
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Done 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Replaying journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Replayed 0 of 0 blocks 
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Found 0 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Journal replayed in 1s
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Done 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Replaying journal... 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Replayed 222 of 223 blocks
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Found 1 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Journal replayed in 1s 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Done
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Replaying journal...
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Replayed 438 of 439 blocks 
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Found 1 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Journal replayed in 1s
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Done 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From teigland at redhat.com  Thu Jun  7 13:59:43 2007
From: teigland at redhat.com (David Teigland)
Date: Thu, 7 Jun 2007 08:59:43 -0500
Subject: [Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing
In-Reply-To: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com>
References: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com>
Message-ID: <20070607135943.GB27026@redhat.com>

On Wed, Jun 06, 2007 at 07:27:42PM -0400, nrbwpi at gmail.com wrote:
> Hello,
> 
> Installed RHEL5 on a new two node cluster with Shared FC storage.  The two
> shared storage boxes are each split into 6.9TB LUNs for a total of 4 -
> 6.9TBLUNS.  Each machine is connected via a single 100Mb connection to
> a switch
> and a single FC connection to a FC switch.
> 
> The 4 LUNs have LVM on them with GFS2.  The file systems are mountable from
> each box.  When performing a script dd write of zeros in 250MB file sizes to
> the file system from each box to different LUNS, one of the nodes in the
> cluster is fenced by the other one.  File size does not seem to matter.
> 
> My first guess at the problem was the heartbeat timeout in openais.  In the
> cluster.conf below I added the totem line to hopefully raise the timeout to
> 10 seconds.  This however did not resolve the problem.  Both boxes are
> running the latest updates as of 2 days ago from up2date.
> 
> Below is the cluster.conf and what is seen in the logs.  Any suggestions
> would be greatly appreciated.

Did you see whether the fenced node panicked or hung before it was
rebooted?  I'd try the same thing with GFS1 (everything stays the same,
just 'mkfs -t gfs ...' and 'mount -t gfs ...', and make sure you have the
gfs kernel module loaded.)

Dave



From pillai at mathstat.dal.ca  Thu Jun  7 14:07:23 2007
From: pillai at mathstat.dal.ca (Balagopal Pillai)
Date: Thu, 07 Jun 2007 11:07:23 -0300
Subject: [Linux-cluster] gnbd problem
Message-ID: <4668111B.1070207@mathstat.dal.ca>

Hi,

               I am running CentOS 4.4 with the cluster suite in a GNBD 
+ GFS solution. The dual onboard nics are bonded in alb on the client nodes
and the gnbd server has 4 nics bonded on alb. The GNBD server has a 
3ware 16 channel controller on raid 6. The network aggregate throughput 
is great and so is the
performance on GFS. This GFS installation is replacing my current Lustre 
installation.

               Here is the problem - On heavy load like copying lots of 
very big files or dd in a loop (from many hpc nodes simultaneously), i 
get the following error messages -

gnbd (pid 5296: cp) got signal 9
gnbd0: Send control failed (result -4)
gnbd0: Send data failed (result -104)
gnbd0: Receive control failed (result -32)
gnbd0: shutting down socket
exitting GNBD_DO_IT ioctl
resending requests
 
                 One time with iozone, the gfs mount froze on all nodes. 
But once i disabled the oopes ok option in mount, that problem seems to 
have gone away. Hopefully
if there is an oops, that node will panic and won't freeze gfs for the 
rest of the nodes. If i update the gnbd from 1.0.8 to 1.0.9, will it fix 
the gnbd error messages on heavy load?
The clients are using GNBD fencing. I am a bit concerned that the gnbd 
client when re-opening connection with gnbd server could cause 
corruption of data or freeze of gfs mount.
Is that a possibility? Since the other parts of cluster suite are 
working fine and it is just the gnbd client that is having problems, the 
gnbd fencing probably won't kick in.

                I have another related question. During  a power outage, 
the hpc nodes would shutdown in 5 minutes and only the master node and 
storage server would run on the ups with battery pack
for another hour or so. The master node re-exports the gfs mount via nfs 
to our infrastructure servers. In this scenario, when all the hpc nodes 
are down, the cluster loses quorum. Will the gfs mount on the master 
node freeze when the cluster loses quorum? If it does, is there a way 
around it, like maybe lots of votes for master node for example? In 
Lustre, this scenario is possible. I can have a single server with 
mounted lustre volumes still up with all the other nodes down due to a 
power outage. Thanks very much.

Balagopal Pillai



From Christopher.Barry at qlogic.com  Thu Jun  7 14:26:06 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Thu, 07 Jun 2007 10:26:06 -0400
Subject: [Linux-cluster] correct cvs branch to use w/ 2.6.19.7?
Message-ID: <1181226367.16333.14.camel@localhost>

Greetings List,

Can someone recommend the best kernel/cvs branch combo to compile? I'm
running kernel.org 2.6.19.7 on an otherwise stock RHEL4UD4 box, and
cannot get STABLE or HEAD to compile. I have followed the instructions
in usage.txt, installing openais and the latest udev (v112). I'd like to
keep this kernel if possible.

I would appreciate any help or suggestions. If required, I can throw you
a link to the compile-time errors I'm seeing and kernel .config.


-- 
Regards,
-C

Christopher Barry
Systems Engineer, Principal
Qlogic Corporation



From teigland at redhat.com  Thu Jun  7 14:36:30 2007
From: teigland at redhat.com (David Teigland)
Date: Thu, 7 Jun 2007 09:36:30 -0500
Subject: [Linux-cluster] correct cvs branch to use w/ 2.6.19.7?
In-Reply-To: <1181226367.16333.14.camel@localhost>
References: <1181226367.16333.14.camel@localhost>
Message-ID: <20070607143630.GC27026@redhat.com>

On Thu, Jun 07, 2007 at 10:26:06AM -0400, Christopher Barry wrote:
> Greetings List,
> 
> Can someone recommend the best kernel/cvs branch combo to compile? I'm
> running kernel.org 2.6.19.7 on an otherwise stock RHEL4UD4 box, and
> cannot get STABLE or HEAD to compile. I have followed the instructions
> in usage.txt, installing openais and the latest udev (v112). I'd like to
> keep this kernel if possible.

The RHEL4 branch is probably the closest you'll get.  You may still need
to do some hacking to get it to build depending how different your kernel
is from RHEL4's.

Also, we're talking about the older generation of cluster code here, so
you should follow this version of the usage:
  http://sources.redhat.com/cluster/doc/usage.txt

(which uses cman-kernel; openais is only relevant to the RHEL5 generation
of the code which would require a much newer kernel)

Dave



From christopher.barry at qlogic.com  Thu Jun  7 15:31:57 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Thu, 7 Jun 2007 10:31:57 -0500
Subject: [Linux-cluster] correct cvs branch to use w/ 2.6.19.7?
In-Reply-To: <20070607143630.GC27026@redhat.com>
References: <1181226367.16333.14.camel@localhost>
	<20070607143630.GC27026@redhat.com>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB2438329F@EPEXCH1.qlogic.org>

 

> -----Original Message-----
> From: David Teigland [mailto:teigland at redhat.com] 
> Sent: Thursday, June 07, 2007 10:37 AM
> To: Christopher Barry
> Cc: Linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] correct cvs branch to use w/ 2.6.19.7?
> 
> On Thu, Jun 07, 2007 at 10:26:06AM -0400, Christopher Barry wrote:
> > Greetings List,
> > 
> > Can someone recommend the best kernel/cvs branch combo to 
> compile? I'm
> > running kernel.org 2.6.19.7 on an otherwise stock RHEL4UD4 box, and
> > cannot get STABLE or HEAD to compile. I have followed the 
> instructions
> > in usage.txt, installing openais and the latest udev 
> (v112). I'd like to
> > keep this kernel if possible.
> 
> The RHEL4 branch is probably the closest you'll get.  You may 
> still need
> to do some hacking to get it to build depending how different 
> your kernel
> is from RHEL4's.
> 
> Also, we're talking about the older generation of cluster 
> code here, so
> you should follow this version of the usage:
>   http://sources.redhat.com/cluster/doc/usage.txt
> 
> (which uses cman-kernel; openais is only relevant to the 
> RHEL5 generation
> of the code which would require a much newer kernel)
> 
> Dave
> 
> 


Thanks Dave,


I'm not compiling against 2.6.9-42ELsmp, but kernel.org 2.6.19.7

Here's the results from HEAD and STABLE:

results from HEAD compile:


[root at localhost cluster]# ./configure
--kernel_src=/usr/src/kernels/linux-2.6.19.7

Configuring Makefiles for your system...

The following fence agents will be build on this system:
 xen rps10 ipmilan wti manual baytech bullpap apc scsi vmware xcat rsb
bladecenter xvm zvm vixel brocade rsa ibmblade sanbox2 rackswitch cpint
egenera mcdata ilo drac

NOTE: xvm will build only if --enable_xen has been specified

Completed Makefile configuration

[root at localhost cluster]#

[root at localhost cluster]# make install 2>&1 | tee ../HEAD-build.log
make -C gnbd-kernel/src all
make[1]: Entering directory
`/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src'
make -C /usr/src/kernels/linux-2.6.19.7
M=/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src
symverfile=/usr/src/kernels/linux-2.6.19.7/Module.symvers modules
USING_KBUILD=yes
make[2]: Entering directory `/usr/src/kernels/linux-2.6.19.7'
  CC [M]  /usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.o
/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.c: In function
`gnbd_ctl_ioctl':
/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.c:717: error: too few
arguments to function `invalidate_bdev'
make[3]: *** [/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.o]
Error 1
make[2]: *** [_module_/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src]
Error 2
make[2]: Leaving directory `/usr/src/kernels/linux-2.6.19.7'
make[1]: *** [all] Error 2
make[1]: Leaving directory
`/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src'
make: *** [gnbd-kernel] Error 2
[root at localhost cluster]#

HEAD gags almost immediately, but STABLE gets a lot further...


results from STABLE compile:

[root at localhost cluster]# ./configure
--kernel_src=/usr/src/kernels/linux-2.6.19.7
configure cman-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure dlm-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure gnbd-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure magma

Configuring Makefiles for your system...
Completed Makefile configuration

configure ccs

Configuring Makefiles for your system...
Completed Makefile configuration

configure cman

Configuring Makefiles for your system...
Completed Makefile configuration

configure dlm

Configuring Makefiles for your system...
Completed Makefile configuration

configure fence

Configuring Makefiles for your system...
Completed Makefile configuration

configure iddev

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs

Configuring Makefiles for your system...
Completed Makefile configuration

configure gnbd

Configuring Makefiles for your system...
Completed Makefile configuration

configure gulm

Configuring Makefiles for your system...
Completed Makefile configuration

configure magma-plugins

Configuring Makefiles for your system...
Completed Makefile configuration

configure rgmanager

Configuring Makefiles for your system...
Completed Makefile configuration

[root at localhost cluster]#

relevent errors from STABLE make:

[root at localhost STABLE]# egrep -A2 -B2 'Warning|[eE]rror'
STABLE-build.log
  CC [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.o
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c: In function
`get_dummy_sb':
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c:235: error:
structure has no member named `bd_mount_sem'
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c:237: error:
structure has no member named `bd_mount_sem'
make[5]: ***
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.o] Error 1
make[4]: ***
[_module_/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs] Error 2
make[4]: Leaving directory `/usr/src/kernels/linux-2.6.19.7'
make[3]: *** [all] Error 2
make[3]: Leaving directory
`/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs'
make[2]: *** [install] Error 2
make[2]: Leaving directory
`/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/usr/src/cvs-src/STABLE/cluster/gfs-kernel'
make: *** [install] Error 2
[root at localhost STABLE]#



Any ideas?


Regards,
-C



From christopher.barry at qlogic.com  Thu Jun  7 15:44:27 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Thu, 7 Jun 2007 10:44:27 -0500
Subject: [Linux-cluster] correct cvs branch to use w/ 2.6.19.7? (more data)
In-Reply-To: <D158540CCC0AB54C8FD4818F823CCB2438329F@EPEXCH1.qlogic.org>
References: <1181226367.16333.14.camel@localhost><20070607143630.GC27026@redhat.com>
	<D158540CCC0AB54C8FD4818F823CCB2438329F@EPEXCH1.qlogic.org>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB243832A4@EPEXCH1.qlogic.org>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Christopher Barry
> Sent: Thursday, June 07, 2007 11:32 AM
> To: Linux-cluster at redhat.com
> Subject: RE: [Linux-cluster] correct cvs branch to use w/ 2.6.19.7?
> 
>  
> 
> > -----Original Message-----
> > From: David Teigland [mailto:teigland at redhat.com] 
> > Sent: Thursday, June 07, 2007 10:37 AM
> > To: Christopher Barry
> > Cc: Linux-cluster at redhat.com
> > Subject: Re: [Linux-cluster] correct cvs branch to use w/ 2.6.19.7?
> > 
> > On Thu, Jun 07, 2007 at 10:26:06AM -0400, Christopher Barry wrote:
> > > Greetings List,
> > > 
> > > Can someone recommend the best kernel/cvs branch combo to 
> > compile? I'm
> > > running kernel.org 2.6.19.7 on an otherwise stock 
> RHEL4UD4 box, and
> > > cannot get STABLE or HEAD to compile. I have followed the 
> > instructions
> > > in usage.txt, installing openais and the latest udev 
> > (v112). I'd like to
> > > keep this kernel if possible.
> > 
> > The RHEL4 branch is probably the closest you'll get.  You may 
> > still need
> > to do some hacking to get it to build depending how different 
> > your kernel
> > is from RHEL4's.
> > 
> > Also, we're talking about the older generation of cluster 
> > code here, so
> > you should follow this version of the usage:
> >   http://sources.redhat.com/cluster/doc/usage.txt
> > 
> > (which uses cman-kernel; openais is only relevant to the 
> > RHEL5 generation
> > of the code which would require a much newer kernel)
> > 
> > Dave
> > 
> > 
> 
> 
> Thanks Dave,
> 
> 
> I'm not compiling against 2.6.9-42ELsmp, but kernel.org 2.6.19.7
> 
> Here's the results from HEAD and STABLE:
> 
> results from HEAD compile:
> 
> 
> [root at localhost cluster]# ./configure
> --kernel_src=/usr/src/kernels/linux-2.6.19.7
> 
> Configuring Makefiles for your system...
> 
> The following fence agents will be build on this system:
>  xen rps10 ipmilan wti manual baytech bullpap apc scsi vmware xcat rsb
> bladecenter xvm zvm vixel brocade rsa ibmblade sanbox2 
> rackswitch cpint
> egenera mcdata ilo drac
> 
> NOTE: xvm will build only if --enable_xen has been specified
> 
> Completed Makefile configuration
> 
> [root at localhost cluster]#
> 
> [root at localhost cluster]# make install 2>&1 | tee ../HEAD-build.log
> make -C gnbd-kernel/src all
> make[1]: Entering directory
> `/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src'
> make -C /usr/src/kernels/linux-2.6.19.7
> M=/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src
> symverfile=/usr/src/kernels/linux-2.6.19.7/Module.symvers modules
> USING_KBUILD=yes
> make[2]: Entering directory `/usr/src/kernels/linux-2.6.19.7'
>   CC [M]  /usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.o
> /usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.c: In function
> `gnbd_ctl_ioctl':
> /usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.c:717: 
> error: too few
> arguments to function `invalidate_bdev'
> make[3]: *** [/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src/gnbd.o]
> Error 1
> make[2]: *** [_module_/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src]
> Error 2
> make[2]: Leaving directory `/usr/src/kernels/linux-2.6.19.7'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory
> `/usr/src/cvs-src/HEAD/cluster/gnbd-kernel/src'
> make: *** [gnbd-kernel] Error 2
> [root at localhost cluster]#
> 
> HEAD gags almost immediately, but STABLE gets a lot further...
> 
> 
> results from STABLE compile:
> 
> [root at localhost cluster]# ./configure
> --kernel_src=/usr/src/kernels/linux-2.6.19.7
> configure cman-kernel
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure dlm-kernel
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure gfs-kernel
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure gnbd-kernel
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure magma
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure ccs
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure cman
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure dlm
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure fence
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure iddev
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure gfs
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure gnbd
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure gulm
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure magma-plugins
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> configure rgmanager
> 
> Configuring Makefiles for your system...
> Completed Makefile configuration
> 
> [root at localhost cluster]#
> 
> relevent errors from STABLE make:
> 
> [root at localhost STABLE]# egrep -A2 -B2 'Warning|[eE]rror'
> STABLE-build.log
>   CC [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.o
> /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c: 
> In function
> `get_dummy_sb':
> /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c:23
> 5: error:
> structure has no member named `bd_mount_sem'
> /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c:23
> 7: error:
> structure has no member named `bd_mount_sem'
> make[5]: ***
> [/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.o] Error 1
> make[4]: ***
> [_module_/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs] Error 2
> make[4]: Leaving directory `/usr/src/kernels/linux-2.6.19.7'
> make[3]: *** [all] Error 2
> make[3]: Leaving directory
> `/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs'
> make[2]: *** [install] Error 2
> make[2]: Leaving directory
> `/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src'
> make[1]: *** [install] Error 2
> make[1]: Leaving directory 
> `/usr/src/cvs-src/STABLE/cluster/gfs-kernel'
> make: *** [install] Error 2
> [root at localhost STABLE]#
> 
> 
> 
> Any ideas?
> 
> 
> Regards,
> -C
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

oops - my egrep actually missed a bunch of WARNINGS first go round:

[root at localhost STABLE]# egrep -A2 -B2 -i 'Warning|error'
STABLE-build.log
  Building modules, stage 2.
  MODPOST 1 modules
WARNING: "kcl_addref_cluster"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_get_node_by_addr"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_get_node_addresses"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_releaseref_cluster"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_get_current_interface"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_get_node_by_nodeid"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_leave_service"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_remove_callback"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_global_service_id"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_unregister_service"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_join_service"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_start_done"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_add_callback"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
WARNING: "kcl_register_service"
[/usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko] undefined!
  CC      /usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.mod.o
  LD [M]  /usr/src/cvs-src/STABLE/cluster/dlm-kernel/src/dlm.ko
--
  Building modules, stage 2.
  MODPOST 1 modules
WARNING: "lm_register_proto"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/nolock/lock_nolock.ko]
undefined!
WARNING: "lm_unregister_proto"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/nolock/lock_nolock.ko]
undefined!
  CC
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/nolock/lock_nolock.mod.o
  LD [M]
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/nolock/lock_nolock.ko
--
  CC [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/main.o
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/main.c: In function
`create_proc_entries':
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/main.c:230: warning:
'plock_get' might be used uninitialized in this function
  CC [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/mount.o
  CC [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/plock.o
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/plock.c: In function
`lm_dlm_plock_get':
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/plock.c:1247:
warning: passing arg 3 of `do_plock_get' makes integer from pointer
without a cast
  CC [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/thread.o
  LD [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.o
  Building modules, stage 2.
  MODPOST 1 modules
WARNING: "kcl_addref_cluster"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_get_services"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_cluster_name"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_releaseref_cluster"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_get_members"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "dlm_query"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_leave_service"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "lm_register_proto"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "lm_unregister_proto"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_unregister_service"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_join_service"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_start_done"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "kcl_register_service"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
WARNING: "dlm_debug_dump"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko]
undefined!
  CC
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.mod.o
  LD [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/dlm/lock_dlm.ko
--
  Building modules, stage 2.
  MODPOST 1 modules
WARNING: "lm_register_proto"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gulm/lock_gulm.ko]
undefined!
WARNING: "lm_unregister_proto"
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gulm/lock_gulm.ko]
undefined!
  CC
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gulm/lock_gulm.mod.o
  LD [M]
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gulm/lock_gulm.ko
--
  CC [M]  /usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.o
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c: In function
`get_dummy_sb':
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c:235: error:
structure has no member named `bd_mount_sem'
/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.c:237: error:
structure has no member named `bd_mount_sem'
make[5]: ***
[/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs/diaper.o] Error 1
make[4]: ***
[_module_/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs] Error 2
make[4]: Leaving directory `/usr/src/kernels/linux-2.6.19.7'
make[3]: *** [all] Error 2
make[3]: Leaving directory
`/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src/gfs'
make[2]: *** [install] Error 2
make[2]: Leaving directory
`/usr/src/cvs-src/STABLE/cluster/gfs-kernel/src'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/usr/src/cvs-src/STABLE/cluster/gfs-kernel'
make: *** [install] Error 2
[root at localhost STABLE]# 



From Paul.McDowell at celera.com  Thu Jun  7 16:05:56 2007
From: Paul.McDowell at celera.com (Paul n McDowell)
Date: Thu, 7 Jun 2007 12:05:56 -0400
Subject: [Linux-cluster] release notes for cluster suite U5
Message-ID: <OF2583BFC9.3B65A4DC-ON852572F3.00579C76-852572F3.00586E85@applera.com>

Hi all,

Can anyone send me the release notes for RHEL U5  Cluster Suite?    There 
is a document provided by Redhat  that has a resolved Issues section at 
the bottom with the info that I'm looking for but these two links appear 
to be dead.


Red Hat Cluster Suite 4 Update 5 Release Notes
https://www.redhat.com/docs/manuals/csgfs/release-notes/CS_4-RHEL4U5-relnotes.html

"
Resolved Issues
The following errata contain complete listings of issues resolved in Red 
Hat Cluster Suite 4 Update 5:

http://errata.devel.redhat.com/errata/showrequest.cgi?advisory=5211

http://errata.devel.redhat.com/errata/showrequest.cgi?advisory=5221
"

Any help appreciated as ever,

Regards,

Paul McDowell
Celera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070607/9e033b6e/attachment.htm>

From jimm at simutronics.com  Thu Jun  7 23:30:28 2007
From: jimm at simutronics.com (James Miller)
Date: Thu, 7 Jun 2007 18:30:28 -0500
Subject: [Linux-cluster] Cluster snapshots
Message-ID: <08e201c7a95b$d51a7ef0$5dd810d1@e3demo>

Hello everyone,

What is the status of the Cluster Snapshot project? 

Lvm2/clmvd doesn't support cluster snapshots (at least as of 2.0.24 that I'm
using).



Thanks,
Jim



James Miller - MCSE RHCE CISSP
Sr Systems & Network Administrator
Simutronics Corp.
www.play.net
636.946.4263 x113 



From cluster at defuturo.co.uk  Fri Jun  8 11:51:19 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Fri, 08 Jun 2007 12:51:19 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
Message-ID: <1181303479.26426.63.camel@localhost.localdomain>

I've upgraded a test cluster to RHEL4U5, but I'm having some problems on
boot. ccsd and cman seem to be having a small disagreement over whether
the cluster is quorate meaning fenced fails to start up.

First, here is a normal boot sequence from another cluster running 4U4:

May 31 10:19:31 node04 ccsd[2503]: Starting ccsd 1.0.7: 
May 31 10:19:31 node04 ccsd[2503]:  Built: Aug 25 2006 15:00:06 
May 31 10:19:31 node04 ccsd[2503]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
May 31 10:19:31 node04 ccsd:  succeeded
May 31 10:19:31 node04 kernel: CMAN 2.6.9-45.15 (built Mar 27 2007 22:56:03) installed
May 31 10:19:31 node04 kernel: NET: Registered protocol family 30
May 31 10:19:31 node04 kernel: DLM 2.6.9-44.9 (built Mar 27 2007 23:00:18) installed
May 31 10:19:31 node04 ccsd[2503]: cluster.conf (cluster name = cluster1, version = 2) found. 
May 31 10:19:32 node04 kernel: CMAN: Waiting to join or form a Linux-cluster
May 31 10:19:32 node04 last message repeated 3 times
May 31 10:19:32 node04 ccsd[2503]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.1 
May 31 10:19:32 node04 ccsd[2503]: Initial status:: Inquorate 
May 31 10:19:34 node04 kernel: CMAN: sending membership request
May 31 10:19:34 node04 last message repeated 3 times
May 31 10:19:34 node04 last message repeated 3 times
May 31 10:19:34 node04 last message repeated 5 times
May 31 10:19:34 node04 last message repeated 5 times
May 31 10:19:35 node04 kernel: CMAN: got node node05
May 31 10:19:35 node04 kernel: CMAN: got node node06
May 31 10:19:35 node04 kernel: CMAN: got node node03
May 31 10:19:35 node04 kernel: CMAN: got node node01
May 31 10:19:35 node04 kernel: CMAN: got node node02
May 31 10:19:35 node04 ccsd[2503]: Cluster is quorate.  Allowing connections. 
May 31 10:19:35 node04 kernel: CMAN: quorum regained, resuming activity
May 31 10:19:35 node04 cman: startup succeeded
May 31 10:19:38 node04 defuturo: fenced succeeded


Now, some logs for the 4U5 cluster:

Jun  8 12:40:26 tamarillo ccsd[2448]: Starting ccsd 1.0.10: 
Jun  8 12:40:26 tamarillo ccsd[2448]:  Built: May 31 2007 15:48:09 
Jun  8 12:40:26 tamarillo ccsd[2448]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
Jun  8 12:40:26 tamarillo ccsd:  succeeded
Jun  8 12:40:26 tamarillo kernel: CMAN 2.6.9-50.2 (built May 31 2007 15:39:24) installed
Jun  8 12:40:26 tamarillo kernel: NET: Registered protocol family 30
Jun  8 12:40:26 tamarillo kernel: DLM 2.6.9-46.16 (built May 31 2007 15:45:51) installed
Jun  8 12:40:26 tamarillo ccsd[2448]: cluster.conf (cluster name = defuturo_test, version = 2) found. 
Jun  8 12:40:27 tamarillo kernel: CMAN: Waiting to join or form a Linux-cluster
Jun  8 12:40:27 tamarillo kernel: CMAN: sending membership request
Jun  8 12:40:28 tamarillo kernel: CMAN: got node guava
Jun  8 12:40:28 tamarillo kernel: CMAN: got node kiwano
Jun  8 12:40:28 tamarillo kernel: CMAN: quorum regained, resuming activity
Jun  8 12:40:28 tamarillo cman: startup succeeded
Jun  8 12:40:30 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:30 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:31 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:31 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:32 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:32 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:33 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:33 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:34 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:34 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:35 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:35 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:36 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:36 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:37 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:37 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:38 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:38 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:39 tamarillo ccsd[2448]: Cluster is not quorate.  Refusing connection. 
Jun  8 12:40:39 tamarillo ccsd[2448]: Error while processing connect: Connection refused 
Jun  8 12:40:40 tamarillo defuturo: fenced failed
Jun  8 12:40:40 tamarillo ccsd[2448]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.4 
Jun  8 12:40:40 tamarillo ccsd[2448]: Initial status:: Quorate 


This error happens on most boots, but not all, so I suspect a race
condition. By the time I can log into the node, it's quorate and I can
start up fenced manually. I've put in some debugging and verified
that /proc/cluster/status lists the cluster as quorate immediately
before and after attempting to start fenced.

There are a couple of things of note about our set-up:

1) We're not using fence from 4U5 because of bz241217, so
fence-1.32.25-1 is installed on both clusters.

2) fenced is being started in a chroot jail by our own script which
runs:

    /usr/sbin/chroot /mnt/fenced /sbin/fence_tool -t 0 join -w

The output from that command is:

fence_tool: cannot connect to ccs -111

fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111
fence_tool: waiting for ccs connection -111


3) /etc/sysconfig/cluster sets CMAN_CLUSTER_TIMEOUT=0 and
CMAN_QUORUM_TIMEOUT=86400.

  Does anyone know what might cause ccsd to continue to refuse
connections for a lack of quorum after cman has decided the cluster is
quorate?

	Robert



From pcaulfie at redhat.com  Fri Jun  8 12:04:34 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 08 Jun 2007 13:04:34 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <1181303479.26426.63.camel@localhost.localdomain>
References: <1181303479.26426.63.camel@localhost.localdomain>
Message-ID: <466945D2.10000@redhat.com>

Robert Clark wrote:
> I've upgraded a test cluster to RHEL4U5, but I'm having some problems on
> boot. ccsd and cman seem to be having a small disagreement over whether
> the cluster is quorate meaning fenced fails to start up.
> 


>   Does anyone know what might cause ccsd to continue to refuse
> connections for a lack of quorum after cman has decided the cluster is
> quorate?
> 

The usual cause of this is the magma plugins either not being installed in the
right place or even at all. "magma_tool list" will show you which plugins are
installed, for CMAN you need the magma_sm.so plugin.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From maciej.bogucki at artegence.com  Fri Jun  8 13:11:42 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 08 Jun 2007 15:11:42 +0200
Subject: [Linux-cluster] how to GFS mirroring? GFS+CLVM+GNBD? GFS+cmirror?
	GFS+DRBD? GFS+ddraid?
In-Reply-To: <913f0e8da235191990a1b737dab0eb81@localhost>
References: <913f0e8da235191990a1b737dab0eb81@localhost>
Message-ID: <4669558E.9060800@artegence.com>

> drbd *seems* not being exactly what i want to accomplish 
> limited 2 nodes, 

You can buy drbd+ - http://www.linbit.com/en/drbd/drbd-plus/ which allow
to build 3 node cluster.

> one node write, etc)

You could use drbd 0.8 which support two node write, and with iSCSI(or
GNBD) You could build GFS HA storage .

Best Regards
Maciej Bogucki



From maciej.bogucki at artegence.com  Fri Jun  8 13:18:47 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 08 Jun 2007 15:18:47 +0200
Subject: [Linux-cluster] fence_scsi on Redhat4U5
In-Reply-To: <742069.12378.qm@web26514.mail.ukl.yahoo.com>
References: <742069.12378.qm@web26514.mail.ukl.yahoo.com>
Message-ID: <46695737.2040501@artegence.com>

> I would like to know whether there is a plan to integrate scsi-fencing
> in redhat4U5 ?
> 
> Thanks in advance.

Hello,

I still doesn't have time to test it, but I suppose that it would work
with RH4U5. Download new cman fe. cman-2.0.60, extract
/etc/rc.d/init.d/scsi_reserve, /sbin/fence_scsi, /sbin/fence_scsi_test
and configure fencing in cluster.conf

Best Regards
Maciej Bogucki



From maciej.bogucki at artegence.com  Fri Jun  8 13:35:41 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 08 Jun 2007 15:35:41 +0200
Subject: [Linux-cluster] Linux Cluster Questions
In-Reply-To: <03fa01c7a474$e73d62f0$5dd810d1@e3demo>
References: <03fa01c7a474$e73d62f0$5dd810d1@e3demo>
Message-ID: <46695B2D.8060806@artegence.com>


> 2. fencing -- Ick!!! I don't have any of the fencing agents -- such as an
> APC MasterSwitch, IBM Blade center, Brocade McData, Sanbox/Qlogic, or Vixel
> fibre switches, Egenera blades, HP ILO ... that leaves me with GNBD and
> Manual/Manual_ack.  From the man pages it's suggested I not use
> Manual/Manual_ack for production.  I really don't need automatic re-adding
> of an node once it comes back up, what issues will I run into using manual
> fencing?

Hello,

In production You need automatic fencing, because when one node goes
down, Your GFS filesystem will be freezed(no ro/rw operations permited)
until You perform manual fencing. You could try to use new fence_scsi
agent, but You need to check if Your storage support Persistent
Reservations (SPC-2 or greater).

Best Regards
Maciej Bogucki



From maciej.bogucki at artegence.com  Fri Jun  8 13:55:47 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 08 Jun 2007 15:55:47 +0200
Subject: [Linux-cluster] fence_scsi on Redhat4U5
In-Reply-To: <46695737.2040501@artegence.com>
References: <742069.12378.qm@web26514.mail.ukl.yahoo.com>
	<46695737.2040501@artegence.com>
Message-ID: <46695FE3.7050101@artegence.com>

Maciej Bogucki napisa?(a):
>> I would like to know whether there is a plan to integrate scsi-fencing
>> in redhat4U5 ?
>>
>> Thanks in advance.
> 
> Hello,
> 
> I still doesn't have time to test it, but I suppose that it would work
> with RH4U5. Download new cman fe. cman-2.0.60, extract
> /etc/rc.d/init.d/scsi_reserve, /sbin/fence_scsi, /sbin/fence_scsi_test
> and configure fencing in cluster.conf
> 
> Best Regards
> Maciej Bogucki
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Hello,

>From URL
https://www.redhat.com/docs/manuals/csgfs/release-notes/CS_4-RHEL4U5-relnotes.html

"SCSI reservations are now supported. Note that the SCSI reservation
feature does not work with dm-multipath."

Best Regards
Maciej Bogucki



From simanhew at yahoo.com  Fri Jun  8 14:24:55 2007
From: simanhew at yahoo.com (Siman Hew)
Date: Fri, 8 Jun 2007 07:24:55 -0700 (PDT)
Subject: [Linux-cluster] Multicast IP Address used in RHEL5 Cluster
Message-ID: <459379.77416.qm@web50106.mail.re2.yahoo.com>

Hi all,

In RHEL 5 cluster, by default, it uses multicast
method.  Just wonder what range I should specify in
cluster.conf, 239.192.000.000-239.251.255.255?
What is default one?  How cluster choose it if I do
not specify? 
Looks like it is kind of random number chose when a
cluster is started.
Any information please?

Thanks a lot. 


      ____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 



From pcaulfie at redhat.com  Fri Jun  8 14:38:42 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 08 Jun 2007 15:38:42 +0100
Subject: [Linux-cluster] Multicast IP Address used in RHEL5 Cluster
In-Reply-To: <459379.77416.qm@web50106.mail.re2.yahoo.com>
References: <459379.77416.qm@web50106.mail.re2.yahoo.com>
Message-ID: <466969F2.1060107@redhat.com>

Siman Hew wrote:
> Hi all,
> 
> In RHEL 5 cluster, by default, it uses multicast
> method.  Just wonder what range I should specify in
> cluster.conf, 239.192.000.000-239.251.255.255?
> What is default one?  How cluster choose it if I do
> not specify? 
> Looks like it is kind of random number chose when a
> cluster is started.


By default cman makes up a multicast address from the cluster ID, adding it to
the low bytes of 239.192.0.0 (for IPv4) or FF15:: (for IPv6). Those are
recommended internal multicast addresses.

-- 
Patrick



From cluster at defuturo.co.uk  Fri Jun  8 15:20:13 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Fri, 08 Jun 2007 16:20:13 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <466945D2.10000@redhat.com>
References: <1181303479.26426.63.camel@localhost.localdomain>
	<466945D2.10000@redhat.com>
Message-ID: <1181316013.26426.82.camel@localhost.localdomain>

  Whoops - my reply didn't make it onto the list because I sent it from
the wrong address. Let me try again:

On Fri, 2007-06-08 at 13:04 +0100, Patrick Caulfield wrote:
> Robert Clark wrote:

> > I've upgraded a test cluster to RHEL4U5, but I'm having some problems on
> > boot. ccsd and cman seem to be having a small disagreement over whether
> > the cluster is quorate meaning fenced fails to start up.

> >   Does anyone know what might cause ccsd to continue to refuse
> > connections for a lack of quorum after cman has decided the cluster is
> > quorate?

> The usual cause of this is the magma plugins either not being installed in the
> right place or even at all. "magma_tool list" will show you which plugins are
> installed, for CMAN you need the magma_sm.so plugin.

  Thanks for the quick reply. I put "magma_tool list" into the script
just before and after trying to start fenced. The output both times is:

Magma: Checking plugins in /lib/magma

File            Status  Message
----            ------  -------
magma_gulm.so   [OK]    GuLM Plugin v1.0.5
magma_sm.so     [OK]    CMAN/SM Plugin v1.1.7.4

Magma: 2 plugins available

  When I added "magma_tool quorum" as well, it reported "Connect
failure: No cluster running?".

        Robert



From chiranthlk at yahoo.com  Sat Jun  9 04:16:30 2007
From: chiranthlk at yahoo.com (chirantha pitigala)
Date: Fri, 8 Jun 2007 21:16:30 -0700 (PDT)
Subject: [Linux-cluster] Relocate of service if restart fails
Message-ID: <57335.85508.qm@web52302.mail.re2.yahoo.com>

Hi all,

In RHCS 4, if service recovery policy is set to restart, in which circumstances relocate will happen? In my setup I can see it only happens when a node fails.

thanks and regards

chirantha



      ___________________________________________________________________________________
You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_html.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070608/056b89c1/attachment.htm>

From chiranthlk at yahoo.com  Sat Jun  9 04:19:36 2007
From: chiranthlk at yahoo.com (chirantha pitigala)
Date: Fri, 8 Jun 2007 21:19:36 -0700 (PDT)
Subject: [Linux-cluster] Defining dependency of services
Message-ID: <297717.56421.qm@web52312.mail.re2.yahoo.com>

Hi all,

In RHCS 4, is there any way to define dependency or order of services? As I can see services are started in the order of defined in the cluster.conf. What is the exact order? Can we add dependency between them?

thanks and regards

chirantha




      ___________________________________________________________________________________
You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_html.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070608/5bfe6434/attachment.htm>

From orkcu at yahoo.com  Sat Jun  9 13:03:52 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Sat, 9 Jun 2007 06:03:52 -0700 (PDT)
Subject: [Linux-cluster] Defining dependency of services
In-Reply-To: <297717.56421.qm@web52312.mail.re2.yahoo.com>
Message-ID: <968727.49425.qm@web50610.mail.re2.yahoo.com>
--- chirantha pitigala <chiranthlk at yahoo.com> wrote:

> Hi all,
> 
> In RHCS 4, is there any way to define dependency or
> order of services? As I can see services are started
> in the order of defined in the cluster.conf. What is
> the exact order? Can we add dependency between them?
> 
I do the dependency by creating one child of another,
the child depend of success "start" of the parent
resource

you was talking about dependency betwenn services or
between resource inside services?

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/



From chiranthlk at yahoo.com  Sat Jun  9 13:14:30 2007
From: chiranthlk at yahoo.com (chirantha pitigala)
Date: Sat, 9 Jun 2007 06:14:30 -0700 (PDT)
Subject: [Linux-cluster] Defining dependency of services
Message-ID: <870798.3019.qm@web52311.mail.re2.yahoo.com>

Hi roger,

What I need is dependency between services. e.g: service 1 depends on successful start of service 2.
Is it possible to do by making parent-child relationship between script resources of two services? 

thanks 

chirantha

----- Original Message ----
From: Roger Pe?a <orkcu at yahoo.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Saturday, June 9, 2007 9:03:52 PM
Subject: Re: [Linux-cluster] Defining dependency of services


--- chirantha pitigala <chiranthlk at yahoo.com> wrote:

> Hi all,
> 
> In RHCS 4, is there any way to define dependency or
> order of services? As I can see services are started
> in the order of defined in the cluster.conf. What is
> the exact order? Can we add dependency between them?
> 
I do the dependency by creating one child of another,
the child depend of success "start" of the parent
resource

you was talking about dependency betwenn services or
between resource inside services?

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster







       
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070609/a777e096/attachment.htm>

From orkcu at yahoo.com  Sun Jun 10 04:22:19 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Sat, 9 Jun 2007 21:22:19 -0700 (PDT)
Subject: [Linux-cluster] Defining dependency of services
In-Reply-To: <870798.3019.qm@web52311.mail.re2.yahoo.com>
Message-ID: <147317.77012.qm@web50611.mail.re2.yahoo.com>


--- chirantha pitigala <chiranthlk at yahoo.com> wrote:

> Hi roger,
> 
> What I need is dependency between services. e.g:
> service 1 depends on successful start of service 2.
> Is it possible to do by making parent-child
> relationship between script resources of two
> services? 
What I try to do is think of service as "cluster
services", a service that a cluster will bring, not a
unix service (httpd or ftp)

if I do it in that way, nothing will stop me of having
two unix services as resource of a cluster service,
more important if this two unix services depend
between them.
so I _guess_ you can define an init.d script as a
child  resoucer of another init.d script resource, I
_guess_ you will have a change that things works with
start-stop actions, but I don't know about
check-status action ...

having said that I also have to say that I never had
done something like this :-) : one service depending
of another 


> ----- Original Message ----
> From: Roger Pe?a <orkcu at yahoo.com>
> --- chirantha pitigala <chiranthlk at yahoo.com> wrote:
> 
> > Hi all,
> > 
> > In RHCS 4, is there any way to define dependency
> or
> > order of services? As I can see services are
> started
> > in the order of defined in the cluster.conf. What
> is
> > the exact order? Can we add dependency between
> them?
> > 
> I do the dependency by creating one child of
> another,
> the child depend of success "start" of the parent
> resource
> 
> you was talking about dependency betwenn services or
> between resource inside services?
> 

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC



From andremachado at techforce.com.br  Sun Jun 10 19:06:06 2007
From: andremachado at techforce.com.br (andremachado)
Date: Sun, 10 Jun 2007 12:06:06 -0700
Subject: [Linux-cluster] Red Hat cluster suite in Debian GNU / Linux 4.0 Etch
Message-ID: <17dca6074f69c7aadae9f0098245cc0f@localhost>

Hello,

The Red Hat cluster suite packages in Debian 4.0 Etch are partially
broken.

See how to patch them to get your cluster up and running at [0].
You need to have your build toolchain installed for repackaging the
source files.



Regards.
Andre Felipe Machado


[0]
http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch






Em Ter, 2007-06-05 ?s 13:14 -0500, James Miller escreveu:
> Hello everyone,
> 
> I'm setting up a cluster and was hoping to get some insight into trouble I'm
> having with running clvmd. My cluster seems to be in a good state (2 nodes).
> I using Debian Etch, 2.6.18 kernel, and redhat-cluster-modules for 2.6.18.





From andremachado at techforce.com.br  Sun Jun 10 19:08:42 2007
From: andremachado at techforce.com.br (andremachado)
Date: Sun, 10 Jun 2007 12:08:42 -0700
Subject: [Linux-cluster] GFS over GNBD freezing problems
Message-ID: <8577042bf3e2aec48e89aec96da2e09d@localhost>

Hello,
I am trying to spot the cause of a seemingly random freezing of GFS over
GNBD.
Configured 3 gnbd servers. Only one actively mounted.
A fourth machine is the mounting point.
When writing files, many times it hard freezes the filesystem. 
Even using gnbd_import -RO cannot release the resource.
Only resetting and gfs_fsck after, with many fs corruption.
Tried to gnbd_export -t 600 from the gnbd server.
Tried an exhagerated clvmd -t 1800 roundtrip.
Tried to export raw partition /dev/hda3 (faster).
Tried to export logical volume (slower).

I discovered some interesting things.
If gfs is mounted at the local machine, no problems, even with big
files.
The problems arise when gfs over gnbd.
If the writing is SLOW, no problems.
Used an internet download throttled by wget.
But if fast writing from disk, or from CD or from another lan machine,
the filesystem hangs.
Also, if writing from disk, but from directories not previously READ,
then it is much more *likely* to succeed.
But if previously copied / read to another directory, and THEN writing
to gfs mounting point, the problem is very likely to appear at the
beginning.
I saw that a 150 MB file previously copied to another directory was
"presumably" written to gfs in a snap, then when writing the next small
files, the filesystem hanged.
gnbd_export without cache, as per documentation.
It seems that cached files are too fast for gfs.
The whole cluster is using regular tcp ethernet at the same segment.
There is no private net for the cluster.
It seems that the gfs is missing some configuration to be more slow
connection or lag tolerant.
But the timeout configurations I tried had no effect.
It seems there is no sync mount option for gfs_mount.
What configuration am I forgetting?

Regards.
Andre Felipe Machado



[0]
http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch




From cluster at defuturo.co.uk  Sun Jun 10 19:53:45 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Sun, 10 Jun 2007 20:53:45 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <1181316013.26426.82.camel@localhost.localdomain>
References: <1181303479.26426.63.camel@localhost.localdomain>
	<466945D2.10000@redhat.com>
	<1181316013.26426.82.camel@localhost.localdomain>
Message-ID: <1181505225.14777.25.camel@localhost.localdomain>

On Fri, 2007-06-08 at 16:20 +0100, Robert Clark wrote:
> On Fri, 2007-06-08 at 13:04 +0100, Patrick Caulfield wrote:
> > Robert Clark wrote:

> > >   Does anyone know what might cause ccsd to continue to refuse
> > > connections for a lack of quorum after cman has decided the cluster is
> > > quorate?
> 
> > The usual cause of this is the magma plugins either not being installed in the
> > right place or even at all. "magma_tool list" will show you which plugins are
> > installed, for CMAN you need the magma_sm.so plugin.
> 
>   Thanks for the quick reply. I put "magma_tool list" into the script
> just before and after trying to start fenced. The output both times is:
> 
> Magma: Checking plugins in /lib/magma
> 
> File            Status  Message
> ----            ------  -------
> magma_gulm.so   [OK]    GuLM Plugin v1.0.5
> magma_sm.so     [OK]    CMAN/SM Plugin v1.1.7.4
> 
> Magma: 2 plugins available
> 
>   When I added "magma_tool quorum" as well, it reported "Connect
> failure: No cluster running?".

  I've managed to get an strace of ccsd during the boot and it turned up
some interesting lines, which I've interspersed with selected log
entries:

Jun  8 22:20:27 localhost ccsd[2981]: Starting ccsd 1.0.10: 
Jun  8 22:20:27 localhost kernel: CMAN 2.6.9-50.2 (built May 31 2007 15:39:24) installed
Jun  8 22:20:27 localhost kernel: NET: Registered protocol family 30
Jun  8 22:20:27 localhost ccsd[2981]:  Built: May 31 2007 15:48:09 
Jun  8 22:20:27 localhost ccsd[2981]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
Jun  8 22:20:27 localhost kernel: DLM 2.6.9-46.16 (built May 31 2007 15:45:51) installed
Jun  8 22:20:28 localhost ccsd[2981]: cluster.conf (cluster name = defuturo_test, version = 2) found. 
Jun  8 22:20:28 localhost kernel: CMAN: Waiting to join or form a Linux-cluster
2990  22:20:28 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
Jun  8 22:20:29 localhost kernel: CMAN: sending membership request
2990  22:20:29 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
Jun  8 22:20:30 localhost kernel: CMAN: got node tamarillo
Jun  8 22:20:30 localhost kernel: CMAN: got node guava
Jun  8 22:20:30 localhost kernel: CMAN: quorum regained, resuming activity
Jun  8 22:20:30 localhost cman: startup succeeded
2990  22:20:30 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:31 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:32 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:33 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:34 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:35 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:36 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:37 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:38 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:39 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:40 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:41 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
2990  22:20:42 stat64("/dev/dlm-control", {st_mode=S_IFCHR|0600, st_rdev=makedev(10, 62), ...}) = 0
Jun  8 22:20:42 kiwano ccsd[2981]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.4
Jun  8 22:20:42 kiwano ccsd[2981]: Initial status:: Quorate

So, it looks like the problem is that the appearance of /dev/dlm-control
is being delayed in the 4U5 cluster.

	Robert



From mkathuria at tuxtechnologies.co.in  Mon Jun 11 03:10:24 2007
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Mon, 11 Jun 2007 08:40:24 +0530
Subject: [Linux-cluster] Problems with Cluster
Message-ID: <1df4abe60706102010g7808af27wf40901bd20618407@mail.gmail.com>

Hello,

We have configured an active-active two node cluster accessing a
shared storage. We are using Red Hat Cluster Suite on RHEL  4Update 4.
The nodes are HP Servers having ILOs as the fence devices. I am facing
certain problems in the setup and would seek suggestions.

While booting up the systems, many a times one node forces the other
one to reboot for no reason. How do I prevent that ?

We want the failover to happen when the power supply fails to either
of the nodes. In order to test the scenario, we removed the power
cables from one of the nodes. However the failover did not happen and
upon observing the logs we found that the alive node could not connect
to the fence device (ILO in this case) of the dead node since it was
powered off and the fencing could not take place. Does this mean that
we would not be able to have a failover in case of power failure for
one of the nodes. Is there a way we can do it ? How is the cluster
supposed to react when the ILO itself is powered off ?

Another failover test was conducted by removing the network
connectivity on of one of the nodes. Failover did happen smoothly in
the dead node was rebooted by the fence device. However when the dead
node was reconnected to the network, it fenced and rebooted the alive
node. How to deal with this ?

Thanks,

Manish



From pcaulfie at redhat.com  Mon Jun 11 07:41:09 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 11 Jun 2007 08:41:09 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <1181505225.14777.25.camel@localhost.localdomain>
References: <1181303479.26426.63.camel@localhost.localdomain>	<466945D2.10000@redhat.com>	<1181316013.26426.82.camel@localhost.localdomain>
	<1181505225.14777.25.camel@localhost.localdomain>
Message-ID: <466CFC95.9060804@redhat.com>

Robert Clark wrote:
> On Fri, 2007-06-08 at 16:20 +0100, Robert Clark wrote:
>> On Fri, 2007-06-08 at 13:04 +0100, Patrick Caulfield wrote:
>>> Robert Clark wrote:
> 
>>>>   Does anyone know what might cause ccsd to continue to refuse
>>>> connections for a lack of quorum after cman has decided the cluster is
>>>> quorate?
>>> The usual cause of this is the magma plugins either not being installed in the
>>> right place or even at all. "magma_tool list" will show you which plugins are
>>> installed, for CMAN you need the magma_sm.so plugin.
>>   Thanks for the quick reply. I put "magma_tool list" into the script
>> just before and after trying to start fenced. The output both times is:
>>
>> Magma: Checking plugins in /lib/magma
>>
>> File            Status  Message
>> ----            ------  -------
>> magma_gulm.so   [OK]    GuLM Plugin v1.0.5
>> magma_sm.so     [OK]    CMAN/SM Plugin v1.1.7.4
>>
>> Magma: 2 plugins available
>>
>>   When I added "magma_tool quorum" as well, it reported "Connect
>> failure: No cluster running?".
> 
>   I've managed to get an strace of ccsd during the boot and it turned up
> some interesting lines, which I've interspersed with selected log
> entries:
> 
> Jun  8 22:20:27 localhost ccsd[2981]: Starting ccsd 1.0.10: 
> Jun  8 22:20:27 localhost kernel: CMAN 2.6.9-50.2 (built May 31 2007 15:39:24) installed
> Jun  8 22:20:27 localhost kernel: NET: Registered protocol family 30
> Jun  8 22:20:27 localhost ccsd[2981]:  Built: May 31 2007 15:48:09 
> Jun  8 22:20:27 localhost ccsd[2981]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
> Jun  8 22:20:27 localhost kernel: DLM 2.6.9-46.16 (built May 31 2007 15:45:51) installed
> Jun  8 22:20:28 localhost ccsd[2981]: cluster.conf (cluster name = defuturo_test, version = 2) found. 
> Jun  8 22:20:28 localhost kernel: CMAN: Waiting to join or form a Linux-cluster
> 2990  22:20:28 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> Jun  8 22:20:29 localhost kernel: CMAN: sending membership request
> 2990  22:20:29 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> Jun  8 22:20:30 localhost kernel: CMAN: got node tamarillo
> Jun  8 22:20:30 localhost kernel: CMAN: got node guava
> Jun  8 22:20:30 localhost kernel: CMAN: quorum regained, resuming activity
> Jun  8 22:20:30 localhost cman: startup succeeded
> 2990  22:20:30 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:31 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:32 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:33 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:34 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:35 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:36 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:37 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:38 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:39 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:40 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:41 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 2990  22:20:42 stat64("/dev/dlm-control", {st_mode=S_IFCHR|0600, st_rdev=makedev(10, 62), ...}) = 0
> Jun  8 22:20:42 kiwano ccsd[2981]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.4
> Jun  8 22:20:42 kiwano ccsd[2981]: Initial status:: Quorate
> 
> So, it looks like the problem is that the appearance of /dev/dlm-control
> is being delayed in the 4U5 cluster.


That's wrong - it should be created in /dev/misc/dlm-control - maybe you are
missing the udev config file that puts in the right place ? you should have
/etc/udev/rules.d/51-dlm.rules file that does this.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From pcaulfie at redhat.com  Mon Jun 11 07:53:26 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 11 Jun 2007 08:53:26 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <466CFC95.9060804@redhat.com>
References: <1181303479.26426.63.camel@localhost.localdomain>	<466945D2.10000@redhat.com>	<1181316013.26426.82.camel@localhost.localdomain>	<1181505225.14777.25.camel@localhost.localdomain>
	<466CFC95.9060804@redhat.com>
Message-ID: <466CFF76.8050703@redhat.com>

Patrick Caulfield wrote:
> Robert Clark wrote:
>> On Fri, 2007-06-08 at 16:20 +0100, Robert Clark wrote:
>>> On Fri, 2007-06-08 at 13:04 +0100, Patrick Caulfield wrote:
>>>> Robert Clark wrote:
>>>>>   Does anyone know what might cause ccsd to continue to refuse
>>>>> connections for a lack of quorum after cman has decided the cluster is
>>>>> quorate?
>>>> The usual cause of this is the magma plugins either not being installed in the
>>>> right place or even at all. "magma_tool list" will show you which plugins are
>>>> installed, for CMAN you need the magma_sm.so plugin.
>>>   Thanks for the quick reply. I put "magma_tool list" into the script
>>> just before and after trying to start fenced. The output both times is:
>>>
>>> Magma: Checking plugins in /lib/magma
>>>
>>> File            Status  Message
>>> ----            ------  -------
>>> magma_gulm.so   [OK]    GuLM Plugin v1.0.5
>>> magma_sm.so     [OK]    CMAN/SM Plugin v1.1.7.4
>>>
>>> Magma: 2 plugins available
>>>
>>>   When I added "magma_tool quorum" as well, it reported "Connect
>>> failure: No cluster running?".
>>   I've managed to get an strace of ccsd during the boot and it turned up
>> some interesting lines, which I've interspersed with selected log
>> entries:
>>
>> Jun  8 22:20:27 localhost ccsd[2981]: Starting ccsd 1.0.10: 
>> Jun  8 22:20:27 localhost kernel: CMAN 2.6.9-50.2 (built May 31 2007 15:39:24) installed
>> Jun  8 22:20:27 localhost kernel: NET: Registered protocol family 30
>> Jun  8 22:20:27 localhost ccsd[2981]:  Built: May 31 2007 15:48:09 
>> Jun  8 22:20:27 localhost ccsd[2981]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
>> Jun  8 22:20:27 localhost kernel: DLM 2.6.9-46.16 (built May 31 2007 15:45:51) installed
>> Jun  8 22:20:28 localhost ccsd[2981]: cluster.conf (cluster name = defuturo_test, version = 2) found. 
>> Jun  8 22:20:28 localhost kernel: CMAN: Waiting to join or form a Linux-cluster
>> 2990  22:20:28 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> Jun  8 22:20:29 localhost kernel: CMAN: sending membership request
>> 2990  22:20:29 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> Jun  8 22:20:30 localhost kernel: CMAN: got node tamarillo
>> Jun  8 22:20:30 localhost kernel: CMAN: got node guava
>> Jun  8 22:20:30 localhost kernel: CMAN: quorum regained, resuming activity
>> Jun  8 22:20:30 localhost cman: startup succeeded
>> 2990  22:20:30 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:31 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:32 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:33 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:34 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:35 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:36 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:37 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:38 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:39 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:40 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:41 stat64("/dev/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>> 2990  22:20:42 stat64("/dev/dlm-control", {st_mode=S_IFCHR|0600, st_rdev=makedev(10, 62), ...}) = 0
>> Jun  8 22:20:42 kiwano ccsd[2981]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.4
>> Jun  8 22:20:42 kiwano ccsd[2981]: Initial status:: Quorate
>>
>> So, it looks like the problem is that the appearance of /dev/dlm-control
>> is being delayed in the 4U5 cluster.
> 
> 
> That's wrong - it should be created in /dev/misc/dlm-control - maybe you are
> missing the udev config file that puts in the right place ? you should have
> /etc/udev/rules.d/51-dlm.rules file that does this.
> 

Oops got that wrong way round - dlm-control probably IS being created in the
right place, it's magma that is looking for it in the wrong place. I don't have
the magma version numbers to hand, but I'm pretty sure this was fixed in CVS and
would be worried if it didn't get into U5.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From cluster at defuturo.co.uk  Mon Jun 11 08:36:51 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Mon, 11 Jun 2007 09:36:51 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <466CFF76.8050703@redhat.com>
References: <1181303479.26426.63.camel@localhost.localdomain>
	<466945D2.10000@redhat.com>
	<1181316013.26426.82.camel@localhost.localdomain>
	<1181505225.14777.25.camel@localhost.localdomain>
	<466CFC95.9060804@redhat.com>  <466CFF76.8050703@redhat.com>
Message-ID: <1181551011.5588.10.camel@localhost.localdomain>

On Mon, 2007-06-11 at 08:53 +0100, Patrick Caulfield wrote:
> Patrick Caulfield wrote:

> > That's wrong - it should be created in /dev/misc/dlm-control - maybe you are
> > missing the udev config file that puts in the right place ? you should have
> > /etc/udev/rules.d/51-dlm.rules file that does this.
> 
> Oops got that wrong way round - dlm-control probably IS being created in the
> right place, it's magma that is looking for it in the wrong place. I don't have
> the magma version numbers to hand, but I'm pretty sure this was fixed in CVS and
> would be worried if it didn't get into U5.

  I trimmed some lines like the following from the strace:

2990  22:20:29 stat64("/dev/misc/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)

so, ccsd is looking in both locations, but you're quite right that I
don't have the 51-dlm.rules file. I'll track that down and try again.

	Thanks,

		Robert



From cluster at defuturo.co.uk  Mon Jun 11 09:51:37 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Mon, 11 Jun 2007 10:51:37 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <1181551011.5588.10.camel@localhost.localdomain>
References: <1181303479.26426.63.camel@localhost.localdomain>
	<466945D2.10000@redhat.com>
	<1181316013.26426.82.camel@localhost.localdomain>
	<1181505225.14777.25.camel@localhost.localdomain>
	<466CFC95.9060804@redhat.com>  <466CFF76.8050703@redhat.com>
	<1181551011.5588.10.camel@localhost.localdomain>
Message-ID: <1181555498.5588.47.camel@localhost.localdomain>

On Mon, 2007-06-11 at 09:36 +0100, Robert Clark wrote:
> On Mon, 2007-06-11 at 08:53 +0100, Patrick Caulfield wrote:

> > Oops got that wrong way round - dlm-control probably IS being created in the
> > right place, it's magma that is looking for it in the wrong place. I don't have
> > the magma version numbers to hand, but I'm pretty sure this was fixed in CVS and
> > would be worried if it didn't get into U5.
> 
>   I trimmed some lines like the following from the strace:
> 
> 2990  22:20:29 stat64("/dev/misc/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
> 
> so, ccsd is looking in both locations, but you're quite right that I
> don't have the 51-dlm.rules file. I'll track that down and try again.

  Well, I was missing the dlm package. The system was still booting
occasionally without it. Now I've installed that, the device file is
being created in /dev/misc/dlm-control instead, but ccsd still spends
>10 seconds waiting for it to appear and fenced often fails to start as
a result.

  Is the delay here likely to simply be udev being slow?

	Robert



From pcaulfie at redhat.com  Mon Jun 11 10:05:08 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 11 Jun 2007 11:05:08 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <1181555498.5588.47.camel@localhost.localdomain>
References: <1181303479.26426.63.camel@localhost.localdomain>	<466945D2.10000@redhat.com>	<1181316013.26426.82.camel@localhost.localdomain>	<1181505225.14777.25.camel@localhost.localdomain>	<466CFC95.9060804@redhat.com>
	<466CFF76.8050703@redhat.com>	<1181551011.5588.10.camel@localhost.localdomain>
	<1181555498.5588.47.camel@localhost.localdomain>
Message-ID: <466D1E54.4060504@redhat.com>

Robert Clark wrote:
> On Mon, 2007-06-11 at 09:36 +0100, Robert Clark wrote:
>> On Mon, 2007-06-11 at 08:53 +0100, Patrick Caulfield wrote:
> 
>>> Oops got that wrong way round - dlm-control probably IS being created in the
>>> right place, it's magma that is looking for it in the wrong place. I don't have
>>> the magma version numbers to hand, but I'm pretty sure this was fixed in CVS and
>>> would be worried if it didn't get into U5.
>>   I trimmed some lines like the following from the strace:
>>
>> 2990  22:20:29 stat64("/dev/misc/dlm-control", 0xf6f2c104) = -1 ENOENT (No such file or directory)
>>
>> so, ccsd is looking in both locations, but you're quite right that I
>> don't have the 51-dlm.rules file. I'll track that down and try again.
> 
>   Well, I was missing the dlm package. The system was still booting
> occasionally without it. Now I've installed that, the device file is
> being created in /dev/misc/dlm-control instead, but ccsd still spends
>> 10 seconds waiting for it to appear and fenced often fails to start as
> a result.
> 
>   Is the delay here likely to simply be udev being slow?

It sounds like udev isn't creating it at all. What happens is that libdlm waits
10 seconds for udev to create the device file, and if it doesn't appear after
that time it will do the job itself.

Not being a udev expert I'm not really sure why that might be.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From chiranthlk at yahoo.com  Mon Jun 11 08:04:20 2007
From: chiranthlk at yahoo.com (chirantha pitigala)
Date: Mon, 11 Jun 2007 01:04:20 -0700 (PDT)
Subject: [Linux-cluster] Relocate of service if restart fails
Message-ID: <872287.60894.qm@web52312.mail.re2.yahoo.com>

Hi all,
Does anyone have idea of this matter. I know that maximum false restart parameters no longer exists at RHCS 4 as cluemanager is not there.
In the document it says;
Restart ? Restart the
		service in the node the service is currently located. The default
		setting is Restart. If the service cannot be
		restarted in the the current node, the service is relocated.
Does it mean restart has to be validated in the start script ant throw non zero value when fails?
Really appreciate if someone can help me in this matter.

thanks and regards

chirantha 


Hi all,

In RHCS 4, if service recovery policy is set to restart, in which circumstances relocate will happen? In my setup I can see it only happens when a node fails.

thanks and regards

chirantha



      Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel and lay it on us.
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster






 
____________________________________________________________________________________
Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070611/87266b78/attachment.htm>

From cluster at defuturo.co.uk  Mon Jun 11 10:50:14 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Mon, 11 Jun 2007 11:50:14 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <466D1E54.4060504@redhat.com>
References: <1181303479.26426.63.camel@localhost.localdomain>
	<466945D2.10000@redhat.com>
	<1181316013.26426.82.camel@localhost.localdomain>
	<1181505225.14777.25.camel@localhost.localdomain>
	<466CFC95.9060804@redhat.com> <466CFF76.8050703@redhat.com>
	<1181551011.5588.10.camel@localhost.localdomain>
	<1181555498.5588.47.camel@localhost.localdomain>
	<466D1E54.4060504@redhat.com>
Message-ID: <1181559014.5588.63.camel@localhost.localdomain>

On Mon, 2007-06-11 at 11:05 +0100, Patrick Caulfield wrote:
> Robert Clark wrote:

> >   Well, I was missing the dlm package. The system was still booting
> > occasionally without it. Now I've installed that, the device file is
> > being created in /dev/misc/dlm-control instead, but ccsd still spends
> >> 10 seconds waiting for it to appear and fenced often fails to start as
> > a result.
> > 
> >   Is the delay here likely to simply be udev being slow?
> 
> It sounds like udev isn't creating it at all. What happens is that libdlm waits
> 10 seconds for udev to create the device file, and if it doesn't appear after
> that time it will do the job itself.
> 
> Not being a udev expert I'm not really sure why that might be.

  As an experiment, I've tried just loading the dlm module on a node
with no cluster services running and confirmed that dlm-control is being
created by udev.

  I must admit - I'm pretty confused now about the role of libdlm. Since
it turns out that I've been running a 4U4 cluster without the dlm
package installed (and so no libdlm) and, until this morning, my 4U5
cluster in the same state, I'm wondering: What uses libdlm?

	Robert



From pcaulfie at redhat.com  Mon Jun 11 11:00:13 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 11 Jun 2007 12:00:13 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <1181559014.5588.63.camel@localhost.localdomain>
References: <1181303479.26426.63.camel@localhost.localdomain>	<466945D2.10000@redhat.com>	<1181316013.26426.82.camel@localhost.localdomain>	<1181505225.14777.25.camel@localhost.localdomain>	<466CFC95.9060804@redhat.com>
	<466CFF76.8050703@redhat.com>	<1181551011.5588.10.camel@localhost.localdomain>	<1181555498.5588.47.camel@localhost.localdomain>	<466D1E54.4060504@redhat.com>
	<1181559014.5588.63.camel@localhost.localdomain>
Message-ID: <466D2B3D.8060700@redhat.com>

Robert Clark wrote:
> On Mon, 2007-06-11 at 11:05 +0100, Patrick Caulfield wrote:
>> Robert Clark wrote:
> 
>>>   Well, I was missing the dlm package. The system was still booting
>>> occasionally without it. Now I've installed that, the device file is
>>> being created in /dev/misc/dlm-control instead, but ccsd still spends
>>>> 10 seconds waiting for it to appear and fenced often fails to start as
>>> a result.
>>>
>>>   Is the delay here likely to simply be udev being slow?
>> It sounds like udev isn't creating it at all. What happens is that libdlm waits
>> 10 seconds for udev to create the device file, and if it doesn't appear after
>> that time it will do the job itself.
>>
>> Not being a udev expert I'm not really sure why that might be.
> 
>   As an experiment, I've tried just loading the dlm module on a node
> with no cluster services running and confirmed that dlm-control is being
> created by udev.
> 
>   I must admit - I'm pretty confused now about the role of libdlm. Since
> it turns out that I've been running a 4U4 cluster without the dlm
> package installed (and so no libdlm) and, until this morning, my 4U5
> cluster in the same state, I'm wondering: What uses libdlm?
> 


I don't think anything is, that might be the problem. But magma (which ccsd uses
to talk to the cluster manager) checks for the existence of dlm-controld anyway!
just in case you need to create any locks using magma I suppose.


-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From maciej.bogucki at artegence.com  Mon Jun 11 13:15:27 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Mon, 11 Jun 2007 15:15:27 +0200
Subject: [Linux-cluster] Relocate of service if restart fails
In-Reply-To: <57335.85508.qm@web52302.mail.re2.yahoo.com>
References: <57335.85508.qm@web52302.mail.re2.yahoo.com>
Message-ID: <466D4AEF.1040900@artegence.com>

chirantha pitigala napisa?(a):
> Hi all,
> 
> In RHCS 4, if service recovery policy is set to restart, in which
> circumstances relocate will happen? In my setup I can see it only
> happens when a node fails.

Hello,

Maybe this article will be usefull:

http://www.samag.com/documents/s=10123/sam0704a/0704a.htm

Best Regards
Maciej Bogucki



From chiranthlk at yahoo.com  Mon Jun 11 13:25:01 2007
From: chiranthlk at yahoo.com (chirantha pitigala)
Date: Mon, 11 Jun 2007 06:25:01 -0700 (PDT)
Subject: [Linux-cluster] Relocate of service if restart fails
Message-ID: <622288.42809.qm@web52302.mail.re2.yahoo.com>

Hi Maciej,

Thanks a lot.
That article has the answers  for  my questions.

best regards
chirantha 

----- Original Message ----
From: Maciej Bogucki <maciej.bogucki at artegence.com>
To: linux clustering <linux-cluster at redhat.com>; chiranthlk at yahoo.com
Sent: Monday, June 11, 2007 9:15:27 PM
Subject: Re: [Linux-cluster] Relocate of service if restart fails

chirantha pitigala napisa?(a):
> Hi all,
> 
> In RHCS 4, if service recovery policy is set to restart, in which
> circumstances relocate will happen? In my setup I can see it only
> happens when a node fails.

Hello,

Maybe this article will be usefull:

http://www.samag.com/documents/s=10123/sam0704a/0704a.htm

Best Regards
Maciej Bogucki







 
____________________________________________________________________________________
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070611/ab4ec29f/attachment.htm>

From maciej.bogucki at artegence.com  Mon Jun 11 14:40:06 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Mon, 11 Jun 2007 16:40:06 +0200
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <1df4abe60706102010g7808af27wf40901bd20618407@mail.gmail.com>
References: <1df4abe60706102010g7808af27wf40901bd20618407@mail.gmail.com>
Message-ID: <466D5EC6.1040903@artegence.com>

Manish Kathuria napisa?(a):
> Hello,
> 
> We have configured an active-active two node cluster accessing a
> shared storage. We are using Red Hat Cluster Suite on RHEL  4Update 4.
> The nodes are HP Servers having ILOs as the fence devices. I am facing
> certain problems in the setup and would seek suggestions.
> 
> While booting up the systems, many a times one node forces the other
> one to reboot for no reason. How do I prevent that ?

Try to set post_join_delay to high value fe. 600

> 
> We want the failover to happen when the power supply fails to either
> of the nodes. In order to test the scenario, we removed the power
> cables from one of the nodes. However the failover did not happen and
> upon observing the logs we found that the alive node could not connect
> to the fence device (ILO in this case) of the dead node since it was
> powered off and the fencing could not take place. Does this mean that
> we would not be able to have a failover in case of power failure for
> one of the nodes. Is there a way we can do it ? How is the cluster
> supposed to react when the ILO itself is powered off ?

You need to perform manual fencing(administrator reaction) when it happend.

> 
> Another failover test was conducted by removing the network
> connectivity on of one of the nodes. Failover did happen smoothly in
> the dead node was rebooted by the fence device. However when the dead
> node was reconnected to the network, it fenced and rebooted the alive
> node. How to deal with this ?

Try to increase post_join_delay.

Best Regards
Maciej Bogucki



From cluster at defuturo.co.uk  Mon Jun 11 15:46:32 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Mon, 11 Jun 2007 16:46:32 +0100
Subject: [Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion
In-Reply-To: <466D2B3D.8060700@redhat.com>
References: <1181303479.26426.63.camel@localhost.localdomain>
	<466945D2.10000@redhat.com>
	<1181316013.26426.82.camel@localhost.localdomain>
	<1181505225.14777.25.camel@localhost.localdomain>
	<466CFC95.9060804@redhat.com> <466CFF76.8050703@redhat.com>
	<1181551011.5588.10.camel@localhost.localdomain>
	<1181555498.5588.47.camel@localhost.localdomain>
	<466D1E54.4060504@redhat.com>
	<1181559014.5588.63.camel@localhost.localdomain>
	<466D2B3D.8060700@redhat.com>
Message-ID: <1181576792.2902.20.camel@rutabaga.defuturo.co.uk>

On Mon, 2007-06-11 at 12:00 +0100, Patrick Caulfield wrote:
> Robert Clark wrote:
> > On Mon, 2007-06-11 at 11:05 +0100, Patrick Caulfield wrote:
> >> Robert Clark wrote:

> >>>   Is the delay here likely to simply be udev being slow?

> >> It sounds like udev isn't creating it at all. What happens is that libdlm waits
> >> 10 seconds for udev to create the device file, and if it doesn't appear after
> >> that time it will do the job itself.

> >   As an experiment, I've tried just loading the dlm module on a node
> > with no cluster services running and confirmed that dlm-control is being
> > created by udev.
> > 
> >   I must admit - I'm pretty confused now about the role of libdlm. Since
> > it turns out that I've been running a 4U4 cluster without the dlm
> > package installed (and so no libdlm) and, until this morning, my 4U5
> > cluster in the same state, I'm wondering: What uses libdlm?

> I don't think anything is, that might be the problem. But magma (which ccsd uses
> to talk to the cluster manager) checks for the existence of dlm-controld anyway!
> just in case you need to create any locks using magma I suppose.

  OK, I may need to upgrade my condition from confused to baffled...

  I've slapped an strace on udevd during boot and here are some
excerpts:

Jun 11 16:23:48 localhost kernel: CMAN 2.6.9-50.2 (built May 31 2007 15:39:24) installed
Jun 11 16:23:48 localhost kernel: NET: Registered protocol family 30
Jun 11 16:23:48 localhost kernel: DLM 2.6.9-46.16 (built May 31 2007 15:45:51) installed

970   16:23:52 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={9, 0}}, NULL) = 0
970   16:23:52 select(6, [3 5], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted)
970   16:24:01 --- SIGALRM (Alarm clock) @ 0 (0) ---

970   16:24:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf6fc9708) = 3090
3090  16:24:01 execve("/sbin/udev", ["udev", "misc"], [/* 3 vars */])   = 0

3090  16:24:01 mknod("/dev/misc/dlm-control", S_IFCHR|0666, makedev(10, 62))   = 0

So, certainly udev is the main source of the delay (waiting for a
SIGALRM?) but, then, the same version of udev is on both clusters.

  I guess I'll add something to the startup script to wait
for /dev/misc/dlm-control to exist before starting fenced.

	Thanks,

		Robert



From mkathuria at tuxtechnologies.co.in  Mon Jun 11 16:45:02 2007
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Mon, 11 Jun 2007 22:15:02 +0530
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <466D5EC6.1040903@artegence.com>
References: <1df4abe60706102010g7808af27wf40901bd20618407@mail.gmail.com>
	<466D5EC6.1040903@artegence.com>
Message-ID: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com>

On 6/11/07, Maciej Bogucki <maciej.bogucki at artegence.com> wrote:
> Manish Kathuria napisa?(a):

> > We want the failover to happen when the power supply fails to either
> > of the nodes. In order to test the scenario, we removed the power
> > cables from one of the nodes. However the failover did not happen and
> > upon observing the logs we found that the alive node could not connect
> > to the fence device (ILO in this case) of the dead node since it was
> > powered off and the fencing could not take place. Does this mean that
> > we would not be able to have a failover in case of power failure for
> > one of the nodes. Is there a way we can do it ? How is the cluster
> > supposed to react when the ILO itself is powered off ?
>
> You need to perform manual fencing(administrator reaction) when it happend.
>

Isn't there any way which is automated and does not require manual
intervention ? Otherwise, the whole purpose gets defeated.



From Robert.Gil at americanhm.com  Mon Jun 11 16:59:52 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Mon, 11 Jun 2007 12:59:52 -0400
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA62E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

If ilo itself is off, fencing doesn't work. 

Did you add ilo as a fence device? And create a user? You create a user in the ilo for that blade, not on the chassis. You have to reboot the blade to get to the ilo manager. 


Robert Gil
Linux Systems Administrator
American Home Mortgage
Phone: 631-622-8410
Cell: 631-827-5775
Fax: 516-495-5861

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Manish Kathuria
Sent: Monday, June 11, 2007 12:45 PM
To: linux clustering
Subject: Re: [Linux-cluster] Problems with Cluster

On 6/11/07, Maciej Bogucki <maciej.bogucki at artegence.com> wrote:
> Manish Kathuria napisa?(a):

> > We want the failover to happen when the power supply fails to either 
> > of the nodes. In order to test the scenario, we removed the power 
> > cables from one of the nodes. However the failover did not happen 
> > and upon observing the logs we found that the alive node could not 
> > connect to the fence device (ILO in this case) of the dead node 
> > since it was powered off and the fencing could not take place. Does 
> > this mean that we would not be able to have a failover in case of 
> > power failure for one of the nodes. Is there a way we can do it ? 
> > How is the cluster supposed to react when the ILO itself is powered off ?
>
> You need to perform manual fencing(administrator reaction) when it happend.
>

Isn't there any way which is automated and does not require manual intervention ? Otherwise, the whole purpose gets defeated.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From chris at cmiware.com  Mon Jun 11 17:19:12 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 11 Jun 2007 12:19:12 -0500
Subject: [Linux-cluster] DRBD + Postgres on RH Cluster 5
Message-ID: <466D8410.7090403@cmiware.com>

I am attempting to run Postgres with the data to be stored on a DRBD 
partition. 

First question, is anyone using DRBD on RedHat Cluster and would you 
mind sharing your scripts?  I saw a post from several months ago where 
someone claimed to have used the script from Linux-HA version 1 to 
accomplish this, but the script was not posted to the list that I could 
tell, nor does it appear to be in CVS.

Second, I tried setting up a Postgres-8 resource through Conga which did 
not work - for the record, several of the resource scripts were omitted 
from my install of the cluster suite.  I managed to hack on the 
postgres-8 script in CVS to get it to start and stop as a custom script 
resource, but this may not be ideal.

Is anyone using this setup or maybe MySQL + DRBD and have some 
pointers?  Any advice would be appreciated greatly.

Thanks,
Chris Harms



From janne.peltonen at helsinki.fi  Mon Jun 11 18:04:22 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Mon, 11 Jun 2007 21:04:22 +0300
Subject: [Linux-cluster] ip.sh
Message-ID: <20070611180422.GF25899@helsinki.fi>

Hi!

While looking at process accounting on my new cluster system, I noticed
that something called fs.sh is run some hundreds of times per second,
sometimes more often, sometimes less. Abt 10 million times per day,
anyway.

Now I wonder if this is normal. I've got a rhel 5 based system with 25
services, 24 of which use ext3 fs's on clvm logical volumes in a SAN.
The last one uses a local disk and is confined to run on a dedicated
server (which is part of the cluster mostly because the cluster tools
are a convenient way to organize the service there). Also, the dedicated
cluster member is not connected to the SAN.

Thanks for any info.


--Janne Peltonen
Univ. of Helsinki
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From brad at bradandkim.net  Mon Jun 11 18:21:26 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Mon, 11 Jun 2007 13:21:26 -0500 (CDT)
Subject: [Linux-cluster] A single server backing up three servers
Message-ID: <58433.129.237.174.144.1181586086.squirrel@webmail.bradandkim.net>

Hi all,

I have just begun playing around with redhat cluster and I have a question
that I am not finding a straight answer to.   I would like to have one
server be a backup to 3 primary servers.   In other words I guess I would
have 3 services (db1, db2, and db3) and then a 4th server (db4) that is
available to take over for any of the 3 primary servers.   Obviously if 2
primary servers fail I am in trouble, but the odds of that are extremely
slim.  Is this possible?  Are there caveats?

TIA,

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From redhat at watson-wilson.ca  Mon Jun 11 18:31:30 2007
From: redhat at watson-wilson.ca (Neil Watson)
Date: Mon, 11 Jun 2007 14:31:30 -0400
Subject: [Linux-cluster] A single server backing up three servers
In-Reply-To: <58433.129.237.174.144.1181586086.squirrel@webmail.bradandkim.net>
References: <58433.129.237.174.144.1181586086.squirrel@webmail.bradandkim.net>
Message-ID: <20070611183130.GB5773@watson-wilson.ca>

Are you saying that you have three services run by three separate
servers?  You would like any of the three services to fail over to the
forth server?

-- 
Neil Watson             | Debian Linux
System Administrator    | Uptime 39 days
http://watson-wilson.ca



From jwhiter at redhat.com  Mon Jun 11 18:37:19 2007
From: jwhiter at redhat.com (Josef Bacik)
Date: Mon, 11 Jun 2007 14:37:19 -0400
Subject: [Linux-cluster] A single server backing up three servers
In-Reply-To: <58433.129.237.174.144.1181586086.squirrel@webmail.bradandkim.net>
References: <58433.129.237.174.144.1181586086.squirrel@webmail.bradandkim.net>
Message-ID: <20070611183718.GC26695@korben.rdu.redhat.com>

On Mon, Jun 11, 2007 at 01:21:26PM -0500, brad at bradandkim.net wrote:
> Hi all,
> 
> I have just begun playing around with redhat cluster and I have a question
> that I am not finding a straight answer to.   I would like to have one
> server be a backup to 3 primary servers.   In other words I guess I would
> have 3 services (db1, db2, and db3) and then a 4th server (db4) that is
> available to take over for any of the 3 primary servers.   Obviously if 2
> primary servers fail I am in trouble, but the odds of that are extremely
> slim.  Is this possible?  Are there caveats?
> 

Yeah you can do this with priority failover domains.  I assume from your
description that you will want 3 failover domains, one for each application.
You will add two servers for each domain, the primary server which will have the
higher priority and then db4.  Then if anything fails on the main servers they
will failover to db4, and then when they come back up the service will failback.
You will want the "restricted" and "ordered" flag set.  The only caveat you have
already specified, if db4 is down and something fails everything hangs since you
will have lost 2 nodes.  You may want to look into qdisk if you want to change
this behavior slightly.

Josef



From andremachado at techforce.com.br  Mon Jun 11 18:38:19 2007
From: andremachado at techforce.com.br (andremachado)
Date: Mon, 11 Jun 2007 11:38:19 -0700
Subject: [Linux-cluster] Re: [work around] GFS over GNBD freezing problems
In-Reply-To: <8577042bf3e2aec48e89aec96da2e09d@localhost>
References: <8577042bf3e2aec48e89aec96da2e09d@localhost>
Message-ID: <7a0ccdc86b07edc1e4e0bb650e76d2a8@localhost>

Hello,
I discovered a workaround using an undocumented feature of GFS and gnbd.
(did not found it at RH manuals and man pages)
Maybe it does not happen in red hat cluster source newer than 1.03.00, kernel 2.6.18, when used with high performance hardware and hi perf private network, but the problem is clear at low end hardware and networking.
the code could be more tolerant to these.
Described the work around at [0]
Regards.
Andre Felipe Machado


[0] http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch



From brad at bradandkim.net  Mon Jun 11 19:09:03 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Mon, 11 Jun 2007 14:09:03 -0500 (CDT)
Subject: [Linux-cluster] A single server backing up three servers
In-Reply-To: <20070611183130.GB5773@watson-wilson.ca>
References: <58433.129.237.174.144.1181586086.squirrel@webmail.bradandkim.net>
	<20070611183130.GB5773@watson-wilson.ca>
Message-ID: <47660.129.237.174.144.1181588943.squirrel@webmail.bradandkim.net>


> Are you saying that you have three services run by three separate
> servers?  You would like any of the three services to fail over to the
> forth server?
>
> --
> Neil Watson             | Debian Linux
> System Administrator    | Uptime 39 days
> http://watson-wilson.ca
>

Sorry if it was unclear, but yes, that is exactly what I am looking for.

Thanks,

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From brad at bradandkim.net  Mon Jun 11 19:11:20 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Mon, 11 Jun 2007 14:11:20 -0500 (CDT)
Subject: [Linux-cluster] A single server backing up three servers
In-Reply-To: <20070611183718.GC26695@korben.rdu.redhat.com>
References: <58433.129.237.174.144.1181586086.squirrel@webmail.bradandkim.net>
	<20070611183718.GC26695@korben.rdu.redhat.com>
Message-ID: <44846.129.237.174.144.1181589080.squirrel@webmail.bradandkim.net>


> On Mon, Jun 11, 2007 at 01:21:26PM -0500, brad at bradandkim.net wrote:
>> Hi all,
>>
>> I have just begun playing around with redhat cluster and I have a
>> question
>> that I am not finding a straight answer to.   I would like to have one
>> server be a backup to 3 primary servers.   In other words I guess I
>> would
>> have 3 services (db1, db2, and db3) and then a 4th server (db4) that is
>> available to take over for any of the 3 primary servers.   Obviously if
>> 2
>> primary servers fail I am in trouble, but the odds of that are extremely
>> slim.  Is this possible?  Are there caveats?
>>
>
> Yeah you can do this with priority failover domains.  I assume from your
> description that you will want 3 failover domains, one for each
> application.
> You will add two servers for each domain, the primary server which will
> have the
> higher priority and then db4.  Then if anything fails on the main servers
> they
> will failover to db4, and then when they come back up the service will
> failback.
> You will want the "restricted" and "ordered" flag set.  The only caveat
> you have
> already specified, if db4 is down and something fails everything hangs
> since you
> will have lost 2 nodes.  You may want to look into qdisk if you want to
> change
> this behavior slightly.
>
> Josef
>

Thanks Josef.  I will look into qdisk.


Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From mkathuria at tuxtechnologies.co.in  Tue Jun 12 01:29:00 2007
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Tue, 12 Jun 2007 06:59:00 +0530
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA62E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA62E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com>

On 6/11/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
> If ilo itself is off, fencing doesn't work.

Isn't there any timeout setting such that if the ILO doesn't respond
for a certain amount of time, it is treated as fenced and the node is
considered to be dead and the failover takes place?

>
> Did you add ilo as a fence device? And create a user? You create a user in the ilo for that blade, not on the chassis. You have to reboot the blade to get to the ilo manager.

Yes, had added respective ILOs as fence devices for both the servers
and created users also.


I just want to make sure that automatic fencing happens and failover
takes place even when there is a complete power failure for one node

>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Manish Kathuria
> Sent: Monday, June 11, 2007 12:45 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Problems with Cluster
>
> On 6/11/07, Maciej Bogucki <maciej.bogucki at artegence.com> wrote:
> > Manish Kathuria napisa?(a):
>
> > > We want the failover to happen when the power supply fails to either
> > > of the nodes. In order to test the scenario, we removed the power
> > > cables from one of the nodes. However the failover did not happen
> > > and upon observing the logs we found that the alive node could not
> > > connect to the fence device (ILO in this case) of the dead node
> > > since it was powered off and the fencing could not take place. Does
> > > this mean that we would not be able to have a failover in case of
> > > power failure for one of the nodes. Is there a way we can do it ?
> > > How is the cluster supposed to react when the ILO itself is powered off ?
> >
> > You need to perform manual fencing(administrator reaction) when it happend.
> >
>
> Isn't there any way which is automated and does not require manual intervention ? Otherwise, the whole purpose gets defeated.
>



From grimme at atix.de  Tue Jun 12 06:16:30 2007
From: grimme at atix.de (Marc Grimme)
Date: Tue, 12 Jun 2007 08:16:30 +0200
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com>
References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA62E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com>
Message-ID: <200706120816.32533.grimme@atix.de>

On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote:
> On 6/11/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
> > If ilo itself is off, fencing doesn't work.
>
> Isn't there any timeout setting such that if the ILO doesn't respond
> for a certain amount of time, it is treated as fenced and the node is
> considered to be dead and the failover takes place?
As far as I remember there is only a tcp-timeout when establishing the 
connection to the ilo-card that takes a very long time to occure (that's a 
default setting and takes minutes). I'm not sure how and where to set it.

But we've had this discussion (especially with ILO-Cards) nearly every time 
when using them and therefore and also out of other reasons we had to build 
our own fence_ilo agent. I'm quite sure that we solved the timeout problem in 
the end. It is set to 10sec per default (Config.timeout).
You can find it at 
http://download.atix.de/yum/comoonics/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm
or directly use the yum/up2date-channel as described here:
http://www.open-sharedroot.org/faq/can-i-use-yum-or-up2date-to-install-the-software/ 
then install "comoonics-bootimage-fenceclient-ilo" and there you go.
>
> > Did you add ilo as a fence device? And create a user? You create a user
> > in the ilo for that blade, not on the chassis. You have to reboot the
> > blade to get to the ilo manager.
>
> Yes, had added respective ILOs as fence devices for both the servers
> and created users also.
We are doing so as well. Always a power user for ilo devices.
We are also automating this with the ilo client.
There is a undocumented switch -x in the fence_ilo client referenced above 
where you reference a file that might look as follows and you'll have your 
user.

  <USER_INFO MODE="write">
    <ADD_USER
      USER_NAME="power"
      USER_LOGIN="power"
      PASSWORD="the_password">
      <ADMIN_PRIV value ="N"/>
      <REMOTE_CONS_PRIV value ="N"/>
      <RESET_SERVER_PRIV value ="Y"/>
      <VIRTUAL_MEDIA_PRIV value ="N"/>
      <!--        Firmware support infomation for next tag:          -->
      <!--            iLO 2 - All version.                           -->
      <!--              iLO - All version.                           -->
      <!--         RILOE II - None                                   -->
      <CONFIG_ILO_PRIV value="Yes"/>
      <!--        Firmware support infomation for next 3 tags:       -->
      <!--            iLO 2 - None.                                  -->
      <!--              iLO - None.                                  -->
      <!--         RILOE II - All versions.                          -->
      <!--
      <CONFIG_RILO_PRIV value="Y"/>
      <LOGIN_PRIV value ="Y"/>
      <CLIENT_RANGE value ="10.10.10.1 - 254.255.255.255"/>
      -->
      <!--        Firmware support infomation for next 6 tags:       -->
      <!--            iLO 2 - None.                                  -->
      <!--              iLO - Version 1.40 and earlier.              -->
      <!--         RILOE II - None.                                  -->
      <!--
      <VIEW_LOGS_PRIV value="Yes"/>
      <CLEAR_LOGS_PRIV value="Yes"/>
      <EMS_PRIV value="Yes"/>
      <UPDATE_ILO_PRIV value="No"/>
      <CONFIG_RACK_PRIV value="Yes"/>
      <DIAG_PRIV value="Yes"/>
      -->
    </ADD_USER>
  </USER_INFO>

>
>
> I just want to make sure that automatic fencing happens and failover
> takes place even when there is a complete power failure for one node
If the timeout thing works you'll also need a second fence mechanism. 
You might think about using fence_manual as last resort, to bring that cluster 
back online after power failure and then after manual intervention.

Regards Marc.
>
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Manish Kathuria
> > Sent: Monday, June 11, 2007 12:45 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Problems with Cluster
> >
> > On 6/11/07, Maciej Bogucki <maciej.bogucki at artegence.com> wrote:
> > > Manish Kathuria napisa?(a):
> > > > We want the failover to happen when the power supply fails to either
> > > > of the nodes. In order to test the scenario, we removed the power
> > > > cables from one of the nodes. However the failover did not happen
> > > > and upon observing the logs we found that the alive node could not
> > > > connect to the fence device (ILO in this case) of the dead node
> > > > since it was powered off and the fencing could not take place. Does
> > > > this mean that we would not be able to have a failover in case of
> > > > power failure for one of the nodes. Is there a way we can do it ?
> > > > How is the cluster supposed to react when the ILO itself is powered
> > > > off ?
> > >
> > > You need to perform manual fencing(administrator reaction) when it
> > > happend.
> >
> > Isn't there any way which is automated and does not require manual
> > intervention ? Otherwise, the whole purpose gets defeated.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From jprats at cesca.es  Tue Jun 12 07:08:14 2007
From: jprats at cesca.es (Jordi Prats)
Date: Tue, 12 Jun 2007 09:08:14 +0200
Subject: [Linux-cluster] Assertion failed on sm_membership.c
Message-ID: <466E465E.5060301@cesca.es>

Hi,
I found my server with this error after start a second node:

CMAN got WAIT barrier not in phase 1 (...)

Assertion failed on line 106 of file 
(...)/cman-kernel-2.6.9-4/smp/src/sm_memnership.c

assertion node
time 449651800 nodeid=1

I could not get the full message, I hope this could be usefull.

Regards,
Jordi

-- 
......................................................................
         __
        / /          Jordi Prats
  C E / S / C A      Dept. de Sistemes
      /_/            Centre de Supercomputaci? de Catalunya

  Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
  T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
...................................................................... 



From frederik.wagner at physik.lmu.de  Tue Jun 12 09:07:27 2007
From: frederik.wagner at physik.lmu.de (Frederik Wagner)
Date: Tue, 12 Jun 2007 11:07:27 +0200
Subject: [Linux-cluster] stability gfs2
Message-ID: <466E624F.3020209@physik.lmu.de>

Hi group,

I was wondering about the stability of the gfs2 filesystem.
After the release of Scientific Linux 5 (SL5) I tried the gfs2
filesystem on our FC SAN Storage with a two node cluster frontend (this
are supposed to be three in the future).

The system was running well on SL44 and is now running well with SL5 and
gfs, but when I tried the gfs2 filesystem I had problems syncing (rsync)
back the storage data from the backup. I got reproducable write errors,
after which the whole filsystem was stuck and broken.

Unluckily I had to install the system in to production with gfs now, so
I cannot post the error messages. (if the error messages are needed for
further debugging, i could set up a testsystem) So up to now I'm wondering:
1. is gfs2 supposed to be completely stable?
2. are there any known serious bugs i missed?
3. why should i use gfs2 instead of gfs, except for a faster df command? :-)
4. are there any documentation speaking about the gfs2 fs system in
detail? To me the documentation looks quite 'small'...

Thanks a lot in advance,
	Frederik Wagner

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070612/e2b7d3e4/attachment.sig>

From jos at xos.nl  Tue Jun 12 09:09:37 2007
From: jos at xos.nl (Jos Vos)
Date: Tue, 12 Jun 2007 11:09:37 +0200
Subject: [Linux-cluster] stability gfs2
In-Reply-To: <466E624F.3020209@physik.lmu.de>;
	from frederik.wagner@physik.lmu.de on Tue, Jun 12, 2007 at
	11:07:27AM +0200
References: <466E624F.3020209@physik.lmu.de>
Message-ID: <20070612110937.A13614@xos037.xos.nl>

On Tue, Jun 12, 2007 at 11:07:27AM +0200, Frederik Wagner wrote:

> I was wondering about the stability of the gfs2 filesystem.
> After the release of Scientific Linux 5 (SL5) I tried the gfs2
> filesystem on our FC SAN Storage with a two node cluster frontend (this
> are supposed to be three in the future).

The RHEL5 release notes say you should not use it in production yet.
I guess that includes the answer ;-).

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From Santosh.Panigrahi at in.unisys.com  Tue Jun 12 09:39:45 2007
From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar)
Date: Tue, 12 Jun 2007 15:09:45 +0530
Subject: [Linux-cluster] dlm service is not stopping
Message-ID: <D566E8CF3538B54D95B925CB69CB4D2A046B16DB@inblr-exch1.eu.uis.unisys.com>

Hi Cluster Team,

 

I have configured a 2 node cluster (RHEL5). When I am shutting down the
cluster, I am stopping "rgmanager' service first and then "cman'
service.

But cman service is failing to stop and flashing the following error
message.

 

[root at m1045 panigrsk]# service cman stop

Stopping cluster:

   Stopping fencing... done

   Stopping cman... failed

/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy

                                                           [FAILED]

 

 

'cman_tool" is showing "dlm" service is still running.

 

[root at m1045 panigrsk]# cman_tool services

type             level name       id       state

dlm              1     rgmanager  00020001 none

[2]

 

Please help me, How to stop dlm service.

 

Thanks,

Santosh

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070612/bc69c0a2/attachment.htm>

From lhh at redhat.com  Tue Jun 12 13:46:18 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 12 Jun 2007 09:46:18 -0400
Subject: [Linux-cluster] Testing Failover - Failing in few cases
In-Reply-To: <24fecab20705280612l5747a5begb3ffede3f24bb885@mail.gmail.com>
References: <24fecab20705280612l5747a5begb3ffede3f24bb885@mail.gmail.com>
Message-ID: <20070612134618.GA7203@redhat.com>

On Mon, May 28, 2007 at 06:42:08PM +0530, Satya Daragani wrote:
> Hi Linux-Cluster Team,
> 
> Please help me in testing the failover with the RHEL Cluster Suite 4 with
> update 4. I am appending the details related to cluster nodes and
> configuration here. Kindly suggest me how to proceed further.
> 
>   6. No fence devices configured (chkconfig --del fenced)

This breaks 2-node mode.


> 2nd case
> 
> Currently node1 is running the httpd service, if I down the network
> interface (ifconfig eth0 down), the httpd service is failing over to the
> node2.
> 
> Then if I up the interface (ifconfig eth0 up) on node1, the service is not
> failovering to the node1 and in the /var/log/messages it is saying "unable
> to contact the cluster infrastructure". *Need your help here*

CMAN doesn't gracefully die in this case; fencing needs to happen to fix
it.

> 3rd case
> 
> Currently node1 is running the HTTPd service, if I remove the powercord (I
> mean the improper shutdown), the service is going to the recovery mode and
> not getting started on the node2. *Need your help here.*

Are there any log messages here?  I'm not sure if lack of fencing might
cause this; I suspect not.  It shouldn't stick in recovery mode.


> Currently node1 is running the httpd service, if I stop or killall the httpd
> service (service httpd stop) failover is not happening. *Need your help
> here.*

That's a bug in the script resource agent.  Edit
/usr/share/cluster/script.sh and change the 'interval' values for the
monitor + status actions to 10 seconds instead of 3600 seconds / 1 hour.

e.g.
      interval="10"

That should fix it.

Can I convince you to upgrade at least:

   rgmanager
   ccs
   magma
   magma-plugins

...to the update 5 releases?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From wferi at niif.hu  Tue Jun 12 14:01:04 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Tue, 12 Jun 2007 16:01:04 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <20070521145058.GA2590@redhat.com> (David Teigland's message of
	"Mon, 21 May 2007 09:50:58 -0500")
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
	<20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com>
Message-ID: <87tztdmfvj.fsf@tac.ki.iif.hu>

Hi David,

Sorry if all what follows is misguided nonsense.  I'm eager to learn...

David Teigland <teigland at redhat.com> writes:

> The new code has much better caching in the dlm which will benefit flocks,
> look at these flock numbers I sent before: [...]
>
> This is testing raw flock performance.  The dlm locks for normal file
> operations should be cached and locally mastered also, so I'm not sure
> what's causing the long times.  Make sure that drop_count is zero again,
> now it's in sysfs:
>   echo 0 > /sys/fs/gfs/<foo>:<bar>/lock_module/drop_count
>
> Also, mount debugfs so we can check some stuff later:
>   mount -t debugfs none /sys/kernel/debug
>
> Then run some tests:
> - mount on nodeA
> - run the test on nodeA
> - count locks on nodeA
>   (cat /sys/kernel/debug/dlm/<bar> | grep Master | wc -l)
> - mount on nodeB (don't do anything on this node)
> - run the test again on nodeA
> - count locks on nodeA and nodeB (see above)
> - mount on nodeC (don't do anything on nodes B or C)
> - run the test again on nodeA
> - count locks on nodes A, B and C (see above)
>
> We're basically trying to produce the best-case performance from one node,
> nodeA.  That means making sure that nodeA is mastering all locks and doing
> maximum caching.  That's why it's important that we not do anything at all
> that accesses the fs on nodes B or C, or do any extra mounts/unmounts.

I made all the above tests and composed the reply a long time ago, but
now, getting back to it after that long time, I decided to satisfy your
curiosity, behold...

> Plocks will be much slower and are probably not interesting to test, but
> I'm curious if you added the "-l0" option to gfs_controld?  That option
> turns off the code that intentionally limits the rate of plocks.  See the
> old results again: [...]

Now, that switch makes ALL the difference.  With a single node
switched on, I get results like this (with abbreviated strace -c
output appended):

without -l0:

filecount=500
  iteration=0 elapsed time=10.444446 s
  iteration=1 elapsed time=9.693618 s
  iteration=2 elapsed time=10.520073 s
  iteration=3 elapsed time=10.521504 s
  iteration=4 elapsed time=10.520183 s
total elapsed time=51.699824 s
Process 5265 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.27    0.048525           6      7551           read
  6.73    0.003923           2      2502           fcntl64
  4.47    0.002606           1      2528           close
  3.09    0.001801           1      2551        23 open
  0.74    0.000432           0      2507           write
  0.71    0.000415           0      5033           mmap2
  0.41    0.000237           0     12528         3 _llseek
  0.31    0.000178           0      5001           munmap
  0.18    0.000107           0      5015           fstat64
  0.08    0.000049           0      2506           gettimeofday
  0.00    0.000000           0        16        14 ioctl
  0.00    0.000000           0       202       182 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.058273                 47974       229 total

with -l0:

filecount=500
  iteration=0 elapsed time=5.966146 s
  iteration=1 elapsed time=0.582058 s
  iteration=2 elapsed time=0.528272 s
  iteration=3 elapsed time=0.936438 s
  iteration=4 elapsed time=0.528147 s
total elapsed time=8.541061 s
Process 10030 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.17    0.016527           2      7551           read
 21.49    0.006213           2      2528           close
  8.16    0.002358           1      2502           fcntl64
  6.59    0.001904           1      2551        23 open
  2.21    0.000638           0      2507           write
  1.46    0.000421           0      5033           mmap2
  0.86    0.000249         249         1           execve
  0.73    0.000212           0      5001           munmap
  0.65    0.000187           0     12528         3 _llseek
  0.57    0.000165           0      5015           fstat64
  0.12    0.000034           0      2506           gettimeofday
  0.00    0.000000           0        16        14 ioctl
  0.00    0.000000           0       202       182 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.028908                 47974       229 total

Looks like the bottleneck isn't the explicit locking (be it plock or
flock), but something else, like the built-in GFS locking.

Similar dramatic speedup can be achieved (with a single node switched
on, again), by the lockproto=lock_nolock mount option, even if used
together with ignore_local_fs.  It I understand it right, this
combination leaves the cluster-wide [pf]locks alone, just eliminates
the GFS internal locking, which guards the internal consistency of the
file system (please correct me if I'm wrong).

What's strange, is that gfs_controld -l0 seems like a perfectly safe
invocation (what's the catch, ie. why was the artifical limit
introduced?), still it achieves almost the same speedup like using
lock_nolock, which would be a disaster with more than one node
mounting the fs.  (Also this trick scales pretty well to 4000 files.)

Again, the above tests were done with a single node switched on, and
I'm not sure whether the results carry over to the real cluster setup,
will test is soon.  I didn't touch drop_count either, everything was
left as default, except for the mount options and the -l option.

Also, I can send the results of the scenario suggested by you, if it's
still relevant.  In short: the locks are always mastered on node A
only, but the performance is poor nevertheless.
-- 
Regards,
Feri.



From wferi at niif.hu  Tue Jun 12 14:38:13 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Tue, 12 Jun 2007 16:38:13 +0200
Subject: [Linux-cluster] What is manual-failure override?
Message-ID: <87lkepme5m.fsf@tac.ki.iif.hu>

Hi,

The CVS commit which removed fence_manual.c (and thus my favourite
agent :) says:

Remove fence_manual; only provide manual-failure override

Good.  Now, how do I use manual fencing?  What's to be written into
cluster.conf?  (cluster 2.00.00)
-- 
Thanks,
Feri.



From teigland at redhat.com  Tue Jun 12 14:46:52 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 12 Jun 2007 09:46:52 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <87tztdmfvj.fsf@tac.ki.iif.hu>
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
	<20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87tztdmfvj.fsf@tac.ki.iif.hu>
Message-ID: <20070612144652.GB16723@redhat.com>

On Tue, Jun 12, 2007 at 04:01:04PM +0200, Ferenc Wagner wrote:
> Hi David,
> 
> Sorry if all what follows is misguided nonsense.  I'm eager to learn...
> 
> David Teigland <teigland at redhat.com> writes:
> 
> > The new code has much better caching in the dlm which will benefit flocks,
> > look at these flock numbers I sent before: [...]
> >
> > This is testing raw flock performance.  The dlm locks for normal file
> > operations should be cached and locally mastered also, so I'm not sure
> > what's causing the long times.  Make sure that drop_count is zero again,
> > now it's in sysfs:
> >   echo 0 > /sys/fs/gfs/<foo>:<bar>/lock_module/drop_count
> >
> > Also, mount debugfs so we can check some stuff later:
> >   mount -t debugfs none /sys/kernel/debug
> >
> > Then run some tests:
> > - mount on nodeA
> > - run the test on nodeA
> > - count locks on nodeA
> >   (cat /sys/kernel/debug/dlm/<bar> | grep Master | wc -l)
> > - mount on nodeB (don't do anything on this node)
> > - run the test again on nodeA
> > - count locks on nodeA and nodeB (see above)
> > - mount on nodeC (don't do anything on nodes B or C)
> > - run the test again on nodeA
> > - count locks on nodes A, B and C (see above)
> >
> > We're basically trying to produce the best-case performance from one node,
> > nodeA.  That means making sure that nodeA is mastering all locks and doing
> > maximum caching.  That's why it's important that we not do anything at all
> > that accesses the fs on nodes B or C, or do any extra mounts/unmounts.
> 
> I made all the above tests and composed the reply a long time ago, but
> now, getting back to it after that long time, I decided to satisfy your
> curiosity, behold...
> 
> > Plocks will be much slower and are probably not interesting to test, but
> > I'm curious if you added the "-l0" option to gfs_controld?  That option
> > turns off the code that intentionally limits the rate of plocks.  See the
> > old results again: [...]
> 
> Now, that switch makes ALL the difference.  With a single node
> switched on, I get results like this (with abbreviated strace -c
> output appended):
> 
> without -l0:
> 
> filecount=500
>   iteration=0 elapsed time=10.444446 s
>   iteration=1 elapsed time=9.693618 s
>   iteration=2 elapsed time=10.520073 s
>   iteration=3 elapsed time=10.521504 s
>   iteration=4 elapsed time=10.520183 s
> total elapsed time=51.699824 s
> Process 5265 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  83.27    0.048525           6      7551           read
>   6.73    0.003923           2      2502           fcntl64
>   4.47    0.002606           1      2528           close
>   3.09    0.001801           1      2551        23 open
>   0.74    0.000432           0      2507           write
>   0.71    0.000415           0      5033           mmap2
>   0.41    0.000237           0     12528         3 _llseek
>   0.31    0.000178           0      5001           munmap
>   0.18    0.000107           0      5015           fstat64
>   0.08    0.000049           0      2506           gettimeofday
>   0.00    0.000000           0        16        14 ioctl
>   0.00    0.000000           0       202       182 stat64
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.058273                 47974       229 total
> 
> with -l0:
> 
> filecount=500
>   iteration=0 elapsed time=5.966146 s
>   iteration=1 elapsed time=0.582058 s
>   iteration=2 elapsed time=0.528272 s
>   iteration=3 elapsed time=0.936438 s
>   iteration=4 elapsed time=0.528147 s
> total elapsed time=8.541061 s
> Process 10030 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  57.17    0.016527           2      7551           read
>  21.49    0.006213           2      2528           close
>   8.16    0.002358           1      2502           fcntl64
>   6.59    0.001904           1      2551        23 open
>   2.21    0.000638           0      2507           write
>   1.46    0.000421           0      5033           mmap2
>   0.86    0.000249         249         1           execve
>   0.73    0.000212           0      5001           munmap
>   0.65    0.000187           0     12528         3 _llseek
>   0.57    0.000165           0      5015           fstat64
>   0.12    0.000034           0      2506           gettimeofday
>   0.00    0.000000           0        16        14 ioctl
>   0.00    0.000000           0       202       182 stat64
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.028908                 47974       229 total
> 
> Looks like the bottleneck isn't the explicit locking (be it plock or
> flock), but something else, like the built-in GFS locking.

I'm guessing that these were run with a single node in the cluster?  The
second set of numbers (with -l0) wouldn't make much sense otherwise.  I
think if you add nodes to the cluster, the -l0 numbers will go up quite a
bit.  In the end I expect that flocks are still going to be the fastest
for you.

> Similar dramatic speedup can be achieved (with a single node switched
> on, again), by the lockproto=lock_nolock mount option, even if used
> together with ignore_local_fs.  It I understand it right, this
> combination leaves the cluster-wide [pf]locks alone, just eliminates
> the GFS internal locking, which guards the internal consistency of the
> file system (please correct me if I'm wrong).

With nolock there is no cluster (lock_nolock just returns 0 for
everything), so the cluster-wide [pf]locks have zero cost.  So this test
doesn't tell you anything.

> What's strange, is that gfs_controld -l0 seems like a perfectly safe
> invocation (what's the catch, ie. why was the artifical limit
> introduced?), 

The rate limit was introduced to prevent bad programs from flooding the
network with plock operations.  It may not be a very real problem, though,
so we might eventually disable it (-l0) by default.

> still it achieves almost the same speedup like using
> lock_nolock, which would be a disaster with more than one node
> mounting the fs.  (Also this trick scales pretty well to 4000 files.)

No, -l0 is not going to give you the performance of nolock.  I think you
must have been running with a single node in the cluster.  In that case
there are no other nodes to send/recv messages to/from, so the plock
messages are very fast.

> Again, the above tests were done with a single node switched on, and
> I'm not sure whether the results carry over to the real cluster setup,
> will test is soon.  

Ah, yep.  When you add nodes the plocks will become much slower.  Again, I
think you'll have better luck with flocks.

> I didn't touch drop_count either, everything was
> left as default, except for the mount options and the -l option.
> 
> Also, I can send the results of the scenario suggested by you, if it's
> still relevant.  In short: the locks are always mastered on node A
> only, but the performance is poor nevertheless.

Poor even in the first step when you're just mounting on nodeA?

Dave



From wferi at niif.hu  Tue Jun 12 15:06:56 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Tue, 12 Jun 2007 17:06:56 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
	<20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com>
Message-ID: <87ejkhmctr.fsf@tac.ki.iif.hu>

Hi David,

Here is the old mail I haven't sent before.  Meanwhile, I'm switching
in other nodes to continue the tests in my previous mail.

David Teigland <teigland at redhat.com> writes:

> Make sure that drop_count is zero again, now it's in sysfs:
>   echo 0 > /sys/fs/gfs/<foo>:<bar>/lock_module/drop_count

Ok, I set it after mount, on nodeA only.

> Then run some tests:
> - mount on nodeA
> - run the test on nodeA
filecount=500
  iteration=0 elapsed time=7.392987 s
  iteration=1 elapsed time=4.27927 s
  iteration=2 elapsed time=5.262367 s
  iteration=3 elapsed time=5.265202 s
  iteration=4 elapsed time=5.269652 s
total elapsed time=27.469478 s
> - count locks on nodeA
>   (cat /sys/kernel/debug/dlm/<bar> | grep Master | wc -l)
nodeA: 1031
> - mount on nodeB (don't do anything on this node)
> - run the test again on nodeA
filecount=500
  iteration=0 elapsed time=4.288288 s
  iteration=1 elapsed time=5.282437 s
  iteration=2 elapsed time=5.244141 s
  iteration=3 elapsed time=5.268136 s
  iteration=4 elapsed time=5.261129 s
total elapsed time=25.344131 s
> - count locks on nodeA and nodeB (see above)
nodeA: 1030
nodeB: 20
> - mount on nodeC (don't do anything on nodes B or C)
> - run the test again on nodeA
filecount=500
  iteration=0 elapsed time=4.307917 s
  iteration=1 elapsed time=5.298193 s
  iteration=2 elapsed time=5.295678 s
  iteration=3 elapsed time=5.336343 s
  iteration=4 elapsed time=5.308529 s
total elapsed time=25.54666 s
> - count locks on nodes A, B and C (see above)
nodeA: 1030
nodeB:   20
nodeC:   21

> We're basically trying to produce the best-case performance from one node,
> nodeA.  That means making sure that nodeA is mastering all locks and doing
> maximum caching.  That's why it's important that we not do anything at all
> that accesses the fs on nodes B or C, or do any extra mounts/unmounts.

Yes, this sounds very reasonable.  But looks like nodeA feels obliged
to communicate its locking process around the cluster.  What confuses
me is that he emits multicast packets even when he's the only member.
Otherwise, it passes tokens around the cluster, which makes more
sense, though still unnecessary, as he is the lock master (if I get
the lock master concept right).
-- 
Thanks,
Feri.



From wferi at niif.hu  Tue Jun 12 15:43:24 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Tue, 12 Jun 2007 17:43:24 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <20070612144652.GB16723@redhat.com> (David Teigland's message of
	"Tue, 12 Jun 2007 09:46:52 -0500")
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
	<20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87tztdmfvj.fsf@tac.ki.iif.hu>
	<20070612144652.GB16723@redhat.com>
Message-ID: <877iq9mb4z.fsf@tac.ki.iif.hu>

David Teigland <teigland at redhat.com> writes:

> On Tue, Jun 12, 2007 at 04:01:04PM +0200, Ferenc Wagner wrote:
>
>> with -l0:
>> 
>> filecount=500
>>   iteration=0 elapsed time=5.966146 s
>>   iteration=1 elapsed time=0.582058 s
>>   iteration=2 elapsed time=0.528272 s
>>   iteration=3 elapsed time=0.936438 s
>>   iteration=4 elapsed time=0.528147 s
>> total elapsed time=8.541061 s
>>
>> Looks like the bottleneck isn't the explicit locking (be it plock
>> or flock), but something else, like the built-in GFS locking.
>
> I'm guessing that these were run with a single node in the cluster?
> The second set of numbers (with -l0) wouldn't make much sense
> otherwise.

Yes, you guessed right.  For some reason I found it a good idea to
reveal this at the end only.  (Sorry.)

> In the end I expect that flocks are still going to be the fastest
> for you.

They really seem to be faster, but since the [fp]locking time is
negligible, it doesn't buy much.

> I think if you add nodes to the cluster, the -l0 numbers will go up
> quite a bit.

Let's see.  Mounting on one more node (and switching on the third):

# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 2 3]
dlm              1     clvmd    00020001 none        
[1 2 3]
dlm              1     test     000a0001 none        
[1 2]
gfs              2     test     00090001 none        
[1 2]

$ <commands to run the test>
filecount=500
  iteration=0 elapsed time=5.030971 s
  iteration=1 elapsed time=0.657682 s
  iteration=2 elapsed time=0.798228 s
  iteration=3 elapsed time=0.65742 s
  iteration=4 elapsed time=0.776301 s
total elapsed time=7.920602 s

Somewhat slower, yes.  But still pretty fast.

Mounting on the third node:

# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 2 3]
dlm              1     clvmd    00020001 none        
[1 2 3]
dlm              1     test     000a0001 none        
[1 2 3]
gfs              2     test     00090001 none        
[1 2 3]

$ <commands to run the test>
filecount=500
  iteration=0 elapsed time=0.822107 s
  iteration=1 elapsed time=0.656789 s
  iteration=2 elapsed time=0.657798 s
  iteration=3 elapsed time=0.881496 s
  iteration=4 elapsed time=0.659481 s
total elapsed time=3.677671 s

It's much the same...

>> Again, the above tests were done with a single node switched on,
>> and I'm not sure whether the results carry over to the real cluster
>> setup, will test is soon.
>
> Ah, yep.  When you add nodes the plocks will become much slower.
> Again, I think you'll have better luck with flocks.

I didn't get anywhere with flocks.  At least strace didn't lead me
anywhere.  It showed my that the flock() calls are indeed faster than
the fcntl64() calls, but neither took a significant percentage of the
full run time.  I don't say I fully understand the strace output, but
the raw numbers above reflect the reality for sure.

>> Also, I can send the results of the scenario suggested by you, if
>> it's still relevant.  In short: the locks are always mastered on
>> node A only, but the performance is poor nevertheless.
>
> Poor even in the first step when you're just mounting on nodeA?

Yes, as detailed in the other mail.
-- 
Thanks,
Feri.



From teigland at redhat.com  Tue Jun 12 16:11:48 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 12 Jun 2007 11:11:48 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <87ejkhmctr.fsf@tac.ki.iif.hu>
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
	<20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu>
Message-ID: <20070612161148.GC16723@redhat.com>

On Tue, Jun 12, 2007 at 05:06:56PM +0200, Ferenc Wagner wrote:
> Hi David,
> 
> Here is the old mail I haven't sent before.  Meanwhile, I'm switching
> in other nodes to continue the tests in my previous mail.
> 
> David Teigland <teigland at redhat.com> writes:
> 
> > Make sure that drop_count is zero again, now it's in sysfs:
> >   echo 0 > /sys/fs/gfs/<foo>:<bar>/lock_module/drop_count
> 
> Ok, I set it after mount, on nodeA only.
> 
> > Then run some tests:
> > - mount on nodeA
> > - run the test on nodeA
> filecount=500
>   iteration=0 elapsed time=7.392987 s
>   iteration=1 elapsed time=4.27927 s
>   iteration=2 elapsed time=5.262367 s
>   iteration=3 elapsed time=5.265202 s
>   iteration=4 elapsed time=5.269652 s
> total elapsed time=27.469478 s
> > - count locks on nodeA
> >   (cat /sys/kernel/debug/dlm/<bar> | grep Master | wc -l)
> nodeA: 1031
> > - mount on nodeB (don't do anything on this node)
> > - run the test again on nodeA
> filecount=500
>   iteration=0 elapsed time=4.288288 s
>   iteration=1 elapsed time=5.282437 s
>   iteration=2 elapsed time=5.244141 s
>   iteration=3 elapsed time=5.268136 s
>   iteration=4 elapsed time=5.261129 s
> total elapsed time=25.344131 s
> > - count locks on nodeA and nodeB (see above)
> nodeA: 1030
> nodeB: 20
> > - mount on nodeC (don't do anything on nodes B or C)
> > - run the test again on nodeA
> filecount=500
>   iteration=0 elapsed time=4.307917 s
>   iteration=1 elapsed time=5.298193 s
>   iteration=2 elapsed time=5.295678 s
>   iteration=3 elapsed time=5.336343 s
>   iteration=4 elapsed time=5.308529 s
> total elapsed time=25.54666 s
> > - count locks on nodes A, B and C (see above)
> nodeA: 1030
> nodeB:   20
> nodeC:   21
> 
> > We're basically trying to produce the best-case performance from one node,
> > nodeA.  That means making sure that nodeA is mastering all locks and doing
> > maximum caching.  That's why it's important that we not do anything at all
> > that accesses the fs on nodes B or C, or do any extra mounts/unmounts.
> 
> Yes, this sounds very reasonable.  But looks like nodeA feels obliged
> to communicate its locking process around the cluster.  

I'm not sure what you mean here.  To see the amount of dlm locking traffic
on the network, look at port 21064.  There should be very little in the
test above... and the dlm locking that you do see should mostly be related
to file i/o, not flocks.

> What confuses me is that he emits multicast packets even when he's the
> only member.  Otherwise, it passes tokens around the cluster, which
> makes more sense, though still unnecessary, as he is the lock master (if
> I get the lock master concept right).

I think you're confusing the multicast network traffic from openais
(related to cluster membership) and the point-to-point network traffic
from the dlm (related to gfs locking).  The two types of traffic are not
related.

Dave



From teigland at redhat.com  Tue Jun 12 16:17:09 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 12 Jun 2007 11:17:09 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <877iq9mb4z.fsf@tac.ki.iif.hu>
References: <20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87tztdmfvj.fsf@tac.ki.iif.hu>
	<20070612144652.GB16723@redhat.com> <877iq9mb4z.fsf@tac.ki.iif.hu>
Message-ID: <20070612161709.GD16723@redhat.com>

On Tue, Jun 12, 2007 at 05:43:24PM +0200, Ferenc Wagner wrote:
> David Teigland <teigland at redhat.com> writes:
> 
> > On Tue, Jun 12, 2007 at 04:01:04PM +0200, Ferenc Wagner wrote:
> >
> >> with -l0:
> >> 
> >> filecount=500
> >>   iteration=0 elapsed time=5.966146 s
> >>   iteration=1 elapsed time=0.582058 s
> >>   iteration=2 elapsed time=0.528272 s
> >>   iteration=3 elapsed time=0.936438 s
> >>   iteration=4 elapsed time=0.528147 s
> >> total elapsed time=8.541061 s
> >>
> >> Looks like the bottleneck isn't the explicit locking (be it plock
> >> or flock), but something else, like the built-in GFS locking.
> >
> > I'm guessing that these were run with a single node in the cluster?
> > The second set of numbers (with -l0) wouldn't make much sense
> > otherwise.
> 
> Yes, you guessed right.  For some reason I found it a good idea to
> reveal this at the end only.  (Sorry.)
> 
> > In the end I expect that flocks are still going to be the fastest
> > for you.
> 
> They really seem to be faster, but since the [fp]locking time is
> negligible, it doesn't buy much.
> 
> > I think if you add nodes to the cluster, the -l0 numbers will go up
> > quite a bit.
> 
> Let's see.  Mounting on one more node (and switching on the third):
> 
> # cman_tool services
> type             level name     id       state       
> fence            0     default  00010001 none        
> [1 2 3]
> dlm              1     clvmd    00020001 none        
> [1 2 3]
> dlm              1     test     000a0001 none        
> [1 2]
> gfs              2     test     00090001 none        
> [1 2]

!?!? but now you're using the old RHEL4 generation stuff -- gfs_controld
is completely irrelevant there.  The analysis completely changes between
the RHEL4/RHEL5 (old/new) generations of infrastructure.

Dave



From wferi at niif.hu  Tue Jun 12 16:39:34 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Tue, 12 Jun 2007 18:39:34 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <20070612161148.GC16723@redhat.com> (David Teigland's message of
	"Tue, 12 Jun 2007 11:11:48 -0500")
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
	<20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu>
	<20070612161148.GC16723@redhat.com>
Message-ID: <87zm35ktyx.fsf@tac.ki.iif.hu>

David Teigland <teigland at redhat.com> writes:

> On Tue, Jun 12, 2007 at 05:06:56PM +0200, Ferenc Wagner wrote:
> 
>> Here is the old mail I haven't sent before.  Meanwhile, I'm switching
>> in other nodes to continue the tests in my previous mail.

[...]

>> But looks like nodeA feels obliged to communicate its locking
>> process around the cluster.
>
> I'm not sure what you mean here.  To see the amount of dlm locking traffic
> on the network, look at port 21064.  There should be very little in the
> test above... and the dlm locking that you do see should mostly be related
> to file i/o, not flocks.

There was much traffic on port 21064.  Possibly related to file I/O
and not flocks, I can't tell.  But that's agrees with my speculation,
that it's not the explicit [pf]locks that take much time, but
something else.

>> What confuses me is that he emits multicast packets even when he's the
>> only member.  Otherwise, it passes tokens around the cluster, which
>> makes more sense, though still unnecessary, as he is the lock master (if
>> I get the lock master concept right).
>
> I think you're confusing the multicast network traffic from openais
> (related to cluster membership) and the point-to-point network traffic
> from the dlm (related to gfs locking).  The two types of traffic are not
> related.

I didn't notice any multicast traffic when the node wasn't alone, but
maybe it was simply dwarfed by the locking traffic.  I can check that
again later, but...

>> # cman_tool services
>> type             level name     id       state       
>> fence            0     default  00010001 none        
>> [1 2 3]
>> dlm              1     clvmd    00020001 none        
>> [1 2 3]
>> dlm              1     test     000a0001 none        
>> [1 2]
>> gfs              2     test     00090001 none        
>> [1 2]
>
> !?!? but now you're using the old RHEL4 generation stuff -- gfs_controld
> is completely irrelevant there.  The analysis completely changes between
> the RHEL4/RHEL5 (old/new) generations of infrastructure.

To my best knowledge, I'm using the new infrastructure.  There's no
cman kernel module loaded, there's no cman process running, there's an
aisexec process running, syslog contains messages like

openais[4374]: [CLM  ] CLM CONFIGURATION CHANGE 
openais[4374]: [CLM  ] New Configuration: 
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [CLM  ] Members Left: 
openais[4374]: [CLM  ] Members Joined: 
openais[4374]: [CLM  ] ^Ir(0) ip(XXX.XXX.XXX.XXX)  
openais[4374]: [SYNC ] This node is within the primary component and will provide service. 
openais[4374]: [TOTEM] entering OPERATIONAL state. 
openais[4374]: [CLM  ] got nodejoin message XXX.XXX.XXX.XXX 
openais[4374]: [CLM  ] got nodejoin message XXX.XXX.XXX.XXX 
openais[4374]: [CLM  ] got nodejoin message XXX.XXX.XXX.XXX 
openais[4374]: [CPG  ] got joinlist message from node 1 
openais[4374]: [CPG  ] got joinlist message from node 2 
kernel: dlm: connecting to 3
kernel: dlm: got connection from 3

lsmod gives

gfs                   256964  1 
lock_nolock             4480  0 
lock_dlm               20684  2 
gfs2                  328076  3 gfs,lock_nolock,lock_dlm
dlm                    92340  17 lock_dlm
configfs               25616  2 dlm

How could I be running the old stuff?  Am I totally confused?
-- 
Feri.



From teigland at redhat.com  Tue Jun 12 16:44:19 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 12 Jun 2007 11:44:19 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <87zm35ktyx.fsf@tac.ki.iif.hu>
References: <20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu>
	<20070612161148.GC16723@redhat.com> <87zm35ktyx.fsf@tac.ki.iif.hu>
Message-ID: <20070612164418.GF16723@redhat.com>

On Tue, Jun 12, 2007 at 06:39:34PM +0200, Ferenc Wagner wrote:
> David Teigland <teigland at redhat.com> writes:
> 
> > On Tue, Jun 12, 2007 at 05:06:56PM +0200, Ferenc Wagner wrote:
> > 
> >> Here is the old mail I haven't sent before.  Meanwhile, I'm switching
> >> in other nodes to continue the tests in my previous mail.
> 
> [...]
> 
> >> But looks like nodeA feels obliged to communicate its locking
> >> process around the cluster.
> >
> > I'm not sure what you mean here.  To see the amount of dlm locking traffic
> > on the network, look at port 21064.  There should be very little in the
> > test above... and the dlm locking that you do see should mostly be related
> > to file i/o, not flocks.
> 
> There was much traffic on port 21064.  Possibly related to file I/O
> and not flocks, I can't tell.  But that's agrees with my speculation,
> that it's not the explicit [pf]locks that take much time, but
> something else.

Could you comment the fcntl/flock calls out of the application entirely
and try it?

> >> # cman_tool services
> >> type             level name     id       state       
> >> fence            0     default  00010001 none        
> >> [1 2 3]
> >> dlm              1     clvmd    00020001 none        
> >> [1 2 3]
> >> dlm              1     test     000a0001 none        
> >> [1 2]
> >> gfs              2     test     00090001 none        
> >> [1 2]
> >
> > !?!? but now you're using the old RHEL4 generation stuff -- gfs_controld
> > is completely irrelevant there.  The analysis completely changes between
> > the RHEL4/RHEL5 (old/new) generations of infrastructure.
> 
> To my best knowledge, I'm using the new infrastructure.  There's no
> cman kernel module loaded, there's no cman process running, there's an
> aisexec process running, syslog contains messages like

Sorry, I was wrong.  I saw 'cman_tool services' which is the old method of
doing group_tool, and forgot that we made cman_tool services call
group_tool.

> How could I be running the old stuff?  Am I totally confused?

You're running the new stuff, it's me who was confused.
Dave



From Christopher.Barry at qlogic.com  Tue Jun 12 18:12:18 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Tue, 12 Jun 2007 14:12:18 -0400
Subject: [Linux-cluster] Trouble finding openais include dir in RHEL5 compile
Message-ID: <1181671939.5249.32.camel@localhost>

Greetings All,

	I'm having an incredibly difficult time simply compiling this software.
I have followed a couple of 'usage.txt' versions, and well - they just
do not work for me. I've tried HEAD and RHEL5.

	Can someone tell me what branch of cvs code compiles against the stock
RHEL5 2.6.18-8 kernel source? 

One usage.txt said that you needed openais installed - I do. Latest
openais w/ default make;make install. Still, it appears the build system
for RHEL5 branch of cluster cannot find it. HEAD's configure system is
completely different.

daemon.c:32:35: error: openais/totem/aispoll.h: No such file or
directory

What variable(s) need(s) to be set for this file - installed by
openais's make install by default in:
 /usr/local/usr/include/openais/totem/aispoll.h 
to be found?

--prefix=/usr/local/ does not work.

Also, what is the difference between the RHEL5 and RHEL50 branches, and
which one should be used? - OR - what branch *should* be used against
this kernel now? 





Thanks

-- 
Regards,
-C



From teigland at redhat.com  Tue Jun 12 18:19:07 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 12 Jun 2007 13:19:07 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <877iq9mb4z.fsf@tac.ki.iif.hu>
References: <20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87tztdmfvj.fsf@tac.ki.iif.hu>
	<20070612144652.GB16723@redhat.com> <877iq9mb4z.fsf@tac.ki.iif.hu>
Message-ID: <20070612181907.GH16723@redhat.com>

On Tue, Jun 12, 2007 at 05:43:24PM +0200, Ferenc Wagner wrote:
> Let's see.  Mounting on one more node (and switching on the third):
> 
> # cman_tool services
> type             level name     id       state       
> fence            0     default  00010001 none        
> [1 2 3]
> dlm              1     clvmd    00020001 none        
> [1 2 3]
> dlm              1     test     000a0001 none        
> [1 2]
> gfs              2     test     00090001 none        
> [1 2]
> 
> $ <commands to run the test>
> filecount=500
>   iteration=0 elapsed time=5.030971 s
>   iteration=1 elapsed time=0.657682 s
>   iteration=2 elapsed time=0.798228 s
>   iteration=3 elapsed time=0.65742 s
>   iteration=4 elapsed time=0.776301 s
> total elapsed time=7.920602 s
> 
> Somewhat slower, yes.  But still pretty fast.
> 
> Mounting on the third node:
> 
> # cman_tool services
> type             level name     id       state       
> fence            0     default  00010001 none        
> [1 2 3]
> dlm              1     clvmd    00020001 none        
> [1 2 3]
> dlm              1     test     000a0001 none        
> [1 2 3]
> gfs              2     test     00090001 none        
> [1 2 3]
> 
> $ <commands to run the test>
> filecount=500
>   iteration=0 elapsed time=0.822107 s
>   iteration=1 elapsed time=0.656789 s
>   iteration=2 elapsed time=0.657798 s
>   iteration=3 elapsed time=0.881496 s
>   iteration=4 elapsed time=0.659481 s
> total elapsed time=3.677671 s
> 
> It's much the same...

OK, that's interesting.  I was led to believe from some of my old testing
that plocks wouldn't be that fast, but maybe that was a mistaken
conclusion.  There's no reason not to use plocks with -l0 if you're
getting good results.

Dave



From sdake at redhat.com  Tue Jun 12 18:19:06 2007
From: sdake at redhat.com (Steven Dake)
Date: Tue, 12 Jun 2007 11:19:06 -0700
Subject: [Linux-cluster] Trouble finding openais include dir in RHEL5
	compile
In-Reply-To: <1181671939.5249.32.camel@localhost>
References: <1181671939.5249.32.camel@localhost>
Message-ID: <1181672346.17814.26.camel@shih.broked.org>

On Tue, 2007-06-12 at 14:12 -0400, Christopher Barry wrote:
> Greetings All,
> 
> 	I'm having an incredibly difficult time simply compiling this software.
> I have followed a couple of 'usage.txt' versions, and well - they just
> do not work for me. I've tried HEAD and RHEL5.
> 
> 	Can someone tell me what branch of cvs code compiles against the stock
> RHEL5 2.6.18-8 kernel source? 
> 

When installing openais, you should use:

make install DESTDIR=/

the standard make install will install in /usr/local which is probably
not what you want.  

Regards
-steve

> One usage.txt said that you needed openais installed - I do. Latest
> openais w/ default make;make install. Still, it appears the build system
> for RHEL5 branch of cluster cannot find it. HEAD's configure system is
> completely different.
> 
> daemon.c:32:35: error: openais/totem/aispoll.h: No such file or
> directory
> 
> What variable(s) need(s) to be set for this file - installed by
> openais's make install by default in:
>  /usr/local/usr/include/openais/totem/aispoll.h 
> to be found?
> 
> --prefix=/usr/local/ does not work.
> 
> Also, what is the difference between the RHEL5 and RHEL50 branches, and
> which one should be used? - OR - what branch *should* be used against
> this kernel now? 
> 
> 
> 
> 
> 
> Thanks
> 



From Christopher.Barry at qlogic.com  Tue Jun 12 19:11:04 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Tue, 12 Jun 2007 15:11:04 -0400
Subject: [Linux-cluster] Trouble finding openais include dir in RHEL5
	compile
In-Reply-To: <1181672346.17814.26.camel@shih.broked.org>
References: <1181671939.5249.32.camel@localhost>
	<1181672346.17814.26.camel@shih.broked.org>
Message-ID: <1181675464.5249.33.camel@localhost>

On Tue, 2007-06-12 at 11:19 -0700, Steven Dake wrote:
> On Tue, 2007-06-12 at 14:12 -0400, Christopher Barry wrote:
> > Greetings All,
> > 
> > 	I'm having an incredibly difficult time simply compiling this software.
> > I have followed a couple of 'usage.txt' versions, and well - they just
> > do not work for me. I've tried HEAD and RHEL5.
> > 
> > 	Can someone tell me what branch of cvs code compiles against the stock
> > RHEL5 2.6.18-8 kernel source? 
> > 
> 
> When installing openais, you should use:
> 
> make install DESTDIR=/
> 
> the standard make install will install in /usr/local which is probably
> not what you want.  
> 
> Regards
> -steve
> 
> > One usage.txt said that you needed openais installed - I do. Latest
> > openais w/ default make;make install. Still, it appears the build system
> > for RHEL5 branch of cluster cannot find it. HEAD's configure system is
> > completely different.
> > 
> > daemon.c:32:35: error: openais/totem/aispoll.h: No such file or
> > directory
> > 
> > What variable(s) need(s) to be set for this file - installed by
> > openais's make install by default in:
> >  /usr/local/usr/include/openais/totem/aispoll.h 
> > to be found?
> > 
> > --prefix=/usr/local/ does not work.
> > 
> > Also, what is the difference between the RHEL5 and RHEL50 branches, and
> > which one should be used? - OR - what branch *should* be used against
> > this kernel now? 
> > 
> > 
> > 
> > 
> > 
> > Thanks
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Thanks!

-C



From Christopher.Barry at qlogic.com  Tue Jun 12 19:56:58 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Tue, 12 Jun 2007 15:56:58 -0400
Subject: [Linux-cluster] Trouble finding openais include dir in RHEL5
	compile
In-Reply-To: <1181675464.5249.33.camel@localhost>
References: <1181671939.5249.32.camel@localhost>
	<1181672346.17814.26.camel@shih.broked.org>
	<1181675464.5249.33.camel@localhost>
Message-ID: <1181678219.5249.44.camel@localhost>

On Tue, 2007-06-12 at 15:11 -0400, Christopher Barry wrote:
> On Tue, 2007-06-12 at 11:19 -0700, Steven Dake wrote:
> > On Tue, 2007-06-12 at 14:12 -0400, Christopher Barry wrote:
> > > Greetings All,
> > > 
> > > 	I'm having an incredibly difficult time simply compiling this software.
> > > I have followed a couple of 'usage.txt' versions, and well - they just
> > > do not work for me. I've tried HEAD and RHEL5.
> > > 
> > > 	Can someone tell me what branch of cvs code compiles against the stock
> > > RHEL5 2.6.18-8 kernel source? 
> > > 
> > 
> > When installing openais, you should use:
> > 
> > make install DESTDIR=/
> > 
...snip


Thanks Steve. That gets me to the dlm dir, which fails with the errors
below. This is cvs RHEL5 against the stock RHEL5 kernel 2.6.18-8.el5 

Any thoughts on this? I googled this failure, but nothing relevant comes
up.


gcc -Wall  -g -I. -O2  -D_REENTRANT -c -o libdlm.o libdlm.c
libdlm.c: In function ?set_version_v5?:
libdlm.c:324: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:325: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:326: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?set_version_v6?:
libdlm.c:335: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:336: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:337: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?detect_kernel_version?:
libdlm.c:443: error: storage size of ?v? isn?t known
libdlm.c:446: error: invalid application of ?sizeof? to incomplete type
?struct dlm_device_version?
libdlm.c:448: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:449: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:450: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:452: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:453: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:454: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:443: warning: unused variable ?v?
libdlm.c: In function ?do_dlm_dispatch?:
libdlm.c:590: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?ls_lock_v6?:
libdlm.c:835: error: ?struct dlm_lock_params? has no member named ?xid?
libdlm.c:837: error: ?struct dlm_lock_params? has no member named
?timeout?
libdlm.c: In function ?ls_lock?:
libdlm.c:892: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?dlm_ls_lockx?:
libdlm.c:916: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?dlm_ls_unlock?:
libdlm.c:1067: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?dlm_ls_deadlock_cancel?:
libdlm.c:1099: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:1115: error: ?DLM_USER_DEADLOCK? undeclared (first use in this
function)
libdlm.c:1115: error: (Each undeclared identifier is reported only once
libdlm.c:1115: error: for each function it appears in.)
libdlm.c: In function ?dlm_ls_purge?:
libdlm.c:1134: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:1145: error: ?DLM_USER_PURGE? undeclared (first use in this
function)
libdlm.c:1146: error: ?union <anonymous>? has no member named ?purge?
libdlm.c:1147: error: ?union <anonymous>? has no member named ?purge?
libdlm.c: In function ?create_lockspace?:
libdlm.c:1311: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?release_lockspace?:
libdlm.c:1415: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c: In function ?dlm_kernel_version?:
libdlm.c:1501: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:1502: error: invalid use of undefined type ?struct
dlm_device_version?
libdlm.c:1503: error: invalid use of undefined type ?struct
dlm_device_version?
make[2]: *** [libdlm.o] Error 1
make[2]: Leaving directory `/usr/src/cluster/dlm/lib'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/cluster/dlm'
make: *** [all] Error 2
[root at localhost cluster]#




From karl at klxsystems.net  Tue Jun 12 23:30:07 2007
From: karl at klxsystems.net (Karl R. Balsmeier)
Date: Tue, 12 Jun 2007 16:30:07 -0700
Subject: [Linux-cluster] fence device: network card?
Message-ID: <466F2C7F.9080906@klxsystems.net>

Hi,

I have three (3) servers built and entered into the 
system-config-cluster tool as nodes.  Basically the first node has node 
2 and node 3 as members of the cluster.

For a fence device, I do not have any of the SAN or network/switch 
devices listed in the dropdown menu, and where I have read in the 
documentation that says "gnbd" Generic Network Block Device seems to be 
what i'm looking for.

Basically I read in the docs you can use a NIC card as a fence device, 
is this true?

Right now each of the 3 servers have 3 NICs, so I have a total of 9 to 
play with.  Right now I am bonding the two GB NIC's together no 
problem.  That leaves each server a 100mbps NIC.

My ultimate goal is to use these 3 machines to make a Vsftpd GFS cluster 
that I can run Iscsi over. 

Being new to this though, i'll stick to the primary questions:  How does 
one configure a fence device in the form of a NIC card?  Is the gnbd 
item relevant to this?

-karl




From eschneid at uccs.edu  Tue Jun 12 23:05:06 2007
From: eschneid at uccs.edu (Eric Schneider)
Date: Tue, 12 Jun 2007 17:05:06 -0600
Subject: [Linux-cluster] unlinked inode
Message-ID: <004701c7ad46$1dfca100$1b03c680@uccs.edu>

I am testing failover on a 2 node cluster.  When I shut down the service,
umount the drives, and do an fsck.gfs I get the following message sometimes.

Does this indicate a file system corruption?

Found unlinked inode at 84692
Add unlinked inode to l+f? (y/n)y
l+f directory at 189
Added inode #84692 to l+f dir
Found unlinked inode at 84693
Add unlinked inode to l+f? (y/n)y
Added inode #84693 to l+f dir
Found unlinked inode at 84694
Add unlinked inode to l+f? (y/n)y

Thanks,

Eric Schneider





From pcaulfie at redhat.com  Wed Jun 13 07:21:27 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 13 Jun 2007 08:21:27 +0100
Subject: [Linux-cluster] Trouble finding openais include dir in RHEL5
	compile
In-Reply-To: <1181678219.5249.44.camel@localhost>
References: <1181671939.5249.32.camel@localhost>	<1181672346.17814.26.camel@shih.broked.org>	<1181675464.5249.33.camel@localhost>
	<1181678219.5249.44.camel@localhost>
Message-ID: <466F9AF7.2030105@redhat.com>

Christopher Barry wrote:
> On Tue, 2007-06-12 at 15:11 -0400, Christopher Barry wrote:
>> On Tue, 2007-06-12 at 11:19 -0700, Steven Dake wrote:
>>> On Tue, 2007-06-12 at 14:12 -0400, Christopher Barry wrote:
>>>> Greetings All,
>>>>
>>>> 	I'm having an incredibly difficult time simply compiling this software.
>>>> I have followed a couple of 'usage.txt' versions, and well - they just
>>>> do not work for me. I've tried HEAD and RHEL5.
>>>>
>>>> 	Can someone tell me what branch of cvs code compiles against the stock
>>>> RHEL5 2.6.18-8 kernel source? 
>>>>
>>> When installing openais, you should use:
>>>
>>> make install DESTDIR=/
>>>
> ...snip
> 
> 
> Thanks Steve. That gets me to the dlm dir, which fails with the errors
> below. This is cvs RHEL5 against the stock RHEL5 kernel 2.6.18-8.el5 
> 
> Any thoughts on this? I googled this failure, but nothing relevant comes
> up.
> 
> 
> gcc -Wall  -g -I. -O2  -D_REENTRANT -c -o libdlm.o libdlm.c
> libdlm.c: In function ?set_version_v5?:
> libdlm.c:324: error: invalid use of undefined type ?struct
> dlm_device_version?
> libdlm.c:325: error: invalid use of undefined type ?struct
> dlm_device_version?
> libdlm.c:326: error: invalid use of undefined type ?struct
> dlm_device_version?
> libdlm.c: In function ?set_version_v6?:
> libdlm.c:335: error: invalid use of undefined type ?struct
> dlm_device_version?

You need the latest kernel headers that have the updated dlm includes. Or, if
you're desperate, copy them from the latest kernel and stick

#define __user

at the top...(it's what I did!)


-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From frederik.wagner at physik.lmu.de  Wed Jun 13 08:54:11 2007
From: frederik.wagner at physik.lmu.de (Frederik Wagner)
Date: Wed, 13 Jun 2007 10:54:11 +0200
Subject: [Linux-cluster] altname broken?
Message-ID: <466FB0B3.1010806@physik.lmu.de>

Hello again,

I was succesfully using the altname option under Scientific Linux 4.4 in
my cluster.conf in the clusternode definition, like:

<clusternode name="auriga" nodeid="1" votes="1">
   <altname name="auriga-hb"/> -->
   <fence>
     <method name="1">
        <device lanplus="1" name="auriga_IPMI"/>
     </method>
   </fence>
</clusternode>

since I'm using Scientific Linux 5 now, the altname option seems to be
broken, in the sense that I have problems rejoining the cluster of even
starting up the cluster. It's a bit unspecific, since I can not really
track the starting of the cman init script.

Are there any known problems regading this option?

Thanks for you help,
	Frederik Wagner

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070613/0ef4b767/attachment.sig>

From pcaulfie at redhat.com  Wed Jun 13 09:09:54 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 13 Jun 2007 10:09:54 +0100
Subject: [Linux-cluster] altname broken?
In-Reply-To: <466FB0B3.1010806@physik.lmu.de>
References: <466FB0B3.1010806@physik.lmu.de>
Message-ID: <466FB462.9010705@redhat.com>

Frederik Wagner wrote:
> Hello again,
> 
> I was succesfully using the altname option under Scientific Linux 4.4 in
> my cluster.conf in the clusternode definition, like:
> 
> <clusternode name="auriga" nodeid="1" votes="1">
>    <altname name="auriga-hb"/> -->
>    <fence>
>      <method name="1">
>         <device lanplus="1" name="auriga_IPMI"/>
>      </method>
>    </fence>
> </clusternode>
> 
> since I'm using Scientific Linux 5 now, the altname option seems to be
> broken, in the sense that I have problems rejoining the cluster of even
> starting up the cluster. It's a bit unspecific, since I can not really
> track the starting of the cman init script.
> 
> Are there any known problems regading this option?
> 

There's nothing changed in that area that I can recall. Can you track the
startup with "cman_tool join -d" and also CMAN messages that appear in syslog on
the nodes?


Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From frederik.wagner at physik.lmu.de  Wed Jun 13 11:36:04 2007
From: frederik.wagner at physik.lmu.de (Frederik Wagner)
Date: Wed, 13 Jun 2007 13:36:04 +0200
Subject: [Linux-cluster] altname broken?
In-Reply-To: <466FB462.9010705@redhat.com>
References: <466FB0B3.1010806@physik.lmu.de> <466FB462.9010705@redhat.com>
Message-ID: <466FD6A4.7030605@physik.lmu.de>

Hi Patrick,

On 06/13/2007 11:09 AM, Patrick Caulfield wrote:
> Frederik Wagner wrote:
>>
>> Are there any known problems regading this option?
>>
> 
> There's nothing changed in that area that I can recall. Can you track the
> startup with "cman_tool join -d" and also CMAN messages that appear in syslog on
> the nodes?
> 

I reintroduced the altname in my cluster.conf and get the following problem.
Starting the service "cman" on the three cluster nodes makes then hang
while trying to start the fencing:

# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... failed

                                                           [FAILED]

after some minutes.
While stoping the script, and running clustat shows, that the node only
sees itself as Online, but not the others

# clustat
Member Status: Inquorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  auriga                                1 Offline
  pod                                   2 Offline
  sulaco                                3 Online, Local

also ccs_tool doesn't  see the other nodes (e.g. ccs_tool update... does
not work with "Unable to connect to the CCS daemon: Connection refused").

The output of the cman_tool join -d (after starting ccsd by hand and
loading the modules, i suppose) is attached as output.txt. I left it a
bit running while calling clustat etc...

The relevant lines of my cluster.conf are the following:

<?xml version="1.0"?>
<cluster config_version="30" name="storage_cluster">
 <fence_daemon post_fail_delay="0" post_join_delay="3"/>
 <clusternodes>
  <clusternode name="auriga" nodeid="1" votes="1">
   <altname name="auriga-hb"/>
   <fence>
    <method name="ipmi">
     <device lanplus="1" name="auriga_IPMI"/>
    </method>
    <method name="fabric">
     <device name="fcswitch1" port="5"/>
     <device name="fcswitch2" port="5"/>
    </method>
   </fence>
  </clusternode>
  <clusternode name="pod" nodeid="2" votes="1">
   <altname name="pod-hb"/>
   <fence>
    <method name="ipmi">
     <device lanplus="1" name="pod_IPMI"/>
    </method>
    <method name="fabric">
     <device name="fcswitch1" port="6"/>
     <device name="fcswitch2" port="6"/>
    </method>
   </fence>
  </clusternode>
  <clusternode name="sulaco" nodeid="3" votes="1">
   <altname name="sulaco-hb"/>
   <fence>
    <method name="ipmi">
     <device lanplus="1" name="sulaco_IPMI"/>
    </method>
    <method name="fabric">
     <device name="fcswitch1" port="4"/>
     <device name="fcswitch2" port="4"/>
    </method>
   </fence>
  </clusternode>
 </clusternodes>
 <cman/>
 <fencedevices>
  ...
 </fencedevices>
 <rm>
  ...
 </rm>
</cluster>

Thanks for any help! And let me know if you need further output...

Frederik
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: output.txt
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070613/a71e9cd5/attachment.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070613/a71e9cd5/attachment.sig>

From pcaulfie at redhat.com  Wed Jun 13 12:23:33 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 13 Jun 2007 13:23:33 +0100
Subject: [Linux-cluster] altname broken?
In-Reply-To: <466FD6A4.7030605@physik.lmu.de>
References: <466FB0B3.1010806@physik.lmu.de> <466FB462.9010705@redhat.com>
	<466FD6A4.7030605@physik.lmu.de>
Message-ID: <466FE1C5.8090806@redhat.com>

Frederik Wagner wrote:
> Hi Patrick,
> 
> On 06/13/2007 11:09 AM, Patrick Caulfield wrote:
>> Frederik Wagner wrote:
>>> Are there any known problems regading this option?
>>>
>> There's nothing changed in that area that I can recall. Can you track the
>> startup with "cman_tool join -d" and also CMAN messages that appear in syslog on
>> the nodes?
>>

Ahh, I misread you original message slightly. I though this was an upgrade from
4.4 to 4.5 not to 5.0 which is a totally new cluster system!

I just tried altname on my systems here and it seems to work fine - though it's
not a very representative test network.

Firstly, make sure that all your nodes are running the same version of the
software. I suspect you are but it is important to realise that you can't mix V4
and v5 clusters.

The TOTEM messages I'd rather leave to Steve Dake to have a look at if he has time

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From teigland at redhat.com  Wed Jun 13 13:53:51 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 13 Jun 2007 08:53:51 -0500
Subject: [Linux-cluster] Trouble finding openais include dir in RHEL5
	compile
In-Reply-To: <1181678219.5249.44.camel@localhost>
References: <1181671939.5249.32.camel@localhost>
	<1181672346.17814.26.camel@shih.broked.org>
	<1181675464.5249.33.camel@localhost>
	<1181678219.5249.44.camel@localhost>
Message-ID: <20070613135351.GA16134@redhat.com>

On Tue, Jun 12, 2007 at 03:56:58PM -0400, Christopher Barry wrote:
> Thanks Steve. That gets me to the dlm dir, which fails with the errors
> below. This is cvs RHEL5 against the stock RHEL5 kernel 2.6.18-8.el5 
> 
> Any thoughts on this? I googled this failure, but nothing relevant comes
> up.

The RHEL50 cvs branch is used for the current RHEL 5.0 release, which it
sounds like you have.

The RHEL5 cvs branch is in development, to be used for the RHEL 5.1
release.  The RHEL 5.1 kernel includes the new dlm user/kernel interface
which is in the upstream kernel.  If you want to use the RHEL5 cvs branch,
you'd need the kernel for RHEL 5.1 (which is still in development and not
accessible AFAIK.)

The cvs HEAD should be used with upstream 2.6.22-rc kernels.  Doing this
requires that you also copy the new dlm kernel headers into the system
include directory:
cp /usr/src/linux-2.6.22-rcX/include/linux/dlm*  /usr/include/linux/
and then remove the __user tags from some them (or do as Patrick suggested).

Sorry for the confusion.

Dave



From teigland at redhat.com  Wed Jun 13 14:14:55 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 13 Jun 2007 09:14:55 -0500
Subject: [Linux-cluster] stability gfs2
In-Reply-To: <466E624F.3020209@physik.lmu.de>
References: <466E624F.3020209@physik.lmu.de>
Message-ID: <20070613141455.GB16134@redhat.com>

On Tue, Jun 12, 2007 at 11:07:27AM +0200, Frederik Wagner wrote:
> Hi group,
> 
> I was wondering about the stability of the gfs2 filesystem.
> After the release of Scientific Linux 5 (SL5) I tried the gfs2
> filesystem on our FC SAN Storage with a two node cluster frontend (this
> are supposed to be three in the future).
> 
> The system was running well on SL44 and is now running well with SL5 and
> gfs, but when I tried the gfs2 filesystem I had problems syncing (rsync)
> back the storage data from the backup. I got reproducable write errors,
> after which the whole filsystem was stuck and broken.
> 
> Unluckily I had to install the system in to production with gfs now, so
> I cannot post the error messages. (if the error messages are needed for
> further debugging, i could set up a testsystem) So up to now I'm wondering:
> 1. is gfs2 supposed to be completely stable?
> 2. are there any known serious bugs i missed?
> 3. why should i use gfs2 instead of gfs, except for a faster df command? :-)
> 4. are there any documentation speaking about the gfs2 fs system in
> detail? To me the documentation looks quite 'small'...

gfs2 isn't ready to be used, stick with gfs1.

Dave



From teigland at redhat.com  Wed Jun 13 14:31:45 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 13 Jun 2007 09:31:45 -0500
Subject: [Linux-cluster] What is manual-failure override?
In-Reply-To: <87lkepme5m.fsf@tac.ki.iif.hu>
References: <87lkepme5m.fsf@tac.ki.iif.hu>
Message-ID: <20070613143145.GC16134@redhat.com>

On Tue, Jun 12, 2007 at 04:38:13PM +0200, Ferenc Wagner wrote:
> Hi,
> 
> The CVS commit which removed fence_manual.c (and thus my favourite
> agent :) says:
> 
> Remove fence_manual; only provide manual-failure override
> 
> Good.  Now, how do I use manual fencing?  What's to be written into
> cluster.conf?  (cluster 2.00.00)

You just leave the <fence> sections empty for all the nodes.  When
a node fails, fenced no longer runs fence_manual, it just waits for
you to run "fence_ack_manual <nodename>" which tells fenced to go on.

Dave



From wferi at niif.hu  Wed Jun 13 14:38:40 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Wed, 13 Jun 2007 16:38:40 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <20070612164418.GF16723@redhat.com> (David Teigland's message of
	"Tue, 12 Jun 2007 11:44:19 -0500")
References: <20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu>
	<20070612161148.GC16723@redhat.com> <87zm35ktyx.fsf@tac.ki.iif.hu>
	<20070612164418.GF16723@redhat.com>
Message-ID: <87odjjkjgv.fsf@tac.ki.iif.hu>

David Teigland <teigland at redhat.com> writes:

>>>> But looks like nodeA feels obliged to communicate its locking
>>>> process around the cluster.
>>>
>>> I'm not sure what you mean here.  To see the amount of dlm locking traffic
>>> on the network, look at port 21064.  There should be very little in the
>>> test above... and the dlm locking that you do see should mostly be related
>>> to file i/o, not flocks.
>> 
>> There was much traffic on port 21064.  Possibly related to file I/O
>> and not flocks, I can't tell.  But that's agrees with my speculation,
>> that it's not the explicit [pf]locks that take much time, but
>> something else.
>
> Could you comment the fcntl/flock calls out of the application entirely
> and try it?

Let's see.  A typical test run looks like this (first with fcntl
locking; tcpdump slows down the first iteration from about 6 s):

filecount=500
  iteration=0 elapsed time=20.196318 s
  iteration=1 elapsed time=0.323969 s
  iteration=2 elapsed time=0.319929 s
  iteration=3 elapsed time=0.361738 s
  iteration=4 elapsed time=0.399365 s
total elapsed time=21.601319 s

During the first (slow) iteration, there's much traffic on port 21064.
During the next (fast) iterations there's no traffic at all on that port.
If I rerun the test immediately, there's still no traffic.
5 minutes later, without any action on my part, there's a couple of
packets again, then 20 s later a bigger bunch (around 30).
After this, the first iteration generates much traffic again, GOTO 10.

If I use flock instead, the beginning is similar, but after about 10 s
from the finish of the test, some small traffic appears by itself, and
if I rerun the test after this, it generates traffic again, although
much less than after 5 minutes.  The traffic generated 5 minutes after
the test run consists of a couple of packets followed by a much bigger
bunch 5 s later.

If I don't use any locking at all, then the situation is the same as
with fcntl locking, but the "automatic" traffic consist of a small
burst (couple of packets) 4 min 51 s after the finish, then about 30
packets 25 s later.

Does it tell you anything?  The timings are perhaps somewhat off
because of the 20 s runtime.  If you can make some sense out of this,
I'd be glad to hear it.  Also, I'd like to tweak the 5 minutes
timeout, where does it come from?  Is it settable by sysfs or
gfs_tool?
-- 
Thanks,
Feri.



From lhh at redhat.com  Wed Jun 13 14:39:07 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 10:39:07 -0400
Subject: [Linux-cluster] What is manual-failure override?
In-Reply-To: <87lkepme5m.fsf@tac.ki.iif.hu>
References: <87lkepme5m.fsf@tac.ki.iif.hu>
Message-ID: <20070613143907.GH7203@redhat.com>

On Tue, Jun 12, 2007 at 04:38:13PM +0200, Ferenc Wagner wrote:
> Hi,
> 
> The CVS commit which removed fence_manual.c (and thus my favourite
> agent :) says:
> 
> Remove fence_manual; only provide manual-failure override
> 
> Good.  Now, how do I use manual fencing?  What's to be written into
> cluster.conf?  (cluster 2.00.00)

Nothing!

:)

No fencing -> fencing fails immediately -> activates manual override
all by itself.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jun 13 14:40:38 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 10:40:38 -0400
Subject: [Linux-cluster] ip.sh
In-Reply-To: <20070611180422.GF25899@helsinki.fi>
References: <20070611180422.GF25899@helsinki.fi>
Message-ID: <20070613144038.GI7203@redhat.com>

On Mon, Jun 11, 2007 at 09:04:22PM +0300, Janne Peltonen wrote:
> Hi!
> 
> While looking at process accounting on my new cluster system, I noticed
> that something called fs.sh is run some hundreds of times per second,
> sometimes more often, sometimes less. Abt 10 million times per day,
> anyway.

> Now I wonder if this is normal.

10 million times per day?  No, that's decidedly not normal.

> I've got a rhel 5 based system with 25
> services, 24 of which use ext3 fs's on clvm logical volumes in a SAN.

Could you tell us what version of rgmanager you have installed?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From wferi at niif.hu  Wed Jun 13 14:40:58 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Wed, 13 Jun 2007 16:40:58 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <20070612181907.GH16723@redhat.com> (David Teigland's message of
	"Tue, 12 Jun 2007 13:19:07 -0500")
References: <20070328162726.GF22230@redhat.com>
	<20070328163850.GG22230@redhat.com> <87wt06djk7.fsf@tac.ki.iif.hu>
	<20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87tztdmfvj.fsf@tac.ki.iif.hu>
	<20070612144652.GB16723@redhat.com> <877iq9mb4z.fsf@tac.ki.iif.hu>
	<20070612181907.GH16723@redhat.com>
Message-ID: <87k5u7kjd1.fsf@tac.ki.iif.hu>

David Teigland <teigland at redhat.com> writes:

> OK, that's interesting.  I was led to believe from some of my old
> testing that plocks wouldn't be that fast, but maybe that was a
> mistaken conclusion.  There's no reason not to use plocks with -l0
> if you're getting good results.

Thanks.  Looks like GFS with the new infrastructure can deliver the
performance I need without any compromise!  Really sweet indeed.
-- 
Feri.



From lhh at redhat.com  Wed Jun 13 14:41:24 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 10:41:24 -0400
Subject: [Linux-cluster] Defining dependency of services
In-Reply-To: <297717.56421.qm@web52312.mail.re2.yahoo.com>
References: <297717.56421.qm@web52312.mail.re2.yahoo.com>
Message-ID: <20070613144124.GJ7203@redhat.com>

On Fri, Jun 08, 2007 at 09:19:36PM -0700, chirantha pitigala wrote:
> Hi all,
> 
> In RHCS 4, is there any way to define dependency or order of services? As I can see services are started in the order of defined in the cluster.conf. What is the exact order? Can we add dependency between them?
> 

Currently, no.  There's rudimentary interservice dependencies in RHEL5,
with more advanced ones planned.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jun 13 14:43:18 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 10:43:18 -0400
Subject: [Linux-cluster] Defining dependency of services
In-Reply-To: <870798.3019.qm@web52311.mail.re2.yahoo.com>
References: <870798.3019.qm@web52311.mail.re2.yahoo.com>
Message-ID: <20070613144318.GK7203@redhat.com>

On Sat, Jun 09, 2007 at 06:14:30AM -0700, chirantha pitigala wrote:
> Hi roger,
> 
> What I need is dependency between services. e.g: service 1 depends on successful start of service 2.
> Is it possible to do by making parent-child relationship between script resources of two services? 

Not easily; one simple way to do it is to cause the script of service B
to wait for the service provided by service A to become available before
continuing, but this has some obvious drawbacks.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From wferi at niif.hu  Wed Jun 13 14:49:02 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Wed, 13 Jun 2007 16:49:02 +0200
Subject: [Linux-cluster] What is manual-failure override?
In-Reply-To: <20070613143145.GC16134@redhat.com> (David Teigland's message of
	"Wed, 13 Jun 2007 09:31:45 -0500")
References: <87lkepme5m.fsf@tac.ki.iif.hu> <20070613143145.GC16134@redhat.com>
Message-ID: <87fy4vkizl.fsf@tac.ki.iif.hu>

David Teigland <teigland at redhat.com> writes:

> On Tue, Jun 12, 2007 at 04:38:13PM +0200, Ferenc Wagner wrote:
>> Hi,
>> 
>> The CVS commit which removed fence_manual.c (and thus my favourite
>> agent :) says:
>> 
>> Remove fence_manual; only provide manual-failure override
>> 
>> Good.  Now, how do I use manual fencing?  What's to be written into
>> cluster.conf?  (cluster 2.00.00)
>
> You just leave the <fence> sections empty for all the nodes.  When
> a node fails, fenced no longer runs fence_manual, it just waits for
> you to run "fence_ack_manual <nodename>" which tells fenced to go on.

Thanks again.
-- 
Feri.



From lhh at redhat.com  Wed Jun 13 14:49:35 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 10:49:35 -0400
Subject: [Linux-cluster] Defining dependency of services
In-Reply-To: <147317.77012.qm@web50611.mail.re2.yahoo.com>
References: <870798.3019.qm@web52311.mail.re2.yahoo.com>
	<147317.77012.qm@web50611.mail.re2.yahoo.com>
Message-ID: <20070613144935.GL7203@redhat.com>

On Sat, Jun 09, 2007 at 09:22:19PM -0700, Roger Pe?a wrote:
> 
> --- chirantha pitigala <chiranthlk at yahoo.com> wrote:
> 
> > Hi roger,
> > 
> > What I need is dependency between services. e.g:
> > service 1 depends on successful start of service 2.
> > Is it possible to do by making parent-child
> > relationship between script resources of two
> > services? 
> What I try to do is think of service as "cluster
> services", a service that a cluster will bring, not a
> unix service (httpd or ftp)

> if I do it in that way, nothing will stop me of having
> two unix services as resource of a cluster service,
> more important if this two unix services depend
> between them.

You can have as many dependent parts of a single service currently, but
not between them:

  <service name="1">
     <script name="ftp" file=...>
        <script name="httpd" file=.../>
     </script>
  </service>

If you have RHEL4.5, you can also put all the scripts at the top level
to ensure the same ordering:

  <service name="1">
     <script name="ftp" file=.../>
     <script name="httpd" file=.../>
  </service>

However, you can't do something like this right now:

  <resources>
     <script name="ftp" file=.../>
     <script name="httpd" file=.../>
  </resources>
  <service name="1">
     <script ref="httpd"/>
  </service>
  <service name="2">
     <script ref="ftp">
        [some magic stuff to wait for service 1]
     </script>
  </service>

There's partial (read: demo) code in head CVS which implements higher
level dependencies, but it is integrated with the rest of rgmanager
yet.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jun 13 14:53:55 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 10:53:55 -0400
Subject: [Linux-cluster] dlm service is not stopping
In-Reply-To: <D566E8CF3538B54D95B925CB69CB4D2A046B16DB@inblr-exch1.eu.uis.unisys.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A046B16DB@inblr-exch1.eu.uis.unisys.com>
Message-ID: <20070613145355.GM7203@redhat.com>

On Tue, Jun 12, 2007 at 03:09:45PM +0530, Panigrahi, Santosh Kumar wrote:
> Hi Cluster Team,
> 
> I have configured a 2 node cluster (RHEL5). When I am shutting down the
> cluster, I am stopping "rgmanager' service first and then "cman'
> service.

Could you file a bugzilla about this?  I am guessing it is an rgmanager
bug, but it seems to work for me.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From teigland at redhat.com  Wed Jun 13 15:01:21 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 13 Jun 2007 10:01:21 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <87odjjkjgv.fsf@tac.ki.iif.hu>
References: <87wt06djk7.fsf@tac.ki.iif.hu> <20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu>
	<20070612161148.GC16723@redhat.com> <87zm35ktyx.fsf@tac.ki.iif.hu>
	<20070612164418.GF16723@redhat.com> <87odjjkjgv.fsf@tac.ki.iif.hu>
Message-ID: <20070613150121.GD16134@redhat.com>

On Wed, Jun 13, 2007 at 04:38:40PM +0200, Ferenc Wagner wrote:
> David Teigland <teigland at redhat.com> writes:
> 
> >>>> But looks like nodeA feels obliged to communicate its locking
> >>>> process around the cluster.
> >>>
> >>> I'm not sure what you mean here.  To see the amount of dlm locking traffic
> >>> on the network, look at port 21064.  There should be very little in the
> >>> test above... and the dlm locking that you do see should mostly be related
> >>> to file i/o, not flocks.
> >> 
> >> There was much traffic on port 21064.  Possibly related to file I/O
> >> and not flocks, I can't tell.  But that's agrees with my speculation,
> >> that it's not the explicit [pf]locks that take much time, but
> >> something else.
> >
> > Could you comment the fcntl/flock calls out of the application entirely
> > and try it?
> 
> Let's see.  A typical test run looks like this (first with fcntl
> locking; tcpdump slows down the first iteration from about 6 s):
> 
> filecount=500
>   iteration=0 elapsed time=20.196318 s
>   iteration=1 elapsed time=0.323969 s
>   iteration=2 elapsed time=0.319929 s
>   iteration=3 elapsed time=0.361738 s
>   iteration=4 elapsed time=0.399365 s
> total elapsed time=21.601319 s
> 
> During the first (slow) iteration, there's much traffic on port 21064.
> During the next (fast) iterations there's no traffic at all on that port.
> If I rerun the test immediately, there's still no traffic.
> 5 minutes later, without any action on my part, there's a couple of
> packets again, then 20 s later a bigger bunch (around 30).
> After this, the first iteration generates much traffic again, GOTO 10.
> 
> If I use flock instead, the beginning is similar, but after about 10 s
> from the finish of the test, some small traffic appears by itself, and
> if I rerun the test after this, it generates traffic again, although
> much less than after 5 minutes.  The traffic generated 5 minutes after
> the test run consists of a couple of packets followed by a much bigger
> bunch 5 s later.
> 
> If I don't use any locking at all, then the situation is the same as
> with fcntl locking, but the "automatic" traffic consist of a small
> burst (couple of packets) 4 min 51 s after the finish, then about 30
> packets 25 s later.
> 
> Does it tell you anything?  The timings are perhaps somewhat off
> because of the 20 s runtime.  If you can make some sense out of this,

It sounds pretty normal, I'd need to repeat the test myself to figure out
exactly what's happening.  The 10 sec is probably toss_secs from the dlm;
you can increase with: echo 20 >> /sys/kernel/config/dlm/cluster/toss_secs

> I'd be glad to hear it.  Also, I'd like to tweak the 5 minutes
> timeout, where does it come from?  Is it settable by sysfs or
> gfs_tool?

gfs_tool gettune <mountpoint> | grep demote_secs

should show 300, to increase:

gfs_tool settune <mountpoint> demote_secs <value>

Dave



From lgodoy at atichile.com  Wed Jun 13 15:27:32 2007
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Wed, 13 Jun 2007 11:27:32 -0400
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com>
References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com>	<F154BEC9D4278D4BA9B94BDEC76193482BA62E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com>
Message-ID: <46700CE4.30307@atichile.com>

I have similar situation, in our case whe have redundant power supply 
for ensure the server and iLo always is power up.
Other way to fix this issue is using some quorum device. I' haven't 
worked whith them,  but I think may work it.

Bye

Manish Kathuria escribi?:
> On 6/11/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
>> If ilo itself is off, fencing doesn't work.
>
> Isn't there any timeout setting such that if the ILO doesn't respond
> for a certain amount of time, it is treated as fenced and the node is
> considered to be dead and the failover takes place?
>
>>
>> Did you add ilo as a fence device? And create a user? You create a 
>> user in the ilo for that blade, not on the chassis. You have to 
>> reboot the blade to get to the ilo manager.
>
> Yes, had added respective ILOs as fence devices for both the servers
> and created users also.
>
>
> I just want to make sure that automatic fencing happens and failover
> takes place even when there is a complete power failure for one node
>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Manish Kathuria
>> Sent: Monday, June 11, 2007 12:45 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] Problems with Cluster
>>
>> On 6/11/07, Maciej Bogucki <maciej.bogucki at artegence.com> wrote:
>> > Manish Kathuria napisa?(a):
>>
>> > > We want the failover to happen when the power supply fails to either
>> > > of the nodes. In order to test the scenario, we removed the power
>> > > cables from one of the nodes. However the failover did not happen
>> > > and upon observing the logs we found that the alive node could not
>> > > connect to the fence device (ILO in this case) of the dead node
>> > > since it was powered off and the fencing could not take place. Does
>> > > this mean that we would not be able to have a failover in case of
>> > > power failure for one of the nodes. Is there a way we can do it ?
>> > > How is the cluster supposed to react when the ILO itself is 
>> powered off ?
>> >
>> > You need to perform manual fencing(administrator reaction) when it 
>> happend.
>> >
>>
>> Isn't there any way which is automated and does not require manual 
>> intervention ? Otherwise, the whole purpose gets defeated.
>>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From mkathuria at tuxtechnologies.co.in  Wed Jun 13 16:15:03 2007
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Wed, 13 Jun 2007 21:45:03 +0530
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <200706120816.32533.grimme@atix.de>
References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA62E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com>
	<200706120816.32533.grimme@atix.de>
Message-ID: <1df4abe60706130915j379e1932ob4976044104dffe6@mail.gmail.com>

On 6/12/07, Marc Grimme <grimme at atix.de> wrote:
> On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote:
> > On 6/11/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
> > > If ilo itself is off, fencing doesn't work.
> >
> > Isn't there any timeout setting such that if the ILO doesn't respond
> > for a certain amount of time, it is treated as fenced and the node is
> > considered to be dead and the failover takes place?
> As far as I remember there is only a tcp-timeout when establishing the
> connection to the ilo-card that takes a very long time to occure (that's a
> default setting and takes minutes). I'm not sure how and where to set it.

We did wait for quite some time and followed the messages appearing in
/var/log/messages. It kept on trying to contact the ILO of the node
which was powered off.

>
> But we've had this discussion (especially with ILO-Cards) nearly every time
> when using them and therefore and also out of other reasons we had to build
> our own fence_ilo agent. I'm quite sure that we solved the timeout problem in
> the end. It is set to 10sec per default (Config.timeout).
> You can find it at
> http://download.atix.de/yum/comoonics/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm
> or directly use the yum/up2date-channel as described here:
> http://www.open-sharedroot.org/faq/can-i-use-yum-or-up2date-to-install-the-software/
> then install "comoonics-bootimage-fenceclient-ilo" and there you go.

Thanks, I will try and see if they agree to use this version.

> >
> > > Did you add ilo as a fence device? And create a user? You create a user
> > > in the ilo for that blade, not on the chassis. You have to reboot the
> > > blade to get to the ilo manager.
> >
> > Yes, had added respective ILOs as fence devices for both the servers
> > and created users also.
> We are doing so as well. Always a power user for ilo devices.
> We are also automating this with the ilo client.
> There is a undocumented switch -x in the fence_ilo client referenced above
> where you reference a file that might look as follows and you'll have your
> user.
> > I just want to make sure that automatic fencing happens and failover
> > takes place even when there is a complete power failure for one node
> If the timeout thing works you'll also need a second fence mechanism.
> You might think about using fence_manual as last resort, to bring that cluster
> back online after power failure and then after manual intervention.
>
> Regards Marc.

Just wondering if there is any undocumented option / switch which will
force an automatic failover to one node if the ILO on the other one
fails to respond within certain time period (maybe few minutes).

Regards,
--
Manish



From mkathuria at tuxtechnologies.co.in  Wed Jun 13 16:15:29 2007
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Wed, 13 Jun 2007 21:45:29 +0530
Subject: [Linux-cluster] Problems with Cluster
In-Reply-To: <46700CE4.30307@atichile.com>
References: <1df4abe60706110945w658766cas3e4b9280a6a4a34c@mail.gmail.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA62E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<1df4abe60706111829n210de3b9m33ffa10dc4fe3afe@mail.gmail.com>
	<46700CE4.30307@atichile.com>
Message-ID: <1df4abe60706130915s3ae98404v2a7a543c4ab88d3d@mail.gmail.com>

On 6/13/07, Luis Godoy Gonzalez <lgodoy at atichile.com> wrote:
> I have similar situation, in our case whe have redundant power supply
> for ensure the server and iLo always is power up.
> Other way to fix this issue is using some quorum device. I' haven't
> worked whith them,  but I think may work it.
>
> Bye

In this scenario also, both the nodes have redundant power supply and
the iLo will always be powered but the users want the failover to
happen when a node (and therefore the associated iLo) doesn't receive
power supply at all.



From wferi at niif.hu  Wed Jun 13 16:53:10 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Wed, 13 Jun 2007 18:53:10 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <20070613150121.GD16134@redhat.com> (David Teigland's message of
	"Wed, 13 Jun 2007 10:01:21 -0500")
References: <87wt06djk7.fsf@tac.ki.iif.hu> <20070423211717.GA22147@redhat.com>
	<20070424193600.GB11156@redhat.com> <87abvya1ke.fsf@tac.ki.iif.hu>
	<20070521145058.GA2590@redhat.com> <87ejkhmctr.fsf@tac.ki.iif.hu>
	<20070612161148.GC16723@redhat.com> <87zm35ktyx.fsf@tac.ki.iif.hu>
	<20070612164418.GF16723@redhat.com> <87odjjkjgv.fsf@tac.ki.iif.hu>
	<20070613150121.GD16134@redhat.com>
Message-ID: <878xankd8p.fsf@tac.ki.iif.hu>

David Teigland <teigland at redhat.com> writes:

> The 10 sec is probably toss_secs from the dlm; you can increase
> with: echo 20 >> /sys/kernel/config/dlm/cluster/toss_secs
>
> gfs_tool gettune <mountpoint> | grep demote_secs
>
> should show 300, to increase:
>
> gfs_tool settune <mountpoint> demote_secs <value>

Great.  No problem to do that after each mount.
-- 
Thanks,
Feri.



From lhh at redhat.com  Wed Jun 13 19:04:13 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 15:04:13 -0400
Subject: [Linux-cluster] fence device: network card?
In-Reply-To: <466F2C7F.9080906@klxsystems.net>
References: <466F2C7F.9080906@klxsystems.net>
Message-ID: <20070613190413.GN7203@redhat.com>

On Tue, Jun 12, 2007 at 04:30:07PM -0700, Karl R. Balsmeier wrote:
> Basically I read in the docs you can use a NIC card as a fence device, 
> is this true?

I don't think there's NIC-based fencing at this point; I could be
mistaken; I haven't looked at the fence tree for some time.

Could you point me at where it noted this so I can read it and send in
corrections?

> Right now each of the 3 servers have 3 NICs, so I have a total of 9 to 
> play with.  Right now I am bonding the two GB NIC's together no 
> problem.  That leaves each server a 100mbps NIC.


> My ultimate goal is to use these 3 machines to make a Vsftpd GFS cluster 
> that I can run Iscsi over. 

GNBD is not iSCSI.  It is similar in that it implements a block device
over the network, but to use fence_gnbd, you need to be using GNBD (not
iSCSI).

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Jun 13 20:09:52 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Jun 2007 16:09:52 -0400
Subject: [Linux-cluster] dlm service is not stopping
In-Reply-To: <20070613145355.GM7203@redhat.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A046B16DB@inblr-exch1.eu.uis.unisys.com>
	<20070613145355.GM7203@redhat.com>
Message-ID: <20070613200952.GO7203@redhat.com>

On Wed, Jun 13, 2007 at 10:53:55AM -0400, Lon Hohberger wrote:
> On Tue, Jun 12, 2007 at 03:09:45PM +0530, Panigrahi, Santosh Kumar wrote:
> > Hi Cluster Team,
> > 
> > I have configured a 2 node cluster (RHEL5). When I am shutting down the
> > cluster, I am stopping "rgmanager' service first and then "cman'
> > service.
> 
> Could you file a bugzilla about this?  I am guessing it is an rgmanager
> bug, but it seems to work for me.
> 

We hit this while testing the fix for another bugzilla:

If most nodes of a cluster go offline, the last node(s) lose quorum.
This causes rgmanager to exit uncleanly, failing to clean up the
lockspace.  This prevents cman from stopping.

Is this what happened for you?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From Santosh.Panigrahi at in.unisys.com  Thu Jun 14 03:54:41 2007
From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar)
Date: Thu, 14 Jun 2007 09:24:41 +0530
Subject: [Linux-cluster] dlm service is not stopping
In-Reply-To: <20070613200952.GO7203@redhat.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A046B16DB@inblr-exch1.eu.uis.unisys.com><20070613145355.GM7203@redhat.com>
	<20070613200952.GO7203@redhat.com>
Message-ID: <D566E8CF3538B54D95B925CB69CB4D2A0479639B@inblr-exch1.eu.uis.unisys.com>

You are absolutely right. 

There are around 5 process starting name with dlm_* are running. I am
also not able to kill these processes by (kill -9 pid). So each time, I
am rebooting this node on facing this dlm problem. Please suggest me
some other way to kill these processes.

I will file a bugzilla for the same.

Thanks
Santosh

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Thursday, June 14, 2007 1:40 AM
To: linux clustering
Subject: Re: [Linux-cluster] dlm service is not stopping

On Wed, Jun 13, 2007 at 10:53:55AM -0400, Lon Hohberger wrote:
> On Tue, Jun 12, 2007 at 03:09:45PM +0530, Panigrahi, Santosh Kumar
wrote:
> > Hi Cluster Team,
> > 
> > I have configured a 2 node cluster (RHEL5). When I am shutting down
the
> > cluster, I am stopping "rgmanager' service first and then "cman'
> > service.
> 
> Could you file a bugzilla about this?  I am guessing it is an
rgmanager
> bug, but it seems to work for me.
> 

We hit this while testing the fix for another bugzilla:

If most nodes of a cluster go offline, the last node(s) lose quorum.
This causes rgmanager to exit uncleanly, failing to clean up the
lockspace.  This prevents cman from stopping.

Is this what happened for you?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From janne.peltonen at helsinki.fi  Thu Jun 14 05:38:25 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 14 Jun 2007 08:38:25 +0300
Subject: [Linux-cluster] ip.sh
In-Reply-To: <20070613144038.GI7203@redhat.com>
References: <20070611180422.GF25899@helsinki.fi>
	<20070613144038.GI7203@redhat.com>
Message-ID: <20070614053825.GK25899@helsinki.fi>

On Wed, Jun 13, 2007 at 10:40:38AM -0400, Lon Hohberger wrote:
> > I've got a rhel 5 based system with 25
> > services, 24 of which use ext3 fs's on clvm logical volumes in a SAN.
> 
> Could you tell us what version of rgmanager you have installed?

Seems to be 2.0.23. And, to be exact, my system is a CentOS 5. The
release of rgmanager is 1.el5.centos.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From pcaulfie at redhat.com  Thu Jun 14 07:56:05 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 14 Jun 2007 08:56:05 +0100
Subject: [Linux-cluster] dlm service is not stopping
In-Reply-To: <D566E8CF3538B54D95B925CB69CB4D2A0479639B@inblr-exch1.eu.uis.unisys.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A046B16DB@inblr-exch1.eu.uis.unisys.com><20070613145355.GM7203@redhat.com>	<20070613200952.GO7203@redhat.com>
	<D566E8CF3538B54D95B925CB69CB4D2A0479639B@inblr-exch1.eu.uis.unisys.com>
Message-ID: <4670F495.8060808@redhat.com>

Panigrahi, Santosh Kumar wrote:
> You are absolutely right. 
> 
> There are around 5 process starting name with dlm_* are running. I am
> also not able to kill these processes by (kill -9 pid). So each time, I
> am rebooting this node on facing this dlm problem. Please suggest me
> some other way to kill these processes.
>

You can't kill those processes, they are DLM kernel threads. You need to shut
down any 'real' processes that are using the DLM, then they will go away.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From cluster at defuturo.co.uk  Wed Jun 13 16:31:10 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Wed, 13 Jun 2007 17:31:10 +0100
Subject: [Linux-cluster] cmirror leg failure: dmeventd dies
Message-ID: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk>

  I've recently upgraded from RHEL4U4 to RHEL4U5 in order to use the new
cmirror package instead of one I was building myself from CVS.

  Now, when I fail one of the PVs, the mirrored LV isn't converted to
linear. Instead, dmeventd dies like this: 

3377  send(6, "<15>Jun 12 16:50:34 lvm[3371]: Loaded external locking library liblvm2clusterlock.so", 84, MSG_NOSIGNAL) = 84
3377  socket(PF_FILE, SOCK_STREAM, 0)   = 7
3377  connect(7, {sa_family=AF_FILE, path=@clvmd}, 110) = 0
3377  time(NULL)                        = 1181663434
3377  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1323, ...}) = 0
3377  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1323, ...}) = 0
3377  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1323, ...}) = 0
3377  send(6, "<15>Jun 12 16:50:34 lvm[3371]: Finding volume group \"lvm_test1\"", 63, MSG_NOSIGNAL) = 63
3377  stat64("/proc/lvm/VGs/lvm_test1", 0xf6fa6340) = -1 ENOENT (No such file or directory)
3377  rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], ~[HUP INT QUIT KILL TERM STOP RTMIN RT_1], 8) = 0
3377  writev(2, [{"<program name unknown>", 22}, {": ", 2}, {"symbol lookup error", 19}, {": ", 2}, {"/usr/lib/liblvm2clusterlock.so", 30}, {": ", 2}, {"undefined symbol: print_log", 27}, {"", 0}, {"", 0}, {"\n", 1}], 10) = 105
3377  exit_group(127)                   = ?

and logs this:

Jun 12 13:43:57 kiwano dmeventd[3371]: dmeventd ready for processing.
Jun 12 13:43:57 kiwano dmeventd[3371]: Monitoring mirror device lvm_test1-var for events 
Jun 12 13:44:08 kiwano lvm[3371]: lvm_test1-var is now in-sync 
Jun 12 16:50:33 kiwano lvm[3371]: Mirror device, 253:5, has failed. 
Jun 12 16:50:33 kiwano lvm[3371]: Device failure in lvm_test1-var 
Jun 12 16:50:33 kiwano lvm[3371]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended

  The "undefined symbol: print_log" error looks pretty fatal and "ldd
-r /usr/lib/liblvm2clusterlock.so" reports it too, though it does the
same on working clusters and a FC6 box as well.

  Can anyone suggest how to debug this?

	Thanks,

		Robert



From mbrookov at mines.edu  Thu Jun 14 14:37:14 2007
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Thu, 14 Jun 2007 08:37:14 -0600
Subject: [Linux-cluster] fence device: network card?
In-Reply-To: <466F2C7F.9080906@klxsystems.net>
References: <466F2C7F.9080906@klxsystems.net>
Message-ID: <1181831834.26590.35.camel@merlin.Mines.EDU>

Hi Karl,

GNDB is a block server, like ISCSI.  Unfortunately there does not appear
to be any standard fencing mechanism for ISCSI.  I hacked up one of the
existing fence agents to use SNMP to turn off the network ports in a
Cisco 3750 switch.  My test systems are using an old HP switch that the
network team was not using -- it works also.

System-config-cluster would not allow me to specify my own fence agent.
Well, I looked quickly, did not see any thing obvious, gave up and
edited cluster.conf with vi.

I have attached fence_cisco to this message, and here are some notes to
use it.  You may need to get the Net-SNMP perl module from CPAN.org.

The config file for fence_cisco looks like this:

community:<PUT YOUR SNMP COMMUNITY STRING HERE>
switch:10.1.4.254
oneoften:A1:C1
twooften:A5:C2
threeoften:A2:C3

The first line is your community string, the second line is the IP
address of the network switch, the rest are the hosts.  The first column
is the host name followed by a colon separated list of the ports that
the host is attached to on the Ethernet switch.  In the cluster.conf
file, the port parameter must match an entry in the host name column.  

<?xml version="1.0"?>
<cluster alias="DEV_ACN" config_version="11" name="DEV_ACN">
        <clusternodes>
                <clusternode name="oneoften_a.devstorage.local" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ciscofence1" port="oneoften" switch="do not know"/>
                                </method>
                        </fence>
                        <multicast addr="224.0.0.1" interface="eth1"/>
                </clusternode>
                <clusternode name="twooften_a.devstorage.local" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ciscofence1" port="twooften" switch="do not know"/>
                                </method>
                        </fence>
                        <multicast addr="224.0.0.1" interface="eth1"/>
                </clusternode>
                <clusternode name="threeoften_a.devstorage.local" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ciscofence1" port="threeoften" switch="do not know"/>
                                </method>
                        </fence>
                        <multicast addr="224.0.0.1" interface="eth1"/>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_cisco" ipaddr="10.1.4.254" login="testing" name="ciscofence1" passwd="password"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <cman>
                <multicast addr="224.0.0.1"/>
        </cman>
</cluster>


The fence_cisco agent will 'fence a nic' in the network switch.

I have attached fence_cisco to this message.  I would suggest that you
test every thing carefully.

Matt

On Tue, 2007-06-12 at 16:30 -0700, Karl R. Balsmeier wrote:

> Hi,
> 
> I have three (3) servers built and entered into the 
> system-config-cluster tool as nodes.  Basically the first node has node 
> 2 and node 3 as members of the cluster.
> 
> For a fence device, I do not have any of the SAN or network/switch 
> devices listed in the dropdown menu, and where I have read in the 
> documentation that says "gnbd" Generic Network Block Device seems to be 
> what i'm looking for.
> 
> Basically I read in the docs you can use a NIC card as a fence device, 
> is this true?
> 
> Right now each of the 3 servers have 3 NICs, so I have a total of 9 to 
> play with.  Right now I am bonding the two GB NIC's together no 
> problem.  That leaves each server a 100mbps NIC.
> 
> My ultimate goal is to use these 3 machines to make a Vsftpd GFS cluster 
> that I can run Iscsi over. 
> 
> Being new to this though, i'll stick to the primary questions:  How does 
> one configure a fence device in the form of a NIC card?  Is the gnbd 
> item relevant to this?
> 
> -karl
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070614/a9f84875/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence_cisco
Type: application/x-perl
Size: 10617 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070614/a9f84875/attachment.pl>

From janne.peltonen at helsinki.fi  Thu Jun 14 14:38:27 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 14 Jun 2007 17:38:27 +0300
Subject: er, fs.sh, not ip (Re: [Linux-cluster] ip.sh)
In-Reply-To: <20070614053825.GK25899@helsinki.fi>
References: <20070611180422.GF25899@helsinki.fi>
	<20070613144038.GI7203@redhat.com>
	<20070614053825.GK25899@helsinki.fi>
Message-ID: <20070614143827.GB15269@helsinki.fi>

On Thu, Jun 14, 2007 at 08:38:25AM +0300, Janne Peltonen wrote:
> On Wed, Jun 13, 2007 at 10:40:38AM -0400, Lon Hohberger wrote:
> > > I've got a rhel 5 based system with 25
> > > services, 24 of which use ext3 fs's on clvm logical volumes in a SAN.
> > 
> > Could you tell us what version of rgmanager you have installed?
> 
> Seems to be 2.0.23. And, to be exact, my system is a CentOS 5. The
> release of rgmanager is 1.el5.centos.

I noticed a bug in this message's subject. It was about fs.sh, not
ip.sh...


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From jbrassow at redhat.com  Thu Jun 14 15:10:14 2007
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Thu, 14 Jun 2007 10:10:14 -0500
Subject: [Linux-cluster] cmirror leg failure: dmeventd dies
In-Reply-To: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk>
References: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk>
Message-ID: <8F041834-5149-4375-86C6-290B4CDC1D50@redhat.com>


On Jun 13, 2007, at 11:31 AM, Robert Clark wrote:

>   I've recently upgraded from RHEL4U4 to RHEL4U5 in order to use  
> the new
> cmirror package instead of one I was building myself from CVS.
>
>   Now, when I fail one of the PVs, the mirrored LV isn't converted to
> linear. Instead, dmeventd dies like this:
>
> 3377  send(6, "<15>Jun 12 16:50:34 lvm[3371]: Loaded external  
> locking library liblvm2clusterlock.so", 84, MSG_NOSIGNAL) = 84
> 3377  socket(PF_FILE, SOCK_STREAM, 0)   = 7
> 3377  connect(7, {sa_family=AF_FILE, path=@clvmd}, 110) = 0
> 3377  time(NULL)                        = 1181663434
> 3377  stat64("/etc/localtime", {st_mode=S_IFREG|0644,  
> st_size=1323, ...}) = 0
> 3377  stat64("/etc/localtime", {st_mode=S_IFREG|0644,  
> st_size=1323, ...}) = 0
> 3377  stat64("/etc/localtime", {st_mode=S_IFREG|0644,  
> st_size=1323, ...}) = 0
> 3377  send(6, "<15>Jun 12 16:50:34 lvm[3371]: Finding volume group  
> \"lvm_test1\"", 63, MSG_NOSIGNAL) = 63
> 3377  stat64("/proc/lvm/VGs/lvm_test1", 0xf6fa6340) = -1 ENOENT (No  
> such file or directory)
> 3377  rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], ~[HUP INT QUIT  
> KILL TERM STOP RTMIN RT_1], 8) = 0
> 3377  writev(2, [{"<program name unknown>", 22}, {": ", 2},  
> {"symbol lookup error", 19}, {": ", 2}, {"/usr/lib/ 
> liblvm2clusterlock.so", 30}, {": ", 2}, {"undefined symbol:  
> print_log", 27}, {"", 0}, {"", 0}, {"\n", 1}], 10) = 105
> 3377  exit_group(127)                   = ?
>
> and logs this:
>
> Jun 12 13:43:57 kiwano dmeventd[3371]: dmeventd ready for processing.
> Jun 12 13:43:57 kiwano dmeventd[3371]: Monitoring mirror device  
> lvm_test1-var for events
> Jun 12 13:44:08 kiwano lvm[3371]: lvm_test1-var is now in-sync
> Jun 12 16:50:33 kiwano lvm[3371]: Mirror device, 253:5, has failed.
> Jun 12 16:50:33 kiwano lvm[3371]: Device failure in lvm_test1-var
> Jun 12 16:50:33 kiwano lvm[3371]: WARNING: dev_open(/etc/lvm/ 
> lvm.conf) called while suspended
>
>   The "undefined symbol: print_log" error looks pretty fatal and "ldd
> -r /usr/lib/liblvm2clusterlock.so" reports it too, though it does the
> same on working clusters and a FC6 box as well.
>
>   Can anyone suggest how to debug this?

/etc/lvm/lvm.conf:
locking_type = 3

?

  brassow



From lhh at redhat.com  Thu Jun 14 18:14:32 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 14 Jun 2007 14:14:32 -0400
Subject: er, fs.sh, not ip (Re: [Linux-cluster] ip.sh)
In-Reply-To: <20070614143827.GB15269@helsinki.fi>
References: <20070611180422.GF25899@helsinki.fi>
	<20070613144038.GI7203@redhat.com>
	<20070614053825.GK25899@helsinki.fi>
	<20070614143827.GB15269@helsinki.fi>
Message-ID: <20070614181432.GQ7203@redhat.com>

On Thu, Jun 14, 2007 at 05:38:27PM +0300, Janne Peltonen wrote:
> On Thu, Jun 14, 2007 at 08:38:25AM +0300, Janne Peltonen wrote:
> > On Wed, Jun 13, 2007 at 10:40:38AM -0400, Lon Hohberger wrote:
> > > > I've got a rhel 5 based system with 25
> > > > services, 24 of which use ext3 fs's on clvm logical volumes in a SAN.
> > > 
> > > Could you tell us what version of rgmanager you have installed?
> > 
> > Seems to be 2.0.23. And, to be exact, my system is a CentOS 5. The
> > release of rgmanager is 1.el5.centos.
> 
> I noticed a bug in this message's subject. It was about fs.sh, not
> ip.sh...

It doesn't matter; it shouldn't do that :)  I think it's related to
something that may be fixed in my sandbox; I'll check on it some more.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From cluster at defuturo.co.uk  Thu Jun 14 20:17:08 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Thu, 14 Jun 2007 21:17:08 +0100
Subject: [Linux-cluster] cmirror leg failure: dmeventd dies
In-Reply-To: <8F041834-5149-4375-86C6-290B4CDC1D50@redhat.com>
References: <1181752271.10296.10.camel@rutabaga.defuturo.co.uk>
	<8F041834-5149-4375-86C6-290B4CDC1D50@redhat.com>
Message-ID: <1181852229.17288.19.camel@localhost.localdomain>

On Thu, 2007-06-14 at 10:10 -0500, Jonathan Brassow wrote:
> On Jun 13, 2007, at 11:31 AM, Robert Clark wrote:
> 
> >   I've recently upgraded from RHEL4U4 to RHEL4U5 in order to use  
> > the new
> > cmirror package instead of one I was building myself from CVS.
> >
> >   Now, when I fail one of the PVs, the mirrored LV isn't converted to
> > linear. Instead, dmeventd dies

> /etc/lvm/lvm.conf:
> locking_type = 3

  Thanks - that got it. Mine was still set to 2 from lvmconf run from
the RHEL4U4 lvm2-cluster package.

	Robert



From andremachado at techforce.com.br  Thu Jun 14 20:22:14 2007
From: andremachado at techforce.com.br (andremachado)
Date: Thu, 14 Jun 2007 13:22:14 -0700
Subject: [Linux-cluster] GFS over GNBD freezes
Message-ID: <23e2820769f1b99a2a52b11409ba6e73@localhost>

Hello,
When executing a copy operation  over the same gfs over gnbd that is being written, it freezes (randomly).
the problem does not happen when gfs is locally mounted or when writing to 2 different gfs+gnbd devices.
Please, what configuration is missing?
Regards.
Andre Felipe






From maciej.bogucki at artegence.com  Fri Jun 15 08:34:28 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 15 Jun 2007 10:34:28 +0200
Subject: [Linux-cluster] GFS over GNBD freezes
In-Reply-To: <23e2820769f1b99a2a52b11409ba6e73@localhost>
References: <23e2820769f1b99a2a52b11409ba6e73@localhost>
Message-ID: <46724F14.5080505@artegence.com>

andremachado napisa?(a):
> Hello,
> When executing a copy operation  over the same gfs over gnbd that is being written, it freezes (randomly).
> the problem does not happen when gfs is locally mounted or when writing to 2 different gfs+gnbd devices.
> Please, what configuration is missing?
> Regards.
> Andre Felipe
> 

Maybe this URL would help you:

http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch

Best Regards
Maciej Bogucki



From mathieu.avila at seanodes.com  Fri Jun 15 08:52:08 2007
From: mathieu.avila at seanodes.com (Mathieu Avila)
Date: Fri, 15 Jun 2007 10:52:08 +0200
Subject: [Linux-cluster] Error when starting ccsd and proposed patch
Message-ID: <20070615105208.0fc6fd76@mathieu.toulouse>

Hello all,

I'm sometimes having trouble when starting ccsd and then gulm under
heavy CPU load. Ccsd's init script tells it is running but it's not
fully initialized.
The problem comes from the fact that ccsd's main process returns before
the daemonized process of ccsd has finished initializing its sockets.
The "cluster_communicator" thread sends a SIGTERM message to the parent
process before the main thread has finished its initialization work.

With the patch proposed in attachement, the cluster_communicator is
started after the main thread has finished initializing. It works
well under any load. Any daemon that needs to connect ccsd will
then succceed. 
It was tested with cluster-1.03, but it should work with older
versions, the ccsd files didn't seem to have changed much.

--
Mathieu Avila
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ccsd-init.patch
Type: text/x-patch
Size: 704 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070615/9930bf9f/attachment.bin>

From jad at midentity.com  Fri Jun 15 12:04:49 2007
From: jad at midentity.com (James Dyer)
Date: Fri, 15 Jun 2007 13:04:49 +0100 (BST)
Subject: [Linux-cluster] Need some advice, setting up first clustered FS
Message-ID: <Pine.LNX.4.44.0706151215250.15769-100000@new-server.midentity.com>

I'm trying to set up my first clustered FS, but before I waste time trying 
things, only to find they don't work, I thought it would be a good idea to 
ask the esteemed members of this list for some opinions.

At the moment, I have three webservers, which share storage via an NFS 
mount to a server with 1TB space on it, The file server exports a 800GB 
partition to these servers.  The 800GB partition is a stripe over 2 
500MB SATA disks. This 800GB partition is syncronised to another server 
using Unison every 30 mins.

NFS is really not working for us; hitting all sorts of problems with it. 
Additionally, the above solution is obviously not at all fault tolerant, 
nor expandable, so it's time to look at other options.

Budget limited at the moment, so really need to stick with the hardware 
I've currently got.

The solution I'm thinking of is as follows; I'd like some opinions on 
whether or not this is a good idea, or if it's stupid, or impossible etc.

1- On each of the file servers, keep the existing 800GB raid0 stripe. 
2- Using vblade, present these stripes to both file servers over AoE
3- On each file server, create a raid1 volume of both raid0 stripes
4- Put a gfs filesystem on the raid1 volume, mount on webservers using 
   gfs etc.

Some questions:
1- I'm not sure if stage 3 is do-able or not. I'm not sure if I can create 
a raid1 volume from two AoE volumes. Some things I've read say no, some 
say perhaps.
2- Can I actually present a device over AoE to the same physical server 
it's installed in, or would the volume need to be made from the AoE 
device from the other server, and the physical device on this server? 
(think that question kinda makes sense...)

Really keen to make this very expandible in the future, and fault 
tolerant, so would expect to move to a raid5/10 system at some point. 
This would be accomplished by having more file servers exporting a stripe 
over AoE.   At this point, I would imagine I'd have a couple of servers in 
front of the disk farm servers to actually create the gfs partition, and 
it is these servers that the webservers would communicate directly with.

Hope what I've written makes some semblance of sense...

Thanks in advance for any advice/pointers

James


-- 
July 27th, 2007 - System Administrator Appreciation Day - http://www.sysadminday.com/






From Robert.Gil at americanhm.com  Fri Jun 15 12:31:59 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Fri, 15 Jun 2007 08:31:59 -0400
Subject: [Linux-cluster] Need some advice, setting up first clustered FS
In-Reply-To: <Pine.LNX.4.44.0706151215250.15769-100000@new-server.midentity.com>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA66E@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

James,
 
I have been looking into similar implementations for our testing
environment. We use san everywhere, and since san is so expensive I was
considering using AoE to imitate it and a lower cost.

As you said, you have 3 webservers and 1 fileserver. If you use AoE each
of the 3 servers can mount that device and you can use GFS for the file
locking. Each server will see the SAME disk. If you use LVM on the file
server, you can expand the fileserver as much as you want. Since AoE is
block level storage, you can add additional fileservers and use LVM on
the webservers to expand the AoE disks. If this is to be production I
would add some fault tolerance, with at least channel bonding, if not
two switches for redundancy on the AoE side. When you do this however,
your system will see two sets of disks, and you will need to use
multipathing to handle the multiple paths and create a pseudo device so
in the event of a failure, it is relatively transparent to the OS. 

If you do bonded GigE, your doing pretty well as far as throughput in
comparison to FC. I assume the latencies differ significantly between
GigE and FC, but I don't know what the percent is.

Hope that helps.

Robert Gil
Linux Systems Administrator
American Home Mortgage


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Dyer
Sent: Friday, June 15, 2007 8:05 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Need some advice, setting up first clustered FS

I'm trying to set up my first clustered FS, but before I waste time
trying things, only to find they don't work, I thought it would be a
good idea to ask the esteemed members of this list for some opinions.

At the moment, I have three webservers, which share storage via an NFS
mount to a server with 1TB space on it, The file server exports a 800GB
partition to these servers.  The 800GB partition is a stripe over 2
500MB SATA disks. This 800GB partition is syncronised to another server
using Unison every 30 mins.

NFS is really not working for us; hitting all sorts of problems with it.

Additionally, the above solution is obviously not at all fault tolerant,
nor expandable, so it's time to look at other options.

Budget limited at the moment, so really need to stick with the hardware
I've currently got.

The solution I'm thinking of is as follows; I'd like some opinions on
whether or not this is a good idea, or if it's stupid, or impossible
etc.

1- On each of the file servers, keep the existing 800GB raid0 stripe. 
2- Using vblade, present these stripes to both file servers over AoE
3- On each file server, create a raid1 volume of both raid0 stripes
4- Put a gfs filesystem on the raid1 volume, mount on webservers using 
   gfs etc.

Some questions:
1- I'm not sure if stage 3 is do-able or not. I'm not sure if I can
create a raid1 volume from two AoE volumes. Some things I've read say
no, some say perhaps.
2- Can I actually present a device over AoE to the same physical server
it's installed in, or would the volume need to be made from the AoE
device from the other server, and the physical device on this server? 
(think that question kinda makes sense...)

Really keen to make this very expandible in the future, and fault
tolerant, so would expect to move to a raid5/10 system at some point. 
This would be accomplished by having more file servers exporting a
stripe 
over AoE.   At this point, I would imagine I'd have a couple of servers
in 
front of the disk farm servers to actually create the gfs partition, and
it is these servers that the webservers would communicate directly with.

Hope what I've written makes some semblance of sense...

Thanks in advance for any advice/pointers

James


--
July 27th, 2007 - System Administrator Appreciation Day -
http://www.sysadminday.com/




--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From andremachado at techforce.com.br  Fri Jun 15 14:05:11 2007
From: andremachado at techforce.com.br (andremachado)
Date: Fri, 15 Jun 2007 7:05:11 -0700
Subject: [Linux-cluster] GFS over GNBD freezes
In-Reply-To: <46724F14.5080505@artegence.com>
References: <46724F14.5080505@artegence.com>
Message-ID: <495714f1a428bff7858e54afc745bab7@localhost>

Hello,
Many thanks for your message.
Well, actually the article is from my blog...
So, I already executed all steps described there, and much more ideas also.
The synchronization solved the problem of single directly writing, but not **concurrent** access over gnbd.
oopses_ok is a frigthning proposition...
Read across RH docs, tried many cluster.conf configurations, enabled debug and collected some data.
Unfortunately, I still have only some "suspects".
It ***"seems"*** that the cluster locking (suite 1.03.02) is not so robust, and heavily depends on fast FC private networks, not implementing suitable semaphores and handshaking.
I am trying to implement iscsi now (reading docs phase). But, maybe, the problems arise again if the real cause is at GFS / clvm locking coordination and not at gnbd itself.
As RH docs suggests at fig. 1.13 [0] that the intended idea is feasible, I am trying it.
do you have any additional ideas?
How to spot the real cause?
Regards.
Andre Felipe Machado

[0] http://elibrary.fultus.com/technical/topic/com.fultus.redhat.elinux5/manuals/Cluster_Suite_Overview/s2-ov-economy-CSO.html



> 
> Maybe this URL would help you:
> 
> http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch
> 
> Best Regards
> Maciej Bogucki



From eschneid at uccs.edu  Fri Jun 15 17:29:20 2007
From: eschneid at uccs.edu (Eric Schneider)
Date: Fri, 15 Jun 2007 11:29:20 -0600
Subject: [Linux-cluster] CommuniGate as service?
Message-ID: <002701c7af72$b532b400$1b03c680@uccs.edu>

I am trying to setup CommuniGate
(http://www.stalker.com/content/default.html) as a cluster service.  I have
everything working except for one thing.  If I stop/kill the process
manually and wait for the status check to move the service nothing happens.


I get "clurgmgrd: [3087]: <info> Executing /etc/init.d/CommuniGate status"
rather than the "clurgmgrd: [3087]: <err> script:httpd-webcal: status of
/etc/init.d/httpd-webcal failed (returned 3)" I get with an apache process.

I had to add the "status" part to /etc/init.d/CommuniGate file myself, but
there is obviously a problem.  I have mucked around with it for a while but
I am just putting it out there if my mistakes are obvious.  

# If you have placed the application folder in a different directory,
#    change the APPLICATION variable
#
# The default location for the CommuniGate Pro "base directory" (a folder
# containing mail accounts, settings, logs, etc.) is /var/CommuniGate
# If you want to use a different location, change the BASEFOLDER variable
#
#
APPLICATION="/opt"
BASEFOLDER="/var/CommuniGate"
SUPPLPARAMS=
PROG="/opt/CommuniGate/CGServer"
#ADDED
pidfile=${PIDFILE-/var/run/CommuniGate.pid}
lockfile=${LOCKFILE-/var/lock/subsys/CommuniGate}
RETVAL=0
#/ADDED
[ -f ${APPLICATION}/CommuniGate/CGServer ] || exit 0

# Some Linux distributions come with the "NPTL" threads library
# that crashes quite often. The following lines are believed to force
# Linux to use the old working threads library.
#
#LD_ASSUME_KERNEL=2.4.1
#export LD_ASSUME_KERNEL

# Source function library.
if [ -f /etc/rc.d/init.d/functions ]; then
  . /etc/rc.d/init.d/functions
elif [ -f /etc/init.d/functions ]; then
  . /etc/init.d/functions
fi

ulimit -u 2000
ulimit -c 2097151
umask 0

# Custom startup parameters
if [ -f ${BASEFOLDER}/Startup.sh ]; then
  . ${BASEFOLDER}/Startup.sh
fi

case "$1" in
  start)
    if [ -d ${BASEFOLDER} ] ; then
      echo
    else
      echo "Creating the CommuniGate Base Folder..."
      mkdir ${BASEFOLDER}
      chgrp mail ${BASEFOLDER}
      chmod 2770 ${BASEFOLDER}
    fi
    echo -n "Starting CommuniGate Pro"
    ${APPLICATION}/CommuniGate/CGServer \
        --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \
#       --ClusterBackend
#       --ClusterFrontend
    #Comment out
    #touch /var/lock/subsys/CommuniGate
    #ADD
    touch ${pidfile}
    RETVAL=$?
    echo
    [ "$RETVAL = 0" ] && touch ${lockfile}
    #return $RETVAL
    #/ADDED
    ;;
  controller)
    echo "Starting CommuniGate Pro Cluster Controller"
    ${APPLICATION}/CommuniGate/CGServer \
        --Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \
        --ClusterController
    touch /var/lock/subsys/CommuniGate
    ;;
  stop)
     if [ -f ${BASEFOLDER}/ProcessID ]; then
       echo "Shutting down the CommuniGate Pro Server"
       kill `cat ${BASEFOLDER}/ProcessID`
       sleep 5
     else
       echo "It looks like the CommuniGate Pro Server is not running"
     fi
     #eric
     rm -f ${pidfile}
     ##Orig
     #rm -f /var/lock/subsys/CommuniGate
     #ADDED
     RETVAL=$?
     echo
     [ "$RETVAL = 3" ] && rm -f ${lockfile} #${pidfile}
     #/ADDED
     ;;
     #ADDED
  status)
     status $PROG
     RETVAL=$?
     ;;
     #/ADDED
  *)
     echo "Usage: $0 [ start | stop | status ]"
     exit 1
esac

exit 0






From jonyahoo at directfreight.com  Sat Jun 16 03:41:01 2007
From: jonyahoo at directfreight.com (Jon Gabrielson)
Date: Fri, 15 Jun 2007 22:41:01 -0500 (CDT)
Subject: [Linux-cluster] gfs and software raid across 4 systems.
Message-ID: <5474.12.227.156.125.1181965261.squirrel@www.directfreight.com>

I have 4 servers each with their own harddrive.
I would like to setup gfs so that each can read/write
to their local harddrive and the data synced to the other 3 harddrives.
My original idea was to export all 4 harddrives with a network block
device like iscsi/gnbd/nbd and then use md software raid1 on top of
those but I've heard that this is a bad idea as md is not cluster aware.
I also found drbd but it only supports 2 harddrives.

Are there any available solutions for doing a 4 way mirror like I
am looking for?   Basically a network raid 1 across 4 harddrives
on 4 separate systems.


Thanks,


Jon.



From Robert.Gil at americanhm.com  Sun Jun 17 05:34:24 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Sun, 17 Jun 2007 01:34:24 -0400
Subject: [Linux-cluster] clusvcadm hangs starting service
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA686@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

I have a service for an ip which does a mysql check as a dependancy for
failover. For some reason clusvcadm hangs when trying to either enable,
disable, or restart that service. I get no errors in the error log and
it just hangs. This is luckily in our test environment, but this will
not be good in production. We use this floating ip for a couple of mysql
servers doing replication so we always want to have the ip pointing to
the master (rw) server.
 
Has anyone seen the services hanging? How can this be resolved? Should
rgmanager be restarted? wont this fence the server?
 
Thanks,
 
Robert Gil
Linux Systems Administrator
American Home Mortgage
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070617/5298752a/attachment.htm>

From faizn2000 at yahoo.com  Mon Jun 18 07:20:54 2007
From: faizn2000 at yahoo.com (faiz n)
Date: Mon, 18 Jun 2007 00:20:54 -0700 (PDT)
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 38, Issue 25
In-Reply-To: <20070616160007.3871E73B03@hormel.redhat.com>
Message-ID: <594781.25813.qm@web36604.mail.mud.yahoo.com>

Hi All,
   
  I have intall globus on ma PC. I have done everything but when I run the command
   
  root at host:# etc/init.d/globus-4.0.4 start
   
  I gave me the error that
  libglobus_common_gcc32.so.0. No such file or directory
  When I downloaded this file and run the command again, then problem came that:
  ltdl_common_..... No such file or directory
  When I downloaded it again and run the command again now then this error comes
   
  /usr/local /sbin/globus-start-container-detached undefined symbol lookup error: globus_callback_space_reference.
   
  so can anyone help me that why this error comes to me and how can I fix this.
   
  take Care
  Thanks
  Faiz.
  

linux-cluster-request at redhat.com wrote:
  Send Linux-cluster mailing list submissions to
linux-cluster at redhat.com

To subscribe or unsubscribe via the World Wide Web, visit
https://www.redhat.com/mailman/listinfo/linux-cluster
or, via email, send a message with subject or body 'help' to
linux-cluster-request at redhat.com

You can reach the person managing the list at
linux-cluster-owner at redhat.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-cluster digest..."


Today's Topics:

1. CommuniGate as service? (Eric Schneider)
2. gfs and software raid across 4 systems. (Jon Gabrielson)


----------------------------------------------------------------------

Message: 1
Date: Fri, 15 Jun 2007 11:29:20 -0600
From: "Eric Schneider" 
Subject: [Linux-cluster] CommuniGate as service?
To: "'linux clustering'" 

Message-ID: <002701c7af72$b532b400$1b03c680 at uccs.edu>
Content-Type: text/plain; charset="US-ASCII"

I am trying to setup CommuniGate
(http://www.stalker.com/content/default.html) as a cluster service. I have
everything working except for one thing. If I stop/kill the process
manually and wait for the status check to move the service nothing happens.


I get "clurgmgrd: [3087]: Executing /etc/init.d/CommuniGate status"
rather than the "clurgmgrd: [3087]: script:httpd-webcal: status of
/etc/init.d/httpd-webcal failed (returned 3)" I get with an apache process.

I had to add the "status" part to /etc/init.d/CommuniGate file myself, but
there is obviously a problem. I have mucked around with it for a while but
I am just putting it out there if my mistakes are obvious. 

# If you have placed the application folder in a different directory,
# change the APPLICATION variable
#
# The default location for the CommuniGate Pro "base directory" (a folder
# containing mail accounts, settings, logs, etc.) is /var/CommuniGate
# If you want to use a different location, change the BASEFOLDER variable
#
#
APPLICATION="/opt"
BASEFOLDER="/var/CommuniGate"
SUPPLPARAMS=
PROG="/opt/CommuniGate/CGServer"
#ADDED
pidfile=${PIDFILE-/var/run/CommuniGate.pid}
lockfile=${LOCKFILE-/var/lock/subsys/CommuniGate}
RETVAL=0
#/ADDED
[ -f ${APPLICATION}/CommuniGate/CGServer ] || exit 0

# Some Linux distributions come with the "NPTL" threads library
# that crashes quite often. The following lines are believed to force
# Linux to use the old working threads library.
#
#LD_ASSUME_KERNEL=2.4.1
#export LD_ASSUME_KERNEL

# Source function library.
if [ -f /etc/rc.d/init.d/functions ]; then
. /etc/rc.d/init.d/functions
elif [ -f /etc/init.d/functions ]; then
. /etc/init.d/functions
fi

ulimit -u 2000
ulimit -c 2097151
umask 0

# Custom startup parameters
if [ -f ${BASEFOLDER}/Startup.sh ]; then
. ${BASEFOLDER}/Startup.sh
fi

case "$1" in
start)
if [ -d ${BASEFOLDER} ] ; then
echo
else
echo "Creating the CommuniGate Base Folder..."
mkdir ${BASEFOLDER}
chgrp mail ${BASEFOLDER}
chmod 2770 ${BASEFOLDER}
fi
echo -n "Starting CommuniGate Pro"
${APPLICATION}/CommuniGate/CGServer \
--Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \
# --ClusterBackend
# --ClusterFrontend
#Comment out
#touch /var/lock/subsys/CommuniGate
#ADD
touch ${pidfile}
RETVAL=$?
echo
[ "$RETVAL = 0" ] && touch ${lockfile}
#return $RETVAL
#/ADDED
;;
controller)
echo "Starting CommuniGate Pro Cluster Controller"
${APPLICATION}/CommuniGate/CGServer \
--Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \
--ClusterController
touch /var/lock/subsys/CommuniGate
;;
stop)
if [ -f ${BASEFOLDER}/ProcessID ]; then
echo "Shutting down the CommuniGate Pro Server"
kill `cat ${BASEFOLDER}/ProcessID`
sleep 5
else
echo "It looks like the CommuniGate Pro Server is not running"
fi
#eric
rm -f ${pidfile}
##Orig
#rm -f /var/lock/subsys/CommuniGate
#ADDED
RETVAL=$?
echo
[ "$RETVAL = 3" ] && rm -f ${lockfile} #${pidfile}
#/ADDED
;;
#ADDED
status)
status $PROG
RETVAL=$?
;;
#/ADDED
*)
echo "Usage: $0 [ start | stop | status ]"
exit 1
esac

exit 0






------------------------------

Message: 2
Date: Fri, 15 Jun 2007 22:41:01 -0500 (CDT)
From: "Jon Gabrielson" 
Subject: [Linux-cluster] gfs and software raid across 4 systems.
To: linux-cluster at redhat.com
Message-ID:
<5474.12.227.156.125.1181965261.squirrel at www.directfreight.com>
Content-Type: text/plain;charset=iso-8859-1

I have 4 servers each with their own harddrive.
I would like to setup gfs so that each can read/write
to their local harddrive and the data synced to the other 3 harddrives.
My original idea was to export all 4 harddrives with a network block
device like iscsi/gnbd/nbd and then use md software raid1 on top of
those but I've heard that this is a bad idea as md is not cluster aware.
I also found drbd but it only supports 2 harddrives.

Are there any available solutions for doing a 4 way mirror like I
am looking for? Basically a network raid 1 across 4 harddrives
on 4 separate systems.


Thanks,


Jon.



------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

End of Linux-cluster Digest, Vol 38, Issue 25
*********************************************


       
---------------------------------
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070618/07d6be26/attachment.htm>

From samin at isaaviation.ae  Mon Jun 18 07:40:27 2007
From: samin at isaaviation.ae (Sushanth Amin)
Date: Mon, 18 Jun 2007 11:40:27 +0400
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 38, Issue 25
In-Reply-To: <594781.25813.qm@web36604.mail.mud.yahoo.com>
Message-ID: <200706181147.l5IBlA7L006366@isaaviation.ae>

Hello Faiz, 

 

Follow the steps mentioned in the link given below

 

http://vdt.cs.wisc.edu/releases/1.2.4/installing-rpms.html

 

Thanks & Regards 

 

Sushanth Amin

 

  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of faiz n
Sent: Monday, June 18, 2007 11:21 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 38, Issue 25

 

Hi All,

 

I have intall globus on ma PC. I have done everything but when I run the
command

 

root at host:# etc/init.d/globus-4.0.4 start

 

I gave me the error that

libglobus_common_gcc32.so.0. No such file or directory

When I downloaded this file and run the command again, then problem came
that:

ltdl_common_..... No such file or directory

When I downloaded it again and run the command again now then this error
comes

 

/usr/local /sbin/globus-start-container-detached undefined symbol lookup
error: globus_callback_space_reference.

 

so can anyone help me that why this error comes to me and how can I fix
this.

 

take Care

Thanks

Faiz.



linux-cluster-request at redhat.com wrote:

Send Linux-cluster mailing list submissions to
linux-cluster at redhat.com

To subscribe or unsubscribe via the World Wide Web, visit
https://www.redhat.com/mailman/listinfo/linux-cluster
or, via email, send a message with subject or body 'help' to
linux-cluster-request at redhat.com

You can reach the person managing the list at
linux-cluster-owner at redhat.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-cluster digest..."


Today's Topics:

1. CommuniGate as service? (Eric Schneider)
2. gfs and software raid across 4 systems. (Jon Gabrielson)


----------------------------------------------------------------------

Message: 1
Date: Fri, 15 Jun 2007 11:29:20 -0600
From: "Eric Schneider" 
Subject: [Linux-cluster] CommuniGate as service?
To: "'linux clustering'" 
Message-ID: <002701c7af72$b532b400$1b03c680 at uccs.edu>
Content-Type: text/plain; charset="US-ASCII"

I am trying to setup CommuniGate
(http://www.stalker.com/content/default.html) as a cluster service. I have
everything working except for one thing. If I stop/kill the process
manually and wait for the status check to move the service nothing happens.


I get "clurgmgrd: [3087]: Executing /etc/init.d/CommuniGate status"
rather than the "clurgmgrd: [3087]: script:httpd-webcal: status of
/etc/init.d/httpd-webcal failed (returned 3)" I get with an apache process.

I had to add the "status" part to /etc/init.d/CommuniGate file myself, but
there is obviously a problem. I have mucked around with it for a while but
I am just putting it out there if my mistakes are obvious. 

# If you have placed the application folder in a different directory,
# change the APPLICATION variable
#
# The default location for the CommuniGate Pro "base directory" (a folder
# containing mail accounts, settings, logs, etc.) is /var/CommuniGate
# If you want to use a different location, change the BASEFOLDER variable
#
#
APPLICATION="/opt"
BASEFOLDER="/var/CommuniGate"
SUPPLPARAMS=
PROG="/opt/CommuniGate/CGServer"
#ADDED
pidfile=${PIDFILE-/var/run/CommuniGate.pid}
lockfile=${LOCKFILE-/var/lock/subsys/CommuniGate}
RETVAL=0
#/ADDED
[ -f ${APPLICATION}/CommuniGate/CGServer ] || exit 0

# Some Linux distributions come with the "NPTL" threads library
# that crashes quite often. The following lines are believed to force
# Linux to use the old working threads library.
#
#LD_ASSUME_KERNEL=2.4.1
#export LD_ASSUME_KERNEL

# Source function library.
if [ -f /etc/rc.d/init.d/functions ]; then
. /etc/rc.d/init.d/functions
elif [ -f /etc/init.d/functions ]; then
. /etc/init.d/functions
fi

ulimit -u 2000
ulimit -c 2097151
umask 0

# Custom startup parameters
if [ -f ${BASEFOLDER}/Startup.sh ]; then
. ${BASEFOLDER}/Startup.sh
fi

case "$1" in
start)
if [ -d ${BASEFOLDER} ] ; then
echo
else
echo "Creating the CommuniGate Base Folder..."
mkdir ${BASEFOLDER}
chgrp mail ${BASEFOLDER}
chmod 2770 ${BASEFOLDER}
fi
echo -n "Starting CommuniGate Pro"
${APPLICATION}/CommuniGate/CGServer \
--Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \
# --ClusterBackend
# --ClusterFrontend
#Comment out
#touch /var/lock/subsys/CommuniGate
#ADD
touch ${pidfile}
RETVAL=$?
echo
[ "$RETVAL = 0" ] && touch ${lockfile}
#return $RETVAL
#/ADDED
;;
controller)
echo "Starting CommuniGate Pro Cluster Controller"
${APPLICATION}/CommuniGate/CGServer \
--Base ${BASEFOLDER} --Daemon ${SUPPLPARAMS} \
--ClusterController
touch /var/lock/subsys/CommuniGate
;;
stop)
if [ -f ${BASEFOLDER}/ProcessID ]; then
echo "Shutting down the CommuniGate Pro Server"
kill `cat ${BASEFOLDER}/ProcessID`
sleep 5
else
echo "It looks like the CommuniGate Pro Server is not running"
fi
#eric
rm -f ${pidfile}
##Orig
#rm -f /var/lock/subsys/CommuniGate
#ADDED
RETVAL=$?
echo
[ "$RETVAL = 3" ] && rm -f ${lockfile} #${pidfile}
#/ADDED
;;
#ADDED
status)
status $PROG
RETVAL=$?
;;
#/ADDED
*)
echo "Usage: $0 [ start | stop | status ]"
exit 1
esac

exit 0






------------------------------

Message: 2
Date: Fri, 15 Jun 2007 22:41:01 -0500 (CDT)
From: "Jon Gabrielson" 
Subject: [Linux-cluster] gfs and software raid across 4 systems.
To: linux-cluster at redhat.com
Message-ID:
<5474.12.227.156.125.1181965261.squirrel at www.directfreight.com>
Content-Type: text/plain;charset=iso-8859-1

I have 4 servers each with their own harddrive.
I would like to setup gfs so that each can read/write
to their local harddrive and the data synced to the other 3 harddrives.
My original idea was to export all 4 harddrives with a network block
device like iscsi/gnbd/nbd and then use md software raid1 on top of
those but I've heard that this is a bad idea as md is not cluster aware.
I also found drbd but it only supports 2 harddrives.

Are there any available solutions for doing a 4 way mirror like I
am looking for? Basically a network raid 1 across 4 harddrives
on 4 separate systems.


Thanks,


Jon.



------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

End of Linux-cluster Digest, Vol 38, Issue 25
*********************************************

 

  

  _____  

Take the Internet to Go: Yahoo!Go puts the Internet
<http://us.rd.yahoo.com/evt=48253/*http:/mobile.yahoo.com/go?refer=1GNXIC>
in your pocket: mail, news, photos & more. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070618/df7a7786/attachment.htm>

From stefan.hirsch at nsn.com  Mon Jun 18 08:01:56 2007
From: stefan.hirsch at nsn.com (Stefan Hirsch)
Date: Mon, 18 Jun 2007 10:01:56 +0200
Subject: [Linux-cluster] CommuniGate as service?
In-Reply-To: <002701c7af72$b532b400$1b03c680@uccs.edu>
References: <002701c7af72$b532b400$1b03c680@uccs.edu>
Message-ID: <46763BF4.3040400@nsn.com>



ext Eric Schneider wrote:
>      #ADDED
>   status)
>      status $PROG
>      RETVAL=$?
>      ;;
>      #/ADDED
>   *)
>      echo "Usage: $0 [ start | stop | status ]"
>      exit 1
> esac
> 
> exit 0
   ^^^^^^
The script exits always with zero, except for the wildcard part, otherwise
$RETVAL is ignored.

/Stefan



From urandomdev at gmail.com  Mon Jun 18 12:23:13 2007
From: urandomdev at gmail.com (Simon Jolle)
Date: Mon, 18 Jun 2007 14:23:13 +0200
Subject: [Linux-cluster] stuck with lock usrm::rg="db"
Message-ID: <648d054e0706180523y23bab83dq690d6a7fe5e6c23@mail.gmail.com>

Hi list

when doing a clustat, the rgmanager doesn't respond and you cant see
the cluster resource group (after long timeout).

A reboot solved the problem. The system even couldn't restart because
of error messages of rgmanager on screen -> only reset helped.

Sorry don't have collected any more useful informations. Please
request additional output. Attached the cluster.conf.

How to diagnose/understand the message "Node ID:0000000000000002 stuck
with lock usrm::rg="db""

on the secondary node:

Jun 15 18:19:24 oracle09 clurgmgrd[4432]: <warning> Node
ID:0000000000000002 stuck with lock usrm::rg="db"
Jun 15 18:19:54 oracle09 clurgmgrd[4432]: <warning> Node
ID:0000000000000002 stuck with lock usrm::rg="db"
Jun 15 18:20:26 oracle09 clurgmgrd[4432]: <warning> Node
ID:0000000000000002 stuck with lock usrm::rg="db"

on the primary node:
Jun 15 10:17:25 oracle08 kernel: rh_lkid 11300b8
Jun 15 10:17:25 oracle08 kernel: lockstate 0
Jun 15 10:17:25 oracle08 kernel: nodeid 2
Jun 15 10:17:25 oracle08 kernel: status 4294967279
Jun 15 10:17:25 oracle08 kernel: lkid f569ff84
Jun 15 10:17:25 oracle08 kernel: dlm: Magma: reply from 1 no lock
Jun 15 10:17:25 oracle08 kernel: dlm: reply
Jun 15 10:17:25 oracle08 kernel: rh_cmd 5
Jun 15 10:17:25 oracle08 kernel: rh_lkid eb01c5
Jun 15 10:17:25 oracle08 kernel: lockstate 0
Jun 15 10:17:25 oracle08 kernel: nodeid 2
Jun 15 10:17:25 oracle08 kernel: status 4294967279
Jun 15 10:17:25 oracle08 kernel: lkid f569ff84
Jun 15 10:17:25 oracle08 kernel: dlm: Magma: reply from 1 no lock
Jun 15 10:17:25 oracle08 kernel: dlm: reply
Jun 15 10:17:25 oracle08 kernel: rh_cmd 5
Jun 15 10:17:25 oracle08 kernel: rh_lkid 11e027c
Jun 15 10:17:25 oracle08 kernel: lockstate 0
Jun 15 10:17:26 oracle08 kernel: nodeid 2
Jun 15 10:17:26 oracle08 kernel: status 4294967279
Jun 15 10:17:26 oracle08 kernel: lkid f569ff84
Jun 15 10:17:26 oracle08 kernel: dlm: Magma: reply from 1 no lock
Jun 15 10:17:26 oracle08 kernel: dlm: reply
Jun 15 10:17:26 oracle08 kernel: rh_cmd 5
Jun 15 10:17:26 oracle08 kernel: rh_lkid 122025f
Jun 15 10:17:26 oracle08 kernel: lockstate 0
Jun 15 10:17:26 oracle08 kernel: nodeid 2
Jun 15 10:17:26 oracle08 kernel: status 4294967279
Jun 15 10:17:26 oracle08 kernel: lkid f569ff84
Jun 15 10:17:26 oracle08 kernel: dlm: Magma: reply from 1 no lock
Jun 15 10:17:26 oracle08 kernel: dlm: reply
Jun 15 10:17:26 oracle08 kernel: rh_cmd 5
Jun 15 10:17:26 oracle08 kernel: rh_lkid 12e0185
Jun 15 10:17:26 oracle08 kernel: lockstate 0
Jun 15 10:17:26 oracle08 kernel: nodeid 2
Jun 15 10:17:26 oracle08 kernel: status 4294967279
Jun 15 10:17:26 oracle08 kernel: lkid f569ff84

after initiating shutdown:

Jun 15 18:20:54 oracle08 fenced: Stopping fence domain:
Jun 15 18:20:54 oracle08 fenced: shutdown succeeded
Jun 15 18:20:54 oracle08 fenced: ESC[60G
Jun 15 18:20:54 oracle08 fenced:
Jun 15 18:20:54 oracle08 rc: Stopping fenced:  succeeded
Jun 15 18:20:54 oracle08 lock_gulmd: Stopping lock_gulmd:
Jun 15 18:20:54 oracle08 lock_gulmd: shutdown succeeded
Jun 15 18:20:54 oracle08 lock_gulmd: ESC[60G
Jun 15 18:20:54 oracle08 lock_gulmd:
Jun 15 18:20:54 oracle08 rc: Stopping lock_gulmd:  succeeded
Jun 15 18:20:54 oracle08 cman: Stopping cman:
Jun 15 18:20:58 oracle08 cman: failed to stop cman failed
Jun 15 18:20:58 oracle08 cman: ESC[60G
Jun 15 18:20:58 oracle08 cman:
Jun 15 18:20:58 oracle08 rc: Stopping cman:  failed
Jun 15 18:20:58 oracle08 ccsd: Stopping ccsd:
Jun 15 18:20:58 oracle08 ccsd[2276]: Stopping ccsd, SIGTERM received.
Jun 15 18:20:59 oracle08 ccsd: shutdown succeeded
Jun 15 18:20:59 oracle08 ccsd: ESC[60G[
Jun 15 18:20:59 oracle08 ccsd:
Jun 15 18:20:59 oracle08 rc: Stopping ccsd:  succeeded


-- 
XMPP: sjolle at swissjabber.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf.xml
Type: text/xml
Size: 2254 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070618/c3a6056f/attachment.xml>

From christian.brandes at forschungsgruppe.de  Mon Jun 18 14:57:10 2007
From: christian.brandes at forschungsgruppe.de (Christian Brandes)
Date: Mon, 18 Jun 2007 16:57:10 +0200
Subject: [Linux-cluster] cluster.conf documentation?
Message-ID: <46769D46.8020506@forschungsgruppe.de>

Is there a more comprehensive guide to /etc/cluster.conf than the man 
page, with a description of all available options?

Best regards
	Christian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4348 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070618/31c4c8fa/attachment.bin>

From jwilson at transolutions.net  Mon Jun 18 17:13:34 2007
From: jwilson at transolutions.net (James Wilson)
Date: Mon, 18 Jun 2007 12:13:34 -0500
Subject: [Linux-cluster] gnbd_serv and gnbd export on bootup
Message-ID: <4676BD3E.7000206@transolutions.net>

Hey All,

    Just wondering if someone could help me out with getting gnbd_serv 
to start on boot up and also have gnbd_export to export the storage on 
bootup? Thanks for any help.



From rpeterso at redhat.com  Mon Jun 18 19:55:07 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 18 Jun 2007 14:55:07 -0500
Subject: [Linux-cluster] cluster.conf documentation?
In-Reply-To: <46769D46.8020506@forschungsgruppe.de>
References: <46769D46.8020506@forschungsgruppe.de>
Message-ID: <4676E31B.4040500@redhat.com>

Christian Brandes wrote:
> Is there a more comprehensive guide to /etc/cluster.conf than the man 
> page, with a description of all available options?
> 
> Best regards
>     Christian
Hi Christian,

http://sources.redhat.com/cluster/faq.html#clusterconf

Regards,

Bob Peterson
Red Hat Cluster Suite



From andremachado at techforce.com.br  Mon Jun 18 20:23:47 2007
From: andremachado at techforce.com.br (andremachado)
Date: Mon, 18 Jun 2007 13:23:47 -0700
Subject: [Linux-cluster] GFS over GNBD freezes -> confirmed gnbd flaw?
In-Reply-To: <46764DDA.1090600@artegence.com>
References: <46764DDA.1090600@artegence.com>
Message-ID: <9501e0cf83a2246371838c9926ce15cb@localhost>

Hello,
I just updated my blog page [0] with preliminary tests and conclusions about GFS, CLVM, GNBD, iSCSI.
It seems that GNBD of redhat cluster suite 1.03.02 has problems....
How spot the problem code?
Regards.
Andre Felipe Machado

[0] http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch


On Mon, 18 Jun 2007 11:18:18 +0200, Maciej Bogucki <maciej.bogucki at artegence.com> wrote:
>> I am trying to implement iscsi now (reading docs phase). But, maybe, the
> problems arise again if the real cause is at GFS / clvm locking
> coordination and not at gnbd itself.
> It is good idea to implement iSCSI. Then You will know if is it GNDB or
> GFS problem.
> 
> Best Regards
> Maciej Bogucki



From Alain.Moulle at bull.net  Tue Jun 19 14:49:18 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 19 Jun 2007 16:49:18 +0200
Subject: [Linux-cluster] CS4 U4/U5 / is it possible to disable the status ?
Message-ID: <4677ECEE.3040005@bull.net>

Hi

Is there a configuration possibility in the GUI or directly in the cluster.conf
to disable the periodic monitoring of services ?

Thanks
Alain Moull?




From Michael.Hagmann at hilti.com  Tue Jun 19 15:25:57 2007
From: Michael.Hagmann at hilti.com (Hagmann, Michael)
Date: Tue, 19 Jun 2007 17:25:57 +0200
Subject: [Linux-cluster] clurgmgrd - <err> #48: Unable to obtain
	clusterlock: Connectiontimed out
In-Reply-To: <20070511201903.GF15766@redhat.com>
References: <1178560496.7699.38.camel@WSBID06223>
	<20070511201903.GF15766@redhat.com>
Message-ID: <9C203D6FD2BF9D49BFF3450201DEDA5301B18B2D@LI-OWL.hag.hilti.com>

Hi all

we just hit this Problem again:

Jun 18 08:03:08 lilr623a clurgmgrd[22152]:  #48: Unable to obtain
cluster lock: Connection timed out  	
Jun 18 08:03:35 lilr623f clurgmgrd: [21651]:  Executing
/usr/local/swadmin/caa/SAP/P06WD002 status  	
Jun 18 08:05:29 lilr623f clurgmgrd[21651]:  #49: Failed getting status
for RG P06WD002	

is there any open Bugzilla about this Problem?

what we also see that the Crash maybe is realated to the cron.daily
entries. Maybe some crontab entry trigger this dlmbug?

Here you can see the crontab, the cron.daily start at 08:02 the Cluster
stuck ag 08:03 ! Also the last time it was also the same time.

root at lilr623a:/tmp# cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/
 
# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 8 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly
 
root at lilr623a:/tmp# ls -l  /etc/cron.daily
total 28
lrwxrwxrwx  1 root root   28 Oct  5  2006 00-logwatch ->
../log.d/scripts/logwatch.pl
-rwxr-xr-x  1 root root  418 Apr 14  2006 00-makewhatis.cron
-rwxr-xr-x  1 root root  276 Sep 28  2004 0anacron
-rwxr-xr-x  1 root root  180 Jul 13  2005 logrotate
-rwxr-xr-x  1 root root   48 Apr  9  2006 mcelog.cron
-rwxr-xr-x  1 root root 2133 Dec  1  2004 prelink
-rwxr-xr-x  1 root root  121 Aug  8  2005 slocate.cron 

Thanks for your help

Mike

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Freitag, 11. Mai 2007 22:19
To: linux clustering
Subject: Re: [Linux-cluster] clurgmgrd - <err> #48: Unable to obtain
clusterlock: Connectiontimed out

On Mon, May 07, 2007 at 01:54:56PM -0400, rhurst at bidmc.harvard.edu
wrote:
> What could cause clurgmgrd fail like this?  If clurgmgrd has a hiccup 
> like this, is it supposed to shutdown its services?  Is there 
> something in our implementation that could have prevented this from
shutting down?
> 
> For unexplained reasons, we just had our CS service (WATSON) go down 
> on its own, and the syslog entry details the event as:
> 
> May  7 13:18:39 db1 clurgmgrd[17888]: <err> #48: Unable to obtain 
> cluster lock: Connection timed out May  7 13:18:41 db1 kernel: dlm: 
> Magma: reply from 2 no lock May  7 13:18:41 db1 kernel: dlm: reply May

> 7 13:18:41 db1 kernel: rh_cmd 5 May  7 13:18:41 db1 kernel: rh_lkid 
> 200242 May  7 13:18:41 db1 kernel: lockstate 2 May  7 13:18:41 db1 
> kernel: nodeid 0 May  7 13:18:41 db1 kernel: status 0 May  7 13:18:41 
> db1 kernel: lkid ee0388 May  7 13:18:41 db1 clurgmgrd[17888]: <notice>

> Stopping service WATSON

This usually is a dlm bug.  Once the DLM gets in to this state,
rgmanager blows up.  What rgmanager are you using?

(There's only one lock per service; the complexity of the service
doesn't matter...)

--
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From alkol6 at gmail.com  Tue Jun 19 19:23:11 2007
From: alkol6 at gmail.com (Senol Erdogan)
Date: Tue, 19 Jun 2007 22:23:11 +0300
Subject: [Linux-cluster] cluster.conf documentation?
In-Reply-To: <46769D46.8020506@forschungsgruppe.de>
References: <46769D46.8020506@forschungsgruppe.de>
Message-ID: <93bf230a0706191223x5774d7d1m37bb6d15b2387342@mail.gmail.com>

and,

# man ccs_tool

2007/6/18, Christian Brandes <christian.brandes at forschungsgruppe.de>:
>
> Is there a more comprehensive guide to /etc/cluster.conf than the man
> page, with a description of all available options?
>
> Best regards
>         Christian
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070619/a8eed4b8/attachment.htm>

From jparsons at redhat.com  Tue Jun 19 20:27:40 2007
From: jparsons at redhat.com (James Parsons)
Date: Tue, 19 Jun 2007 16:27:40 -0400
Subject: [Linux-cluster] cluster.conf documentation?
In-Reply-To: <93bf230a0706191223x5774d7d1m37bb6d15b2387342@mail.gmail.com>
References: <46769D46.8020506@forschungsgruppe.de>
	<93bf230a0706191223x5774d7d1m37bb6d15b2387342@mail.gmail.com>
Message-ID: <46783C3C.9000903@redhat.com>

I spent a couple of nights working on the schema description recently. 
It is split into rhel4 and rhel5 versions now, and includes some of the 
new openais/cman params. It occurs to me, however, that the 8 new 
resource types are not represented yet. I will add a description for 
those over the next few nights.
Once again, the URL is:
http://sources.redhat.com/cluster/doc/cluster_schema.html

Senol Erdogan wrote:

> and,
>
> # man ccs_tool
>
> 2007/6/18, Christian Brandes < christian.brandes at forschungsgruppe.de 
> <mailto:christian.brandes at forschungsgruppe.de>>:
>
>     Is there a more comprehensive guide to /etc/cluster.conf than the man
>     page, with a description of all available options?
>
>     Best regards
>             Christian
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>




From orkcu at yahoo.com  Tue Jun 19 21:19:24 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 19 Jun 2007 14:19:24 -0700 (PDT)
Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail
	cluster?
Message-ID: <687378.49599.qm@web50611.mail.re2.yahoo.com>

Hi

I am looking for ideas about to create a Qmail HA
cluster with 2 nodes and the storage in a SAN (FC
access)

right now I am in the design stage, mainly finding
potencial problems so ....
do anybody has anything to recommend ? 
(except not use qmail ;-) I would like to use postfix
or exim but my client disagree :-( no choice here)

my first problem looks like qmail is started,
monitored and managed by daemontools (sv* programs)
and svscan itseft is started through inittab or
rc.local
so my first approach is to create an sysV init script
for svscanboot (whitch is used to start svc and
svscan) and that script is the one that will be
controlled by RHCS as a script resource (alonside with
the GFS or plain FS resource, and maybe the IP
resource)

so, my idea is to "clusterizate" (that word exist ?
;-) ) the daemontool and not the qmail process, do you
agree?

thanks in advance for any tip :-)

roger


__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz



From chris at cmiware.com  Tue Jun 19 22:23:45 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 19 Jun 2007 17:23:45 -0500
Subject: [Linux-cluster] avoiding 2 Node Fencing shootout
Message-ID: <46785771.1000403@cmiware.com>

Hi all,

Setup:

Cluster Suite 5
2 Nodes each fenced by DRAC card over network interface. Flagged as two 
node in cluster.conf

<cman expected_votes="1" two_node="1"/>

As a test, I unplugged one node (Node A) from the network.  The 
remaining node (Node B) attempted to fence it, but failed (no network 
access) and never assumed the services.  Plugging in Node A a gain 
induced each to fence the other.  Something similar happens when a node 
is rebooted manually (shutdown -r now).

How does one best combat this?  I've seen reference to adding a ping 
node but no actual documentation on how to do it.

Thanks,
Chris



From rainer at ultra-secure.de  Tue Jun 19 22:33:59 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Wed, 20 Jun 2007 00:33:59 +0200
Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail
	cluster?
In-Reply-To: <687378.49599.qm@web50611.mail.re2.yahoo.com>
References: <687378.49599.qm@web50611.mail.re2.yahoo.com>
Message-ID: <48D17A8F-34A5-420D-B856-BC936FFCC797@ultra-secure.de>


Am 19.06.2007 um 23:19 schrieb Roger Pe?a:

> Hi
>
> I am looking for ideas about to create a Qmail HA
> cluster with 2 nodes and the storage in a SAN (FC
> access)
>


Only two nodes?
What backend do you want to use?
(In case you want to use vpopmail)


> right now I am in the design stage, mainly finding
> potencial problems so ....
> do anybody has anything to recommend ?


Qmail is IMO not suited for a GFS cluster.
GFS tries its best to keep write operations on the cluster-FS  
synchronized.
This is useless in the case of Qmail, because Qmail is designed to  
function even on NFS-filesystems without any kind of useful locking.
In GFS-land, Qmail just generates lots of useless I/O.


> (except not use qmail ;-) I would like to use postfix
> or exim but my client disagree :-( no choice here)
>


It's understandable. Qmail still offers a lot of value when it comes  
to virtual email-domain hosting - though the original DJB-Qmail is  
barely usable today.
But people like Matt Simerson and Bill Shupp have done tremendous  
integration-work, and helped to keep the platform on par (or in some  
cases beyond) with other systems, even commercial ones.


> my first problem looks like qmail is started,
> monitored and managed by daemontools (sv* programs)
> and svscan itseft is started through inittab or
> rc.local
> so my first approach is to create an sysV init script
> for svscanboot (whitch is used to start svc and
> svscan) and that script is the one that will be
> controlled by RHCS as a script resource (alonside with
> the GFS or plain FS resource, and maybe the IP
> resource)
>


Sometimes, it's not enough to stop the svscan-startscript.
Daemons linger around, prevent new ones from starting. After killing  
the start-scripts, it might be necessary to kill (or kill -9) any  
remaining processes.


> so, my idea is to "clusterizate" (that word exist ?
> ;-) ) the daemontool and not the qmail process, do you
> agree?
>
> thanks in advance for any tip :-)
>


You could try to run a sharedroot-cluster on RHEL4 and see how it  
performs for your workload - there are some succesful reports here on  
this list (though the one I remember uses a tremendous amount of disk- 
spindles).
This should solve your problems with the script (just fence the whole  
node - finished).

If you don't want to go that route, I'd say forget about GFS and go  
back to NFS (with a serious NFS server-platform like Solaris and  
clients like Solaris or FreeBSD) - see the picture on Bill Shupp's  
homepage for a design.
Matt Simerson's formerly FreeBSD-only (now also Solaris, Linux,  
Darwin) Mail-Toaster framework already contains most of the  
integration-work necessary (distribute configfiles etc. - take a look  
at the source, it's amazing).

Above a certain amount of users (500k, probably varies), shared- 
storage may be the wrong answer anyway.
Then, a distributed setup might be better suited.
How many users will you have to support?


cheers,
Rainer
-- 
Rainer Duffner
CISSP, LPI, MCSE
rainer at ultra-secure.de





From rpeterso at redhat.com  Tue Jun 19 22:52:45 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 19 Jun 2007 17:52:45 -0500
Subject: [Linux-cluster] avoiding 2 Node Fencing shootout
In-Reply-To: <46785771.1000403@cmiware.com>
References: <46785771.1000403@cmiware.com>
Message-ID: <46785E3D.2070200@redhat.com>

Chris Harms wrote:
> Hi all,
> 
> Setup:
> 
> Cluster Suite 5
> 2 Nodes each fenced by DRAC card over network interface. Flagged as two 
> node in cluster.conf
> 
> <cman expected_votes="1" two_node="1"/>
> 
> As a test, I unplugged one node (Node A) from the network.  The 
> remaining node (Node B) attempted to fence it, but failed (no network 
> access) and never assumed the services.  Plugging in Node A a gain 
> induced each to fence the other.  Something similar happens when a node 
> is rebooted manually (shutdown -r now).
> 
> How does one best combat this?  I've seen reference to adding a ping 
> node but no actual documentation on how to do it.
> 
> Thanks,
> Chris
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
Hi Chris,

Here's a good place to start:

http://sources.redhat.com/cluster/faq.html#quorum

There's a couple FAQ entries after that one too pertaining to tie-breaking.

Regards,

Bob Peterson
Red Hat Cluster Suite



From orkcu at yahoo.com  Wed Jun 20 00:42:11 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 19 Jun 2007 17:42:11 -0700 (PDT)
Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail
	cluster?
In-Reply-To: <48D17A8F-34A5-420D-B856-BC936FFCC797@ultra-secure.de>
Message-ID: <451387.21922.qm@web50606.mail.re2.yahoo.com>


--- Rainer Duffner <rainer at ultra-secure.de> wrote:

> 
> Am 19.06.2007 um 23:19 schrieb Roger Pe?a:
> 
> > Hi
> >
> > I am looking for ideas about to create a Qmail HA
> > cluster with 2 nodes and the storage in a SAN (FC
> > access)
> >
> 
> 
> Only two nodes?
> What backend do you want to use?
> (In case you want to use vpopmail)
backend for what? for user data? we plan to use ldap,
another two more server to the cluster,  but I was
talking about just mail related (smtp, pop-imap) nodes

> 
> 
> > right now I am in the design stage, mainly finding
> > potencial problems so ....
> > do anybody has anything to recommend ?
> 
> 
> Qmail is IMO not suited for a GFS cluster.
> GFS tries its best to keep write operations on the
> cluster-FS  
> synchronized.
> This is useless in the case of Qmail, because Qmail
> is designed to  
> function even on NFS-filesystems without any kind of
> useful locking.
> In GFS-land, Qmail just generates lots of useless
> I/O.
you are thinking about maildir advantages? yeap, I
know that with maildir you will not have locking
problems (practical meaning although there is a
teorical chance :-) )
but I was thinking in FS cache, default config for
ext3 are not  suitable, maybe I should go into the
tunning ext3 area but .... I thought GFS will take
care of FS sincronization more easy than tunning ext3
to not make cache or just do it for little time

> 
> 
> > (except not use qmail ;-) I would like to use
> postfix
> > or exim but my client disagree :-( no choice here)
> >
> 
> 
> It's understandable. Qmail still offers a lot of
> value when it comes  
> to virtual email-domain hosting - though the
> original DJB-Qmail is  
> barely usable today.
> But people like Matt Simerson and Bill Shupp have
> done tremendous  
> integration-work, and helped to keep the platform on
> par (or in some  
> cases beyond) with other systems, even commercial
> ones.
> 
> 
> > my first problem looks like qmail is started,
> > monitored and managed by daemontools (sv*
> programs)
> > and svscan itseft is started through inittab or
> > rc.local
> > so my first approach is to create an sysV init
> script
> > for svscanboot (whitch is used to start svc and
> > svscan) and that script is the one that will be
> > controlled by RHCS as a script resource (alonside
> with
> > the GFS or plain FS resource, and maybe the IP
> > resource)
> >
> 
> 
> Sometimes, it's not enough to stop the
> svscan-startscript.
> Daemons linger around, prevent new ones from
> starting. After killing  
> the start-scripts, it might be necessary to kill (or
> kill -9) any  
> remaining processes.

good to know it :-)
I will be looking for this problem :-)

> 
> 
> > so, my idea is to "clusterizate" (that word exist
> ?
> > ;-) ) the daemontool and not the qmail process, do
> you
> > agree?
> >
> > thanks in advance for any tip :-)
> >
> 
> 
> You could try to run a sharedroot-cluster on RHEL4
> and see how it  
> performs for your workload - there are some
> succesful reports here on  
> this list (though the one I remember uses a
> tremendous amount of disk- 
> spindles).
> This should solve your problems with the script
> (just fence the whole  
> node - finished).
> 
> If you don't want to go that route, I'd say forget
> about GFS and go  
> back to NFS (with a serious NFS server-platform like
> Solaris and  
> clients like Solaris or FreeBSD) - see the picture

another requisite for the solution:
the OS has to be RHEL, RHEL5 as the preferred



> on Bill Shupp's  
> homepage for a design.
> Matt Simerson's formerly FreeBSD-only (now also
> Solaris, Linux,  
> Darwin) Mail-Toaster framework already contains most
> of the  
> integration-work necessary (distribute configfiles
> etc. - take a look  
> at the source, it's amazing).

I will do 

> 
> Above a certain amount of users (500k, probably
> varies), shared- 
> storage may be the wrong answer anyway.
> Then, a distributed setup might be better suited.
> How many users will you have to support?

I guess few hundreds of thousands but I hope not 500k,
maybe 200k or 300k
I know this is an important data to be uncertain but
as I said I am in the process of finding potentials
problems yet :-) in the next few days-weeks I will
have more deep understand of the environment


> Rainer
thanks a lot Rainier

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469



From anujhere at gmail.com  Tue Jun 19 14:25:16 2007
From: anujhere at gmail.com (anugunj anuj singh)
Date: Tue, 19 Jun 2007 19:55:16 +0530
Subject: [Linux-cluster] drbd+GFS HA cluster
Message-ID: <1182263116.7430.16.camel@anugunj.sytes.net>

Hi,
I installed drbd-8.0.3 on RHEL4, my drbd.conf

global { usage-count yes; }
       common { syncer { rate 10M; } } 
       resource r0 { 
            protocol C; 
            net { 
                 cram-hmac-alg sha1; 
                 shared-secret "anugunj"; 
            } 
            on node0005.anugunj.com { 
                 device    /dev/drbd1; 
                 disk      /dev/hda6; 
                 address   10.1.1.3:7789; 
                 meta-disk  internal; 
            } 
            on node0021.anugunj.com { 
                 device    /dev/drbd1; 
                 disk      /dev/sdb1; 
                 address   10.1.1.11:7789; 
                 meta-disk  internal; 
            } 
       } 
                                                
created drbd-meta data
drbdadm create-md r0
 , and then gfs file system on it. 
I am trying to do mirroring of cluster, using gfs file system ,
currently I have to make one drbd hard disk primary, other secondary,
only then I am able to mount and use, on other system it is not
mounting, i have to make it primary with.
drbdadm primary all.

has anyone tried drbd on gfs, filesystem. Version i am using does
support gfs (according to changelog)
http://svn.drbd.org/drbd/trunk/ChangeLog . Target is for HA of cluster.
while mounting drbd hard disks on both systems simultaneous. 

thanks and regards
anugunj "anuj singh"                           
                           



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070619/dd1a75e4/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070619/dd1a75e4/attachment.sig>

From mathieu.avila at seanodes.com  Wed Jun 20 08:09:54 2007
From: mathieu.avila at seanodes.com (Mathieu Avila)
Date: Wed, 20 Jun 2007 10:09:54 +0200
Subject: [Linux-cluster] Error when starting ccsd and proposed patch
In-Reply-To: <20070615105208.0fc6fd76@mathieu.toulouse>
References: <20070615105208.0fc6fd76@mathieu.toulouse>
Message-ID: <20070620100954.1e8b2697@mathieu.toulouse>

Sorry to bother you with this ; am i the only one that spotted this
issue ?
I did review the code from cluster-2 and cluster-1.04 and the patch is
also relevant there.
A easy way of running into this problem is to generate CPU load on a
node, and then do loops of ccsd and gulm start/stop. Sometimes, gulm
will get out with an error complaining that it was unable to contact
ccsd.

Le Fri, 15 Jun 2007 10:52:08 +0200,
Mathieu Avila <mathieu.avila at seanodes.com> a ?crit :

> Hello all,
> 
> I'm sometimes having trouble when starting ccsd and then gulm under
> heavy CPU load. Ccsd's init script tells it is running but it's not
> fully initialized.
> The problem comes from the fact that ccsd's main process returns
> before the daemonized process of ccsd has finished initializing its
> sockets. The "cluster_communicator" thread sends a SIGTERM message to
> the parent process before the main thread has finished its
> initialization work.
> 
> With the patch proposed in attachement, the cluster_communicator is
> started after the main thread has finished initializing. It works
> well under any load. Any daemon that needs to connect ccsd will
> then succceed. 
> It was tested with cluster-1.03, but it should work with older
> versions, the ccsd files didn't seem to have changed much.
> 
> --
> Mathieu Avila



From dan.deshayes at algitech.com  Wed Jun 20 08:18:22 2007
From: dan.deshayes at algitech.com (Dan Deshayes)
Date: Wed, 20 Jun 2007 10:18:22 +0200
Subject: [Linux-cluster] problem starting clvmd on second node.
Message-ID: <4678E2CE.4050302@algitech.com>

Hello,
I'm having problem starting the clvmd on the second node.
I'm running Centos 5 resently updated. Its going to be a 3node HA cluster.
What I've done is the following.

creating the filesystems on the fiberdiscs thats devided with lvm.
mkfs.gfs -p lock_dlm -t acl002:project_logs -j 3 /dev/projectVG/logs
mkfs.gfs -p lock_dlm -t acl002:project_web -j 3 /dev/projectVG/web
mkfs.gfs -p lock_dlm -t acl002:project_db -j 3 /dev/projectVG/db

Then I start the cluster namned acl002 on all nodes and clvmd on the 
first node,
it starts and i can mount/unmount and write to the volumes, but i get a 
clvmd -T20 process
When I go to the second node and starts clvmd it hangs in the vgscan.

I'm using locking_type = 3 in the lvm.conf file, before I used 2 with 
the liblvm2clusterlock.so libary
but it doesn't seems to be availible anymore and this does not seem to 
be related to my problem(?).

It works if I start the clvmd on the second node first but then the 
first node gives the same error.

Maybe someone can give me a hint in the right direction.
Thanks in advance.
Dan



From rainer at ultra-secure.de  Wed Jun 20 08:19:00 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Wed, 20 Jun 2007 10:19:00 +0200
Subject: [Linux-cluster] RH Cluster Suit can be used to create a qmail
	cluster?
In-Reply-To: <451387.21922.qm@web50606.mail.re2.yahoo.com>
References: <451387.21922.qm@web50606.mail.re2.yahoo.com>
Message-ID: <4678E2F4.7000201@ultra-secure.de>

Roger Pe?a wrote:
>
> backend for what? for user data? 

Yes.
It's a rather important question, IMO.

> you are thinking about maildir advantages? yeap, I
> know that with maildir you will not have locking
> problems (practical meaning although there is a
> teorical chance :-) )
> but I was thinking in FS cache, default config for
> ext3 are not  suitable, maybe I should go into the
> tunning ext3 area but .... I thought GFS will take
> care of FS sincronization more easy than tunning ext3
> to not make cache or just do it for little time
>
>   

I'm not sure what you mean here.


> good to know it :-)
> I will be looking for this problem :-)
>
>   

Rather, problems tend to look for you ;-)

> another requisite for the solution:
> the OS has to be RHEL, RHEL5 as the preferred
>
>   

It's just that RHEL doesn't offer most of the needed software 
out-of-the-box (apart from the ldap-client).
And even if it does, you need to recompile it yourself, because it needs 
other compilation-options.


> I guess few hundreds of thousands but I hope not 500k,
> maybe 200k or 300k
> I know this is an important data to be uncertain but
> as I said I am in the process of finding potentials
> problems yet :-) in the next few days-weeks I will
> have more deep understand of the environment
>
>   

300k would be still OK for a shared storage.
What kind of SAN do you have?



cheers,
Rainer




From pcaulfie at redhat.com  Wed Jun 20 10:48:35 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 20 Jun 2007 11:48:35 +0100
Subject: [Linux-cluster] Error when starting ccsd and proposed patch
In-Reply-To: <20070620100954.1e8b2697@mathieu.toulouse>
References: <20070615105208.0fc6fd76@mathieu.toulouse>
	<20070620100954.1e8b2697@mathieu.toulouse>
Message-ID: <46790603.1080204@redhat.com>

Mathieu Avila wrote:
> Sorry to bother you with this ; am i the only one that spotted this
> issue ?

I think you might be ;-)


> I did review the code from cluster-2 and cluster-1.04 and the patch is
> also relevant there.
> A easy way of running into this problem is to generate CPU load on a
> node, and then do loops of ccsd and gulm start/stop. Sometimes, gulm
> will get out with an error complaining that it was unable to contact
> ccsd.

Yes, I can believe the problem is still in ccsd2 - it's really the same thing,
ccsd hasnt been touched for a while

Your patch is appreciated, really, it's just that things are a bit hectic at the
moment, we'll get around to testing/integrating it soon I hope.

Thanks,

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From Santosh.Panigrahi at in.unisys.com  Wed Jun 20 11:12:35 2007
From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar)
Date: Wed, 20 Jun 2007 16:42:35 +0530
Subject: [Linux-cluster] failover domain conf. with conga
Message-ID: <D566E8CF3538B54D95B925CB69CB4D2A049E88D8@inblr-exch1.eu.uis.unisys.com>

Hi,

 

I am not able to configure fail over domain in RHEL5 through conga
utility. On trying to do so, I am getting an error page from luci
service. But I am able to configure it through system-config-cluster
utility.

I got following information from RHEl5 release notes. 

[At present, conga and luci do not allow users to create and configure
failover domains. 

To create failover domains, use system-config-cluster. You need to
manually edit /etc/cluster/cluster.conf to configure failover domains
created this way.]

https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/release-not
es/RELEASE-NOTES-x86-en.html 

 

I want to know when will be the next release of conga utility with the
failover domain feature ?

 

Thanks and Regards,

Santosh

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070620/1a25e804/attachment.htm>

From Robert.Gil at americanhm.com  Wed Jun 20 12:18:02 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Wed, 20 Jun 2007 08:18:02 -0400
Subject: [Linux-cluster] problem starting clvmd on second node.
In-Reply-To: <4678E2CE.4050302@algitech.com>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

What exactly is the error? If its permission denied it may most likely
have to do with fenced not running. If lvm skips the clustered
filesystems, then look at the lvm.conf to make sure its right.


Robert Gil
Linux Systems Administrator
American Home Mortgage


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Deshayes
Sent: Wednesday, June 20, 2007 4:18 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] problem starting clvmd on second node.

Hello,
I'm having problem starting the clvmd on the second node.
I'm running Centos 5 resently updated. Its going to be a 3node HA
cluster.
What I've done is the following.

creating the filesystems on the fiberdiscs thats devided with lvm.
mkfs.gfs -p lock_dlm -t acl002:project_logs -j 3 /dev/projectVG/logs
mkfs.gfs -p lock_dlm -t acl002:project_web -j 3 /dev/projectVG/web
mkfs.gfs -p lock_dlm -t acl002:project_db -j 3 /dev/projectVG/db

Then I start the cluster namned acl002 on all nodes and clvmd on the
first node, it starts and i can mount/unmount and write to the volumes,
but i get a clvmd -T20 process When I go to the second node and starts
clvmd it hangs in the vgscan.

I'm using locking_type = 3 in the lvm.conf file, before I used 2 with
the liblvm2clusterlock.so libary but it doesn't seems to be availible
anymore and this does not seem to be related to my problem(?).

It works if I start the clvmd on the second node first but then the
first node gives the same error.

Maybe someone can give me a hint in the right direction.
Thanks in advance.
Dan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From dan.deshayes at algitech.com  Wed Jun 20 13:22:08 2007
From: dan.deshayes at algitech.com (Dan Deshayes)
Date: Wed, 20 Jun 2007 15:22:08 +0200
Subject: [Linux-cluster] problem starting clvmd on second node.
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <46792A00.8080707@algitech.com>

when i start the clvmd on the second node in debugmode it seems to start 
fine.

[root at asl012 conf.d]# clvmd -d
CLVMD[aaabc2c0]: Jun 20 15:06:13 CLVMD started
CLVMD[aaabc2c0]: Jun 20 15:06:22 Cluster ready, doing some more 
initialisation
CLVMD[aaabc2c0]: Jun 20 15:06:22 starting LVM thread
CLVMD[aaabc2c0]: Jun 20 15:06:22 clvmd ready for work
CLVMD[aaabc2c0]: Jun 20 15:06:22 Using timeout of 60 seconds
CLVMD[41401940]: Jun 20 15:06:22 LVM thread function started
File descriptor 5 left open
CLVMD[41401940]: Jun 20 15:06:23 LVM thread waiting for work

but when i try to mount it the proccess just freezes. /var/logs/messages:
Jun 20 15:06:22 asl012 clvmd: Cluster LVM daemon started - connected to CMAN
Jun 20 15:07:19 asl012 kernel: Trying to join cluster "lock_dlm", 
"acl002:project_db"
Jun 20 15:07:19 asl012 kernel: dlm: connecting to 2
Jun 20 15:07:19 asl012 kernel: Joined cluster. Now mounting FS...

and then nothing happens.
root     19031  0.0  0.0   3628   332 pts/0    D    15:07   0:00 
/sbin/mount.gfs /dev/projectVG/db /project/db/ -o rw
after trying to do this the filesystem locks up on the first node also.

the fenced is running and starts fine when starting the cman-service.
I'm not tryin to make the filesystem failover but to be mounted by all 
nodes always.

/Dan

Robert Gil wrote:

>What exactly is the error? If its permission denied it may most likely
>have to do with fenced not running. If lvm skips the clustered
>filesystems, then look at the lvm.conf to make sure its right.
>
>
>Robert Gil
>Linux Systems Administrator
>American Home Mortgage
>
>
>-----Original Message-----
>From: linux-cluster-bounces at redhat.com
>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Deshayes
>Sent: Wednesday, June 20, 2007 4:18 AM
>To: linux-cluster at redhat.com
>Subject: [Linux-cluster] problem starting clvmd on second node.
>
>Hello,
>I'm having problem starting the clvmd on the second node.
>I'm running Centos 5 resently updated. Its going to be a 3node HA
>cluster.
>What I've done is the following.
>
>creating the filesystems on the fiberdiscs thats devided with lvm.
>mkfs.gfs -p lock_dlm -t acl002:project_logs -j 3 /dev/projectVG/logs
>mkfs.gfs -p lock_dlm -t acl002:project_web -j 3 /dev/projectVG/web
>mkfs.gfs -p lock_dlm -t acl002:project_db -j 3 /dev/projectVG/db
>
>Then I start the cluster namned acl002 on all nodes and clvmd on the
>first node, it starts and i can mount/unmount and write to the volumes,
>but i get a clvmd -T20 process When I go to the second node and starts
>clvmd it hangs in the vgscan.
>
>I'm using locking_type = 3 in the lvm.conf file, before I used 2 with
>the liblvm2clusterlock.so libary but it doesn't seems to be availible
>anymore and this does not seem to be related to my problem(?).
>
>It works if I start the clvmd on the second node first but then the
>first node gives the same error.
>
>Maybe someone can give me a hint in the right direction.
>Thanks in advance.
>Dan
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>



From rpeterso at redhat.com  Wed Jun 20 13:59:20 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 20 Jun 2007 08:59:20 -0500
Subject: [Linux-cluster] problem starting clvmd on second node.
In-Reply-To: <46792A00.8080707@algitech.com>
References: <F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<46792A00.8080707@algitech.com>
Message-ID: <467932B8.4060506@redhat.com>

Dan Deshayes wrote:
> but when i try to mount it the proccess just freezes. /var/logs/messages:
> Jun 20 15:06:22 asl012 clvmd: Cluster LVM daemon started - connected to 
> CMAN
> Jun 20 15:07:19 asl012 kernel: Trying to join cluster "lock_dlm", 
> "acl002:project_db"
> Jun 20 15:07:19 asl012 kernel: dlm: connecting to 2
> Jun 20 15:07:19 asl012 kernel: Joined cluster. Now mounting FS...
> 
> and then nothing happens.
> root     19031  0.0  0.0   3628   332 pts/0    D    15:07   0:00 
> /sbin/mount.gfs /dev/projectVG/db /project/db/ -o rw
> after trying to do this the filesystem locks up on the first node also.
> 
> /Dan

Hi Dan,

So apparently it's the mount that's hanging, not clvmd.
By any chance, are you using manual fencing?  Because I think this behavior
can be caused when a node is manually fenced, but the
fence_ack_manual script was never run.  If that's the problem, try:

fence_ack_manual -n <fenced_node>

Regards,

Bob Peterson
Red Hat Cluster Suite



From kristoffer.lippert at jppol.dk  Wed Jun 20 14:29:02 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Wed, 20 Jun 2007 16:29:02 +0200
Subject: [Linux-cluster] Performance on GFS, OCFS2 or GFS2?
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <4678E2CE.4050302@algitech.com>
	<F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A7941@exchsrv07.rootdom.dk>

Hi,

I'm considering different options for a SAN. Are there anyone who knows
of a comparison of the GFS,  GFS2 and OCFS2 filesystems?
Mostly i'm concerned about performance and stability (maturity).

It'll be running on a RHEL 5. 

Kind regards
Kristoffer




From dan.deshayes at algitech.com  Wed Jun 20 14:34:01 2007
From: dan.deshayes at algitech.com (Dan Deshayes)
Date: Wed, 20 Jun 2007 16:34:01 +0200
Subject: [Linux-cluster] problem starting clvmd on second node.
In-Reply-To: <467932B8.4060506@redhat.com>
References: <F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>	<46792A00.8080707@algitech.com>
	<467932B8.4060506@redhat.com>
Message-ID: <46793AD9.4020308@algitech.com>

Robert Peterson wrote:

> Dan Deshayes wrote:
>
>> but when i try to mount it the proccess just freezes. 
>> /var/logs/messages:
>> Jun 20 15:06:22 asl012 clvmd: Cluster LVM daemon started - connected 
>> to CMAN
>> Jun 20 15:07:19 asl012 kernel: Trying to join cluster "lock_dlm", 
>> "acl002:project_db"
>> Jun 20 15:07:19 asl012 kernel: dlm: connecting to 2
>> Jun 20 15:07:19 asl012 kernel: Joined cluster. Now mounting FS...
>>
>> and then nothing happens.
>> root     19031  0.0  0.0   3628   332 pts/0    D    15:07   0:00 
>> /sbin/mount.gfs /dev/projectVG/db /project/db/ -o rw
>> after trying to do this the filesystem locks up on the first node also.
>>
>> /Dan
>
>
> Hi Dan,
>
> So apparently it's the mount that's hanging, not clvmd.
> By any chance, are you using manual fencing?  Because I think this 
> behavior
> can be caused when a node is manually fenced, but the
> fence_ack_manual script was never run.  If that's the problem, try:
>
> fence_ack_manual -n <fenced_node>
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hey Robert,
thats right, its not really the clvmd but when starting as a service it 
performs a vgscan wich also freezes.
Correct, I'm using manual fenceing for the moment, but I don't want to 
fence the other node since it hasn't faild,
but to have the filesystem mounted by all the nodes at the same time. 
Just fence when it fails.
Maybe I've missunderstood the possibility of this? Though I've used it 
on previous versions of centos.

Regards, Dan



From jparsons at redhat.com  Wed Jun 20 14:34:37 2007
From: jparsons at redhat.com (jim parsons)
Date: Wed, 20 Jun 2007 10:34:37 -0400
Subject: [Linux-cluster] failover domain conf. with conga
In-Reply-To: <D566E8CF3538B54D95B925CB69CB4D2A049E88D8@inblr-exch1.eu.uis.unisys.com>
References: <D566E8CF3538B54D95B925CB69CB4D2A049E88D8@inblr-exch1.eu.uis.unisys.com>
Message-ID: <1182350078.3302.9.camel@localhost.localdomain>

On Wed, 2007-06-20 at 16:42 +0530, Panigrahi, Santosh Kumar wrote:
> Hi,
> 
>  
> 
> I am not able to configure fail over domain in RHEL5 through conga
> utility. On trying to do so, I am getting an error page from luci
> service. But I am able to configure it through system-config-cluster
> utility.
> 
> I got following information from RHEl5 release notes. 
> 
> [At present, conga and luci do not allow users to create and configure
> failover domains. 
> 
> To create failover domains, use system-config-cluster. You need to
> manually edit /etc/cluster/cluster.conf to configure failover domains
> created this way.]
> 
> https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/release-notes/RELEASE-NOTES-x86-en.html 
> 
>  
> 
> I want to know when will be the next release of conga utility with the
> failover domain feature ?
There was a z-stream (asynchronous update) release of Conga that
addresses some VM issues and also offers creation and config of Fdoms.

You want the 0.9.2-6.el5 builds for ricci, luci, and modcluster.

If you are using these versions, then you have encountered something I
would be interested in knowing more about :)

I would also be interested in any usability comments regarding the conga
UI that you care to offer. 

BTW, here is a page that may be helpful to you:
http://sourceware.org/cluster/conga/

I will file a bug to update the documentation. Thanks for trying out
conga.

-J





From Michael.Hagmann at hilti.com  Wed Jun 20 14:55:47 2007
From: Michael.Hagmann at hilti.com (Hagmann, Michael)
Date: Wed, 20 Jun 2007 16:55:47 +0200
Subject: [Linux-cluster] Red Hat Cluster for Symantec Veritas NetBackup 6
In-Reply-To: <1182350078.3302.9.camel@localhost.localdomain>
References: <D566E8CF3538B54D95B925CB69CB4D2A049E88D8@inblr-exch1.eu.uis.unisys.com>
	<1182350078.3302.9.camel@localhost.localdomain>
Message-ID: <9C203D6FD2BF9D49BFF3450201DEDA5301B18E29@LI-OWL.hag.hilti.com>

Hi all

we have a lot of Red Hat 4 Clusters for Oracle / SAP operational. Now we
also have to move our Tru64 / TruCluster NetBackup Infrastructure to
Linux. Now we are thinking to move the NetBackup to a Red Hat Cluster.
Does someone have this running? When yes are there any important points
to know? 

thanks for your comments

Mike


Michael Hagmann
UNIX Systems Engineering
Enterprise Systems Technology

Hilti Corporation
9494 Schaan  Liechtenstein

Department FIBS
Feldkircherstrasse 100   P.O.Box 333
P +423-234 2467  F +423-234 6467
E michael.hagmann at hilti.com
www.hilti.com




From rainer at ultra-secure.de  Wed Jun 20 15:04:30 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Wed, 20 Jun 2007 17:04:30 +0200
Subject: [Linux-cluster] Performance on GFS, OCFS2 or GFS2?
In-Reply-To: <00B9BFA1C44A674794C9A1A4F5A22CA51A7941@exchsrv07.rootdom.dk>
References: <4678E2CE.4050302@algitech.com>	<F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<00B9BFA1C44A674794C9A1A4F5A22CA51A7941@exchsrv07.rootdom.dk>
Message-ID: <467941FE.8000209@ultra-secure.de>

Kristoffer Lippert wrote:
> Hi,
>
> I'm considering different options for a SAN. Are there anyone who knows
> of a comparison of the GFS,  GFS2 and OCFS2 filesystems?
> Mostly i'm concerned about performance and stability (maturity).
>
> It'll be running on a RHEL 5. 
>
> Kind regards
> Kristoffer
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


There was a comparison in some back-issue of iX magazine (German monthly 
IT-magazine - http://www.heise.de/ix).
It was some time ago (a year, or more), so they didn't include GFS2.
But as GFS2 isn't ready for prime-time anyway, it should not be a big 
problem.

I don't remember in which issue it was - search their archive.



cheers,
Rainer



From rpeterso at redhat.com  Wed Jun 20 16:20:25 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 20 Jun 2007 11:20:25 -0500
Subject: [Linux-cluster] problem starting clvmd on second node.
In-Reply-To: <46793AD9.4020308@algitech.com>
References: <F154BEC9D4278D4BA9B94BDEC76193482BA6D3@melpsechmbx15.HQ.CORP.AMERICANHM.COM>	<46792A00.8080707@algitech.com>	<467932B8.4060506@redhat.com>
	<46793AD9.4020308@algitech.com>
Message-ID: <467953C9.5080504@redhat.com>

Dan Deshayes wrote:
> Hey Robert,
> thats right, its not really the clvmd but when starting as a service it 
> performs a vgscan wich also freezes.
> Correct, I'm using manual fenceing for the moment, but I don't want to 
> fence the other node since it hasn't faild,
> but to have the filesystem mounted by all the nodes at the same time. 
> Just fence when it fails.
> Maybe I've missunderstood the possibility of this? Though I've used it 
> on previous versions of centos.
> 
> Regards, Dan

Hi Dan,

Is this the old cluster infrastructure (e.g. rhel4/centos 4/stable or equiv) or
the new cluster infrastructure (e.g. rhel5/HEAD or equiv)?

Since you're using manual fencing, perhaps you should start from the beginning
in case a node thinks it needs a fence ack:

1. power off all nodes
2. power on all nodes
3. start clustering on all nodes
4. do group_tool -v on all nodes to make sure there are no error conditions
5. start clvmd on all nodes
6. do your mounts

Let me know what happens.

Regards,

Bob Peterson



From chris at cmiware.com  Wed Jun 20 16:55:52 2007
From: chris at cmiware.com (Chris Harms)
Date: Wed, 20 Jun 2007 11:55:52 -0500
Subject: [Linux-cluster] Diskless Quorum Disk
Message-ID: <46795C18.5040408@cmiware.com>

I'm interested in using qdisk heuristics to circumvent a fencing duel in 
my two node cluster, however I have no shared storage so I'm mostly 
interested in network tests.  The FAQ indicates "You don't have to use a 
disk or partition to get this functionality." however Conga complains 
about not setting a device or label, and errors when I enter a dummy 
label. 

To what should I set the device / label if I just want to ping the 
gateway for example?

Cheers,
Chris



From lhh at redhat.com  Wed Jun 20 21:44:08 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 20 Jun 2007 17:44:08 -0400
Subject: [Linux-cluster] Diskless Quorum Disk
In-Reply-To: <46795C18.5040408@cmiware.com>
References: <46795C18.5040408@cmiware.com>
Message-ID: <20070620214408.GL4687@redhat.com>

On Wed, Jun 20, 2007 at 11:55:52AM -0500, Chris Harms wrote:
> I'm interested in using qdisk heuristics to circumvent a fencing duel in 
> my two node cluster, however I have no shared storage so I'm mostly 
> interested in network tests.  The FAQ indicates "You don't have to use a 
> disk or partition to get this functionality." however Conga complains 
> about not setting a device or label, and errors when I enter a dummy 
> label.

The FAQ is incorrect probably because of context; I'll clarify it.  You
don't need to use *qdiskd* to prevent a 'fence duel' in the case that the
cluster is configured in the following way:

http://sources.redhat.com/cluster/faq.html#two_node_correct

... however, support for "diskless mode" is not implemented.  File a
bugzilla / feature request so we can track it?

> To what should I set the device / label if I just want to ping the 
> gateway for example?

You can't right now; it uses the quorum partition to converge on things
and do voting.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From chris at cmiware.com  Wed Jun 20 22:57:05 2007
From: chris at cmiware.com (Chris Harms)
Date: Wed, 20 Jun 2007 17:57:05 -0500
Subject: [Linux-cluster] Diskless Quorum Disk
In-Reply-To: <20070620214408.GL4687@redhat.com>
References: <46795C18.5040408@cmiware.com> <20070620214408.GL4687@redhat.com>
Message-ID: <4679B0C1.9090005@cmiware.com>

My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards 
using telnet over their NICs.  The same NICs used in my bonded config on 
the OS so I assumed it was on the same network path.  Perhaps I assume 
incorrectly.

Desired effect would be survivor claims service(s) running on 
unreachable node and attempts to fence unreachable node or bring it back 
online without fencing should it establish contact.  Actual result was 
survivor spun its wheels trying to fence unreachable node and did not 
assume services.  Restoring network connectivity induced the previously 
unreachable node to reboot and the surviving node experienced some kind 
of weird power off and then powered back on (???).

Ergo I figured I must need quorum disk so I can use something like a 
ping node.  My present plan is to use a loop device for the quorum disk 
device and then setup ping heuristics.  Will this even work, i.e. do the 
nodes both need to see the same qdisk or can I fool the service with a 
loop device?  I am not deploying GFS or GNDB and I have no SAN.  My only 
option would be to add another DRBD partition for this purpose which may 
or may not work.

What is the proper setup option, two_node=1 or qdisk?

Chris

Lon Hohberger wrote:
> On Wed, Jun 20, 2007 at 11:55:52AM -0500, Chris Harms wrote:
>   
>> I'm interested in using qdisk heuristics to circumvent a fencing duel in 
>> my two node cluster, however I have no shared storage so I'm mostly 
>> interested in network tests.  The FAQ indicates "You don't have to use a 
>> disk or partition to get this functionality." however Conga complains 
>> about not setting a device or label, and errors when I enter a dummy 
>> label.
>>     
>
> The FAQ is incorrect probably because of context; I'll clarify it.  You
> don't need to use *qdiskd* to prevent a 'fence duel' in the case that the
> cluster is configured in the following way:
>
> http://sources.redhat.com/cluster/faq.html#two_node_correct
>
> ... however, support for "diskless mode" is not implemented.  File a
> bugzilla / feature request so we can track it?
>
>   
>> To what should I set the device / label if I just want to ping the 
>> gateway for example?
>>     
>
> You can't right now; it uses the quorum partition to converge on things
> and do voting.
>
>   



From David.Schroeder at flinders.edu.au  Wed Jun 20 23:58:48 2007
From: David.Schroeder at flinders.edu.au (David Schroeder)
Date: Thu, 21 Jun 2007 09:28:48 +0930
Subject: [Linux-cluster] Cluster service restarting 
Message-ID: <4679BF38.6080509@flinders.edu.au>

Hi,

We have been running web and database clusters successfully for several 
years on RHEL 3 and 4 and we now have one of each on RHEL 5.

The setup is very straight forward, 2 nodes active/active with one 
running the webserver the other the databases.

We have found the services restart in place regularly, up to 2 or 3 
times a day sometimes. The cause is the Failure to ping one or another 
of the clustered service IP addresses and is evident from the log 
entries. This happens less frequently on the database server with one 
clustered interface than it does with the webserver that has 5. The 
failure to ping that is reported in the logs for the webserver is not 
always on the same IP address and it seems quite random in time and 
which in which IP address it reports is at fault. There are no load 
related issues as this is still in the testing stage.

I have turned the "Monitor Link" setting off and it still happens.

Are there any settings that will increase the timeout as I'm sure the 
interface does not go down.

Any other pointers or suggestions?

Thanks
-- 
David Schroeder
Server Support
Information Services Division
Flinders University
Adelaide, Australia
Ph: +61 8 8201 2689



From Robert.Hell at fabasoft.com  Thu Jun 21 04:04:30 2007
From: Robert.Hell at fabasoft.com (Hell, Robert)
Date: Thu, 21 Jun 2007 06:04:30 +0200
Subject: [Linux-cluster] Getting cman to use a different NIC for Heartbeat 
Message-ID: <B710F3299F04664DB6B37C258FDEEB948EC8D2@FABAMAIL.fabagl.fabasoft.com>

Hi!

 

We got a 2-node cluster with RHEL5 Cluster Suite. We want to use a dedicated network for heartbeat communication.

So we have 2 interfaces - one for "data" and one for heartbeat communication.

 

I tried the way explained in http://sources.redhat.com/cluster/faq.html#cman_heartbeat_nic but when I start cman:

Starting cluster:

   Loading modules... done

   Mounting configfs... done

   Starting ccsd... done

   Starting cman... failed

cman not started: Overridden node name is not in CCS /usr/sbin/cman_tool: aisexec daemon didn't start

                                                           [FAILED]

 

I used the name pg-hba-001 (heartbeat) in /etc/init.d/cman - the node is configured with the name pg-ba-001 in cluster.conf.

When I use the heartbeat-name in cluster.conf there's no problem. But that's not what I want.

 

It would be nice if there is any way to tell cman that it should use the heartbeat connection for heartbeat communication - but the node name should be the original pg-ba-001.

The perfect solution in my opinion would be that cman uses both available paths - first the heartbeat connection (which is in fact a direct node to node connection) and if this one fails for any reason the connection over regular LAN.

 

Is there any way to achieve that?

 

Thanks in advance,

Robert

 

Fabasoft R&D Software GmbH & Co KG
Honauerstra?e 4
4020 Linz
Austria
Tel: [43] (732) 60 61 62
Fax: [43] (732) 60 61 62-609
E-Mail: Robert.Hell at fabasoft.com <mailto:Robert.Hell at fabasoft.com>  
www.fabasoft.com <http://www.fabasoft.com> 

 

Fabasoft R&D Software GmbH & Co KG: Handelsgericht Linz, FN 190334d

Komplement?r: Fabasoft R&D Software GmbH, Handelsgericht Linz, FN 190091x

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070621/62c7b89b/attachment.htm>

From Alain.Moulle at bull.net  Thu Jun 21 05:43:53 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 21 Jun 2007 07:43:53 +0200
Subject: [Linux-cluster] CS4 U4/U5 a way to disable service monitoring ?
Message-ID: <467A1019.7050408@bull.net>

Hi

Is there a way to disable the periodic status of service either
with GUI, either directly in cluster.conf ?

Thanks a lot.

Alain Moull?



From manami_mukherjee at yahoo.com  Thu Jun 21 09:13:23 2007
From: manami_mukherjee at yahoo.com (manami mukherjee)
Date: Thu, 21 Jun 2007 02:13:23 -0700 (PDT)
Subject: [Linux-cluster] Setup of Linux cluster in VMware
Message-ID: <947291.17990.qm@web62510.mail.re1.yahoo.com>

Hi All,
I have just registered to this group.

I have one query :
i was trying to set up Linux cluster using rhel 4.0 in
Vmware.

The problem i am facing with shared disk , and i dont
have any proper documenation for this.

Please let me know if anyone of you have worked on
this .

Thanks,
Manami


 
____________________________________________________________________________________
Bored stiff? Loosen up... 
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front



From haller at atix.de  Thu Jun 21 09:54:27 2007
From: haller at atix.de (Dirk Haller)
Date: Thu, 21 Jun 2007 11:54:27 +0200
Subject: [Linux-cluster] Setup of Linux cluster in VMware
In-Reply-To: <947291.17990.qm@web62510.mail.re1.yahoo.com>
References: <947291.17990.qm@web62510.mail.re1.yahoo.com>
Message-ID: <200706211154.28175.haller@atix.de>

Hello Manami,

what is your exact problem? Setting up shared disks in VMware?
If yes, have a look on this doc...
http://www.vmware.com/support/gsx3/doc/ha_configs_gsx.html

This also works with the free VMware server version.

Regards,
Dirk
 
On Thursday 21 June 2007 11:13:23 manami mukherjee wrote:
> Hi All,
> I have just registered to this group.
>
> I have one query :
> i was trying to set up Linux cluster using rhel 4.0 in
> Vmware.
>
> The problem i am facing with shared disk , and i dont
> have any proper documenation for this.
>
> Please let me know if anyone of you have worked on
> this .
>
> Thanks,
> Manami
>
>
>
> ___________________________________________________________________________
>_________ Bored stiff? Loosen up...
> Download and play hundreds of games for free on Yahoo! Games.
> http://games.yahoo.com/games/front
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards Dirk Haller
Tel.: +49-89 452 3538 -13

****
ATIX - Gesellschaft f?r Informationstechnologie und Consulting mbH
Einsteinstrasse 10
D-85716 Unterschleissheim
Tel.: +49-89 452 3538 -0
http://www.atix.de/

!!! ATIX auf dem Linux Tag in Berlin: !!!
LinuxTag, 30.05. - 02.06.2007, Berlin
Linux-Verband Stand: Halle 12, Stand 56

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From jprats at cesca.es  Thu Jun 21 10:57:03 2007
From: jprats at cesca.es (Jordi Prats)
Date: Thu, 21 Jun 2007 12:57:03 +0200
Subject: [Linux-cluster] Very poor performance of GFS2 
Message-ID: <467A597F.7080604@cesca.es>

Hi all,
I'm getting a very poor performance using GFS2. Here some numbers:

A disk usage on a GFS2 filesystem:

[root at urani CLUSTER]# time du -hs PostgreSQL814/postgresql-8.1.4/
121M    PostgreSQL814/postgresql-8.1.4/

real    5m49.597s
user    0m0.000s
sys     0m0.004s


On the local disk:

[root at urani CLUSTER]# time du -hs postgresql-8.1.4/
116M    postgresql-8.1.4/

real    0m0.015s
user    0m0.000s
sys     0m0.016s

The GFS2 filesystem was created using this:

mkfs.gfs2 -t dades_test:postgres814 -p lock_dlm -j 2 /dev/data/postgres814

It is mounted on two machines (urani and plutoni):

/dev/data/mysql5020 on /CLUSTER/MySQL5020 type gfs2 
(rw,hostdata=jid=1:id=196610:first=0)
/dev/data/postgres814 on /CLUSTER/PostgreSQL814 type gfs2 
(rw,hostdata=jid=1:id=65538:first=0)

I supose it's caused by some problem on my configuration. What I am missing?

Thank you!

Jordi

-- 
......................................................................
         __
        / /          Jordi Prats
  C E / S / C A      Dept. de Sistemes
      /_/            Centre de Supercomputaci? de Catalunya

  Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
  T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
...................................................................... 



From swplotner at amherst.edu  Thu Jun 21 14:04:28 2007
From: swplotner at amherst.edu (Steffen Plotner)
Date: Thu, 21 Jun 2007 10:04:28 -0400
Subject: [Linux-cluster] Setup of Linux cluster in VMware
References: <947291.17990.qm@web62510.mail.re1.yahoo.com>
Message-ID: <0456A130F613AD459887FF012652963F0125EE7A@mail7.amherst.edu>

Hi,
 
I have a cluster that resides partially in vmware virtual machines and physical machines. To get at shared disks I use the iscsi initiator from within the VM.
 
Steffen
 
________________________________ 
Steffen Plotner 
Systems Administrator/Programmer 
Systems & Networking 
Amherst College 
PO BOX 5000 
Amherst, MA 01002-5000 
Tel (413) 542-2348 
Fax (413) 542-2626 
Email: swplotner at amherst.edu 
________________________________ 

________________________________

From: linux-cluster-bounces at redhat.com on behalf of manami mukherjee
Sent: Thu 6/21/2007 5:13 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Setup of Linux cluster in VMware



Hi All,
I have just registered to this group.

I have one query :
i was trying to set up Linux cluster using rhel 4.0 in
Vmware.

The problem i am facing with shared disk , and i dont
have any proper documenation for this.

Please let me know if anyone of you have worked on
this .

Thanks,
Manami



____________________________________________________________________________________
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070621/d6146031/attachment.htm>

From janne.peltonen at helsinki.fi  Thu Jun 21 15:37:33 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 21 Jun 2007 18:37:33 +0300
Subject: [Linux-cluster] rgmanager-2.0.24 doesn't execute script status
Message-ID: <20070621153732.GD15269@helsinki.fi>

Hi.

It seems to me that there is a bug in rgmanager 2.0.24 (at least in the
centos build). It doesn't execute status for service scripts, even if
there is this line in /usr/share/cluster/script.sh:

        <action name="status" interval="30s" timeout="0"/>

Downgrading back to version 2.0.23 seemed to help. It works with the
same configuration that version 2.0.24 doesn't.

(Perhaps I'd better learn to use Bugzilla...)


--Janne Peltonen
Univ. of Helsinki

P.S. Any news on the fs.sh front?
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From mvz+rhcluster at nimium.hr  Thu Jun 21 16:42:47 2007
From: mvz+rhcluster at nimium.hr (Miroslav Zubcic)
Date: Thu, 21 Jun 2007 18:42:47 +0200
Subject: [Linux-cluster] Status script action timeout
Message-ID: <467AAA87.10502@nimium.hr>

Hello,

Today early in the morning I have suffered from oracle (yuck!) listener
bug. Routine in status () in service monitoring script in my RHCS never
returned, just hanged forever. I have received error only after oracle
listener was restarted and oracle has been restarted in working hours
because listener hanged in 04:03 AM.

Is there some timeout parameter for rgmanager? Something to put in
cluster.conf? Man page doesn't mention anything, rh-cs-en.pdf neither,
so RTFM dosen't help. There must be some way to control timeout from
status routine in cluster control script right?

Thanks ...


-- 
Miroslav Zubcic, Nimium d.o.o., email: <mzubcic at nimium.hr>
Tel: +385 01 6390 782, Fax: +385 01 4852 640, Mobile: +385 098 942 8672
Gredicka 3, 10000 Zagreb, Hrvatska



From wkenji at labs.fujitsu.com  Fri Jun 22 08:53:41 2007
From: wkenji at labs.fujitsu.com (Kenji Wakamiya)
Date: Fri, 22 Jun 2007 17:53:41 +0900
Subject: [Linux-cluster] Can't mount snapshot LUN with new infrastructure
Message-ID: <467B8E15.4080209@labs.fujitsu.com>

Hello,

I have a three node GFS1 cluster that is based on OpenAIS and CentOS 5,
and am using NetApp's iSCSI LUN as a block device.

Now I want to mount that LUN's snapshot LUN with lock_nolock on any
member of the three nodes, with having mounted original LUN.

With old version of CMAN (not OpenAIS), the same thing worked well.
But I've got the following error with new infrastructure:

# mount -t gfs
/dev/isda on /web type gfs (rw,hostdata=jid=2:id=131074:first=0,acl)
# mount -t gfs -o lockproto=lock_nolock /dev/isdb /testsnap
/sbin/mount.gfs: error 17 mounting /dev/isdb on /testsnap

/var/log/mssages:
Jun 22 17:10:46 node17 kernel: kobject_add failed for cluster3:gfstest with -EEXIST, don't try to register things with the same name in the same directory.
Jun 22 17:10:46 node17 kernel:  [<c04dc1e0>] kobject_add+0x147/0x16d
Jun 22 17:10:46 node17 kernel:  [<c04dc2ec>] kobject_register+0x19/0x30
Jun 22 17:10:46 node17 kernel:  [<f8c599ca>] fill_super+0x3df/0x5c9 [gfs]
Jun 22 17:10:46 node17 kernel:  [<c04708f6>] get_sb_bdev+0xc6/0x110
Jun 22 17:10:46 node17 kernel:  [<c0453f2e>] __alloc_pages+0x57/0x27e
Jun 22 17:10:46 node17 kernel:  [<f8c58d15>] gfs_get_sb+0x12/0x16 [gfs]
Jun 22 17:10:46 node17 kernel:  [<f8c595eb>] fill_super+0x0/0x5c9 [gfs]
Jun 22 17:10:46 node17 kernel:  [<c04703bb>] vfs_kern_mount+0x7d/0xf2
Jun 22 17:10:46 node17 kernel:  [<c0470462>] do_kern_mount+0x25/0x36
Jun 22 17:10:46 node17 kernel:  [<c0482eb4>] do_mount+0x5d6/0x646
Jun 22 17:10:46 node17 kernel:  [<c044f86c>] find_get_pages_tag+0x30/0x6e
Jun 22 17:10:46 node17 kernel:  [<c0455887>] pagevec_lookup_tag+0x1b/0x22
Jun 22 17:10:46 node17 kernel:  [<c0453c5d>] get_page_from_freelist+0x96/0x310
Jun 22 17:10:46 node17 kernel:  [<c0453e6d>] get_page_from_freelist+0x2a6/0x310
Jun 22 17:10:46 node17 kernel:  [<c0453c5d>] get_page_from_freelist+0x96/0x310
Jun 22 17:10:46 node17 kernel:  [<c0481dc5>] copy_mount_options+0x26/0x109
Jun 22 17:10:46 node17 kernel:  [<c0482f91>] sys_mount+0x6d/0xa5
Jun 22 17:10:46 node17 kernel:  [<c0403eff>] syscall_call+0x7/0xb
Jun 22 17:10:46 node17 kernel:  =======================

Now I have /sys/fs/gfs/cluster3:gfstest/.  So I suspect that a
conflict between two LUN's lock table names is occurring in sysfs.

Is there any good solution?

Thanks,
Kenji




From GavinF at itdynamics.co.za  Fri Jun 22 10:14:35 2007
From: GavinF at itdynamics.co.za (Gavin Fietze)
Date: Fri, 22 Jun 2007 12:14:35 +0200
Subject: [Linux-cluster] QDISK problem
Message-ID: <467BBD2B.A5F6.00ED.0@itdynamics.co.za>

I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP
I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard.
 
When I run clustat and "cman_tool nodes" I get strange output for the qdisk object :
 
[root at node1 ~]#
[root at node1 ~]# clustat
/dev/sdc1?? U?? U?? not found
realloc 1232
Member Status: Quorate
 
  Member Name                        ID   Status
  ------ ----                        ---- ------
  node1.sv.dynamics.co.za               1 Online, Local, rgmanager
  node2.sv.dynamics.co.za               2 Online, rgmanager
  node3.sv.dynamics.co.za               3 Online, rgmanager
  /dev/sdc1?? U?? U??                  0 Online, Estranged, rgmanager
 
  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:it           node2.sv.dynamics.co.za        started
  service:tru          node3.sv.dynamics.co.za        started
  service:aql          (node1.sv.dynamics.co.za)      failed
  service:if           node3.sv.dynamics.co.za        started
  service:fc           (node1.sv.dynamics.co.za)      failed
  service:com          (node1.sv.dynamics.co.za)      failed
  service:xprint       node3.sv.dynamics.co.za        started
[root at node1 ~]#
[root at node1 ~]#
[root at node1 ~]#
[root at node1 ~]# cman_tool nodes

Node  Sts   Inc   Joined               Name
   0   M      0   2007-06-22 10:49:54  /dev/sdc1?? U?? U??
   1   M      4   2007-06-22 10:47:14  node1.sv.dynamics.co.za
   2   M     52   2007-06-22 10:47:14  node2.sv.dynamics.co.za
   3   M     52   2007-06-22 10:47:14  node3.sv.dynamics.co.za
 
mkqisk does not report any funnies:
 
[root at node1 ~]# mkqdisk -L
mkqdisk v0.5.1
/dev/sdc1:
        Magic:   eb7a62c2
        Label:   epoc
        Created: Thu Jun 14 16:04:49 2007
        Host:    node3.svdynamics.co.za
 
Is this normal, and will it effect the operation of qdiskd?
Can someone tell me what Inc represents in the cman_tool nodes output?
 
Thanks
 
Gavin Fietze
IT Dynamics (Pty) Ltd
Direct +27 (0)31 7130826
Fax +27 (0)31 7020613
Mobile +27 (0)83 5012516
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070622/df715df0/attachment.htm>

From teigland at redhat.com  Fri Jun 22 13:33:08 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 22 Jun 2007 08:33:08 -0500
Subject: [Linux-cluster] Very poor performance of GFS2
In-Reply-To: <467A597F.7080604@cesca.es>
References: <467A597F.7080604@cesca.es>
Message-ID: <20070622133308.GA6381@redhat.com>

On Thu, Jun 21, 2007 at 12:57:03PM +0200, Jordi Prats wrote:
> Hi all,
> I'm getting a very poor performance using GFS2. Here some numbers:
> 
> A disk usage on a GFS2 filesystem:
> 
> [root at urani CLUSTER]# time du -hs PostgreSQL814/postgresql-8.1.4/
> 121M    PostgreSQL814/postgresql-8.1.4/
> 
> real    5m49.597s
> user    0m0.000s
> sys     0m0.004s
> 
> 
> On the local disk:
> 
> [root at urani CLUSTER]# time du -hs postgresql-8.1.4/
> 116M    postgresql-8.1.4/
> 
> real    0m0.015s
> user    0m0.000s
> sys     0m0.016s

You should compare with gfs1, that will tell you if the gfs2 numbers are
in the right ballpark.

Dave



From teigland at redhat.com  Fri Jun 22 13:40:54 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 22 Jun 2007 08:40:54 -0500
Subject: [Linux-cluster] Can't mount snapshot LUN with new infrastructure
In-Reply-To: <467B8E15.4080209@labs.fujitsu.com>
References: <467B8E15.4080209@labs.fujitsu.com>
Message-ID: <20070622134054.GB6381@redhat.com>

On Fri, Jun 22, 2007 at 05:53:41PM +0900, Kenji Wakamiya wrote:
> Hello,
> 
> I have a three node GFS1 cluster that is based on OpenAIS and CentOS 5,
> and am using NetApp's iSCSI LUN as a block device.
> 
> Now I want to mount that LUN's snapshot LUN with lock_nolock on any
> member of the three nodes, with having mounted original LUN.
> 
> With old version of CMAN (not OpenAIS), the same thing worked well.
> But I've got the following error with new infrastructure:
> 
> # mount -t gfs
> /dev/isda on /web type gfs (rw,hostdata=jid=2:id=131074:first=0,acl)
> # mount -t gfs -o lockproto=lock_nolock /dev/isdb /testsnap
> /sbin/mount.gfs: error 17 mounting /dev/isdb on /testsnap

> Now I have /sys/fs/gfs/cluster3:gfstest/.  So I suspect that a
> conflict between two LUN's lock table names is occurring in sysfs.

Yes, you're exactly right.  You can also override the locktable name with
a mount option:

  mount -t gfs -o lockproto=lock_nolock,locktable=foo /dev/isdb /testsnap

Dave



From simanhew at gmail.com  Fri Jun 22 16:57:35 2007
From: simanhew at gmail.com (siman hew)
Date: Fri, 22 Jun 2007 12:57:35 -0400
Subject: [Linux-cluster] What is "Password Script" field for in Fence Device
	Configuration
Message-ID: <6596a7c70706220957s5dfad6cel69337069ace15be6@mail.gmail.com>

Hi all,

I tried to use old GUI (and Conga) to define a fence device on RHEL4U5, I
found there is new field "Password Script" under Password field.  What is
thid field for?  I can not find it on RHEL5 (old GUI & Conga)
.  Is this particular for 4U5 ?

Any explaination is very apprecaited.

Thanks,

Siman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070622/13ba63fc/attachment.htm>

From oliver.olsen at advance.as  Fri Jun 22 17:51:05 2007
From: oliver.olsen at advance.as (Oliver Olsen)
Date: Fri, 22 Jun 2007 19:51:05 +0200
Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4)
Message-ID: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as>

Hi,
I'm currently in the process of upgrading a RHEL 3 cluster to RHEL 4,  
and I haven't been able to figure out how to use snapshots in a proper  
way. We have other RHEL 4 systems using LVM2, and I use snapshots with  
ext3 for backup with great success on those. As the demand for uptime  
on this particular cluster is cruicial (and regular filebackup using  
tar is way too slow!) I was hoping to accomplish the same kind of  
snapshots with GFS1 and RHEL 4 as with ext3.

My test-environment consists of 3 VM's in a VMware Server 1.0.3  
environment, using DLM and one GFS filesystem on the VM's. Fencing and  
connectivity is working just fine, but when I try to use snapshots  
things go wrong.

I try to initiate the snapshot via
[root at gfs1 Scripts]# lvcreate -L500M -s -n snap /dev/GFS/LV1
Logical volume "snap" created

When I try to mount the snapshot (in the same way as I do with an ext3  
snapshot) it does not seem to work

[root at gfs1 Scripts]# mount -t gfs
/dev/mapper/GFS-LV1 on /mnt/GFS type gfs (rw)

[root at gfs1 Scripts]# mount -t gfs /dev/GFS/snap /mnt/GFS-snap
mount: File exists

Also - when I try to do changes in the /mnt/GFS filesystem after  
creating the snapshot, there does not seem to be any changes in the  
snapshot when I check the graphical interface of LVM (it says Snapshot  
usage: 0%). Attributes are "swi-a-" and GFS (clustered) according to  
LVM.

I assume I am missing something vital here, but I haven't been able to  
find the documentation which explains this in an easy manner.
Any inputs will be highly appreciated!

Best regards,
Oliver Olsen



From Greg.Caetano at hp.com  Fri Jun 22 18:08:35 2007
From: Greg.Caetano at hp.com (Caetano, Greg)
Date: Fri, 22 Jun 2007 14:08:35 -0400
Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4)
In-Reply-To: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as>
References: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as>
Message-ID: <E4BD1443AB30DA45AC27E87417E758D402320FAA@tayexc19.americas.cpqcorp.net>

Oliver

What does the following command show for the status of your snapshot
volume and/or device

# lvdisplay /dev/GFS/LV1 


Greg Caetano
HP TSG Linux Solutions Alliances Engineering
Chicago, IL
greg.caetano at hp.com
Red Hat Certified Engineer
RHCE#803004972711193
-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Oliver Olsen
Sent: Friday, June 22, 2007 12:51 PM
To: linux clustering
Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4)

Hi,
I'm currently in the process of upgrading a RHEL 3 cluster to RHEL 4,
and I haven't been able to figure out how to use snapshots in a proper
way. We have other RHEL 4 systems using LVM2, and I use snapshots with
ext3 for backup with great success on those. As the demand for uptime on
this particular cluster is cruicial (and regular filebackup using tar is
way too slow!) I was hoping to accomplish the same kind of snapshots
with GFS1 and RHEL 4 as with ext3.

My test-environment consists of 3 VM's in a VMware Server 1.0.3
environment, using DLM and one GFS filesystem on the VM's. Fencing and
connectivity is working just fine, but when I try to use snapshots
things go wrong.

I try to initiate the snapshot via
[root at gfs1 Scripts]# lvcreate -L500M -s -n snap /dev/GFS/LV1 Logical
volume "snap" created

When I try to mount the snapshot (in the same way as I do with an ext3
snapshot) it does not seem to work

[root at gfs1 Scripts]# mount -t gfs
/dev/mapper/GFS-LV1 on /mnt/GFS type gfs (rw)

[root at gfs1 Scripts]# mount -t gfs /dev/GFS/snap /mnt/GFS-snap
mount: File exists

Also - when I try to do changes in the /mnt/GFS filesystem after
creating the snapshot, there does not seem to be any changes in the
snapshot when I check the graphical interface of LVM (it says Snapshot
usage: 0%). Attributes are "swi-a-" and GFS (clustered) according to
LVM.

I assume I am missing something vital here, but I haven't been able to
find the documentation which explains this in an easy manner.
Any inputs will be highly appreciated!

Best regards,
Oliver Olsen

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From oliver.olsen at advance.as  Fri Jun 22 18:17:17 2007
From: oliver.olsen at advance.as (Oliver Olsen)
Date: Fri, 22 Jun 2007 20:17:17 +0200
Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4)
In-Reply-To: <E4BD1443AB30DA45AC27E87417E758D402320FAA@tayexc19.americas.cpqcorp.net>
References: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as>
	<E4BD1443AB30DA45AC27E87417E758D402320FAA@tayexc19.americas.cpqcorp.net>
Message-ID: <20070622201717.airzxizzko84kc0o@home.advance.as>

Quoting "Caetano, Greg" <Greg.Caetano at hp.com>:

> Oliver
>
> What does the following command show for the status of your snapshot
> volume and/or device
>
> # lvdisplay /dev/GFS/LV1
>
Greg,
The output is as follows

[root at gfs1 Scripts]# lvdisplay /dev/GFS/LV1
   --- Logical volume ---
   LV Name                /dev/GFS/LV1
   VG Name                GFS
   LV UUID                zCkOfT-2Nua-Z3V2-GEy7-vrrs-p0IX-qu01kr
   LV Write Access        read/write
   LV snapshot status     source of
                          /dev/GFS/snap [active]
   LV Status              available
   # open                 1
   LV Size                1.56 GB
   Current LE             400
   Segments               1
   Allocation             inherit
   Read ahead sectors     0
   Block device           253:3

[root at gfs1 Scripts]# lvdisplay /dev/GFS/snap
   --- Logical volume ---
   LV Name                /dev/GFS/snap
   VG Name                GFS
   LV UUID                KSPZxh-y9j4-h1c3-JaYo-O8FZ-2dPr-Nbv09S
   LV Write Access        read/write
   LV snapshot status     active destination for /dev/GFS/LV1
   LV Status              available
   # open                 0
   LV Size                1.56 GB
   Current LE             400
   COW-table size         100.00 MB
   COW-table LE           25
   Allocated to snapshot  0.02%
   Snapshot chunk size    8.00 KB
   Segments               1
   Allocation             inherit
   Read ahead sectors     0
   Block device           253:5

(I used "lvcreate -L100M -s -n snap /dev/GFS/LV1" for this particular  
snapshot)

Best regards,
Oliver Olsen



From teigland at redhat.com  Fri Jun 22 18:45:42 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 22 Jun 2007 13:45:42 -0500
Subject: [Linux-cluster] Snapshots with GFS1/RHEL 4 (U4)
In-Reply-To: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as>
References: <20070622195105.r0bk5hg4kk0ws4oo@home.advance.as>
Message-ID: <20070622184542.GE6381@redhat.com>

On Fri, Jun 22, 2007 at 07:51:05PM +0200, Oliver Olsen wrote:
> I assume I am missing something vital here, but I haven't been able to  
> find the documentation which explains this in an easy manner.

You need clustered snapshots in lvm2 which don't exist.  Clustered
mirroring in lvm2 was recently introduced, though, which you can use with
gfs.

Dave



From lhh at redhat.com  Fri Jun 22 20:36:04 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 22 Jun 2007 16:36:04 -0400
Subject: [Linux-cluster] Diskless Quorum Disk
In-Reply-To: <4679B0C1.9090005@cmiware.com>
References: <46795C18.5040408@cmiware.com> <20070620214408.GL4687@redhat.com>
	<4679B0C1.9090005@cmiware.com>
Message-ID: <20070622203604.GO4687@redhat.com>

On Wed, Jun 20, 2007 at 05:57:05PM -0500, Chris Harms wrote:
> My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards 
> using telnet over their NICs.  The same NICs used in my bonded config on 
> the OS so I assumed it was on the same network path.  Perhaps I assume 
> incorrectly.

That sounds mostly right.  The point is that a node disconnected from
the cluster must not be able to fence a node which is supposedly still
connected.

That is: 'A' must not be able to fence 'B' if 'A' becomes disconnected
from the cluster.  However, 'A' must be able to be fenced if 'A' becomes
disconnected.

Why was DRAC unreachable; was it unplugged too? (Is DRAC like IPMI - in
that it shares a NIC with the host machine?)

> Desired effect would be survivor claims service(s) running on 
> unreachable node and attempts to fence unreachable node or bring it back 
> online without fencing should it establish contact.  Actual result was 
> survivor spun its wheels trying to fence unreachable node and did not 
> assume services.

Yes, this is an unfortunate limitation of using (most) integrated power
management systems.  Basically, some BMCs share a NIC with the host
(IPMI), and some run off of the machine's power supply (IPMI, iLO,
DRAC).  When the fence device becomes unreachable, we don't know whether
it's a total network outage or a "power disconnected" state.

* If the power to a node has been disconnected, it's safe to recover.

* If the node just lost all of its network connectivity, it's *NOT* safe
to recover.

* In both cases, we can not confirm the node is dead... which is why we
don't recover.

> Restoring network connectivity induced the previously 
> unreachable node to reboot and the surviving node experienced some kind 
> of weird power off and then powered back on (???).

That doesn't sound right; the surviving node should have stayed put (not
rebooted).

> Ergo I figured I must need quorum disk so I can use something like a 
> ping node.  My present plan is to use a loop device for the quorum disk 
> device and then setup ping heuristics.  Will this even work, i.e. do the 
> nodes both need to see the same qdisk or can I fool the service with a 
> loop device?

I don't believe the effect of tricking qdiskd in this way have been
explored; I don't see why it wouldn't work in theory, but... qdiskd with
or without a disk won't fix the behavior you experienced (uncertain
state due to failure to fence -> retry / wait for node to come back).

> I am not deploying GFS or GNDB and I have no SAN.  My only 
> option would be to add another DRBD partition for this purpose which may 
> or may not work.

> What is the proper setup option, two_node=1 or qdisk?

In your case, I'd say two_node="1".

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From wkenji at labs.fujitsu.com  Fri Jun 22 20:39:33 2007
From: wkenji at labs.fujitsu.com (Kenji Wakamiya)
Date: Sat, 23 Jun 2007 05:39:33 +0900
Subject: [Linux-cluster] Can't mount snapshot LUN with new infrastructure
In-Reply-To: <20070622134054.GB6381@redhat.com>
References: <467B8E15.4080209@labs.fujitsu.com>
	<20070622134054.GB6381@redhat.com>
Message-ID: <467C3385.2080802@labs.fujitsu.com>

David Teigland wrote:
 >> Now I have /sys/fs/gfs/cluster3:gfstest/.  So I suspect that a
 >> conflict between two LUN's lock table names is occurring in sysfs.
 >
 > Yes, you're exactly right.  You can also override the locktable name with
 > a mount option:
 >
 >   mount -t gfs -o lockproto=lock_nolock,locktable=foo /dev/isdb /testsnap

I didn't know that option existed.  It worked!
Thank you!

Kenji




From lhh at redhat.com  Fri Jun 22 20:40:43 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 22 Jun 2007 16:40:43 -0400
Subject: [Linux-cluster] Cluster service restarting
In-Reply-To: <4679BF38.6080509@flinders.edu.au>
References: <4679BF38.6080509@flinders.edu.au>
Message-ID: <20070622204042.GP4687@redhat.com>

On Thu, Jun 21, 2007 at 09:28:48AM +0930, David Schroeder wrote:
> Hi,
> 
> We have been running web and database clusters successfully for several 
> years on RHEL 3 and 4 and we now have one of each on RHEL 5.
> 
> The setup is very straight forward, 2 nodes active/active with one 
> running the webserver the other the databases.
> 
> We have found the services restart in place regularly, up to 2 or 3 
> times a day sometimes. The cause is the Failure to ping one or another 
> of the clustered service IP addresses and is evident from the log 
> entries. This happens less frequently on the database server with one 
> clustered interface than it does with the webserver that has 5. The 
> failure to ping that is reported in the logs for the webserver is not 
> always on the same IP address and it seems quite random in time and 
> which in which IP address it reports is at fault. There are no load 
> related issues as this is still in the testing stage.
> 
> I have turned the "Monitor Link" setting off and it still happens.
> 
> Are there any settings that will increase the timeout as I'm sure the 
> interface does not go down.
> 
> Any other pointers or suggestions?

You can disable the check; remove these from /usr/share/cluster/ip.sh:

        <!-- Checks to see if we can ping the IP address locally -->
        <action name="status" depth="10" interval="60" timeout="20"/>
        <action name="monitor" depth="10" interval="60" timeout="20"/>

Update your /etc/cluster/cluster.conf's config_version and redistribute
the configuration file using ccs_tool update.  This will cause rgmanager
to stop doing the 'ping' checks.

-- Lon


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From chris at cmiware.com  Fri Jun 22 23:43:53 2007
From: chris at cmiware.com (Chris Harms)
Date: Fri, 22 Jun 2007 18:43:53 -0500
Subject: [Linux-cluster] Diskless Quorum Disk
In-Reply-To: <20070622203604.GO4687@redhat.com>
References: <46795C18.5040408@cmiware.com>
	<20070620214408.GL4687@redhat.com>	<4679B0C1.9090005@cmiware.com>
	<20070622203604.GO4687@redhat.com>
Message-ID: <467C5EB9.20904@cmiware.com>

Lon, thank you for the response.  It appears that what I thought was a 
fence duel, was actually the cluster fencing the proper node and DRBD 
halting the surviving node after a split brain scenario.  (Have some 
work to do on my drbd.conf obviously.)  After the fenced node revived, 
it saw that the other was unresponsive (it had been halted) and then 
fenced it; in this case inducing it to power on.

Our DRAC shares the NICs with the host.  We will probably hack on the 
DRAC fence script a little to take advantage of some other features 
available besides doing a poweroff poweron.

Using two_node=1 may be an option again, but then the FAQ indicates the 
quorum disk might still be beneficial.  Using a loop device didn't seem 
to go so well, but that could be due to configuration error.  Having one 
node not see the qdisk is probably an automatic test failure.

Thanks again,
Chris


Lon Hohberger wrote:
> On Wed, Jun 20, 2007 at 05:57:05PM -0500, Chris Harms wrote:
>   
>> My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards 
>> using telnet over their NICs.  The same NICs used in my bonded config on 
>> the OS so I assumed it was on the same network path.  Perhaps I assume 
>> incorrectly.
>>     
>
> That sounds mostly right.  The point is that a node disconnected from
> the cluster must not be able to fence a node which is supposedly still
> connected.
>
> That is: 'A' must not be able to fence 'B' if 'A' becomes disconnected
> from the cluster.  However, 'A' must be able to be fenced if 'A' becomes
> disconnected.
>
> Why was DRAC unreachable; was it unplugged too? (Is DRAC like IPMI - in
> that it shares a NIC with the host machine?)
>
>   
>> Desired effect would be survivor claims service(s) running on 
>> unreachable node and attempts to fence unreachable node or bring it back 
>> online without fencing should it establish contact.  Actual result was 
>> survivor spun its wheels trying to fence unreachable node and did not 
>> assume services.
>>     
>
> Yes, this is an unfortunate limitation of using (most) integrated power
> management systems.  Basically, some BMCs share a NIC with the host
> (IPMI), and some run off of the machine's power supply (IPMI, iLO,
> DRAC).  When the fence device becomes unreachable, we don't know whether
> it's a total network outage or a "power disconnected" state.
>
> * If the power to a node has been disconnected, it's safe to recover.
>
> * If the node just lost all of its network connectivity, it's *NOT* safe
> to recover.
>
> * In both cases, we can not confirm the node is dead... which is why we
> don't recover.
>
>   
>> Restoring network connectivity induced the previously 
>> unreachable node to reboot and the surviving node experienced some kind 
>> of weird power off and then powered back on (???).
>>     
>
> That doesn't sound right; the surviving node should have stayed put (not
> rebooted).
>
>   
>> Ergo I figured I must need quorum disk so I can use something like a 
>> ping node.  My present plan is to use a loop device for the quorum disk 
>> device and then setup ping heuristics.  Will this even work, i.e. do the 
>> nodes both need to see the same qdisk or can I fool the service with a 
>> loop device?
>>     
>
> I don't believe the effect of tricking qdiskd in this way have been
> explored; I don't see why it wouldn't work in theory, but... qdiskd with
> or without a disk won't fix the behavior you experienced (uncertain
> state due to failure to fence -> retry / wait for node to come back).
>
>   
>> I am not deploying GFS or GNDB and I have no SAN.  My only 
>> option would be to add another DRBD partition for this purpose which may 
>> or may not work.
>>     
>
>   
>> What is the proper setup option, two_node=1 or qdisk?
>>     
>
> In your case, I'd say two_node="1".
>
>   



From bsachnoff at incisent.com  Sat Jun 23 22:23:56 2007
From: bsachnoff at incisent.com (Brent Sachnoff)
Date: Sat, 23 Jun 2007 17:23:56 -0500
Subject: [Linux-cluster] a couple of questions regarding clusters
Message-ID: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz>

I have a 3 node cluster running redhat 4 with gfs.  What is the proper
way to have a node leave the cluster for maintenance and then rejoin
after maintenance is completed?  From the docs, I have read that I need
to unmount gfs and then stop all the services in the following order:
rgmanager, gfs, clvmd, fenced.  I can then issue a cman_tool leave
(remove) request.

 

I have also noticed that if I lose ip connectivity to a certain node I
lose gfs connectivity with the other two nodes.  I thought that I would
only need 2 votes to continue connectivity. 

 

Thanks for the help!

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070623/f8e8f401/attachment.htm>

From manjusc13 at rediffmail.com  Mon Jun 25 03:57:19 2007
From: manjusc13 at rediffmail.com (manjunath c shanubog)
Date: 25 Jun 2007 03:57:19 -0000
Subject: [Linux-cluster] Cluster configuration on redhat AS 4
Message-ID: <20070625035719.12708.qmail@webmail6.rediffmail.com>

Hi,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I have to setup two node cluster with redhat AS 4 and cluster suite with GFS. The application which is to be installed is MySql database. I would like&nbsp;to have a solution for the below queries&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1. Detailed installation guide for cluster suite installation and is it possible to load balance on redhat 4/5 linux.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2. Do i need to have a separate cluster suite for MySql, if so which one is Good.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3. Guide or document for Installation of MySQL on cluster.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4. In windows clustering there is no need of fencing device, why is it necessary in linux. if so which is good fencing device and its configuration details.Thanking YouManjunath
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070625/14725fc2/attachment.htm>

From jprats at cesca.es  Mon Jun 25 06:48:05 2007
From: jprats at cesca.es (Jordi Prats)
Date: Mon, 25 Jun 2007 08:48:05 +0200
Subject: [Linux-cluster] Very poor performance of GFS2
In-Reply-To: <20070622133308.GA6381@redhat.com>
References: <467A597F.7080604@cesca.es> <20070622133308.GA6381@redhat.com>
Message-ID: <467F6525.3040506@cesca.es>

I supose you say so because the disk could be slow, but it should not be 
this slow because they virtual machines and they are accesing the same 
way as is the local disk. (Both are LVM volumes)

I found almost no documentation about how to install GFS2, so I'm 
assuming I did something wrong. I supose GFS2 do not add about 5 minutes 
of delay because of it's operations!

Jordi

David Teigland wrote:
> On Thu, Jun 21, 2007 at 12:57:03PM +0200, Jordi Prats wrote:
>   
>> Hi all,
>> I'm getting a very poor performance using GFS2. Here some numbers:
>>
>> A disk usage on a GFS2 filesystem:
>>
>> [root at urani CLUSTER]# time du -hs PostgreSQL814/postgresql-8.1.4/
>> 121M    PostgreSQL814/postgresql-8.1.4/
>>
>> real    5m49.597s
>> user    0m0.000s
>> sys     0m0.004s
>>
>>
>> On the local disk:
>>
>> [root at urani CLUSTER]# time du -hs postgresql-8.1.4/
>> 116M    postgresql-8.1.4/
>>
>> real    0m0.015s
>> user    0m0.000s
>> sys     0m0.016s
>>     
>
> You should compare with gfs1, that will tell you if the gfs2 numbers are
> in the right ballpark.
>
> Dave
>
>
>
>   


-- 
......................................................................
         __
        / /          Jordi Prats
  C E / S / C A      Dept. de Sistemes
      /_/            Centre de Supercomputaci? de Catalunya

  Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
  T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
...................................................................... 



From tmornini at engineyard.com  Mon Jun 25 07:19:01 2007
From: tmornini at engineyard.com (Tom Mornini)
Date: Mon, 25 Jun 2007 00:19:01 -0700
Subject: [Linux-cluster] Very poor performance of GFS2
In-Reply-To: <467F6525.3040506@cesca.es>
References: <467A597F.7080604@cesca.es> <20070622133308.GA6381@redhat.com>
	<467F6525.3040506@cesca.es>
Message-ID: <E4D16754-56BB-47A4-98FB-8CAE57176E65@engineyard.com>

On Jun 24, 2007, at 11:48 PM, Jordi Prats wrote:

> I supose you say so because the disk could be slow, but it should  
> not be this slow because they virtual machines and they are  
> accesing the same way as is the local disk. (Both are LVM volumes)
>
> I found almost no documentation about how to install GFS2, so I'm  
> assuming I did something wrong. I supose GFS2 do not add about 5  
> minutes of delay because of it's operations!

I think you're right. GFS2 is *supposed* to be faster than the  
original GFS, and here's what I get.

ey00-s00001 data # pwd
/data

ey00-s00001 data # time du -hs postgresql-8.2.4/
76M     postgresql-8.2.4/

real    0m0.061s
user    0m0.000s
sys     0m0.050s

ey00-s00001 data # df -Th
Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/sda1 reiserfs    2.0G  783M  1.3G  39% /
udev         tmpfs    512M  120K  512M   1% /dev
shm          tmpfs    512M     0  512M   0% /dev/shm
/dev/sdb1      gfs    227G  128G   99G  57% /data

-- 
-- Tom Mornini, CTO
-- Engine Yard, Ruby on Rails Hosting
-- Support, Scalability, Reliability
-- (866) 518-YARD (9273)



From admin.cluster at gmail.com  Mon Jun 25 09:50:59 2007
From: admin.cluster at gmail.com (Anthony)
Date: Mon, 25 Jun 2007 11:50:59 +0200
Subject: [Linux-cluster] eclipse for RHEL4
Message-ID: <467F9003.7000603@gmail.com>

Hello,

i have problems running eclipse on my RHEL box,
when i use the gz version from the eclipse website.

so i am looking for a Eclipse RPM for RHEL 4, couldn't find it on the 
RHN network.

Does anyone has a link for the Eclipse RPM for the RHEL 4 update 2

Thnks,
Anthony.




From davea at support.kcm.org  Mon Jun 25 15:44:36 2007
From: davea at support.kcm.org (Dave Augustus)
Date: Mon, 25 Jun 2007 10:44:36 -0500
Subject: [Linux-cluster] No failover occurs!
Message-ID: <1182786276.12133.2.camel@kcm40202.kcmhq.org>

I am familiar with Heartbeat and new to RHCS.

Anyhow:

I created a 2 node cluster with no quorum drive.
added an ip address on the public eth
added an ip address on the private eth
added the script apache, with proper configs on both hosts

The only way I can get all 3 to run is to reboot the nodes.

Shouldn't it failover if a service fails to start?

Thanks,
Dave



From andrewxwang at yahoo.com.tw  Mon Jun 25 17:02:27 2007
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Tue, 26 Jun 2007 01:02:27 +0800 (CST)
Subject: [Linux-cluster] Fwd: Sun N1 Grid Engine Software and the Tokyo
	Institute of Technology Super Computer Grid (Sun BluePrint)
Message-ID: <966992.58014.qm@web73503.mail.tp2.yahoo.com>

The blueprint talks about how SGE 6 is used in the
TSUBAME supercomputer (no. 7 on the June 2006 TOP500
list).
 
It talks about the tight SGE-SSH integration in:

  Chapter 3: SSH for Sun N1 Grid Engine Software

You can download the PDF at:
http://www.sun.com/blueprints/0607/820-1695.html

Andrew.




      ____________________________________________________________________________________
?????????????????????? Yahoo!??????????
http://tw.mobile.yahoo.com/texts/mail.php



From Christopher.Barry at qlogic.com  Mon Jun 25 17:59:14 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Mon, 25 Jun 2007 13:59:14 -0400
Subject: [Linux-cluster] Fwd: Sun N1 Grid Engine Software and the Tokyo
	Institute of Technology Super Computer Grid (Sun BluePrint)
In-Reply-To: <966992.58014.qm@web73503.mail.tp2.yahoo.com>
References: <966992.58014.qm@web73503.mail.tp2.yahoo.com>
Message-ID: <1182794354.5240.16.camel@localhost>

On Tue, 2007-06-26 at 01:02 +0800, Andrew Wang wrote:
> The blueprint talks about how SGE 6 is used in the
> TSUBAME supercomputer (no. 7 on the June 2006 TOP500
> list).
>  
> It talks about the tight SGE-SSH integration in:
> 
>   Chapter 3: SSH for Sun N1 Grid Engine Software
> 
> You can download the PDF at:
> http://www.sun.com/blueprints/0607/820-1695.html
> 
> Andrew.
> 
> 
> 

First, do not cross-post.
Second, SGE is not RHCS.
Third, and the most interesting, the blueprint design on the cover of
this PDF appears to be of a bathroom, with the text "To Drain -->" being
the most prominent aspect of the blueprint! Not the best connotation...

Doh!
-C



From andremachado at techforce.com.br  Mon Jun 25 19:34:06 2007
From: andremachado at techforce.com.br (andremachado)
Date: Mon, 25 Jun 2007 12:34:06 -0700
Subject: [Linux-cluster] how to GFS+iSCSI failover?
Message-ID: <bf3013ea872512c5abc922fe61f418c4@localhost>

Hello,
I have 2 nodes (node_1, node_2) with 2 respective GFS (gfs_1, gfs_2) lvm exported trough iSCSI Enterprise Target to a third node node_3 with Open-iSCSI. 
What is the correct way to implement a cluster failover service to verify if node_1 becomes unavailable, 
umount/logout gfs_1 and then login/mount the gfs_2 from node_2 to the same mount point at node_3?

Should I configure a failover domain restricted, prioritized node_1 and node_2, each with a private resource gfs,
with THEIR local mount points or node_3 mount point?
Should I configure a resource custom script? (it seems likely).
How then monitor the node availability? 
could you explain the gfs, ip and script resources behaviour further than the RH docs?

Regards.
Andre Felipe Machado


updated [0] in June 21 2007

[0] http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch



From chris at cmiware.com  Mon Jun 25 20:08:41 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 25 Jun 2007 15:08:41 -0500
Subject: [Linux-cluster] trouble reinstalling cluster suite 5
Message-ID: <468020C9.8080208@cmiware.com>

After some trouble with my first go, I ended up uninstalling the cluster 
packages and attempting to reinstall.  However, one of the nodes 
apparently can't forget some info about the previous installation.  
After reinstalling ricci and luci, and doing create cluster the nodes 
reboot as normal.  Upon reboot, the offending node attempts to fence the 
other (I haven't even gotten to the point of setting that up yet) while 
the other reports

ccsd[3414]: Unable to connect to cluster infrastructure after XYZ seconds.

attempts to stop cman via service cman stop summarily fail and fenced 
and ccsd have to be killed before it will succeed.

Any ideas on what or where it would be storing information about the old 
install, or how I should properly uninstall the software before starting 
over?

Thanks in advance,
Chris



From chris at cmiware.com  Mon Jun 25 22:13:10 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 25 Jun 2007 17:13:10 -0500
Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (correction)
In-Reply-To: <468020C9.8080208@cmiware.com>
References: <468020C9.8080208@cmiware.com>
Message-ID: <46803DF6.6040002@cmiware.com>

something is apparently wrong with openais on a node.  repeated removal 
and reinstall of this package yields no change and I get the following 
errors when trying to install this node in my cluster through Conga:

openais[3488]: [MAIN ] Error reading CCS info, cannot start
Jun 25 17:03:40 openais[3488]: [MAIN ] ?h??
Jun 25 17:03:40 openais[3488]: [MAIN ] AIS Executive exiting (-9).
Jun 25 17:04:06 ccsd[3447]: Unable to connect to cluster infrastructure

I'm sure the garbled log entry is not normal.  Any ideas?



Chris Harms wrote:
> After some trouble with my first go, I ended up uninstalling the 
> cluster packages and attempting to reinstall.  However, one of the 
> nodes apparently can't forget some info about the previous 
> installation.  After reinstalling ricci and luci, and doing create 
> cluster the nodes reboot as normal.  Upon reboot, the offending node 
> attempts to fence the other (I haven't even gotten to the point of 
> setting that up yet) while the other reports
>
> ccsd[3414]: Unable to connect to cluster infrastructure after XYZ 
> seconds.
>
> attempts to stop cman via service cman stop summarily fail and fenced 
> and ccsd have to be killed before it will succeed.
>
> Any ideas on what or where it would be storing information about the 
> old install, or how I should properly uninstall the software before 
> starting over?
>
> Thanks in advance,
> Chris
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From chris at cmiware.com  Mon Jun 25 22:59:53 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 25 Jun 2007 17:59:53 -0500
Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (solved)
In-Reply-To: <46803DF6.6040002@cmiware.com>
References: <468020C9.8080208@cmiware.com> <46803DF6.6040002@cmiware.com>
Message-ID: <468048E9.3080208@cmiware.com>

For the sake of completeness it was a bad entry in /etc/hosts file

Chris Harms wrote:
> something is apparently wrong with openais on a node.  repeated 
> removal and reinstall of this package yields no change and I get the 
> following errors when trying to install this node in my cluster 
> through Conga:
>
> openais[3488]: [MAIN ] Error reading CCS info, cannot start
> Jun 25 17:03:40 openais[3488]: [MAIN ] ?h??
> Jun 25 17:03:40 openais[3488]: [MAIN ] AIS Executive exiting (-9).
> Jun 25 17:04:06 ccsd[3447]: Unable to connect to cluster infrastructure
>
> I'm sure the garbled log entry is not normal.  Any ideas?
>
>
>
> Chris Harms wrote:
>> After some trouble with my first go, I ended up uninstalling the 
>> cluster packages and attempting to reinstall.  However, one of the 
>> nodes apparently can't forget some info about the previous 
>> installation.  After reinstalling ricci and luci, and doing create 
>> cluster the nodes reboot as normal.  Upon reboot, the offending node 
>> attempts to fence the other (I haven't even gotten to the point of 
>> setting that up yet) while the other reports
>>
>> ccsd[3414]: Unable to connect to cluster infrastructure after XYZ 
>> seconds.
>>
>> attempts to stop cman via service cman stop summarily fail and fenced 
>> and ccsd have to be killed before it will succeed.
>>
>> Any ideas on what or where it would be storing information about the 
>> old install, or how I should properly uninstall the software before 
>> starting over?
>>
>> Thanks in advance,
>> Chris
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From pcaulfie at redhat.com  Tue Jun 26 07:32:35 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 26 Jun 2007 08:32:35 +0100
Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (solved)
In-Reply-To: <468048E9.3080208@cmiware.com>
References: <468020C9.8080208@cmiware.com> <46803DF6.6040002@cmiware.com>
	<468048E9.3080208@cmiware.com>
Message-ID: <4680C113.1000207@redhat.com>

Chris Harms wrote:
> For the sake of completeness it was a bad entry in /etc/hosts file
> 

Do you still have the /etc/hosts file that caused this crash?

We have a bugzilla entry open for this bug, but no-one has manage to provide the
offending file for me to fix it.

-- 
Patrick

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street,
Windsor, Berkshire, SL4 ITE, UK.
Registered in England and Wales under Company Registration No. 3798903



From chris at cmiware.com  Tue Jun 26 15:09:39 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 26 Jun 2007 10:09:39 -0500
Subject: [Linux-cluster] trouble reinstalling cluster suite 5 (solved)
In-Reply-To: <4680C113.1000207@redhat.com>
References: <468020C9.8080208@cmiware.com>
	<46803DF6.6040002@cmiware.com>	<468048E9.3080208@cmiware.com>
	<4680C113.1000207@redhat.com>
Message-ID: <46812C33.3070007@cmiware.com>

I believe it contained the host.domain.tld of the machine in the 
127.0.0.1 entry.

Patrick Caulfield wrote:
> Chris Harms wrote:
>   
>> For the sake of completeness it was a bad entry in /etc/hosts file
>>
>>     
>
> Do you still have the /etc/hosts file that caused this crash?
>
> We have a bugzilla entry open for this bug, but no-one has manage to provide the
> offending file for me to fix it.
>
>   



From johnson.eric at gmail.com  Tue Jun 26 18:47:46 2007
From: johnson.eric at gmail.com (eric johnson)
Date: Tue, 26 Jun 2007 14:47:46 -0400
Subject: [Linux-cluster] GFS2 - Simple script that seems to make GFS2 sad
Message-ID: <c44f27df0706261147k4b44754dnb1c4553c632e67fa@mail.gmail.com>

Hi -

I had experimented with GFS a few months back.  I'm interested in it,
but know that it isn't quite production worthy yet - at least not
quite for my needs.

Now that GFS2 is emerging, I thought I'd give it a quick try again
just to see how things were shaping up.

I've got a script that seems to make our installation sad...

Take this script

> cat foo.pl

my $i=0;
my $max=shift(@ARGV);
my $d=shift(@ARGV);
if (not defined $d) {
    $d="";
}
foreach(my $i=0;$i<$max;$i++) {
    my $filename=sprintf("%s-%d%s",rand()*100000,$i,$d);
    open FOO, ">$filename";
    for (my $j=0;$j<1500;++$j) {
        print FOO "This is fun!!\n";
    }
    close FOO;
}

Assuming a mount at /gfs

Queue up a good chunk of these - each working their own directory...

cd /gfs
mkdir foo1
cd foo1
perl -w ~/foo.pl 10000000 A &
cd ..
mkdir foo2
cd foo2
perl -w ~/foo.pl 10000000 A &
cd ..
mkdir foo3
cd foo3
perl -w ~/foo.pl 10000000 A &
cd ..
mkdir foo4
cd foo4
perl -w ~/foo.pl 10000000 A &
cd ..
mkdir foo5
cd foo5
perl -w ~/foo.pl 10000000 A &

After a few minutes, the mount seems to disappear.

> cd /gfs
-bash: cd: /gfs: Input/output error


It seems likely that I have something misconfigured...

-Eric



From johnson.eric at gmail.com  Tue Jun 26 19:24:03 2007
From: johnson.eric at gmail.com (eric johnson)
Date: Tue, 26 Jun 2007 15:24:03 -0400
Subject: [Linux-cluster] Re: GFS2 - Simple script that seems to make GFS2 sad
In-Reply-To: <c44f27df0706261147k4b44754dnb1c4553c632e67fa@mail.gmail.com>
References: <c44f27df0706261147k4b44754dnb1c4553c632e67fa@mail.gmail.com>
Message-ID: <c44f27df0706261224n604d01c8ke2f181c68348506b@mail.gmail.com>

A dmesg reveals this...

GFS2: fsid=gfs:gfs.0: gfs2_delete_inode: 13
GFS2: fsid=gfs:gfs.0: fatal: assertion "gfs2_glock_is_held_excl(gl)" failed
GFS2: fsid=gfs:gfs.0:   function = glock_lo_after_commit, file =
fs/gfs2/lops.c, line = 61

I'm a bit out of my element here so I may be highlighting a red herring.

-Eric



From chris at cmiware.com  Tue Jun 26 23:16:44 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 26 Jun 2007 18:16:44 -0500
Subject: [Linux-cluster] manual fencing problem
Message-ID: <46819E5C.1090802@cmiware.com>

Trying to setup manual fencing for testing purposes in Conga gave me the 
following errors:

agent "fence_manual" reports: failed: fence_manual no node name

It appears this came up before:
http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html

but is still unresolved.

Cheers,
Chris



From rgrover1 at gmail.com  Wed Jun 27 03:54:07 2007
From: rgrover1 at gmail.com (Rohit Grover)
Date: Wed, 27 Jun 2007 15:54:07 +1200
Subject: [Linux-cluster] wishing to run GFS on iSCSI with redundancy
Message-ID: <C1E9042E-7DB2-41D3-84F0-7A1867F448D5@gmail.com>

Hello,

We'd like to run GFS in a cluster serviced by a pool of iSCSI disks.  
We would like to use RAID to add redundancy to the storage, but  
there's literature on the net saying that linux's MD driver is not  
cluster safe. Since CLVM doesn't support RAID, what options do we  
have other than pairing the iSCSI disks with DRBD?

thanks,
Rohit Grover.



From jprats at cesca.es  Wed Jun 27 11:16:51 2007
From: jprats at cesca.es (Jordi Prats)
Date: Wed, 27 Jun 2007 13:16:51 +0200
Subject: [Linux-cluster] crash GFS2
Message-ID: <46824723.8080704@cesca.es>

Hi,
I've got this crash using GFS2 and exporting it with NFS.

Jordi

Message from syslogd at urani at Wed Jun 27 06:57:50 2007 ...
urani kernel: ------------[ cut here ]------------

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: invalid opcode: 0000 [#1]

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: SMP

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: CPU:    0

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: EIP:    0061:[<ee1dc1e6>]    Not tainted VLI

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: EFLAGS: 00010282   (2.6.20-1.2952.fc6xen #1)

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: EIP is at gfs2_glock_nq+0xff/0x19d [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: eax: 00000020   ebx: eac37e88   ecx: ffffffff   edx: f5416000

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: esi: eac37d40   edi: dffc4f40   ebp: dffc4f40   esp: eac37d0c

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: ds: 007b   es: 007b   ss: 0069

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: Process nfsd (pid: 22673, ti=eac37000 task=d2f371b0 
task.ti=eac37000)

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: Stack: ee1f2324 00000002 00000001 e6b79000 00000000 
eac37d40 dcb05b0c dcb05d04

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel:        e6b79000 ee1e8c88 eac37d40 ee1dc4bd ffffffe4 
eac37d40 eac37d40 dffc4f40

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel:        00005891 00000001 00000002 00000000 00000402 
ee1e8c81 dcb05b0c ee1e8c3d

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel: Call Trace:

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel:  [<ee1e8c88>] gfs2_delete_inode+0x4b/0x14f [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
urani kernel:  [<ee1dc4bd>] gfs2_holder_uninit+0xb/0x1b [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<ee1e8c81>] gfs2_delete_inode+0x44/0x14f [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<ee1e8c3d>] gfs2_delete_inode+0x0/0x14f [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<c047717c>] generic_delete_inode+0xa3/0x10b

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<c047681e>] iput+0x60/0x62

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<ee1ded52>] gfs2_createi+0xcb8/0xcf2 [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<c0417672>] __might_sleep+0x21/0xc1

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<ee1e7d6c>] gfs2_create+0x5d/0x101 [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<ee1de0f5>] gfs2_createi+0x5b/0xcf2 [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
urani kernel:  [<ee1dc84a>] gfs2_glock_nq_num+0x3f/0x64 [gfs2]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<c046d1b2>] vfs_create+0xca/0x134

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<ee2cf843>] nfsd_create_v3+0x27f/0x468 [nfsd]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<ee2d461d>] nfsd3_proc_create+0x15e/0x16c [nfsd]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  [<c0405177>] kernel_thread_helper+0x7/0x10

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel:  =======================

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel: Code: 0c c7 04 24 17 23 1f ee 89 44 24 04 e8 28 1b 24 d2 
8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 24 23 1f ee e8 0e 1b 
24 d2 <0f> 0b eb fe 39 58 0c 74 0e 89 d0 8b 10 0f 18 02 90 39 c8 75 ef

Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
urani kernel: EIP: [<ee1dc1e6>] gfs2_glock_nq+0xff/0x19d [gfs2] SS:ESP 
0069:eac37d0c

Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
urani kernel: Oops: 0000 [#2]

Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
urani kernel: SMP

Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
urani kernel: CPU:    0

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: EIP:    0061:[<ee1e86bc>]    Not tainted VLI

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: EFLAGS: 00010297   (2.6.20-1.2952.fc6xen #1)

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: EIP is at gfs2_permission+0x36/0xc2 [gfs2]

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: eax: 00000000   ebx: 00000000   ecx: 00000000   edx: ea533730

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: esi: dfe10078   edi: dfe10094   ebp: e8c74b8c   esp: e6f4ae68

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: ds: 007b   es: 007b   ss: 0069

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: Process nfsd (pid: 22678, ti=e6f4a000 task=ea533730 
task.ti=e6f4a000)

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: Stack: c0462732 00000003 0000000a 00000000 c042827c 
e6cea810 eb345400 e6f4aebc

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:        11270000 ee1e49fa ee2cc954 e8c74b8c ee1e8686 
00000000 00000003 c046cb7d

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:        00000003 e8c74b8c c678d140 11270000 ee2cd34b 
d49de1a8 e6cea804 c678d140

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel: Call Trace:

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<c0462732>] kfree+0xe/0x6f

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<c042827c>] set_current_groups+0x154/0x160

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<ee1e49fa>] gfs2_decode_fh+0xe2/0xe9 [gfs2]

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<ee1e8686>] gfs2_permission+0x0/0xc2 [gfs2]

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<c046cb7d>] permission+0x9e/0xdb

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<ee2cd34b>] nfsd_permission+0x87/0xd5 [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
urani kernel:  [<ee2cce47>] fh_verify+0x434/0x519 [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee2d459b>] nfsd3_proc_create+0xdc/0x16c [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee2d271f>] nfsd_cache_lookup+0x1c7/0x2ab [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<c04282ca>] groups_alloc+0x42/0xae

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<c0404ff2>] hypervisor_callback+0x46/0x50

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  [<c0405177>] kernel_thread_helper+0x7/0x10

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel:  =======================

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel: Code: 24 04 8b b0 f4 01 00 00 8d 7e 1c 89 f8 e8 69 96 42 
d2 8b 4e 44 eb 14 8b 41 0c 65 8b 15 08 00 00 00 3b 82 a8 00 00 00 74 11 
89 d9 <8b> 19 0f 18 03 90 8d 46 44 39 c1 75 df eb 41 89 f8 e8 20 96 42

Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
urani kernel: EIP: [<ee1e86bc>] gfs2_permission+0x36/0xc2 [gfs2] SS:ESP 
0069:e6f4ae68

-- 
......................................................................
         __
        / /          Jordi Prats
  C E / S / C A      Dept. de Sistemes
      /_/            Centre de Supercomputaci? de Catalunya

  Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
  T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
...................................................................... 



From jprats at cesca.es  Wed Jun 27 11:19:57 2007
From: jprats at cesca.es (Jordi Prats)
Date: Wed, 27 Jun 2007 13:19:57 +0200
Subject: [Linux-cluster] [Fwd: crash GFS2]
Message-ID: <468247DD.1080802@cesca.es>

Hi,
On the console I've found a lot of this messages:

Hope this helps!

Jordi

 =======================
BUG: soft lockup detected on CPU#0!
 [<c04404e7>] softlockup_tick+0xaa/0xc1
 [<c0408d4f>] timer_interrupt+0x552/0x59f
 [<c0441afe>] handle_level_irq+0xd1/0xdf
 [<c0611d6a>] _spin_lock_irqsave+0x12/0x17
 [<c0525a5e>] __add_entropy_words+0x56/0x18b
 [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
 [<c044075d>] handle_IRQ_event+0x1e/0x47
 [<c0441ac0>] handle_level_irq+0x93/0xdf
 [<c0441a2d>] handle_level_irq+0x0/0xdf
 [<c040686e>] do_IRQ+0xb5/0xdb
 [<c0549455>] evtchn_do_upcall+0x5f/0x97
 [<c0404ff2>] hypervisor_callback+0x46/0x50
 [<c04e007b>] simple_strtoul+0xab/0xc5
 [<c04e24f7>] _raw_spin_lock+0x67/0xd9
 [<ee1e86a3>] gfs2_permission+0x1d/0xc2 [gfs2]
 [<c0462732>] kfree+0xe/0x6f
 [<c042827c>] set_current_groups+0x154/0x160
 [<ee1e49fa>] gfs2_decode_fh+0xe2/0xe9 [gfs2]
 [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
 [<ee1e8686>] gfs2_permission+0x0/0xc2 [gfs2]
 [<c046cb7d>] permission+0x9e/0xdb
 [<ee2cd34b>] nfsd_permission+0x87/0xd5 [nfsd]
 [<ee2cce47>] fh_verify+0x434/0x519 [nfsd]
 [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
 [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
 [<ee2d459b>] nfsd3_proc_create+0xdc/0x16c [nfsd]
 [<ee2d271f>] nfsd_cache_lookup+0x1c7/0x2ab [nfsd]
 [<c04282ca>] groups_alloc+0x42/0xae
 [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]
 [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]
 [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]
 [<c0404ff2>] hypervisor_callback+0x46/0x50
 [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]
 [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]
 [<c0405177>] kernel_thread_helper+0x7/0x10
 =======================

-- 
......................................................................
         __
        / /          Jordi Prats
  C E / S / C A      Dept. de Sistemes
      /_/            Centre de Supercomputaci? de Catalunya

  Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
  T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
...................................................................... 

-------------- next part --------------
An embedded message was scrubbed...
From: Jordi Prats <jprats at cesca.es>
Subject: crash GFS2
Date: Wed, 27 Jun 2007 13:16:51 +0200
Size: 10208
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070627/f7b3dcac/attachment.eml>

From brian at bcons.com  Wed Jun 27 13:11:25 2007
From: brian at bcons.com (Brian C. O'Berry)
Date: Wed, 27 Jun 2007 09:11:25 -0400
Subject: [Linux-cluster] Possible to Share Berkeley DB Environment via GFS?
Message-ID: <468261FD.1090906@bcons.com>

Is it possible to share a Berkeley DB environment, via a common GFS 
filesystem, between Concurrent Data Store applications running on 
different systems?

In reference to *remote* filesystems, the Berkeley DB Reference Guide 
states that

"Remote filesystems rarely support mapping files into process memory, 
and even more rarely support correct semantics for mutexes after the 
attempt succeeds. For this reason, we strongly recommend that the 
database environment directory reside in a local filesystem... For 
remote filesystems that do allow system files to be mapped into process 
memory, home directories accessed via remote filesystems cannot be used 
simultaneously from multiple clients. None of the commercial remote 
filesystems available today implement coherent, distributed shared 
memory for remote-mounted files. As a result, different machines will 
see different versions of these shared regions, and the system behavior 
is undefined."

Based on that, I'd expect sharing the environment (home) directory 
between systems to be infeasible, but I know little about GFS.  Can 
someone verify one way or the other whether such sharing is possible?

Thanks,

Brian



From wcheng at redhat.com  Wed Jun 27 14:33:02 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 27 Jun 2007 10:33:02 -0400
Subject: [Linux-cluster] [Fwd: crash GFS2]
In-Reply-To: <468247DD.1080802@cesca.es>
References: <468247DD.1080802@cesca.es>
Message-ID: <4682751E.8050907@redhat.com>

Jordi Prats wrote:
> Hi,
> On the console I've found a lot of this messages:

We'll have few critical GFS2-NFS fixes ready late today via Red Hat 
bugzilla 243136. If you like to help testing this out, let us know your 
kernel version...

-- Wendy
>
> =======================
> BUG: soft lockup detected on CPU#0!
> [<c04404e7>] softlockup_tick+0xaa/0xc1
> [<c0408d4f>] timer_interrupt+0x552/0x59f
> [<c0441afe>] handle_level_irq+0xd1/0xdf
> [<c0611d6a>] _spin_lock_irqsave+0x12/0x17
> [<c0525a5e>] __add_entropy_words+0x56/0x18b
> [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
> [<c044075d>] handle_IRQ_event+0x1e/0x47
> [<c0441ac0>] handle_level_irq+0x93/0xdf
> [<c0441a2d>] handle_level_irq+0x0/0xdf
> [<c040686e>] do_IRQ+0xb5/0xdb
> [<c0549455>] evtchn_do_upcall+0x5f/0x97
> [<c0404ff2>] hypervisor_callback+0x46/0x50
> [<c04e007b>] simple_strtoul+0xab/0xc5
> [<c04e24f7>] _raw_spin_lock+0x67/0xd9
> [<ee1e86a3>] gfs2_permission+0x1d/0xc2 [gfs2]
> [<c0462732>] kfree+0xe/0x6f
> [<c042827c>] set_current_groups+0x154/0x160
> [<ee1e49fa>] gfs2_decode_fh+0xe2/0xe9 [gfs2]
> [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
> [<ee1e8686>] gfs2_permission+0x0/0xc2 [gfs2]
> [<c046cb7d>] permission+0x9e/0xdb
> [<ee2cd34b>] nfsd_permission+0x87/0xd5 [nfsd]
> [<ee2cce47>] fh_verify+0x434/0x519 [nfsd]
> [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
> [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
> [<ee2d459b>] nfsd3_proc_create+0xdc/0x16c [nfsd]
> [<ee2d271f>] nfsd_cache_lookup+0x1c7/0x2ab [nfsd]
> [<c04282ca>] groups_alloc+0x42/0xae
> [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]
> [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]
> [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]
> [<c0404ff2>] hypervisor_callback+0x46/0x50
> [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]
> [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]
> [<c0405177>] kernel_thread_helper+0x7/0x10
> =======================
>
>
> ------------------------------------------------------------------------
>
> Subject:
> crash GFS2
> From:
> Jordi Prats <jprats at cesca.es>
> Date:
> Wed, 27 Jun 2007 13:16:51 +0200
> To:
> linux-cluster at redhat.com
>
> To:
> linux-cluster at redhat.com
>
>
> Hi,
> I've got this crash using GFS2 and exporting it with NFS.
>
> Jordi
>
> Message from syslogd at urani at Wed Jun 27 06:57:50 2007 ...
> urani kernel: ------------[ cut here ]------------
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: invalid opcode: 0000 [#1]
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: SMP
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: CPU:    0
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: EIP:    0061:[<ee1dc1e6>]    Not tainted VLI
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: EFLAGS: 00010282   (2.6.20-1.2952.fc6xen #1)
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: EIP is at gfs2_glock_nq+0xff/0x19d [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: eax: 00000020   ebx: eac37e88   ecx: ffffffff   edx: 
> f5416000
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: esi: eac37d40   edi: dffc4f40   ebp: dffc4f40   esp: 
> eac37d0c
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: ds: 007b   es: 007b   ss: 0069
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: Process nfsd (pid: 22673, ti=eac37000 task=d2f371b0 
> task.ti=eac37000)
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: Stack: ee1f2324 00000002 00000001 e6b79000 00000000 
> eac37d40 dcb05b0c dcb05d04
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel:        e6b79000 ee1e8c88 eac37d40 ee1dc4bd ffffffe4 
> eac37d40 eac37d40 dffc4f40
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel:        00005891 00000001 00000002 00000000 00000402 
> ee1e8c81 dcb05b0c ee1e8c3d
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel: Call Trace:
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel:  [<ee1e8c88>] gfs2_delete_inode+0x4b/0x14f [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
> urani kernel:  [<ee1dc4bd>] gfs2_holder_uninit+0xb/0x1b [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<ee1e8c81>] gfs2_delete_inode+0x44/0x14f [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<ee1e8c3d>] gfs2_delete_inode+0x0/0x14f [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<c047717c>] generic_delete_inode+0xa3/0x10b
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<c047681e>] iput+0x60/0x62
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<ee1ded52>] gfs2_createi+0xcb8/0xcf2 [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<c0417672>] __might_sleep+0x21/0xc1
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<ee1e7d6c>] gfs2_create+0x5d/0x101 [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<ee1de0f5>] gfs2_createi+0x5b/0xcf2 [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
> urani kernel:  [<ee1dc84a>] gfs2_glock_nq_num+0x3f/0x64 [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<c046d1b2>] vfs_create+0xca/0x134
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<ee2cf843>] nfsd_create_v3+0x27f/0x468 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<ee2d461d>] nfsd3_proc_create+0x15e/0x16c [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  [<c0405177>] kernel_thread_helper+0x7/0x10
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel:  =======================
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel: Code: 0c c7 04 24 17 23 1f ee 89 44 24 04 e8 28 1b 24 d2 
> 8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 24 23 1f ee e8 0e 
> 1b 24 d2 <0f> 0b eb fe 39 58 0c 74 0e 89 d0 8b 10 0f 18 02 90 39 c8 75 ef
>
> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
> urani kernel: EIP: [<ee1dc1e6>] gfs2_glock_nq+0xff/0x19d [gfs2] SS:ESP 
> 0069:eac37d0c
>
> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
> urani kernel: Oops: 0000 [#2]
>
> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
> urani kernel: SMP
>
> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
> urani kernel: CPU:    0
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: EIP:    0061:[<ee1e86bc>]    Not tainted VLI
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: EFLAGS: 00010297   (2.6.20-1.2952.fc6xen #1)
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: EIP is at gfs2_permission+0x36/0xc2 [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: eax: 00000000   ebx: 00000000   ecx: 00000000   edx: 
> ea533730
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: esi: dfe10078   edi: dfe10094   ebp: e8c74b8c   esp: 
> e6f4ae68
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: ds: 007b   es: 007b   ss: 0069
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: Process nfsd (pid: 22678, ti=e6f4a000 task=ea533730 
> task.ti=e6f4a000)
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: Stack: c0462732 00000003 0000000a 00000000 c042827c 
> e6cea810 eb345400 e6f4aebc
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:        11270000 ee1e49fa ee2cc954 e8c74b8c ee1e8686 
> 00000000 00000003 c046cb7d
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:        00000003 e8c74b8c c678d140 11270000 ee2cd34b 
> d49de1a8 e6cea804 c678d140
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel: Call Trace:
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<c0462732>] kfree+0xe/0x6f
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<c042827c>] set_current_groups+0x154/0x160
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<ee1e49fa>] gfs2_decode_fh+0xe2/0xe9 [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<ee1e8686>] gfs2_permission+0x0/0xc2 [gfs2]
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<c046cb7d>] permission+0x9e/0xdb
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<ee2cd34b>] nfsd_permission+0x87/0xd5 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
> urani kernel:  [<ee2cce47>] fh_verify+0x434/0x519 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee2d459b>] nfsd3_proc_create+0xdc/0x16c [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee2d271f>] nfsd_cache_lookup+0x1c7/0x2ab [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<c04282ca>] groups_alloc+0x42/0xae
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<c0404ff2>] hypervisor_callback+0x46/0x50
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  [<c0405177>] kernel_thread_helper+0x7/0x10
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel:  =======================
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel: Code: 24 04 8b b0 f4 01 00 00 8d 7e 1c 89 f8 e8 69 96 42 
> d2 8b 4e 44 eb 14 8b 41 0c 65 8b 15 08 00 00 00 3b 82 a8 00 00 00 74 
> 11 89 d9 <8b> 19 0f 18 03 90 8d 46 44 39 c1 75 df eb 41 89 f8 e8 20 96 42
>
> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
> urani kernel: EIP: [<ee1e86bc>] gfs2_permission+0x36/0xc2 [gfs2] 
> SS:ESP 0069:e6f4ae68
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From laule75 at yahoo.fr  Wed Jun 27 15:42:17 2007
From: laule75 at yahoo.fr (laurent)
Date: Wed, 27 Jun 2007 15:42:17 +0000 (GMT)
Subject: [Linux-cluster] Redhat cluster limits
Message-ID: <200548.29516.qm@web26510.mail.ukl.yahoo.com>

Hello,

I'm trying to test extensively redhat Cluster Suite and GFS  in order to replace eventually Veritas cluster products.

Unfortunately,  we've
experienced problems  when we try to play with a big number of
filesystems and services (around 100) ==>  high CPU consumption to
monitor resources, memory allocation problems when creating lot of
logical volumes, etc. ; 

Have you ever experienced such kind of issues ? 

I
would be glad to ear any feedback from people who have carried out some
stress tests on the redhat cluster suites (maximum number of
services/resources created, max number of volume group and logical
volumes, etc etc...)

Kind regards

Laurent





      _____________________________________________________________________________ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070627/e214115c/attachment.htm>

From rpeterso at redhat.com  Wed Jun 27 16:29:20 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 27 Jun 2007 11:29:20 -0500
Subject: [Linux-cluster] Redhat cluster limits
In-Reply-To: <200548.29516.qm@web26510.mail.ukl.yahoo.com>
References: <200548.29516.qm@web26510.mail.ukl.yahoo.com>
Message-ID: <1182961760.11507.27.camel@technetium.msp.redhat.com>

On Wed, 2007-06-27 at 15:42 +0000, laurent wrote:
> Hello,
> 
> I'm trying to test extensively redhat Cluster Suite and GFS  in order
> to replace eventually Veritas cluster products.
> 
> Unfortunately,  we've experienced problems  when we try to play with a
> big number of filesystems and services (around 100) ==>  high CPU
> consumption to monitor resources, memory allocation problems when
> creating lot of logical volumes, etc. ; 
> 
> Have you ever experienced such kind of issues ? 
> 
> I would be glad to ear any feedback from people who have carried out
> some stress tests on the redhat cluster suites (maximum number of
> services/resources created, max number of volume group and logical
> volumes, etc etc...)
> 
> Kind regards
> 
> Laurent

Hi Laurent,

What version of cluster suite/RHEL/Centos/etc., are you using?
I suggest opening up bugzillas against each of the problems you've seen.

I can't speak for Red Hat or any other developer, but my take is this:
Rather than ask if cluster suite can do the job, let's get the
problems solved so it can do the job.
If there's a problem, we want to hear about them.  If we need to
fix the code, that makes it better for everyone to use.

Regards,

Bob Peterson
Red Hat Cluster Suite




From jprats at cesca.es  Wed Jun 27 18:38:21 2007
From: jprats at cesca.es (Jordi Prats)
Date: Wed, 27 Jun 2007 20:38:21 +0200
Subject: [Linux-cluster] [Fwd: crash GFS2]
In-Reply-To: <4682751E.8050907@redhat.com>
References: <468247DD.1080802@cesca.es> <4682751E.8050907@redhat.com>
Message-ID: <4682AE9D.1070103@cesca.es>

Hi,
I'm using a xen guest kernel 2.6.20 on a Fedora Core 6

My uname is:

Linux urani 2.6.20-1.2952.fc6xen #1 SMP Wed May 16 19:19:04 EDT 2007
i686 i686 i386 GNU/Linux

Jordi

Wendy Cheng wrote:
> Jordi Prats wrote:
>> Hi,
>> On the console I've found a lot of this messages:
> 
> We'll have few critical GFS2-NFS fixes ready late today via Red Hat
> bugzilla 243136. If you like to help testing this out, let us know your
> kernel version...
> 
> -- Wendy
>>
>> =======================
>> BUG: soft lockup detected on CPU#0!
>> [<c04404e7>] softlockup_tick+0xaa/0xc1
>> [<c0408d4f>] timer_interrupt+0x552/0x59f
>> [<c0441afe>] handle_level_irq+0xd1/0xdf
>> [<c0611d6a>] _spin_lock_irqsave+0x12/0x17
>> [<c0525a5e>] __add_entropy_words+0x56/0x18b
>> [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
>> [<c044075d>] handle_IRQ_event+0x1e/0x47
>> [<c0441ac0>] handle_level_irq+0x93/0xdf
>> [<c0441a2d>] handle_level_irq+0x0/0xdf
>> [<c040686e>] do_IRQ+0xb5/0xdb
>> [<c0549455>] evtchn_do_upcall+0x5f/0x97
>> [<c0404ff2>] hypervisor_callback+0x46/0x50
>> [<c04e007b>] simple_strtoul+0xab/0xc5
>> [<c04e24f7>] _raw_spin_lock+0x67/0xd9
>> [<ee1e86a3>] gfs2_permission+0x1d/0xc2 [gfs2]
>> [<c0462732>] kfree+0xe/0x6f
>> [<c042827c>] set_current_groups+0x154/0x160
>> [<ee1e49fa>] gfs2_decode_fh+0xe2/0xe9 [gfs2]
>> [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
>> [<ee1e8686>] gfs2_permission+0x0/0xc2 [gfs2]
>> [<c046cb7d>] permission+0x9e/0xdb
>> [<ee2cd34b>] nfsd_permission+0x87/0xd5 [nfsd]
>> [<ee2cce47>] fh_verify+0x434/0x519 [nfsd]
>> [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
>> [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
>> [<ee2d459b>] nfsd3_proc_create+0xdc/0x16c [nfsd]
>> [<ee2d271f>] nfsd_cache_lookup+0x1c7/0x2ab [nfsd]
>> [<c04282ca>] groups_alloc+0x42/0xae
>> [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]
>> [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]
>> [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]
>> [<c0404ff2>] hypervisor_callback+0x46/0x50
>> [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]
>> [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]
>> [<c0405177>] kernel_thread_helper+0x7/0x10
>> =======================
>>
>>
>> ------------------------------------------------------------------------
>>
>> Subject:
>> crash GFS2
>> From:
>> Jordi Prats <jprats at cesca.es>
>> Date:
>> Wed, 27 Jun 2007 13:16:51 +0200
>> To:
>> linux-cluster at redhat.com
>>
>> To:
>> linux-cluster at redhat.com
>>
>>
>> Hi,
>> I've got this crash using GFS2 and exporting it with NFS.
>>
>> Jordi
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:50 2007 ...
>> urani kernel: ------------[ cut here ]------------
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: invalid opcode: 0000 [#1]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: SMP
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: CPU:    0
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: EIP:    0061:[<ee1dc1e6>]    Not tainted VLI
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: EFLAGS: 00010282   (2.6.20-1.2952.fc6xen #1)
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: EIP is at gfs2_glock_nq+0xff/0x19d [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: eax: 00000020   ebx: eac37e88   ecx: ffffffff   edx:
>> f5416000
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: esi: eac37d40   edi: dffc4f40   ebp: dffc4f40   esp:
>> eac37d0c
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: ds: 007b   es: 007b   ss: 0069
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: Process nfsd (pid: 22673, ti=eac37000 task=d2f371b0
>> task.ti=eac37000)
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: Stack: ee1f2324 00000002 00000001 e6b79000 00000000
>> eac37d40 dcb05b0c dcb05d04
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel:        e6b79000 ee1e8c88 eac37d40 ee1dc4bd ffffffe4
>> eac37d40 eac37d40 dffc4f40
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel:        00005891 00000001 00000002 00000000 00000402
>> ee1e8c81 dcb05b0c ee1e8c3d
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel: Call Trace:
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel:  [<ee1e8c88>] gfs2_delete_inode+0x4b/0x14f [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:51 2007 ...
>> urani kernel:  [<ee1dc4bd>] gfs2_holder_uninit+0xb/0x1b [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<ee1e8c81>] gfs2_delete_inode+0x44/0x14f [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<ee1e8c3d>] gfs2_delete_inode+0x0/0x14f [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<c047717c>] generic_delete_inode+0xa3/0x10b
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<c047681e>] iput+0x60/0x62
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<ee1ded52>] gfs2_createi+0xcb8/0xcf2 [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<c0417672>] __might_sleep+0x21/0xc1
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<ee1e7d6c>] gfs2_create+0x5d/0x101 [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<ee1de0f5>] gfs2_createi+0x5b/0xcf2 [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:52 2007 ...
>> urani kernel:  [<ee1dc84a>] gfs2_glock_nq_num+0x3f/0x64 [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<c046d1b2>] vfs_create+0xca/0x134
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<ee2cf843>] nfsd_create_v3+0x27f/0x468 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<ee2d461d>] nfsd3_proc_create+0x15e/0x16c [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  [<c0405177>] kernel_thread_helper+0x7/0x10
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel:  =======================
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel: Code: 0c c7 04 24 17 23 1f ee 89 44 24 04 e8 28 1b 24 d2
>> 8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 24 23 1f ee e8 0e
>> 1b 24 d2 <0f> 0b eb fe 39 58 0c 74 0e 89 d0 8b 10 0f 18 02 90 39 c8 75 ef
>>
>> Message from syslogd at urani at Wed Jun 27 06:57:53 2007 ...
>> urani kernel: EIP: [<ee1dc1e6>] gfs2_glock_nq+0xff/0x19d [gfs2] SS:ESP
>> 0069:eac37d0c
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
>> urani kernel: Oops: 0000 [#2]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
>> urani kernel: SMP
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:50 2007 ...
>> urani kernel: CPU:    0
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: EIP:    0061:[<ee1e86bc>]    Not tainted VLI
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: EFLAGS: 00010297   (2.6.20-1.2952.fc6xen #1)
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: EIP is at gfs2_permission+0x36/0xc2 [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: eax: 00000000   ebx: 00000000   ecx: 00000000   edx:
>> ea533730
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: esi: dfe10078   edi: dfe10094   ebp: e8c74b8c   esp:
>> e6f4ae68
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: ds: 007b   es: 007b   ss: 0069
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: Process nfsd (pid: 22678, ti=e6f4a000 task=ea533730
>> task.ti=e6f4a000)
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: Stack: c0462732 00000003 0000000a 00000000 c042827c
>> e6cea810 eb345400 e6f4aebc
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:        11270000 ee1e49fa ee2cc954 e8c74b8c ee1e8686
>> 00000000 00000003 c046cb7d
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:        00000003 e8c74b8c c678d140 11270000 ee2cd34b
>> d49de1a8 e6cea804 c678d140
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel: Call Trace:
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<c0462732>] kfree+0xe/0x6f
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<c042827c>] set_current_groups+0x154/0x160
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<ee1e49fa>] gfs2_decode_fh+0xe2/0xe9 [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<ee1e8686>] gfs2_permission+0x0/0xc2 [gfs2]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<c046cb7d>] permission+0x9e/0xdb
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<ee2cd34b>] nfsd_permission+0x87/0xd5 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:51 2007 ...
>> urani kernel:  [<ee2cce47>] fh_verify+0x434/0x519 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee2cc954>] nfsd_acceptable+0x0/0xbf [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<c0611dcf>] _spin_unlock_irqrestore+0x8/0x16
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee2d459b>] nfsd3_proc_create+0xdc/0x16c [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee2d271f>] nfsd_cache_lookup+0x1c7/0x2ab [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<c04282ca>] groups_alloc+0x42/0xae
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee2ca200>] nfsd_dispatch+0xc5/0x180 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee15de7b>] svcauth_unix_set_client+0x165/0x19a [sunrpc]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee15b1c1>] svc_process+0x355/0x610 [sunrpc]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<c0404ff2>] hypervisor_callback+0x46/0x50
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee2ca73b>] nfsd+0x173/0x278 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<ee2ca5c8>] nfsd+0x0/0x278 [nfsd]
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  [<c0405177>] kernel_thread_helper+0x7/0x10
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel:  =======================
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel: Code: 24 04 8b b0 f4 01 00 00 8d 7e 1c 89 f8 e8 69 96 42
>> d2 8b 4e 44 eb 14 8b 41 0c 65 8b 15 08 00 00 00 3b 82 a8 00 00 00 74
>> 11 89 d9 <8b> 19 0f 18 03 90 8d 46 44 39 c1 75 df eb 41 89 f8 e8 20 96 42
>>
>> Message from syslogd at urani at Wed Jun 27 07:00:52 2007 ...
>> urani kernel: EIP: [<ee1e86bc>] gfs2_permission+0x36/0xc2 [gfs2]
>> SS:ESP 0069:e6f4ae68
>>
>> ------------------------------------------------------------------------
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 

-- 
......................................................................
        __
       / /          Jordi Prats Catal?
 C E / S / C A      Departament de Sistemes
     /_/            Centre de Supercomputaci? de Catalunya

 Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
 T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
......................................................................
pgp:0x5D0D1321
......................................................................



From nrbwpi at gmail.com  Wed Jun 27 22:35:57 2007
From: nrbwpi at gmail.com (nrbwpi at gmail.com)
Date: Wed, 27 Jun 2007 18:35:57 -0400
Subject: [Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing
In-Reply-To: <1181201333.25918.229.camel@quoit>
References: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com>
	<1181201333.25918.229.camel@quoit>
Message-ID: <6eee34430706271535jd4b13c0hf5b99219db4db6ec@mail.gmail.com>

Thanks for your reply

I switched the hardware over to Fedora core 6, brought the system up2date,
and configured it the same as before with GFS2. Uname returns the following
kernel string: "Linux fu2 2.6.20-1.2952.fc6 #1 SMP Wed May 16 18:18:22 EDT
2007 x86_64 x86_64 x86_64 GNU/Linux".

The same fencing occurred after several hours of writing zeros to the volume
with dd in 250MB files.  This time, however, I noticed a kernel panic on the
fenced node.  The kernel output in /var/log/messages is below.  Could this
be a hardware configuration issue, or a bug in the kernel?



#####################################



Kernel panic



#####################################



Jun 26 10:00:41 fu2 kernel: ------------[ cut here ]------------

Jun 26 10:00:41 fu2 kernel: kernel BUG at lib/list_debug.c:67!

Jun 26 10:00:41 fu2 kernel: invalid opcode: 0000 [1] SMP

Jun 26 10:00:41 fu2 kernel: last sysfs file: /devices/pci0000:00/0000:00:
02.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/irq

Jun 26 10:00:41 fu2 kernel: CPU 7Jun 26 10:00:41 fu2 kernel: Modules linked
in: lock_dlm gfs2 dlm configfs ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp
iptable_filter ip_tables x_tables bridge autofs4 hidp xfs rfcomm l2cap
bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath video sbs
i2c_ec i2c_core dock button battery asus_acpi backlight ac parport_pc lp
parport sg ata_piix libata pcspkr bnx2 ide_cd cdrom serio_raw dm_snapshot
dm_zero dm_mirror dm_mod lpfc scsi_transport_fc shpchp megaraid_sas sd_mod
scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd

Jun 26 10:00:41 fu2 kernel: Pid: 4142, comm: gfs2_logd Not tainted
2.6.20-1.2952.fc6 #1

Jun 26 10:00:41 fu2 kernel: RIP: 0010:[<ffffffff80341368>]
[<ffffffff80341368>] list_del+0x21/0x5b

Jun 26 10:00:41 fu2 kernel: RSP: 0018:ffff81011e247d00  EFLAGS: 00010082

Jun 26 10:00:41 fu2 kernel: RAX: 0000000000000058 RBX: ffff81011aa40000 RCX:
ffffffff8057fc58

Jun 26 10:00:41 fu2 kernel: RDX: ffffffff8057fc58 RSI: 0000000000000000 RDI:
ffffffff8057fc40

Jun 26 10:00:41 fu2 kernel: RBP: ffff81012da3f7c0 R08: ffffffff8057fc58 R09:
0000000000000001

Jun 26 10:00:41 fu2 kernel: R10: 0000000000000000 R11: ffff81012fd9d0c0 R12:
ffff81011aa40f70

Jun 26 10:00:41 fu2 kernel: R13: ffff810123fb1a00 R14: ffff810123fb05d8 R15:
0000000000000036

Jun 26 10:00:41 fu2 kernel: FS:  0000000000000000(0000)
GS:ffff81012fdb47c0(0000) knlGS:0000000000000000

Jun 26 10:00:41 fu2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b

Jun 26 10:00:41 fu2 kernel: CR2: 00002aaaadfbe008 CR3: 0000000042c20000 CR4:
00000000000006e0

Jun 26 10:00:41 fu2 kernel: Process gfs2_logd (pid: 4142, threadinfo
ffff81011e246000, task ffff810121d35800)

Jun 26 10:00:41 fu2 kernel: Stack:  ffff810123fb1a00 ffffffff802cc6e7
0000003c00000000 ffff81012da3f7c0

Jun 26 10:00:41 fu2 kernel:  000000000000003c ffff810123fb0400
0000000000000000 ffff810123fb1a00

Jun 26 10:00:41 fu2 kernel:  ffff81012da3f800 ffffffff802cc8be
ffff810123fb07e8 ffff810123fb0400

Jun 26 10:00:41 fu2 kernel: Call Trace:

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc6e7>] free_block+0xb1/0x142

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc8be>] cache_flusharray+0x7d/0xb1

Jun 26 10:00:41 fu2 kernel:  [<ffffffff8020765f>]
kmem_cache_free+0x1ef/0x20c

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88445628>]
:gfs2:databuf_lo_before_commit+0x576/0x5c6

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88443acf>]
:gfs2:gfs2_log_flush+0x11e/0x2d3

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438310>] :gfs2:gfs2_logd+0xab/0x15b

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438265>] :gfs2:gfs2_logd+0x0/0x15b

Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
keventd_create_kthread+0x0/0x6a

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802318bd>] kthread+0xd0/0xff

Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aec8>] child_rip+0xa/0x12

Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
keventd_create_kthread+0x0/0x6a

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802317ed>] kthread+0x0/0xff

Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aebe>] child_rip+0x0/0x12

Jun 26 10:00:41 fu2 kernel:

Jun 26 10:00:41 fu2 kernel:

Jun 26 10:00:41 fu2 kernel: Code: 0f 0b eb fe 48 8b 07 48 8b 50 08 48 39 fa
74 12 48 c7 c7 97

Jun 26 10:00:41 fu2 kernel: RIP  [<ffffffff80341368>] list_del+0x21/0x5b

Jun 26 10:00:41 fu2 kernel:  RSP <ffff81011e247d00>


On 6/7/07, Steven Whitehouse <swhiteho at redhat.com> wrote:
>
> Hi,
>
> The version of GFS2 in RHEL5 is rather old. Please use Fedora, the
> upstream kernel or wait until RHEL 5.1 is out. This should solve the
> problem that you are seeing,
>
> Steve.
>
> On Wed, 2007-06-06 at 19:27 -0400, nrbwpi at gmail.com wrote:
> > Hello,
> >
> > Installed RHEL5 on a new two node cluster with Shared FC storage.  The
> > two shared storage boxes are each split into 6.9TB LUNs for a total of
> > 4 - 6.9TB LUNS.  Each machine is connected via a single 100Mb
> > connection to a switch and a single FC connection to a FC switch.
> >
> > The 4 LUNs have LVM on them with GFS2.  The file systems are mountable
> > from each box.  When performing a script dd write of zeros in 250MB
> > file sizes to the file system from each box to different LUNS, one of
> > the nodes in the cluster is fenced by the other one.  File size does
> > not seem to matter.
> >
> > My first guess at the problem was the heartbeat timeout in openais.
> > In the cluster.conf below I added the totem line to hopefully raise
> > the timeout to 10 seconds.  This however did not resolve the problem.
> > Both boxes are running the latest updates as of 2 days ago from
> > up2date.
> >
> > Below is the cluster.conf and what is seen in the logs.  Any
> > suggestions would be greatly appreciated.
> >
> > Thanks!
> >
> > Neal
> >
> >
> >
> > ##########################################
> >
> > Cluster.conf
> >
> > ##########################################
> >
> >
> > <?xml version="1.0"?>
> > <cluster alias="storage1" config_version="4" name="storage1">
> >         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >         <clusternodes>
> >                 <clusternode name="fu1" nodeid="1" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device name="apc4" port="1"
> > switch="1"/>
> >                                 </method>
> >                         </fence>
> >                         <multicast addr=" 224.10.10.10"
> > interface="eth0"/>
> >                 </clusternode>
> >                 <clusternode name="fu2" nodeid="2" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device name="apc4" port="2"
> > switch="1"/>
> >                                 </method>
> >                         </fence>
> >                         <multicast addr="224.10.10.10"
> > interface="eth0"/>
> >                 </clusternode>
> >         </clusternodes>
> >         <cman expected_votes="1" two_node="1">
> >                 <multicast addr="224.10.10.10"/>
> >                 <totem token="10000"/>
> >         </cman>
> >         <fencedevices>
> >                 <fencedevice agent="fence_apc" ipaddr="192.168.14.193"
> > login="apc" name="apc4" passwd="apc"/>
> >         </fencedevices>
> >         <rm>
> >                 <failoverdomains/>
> >                 <resources/>
> >         </rm>
> > </cluster>
> >
> >
> > #####################################################
> >
> > /var/log/messages
> >
> > #####################################################
> >
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the
> > OPERATIONAL state.
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket
> > recv buffer size (262142 bytes).
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket
> > send buffer size (262142 bytes).
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from
> > 2.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from
> > 0.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token
> > because I am the rep.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high
> > seq received 6e
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member
> > 192.168.14.195:
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep
> > 192.168.14.195
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e
> > received flag 0
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate
> > any messages in recovery.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for
> > ring 14
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> > Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
> > Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec
> > post_fail_delay
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> > ip(192.168.14.195)
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> > Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2"
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> > ip(192.168.14.197)
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> > primary component and will provide service.
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> > ip(192.168.14.195)
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> > primary component and will provide service.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message
> > 192.168.14.195
> > Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from
> > node 1
> > Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Replayed 0 of 0 blocks
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Found 0 revoke tags
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Done
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Replayed 0 of 0 blocks
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Found 0 revoke tags
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Done
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Replayed 222 of 223 blocks
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Found 1 revoke tags
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Done
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Replayed 438 of 439 blocks
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Found 1 revoke tags
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Done
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070627/d74d2fe0/attachment.htm>

From swhiteho at redhat.com  Wed Jun 27 23:40:13 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Thu, 28 Jun 2007 00:40:13 +0100
Subject: [Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing
In-Reply-To: <6eee34430706271535jd4b13c0hf5b99219db4db6ec@mail.gmail.com>
References: <6eee34430706061627p69a8080lf39168322926a769@mail.gmail.com>
	<1181201333.25918.229.camel@quoit>
	<6eee34430706271535jd4b13c0hf5b99219db4db6ec@mail.gmail.com>
Message-ID: <1182987613.3386.36.camel@localhost.localdomain>

Hi,

On Wed, 2007-06-27 at 18:35 -0400, nrbwpi at gmail.com wrote:
> Thanks for your reply
> 
> I switched the hardware over to Fedora core 6, brought the system
> up2date, and configured it the same as before with GFS2. Uname returns
> the following kernel string: "Linux fu2 2.6.20-1.2952.fc6 #1 SMP Wed
> May 16 18:18:22 EDT 2007  86_64 x86_64 x86_64 GNU/Linux". 
> 
> The same fencing occurred after several hours of writing zeros to the
> volume with dd in 250MB files.  This time, however, I noticed a kernel
> panic on the fenced node.  The kernel output in /var/log/messages is
> below.  Could this be a hardware configuration issue, or a bug in the
> kernel?
> 
Its a kernel bug. We are currently working on fixing something in the
same area, so it might be that you've tripped over the same thing, or
something related anyway. There are also a few patches (quite recent,
again) which are in the git tree, but haven't made it into FC-6 yet, so
it might also be one of those that will fix the problem. I'll try and
get another set of update patches done shortly - I'm out of the office
at that moment which makes such things a bit slower than usual I'm
afraid.

If you are able to test the current GFS2 git tree kernel and you are
still having the problem, then please report it through the Red Hat
bugzilla,

Steve.

>  
> 
> #####################################
> 
>  
> 
> Kernel panic
> 
>  
> 
> #####################################
> 
>  
> 
> Jun 26 10:00:41 fu2 kernel: ------------[ cut here ]------------
> 
> Jun 26 10:00:41 fu2 kernel: kernel BUG at lib/list_debug.c:67!
> 
> Jun 26 10:00:41 fu2 kernel: invalid opcode: 0000 [1] SMP
> 
> Jun 26 10:00:41 fu2 kernel: last sysfs
> file: /devices/pci0000:00/0000:00:02.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/irq
> 
> Jun 26 10:00:41 fu2 kernel: CPU 7Jun 26 10:00:41 fu2 kernel: Modules
> linked in: lock_dlm gfs2 dlm configfs ipt_MASQUERADE iptable_nat
> nf_nat nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink ipt_REJECT
> xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp xfs
> rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa
> ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi
> dm_multipath video sbs i2c_ec i2c_core dock button battery asus_acpi
> backlight ac parport_pc lp parport sg ata_piix libata pcspkr bnx2
> ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod lpfc
> scsi_transport_fc shpchp megaraid_sas sd_mod scsi_mod ext3 jbd
> ehci_hcd ohci_hcd uhci_hcd
> 
> Jun 26 10:00:41 fu2 kernel: Pid: 4142, comm: gfs2_logd Not tainted
> 2.6.20-1.2952.fc6 #1
> 
> Jun 26 10:00:41 fu2 kernel: RIP: 0010:[<ffffffff80341368>]
> [<ffffffff80341368>] list_del+0x21/0x5b
> 
> Jun 26 10:00:41 fu2 kernel: RSP: 0018:ffff81011e247d00  EFLAGS:
> 00010082
> 
> Jun 26 10:00:41 fu2 kernel: RAX: 0000000000000058 RBX:
> ffff81011aa40000 RCX: ffffffff8057fc58
> 
> Jun 26 10:00:41 fu2 kernel: RDX: ffffffff8057fc58 RSI:
> 0000000000000000 RDI: ffffffff8057fc40
> 
> Jun 26 10:00:41 fu2 kernel: RBP: ffff81012da3f7c0 R08:
> ffffffff8057fc58 R09: 0000000000000001
> 
> Jun 26 10:00:41 fu2 kernel: R10: 0000000000000000 R11:
> ffff81012fd9d0c0 R12: ffff81011aa40f70
> 
> Jun 26 10:00:41 fu2 kernel: R13: ffff810123fb1a00 R14:
> ffff810123fb05d8 R15: 0000000000000036
> 
> Jun 26 10:00:41 fu2 kernel: FS:  0000000000000000(0000)
> GS:ffff81012fdb47c0(0000) knlGS:0000000000000000
> 
> Jun 26 10:00:41 fu2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> 
> Jun 26 10:00:41 fu2 kernel: CR2: 00002aaaadfbe008 CR3:
> 0000000042c20000 CR4: 00000000000006e0
> 
> Jun 26 10:00:41 fu2 kernel: Process gfs2_logd (pid: 4142, threadinfo
> ffff81011e246000, task ffff810121d35800)
> 
> Jun 26 10:00:41 fu2 kernel: Stack:  ffff810123fb1a00 ffffffff802cc6e7
> 0000003c00000000 ffff81012da3f7c0
> 
> Jun 26 10:00:41 fu2 kernel:  000000000000003c ffff810123fb0400
> 0000000000000000 ffff810123fb1a00
> 
> Jun 26 10:00:41 fu2 kernel:  ffff81012da3f800 ffffffff802cc8be
> ffff810123fb07e8 ffff810123fb0400
> 
> Jun 26 10:00:41 fu2 kernel: Call Trace:
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc6e7>] free_block
> +0xb1/0x142
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc8be>] cache_flusharray
> +0x7d/0xb1
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff8020765f>] kmem_cache_free
> +0x1ef/0x20c
> 
> Jun 26 10:00:41 fu2 kernel:
> [<ffffffff88445628>] :gfs2:databuf_lo_before_commit+0x576/0x5c6
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff88443acf>] :gfs2:gfs2_log_flush
> +0x11e/0x2d3
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438310>] :gfs2:gfs2_logd
> +0xab/0x15b
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438265>] :gfs2:gfs2_logd
> +0x0/0x15b
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
> keventd_create_kthread+0x0/0x6a
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802318bd>] kthread+0xd0/0xff
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aec8>] child_rip+0xa/0x12
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
> keventd_create_kthread+0x0/0x6a
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802317ed>] kthread+0x0/0xff
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aebe>] child_rip+0x0/0x12
> 
> Jun 26 10:00:41 fu2 kernel:
> 
> Jun 26 10:00:41 fu2 kernel:
> 
> Jun 26 10:00:41 fu2 kernel: Code: 0f 0b eb fe 48 8b 07 48 8b 50 08 48
> 39 fa 74 12 48 c7 c7 97
> 
> Jun 26 10:00:41 fu2 kernel: RIP  [<ffffffff80341368>] list_del
> +0x21/0x5b
> 
> Jun 26 10:00:41 fu2 kernel:  RSP <ffff81011e247d00>
> 
> 
> 
> On 6/7/07, Steven Whitehouse <swhiteho at redhat.com> wrote:
>         Hi,
>         
>         The version of GFS2 in RHEL5 is rather old. Please use Fedora,
>         the
>         upstream kernel or wait until RHEL 5.1 is out. This should
>         solve the
>         problem that you are seeing,
>         
>         Steve.
>         
>         On Wed, 2007-06-06 at 19:27 -0400, nrbwpi at gmail.com wrote:
>         > Hello,
>         >
>         > Installed RHEL5 on a new two node cluster with Shared FC
>         storage.  The
>         > two shared storage boxes are each split into 6.9TB LUNs for
>         a total of
>         > 4 - 6.9TB LUNS.  Each machine is connected via a single
>         100Mb
>         > connection to a switch and a single FC connection to a FC
>         switch.
>         >
>         > The 4 LUNs have LVM on them with GFS2.  The file systems are
>         mountable 
>         > from each box.  When performing a script dd write of zeros
>         in 250MB
>         > file sizes to the file system from each box to different
>         LUNS, one of
>         > the nodes in the cluster is fenced by the other one.  File
>         size does 
>         > not seem to matter.
>         >
>         > My first guess at the problem was the heartbeat timeout in
>         openais.
>         > In the cluster.conf below I added the totem line to
>         hopefully raise
>         > the timeout to 10 seconds.  This however did not resolve the
>         problem. 
>         > Both boxes are running the latest updates as of 2 days ago
>         from
>         > up2date.
>         >
>         > Below is the cluster.conf and what is seen in the logs.  Any
>         > suggestions would be greatly appreciated.
>         > 
>         > Thanks!
>         >
>         > Neal
>         >
>         >
>         >
>         > ##########################################
>         >
>         > Cluster.conf
>         >
>         > ##########################################
>         >
>         >
>         > <?xml version=" 1.0"?>
>         > <cluster alias="storage1" config_version="4"
>         name="storage1">
>         >         <fence_daemon post_fail_delay="0"
>         post_join_delay="3"/>
>         >         <clusternodes>
>         >                 <clusternode name="fu1" nodeid="1"
>         votes="1">
>         >                         <fence>
>         >                                 <method name="1"> 
>         >                                         <device name="apc4"
>         port="1"
>         > switch="1"/>
>         >                                 </method>
>         >                         </fence> 
>         >                         <multicast addr=" 224.10.10.10"
>         > interface="eth0"/>
>         >                 </clusternode>
>         >                 <clusternode name="fu2" nodeid="2"
>         votes="1"> 
>         >                         <fence>
>         >                                 <method name="1">
>         >                                         <device name="apc4"
>         port="2"
>         > switch="1"/>
>         >                                 </method>
>         >                         </fence>
>         >                         <multicast addr="224.10.10.10"
>         > interface="eth0"/>
>         >                 </clusternode>
>         >         </clusternodes>
>         >         <cman expected_votes="1" two_node="1"> 
>         >                 <multicast addr="224.10.10.10"/>
>         >                 <totem token="10000"/>
>         >         </cman>
>         >         <fencedevices> 
>         >                 <fencedevice agent="fence_apc"
>         ipaddr="192.168.14.193"
>         > login="apc" name="apc4" passwd="apc"/>
>         >         </fencedevices>
>         >         <rm>
>         >                 <failoverdomains/>
>         >                 <resources/>
>         >         </rm>
>         > </cluster>
>         >
>         > 
>         > #####################################################
>         >
>         > /var/log/messages
>         >
>         > #####################################################
>         >
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was
>         lost in the 
>         > OPERATIONAL state.
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast
>         socket
>         > recv buffer size (262142 bytes).
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit
>         multicast socket 
>         > send buffer size (262142 bytes).
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER
>         state from
>         > 2.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER
>         state from
>         > 0.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit
>         token
>         > because I am the rep.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru
>         6e high
>         > seq received 6e
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT
>         state. 
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY
>         state.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0]
>         member
>         > 192.168.14.195:
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq
>         16 rep 
>         > 192.168.14.195
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high
>         delivered 6e
>         > received flag 0
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to
>         originate 
>         > any messages in recovery.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new
>         sequence id for
>         > ring 14
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial
>         ORF token
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION
>         CHANGE 
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New
>         Configuration:
>         > Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node
>         2
>         > Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member
>         after 0 sec
>         > post_fail_delay
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
>         > ip(192.168.14.195)
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
>         > Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2" 
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
>         > ip(192.168.14.197)
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
>         > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is
>         within the 
>         > primary component and will provide service.
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION
>         CHANGE
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New
>         Configuration:
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) 
>         > ip(192.168.14.195)
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
>         > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is
>         within the 
>         > primary component and will provide service.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering
>         OPERATIONAL state.
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin
>         message
>         > 192.168.14.195
>         > Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist
>         message from
>         > node 1
>         > Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1: 
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1: 
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1: 
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1: 
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock... 
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Replaying journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Replayed 0 of 0 blocks
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1: 
>         > Found 0 revoke tags
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Journal replayed in 1s
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Done 
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Replaying journal... 
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Replayed 0 of 0 blocks
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Found 0 revoke tags
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1: 
>         > Journal replayed in 1s
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Done
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock... 
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock...
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Replaying journal... 
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Replayed 222 of 223 blocks
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Found 1 revoke tags
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1: 
>         > Journal replayed in 1s
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Done
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Replaying journal... 
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Replayed 438 of 439 blocks
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Found 1 revoke tags
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1: 
>         > Journal replayed in 1s
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Done
>         >
>         >
>         > --
>         > Linux-cluster mailing list
>         > Linux-cluster at redhat.com
>         > https://www.redhat.com/mailman/listinfo/linux-cluster
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From janne.peltonen at helsinki.fi  Thu Jun 28 15:45:37 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 28 Jun 2007 18:45:37 +0300
Subject: [Linux-cluster] Cluster node without access to all resources -
	trouble
Message-ID: <20070628154537.GA3108@helsinki.fi>

Hi.

I'm running a five node cluster. Four of the nodes run services that
need access to a SAN, but the fifth doesn't. (The fifth node belongs to
the cluster to avoid a cluster with an even number of nodes.
Additionally, the fifth node is a stand-alone rack server, while the
four other nodes are blade server, two of the in two different blade
racks - this way, even if either of the blade racks goes down, I won't
lose the cluster.) This seems to create all sorts of trouble. For
example, if I try to manipulate clvm'd filesystems on the other four
nodes, they refuse to commit changes if the fifth node is up. And even
if I've restricted the SAN-access-needing services to run only on the
four nodes that have the access, the cluster system tries to shut the
services down in the fifth node also (when quorum is lost, for example)
- and complains about being unable to stop them and, on the nodes that
should run the services, refuses to restart them until I've removed the
fifth node from the cluster and fenced it. (Or, rather, I've removed the
fifth node from the cluster and one of the other nodes has successfully
fenced it.)

So.

Is it really necessary that all the members in a cluster have access to
all the resources that any of the members have, even if the services in
the cluster are partitioned to run in only a part of the cluster? Or is
there a way to tell the cluster that it shouldn't care about the fifth
members opinion about certain services; that is, it doesn't need to
check if the services are running on it, because they never do. Or
should I just make sure that the fifth member always comes up last (that
is, won't be running while the others are coming up)? Or should I aceept
that I'm going to create more harm than avoiding by letting the fifth
node belong to the cluster, and just run it outside the cluster?

Sorry if this was incoherent. I'm a bit tired; this system should be in
production in two weeks, and unexpected problems (that didn't come up
during testing) keep coming up... Any suggestions would be greatly
appreciated.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From Robert.Gil at americanhm.com  Thu Jun 28 15:50:07 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Thu, 28 Jun 2007 11:50:07 -0400
Subject: [Linux-cluster] Cluster node without access to all resources
	-trouble
In-Reply-To: <20070628154537.GA3108@helsinki.fi>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA777@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

What version of cluster are you running?


Robert Gil
Linux Systems Administrator
American Home Mortgage

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen
Sent: Thursday, June 28, 2007 11:46 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Cluster node without access to all resources
-trouble

Hi.

I'm running a five node cluster. Four of the nodes run services that
need access to a SAN, but the fifth doesn't. (The fifth node belongs to
the cluster to avoid a cluster with an even number of nodes.
Additionally, the fifth node is a stand-alone rack server, while the
four other nodes are blade server, two of the in two different blade
racks - this way, even if either of the blade racks goes down, I won't
lose the cluster.) This seems to create all sorts of trouble. For
example, if I try to manipulate clvm'd filesystems on the other four
nodes, they refuse to commit changes if the fifth node is up. And even
if I've restricted the SAN-access-needing services to run only on the
four nodes that have the access, the cluster system tries to shut the
services down in the fifth node also (when quorum is lost, for example)
- and complains about being unable to stop them and, on the nodes that
should run the services, refuses to restart them until I've removed the
fifth node from the cluster and fenced it. (Or, rather, I've removed the
fifth node from the cluster and one of the other nodes has successfully
fenced it.)

So.

Is it really necessary that all the members in a cluster have access to
all the resources that any of the members have, even if the services in
the cluster are partitioned to run in only a part of the cluster? Or is
there a way to tell the cluster that it shouldn't care about the fifth
members opinion about certain services; that is, it doesn't need to
check if the services are running on it, because they never do. Or
should I just make sure that the fifth member always comes up last (that
is, won't be running while the others are coming up)? Or should I aceept
that I'm going to create more harm than avoiding by letting the fifth
node belong to the cluster, and just run it outside the cluster?

Sorry if this was incoherent. I'm a bit tired; this system should be in
production in two weeks, and unexpected problems (that didn't come up
during testing) keep coming up... Any suggestions would be greatly
appreciated.


--Janne
--
Janne Peltonen <janne.peltonen at helsinki.fi>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From janne.peltonen at helsinki.fi  Thu Jun 28 16:23:36 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 28 Jun 2007 19:23:36 +0300
Subject: [Linux-cluster] Cluster node without access to all resources
	-trouble
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA777@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <20070628154537.GA3108@helsinki.fi>
	<F154BEC9D4278D4BA9B94BDEC76193482BA777@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <20070628162336.GA3221@helsinki.fi>

On Thu, Jun 28, 2007 at 11:50:07AM -0400, Robert Gil wrote:
> What version of cluster are you running?

[jmmpelto at pcn3 ~]$ sudo rpm -qa 'lvm*cluster|cman|rgmanager'
cman-2.0.60-1.el5
lvm2-cluster-2.02.16-3.el5
rgmanager-2.0.23-1.el5.centos

I'm not running rgmanager-2.0.24 because it didn't seem to run the
script status checks (!).


--Janne

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen
> Sent: Thursday, June 28, 2007 11:46 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Cluster node without access to all resources
> -trouble
> 
> Hi.
> 
> I'm running a five node cluster. Four of the nodes run services that
> need access to a SAN, but the fifth doesn't. (The fifth node belongs to
> the cluster to avoid a cluster with an even number of nodes.
> Additionally, the fifth node is a stand-alone rack server, while the
> four other nodes are blade server, two of the in two different blade
> racks - this way, even if either of the blade racks goes down, I won't
> lose the cluster.) This seems to create all sorts of trouble. For
> example, if I try to manipulate clvm'd filesystems on the other four
> nodes, they refuse to commit changes if the fifth node is up. And even
> if I've restricted the SAN-access-needing services to run only on the
> four nodes that have the access, the cluster system tries to shut the
> services down in the fifth node also (when quorum is lost, for example)
> - and complains about being unable to stop them and, on the nodes that
> should run the services, refuses to restart them until I've removed the
> fifth node from the cluster and fenced it. (Or, rather, I've removed the
> fifth node from the cluster and one of the other nodes has successfully
> fenced it.)
> 
> So.
> 
> Is it really necessary that all the members in a cluster have access to
> all the resources that any of the members have, even if the services in
> the cluster are partitioned to run in only a part of the cluster? Or is
> there a way to tell the cluster that it shouldn't care about the fifth
> members opinion about certain services; that is, it doesn't need to
> check if the services are running on it, because they never do. Or
> should I just make sure that the fifth member always comes up last (that
> is, won't be running while the others are coming up)? Or should I aceept
> that I'm going to create more harm than avoiding by letting the fifth
> node belong to the cluster, and just run it outside the cluster?
> 
> Sorry if this was incoherent. I'm a bit tired; this system should be in
> production in two weeks, and unexpected problems (that didn't come up
> during testing) keep coming up... Any suggestions would be greatly
> appreciated.
> 
> 
> --Janne
> --
> Janne Peltonen <janne.peltonen at helsinki.fi>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From Robert.Gil at americanhm.com  Thu Jun 28 16:29:04 2007
From: Robert.Gil at americanhm.com (Robert Gil)
Date: Thu, 28 Jun 2007 12:29:04 -0400
Subject: [Linux-cluster] Cluster node without access to all
	resources-trouble
In-Reply-To: <20070628162336.GA3221@helsinki.fi>
Message-ID: <F154BEC9D4278D4BA9B94BDEC76193482BA778@melpsechmbx15.HQ.CORP.AMERICANHM.COM>

I cant really help you there. In EL4 each of the services are separate.
So a node can be part of the cluster but doesn't need to share the
resources such as a shared san disk. If you have the resources set up so
that it requires that resource, then it should be fenced. 


Robert Gil
Linux Systems Administrator
American Home Mortgage
Phone: 631-622-8410
Cell: 631-827-5775
Fax: 516-495-5861

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen
Sent: Thursday, June 28, 2007 12:24 PM
To: linux clustering
Subject: Re: [Linux-cluster] Cluster node without access to all
resources-trouble

On Thu, Jun 28, 2007 at 11:50:07AM -0400, Robert Gil wrote:
> What version of cluster are you running?

[jmmpelto at pcn3 ~]$ sudo rpm -qa 'lvm*cluster|cman|rgmanager'
cman-2.0.60-1.el5
lvm2-cluster-2.02.16-3.el5
rgmanager-2.0.23-1.el5.centos

I'm not running rgmanager-2.0.24 because it didn't seem to run the
script status checks (!).


--Janne

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen
> Sent: Thursday, June 28, 2007 11:46 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Cluster node without access to all resources 
> -trouble
> 
> Hi.
> 
> I'm running a five node cluster. Four of the nodes run services that 
> need access to a SAN, but the fifth doesn't. (The fifth node belongs 
> to the cluster to avoid a cluster with an even number of nodes.
> Additionally, the fifth node is a stand-alone rack server, while the 
> four other nodes are blade server, two of the in two different blade 
> racks - this way, even if either of the blade racks goes down, I won't

> lose the cluster.) This seems to create all sorts of trouble. For 
> example, if I try to manipulate clvm'd filesystems on the other four 
> nodes, they refuse to commit changes if the fifth node is up. And even

> if I've restricted the SAN-access-needing services to run only on the 
> four nodes that have the access, the cluster system tries to shut the 
> services down in the fifth node also (when quorum is lost, for 
> example)
> - and complains about being unable to stop them and, on the nodes that

> should run the services, refuses to restart them until I've removed 
> the fifth node from the cluster and fenced it. (Or, rather, I've 
> removed the fifth node from the cluster and one of the other nodes has

> successfully fenced it.)
> 
> So.
> 
> Is it really necessary that all the members in a cluster have access 
> to all the resources that any of the members have, even if the 
> services in the cluster are partitioned to run in only a part of the 
> cluster? Or is there a way to tell the cluster that it shouldn't care 
> about the fifth members opinion about certain services; that is, it 
> doesn't need to check if the services are running on it, because they 
> never do. Or should I just make sure that the fifth member always 
> comes up last (that is, won't be running while the others are coming 
> up)? Or should I aceept that I'm going to create more harm than 
> avoiding by letting the fifth node belong to the cluster, and just run
it outside the cluster?
> 
> Sorry if this was incoherent. I'm a bit tired; this system should be 
> in production in two weeks, and unexpected problems (that didn't come 
> up during testing) keep coming up... Any suggestions would be greatly 
> appreciated.
> 
> 
> --Janne
> --
> Janne Peltonen <janne.peltonen at helsinki.fi>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Janne Peltonen <janne.peltonen at helsinki.fi>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From janne.peltonen at helsinki.fi  Thu Jun 28 16:54:05 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 28 Jun 2007 19:54:05 +0300
Subject: [Linux-cluster] Cluster node without access to all
	resources-trouble
In-Reply-To: <F154BEC9D4278D4BA9B94BDEC76193482BA778@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
References: <20070628162336.GA3221@helsinki.fi>
	<F154BEC9D4278D4BA9B94BDEC76193482BA778@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
Message-ID: <20070628165405.GB3221@helsinki.fi>

On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote:
> I cant really help you there. In EL4 each of the services are separate.
> So a node can be part of the cluster but doesn't need to share the
> resources such as a shared san disk. If you have the resources set up so
> that it requires that resource, then it should be fenced. 

Yep.

The situation seems to be this (someone who really knows abt the inner
workings of the resource group manager, correct me):

*when a clurgmgrd starts, it wants to know the status of all the
services, and to make thing sure, it stops all services locally
(unmounts the filesystems, runs the scripts with "stop") - and asks the
already-running cluster members their idea of the status

*when the clurgmgrd on the fifth node starts, it tries to stop locally
the SAN requiring services - and cannot match the /dev/<vg>/<lv> paths
with real nodes, so it ends up with incoherent information about their
status

*if all the nodes with SAN access are restarted (while the fifth node is
up), the nodes with SAN access first stop the services locally - and
then, apparently, ask the fifth node about the service status. Result:
a line like the following, for each service:

--cut--
Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: <err> #34: Cannot get status for service service:im  
--cut--

(what is weird, though, is that the fifth node knows perfectly well the
status of this particular service, since it's running the service
(service:im doesn't need the SAN access) - perhaps there is some other
reason not to believe the fifth node at this point. can't imagine what
it'd be, though.)

*after that, the nodes with SAN access do nothing about any services
until after the fifth node has left the cluster and has been fenced. So,
apparently the other nodes conclude that the fifth node is 'bad' and
could be interfering with their SAN access requiring services. When the
fifth node has been fenced, the other nodes start the services. And the
fifth node can join the cluster and start the services that should be
running there...

> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen
> > Sent: Thursday, June 28, 2007 11:46 AM
> > To: linux-cluster at redhat.com
> > Subject: [Linux-cluster] Cluster node without access to all resources 
> > -trouble
> > 
> > Hi.
> > 
> > I'm running a five node cluster. Four of the nodes run services that 
> > need access to a SAN, but the fifth doesn't. (The fifth node belongs 
> > to the cluster to avoid a cluster with an even number of nodes.
> > Additionally, the fifth node is a stand-alone rack server, while the 
> > four other nodes are blade server, two of the in two different blade 
> > racks - this way, even if either of the blade racks goes down, I won't
> 
> > lose the cluster.) This seems to create all sorts of trouble. For 
> > example, if I try to manipulate clvm'd filesystems on the other four 
> > nodes, they refuse to commit changes if the fifth node is up. And even
> 
> > if I've restricted the SAN-access-needing services to run only on the 
> > four nodes that have the access, the cluster system tries to shut the 
> > services down in the fifth node also (when quorum is lost, for 
> > example)
> > - and complains about being unable to stop them and, on the nodes that
> 
> > should run the services, refuses to restart them until I've removed 
> > the fifth node from the cluster and fenced it. (Or, rather, I've 
> > removed the fifth node from the cluster and one of the other nodes has
> 
> > successfully fenced it.)
> > 
> > So.
> > 
> > Is it really necessary that all the members in a cluster have access 
> > to all the resources that any of the members have, even if the 
> > services in the cluster are partitioned to run in only a part of the 
> > cluster? Or is there a way to tell the cluster that it shouldn't care 
> > about the fifth members opinion about certain services; that is, it 
> > doesn't need to check if the services are running on it, because they 
> > never do. Or should I just make sure that the fifth member always 
> > comes up last (that is, won't be running while the others are coming 
> > up)? Or should I aceept that I'm going to create more harm than 
> > avoiding by letting the fifth node belong to the cluster, and just run
> it outside the cluster?
> > 
> > Sorry if this was incoherent. I'm a bit tired; this system should be 
> > in production in two weeks, and unexpected problems (that didn't come 
> > up during testing) keep coming up... Any suggestions would be greatly 
> > appreciated.
> > 
> > 
> > --Janne
> > --
> > Janne Peltonen <janne.peltonen at helsinki.fi>
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Janne Peltonen <janne.peltonen at helsinki.fi>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From mike at duncancg.com  Thu Jun 28 17:43:41 2007
From: mike at duncancg.com (Mike Duncan)
Date: Thu, 28 Jun 2007 12:43:41 -0500
Subject: [Linux-cluster] Newbie Question
Message-ID: <1183052621.3641.5.camel@Dilbert>

I hope this is a good place for beginner questions.  If not, please let
me know and I'll go away....

I am trying to construct my first cluster.  I have 6 PCs (P3s) and am
trying to learn the basics.  I have MPI up and running so that the nodes
will answer and they run a simple "hello world" program.

Here's my main question:  I'm running Fedora Core 6 on this cluster, and
would like to implement GFS as each node has a huge data HDD--160G per
node.  However I cannot find any information on the Net about
implementing GFS on FC6.  I found some old info about FC4, but it was
useless.  I see that RHEL has the functionality included.  Do I need to
install RHEL?  I have a legal copy of RHEL ES 4.

Thanks for any assistance!

Mike



From lhh at redhat.com  Thu Jun 28 18:18:13 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:18:13 -0400
Subject: [Linux-cluster] QDISK problem
In-Reply-To: <467BBD2B.A5F6.00ED.0@itdynamics.co.za>
References: <467BBD2B.A5F6.00ED.0@itdynamics.co.za>
Message-ID: <20070628181813.GO8818@redhat.com>

On Fri, Jun 22, 2007 at 12:14:35PM +0200, Gavin Fietze wrote:
> I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP
> I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard.
>  
> When I run clustat and "cman_tool nodes" I get strange output for the qdisk object :

This is a bug in rgmanager; it will be fixed in 5.1.  For now, since you
have 3 nodes, just go without qdisk.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:19:58 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:19:58 -0400
Subject: [Linux-cluster] No failover occurs!
In-Reply-To: <1182786276.12133.2.camel@kcm40202.kcmhq.org>
References: <1182786276.12133.2.camel@kcm40202.kcmhq.org>
Message-ID: <20070628181958.GP8818@redhat.com>

On Mon, Jun 25, 2007 at 10:44:36AM -0500, Dave Augustus wrote:
> I am familiar with Heartbeat and new to RHCS.
> 
> Anyhow:
> 
> I created a 2 node cluster with no quorum drive.
> added an ip address on the public eth
> added an ip address on the private eth
> added the script apache, with proper configs on both hosts
> 
> The only way I can get all 3 to run is to reboot the nodes.
> 
> Shouldn't it failover if a service fails to start?

It's a configuration option.


  <service name="xxx" recovery="restart|relocate|disable" ...>

Restart = default = restart on the same node.
Relocate = "failover"
Disable = don't bother trying; disable service immediately.

Could you post your cluster.conf?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:22:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:22:19 -0400
Subject: [Linux-cluster] a couple of questions regarding clusters
In-Reply-To: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz>
References: <91E302EA72562F43AEDC06DD8FAE7D3602855F82@ord1mail01.firstlook.biz>
Message-ID: <20070628182219.GQ8818@redhat.com>

On Sat, Jun 23, 2007 at 05:23:56PM -0500, Brent Sachnoff wrote:
> I have a 3 node cluster running redhat 4 with gfs.  What is the proper
> way to have a node leave the cluster for maintenance and then rejoin
> after maintenance is completed?  From the docs, I have read that I need
> to unmount gfs and then stop all the services in the following order:
> rgmanager, gfs, clvmd, fenced.  I can then issue a cman_tool leave
> (remove) request.

That should be correct.  'cman_tool leave remove' will decrement the quorum
vote count.


> I have also noticed that if I lose ip connectivity to a certain node I
> lose gfs connectivity with the other two nodes.  I thought that I would
> only need 2 votes to continue connectivity. 

I assume that in this case, you are disconnecting the cable (not doing a
clean shutdown as above).

That's correct, except fencing must complete in order for you to
maintain full access to the GFS volume. 

If you don't have fencing configured, you will lose connectivity to GFS
volume.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:24:15 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:24:15 -0400
Subject: [Linux-cluster] Cluster configuration on redhat AS 4
In-Reply-To: <20070625035719.12708.qmail@webmail6.rediffmail.com>
References: <20070625035719.12708.qmail@webmail6.rediffmail.com>
Message-ID: <20070628182415.GR8818@redhat.com>

On Mon, Jun 25, 2007 at 03:57:19AM -0000, manjunath c shanubog wrote:
> Hi,&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I have to setup two node cluster with redhat AS 4 and cluster suite with GFS. The application which is to be installed is MySql database. I would like&nbsp;to have a solution for the below queries&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1. Detailed installation guide for cluster suite installation and is it possible to load balance on redhat 4/5 linux.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2. Do i need to have a separate cluster suite for MySql, if so which one is Good.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3. Guide or document for Installation of MySQL on cluster.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4. In windows clustering there is no need of fencing device, why is it necessary in linux. if so which is good fencing device and its configuration details.Thanking YouManjunath

I don't think active/active mysql clustering currently works; someone
else can correct me if I'm wrong on this.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:25:55 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:25:55 -0400
Subject: [Linux-cluster] manual fencing problem
In-Reply-To: <46819E5C.1090802@cmiware.com>
References: <46819E5C.1090802@cmiware.com>
Message-ID: <20070628182553.GS8818@redhat.com>

On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote:
> Trying to setup manual fencing for testing purposes in Conga gave me the 
> following errors:
> 
> agent "fence_manual" reports: failed: fence_manual no node name
> 
> It appears this came up before:
> http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html
> 
> but is still unresolved.

Ew.. What releases of conga (luci/ricci) system-config-cluster and fence
do you have installed?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:30:32 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:30:32 -0400
Subject: [Linux-cluster] wishing to run GFS on iSCSI with redundancy
In-Reply-To: <C1E9042E-7DB2-41D3-84F0-7A1867F448D5@gmail.com>
References: <C1E9042E-7DB2-41D3-84F0-7A1867F448D5@gmail.com>
Message-ID: <20070628183032.GT8818@redhat.com>

On Wed, Jun 27, 2007 at 03:54:07PM +1200, Rohit Grover wrote:
> Hello,
> 
> We'd like to run GFS in a cluster serviced by a pool of iSCSI disks.  
> We would like to use RAID to add redundancy to the storage, but  
> there's literature on the net saying that linux's MD driver is not  
> cluster safe. Since CLVM doesn't support RAID, what options do we  
> have other than pairing the iSCSI disks with DRBD?

Ok, couple of notes:

* MD is only unsafe in a cluster if it's used on multiple cluster nodes.
That is, it should be fairly easy to implement a resource agent which
assembles MD devices from network block devices - on one node at a time.

* DRBD only will work with two writers (if 0.8.x+).  I'm not sure how many
mirror targets you can maintain.

* Aren't most iSCSI targets RAID arrays already (?)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:33:42 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:33:42 -0400
Subject: [Linux-cluster] Cluster node without access to all resources -
	trouble
In-Reply-To: <20070628154537.GA3108@helsinki.fi>
References: <20070628154537.GA3108@helsinki.fi>
Message-ID: <20070628183341.GU8818@redhat.com>

On Thu, Jun 28, 2007 at 06:45:37PM +0300, Janne Peltonen wrote:
> 
> So.
> 
> Is it really necessary that all the members in a cluster have access to
> all the resources that any of the members have, even if the services in
> the cluster are partitioned to run in only a part of the cluster?

No.

>  Or is
> there a way to tell the cluster that it shouldn't care about the fifth
> members opinion about certain services; that is, it doesn't need to
> check if the services are running on it, because they never do.

(1) don't start rgmanager on the fifth node :), or

(2) if you do start rgmanager on the fifth node, make all services be
part of a "restricted" failover domain comprised of the other four
nodes.

> Or
> should I just make sure that the fifth member always comes up last (that
> is, won't be running while the others are coming up)? Or should I aceept
> that I'm going to create more harm than avoiding by letting the fifth
> node belong to the cluster, and just run it outside the cluster?

If the above two don't work, it's a bug.

(Oh! and those status-script checks will be fixed in 5.1 ;) ).

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:39:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:39:44 -0400
Subject: [Linux-cluster] Cluster node without access to all
	resources-trouble
In-Reply-To: <20070628165405.GB3221@helsinki.fi>
References: <20070628162336.GA3221@helsinki.fi>
	<F154BEC9D4278D4BA9B94BDEC76193482BA778@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<20070628165405.GB3221@helsinki.fi>
Message-ID: <20070628183944.GV8818@redhat.com>

On Thu, Jun 28, 2007 at 07:54:05PM +0300, Janne Peltonen wrote:
> On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote:
> > I cant really help you there. In EL4 each of the services are separate.
> > So a node can be part of the cluster but doesn't need to share the
> > resources such as a shared san disk. If you have the resources set up so
> > that it requires that resource, then it should be fenced. 

RHEL5 is the same FWIW, or should be.

> *when a clurgmgrd starts, it wants to know the status of all the
> services, and to make thing sure, it stops all services locally
> (unmounts the filesystems, runs the scripts with "stop") - and asks the
> already-running cluster members their idea of the status

Right.

> *when the clurgmgrd on the fifth node starts, it tries to stop locally
> the SAN requiring services - and cannot match the /dev/<vg>/<lv> paths
> with real nodes, so it ends up with incoherent information about their
> status

This should not cause a problem.


> *if all the nodes with SAN access are restarted (while the fifth node is
> up), the nodes with SAN access first stop the services locally - and
> then, apparently, ask the fifth node about the service status. Result:
> a line like the following, for each service:
> 
> --cut--
> Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: <err> #34: Cannot get status for service service:im  
> --cut--

What do you mean here, (sorry, being daft)

Restart all nodes = "just rgmanager on all nodes", or "reboot all
nodes"?


> (what is weird, though, is that the fifth node knows perfectly well the
> status of this particular service, since it's running the service
> (service:im doesn't need the SAN access) - perhaps there is some other
> reason not to believe the fifth node at this point. can't imagine what
> it'd be, though.)

cman_tool services from each node could help here.


> *after that, the nodes with SAN access do nothing about any services
> until after the fifth node has left the cluster and has been fenced. 

If you're rebooting the other 4 nodes, it sounds like the 5th is holding
some sort of a lock which it shouldn't be across quorum transitions
(which would be a bug).

If this is the case, could you:

* install rgmanager-debuginfo
* get me a backtrace:

    gdb clurgmgrd `pidof clurgmgrd`
    thr a a bt

-- Lon
-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Jun 28 18:41:05 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 28 Jun 2007 14:41:05 -0400
Subject: [Linux-cluster] Newbie Question
In-Reply-To: <1183052621.3641.5.camel@Dilbert>
References: <1183052621.3641.5.camel@Dilbert>
Message-ID: <20070628184105.GW8818@redhat.com>

On Thu, Jun 28, 2007 at 12:43:41PM -0500, Mike Duncan wrote:
> I hope this is a good place for beginner questions.  If not, please let
> me know and I'll go away....
> 
> I am trying to construct my first cluster.  I have 6 PCs (P3s) and am
> trying to learn the basics.  I have MPI up and running so that the nodes
> will answer and they run a simple "hello world" program.
> 
> Here's my main question:  I'm running Fedora Core 6 on this cluster, and
> would like to implement GFS as each node has a huge data HDD--160G per
> node.  However I cannot find any information on the Net about
> implementing GFS on FC6.  I found some old info about FC4, but it was
> useless.  I see that RHEL has the functionality included.  Do I need to
> install RHEL?

No, you don't; the RHEL5 documentation for installing/configuring GFS
should apply to Fedora Core 6.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From janne.peltonen at helsinki.fi  Thu Jun 28 18:40:54 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 28 Jun 2007 21:40:54 +0300
Subject: [Linux-cluster] Cluster node without access to all resources -
	trouble
In-Reply-To: <20070628183341.GU8818@redhat.com>
References: <20070628154537.GA3108@helsinki.fi>
	<20070628183341.GU8818@redhat.com>
Message-ID: <20070628184053.GD3221@helsinki.fi>

On Thu, Jun 28, 2007 at 02:33:42PM -0400, Lon Hohberger wrote:
> (1) don't start rgmanager on the fifth node :), or

...now there's an idea :)

> (2) if you do start rgmanager on the fifth node, make all services be
> part of a "restricted" failover domain comprised of the other four
> nodes.
> 
> > Or
> > should I just make sure that the fifth member always comes up last (that
> > is, won't be running while the others are coming up)? Or should I aceept
> > that I'm going to create more harm than avoiding by letting the fifth
> > node belong to the cluster, and just run it outside the cluster?
> 
> If the above two don't work, it's a bug.

If (2) means /all/ the services, even the one that should be running on
the fifth node, it's more or less equal to (1), isn't it? That is, the
service that I want to be running on node five can't be a clustered
service (which is, come to think of it, exactly what I want...)

> (Oh! and those status-script checks will be fixed in 5.1 ;) ).

(Thanks.)


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From chris at cmiware.com  Thu Jun 28 19:00:05 2007
From: chris at cmiware.com (Chris Harms)
Date: Thu, 28 Jun 2007 14:00:05 -0500
Subject: [Linux-cluster] manual fencing problem
In-Reply-To: <20070628182553.GS8818@redhat.com>
References: <46819E5C.1090802@cmiware.com> <20070628182553.GS8818@redhat.com>
Message-ID: <46840535.9080608@cmiware.com>

luci-0.9.2-6.el5
ricci-0.9.2-6.el5
cman-2.0.64-1.el5
fence_tool 2.0.64 (built May 10 2007 17:58:41)


Lon Hohberger wrote:
> On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote:
>   
>> Trying to setup manual fencing for testing purposes in Conga gave me the 
>> following errors:
>>
>> agent "fence_manual" reports: failed: fence_manual no node name
>>
>> It appears this came up before:
>> http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html
>>
>> but is still unresolved.
>>     
>
> Ew.. What releases of conga (luci/ricci) system-config-cluster and fence
> do you have installed?
>
>   



From Robert.Hell at fabasoft.com  Thu Jun 28 19:12:09 2007
From: Robert.Hell at fabasoft.com (Hell, Robert)
Date: Thu, 28 Jun 2007 21:12:09 +0200
Subject: AW: [Linux-cluster] QDISK problem
In-Reply-To: <20070628181813.GO8818@redhat.com>
References: <467BBD2B.A5F6.00ED.0@itdynamics.co.za>
	<20070628181813.GO8818@redhat.com>
Message-ID: <B710F3299F04664DB6B37C258FDEEB94544722@FABAMAIL.fabagl.fabasoft.com>

By the way - is there a release date for 5.1?

I couldn't find one online ...

Regards,
Robert

-----Urspr?ngliche Nachricht-----
Von: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Lon Hohberger
Gesendet: Donnerstag, 28. Juni 2007 20:18
An: linux clustering
Betreff: Re: [Linux-cluster] QDISK problem

On Fri, Jun 22, 2007 at 12:14:35PM +0200, Gavin Fietze wrote:
> I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP
> I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard.
>  
> When I run clustat and "cman_tool nodes" I get strange output for the qdisk object :

This is a bug in rgmanager; it will be fixed in 5.1.  For now, since you
have 3 nodes, just go without qdisk.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jparsons at redhat.com  Thu Jun 28 19:39:47 2007
From: jparsons at redhat.com (jim parsons)
Date: Thu, 28 Jun 2007 15:39:47 -0400
Subject: [Linux-cluster] manual fencing problem
In-Reply-To: <20070628182553.GS8818@redhat.com>
References: <46819E5C.1090802@cmiware.com> <20070628182553.GS8818@redhat.com>
Message-ID: <1183059587.3017.4.camel@localhost.localdomain>

On Thu, 2007-06-28 at 14:25 -0400, Lon Hohberger wrote:
> On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote:
> > Trying to setup manual fencing for testing purposes in Conga gave me the 
> > following errors:
> > 
> > agent "fence_manual" reports: failed: fence_manual no node name
> > 
> > It appears this came up before:
> > http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html
> > 
> > but is still unresolved.
> 
> Ew.. What releases of conga (luci/ricci) system-config-cluster and fence
> do you have installed?
> 
This is a known bug and was fixed in the current 5.1 beta...we can
provide you a patch whether you are using 4 or 5. What version are you
running?

-J



From jparsons at redhat.com  Thu Jun 28 19:41:43 2007
From: jparsons at redhat.com (jim parsons)
Date: Thu, 28 Jun 2007 15:41:43 -0400
Subject: AW: [Linux-cluster] QDISK problem
In-Reply-To: <B710F3299F04664DB6B37C258FDEEB94544722@FABAMAIL.fabagl.fabasoft.com>
References: <467BBD2B.A5F6.00ED.0@itdynamics.co.za>
	<20070628181813.GO8818@redhat.com>
	<B710F3299F04664DB6B37C258FDEEB94544722@FABAMAIL.fabagl.fabasoft.com>
Message-ID: <1183059703.3017.6.camel@localhost.localdomain>

On Thu, 2007-06-28 at 21:12 +0200, Hell, Robert wrote:
> By the way - is there a release date for 5.1?
> 
> I couldn't find one online ...
Beta froze yesterday. Prolly a week?
-j

> -----Urspr?ngliche Nachricht-----
> Von: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Lon Hohberger
> Gesendet: Donnerstag, 28. Juni 2007 20:18
> An: linux clustering
> Betreff: Re: [Linux-cluster] QDISK problem
> 
> On Fri, Jun 22, 2007 at 12:14:35PM +0200, Gavin Fietze wrote:
> > I am trying to get qdiskd to work in a 3 node cluster using RHEL 5 AP . The 3 nodes are virtual machines running XEN. Domain0 is also RHEL 5 AP
> > I have upgraded cman and rgmanager to cman-2.0.60-1.el5 and rgmanager-2.0.23-1 respectively, everything else stock standard.
> >  
> > When I run clustat and "cman_tool nodes" I get strange output for the qdisk object :
> 
> This is a bug in rgmanager; it will be fixed in 5.1.  For now, since you
> have 3 nodes, just go without qdisk.
> 
> -- Lon
> 



From chris at cmiware.com  Thu Jun 28 19:56:29 2007
From: chris at cmiware.com (Chris Harms)
Date: Thu, 28 Jun 2007 14:56:29 -0500
Subject: [Linux-cluster] manual fencing problem
In-Reply-To: <1183059587.3017.4.camel@localhost.localdomain>
References: <46819E5C.1090802@cmiware.com> <20070628182553.GS8818@redhat.com>
	<1183059587.3017.4.camel@localhost.localdomain>
Message-ID: <4684126D.3070302@cmiware.com>

Using CS 5 on RHEL 5 via RedHat Network.

Thanks,
Chris

jim parsons wrote:
> On Thu, 2007-06-28 at 14:25 -0400, Lon Hohberger wrote:
>   
>> On Tue, Jun 26, 2007 at 06:16:44PM -0500, Chris Harms wrote:
>>     
>>> Trying to setup manual fencing for testing purposes in Conga gave me the 
>>> following errors:
>>>
>>> agent "fence_manual" reports: failed: fence_manual no node name
>>>
>>> It appears this came up before:
>>> http://www.redhat.com/archives/linux-cluster/2006-May/msg00088.html
>>>
>>> but is still unresolved.
>>>       
>> Ew.. What releases of conga (luci/ricci) system-config-cluster and fence
>> do you have installed?
>>
>>     
> This is a known bug and was fixed in the current 5.1 beta...we can
> provide you a patch whether you are using 4 or 5. What version are you
> running?
>
> -J
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From mike at duncancg.com  Thu Jun 28 20:04:49 2007
From: mike at duncancg.com (Mike Duncan)
Date: Thu, 28 Jun 2007 15:04:49 -0500
Subject: [Linux-cluster] Newbie Question
In-Reply-To: <20070628184105.GW8818@redhat.com>
References: <1183052621.3641.5.camel@Dilbert>
	<20070628184105.GW8818@redhat.com>
Message-ID: <1183061089.3641.12.camel@Dilbert>

Hmm.  OK.  Does that mean I still need to acquire RH's Cluster Suite?
Unfortunately, I'm doing this as a personal project, and do not have a
company's resources behind me. 

Mike

>  useless.  I see that RHEL has the functionality included.  Do I need to
> > install RHEL?
> 
> No, you don't; the RHEL5 documentation for installing/configuring GFS
> should apply to Fedora Core 6.
> 




From janne.peltonen at helsinki.fi  Thu Jun 28 20:49:11 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 28 Jun 2007 23:49:11 +0300
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
Message-ID: <20070628204911.GE3221@helsinki.fi>

Hi.

There is a clustered vg in my five-node cluster (four of which have
access to the SAN where the physical devices reside). It seems to me
that the more LV's and PV's I've got, the more time it takes to get the
VG activated. When I had 8 PV's and 79 LV's, it took some ten minutes to
activate the VG on each node. Now, I extended the VG by 39 PV's (so that my
sometimes-disk-intensive services wouldn't interfere with each other,
each has its own 'disk' from the SAN) and I haven't succeeded in
activating the VG anymore. (I did the pvcreates on one node, and they
were fast. Thereafter I run the vgextend on the same node, and it took a
couple of minutes. Then I tried to do vgdisplay on one of the other
nodes, and got an error (the lvm couldn't find the new PV's and then,
the VG itself). I rebooted all the nodes but they haven't booted
yet. The cluster seems to be up, and the nodes have all successfully
started CLVMD (I'm seeing this from our remote syslog host, I don't have
access to the node consoles right now (stupid java remote consoles
and stupid JVMs that don't handle slow X11 connections well)) - and
that's it. They are all probably trying to activate the VG with 47 PV's,
and it seems to take ages.  It started an three-quarters of an hour ago,
and now I'm going to sleep (it's midnight here) and see if it'll be up
by tomorrow morning...


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From janne.peltonen at helsinki.fi  Thu Jun 28 20:51:19 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Thu, 28 Jun 2007 23:51:19 +0300
Subject: [Linux-cluster] Cluster node without access to all
	resources-trouble
In-Reply-To: <20070628183944.GV8818@redhat.com>
References: <20070628162336.GA3221@helsinki.fi>
	<F154BEC9D4278D4BA9B94BDEC76193482BA778@melpsechmbx15.HQ.CORP.AMERICANHM.COM>
	<20070628165405.GB3221@helsinki.fi>
	<20070628183944.GV8818@redhat.com>
Message-ID: <20070628205119.GF3221@helsinki.fi>

On Thu, Jun 28, 2007 at 02:39:44PM -0400, Lon Hohberger wrote:
> 
> > *if all the nodes with SAN access are restarted (while the fifth node is
> > up), the nodes with SAN access first stop the services locally - and
> > then, apparently, ask the fifth node about the service status. Result:
> > a line like the following, for each service:
> > 
> > --cut--
> > Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: <err> #34: Cannot get status for service service:im  
> > --cut--
> 
> What do you mean here, (sorry, being daft)
> 
> Restart all nodes = "just rgmanager on all nodes", or "reboot all
> nodes"?

Reboot all nodes.

> > *after that, the nodes with SAN access do nothing about any services
> > until after the fifth node has left the cluster and has been fenced. 
> If you're rebooting the other 4 nodes, it sounds like the 5th is holding
> some sort of a lock which it shouldn't be across quorum transitions
> (which would be a bug).
> 
> If this is the case, could you:
> 
> * install rgmanager-debuginfo
> * get me a backtrace:
> 
>     gdb clurgmgrd `pidof clurgmgrd`
>     thr a a bt

I'll try to find the time for this tomorrow or something. (This
behaviour doesn't really make the cluster un-production-useable, so I'm
trying to solve the other problems first ;)


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From janne.peltonen at helsinki.fi  Thu Jun 28 21:49:07 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 29 Jun 2007 00:49:07 +0300
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <20070628204911.GE3221@helsinki.fi>
References: <20070628204911.GE3221@helsinki.fi>
Message-ID: <20070628214907.GA3768@helsinki.fi>

On Thu, Jun 28, 2007 at 11:49:11PM +0300, Janne Peltonen wrote:
> that's it. They are all probably trying to activate the VG with 47 PV's,
> and it seems to take ages.  It started an three-quarters of an hour ago,
> and now I'm going to sleep (it's midnight here) and see if it'll be up
> by tomorrow morning...

So I didn't go to sleep. The VG took /one and a half hours/ to activate.
And operations such as pvdisplay /dev/sdau1 take ages (minutes and
minutes), and the pvdisplay appears to hog cpu. Meanwhile:

[jmmpelto at pcn3 ~]$ sudo dd if=/dev/sdau1 of=/tmp/huu bs=1k count=10000
10000+0 records in
10000+0 records out
10240000 bytes (10 MB) copied, 0.30667 seconds, 33.4 MB/s

Er.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From rgrover1 at gmail.com  Thu Jun 28 22:49:35 2007
From: rgrover1 at gmail.com (Rohit Grover)
Date: Fri, 29 Jun 2007 10:49:35 +1200
Subject: [Linux-cluster] wishing to run GFS on iSCSI with redundancy
In-Reply-To: <20070628183032.GT8818@redhat.com>
References: <C1E9042E-7DB2-41D3-84F0-7A1867F448D5@gmail.com>
	<20070628183032.GT8818@redhat.com>
Message-ID: <426bed110706281549n70b37bfaq178c04530540c568@mail.gmail.com>

Hello,

Thanks a lot for responding.

Ok, couple of notes:
>
> * MD is only unsafe in a cluster if it's used on multiple cluster nodes.
> That is, it should be fairly easy to implement a resource agent which
> assembles MD devices from network block devices - on one node at a time.



True. I would like to have MD assembling iSCSI initiators (the same set, of
course) at multiple nodes. This will facilitate load distribution.
Isn't it true that if MD is made to not cache any data flowing through it
(and leave GFS to do caching and coherency control across the cluster), then
MD should be a viable solution to putting together iSCSI initiators with
RAID?


* DRBD only will work with two writers (if 0.8.x+).  I'm not sure how many
> mirror targets you can maintain.


Could you please elaborate on this? I don't understand what is meant by
'DRBD will only work with two writers'. Thanks.


* Aren't most iSCSI targets RAID arrays already (?)


Yes, they are in our case. But we also want to survive software/firmware
failures of the iSCSI targets.


Thanks,
Rohit Grover.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070629/880c3f98/attachment.htm>

From janne.peltonen at helsinki.fi  Fri Jun 29 07:21:42 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 29 Jun 2007 10:21:42 +0300
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <20070628214907.GA3768@helsinki.fi>
References: <20070628204911.GE3221@helsinki.fi>
	<20070628214907.GA3768@helsinki.fi>
Message-ID: <20070629072141.GE29854@helsinki.fi>

On Fri, Jun 29, 2007 at 12:49:07AM +0300, Janne Peltonen wrote:
> On Thu, Jun 28, 2007 at 11:49:11PM +0300, Janne Peltonen wrote:
> > that's it. They are all probably trying to activate the VG with 47 PV's,
> > and it seems to take ages.  It started an three-quarters of an hour ago,
> > and now I'm going to sleep (it's midnight here) and see if it'll be up
> > by tomorrow morning...
> 
> So I didn't go to sleep. The VG took /one and a half hours/ to activate.
> And operations such as pvdisplay /dev/sdau1 take ages (minutes and
> minutes), and the pvdisplay appears to hog cpu. Meanwhile:

Some data to give you the feel of things:

[jmmpelto at pcn1 ~]$ time sudo service clvmd restart
Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]
Stopping clvm:                                             [  OK  ]
Starting clvmd:                                            [  OK  ]
Activating VGs:   2 logical volume(s) in volume group "main" now active
  78 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]

real    4m40.448s
user    0m0.662s
sys     0m0.299s
[jmmpelto at pcn1 ~]$ time sudo vgextend mappi-primary $(for LETTER in {i..r}; do echo /dev/sd${LETTER}1; done)
Password:
  Volume group "mappi-primary" successfully extended

real    0m7.534s
user    0m0.197s
sys     0m0.112s
[jmmpelto at pcn1 ~]$ time sudo service clvmd restart
Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]
Stopping clvm:                                             [  OK  ]
Starting clvmd:                                            [  OK  ]
Activating VGs:   2 logical volume(s) in volume group "main" now active
  78 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]

real    43m17.340s
user    0m2.473s
sys     0m0.528s

Adding ten PV's increased the deactivation-activation-cycle time tenfold.
Whatever might be the reason for this.



--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From janne.peltonen at helsinki.fi  Fri Jun 29 07:35:55 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 29 Jun 2007 10:35:55 +0300
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <20070629072141.GE29854@helsinki.fi>
References: <20070628204911.GE3221@helsinki.fi>
	<20070628214907.GA3768@helsinki.fi>
	<20070629072141.GE29854@helsinki.fi>
Message-ID: <20070629073554.GF29854@helsinki.fi>

There seems to be great variation in the cycle time in different SAN
load conditions:

On Fri, Jun 29, 2007 at 10:21:42AM +0300, Janne Peltonen wrote:
> [jmmpelto at pcn1 ~]$ time sudo service clvmd restart
> Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
>                                                            [  OK  ]
> Stopping clvm:                                             [  OK  ]
> Starting clvmd:                                            [  OK  ]
> Activating VGs:   2 logical volume(s) in volume group "main" now active
>   78 logical volume(s) in volume group "mappi-primary" now active
>                                                            [  OK  ]
> 
> real    4m40.448s
> user    0m0.662s
> sys     0m0.299s

(added and reduced 10 PV's) (and the activity on the SAN on other nodes
decreased)

[jmmpelto at pcn1 ~]$ time sudo service clvmd restart
Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]
Stopping clvm:                                             [  OK  ]
Starting clvmd:                                            [  OK  ]
Activating VGs:   2 logical volume(s) in volume group "main" now active
  78 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]

real    1m54.891s
user    0m0.672s
sys     0m0.324s
[jmmpelto at pcn1 ~]$ time sudo service clvmd restart
Password:
Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]
Stopping clvm:                                             [  OK  ]
Starting clvmd:                                            [  OK  ]
Activating VGs:   2 logical volume(s) in volume group "main" now active
  78 logical volume(s) in volume group "mappi-primary" now active
                                                           [  OK  ]

real    2m3.736s
user    0m0.660s
sys     0m0.321s


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From breeves at redhat.com  Fri Jun 29 08:54:06 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Fri, 29 Jun 2007 09:54:06 +0100
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <20070628204911.GE3221@helsinki.fi>
References: <20070628204911.GE3221@helsinki.fi>
Message-ID: <4684C8AE.5030706@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Janne Peltonen wrote:
> Hi.
> 
> There is a clustered vg in my five-node cluster (four of which have
> access to the SAN where the physical devices reside). It seems to me
> that the more LV's and PV's I've got, the more time it takes to get the
> VG activated. When I had 8 PV's and 79 LV's, it took some ten minutes to
> activate the VG on each node. Now, I extended the VG by 39 PV's (so that my

Did you put a metadata copy on each PV? Check with vgdisplay:

  --- Volume group ---
  VG Name               t0
  System ID
  Format                lvm2
  Metadata Areas        4          <---------
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                64
  Act PV                64
  VG Size               7.75 GB
  PE Size               4.00 MB
  Total PE              1984
  Alloc PE / Size       0 / 0
  Free  PE / Size       1984 / 7.75 GB
  VG UUID               PcERts-A1MU-KoT6-gu6R-Acuw-MYuf-08Ximv

If that's the case, you probably don't want one for each PV here. It's
unnecessary and will slow the tools down a lot when there are a large
number of PVs.

Check out the --metadatacopies option to pvcreate and re-create the
volume group with a much smaller number of MDAs if this is the problem
you're seeing. If you're careful you can do this in-place via the
- --restorefile option to pvcreate and vgcfgbackup/vgcfgrestore.

Regards,
Bryn.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGhMiu6YSQoMYUY94RApX4AKCEW1/2ybBF7dmGnIHVTN1kKUpYUACfbxt4
kBTSA2222ahlwxasXKp3ffg=
=Gohu
-----END PGP SIGNATURE-----



From janne.peltonen at helsinki.fi  Fri Jun 29 09:21:32 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 29 Jun 2007 12:21:32 +0300
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <4684C8AE.5030706@redhat.com>
References: <20070628204911.GE3221@helsinki.fi> <4684C8AE.5030706@redhat.com>
Message-ID: <20070629092132.GG29854@helsinki.fi>

On Fri, Jun 29, 2007 at 09:54:06AM +0100, Bryn M. Reeves wrote:
> > There is a clustered vg in my five-node cluster (four of which have
> > access to the SAN where the physical devices reside). It seems to me
> > that the more LV's and PV's I've got, the more time it takes to get the
> > VG activated. When I had 8 PV's and 79 LV's, it took some ten minutes to
> > activate the VG on each node. Now, I extended the VG by 39 PV's (so that my
> 
> Did you put a metadata copy on each PV? Check with vgdisplay:
[..]
> If that's the case, you probably don't want one for each PV here. It's
> unnecessary and will slow the tools down a lot when there are a large
> number of PVs.
> 
> Check out the --metadatacopies option to pvcreate and re-create the
> volume group with a much smaller number of MDAs if this is the problem
> you're seeing. If you're careful you can do this in-place via the
> - --restorefile option to pvcreate and vgcfgbackup/vgcfgrestore.

THANK you. And here I was thinking I knew the lvm tools completely...
Apparantly one'd have to read the man pages of even everyday tools now
and then. ;)


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From janne.peltonen at helsinki.fi  Fri Jun 29 13:03:19 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 29 Jun 2007 16:03:19 +0300
Subject: [Linux-cluster] How to avoid all services starting on the first
	node that boots?
Message-ID: <20070629130318.GJ29854@helsinki.fi>

Hi!

In my cluster, there are 39 services spread over 4 nodes. Any service
can run on any node, so I've set the failover domain
priorities up so that when any node goes down the services are spread
more or less evenly on the remaining nodes. Even if there is but one
node remaining, it can run the services.

But there seems to be a catch. It seems to me that the first node that
starts the rgmanager starts up all the services - and, since starting up
the services takes up a lot of resources, it takes a long time (well,
abt five minutes) until the services are relocated where they belong.

Is there a way to increase the time that a node waits for a prior member
in the failover domain to come up before it tries to start the service
in the group? I couldn't find any, but perhaps I didn't search well
enough.

Thanks for any advice.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From rpeterso at redhat.com  Fri Jun 29 13:10:04 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 29 Jun 2007 08:10:04 -0500
Subject: [Linux-cluster] Newbie Question
In-Reply-To: <1183061089.3641.12.camel@Dilbert>
References: <1183052621.3641.5.camel@Dilbert>
	<20070628184105.GW8818@redhat.com> <1183061089.3641.12.camel@Dilbert>
Message-ID: <1183122604.11507.35.camel@technetium.msp.redhat.com>

On Thu, 2007-06-28 at 15:04 -0500, Mike Duncan wrote:
> Hmm.  OK.  Does that mean I still need to acquire RH's Cluster Suite?
> Unfortunately, I'm doing this as a personal project, and do not have a
> company's resources behind me. 
> 
> Mike
Hi Mike,

This is all open-source software, so you don't need to buy anything.
You can download the source code for the entire cluster suite,
compile it, install it, and use it.  

You can fetch the entire source code tree from CVS with this command: 

cvs -d :ext:sources.redhat.com:/cvs/cluster co cluster 

There are install packages for various platforms too, if you
don't want to compile it yourself.

Regards,

Bob Peterson
Red Hat Cluster Suite




From janne.peltonen at helsinki.fi  Fri Jun 29 13:25:56 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 29 Jun 2007 16:25:56 +0300
Subject: [Linux-cluster] fs.sh?
Message-ID: <20070629132556.GK29854@helsinki.fi>

Hi!

I had the trouble with fs.sh a while ago, and, well, it hasn't gone
anywhere. Are there any news? Thanks.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From sholmes at surf7.com  Fri Jun 29 13:58:54 2007
From: sholmes at surf7.com (steven holmes)
Date: Fri, 29 Jun 2007 06:58:54 -0700 (PDT)
Subject: [Linux-cluster] a cluster in a cluster
Message-ID: <802137.86357.qm@web403.biz.mail.mud.yahoo.com>

has any one tried to build a storage cluster and then build vmware on those hosts and make vm windows cluster accros the 2 hosts.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070629/ebf9b5ce/attachment.htm>

From carlopmart at gmail.com  Fri Jun 29 15:19:07 2007
From: carlopmart at gmail.com (carlopmart)
Date: Fri, 29 Jun 2007 17:19:07 +0200
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <802137.86357.qm@web403.biz.mail.mud.yahoo.com>
References: <802137.86357.qm@web403.biz.mail.mud.yahoo.com>
Message-ID: <468522EB.9070404@gmail.com>



steven holmes wrote:
> has any one tried to build a storage cluster and then build vmware on 
> those hosts and make vm windows cluster accros the 2 hosts.
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 
guests with rhcs and vmware-tools installed (very important). Works very 
very well ...

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From kristoffer.lippert at jppol.dk  Fri Jun 29 15:23:20 2007
From: kristoffer.lippert at jppol.dk (Kristoffer Lippert)
Date: Fri, 29 Jun 2007 17:23:20 +0200
Subject: [linux-cluster] multipath issue... Smells of hardware issue.
Message-ID: <00B9BFA1C44A674794C9A1A4F5A22CA51A79EB@exchsrv07.rootdom.dk>

Hi,

I have a setup with two identical RX200s3 FuSi servers talking to a SAN
(SX60 + extra controller), and that works fine with gfs1.

I do however see some errors on one of the servers. It's in my message
log and only now and then now and then (though always under load, but i
cant load it and thereby force it to give the error).

The error says:
Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed
Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active
paths: 1
Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code =
0x00070000
Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector
705160231
Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path
8:16.
Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path
is up
Jun 28 15:44:22 app02 multipathd: 8:16: reinstated
Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active
paths: 2
Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed
Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active
paths: 1
Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code =
0x00070000
Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector
739870727
Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path
8:32.
Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path
is up
Jun 28 15:46:06 app02 multipathd: 8:32: reinstated
Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active
paths: 2

To me i looks like a fiber that bounces up and down. (There is no switch
involved).

Sometimes i only get a slightly shorter version:
Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code =
0x00070000
Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector
2782490295
Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path
8:16.
Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed
Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active
paths: 1
Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path
is up
Jun 29 09:04:37 app02 multipathd: 8:16: reinstated
Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active
paths: 2


Any sugestions, but start swapping hardware?

Mvh / Kind regards

Kristoffer Lippert
Systemansvarlig
JP/Politiken A/S
Online Magasiner

Tlf. +45 8738 3032
Cell. +45 6062 8703

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070629/9be6414e/attachment.htm>

From Christopher.Barry at qlogic.com  Fri Jun 29 16:26:14 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Fri, 29 Jun 2007 12:26:14 -0400
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <20070629073554.GF29854@helsinki.fi>
References: <20070628204911.GE3221@helsinki.fi>
	<20070628214907.GA3768@helsinki.fi>
	<20070629072141.GE29854@helsinki.fi>
	<20070629073554.GF29854@helsinki.fi>
Message-ID: <1183134374.10344.51.camel@localhost>

On Fri, 2007-06-29 at 10:35 +0300, Janne Peltonen wrote:
> There seems to be great variation in the cycle time in different SAN
> load conditions:
> 
> On Fri, Jun 29, 2007 at 10:21:42AM +0300, Janne Peltonen wrote:
> > [jmmpelto at pcn1 ~]$ time sudo service clvmd restart
> > Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
> >                                                            [  OK  ]
> > Stopping clvm:                                             [  OK  ]
> > Starting clvmd:                                            [  OK  ]
> > Activating VGs:   2 logical volume(s) in volume group "main" now active
> >   78 logical volume(s) in volume group "mappi-primary" now active
> >                                                            [  OK  ]
> > 
> > real    4m40.448s
> > user    0m0.662s
> > sys     0m0.299s
> 
> (added and reduced 10 PV's) (and the activity on the SAN on other nodes
> decreased)
> 
> [jmmpelto at pcn1 ~]$ time sudo service clvmd restart
> Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
>                                                            [  OK  ]
> Stopping clvm:                                             [  OK  ]
> Starting clvmd:                                            [  OK  ]
> Activating VGs:   2 logical volume(s) in volume group "main" now active
>   78 logical volume(s) in volume group "mappi-primary" now active
>                                                            [  OK  ]
> 
> real    1m54.891s
> user    0m0.672s
> sys     0m0.324s
> [jmmpelto at pcn1 ~]$ time sudo service clvmd restart
> Password:
> Deactivating VG mappi-primary:   0 logical volume(s) in volume group "mappi-primary" now active
>                                                            [  OK  ]
> Stopping clvm:                                             [  OK  ]
> Starting clvmd:                                            [  OK  ]
> Activating VGs:   2 logical volume(s) in volume group "main" now active
>   78 logical volume(s) in volume group "mappi-primary" now active
>                                                            [  OK  ]
> 
> real    2m3.736s
> user    0m0.660s
> sys     0m0.321s
> 
> 
> --Janne


What's interesting to me here is the huge difference in real vs. user or
sys time. It appears to spend most of the time waiting around.

Can you trace the process to see what it's doing and where it sits and
waits?



-- 
Regards,
-C



From jparsons at redhat.com  Fri Jun 29 16:34:18 2007
From: jparsons at redhat.com (jim parsons)
Date: Fri, 29 Jun 2007 12:34:18 -0400
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <468522EB.9070404@gmail.com>
References: <802137.86357.qm@web403.biz.mail.mud.yahoo.com>
	<468522EB.9070404@gmail.com>
Message-ID: <1183134858.3313.2.camel@localhost.localdomain>

On Fri, 2007-06-29 at 17:19 +0200, carlopmart wrote:
> 
> steven holmes wrote:
> > has any one tried to build a storage cluster and then build vmware on 
> > those hosts and make vm windows cluster accros the 2 hosts.
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 
> guests with rhcs and vmware-tools installed (very important). Works very 
> very well ...
> 
This works OK in RHEL5 using xen as well. Here is a recipe for how to do
it in the interface:
http://sourceware.org/cluster/conga/cookbook/VMs_as_clusters/

Even if you don't use the interface for deployment, the pre-requisite
list shown in these slides is valuable.

-J



From breeves at redhat.com  Fri Jun 29 16:37:08 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Fri, 29 Jun 2007 17:37:08 +0100
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <1183134374.10344.51.camel@localhost>
References: <20070628204911.GE3221@helsinki.fi>	<20070628214907.GA3768@helsinki.fi>	<20070629072141.GE29854@helsinki.fi>	<20070629073554.GF29854@helsinki.fi>
	<1183134374.10344.51.camel@localhost>
Message-ID: <46853534.4030601@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christopher Barry wrote:
> What's interesting to me here is the huge difference in real vs. user or
> sys time. It appears to spend most of the time waiting around.
> 
> Can you trace the process to see what it's doing and where it sits and
> waits?
> 

It'll be waiting for I/O to all the metadata areas on all the different PVs.

Regards,
Bryn.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGhTU06YSQoMYUY94RArmcAKDVubXRwakPZM24kyDBmdk/v8V9IQCgohPF
edZ1PUYQStsZheHmnzO74bI=
=bu6q
-----END PGP SIGNATURE-----



From Christopher.Barry at qlogic.com  Fri Jun 29 16:47:02 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Fri, 29 Jun 2007 12:47:02 -0400
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <1183134858.3313.2.camel@localhost.localdomain>
References: <802137.86357.qm@web403.biz.mail.mud.yahoo.com>
	<468522EB.9070404@gmail.com>
	<1183134858.3313.2.camel@localhost.localdomain>
Message-ID: <1183135623.10344.54.camel@localhost>

On Fri, 2007-06-29 at 12:34 -0400, jim parsons wrote:
> On Fri, 2007-06-29 at 17:19 +0200, carlopmart wrote:
> > 
> > steven holmes wrote:
> > > has any one tried to build a storage cluster and then build vmware on 
> > > those hosts and make vm windows cluster accros the 2 hosts.
> > > 
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 
> > guests with rhcs and vmware-tools installed (very important). Works very 
> > very well ...
> > 
> This works OK in RHEL5 using xen as well. Here is a recipe for how to do
> it in the interface:
> http://sourceware.org/cluster/conga/cookbook/VMs_as_clusters/
> 
> Even if you don't use the interface for deployment, the pre-requisite
> list shown in these slides is valuable.
> 
> -J
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Has anyone created a cluster on VMware ESX 3.0.x with FC SAN yet?



-- 
Regards,
-C



From Christopher.Barry at qlogic.com  Fri Jun 29 16:48:24 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Fri, 29 Jun 2007 12:48:24 -0400
Subject: [Linux-cluster] Clustered VGs with many PVs slow to activate
In-Reply-To: <46853534.4030601@redhat.com>
References: <20070628204911.GE3221@helsinki.fi>
	<20070628214907.GA3768@helsinki.fi>	<20070629072141.GE29854@helsinki.fi>
	<20070629073554.GF29854@helsinki.fi>
	<1183134374.10344.51.camel@localhost> <46853534.4030601@redhat.com>
Message-ID: <1183135705.10344.57.camel@localhost>

On Fri, 2007-06-29 at 17:37 +0100, Bryn M. Reeves wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Christopher Barry wrote:
> > What's interesting to me here is the huge difference in real vs. user or
> > sys time. It appears to spend most of the time waiting around.
> > 
> > Can you trace the process to see what it's doing and where it sits and
> > waits?
> > 
> 
> It'll be waiting for I/O to all the metadata areas on all the different PVs.
> 
> Regards,
> Bryn.
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
> 
> iD8DBQFGhTU06YSQoMYUY94RArmcAKDVubXRwakPZM24kyDBmdk/v8V9IQCgohPF
> edZ1PUYQStsZheHmnzO74bI=
> =bu6q
> -----END PGP SIGNATURE-----
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


And is this currently a serial event?


-- 
Regards,
-C



From sholmes at surf7.com  Fri Jun 29 17:03:53 2007
From: sholmes at surf7.com (steven holmes)
Date: Fri, 29 Jun 2007 10:03:53 -0700 (PDT)
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <468522EB.9070404@gmail.com>
Message-ID: <33489.66750.qm@web406.biz.mail.mud.yahoo.com>

i am running into a small problem i get the windows cluster built but when i do the msdtc the server i do it from the other server can not write to it.

carlopmart <carlopmart at gmail.com> wrote:  

steven holmes wrote:
> has any one tried to build a storage cluster and then build vmware on 
> those hosts and make vm windows cluster accros the 2 hosts.
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 
guests with rhcs and vmware-tools installed (very important). Works very 
very well ...

-- 
CL Martinez
carlopmart {at} gmail {d0t} com

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070629/3d249493/attachment.htm>

From sholmes at surf7.com  Fri Jun 29 17:07:44 2007
From: sholmes at surf7.com (steven holmes)
Date: Fri, 29 Jun 2007 10:07:44 -0700 (PDT)
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <1183135623.10344.54.camel@localhost>
Message-ID: <580106.36531.qm@web404.biz.mail.mud.yahoo.com>

yes but you can not use vmotion when you do this.

Christopher Barry <Christopher.Barry at qlogic.com> wrote:  On Fri, 2007-06-29 at 12:34 -0400, jim parsons wrote:
> On Fri, 2007-06-29 at 17:19 +0200, carlopmart wrote:
> > 
> > steven holmes wrote:
> > > has any one tried to build a storage cluster and then build vmware on 
> > > those hosts and make vm windows cluster accros the 2 hosts.
> > > 
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 
> > guests with rhcs and vmware-tools installed (very important). Works very 
> > very well ...
> > 
> This works OK in RHEL5 using xen as well. Here is a recipe for how to do
> it in the interface:
> http://sourceware.org/cluster/conga/cookbook/VMs_as_clusters/
> 
> Even if you don't use the interface for deployment, the pre-requisite
> list shown in these slides is valuable.
> 
> -J
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Has anyone created a cluster on VMware ESX 3.0.x with FC SAN yet?



-- 
Regards,
-C

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070629/6a4a9f1f/attachment.htm>

From sholmes at surf7.com  Fri Jun 29 17:12:27 2007
From: sholmes at surf7.com (steven holmes)
Date: Fri, 29 Jun 2007 10:12:27 -0700 (PDT)
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <468522EB.9070404@gmail.com>
Message-ID: <714947.4942.qm@web414.biz.mail.mud.yahoo.com>

this is what i used do you have the recipe you used

carlopmart <carlopmart at gmail.com> wrote:   

steven holmes wrote:
> has any one tried to build a storage cluster and then build vmware on 
> those hosts and make vm windows cluster accros the 2 hosts.
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4 
guests with rhcs and vmware-tools installed (very important). Works very 
very well ...

-- 
CL Martinez
carlopmart {at} gmail {d0t} com

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070629/2392f375/attachment.htm>

From carlopmart at gmail.com  Fri Jun 29 19:17:52 2007
From: carlopmart at gmail.com (carlopmart)
Date: Fri, 29 Jun 2007 21:17:52 +0200
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <714947.4942.qm@web414.biz.mail.mud.yahoo.com>
References: <714947.4942.qm@web414.biz.mail.mud.yahoo.com>
Message-ID: <46855AE0.8080307@gmail.com>

steven holmes wrote:
> this is what i used do you have the recipe you used
> 
> */carlopmart <carlopmart at gmail.com>/* wrote:
> 
> 
> 
>     steven holmes wrote:
>      > has any one tried to build a storage cluster and then build
>     vmware on
>      > those hosts and make vm windows cluster accros the 2 hosts.
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      > --
>      > Linux-cluster mailing list
>      > Linux-cluster at redhat.com
>      > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
>     Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4
>     guests with rhcs and vmware-tools installed (very important). Works
>     very
>     very well ...
> 
>     -- 
>     CL Martinez
>     carlopmart {at} gmail {d0t} com
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com
>     https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Yes, here it is:

  a) Install two host rhel4 nodes
  b) apply all patches to rhel4
  c) Install development packages (kernel-devel, gcc, etc) on both hosts
  d) prepare some type of shared storage like iscsi, FC san, drbd, etc ...
  e) Install rhcs for rhel4 and do initial configuration, fence device 
included.
  f) configure shared storage on both nodes and format with gfs filesystem.
  g) Install vmware binaries on both servers (also you can install 
vmware binaries on shared storage, but I haven't test it yet)
  h) Install first rhel4 guest, applying all patches and vmware-tools 
(very very very important) on one node
  i) Install second rhel4 guest and do the same like you do it with the 
first guest, and on the same node
  j) Next you need to copy /etc/vmware/vm-* on the second host node
  k) Modify .vmx files to use same uuid on both nodes (you can do this 
starting rhel4 guest on the second node and accept warning about uuid)
  l) Create start and stop scripts for the guests using vmware-cmd command
  m) Do some tests using this scripts.
  n) Configure rhcs on hosts and test guest machines.
  o) Stop virtual machines and configure virtual shared storage.
  p) Install rhcs and gfs on both guest machines
  q) configure cluster services and test it.
  r) finish.


I think that this should work with rhel5, but I haven't test it ....


-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From sholmes at surf7.com  Sat Jun 30 00:57:14 2007
From: sholmes at surf7.com (Steven C Holmes)
Date: Fri, 29 Jun 2007 19:57:14 -0500
Subject: [Linux-cluster] a cluster in a cluster
In-Reply-To: <46855AE0.8080307@gmail.com>
References: <714947.4942.qm@web414.biz.mail.mud.yahoo.com>
	<46855AE0.8080307@gmail.com>
Message-ID: <002801c7bab1$99b6a4d0$cd23ee70$@com>

You did the same thing I did except for the uuid and I bet that is they key
tank you very much I bet this fixes the problem.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of carlopmart
Sent: Friday, June 29, 2007 2:18 PM
To: linux clustering
Subject: Re: [Linux-cluster] a cluster in a cluster

steven holmes wrote:
> this is what i used do you have the recipe you used
> 
> */carlopmart <carlopmart at gmail.com>/* wrote:
> 
> 
> 
>     steven holmes wrote:
>      > has any one tried to build a storage cluster and then build
>     vmware on
>      > those hosts and make vm windows cluster accros the 2 hosts.
>      >
>      >
>      >
>
------------------------------------------------------------------------
>      >
>      > --
>      > Linux-cluster mailing list
>      > Linux-cluster at redhat.com
>      > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
>     Yes, using RHEL4.4 cluster suite, vmware server 1.0.3, and rhel4.4
>     guests with rhcs and vmware-tools installed (very important). Works
>     very
>     very well ...
> 
>     -- 
>     CL Martinez
>     carlopmart {at} gmail {d0t} com
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com
>     https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Yes, here it is:

  a) Install two host rhel4 nodes
  b) apply all patches to rhel4
  c) Install development packages (kernel-devel, gcc, etc) on both hosts
  d) prepare some type of shared storage like iscsi, FC san, drbd, etc ...
  e) Install rhcs for rhel4 and do initial configuration, fence device 
included.
  f) configure shared storage on both nodes and format with gfs filesystem.
  g) Install vmware binaries on both servers (also you can install 
vmware binaries on shared storage, but I haven't test it yet)
  h) Install first rhel4 guest, applying all patches and vmware-tools 
(very very very important) on one node
  i) Install second rhel4 guest and do the same like you do it with the 
first guest, and on the same node
  j) Next you need to copy /etc/vmware/vm-* on the second host node
  k) Modify .vmx files to use same uuid on both nodes (you can do this 
starting rhel4 guest on the second node and accept warning about uuid)
  l) Create start and stop scripts for the guests using vmware-cmd command
  m) Do some tests using this scripts.
  n) Configure rhcs on hosts and test guest machines.
  o) Stop virtual machines and configure virtual shared storage.
  p) Install rhcs and gfs on both guest machines
  q) configure cluster services and test it.
  r) finish.


I think that this should work with rhel5, but I haven't test it ....


-- 
CL Martinez
carlopmart {at} gmail {d0t} com

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From chris at cmiware.com  Sat Jun 30 18:41:03 2007
From: chris at cmiware.com (Chris Harms)
Date: Sat, 30 Jun 2007 13:41:03 -0500
Subject: [Linux-cluster] IP monitor failing periodically
Message-ID: <4686A3BF.9070609@cmiware.com>

I am experiencing periodic failovers due to a floating IP address not 
passing the status check:

clurgmgrd: [9975]: <warning> Failed to ping 192.168.13.204
Jun 30 11:41:47 nodeA clurgmgrd[9975]: <notice> status on ip 
"192.168.13.204" returned 1 (generic error)

Both nodes have bonded NICs with gigabit connections to redundant 
switches, so it is unlikely they are going down, nothing in the logs 
about linux losing the links.  I parked all the cluster services - 2 
Postgres services and 1 Apache - on one node and allowed it to run 
overnight.  There would be no client activity during this time. One 
Postgres service failed two times in this manner and the other failed 
once in this manner.  The Apache service did not fail.

What can I do to resolve this or get more information out of the system 
to resolve this?

Thanks in advance,
Chris