From pcaulfie at redhat.com  Thu Nov  1 08:46:12 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 01 Nov 2007 08:46:12 +0000
Subject: [Linux-cluster] DLM Document
Message-ID: <47299254.4040709@redhat.com>

For those wanting more detail on writing applications to use the Red Hat DLM, I
have prepared this document:

 http://people.redhat.com/pcaulfie/docs/rhdlmbook.pdf

It's based quite heavily on the IBM dlmbook document so many thanks to Kristin
Thomas for that work. It's been updated and modified to include things specific
to our DLM including the API reference.

Any comments gratefully received.

Patrick


From sanelson at gmail.com  Thu Nov  1 11:03:12 2007
From: sanelson at gmail.com (Stephen Nelson-Smith)
Date: Thu, 1 Nov 2007 11:03:12 +0000
Subject: [Linux-cluster] High Availability Virtualisation
Message-ID: <b6131fdc0711010403t4eaec6e4v89694bda3b9c190e@mail.gmail.com>

Hello,

I presently run a bunch of openvz ve's on a fairly beefy machine.

I am somewhat concerned that if this machine fails, the vms fail too.

Other than using redundant hardware (multiple psu, mirrored disks,
etc), how can I increase availability?  I could put the virtual
environments on a shared filesystem, but really I'd like some kind of
failover mechanism.  Is this asking too much?

This looks interesting: http://www.pro-linux.de/work/virtual-ha/virtual-ha5.html

But my German is very rusty, so it's heavy going!

Any ideas?

S.


From mike at technomonk.com  Thu Nov  1 12:28:18 2007
From: mike at technomonk.com (Mike Preston - Technomonk Industries)
Date: Thu, 01 Nov 2007 12:28:18 +0000
Subject: [Linux-cluster] High Availability Virtualisation
In-Reply-To: <b6131fdc0711010403t4eaec6e4v89694bda3b9c190e@mail.gmail.com>
References: <b6131fdc0711010403t4eaec6e4v89694bda3b9c190e@mail.gmail.com>
Message-ID: <4729C662.7010605@technomonk.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Stephen Nelson-Smith wrote:
> Hello,
> 
> I presently run a bunch of openvz ve's on a fairly beefy machine.
> 
> I am somewhat concerned that if this machine fails, the vms fail too.
> 
> Other than using redundant hardware (multiple psu, mirrored disks,
> etc), how can I increase availability?  I could put the virtual
> environments on a shared filesystem, but really I'd like some kind of
> failover mechanism.  Is this asking too much?
> 
> This looks interesting: http://www.pro-linux.de/work/virtual-ha/virtual-ha5.html
> 
> But my German is very rusty, so it's heavy going!

Google translate helps...

http://translate.google.com/translate?u=http%3A%2F%2Fwww.pro-linux.de%2Fwork%2Fvirtual-ha%2Fvirtual-ha5.html&langpair=de%7Cen&hl=en&ie=UTF8

How I would do it, is with at least one other server.

Partition the machines into multiple xen domains, each xen domain
running something like vserver (supported in debian) or if you prefer it
openvz if you can support it.

With shared storage (where the DRBD comes in (like a network RAID 1))
you can seemlessly migrate xen domains from machine to machine, or
restart them on the other machine in the event of failure.

This will allow a failed machine to have all its xen domains started up
on another server (since the DRDB has kept the storage in sync between
them) or if the downtime is scheduled, live migrated to other boxes.

Mike

> Any ideas?
> 
> S.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


- --
- --
Mike Preston
mike at technomonk.com
Technomonk Industries
T: +44 (0) 116 2 988 433
M: +44 (0) 7849 72 68 27
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHKcZivhwPecbXDdwRAs2CAJ9YkTRzXjzTmz5EQQyfyCzf6Dz3sgCgk0mP
SEpWIwNNWlCdCJrQAwljHMw=
=P+Fw
-----END PGP SIGNATURE-----


From bmarzins at redhat.com  Thu Nov  1 17:49:39 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 1 Nov 2007 12:49:39 -0500
Subject: [Linux-cluster] fence gnbd doesn't works as expected
In-Reply-To: <4726E4B7.9000700@gmail.com>
References: <4726E4B7.9000700@gmail.com>
Message-ID: <20071101174939.GD3435@ether.msp.redhat.com>

On Tue, Oct 30, 2007 at 09:00:55AM +0100, carlopmart wrote:
> Hi all,
> 
>  I have already installed two nodes cluster using gnbd as a fence device. 
>  When tow nodes comes up at the same time all works ok, but when only I need 
> to start only one node, GFS doesn't mounts because fence device doesn't 
> works. Error is:
> 
> Mounting GFS filesystems:  /sbin/mount.gfs: lock_dlm_join: gfs_controld 
> join error: -22
> /sbin/mount.gfs: error mounting lockproto lock_dlm.
> 
> I am using a third server as GNBD server wihout serving disks. Why this 
> doesn't works?? Perhaps do I need quorum disk??

Let me see if I understand what you are doing. You want to use
fence_gnbd as your fence device, but the nodes in your cluster aren't
actually using gnbd devices for their shared storage. It this is true,
it won't work at all. All fence_gnbd guarantees is that the fenced node
will not be able to access its gnbd devices. If the GFS filesystems are
on the gnbd devices, this will keep the fenced node from being able to corrupt
them. If a GFS filesystem is not on a GNBD device, fence_gnbd does
nothing at all to protect it from corruption.

You really need quorum disk to deal with this.

-Ben
 
> My cluster.conf:
> 
> <?xml version="1.0"?>
> <cluster alias="XenDomUcluster" config_version="3" name="XenDomUcluster">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="node01.hpulabs.org" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="gnbd-fence" 
> nodename="node01.hpulabs.org"/>
>                                 </method>
>                         </fence>
>                         <multicast addr="239.192.75.55" interface="eth0"/>
>                 </clusternode>
>                 <clusternode name="node02.hpulabs.org" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="gnbd-fence" 
> nodename="node02.hpulabs.org"/>
>                                 </method>
>                         </fence>
>                         <multicast addr="239.192.75.55" interface="eth0"/>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1">
>                 <multicast addr="239.192.75.55"/>
>         </cman>
>         <fencedevices>
>                 <fencedevice agent="fence_gnbd" name="gnbd-fence" 
> servers="gnbdserv.hpulabs.org"/>
>         </fencedevices>
>         <rm log_facility="local4" log_level="7">
>                 <failoverdomains>
>                         <failoverdomain name="PriCluster" ordered="1" 
> restricted="1">
>                                 <failoverdomainnode 
>                                 name="node01.hpulabs.org" priority="1"/>
>                                 <failoverdomainnode 
>                                 name="node02.hpulabs.org" priority="2"/>
>                         </failoverdomain>
>                         <failoverdomain name="SecCluster" ordered="1" 
> restricted="1">
>                                 <failoverdomainnode 
>                                 name="node02.hpulabs.org" priority="1"/>
>                                 <failoverdomainnode 
>                                 name="node01.hpulabs.org" priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <ip address="172.25.50.11" monitor_link="1"/>
>                         <ip address="172.25.50.12" monitor_link="1"/>
>                         <ip address="172.25.50.13" monitor_link="1"/>
>                         <ip address="172.25.50.14" monitor_link="1"/>
>                         <ip address="172.25.50.15" monitor_link="1"/>
>                         <ip address="172.25.50.16" monitor_link="1"/>
>                         <ip address="172.25.50.17" monitor_link="1"/>
>                         <ip address="172.25.50.18" monitor_link="1"/>
> 			<ip address="172.25.50.19" monitor_link="1"/>
>                         <ip address="172.25.50.20" monitor_link="1"/>
>                 </resources>
>                 <service autostart="1" domain="PriCluster" name="rsync-svc" 
> recovery="relocate">
>                         <ip ref="172.25.50.11">
>                                 <script 
> file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
>                         </ip>
>                 </service>
>                 <service autostart="1" domain="SecCluster" 
>                 name="wwwsoft-svc" recovery="relocate">
>                         <ip ref="172.25.50.12">
>                                 <script 
> file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
>                         </ip>
>                 </service>
>                 <service autostart="1" domain="PriCluster" name="proxy-svc" 
> recovery="relocate">
>                         <ip ref="172.25.50.13">
>                                 <script 
> file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
>                         </ip>
>                 </service>
>                 <service autostart="1" domain="SecCluster" name="mail-svc" 
> recovery="relocate">
>                         <ip ref="172.25.50.14">
>                                 <script 
> file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix-cluster"/>
>                         </ip>
>                 </service>
>         </rm>
> </cluster>
> -- 
> CL Martinez
> carlopmart {at} gmail {d0t} com
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From prisenhoover at sampledigital.com  Thu Nov  1 18:03:20 2007
From: prisenhoover at sampledigital.com (Paul Risenhoover)
Date: Thu, 01 Nov 2007 11:03:20 -0700
Subject: [Linux-cluster] GFS on raw devices, CLVM and resizing partitions
Message-ID: <472A14E8.5060002@sampledigital.com>

Hi All,

I am in the process of bringing a new iSCSI device online and redoing 
some of my SAN architecture and I have a few questions that I hope you 
might be able to help with.

The fundamental goal here is to build a two-node active/passive cluster 
to serve as a NAS head with two iSCSI Promise arrays serving as the 
storage devices.  One very desirable requirement is to be able to serve 
up all the disk storage as a single mount point (ie, be able to publish 
a samba share that spans both iSCSI devices).

Now, I've heard that GFS will work on raw partitions and within the CLVM 
context, and I'm tempted to use the raw devices for simplicity, but I'm 
not certain that GFS will allow me to build a partition that spans two 
physical devices.

If anybody could take a moment or two to enlighten me on the benefits of 
raw devices v. clvmd, I'd be most grateful.

Thanks,
Paul


From jos at xos.nl  Thu Nov  1 18:13:36 2007
From: jos at xos.nl (Jos Vos)
Date: Thu, 1 Nov 2007 19:13:36 +0100
Subject: [Linux-cluster] GFS on raw devices, CLVM and resizing partitions
In-Reply-To: <472A14E8.5060002@sampledigital.com>
References: <472A14E8.5060002@sampledigital.com>
Message-ID: <20071101181336.GA4175@jasmine.xos.nl>

On Thu, Nov 01, 2007 at 11:03:20AM -0700, Paul Risenhoover wrote:

> Now, I've heard that GFS will work on raw partitions and within the CLVM 
> context, and I'm tempted to use the raw devices for simplicity, but I'm 
> not certain that GFS will allow me to build a partition that spans two 
> physical devices.
> 
> If anybody could take a moment or two to enlighten me on the benefits of 
> raw devices v. clvmd, I'd be most grateful.

A GFS filesystem is always created on one device, just like a "normal" fs,
so if you need to span 2 disks, you need clvmd to create a VG that
includes those two disks.  Then create a LV on that VG on which you put
the GFS fs.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From lhh at redhat.com  Thu Nov  1 20:27:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 01 Nov 2007 16:27:30 -0400
Subject: [Linux-cluster] High Availability Virtualisation
In-Reply-To: <b6131fdc0711010403t4eaec6e4v89694bda3b9c190e@mail.gmail.com>
References: <b6131fdc0711010403t4eaec6e4v89694bda3b9c190e@mail.gmail.com>
Message-ID: <1193948850.10269.54.camel@ayanami.boston.devel.redhat.com>

On Thu, 2007-11-01 at 11:03 +0000, Stephen Nelson-Smith wrote:
> Hello,
> 
> I presently run a bunch of openvz ve's on a fairly beefy machine.
> 
> I am somewhat concerned that if this machine fails, the vms fail too.
> 
> Other than using redundant hardware (multiple psu, mirrored disks,
> etc), how can I increase availability?  I could put the virtual
> environments on a shared filesystem, but really I'd like some kind of
> failover mechanism.  Is this asking too much?

rgmanager can fail over / migrate / restart Xen virtual machines; it
wouldn't be hard to extend it to other libvirt-supported VM systems.

-- Lon


From lhh at redhat.com  Thu Nov  1 21:46:38 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 01 Nov 2007 17:46:38 -0400
Subject: [Linux-cluster] clvmd hangs when third node tries to connect
	to cluster
In-Reply-To: <cdd145f50710311226l4228af5frb73252571e65a591@mail.gmail.com>
References: <cdd145f50710290351k30221b3eh6afff88e4e8e9d47@mail.gmail.com>
	<47270F3E.2030607@redhat.com>
	<cdd145f50710311226l4228af5frb73252571e65a591@mail.gmail.com>
Message-ID: <1193953598.10269.58.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-10-31 at 19:26 +0000, s.c.graham at gmail.com wrote:
> >
> > That sounds like a bug that has already been fixed. I don't have the reference
> > to hand as I've just returned from holiday, sorry.
> 
> Does anyone else remember this bug (or could someone please point me
> in the direction of the correct bugzilla so I can try and track it
> down myself)?

338511 might be it, but it could be another one as well.

-- Lon


From mkathuria at tuxtechnologies.co.in  Fri Nov  2 12:00:50 2007
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Fri, 2 Nov 2007 17:30:50 +0530
Subject: [Linux-cluster] dlm_sendd and dlm_sendd running on 100% CPU
In-Reply-To: <20071029140401.GA22472@redhat.com>
References: <200710291203.02123.huangxiong@uit.com.cn>
	<20071029140401.GA22472@redhat.com>
Message-ID: <1df4abe60711020500m50bcc4c3gd15c1ad4e45de9c2@mail.gmail.com>

On 10/29/07, David Teigland <teigland at redhat.com> wrote:
> On Mon, Oct 29, 2007 at 12:03:02PM +0800, Huang Xiong wrote:
> > Hi Dave,
> >
> > Have you released this new cluster tarball?
> > If so, can you please tell me where to download and how to use.
>
> https://www.redhat.com/archives/linux-cluster/2007-July/msg00268.html
>
> Dave
>

Any idea if these changes have been incorporated in the RHEL 4 kernels
yet ? A client of mine running Cluster Suite and GFS on RHEL 4 Update
5, further updated till date is facing a similar problem where
dlm_sendd consumes all the CPU time.

Thanks,

Manish


From raindoctor at gmail.com  Fri Nov  2 16:34:51 2007
From: raindoctor at gmail.com (Pedro Espinoza)
Date: Fri, 2 Nov 2007 12:34:51 -0400
Subject: [Linux-cluster] liblvm2clusterlock.so
Message-ID: <b86c876f0711020934h6fc8c404q81eb94c6eb3ee454@mail.gmail.com>

Hi

I searched on net, but in vain. This liblvm2clusterlock.so seems to be
missing (as I am trying to install cluster suite with rhel5).

1. May I know which rpm contain this one?
2. Do I need to use locking_type=3?

Here are the details about packages

[root at platoc2 scsi]# rpm -qa | grep -i gfs
kmod-gfs-0.1.16-5.2.6.18_8.el5
kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5
gfs-utils-0.1.11-3.el5
gfs2-utils-0.1.25-1.el5


# rpm -qa |grep -i lvm
lvm2-2.02.16-3.el5
lvm2-cluster-2.02.16-3.el5
system-config-lvm-1.0.22-1.0.el5

2.6.18-8.1.14.el5 #1 SMP Tue Sep 25 11:45:55 EDT 2007 x86_64 x86_64
x86_64 GNU/Linux

Thanks, Pedro.


From pcaulfie at redhat.com  Fri Nov  2 16:44:27 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 02 Nov 2007 16:44:27 +0000
Subject: [Linux-cluster] liblvm2clusterlock.so
In-Reply-To: <b86c876f0711020934h6fc8c404q81eb94c6eb3ee454@mail.gmail.com>
References: <b86c876f0711020934h6fc8c404q81eb94c6eb3ee454@mail.gmail.com>
Message-ID: <472B53EB.6020906@redhat.com>

Pedro Espinoza wrote:
> Hi
> 
> I searched on net, but in vain. This liblvm2clusterlock.so seems to be
> missing (as I am trying to install cluster suite with rhel5).
> 
> 1. May I know which rpm contain this one?
> 2. Do I need to use locking_type=3?
> 


You don't need it in RHEL5, just set locking_type=3 for clustering. (and start
clvmd of course).

-- 
Patrick


From Christopher.Barry at qlogic.com  Fri Nov  2 17:11:43 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Fri, 02 Nov 2007 13:11:43 -0400
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <4728A2C9.3000501@redhat.com>
References: <1193844478.5162.53.camel@localhost> <4728A2C9.3000501@redhat.com>
Message-ID: <1194023503.9381.1.camel@localhost>

On Wed, 2007-10-31 at 10:44 -0500, Ryan O'Hara wrote:
> Christopher Barry wrote:
> > Greetings all,
> > 
> > I have 2 vmware esx servers, each hitting a NetApp over FS, and each
> > with 3 RHCS cluster nodes trying to mount a gfs volume.
> > 
> > All of the nodes (1,2,& 3) on esx-01 can mount the volume fine, but none
> > of the nodes in the second esx box can mount the gfs volume at all, and
> > I get the following error in dmesg:
> 
> Are you intentionally trying to use scsi reservations as a fence method? 

No. In fact I thought the scsi_reservation service may be *causing* the
issue, and disabled the service from starting on all nodes. Does this
have to be on?

> It sounds like the nodes on esx-01 are creating reservations, but the 
> nodes on the second esx box are not registering with the device and 
> therefore are unable to mount the filesystem. Creation of reservations 
> and registrations is handled by the scsi_reserve init script, which 
> should be run at startup on all nodes in the cluster. You can check to 
> see what devices a node is registered for before you mount the 
> filesystem by doing /etc/init.d/scsi_reservce status. If your nodes are 
> not registered with the device and a reservation exists then you won't 
> be able to mount.
> 
> > Lock_Harness 2.6.9-72.2 (built Apr 24 2007 12:45:38) installed
> > GFS 2.6.9-72.2 (built Apr 24 2007 12:45:54) installed
> > GFS: Trying to join cluster "lock_dlm", "kop-sds:gfs_home"
> > Lock_DLM (built Apr 24 2007 12:45:40) installed
> > GFS: fsid=kop-sds:gfs_home.2: Joined cluster. Now mounting FS...
> > GFS: fsid=kop-sds:gfs_home.2: jid=2: Trying to acquire journal lock...
> > GFS: fsid=kop-sds:gfs_home.2: jid=2: Looking at journal...
> > GFS: fsid=kop-sds:gfs_home.2: jid=2: Done
> > scsi2 (0,0,0) : reservation conflict
> > SCSI error : <2 0 0 0> return code = 0x18
> > end_request: I/O error, dev sdc, sector 523720263
> > scsi2 (0,0,0) : reservation conflict
> > SCSI error : <2 0 0 0> return code = 0x18
> > end_request: I/O error, dev sdc, sector 523720271
> > scsi2 (0,0,0) : reservation conflict
> > SCSI error : <2 0 0 0> return code = 0x18
> > end_request: I/O error, dev sdc, sector 523720279
> > GFS: fsid=kop-sds:gfs_home.2: fatal: I/O error
> > GFS: fsid=kop-sds:gfs_home.2:   block = 65464979
> > GFS: fsid=kop-sds:gfs_home.2:   function = gfs_logbh_wait
> > GFS: fsid=kop-sds:gfs_home.2:   file
> > = /builddir/build/BUILD/gfs-kernel-2.6.9-72/smp/src/gfs/dio.c, line =
> > 923
> > GFS: fsid=kop-sds:gfs_home.2:   time = 1193838678
> > GFS: fsid=kop-sds:gfs_home.2: about to withdraw from the cluster
> > GFS: fsid=kop-sds:gfs_home.2: waiting for outstanding I/O
> > GFS: fsid=kop-sds:gfs_home.2: telling LM to withdraw
> > lock_dlm: withdraw abandoned memory
> > GFS: fsid=kop-sds:gfs_home.2: withdrawn
> > GFS: fsid=kop-sds:gfs_home.2: can't get resource index inode: -5
> > 
> > 
> > Does anyone have a clue as to where I should start looking?
> > 
> > 
> > Thanks,
> > -C
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Regards,
-C

Christopher Barry
Systems Engineer, Principal
QLogic Corporation
780 Fifth Avenue, Suite 140
King of Prussia, PA   19406
o/f: 610-233-4870 / 4777
  m: 267-242-9306


From rohara at redhat.com  Fri Nov  2 17:32:05 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Fri, 02 Nov 2007 12:32:05 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <1194023503.9381.1.camel@localhost>
References: <1193844478.5162.53.camel@localhost> <4728A2C9.3000501@redhat.com>
	<1194023503.9381.1.camel@localhost>
Message-ID: <472B5F15.6090606@redhat.com>

Christopher Barry wrote:
> On Wed, 2007-10-31 at 10:44 -0500, Ryan O'Hara wrote:
>> Christopher Barry wrote:
>>> Greetings all,
>>>
>>> I have 2 vmware esx servers, each hitting a NetApp over FS, and each
>>> with 3 RHCS cluster nodes trying to mount a gfs volume.
>>>
>>> All of the nodes (1,2,& 3) on esx-01 can mount the volume fine, but none
>>> of the nodes in the second esx box can mount the gfs volume at all, and
>>> I get the following error in dmesg:
>> Are you intentionally trying to use scsi reservations as a fence method? 
> 
> No. In fact I thought the scsi_reservation service may be *causing* the
> issue, and disabled the service from starting on all nodes. Does this
> have to be on?

No. You only need to run this service if you plan on using scsi 
reservations as a fence method. A scsi reservation will restrict access 
to a device such that only registered nodes can access it. If a 
reservation exist and a unregistered node tries to access the device, 
you'll see what you are seeing.

It may be that some reservations were created and never got cleaned-up, 
which might cause the problem to continue even after the scsi_reserve 
script was disabled. You can manually run '/etc/init.d/scsi_reserve 
stop' to attempt to clean up any reservations. Note that I am assuming 
that any reservations that might still exist on a device were created by 
the scsi_reserve script. If that is the case, you can see what devices a 
node is registered for by doing a '/etc/init.d/scsi_reserve status'. 
Also not that the scsi_reserve script does *not* have to but started or 
enabled to do these things (ie. you can safely run 'status' or 'stop' 
without first running 'start').

On caveat... 'scsi_reserve stop' will not unregister a node if it is the 
reservation holder and other nodes are still registered with a device. 
You can also use sg_persist command directly to clean all registrations 
and reservations. Use the -C option. See the sg_persist man page for a 
better description.

>> It sounds like the nodes on esx-01 are creating reservations, but the 
>> nodes on the second esx box are not registering with the device and 
>> therefore are unable to mount the filesystem. Creation of reservations 
>> and registrations is handled by the scsi_reserve init script, which 
>> should be run at startup on all nodes in the cluster. You can check to 
>> see what devices a node is registered for before you mount the 
>> filesystem by doing /etc/init.d/scsi_reservce status. If your nodes are 
>> not registered with the device and a reservation exists then you won't 
>> be able to mount.
>>
>>> Lock_Harness 2.6.9-72.2 (built Apr 24 2007 12:45:38) installed
>>> GFS 2.6.9-72.2 (built Apr 24 2007 12:45:54) installed
>>> GFS: Trying to join cluster "lock_dlm", "kop-sds:gfs_home"
>>> Lock_DLM (built Apr 24 2007 12:45:40) installed
>>> GFS: fsid=kop-sds:gfs_home.2: Joined cluster. Now mounting FS...
>>> GFS: fsid=kop-sds:gfs_home.2: jid=2: Trying to acquire journal lock...
>>> GFS: fsid=kop-sds:gfs_home.2: jid=2: Looking at journal...
>>> GFS: fsid=kop-sds:gfs_home.2: jid=2: Done
>>> scsi2 (0,0,0) : reservation conflict
>>> SCSI error : <2 0 0 0> return code = 0x18
>>> end_request: I/O error, dev sdc, sector 523720263
>>> scsi2 (0,0,0) : reservation conflict
>>> SCSI error : <2 0 0 0> return code = 0x18
>>> end_request: I/O error, dev sdc, sector 523720271
>>> scsi2 (0,0,0) : reservation conflict
>>> SCSI error : <2 0 0 0> return code = 0x18
>>> end_request: I/O error, dev sdc, sector 523720279
>>> GFS: fsid=kop-sds:gfs_home.2: fatal: I/O error
>>> GFS: fsid=kop-sds:gfs_home.2:   block = 65464979
>>> GFS: fsid=kop-sds:gfs_home.2:   function = gfs_logbh_wait
>>> GFS: fsid=kop-sds:gfs_home.2:   file
>>> = /builddir/build/BUILD/gfs-kernel-2.6.9-72/smp/src/gfs/dio.c, line =
>>> 923
>>> GFS: fsid=kop-sds:gfs_home.2:   time = 1193838678
>>> GFS: fsid=kop-sds:gfs_home.2: about to withdraw from the cluster
>>> GFS: fsid=kop-sds:gfs_home.2: waiting for outstanding I/O
>>> GFS: fsid=kop-sds:gfs_home.2: telling LM to withdraw
>>> lock_dlm: withdraw abandoned memory
>>> GFS: fsid=kop-sds:gfs_home.2: withdrawn
>>> GFS: fsid=kop-sds:gfs_home.2: can't get resource index inode: -5
>>>
>>>
>>> Does anyone have a clue as to where I should start looking?
>>>
>>>
>>> Thanks,
>>> -C
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From jos at xos.nl  Fri Nov  2 17:36:20 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 2 Nov 2007 18:36:20 +0100
Subject: [Linux-cluster] GFS RG size (and tuning)
In-Reply-To: <47227EDE.4010901@redhat.com>
References: <200710262216.l9QMGpf5032362@jasmine.xos.nl>
	<47227EDE.4010901@redhat.com>
Message-ID: <20071102173620.GB19226@jasmine.xos.nl>

On Fri, Oct 26, 2007 at 07:57:18PM -0400, Wendy Cheng wrote:

> 2. The gfs_scand issue is more to do with the number of glock count. One 
> way to tune this is via purge_glock tunable. There is an old write-up in:
> http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4 
> . It is for RHEL4 but should work the same way for RHEL5.

Unfortunately, no.  The patch applies fine (only with some offsets),
but building results in an error:

  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/acl.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/bits.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/bmap.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/daemon.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.o
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c: In function 'stuck_releasepage':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:85: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'sector_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:96: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'u64'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:122: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:122: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'uint64_t'
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dir.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/eaops.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/eattr.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/file.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.o
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 'try_purge_iopen':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2563: error: implicit declaration of function 'gl2gl'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2563: warning: assignment makes pointer from integer without a cast
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 'dump_inode':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 5 has type 'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 'dump_glock':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2882: warning: format '%lu' expects type 'long unsigned int', but argument 5 has type 'u64'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2882: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'u64'
make[4]: *** [/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.o] Error 1
make[3]: *** [_module_/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs] Error 2

Regards,

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From christopher.barry at qlogic.com  Fri Nov  2 17:38:39 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Fri, 2 Nov 2007 12:38:39 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <472B5F15.6090606@redhat.com>
References: <1193844478.5162.53.camel@localhost>
	<4728A2C9.3000501@redhat.com><1194023503.9381.1.camel@localhost>
	<472B5F15.6090606@redhat.com>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB24580B97@EPEXCH1.qlogic.org>

> >> Are you intentionally trying to use scsi reservations as a 
> fence method? 
> > 
> > No. In fact I thought the scsi_reservation service may be 
> *causing* the
> > issue, and disabled the service from starting on all nodes. 
> Does this
> > have to be on?
> 
> No. You only need to run this service if you plan on using scsi 
> reservations as a fence method. A scsi reservation will 
> restrict access 

Thank You Ryan. I will give that a shot.

-C


From rohara at redhat.com  Fri Nov  2 17:42:31 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Fri, 02 Nov 2007 12:42:31 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <D158540CCC0AB54C8FD4818F823CCB24580B97@EPEXCH1.qlogic.org>
References: <1193844478.5162.53.camel@localhost>	<4728A2C9.3000501@redhat.com><1194023503.9381.1.camel@localhost>	<472B5F15.6090606@redhat.com>
	<D158540CCC0AB54C8FD4818F823CCB24580B97@EPEXCH1.qlogic.org>
Message-ID: <472B6187.1000107@redhat.com>


Just out of curiosity ... what scsi_reserve enabled by default? It 
shouldn't be. If it is I'll have to fix that.

Christopher Barry wrote:
>>>> Are you intentionally trying to use scsi reservations as a 
>> fence method? 
>>> No. In fact I thought the scsi_reservation service may be 
>> *causing* the
>>> issue, and disabled the service from starting on all nodes. 
>> Does this
>>> have to be on?
>> No. You only need to run this service if you plan on using scsi 
>> reservations as a fence method. A scsi reservation will 
>> restrict access 
> 
> Thank You Ryan. I will give that a shot.
> 
> -C
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From m.schlett at fzd.de  Fri Nov  2 17:55:09 2007
From: m.schlett at fzd.de (Schlett, Matthias)
Date: Fri, 02 Nov 2007 18:55:09 +0100
Subject: [Linux-cluster] corrupt 5.2.1 filesystem
Message-ID: <472B647D.7070105@fzd.de>

We have an old  GFS 5.2.1 cluster which is running without problems since 2004.
Unfortunally during a RAID-5 rebuild one of our filesystems was corrupted and cannot be mounted.
We also tried to check it by gfs_fsck, but it cannot find the superblock.
With the help of dd  I dumped the first blocks of the device and compared it with a working filesystem.
It seems that the filesystem information is only moved an not destroyed.

So my questions is, how can I restore the superblock and maybe restore the whole filesystem ?


m.schlett
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4808 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071102/21a9bb3d/attachment.bin>

From scstafford at gmail.com  Fri Nov  2 18:07:58 2007
From: scstafford at gmail.com (Stewart Stafford)
Date: Fri, 2 Nov 2007 13:07:58 -0500
Subject: [Linux-cluster] fence_ipmilan issue
Message-ID: <6f9ffb020711021107p75a587f5m530288b72ea79ff1@mail.gmail.com>

Hey all,

I have a 3 node cluster consisting of Sun X2200 M2's.  I am using the
ELOM port for fencing the nodes.  The problem I am having is when I
manually fence a node by calling fence_ipmilan with the reboot option,
the service running on that node is stopped and never migrates to
another node.  However, if I issue a reboot on the node running the
service, the service migrates successfully.  Attached is the
cluster.conf and below is a snippet of the messages log.

Thanks,

Stew

cluster.conf
===========================
<?xml version="1.0"?>
<cluster alias="ices_nfscluster" config_version="78" name="nfs_cluster">
        <fence_daemon post_fail_delay="20" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="isc0" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="" name="iisc0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="isc1" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="" name="iisc1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="isc2" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="iisc2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="172.16.158.159" login="root" name="iisc0" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="172.16.158.160" login="root" name="iisc1" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="171.16.158.161" login="root" name="iisc2" passwd="changeme"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="fotest" ordered="1"
restricted="1">
                                <failoverdomainnode name="isc0" priority="1"/>
                                <failoverdomainnode name="isc1" priority="1"/>
                                <failoverdomainnode name="isc2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="172.16.127.15" monitor_link="1"/>
                        <ip address="172.16.127.17" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="fotest" name="nfstest"
recovery="restart">
                        <fs device="/dev/ices-fs/test" force_fsck="0"
force_unmount="1" fsid="13584" fstype="ext3" mountpoint="/export/test"
name="testfs" options="" self_fence="0"/>
                        <nfsexport name="test_export">
                                <nfsclient name="test_export"
options="async,rw,fsid=20" path="/export/test"
target="128.83.68.0/24"/>
                        </nfsexport>
                        <ip ref="172.16.127.15"/>
                </service>
                <service autostart="1" domain="fotest" name="nfsices"
recovery="relocate">
                        <fs device="/dev/ices-fs/ices" force_fsck="0"
force_unmount="1" fsid="44096" fstype="ext3"
mountpoint="/export/cices" name="nfsfs" options="" self_fence="0"/>
                        <nfsexport name="nfsexport">
                                <nfsclient name="nfsclient"
options="async,fsid=25,rw" path="/export/cices"
target="128.83.68.0/24"/>
                        </nfsexport>
                        <ip ref="172.16.127.17"/>
                </service>
        </rm>
</cluster>

Nov  2 13:05:48 isc1 fenced[2762]: fencing node "isc0"
Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.123
Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.124
Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 2
Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 3
Nov  2 13:06:21 isc1 fenced[2762]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:172.16.158.159...ipmilan: Failed to con
nect after 30 seconds Failed
Nov  2 13:06:21 isc1 fenced[2762]: fence "isc0" failed
Nov  2 13:06:26 isc1 fenced[2762]: fencing node "isc0"
Nov  2 13:06:26 isc1 ccsd[2736]: process_get: Invalid connection
descriptor received.
Nov  2 13:06:26 isc1 ccsd[2736]: Error while processing get: Invalid
request descriptor


From christopher.barry at qlogic.com  Fri Nov  2 18:37:36 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Fri, 2 Nov 2007 13:37:36 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <472B6187.1000107@redhat.com>
References: <1193844478.5162.53.camel@localhost>	<4728A2C9.3000501@redhat.com><1194023503.9381.1.camel@localhost>	<472B5F15.6090606@redhat.com><D158540CCC0AB54C8FD4818F823CCB24580B97@EPEXCH1.qlogic.org>
	<472B6187.1000107@redhat.com>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB24580B9D@EPEXCH1.qlogic.org>

 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
> Sent: Friday, November 02, 2007 1:43 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] scsi reservation issue
> 
> 
> Just out of curiosity ... what scsi_reserve enabled by default? It 
> shouldn't be. If it is I'll have to fix that.
>

On RHEL4/U5, yes. I did not turn it on.

 
> Christopher Barry wrote:
> >>>> Are you intentionally trying to use scsi reservations as a 
> >> fence method? 
> >>> No. In fact I thought the scsi_reservation service may be 
> >> *causing* the
> >>> issue, and disabled the service from starting on all nodes. 
> >> Does this
> >>> have to be on?
> >> No. You only need to run this service if you plan on using scsi 
> >> reservations as a fence method. A scsi reservation will 
> >> restrict access 
> > 
> > Thank You Ryan. I will give that a shot.
> > 
> > -C
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


From jparsons at redhat.com  Fri Nov  2 18:40:32 2007
From: jparsons at redhat.com (jim parsons)
Date: Fri, 02 Nov 2007 14:40:32 -0400
Subject: [Linux-cluster] fence_ipmilan issue
In-Reply-To: <6f9ffb020711021107p75a587f5m530288b72ea79ff1@mail.gmail.com>
References: <6f9ffb020711021107p75a587f5m530288b72ea79ff1@mail.gmail.com>
Message-ID: <1194028832.3306.19.camel@localhost.localdomain>

On Fri, 2007-11-02 at 13:07 -0500, Stewart Stafford wrote:
> Hey all,
> 
> I have a 3 node cluster consisting of Sun X2200 M2's.  I am using the
> ELOM port for fencing the nodes.  The problem I am having is when I
> manually fence a node by calling fence_ipmilan with the reboot option,
> the service running on that node is stopped and never migrates to
> another node.  However, if I issue a reboot on the node running the
> service, the service migrates successfully.  Attached is the
> cluster.conf and below is a snippet of the messages log.
> 
> Thanks,
> 
> Stew
Try setting lanplus="1" in the device blocks. It is currently set to "".

How are you configuring this conf file? With system-config-cluster?
Conga? Vi?

-J

> 
> cluster.conf
> ===========================
> <?xml version="1.0"?>
> <cluster alias="ices_nfscluster" config_version="78" name="nfs_cluster">
>         <fence_daemon post_fail_delay="20" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="isc0" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device lanplus="" name="iisc0"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="isc1" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device lanplus="" name="iisc1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="isc2" nodeid="3" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="iisc2"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman/>
>         <fencedevices>
>                 <fencedevice agent="fence_ipmilan" auth="none"
> ipaddr="172.16.158.159" login="root" name="iisc0" passwd="changeme"/>
>                 <fencedevice agent="fence_ipmilan" auth="none"
> ipaddr="172.16.158.160" login="root" name="iisc1" passwd="changeme"/>
>                 <fencedevice agent="fence_ipmilan" auth="none"
> ipaddr="171.16.158.161" login="root" name="iisc2" passwd="changeme"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="fotest" ordered="1"
> restricted="1">
>                                 <failoverdomainnode name="isc0" priority="1"/>
>                                 <failoverdomainnode name="isc1" priority="1"/>
>                                 <failoverdomainnode name="isc2" priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <ip address="172.16.127.15" monitor_link="1"/>
>                         <ip address="172.16.127.17" monitor_link="1"/>
>                 </resources>
>                 <service autostart="1" domain="fotest" name="nfstest"
> recovery="restart">
>                         <fs device="/dev/ices-fs/test" force_fsck="0"
> force_unmount="1" fsid="13584" fstype="ext3" mountpoint="/export/test"
> name="testfs" options="" self_fence="0"/>
>                         <nfsexport name="test_export">
>                                 <nfsclient name="test_export"
> options="async,rw,fsid=20" path="/export/test"
> target="128.83.68.0/24"/>
>                         </nfsexport>
>                         <ip ref="172.16.127.15"/>
>                 </service>
>                 <service autostart="1" domain="fotest" name="nfsices"
> recovery="relocate">
>                         <fs device="/dev/ices-fs/ices" force_fsck="0"
> force_unmount="1" fsid="44096" fstype="ext3"
> mountpoint="/export/cices" name="nfsfs" options="" self_fence="0"/>
>                         <nfsexport name="nfsexport">
>                                 <nfsclient name="nfsclient"
> options="async,fsid=25,rw" path="/export/cices"
> target="128.83.68.0/24"/>
>                         </nfsexport>
>                         <ip ref="172.16.127.17"/>
>                 </service>
>         </rm>
> </cluster>
> 
> Nov  2 13:05:48 isc1 fenced[2762]: fencing node "isc0"
> Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.123
> Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.124
> Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 2
> Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 3
> Nov  2 13:06:21 isc1 fenced[2762]: agent "fence_ipmilan" reports:
> Rebooting machine @ IPMI:172.16.158.159...ipmilan: Failed to con
> nect after 30 seconds Failed
> Nov  2 13:06:21 isc1 fenced[2762]: fence "isc0" failed
> Nov  2 13:06:26 isc1 fenced[2762]: fencing node "isc0"
> Nov  2 13:06:26 isc1 ccsd[2736]: process_get: Invalid connection
> descriptor received.
> Nov  2 13:06:26 isc1 ccsd[2736]: Error while processing get: Invalid
> request descriptor
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rohara at redhat.com  Fri Nov  2 18:49:42 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Fri, 02 Nov 2007 13:49:42 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <D158540CCC0AB54C8FD4818F823CCB24580B9D@EPEXCH1.qlogic.org>
References: <1193844478.5162.53.camel@localhost>	<4728A2C9.3000501@redhat.com><1194023503.9381.1.camel@localhost>	<472B5F15.6090606@redhat.com><D158540CCC0AB54C8FD4818F823CCB24580B97@EPEXCH1.qlogic.org>	<472B6187.1000107@redhat.com>
	<D158540CCC0AB54C8FD4818F823CCB24580B9D@EPEXCH1.qlogic.org>
Message-ID: <472B7146.5020909@redhat.com>

Christopher Barry wrote:
>  
> 
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
>> Sent: Friday, November 02, 2007 1:43 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] scsi reservation issue
>>
>>
>> Just out of curiosity ... what scsi_reserve enabled by default? It 
>> shouldn't be. If it is I'll have to fix that.
>>
> 
> On RHEL4/U5, yes. I did not turn it on.

Thanks. This is important to know. I'll make sure we get it fixed. 
Intended behavior is that it should not be enabled by default unless a 
user enables it.

Ryan


From scstafford at gmail.com  Fri Nov  2 18:50:07 2007
From: scstafford at gmail.com (Stewart Stafford)
Date: Fri, 2 Nov 2007 13:50:07 -0500
Subject: [Linux-cluster] fence_ipmilan issue
In-Reply-To: <1194028832.3306.19.camel@localhost.localdomain>
References: <6f9ffb020711021107p75a587f5m530288b72ea79ff1@mail.gmail.com>
	<1194028832.3306.19.camel@localhost.localdomain>
Message-ID: <6f9ffb020711021150g1c2d894by487d7547715ff1c3@mail.gmail.com>

Thanks for the response.  Using system-config-cluster on CentOS5
2.6.18-8.1.15.el5.centos.plus, x86_64.
I'll try the setting lan plus and give a whirl


On Nov 2, 2007 1:40 PM, jim parsons <jparsons at redhat.com> wrote:
> On Fri, 2007-11-02 at 13:07 -0500, Stewart Stafford wrote:
> > Hey all,
> >
> > I have a 3 node cluster consisting of Sun X2200 M2's.  I am using the
> > ELOM port for fencing the nodes.  The problem I am having is when I
> > manually fence a node by calling fence_ipmilan with the reboot option,
> > the service running on that node is stopped and never migrates to
> > another node.  However, if I issue a reboot on the node running the
> > service, the service migrates successfully.  Attached is the
> > cluster.conf and below is a snippet of the messages log.
> >
> > Thanks,
> >
> > Stew
> Try setting lanplus="1" in the device blocks. It is currently set to "".
>
> How are you configuring this conf file? With system-config-cluster?
> Conga? Vi?
>
> -J
>
>
> >
> > cluster.conf
> > ===========================
> > <?xml version="1.0"?>
> > <cluster alias="ices_nfscluster" config_version="78" name="nfs_cluster">
> >         <fence_daemon post_fail_delay="20" post_join_delay="3"/>
> >         <clusternodes>
> >                 <clusternode name="isc0" nodeid="1" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device lanplus="" name="iisc0"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >                 <clusternode name="isc1" nodeid="2" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device lanplus="" name="iisc1"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >                 <clusternode name="isc2" nodeid="3" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device name="iisc2"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >         </clusternodes>
> >         <cman/>
> >         <fencedevices>
> >                 <fencedevice agent="fence_ipmilan" auth="none"
> > ipaddr="172.16.158.159" login="root" name="iisc0" passwd="changeme"/>
> >                 <fencedevice agent="fence_ipmilan" auth="none"
> > ipaddr="172.16.158.160" login="root" name="iisc1" passwd="changeme"/>
> >                 <fencedevice agent="fence_ipmilan" auth="none"
> > ipaddr="171.16.158.161" login="root" name="iisc2" passwd="changeme"/>
> >         </fencedevices>
> >         <rm>
> >                 <failoverdomains>
> >                         <failoverdomain name="fotest" ordered="1"
> > restricted="1">
> >                                 <failoverdomainnode name="isc0" priority="1"/>
> >                                 <failoverdomainnode name="isc1" priority="1"/>
> >                                 <failoverdomainnode name="isc2" priority="2"/>
> >                         </failoverdomain>
> >                 </failoverdomains>
> >                 <resources>
> >                         <ip address="172.16.127.15" monitor_link="1"/>
> >                         <ip address="172.16.127.17" monitor_link="1"/>
> >                 </resources>
> >                 <service autostart="1" domain="fotest" name="nfstest"
> > recovery="restart">
> >                         <fs device="/dev/ices-fs/test" force_fsck="0"
> > force_unmount="1" fsid="13584" fstype="ext3" mountpoint="/export/test"
> > name="testfs" options="" self_fence="0"/>
> >                         <nfsexport name="test_export">
> >                                 <nfsclient name="test_export"
> > options="async,rw,fsid=20" path="/export/test"
> > target="128.83.68.0/24"/>
> >                         </nfsexport>
> >                         <ip ref="172.16.127.15"/>
> >                 </service>
> >                 <service autostart="1" domain="fotest" name="nfsices"
> > recovery="relocate">
> >                         <fs device="/dev/ices-fs/ices" force_fsck="0"
> > force_unmount="1" fsid="44096" fstype="ext3"
> > mountpoint="/export/cices" name="nfsfs" options="" self_fence="0"/>
> >                         <nfsexport name="nfsexport">
> >                                 <nfsclient name="nfsclient"
> > options="async,fsid=25,rw" path="/export/cices"
> > target="128.83.68.0/24"/>
> >                         </nfsexport>
> >                         <ip ref="172.16.127.17"/>
> >                 </service>
> >         </rm>
> > </cluster>
> >
> > Nov  2 13:05:48 isc1 fenced[2762]: fencing node "isc0"
> > Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.123
> > Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.124
> > Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 2
> > Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 3
> > Nov  2 13:06:21 isc1 fenced[2762]: agent "fence_ipmilan" reports:
> > Rebooting machine @ IPMI:172.16.158.159...ipmilan: Failed to con
> > nect after 30 seconds Failed
> > Nov  2 13:06:21 isc1 fenced[2762]: fence "isc0" failed
> > Nov  2 13:06:26 isc1 fenced[2762]: fencing node "isc0"
> > Nov  2 13:06:26 isc1 ccsd[2736]: process_get: Invalid connection
> > descriptor received.
> > Nov  2 13:06:26 isc1 ccsd[2736]: Error while processing get: Invalid
> > request descriptor
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From Timothy.Ward at itt.com  Fri Nov  2 19:40:42 2007
From: Timothy.Ward at itt.com (Ward, Timothy - SSD)
Date: Fri, 2 Nov 2007 15:40:42 -0400
Subject: [Linux-cluster] Cluster NFS causes kernel bug
Message-ID: <77E700AE7021314DB6CDF6D6E8F661320396FC98@ACDFWMAIL1.acd.de.ittind.com>

On Wed, 23 Oct 2007, Gordon wrote:

2) Thanks for the report on NFSv3/UDP.  From my reading that sounded
like something to avoid, but maybe I need to try it anyway.  How
reliable has it been?  Do the clients reconnect most times?

In your case, NFS over TCP is likely to have been the major cause of
your problems. UDP can fail over much more transparently, because there
is no state to it to expire. 

You could also try tweaking your timeout, retry, and hard vs. soft
failure modes on NFS. 

Gordan
------------------------------------------------------------------------
----------------

I found the major stability issue.  

Using managed IP *BAD*, using managed NFS ok.  I had the nfs service
running on all the nodes and just failed the IP over.  It does not
matter if you use UDP or TCP, the managed IP is flaky.

Using managed NFS over TCP was still a bit unstable but not nearly as
bad as the managed IP.

So what I have settled on for my testbed:
64bit AMD Opteron
CentOS 5.0
SAN with brocade switch and storage arrays
GFS1
Managed NFS
UDP over NFS

I will be doing stability testing over the next couple weeks and will
post my findings.  If the stability is good my testbed goes live before
the end of the year.

Thanks for all the hard work on RHCS :)
Tim
*****************************************************************
This e-mail and any files transmitted with it may be proprietary 
and are intended solely for the use of the individual or entity to 
whom they are addressed. If you have received this e-mail in 
error please notify the sender. Please note that any views or
opinions presented in this e-mail are solely those of the author 
and do not necessarily represent those of ITT Corporation. The 
recipient should check this e-mail and any attachments for the 
presence of viruses. ITT accepts no liability for any damage 
caused by any virus transmitted by this e-mail.
*******************************************************************


From wcheng at redhat.com  Fri Nov  2 20:00:35 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Fri, 02 Nov 2007 16:00:35 -0400
Subject: [Linux-cluster] GFS RG size (and tuning)
In-Reply-To: <20071102173620.GB19226@jasmine.xos.nl>
References: <200710262216.l9QMGpf5032362@jasmine.xos.nl>
	<47227EDE.4010901@redhat.com>
	<20071102173620.GB19226@jasmine.xos.nl>
Message-ID: <472B81E3.80108@redhat.com>

Jos Vos wrote:
> On Fri, Oct 26, 2007 at 07:57:18PM -0400, Wendy Cheng wrote:
>
>   
>> 2. The gfs_scand issue is more to do with the number of glock count. One 
>> way to tune this is via purge_glock tunable. There is an old write-up in:
>> http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4 
>> . It is for RHEL4 but should work the same way for RHEL5.
>>     
>
> Unfortunately, no.  The patch applies fine (only with some offsets),
> but building results in an error:
>   
The patch on my people's page is a *RHEL4* patch.

Just did a quick check, RHEL5.1 gfs-kmod-0.1.19 should have this patch 
and will be released *very soon*. Mind to wait for a little bit longer 
to get an "official" version ? Or go to our cvs (which is open to 
everyone) to extract the source yourself ?

-- Wendy
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/acl.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/bits.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/bmap.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/daemon.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.o
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c: In function 'stuck_releasepage':
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:85: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'sector_t'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:96: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'u64'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:122: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'uint64_t'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:122: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'uint64_t'
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dir.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/eaops.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/eattr.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/file.o
>   CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.o
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 'try_purge_iopen':
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2563: error: implicit declaration of function 'gl2gl'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2563: warning: assignment makes pointer from integer without a cast
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 'dump_inode':
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'uint64_t'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 5 has type 'uint64_t'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 'uint64_t'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'uint64_t'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 'dump_glock':
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2882: warning: format '%lu' expects type 'long unsigned int', but argument 5 has type 'u64'
> /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2882: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'u64'
> make[4]: *** [/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.o] Error 1
> make[3]: *** [_module_/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs] Error 2
>
> Regards,
>
>   


From rpeterso at redhat.com  Fri Nov  2 20:17:38 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 02 Nov 2007 15:17:38 -0500
Subject: [Linux-cluster] corrupt 5.2.1 filesystem
In-Reply-To: <472B647D.7070105@fzd.de>
References: <472B647D.7070105@fzd.de>
Message-ID: <1194034658.21497.41.camel@technetium.msp.redhat.com>

On Fri, 2007-11-02 at 18:55 +0100, Schlett, Matthias wrote:
> We have an old  GFS 5.2.1 cluster which is running without problems since 2004.
> Unfortunally during a RAID-5 rebuild one of our filesystems was corrupted and cannot be mounted.
> We also tried to check it by gfs_fsck, but it cannot find the superblock.
> With the help of dd  I dumped the first blocks of the device and compared it with a working filesystem.
> It seems that the filesystem information is only moved an not destroyed.
> 
> So my questions is, how can I restore the superblock and maybe restore the whole filesystem ?
> 
> 
> m.schlett

Hi Matthias,

I would suggest extreme caution here.  It depends on what exactly is wrong.
If it's just the superblock, that's one thing, but if the superblock is
relocated, chances are, so are lots of things, like the rgindex and the
bitmaps telling which blocks are used.  The preferred method, I would
think, is to restore from backup after doing mkfs.  If you don't have
a backup and really need to get the information back, I would first
use the gfs_edit util to examine what's really out there and how sane it
is.  Actually, if possible, I'd use gfs2_edit because it has a lot more
features than gfs_edit and can operate, for the most part, on gfs1
file systems as well as gfs2.  Of course, gfs2_edit only exists on
newer releases.  Also, on older systems, I don't think gfs2_edit was
part of the cluster packages so you would have to build it from source
code.  If the superblock is bad, gfs_edit and/or gfs2_edit may not be able
to read the file system either.  You may not want to overwrite what's
at the superblock's location, because it may be some other vital piece
that should be relocated somewhere else.  In short, the situation is not
good.  You really shouldn't mess with it unless you're intimately
familiar with how a gfs file system looks on disk.

Regards,

Bob Peterson
Red Hat Cluster Suite


From jos at xos.nl  Fri Nov  2 20:20:03 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 2 Nov 2007 21:20:03 +0100
Subject: [Linux-cluster] GFS RG size (and tuning)
In-Reply-To: <472B81E3.80108@redhat.com>
References: <200710262216.l9QMGpf5032362@jasmine.xos.nl>
	<47227EDE.4010901@redhat.com>
	<20071102173620.GB19226@jasmine.xos.nl> <472B81E3.80108@redhat.com>
Message-ID: <20071102202003.GA21128@jasmine.xos.nl>

On Fri, Nov 02, 2007 at 04:00:35PM -0400, Wendy Cheng wrote:

> The patch on my people's page is a *RHEL4* patch.

I know, but you said "It is for RHEL4 but should work the same way for
RHEL5", so I tried that.

> Just did a quick check, RHEL5.1 gfs-kmod-0.1.19 should have this patch 
> and will be released *very soon*. Mind to wait for a little bit longer 
> to get an "official" version ? Or go to our cvs (which is open to 
> everyone) to extract the source yourself ?

OK, I'll wait till 5.1 is released (and maybe we should migrate to
GFS2 because of our performance problems?).

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From wcheng at redhat.com  Fri Nov  2 20:12:39 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Fri, 02 Nov 2007 16:12:39 -0400
Subject: [Linux-cluster] GFS RG size (and tuning)
In-Reply-To: <472B81E3.80108@redhat.com>
References: <200710262216.l9QMGpf5032362@jasmine.xos.nl>
	<47227EDE.4010901@redhat.com>
	<20071102173620.GB19226@jasmine.xos.nl> <472B81E3.80108@redhat.com>
Message-ID: <472B84B7.8000602@redhat.com>

Wendy Cheng wrote:
>>> 2. The gfs_scand issue is more to do with the number of glock count. 
>>> One way to tune this is via purge_glock tunable. There is an old 
>>> write-up in:
>>> http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4 
>>> . It is for RHEL4 but should work the same way for RHEL5.
Sorry, you apparently mis-understood what I meant about "work the same 
way as in RHEL5". The logic and tunable setting are identical but the 
code (patch) itself will have to depend on individual RHEL release and 
update versions.

Also I read your previous mailing list post with "df" issue - didn't 
have time to comment. Note that both RHEL 4.6 and RHEL 5.1 will have a 
"fast_statfs" tunable that is specifically added to speed up the "df" 
command. Give it a try. If it works well, we'll switch it from a tunable 
to default (so people don't have to suffer from GFS1's df command so much).

-- Wendy


From jos at xos.nl  Fri Nov  2 20:33:09 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 2 Nov 2007 21:33:09 +0100
Subject: [Linux-cluster] GFS RG size (and tuning)
In-Reply-To: <472B84B7.8000602@redhat.com>
References: <200710262216.l9QMGpf5032362@jasmine.xos.nl>
	<47227EDE.4010901@redhat.com>
	<20071102173620.GB19226@jasmine.xos.nl>
	<472B81E3.80108@redhat.com> <472B84B7.8000602@redhat.com>
Message-ID: <20071102203309.GB21175@jasmine.xos.nl>

On Fri, Nov 02, 2007 at 04:12:39PM -0400, Wendy Cheng wrote:

> Also I read your previous mailing list post with "df" issue - didn't 
> have time to comment. Note that both RHEL 4.6 and RHEL 5.1 will have a 
> "fast_statfs" tunable that is specifically added to speed up the "df" 
> command. Give it a try. If it works well, we'll switch it from a tunable 
> to default (so people don't have to suffer from GFS1's df command so much).

OK, thanks, we'll try with 5.1.

In the meantime we rebuilded all fs's with larger RGs (-r 2048), which
already improved the "df" behavior seriously.

Also, fs performance is not that bad w.r.t. bandwidth (our measurements 
were first incorrect due to 32-bit counter troubles), but operations like
rsync (which we do a lot) that scan large directory trees are horrable.
For that we'll wait for 5.1.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From wcheng at redhat.com  Fri Nov  2 20:44:16 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Fri, 02 Nov 2007 16:44:16 -0400
Subject: [Linux-cluster] GFS RG size (and tuning)
In-Reply-To: <20071102203309.GB21175@jasmine.xos.nl>
References: <200710262216.l9QMGpf5032362@jasmine.xos.nl>
	<47227EDE.4010901@redhat.com>
	<20071102173620.GB19226@jasmine.xos.nl>
	<472B81E3.80108@redhat.com> <472B84B7.8000602@redhat.com>
	<20071102203309.GB21175@jasmine.xos.nl>
Message-ID: <472B8C20.4070204@redhat.com>

Jos Vos wrote:
> On Fri, Nov 02, 2007 at 04:12:39PM -0400, Wendy Cheng wrote:
>
>   
>> Also I read your previous mailing list post with "df" issue - didn't 
>> have time to comment. Note that both RHEL 4.6 and RHEL 5.1 will have a 
>> "fast_statfs" tunable that is specifically added to speed up the "df" 
>> command. Give it a try. If it works well, we'll switch it from a tunable 
>> to default (so people don't have to suffer from GFS1's df command so much).
>>     
>
> OK, thanks, we'll try with 5.1.
>
> In the meantime we rebuilded all fs's with larger RGs (-r 2048), which
> already improved the "df" behavior seriously.
>   
Sign .. everything has a trade-off. Forgot to explain this .. larger RG 
will introduce more disk reads if RG locks (that guards disk allocation) 
happen to get moved around between different nodes. You may also have to 
carry more buffer head in the memory cache . If you do lots of rsync, it 
could contribute to the lock/memory congestion.
> Also, fs performance is not that bad w.r.t. bandwidth (our measurements 
> were first incorrect due to 32-bit counter troubles), but operations like
> rsync (which we do a lot) that scan large directory trees are horrable.
> For that we'll wait for 5.1.
>
>   
Thanks for the patient - let us know how it goes ...

-- Wendy


From wcheng at redhat.com  Fri Nov  2 20:46:54 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Fri, 02 Nov 2007 16:46:54 -0400
Subject: [Linux-cluster] GFS RG size (and tuning)
In-Reply-To: <472B8C20.4070204@redhat.com>
References: <200710262216.l9QMGpf5032362@jasmine.xos.nl>
	<47227EDE.4010901@redhat.com>
	<20071102173620.GB19226@jasmine.xos.nl>
	<472B81E3.80108@redhat.com> <472B84B7.8000602@redhat.com>
	<20071102203309.GB21175@jasmine.xos.nl>
	<472B8C20.4070204@redhat.com>
Message-ID: <472B8CBE.5010501@redhat.com>

Wendy Cheng wrote:
> Jos Vos wrote:
>> On Fri, Nov 02, 2007 at 04:12:39PM -0400, Wendy Cheng wrote:
>>
>>> Also I read your previous mailing list post with "df" issue - didn't 
>>> have time to comment. Note that both RHEL 4.6 and RHEL 5.1 will have 
>>> a "fast_statfs" tunable that is specifically added to speed up the 
>>> "df" command. Give it a try. If it works well, we'll switch it from 
>>> a tunable to default (so people don't have to suffer from GFS1's df 
>>> command so much).
>>
>> OK, thanks, we'll try with 5.1.
>>
>> In the meantime we rebuilded all fs's with larger RGs (-r 2048), which
>> already improved the "df" behavior seriously.
> Sign .. everything has a trade-off. Forgot to explain this .. larger 
> RG will introduce more disk reads if RG locks (that guards disk 
> allocation) happen to get moved around between different nodes. You 
> may also have to carry more buffer head in the memory cache . If you 
> do lots of rsync, it could contribute to the lock/memory congestion.
>> Also, fs performance is not that bad w.r.t. bandwidth (our 
>> measurements were first incorrect due to 32-bit counter troubles), 
>> but operations like
>> rsync (which we do a lot) that scan large directory trees are horrable.
>> For that we'll wait for 5.1.
>>
> Thanks for the patient - let us know how it goes ...

s/patient/patience/


From scstafford at gmail.com  Fri Nov  2 21:16:31 2007
From: scstafford at gmail.com (Stewart Stafford)
Date: Fri, 2 Nov 2007 16:16:31 -0500
Subject: [Linux-cluster] fence_ipmilan issue
In-Reply-To: <6f9ffb020711021150g1c2d894by487d7547715ff1c3@mail.gmail.com>
References: <6f9ffb020711021107p75a587f5m530288b72ea79ff1@mail.gmail.com>
	<1194028832.3306.19.camel@localhost.localdomain>
	<6f9ffb020711021150g1c2d894by487d7547715ff1c3@mail.gmail.com>
Message-ID: <6f9ffb020711021416p4a638281x871b5cf053f03625@mail.gmail.com>

Updated the cluster.conf (by hand) setting lanplus="1" and pushed it
out to all the nodes.
Currently installed packages:
OpenIPMI-2.0.6-5.el5.3
OpenIPMI-tools-2.0.6-5.el5.3
OpenIPMI-libs-2.0.6-5.el5.3
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos

Still having the same issue after making the changes.  After fencing
the node, it appeared to come back into the cluster, but when I tried
to migrate the service back, it failed.  Noticed issuing a clustat on
the node it reported:
msg_open: No such file or directory
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  isc0                                  1 Online
  isc1                                  2 Online
  isc2                                  3 Online, Local

Where the other node members report the following:
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  isc0                                  1 Online, rgmanager
  isc1                                  2 Online, Local, rgmanager
  isc2                                  3 Online, rgmanager

rgmanager seems to have stopped communicating and restarting it hangs
indefinitly

Thanks,

S
On Nov 2, 2007 1:50 PM, Stewart Stafford <scstafford at gmail.com> wrote:
> Thanks for the response.  Using system-config-cluster on CentOS5
> 2.6.18-8.1.15.el5.centos.plus, x86_64.
> I'll try the setting lan plus and give a whirl
>
>
>
> On Nov 2, 2007 1:40 PM, jim parsons <jparsons at redhat.com> wrote:
> > On Fri, 2007-11-02 at 13:07 -0500, Stewart Stafford wrote:
> > > Hey all,
> > >
> > > I have a 3 node cluster consisting of Sun X2200 M2's.  I am using the
> > > ELOM port for fencing the nodes.  The problem I am having is when I
> > > manually fence a node by calling fence_ipmilan with the reboot option,
> > > the service running on that node is stopped and never migrates to
> > > another node.  However, if I issue a reboot on the node running the
> > > service, the service migrates successfully.  Attached is the
> > > cluster.conf and below is a snippet of the messages log.
> > >
> > > Thanks,
> > >
> > > Stew
> > Try setting lanplus="1" in the device blocks. It is currently set to "".
> >
> > How are you configuring this conf file? With system-config-cluster?
> > Conga? Vi?
> >
> > -J
> >
> >
> > >
> > > cluster.conf
> > > ===========================
> > > <?xml version="1.0"?>
> > > <cluster alias="ices_nfscluster" config_version="78" name="nfs_cluster">
> > >         <fence_daemon post_fail_delay="20" post_join_delay="3"/>
> > >         <clusternodes>
> > >                 <clusternode name="isc0" nodeid="1" votes="1">
> > >                         <fence>
> > >                                 <method name="1">
> > >                                         <device lanplus="" name="iisc0"/>
> > >                                 </method>
> > >                         </fence>
> > >                 </clusternode>
> > >                 <clusternode name="isc1" nodeid="2" votes="1">
> > >                         <fence>
> > >                                 <method name="1">
> > >                                         <device lanplus="" name="iisc1"/>
> > >                                 </method>
> > >                         </fence>
> > >                 </clusternode>
> > >                 <clusternode name="isc2" nodeid="3" votes="1">
> > >                         <fence>
> > >                                 <method name="1">
> > >                                         <device name="iisc2"/>
> > >                                 </method>
> > >                         </fence>
> > >                 </clusternode>
> > >         </clusternodes>
> > >         <cman/>
> > >         <fencedevices>
> > >                 <fencedevice agent="fence_ipmilan" auth="none"
> > > ipaddr="172.16.158.159" login="root" name="iisc0" passwd="changeme"/>
> > >                 <fencedevice agent="fence_ipmilan" auth="none"
> > > ipaddr="172.16.158.160" login="root" name="iisc1" passwd="changeme"/>
> > >                 <fencedevice agent="fence_ipmilan" auth="none"
> > > ipaddr="171.16.158.161" login="root" name="iisc2" passwd="changeme"/>
> > >         </fencedevices>
> > >         <rm>
> > >                 <failoverdomains>
> > >                         <failoverdomain name="fotest" ordered="1"
> > > restricted="1">
> > >                                 <failoverdomainnode name="isc0" priority="1"/>
> > >                                 <failoverdomainnode name="isc1" priority="1"/>
> > >                                 <failoverdomainnode name="isc2" priority="2"/>
> > >                         </failoverdomain>
> > >                 </failoverdomains>
> > >                 <resources>
> > >                         <ip address="172.16.127.15" monitor_link="1"/>
> > >                         <ip address="172.16.127.17" monitor_link="1"/>
> > >                 </resources>
> > >                 <service autostart="1" domain="fotest" name="nfstest"
> > > recovery="restart">
> > >                         <fs device="/dev/ices-fs/test" force_fsck="0"
> > > force_unmount="1" fsid="13584" fstype="ext3" mountpoint="/export/test"
> > > name="testfs" options="" self_fence="0"/>
> > >                         <nfsexport name="test_export">
> > >                                 <nfsclient name="test_export"
> > > options="async,rw,fsid=20" path="/export/test"
> > > target="128.83.68.0/24"/>
> > >                         </nfsexport>
> > >                         <ip ref="172.16.127.15"/>
> > >                 </service>
> > >                 <service autostart="1" domain="fotest" name="nfsices"
> > > recovery="relocate">
> > >                         <fs device="/dev/ices-fs/ices" force_fsck="0"
> > > force_unmount="1" fsid="44096" fstype="ext3"
> > > mountpoint="/export/cices" name="nfsfs" options="" self_fence="0"/>
> > >                         <nfsexport name="nfsexport">
> > >                                 <nfsclient name="nfsclient"
> > > options="async,fsid=25,rw" path="/export/cices"
> > > target="128.83.68.0/24"/>
> > >                         </nfsexport>
> > >                         <ip ref="172.16.127.17"/>
> > >                 </service>
> > >         </rm>
> > > </cluster>
> > >
> > > Nov  2 13:05:48 isc1 fenced[2762]: fencing node "isc0"
> > > Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.123
> > > Nov  2 13:05:48 isc1 openais[2743]: [CLM  ] got nodejoin message 172.16.127.124
> > > Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 2
> > > Nov  2 13:05:48 isc1 openais[2743]: [CPG  ] got joinlist message from node 3
> > > Nov  2 13:06:21 isc1 fenced[2762]: agent "fence_ipmilan" reports:
> > > Rebooting machine @ IPMI:172.16.158.159...ipmilan: Failed to con
> > > nect after 30 seconds Failed
> > > Nov  2 13:06:21 isc1 fenced[2762]: fence "isc0" failed
> > > Nov  2 13:06:26 isc1 fenced[2762]: fencing node "isc0"
> > > Nov  2 13:06:26 isc1 ccsd[2736]: process_get: Invalid connection
> > > descriptor received.
> > > Nov  2 13:06:26 isc1 ccsd[2736]: Error while processing get: Invalid
> > > request descriptor
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>


From amrossi at linux.it  Sat Nov  3 19:01:03 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Sat, 3 Nov 2007 20:01:03 +0100 (CET)
Subject: [Linux-cluster] Cluster Suite and Data Replication 
Message-ID: <49884.79.10.137.147.1194116463.squirrel@picard.linux.it>

Hi all,

i've 2 Server with RHEL5 and Cluster Suite for High-Availability Service.

I have te problem to replicate data from NODE MASTER to NODE SLAVE because
i've not a shared storage. Now i use rsync but i don't like it.

Have you a solution for me?

Can i use GFS? GNDB?


From tam_annie at aliceposta.it  Sun Nov  4 14:51:43 2007
From: tam_annie at aliceposta.it (tam_annie at aliceposta.it)
Date: Sun, 04 Nov 2007 15:51:43 +0100
Subject: [Linux-cluster] GFS directory freezing unexpectedly under pressure
	...
In-Reply-To: <FBCMMO02fxUG0DpWhkR000282f5@FBCMMO02.fbc.local>
References: <FBCMMO02fxUG0DpWhkR000282f5@FBCMMO02.fbc.local>
Message-ID: <472DDC7F.9060401@aliceposta.it>

Hi everybody,

    when my GFS (v. 1) filesystems experience some heavy load (ex. 
vmware virtual machine OS installation, oracle rman backup using gfs 
filesystems as flash recovery area), they "freeze" unexpectedly.
    More precisely, not the whole gfs filesystem freezes, but only the 
directory interested by the load: I can't even ls the contents of that 
directory, everything interesting it seems to hang hopelessly. I can't 
find any related errors in my logs; cluster utilities output (clustat, 
group_tool -v, cman_tool nodes) looks absolutely normal ... (no fencing 
is occurring): I can even go on working on the other directories of the 
same gfs!!!
The only way out I've found is to restart the cluster. I can reproduce 
deterministically the problem, but I don't know how to debug it.

   I noted that the problem arises on both my 2-node and my 1-node 
cluster, either when I mount gfs with 'noquota,noatime' or not.

   Your help is my hope:
   thank you in advance!
   Tyzan

___________________________________________________________________________________________________ 
 
Linux xxxxxxxxxxxxxxxx 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:39:17 EDT 
2007 x86_64 x86_64 x86_64 GNU/Linux

lvm2-cluster-2.02.16-3.el5
kmod-gfs-0.1.16-5.2.6.18_8.1.8.el5
gfs2-utils-0.1.25-1.el5
gfs-utils-0.1.11-3.el5
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5


[root at orarac1 ~]# gfs_tool gettune /share
ilimit1 = 100
ilimit1_tries = 3
ilimit1_min = 1
ilimit2 = 500
ilimit2_tries = 10
ilimit2_min = 3
demote_secs = 300
incore_log_blocks = 1024
jindex_refresh_secs = 60
depend_secs = 60
scand_secs = 5
recoverd_secs = 60
logd_secs = 1
quotad_secs = 5
inoded_secs = 15
quota_simul_sync = 64
quota_warn_period = 10
atime_quantum = 3600
quota_quantum = 60
quota_scale = 1.0000   (1, 1)
quota_enforce = 1
quota_account = 1
new_files_jdata = 0
new_files_directio = 0
max_atomic_write = 4194304
max_readahead = 262144
lockdump_size = 131072
stall_secs = 600
complain_secs = 10
reclaim_limit = 5000
entries_per_readdir = 32
prefetch_secs = 10
statfs_slots = 64
max_mhc = 10000
greedy_default = 100
greedy_quantum = 25
greedy_max = 250
rgrp_try_threshold = 100


From gordan at bobich.net  Sun Nov  4 17:16:39 2007
From: gordan at bobich.net (Gordan Bobic)
Date: Sun, 04 Nov 2007 17:16:39 +0000
Subject: [Linux-cluster] GFS directory freezing unexpectedly under pressure
	...
In-Reply-To: <472DDC7F.9060401@aliceposta.it>
References: <FBCMMO02fxUG0DpWhkR000282f5@FBCMMO02.fbc.local>
	<472DDC7F.9060401@aliceposta.it>
Message-ID: <472DFE77.70706@bobich.net>

Are you sure you are using GFS1 and not GFS2? I've experienced that 
problem with GFS2, but not with GFS1.

Gordan

tam_annie at aliceposta.it wrote:
> Hi everybody,
> 
>    when my GFS (v. 1) filesystems experience some heavy load (ex. vmware 
> virtual machine OS installation, oracle rman backup using gfs 
> filesystems as flash recovery area), they "freeze" unexpectedly.
>    More precisely, not the whole gfs filesystem freezes, but only the 
> directory interested by the load: I can't even ls the contents of that 
> directory, everything interesting it seems to hang hopelessly. I can't 
> find any related errors in my logs; cluster utilities output (clustat, 
> group_tool -v, cman_tool nodes) looks absolutely normal ... (no fencing 
> is occurring): I can even go on working on the other directories of the 
> same gfs!!!
> The only way out I've found is to restart the cluster. I can reproduce 
> deterministically the problem, but I don't know how to debug it.
> 
>   I noted that the problem arises on both my 2-node and my 1-node 
> cluster, either when I mount gfs with 'noquota,noatime' or not.
> 
>   Your help is my hope:
>   thank you in advance!
>   Tyzan
> 
> ___________________________________________________________________________________________________ 
> 
> Linux xxxxxxxxxxxxxxxx 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:39:17 EDT 
> 2007 x86_64 x86_64 x86_64 GNU/Linux
> 
> lvm2-cluster-2.02.16-3.el5
> kmod-gfs-0.1.16-5.2.6.18_8.1.8.el5
> gfs2-utils-0.1.25-1.el5
> gfs-utils-0.1.11-3.el5
> cman-2.0.64-1.0.1.el5
> rgmanager-2.0.24-1.el5
> 
> 
>   [root at orarac1 ~]# gfs_tool gettune /share
> ilimit1 = 100
> ilimit1_tries = 3
> ilimit1_min = 1
> ilimit2 = 500
> ilimit2_tries = 10
> ilimit2_min = 3
> demote_secs = 300
> incore_log_blocks = 1024
> jindex_refresh_secs = 60
> depend_secs = 60
> scand_secs = 5
> recoverd_secs = 60
> logd_secs = 1
> quotad_secs = 5
> inoded_secs = 15
> quota_simul_sync = 64
> quota_warn_period = 10
> atime_quantum = 3600
> quota_quantum = 60
> quota_scale = 1.0000   (1, 1)
> quota_enforce = 1
> quota_account = 1
> new_files_jdata = 0
> new_files_directio = 0
> max_atomic_write = 4194304
> max_readahead = 262144
> lockdump_size = 131072
> stall_secs = 600
> complain_secs = 10
> reclaim_limit = 5000
> entries_per_readdir = 32
> prefetch_secs = 10
> statfs_slots = 64
> max_mhc = 10000
> greedy_default = 100
> greedy_quantum = 25
> greedy_max = 250
> rgrp_try_threshold = 100
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From tam_annie at aliceposta.it  Sun Nov  4 19:40:20 2007
From: tam_annie at aliceposta.it (tam_annie at aliceposta.it)
Date: Sun, 04 Nov 2007 20:40:20 +0100
Subject: [Linux-cluster] GFS directory freezing unexpectedly under pressure
	...
In-Reply-To: <472DFE77.70706@bobich.net>
References: <FBCMMO02fxUG0DpWhkR000282f5@FBCMMO02.fbc.local>	<472DDC7F.9060401@aliceposta.it>
	<472DFE77.70706@bobich.net>
Message-ID: <472E2024.30503@aliceposta.it>

Well, I really think so: I always used 'gfs' in my command lines, never 
'gfs2'  ...; gfs_fsck runs well on that filesystem, while gfs2_fsck returns:

[root at orarac1 ~]# gfs2_fsck /dev/vg_share/lv_share_1
Initializing fsck
Old gfs1 file system detected.

However, the same question you're now asking to me came into my own mind 
when I wrote the post: nobody seems to have problems with GFS1, only 
with GFS2 (which is not yet available for production); maybe I'm messing 
up something ...

To make my doubts even greater, I noticed (with my great surprise) that 
my gfs kernel module is using the gfs2 kernel module:

[root at orarac1 ~]# lsmod
Module                  Size  Used by
....
gfs                   302204  2
lock_dlm               55385  3
gfs2                  522965  2 gfs,lock_dlm
dlm                   131525  24 lock_dlm
configfs               62301  2 dlm
vmnet                 106288  3
vmmon                 176716  0
sunrpc                195977  1
ipv6                  410017  22
cpufreq_ondemand       40401  2
dm_mirror              60993  0
dm_mod                 93841  6 dm_mirror
video                  51273  0
sbs                    49921  0
i2c_ec                 38593  1 sbs
i2c_core               56129  1 i2c_ec
button                 40545  0
battery                43849  0
asus_acpi              50917  0
acpi_memhotplug        40133  0
ac                     38729  0
parport_pc             62313  0
lp                     47121  0
parport                73165  2 parport_pc,lp
k8_edac                49537  0
edac_mc                58657  1 k8_edac
shpchp                 70765  0
bnx2                  119057  0
pcspkr                 36289  0
serio_raw              40517  0
sg                     69737  0
qla2400               242944  0
qla2300               159360  0
usb_storage           116257  0
cciss                  92361  4
ext3                  166609  2
jbd                    93873  1 ext3
ehci_hcd               65229  0
ohci_hcd               54493  0
uhci_hcd               57433  0
qla2xxx               309664  7 qla2400,qla2300
sd_mod                 54081  9
scsi_mod              184057  5 sg,usb_storage,cciss,qla2xxx,sd_mod
qla2xxx_conf          334856  1
intermodule            37508  2 qla2xxx,qla2xxx_conf

Is it possible that I'm wrongly using GFS via GFS2 (or something like 
that) and that my configuration is not stable simply because the gfs2 
kernel module is not yet ready for production? Maybe I have to change 
something ...

I'd greatly appreciate your sharing your working GFS1 configuration with 
me, if it's possible.
Thank you
Tyzan


[root at orarac1 ~]# gfs_tool df
/share:
  SB lock proto = "lock_dlm"
  SB lock table = "orarac:gfs_share_1"
  SB ondisk format = 1309
  SB multihost format = 1401
  Block size = 4096
  Journals = 10
  Resource Groups = 796
  Mounted lock proto = "lock_dlm"
  Mounted lock table = "orarac:gfs_share_1"
  Mounted host data = "jid=1:id=196610:first=0"
  Journal number = 1
  Lock module flags = 0
  Local flocks = FALSE
  Local caching = FALSE
  Oopses OK = FALSE

  Type           Total          Used           Free           use%
  ------------------------------------------------------------------------
  inodes         30             30             0              100%
  metadata       13197          12193          1004           92%
  data           52080821       6084973        45995848       12%


Gordan Bobic wrote:
> Are you sure you are using GFS1 and not GFS2? I've experienced that 
> problem with GFS2, but not with GFS1.
>
> Gordan
>
> tam_annie at aliceposta.it wrote:
>> Hi everybody,
>>
>>    when my GFS (v. 1) filesystems experience some heavy load (ex. 
>> vmware virtual machine OS installation, oracle rman backup using gfs 
>> filesystems as flash recovery area), they "freeze" unexpectedly.
>>    More precisely, not the whole gfs filesystem freezes, but only the 
>> directory interested by the load: I can't even ls the contents of 
>> that directory, everything interesting it seems to hang hopelessly. I 
>> can't find any related errors in my logs; cluster utilities output 
>> (clustat, group_tool -v, cman_tool nodes) looks absolutely normal ... 
>> (no fencing is occurring): I can even go on working on the other 
>> directories of the same gfs!!!
>> The only way out I've found is to restart the cluster. I can 
>> reproduce deterministically the problem, but I don't know how to 
>> debug it.
>>
>>   I noted that the problem arises on both my 2-node and my 1-node 
>> cluster, either when I mount gfs with 'noquota,noatime' or not.
>>
>>   Your help is my hope:
>>   thank you in advance!
>>   Tyzan
>>
>> ___________________________________________________________________________________________________ 
>>
>> Linux xxxxxxxxxxxxxxxx 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:39:17 
>> EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>>
>> lvm2-cluster-2.02.16-3.el5
>> kmod-gfs-0.1.16-5.2.6.18_8.1.8.el5
>> gfs2-utils-0.1.25-1.el5
>> gfs-utils-0.1.11-3.el5
>> cman-2.0.64-1.0.1.el5
>> rgmanager-2.0.24-1.el5
>>
>>
>>   [root at orarac1 ~]# gfs_tool gettune /share
>> ilimit1 = 100
>> ilimit1_tries = 3
>> ilimit1_min = 1
>> ilimit2 = 500
>> ilimit2_tries = 10
>> ilimit2_min = 3
>> demote_secs = 300
>> incore_log_blocks = 1024
>> jindex_refresh_secs = 60
>> depend_secs = 60
>> scand_secs = 5
>> recoverd_secs = 60
>> logd_secs = 1
>> quotad_secs = 5
>> inoded_secs = 15
>> quota_simul_sync = 64
>> quota_warn_period = 10
>> atime_quantum = 3600
>> quota_quantum = 60
>> quota_scale = 1.0000   (1, 1)
>> quota_enforce = 1
>> quota_account = 1
>> new_files_jdata = 0
>> new_files_directio = 0
>> max_atomic_write = 4194304
>> max_readahead = 262144
>> lockdump_size = 131072
>> stall_secs = 600
>> complain_secs = 10
>> reclaim_limit = 5000
>> entries_per_readdir = 32
>> prefetch_secs = 10
>> statfs_slots = 64
>> max_mhc = 10000
>> greedy_default = 100
>> greedy_quantum = 25
>> greedy_max = 250
>> rgrp_try_threshold = 100
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


From bsd_daemon at msn.com  Mon Nov  5 09:59:27 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Mon, 5 Nov 2007 09:59:27 +0000
Subject: [Linux-cluster] Cluster Suite and Data Replication
In-Reply-To: <49884.79.10.137.147.1194116463.squirrel@picard.linux.it>
References: <49884.79.10.137.147.1194116463.squirrel@picard.linux.it>
Message-ID: <BLU115-W4116A9CF20DF3BDEB1A757E3880@phx.gbl>


Hi, My english is not fine, but not bad too.

Why are you using the GFS for failover ? In my opinion, If you have not special reason,
you must use DRBD for failover structure. (Heartbeat and DRBD or Rgmanager and DRBD). 

Have a nice day.


---------------------------------------- 
> Date: Sat, 3 Nov 2007 20:01:03 +0100 
> From: amrossi at linux.it 
> To: linux-cluster at redhat.com 
> Subject: [Linux-cluster] Cluster Suite and Data Replication 
> 
> Hi all, 
> 
> i've 2 Server with RHEL5 and Cluster Suite for High-Availability Service. 
> 
> I have te problem to replicate data from NODE MASTER to NODE SLAVE because 
> i've not a shared storage. Now i use rsync but i don't like it. 
> 
> Have you a solution for me? 
> 
> Can i use GFS? GNDB? 
>
_________________________________________________________________
Climb to the top of the charts!? Play Star Shuffle:? the word scramble challenge with star power.
http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct


From gordan at bobich.net  Mon Nov  5 10:02:15 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Mon, 5 Nov 2007 10:02:15 +0000 (GMT)
Subject: [Linux-cluster] GFS directory freezing unexpectedly under pressure
	...
In-Reply-To: <472E2024.30503@aliceposta.it>
References: <FBCMMO02fxUG0DpWhkR000282f5@FBCMMO02.fbc.local>
	<472DDC7F.9060401@aliceposta.it> <472DFE77.70706@bobich.net>
	<472E2024.30503@aliceposta.it>
Message-ID: <Pine.LNX.4.64.0711051000420.27899@skynet.shatteredsilicon.net>

On Sun, 4 Nov 2007, tam_annie at aliceposta.it wrote:

> Well, I really think so: I always used 'gfs' in my command lines, never 
> 'gfs2'  ...; gfs_fsck runs well on that filesystem, while gfs2_fsck returns:
>
> [root at orarac1 ~]# gfs2_fsck /dev/vg_share/lv_share_1
> Initializing fsck
> Old gfs1 file system detected.

That seems fairly conclusive.

> To make my doubts even greater, I noticed (with my great surprise) that my 
> gfs kernel module is using the gfs2 kernel module:

That's normal. Confusing, but normal. It surprised me, too, when I first 
noticed.

> I'd greatly appreciate your sharing your working GFS1 configuration with me, 
> if it's possible.

Sure, but it'll be a couple of days before my cluster is up and running 
again.

Gordan


From james at cloud9.co.uk  Mon Nov  5 13:28:39 2007
From: james at cloud9.co.uk (James Fidell)
Date: Mon, 05 Nov 2007 13:28:39 +0000
Subject: [Linux-cluster] Shutting down a cluster
Message-ID: <472F1A87.7060505@cloud9.co.uk>

According to the FAQ, to shut down a cluster node it's necessary to do

  # cman_tool leave remove

before shutting down a cluster node.  On my CentOS5 servers however, I
get:

  # cman_tool leave remove
  cman_tool: Error leaving cluster: Device or resource busy

There are no errors in the logs to indicate why this might be and it
happens even if I run it immediately after running the cman init script
without any services being run yet.

Is this the wrong thing to be doing now, or is there some way to find
out why the node can't leave the cluster?

James


From james at cloud9.co.uk  Mon Nov  5 14:18:17 2007
From: james at cloud9.co.uk (James Fidell)
Date: Mon, 05 Nov 2007 14:18:17 +0000
Subject: [Linux-cluster] Clustered NFS services starting on wrong nodes
Message-ID: <472F2629.2000201@cloud9.co.uk>

I have three servers in a cluster, each running NFS exports of different
filesystems from shared storage, a la NFS cookbook.  I have three
failover domains, each containing all the nodes where one machine is
listed at priority 1 and the others at priority 2, the priority 1 node
being different in each domain.

However, when rgmanager is started on all three nodes at the same time,
all the services start on one server, apparently ignoring the priority
setting.  The other two nodes don't appear to have any problems logged
and I can immediately migrate the NFS services to the correct server.

Is there any obvious reason this might happen?

James


From Timothy.Ward at itt.com  Mon Nov  5 18:01:31 2007
From: Timothy.Ward at itt.com (Ward, Timothy - SSD)
Date: Mon, 5 Nov 2007 13:01:31 -0500
Subject: [Linux-cluster] Shutting down a cluster
Message-ID: <77E700AE7021314DB6CDF6D6E8F661320396FC9C@ACDFWMAIL1.acd.de.ittind.com>

On Mon, 5 Nov 2007, James Fidell wrote:
> According to the FAQ, to shut down a cluster node it's necessary to do
>
>   # cman_tool leave remove
>
> before shutting down a cluster node.  On my CentOS5 servers however, I
get:
>
>   # cman_tool leave remove
>   cman_tool: Error leaving cluster: Device or resource busy
>
> There are no errors in the logs to indicate why this might be and it
> happens even if I run it immediately after running the cman init
script
> without any services being run yet.
>
> Is this the wrong thing to be doing now, or is there some way to find
> out why the node can't leave the cluster?
>
> James

Try:
# cman_tool force leave

Tim

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.


From Timothy.Ward at itt.com  Mon Nov  5 18:03:46 2007
From: Timothy.Ward at itt.com (Ward, Timothy - SSD)
Date: Mon, 5 Nov 2007 13:03:46 -0500
Subject: [Linux-cluster] Clustered NFS services starting on wrong
In-Reply-To: <77E700AE7021314DB6CDF6D6E8F661320396FC9C@ACDFWMAIL1.acd.de.ittind.com>
Message-ID: <77E700AE7021314DB6CDF6D6E8F661320396FC9D@ACDFWMAIL1.acd.de.ittind.com>

On Mon, 5 Nov 2007, James Fidell wrote:
> I have three servers in a cluster, each running NFS exports of
different
> filesystems from shared storage, a la NFS cookbook.  I have three
> failover domains, each containing all the nodes where one machine is
> listed at priority 1 and the others at priority 2, the priority 1 node
> being different in each domain.
>
> However, when rgmanager is started on all three nodes at the same
time,
> all the services start on one server, apparently ignoring the priority
> setting.  The other two nodes don't appear to have any problems logged
> and I can immediately migrate the NFS services to the correct server.
>
> Is there any obvious reason this might happen?
>
> James

Please post your /etc/cluster/cluster.conf file so we can peruse it.

Thanks,
Tim

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.


From scstafford at gmail.com  Mon Nov  5 19:36:16 2007
From: scstafford at gmail.com (Stewart Stafford)
Date: Mon, 5 Nov 2007 13:36:16 -0600
Subject: [Linux-cluster] fence_ipmilan issue
In-Reply-To: <6f9ffb020711021416p4a638281x871b5cf053f03625@mail.gmail.com>
References: <6f9ffb020711021107p75a587f5m530288b72ea79ff1@mail.gmail.com>
	<1194028832.3306.19.camel@localhost.localdomain>
	<6f9ffb020711021150g1c2d894by487d7547715ff1c3@mail.gmail.com>
	<6f9ffb020711021416p4a638281x871b5cf053f03625@mail.gmail.com>
Message-ID: <6f9ffb020711051136y414bd4aau83bff0b18bec9c18@mail.gmail.com>

In order to get the node back into the cluster, I had to reboot all
the nodes.  Not exactly what I want to have happen.  Still not sure
why the rgmanager was hung.
Instead of calling fence_ipmilan, decided to see what would happen if
I pulled the ethernet cable to a node running a service.  From
/var/log/messages on one node I see the following:

ov  5 12:46:17 isc0 openais[2870]: [TOTEM] Creating commit token
because I am the rep.
Nov  5 12:46:17 isc0 openais[2870]: [TOTEM] Saving state aru 97 high
seq received 97
Nov  5 12:46:17 isc0 openais[2870]: [TOTEM] entering COMMIT state.
Nov  5 12:46:17 isc0 openais[2870]: [TOTEM] entering GATHER state from 12.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering GATHER state from 11.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Creating commit token
because I am the rep.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering COMMIT state.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering RECOVERY state.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] position [0] member 172.16.127.122:
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] previous ring seq 52 rep
172.16.127.122
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] aru 97 high delivered 97
received flag 0
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] position [1] member 172.16.127.124:
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] previous ring seq 52 rep
172.16.127.122
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] aru 97 high delivered 97
received flag 0
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Did not need to originate
any messages in recovery.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Storing new sequence id for ring 3c
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] Sending initial ORF token
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] New Configuration:
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.122)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.124)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Left:
Nov  5 12:46:22 isc0 kernel: dlm: closing connection to node 2
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.123)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Joined:
Nov  5 12:46:22 isc0 openais[2870]: [SYNC ] This node is within the
primary component and will provi
de service.
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] New Configuration:
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.122)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ]     r(0) ip(172.16.127.124)
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Left:
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] Members Joined:
Nov  5 12:46:22 isc0 openais[2870]: [SYNC ] This node is within the
primary component and will provi
de service.
Nov  5 12:46:22 isc0 openais[2870]: [TOTEM] entering OPERATIONAL state.
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] got nodejoin message 172.16.127.122
Nov  5 12:46:22 isc0 openais[2870]: [CLM  ] got nodejoin message 172.16.127.124
Nov  5 12:46:22 isc0 openais[2870]: [CPG  ] got joinlist message from node 1
Nov  5 12:46:22 isc0 openais[2870]: [CPG  ] got joinlist message from node 3
Nov  5 12:46:42 isc0 fenced[2889]: isc1 not a cluster member after 20
sec post_fail_delay
Nov  5 12:46:42 isc0 fenced[2889]: fencing node "isc1"
Nov  5 12:46:42 isc0 fenced[2889]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:172.16.15
8.160...Failed
Nov  5 12:46:42 isc0 fenced[2889]: fence "isc1" failed
Nov  5 12:46:47 isc0 fenced[2889]: fencing node "isc1"
Nov  5 12:46:48 isc0 fenced[2889]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:172.16.15
8.160...Failed

The last 3 lines continue to repeat.  Any clues as to what might be
wrong?  Here's an updated cluster.conf
<?xml version="1.0"?>
<cluster alias="ices_nfscluster" config_version="100" name="nfs_cluster">
        <fence_daemon post_fail_delay="20" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="isc0" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="iisc0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="isc1" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="iisc1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="isc2" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="iisc2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="172.16.158.159" login="root" name="iisc0" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="172.16.158.160" login="root" name="iisc1" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" auth="none"
ipaddr="171.16.158.161" login="root" name="iisc2" passwd="changeme"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="fotest" ordered="1"
restricted="1">
                                <failoverdomainnode name="isc0" priority="1"/>
                                <failoverdomainnode name="isc1" priority="1"/>
                                <failoverdomainnode name="isc2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="172.16.127.15" monitor_link="1"/>
                        <ip address="172.16.127.17" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="fotest" name="nfstest"
recovery="restart">
                        <fs device="/dev/ices-fs/test" force_fsck="0"
force_unmount="1" fsid="13584" fstype="ext3" mountpoint="/export/test"
name="testfs" options="" self_fence="0"/>
                        <nfsexport name="test_export">
                                <nfsclient name="test_export"
options="async,rw,fsid=20" path="/export/test"
target="128.83.68.0/24"/>
                        </nfsexport>
                        <ip ref="172.16.127.15"/>
                </service>
                <service autostart="1" domain="fotest" name="nfsices"
recovery="relocate">
                        <fs device="/dev/ices-fs/ices" force_fsck="0"
force_unmount="1" fsid="44096" fstype="ext3"
mountpoint="/export/cices" name="nfsfs" options="" self_fence="0"/>
                        <nfsexport name="nfsexport">
                                <nfsclient name="nfsclient"
options="async,fsid=25,rw" path="/export/cices"
target="128.83.68.0/24"/>
                        </nfsexport>
                        <ip ref="172.16.127.17"/>
                </service>
        </rm>
</cluster>

Thanks,

Stew


From jimmy.nimo at pranical.com  Mon Nov  5 19:42:02 2007
From: jimmy.nimo at pranical.com (jimmy.nimo at pranical.com)
Date: Mon,  5 Nov 2007 15:42:02 -0400
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
Message-ID: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>


Hello, thanks for reading my email, I have a problem and perphaps someone in the
list can work it out

I have 3 working redhat 4 update 5 trying to work with piranha+lvs+nanny, 2 Real
Server and 1 LVS Router, the router have 2 NIC, 1 with IP 172.16.247.130(eht0),
and the other with 10.11.12.9(eth1), the real server are 10.11.12.11 and
10.11.12.12, I want to put the virtual server to run in the ip 172.16.247.150
(eth0:1) and the gateway in the real servers are 10.11.12.10 (eth1:1 in the
router server), I create the lvs.cf with piranha and start pulse, but, NOTHING
happens, piranha create the 10.11.12.10 and the 172.16.247.150 virtual
interfaces (I can conect to the gui of piranha in http://172.16.247.150:3636)
but can't do in the port 80. here are my lvs.cf

serial_no = 137
primary = 172.16.247.130
primary_private = 10.11.12.9
service = lvs
backup_active = 0
backup = 172.16.247.131
backup_private = 10.11.12.11
heartbeat = 1
heartbeat_port = 539
keepalive = 3
deadtime = 10
network = nat
nat_router = 10.11.12.10 eth1:1
nat_nmask = 255.255.255.0
debug_level = NONE
monitor_links = 0
virtual http {
     active = 1
     address = 172.16.247.150 eth0:1
     vip_nmask = 255.255.255.0
     port = 80
     use_regex = 0
     load_monitor = ruptime
     scheduler = rr
     protocol = tcp
     timeout = 5
     reentry = 5
     quiesce_server = 0
     server uno {
         address = 10.11.12.11
         active = 1
         weight = 1
     }
     server dos {
         address = 10.11.12.12
         active = 1
         weight = 1
     }
}

and I don't know why, but the ipsvadmin don't show the entrys of the real
servers:

[root at node1 ~]# ipvsadm -Ln
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.16.247.150:80 rr

can anyone help me? (sorry for my bad english)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-keys
Size: 1364 bytes
Desc: Clave PGP p?blica
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071105/db979ad1/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-keys
Size: 1340 bytes
Desc: Clave PGP p?blica
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071105/db979ad1/attachment-0001.bin>

From amrossi at linux.it  Mon Nov  5 20:23:06 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Mon, 5 Nov 2007 21:23:06 +0100 (CET)
Subject: [Linux-cluster] Cluster Suite and Data Replication
In-Reply-To: <BLU115-W4116A9CF20DF3BDEB1A757E3880@phx.gbl>
References: <49884.79.10.137.147.1194116463.squirrel@picard.linux.it>
	<BLU115-W4116A9CF20DF3BDEB1A757E3880@phx.gbl>
Message-ID: <47845.79.10.137.147.1194294186.squirrel@picard.linux.it>

Thanks a lot..but, in your opinion, rgmanager+drdb is better then
heartbeat+drdb ????

Another question :-), can i use GNDB o CLVM insteade of drdb?


Thanks ;-)


From orkcu at yahoo.com  Mon Nov  5 20:25:27 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Mon, 5 Nov 2007 12:25:27 -0800 (PST)
Subject: [Linux-cluster] is there any mailing list for RHCS - GFS Bugfix
	announcements ?
Message-ID: <345516.25691.qm@web50603.mail.re2.yahoo.com>

Hi

well, just the question in the subject :-)
it happen that usually RHCS and RHGFS updates are just
bugfixes so they are not announced in
Enterprise-watch-list mailling list.
so, dows anybody know a way to get the announce by
email ?

thanks
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From kanderso at redhat.com  Mon Nov  5 21:27:57 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Mon, 05 Nov 2007 15:27:57 -0600
Subject: [Linux-cluster] is there any mailing list for RHCS - GFS
	Bugfix announcements ?
In-Reply-To: <345516.25691.qm@web50603.mail.re2.yahoo.com>
References: <345516.25691.qm@web50603.mail.re2.yahoo.com>
Message-ID: <1194298077.2732.15.camel@localhost.localdomain>

On Mon, 2007-11-05 at 12:25 -0800, Roger Pe?a wrote:
> Hi
> 
> well, just the question in the subject :-)
> it happen that usually RHCS and RHGFS updates are just
> bugfixes so they are not announced in
> Enterprise-watch-list mailling list.
> so, dows anybody know a way to get the announce by
> email ?
> 
RHN will notify you when the software you have subscribed to has
updates.

Kevin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071105/dfc0d0ad/attachment.htm>

From raindoctor at gmail.com  Mon Nov  5 22:11:20 2007
From: raindoctor at gmail.com (Pedro Espinoza)
Date: Mon, 5 Nov 2007 17:11:20 -0500
Subject: [Linux-cluster] cluster suite ports
Message-ID: <b86c876f0711051411s7ac027a2s96eef2badf624bb2@mail.gmail.com>

I am trying to see which ports various daemons listen on, and found
the following
http://wiki.zimbra.com/index.php?title=Firewall_Configuration

    * rgmanager
          o port 41966/tcp
          o port 41967/tcp
          o port 41968/tcp
          o port 41969/tcp
    * ccsd
          o port 50006/tcp
          o port 50007udp
          o port 50008/tcp
          o port 50009/tcp
    * dlm
          o port 21064/tcp
    * cman
          o port 6809/udp
    * gnbd
          o port 14567/tcp


Is the above information correct? Where can I find port information on
gfs cluster suite, like /etc/services or something to taht effect?


Thanks, Pedro.


From raindoctor at gmail.com  Mon Nov  5 22:54:53 2007
From: raindoctor at gmail.com (Pedro Espinoza)
Date: Mon, 5 Nov 2007 17:54:53 -0500
Subject: [Linux-cluster] Re: cluster suite ports
In-Reply-To: <b86c876f0711051411s7ac027a2s96eef2badf624bb2@mail.gmail.com>
References: <b86c876f0711051411s7ac027a2s96eef2badf624bb2@mail.gmail.com>
Message-ID: <b86c876f0711051454l3f52a426r91fb3e04589564ac@mail.gmail.com>

I found the answer

FAQ 19: What ports do I have to enable for the iptables firewall?


On Nov 5, 2007 5:11 PM, Pedro Espinoza <raindoctor at gmail.com> wrote:
> I am trying to see which ports various daemons listen on, and found
> the following
> http://wiki.zimbra.com/index.php?title=Firewall_Configuration
>
>     * rgmanager
>           o port 41966/tcp
>           o port 41967/tcp
>           o port 41968/tcp
>           o port 41969/tcp
>     * ccsd
>           o port 50006/tcp
>           o port 50007udp
>           o port 50008/tcp
>           o port 50009/tcp
>     * dlm
>           o port 21064/tcp
>     * cman
>           o port 6809/udp
>     * gnbd
>           o port 14567/tcp
>
>
> Is the above information correct? Where can I find port information on
> gfs cluster suite, like /etc/services or something to taht effect?
>
>
> Thanks, Pedro.
>


From jleafey at utmem.edu  Mon Nov  5 22:55:29 2007
From: jleafey at utmem.edu (Jay Leafey)
Date: Mon, 05 Nov 2007 16:55:29 -0600
Subject: [Linux-cluster] cluster suite ports
In-Reply-To: <b86c876f0711051411s7ac027a2s96eef2badf624bb2@mail.gmail.com>
References: <b86c876f0711051411s7ac027a2s96eef2badf624bb2@mail.gmail.com>
Message-ID: <472F9F61.1070104@utmem.edu>

Pedro Espinoza wrote:
> I am trying to see which ports various daemons listen on, and found
> the following
> http://wiki.zimbra.com/index.php?title=Firewall_Configuration
> 
>     * rgmanager
>           o port 41966/tcp
>           o port 41967/tcp
>           o port 41968/tcp
>           o port 41969/tcp
>     * ccsd
>           o port 50006/tcp
>           o port 50007udp
>           o port 50008/tcp
>           o port 50009/tcp
>     * dlm
>           o port 21064/tcp
>     * cman
>           o port 6809/udp
>     * gnbd
>           o port 14567/tcp
> 
> 
> Is the above information correct? Where can I find port information on
> gfs cluster suite, like /etc/services or something to taht effect?
> 

Use the FAQ, Luke!  The ports are documented in the FAQ at:

http://sources.redhat.com/cluster/faq.html#iptables

Looks like your source was right, tho.  Enjoy!

-- 
Jay Leafey - University of Tennessee
E-Mail:  jleafey at utmem.edu  Phone:  901-448-6534  FAX:  901-448-8199

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5158 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071105/b8c20f51/attachment.bin>

From gsrlinux at gmail.com  Tue Nov  6 04:08:21 2007
From: gsrlinux at gmail.com (GS R)
Date: Tue, 6 Nov 2007 09:38:21 +0530
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
In-Reply-To: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
Message-ID: <d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>

On 11/6/07, jimmy.nimo at pranical.com <jimmy.nimo at pranical.com> wrote:
>
>
>
>
> Hello, thanks for reading my email,


Welcome :)

 I have a problem and perphaps someone in the
> list can work it out
>
> I have 3 working redhat 4 update 5 trying to work with piranha+lvs+nanny,
> 2 Real
> Server and 1 LVS Router, the router have 2 NIC, 1 with IP 172.16.247.130
> (eht0),
> and the other with 10.11.12.9(eth1), the real server are 10.11.12.11 and
> 10.11.12.12, I want to put the virtual server to run in the ip
> 172.16.247.150
> (eth0:1) and the gateway in the real servers are 10.11.12.10 (eth1:1 in
> the
> router server),


Since you have configured LVS- NAT your gateway on the real servers should
be 10.11.12.9

I create the lvs.cf with piranha and start pulse, but, NOTHING
> happens, piranha create the 10.11.12.10 and the 172.16.247.150 virtual
> interfaces (I can conect to the gui of piranha in
> http://172.16.247.150:3636)
> but can't do in the port 80. here are my lvs.cf
>
> serial_no = 137
> primary = 172.16.247.130
> primary_private = 10.11.12.9
> service = lvs
> backup_active = 0
> backup = 172.16.247.131
> backup_private = 10.11.12.11
> heartbeat = 1
> heartbeat_port = 539
> keepalive = 3
> deadtime = 10
> network = nat
> nat_router = 10.11.12.10 eth1:1
> nat_nmask = 255.255.255.0
> debug_level = NONE
> monitor_links = 0
> virtual http {
>     active = 1
>     address = 172.16.247.150 eth0:1
>     vip_nmask = 255.255.255.0
>     port = 80
>     use_regex = 0
>     load_monitor = ruptime


You should start 'rwhod' service on the real servers since you opted for
'load_monitor' as ruptime.

    scheduler = rr
>     protocol = tcp
>     timeout = 5
>     reentry = 5
>     quiesce_server = 0
>     server uno {
>         address = 10.11.12.11
>         active = 1
>         weight = 1
>     }
>     server dos {
>         address = 10.11.12.12
>         active = 1
>         weight = 1
>     }
> }


Rest configuration looks fine.

and I don't know why, but the ipsvadmin don't show the entrys of the real
> servers:
>
> [root at node1 ~]# ipvsadm -Ln
> IP Virtual Server version 1.2.0 (size=4096)
> Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
> TCP  172.16.247.150:80 rr


Make sure you have the 80 port up and running on your real servers.

can anyone help me? (sorry for my bad english)


-GSR
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071106/963be5b7/attachment.htm>

From james at cloud9.co.uk  Tue Nov  6 09:28:11 2007
From: james at cloud9.co.uk (James Fidell)
Date: Tue, 06 Nov 2007 09:28:11 +0000
Subject: [Linux-cluster] Shutting down a cluster
In-Reply-To: <77E700AE7021314DB6CDF6D6E8F661320396FC9C@ACDFWMAIL1.acd.de.ittind.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9C@ACDFWMAIL1.acd.de.ittind.com>
Message-ID: <473033AB.7020008@cloud9.co.uk>

Ward, Timothy - SSD wrote:
> On Mon, 5 Nov 2007, James Fidell wrote:
>> According to the FAQ, to shut down a cluster node it's necessary to do
>>
>>   # cman_tool leave remove
>>
>> before shutting down a cluster node.  On my CentOS5 servers however, I
> get:
>>   # cman_tool leave remove
>>   cman_tool: Error leaving cluster: Device or resource busy
>>
>> There are no errors in the logs to indicate why this might be and it
>> happens even if I run it immediately after running the cman init
> script
>> without any services being run yet.
>>
>> Is this the wrong thing to be doing now, or is there some way to find
>> out why the node can't leave the cluster?
>>
>> James
> 
> Try:
> # cman_tool force leave

I transferred all services off one machine and tried this.  It results
in the remaining members of the cluster fencing the leaving node.  Is
there anything else I should have done first?

James


From james at cloud9.co.uk  Tue Nov  6 09:42:36 2007
From: james at cloud9.co.uk (James Fidell)
Date: Tue, 06 Nov 2007 09:42:36 +0000
Subject: [Linux-cluster] Clustered NFS services starting on wrong
In-Reply-To: <77E700AE7021314DB6CDF6D6E8F661320396FC9D@ACDFWMAIL1.acd.de.ittind.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9D@ACDFWMAIL1.acd.de.ittind.com>
Message-ID: <4730370C.4040703@cloud9.co.uk>

Ward, Timothy - SSD wrote:
> On Mon, 5 Nov 2007, James Fidell wrote:
>> I have three servers in a cluster, each running NFS exports of
> different
>> filesystems from shared storage, a la NFS cookbook.  I have three
>> failover domains, each containing all the nodes where one machine is
>> listed at priority 1 and the others at priority 2, the priority 1 node
>> being different in each domain.
>>
>> However, when rgmanager is started on all three nodes at the same
> time,
>> all the services start on one server, apparently ignoring the priority
>> setting.  The other two nodes don't appear to have any problems logged
>> and I can immediately migrate the NFS services to the correct server.
>>
>> Is there any obvious reason this might happen?
>>
>> James
> 
> Please post your /etc/cluster/cluster.conf file so we can peruse it.

Here it is.  Bit long, because of all the partitions involved.
Apologies for that, but I didn't want to edit it for posting any more
than was absolutely necessary.

It's probably relevant to mention that the failover domains are tagged
as restricted even though they all contain all the current nodes because
at some point I wish to add further nodes that will have access to the
GFS filesystems but won't provide NFS services.

<?xml version="1.0"?>
<cluster alias="MailCluster1" config_version="30" name="MailCluster1">
  <fence_daemon post_fail_delay="0" post_join_delay="3"/>
  <clusternodes>
    <clusternode name="nfs-7" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="apc-c3" port="1" switch="1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="nfs-8" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="apc-c3" port="2" switch="1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="nfs-9" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device name="nfs-9-ilo"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman/>
  <fencedevices>
    <fencedevice agent="fence_apc" ipaddr="192.168.1.65" login="XXXX"
name="apc-c3" passwd="XXXX"/>
    <fencedevice agent="fence_ilo" hostname="192.168.1.72" login="XXXX"
name="nfs-9-ilo" passwd="XXXX"/>
  </fencedevices>
  <rm>
    <failoverdomains>
      <failoverdomain name="MailDomain1-1" ordered="1" restricted="1">
        <failoverdomainnode name="nfs-7" priority="1"/>
        <failoverdomainnode name="nfs-8" priority="2"/>
        <failoverdomainnode name="nfs-9" priority="2"/>
      </failoverdomain>
      <failoverdomain name="MailDomain1-2" ordered="1" restricted="1">
        <failoverdomainnode name="nfs-7" priority="2"/>
        <failoverdomainnode name="nfs-8" priority="1"/>
        <failoverdomainnode name="nfs-9" priority="2"/>
      </failoverdomain>
      <failoverdomain name="MailDomain1-3" ordered="1" restricted="1">
        <failoverdomainnode name="nfs-7" priority="2"/>
        <failoverdomainnode name="nfs-8" priority="2"/>
        <failoverdomainnode name="nfs-9" priority="1"/>
      </failoverdomain>
    </failoverdomains>
    <resources>
      <clusterfs device="/dev/mail-vg-01/mail-lv-01" force_unmount="0"
fsid="15708" fstype="gfs" mountpoint="/mail21" name="mail21"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-02/mail-lv-02" force_unmount="0"
fsid="27878" fstype="gfs" mountpoint="/mail22" name="mail22"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-03/mail-lv-03" force_unmount="0"
fsid="23664" fstype="gfs" mountpoint="/mail23" name="mail23"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-04/mail-lv-04" force_unmount="0"
fsid="51667" fstype="gfs" mountpoint="/mail24" name="mail24"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-05/mail-lv-05" force_unmount="0"
fsid="63813" fstype="gfs" mountpoint="/mail25" name="mail25"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-06/mail-lv-06" force_unmount="0"
fsid="43263" fstype="gfs" mountpoint="/mail26" name="mail26"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-07/mail-lv-07" force_unmount="0"
fsid="39017" fstype="gfs" mountpoint="/mail27" name="mail27"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-08/mail-lv-08" force_unmount="0"
fsid="34296" fstype="gfs" mountpoint="/mail28" name="mail28"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-09/mail-lv-09" force_unmount="0"
fsid="46446" fstype="gfs" mountpoint="/mail29" name="mail29"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-10/mail-lv-10" force_unmount="0"
fsid="15499" fstype="gfs" mountpoint="/mail30" name="mail30"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-11/mail-lv-11" force_unmount="0"
fsid="3101" fstype="gfs" mountpoint="/mail31" name="mail31"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-12/mail-lv-12" force_unmount="0"
fsid="23975" fstype="gfs" mountpoint="/mail32" name="mail32"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-13/mail-lv-13" force_unmount="0"
fsid="27953" fstype="gfs" mountpoint="/mail33" name="mail33"
options="noquota"/>
      <clusterfs device="/dev/mail-vg-14/mail-lv-14" force_unmount="0"
fsid="12681" fstype="gfs" mountpoint="/mailstate" name="mailstate"
options="noquota"/>
      <nfsexport name="MailNFSExport"/>
      <nfsclient name="MailNFSClient" options="async,no_root_squash,rw"
path="" target="192.168.1.0/24"/>
    </resources>
    <service autostart="1" domain="MailDomain1-1" name="MailNFS1"
recovery="relocate">
      <ip address="192.168.1.48" monitor_link="1"/>
      <clusterfs ref="mailstate">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail21">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail22">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail23">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
    </service>
    <service autostart="1" domain="MailDomain1-2" name="MailNFS2"
recovery="relocate">
      <ip address="192.168.1.49" monitor_link="1"/>
      <clusterfs ref="mail24">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail25">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail26">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail27">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail28">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
    </service>
    <service autostart="1" domain="MailDomain1-3" name="MailNFS3"
recovery="relocate">
      <ip address="192.168.1.50" monitor_link="1"/>
      <clusterfs ref="mail29">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail30">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail31">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail32">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
      <clusterfs ref="mail33">
        <nfsexport ref="MailNFSExport">
          <nfsclient ref="MailNFSClient"/>
        </nfsexport>
      </clusterfs>
    </service>
  </rm>
</cluster>

James


From bsd_daemon at msn.com  Tue Nov  6 10:24:09 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Tue, 6 Nov 2007 10:24:09 +0000
Subject: [Linux-cluster] Cluster Suite and Data Replication
In-Reply-To: <47845.79.10.137.147.1194294186.squirrel@picard.linux.it>
References: <49884.79.10.137.147.1194116463.squirrel@picard.linux.it>
	<BLU115-W4116A9CF20DF3BDEB1A757E3880@phx.gbl> 
	<47845.79.10.137.147.1194294186.squirrel@picard.linux.it>
Message-ID: <BLU115-W306DA01E23274BD62722CE3890@phx.gbl>


Of course, my preference is heartbeat drbd. You can use GNBD but only the GNBD.
Right now. CLVM not provide your desire. Because, CLVM is working with the quorum.
You will use 2node. If quorum is lost ? what will be ?

Additionally, You must use GFS for Replicant of Data. You must set configuration of
2node on CMAN. I can present 2 proposition to you. 

1) You can set heartbeat for 2node. The nodes is shared storage on Nfs or etc.
2) you can set DRBD Master-Slave on heartbeat for 2 node.

In my opinion, you should use heartbeat+drbd. have a nice day.

Mehmet CELIK
Istanbul/TURKEY


----------------------------------------
> Date: Mon, 5 Nov 2007 21:23:06 +0100 
> Subject: RE: [Linux-cluster] Cluster Suite and Data Replication 
> From: amrossi at linux.it 
> To: linux-cluster at redhat.com 
> CC: linux-cluster at redhat.com 
> 
> Thanks a lot..but, in your opinion, rgmanager+drdb is better then 
> heartbeat+drdb ???? 
> 
> Another question :-), can i use GNDB o CLVM insteade of drdb? 
> 
> 
> Thanks ;-) 
> 
> 


_________________________________________________________________
Windows Live Hotmail and Microsoft Office Outlook ? together at last. ?Get it now.
http://office.microsoft.com/en-us/outlook/HA102225181033.aspx?pid=CL100626971033


From orkcu at yahoo.com  Tue Nov  6 13:06:56 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 6 Nov 2007 05:06:56 -0800 (PST)
Subject: [Linux-cluster] is there any mailing list for RHCS - GFS Bugfix
	announcements ?
In-Reply-To: <1194298077.2732.15.camel@localhost.localdomain>
Message-ID: <298036.5915.qm@web50605.mail.re2.yahoo.com>


--- Kevin Anderson <kanderso at redhat.com> wrote:

> On Mon, 2007-11-05 at 12:25 -0800, Roger Pe???a
wrote:
> > Hi
> > 
> > well, just the question in the subject :-)
> > it happen that usually RHCS and RHGFS updates are
> just
> > bugfixes so they are not announced in
> > Enterprise-watch-list mailling list.
> > so, dows anybody know a way to get the announce by
> > email ?
> > 
> RHN will notify you when the software you have
> subscribed to has
> updates.

so there isn't any public mailling list for this kind
of announce just the rhn suscription email
notification system.

ok, I just wanted to be sure :-)

thanks
roger


__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From pcaulfie at redhat.com  Tue Nov  6 13:26:17 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 06 Nov 2007 13:26:17 +0000
Subject: [Linux-cluster] Shutting down a cluster
In-Reply-To: <77E700AE7021314DB6CDF6D6E8F661320396FC9C@ACDFWMAIL1.acd.de.ittind.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9C@ACDFWMAIL1.acd.de.ittind.com>
Message-ID: <47306B79.3070205@redhat.com>

Ward, Timothy - SSD wrote:
> On Mon, 5 Nov 2007, James Fidell wrote:
>> According to the FAQ, to shut down a cluster node it's necessary to do
>>
>>   # cman_tool leave remove
>>
>> before shutting down a cluster node.  On my CentOS5 servers however, I
> get:
>>   # cman_tool leave remove
>>   cman_tool: Error leaving cluster: Device or resource busy
>>
>> There are no errors in the logs to indicate why this might be and it
>> happens even if I run it immediately after running the cman init
> script
>> without any services being run yet.
>>
>> Is this the wrong thing to be doing now, or is there some way to find
>> out why the node can't leave the cluster?
>>
>> James
> 
> Try:
> # cman_tool force leave
> 

I would try and avoid that if possible. It's much better to make sure that all
of the services are shut down beforehand and let cman shut down tidily.

Check for things like ccsd still running and services shown my 'cman_tool
services'. Unfortunately it can be hard to determine which cluster services are
running if you don't know what they all are, but it's worth trying  :-)

-- 
Patrick


From jimmy.nimo at pranical.com  Tue Nov  6 14:16:21 2007
From: jimmy.nimo at pranical.com (jimmy.nimo at pranical.com)
Date: Tue,  6 Nov 2007 10:16:21 -0400
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
In-Reply-To: <d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
	<d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
Message-ID: <20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>

I take your recommendation and change the gateway in the real server to
10.11.12.9 (I really think that it has to use the 10.11.12.10 because if you
have a backup lvs Router, it change is the virtual ip eth1:1 not the real ip)
and don't work either

btw I run a nmap to the virtual server and this is the output:


root at Cancer:~# nmap -sT 172.16.247.150

Starting Nmap 4.20 ( http://insecure.org ) at 2007-11-06 09:42 VET
Interesting ports on 172.16.247.150:
Not shown: 1693 closed ports
PORT    STATE SERVICE
22/tcp  open  ssh
111/tcp open  rpcbind
443/tcp open  https
815/tcp open  unknown

it dosn't have the http port open, I don't have a firewall, I have the 
iptables
down in all the server, I really don't know what happen, anyone can help??


Quoting GS R <gsrlinux at gmail.com>:

> On 11/6/07, jimmy.nimo at pranical.com <jimmy.nimo at pranical.com> wrote:
>>
>>
>>
>>
>> Hello, thanks for reading my email,
>
>
> Welcome :)
>
> I have a problem and perphaps someone in the
>> list can work it out
>>
>> I have 3 working redhat 4 update 5 trying to work with piranha+lvs+nanny,
>> 2 Real
>> Server and 1 LVS Router, the router have 2 NIC, 1 with IP 172.16.247.130
>> (eht0),
>> and the other with 10.11.12.9(eth1), the real server are 10.11.12.11 and
>> 10.11.12.12, I want to put the virtual server to run in the ip
>> 172.16.247.150
>> (eth0:1) and the gateway in the real servers are 10.11.12.10 (eth1:1 in
>> the
>> router server),
>
>
> Since you have configured LVS- NAT your gateway on the real servers should
> be 10.11.12.9
>
> I create the lvs.cf with piranha and start pulse, but, NOTHING
>> happens, piranha create the 10.11.12.10 and the 172.16.247.150 virtual
>> interfaces (I can conect to the gui of piranha in
>> http://172.16.247.150:3636)
>> but can't do in the port 80. here are my lvs.cf
>>
>> serial_no = 137
>> primary = 172.16.247.130
>> primary_private = 10.11.12.9
>> service = lvs
>> backup_active = 0
>> backup = 172.16.247.131
>> backup_private = 10.11.12.11
>> heartbeat = 1
>> heartbeat_port = 539
>> keepalive = 3
>> deadtime = 10
>> network = nat
>> nat_router = 10.11.12.10 eth1:1
>> nat_nmask = 255.255.255.0
>> debug_level = NONE
>> monitor_links = 0
>> virtual http {
>>     active = 1
>>     address = 172.16.247.150 eth0:1
>>     vip_nmask = 255.255.255.0
>>     port = 80
>>     use_regex = 0
>>     load_monitor = ruptime
>
>
> You should start 'rwhod' service on the real servers since you opted for
> 'load_monitor' as ruptime.
>
>    scheduler = rr
>>     protocol = tcp
>>     timeout = 5
>>     reentry = 5
>>     quiesce_server = 0
>>     server uno {
>>         address = 10.11.12.11
>>         active = 1
>>         weight = 1
>>     }
>>     server dos {
>>         address = 10.11.12.12
>>         active = 1
>>         weight = 1
>>     }
>> }
>
>
> Rest configuration looks fine.
>
> and I don't know why, but the ipsvadmin don't show the entrys of the real
>> servers:
>>
>> [root at node1 ~]# ipvsadm -Ln
>> IP Virtual Server version 1.2.0 (size=4096)
>> Prot LocalAddress:Port Scheduler Flags
>> -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
>> TCP  172.16.247.150:80 rr
>
>
> Make sure you have the 80 port up and running on your real servers.
>
> can anyone help me? (sorry for my bad english)
>
>
>
> -GSR
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-keys
Size: 1340 bytes
Desc: Clave PGP p?blica
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071106/0013b517/attachment.bin>

From gsrlinux at gmail.com  Tue Nov  6 14:27:41 2007
From: gsrlinux at gmail.com (GSR-Linux)
Date: Tue, 06 Nov 2007 19:57:41 +0530
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
In-Reply-To: <20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>	<d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
	<20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
Message-ID: <473079DD.5050300@gmail.com>


Just check if port forwarding is enabled. in your lvs router.
Paste the output of /sysctl -p

/jimmy.nimo at pranical.com wrote:
> I take your recommendation and change the gateway in the real server to
> 10.11.12.9 (I really think that it has to use the 10.11.12.10 because 
> if you
> have a backup lvs Router, it change is the virtual ip eth1:1 not the 
> real ip)
> and don't work either
>
> btw I run a nmap to the virtual server and this is the output:
>
>
>
> root at Cancer:~# nmap -sT 172.16.247.150
>
> Starting Nmap 4.20 ( http://insecure.org ) at 2007-11-06 09:42 VET
> Interesting ports on 172.16.247.150:
> Not shown: 1693 closed ports
> PORT    STATE SERVICE
> 22/tcp  open  ssh
> 111/tcp open  rpcbind
> 443/tcp open  https
> 815/tcp open  unknown
>
> it dosn't have the http port open, I don't have a firewall, I have the 
> iptables
> down in all the server, I really don't know what happen, anyone can 
> help??
>
>
>
>
> Quoting GS R <gsrlinux at gmail.com>:
>
>> On 11/6/07, jimmy.nimo at pranical.com <jimmy.nimo at pranical.com> wrote:
>>>
>>>
>>>
>>>
>>> Hello, thanks for reading my email,
>>
>>
>> Welcome :)
>>
>> I have a problem and perphaps someone in the
>>> list can work it out
>>>
>>> I have 3 working redhat 4 update 5 trying to work with 
>>> piranha+lvs+nanny,
>>> 2 Real
>>> Server and 1 LVS Router, the router have 2 NIC, 1 with IP 
>>> 172.16.247.130
>>> (eht0),
>>> and the other with 10.11.12.9(eth1), the real server are 10.11.12.11 
>>> and
>>> 10.11.12.12, I want to put the virtual server to run in the ip
>>> 172.16.247.150
>>> (eth0:1) and the gateway in the real servers are 10.11.12.10 (eth1:1 in
>>> the
>>> router server),
>>
>>
>> Since you have configured LVS- NAT your gateway on the real servers 
>> should
>> be 10.11.12.9
>>
>> I create the lvs.cf with piranha and start pulse, but, NOTHING
>>> happens, piranha create the 10.11.12.10 and the 172.16.247.150 virtual
>>> interfaces (I can conect to the gui of piranha in
>>> http://172.16.247.150:3636)
>>> but can't do in the port 80. here are my lvs.cf
>>>
>>> serial_no = 137
>>> primary = 172.16.247.130
>>> primary_private = 10.11.12.9
>>> service = lvs
>>> backup_active = 0
>>> backup = 172.16.247.131
>>> backup_private = 10.11.12.11
>>> heartbeat = 1
>>> heartbeat_port = 539
>>> keepalive = 3
>>> deadtime = 10
>>> network = nat
>>> nat_router = 10.11.12.10 eth1:1
>>> nat_nmask = 255.255.255.0
>>> debug_level = NONE
>>> monitor_links = 0
>>> virtual http {
>>>     active = 1
>>>     address = 172.16.247.150 eth0:1
>>>     vip_nmask = 255.255.255.0
>>>     port = 80
>>>     use_regex = 0
>>>     load_monitor = ruptime
>>
>>
>> You should start 'rwhod' service on the real servers since you opted for
>> 'load_monitor' as ruptime.
>>
>>    scheduler = rr
>>>     protocol = tcp
>>>     timeout = 5
>>>     reentry = 5
>>>     quiesce_server = 0
>>>     server uno {
>>>         address = 10.11.12.11
>>>         active = 1
>>>         weight = 1
>>>     }
>>>     server dos {
>>>         address = 10.11.12.12
>>>         active = 1
>>>         weight = 1
>>>     }
>>> }
>>
>>
>> Rest configuration looks fine.
>>
>> and I don't know why, but the ipsvadmin don't show the entrys of the 
>> real
>>> servers:
>>>
>>> [root at node1 ~]# ipvsadm -Ln
>>> IP Virtual Server version 1.2.0 (size=4096)
>>> Prot LocalAddress:Port Scheduler Flags
>>> -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
>>> TCP  172.16.247.150:80 rr
>>
>>
>> Make sure you have the 80 port up and running on your real servers.
>>
>> can anyone help me? (sorry for my bad english)
>>
>>
>>
>> -GSR
>>
>
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071106/54aa8297/attachment.htm>

From jimmy.nimo at pranical.com  Tue Nov  6 14:56:34 2007
From: jimmy.nimo at pranical.com (jimmy.nimo at pranical.com)
Date: Tue,  6 Nov 2007 10:56:34 -0400
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
In-Reply-To: <473079DD.5050300@gmail.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
	<d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
	<20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
	<473079DD.5050300@gmail.com>
Message-ID: <20071106105634.mgmar8cxxrsws48s@webmail.pranical.com>


yes, it is enabled. double check and I have the real server with ip forwarding
just in case (in fact i will put everything i can with ip forwarding even the
microware is it nedeed)

[root at node1 ~]# sysctl -p
net.ipv4.ip_forward = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1


Quoting GSR-Linux <gsrlinux at gmail.com>:

>
> Just check if port forwarding is enabled. in your lvs router.
> Paste the output of /sysctl -p
>
> /jimmy.nimo at pranical.com wrote:
>> I take your recommendation and change the gateway in the real server to
>> 10.11.12.9 (I really think that it has to use the 10.11.12.10 because if you
>> have a backup lvs Router, it change is the virtual ip eth1:1 not the 
>> real ip)
>> and don't work either
>>
>> btw I run a nmap to the virtual server and this is the output:
>>
>>
>>
>> root at Cancer:~# nmap -sT 172.16.247.150
>>
>> Starting Nmap 4.20 ( http://insecure.org ) at 2007-11-06 09:42 VET
>> Interesting ports on 172.16.247.150:
>> Not shown: 1693 closed ports
>> PORT    STATE SERVICE
>> 22/tcp  open  ssh
>> 111/tcp open  rpcbind
>> 443/tcp open  https
>> 815/tcp open  unknown
>>
>> it dosn't have the http port open, I don't have a firewall, I have 
>> the iptables
>> down in all the server, I really don't know what happen, anyone can help??
>>
>>
>>
>>
>> Quoting GS R <gsrlinux at gmail.com>:
>>
>>> On 11/6/07, jimmy.nimo at pranical.com <jimmy.nimo at pranical.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> Hello, thanks for reading my email,
>>>
>>>
>>> Welcome :)
>>>
>>> I have a problem and perphaps someone in the
>>>> list can work it out
>>>>
>>>> I have 3 working redhat 4 update 5 trying to work with piranha+lvs+nanny,
>>>> 2 Real
>>>> Server and 1 LVS Router, the router have 2 NIC, 1 with IP 172.16.247.130
>>>> (eht0),
>>>> and the other with 10.11.12.9(eth1), the real server are 10.11.12.11 and
>>>> 10.11.12.12, I want to put the virtual server to run in the ip
>>>> 172.16.247.150
>>>> (eth0:1) and the gateway in the real servers are 10.11.12.10 (eth1:1 in
>>>> the
>>>> router server),
>>>
>>>
>>> Since you have configured LVS- NAT your gateway on the real servers should
>>> be 10.11.12.9
>>>
>>> I create the lvs.cf with piranha and start pulse, but, NOTHING
>>>> happens, piranha create the 10.11.12.10 and the 172.16.247.150 virtual
>>>> interfaces (I can conect to the gui of piranha in
>>>> http://172.16.247.150:3636)
>>>> but can't do in the port 80. here are my lvs.cf
>>>>
>>>> serial_no = 137
>>>> primary = 172.16.247.130
>>>> primary_private = 10.11.12.9
>>>> service = lvs
>>>> backup_active = 0
>>>> backup = 172.16.247.131
>>>> backup_private = 10.11.12.11
>>>> heartbeat = 1
>>>> heartbeat_port = 539
>>>> keepalive = 3
>>>> deadtime = 10
>>>> network = nat
>>>> nat_router = 10.11.12.10 eth1:1
>>>> nat_nmask = 255.255.255.0
>>>> debug_level = NONE
>>>> monitor_links = 0
>>>> virtual http {
>>>>     active = 1
>>>>     address = 172.16.247.150 eth0:1
>>>>     vip_nmask = 255.255.255.0
>>>>     port = 80
>>>>     use_regex = 0
>>>>     load_monitor = ruptime
>>>
>>>
>>> You should start 'rwhod' service on the real servers since you opted for
>>> 'load_monitor' as ruptime.
>>>
>>>    scheduler = rr
>>>>     protocol = tcp
>>>>     timeout = 5
>>>>     reentry = 5
>>>>     quiesce_server = 0
>>>>     server uno {
>>>>         address = 10.11.12.11
>>>>         active = 1
>>>>         weight = 1
>>>>     }
>>>>     server dos {
>>>>         address = 10.11.12.12
>>>>         active = 1
>>>>         weight = 1
>>>>     }
>>>> }
>>>
>>>
>>> Rest configuration looks fine.
>>>
>>> and I don't know why, but the ipsvadmin don't show the entrys of the real
>>>> servers:
>>>>
>>>> [root at node1 ~]# ipvsadm -Ln
>>>> IP Virtual Server version 1.2.0 (size=4096)
>>>> Prot LocalAddress:Port Scheduler Flags
>>>> -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
>>>> TCP  172.16.247.150:80 rr
>>>
>>>
>>> Make sure you have the 80 port up and running on your real servers.
>>>
>>> can anyone help me? (sorry for my bad english)
>>>
>>>
>>>
>>> -GSR
>>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-keys
Size: 1340 bytes
Desc: Clave PGP p?blica
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071106/e5e29241/attachment.bin>

From james at cloud9.co.uk  Tue Nov  6 15:01:33 2007
From: james at cloud9.co.uk (James Fidell)
Date: Tue, 06 Nov 2007 15:01:33 +0000
Subject: [Linux-cluster] Shutting down a cluster
In-Reply-To: <47306B79.3070205@redhat.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9C@ACDFWMAIL1.acd.de.ittind.com>
	<47306B79.3070205@redhat.com>
Message-ID: <473081CD.505@cloud9.co.uk>

Patrick Caulfield wrote:
> Ward, Timothy - SSD wrote:
>> Try:
>> # cman_tool force leave
>>
> 
> I would try and avoid that if possible. It's much better to make sure that all
> of the services are shut down beforehand and let cman shut down tidily.
> 
> Check for things like ccsd still running and services shown my 'cman_tool
> services'. Unfortunately it can be hard to determine which cluster services are
> running if you don't know what they all are, but it's worth trying  :-)

Ok.  I have a newly rebooted server on which I've started clustering,
but not started any services (other than those started by the cman init
script).

  # cman_tool services

reports that fenced is the only service running.  So, I killed fenced.
Nothing else appears to notice that fenced on this server has died.  Now

  # cman_tool leave remove

works, in that it returns no errors.  But within a few seconds, the
leaving node is fenced and shut down :(

James


From Joao.Cascao at portucelsoporcel.com  Tue Nov  6 16:43:51 2007
From: Joao.Cascao at portucelsoporcel.com (=?iso-8859-1?B?Sm/jbyBDYXNj428=?=)
Date: Tue, 6 Nov 2007 16:43:51 -0000
Subject: [Linux-cluster] Shutting down a cluster
In-Reply-To: <473081CD.505@cloud9.co.uk>
Message-ID: <003401c82094$367c65f0$040a010a@soporcel.pt>


Try this:

- service gfs stop   - if you use shared storage(or gfs2)
- service rgmanager stop
- service clvmd stop
- service cman stop

It usually works for me on a 2 node Centos5 cluster.

----
JCascao

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]On Behalf Of James Fidell
> Sent: ter?a-feira, 6 de Novembro de 2007 15:02
> To: linux clustering
> Subject: Re: [Linux-cluster] Shutting down a cluster
>
>
> Patrick Caulfield wrote:
> > Ward, Timothy - SSD wrote:
> >> Try:
> >> # cman_tool force leave
> >>
> >
> > I would try and avoid that if possible. It's much better to
> make sure that all
> > of the services are shut down beforehand and let cman shut
> down tidily.
> >
> > Check for things like ccsd still running and services shown
> my 'cman_tool
> > services'. Unfortunately it can be hard to determine which
> cluster services are
> > running if you don't know what they all are, but it's worth
> trying  :-)
>
> Ok.  I have a newly rebooted server on which I've started clustering,
> but not started any services (other than those started by the
> cman init
> script).
>
>   # cman_tool services
>
> reports that fenced is the only service running.  So, I killed fenced.
> Nothing else appears to notice that fenced on this server has
> died.  Now
>
>   # cman_tool leave remove
>
> works, in that it returns no errors.  But within a few seconds, the
> leaving node is fenced and shut down :(
>
> James
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> !DSPAM:3,473081ef76021170020771!
>


From james at cloud9.co.uk  Tue Nov  6 19:06:06 2007
From: james at cloud9.co.uk (James Fidell)
Date: Tue, 06 Nov 2007 19:06:06 +0000
Subject: [Linux-cluster] Shutting down a cluster
In-Reply-To: <003401c82094$367c65f0$040a010a@soporcel.pt>
References: <003401c82094$367c65f0$040a010a@soporcel.pt>
Message-ID: <4730BB1E.2050205@cloud9.co.uk>

Jo?o Casc?o wrote:
> Try this:
> 
> - service gfs stop   - if you use shared storage(or gfs2)
> - service rgmanager stop
> - service clvmd stop
> - service cman stop
> 
> It usually works for me on a 2 node Centos5 cluster.

Bingo!

I guess no day is entirely wasted if you've learned something, but I
guess the FAQ could do with an update to the "What's the proper way to
shut down my cluster?" question then.  If it hadn't been for that I'd
have tried the above first of all.

Ah well...

James.


From christopher.barry at qlogic.com  Tue Nov  6 20:20:32 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Tue, 6 Nov 2007 14:20:32 -0600
Subject: [Linux-cluster] Shutting down a cluster
In-Reply-To: <4730BB1E.2050205@cloud9.co.uk>
References: <003401c82094$367c65f0$040a010a@soporcel.pt>
	<4730BB1E.2050205@cloud9.co.uk>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB24580D19@EPEXCH1.qlogic.org>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Fidell
> Sent: Tuesday, November 06, 2007 2:06 PM
> To: Joao.Cascao at portucelsoporcel.com; linux clustering
> Subject: Re: [Linux-cluster] Shutting down a cluster
> 
> Jo?o Casc?o wrote:
> > Try this:
> > 
> > - service gfs stop   - if you use shared storage(or gfs2)
> > - service rgmanager stop
> > - service clvmd stop
> > - service cman stop
> > 
> > It usually works for me on a 2 node Centos5 cluster.
> 
> Bingo!
> 
> I guess no day is entirely wasted if you've learned something, but I
> guess the FAQ could do with an update to the "What's the proper way to
> shut down my cluster?" question then.  If it hadn't been for that I'd
> have tried the above first of all.
> 
> Ah well...
> 
> James.


Setting up a MediaWiki for the FAQ might be a good idea, then everyone can update the data - instead of swamping the dev guys.

-C


From amrossi at linux.it  Tue Nov  6 20:31:30 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Tue, 6 Nov 2007 21:31:30 +0100 (CET)
Subject: [Linux-cluster] Network Monitor
Message-ID: <41924.79.10.137.147.1194381090.squirrel@picard.linux.it>

Hi all,

i've Failover Domain (two nodes) with Red Hat Cluster Suite (RHEL5).

I'm writting  little script to control network status; this script running
under rgmanager with other script (apache|tomcat|..).

################################################################

#!/bin/bash

DEVICE="eth0"
hosts="www.google.it www.iol.it www.ansa.it www.virgilio.it"
MINIMO="2"
STATUS="true"

status() {
    ip addr show | grep $DEVICE | grep "NO-CARRIER"  > /dev/null 2>&1
    if [ "$?" = "0" ] ; then
        echo "Warning. Network Down."
        STATUS="false"
    else
        c=0
        for host in $hosts ;  do
          fping -a $host > /dev/null 2>&1
          if [ "$?" -eq "1" ] ; then
               echo $((++c)) > /dev/null 2>&1
               if [ "$c" -eq "$MINIMO" ] ;  then
                   echo "Warning. Network Down."
                   STATUS="false"
               fi
         fi
        done
    fi
}
start() {
   status
}

stop() {
   status
}

restart() {
   status
}

ip addr show | grep $DEVICE > /dev/null 2>&1
if [ "$?" != "0" ] ; then
     echo "$DEVICE not Found"
     STATUS="false"
fi

if [ -z "$1" ] ; then
    command="status"
else
    command="$1"
fi

case "$command" in
    'status')
        status
        ;;
    'start')
        start
        ;;
    'stop')
        stop
        ;;
    'restart')
        restart
        ;;
    *)
        echo "Usage: $0 { start | stop | restart | status }"
        STATUS="false"
        ;;
esac

if [ "$STATUS" == "true" ] ; then
     exit 0
else
     exit 1
fi


########################################################################


It's OK?


From lhh at redhat.com  Tue Nov  6 20:46:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Nov 2007 15:46:46 -0500
Subject: [Linux-cluster] Clustered NFS services starting on wrong
In-Reply-To: <4730370C.4040703@cloud9.co.uk>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9D@ACDFWMAIL1.acd.de.ittind.com>
	<4730370C.4040703@cloud9.co.uk>
Message-ID: <1194382006.8967.17.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-06 at 09:42 +0000, James Fidell wrote:
> Ward, Timothy - SSD wrote:

> >> Is there any obvious reason this might happen?

Pretty sure I fixed this in the current RHEL5 branch of rgmanager,
actually.

(i.e. 5.1 when it's out should fix it.)

-- Lon


From lhh at redhat.com  Tue Nov  6 20:57:05 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Nov 2007 15:57:05 -0500
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
In-Reply-To: <20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
	<d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
	<20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
Message-ID: <1194382625.8967.25.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-06 at 10:16 -0400, jimmy.nimo at pranical.com wrote:
> I take your recommendation and change the gateway in the real server to
> 10.11.12.9 (I really think that it has to use the 10.11.12.10 because if you
> have a backup lvs Router, it change is the virtual ip eth1:1 not the real ip)
> and don't work either

I am pretty sure that if you don't have a send_program (i.e. external
script defined), you need send / expect strings.  Nanny uses these to
decide what real servers are online so it can then add/remove them
to/from the real server pool.

For web,

virtual http {
    expect = "HTTP"
    send = "GET / HTTP/1.0\r\n\r\n"
    ... 
    (rest of virtual server def.)
    ...
}

Also you don't have the backup server enabled, so director failover
won't happen.


-- Lon


From james at cloud9.co.uk  Tue Nov  6 21:02:39 2007
From: james at cloud9.co.uk (James Fidell)
Date: Tue, 06 Nov 2007 21:02:39 +0000
Subject: [Linux-cluster] Clustered NFS services starting on wrong
In-Reply-To: <1194382006.8967.17.camel@ayanami.boston.devel.redhat.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9D@ACDFWMAIL1.acd.de.ittind.com>	<4730370C.4040703@cloud9.co.uk>
	<1194382006.8967.17.camel@ayanami.boston.devel.redhat.com>
Message-ID: <4730D66F.1050901@cloud9.co.uk>

Lon Hohberger wrote:
> On Tue, 2007-11-06 at 09:42 +0000, James Fidell wrote:
>> Ward, Timothy - SSD wrote:
> 
>>>> Is there any obvious reason this might happen?
> 
> Pretty sure I fixed this in the current RHEL5 branch of rgmanager,
> actually.
> 
> (i.e. 5.1 when it's out should fix it.)

Cool.  Do you know if it's likely to be possible to mix 5.0 and 5.1
systems in the same cluster?  (So I can take each machine out of
service in turn and upgrade it from 5.0 to 5.1).

James


From josh at jtri.com  Tue Nov  6 21:21:57 2007
From: josh at jtri.com (Josh Gray)
Date: Tue, 6 Nov 2007 14:21:57 -0700
Subject: [Linux-cluster] Clustered NFS services starting on wrong
In-Reply-To: <1194382006.8967.17.camel@ayanami.boston.devel.redhat.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9D@ACDFWMAIL1.acd.de.ittind.com>
	<4730370C.4040703@cloud9.co.uk>
	<1194382006.8967.17.camel@ayanami.boston.devel.redhat.com>
Message-ID: <009C1EA6-F5CF-4AFB-A678-46ADF2E023F7@jtri.com>

If i hear you guys say "fixed in beta..." one more time......... ;)     
Any idea when that will be out in stable?


JG


On Nov 6, 2007, at 1:46 PM, Lon Hohberger wrote:

> On Tue, 2007-11-06 at 09:42 +0000, James Fidell wrote:
>> Ward, Timothy - SSD wrote:
>
>>>> Is there any obvious reason this might happen?
>
> Pretty sure I fixed this in the current RHEL5 branch of rgmanager,
> actually.
>
> (i.e. 5.1 when it's out should fix it.)
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From jgray at nicusa.com  Wed Nov  7 01:47:57 2007
From: jgray at nicusa.com (Josh Gray)
Date: Tue, 06 Nov 2007 18:47:57 -0700
Subject: [Linux-cluster] EXT3 service mounted on two nodes
In-Reply-To: <1193847418.9223.171.camel@ayanami.boston.devel.redhat.com>
Message-ID: <C356675D.114360%jgray@nicusa.com>

Hello Lon & the rest of the list -  You responded the other day that my
problem with the ext3 volume mounting up on multiple nodes might be related
to this bug listed.

Support looked over my config and explained to me that I nested my resources
when I really shouldn't have.   They say the cluster reads the order in
reverse so If I made the parent the filesystem, then the nfsexport, then the
nfs clients, then the ips in that order - they say this is why I saw a brief
error that the path the nfs client was given access to doesn't exist -
though everything would start up fine eventually.      They say I should do
them all on the same level, and it'll work fine, indeed it does seem to work
great.  

Is it possible the cluster got confused when it was trying to start these
services this one time and just by chance mounted the filesystem and brought
up the VIP's on both nodes BECAUSE of this and not the bug?

I guess what I'm getting at is - should I feel more comfortable with this
one problem solved?


Josh Gray


On 10/31/07 10:16 AM, "Lon Hohberger" <lhh at redhat.com> wrote:

> On Tue, 2007-10-30 at 15:07 -0400, Josh Gray wrote:
>> You asked me that a while back, pardon my inexperience with Linux thus far
>> not positive how to get the version numbers you're looking for.
>> 
>> Here's a few - 
>> 
>> # uname -a
>> Linux nfs-5.cdc.nicusa.com 2.6.18-8.1.15.el5 #1 SMP Thu Oct 4 04:06:39 EDT
>> 2007 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> # cman_tool version
>> 6.0.1 config 59
>> 
>> # clusvcadm -v
>> 2.0.24
> 
> That's the ticket.
> 
> It sounds like you're hitting this, but I thought it was isolated to
> post-RHEL5.0:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=249758
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Josh Gray
Systems Administrator
NIC Inc

Email: jgray at nicusa.com
Desk/Mobile: 913-221-1520

"It is not the mountain we conquer, but ourselves."
- Sir Edmund Hillary


From mkathuria at tuxtechnologies.co.in  Wed Nov  7 03:56:45 2007
From: mkathuria at tuxtechnologies.co.in (Manish Kathuria)
Date: Wed, 7 Nov 2007 09:26:45 +0530
Subject: [Linux-cluster] Using Ext3 On External Storage
Message-ID: <1df4abe60711061956t18665035s38c601564d581410@mail.gmail.com>

What are the pros and cons of using ext3 filesystem on shared external
storage partitions when each of the partitions is going to be mounted
on only one node at a time ?

There is a two node active-active cluster running different services
under normal scenario one node mounting two partitions from the shared
external storage and the other node mounting another two partitions
from the shared storage. When one of the node goes down, the floating
IP address, the mounted partitions and the service on the failing node
will be transferred to the other node. At any point of time, each of
the partitions will be accessed by only one node. In this scenario,
What are the drawbacks or risks involved if the external shared
storage partitions are formatted with ext3 instead of GFS ?

Thanks,
-- 
Manish Kathuria


From jgray at nicusa.com  Wed Nov  7 04:07:59 2007
From: jgray at nicusa.com (Josh Gray)
Date: Tue, 06 Nov 2007 21:07:59 -0700
Subject: [Linux-cluster] Using Ext3 On External Storage
In-Reply-To: <1df4abe60711061956t18665035s38c601564d581410@mail.gmail.com>
Message-ID: <C356882F.1143A9%jgray@nicusa.com>

Manish, I can say for a fact I've had a problem that ended up having all
virtual ip's and the ext3 file system mount up on two nodes quickly causing
the file system to get corrupted after about 20 minutes of I/O.

The root cause is still unknown to me, just letting you know that I did
encounter this.   I personally feel using GFS for only this reason is
ignoring the problem.

Josh


On 11/6/07 8:56 PM, "Manish Kathuria" <mkathuria at tuxtechnologies.co.in>
wrote:

> What are the pros and cons of using ext3 filesystem on shared external
> storage partitions when each of the partitions is going to be mounted
> on only one node at a time ?
> 
> There is a two node active-active cluster running different services
> under normal scenario one node mounting two partitions from the shared
> external storage and the other node mounting another two partitions
> from the shared storage. When one of the node goes down, the floating
> IP address, the mounted partitions and the service on the failing node
> will be transferred to the other node. At any point of time, each of
> the partitions will be accessed by only one node. In this scenario,
> What are the drawbacks or risks involved if the external shared
> storage partitions are formatted with ext3 instead of GFS ?
> 
> Thanks,

-- 
Josh Gray
Systems Administrator
NIC Inc

Email: jgray at nicusa.com
Desk/Mobile: 913-221-1520

"It is not the mountain we conquer, but ourselves."
- Sir Edmund Hillary


From gordan at bobich.net  Wed Nov  7 10:17:14 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 7 Nov 2007 10:17:14 +0000 (GMT)
Subject: [Linux-cluster] cman startup issue
Message-ID: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>

Hi,

I'm having a weird problem. I am using a shared GFS root file system, and 
the same initrd image on all the machines. The cluster has 3 machines on 
it at the moment, and 1 refuses to join the cluster, regardless of which 
order I bring them up in.

When cman service is being started, it fails when starting cman:

cman not started: Can't find local node name in cluster.conf 
/usr/local/sbin/cman_tool: aisexec daemon didn't start

If I try to run aisexec, I get:
aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.

Where should I be looking for causes of this? I double checked my 
cluster.conf and the MAC addresses, IP addresses and interface names are 
correct in each node's config.

Thanks.

Gordan


From pcaulfie at redhat.com  Wed Nov  7 10:39:10 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 07 Nov 2007 10:39:10 +0000
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>
Message-ID: <473195CE.9030901@redhat.com>

gordan at bobich.net wrote:
> Hi,
> 
> I'm having a weird problem. I am using a shared GFS root file system,
> and the same initrd image on all the machines. The cluster has 3
> machines on it at the moment, and 1 refuses to join the cluster,
> regardless of which order I bring them up in.
> 
> When cman service is being started, it fails when starting cman:
> 
> cman not started: Can't find local node name in cluster.conf
> /usr/local/sbin/cman_tool: aisexec daemon didn't start
> 
> If I try to run aisexec, I get:
> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
> 
> Where should I be looking for causes of this? I double checked my
> cluster.conf and the MAC addresses, IP addresses and interface names are
> correct in each node's config.

Check that the new node can write into /tmp - where it is trying to store the
current ring-id.  It could be SElinux or perhaps the permissions on the file it
is trying to create.

stracing the aisexec process as it starts up might show why the file cannot be
opened.

-- 
Patrick


From addi at hugsmidjan.is  Wed Nov  7 11:13:44 2007
From: addi at hugsmidjan.is (=?ISO-8859-1?Q?S=E6valdur?= Arnar Gunnarsson)
Date: Wed, 07 Nov 2007 11:13:44 +0000
Subject: [Linux-cluster] How to take down a CS/GFS setup with minimum
	downtime
In-Reply-To: <1193669810.9223.63.camel@ayanami.boston.devel.redhat.com>
References: <1193412589.5963.2.camel@addi.hugsmidjan.is>
	<1193669810.9223.63.camel@ayanami.boston.devel.redhat.com>
Message-ID: <1194434024.6793.25.camel@addi.hugsmidjan.is>

Thanks for this Lon, I'm down to the last two node members and according
to cman_tool status I have two nodes, two votes and a quorum of two.
--
Nodes: 2
Expected_votes: 5
Total_votes: 2
Quorum: 2   
--

One of those nodes has the GFS filesystems mounted.
If I issue cman_tool leave remove on the other node will I run into any
problems on the GFS mounted node ? (for example, due to quorum)


On Mon, 2007-10-29 at 10:56 -0400, Lon Hohberger wrote:

> That should do it, yes.  Leave remove is supposed to decrement the
> quorum count, meaning you can go from 5..1 nodes if done correctly.  You
> can verify that the expected votes count decreases with each removal
> using 'cman_tool status'.
> 
> 
> If for some reason the above doesn't work, the alternative looks
> something like this:
>   * unmount the GFS volume + stop cluster on all nodes
>   * use gfs_tool to alter the lock proto to nolock
>   * mount on node 1.  copy out data.  unmount!
>   * mount on node 2.  copy out data.  unmount!
>   * ...
>   * mount on node 5.  copy out data.  unmount!
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4581 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071107/a218443d/attachment.bin>

From gordan at bobich.net  Wed Nov  7 11:30:38 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 7 Nov 2007 11:30:38 +0000 (GMT)
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <473195CE.9030901@redhat.com>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>
	<473195CE.9030901@redhat.com>
Message-ID: <Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>

On Wed, 7 Nov 2007, Patrick Caulfield wrote:

>> I'm having a weird problem. I am using a shared GFS root file system,
>> and the same initrd image on all the machines. The cluster has 3
>> machines on it at the moment, and 1 refuses to join the cluster,
>> regardless of which order I bring them up in.
>>
>> When cman service is being started, it fails when starting cman:
>>
>> cman not started: Can't find local node name in cluster.conf
>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>
>> If I try to run aisexec, I get:
>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>
>> Where should I be looking for causes of this? I double checked my
>> cluster.conf and the MAC addresses, IP addresses and interface names are
>> correct in each node's config.
>
> Check that the new node can write into /tmp - where it is trying to store the
> current ring-id.  It could be SElinux or perhaps the permissions on the file it
> is trying to create.

That fixed the aisexec problem, but the "Can't find local node name in 
cluster.conf" problem remains, and cman still won't start. :-(

Gordan


From christopher.barry at qlogic.com  Wed Nov  7 11:34:16 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Wed, 7 Nov 2007 05:34:16 -0600
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net><473195CE.9030901@redhat.com>
	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB24580D4A@EPEXCH1.qlogic.org>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> gordan at bobich.net
> Sent: Wednesday, November 07, 2007 6:31 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] cman startup issue
> 
> On Wed, 7 Nov 2007, Patrick Caulfield wrote:
> 
> >> I'm having a weird problem. I am using a shared GFS root 
> file system,
> >> and the same initrd image on all the machines. The cluster has 3
> >> machines on it at the moment, and 1 refuses to join the cluster,
> >> regardless of which order I bring them up in.
> >>
> >> When cman service is being started, it fails when starting cman:
> >>
> >> cman not started: Can't find local node name in cluster.conf
> >> /usr/local/sbin/cman_tool: aisexec daemon didn't start
> >>
> >> If I try to run aisexec, I get:
> >> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
> >>
> >> Where should I be looking for causes of this? I double checked my
> >> cluster.conf and the MAC addresses, IP addresses and 
> interface names are
> >> correct in each node's config.
> >
> > Check that the new node can write into /tmp - where it is 
> trying to store the
> > current ring-id.  It could be SElinux or perhaps the 
> permissions on the file it
> > is trying to create.
> 
> That fixed the aisexec problem, but the "Can't find local 
> node name in 
> cluster.conf" problem remains, and cman still won't start. :-(
> 
> Gordan
> 

Gordan,

Are your nodes defined in dns or in /etc/hosts file?

-C


From pcaulfie at redhat.com  Wed Nov  7 11:42:40 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 07 Nov 2007 11:42:40 +0000
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>	<473195CE.9030901@redhat.com>
	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
Message-ID: <4731A4B0.1080001@redhat.com>

gordan at bobich.net wrote:
> On Wed, 7 Nov 2007, Patrick Caulfield wrote:
> 
>>> I'm having a weird problem. I am using a shared GFS root file system,
>>> and the same initrd image on all the machines. The cluster has 3
>>> machines on it at the moment, and 1 refuses to join the cluster,
>>> regardless of which order I bring them up in.
>>>
>>> When cman service is being started, it fails when starting cman:
>>>
>>> cman not started: Can't find local node name in cluster.conf
>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>
>>> If I try to run aisexec, I get:
>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>
>>> Where should I be looking for causes of this? I double checked my
>>> cluster.conf and the MAC addresses, IP addresses and interface names are
>>> correct in each node's config.
>>
>> Check that the new node can write into /tmp - where it is trying to
>> store the
>> current ring-id.  It could be SElinux or perhaps the permissions on
>> the file it
>> is trying to create.
> 
> That fixed the aisexec problem, but the "Can't find local node name in
> cluster.conf" problem remains, and cman still won't start. :-(

Well, it won't start if it can' find the local node name in cluster.conf ...
Have you double-checked that the name(s) in cluster.conf match those on the
ethernet interfaces ?

-- 
Patrick


From gordan at bobich.net  Wed Nov  7 11:45:20 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 7 Nov 2007 11:45:20 +0000 (GMT)
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <D158540CCC0AB54C8FD4818F823CCB24580D4A@EPEXCH1.qlogic.org>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net><473195CE.9030901@redhat.com>
	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
	<D158540CCC0AB54C8FD4818F823CCB24580D4A@EPEXCH1.qlogic.org>
Message-ID: <Pine.LNX.4.64.0711071141410.23274@skynet.shatteredsilicon.net>

On Wed, 7 Nov 2007, Christopher Barry wrote:

>>>> I'm having a weird problem. I am using a shared GFS root
>> file system,
>>>> and the same initrd image on all the machines. The cluster has 3
>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>> regardless of which order I bring them up in.
>>>>
>>>> When cman service is being started, it fails when starting cman:
>>>>
>>>> cman not started: Can't find local node name in cluster.conf
>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>
>>>> If I try to run aisexec, I get:
>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>
>>>> Where should I be looking for causes of this? I double checked my
>>>> cluster.conf and the MAC addresses, IP addresses and
>> interface names are
>>>> correct in each node's config.
>>>
>>> Check that the new node can write into /tmp - where it is
>> trying to store the
>>> current ring-id.  It could be SElinux or perhaps the
>> permissions on the file it
>>> is trying to create.
>>
>> That fixed the aisexec problem, but the "Can't find local
>> node name in
>> cluster.conf" problem remains, and cman still won't start. :-(
>
> Are your nodes defined in dns or in /etc/hosts file?

They are in /etc/hosts. There are lines for each of the nodes 
corresponding to each entry in cluster.conf.

Gordan


From gordan at bobich.net  Wed Nov  7 11:53:19 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 7 Nov 2007 11:53:19 +0000 (GMT)
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <4731A4B0.1080001@redhat.com>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>
	<473195CE.9030901@redhat.com>
	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
	<4731A4B0.1080001@redhat.com>
Message-ID: <Pine.LNX.4.64.0711071145350.23274@skynet.shatteredsilicon.net>

On Wed, 7 Nov 2007, Patrick Caulfield wrote:

>>>> I'm having a weird problem. I am using a shared GFS root file system,
>>>> and the same initrd image on all the machines. The cluster has 3
>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>> regardless of which order I bring them up in.
>>>>
>>>> When cman service is being started, it fails when starting cman:
>>>>
>>>> cman not started: Can't find local node name in cluster.conf
>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>
>>>> If I try to run aisexec, I get:
>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>
>>>> Where should I be looking for causes of this? I double checked my
>>>> cluster.conf and the MAC addresses, IP addresses and interface names are
>>>> correct in each node's config.
>>>
>>> Check that the new node can write into /tmp - where it is trying to
>>> store the
>>> current ring-id.  It could be SElinux or perhaps the permissions on
>>> the file it
>>> is trying to create.
>>
>> That fixed the aisexec problem, but the "Can't find local node name in
>> cluster.conf" problem remains, and cman still won't start. :-(
>
> Well, it won't start if it can' find the local node name in cluster.conf ...
> Have you double-checked that the name(s) in cluster.conf match those on the
> ethernet interfaces ?

You mean as in:
<eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3" 
mask="255.255.255.0"/>
?

If so, then yes, I checked it about 10 times. That was the first thing I 
thought was wrong. :-(

Gordan


From pcaulfie at redhat.com  Wed Nov  7 13:15:29 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 07 Nov 2007 13:15:29 +0000
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <Pine.LNX.4.64.0711071145350.23274@skynet.shatteredsilicon.net>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>	<473195CE.9030901@redhat.com>	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>	<4731A4B0.1080001@redhat.com>
	<Pine.LNX.4.64.0711071145350.23274@skynet.shatteredsilicon.net>
Message-ID: <4731BA71.6010103@redhat.com>

gordan at bobich.net wrote:
> On Wed, 7 Nov 2007, Patrick Caulfield wrote:
> 
>>>>> I'm having a weird problem. I am using a shared GFS root file system,
>>>>> and the same initrd image on all the machines. The cluster has 3
>>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>>> regardless of which order I bring them up in.
>>>>>
>>>>> When cman service is being started, it fails when starting cman:
>>>>>
>>>>> cman not started: Can't find local node name in cluster.conf
>>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>>
>>>>> If I try to run aisexec, I get:
>>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>>
>>>>> Where should I be looking for causes of this? I double checked my
>>>>> cluster.conf and the MAC addresses, IP addresses and interface
>>>>> names are
>>>>> correct in each node's config.
>>>>
>>>> Check that the new node can write into /tmp - where it is trying to
>>>> store the
>>>> current ring-id.  It could be SElinux or perhaps the permissions on
>>>> the file it
>>>> is trying to create.
>>>
>>> That fixed the aisexec problem, but the "Can't find local node name in
>>> cluster.conf" problem remains, and cman still won't start. :-(
>>
>> Well, it won't start if it can' find the local node name in
>> cluster.conf ...
>> Have you double-checked that the name(s) in cluster.conf match those
>> on the
>> ethernet interfaces ?
> 
> You mean as in:
> <eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3"
> mask="255.255.255.0"/>
> ?
> 
> If so, then yes, I checked it about 10 times. That was the first thing I
> thought was wrong. :-(

As I don't have your cluster.conf or access to your DNS server it's hard to say
from here, but that message does mean what it says. If you have older software
it might not detect anything other than the node's main hostname, but later
versions will check all the interfaces on the system for something that matches
anything in cluster.conf.

I see you're using eth1 so make sure you do have an up-to-date cman.

-- 
Patrick


From gordan at bobich.net  Wed Nov  7 13:25:53 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 7 Nov 2007 13:25:53 +0000 (GMT)
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <4731BA71.6010103@redhat.com>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>
	<473195CE.9030901@redhat.com>
	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
	<4731A4B0.1080001@redhat.com>
	<Pine.LNX.4.64.0711071145350.23274@skynet.shatteredsilicon.net>
	<4731BA71.6010103@redhat.com>
Message-ID: <Pine.LNX.4.64.0711071315370.24318@skynet.shatteredsilicon.net>

On Wed, 7 Nov 2007, Patrick Caulfield wrote:

>>>>>> I'm having a weird problem. I am using a shared GFS root file system,
>>>>>> and the same initrd image on all the machines. The cluster has 3
>>>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>>>> regardless of which order I bring them up in.
>>>>>>
>>>>>> When cman service is being started, it fails when starting cman:
>>>>>>
>>>>>> cman not started: Can't find local node name in cluster.conf
>>>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>>>
>>>>>> If I try to run aisexec, I get:
>>>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>>>
>>>>>> Where should I be looking for causes of this? I double checked my
>>>>>> cluster.conf and the MAC addresses, IP addresses and interface
>>>>>> names are
>>>>>> correct in each node's config.
>>>>>
>>>>> Check that the new node can write into /tmp - where it is trying to
>>>>> store the
>>>>> current ring-id.  It could be SElinux or perhaps the permissions on
>>>>> the file it
>>>>> is trying to create.
>>>>
>>>> That fixed the aisexec problem, but the "Can't find local node name in
>>>> cluster.conf" problem remains, and cman still won't start. :-(
>>>
>>> Well, it won't start if it can' find the local node name in
>>> cluster.conf ...
>>> Have you double-checked that the name(s) in cluster.conf match those
>>> on the
>>> ethernet interfaces ?
>>
>> You mean as in:
>> <eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3"
>> mask="255.255.255.0"/>
>> ?
>>
>> If so, then yes, I checked it about 10 times. That was the first thing I
>> thought was wrong. :-(
>
> As I don't have your cluster.conf or access to your DNS server it's hard to say
> from here, but that message does mean what it says. If you have older software
> it might not detect anything other than the node's main hostname, but later
> versions will check all the interfaces on the system for something that matches
> anything in cluster.conf.

Well, the thing that really puzzles me is that the same cluster used to 
work before. All I effectively did was move it to a different IP range and 
changed cluster.conf. I can't figure out what could have changed in the 
meantime to break it, other than cluster.conf. The only other thing that's 
different is that some of the machines have eth1 and eth0 reversed. Before 
they all used eth1 for cluster configuration, and now one of them uses 
eth0 (slightly different model, and the manufacturer mislaeled the ports 
on them). But I have two identical machines, and one connects, the other 
doesn't. It really has me stumped.

> I see you're using eth1 so make sure you do have an up-to-date cman.

I'm running the latest that is available for RHEL5.

Gordan


From pcaulfie at redhat.com  Wed Nov  7 13:57:16 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 07 Nov 2007 13:57:16 +0000
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <Pine.LNX.4.64.0711071315370.24318@skynet.shatteredsilicon.net>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>	<473195CE.9030901@redhat.com>	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>	<4731A4B0.1080001@redhat.com>	<Pine.LNX.4.64.0711071145350.23274@skynet.shatteredsilicon.net>	<4731BA71.6010103@redhat.com>
	<Pine.LNX.4.64.0711071315370.24318@skynet.shatteredsilicon.net>
Message-ID: <4731C43C.7010806@redhat.com>

gordan at bobich.net wrote:
> On Wed, 7 Nov 2007, Patrick Caulfield wrote:
> 
>>>>>>> I'm having a weird problem. I am using a shared GFS root file
>>>>>>> system,
>>>>>>> and the same initrd image on all the machines. The cluster has 3
>>>>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>>>>> regardless of which order I bring them up in.
>>>>>>>
>>>>>>> When cman service is being started, it fails when starting cman:
>>>>>>>
>>>>>>> cman not started: Can't find local node name in cluster.conf
>>>>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>>>>
>>>>>>> If I try to run aisexec, I get:
>>>>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>>>>
>>>>>>> Where should I be looking for causes of this? I double checked my
>>>>>>> cluster.conf and the MAC addresses, IP addresses and interface
>>>>>>> names are
>>>>>>> correct in each node's config.
>>>>>>
>>>>>> Check that the new node can write into /tmp - where it is trying to
>>>>>> store the
>>>>>> current ring-id.  It could be SElinux or perhaps the permissions on
>>>>>> the file it
>>>>>> is trying to create.
>>>>>
>>>>> That fixed the aisexec problem, but the "Can't find local node name in
>>>>> cluster.conf" problem remains, and cman still won't start. :-(
>>>>
>>>> Well, it won't start if it can' find the local node name in
>>>> cluster.conf ...
>>>> Have you double-checked that the name(s) in cluster.conf match those
>>>> on the
>>>> ethernet interfaces ?
>>>
>>> You mean as in:
>>> <eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3"
>>> mask="255.255.255.0"/>
>>> ?
>>>
>>> If so, then yes, I checked it about 10 times. That was the first thing I
>>> thought was wrong. :-(
>>
>> As I don't have your cluster.conf or access to your DNS server it's
>> hard to say
>> from here, but that message does mean what it says. If you have older
>> software
>> it might not detect anything other than the node's main hostname, but
>> later
>> versions will check all the interfaces on the system for something
>> that matches
>> anything in cluster.conf.
> 
> Well, the thing that really puzzles me is that the same cluster used to
> work before. All I effectively did was move it to a different IP range
> and changed cluster.conf. I can't figure out what could have changed in
> the meantime to break it, other than cluster.conf. The only other thing
> that's different is that some of the machines have eth1 and eth0
> reversed. Before they all used eth1 for cluster configuration, and now
> one of them uses eth0 (slightly different model, and the manufacturer
> mislaeled the ports on them). But I have two identical machines, and one
> connects, the other doesn't. It really has me stumped.
> 
>> I see you're using eth1 so make sure you do have an up-to-date cman.
> 
> I'm running the latest that is available for RHEL5.

If that's what came with 5.0 then there's a bug in the name matching. I can't
figure out from the CVS tags in which package this was fixed unfortunately.

"revision 1.26
 date: 2007/03/15 11:12:33;  author: pcaulfield;  state: Exp;  lines: +16 -13
 If the machine is multi-homed, then using a truncated name in uname but not in
 cluster.conf would fail to match them up."

--
Patrick


From pcaulfie at redhat.com  Wed Nov  7 14:17:26 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 07 Nov 2007 14:17:26 +0000
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <4731C43C.7010806@redhat.com>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>	<473195CE.9030901@redhat.com>	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>	<4731A4B0.1080001@redhat.com>	<Pine.LNX.4.64.0711071145350.23274@skynet.shatteredsilicon.net>	<4731BA71.6010103@redhat.com>
	<Pine.LNX.4.64.0711071315370.24318@skynet.shatteredsilicon.net>
	<4731C43C.7010806@redhat.com>
Message-ID: <4731C8F6.8020401@redhat.com>

Patrick Caulfield wrote:
> gordan at bobich.net wrote:
>> On Wed, 7 Nov 2007, Patrick Caulfield wrote:
>>
>>>>>>>> I'm having a weird problem. I am using a shared GFS root file
>>>>>>>> system,
>>>>>>>> and the same initrd image on all the machines. The cluster has 3
>>>>>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>>>>>> regardless of which order I bring them up in.
>>>>>>>>
>>>>>>>> When cman service is being started, it fails when starting cman:
>>>>>>>>
>>>>>>>> cman not started: Can't find local node name in cluster.conf
>>>>>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>>>>>
>>>>>>>> If I try to run aisexec, I get:
>>>>>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>>>>>
>>>>>>>> Where should I be looking for causes of this? I double checked my
>>>>>>>> cluster.conf and the MAC addresses, IP addresses and interface
>>>>>>>> names are
>>>>>>>> correct in each node's config.
>>>>>>> Check that the new node can write into /tmp - where it is trying to
>>>>>>> store the
>>>>>>> current ring-id.  It could be SElinux or perhaps the permissions on
>>>>>>> the file it
>>>>>>> is trying to create.
>>>>>> That fixed the aisexec problem, but the "Can't find local node name in
>>>>>> cluster.conf" problem remains, and cman still won't start. :-(
>>>>> Well, it won't start if it can' find the local node name in
>>>>> cluster.conf ...
>>>>> Have you double-checked that the name(s) in cluster.conf match those
>>>>> on the
>>>>> ethernet interfaces ?
>>>> You mean as in:
>>>> <eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3"
>>>> mask="255.255.255.0"/>
>>>> ?
>>>>
>>>> If so, then yes, I checked it about 10 times. That was the first thing I
>>>> thought was wrong. :-(
>>> As I don't have your cluster.conf or access to your DNS server it's
>>> hard to say
>>> from here, but that message does mean what it says. If you have older
>>> software
>>> it might not detect anything other than the node's main hostname, but
>>> later
>>> versions will check all the interfaces on the system for something
>>> that matches
>>> anything in cluster.conf.
>> Well, the thing that really puzzles me is that the same cluster used to
>> work before. All I effectively did was move it to a different IP range
>> and changed cluster.conf. I can't figure out what could have changed in
>> the meantime to break it, other than cluster.conf. The only other thing
>> that's different is that some of the machines have eth1 and eth0
>> reversed. Before they all used eth1 for cluster configuration, and now
>> one of them uses eth0 (slightly different model, and the manufacturer
>> mislaeled the ports on them). But I have two identical machines, and one
>> connects, the other doesn't. It really has me stumped.
>>
>>> I see you're using eth1 so make sure you do have an up-to-date cman.
>> I'm running the latest that is available for RHEL5.
> 
> If that's what came with 5.0 then there's a bug in the name matching. I can't
> figure out from the CVS tags in which package this was fixed unfortunately.
> 
> "revision 1.26
>  date: 2007/03/15 11:12:33;  author: pcaulfield;  state: Exp;  lines: +16 -13
>  If the machine is multi-homed, then using a truncated name in uname but not in
>  cluster.conf would fail to match them up."

Well, I can tell you that the fix is NOT in cman-2.0.61, and it IS in
cman-2.0.73. Sorry I can't be more specific!

-- 
Patrick


From james at cloud9.co.uk  Wed Nov  7 14:19:22 2007
From: james at cloud9.co.uk (James Fidell)
Date: Wed, 07 Nov 2007 14:19:22 +0000
Subject: [Linux-cluster] Failover with multiple interfaces
Message-ID: <4731C96A.9050308@cloud9.co.uk>

So, I have my NFS cluster all set up, with NFS services and shared
storage all managed over a "private" network.

My intention is to join several more nodes to the cluster using the
same storage, but providing services other than NFS.  Is it possible
at the same time to provide services on a "public" network and have
floating public IP addresses which are also migrated across the
failover domain when the cluster fences a node?

Is it just a case of creating a second IP address service in the
failover domain?  Will the resource manager just "do the right thing"
and bind the floating address to the correct network interface?

James


From tam_annie at alice.it  Wed Nov  7 14:09:53 2007
From: tam_annie at alice.it (tam_annie at alice.it)
Date: Wed, 7 Nov 2007 15:09:53 +0100
Subject: R: [Linux-cluster] cman startup issue
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net><473195CE.9030901@redhat.com>
	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
Message-ID: <A8ED7CBAB5254E4EA503653480601CF101291A2A@FBCMST09V03.fbc.local>

I've found anywhere and used for my cluster the following dirty hack:

/etc/init.d/cman:

...
$NODENAME=racnode1 #node-specific name, used in cluster.conf
...
/usr/sbin/cman_tool -n $NODENAME ... (two places)
...

Hope this can help ...
Tyzan


-----Messaggio originale-----
Da: linux-cluster-bounces at redhat.com per conto di gordan at bobich.net
Inviato: mer 07/11/2007 12.30
A: linux clustering
Oggetto: Re: [Linux-cluster] cman startup issue
 
On Wed, 7 Nov 2007, Patrick Caulfield wrote:

>> I'm having a weird problem. I am using a shared GFS root file system,
>> and the same initrd image on all the machines. The cluster has 3
>> machines on it at the moment, and 1 refuses to join the cluster,
>> regardless of which order I bring them up in.
>>
>> When cman service is being started, it fails when starting cman:
>>
>> cman not started: Can't find local node name in cluster.conf
>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>
>> If I try to run aisexec, I get:
>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>
>> Where should I be looking for causes of this? I double checked my
>> cluster.conf and the MAC addresses, IP addresses and interface names are
>> correct in each node's config.
>
> Check that the new node can write into /tmp - where it is trying to store the
> current ring-id.  It could be SElinux or perhaps the permissions on the file it
> is trying to create.

That fixed the aisexec problem, but the "Can't find local node name in 
cluster.conf" problem remains, and cman still won't start. :-(

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071107/6414cc0f/attachment.htm>

From srigler at marathonoil.com  Wed Nov  7 14:30:45 2007
From: srigler at marathonoil.com (Steve Rigler)
Date: Wed, 07 Nov 2007 08:30:45 -0600
Subject: [Linux-cluster] Failover with multiple interfaces
In-Reply-To: <4731C96A.9050308@cloud9.co.uk>
References: <4731C96A.9050308@cloud9.co.uk>
Message-ID: <1194445845.4099.8.camel@houuc8>

On Wed, 2007-11-07 at 14:19 +0000, James Fidell wrote:
> So, I have my NFS cluster all set up, with NFS services and shared
> storage all managed over a "private" network.
> 
> My intention is to join several more nodes to the cluster using the
> same storage, but providing services other than NFS.  Is it possible
> at the same time to provide services on a "public" network and have
> floating public IP addresses which are also migrated across the
> failover domain when the cluster fences a node?
> 
> Is it just a case of creating a second IP address service in the
> failover domain?  Will the resource manager just "do the right thing"
> and bind the floating address to the correct network interface?
> 
> James

James,

In the past we had multiple public interfaces and cluster service was
smart enough to bind multiple floating IP's to the correct interfaces.

-Steve


From gordan at bobich.net  Wed Nov  7 14:39:50 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 7 Nov 2007 14:39:50 +0000 (GMT)
Subject: [Linux-cluster] cman startup issue
In-Reply-To: <4731C8F6.8020401@redhat.com>
References: <Pine.LNX.4.64.0711071009150.23274@skynet.shatteredsilicon.net>
	<473195CE.9030901@redhat.com>
	<Pine.LNX.4.64.0711071130040.23274@skynet.shatteredsilicon.net>
	<4731A4B0.1080001@redhat.com>
	<Pine.LNX.4.64.0711071145350.23274@skynet.shatteredsilicon.net>
	<4731BA71.6010103@redhat.com>
	<Pine.LNX.4.64.0711071315370.24318@skynet.shatteredsilicon.net>
	<4731C43C.7010806@redhat.com> <4731C8F6.8020401@redhat.com>
Message-ID: <Pine.LNX.4.64.0711071438380.24318@skynet.shatteredsilicon.net>

On Wed, 7 Nov 2007, Patrick Caulfield wrote:

>>>>>>>>> I'm having a weird problem. I am using a shared GFS root file
>>>>>>>>> system,
>>>>>>>>> and the same initrd image on all the machines. The cluster has 3
>>>>>>>>> machines on it at the moment, and 1 refuses to join the cluster,
>>>>>>>>> regardless of which order I bring them up in.
>>>>>>>>>
>>>>>>>>> When cman service is being started, it fails when starting cman:
>>>>>>>>>
>>>>>>>>> cman not started: Can't find local node name in cluster.conf
>>>>>>>>> /usr/local/sbin/cman_tool: aisexec daemon didn't start
>>>>>>>>>
>>>>>>>>> If I try to run aisexec, I get:
>>>>>>>>> aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.
>>>>>>>>>
>>>>>>>>> Where should I be looking for causes of this? I double checked my
>>>>>>>>> cluster.conf and the MAC addresses, IP addresses and interface
>>>>>>>>> names are
>>>>>>>>> correct in each node's config.
>>>>>>>> Check that the new node can write into /tmp - where it is trying to
>>>>>>>> store the
>>>>>>>> current ring-id.  It could be SElinux or perhaps the permissions on
>>>>>>>> the file it
>>>>>>>> is trying to create.
>>>>>>> That fixed the aisexec problem, but the "Can't find local node name in
>>>>>>> cluster.conf" problem remains, and cman still won't start. :-(
>>>>>> Well, it won't start if it can' find the local node name in
>>>>>> cluster.conf ...
>>>>>> Have you double-checked that the name(s) in cluster.conf match those
>>>>>> on the
>>>>>> ethernet interfaces ?
>>>>> You mean as in:
>>>>> <eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3"
>>>>> mask="255.255.255.0"/>
>>>>> ?
>>>>>
>>>>> If so, then yes, I checked it about 10 times. That was the first thing I
>>>>> thought was wrong. :-(
>>>> As I don't have your cluster.conf or access to your DNS server it's
>>>> hard to say
>>>> from here, but that message does mean what it says. If you have older
>>>> software
>>>> it might not detect anything other than the node's main hostname, but
>>>> later
>>>> versions will check all the interfaces on the system for something
>>>> that matches
>>>> anything in cluster.conf.
>>> Well, the thing that really puzzles me is that the same cluster used to
>>> work before. All I effectively did was move it to a different IP range
>>> and changed cluster.conf. I can't figure out what could have changed in
>>> the meantime to break it, other than cluster.conf. The only other thing
>>> that's different is that some of the machines have eth1 and eth0
>>> reversed. Before they all used eth1 for cluster configuration, and now
>>> one of them uses eth0 (slightly different model, and the manufacturer
>>> mislaeled the ports on them). But I have two identical machines, and one
>>> connects, the other doesn't. It really has me stumped.
>>>
>>>> I see you're using eth1 so make sure you do have an up-to-date cman.
>>> I'm running the latest that is available for RHEL5.
>>
>> If that's what came with 5.0 then there's a bug in the name matching. I can't
>> figure out from the CVS tags in which package this was fixed unfortunately.
>>
>> "revision 1.26
>>  date: 2007/03/15 11:12:33;  author: pcaulfield;  state: Exp;  lines: +16 -13
>>  If the machine is multi-homed, then using a truncated name in uname but not in
>>  cluster.conf would fail to match them up."
>
> Well, I can tell you that the fix is NOT in cman-2.0.61, and it IS in
> cman-2.0.73. Sorry I can't be more specific!

Assuming that's what's causing my problem, it's not in 2.0.64, as that is 
what I have.

Is there a workaround? What triggers the bug? Can I make it go away by 
using different node names? Is it affected by DNS?

Gordan


From jimmy.nimo at pranical.com  Wed Nov  7 15:36:43 2007
From: jimmy.nimo at pranical.com (jimmy.nimo at pranical.com)
Date: Wed,  7 Nov 2007 11:36:43 -0400
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
In-Reply-To: <1194382625.8967.25.camel@ayanami.boston.devel.redhat.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
	<d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
	<20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
	<1194382625.8967.25.camel@ayanami.boston.devel.redhat.com>
Message-ID: <20071107113643.zebyt6dmstw8o48o@webmail.pranical.com>

Yes it was the script, thanks a lot but i have another question:

how can I put the lvs router to be a real router? thanks for all

Jimmy Nimo

Quoting Lon Hohberger <lhh at redhat.com>:

> On Tue, 2007-11-06 at 10:16 -0400, jimmy.nimo at pranical.com wrote:
>> I take your recommendation and change the gateway in the real server to
>> 10.11.12.9 (I really think that it has to use the 10.11.12.10 because if you
>> have a backup lvs Router, it change is the virtual ip eth1:1 not the 
>> real ip)
>> and don't work either
>
> I am pretty sure that if you don't have a send_program (i.e. external
> script defined), you need send / expect strings.  Nanny uses these to
> decide what real servers are online so it can then add/remove them
> to/from the real server pool.
>
> For web,
>
> virtual http {
>    expect = "HTTP"
>    send = "GET / HTTP/1.0\r\n\r\n"
>    ...
>    (rest of virtual server def.)
>    ...
> }
>
> Also you don't have the backup server enabled, so director failover
> won't happen.
>
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-keys
Size: 1340 bytes
Desc: Clave PGP p?blica
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071107/9422167a/attachment.bin>

From rhurst at bidmc.harvard.edu  Wed Nov  7 16:29:06 2007
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Wed, 7 Nov 2007 11:29:06 -0500
Subject: [Linux-cluster] lvs + nanny + piranha problem, server notworking
In-Reply-To: <1194382625.8967.25.camel@ayanami.boston.devel.redhat.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
	<d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
	<20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
	<1194382625.8967.25.camel@ayanami.boston.devel.redhat.com>
Message-ID: <1194452946.3847.33.camel@xw9300.bidmc.harvard.edu>

Just a side thread on our experiences with nanny, there seems to be an
undocumented feature in that you can signal any nanny process with a
SIGUSR1 as an override to shutoff LVS access to the service it is
monitoring, and a SIGUSR2 to always have it enabled regardless of the
monitoring outcome.

These are VERY useful features, especially if you have multiple IPs with
multiple services each -- you might not have the luxury of just
restarting pulse because you need to make a hot change.


On Tue, 2007-11-06 at 15:57 -0500, Lon Hohberger wrote:

> On Tue, 2007-11-06 at 10:16 -0400, jimmy.nimo at pranical.com wrote:
> > I take your recommendation and change the gateway in the real server to
> > 10.11.12.9 (I really think that it has to use the 10.11.12.10 because if you
> > have a backup lvs Router, it change is the virtual ip eth1:1 not the real ip)
> > and don't work either
> 
> I am pretty sure that if you don't have a send_program (i.e. external
> script defined), you need send / expect strings.  Nanny uses these to
> decide what real servers are online so it can then add/remove them
> to/from the real server pool.
> 
> For web,
> 
> virtual http {
>     expect = "HTTP"
>     send = "GET / HTTP/1.0\r\n\r\n"
>     ... 
>     (rest of virtual server def.)
>     ...
> }
> 
> Also you don't have the backup server enabled, so director failover
> won't happen.
> 
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071107/04710371/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071107/04710371/attachment.p7s>

From rhurst at bidmc.harvard.edu  Wed Nov  7 16:43:33 2007
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Wed, 7 Nov 2007 11:43:33 -0500
Subject: [Linux-cluster] How to take down a CS/GFS setup with
	minimumdowntime
In-Reply-To: <1194434024.6793.25.camel@addi.hugsmidjan.is>
References: <1193412589.5963.2.camel@addi.hugsmidjan.is>
	<1193669810.9223.63.camel@ayanami.boston.devel.redhat.com>
	<1194434024.6793.25.camel@addi.hugsmidjan.is>
Message-ID: <1194453813.3847.46.camel@xw9300.bidmc.harvard.edu>

Lon, "leave remove" works as advertised, but is there a way (i.e.,
parameter) to do the same thing when a downed node re-joins the cluster
automagically?  If I down more than one node using the default "leave
remove", it decrements each instance properly, and maintains quorum.
But if I startup just one of those nodes later, the quorum count jumps
all the way back to its modus operandi value in cluster.conf, and in
some rare cases, I could no longer have quorum!

Example:

11 nodes (votes are 2 nodes @ 5 each, plus 9 nodes @ 1 each, expected =
19, quorum = 10)

"leave remove" 1 node @ 5 votes, quorum re-calculates as 8, total = 14
"leave remove" 1 node @ 1 vote, quorum re-calculates as 7, total = 13
"leave remove" 1 node @ 1 vote, quorum re-calculates as 7, total = 12
"leave remove" 1 node @ 1 vote, quorum re-calculates as 6, total = 11
"leave remove" 1 node @ 1 vote, quorum re-calculates as 6, total = 10
"leave remove" 1 node @ 1 vote, quorum re-calculates as 5, total = 9
"leave remove" 1 node @ 1 vote, quorum re-calculates as 5, total = 8

cman join 1 node @ 1 vote, quorum re-calculates as 10, total = 9,
inquorate!!

Please advise, thanks.


On Wed, 2007-11-07 at 11:13 +0000, S?valdur Arnar Gunnarsson wrote:

> Thanks for this Lon, I'm down to the last two node members and according
> to cman_tool status I have two nodes, two votes and a quorum of two.
> --
> Nodes: 2
> Expected_votes: 5
> Total_votes: 2
> Quorum: 2   
> --
> 
> One of those nodes has the GFS filesystems mounted.
> If I issue cman_tool leave remove on the other node will I run into any
> problems on the GFS mounted node ? (for example, due to quorum)
> 
> 
> 
> On Mon, 2007-10-29 at 10:56 -0400, Lon Hohberger wrote:
> 
> > That should do it, yes.  Leave remove is supposed to decrement the
> > quorum count, meaning you can go from 5..1 nodes if done correctly.  You
> > can verify that the expected votes count decreases with each removal
> > using 'cman_tool status'.
> > 
> > 
> > If for some reason the above doesn't work, the alternative looks
> > something like this:
> >   * unmount the GFS volume + stop cluster on all nodes
> >   * use gfs_tool to alter the lock proto to nolock
> >   * mount on node 1.  copy out data.  unmount!
> >   * mount on node 2.  copy out data.  unmount!
> >   * ...
> >   * mount on node 5.  copy out data.  unmount!
> > 
> > -- Lon
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071107/7bbd827b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071107/7bbd827b/attachment.p7s>

From jgray at nicusa.com  Wed Nov  7 16:53:36 2007
From: jgray at nicusa.com (Josh Gray)
Date: Wed, 07 Nov 2007 09:53:36 -0700
Subject: [Linux-cluster] Failover with multiple interfaces
In-Reply-To: <4731C96A.9050308@cloud9.co.uk>
Message-ID: <C3573BA0.1144BC%jgray@nicusa.com>

In my experience you would keep the public addresses at your perimeter like
your firewall and "NAT" (network address translation) it to the local
address for the virtual ip.   That's the safest in my humble...

Josh


On 11/7/07 7:19 AM, "James Fidell" <james at cloud9.co.uk> wrote:

> So, I have my NFS cluster all set up, with NFS services and shared
> storage all managed over a "private" network.
> 
> My intention is to join several more nodes to the cluster using the
> same storage, but providing services other than NFS.  Is it possible
> at the same time to provide services on a "public" network and have
> floating public IP addresses which are also migrated across the
> failover domain when the cluster fences a node?
> 
> Is it just a case of creating a second IP address service in the
> failover domain?  Will the resource manager just "do the right thing"
> and bind the floating address to the correct network interface?
> 
> James
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Josh Gray
Systems Administrator
NIC Inc

Email: jgray at nicusa.com
Desk/Mobile: 913-221-1520

"It is not the mountain we conquer, but ourselves."
- Sir Edmund Hillary


From jgray at nicusa.com  Wed Nov  7 16:56:30 2007
From: jgray at nicusa.com (Josh Gray)
Date: Wed, 07 Nov 2007 09:56:30 -0700
Subject: [Linux-cluster] Failover with multiple interfaces
In-Reply-To: <1194445845.4099.8.camel@houuc8>
Message-ID: <C3573C4E.1144BF%jgray@nicusa.com>

Oh I must have misunderstood the question..   Yes, the cluster matches up
the virtual ip's to where they should be.  I have three different subnets on
three different interfaces and it works great.

Examples:
Vips:
10.0.13.5
10.0.14.5
10.0.15.5

On the servers are:
Node 1 10.0.13.10, 10.0.14.10. 10.0.15.10
Node 2 10.0.13.11, 10.0.14.11. 10.0.15.11
Node 3 10.0.13.12, 10.0.14.12. 10.0.15.12


On 11/7/07 7:30 AM, "Steve Rigler" <srigler at marathonoil.com> wrote:

> On Wed, 2007-11-07 at 14:19 +0000, James Fidell wrote:
>> So, I have my NFS cluster all set up, with NFS services and shared
>> storage all managed over a "private" network.
>> 
>> My intention is to join several more nodes to the cluster using the
>> same storage, but providing services other than NFS.  Is it possible
>> at the same time to provide services on a "public" network and have
>> floating public IP addresses which are also migrated across the
>> failover domain when the cluster fences a node?
>> 
>> Is it just a case of creating a second IP address service in the
>> failover domain?  Will the resource manager just "do the right thing"
>> and bind the floating address to the correct network interface?
>> 
>> James
> 
> James,
> 
> In the past we had multiple public interfaces and cluster service was
> smart enough to bind multiple floating IP's to the correct interfaces.
> 
> -Steve
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Josh Gray
Systems Administrator
NIC Inc

Email: jgray at nicusa.com
Desk/Mobile: 913-221-1520

"It is not the mountain we conquer, but ourselves."
- Sir Edmund Hillary


From isplist at logicore.net  Wed Nov  7 18:32:40 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 7 Nov 2007 12:32:40 -0600
Subject: [Linux-cluster] SSL Server vs SSL/Apache on cluster
Message-ID: <2007117123240.003273@leena>

How are you dealing with SSL on your web server cluster?

I have an apache web cluster using GFS central storage in place and I need to 
add SSL to the services. I also happen to have SSL servers already installed 
(not in use currently) at the top of my network.

I then have LVS load balancing in front of the web cluster.

I cannot centralize the SSL services since there are too many so wonder how I 
might handle SSL on the web cluster? Do I need an SSL cert on each server or 
have others done this in other ways?

Thanks for your input on this.

Mike


From gordan at bobich.net  Thu Nov  8 11:18:13 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Thu, 8 Nov 2007 11:18:13 +0000 (GMT)
Subject: [Linux-cluster] dlm - Extra connection from node attempted
Message-ID: <Pine.LNX.4.64.0711081110040.17132@skynet.shatteredsilicon.net>

Hi,

I'm seeing this pop up on the console just before the cluster locks up. 
I'm using CentOS/RHEL 5. What causes this and how can I fix it / work 
around it? It usually happens within an hour of the cluster coming up. Is 
this related to the lock-ups?

Gordan


From jimmy.nimo at pranical.com  Thu Nov  8 15:58:28 2007
From: jimmy.nimo at pranical.com (jimmy.nimo at pranical.com)
Date: Thu,  8 Nov 2007 11:58:28 -0400
Subject: [Linux-cluster] lvs, lvs router as real server
Message-ID: <20071108115828.5i83xlyvixy8ckk4@webmail.pranical.com>

hello, thanks for all the help but know I have another problem

I want to put the director Lvs Router as one of the Real server, I configure the
Piranha to take the ip of the director as one of the real server and he do that
pretty well, here is a model of my cluster:

To the Intranet
  |
  |
  |
Eth0 172.16.247.130
Eth0:1 172.16.247.150
LVS Router
Eth1: 10.11.12.9
Eth1:1 10.11.12.10
   (To the Real servers)
  |                 |
  |                 |
  |                 |
Node 2              Node 3
Eth0: 10.11.12.11   Eth0: 10.11.12.12


here is my lvs.cf

serial_no = 164
primary = 172.16.247.130
primary_private = 10.11.12.9
service = lvs
backup_active = 0
backup = 172.16.247.131
backup_private = 10.11.12.11
heartbeat = 1
heartbeat_port = 539
keepalive = 3
deadtime = 10
network = nat
nat_router = 10.11.12.10 eth1:1
nat_nmask = 255.255.255.0
debug_level = NONE
monitor_links = 0
virtual http {
     active = 1
     address = 172.16.247.150 eth0:1
     vip_nmask = 255.255.255.0
     port = 80
     send = "GET /file2.html HTTP/1.0\r\n\r\n"
     expect = "1"
     use_regex = 1
     load_monitor = none
     scheduler = rr
     protocol = tcp
     timeout = 5
     reentry = 5
     quiesce_server = 0
     server dos {
         address = 10.11.12.11
         active = 1
         weight = 1
     }
     server tres {
         address = 10.11.12.12
         active = 1
         weight = 1
     }
     server uno {
         address = 10.11.12.9
         active = 1
         weight = 1
     }

}

and here is my ipvsadm, he put the Lvs router as a real server in the category
of  "local".

[root at node1 ~]# ipvsadm -Ln
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.16.247.150:80 rr
  -> 10.11.12.9:80                Local   1      0          1
  -> 10.11.12.12:80               Masq    1      0          1
  -> 10.11.12.11:80               Masq    1      0          1

I have the Lvs working in round robin and he give the pages ok,  but when is the
turn of the 10.11.12.9 (the Lvs router) it crask with "Firefox can't establish a
connection to the server at 172.16.247.150." any hints?.

here is my route of the lvs router just in case is need it

[root at node1 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.11.12.0      0.0.0.0         255.255.255.0   U     0      0        0 eth1
172.16.247.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth1
0.0.0.0         10.11.12.10     0.0.0.0         UG    0      0        0 eth1

Thanks again for the help


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-keys
Size: 1340 bytes
Desc: Clave PGP p?blica
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071108/61701960/attachment.bin>

From andrewxwang at yahoo.com.tw  Thu Nov  8 16:53:32 2007
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Fri, 9 Nov 2007 00:53:32 +0800 (CST)
Subject: [Linux-cluster] Fwd: [GE users] Invitation to talk about grid.org
	and Cluster Express at SC07
Message-ID: <747483.25478.qm@web73512.mail.tp2.yahoo.com>

Globus, Grid Engine, commercial Rocks @ SC07!!

Andrew.


---------- Forwarded message ----------
From: Steve Tuecke <tuecke at univa.com>
Date: Nov 8, 2007 11:37 AM
Subject: [GE users] Invitation to talk about grid.org
and Cluster Express at SC07
To: users at gridengine.sunsource.net
Cc: Steve Tuecke <tuecke at univaud.com>, Carl Kesselman
<carl at univaud.com>

As you may be aware, there are new things brewing at
Univa UD and the
grid.org web site.  Previously, grid.org was the home
of a philanthropic internet scale computing platform
that was used to help advance cancer research, drug
design and public health. With the advances in
integrated open source software for cluster and grid
environments, including the release of our open source
Cluster
Express 3.0 beta next week at SC based on Grid Engine
and Globus,
Univa UD felt that it was time to direct grid.org
towards a new
purpose: augmenting the existing open source
communities with a site
oriented toward the integrated set of components that
are required to
create an complete open source Grid and Cluster
software stack.

We would very much like to have the opportunity next
week at the SC
conference in Reno to tell you more about the plans
for grid.org and
our open source Cluster Express product, discuss how
they related to
the existing Globus, Grid Engine, and other open
source software and
communities, get your input, and perhaps encourage you
to
participate. While we are happy to talk to you at any
time, Carl
Kesselman and I will be at the Univa UD booth (#381)
on Tuesday and
Wednesday at 11AM and 1PM specifically for this
purpose.

We hope to see you there and look forward to the
chance to talk to
you about these exciting new activities.

Regards,
-Steve

--
Steve Tuecke
CTO and Co-founder
Univa UD, Inc.
1001 Warrenville Road, Suite 550
Lisle IL 60532
Direct: +1-630-563-8616
tuecke at univaud.com


      _____________________________________________________________________________________
??????Yahoo!??????2.0? http://tw.mg0.mail.yahoo.com/dc/landing


From Christopher.Barry at qlogic.com  Thu Nov  8 20:31:15 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Thu, 08 Nov 2007 15:31:15 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <472B5F15.6090606@redhat.com>
References: <1193844478.5162.53.camel@localhost>
	<4728A2C9.3000501@redhat.com> <1194023503.9381.1.camel@localhost>
	<472B5F15.6090606@redhat.com>
Message-ID: <1194553876.20447.0.camel@localhost>

On Fri, 2007-11-02 at 12:32 -0500, Ryan O'Hara wrote:
> Christopher Barry wrote:
> > On Wed, 2007-10-31 at 10:44 -0500, Ryan O'Hara wrote:
> >> Christopher Barry wrote:
> >>> Greetings all,
> >>>
> >>> I have 2 vmware esx servers, each hitting a NetApp over FS, and each
> >>> with 3 RHCS cluster nodes trying to mount a gfs volume.
> >>>
> >>> All of the nodes (1,2,& 3) on esx-01 can mount the volume fine, but none
> >>> of the nodes in the second esx box can mount the gfs volume at all, and
> >>> I get the following error in dmesg:
> >> Are you intentionally trying to use scsi reservations as a fence method? 
> > 
> > No. In fact I thought the scsi_reservation service may be *causing* the
> > issue, and disabled the service from starting on all nodes. Does this
> > have to be on?
> 
> No. You only need to run this service if you plan on using scsi 
> reservations as a fence method. A scsi reservation will restrict access 
> to a device such that only registered nodes can access it. If a 
> reservation exist and a unregistered node tries to access the device, 
> you'll see what you are seeing.
> 
> It may be that some reservations were created and never got cleaned-up, 
> which might cause the problem to continue even after the scsi_reserve 
> script was disabled. You can manually run '/etc/init.d/scsi_reserve 
> stop' to attempt to clean up any reservations. Note that I am assuming 
> that any reservations that might still exist on a device were created by 
> the scsi_reserve script. If that is the case, you can see what devices a 
> node is registered for by doing a '/etc/init.d/scsi_reserve status'. 
> Also not that the scsi_reserve script does *not* have to but started or 
> enabled to do these things (ie. you can safely run 'status' or 'stop' 
> without first running 'start').
> 
> On caveat... 'scsi_reserve stop' will not unregister a node if it is the 
> reservation holder and other nodes are still registered with a device. 
> You can also use sg_persist command directly to clean all registrations 
> and reservations. Use the -C option. See the sg_persist man page for a 
> better description.
> 

Okay. I had some other issues to deal with, but now I'm back to this,
and let me get you all up to speed on what I have done, and what I do
not understand about all of this.

status:
esx-01: contains nodes 1 thru 3
esx-02: contains nodes 4 thru 6

esx-01: all 3 cluster nodes can mount gfs.

esx-02: none can mount gfs.
esx-02: scsi reservation errors in dmesg
esx-02: mount fails w/ "can't read superblock" 

Oddly, with the gfs filesystem unmounted on all nodes, I can format the
gfs filesystem from the esx-02 box (from node4), and then mount it from
a node on esx-01, but cannot mount it on the node I just formatted it
from!

fdisk -l shows /dev/sdc1 on nodes 4 thru 6 just fine.

# sg_persist -C --out /dev/sdc1
fails to clear out the reservations

I do not understand these reservations, maybe someone can summarize?

I'm not at the box this sec (vpn-ing in will hork my evolution), but I
will provide any amount of data if either you Ryan, or anyone else has
stuff for me to try.

Thanks all,
-C


> >> It sounds like the nodes on esx-01 are creating reservations, but the 
> >> nodes on the second esx box are not registering with the device and 
> >> therefore are unable to mount the filesystem. Creation of reservations 
> >> and registrations is handled by the scsi_reserve init script, which 
> >> should be run at startup on all nodes in the cluster. You can check to 
> >> see what devices a node is registered for before you mount the 
> >> filesystem by doing /etc/init.d/scsi_reservce status. If your nodes are 
> >> not registered with the device and a reservation exists then you won't 
> >> be able to mount.
> >>
> >>> Lock_Harness 2.6.9-72.2 (built Apr 24 2007 12:45:38) installed
> >>> GFS 2.6.9-72.2 (built Apr 24 2007 12:45:54) installed
> >>> GFS: Trying to join cluster "lock_dlm", "kop-sds:gfs_home"
> >>> Lock_DLM (built Apr 24 2007 12:45:40) installed
> >>> GFS: fsid=kop-sds:gfs_home.2: Joined cluster. Now mounting FS...
> >>> GFS: fsid=kop-sds:gfs_home.2: jid=2: Trying to acquire journal lock...
> >>> GFS: fsid=kop-sds:gfs_home.2: jid=2: Looking at journal...
> >>> GFS: fsid=kop-sds:gfs_home.2: jid=2: Done
> >>> scsi2 (0,0,0) : reservation conflict
> >>> SCSI error : <2 0 0 0> return code = 0x18
> >>> end_request: I/O error, dev sdc, sector 523720263
> >>> scsi2 (0,0,0) : reservation conflict
> >>> SCSI error : <2 0 0 0> return code = 0x18
> >>> end_request: I/O error, dev sdc, sector 523720271
> >>> scsi2 (0,0,0) : reservation conflict
> >>> SCSI error : <2 0 0 0> return code = 0x18
> >>> end_request: I/O error, dev sdc, sector 523720279
> >>> GFS: fsid=kop-sds:gfs_home.2: fatal: I/O error
> >>> GFS: fsid=kop-sds:gfs_home.2:   block = 65464979
> >>> GFS: fsid=kop-sds:gfs_home.2:   function = gfs_logbh_wait
> >>> GFS: fsid=kop-sds:gfs_home.2:   file
> >>> = /builddir/build/BUILD/gfs-kernel-2.6.9-72/smp/src/gfs/dio.c, line =
> >>> 923
> >>> GFS: fsid=kop-sds:gfs_home.2:   time = 1193838678
> >>> GFS: fsid=kop-sds:gfs_home.2: about to withdraw from the cluster
> >>> GFS: fsid=kop-sds:gfs_home.2: waiting for outstanding I/O
> >>> GFS: fsid=kop-sds:gfs_home.2: telling LM to withdraw
> >>> lock_dlm: withdraw abandoned memory
> >>> GFS: fsid=kop-sds:gfs_home.2: withdrawn
> >>> GFS: fsid=kop-sds:gfs_home.2: can't get resource index inode: -5
> >>>
> >>>
> >>> Does anyone have a clue as to where I should start looking?
> >>>
> >>>
> >>> Thanks,
> >>> -C
> >>>
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Regards,
-C

Christopher Barry
Systems Engineer, Principal
QLogic Corporation
780 Fifth Avenue, Suite 140
King of Prussia, PA   19406
o/f: 610-233-4870 / 4777
  m: 267-242-9306


From rohara at redhat.com  Thu Nov  8 21:32:42 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Thu, 08 Nov 2007 15:32:42 -0600
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <1194553876.20447.0.camel@localhost>
References: <1193844478.5162.53.camel@localhost>	<4728A2C9.3000501@redhat.com>
	<1194023503.9381.1.camel@localhost>	<472B5F15.6090606@redhat.com>
	<1194553876.20447.0.camel@localhost>
Message-ID: <4733807A.9000901@redhat.com>

Christopher Barry wrote:
> 
> Okay. I had some other issues to deal with, but now I'm back to this,
> and let me get you all up to speed on what I have done, and what I do
> not understand about all of this.
> 
> status:
> esx-01: contains nodes 1 thru 3
> esx-02: contains nodes 4 thru 6
> 
> esx-01: all 3 cluster nodes can mount gfs.
> 
> esx-02: none can mount gfs.
> esx-02: scsi reservation errors in dmesg
> esx-02: mount fails w/ "can't read superblock" 

OK. So it looks like one of the nodes is still holding a reservation on 
the device. First, we need to determine which node has that reservation. 
  From any node, you should be able to run the following commands:

sg_persist -i -k /dev/sdc
sg_persist -i -r /dev/sdc

The first will list all the keys registered with the device. The second 
will show you which key is holding the reservation. At this point, I 
would expect that you will only see 1 key registered and that key will 
also be the reservation holder, but that it just a guess.

The keys are unique to each node, so we can figure correlate a key to a 
node. The key is just the hex representation of the node's IP address. 
You can get this by running gethostip -x <hostname>. By doing this, you 
should be able to figure out which node is still holding a reservation.
Once you determine this key/node, try running /etc/init.d/scsi_reserve 
stop from that node. Once that runs, use the sg_persist commands listed 
above to see if the reservation is cleared.

> Oddly, with the gfs filesystem unmounted on all nodes, I can format the
> gfs filesystem from the esx-02 box (from node4), and then mount it from
> a node on esx-01, but cannot mount it on the node I just formatted it
> from!
> 
> fdisk -l shows /dev/sdc1 on nodes 4 thru 6 just fine.

Hmm. I wonder if there is something goofy happening because the nodes 
are running within vmware. I have never tried this, so I have no idea. 
Either way, we should be able to clear up the problem.

> # sg_persist -C --out /dev/sdc1
> fails to clear out the reservations

Right. It believe this must be run from the node holding the 
reservation, or at the very least a node that is registered with the 
device. Also node that scsi reservations effect the entire LUN, so you 
can't issue registrations/reservations to a single partition (ie. sdc1).

> I do not understand these reservations, maybe someone can summarize?

I'll try to be brief. Each node in the cluster can register with a 
device, thus a device may have many registrations. Each node registers 
by using a unique key. Once registered, one of the nodes can issue a 
reservation. Only one node may hold the reservation, the reservations is 
created using that node's key. For our purposed, we use a 
write-exclusive, registrants only type of reservation. This means that 
only nodes that are registered with the device may write to it. As long 
as that reservation exists, that rule will be enforced.

When it comes to to remove registrations, there it one caveat: the node 
that hold the reservation cannot unregister unless there are no other 
nodes registered with the device. This is due to the fact that the 
reservations holder must also be registered  *and* if the reservation 
were to go away the write-exclusive, registrants-only policy would not 
longer be in effect. So ... what may have happened is that you tried to 
clear the reservation while other nodes were still registered, which 
will fail since that cannot happen. Once all the other nodes have 
"unregistered", you should be able to go back and clear the reservation.

Yes, this is a limitation in our product. There is a notion of moving a 
reservation (in the case where the reservation holder wants to 
unregister), but that is not yet implemented.

> I'm not at the box this sec (vpn-ing in will hork my evolution), but I
> will provide any amount of data if either you Ryan, or anyone else has
> stuff for me to try.

Please let me know if you have questions or need further assistance 
clearing that pesky reservation for you. :)

> Thanks all,
> -C
>


From Christopher.Barry at qlogic.com  Fri Nov  9 00:03:25 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Thu, 08 Nov 2007 19:03:25 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <4733807A.9000901@redhat.com>
References: <1193844478.5162.53.camel@localhost>
	<4728A2C9.3000501@redhat.com> <1194023503.9381.1.camel@localhost>
	<472B5F15.6090606@redhat.com> <1194553876.20447.0.camel@localhost>
	<4733807A.9000901@redhat.com>
Message-ID: <1194566605.21629.24.camel@localhost>

On Thu, 2007-11-08 at 15:32 -0600, Ryan O'Hara wrote:
> Christopher Barry wrote:
> > 
> > Okay. I had some other issues to deal with, but now I'm back to this,
> > and let me get you all up to speed on what I have done, and what I do
> > not understand about all of this.
> > 
> > status:
> > esx-01: contains nodes 1 thru 3
> > esx-02: contains nodes 4 thru 6
> > 
> > esx-01: all 3 cluster nodes can mount gfs.
> > 
> > esx-02: none can mount gfs.
> > esx-02: scsi reservation errors in dmesg
> > esx-02: mount fails w/ "can't read superblock" 
> 
> OK. So it looks like one of the nodes is still holding a reservation on 
> the device. First, we need to determine which node has that reservation. 
>   From any node, you should be able to run the following commands:
> 
> sg_persist -i -k /dev/sdc
> sg_persist -i -r /dev/sdc
> 
> The first will list all the keys registered with the device. The second 
> will show you which key is holding the reservation. At this point, I 
> would expect that you will only see 1 key registered and that key will 
> also be the reservation holder, but that it just a guess.
> 
> The keys are unique to each node, so we can figure correlate a key to a 
> node. The key is just the hex representation of the node's IP address. 
> You can get this by running gethostip -x <hostname>. By doing this, you 
> should be able to figure out which node is still holding a reservation.
> Once you determine this key/node, try running /etc/init.d/scsi_reserve 
> stop from that node. Once that runs, use the sg_persist commands listed 
> above to see if the reservation is cleared.
> 
> > Oddly, with the gfs filesystem unmounted on all nodes, I can format the
> > gfs filesystem from the esx-02 box (from node4), and then mount it from
> > a node on esx-01, but cannot mount it on the node I just formatted it
> > from!
> > 
> > fdisk -l shows /dev/sdc1 on nodes 4 thru 6 just fine.
> 
> Hmm. I wonder if there is something goofy happening because the nodes 
> are running within vmware. I have never tried this, so I have no idea. 
> Either way, we should be able to clear up the problem.
> 
> > # sg_persist -C --out /dev/sdc1
> > fails to clear out the reservations
> 
> Right. It believe this must be run from the node holding the 
> reservation, or at the very least a node that is registered with the 
> device. Also node that scsi reservations effect the entire LUN, so you 
> can't issue registrations/reservations to a single partition (ie. sdc1).
> 
> > I do not understand these reservations, maybe someone can summarize?
> 
> I'll try to be brief. Each node in the cluster can register with a 
> device, thus a device may have many registrations. Each node registers 
> by using a unique key. Once registered, one of the nodes can issue a 
> reservation. Only one node may hold the reservation, the reservations is 
> created using that node's key. For our purposed, we use a 
> write-exclusive, registrants only type of reservation. This means that 
> only nodes that are registered with the device may write to it. As long 
> as that reservation exists, that rule will be enforced.
> 
> When it comes to to remove registrations, there it one caveat: the node 
> that hold the reservation cannot unregister unless there are no other 
> nodes registered with the device. This is due to the fact that the 
> reservations holder must also be registered  *and* if the reservation 
> were to go away the write-exclusive, registrants-only policy would not 
> longer be in effect. So ... what may have happened is that you tried to 
> clear the reservation while other nodes were still registered, which 
> will fail since that cannot happen. Once all the other nodes have 
> "unregistered", you should be able to go back and clear the reservation.
> 
> Yes, this is a limitation in our product. There is a notion of moving a 
> reservation (in the case where the reservation holder wants to 
> unregister), but that is not yet implemented.
> 
> > I'm not at the box this sec (vpn-ing in will hork my evolution), but I
> > will provide any amount of data if either you Ryan, or anyone else has
> > stuff for me to try.
> 
> Please let me know if you have questions or need further assistance 
> clearing that pesky reservation for you. :)
> 
> > Thanks all,
> > -C
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Ryan,

Thank you so much for your replies.

I tracked down the registration and reserve to the first cluster node,
by converting the hex value to the IP per your instructions. All nodes
reported only this one registration.

On that node, I try:
# sg_persist -C --out /dev/sdc

and it returns a failure, citing a scsi reservation conflict.

I then try on kop-sds-01, the node holding the reservation:

#/etc/init.d/scsi_reserve stop
  connect() failed on local socket: Connection refused
  No volume groups found


Now, I initially had clvmd running, and I had volume groups defined, but
since I'm running on a netapp that does all of that stuff, I decided to
simplify it and remove this stuff. I removed all that in the beginning,
after initially trying to troubleshoot this problem. Are these
reservations somehow stuck looking at an old lvm configuration
somewhere?

Thanks!

-C


From Christopher.Barry at qlogic.com  Fri Nov  9 01:05:06 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Thu, 08 Nov 2007 20:05:06 -0500
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <1194566605.21629.24.camel@localhost>
References: <1193844478.5162.53.camel@localhost>
	<4728A2C9.3000501@redhat.com> <1194023503.9381.1.camel@localhost>
	<472B5F15.6090606@redhat.com> <1194553876.20447.0.camel@localhost>
	<4733807A.9000901@redhat.com>  <1194566605.21629.24.camel@localhost>
Message-ID: <1194570307.21629.37.camel@localhost>

On Thu, 2007-11-08 at 19:03 -0500, Christopher Barry wrote:
> On Thu, 2007-11-08 at 15:32 -0600, Ryan O'Hara wrote:
> > Christopher Barry wrote:
> > > 
> > > Okay. I had some other issues to deal with, but now I'm back to this,
> > > and let me get you all up to speed on what I have done, and what I do
> > > not understand about all of this.
> > > 
> > > status:
> > > esx-01: contains nodes 1 thru 3
> > > esx-02: contains nodes 4 thru 6
> > > 
> > > esx-01: all 3 cluster nodes can mount gfs.
> > > 
> > > esx-02: none can mount gfs.
> > > esx-02: scsi reservation errors in dmesg
> > > esx-02: mount fails w/ "can't read superblock" 
> > 
> > OK. So it looks like one of the nodes is still holding a reservation on 
> > the device. First, we need to determine which node has that reservation. 
> >   From any node, you should be able to run the following commands:
> > 
> > sg_persist -i -k /dev/sdc
> > sg_persist -i -r /dev/sdc
> > 
> > The first will list all the keys registered with the device. The second 
> > will show you which key is holding the reservation. At this point, I 
> > would expect that you will only see 1 key registered and that key will 
> > also be the reservation holder, but that it just a guess.
> > 
> > The keys are unique to each node, so we can figure correlate a key to a 
> > node. The key is just the hex representation of the node's IP address. 
> > You can get this by running gethostip -x <hostname>. By doing this, you 
> > should be able to figure out which node is still holding a reservation.
> > Once you determine this key/node, try running /etc/init.d/scsi_reserve 
> > stop from that node. Once that runs, use the sg_persist commands listed 
> > above to see if the reservation is cleared.
> > 
> > > Oddly, with the gfs filesystem unmounted on all nodes, I can format the
> > > gfs filesystem from the esx-02 box (from node4), and then mount it from
> > > a node on esx-01, but cannot mount it on the node I just formatted it
> > > from!
> > > 
> > > fdisk -l shows /dev/sdc1 on nodes 4 thru 6 just fine.
> > 
> > Hmm. I wonder if there is something goofy happening because the nodes 
> > are running within vmware. I have never tried this, so I have no idea. 
> > Either way, we should be able to clear up the problem.
> > 
> > > # sg_persist -C --out /dev/sdc1
> > > fails to clear out the reservations
> > 
> > Right. It believe this must be run from the node holding the 
> > reservation, or at the very least a node that is registered with the 
> > device. Also node that scsi reservations effect the entire LUN, so you 
> > can't issue registrations/reservations to a single partition (ie. sdc1).
> > 
> > > I do not understand these reservations, maybe someone can summarize?
> > 
> > I'll try to be brief. Each node in the cluster can register with a 
> > device, thus a device may have many registrations. Each node registers 
> > by using a unique key. Once registered, one of the nodes can issue a 
> > reservation. Only one node may hold the reservation, the reservations is 
> > created using that node's key. For our purposed, we use a 
> > write-exclusive, registrants only type of reservation. This means that 
> > only nodes that are registered with the device may write to it. As long 
> > as that reservation exists, that rule will be enforced.
> > 
> > When it comes to to remove registrations, there it one caveat: the node 
> > that hold the reservation cannot unregister unless there are no other 
> > nodes registered with the device. This is due to the fact that the 
> > reservations holder must also be registered  *and* if the reservation 
> > were to go away the write-exclusive, registrants-only policy would not 
> > longer be in effect. So ... what may have happened is that you tried to 
> > clear the reservation while other nodes were still registered, which 
> > will fail since that cannot happen. Once all the other nodes have 
> > "unregistered", you should be able to go back and clear the reservation.
> > 
> > Yes, this is a limitation in our product. There is a notion of moving a 
> > reservation (in the case where the reservation holder wants to 
> > unregister), but that is not yet implemented.
> > 
> > > I'm not at the box this sec (vpn-ing in will hork my evolution), but I
> > > will provide any amount of data if either you Ryan, or anyone else has
> > > stuff for me to try.
> > 
> > Please let me know if you have questions or need further assistance 
> > clearing that pesky reservation for you. :)
> > 
> > > Thanks all,
> > > -C
> > >
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> Ryan,
> 
> Thank you so much for your replies.
> 
> I tracked down the registration and reserve to the first cluster node,
> by converting the hex value to the IP per your instructions. All nodes
> reported only this one registration.
> 
> On that node, I try:
> # sg_persist -C --out /dev/sdc
> 
> and it returns a failure, citing a scsi reservation conflict.
> 
> I then try on kop-sds-01, the node holding the reservation:
> 
> #/etc/init.d/scsi_reserve stop
>   connect() failed on local socket: Connection refused
>   No volume groups found
> 
> 
> Now, I initially had clvmd running, and I had volume groups defined, but
> since I'm running on a netapp that does all of that stuff, I decided to
> simplify it and remove this stuff. I removed all that in the beginning,
> after initially trying to troubleshoot this problem. Are these
> reservations somehow stuck looking at an old lvm configuration
> somewhere?
> 
> Thanks!
> 
> -C


YAAAY! Looks like I might have it. I pulled out the command
from /etc/init.d/scsi_reserve and used that on the node in question, and
I am able to mount on node4. I'm rebooting the whole kaboodle now. I
expect all nodes to mount when it comes up.

Thanks Ryan!

Now that I can mount, how bad will the performance be... ;)


-C


From josh at jtri.com  Fri Nov  9 01:05:04 2007
From: josh at jtri.com (Josh Gray)
Date: Thu, 8 Nov 2007 18:05:04 -0700
Subject: [Linux-cluster] RHEL Cluster update?
In-Reply-To: <1194566605.21629.24.camel@localhost>
References: <1193844478.5162.53.camel@localhost> <4728A2C9.3000501@redhat.com>
	<1194023503.9381.1.camel@localhost> <472B5F15.6090606@redhat.com>
	<1194553876.20447.0.camel@localhost> <4733807A.9000901@redhat.com>
	<1194566605.21629.24.camel@localhost>
Message-ID: <7C103552-6495-4884-910D-0FB15A5E6752@jtri.com>

Hmmm  looks like Redhat.com has been down for a while.   Does anyone  
know if the newest RHEL 5.1 includes most recent updates to the  
cluster suite?

Josh Gray


From Christopher.Barry at qlogic.com  Fri Nov  9 06:45:27 2007
From: Christopher.Barry at qlogic.com (Christopher.Barry at qlogic.com)
Date: Fri, 09 Nov 2007 08:45:27 +0200
Subject: [Linux-cluster] scsi reservation issue
Message-ID: <CHILKAT-MID-5c32c734-ff9c-4aaa-babf-3ab590c62b27@DR2.sv.dynamics.co.za>

On Thu, 2007-11-08 at 15:32 -0600, Ryan O'Hara wrote:
> Christopher Barry wrote:
> > 
> > Okay. I had some other issues to deal with, but now I'm back to this,
> > and let me get you all up to speed on what I have done, and what I do
> > not understand about all of this.
> > 
> > status:
> > esx-01: contains nodes 1 thru 3
> > esx-02: contains nodes 4 thru 6
> > 
> > esx-01: all 3 cluster nodes can mount gfs.
> > 
> > esx-02: none can mount gfs.
> > esx-02: scsi reservation errors in dmesg
> > esx-02: mount fails w/ "can't read superblock" 
> 
> OK. So it looks like one of the nodes is still holding a reservation on 
> the device. First, we need to determine which node has that reservation. 
>   From any node, you should be able to run the following commands:
> 
> sg_persist -i -k /dev/sdc
> sg_persist -i -r /dev/sdc
> 
> The first will list all the keys registered with the device. The second 
> will show you which key is holding the reservation. At this point, I 
> would expect that you will only see 1 key registered and that key will 
> also be the reservation holder, but that it just a guess.
> 
> The keys are unique to each node, so we can figure correlate a key to a 
> node. The key is just the hex representation of the node's IP address. 
> You can get this by running gethostip -x <hostname>. By doing this, you 
> should be able to figure out which node is still holding a reservation.
> Once you determine this key/node, try running /etc/init.d/scsi_reserve 
> stop from that node. Once that runs, use the sg_persist commands listed 
> above to see if the reservation is cleared.
> 
> > Oddly, with the gfs filesystem unmounted on all nodes, I can format the
> > gfs filesystem from the esx-02 box (from node4), and then mount it from
> > a node on esx-01, but cannot mount it on the node I just formatted it
> > from!
> > 
> > fdisk -l shows /dev/sdc1 on nodes 4 thru 6 just fine.
> 
> Hmm. I wonder if there is something goofy happening because the nodes 
> are running within vmware. I have never tried this, so I have no idea. 
> Either way, we should be able to clear up the problem.
> 
> > # sg_persist -C --out /dev/sdc1
> > fails to clear out the reservations
> 
> Right. It believe this must be run from the node holding the 
> reservation, or at the very least a node that is registered with the 
> device. Also node that scsi reservations effect the entire LUN, so you 
> can't issue registrations/reservations to a single partition (ie. sdc1).
> 
> > I do not understand these reservations, maybe someone can summarize?
> 
> I'll try to be brief. Each node in the cluster can register with a 
> device, thus a device may have many registrations. Each node registers 
> by using a unique key. Once registered, one of the nodes can issue a 
> reservation. Only one node may hold the reservation, the reservations is 
> created using that node's key. For our purposed, we use a 
> write-exclusive, registrants only type of reservation. This means that 
> only nodes that are registered with the device may write to it. As long 
> as that reservation exists, that rule will be enforced.
> 
> When it comes to to remove registrations, there it one caveat: the node 
> that hold the reservation cannot unregister unless there are no other 
> nodes registered with the device. This is due to the fact that the 
> reservations holder must also be registered  *and* if the reservation 
> were to go away the write-exclusive, registrants-only policy would not 
> longer be in effect. So ... what may have happened is that you tried to 
> clear the reservation while other nodes were still registered, which 
> will fail since that cannot happen. Once all the other nodes have 
> "unregistered", you should be able to go back and clear the reservation.
> 
> Yes, this is a limitation in our product. There is a notion of moving a 
> reservation (in the case where the reservation holder wants to 
> unregister), but that is not yet implemented.
> 
> > I'm not at the box this sec (vpn-ing in will hork my evolution), but I
> > will provide any amount of data if either you Ryan, or anyone else has
> > stuff for me to try.
> 
> Please let me know if you have questions or need further assistance 
> clearing that pesky reservation for you. :)
> 
> > Thanks all,
> > -C
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Ryan,

Thank you so much for your replies.

I tracked down the registration and reserve to the first cluster node,
by converting the hex value to the IP per your instructions. All nodes
reported only this one registration.

On that node, I try:
# sg_persist -C --out /dev/sdc

and it returns a failure, citing a scsi reservation conflict.

I then try on kop-sds-01, the node holding the reservation:

#/etc/init.d/scsi_reserve stop
  connect() failed on local socket: Connection refused
  No volume groups found


Now, I initially had clvmd running, and I had volume groups defined, but
since I'm running on a netapp that does all of that stuff, I decided to
simplify it and remove this stuff. I removed all that in the beginning,
after initially trying to troubleshoot this problem. Are these
reservations somehow stuck looking at an old lvm configuration
somewhere?

Thanks!

-C

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: GWAVADAT.TXT
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071109/7e0e270c/attachment.ksh>

From Christopher.Barry at qlogic.com  Fri Nov  9 10:55:02 2007
From: Christopher.Barry at qlogic.com (Christopher.Barry at qlogic.com)
Date: Fri, 09 Nov 2007 12:55:02 +0200
Subject: [Linux-cluster] scsi reservation issue
Message-ID: <CHILKAT-MID-5eb3df1d-1f19-45d1-9ab9-ab39909cf1f2@DR2.sv.dynamics.co.za>

On Thu, 2007-11-08 at 15:32 -0600, Ryan O'Hara wrote:
> Christopher Barry wrote:
> > 
> > Okay. I had some other issues to deal with, but now I'm back to this,
> > and let me get you all up to speed on what I have done, and what I do
> > not understand about all of this.
> > 
> > status:
> > esx-01: contains nodes 1 thru 3
> > esx-02: contains nodes 4 thru 6
> > 
> > esx-01: all 3 cluster nodes can mount gfs.
> > 
> > esx-02: none can mount gfs.
> > esx-02: scsi reservation errors in dmesg
> > esx-02: mount fails w/ "can't read superblock" 
> 
> OK. So it looks like one of the nodes is still holding a reservation on 
> the device. First, we need to determine which node has that reservation. 
>   From any node, you should be able to run the following commands:
> 
> sg_persist -i -k /dev/sdc
> sg_persist -i -r /dev/sdc
> 
> The first will list all the keys registered with the device. The second 
> will show you which key is holding the reservation. At this point, I 
> would expect that you will only see 1 key registered and that key will 
> also be the reservation holder, but that it just a guess.
> 
> The keys are unique to each node, so we can figure correlate a key to a 
> node. The key is just the hex representation of the node's IP address. 
> You can get this by running gethostip -x <hostname>. By doing this, you 
> should be able to figure out which node is still holding a reservation.
> Once you determine this key/node, try running /etc/init.d/scsi_reserve 
> stop from that node. Once that runs, use the sg_persist commands listed 
> above to see if the reservation is cleared.
> 
> > Oddly, with the gfs filesystem unmounted on all nodes, I can format the
> > gfs filesystem from the esx-02 box (from node4), and then mount it from
> > a node on esx-01, but cannot mount it on the node I just formatted it
> > from!
> > 
> > fdisk -l shows /dev/sdc1 on nodes 4 thru 6 just fine.
> 
> Hmm. I wonder if there is something goofy happening because the nodes 
> are running within vmware. I have never tried this, so I have no idea. 
> Either way, we should be able to clear up the problem.
> 
> > # sg_persist -C --out /dev/sdc1
> > fails to clear out the reservations
> 
> Right. It believe this must be run from the node holding the 
> reservation, or at the very least a node that is registered with the 
> device. Also node that scsi reservations effect the entire LUN, so you 
> can't issue registrations/reservations to a single partition (ie. sdc1).
> 
> > I do not understand these reservations, maybe someone can summarize?
> 
> I'll try to be brief. Each node in the cluster can register with a 
> device, thus a device may have many registrations. Each node registers 
> by using a unique key. Once registered, one of the nodes can issue a 
> reservation. Only one node may hold the reservation, the reservations is 
> created using that node's key. For our purposed, we use a 
> write-exclusive, registrants only type of reservation. This means that 
> only nodes that are registered with the device may write to it. As long 
> as that reservation exists, that rule will be enforced.
> 
> When it comes to to remove registrations, there it one caveat: the node 
> that hold the reservation cannot unregister unless there are no other 
> nodes registered with the device. This is due to the fact that the 
> reservations holder must also be registered  *and* if the reservation 
> were to go away the write-exclusive, registrants-only policy would not 
> longer be in effect. So ... what may have happened is that you tried to 
> clear the reservation while other nodes were still registered, which 
> will fail since that cannot happen. Once all the other nodes have 
> "unregistered", you should be able to go back and clear the reservation.
> 
> Yes, this is a limitation in our product. There is a notion of moving a 
> reservation (in the case where the reservation holder wants to 
> unregister), but that is not yet implemented.
> 
> > I'm not at the box this sec (vpn-ing in will hork my evolution), but I
> > will provide any amount of data if either you Ryan, or anyone else has
> > stuff for me to try.
> 
> Please let me know if you have questions or need further assistance 
> clearing that pesky reservation for you. :)
> 
> > Thanks all,
> > -C
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Ryan,

Thank you so much for your replies.

I tracked down the registration and reserve to the first cluster node,
by converting the hex value to the IP per your instructions. All nodes
reported only this one registration.

On that node, I try:
# sg_persist -C --out /dev/sdc

and it returns a failure, citing a scsi reservation conflict.

I then try on kop-sds-01, the node holding the reservation:

#/etc/init.d/scsi_reserve stop
  connect() failed on local socket: Connection refused
  No volume groups found


Now, I initially had clvmd running, and I had volume groups defined, but
since I'm running on a netapp that does all of that stuff, I decided to
simplify it and remove this stuff. I removed all that in the beginning,
after initially trying to troubleshoot this problem. Are these
reservations somehow stuck looking at an old lvm configuration
somewhere?

Thanks!

-C

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 2#GWAVADAT.TXT
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071109/645d9a96/attachment.ksh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 3#Part.002
Type: application/octet-stream
Size: 111 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071109/645d9a96/attachment.obj>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: GWAVADAT.TXT
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071109/645d9a96/attachment-0001.ksh>

From jos at xos.nl  Fri Nov  9 11:04:44 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 09 Nov 2007 12:04:44 +0100
Subject: [Linux-cluster] GFS consistency error
Message-ID: <200711091104.lA9B4iRp030792@jasmine.xos.nl>

Hi,

One node in our two-node cluster goes down twice a day now (after this
message it's fenced by the other node) with the following messages:

Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1: fatal: filesystem consistency error
Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   inode = 609689846/609689846
Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   function = gfs_change_nlink
Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   file = /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/inode.c, line = 844
Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   time = 1194603192
Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1: about to withdraw from the cluster
Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1: telling LM to withdraw

This is on a RHEL5-clone.

Is this really a GFS fs error and should we do a fsck?  If yes, question is
of course how this inconsistency could happen.  If not, what can we do?

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From jamesm at xandros.com  Fri Nov  9 16:00:53 2007
From: jamesm at xandros.com (James McOrmond)
Date: Fri, 9 Nov 2007 11:00:53 -0500
Subject: [Linux-cluster] RHEL Cluster update?
In-Reply-To: <7C103552-6495-4884-910D-0FB15A5E6752@jtri.com>
References: <1193844478.5162.53.camel@localhost>
Message-ID: <47348435.3050002@xandros.com>


Josh Gray wrote:

> Does anyone  know if the newest RHEL 5.1 includes most recent updates 
> to the  cluster suite?

-rw-r--r--   94 root root   172221 2007-09-25 09:39 
Cluster/cluster-cim-0.10.0-5.el5.i386.rpm
-rw-r--r--   94 root root   166726 2007-09-25 09:39 
Cluster/cluster-snmp-0.10.0-5.el5.i386.rpm
-rw-r--r--  100 root root    31359 2007-01-18 12:37 
Cluster/ipvsadm-1.24-8.1.i386.rpm
-rw-r--r--   93 root root 26657969 2007-09-25 09:38 
Cluster/luci-0.10.0-6.el5.i386.rpm
-rw-r--r--   93 root root   340996 2007-09-25 09:39 
Cluster/modcluster-0.10.0-5.el5.i386.rpm
-rw-r--r--  100 root root   726092 2007-01-17 11:14 
Cluster/piranha-0.8.4-7.el5.i386.rpm
-rw-r--r--   94 root root   234893 2007-09-25 19:10 
Cluster/rgmanager-2.0.31-1.el5.i386.rpm
-rw-r--r--   94 root root  1125325 2007-09-25 09:38 
Cluster/ricci-0.10.0-6.el5.i386.rpm
-rw-r--r--  115 root root   292264 2007-09-25 09:39 
Cluster/system-config-cluster-1.0.50-1.3.noarch.rpm

-rw-r--r--   99 root root  232403 2007-06-27 02:56 
ClusterStorage/gfs-utils-0.1.12-1.el5.i386.rpm
-rw-r--r--  100 root root   85831 2007-01-17 14:07 
ClusterStorage/gnbd-1.1.5-1.el5.i386.rpm
-rw-r--r--   72 root root  140530 2007-10-10 21:00 
ClusterStorage/kmod-gfs-0.1.19-7.el5.i686.rpm
-rw-r--r--   73 root root  442216 2007-10-10 20:59 
ClusterStorage/kmod-gfs2-1.52-1.16.el5.i686.rpm
-rw-r--r--   73 root root  443523 2007-10-10 20:59 
ClusterStorage/kmod-gfs2-PAE-1.52-1.16.el5.i686.rpm
-rw-r--r--   73 root root  434203 2007-10-10 20:59 
ClusterStorage/kmod-gfs2-xen-1.52-1.16.el5.i686.rpm
-rw-r--r--   72 root root  140563 2007-10-10 21:00 
ClusterStorage/kmod-gfs-PAE-0.1.19-7.el5.i686.rpm
-rw-r--r--   73 root root  140390 2007-10-10 21:00 
ClusterStorage/kmod-gfs-xen-0.1.19-7.el5.i686.rpm
-rw-r--r--   73 root root   14943 2007-10-10 21:00 
ClusterStorage/kmod-gnbd-0.1.4-12.el5.i686.rpm
-rw-r--r--   73 root root   15023 2007-10-10 21:00 
ClusterStorage/kmod-gnbd-PAE-0.1.4-12.el5.i686.rpm
-rw-r--r--   72 root root   15194 2007-10-10 21:00 
ClusterStorage/kmod-gnbd-xen-0.1.4-12.el5.i686.rpm
-rw-r--r--   93 root root  198906 2007-09-25 09:40 
ClusterStorage/lvm2-cluster-2.02.26-1.el5.i386.rpm
-rw-r--r--   94 root root   66129 2007-09-25 09:39 
ClusterStorage/scsi-target-utils-0.0-0.20070620snap.el5.i386.rpm

-- 
James A. McOrmond (jamesm at xandros.com)
Network Administrator
Xandros Corporation, Ottawa, Canada.
Morpheus: ...after a century of war I remember that which matters most:
 *We are still HERE!*


From rohara at redhat.com  Fri Nov  9 17:06:47 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Fri, 09 Nov 2007 11:06:47 -0600
Subject: [Linux-cluster] scsi reservation issue
In-Reply-To: <CHILKAT-MID-5c32c734-ff9c-4aaa-babf-3ab590c62b27@DR2.sv.dynamics.co.za>
References: <CHILKAT-MID-5c32c734-ff9c-4aaa-babf-3ab590c62b27@DR2.sv.dynamics.co.za>
Message-ID: <473493A7.8090100@redhat.com>

Christopher.Barry at qlogic.com wrote:

> Ryan,
> 
> Thank you so much for your replies.
> 
> I tracked down the registration and reserve to the first cluster node,
> by converting the hex value to the IP per your instructions. All nodes
> reported only this one registration.
> 
> On that node, I try:
> # sg_persist -C --out /dev/sdc
> 
> and it returns a failure, citing a scsi reservation conflict.
> 
> I then try on kop-sds-01, the node holding the reservation:
> 
> #/etc/init.d/scsi_reserve stop
>   connect() failed on local socket: Connection refused
>   No volume groups found
> 
> 
> Now, I initially had clvmd running, and I had volume groups defined, but
> since I'm running on a netapp that does all of that stuff, I decided to
> simplify it and remove this stuff. I removed all that in the beginning,
> after initially trying to troubleshoot this problem. Are these
> reservations somehow stuck looking at an old lvm configuration
> somewhere?

Yes. The scsi_reserve script depends on clvmd running, since it looks 
for devices that comprise any cluster lvm volumes.

Glad we cleared up that issue.

> Thanks!

No problem at all.


From jeff.wasilko at tufts.edu  Fri Nov  9 18:00:20 2007
From: jeff.wasilko at tufts.edu (Jeff Wasilko)
Date: Fri, 09 Nov 2007 13:00:20 -0500
Subject: [Linux-cluster] software disk mirroring (raid-1) and cluster?
Message-ID: <4734A034.2070101@tufts.edu>

Hi:

We're considering moving an application from Solaris (and Sun's cluster 
software) to Linux (and RHCS). The application (campus-wide mail) has 
stringent data availability requirements, so currently we are doing 
host-based mirroring between 2 totally separate hardware arrays.

Can I use the MD (multi-device) software to mirror under RHCS?

Thanks!

-j


From lhh at redhat.com  Fri Nov  9 18:12:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 09 Nov 2007 13:12:44 -0500
Subject: [Linux-cluster] Failover with multiple interfaces
In-Reply-To: <4731C96A.9050308@cloud9.co.uk>
References: <4731C96A.9050308@cloud9.co.uk>
Message-ID: <1194631964.8967.41.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-07 at 14:19 +0000, James Fidell wrote:
> So, I have my NFS cluster all set up, with NFS services and shared
> storage all managed over a "private" network.
> 
> My intention is to join several more nodes to the cluster using the
> same storage, but providing services other than NFS.  Is it possible
> at the same time to provide services on a "public" network and have
> floating public IP addresses which are also migrated across the
> failover domain when the cluster fences a node?
> 
> Is it just a case of creating a second IP address service in the
> failover domain?  Will the resource manager just "do the right thing"
> and bind the floating address to the correct network interface?

Yes, it binds IPs to NICs based on netmask + network.

So...
  eth0 address/mask = 10.1.1.1/24
  eth1 address/mask = 192.168.1.1/16
  
  * All VIPs in the network 10.1.1.0/24 would get bound to eth0
  * All VIPs in the network 192.168.0.0/16 would get bound to eth1

It doesn't matter what the cluster's using for its internal
communications.

-- Lon


From lhh at redhat.com  Fri Nov  9 18:14:47 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 09 Nov 2007 13:14:47 -0500
Subject: [Linux-cluster] lvs + nanny + piranha problem, server not working
In-Reply-To: <20071107113643.zebyt6dmstw8o48o@webmail.pranical.com>
References: <20071105154202.ga5czcx8xgg08o48@webmail.pranical.com>
	<d765e01f0711052008j4380f865h521d4cf17c925d3e@mail.gmail.com>
	<20071106101621.deoutzcx3o8cwgwc@webmail.pranical.com>
	<1194382625.8967.25.camel@ayanami.boston.devel.redhat.com>
	<20071107113643.zebyt6dmstw8o48o@webmail.pranical.com>
Message-ID: <1194632087.8967.44.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-07 at 11:36 -0400, jimmy.nimo at pranical.com wrote:
> Yes it was the script, thanks a lot but i have another question:
> 
> how can I put the lvs router to be a real router? thanks for all

I'm pretty sure you can't, but others might know something that I don't.

-- Lon


From lhh at redhat.com  Fri Nov  9 18:15:39 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 09 Nov 2007 13:15:39 -0500
Subject: [Linux-cluster] RHEL Cluster update?
In-Reply-To: <7C103552-6495-4884-910D-0FB15A5E6752@jtri.com>
References: <1193844478.5162.53.camel@localhost>
	<4728A2C9.3000501@redhat.com> <1194023503.9381.1.camel@localhost>
	<472B5F15.6090606@redhat.com> <1194553876.20447.0.camel@localhost>
	<4733807A.9000901@redhat.com> <1194566605.21629.24.camel@localhost>
	<7C103552-6495-4884-910D-0FB15A5E6752@jtri.com>
Message-ID: <1194632139.8967.46.camel@ayanami.boston.devel.redhat.com>

On Thu, 2007-11-08 at 18:05 -0700, Josh Gray wrote:
> Hmmm  looks like Redhat.com has been down for a while.   Does anyone  
> know if the newest RHEL 5.1 includes most recent updates to the  
> cluster suite?

It had better... ;)

-- Lon


From rpeterso at redhat.com  Fri Nov  9 18:39:12 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 09 Nov 2007 12:39:12 -0600
Subject: [Linux-cluster] GFS consistency error
In-Reply-To: <200711091104.lA9B4iRp030792@jasmine.xos.nl>
References: <200711091104.lA9B4iRp030792@jasmine.xos.nl>
Message-ID: <1194633552.20910.3.camel@technetium.msp.redhat.com>

On Fri, 2007-11-09 at 12:04 +0100, Jos Vos wrote:
> Hi,
> 
> One node in our two-node cluster goes down twice a day now (after this
> message it's fenced by the other node) with the following messages:
> 
> Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1: fatal: filesystem consistency error
> Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   inode = 609689846/609689846
> Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   function = gfs_change_nlink
> Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   file = /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/inode.c, line = 844
> Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1:   time = 1194603192
> Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1: about to withdraw from the cluster
> Nov  9 11:13:12 node1 kernel: GFS: fsid=my_cluster:vol3.1: telling LM to withdraw
> 
> This is on a RHEL5-clone.
> 
> Is this really a GFS fs error and should we do a fsck?  If yes, question is
> of course how this inconsistency could happen.  If not, what can we do?
> 
> Thanks,
Hi Jos,

These problems are usually caused by file system corruption.
I recommend unmounting from all nodes, then running gfs_fsck to
clean up the corruption.  There can be many reasons for the
corruption; I tried to give the most common reasons in the faq.

Regards,

Bob Peterson


From jos at xos.nl  Sat Nov 10 22:05:40 2007
From: jos at xos.nl (Jos Vos)
Date: Sat, 10 Nov 2007 23:05:40 +0100
Subject: [Linux-cluster] [TOTEM] The token was lost in the OPERATIONAL
	state: explanation?
Message-ID: <200711102205.lAAM5eS5013833@jasmine.xos.nl>

Hi,

In a two-node cluster, a few times per day one of the nodes (not always
the same) reboots because it is fenced by the other node.  The logging
on the fencing node starts with:

Nov 10 22:30:14 node2 openais[3275]: [TOTEM] The token was lost in the OPERATIONAL state.
Nov 10 22:30:14 node2 openais[3275]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).
Nov 10 22:30:14 node2 openais[3275]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Nov 10 22:30:14 node2 openais[3275]: [TOTEM] entering GATHER state from 2.
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] entering GATHER state from 0.
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] Creating commit token because I am the rep.
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] Saving state aru 32fc3 high seq received 32fc3
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] entering COMMIT state.
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] entering RECOVERY state.
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] position [0] member <ip-addr-of-node-2>:
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] previous ring seq 56 rep <ip-addr-of-node-1>
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] aru 32fc3 high delivered 32fc3 received flag 0
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] Did not need to originate any messages in recovery.
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] Storing new sequence id for ring 3c
Nov 10 22:30:19 node2 openais[3275]: [TOTEM] Sending initial ORF token

On the fenced node, in most cases nothing is logged before the reboot.
A few times, a "fatal: filesystem consistency error" was reported on
the fenced node just before the reboot.

Should I assume that in case nothing is logged this is also caused by a
fs error, although the log was not wriiten to disk in time before being
fenced?

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From jos at xos.nl  Sun Nov 11 22:57:50 2007
From: jos at xos.nl (Jos Vos)
Date: Sun, 11 Nov 2007 23:57:50 +0100
Subject: [Linux-cluster] Unkillable clurgmgrd
Message-ID: <200711112257.lABMvoW7025748@jasmine.xos.nl>

Hi,

I have a node that has an unkillable (kill -9 doesn't work) clurgmgrd
running.  I have fenced it now for the third time, with the same
result after startup...

Stracing clutstat gives:

[...]
socket(PF_FILE, SOCK_STREAM, 0)         = 5
connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"}, 110) = -1 ENOENT (No such file or directory)
close(5)                                = 0
dup(2)                                  = 5
fcntl(5, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(5, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
write(5, "msg_open: No such file or direct"..., 36msg_open: No such file or directory
) = 36
close(5)                                = 0
munmap(0x2aaaaaaac000, 4096)            = 0
[...]

How to get this node back up again???

This is on a RHEL 5.0 clone.

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From carlopmart at gmail.com  Sun Nov 11 16:05:59 2007
From: carlopmart at gmail.com (carlopmart)
Date: Sun, 11 Nov 2007 17:05:59 +0100
Subject: [Linux-cluster] fence_xvmd doesn't starts under rhel5.1
Message-ID: <47372867.1010306@gmail.com>

Hi all,

  As it is mentioned on kb.redhat.com: 
http://kbase.redhat.com/faq/FAQ_103_11190.shtm, I have setup a single node to 
use fence_xvmd on a domU cluster. Result is the same as in Rhel 5.0: fence_xvmd 
doesn't starts up on dom0.

  My cluster.conf on dom0 is:

  <?xml version="1.0" ?>
<cluster config_version="3" name="XenDom0Cluster">
         <fence_xvmd family="ipv4" key_file="/etc/cluster/fence_xvm.key"/>
         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="xenhost.hpulabs.org" nodeid="1" votes="1">
                         <fence>
                                 <method name="human">
                                         <device name="last_resort" 
ipaddr="xenhost.hpulabs.org"/>
                                 </method>
                         </fence>
                         <multicast addr="239.192.75.35" interface="xenbr0"/>
                 </clusternode>
         </clusternodes>
         <cman>
                 <multicast addr="239.192.75.35"/>
         </cman>
         <fencedevices>
                 <fencedevice name="last_resort" agent="fence_manual"/>
         </fencedevices>
         <rm log_facility="local4" log_level="7">
                 <failoverdomains/>
                 <resources/>
         </rm>
</cluster>

  Is it possible to fix this issue to automatically start up fence_xvmd on dom0???
-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From Alain.Moulle at bull.net  Mon Nov 12 07:53:55 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 12 Nov 2007 08:53:55 +0100
Subject: [Linux-cluster] Question about CS4 log information
Message-ID: <47380693.3000904@bull.net>

Hi

In a HA pair node1 & node2, after node1 crashes, I have this
information in node2 syslog :

Oct 30 21:32:54 s_sys at node2 clurgmgrd[14522]: <err> #48: Unable to
obtain cluster lock: Connection timed out

I wonder what it means exactly ?

And what are the consequences ?

Thanks
Alain Moull?


From isplist at logicore.net  Mon Nov 12 14:52:59 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 12 Nov 2007 08:52:59 -0600
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <47380693.3000904@bull.net>
Message-ID: <2007111285259.360295@leena>

When this new node, identical to all of the others, tries to come into the 
cluster, the others all die with a kernel panic.

I'm not seeing anything that gives me any ideas in the logs. I've made sure 
that the new node has a copy of the cluster.conf, which is updated to reflect 
the new node of course. 

Where else can I start to look at why trying to add a new node is killing the 
others? 

Mike


From pcaulfie at redhat.com  Mon Nov 12 14:59:29 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 12 Nov 2007 14:59:29 +0000
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <2007111285259.360295@leena>
References: <2007111285259.360295@leena>
Message-ID: <47386A51.20103@redhat.com>

isplist at logicore.net wrote:
> When this new node, identical to all of the others, tries to come into the 
> cluster, the others all die with a kernel panic.
> 
> I'm not seeing anything that gives me any ideas in the logs. I've made sure 
> that the new node has a copy of the cluster.conf, which is updated to reflect 
> the new node of course. 
> 
> Where else can I start to look at why trying to add a new node is killing the 
> others? 
> 

Send a copy of the messages you are seeing would be helpful. If you're getting a
kernel panic then that's a good thing to start with.

-- 
Patrick


From isplist at logicore.net  Mon Nov 12 15:13:30 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 12 Nov 2007 09:13:30 -0600
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <47386A51.20103@redhat.com>
Message-ID: <2007111291330.818011@leena>

> Send a copy of the messages you are seeing would be helpful. If you're
> getting a kernel panic then that's a good thing to start with.

I don't seem to see anything in the logs. There are some errors about fencing 
but that's because the nodes start dying and are fenced. That's about it.

The only other info is on the console as it dies. I could re-create the 
problem and jot down what I see, that might help. I was thinking this the 
information displayed on the console might be telling the obvious, that the 
kernel panicked and died.

What should I be looking for to post here?

Mike


From pcaulfie at redhat.com  Mon Nov 12 15:21:43 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 12 Nov 2007 15:21:43 +0000
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <2007111291330.818011@leena>
References: <2007111291330.818011@leena>
Message-ID: <47386F87.7070206@redhat.com>

isplist at logicore.net wrote:
>> Send a copy of the messages you are seeing would be helpful. If you're
>> getting a kernel panic then that's a good thing to start with.
> 
> I don't seem to see anything in the logs. There are some errors about fencing 
> but that's because the nodes start dying and are fenced. That's about it.
> 
> The only other info is on the console as it dies. I could re-create the 
> problem and jot down what I see, that might help. I was thinking this the 
> information displayed on the console might be telling the obvious, that the 
> kernel panicked and died.
> 
> What should I be looking for to post here?

The exact detail of any kernel panic you are seeing .. ALL the text.

and then the obvious stuff: cluster.conf file, version numbers of all cluster
software, distribution and where you got them from.

Copies of things in /proc/cluster are always helpful too, if you can get them
from any running node (please say which node).

Patrick


From isplist at logicore.net  Mon Nov 12 15:28:33 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 12 Nov 2007 09:28:33 -0600
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <47386F87.7070206@redhat.com>
Message-ID: <2007111292833.221155@leena>

Ok, I'll re-create the crash as soon as I can take the cluster down and take 
more notes. I'll post them once I have them. Thanks.

Mike

>> What should I be looking for to post here?
>> 
> The exact detail of any kernel panic you are seeing .. ALL the text.
> and then the obvious stuff: cluster.conf file, version numbers of all
> cluster software, distribution and where you got them from.
> 
> Copies of things in /proc/cluster are always helpful too, if you can get
> them from any running node (please say which node).
> 
> Patrick


From carlopmart at gmail.com  Mon Nov 12 15:39:22 2007
From: carlopmart at gmail.com (carlopmart)
Date: Mon, 12 Nov 2007 16:39:22 +0100
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <47372867.1010306@gmail.com>
References: <47372867.1010306@gmail.com>
Message-ID: <473873AA.6050100@gmail.com>

carlopmart wrote:
> Hi all,
> 
>  As it is mentioned on kb.redhat.com: 
> http://kbase.redhat.com/faq/FAQ_103_11190.shtm, I have setup a single 
> node to use fence_xvmd on a domU cluster. Result is the same as in Rhel 
> 5.0: fence_xvmd doesn't starts up on dom0.
> 
>  My cluster.conf on dom0 is:
> 
>  <?xml version="1.0" ?>
> <cluster config_version="3" name="XenDom0Cluster">
>         <fence_xvmd family="ipv4" key_file="/etc/cluster/fence_xvm.key"/>
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="xenhost.hpulabs.org" nodeid="1" 
> votes="1">
>                         <fence>
>                                 <method name="human">
>                                         <device name="last_resort" 
> ipaddr="xenhost.hpulabs.org"/>
>                                 </method>
>                         </fence>
>                         <multicast addr="239.192.75.35" 
> interface="xenbr0"/>
>                 </clusternode>
>         </clusternodes>
>         <cman>
>                 <multicast addr="239.192.75.35"/>
>         </cman>
>         <fencedevices>
>                 <fencedevice name="last_resort" agent="fence_manual"/>
>         </fencedevices>
>         <rm log_facility="local4" log_level="7">
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
> </cluster>
> 
>  Is it possible to fix this issue to automatically start up fence_xvmd 
> on dom0???

I see this open bug: https://bugzilla.redhat.com/show_bug.cgi?id=362351. May I 
have to wait until rhel 5.2 will be released to solve this problem?? Do -L flag 
works using fence_xvmd released under rhel 5.1??

Any comments?

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From fsalaman at gmail.com  Mon Nov 12 16:48:42 2007
From: fsalaman at gmail.com (Fabian Salamanca Dominguez)
Date: Mon, 12 Nov 2007 10:48:42 -0600
Subject: [Linux-cluster] Sun ILOM
Message-ID: <afe307c20711120848pdb49a51ib4a293166b9f5ade@mail.gmail.com>

Hi!

Are Sun's LOM and ILOM cards supported as fencing devices for Red Hat
Cluster Suite?

Thanks and BR


-- 
Fabian


From kanderso at redhat.com  Mon Nov 12 17:02:27 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Mon, 12 Nov 2007 11:02:27 -0600
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <afe307c20711120848pdb49a51ib4a293166b9f5ade@mail.gmail.com>
References: <afe307c20711120848pdb49a51ib4a293166b9f5ade@mail.gmail.com>
Message-ID: <1194886947.2715.24.camel@localhost.localdomain>

On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca Dominguez wrote:
> Hi!
> 
> Are Sun's LOM and ILOM cards supported as fencing devices for Red Hat
> Cluster Suite?
> 
No one has implemented a fencing script for those devices as of yet.  I
believe people have had success using fence_ipmi with the Sun hardware,
so you might want to try that.  Alternatively, if you have access to the
hardware, take a look at implementing the fence agent.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071112/402c7d50/attachment.htm>

From scottb at bxwa.com  Mon Nov 12 17:47:16 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 12 Nov 2007 09:47:16 -0800
Subject: [Linux-cluster] Conga issue
Message-ID: <473891A4.1030202@bxwa.com>

I use luci to configure the main fencing method and afterwards, luci 
show's me extra entries. I have two power switches (apc 7920s) one for 
each power feed and to each power supply in each node. Currently my 
cluster.conf contains this:

for node 2:
<fence>
 <method name="1">
  <device name="RackPDU1" option="off" port="2"/>
  <device name="RackPDU2" option="off" port="2"/>
 </method>
</fence>

for node 3:
<fence>
 <method name="1">
  <device name="RackPDU1" option="off" port="3"/>
  <device name="RackPDU2" option="off" port="3"/>
  <device name="RackPDU1" option="on" port="3"/>
  <device name="RackPDU2" option="on" port="3"/>
 </method>
</fence>


When I view the config in luci, for node 2 it show exactly what I 
entered but for node 3 it shows 4 entries instead of the two I entered, 
without any indication of option="off" vs option="on".

Luci at one point even show 4 entries under main fence method and the 
same 4 entries under backup fence method even though I never set any there.

Luci (or ricci) has a problem which needs fixed but my first question is 
what should be in cluster.conf? Is the option="on" needed?


    thanks
    scottb


From scottb at bxwa.com  Mon Nov 12 18:03:04 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 12 Nov 2007 10:03:04 -0800
Subject: [Linux-cluster] fence_apc
Message-ID: <47389558.6010801@bxwa.com>

I have a redundant network feed, two ethernet switches, two power 
switches. Three nodes.

So I configured the apc 7920s for ssh1 and ssh2, no telnet. That setting 
allowed me to use the public network (the only one I have and necessary 
for reliable operation down to one node).

fence_apc only allows telnet control?!

    scottb


From david.costakos at gmail.com  Mon Nov 12 19:21:16 2007
From: david.costakos at gmail.com (Dave Costakos)
Date: Mon, 12 Nov 2007 11:21:16 -0800
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <1194886947.2715.24.camel@localhost.localdomain>
References: <afe307c20711120848pdb49a51ib4a293166b9f5ade@mail.gmail.com>
	<1194886947.2715.24.camel@localhost.localdomain>
Message-ID: <6b6836c60711121121k488eb4e8he63c0ce42641248f@mail.gmail.com>

I should say that I am successfully using fence_ipmi in a 3-node Sun 4600
cluster -- though I obviously had to connect to the iLOM via the network
interface rather than the serial interface (which is our normal procedure).

-Dave.

On Nov 12, 2007 9:02 AM, Kevin Anderson <kanderso at redhat.com> wrote:

>  On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca Dominguez wrote:
>
> Hi!
> Are Sun's LOM and ILOM cards supported as fencing devices for Red HatCluster Suite?
>
>  No one has implemented a fencing script for those devices as of yet.  I
> believe people have had success using fence_ipmi with the Sun hardware, so
> you might want to try that.  Alternatively, if you have access to the
> hardware, take a look at implementing the fence agent.
>
> Kevin
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Dave Costakos
mailto:david.costakos at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071112/2b9804ad/attachment.htm>

From lhh at redhat.com  Mon Nov 12 20:32:36 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 12 Nov 2007 15:32:36 -0500
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <473873AA.6050100@gmail.com>
References: <47372867.1010306@gmail.com>  <473873AA.6050100@gmail.com>
Message-ID: <1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-11-12 at 16:39 +0100, carlopmart wrote:
> carlopmart wrote:

> I see this open bug: https://bugzilla.redhat.com/show_bug.cgi?id=362351. May I 
> have to wait until rhel 5.2 will be released to solve this problem?? Do -L flag 
> works using fence_xvmd released under rhel 5.1??

No, it didn't make 5.1.

-- Lon


From lhh at redhat.com  Mon Nov 12 20:34:45 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 12 Nov 2007 15:34:45 -0500
Subject: [Linux-cluster] software disk mirroring (raid-1) and cluster?
In-Reply-To: <4734A034.2070101@tufts.edu>
References: <4734A034.2070101@tufts.edu>
Message-ID: <1194899685.8951.36.camel@ayanami.boston.devel.redhat.com>

On Fri, 2007-11-09 at 13:00 -0500, Jeff Wasilko wrote:
> Hi:
> 
> We're considering moving an application from Solaris (and Sun's cluster 
> software) to Linux (and RHCS). The application (campus-wide mail) has 
> stringent data availability requirements, so currently we are doing 
> host-based mirroring between 2 totally separate hardware arrays.
> 
> Can I use the MD (multi-device) software to mirror under RHCS?

You can't use md from multiple nodes in a cluster in a sane way.  It
might work, but it'd be purely accidental.

I believe the technology you need to look in to is cluster mirroring.

-- Lon


From lhh at redhat.com  Mon Nov 12 20:36:07 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 12 Nov 2007 15:36:07 -0500
Subject: [Linux-cluster] Unkillable clurgmgrd
In-Reply-To: <200711112257.lABMvoW7025748@jasmine.xos.nl>
References: <200711112257.lABMvoW7025748@jasmine.xos.nl>
Message-ID: <1194899767.8951.39.camel@ayanami.boston.devel.redhat.com>

On Sun, 2007-11-11 at 23:57 +0100, Jos Vos wrote:
> Hi,
> 
> I have a node that has an unkillable (kill -9 doesn't work) clurgmgrd
> running.  I have fenced it now for the third time, with the same
> result after startup...
> 
> Stracing clutstat gives:
> 
> [...]
> socket(PF_FILE, SOCK_STREAM, 0)         = 5
> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"}, 110) = -1 ENOENT (No such file or directory)
> close(5)                                = 0
> dup(2)                                  = 5
> fcntl(5, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
> fstat(5, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
> lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
> write(5, "msg_open: No such file or direct"..., 36msg_open: No such file or directory
> ) = 36
> close(5)                                = 0
> munmap(0x2aaaaaaac000, 4096)            = 0
> [...]
> 
> How to get this node back up again???
> 
> This is on a RHEL 5.0 clone.

If it's unkillable, it's stuck waiting on the kernel for something.  

echo 1 > /proc/sys/kernel/sysrq
echo t > /proc/sysrq-trigger

dmesg > foo.out

reply + attach foo.out ;)

-- Lon


From lhh at redhat.com  Mon Nov 12 20:37:23 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 12 Nov 2007 15:37:23 -0500
Subject: [Linux-cluster] Question about CS4 log information
In-Reply-To: <47380693.3000904@bull.net>
References: <47380693.3000904@bull.net>
Message-ID: <1194899843.8951.42.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-11-12 at 08:53 +0100, Alain Moulle wrote:
> Hi
> 
> In a HA pair node1 & node2, after node1 crashes, I have this
> information in node2 syslog :
> 
> Oct 30 21:32:54 s_sys at node2 clurgmgrd[14522]: <err> #48: Unable to
> obtain cluster lock: Connection timed out

> I wonder what it means exactly ?

The DLM couldn't connect to the known lock master, probably, and
returned the error to userland.  (I think)

-- Lon


From lhh at redhat.com  Mon Nov 12 20:52:48 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 12 Nov 2007 15:52:48 -0500
Subject: [Linux-cluster] Clustered NFS services starting on wrong
In-Reply-To: <009C1EA6-F5CF-4AFB-A678-46ADF2E023F7@jtri.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FC9D@ACDFWMAIL1.acd.de.ittind.com>
	<4730370C.4040703@cloud9.co.uk>
	<1194382006.8967.17.camel@ayanami.boston.devel.redhat.com>
	<009C1EA6-F5CF-4AFB-A678-46ADF2E023F7@jtri.com>
Message-ID: <1194900768.8951.44.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-06 at 14:21 -0700, Josh Gray wrote:
> If i hear you guys say "fixed in beta..." one more time......... ;)     
> Any idea when that will be out in stable?

I think it's out now ;)

-- Lon


From jos at xos.nl  Mon Nov 12 21:24:25 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 12 Nov 2007 22:24:25 +0100
Subject: [Linux-cluster] Unkillable clurgmgrd
In-Reply-To: <1194899767.8951.39.camel@ayanami.boston.devel.redhat.com>
References: <200711112257.lABMvoW7025748@jasmine.xos.nl>
	<1194899767.8951.39.camel@ayanami.boston.devel.redhat.com>
Message-ID: <20071112212425.GC2577@jasmine.xos.nl>

On Mon, Nov 12, 2007 at 03:36:07PM -0500, Lon Hohberger wrote:

> If it's unkillable, it's stuck waiting on the kernel for something.  
> 
> echo 1 > /proc/sys/kernel/sysrq
> echo t > /proc/sysrq-trigger
> 
> dmesg > foo.out
> 
> reply + attach foo.out ;)

In the meantime, I rebooted that node after chkconfig-off the cluster
services.  The second node still works, but something is wrong with
clvmd, as "vgdisplay" is blocked (but killable) and strace'ing it gives:

stat("/dev/ram3", {st_mode=S_IFBLK|0640, st_rdev=makedev(1, 3), ...}) = 0
stat("/dev/ram4", {st_mode=S_IFBLK|0640, st_rdev=makedev(1, 4), ...}) = 0
stat("/dev/ram7", {st_mode=S_IFBLK|0640, st_rdev=makedev(1, 7), ...}) = 0
stat("/dev/disk/by-path/pci-0000:01:03.0-scsi-0:0:1:0", {st_mode=S_IFBLK|0640, st_rdev=makedev(8, 48), ...}) = 0
stat("/dev/disk/by-id/scsi-3600d0230006e3d840e561a0e4d233b00", {st_mode=S_IFBLK|0640, st_rdev=makedev(8, 128), ...}) = 0
close(3)                                = 0
time(NULL)                              = 1194900143
stat("/etc/lvm/lvm.conf", {st_mode=S_IFREG|0644, st_size=14395, ...}) = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_FILE, path=@clvmd}, 110

And then it hangs.  I can't restart clvmd, nor can I start it on the
other node.

I'm intending to reboot the whole cluster tomorrow morning...

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From scottb at bxwa.com  Mon Nov 12 21:48:35 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 12 Nov 2007 13:48:35 -0800
Subject: [Linux-cluster] fence_apc
In-Reply-To: <47389558.6010801@bxwa.com>
References: <47389558.6010801@bxwa.com>
Message-ID: <4738CA33.6040903@bxwa.com>

Peeking at the other fence_scripts, all the perl ones also do not 
support ssh, only telnet.

Is this a huge glaring omission or am I missing something here.

Is there a sane way to get fence_apc to use a tunnel? Or am I out of luck?

    scottb


Scott Becker wrote:
> I have a redundant network feed, two ethernet switches, two power 
> switches. Three nodes.
>
> So I configured the apc 7920s for ssh1 and ssh2, no telnet. That 
> setting allowed me to use the public network (the only one I have and 
> necessary for reliable operation down to one node).
>
> fence_apc only allows telnet control?!
>
>    scottb
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From alex.kompel at 23andme.com  Mon Nov 12 22:47:18 2007
From: alex.kompel at 23andme.com (Alex Kompel)
Date: Mon, 12 Nov 2007 14:47:18 -0800
Subject: [Linux-cluster] Unkillable clurgmgrd
In-Reply-To: <1194899767.8951.39.camel@ayanami.boston.devel.redhat.com>
References: <200711112257.lABMvoW7025748@jasmine.xos.nl>
	<1194899767.8951.39.camel@ayanami.boston.devel.redhat.com>
Message-ID: <68a019b50711121447g6a08f2fcl96877ee610034ca3@mail.gmail.com>

On 11/12/07, Lon Hohberger <lhh at redhat.com> wrote:
>
> On Sun, 2007-11-11 at 23:57 +0100, Jos Vos wrote:
> > Hi,
> >
> > I have a node that has an unkillable (kill -9 doesn't work) clurgmgrd
> > running.  I have fenced it now for the third time, with the same
> > result after startup...
> >
> > Stracing clutstat gives:
> >
> > [...]
> > socket(PF_FILE, SOCK_STREAM, 0)         = 5
> > connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"},
> 110) = -1 ENOENT (No such file or directory)
> > close(5)                                = 0
> > dup(2)                                  = 5
> > fcntl(5, F_GETFL)                       = 0x8002 (flags
> O_RDWR|O_LARGEFILE)
> > fstat(5, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0x2aaaaaaac000
> > lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
> > write(5, "msg_open: No such file or direct"..., 36msg_open: No such file
> or directory
> > ) = 36
> > close(5)                                = 0
> > munmap(0x2aaaaaaac000, 4096)            = 0
> > [...]
> >
> > How to get this node back up again???
> >
> > This is on a RHEL 5.0 clone.
>
> If it's unkillable, it's stuck waiting on the kernel for something.
>
> echo 1 > /proc/sys/kernel/sysrq
> echo t > /proc/sysrq-trigger
>
> dmesg > foo.out
>
> reply + attach foo.out ;)


I observed a similar problem on the test cluster. It appears the clurgmgrd
deadlocks in some cases in groups.c:count_resource_groups(). It does not
happen every time but it is reproducible. Surviving node calls
rg_lock(service:mysql) @ groups.c:101 and gets stuck. The other node
resource manager waits indefinitely for the lock:

{3965} rg_lock(service:mysql) @ groups.c:101
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"
[3460] debug: Sending node states to CTX0xa2a7fd0
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"

To the original poster: the surviving node clurgmgrd is "unkillable" as
well.
You can try to reboot the surviving node - it will release the lock and
resource manager on the fenced node will be unblocked and start just fine.
Unfortunately, once you reboot the node the situation may reverse (resource
manager will hang on the rebooted node).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071112/a2ee6e2f/attachment.htm>

From scottb at bxwa.com  Mon Nov 12 23:47:03 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 12 Nov 2007 15:47:03 -0800
Subject: [Linux-cluster] fence_apc
In-Reply-To: <4738CA33.6040903@bxwa.com>
References: <47389558.6010801@bxwa.com> <4738CA33.6040903@bxwa.com>
Message-ID: <4738E5F7.4030203@bxwa.com>

I suppose that I will have to buy another pair of switches and create a 
create a private subnet to work around this.

    scottb


Scott Becker wrote:
> Peeking at the other fence_scripts, all the perl ones also do not 
> support ssh, only telnet.
>
> Is this a huge glaring omission or am I missing something here.
>
> Is there a sane way to get fence_apc to use a tunnel? Or am I out of 
> luck?
>
>    scottb
>
>
> Scott Becker wrote:
>> I have a redundant network feed, two ethernet switches, two power 
>> switches. Three nodes.
>>
>> So I configured the apc 7920s for ssh1 and ssh2, no telnet. That 
>> setting allowed me to use the public network (the only one I have and 
>> necessary for reliable operation down to one node).
>>
>> fence_apc only allows telnet control?!
>>
>>    scottb
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From scottb at bxwa.com  Tue Nov 13 00:17:53 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 12 Nov 2007 16:17:53 -0800
Subject: [Linux-cluster] Conga issue
In-Reply-To: <473891A4.1030202@bxwa.com>
References: <473891A4.1030202@bxwa.com>
Message-ID: <4738ED31.3000308@bxwa.com>

So, according to the faq, the latter syntax is needed.

So now I need to figure out how to manually make changes to cluster.conf 
(I've been using luci)

and luci appears to need some fixes.

    scottb


Scott Becker wrote:
> I use luci to configure the main fencing method and afterwards, luci 
> show's me extra entries. I have two power switches (apc 7920s) one for 
> each power feed and to each power supply in each node. Currently my 
> cluster.conf contains this:
>
> for node 2:
> <fence>
> <method name="1">
>  <device name="RackPDU1" option="off" port="2"/>
>  <device name="RackPDU2" option="off" port="2"/>
> </method>
> </fence>
>
> for node 3:
> <fence>
> <method name="1">
>  <device name="RackPDU1" option="off" port="3"/>
>  <device name="RackPDU2" option="off" port="3"/>
>  <device name="RackPDU1" option="on" port="3"/>
>  <device name="RackPDU2" option="on" port="3"/>
> </method>
> </fence>
>
>
>
> When I view the config in luci, for node 2 it show exactly what I 
> entered but for node 3 it shows 4 entries instead of the two I 
> entered, without any indication of option="off" vs option="on".
>
> Luci at one point even show 4 entries under main fence method and the 
> same 4 entries under backup fence method even though I never set any 
> there.
>
> Luci (or ricci) has a problem which needs fixed but my first question 
> is what should be in cluster.conf? Is the option="on" needed?
>
>
>    thanks
>    scottb
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From scottb at bxwa.com  Tue Nov 13 00:31:18 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 12 Nov 2007 16:31:18 -0800
Subject: [Linux-cluster] fence_apc
In-Reply-To: <4738CA33.6040903@bxwa.com>
References: <47389558.6010801@bxwa.com> <4738CA33.6040903@bxwa.com>
Message-ID: <4738F056.5080708@bxwa.com>

I'm no perl guy but I found some code on:
http://search.cpan.org/~jrogers/Net-Telnet-3.03/lib/Net/Telnet.pm

Which leads me to believe that it's easier to deal with ssh from perl 
than it is from python.
(Including Twisted.Conch for python looks involved, but I'm no python 
guy either.)

Are there any python to perl converters?
Does anybody have any experience with conch?

    scottb


Scott Becker wrote:
> Peeking at the other fence_scripts, all the perl ones also do not 
> support ssh, only telnet.
>
> Is this a huge glaring omission or am I missing something here.
>
> Is there a sane way to get fence_apc to use a tunnel? Or am I out of 
> luck?
>
>    scottb
>
>
> Scott Becker wrote:
>> I have a redundant network feed, two ethernet switches, two power 
>> switches. Three nodes.
>>
>> So I configured the apc 7920s for ssh1 and ssh2, no telnet. That 
>> setting allowed me to use the public network (the only one I have and 
>> necessary for reliable operation down to one node).
>>
>> fence_apc only allows telnet control?!
>>
>>    scottb
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From gordan at bobich.net  Tue Nov 13 08:44:01 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Tue, 13 Nov 2007 08:44:01 +0000 (GMT)
Subject: [Linux-cluster] software disk mirroring (raid-1) and cluster?
In-Reply-To: <1194899685.8951.36.camel@ayanami.boston.devel.redhat.com>
References: <4734A034.2070101@tufts.edu>
	<1194899685.8951.36.camel@ayanami.boston.devel.redhat.com>
Message-ID: <Pine.LNX.4.64.0711130843170.25536@skynet.shatteredsilicon.net>

On Mon, 12 Nov 2007, Lon Hohberger wrote:

> On Fri, 2007-11-09 at 13:00 -0500, Jeff Wasilko wrote:
>> Hi:
>>
>> We're considering moving an application from Solaris (and Sun's cluster
>> software) to Linux (and RHCS). The application (campus-wide mail) has
>> stringent data availability requirements, so currently we are doing
>> host-based mirroring between 2 totally separate hardware arrays.
>>
>> Can I use the MD (multi-device) software to mirror under RHCS?
>
> You can't use md from multiple nodes in a cluster in a sane way.  It
> might work, but it'd be purely accidental.
>
> I believe the technology you need to look in to is cluster mirroring.

Yup. Look at network block devices (NDB or GNDB).

Gordan


From jos at xos.nl  Tue Nov 13 09:04:10 2007
From: jos at xos.nl (Jos Vos)
Date: Tue, 13 Nov 2007 10:04:10 +0100
Subject: [Linux-cluster] Unkillable clurgmgrd
In-Reply-To: <68a019b50711121447g6a08f2fcl96877ee610034ca3@mail.gmail.com>
References: <200711112257.lABMvoW7025748@jasmine.xos.nl>
	<1194899767.8951.39.camel@ayanami.boston.devel.redhat.com>
	<68a019b50711121447g6a08f2fcl96877ee610034ca3@mail.gmail.com>
Message-ID: <20071113090410.GA9677@jasmine.xos.nl>

On Mon, Nov 12, 2007 at 02:47:18PM -0800, Alex Kompel wrote:

> I observed a similar problem on the test cluster. It appears the clurgmgrd
> deadlocks in some cases in groups.c:count_resource_groups(). It does not
> happen every time but it is reproducible. Surviving node calls
> rg_lock(service:mysql) @ groups.c:101 and gets stuck. The other node
> resource manager waits indefinitely for the lock:

[...]

> To the original poster: the surviving node clurgmgrd is "unkillable" as
> well.
> You can try to reboot the surviving node - it will release the lock and
> resource manager on the fenced node will be unblocked and start just fine.
> Unfortunately, once you reboot the node the situation may reverse (resource
> manager will hang on the rebooted node).

Yes, rebooting ended up in some "locking war" and neither node came up
properly.  I finally (a) chkconfig off all cluster subsystems and (b)
modified the cluster.conf on both nodes to turn off autostart and then
turned off both nodes (shutting down didn't work of course).  Then, I
brought the nodes up in sequence, manually brought up the cluster
subsystems, manually started the cluster services and finally I
reverted (a) and (b).

Is this problem solved in 5.1?

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From gordan at bobich.net  Tue Nov 13 13:22:08 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Tue, 13 Nov 2007 13:22:08 +0000 (GMT)
Subject: [Linux-cluster] GFS mounting fails when rejoining the cluster
Message-ID: <Pine.LNX.4.64.0711131316260.22513@skynet.shatteredsilicon.net>

Hi,

I seem to have a weird problem. I have a 5 node (actually 4 with 
room to put another node in) cluster, if all the nodes 
come up at the same time, everything works fine. If I reboot one of the 
nodes, it doesn't re-connect the GFS share. It says:

Trying to join cluster "lock_dlm" "mycluster:myshare"

dlm: connecting to 1
dlm: connecting to 2
dlm: connection to 3

Joined cluster. Now mounting FS...

dlm: Got connection from 3

and then it just sits there.

The other nodes also lose access to the shared file system. How do I 
troubleshoot this? Everything works OK when the nodes all come up at the 
same time, but the re-joining seems to break the whole cluster.

I'm using CentOS 5 with the latest updates.

Gordan


From rmccabe at redhat.com  Tue Nov 13 15:03:02 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Tue, 13 Nov 2007 10:03:02 -0500
Subject: [Linux-cluster] Conga issue
In-Reply-To: <4738ED31.3000308@bxwa.com>
References: <473891A4.1030202@bxwa.com> <4738ED31.3000308@bxwa.com>
Message-ID: <20071113150302.GA12307@redhat.com>

On Mon, Nov 12, 2007 at 04:17:53PM -0800, Scott Becker wrote:
> So, according to the faq, the latter syntax is needed.

Which version of luci are you using?

> So now I need to figure out how to manually make changes to cluster.conf 
> (I've been using luci)

You can login to one of the nodes, edit /etc/cluster/cluster.conf,
increment the version number, and run
'ccs_tool update /etc/cluster/cluster.conf'.

Ryan


From xbfair at citistreetonline.com  Tue Nov 13 15:33:18 2007
From: xbfair at citistreetonline.com (Fair, Brian)
Date: Tue, 13 Nov 2007 10:33:18 -0500
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <6b6836c60711121121k488eb4e8he63c0ce42641248f@mail.gmail.com>
References: <afe307c20711120848pdb49a51ib4a293166b9f5ade@mail.gmail.com><1194886947.2715.24.camel@localhost.localdomain>
	<6b6836c60711121121k488eb4e8he63c0ce42641248f@mail.gmail.com>
Message-ID: <97F238EA86B5704DBAD740518CF8291003859C06@hwpms600.tbo.citistreet.org>

I'm sure you know this, but Sun has shipped several different out of
band management solutions over the years. They have different software,
commands, and features.
 
Some (all maybe, couldn't say) of the newer stuff is IPMI...
coolthreads, the newer x86 based boxes certainly, etc. There are a few
different types it could be if its older equipment. 
 
Just something to keep in mind if you hear that IPMI is working for a
Sun box.
 
Curious -- what kind of hardware are you building out your cluster on?
 
-Brian

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dave Costakos
Sent: Monday, November 12, 2007 2:21 PM
To: linux clustering
Subject: Re: [Linux-cluster] Sun ILOM


I should say that I am successfully using fence_ipmi in a 3-node Sun
4600 cluster -- though I obviously had to connect to the iLOM via the
network interface rather than the serial interface (which is our normal
procedure). 

-Dave.


On Nov 12, 2007 9:02 AM, Kevin Anderson <kanderso at redhat.com> wrote:


	On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca Dominguez
wrote: 

		Hi!
		
		Are Sun's LOM and ILOM cards supported as fencing
devices for Red Hat
		Cluster Suite?
		

	No one has implemented a fencing script for those devices as of
yet.  I believe people have had success using fence_ipmi with the Sun
hardware, so you might want to try that.  Alternatively, if you have
access to the hardware, take a look at implementing the fence agent. 
	
	Kevin 

	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster 
	

-- 
Dave Costakos
mailto:david.costakos at gmail.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/1bfcf512/attachment.htm>

From Arne.Brieseneck at vodafone.com  Tue Nov 13 15:52:22 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Tue, 13 Nov 2007 16:52:22 +0100
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <97F238EA86B5704DBAD740518CF8291003859C06@hwpms600.tbo.citistreet.org>
Message-ID: <E67F1468BF7A4C418D874810215A377EEDBB47@EITO-MBX01.internal.vodafone.com>

I have a 40 node SUN X4100 cluster with fence_ipmilan running.
If you have any questions let me know...


________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fair, Brian
	Sent: Dienstag, 13. November 2007 16:33
	To: linux clustering
	Subject: RE: [Linux-cluster] Sun ILOM
	
	
	I'm sure you know this, but Sun has shipped several different
out of band management solutions over the years. They have different
software, commands, and features.
	 
	Some (all maybe, couldn't say) of the newer stuff is IPMI...
coolthreads, the newer x86 based boxes certainly, etc. There are a few
different types it could be if its older equipment. 
	 
	Just something to keep in mind if you hear that IPMI is working
for a Sun box.
	 
	Curious -- what kind of hardware are you building out your
cluster on?
	 
	-Brian

________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dave Costakos
	Sent: Monday, November 12, 2007 2:21 PM
	To: linux clustering
	Subject: Re: [Linux-cluster] Sun ILOM
	
	
	I should say that I am successfully using fence_ipmi in a 3-node
Sun 4600 cluster -- though I obviously had to connect to the iLOM via
the network interface rather than the serial interface (which is our
normal procedure). 
	
	-Dave.
	
	
	On Nov 12, 2007 9:02 AM, Kevin Anderson <kanderso at redhat.com>
wrote:
	

		On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca
Dominguez wrote: 

			Hi!
			
			Are Sun's LOM and ILOM cards supported as
fencing devices for Red Hat
			Cluster Suite?
			

		No one has implemented a fencing script for those
devices as of yet.  I believe people have had success using fence_ipmi
with the Sun hardware, so you might want to try that.  Alternatively, if
you have access to the hardware, take a look at implementing the fence
agent. 
		
		Kevin 

		--
		Linux-cluster mailing list
		Linux-cluster at redhat.com
		https://www.redhat.com/mailman/listinfo/linux-cluster 
		

	-- 
	Dave Costakos
	mailto:david.costakos at gmail.com 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/8ea77093/attachment.htm>

From scottb at bxwa.com  Tue Nov 13 15:55:56 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 13 Nov 2007 07:55:56 -0800
Subject: [Linux-cluster] Conga issue
In-Reply-To: <20071113150302.GA12307@redhat.com>
References: <473891A4.1030202@bxwa.com> <4738ED31.3000308@bxwa.com>
	<20071113150302.GA12307@redhat.com>
Message-ID: <4739C90C.7020108@bxwa.com>

I just updated to RHEL 5.1. The release notes stated there was a 
smoother layout of the fences screen.

luci-0.10.0-6.el5

    thanks
    scottb


Ryan McCabe wrote:
> On Mon, Nov 12, 2007 at 04:17:53PM -0800, Scott Becker wrote:
>   
>> So, according to the faq, the latter syntax is needed.
>>     
>
> Which version of luci are you using?
>
>   
>> So now I need to figure out how to manually make changes to cluster.conf 
>> (I've been using luci)
>>     
>
> You can login to one of the nodes, edit /etc/cluster/cluster.conf,
> increment the version number, and run
> 'ccs_tool update /etc/cluster/cluster.conf'.
>
> Ryan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/2e4c8ac2/attachment.htm>

From isplist at logicore.net  Tue Nov 13 16:29:55 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 13 Nov 2007 10:29:55 -0600
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <47386F87.7070206@redhat.com>
Message-ID: <20071113102955.287554@leena>

>> What should I be looking for to post here?
>> 
> The exact detail of any kernel panic you are seeing .. ALL the text.
> and then the obvious stuff: cluster.conf file, version numbers of all
> cluster software, distribution and where you got them from.
> Copies of things in /proc/cluster are always helpful too, if you can get
> them from any running node (please say which node).

That's a lot of info :). Got some of it at least;

ccs 1.0.7-0.XOS.1
cman 1.0.11-0.XOS.1
cman-kernel 2.6.9-36.0.XOS.1
cman-kernel 2.6.9-45.8.XOS.1
cman-kernel 2.6.9-45.15.XOS.1
fence 1.32.25-1.XOS.1
lvm2-cluster 2.02.06-7.0.RHEL4.XOS.1
magma 1.0.6-0.XOS.1
magma-plugins 1.0.9-0.XOS.1
piranha 0.8.2-1.XOS.1
system-config-cluster 1.0.27-1.0.XOS.1

And yes, I know I'm running old versions but all of the nodes are running the 
same things and it works fine for me, cept for this new problem :). Now, as I 
posted this, it does dawn on me that the new node (img62) would have newer 
versions of all of the above installed. Would this be the cause? Should I 
upgrade all nodes to the latest versions?

This is the kernel panic from .58 when .62 (img62) tries to join the cluster. 
The new node does have an updated cluster.conf and so do all of the other 
nodes to reflect the new node joining. All nodes had their hosts file updated 
also so that they know about it's IP.

Nov 13 09:59:32 compdev kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 13 10:03:40 compdev kernel: CMAN: node img62.domain.com rejoining
Nov 13 10:03:42 compdev kernel: Unable to handle kernel paging request at 
virtual address 008c9689
Nov 13 10:03:42 compdev kernel:  printing eip:
Nov 13 10:03:42 compdev kernel: e09e0d19
Nov 13 10:03:42 compdev kernel: *pde = 00000000
Nov 13 10:03:42 compdev kernel: Oops: 0000 [#1]
Nov 13 10:03:42 compdev kernel: Modules linked in: autofs4 dlm(U) cman(U) md5 
ipv6 sunrpc dm_mirror uhci_hcd e100 mii floppy ext3 jbd dm_mod qla2200 qla2xxx 
scsi_transport_fc sd_mod scsi_mod
Nov 13 10:03:42 compdev kernel: CPU:    0
Nov 13 10:03:42 compdev kernel: EIP:    0060:[<e09e0d19>]    Not tainted VLI
Nov 13 10:03:42 compdev kernel: EFLAGS: 00010202   (2.6.9-42.0.10.EL.XOS.1)
Nov 13 10:03:42 compdev kernel: EIP is at process_join_request+0x65/0x1ba 
[cman]
Nov 13 10:03:42 compdev kernel: eax: 00000000   ebx: 008c9689   ecx: e09f20c0 
  edx: dd439000
Nov 13 10:03:42 compdev kernel: esi: 00006564   edi: 0000003a   ebp: dd439f98 
  esp: dd439f58
Nov 13 10:03:42 compdev kernel: ds: 007b   es: 007b   ss: 0068
Nov 13 10:03:42 compdev kernel: Process cman_serviced (pid: 2212, 
threadinfo=dd439000 task=de793340)
Nov 13 10:03:42 compdev kernel: Stack: 00000000 d6f9c014 0000003e 00000000 
00000000 00000000 00000000 00000000
Nov 13 10:03:42 compdev kernel:        95eb1078 0003641b de750ae0 0000003e 
d6f9c000 dd439f98 e09de8a3 e09e1125
Nov 13 10:03:42 compdev kernel:        00000001 00000000 00000000 00070000 
61666564 06e57ac4 000000d9 de793340
Nov 13 10:03:42 compdev kernel: Call Trace:
Nov 13 10:03:42 compdev kernel:  [<e09de8a3>] serviced+0x0/0x140 [cman]
Nov 13 10:03:42 compdev kernel:  [<e09e1125>] process_message+0x32/0x93 [cman]
Nov 13 10:03:42 compdev kernel:  [<e09e12a9>] process_messages+0x123/0x13e 
[cman]
Nov 13 10:03:42 compdev kernel:  [<e09de8ce>] serviced+0x2b/0x140 [cman]
Nov 13 10:03:42 compdev kernel:  [<c013cc2d>] kthread+0x69/0x91
Nov 13 10:03:42 compdev kernel:  [<c013cbc4>] kthread+0x0/0x91
Nov 13 10:03:42 compdev kernel:  [<c01041dd>] kernel_thread_helper+0x5/0xb
Nov 13 10:03:42 compdev kernel: Code: 74 df e8 1e 69 93 df b9 c0 20 9f e0 ff 
0d c0 20 9f e0 0f 88 e9 08 00 00 8b 3d 6c 1f 9f e0 39 7c 24 08 74 3d 8b 1c f5 
a0 20 9f e0 <8b> 03 0f 18 00 90 8d 04 f5 a0 20 9f e0 39 c3 74 25 0f b7 45 12
Nov 13 10:03:42 compdev kernel:  <0>Fatal exception: panic in 5 seconds

cluster.conf;

<cluster config_version="80" name="vgcomp">
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="compdev.domain.com" votes="1" nodeid="58">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="2"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb92.domain.com" votes="1" nodeid="92">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="3"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb93.domain.com" votes="1" nodeid="93">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="4"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb94.domain.com" votes="1" nodeid="94">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="5"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="img62.domain.com" votes="1" nodeid="62">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="7"/>
                    </method>
                </fence>
        </clusternode>
    </clusternodes>
<fencedevices>
    <fencedevice agent="fence_brocade" ipaddr="192.168.1.215" login="xxx" 
name="brocade215" passwd="xxxx
s"/>
</fencedevices>
</cluster>


From Timothy.Ward at itt.com  Tue Nov 13 16:13:00 2007
From: Timothy.Ward at itt.com (Ward, Timothy - SSD)
Date: Tue, 13 Nov 2007 11:13:00 -0500
Subject: [Linux-cluster] Conga issue
Message-ID: <77E700AE7021314DB6CDF6D6E8F661320396FCAA@ACDFWMAIL1.acd.de.ittind.com>

Scott Becker wrote:
>> So, according to the faq, the latter syntax is needed.
>> 
>> So now I need to figure out how to manually make changes to
cluster.conf (I've been
>> using luci) and luci appears to need some fixes.
>> 
>>    scottb

Scott,

I have had limited success with conga.  I am using: luci-0.8-30.el5
These are the steps I've had to take:
1) Build initial small /etc/cluster/cluster.conf by hand
2) Copy this file to all 3 nodes
3) Start the cluster to make sure it comes up
4) Start luci and load this cluster.conf using "Add an Existing Cluster"
5) Make changes in luci

There are some additional things that do not work when you attempt to
add them through the web interface, but I will have to rebuild my
cluster to remember what they are.  (I will post them when I find them).

Also I have found that using luci for managing the service and server
state is useless.  Instead from the command line I do things like:
root at node1::/#  clusvcadm -r apache1 -m node2.cluster.com

Using the above combination of vi/luci/command-line I've gotten a
working cluster on CentOS 5.

Tim
*****************************************************************
This e-mail and any files transmitted with it may be proprietary 
and are intended solely for the use of the individual or entity to 
whom they are addressed. If you have received this e-mail in 
error please notify the sender. Please note that any views or
opinions presented in this e-mail are solely those of the author 
and do not necessarily represent those of ITT Corporation. The 
recipient should check this e-mail and any attachments for the 
presence of viruses. ITT accepts no liability for any damage 
caused by any virus transmitted by this e-mail.
*******************************************************************


From rmccabe at redhat.com  Tue Nov 13 17:06:08 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Tue, 13 Nov 2007 12:06:08 -0500
Subject: [Linux-cluster] Conga issue
In-Reply-To: <77E700AE7021314DB6CDF6D6E8F661320396FCAA@ACDFWMAIL1.acd.de.ittind.com>
References: <77E700AE7021314DB6CDF6D6E8F661320396FCAA@ACDFWMAIL1.acd.de.ittind.com>
Message-ID: <20071113170608.GA12937@redhat.com>

On Tue, Nov 13, 2007 at 11:13:00AM -0500, Ward, Timothy - SSD wrote:
> I have had limited success with conga.  I am using: luci-0.8-30.el5
> 
> There are some additional things that do not work when you attempt to
> add them through the web interface, but I will have to rebuild my
> cluster to remember what they are.  (I will post them when I find them).

Version 0.8.30 shouldn't be used, and is definitely not reflective of
the current state of conga.


Ryan


From carlopmart at gmail.com  Tue Nov 13 17:09:22 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 13 Nov 2007 18:09:22 +0100
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
References: <47372867.1010306@gmail.com> <473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
Message-ID: <4739DA42.8080109@gmail.com>

Lon Hohberger wrote:
> On Mon, 2007-11-12 at 16:39 +0100, carlopmart wrote:
>> carlopmart wrote:
> 
>> I see this open bug: https://bugzilla.redhat.com/show_bug.cgi?id=362351. May I 
>> have to wait until rhel 5.2 will be released to solve this problem?? Do -L flag 
>> works using fence_xvmd released under rhel 5.1??
> 
> No, it didn't make 5.1.
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
Then, is it not possible to use fence_xvmd as a fence daemon on rhel 5.1???

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From rmccabe at redhat.com  Tue Nov 13 17:22:10 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Tue, 13 Nov 2007 12:22:10 -0500
Subject: [Linux-cluster] Conga issue
In-Reply-To: <4739C90C.7020108@bxwa.com>
References: <473891A4.1030202@bxwa.com> <4738ED31.3000308@bxwa.com>
	<20071113150302.GA12307@redhat.com> <4739C90C.7020108@bxwa.com>
Message-ID: <20071113172210.GB12937@redhat.com>

On Tue, Nov 13, 2007 at 07:55:56AM -0800, Scott Becker wrote:
> I just updated to RHEL 5.1. The release notes stated there was a 
> smoother layout of the fences screen.

For node 3, unless I'm missing something, the cluster.conf that conga
produced is correct. Because there are dual power supplies, both need
to be turned off before either is turned back on to ensure the machine
powers off. After both are off, both can be turned back on. The front-end
hides this detail from the user.

Did you enter identical fence configurations (except for the port numbers)
in conga for nodes 2 and 3? If so, then they both ought to look like
node 3's configuration, and you've hit a bug; node 2 will be fenced
correctly, but it won't be brought back up.


Ryan


From scottb at bxwa.com  Tue Nov 13 19:47:17 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 13 Nov 2007 11:47:17 -0800
Subject: [Linux-cluster] Conga issue
In-Reply-To: <20071113172210.GB12937@redhat.com>
References: <473891A4.1030202@bxwa.com>
	<4738ED31.3000308@bxwa.com>	<20071113150302.GA12307@redhat.com>
	<4739C90C.7020108@bxwa.com> <20071113172210.GB12937@redhat.com>
Message-ID: <4739FF45.8040905@bxwa.com>

Click here to see what I entered for both nodes (port node differed):
http://www.fastbid.net/fence.jpg

When I looked back at the configuration in luci I saw this:
http://www.fastbid.net/fence3.jpg

At one point I saw four entries under main and another four entries 
under backup.

My understanding of the desired effect is that luci creates and shows 
one entry onscreen for each on and off pair of entries in cluster.conf.

It appears that it is not consistently creating the pair and it is 
definitely displaying them wrong. Each on and off is shown as a 
redundant entry with no indication of "On" or "Off". Which is why I was 
confused about whether or not the pair was needed.

I'm still not sure 'why' the pair is needed, it's an implementation 
detail of fencing. Do some fencing devices have different commands for 
turning back on? Or does it just allow the ability to configure "auto 
fence off, but only manual fence on" by deleting the on items. If this 
is the case then it seems that there should be a check box in luci to 
that effect.

Perhaps the parsing code in luci (or wherever it's being parsed) is 
confused by the ordering:
 <device name="RackPDU1" option="off" port="3"/>
 <device name="RackPDU2" option="off" port="3"/>
 <device name="RackPDU1" option="on" port="3"/>
 <device name="RackPDU2" option="on" port="3"/>

Since I manually fixed cluster.conf, luci now displays four entires for 
both nodes.

I just reordered the entries in cluster.conf like this:
 <device name="RackPDU1" option="off" port="3"/>
 <device name="RackPDU1" option="on" port="3"/>
 <device name="RackPDU2" option="off" port="3"/>
 <device name="RackPDU2" option="on" port="3"/>

to group them in on off pairs and now luci displays just the two.  :)

    thanks
    scottb


Ryan McCabe wrote:
> On Tue, Nov 13, 2007 at 07:55:56AM -0800, Scott Becker wrote:
>   
>> I just updated to RHEL 5.1. The release notes stated there was a 
>> smoother layout of the fences screen.
>>     
>
> For node 3, unless I'm missing something, the cluster.conf that conga
> produced is correct. Because there are dual power supplies, both need
> to be turned off before either is turned back on to ensure the machine
> powers off. After both are off, both can be turned back on. The front-end
> hides this detail from the user.
>
> Did you enter identical fence configurations (except for the port numbers)
> in conga for nodes 2 and 3? If so, then they both ought to look like
> node 3's configuration, and you've hit a bug; node 2 will be fenced
> correctly, but it won't be brought back up.
>
>
> Ryan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/d422da7e/attachment.htm>

From jparsons at redhat.com  Tue Nov 13 19:59:20 2007
From: jparsons at redhat.com (James Parsons)
Date: Tue, 13 Nov 2007 14:59:20 -0500
Subject: [Linux-cluster] Conga issue
In-Reply-To: <4739FF45.8040905@bxwa.com>
References: <473891A4.1030202@bxwa.com>	<4738ED31.3000308@bxwa.com>	<20071113150302.GA12307@redhat.com>	<4739C90C.7020108@bxwa.com>
	<20071113172210.GB12937@redhat.com> <4739FF45.8040905@bxwa.com>
Message-ID: <473A0218.8010001@redhat.com>


>
> Since I manually fixed cluster.conf, luci now displays four entires 
> for both nodes.
>
> I just reordered the entries in cluster.conf like this:
>  <device name="RackPDU1" option="off" port="3"/>
>  <device name="RackPDU1" option="on" port="3"/>
>  <device name="RackPDU2" option="off" port="3"/>
>  <device name="RackPDU2" option="on" port="3"/>
>
> to group them in on off pairs and now luci displays just the two.  :)
>
>     thanks
>     scottb

The above is an incorrect way to set up fencing on a node with redundant 
power supplies...whatever the UI is showing you.
This configuration does not guarentee that the node is truly rebooted. 
Both ports Off At The Same Time is the only way to be certain the node 
is not staying up due to timing issues. Keeping this in your 
cluster.conf puts your data at risk.

If the UI component is writing something improper to the configuration 
file regarding fencing (or any other cluster parameter), please file a 
bugzilla ticket and include your conga version number, your conf file, 
and a screen shot if you can. If conga is messing up fence 
configuration, it will be addressed with a very high priority.

Regards,

-Jim


From rmccabe at redhat.com  Tue Nov 13 20:01:57 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Tue, 13 Nov 2007 15:01:57 -0500
Subject: [Linux-cluster] Conga issue
In-Reply-To: <4739FF45.8040905@bxwa.com>
References: <473891A4.1030202@bxwa.com> <4738ED31.3000308@bxwa.com>
	<20071113150302.GA12307@redhat.com> <4739C90C.7020108@bxwa.com>
	<20071113172210.GB12937@redhat.com> <4739FF45.8040905@bxwa.com>
Message-ID: <20071113200157.GA13772@redhat.com>

On Tue, Nov 13, 2007 at 11:47:17AM -0800, Scott Becker wrote:
> I'm still not sure 'why' the pair is needed, it's an implementation 
> detail of fencing. Do some fencing devices have different commands for 
> turning back on? Or does it just allow the ability to configure "auto 
> fence off, but only manual fence on" by deleting the on items. If this 
> is the case then it seems that there should be a check box in luci to 
> that effect.

When a node is to be fenced, the <device> blocks within the <method>
tags are processed in order. If no 'option' attribute is specified, the
default action is to issue a power off, then a power on. When a node has
redundant power supplies, it's necessary to power them all off before
powering any back on, or the machine will never lose power, defeating
the point of trying to fencing it. So if you're using power fencing and
you have N power supplies for a node, you'll end up with N*2 <device>
blocks in the fence method block for that node; the first N/2 having
attribute 'option' set to off, and the last N/2 having the attribute
'option' set to on.


Ryan


From christopher.barry at qlogic.com  Tue Nov 13 20:05:39 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Tue, 13 Nov 2007 14:05:39 -0600
Subject: [Linux-cluster] nanny segfault problem
Message-ID: <1194984339.5187.24.camel@localhost>

Greetings All,

running RHEL4U5

I have a bunch of services on my cluster w/ access via redundant
directors.

I've created a generic service checking script, which I'm specifying in
lvs.cf's 'send_program' config parameter.

script is attached to this post. see that for how it works with the
symlinks described below.

I create symlinks to the script for every service I want to check, with
their name containing the port to hit, as in:
/sbin/lvs-<port>.sh

so the symlink name to check ssh availability, for instance, is:
/sbin/lvs-22.sh

The script works fine, and returns the first contiguous block of
[[:alnum:]] text data from the connection attempt for use with the
expect line of lvs.cf.


The problem is, when nanny is spawned by pulse, all of the nanny
processes segfault.

> Nov 13 14:40:44 kop-sds-dir-01 lvs[17740]: create_monitor for ssh_access/kop-sds-01 running as pid 17749
> Nov 13 14:40:44 kop-sds-dir-01 nanny[17749]: making 10.32.12.11:22 available
> Nov 13 14:40:44 kop-sds-dir-01 kernel: nanny[17749]: segfault at 000000000000006c rip 000000335e570810 rsp 0000007fbfffe978 error 4

this occurs almost instantly for every nanny process.

Can anyone venture a guess as to what is happening?

see my lvs.cf here:
http://nanny-error.pastebin.com/m592f7911


-- 
Regards,
-C

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/0bac5274/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: A_POLICY_VIOLATED_FILE_WAS_DETECTED_AND_REMOVED.TXT
Type: application/x-shellscript
Size: 150 bytes
Desc: A_POLICY_VIOLATED_FILE_WAS_DETECTED_AND_REMOVED.TXT
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/0bac5274/attachment.bin>

From Christopher.Barry at qlogic.com  Tue Nov 13 20:14:01 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Tue, 13 Nov 2007 15:14:01 -0500
Subject: [Linux-cluster] Re: nanny segfault problem
In-Reply-To: <1194984339.5187.24.camel@localhost>
References: <1194984339.5187.24.camel@localhost>
Message-ID: <1194984842.5187.28.camel@localhost>

script got scraped by my gateway - attached here as a textfile


On Tue, 2007-11-13 at 15:05 -0500, Christopher Barry wrote:
> Greetings All,
> 
> running RHEL4U5
> 
> I have a bunch of services on my cluster w/ access via redundant
> directors.
> 
> I've created a generic service checking script, which I'm specifying in
> lvs.cf's 'send_program' config parameter.
> 
> script is attached to this post. see that for how it works with the
> symlinks described below.
> 
> I create symlinks to the script for every service I want to check, with
> their name containing the port to hit, as in:
> /sbin/lvs-<port>.sh
> 
> so the symlink name to check ssh availability, for instance, is:
> /sbin/lvs-22.sh
> 
> The script works fine, and returns the first contiguous block of
> [[:alnum:]] text data from the connection attempt for use with the
> expect line of lvs.cf.
> 
> 
> The problem is, when nanny is spawned by pulse, all of the nanny
> processes segfault.
> 
> > Nov 13 14:40:44 kop-sds-dir-01 lvs[17740]: create_monitor for ssh_access/kop-sds-01 running as pid 17749
> > Nov 13 14:40:44 kop-sds-dir-01 nanny[17749]: making 10.32.12.11:22 available
> > Nov 13 14:40:44 kop-sds-dir-01 kernel: nanny[17749]: segfault at 000000000000006c rip 000000335e570810 rsp 0000007fbfffe978 error 4
> 
> this occurs almost instantly for every nanny process.
> 
> Can anyone venture a guess as to what is happening?
> 
> see my lvs.cf here:
> http://nanny-error.pastebin.com/m592f7911
> 
> 
> 
-- 
Regards,
-C

Christopher Barry
Systems Engineer, Principal
QLogic Corporation
780 Fifth Avenue, Suite 140
King of Prussia, PA   19406
o/f: 610-233-4870 / 4777
  m: 267-242-9306


-------------- next part --------------
A non-text attachment was scrubbed...
Name: lvs-vsmonitor.txt
Type: application/x-shellscript
Size: 744 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/561241fe/attachment.bin>

From alex.kompel at 23andme.com  Tue Nov 13 20:57:05 2007
From: alex.kompel at 23andme.com (Alex Kompel)
Date: Tue, 13 Nov 2007 12:57:05 -0800
Subject: [Linux-cluster] nanny segfault problem
In-Reply-To: <1194984339.5187.24.camel@localhost>
References: <1194984339.5187.24.camel@localhost>
Message-ID: <68a019b50711131257l1866ee2egf21861889a5ed4b0@mail.gmail.com>

On 11/13/07, Christopher Barry <christopher.barry at qlogic.com> wrote:
>
>  Greetings All,
>
> running RHEL4U5
>
> I have a bunch of services on my cluster w/ access via redundant
> directors.
>
> I've created a generic service checking script, which I'm specifying in
> lvs.cf's 'send_program' config parameter.
>
> script is attached to this post. see that for how it works with the
> symlinks described below.
>
> I create symlinks to the script for every service I want to check, with
> their name containing the port to hit, as in:
> /sbin/lvs-<port>.sh
>
> so the symlink name to check ssh availability, for instance, is:
> /sbin/lvs-22.sh
>
> The script works fine, and returns the first contiguous block of
> [[:alnum:]] text data from the connection attempt for use with the
> expect line of lvs.cf.
>
>
> The problem is, when nanny is spawned by pulse, all of the nanny
> processes segfault.
>
> > Nov 13 14:40:44 kop-sds-dir-01 lvs[17740]: create_monitor for
> ssh_access/kop-sds-01 running as pid 17749
> > Nov 13 14:40:44 kop-sds-dir-01 nanny[17749]: making 10.32.12.11:22available
> > Nov 13 14:40:44 kop-sds-dir-01 kernel: nanny[17749]: segfault at
> 000000000000006c rip 000000335e570810 rsp 0000007fbfffe978 error 4
>
> this occurs almost instantly for every nanny process.
>
> Can anyone venture a guess as to what is happening?
>

Try running nanny manually in foreground - see if you get any error
messages. RHEL5 nanny (0.8.4) has a bug where it segfaults on printing
syslog log messages longer than 80 characters. Could be that. The patch is
below.

*** util.c      2002-04-25 21:19:57.000000000 -0700
--- util.new    2007-10-10 13:27:43.000000000 -0700
***************
*** 49,55 ****

    while (1)
      {
!       ret = vsnprintf (buf, bufLen, format, args);
        if ((ret > -1) && (ret < bufLen))
        {
          break;
--- 49,58 ----

    while (1)
      {
!       va_list try_args;
!       va_copy(try_args, args);
!       ret = vsnprintf (buf, bufLen, format, try_args);
!       va_end(try_args);
        if ((ret > -1) && (ret < bufLen))
        {
          break;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/6e6a30dc/attachment.htm>

From Christopher.Barry at qlogic.com  Tue Nov 13 22:53:20 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Tue, 13 Nov 2007 17:53:20 -0500
Subject: [Linux-cluster] Re: nanny segfault problem
In-Reply-To: <1194984842.5187.28.camel@localhost>
References: <1194984339.5187.24.camel@localhost>
	<1194984842.5187.28.camel@localhost>
Message-ID: <1194994401.5187.47.camel@localhost>

On Tue, 2007-11-13 at 15:14 -0500, Christopher Barry wrote:
> script got scraped by my gateway - attached here as a textfile
> 
> 
> On Tue, 2007-11-13 at 15:05 -0500, Christopher Barry wrote:
> > Greetings All,
> > 
> > running RHEL4U5
> > 
> > I have a bunch of services on my cluster w/ access via redundant
> > directors.
> > 
> > I've created a generic service checking script, which I'm specifying in
> > lvs.cf's 'send_program' config parameter.
> > 
> > script is attached to this post. see that for how it works with the
> > symlinks described below.
> > 
> > I create symlinks to the script for every service I want to check, with
> > their name containing the port to hit, as in:
> > /sbin/lvs-<port>.sh
> > 
> > so the symlink name to check ssh availability, for instance, is:
> > /sbin/lvs-22.sh
> > 
> > The script works fine, and returns the first contiguous block of
> > [[:alnum:]] text data from the connection attempt for use with the
> > expect line of lvs.cf.
> > 
> > 
> > The problem is, when nanny is spawned by pulse, all of the nanny
> > processes segfault.
> > 
> > > Nov 13 14:40:44 kop-sds-dir-01 lvs[17740]: create_monitor for ssh_access/kop-sds-01 running as pid 17749
> > > Nov 13 14:40:44 kop-sds-dir-01 nanny[17749]: making 10.32.12.11:22 available
> > > Nov 13 14:40:44 kop-sds-dir-01 kernel: nanny[17749]: segfault at 000000000000006c rip 000000335e570810 rsp 0000007fbfffe978 error 4
> > 
> > this occurs almost instantly for every nanny process.
> > 
> > Can anyone venture a guess as to what is happening?
> > 
> > see my lvs.cf here:
> > http://nanny-error.pastebin.com/m592f7911
> > 
> > 

All,

More interesting developments:
If I start pulse with:

# pulse -v --nodaemon

everything (kinda) works.

# pulse -v

does not work work at all, however.

Something is different between daemon mode and not, beyond apparently
backgrounding it.

I was thinking this may be a permissions issue, but I'd already changed
the mode of my script to 4755.


From david.costakos at gmail.com  Tue Nov 13 23:40:46 2007
From: david.costakos at gmail.com (Dave Costakos)
Date: Tue, 13 Nov 2007 15:40:46 -0800
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <E67F1468BF7A4C418D874810215A377EEDBB47@EITO-MBX01.internal.vodafone.com>
References: <97F238EA86B5704DBAD740518CF8291003859C06@hwpms600.tbo.citistreet.org>
	<E67F1468BF7A4C418D874810215A377EEDBB47@EITO-MBX01.internal.vodafone.com>
Message-ID: <6b6836c60711131540u2e483074oc893440d9ad8b019@mail.gmail.com>

I just ordered some Sun Blade 6250s for a GFS/Xen cluster, I'll post some
details if I can get that to work. (I suspect I'll have to add "a lot of
special modifications myself", as Han Solo would say, to integrate with the
CMM.

On Nov 13, 2007 7:52 AM, Brieseneck, Arne, VF-Group <
Arne.Brieseneck at vodafone.com> wrote:

>  I have a 40 node SUN X4100 cluster with fence_ipmilan running.
> If you have any questions let me know...
>
>  ------------------------------
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Fair, Brian
> *Sent:* Dienstag, 13. November 2007 16:33
> *To:* linux clustering
> *Subject:* RE: [Linux-cluster] Sun ILOM
>
>  I'm sure you know this, but Sun has shipped several different out of band
> management solutions over the years. They have different software, commands,
> and features.
>
> Some (all maybe, couldn't say) of the newer stuff is IPMI... coolthreads,
> the newer x86 based boxes certainly, etc. There are a few different types it
> could be if its older equipment.
>
> Just something to keep in mind if you hear that IPMI is working for a Sun
> box.
>
> Curious -- what kind of hardware are you building out your cluster on?
>
> -Brian
>
>  ------------------------------
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Dave Costakos
> *Sent:* Monday, November 12, 2007 2:21 PM
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] Sun ILOM
>
> I should say that I am successfully using fence_ipmi in a 3-node Sun 4600
> cluster -- though I obviously had to connect to the iLOM via the network
> interface rather than the serial interface (which is our normal procedure).
>
> -Dave.
>
> On Nov 12, 2007 9:02 AM, Kevin Anderson <kanderso at redhat.com> wrote:
>
> >  On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca Dominguez wrote:
> >
> > Hi!
> > Are Sun's LOM and ILOM cards supported as fencing devices for Red HatCluster Suite?
> >
> > No one has implemented a fencing script for those devices as of yet.  I
> > believe people have had success using fence_ipmi with the Sun hardware, so
> > you might want to try that.  Alternatively, if you have access to the
> > hardware, take a look at implementing the fence agent.
> >
> > Kevin
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
>
> --
> Dave Costakos
> mailto:david.costakos at gmail.com
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Dave Costakos
mailto:david.costakos at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071113/229daffa/attachment.htm>

From scottb at bxwa.com  Wed Nov 14 00:10:43 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 13 Nov 2007 16:10:43 -0800
Subject: [Linux-cluster] Conga issue
In-Reply-To: <473A0218.8010001@redhat.com>
References: <473891A4.1030202@bxwa.com>	<4738ED31.3000308@bxwa.com>	<20071113150302.GA12307@redhat.com>	<4739C90C.7020108@bxwa.com>	<20071113172210.GB12937@redhat.com>
	<4739FF45.8040905@bxwa.com> <473A0218.8010001@redhat.com>
Message-ID: <473A3D03.40408@bxwa.com>

I see. I just switched them back.

Can I leave just the 'off' commands to allow manual recovery instead of 
a reboot? It seem to me that two node split-brain is best avoided this way.

    scottb


James Parsons wrote:
>
>>
>> Since I manually fixed cluster.conf, luci now displays four entires 
>> for both nodes.
>>
>> I just reordered the entries in cluster.conf like this:
>>  <device name="RackPDU1" option="off" port="3"/>
>>  <device name="RackPDU1" option="on" port="3"/>
>>  <device name="RackPDU2" option="off" port="3"/>
>>  <device name="RackPDU2" option="on" port="3"/>
>>
>> to group them in on off pairs and now luci displays just the two.  :)
>>
>>     thanks
>>     scottb
>
> The above is an incorrect way to set up fencing on a node with 
> redundant power supplies...whatever the UI is showing you.
> This configuration does not guarentee that the node is truly rebooted. 
> Both ports Off At The Same Time is the only way to be certain the node 
> is not staying up due to timing issues. Keeping this in your 
> cluster.conf puts your data at risk.
>
> If the UI component is writing something improper to the configuration 
> file regarding fencing (or any other cluster parameter), please file a 
> bugzilla ticket and include your conga version number, your conf file, 
> and a screen shot if you can. If conga is messing up fence 
> configuration, it will be addressed with a very high priority.
>
> Regards,
>
> -Jim
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From shawnlhood at gmail.com  Wed Nov 14 02:59:09 2007
From: shawnlhood at gmail.com (Shawn Hood)
Date: Tue, 13 Nov 2007 21:59:09 -0500
Subject: [Linux-cluster] howdy
Message-ID: <cfe2fc960711131859l564db23dx9cc95576208145da@mail.gmail.com>

Hey folks,

Just thought I'd introduce myself.  I found this list while perusing
some information RHCS/GFS.  I figured that it would behoove me to join
up, especially as many of my upcoming projects are HA cluster related.
 I recently relocated to the DC metro area from Chapel Hill.  I was
working in blade development at IBM in RTP.  I now work for a small
consulting firm and service a number of clients with Linux
infrastructures.

That said, my primary client is a medical coding application service
provider that is doing some really fascinating stuff involving natural
language processing.  These NLP applications are very computationally
expensive--cpu, memory, STORAGE.  My current project is implement GFS
across multiple Dell PowerEdges running RHEL4.4/4.5/5, connecting to 3
Apple xraids, and soon a larger first-tier storage device.  Upcoming
projects include implementing clustering to provide redundancy for
critical applications (e-mail, jabber, JBoss, etc), across two
datacenters (our corporate HQ and Rackspace).

Anyhow, you guys will probably be seeing some traffic from me.  I
guess I'll go ahead a question:  Are there any must-reads (apart from
the sparse RH documentation) related to RHCS/GFS?  Are there any books
that are imperative reads on the concepts of highly-available
infrastructure?

Shawn Hood


From jparsons at redhat.com  Wed Nov 14 04:35:24 2007
From: jparsons at redhat.com (jim parsons)
Date: Tue, 13 Nov 2007 23:35:24 -0500
Subject: [Linux-cluster] Conga issue
In-Reply-To: <473A3D03.40408@bxwa.com>
References: <473891A4.1030202@bxwa.com>	<4738ED31.3000308@bxwa.com>
	<20071113150302.GA12307@redhat.com>	<4739C90C.7020108@bxwa.com>
	<20071113172210.GB12937@redhat.com> <4739FF45.8040905@bxwa.com>
	<473A0218.8010001@redhat.com>  <473A3D03.40408@bxwa.com>
Message-ID: <1195014924.3578.12.camel@localhost.localdomain>

On Tue, 2007-11-13 at 16:10 -0800, Scott Becker wrote:
> I see. I just switched them back.
> 
> Can I leave just the 'off' commands to allow manual recovery instead of 
> a reboot? 

Sure.


> It seem to me that two node split-brain is best avoided this way.
> 
How many nodes are in your cluster? I got the impression from your conf
file fragment that you had 3 nodes. If you have an even number of nodes,
and split-brain concerns you, may I suggest using Lon H's excellent
quorum disk feature? man qdisk...

-J


From dcmbrown at shaw.ca  Wed Nov 14 08:10:39 2007
From: dcmbrown at shaw.ca (Dan Brown)
Date: Wed, 14 Nov 2007 02:10:39 -0600
Subject: [Linux-cluster] clvmd version mismatch
Message-ID: <473AAD7F.8080100@shaw.ca>


I'm trying to get a two node GFS2 cluster running on a pair of test
machines using GFS2 on top of a clustered lvm2 disk replicated over a
DRBD 8.2.0 active/active link. 
After starting up drbd and cman, I attempt to start up rgmanager and
clvmd and they both fail with this message in /var/log/syslog.

Nov 13 22:25:36 seahawk kernel: dlm: process clurgmgrd (19787) version
mismatch user (5.0.0) kernel (6.0.0)
Nov 13 22:26:06 seahawk kernel: dlm: process clvmd (20143) version
mismatch user (5.0.0) kernel (6.0.0)

After trying to tune in various parameters to get more logging or
debugging, apart from a somewhat uninformative strace, I can't seem to
find a whole lot more about which userspace items are conflicting with
which kernel space items. 

Oh, to make things a little worse, this is on a pair of Fedora FC6
machines rather than RedHat EL or CentOS.

----
Dan Brown
dcmbrown at shaw.ca


From pcaulfie at redhat.com  Wed Nov 14 09:39:23 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 14 Nov 2007 09:39:23 +0000
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <20071113102955.287554@leena>
References: <20071113102955.287554@leena>
Message-ID: <473AC24B.9000605@redhat.com>

isplist at logicore.net wrote:
>>> What should I be looking for to post here?
>>>
>>>       
>> The exact detail of any kernel panic you are seeing .. ALL the text.
>> and then the obvious stuff: cluster.conf file, version numbers of all
>> cluster software, distribution and where you got them from.
>> Copies of things in /proc/cluster are always helpful too, if you can get
>> them from any running node (please say which node).
>>     
>
> That's a lot of info :). Got some of it at least;
>
> ccs 1.0.7-0.XOS.1
> cman 1.0.11-0.XOS.1
> cman-kernel 2.6.9-36.0.XOS.1
> cman-kernel 2.6.9-45.8.XOS.1
> cman-kernel 2.6.9-45.15.XOS.1
> fence 1.32.25-1.XOS.1
> lvm2-cluster 2.02.06-7.0.RHEL4.XOS.1
> magma 1.0.6-0.XOS.1
> magma-plugins 1.0.9-0.XOS.1
> piranha 0.8.2-1.XOS.1
> system-config-cluster 1.0.27-1.0.XOS.1
>
> And yes, I know I'm running old versions but all of the nodes are running the 
> same things and it works fine for me, cept for this new problem :). Now, as I 
> posted this, it does dawn on me that the new node (img62) would have newer 
> versions of all of the above installed. Would this be the cause? Should I 
> upgrade all nodes to the latest versions?
>
>   
Yes, it could be that the versions are out of step. I'm not sure about
what's in each of those versions as I don't recognise the numbers, there
were some incompatibilities between very old versions of cman and newer
ones. So I strongly recommend upgradeing .. or, at least using the same
version on all nodes.
> This is the kernel panic from .58 when .62 (img62) tries to join the cluster. 
> The new node does have an updated cluster.conf and so do all of the other 
> nodes to reflect the new node joining. All nodes had their hosts file updated 
> also so that they know about it's IP.
>
> Nov 13 09:59:32 compdev kernel: klogd 1.4.1, log source = /proc/kmsg started.
> Nov 13 10:03:40 compdev kernel: CMAN: node img62.domain.com rejoining
> Nov 13 10:03:42 compdev kernel: Unable to handle kernel paging request at 
> virtual address 008c9689
> Nov 13 10:03:42 compdev kernel:  printing eip:
> Nov 13 10:03:42 compdev kernel: e09e0d19
> Nov 13 10:03:42 compdev kernel: *pde = 00000000
> Nov 13 10:03:42 compdev kernel: Oops: 0000 [#1]
> Nov 13 10:03:42 compdev kernel: Modules linked in: autofs4 dlm(U) cman(U) md5 
> ipv6 sunrpc dm_mirror uhci_hcd e100 mii floppy ext3 jbd dm_mod qla2200 qla2xxx 
> scsi_transport_fc sd_mod scsi_mod
> Nov 13 10:03:42 compdev kernel: CPU:    0
> Nov 13 10:03:42 compdev kernel: EIP:    0060:[<e09e0d19>]    Not tainted VLI
> Nov 13 10:03:42 compdev kernel: EFLAGS: 00010202   (2.6.9-42.0.10.EL.XOS.1)
> Nov 13 10:03:42 compdev kernel: EIP is at process_join_request+0x65/0x1ba 
> [cman]
> Nov 13 10:03:42 compdev kernel: eax: 00000000   ebx: 008c9689   ecx: e09f20c0 
>   edx: dd439000
> Nov 13 10:03:42 compdev kernel: esi: 00006564   edi: 0000003a   ebp: dd439f98 
>   esp: dd439f58
> Nov 13 10:03:42 compdev kernel: ds: 007b   es: 007b   ss: 0068
> Nov 13 10:03:42 compdev kernel: Process cman_serviced (pid: 2212, 
> threadinfo=dd439000 task=de793340)
> Nov 13 10:03:42 compdev kernel: Stack: 00000000 d6f9c014 0000003e 00000000 
> 00000000 00000000 00000000 00000000
> Nov 13 10:03:42 compdev kernel:        95eb1078 0003641b de750ae0 0000003e 
> d6f9c000 dd439f98 e09de8a3 e09e1125
> Nov 13 10:03:42 compdev kernel:        00000001 00000000 00000000 00070000 
> 61666564 06e57ac4 000000d9 de793340
> Nov 13 10:03:42 compdev kernel: Call Trace:
> Nov 13 10:03:42 compdev kernel:  [<e09de8a3>] serviced+0x0/0x140 [cman]
> Nov 13 10:03:42 compdev kernel:  [<e09e1125>] process_message+0x32/0x93 [cman]
> Nov 13 10:03:42 compdev kernel:  [<e09e12a9>] process_messages+0x123/0x13e 
>   
Patrick


From pcaulfie at redhat.com  Wed Nov 14 09:43:59 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 14 Nov 2007 09:43:59 +0000
Subject: [Linux-cluster] clvmd version mismatch
In-Reply-To: <473AAD7F.8080100@shaw.ca>
References: <473AAD7F.8080100@shaw.ca>
Message-ID: <473AC35F.6010804@redhat.com>

Dan Brown wrote:
> I'm trying to get a two node GFS2 cluster running on a pair of test
> machines using GFS2 on top of a clustered lvm2 disk replicated over a
> DRBD 8.2.0 active/active link. 
> After starting up drbd and cman, I attempt to start up rgmanager and
> clvmd and they both fail with this message in /var/log/syslog.
>
> Nov 13 22:25:36 seahawk kernel: dlm: process clurgmgrd (19787) version
> mismatch user (5.0.0) kernel (6.0.0)
> Nov 13 22:26:06 seahawk kernel: dlm: process clvmd (20143) version
> mismatch user (5.0.0) kernel (6.0.0)
>
> After trying to tune in various parameters to get more logging or
> debugging, apart from a somewhat uninformative strace, I can't seem to
> find a whole lot more about which userspace items are conflicting with
> which kernel space items. 
>   
The conflict is between libdlm and DLM in the kernel. if you upgrade to
the latest libdlm then that should fix it.

Patrick


From lhh at redhat.com  Wed Nov 14 15:17:39 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 10:17:39 -0500
Subject: [Linux-cluster] Unkillable clurgmgrd
In-Reply-To: <20071113090410.GA9677@jasmine.xos.nl>
References: <200711112257.lABMvoW7025748@jasmine.xos.nl>
	<1194899767.8951.39.camel@ayanami.boston.devel.redhat.com>
	<68a019b50711121447g6a08f2fcl96877ee610034ca3@mail.gmail.com>
	<20071113090410.GA9677@jasmine.xos.nl>
Message-ID: <1195053459.8951.62.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-13 at 10:04 +0100, Jos Vos wrote:
> On Mon, Nov 12, 2007 at 02:47:18PM -0800, Alex Kompel wrote:
> 
> > I observed a similar problem on the test cluster. It appears the clurgmgrd
> > deadlocks in some cases in groups.c:count_resource_groups(). It does not
> > happen every time but it is reproducible. Surviving node calls
> > rg_lock(service:mysql) @ groups.c:101 and gets stuck. The other node
> > resource manager waits indefinitely for the lock:
> 
> [...]
> 
> > To the original poster: the surviving node clurgmgrd is "unkillable" as
> > well.
> > You can try to reboot the surviving node - it will release the lock and
> > resource manager on the fenced node will be unblocked and start just fine.
> > Unfortunately, once you reboot the node the situation may reverse (resource
> > manager will hang on the rebooted node).
> 
> Yes, rebooting ended up in some "locking war" and neither node came up
> properly.  I finally (a) chkconfig off all cluster subsystems and (b)
> modified the cluster.conf on both nodes to turn off autostart and then
> turned off both nodes (shutting down didn't work of course).  Then, I
> brought the nodes up in sequence, manually brought up the cluster
> subsystems, manually started the cluster services and finally I
> reverted (a) and (b).
> 
> Is this problem solved in 5.1?

I'm not aware of what might be causing that, unless it's the same as
#338511 but in rhel5-land.  Someone else might.

-- Lon


From lhh at redhat.com  Wed Nov 14 15:22:08 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 10:22:08 -0500
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <4739DA42.8080109@gmail.com>
References: <47372867.1010306@gmail.com> <473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
	<4739DA42.8080109@gmail.com>
Message-ID: <1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-13 at 18:09 +0100, carlopmart wrote:
> Lon Hohberger wrote:
> > On Mon, 2007-11-12 at 16:39 +0100, carlopmart wrote:
> >> carlopmart wrote:
> > 
> >> I see this open bug: https://bugzilla.redhat.com/show_bug.cgi?id=362351. May I 
> >> have to wait until rhel 5.2 will be released to solve this problem?? Do -L flag 
> >> works using fence_xvmd released under rhel 5.1??
> > 
> > No, it didn't make 5.1.

> Then, is it not possible to use fence_xvmd as a fence daemon on rhel 5.1???
> 

You can use it, you just have to set up a 1-node cluster on the host and
start cman.  Attached is an example cluster.conf for the dom-0.

(You can leave the rm tag empty; I use rgmanager to check-restart my two
domains and temporarily disable them.)

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/xml
Size: 337 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071114/2528579d/attachment.wsdl>

From lhh at redhat.com  Wed Nov 14 15:25:50 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 10:25:50 -0500
Subject: [Linux-cluster] howdy
In-Reply-To: <cfe2fc960711131859l564db23dx9cc95576208145da@mail.gmail.com>
References: <cfe2fc960711131859l564db23dx9cc95576208145da@mail.gmail.com>
Message-ID: <1195053950.8951.70.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-13 at 21:59 -0500, Shawn Hood wrote:

> Anyhow, you guys will probably be seeing some traffic from me.  I
> guess I'll go ahead a question:  Are there any must-reads (apart from
> the sparse RH documentation) related to RHCS/GFS?  Are there any books
> that are imperative reads on the concepts of highly-available
> infrastructure?

http://sources.redhat.com/cluster/faq.html
http://sources.redhat.com/cluster/doc/nfscookbook.pdf

-- Lon


From carlopmart at gmail.com  Wed Nov 14 16:28:06 2007
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 14 Nov 2007 17:28:06 +0100
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
References: <47372867.1010306@gmail.com>
	<473873AA.6050100@gmail.com>	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>	<4739DA42.8080109@gmail.com>
	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
Message-ID: <473B2216.1070302@gmail.com>

Lon Hohberger wrote:
> On Tue, 2007-11-13 at 18:09 +0100, carlopmart wrote:
>> Lon Hohberger wrote:
>>> On Mon, 2007-11-12 at 16:39 +0100, carlopmart wrote:
>>>> carlopmart wrote:
>>>> I see this open bug: https://bugzilla.redhat.com/show_bug.cgi?id=362351. May I 
>>>> have to wait until rhel 5.2 will be released to solve this problem?? Do -L flag 
>>>> works using fence_xvmd released under rhel 5.1??
>>> No, it didn't make 5.1.
> 
>> Then, is it not possible to use fence_xvmd as a fence daemon on rhel 5.1???
>>
> 
> You can use it, you just have to set up a 1-node cluster on the host and
> start cman.  Attached is an example cluster.conf for the dom-0.
> 
> (You can leave the rm tag empty; I use rgmanager to check-restart my two
> domains and temporarily disable them.)
> 
> -- Lon
> 

Thanks Lon, I have tried it, but fence_xvmd doesn't starts automatically using 
cman script. I think error is the same than on rhel 5.0, when cman startup 
script checks if xen is available using this: xm list --long 2> /dev/null | grep 
-q "Domain-0" || return 1

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From damjan.stulic at edwardjones.com  Wed Nov 14 17:03:29 2007
From: damjan.stulic at edwardjones.com (Stulic,Damjan)
Date: Wed, 14 Nov 2007 11:03:29 -0600
Subject: [Linux-cluster] cman kernel module - remove
Message-ID: <B2ED9364017FF54FB9392221F9A4984501230AB0@nwpsrv08.edj.ad.edwardjones.com>

Cman module fails to remove from kernel space. Module shows 4
dependencies, but none listed. Cman-tool does not list any connected
services. rgmanager stopped, fenced stopped, cman stopped, ccsd stopped.
Looking for a way to remove module without rebooting the server. Any
Ideas? Any other services that connect to cman kernel module?

# cman_tool services
Service          Name                              GID LID State
Code
#

# lsmod | grep cman
cman                  139552  4

# modprobe -rf cman
FATAL: Module cman is in use.


Thanks,
Damjan Stulic
Edward Jones, IS Security
 
 If you are not the intended recipient of this message (including attachments), or if you have received this message in error, immediately notify us and delete it and any attachments.  If you no longer wish to receive e-mail from Edward Jones, please send this request to messages at edwardjones.com.  You must include the e-mail address that you wish not to receive e-mail communications.  For important additional information related to this e-mail, visit www.edwardjones.com/US_email_disclosure


From isplist at logicore.net  Wed Nov 14 17:39:09 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 14 Nov 2007 11:39:09 -0600
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <47386F87.7070206@redhat.com>
Message-ID: <2007111411399.690032@leena>

>> What should I be looking for to post here?
>> 
> The exact detail of any kernel panic you are seeing .. ALL the text.
> and then the obvious stuff: cluster.conf file, version numbers of all
> cluster software, distribution and where you got them from.
> Copies of things in /proc/cluster are always helpful too, if you can get
> them from any running node (please say which node).

That's a lot of info :). Got some of it at least;

ccs 1.0.7-0.XOS.1
cman 1.0.11-0.XOS.1
cman-kernel 2.6.9-36.0.XOS.1
cman-kernel 2.6.9-45.8.XOS.1
cman-kernel 2.6.9-45.15.XOS.1
fence 1.32.25-1.XOS.1
lvm2-cluster 2.02.06-7.0.RHEL4.XOS.1
magma 1.0.6-0.XOS.1
magma-plugins 1.0.9-0.XOS.1
piranha 0.8.2-1.XOS.1
system-config-cluster 1.0.27-1.0.XOS.1

And yes, I know I'm running old versions but all of the nodes are running the
same things and it works fine for me, cept for this new problem :). Now, as I
posted this, it does dawn on me that the new node (img62) would have newer 
versions of all of the above installed. Would this be the cause? Should I 
upgrade all nodes to the latest versions?

This is the kernel panic from .58 when .62 (img62) tries to join the cluster.
The new node does have an updated cluster.conf and so do all of the other 
nodes to reflect the new node joining. All nodes had their hosts file updated
also so that they know about it's IP.

Nov 13 09:59:32 compdev kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 13 10:03:40 compdev kernel: CMAN: node img62.domain.com rejoining
Nov 13 10:03:42 compdev kernel: Unable to handle kernel paging request at 
virtual address 008c9689
Nov 13 10:03:42 compdev kernel:  printing eip:
Nov 13 10:03:42 compdev kernel: e09e0d19
Nov 13 10:03:42 compdev kernel: *pde = 00000000
Nov 13 10:03:42 compdev kernel: Oops: 0000 [#1]
Nov 13 10:03:42 compdev kernel: Modules linked in: autofs4 dlm(U) cman(U) md5
ipv6 sunrpc dm_mirror uhci_hcd e100 mii floppy ext3 jbd dm_mod qla2200 qla2xxx 
scsi_transport_fc sd_mod scsi_mod
Nov 13 10:03:42 compdev kernel: CPU:    0
Nov 13 10:03:42 compdev kernel: EIP:    0060:[<e09e0d19>]    Not tainted VLI
Nov 13 10:03:42 compdev kernel: EFLAGS: 00010202   (2.6.9-42.0.10.EL.XOS.1)
Nov 13 10:03:42 compdev kernel: EIP is at process_join_request+0x65/0x1ba 
[cman]
Nov 13 10:03:42 compdev kernel: eax: 00000000   ebx: 008c9689   ecx: e09f20c0
  edx: dd439000
Nov 13 10:03:42 compdev kernel: esi: 00006564   edi: 0000003a   ebp: dd439f98
  esp: dd439f58
Nov 13 10:03:42 compdev kernel: ds: 007b   es: 007b   ss: 0068
Nov 13 10:03:42 compdev kernel: Process cman_serviced (pid: 2212, 
threadinfo=dd439000 task=de793340)
Nov 13 10:03:42 compdev kernel: Stack: 00000000 d6f9c014 0000003e 00000000 
00000000 00000000 00000000 00000000
Nov 13 10:03:42 compdev kernel:        95eb1078 0003641b de750ae0 0000003e 
d6f9c000 dd439f98 e09de8a3 e09e1125
Nov 13 10:03:42 compdev kernel:        00000001 00000000 00000000 00070000 
61666564 06e57ac4 000000d9 de793340
Nov 13 10:03:42 compdev kernel: Call Trace:
Nov 13 10:03:42 compdev kernel:  [<e09de8a3>] serviced+0x0/0x140 [cman]
Nov 13 10:03:42 compdev kernel:  [<e09e1125>] process_message+0x32/0x93 [cman]
Nov 13 10:03:42 compdev kernel:  [<e09e12a9>] process_messages+0x123/0x13e 
[cman]
Nov 13 10:03:42 compdev kernel:  [<e09de8ce>] serviced+0x2b/0x140 [cman]
Nov 13 10:03:42 compdev kernel:  [<c013cc2d>] kthread+0x69/0x91
Nov 13 10:03:42 compdev kernel:  [<c013cbc4>] kthread+0x0/0x91
Nov 13 10:03:42 compdev kernel:  [<c01041dd>] kernel_thread_helper+0x5/0xb
Nov 13 10:03:42 compdev kernel: Code: 74 df e8 1e 69 93 df b9 c0 20 9f e0 ff 
0d c0 20 9f e0 0f 88 e9 08 00 00 8b 3d 6c 1f 9f e0 39 7c 24 08 74 3d 8b 1c f5
a0 20 9f e0 <8b> 03 0f 18 00 90 8d 04 f5 a0 20 9f e0 39 c3 74 25 0f b7 45 12
Nov 13 10:03:42 compdev kernel:  <0>Fatal exception: panic in 5 seconds

cluster.conf;

<cluster config_version="80" name="vgcomp">
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="compdev.domain.com" votes="1" nodeid="58">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="2"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb92.domain.com" votes="1" nodeid="92">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="3"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb93.domain.com" votes="1" nodeid="93">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="4"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb94.domain.com" votes="1" nodeid="94">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="5"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="img62.domain.com" votes="1" nodeid="62">
                <fence>
                    <method name="1">
                        <device name="brocade215" port="7"/>
                    </method>
                </fence>
        </clusternode>
    </clusternodes>
<fencedevices>
    <fencedevice agent="fence_brocade" ipaddr="192.168.1.215" login="xxx" 
name="brocade215" passwd="xxxx
s"/>
</fencedevices>
</cluster>


From isplist at logicore.net  Wed Nov 14 18:57:47 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 14 Nov 2007 12:57:47 -0600
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <473AC24B.9000605@redhat.com>
Message-ID: <20071114125747.835929@leena>

> Yes, it could be that the versions are out of step. I'm not sure about
> what's in each of those versions as I don't recognise the numbers, there
> were some incompatibilities between very old versions of cman and newer
> ones. So I strongly recommend upgradeing .. or, at least using the same
> version on all nodes.

Well, I've run into more weirdness .
yum update won't run because there's something called seamonkey-nss and 
seamonkey-nrps (can't recall exactly now, not at that terminal). Neither can 
be updated, neither can be removed, neither can even be found on the system. 
However... one or both seem to be dependant on fenced.

Anyone else come across this?

Mike


From lhh at redhat.com  Wed Nov 14 19:22:59 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 14:22:59 -0500
Subject: [Linux-cluster] Cannot add nodes
In-Reply-To: <20071114125747.835929@leena>
References: <20071114125747.835929@leena>
Message-ID: <1195068179.8951.79.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-14 at 12:57 -0600, isplist at logicore.net wrote:
> > Yes, it could be that the versions are out of step. I'm not sure about
> > what's in each of those versions as I don't recognise the numbers, there
> > were some incompatibilities between very old versions of cman and newer
> > ones. So I strongly recommend upgradeing .. or, at least using the same
> > version on all nodes.
> 
> Well, I've run into more weirdness .
> yum update won't run because there's something called seamonkey-nss and 
> seamonkey-nrps (can't recall exactly now, not at that terminal). Neither can 
> be updated, neither can be removed, neither can even be found on the system. 
> However... one or both seem to be dependant on fenced.

seamonkey-nss + seamonkey-nspr are RHEL4 names for nss and nspr.  If
your yum repos are set correctly, they should be pulled in.

fenced doesn't need it, but fence_xvm[d] does.  There was a problem with
RHN for a bit where it couldn't resolve the dependencies for some
reason, but I thought we fixed it.

-- Lon


From alexis_q at yahoo.com  Wed Nov 14 20:17:09 2007
From: alexis_q at yahoo.com (Alexis Quintana)
Date: Wed, 14 Nov 2007 12:17:09 -0800 (PST)
Subject: [Linux-cluster] NFS Client options
Message-ID: <499897.78746.qm@web54503.mail.re2.yahoo.com>

Hi, I'm using RedHat Cluster Suite (Advance Server 4) and when I try to set the NFS clients options to anonuid=501,anongid=100, the cluster won't start saying that can not recognize those options ( it worked fine under version 2.1) any idea?
Thanks.


      ____________________________________________________________________________________
Be a better sports nut!  Let your teams follow you 
with Yahoo Mobile. Try it now.  http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071114/2936c288/attachment.htm>

From scottb at bxwa.com  Wed Nov 14 21:00:58 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 14 Nov 2007 13:00:58 -0800
Subject: [Linux-cluster] Cluster Communications Security
Message-ID: <473B620A.2010902@bxwa.com>

What's the general consensus of security risks of cman communications 
over a public subnet?
The faq only briefly mentions it.

    thanks
    scottb


From lhh at redhat.com  Wed Nov 14 21:13:03 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 16:13:03 -0500
Subject: [Linux-cluster] NFS Client options
In-Reply-To: <499897.78746.qm@web54503.mail.re2.yahoo.com>
References: <499897.78746.qm@web54503.mail.re2.yahoo.com>
Message-ID: <1195074783.8951.81.camel@ayanami.boston.devel.redhat.com>

Fixed in 4.5

-- Lon


From rstevens at internap.com  Wed Nov 14 21:15:21 2007
From: rstevens at internap.com (Rick Stevens)
Date: Wed, 14 Nov 2007 13:15:21 -0800
Subject: [Linux-cluster] NFS Client options
In-Reply-To: <499897.78746.qm@web54503.mail.re2.yahoo.com>
References: <499897.78746.qm@web54503.mail.re2.yahoo.com>
Message-ID: <1195074921.31361.32.camel@prophead.corp.publichost.com>

On Wed, 2007-11-14 at 12:17 -0800, Alexis Quintana wrote:
> Hi, I'm using RedHat Cluster Suite (Advance Server 4) and when I try
> to set the NFS clients options to anonuid=501,anongid=100, the cluster
> won't start saying that can not recognize those options ( it worked
> fine under version 2.1) any idea?

anonuid and anongid are NFS _SERVER_ options, not client options.
They can be used in /etc/exports (server), but not on a mount command
or /etc/fstab entry (client).

----------------------------------------------------------------------
- Rick Stevens, Principal Engineer             rstevens at internap.com -
- CDN Systems, Internap, Inc.                http://www.internap.com -
-                                                                    -
-    If Windows isn't a virus, then it sure as hell is a carrier!    -
----------------------------------------------------------------------


From rstevens at internap.com  Wed Nov 14 21:19:14 2007
From: rstevens at internap.com (Rick Stevens)
Date: Wed, 14 Nov 2007 13:19:14 -0800
Subject: [Linux-cluster] Cluster Communications Security
In-Reply-To: <473B620A.2010902@bxwa.com>
References: <473B620A.2010902@bxwa.com>
Message-ID: <1195075154.31361.37.camel@prophead.corp.publichost.com>

On Wed, 2007-11-14 at 13:00 -0800, Scott Becker wrote:
> What's the general consensus of security risks of cman communications 
> over a public subnet?
> The faq only briefly mentions it.

cman is pretty important.  If it's on a public subnet, someone could
spoof IPs and screw with your locks, spew garbage (e.g. floodping) on
the wire and lots of other nefarious things.  I'd keep it private.

If possible, I'd tend to keep it on its own VLAN as well.  You really
only want cluster-centric traffic on those wires.

----------------------------------------------------------------------
- Rick Stevens, Principal Engineer             rstevens at internap.com -
- CDN Systems, Internap, Inc.                http://www.internap.com -
-                                                                    -
-            Beware of programmers who carry screwdrivers            -
----------------------------------------------------------------------


From scottb at bxwa.com  Wed Nov 14 21:48:21 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 14 Nov 2007 13:48:21 -0800
Subject: [Linux-cluster] Cluster Communications Security
In-Reply-To: <1195075154.31361.37.camel@prophead.corp.publichost.com>
References: <473B620A.2010902@bxwa.com>
	<1195075154.31361.37.camel@prophead.corp.publichost.com>
Message-ID: <473B6D25.6060503@bxwa.com>

I'm on the verge of reimplementing fence_apc in C to use ssh. Before I 
spend the time on this to be able to fence securely, I wanted to see if 
there's any compelling reasons I needed a private subnet anyway. I don't 
have any GFS, each node will have it's own copy of the web content.

I control all the hosts on the subnet so outside interference would be 
sending in the blind or exploiting a weakness.

I believe the luci to ricci communication uses ssh so that should be OK. 
Does cman ever send root passwords?

    thanks
    scottb


Rick Stevens wrote:
> On Wed, 2007-11-14 at 13:00 -0800, Scott Becker wrote:
>   
>> What's the general consensus of security risks of cman communications 
>> over a public subnet?
>> The faq only briefly mentions it.
>>     
>
> cman is pretty important.  If it's on a public subnet, someone could
> spoof IPs and screw with your locks, spew garbage (e.g. floodping) on
> the wire and lots of other nefarious things.  I'd keep it private.
>
> If possible, I'd tend to keep it on its own VLAN as well.  You really
> only want cluster-centric traffic on those wires.
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071114/64515e99/attachment.htm>

From lhh at redhat.com  Wed Nov 14 22:04:45 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 17:04:45 -0500
Subject: [Linux-cluster] Cluster Communications Security
In-Reply-To: <473B6D25.6060503@bxwa.com>
References: <473B620A.2010902@bxwa.com>
	<1195075154.31361.37.camel@prophead.corp.publichost.com>
	<473B6D25.6060503@bxwa.com>
Message-ID: <1195077885.8951.93.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-14 at 13:48 -0800, Scott Becker wrote:
> I'm on the verge of reimplementing fence_apc in C to use ssh. Before I
> spend the time on this to be able to fence securely, I wanted to see
> if there's any compelling reasons I needed a private subnet anyway. I
> don't have any GFS, each node will have it's own copy of the web
> content.

If you do and want us to pull it in, be sure to use nss/nspr and/or
GNUtls; OpenSSL's license is GPLv2 incompatible:

  http://www.gnu.org/philosophy/license-list.html

Preference would be for nss/nspr, as it routinely gets FIPS
certifications:

  http://www.mozilla.org/projects/security/pki/nss/fips/

-- Lon


From lhh at redhat.com  Wed Nov 14 22:06:23 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 17:06:23 -0500
Subject: [Linux-cluster] Cluster Communications Security
In-Reply-To: <473B6D25.6060503@bxwa.com>
References: <473B620A.2010902@bxwa.com>
	<1195075154.31361.37.camel@prophead.corp.publichost.com>
	<473B6D25.6060503@bxwa.com>
Message-ID: <1195077983.8951.96.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-14 at 13:48 -0800, Scott Becker wrote:

> I believe the luci to ricci communication uses ssh so that should be
> OK. Does cman ever send root passwords?

No, but ccs spews the cluster config over the wire; so you'll want to
use the fence script option with your new agent as well.

-- Lon


From lhh at redhat.com  Wed Nov 14 22:09:05 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Nov 2007 17:09:05 -0500
Subject: [Linux-cluster] Cluster Communications Security
In-Reply-To: <1195077885.8951.93.camel@ayanami.boston.devel.redhat.com>
References: <473B620A.2010902@bxwa.com>
	<1195075154.31361.37.camel@prophead.corp.publichost.com>
	<473B6D25.6060503@bxwa.com>
	<1195077885.8951.93.camel@ayanami.boston.devel.redhat.com>
Message-ID: <1195078145.8951.99.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-14 at 17:04 -0500, Lon Hohberger wrote:
> On Wed, 2007-11-14 at 13:48 -0800, Scott Becker wrote:
> > I'm on the verge of reimplementing fence_apc in C to use ssh. Before I

I misread; I thought you said SSL, not SSH.  Heh.

-- Lon


From scottb at bxwa.com  Wed Nov 14 23:53:35 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 14 Nov 2007 15:53:35 -0800
Subject: [Linux-cluster] Cluster Communications Security
In-Reply-To: <1195077983.8951.96.camel@ayanami.boston.devel.redhat.com>
References: <473B620A.2010902@bxwa.com>	<1195075154.31361.37.camel@prophead.corp.publichost.com>	<473B6D25.6060503@bxwa.com>
	<1195077983.8951.96.camel@ayanami.boston.devel.redhat.com>
Message-ID: <473B8A7F.90108@bxwa.com>

I see what you mean, the "password script" on the fence device 
definition keeps the password out of cluster.conf.

    thanks
    scottb


Lon Hohberger wrote:
> On Wed, 2007-11-14 at 13:48 -0800, Scott Becker wrote:
>
>   
>> I believe the luci to ricci communication uses ssh so that should be
>> OK. Does cman ever send root passwords?
>>     
>
> No, but ccs spews the cluster config over the wire; so you'll want to
> use the fence script option with your new agent as well.
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071114/35cd26b7/attachment.htm>

From xbfair at citistreetonline.com  Thu Nov 15 08:40:35 2007
From: xbfair at citistreetonline.com (Fair, Brian)
Date: Thu, 15 Nov 2007 03:40:35 -0500
Subject: [Linux-cluster] GFS filesystem hangs
Message-ID: <97F238EA86B5704DBAD740518CF8291002E51D9E@hwpms600.tbo.citistreet.org>

We have a a GFS filesystem (1 of 100 on this server in particular) that will consistently hang. I haven't identified the circumstances around it, but there is some speculation that it may occur during heavy usage, though that isn't for sure.  When this happens, the load average on the system skyrockets.
 
The mountpoint is /omni_mnt/clients/j2
 
When I say hang, cd sometimes hangs, ls will hang, etc. Programs and file operations certainly hang. Sometimes just cd'ing into the mountpoint, other times into a large subdirectory.
 
ie:

# cd /omni_mnt/clients/j2
root at hlpom500:[/omni_mnt/clients/j2 <mailto:root at hlpom500:[/omni_mnt/clients/j2> ]
# ls
<normal output>
root at hlpom500:[/omni_mnt/clients/j2 <mailto:root at hlpom500:[/omni_mnt/clients/j2> ]
# cd stmt
root at hlpom500:[/omni_mnt/clients/j2/stmt <mailto:root at hlpom500:[/omni_mnt/clients/j2/stmt> ]
# ls

<hangs here, shell must be killed>

In the past, shutting down and rebooting the 2 systems that mount this gfs has cleared the issue.
 
Info:
 
RHEL ES 4 u5
kernel 2.6.9-55.0.2.ELsmp
GFS 2.6.9-72.2.0.2
 
Not sure what is helpful. but here are some outputs from the system while the fs was hung.. I have a lockdump also, but it is 4,650 lines. I can send it along if needed. Any suggestions on data to gather in the future are welcomed.
 
 
Thanks!
 
Brian Fair
 
 
gfs_tool gettune ************************************************************************
 
ilimit1 = 100
ilimit1_tries = 3
ilimit1_min = 1
ilimit2 = 500
ilimit2_tries = 10
ilimit2_min = 3
demote_secs = 300
incore_log_blocks = 1024
jindex_refresh_secs = 60
depend_secs = 60
scand_secs = 5
recoverd_secs = 60
logd_secs = 1
quotad_secs = 5
inoded_secs = 15
glock_purge = 0
quota_simul_sync = 64
quota_warn_period = 10
atime_quantum = 3600
quota_quantum = 60
quota_scale = 1.0000   (1, 1)
quota_enforce = 1
quota_account = 1
new_files_jdata = 0
new_files_directio = 0
max_atomic_write = 4194304
max_readahead = 262144
lockdump_size = 131072
stall_secs = 600
complain_secs = 10
reclaim_limit = 5000
entries_per_readdir = 32
prefetch_secs = 10
statfs_slots = 64
max_mhc = 10000
greedy_default = 100
greedy_quantum = 25
greedy_max = 250
rgrp_try_threshold = 100
statfs_fast = 0

gfs_tool counters ************************************************************************

                                  locks 246
                             locks held 127
                           freeze count 0
                          incore inodes 101
                       metadata buffers 4
                        unlinked inodes 2
                              quota IDs 3
                     incore log buffers 0
                         log space used 0.05%
              meta header cache entries 0
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 85
                   outstanding LM calls 2
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 1316856
                         glock nq calls 194073094
                         glock dq calls 193851427
                   glock prefetch calls 102749
                          lm_lock calls 903612
                        lm_unlock calls 833348
                           lm callbacks 1769983
                     address operations 71707236
                      dentry operations 23750382
                      export operations 0
                        file operations 139487453
                       inode operations 38356847
                       super operations 110620113
                          vm operations 1052447
                        block I/O reads 241669
                       block I/O writes 3295626


From rpeterso at redhat.com  Thu Nov 15 14:41:08 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 15 Nov 2007 08:41:08 -0600
Subject: [Linux-cluster] GFS filesystem hangs
In-Reply-To: <97F238EA86B5704DBAD740518CF8291002E51D9E@hwpms600.tbo.citistreet.org>
References: <97F238EA86B5704DBAD740518CF8291002E51D9E@hwpms600.tbo.citistreet.org>
Message-ID: <1195137668.23069.4.camel@technetium.msp.redhat.com>

Hi,

Perhaps you should open up a bugzilla record against this problem.
Then attach lock dumps from all nodes and a full call trace
(from magic sysrq key: <sysrq>t) to the bugzilla.
https://bugzilla.redhat.com/

Regards,

Bob Peterson
Red Hat

On Thu, 2007-11-15 at 03:40 -0500, Fair, Brian wrote:
> We have a a GFS filesystem (1 of 100 on this server in particular) that will consistently hang. I haven't identified the circumstances around it, but there is some speculation that it may occur during heavy usage, though that isn't for sure.  When this happens, the load average on the system skyrockets.
>  
> The mountpoint is /omni_mnt/clients/j2
>  
> When I say hang, cd sometimes hangs, ls will hang, etc. Programs and file operations certainly hang. Sometimes just cd'ing into the mountpoint, other times into a large subdirectory.
>  
> ie:
> 
> # cd /omni_mnt/clients/j2
> root at hlpom500:[/omni_mnt/clients/j2 <mailto:root at hlpom500:[/omni_mnt/clients/j2> ]
> # ls
> <normal output>
> root at hlpom500:[/omni_mnt/clients/j2 <mailto:root at hlpom500:[/omni_mnt/clients/j2> ]
> # cd stmt
> root at hlpom500:[/omni_mnt/clients/j2/stmt <mailto:root at hlpom500:[/omni_mnt/clients/j2/stmt> ]
> # ls
> 
> <hangs here, shell must be killed>
> 
> In the past, shutting down and rebooting the 2 systems that mount this gfs has cleared the issue.
>  
> Info:
>  
> RHEL ES 4 u5
> kernel 2.6.9-55.0.2.ELsmp
> GFS 2.6.9-72.2.0.2
>  
> Not sure what is helpful. but here are some outputs from the system while the fs was hung.. I have a lockdump also, but it is 4,650 lines. I can send it along if needed. Any suggestions on data to gather in the future are welcomed.
>  
> 
> Thanks!
>  
> Brian Fair
>  
> 
> gfs_tool gettune ************************************************************************
>  
> ilimit1 = 100
> ilimit1_tries = 3
> ilimit1_min = 1
> ilimit2 = 500
> ilimit2_tries = 10
> ilimit2_min = 3
> demote_secs = 300
> incore_log_blocks = 1024
> jindex_refresh_secs = 60
> depend_secs = 60
> scand_secs = 5
> recoverd_secs = 60
> logd_secs = 1
> quotad_secs = 5
> inoded_secs = 15
> glock_purge = 0
> quota_simul_sync = 64
> quota_warn_period = 10
> atime_quantum = 3600
> quota_quantum = 60
> quota_scale = 1.0000   (1, 1)
> quota_enforce = 1
> quota_account = 1
> new_files_jdata = 0
> new_files_directio = 0
> max_atomic_write = 4194304
> max_readahead = 262144
> lockdump_size = 131072
> stall_secs = 600
> complain_secs = 10
> reclaim_limit = 5000
> entries_per_readdir = 32
> prefetch_secs = 10
> statfs_slots = 64
> max_mhc = 10000
> greedy_default = 100
> greedy_quantum = 25
> greedy_max = 250
> rgrp_try_threshold = 100
> statfs_fast = 0
> 
> gfs_tool counters ************************************************************************
> 
>                                   locks 246
>                              locks held 127
>                            freeze count 0
>                           incore inodes 101
>                        metadata buffers 4
>                         unlinked inodes 2
>                               quota IDs 3
>                      incore log buffers 0
>                          log space used 0.05%
>               meta header cache entries 0
>                      glock dependencies 0
>                  glocks on reclaim list 0
>                               log wraps 85
>                    outstanding LM calls 2
>                   outstanding BIO calls 0
>                        fh2dentry misses 0
>                        glocks reclaimed 1316856
>                          glock nq calls 194073094
>                          glock dq calls 193851427
>                    glock prefetch calls 102749
>                           lm_lock calls 903612
>                         lm_unlock calls 833348
>                            lm callbacks 1769983
>                      address operations 71707236
>                       dentry operations 23750382
>                       export operations 0
>                         file operations 139487453
>                        inode operations 38356847
>                        super operations 110620113
>                           vm operations 1052447
>                         block I/O reads 241669
>                        block I/O writes 3295626
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Thu Nov 15 15:47:02 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Nov 2007 09:47:02 -0600
Subject: [Linux-cluster] Cannot add nodes (solved)
In-Reply-To: <1195082228.8951.118.camel@ayanami.boston.devel.redhat.com>
Message-ID: <200711159472.881625@leena>

So the problem was in fact that the versions were different. Once I resolved 
the seamonkey problem, updated all nodes, the cluster came up without 
problems.

Ok, based on your URL lead (thanks!), I've solved this mess.

I had to download three monkeys;

seamonkey-nss-1.0.9-6.el4.centos.i386.rpm 
seamonkey-nspr-1.0.9-4.el4.centos.i386.rpm 
seamonkey-nspr-1.0.9-6.el4.centos.i386.rpm

Then the centos key;
rpm --import http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-4

Then install these three;

yum install seamonkey-nss-1.0.9-6.el4.centos.i386.rpm 
seamonkey-nspr-1.0.9-4.el4.centos.i386.rpm 
seamonkey-nspr-1.0.9-6.el4.centos.i386.rpm

Then yum update worked perfectly. 

Now, my question is, I was told that different versions should work together? 
What went wrong here? I have a mix of Linux machines including RHEL4/5 and 
CentOS/4/5.

Mike


On Wed, 14 Nov 2007 18:17:07 -0500, Lon Hohberger wrote:
> On Wed, 2007-11-14 at 16:25 -0600, isplist at logicore.net wrote:
> 
>> mozilla-nspr-1.7.13-1.4.1.XOS.1
>> 
> In 4.4, I think Red Hat renamed / removed the mozilla-nspr and
> mozilla-nss packages and replaced it with seamonkey-nspr and
> seamonkey-nss.
> 
> You should be able to 'yum update' instead of 'yum install'.  Just be
> sure to include the dependent packages which use nss/nspr (i.e. mozilla,
> for example).
> 
> http://rpm.pbone.net/index.php3/stat/4/idpl/5199061/com/seamonkey-nss-1.0.9-
> 6.el4.centos.i386.rpm.html
> http://rpm.pbone.net/index.php3/stat/4/idpl/4683895/com/seamonkey-nspr-
> 1.0.9-4.el4.centos.i386.rpm.html
> 
> According to those pages, seamonkey-nss and seamonkey-nspr provide
> mozilla-nss and mozilla-nspr, respectively.
> 
> -- Lon


From isplist at logicore.net  Thu Nov 15 16:03:19 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Nov 2007 10:03:19 -0600
Subject: [Linux-cluster] Howto? Remove volumes
In-Reply-To: <47220B7B.9020901@redhat.com>
Message-ID: <2007111510319.122388@leena>

Experiencing new problems with old volumes now. Seems to be an error message 
getting stuck in there somewhere? There is some old volume information stuck 
somewhere and I've never been able to figure out how to remove it.

Here is the boot.log;


Nov 15 09:03:43 compdev lvm.static: connect() failed on local socket: 
Connection refused
Nov 15 09:03:44 compdev lvm.static:   WARNING: Falling back to local 
file-based locking.
Nov 15 09:03:44 compdev lvm.static:   Volume Groups with the clustered 
attribute will be inaccessible.
Nov 15 09:03:44 compdev lvm.static:   1 logical volume(s) in volume group 
VolGroup03 now active
Nov 15 09:03:44 compdev lvm.static:   1 logical volume(s) in volume group 
VolGroup02 now active
Nov 15 09:03:44 compdev lvm.static:   1 logical volume(s) in volume group 
VolGroup01 now active
Nov 15 09:03:44 compdev rc.sysinit: Setting up Logical Volume Management: 
succeeded
Nov 15 09:03:44 compdev rc.sysinit: Checking filesystems succeeded
Nov 15 09:03:44 compdev rc.sysinit: Mounting local filesystems:  succeeded
Nov 15 09:03:44 compdev rc.sysinit: Enabling local filesystem quotas:  
succeeded
Nov 15 09:03:45 compdev rc.sysinit: Enabling swap space:  succeeded
Nov 15 09:03:47 compdev microcode_ctl: microcode_ctl startup succeeded
Nov 15 09:03:47 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:47 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:48 compdev vgchange:
Nov 15 09:03:48 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:48 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:48 compdev vgchange: Volume group "WARNING:" not found
Nov 15 09:03:48 compdev lvm2-monitor: Starting monitoring for VG WARNING:: 
failed
Nov 15 09:03:48 compdev vgchange:
Nov 15 09:03:48 compdev vgchange: connect() failed on local socket: Connection 
refused
Nov 15 09:03:48 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:48 compdev vgchange:
Nov 15 09:03:48 compdev vgchange: Volume group "Falling" not found
Nov 15 09:03:48 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:48 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:48 compdev lvm2-monitor: Starting monitoring for VG Falling: 
failed
Nov 15 09:03:48 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:48 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:48 compdev vgchange:
Nov 15 09:03:48 compdev vgchange: Volume group "back" not found
Nov 15 09:03:48 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:48 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:48 compdev lvm2-monitor: Starting monitoring for VG back: failed
Nov 15 09:03:48 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:48 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:48 compdev vgchange:
Nov 15 09:03:48 compdev vgchange: Volume group "to" not found
Nov 15 09:03:48 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:48 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:48 compdev lvm2-monitor: Starting monitoring for VG to: failed
Nov 15 09:03:48 compdev vgchange:
Nov 15 09:03:49 compdev vgchange: connect() failed on local socket: Connection 
refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev vgchange: Volume group "local" not found
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG local: failed
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev vgchange: Volume group "file-based" not found
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG file-based: 
failed
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev vgchange: Volume group "locking." not found
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG locking.: 
failed
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange: connect() failed on local socket: Connection 
refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange: Volume group "Volume" not found
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG Volume: 
failed
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev vgchange: Volume group "Groups" not found
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG Groups: 
failed
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev vgchange: Volume group "with" not found
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG with: failed
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange: connect() failed on local socket: Connection 
refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev vgchange: Volume group "the" not found
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG the: failed
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange: connect() failed on local socket: Connection 
refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange: Volume group "clustered" not found
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG clustered: 
failed
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:
Nov 15 09:03:49 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:49 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:49 compdev vgchange: Volume group "attribute" not found
Nov 15 09:03:49 compdev lvm2-monitor: Starting monitoring for VG attribute: 
failed
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:49 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:50 compdev vgchange:
Nov 15 09:03:50 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:50 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:50 compdev vgchange: Volume group "will" not found
Nov 15 09:03:50 compdev lvm2-monitor: Starting monitoring for VG will: failed
Nov 15 09:03:50 compdev vgchange:
Nov 15 09:03:50 compdev vgchange: connect() failed on local socket: Connection 
refused
Nov 15 09:03:50 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:50 compdev vgchange:
Nov 15 09:03:50 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:50 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:50 compdev vgchange: Volume group "be" not found
Nov 15 09:03:50 compdev lvm2-monitor: Starting monitoring for VG be: failed
Nov 15 09:03:50 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:50 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:50 compdev vgchange:
Nov 15 09:03:50 compdev vgchange: Volume group "inaccessible." not found
Nov 15 09:03:50 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:50 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:50 compdev lvm2-monitor: Starting monitoring for VG 
inaccessible.: failed
Nov 15 09:03:50 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:50 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:50 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:50 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:50 compdev vgchange:   1 logical volume(s) in volume group 
"VolGroup01" monitored
Nov 15 09:03:50 compdev lvm2-monitor: Starting monitoring for VG VolGroup01: 
succeeded
Nov 15 09:03:50 compdev vgchange:
Nov 15 09:03:50 compdev vgchange: connect() failed on local socket: Connection 
refused
Nov 15 09:03:50 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:51 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:51 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:51 compdev vgchange:   1 logical volume(s) in volume group 
"VolGroup02" monitored
Nov 15 09:03:51 compdev lvm2-monitor: Starting monitoring for VG VolGroup02: 
succeeded
Nov 15 09:03:51 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:51 compdev vgchange:   connect() failed on local socket: 
Connection refused
Nov 15 09:03:51 compdev vgchange:   WARNING: Falling back to local file-based 
locking.
Nov 15 09:03:51 compdev vgchange:   Volume Groups with the clustered attribute 
will be inaccessible.
Nov 15 09:03:51 compdev vgchange:   1 logical volume(s) in volume group 
"VolGroup03" monitored
Nov 15 09:03:51 compdev lvm2-monitor: Starting monitoring for VG VolGroup03: 
succeeded
Nov 15 09:04:01 compdev kudzu:  succeeded
Nov 15 09:04:01 compdev sysctl: net.ipv4.ip_forward = 0
Nov 15 09:04:01 compdev sysctl: net.ipv4.conf.default.rp_filter = 1
Nov 15 09:04:01 compdev sysctl: net.ipv4.conf.default.accept_source_route = 0
Nov 15 09:04:01 compdev sysctl: kernel.sysrq = 0
Nov 15 09:04:01 compdev sysctl: kernel.core_uses_pid = 1
Nov 15 09:04:01 compdev sysctl: kernel.panic_on_oops = 1
Nov 15 09:04:01 compdev network: Setting network parameters:  succeeded
Nov 15 09:04:01 compdev network: Bringing up loopback interface:  succeeded
Nov 15 09:04:03 compdev network: Bringing up interface eth0:  succeeded
Nov 15 09:04:47 compdev cman: startup succeeded
Nov 15 09:04:48 compdev fenced: startup succeeded
Nov 15 09:04:56 compdev clvmd: clvmd startup succeeded
Nov 15 09:04:58 compdev vgchange:   1 logical volume(s) in volume group 
"VolGroup03" now active
Nov 15 09:04:58 compdev vgchange:   1 logical volume(s) in volume group 
"VolGroup02" now active
Nov 15 09:04:58 compdev vgchange:   1 logical volume(s) in volume group 
"VolGroup01" now active
Nov 15 09:04:58 compdev clvmd: Activating VGs: succeeded
Nov 15 09:04:58 compdev netfs: Mounting other filesystems:  succeeded
Nov 15 09:04:58 compdev autofs: automount startup succeeded

***SNIP***


From pcaulfie at redhat.com  Thu Nov 15 16:06:53 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 15 Nov 2007 16:06:53 +0000
Subject: [Linux-cluster] Cannot add nodes (solved)
In-Reply-To: <200711159472.881625@leena>
References: <200711159472.881625@leena>
Message-ID: <473C6E9D.5080909@redhat.com>

isplist at logicore.net wrote:
> So the problem was in fact that the versions were different. Once I resolved 
> the seamonkey problem, updated all nodes, the cluster came up without 
> problems.
> 
> Ok, based on your URL lead (thanks!), I've solved this mess.
> 
> I had to download three monkeys;
> 
> seamonkey-nss-1.0.9-6.el4.centos.i386.rpm 
> seamonkey-nspr-1.0.9-4.el4.centos.i386.rpm 
> seamonkey-nspr-1.0.9-6.el4.centos.i386.rpm
> 
> Then the centos key;
> rpm --import http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-4
> 
> Then install these three;
> 
> yum install seamonkey-nss-1.0.9-6.el4.centos.i386.rpm 
> seamonkey-nspr-1.0.9-4.el4.centos.i386.rpm 
> seamonkey-nspr-1.0.9-6.el4.centos.i386.rpm
> 
> Then yum update worked perfectly. 
> 
> Now, my question is, I was told that different versions should work together? 
> What went wrong here? I have a mix of Linux machines including RHEL4/5 and 
> CentOS/4/5.


RHEL 4 & 5 will not talk together.

There is a problem with 4.1 that prevents it being a member of a cluster
with 4.2 or higher.

Patrick


From isplist at logicore.net  Thu Nov 15 16:29:29 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Nov 2007 10:29:29 -0600
Subject: [Linux-cluster] Changing volume name
Message-ID: <20071115102929.948470@leena>

Anyone know of a URL that has proper information on changing volume names 
safely from the command line?


From isplist at logicore.net  Thu Nov 15 20:05:27 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Nov 2007 14:05:27 -0600
Subject: [Linux-cluster] Upgrades lead to volumes gone?
Message-ID: <2007111514527.510918@leena>

Based on my other thread about new nodes not coming up, I was able to resolve 
this by updating the software on all of the nodes to match.

All nodes have now come up, however, the logical volumes keep getting lost 
now? Actually, it is the mount which keeps going away. There do not seem to be 
any error messages in the logs either?

Any thoughts on what might be causing this and what else I should be looking 
for?

Mike


From swells at redhat.com  Thu Nov 15 20:21:51 2007
From: swells at redhat.com (Shawn Wells)
Date: Thu, 15 Nov 2007 15:21:51 -0500
Subject: [Linux-cluster] Upgrades lead to volumes gone?
In-Reply-To: <2007111514527.510918@leena>
References: <2007111514527.510918@leena>
Message-ID: <473CAA5F.1080308@redhat.com>


Can you be a little more specific on what you mean?  Once the cluster is 
up do the mounts randomly disappear?  Are they mounted to begin with?

Please clarify for me.

-- 
Shawn D. Wells
Solutions Architect, Federal Team
Red Hat, Inc
swells at redhat.com
C: 443-534-0130


isplist at logicore.net wrote:
> Based on my other thread about new nodes not coming up, I was able to resolve 
> this by updating the software on all of the nodes to match.
>
> All nodes have now come up, however, the logical volumes keep getting lost 
> now? Actually, it is the mount which keeps going away. There do not seem to be 
> any error messages in the logs either?
>
> Any thoughts on what might be causing this and what else I should be looking 
> for?
>
> Mike
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


From sdake at redhat.com  Thu Nov 15 20:25:25 2007
From: sdake at redhat.com (Steven Dake)
Date: Thu, 15 Nov 2007 13:25:25 -0700
Subject: [Linux-cluster] Cluster Communications Security
In-Reply-To: <473B620A.2010902@bxwa.com>
References: <473B620A.2010902@bxwa.com>
Message-ID: <1195158325.3527.6.camel@balance>

On Wed, 2007-11-14 at 13:00 -0800, Scott Becker wrote:
> What's the general consensus of security risks of cman communications 
> over a public subnet?
> The faq only briefly mentions it.
> 
>     thanks
>     scottb
> 

Scottb,
the cluster communication for the most part is encrypted with SOBER128
and messages are authenticated with HMAC/SHA1.  There are some
theoretical weaknesses with SHA1 which is why the US government has
mandated the move away from the SHA1 hash algorithm.

I would recommend not placing the cluster communication on any type of
"external" network, however inside a firewall your data is fairly
secure.

By fairly, I mean that it would take some very determined people to
determine your shared key and they would have to be able to sniff the
network and know what kind of unencrypted packets were being sent.  This
would probably also require access to the local cluster.

All in all, I'd say if your worried about protecting your system from
expert hackers you are safe with the current system.  If you want to
protect against multimillion dollar government-sponsored attacks, there
is no solution for you at this time.

Regards
-steve
 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Thu Nov 15 20:59:57 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Nov 2007 14:59:57 -0600
Subject: [Linux-cluster] Upgrades lead to volumes gone?
In-Reply-To: <473CAA5F.1080308@redhat.com>
Message-ID: <20071115145957.712023@leena>

I've got LVS running checking on this server so the log goes as such (wish I 
could get rid of this error, have yet to find a way);

Nov 15 15:00:44 cweb92 httpd[3984]: [client 192.168.1.74] PHP Notice:  
Undefined index:  HTTP_HOST in /var/www/html/index.php on line 28

Suddenly, the mount is gone, no errors, no warning, nothing, it's gone on ALL 
nodes. I log into each node, remount and it's fine. This doesn't change the 
status of the cluster either.

Nov 15 15:01:03 cweb92 httpd[3978]: [error] [client 192.168.1.74] File does 
not exist: /var/www/html

That's why I'm not sure where to look just yet.

Mike


On Thu, 15 Nov 2007 15:21:51 -0500, Shawn Wells wrote:
> 
> 
> Can you be a little more specific on what you mean?  Once the cluster is
> up do the mounts randomly disappear?  Are they mounted to begin with?
> 
> Please clarify for me.
> 
> --
> Shawn D. Wells
> Solutions Architect, Federal Team
> Red Hat, Inc
> swells at redhat.com
> C: 443-534-0130
> 
> 
> isplist at logicore.net wrote:
>> Based on my other thread about new nodes not coming up, I was able to
>> resolve
>> this by updating the software on all of the nodes to match.
>> 
>> All nodes have now come up, however, the logical volumes keep getting lost
>> now? Actually, it is the mount which keeps going away. There do not seem
>> to be
>> any error messages in the logs either?
>> 
>> Any thoughts on what might be causing this and what else I should be
>> looking
>> for?
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Thu Nov 15 22:38:20 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Nov 2007 16:38:20 -0600
Subject: [Linux-cluster] Upgrades lead to volumes gone?
In-Reply-To: <473CAA5F.1080308@redhat.com>
Message-ID: <20071115163820.412697@leena>

A little more input.
I've noticed that the new node is mounted to a different partition on the same 
storage but it doesn't lose it's mount. Only those connected to another 
partition are losing it. It's happened three times so far, still nothing 
obvious.

Mike


On Thu, 15 Nov 2007 15:21:51 -0500, Shawn Wells wrote:
> 
> 
> Can you be a little more specific on what you mean?  Once the cluster is
> up do the mounts randomly disappear?  Are they mounted to begin with?
> 
> Please clarify for me.
> 
> --
> Shawn D. Wells
> Solutions Architect, Federal Team
> Red Hat, Inc
> swells at redhat.com
> C: 443-534-0130
> 
> 
> isplist at logicore.net wrote:
>> Based on my other thread about new nodes not coming up, I was able to
>> resolve
>> this by updating the software on all of the nodes to match.
>> 
>> All nodes have now come up, however, the logical volumes keep getting lost
>> now? Actually, it is the mount which keeps going away. There do not seem
>> to be
>> any error messages in the logs either?
>> 
>> Any thoughts on what might be causing this and what else I should be
>> looking
>> for?
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Thu Nov 15 22:40:14 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Nov 2007 16:40:14 -0600
Subject: [Linux-cluster] Upgrades lead to volumes gone?
In-Reply-To: <473CAA5F.1080308@redhat.com>
Message-ID: <20071115164014.395021@leena>

One more thing I've noticed. Only the web server nodes are losing the mount, 
none of the other servers which aren't web servers are losing their mount?


On Thu, 15 Nov 2007 15:21:51 -0500, Shawn Wells wrote:
> 
> 
> Can you be a little more specific on what you mean?  Once the cluster is
> up do the mounts randomly disappear?  Are they mounted to begin with?
> 
> Please clarify for me.
> 
> --
> Shawn D. Wells
> Solutions Architect, Federal Team
> Red Hat, Inc
> swells at redhat.com
> C: 443-534-0130
> 
> 
> isplist at logicore.net wrote:
>> Based on my other thread about new nodes not coming up, I was able to
>> resolve
>> this by updating the software on all of the nodes to match.
>> 
>> All nodes have now come up, however, the logical volumes keep getting lost
>> now? Actually, it is the mount which keeps going away. There do not seem
>> to be
>> any error messages in the logs either?
>> 
>> Any thoughts on what might be causing this and what else I should be
>> looking
>> for?
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From amrossi at linux.it  Fri Nov 16 14:30:15 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Fri, 16 Nov 2007 15:30:15 +0100 (CET)
Subject: [Linux-cluster] GFS over DRBD
Message-ID: <41083.217.171.43.40.1195223415.squirrel@picard.linux.it>

Hi All,

can i use DRBD over a GFS cluster with 4/5 nodes?


Best regards,


From jos at xos.nl  Fri Nov 16 14:35:04 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 16 Nov 2007 15:35:04 +0100
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <41083.217.171.43.40.1195223415.squirrel@picard.linux.it>
References: <41083.217.171.43.40.1195223415.squirrel@picard.linux.it>
Message-ID: <20071116143504.GC22540@jasmine.xos.nl>

On Fri, Nov 16, 2007 at 03:30:15PM +0100, amrossi at linux.it wrote:

> can i use DRBD over a GFS cluster with 4/5 nodes?

No.

When using GFS, you access a block device concurrently from several
nodes.  DRBD is meant for replicating block device operations, so
this can't work when that device is written by more than one node.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From Jeff.Wasilko at tufts.edu  Fri Nov 16 14:56:48 2007
From: Jeff.Wasilko at tufts.edu (Jeff Wasilko)
Date: Fri, 16 Nov 2007 09:56:48 -0500
Subject: [Linux-cluster] software disk mirroring (raid-1) and cluster?
In-Reply-To: <1194899685.8951.36.camel@ayanami.boston.devel.redhat.com>
References: <4734A034.2070101@tufts.edu>
	<1194899685.8951.36.camel@ayanami.boston.devel.redhat.com>
Message-ID: <20071116145648.GB3675@mila.usg.tufts.edu>

On Mon, Nov 12, 2007 at 03:34:45PM -0500, Lon Hohberger wrote:
> On Fri, 2007-11-09 at 13:00 -0500, Jeff Wasilko wrote:
> > Hi:
> > 
> > We're considering moving an application from Solaris (and Sun's cluster 
> > software) to Linux (and RHCS). The application (campus-wide mail) has 
> > stringent data availability requirements, so currently we are doing 
> > host-based mirroring between 2 totally separate hardware arrays.
> > 
> > Can I use the MD (multi-device) software to mirror under RHCS?
> 
> You can't use md from multiple nodes in a cluster in a sane way.  It
> might work, but it'd be purely accidental.

Interesting. Even in a failover cluster where only one node has a filesystem
mounted at a time?

We're doing that under Solaris right now with the stock volume manager
(that also does raid-1 internally).

-j


From Jeff.Wasilko at tufts.edu  Fri Nov 16 14:58:30 2007
From: Jeff.Wasilko at tufts.edu (Jeff Wasilko)
Date: Fri, 16 Nov 2007 09:58:30 -0500
Subject: [Linux-cluster] software disk mirroring (raid-1) and cluster?
In-Reply-To: <20071116145648.GB3675@mila.usg.tufts.edu>
References: <4734A034.2070101@tufts.edu>
	<1194899685.8951.36.camel@ayanami.boston.devel.redhat.com>
	<20071116145648.GB3675@mila.usg.tufts.edu>
Message-ID: <20071116145830.GC3675@mila.usg.tufts.edu>

On Fri, Nov 16, 2007 at 09:56:48AM -0500, Jeff Wasilko wrote:
> On Mon, Nov 12, 2007 at 03:34:45PM -0500, Lon Hohberger wrote:
> > On Fri, 2007-11-09 at 13:00 -0500, Jeff Wasilko wrote:
> > > Hi:
> > > 
> > > We're considering moving an application from Solaris (and Sun's cluster 
> > > software) to Linux (and RHCS). The application (campus-wide mail) has 
> > > stringent data availability requirements, so currently we are doing 
> > > host-based mirroring between 2 totally separate hardware arrays.
> > > 
> > > Can I use the MD (multi-device) software to mirror under RHCS?
> > 
> > You can't use md from multiple nodes in a cluster in a sane way.  It
> > might work, but it'd be purely accidental.
> 
> Interesting. Even in a failover cluster where only one node has a filesystem
> mounted at a time?

And to clarify, our cluster would be using SAN storage that would
be failed over from node to node....

-j


From amrossi at linux.it  Fri Nov 16 15:01:09 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Fri, 16 Nov 2007 16:01:09 +0100 (CET)
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <20071116143504.GC22540@jasmine.xos.nl>
References: <41083.217.171.43.40.1195223415.squirrel@picard.linux.it>
	<20071116143504.GC22540@jasmine.xos.nl>
Message-ID: <43640.217.171.43.40.1195225269.squirrel@picard.linux.it>

On www.drbd.org i'm reading this:

 <<Since DRBD-8.0.0 you can run both nodes in the primary role, enabling
to  mount a cluster file system (a physical parallel file system) one
both nodes concurrently. Examples for such file systems are OCFS2 and
GFS.>>

Why?

>
>> can i use DRBD over a GFS cluster with 4/5 nodes?
>
> No.
>
> When using GFS, you access a block device concurrently from several
> nodes.  DRBD is meant for replicating block device operations, so
> this can't work when that device is written by more than one node.
>


From jos at xos.nl  Fri Nov 16 15:12:25 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 16 Nov 2007 16:12:25 +0100
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <43640.217.171.43.40.1195225269.squirrel@picard.linux.it>
References: <41083.217.171.43.40.1195223415.squirrel@picard.linux.it>
	<20071116143504.GC22540@jasmine.xos.nl>
	<43640.217.171.43.40.1195225269.squirrel@picard.linux.it>
Message-ID: <20071116151225.GD22540@jasmine.xos.nl>

On Fri, Nov 16, 2007 at 04:01:09PM +0100, amrossi at linux.it wrote:

>  <<Since DRBD-8.0.0 you can run both nodes in the primary role, enabling
> to  mount a cluster file system (a physical parallel file system) one
> both nodes concurrently. Examples for such file systems are OCFS2 and
> GFS.>>

I didn't know about this feature.  It looks like this enables you to
run GFS without shared storage on 2 nodes (i.e. using two disks on
two nodes that are replicated as the device for GFS).

Is this what you were looking for?  I think DRBD only handles 2 nodes,
but I might be wrong in that.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From chawkins at veracitynetworks.com  Fri Nov 16 15:21:56 2007
From: chawkins at veracitynetworks.com (Christopher Hawkins)
Date: Fri, 16 Nov 2007 10:21:56 -0500
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <20071116151225.GD22540@jasmine.xos.nl>
Message-ID: <200711161521.lAGFLTS9025362@mxmail.leaseoptions.com>

Yes, it does work with DRBD 8.0 or higher, but is limited to 2 nodes. The
DRBD pro version (not free) allows 3 nodes.  

Chris

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jos Vos
Sent: Friday, November 16, 2007 10:12 AM
To: linux clustering
Subject: Re: [Linux-cluster] GFS over DRBD

On Fri, Nov 16, 2007 at 04:01:09PM +0100, amrossi at linux.it wrote:

>  <<Since DRBD-8.0.0 you can run both nodes in the primary role, 
> enabling to  mount a cluster file system (a physical parallel file 
> system) one both nodes concurrently. Examples for such file systems 
> are OCFS2 and GFS.>>

I didn't know about this feature.  It looks like this enables you to run GFS
without shared storage on 2 nodes (i.e. using two disks on two nodes that
are replicated as the device for GFS).

Is this what you were looking for?  I think DRBD only handles 2 nodes, but I
might be wrong in that.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Fri Nov 16 17:48:47 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 16 Nov 2007 12:48:47 -0500
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <20071116151225.GD22540@jasmine.xos.nl>
References: <41083.217.171.43.40.1195223415.squirrel@picard.linux.it>
	<20071116143504.GC22540@jasmine.xos.nl>
	<43640.217.171.43.40.1195225269.squirrel@picard.linux.it>
	<20071116151225.GD22540@jasmine.xos.nl>
Message-ID: <1195235327.4059.32.camel@localhost.localdomain>


On Fri, 2007-11-16 at 16:12 +0100, Jos Vos wrote:
> On Fri, Nov 16, 2007 at 04:01:09PM +0100, amrossi at linux.it wrote:
> 
> >  <<Since DRBD-8.0.0 you can run both nodes in the primary role, enabling
> > to  mount a cluster file system (a physical parallel file system) one
> > both nodes concurrently. Examples for such file systems are OCFS2 and
> > GFS.>>
> 
> I didn't know about this feature.  It looks like this enables you to
> run GFS without shared storage on 2 nodes (i.e. using two disks on
> two nodes that are replicated as the device for GFS).
> 
> Is this what you were looking for?  I think DRBD only handles 2 nodes,
> but I might be wrong in that.
> 

You're right.  2 nodes max.

-- Lon


From shawnlhood at gmail.com  Fri Nov 16 18:33:39 2007
From: shawnlhood at gmail.com (Shawn Hood)
Date: Fri, 16 Nov 2007 18:33:39 +0000
Subject: [Linux-cluster] Fwd: cluster post
In-Reply-To: <200711161703.lAGH35pL004737@mxmail.leaseoptions.com>
References: <cfe2fc960711131859l564db23dx9cc95576208145da@mail.gmail.com>
	<200711161703.lAGH35pL004737@mxmail.leaseoptions.com>
Message-ID: <cfe2fc960711161033y7aca3b8bicb193708beb8db7@mail.gmail.com>

You guys may want to unsubscribe this fellow.  He spammed me after my
first post to the mailing list.

Shawn


---------- Forwarded message ----------
From: Christopher Hawkins <chawkins at bplinux.com>
Date: Nov 16, 2007 5:03 PM
Subject: RE: cluster post
To: shawnlhood at gmail.com


Hello Shawn,

I saw your post on RedHat's mailing list. You are exactly the kind of
professional we're looking for!

I'm a network engineer who specializes in Linux clusters, and for years I
have been doing projects just like the one you described in your post. About
two years ago I decided to develop my own compilation of cluster tools
because the existing technology left a lot to be desired... sparse
documentation, trying to get code from 5 different projects to work
together, no unified control interface, etc. I'm sure you know the headaches
I'm talking about. :-)

We're a couple months from having a beta version ready for testing, and we
are looking for feedback from potential resellers and customers. Can we have
a discussion about whether this cluster software would work for you? It's a
shared root, load balanced, highly available cluster that can automatically
scale itself it up and down in response to demand by PXE booting additional
diskless nodes. New nodes take about 60 seconds from power-on to joining the
load balancing pool, with no local configuration on the node at all. There
is simply no faster or easier way to scale a cluster.

No SAN is required, it does everything over gigabit ethernet. It has
web-based graphical monitoring and configuration, does email notification
for significant events, can run most standard linux services (anything IP
based - we are focusing on Tomcat clusters at first, but it will work for
email, web, etc. just fine), is compatible with virtualization, and is
distribution / kernel agnostic; no kernel patches here. While we are
standardizing on RedHat / Centos and IBM hardware, there is nothing
preventing this from running on any normal platform. The cluster is designed
to compete with big iron solutions at a fraction of the price, and to be a
huge scalability and ease of use improvement over something like RH Cluster
Suite. We would sell the software pre-installed on a single or dual (for
high availability) master node and then you can scale it up with as many
child nodes as you want (gig-e being the limiting factor in large
deployments, but we are infiniband compatible as well).

Can't wait to hear back from you and see what you think about it - we are
hoping to find some resellers who want to partner up early and exchange
product feedback for generous reseller discounts and the opportunity to
influence the design of the system. If there are features you want or need,
or if we can make it easier for you to sell this to your customers in any
way, we want to hear about it so we can design it in for you.

Thank you for your time, Shawn. Look forward to your response!

Chris Hawkins
President, Bulletproof Linux


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Shawn Hood
Sent: Tuesday, November 13, 2007 9:59 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] howdy

Hey folks,

Just thought I'd introduce myself.  I found this list while perusing some
information RHCS/GFS.  I figured that it would behoove me to join up,
especially as many of my upcoming projects are HA cluster related.
 I recently relocated to the DC metro area from Chapel Hill.  I was working
in blade development at IBM in RTP.  I now work for a small consulting firm
and service a number of clients with Linux infrastructures.

That said, my primary client is a medical coding application service
provider that is doing some really fascinating stuff involving natural
language processing.  These NLP applications are very computationally
expensive--cpu, memory, STORAGE.  My current project is implement GFS across
multiple Dell PowerEdges running RHEL4.4/4.5/5, connecting to 3 Apple
xraids, and soon a larger first-tier storage device.  Upcoming projects
include implementing clustering to provide redundancy for critical
applications (e-mail, jabber, JBoss, etc), across two datacenters (our
corporate HQ and Rackspace).

Anyhow, you guys will probably be seeing some traffic from me.  I guess I'll
go ahead a question:  Are there any must-reads (apart from the sparse RH
documentation) related to RHCS/GFS?  Are there any books that are imperative
reads on the concepts of highly-available infrastructure?

Shawn Hood

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Shawn Hood
(910) 670-1819 Mobile


From chawkins at veracitynetworks.com  Fri Nov 16 19:02:01 2007
From: chawkins at veracitynetworks.com (Christopher Hawkins)
Date: Fri, 16 Nov 2007 14:02:01 -0500
Subject: [Linux-cluster] Fwd: cluster post
In-Reply-To: <cfe2fc960711161033y7aca3b8bicb193708beb8db7@mail.gmail.com>
Message-ID: <200711161901.lAGJ1XTF017613@mxmail.leaseoptions.com>

Wow. Sorry all - I didn't mean to offend Shawn. I have been on this list for
years and have only mailed 2 people from it (isplist at logicore - hi Mike!) to
ask for ideas on what I'm working on. Shawn, you owe me an apology - telling
the world what I sent you in a private email is pretty bad. You simply
sounded like someone who would be a good tester for what I'm working on, and
I was hoping you'd be interested. Also, Mike - I'm closer now to a beta if
you want to do any testing with us. Thanks!

Chris

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Shawn Hood
Sent: Friday, November 16, 2007 1:34 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Fwd: cluster post

You guys may want to unsubscribe this fellow.  He spammed me after my first
post to the mailing list.

Shawn


---------- Forwarded message ----------
From: Christopher Hawkins <chawkins at bplinux.com>
Date: Nov 16, 2007 5:03 PM
Subject: RE: cluster post
To: shawnlhood at gmail.com


Hello Shawn,

I saw your post on RedHat's mailing list. You are exactly the kind of
professional we're looking for!

I'm a network engineer who specializes in Linux clusters, and for years I
have been doing projects just like the one you described in your post. About
two years ago I decided to develop my own compilation of cluster tools
because the existing technology left a lot to be desired... sparse
documentation, trying to get code from 5 different projects to work
together, no unified control interface, etc. I'm sure you know the headaches
I'm talking about. :-)

We're a couple months from having a beta version ready for testing, and we
are looking for feedback from potential resellers and customers. Can we have
a discussion about whether this cluster software would work for you? It's a
shared root, load balanced, highly available cluster that can automatically
scale itself it up and down in response to demand by PXE booting additional
diskless nodes. New nodes take about 60 seconds from power-on to joining the
load balancing pool, with no local configuration on the node at all. There
is simply no faster or easier way to scale a cluster.

No SAN is required, it does everything over gigabit ethernet. It has
web-based graphical monitoring and configuration, does email notification
for significant events, can run most standard linux services (anything IP
based - we are focusing on Tomcat clusters at first, but it will work for
email, web, etc. just fine), is compatible with virtualization, and is
distribution / kernel agnostic; no kernel patches here. While we are
standardizing on RedHat / Centos and IBM hardware, there is nothing
preventing this from running on any normal platform. The cluster is designed
to compete with big iron solutions at a fraction of the price, and to be a
huge scalability and ease of use improvement over something like RH Cluster
Suite. We would sell the software pre-installed on a single or dual (for
high availability) master node and then you can scale it up with as many
child nodes as you want (gig-e being the limiting factor in large
deployments, but we are infiniband compatible as well).

Can't wait to hear back from you and see what you think about it - we are
hoping to find some resellers who want to partner up early and exchange
product feedback for generous reseller discounts and the opportunity to
influence the design of the system. If there are features you want or need,
or if we can make it easier for you to sell this to your customers in any
way, we want to hear about it so we can design it in for you.

Thank you for your time, Shawn. Look forward to your response!

Chris Hawkins
President, Bulletproof Linux


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Shawn Hood
Sent: Tuesday, November 13, 2007 9:59 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] howdy

Hey folks,

Just thought I'd introduce myself.  I found this list while perusing some
information RHCS/GFS.  I figured that it would behoove me to join up,
especially as many of my upcoming projects are HA cluster related.
 I recently relocated to the DC metro area from Chapel Hill.  I was working
in blade development at IBM in RTP.  I now work for a small consulting firm
and service a number of clients with Linux infrastructures.

That said, my primary client is a medical coding application service
provider that is doing some really fascinating stuff involving natural
language processing.  These NLP applications are very computationally
expensive--cpu, memory, STORAGE.  My current project is implement GFS across
multiple Dell PowerEdges running RHEL4.4/4.5/5, connecting to 3 Apple
xraids, and soon a larger first-tier storage device.  Upcoming projects
include implementing clustering to provide redundancy for critical
applications (e-mail, jabber, JBoss, etc), across two datacenters (our
corporate HQ and Rackspace).

Anyhow, you guys will probably be seeing some traffic from me.  I guess I'll
go ahead a question:  Are there any must-reads (apart from the sparse RH
documentation) related to RHCS/GFS?  Are there any books that are imperative
reads on the concepts of highly-available infrastructure?

Shawn Hood

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Shawn Hood
(910) 670-1819 Mobile

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lpleiman at redhat.com  Fri Nov 16 19:07:46 2007
From: lpleiman at redhat.com (Leo Pleiman)
Date: Fri, 16 Nov 2007 14:07:46 -0500
Subject: [Linux-cluster] Fwd: cluster post
In-Reply-To: <200711161901.lAGJ1XTF017613@mxmail.leaseoptions.com>
References: <200711161901.lAGJ1XTF017613@mxmail.leaseoptions.com>
Message-ID: <473DEA82.6060307@redhat.com>

Private email?!

I don't think anything is private in this digital world.

Christopher Hawkins wrote:
> Wow. Sorry all - I didn't mean to offend Shawn. I have been on this list for
> years and have only mailed 2 people from it (isplist at logicore - hi Mike!) to
> ask for ideas on what I'm working on. Shawn, you owe me an apology - telling
> the world what I sent you in a private email is pretty bad. You simply
> sounded like someone who would be a good tester for what I'm working on, and
> I was hoping you'd be interested. Also, Mike - I'm closer now to a beta if
> you want to do any testing with us. Thanks!
>
> Chris
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Shawn Hood
> Sent: Friday, November 16, 2007 1:34 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Fwd: cluster post
>
> You guys may want to unsubscribe this fellow.  He spammed me after my first
> post to the mailing list.
>
> Shawn
>
>
> ---------- Forwarded message ----------
> From: Christopher Hawkins <chawkins at bplinux.com>
> Date: Nov 16, 2007 5:03 PM
> Subject: RE: cluster post
> To: shawnlhood at gmail.com
>
>
> Hello Shawn,
>
> I saw your post on RedHat's mailing list. You are exactly the kind of
> professional we're looking for!
>
> I'm a network engineer who specializes in Linux clusters, and for years I
> have been doing projects just like the one you described in your post. About
> two years ago I decided to develop my own compilation of cluster tools
> because the existing technology left a lot to be desired... sparse
> documentation, trying to get code from 5 different projects to work
> together, no unified control interface, etc. I'm sure you know the headaches
> I'm talking about. :-)
>
> We're a couple months from having a beta version ready for testing, and we
> are looking for feedback from potential resellers and customers. Can we have
> a discussion about whether this cluster software would work for you? It's a
> shared root, load balanced, highly available cluster that can automatically
> scale itself it up and down in response to demand by PXE booting additional
> diskless nodes. New nodes take about 60 seconds from power-on to joining the
> load balancing pool, with no local configuration on the node at all. There
> is simply no faster or easier way to scale a cluster.
>
> No SAN is required, it does everything over gigabit ethernet. It has
> web-based graphical monitoring and configuration, does email notification
> for significant events, can run most standard linux services (anything IP
> based - we are focusing on Tomcat clusters at first, but it will work for
> email, web, etc. just fine), is compatible with virtualization, and is
> distribution / kernel agnostic; no kernel patches here. While we are
> standardizing on RedHat / Centos and IBM hardware, there is nothing
> preventing this from running on any normal platform. The cluster is designed
> to compete with big iron solutions at a fraction of the price, and to be a
> huge scalability and ease of use improvement over something like RH Cluster
> Suite. We would sell the software pre-installed on a single or dual (for
> high availability) master node and then you can scale it up with as many
> child nodes as you want (gig-e being the limiting factor in large
> deployments, but we are infiniband compatible as well).
>
> Can't wait to hear back from you and see what you think about it - we are
> hoping to find some resellers who want to partner up early and exchange
> product feedback for generous reseller discounts and the opportunity to
> influence the design of the system. If there are features you want or need,
> or if we can make it easier for you to sell this to your customers in any
> way, we want to hear about it so we can design it in for you.
>
> Thank you for your time, Shawn. Look forward to your response!
>
> Chris Hawkins
> President, Bulletproof Linux
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Shawn Hood
> Sent: Tuesday, November 13, 2007 9:59 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] howdy
>
> Hey folks,
>
> Just thought I'd introduce myself.  I found this list while perusing some
> information RHCS/GFS.  I figured that it would behoove me to join up,
> especially as many of my upcoming projects are HA cluster related.
>  I recently relocated to the DC metro area from Chapel Hill.  I was working
> in blade development at IBM in RTP.  I now work for a small consulting firm
> and service a number of clients with Linux infrastructures.
>
> That said, my primary client is a medical coding application service
> provider that is doing some really fascinating stuff involving natural
> language processing.  These NLP applications are very computationally
> expensive--cpu, memory, STORAGE.  My current project is implement GFS across
> multiple Dell PowerEdges running RHEL4.4/4.5/5, connecting to 3 Apple
> xraids, and soon a larger first-tier storage device.  Upcoming projects
> include implementing clustering to provide redundancy for critical
> applications (e-mail, jabber, JBoss, etc), across two datacenters (our
> corporate HQ and Rackspace).
>
> Anyhow, you guys will probably be seeing some traffic from me.  I guess I'll
> go ahead a question:  Are there any must-reads (apart from the sparse RH
> documentation) related to RHCS/GFS?  Are there any books that are imperative
> reads on the concepts of highly-available infrastructure?
>
> Shawn Hood
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Shawn Hood
> (910) 670-1819 Mobile
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lpleiman.vcf
Type: text/x-vcard
Size: 206 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071116/012d8f35/attachment.vcf>

From scottb at bxwa.com  Fri Nov 16 19:12:50 2007
From: scottb at bxwa.com (Scott Becker)
Date: Fri, 16 Nov 2007 11:12:50 -0800
Subject: [Linux-cluster] Arbitrary heuristics
Message-ID: <473DEBB2.9030905@bxwa.com>

I need to setup an IP-tie breaker.

It appears that I need to setup shared storage to utilize QDisks 
features or write a C program to use libcman?

Perhaps I don't need to add/change quorum voting. Perhaps I can write a 
cron called script which will withdraw the node from the cluster if 
pinging the gateway fails. I certainly need to add some checks for my 
application health. Is there a standard mechanism other than qdisk 
heuristics to tell the cluster good bye?

    thanks
    scottb


From chawkins at veracitynetworks.com  Fri Nov 16 19:13:12 2007
From: chawkins at veracitynetworks.com (Christopher Hawkins)
Date: Fri, 16 Nov 2007 14:13:12 -0500
Subject: [Linux-cluster] Fwd: cluster post
In-Reply-To: <473DEA82.6060307@redhat.com>
Message-ID: <200711161912.lAGJCjmF018817@mxmail.leaseoptions.com>

No kidding!   ;-)   lol 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Leo Pleiman
Sent: Friday, November 16, 2007 2:08 PM
To: linux clustering
Subject: Re: [Linux-cluster] Fwd: cluster post

Private email?!

I don't think anything is private in this digital world.

Christopher Hawkins wrote:
> Wow. Sorry all - I didn't mean to offend Shawn. I have been on this 
> list for years and have only mailed 2 people from it (isplist at logicore 
> - hi Mike!) to ask for ideas on what I'm working on. Shawn, you owe me 
> an apology - telling the world what I sent you in a private email is 
> pretty bad. You simply sounded like someone who would be a good tester 
> for what I'm working on, and I was hoping you'd be interested. Also, 
> Mike - I'm closer now to a beta if you want to do any testing with us.
Thanks!
>
> Chris
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Shawn Hood
> Sent: Friday, November 16, 2007 1:34 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Fwd: cluster post
>
> You guys may want to unsubscribe this fellow.  He spammed me after my 
> first post to the mailing list.
>
> Shawn
>
>
> ---------- Forwarded message ----------
> From: Christopher Hawkins <chawkins at bplinux.com>
> Date: Nov 16, 2007 5:03 PM
> Subject: RE: cluster post
> To: shawnlhood at gmail.com
>
>
> Hello Shawn,
>
> I saw your post on RedHat's mailing list. You are exactly the kind of 
> professional we're looking for!
>
> I'm a network engineer who specializes in Linux clusters, and for 
> years I have been doing projects just like the one you described in 
> your post. About two years ago I decided to develop my own compilation 
> of cluster tools because the existing technology left a lot to be 
> desired... sparse documentation, trying to get code from 5 different 
> projects to work together, no unified control interface, etc. I'm sure 
> you know the headaches I'm talking about. :-)
>
> We're a couple months from having a beta version ready for testing, 
> and we are looking for feedback from potential resellers and 
> customers. Can we have a discussion about whether this cluster 
> software would work for you? It's a shared root, load balanced, highly 
> available cluster that can automatically scale itself it up and down 
> in response to demand by PXE booting additional diskless nodes. New 
> nodes take about 60 seconds from power-on to joining the load 
> balancing pool, with no local configuration on the node at all. There is
simply no faster or easier way to scale a cluster.
>
> No SAN is required, it does everything over gigabit ethernet. It has 
> web-based graphical monitoring and configuration, does email 
> notification for significant events, can run most standard linux 
> services (anything IP based - we are focusing on Tomcat clusters at 
> first, but it will work for email, web, etc. just fine), is compatible 
> with virtualization, and is distribution / kernel agnostic; no kernel 
> patches here. While we are standardizing on RedHat / Centos and IBM 
> hardware, there is nothing preventing this from running on any normal 
> platform. The cluster is designed to compete with big iron solutions 
> at a fraction of the price, and to be a huge scalability and ease of 
> use improvement over something like RH Cluster Suite. We would sell 
> the software pre-installed on a single or dual (for high availability) 
> master node and then you can scale it up with as many child nodes as 
> you want (gig-e being the limiting factor in large deployments, but we are
infiniband compatible as well).
>
> Can't wait to hear back from you and see what you think about it - we 
> are hoping to find some resellers who want to partner up early and 
> exchange product feedback for generous reseller discounts and the 
> opportunity to influence the design of the system. If there are 
> features you want or need, or if we can make it easier for you to sell 
> this to your customers in any way, we want to hear about it so we can
design it in for you.
>
> Thank you for your time, Shawn. Look forward to your response!
>
> Chris Hawkins
> President, Bulletproof Linux
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Shawn Hood
> Sent: Tuesday, November 13, 2007 9:59 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] howdy
>
> Hey folks,
>
> Just thought I'd introduce myself.  I found this list while perusing 
> some information RHCS/GFS.  I figured that it would behoove me to join 
> up, especially as many of my upcoming projects are HA cluster related.
>  I recently relocated to the DC metro area from Chapel Hill.  I was 
> working in blade development at IBM in RTP.  I now work for a small 
> consulting firm and service a number of clients with Linux
infrastructures.
>
> That said, my primary client is a medical coding application service 
> provider that is doing some really fascinating stuff involving natural 
> language processing.  These NLP applications are very computationally 
> expensive--cpu, memory, STORAGE.  My current project is implement GFS 
> across multiple Dell PowerEdges running RHEL4.4/4.5/5, connecting to 3 
> Apple xraids, and soon a larger first-tier storage device.  Upcoming 
> projects include implementing clustering to provide redundancy for 
> critical applications (e-mail, jabber, JBoss, etc), across two 
> datacenters (our corporate HQ and Rackspace).
>
> Anyhow, you guys will probably be seeing some traffic from me.  I 
> guess I'll go ahead a question:  Are there any must-reads (apart from 
> the sparse RH
> documentation) related to RHCS/GFS?  Are there any books that are 
> imperative reads on the concepts of highly-available infrastructure?
>
> Shawn Hood
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Shawn Hood
> (910) 670-1819 Mobile
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


From isplist at logicore.net  Fri Nov 16 22:13:34 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 16 Nov 2007 16:13:34 -0600
Subject: [Linux-cluster] Upgrades lead to volumes gone?
In-Reply-To: <473CAA5F.1080308@redhat.com>
Message-ID: <20071116161334.877267@leena>

I've got LVS running checking on this server so the log goes as such (wish I 
could get rid of this error, have yet to find a way);

Nov 15 15:00:44 cweb92 httpd[3984]: [client 192.168.1.74] PHP Notice:  
Undefined index:  HTTP_HOST in /var/www/html/index.php on line 28

Suddenly, the mount is gone, no errors, no warning, nothing, it's gone on ALL
nodes. I log into each node, remount and it's fine. This doesn't change the 
status of the cluster either.

Nov 15 15:01:03 cweb92 httpd[3978]: [error] [client 192.168.1.74] File does 
not exist: /var/www/html

That's why I'm not sure where to look just yet.

Mike


On Thu, 15 Nov 2007 15:21:51 -0500, Shawn Wells wrote:
> 
> 
> Can you be a little more specific on what you mean?  Once the cluster is
> up do the mounts randomly disappear?  Are they mounted to begin with?
> 
> Please clarify for me.
> 
> --
> Shawn D. Wells
> Solutions Architect, Federal Team
> Red Hat, Inc
> swells at redhat.com
> C: 443-534-0130
> 
> 
> isplist at logicore.net wrote:
>> Based on my other thread about new nodes not coming up, I was able to
>> resolve
>> this by updating the software on all of the nodes to match.
>> 
>> All nodes have now come up, however, the logical volumes keep getting lost
>> now? Actually, it is the mount which keeps going away. There do not seem
>> to be
>> any error messages in the logs either?
>> 
>> Any thoughts on what might be causing this and what else I should be
>> looking
>> for?
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From amrossi at linux.it  Sat Nov 17 13:37:52 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Sat, 17 Nov 2007 14:37:52 +0100 (CET)
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <1195235327.4059.32.camel@localhost.localdomain>
References: <41083.217.171.43.40.1195223415.squirrel@picard.linux.it>
	<20071116143504.GC22540@jasmine.xos.nl>
	<43640.217.171.43.40.1195225269.squirrel@picard.linux.it>
	<20071116151225.GD22540@jasmine.xos.nl>
	<1195235327.4059.32.camel@localhost.localdomain>
Message-ID: <51022.79.10.137.147.1195306672.squirrel@picard.linux.it>

Can i use GlusterFS?


>
> On Fri, 2007-11-16 at 16:12 +0100, Jos Vos wrote:
>> On Fri, Nov 16, 2007 at 04:01:09PM +0100, amrossi at linux.it wrote:
>>
>> >  <<Since DRBD-8.0.0 you can run both nodes in the primary role,
>> enabling
>> > to  mount a cluster file system (a physical parallel file system) one
>> > both nodes concurrently. Examples for such file systems are OCFS2 and
>> > GFS.>>
>>
>> I didn't know about this feature.  It looks like this enables you to
>> run GFS without shared storage on 2 nodes (i.e. using two disks on
>> two nodes that are replicated as the device for GFS).
>>
>> Is this what you were looking for?  I think DRBD only handles 2 nodes,
>> but I might be wrong in that.
>>
>
> You're right.  2 nodes max.
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From chawkins at veracitynetworks.com  Sat Nov 17 16:17:59 2007
From: chawkins at veracitynetworks.com (Christopher Hawkins)
Date: Sat, 17 Nov 2007 11:17:59 -0500
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <51022.79.10.137.147.1195306672.squirrel@picard.linux.it>
Message-ID: <200711171617.lAHGHYuE021728@mxmail.leaseoptions.com>

I've been working with Gluster. It's good stuff, but more like a super-NFS
or Lustre than GFS. It's a client / server architecture where there are
exported filesystems on the server and the client mounts it; shared storage
is not required. It would be an alternative to a DRBD + GFS setup...
Depending on what you need, I think it might work for you. 2 node setup,
right? Can you spell out your clustering goals? 

Chris

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of amrossi at linux.it
Sent: Saturday, November 17, 2007 8:38 AM
To: linux clustering
Cc: linux clustering
Subject: Re: [Linux-cluster] GFS over DRBD

Can i use GlusterFS?


>
> On Fri, 2007-11-16 at 16:12 +0100, Jos Vos wrote:
>> On Fri, Nov 16, 2007 at 04:01:09PM +0100, amrossi at linux.it wrote:
>>
>> >  <<Since DRBD-8.0.0 you can run both nodes in the primary role,
>> enabling
>> > to  mount a cluster file system (a physical parallel file system) 
>> > one both nodes concurrently. Examples for such file systems are 
>> > OCFS2 and GFS.>>
>>
>> I didn't know about this feature.  It looks like this enables you to 
>> run GFS without shared storage on 2 nodes (i.e. using two disks on 
>> two nodes that are replicated as the device for GFS).
>>
>> Is this what you were looking for?  I think DRBD only handles 2 
>> nodes, but I might be wrong in that.
>>
>
> You're right.  2 nodes max.
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From amrossi at linux.it  Sat Nov 17 17:08:04 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Sat, 17 Nov 2007 18:08:04 +0100 (CET)
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <200711171617.lAHGHYuE021728@mxmail.leaseoptions.com>
References: <200711171617.lAHGHYuE021728@mxmail.leaseoptions.com>
Message-ID: <39700.79.10.137.147.1195319284.squirrel@picard.linux.it>


Actually I have a cluster with N nodes and what we seek is a distributed
filesystem: Gluster is the Solution. Also I am thinking seriously tested
on Migration XEN partitions Gluster


From ivoks at grad.hr  Sun Nov 18 15:37:00 2007
From: ivoks at grad.hr (Ante =?UTF-8?B?S2FyYW1hdGnEhw==?=)
Date: Sun, 18 Nov 2007 16:37:00 +0100
Subject: [Linux-cluster] fence_apc
Message-ID: <20071118163700.79ca01cf@titanium>

Hi all!

I'm using redhat-cluster-suite for failover and GFS over DRBD
primary/primary installation. Everything works great, except fencing. I
can't fence a failed node.

When I run 'fence_node node2', I get:

agent "fence_apc" reports: Traceback (most recent call last):
  File "/usr/sbin/fence_apc", line 829, in ?
    main()
  File "/usr/sbin/fence_apc", line 349, in main
    do_power_off(sock)
  File "/usr/sbin/fence_apc", line 813, in do_power_off
    x = do_power_switch(sock, "off")

agent "fence_apc" reports:   File "/usr/sbin/fence_apc", line 611, in
do_power_switch result_code, response = power_off(txt + ndbuf)
  File "/usr/sbin/fence_apc", line 817, in power_off
    x = power_switch(buffer, False, "2", "3");
  File "/usr/sbin/fence_apc", line 810, in po
agent "fence_apc" reports: wer_switch
    raise "unknown screen encountered in \n" + str(lines) + "\n"
unknown screen encountered in 
['1', '', '', '------- Phase Monitor
---------------------------------------------------------', '',
'        Phase Load :  1.8', '        Phase Sta agent "fence_apc"
reports: te: Normal Load ', '', '', '     <ESC>- Back, <ENTER>-
Refresh', '> ']

fence_apc ends up in Phase Monitor, instead of Outlet Control. It
opens connection to Control console, then enters Device Manager, and
then instead of entering Outlet Control (2), it enters Phase Monitor
(1).

It's an APC AP7920 with firmware v2.7.3. It's an RHCS v2.20070823, from
Ubuntu 7.10.

Any suggestions?

Thank you


From james at cloud9.co.uk  Sun Nov 18 21:39:53 2007
From: james at cloud9.co.uk (James Fidell)
Date: Sun, 18 Nov 2007 21:39:53 +0000
Subject: [Linux-cluster] fence_apc
In-Reply-To: <20071118163700.79ca01cf@titanium>
References: <20071118163700.79ca01cf@titanium>
Message-ID: <4740B129.4050800@cloud9.co.uk>

Ante Karamati? wrote:

> It's an APC AP7920 with firmware v2.7.3. It's an RHCS v2.20070823, from
> Ubuntu 7.10.
> 
> Any suggestions?

I've found that fence_apc doesn't work with all AP7920 units.  I patched
mine so it work for mine, which broke in a simialr way initially (and it
supports fencing by port name rather than number), but I think the
problem may be fixed "officially" anyhow now.

I'm happy to email you my script if you want to try it out though --
drop me an email off-list.

James


From isplist at logicore.net  Mon Nov 19 02:42:02 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Sun, 18 Nov 2007 20:42:02 -0600
Subject: [Linux-cluster] Upgrades lead to volumes gone?
Message-ID: <2007111820422.789960@leena>

I've noticed that only my web servers mount keeps dissapearing. There are 
other machines connected/mounted to the same volume but they don't disconnect.

Only the web servers do yet all machines are exactly the same version and 
OS's. Is there some sort of persistence option somewhere that might have been 
turned off during my upgrade?

Mike


From isplist at logicore.net  Mon Nov 19 07:39:20 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 19 Nov 2007 01:39:20 -0600
Subject: [Linux-cluster] Upgrades lead to volumes gone?
In-Reply-To: <473CAA5F.1080308@redhat.com>
Message-ID: <2007111913920.408286@leena>

Figured I should at least post the result.

I kept losing one of my GFS mounts and was not able to figure out why. Early 
on when building the cluster, I went through this and needed to find a 
solution fast to at least keeps things running. I set up a cron job to remount 
every hour and promptly forgot about it

Now, I changed storage around so volume and group names changed. I would 
reboot, all was fine then that old cron job would kick in, losing my mount 
because the cron was trying to remount to something that no longer existed.

My bad, all fixed.

Mike


On Thu, 15 Nov 2007 15:21:51 -0500, Shawn Wells wrote:
> 
> 
> Can you be a little more specific on what you mean?  Once the cluster is
> up do the mounts randomly disappear?  Are they mounted to begin with?
> 
> Please clarify for me.
> 
> --
> Shawn D. Wells
> Solutions Architect, Federal Team
> Red Hat, Inc
> swells at redhat.com
> C: 443-534-0130
> 
> 
> isplist at logicore.net wrote:
>> Based on my other thread about new nodes not coming up, I was able to
>> resolve
>> this by updating the software on all of the nodes to match.
>> 
>> All nodes have now come up, however, the logical volumes keep getting lost
>> now? Actually, it is the mount which keeps going away. There do not seem
>> to be
>> any error messages in the logs either?
>> 
>> Any thoughts on what might be causing this and what else I should be
>> looking
>> for?
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From rottmann at atix.de  Mon Nov 19 08:26:06 2007
From: rottmann at atix.de (Reiner Rottmann)
Date: Mon, 19 Nov 2007 09:26:06 +0100
Subject: [Linux-cluster] qdisk heuristic and chicken/egg dilemma
Message-ID: <200711190926.12489.rottmann@atix.de>

Hi!

Say you want to configure a cluster node by using a quorum disk so that it 
survives when the particular node runs a certain cluster service.

Such a scenario is even stated in the man page QDisk(8):
---%------------------------------------------------------------------------------------------------------------
For example, a user may have a service running on one node, and that node must 
always be the master in the event of a network partition.
---%------------------------------------------------------------------------------------------------------------

Then in my opinion you have a chicken/egg dilemma because you might need the 
additional vote of the quorum disk to get a quorate cluster (which again may 
run the rgmanager that is needed for the heuristic).

How would one solve this dilemma in a straightforward way?

Best regards,
Reiner
-- 
Gruss / Regards,

Dipl.-Ing. (FH) Reiner Rottmann

Phone: +49-89 452 3538-12

http://www.atix.de/
http://open-sharedroot.org/

https://www.xing.com/profile/Reiner_Rottmann

PGP Key ID: 0xCA67C5A6
PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax: ? +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071119/24820cf3/attachment.sig>

From cluster at defuturo.co.uk  Mon Nov 19 09:50:20 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Mon, 19 Nov 2007 09:50:20 +0000
Subject: [Linux-cluster] fence_apc
In-Reply-To: <20071118163700.79ca01cf@titanium>
References: <20071118163700.79ca01cf@titanium>
Message-ID: <1195465820.2577.18.camel@rutabaga.defuturo.co.uk>

On Sun, 2007-11-18 at 16:37 +0100, Ante Karamati? wrote:

> When I run 'fence_node node2', I get:

[...]

> fence_apc ends up in Phase Monitor, instead of Outlet Control. It
> opens connection to Control console, then enters Device Manager, and
> then instead of entering Outlet Control (2), it enters Phase Monitor
> (1).
> 
> It's an APC AP7920 with firmware v2.7.3. It's an RHCS v2.20070823, from
> Ubuntu 7.10.

  You are not alone! Have a look at:

    https://bugzilla.redhat.com/show_bug.cgi?id=241217

I'm using an AP7920 with v2.7.3 also.

  Since this is a regression, one way around it is to use the older
fence-1.32.25-1 instead. This version used a regular expression to
identify the correct menu option which was more robust against menu
order changes.

	Robert


From d.skorupa at wasko.pl  Mon Nov 19 12:13:07 2007
From: d.skorupa at wasko.pl (Dariusz Skorupa)
Date: Mon, 19 Nov 2007 13:13:07 +0100
Subject: [Linux-cluster] RHEL5 failover domain
Message-ID: <47417DD3.90804@wasko.pl>

    I've found problem with RHEL5 cluster. When I use prioritized fail 
over domain and next reset the node witch have priority set to 1 cluster 
relocate service to node with priority 2. Next, when node 1 come back 
,cluster is trying to relocate service back to primary node.  In logfile 
I always find:

Nov 19 12:32:26 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.10
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:32:26 l2 openais[1977]: [CPG  ] got joinlist message from node 1
Nov 19 12:32:26 l2 clurgmgrd[2687]: <notice> Stopping service 
service:vsftpd
Nov 19 12:32:41 l2 clurgmgrd[2687]: <err> #52: Failed changing RG status
Nov 19 12:32:56 l2 clurgmgrd[2687]: <err> #57: Failed changing RG status
Nov 19 12:32:57 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status

   I tested this many times and in this case clurgmgrd do not try to run 
script with stop parameter, but when I try to relocate service manualy 
using clusvcadm or when both nodes have priority 1 everything is 
successful. Is also successful if I'm restarting node using reboot. I 
think automatic (after crash) relocating starts too early. In my opinion 
cluster do not wait for rgmanager start.

now i tried packages:
    rgmanager-2.0.23-1 or rgmanager-2.0.28-1.el5
    cman-2.0.73-1.el5_1.1, cman-2.0.64 and earlier from RHEL5 CD
    openais-0.80.3-7.el5 and earlier from RHEL5 CD


My cluster.conf file:

<?xml version="1.0"?>
<cluster alias="OBN_HA" config_version="26" name="OBN_HA">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="l2.local" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="l2_fence" nodename="l2.local"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="l1.local" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="l1_fence" nodename="l1.local"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman expected_votes="1" two_node="1"/>
    <fencedevices>
        <fencedevice agent="fence_manual" name="l1_fence"/>
        <fencedevice agent="fence_manual" name="l2_fence"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="OBN" ordered="1" restricted="0">
                <failoverdomainnode name="l1.local" priority="1"/>
                <failoverdomainnode name="l2.local" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources/>
        <service autostart="1" domain="OBN" name="vsftpd" 
recovery="relocate">
            <script file="/etc/init.d/vsftpd" name="vsftpd"/>
        </service>
    </rm>
</cluster>

Full /dev/log/messages log :
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Sending initial ORF token
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:39 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:27:39 l2 openais[1977]: [CMAN ] quorum regained, resuming 
activity
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering GATHER state from 11.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Saving state aru 9 high seq 
received 9
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Storing new sequence id for 
ring e4
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering COMMIT state.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering RECOVERY state.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] position [0] member 
192.168.10.10:
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] previous ring seq 224 rep 
192.168.10.10
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] aru 1a high delivered 1a 
received flag 1
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] position [1] member 
192.168.10.11:
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] previous ring seq 224 rep 
192.168.10.11
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] aru 9 high delivered 9 
received flag 1
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Did not need to originate any 
messages in recovery.
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:27:40 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:27:40 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.10
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:27:40 l2 openais[1977]: [CPG  ] got joinlist message from node 2
Nov 19 12:27:40 l2 ccsd[1941]: Initial status:: Quorate
[...]
Nov 19 12:28:37 l2 kernel: dlm: Using TCP for communications
Nov 19 12:28:37 l2 kernel: dlm: connecting to 2
Nov 19 12:28:38 l2 clurgmgrd[2687]: <notice> Resource Group Manager 
Starting
Nov 19 12:28:38 l2 kernel: dlm: got connection from 2
[...]
Nov 19 12:28:47 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd stop
Nov 19 12:28:47 l2 vsftpd: script param: stop
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] The token was lost in the 
OPERATIONAL state.
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] Receive multicast socket recv 
buffer size (288000 bytes).
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] Transmit multicast socket send 
buffer size (219136 bytes).
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] entering GATHER state from 2.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering GATHER state from 0.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Creating commit token because 
I am the rep.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Saving state aru 28 high seq 
received 28
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Storing new sequence id for 
ring e8
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering COMMIT state.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering RECOVERY state.
Nov 19 12:30:32 l2 fenced[1993]: l1.local not a cluster member after 0 
sec post_fail_delay
Nov 19 12:30:32 l2 kernel: dlm: closing connection to node 2
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] position [0] member 
192.168.10.11:
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] previous ring seq 228 rep 
192.168.10.10
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] aru 28 high delivered 28 
received flag 1
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Did not need to originate any 
messages in recovery.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Sending initial ORF token
Nov 19 12:30:32 l2 fenced[1993]: fencing node "l1.local"
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:30:32 l2 fence_manual: Node l1.local needs to be reset before 
recovery can procede.  Waiting for l1.local to rejoin the cluster or for 
manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 
l1.local)
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:30:32 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:30:32 l2 openais[1977]: [CPG  ] got joinlist message from node 1
Nov 19 12:30:52 l2 fenced[1993]: fence "l1.local" success
Nov 19 12:30:58 l2 clurgmgrd[2687]: <notice> Taking over service 
service:vsftpd from down member l1.local
Nov 19 12:30:58 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd start
Nov 19 12:30:58 l2 vsftpd: script param: start
Nov 19 12:30:59 l2 clurgmgrd[2687]: <notice> Service service:vsftpd started
Nov 19 12:31:07 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:31:07 l2 vsftpd: script param: status
Nov 19 12:31:37 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:31:37 l2 vsftpd: script param: status
Nov 19 12:32:07 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:32:07 l2 vsftpd: script param: status
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] entering GATHER state from 11.
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] Saving state aru 18 high seq 
received 18
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] Storing new sequence id for 
ring ec
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] entering COMMIT state.
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] entering RECOVERY state.
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] position [0] member 
192.168.10.10:
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] previous ring seq 232 rep 
192.168.10.10
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] aru 9 high delivered 8 
received flag 1
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] position [1] member 
192.168.10.11:
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] previous ring seq 232 rep 
192.168.10.11
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] aru 18 high delivered 18 
received flag 1
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] Did not need to originate any 
messages in recovery.
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:32:26 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:32:26 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.10
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:32:26 l2 openais[1977]: [CPG  ] got joinlist message from node 1
Nov 19 12:32:26 l2 clurgmgrd[2687]: <notice> Stopping service 
service:vsftpd
Nov 19 12:32:41 l2 clurgmgrd[2687]: <err> #52: Failed changing RG status
Nov 19 12:32:56 l2 clurgmgrd[2687]: <err> #57: Failed changing RG status
Nov 19 12:32:57 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:32:57 l2 vsftpd: script param: status
Nov 19 12:33:20 l2 kernel: dlm: connecting to 2
Nov 19 12:33:20 l2 kernel: dlm: got connection from 2
Nov 19 12:33:36 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:33:36 l2 vsftpd: script param: status
Nov 19 12:34:06 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status

daro

-------------- next part --------------
A non-text attachment was scrubbed...
Name: d.skorupa.vcf
Type: text/x-vcard
Size: 262 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071119/4f17467c/attachment.vcf>

From jparsons at redhat.com  Mon Nov 19 16:05:32 2007
From: jparsons at redhat.com (jim parsons)
Date: Mon, 19 Nov 2007 11:05:32 -0500
Subject: [Linux-cluster] fence_apc
In-Reply-To: <1195465820.2577.18.camel@rutabaga.defuturo.co.uk>
References: <20071118163700.79ca01cf@titanium>
	<1195465820.2577.18.camel@rutabaga.defuturo.co.uk>
Message-ID: <1195488332.3319.6.camel@localhost.localdomain>

On Mon, 2007-11-19 at 09:50 +0000, Robert Clark wrote:
> On Sun, 2007-11-18 at 16:37 +0100, Ante Karamati? wrote:
> 
> > When I run 'fence_node node2', I get:
> 
> [...]
> 
> > fence_apc ends up in Phase Monitor, instead of Outlet Control. It
> > opens connection to Control console, then enters Device Manager, and
> > then instead of entering Outlet Control (2), it enters Phase Monitor
> > (1).
> > 
> > It's an APC AP7920 with firmware v2.7.3. It's an RHCS v2.20070823, from
> > Ubuntu 7.10.
> 
Please test the attached if you can, and let me know here if all is OK
with your switch. This is the version in the latest update that is to go
out next. If this agent script does not fence your node, then we can go
to the next step and dump the interaction, and then get this fixed fast.

Thanks and Regards,

-Jim


From jparsons at redhat.com  Mon Nov 19 16:20:27 2007
From: jparsons at redhat.com (jim parsons)
Date: Mon, 19 Nov 2007 11:20:27 -0500
Subject: [Linux-cluster] fence_apc
In-Reply-To: <1195488332.3319.6.camel@localhost.localdomain>
References: <20071118163700.79ca01cf@titanium>
	<1195465820.2577.18.camel@rutabaga.defuturo.co.uk>
	<1195488332.3319.6.camel@localhost.localdomain>
Message-ID: <1195489227.3319.16.camel@localhost.localdomain>

Doh...

Please drop into /sbin, and chmod to executable...

On Mon, 2007-11-19 at 11:05 -0500, jim parsons wrote:
> On Mon, 2007-11-19 at 09:50 +0000, Robert Clark wrote:
> > On Sun, 2007-11-18 at 16:37 +0100, Ante Karamati? wrote:
> > 
> > > When I run 'fence_node node2', I get:
> > 
> > [...]
> > 
> > > fence_apc ends up in Phase Monitor, instead of Outlet Control. It
> > > opens connection to Control console, then enters Device Manager, and
> > > then instead of entering Outlet Control (2), it enters Phase Monitor
> > > (1).
> > > 
> > > It's an APC AP7920 with firmware v2.7.3. It's an RHCS v2.20070823, from
> > > Ubuntu 7.10.
> > 
> Please test the attached if you can, and let me know here if all is OK
> with your switch. This is the version in the latest update that is to go
> out next. If this agent script does not fence your node, then we can go
> to the next step and dump the interaction, and then get this fixed fast.
> 
> Thanks and Regards,
> 
> -Jim
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence_apc
Type: text/x-python
Size: 25643 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071119/282311e5/attachment.py>

From cliff-lc at cliff.hones.org.uk  Mon Nov 19 16:31:58 2007
From: cliff-lc at cliff.hones.org.uk (Cliff Hones)
Date: Mon, 19 Nov 2007 16:31:58 +0000
Subject: [Linux-cluster] Problems building a storage server using h/w RAID
	and GNBD
Message-ID: <4741BA7E.6000908@cliff.hones.org.uk>

I have been experimenting with clustering prior to setting up
a datacentre system providing high availability storage with
automatic failover in the case of individual disk failure or
complete disk server failure.  I am new to Linux clustering,
and am not sure whether problems I am finding are my misuse or bugs.

Intended final setup is two disk servers using h/w RAID6, each
exporting a single large block device using GNBD, with clients
accessing  filesystems on the imported storage using CLVM/GFS.
To provide extra redundancy I am hoping to configure CLVM to
use mirroring to duplicate the logical volumes across the two servers.

Systems are all CentOS-5.

First problem was trying to use Conga.  It seems that ricci doesn't
work on CentOS-5.  I tried the workaround (http://bugs.centos.org/view.php?id=1931)
of replacing the CentOS version of /etc/redhat-release with the RHEL5
version, which did enable me to set up a two-node cluster, but failed
when I tried to configure storage.  So I am using a combination of manual
configuration, system-config-cluster and system-config-lvm.

Second problem was lack of startup scripts for GNBD.  I have
rolled my own, using a /etc/gndb.conf file to specify the exports
and imports, which seems to work fine, but leaves me worried that
GNBD may not be popular enough to be fully supported.

Third problem is trying to struggle with correct startup of a two-node
cluster.  For initial testing I have one disk server and one client
in the cluster.  I accept that the quorum arrangements are difficult
for a two-node cluster, but I was concerned to find that just rebooting
one node with the other remaining up would not work reliably, often
hanging permanently in shutdown (I think it was clurgmgrd hanging - which
is odd as I have no cluster resources/services configured), and frequently
hanging for 5 minutes in startup of CLVM on the disk server (where there
are actually no logical volumes).  This was solved by removing the two-node
option in the cluster config, and giving the disk server a high vote.
This means the client can never be quorate on its own, but that doesn't
matter as its only use of clustering is to import the shared disks.
If this is a sensible solution, I think it would be worth documenting
somewhere.

Fourth problem (and my main concern) is setting up mirroring.  I am
wondering whether this is actually possible in a clustering environment.
The idea is that all filestore partitions will be mirrored over the
two file servers, so if one of the servers fails completely, LVM will
seamlessly switch to using the partitions unmirrored from the remaining
server.  It seems however that the LVM mirroring either needs three
physical devices (the third for keeping the mirror logs) or runs
using a corelog.  If I use the former, I have to find another block device
to export, which is another point of failure, and if I use core logging I
don't see how the mirror log can be maintained cluster-wide.

It does not seem possible to create a corelog mirror using system-config-lvm.
I have tried making a disk log mirror, both with system-config-lvm
and manually with lvcreate, with no luck.  With lvcreate I get locking
errors, due to LV UUID not being recognized - the UUID reported appears
to be the concatenation of two UUIDs.  Rebooting the client seems to
clear this, but then I find that system-config-lvm crashes on startup,
and if I try to manually make a gfs on the mirror it always reports the
device is too small for the journals.  When I try to make a mirror using
system-config-lvm it fails leaving just the disk log LV made.

I'd appreciate any help here - is what I'm trying possible, or is there
a better way to achieve failover in the event of a complete disk server
failure?  Also, are any of my problems (excluding ricci) currently known
bugs, and is it worth trying a build from cvs/svn, or waiting for CentOS-5
updates?

I have deliberately omitted the gory details of the various problems, but I
am happy to provide more detail on request.

-- Cliff


From cluster at defuturo.co.uk  Mon Nov 19 17:32:14 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Mon, 19 Nov 2007 17:32:14 +0000
Subject: [Linux-cluster] fence_apc
In-Reply-To: <1195489227.3319.16.camel@localhost.localdomain>
References: <20071118163700.79ca01cf@titanium>
	<1195465820.2577.18.camel@rutabaga.defuturo.co.uk>
	<1195488332.3319.6.camel@localhost.localdomain>
	<1195489227.3319.16.camel@localhost.localdomain>
Message-ID: <1195493534.2577.34.camel@rutabaga.defuturo.co.uk>

On Mon, 2007-11-19 at 11:20 -0500, jim parsons wrote:

> > Please test the attached if you can, and let me know here if all is OK
> > with your switch. This is the version in the latest update that is to go
> > out next. If this agent script does not fence your node, then we can go
> > to the next step and dump the interaction, and then get this fixed fast.

  Thanks for the script. I've upgraded to fence-1.32.45-1.0.2 and
dropped in the replacement script and here's what I see:


agent "fence_apc" reports: Traceback (most recent call last):
  File "/sbin/fence_apc", line 832, in ?
    main()
  File "/sbin/fence_apc", line 349, in main
    do_power_off(sock)
  File "/sbin/fence_apc", line 816, in do_power_off
    x = do_power_switch(sock, "off")
  File "/sbi
agent "fence_apc" reports: n/fence_apc", line 611, in do_power_switch
    result_code, response = power_off(txt + ndbuf)
  File "/sbin/fence_apc", line 820, in power_off
    x = power_switch(buffer, False, "2", "3");
  File "/sbin/fence_apc", line 813, in power_switch
    raise "un
agent "fence_apc" reports: known screen encountered in \n" + str(lines) + "\n"
unknown screen encountered in 
['3', '', '', '------- Power Supply Status ---------------------------------------------------', '', '          Primary Power Supply Status: OK', '        Secondary Power S
agent "fence_apc" reports: upply Status: OK', '', '', '     <ESC>- Back, <ENTER>- Refresh', '> ']


  I'm pretty sure this is coming about because of something around line
664 which is expecting "Outlet Control" to be option number 3 while it's
actually option number 2. The problem is that you can't rely on the
option number even for a known firmware version as the number of options
in that menu will change depending on other factors (like whether you
are logged in to the switch as an administrative user or not).

  Ante Karamati? is ending up in a different menu (Phase Monitor), but I
guess it's essentially the same problem.

	Thanks,

		Robert


From scottb at bxwa.com  Mon Nov 19 18:10:39 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 19 Nov 2007 10:10:39 -0800
Subject: [Linux-cluster] fence_apc
In-Reply-To: <1195493534.2577.34.camel@rutabaga.defuturo.co.uk>
References: <20071118163700.79ca01cf@titanium>	<1195465820.2577.18.camel@rutabaga.defuturo.co.uk>	<1195488332.3319.6.camel@localhost.localdomain>	<1195489227.3319.16.camel@localhost.localdomain>
	<1195493534.2577.34.camel@rutabaga.defuturo.co.uk>
Message-ID: <4741D19F.1070505@bxwa.com>

My 7920 came with a book about their new CLI. It's invoked weird with " 
-c" appended to the password but after that it looks very sane and easy 
to use. With no menu and a stated purpose of making it easier to use 
from a script, it may be worth looking at.

    scottb

Robert Clark wrote:
> On Mon, 2007-11-19 at 11:20 -0500, jim parsons wrote:
>   
>   I'm pretty sure this is coming about because of something around line
> 664 which is expecting "Outlet Control" to be option number 3 while it's
> actually option number 2. The problem is that you can't rely on the
> option number even for a known firmware version as the number of options
> in that menu will change depending on other factors (like whether you
> are logged in to the switch as an administrative user or not).
>
>   Ante Karamati? is ending up in a different menu (Phase Monitor), but I
> guess it's essentially the same problem.
>
>   


From scottb at bxwa.com  Mon Nov 19 18:15:22 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 19 Nov 2007 10:15:22 -0800
Subject: [Linux-cluster] Child vs Resource
Message-ID: <4741D2BA.2040200@bxwa.com>

On luci's configure a service screen, there's a button to 'add a child' 
and another button to 'add a resource to this service'.

I haven't found anything which explains the difference.

    thanks
    scottb


From slords at lordsfam.net  Mon Nov 19 19:00:53 2007
From: slords at lordsfam.net (Shad L. Lords)
Date: Mon, 19 Nov 2007 12:00:53 -0700
Subject: [Linux-cluster] Setting up GNBD Server & Client
Message-ID: <032a01c82ade$838b7440$3e91b3a8@witslords>

I'm trying to setup a GNBD server and some clients.  I'm able to run through 
the instructions to setup things up but they all seem to be cmdline based 
instructions that are typed manually.  The concerns I'm having is I can't 
find any documentation anywhere that shows how to setup the server/client to 
that things start automatically.

It appears to me that if I start the server and export a block device and 
import this device on a client that when either the client or server boot 
the corresponding side won't start.  Am I completely off base or am I 
missing something?  If I've missed something could someone please point me 
in the direction of some instructions.

Thanks,

-Shad 


From rmccabe at redhat.com  Mon Nov 19 21:10:01 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Mon, 19 Nov 2007 16:10:01 -0500
Subject: [Linux-cluster] Child vs Resource
In-Reply-To: <4741D2BA.2040200@bxwa.com>
References: <4741D2BA.2040200@bxwa.com>
Message-ID: <20071119211001.GA144684@redhat.com>

On Mon, Nov 19, 2007 at 10:15:22AM -0800, Scott Becker wrote:
> On luci's configure a service screen, there's a button to 'add a child' 
> and another button to 'add a resource to this service'.
> 
> I haven't found anything which explains the difference.

Adding a child to an existing resource causes the resultant resource to
be nested under the one to which you're adding a child. For example, if
the resource you're adding a child to is named R, then, it will produce
something like:

<service ...>
 <restype name="R">
   <new_resource_here />
 </restype>
</service>

Clicking the 'add a resource to this service' button causes a new
resource to be added to the end of the service block. For example:

<service ...>
 <restype name="resource0">
   <optional_nested_resources />
 </restype>
 ...
 <restype name="resourceN" />

 <new_resource_here />
</service>


Ryan


From isplist at logicore.net  Mon Nov 19 21:16:30 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 19 Nov 2007 15:16:30 -0600
Subject: [Linux-cluster] Cluster/Server Monitoring tool
Message-ID: <20071119151630.189096@leena>

Is there something that most folks here would agree on as an open source 
monitoring solution for my cluster?

I am finding it very difficult to figure out where I am having problems with 
my cluster. I have a test setup using three web servers behind an LVS server. 
There are two other machines connected but they are for maintenance, not 
application serving. 

When users connect to the site from remote, there are huge long delays between 
page changes. Sometimes, having to click reload/refresh over and over again. I 
need to figure out what's going on, loads, cpu usage, memory, basic things. 
web server stats for load and such would probably help too.

Thanks in advance.

Mike


From chawkins at bplinux.com  Mon Nov 19 21:24:30 2007
From: chawkins at bplinux.com (Christopher Hawkins)
Date: Mon, 19 Nov 2007 16:24:30 -0500
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <20071119151630.189096@leena>
Message-ID: <200711192124.lAJLO5DQ003744@mxmail.leaseoptions.com>

In a word... Ganglia. Pretty easy to setup, runs as a daemon, and will tell
you all that stuff you mentioned. You run the Ganglia server on one box and
the client on every box. 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
Sent: Monday, November 19, 2007 4:17 PM
To: linux-cluster
Subject: [Linux-cluster] Cluster/Server Monitoring tool

Is there something that most folks here would agree on as an open source
monitoring solution for my cluster?

I am finding it very difficult to figure out where I am having problems with
my cluster. I have a test setup using three web servers behind an LVS
server. 
There are two other machines connected but they are for maintenance, not
application serving. 

When users connect to the site from remote, there are huge long delays
between page changes. Sometimes, having to click reload/refresh over and
over again. I need to figure out what's going on, loads, cpu usage, memory,
basic things. 
web server stats for load and such would probably help too.

Thanks in advance.

Mike


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From chawkins at bplinux.com  Mon Nov 19 21:25:03 2007
From: chawkins at bplinux.com (Christopher Hawkins)
Date: Mon, 19 Nov 2007 16:25:03 -0500
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <20071119151630.189096@leena>
Message-ID: <200711192124.lAJLOdiW003834@mxmail.leaseoptions.com>

In a word... Ganglia. Pretty easy to setup, runs as a daemon, and will tell
you all that stuff you mentioned. You run the Ganglia server on one box and
the client on every box. 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
Sent: Monday, November 19, 2007 4:17 PM
To: linux-cluster
Subject: [Linux-cluster] Cluster/Server Monitoring tool

Is there something that most folks here would agree on as an open source
monitoring solution for my cluster?

I am finding it very difficult to figure out where I am having problems with
my cluster. I have a test setup using three web servers behind an LVS
server. 
There are two other machines connected but they are for maintenance, not
application serving. 

When users connect to the site from remote, there are huge long delays
between page changes. Sometimes, having to click reload/refresh over and
over again. I need to figure out what's going on, loads, cpu usage, memory,
basic things. 
web server stats for load and such would probably help too.

Thanks in advance.

Mike


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From ivoks at grad.hr  Mon Nov 19 21:43:55 2007
From: ivoks at grad.hr (Ante =?UTF-8?B?S2FyYW1hdGnEhw==?=)
Date: Mon, 19 Nov 2007 22:43:55 +0100
Subject: [Linux-cluster] fence_apc
In-Reply-To: <1195488332.3319.6.camel@localhost.localdomain>
References: <20071118163700.79ca01cf@titanium>
	<1195465820.2577.18.camel@rutabaga.defuturo.co.uk>
	<1195488332.3319.6.camel@localhost.localdomain>
Message-ID: <20071119224355.218aef03@titanium>

On Mon, 19 Nov 2007 11:05:32 -0500
jim parsons <jparsons at redhat.com> wrote:

> Please test the attached if you can, and let me know here if all is OK
> with your switch. This is the version in the latest update that is to
> go out next. If this agent script does not fence your node, then we
> can go to the next step and dump the interaction, and then get this
> fixed fast.

Thank you for the new version, but the result is the same. Robert has a
point, and this is what I looked at too. Script sends '3', but '3' is
Power Monitor, not Outlet Control.

-- 
Ante Karamati?, ante at init.hr
Init, www.init.hr
t: +385 1 5621 702
m: +385 95 906 4161


From ebaydaan at gmail.com  Mon Nov 19 22:29:07 2007
From: ebaydaan at gmail.com (Daan Biere)
Date: Mon, 19 Nov 2007 23:29:07 +0100
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <200711192124.lAJLOdiW003834@mxmail.leaseoptions.com>
References: <200711192124.lAJLOdiW003834@mxmail.leaseoptions.com>
Message-ID: <C414FBD584ED4E13A92123F2B6596951@PCvandaan>

Please take a look at zabbix ;)
It's wonderfull, all things like CPU and memory are basic, the nice thing is 
that you can set up specified monitoring checks based on bash or perl 
scripts.
Another nice thing is that all the values will be stored in a central 
database and everything is available for history reference.
You can set up triggers too, so if there's something wrong or if there is an 
abnormal peak in one of the graphs it will send you an e-mail or any other 
media like an sms.

Kind Regards,
Daan Biere

----- Original Message ----- 
From: "Christopher Hawkins" <chawkins at bplinux.com>
To: "'linux clustering'" <linux-cluster at redhat.com>
Sent: Monday, November 19, 2007 10:25 PM
Subject: RE: [Linux-cluster] Cluster/Server Monitoring tool


> In a word... Ganglia. Pretty easy to setup, runs as a daemon, and will 
> tell
> you all that stuff you mentioned. You run the Ganglia server on one box 
> and
> the client on every box.
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> isplist at logicore.net
> Sent: Monday, November 19, 2007 4:17 PM
> To: linux-cluster
> Subject: [Linux-cluster] Cluster/Server Monitoring tool
>
> Is there something that most folks here would agree on as an open source
> monitoring solution for my cluster?
>
> I am finding it very difficult to figure out where I am having problems 
> with
> my cluster. I have a test setup using three web servers behind an LVS
> server.
> There are two other machines connected but they are for maintenance, not
> application serving.
>
> When users connect to the site from remote, there are huge long delays
> between page changes. Sometimes, having to click reload/refresh over and
> over again. I need to figure out what's going on, loads, cpu usage, 
> memory,
> basic things.
> web server stats for load and such would probably help too.
>
> Thanks in advance.
>
> Mike
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster 


From ebaydaan at gmail.com  Mon Nov 19 22:29:07 2007
From: ebaydaan at gmail.com (Daan Biere)
Date: Mon, 19 Nov 2007 23:29:07 +0100
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <200711192124.lAJLOdiW003834@mxmail.leaseoptions.com>
References: <200711192124.lAJLOdiW003834@mxmail.leaseoptions.com>
Message-ID: <6FAB43B03E7043D2BF86C7330F01F261@PCvandaan>

Please take a look at zabbix ;)
It's wonderfull, all things like CPU and memory are basic, the nice thing is 
that you can set up specified monitoring checks based on bash or perl 
scripts.
Another nice thing is that all the values will be stored in a central 
database and everything is available for history reference.
You can set up triggers too, so if there's something wrong or if there is an 
abnormal peak in one of the graphs it will send you an e-mail or any other 
media like an sms.

Kind Regards,
Daan Biere

----- Original Message ----- 
From: "Christopher Hawkins" <chawkins at bplinux.com>
To: "'linux clustering'" <linux-cluster at redhat.com>
Sent: Monday, November 19, 2007 10:25 PM
Subject: RE: [Linux-cluster] Cluster/Server Monitoring tool


> In a word... Ganglia. Pretty easy to setup, runs as a daemon, and will 
> tell
> you all that stuff you mentioned. You run the Ganglia server on one box 
> and
> the client on every box.
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> isplist at logicore.net
> Sent: Monday, November 19, 2007 4:17 PM
> To: linux-cluster
> Subject: [Linux-cluster] Cluster/Server Monitoring tool
>
> Is there something that most folks here would agree on as an open source
> monitoring solution for my cluster?
>
> I am finding it very difficult to figure out where I am having problems 
> with
> my cluster. I have a test setup using three web servers behind an LVS
> server.
> There are two other machines connected but they are for maintenance, not
> application serving.
>
> When users connect to the site from remote, there are huge long delays
> between page changes. Sometimes, having to click reload/refresh over and
> over again. I need to figure out what's going on, loads, cpu usage, 
> memory,
> basic things.
> web server stats for load and such would probably help too.
>
> Thanks in advance.
>
> Mike
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster 


From isplist at logicore.net  Mon Nov 19 22:37:44 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 19 Nov 2007 16:37:44 -0600
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <6FAB43B03E7043D2BF86C7330F01F261@PCvandaan>
Message-ID: <20071119163744.344949@leena>

Ah, now that's a nice clear picture of what's going on. I looked at the other 
one (thanks for the lead) but it seems to be MRTG style output only?

Being able to put things into the database is something I like, so that I can 
go back and review and perhaps gain from a patern showing up.

I'm looking at it now, thanks much.

Mike


On Mon, 19 Nov 2007 23:29:07 +0100, Daan Biere wrote:
> Please take a look at zabbix ;)
> 
> It's wonderfull, all things like CPU and memory are basic, the nice thing is
> that you can set up specified monitoring checks based on bash or perl
> scripts.
> Another nice thing is that all the values will be stored in a central
> database and everything is available for history reference.
> You can set up triggers too, so if there's something wrong or if there is an
> abnormal peak in one of the graphs it will send you an e-mail or any other
> media like an sms.
> 
> Kind Regards,
> Daan Biere
> 
> ----- Original Message -----
> From: "Christopher Hawkins" <chawkins at bplinux.com>
> To: "'linux clustering'" <linux-cluster at redhat.com>
> Sent: Monday, November 19, 2007 10:25 PM
> Subject: RE: [Linux-cluster] Cluster/Server Monitoring tool
> 
> 
>> In a word... Ganglia. Pretty easy to setup, runs as a daemon, and will
>> tell
>> you all that stuff you mentioned. You run the Ganglia server on one box
>> and
>> the client on every box.
>> 
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of
>> isplist at logicore.net
>> Sent: Monday, November 19, 2007 4:17 PM
>> To: linux-cluster
>> Subject: [Linux-cluster] Cluster/Server Monitoring tool
>> 
>> Is there something that most folks here would agree on as an open source
>> monitoring solution for my cluster?
>> 
>> I am finding it very difficult to figure out where I am having problems
>> with
>> my cluster. I have a test setup using three web servers behind an LVS
>> server.
>> There are two other machines connected but they are for maintenance, not
>> application serving.
>> 
>> When users connect to the site from remote, there are huge long delays
>> between page changes. Sometimes, having to click reload/refresh over and
>> over again. I need to figure out what's going on, loads, cpu usage,
>> memory,
>> basic things.
>> web server stats for load and such would probably help too.
>> 
>> Thanks in advance.
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From scottb at bxwa.com  Mon Nov 19 22:38:26 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 19 Nov 2007 14:38:26 -0800
Subject: [Linux-cluster] Child vs Resource
In-Reply-To: <20071119211001.GA144684@redhat.com>
References: <4741D2BA.2040200@bxwa.com> <20071119211001.GA144684@redhat.com>
Message-ID: <47421062.10205@bxwa.com>

In operation, do the children get started successfully before the parent 
is started? As opposed to them being started independently?

Is the failures (after everything has been running) handled the same?

    scottb


Ryan McCabe wrote:
> On Mon, Nov 19, 2007 at 10:15:22AM -0800, Scott Becker wrote:
>   
>> On luci's configure a service screen, there's a button to 'add a child' 
>> and another button to 'add a resource to this service'.
>>
>> I haven't found anything which explains the difference.
>>     
>
> Adding a child to an existing resource causes the resultant resource to
> be nested under the one to which you're adding a child. For example, if
> the resource you're adding a child to is named R, then, it will produce
> something like:
>
> <service ...>
>  <restype name="R">
>    <new_resource_here />
>  </restype>
> </service>
>
> Clicking the 'add a resource to this service' button causes a new
> resource to be added to the end of the service block. For example:
>
> <service ...>
>  <restype name="resource0">
>    <optional_nested_resources />
>  </restype>
>  ...
>  <restype name="resourceN" />
>
>  <new_resource_here />
> </service>
>
>
> Ryan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071119/3db35e85/attachment.htm>

From dnlombar at ichips.intel.com  Mon Nov 19 23:01:52 2007
From: dnlombar at ichips.intel.com (Lombard, David N)
Date: Mon, 19 Nov 2007 15:01:52 -0800
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <20071119163744.344949@leena>
References: <6FAB43B03E7043D2BF86C7330F01F261@PCvandaan>
	<20071119163744.344949@leena>
Message-ID: <20071119230152.GA30957@nlxdcldnl2.cl.intel.com>

On Mon, Nov 19, 2007 at 04:37:44PM -0600, isplist at logicore.net wrote:
> Ah, now that's a nice clear picture of what's going on. I looked at the other 
> one (thanks for the lead) but it seems to be MRTG style output only?
> 
> Being able to put things into the database is something I like, so that I can 
> go back and review and perhaps gain from a patern showing up.

Don't let go of Ganglia so quickly, at least not for that reason.  Ganglia's
data is placed into an RRDtool database, which allows for access; Ganglia
is actually showing data from that DB.  You can also extract the data--the
collected stats actually, not the raw data--for arbitrary access.

Another nice feature of Ganglia is that any node can glom the data from
the multicast channel.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From ivoks at grad.hr  Mon Nov 19 23:14:57 2007
From: ivoks at grad.hr (Ante =?UTF-8?B?S2FyYW1hdGnEhw==?=)
Date: Tue, 20 Nov 2007 00:14:57 +0100
Subject: [Linux-cluster] fence_apc
In-Reply-To: <1195489227.3319.16.camel@localhost.localdomain>
References: <20071118163700.79ca01cf@titanium>
	<1195465820.2577.18.camel@rutabaga.defuturo.co.uk>
	<1195488332.3319.6.camel@localhost.localdomain>
	<1195489227.3319.16.camel@localhost.localdomain>
Message-ID: <20071120001457.1d75c0b3@titanium>

On Mon, 19 Nov 2007 11:20:27 -0500
jim parsons <jparsons at redhat.com> wrote:

> Doh...

Let's take a look at this:

       if o == "-n":
        dex = a.find(":")
        if dex == (-1):
          port = a
        else:
          switchnum = a[:dex]
          port = a[(dex+1):]

Let's say I define only the port, with '-n 3'. Then we get to the part:

    elif i.find(DEVICE_MANAGER) != (-1):
      if switchnum != "":
        res = switchnum + "\r"
      else:
        if FIRMWARE_REV == 2:
          res = "3\r"
        elif FIRMWARE_REV == 3:
          res = "2\r1\r"
        else: #placeholder for future revisions
          res = "3\r"
      return (NOT_COMPLETE, res, "Status Unknown")

And this is very strange to me. On 7920, we are at this stage:

------- Device Manager ---------------------------------

     1- Phase Monitor
     2- Outlet Control
     3- Power Supply Status

     <ESC>- Back, <ENTER>- Refresh

So, clearly, for 'anything' to work, one must define a switch number as
'2'. So, for the sake of the example, I defined it as '-n 2:3'. That
creates another problem. Running command:

fence_apc -a x.x.x.x -l abc -p def -v -n 2:3

I should reboot node number 3. But this is what I get:

------- alias-three ----------------------------------------------

        Name         : alias-three
        Outlet       : 3
        State        : ON

     1- Immediate On              
     2- Immediate Off             
     3- Immediate Reboot          
     4- Delayed On                
     5- Delayed Off               
     6- Delayed Reboot            
     7- Cancel                    

     ?- Help, <ESC>- Back, <ENTER>- Refresh
> 1
        -----------------------------------------------------------------------
        Immediate On              

        This command will immediately turn
        outlet 3 named 'alias-three' ON.

        Enter 'YES' to continue or <ENTER> to cancel : YES
        Command successfully issued.

        Press <ENTER> to continue...

By default, it should reboot, not power on, right?. It was too late for
me to dig more into the code. IMHO 7920 is missing a step that
fence_apc thinks should be there. So, I'm sending full transcript of
the session:

American Power Conversion               Network Management Card
AOS      v2.7.0 (c) Copyright 2004 All Rights Reserved  Rack PDU
APP                     v2.7.3
-------------------------------------------------------------------------------
Name      : r2d2cp0etcbl                              Date : 11/20/2007
Contact   : hostmaster                                Time : 00:04:11
Location  : SomewhereInEurope                         User : Outlet User
Up Time   : 17 Days 10 Hours 57 Minutes Stat : P+ N+ A+

Switched Rack PDU: Communication Established

------- Control Console --------------------------------------------

     1- Device Manager
     2- Network
     3- System
     4- Logout

     <ESC>- Main Menu, <ENTER>- Refresh
> 1

------- Device Manager ---------------------------------------------

     1- Phase Monitor
     2- Outlet Control
     3- Power Supply Status

     <ESC>- Back, <ENTER>- Refresh
> 2

------- Outlet Control --------------------------------------------

     1- aliasname-one               ON
     2- aliasname-two               ON
     3- aliasname-three             ON
     4- ALL Accessible Outlets

     <ESC>- Back, <ENTER>- Refresh
> 3

------- aliasname-three -----------------------------------------------

        Name         : aliasname-three
        Outlet       : 3
        State        : ON

     1- Immediate On              
     2- Immediate Off             
     3- Immediate Reboot          
     4- Delayed On                
     5- Delayed Off               
     6- Delayed Reboot            
     7- Cancel                    

     ?- Help, <ESC>- Back, <ENTER>- Refresh
> 1
        -----------------------------------------------------------------------
        Immediate On              

        This command will immediately turn
        outlet 3 named 'aliasname-three' ON.

        Enter 'YES' to continue or <ENTER> to cancel : YES
        Command successfully issued.

        Press <ENTER> to continue...

Thanks!


From isplist at logicore.net  Mon Nov 19 23:41:37 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 19 Nov 2007 17:41:37 -0600
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <20071119230152.GA30957@nlxdcldnl2.cl.intel.com>
Message-ID: <20071119174137.871574@leena>

Oh I'm not. I'm going to look at everything the folks on this list suggest. If 
the users of this list are using something, it's worth looking at :).

Thanks.

Mike


On Mon, 19 Nov 2007 15:01:52 -0800, Lombard, David N wrote:
> On Mon, Nov 19, 2007 at 04:37:44PM -0600, isplist at logicore.net wrote:
> 
>> Ah, now that's a nice clear picture of what's going on. I looked at the
>> other
>> one (thanks for the lead) but it seems to be MRTG style output only?
>> 
>> Being able to put things into the database is something I like, so that I
>> can
>> go back and review and perhaps gain from a patern showing up.
>> 
> Don't let go of Ganglia so quickly, at least not for that reason.  Ganglia's
> data is placed into an RRDtool database, which allows for access; Ganglia
> is actually showing data from that DB.  You can also extract the data--the
> collected stats actually, not the raw data--for arbitrary access.
> 
> Another nice feature of Ganglia is that any node can glom the data from
> the multicast channel.


From isplist at logicore.net  Mon Nov 19 23:48:07 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 19 Nov 2007 17:48:07 -0600
Subject: [Linux-cluster] Cluster/Server Monitoring tool
In-Reply-To: <20071119230152.GA30957@nlxdcldnl2.cl.intel.com>
Message-ID: <2007111917487.238560@leena>

The two main processes I see at the top using the top linux tool is that  
gfs_scand and httpd are taking up most of the servers time.
I need to figure out why the servers are acting so sluggish when they don't 
have anything to do but serve up web pages. I also need to optimize them just 
for serving pages, LAMP style. 

For some reason, sitting idle, they are still always at the 0.50 load range, 
jumping to 1.00 and 1.50 constantly and no one on them. None of the other 
nodes are doing this, they are pretty much at 0.00, only the web servers are 
doing this. 

Oh, one of them is also a web server but serves up nothing but images.

Mike


On Mon, 19 Nov 2007 15:01:52 -0800, Lombard, David N wrote:
> On Mon, Nov 19, 2007 at 04:37:44PM -0600, isplist at logicore.net wrote:
> 
>> Ah, now that's a nice clear picture of what's going on. I looked at the
>> other
>> one (thanks for the lead) but it seems to be MRTG style output only?
>> 
>> Being able to put things into the database is something I like, so that I
>> can
>> go back and review and perhaps gain from a patern showing up.
>> 
> Don't let go of Ganglia so quickly, at least not for that reason.  Ganglia's
> data is placed into an RRDtool database, which allows for access; Ganglia
> is actually showing data from that DB.  You can also extract the data--the
> collected stats actually, not the raw data--for arbitrary access.
> 
> Another nice feature of Ganglia is that any node can glom the data from
> the multicast channel.


From gnobal at gmail.com  Tue Nov 20 10:06:18 2007
From: gnobal at gmail.com (Amit Schreiber)
Date: Tue, 20 Nov 2007 12:06:18 +0200
Subject: [Linux-cluster] fence_apc snmp question
Message-ID: <f4c96560711200206g707f250eo56ef467c382d347@mail.gmail.com>

Hi,

I tried to use the fence_apc_snmp script from here:
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/apc/?cvsroot=cluster
and it works with my AP7954 APC switch (software version 3.5.5).

My question is: with the SNMP version of the script, is it possible to
prevent a race to the APC device by both nodes? With the telnet
version, the first node to reach the fence device would block the
other node out. I realize that the fence device should be on the same
network path as the nodes, but still a theoretical condition exists
where both nodes cannot communicate between them but still manage to
reach the fence device.

Thanks,
Amit


From cluster at defuturo.co.uk  Tue Nov 20 11:52:54 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Tue, 20 Nov 2007 11:52:54 +0000
Subject: [Linux-cluster] fence_apc
In-Reply-To: <4741D19F.1070505@bxwa.com>
References: <20071118163700.79ca01cf@titanium>
	<1195465820.2577.18.camel@rutabaga.defuturo.co.uk>
	<1195488332.3319.6.camel@localhost.localdomain>
	<1195489227.3319.16.camel@localhost.localdomain>
	<1195493534.2577.34.camel@rutabaga.defuturo.co.uk>
	<4741D19F.1070505@bxwa.com>
Message-ID: <1195559574.2594.11.camel@rutabaga.defuturo.co.uk>

On Mon, 2007-11-19 at 10:10 -0800, Scott Becker wrote:

> My 7920 came with a book about their new CLI. It's invoked weird with " 
> -c" appended to the password but after that it looks very sane and easy 
> to use. With no menu and a stated purpose of making it easier to use 
> from a script, it may be worth looking at.

  Very interesting - I've never heard of this before, but just tried it
and it works on my switches. Never mind scripting, it looks much easier
than the menu-based system for manual use.

	Thanks,

		Robert


From lhh at redhat.com  Tue Nov 20 15:17:09 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 20 Nov 2007 10:17:09 -0500
Subject: [Linux-cluster] Arbitrary heuristics
In-Reply-To: <473DEBB2.9030905@bxwa.com>
References: <473DEBB2.9030905@bxwa.com>
Message-ID: <1195571829.26558.20.camel@ayanami.boston.devel.redhat.com>

On Fri, 2007-11-16 at 11:12 -0800, Scott Becker wrote:
> I need to setup an IP-tie breaker.
> 
> It appears that I need to setup shared storage to utilize QDisks 
> features or write a C program to use libcman?

<heuristic program="ping -c3 -t2 <ip_address>" interval="2" tko="3"
score="1"/>

?

-- Lon


From lhh at redhat.com  Tue Nov 20 15:40:22 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 20 Nov 2007 10:40:22 -0500
Subject: [Linux-cluster] qdisk heuristic and chicken/egg dilemma
In-Reply-To: <200711190926.12489.rottmann@atix.de>
References: <200711190926.12489.rottmann@atix.de>
Message-ID: <1195573222.26558.38.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-11-19 at 09:26 +0100, Reiner Rottmann wrote:
> Hi!
> 
> Say you want to configure a cluster node by using a quorum disk so that it 
> survives when the particular node runs a certain cluster service.
> 
> Such a scenario is even stated in the man page QDisk(8):
> ---%------------------------------------------------------------------------------------------------------------
> For example, a user may have a service running on one node, and that node must 
> always be the master in the event of a network partition.
> ---%------------------------------------------------------------------------------------------------------------
> 
> Then in my opinion you have a chicken/egg dilemma because you might need the 
> additional vote of the quorum disk to get a quorate cluster (which again may 
> run the rgmanager that is needed for the heuristic).

Survival after a failure is not the same thing as trying to clean-start
in a broken environment.

> How would one solve this dilemma in a straightforward way?

(a) Boot both nodes ?

(b) Manual intervention. :)

More roundabout ways include using heuristics which record state.

Admittedly we could make this a bit easier if we handle it more
gracefully inside qdiskd.

-- Lon


From lhh at redhat.com  Tue Nov 20 15:40:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 20 Nov 2007 10:40:51 -0500
Subject: [Linux-cluster] RHEL5 failover domain
In-Reply-To: <47417DD3.90804@wasko.pl>
References: <47417DD3.90804@wasko.pl>
Message-ID: <1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-11-19 at 13:13 +0100, Dariusz Skorupa wrote:
>     I've found problem with RHEL5 cluster. When I use prioritized fail 
> over domain and next reset the node witch have priority set to 1 cluster 
> relocate service to node with priority 2. Next, when node 1 come back 
> ,cluster is trying to relocate service back to primary node.  In logfile 
> I always find:
> 
> Nov 19 12:32:26 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
> Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
> 192.168.10.10
> Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
> 192.168.10.11
> Nov 19 12:32:26 l2 openais[1977]: [CPG  ] got joinlist message from node 1
> Nov 19 12:32:26 l2 clurgmgrd[2687]: <notice> Stopping service 
> service:vsftpd
> Nov 19 12:32:41 l2 clurgmgrd[2687]: <err> #52: Failed changing RG status
> Nov 19 12:32:56 l2 clurgmgrd[2687]: <err> #57: Failed changing RG status
> Nov 19 12:32:57 l2 clurgmgrd: [2687]: <info> Executing 
> /etc/init.d/vsftpd status
> 
>    I tested this many times and in this case clurgmgrd do not try to run 
> script with stop parameter, but when I try to relocate service manualy 
> using clusvcadm or when both nodes have priority 1 everything is 
> successful. Is also successful if I'm restarting node using reboot. I 
> think automatic (after crash) relocating starts too early. In my opinion 
> cluster do not wait for rgmanager start.

https://bugzilla.redhat.com/show_bug.cgi?id=327721

-- Lon


From lhh at redhat.com  Tue Nov 20 15:57:02 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 20 Nov 2007 10:57:02 -0500
Subject: [Linux-cluster] Child vs Resource
In-Reply-To: <47421062.10205@bxwa.com>
References: <4741D2BA.2040200@bxwa.com> <20071119211001.GA144684@redhat.com>
	<47421062.10205@bxwa.com>
Message-ID: <1195574222.26558.49.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-11-19 at 14:38 -0800, Scott Becker wrote:
> In operation, do the children get started successfully before the
> parent is started? As opposed to them being started independently?

Other way around.  Children are dependent on the parent container, and
can inherit attributes.  Assuming untyped resources, generally:

<a>
  <b>
    <c/>
  </b>
  <d/>
</a>

start = a b c d
stop = d c b a

> Is the failures (after everything has been running) handled the same?

Yes, unless you specify __independent_subtree for a resource, then it
and its children are independent and attempted to be restarted as a
separate operation from a full service restart.  Only if that partial
restart fails in this case does a full service restart occur.

-- Lon

> 
>     scottb
> 
> 
> Ryan McCabe wrote: 
> > On Mon, Nov 19, 2007 at 10:15:22AM -0800, Scott Becker wrote:
> >   
> > > On luci's configure a service screen, there's a button to 'add a child' 
> > > and another button to 'add a resource to this service'.
> > > 
> > > I haven't found anything which explains the difference.
> > >     
> > 
> > Adding a child to an existing resource causes the resultant resource to
> > be nested under the one to which you're adding a child. For example, if
> > the resource you're adding a child to is named R, then, it will produce
> > something like:
> > 
> > <service ...>
> >  <restype name="R">
> >    <new_resource_here />
> >  </restype>
> > </service>
> > 
> > Clicking the 'add a resource to this service' button causes a new
> > resource to be added to the end of the service block. For example:
> > 
> > <service ...>
> >  <restype name="resource0">
> >    <optional_nested_resources />
> >  </restype>
> >  ...
> >  <restype name="resourceN" />
> > 
> >  <new_resource_here />
> > </service>
> > 
> > 
> > Ryan
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >   
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From scottb at bxwa.com  Tue Nov 20 16:26:45 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 20 Nov 2007 08:26:45 -0800
Subject: [Linux-cluster] Arbitrary heuristics
In-Reply-To: <1195571829.26558.20.camel@ayanami.boston.devel.redhat.com>
References: <473DEBB2.9030905@bxwa.com>
	<1195571829.26558.20.camel@ayanami.boston.devel.redhat.com>
Message-ID: <47430AC5.5070300@bxwa.com>

Lon Hohberger wrote:
> On Fri, 2007-11-16 at 11:12 -0800, Scott Becker wrote:
>   
>> I need to setup an IP-tie breaker.
>>
>> It appears that I need to setup shared storage to utilize QDisks 
>> features or write a C program to use libcman?
>>     
>
> <heuristic program="ping -c3 -t2 <ip_address>" interval="2" tko="3"
> score="1"/>
>
> ?
>
> -- Lon
>
>   
I got that. As far as I can tell, that line will only work in a quorumd 
tag which as far as I can tell I can not use without setting up a shared 
volume (which I have no hardware for at this time).

Will this heuristic or something similar work in <rm> or anywhere else?

I got a "script" resource added to my "IP address" service to disqualify 
a node from running it when it can't reach the gateway but when I 
simulate an error it tries all nodes (good) and then gives up and 
doesn't retry when the error is resolved (bad).

In real life, the error would be resolved by the NOC while I slept so I 
need it to come back without me manually restarting the cluster.

I know this is weird because it I (for lack of a better option) am 
trying to do it at the resource group level instead of at the quorum level.

I know I'm going down the road less traveled because I want all the 
great recovery features but without a shared disk. Since GFS is the 
shining feature cluster suite the easy answer for most folks is a qdisk, 
not as many options without.

    thanks
    scottb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071120/5acaa38d/attachment.htm>

From lhh at redhat.com  Tue Nov 20 20:25:12 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 20 Nov 2007 15:25:12 -0500
Subject: [Linux-cluster] Arbitrary heuristics
In-Reply-To: <47430AC5.5070300@bxwa.com>
References: <473DEBB2.9030905@bxwa.com>
	<1195571829.26558.20.camel@ayanami.boston.devel.redhat.com>
	<47430AC5.5070300@bxwa.com>
Message-ID: <1195590312.26558.70.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-20 at 08:26 -0800, Scott Becker wrote:
> Lon Hohberger wrote: 
> >   
> I got that. As far as I can tell, that line will only work in a
> quorumd tag which as far as I can tell I can not use without setting
> up a shared volume (which I have no hardware for at this time).

Correct, it needs a shared volume.  Without one, you should use cman
two_node mode.

> Will this heuristic or something similar work in <rm> or anywhere
> else?

I don't think so - at least, not in the way you want it to.

> I got a "script" resource added to my "IP address" service to
> disqualify a node from running it when it can't reach the gateway but
> when I simulate an error it tries all nodes (good) and then gives up
> and doesn't retry when the error is resolved (bad).

That's mostly because there's no mechanism to trigger service
transitions from external sources (except user requests).

So what you're trying to do is prevent starting a service unless X is
present (which it sounds like you succeeded in doing), where X is
something not on the cluster.  What you want to do now is retry starting
the service if X should become available.  Is that about right?

Simplest thing right now is to use 'mon' or another script outside the
cluster to enable/disable the service when resources become available
(or not).

e.g.: http://mon.wiki.kernel.org/index.php/Main_Page

I do not think making rgmanager do it internally would be difficult, at
least not in the RHEL5 branch.  You should file a bugzilla.

Either of the two solutions (using 'mon' or adding it to rgmanager)
should obviate the need for augmenting your service, too - if the
resource is unavailable, the service won't start (ever, assuming you
have autostart="0"), so there's no reason to keep the check in the
service itself.

-- Lon


From scottb at bxwa.com  Tue Nov 20 22:06:59 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 20 Nov 2007 14:06:59 -0800
Subject: [Linux-cluster] Arbitrary heuristics
In-Reply-To: <1195590312.26558.70.camel@ayanami.boston.devel.redhat.com>
References: <473DEBB2.9030905@bxwa.com>	<1195571829.26558.20.camel@ayanami.boston.devel.redhat.com>	<47430AC5.5070300@bxwa.com>
	<1195590312.26558.70.camel@ayanami.boston.devel.redhat.com>
Message-ID: <47435A83.3020306@bxwa.com>

I've been pondering what I'm actually looking for.

Each of my nodes has a public and a private NIC. Public is for serving 
web pages, private is for fencing. I was desperately trying to get 
fencing to work over the public network but I was faced with 
reimplementing a complicated fence agent in C in order to use ssh 
(supported ok by my power switches but difficult to add to the python 
fence agent).

My remaining issue is that if I lose one of my public NICs, I must 
ensure that the ensuing fencing race is won by the good node and not the 
bad node which thinks it's good. Not solved by quorum because I must 
also make it work, 'last man standing' (starting with 3 nodes).

So pondering, I realized that I don't really need to monitor the ability 
to reach the gateway. What I need is for a public comm error to create 
an event, hence I use the public nic for cluster comms. Then do 
something so that the bad node doesn't fence the good nodes.

So assuming only one real failure at a time, I'm thinking of making the 
first step in the fencing method a check for pinging the gateway. That 
way when a node wants to fence, it will only be able to if it's public 
NIC is working, even though it's using the private nic for the rest of 
the fencing.

Am I off?

    thanks
    scottb


Lon Hohberger wrote:
> Correct, it needs a shared volume.  Without one, you should use cman
> two_node mode.
>
>   
>


From d.skorupa at wasko.pl  Wed Nov 21 07:53:13 2007
From: d.skorupa at wasko.pl (Dariusz Skorupa)
Date: Wed, 21 Nov 2007 08:53:13 +0100
Subject: [Linux-cluster] RHEL5 failover domain
In-Reply-To: <1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>
References: <47417DD3.90804@wasko.pl>
	<1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>
Message-ID: <4743E3E9.3050402@wasko.pl>


>>     I've found problem with RHEL5 cluster. When I use prioritized fail 
>> over domain and next reset the node witch have priority set to 1 cluster 
>> relocate service to node with priority 2. Next, when node 1 come back 
>> ,cluster is trying to relocate service back to primary node.  In logfile 
>> I always find:
>>
>> Nov 19 12:32:26 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
>> Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
>> 192.168.10.10
>> Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
>> 192.168.10.11
>> Nov 19 12:32:26 l2 openais[1977]: [CPG  ] got joinlist message from node 1
>> Nov 19 12:32:26 l2 clurgmgrd[2687]: <notice> Stopping service 
>> service:vsftpd
>> Nov 19 12:32:41 l2 clurgmgrd[2687]: <err> #52: Failed changing RG status
>> Nov 19 12:32:56 l2 clurgmgrd[2687]: <err> #57: Failed changing RG status
>> Nov 19 12:32:57 l2 clurgmgrd: [2687]: <info> Executing 
>> /etc/init.d/vsftpd status
>>
>>    I tested this many times and in this case clurgmgrd do not try to run 
>> script with stop parameter, but when I try to relocate service manualy 
>> using clusvcadm or when both nodes have priority 1 everything is 
>> successful. Is also successful if I'm restarting node using reboot. I 
>> think automatic (after crash) relocating starts too early. In my opinion 
>> cluster do not wait for rgmanager start.
>>     
>
> https://bugzilla.redhat.com/show_bug.cgi?id=327721
>
>   
Ok, I'm admitting - I've missed that.

But i've just tested packet cman-2.0.73-1.2lhh.i386.rpm and 
unfortunately i've noticed that problem is existing still.

daro


-------------- next part --------------
A non-text attachment was scrubbed...
Name: d.skorupa.vcf
Type: text/x-vcard
Size: 262 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071121/86297bef/attachment.vcf>

From johannes.russek at io-consulting.net  Wed Nov 21 12:35:04 2007
From: johannes.russek at io-consulting.net (jr)
Date: Wed, 21 Nov 2007 13:35:04 +0100
Subject: [Linux-cluster] xen and migration
Message-ID: <1195648504.25038.52.camel@localhost.localdomain>

Hi everybody,
I am currently working on an installation of a failover RHCS with Xen
guests as Services. Everything works great so far, Xen domains are
failed over to still running nodes if i switch a node off. Even high
available bonding integrates nicely.
The only thing yet is the (live) migration of Xen domain Services. i
changed the /usr/share/cluster/vm.sh so it reads xm migrate -l, but if I
do:

clusvcadm -M VirtServ01 -m Cluster-Node01#

i get: Trying to migrate service:VirtServ01 to Cluster-Node01...Invalid
operation for resource


my configuration for that service looks like this:

                <resources>
                        <clusterfs
device="/dev/mpath/WebClusterConFF0100040310000Fp2" force_unmount="0"
fsid="43783" fstype="gfs" mountpoint="/cluster/XenDomains"
name="XenDomains" options=""/>
                        <vm name="VirtServ01" path="/cluster/XenDomains"
recovery="restart" rootdisk_physical="/dev/mpath/XenD1_FF0100000310000F"
rootdisk_virtual="xvda"/>
                        
                </resources>
                <service name="VirtServ01" domain="failoverdomain1"
autostart="1">
                        <clusterfs ref="XenDomains">
                                <vm ref="VirtServ01"/>
                        </clusterfs>
                </service>

the XenDomains clusterfs is where the xen guest configuration files are
stored, the rootdisks are straight luns from our san. Is there some
nesting i did that prohibits the live migration or is it maybe the gfs
mount that plays tricks on me here?

thanks for your replies.
regards,
johannes


From isplist at logicore.net  Wed Nov 21 16:21:28 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 21 Nov 2007 10:21:28 -0600
Subject: [Linux-cluster] scsi_watchdog
Message-ID: <20071121102128.053592@leena>

I didn't notice that something had changed until now after I upgraded all of 
the nodes. I see in /etc/cluster the following;

-rw-r-----   1 root root  1249 Nov 10 18:40 .cluster.conf
-rw-r-----   1 root root  1249 Nov 13 09:50 cluster.conf
-rw-r--r--   1 root root    34 Aug 30 17:55 scsi_watchdog.conf

Since that update, the web nodes have been VERY slow, sluggish, barely 
responding at times to web requests. 

Could that be related to this? I am not finding a whole lot about this  on the 
net?

Mike


From rohara at redhat.com  Wed Nov 21 16:31:32 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Wed, 21 Nov 2007 10:31:32 -0600
Subject: [Linux-cluster] scsi_watchdog
In-Reply-To: <20071121102128.053592@leena>
References: <20071121102128.053592@leena>
Message-ID: <47445D64.6070500@redhat.com>

isplist at logicore.net wrote:
> I didn't notice that something had changed until now after I upgraded all of 
> the nodes. I see in /etc/cluster the following;
> 
> -rw-r-----   1 root root  1249 Nov 10 18:40 .cluster.conf
> -rw-r-----   1 root root  1249 Nov 13 09:50 cluster.conf
> -rw-r--r--   1 root root    34 Aug 30 17:55 scsi_watchdog.conf
> 
> Since that update, the web nodes have been VERY slow, sluggish, barely 
> responding at times to web requests. 
> 
> Could that be related to this? I am not finding a whole lot about this  on the 
> net?

What changed?

scsi_watchdog.conf should have no effect. It isn't even uses unless you 
configure watchdog (softdog) support for scsi reservations.

Ryan


From isplist at logicore.net  Wed Nov 21 16:42:33 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 21 Nov 2007 10:42:33 -0600
Subject: [Linux-cluster] scsi_watchdog
In-Reply-To: <47445D64.6070500@redhat.com>
Message-ID: <20071121104233.977211@leena>

Is it the new lvm2-monitor?

Also, since I'm using external RAID storage, guessing I don't need to have 
mdmonitor running on any of my nodes?


On Wed, 21 Nov 2007 10:31:32 -0600, Ryan O'Hara wrote:
> isplist at logicore.net wrote:
> 
>> I didn't notice that something had changed until now after I upgraded all
>> of
>> the nodes. I see in /etc/cluster the following;
>> 
>> -rw-r-----   1 root root  1249 Nov 10 18:40 .cluster.conf
>> -rw-r-----   1 root root  1249 Nov 13 09:50 cluster.conf
>> -rw-r--r--   1 root root    34 Aug 30 17:55 scsi_watchdog.conf
>> 
>> Since that update, the web nodes have been VERY slow, sluggish, barely
>> responding at times to web requests.
>> 
>> Could that be related to this? I am not finding a whole lot about this  
>> on the
>> net?
>> 
> What changed?
> 
> scsi_watchdog.conf should have no effect. It isn't even uses unless you
> configure watchdog (softdog) support for scsi reservations.
> 
> Ryan


From isplist at logicore.net  Wed Nov 21 16:56:16 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 21 Nov 2007 10:56:16 -0600
Subject: [Linux-cluster] scsi_watchdog
In-Reply-To: <47445D64.6070500@redhat.com>
Message-ID: <20071121105616.935844@leena>

The scsi_watchdog.conf was added to all of my nodes when I did a yum update on 
all nodes connected to this cluster. I'm not even sure why it's in the 
/etc/cluster directory and I don't use standard SCSI since I'm using fibre 
channel. 

I'm now suspecting that what ever this is, it is the cause of my web server 
slow down's. I can't seem to find much about this so thought I should ask.

Mike


On Wed, 21 Nov 2007 10:31:32 -0600, Ryan O'Hara wrote:
> isplist at logicore.net wrote:
> 
>> I didn't notice that something had changed until now after I upgraded all
>> of
>> the nodes. I see in /etc/cluster the following;
>> 
>> -rw-r-----   1 root root  1249 Nov 10 18:40 .cluster.conf
>> -rw-r-----   1 root root  1249 Nov 13 09:50 cluster.conf
>> -rw-r--r--   1 root root    34 Aug 30 17:55 scsi_watchdog.conf
>> 
>> Since that update, the web nodes have been VERY slow, sluggish, barely
>> responding at times to web requests.
>> 
>> Could that be related to this? I am not finding a whole lot about this  
>> on the
>> net?
>> 
> What changed?
> 
> scsi_watchdog.conf should have no effect. It isn't even uses unless you
> configure watchdog (softdog) support for scsi reservations.
> 
> Ryan


From rohara at redhat.com  Wed Nov 21 17:32:59 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Wed, 21 Nov 2007 11:32:59 -0600
Subject: [Linux-cluster] scsi_watchdog
In-Reply-To: <20071121105616.935844@leena>
References: <20071121105616.935844@leena>
Message-ID: <47446BCB.80602@redhat.com>

isplist at logicore.net wrote:
> The scsi_watchdog.conf was added to all of my nodes when I did a yum update on 
> all nodes connected to this cluster. I'm not even sure why it's in the 
> /etc/cluster directory and I don't use standard SCSI since I'm using fibre 
> channel. 
> 
> I'm now suspecting that what ever this is, it is the cause of my web server 
> slow down's. I can't seem to find much about this so thought I should ask.

scsi_watchdog.conf is only being used if 1) you use scsi reservations as 
a fence method and 2) you explicitly state in a sysconfig file that you 
want scsi_watchdog to run. I'm guessing that neither of these apply to 
use, and therefore the scsi_watchdog is having no effect. It is disabled 
by default.

> On Wed, 21 Nov 2007 10:31:32 -0600, Ryan O'Hara wrote:
>> isplist at logicore.net wrote:
>>
>>> I didn't notice that something had changed until now after I upgraded all
>>> of
>>> the nodes. I see in /etc/cluster the following;
>>>
>>> -rw-r-----   1 root root  1249 Nov 10 18:40 .cluster.conf
>>> -rw-r-----   1 root root  1249 Nov 13 09:50 cluster.conf
>>> -rw-r--r--   1 root root    34 Aug 30 17:55 scsi_watchdog.conf
>>>
>>> Since that update, the web nodes have been VERY slow, sluggish, barely
>>> responding at times to web requests.
>>>
>>> Could that be related to this? I am not finding a whole lot about this  
>>> on the
>>> net?
>>>
>> What changed?
>>
>> scsi_watchdog.conf should have no effect. It isn't even uses unless you
>> configure watchdog (softdog) support for scsi reservations.
>>
>> Ryan
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Wed Nov 21 17:39:28 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 21 Nov 2007 11:39:28 -0600
Subject: [Linux-cluster] scsi_watchdog
In-Reply-To: <47446BCB.80602@redhat.com>
Message-ID: <20071121113928.315308@leena>

> scsi_watchdog.conf is only being used if 1) you use scsi reservations as
> a fence method and 2) you explicitly state in a sysconfig file that you
> want scsi_watchdog to run. I'm guessing that neither of these apply to
> use, and therefore the scsi_watchdog is having no effect. It is disabled
> by default.

I was kind of hoping that this is where my slow down's were coming from. If 
it's not enabled, then I guess I'm searching again.

Ever since I did that update, the web nodes are very slow. All of the other 
nodes seem to be fine. I just don't get it.

Thanks.

Mike


From christopher.barry at qlogic.com  Wed Nov 21 17:42:12 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Wed, 21 Nov 2007 11:42:12 -0600
Subject: [Linux-cluster] scsi_watchdog
References: <20071121113928.315308@leena>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB2453ACB9@EPEXCH1.qlogic.org>


-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of isplist at logicore.net
Sent: Wed 11/21/2007 12:39 PM
To: linux-cluster
Subject: Re: [Linux-cluster] scsi_watchdog
 
> scsi_watchdog.conf is only being used if 1) you use scsi reservations as
> a fence method and 2) you explicitly state in a sysconfig file that you
> want scsi_watchdog to run. I'm guessing that neither of these apply to
> use, and therefore the scsi_watchdog is having no effect. It is disabled
> by default.

I was kind of hoping that this is where my slow down's were coming from. If 
it's not enabled, then I guess I'm searching again.

Ever since I did that update, the web nodes are very slow. All of the other 
nodes seem to be fine. I just don't get it.

Thanks.

Mike


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Can you tell if any processes are hogging CPU or anything?
Can you do a bonnie++ against your disks and see if the IO
is slower than normal for some reason?

What exactly do you mean by slow?

-C


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071121/0ef252dc/attachment.htm>

From isplist at logicore.net  Wed Nov 21 18:45:16 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 21 Nov 2007 12:45:16 -0600
Subject: [Linux-cluster] scsi_watchdog
In-Reply-To: <D158540CCC0AB54C8FD4818F823CCB2453ACB9@EPEXCH1.qlogic.org>
Message-ID: <20071121124516.716565@leena>

> What exactly do you mean by slow?

A general description about the web servers not responding to requests very 
efficiently since my latest yum update.

The storage is Xyratex fibre channel sectioned into RAID5 partitions. 
The HBA's are Qlogic's older 2200's.
The OS is RHEL4. 
The setup is 5 nodes for testing, 3 web servers sharing GFS storage for their 
web pages, 1 image server to offload the web servers, 1 admin server for 
design and administration.

When I first connect any node to the storage, there is a long delay of about 
20 or more seconds before the df returns the storage. This happens on each 
node when first connected and later if there has been no activity (http 
connections to the web server).
It is almost like it takes a few moments to take inventory of the storage 
current statistics/configuration.

hdparm -tT gives a return that seems very low for this type of setup;

/dev/VolGroup01/sql:
 Timing cached reads:   604 MB in  2.01 seconds = 300.10 MB/sec
 Timing buffered disk reads:   60 MB in  3.06 seconds =  19.63 MB/sec

However, I have not gotten around to fine tuning anything yet on the storage 
either. I just installed bonnie++ so need to read up on how to use it.

Since the update, web nodes have pretty high loads on them when running idle. 
They idle around 0.20/0.50 then constantly spike around 1.00 to 2.50. The only 
things I see when using top to check are I don't see anything unusual;

Here is an average cut;

top - 12:25:12 up 2 days, 12:22,  1 user,  load average: 1.10, 0.84, 0.74
Tasks:  87 total,   1 running,  86 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.3% sy,  0.0% ni, 99.7% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:    515568k total,   351548k used,   164020k free,    41356k buffers
Swap:   786232k total,        0k used,   786232k free,   129592k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30977 root      16   0  2868  956  764 R  0.3  0.2   0:05.06 top
    1 root      16   0  3444  548  468 S  0.0  0.1   0:06.56 init
    2 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/0
    3 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 events/0
    4 root       5 -10     0    0    0 S  0.0  0.0   0:00.03 khelper
    5 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/0

Here is a higher load cut;

top - 12:43:24 up 2 days, 12:40,  1 user,  load average: 2.15, 0.98, 0.74
Tasks:  87 total,   1 running,  86 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:    515568k total,   352124k used,   163444k free,    41356k buffers
Swap:   786232k total,        0k used,   786232k free,   130060k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    1 root      16   0  3444  548  468 S  0.0  0.1   0:06.56 init
    2 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/0
    3 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 events/0
    4 root       5 -10     0    0    0 S  0.0  0.0   0:00.03 khelper
    5 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/0
    6 root      25   0     0    0    0 S  0.0  0.0   0:00.00 khubd
   35 root      15   0     0    0    0 S  0.0  0.0   0:00.00 kapmd
   38 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
   39 root      15   0     0    0    0 S  0.0  0.0   0:01.84 pdflush
   40 root      25   0     0    0    0 S  0.0  0.0   0:00.00 kswapd0
   41 root      14 -10     0    0    0 S  0.0  0.0   0:00.00 aio/0

Note that when the load goes up, it happens on all three servers at the same 
time. Seconds apart at most.

> Can you tell if any processes are hogging CPU or anything?
> Can you do a bonnie++ against your disks and see if the IO
> is slower than normal for some reason?

Anything else I can provide to help solve this problem, I'll be more than 
happy to.

Mike


From isplist at logicore.net  Wed Nov 21 18:46:27 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 21 Nov 2007 12:46:27 -0600
Subject: [Linux-cluster] scsi_watchdog
In-Reply-To: <D158540CCC0AB54C8FD4818F823CCB2453ACB9@EPEXCH1.qlogic.org>
Message-ID: <20071121124627.895785@leena>

I forgot to add that when the web server loads go up, it is not because of any 
web connections, they are still sitting idle.

Mike


> Can you tell if any processes are hogging CPU or anything?
> Can you do a bonnie++ against your disks and see if the IO
> is slower than normal for some reason?
> 
> What exactly do you mean by slow?
> 
> -C


From lhh at redhat.com  Wed Nov 21 19:05:12 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 21 Nov 2007 14:05:12 -0500
Subject: [Linux-cluster] GFS over DRBD
In-Reply-To: <39700.79.10.137.147.1195319284.squirrel@picard.linux.it>
References: <200711171617.lAHGHYuE021728@mxmail.leaseoptions.com>
	<39700.79.10.137.147.1195319284.squirrel@picard.linux.it>
Message-ID: <1195671912.2963.1.camel@localhost.localdomain>


On Sat, 2007-11-17 at 18:08 +0100, amrossi at linux.it wrote:
> Actually I have a cluster with N nodes and what we seek is a distributed
> filesystem: Gluster is the Solution. Also I am thinking seriously tested
> on Migration XEN partitions Gluster
> 

GlusterFS or Lustre is a good solution for you, I suspect.

-- Lon


From fsalaman at gmail.com  Wed Nov 21 20:48:20 2007
From: fsalaman at gmail.com (Fabian Salamanca Dominguez)
Date: Wed, 21 Nov 2007 14:48:20 -0600
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <E67F1468BF7A4C418D874810215A377EEDBB47@EITO-MBX01.internal.vodafone.com>
References: <97F238EA86B5704DBAD740518CF8291003859C06@hwpms600.tbo.citistreet.org>
	<E67F1468BF7A4C418D874810215A377EEDBB47@EITO-MBX01.internal.vodafone.com>
Message-ID: <afe307c20711211248o5abc1fafraed803d4334aa098@mail.gmail.com>

Thanks all for the prompt responses!

Arne,

I have a doubt, which type of authentication do you use for ILOM's
IPMI configuration in the fence devices section?

thanks!!!

On Nov 13, 2007 9:52 AM, Brieseneck, Arne, VF-Group
<Arne.Brieseneck at vodafone.com> wrote:
>
>
> I have a 40 node SUN X4100 cluster with fence_ipmilan running.
> If you have any questions let me know...
>
>
>  ________________________________
>  From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fair, Brian
> Sent: Dienstag, 13. November 2007 16:33
> To: linux clustering
> Subject: RE: [Linux-cluster] Sun ILOM
>
>
>
>
> I'm sure you know this, but Sun has shipped several different out of band
> management solutions over the years. They have different software, commands,
> and features.
>
> Some (all maybe, couldn't say) of the newer stuff is IPMI... coolthreads,
> the newer x86 based boxes certainly, etc. There are a few different types it
> could be if its older equipment.
>
> Just something to keep in mind if you hear that IPMI is working for a Sun
> box.
>
> Curious -- what kind of hardware are you building out your cluster on?
>
> -Brian
>
>  ________________________________
>  From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dave Costakos
> Sent: Monday, November 12, 2007 2:21 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Sun ILOM
>
>
> I should say that I am successfully using fence_ipmi in a 3-node Sun 4600
> cluster -- though I obviously had to connect to the iLOM via the network
> interface rather than the serial interface (which is our normal procedure).
>
> -Dave.
>
>
> On Nov 12, 2007 9:02 AM, Kevin Anderson <kanderso at redhat.com> wrote:
>
> >
> >
> > On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca Dominguez wrote:
> >
> > Hi!
> >
> > Are Sun's LOM and ILOM cards supported as fencing devices for Red Hat
> > Cluster Suite?
> >
> >
> > No one has implemented a fencing script for those devices as of yet.  I
> believe people have had success using fence_ipmi with the Sun hardware, so
> you might want to try that.  Alternatively, if you have access to the
> hardware, take a look at implementing the fence agent.
> >
> > Kevin
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
>
> --
> Dave Costakos
> mailto:david.costakos at gmail.com
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Fabian


From Arne.Brieseneck at vodafone.com  Thu Nov 22 08:22:07 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Thu, 22 Nov 2007 09:22:07 +0100
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <afe307c20711211248o5abc1fafraed803d4334aa098@mail.gmail.com>
Message-ID: <E67F1468BF7A4C418D874810215A377EF814F8@EITO-MBX01.internal.vodafone.com>

Hi Fabian,

I have created a new account on each ILOM board. This account is
mentioned in the cluster.conf section.


Cheers,
Arne 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fabian Salamanca
Dominguez
Sent: Mittwoch, 21. November 2007 21:48
To: linux clustering
Subject: Re: [Linux-cluster] Sun ILOM

Thanks all for the prompt responses!

Arne,

I have a doubt, which type of authentication do you use for ILOM's IPMI
configuration in the fence devices section?

thanks!!!

On Nov 13, 2007 9:52 AM, Brieseneck, Arne, VF-Group
<Arne.Brieseneck at vodafone.com> wrote:
>
>
> I have a 40 node SUN X4100 cluster with fence_ipmilan running.
> If you have any questions let me know...
>
>
>  ________________________________
>  From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fair, Brian
> Sent: Dienstag, 13. November 2007 16:33
> To: linux clustering
> Subject: RE: [Linux-cluster] Sun ILOM
>
>
>
>
> I'm sure you know this, but Sun has shipped several different out of 
> band management solutions over the years. They have different 
> software, commands, and features.
>
> Some (all maybe, couldn't say) of the newer stuff is IPMI... 
> coolthreads, the newer x86 based boxes certainly, etc. There are a few

> different types it could be if its older equipment.
>
> Just something to keep in mind if you hear that IPMI is working for a 
> Sun box.
>
> Curious -- what kind of hardware are you building out your cluster on?
>
> -Brian
>
>  ________________________________
>  From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dave Costakos
> Sent: Monday, November 12, 2007 2:21 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Sun ILOM
>
>
> I should say that I am successfully using fence_ipmi in a 3-node Sun 
> 4600 cluster -- though I obviously had to connect to the iLOM via the 
> network interface rather than the serial interface (which is our
normal procedure).
>
> -Dave.
>
>
> On Nov 12, 2007 9:02 AM, Kevin Anderson <kanderso at redhat.com> wrote:
>
> >
> >
> > On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca Dominguez wrote:
> >
> > Hi!
> >
> > Are Sun's LOM and ILOM cards supported as fencing devices for Red 
> > Hat Cluster Suite?
> >
> >
> > No one has implemented a fencing script for those devices as of yet.

> > I
> believe people have had success using fence_ipmi with the Sun 
> hardware, so you might want to try that.  Alternatively, if you have 
> access to the hardware, take a look at implementing the fence agent.
> >
> > Kevin
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
>
> --
> Dave Costakos
> mailto:david.costakos at gmail.com
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


--
Fabian

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From ferry.harmusial at gmail.com  Thu Nov 22 14:27:32 2007
From: ferry.harmusial at gmail.com (Ferry Harmusial)
Date: Thu, 22 Nov 2007 15:27:32 +0100
Subject: [Linux-cluster] Tests to demonstrate Red Hat Cluster Behaviour
Message-ID: <6bc73b1f0711220627i7ab164c6rba6e1995c791e118@mail.gmail.com>

Hi Everybody,

I'm looking for tests that demonstrate the characteristics of Red Hat
Cluster Behaviour.
So far, I came up with the following things myself:

normal operation:

1 Reboot node2
2 Reboot node1
3 Halt node2
4 Halt node1
5 Boot node2
6 Boot node1

hardware failures

7 On the node on which the cluster service is started:
   ? Pull out first ethernet cable of the traffic bond
   ? Wait for three minutes
   ? Pull out second Ethernet cable of the traffic bond
8 On the node on which the cluster service is started:
   ? Pull out first ethernet cable of the cluster interconnect bond
   ? Wait for three minutes
   ? Pull out second Ethernet cable of the cluster interconnect bond
9 On the node on which the cluster service is started:
   ? Pull out the SCSI cable
10 On the node on which the cluster service is started:
   ? Pull out the power cords

software failures

11 On the node on which the cluster service is started:
   ? Change the cluster service init script so 'status' always returns '1'
(failure)

I'm interested in what you use your self ...

Thanks in advance,

Ferry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071122/77161c17/attachment.htm>

From fsalaman at gmail.com  Thu Nov 22 17:55:31 2007
From: fsalaman at gmail.com (Fabian Salamanca Dominguez)
Date: Thu, 22 Nov 2007 11:55:31 -0600
Subject: [Linux-cluster] Sun ILOM
In-Reply-To: <E67F1468BF7A4C418D874810215A377EF814F8@EITO-MBX01.internal.vodafone.com>
References: <afe307c20711211248o5abc1fafraed803d4334aa098@mail.gmail.com>
	<E67F1468BF7A4C418D874810215A377EF814F8@EITO-MBX01.internal.vodafone.com>
Message-ID: <afe307c20711220955o78bfb2t9c71492efb999b11@mail.gmail.com>

Works great with IPMI!

How could I write an agent for the ILOM?

Thanks and BR

On Nov 22, 2007 2:22 AM, Brieseneck, Arne, VF-Group
<Arne.Brieseneck at vodafone.com> wrote:
> Hi Fabian,
>
> I have created a new account on each ILOM board. This account is
> mentioned in the cluster.conf section.
>
>
> Cheers,
> Arne
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
>
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fabian Salamanca
> Dominguez
> Sent: Mittwoch, 21. November 2007 21:48
> To: linux clustering
> Subject: Re: [Linux-cluster] Sun ILOM
>
> Thanks all for the prompt responses!
>
> Arne,
>
> I have a doubt, which type of authentication do you use for ILOM's IPMI
> configuration in the fence devices section?
>
> thanks!!!
>
> On Nov 13, 2007 9:52 AM, Brieseneck, Arne, VF-Group
> <Arne.Brieseneck at vodafone.com> wrote:
> >
> >
> > I have a 40 node SUN X4100 cluster with fence_ipmilan running.
> > If you have any questions let me know...
> >
> >
> >  ________________________________
> >  From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fair, Brian
> > Sent: Dienstag, 13. November 2007 16:33
> > To: linux clustering
> > Subject: RE: [Linux-cluster] Sun ILOM
> >
> >
> >
> >
> > I'm sure you know this, but Sun has shipped several different out of
> > band management solutions over the years. They have different
> > software, commands, and features.
> >
> > Some (all maybe, couldn't say) of the newer stuff is IPMI...
> > coolthreads, the newer x86 based boxes certainly, etc. There are a few
>
> > different types it could be if its older equipment.
> >
> > Just something to keep in mind if you hear that IPMI is working for a
> > Sun box.
> >
> > Curious -- what kind of hardware are you building out your cluster on?
> >
> > -Brian
> >
> >  ________________________________
> >  From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dave Costakos
> > Sent: Monday, November 12, 2007 2:21 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Sun ILOM
> >
> >
> > I should say that I am successfully using fence_ipmi in a 3-node Sun
> > 4600 cluster -- though I obviously had to connect to the iLOM via the
> > network interface rather than the serial interface (which is our
> normal procedure).
> >
> > -Dave.
> >
> >
> > On Nov 12, 2007 9:02 AM, Kevin Anderson <kanderso at redhat.com> wrote:
> >
> > >
> > >
> > > On Mon, 2007-11-12 at 10:48 -0600, Fabian Salamanca Dominguez wrote:
> > >
> > > Hi!
> > >
> > > Are Sun's LOM and ILOM cards supported as fencing devices for Red
> > > Hat Cluster Suite?
> > >
> > >
> > > No one has implemented a fencing script for those devices as of yet.
>
> > > I
> > believe people have had success using fence_ipmi with the Sun
> > hardware, so you might want to try that.  Alternatively, if you have
> > access to the hardware, take a look at implementing the fence agent.
> > >
> > > Kevin
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> >
> >
> >
> > --
> > Dave Costakos
> > mailto:david.costakos at gmail.com
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
>
> --
> Fabian
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Fabian


From Alain.Moulle at bull.net  Fri Nov 23 12:03:14 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Fri, 23 Nov 2007 13:03:14 +0100
Subject: [Linux-cluster] Question about NFS/CS4
Message-ID: <4746C182.40203@bull.net>

Hi

In the cookbook NFS HA , the cluster.conf is with
force_unmount="0" . Is there a good reason not
to set it to 1 ?

Thanks
Regards
Alain Moull?


From andreezer at gmail.com  Fri Nov 23 15:22:21 2007
From: andreezer at gmail.com (=?ISO-8859-1?Q?Andr=E9_Fernandes?=)
Date: Fri, 23 Nov 2007 15:22:21 +0000
Subject: [Linux-cluster] modclusterd, ricci, conga needed?
Message-ID: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com>

Hi.

I'm running into the problem reported previously in
http://www.redhat.com/archives/linux-cluster/2007-September/msg00223.html.

However, my question is if the modclusterd or ricci services are required
for proper cluster functioning.
I found that modclusterd was taking up most of the CPU and if it is not
needed I prefer to have it stopped.
I have a 2-node cluster that is sharing only an IP address.

I appreciate any feedback.
thank you
Andr? Fernandes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071123/519869cc/attachment.htm>

From suvesh_a at sifycorp.com  Sat Nov 24 17:52:23 2007
From: suvesh_a at sifycorp.com (Suvesh)
Date: Sat, 24 Nov 2007 23:22:23 +0530
Subject: [Linux-cluster] Regarding GFS version.
In-Reply-To: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com>
References: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com>
Message-ID: <474864D7.3030507@sifycorp.com>

Hi Friends,

  I  need some clarifications about GFS. In net i found that

  "/GFS2/ is an evolutionary advancement based on the GFS file system.
While fully functional, GFS2 is not yet considered production-ready.
GFS2 is targeted to become fully supported in a subsequent Red Hat
Enterprise Linux 5 update. There is also an in-place conversion utility,
*gfs2_convert*, which can update the metadata of the older GFS file
system format, converting it to a GFS2 file system"

 I would like to know whether GFS2 is  prodution ready or not. And
confirm whether is it possible to use GFS2 with RHEL-ES 4 (U3).


My advance thanks for all.


With regards,
Suvesh


********** DISCLAIMER **********
Information contained and transmitted by this E-MAIL is proprietary to 
Sify Limited and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If this is a 
forwarded message, the content of this E-MAIL may not have been sent with 
the authority of the Company. If you are not the intended recipient, an 
agent of the intended recipient or a  person responsible for delivering the 
information to the named recipient,  you are notified that any use, 
distribution, transmission, printing, copying or dissemination of this 
information in any way or in any manner is strictly prohibited. If you have 
received this communication in error, please delete this mail & notify us 
immediately at admin at sifycorp.com


From david.costakos at gmail.com  Sat Nov 24 21:13:00 2007
From: david.costakos at gmail.com (Dave Costakos)
Date: Sat, 24 Nov 2007 13:13:00 -0800
Subject: [Linux-cluster] modclusterd, ricci, conga needed?
In-Reply-To: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com>
References: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com>
Message-ID: <6b6836c60711241313s4e91d521tfce387886fbc047f@mail.gmail.com>

I don't know for sure, but I have turned it off on a 3-node cluster due to
that same problem and all seems well.  I should note that I wouldn't say
I've done a full testing regimen.


On Nov 23, 2007 7:22 AM, Andr? Fernandes <andreezer at gmail.com> wrote:

> Hi.
>
> I'm running into the problem reported previously in http://www.redhat.com/archives/linux-cluster/2007-September/msg00223.html
> .
>
> However, my question is if the modclusterd or ricci services are required
> for proper cluster functioning.
> I found that modclusterd was taking up most of the CPU and if it is not
> needed I prefer to have it stopped.
> I have a 2-node cluster that is sharing only an IP address.
>
> I appreciate any feedback.
> thank you
> Andr? Fernandes
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Dave Costakos
mailto:david.costakos at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071124/a44bf5ac/attachment.htm>

From adas at redhat.com  Sun Nov 25 06:33:44 2007
From: adas at redhat.com (Abhijith Das)
Date: Sun, 25 Nov 2007 00:33:44 -0600
Subject: [Linux-cluster] Regarding GFS version.
In-Reply-To: <474864D7.3030507@sifycorp.com>
References: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com>
	<474864D7.3030507@sifycorp.com>
Message-ID: <47491748.8030702@redhat.com>


>I would like to know whether GFS2 is  prodution ready or not. And
>confirm whether is it possible to use GFS2 with RHEL-ES 4 (U3).
>  
>
GFS2 is not production ready yet, and I'm not aware of any efforts to
backport gfs2 to RHEL4.

Cheers,
--Abhi


From pk at nodex.ru  Sun Nov 25 16:35:55 2007
From: pk at nodex.ru (Pavel Kuzin)
Date: Sun, 25 Nov 2007 19:35:55 +0300
Subject: [Linux-cluster] cluster-1.04 and kernel 2.6.23.8
References: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com>
	<6b6836c60711241313s4e91d521tfce387886fbc047f@mail.gmail.com>
Message-ID: <1b1d01c82f81$40406db0$0a01a8c0@kuzmichwork>

Hello!

Help please. Cluster does`t build with 2.6.23.8 kernel.
/root/cluster-1.04.00/dlm-kernel/src/lowcomms.c: In function 'lowcomms_start':
/root/cluster-1.04.00/dlm-kernel/src/lowcomms.c:1269: error: too many arguments to function 'kmem_cache_create'

--
Pavel D.Kuzin
NOC Chief
Nodex LTD.
Saint-Petersburg, Russia
pk at nodex.ru
http://nodex.ru
tel. +7 812 3800033 ext. 311
fax. +7 812 2304573

----- Original Message ----- 
From: "Dave Costakos" <david.costakos at gmail.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Sunday, November 25, 2007 12:13 AM
Subject: Re: [Linux-cluster] modclusterd, ricci, conga needed?


>I don't know for sure, but I have turned it off on a 3-node cluster due to
> that same problem and all seems well.  I should note that I wouldn't say
> I've done a full testing regimen.
>
>
>
> On Nov 23, 2007 7:22 AM, Andr? Fernandes <andreezer at gmail.com> wrote:
>
>> Hi.
>>
>> I'm running into the problem reported previously in 
>> http://www.redhat.com/archives/linux-cluster/2007-September/msg00223.html
>> .
>>
>> However, my question is if the modclusterd or ricci services are required
>> for proper cluster functioning.
>> I found that modclusterd was taking up most of the CPU and if it is not
>> needed I prefer to have it stopped.
>> I have a 2-node cluster that is sharing only an IP address.
>>
>> I appreciate any feedback.
>> thank you
>> Andr? Fernandes
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> -- 
> Dave Costakos
> mailto:david.costakos at gmail.com
>


--------------------------------------------------------------------------------


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster 


From pk at nodex.ru  Sun Nov 25 16:38:29 2007
From: pk at nodex.ru (Pavel Kuzin)
Date: Sun, 25 Nov 2007 19:38:29 +0300
Subject: [Linux-cluster] cluster-1.04 and kernel 2.6.23.8
References: <a434d05d0711230722h56bc5cefob798c642c7b03b28@mail.gmail.com><6b6836c60711241313s4e91d521tfce387886fbc047f@mail.gmail.com>
	<1b1d01c82f81$40406db0$0a01a8c0@kuzmichwork>
Message-ID: <1b2201c82f81$9c16dd90$0a01a8c0@kuzmichwork>

I was tryed to compile STABLE from cvs, but without luck..

--
Pavel D.Kuzin
NOC Chief
Nodex LTD.
Saint-Petersburg, Russia
pk at nodex.ru
http://nodex.ru
tel. +7 812 3800033 ext. 311
fax. +7 812 2304573

----- Original Message ----- 
From: "Pavel Kuzin" <pk at nodex.ru>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Sunday, November 25, 2007 7:35 PM
Subject: [Linux-cluster] cluster-1.04 and kernel 2.6.23.8


> Hello!
>
> Help please. Cluster does`t build with 2.6.23.8 kernel.
> /root/cluster-1.04.00/dlm-kernel/src/lowcomms.c: In function 'lowcomms_start':
> /root/cluster-1.04.00/dlm-kernel/src/lowcomms.c:1269: error: too many arguments to function 'kmem_cache_create'
>
> --
> Pavel D.Kuzin
> NOC Chief
> Nodex LTD.
> Saint-Petersburg, Russia
> pk at nodex.ru
> http://nodex.ru
> tel. +7 812 3800033 ext. 311
> fax. +7 812 2304573
>
> ----- Original Message ----- 
> From: "Dave Costakos" <david.costakos at gmail.com>
> To: "linux clustering" <linux-cluster at redhat.com>
> Sent: Sunday, November 25, 2007 12:13 AM
> Subject: Re: [Linux-cluster] modclusterd, ricci, conga needed?
>
>
>>I don't know for sure, but I have turned it off on a 3-node cluster due to
>> that same problem and all seems well.  I should note that I wouldn't say
>> I've done a full testing regimen.
>>
>>
>>
>> On Nov 23, 2007 7:22 AM, Andr? Fernandes <andreezer at gmail.com> wrote:
>>
>>> Hi.
>>>
>>> I'm running into the problem reported previously in 
>>> http://www.redhat.com/archives/linux-cluster/2007-September/msg00223.html
>>> .
>>>
>>> However, my question is if the modclusterd or ricci services are required
>>> for proper cluster functioning.
>>> I found that modclusterd was taking up most of the CPU and if it is not
>>> needed I prefer to have it stopped.
>>> I have a 2-node cluster that is sharing only an IP address.
>>>
>>> I appreciate any feedback.
>>> thank you
>>> Andr? Fernandes
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>>
>> -- 
>> Dave Costakos
>> mailto:david.costakos at gmail.com
>>
>
>
> --------------------------------------------------------------------------------
>
>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


From isplist at logicore.net  Sun Nov 25 20:26:15 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Sun, 25 Nov 2007 14:26:15 -0600
Subject: [Linux-cluster] mod_log_sql killing GFS mount?
Message-ID: <20071125142615.569857@leena>

I recently installed mod_log_sql on our web servers for central logging. 
What's strange is that since I've done this, one of my nodes keeps losing one 
of it's GFS mounts. Web servers have the exact same mount yet aren't losing 
their mount.

There is no information in /var/log/ messages or any other log file showing 
what might be going on.

Any thoughts on what I might be looking for?

Mike


From isplist at logicore.net  Sun Nov 25 20:29:49 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Sun, 25 Nov 2007 14:29:49 -0600
Subject: [Linux-cluster] mod_log_sql killing GFS mount?
Message-ID: <20071125142949.791361@leena>

 From the console;

Nov 25 14:27:11 compdev kernel: GFS: Trying to join cluster "lock_dlm", 
"vgcomp:web"
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster. Now 
mounting FS...
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to 
acquire journal lock...
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at 
journal...
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log 
elements...
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked 
inodes
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes 
for 0 IDs
Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Done

Within seconds of mounting the /Vol_web storage, I get an I/O error;

# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda1             37689468   2665908  33109016   8% /
none                    257784         0    257784   0% /dev/shm
/dev/mapper/VolGroup02-img
                     505788640      5552 505783088   1% /var/images
df: `/Vol_web': Input/output error


>I recently installed mod_log_sql on our web servers for central logging. 
>What's strange is that since I've done this, one of my nodes keeps losing one
>of it's GFS mounts. Web servers have the exact same mount yet aren't losing 
>their mount.

>There is no information in /var/log/ messages or any other log file showing 
>what might be going on.

>Any thoughts on what I might be looking for?

>Mike


From isplist at logicore.net  Sun Nov 25 22:45:43 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Sun, 25 Nov 2007 16:45:43 -0600
Subject: [Linux-cluster] mod_log_sql killing GFS mount?
Message-ID: <20071125164543.495746@leena>

Some new information about this.
 From the console;

Nov 25 16:43:11 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes 
for 0 IDs
Nov 25 16:43:11 compdev kernel: GFS: fsid=vgcomp:web.3: Done
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem 
consistency error
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   function = 
gfs_setbit
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   file = 
/home/xos/gen/updates-2007-10/xlrpm922/rpm/BUILD/gfs-kernel-2.6.9-72/up/src/gf
s/bits.c, line = 71
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   time = 1196030598
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: about to withdraw from 
the cluster
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: waiting for 
outstanding I/O
Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: telling LM to withdraw
Nov 25 16:43:20 compdev kernel: lock_dlm: withdraw abandoned memory
Nov 25 16:43:20 compdev kernel: GFS: fsid=vgcomp:web.3: withdrawn


> From the console;

>Nov 25 14:27:11 compdev kernel: GFS: Trying to join cluster "lock_dlm", 
>"vgcomp:web"
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster. Now 
>mounting FS...
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to 
>acquire journal lock...
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at 
>journal...
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log 
>elements...
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked 
>inodes
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes 
>for 0 IDs
>Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Done

>Within seconds of mounting the /Vol_web storage, I get an I/O error;

># df
>Filesystem           1K-blocks      Used Available Use% Mounted on
>/dev/hda1             37689468   2665908  33109016   8% /
>none                    257784         0    257784   0% /dev/shm
>/dev/mapper/VolGroup02-img
>                     505788640      5552 505783088   1% /var/images
>df: `/Vol_web': Input/output error

>I recently installed mod_log_sql on our web servers for central logging. 
>What's strange is that since I've done this, one of my nodes keeps losing one
>of it's GFS mounts. Web servers have the exact same mount yet aren't losing 
>their mount.

>There is no information in /var/log/ messages or any other log file showing 
>what might be going on.

>Any thoughts on what I might be looking for?

>Mike


From kadlec at sunserv.kfki.hu  Mon Nov 26 13:45:21 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Mon, 26 Nov 2007 14:45:21 +0100 (CET)
Subject: [Linux-cluster] Fatal: filesystem consistency error
Message-ID: <Pine.LNX.4.64.0711261440560.8719@lxserv1.kfki.hu>

Hi,

Today one of our machines reported the following:

Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4: fatal: filesystem
consistency error
Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   inode =
407901/407901
Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   function =
gfs_change_nlink
Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   file =
/services/src/packages/cluster-2.01.00/gfs-kernel/src/gfs/inode.c, line =
844
Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   time = 1196073848
Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4: about to withdraw
from the cluster
Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4: telling LM to
withdraw

What can be the problem and what could result it? Should we run gfs_fsck?

A side question: is there a way to run a custom script when one node 
withdraw itself from the cluster?

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary


From isplist at logicore.net  Mon Nov 26 14:54:56 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 26 Nov 2007 08:54:56 -0600
Subject: [Linux-cluster] Fatal: filesystem consistency error
In-Reply-To: <Pine.LNX.4.64.0711261440560.8719@lxserv1.kfki.hu>
Message-ID: <2007112685456.048863@leena>

Strange, this seems to be close to what I posted started happening to me 
yesterday on one of the nodes. Only one node seemed to be suffering this, I'll 
have to look an hope no others have started this today.

Mike


On Mon, 26 Nov 2007 14:45:21 +0100 (CET), Kadlecsik Jozsi wrote:
> Hi,
> 
> Today one of our machines reported the following:
> 
> Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4: fatal: filesystem
> consistency error
> Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   inode =
> 407901/407901
> Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   function =
> gfs_change_nlink
> Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   file =
> /services/src/packages/cluster-2.01.00/gfs-kernel/src/gfs/inode.c, line =
> 844
> Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4:   time = 1196073848
> Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4: about to withdraw
> from the cluster
> Nov 26 11:44:08 lxserv1 kernel: GFS: fsid=kfki:home.4: telling LM to
> withdraw
> 
> What can be the problem and what could result it? Should we run gfs_fsck?
> 
> A side question: is there a way to run a custom script when one node
> withdraw itself from the cluster?
> 
> Best regards,
> Jozsef
> --
> E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: KFKI Research Institute for Particle and Nuclear Physics
> H-1525 Budapest 114, POB. 49, Hungary
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From Alexandre.Racine at mhicc.org  Mon Nov 26 15:41:49 2007
From: Alexandre.Racine at mhicc.org (Alexandre Racine)
Date: Mon, 26 Nov 2007 10:41:49 -0500
Subject: [Linux-cluster] kernel tainted, two situations
References: <20071125164543.495746@leena>
Message-ID: <C43CF0825BF59D4FBC1F6A2AF45EB88D6D11C6@cumulonimbus.RG.local>

Hi all,

I have some... kernel problem. I can't really find why my GFS not mount and the only things that I can find that looks like errors is some tainted kernel messages in "dmegs". In the past, I had some problem with tainted kernel and was able to find what kernel module had to be removed, but searching for "kcl_register_service" on google is not that popular. Thanks for any good answer you can give me. This is gentoo 2.6.20 kernel.

On SERVER1
CMAN 1.04.00 (built Nov 23 2007 10:26:19) installed
NET: Registered protocol family 30
Lock_Harness 1.04.00 (built Nov 23 2007 10:26:43) installed
gfs: no version for "lm_mount" found: kernel tainted.
GFS 1.04.00 (built Nov 23 2007 10:26:53) installed
etc...

On SERVER2
CMAN 1.04.00 (built Nov 23 2007 10:18:06) installed
[...]
dlm: no version for "kcl_register_service" found: kernel tainted.
DLM 1.04.00 (built Nov 23 2007 10:17:56) installed
Lock_Harness 1.04.00 (built Nov 23 2007 10:18:30) installed
Lock_DLM (built Nov 23 2007 10:18:32) installed
GFS 1.04.00 (built Nov 23 2007 10:18:38) installed
GFS: Trying to join cluster "lock_dlm", "grappe:gfs"
lock_dlm: fence domain not found; check fenced
GFS: can't mount proto = lock_dlm, table = grappe:gfs, hostdata = 
GFS: diaper: unknown undiaper
etc...


Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071126/f4fbfcc8/attachment.htm>

From lhh at redhat.com  Mon Nov 26 18:43:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Nov 2007 13:43:44 -0500
Subject: [Linux-cluster] RHEL5 failover domain
In-Reply-To: <1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>
References: <47417DD3.90804@wasko.pl>
	<1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>
Message-ID: <1196102624.12646.0.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-20 at 10:40 -0500, Lon Hohberger wrote:
> > 
> > Nov 19 12:32:26 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
> > Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
> > 192.168.10.10
> > Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
> > 192.168.10.11
> > Nov 19 12:32:26 l2 openais[1977]: [CPG  ] got joinlist message from node 1
> > Nov 19 12:32:26 l2 clurgmgrd[2687]: <notice> Stopping service 
> > service:vsftpd
> > Nov 19 12:32:41 l2 clurgmgrd[2687]: <err> #52: Failed changing RG status
> > Nov 19 12:32:56 l2 clurgmgrd[2687]: <err> #57: Failed changing RG status
> > Nov 19 12:32:57 l2 clurgmgrd: [2687]: <info> Executing 
> > /etc/init.d/vsftpd status
> > 
> >    I tested this many times and in this case clurgmgrd do not try to run 
> > script with stop parameter, but when I try to relocate service manualy 
> > using clusvcadm or when both nodes have priority 1 everything is 
> > successful. Is also successful if I'm restarting node using reboot. I 
> > think automatic (after crash) relocating starts too early. In my opinion 
> > cluster do not wait for rgmanager start.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=327721
> 

I updated the packages in this bugzilla.  Give them a try.

-- Lon


From lhh at redhat.com  Mon Nov 26 18:44:53 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Nov 2007 13:44:53 -0500
Subject: [Linux-cluster] Arbitrary heuristics
In-Reply-To: <47435A83.3020306@bxwa.com>
References: <473DEBB2.9030905@bxwa.com>
	<1195571829.26558.20.camel@ayanami.boston.devel.redhat.com>
	<47430AC5.5070300@bxwa.com>
	<1195590312.26558.70.camel@ayanami.boston.devel.redhat.com>
	<47435A83.3020306@bxwa.com>
Message-ID: <1196102693.12646.2.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-20 at 14:06 -0800, Scott Becker wrote:
> I've been pondering what I'm actually looking for.
> 
> Each of my nodes has a public and a private NIC. Public is for serving 
> web pages, private is for fencing. I was desperately trying to get 
> fencing to work over the public network but I was faced with 
> reimplementing a complicated fence agent in C in order to use ssh 
> (supported ok by my power switches but difficult to add to the python 
> fence agent).
> 
> My remaining issue is that if I lose one of my public NICs, I must 
> ensure that the ensuing fencing race is won by the good node and not the 
> bad node which thinks it's good. Not solved by quorum because I must 
> also make it work, 'last man standing' (starting with 3 nodes).
> 
> So pondering, I realized that I don't really need to monitor the ability 
> to reach the gateway. What I need is for a public comm error to create 
> an event, hence I use the public nic for cluster comms. Then do 
> something so that the bad node doesn't fence the good nodes.
> 
> So assuming only one real failure at a time, I'm thinking of making the 
> first step in the fencing method a check for pinging the gateway. That 
> way when a node wants to fence, it will only be able to if it's public 
> NIC is working, even though it's using the private nic for the rest of 
> the fencing.

That's a pretty good + simple idea.

-- Lon


From lhh at redhat.com  Mon Nov 26 18:45:55 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Nov 2007 13:45:55 -0500
Subject: [Linux-cluster] RHEL5 failover domain
In-Reply-To: <4743E3E9.3050402@wasko.pl>
References: <47417DD3.90804@wasko.pl>
	<1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>
	<4743E3E9.3050402@wasko.pl>
Message-ID: <1196102755.12646.4.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-21 at 08:53 +0100, Dariusz Skorupa wrote:
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=327721
> >

That's because the first pass didn't fix the problem in all cases :(

The current packages should:

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.src.rpm

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.i386.rpm
http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.i386.rpm
http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.i386.rpm

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.x86_64.rpm
http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.x86_64.rpm
http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.x86_64.rpm

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.ia64.rpm
http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.ia64.rpm
http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.ia64.rpm

-- Lon


From lhh at redhat.com  Mon Nov 26 18:52:32 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Nov 2007 13:52:32 -0500
Subject: [Linux-cluster] xen and migration
In-Reply-To: <1195648504.25038.52.camel@localhost.localdomain>
References: <1195648504.25038.52.camel@localhost.localdomain>
Message-ID: <1196103152.12646.12.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-21 at 13:35 +0100, jr wrote:
> Hi everybody,
> I am currently working on an installation of a failover RHCS with Xen
> guests as Services. Everything works great so far, Xen domains are
> failed over to still running nodes if i switch a node off. Even high
> available bonding integrates nicely.
> The only thing yet is the (live) migration of Xen domain Services. i
> changed the /usr/share/cluster/vm.sh so it reads xm migrate -l, but if I
> do:

You can't migrate it if it's got dependencies in rgmanager, because
other resources don't know how to migrate themselves.  For example,
migrating a file system is a very strange prospect; mount/umount won't
work...

* Put your GFS volume(s) in /etc/fstab
* Put the VM at the top level; you shouldn't need to reference
clusterfs.sh in cluster.conf in this case.

So - your configuration should be far simpler when using GFS:

>                 <service name="VirtServ01" domain="failoverdomain1"
> autostart="1">
>                         <clusterfs ref="XenDomains">
>                                 <vm ref="VirtServ01"/>
>                         </clusterfs>
>                 </service>

...becomes:

      <vm name="VirtServ01" domain="failoverdomain1" autostart="1"/>


Obviously, you *cannot* migrate when using non-cluster file systems as
Xen domain backing stores.

-- Lon


From lhh at redhat.com  Mon Nov 26 18:59:25 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Nov 2007 13:59:25 -0500
Subject: [Linux-cluster] Question about NFS/CS4
In-Reply-To: <4746C182.40203@bull.net>
References: <4746C182.40203@bull.net>
Message-ID: <1196103565.12646.19.camel@ayanami.boston.devel.redhat.com>

On Fri, 2007-11-23 at 13:03 +0100, Alain Moulle wrote:
> Hi
> 
> In the cookbook NFS HA , the cluster.conf is with
> force_unmount="0" . Is there a good reason not
> to set it to 1 ?

It depends on what you want.  I always set it to 1.

-- Lon


From isplist at logicore.net  Mon Nov 26 20:11:32 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 26 Nov 2007 14:11:32 -0600
Subject: [Linux-cluster] mod_log_sql killing GFS mount?
In-Reply-To: <20071125164543.495746@leena>
Message-ID: <20071126141132.936177@leena>

Anyone?

All of the nodes are updated to the same software versions, why would this 
occurring now?


On Sun, 25 Nov 2007 16:45:43 -0600, isplist at logicore.net wrote:
> Some new information about this.
> 
> From the console;
> 
> Nov 25 16:43:11 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes
> for 0 IDs
> Nov 25 16:43:11 compdev kernel: GFS: fsid=vgcomp:web.3: Done
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem
> consistency error
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   function =
> gfs_setbit
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   file =
> /home/xos/gen/updates-2007-10/xlrpm922/rpm/BUILD/gfs-kernel-2.6.9-
> 72/up/src/gf
> s/bits.c, line = 71
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3:   time = 1196030598
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: about to withdraw
> from
> the cluster
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: waiting for
> outstanding I/O
> Nov 25 16:43:18 compdev kernel: GFS: fsid=vgcomp:web.3: telling LM to
> withdraw
> Nov 25 16:43:20 compdev kernel: lock_dlm: withdraw abandoned memory
> Nov 25 16:43:20 compdev kernel: GFS: fsid=vgcomp:web.3: withdrawn
> 
> 
>> From the console;
>> 
>> Nov 25 14:27:11 compdev kernel: GFS: Trying to join cluster "lock_dlm",
>> "vgcomp:web"
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster.
>> Now
>> mounting FS...
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to
>> acquire journal lock...
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at
>> journal...
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log
>> elements...
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked
>> inodes
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota
>> changes
>> for 0 IDs
>> Nov 25 14:27:14 compdev kernel: GFS: fsid=vgcomp:web.3: Done
>> 
>> Within seconds of mounting the /Vol_web storage, I get an I/O error;
>> 
>> # df
>> Filesystem           1K-blocks      Used Available Use% Mounted on
>> /dev/hda1             37689468   2665908  33109016   8% /
>> none                    257784         0    257784   0% /dev/shm
>> /dev/mapper/VolGroup02-img
>> 505788640      5552 505783088   1% /var/images
>> df: `/Vol_web': Input/output error
>> 
>> I recently installed mod_log_sql on our web servers for central logging.
>> What's strange is that since I've done this, one of my nodes keeps losing
>> one
>> of it's GFS mounts. Web servers have the exact same mount yet aren't
>> losing
>> their mount.
>> 
>> There is no information in /var/log/ messages or any other log file
>> showing
>> what might be going on.
>> 
>> Any thoughts on what I might be looking for?
>> 
>> Mike
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From scottb at bxwa.com  Mon Nov 26 20:55:22 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 26 Nov 2007 12:55:22 -0800
Subject: [Linux-cluster] Arbitrary heuristics
In-Reply-To: <1196102693.12646.2.camel@ayanami.boston.devel.redhat.com>
References: <473DEBB2.9030905@bxwa.com>	<1195571829.26558.20.camel@ayanami.boston.devel.redhat.com>	<47430AC5.5070300@bxwa.com>	<1195590312.26558.70.camel@ayanami.boston.devel.redhat.com>	<47435A83.3020306@bxwa.com>
	<1196102693.12646.2.camel@ayanami.boston.devel.redhat.com>
Message-ID: <474B32BA.5060204@bxwa.com>


Lon Hohberger wrote:
> On Tue, 2007-11-20 at 14:06 -0800, Scott Becker wrote:
>   
>> So assuming only one real failure at a time, I'm thinking of making the 
>> first step in the fencing method a check for pinging the gateway. That 
>> way when a node wants to fence, it will only be able to if it's public 
>> NIC is working, even though it's using the private nic for the rest of 
>> the fencing.
>>     
>
> That's a pretty good + simple idea.
>
> -- Lon
>   

I got the fencing setup and performed a test. My "Fencing Heuristic" 
worked but I ran into another problem. For some unrelated reason, the 
good node did not attempt to fence the bad node. At the same time it 
also did not take over the service.

For my simulated nic failure, I unplugged the cables from the public nic 
on node 3 (node 2 is the other). Node 3 was running the IP Address service.

Here's all the data I know to fetch, what else can I provide? The very 
last line (from the disconnected node) is my fence script hack properly 
stopping fencing since the public nic is out.

Further down the log (not shown here) the fencing is attempted 
repeatedly. It was my understanding that it would only try each method once.

Help!


clustat from node 2 before failure test:


Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  205.234.65.132                        2 Online, Local, rgmanager
  205.234.65.133                        3 Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:Web Server A 205.234.65.133                 started


clustat from node 2 after pulling node 3:


Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  205.234.65.132                        2 Online, Local, rgmanager
  205.234.65.133                        3 Offline

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:Web Server A 205.234.65.133                 started


/etc/cluster/cluster.conf:


<?xml version="1.0"?>
<cluster alias="bxwa" config_version="8" name="bxwa">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="205.234.65.132" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="RackPDU1" 
option="off" port="2"/>
                                        <device name="RackPDU2" 
option="off" port="2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="205.234.65.133" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="RackPDU1" 
option="off" port="3"/>
                                        <device name="RackPDU2" 
option="off" port="3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.7.11" 
login="root" name="RackPDU1" passwd_script="/root/cluster/rack_pdu"/>
                <fencedevice agent="fence_apc" ipaddr="192.168.7.12" 
login="root" name="RackPDU2" passwd_script="/root/cluster/rack_pdu"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
                <service autostart="1" exclusive="0" name="Web Server 
Address" recovery="relocate">
                        <ip address="205.234.65.138" monitor_link="1"/>
                </service>
        </rm>
</cluster>


Node 2, /var/log/messages:
openais[9498]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[9498]: [TOTEM] Receive multicast socket recv buffer size (288000 
bytes).
openais[9498]: [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
openais[9498]: [TOTEM] entering GATHER state from 2.
openais[9498]: [TOTEM] entering GATHER state from 0.
openais[9498]: [TOTEM] Creating commit token because I am the rep.
openais[9498]: [TOTEM] Saving state aru 47 high seq received 47
openais[9498]: [TOTEM] Storing new sequence id for ring 6c
openais[9498]: [TOTEM] entering COMMIT state.
openais[9498]: [TOTEM] entering RECOVERY state.
openais[9498]: [TOTEM] position [0] member 205.234.65.132:
openais[9498]: [TOTEM] previous ring seq 104 rep 205.234.65.132
openais[9498]: [TOTEM] aru 47 high delivered 47 received flag 1
openais[9498]: [TOTEM] Did not need to originate any messages in recovery.
openais[9498]: [TOTEM] Sending initial ORF token
openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
openais[9498]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 3
fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
post_fail_delay
openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
openais[9498]: [CLM  ] Members Left:
openais[9498]: [CLM  ]     r(0) ip(205.234.65.133)
openais[9498]: [CLM  ] Members Joined:
openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
openais[9498]: [CLM  ] New Configuration:
openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
openais[9498]: [CLM  ] Members Left:
openais[9498]: [CLM  ] Members Joined:
openais[9498]: [SYNC ] This node is within the primary component and 
will provide service.
openais[9498]: [TOTEM] entering OPERATIONAL state.
openais[9498]: [CLM  ] got nodejoin message 205.234.65.132
openais[9498]: [CPG  ] got joinlist message from node 2


Node 3, /var/log/messages:
kernel: bonding: bond0: now running without any active interface !
openais[2921]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[2921]: [TOTEM] Receive multicast socket recv buffer size (288000 
bytes).
openais[2921]: [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
openais[2921]: [TOTEM] entering GATHER state from 2.
clurgmgrd: [3759]: <warning> Link for bond0: Not detected
clurgmgrd: [3759]: <warning> No link on bond0...
clurgmgrd[3759]: <notice> status on ip "205.234.65.138" returned 1 
(generic error)
clurgmgrd[3759]: <notice> Stopping service service:Web Server Address
openais[2921]: [TOTEM] entering GATHER state from 0.
openais[2921]: [TOTEM] Creating commit token because I am the rep.
openais[2921]: [TOTEM] Saving state aru 47 high seq received 47
openais[2921]: [TOTEM] Storing new sequence id for ring 6c
openais[2921]: [TOTEM] entering COMMIT state.
openais[2921]: [TOTEM] entering RECOVERY state.
openais[2921]: [TOTEM] position [0] member 205.234.65.133:
openais[2921]: [TOTEM] previous ring seq 104 rep 205.234.65.132
openais[2921]: [TOTEM] aru 47 high delivered 47 received flag 1
openais[2921]: [TOTEM] Did not need to originate any messages in recovery.
openais[2921]: [TOTEM] Sending initial ORF token
openais[2921]: [CLM  ] CLM CONFIGURATION CHANGE
openais[2921]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 2
openais[2921]: [CLM  ]     r(0) ip(205.234.65.133)
openais[2921]: [CLM  ] Members Left:
fenced[2937]: 205.234.65.132 not a cluster member after 0 sec 
post_fail_delay
openais[2921]: [CLM  ]     r(0) ip(205.234.65.132)
openais[2921]: [CLM  ] Members Joined:
fenced[2937]: fencing node "205.234.65.132"
openais[2921]: [CLM  ] CLM CONFIGURATION CHANGE
openais[2921]: [CLM  ] New Configuration:
openais[2921]: [CLM  ]     r(0) ip(205.234.65.133)
openais[2921]: [CLM  ] Members Left:
openais[2921]: [CLM  ] Members Joined:
openais[2921]: [SYNC ] This node is within the primary component and 
will provide service.
openais[2921]: [TOTEM] entering OPERATIONAL state.
openais[2921]: [CLM  ] got nodejoin message 205.234.65.133
openais[2921]: [CPG  ] got joinlist message from node 3
fenced[2937]: agent "fence_apc" reports: Can not ping gateway


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071126/57c6d158/attachment.htm>

From scottb at bxwa.com  Mon Nov 26 22:05:50 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 26 Nov 2007 14:05:50 -0800
Subject: [Linux-cluster] Tests to demonstrate Red Hat Cluster Behaviour
In-Reply-To: <6bc73b1f0711220627i7ab164c6rba6e1995c791e118@mail.gmail.com>
References: <6bc73b1f0711220627i7ab164c6rba6e1995c791e118@mail.gmail.com>
Message-ID: <474B433E.4010801@bxwa.com>

I had a failure once where the kernel detected an inconsistency in the 
file system and mounted it read only. I'm interested in testing this but 
am not sure if/how I can remount read only a write mounted filesystem.

    scottb

Ferry Harmusial wrote:
> Hi Everybody,
>
> I'm looking for tests that demonstrate the characteristics of Red Hat 
> Cluster Behaviour.
> So far, I came up with the following things myself:
>
> normal operation:
>
> 1 Reboot node2
> 2 Reboot node1
> 3 Halt node2
> 4 Halt node1   
> 5 Boot node2
> 6 Boot node1   
>
> hardware failures
>
> 7 On the node on which the cluster service is started:
>    ? Pull out first ethernet cable of the traffic bond
>    ? Wait for three minutes
>    ? Pull out second Ethernet cable of the traffic bond
> 8 On the node on which the cluster service is started:
>    ? Pull out first ethernet cable of the cluster interconnect bond
>    ? Wait for three minutes
>    ? Pull out second Ethernet cable of the cluster interconnect bond
> 9 On the node on which the cluster service is started:
>    ? Pull out the SCSI cable
> 10 On the node on which the cluster service is started:
>    ? Pull out the power cords
>
> software failures
>
> 11 On the node on which the cluster service is started:
>    ? Change the cluster service init script so 'status' always returns 
> '1' (failure)
>
> I'm interested in what you use your self ...
>
> Thanks in advance,
>
> Ferry
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071126/21ee2173/attachment.htm>

From scottb at bxwa.com  Mon Nov 26 22:36:15 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 26 Nov 2007 14:36:15 -0800
Subject: [Linux-cluster] Service Recovery Failure
Message-ID: <474B4A5F.1080002@bxwa.com>

I just performed a test which fail miserably. I have two nodes (node 2 
and node 3) and did a test of a nic failure and expected a fencing race 
with a good outcome. The good node did not attempt to fence the bad node 
(although the bad one did make an attempt as expected). At the same time 
it also did not take over the service ( really bad ).

For my simulated nic failure, I unplugged the cables from the public nic 
on node 3. Node 3 was running the IP Address service. The two nodes are 
also connected to a private network solely for the purpose of 
communicating with the power switches.

Here's all the data I know to fetch, what else can I provide? The very 
last line (from the disconnected node) is my fence script hack properly 
stopping fencing since the public nic is out.

Further down the log (not shown here) the fencing is attempted 
repeatedly. It was my understanding that it would only try each method once.

I was expecting both nodes to try to fence and the one with the good 
public nic connection to win. I had a split brain (according to clustat) 
and the log shows the same sequence of events on both except than fence 
didn't happen on one.

Help!


clustat from node 2 before failure test:


Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  205.234.65.132                        2 Online, Local, rgmanager
  205.234.65.133                        3 Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:Web Server A 205.234.65.133                 started


clustat from node 2 after pulling node 3:


Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  205.234.65.132                        2 Online, Local, rgmanager
  205.234.65.133                        3 Offline

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:Web Server A 205.234.65.133                 started


/etc/cluster/cluster.conf:


<?xml version="1.0"?>
<cluster alias="bxwa" config_version="8" name="bxwa">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="205.234.65.132" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="RackPDU1" 
option="off" port="2"/>
                                        <device name="RackPDU2" 
option="off" port="2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="205.234.65.133" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="RackPDU1" 
option="off" port="3"/>
                                        <device name="RackPDU2" 
option="off" port="3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.7.11" 
login="root" name="RackPDU1" passwd_script="/root/cluster/rack_pdu"/>
                <fencedevice agent="fence_apc" ipaddr="192.168.7.12" 
login="root" name="RackPDU2" passwd_script="/root/cluster/rack_pdu"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
                <service autostart="1" exclusive="0" name="Web Server 
Address" recovery="relocate">
                        <ip address="205.234.65.138" monitor_link="1"/>
                </service>
        </rm>
</cluster>


Node 2, /var/log/messages:
openais[9498]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[9498]: [TOTEM] Receive multicast socket recv buffer size (288000 
bytes).
openais[9498]: [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
openais[9498]: [TOTEM] entering GATHER state from 2.
openais[9498]: [TOTEM] entering GATHER state from 0.
openais[9498]: [TOTEM] Creating commit token because I am the rep.
openais[9498]: [TOTEM] Saving state aru 47 high seq received 47
openais[9498]: [TOTEM] Storing new sequence id for ring 6c
openais[9498]: [TOTEM] entering COMMIT state.
openais[9498]: [TOTEM] entering RECOVERY state.
openais[9498]: [TOTEM] position [0] member 205.234.65.132:
openais[9498]: [TOTEM] previous ring seq 104 rep 205.234.65.132
openais[9498]: [TOTEM] aru 47 high delivered 47 received flag 1
openais[9498]: [TOTEM] Did not need to originate any messages in recovery.
openais[9498]: [TOTEM] Sending initial ORF token
openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
openais[9498]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 3
fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
post_fail_delay
openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
openais[9498]: [CLM  ] Members Left:
openais[9498]: [CLM  ]     r(0) ip(205.234.65.133)
openais[9498]: [CLM  ] Members Joined:
openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
openais[9498]: [CLM  ] New Configuration:
openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
openais[9498]: [CLM  ] Members Left:
openais[9498]: [CLM  ] Members Joined:
openais[9498]: [SYNC ] This node is within the primary component and 
will provide service.
openais[9498]: [TOTEM] entering OPERATIONAL state.
openais[9498]: [CLM  ] got nodejoin message 205.234.65.132
openais[9498]: [CPG  ] got joinlist message from node 2


Node 3, /var/log/messages:
kernel: bonding: bond0: now running without any active interface !
openais[2921]: [TOTEM] The token was lost in the OPERATIONAL state.
openais[2921]: [TOTEM] Receive multicast socket recv buffer size (288000 
bytes).
openais[2921]: [TOTEM] Transmit multicast socket send buffer size 
(262142 bytes).
openais[2921]: [TOTEM] entering GATHER state from 2.
clurgmgrd: [3759]: <warning> Link for bond0: Not detected
clurgmgrd: [3759]: <warning> No link on bond0...
clurgmgrd[3759]: <notice> status on ip "205.234.65.138" returned 1 
(generic error)
clurgmgrd[3759]: <notice> Stopping service service:Web Server Address
openais[2921]: [TOTEM] entering GATHER state from 0.
openais[2921]: [TOTEM] Creating commit token because I am the rep.
openais[2921]: [TOTEM] Saving state aru 47 high seq received 47
openais[2921]: [TOTEM] Storing new sequence id for ring 6c
openais[2921]: [TOTEM] entering COMMIT state.
openais[2921]: [TOTEM] entering RECOVERY state.
openais[2921]: [TOTEM] position [0] member 205.234.65.133:
openais[2921]: [TOTEM] previous ring seq 104 rep 205.234.65.132
openais[2921]: [TOTEM] aru 47 high delivered 47 received flag 1
openais[2921]: [TOTEM] Did not need to originate any messages in recovery.
openais[2921]: [TOTEM] Sending initial ORF token
openais[2921]: [CLM  ] CLM CONFIGURATION CHANGE
openais[2921]: [CLM  ] New Configuration:
kernel: dlm: closing connection to node 2
openais[2921]: [CLM  ]     r(0) ip(205.234.65.133)
openais[2921]: [CLM  ] Members Left:
fenced[2937]: 205.234.65.132 not a cluster member after 0 sec 
post_fail_delay
openais[2921]: [CLM  ]     r(0) ip(205.234.65.132)
openais[2921]: [CLM  ] Members Joined:
fenced[2937]: fencing node "205.234.65.132"
openais[2921]: [CLM  ] CLM CONFIGURATION CHANGE
openais[2921]: [CLM  ] New Configuration:
openais[2921]: [CLM  ]     r(0) ip(205.234.65.133)
openais[2921]: [CLM  ] Members Left:
openais[2921]: [CLM  ] Members Joined:
openais[2921]: [SYNC ] This node is within the primary component and 
will provide service.
openais[2921]: [TOTEM] entering OPERATIONAL state.
openais[2921]: [CLM  ] got nodejoin message 205.234.65.133
openais[2921]: [CPG  ] got joinlist message from node 3
fenced[2937]: agent "fence_apc" reports: Can not ping gateway


From eric at bootseg.com  Tue Nov 27 01:14:49 2007
From: eric at bootseg.com (Eric Kerin)
Date: Mon, 26 Nov 2007 20:14:49 -0500
Subject: [Linux-cluster] Tests to demonstrate Red Hat Cluster Behaviour
In-Reply-To: <474B433E.4010801@bxwa.com>
References: <6bc73b1f0711220627i7ab164c6rba6e1995c791e118@mail.gmail.com>
	<474B433E.4010801@bxwa.com>
Message-ID: <1196126089.2741.4.camel@mechanism.localnet>

Scott,

Not sure if it works with GFS (I would assume so, but I don't have it
installed to test)  But normally you would run the following to remount
an already mounted filesystem in read only mode:
mount -o remount,ro <mountpoint>

And conversely to remount read-write:
mount -o remount,rw <mountpoint>

Thanks, 
Eric

On Mon, 2007-11-26 at 14:05 -0800, Scott Becker wrote:
> I had a failure once where the kernel detected an inconsistency in the
> file system and mounted it read only. I'm interested in testing this
> but am not sure if/how I can remount read only a write mounted
> filesystem.
> 
>     scottb
> 
> Ferry Harmusial wrote: 
> > Hi Everybody,
> > 
> > I'm looking for tests that demonstrate the characteristics of Red
> > Hat Cluster Behaviour.
> > So far, I came up with the following things myself:
> > 
> > normal operation:
> > 
> > 1 Reboot node2
> > 2 Reboot node1 
> > 3 Halt node2
> > 4 Halt node1    
> > 5 Boot node2
> > 6 Boot node1    
> > 
> > hardware failures
> > 
> > 7 On the node on which the cluster service is started:
> >    ? Pull out first ethernet cable of the traffic bond
> >    ? Wait for three minutes
> >    ? Pull out second Ethernet cable of the traffic bond
> > 8 On the node on which the cluster service is started:
> >    ? Pull out first ethernet cable of the cluster interconnect bond
> >    ? Wait for three minutes 
> >    ? Pull out second Ethernet cable of the cluster interconnect bond
> > 9 On the node on which the cluster service is started:
> >    ? Pull out the SCSI cable
> > 10 On the node on which the cluster service is started: 
> >    ? Pull out the power cords
> > 
> > software failures
> > 
> > 11 On the node on which the cluster service is started:
> >    ? Change the cluster service init script so 'status' always
> > returns '1' (failure)
> > 
> > I'm interested in what you use your self ... 
> > 
> > Thanks in advance,
> > 
> > Ferry
> > 
> > ____________________________________________________________________
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From d.skorupa at wasko.pl  Tue Nov 27 08:05:27 2007
From: d.skorupa at wasko.pl (daro)
Date: Tue, 27 Nov 2007 09:05:27 +0100
Subject: [Linux-cluster] RHEL5 failover domain
In-Reply-To: <1196102755.12646.4.camel@ayanami.boston.devel.redhat.com>
References: <47417DD3.90804@wasko.pl>	<1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>	<4743E3E9.3050402@wasko.pl>
	<1196102755.12646.4.camel@ayanami.boston.devel.redhat.com>
Message-ID: <474BCFC7.8050204@wasko.pl>

Lon Hohberger wrote:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=327721
>>>       
> That's because the first pass didn't fix the problem in all cases :(
>
> The current packages should:
>
> http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.src.rpm
>
> http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.i386.rpm
> http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.i386.rpm
> http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.i386.rpm
>
>   
Ok,  problem is fixed i think.

Thank you ;)

daro


From Alain.Moulle at bull.net  Tue Nov 27 10:17:24 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 27 Nov 2007 11:17:24 +0100
Subject: [Linux-cluster] CS5 : where are the rpm fence, ccs, etc. ?
Message-ID: <474BEEB4.80204@bull.net>

Hi,
I can't find rpms on DVD RHEL5 U1 Gold for EM64T, neither on Supplementary DVD.
Where are CS5 rpms ?
Thanks
Regards
Alain Moull?

>>On Mon, Jul 16, 2007 at 03:03:24PM +0200, Alain Moulle wrote:
>>>> Is there a CS5 planned for RHEL5 ?
>>There is a cluster suite included with RHEL5 since its launch.


From carlopmart at gmail.com  Tue Nov 27 10:26:54 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 27 Nov 2007 11:26:54 +0100
Subject: [Linux-cluster] Problems to start ony one cluster service
Message-ID: <474BF0EE.7080709@gmail.com>

Hi all

  I have a very strange problem. I have configured three nodes under RHCS on 
rhel5.1 servers. All works ok, except for one service that never starts when 
rgmanager start-up. My cluster conf is:

<?xml version="1.0"?>
<cluster alias="RhelXenCluster" config_version="17" name="RhelXenCluster">
         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="rhelclu01.hpulabs.org" nodeid="1" votes="1">
                         <fence>
                                 <method name="1">
                                         <device name="gnbd-fence" 
nodename="rhelclu01.hpulabs.org"/>
                                 </method>
                         </fence>
                         <multicast addr="239.192.75.55" interface="eth0"/>
                 </clusternode>
                 <clusternode name="rhelclu02.hpulabs.org" nodeid="2" votes="1">
                         <fence>
                                 <method name="1">
                                         <device name="gnbd-fence" 
nodename="rhelclu02.hpulabs.org"/>
                                 </method>
                         </fence>
                         <multicast addr="239.192.75.55" interface="eth0"/>
                 </clusternode>
                 <clusternode name="rhelclu03.hpulabs.org" nodeid="3" votes="1">
                         <fence>
                                 <method name="1">
                                         <device name="gnbd-fence" 
nodename="rhelclu03.hpulabs.org"/>
                                 </method>
                         </fence>
                         <multicast addr="239.192.75.55" interface="xenbr0"/>
                 </clusternode>
         </clusternodes>
         <cman expected_votes="1" two_node="0">
                 <multicast addr="239.192.75.55"/>
         </cman>
         <fencedevices>
                 <fencedevice agent="fence_gnbd" name="gnbd-fence" 
servers="rhelclu03.hpulabs.org"/>
         </fencedevices>
         <rm log_facility="local4" log_level="7">
                 <failoverdomains>
                         <failoverdomain name="PriCluster" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="rhelclu01.hpulabs.org" priority="1"/>
                                 <failoverdomainnode 
name="rhelclu02.hpulabs.org" priority="2"/>
                         </failoverdomain>
                         <failoverdomain name="SecCluster" ordered="1" 
restricted="1">
                                 <failoverdomainnode 
name="rhelclu02.hpulabs.org" priority="1"/>
                                 <failoverdomainnode 
name="rhelclu01.hpulabs.org" priority="2"/>
                         </failoverdomain>
                 </failoverdomains>
                 <resources>
			<ip address="172.25.50.10" monitor_link="1"/>
                         <ip address="172.25.50.11" monitor_link="1"/>
                         <ip address="172.25.50.12" monitor_link="1"/>
                         <ip address="172.25.50.13" monitor_link="1"/>
                         <ip address="172.25.50.14" monitor_link="1"/>
                         <ip address="172.25.50.15" monitor_link="1"/>
                         <ip address="172.25.50.16" monitor_link="1"/>
                         <ip address="172.25.50.17" monitor_link="1"/>
                         <ip address="172.25.50.18" monitor_link="1"/>
                         <ip address="172.25.50.19" monitor_link="1"/>
                         <ip address="172.25.50.20" monitor_link="1"/>
                 </resources>
                 <service autostart="1" domain="PriCluster" name="dns-svc" 
recovery="relocate">
                         <ip ref="172.25.50.10">
                                 <script 
file="/data/cfgcluster/etc/init.d/named" name="named"/>
                         </ip>
                 </service>
                 <service autostart="1" domain="SecCluster" name="mail-svc" 
recovery="relocate">
                         <ip ref="172.25.50.11">
                                 <script 
file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix"/>
                         </ip>
                 </service>
                 <service autostart="1" domain="SecCluster" name="rsync-svc" 
recovery="relocate">
                         <ip ref="172.25.50.13">
                                 <script 
file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
                         </ip>
                 </service>
                 <service autostart="1" domain="PriCluster" name="wwwsoft-svc" 
recovery="relocate">
                         <ip ref="172.25.50.14">
                                 <script 
file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
                         </ip>
                 </service>
                 <service autostart="1" domain="SecCluster" name="proxy-svc" 
recovery="relocate">
                         <ip ref="172.25.50.15">
                                 <script 
file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
                         </ip>
                 </service>
         </rm>
</cluster>

  The service that returns me errors and never starts when rgmanager start-up is 
postfix-cluster. On maillog file I find this error:

  Nov 26 11:27:31 rhelclu01 postfix[27959]: fatal: parameter inet_interfaces: no 
local interface found for 172.25.50.11
Nov 26 11:27:43 rhelclu01 postfix[28313]: fatal: 
/data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
Nov 26 11:27:44 rhelclu01 postfix[28334]: fatal: 
/data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
Nov 26 11:27:55 rhelclu01 postfix[28589]: fatal: parameter inet_interfaces: no 
local interface found for 172.25.50.11
Nov 26 11:30:04 rhelclu01 postfix[30350]: fatal: parameter inet_interfaces: no 
local interface found for 172.25.50.11
Nov 26 11:30:13 rhelclu01 postfix[30622]: fatal: 
/data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
Nov 26 11:30:14 rhelclu01 postfix[30643]: fatal: 
/data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
Nov 26 11:30:26 rhelclu01 postfix[30834]: fatal: parameter inet_interfaces: no 
local interface found for 172.25.50.11
Nov 26 11:31:41 rhelclu01 postfix[31888]: fatal: 
/data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
Nov 26 11:31:42 rhelclu01 postfix[31909]: fatal: 
/data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
Nov 26 11:31:53 rhelclu01 postfix[32249]: fatal: parameter inet_interfaces: no 
local interface found for 172.25.50.11
Nov 26 11:32:12 rhelclu01 postfix[32351]: fatal: parameter inet_interfaces: no 
local interface found for 172.25.50.11
Nov 26 11:32:31 rhelclu01 postfix[32668]: fatal: 
/data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied

  but thath's not true. If I start this service manually all works ok. Postfix 
configuration it is ok, What can be the problem??? I don't know why rgmanager 
dosen't config 172.25.50.11 address before execute postfix-cluster service ....
-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From Alain.Moulle at bull.net  Tue Nov 27 10:27:01 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 27 Nov 2007 11:27:01 +0100
Subject: [Linux-cluster] CS4 : problem with multiple IP addresses on hostname
Message-ID: <474BF0F5.9020801@bull.net>

Hi

CS/cman refuse to configure/start (cman_tool -w join) if gethostbyname() returns
(from DNS) more than one adress for a Node name.

Is there a workaround for this ?

Why not allow IP-addresses to be specified in CS configuration-files/cmd-lines
instead of names to work-around the problem ?

Thanks
Regards
Alain Moull?


From rainer at ultra-secure.de  Tue Nov 27 13:16:25 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Tue, 27 Nov 2007 14:16:25 +0100
Subject: [Linux-cluster] pv segment vg free_count mismatch
Message-ID: <474C18A9.4080504@ultra-secure.de>

Hi,

I wanted to mount (lock_nolock) (or rather first gfs_fsck) an old RHEL3
GFS6.0 pool with RHEL4-GFS6.1.
However, when doing a vgscan --mknodes, I get the above error-message.

Is this expected?
Any ideas?


cheers,
Rainer


From kanderso at redhat.com  Tue Nov 27 13:46:39 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Tue, 27 Nov 2007 07:46:39 -0600
Subject: [Linux-cluster] CS5 : where are the rpm fence, ccs, etc. ?
In-Reply-To: <474BEEB4.80204@bull.net>
References: <474BEEB4.80204@bull.net>
Message-ID: <1196171199.2824.0.camel@localhost.localdomain>

On Tue, 2007-11-27 at 11:17 +0100, Alain Moulle wrote:
> Hi,
> I can't find rpms on DVD RHEL5 U1 Gold for EM64T, neither on Supplementary DVD.
> Where are CS5 rpms ?

In order to reduce the number of rpms, we combined fence and ccs with
the cman rpm.  

Kevin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071127/ce6e826a/attachment.htm>

From pcaulfie at redhat.com  Tue Nov 27 14:06:59 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 27 Nov 2007 14:06:59 +0000
Subject: [Linux-cluster] CS4 : problem with multiple IP addresses on
	hostname
In-Reply-To: <474BF0F5.9020801@bull.net>
References: <474BF0F5.9020801@bull.net>
Message-ID: <474C2483.3040204@redhat.com>

Alain Moulle wrote:
> Hi
> 
> CS/cman refuse to configure/start (cman_tool -w join) if gethostbyname() returns
> (from DNS) more than one adress for a Node name.
> 
> Is there a workaround for this ?
> 
> Why not allow IP-addresses to be specified in CS configuration-files/cmd-lines
> instead of names to work-around the problem ?


Setting the IP address in cluster.conf and starting the cluster like
this works:

cman_tool join -d -n 192.168.1.2


What you have sounds like a bug, can you give us some more information
please ? cluster.conf files, errors from 'cman_tool join -d' and output
from dnslookup/host ?

Thanks

Patrick


From lhh at redhat.com  Tue Nov 27 14:28:58 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Nov 2007 09:28:58 -0500
Subject: [Linux-cluster] Service Recovery Failure
In-Reply-To: <474B4A5F.1080002@bxwa.com>
References: <474B4A5F.1080002@bxwa.com>
Message-ID: <1196173738.12646.27.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-11-26 at 14:36 -0800, Scott Becker wrote:

> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
> openais[9498]: [CLM  ] New Configuration:
> kernel: dlm: closing connection to node 3
> fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
> post_fail_delay
> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
> openais[9498]: [CLM  ] Members Left:
> openais[9498]: [CLM  ]     r(0) ip(205.234.65.133)
> openais[9498]: [CLM  ] Members Joined:
> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
> openais[9498]: [CLM  ] New Configuration:
> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
> openais[9498]: [CLM  ] Members Left:
> openais[9498]: [CLM  ] Members Joined:
> openais[9498]: [SYNC ] This node is within the primary component and 
> will provide service.
> openais[9498]: [TOTEM] entering OPERATIONAL state.
> openais[9498]: [CLM  ] got nodejoin message 205.234.65.132
> openais[9498]: [CPG  ] got joinlist message from node 2

Did it even try to run the fence_apc agent?  It should have done
*something* - it didn't even look like it tried to fence.

-- Lon


From lhh at redhat.com  Tue Nov 27 14:30:21 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Nov 2007 09:30:21 -0500
Subject: [Linux-cluster] RHEL5 failover domain
In-Reply-To: <474BCFC7.8050204@wasko.pl>
References: <47417DD3.90804@wasko.pl>
	<1195573251.26558.40.camel@ayanami.boston.devel.redhat.com>
	<4743E3E9.3050402@wasko.pl>
	<1196102755.12646.4.camel@ayanami.boston.devel.redhat.com>
	<474BCFC7.8050204@wasko.pl>
Message-ID: <1196173821.12646.29.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-27 at 09:05 +0100, daro wrote:

> Ok,  problem is fixed i think.
> 
> Thank you ;)
> 
> daro

You're welcome.  Sorry the first spin didn't work.  My fault :(

-- Lon


From orkcu at yahoo.com  Tue Nov 27 14:40:48 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 27 Nov 2007 06:40:48 -0800 (PST)
Subject: [Linux-cluster] FS mount point must really be unique ?
Message-ID: <91300.66894.qm@web50602.mail.re2.yahoo.com>

Hi

I am getting a resource confict, because identical
mount point definitions in Global FS resources, I read
the FAQ:
http://sources.redhat.com/cluster/faq.html#rgm_wontstart
and it clearly say that:
"The mount point is also the same, and that must be
unique."

ok, my question is:
is there any way I can avoid that check ?.

I have two services in the cluster, both services run
alone on its own node, and will not share in any
circunstances the same node so I guess that the good
check action that RG do is not necesary in  this case

My services are email services, and my common FS
resource is the email queue but I can imagine that two
differents web services with the same /var/www mount
point with different content are perfectly valid
cases.

I guess I can try to configure my aplication so that
use another queue directory but ....
but why if I can't? I didn't check but maybe it is
hard code in the binary ...

please, any sujestion?
thanks
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Get easy, one-click access to your favorites. 
Make Yahoo! your homepage.
http://www.yahoo.com/r/hs 


From d.skorupa at wasko.pl  Tue Nov 27 14:46:02 2007
From: d.skorupa at wasko.pl (Darek Skorupa)
Date: Tue, 27 Nov 2007 15:46:02 +0100
Subject: [Linux-cluster] cluster communication packets - changing interface
Message-ID: <474C2DAA.4020908@wasko.pl>

Hi,

I have tried to change interface for heartbeat packets. In FAQ
(http://sourceware.org/cluster/faq.html) I've found information about
private natwork (-p) which should be added to nodename in /etc/hosts.
I've tried that but it doesn't work for me.

ver. 1
    /etc/hosts
        192.168.10.10   l1-p.local     l1-p
        192.168.10.11   l2-p.local     l2-p

        200.200.200.200.201   l1.local     l1
        200.200.200.200.202   l2.local     l2

          After start cluster still sends packet with source 200.200.200.x

ver.2
          192.168.10.10     l1-p
          192.168.10.11     l2-p

           200.200.200.200.201   l1
           200.200.200.200.202   l2


          Starting cman ... failed
         cman not started: Can't determine address family of nodename
/usr/sbin/cman_tool
          : aisexec daemon didn't start


What I'm doing wrong ?
daro


PS. Cluster configuration:

/etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster alias="OBN_HA" config_version="28" name="OBN_HA">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="l2.local" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="l2_fence" nodename="l2.local"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="l1.local" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="l1_fence" nodename="l1.local"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman expected_votes="1" two_node="1"/>
    <fencedevices>
        <fencedevice agent="fence_manual" name="l1_fence"/>
        <fencedevice agent="fence_manual" name="l2_fence"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="OBN" ordered="1" restricted="1">
                <failoverdomainnode name="l1.local" priority="1"/>
                <failoverdomainnode name="l2.local" priority="2"/>

            </failoverdomain>
        </failoverdomains>
                <resources>
                        <script file="/etc/init.d/vsftpd" name="vsftpd"/>
                        <ip address="200.200.200.203" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="OBN" name="OBN_vsftpd"
recovery="relocate">
                        <ip ref="200.200.200.203">
                                <script ref="vsftpd"/>
                        </ip>
                </service>

    </rm>
</cluster>


From orkcu at yahoo.com  Tue Nov 27 15:11:12 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 27 Nov 2007 07:11:12 -0800 (PST)
Subject: [Linux-cluster] cluster communication packets - changing interface
In-Reply-To: <474C2DAA.4020908@wasko.pl>
Message-ID: <968640.88772.qm@web50609.mail.re2.yahoo.com>


--- Darek Skorupa <d.skorupa at wasko.pl> wrote:

> Hi,
> 
> I have tried to change interface for heartbeat
> packets. In FAQ
> (http://sourceware.org/cluster/faq.html) I've found
> information about
> private natwork (-p) which should be added to
> nodename in /etc/hosts.
> I've tried that but it doesn't work for me.
> 
> ver. 1
>     /etc/hosts
>         192.168.10.10   l1-p.local     l1-p
>         192.168.10.11   l2-p.local     l2-p
                          ^^^^^^^^^^ 
you need to match those names with the node names
defined in your cluster.conf:

>     <clusternodes>
>         <clusternode name="l2.local" nodeid="1"
                             ^^^^^^^^

>         </clusternode>
>         <clusternode name="l1.local" nodeid="2"
                             ^^^^^^^^

>         </clusternode>
>     </clusternodes>


cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From lhh at redhat.com  Tue Nov 27 15:26:31 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Nov 2007 10:26:31 -0500
Subject: [Linux-cluster] FS mount point must really be unique ?
In-Reply-To: <91300.66894.qm@web50602.mail.re2.yahoo.com>
References: <91300.66894.qm@web50602.mail.re2.yahoo.com>
Message-ID: <1196177191.12646.37.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-27 at 06:40 -0800, Roger Pe?a wrote:
> Hi
> 
> I am getting a resource confict, because identical
> mount point definitions in Global FS resources, I read
> the FAQ:
> http://sources.redhat.com/cluster/faq.html#rgm_wontstart
> and it clearly say that:
> "The mount point is also the same, and that must be
> unique."
> 
> ok, my question is:
> is there any way I can avoid that check ?.

Depends - what does your cluster.conf look like?

> My services are email services, and my common FS
> resource is the email queue but I can imagine that two
> differents web services with the same /var/www mount
> point with different content are perfectly valid
> cases.

If you're using GFS, you can reference the same file system from
different services.

If you're using ext3, you can disable the check by
changing /usr/share/cluster/fs.sh.  Change the 'unique' attr of the
'mountpoint' parameter to 0 on all nodes.  Delete the backup file when
you're done (or at least chmod -x...).

After that, increment the version # in cluster.conf and run 'ccs_tool
update /etc/cluster/cluster.conf'.  This will cause rgmanager to reread
the configuration and all resource agents.

With unique="0", attribute collisions are ignored for that parameter.

You can check using 'rg_test test /etc/cluster/cluster.conf'

-- Lon


From lhh at redhat.com  Tue Nov 27 15:35:41 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Nov 2007 10:35:41 -0500
Subject: [Linux-cluster] Problems to start ony one cluster service
In-Reply-To: <474BF0EE.7080709@gmail.com>
References: <474BF0EE.7080709@gmail.com>
Message-ID: <1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-11-27 at 11:26 +0100, carlopmart wrote:
> Hi all
> 
>   I have a very strange problem. I have configured three nodes under RHCS on 
> rhel5.1 servers. All works ok, except for one service that never starts when 
> rgmanager start-up. My cluster conf is:
> 
> <?xml version="1.0"?>
> <cluster alias="RhelXenCluster" config_version="17" name="RhelXenCluster">
>          <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>          <clusternodes>
>                  <clusternode name="rhelclu01.hpulabs.org" nodeid="1" votes="1">
>                          <fence>
>                                  <method name="1">
>                                          <device name="gnbd-fence" 
> nodename="rhelclu01.hpulabs.org"/>
>                                  </method>
>                          </fence>
>                          <multicast addr="239.192.75.55" interface="eth0"/>
>                  </clusternode>
>                  <clusternode name="rhelclu02.hpulabs.org" nodeid="2" votes="1">
>                          <fence>
>                                  <method name="1">
>                                          <device name="gnbd-fence" 
> nodename="rhelclu02.hpulabs.org"/>
>                                  </method>
>                          </fence>
>                          <multicast addr="239.192.75.55" interface="eth0"/>
>                  </clusternode>
>                  <clusternode name="rhelclu03.hpulabs.org" nodeid="3" votes="1">
>                          <fence>
>                                  <method name="1">
>                                          <device name="gnbd-fence" 
> nodename="rhelclu03.hpulabs.org"/>
>                                  </method>
>                          </fence>
>                          <multicast addr="239.192.75.55" interface="xenbr0"/>
>                  </clusternode>
>          </clusternodes>
>          <cman expected_votes="1" two_node="0">
>                  <multicast addr="239.192.75.55"/>
>          </cman>
>          <fencedevices>
>                  <fencedevice agent="fence_gnbd" name="gnbd-fence" 
> servers="rhelclu03.hpulabs.org"/>
>          </fencedevices>
>          <rm log_facility="local4" log_level="7">
>                  <failoverdomains>
>                          <failoverdomain name="PriCluster" ordered="1" 
> restricted="1">
>                                  <failoverdomainnode 
> name="rhelclu01.hpulabs.org" priority="1"/>
>                                  <failoverdomainnode 
> name="rhelclu02.hpulabs.org" priority="2"/>
>                          </failoverdomain>
>                          <failoverdomain name="SecCluster" ordered="1" 
> restricted="1">
>                                  <failoverdomainnode 
> name="rhelclu02.hpulabs.org" priority="1"/>
>                                  <failoverdomainnode 
> name="rhelclu01.hpulabs.org" priority="2"/>
>                          </failoverdomain>
>                  </failoverdomains>
>                  <resources>
> 			<ip address="172.25.50.10" monitor_link="1"/>
>                          <ip address="172.25.50.11" monitor_link="1"/>
>                          <ip address="172.25.50.12" monitor_link="1"/>
>                          <ip address="172.25.50.13" monitor_link="1"/>
>                          <ip address="172.25.50.14" monitor_link="1"/>
>                          <ip address="172.25.50.15" monitor_link="1"/>
>                          <ip address="172.25.50.16" monitor_link="1"/>
>                          <ip address="172.25.50.17" monitor_link="1"/>
>                          <ip address="172.25.50.18" monitor_link="1"/>
>                          <ip address="172.25.50.19" monitor_link="1"/>
>                          <ip address="172.25.50.20" monitor_link="1"/>
>                  </resources>
>                  <service autostart="1" domain="PriCluster" name="dns-svc" 
> recovery="relocate">
>                          <ip ref="172.25.50.10">
>                                  <script 
> file="/data/cfgcluster/etc/init.d/named" name="named"/>
>                          </ip>
>                  </service>
>                  <service autostart="1" domain="SecCluster" name="mail-svc" 
> recovery="relocate">
>                          <ip ref="172.25.50.11">
>                                  <script 
> file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix"/>
>                          </ip>
>                  </service>
>                  <service autostart="1" domain="SecCluster" name="rsync-svc" 
> recovery="relocate">
>                          <ip ref="172.25.50.13">
>                                  <script 
> file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
>                          </ip>
>                  </service>
>                  <service autostart="1" domain="PriCluster" name="wwwsoft-svc" 
> recovery="relocate">
>                          <ip ref="172.25.50.14">
>                                  <script 
> file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
>                          </ip>
>                  </service>
>                  <service autostart="1" domain="SecCluster" name="proxy-svc" 
> recovery="relocate">
>                          <ip ref="172.25.50.15">
>                                  <script 
> file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
>                          </ip>
>                  </service>
>          </rm>
> </cluster>
> 
>   The service that returns me errors and never starts when rgmanager start-up is 
> postfix-cluster. On maillog file I find this error:


>   Nov 26 11:27:31 rhelclu01 postfix[27959]: fatal: parameter inet_interfaces: no 
> local interface found for 172.25.50.11
> Nov 26 11:27:43 rhelclu01 postfix[28313]: fatal: 
> /data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied

>   but thath's not true. If I start this service manually all works ok. Postfix 
> configuration it is ok, What can be the problem??? I don't know why rgmanager 
> dosen't config 172.25.50.11 address before execute postfix-cluster service ....

When you start it manually -- how?
* add IP manually / running the script?
* rg_test?
* clusvcadm -e?

-- Lon


From lhh at redhat.com  Tue Nov 27 15:36:15 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 27 Nov 2007 10:36:15 -0500
Subject: [Linux-cluster] Tests to demonstrate Red Hat Cluster Behaviour
In-Reply-To: <1196126089.2741.4.camel@mechanism.localnet>
References: <6bc73b1f0711220627i7ab164c6rba6e1995c791e118@mail.gmail.com>
	<474B433E.4010801@bxwa.com>
	<1196126089.2741.4.camel@mechanism.localnet>
Message-ID: <1196177775.12646.48.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-11-26 at 20:14 -0500, Eric Kerin wrote:
> Scott,
> 
> Not sure if it works with GFS (I would assume so, but I don't have it
> installed to test)  But normally you would run the following to remount
> an already mounted filesystem in read only mode:
> mount -o remount,ro <mountpoint>
> 
> And conversely to remount read-write:
> mount -o remount,rw <mountpoint>

Should be the same w/ GFS.

-- Lon


From carlopmart at gmail.com  Tue Nov 27 15:43:36 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 27 Nov 2007 16:43:36 +0100
Subject: [Linux-cluster] Problems to start ony one cluster service
In-Reply-To: <1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>
References: <474BF0EE.7080709@gmail.com>
	<1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>
Message-ID: <474C3B28.6090505@gmail.com>

Lon Hohberger wrote:
> On Tue, 2007-11-27 at 11:26 +0100, carlopmart wrote:
>> Hi all
>>
>>   I have a very strange problem. I have configured three nodes under RHCS on 
>> rhel5.1 servers. All works ok, except for one service that never starts when 
>> rgmanager start-up. My cluster conf is:
>>
>> <?xml version="1.0"?>
>> <cluster alias="RhelXenCluster" config_version="17" name="RhelXenCluster">
>>          <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>          <clusternodes>
>>                  <clusternode name="rhelclu01.hpulabs.org" nodeid="1" votes="1">
>>                          <fence>
>>                                  <method name="1">
>>                                          <device name="gnbd-fence" 
>> nodename="rhelclu01.hpulabs.org"/>
>>                                  </method>
>>                          </fence>
>>                          <multicast addr="239.192.75.55" interface="eth0"/>
>>                  </clusternode>
>>                  <clusternode name="rhelclu02.hpulabs.org" nodeid="2" votes="1">
>>                          <fence>
>>                                  <method name="1">
>>                                          <device name="gnbd-fence" 
>> nodename="rhelclu02.hpulabs.org"/>
>>                                  </method>
>>                          </fence>
>>                          <multicast addr="239.192.75.55" interface="eth0"/>
>>                  </clusternode>
>>                  <clusternode name="rhelclu03.hpulabs.org" nodeid="3" votes="1">
>>                          <fence>
>>                                  <method name="1">
>>                                          <device name="gnbd-fence" 
>> nodename="rhelclu03.hpulabs.org"/>
>>                                  </method>
>>                          </fence>
>>                          <multicast addr="239.192.75.55" interface="xenbr0"/>
>>                  </clusternode>
>>          </clusternodes>
>>          <cman expected_votes="1" two_node="0">
>>                  <multicast addr="239.192.75.55"/>
>>          </cman>
>>          <fencedevices>
>>                  <fencedevice agent="fence_gnbd" name="gnbd-fence" 
>> servers="rhelclu03.hpulabs.org"/>
>>          </fencedevices>
>>          <rm log_facility="local4" log_level="7">
>>                  <failoverdomains>
>>                          <failoverdomain name="PriCluster" ordered="1" 
>> restricted="1">
>>                                  <failoverdomainnode 
>> name="rhelclu01.hpulabs.org" priority="1"/>
>>                                  <failoverdomainnode 
>> name="rhelclu02.hpulabs.org" priority="2"/>
>>                          </failoverdomain>
>>                          <failoverdomain name="SecCluster" ordered="1" 
>> restricted="1">
>>                                  <failoverdomainnode 
>> name="rhelclu02.hpulabs.org" priority="1"/>
>>                                  <failoverdomainnode 
>> name="rhelclu01.hpulabs.org" priority="2"/>
>>                          </failoverdomain>
>>                  </failoverdomains>
>>                  <resources>
>> 			<ip address="172.25.50.10" monitor_link="1"/>
>>                          <ip address="172.25.50.11" monitor_link="1"/>
>>                          <ip address="172.25.50.12" monitor_link="1"/>
>>                          <ip address="172.25.50.13" monitor_link="1"/>
>>                          <ip address="172.25.50.14" monitor_link="1"/>
>>                          <ip address="172.25.50.15" monitor_link="1"/>
>>                          <ip address="172.25.50.16" monitor_link="1"/>
>>                          <ip address="172.25.50.17" monitor_link="1"/>
>>                          <ip address="172.25.50.18" monitor_link="1"/>
>>                          <ip address="172.25.50.19" monitor_link="1"/>
>>                          <ip address="172.25.50.20" monitor_link="1"/>
>>                  </resources>
>>                  <service autostart="1" domain="PriCluster" name="dns-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.10">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/named" name="named"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="SecCluster" name="mail-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.11">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="SecCluster" name="rsync-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.13">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="PriCluster" name="wwwsoft-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.14">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="SecCluster" name="proxy-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.15">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
>>                          </ip>
>>                  </service>
>>          </rm>
>> </cluster>
>>
>>   The service that returns me errors and never starts when rgmanager start-up is 
>> postfix-cluster. On maillog file I find this error:
> 
> 
>>   Nov 26 11:27:31 rhelclu01 postfix[27959]: fatal: parameter inet_interfaces: no 
>> local interface found for 172.25.50.11
>> Nov 26 11:27:43 rhelclu01 postfix[28313]: fatal: 
>> /data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
> 
>>   but thath's not true. If I start this service manually all works ok. Postfix 
>> configuration it is ok, What can be the problem??? I don't know why rgmanager 
>> dosen't config 172.25.50.11 address before execute postfix-cluster service ....
> 

Hi Lon,


> When you start it manually -- how?
> * add IP manually / running the script?
Yes, and it works.

> * rg_test?

Works.


> * clusvcadm -e?

Sometimes works, sometimes not. I need to disable service first, and sometimes 
when I try to re-enable works and other not.
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From Alain.Moulle at bull.net  Tue Nov 27 16:26:01 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 27 Nov 2007 17:26:01 +0100
Subject: [Linux-cluster] Re: Re: CS4 : problem with multiple IP addresses
Message-ID: <474C4519.8060205@bull.net>

Hi Patrick

you mean like this in cluster.conf :
 <clusternodes>
        <clusternode name="192.168.1.2" votes="1">
             <fence>
                 <method name="1">
                   <device name="NODE_NAMEfence" option="reboot"/>
                   </method>
             </fence>
        </clusternode>
...

???

and if so, we should use "cman_tool join -d -n 192.168.1.2" instead
of "service cman start"

Is this right ?

Thanks
Regards
Alain

> Setting the IP address in cluster.conf and starting the cluster like
> this works:
> cman_tool join -d -n 192.168.1.2
> What you have sounds like a bug, can you give us some more information
> please ? cluster.conf files, errors from 'cman_tool join -d' and output
> from dnslookup/host ?
> Thanks
> Patrick


From pcaulfie at redhat.com  Tue Nov 27 16:34:09 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 27 Nov 2007 16:34:09 +0000
Subject: [Linux-cluster] Re: Re: CS4 : problem with multiple IP addresses
In-Reply-To: <474C4519.8060205@bull.net>
References: <474C4519.8060205@bull.net>
Message-ID: <474C4701.30602@redhat.com>

Alain Moulle wrote:
> Hi Patrick
> 
> you mean like this in cluster.conf :
>  <clusternodes>
>         <clusternode name="192.168.1.2" votes="1">
>              <fence>
>                  <method name="1">
>                    <device name="NODE_NAMEfence" option="reboot"/>
>                    </method>
>              </fence>
>         </clusternode>
> ...
> 
> ???
> 
> and if so, we should use "cman_tool join -d -n 192.168.1.2" instead
> of "service cman start"
> 
> Is this right ?

Well, it's nasty but it worked for me. I'm happy to actually fix the bug
if I can reproduce it.

Patrick


From isplist at logicore.net  Tue Nov 27 16:34:18 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 27 Nov 2007 10:34:18 -0600
Subject: [Linux-cluster] Any thoughts on losing mount?
Message-ID: <20071127103418.715496@leena>

I'm pulling my hair out here :).
One node in my cluster has decided that it doesn't want to mount a storage 
partition which other nodes are not having a problem with. The console 
messages say that there is an inconsistency in the filesystem yet none of the 
other nodes are complaining. 

I cannot figure this one out so am hoping someone on the list can give me some 
leads on what else to look for as I do not want to cause any new problems.

Mike


Nov 27 10:29:26 compdev kernel: GFS: Trying to join cluster "lock_dlm", 
"vgcomp:web"
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster. Now 
mounting FS...
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to 
acquire journal lock...
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at 
journal...
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log 
elements...
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked 
inodes
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes 
for 0 IDs
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Done
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem 
consistency error
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   function = 
gfs_setbit
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   file = 
/home/xos/gen/updates-2007-11/xlrpm29472/rpm/BUILD/gfs-kernel-2.6.9-72/up/src/
gfs/bits.c, line = 71
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   time = 1196180975
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: about to withdraw from 
the cluster
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: waiting for 
outstanding I/O
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: telling LM to withdraw
Nov 27 10:29:37 compdev kernel: lock_dlm: withdraw abandoned memory
Nov 27 10:29:37 compdev kernel: GFS: fsid=vgcomp:web.3: withdrawn


From scottb at bxwa.com  Tue Nov 27 16:52:34 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 27 Nov 2007 08:52:34 -0800
Subject: [Linux-cluster] Service Recovery Failure
In-Reply-To: <1196173738.12646.27.camel@ayanami.boston.devel.redhat.com>
References: <474B4A5F.1080002@bxwa.com>
	<1196173738.12646.27.camel@ayanami.boston.devel.redhat.com>
Message-ID: <474C4B52.6080400@bxwa.com>


Lon Hohberger wrote:
> On Mon, 2007-11-26 at 14:36 -0800, Scott Becker wrote:
>
>   
>> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
>> openais[9498]: [CLM  ] New Configuration:
>> kernel: dlm: closing connection to node 3
>> fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
>> post_fail_delay
>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
>> openais[9498]: [CLM  ] Members Left:
>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.133)
>> openais[9498]: [CLM  ] Members Joined:
>> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
>> openais[9498]: [CLM  ] New Configuration:
>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
>> openais[9498]: [CLM  ] Members Left:
>> openais[9498]: [CLM  ] Members Joined:
>> openais[9498]: [SYNC ] This node is within the primary component and 
>> will provide service.
>> openais[9498]: [TOTEM] entering OPERATIONAL state.
>> openais[9498]: [CLM  ] got nodejoin message 205.234.65.132
>> openais[9498]: [CPG  ] got joinlist message from node 2
>>     
>
> Did it even try to run the fence_apc agent?  It should have done
> *something* - it didn't even look like it tried to fence.
>
> -- Lon
>
>   
No sign of an attempt. How do I turn up the verbosity of fenced? I'll 
repeat the test. The only mention I can find is -D but I don't know how 
I can use that. I'll browse the source and see if I can learn anything. 
I'm using 2.0.73.

    thanks
    scottb


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071127/09991466/attachment.htm>

From wcheng at redhat.com  Tue Nov 27 17:07:22 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 27 Nov 2007 12:07:22 -0500
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <20071127103418.715496@leena>
References: <20071127103418.715496@leena>
Message-ID: <474C4ECA.9000407@redhat.com>

isplist at logicore.net wrote:
> I'm pulling my hair out here :).
> One node in my cluster has decided that it doesn't want to mount a storage 
> partition which other nodes are not having a problem with. The console 
> messages say that there is an inconsistency in the filesystem yet none of the 
> other nodes are complaining. 
>
> I cannot figure this one out so am hoping someone on the list can give me some 
> leads on what else to look for as I do not want to cause any new problems.
>
>   

The error message indicates resource group (RG) may get corrupted. Have 
you tried to do an fsck (or did it fixes anything) ? 

Different nodes could be accessing different RGs so other nodes may not 
see the corruption (until it starts to access this particular RG 
sometime later). Note that GFS normally tries to make node and/or 
process accessing the same RG it previously used if all possible - this 
is to avoid cluster-wide bottleneck (different nodes on different RGs) 
but still keep locality (use previously accessed RG) for performance 
reason.

Also do you remember any abnormal event (unclean shut-down, panic, 
power-lost, etc) *before* this issue pops out ?

-- Wendy
 
>
> Nov 27 10:29:26 compdev kernel: GFS: Trying to join cluster "lock_dlm", 
> "vgcomp:web"
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster. Now 
> mounting FS...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to 
> acquire journal lock...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at 
> journal...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log 
> elements...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked 
> inodes
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes 
> for 0 IDs
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Done
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem 
> consistency error
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   function = 
> gfs_setbit
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   file = 
> /home/xos/gen/updates-2007-11/xlrpm29472/rpm/BUILD/gfs-kernel-2.6.9-72/up/src/
> gfs/bits.c, line = 71
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   time = 1196180975
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: about to withdraw from 
> the cluster
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: waiting for 
> outstanding I/O
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: telling LM to withdraw
> Nov 27 10:29:37 compdev kernel: lock_dlm: withdraw abandoned memory
> Nov 27 10:29:37 compdev kernel: GFS: fsid=vgcomp:web.3: withdrawn
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


From randy.brown at noaa.gov  Tue Nov 27 17:22:40 2007
From: randy.brown at noaa.gov (Randy Brown)
Date: Tue, 27 Nov 2007 12:22:40 -0500
Subject: [Linux-cluster] Adding new file system caused problems
Message-ID: <474C5260.6030908@noaa.gov>

I am running a two node cluster using Centos 5 that is basically being 
used as a NAS head for our iscsi based storage.  Here are the related 
rpms and their versions I am using:
kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5
kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
system-config-lvm-1.0.22-1.0.el5
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos
gfs-utils-0.1.11-3.el5
lvm2-2.02.16-3.el5
lvm2-cluster-2.02.16-3.el5

This morning I created a 100GB volume on our storage unit and proceeded 
to make it available to the cluster so it could be served via NFS to a 
client on our network.  I used pvcreate and vgcreate as I always do and 
created a new volume group.  When I went to create the logical volume I 
saw this message:
Error locking on node nfs1-cluster.nws.noaa.gov: Volume group for uuid 
not found: 9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs

I figured I had done something wrong and tried to remove the Lvol and 
couldn't.  Lvdisplay showed that the logvol had been created and 
vgdisplay looked good with the exception of the volume not being 
activated.  So, I ran vgchange -aly <Volumegroupname> which didn't 
return any error, but also did not activate the volume.  I then rebooted 
the node which made everything OK.  I could now see the VG and lvol, 
both were active and I could now create the gfs file system on the 
lvol.  The file system mounted  and I thought I was in the clear.

However, node #2 wasn't picking this new filesystem up at all.  I 
stopped the cluster services on this node which all stopped cleanly and 
then tried to restart them.  cman started fine but clvmd didn't.  It 
hung on the vgscan.   Even after a reboot of node #2, clvmd would not 
start and would hang on the vgscan.  It wasn't until I shut down both 
nodes completely and started cluster that both nodes could see the new 
filesystem.

I'm sure it's my own ignorance that's making this more difficult than it 
needs to be.  Am I missing a step?  Is more information required to 
help?  Any assistance in figuring out what happened here would be 
greatly appreciated.  I know I going to need to do similar tasks in the 
future and obviously can't afford to bring everything down in order for 
the cluster to see a new filesystem.

Thank you,

Randy

P.S.  Here is my cluster.conf:
[root at nfs2-cluster ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="ohd_cluster" config_version="114" name="ohd_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="60"/>
        <clusternodes>
                <clusternode name="nfs1-cluster.nws.noaa.gov" nodeid="1" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="8" 
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="nfs2-cluster.nws.noaa.gov" nodeid="2" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="7" 
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="nfs-failover" ordered="0" 
restricted="1">
                                <failoverdomainnode 
name="nfs1-cluster.nws.noaa.gov" priority="1"/>
                                <failoverdomainnode 
name="nfs2-cluster.nws.noaa.gov" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="140.90.91.244" monitor_link="1"/>
                        <clusterfs 
device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647" 
fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/>
                        <nfsexport name="fs-shared-exp"/>
                        <nfsclient name="fs-shared-client" 
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                        <clusterfs 
device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0" fsid="54233" 
fstype="gfs" mountpoint="/rfcdata" name="rfcdata" options="acl"/>
                        <nfsexport name="rfcdata-exp"/>
                        <nfsclient name="rfcdata-client" 
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                </resources>
                <service autostart="1" domain="nfs-failover" name="nfs">
                        <clusterfs ref="fs-shared">
                                <nfsexport ref="fs-shared-exp">
                                        <nfsclient ref="fs-shared-client"/>
                                </nfsexport>
                        </clusterfs>
                        <ip ref="140.90.91.244"/>
                        <clusterfs ref="rfcdata">
                                <nfsexport ref="rfcdata-exp">
                                        <nfsclient ref="rfcdata-client"/>
                                </nfsexport>
                                <ip ref="140.90.91.244"/>
                        </clusterfs>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.42.30" 
login="rbrown" name="nfspower" passwd="XXXXXXX"/>
        </fencedevices>
</cluster>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071127/a1d3f47d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071127/a1d3f47d/attachment.vcf>

From scottb at bxwa.com  Tue Nov 27 17:39:08 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 27 Nov 2007 09:39:08 -0800
Subject: [Linux-cluster] Service Recovery Failure
In-Reply-To: <474C4B52.6080400@bxwa.com>
References: <474B4A5F.1080002@bxwa.com>	<1196173738.12646.27.camel@ayanami.boston.devel.redhat.com>
	<474C4B52.6080400@bxwa.com>
Message-ID: <474C563C.6030309@bxwa.com>

I wonder if find_master_nodeid in recover.c had accurate information.

	scottb


>>> fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
>>> post_fail_delay
>>>       
>


From scottb at bxwa.com  Tue Nov 27 17:43:03 2007
From: scottb at bxwa.com (Scott Becker)
Date: Tue, 27 Nov 2007 09:43:03 -0800
Subject: [Linux-cluster] Service Recovery Failure
In-Reply-To: <474C4B52.6080400@bxwa.com>
References: <474B4A5F.1080002@bxwa.com>	<1196173738.12646.27.camel@ayanami.boston.devel.redhat.com>
	<474C4B52.6080400@bxwa.com>
Message-ID: <474C5727.9070502@bxwa.com>

Lack of fencing in my case (without GFS) is only a problem if the 
failing NIC fades in and out. The larger problem during real operation 
is the lack of service recovery. I plugged the public nic back in and it 
was rejected as a node and then the service was relocated (late).

    scottb


Scott Becker wrote:
>
>
> Lon Hohberger wrote:
>> On Mon, 2007-11-26 at 14:36 -0800, Scott Becker wrote:
>>
>>   
>>> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
>>> openais[9498]: [CLM  ] New Configuration:
>>> kernel: dlm: closing connection to node 3
>>> fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
>>> post_fail_delay
>>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
>>> openais[9498]: [CLM  ] Members Left:
>>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.133)
>>> openais[9498]: [CLM  ] Members Joined:
>>> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
>>> openais[9498]: [CLM  ] New Configuration:
>>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
>>> openais[9498]: [CLM  ] Members Left:
>>> openais[9498]: [CLM  ] Members Joined:
>>> openais[9498]: [SYNC ] This node is within the primary component and 
>>> will provide service.
>>> openais[9498]: [TOTEM] entering OPERATIONAL state.
>>> openais[9498]: [CLM  ] got nodejoin message 205.234.65.132
>>> openais[9498]: [CPG  ] got joinlist message from node 2
>>>     
>>
>> Did it even try to run the fence_apc agent?  It should have done
>> *something* - it didn't even look like it tried to fence.
>>
>> -- Lon
>>
>>   
> No sign of an attempt. How do I turn up the verbosity of fenced? I'll 
> repeat the test. The only mention I can find is -D but I don't know 
> how I can use that. I'll browse the source and see if I can learn 
> anything. I'm using 2.0.73.
>
>     thanks
>     scottb
>
>
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071127/a170f3f5/attachment.htm>

From isplist at logicore.net  Tue Nov 27 18:04:15 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 27 Nov 2007 12:04:15 -0600
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <474C4ECA.9000407@redhat.com>
Message-ID: <2007112712415.820360@leena>

> The error message indicates resource group (RG) may get corrupted. Have
> you tried to do an fsck (or did it fixes anything) ?

Should this be while the partition is unmapped on any of the nodes?

# ./fsck /dev/mapper/VolGroup03-web
fsck 1.35 (28-Feb-2004)
e2fsck 1.35 (28-Feb-2004)
Couldn't find ext2 superblock, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open 
/dev/mapper/VolGroup03-web

I've also seen this in the log;

compdev kernel: GFS: Trying to join cluster "lock_dlm", "vgcomp:qm"
compdev kernel: GFS: fsid=vgcomp:qm.1: Joined cluster. Now mounting FS...
compdev kernel: GFS: fsid=vgcomp:qm.1: jid=1: Trying to acquire journal 
lock...
compdev kernel: GFS: fsid=vgcomp:qm.1: jid=1: Looking at journal...
compdev kernel: GFS: fsid=vgcomp:qm.1: jid=1: Done
compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem consistency error
compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
compdev kernel: GFS: fsid=vgcomp:web.3:   function = gfs_setbit
compdev kernel: GFS: fsid=vgcomp:web.3:   file = 
/home/xos/gen/updates-2007-11/xlrpm29472/rpm/BUILD/gfs-
kernel-2.6.9-72/up/src/gfs/bits.c, line = 71
compdev kernel: GFS: fsid=vgcomp:web.3:   time = 1196105648
compdev kernel: GFS: fsid=vgcomp:web.3: about to withdraw from the cluster
compdev kernel: GFS: fsid=vgcomp:web.3: waiting for outstanding I/O
compdev kernel: GFS: fsid=vgcomp:web.3: telling LM to withdraw
compdev kernel: lock_dlm: withdraw abandoned memory
compdev kernel: GFS: fsid=vgcomp:web.3: withdrawn

and;

compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log elements...
compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked inodes
compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes for 0 IDs
compdev kernel: GFS: fsid=vgcomp:web.3: Done
compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem consistency error
compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
compdev kernel: GFS: fsid=vgcomp:web.3:   function = gfs_setbit
compdev kernel: GFS: fsid=vgcomp:web.3:   file = 
/home/xos/gen/updates-2007-11/xlrpm29472/rpm/BUILD/gfs-
kernel-2.6.9-72/up/src/gfs/bits.c, line = 71

> Also do you remember any abnormal event (unclean shut-down, panic,
> power-lost, etc) *before* this issue pops out ?

Yes, I posted a few things about that recently. The cluster was dying in 
kernel panic until I updated all of them to be identical again. Since then, 
this node has been having these problems. I have also noticed that cman never 
shuts down correctly when I reboot nodes and that there is a lot of garbage 
(for lack of better word) about volume group information which no longer 
exists when I reboot nodes.

Last but not least, I wasn't sure what to post here so I decided I better post 
more than not enough.

compdev rc.sysinit: Checking root filesystem succeeded
compdev kernel: IP route cache hash table entries: 32768 (order: 5, 131072 
bytes)
compdev rc.sysinit: Remounting root filesystem in read-write mode:  succeeded
compdev kernel: TCP established hash table entries: 131072 (order: 8, 1048576 
bytes)
compdev lvm.static:
compdev kernel: TCP bind hash table entries: 131072 (order: 9, 3670016 bytes)
compdev lvm.static: connect() failed on local socket: Connection refused
compdev kernel: TCP: Hash tables configured (established 131072 bind 131072)
compdev lvm.static:   WARNING: Falling back to local file-based locking.
compdev kernel: Initializing IPsec netlink socket
compdev lvm.static:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: NET: Registered protocol family 1
compdev lvm.static:   1 logical volume(s) in volume group VolGroup03 now 
active
compdev kernel: NET: Registered protocol family 17
compdev lvm.static:   1 logical volume(s) in volume group VolGroup02 now 
active
compdev kernel: Freeing unused kernel memory: 168k freed
compdev lvm.static:   1 logical volume(s) in volume group VolGroup01 now 
active
compdev kernel: SCSI subsystem initialized
compdev rc.sysinit: Setting up Logical Volume Management: succeeded
compdev kernel: QLogic Fibre Channel HBA Driver
compdev rc.sysinit: Checking filesystems succeeded
compdev kernel: qla2200 0000:00:11.0: Found an ISP2200, irq 11, iobase 
0xe0816000
compdev rc.sysinit: Mounting local filesystems:  succeeded
compdev kernel: qla2200 0000:00:11.0: Configuring PCI space...
compdev rc.sysinit: Enabling local filesystem quotas:  succeeded
compdev kernel: qla2200 0000:00:11.0: Configure NVRAM parameters...
compdev rc.sysinit: Enabling swap space:  succeeded
compdev kernel: qla2200 0000:00:11.0: Verifying loaded RISC code...
compdev init: Entering runlevel: 3
compdev kernel: qla2200 0000:00:11.0: LIP reset occured (0).
compdev microcode_ctl: microcode_ctl startup succeeded
compdev kernel: qla2200 0000:00:11.0: Waiting for LIP to complete...
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: qla2200 0000:00:11.0: LOOP UP detected (1 Gbps).
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: qla2200 0000:00:11.0: Topology - (F_Port), Host Loop address 
0xffff
compdev vgchange:
compdev kernel: scsi0 : qla2xxx
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: qla2200 0000:00:11.0:
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel:  QLogic Fibre Channel HBA Driver: 8.01.04-d8
compdev vgchange: Volume group "WARNING:" not found
compdev kernel:   QLogic QLA22xx -
compdev lvm2-monitor: Starting monitoring for VG WARNING:: failed
compdev kernel:   ISP2200: PCI (33 MHz) @ 0000:00:11.0 hdma-, host#=0, 
fw=2.02.08 TP
compdev vgchange:
compdev kernel:   Vendor: MYLEX     Model: DACARMRB          Rev: 7775
compdev vgchange: connect() failed on local socket: Connection refused
compdev kernel:   Type:   Direct-Access                      ANSI SCSI 
revision: 02
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: qla2200 0000:00:11.0: scsi(0:0:0:0): Enabled tagged queuing, 
queue depth 16.
compdev vgchange:
compdev kernel: SCSI device sda: 1013760000 512-byte hdwr sectors (519045 MB)
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: SCSI device sda: drive cache: write back
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: SCSI device sda: 1013760000 512-byte hdwr sectors (519045 MB)
compdev vgchange: Volume group "Falling" not found
compdev kernel: SCSI device sda: drive cache: write back
compdev lvm2-monitor: Starting monitoring for VG Falling: failed
compdev kernel:  sda:
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel:   Vendor: MYLEX     Model: DACARMRB          Rev: 7775
compdev vgchange:
compdev kernel:   Type:   Direct-Access                      ANSI SCSI 
revision: 02
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: qla2200 0000:00:11.0: scsi(0:0:0:1): Enabled tagged queuing, 
queue depth 16.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: SCSI device sdb: 1013760000 512-byte hdwr sectors (519045 MB)
compdev vgchange: Volume group "back" not found
compdev kernel: SCSI device sdb: drive cache: write back
compdev lvm2-monitor: Starting monitoring for VG back: failed
compdev kernel: SCSI device sdb: 1013760000 512-byte hdwr sectors (519045 MB)
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: SCSI device sdb: drive cache: write back
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel:  sdb:
compdev vgchange:
compdev kernel: Attached scsi disk sdb at scsi0, channel 0, id 0, lun 1
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel:   Vendor: MYLEX     Model: DACARMRB          Rev: 7775
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel:   Type:   Direct-Access                      ANSI SCSI 
revision: 02
compdev vgchange: Volume group "to" not found
compdev kernel: qla2200 0000:00:11.0: scsi(0:0:0:2): Enabled tagged queuing, 
queue depth 16.
compdev lvm2-monitor: Starting monitoring for VG to: failed
compdev kernel: SCSI device sdc: 997449728 512-byte hdwr sectors (510694 MB)
compdev vgchange:
compdev kernel: SCSI device sdc: drive cache: write back
compdev vgchange: connect() failed on local socket: Connection refused
compdev kernel: SCSI device sdc: 997449728 512-byte hdwr sectors (510694 MB)
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: SCSI device sdc: drive cache: write back
compdev vgchange:
compdev kernel:  sdc:
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: Attached scsi disk sdc at scsi0, channel 0, id 0, lun 2
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange: Volume group "local" not found
compdev kernel: device-mapper: 4.5.5-ioctl (2006-12-01) initialised: 
dm-devel at redhat.com
compdev lvm2-monitor: Starting monitoring for VG local: failed
compdev kernel: kjournald starting.  Commit interval 5 seconds
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: EXT3-fs: mounted filesystem with ordered data mode.
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: SELinux:  Disabled at runtime.
compdev vgchange:
compdev kernel: SELinux:  Unregistering netfilter hooks
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: inserting floppy driver for 2.6.9-55.0.12.EL.XOS.1
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: Floppy drive(s): fd0 is 1.44M
compdev vgchange: Volume group "file-based" not found
compdev kernel: FDC 0 is a post-1991 82077
compdev lvm2-monitor: Starting monitoring for VG file-based: failed
compdev kernel: e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: e100: Copyright(c) 1999-2005 Intel Corporation
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev kernel: e100: eth0: e100_probe: addr 0xfebfe000, irq 5, MAC addr 
00:20:94:10:43:67
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: e100: eth1: e100_probe: addr 0xfebfd000, irq 11, MAC addr 
00:20:94:10:43:68
compdev vgchange: Volume group "locking." not found
compdev kernel: USB Universal Host Controller Interface driver v2.2
compdev lvm2-monitor: Starting monitoring for VG locking.: failed
compdev kernel: PCI: Enabling device 0000:00:07.2 (0000 -> 0001)
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: PCI: No IRQ known for interrupt pin D of device 0000:00:07.2. 
Please try using pci=biosi
rq.
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: uhci_hcd 0000:00:07.2: Found HC with no IRQ.  Check BIOS/PCI 
0000:00:07.2 setup!
compdev vgchange:
compdev kernel: md: Autodetecting RAID arrays.
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: md: autorun ...
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: md: ... autorun DONE.
compdev vgchange: Volume group "Volume" not found
compdev kernel: EXT3 FS on hda1, internal journal
compdev lvm2-monitor: Starting monitoring for VG Volume: failed
compdev kernel: Adding 787176k swap on /dev/hda2.  Priority:-1 extents:1
compdev vgchange:
compdev kernel: IA-32 Microcode Update Driver: v1.14 <tigran at veritas.com>
compdev vgchange: connect() failed on local socket: Connection refused
compdev kernel: microcode: CPU0 updated from revision 0x7 to 0x8, date = 
05052000
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: IA-32 Microcode Update Driver v1.14 unregistered
compdev vgchange:
compdev kernel: ip_tables: (C) 2000-2002 Netfilter core team
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: ip_tables: (C) 2000-2002 Netfilter core team
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex
compdev vgchange: Volume group "Groups" not found
compdev kernel: NET: Registered protocol family 10
compdev lvm2-monitor: Starting monitoring for VG Groups: failed
compdev kernel: Disabled Privacy Extensions on device c0386e60(lo)
compdev vgchange:   connect() failed on local socket: Connection refused
compdev kernel: IPv6 over IPv4 tunneling driver
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev kernel: CMAN 2.6.9-50.2.0.6.XOS.1 (built Nov 15 2007 12:03:01) 
installed
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev kernel: NET: Registered protocol family 30
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev kernel: DLM 2.6.9-46.16.0.12.XOS.1 (built Nov 15 2007 12:27:30) 
installed
compdev vgchange: Volume group "with" not found
compdev lvm2-monitor: Starting monitoring for VG with: failed
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange: Volume group "the" not found
compdev lvm2-monitor: Starting monitoring for VG the: failed
compdev vgchange:
compdev vgchange: connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange: Volume group "clustered" not found
compdev lvm2-monitor: Starting monitoring for VG clustered: failed
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange: Volume group "attribute" not found
compdev lvm2-monitor: Starting monitoring for VG attribute: failed
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange: Volume group "will" not found
compdev lvm2-monitor: Starting monitoring for VG will: failed
compdev vgchange:
compdev vgchange: connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange: Volume group "be" not found
compdev lvm2-monitor: Starting monitoring for VG be: failed
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange: Volume group "inaccessible." not found
compdev lvm2-monitor: Starting monitoring for VG inaccessible.: failed
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange:   1 logical volume(s) in volume group "VolGroup01" monitored
compdev lvm2-monitor: Starting monitoring for VG VolGroup01: succeeded
compdev vgchange:
compdev vgchange: connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange:   1 logical volume(s) in volume group "VolGroup02" monitored
compdev lvm2-monitor: Starting monitoring for VG VolGroup02: succeeded
compdev vgchange:
compdev vgchange: connect() failed on local socket: Connection refused
compdev vgchange:   connect() failed on local socket: Connection refused
compdev vgchange:   WARNING: Falling back to local file-based locking.
compdev vgchange:   Volume Groups with the clustered attribute will be 
inaccessible.
compdev vgchange:   1 logical volume(s) in volume group "VolGroup03" monitored
compdev lvm2-monitor: Starting monitoring for VG VolGroup03: succeeded
compdev kudzu:  succeeded
compdev sysctl: net.ipv4.ip_forward = 0
compdev sysctl: net.ipv4.conf.default.rp_filter = 1
compdev sysctl: net.ipv4.conf.default.accept_source_route = 0
compdev sysctl: kernel.sysrq = 0
compdev sysctl: kernel.core_uses_pid = 1
compdev sysctl: kernel.panic_on_oops = 1
compdev network: Setting network parameters:  succeeded
compdev network: Bringing up loopback interface:  succeeded
compdev network: Bringing up interface eth0:  succeeded
compdev ccsd[2458]: Remote copy of cluster.conf is from quorate node.
compdev ccsd[2458]:  Local version # : 80
compdev ccsd[2458]:  Remote version #: 80
compdev kernel: CMAN: Waiting to join or form a Linux-cluster
compdev kernel: CMAN: sending membership request
compdev ccsd[2458]: Connected to cluster infrastruture via: CMAN/SM Plugin 
v1.1.7.4
compdev ccsd[2458]: Initial status:: Inquorate
compdev kernel: CMAN: got node cweb93
compdev kernel: CMAN: got node cweb94
compdev kernel: CMAN: got node cweb92
compdev kernel: CMAN: got node img62
compdev ccsd[2458]: Cluster is quorate.  Allowing connections.
compdev kernel: CMAN: quorum regained, resuming activity
compdev cman: startup succeeded
compdev fenced: startup succeeded
compdev clvmd: Cluster LVM daemon started - connected to CMAN
compdev clvmd: clvmd startup succeeded
compdev vgchange:   1 logical volume(s) in volume group "VolGroup03" now 
active
compdev vgchange:   1 logical volume(s) in volume group "VolGroup02" now 
active
compdev vgchange:   1 logical volume(s) in volume group "VolGroup01" now 
active
compdev clvmd: Activating VGs: succeeded
compdev netfs: Mounting other filesystems:  succeeded
compdev kernel: Lock_Harness 2.6.9-72.2.0.9.XOS.1 (built Nov 15 2007 12:30:46) 
installed
compdev kernel: GFS 2.6.9-72.2.0.9.XOS.1 (built Nov 15 2007 12:31:07) 
installed
compdev kernel: GFS: Trying to join cluster "lock_dlm", "vgcomp:qm"
compdev kernel: Lock_DLM (built Nov 15 2007 12:30:48) installed
compdev kernel: GFS: fsid=vgcomp:qm.1: Joined cluster. Now mounting FS...
compdev kernel: GFS: fsid=vgcomp:qm.1: jid=1: Trying to acquire journal 
lock...
compdev kernel: GFS: fsid=vgcomp:qm.1: jid=1: Looking at journal...
compdev kernel: GFS: fsid=vgcomp:qm.1: jid=1: Done
compdev kernel: GFS: Trying to join cluster "lock_dlm", "vgcomp:web"
compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster. Now mounting FS...
compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to acquire journal 
lock...
compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at journal...
compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log elements...
compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked inodes
compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes for 0 IDs
compdev kernel: GFS: fsid=vgcomp:web.3: Done
compdev gfs: Mounting GFS filesystems:  succeeded


From isplist at logicore.net  Tue Nov 27 18:17:47 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 27 Nov 2007 12:17:47 -0600
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <474C4ECA.9000407@redhat.com>
Message-ID: <20071127121747.275467@leena>

I've unmounted the partition from one node and am now running gfs_fsck on it.

There were a number of problems;

Leaf(15651992) entry count in directory 15651847 doesn't match number of 
entries found - is 49, found 0
Leaf entry count updated
Leaf(15651935) entry count in directory 15651847 doesn't match number of 
entries found - is 44, found 0
Leaf entry count updated

A very long list in fact. So, rather than Updating the countless problems, 
would I be better off to simply reformat the storage with GFS?

Mike


From wcheng at redhat.com  Tue Nov 27 19:23:55 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 27 Nov 2007 14:23:55 -0500
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <20071127121747.275467@leena>
References: <20071127121747.275467@leena>
Message-ID: <474C6ECB.9050702@redhat.com>

isplist at logicore.net wrote:
> I've unmounted the partition from one node and am now running gfs_fsck on it.
>   
Please *don't* do that. While fsck (gfs_fsck), unmount the filesystem 
from *all* nodes.
> There were a number of problems;
>
> Leaf(15651992) entry count in directory 15651847 doesn't match number of 
> entries found - is 49, found 0
> Leaf entry count updated
> Leaf(15651935) entry count in directory 15651847 doesn't match number of 
> entries found - is 44, found 0
> Leaf entry count updated
>
> A very long list in fact. So, rather than Updating the countless problems, 
> would I be better off to simply reformat the storage with GFS?
>
>
>   
Your call ...
>
>   

Few things (for next time):
1. Any fsck should be run *without* being mounted. In GFS case, you 
should unmount the filesystem from all nodes.
2. All filesystem has their own fsck. So please be aware the difference.
3. For any tool you never use before, please at least browse thru the 
man page (in GFS case, "man gfs_fsck").
4. If you want to redo mkfs, *unmount* the current partition from all nodes.

-- Wendy


From isplist at logicore.net  Tue Nov 27 19:34:47 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 27 Nov 2007 13:34:47 -0600
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <474C6ECB.9050702@redhat.com>
Message-ID: <20071127133447.724441@leena>

Thanks for the help. Your suggestion lead to fixing things just fine. I went 
with reformatting the space since that is an easy option. I understand about 
making sure that all nodes are unmounted before doing any gfs_fsck work on the 
disk. 

O another note, I posted my log about the boot up information showing problems 
with lvm. Can you see anything in there that might make sense to you? 

Mike


From wcheng at redhat.com  Tue Nov 27 19:49:16 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 27 Nov 2007 14:49:16 -0500
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <20071127133447.724441@leena>
References: <20071127133447.724441@leena>
Message-ID: <474C74BC.6030001@redhat.com>

isplist at logicore.net wrote:
> Thanks for the help. Your suggestion lead to fixing things just fine. I went 
> with reformatting the space since that is an easy option. I understand about 
> making sure that all nodes are unmounted before doing any gfs_fsck work on the 
> disk. 
>   

Sorry... I was a little bit worried. Anyway, I start to think it would 
be nice to have some kinds of "control node" concept where these admin 
commands can be performed on one particular pre-defined node. This would 
allow the tools to check and prevent mistakes like these (say fsck would 
start to ssh to each node to unmount the filesystem before it starts to 
do anything).  This is something to think about. After all, cluster 
system management is not a trivial task and mistakes can be plenty, 
regardless admins' skills.

-- Wendy


From lpleiman at redhat.com  Tue Nov 27 20:00:46 2007
From: lpleiman at redhat.com (Leo Pleiman)
Date: Tue, 27 Nov 2007 15:00:46 -0500
Subject: [Linux-cluster] Cluster Modules for RHEL4 Paravirtualized Kernel
Message-ID: <474C776E.8060407@redhat.com>

My customer would like the ability to build a Red Hat cluster composed 
of RHEL4 paravirtualized guests on RHEL5.1 cluster hosts. What is the 
availability of the kernel modules for the 2.6.9-67.ELxenU kernel?

Thanks in advance.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lpleiman.vcf
Type: text/x-vcard
Size: 206 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071127/3c7c7dca/attachment.vcf>

From isplist at logicore.net  Tue Nov 27 20:04:02 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 27 Nov 2007 14:04:02 -0600
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <474C74BC.6030001@redhat.com>
Message-ID: <200711271442.396616@leena>

> be nice to have some kinds of "control node" concept where these admin
> commands can be performed on one particular pre-defined node. This would
> allow the tools to check and prevent mistakes like these (say fsck would

In my test setup, this is somewhat how I've been using my cluster in the past 
year or so. I try to maintain everything from just one node and it sort of 
becomes my management node.

> start to ssh to each node to unmount the filesystem before it starts to
> do anything).  This is something to think about. After all, cluster

I've set up basic scripts to control the other nodes when I want to do 
something system wide. Pretty basic stuff using scp and ssh remote commands so 
that I don't have to log into every node to do something repetitive.

> system management is not a trivial task and mistakes can be plenty,
> regardless admins' skills.

The problem is that since the cluster interacts with traditional 
storage/networking, it's sometimes too easy to forget that various things are 
shared.

Mike


From wcheng at redhat.com  Tue Nov 27 21:28:35 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 27 Nov 2007 16:28:35 -0500
Subject: [Linux-cluster] Any thoughts on losing mount?
In-Reply-To: <200711271442.396616@leena>
References: <200711271442.396616@leena>
Message-ID: <474C8C03.8010502@redhat.com>

isplist at logicore.net wrote:
>> be nice to have some kinds of "control node" concept where these admin
>> commands can be performed on one particular pre-defined node. This would
>> allow the tools to check and prevent mistakes like these (say fsck would
>>     
>
> In my test setup, this is somewhat how I've been using my cluster in the past 
> year or so. I try to maintain everything from just one node and it sort of 
> becomes my management node.
>
>   

Yep .. that's smart.

But I'm thinking more about a formal GFS tool set from a product point 
of view .. Anyway, glad your system is back to normal. Is it ?

-- Wendy


From prisenhoover at sampledigital.com  Tue Nov 27 21:34:09 2007
From: prisenhoover at sampledigital.com (Paul Risenhoover)
Date: Tue, 27 Nov 2007 13:34:09 -0800
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
Message-ID: <474C8D51.4000801@sampledigital.com>

Hi All,

I am experiencing some substantial performance problems on my RHEL 5 
server running GFS.  The specific symptom that I'm seeing is that the 
file system will hang for anywhere from 5 to 45 seconds on occasion.  
When this happens it stalls all processes that are attempting to access 
the file system (ie, "ls -l") such that even a ctrl-break can't stop it.

It also appears that gfs_scand is working extremely hard.  It runs at 
7-10% CPU almost constantly.  I did some research on this and discovered 
a discussion about cluster locking in relation to directories with large 
numbers of files, and believe it might be related.  I've got some 
directories with 5000+ files.  However, I get the stalling behavior even 
when nothing is accessing those particular directories.

I also tried some tuning some of the parameters:

gfs_tool settune /mnt/promise demote_secs 10
gfs_tool settune /mnt/promise scand_secs 2
gfs_tool settune /mnt/promise/ reclaim_limit 1000

But this doesn't appear to have done much.    Does anybody have some 
thoughts on how I might resolve this?

Paul


From prisenhoover at sampledigital.com  Tue Nov 27 23:26:01 2007
From: prisenhoover at sampledigital.com (Paul Risenhoover)
Date: Tue, 27 Nov 2007 15:26:01 -0800
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <474C8D51.4000801@sampledigital.com>
References: <474C8D51.4000801@sampledigital.com>
Message-ID: <474CA789.6030507@sampledigital.com>


I'm guessing my problem has to do with this:


Paul Risenhoover wrote:
> Hi All,
>
> I am experiencing some substantial performance problems on my RHEL 5 
> server running GFS.  The specific symptom that I'm seeing is that the 
> file system will hang for anywhere from 5 to 45 seconds on occasion.  
> When this happens it stalls all processes that are attempting to 
> access the file system (ie, "ls -l") such that even a ctrl-break can't 
> stop it.
>
> It also appears that gfs_scand is working extremely hard.  It runs at 
> 7-10% CPU almost constantly.  I did some research on this and 
> discovered a discussion about cluster locking in relation to 
> directories with large numbers of files, and believe it might be 
> related.  I've got some directories with 5000+ files.  However, I get 
> the stalling behavior even when nothing is accessing those particular 
> directories.
>
> I also tried some tuning some of the parameters:
>
> gfs_tool settune /mnt/promise demote_secs 10
> gfs_tool settune /mnt/promise scand_secs 2
> gfs_tool settune /mnt/promise/ reclaim_limit 1000
>
> But this doesn't appear to have done much.    Does anybody have some 
> thoughts on how I might resolve this?
>
> Paul
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From prisenhoover at sampledigital.com  Tue Nov 27 23:30:08 2007
From: prisenhoover at sampledigital.com (Paul Risenhoover)
Date: Tue, 27 Nov 2007 15:30:08 -0800
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <474C8D51.4000801@sampledigital.com>
References: <474C8D51.4000801@sampledigital.com>
Message-ID: <474CA880.9070404@sampledigital.com>


Sorry about this mis-send.

I'm guessing my problem has to do with this:

https://www.redhat.com/archives/linux-cluster/2007-October/msg00332.html

BTW: My file system is 13TB.

I found this article that talks about tuning the glock_purge setting:
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4

But it seems to require a special kernel module that I don't have :(.  
Anybody know where I can get it?

Paul

Paul Risenhoover wrote:
> Hi All,
>
> I am experiencing some substantial performance problems on my RHEL 5 
> server running GFS.  The specific symptom that I'm seeing is that the 
> file system will hang for anywhere from 5 to 45 seconds on occasion.  
> When this happens it stalls all processes that are attempting to 
> access the file system (ie, "ls -l") such that even a ctrl-break can't 
> stop it.
>
> It also appears that gfs_scand is working extremely hard.  It runs at 
> 7-10% CPU almost constantly.  I did some research on this and 
> discovered a discussion about cluster locking in relation to 
> directories with large numbers of files, and believe it might be 
> related.  I've got some directories with 5000+ files.  However, I get 
> the stalling behavior even when nothing is accessing those particular 
> directories.
>
> I also tried some tuning some of the parameters:
>
> gfs_tool settune /mnt/promise demote_secs 10
> gfs_tool settune /mnt/promise scand_secs 2
> gfs_tool settune /mnt/promise/ reclaim_limit 1000
>
> But this doesn't appear to have done much.    Does anybody have some 
> thoughts on how I might resolve this?
>
> Paul
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From jamesc at exa.com  Tue Nov 27 23:49:28 2007
From: jamesc at exa.com (James Chamberlain)
Date: Tue, 27 Nov 2007 18:49:28 -0500 (EST)
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <474CA880.9070404@sampledigital.com>
References: <474C8D51.4000801@sampledigital.com>
	<474CA880.9070404@sampledigital.com>
Message-ID: <Pine.LNX.4.64.0711271834560.31444@hawking.exa.com>

Hi Paul,

I'm guessing from the information you give below that you're using a 
Promise VTrak M500i with 1 TB disks?  Can you confirm this?  I had uneven 
experience with that platform, which led me to abandon it; but I did make 
one or two discoveries along the way which may be useful if they are 
applicable to your setup.  Can you share a little more about your hardware 
and setup?

Regards,

James Chamberlain

On Tue, 27 Nov 2007, Paul Risenhoover wrote:

>
> Sorry about this mis-send.
>
> I'm guessing my problem has to do with this:
>
> https://www.redhat.com/archives/linux-cluster/2007-October/msg00332.html
>
> BTW: My file system is 13TB.
>
> I found this article that talks about tuning the glock_purge setting:
> http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4
>
> But it seems to require a special kernel module that I don't have :(. 
> Anybody know where I can get it?
>
> Paul
>
> Paul Risenhoover wrote:
>>  Hi All,
>>
>>  I am experiencing some substantial performance problems on my RHEL 5
>>  server running GFS.  The specific symptom that I'm seeing is that the file
>>  system will hang for anywhere from 5 to 45 seconds on occasion.  When this
>>  happens it stalls all processes that are attempting to access the file
>>  system (ie, "ls -l") such that even a ctrl-break can't stop it.
>>
>>  It also appears that gfs_scand is working extremely hard.  It runs at
>>  7-10% CPU almost constantly.  I did some research on this and discovered a
>>  discussion about cluster locking in relation to directories with large
>>  numbers of files, and believe it might be related.  I've got some
>>  directories with 5000+ files.  However, I get the stalling behavior even
>>  when nothing is accessing those particular directories.
>>
>>  I also tried some tuning some of the parameters:
>>
>>  gfs_tool settune /mnt/promise demote_secs 10
>>  gfs_tool settune /mnt/promise scand_secs 2
>>  gfs_tool settune /mnt/promise/ reclaim_limit 1000
>>
>>  But this doesn't appear to have done much.    Does anybody have some
>>  thoughts on how I might resolve this?
>>
>>  Paul
>>
>>  --
>>  Linux-cluster mailing list
>>  Linux-cluster at redhat.com
>>  https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From prisenhoover at sampledigital.com  Tue Nov 27 23:54:19 2007
From: prisenhoover at sampledigital.com (Paul Risenhoover)
Date: Tue, 27 Nov 2007 15:54:19 -0800
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <Pine.LNX.4.64.0711271834560.31444@hawking.exa.com>
References: <474C8D51.4000801@sampledigital.com>	<474CA880.9070404@sampledigital.com>
	<Pine.LNX.4.64.0711271834560.31444@hawking.exa.com>
Message-ID: <474CAE2B.5010307@sampledigital.com>


Yes and No.

I've been running a RHEL 4.x server connected to a VTrak M500i with 
750GB disks for the last year, and it's run beautifully.  I have had no 
performance problems with a 5TB volume (the disk array wasn't fully loaded).

In an effort to increase storage, I just purchased a VTrak 610 with 1TB 
disks and prepped it exactly like the other (except with RHEL5).  The 
ultimate goal is to have two servers in an active/passive configuration 
serving SAMBA.

Would you be willing to share your discoveries?
Paul

James Chamberlain wrote:
> Hi Paul,
>
> I'm guessing from the information you give below that you're using a 
> Promise VTrak M500i with 1 TB disks?  Can you confirm this?  I had 
> uneven experience with that platform, which led me to abandon it; but 
> I did make one or two discoveries along the way which may be useful if 
> they are applicable to your setup.  Can you share a little more about 
> your hardware and setup?
>
> Regards,
>
> James Chamberlain
>
> On Tue, 27 Nov 2007, Paul Risenhoover wrote:
>
>>
>> Sorry about this mis-send.
>>
>> I'm guessing my problem has to do with this:
>>
>> https://www.redhat.com/archives/linux-cluster/2007-October/msg00332.html
>>
>> BTW: My file system is 13TB.
>>
>> I found this article that talks about tuning the glock_purge setting:
>> http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4
>>
>> But it seems to require a special kernel module that I don't have :(. 
>> Anybody know where I can get it?
>>
>> Paul
>>
>> Paul Risenhoover wrote:
>>>  Hi All,
>>>
>>>  I am experiencing some substantial performance problems on my RHEL 5
>>>  server running GFS.  The specific symptom that I'm seeing is that 
>>> the file
>>>  system will hang for anywhere from 5 to 45 seconds on occasion.  
>>> When this
>>>  happens it stalls all processes that are attempting to access the file
>>>  system (ie, "ls -l") such that even a ctrl-break can't stop it.
>>>
>>>  It also appears that gfs_scand is working extremely hard.  It runs at
>>>  7-10% CPU almost constantly.  I did some research on this and 
>>> discovered a
>>>  discussion about cluster locking in relation to directories with large
>>>  numbers of files, and believe it might be related.  I've got some
>>>  directories with 5000+ files.  However, I get the stalling behavior 
>>> even
>>>  when nothing is accessing those particular directories.
>>>
>>>  I also tried some tuning some of the parameters:
>>>
>>>  gfs_tool settune /mnt/promise demote_secs 10
>>>  gfs_tool settune /mnt/promise scand_secs 2
>>>  gfs_tool settune /mnt/promise/ reclaim_limit 1000
>>>
>>>  But this doesn't appear to have done much.    Does anybody have some
>>>  thoughts on how I might resolve this?
>>>
>>>  Paul
>>>
>>>  --


From noman at concertedsystems.com  Wed Nov 28 00:03:04 2007
From: noman at concertedsystems.com (Noman Syed)
Date: Tue, 27 Nov 2007 18:03:04 -0600
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <474CAE2B.5010307@sampledigital.com>
References: <474C8D51.4000801@sampledigital.com>	<474CA880.9070404@sampledigital.com>
	<Pine.LNX.4.64.0711271834560.31444@hawking.exa.com>
	<474CAE2B.5010307@sampledigital.com>
Message-ID: <CC3E49C0-104D-4A7C-9C72-B6EB2DE7A87A@concertedsystems.com>

Hello all,

I have a two-node Centos 4 platform GFS cluster  platform.  However,  
periodically one of the node gets fenced off  (shutdown).  I need help  
figuring out what is going on under the hood. Any ideas?

Any help will be greatly appreciated

Thanks,

On Nov 27, 2007, at 5:54 PM, Paul Risenhoover wrote:

>
> Yes and No.
>
> I've been running a RHEL 4.x server connected to a VTrak M500i with  
> 750GB disks for the last year, and it's run beautifully.  I have had  
> no performance problems with a 5TB volume (the disk array wasn't  
> fully loaded).
>
> In an effort to increase storage, I just purchased a VTrak 610 with  
> 1TB disks and prepped it exactly like the other (except with  
> RHEL5).  The ultimate goal is to have two servers in an active/ 
> passive configuration serving SAMBA.
>
> Would you be willing to share your discoveries?
> Paul
>
> James Chamberlain wrote:
>> Hi Paul,
>>
>> I'm guessing from the information you give below that you're using  
>> a Promise VTrak M500i with 1 TB disks?  Can you confirm this?  I  
>> had uneven experience with that platform, which led me to abandon  
>> it; but I did make one or two discoveries along the way which may  
>> be useful if they are applicable to your setup.  Can you share a  
>> little more about your hardware and setup?
>>
>> Regards,
>>
>> James Chamberlain
>>
>> On Tue, 27 Nov 2007, Paul Risenhoover wrote:
>>
>>>
>>> Sorry about this mis-send.
>>>
>>> I'm guessing my problem has to do with this:
>>>
>>> https://www.redhat.com/archives/linux-cluster/2007-October/msg00332.html
>>>
>>> BTW: My file system is 13TB.
>>>
>>> I found this article that talks about tuning the glock_purge  
>>> setting:
>>> http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4
>>>
>>> But it seems to require a special kernel module that I don't have : 
>>> (. Anybody know where I can get it?
>>>
>>> Paul
>>>
>>> Paul Risenhoover wrote:
>>>> Hi All,
>>>>
>>>> I am experiencing some substantial performance problems on my  
>>>> RHEL 5
>>>> server running GFS.  The specific symptom that I'm seeing is that  
>>>> the file
>>>> system will hang for anywhere from 5 to 45 seconds on occasion.   
>>>> When this
>>>> happens it stalls all processes that are attempting to access the  
>>>> file
>>>> system (ie, "ls -l") such that even a ctrl-break can't stop it.
>>>>
>>>> It also appears that gfs_scand is working extremely hard.  It  
>>>> runs at
>>>> 7-10% CPU almost constantly.  I did some research on this and  
>>>> discovered a
>>>> discussion about cluster locking in relation to directories with  
>>>> large
>>>> numbers of files, and believe it might be related.  I've got some
>>>> directories with 5000+ files.  However, I get the stalling  
>>>> behavior even
>>>> when nothing is accessing those particular directories.
>>>>
>>>> I also tried some tuning some of the parameters:
>>>>
>>>> gfs_tool settune /mnt/promise demote_secs 10
>>>> gfs_tool settune /mnt/promise scand_secs 2
>>>> gfs_tool settune /mnt/promise/ reclaim_limit 1000
>>>>
>>>> But this doesn't appear to have done much.    Does anybody have  
>>>> some
>>>> thoughts on how I might resolve this?
>>>>
>>>> Paul
>>>>
>>>> --
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

---------------------------------------------
Noman Syed
Infrastructure Specialist
noman at concertedsystems.com
Fax:  847.548.4422
Phone: 847.548.4411 ex. 3

*Concerted Systems LLC*
61 Seymour Avenue
Grayslake, IL 60030
-----------------------------------------------


From jamesc at exa.com  Wed Nov 28 00:40:42 2007
From: jamesc at exa.com (James Chamberlain)
Date: Tue, 27 Nov 2007 19:40:42 -0500 (EST)
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <474CAE2B.5010307@sampledigital.com>
References: <474C8D51.4000801@sampledigital.com>
	<474CA880.9070404@sampledigital.com>
	<Pine.LNX.4.64.0711271834560.31444@hawking.exa.com>
	<474CAE2B.5010307@sampledigital.com>
Message-ID: <Pine.LNX.4.64.0711271901240.31444@hawking.exa.com>

Hi Paul,

In my experience with the VTrak M500i, it didn't seem like it could handle 
active multipathing.  When I tried to use both interfaces simultaneously 
rather than fail over between them, my throughput to the disks dropped to 
less than 1 MB/s.  It looks like they've made some improvements in the 
VTrak M610i, such as link aggregation, so this may not be applicable for 
your newer hardware.

You might want to check the management interface of your VTrak and see if 
anything interesting is going on when everything freezes up on you.  You 
might also want to check dmesg and your syslogs to see if you see anything 
about the iSCSI session being lost and reestablished.

Regards,

James Chamberlain

On Tue, 27 Nov 2007, Paul Risenhoover wrote:

>
> Yes and No.
>
> I've been running a RHEL 4.x server connected to a VTrak M500i with 750GB 
> disks for the last year, and it's run beautifully.  I have had no performance 
> problems with a 5TB volume (the disk array wasn't fully loaded).
>
> In an effort to increase storage, I just purchased a VTrak 610 with 1TB disks 
> and prepped it exactly like the other (except with RHEL5).  The ultimate goal 
> is to have two servers in an active/passive configuration serving SAMBA.
>
> Would you be willing to share your discoveries?
> Paul
>
> James Chamberlain wrote:
>>  Hi Paul,
>>
>>  I'm guessing from the information you give below that you're using a
>>  Promise VTrak M500i with 1 TB disks?  Can you confirm this?  I had uneven
>>  experience with that platform, which led me to abandon it; but I did make
>>  one or two discoveries along the way which may be useful if they are
>>  applicable to your setup.  Can you share a little more about your hardware
>>  and setup?
>>
>>  Regards,
>>
>>  James Chamberlain
>>
>>  On Tue, 27 Nov 2007, Paul Risenhoover wrote:
>> 
>> > 
>> >  Sorry about this mis-send.
>> > 
>> >  I'm guessing my problem has to do with this:
>> > 
>> >  https://www.redhat.com/archives/linux-cluster/2007-October/msg00332.html
>> > 
>> >  BTW: My file system is 13TB.
>> > 
>> >  I found this article that talks about tuning the glock_purge setting:
>> >  http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4
>> > 
>> >  But it seems to require a special kernel module that I don't have :(. 
>> >  Anybody know where I can get it?
>> > 
>> >  Paul
>> > 
>> >  Paul Risenhoover wrote:
>> > >   Hi All,
>> > > 
>> > >   I am experiencing some substantial performance problems on my RHEL 5
>> > >   server running GFS.  The specific symptom that I'm seeing is that 
>> > >  the file
>> > >  system will hang for anywhere from 5 to 45 seconds on occasion.  When 
>> > >  this
>> > >   happens it stalls all processes that are attempting to access the 
>> > >   file
>> > >   system (ie, "ls -l") such that even a ctrl-break can't stop it.
>> > > 
>> > >   It also appears that gfs_scand is working extremely hard.  It runs at
>> > >   7-10% CPU almost constantly.  I did some research on this and 
>> > >  discovered a
>> > >   discussion about cluster locking in relation to directories with 
>> > >   large
>> > >   numbers of files, and believe it might be related.  I've got some
>> > >   directories with 5000+ files.  However, I get the stalling behavior 
>> > >  even
>> > >   when nothing is accessing those particular directories.
>> > > 
>> > >   I also tried some tuning some of the parameters:
>> > > 
>> > >   gfs_tool settune /mnt/promise demote_secs 10
>> > >   gfs_tool settune /mnt/promise scand_secs 2
>> > >   gfs_tool settune /mnt/promise/ reclaim_limit 1000
>> > > 
>> > >   But this doesn't appear to have done much.    Does anybody have some
>> > >   thoughts on how I might resolve this?
>> > > 
>> > >   Paul
>> > > 
>> > >   --
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From jialisong at datuu.com  Wed Nov 28 01:33:14 2007
From: jialisong at datuu.com (jialisong at datuu.com)
Date: Wed, 28 Nov 2007 09:33:14 +0800 (CST)
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 43, Issue 37
References: <20071127170014.02FB77340A@hormel.redhat.com>
Message-ID: <001601c8315e$88dd7200$0300a8c0@E0D45CE948E145D>

?????GFS6.1???????????????fence???????????
----- Original Message ----- 
From: <linux-cluster-request at redhat.com>
To: <linux-cluster at redhat.com>
Sent: Wednesday, November 28, 2007 1:01 AM
Subject: Linux-cluster Digest, Vol 43, Issue 37


> Send Linux-cluster mailing list submissions to
> linux-cluster at redhat.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
> linux-cluster-request at redhat.com
> 
> You can reach the person managing the list at
> linux-cluster-owner at redhat.com
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Tests to demonstrate Red Hat Cluster Behaviour (Lon Hohberger)
>   2. Re: Problems to start ony one cluster service (carlopmart)
>   3. Re: Re: CS4 : problem with multiple IP addresses (Alain Moulle)
>   4. Re: Re: Re: CS4 : problem with multiple IP addresses
>      (Patrick Caulfield)
>   5. Any thoughts on losing mount? (isplist at logicore.net)
>   6. Re: Service Recovery Failure (Scott Becker)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 27 Nov 2007 10:36:15 -0500
> From: Lon Hohberger <lhh at redhat.com>
> Subject: Re: [Linux-cluster] Tests to demonstrate Red Hat Cluster
> Behaviour
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID:
> <1196177775.12646.48.camel at ayanami.boston.devel.redhat.com>
> Content-Type: text/plain
> 
> On Mon, 2007-11-26 at 20:14 -0500, Eric Kerin wrote:
>> Scott,
>> 
>> Not sure if it works with GFS (I would assume so, but I don't have it
>> installed to test)  But normally you would run the following to remount
>> an already mounted filesystem in read only mode:
>> mount -o remount,ro <mountpoint>
>> 
>> And conversely to remount read-write:
>> mount -o remount,rw <mountpoint>
> 
> Should be the same w/ GFS.
> 
> -- Lon
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 27 Nov 2007 16:43:36 +0100
> From: carlopmart <carlopmart at gmail.com>
> Subject: Re: [Linux-cluster] Problems to start ony one cluster service
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID: <474C3B28.6090505 at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Lon Hohberger wrote:
>> On Tue, 2007-11-27 at 11:26 +0100, carlopmart wrote:
>>> Hi all
>>>
>>>   I have a very strange problem. I have configured three nodes under RHCS on 
>>> rhel5.1 servers. All works ok, except for one service that never starts when 
>>> rgmanager start-up. My cluster conf is:
>>>
>>> <?xml version="1.0"?>
>>> <cluster alias="RhelXenCluster" config_version="17" name="RhelXenCluster">
>>>          <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>          <clusternodes>
>>>                  <clusternode name="rhelclu01.hpulabs.org" nodeid="1" votes="1">
>>>                          <fence>
>>>                                  <method name="1">
>>>                                          <device name="gnbd-fence" 
>>> nodename="rhelclu01.hpulabs.org"/>
>>>                                  </method>
>>>                          </fence>
>>>                          <multicast addr="239.192.75.55" interface="eth0"/>
>>>                  </clusternode>
>>>                  <clusternode name="rhelclu02.hpulabs.org" nodeid="2" votes="1">
>>>                          <fence>
>>>                                  <method name="1">
>>>                                          <device name="gnbd-fence" 
>>> nodename="rhelclu02.hpulabs.org"/>
>>>                                  </method>
>>>                          </fence>
>>>                          <multicast addr="239.192.75.55" interface="eth0"/>
>>>                  </clusternode>
>>>                  <clusternode name="rhelclu03.hpulabs.org" nodeid="3" votes="1">
>>>                          <fence>
>>>                                  <method name="1">
>>>                                          <device name="gnbd-fence" 
>>> nodename="rhelclu03.hpulabs.org"/>
>>>                                  </method>
>>>                          </fence>
>>>                          <multicast addr="239.192.75.55" interface="xenbr0"/>
>>>                  </clusternode>
>>>          </clusternodes>
>>>          <cman expected_votes="1" two_node="0">
>>>                  <multicast addr="239.192.75.55"/>
>>>          </cman>
>>>          <fencedevices>
>>>                  <fencedevice agent="fence_gnbd" name="gnbd-fence" 
>>> servers="rhelclu03.hpulabs.org"/>
>>>          </fencedevices>
>>>          <rm log_facility="local4" log_level="7">
>>>                  <failoverdomains>
>>>                          <failoverdomain name="PriCluster" ordered="1" 
>>> restricted="1">
>>>                                  <failoverdomainnode 
>>> name="rhelclu01.hpulabs.org" priority="1"/>
>>>                                  <failoverdomainnode 
>>> name="rhelclu02.hpulabs.org" priority="2"/>
>>>                          </failoverdomain>
>>>                          <failoverdomain name="SecCluster" ordered="1" 
>>> restricted="1">
>>>                                  <failoverdomainnode 
>>> name="rhelclu02.hpulabs.org" priority="1"/>
>>>                                  <failoverdomainnode 
>>> name="rhelclu01.hpulabs.org" priority="2"/>
>>>                          </failoverdomain>
>>>                  </failoverdomains>
>>>                  <resources>
>>> <ip address="172.25.50.10" monitor_link="1"/>
>>>                          <ip address="172.25.50.11" monitor_link="1"/>
>>>                          <ip address="172.25.50.12" monitor_link="1"/>
>>>                          <ip address="172.25.50.13" monitor_link="1"/>
>>>                          <ip address="172.25.50.14" monitor_link="1"/>
>>>                          <ip address="172.25.50.15" monitor_link="1"/>
>>>                          <ip address="172.25.50.16" monitor_link="1"/>
>>>                          <ip address="172.25.50.17" monitor_link="1"/>
>>>                          <ip address="172.25.50.18" monitor_link="1"/>
>>>                          <ip address="172.25.50.19" monitor_link="1"/>
>>>                          <ip address="172.25.50.20" monitor_link="1"/>
>>>                  </resources>
>>>                  <service autostart="1" domain="PriCluster" name="dns-svc" 
>>> recovery="relocate">
>>>                          <ip ref="172.25.50.10">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/named" name="named"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="SecCluster" name="mail-svc" 
>>> recovery="relocate">
>>>                          <ip ref="172.25.50.11">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="SecCluster" name="rsync-svc" 
>>> recovery="relocate">
>>>                          <ip ref="172.25.50.13">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="PriCluster" name="wwwsoft-svc" 
>>> recovery="relocate">
>>>                          <ip ref="172.25.50.14">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="SecCluster" name="proxy-svc" 
>>> recovery="relocate">
>>>                          <ip ref="172.25.50.15">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
>>>                          </ip>
>>>                  </service>
>>>          </rm>
>>> </cluster>
>>>
>>>   The service that returns me errors and never starts when rgmanager start-up is 
>>> postfix-cluster. On maillog file I find this error:
>> 
>> 
>>>   Nov 26 11:27:31 rhelclu01 postfix[27959]: fatal: parameter inet_interfaces: no 
>>> local interface found for 172.25.50.11
>>> Nov 26 11:27:43 rhelclu01 postfix[28313]: fatal: 
>>> /data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
>> 
>>>   but thath's not true. If I start this service manually all works ok. Postfix 
>>> configuration it is ok, What can be the problem??? I don't know why rgmanager 
>>> dosen't config 172.25.50.11 address before execute postfix-cluster service ....
>> 
> 
> Hi Lon,
> 
> 
>> When you start it manually -- how?
>> * add IP manually / running the script?
> Yes, and it works.
> 
>> * rg_test?
> 
> Works.
> 
> 
>> * clusvcadm -e?
> 
> Sometimes works, sometimes not. I need to disable service first, and sometimes 
> when I try to re-enable works and other not.
>> 
>> -- Lon
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
> 
> 
> -- 
> CL Martinez
> carlopmart {at} gmail {d0t} com
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 27 Nov 2007 17:26:01 +0100
> From: Alain Moulle <Alain.Moulle at bull.net>
> Subject: [Linux-cluster] Re: Re: CS4 : problem with multiple IP
> addresses
> To: linux-cluster at redhat.com
> Message-ID: <474C4519.8060205 at bull.net>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi Patrick
> 
> you mean like this in cluster.conf :
> <clusternodes>
>        <clusternode name="192.168.1.2" votes="1">
>             <fence>
>                 <method name="1">
>                   <device name="NODE_NAMEfence" option="reboot"/>
>                   </method>
>             </fence>
>        </clusternode>
> ..
> 
> ???
> 
> and if so, we should use "cman_tool join -d -n 192.168.1.2" instead
> of "service cman start"
> 
> Is this right ?
> 
> Thanks
> Regards
> Alain
> 
>> Setting the IP address in cluster.conf and starting the cluster like
>> this works:
>> cman_tool join -d -n 192.168.1.2
>> What you have sounds like a bug, can you give us some more information
>> please ? cluster.conf files, errors from 'cman_tool join -d' and output
>> from dnslookup/host ?
>> Thanks
>> Patrick
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 27 Nov 2007 16:34:09 +0000
> From: Patrick Caulfield <pcaulfie at redhat.com>
> Subject: Re: [Linux-cluster] Re: Re: CS4 : problem with multiple IP
> addresses
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID: <474C4701.30602 at redhat.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Alain Moulle wrote:
>> Hi Patrick
>> 
>> you mean like this in cluster.conf :
>>  <clusternodes>
>>         <clusternode name="192.168.1.2" votes="1">
>>              <fence>
>>                  <method name="1">
>>                    <device name="NODE_NAMEfence" option="reboot"/>
>>                    </method>
>>              </fence>
>>         </clusternode>
>> ...
>> 
>> ???
>> 
>> and if so, we should use "cman_tool join -d -n 192.168.1.2" instead
>> of "service cman start"
>> 
>> Is this right ?
> 
> Well, it's nasty but it worked for me. I'm happy to actually fix the bug
> if I can reproduce it.
> 
> Patrick
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Tue, 27 Nov 2007 10:34:18 -0600
> From: "isplist at logicore.net" <isplist at logicore.net>
> Subject: [Linux-cluster] Any thoughts on losing mount?
> To: linux-cluster <linux-cluster at redhat.com>
> Message-ID: <20071127103418.715496 at leena>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> I'm pulling my hair out here :).
> One node in my cluster has decided that it doesn't want to mount a storage 
> partition which other nodes are not having a problem with. The console 
> messages say that there is an inconsistency in the filesystem yet none of the 
> other nodes are complaining. 
> 
> I cannot figure this one out so am hoping someone on the list can give me some 
> leads on what else to look for as I do not want to cause any new problems.
> 
> Mike
> 
> 
> Nov 27 10:29:26 compdev kernel: GFS: Trying to join cluster "lock_dlm", 
> "vgcomp:web"
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster. Now 
> mounting FS...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to 
> acquire journal lock...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at 
> journal...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log 
> elements...
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked 
> inodes
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes 
> for 0 IDs
> Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Done
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem 
> consistency error
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   function = 
> gfs_setbit
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   file = 
> /home/xos/gen/updates-2007-11/xlrpm29472/rpm/BUILD/gfs-kernel-2.6.9-72/up/src/
> gfs/bits.c, line = 71
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   time = 1196180975
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: about to withdraw from 
> the cluster
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: waiting for 
> outstanding I/O
> Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: telling LM to withdraw
> Nov 27 10:29:37 compdev kernel: lock_dlm: withdraw abandoned memory
> Nov 27 10:29:37 compdev kernel: GFS: fsid=vgcomp:web.3: withdrawn
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Tue, 27 Nov 2007 08:52:34 -0800
> From: Scott Becker <scottb at bxwa.com>
> Subject: Re: [Linux-cluster] Service Recovery Failure
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID: <474C4B52.6080400 at bxwa.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> 
> 
> Lon Hohberger wrote:
>> On Mon, 2007-11-26 at 14:36 -0800, Scott Becker wrote:
>>
>>   
>>> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
>>> openais[9498]: [CLM  ] New Configuration:
>>> kernel: dlm: closing connection to node 3
>>> fenced[9568]: 205.234.65.133 not a cluster member after 0 sec 
>>> post_fail_delay
>>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
>>> openais[9498]: [CLM  ] Members Left:
>>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.133)
>>> openais[9498]: [CLM  ] Members Joined:
>>> openais[9498]: [CLM  ] CLM CONFIGURATION CHANGE
>>> openais[9498]: [CLM  ] New Configuration:
>>> openais[9498]: [CLM  ]     r(0) ip(205.234.65.132)
>>> openais[9498]: [CLM  ] Members Left:
>>> openais[9498]: [CLM  ] Members Joined:
>>> openais[9498]: [SYNC ] This node is within the primary component and 
>>> will provide service.
>>> openais[9498]: [TOTEM] entering OPERATIONAL state.
>>> openais[9498]: [CLM  ] got nodejoin message 205.234.65.132
>>> openais[9498]: [CPG  ] got joinlist message from node 2
>>>     
>>
>> Did it even try to run the fence_apc agent?  It should have done
>> *something* - it didn't even look like it tried to fence.
>>
>> -- Lon
>>
>>   
> No sign of an attempt. How do I turn up the verbosity of fenced? I'll 
> repeat the test. The only mention I can find is -D but I don't know how 
> I can use that. I'll browse the source and see if I can learn anything. 
> I'm using 2.0.73.
> 
>    thanks
>    scottb
> 
> 
> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://www.redhat.com/archives/linux-cluster/attachments/20071127/09991466/attachment.html
> 
> ------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> End of Linux-cluster Digest, Vol 43, Issue 37
> *********************************************
>


From prisenhoover at sampledigital.com  Wed Nov 28 02:44:33 2007
From: prisenhoover at sampledigital.com (Paul Risenhoover)
Date: Tue, 27 Nov 2007 18:44:33 -0800
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <Pine.LNX.4.64.0711271901240.31444@hawking.exa.com>
References: <474C8D51.4000801@sampledigital.com>	<474CA880.9070404@sampledigital.com>	<Pine.LNX.4.64.0711271834560.31444@hawking.exa.com>	<474CAE2B.5010307@sampledigital.com>
	<Pine.LNX.4.64.0711271901240.31444@hawking.exa.com>
Message-ID: <474CD611.1070604@sampledigital.com>


Hi James,

Like I said in my last email, my M500i has been swell so far, but I'm 
only using one interface.  In regards to your problems though, did you 
ever call Promise to get help?  I haven't had a big need to call them in 
the past, but when I have, they've been extremely helpful.

My thinking is that this is related to the glock problem that I 
mentioned in my previous email.   The reason I think this is because 
there are two large filesystems on two separate NAS's that I'm trying to 
combine into one.  One NAS has a directory structure with a max of about 
1000 files per directory.  When I transferred those files via rsync I 
got great performance with no stalling.

The other NAS has directories with 5000+ files, and it was only after 
those directories started showing up that the system began stalling.  
The problem is, now that those directories are here, the whole 
filesystem stalls, whether I'm accessing those directories or not.

The thing that's confusing to me is that, according to the writeup I 
read 
(http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4) 
the problem is with glocks.  From what I understand, a certain number of 
glocks are allocated and then the system needs to "garbage collect," 
which causes the pauses.  This indicates to me that there is a glock 
table that must be cleaned up on a regular basis.  Apparently I'm 
running out of lock space and that's why my system is pausing. 

My question is, how big is this table?  Can it be increased?  I've got a 
machine with 4G of ram and 2.5G are free.  I would think that if I could 
allocate another gig for the glock table and simply let gfs_scand run 
constantly at 10% of CPU, I might be able to prevent the stalls.

(I'm not asking you, James, for the answer, but I'm wondering if anybody 
else can chime in.)

Paul


James Chamberlain wrote:
> Hi Paul,
>
> In my experience with the VTrak M500i, it didn't seem like it could 
> handle active multipathing.  When I tried to use both interfaces 
> simultaneously rather than fail over between them, my throughput to 
> the disks dropped to less than 1 MB/s.  It looks like they've made 
> some improvements in the VTrak M610i, such as link aggregation, so 
> this may not be applicable for your newer hardware.
>
> You might want to check the management interface of your VTrak and see 
> if anything interesting is going on when everything freezes up on 
> you.  You might also want to check dmesg and your syslogs to see if 
> you see anything about the iSCSI session being lost and reestablished.
>
> Regards,
>
> James Chamberlain


From wcheng at redhat.com  Wed Nov 28 03:21:29 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 27 Nov 2007 22:21:29 -0500
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <474CA880.9070404@sampledigital.com>
References: <474C8D51.4000801@sampledigital.com>
	<474CA880.9070404@sampledigital.com>
Message-ID: <474CDEB9.8080405@redhat.com>

Paul Risenhoover wrote:

>
> Sorry about this mis-send.
>
> I'm guessing my problem has to do with this:
>
> https://www.redhat.com/archives/linux-cluster/2007-October/msg00332.html
>
> BTW: My file system is 13TB.
>
> I found this article that talks about tuning the glock_purge setting:
> http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4
>
> But it seems to require a special kernel module that I don't have :(.  
> Anybody know where I can get it?
>
>

The patch should be part of RHEL 4.6 (or RHEL 5.1) - both will be 
released soon.

-- Wendy

>> Hi All,
>>
>> I am experiencing some substantial performance problems on my RHEL 5 
>> server running GFS.  The specific symptom that I'm seeing is that the 
>> file system will hang for anywhere from 5 to 45 seconds on occasion.  
>> When this happens it stalls all processes that are attempting to 
>> access the file system (ie, "ls -l") such that even a ctrl-break 
>> can't stop it.
>>
>> It also appears that gfs_scand is working extremely hard.  It runs at 
>> 7-10% CPU almost constantly.  I did some research on this and 
>> discovered a discussion about cluster locking in relation to 
>> directories with large numbers of files, and believe it might be 
>> related.  I've got some directories with 5000+ files.  However, I get 
>> the stalling behavior even when nothing is accessing those particular 
>> directories.
>>
>> I also tried some tuning some of the parameters:
>>
>> gfs_tool settune /mnt/promise demote_secs 10
>> gfs_tool settune /mnt/promise scand_secs 2
>> gfs_tool settune /mnt/promise/ reclaim_limit 1000
>>
>> But this doesn't appear to have done much.    Does anybody have some 
>> thoughts on how I might resolve this?
>>
>> Paul
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Wed Nov 28 03:18:21 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 27 Nov 2007 21:18:21 -0600
Subject: [Linux-cluster] GFS or Web Server Performance issues?
Message-ID: <20071127211821.742581@leena>

Here is ab sending a test to an LVS server in front of a 3 node web server. 
The average loads on each server was around 8.00 to 10.00. These aren't very 
good numbers and I'm wondering where to start looking.

These are pretty much default apache RPM versions out of the box.

# ab -kc 50 -t 30 http://192.168.1.150/
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.150 (be patient)
Finished 130 requests


Server Software:        Apache
Server Hostname:        192.168.1.150
Server Port:            80

Document Path:          /
Document Length:        8997 bytes

Concurrency Level:      50
Time taken for tests:   30.234185 seconds
Complete requests:      130
Failed requests:        0
Write errors:           0
Keep-Alive requests:    0
Total transferred:      1371090 bytes
HTML transferred:       1299586 bytes
Requests per second:    4.30 [#/sec] (mean)
Time per request:       11628.532 [ms] (mean)
Time per request:       232.571 [ms] (mean, across all concurrent requests)
Transfer rate:          44.25 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   1.4      2       4
Processing:   409 6228 4357.5   5227   16317
Waiting:      369 6146 4370.8   5199   16261
Total:        409 6230 4358.1   5230   16321

Percentage of the requests served within a certain time (ms)
  50%   5230
  66%   8331
  75%   9310
  80%  10386
  90%  12820
  95%  14502
  98%  15399
  99%  15746
 100%  16321 (longest request)


From prisenhoover at sampledigital.com  Wed Nov 28 03:33:01 2007
From: prisenhoover at sampledigital.com (Paul Risenhoover)
Date: Tue, 27 Nov 2007 19:33:01 -0800
Subject: [Linux-cluster] GFS Performance Problems (RHEL5)
In-Reply-To: <474CDEB9.8080405@redhat.com>
References: <474C8D51.4000801@sampledigital.com>	<474CA880.9070404@sampledigital.com>
	<474CDEB9.8080405@redhat.com>
Message-ID: <474CE16D.9070703@sampledigital.com>

Hi Wendy,

Thanks for responding.  Is there any way I can get this patch sooner 
than "soon?"  I'm not trying to be cheeky, but this file system is in 
production, and the performance issues are too substantial for me to 
continue down the the gfs path without some insurance that this fix will 
resolve the issue. 

If you can get me the patch tomorrow I can have it installed tomorrow 
night, and if it works, great.  If it doesn't, I'll happily provide you 
with any debugging data you want.

All the best,
Paul

Wendy Cheng wrote:
>
> The patch should be part of RHEL 4.6 (or RHEL 5.1) - both will be 
> released soon.
>
> -- Wendy


From wcheng at redhat.com  Wed Nov 28 03:54:07 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 27 Nov 2007 22:54:07 -0500
Subject: [Linux-cluster] GFS or Web Server Performance issues?
In-Reply-To: <20071127211821.742581@leena>
References: <20071127211821.742581@leena>
Message-ID: <474CE65F.4010607@redhat.com>

isplist at logicore.net wrote:

>Here is ab sending a test to an LVS server in front of a 3 node web server. 
>The average loads on each server was around 8.00 to 10.00. These aren't very 
>good numbers and I'm wondering where to start looking.
>  
>

Using a load balancer in front of GFS nodes is tricky. Make sure to set 
your scheduling rule (or whatever it is called in LVS) in such a way 
that it would not generate un-necessary lock traffic. For example, you 
don't want the same write lock to get rotated between three nodes. Be 
aware that moving a write lock between nodes requires many steps that 
include a disk flush.

-- Wendy

>These are pretty much default apache RPM versions out of the box.
>
># ab -kc 50 -t 30 http://192.168.1.150/
>This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
>Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
>Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/
>
>Benchmarking 192.168.1.150 (be patient)
>Finished 130 requests
>
>
>Server Software:        Apache
>Server Hostname:        192.168.1.150
>Server Port:            80
>
>Document Path:          /
>Document Length:        8997 bytes
>
>Concurrency Level:      50
>Time taken for tests:   30.234185 seconds
>Complete requests:      130
>Failed requests:        0
>Write errors:           0
>Keep-Alive requests:    0
>Total transferred:      1371090 bytes
>HTML transferred:       1299586 bytes
>Requests per second:    4.30 [#/sec] (mean)
>Time per request:       11628.532 [ms] (mean)
>Time per request:       232.571 [ms] (mean, across all concurrent requests)
>Transfer rate:          44.25 [Kbytes/sec] received
>
>Connection Times (ms)
>              min  mean[+/-sd] median   max
>Connect:        0    1   1.4      2       4
>Processing:   409 6228 4357.5   5227   16317
>Waiting:      369 6146 4370.8   5199   16261
>Total:        409 6230 4358.1   5230   16321
>
>Percentage of the requests served within a certain time (ms)
>  50%   5230
>  66%   8331
>  75%   9310
>  80%  10386
>  90%  12820
>  95%  14502
>  98%  15399
>  99%  15746
> 100%  16321 (longest request)
>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>


From isplist at logicore.net  Wed Nov 28 04:32:58 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 27 Nov 2007 22:32:58 -0600
Subject: [Linux-cluster] GFS or Web Server Performance issues?
In-Reply-To: <474CE65F.4010607@redhat.com>
Message-ID: <20071127223258.585788@leena>

> Using a load balancer in front of GFS nodes is tricky. Make sure to set
> your scheduling rule (or whatever it is called in LVS) in such a way

Do you mean such as a session, so that the user is not moved between servers? 
That's something I did take into account when building the LVS LB's. I also 
have tested with the LVS and direct to the web nodes. The performance is just 
not what I expected but mind you, I've not done any fine tuning yet. 

Are there a series of steps I should be taking? I see a lot on the net but 
it's not too clear which things I really should try doing first.

Mike


> that it would not generate un-necessary lock traffic. For example, you
> don't want the same write lock to get rotated between three nodes. Be
> aware that moving a write lock between nodes requires many steps that
> include a disk flush.
> 
> -- Wendy
> 
>> These are pretty much default apache RPM versions out of the box.
>> 
>> # ab -kc 50 -t 30 http://192.168.1.150/
>> This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
>> Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd,
>> http://www.zeustech.net/
>> Copyright (c) 1998-2002 The Apache Software Foundation,
>> http://www.apache.org/
>> 
>> Benchmarking 192.168.1.150 (be patient)
>> Finished 130 requests
>> 
>> 
>> Server Software:        Apache
>> Server Hostname:        192.168.1.150
>> Server Port:            80
>> 
>> Document Path:          /
>> Document Length:        8997 bytes
>> 
>> Concurrency Level:      50
>> Time taken for tests:   30.234185 seconds
>> Complete requests:      130
>> Failed requests:        0
>> Write errors:           0
>> Keep-Alive requests:    0
>> Total transferred:      1371090 bytes
>> HTML transferred:       1299586 bytes
>> Requests per second:    4.30 [#/sec] (mean)
>> Time per request:       11628.532 [ms] (mean)
>> Time per request:       232.571 [ms] (mean, across all concurrent
>> requests)
>> Transfer rate:          44.25 [Kbytes/sec] received
>> 
>> Connection Times (ms)
>> min  mean[+/-sd] median   max
>> Connect:        0    1   1.4      2       4
>> Processing:   409 6228 4357.5   5227   16317
>> Waiting:      369 6146 4370.8   5199   16261
>> Total:        409 6230 4358.1   5230   16321
>> 
>> Percentage of the requests served within a certain time (ms)
>> 50%   5230
>> 66%   8331
>> 75%   9310
>> 80%  10386
>> 90%  12820
>> 95%  14502
>> 98%  15399
>> 99%  15746
>> 100%  16321 (longest request)
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From Alain.Moulle at bull.net  Wed Nov 28 07:57:44 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 28 Nov 2007 08:57:44 +0100
Subject: [Linux-cluster] Re: Re: Re: CS4 : problem with multiple IP
	addresses (cont.)
Message-ID: <474D1F78.7090300@bull.net>

Hi
I don't really know if it is a bug, but the problem occurs when
you a DNS is used, and when a node must be reachable from two
different networks but with the same name on both networks; in this
case, the gethostbyname() returns (from DNS) more than one adress for
a Node name. So it seems that this leads to set the Node Name in cluster.conf
as an IP addr. OK ?
So, the bug is perhaps more to have to launch cman service with a specific
syntax and no more by "service cman start" ... the cman start should manage
by itself the two modes for node name in cluster.conf : name or IP addr.
Do you agree ?
Thanks
Regards
Alain

>>Well, it's nasty but it worked for me. I'm happy to actually fix the bug
>>if I can reproduce it.
>>Patrick


From d.skorupa at wasko.pl  Wed Nov 28 08:34:51 2007
From: d.skorupa at wasko.pl (Darek Skorupa)
Date: Wed, 28 Nov 2007 09:34:51 +0100
Subject: [Linux-cluster] cluster communication packets - changing interface
In-Reply-To: <968640.88772.qm@web50609.mail.re2.yahoo.com>
References: <968640.88772.qm@web50609.mail.re2.yahoo.com>
Message-ID: <474D282B.90903@wasko.pl>


>> I have tried to change interface for heartbeat
>> packets. In FAQ
>> (http://sourceware.org/cluster/faq.html) I've found
>> information about
>> private natwork (-p) which should be added to
>> nodename in /etc/hosts.
>> I've tried that but it doesn't work for me.
>>
>> ver. 1
>>     /etc/hosts
>>         192.168.10.10   l1-p.local     l1-p
>>         192.168.10.11   l2-p.local     l2-p
>>     
>                           ^^^^^^^^^^ 
> you need to match those names with the node names
> defined in your cluster.conf:
>
>   
>>     <clusternodes>
>>         <clusternode name="l2.local" nodeid="1"
>>     
>                              ^^^^^^^^
>   
Ok, thx error doesn't show after I changed my mistake,
but cluster heartbeat transmission is still with 200.200.200.x source so 
it goes by eth0 interface
I need to change it to eth1 with addresses 192.168.10.x. Is it possible 
that nodename-p doesn't work ??

daro


From carlopmart at gmail.com  Wed Nov 28 08:42:58 2007
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 28 Nov 2007 09:42:58 +0100
Subject: [Linux-cluster] Problems to start ony one cluster service More
	info
In-Reply-To: <1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>
References: <474BF0EE.7080709@gmail.com>
	<1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>
Message-ID: <474D2A12.807@gmail.com>

Lon Hohberger wrote:
> On Tue, 2007-11-27 at 11:26 +0100, carlopmart wrote:
>> Hi all
>>
>>   I have a very strange problem. I have configured three nodes under RHCS on 
>> rhel5.1 servers. All works ok, except for one service that never starts when 
>> rgmanager start-up. My cluster conf is:
>>
>> <?xml version="1.0"?>
>> <cluster alias="RhelXenCluster" config_version="17" name="RhelXenCluster">
>>          <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>          <clusternodes>
>>                  <clusternode name="rhelclu01.hpulabs.org" nodeid="1" votes="1">
>>                          <fence>
>>                                  <method name="1">
>>                                          <device name="gnbd-fence" 
>> nodename="rhelclu01.hpulabs.org"/>
>>                                  </method>
>>                          </fence>
>>                          <multicast addr="239.192.75.55" interface="eth0"/>
>>                  </clusternode>
>>                  <clusternode name="rhelclu02.hpulabs.org" nodeid="2" votes="1">
>>                          <fence>
>>                                  <method name="1">
>>                                          <device name="gnbd-fence" 
>> nodename="rhelclu02.hpulabs.org"/>
>>                                  </method>
>>                          </fence>
>>                          <multicast addr="239.192.75.55" interface="eth0"/>
>>                  </clusternode>
>>                  <clusternode name="rhelclu03.hpulabs.org" nodeid="3" votes="1">
>>                          <fence>
>>                                  <method name="1">
>>                                          <device name="gnbd-fence" 
>> nodename="rhelclu03.hpulabs.org"/>
>>                                  </method>
>>                          </fence>
>>                          <multicast addr="239.192.75.55" interface="xenbr0"/>
>>                  </clusternode>
>>          </clusternodes>
>>          <cman expected_votes="1" two_node="0">
>>                  <multicast addr="239.192.75.55"/>
>>          </cman>
>>          <fencedevices>
>>                  <fencedevice agent="fence_gnbd" name="gnbd-fence" 
>> servers="rhelclu03.hpulabs.org"/>
>>          </fencedevices>
>>          <rm log_facility="local4" log_level="7">
>>                  <failoverdomains>
>>                          <failoverdomain name="PriCluster" ordered="1" 
>> restricted="1">
>>                                  <failoverdomainnode 
>> name="rhelclu01.hpulabs.org" priority="1"/>
>>                                  <failoverdomainnode 
>> name="rhelclu02.hpulabs.org" priority="2"/>
>>                          </failoverdomain>
>>                          <failoverdomain name="SecCluster" ordered="1" 
>> restricted="1">
>>                                  <failoverdomainnode 
>> name="rhelclu02.hpulabs.org" priority="1"/>
>>                                  <failoverdomainnode 
>> name="rhelclu01.hpulabs.org" priority="2"/>
>>                          </failoverdomain>
>>                  </failoverdomains>
>>                  <resources>
>> 			<ip address="172.25.50.10" monitor_link="1"/>
>>                          <ip address="172.25.50.11" monitor_link="1"/>
>>                          <ip address="172.25.50.12" monitor_link="1"/>
>>                          <ip address="172.25.50.13" monitor_link="1"/>
>>                          <ip address="172.25.50.14" monitor_link="1"/>
>>                          <ip address="172.25.50.15" monitor_link="1"/>
>>                          <ip address="172.25.50.16" monitor_link="1"/>
>>                          <ip address="172.25.50.17" monitor_link="1"/>
>>                          <ip address="172.25.50.18" monitor_link="1"/>
>>                          <ip address="172.25.50.19" monitor_link="1"/>
>>                          <ip address="172.25.50.20" monitor_link="1"/>
>>                  </resources>
>>                  <service autostart="1" domain="PriCluster" name="dns-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.10">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/named" name="named"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="SecCluster" name="mail-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.11">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="SecCluster" name="rsync-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.13">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="PriCluster" name="wwwsoft-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.14">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
>>                          </ip>
>>                  </service>
>>                  <service autostart="1" domain="SecCluster" name="proxy-svc" 
>> recovery="relocate">
>>                          <ip ref="172.25.50.15">
>>                                  <script 
>> file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
>>                          </ip>
>>                  </service>
>>          </rm>
>> </cluster>
>>
>>   The service that returns me errors and never starts when rgmanager start-up is 
>> postfix-cluster. On maillog file I find this error:
> 
> 
>>   Nov 26 11:27:31 rhelclu01 postfix[27959]: fatal: parameter inet_interfaces: no 
>> local interface found for 172.25.50.11
>> Nov 26 11:27:43 rhelclu01 postfix[28313]: fatal: 
>> /data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
> 
>>   but thath's not true. If I start this service manually all works ok. Postfix 
>> configuration it is ok, What can be the problem??? I don't know why rgmanager 
>> dosen't config 172.25.50.11 address before execute postfix-cluster service ....
> 
> When you start it manually -- how?
> * add IP manually / running the script?
> * rg_test?
> * clusvcadm -e?
> 
> -- Lon

Another strange thing: at this morning this service is stopped, when I try to 
start using clusvcadm returns this error:

Nov 28 09:28:21 rhelclu01 clurgmgrd[1450]: <warning> #68: Failed to start 
service:mail-svc; return value: 1
Nov 28 09:28:21 rhelclu01 clurgmgrd[1450]: <notice> Stopping service 
service:mail-svc
Nov 28 09:28:22 rhelclu01 clurgmgrd: [1450]: <err> script:postfix: stop of 
/data/cfgcluster/etc/init.d/postfix-cluster failed (returned 1)
Nov 28 09:28:22 rhelclu01 clurgmgrd[1450]: <notice> stop on script "postfix" 
returned 1 (generic error)
Nov 28 09:28:22 rhelclu01 in.rdiscd[11610]: setsockopt (IP_ADD_MEMBERSHIP): 
Address already in use
Nov 28 09:28:22 rhelclu01 in.rdiscd[11610]: Failed joining addresses
Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <notice> Service service:mail-svc is 
recovering
Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <warning> #71: Relocating failed 
service service:mail-svc
Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <notice> Stopping service 
service:mail-svc

  I don't understand this. IP 172.25.50.11 isn't used by anyone ....


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From pcaulfie at redhat.com  Wed Nov 28 09:00:13 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 28 Nov 2007 09:00:13 +0000
Subject: [Linux-cluster] cluster communication packets - changing interface
In-Reply-To: <474D282B.90903@wasko.pl>
References: <968640.88772.qm@web50609.mail.re2.yahoo.com>
	<474D282B.90903@wasko.pl>
Message-ID: <474D2E1D.6050200@redhat.com>

Darek Skorupa wrote:
> 
>>> I have tried to change interface for heartbeat
>>> packets. In FAQ
>>> (http://sourceware.org/cluster/faq.html) I've found
>>> information about
>>> private natwork (-p) which should be added to
>>> nodename in /etc/hosts.
>>> I've tried that but it doesn't work for me.
>>>
>>> ver. 1
>>>     /etc/hosts
>>>         192.168.10.10   l1-p.local     l1-p
>>>         192.168.10.11   l2-p.local     l2-p
>>>     
>>                           ^^^^^^^^^^ you need to match those names
>> with the node names
>> defined in your cluster.conf:
>>
>>  
>>>     <clusternodes>
>>>         <clusternode name="l2.local" nodeid="1"
>>>     
>>                              ^^^^^^^^
>>   
> Ok, thx error doesn't show after I changed my mistake,
> but cluster heartbeat transmission is still with 200.200.200.x source so
> it goes by eth0 interface
> I need to change it to eth1 with addresses 192.168.10.x. Is it possible
> that nodename-p doesn't work ??

If you have a recent version of cluster suite then using the -p names in
cluster.conf should work. it searches for interfaces that match the
names in the config file.

-p is just a convention by the way, there's nothing hard-coded in the
software to recognise them

Patrick


From gordan at bobich.net  Wed Nov 28 09:24:00 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 28 Nov 2007 09:24:00 +0000 (GMT)
Subject: [Linux-cluster] GFS or Web Server Performance issues?
In-Reply-To: <20071127223258.585788@leena>
References: <20071127223258.585788@leena>
Message-ID: <Pine.LNX.4.64.0711280918270.20104@skynet.shatteredsilicon.net>

I think a part of the problem is perception. Clustering in most cases 
leads to _LOWER_ performance on I/O bound processes. If it's CPU bound, 
then sure, it'll help. But on I/O it'll likely do you harm. It's more 
about redundancy and graceful degradation than performance. There's no way 
of getting away from the fact that a cluster has to do more work than a 
single node, just because it has to keep itself in sync.

The only way clustering will give you scaleable performance benefit is 
with partitioned (as opposed to shared) data. Shared data clustering is 
about convenience and redundancy, not about performance.

Gordan

On Tue, 27 Nov 2007, isplist at logicore.net wrote:

>> Using a load balancer in front of GFS nodes is tricky. Make sure to set
>> your scheduling rule (or whatever it is called in LVS) in such a way
>
> Do you mean such as a session, so that the user is not moved between servers?
> That's something I did take into account when building the LVS LB's. I also
> have tested with the LVS and direct to the web nodes. The performance is just
> not what I expected but mind you, I've not done any fine tuning yet.
>
> Are there a series of steps I should be taking? I see a lot on the net but
> it's not too clear which things I really should try doing first.
>
> Mike
>
>
>
>> that it would not generate un-necessary lock traffic. For example, you
>> don't want the same write lock to get rotated between three nodes. Be
>> aware that moving a write lock between nodes requires many steps that
>> include a disk flush.
>>
>> -- Wendy
>>
>>> These are pretty much default apache RPM versions out of the box.
>>>
>>> # ab -kc 50 -t 30 http://192.168.1.150/
>>> This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
>>> Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd,
>>> http://www.zeustech.net/
>>> Copyright (c) 1998-2002 The Apache Software Foundation,
>>> http://www.apache.org/
>>>
>>> Benchmarking 192.168.1.150 (be patient)
>>> Finished 130 requests
>>>
>>>
>>> Server Software:        Apache
>>> Server Hostname:        192.168.1.150
>>> Server Port:            80
>>>
>>> Document Path:          /
>>> Document Length:        8997 bytes
>>>
>>> Concurrency Level:      50
>>> Time taken for tests:   30.234185 seconds
>>> Complete requests:      130
>>> Failed requests:        0
>>> Write errors:           0
>>> Keep-Alive requests:    0
>>> Total transferred:      1371090 bytes
>>> HTML transferred:       1299586 bytes
>>> Requests per second:    4.30 [#/sec] (mean)
>>> Time per request:       11628.532 [ms] (mean)
>>> Time per request:       232.571 [ms] (mean, across all concurrent
>>> requests)
>>> Transfer rate:          44.25 [Kbytes/sec] received
>>>
>>> Connection Times (ms)
>>> min  mean[+/-sd] median   max
>>> Connect:        0    1   1.4      2       4
>>> Processing:   409 6228 4357.5   5227   16317
>>> Waiting:      369 6146 4370.8   5199   16261
>>> Total:        409 6230 4358.1   5230   16321
>>>
>>> Percentage of the requests served within a certain time (ms)
>>> 50%   5230
>>> 66%   8331
>>> 75%   9310
>>> 80%  10386
>>> 90%  12820
>>> 95%  14502
>>> 98%  15399
>>> 99%  15746
>>> 100%  16321 (longest request)
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From d.skorupa at wasko.pl  Wed Nov 28 09:25:53 2007
From: d.skorupa at wasko.pl (Darek Skorupa)
Date: Wed, 28 Nov 2007 10:25:53 +0100
Subject: [Linux-cluster] cluster communication packets - changing interface
In-Reply-To: <474D2E1D.6050200@redhat.com>
References: <968640.88772.qm@web50609.mail.re2.yahoo.com>	<474D282B.90903@wasko.pl>
	<474D2E1D.6050200@redhat.com>
Message-ID: <474D3421.7080704@wasko.pl>


> If you have a recent version of cluster suite then using the -p names in
> cluster.conf should work. it searches for interfaces that match the
> names in the config file.
>
> -p is just a convention by the way, there's nothing hard-coded in the
> software to recognise them
>
>   
Ok, now I see. I had to change nodename to nodename-p in cluster.conf  
and now it works perfectly.

I have read the faq meny times and maybe my english is so bad to miss 
that information, but I think it doesn't explain that clearly.

Thank you for your help

daro


From pcaulfie at redhat.com  Wed Nov 28 09:54:04 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 28 Nov 2007 09:54:04 +0000
Subject: [Linux-cluster] Re: Re: Re: CS4 : problem with multiple
	IP	addresses (cont.)
In-Reply-To: <474D1F78.7090300@bull.net>
References: <474D1F78.7090300@bull.net>
Message-ID: <474D3ABC.9000805@redhat.com>

Alain Moulle wrote:
> Hi
> I don't really know if it is a bug, but the problem occurs when
> you a DNS is used, and when a node must be reachable from two
> different networks but with the same name on both networks; in this
> case, the gethostbyname() returns (from DNS) more than one adress for
> a Node name. So it seems that this leads to set the Node Name in cluster.conf
> as an IP addr. OK ?
> So, the bug is perhaps more to have to launch cman service with a specific
> syntax and no more by "service cman start" ... the cman start should manage
> by itself the two modes for node name in cluster.conf : name or IP addr.
> Do you agree ?

The problem is that cman needs an unambiguous name and IP address for
its communication. If the IP address maps to >1 name or the name maps to
>1 IP then it can't do this.

What I would do in this situation (and I've just been playing with it)
is to define /another/ non-ambiguous name that's only used for the
cluster and start the service with cman_tool join -n <newname>.

Functionally, that's identical to using the IP address of course, but
it's a little easier on the eye in cman_tool nodes output !

I agree it should be settable via a variable available to the startup
script. (it's easy enough to edit that script but you need to be careful
of upgrades)

Patrick


From Alain.Moulle at bull.net  Wed Nov 28 10:22:59 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 28 Nov 2007 11:22:59 +0100
Subject: [Linux-cluster] CS4 U4 /// HA NFS / problem
Message-ID: <474D4183.2050609@bull.net>

Hi

We are currently trying to set up a HA NFS Sever (CS4u4, active/active). As
decribed in Cookbook we use "managed nfs service" fonctionnality with ext3 FS
type. During failover process (relocate command) we have the following problem :
On server side everything seems to be OK (exports, IP, mount points), but on
the client side time before ressource is back may last from 3 and 20 min
(apparently  stalled in "rpc_execute"). Are they any specifics options to use in
that case (NFS,ext3)?, or a well-known issue on that kernel ?

cluster.conf
               <resources>
                        <fs device="/dev/sdd" force_unmount="1" fstype="ext3"
mountpoint="/tstha" name="nfsha" options=""/>
                        <nfsexport name="nfsexport1"/>
                        <nfsclient name="nfsclient1" target="182.20.10.68"
options="rw,no_root_squash,fsid=7"/>
                        <fs device="/dev/sde" force_unmount="1" fstype="ext3"
mountpoint="/tstha1" name="nfsha1" options=""/>
                        <nfsexport name="nfsexport2"/>
                        <nfsclient name="nfsclient2" target="182.20.10.68"
options="rw,no_root_squash,fsid=8"/>
                </resources>
                <service domain="NFSHA1" name="nfs_service1" autostart="1"
checkinterval="60">
                        <ip address="182.20.10.76" monitor_link="1"/>
                        <fs ref="nfsha">
                           <nfsexport name="nfsha">
                               <nfsclient ref="nfsclient1"/>
                           </nfsexport>
                        </fs>
                </service>
                <service domain="NFSHA2" name="nfs_service2" autostart="1"
checkinterval="60">
                        <ip address="182.20.10.77" monitor_link="1"/>
                        <fs ref="nfsha1">
                           <nfsexport name="nfsha1">
                               <nfsclient ref="nfsclient2"/>
                           </nfsexport>
                        </fs>
                </service>

on the client side:

[root at xena8 ~]# ps -ef |grep df
root        85    10  0 Nov27 ?        00:00:00 [pdflush]
root        86    10  0 Nov27 ?        00:00:00 [pdflush]
root      3723     1  0 Nov27 ?        00:00:00 xinetd -stayalive -pidfile
/var/run/xinetd.pid
root     31614 31262  0 10:55 pts/4    00:00:00 df
root     31648 28448  0 10:56 pts/2    00:00:00 grep df
[root at xena8 ~]# cat /proc/31614/wchan
__rpc_execute[root at xena8 ~]# umount /tstha
umount: /tstha: device is busy
umount: /tstha: device is busy

Thanks for help.
Regards
Alain Moull?


From johannes.russek at io-consulting.net  Wed Nov 28 12:33:35 2007
From: johannes.russek at io-consulting.net (jr)
Date: Wed, 28 Nov 2007 13:33:35 +0100
Subject: [Linux-cluster] xen and migration
In-Reply-To: <1196103152.12646.12.camel@ayanami.boston.devel.redhat.com>
References: <1195648504.25038.52.camel@localhost.localdomain>
	<1196103152.12646.12.camel@ayanami.boston.devel.redhat.com>
Message-ID: <1196253215.2852.17.camel@localhost.localdomain>

thank you, it works fine now :)
it would be great if there was more documentation about vm.sh somewhere.
thanks,
johannes


From csim at ices.utexas.edu  Wed Nov 28 12:48:29 2007
From: csim at ices.utexas.edu (Chris Simmons)
Date: Wed, 28 Nov 2007 06:48:29 -0600
Subject: [Linux-cluster] CS4 U4 /// HA NFS / problem
In-Reply-To: <474D4183.2050609@bull.net>
References: <474D4183.2050609@bull.net>
Message-ID: <20071128124829.GA30616@ices.utexas.edu>

In 2.6 kernels, NFS clients default to mounting via TCP. Switch to UDP
using the 'udp' mount option and you'll be able to switch around your
HANFS share with only a few seconds of stalled access.

If you do keep it as the TCP, NFS clients will keep the connection in a
TIME_WAIT state for quite some time (up to 20 minutes)

Thanks,

Chris

* Alain Moulle <Alain.Moulle at bull.net> [2007-11-28 11:22:59 +0100]:

> Hi
> 
> We are currently trying to set up a HA NFS Sever (CS4u4, active/active). As
> decribed in Cookbook we use "managed nfs service" fonctionnality with ext3 FS
> type. During failover process (relocate command) we have the following problem :
> On server side everything seems to be OK (exports, IP, mount points), but on
> the client side time before ressource is back may last from 3 and 20 min
> (apparently  stalled in "rpc_execute"). Are they any specifics options to use in
> that case (NFS,ext3)?, or a well-known issue on that kernel ?
> 
> cluster.conf
>                <resources>
>                         <fs device="/dev/sdd" force_unmount="1" fstype="ext3"
> mountpoint="/tstha" name="nfsha" options=""/>
>                         <nfsexport name="nfsexport1"/>
>                         <nfsclient name="nfsclient1" target="182.20.10.68"
> options="rw,no_root_squash,fsid=7"/>
>                         <fs device="/dev/sde" force_unmount="1" fstype="ext3"
> mountpoint="/tstha1" name="nfsha1" options=""/>
>                         <nfsexport name="nfsexport2"/>
>                         <nfsclient name="nfsclient2" target="182.20.10.68"
> options="rw,no_root_squash,fsid=8"/>
>                 </resources>
>                 <service domain="NFSHA1" name="nfs_service1" autostart="1"
> checkinterval="60">
>                         <ip address="182.20.10.76" monitor_link="1"/>
>                         <fs ref="nfsha">
>                            <nfsexport name="nfsha">
>                                <nfsclient ref="nfsclient1"/>
>                            </nfsexport>
>                         </fs>
>                 </service>
>                 <service domain="NFSHA2" name="nfs_service2" autostart="1"
> checkinterval="60">
>                         <ip address="182.20.10.77" monitor_link="1"/>
>                         <fs ref="nfsha1">
>                            <nfsexport name="nfsha1">
>                                <nfsclient ref="nfsclient2"/>
>                            </nfsexport>
>                         </fs>
>                 </service>
> 
> on the client side:
> 
> [root at xena8 ~]# ps -ef |grep df
> root        85    10  0 Nov27 ?        00:00:00 [pdflush]
> root        86    10  0 Nov27 ?        00:00:00 [pdflush]
> root      3723     1  0 Nov27 ?        00:00:00 xinetd -stayalive -pidfile
> /var/run/xinetd.pid
> root     31614 31262  0 10:55 pts/4    00:00:00 df
> root     31648 28448  0 10:56 pts/2    00:00:00 grep df
> [root at xena8 ~]# cat /proc/31614/wchan
> __rpc_execute[root at xena8 ~]# umount /tstha
> umount: /tstha: device is busy
> umount: /tstha: device is busy
> 
> Thanks for help.
> Regards
> Alain Moull?
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From wcheng at redhat.com  Wed Nov 28 15:35:13 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 28 Nov 2007 10:35:13 -0500
Subject: [Linux-cluster] GFS or Web Server Performance issues?
In-Reply-To: <Pine.LNX.4.64.0711280918270.20104@skynet.shatteredsilicon.net>
References: <20071127223258.585788@leena>
	<Pine.LNX.4.64.0711280918270.20104@skynet.shatteredsilicon.net>
Message-ID: <474D8AB1.5020509@redhat.com>

gordan at bobich.net wrote:
> I think a part of the problem is perception. Clustering in most cases 
> leads to _LOWER_ performance on I/O bound processes. If it's CPU 
> bound, then sure, it'll help. But on I/O it'll likely do you harm. 
> It's more about redundancy and graceful degradation than performance. 
> There's no way of getting away from the fact that a cluster has to do 
> more work than a single node, just because it has to keep itself in sync.
>
> The only way clustering will give you scaleable performance benefit is 
> with partitioned (as opposed to shared) data. Shared data clustering 
> is about convenience and redundancy, not about performance.
>
Well said ! This reminds me some of previous conversations with 
customers in mid-90 when people started to port their applications from 
supercomputers and/or big SMP boxes into clustered machines. It had 
taken non-trivial amount of collaborative efforts between the customers 
and the team's application enablement group to achieve the expected 
performance when moving applications between different platforms. Be 
aware that cluster management and its associated performance tuning is 
really not a trivial task. It is kind of hard to give a "catch-all" 
advice in a mailing list, particularly we have been participating the 
discussions on our spare time basis.

-- Wendy


From carlopmart at gmail.com  Wed Nov 28 15:38:57 2007
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 28 Nov 2007 16:38:57 +0100
Subject: [Linux-cluster] Problems to start ony one cluster service (SOLVED
	but ...)
In-Reply-To: <474D2A12.807@gmail.com>
References: <474BF0EE.7080709@gmail.com>
	<1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>
	<474D2A12.807@gmail.com>
Message-ID: <474D8B91.9090604@gmail.com>

carlopmart wrote:
> Lon Hohberger wrote:
>> On Tue, 2007-11-27 at 11:26 +0100, carlopmart wrote:
>>> Hi all
>>>
>>>   I have a very strange problem. I have configured three nodes under 
>>> RHCS on rhel5.1 servers. All works ok, except for one service that 
>>> never starts when rgmanager start-up. My cluster conf is:
>>>
>>> <?xml version="1.0"?>
>>> <cluster alias="RhelXenCluster" config_version="17" 
>>> name="RhelXenCluster">
>>>          <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>          <clusternodes>
>>>                  <clusternode name="rhelclu01.hpulabs.org" nodeid="1" 
>>> votes="1">
>>>                          <fence>
>>>                                  <method name="1">
>>>                                          <device name="gnbd-fence" 
>>> nodename="rhelclu01.hpulabs.org"/>
>>>                                  </method>
>>>                          </fence>
>>>                          <multicast addr="239.192.75.55" 
>>> interface="eth0"/>
>>>                  </clusternode>
>>>                  <clusternode name="rhelclu02.hpulabs.org" nodeid="2" 
>>> votes="1">
>>>                          <fence>
>>>                                  <method name="1">
>>>                                          <device name="gnbd-fence" 
>>> nodename="rhelclu02.hpulabs.org"/>
>>>                                  </method>
>>>                          </fence>
>>>                          <multicast addr="239.192.75.55" 
>>> interface="eth0"/>
>>>                  </clusternode>
>>>                  <clusternode name="rhelclu03.hpulabs.org" nodeid="3" 
>>> votes="1">
>>>                          <fence>
>>>                                  <method name="1">
>>>                                          <device name="gnbd-fence" 
>>> nodename="rhelclu03.hpulabs.org"/>
>>>                                  </method>
>>>                          </fence>
>>>                          <multicast addr="239.192.75.55" 
>>> interface="xenbr0"/>
>>>                  </clusternode>
>>>          </clusternodes>
>>>          <cman expected_votes="1" two_node="0">
>>>                  <multicast addr="239.192.75.55"/>
>>>          </cman>
>>>          <fencedevices>
>>>                  <fencedevice agent="fence_gnbd" name="gnbd-fence" 
>>> servers="rhelclu03.hpulabs.org"/>
>>>          </fencedevices>
>>>          <rm log_facility="local4" log_level="7">
>>>                  <failoverdomains>
>>>                          <failoverdomain name="PriCluster" 
>>> ordered="1" restricted="1">
>>>                                  <failoverdomainnode 
>>> name="rhelclu01.hpulabs.org" priority="1"/>
>>>                                  <failoverdomainnode 
>>> name="rhelclu02.hpulabs.org" priority="2"/>
>>>                          </failoverdomain>
>>>                          <failoverdomain name="SecCluster" 
>>> ordered="1" restricted="1">
>>>                                  <failoverdomainnode 
>>> name="rhelclu02.hpulabs.org" priority="1"/>
>>>                                  <failoverdomainnode 
>>> name="rhelclu01.hpulabs.org" priority="2"/>
>>>                          </failoverdomain>
>>>                  </failoverdomains>
>>>                  <resources>
>>>             <ip address="172.25.50.10" monitor_link="1"/>
>>>                          <ip address="172.25.50.11" monitor_link="1"/>
>>>                          <ip address="172.25.50.12" monitor_link="1"/>
>>>                          <ip address="172.25.50.13" monitor_link="1"/>
>>>                          <ip address="172.25.50.14" monitor_link="1"/>
>>>                          <ip address="172.25.50.15" monitor_link="1"/>
>>>                          <ip address="172.25.50.16" monitor_link="1"/>
>>>                          <ip address="172.25.50.17" monitor_link="1"/>
>>>                          <ip address="172.25.50.18" monitor_link="1"/>
>>>                          <ip address="172.25.50.19" monitor_link="1"/>
>>>                          <ip address="172.25.50.20" monitor_link="1"/>
>>>                  </resources>
>>>                  <service autostart="1" domain="PriCluster" 
>>> name="dns-svc" recovery="relocate">
>>>                          <ip ref="172.25.50.10">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/named" name="named"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="SecCluster" 
>>> name="mail-svc" recovery="relocate">
>>>                          <ip ref="172.25.50.11">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="SecCluster" 
>>> name="rsync-svc" recovery="relocate">
>>>                          <ip ref="172.25.50.13">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="PriCluster" 
>>> name="wwwsoft-svc" recovery="relocate">
>>>                          <ip ref="172.25.50.14">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
>>>                          </ip>
>>>                  </service>
>>>                  <service autostart="1" domain="SecCluster" 
>>> name="proxy-svc" recovery="relocate">
>>>                          <ip ref="172.25.50.15">
>>>                                  <script 
>>> file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
>>>                          </ip>
>>>                  </service>
>>>          </rm>
>>> </cluster>
>>>
>>>   The service that returns me errors and never starts when rgmanager 
>>> start-up is postfix-cluster. On maillog file I find this error:
>>
>>
>>>   Nov 26 11:27:31 rhelclu01 postfix[27959]: fatal: parameter 
>>> inet_interfaces: no local interface found for 172.25.50.11
>>> Nov 26 11:27:43 rhelclu01 postfix[28313]: fatal: 
>>> /data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
>>
>>>   but thath's not true. If I start this service manually all works 
>>> ok. Postfix configuration it is ok, What can be the problem??? I 
>>> don't know why rgmanager dosen't config 172.25.50.11 address before 
>>> execute postfix-cluster service ....
>>
>> When you start it manually -- how?
>> * add IP manually / running the script?
>> * rg_test?
>> * clusvcadm -e?
>>
>> -- Lon
> 
> Another strange thing: at this morning this service is stopped, when I 
> try to start using clusvcadm returns this error:
> 
> Nov 28 09:28:21 rhelclu01 clurgmgrd[1450]: <warning> #68: Failed to 
> start service:mail-svc; return value: 1
> Nov 28 09:28:21 rhelclu01 clurgmgrd[1450]: <notice> Stopping service 
> service:mail-svc
> Nov 28 09:28:22 rhelclu01 clurgmgrd: [1450]: <err> script:postfix: stop 
> of /data/cfgcluster/etc/init.d/postfix-cluster failed (returned 1)
> Nov 28 09:28:22 rhelclu01 clurgmgrd[1450]: <notice> stop on script 
> "postfix" returned 1 (generic error)
> Nov 28 09:28:22 rhelclu01 in.rdiscd[11610]: setsockopt 
> (IP_ADD_MEMBERSHIP): Address already in use
> Nov 28 09:28:22 rhelclu01 in.rdiscd[11610]: Failed joining addresses
> Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <notice> Service 
> service:mail-svc is recovering
> Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <warning> #71: Relocating 
> failed service service:mail-svc
> Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <notice> Stopping service 
> service:mail-svc
> 
>  I don't understand this. IP 172.25.50.11 isn't used by anyone ....
> 
> 

Finally I have found where is the problem: I need to put alternate_config param 
under first postfix instance and now all works ok. Service starts, stops and 
relocate ok but I found a little problem: clurgmgrd doesn't checks the status of 
the service. If I remove status flag from init script for the resource, nothing 
occurs. Do I need to put any param on cluster.conf to check services every 1 min 
or 2???

Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From scottb at bxwa.com  Wed Nov 28 16:11:32 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 28 Nov 2007 08:11:32 -0800
Subject: [Linux-cluster] I give up
Message-ID: <474D9334.3020602@bxwa.com>

I have to conclude that between lingering issues from the switch to 
Openais and overall project immaturity, that Cluster Suite is not 
suitable for production use. I will not be renewing Advanced Platform or 
Premium Support. I have been suckered by marketing hype. RH is still my 
favorite OS but I must look elsewhere for failover functionality.

I've learned a lot during the last month but other than that it's been a 
waste of time that I can't afford. You Redhat guys might consider having 
a meeting about adopting Heartbeat or getting serious about finishing 
Cluster Suite. "Serious" should include real documentation. Interactive 
documentation would be a great way to let the community contribute.

    thanks
    scottb


From kanderso at redhat.com  Wed Nov 28 16:45:13 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Wed, 28 Nov 2007 10:45:13 -0600
Subject: [Linux-cluster] I give up
In-Reply-To: <474D9334.3020602@bxwa.com>
References: <474D9334.3020602@bxwa.com>
Message-ID: <1196268313.2827.21.camel@localhost.localdomain>


On Wed, 2007-11-28 at 08:11 -0800, Scott Becker wrote:
> I have to conclude that between lingering issues from the switch to 
> Openais and overall project immaturity, that Cluster Suite is not 
> suitable for production use. I will not be renewing Advanced Platform or 
> Premium Support. I have been suckered by marketing hype. RH is still my 
> favorite OS but I must look elsewhere for failover functionality.
> 
> I've learned a lot during the last month but other than that it's been a 
> waste of time that I can't afford. You Redhat guys might consider having 
> a meeting about adopting Heartbeat or getting serious about finishing 
> Cluster Suite. "Serious" should include real documentation. Interactive 
> documentation would be a great way to let the community contribute.
> 
Sorry to hear you were unable to get it to work.  Going back through
your postings over the last month, it looks like you have been trying to
setup a two node cluster with quorum disk that provides a failover
environment.  Would a simple button in conga that does that have been
useful?

Also, you mention lingering issues about openais, but didn't see
anything in your posts that describe any problems in that space.  Can
you be more specific?

We are serious about making Cluster Suite correct for our customers and
are very concerned when things don't work out for people. 

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/0927fc70/attachment.htm>

From jos at xos.nl  Wed Nov 28 16:51:31 2007
From: jos at xos.nl (Jos Vos)
Date: Wed, 28 Nov 2007 17:51:31 +0100
Subject: [Linux-cluster] I give up
In-Reply-To: <1196268313.2827.21.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
Message-ID: <20071128165131.GB23091@jasmine.xos.nl>

On Wed, Nov 28, 2007 at 10:45:13AM -0600, Kevin Anderson wrote:

> Sorry to hear you were unable to get it to work.  Going back through
> your postings over the last month, it looks like you have been trying to
> setup a two node cluster with quorum disk that provides a failover
> environment.  Would a simple button in conga that does that have been
> useful?

Using a quorum disk in RHEL 5.0 just does not work.  This has been
acknowledged earlier on this list, IIRC.  Should be solved in 5.1,
which I haven't tried yet.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From josh at jtri.com  Wed Nov 28 16:52:33 2007
From: josh at jtri.com (Josh Gray)
Date: Wed, 28 Nov 2007 10:52:33 -0600
Subject: [Linux-cluster] I give up
In-Reply-To: <1196268313.2827.21.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
Message-ID: <7E525CC4-66C1-425E-A3E5-073AA2EE297A@jtri.com>

Kevin - I agree that documentation for enabling quorum disks in each  
the situation of 2 node cluster, and 3+ node cluster is light and what  
is there is confusing.   I would like to see clear examples of all  
settings that should be configured for each of those situations.

Josh


On Nov 28, 2007, at 10:45 AM, Kevin Anderson wrote:

>
> On Wed, 2007-11-28 at 08:11 -0800, Scott Becker wrote:
>>
>> I have to conclude that between lingering issues from the switch to
>> Openais and overall project immaturity, that Cluster Suite is not
>> suitable for production use. I will not be renewing Advanced  
>> Platform or
>> Premium Support. I have been suckered by marketing hype. RH is  
>> still my
>> favorite OS but I must look elsewhere for failover functionality.
>>
>> I've learned a lot during the last month but other than that it's  
>> been a
>> waste of time that I can't afford. You Redhat guys might consider  
>> having
>> a meeting about adopting Heartbeat or getting serious about finishing
>> Cluster Suite. "Serious" should include real documentation.  
>> Interactive
>> documentation would be a great way to let the community contribute.
>>
> Sorry to hear you were unable to get it to work.  Going back through  
> your postings over the last month, it looks like you have been  
> trying to setup a two node cluster with quorum disk that provides a  
> failover environment.  Would a simple button in conga that does that  
> have been useful?
>
> Also, you mention lingering issues about openais, but didn't see  
> anything in your posts that describe any problems in that space.   
> Can you be more specific?
>
> We are serious about making Cluster Suite correct for our customers  
> and are very concerned when things don't work out for people.
>
> Kevin
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/63e26ffa/attachment.htm>

From isplist at logicore.net  Wed Nov 28 16:54:53 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 28 Nov 2007 10:54:53 -0600
Subject: [Linux-cluster] GFS or Web Server Performance issues?
In-Reply-To: <474D8AB1.5020509@redhat.com>
Message-ID: <20071128105453.088300@leena>

Come on folks, you're making me feel like I should give up or something  :)

 From Gordan;

>I think a part of the problem is perception.

Perception can only be what marketing says it will do. I can't say I have once 
seen anything that says it won't scale performance wise by virtue of what it 
is, a cluster. I looked at SSI and other types of clusters, this seemed to be 
the key for my LAMP based services. 

>leads to _LOWER_ performance on I/O bound processes. If it's CPU bound,

Sure, there is a performance cost from each node but I would guess it's an 
acceptable cost so long as I can work out the I/O side of things. I'm guessing 
a lot of folks have come up with all sorts of good ways of handling this 
otherwise, no one would be using these tools.

>then sure, it'll help. But on I/O it'll likely do you harm. It's more
>about redundancy and graceful degradation than performance. There's no way
>of getting away from the fact that a cluster has to do more work than a
>single node, just because it has to keep itself in sync.

When I started learning about the RH cluster suite and GFS, it was because the 
hype was that I could build a highly scalable, highly available environment 
where I could share data in a way I had not been able to before. 

That sharing data part has been true and I love what I can achieve with that 
alone. I am now at the performance stage, need to get the most I can out of 
what I have. Time to ask questions so that I can have some basic starting 
points. 

Even though each node has to do more work and the cluster costs me in 
performance overall, that seems to be pretty much like any other application 
out there. Applications cost part of the machines resources, that's just the 
way things are.

>about redundancy and graceful degradation than performance. There's no way
>of getting away from the fact that a cluster has to do more work than a
>single node, just because it has to keep itself in sync.

Got it. 
However, I've already spent months learning about the cluster suite, GFS, and 
much harder has been all of the networking involved, the fibre channel 
switches, the fibre channel storage and it's endless needs, the list goes on 
and on, don't want to bore you.

The bottom line is, I have a working cluster, sharing GFS space. I know it's 
costing me some resources from each node, I understand this. However, there's 
plenty left to work with :). I could use some input on where to start to get 
the most performance I can out of what I have. 

>The only way clustering will give you scaleable performance benefit is
>with partitioned (as opposed to shared) data. Shared data clustering is
>about convenience and redundancy, not about performance.

I agree but this is a very general statement. In my case, I have a LAMP 
application which benefits more from having shared GFS space. I might move to 
purely distributed at some point but for now, I'd prefer to find out what I 
can do with what I've built so far.

So, looking for help. I am of course willing to give what ever information I 
can provide in order to get that help. Since I don't know the answer, I can't 
ask the right questions just yet so the question is basic. Where do I start 
looking for performance enhancements now that my cluster is ready?

 From Wendy;

>Be aware that cluster management and its associated performance tuning is
>really not a trivial task. It is kind of hard to give a "catch-all"
>advice in a mailing list, particularly we have been participating the
>discussions on our spare time basis.

I don't think anyone who bothers to take this on would think that it's trivial 
or anything less than something they do have to spend some time at. 

Asking 'catch-all' questions is often the only way to get the ball rolling, to 
invite additional questions which often lead to more meaningful help. At 
least, in all of the endless meetings I've been at where it's fact finding, 
we'll often start with some basics and that turns into more relevant things.

So, my question is again, the same :). Now that I have my cluster up and 
running, I still would like to ask those in the list for thoughts, input, 
ideas on where they started looking for performance enhancements. 

I have a basic, non cluster/GFS  list of course;

I'll need to work on fine tuning my web servers, my storage and even my 
networking. Only thing is, are there some things I should be aware of when 
doing this which are cluster/GFS related tips, input that others might have?

Mike


From gordan at bobich.net  Wed Nov 28 17:07:56 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Wed, 28 Nov 2007 17:07:56 +0000 (GMT)
Subject: [Linux-cluster] GFS or Web Server Performance issues?
In-Reply-To: <20071128105453.088300@leena>
References: <20071128105453.088300@leena>
Message-ID: <Pine.LNX.4.64.0711281656210.20104@skynet.shatteredsilicon.net>

On Wed, 28 Nov 2007, isplist at logicore.net wrote:

> Come on folks, you're making me feel like I should give up or something  :)
>
> From Gordan;
>
>> I think a part of the problem is perception.
>
> Perception can only be what marketing says it will do.

And marketing rarely (if ever) reflects what things will really do.

> I can't say I have once
> seen anything that says it won't scale performance wise by virtue of what it
> is, a cluster. I looked at SSI and other types of clusters, this seemed to be
> the key for my LAMP based services.

It depends on where the bottleneck on your system is. It stands to reason 
that if the physical performance of the disk (e.g. a SAN appliance) is 
fixed, then piling more machines using it, while having to keep locks in 
sync cannot possibly result in performance magically increasing. Anybody 
who tells you otherwise is either trying to sell you something, or doesn't 
understand what they are talking about.

>> leads to _LOWER_ performance on I/O bound processes. If it's CPU bound,
>
> Sure, there is a performance cost from each node but I would guess it's an
> acceptable cost so long as I can work out the I/O side of things. I'm guessing
> a lot of folks have come up with all sorts of good ways of handling this
> otherwise, no one would be using these tools.

It improves _some_ types of performance, not all. It also depends on how 
higher levels handle things. If you have partitioned data so that each 
node handles a subset of it, then you will get improved performance. If 
all the data is in once place, then the chances are that clustering will 
cause an overhead, and thus a slowdown if the system is I/O bound in the 
first place.

>> then sure, it'll help. But on I/O it'll likely do you harm. It's more
>> about redundancy and graceful degradation than performance. There's no way
>> of getting away from the fact that a cluster has to do more work than a
>> single node, just because it has to keep itself in sync.
>
> When I started learning about the RH cluster suite and GFS, it was because the
> hype was that I could build a highly scalable, highly available environment
> where I could share data in a way I had not been able to before.

It is scaleable right up to the point where you are I/O (and lock) bound. 
If you are serving only static web pages, then your performance will 
likely degrade. If you are using lots of CPU intensive CGI processes, then 
the performance will most likely increase.

For example, you (surely?) wouldn't expect to dump a MySQL DB on a GFS 
file system on a cluster, get external locking going, and expect the 
read/write performance to increase, would you??

SSI and GFS are technologies that have their place, but they are not the 
right tool for every job. For example, I have an SSI root file system, but 
I have /var/lib/mysql mounted off local disks, with round-robin 
replication set up on the nodes, so each is a master and a slave. I 
wouldn't dream of expecting similar performance if it was running off GFS 
with external locking.

>> The only way clustering will give you scaleable performance benefit is
>> with partitioned (as opposed to shared) data. Shared data clustering is
>> about convenience and redundancy, not about performance.
>
> I agree but this is a very general statement. In my case, I have a LAMP
> application which benefits more from having shared GFS space. I might move to
> purely distributed at some point but for now, I'd prefer to find out what I
> can do with what I've built so far.

For static data, you could set up an unshared cache space on local 
storage, and set up squid in accelerator mode (basically outbound rather 
than inbound cache). This will mean that most of your access hits for 
static data will never hit Apache or the the GFS storage.

Gordan


From scottb at bxwa.com  Wed Nov 28 17:47:43 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 28 Nov 2007 09:47:43 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <1196268313.2827.21.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
Message-ID: <474DA9BF.7050006@bxwa.com>

Kevin Anderson wrote:
>
> Sorry to hear you were unable to get it to work.  Going back through 
> your postings over the last month, it looks like you have been trying 
> to setup a two node cluster with quorum disk that provides a failover 
> environment.  Would a simple button in conga that does that have been 
> useful?
>
Actually 3 nodes down to 1 node using IP tie breaker. No shared storage 
so I can't use Qdisk. Was resorting to ping test to stop fencing from 
broken node.
> Also, you mention lingering issues about openais, but didn't see 
> anything in your posts that describe any problems in that space.  Can 
> you be more specific?
>
During network partition test, expecting a fencing race where I control 
the outcome, one node would not fence the other and did not takeover the 
service until the other node attempted to rejoin the cluster (way too late).

Another poster stated that he could not get the cluster to function 
properly since the switch to Openais. Hence I'm speculating that they 
are related.


If you guys are serious then I have some suggestions:

Interactive documentation so the community can contribute.

Debugging mode option at the top of cluster.conf which will cause 
"Every" component to verbosely contribute to a separate log file which 
can be sent in to to be analyzed. A limited community size because of 
the niche nature of the system, calls for extra debugging effort.

An interactive design document for Cluster Suite where users can make 
suggestions for changes and vote for new features. In the process of 
exploring ways to get my setup working I kept finding ways and 
roadblocks. It's an involved product and process and I personally don't 
have any way to contribute to a design discussion.

For example, am I the only one who would benefit from ssh support in the 
fence agents? Or would that help many others also.

Another example, during the process I realized that the redundant power 
and redundant network connections I set up to "Keep Things Working" 
actually had a negative impact on the reliability of fencing. This is OK 
because lack of ability to fence doesn't stop the cluster as long as a 
fencing ability failure is detected and fixed in a timely manner. So I 
setup an hourly cron script to test communications to the power 
switches. Would others benefit from that feature? Should it be included?

How about a bigger question, is that already a feature? I think 
Heartbeat has fence device status monitoring but without an organized, 
easy to navigate design document for Cluster Suite I don't know if 
that's a feature of it or not.

    thanks
    scottb


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/b91074f7/attachment.htm>

From andrewxwang at yahoo.com.tw  Wed Nov 28 18:04:56 2007
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Thu, 29 Nov 2007 02:04:56 +0800 (CST)
Subject: [Linux-cluster] Fwd: HPC Consortium '07 Presentations now Online
Message-ID: <892025.80017.qm@web73512.mail.tp2.yahoo.com>

The GridEngine and the <<ROCKS and ROCKS on Solaris>>
presentations are interesting:

https://events-at-sun.com/hpcreno/presentations.html

Andrew.

---------- Forwarded message ----------
Fritz's presentation "Highly Scalable and Dynamic
Grids" talks about SGE 6.2. Other interesting
presentations:

- AMD/Intel multcore
- ROCKS and ROCKS on Solaris
- GPGPU
- Voltaire
- Mellanox/IB
- Totalview
- and many others...

- the Lustre & ZFS integration presentation was taken
offline, but I summarized the interesting part at:
http://www.opensolaris.org/jive/thread.jspa?threadID=45487

URL for the presentations:
https://events-at-sun.com/hpcreno/presentations.html

Rayson


      _____________________________________________________________________________________
??????Yahoo!??????2.0? http://tw.mg0.mail.yahoo.com/dc/landing


From beres.laszlo at sys-admin.hu  Wed Nov 28 20:26:59 2007
From: beres.laszlo at sys-admin.hu (BERES Laszlo)
Date: Wed, 28 Nov 2007 21:26:59 +0100
Subject: [Linux-cluster] I give up
In-Reply-To: <474D9334.3020602@bxwa.com>
References: <474D9334.3020602@bxwa.com>
Message-ID: <474DCF13.5060004@sys-admin.hu>

Scott,

Scott Becker wrote:

> I've learned a lot during the last month but other than that it's been a
> waste of time that I can't afford. You Redhat guys might consider having
> a meeting about adopting Heartbeat or getting serious about finishing
> Cluster Suite. "Serious" should include real documentation. Interactive
> documentation would be a great way to let the community contribute.

no offense: did you try to get help from official RH support?

-- 
B?RES L?szl? RHCE, RHCX
senior IT engineer, trainer


From kanderso at redhat.com  Wed Nov 28 20:32:44 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Wed, 28 Nov 2007 14:32:44 -0600
Subject: [Linux-cluster] I give up
In-Reply-To: <474DA9BF.7050006@bxwa.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
Message-ID: <1196281964.2827.76.camel@localhost.localdomain>

On Wed, 2007-11-28 at 09:47 -0800, Scott Becker wrote:
> Kevin Anderson wrote: 
> > 
> > Sorry to hear you were unable to get it to work.  Going back through
> > your postings over the last month, it looks like you have been
> > trying to setup a two node cluster with quorum disk that provides a
> > failover environment.  Would a simple button in conga that does that
> > have been useful?
> > 
> Actually 3 nodes down to 1 node using IP tie breaker. No shared
> storage so I can't use Qdisk. Was resorting to ping test to stop
> fencing from broken node.

Not sure what you mean by 3 to 1 using IP tie breaker.  How are you
maintaining quorum without qdisk as a voting entity?

> > Also, you mention lingering issues about openais, but didn't see
> > anything in your posts that describe any problems in that space.
> > Can you be more specific?
> > 
> During network partition test, expecting a fencing race where I
> control the outcome, one node would not fence the other and did not
> takeover the service until the other node attempted to rejoin the
> cluster (way too late).

Is this resolved with the 5.1 release we did a few weeks ago?

> 
> Another poster stated that he could not get the cluster to function
> properly since the switch to Openais. Hence I'm speculating that they
> are related.
Doubtful.  There have been issues with cisco switch configurations with
allowing multicast properly.  All of those have been resolved with a
switch configuration setting change.

> 
> 
> 
> If you guys are serious then I have some suggestions:
> 
> Interactive documentation so the community can contribute.
> 
Good point.  Our community interface (ie sources.redhat.com) is pretty
weak and needs revamping.  Are you thinking blogs, wiki, etc?  Also, our
assumption is people accessing the community pages and mailing lists are
developers, open source contributors, nonpaying consumers versus direct
paying customers.  Responses on the mailing lists, while pretty frequent
can be spotty.  Paying customers usually contact support directly rather
than wait on developer responses.

> Debugging mode option at the top of cluster.conf which will cause
> "Every" component to verbosely contribute to a separate log file which
> can be sent in to to be analyzed. A limited community size because of
> the niche nature of the system, calls for extra debugging effort.
> 
Good suggestion and one to look into providing.  We did add the ability
to centralize log messages in openais, but haven't incorporated that
support into the commands/utilities as of yet.

We have an internal bugzilla for this one already, will see if product
management can open it up for external viewing (ie not sure why it is
locked at the moment).

> An interactive design document for Cluster Suite where users can make
> suggestions for changes and vote for new features. In the process of
> exploring ways to get my setup working I kept finding ways and
> roadblocks. It's an involved product and process and I personally
> don't have any way to contribute to a design discussion.
Currently community development discussion happens on cluster-devel
mailing list, #linux-cluster at irc.freenode.net and through bugzilla
updates.  

However, we don't have good records of which things are actively getting
worked, versus areas where people can jump in and won't feel redundant.
This also points out that we haven't had a cluster summit for a couple
of years, which is also a way we coordinated/communicated effort and
direction.  Probably time to do that again as well as revamping our web
interface.

> 
> For example, am I the only one who would benefit from ssh support in
> the fence agents? Or would that help many others also.

Not the only one, which is why we probably didn't respond.  ssh support
needs to be added uniquely to each fence agent since it isn't just
connecting, but also using the commands unique to each fencing device
once you get access. Just doing a quick bugzilla search shows:

https://bugzilla.redhat.com/buglist.cgi?product=Red+Hat+Cluster
+Suite&product=Red+Hat+Enterprise+Linux
+5&version=&component=&target_milestone=&bug_status=NEW&bug_status=ASSIGNED&bug_status=NEEDINFO&bug_status=MODIFIED&short_desc_type=allwordssubstr&short_desc=ssh+fence&long_desc_type=allwordssubstr&long_desc=

This is definitely support we are working on.  Again, we should include
in roadmap/priorities/vision page so people know it.

> 
> Another example, during the process I realized that the redundant
> power and redundant network connections I set up to "Keep Things
> Working" actually had a negative impact on the reliability of fencing.
> This is OK because lack of ability to fence doesn't stop the cluster
> as long as a fencing ability failure is detected and fixed in a timely
> manner. So I setup an hourly cron script to test communications to the
> power switches. Would others benefit from that feature? Should it be
> included?
> 
> How about a bigger question, is that already a feature? I think
> Heartbeat has fence device status monitoring but without an organized,
> easy to navigate design document for Cluster Suite I don't know if
> that's a feature of it or not.
Both of these are part of the bigger picture resource monitoring work
that Lon and some of the linux ha guys are jointly working on converging
to a single base.  See this page - 
http://people.redhat.com/lhh/cluster-integration.html

Which again, not very visible :-(.

So, my take aways: 

1. For your environment and the current product, fence device access is
insecure and needs to be on an internal private network.  SSH specific
fence device support is on the list of work items, but is dependent on
getting time/resources to do them.  Even then, it should still be on an
internal private network.

2. Need better community pages - revamp/move website, add blogging,
wiki, regular updates, more communication, roadmaps, etc...  Purpose is
to allow more community communication and participation.

3. Time for Cluster Summit again - location preferences, timeframe,
funding, etc?


Other ideas?

Thanks
Kevin


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/9f658407/attachment.htm>

From scottb at bxwa.com  Wed Nov 28 20:50:41 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 28 Nov 2007 12:50:41 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <474DCF13.5060004@sys-admin.hu>
References: <474D9334.3020602@bxwa.com> <474DCF13.5060004@sys-admin.hu>
Message-ID: <474DD4A1.2020105@bxwa.com>

Yes, slow at best. This list is better. Support is not the issue. I'm 
shooting for 100% uptime with simple failover. There are too many 
problems with the software and I'm a month behind schedule. The biggest 
problem is the core system is malfunctioning. Smaller problem is that in 
the same test I ran into the fencing issue that others have experienced, 
the menus are too dynamic for the current fence_apc agent.

A month ago I thought, steep learning curve and then I'll be all set. 
redhat.com gives the distinct impression that they are selling a 
completed solution. I disagree.

    scottb


BERES Laszlo wrote:
> Scott,
>
> Scott Becker wrote:
>
>   
>> I've learned a lot during the last month but other than that it's been a
>> waste of time that I can't afford. You Redhat guys might consider having
>> a meeting about adopting Heartbeat or getting serious about finishing
>> Cluster Suite. "Serious" should include real documentation. Interactive
>> documentation would be a great way to let the community contribute.
>>     
>
> no offense: did you try to get help from official RH support?
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/8c7cec0b/attachment.htm>

From jakub.suchy at enlogit.cz  Wed Nov 28 20:57:50 2007
From: jakub.suchy at enlogit.cz (jakub.suchy at enlogit.cz)
Date: Wed, 28 Nov 2007 21:57:50 +0100
Subject: [Linux-cluster] I give up
In-Reply-To: <474DD4A1.2020105@bxwa.com>
References: <474D9334.3020602@bxwa.com> <474DCF13.5060004@sys-admin.hu>
	<474DD4A1.2020105@bxwa.com>
Message-ID: <20071128205750.GA2062@localhost>

Scott Becker wrote:
> Yes, slow at best. This list is better. Support is not the issue. I'm 
> shooting for 100% uptime with simple failover. There are too many problems 
> with the software and I'm a month behind schedule. The biggest problem is 
> the core system is malfunctioning. Smaller problem is that in the same test 
> I ran into the fencing issue that others have experienced, the menus are 
> too dynamic for the current fence_apc agent.
>
> A month ago I thought, steep learning curve and then I'll be all set. 
> redhat.com gives the distinct impression that they are selling a completed 
> solution. I disagree.

Have you tried RHEL 4.6 cluster? It's setup is really smooth oposite to
RHEL 5.0, according to my experience. (As I said before, we will be
trying 5.1 in few days).

Jakub


From scottb at bxwa.com  Wed Nov 28 21:27:14 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 28 Nov 2007 13:27:14 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <1196281964.2827.76.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
Message-ID: <474DDD32.1020908@bxwa.com>


Kevin Anderson wrote:
> Not sure what you mean by 3 to 1 using IP tie breaker.  How are you 
> maintaining quorum without qdisk as a voting entity?
>
I have three nodes. If one fails the other two are expected to maintain 
quorum and continue. I would really like a second failure to keep going 
on it's own (last man standing). For this to work I would need to set 
expected votes to 1 and make sure the correct node wins the ensuing 
fencing race.

Case two. I remove one node from the cluster to maintain it. Now I have 
a two node cluster. Same issues as above. Luci wants to set two_node = 1 
in this case instead of just dealing with expected votes = 1. I haven't 
test this because I'm testing all this with node 2 and node 3 while the 
future node 1 is currently our production server.

The ping gateway test/IP tie-breaker was my way of reliably running down 
to last man standing.


>> During network partition test, expecting a fencing race where I 
>> control the outcome, one node would not fence the other and did not 
>> takeover the service until the other node attempted to rejoin the 
>> cluster (way too late).
>
> Is this resolved with the 5.1 release we did a few weeks ago?
>
I'm using the latest release.
>>
>> Another poster stated that he could not get the cluster to function 
>> properly since the switch to Openais. Hence I'm speculating that they 
>> are related.
> Doubtful.  There have been issues with cisco switch configurations 
> with allowing multicast properly.  All of those have been resolved 
> with a switch configuration setting change.
>
I don't know why it "stared at me" instead of recovering the service, 
because debugging is lacking. I really think that even if the "verbose 
debugging" was a compile time option and users had to install "testing" 
rpms, that all the problems would have been flushed out long ago.


...
> Both of these are part of the bigger picture resource monitoring work 
> that Lon and some of the linux ha guys are jointly working on 
> converging to a single base.  See this page -
> http://people.redhat.com/lhh/cluster-integration.html
>
> Which again, not very visible :-(.
 From a distance, it seems that 5.0 and 5.1 are less stable than 4.4 and 
4.5 (I've only tried the current ones). If big changes were made and 
released prematurely, it's being shaken out by production clusters 
instead of test clusters.

How much of this "not very visible" work is being tested by a larger group?

>
> 3. Time for Cluster Summit again - location preferences, timeframe, 
> funding, etc?
>
Summit's are better than closed development but users like me are never 
going to attend. A community based site is a good foundation.

By the way, I am a C programmer. (From windows land though we use RH on 
all of our servers.) I've spent a month trying to get this to work. It's 
open source and given enough time I can make it go. I don't have any 
more time. It's supposed to be production quality.

I have a failure case staring at me but debugging is lacking so I have 
to look else where for a solution. I can't sit here dangling my feet 
waiting and I can't spend weeks fixing it myself.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/fe59ec34/attachment.htm>

From scottb at bxwa.com  Wed Nov 28 21:31:02 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 28 Nov 2007 13:31:02 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <20071128205750.GA2062@localhost>
References: <474D9334.3020602@bxwa.com>
	<474DCF13.5060004@sys-admin.hu>	<474DD4A1.2020105@bxwa.com>
	<20071128205750.GA2062@localhost>
Message-ID: <474DDE16.5080400@bxwa.com>

Perhaps I should try it. But it seems like a step backwards. I don't 
have enough hardware to run a cluster and experiment with a test cluster 
to know when I can use the newer system.

    scottb

jakub.suchy at enlogit.cz wrote:
> Scott Becker wrote:
>   
>> Yes, slow at best. This list is better. Support is not the issue. I'm 
>> shooting for 100% uptime with simple failover. There are too many problems 
>> with the software and I'm a month behind schedule. The biggest problem is 
>> the core system is malfunctioning. Smaller problem is that in the same test 
>> I ran into the fencing issue that others have experienced, the menus are 
>> too dynamic for the current fence_apc agent.
>>
>> A month ago I thought, steep learning curve and then I'll be all set. 
>> redhat.com gives the distinct impression that they are selling a completed 
>> solution. I disagree.
>>     
>
> Have you tried RHEL 4.6 cluster? It's setup is really smooth oposite to
> RHEL 5.0, according to my experience. (As I said before, we will be
> trying 5.1 in few days).
>
> Jakub
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071128/18405571/attachment.htm>

From teigland at redhat.com  Wed Nov 28 21:48:31 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 28 Nov 2007 15:48:31 -0600
Subject: [Linux-cluster] I give up
In-Reply-To: <474DDD32.1020908@bxwa.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>
Message-ID: <20071128214830.GA4603@redhat.com>

On Wed, Nov 28, 2007 at 01:27:14PM -0800, Scott Becker wrote:
> I have three nodes. If one fails the other two are expected to maintain 
> quorum and continue. I would really like a second failure to keep going 
> on it's own (last man standing). For this to work I would need to set 
> expected votes to 1 and make sure the correct node wins the ensuing 
> fencing race.
> 
> Case two. I remove one node from the cluster to maintain it. Now I have 
> a two node cluster. Same issues as above. Luci wants to set two_node = 1 
> in this case instead of just dealing with expected votes = 1. I haven't 
> test this because I'm testing all this with node 2 and node 3 while the 
> future node 1 is currently our production server.
> 
> The ping gateway test/IP tie-breaker was my way of reliably running down 
> to last man standing.

> By the way, I am a C programmer. (From windows land though we use RH on 
> all of our servers.) I've spent a month trying to get this to work. It's 
> open source and given enough time I can make it go. I don't have any 
> more time. It's supposed to be production quality.

I'm curious if anyone else out there has done this successfully?  I doubt
anyone at RH has ever even tried it.  It sounds to me like you're outside
the scope of what a person can or should do with this software.  At least
reasonably.  It may be technically possible to make it do what you want, I
don't know, but I doubt it would be easy or pretty as you've shown.

Dave


From jparsons at redhat.com  Wed Nov 28 22:27:20 2007
From: jparsons at redhat.com (James Parsons)
Date: Wed, 28 Nov 2007 17:27:20 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <474DDD32.1020908@bxwa.com>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>
Message-ID: <474DEB48.3000209@redhat.com>

Scott Becker wrote:

>
>
> Kevin Anderson wrote:
>
>> Not sure what you mean by 3 to 1 using IP tie breaker.  How are you 
>> maintaining quorum without qdisk as a voting entity?
>>
> I have three nodes. If one fails the other two are expected to 
> maintain quorum and continue. I would really like a second failure to 
> keep going on it's own (last man standing). For this to work I would 
> need to set expected votes to 1 and make sure the correct node wins 
> the ensuing fencing race.
>
> Case two. I remove one node from the cluster to maintain it. Now I 
> have a two node cluster. Same issues as above. Luci wants to set 
> two_node = 1 in this case instead of just dealing with expected votes 
> = 1. 

I know why Luci is doing this -- she sees the cluster reduced from three 
nodes to two nodes and configures it (as the large majority or our 
typical users consider) appropriately. When you are finished maintaining 
the node and you tell Luci to add it back in to the cluster, she will 
remove those configuration attributes.

The sticking point seems to be your particular desired cluster behavior 
and the fact that it lies outside of what was expected for cluster suite.

If this is not appropriate behavior for you, then don't use Luci. You 
are free to use a text editor on the cluster.conf file and propagate it 
manually via the command line on one of the nodes, as you are free to 
edit the source code and add ssh support to your favorite fence agent. 
You are free to go off the map, and the members of this list (including 
many of the Red Hat engineers who write cluster code and watch this 
list) will assist you in your expedition as much as possible. We will 
all try our best to help you get where you want to go (and I think you 
would have to agree that you have had a very respectable response rate 
for your queries this last month - many have tried to offer you 
assistance), but if we can't think of a way to stretch the software to 
your needs, then we just can't.

I do want to disagree strongly, however, with your blanket suggestion 
that this software is not complete, and is not a cluster solution. It is 
a solution for many, many users...not all of whom are RH customers. It 
is just not a solution for you, my friend.

Thanks for your many constructive comments. I hope you keep trying the 
software - we are here to help as best we can. I haven't given up on you 
*quite* yet! :)

-J


From lhh at redhat.com  Wed Nov 28 22:32:13 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 28 Nov 2007 17:32:13 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <20071128214830.GA4603@redhat.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>  <20071128214830.GA4603@redhat.com>
Message-ID: <1196289133.2630.14.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-28 at 15:48 -0600, David Teigland wrote:

> I'm curious if anyone else out there has done this successfully?  I doubt
> anyone at RH has ever even tried it.  It sounds to me like you're outside
> the scope of what a person can or should do with this software.  At least
> reasonably.  It may be technically possible to make it do what you want, I
> don't know, but I doubt it would be easy or pretty as you've shown.

You could hook in to AIS CLM directly and write a simple failover daemon
if quorum isn't an issue; don't even need a tiebreaker...

-- Lon


From teigland at redhat.com  Wed Nov 28 22:31:32 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 28 Nov 2007 16:31:32 -0600
Subject: [Linux-cluster] I give up
In-Reply-To: <20071128214830.GA4603@redhat.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com> <20071128214830.GA4603@redhat.com>
Message-ID: <20071128223132.GB4603@redhat.com>

On Wed, Nov 28, 2007 at 03:48:31PM -0600, David Teigland wrote:
> On Wed, Nov 28, 2007 at 01:27:14PM -0800, Scott Becker wrote:
> > I have three nodes. If one fails the other two are expected to maintain 
> > quorum and continue. I would really like a second failure to keep going 
> > on it's own (last man standing). For this to work I would need to set 
> > expected votes to 1 and make sure the correct node wins the ensuing 
> > fencing race.
> > 
> > Case two. I remove one node from the cluster to maintain it. Now I have 
> > a two node cluster. Same issues as above. Luci wants to set two_node = 1 
> > in this case instead of just dealing with expected votes = 1. I haven't 
> > test this because I'm testing all this with node 2 and node 3 while the 
> > future node 1 is currently our production server.
> > 
> > The ping gateway test/IP tie-breaker was my way of reliably running down 
> > to last man standing.
> 
> > By the way, I am a C programmer. (From windows land though we use RH on 
> > all of our servers.) I've spent a month trying to get this to work. It's 
> > open source and given enough time I can make it go. I don't have any 
> > more time. It's supposed to be production quality.
> 
> I'm curious if anyone else out there has done this successfully?  I doubt
> anyone at RH has ever even tried it.  It sounds to me like you're outside
> the scope of what a person can or should do with this software.  

BTW, that's not to say that we *shouldn't* work on making this a normal
and easily-done thing.  You appear to be describing a rational, and
relatively obvious use case.

Dave


From lhh at redhat.com  Wed Nov 28 22:50:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 28 Nov 2007 17:50:19 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <474DDD32.1020908@bxwa.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>
Message-ID: <1196290219.2630.25.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-11-28 at 13:27 -0800, Scott Becker wrote:

> Summit's are better than closed development but users like me are
> never going to attend. A community based site is a good foundation.

There was this, but it's a bit outdated and pretty spartan:

http://gfs.wikidev.net/

It's going to take a lot of work to make it current, though it seems to
still work.  (Apparently, even my account still works *gasp*)

-- Lon


From scottb at bxwa.com  Wed Nov 28 23:19:57 2007
From: scottb at bxwa.com (Scott Becker)
Date: Wed, 28 Nov 2007 15:19:57 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <474DEB48.3000209@redhat.com>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>
	<474DEB48.3000209@redhat.com>
Message-ID: <474DF79D.6040403@bxwa.com>


>
> I do want to disagree strongly, however, with your blanket suggestion 
> that this software is not complete, and is not a cluster solution. It 
> is a solution for many, many users...not all of whom are RH customers. 
> It is just not a solution for you, my friend.
>
> Thanks for your many constructive comments. I hope you keep trying the 
> software - we are here to help as best we can. I haven't given up on 
> you *quite* yet! :)
>
>
to recap my problem, from an earlier post:

I just performed a test which failed miserably. I have two nodes (node 2 
and node 3) and did a test of a nic failure and expected a fencing race 
with a good outcome. The good node did not attempt to fence the bad node 
(although the bad one did make an attempt as expected). At the same time 
it also did not take over the service ( really bad ).
...

Only after I reconnected the nic's cable did it reject the improperly 
joining node (good) and recover the service (too late). Normal luci two 
node configuration. It's broken. From a prior post of "Service Recovery 
Failure" thread:

How do I turn up the verbosity of fenced? I'll repeat the test. The only 
mention I can find is -D but I don't know how I can use that. I'll 
browse the source and see if I can learn anything. I'm using 2.0.73.
...

The failover failed. My fence_apc hack worked great. If I could turn up 
verbosity of fenced I would continue trying to figure this out. It's 
possible that if I stopped the cluster, rebooted everything and then 
brought the cluster back up then my test might succeed. But I still 
can't trust it for production.

Several others have said 4.5 and 4.6 work great for them but 5.0 and 5.1 
malfunction.

    scottb


From jamesc at exa.com  Wed Nov 28 23:54:33 2007
From: jamesc at exa.com (James Chamberlain)
Date: Wed, 28 Nov 2007 18:54:33 -0500 (EST)
Subject: [Linux-cluster] I give up
In-Reply-To: <474DF79D.6040403@bxwa.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com> <474DEB48.3000209@redhat.com>
	<474DF79D.6040403@bxwa.com>
Message-ID: <Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>

On Wed, 28 Nov 2007, Scott Becker wrote:

>>  I do want to disagree strongly, however, with your blanket suggestion that
>>  this software is not complete, and is not a cluster solution. It is a
>>  solution for many, many users...not all of whom are RH customers. It is
>>  just not a solution for you, my friend.
>>
>>  Thanks for your many constructive comments. I hope you keep trying the
>>  software - we are here to help as best we can. I haven't given up on you
>>  *quite* yet! :)
>> 
>> 
> to recap my problem, from an earlier post:
>
> I just performed a test which failed miserably. I have two nodes (node 2 and 
> node 3) and did a test of a nic failure and expected a fencing race with a 
> good outcome. The good node did not attempt to fence the bad node (although 
> the bad one did make an attempt as expected). At the same time it also did 
> not take over the service ( really bad ).
> ...

The only time I've had an issue like this was when I had my fencing and 
failover domain incorrectly configured.  I had set up my fencedevices, but 
hadn't associated them with any of my clusternodes.  Any chance you can 
post your cluster.conf, sans passwords?  (Assuming you haven't already 
posted it - I didn't find it in a quick search)

> Only after I reconnected the nic's cable did it reject the improperly joining 
> node (good) and recover the service (too late). Normal luci two node 
> configuration. It's broken. From a prior post of "Service Recovery Failure" 
> thread:
>
> How do I turn up the verbosity of fenced? I'll repeat the test. The only 
> mention I can find is -D but I don't know how I can use that. I'll browse the 
> source and see if I can learn anything. I'm using 2.0.73.
> ...
> 
> The failover failed. My fence_apc hack worked great. If I could turn up 
> verbosity of fenced I would continue trying to figure this out. It's possible 
> that if I stopped the cluster, rebooted everything and then brought the 
> cluster back up then my test might succeed. But I still can't trust it for 
> production.
>
> Several others have said 4.5 and 4.6 work great for them but 5.0 and 5.1 
> malfunction.

The only problem I've had with 5.0 was getting it initially set up.  Once I 
applied the available patches (pre-5.1), everything went smoothly for me. 
But then, I wasn't trying to use a quorum disk, which I've heard can be 
tricky to set up correctly, particularly if you're using clvm.  The only 
time I've seen quorum disks work correctly was way back with RHAS 2.1 - but 
I also haven't tried setting that up since then.

Regards,

James Chamberlain


From isplist at logicore.net  Thu Nov 29 02:08:44 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 28 Nov 2007 20:08:44 -0600
Subject: [Linux-cluster] GFS and server performance = Application
Message-ID: <2007112820844.412280@leena>

I've been trying to get a handle on web server performance on GFS mounted 
storage vs none. Since the other thread kind of got lost, I decided to start a 
new one. What I've found is interesting enough that I felt I should talk about 
it since I'm surely not the only one using Joomla, in this case.

This test is without GFS mounted as the root of the web server. The root of 
the site only has an index.html file with little in it;

#ab -k -n 100 -c 100 http://192.168.1.92/

Time taken for tests:   0.279518 seconds
Requests per second:    357.76 [#/sec] (mean)
Time per request:       279.518 [ms] (mean)
Time per request:       2.795 [ms] (mean, across all concurrent requests)
Transfer rate:          146.68 [Kbytes/sec] received

Same test with GFS mounted as the root of the web server. The root of the site 
only has an index.html file with little in it;

# ab -k -n 100 -c 100 http://192.168.1.92/

Time taken for tests:   0.162151 seconds
Requests per second:    616.71 [#/sec] (mean)
Time per request:       162.151 [ms] (mean)
Time per request:       1.622 [ms] (mean, across all concurrent requests)
Transfer rate:          160.34 [Kbytes/sec] received

Similar right? Now, let's try the same test but this time, we add a full bore 
application, a Joomla site in this case at the root of the web server;

#ab -k -n 100 -c 100 http://192.168.1.92/

Time taken for tests:   33.583784 seconds
Requests per second:    2.98 [#/sec] (mean)
Time per request:       33583.782 [ms] (mean)
Time per request:       335.838 [ms] (mean, across all concurrent requests)
Transfer rate:          27.42 [Kbytes/sec] received

Quite the difference and little to do with GFS from what I can tell. And, this 
is what I am trying to confirm, and am asking from the community. Is there any 
fine tuning needed for GFS and the cluster itself as well as what ever I will 
do with the cluster later?

In another thread, I was told that GFS would hurt performance and my thought 
was that well, yes, it would take up some of the servers resources but there 
should be plenty left over to handle web serving or what ever else the server 
needs to serve up. 

I've tested this in various ways today, from external connections, internal, 
various httpd.conf settings, it's always the same. While some of the 
httpd.conf settings have some effects, the biggest one is always what 
application is being run on the server. In this case, Joomla seems to be 
insanely resource intensive. 

Are there any thoughts on this so that I can know where I need to spend my 
time now. Should I worry about GFS and the cluster itself or move on and start 
trying to figure out how to get Joomla to run more effectively?

Mike


From Christopher.Barry at qlogic.com  Thu Nov 29 02:40:45 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Wed, 28 Nov 2007 21:40:45 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <474DEB48.3000209@redhat.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>  <474DEB48.3000209@redhat.com>
Message-ID: <1196304045.9327.34.camel@localhost>

On Wed, 2007-11-28 at 17:27 -0500, James Parsons wrote:
> Scott Becker wrote:
> 
snip...
> > Case two. I remove one node from the cluster to maintain it. Now I 
> > have a two node cluster. Same issues as above. Luci wants to set 
> > two_node = 1 in this case instead of just dealing with expected votes 
> > = 1. 
> 
> I know why Luci is doing this -- she sees the cluster reduced from three 
> nodes to two nodes and configures it (as the large majority or our 
> typical users consider) appropriately. When you are finished maintaining 
> the node and you tell Luci to add it back in to the cluster, she will 
> remove those configuration attributes.
> 
> The sticking point seems to be your particular desired cluster behavior 
> and the fact that it lies outside of what was expected for cluster suite.
> 
> If this is not appropriate behavior for you, then don't use Luci. You 
> are free to use a text editor on the cluster.conf file and propagate it 
> manually via the command line on one of the nodes, as you are free to 
> edit the source code and add ssh support to your favorite fence agent. 
> You are free to go off the map, and the members of this list (including 
> many of the Red Hat engineers who write cluster code and watch this 
> list) will assist you in your expedition as much as possible. We will 
> all try our best to help you get where you want to go (and I think you 
> would have to agree that you have had a very respectable response rate 
> for your queries this last month - many have tried to offer you 
> assistance), but if we can't think of a way to stretch the software to 
> your needs, then we just can't.

Everyone here has been very helpful to me during my trials and
tribulations. I'm pretty sure I would have bailed without it. But there
is one thing I think would greatly help a lot of people - myself for
sure - and that is accurate and complete documentation of the totality
of the cluster conf file - with lots of example config files. The one
schema for 4.x and 5.x doc I have found is incomplete. All documentation
I have seen regarding setup and admin warns the user repeatedly NOT to
edit the configuration file yourself! But the fact is, using the GUI
apps, both conga and piranha, is *excruciatingly* slow and painful.
That, and it does not help you understand how it actually works.

Frankly, and not to hurt anybody's feelings, or diminish the effort that
has gone into the GUI projects, but manual editing of the file is the
only reasonable way to set this up. Complete information on all of the
tags and their precise meanings and organization would be way more
valuable than a GUI that does not really work all that well. I'm betting
it would take less developer time too.

Heck, If I can get all the data, I will write the doc and put it on the
GFS wiki. That wiki is a great start. Lon, James or anyone - feel free
to send me all conf file info you have and example configs and I will
try to distill it all into something helpful.

> I do want to disagree strongly, however, with your blanket suggestion 
> that this software is not complete, and is not a cluster solution. It is 
> a solution for many, many users...not all of whom are RH customers. It 
> is just not a solution for you, my friend.
> 
> Thanks for your many constructive comments. I hope you keep trying the 
> software - we are here to help as best we can. I haven't given up on you 
> *quite* yet! :)
> 
> -J
> 

As it stands, I've plodded through, and have a 6-node virtual cluster
that spans 2 ESX servers, that can stay up to the last man, (using a
quorum disk), serving desktops for developers via vnc and xdmcp, and ssh
sessions and nfs mounts - all load balanced through HA directors. Pretty
damn sweet if I do say so myself ;) 
I had to write the fence script for it, and replace nanny because it
continuously segfaulted, but once a GFS Linux cluster is finally
configured and running, it kicks serious butt, and it is indeed worth
it. The thing is, if everyone gave up, where would OpenSource be today?

-- 
Regards,
-C


From hedayan at gmail.com  Thu Nov 29 06:11:05 2007
From: hedayan at gmail.com (dayan)
Date: Thu, 29 Nov 2007 14:11:05 +0800
Subject: [Linux-cluster] I give up
References: <474D9334.3020602@bxwa.com><1196268313.2827.21.camel@localhost.localdomain><474DA9BF.7050006@bxwa.com><1196281964.2827.76.camel@localhost.localdomain><474DDD32.1020908@bxwa.com>
	<474DEB48.3000209@redhat.com> <1196304045.9327.34.camel@localhost>
Message-ID: <011801c8324e$a3b539a0$6400a8c0@alpha>


I support it !!!


----- Original Message ----- 
From: "Christopher Barry" <Christopher.Barry at qlogic.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Thursday, November 29, 2007 10:40 AM
Subject: Re: [Linux-cluster] I give up


> On Wed, 2007-11-28 at 17:27 -0500, James Parsons wrote:
>> Scott Becker wrote:
>> 
> snip...
>> > Case two. I remove one node from the cluster to maintain it. Now I 
>> > have a two node cluster. Same issues as above. Luci wants to set 
>> > two_node = 1 in this case instead of just dealing with expected votes 
>> > = 1. 
>> 
>> I know why Luci is doing this -- she sees the cluster reduced from three 
>> nodes to two nodes and configures it (as the large majority or our 
>> typical users consider) appropriately. When you are finished maintaining 
>> the node and you tell Luci to add it back in to the cluster, she will 
>> remove those configuration attributes.
>> 
>> The sticking point seems to be your particular desired cluster behavior 
>> and the fact that it lies outside of what was expected for cluster suite.
>> 
>> If this is not appropriate behavior for you, then don't use Luci. You 
>> are free to use a text editor on the cluster.conf file and propagate it 
>> manually via the command line on one of the nodes, as you are free to 
>> edit the source code and add ssh support to your favorite fence agent. 
>> You are free to go off the map, and the members of this list (including 
>> many of the Red Hat engineers who write cluster code and watch this 
>> list) will assist you in your expedition as much as possible. We will 
>> all try our best to help you get where you want to go (and I think you 
>> would have to agree that you have had a very respectable response rate 
>> for your queries this last month - many have tried to offer you 
>> assistance), but if we can't think of a way to stretch the software to 
>> your needs, then we just can't.
> 
> Everyone here has been very helpful to me during my trials and
> tribulations. I'm pretty sure I would have bailed without it. But there
> is one thing I think would greatly help a lot of people - myself for
> sure - and that is accurate and complete documentation of the totality
> of the cluster conf file - with lots of example config files. The one
> schema for 4.x and 5.x doc I have found is incomplete. All documentation
> I have seen regarding setup and admin warns the user repeatedly NOT to
> edit the configuration file yourself! But the fact is, using the GUI
> apps, both conga and piranha, is *excruciatingly* slow and painful.
> That, and it does not help you understand how it actually works.
> 
> Frankly, and not to hurt anybody's feelings, or diminish the effort that
> has gone into the GUI projects, but manual editing of the file is the
> only reasonable way to set this up. Complete information on all of the
> tags and their precise meanings and organization would be way more
> valuable than a GUI that does not really work all that well. I'm betting
> it would take less developer time too.
> 
> Heck, If I can get all the data, I will write the doc and put it on the
> GFS wiki. That wiki is a great start. Lon, James or anyone - feel free
> to send me all conf file info you have and example configs and I will
> try to distill it all into something helpful.
> 
>> I do want to disagree strongly, however, with your blanket suggestion 
>> that this software is not complete, and is not a cluster solution. It is 
>> a solution for many, many users...not all of whom are RH customers. It 
>> is just not a solution for you, my friend.
>> 
>> Thanks for your many constructive comments. I hope you keep trying the 
>> software - we are here to help as best we can. I haven't given up on you 
>> *quite* yet! :)
>> 
>> -J
>> 
> 
> As it stands, I've plodded through, and have a 6-node virtual cluster
> that spans 2 ESX servers, that can stay up to the last man, (using a
> quorum disk), serving desktops for developers via vnc and xdmcp, and ssh
> sessions and nfs mounts - all load balanced through HA directors. Pretty
> damn sweet if I do say so myself ;) 
> I had to write the fence script for it, and replace nanny because it
> continuously segfaulted, but once a GFS Linux cluster is finally
> configured and running, it kicks serious butt, and it is indeed worth
> it. The thing is, if everyone gave up, where would OpenSource be today?
> 
> -- 
> Regards,
> -C
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From thomas.reiter at cargo-partner.com  Thu Nov 29 08:08:57 2007
From: thomas.reiter at cargo-partner.com (Thomas Reiter)
Date: Thu, 29 Nov 2007 09:08:57 +0100
Subject: [Linux-cluster] RHEL5 LVM Cluster locking problem
Message-ID: <474E7399.3010904@cargo-partner.com>

Hi everyone in this community.

I have installed a working two node cluster using the new RHEL5 
clustersuite and RHEL5 x64.
kernel:  ......  2.6.18-8.el5
cman: cman-2.0.64-1.0.1.el5
lvm2-cluster version: lvm2-cluster-2.02.16-3.el5
iscsi: iscsi-initiator-utils-6.2.0.742-0.6.el5
multipath: device-mapper-multipath-0.4.7-8.el5

What i am trying to do now is to connect my shared storage (iscsi) via 
lvm2 clustered VG and mirrored LV.

I can successful create the PVs and VG.... but when activating the LV i 
get following errors in /var/log/messages:

/kernel: device-mapper: table: *253:6: mirror: Error creating mirror 
dirty log*
kernel: device-mapper: ioctl: error adding target to table
dmeventd[4626]: dmeventd ready for processing.
dmeventd[4626]: Monitoring mirror device, VG_mail-LV_mail for events
lvm[4590]: Error locking on node mail1: device-mapper: reload ioctl 
failed: Invalid argument
lvm[4590]: Error locking on node mail1: device-mapper: reload ioctl 
failed: Invalid argument
lvm[4590]: Failed to activate new LV./

What does this mean? Where to search for the mistake?

Tried already to change locking mode in /etc/lvm/lvm.conf to 2 and use 
locking_library liblvm2clusterlock.so from x32 RHEL4.5 without success 
also. (i know x32 and x64 os, but i thought it's worth a try..... 
couldn't find an external locking lib for RHEL5 x64)

Just for info, please find below my system config:

*lvm.conf:*
/library_dir = "/usr/lib64"
locking_type = 3
fallback_to_clustered_locking = 1
fallback_to_local_locking = 1
locking_dir = "/var/lock/lvm"
mirror_log_fault_policy = "allocate"
mirror_device_fault_policy = "remove"/

*iscsi devices:*
/2 infortrend iscsi storage boxes with each raid 6 + Hotspare
LV size should be 6 TeraByte -> mirrored to 2nd box/

*multipath -l shows: (right now there is only one path configured)*
/mail_m0 (3600d0230006fb0c300065568b855f301) dm-2 IFT,A16E-G2130-4
[size=5.9T][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 3:0:0:0 sde 8:64  [active][undef]
mail_mlog (3600d0230006fb0c300065568b855f300) dm-3 IFT,A16E-G2130-4
[size=200M][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 3:0:0:1 sdf 8:80  [active][undef]
mail_m1 (3600d0230006fbccc0006624409e63600) dm-4 IFT,A16E-G2130-4
[size=5.9T][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 4:0:0:0 sdb 8:16  [active][undef]/

*vgdisplay shows:
*/--- Volume group ---
  VG Name               VG_mail
  System ID
  Format                lvm2
  Metadata Areas        3
  Metadata Sequence No  7
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                4
  Open LV               0
  Max PV                0
  Cur PV                3
  Act PV                3
  VG Size               11.82 TB
  *PE Size               128.00 MB*
  Total PE              96824
  Alloc PE / Size       96667 / 11.80 TB
  Free  PE / Size       157 / 19.62 GB
  VG UUID               cdjlgF-n0dh-mlDo-wPco-RvFl-C3fO-PUQSuh/*

*PV Size 128MB for getting 6 TB volume.

*command issued to create LV:*
/lvcreate -m1 -L 5.90T -n LV_mail VG_mail /dev/mapper/mail_m0 
/dev/mapper/mail_m1 /dev/mapper/mail_mlog
/
*lvs shows:*
/LV       VG         Attr   LSize   Origin Snap%  Move Log          Copy%
  LV_mail  VG_mail    mwi-d-   5.90T                    LV_mail_mlog   0.00
  LogVol00 VolGroup00 -wi-ao   3.41G
  LogVol01 VolGroup00 -wi-ao 480.00M
/
but LV is NOT active because i get the error mentioned above!

Any ideas?
Every hint would be highly appriaciated.

Thanks
Thomas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/35f0ed1c/attachment.htm>

From hlawatschek at atix.de  Thu Nov 29 08:21:34 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Thu, 29 Nov 2007 09:21:34 +0100
Subject: [Linux-cluster] I give up
In-Reply-To: <1196290219.2630.25.camel@ayanami.boston.devel.redhat.com>
References: <474D9334.3020602@bxwa.com> <474DDD32.1020908@bxwa.com>
	<1196290219.2630.25.camel@ayanami.boston.devel.redhat.com>
Message-ID: <200711290921.34379.hlawatschek@atix.de>

On Wednesday 28 November 2007 23:50:19 Lon Hohberger wrote:
> On Wed, 2007-11-28 at 13:27 -0800, Scott Becker wrote:
> > Summit's are better than closed development but users like me are
> > never going to attend. A community based site is a good foundation.
>
> There was this, but it's a bit outdated and pretty spartan:
>
> http://gfs.wikidev.net/
>
> It's going to take a lot of work to make it current, though it seems to
> still work.  (Apparently, even my account still works *gasp*)
>
open-sharedroot.org provides a wiki section for cluster documentation. There 
is already some information about the redhat cluster stack available. 
We could easily use it to create a redhat cluster/gfs community documentation.
So have a look at http://open-sharedroot.org/faq
What do you think about this?

Mark

-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany


From gordan at bobich.net  Thu Nov 29 09:45:15 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Thu, 29 Nov 2007 09:45:15 +0000 (GMT)
Subject: [Linux-cluster] GFS and server performance = Application
In-Reply-To: <2007112820844.412280@leena>
References: <2007112820844.412280@leena>
Message-ID: <Pine.LNX.4.64.0711290941140.31433@skynet.shatteredsilicon.net>

On Wed, 28 Nov 2007, isplist at logicore.net wrote:

> In another thread, I was told that GFS would hurt performance and my thought
> was that well, yes, it would take up some of the servers resources but there
> should be plenty left over to handle web serving or what ever else the server
> needs to serve up.

Is this a single-threaded or a multi-threaded test, and how is the node 
access distribution handled? GFS will primarily add latency (because locks 
need to be moved between the nodes). Once the node that needs to answer 
obtains the locks, it should be able to deliver full speed on data 
transfers. If you are accessing lots of small files, the latency will be 
very dominant to the bandwidth. This could be what you are seeing.

You also don't appear to have posted the results for the same app running 
off the local FS.

> I've tested this in various ways today, from external connections, internal,
> various httpd.conf settings, it's always the same. While some of the
> httpd.conf settings have some effects, the biggest one is always what
> application is being run on the server. In this case, Joomla seems to be
> insanely resource intensive.

It could be down to how many files it accesses and how often. I've never 
used it, so I don't know. Remember that lock migration is expensive and 
requires cache flushes which will largely negate the usefulness of 
caching.

Gordan


From Alexandre.Racine at mhicc.org  Thu Nov 29 14:50:15 2007
From: Alexandre.Racine at mhicc.org (Alexandre Racine)
Date: Thu, 29 Nov 2007 09:50:15 -0500
Subject: [Linux-cluster] GFS and server performance = Application
References: <2007112820844.412280@leena>
Message-ID: <C43CF0825BF59D4FBC1F6A2AF45EB88D6D11D1@cumulonimbus.RG.local>

Performance links

RedHat Performance FAQ http://kbase.redhat.com/faq/FAQ_78_3152.shtm

GFS Performance Tuning http://sourceware.org/cluster/faq.html#gfs_tuning

Mount with noatime http://man.chinaunix.net/linux/redhat/rh-gfs-en-6.0/s1-manage-atimeconf.html

Turn off disk quotas http://www.redhat.com/archives/linux-cluster/2006-August/msg00237.html 

Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org


-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of isplist at logicore.net
Sent: Wed 2007-11-28 21:08
To: linux-cluster
Subject: [Linux-cluster] GFS and server performance = Application
 
I've been trying to get a handle on web server performance on GFS mounted 
storage vs none. Since the other thread kind of got lost, I decided to start a 
new one. What I've found is interesting enough that I felt I should talk about 
it since I'm surely not the only one using Joomla, in this case.

This test is without GFS mounted as the root of the web server. The root of 
the site only has an index.html file with little in it;

#ab -k -n 100 -c 100 http://192.168.1.92/

Time taken for tests:   0.279518 seconds
Requests per second:    357.76 [#/sec] (mean)
Time per request:       279.518 [ms] (mean)
Time per request:       2.795 [ms] (mean, across all concurrent requests)
Transfer rate:          146.68 [Kbytes/sec] received

Same test with GFS mounted as the root of the web server. The root of the site 
only has an index.html file with little in it;

# ab -k -n 100 -c 100 http://192.168.1.92/

Time taken for tests:   0.162151 seconds
Requests per second:    616.71 [#/sec] (mean)
Time per request:       162.151 [ms] (mean)
Time per request:       1.622 [ms] (mean, across all concurrent requests)
Transfer rate:          160.34 [Kbytes/sec] received

Similar right? Now, let's try the same test but this time, we add a full bore 
application, a Joomla site in this case at the root of the web server;

#ab -k -n 100 -c 100 http://192.168.1.92/

Time taken for tests:   33.583784 seconds
Requests per second:    2.98 [#/sec] (mean)
Time per request:       33583.782 [ms] (mean)
Time per request:       335.838 [ms] (mean, across all concurrent requests)
Transfer rate:          27.42 [Kbytes/sec] received

Quite the difference and little to do with GFS from what I can tell. And, this 
is what I am trying to confirm, and am asking from the community. Is there any 
fine tuning needed for GFS and the cluster itself as well as what ever I will 
do with the cluster later?

In another thread, I was told that GFS would hurt performance and my thought 
was that well, yes, it would take up some of the servers resources but there 
should be plenty left over to handle web serving or what ever else the server 
needs to serve up. 

I've tested this in various ways today, from external connections, internal, 
various httpd.conf settings, it's always the same. While some of the 
httpd.conf settings have some effects, the biggest one is always what 
application is being run on the server. In this case, Joomla seems to be 
insanely resource intensive. 

Are there any thoughts on this so that I can know where I need to spend my 
time now. Should I worry about GFS and the cluster itself or move on and start 
trying to figure out how to get Joomla to run more effectively?

Mike


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4381 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/2cfdf125/attachment.bin>

From carlopmart at gmail.com  Thu Nov 29 14:51:52 2007
From: carlopmart at gmail.com (carlopmart)
Date: Thu, 29 Nov 2007 15:51:52 +0100
Subject: [Linux-cluster] Problems to start ony one cluster service (SOLVED
	but ...)
In-Reply-To: <474D8B91.9090604@gmail.com>
References: <474BF0EE.7080709@gmail.com>
	<1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>
	<474D2A12.807@gmail.com> <474D8B91.9090604@gmail.com>
Message-ID: <474ED208.9090003@gmail.com>

carlopmart wrote:
> carlopmart wrote:
>> Lon Hohberger wrote:
>>> On Tue, 2007-11-27 at 11:26 +0100, carlopmart wrote:
>>>> Hi all
>>>>
>>>>   I have a very strange problem. I have configured three nodes under 
>>>> RHCS on rhel5.1 servers. All works ok, except for one service that 
>>>> never starts when rgmanager start-up. My cluster conf is:
>>>>
>>>> <?xml version="1.0"?>
>>>> <cluster alias="RhelXenCluster" config_version="17" 
>>>> name="RhelXenCluster">
>>>>          <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>>          <clusternodes>
>>>>                  <clusternode name="rhelclu01.hpulabs.org" 
>>>> nodeid="1" votes="1">
>>>>                          <fence>
>>>>                                  <method name="1">
>>>>                                          <device name="gnbd-fence" 
>>>> nodename="rhelclu01.hpulabs.org"/>
>>>>                                  </method>
>>>>                          </fence>
>>>>                          <multicast addr="239.192.75.55" 
>>>> interface="eth0"/>
>>>>                  </clusternode>
>>>>                  <clusternode name="rhelclu02.hpulabs.org" 
>>>> nodeid="2" votes="1">
>>>>                          <fence>
>>>>                                  <method name="1">
>>>>                                          <device name="gnbd-fence" 
>>>> nodename="rhelclu02.hpulabs.org"/>
>>>>                                  </method>
>>>>                          </fence>
>>>>                          <multicast addr="239.192.75.55" 
>>>> interface="eth0"/>
>>>>                  </clusternode>
>>>>                  <clusternode name="rhelclu03.hpulabs.org" 
>>>> nodeid="3" votes="1">
>>>>                          <fence>
>>>>                                  <method name="1">
>>>>                                          <device name="gnbd-fence" 
>>>> nodename="rhelclu03.hpulabs.org"/>
>>>>                                  </method>
>>>>                          </fence>
>>>>                          <multicast addr="239.192.75.55" 
>>>> interface="xenbr0"/>
>>>>                  </clusternode>
>>>>          </clusternodes>
>>>>          <cman expected_votes="1" two_node="0">
>>>>                  <multicast addr="239.192.75.55"/>
>>>>          </cman>
>>>>          <fencedevices>
>>>>                  <fencedevice agent="fence_gnbd" name="gnbd-fence" 
>>>> servers="rhelclu03.hpulabs.org"/>
>>>>          </fencedevices>
>>>>          <rm log_facility="local4" log_level="7">
>>>>                  <failoverdomains>
>>>>                          <failoverdomain name="PriCluster" 
>>>> ordered="1" restricted="1">
>>>>                                  <failoverdomainnode 
>>>> name="rhelclu01.hpulabs.org" priority="1"/>
>>>>                                  <failoverdomainnode 
>>>> name="rhelclu02.hpulabs.org" priority="2"/>
>>>>                          </failoverdomain>
>>>>                          <failoverdomain name="SecCluster" 
>>>> ordered="1" restricted="1">
>>>>                                  <failoverdomainnode 
>>>> name="rhelclu02.hpulabs.org" priority="1"/>
>>>>                                  <failoverdomainnode 
>>>> name="rhelclu01.hpulabs.org" priority="2"/>
>>>>                          </failoverdomain>
>>>>                  </failoverdomains>
>>>>                  <resources>
>>>>             <ip address="172.25.50.10" monitor_link="1"/>
>>>>                          <ip address="172.25.50.11" monitor_link="1"/>
>>>>                          <ip address="172.25.50.12" monitor_link="1"/>
>>>>                          <ip address="172.25.50.13" monitor_link="1"/>
>>>>                          <ip address="172.25.50.14" monitor_link="1"/>
>>>>                          <ip address="172.25.50.15" monitor_link="1"/>
>>>>                          <ip address="172.25.50.16" monitor_link="1"/>
>>>>                          <ip address="172.25.50.17" monitor_link="1"/>
>>>>                          <ip address="172.25.50.18" monitor_link="1"/>
>>>>                          <ip address="172.25.50.19" monitor_link="1"/>
>>>>                          <ip address="172.25.50.20" monitor_link="1"/>
>>>>                  </resources>
>>>>                  <service autostart="1" domain="PriCluster" 
>>>> name="dns-svc" recovery="relocate">
>>>>                          <ip ref="172.25.50.10">
>>>>                                  <script 
>>>> file="/data/cfgcluster/etc/init.d/named" name="named"/>
>>>>                          </ip>
>>>>                  </service>
>>>>                  <service autostart="1" domain="SecCluster" 
>>>> name="mail-svc" recovery="relocate">
>>>>                          <ip ref="172.25.50.11">
>>>>                                  <script 
>>>> file="/data/cfgcluster/etc/init.d/postfix-cluster" name="postfix"/>
>>>>                          </ip>
>>>>                  </service>
>>>>                  <service autostart="1" domain="SecCluster" 
>>>> name="rsync-svc" recovery="relocate">
>>>>                          <ip ref="172.25.50.13">
>>>>                                  <script 
>>>> file="/data/cfgcluster/etc/init.d/rsyncd" name="rsyncd"/>
>>>>                          </ip>
>>>>                  </service>
>>>>                  <service autostart="1" domain="PriCluster" 
>>>> name="wwwsoft-svc" recovery="relocate">
>>>>                          <ip ref="172.25.50.14">
>>>>                                  <script 
>>>> file="/data/cfgcluster/etc/init.d/httpd-mirror" name="httpd-mirror"/>
>>>>                          </ip>
>>>>                  </service>
>>>>                  <service autostart="1" domain="SecCluster" 
>>>> name="proxy-svc" recovery="relocate">
>>>>                          <ip ref="172.25.50.15">
>>>>                                  <script 
>>>> file="/data/cfgcluster/etc/init.d/squid" name="squid"/>
>>>>                          </ip>
>>>>                  </service>
>>>>          </rm>
>>>> </cluster>
>>>>
>>>>   The service that returns me errors and never starts when rgmanager 
>>>> start-up is postfix-cluster. On maillog file I find this error:
>>>
>>>
>>>>   Nov 26 11:27:31 rhelclu01 postfix[27959]: fatal: parameter 
>>>> inet_interfaces: no local interface found for 172.25.50.11
>>>> Nov 26 11:27:43 rhelclu01 postfix[28313]: fatal: 
>>>> /data/cfgcluster/etc/postfix-cluster/postfix-script: Permission denied
>>>
>>>>   but thath's not true. If I start this service manually all works 
>>>> ok. Postfix configuration it is ok, What can be the problem??? I 
>>>> don't know why rgmanager dosen't config 172.25.50.11 address before 
>>>> execute postfix-cluster service ....
>>>
>>> When you start it manually -- how?
>>> * add IP manually / running the script?
>>> * rg_test?
>>> * clusvcadm -e?
>>>
>>> -- Lon
>>
>> Another strange thing: at this morning this service is stopped, when I 
>> try to start using clusvcadm returns this error:
>>
>> Nov 28 09:28:21 rhelclu01 clurgmgrd[1450]: <warning> #68: Failed to 
>> start service:mail-svc; return value: 1
>> Nov 28 09:28:21 rhelclu01 clurgmgrd[1450]: <notice> Stopping service 
>> service:mail-svc
>> Nov 28 09:28:22 rhelclu01 clurgmgrd: [1450]: <err> script:postfix: 
>> stop of /data/cfgcluster/etc/init.d/postfix-cluster failed (returned 1)
>> Nov 28 09:28:22 rhelclu01 clurgmgrd[1450]: <notice> stop on script 
>> "postfix" returned 1 (generic error)
>> Nov 28 09:28:22 rhelclu01 in.rdiscd[11610]: setsockopt 
>> (IP_ADD_MEMBERSHIP): Address already in use
>> Nov 28 09:28:22 rhelclu01 in.rdiscd[11610]: Failed joining addresses
>> Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <notice> Service 
>> service:mail-svc is recovering
>> Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <warning> #71: Relocating 
>> failed service service:mail-svc
>> Nov 28 09:28:32 rhelclu01 clurgmgrd[1450]: <notice> Stopping service 
>> service:mail-svc
>>
>>  I don't understand this. IP 172.25.50.11 isn't used by anyone ....
>>
>>
> 
> Finally I have found where is the problem: I need to put 
> alternate_config param under first postfix instance and now all works 
> ok. Service starts, stops and relocate ok but I found a little problem: 
> clurgmgrd doesn't checks the status of the service. If I remove status 
> flag from init script for the resource, nothing occurs. Do I need to put 
> any param on cluster.conf to check services every 1 min or 2???
> 
> Thanks.
> 

Please, any hints???

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From amrossi at linux.it  Thu Nov 29 15:25:32 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Thu, 29 Nov 2007 16:25:32 +0100 (CET)
Subject: [Linux-cluster] Conga and Ricci certificate
Message-ID: <50352.62.101.100.5.1196349932.squirrel@picard.linux.it>

Hi all,
i've this problem.

With system-config-cluster i make  A Failover domains with 2 VM Xen and
Apache HTTP Server. This cluster works.

Now, i'm installing luci&ricci on dom0. Ricci on booth domU.

On Conga:

HOMEBASE----> Add an existing Cluster


i insert the IP of dom0 but when i inser the domu IP i read this error:


The following errors occurred:

  * Unable to add the key for node vm03-dadmin.example.prv to the trusted
keys list.
  * Unable to connect to the ricci agent on vm03-dadmin-example.prv:
Unable to establish an SSL connection to vm03-dadmin.example.prv:11111:
ricci's certificate is not trusted


I don'understand. With dom0 and the first domU it's OK....Over the second
domU i see this error .... :-(


On dom0:


tail -f /var/log/messages

Nov 29 16:12:43 zeus03-dom0 luci[28835]: Unable to establish an SSL
connection to zeus03-dadmin.replynet.prv:11111: ricci's certificate is not
trusted
Nov 29 16:26:37 zeus03-dom0 luci[28835]: Error reading from
zeus03-dadmin.replynet.prv:11111: timeout
Nov 29 16:26:37 zeus03-dom0 luci[28835]: The SSL certificate for host
"zeus03-dadmin.replynet.prv" is not trusted. Aborting connection attempt.
Nov 29 16:26:37 zeus03-dom0 luci[28835]: Unable to establish an SSL
connection to zeus03-dadmin.replynet.prv:11111: ricci's certificate is not
trusted


i don't understand


From isplist at logicore.net  Thu Nov 29 15:29:28 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 29 Nov 2007 09:29:28 -0600
Subject: [Linux-cluster] GFS and server performance = Application
In-Reply-To: <Pine.LNX.4.64.0711290941140.31433@skynet.shatteredsilicon.net>
Message-ID: <2007112992928.369770@leena>

Hi Gordan,

The point of the test was just to get a starting point since I didn't have 
one. Even a rudimentary starting point is better than none. 
I found the 'requests per second' interesting between the applications and not 
wether they were being served up from a GFS partition or not. I have no doubt 
that GFS plays a role in performance loss but it certainly would not be as 
great a difference as I am seeing between applications. So it seems, for now 
at least.

>Is this a single-threaded or a multi-threaded test,

 ab -k -n 100 -c 100 http://192.168.1.150/ (pointing to LVS server)

>and how is the node access distribution handled?

You'll have to ask me this one in English since this is just a part time thing 
for me, I'm not interested in becoming a guru at GFS/Cluster suites. I just 
need to understand it enough to make it work for my needs. Can you rephrase 
this please?

> GFS will primarily add latency (because locks
> need to be moved between the nodes). Once the node that needs to answer
> obtains the locks, it should be able to deliver full speed on data
> transfers. If you are accessing lots of small files, the latency will be
> very dominant to the bandwidth. This could be what you are seeing.

I didn't shut the GFS or cluster services down, I only unmounted the shared 
storage for the testing. Also, there was another GFS still mounted to that 
same machine but it was in another path so not part of the path to test the 
web server. 
 
> You also don't appear to have posted the results for the same app running
> off the local FS.

Do you mean do the same test on the web server itself? 
I didn't run the test from the web server itself because I wanted a more 
accurate load. Running the tool on the server would add it's own resource 
needs to the mix. Do you think I should run it on the same server?

Just in case you do, here are those results;

This is with GFS mounted as web root and Joomla application as test page;
# ab -k -n 100 -c 100 http://localhost/

Time taken for tests:   57.43335 seconds
Requests per second:    1.75 [#/sec] (mean)
Time per request:       57043.335 [ms] (mean)
Time per request:       570.433 [ms] (mean, across all concurrent requests)
Transfer rate:          26.03 [Kbytes/sec] received

This is without GFS mounted as web root and simple index.html test page;
# ab -k -n 100 -c 100 http://localhost/

Time taken for tests:   0.187115 seconds
Requests per second:    534.43 [#/sec] (mean)
Time per request:       187.115 [ms] (mean)
Time per request:       1.871 [ms] (mean, across all concurrent requests)
Transfer rate:          267.22 [Kbytes/sec] received

> It could be down to how many files it accesses and how often. I've never
> used it, so I don't know. Remember that lock migration is expensive and
> requires cache flushes which will largely negate the usefulness of
> caching.

Same test without GFS mounted and with a basic LAMP application in root of web 
server;
# ab -k -n 100 -c 100 http://localhost/

Time taken for tests:   0.241493 seconds
Requests per second:    414.09 [#/sec] (mean)
Time per request:       241.493 [ms] (mean)
Time per request:       2.415 [ms] (mean, across all concurrent requests)
Transfer rate:          215.33 [Kbytes/sec] received

Same test without GFS mounted and with Joomla pplication in root of web 
server;
# ab -k -n 100 -c 100 http://localhost/

Time taken for tests:   35.653875 seconds
Requests per second:    2.80 [#/sec] (mean)
Time per request:       35653.873 [ms] (mean)
Time per request:       356.539 [ms] (mean, across all concurrent requests)
Transfer rate:          26.11 [Kbytes/sec] received

Of course, this test still might not mean much because it is possible that 
Joomla handles things in some unusual way. I'll have to post the same in the 
Joomla forums and see what I get back. However, right now at least, it looks 
like it is the application which makes a bigger difference than much else, 
right off the top at least.

I think I'm forgetting more version of this test? Remember that the machine 
being tested does have a second GFS mount which is always mounted during this 
testing.

If you have more thoughts on what I can test, I am more than willing to try 
them out.

Mike


From isplist at logicore.net  Thu Nov 29 15:34:45 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 29 Nov 2007 09:34:45 -0600
Subject: [Linux-cluster] GFS and server performance = Application
In-Reply-To: <C43CF0825BF59D4FBC1F6A2AF45EB88D6D11D1@cumulonimbus.RG.local>
Message-ID: <2007112993445.259539@leena>

Thank you! All good leads. I'll look into all of these.

On Thu, 29 Nov 2007 09:50:15 -0500, Alexandre Racine wrote:
> Performance links
> 
> RedHat Performance FAQ http://kbase.redhat.com/faq/FAQ_78_3152.shtm
> 
> GFS Performance Tuning http://sourceware.org/cluster/faq.html#gfs_tuning
> 
> Mount with noatime http://man.chinaunix.net/linux/redhat/rh-gfs-en-6.0/s1-
> manage-atimeconf.html
> 
> Turn off disk quotas http://www.redhat.com/archives/linux-cluster/2006-
> August/msg00237.html
> 
> Alexandre Racine
> Projets sp?ciaux
> 514-461-1300 poste 3304
> alexandre.racine at mhicc.org
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com on behalf of isplist at logicore.net
> Sent: Wed 2007-11-28 21:08
> To: linux-cluster
> Subject: [Linux-cluster] GFS and server performance = Application
> 
> I've been trying to get a handle on web server performance on GFS mounted
> storage vs none. Since the other thread kind of got lost, I decided to
> start a
> new one. What I've found is interesting enough that I felt I should talk
> about
> it since I'm surely not the only one using Joomla, in this case.
> 
> This test is without GFS mounted as the root of the web server. The root of
> the site only has an index.html file with little in it;
> 
> #ab -k -n 100 -c 100 http://192.168.1.92/
> 
> Time taken for tests:   0.279518 seconds
> Requests per second:    357.76 [#/sec] (mean)
> Time per request:       279.518 [ms] (mean)
> Time per request:       2.795 [ms] (mean, across all concurrent requests)
> Transfer rate:          146.68 [Kbytes/sec] received
> 
> Same test with GFS mounted as the root of the web server. The root of the
> site
> only has an index.html file with little in it;
> 
> # ab -k -n 100 -c 100 http://192.168.1.92/
> 
> Time taken for tests:   0.162151 seconds
> Requests per second:    616.71 [#/sec] (mean)
> Time per request:       162.151 [ms] (mean)
> Time per request:       1.622 [ms] (mean, across all concurrent requests)
> Transfer rate:          160.34 [Kbytes/sec] received
> 
> Similar right? Now, let's try the same test but this time, we add a full
> bore
> application, a Joomla site in this case at the root of the web server;
> 
> #ab -k -n 100 -c 100 http://192.168.1.92/
> 
> Time taken for tests:   33.583784 seconds
> Requests per second:    2.98 [#/sec] (mean)
> Time per request:       33583.782 [ms] (mean)
> Time per request:       335.838 [ms] (mean, across all concurrent requests)
> Transfer rate:          27.42 [Kbytes/sec] received
> 
> Quite the difference and little to do with GFS from what I can tell. And,
> this
> is what I am trying to confirm, and am asking from the community. Is there
> any
> fine tuning needed for GFS and the cluster itself as well as what ever I
> will
> do with the cluster later?
> 
> In another thread, I was told that GFS would hurt performance and my thought
> was that well, yes, it would take up some of the servers resources but there
> should be plenty left over to handle web serving or what ever else the
> server
> needs to serve up.
> 
> I've tested this in various ways today, from external connections, internal,
> various httpd.conf settings, it's always the same. While some of the
> httpd.conf settings have some effects, the biggest one is always what
> application is being run on the server. In this case, Joomla seems to be
> insanely resource intensive.
> 
> Are there any thoughts on this so that I can know where I need to spend my
> time now. Should I worry about GFS and the cluster itself or move on and
> start
> trying to figure out how to get Joomla to run more effectively?
> 
> Mike
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From gordan at bobich.net  Thu Nov 29 15:39:50 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Thu, 29 Nov 2007 15:39:50 +0000 (GMT)
Subject: [Linux-cluster] GFS and server performance = Application
In-Reply-To: <2007112992928.369770@leena>
References: <2007112992928.369770@leena>
Message-ID: <Pine.LNX.4.64.0711291535020.31713@skynet.shatteredsilicon.net>

On Thu, 29 Nov 2007, isplist at logicore.net wrote:

> The point of the test was just to get a starting point since I didn't have
> one. Even a rudimentary starting point is better than none.
> I found the 'requests per second' interesting between the applications and not
> wether they were being served up from a GFS partition or not. I have no doubt
> that GFS plays a role in performance loss but it certainly would not be as
> great a difference as I am seeing between applications. So it seems, for now
> at least.
>
>> Is this a single-threaded or a multi-threaded test,
>
> ab -k -n 100 -c 100 http://192.168.1.150/ (pointing to LVS server)
>
>> and how is the node access distribution handled?
>
> You'll have to ask me this one in English since this is just a part time thing
> for me, I'm not interested in becoming a guru at GFS/Cluster suites. I just
> need to understand it enough to make it work for my needs. Can you rephrase
> this please?

Are the same nodes asked to access individual subsets of the application 
paths or are all nodes handling everything?

>> GFS will primarily add latency (because locks
>> need to be moved between the nodes). Once the node that needs to answer
>> obtains the locks, it should be able to deliver full speed on data
>> transfers. If you are accessing lots of small files, the latency will be
>> very dominant to the bandwidth. This could be what you are seeing.
>
> I didn't shut the GFS or cluster services down, I only unmounted the shared
> storage for the testing. Also, there was another GFS still mounted to that
> same machine but it was in another path so not part of the path to test the
> web server.

Indeed, that wouldn't affect it. But the point I was making was that if 
your test waits for a response before asking again, you may find that the 
throughput goes right down because latency goes up. If you are issuing 
10 requests in parallel, that may cover up the latency increase.

> I think I'm forgetting more version of this test? Remember that the machine
> being tested does have a second GFS mount which is always mounted during this
> testing.

Mounted filesystems that you aren't accessing won't be affecting the 
performance. It's pretty safe to ignore those for now. It would be worth 
looking into how Joomla handles it's file accesses. If it's constantly 
opening and closing lots of files for r/w access, that may well explain 
the 10x slowdown you're seeing.

Gordan


From isplist at logicore.net  Thu Nov 29 16:04:11 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 29 Nov 2007 10:04:11 -0600
Subject: [Linux-cluster] GFS and server performance = Application
In-Reply-To: <Pine.LNX.4.64.0711291535020.31713@skynet.shatteredsilicon.net>
Message-ID: <2007112910411.180299@leena>

> Are the same nodes asked to access individual subsets of the application
> paths or are all nodes handling everything?

At the top is an LVS server doing load balancing for the web servers. 

The web pages are on a shared GFS storage device.
The clustered web servers all have their /var/www/ mapped to the GFS partition 
where all of the web pages reside. 
This means that all web servers are sharing the web sites/pages so all are 
serving up the same copies of pages. There is zero load right now since it is 
not public. 

> But the point I was making was that if
> your test waits for a response before asking again, you may find that the
> throughput goes right down because latency goes up. If you are issuing
> 10 requests in parallel, that may cover up the latency increase.

Right and this is why I'm suspecting joomla because other applications seem to 
bring up good performance based on the same testing at least. I've tried 
various applications all the way from just an index.html page to a non Mysql 
app to a Mysql app to joomla. Joomla is the only one which is giving me these 
very low 2/4 requests per second results.

> Mounted filesystems that you aren't accessing won't be affecting the
> performance. It's pretty safe to ignore those for now.

Good to know.

> It would be worth
> looking into how Joomla handles it's file accesses. If it's constantly
> opening and closing lots of files for r/w access, that may well explain
> the 10x slowdown you're seeing.

Good point and I am posting a message in the joomla forums right now to get 
some more input on this. Another list member just posted some good links to 
GFS testing and fine tuning also which I'll take a peek at.

Mike


From scottb at bxwa.com  Thu Nov 29 16:37:47 2007
From: scottb at bxwa.com (Scott Becker)
Date: Thu, 29 Nov 2007 08:37:47 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>
	<474DEB48.3000209@redhat.com>	<474DF79D.6040403@bxwa.com>
	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
Message-ID: <474EEADB.9040908@bxwa.com>


James Chamberlain wrote:
> Any chance you can post your cluster.conf, sans passwords?
>
See message dated: 11/26/2007 2:36 PM

    scottb


From rhurst at bidmc.harvard.edu  Thu Nov 29 16:38:32 2007
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Thu, 29 Nov 2007 11:38:32 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <474DEB48.3000209@redhat.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com> <474DEB48.3000209@redhat.com>
Message-ID: <1196354312.13624.48.camel@xw9300.bidmc.harvard.edu>

James Parsons wrote:

> I do want to disagree strongly, however, with your blanket suggestion 
> that this software is not complete, and is not a cluster solution. It is 
> a solution for many, many users...not all of whom are RH customers. It 
> is just not a solution for you, my friend.


Ditto!!  We would not be able to achieve such high levels of success
with moving our critical clinical applications to this software stack,
without it working as advertised.  I would also strongly suggest RHEL
training, too, because their classes/labs/exams will put you through the
paces and guide you in the thinking required to deploy sophisticated
solutions.

Reference:
http://customers.press.redhat.com/2007/10/15/beth-israel-deaconess-medical-center/


Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/21ab68da/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/21ab68da/attachment.p7s>

From marcos.david at efacec.pt  Thu Nov 29 16:49:08 2007
From: marcos.david at efacec.pt (Marcos David)
Date: Thu, 29 Nov 2007 16:49:08 +0000
Subject: [Linux-cluster] I give up
In-Reply-To: <474EEADB.9040908@bxwa.com>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>	<474DF79D.6040403@bxwa.com>	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
	<474EEADB.9040908@bxwa.com>
Message-ID: <474EED84.8050506@efacec.pt>

There are no failover domains defined in your cluster.conf.
This could explain why no other nodes take over the service....

I don't know if this is a new setting to RHEL5. I'm working on 4.6....


What do you guys have to say about this? Is a failover domain required?


Hope it helps....


MD


Scott Becker wrote:
>
>
> James Chamberlain wrote:
>> Any chance you can post your cluster.conf, sans passwords?
>>
> See message dated: 11/26/2007 2:36 PM
>
>    scottb
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>


From scottb at bxwa.com  Thu Nov 29 16:55:52 2007
From: scottb at bxwa.com (Scott Becker)
Date: Thu, 29 Nov 2007 08:55:52 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <1196281964.2827.76.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
Message-ID: <474EEF18.50204@bxwa.com>


Kevin Anderson wrote:
> ...  Are you thinking blogs, wiki, etc?

The bottom of every page has user contributed comments:
http://www.php.net/manual/en/ref.fam.php

Here's another example:
http://www.postgresql.org/docs/manuals/

The official Redhat Manuals need an interactive version. Then instead of 
their uselessness being an inside joke, they would quickly become 
useful. For example, recently I scoured all available docs and then 
asked the list what the runtime significance was of a parent/child 
resource relationship. After I got the answer (which I did), I could 
have gone back to the spot in the official manual (interactive version) 
where the answer should have been explained 
(https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/s1-add-service-conga-CA.html).

    scottb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/2cdb2a60/attachment.htm>

From scottb at bxwa.com  Thu Nov 29 17:00:05 2007
From: scottb at bxwa.com (Scott Becker)
Date: Thu, 29 Nov 2007 09:00:05 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <474EED84.8050506@efacec.pt>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>	<474DF79D.6040403@bxwa.com>	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>	<474EEADB.9040908@bxwa.com>
	<474EED84.8050506@efacec.pt>
Message-ID: <474EF015.10103@bxwa.com>


Marcos David wrote:
> There are no failover domains defined in your cluster.conf.
> This could explain why no other nodes take over the service....
>
It's my understanding from the man pages that failover domains are an 
option to configure a service to run on only a subset of the nodes. When 
I used luci to "Fence this node" the failover worked good. During my 
testing via unplugging cables, failover happened, just not at all the 
right times.

    scottb


From johannes.russek at io-consulting.net  Thu Nov 29 17:09:04 2007
From: johannes.russek at io-consulting.net (jr)
Date: Thu, 29 Nov 2007 18:09:04 +0100
Subject: [Linux-cluster] GFS and server performance = Application
In-Reply-To: <2007112910411.180299@leena>
References: <2007112910411.180299@leena>
Message-ID: <1196356144.16961.3.camel@localhost.localdomain>


> Right and this is why I'm suspecting joomla because other applications seem to 
> bring up good performance based on the same testing at least. I've tried 
> various applications all the way from just an index.html page to a non Mysql 
> app to a Mysql app to joomla. Joomla is the only one which is giving me these 
> very low 2/4 requests per second results.

i'm sorry if this doesn't have anything to do with clustering/gfs in
general, but isn't joomla a php application and would benefit extremely
well from a php cache?
before debugging gfs i'd rather go for a phpcache and see where that
takes me (php is a ressource consuming b**ch).
johannes


From johannes.russek at io-consulting.net  Thu Nov 29 17:11:52 2007
From: johannes.russek at io-consulting.net (jr)
Date: Thu, 29 Nov 2007 18:11:52 +0100
Subject: [Linux-cluster] I give up
In-Reply-To: <474EF015.10103@bxwa.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>
	<474DF79D.6040403@bxwa.com>
	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
	<474EEADB.9040908@bxwa.com> <474EED84.8050506@efacec.pt>
	<474EF015.10103@bxwa.com>
Message-ID: <1196356312.16961.5.camel@localhost.localdomain>

from my understanding a failover domain is required whenever you want
other nodes to take over a service. the subset is if you make it
restricted, isn't it?
regards,
johannes

Am Donnerstag, den 29.11.2007, 09:00 -0800 schrieb Scott Becker:
> 
> Marcos David wrote:
> > There are no failover domains defined in your cluster.conf.
> > This could explain why no other nodes take over the service....
> >
> It's my understanding from the man pages that failover domains are an 
> option to configure a service to run on only a subset of the nodes. When 
> I used luci to "Fence this node" the failover worked good. During my 
> testing via unplugging cables, failover happened, just not at all the 
> right times.
> 
>     scottb
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From marcos.david at efacec.pt  Thu Nov 29 17:23:05 2007
From: marcos.david at efacec.pt (Marcos David)
Date: Thu, 29 Nov 2007 17:23:05 +0000
Subject: [Linux-cluster] I give up
In-Reply-To: <1196356312.16961.5.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>	<474DF79D.6040403@bxwa.com>	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>	<474EEADB.9040908@bxwa.com>
	<474EED84.8050506@efacec.pt>	<474EF015.10103@bxwa.com>
	<1196356312.16961.5.camel@localhost.localdomain>
Message-ID: <474EF579.2000908@efacec.pt>

jr wrote:
> from my understanding a failover domain is required whenever you want
> other nodes to take over a service. the subset is if you make it
> restricted, isn't it?
> regards,
> johannes
I believe so.
And if you have the relocate option set, the service migrates even if 
there are no nodes left in that failover domain....
I've successfully tested this behavior in a two-node cluster.
I never tested any configuration without failover domains....

Greets,
Marcos David


From scottb at bxwa.com  Thu Nov 29 17:23:19 2007
From: scottb at bxwa.com (Scott Becker)
Date: Thu, 29 Nov 2007 09:23:19 -0800
Subject: [Linux-cluster] I give up
In-Reply-To: <1196356312.16961.5.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>	<474DF79D.6040403@bxwa.com>	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>	<474EEADB.9040908@bxwa.com>
	<474EED84.8050506@efacec.pt>	<474EF015.10103@bxwa.com>
	<1196356312.16961.5.camel@localhost.localdomain>
Message-ID: <474EF587.2050701@bxwa.com>

jr wrote:
> from my understanding a failover domain is required whenever you want
> other nodes to take over a service. the subset is if you make it
> restricted, isn't it?
> regards,
> johanne
From:
https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/s1-config-failover-domain-conga-CA.html

"Failover domains are /not/ required for operation."

    scottb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/37a34d8e/attachment.htm>

From jamesc at exa.com  Thu Nov 29 17:31:01 2007
From: jamesc at exa.com (James Chamberlain)
Date: Thu, 29 Nov 2007 12:31:01 -0500 (EST)
Subject: [Linux-cluster] I give up
In-Reply-To: <474EF587.2050701@bxwa.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com> <474DEB48.3000209@redhat.com>
	<474DF79D.6040403@bxwa.com>
	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
	<474EEADB.9040908@bxwa.com>
	<474EED84.8050506@efacec.pt> <474EF015.10103@bxwa.com>
	<1196356312.16961.5.camel@localhost.localdomain>
	<474EF587.2050701@bxwa.com>
Message-ID: <Pine.LNX.4.64.0711291226050.6457@hawking.exa.com>

As with many things, I think the devil in that quote is in the details. 
Failover domains are not required for the cluster to run *at all*, but I 
think they may be required for the cluster to behave as you'd like.

For example, if I wanted to have multiple front-end nodes serving a GFS 
filesystem, with a load balancer out in front of the cluster to redirect 
requests, I wouldn't need a failover domain.  Every node would be mounting 
the GFS filesystem, so there wouldn't be a service I'd need to fail over to 
anywhere.  However, if I wanted to run a web server on one node and have it 
migrate to another node if the first one failed, then I would need a 
failover domain.

Regards,

James Chamberlain

On Thu, 29 Nov 2007, Scott Becker wrote:

> jr wrote:
>>  from my understanding a failover domain is required whenever you want
>>  other nodes to take over a service. the subset is if you make it
>>  restricted, isn't it?
>>  regards,
>>  johanne
> From:
> https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/s1-config-failover-domain-conga-CA.html
>
> "Failover domains are /not/ required for operation."
>
>   scottb
>
>


From marcos.david at efacec.pt  Thu Nov 29 17:37:05 2007
From: marcos.david at efacec.pt (Marcos David)
Date: Thu, 29 Nov 2007 17:37:05 +0000
Subject: [Linux-cluster] I give up
In-Reply-To: <474EF587.2050701@bxwa.com>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>	<474DF79D.6040403@bxwa.com>	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>	<474EEADB.9040908@bxwa.com>	<474EED84.8050506@efacec.pt>	<474EF015.10103@bxwa.com>	<1196356312.16961.5.camel@localhost.localdomain>
	<474EF587.2050701@bxwa.com>
Message-ID: <474EF8C1.90400@efacec.pt>

Scott Becker wrote:
> jr wrote:
>> from my understanding a failover domain is required whenever you want
>> other nodes to take over a service. the subset is if you make it
>> restricted, isn't it?
>> regards,
>> johanne
> From:
> https://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/s1-config-failover-domain-conga-CA.html
>
> "Failover domains are /not/ required for operation."
>
>     scottb
>
Not required doesn't mean not necessary for your particular situation.....

This tip/note was added with version 4.5.
I'm using 4.4 on almost all (two-node) clusters, except one that has 
version 4.5.
They all have failover domains and are working fine....


I think we should wait for some feedback from the developers, since they 
know the software a lot better and can explain this situation to greater 
extent...


Greets,
Marcos David

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/38fc6765/attachment.htm>

From isplist at logicore.net  Thu Nov 29 18:14:25 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 29 Nov 2007 12:14:25 -0600
Subject: [Linux-cluster] GFS and server performance = Application
In-Reply-To: <1196356144.16961.3.camel@localhost.localdomain>
Message-ID: <20071129121425.577456@leena>

> i'm sorry if this doesn't have anything to do with clustering/gfs in
> general, but isn't joomla a php application and would benefit extremely
> well from a php cache?
> before debugging gfs i'd rather go for a phpcache and see where that
> takes me (php is a ressource consuming b**ch).

Yes, I am using php accelerating. However, I am testing on non accelerated 
app's so that I have a reference point. 

Some great links were posted so I'm going to spend time looking at those and 
talking with joomla folks to see what I can get. I'll post my findings once I 
have them for anyone who might be doing something similar.

Mike


From Yanik.Doucet at banq.qc.ca  Thu Nov 29 20:26:31 2007
From: Yanik.Doucet at banq.qc.ca (Yanik Doucet)
Date: Thu, 29 Nov 2007 15:26:31 -0500
Subject: [Linux-cluster] on bundling http and https
Message-ID: <1196367991.5923.19.camel@ubuntu>

Hello 

I'm trying piranha to see if we could throw out our actual closed source
solution.  

My test setup consist of a client, 2 lvs directors and 2 webservers.

I first made a virtual http server and it's working great.  Nothing too
fancy but I can pull the switch on a director or a webserver with little
impact on availability. 

Now I'm trying to bundle http and https to make sure the client connect
to the same server for both protocol.  This is where it fails.  I have
the exact same problem as this guy:

http://osdir.com/ml/linux.redhat.piranha/2006-03/msg00014.html


I setup the firewall marks with piranha, then did the same thing with
iptables, but when I restart pulse, ipvsadm fails to start virtual
service HTTPS as explaned in the above link.


I did not find any solution, nor other people having the same problem,
are we the only 2 persons using firewall marks?


For the record, I'm using RHEL5.1, SELinux disabled.


Can anyone help me please? 

Thanks


Yanik Doucet
Technicien Principal en Informatique, Serveurs Linux
Direction de l'infrastructure technologique
Biblioth?que et Archives nationales du Qu?bec

475, boulevard De Maisonneuve Est
Montr?al (Qu?bec) H2L 5C4
T?l?phone: 514 873-1101 poste 3103
T?l?copieur: 514 864-1118
yanik.doucet at banq.qc.ca
www.banq.qc.ca

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071129/c8e2b7f1/attachment.htm>

From lhh at redhat.com  Fri Nov 30 01:34:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Nov 2007 20:34:46 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <200711290921.34379.hlawatschek@atix.de>
References: <474D9334.3020602@bxwa.com> <474DDD32.1020908@bxwa.com>
	<1196290219.2630.25.camel@ayanami.boston.devel.redhat.com>
	<200711290921.34379.hlawatschek@atix.de>
Message-ID: <1196386486.10025.6.camel@ayanami.boston.devel.redhat.com>

On Thu, 2007-11-29 at 09:21 +0100, Mark Hlawatschek wrote:
> On Wednesday 28 November 2007 23:50:19 Lon Hohberger wrote:
> > On Wed, 2007-11-28 at 13:27 -0800, Scott Becker wrote:
> > > Summit's are better than closed development but users like me are
> > > never going to attend. A community based site is a good foundation.
> >
> > There was this, but it's a bit outdated and pretty spartan:
> >
> > http://gfs.wikidev.net/
> >
> > It's going to take a lot of work to make it current, though it seems to
> > still work.  (Apparently, even my account still works *gasp*)
> >
> open-sharedroot.org provides a wiki section for cluster documentation. There 
> is already some information about the redhat cluster stack available. 
> We could easily use it to create a redhat cluster/gfs community documentation.
> So have a look at http://open-sharedroot.org/faq
> What do you think about this?

Either one would work.

I'm more familiar with MediaWiki (wikidev runs this; it's used by
Wikipedia), but it's not like Wikis are hard to use. ;)

-- Lon


From lhh at redhat.com  Fri Nov 30 01:38:32 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Nov 2007 20:38:32 -0500
Subject: [Linux-cluster] Problems to start ony one cluster service
	(SOLVED but ...)
In-Reply-To: <474ED208.9090003@gmail.com>
References: <474BF0EE.7080709@gmail.com>
	<1196177741.12646.46.camel@ayanami.boston.devel.redhat.com>
	<474D2A12.807@gmail.com> <474D8B91.9090604@gmail.com>
	<474ED208.9090003@gmail.com>
Message-ID: <1196386712.10025.11.camel@ayanami.boston.devel.redhat.com>

On Thu, 2007-11-29 at 15:51 +0100, carlopmart wrote:
> carlopmart wrote:
> > 
> > Finally I have found where is the problem: I need to put 
> > alternate_config param under first postfix instance and now all works 
> > ok. Service starts, stops and relocate ok but I found a little problem: 
> > clurgmgrd doesn't checks the status of the service. If I remove status 
> > flag from init script for the resource, nothing occurs. Do I need to put 
> > any param on cluster.conf to check services every 1 min or 2???
> > 
> > Thanks.
> > 
> 
> Please, any hints???
> 

Check rg_test output if it's not checking the status; e.g.:

  rg_test rules
  rg_test test /etc/cluster/cluster.conf

If there are no errors, & the status timer interval is too high, you can
reduce it either in the resource agents, or by adding a special
non-resource child-tag of a given resource in the tree (i.e. not in the
<resources> block, but actually as a child of the resource in the
<service>):

  <action name="status" interval="time_in_seconds" depth="*"/>

So, ...

   <script name="myscript>
     <action name="status" interval="10" depth="*"/>  <!-- status every
                                                        10 seconds -->
   </script>

for example.

-- Lon


From lhh at redhat.com  Fri Nov 30 01:39:54 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Nov 2007 20:39:54 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <474EED84.8050506@efacec.pt>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>
	<474DF79D.6040403@bxwa.com>
	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
	<474EEADB.9040908@bxwa.com>  <474EED84.8050506@efacec.pt>
Message-ID: <1196386794.10025.13.camel@ayanami.boston.devel.redhat.com>

On Thu, 2007-11-29 at 16:49 +0000, Marcos David wrote:
> There are no failover domains defined in your cluster.conf.
> This could explain why no other nodes take over the service....

His logs didn't indicate a failover domain problem - one of the nodes
didn't try to fence the other when it should have as far as I can tell.

-- Lon


From lhh at redhat.com  Fri Nov 30 01:40:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Nov 2007 20:40:51 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <474EF015.10103@bxwa.com>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>
	<474DF79D.6040403@bxwa.com>
	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
	<474EEADB.9040908@bxwa.com> <474EED84.8050506@efacec.pt>
	<474EF015.10103@bxwa.com>
Message-ID: <1196386851.10025.15.camel@ayanami.boston.devel.redhat.com>

On Thu, 2007-11-29 at 09:00 -0800, Scott Becker wrote:
> 
> Marcos David wrote:
> > There are no failover domains defined in your cluster.conf.
> > This could explain why no other nodes take over the service....
> >
> It's my understanding from the man pages that failover domains are an 
> option to configure a service to run on only a subset of the nodes. 

Correct.  Or, an ordered set of nodes.

e.g. if you want a service to prefer node 1 for example.

-- Lon


From lhh at redhat.com  Fri Nov 30 01:41:37 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Nov 2007 20:41:37 -0500
Subject: [Linux-cluster] I give up
In-Reply-To: <1196356312.16961.5.camel@localhost.localdomain>
References: <474D9334.3020602@bxwa.com>
	<1196268313.2827.21.camel@localhost.localdomain>
	<474DA9BF.7050006@bxwa.com>
	<1196281964.2827.76.camel@localhost.localdomain>
	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>
	<474DF79D.6040403@bxwa.com>
	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>
	<474EEADB.9040908@bxwa.com> <474EED84.8050506@efacec.pt>
	<474EF015.10103@bxwa.com>
	<1196356312.16961.5.camel@localhost.localdomain>
Message-ID: <1196386897.10025.17.camel@ayanami.boston.devel.redhat.com>

On Thu, 2007-11-29 at 18:11 +0100, jr wrote:
> from my understanding a failover domain is required whenever you want
> other nodes to take over a service. the subset is if you make it
> restricted, isn't it?
> regards,

They're optional.  If you don't define one, it's the same as saying
"unordered, unrestricted failover domain of all nodes in the cluster".

-- Lon


From gsrlinux at gmail.com  Fri Nov 30 04:03:06 2007
From: gsrlinux at gmail.com (GS R)
Date: Fri, 30 Nov 2007 09:33:06 +0530
Subject: [Linux-cluster] Conga and Ricci certificate
In-Reply-To: <50352.62.101.100.5.1196349932.squirrel@picard.linux.it>
References: <50352.62.101.100.5.1196349932.squirrel@picard.linux.it>
Message-ID: <d765e01f0711292003p5e3baf5q28af445ae527e8b5@mail.gmail.com>

>
> With system-config-cluster i make  A Failover domains with 2 VM Xen and
> Apache HTTP Server. This cluster works.
>
> Now, i'm installing luci&ricci on dom0. Ricci on booth domU.
>
> On Conga:
>
> HOMEBASE----> Add an existing Cluster
>
>
> i insert the IP of dom0 but when i inser the domu IP i read this error:
>
>
> The following errors occurred:
>
>  * Unable to add the key for node vm03-dadmin.example.prv to the trusted
> keys list.
>  * Unable to connect to the ricci agent on vm03-dadmin-example.prv:
> Unable to establish an SSL connection to vm03-dadmin.example.prv:11111:
> ricci's certificate is not trusted
>
>
> I don'understand. With dom0 and the first domU it's OK....Over the second
> domU i see this error .... :-(
>
>
> On dom0:
>
>
> tail -f /var/log/messages
>
> Nov 29 16:12:43 zeus03-dom0 luci[28835]: Unable to establish an SSL
> connection to zeus03-dadmin.replynet.prv:11111: ricci's certificate is not
> trusted
> Nov 29 16:26:37 zeus03-dom0 luci[28835]: Error reading from
> zeus03-dadmin.replynet.prv:11111: timeout
> Nov 29 16:26:37 zeus03-dom0 luci[28835]: The SSL certificate for host
> "zeus03-dadmin.replynet.prv" is not trusted. Aborting connection attempt.
> Nov 29 16:26:37 zeus03-dom0 luci[28835]: Unable to establish an SSL
> connection to zeus03-dadmin.replynet.prv:11111: ricci's certificate is not
> trusted
>
>
>
> i don't understand
>

Hi,

1. Make sure you have the correct entries in /etc/hosts for all the nodes.
luci refers /etc/hosts.
2. Start ricci on all the nodes before adding them to the cluster using
Conga.

luci is not able to identify zeus03-dadmin.replynet.prv. Adding the entries
in /etc/hosts should fix your problem.

One more thing, since you are dealing with xen vm's, I hope you are aware of
the fact that your dom0 should not be a part of its domU cluster and
vice-versa.

Thanks
GSR
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071130/7e5a6606/attachment.htm>

From marcos.david at efacec.pt  Fri Nov 30 09:16:11 2007
From: marcos.david at efacec.pt (Marcos David)
Date: Fri, 30 Nov 2007 09:16:11 +0000
Subject: [Linux-cluster] I give up
In-Reply-To: <1196386897.10025.17.camel@ayanami.boston.devel.redhat.com>
References: <474D9334.3020602@bxwa.com>	<1196268313.2827.21.camel@localhost.localdomain>	<474DA9BF.7050006@bxwa.com>	<1196281964.2827.76.camel@localhost.localdomain>	<474DDD32.1020908@bxwa.com>	<474DEB48.3000209@redhat.com>	<474DF79D.6040403@bxwa.com>	<Pine.LNX.4.64.0711281822120.18902@hawking.exa.com>	<474EEADB.9040908@bxwa.com>
	<474EED84.8050506@efacec.pt>	<474EF015.10103@bxwa.com>	<1196356312.16961.5.camel@localhost.localdomain>
	<1196386897.10025.17.camel@ayanami.boston.devel.redhat.com>
Message-ID: <474FD4DB.6030704@efacec.pt>

Lon Hohberger wrote:
> On Thu, 2007-11-29 at 18:11 +0100, jr wrote:
>   
>> from my understanding a failover domain is required whenever you want
>> other nodes to take over a service. the subset is if you make it
>> restricted, isn't it?
>> regards,
>>     
>
> They're optional.  If you don't define one, it's the same as saying
> "unordered, unrestricted failover domain of all nodes in the cluster".
>
> -- Lon
>
>   
Ok, thanks for clearing that up.

Greets,
Marcos David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071130/2d36cb72/attachment.htm>

From johannes.russek at io-consulting.net  Fri Nov 30 10:23:09 2007
From: johannes.russek at io-consulting.net (jr)
Date: Fri, 30 Nov 2007 11:23:09 +0100
Subject: [Linux-cluster] Live migration of VMs instead of relocation
Message-ID: <1196418189.16961.9.camel@localhost.localdomain>

Hello everybody,
i was wondering if i could somehow get rgmanager to use live migration
of vms when the prefered member of a failover domain for a certain vm
service comes up again after a failure. the way it is right now is that
if rgmanager detects a failure of a node, the virtual machine gets taken
over by a different node with a lower priority. as soon as i the primary
node comes back into the cluster, rgmanager relocated the vm to that
node, which means shutting it down and starting it on that node again.
as i managed to get live migration working in the cluster, i'd like to
have rgmanager make use of that.
is there a known configuration for this?
best regards,
johannes russek


From mousavi.ehsan at gmail.com  Fri Nov 30 11:30:20 2007
From: mousavi.ehsan at gmail.com (Ehsan Mousavi)
Date: Fri, 30 Nov 2007 15:00:20 +0330
Subject: [Linux-cluster] C-Sharifi
Message-ID: <d9b6c3340711300330t2244882dj15a56c07f295281e@mail.gmail.com>

*C-Sharifi** **Cluster Engine: The Second Success Story on "Kernel-Level
Paradigm" for Distributed Computing Support*


 Contrary to two school of thoughts in providing system software support for
distributed computation that advocate either the development of a whole new
distributed operating system (like Mach), or the development of
library-based or patch-based middleware on top of existing operating systems
(like MPI, Kerrighed and Mosix), *Dr. Mohsen Sharifi
<msharifi at iust.ac.ir>*hypothesized another school of thought as his
thesis in 1986 that believes
all distributed systems software requirements and supports can be and must
be built at the Kernel Level of existing operating systems; requirements
like Ease of Programming, Simplicity, Efficiency, Accessibility, etc which
may be coined as *Usability*. Although the latter belief was hard to
realize, a sample byproduct called DIPC was built purely based on this
thesis and openly announced to the Linux community worldwide in 1993. This
was admired for being able to provide necessary supports for distributed
communication at the Kernel Level of Linux for the first time in the world,
and for providing Ease of Programming as a consequence of being realized at
the Kernel Level. However, it was criticized at the same time as being
inefficient. This did not force the school to trade Ease of Programming for
Efficiency but instead tried hard to achieve efficiency, alongside ease of
programming and simplicity, without defecting the school that advocates the
provision of all needs at the kernel level. The result of this effort is now
manifested in the *C-Sharifi** *Cluster Engine.

*C-Sharifi* is a cost effective distributed system software engine in
support of high performance computing by clusters of off-the-shelf
computers. It is wholly implemented in Kernel, and as a consequence of
following this school, it has Ease of Programming, Ease of Clustering,
Simplicity, and it can be configured to fit as best as possible to the
efficiency requirements of applications that need high performance. It
supports both distributed shared memory and message passing styles, it is
built in Linux, and its cost/performance ratio in some scientific
applications (like meteorology and cryptanalysis) has shown to be far better
than non-kernel-based solutions and engines (like MPI, Kerrighed and Mosix).


 Best Regard

*Leili Mirtaheri

~Ehsan Mousavi

*C-Sharifi* Development Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071130/86c9af20/attachment.htm>

From xbfair at citistreetonline.com  Fri Nov 30 14:34:45 2007
From: xbfair at citistreetonline.com (Fair, Brian)
Date: Fri, 30 Nov 2007 09:34:45 -0500
Subject: [Linux-cluster] Adding new file system caused problems
In-Reply-To: <474C5260.6030908@noaa.gov>
References: <474C5260.6030908@noaa.gov>
Message-ID: <97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org>

I think this is something we see. The workaround has basically been to
disabled clustering (lvm wise) when doing this kind of change, and to
handle it manually:

 
Ie:

 
vgchange -c n <vg> to disable the cluster flag

lvmconf -disable-cluster on all nodes

rescan/discover lun, whatever, on all nodes

lvcreate on one node

lvchange -refresh on every node

lvchange -a y on one node

gfs_grow on one host (you can run this on the other to confirm, it
should say it can't grow anymore)

 
When done, I've been putting things back how they were with vgchange -c
y, lvmconf -disable-cluster, though I think if I you just left it
unclustered it'd be fine... what you won't want to do is leave the vg
clustered, but not -enable-cluster... if you do this when you reboot the
clustered volume groups won't be activated.

 
Hope this helps... if anyone knows of a definitive fix for this I'd like
to hear about it, we haven't pushed for it since it isn't too big of a
hassle and we aren't constantly adding new volumes, but it is a pain.

 
Brian Fair, UNIX Administrator, CitiStreet

904.791.2662

 
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Randy Brown
Sent: Tuesday, November 27, 2007 12:23 PM
To: linux clustering
Subject: [Linux-cluster] Adding new file system caused problems

 
I am running a two node cluster using Centos 5 that is basically being
used as a NAS head for our iscsi based storage.  Here are the related
rpms and their versions I am using:
kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5
kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
system-config-lvm-1.0.22-1.0.el5
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos
gfs-utils-0.1.11-3.el5
lvm2-2.02.16-3.el5
lvm2-cluster-2.02.16-3.el5

This morning I created a 100GB volume on our storage unit and proceeded
to make it available to the cluster so it could be served via NFS to a
client on our network.  I used pvcreate and vgcreate as I always do and
created a new volume group.  When I went to create the logical volume I
saw this message:
Error locking on node nfs1-cluster.nws.noaa.gov: Volume group for uuid
not found:
9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs

I figured I had done something wrong and tried to remove the Lvol and
couldn't.  Lvdisplay showed that the logvol had been created and
vgdisplay looked good with the exception of the volume not being
activated.  So, I ran vgchange -aly <Volumegroupname> which didn't
return any error, but also did not activate the volume.  I then rebooted
the node which made everything OK.  I could now see the VG and lvol,
both were active and I could now create the gfs file system on the lvol.
The file system mounted  and I thought I was in the clear.

However, node #2 wasn't picking this new filesystem up at all.  I
stopped the cluster services on this node which all stopped cleanly and
then tried to restart them.  cman started fine but clvmd didn't.  It
hung on the vgscan.   Even after a reboot of node #2, clvmd would not
start and would hang on the vgscan.  It wasn't until I shut down both
nodes completely and started cluster that both nodes could see the new
filesystem.

I'm sure it's my own ignorance that's making this more difficult than it
needs to be.  Am I missing a step?  Is more information required to
help?  Any assistance in figuring out what happened here would be
greatly appreciated.  I know I going to need to do similar tasks in the
future and obviously can't afford to bring everything down in order for
the cluster to see a new filesystem.

Thank you,

Randy

P.S.  Here is my cluster.conf:
[root at nfs2-cluster ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="ohd_cluster" config_version="114" name="ohd_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="60"/>
        <clusternodes>
                <clusternode name="nfs1-cluster.nws.noaa.gov" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="8"
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="nfs2-cluster.nws.noaa.gov" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="7"
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="nfs-failover" ordered="0"
restricted="1">
                                <failoverdomainnode
name="nfs1-cluster.nws.noaa.gov" priority="1"/>
                                <failoverdomainnode
name="nfs2-cluster.nws.noaa.gov" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="140.90.91.244" monitor_link="1"/>
                        <clusterfs
device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647"
fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/>
                        <nfsexport name="fs-shared-exp"/>
                        <nfsclient name="fs-shared-client"
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                        <clusterfs
device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0" fsid="54233"
fstype="gfs" mountpoint="/rfcdata" name="rfcdata" options="acl"/>
                        <nfsexport name="rfcdata-exp"/>
                        <nfsclient name="rfcdata-client"
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                </resources>
                <service autostart="1" domain="nfs-failover" name="nfs">
                        <clusterfs ref="fs-shared">
                                <nfsexport ref="fs-shared-exp">
                                        <nfsclient
ref="fs-shared-client"/>
                                </nfsexport>
                        </clusterfs>
                        <ip ref="140.90.91.244"/>
                        <clusterfs ref="rfcdata">
                                <nfsexport ref="rfcdata-exp">
                                        <nfsclient
ref="rfcdata-client"/>
                                </nfsexport>
                                <ip ref="140.90.91.244"/>
                        </clusterfs>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.42.30"
login="rbrown" name="nfspower" passwd="XXXXXXX"/>
        </fencedevices>
</cluster>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071130/92c7a845/attachment.htm>

From balajisundar at midascomm.com  Fri Nov 30 14:59:18 2007
From: balajisundar at midascomm.com (Balaji)
Date: Fri, 30 Nov 2007 20:29:18 +0530
Subject: [Linux-cluster] RHEL4 Update 4 Cluster Suite Download for Testing
Message-ID: <47502546.3070205@midascomm.com>

Dear All,

I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS  30 Days 
Evaluation  copy and i Installed  and testing the Red Hat Enterprise 
Linux 4  Update  4 AS  and i need the Cluster Suite for same
The Cluster Suite for same is not available in Red Hat Site

 Please can any one send me the Cluster Suite link for Red Hat 
Enterprise Linux 4 Update 4 AS Supported

Regards
-S.Balaji
 

From lhh at redhat.com  Fri Nov 30 10:18:26 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Nov 2007 05:18:26 -0500
Subject: [Linux-cluster] Live migration of VMs instead of relocation
In-Reply-To: <1196418189.16961.9.camel@localhost.localdomain>
References: <1196418189.16961.9.camel@localhost.localdomain>
Message-ID: <1196417906.2454.18.camel@localhost.localdomain>


On Fri, 2007-11-30 at 11:23 +0100, jr wrote:
> Hello everybody,
> i was wondering if i could somehow get rgmanager to use live migration
> of vms when the prefered member of a failover domain for a certain vm
> service comes up again after a failure. the way it is right now is that
> if rgmanager detects a failure of a node, the virtual machine gets taken
> over by a different node with a lower priority. as soon as i the primary
> node comes back into the cluster, rgmanager relocated the vm to that
> node, which means shutting it down and starting it on that node again.
> as i managed to get live migration working in the cluster, i'd like to
> have rgmanager make use of that.
> is there a known configuration for this?
> best regards,

5.1(+updates) does (or should do?) "migrate-or-nothing" when relocating
VMs back to the preferred node.  That is, if it can't do a migrate,
leave the VM where it is.

The caveat is of course that the VM is at the top level with no parent
node / no children in the resource tree (i.e. it shouldn't be a child of
a <service>), like so:

  <rm>
    <resources/>
    <service ...>
      <child1 .../>
    </service>
    <vm />
  </rm>

Parent/child dependencies aren't allowed because of the stop/start
nature of other resources: To stop a node, its children must be stopped,
but to start a node, its parents must be started.

Note that currently as of 5.1, it's pause-migration, not live-migration
- to change this, you need to edit vm.sh and change the "xm migrate ..."
command line to "xm migrate -l ...".

The upside of pause-migration is that it's a simpler and faster overall
operation to transfer the VM from one machine to another.  The down side
is of course that your downtime is several seconds during migrate rather
than the typical <1 sec for live-migration.

We plan to switch to live migrate as default instead of pause-migrate
(with the ability to select pause migration if desired) in the next
update.  Actually the change is in CVS if you don't want to hax around
with the resource agent:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/cluster/rgmanager/src/resources/vm.sh?rev=1.1.2.9&content-type=text/plain&cvsroot=cluster&only_with_tag=RHEL5

... hasn't had a lot of testing though. :)

-- Lon


From lhh at redhat.com  Fri Nov 30 10:19:31 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Nov 2007 05:19:31 -0500
Subject: [Linux-cluster] on bundling http and https
In-Reply-To: <1196367991.5923.19.camel@ubuntu>
References: <1196367991.5923.19.camel@ubuntu>
Message-ID: <1196417971.2454.20.camel@localhost.localdomain>


On Thu, 2007-11-29 at 15:26 -0500, Yanik Doucet wrote:
> Hello 
> 
> I'm trying piranha to see if we could throw out our actual closed
> source solution.  
> 
> My test setup consist of a client, 2 lvs directors and 2 webservers.
> 
> I first made a virtual http server and it's working great.  Nothing
> too fancy but I can pull the switch on a director or a webserver with
> little impact on availability. 
> 
> Now I'm trying to bundle http and https to make sure the client
> connect to the same server for both protocol.  This is where it fails.
> I have the exact same problem as this guy:
> 
> http://osdir.com/ml/linux.redhat.piranha/2006-03/msg00014.html
> 
> 
> 
> I setup the firewall marks with piranha, then did the same thing with
> iptables, but when I restart pulse, ipvsadm fails to start virtual
> service HTTPS as explaned in the above link.

If that email is right, it looks like a bug in piranha.

-- Lon


From johannes.russek at io-consulting.net  Fri Nov 30 15:23:26 2007
From: johannes.russek at io-consulting.net (jr)
Date: Fri, 30 Nov 2007 16:23:26 +0100
Subject: [Linux-cluster] Live migration of VMs instead of relocation
In-Reply-To: <1196417906.2454.18.camel@localhost.localdomain>
References: <1196418189.16961.9.camel@localhost.localdomain>
	<1196417906.2454.18.camel@localhost.localdomain>
Message-ID: <1196436206.2437.4.camel@localhost.localdomain>

Hi Lon,
thank you for your detailed answer.
That's very good news I'm going to update to 5.1 as soon as this is
possible here. I already did the "Hax" e.g. added -l in the ressource
agent :) 
Thanks!
regards,
johannes

> We plan to switch to live migrate as default instead of pause-migrate
> (with the ability to select pause migration if desired) in the next
> update.  Actually the change is in CVS if you don't want to hax around
> with the resource agent:
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/cluster/rgmanager/src/resources/vm.sh?rev=1.1.2.9&content-type=text/plain&cvsroot=cluster&only_with_tag=RHEL5
> 
> ... hasn't had a lot of testing though. :)
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From johannes.russek at io-consulting.net  Fri Nov 30 17:05:22 2007
From: johannes.russek at io-consulting.net (jr)
Date: Fri, 30 Nov 2007 18:05:22 +0100
Subject: [Linux-cluster] Adding new file system caused problems
In-Reply-To: <97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org>
References: <474C5260.6030908@noaa.gov>
	<97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org>
Message-ID: <1196442322.2437.8.camel@localhost.localdomain>

is this a bug?
i'm getting the exact same thing only during setup of a new clustered
volume group, no resize or anything.
what are the odds of having the lvm under the gfs not clustered?
i can't restart the whole cluster when i add a new clustered
filesystem..
regards,
johannes

Am Freitag, den 30.11.2007, 09:34 -0500 schrieb Fair, Brian:
> I think this is something we see. The workaround has basically been to
> disabled clustering (lvm wise) when doing this kind of change, and to
> handle it manually:
> 
>  
> 
> Ie:
> 
>  
> 
> vgchange ?c n <vg> to disable the cluster flag
> 
> lvmconf ?disable-cluster on all nodes
> 
> rescan/discover lun, whatever, on all nodes
> 
> lvcreate on one node
> 
> lvchange ?refresh on every node
> 
> lvchange ?a y on one node
> 
> gfs_grow on one host (you can run this on the other to confirm, it
> should say it can?t grow anymore)
> 
>  
> 
> When done, I?ve been putting things back how they were with vgchange ?
> c y, lvmconf ?disable-cluster, though I think if I you just left it
> unclustered it?d be fine? what you won?t want to do is leave the vg
> clustered, but not ?enable-cluster? if you do this when you reboot the
> clustered volume groups won?t be activated.
> 
>  
> 
> Hope this helps? if anyone knows of a definitive fix for this I?d like
> to hear about it, we haven?t pushed for it since it isn?t too big of a
> hassle and we aren?t constantly adding new volumes, but it is a pain.
> 
>  
> 
> Brian Fair, UNIX Administrator, CitiStreet
> 
> 904.791.2662
> 
>  
> 
>  
> 
>  
> 
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Randy Brown
> Sent: Tuesday, November 27, 2007 12:23 PM
> To: linux clustering
> Subject: [Linux-cluster] Adding new file system caused problems
> 
> 
>  
> 
> I am running a two node cluster using Centos 5 that is basically being
> used as a NAS head for our iscsi based storage.  Here are the related
> rpms and their versions I am using:
> kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5
> kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
> system-config-lvm-1.0.22-1.0.el5
> cman-2.0.64-1.0.1.el5
> rgmanager-2.0.24-1.el5.centos
> gfs-utils-0.1.11-3.el5
> lvm2-2.02.16-3.el5
> lvm2-cluster-2.02.16-3.el5
> 
> This morning I created a 100GB volume on our storage unit and
> proceeded to make it available to the cluster so it could be served
> via NFS to a client on our network.  I used pvcreate and vgcreate as I
> always do and created a new volume group.  When I went to create the
> logical volume I saw this message:
> Error locking on node nfs1-cluster.nws.noaa.gov: Volume group for uuid
> not found:
> 9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs
> 
> I figured I had done something wrong and tried to remove the Lvol and
> couldn't.  Lvdisplay showed that the logvol had been created and
> vgdisplay looked good with the exception of the volume not being
> activated.  So, I ran vgchange -aly <Volumegroupname> which didn't
> return any error, but also did not activate the volume.  I then
> rebooted the node which made everything OK.  I could now see the VG
> and lvol, both were active and I could now create the gfs file system
> on the lvol.  The file system mounted  and I thought I was in the
> clear.
> 
> However, node #2 wasn't picking this new filesystem up at all.  I
> stopped the cluster services on this node which all stopped cleanly
> and then tried to restart them.  cman started fine but clvmd didn't.
> It hung on the vgscan.   Even after a reboot of node #2, clvmd would
> not start and would hang on the vgscan.  It wasn't until I shut down
> both nodes completely and started cluster that both nodes could see
> the new filesystem.
> 
> I'm sure it's my own ignorance that's making this more difficult than
> it needs to be.  Am I missing a step?  Is more information required to
> help?  Any assistance in figuring out what happened here would be
> greatly appreciated.  I know I going to need to do similar tasks in
> the future and obviously can't afford to bring everything down in
> order for the cluster to see a new filesystem.
> 
> Thank you,
> 
> Randy
> 
> P.S.  Here is my cluster.conf:
> [root at nfs2-cluster ~]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster alias="ohd_cluster" config_version="114" name="ohd_cluster">
>         <fence_daemon post_fail_delay="0" post_join_delay="60"/>
>         <clusternodes>
>                 <clusternode name="nfs1-cluster.nws.noaa.gov"
> nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="nfspower"
> port="8" switch="1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="nfs2-cluster.nws.noaa.gov"
> nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="nfspower"
> port="7" switch="1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="nfs-failover"
> ordered="0" restricted="1">
>                                 <failoverdomainnode
> name="nfs1-cluster.nws.noaa.gov" priority="1"/>
>                                 <failoverdomainnode
> name="nfs2-cluster.nws.noaa.gov" priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <ip address="140.90.91.244" monitor_link="1"/>
>                         <clusterfs
> device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647"
> fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/>
>                         <nfsexport name="fs-shared-exp"/>
>                         <nfsclient name="fs-shared-client"
> options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
>                         <clusterfs
> device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0"
> fsid="54233" fstype="gfs" mountpoint="/rfcdata" name="rfcdata"
> options="acl"/>
>                         <nfsexport name="rfcdata-exp"/>
>                         <nfsclient name="rfcdata-client"
> options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
>                 </resources>
>                 <service autostart="1" domain="nfs-failover"
> name="nfs">
>                         <clusterfs ref="fs-shared">
>                                 <nfsexport ref="fs-shared-exp">
>                                         <nfsclient
> ref="fs-shared-client"/>
>                                 </nfsexport>
>                         </clusterfs>
>                         <ip ref="140.90.91.244"/>
>                         <clusterfs ref="rfcdata">
>                                 <nfsexport ref="rfcdata-exp">
>                                         <nfsclient
> ref="rfcdata-client"/>
>                                 </nfsexport>
>                                 <ip ref="140.90.91.244"/>
>                         </clusterfs>
>                 </service>
>         </rm>
>         <fencedevices>
>                 <fencedevice agent="fence_apc" ipaddr="192.168.42.30"
> login="rbrown" name="nfspower" passwd="XXXXXXX"/>
>         </fencedevices>
> </cluster>
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From pillai at mathstat.dal.ca  Fri Nov 30 18:07:33 2007
From: pillai at mathstat.dal.ca (Balagopal Pillai)
Date: Fri, 30 Nov 2007 14:07:33 -0400
Subject: [Linux-cluster] RHEL4 Update 4 Cluster Suite Download for Testing
In-Reply-To: <47502546.3070205@midascomm.com>
References: <47502546.3070205@midascomm.com>
Message-ID: <47505165.4030905@mathstat.dal.ca>

            It can be downloaded from CentOS. (http://www.centos.org/)
http://centos.arcticnetwork.ca/4.5/csgfs/          This is for 4.5.
4.4 one is at http://vault.centos.org/4.4/csgfs/


Balaji wrote:
> Dear All,
>
> I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS  30 Days 
> Evaluation  copy and i Installed  and testing the Red Hat Enterprise 
> Linux 4  Update  4 AS  and i need the Cluster Suite for same
> The Cluster Suite for same is not available in Red Hat Site
>
> Please can any one send me the Cluster Suite link for Red Hat 
> Enterprise Linux 4 Update 4 AS Supported
>
> Regards
> -S.Balaji
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From scottb at bxwa.com  Fri Nov 30 22:57:44 2007
From: scottb at bxwa.com (Scott Becker)
Date: Fri, 30 Nov 2007 14:57:44 -0800
Subject: [Linux-cluster] File system checking
Message-ID: <47509568.905@bxwa.com>

Does anybody know the best way to check that a filesystem is healthy?

I'm working on a light selfcheck script (to be ran once a minute) and 
creating a file and checking it's existence may not work because of 
write caching. Checking the mount status is probably better but I don't 
know. I've had full filesystems and once the kernel detected an error 
and remounted read only. Other times, when a drive in the raid array was 
slowly failing, it would hang on all IO for a spell.

If there's an existing source module or a script somebody is aware of 
that would be great.

    thanks
    scottb