From mousavi.ehsan at gmail.com  Sat Dec  1 05:34:11 2007
From: mousavi.ehsan at gmail.com (Ehsan Mousavi)
Date: Sat, 1 Dec 2007 09:04:11 +0330
Subject: [Linux-cluster] CSharifi Next Generation of HPC
In-Reply-To: <20071130170010.2F41973C02@hormel.redhat.com>
References: <20071130170010.2F41973C02@hormel.redhat.com>
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGxyS2lRBsBHqosJY/f2K70BAAAAAA==@GMAIL.COM>

C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level
Paradigm" for Distributed Computing Support

Contrary to two school of thoughts in providing system software support for
distributed computation that advocate either the development of a whole new
distributed operating system (like Mach), or the development of
library-based or patch-based middleware on top of existing operating systems
(like MPI, Kerrighed and Mosix), Dr. Mohsen Sharifi hypothesized another
school of thought as his thesis in 1986 that believes all distributed
systems software requirements and supports can be and must be built at the
Kernel Level of existing operating systems; requirements like Ease of
Programming, Simplicity, Efficiency, Accessibility, etc which may be coined
as Usability.  Although the latter belief was hard to realize, a sample
byproduct called DIPC was built purely based on this thesis and openly
announced to the Linux community worldwide in 1993.  This was admired for
being able to provide necessary supports for distributed communication at
the Kernel Level of Linux for the first time in the world, and for providing
Ease of Programming as a consequence of being realized at the Kernel Level.
However, it was criticized at the same time as being inefficient. This did
not force the school to trade Ease of Programming for Efficiency but instead
tried hard to achieve efficiency, alongside ease of programming and
simplicity, without defecting the school that advocates the provision of all
needs at the kernel level. The result of this effort is now manifested in
the C-Sharifi Cluster Engine.
 C-Sharifi is a cost effective distributed system software engine in support
of high performance computing by clusters of off-the-shelf computers. It is
wholly implemented in Kernel, and as a consequence of following this school,
it has Ease of Programming, Ease of Clustering, Simplicity, and it can be
configured to fit as best as possible to the efficiency requirements of
applications that need high performance.  It supports both distributed
shared memory and message passing styles, it is built in Linux, and its
cost/performance ratio in some scientific applications (like meteorology and
cryptanalysis) has shown to be far better than non-kernel-based solutions
and engines (like MPI, Kerrighed and Mosix). 

Best Regard
~Ehsan Mousavi
C-Sharifi  Development Team

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
linux-cluster-request at redhat.com
Sent: Friday, November 30, 2007 8:30 PM
To: linux-cluster at redhat.com
Subject: Linux-cluster Digest, Vol 43, Issue 46

Send Linux-cluster mailing list submissions to
	linux-cluster at redhat.com

To subscribe or unsubscribe via the World Wide Web, visit
	https://www.redhat.com/mailman/listinfo/linux-cluster
or, via email, send a message with subject or body 'help' to
	linux-cluster-request at redhat.com

You can reach the person managing the list at
	linux-cluster-owner at redhat.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-cluster digest..."


Today's Topics:

   1. Live migration of VMs instead of relocation (jr)
   2. C-Sharifi (Ehsan Mousavi)
   3. RE: Adding new file system caused problems (Fair, Brian)
   4. RHEL4 Update 4 Cluster Suite Download for Testing (Balaji)
   5. Re: Live migration of VMs instead of relocation (Lon Hohberger)
   6. Re: on bundling http and https (Lon Hohberger)
   7. Re: Live migration of VMs instead of relocation (jr)


----------------------------------------------------------------------

Message: 1
Date: Fri, 30 Nov 2007 11:23:09 +0100
From: jr <johannes.russek at io-consulting.net>
Subject: [Linux-cluster] Live migration of VMs instead of relocation
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <1196418189.16961.9.camel at localhost.localdomain>
Content-Type: text/plain

Hello everybody,
i was wondering if i could somehow get rgmanager to use live migration
of vms when the prefered member of a failover domain for a certain vm
service comes up again after a failure. the way it is right now is that
if rgmanager detects a failure of a node, the virtual machine gets taken
over by a different node with a lower priority. as soon as i the primary
node comes back into the cluster, rgmanager relocated the vm to that
node, which means shutting it down and starting it on that node again.
as i managed to get live migration working in the cluster, i'd like to
have rgmanager make use of that.
is there a known configuration for this?
best regards,
johannes russek


------------------------------

Message: 2
Date: Fri, 30 Nov 2007 15:00:20 +0330
From: "Ehsan Mousavi" <mousavi.ehsan at gmail.com>
Subject: [Linux-cluster] C-Sharifi
To: Linux-cluster at redhat.com
Message-ID:
	<d9b6c3340711300330t2244882dj15a56c07f295281e at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

*C-Sharifi** **Cluster Engine: The Second Success Story on "Kernel-Level
Paradigm" for Distributed Computing Support*


 Contrary to two school of thoughts in providing system software support for
distributed computation that advocate either the development of a whole new
distributed operating system (like Mach), or the development of
library-based or patch-based middleware on top of existing operating systems
(like MPI, Kerrighed and Mosix), *Dr. Mohsen Sharifi
<msharifi at iust.ac.ir>*hypothesized another school of thought as his
thesis in 1986 that believes
all distributed systems software requirements and supports can be and must
be built at the Kernel Level of existing operating systems; requirements
like Ease of Programming, Simplicity, Efficiency, Accessibility, etc which
may be coined as *Usability*. Although the latter belief was hard to
realize, a sample byproduct called DIPC was built purely based on this
thesis and openly announced to the Linux community worldwide in 1993. This
was admired for being able to provide necessary supports for distributed
communication at the Kernel Level of Linux for the first time in the world,
and for providing Ease of Programming as a consequence of being realized at
the Kernel Level. However, it was criticized at the same time as being
inefficient. This did not force the school to trade Ease of Programming for
Efficiency but instead tried hard to achieve efficiency, alongside ease of
programming and simplicity, without defecting the school that advocates the
provision of all needs at the kernel level. The result of this effort is now
manifested in the *C-Sharifi** *Cluster Engine.

*C-Sharifi* is a cost effective distributed system software engine in
support of high performance computing by clusters of off-the-shelf
computers. It is wholly implemented in Kernel, and as a consequence of
following this school, it has Ease of Programming, Ease of Clustering,
Simplicity, and it can be configured to fit as best as possible to the
efficiency requirements of applications that need high performance. It
supports both distributed shared memory and message passing styles, it is
built in Linux, and its cost/performance ratio in some scientific
applications (like meteorology and cryptanalysis) has shown to be far better
than non-kernel-based solutions and engines (like MPI, Kerrighed and Mosix).


 Best Regard

*Leili Mirtaheri

~Ehsan Mousavi

*C-Sharifi* Development Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
https://www.redhat.com/archives/linux-cluster/attachments/20071130/86c9af20/
attachment.html

------------------------------

Message: 3
Date: Fri, 30 Nov 2007 09:34:45 -0500
From: "Fair, Brian" <xbfair at citistreetonline.com>
Subject: RE: [Linux-cluster] Adding new file system caused problems
To: "linux clustering" <linux-cluster at redhat.com>
Message-ID:
	
<97F238EA86B5704DBAD740518CF829100394AE0C at hwpms600.tbo.citistreet.org>
Content-Type: text/plain; charset="us-ascii"

I think this is something we see. The workaround has basically been to
disabled clustering (lvm wise) when doing this kind of change, and to
handle it manually:

 
Ie:

 
vgchange -c n <vg> to disable the cluster flag

lvmconf -disable-cluster on all nodes

rescan/discover lun, whatever, on all nodes

lvcreate on one node

lvchange -refresh on every node

lvchange -a y on one node

gfs_grow on one host (you can run this on the other to confirm, it
should say it can't grow anymore)

 
When done, I've been putting things back how they were with vgchange -c
y, lvmconf -disable-cluster, though I think if I you just left it
unclustered it'd be fine... what you won't want to do is leave the vg
clustered, but not -enable-cluster... if you do this when you reboot the
clustered volume groups won't be activated.

 
Hope this helps... if anyone knows of a definitive fix for this I'd like
to hear about it, we haven't pushed for it since it isn't too big of a
hassle and we aren't constantly adding new volumes, but it is a pain.

 
Brian Fair, UNIX Administrator, CitiStreet

904.791.2662

 
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Randy Brown
Sent: Tuesday, November 27, 2007 12:23 PM
To: linux clustering
Subject: [Linux-cluster] Adding new file system caused problems

 
I am running a two node cluster using Centos 5 that is basically being
used as a NAS head for our iscsi based storage.  Here are the related
rpms and their versions I am using:
kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5
kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
system-config-lvm-1.0.22-1.0.el5
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos
gfs-utils-0.1.11-3.el5
lvm2-2.02.16-3.el5
lvm2-cluster-2.02.16-3.el5

This morning I created a 100GB volume on our storage unit and proceeded
to make it available to the cluster so it could be served via NFS to a
client on our network.  I used pvcreate and vgcreate as I always do and
created a new volume group.  When I went to create the logical volume I
saw this message:
Error locking on node nfs1-cluster.nws.noaa.gov: Volume group for uuid
not found:
9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs

I figured I had done something wrong and tried to remove the Lvol and
couldn't.  Lvdisplay showed that the logvol had been created and
vgdisplay looked good with the exception of the volume not being
activated.  So, I ran vgchange -aly <Volumegroupname> which didn't
return any error, but also did not activate the volume.  I then rebooted
the node which made everything OK.  I could now see the VG and lvol,
both were active and I could now create the gfs file system on the lvol.
The file system mounted  and I thought I was in the clear.

However, node #2 wasn't picking this new filesystem up at all.  I
stopped the cluster services on this node which all stopped cleanly and
then tried to restart them.  cman started fine but clvmd didn't.  It
hung on the vgscan.   Even after a reboot of node #2, clvmd would not
start and would hang on the vgscan.  It wasn't until I shut down both
nodes completely and started cluster that both nodes could see the new
filesystem.

I'm sure it's my own ignorance that's making this more difficult than it
needs to be.  Am I missing a step?  Is more information required to
help?  Any assistance in figuring out what happened here would be
greatly appreciated.  I know I going to need to do similar tasks in the
future and obviously can't afford to bring everything down in order for
the cluster to see a new filesystem.

Thank you,

Randy

P.S.  Here is my cluster.conf:
[root at nfs2-cluster ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="ohd_cluster" config_version="114" name="ohd_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="60"/>
        <clusternodes>
                <clusternode name="nfs1-cluster.nws.noaa.gov" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="8"
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="nfs2-cluster.nws.noaa.gov" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="7"
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="nfs-failover" ordered="0"
restricted="1">
                                <failoverdomainnode
name="nfs1-cluster.nws.noaa.gov" priority="1"/>
                                <failoverdomainnode
name="nfs2-cluster.nws.noaa.gov" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="140.90.91.244" monitor_link="1"/>
                        <clusterfs
device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647"
fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/>
                        <nfsexport name="fs-shared-exp"/>
                        <nfsclient name="fs-shared-client"
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                        <clusterfs
device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0" fsid="54233"
fstype="gfs" mountpoint="/rfcdata" name="rfcdata" options="acl"/>
                        <nfsexport name="rfcdata-exp"/>
                        <nfsclient name="rfcdata-client"
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                </resources>
                <service autostart="1" domain="nfs-failover" name="nfs">
                        <clusterfs ref="fs-shared">
                                <nfsexport ref="fs-shared-exp">
                                        <nfsclient
ref="fs-shared-client"/>
                                </nfsexport>
                        </clusterfs>
                        <ip ref="140.90.91.244"/>
                        <clusterfs ref="rfcdata">
                                <nfsexport ref="rfcdata-exp">
                                        <nfsclient
ref="rfcdata-client"/>
                                </nfsexport>
                                <ip ref="140.90.91.244"/>
                        </clusterfs>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.42.30"
login="rbrown" name="nfspower" passwd="XXXXXXX"/>
        </fencedevices>
</cluster>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
https://www.redhat.com/archives/linux-cluster/attachments/20071130/92c7a845/
attachment.html

------------------------------

Message: 4
Date: Fri, 30 Nov 2007 20:29:18 +0530
From: Balaji <balajisundar at midascomm.com>
Subject: [Linux-cluster] RHEL4 Update 4 Cluster Suite Download for
	Testing
To: linux-cluster at redhat.com
Message-ID: <47502546.3070205 at midascomm.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Dear All,

I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS  30 Days 
Evaluation  copy and i Installed  and testing the Red Hat Enterprise 
Linux 4  Update  4 AS  and i need the Cluster Suite for same
The Cluster Suite for same is not available in Red Hat Site

 Please can any one send me the Cluster Suite link for Red Hat 
Enterprise Linux 4 Update 4 AS Supported

Regards
-S.Balaji
 

------------------------------

Message: 5
Date: Fri, 30 Nov 2007 05:18:26 -0500
From: Lon Hohberger <lhh at redhat.com>
Subject: Re: [Linux-cluster] Live migration of VMs instead of
	relocation
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <1196417906.2454.18.camel at localhost.localdomain>
Content-Type: text/plain


On Fri, 2007-11-30 at 11:23 +0100, jr wrote:
> Hello everybody,
> i was wondering if i could somehow get rgmanager to use live migration
> of vms when the prefered member of a failover domain for a certain vm
> service comes up again after a failure. the way it is right now is that
> if rgmanager detects a failure of a node, the virtual machine gets taken
> over by a different node with a lower priority. as soon as i the primary
> node comes back into the cluster, rgmanager relocated the vm to that
> node, which means shutting it down and starting it on that node again.
> as i managed to get live migration working in the cluster, i'd like to
> have rgmanager make use of that.
> is there a known configuration for this?
> best regards,

5.1(+updates) does (or should do?) "migrate-or-nothing" when relocating
VMs back to the preferred node.  That is, if it can't do a migrate,
leave the VM where it is.

The caveat is of course that the VM is at the top level with no parent
node / no children in the resource tree (i.e. it shouldn't be a child of
a <service>), like so:

  <rm>
    <resources/>
    <service ...>
      <child1 .../>
    </service>
    <vm />
  </rm>

Parent/child dependencies aren't allowed because of the stop/start
nature of other resources: To stop a node, its children must be stopped,
but to start a node, its parents must be started.

Note that currently as of 5.1, it's pause-migration, not live-migration
- to change this, you need to edit vm.sh and change the "xm migrate ..."
command line to "xm migrate -l ...".

The upside of pause-migration is that it's a simpler and faster overall
operation to transfer the VM from one machine to another.  The down side
is of course that your downtime is several seconds during migrate rather
than the typical <1 sec for live-migration.

We plan to switch to live migrate as default instead of pause-migrate
(with the ability to select pause migration if desired) in the next
update.  Actually the change is in CVS if you don't want to hax around
with the resource agent:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/cluster/rgmanager/sr
c/resources/vm.sh?rev=1.1.2.9&content-type=text/plain&cvsroot=cluster&only_w
ith_tag=RHEL5

... hasn't had a lot of testing though. :)

-- Lon


------------------------------

Message: 6
Date: Fri, 30 Nov 2007 05:19:31 -0500
From: Lon Hohberger <lhh at redhat.com>
Subject: Re: [Linux-cluster] on bundling http and https
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <1196417971.2454.20.camel at localhost.localdomain>
Content-Type: text/plain


On Thu, 2007-11-29 at 15:26 -0500, Yanik Doucet wrote:
> Hello 
> 
> I'm trying piranha to see if we could throw out our actual closed
> source solution.  
> 
> My test setup consist of a client, 2 lvs directors and 2 webservers.
> 
> I first made a virtual http server and it's working great.  Nothing
> too fancy but I can pull the switch on a director or a webserver with
> little impact on availability. 
> 
> Now I'm trying to bundle http and https to make sure the client
> connect to the same server for both protocol.  This is where it fails.
> I have the exact same problem as this guy:
> 
> http://osdir.com/ml/linux.redhat.piranha/2006-03/msg00014.html
> 
> 
> 
> I setup the firewall marks with piranha, then did the same thing with
> iptables, but when I restart pulse, ipvsadm fails to start virtual
> service HTTPS as explaned in the above link.

If that email is right, it looks like a bug in piranha.

-- Lon


------------------------------

Message: 7
Date: Fri, 30 Nov 2007 16:23:26 +0100
From: jr <johannes.russek at io-consulting.net>
Subject: Re: [Linux-cluster] Live migration of VMs instead of
	relocation
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <1196436206.2437.4.camel at localhost.localdomain>
Content-Type: text/plain

Hi Lon,
thank you for your detailed answer.
That's very good news I'm going to update to 5.1 as soon as this is
possible here. I already did the "Hax" e.g. added -l in the ressource
agent :) 
Thanks!
regards,
johannes

> We plan to switch to live migrate as default instead of pause-migrate
> (with the ability to select pause migration if desired) in the next
> update.  Actually the change is in CVS if you don't want to hax around
> with the resource agent:
> 
>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/cluster/rgmanager/sr
c/resources/vm.sh?rev=1.1.2.9&content-type=text/plain&cvsroot=cluster&only_w
ith_tag=RHEL5
> 
> ... hasn't had a lot of testing though. :)
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

End of Linux-cluster Digest, Vol 43, Issue 46
*********************************************


From orkcu at yahoo.com  Sat Dec  1 11:38:58 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Sat, 1 Dec 2007 03:38:58 -0800 (PST)
Subject: [Linux-cluster] File system checking
In-Reply-To: <47509568.905@bxwa.com>
Message-ID: <839066.46392.qm@web50608.mail.re2.yahoo.com>


--- Scott Becker <scottb at bxwa.com> wrote:

> Does anybody know the best way to check that a
> filesystem is healthy?
> 
> I'm working on a light selfcheck script (to be ran
> once a minute) and 
> creating a file and checking it's existence may not
> work because of 
> write caching. Checking the mount status is probably
> better but I don't 
> know. I've had full filesystems and once the kernel
> detected an error 
> and remounted read only. Other times, when a drive
> in the raid array was 
> slowly failing, it would hang on all IO for a spell.
> 
> If there's an existing source module or a script
> somebody is aware of 
> that would be great.

take a look at fs.sh, you might find some good
functions and  procedure to get what you want

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Be a better pen pal. 
Text or chat with friends inside Yahoo! Mail. See how.  http://overview.mail.yahoo.com/


From DRand at amnesty.org  Sat Dec  1 12:01:29 2007
From: DRand at amnesty.org (DRand at amnesty.org)
Date: Sat, 1 Dec 2007 12:01:29 +0000
Subject: [Linux-cluster] Restarting a frozen cluster
Message-ID: <OF03159A8F.44603281-ON802573A4.0041801F-802573A4.0042109F@amnesty.org>

Hi there,

We've been trying to get GFS stablized on Debian for a month or so now.. 
We have a very simple installation with three nodes started like this..

=========
start-stop-daemon --start --quiet --pidfile $PIDFILE --exec $DAEMON -- 
$CCSD_OPTIONS
echo "$NAME."
/sbin/lock_gulmd -n aicluster -s cms1,cms2,cmsqa
sleep 1
/bin/mount -t gfs -o acl /dev/sda /san
=========

Every few days the cluster hangs for some reason.. Thats a separate 
problem though. For now though our main problem is the only way to restart 
is to reboot all three machines.. We can't shutdown and restart the 
cluster without a reboot.. Here is the messages we're getting..

cms1:/home/alfresco# gulm_tool shutdown cms1
Cannot shutdown cms1. Maybe try unmounting gfs?
cms1:/home/alfresco# umount /san
umount: /san: device is busy
umount: /san: device is busy

Any ideas?

Regards,
Damon.


Working to protect human rights worldwide

DISCLAIMER
Internet communications are not secure and therefore Amnesty International Ltd does not accept legal responsibility for the contents of this message. If you are not the intended recipient you must not disclose or rely on the information in this e-mail. Any views or opinions presented are solely those of the author and do not necessarily represent those of Amnesty International Ltd unless specifically stated. Electronic communications including email might be monitored by Amnesty International Ltd. for operational or business reasons.

This message has been scanned for viruses by Postini.
www.postini.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071201/0d042cea/attachment.htm>

From ferry.harmusial at gmail.com  Sat Dec  1 14:21:18 2007
From: ferry.harmusial at gmail.com (Ferry Harmusial)
Date: Sat, 1 Dec 2007 15:21:18 +0100
Subject: [Linux-cluster] use of quorum disk 4u5 does not prevent fencing
	after missing too many heartbeats
Message-ID: <6bc73b1f0712010621k3421bb70gaf2a371dc21485f1@mail.gmail.com>

Hello All,

When I use a quorum disk in 4u5 it does not prevent fencing after missing
too many heartbeats.
http://sources.redhat.com/cluster/faq.html#quorum

I set up the heuristic program (ping) in such a way that both still report
themselves "fit for duty" even with the disconnected cluster communication
link.
I have set deadnode_timer to a value more than twice the time needed voor
quorum daemon to time

Any pointers on what I am missing would be much appreciated ....

Kind Regards,

Ferry Harmusial

[root at vm2 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
172.16.77.22            vm2.localdomain vm2
172.16.77.21            vm1.localdomain vm1

[root at vm2 ~]# ip addr
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:21:9a:75 brd ff:ff:ff:ff:ff:ff
    inet 172.16.169.128/24 brd 172.16.169.255 scope global eth0
    inet6 fe80::20c:29ff:fe21:9a75/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:21:9a:7f brd ff:ff:ff:ff:ff:ff
    inet 172.16.77.22/24 brd 172.16.77.255 scope global eth1
    inet6 fe80::20c:29ff:fe21:9a7f/64 scope link
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0


<?xml version="1.0"?>
<cluster alias="cluster1" config_version="63" name="cluster1">
        <quorumd interval="1" label="cluster1_qd" min_score="1" tko="10"
votes="3">
                <heuristic interval="2" program="ping 172.16.169.1 -c1 -t1"
score="1"/>
        </quorumd>

        <fence_daemon post_fail_delay="0" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="vm1.localdomain" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="vmware"
port="/var/lib/vmware/Virtual Machines/Red Hat Linux/Red Hat Linux.vmx"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="vm2.localdomain" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="vmware"
port="/var/lib/vmware/Virtual Machines/RHEL4u5-1/Red Hat Linux.vmx"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3" two_node="0" deadnode_timer="61"/>
        <fencedevices>
                <fencedevice agent="fence_vmware" ipaddr="172.16.77.1"
login="root" name="vmware" passwd="xxxxxxx"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="FO_apache" ordered="1"
restricted="0">
                                <failoverdomainnode name="vm1.localdomain"
priority="2"/>
                                <failoverdomainnode name="vm2.localdomain"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <service autostart="1" name="SVC_apache">
                        <fs device="/dev/sdb" force_fsck="1"
force_unmount="1" fsid="44559" fstype="ext2" mountpoint="/var/www/html"
name="FS_apache" options="" self_fence="0"/>
                        <ip address="172.16.169.140" monitor_link="1"/>
                        <script file="/etc/init.d/httpd"
name="SCRIPT_apache"/>
                </service>
        </rm>
</cluster>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071201/318f7c47/attachment.htm>

From simone.gotti at email.it  Mon Dec  3 09:40:27 2007
From: simone.gotti at email.it (Simone Gotti)
Date: Mon, 03 Dec 2007 10:40:27 +0100
Subject: [Linux-cluster] Re: [RFC] New lvm_vg.sh agent for
	non-clustered lvm2 VGs
In-Reply-To: <1191450018.3171.37.camel@localhost>
References: <1190996054.5802.30.camel@localhost>
	<3984516A-D158-487B-BC6D-AAD5850746CD@redhat.com>
	<1191450018.3171.37.camel@localhost>
Message-ID: <1196674827.7648.2.camel@localhost>

Hi Jonathan,


On Thu, 2007-10-04 at 00:20 +0200, Simone Gotti wrote:
> Hi,
> 
> like reported in the previous mail I tried to implement a new VG based
> rgmanager agent for non-clustered lvm2 VGs.
> 
> It's a sort of experiment, I just tried to implement in it some ideas
> and I'd like to get some suggestions on them just before going on
> working on it (it's quite probable that I had wrong assumption or missed
> some problems around).
>
> [...]

did you had any time to look at this?

Thanks!
Bye!


-- 
Simone Gotti
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071203/8a098724/attachment.sig>

From lhh at redhat.com  Mon Dec  3 14:40:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Dec 2007 09:40:46 -0500
Subject: [Linux-cluster] File system checking
In-Reply-To: <47509568.905@bxwa.com>
References: <47509568.905@bxwa.com>
Message-ID: <1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>

On Fri, 2007-11-30 at 14:57 -0800, Scott Becker wrote:
> Does anybody know the best way to check that a filesystem is healthy?
> 
> I'm working on a light selfcheck script (to be ran once a minute) and 
> creating a file and checking it's existence may not work because of 
> write caching.


> If there's an existing source module or a script somebody is aware of 
> that would be great.

The fs.sh script checks if the file system is mounted (in the right
place), and occasionally does a simple write test (if the fs wasn't
mounted ro as per the resource options).  The write test is expected to
fail if the file system was remounted read only by the kernel.

The fs.sh script is a bit ugly, but it works.  Could be useful to start
from there.

-- Lon


From lhh at redhat.com  Mon Dec  3 14:49:13 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Dec 2007 09:49:13 -0500
Subject: [Linux-cluster] CSharifi Next Generation of HPC
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGxyS2lRBsBHqosJY/f2K70BAAAAAA==@GMAIL.COM>
References: <20071130170010.2F41973C02@hormel.redhat.com>
	<!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGxyS2lRBsBHqosJY/f2K70BAAAAAA==@GMAIL.COM>
Message-ID: <1196693353.10025.35.camel@ayanami.boston.devel.redhat.com>

On Sat, 2007-12-01 at 09:04 +0330, Ehsan Mousavi wrote:
> C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level
> Paradigm" for Distributed Computing Support

Please include a link to the source code & some example programs.

Google only seems to know about this message which has been cross-posted
in multiple languages to piles of mailing lists, sometimes multiple
times.

Maybe this message is steganography ... :o </conspiracy_theory>

-- Lon


From davegu1 at hotmail.com  Mon Dec  3 14:56:28 2007
From: davegu1 at hotmail.com (me)
Date: Mon, 3 Dec 2007 08:56:28 -0600
Subject: [Linux-cluster] File system checking
References: <47509568.905@bxwa.com>
	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
Message-ID: <BLU119-DAV62725DBE2CA5764097A46FA6C0@phx.gbl>

Sounds interesting and would like to test it. Where do we get a copy of the 
fs.sh script.

David

----- Original Message ----- 
From: "Lon Hohberger" <lhh at redhat.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Monday, December 03, 2007 8:40 AM
Subject: Re: [Linux-cluster] File system checking


> On Fri, 2007-11-30 at 14:57 -0800, Scott Becker wrote:
>> Does anybody know the best way to check that a filesystem is healthy?
>>
>> I'm working on a light selfcheck script (to be ran once a minute) and
>> creating a file and checking it's existence may not work because of
>> write caching.
>
>
>> If there's an existing source module or a script somebody is aware of
>> that would be great.
>
> The fs.sh script checks if the file system is mounted (in the right
> place), and occasionally does a simple write test (if the fs wasn't
> mounted ro as per the resource options).  The write test is expected to
> fail if the file system was remounted read only by the kernel.
>
> The fs.sh script is a bit ugly, but it works.  Could be useful to start
> from there.
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


From scottb at bxwa.com  Mon Dec  3 16:31:31 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 03 Dec 2007 08:31:31 -0800
Subject: [Linux-cluster] File system checking
In-Reply-To: <BLU119-DAV62725DBE2CA5764097A46FA6C0@phx.gbl>
References: <47509568.905@bxwa.com>	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
	<BLU119-DAV62725DBE2CA5764097A46FA6C0@phx.gbl>
Message-ID: <47542F63.2000808@bxwa.com>

It's in /usr/share/cluster/ on RH

me wrote:
> Where do we get a copy of the fs.sh script.
>


From scottb at bxwa.com  Mon Dec  3 16:32:44 2007
From: scottb at bxwa.com (Scott Becker)
Date: Mon, 03 Dec 2007 08:32:44 -0800
Subject: [Linux-cluster] File system checking
In-Reply-To: <1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
References: <47509568.905@bxwa.com>
	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
Message-ID: <47542FAC.104@bxwa.com>

Thanks

Lon Hohberger wrote:
> The fs.sh script is a bit ugly, but it works.  Could be useful to start
> from there.
>
>
>   


From pedro.faustino at fccn.pt  Mon Dec  3 16:43:09 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Mon, 03 Dec 2007 16:43:09 +0000
Subject: [Linux-cluster] conga: problems getting luci up and running
In-Reply-To: <47542FAC.104@bxwa.com>
References: <47509568.905@bxwa.com>	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
	<47542FAC.104@bxwa.com>
Message-ID: <4754321D.1050600@fccn.pt>

Hi All,

I'm trying to install Luci on a Fedora Core 6 in order to manage ricci's 
installed in Fedora 7 machines.
This is the problem I'm getting trying to get luci up and running:
Any help is really appreciated!!!!
Thanks :)

Installation procedure:

yum install zope plone
zopectl adduser username password
service zope start
/usr/sbin/luci_admin password [password_string]
service luci start

System config:

Zope Version - (Zope 2.10.5-final, python 2.4.4, linux2)
Python Version - 2.4.4 (#1, Oct 23 2006, 13:58:00) [GCC 4.1.1 20061011 
(Red Hat 4.1.1-30)]
System Platform - uname -r: 2.6.18-1.2798.fc6
Network Services - ZServer.HTTPServer.zhttp_server (Port: 8080)


When I go to http://localhost:8080 and http://localhost:8080/manage 
everything looks fine

When i go to https://localhost:8084 these are the error messages I get

Z2.log:

127.0.0.1 - Anonymous [03/Dec/2007:16:14:22 +0100] "GET / HTTP/1.1" 200 
358 "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) 
Gecko/20071127 Firefox/2.0.0.11"
127.0.0.1 - Anonymous [03/Dec/2007:16:14:22 +0100] "GET /luci/homebase 
HTTP/1.1" 302 742 "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; 
rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
127.0.0.1 - Anonymous [03/Dec/2007:16:14:22 +0100] "GET 
/luci/acl_users/credentials_cookie_auth/require_login?came_from=https://192.168.0.115:8084/luci/homebase/index_html 
HTTP/1.1" 500 4766 "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; 
rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"


event.log:

------
2007-12-03T16:23:38 ERROR root Exception while rendering an error message
Traceback (most recent call last):
  File "/usr/lib/zope/lib/python/OFS/SimpleItem.py", line 227, in 
raise_standardErrorMessage
    v = s(**kwargs)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPythonScript.py", 
line 140, in __call__
    return Script.__call__(self, *args, **kw)
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
313, in __call__
    return self._bindAndExec(args, kw, None)
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
350, in _bindAndExec
    return self._exec(bound_data, args, kw)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPythonScript.py", 
line 196, in _exec
    result = f(*args, **kw)
  File "Script (Python)", line 34, in standard_error_message
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
313, in __call__
    return self._bindAndExec(args, kw, None)
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
350, in _bindAndExec
    return self._exec(bound_data, args, kw)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPageTemplate.py", 
line 216, in _exec
    result = self.pt_render(extra_context=bound_names)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPageTemplate.py", 
line 155, in pt_render
    result = FSPageTemplate.inheritedAttribute('pt_render')(
  File 
"/usr/lib/zope/lib/python/Products/PageTemplates/PageTemplate.py", line 
89, in pt_render
    return super(PageTemplate, self).pt_render(c, source=source)
  File "/usr/lib/zope/lib/python/zope/pagetemplate/pagetemplate.py", 
line 117, in pt_render
    strictinsert=0, sourceAnnotations=sourceAnnotations)()
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 271, 
in __call__
    self.interpret(self.program)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 346, 
in interpret
    handlers[opcode](self, args)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 891, 
in do_useMacro
    self.interpret(macro)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 346, 
in interpret
    handlers[opcode](self, args)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 536, 
in do_optTag_tal
    self.do_optTag(stuff)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 521, 
in do_optTag
    return self.no_tag(start, program)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 516, 
in no_tag
    self.interpret(program)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 346, 
in interpret
    handlers[opcode](self, args)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 870, 
in do_useMacro
    macro = self.engine.evaluateMacro(macroExpr)
  File "/usr/lib/zope/lib/python/zope/tales/tales.py", line 696, in evaluate
    return expression(self)
  File "/usr/lib/zope/lib/python/zope/tales/expressions.py", line 217, 
in __call__
    return self._eval(econtext)
  File "/usr/lib/zope/lib/python/Products/PageTemplates/Expressions.py", 
line 153, in _eval
    ob = self._subexprs[-1](econtext)
  File "/usr/lib/zope/lib/python/zope/tales/expressions.py", line 124, 
in _eval
    ob = self._traverser(ob, element, econtext)
  File "/usr/lib/zope/lib/python/Products/PageTemplates/Expressions.py", 
line 80, in boboAwareZopeTraverse
    object = object.restrictedTraverse(name)
  File "/usr/lib/zope/lib/python/OFS/Traversable.py", line 301, in 
restrictedTraverse
    return self.unrestrictedTraverse(path, default, restricted=True)
  File "/usr/lib/zope/lib/python/OFS/Traversable.py", line 284, in 
unrestrictedTraverse
    raise e
KeyError: 'header'
------
2007-12-03T16:23:39 ERROR Zope.SiteErrorLog 
https://192.168.0.115:8084/luci/acl_users/credentials_cookie_auth/require_login
Traceback (innermost last):
  Module ZPublisher.Publish, line 119, in publish
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 42, in call_object
  Module Products.CMFCore.FSPythonScript, line 140, in __call__
  Module Shared.DC.Scripts.Bindings, line 313, in __call__
  Module Shared.DC.Scripts.Bindings, line 350, in _bindAndExec
  Module Products.CMFCore.FSPythonScript, line 196, in _exec
  Module None, line 10, in require_login
   - <FSPythonScript at /luci/require_login used for 
/luci/acl_users/credentials_cookie_auth>
   - Line 10
  Module Products.CMFFormController.ControllerPageTemplate, line 74, in 
__call__
  Module Products.CMFFormController.BaseControllerPageTemplate, line 31, 
in _call
  Module Shared.DC.Scripts.Bindings, line 313, in __call__
  Module Shared.DC.Scripts.Bindings, line 350, in _bindAndExec
  Module Products.PageTemplates.ZopePageTemplate, line 330, in _exec
  Module Products.PageTemplates.ZopePageTemplate, line 426, in pt_render
  Module Products.PageTemplates.PageTemplate, line 89, in pt_render
  Module zope.pagetemplate.pagetemplate, line 117, in pt_render
   - Warning: Macro expansion failed
   - Warning: exceptions.KeyError: 'header'
  Module zope.tal.talinterpreter, line 271, in __call__
  Module zope.tal.talinterpreter, line 346, in interpret
  Module zope.tal.talinterpreter, line 891, in do_useMacro
  Module zope.tal.talinterpreter, line 346, in interpret
  Module zope.tal.talinterpreter, line 536, in do_optTag_tal
  Module zope.tal.talinterpreter, line 521, in do_optTag
  Module zope.tal.talinterpreter, line 516, in no_tag
  Module zope.tal.talinterpreter, line 346, in interpret
  Module zope.tal.talinterpreter, line 891, in do_useMacro
  Module zope.tal.talinterpreter, line 346, in interpret
  Module zope.tal.talinterpreter, line 586, in do_setLocal_tal
  Module zope.tales.tales, line 696, in evaluate
   - URL: global_defines
   - Line 8, Column 0
   - Expression: <PathExpr standard:u'plone_view/globalize'>
   - Names:
      {'container': <PloneSite at /luci>,
       'context': <PloneSite at /luci>,
       'default': <object object at 0xb7f4c528>,
       'here': <PloneSite at /luci>,
       'loop': {},
       'nothing': None,
       'options': {'args': (),
                   'state': 
<Products.CMFFormController.ControllerState.ControllerState object at 
0xd6f14ac>},
       'repeat': <Products.PageTemplates.Expressions.SafeMapping object 
at 0xd6f118c>,
       'request': <HTTPRequest, 
URL=https://192.168.0.115:8084/luci/acl_users/credentials_cookie_auth/require_login>,
       'root': <Application at >,
       'template': <ControllerPageTemplate at /luci/login_form>,
       'traverse_subpath': [],
       'user': <SpecialUser 'Anonymous User'>}
  Module zope.tales.expressions, line 217, in __call__
  Module Products.PageTemplates.Expressions, line 161, in _eval
  Module Products.PageTemplates.Expressions, line 123, in render
  Module Products.CMFPlone.browser.ploneview, line 68, in globalize
  Module Products.CMFPlone.browser.ploneview, line 133, in _initializeData
  Module Products.CMFPlone.browser.ploneview, line 515, in have_portlets
  Module zope.component._api, line 207, in getUtility
ComponentLookupError: (<InterfaceClass 
plone.portlets.interfaces.IPortletManager>, 'plone.leftcolumn')
------
2007-12-03T16:23:39 ERROR root Exception while rendering an error message
Traceback (most recent call last):
  File "/usr/lib/zope/lib/python/OFS/SimpleItem.py", line 227, in 
raise_standardErrorMessage
    v = s(**kwargs)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPythonScript.py", 
line 140, in __call__
    return Script.__call__(self, *args, **kw)
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
313, in __call__
    return self._bindAndExec(args, kw, None)
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
350, in _bindAndExec
    return self._exec(bound_data, args, kw)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPythonScript.py", 
line 196, in _exec
    result = f(*args, **kw)
  File "Script (Python)", line 34, in standard_error_message
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
313, in __call__
    return self._bindAndExec(args, kw, None)
  File "/usr/lib/zope/lib/python/Shared/DC/Scripts/Bindings.py", line 
350, in _bindAndExec
    return self._exec(bound_data, args, kw)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPageTemplate.py", 
line 216, in _exec
    result = self.pt_render(extra_context=bound_names)
  File "/usr/lib/zope/lib/python/Products/CMFCore/FSPageTemplate.py", 
line 155, in pt_render
    result = FSPageTemplate.inheritedAttribute('pt_render')(
  File 
"/usr/lib/zope/lib/python/Products/PageTemplates/PageTemplate.py", line 
89, in pt_render
    return super(PageTemplate, self).pt_render(c, source=source)
  File "/usr/lib/zope/lib/python/zope/pagetemplate/pagetemplate.py", 
line 117, in pt_render
    strictinsert=0, sourceAnnotations=sourceAnnotations)()
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 271, 
in __call__
    self.interpret(self.program)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 346, 
in interpret
    handlers[opcode](self, args)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 891, 
in do_useMacro
    self.interpret(macro)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 346, 
in interpret
    handlers[opcode](self, args)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 536, 
in do_optTag_tal
    self.do_optTag(stuff)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 521, 
in do_optTag
    return self.no_tag(start, program)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 516, 
in no_tag
    self.interpret(program)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 346, 
in interpret
    handlers[opcode](self, args)
  File "/usr/lib/zope/lib/python/zope/tal/talinterpreter.py", line 870, 
in do_useMacro
    macro = self.engine.evaluateMacro(macroExpr)
  File "/usr/lib/zope/lib/python/zope/tales/tales.py", line 696, in evaluate
    return expression(self)
  File "/usr/lib/zope/lib/python/zope/tales/expressions.py", line 217, 
in __call__
    return self._eval(econtext)
  File "/usr/lib/zope/lib/python/Products/PageTemplates/Expressions.py", 
line 153, in _eval
    ob = self._subexprs[-1](econtext)
  File "/usr/lib/zope/lib/python/zope/tales/expressions.py", line 124, 
in _eval
    ob = self._traverser(ob, element, econtext)
  File "/usr/lib/zope/lib/python/Products/PageTemplates/Expressions.py", 
line 80, in boboAwareZopeTraverse
    object = object.restrictedTraverse(name)
  File "/usr/lib/zope/lib/python/OFS/Traversable.py", line 301, in 
restrictedTraverse
    return self.unrestrictedTraverse(path, default, restricted=True)
  File "/usr/lib/zope/lib/python/OFS/Traversable.py", line 284, in 
unrestrictedTraverse
    raise e
KeyError: 'header'


Pedro Bandim Faustino
email/sip: pedro.faustino at fccn.pt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071203/f09e50c3/attachment.bin>

From johannes.russek at io-consulting.net  Mon Dec  3 16:54:13 2007
From: johannes.russek at io-consulting.net (jr)
Date: Mon, 03 Dec 2007 17:54:13 +0100
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <473B2216.1070302@gmail.com>
References: <47372867.1010306@gmail.com> <473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
	<4739DA42.8080109@gmail.com>
	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
	<473B2216.1070302@gmail.com>
Message-ID: <1196700854.2437.21.camel@localhost.localdomain>

did someone fix this issue? (cman starting before xend thus not starting
fence_xvmd?)
regards,
johannes

Am Mittwoch, den 14.11.2007, 17:28 +0100 schrieb carlopmart:
[snip]
> Thanks Lon, I have tried it, but fence_xvmd doesn't starts automatically using 
> cman script. I think error is the same than on rhel 5.0, when cman startup 
> script checks if xen is available using this: xm list --long 2> /dev/null | grep 
> -q "Domain-0" || return 1
> 


From rmccabe at redhat.com  Mon Dec  3 17:54:09 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Mon, 3 Dec 2007 12:54:09 -0500
Subject: [Linux-cluster] conga: problems getting luci up and running
In-Reply-To: <4754321D.1050600@fccn.pt>
References: <47509568.905@bxwa.com>
	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
	<47542FAC.104@bxwa.com> <4754321D.1050600@fccn.pt>
Message-ID: <20071203175404.GA21909@redhat.com>

On Mon, Dec 03, 2007 at 04:43:09PM +0000, Pedro Bandim Faustino wrote:
> yum install zope plone
> zopectl adduser username password
> service zope start
> /usr/sbin/luci_admin password [password_string]
> service luci start
> 
> System config:
> 
> Zope Version - (Zope 2.10.5-final, python 2.4.4, linux2)
> Python Version - 2.4.4 (#1, Oct 23 2006, 13:58:00) [GCC 4.1.1 20061011 

Luci won't run on Zope 2.10. There are RPMs at
http://people.redhat.com/jparsons/downloads/fc6/ that should include
everything you need for luci on fc6. They're a bit out of date, though.
If you prefer something more current, you should be able to install the
current RHEL5 luci rpm on fc6.


Ryan


From lhh at redhat.com  Mon Dec  3 19:25:09 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Dec 2007 14:25:09 -0500
Subject: [Linux-cluster] use of quorum disk 4u5 does not prevent
	fencing after missing too many heartbeats
In-Reply-To: <6bc73b1f0712010621k3421bb70gaf2a371dc21485f1@mail.gmail.com>
References: <6bc73b1f0712010621k3421bb70gaf2a371dc21485f1@mail.gmail.com>
Message-ID: <1196709909.10025.83.camel@ayanami.boston.devel.redhat.com>

On Sat, 2007-12-01 at 15:21 +0100, Ferry Harmusial wrote:
> Hello All,
> 
> When I use a quorum disk in 4u5 it does not prevent fencing after
> missing too many heartbeats.

That's correct.

> http://sources.redhat.com/cluster/faq.html#quorum

I should clarify this answer (esp because I probably reviewed it).  It's
not quite accurate:

(A) Basically, it does kind of the contrapositive of what is mentioned
in the answer with regards to advertising / checking:

* We check our own fitness for participation in the cluster and
advertise it to the other nodes via the quorum disk.

* In the event of a failure, heuristics - instead of votes or a
race-to-fence - can be used to determine the quorate partition (and
therefore the partition that fences the other partition(s)).

(B) Nothing which currently ships with linux-clluster allow multiple
cluster partitions to continue running (like RHCS3 did) after a split.
With correct heuristics, one partition will always win - and fence the
other partition(s).


I apologize for the confusion. (Note: RHCS3 will continue running w/o
fencing, but in a degraded state; you cannot move around services or
alter the configuration safely while the cluster is split...)


The qdisk(5) man page (sections 1 & 2) has an explanation of how qdiskd
works.

-- Lon


From agilman at fla.fujitsu.com  Mon Dec  3 19:24:20 2007
From: agilman at fla.fujitsu.com (Alex Gilman)
Date: Mon, 03 Dec 2007 11:24:20 -0800
Subject: [Linux-cluster] Newbie question: iozone failing sanity check on gfs
	filesystem
Message-ID: <475457E4.60802@fla.fujitsu.com>

Greetings!  I have attempted to set up gfs shared storage in a two-node
cluster (2.6.18-8.el5 x86_64) with an AX150SC SAN device.  Everything
seemed to go alright until I tried to run iozone and got the following
message:

Sanity check failed. Do not deploy this filesystem in a production
environment !

There was nothing in /var/log/messages associated with this.  Has anyone
seen this before?

The filesystem was made by
# gfs_mkfs -t my_cluster:gfs1 -p lock_dlm -j 4 -r 1024 /dev/sdb1

I'm not sure what other information would be helpful in diagnosing this.

Thank you in advance,
<- alex


From lhh at redhat.com  Mon Dec  3 19:38:36 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Dec 2007 14:38:36 -0500
Subject: [Linux-cluster] File system checking
In-Reply-To: <BLU119-DAV62725DBE2CA5764097A46FA6C0@phx.gbl>
References: <47509568.905@bxwa.com>
	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
	<BLU119-DAV62725DBE2CA5764097A46FA6C0@phx.gbl>
Message-ID: <1196710716.10025.85.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-03 at 08:56 -0600, me wrote:
> Sounds interesting and would like to test it. Where do we get a copy of the 
> fs.sh script.

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/cluster/rgmanager/src/resources/fs.sh?rev=1.24&content-type=text/plain&cvsroot=cluster

-- Lon


From shawnlhood at gmail.com  Mon Dec  3 19:39:17 2007
From: shawnlhood at gmail.com (Shawn Hood)
Date: Mon, 3 Dec 2007 14:39:17 -0500
Subject: [Linux-cluster] journal size, performance
Message-ID: <cfe2fc960712031139y77f61d7ds59fb30d36d21c942@mail.gmail.com>

Hey folks,

I'm in the process of implementing a GFS cluster.  A quick over of our hardware:

1 Apple xraid (with plans to bring the two others into the SAN after testing)SA
3 Dell (2x 1950 / 1x 2950) boxes, running RHEL4u5
Qlogic SANbox
Qlogic HBAs

lvscan of SAN volume group

ACTIVE            '/dev/hq-san/nlp_qa' [700.00 GB] inherit
  ACTIVE            '/dev/hq-san/prod_reports' [1000.00 GB] inherit
  ACTIVE            '/dev/hq-san/cam_development' [500.00 GB] inherit
  ACTIVE            '/dev/hq-san/nlp_development' [500.00 GB] inherit
  ACTIVE            '/dev/hq-san/svn_users' [1.50 TB] inherit


We run a mixed environment of RHEL4.5 and RHEL5.1.  We're currently in
the process of developing a plan to shorten the OS lifecycle so we can
have all of our cluster nodes in a more homogenous RHEL5.1
environement.  Anyhow, we plan to scale to about 3-5x our current
storage capacity and 3-4x our current server count within the next
several months.

That said, I need to decide know the parameters to use when creating
journals--number of journals and size.  I could not find any
information about how the number of journals affects performance.
What kind of performance impact would I see by creating 9-12 journals
that we will eventually need?  If the size of the filesystem is
smaller than the logical volume, can the number of journals be
increased?  What about journal size impacting performance?

If information on this already exists, could someone point me to it?

Shawn


From leandro.alcaldeperez at atosorigin.com  Mon Dec  3 19:58:58 2007
From: leandro.alcaldeperez at atosorigin.com (Leandro Alcalde Perez)
Date: Mon, 03 Dec 2007 20:58:58 +0100
Subject: [Linux-cluster] gfs_grow: Device has grown by less than 100
	blocks.... skipping
Message-ID: <76BD4E6231F77947B6885C75E5214FA5030CF3@INTMAIL01.es.int.atosorigin.com>

Hello.
 
I need to grow a production GFS, but firts, i have probed with a little gfs.
The problem is:
When i try   gfs_grow always get an error:
[root at localhost ~]# gfs_grow -v /Backup_files
Device has grown by less than 100 blocks.... skipping
 
 
Even I have created 3 pv, but only the first has dates, the others ones are empty
[root at localhost ~]# pvs
 /dev/sdc          VolGroup03 lvm2 a-   1020.00M       0 
  /dev/sdd          VolGroup03 lvm2 a-   1020.00M 1020.00M
  /dev/sde          VolGroup03 lvm2 a-     10.00G   10.00G
 
And the volum group is:
[root at localhost ~]# vgs
  VG         #PV #LV #SN Attr   VSize   VFree 
   VolGroup03   3   1   0 wz--nc  11.99G 10.99G
 
[root at localhost ~]# gfs_tool df /Backup_files/
/Backup_files:
  SB lock proto = "lock_dlm"
  SB lock table = "cluster:gfspru"
  SB ondisk format = 1309
  SB multihost format = 1401
  Block size = 4096
  Journals = 4
  Resource Groups = 8
  Mounted lock proto = "lock_dlm"
  Mounted lock table = "cluster:gfspru"
  Mounted host data = ""
  Journal number = 0
  Lock module flags = 
  Local flocks = FALSE
  Local caching = FALSE
  Oopses OK = FALSE
  Type           Total          Used           Free           use%           
  ------------------------------------------------------------------------
  inodes         416            416            0              100%
  metadata       357            258            99             72%
  data           129231         129209         22             100%

The cluster has two nodes   rhel AS 4u4 and the gfs is: 
GFS-6.1.6-1
GFS-kernel-smp-2.6.9-58.0
 
There is nothing bad in the /var/log/message

Dec  3 20:04:03 aplic1 kernel: GFS: Trying to join cluster "lock_dlm", "cluster:gfspru"
Dec  3 20:04:03 aplic1 kernel: Lock_DLM (built Jul 13 2006 14:18:58) installed
Dec  3 20:04:05 aplic1 kernel: GFS: fsid=clus-aplic-ame:gfspru.1: Joined cluster. Now mounting FS...
Dec  3 20:04:05 aplic1 kernel: GFS: fsid=clus-aplic-ame:gfspru.1: jid=1: Trying to acquire journal lock...
Dec  3 20:04:05 aplic1 kernel: GFS: fsid=clus-aplic-ame:gfspru.1: jid=1: Looking at journal...
Dec  3 20:04:05 aplic1 kernel: GFS: fsid=clus-aplic-ame:gfspru.1: jid=1: Done

Could you help me?
Sorry about my english
 
LAP
 

------------------------------------------------------------------
This e-mail and the documents attached are confidential and intended solely
for the addressee; it may also be privileged. If you receive this e-mail
in error, please notify the sender immediately and destroy it.
As its integrity cannot be secured on the Internet, the Atos Origin group
liability cannot be triggered for the message content. Although the
sender endeavours to maintain a computer virus-free network, the sender does
not warrant that this transmission is virus-free and will not be liable for
any damages resulting from any virus transmitted.

Este mensaje y los ficheros adjuntos pueden contener informacion
confidencial destinada solamente a la(s) persona(s) mencionadas
anteriormente. Pueden estar protegidos por secreto profesional Si usted
recibe este correo electronico por error, gracias de informar inmediatamente
al remitente y destruir el mensaje.
Al no estar asegurada la integridad de este mensaje sobre la red, Atos
Origin no se hace responsable por su contenido. Su contenido no constituye
ningun compromiso para el grupo Atos Origin, salvo ratificacion escrita por
ambas partes.
Aunque se esfuerza al maximo por mantener su red libre de virus, el emisor
no puede garantizar nada al respecto y no sera responsable de cualesquiera
danos que puedan resultar de una transmision de virus
------------------------------------------------------------------


From mm at yuhu.biz  Mon Dec  3 20:03:30 2007
From: mm at yuhu.biz (Marian Marinov)
Date: Mon, 3 Dec 2007 22:03:30 +0200
Subject: [Linux-cluster] CSharifi Next Generation of HPC
In-Reply-To: <1196693353.10025.35.camel@ayanami.boston.devel.redhat.com>
References: <20071130170010.2F41973C02@hormel.redhat.com>
	<!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGxyS2lRBsBHqosJY/f2K70BAAAAAA==@GMAIL.COM>
	<1196693353.10025.35.camel@ayanami.boston.devel.redhat.com>
Message-ID: <200712032203.30405.mm@yuhu.biz>

On Monday 03 December 2007 16:49:13 Lon Hohberger wrote:
> On Sat, 2007-12-01 at 09:04 +0330, Ehsan Mousavi wrote:
> > C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level
> > Paradigm" for Distributed Computing Support
>
> Please include a link to the source code & some example programs.
>
> Google only seems to know about this message which has been cross-posted
> in multiple languages to piles of mailing lists, sometimes multiple
> times.
>
> Maybe this message is steganography ... :o </conspiracy_theory>
>
> -- Lon

As far as I understood, this new Technology is currently waiting to be 
patented. And after that it will be released to the general public.

Marian


From Christopher.Barry at qlogic.com  Mon Dec  3 20:21:24 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Mon, 03 Dec 2007 15:21:24 -0500
Subject: [Linux-cluster] journal size, performance
In-Reply-To: <cfe2fc960712031139y77f61d7ds59fb30d36d21c942@mail.gmail.com>
References: <cfe2fc960712031139y77f61d7ds59fb30d36d21c942@mail.gmail.com>
Message-ID: <1196713284.5644.29.camel@localhost>

On Mon, 2007-12-03 at 14:39 -0500, Shawn Hood wrote:
> Hey folks,
> 
> I'm in the process of implementing a GFS cluster.  A quick over of our hardware:
> 
> 1 Apple xraid (with plans to bring the two others into the SAN after testing)SA
> 3 Dell (2x 1950 / 1x 2950) boxes, running RHEL4u5
> Qlogic SANbox
> Qlogic HBAs
> 
> lvscan of SAN volume group
> 
> ACTIVE            '/dev/hq-san/nlp_qa' [700.00 GB] inherit
>   ACTIVE            '/dev/hq-san/prod_reports' [1000.00 GB] inherit
>   ACTIVE            '/dev/hq-san/cam_development' [500.00 GB] inherit
>   ACTIVE            '/dev/hq-san/nlp_development' [500.00 GB] inherit
>   ACTIVE            '/dev/hq-san/svn_users' [1.50 TB] inherit
> 
> 
> We run a mixed environment of RHEL4.5 and RHEL5.1.  We're currently in
> the process of developing a plan to shorten the OS lifecycle so we can
> have all of our cluster nodes in a more homogenous RHEL5.1
> environement.  Anyhow, we plan to scale to about 3-5x our current
> storage capacity and 3-4x our current server count within the next
> several months.
> 
> That said, I need to decide know the parameters to use when creating
> journals--number of journals and size.  I could not find any
> information about how the number of journals affects performance.
> What kind of performance impact would I see by creating 9-12 journals
> that we will eventually need?  If the size of the filesystem is
> smaller than the logical volume, can the number of journals be
> increased?  What about journal size impacting performance?
> 
> If information on this already exists, could someone point me to it?
> 
> Shawn
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Shawn,

I am assuming you do not have a GFS cluster up yet. If you plan on
moving quickly to RHEL5.1, I suggest you do you initial as that version
because the cluster infrastructure is not compatible between 4.5 and
5.1. I do not believe that having both 4.5 and 5.1 in a unified cluster
is a supported configuration.

I do not think unused journals affect performance - they just reserve
space.

man gfs_jadd to see how to add more journals later.


-- 
Regards,
-C

Christopher Barry
Systems Engineer, Principal
QLogic Corporation
780 Fifth Avenue, Suite 140
King of Prussia, PA   19406
o/f: 610-233-4870 / 4777
  m: 267-242-9306


From lhh at redhat.com  Mon Dec  3 21:05:12 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Dec 2007 16:05:12 -0500
Subject: [Linux-cluster] CSharifi Next Generation of HPC
In-Reply-To: <200712032203.30405.mm@yuhu.biz>
References: <20071130170010.2F41973C02@hormel.redhat.com>
	<!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGxyS2lRBsBHqosJY/f2K70BAAAAAA==@GMAIL.COM>
	<1196693353.10025.35.camel@ayanami.boston.devel.redhat.com>
	<200712032203.30405.mm@yuhu.biz>
Message-ID: <1196715912.10025.103.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-03 at 22:03 +0200, Marian Marinov wrote:

> As far as I understood, this new Technology is currently waiting to be 
> patented. And after that it will be released to the general public.

Excellent.

-- Lon


From lhh at redhat.com  Mon Dec  3 21:57:23 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 03 Dec 2007 16:57:23 -0500
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <1196700854.2437.21.camel@localhost.localdomain>
References: <47372867.1010306@gmail.com> <473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
	<4739DA42.8080109@gmail.com>
	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
	<473B2216.1070302@gmail.com>
	<1196700854.2437.21.camel@localhost.localdomain>
Message-ID: <1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-03 at 17:54 +0100, jr wrote:
> did someone fix this issue? (cman starting before xend thus not starting
> fence_xvmd?)
> regards,
> johannes
> 
> Am Mittwoch, den 14.11.2007, 17:28 +0100 schrieb carlopmart:
> [snip]
> > Thanks Lon, I have tried it, but fence_xvmd doesn't starts automatically using 
> > cman script. I think error is the same than on rhel 5.0, when cman startup 
> > script checks if xen is available using this: xm list --long 2> /dev/null | grep 
> > -q "Domain-0" || return 1

For now, just put fence_xvmd in /etc/rc.d/rc.local

File a bug against cman and assign it to me.  It fell off my radar.  It
looks like basically, we should -

* start fence_xvmd if enabled in the config (don't do the xm list check)

* fence_xvmd should wait (maybe for some timeout - 5 minutes?  10
minutes?) for xend to come up, and die if it doesn't start (so as to not
keep running for no reason).

What do people think is reasonable for a /default/ timeout-on-startup of
fence_xvmd?

-- Lon


From cfriedt at visible-assets.com  Mon Dec  3 22:16:01 2007
From: cfriedt at visible-assets.com (Christopher Friedt)
Date: Mon, 03 Dec 2007 23:16:01 +0100
Subject: [Linux-cluster] suitability for embedded applications
Message-ID: <47548021.3020600@visible-assets.com>

Hi,

I'm currently writing my own light-weight DLM and my manager says that 
its taking too long and I should look at using something already out there.

I need to lock a shared resource in a building with nodes on the same 
LAN - specifically I'm locking a radio frequency (truly a shared 
resource, in every sense of the word).

Network latency can be assumed to be very small.

Only one node is allowed to transmit / receive on the radio frequency at 
one time.

The article on KernelTrap[1] pointed me here, and I just wanted to ask a 
few questions.

- is DLM suitable for embedded systems ?
- what are the major changes / additions, etc to the linux kernel ?
- does it patch to the latest linux kernel?
- does it patch to the 2.4 kernels?
- are any user-space tools necessary?
- how does an application interface to DLM ? (device node? etc)

- to the best of your knowlege, would DLM be suitable for the scenario I 
described above?

Cheers,

Chris

[1] http://kerneltrap.org/node/5119


From johannes.russek at io-consulting.net  Mon Dec  3 23:47:05 2007
From: johannes.russek at io-consulting.net (Johannes Russek)
Date: Tue, 04 Dec 2007 00:47:05 +0100
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>
References: <47372867.1010306@gmail.com>
	<473873AA.6050100@gmail.com>	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>	<4739DA42.8080109@gmail.com>	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>	<473B2216.1070302@gmail.com>	<1196700854.2437.21.camel@localhost.localdomain>
	<1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>
Message-ID: <47549579.4000403@io-consulting.net>

Lon Hohberger schrieb:
> For now, just put fence_xvmd in /etc/rc.d/rc.local
>
> File a bug against cman and assign it to me.  It fell off my radar.  It
> looks like basically, we should -
>
> * start fence_xvmd if enabled in the config (don't do the xm list check)
>
> * fence_xvmd should wait (maybe for some timeout - 5 minutes?  10
> minutes?) for xend to come up, and die if it doesn't start (so as to not
> keep running for no reason).
>   
why would it need a timeout to die? apparently, if someone put that into 
his cluster config it should be running wether xend comes up or not, 
maybe throw a warning. what if xend doesn't come up for a reason but a 
manual fix corrected that and then comes up but after fence_xvmds 
timeout. you'd have to either manually start fence_xvmd or restart cman, 
wouldn't you?
i think if it's configured it should just start. it would be smart to 
have it warn about missing xend every so minutes or sthg like that, 
though :)
regards,
johannes
p.s.: i'll do the rc.local thing, thanks


> What do people think is reasonable for a /default/ timeout-on-startup of
> fence_xvmd?
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


From agilman at fla.fujitsu.com  Tue Dec  4 03:32:15 2007
From: agilman at fla.fujitsu.com (Alex Gilman)
Date: Mon, 03 Dec 2007 19:32:15 -0800
Subject: [Linux-cluster] Newbie question: iozone failing sanity check
	on gfs	filesystem (RESOLVED)
In-Reply-To: <475457E4.60802@fla.fujitsu.com>
References: <475457E4.60802@fla.fujitsu.com>
Message-ID: <4754CA3F.4090204@fla.fujitsu.com>

The sanity check failure disappeared when I ran iozone as root.

Alex Gilman wrote:
> Greetings!  I have attempted to set up gfs shared storage in a two-node
> cluster (2.6.18-8.el5 x86_64) with an AX150SC SAN device.  Everything
> seemed to go alright until I tried to run iozone and got the following
> message:
>
> Sanity check failed. Do not deploy this filesystem in a production
> environment !
 > ...


From breeves at redhat.com  Tue Dec  4 07:05:27 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Tue, 04 Dec 2007 07:05:27 +0000
Subject: [Linux-cluster] suitability for embedded applications
In-Reply-To: <47548021.3020600@visible-assets.com>
References: <47548021.3020600@visible-assets.com>
Message-ID: <4754FC37.2080107@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christopher Friedt wrote:
> The article on KernelTrap[1] pointed me here, and I just wanted to ask a
> few questions.
> 
> - is DLM suitable for embedded systems ?
> - what are the major changes / additions, etc to the linux kernel ?
> - does it patch to the latest linux kernel?

The DLM patches were included in the mainstream kernel in January last
year, so you shouldn't need to patch current kernel versions in order to
use it:

commit e7fd41792fc0ee52a05fcaac87511f118328d147
Author: David Teigland <teigland at redhat.com>
Date:   Wed Jan 18 09:30:29 2006 +0000

    [DLM] The core of the DLM for GFS2/CLVM

    This is the core of the distributed lock manager which is required
    to use GFS2 as a cluster filesystem. It is also used by CLVM and
    can be used as a standalone lock manager independantly of either
    of these two projects.

    It implements VAX-style locking modes.

    Signed-off-by: David Teigland <teigland at redhat.com>
    Signed-off-by: Steve Whitehouse <swhiteho at redhat.com>

Later kernels obviously have additional fixes & features.

Regards,
Bryn.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHVPw36YSQoMYUY94RAlSIAKCvPzDwwxBROEkfrqST1VjZX5BNbgCffvHY
e2XvMOWgEqWQ4Wf6NU8lBkg=
=XP2g
-----END PGP SIGNATURE-----


From pcaulfie at redhat.com  Tue Dec  4 08:55:24 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 04 Dec 2007 08:55:24 +0000
Subject: [Linux-cluster] suitability for embedded applications
In-Reply-To: <47548021.3020600@visible-assets.com>
References: <47548021.3020600@visible-assets.com>
Message-ID: <475515FC.5030409@redhat.com>

Christopher Friedt wrote:
> Hi,
> 
> I'm currently writing my own light-weight DLM and my manager says that
> its taking too long and I should look at using something already out there.
> 
> I need to lock a shared resource in a building with nodes on the same
> LAN - specifically I'm locking a radio frequency (truly a shared
> resource, in every sense of the word).
> 
> Network latency can be assumed to be very small.
> 
> Only one node is allowed to transmit / receive on the radio frequency at
> one time.
> 
> The article on KernelTrap[1] pointed me here, and I just wanted to ask a
> few questions.
> 
> - is DLM suitable for embedded systems ?
> - what are the major changes / additions, etc to the linux kernel ?
> - does it patch to the latest linux kernel?
> - does it patch to the 2.4 kernels?
> - are any user-space tools necessary?

You'll need libdlm, which is included in the cluster suite CVS or
tarballs at http://sourceware.org/cluster/

It uses a device node interface to communication with the DLM in the kernel.

> - how does an application interface to DLM ? (device node? etc)

There is a VMS-Style call interface that provides full access to all of
the facilities. there's also a simpler on for basic lock/unlock
requests. That is provided by libdlm.

Documentation here:
http://people.redhat.com/pcaulfie/docs/rhdlmbook.pdf


> - to the best of your knowlege, would DLM be suitable for the scenario I
> described above?

Hard to tell to be perfectly honest. The major user of the DLM is GFS, a
cluster filesystem though it is designed to be a generic DLM.

Patrick


From carlopmart at gmail.com  Tue Dec  4 09:14:43 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 04 Dec 2007 10:14:43 +0100
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <47549579.4000403@io-consulting.net>
References: <47372867.1010306@gmail.com>	<473873AA.6050100@gmail.com>	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>	<4739DA42.8080109@gmail.com>	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>	<473B2216.1070302@gmail.com>	<1196700854.2437.21.camel@localhost.localdomain>	<1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>
	<47549579.4000403@io-consulting.net>
Message-ID: <47551A83.9040702@gmail.com>

Johannes Russek wrote:
> Lon Hohberger schrieb:
>> For now, just put fence_xvmd in /etc/rc.d/rc.local
>>
>> File a bug against cman and assign it to me.  It fell off my radar.  It
>> looks like basically, we should -
>>
>> * start fence_xvmd if enabled in the config (don't do the xm list check)
>>
>> * fence_xvmd should wait (maybe for some timeout - 5 minutes?  10
>> minutes?) for xend to come up, and die if it doesn't start (so as to not
>> keep running for no reason).
>>   
> why would it need a timeout to die? apparently, if someone put that into 
> his cluster config it should be running wether xend comes up or not, 
> maybe throw a warning. what if xend doesn't come up for a reason but a 
> manual fix corrected that and then comes up but after fence_xvmds 
> timeout. you'd have to either manually start fence_xvmd or restart cman, 
> wouldn't you?
> i think if it's configured it should just start. it would be smart to 
> have it warn about missing xend every so minutes or sthg like that, 
> though :)
> regards,
> johannes
> p.s.: i'll do the rc.local thing, thanks

I agree with Johannes.

> 
> 
>> What do people think is reasonable for a /default/ timeout-on-startup of
>> fence_xvmd?
>>
>> -- Lon
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>   
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From carlopmart at gmail.com  Tue Dec  4 13:35:10 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 04 Dec 2007 14:35:10 +0100
Subject: [Linux-cluster] Best practices to setup mysql cluster under rhel4
	cluster suite
Message-ID: <4755578E.30803@gmail.com>

Hi all,

  Somebody can points me to a good howto to setup a mysql HA infraestructure ( 
with version shipped with rhel4.x) using Cluster Suite on RHEL4??

Many thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From gordan at bobich.net  Tue Dec  4 13:39:14 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Tue, 4 Dec 2007 13:39:14 +0000 (GMT)
Subject: [Linux-cluster] Best practices to setup mysql cluster under
	rhel4	cluster suite
In-Reply-To: <4755578E.30803@gmail.com>
References: <4755578E.30803@gmail.com>
Message-ID: <Pine.LNX.4.64.0712041337520.9325@skynet.shatteredsilicon.net>

On Tue, 4 Dec 2007, carlopmart wrote:

> Hi all,
>
> Somebody can points me to a good howto to setup a mysql HA infraestructure ( 
> with version shipped with rhel4.x) using Cluster Suite on RHEL4??

You may find that you're better off with fail-over IPVS load balancers 
with source IP hash based rotation and MySQL servers set up to do 
round-robin replication.

Gordan


From lhh at redhat.com  Tue Dec  4 14:31:02 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Dec 2007 09:31:02 -0500
Subject: [Linux-cluster] Best practices to setup mysql cluster under
	rhel4	cluster suite
In-Reply-To: <Pine.LNX.4.64.0712041337520.9325@skynet.shatteredsilicon.net>
References: <4755578E.30803@gmail.com>
	<Pine.LNX.4.64.0712041337520.9325@skynet.shatteredsilicon.net>
Message-ID: <1196778662.10025.120.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-04 at 13:39 +0000, gordan at bobich.net wrote:
> On Tue, 4 Dec 2007, carlopmart wrote:
> 
> > Hi all,
> >
> > Somebody can points me to a good howto to setup a mysql HA infraestructure ( 
> > with version shipped with rhel4.x) using Cluster Suite on RHEL4??

Does MySQL HA infra even work on RHEL4.x?

-- Lon


From lhh at redhat.com  Tue Dec  4 14:32:54 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Dec 2007 09:32:54 -0500
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <47549579.4000403@io-consulting.net>
References: <47372867.1010306@gmail.com> <473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
	<4739DA42.8080109@gmail.com>
	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
	<473B2216.1070302@gmail.com>
	<1196700854.2437.21.camel@localhost.localdomain>
	<1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>
	<47549579.4000403@io-consulting.net>
Message-ID: <1196778774.10025.123.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-04 at 00:47 +0100, Johannes Russek wrote:
> i think if it's configured it should just start. it would be smart to 
> have it warn about missing xend every so minutes or sthg like that, 
> though :)

Ok, give me a few minutes.

-- Lon


From thomas.reiter at cargo-partner.com  Tue Dec  4 14:33:37 2007
From: thomas.reiter at cargo-partner.com (Thomas Reiter)
Date: Tue, 04 Dec 2007 15:33:37 +0100
Subject: [Linux-cluster] unsubsribe
Message-ID: <47556541.9010907@cargo-partner.com>


-- 

mfg/brgds
Ing. Thomas Reiter / Information Technology

cargo-partner AG
P.O. Box 1
Airportstra?e
A-2401 Fischamend
Phone: 0043 2232 798-11350
Mobile: 0043 664 8273259
www.cargo-partner.com


From lhh at redhat.com  Tue Dec  4 15:01:55 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Dec 2007 10:01:55 -0500
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <47549579.4000403@io-consulting.net>
References: <47372867.1010306@gmail.com> <473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
	<4739DA42.8080109@gmail.com>
	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
	<473B2216.1070302@gmail.com>
	<1196700854.2437.21.camel@localhost.localdomain>
	<1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>
	<47549579.4000403@io-consulting.net>
Message-ID: <1196780515.10025.126.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-04 at 00:47 +0100, Johannes Russek wrote:

> i think if it's configured it should just start. it would be smart to 
> have it warn about missing xend every so minutes or sthg like that, 
> though :)
> regards,
> johannes

In 5.1, it seems to work even if xend hasn't started yet.

It just sits there and does nothing until xend comes up - but it works.

-- Lon


From lhh at redhat.com  Tue Dec  4 15:03:40 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Dec 2007 10:03:40 -0500
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <1196780515.10025.126.camel@ayanami.boston.devel.redhat.com>
References: <47372867.1010306@gmail.com> <473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
	<4739DA42.8080109@gmail.com>
	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
	<473B2216.1070302@gmail.com>
	<1196700854.2437.21.camel@localhost.localdomain>
	<1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>
	<47549579.4000403@io-consulting.net>
	<1196780515.10025.126.camel@ayanami.boston.devel.redhat.com>
Message-ID: <1196780620.10025.129.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-04 at 10:01 -0500, Lon Hohberger wrote:
> On Tue, 2007-12-04 at 00:47 +0100, Johannes Russek wrote:
> 
> > i think if it's configured it should just start. it would be smart to 
> > have it warn about missing xend every so minutes or sthg like that, 
> > though :)
> > regards,
> > johannes
> 
> In 5.1, it seems to work even if xend hasn't started yet.
> 
> It just sits there and does nothing until xend comes up - but it works.

Joining multicast group
ipv4_recv_sk: success, fd = 7
virConnectOpen: Connection refused
My Node ID = 1
NOTICE: virConnectOpen(): Connection refused; cannot fence!
NOTICE: virConnectOpen(): Connection refused; cannot fence!
NOTICE: virConnectOpen(): Connection refused; cannot fence!
NOTICE: virConnectOpen(): Connection refused; cannot fence!
NOTICE: virConnectOpen(): Connection refused; cannot fence!
...
repeats.
(service xend start on another console)
...
Domain                   UUID                                 Owner
State
------                   ----                                 -----
-----
Domain-0                 00000000-0000-0000-0000-000000000000 00001
00001
frederick                ef84f5cf-8589-41f6-8728-ac5170c42cbc 00001
00002
molly                    d1cf68a4-0bdf-5a88-81a7-5c41976147f6 00001
00002
Storing frederick
Storing molly


-- Lon


From lhh at redhat.com  Tue Dec  4 15:12:26 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 04 Dec 2007 10:12:26 -0500
Subject: [Linux-cluster] Re: fence_xvmd doesn't starts under rhel5.1
In-Reply-To: <47551A83.9040702@gmail.com>
References: <47372867.1010306@gmail.com>	<473873AA.6050100@gmail.com>
	<1194899556.8951.33.camel@ayanami.boston.devel.redhat.com>
	<4739DA42.8080109@gmail.com>
	<1195053728.8951.68.camel@ayanami.boston.devel.redhat.com>
	<473B2216.1070302@gmail.com>
	<1196700854.2437.21.camel@localhost.localdomain>
	<1196719043.10025.113.camel@ayanami.boston.devel.redhat.com>
	<47549579.4000403@io-consulting.net>  <47551A83.9040702@gmail.com>
Message-ID: <1196781146.10025.136.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-04 at 10:14 +0100, carlopmart wrote:

> > p.s.: i'll do the rc.local thing, thanks
> 
> I agree with Johannes.

As it happens, fence_xvmd actually works even if started w/o cman or w/o
xend running.  So, one could also just comment out line #192
of /etc/init.d/cman too.  Turns out fence_xvmd is smart enough to work
even if you transition xend...

Here's the patch:

https://bugzilla.redhat.com/attachment.cgi?id=277061

-- Lon


From carlopmart at gmail.com  Tue Dec  4 15:59:37 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 04 Dec 2007 16:59:37 +0100
Subject: [Linux-cluster] Best practices to setup mysql cluster under	rhel4
	cluster suite
In-Reply-To: <1196778662.10025.120.camel@ayanami.boston.devel.redhat.com>
References: <4755578E.30803@gmail.com>	<Pine.LNX.4.64.0712041337520.9325@skynet.shatteredsilicon.net>
	<1196778662.10025.120.camel@ayanami.boston.devel.redhat.com>
Message-ID: <47557969.6010807@gmail.com>

Lon Hohberger wrote:
> On Tue, 2007-12-04 at 13:39 +0000, gordan at bobich.net wrote:
>> On Tue, 4 Dec 2007, carlopmart wrote:
>>
>>> Hi all,
>>>
>>> Somebody can points me to a good howto to setup a mysql HA infraestructure ( 
>>> with version shipped with rhel4.x) using Cluster Suite on RHEL4??
> 
> Does MySQL HA infra even work on RHEL4.x?
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
humm ... I don't know it, but I need to test it using mysql ha or rhel4 cs ... 
My preference is to use rhel4 cluster suite

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From linux-cluster at merctech.com  Tue Dec  4 17:15:45 2007
From: linux-cluster at merctech.com (linux-cluster at merctech.com)
Date: Tue, 04 Dec 2007 12:15:45 -0500
Subject: [Linux-cluster] use of delayed power mgmt device for fencing?
Message-ID: <26423.1196788545@mirchi>


I'm setting up a 2-node cluster (CentOS5).  The nodes are connected to a 
Tripplite "8-Outlet Metered PDU with Automatic Transfer Switching"
(Manufacturer Part# : PDUMH15ATNET). I've written an expect script to enable 
each node to power off the other node. This seems like the preferred method for 
cluster fencing, except for a potentially serious defect with the Tripplite PDU:

	upon receipt of the command to power down an outlet, the PDU delays
	powering down the outlet for a variable period of about 15~20 seconds

The manufacturer states that the delay is intentional, unavoidable, cannot be 
configured, and will not be eliminated in future firware releases to the PDU.

My concern is that in case of a loss of cluster communication, each node will 
attempt to power down the other, and the delay will cause both nodes to send 
the power-down command to the PDU, shutting down the entire cluster when the 
PDU acts on the commands.

I am considering a scheme where the nodes would do something like:

	sleep 20 * $nodenumber
	powerdown $othernode

while this is extensible if I add more nodes to the cluster in the future, it 
means that the potential split-brain period could continue for several minutes, 
and that the lower numbered nodes would always have "priority". I'm concerned 
about data corruption during the delay period.

Do you see any reasonable way to use this power management switch for cluster 
fencing?


The nodes are on a SAN, and could be configured to use a quorum disk or a SAN
switch for fencing as an alternative. The nodes could also be set up to use 
IPMI for power fencing.

Thanks,

Mark


From Yazan.Albakheit at Progressoft.com  Tue Dec  4 17:30:51 2007
From: Yazan.Albakheit at Progressoft.com (Yazan Albakheit)
Date: Tue, 4 Dec 2007 19:30:51 +0200
Subject: [Linux-cluster] NTP
Message-ID: <9F22D67428CC144B93A198ECA35B85BF13B59B0097@PS-MAILBOX.Progressoft.com>

Dear ,

   Can you Help me in configuring the NTP between two nodes running RHEL_AS_V4_U5  .

   I Have   two Server (A,B)   I want server B  to take its time from server A  only.


Thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071204/bf1a1a3e/attachment.htm>

From carlopmart at gmail.com  Tue Dec  4 19:22:26 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 04 Dec 2007 20:22:26 +0100
Subject: [Linux-cluster] RHCS4 under rhel4.6 xen domU
Message-ID: <4755A8F2.9070305@gmail.com>

Hi all,

  Will RH release a rhcs4 version that works under rhel4.x xen domU guests?? is 
it possible to re-compile cman, ccs, etc to work under xen domU with rhel4.6???

Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From carlopmart at gmail.com  Tue Dec  4 19:34:32 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 04 Dec 2007 20:34:32 +0100
Subject: [Linux-cluster] Re: RHCS4 under rhel4.6 xen domU (my mistake)
In-Reply-To: <4755A8F2.9070305@gmail.com>
References: <4755A8F2.9070305@gmail.com>
Message-ID: <4755ABC8.4080201@gmail.com>

carlopmart wrote:
> Hi all,
> 
>  Will RH release a rhcs4 version that works under rhel4.x xen domU 
> guests?? is it possible to re-compile cman, ccs, etc to work under xen 
> domU with rhel4.6???
> 
> Thanks.
> 
OOps sorry ... my mistake ... I had seen cman, css, etc kernel modules for xen 
on rhn ... Sorry.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From garromo at us.ibm.com  Tue Dec  4 20:20:56 2007
From: garromo at us.ibm.com (Gary Romo)
Date: Tue, 4 Dec 2007 13:20:56 -0700
Subject: [Linux-cluster] Best practices to setup mysql cluster under rhel4
	cluster suite
In-Reply-To: <4755578E.30803@gmail.com>
Message-ID: <OFC822AE49.0C61E24B-ON872573A7.006FBAAA-872573A7.006FB0BF@us.ibm.com>

I'll take the same information for Oracle, if someone has it handy.
Thanks!

Gary Romo


carlopmart <carlopmart at gmail.com> 
Sent by: linux-cluster-bounces at redhat.com
12/04/2007 06:35 AM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
linux clustering <linux-cluster at redhat.com>
cc

Subject
[Linux-cluster] Best practices to setup mysql cluster under rhel4 cluster 
suite


Hi all,

  Somebody can points me to a good howto to setup a mysql HA 
infraestructure ( 
with version shipped with rhel4.x) using Cluster Suite on RHEL4??

Many thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071204/40194329/attachment.htm>

From soren at ubuntu.com  Tue Dec  4 20:31:18 2007
From: soren at ubuntu.com (Soren Hansen)
Date: Tue, 4 Dec 2007 21:31:18 +0100
Subject: [Linux-cluster] [PATCH] Add --disable-kernel-check to configure
Message-ID: <20071204203118.GN3975@atlas.linux2go.dk>

Hi guys!

Could you please apply this patch for me? It's handy to be able to
disable the new kernel version check if we don't actually have the
kernel headers around, but know that the proper stuff is around when
it's needed.

-- 
Soren Hansen
Ubuntu Server Team
http://www.ubuntu.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: disable_kernel_check.diff
Type: text/x-diff
Size: 1568 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071204/3f9503d1/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071204/3f9503d1/attachment.sig>

From Timothy.Ward at itt.com  Tue Dec  4 22:18:10 2007
From: Timothy.Ward at itt.com (Ward, Timothy - SSD)
Date: Tue, 4 Dec 2007 17:18:10 -0500
Subject: [Linux-cluster] conga: problems getting luci up and running
Message-ID: <77E700AE7021314DB6CDF6D6E8F661320396FCD3@ACDFWMAIL1.acd.de.ittind.com>

Pedro wrote:
======================================================
Hi All,


I'm trying to install Luci on a Fedora Core 6 in order to manage ricci's
installed in Fedora 7 machines.
This is the problem I'm getting trying to get luci up and running:
Any help is really appreciated!!!!
Thanks :)
======================================================

If you are not able to get this working you might try CentOS 5 (I read
somewhere 5.1 just came out).  It was suggested to me to replace my
Fedora 6 with CentOS 5 and it works *much* better.  Still some issues,
but worth working on.

Good luck!
Tim


This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail.


From fajar at telkom.net.id  Wed Dec  5 02:03:47 2007
From: fajar at telkom.net.id (Fajar A. Nugraha)
Date: Wed, 05 Dec 2007 09:03:47 +0700
Subject: [Linux-cluster] Best practices to setup mysql cluster under rhel4
	cluster suite
In-Reply-To: <OFC822AE49.0C61E24B-ON872573A7.006FBAAA-872573A7.006FB0BF@us.ibm.com>
References: <OFC822AE49.0C61E24B-ON872573A7.006FBAAA-872573A7.006FB0BF@us.ibm.com>
Message-ID: <47560703.5070000@telkom.net.id>

Gary Romo wrote:
>
> I'll take the same information for Oracle, if someone has it handy.
> Thanks!
>
I believe the best way to do that with Oracle is to use Oracle RAC
directly on RHEL4/5, without RHCS.
AFAIK RHCS is only certified to work with Oracle if it's using
lock-gulm, which requires additional server.
See http://www.redhat.com/docs/manuals/csgfs/Oracle_GFS-en-US/

Regards,

Fajar


From balajisundar at midascomm.com  Wed Dec  5 05:25:00 2007
From: balajisundar at midascomm.com (Balaji)
Date: Wed, 05 Dec 2007 10:55:00 +0530
Subject: [Linux-cluster] Regd: Red Hat Enterprise Linux 4 Update 4 AS
 Cluster Suite download link
Message-ID: <4756362C.60101@midascomm.com>

Dear All,

I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS  30 Days 
Evaluation  copy and i Installed  and testing the Red Hat Enterprise 
Linux 4  Update  4 AS  and i need the Cluster Suite for same
The Cluster Suite for same is not available in Red Hat Site

 Please can any one send me the Red Hat Cluster Suite download link for Red Hat 
Enterprise Linux 4 Update 4 AS 

Regards
-S.Balaji


From fajar at telkom.net.id  Wed Dec  5 05:32:12 2007
From: fajar at telkom.net.id (Fajar A. Nugraha)
Date: Wed, 05 Dec 2007 12:32:12 +0700
Subject: [Linux-cluster] Regd: Red Hat Enterprise Linux 4 Update 4 AS
	Cluster Suite download link
In-Reply-To: <4756362C.60101@midascomm.com>
References: <4756362C.60101@midascomm.com>
Message-ID: <475637DC.4060101@telkom.net.id>

Balaji wrote:
> Dear All,
>
> I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS  30 Days
> Evaluation  copy and i Installed  and testing the Red Hat Enterprise
> Linux 4  Update  4 AS  and i need the Cluster Suite for same
> The Cluster Suite for same is not available in Red Hat Site
>
> Please can any one send me the Red Hat Cluster Suite download link for
> Red Hat Enterprise Linux 4 Update 4 AS

You can't.
If you want RH's support and binary, I suggest you contact their sales
for trial availability.
Otherwise, try Centos. I recommend you use version 5.1 not 4.

Regards,

Fajar


From fabbione at ubuntu.com  Wed Dec  5 05:42:00 2007
From: fabbione at ubuntu.com (Fabio M. Di Nitto)
Date: Wed, 5 Dec 2007 06:42:00 +0100 (CET)
Subject: [Linux-cluster] [PATCH] Add --disable-kernel-check to configure
In-Reply-To: <20071204203118.GN3975@atlas.linux2go.dk>
References: <20071204203118.GN3975@atlas.linux2go.dk>
Message-ID: <Pine.LNX.4.64.0712050634550.1138@trider-g7>

moving to cluster-devel as it belongs there.

The patch looks ok.

Just to keep track of my only comment on IRC:
=0 instead of ="": I prefer to have them all alligned across the 
configure. I don't have a preference so either change your in your patch 
or change them all. Both are fine with me (just make sure it won't break 
major and minor_release as they have a slightly different check set).

Once you have decided how to address this, you have a signoff from me.

Can somebody please make sure to apply it in CVS HEAD?

I did specifically ask Soren to make this change because:

- we don't always need this in certain testing/debugging environements 
(David did trip on this problem 2 days ago)
- packagers might have to deal with distro kernels that might be on an 
older version (compared to minimal requirements) but the kernel still 
contains all relevant patches backported from recent kernels.
- we want an easy way to override certain checks no matter what.

Thanks
Fabio

On Tue, 4 Dec 2007, Soren Hansen wrote:

> Hi guys!
>
> Could you please apply this patch for me? It's handy to be able to
> disable the new kernel version check if we don't actually have the
> kernel headers around, but know that the proper stuff is around when
> it's needed.
>
> -- 
> Soren Hansen
> Ubuntu Server Team
> http://www.ubuntu.com/
>


--
I'm going to make him an offer he can't refuse.


From hedayan at gmail.com  Wed Dec  5 06:40:54 2007
From: hedayan at gmail.com (dayan)
Date: Wed, 5 Dec 2007 14:40:54 +0800
Subject: [Linux-cluster] (no subject)
Message-ID: <023401c83709$ca9d29f0$6400a8c0@alpha>

hello ,everyone !


 I don't understand ?Data Journaling? mentioned in "Global_File_System-for RHELAS4_5.pdf" ,

  Can somebody explain it ?

  thanks a lot !


---------------------------------------
hello, world 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071205/820f4e0a/attachment.htm>

From mailinglist at bussink.ch  Wed Dec  5 08:24:40 2007
From: mailinglist at bussink.ch (Erik Bussink)
Date: Wed, 05 Dec 2007 09:24:40 +0100
Subject: [Linux-cluster] Regd: Red Hat Enterprise Linux 4 Update 4 AS
	Cluster Suite download link
In-Reply-To: <4756362C.60101@midascomm.com>
References: <4756362C.60101@midascomm.com>
Message-ID: <1196843080.28168.5.camel@eowyn.bussink.local>

On Wed, 2007-12-05 at 10:55 +0530, Balaji wrote:
> I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS  30 Days 
> Evaluation  copy and i Installed  and testing the Red Hat Enterprise 
> Linux 4  Update  4 AS  and i need the Cluster Suite for same
> The Cluster Suite for same is not available in Red Hat Site

RHEL 4 AS does not come with the Cluster Suite integrated. To retrieve
the RHCS 4.4 that goes with RHEL 4 AS, call RedHat sales and ask them to
add a 30 days evaluation of RHCS to you're account.

Regards,
Erik


From ondinap at smg.gov.mo  Wed Dec  5 09:28:11 2007
From: ondinap at smg.gov.mo (Poon Suk Kit)
Date: Wed, 5 Dec 2007 17:28:11 +0800
Subject: [Linux-cluster] RHEL 5.1 qdisk warning 
Message-ID: <000f01c83721$27c21b60$510810ac@smg.net>

Hi, 

We have setup 2 nodes using GFS, with quorum disk and fence device.  but there is always a message
 qdiskd[3029]<warning> qdisk cycle took more than 1 second to complete (1.000000) 

what 's that mean, does it cause any problem to my running system ? 

Regards
SMG


<?xml version="1.0"?>
<cluster alias="lvs001" config_version="27" name="lvs001">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="has001.smg.net" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="pdu2" port="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="has002.smg.net" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="pdu2" port="3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman two_node="0"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.7.30" login="apc" name="pdu2" passwd="apc"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="failover1" ordered="1" restricted="1">
                                <failoverdomainnode name="has001.smg.net" priority="1"/>
                                <failoverdomainnode name="has002.smg.net" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="172.16.3.2" monitor_link="1"/>
                        <clusterfs device="/dev/vg_gfs1/lv_gfs_dc" force_unmount="1" fsid="58386" fstype="gfs" mountpoint="/datacenter" name="gfs_dc"/>
                        <nfsexport name="DC"/>
                        <smb name="dc_samba" workgroup="smg"/>
                        <nfsclient allow_recover="1" name="nfs_client_dc" options="rw,insecure,async,no_root_squash" target="172.16.0.0/16"/>
                </resources>
                <service autostart="1" domain="failover1" exclusive="0" name="share_dc" recovery="relocate">
                        <ip ref="172.16.3.2"/>
                        <clusterfs ref="gfs_dc">
                                <nfsexport ref="DC">
                                        <nfsclient ref="nfs_client_dc"/>
                                </nfsexport>
                        </clusterfs>
                </service>
        </rm>
        <quorumd interval="1" label="quorum" votes="1" tko="10">
                <heuristic interval="2" program="/bin/ping has001 -c1 -t1" score="1"/>
                <heuristic interval="2" program="/bin/ping has002 -c1 -t1" score="1"/>
        </quorumd>
</cluster>


[root at has001 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  has001.smg.net                        1 Online, Local, rgmanager
  has002.smg.net                        2 Online, rgmanager
  /dev/sdb                              0 Online, Quorum Disk

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:share_dc     has001.smg.net                 started         
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071205/f3a15764/attachment.htm>

From Alain.Moulle at bull.net  Wed Dec  5 10:07:49 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 05 Dec 2007 11:07:49 +0100
Subject: [Linux-cluster] First steps on CS5 / RHEL5
Message-ID: <47567875.3000303@bull.net>

Hi

My first steps on CS5 ...

When I do:
service cman start, I got :
# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
cman not started: CCS does not have a nodeid for this node, run 'ccs_tool
addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start
                                                           [FAILED]

Any idea ?

PS: for CS5, I have only installed :
perl-Net-Telnet-3.03-5.noarch.rpm
openais-0.80.3-7.el5.x86_64.rpm
cman-2.0.73-1.el5.x86_64.rpm
rgmanager-2.0.31-1.el5.x86_64.rpm
system-config-cluster-1.0.50-1.3.noarch.rpm
is it ok ?

Thanks
Regards
Alain Moull?


From pcaulfie at redhat.com  Wed Dec  5 10:25:56 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 05 Dec 2007 10:25:56 +0000
Subject: [Linux-cluster] First steps on CS5 / RHEL5
In-Reply-To: <47567875.3000303@bull.net>
References: <47567875.3000303@bull.net>
Message-ID: <47567CB4.8020604@redhat.com>

Alain Moulle wrote:
> Hi
> 
> My first steps on CS5 ...
> 
> When I do:
> service cman start, I got :
> # service cman start
> Starting cluster:
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... failed
> cman not started: CCS does not have a nodeid for this node, run 'ccs_tool
> addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start
>                                                            [FAILED]
>

I was hoping that the message was a clue

"cman not started: CCS does not have a nodeid for this node, run
'ccs_tool addnodeids' to fix"

is it not helpful ?

Patrick


From pedro.faustino at fccn.pt  Wed Dec  5 10:34:44 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Wed, 05 Dec 2007 10:34:44 +0000
Subject: [Linux-cluster] conga: problems getting luci up and running
In-Reply-To: <20071203175404.GA21909@redhat.com>
References: <47509568.905@bxwa.com>
	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>
	<47542FAC.104@bxwa.com> <4754321D.1050600@fccn.pt>
	<20071203175404.GA21909@redhat.com>
Message-ID: <47567EC4.2090409@fccn.pt>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071205/7533a334/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071205/7533a334/attachment.bin>

From swells at redhat.com  Wed Dec  5 11:58:08 2007
From: swells at redhat.com (Shawn Wells)
Date: Wed, 05 Dec 2007 06:58:08 -0500
Subject: [Linux-cluster] Regd: Red Hat Enterprise Linux 4 Update 4 AS
	Cluster Suite download link
In-Reply-To: <4756362C.60101@midascomm.com>
References: <4756362C.60101@midascomm.com>
Message-ID: <47569250.4030807@redhat.com>

RHCS doesn't come free with RHEL4.  Talk to your sales rep and he can 
get you a 30/60/90 day evaluation.

*However*,  RHCS *does* come free with RHEL5.  As does GFS, improved 
versions of Xen, RHCS, and a whole bunch of updates.

If you're evaluating RHEL... why not eval the latest & greatest?

-Shawn


Balaji wrote:
> Dear All,
>
> I am Downloaded the Red Hat Enterprise Linux 4 Update 4 AS  30 Days 
> Evaluation  copy and i Installed  and testing the Red Hat Enterprise 
> Linux 4  Update  4 AS  and i need the Cluster Suite for same
> The Cluster Suite for same is not available in Red Hat Site
>
> Please can any one send me the Red Hat Cluster Suite download link for 
> Red Hat Enterprise Linux 4 Update 4 AS
> Regards
> -S.Balaji
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Shawn D. Wells
Solutions Architect, Federal Team
swells at redhat.com
C: 443-534-0130


From johannes.russek at io-consulting.net  Wed Dec  5 13:51:12 2007
From: johannes.russek at io-consulting.net (jr)
Date: Wed, 05 Dec 2007 14:51:12 +0100
Subject: [Linux-cluster] GFS, xattr and SElinux
Message-ID: <1196862672.2437.61.camel@localhost.localdomain>

Hi Guys,
does GFS not work with SELinux at all, even though SElinux seems to
initialize the Filesystem right after the mount correctly, and the files
show labels? (ls -lZ) (this is CentOS 5.1 with the most recent packages,
using GFS non2).
it seems as if i ran into something like that.
even though ls -lZ would show the correct file labels, SELinux denies
access to unlabeled_t.
after restarting one of the nodes in the cluster, that node shows
unlabeled_t when using ls -lZ on the GFS mounted directory. on other
nodes, it's correctly httpd_config_t though.
is there anything known with this or any suggestions?
thanks a lot.
regards,
johannes


From Alain.Moulle at bull.net  Wed Dec  5 16:00:20 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 05 Dec 2007 17:00:20 +0100
Subject: [Linux-cluster] First steps with CS5/RHEL5 (contd.)
Message-ID: <4756CB14.8050008@bull.net>

Hi

I'm stalled on :
# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
cman not started: Can't find local node name in cluster.conf
/usr/sbin/cman_tool: aisexec daemon didn't start
                 [FAILED]

with  :
messages:Dec  4 16:56:49 am2 openais[15075]: [MAIN ] local node name "am2" not
found in cluster.conf

In cluster.conf I have :
<clusternodes>
     <clusternode name="am1" nodeid="1" votes="1">
        <fence>
           <method name="1">
               <device name="am1fence" option="reboot"/>
           </method>
        </fence>
     </clusternode>
     <clusternode name="am2" nodeid="2" votes="1">
         <fence>
            <method name="1">
               <device name="am2fence" option="reboot"/>
               </method>
            </fence>
     </clusternode>
</clusternodes>

and in /etc/hosts :
162.29.1.78  am1
162.29.1.79  am2
and command hostname returns :
# hostname
am2

No DNS is used.

cman release is :
cman-2.0.73-1.el5

Any idea on this problem ?

Thanks
Regards
Alain Moull?


From pcaulfie at redhat.com  Wed Dec  5 16:16:10 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 05 Dec 2007 16:16:10 +0000
Subject: [Linux-cluster] First steps with CS5/RHEL5 (contd.)
In-Reply-To: <4756CB14.8050008@bull.net>
References: <4756CB14.8050008@bull.net>
Message-ID: <4756CECA.20005@redhat.com>

Alain Moulle wrote:
> Hi
> 
> I'm stalled on :
> # service cman start
> Starting cluster:
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... failed
> cman not started: Can't find local node name in cluster.conf
> /usr/sbin/cman_tool: aisexec daemon didn't start
>                  [FAILED]
> 
> with  :
> messages:Dec  4 16:56:49 am2 openais[15075]: [MAIN ] local node name "am2" not
> found in cluster.conf
> 
> In cluster.conf I have :
> <clusternodes>
>      <clusternode name="am1" nodeid="1" votes="1">
>         <fence>
>            <method name="1">
>                <device name="am1fence" option="reboot"/>
>            </method>
>         </fence>
>      </clusternode>
>      <clusternode name="am2" nodeid="2" votes="1">
>          <fence>
>             <method name="1">
>                <device name="am2fence" option="reboot"/>
>                </method>
>             </fence>
>      </clusternode>
> </clusternodes>
> 
> and in /etc/hosts :
> 162.29.1.78  am1
> 162.29.1.79  am2
> and command hostname returns :
> # hostname
> am2
> 
> No DNS is used.
> 
> cman release is :
> cman-2.0.73-1.el5
>

cman_tool join -d should give more info.

Patrick


From rohara at redhat.com  Wed Dec  5 17:17:57 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Wed, 05 Dec 2007 11:17:57 -0600
Subject: [Linux-cluster] GFS, xattr and SElinux
In-Reply-To: <1196862672.2437.61.camel@localhost.localdomain>
References: <1196862672.2437.61.camel@localhost.localdomain>
Message-ID: <4756DD45.5020800@redhat.com>

jr wrote:
> Hi Guys,
> does GFS not work with SELinux at all, even though SElinux seems to
> initialize the Filesystem right after the mount correctly, and the files
> show labels? (ls -lZ) (this is CentOS 5.1 with the most recent packages,
> using GFS non2).
> it seems as if i ran into something like that.
> even though ls -lZ would show the correct file labels, SELinux denies
> access to unlabeled_t.
> after restarting one of the nodes in the cluster, that node shows
> unlabeled_t when using ls -lZ on the GFS mounted directory. on other
> nodes, it's correctly httpd_config_t though.
> is there anything known with this or any suggestions?
> thanks a lot.
> regards,
> johannes

There are 2 things that come to mind:

1. I believe that although we have added selinux support for GFS(1) in 
RHEL5, the policy does not reflect this. In order to get things working, 
you may have to edit your selinux policy such that gfs is defined to 
support selinux xattrs.

2. I just fixed a bug in the GFS(1) selinux xattr operations. The 
various functions that handle selinux xattrs were incorrectly checking 
read/write permissions, which is wrong. This could result in permission 
denials, as you mentioned.


From lhh at redhat.com  Wed Dec  5 17:45:11 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 05 Dec 2007 12:45:11 -0500
Subject: [Linux-cluster] Re: [Cluster-devel] RHCS with xen and gfs question
In-Reply-To: <e83473390712050021k196cceah2de7d312679805f6@mail.gmail.com>
References: <e83473390712040140l7d4abf27j7e2b60cc75ee2df@mail.gmail.com>
	<1196800290.32141.4.camel@ayanami.boston.devel.redhat.com>
	<e83473390712050021k196cceah2de7d312679805f6@mail.gmail.com>
Message-ID: <1196876711.5409.2.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-12-05 at 09:21 +0100, Maurizio Rottin wrote:
> 2007/12/4, Lon Hohberger <lhh at redhat.com>:
> > On Tue, 2007-12-04 at 10:40 +0100, Maurizio Rottin wrote:
> > > hi all,
> > >
> > > my setup is a cluster of 6 physical machines connected with qlogic HBA to a SAN.
> > >
> > > Now i can run several xen virtual machines but i need to read from the
> > > same partition with a high I/O rate (my application doesn't need to
> > > write a lot).
> > >
> > > Hence my question is, do i need a gfs cluster on the physical nodes
> > > and export with gndb for network mounting on xen machines (which seems
> > > to be slow, cause i'm using network in this way), or is there a way to
> > > insert the xen nodes in the gfs cluster too (in this way i should use
> > > kernel to directly access gfs/gfs2 partition)?
> >
> > Why don't you just present the shared LUNs to all of the xen guests,
> > like:
> >
> > # Disks
> > disk = [ "tap:aio:/guests/molly.img,xvda,w",
> > "phy:/dev/sdb1,xvdb,w!","phy:/dev/sdb2,xvdc,w!" ]
> >
> > ... and gfs_mkfs /dev/xvd[bc] on a guest?
> >
> > xvda = local disk
> > xvdb = /dev/sdb1 on host
> > xvdc = /dev/sdb2 on host
> >
> > On the latter two, note "w!" - which allows concurrent guest access.
> >

> clurgmgrd: [1887]: <err> 'mount -t gfs  /dev/xvdb1 /mnt/culona' failed, error=1
> as you can see...it says gfs and i'm trying to put some gfs2 into
> src.rpm and recompile, if i can find the right line!
> (i really want to use gfs2!)
> 
> how did you do this?

I just committed patches to -head and -RHEL5 branches which allow
clusterfs.sh to mount gfs2 partitions.

  <clusterfs name="myname" fstype="gfs2" device="..."
mountpoint="..." .../>

Should do it, with the patch:

http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/resources/clusterfs.sh.diff?cvsroot=cluster&only_with_tag=RHEL5&r1=1.15.2.4&r2=1.15.2.5

-- Lon


From barbos at gmail.com  Wed Dec  5 19:14:30 2007
From: barbos at gmail.com (Alex Kompel)
Date: Wed, 5 Dec 2007 11:14:30 -0800
Subject: [Linux-cluster] Best practices to setup mysql cluster under rhel4
	cluster suite
In-Reply-To: <47557969.6010807@gmail.com>
References: <4755578E.30803@gmail.com>
	<Pine.LNX.4.64.0712041337520.9325@skynet.shatteredsilicon.net>
	<1196778662.10025.120.camel@ayanami.boston.devel.redhat.com>
	<47557969.6010807@gmail.com>
Message-ID: <3ae027040712051114x4b20bba9wa227a770b4c532dd@mail.gmail.com>

On Dec 4, 2007 7:59 AM, carlopmart <carlopmart at gmail.com> wrote:

>  humm ... I don't know it, but I need to test it using mysql ha or rhel4
> cs ...
> My preference is to use rhel4 cluster suite
>
>

The cornerstone of a reliable database failover cluster is reliable shared
storage. Once you get shared storage online then building HA cluster using
RHEL CS is fairly straightforward. I suggest you invest into good redundant
external storage array. If this is not an option then you will probably be
better of using MySQL replication and manually failover to the slave if the
master fails. MySQL has internal clustering support in the latest version
but it is fairly new  and has number of limitations.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071205/21a84b72/attachment.htm>

From abeekhof at suse.de  Wed Dec  5 20:06:38 2007
From: abeekhof at suse.de (Andrew Beekhof)
Date: Wed, 5 Dec 2007 21:06:38 +0100
Subject: [Linux-cluster] ANNOUNCE: Heartbeat Cluster Resource Manager Ported
	to OpenAIS
Message-ID: <0C24B0E3-D993-47F6-983D-1928B7425F6D@suse.de>

Just as OCFS2 and GFS2 are converging on some of their in-kernel  
components, I am happy to announce that similar convergence is  
starting to occur elsewhere in the cluster stack.

Over the last few months, Red Hat and SUSE engineers have been working  
together to port Heartbeat's powerful Cluster Resource Manager (CRM)  
to run natively on top of OpenAIS.

OpenAIS is an OSI Certified implementation of the Service Availability  
Forum Application Interface Specification (AIS) and forms the  
foundation of the open source clustering stack shared by Novell/SUSE,  
Red Hat, and Oracle.

Running natively on top of OpenAIS gives the cluster manager access to  
the many features it provides:
- event service
- message service
- checkpoint service
- distributed locking service
- cluster membership service
- extended virtual synchrony service
- cluster closed process group service

More importantly, it means that applications being managed by the  
cluster can also make use of such services and can do so on  
forthcoming versions of both RHEL and SLES.

Applications that already use OpenAIS can also be managed by the CRM  
more reliably than before - since by using the same communication and  
membership services, they will always have the same view of the cluster.


There are a number of items yet to be completed, the most critical  
being to plug the cluster manager into to a fencing subsystem, however  
2, 4 and 6 node clusters have been responding well to automated  
testing (which is incredibly harsh on the cluster) and the software  
has been given a clean bill of health by Valgrind.  This is of course  
in addition to the rigorous release testing of OpenAIS itself.


Completing the port is a top priority for SUSE's cluster team in 2008.


Packages for OpenAIS^ and the CRM^^ can be found for most major  
distros on the OpenSUSE Build Service:
      http://download.opensuse.org/repositories/server:/ha-clustering/

Just look for packages with an openais prefix.

More information on the CRM^^ and OpenAIS^ can be found at http://linux-ha.org/v2 
  and http://www.openais.org respectively.


^ The OpenAIS packages include a couple of required patches that  
haven't yet been shipped by the OpenAIS project.  When configuring  
OpenAIS, one also needs to include the following in openais.conf so  
that the child process that controls resources has permission to do so.
aisexec {
	user:	root
	group:	root
}

^^ Currently, the openais-crm packages can only be used with OpenAIS.   
Choosing a communications layer at run-time is possible but not yet  
implemented.


From Alain.Moulle at bull.net  Thu Dec  6 07:36:45 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 06 Dec 2007 08:36:45 +0100
Subject: [Linux-cluster] Re: First steps with CS5/RHEL5 (contd.) 
Message-ID: <4757A68D.3070200@bull.net>

Hi

Thanks Patrick for information.

And sorry about first message :
cman not started: CCS does not have a nodeid for this node, run 'ccs_tool
> addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start
It was effectively clear enough and I had fixed the problem, I just mismatch
the error message when copy/paste in email.
My problem was effectively in the second email, meaning :
> cman not started: Can't find local node name in cluster.conf
> /usr/sbin/cman_tool: aisexec daemon didn't start
>                  [FAILED]

I tried "cman_tool join -d " as you recommanded, it returns :
[root at am2 log]# cman_tool join -d
waiting for aisexec to start
[MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
cman not started: Can't find local node name in cluster.conf
cman_tool: aisexec daemon didn't start
[root at am2 log]# [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and
contributors.
[MAIN ] Copyright (C) 2006 Red Hat, Inc.
[MAIN ] AIS Executive Service: started and ready to provide service.
[MAIN ] local node name "am2" not found in cluster.conf
[MAIN ] Error reading CCS info, cannot start
[MAIN ]
[MAIN ] AIS Executive exiting (-9).

And when I launch the system-config-cluster GUI , I got errors :
Relax-NG validity error : Extra element rm in interleave
/etc/cluster/cluster.conf:25: element rm: Relax-NG validity error : Element
cluster failed to validate content
/etc/cluster/cluster.conf fails to validate

My cluster.conf is :
==================
<?xml version="1.0"?>
<cluster config_version="1" name="TEST">
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="am1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="am1fence" option="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="am2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="am2fence" option="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" ipaddr="162.29.1.116"
login="administrator" name="am1fence" passwd="administrator"/>
                <fencedevice agent="fence_ipmilan" ipaddr="162.29.1.117"
login="administrator" name="am2fence" passwd="administrator"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="testHA" ordered="0" restricted="0">
                                <failoverdomainnode name="am1" priority="1"/>
                                <failoverdomainnode name="am2" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <service domain="testHA" name="test_am1" autostart="0"
checkinterval="60">
                        <script file="/usr/sbin/test_am1" name="test_am1"/>
                </service>
                <service domain="testHA" name="test_am2" autostart="0"
checkinterval="60">
                        <script file="/usr/sbin/test_am2" name="test_am2"/>
                </service>
        </rm>
</cluster>

Thanks for your help.
Alain


From Alain.Moulle at bull.net  Thu Dec  6 08:04:49 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 06 Dec 2007 09:04:49 +0100
Subject: [Linux-cluster] Re: First steps with CS5/RHEL5 (contd.)
Message-ID: <4757AD21.1000501@bull.net>

Hi
Some new stuff : I've added the "return" just at the
begining of check_xml function in CommandHandler.py
and now GUI works fine with my cluster.conf .
But even when saving again as a new cluster.conf
I always got the same error about local node name
not found in cluster.conf.

Alain

Hi

Thanks Patrick for information.

And sorry about first message :
cman not started: CCS does not have a nodeid for this node, run 'ccs_tool
> addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start
It was effectively clear enough and I had fixed the problem, I just mismatch
the error message when copy/paste in email.
My problem was effectively in the second email, meaning :
> cman not started: Can't find local node name in cluster.conf
> /usr/sbin/cman_tool: aisexec daemon didn't start
>                  [FAILED]

I tried "cman_tool join -d " as you recommanded, it returns :
[root at am2 log]# cman_tool join -d
waiting for aisexec to start
[MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
cman not started: Can't find local node name in cluster.conf
cman_tool: aisexec daemon didn't start
[root at am2 log]# [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and
contributors.
[MAIN ] Copyright (C) 2006 Red Hat, Inc.
[MAIN ] AIS Executive Service: started and ready to provide service.
[MAIN ] local node name "am2" not found in cluster.conf
[MAIN ] Error reading CCS info, cannot start
[MAIN ]
[MAIN ] AIS Executive exiting (-9).

And when I launch the system-config-cluster GUI , I got errors :
Relax-NG validity error : Extra element rm in interleave
/etc/cluster/cluster.conf:25: element rm: Relax-NG validity error : Element
cluster failed to validate content
/etc/cluster/cluster.conf fails to validate

My cluster.conf is :
==================
<?xml version="1.0"?>
<cluster config_version="1" name="TEST">
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="am1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="am1fence" option="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="am2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="am2fence" option="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" ipaddr="162.29.1.116"
login="administrator" name="am1fence" passwd="administrator"/>
                <fencedevice agent="fence_ipmilan" ipaddr="162.29.1.117"
login="administrator" name="am2fence" passwd="administrator"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="testHA" ordered="0" restricted="0">
                                <failoverdomainnode name="am1" priority="1"/>
                                <failoverdomainnode name="am2" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <service domain="testHA" name="test_am1" autostart="0"
checkinterval="60">
                        <script file="/usr/sbin/test_am1" name="test_am1"/>
                </service>
                <service domain="testHA" name="test_am2" autostart="0"
checkinterval="60">
                        <script file="/usr/sbin/test_am2" name="test_am2"/>
                </service>
        </rm>
</cluster>

Thanks for your help.
Alain


From Alain.Moulle at bull.net  Thu Dec  6 09:34:36 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 06 Dec 2007 10:34:36 +0100
Subject: [Linux-cluster] Re: First steps with CS5/RHEL5 (contd.)
Message-ID: <4757C22C.7020407@bull.net>

Hi
OK I've found the problem, there was an alias am2 on local host line
in /etc/hosts.
Sorry for disturb.
Alain


From pcaulfie at redhat.com  Thu Dec  6 10:38:21 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 06 Dec 2007 10:38:21 +0000
Subject: [Linux-cluster] [PATCH] Add --disable-kernel-check to configure
In-Reply-To: <20071204203118.GN3975@atlas.linux2go.dk>
References: <20071204203118.GN3975@atlas.linux2go.dk>
Message-ID: <4757D11D.10106@redhat.com>

Soren Hansen wrote:
> Hi guys!
> 
> Could you please apply this patch for me? It's handy to be able to
> disable the new kernel version check if we don't actually have the
> kernel headers around, but know that the proper stuff is around when
> it's needed.


Applied, thanks.

Patrick


From lmb at suse.de  Thu Dec  6 12:57:50 2007
From: lmb at suse.de (Lars Marowsky-Bree)
Date: Thu, 6 Dec 2007 13:57:50 +0100
Subject: [Linux-cluster] Re: [Linux-ha-dev] ANNOUNCE: Heartbeat Cluster
	Resource Manager Ported to OpenAIS
In-Reply-To: <0C24B0E3-D993-47F6-983D-1928B7425F6D@suse.de>
References: <0C24B0E3-D993-47F6-983D-1928B7425F6D@suse.de>
Message-ID: <20071206125750.GY3989@marowsky-bree.de>

On 2007-12-05T21:06:38, Andrew Beekhof <abeekhof at suse.de> wrote:

> Over the last few months, Red Hat and SUSE engineers have been working 
> together to port Heartbeat's powerful Cluster Resource Manager (CRM) to run 
> natively on top of OpenAIS.

Credit where credit is due: this means you, Andrew. Great work.

Just thought I would explicitly mention the hg repository hosting this
port, for those who are interested: http://hg.beekhof.net/lha/crm-ais/

This is a great step towards finally converging considerably on the OSS
cluster stack on Linux!


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


From elliot at devnull.org.uk  Thu Dec  6 13:45:44 2007
From: elliot at devnull.org.uk (Elliot Moore)
Date: Thu, 6 Dec 2007 13:45:44 +0000
Subject: [Linux-cluster] Types of file locking support in GFS
Message-ID: <20071206134544.GA27857@book.riviera.org.uk>


Hi,

I'm trying to setup activemq with master and slave.
According to http://activemq.apache.org/shared-file-system-master-slave.html, you can use a SAN to hold a lockfile for multiple brokers to watch.
But the SAN filesystem must support exclusive file locks.

OCFS2 only supports locking with 'fcntl' and not 'lockf and flock', therefore mutex file locking from Java isn't supported. (both brokers think they have
 an exclusive lock on the lockfile!)

Does Redhat GFS support 'lockf and flock' as well as fcntl ?


Thanks in Advance.

Elliot Moore


From randy.brown at noaa.gov  Thu Dec  6 15:41:17 2007
From: randy.brown at noaa.gov (Randy Brown)
Date: Thu, 06 Dec 2007 10:41:17 -0500
Subject: [Linux-cluster] Service won't relocate after yum updates
Message-ID: <4758181D.6040702@noaa.gov>

I just ran `yum update` on one of the nodes in my two node cluster and 
now the nfs service won't relocate to the updated node.  Here are the 
versions of relevant packages on each node:

Node 1 (updated node)
[root at nfs1-cluster ~]# rpm -qa |grep -e cman -e lvm -e gfs -e rgmanager 
-e kernel
kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
lvm2-2.02.26-3.el5
kmod-gfs-0.1.19-7.el5_1.1
gfs-utils-0.1.12-1.el5
system-config-lvm-1.0.22-1.0.el5
cman-2.0.73-1.el5_1.1
lvm2-cluster-2.02.26-1.el5
rgmanager-2.0.31-1.el5.centos
gfs2-utils-0.1.38-1.el5
kernel-2.6.18-53.1.4.el5
kernel-2.6.18-8.1.15.el5
kernel-headers-2.6.18-53.1.4.el5

Node 2
[root at nfs2-cluster ~]# rpm -qa |grep -e cman -e lvm -e gfs -e rgmanager 
-e kernel
gfs2-utils-0.1.25-1.el5
kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5
kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
system-config-lvm-1.0.22-1.0.el5
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos
gfs-utils-0.1.11-3.el5
lvm2-2.02.16-3.el5
lvm2-cluster-2.02.16-3.el5
kernel-2.6.18-8.1.14.el5
kernel-2.6.18-8.1.15.el5
kernel-headers-2.6.18-8.1.15.el5

The cluster will start on the new machine but the nfs service will 
failover to it as it did prior to the upgrade.  The messages I see in 
/var/log/messages are:
Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> Member 2 shutting 
down
Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> Starting stopped 
service service:nfs
Dec  6 10:14:08 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> start on 
nfsclient "fs-shared-client" returned 2 (invalid argument(s))
Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <warning> #68: Failed to 
start service:nfs; return value: 1
Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> Stopping service 
service:nfs
Dec  6 10:14:08 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> stop on nfsclient 
"rfcdata-client" returned 2 (invalid argument(s))
Dec  6 10:14:08 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> stop on nfsclient 
"fs-shared-client" returned 2 (invalid argument(s))
Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> Service 
service:nfs is recovering
Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <warning> #71: Relocating 
failed service service:nfs
Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> Stopping service 
service:nfs
Dec  6 10:14:09 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> stop on nfsclient 
"rfcdata-client" returned 2 (invalid argument(s))
Dec  6 10:14:09 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> stop on nfsclient 
"fs-shared-client" returned 2 (invalid argument(s))
Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> Service 
service:nfs is stopped
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> Starting stopped 
service service:nfs
Dec  6 10:14:47 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> start on 
nfsclient "fs-shared-client" returned 2 (invalid argument(s))
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <warning> #68: Failed to 
start service:nfs; return value: 1
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> Stopping service 
service:nfs
Dec  6 10:14:47 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> stop on nfsclient 
"rfcdata-client" returned 2 (invalid argument(s))
Dec  6 10:14:47 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
specified.
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> stop on nfsclient 
"fs-shared-client" returned 2 (invalid argument(s))
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> Service 
service:nfs is recovering
Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <warning> #71: Relocating 
failed service service:nfs
Dec  6 10:14:49 nfs1-cluster clurgmgrd[4455]: <notice> Service 
service:nfs is now running on member 2

The export path in the nfsclient resource box, when using 
system-config-cluster, is marked optional and it has not been a problem 
leaving that blank in the past.  Has something regarding this changed?

Thanks in advance for any assistance,

Randy

Cluster.conf:
[root at nfs1-cluster cluster]# cat cluster.conf
<?xml version="1.0"?>
<cluster alias="ohd_cluster" config_version="120" name="ohd_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="60"/>
        <clusternodes>
                <clusternode name="nfs1-cluster.nws.noaa.gov" nodeid="1" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="8" 
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="nfs2-cluster.nws.noaa.gov" nodeid="2" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="7" 
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="nfs-failover" ordered="0" 
restricted="1">
                                <failoverdomainnode 
name="nfs1-cluster.nws.noaa.gov" priority="1"/>
                                <failoverdomainnode 
name="nfs2-cluster.nws.noaa.gov" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="140.90.91.244" monitor_link="1"/>
                        <clusterfs 
device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647" 
fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/>
                        <nfsexport name="fs-shared-exp"/>
                        <nfsclient name="fs-shared-client" 
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                        <clusterfs 
device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0" fsid="54233" 
fstype="gfs" mountpoint="/rfcdata" name="rfcdata" options="acl"/>
                        <nfsexport name="rfcdata-exp"/>
                        <nfsclient name="rfcdata-client" 
options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
                </resources>
                <service autostart="1" domain="nfs-failover" name="nfs">
                        <clusterfs ref="fs-shared">
                                <nfsexport ref="fs-shared-exp">
                                        <nfsclient ref="fs-shared-client"/>
                                </nfsexport>
                        </clusterfs>
                        <ip ref="140.90.91.244"/>
                        <clusterfs ref="rfcdata">
                                <nfsexport ref="rfcdata-exp">
                                        <nfsclient ref="rfcdata-client"/>
                                </nfsexport>
                                <ip ref="140.90.91.244"/>
                        </clusterfs>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.42.30" 
login="rbrown" name="nfspower" passwd="Tele4m32"/>
        </fencedevices>
</cluster>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071206/b8098cf3/attachment.vcf>

From randy.brown at noaa.gov  Thu Dec  6 15:44:24 2007
From: randy.brown at noaa.gov (Randy Brown)
Date: Thu, 06 Dec 2007 10:44:24 -0500
Subject: [Linux-cluster] Service won't relocate after yum updates
In-Reply-To: <4758181D.6040702@noaa.gov>
References: <4758181D.6040702@noaa.gov>
Message-ID: <475818D8.3010808@noaa.gov>

Correction:  "but the nfs service will failover" should read "but the 
nfs service will not failover"  Sorry.

Randy

Randy Brown wrote:
> I just ran `yum update` on one of the nodes in my two node cluster and 
> now the nfs service won't relocate to the updated node.  Here are the 
> versions of relevant packages on each node:
>
> Node 1 (updated node)
> [root at nfs1-cluster ~]# rpm -qa |grep -e cman -e lvm -e gfs -e 
> rgmanager -e kernel
> kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
> lvm2-2.02.26-3.el5
> kmod-gfs-0.1.19-7.el5_1.1
> gfs-utils-0.1.12-1.el5
> system-config-lvm-1.0.22-1.0.el5
> cman-2.0.73-1.el5_1.1
> lvm2-cluster-2.02.26-1.el5
> rgmanager-2.0.31-1.el5.centos
> gfs2-utils-0.1.38-1.el5
> kernel-2.6.18-53.1.4.el5
> kernel-2.6.18-8.1.15.el5
> kernel-headers-2.6.18-53.1.4.el5
>
> Node 2
> [root at nfs2-cluster ~]# rpm -qa |grep -e cman -e lvm -e gfs -e 
> rgmanager -e kernel
> gfs2-utils-0.1.25-1.el5
> kmod-gfs-0.1.16-5.2.6.18_8.1.14.el5
> kmod-gfs-0.1.16-6.2.6.18_8.1.15.el5
> system-config-lvm-1.0.22-1.0.el5
> cman-2.0.64-1.0.1.el5
> rgmanager-2.0.24-1.el5.centos
> gfs-utils-0.1.11-3.el5
> lvm2-2.02.16-3.el5
> lvm2-cluster-2.02.16-3.el5
> kernel-2.6.18-8.1.14.el5
> kernel-2.6.18-8.1.15.el5
> kernel-headers-2.6.18-8.1.15.el5
>
> The cluster will start on the new machine but the nfs service will 
> failover to it as it did prior to the upgrade.  The messages I see in 
> /var/log/messages are:
> Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> Member 2 
> shutting down
> Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> Starting 
> stopped service service:nfs
> Dec  6 10:14:08 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> start on 
> nfsclient "fs-shared-client" returned 2 (invalid argument(s))
> Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <warning> #68: Failed to 
> start service:nfs; return value: 1
> Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> Stopping 
> service service:nfs
> Dec  6 10:14:08 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> stop on 
> nfsclient "rfcdata-client" returned 2 (invalid argument(s))
> Dec  6 10:14:08 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:08 nfs1-cluster clurgmgrd[4455]: <notice> stop on 
> nfsclient "fs-shared-client" returned 2 (invalid argument(s))
> Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> Service 
> service:nfs is recovering
> Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <warning> #71: 
> Relocating failed service service:nfs
> Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> Stopping 
> service service:nfs
> Dec  6 10:14:09 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> stop on 
> nfsclient "rfcdata-client" returned 2 (invalid argument(s))
> Dec  6 10:14:09 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> stop on 
> nfsclient "fs-shared-client" returned 2 (invalid argument(s))
> Dec  6 10:14:09 nfs1-cluster clurgmgrd[4455]: <notice> Service 
> service:nfs is stopped
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> Starting 
> stopped service service:nfs
> Dec  6 10:14:47 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> start on 
> nfsclient "fs-shared-client" returned 2 (invalid argument(s))
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <warning> #68: Failed to 
> start service:nfs; return value: 1
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> Stopping 
> service service:nfs
> Dec  6 10:14:47 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> stop on 
> nfsclient "rfcdata-client" returned 2 (invalid argument(s))
> Dec  6 10:14:47 nfs1-cluster clurgmgrd: [4455]: <err> No export path 
> specified.
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> stop on 
> nfsclient "fs-shared-client" returned 2 (invalid argument(s))
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <notice> Service 
> service:nfs is recovering
> Dec  6 10:14:47 nfs1-cluster clurgmgrd[4455]: <warning> #71: 
> Relocating failed service service:nfs
> Dec  6 10:14:49 nfs1-cluster clurgmgrd[4455]: <notice> Service 
> service:nfs is now running on member 2
>
> The export path in the nfsclient resource box, when using 
> system-config-cluster, is marked optional and it has not been a 
> problem leaving that blank in the past.  Has something regarding this 
> changed?
>
> Thanks in advance for any assistance,
>
> Randy
>
> Cluster.conf:
> [root at nfs1-cluster cluster]# cat cluster.conf
> <?xml version="1.0"?>
> <cluster alias="ohd_cluster" config_version="120" name="ohd_cluster">
>        <fence_daemon post_fail_delay="0" post_join_delay="60"/>
>        <clusternodes>
>                <clusternode name="nfs1-cluster.nws.noaa.gov" 
> nodeid="1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="nfspower" 
> port="8" switch="1"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="nfs2-cluster.nws.noaa.gov" 
> nodeid="2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="nfspower" 
> port="7" switch="1"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman expected_votes="1" two_node="1"/>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="nfs-failover" ordered="0" 
> restricted="1">
>                                <failoverdomainnode 
> name="nfs1-cluster.nws.noaa.gov" priority="1"/>
>                                <failoverdomainnode 
> name="nfs2-cluster.nws.noaa.gov" priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources>
>                        <ip address="140.90.91.244" monitor_link="1"/>
>                        <clusterfs 
> device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647" 
> fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/>
>                        <nfsexport name="fs-shared-exp"/>
>                        <nfsclient name="fs-shared-client" 
> options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
>                        <clusterfs 
> device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0" 
> fsid="54233" fstype="gfs" mountpoint="/rfcdata" name="rfcdata" 
> options="acl"/>
>                        <nfsexport name="rfcdata-exp"/>
>                        <nfsclient name="rfcdata-client" 
> options="no_root_squash,rw" path="" target="140.90.91.0/24"/>
>                </resources>
>                <service autostart="1" domain="nfs-failover" name="nfs">
>                        <clusterfs ref="fs-shared">
>                                <nfsexport ref="fs-shared-exp">
>                                        <nfsclient 
> ref="fs-shared-client"/>
>                                </nfsexport>
>                        </clusterfs>
>                        <ip ref="140.90.91.244"/>
>                        <clusterfs ref="rfcdata">
>                                <nfsexport ref="rfcdata-exp">
>                                        <nfsclient ref="rfcdata-client"/>
>                                </nfsexport>
>                                <ip ref="140.90.91.244"/>
>                        </clusterfs>
>                </service>
>        </rm>
>        <fencedevices>
>                <fencedevice agent="fence_apc" ipaddr="192.168.42.30" 
> login="rbrown" name="nfspower" passwd="Tele4m32"/>
>        </fencedevices>
> </cluster>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071206/d93de6d2/attachment.vcf>

From lhh at redhat.com  Thu Dec  6 22:34:21 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 06 Dec 2007 17:34:21 -0500
Subject: [Linux-cluster] [RFC] DRBD + GFS cookbook
Message-ID: <1196980461.6279.1.camel@ayanami.boston.devel.redhat.com>

fabioc (!= fabbione) and I did this today:

http://gfs.wikidev.net/DRBD_Cookbook

I only did very basic testing, but it's a start.

-- Lon


From ivoks at grad.hr  Fri Dec  7 06:13:56 2007
From: ivoks at grad.hr (Ante =?UTF-8?B?S2FyYW1hdGnEhw==?=)
Date: Fri, 7 Dec 2007 07:13:56 +0100
Subject: [Linux-cluster] [RFC] DRBD + GFS cookbook
In-Reply-To: <1196980461.6279.1.camel@ayanami.boston.devel.redhat.com>
References: <1196980461.6279.1.camel@ayanami.boston.devel.redhat.com>
Message-ID: <20071207071356.6a540afc@titanium>

On Thu, 06 Dec 2007 17:34:21 -0500
Lon Hohberger <lhh at redhat.com> wrote:

> fabioc (!= fabbione) and I did this today:
> 
> http://gfs.wikidev.net/DRBD_Cookbook
> 
> I only did very basic testing, but it's a start.

I'm using exactly the same setup, except fencing trough DRBD, on Ubuntu
and can only confirm that this works as expected.


From jet at gyve.org  Fri Dec  7 10:31:33 2007
From: jet at gyve.org (Masatake YAMATO)
Date: Fri, 07 Dec 2007 19:31:33 +0900 (JST)
Subject: [Linux-cluster] typo in rhdlmbook.pdf
Message-ID: <20071207.193133.117157158.jet@gyve.org>

Hi,

I'm reading http://people.redhat.com/pcaulfie/docs/rhdlmbook.pdf.
I've found some typos. So I report here.

In 3.10. Lock space Devices., files under /proc/misc directory are explained.
However, /proc/misc is a directory. Maybe typo. I guess it is /dev/misc correctly.

Masatake YAMATO


From pcaulfie at redhat.com  Fri Dec  7 10:56:25 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 07 Dec 2007 10:56:25 +0000
Subject: [Linux-cluster] typo in rhdlmbook.pdf
In-Reply-To: <20071207.193133.117157158.jet@gyve.org>
References: <20071207.193133.117157158.jet@gyve.org>
Message-ID: <475926D9.8060509@redhat.com>

Masatake YAMATO wrote:
> Hi,
> 
> I'm reading http://people.redhat.com/pcaulfie/docs/rhdlmbook.pdf.
> I've found some typos. So I report here.
> 
> In 3.10. Lock space Devices., files under /proc/misc directory are explained.
> However, /proc/misc is a directory. Maybe typo. I guess it is /dev/misc correctly.
> 
>

You're right, thanks for pointing that out!

I have uploaded a corrected version.

Patrick


From pedro.faustino at fccn.pt  Fri Dec  7 14:25:00 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Fri, 07 Dec 2007 14:25:00 +0000
Subject: [Linux-cluster] openais question
Message-ID: <475957BC.9070905@fccn.pt>

Hi All,

I've a running cluster (v2.01.00) with two Fedora7 nodes. While testing 
I've disabled all the NICs in one node. I started observing these 
messages on the other node:

Dec  7 13:05:55 m07 openais[4233]: [TOTEM] The consensus timeout expired.
Dec  7 13:05:55 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
Dec  7 13:06:10 m07 openais[4233]: [TOTEM] The consensus timeout expired.
Dec  7 13:06:10 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
Dec  7 13:06:25 m07 openais[4233]: [TOTEM] The consensus timeout expired.
Dec  7 13:06:25 m07 openais[4233]: [TOTEM] entering GATHER state from 3.

When I enabled the NICs and network was restored the same messages kept 
appearing, now on both nodes.
I've searched but couldn't find an answer/explanation.

Thanks for your help,

-- 
Pedro Bandim Faustino
email/sip: pedro.faustino at fccn.pt

FCCN - Funda??o para a Computa??o Cient?fica Nacional
Av. do Brasil, n.? 101
1700-066 Lisboa
Tel: +351 21 844 0100
Fax: +351 21 847 2167
www.fccn.pt
 
Aviso de Confidencialidade
 
Esta mensagem ? exclusivamente destinada ao seu destinat?rio, podendo conter informa??o CONFIDENCIAL, cuja divulga??o est? expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conte?do de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/925d5178/attachment.bin>

From stephane.neveu at cetsi.fr  Fri Dec  7 15:00:32 2007
From: stephane.neveu at cetsi.fr (=?ISO-8859-1?Q?St=E9phane_Neveu?=)
Date: Fri, 07 Dec 2007 16:00:32 +0100
Subject: [Linux-cluster] gfs on debian
Message-ID: <47596010.5060700@cetsi.fr>


Hello all,

First let me thank you for your work !

I actually try to mount a gfs /dev/sda3 on a debian testing box... (my 
goal is to make a primary/primary cluster with drbd/heartbeat etc...)
I'v seen on http://www.drbd.org/fileadmin/drbd/doc/8.0.2/en/drbd.conf.html
this :

|allow-two-primaries|

    With this option set you might make both nodes primary. You only
    should use this options if you use a shared storage file system on
    top of DRBD. At the time of writing the only ones are: OCFS2 and
    GFS. If you use this option with any other filesystem you are goint
    to crash your nodes and to corrupt your data!


So I 've installed gfs-tools package then :

gfs_mkfs -p lock_gulm -t mycluster:gfs1 -j 8 /dev/sda3

Syncing...

All Done


... It seems to be ok but when I try to mount it :

mount -t gfs /dev/sda3 /mnt

mount: unknown filesystem type 'gfs'

modprobe gfs

FATAL: Module gfs not found.

gfs_mount /dev/sda3 /mnt

command not found : gfs_mount

It seems I just got the man page !?

May I ask you some help ?
Thx by advance 


-- 

 
*St?phane** Neveu**/
/**D?partement Internet
*Ing?nieur Syst?me R?seaux
____________________________________


93/105, Rue Veuve Lacroix
92000 NANTERRE
E-Mail : St?phane Neveu at cetsi.fr
http://www.cetsi.fr

Tel : 01.41.19.40.50
Fax : 01.47.60.22.16
____________________________________
*Nos m?tiers :* Internet, Infogestion, Syst?mes et r?seaux, Distribution

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/501cba4d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: _x0000_i1025
Type: image/jpeg
Size: 32269 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/501cba4d/attachment.jpe>

From christopher.barry at qlogic.com  Fri Dec  7 15:18:41 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Fri, 7 Dec 2007 09:18:41 -0600
Subject: [Linux-cluster] gfs on debian
References: <47596010.5060700@cetsi.fr>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB2453ACF6@EPEXCH1.qlogic.org>


-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of St?phane Neveu
Sent: Fri 12/7/2007 10:00 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] gfs on debian
 

Hello all,

First let me thank you for your work !

I actually try to mount a gfs /dev/sda3 on a debian testing box... (my 
goal is to make a primary/primary cluster with drbd/heartbeat etc...)
I'v seen on http://www.drbd.org/fileadmin/drbd/doc/8.0.2/en/drbd.conf.html
this :

|allow-two-primaries|

    With this option set you might make both nodes primary. You only
    should use this options if you use a shared storage file system on
    top of DRBD. At the time of writing the only ones are: OCFS2 and
    GFS. If you use this option with any other filesystem you are goint
    to crash your nodes and to corrupt your data!


So I 've installed gfs-tools package then :

gfs_mkfs -p lock_gulm -t mycluster:gfs1 -j 8 /dev/sda3

Syncing...

All Done


... It seems to be ok but when I try to mount it :

mount -t gfs /dev/sda3 /mnt

mount: unknown filesystem type 'gfs'

modprobe gfs

FATAL: Module gfs not found.

gfs_mount /dev/sda3 /mnt

command not found : gfs_mount

It seems I just got the man page !?

May I ask you some help ?
Thx by advance 


-- 

 
*St?phane** Neveu**/
/**D?partement Internet
*Ing?nieur Syst?me R?seaux
____________________________________


93/105, Rue Veuve Lacroix
92000 NANTERRE
E-Mail : St?phane Neveu at cetsi.fr
http://www.cetsi.fr

Tel : 01.41.19.40.50
Fax : 01.47.60.22.16
____________________________________
*Nos m?tiers :* Internet, Infogestion, Syst?mes et r?seaux, Distribution

 
I think you'll also need the redhat-cluster source package. See here for some more info:

http://packages.debian.org/unstable/admin/gfs-tools


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/6f297f6e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: _x0000_i1025
Type: image/jpeg
Size: 32269 bytes
Desc: _x0000_i1025
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/6f297f6e/attachment.jpe>

From pedro.faustino at fccn.pt  Fri Dec  7 15:34:03 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Fri, 07 Dec 2007 15:34:03 +0000
Subject: [Linux-cluster] openais question
In-Reply-To: <475957BC.9070905@fccn.pt>
References: <475957BC.9070905@fccn.pt>
Message-ID: <475967EB.80705@fccn.pt>

Other way to observe the same (maybe other reason??):

After booting both nodes, when doing a service cman start on both nodes 
with some seconds of interval between both commands, the two nodes join 
the cluster and get quored.
When doing service cman stop on both nodes (also with some seconds of 
interval between both commands), one of the nodes successfully leaves 
the cluster, but the other prints this out

[root at m07 ~]# service cman stop
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
Timed-out waiting for cluster
                                                           [FAILED]

while the messages in the log are
Dec  7 15:27:17 m07 openais[5436]: [TOTEM] The consensus timeout expired.
Dec  7 15:27:17 m07 openais[5436]: [TOTEM] entering GATHER state from 3.
Dec  7 15:27:32 m07 openais[5436]: [TOTEM] The consensus timeout expired.
Dec  7 15:27:32 m07 openais[5436]: [TOTEM] entering GATHER state from 3.
Dec  7 15:27:47 m07 openais[5436]: [TOTEM] The consensus timeout expired.
Dec  7 15:27:47 m07 openais[5436]: [TOTEM] entering GATHER state from 3.

Do you know what the problem is?

output of ps fax:
....
 5430 ?        Ssl    0:00 /sbin/ccsd
 5436 ?        SLl    0:03 aisexec
 5450 ?        Ss     0:00 /sbin/groupd
 5458 ?        Ss     0:00 /sbin/fenced
 5464 ?        Ss     0:00 /sbin/dlm_controld
 5470 ?        Ss     0:00 /sbin/gfs_controld
...

cluster.conf
<?xml version="1.0"?>
<cluster name="VoIP_RCTS" config_version="8">

<!-- The quorum disk solves the imbalance caused by this two-node 
cluster -->
<cman two_node="0" expected_votes="3">
</cman>

<!-- Change logging from /var/log/messages to /log/cluster/cluster.log -->
<rm log_level="6" log_facility="local4">
</rm>

<fence_daemon post_join_delay="10"/>

<clusternodes>
        <clusternode name="m07.<whatever>" votes="1" nodeid="1">
                <fence>
                        <method name="1"><device 
name="fence_bladecenter-VoIP_RCTS-cluster" blade="7"/></method>
                </fence>
        </clusternode>
        <clusternode name="m08.<whatever>" votes="1" nodeid="2">
                <fence>
                        <method name="1"><device 
name="fence_bladecenter-VoIP_RCTS-cluster" blade="8"/></method>
                </fence>
        </clusternode>
</clusternodes>

<fencedevices>
        <fencedevice name="fence_bladecenter-VoIP_RCTS-cluster" 
agent="fence_bladecenter" ipaddr="192.168.0.1" login="<login>" 
password="<password>"/>
</fencedevices>

<!-- Specify here the shared quorum disk -->
<quorumd label="QUORUM-VoIP" votes="1"/></cluster>


Pedro Bandim Faustino
email/sip: pedro.faustino at fccn.pt

FCCN - Funda??o para a Computa??o Cient?fica Nacional
Av. do Brasil, n.? 101
1700-066 Lisboa
Tel: +351 21 844 0100
Fax: +351 21 847 2167
www.fccn.pt
 
Aviso de Confidencialidade
 
Esta mensagem ? exclusivamente destinada ao seu destinat?rio, podendo conter informa??o CONFIDENCIAL, cuja divulga??o est? expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conte?do de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately.


Pedro Bandim Faustino wrote:
> Hi All,
>
> I've a running cluster (v2.01.00) with two Fedora7 nodes. While 
> testing I've disabled all the NICs in one node. I started observing 
> these messages on the other node:
>
> Dec  7 13:05:55 m07 openais[4233]: [TOTEM] The consensus timeout expired.
> Dec  7 13:05:55 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
> Dec  7 13:06:10 m07 openais[4233]: [TOTEM] The consensus timeout expired.
> Dec  7 13:06:10 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
> Dec  7 13:06:25 m07 openais[4233]: [TOTEM] The consensus timeout expired.
> Dec  7 13:06:25 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
>
> When I enabled the NICs and network was restored the same messages 
> kept appearing, now on both nodes.
> I've searched but couldn't find an answer/explanation.
>
> Thanks for your help,
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/1a79f861/attachment.bin>

From stephane.neveu at cetsi.fr  Fri Dec  7 15:39:59 2007
From: stephane.neveu at cetsi.fr (=?ISO-8859-1?Q?St=E9phane_Neveu?=)
Date: Fri, 07 Dec 2007 16:39:59 +0100
Subject: [Linux-cluster] gfs on debian
In-Reply-To: <D158540CCC0AB54C8FD4818F823CCB2453ACF6@EPEXCH1.qlogic.org>
References: <47596010.5060700@cetsi.fr>
	<D158540CCC0AB54C8FD4818F823CCB2453ACF6@EPEXCH1.qlogic.org>
Message-ID: <4759694F.6030502@cetsi.fr>

Thanks for your quick reply ;o)

So, I must install my kernel sources ... then patch the whole thing and 
compile everything ? ...
It won't be fun to upgrade my box ...
Is there another way ?

Christopher Barry a ?crit :
>
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com on behalf of St?phane Neveu
> Sent: Fri 12/7/2007 10:00 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] gfs on debian
>
>
> Hello all,
>
> First let me thank you for your work !
>
> I actually try to mount a gfs /dev/sda3 on a debian testing box... (my
> goal is to make a primary/primary cluster with drbd/heartbeat etc...)
> I'v seen on http://www.drbd.org/fileadmin/drbd/doc/8.0.2/en/drbd.conf.html
> this :
>
> |allow-two-primaries|
>
>     With this option set you might make both nodes primary. You only
>     should use this options if you use a shared storage file system on
>     top of DRBD. At the time of writing the only ones are: OCFS2 and
>     GFS. If you use this option with any other filesystem you are goint
>     to crash your nodes and to corrupt your data!
>
>
> So I 've installed gfs-tools package then :
>
> gfs_mkfs -p lock_gulm -t mycluster:gfs1 -j 8 /dev/sda3
>
> Syncing...
>
> All Done
>
>
> ... It seems to be ok but when I try to mount it :
>
> mount -t gfs /dev/sda3 /mnt
>
> mount: unknown filesystem type 'gfs'
>
> modprobe gfs
>
> FATAL: Module gfs not found.
>
> gfs_mount /dev/sda3 /mnt
>
> command not found : gfs_mount
>
> It seems I just got the man page !?
>
> May I ask you some help ?
> Thx by advance
>
>
> --
>
>
>
> *St?phane** Neveu**/
> /**D?partement Internet
> *Ing?nieur Syst?me R?seaux
> ____________________________________
>
>
> 93/105, Rue Veuve Lacroix
> 92000 NANTERRE
> E-Mail : St?phane Neveu at cetsi.fr
> http://www.cetsi.fr
>
> Tel : 01.41.19.40.50
> Fax : 01.47.60.22.16
> ____________________________________
> *Nos m?tiers :* Internet, Infogestion, Syst?mes et r?seaux, Distribution
>
>
>
>
>
>
> I think you'll also need the redhat-cluster source package. See here 
> for some more info:
>
> http://packages.debian.org/unstable/admin/gfs-tools
>
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 

 
*St?phane** Neveu**/
/**D?partement Internet
*Ing?nieur Syst?me R?seaux
____________________________________


93/105, Rue Veuve Lacroix
92000 NANTERRE
E-Mail : St?phane Neveu at cetsi.fr
http://www.cetsi.fr

Tel : 01.41.19.40.50
Fax : 01.47.60.22.16
____________________________________
*Nos m?tiers :* Internet, Infogestion, Syst?mes et r?seaux, Distribution

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/cc8b139b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 32269 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/cc8b139b/attachment.jpe>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: _x0000_i1025
Type: image/jpeg
Size: 32269 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/cc8b139b/attachment-0001.jpe>

From christopher.barry at qlogic.com  Fri Dec  7 15:55:50 2007
From: christopher.barry at qlogic.com (Christopher Barry)
Date: Fri, 7 Dec 2007 09:55:50 -0600
Subject: [Linux-cluster] gfs on debian
References: <47596010.5060700@cetsi.fr><D158540CCC0AB54C8FD4818F823CCB2453ACF6@EPEXCH1.qlogic.org>
	<4759694F.6030502@cetsi.fr>
Message-ID: <D158540CCC0AB54C8FD4818F823CCB2453ACF7@EPEXCH1.qlogic.org>

-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of St?phane Neveu
Sent: Fri 12/7/2007 10:39 AM
To: linux clustering
Subject: Re: [Linux-cluster] gfs on debian
 
Thanks for your quick reply ;o)

So, I must install my kernel sources ... then patch the whole thing and 
compile everything ? ...
It won't be fun to upgrade my box ...
Is there another way ?


I think you'll just need to recompile the gfs modules on kernel upgrade, but I cannot be positive. maybe someone else who has done this already can speak to this.

-C
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 67910 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/4f7f8c4b/attachment.bin>

From lhh at redhat.com  Fri Dec  7 11:12:02 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 07 Dec 2007 06:12:02 -0500
Subject: [Linux-cluster] [RFC] DRBD + GFS cookbook
In-Reply-To: <20071207071356.6a540afc@titanium>
References: <1196980461.6279.1.camel@ayanami.boston.devel.redhat.com>
	<20071207071356.6a540afc@titanium>
Message-ID: <1197025922.4155.3.camel@localhost.localdomain>


On Fri, 2007-12-07 at 07:13 +0100, Ante Karamati? wrote:
> On Thu, 06 Dec 2007 17:34:21 -0500
> Lon Hohberger <lhh at redhat.com> wrote:
> 
> > fabioc (!= fabbione) and I did this today:
> > 
> > http://gfs.wikidev.net/DRBD_Cookbook
> > 
> > I only did very basic testing, but it's a start.
> 
> I'm using exactly the same setup, except fencing trough DRBD, on Ubuntu
> and can only confirm that this works as expected.

Could you expand on this -- do you have the script that drbd is calling
for fencing?

I assume you're using 'resource', right?  "Ssh to the other guy and
issue some command..."

-- Lon


From pedro.faustino at fccn.pt  Fri Dec  7 16:55:54 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Fri, 07 Dec 2007 16:55:54 +0000
Subject: [Linux-cluster] openais question
In-Reply-To: <475967EB.80705@fccn.pt>
References: <475957BC.9070905@fccn.pt> <475967EB.80705@fccn.pt>
Message-ID: <47597B1A.60803@fccn.pt>

The problem was the firewall iptables!

I've done as told by the FAQ 
http://sources.redhat.com/cluster/faq.html#iptables
Node 2 iptables.conf:
#       rgmanager/clurgmgrd
-A SERVICOS -p tcp -m tcp -s node1-IPAddr --dport 41966:41969 -j ACCEPT
#       ccsd
-A SERVICOS -p tcp -m tcp -s node1-IPAddr --dport 50006 -j ACCEPT
-A SERVICOS -p udp -m udp -s node1-IPAddr --dport 50007 -j ACCEPT
-A SERVICOS -p tcp -m tcp -s node1-IPAddr --dport 50008:50009 -j ACCEPT
#       dlm
-A SERVICOS -p tcp -m tcp -s node1-IPAddr --dport 21064 -j ACCEPT
#       openais
-A SERVICOS -p udp -m udp -s node1-IPAddr --dport 5405 -j ACCEPT
-A SERVICOS -j RETURN

But when these rules are enabled, my previous email explains the problems.

Does any of you know what am I missing in my iptables.conf?

Thanks,

Pedro Bandim Faustino
email/sip: pedro.faustino at fccn.pt


Pedro Bandim Faustino wrote:
> Other way to observe the same (maybe other reason??):
>
> After booting both nodes, when doing a service cman start on both 
> nodes with some seconds of interval between both commands, the two 
> nodes join the cluster and get quored.
> When doing service cman stop on both nodes (also with some seconds of 
> interval between both commands), one of the nodes successfully leaves 
> the cluster, but the other prints this out
>
> [root at m07 ~]# service cman stop
> Stopping cluster:
>   Stopping fencing... done
>   Stopping cman... failed
> Timed-out waiting for cluster
>                                                           [FAILED]
>
> while the messages in the log are
> Dec  7 15:27:17 m07 openais[5436]: [TOTEM] The consensus timeout expired.
> Dec  7 15:27:17 m07 openais[5436]: [TOTEM] entering GATHER state from 3.
> Dec  7 15:27:32 m07 openais[5436]: [TOTEM] The consensus timeout expired.
> Dec  7 15:27:32 m07 openais[5436]: [TOTEM] entering GATHER state from 3.
> Dec  7 15:27:47 m07 openais[5436]: [TOTEM] The consensus timeout expired.
> Dec  7 15:27:47 m07 openais[5436]: [TOTEM] entering GATHER state from 3.
>
> Do you know what the problem is?
>
> output of ps fax:
> ....
> 5430 ?        Ssl    0:00 /sbin/ccsd
> 5436 ?        SLl    0:03 aisexec
> 5450 ?        Ss     0:00 /sbin/groupd
> 5458 ?        Ss     0:00 /sbin/fenced
> 5464 ?        Ss     0:00 /sbin/dlm_controld
> 5470 ?        Ss     0:00 /sbin/gfs_controld
> ...
>
> cluster.conf
> <?xml version="1.0"?>
> <cluster name="VoIP_RCTS" config_version="8">
>
> <!-- The quorum disk solves the imbalance caused by this two-node 
> cluster -->
> <cman two_node="0" expected_votes="3">
> </cman>
>
> <!-- Change logging from /var/log/messages to /log/cluster/cluster.log 
> -->
> <rm log_level="6" log_facility="local4">
> </rm>
>
> <fence_daemon post_join_delay="10"/>
>
> <clusternodes>
>        <clusternode name="m07.<whatever>" votes="1" nodeid="1">
>                <fence>
>                        <method name="1"><device 
> name="fence_bladecenter-VoIP_RCTS-cluster" blade="7"/></method>
>                </fence>
>        </clusternode>
>        <clusternode name="m08.<whatever>" votes="1" nodeid="2">
>                <fence>
>                        <method name="1"><device 
> name="fence_bladecenter-VoIP_RCTS-cluster" blade="8"/></method>
>                </fence>
>        </clusternode>
> </clusternodes>
>
> <fencedevices>
>        <fencedevice name="fence_bladecenter-VoIP_RCTS-cluster" 
> agent="fence_bladecenter" ipaddr="192.168.0.1" login="<login>" 
> password="<password>"/>
> </fencedevices>
>
> <!-- Specify here the shared quorum disk -->
> <quorumd label="QUORUM-VoIP" votes="1"/></cluster>
>
>
>
>
> Pedro Bandim Faustino
> email/sip: pedro.faustino at fccn.pt
>
> FCCN - Funda??o para a Computa??o Cient?fica Nacional
> Av. do Brasil, n.? 101
> 1700-066 Lisboa
> Tel: +351 21 844 0100
> Fax: +351 21 847 2167
> www.fccn.pt
>
> Aviso de Confidencialidade
>
> Esta mensagem ? exclusivamente destinada ao seu destinat?rio, podendo 
> conter informa??o CONFIDENCIAL, cuja divulga??o est? expressamente 
> vedada nos termos da lei. Caso tenha recepcionado indevidamente esta 
> mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta 
> via ou para o telefone +351 218440100 devendo apagar o seu conte?do de 
> imediato. This message is intended exclusively for its addressee. It 
> may contain CONFIDENTIAL information protected by law. If this message 
> has been received by error, please notify us via e-mail or by 
> telephone +351 218440100 and delete it immediately.
>
>
>
> Pedro Bandim Faustino wrote:
>> Hi All,
>>
>> I've a running cluster (v2.01.00) with two Fedora7 nodes. While 
>> testing I've disabled all the NICs in one node. I started observing 
>> these messages on the other node:
>>
>> Dec  7 13:05:55 m07 openais[4233]: [TOTEM] The consensus timeout 
>> expired.
>> Dec  7 13:05:55 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
>> Dec  7 13:06:10 m07 openais[4233]: [TOTEM] The consensus timeout 
>> expired.
>> Dec  7 13:06:10 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
>> Dec  7 13:06:25 m07 openais[4233]: [TOTEM] The consensus timeout 
>> expired.
>> Dec  7 13:06:25 m07 openais[4233]: [TOTEM] entering GATHER state from 3.
>>
>> When I enabled the NICs and network was restored the same messages 
>> kept appearing, now on both nodes.
>> I've searched but couldn't find an answer/explanation.
>>
>> Thanks for your help,
>>
>> ------------------------------------------------------------------------
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071207/c9eec9ed/attachment.bin>

From ivoks at grad.hr  Fri Dec  7 17:55:17 2007
From: ivoks at grad.hr (Ante =?UTF-8?B?S2FyYW1hdGnEhw==?=)
Date: Fri, 7 Dec 2007 18:55:17 +0100
Subject: [Linux-cluster] [RFC] DRBD + GFS cookbook
In-Reply-To: <1197025922.4155.3.camel@localhost.localdomain>
References: <1196980461.6279.1.camel@ayanami.boston.devel.redhat.com>
	<20071207071356.6a540afc@titanium>
	<1197025922.4155.3.camel@localhost.localdomain>
Message-ID: <20071207185517.348c688d@titanium>

On Fri, 07 Dec 2007 06:12:02 -0500
Lon Hohberger <lhh at redhat.com> wrote:

> Could you expand on this -- do you have the script that drbd is
> calling for fencing?

No. Fencing is done trough redhat cluster suite, since I have more than
two computers in cluster - that's why I can tell which one is dead.


From koustubha_kale at yahoo.com  Sat Dec  8 05:11:29 2007
From: koustubha_kale at yahoo.com (Koustubha Kale)
Date: Sat, 8 Dec 2007 10:41:29 +0530 (IST)
Subject: [Linux-cluster] Re: [RFC] DRBD + GFS cookbook (Lon Hohberger)
Message-ID: <3142.2432.qm@web94208.mail.in2.yahoo.com>

On Fri, 2007-12-07 at 07:13 +0100, Ante Karamati? wrote:
> On Thu, 06 Dec 2007 17:34:21 -0500
> Lon Hohberger <lhh at redhat.com> wrote:
> 
> > fabioc (!= fabbione) and I did this today:
> > 
> > http://gfs.wikidev.net/DRBD_Cookbook
> > 
> > I only did very basic testing, but it's a start.
> 
> I'm using exactly the same setup, except fencing trough DRBD, on
 Ubuntu
> and can only confirm that this works as expected.

Could you expand on this -- do you have the script that drbd is calling
for fencing?

I assume you're using 'resource', right?  "Ssh to the other guy and
issue some command..."

-- Lon


>On Fri, 2007-12-07 at 07:13 +0100, Ante Karamati? wrote:
> On Thu, 06 Dec 2007 17:34:21 -0500
> Lon Hohberger <lhh at redhat.com> wrote:
> 
> > fabioc (!= fabbione) and I did this today:
> > 
> > http://gfs.wikidev.net/DRBD_Cookbook
> > 
> > I only did very basic testing, but it's a start.
> 
> I'm using exactly the same setup, except fencing trough DRBD, on
 Ubuntu
> and can only confirm that this works as expected.
>
>Could you expand on this -- do you have the script that drbd is calling
>for fencing?
>
>I assume you're using 'resource', right?  "Ssh to the other guy and
>issue some command..."
>
>-- Lon

We are using a similar setup with a three node cluster. The third node mounts the gfs volumes through a manged NFS service. All three cluster nodes act as servers for diskless nodes ( XDMCP through LVS).
We have observed few issues though.
1) On the drbd nodes, we have root partition on a logical volume. Also our drbd+gfs disks are clustered LV's; So we had to manually restart clvmd after drbd in order for the gfs volumes to be active.
2) The manged NFS service refuses to failover. I am not sure whether this is because of manual fencing. Our APC MasterSwitch is expected shortly so will know more about NFS failover after we have proper fencing setup.
I would be very interested in trying this fencing through DRBD..
3) The disk IO is very slow. Almost a bottleneck. I wonder getting rid of the LV's & making gfs directly on the drbd device might help?

Another question may be OT sorry if so. Is there a way to failover the diskless nodes to other cluster server in case of one cluster server going down?
 
With warm regards
Koustubha Kale


      Forgot the famous last words? Access your message archive online at http://in.messenger.yahoo.com/webmessengerpromo.php


From elliot at devnull.org.uk  Sat Dec  8 10:09:57 2007
From: elliot at devnull.org.uk (Elliot Moore)
Date: Sat, 8 Dec 2007 10:09:57 +0000
Subject: [Linux-cluster] Types of file locking support in GFS
In-Reply-To: <20071206134544.GA27857@book.riviera.org.uk>
References: <20071206134544.GA27857@book.riviera.org.uk>
Message-ID: <C692865D-2A62-4D44-ADB7-9EB528017451@devnull.org.uk>

On 6 Dec 2007, at 13:45, Elliot Moore wrote:
> Hi,
>
> I'm trying to setup activemq with master and slave.
> According to http://activemq.apache.org/shared-file-system-master-slave.html 
> , you can use a SAN to hold a lockfile for multiple brokers to watch.
> But the SAN filesystem must support exclusive file locks.
>
> OCFS2 only supports locking with 'fcntl' and not 'lockf and flock',  
> therefore mutex file locking from Java isn't supported. (both  
> brokers think they have
> an exclusive lock on the lockfile!)
>
> Does Redhat GFS support 'lockf and flock' as well as fcntl ?


FYI
got a response from redhat, yes
more information @ http://sources.redhat.com/cluster/faq.html#gfs_vs_ocfs2


From Abdel.Sadek at lsi.com  Sun Dec  9 05:16:35 2007
From: Abdel.Sadek at lsi.com (Sadek, Abdel)
Date: Sat, 8 Dec 2007 22:16:35 -0700
Subject: [Linux-cluster] RHEL 4 and RHEL 5 cluster
Message-ID: <C776378855970A4DADE4A476447F6391010AF2E0@NAMAIL3.ad.lsil.com>

Is it possible to run RedHat native cluster with nodes running different
versions of RHEL (example RHEL 4.6 and RHEL 5.1).?

 
Thanks.

Abdel..

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071208/f56aca38/attachment.htm>

From Christopher.Barry at qlogic.com  Sun Dec  9 05:26:12 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Sun, 09 Dec 2007 00:26:12 -0500
Subject: [Linux-cluster] RHEL 4 and RHEL 5 cluster
In-Reply-To: <C776378855970A4DADE4A476447F6391010AF2E0@NAMAIL3.ad.lsil.com>
References: <C776378855970A4DADE4A476447F6391010AF2E0@NAMAIL3.ad.lsil.com>
Message-ID: <1197177973.7619.3.camel@ubuntu-7-04>


On Sat, 2007-12-08 at 22:16 -0700, Sadek, Abdel wrote:
> Is it possible to run RedHat native cluster with nodes running
> different versions of RHEL (example RHEL 4.6 and RHEL 5.1).?
> 
>  
> 
> Thanks.
> 
> Abdel..
> 

Simple answer: No. They use completely different infrastructures.
Could it be done? Anything 'can' be done. ;)

Have you checked out the FAQ yet? It's getting better all the time.

http://sources.redhat.com/cluster/faq.html


-C


From paolom at prisma-eng.it  Sun Dec  9 08:01:41 2007
From: paolom at prisma-eng.it (Paolo Marini)
Date: Sun, 09 Dec 2007 09:01:41 +0100
Subject: [Linux-cluster] Cluster of XEN guests unstable when rebooting a node
Message-ID: <475BA0E5.8040803@prisma-eng.it>

I am building up a cluster of XEN Guests with root file system residing 
on a file on an GFS filesystem (iscsi actually). Each cluster node 
mounts an GFS file system residing on an iscsi device. For performance 
reasons, both the iscsi device and the physical nodes (part also of a 
cluster) use two gigabit ethernet with bonding and LACP.

For the physical machines, I had to insert a sleep 30 on the 
/etc/init.d/iscsi script before the iscsi login, in order to wait for 
the bond interface to come up, otherwise the iscsi devices are not seen 
and no gfs mount is possible.

Then, going to the cluster of XEN Guests, they work fine, I am able to 
migrate each one to a different physical node without problems on the 
guest. When I reboot or fence a guest, the guest cluster breaks, e.g. 
the quorum is dissolved and I have to fence ALL the nodes and reboot 
them in order for the cluster to restart. Does it have to do with the 
xen bridge going up and down for a time longer than the heartbeat timeout ?

Is it still valid (and so the solution to the problems I found) this 
entry in the FAQ ?

When I reboot a xen dom, I get cluster errors and it gets fenced. What's 
going on and how do I fix it?

As I understand it, the problem is due to the fact that xen nodes tear 
down and rebuild the ethernet nic after cluster suite has started. We're 
working on a more permanent solution. In the meantime, here is a 
workaround:

  1. Edit the file: /etc/xen/xend-config.sxp line. Locate the line that
     reads:

     (network-script network-bridge)

     Change that line to read:

     (network-script /bin/true)

  2. Create and/or edit file /etc/sysconfig/network-scripts/ifcfg-eth0
     to look something like:

     DEVICE=eth0
     ONBOOT=yes
     BRIDGE=xenbr0
     HWADDR=XX:XX:XX:XX:XX:XX

     Where XX:XX:XX:XX:XX:XX is the mac address of your network card.

  3. Create and/or edit file
     /etc/sysconfig/network-scripts/ifcfg-xenbr0 to look something like:

     DEVICE=xenbr0
     ONBOOT=yes
     BOOTPROTO=static
     IPADDR=10.0.0.116
     NETMASK=255.255.255.0
     GATEWAY=10.0.0.254
     TYPE=Bridge
     DELAY=0

     Substitute your appropriate IP address, netmask and gateway
     information.


Thanks, Paolo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: paolom.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071209/be64acd1/attachment.vcf>

From jet at gyve.org  Mon Dec 10 06:48:16 2007
From: jet at gyve.org (Masatake YAMATO)
Date: Mon, 10 Dec 2007 15:48:16 +0900 (JST)
Subject: [Linux-cluster] wireshark dissector for DLM
Message-ID: <20071210.154816.70132070.jet@gyve.org>

Hi,

I've submitted a patch to dissect communications between DLM3 nodes.
The patch is not accepted as part of wireshark.
However, I hope the patch itself may be useful for you.
So I'd like to introduce the page where I submitted the patch.

http://bugs.wireshark.org/bugzilla/show_bug.cgi?id=2084

Enjoy!

Masatake YAMATO


From pcaulfie at redhat.com  Mon Dec 10 10:12:52 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 10 Dec 2007 10:12:52 +0000
Subject: [Linux-cluster] wireshark dissector for DLM
In-Reply-To: <20071210.154816.70132070.jet@gyve.org>
References: <20071210.154816.70132070.jet@gyve.org>
Message-ID: <475D1124.7090004@redhat.com>

Masatake YAMATO wrote:
> Hi,
> 
> I've submitted a patch to dissect communications between DLM3 nodes.
> The patch is not accepted as part of wireshark.
> However, I hope the patch itself may be useful for you.
> So I'd like to introduce the page where I submitted the patch.
> 
> http://bugs.wireshark.org/bugzilla/show_bug.cgi?id=2084

That's nice work, thank you !

Has it been refused for inclusion or are they just wanting some minor
points changed? It looks like the latter to me. It would be great if
this could be included with a wireshark release, building it yourself
can be a bit of an adventure ;)

Patrick


From Alain.Moulle at bull.net  Mon Dec 10 10:12:55 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 10 Dec 2007 11:12:55 +0100
Subject: [Linux-cluster] CS5 two-nodes with quorum disk : which votes values
	?
Message-ID: <475D1127.4000408@bull.net>

Hi

I'm fighting about all possibilities between quorumd votes and
cman expected votes values so that my two-nodes cluster works
fine. I've read lots of emails on this ML about that and the
FAQ etc. but among some contradictions, and perhaps my misunderstanding
of documentation (qdisk man etc.) , I don't find a correct
cluster.conf :

My cluster.conf is like this (without any heuristic) :
....
<quorumd interval="1" tko="10" votes="2" log_level="9" log_facility="local4"
status_file="/tmp/qdisk_status" label="QUORUMDISK">
</quorumd>
...
<cman expected_votes="2" two_node="0"/>
...

As I don't find the good values and some contradictions between advised values,
I try all the following combinations to study the behavior and I got :

quorumd votes= 2    expected votes=3
quorumd votes= 1    expected votes=3
quorumd votes= 1    expected votes=2
quorumd votes= 2    expected votes=2


And in all cases, I can't launch cman on the first node without avoiding an
automatic fence of second node after a while, and then cman accepts to run
on the first node (at least in both first cases 2/3 and 1/3, for 1/2 and 2/2
I've stop cman before it fences the secund node as I saw in syslog that
it was about to fence in a while ...)

Note that I do "service qdiskd start" after the "service cman start" and when
this last one is stalled a moment in fencing phase, because the qdisk man gives
in Limitations chapter : "cman must be running before the qdisk program can start"

So any help to tune this values would be fine, knowing that I do want to be able
to launch CS5 on only one node without being forced to launch CS5 on both nodes.

Thanks
Regards
Alain


From ronny.machado at cajalosandes.cl  Mon Dec 10 15:49:25 2007
From: ronny.machado at cajalosandes.cl (ronny machado)
Date: Mon, 10 Dec 2007 12:49:25 -0300
Subject: [Linux-cluster] /etc/udev/rules.d/90-ib.rules
Message-ID: <A159F645-2064-499C-BF01-F697B4545D23@cajalosandes.cl>

Hello everyone...

I'v been having problems with my installation, especifically with  
storage, right now I was looking at my messages and I found the  
following:

Dec 10 09:42:20 arintica udev[29008]: parse error /etc/udev/rules.d/ 
90-ib.rules, line 1:0, rule skipped <30>Dec 10 09
:42:20 udev[29008]: parse error /etc/udev/rules.d/90-ib.rules, line  
2:0, rule skipped
Dec 10 09:42:20 arintica udev[29008]: parse error /etc/udev/rules.d/ 
90-ib.rules, line 4:0, rule skipped <30>Dec 10 09
:42:20 udev[29008]: parse error /etc/udev/rules.d/90-ib.rules, line  
5:0, rule skipped
Dec 10 09:42:20 arintica udev[29008]: parse error /etc/udev/rules.d/ 
90-ib.rules, line 6:0, rule skipped <30>Dec 10 09
:42:20 udev[29008]: parse error /etc/udev/rules.d/90-ib.rules, line  
7:0, rule skipped
Dec 10 09:42:20 arintica udev[29010]: parse error /etc/udev/rules.d/ 
90-ib.rules, line 1:0, rule skipped <30>Dec 10 09
:42:20 udev[29010]: parse error /etc/udev/rules.d/90-ib.rules, line  
2:0, rule skipped
Dec 10 09:42:20 arintica udev[29012]: parse error /etc/udev/rules.d/ 
90-ib.rules, line 1:0, rule skipped <30>Dec 10 09

I'd like to know exactly what it means....

My installation was a RH 4AS R4, and in December 1st we updated to  
R6, before this there wasn't any of this messages

my /etc/udev/rules.d/90-ib.rules:

KERNEL=="umad*", NAME="infiniband/%k"
KERNEL=="issm*", NAME="infiniband/%k"
KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"

Greetings


Ronny Machado C.
Departamento TI
CCAF de Los Andes
5100140

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071210/eb79d693/attachment.htm>

From Alain.Moulle at bull.net  Mon Dec 10 16:06:26 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 10 Dec 2007 17:06:26 +0100
Subject: [Linux-cluster] CS5 can't stop service cman
Message-ID: <475D6402.8040209@bull.net>

Hi

I've got quite systematically this problem when stopping cman service :

# service cman stop
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]

but no more details in syslog ...

Is this a known issue ? and if so, is there a issue number with a patch ?

Thanks

Regards
Alain Moull?


From lhh at redhat.com  Mon Dec 10 16:07:31 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 10 Dec 2007 11:07:31 -0500
Subject: [Linux-cluster] CS5 two-nodes with quorum disk : which votes
	values ?
In-Reply-To: <475D1127.4000408@bull.net>
References: <475D1127.4000408@bull.net>
Message-ID: <1197302851.6279.42.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-10 at 11:12 +0100, Alain Moulle wrote:
> Hi
> 
> ....
> <quorumd interval="1" tko="10" votes="2" log_level="9" log_facility="local4"
> status_file="/tmp/qdisk_status" label="QUORUMDISK">

* votes="1"
* Get rid of status_file after you're done debugging.


> </quorumd>
> ...
> <cman expected_votes="2" two_node="0"/>

expected_votes="3"

> And in all cases, I can't launch cman on the first node without avoiding an
> automatic fence of second node after a while, and then cman accepts to run
> on the first node (at least in both first cases 2/3 and 1/3, for 1/2 and 2/2

Qdiskd does not and can not prevent CMAN from fencing the dead portion
of the cluster during startup.

> I've stop cman before it fences the secund node as I saw in syslog that
> it was about to fence in a while ...)
> 
> Note that I do "service qdiskd start" after the "service cman start" and when
> this last one is stalled a moment in fencing phase, because the qdisk man gives
> in Limitations chapter : "cman must be running before the qdisk program can start"

Not true anymore as of 5.1, but changing the order will not prevent cman
from fencing the dead node.

<fenced clean_start="1" /> will prevent fencing dead nodes on startup,
but is dangerous to use.

-- Lon


From lhh at redhat.com  Mon Dec 10 16:08:26 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 10 Dec 2007 11:08:26 -0500
Subject: [Linux-cluster] RHEL 4 and RHEL 5 cluster
In-Reply-To: <C776378855970A4DADE4A476447F6391010AF2E0@NAMAIL3.ad.lsil.com>
References: <C776378855970A4DADE4A476447F6391010AF2E0@NAMAIL3.ad.lsil.com>
Message-ID: <1197302906.6279.44.camel@ayanami.boston.devel.redhat.com>

On Sat, 2007-12-08 at 22:16 -0700, Sadek, Abdel wrote:
> Is it possible to run RedHat native cluster with nodes running
> different versions of RHEL (example RHEL 4.6 and RHEL 5.1).?

No.

-- Lon


From carlopmart at gmail.com  Mon Dec 10 17:22:31 2007
From: carlopmart at gmail.com (carlopmart)
Date: Mon, 10 Dec 2007 18:22:31 +0100
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
Message-ID: <475D75D7.8040507@gmail.com>

Hi all,

  I am testing xen cluster using rehl5.1 and I have some doubts. First: can I 
configure a dedicated ethernet interface to move xen guests over nodes on RHCS 
5?. For example, I have two nodes: xenhost01 and xenhost02 with ip 1.1.1.1 and 
1.1.1.2 respectively. I have another physical interface configured on each node 
with ip's 2.2.2.1 and 2.2.2.2. Then, I would like to move one xen guest over 
second interface and not over first interface using clusvcadm tool. Uname 
command returns hostname associated on 1.1.1.x ip. How can I do it?

  Second: can I configure RHCS to move guests machines depending on the workload 
on each node automatically?

Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From damjan.stulic at edwardjones.com  Mon Dec 10 18:31:50 2007
From: damjan.stulic at edwardjones.com (Stulic,Damjan)
Date: Mon, 10 Dec 2007 12:31:50 -0600
Subject: [Linux-cluster] CS5 can't stop service cman
In-Reply-To: <475D6402.8040209@bull.net>
References: <475D6402.8040209@bull.net>
Message-ID: <B2ED9364017FF54FB9392221F9A4984502551CDA@nwpsrv08.edj.ad.edwardjones.com>

This problem is related to the cman kernel module. The module has services connected to it, and therefore will not unload from memory. 
You can stop cman manually by:
cman_tool leave force 
Or
cman_tool leave force remove
There is also a force flag on the cman start stop script.

However this does not solve the problem of the cman kernel module!

Thanks,
Damjan Stulic
IS Security Identity Management
Edward Jones


 If you are not the intended recipient of this message (including attachments), or if you have received this message in error, immediately notify us and delete it and any attachments.  If you no longer wish to receive e-mail from Edward Jones, please send this request to messages at edwardjones.com.  You must include the e-mail address that you wish not to receive e-mail communications.  For important additional information related to this e-mail, visit www.edwardjones.com/US_email_disclosure
 
-----Original Message-----
 

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alain Moulle
Sent: Monday, December 10, 2007 10:06
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CS5 can't stop service cman

Hi

I've got quite systematically this problem when stopping cman service :

# service cman stop
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]

but no more details in syslog ...

Is this a known issue ? and if so, is there a issue number with a patch ?

Thanks

Regards
Alain Moull?


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Mon Dec 10 21:10:10 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 10 Dec 2007 16:10:10 -0500
Subject: [Linux-cluster] Re: [RFC] DRBD + GFS cookbook (Lon Hohberger)
In-Reply-To: <3142.2432.qm@web94208.mail.in2.yahoo.com>
References: <3142.2432.qm@web94208.mail.in2.yahoo.com>
Message-ID: <1197321010.6279.84.camel@ayanami.boston.devel.redhat.com>

On Sat, 2007-12-08 at 10:41 +0530, Koustubha Kale wrote:

> We are using a similar setup with a three node cluster. The third node mounts the gfs volumes through a manged NFS service. All three cluster nodes act as servers for diskless nodes ( XDMCP through LVS).
> We have observed few issues though.
> 1) On the drbd nodes, we have root partition on a logical volume. Also our drbd+gfs disks are clustered LV's; So we had to manually restart clvmd after drbd in order for the gfs volumes to be active.

:o

I think that complicates things.  CLVM expects shared storage.  DRBD
isn't shared storage; it's distributed storage acting as shared storage.


> 2) The manged NFS service refuses to failover. I am not sure whether this is because of manual fencing. Our APC MasterSwitch is expected shortly so will know more about NFS failover after we have proper fencing setup.
> I would be very interested in trying this fencing through DRBD..

I don't like the idea of asking a node who has been evicted from the
cluster to "stop I/O pretty-please-with-sugar-on-top", but that's just
my opinion.

A simple outdate-peer script could be done using ssh assuming
distributed keys:

  ssh <host> -c "drbdadm outdate all"

Though it would be nicer to have an equivalent to dopd/drbd-outdate-peer
written for cman/openais (though the ssh model works independent of
underlying cluster architecture...)

> 3) The disk IO is very slow. Almost a bottleneck. I wonder getting rid of the LV's & making gfs directly on the drbd device might help?

I think it will help some things (e.g. you won't have to restart clvmd)
- but I don't think it will help much in the I/O bottleneck.

> Another question may be OT sorry if so. Is there a way to failover the diskless nodes to other cluster server in case of one cluster server going down?

I don't fully understand.  You want to start a service on a node which
doesn't have access to DRBD - but the service depends on DRBD?

This is probably not easy to do.  It sounds like you would have to
reconfigure drbd on the remaining cluster node on the fly...

-- Lon


From lhh at redhat.com  Mon Dec 10 22:06:25 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 10 Dec 2007 17:06:25 -0500
Subject: [Linux-cluster] [RFC] IPMI setup for fencing
Message-ID: <1197324386.6279.89.camel@ayanami.boston.devel.redhat.com>

Ok - not really 60 seconds, but you're reading this now, so...  Comments
welcome:

http://gfs.wikidev.net/IPMI_Fencing

-- Lon


From lhh at redhat.com  Mon Dec 10 22:07:09 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 10 Dec 2007 17:07:09 -0500
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
In-Reply-To: <475D75D7.8040507@gmail.com>
References: <475D75D7.8040507@gmail.com>
Message-ID: <1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-10 at 18:22 +0100, carlopmart wrote:
> Hi all,
> 
>   I am testing xen cluster using rehl5.1 and I have some doubts. First: can I 
> configure a dedicated ethernet interface to move xen guests over nodes on RHCS 
> 5?. For example, I have two nodes: xenhost01 and xenhost02 with ip 1.1.1.1 and 
> 1.1.1.2 respectively. I have another physical interface configured on each node 
> with ip's 2.2.2.1 and 2.2.2.2. Then, I would like to move one xen guest over 
> second interface and not over first interface using clusvcadm tool. Uname 
> command returns hostname associated on 1.1.1.x ip. How can I do it?

how about... 

'xm migrate' ?

-- Lon


From Alain.Moulle at bull.net  Tue Dec 11 08:38:41 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 11 Dec 2007 09:38:41 +0100
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk 
Message-ID: <475E4C91.7030709@bull.net>

Hi Lon

Thanks for your information about votes values with quorumd.

Another question about my tests :
Now I have the quorum disk working correctly, and so I wanted
to do this test : ifdown on the heart beat interface, to simulate
a heart beat network breakdown. I expected the cluster NOT to failover
because of quorum disk always available, but in fact after the
21s the node where I've stopped the if eth has been fenced despite
the quorumdisk ...

Where is my misunderstanding ?

Thanks

Regards
Alain Moull?


From koustubha_kale at yahoo.com  Tue Dec 11 08:42:41 2007
From: koustubha_kale at yahoo.com (Koustubha Kale)
Date: Tue, 11 Dec 2007 14:12:41 +0530 (IST)
Subject: [Linux-cluster] Re: [RFC] DRBD + GFS cookbook (Lon Hohberger)
Message-ID: <680255.91462.qm@web94209.mail.in2.yahoo.com>

>> 2) The manged NFS service refuses to failover. I am not sure whether
 this is because of manual fencing. Our APC MasterSwitch is expected
 shortly so >>will know more about NFS failover after we have proper fencing
 setup.
>> I would be very interested in trying this fencing through DRBD..

>I don't like the idea of asking a node who has been evicted from the
>cluster to "stop I/O pretty-please-with-sugar-on-top", but that's just
>my opinion.

Agreed. 

>A simple outdate-peer script could be done using ssh assuming
>distributed keys:

>  ssh <host> -c "drbdadm outdate all"

My presumption here is drbd has lost contact with the peer but the RHCS cluster my still be up & nodes may still be able to talk to each other through the other link. So we try to use that & I/O fence it or STONITH it. Correct?

Does it work as expected? There seem to be two problems to me...

a) when we use something like.. (from your cookbook)
disk {
               fencing resource-and-stonith;
       }
       handlers {
               outdate-peer "/sbin/obliterate"; # We'll get back to this.
       }when this handler gets called both nodes will try to fence each other. Is that the intended effect?

b) If we try to do ssh <host> -c "drbdadm outdate all",  gfs is still mounted on top of drbd and drbd is primary so here is no effect of the command and the split brain continues. I have seen this.


>> Another question may be OT sorry if so. Is there a way to failover
 the diskless nodes to other cluster server in case of one cluster server
 going down?

>I don't fully understand.  You want to start a service on a node which
>doesn't have access to DRBD - but the service depends on DRBD?

We are using a three server node cluster. Two of the server nodes act as the shared storage in Active-Active DRBD. The third
 server node mounts the gfs volumes through a manged NFS service. All three
 cluster nodes act as servers for diskless nodes ( XDMCP through LVS --> Direct Routing method). The diskless nodes are not part of RHCS cluster. They are thin clients for students.
What I was wondering about is if there is a way to switch over a users session in the event of a server cluster node crashing. It wont have to depend on DRBD as the other server node will still be active as drbd primary, also the third server will continue working with NFS failing over to the remaining DRBD machine. 
Currently the user's session freezes and they have to restart the thin client so they get connected to the one of the remaining servers. 

Regards
Koustubha Kale


      Chat on a cool, new interface. No download required. Go to http://in.messenger.yahoo.com/webmessengerpromo.php


From jet at gyve.org  Tue Dec 11 09:00:29 2007
From: jet at gyve.org (Masatake YAMATO)
Date: Tue, 11 Dec 2007 18:00:29 +0900 (JST)
Subject: [Linux-cluster] wireshark dissector for DLM
In-Reply-To: <475D1124.7090004@redhat.com>
References: <20071210.154816.70132070.jet@gyve.org>
	<475D1124.7090004@redhat.com>
Message-ID: <20071211.180029.205641537.jet@gyve.org>

> > Hi,
> > 
> > I've submitted a patch to dissect communications between DLM3 nodes.
> > The patch is not accepted as part of wireshark.
> > However, I hope the patch itself may be useful for you.
> > So I'd like to introduce the page where I submitted the patch.
> > 
> > http://bugs.wireshark.org/bugzilla/show_bug.cgi?id=2084
> 
> That's nice work, thank you !
> 
> Has it been refused for inclusion or are they just wanting some minor
> points changed? It looks like the latter to me. It would be great if
> this could be included with a wireshark release, building it yourself
> can be a bit of an adventure ;)

Now the patch is accepted. So you don't have to apply the patch
by yourself. However, you have to build by yourself.

For trying the patch:

     (1) install autotools, svn, gcc and pcap library to your system.
     (2) do 
     	 $ svn co http://anonsvn.wireshark.org/wireshark/trunk/ wireshark
     (3) do
         $ cd wireshark; ./autogen.sh; ./configure; make; ./wireshark
     (4) touch /mnt-where-gfs-is-mounted/something-file

If GUI labels explaining the protocol is broken or not good, 
please, let me know. 


Which protocol you want to dissect next?
totem? udp:5405 may be the target.

    # Please read the openais.conf.5 manual page

    totem {
	    version: 2
	    secauth: off
	    threads: 0
	    interface {
		    ringnumber: 0
		    bindnetaddr: 192.168.2.0
		    mcastaddr: 226.94.1.1
		    mcastport: 5405
	    }
    }

    logging {
	    debug: off
	    timestamp: on
    }

    amf {
	    mode: disabled
    }


I'd like to provide dissectors for all protocols used in RHCS and
GFS. The dissectors may be useful for shooting troubles and
understanding the software.  However, I have limited time to hack, so
I have to prioritize protocols. Inputs are welcome.

Thanks,
Masatake YAMATO


From carlopmart at gmail.com  Tue Dec 11 09:00:04 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 11 Dec 2007 10:00:04 +0100
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
In-Reply-To: <1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>
References: <475D75D7.8040507@gmail.com>
	<1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>
Message-ID: <475E5194.9090803@gmail.com>

Lon Hohberger wrote:
> On Mon, 2007-12-10 at 18:22 +0100, carlopmart wrote:
>> Hi all,
>>
>>   I am testing xen cluster using rehl5.1 and I have some doubts. First: can I 
>> configure a dedicated ethernet interface to move xen guests over nodes on RHCS 
>> 5?. For example, I have two nodes: xenhost01 and xenhost02 with ip 1.1.1.1 and 
>> 1.1.1.2 respectively. I have another physical interface configured on each node 
>> with ip's 2.2.2.1 and 2.2.2.2. Then, I would like to move one xen guest over 
>> second interface and not over first interface using clusvcadm tool. Uname 
>> command returns hostname associated on 1.1.1.x ip. How can I do it?
> 
> how about... 
> 
> 'xm migrate' ?
> 
> -- Lon

Yes Lon. I know that I need 'xm migrate' command to do this. My question is: 
when I execute 'xm migrate', can I be sure that migration is doing across 
2.2.2.x network??
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From johannes.russek at io-consulting.net  Tue Dec 11 09:38:38 2007
From: johannes.russek at io-consulting.net (jr)
Date: Tue, 11 Dec 2007 10:38:38 +0100
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
In-Reply-To: <475E5194.9090803@gmail.com>
References: <475D75D7.8040507@gmail.com>
	<1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>
	<475E5194.9090803@gmail.com>
Message-ID: <1197365918.2437.169.camel@localhost.localdomain>

Am Dienstag, den 11.12.2007, 10:00 +0100 schrieb carlopmart:

> 
> Yes Lon. I know that I need 'xm migrate' command to do this. My question is: 
> when I execute 'xm migrate', can I be sure that migration is doing across 
> 2.2.2.x network??

as far as i see this: you could use the 2.2.2.x network if you had node
names with those ip addresses either in your hosts file or in your dns,
and then use that nodename for migration with "xm migrate".
but i don't feel like using xm migrate for rgmanager controlled xen
guests is a good idea, as far as i have experienced, rgmanager wouldn't
notice that the xen domain has actually been migrated to a different
node.
is your primary goal to keep the migration traffic away from the cluster
traffic, or from the application traffic?
if it's the latter, it shouldn't be that hard to put the guests on a
different bridge in a different network then the cluster/migration
traffic. separating migration and cluster traffic seems to be harder
though, because the nodenames in the cluster are used for the migration
of the domains (by rgmanager..)

regards,
johannes


From pcaulfie at redhat.com  Tue Dec 11 14:22:32 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 11 Dec 2007 14:22:32 +0000
Subject: [Linux-cluster] wireshark dissector for DLM
In-Reply-To: <20071211.180029.205641537.jet@gyve.org>
References: <20071210.154816.70132070.jet@gyve.org>	<475D1124.7090004@redhat.com>
	<20071211.180029.205641537.jet@gyve.org>
Message-ID: <475E9D28.7010007@redhat.com>

Masatake YAMATO wrote:
>>> Hi,
>>>
>>> I've submitted a patch to dissect communications between DLM3 nodes.
>>> The patch is not accepted as part of wireshark.
>>> However, I hope the patch itself may be useful for you.
>>> So I'd like to introduce the page where I submitted the patch.
>>>
>>> http://bugs.wireshark.org/bugzilla/show_bug.cgi?id=2084
>> That's nice work, thank you !
>>
>> Has it been refused for inclusion or are they just wanting some minor
>> points changed? It looks like the latter to me. It would be great if
>> this could be included with a wireshark release, building it yourself
>> can be a bit of an adventure ;)
> 
> Now the patch is accepted. So you don't have to apply the patch
> by yourself. However, you have to build by yourself.
> 
> For trying the patch:
> 
>      (1) install autotools, svn, gcc and pcap library to your system.
>      (2) do 
>      	 $ svn co http://anonsvn.wireshark.org/wireshark/trunk/ wireshark
>      (3) do
>          $ cd wireshark; ./autogen.sh; ./configure; make; ./wireshark
>      (4) touch /mnt-where-gfs-is-mounted/something-file
> 
> If GUI labels explaining the protocol is broken or not good, 
> please, let me know. 
> 
> 
> Which protocol you want to dissect next?
> totem? udp:5405 may be the target.
> 
>     # Please read the openais.conf.5 manual page
> 
>     totem {
> 	    version: 2
> 	    secauth: off
> 	    threads: 0
> 	    interface {
> 		    ringnumber: 0
> 		    bindnetaddr: 192.168.2.0
> 		    mcastaddr: 226.94.1.1
> 		    mcastport: 5405
> 	    }
>     }
> 
>     logging {
> 	    debug: off
> 	    timestamp: on
>     }
> 
>     amf {
> 	    mode: disabled
>     }
> 
> 
> I'd like to provide dissectors for all protocols used in RHCS and
> GFS. The dissectors may be useful for shooting troubles and
> understanding the software.  However, I have limited time to hack, so
> I have to prioritize protocols. Inputs are welcome.

I think totem would be the next most useful one, yes.

Personally I'd like the old kernel (RHEL4) cman protocol done, but
obsolete protocols are probably not a good use of your time ;-)

Patrick


From carlopmart at gmail.com  Tue Dec 11 15:25:36 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 11 Dec 2007 16:25:36 +0100
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
In-Reply-To: <1197365918.2437.169.camel@localhost.localdomain>
References: <475D75D7.8040507@gmail.com>	<1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>	<475E5194.9090803@gmail.com>
	<1197365918.2437.169.camel@localhost.localdomain>
Message-ID: <475EABF0.1040506@gmail.com>

jr wrote:
> Am Dienstag, den 11.12.2007, 10:00 +0100 schrieb carlopmart:
> 
>> Yes Lon. I know that I need 'xm migrate' command to do this. My question is: 
>> when I execute 'xm migrate', can I be sure that migration is doing across 
>> 2.2.2.x network??
> 
> as far as i see this: you could use the 2.2.2.x network if you had node
> names with those ip addresses either in your hosts file or in your dns,
> and then use that nodename for migration with "xm migrate".
> but i don't feel like using xm migrate for rgmanager controlled xen
> guests is a good idea, as far as i have experienced, rgmanager wouldn't
> notice that the xen domain has actually been migrated to a different
> node.
> is your primary goal to keep the migration traffic away from the cluster
> traffic, or from the application traffic?
> if it's the latter, it shouldn't be that hard to put the guests on a
> different bridge in a different network then the cluster/migration
> traffic. separating migration and cluster traffic seems to be harder
> though, because the nodenames in the cluster are used for the migration
> of the domains (by rgmanager..)
> 
> regards,
> johannes
> 

Thanks Johannes. Yes, my primary goal is to keep the migration traffic away from 
the traffic cluster. It is here on I think I would be find the problems as you 
say.... is it impossible to do this??


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From johannes.russek at io-consulting.net  Tue Dec 11 15:56:20 2007
From: johannes.russek at io-consulting.net (jr)
Date: Tue, 11 Dec 2007 16:56:20 +0100
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
In-Reply-To: <475EABF0.1040506@gmail.com>
References: <475D75D7.8040507@gmail.com>
	<1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>
	<475E5194.9090803@gmail.com>
	<1197365918.2437.169.camel@localhost.localdomain>
	<475EABF0.1040506@gmail.com>
Message-ID: <1197388580.2437.197.camel@localhost.localdomain>


> Thanks Johannes. Yes, my primary goal is to keep the migration traffic away from 
> the traffic cluster. It is here on I think I would be find the problems as you 
> say.... is it impossible to do this??

Hi.
Sorry, i didn't quite understand that mail :)
Okay, so far as i see it, we have three different kinds of network
traffic:

1. the cluster infrastructure traffic (openais)
2. then application traffic (such as http if you're running a cluster of
webservers, or nfs, or...)
3. the xm migrate traffic

i think if you are able to separate the 2nd kind of traffic from the
other two, the cluster should already benefit from this. and this can be
done by attaching the xen-guest bridge to a different network interface
then the interface that RHCS uses. unless you want to migrate every few
seconds, i don't think the migration traffic should have any effects to
the cluster infrastructure. 

best regards,
johannes


From andrewxwang at yahoo.com.tw  Tue Dec 11 16:25:54 2007
From: andrewxwang at yahoo.com.tw (Andrew Wang)
Date: Wed, 12 Dec 2007 00:25:54 +0800 (CST)
Subject: [Linux-cluster] Rocks for financial customers (webinar)
Message-ID: <927097.19694.qm@web73508.mail.tp2.yahoo.com>

The webinar is on the coming Thursday.

Register at:
http://www.univaud.com/insurancewebinar/index.php

Andrew.


---------- Forwarded message ----------
December 13th 10:00 AM CDT

Join Sun Microsystems, AMD, Univa UD, and Diamond
Consulting for the Insurance Application Acceleration
webinar and hear how your business can benefit today
from an open source grid or cluster solution for
accelerating application processing.

Speakers:
    * Dave Teszler, US Practice Manager for HPC, Sun
Microsystems
    * Ramesh Joginpalli, Division Manager, Commercial
Solutions Group, AMD
    * Jay Ruffin, Principal, Diamond Consulting
    * Dr. Carl Kesselman, Co-Founder and Chief
Scientist, Univa UD

The Challenge: Cost-Effective High-Performance
Computing
Today's insurance companies deal with a range of IT
challenges related to limited data processing
capabilities in areas like:

    * Enterprise Risk Management
    * Predictive Modeling
    * Actuarial / Policy Administration
    * Telematics
    * OCR and Text Mining

In such IT environments, information is silo-ed and
adding capacity can be an expensive and arduous
process. In addition, computing power limitations
result in longer processing windows and the need to
run fewer, or simpler, variations in modeling. Many
companies are turning to grid and cluster technology
to alleviate the problem ? but the traditional
proprietary products on the market are extremely
costly to operate and maintain, while vendors are
increasingly difficult to deal with.

The Solution: Industrial-Strength Open Source
Solutions
Adopting an open source solution for grid and cluster
computing let companies avoid vendor locking for
maximum flexibility and superior operability. In
addition, open source solutions provide significant
cost reduction over proprietary solutions with
functionality limitations and high price tags.
Leveraging this technology now is crucial for
successful insurance enterprises to expand their
capacity to data-mine for claims, rapidly run more
historical data, and perform greater complex
calculative analyses for alternative rates in record
time.

Univa UD offers the leading solution for open source
grid and cluster technology. Together with our
partners, Univa UD, Sun Microsystems and AMD offer an
off-the-shelf, in the box integrated cluster solution
that can rapidly accelerate your insurance enterprise.
Agenda Topics Include:

    * Industry Computing Challenges
    * The Grid Computing Advantage
    * Usage Scenarios
    * Architecture Requirements
    * Getting Started


      ______________________________________________________________________________________
??????????????????????????????http://tw.promo.yahoo.com/antispam/index.html


From pedro.faustino at fccn.pt  Tue Dec 11 17:10:48 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Tue, 11 Dec 2007 17:10:48 +0000
Subject: [Linux-cluster] I/O performance tests on GFS and GFS2 (cluster
	2.01.00)
In-Reply-To: <927097.19694.qm@web73508.mail.tp2.yahoo.com>
References: <927097.19694.qm@web73508.mail.tp2.yahoo.com>
Message-ID: <475EC498.9010906@fccn.pt>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071211/18cb2293/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071211/18cb2293/attachment.bin>

From lhh at redhat.com  Tue Dec 11 17:51:32 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 11 Dec 2007 12:51:32 -0500
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk
In-Reply-To: <475E4C91.7030709@bull.net>
References: <475E4C91.7030709@bull.net>
Message-ID: <1197395492.18308.36.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-11 at 09:38 +0100, Alain Moulle wrote:
> Hi Lon
> 
> Thanks for your information about votes values with quorumd.
> 
> Another question about my tests :
> Now I have the quorum disk working correctly, and so I wanted
> to do this test : ifdown on the heart beat interface, to simulate
> a heart beat network breakdown. I expected the cluster NOT to failover
> because of quorum disk always available, but in fact after the
> 21s the node where I've stopped the if eth has been fenced despite
> the quorumdisk ...
> 
> Where is my misunderstanding ?

QDisk provides additional votes based on user-defined heuristics (or, no
heuristics, depending).  The combination of the heuristics + votes can
be used to:

* prevent even-split fence races in the event of a network partition -
one cluster partition can, given well-defined heuristics, decide it is
unfit for cluster participation (and usually remove itself), while the
other remains "fit" and therefore fences the bad partition

* allow a minority partition to become the surviving partition a split -
similar to the above - given a 4-node cluster, 3 nodes in a majority
partition could decide that they are *all* unfit for cluster
participation and remove themselves - while the 1-node minority
partition continues to operate

* prevent a partition from becoming quorate after being fenced - on
boot, if a node does not meet its heuristic requirements and a master
node exists in the cluster, it cannot become quorate unless it has
communications with the master qdisk node (optionally, you can have
qdisk stop CMAN in this case)

... and possibly other things, but those are the main ones.

It's not a replacement for cluster communications nor is it a
replacement for CMAN's membership (in fact, it relies on CMAN's
membership - and fencing - to do its job).

Even if qdiskd told CMAN which nodes were online, much of the internal
network traffic (for example, DLM traffic) cannot be pushed through the
disk in a meaningful way, meaning GFS access would be blocked.

-- Lon


From Christopher.Barry at qlogic.com  Tue Dec 11 17:56:12 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Tue, 11 Dec 2007 12:56:12 -0500
Subject: [Linux-cluster] I/O performance tests on GFS and GFS2 (cluster
	2.01.00)
In-Reply-To: <475EC498.9010906@fccn.pt>
References: <927097.19694.qm@web73508.mail.tp2.yahoo.com>
	<475EC498.9010906@fccn.pt>
Message-ID: <1197395773.5342.28.camel@localhost>

On Tue, 2007-12-11 at 17:10 +0000, Pedro Bandim Faustino wrote:
> Hi all,
> 
> I've conducted some I/O performance tests with IOzone on GFS and GFS2
> (among others for comparison).
> You can find them here:
> http://pedrofaustino.blogspot.com/2007/12/iozone-performance-tests-on-gfsnfsext3.html
> 
> If any of you has done such tests, please write me back if the
> results/performance are as expected.
> 
> Cheers,
> Pedro Bandim Faustino
> email/sip: pedro.faustino at fccn.pt
> 
> FCCN - Funda??o para a Computa??o Cient?fica Nacional
> Av. do Brasil, n.? 101
> 1700-066 Lisboa
> Tel: +351 21 844 0100
> Fax: +351 21 847 2167
> www.fccn.pt
>  
> Aviso de Confidencialidade
>  
> Esta mensagem ? exclusivamente destinada ao seu destinat?rio, podendo conter informa??o CONFIDENCIAL, cuja divulga??o est? expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conte?do de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately.
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Excellent work Pedro! Thank You. I have been looking for just such a
comparison.

Couple of questions:
* What is the 'out-of-the-box record size' for GFS1?
* '                                      ' for NFSv3?

I'm assuming the {r,w}size=8192 we set NFS to is the record size?

Regards,
-C


From carlopmart at gmail.com  Tue Dec 11 18:06:54 2007
From: carlopmart at gmail.com (carlopmart)
Date: Tue, 11 Dec 2007 19:06:54 +0100
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
In-Reply-To: <1197388580.2437.197.camel@localhost.localdomain>
References: <475D75D7.8040507@gmail.com>	<1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>	<475E5194.9090803@gmail.com>	<1197365918.2437.169.camel@localhost.localdomain>	<475EABF0.1040506@gmail.com>
	<1197388580.2437.197.camel@localhost.localdomain>
Message-ID: <475ED1BE.8060806@gmail.com>

jr wrote:
>> Thanks Johannes. Yes, my primary goal is to keep the migration traffic away from 
>> the traffic cluster. It is here on I think I would be find the problems as you 
>> say.... is it impossible to do this??
> 
> Hi.
> Sorry, i didn't quite understand that mail :)
> Okay, so far as i see it, we have three different kinds of network
> traffic:
> 
> 1. the cluster infrastructure traffic (openais)
> 2. then application traffic (such as http if you're running a cluster of
> webservers, or nfs, or...)
> 3. the xm migrate traffic
> 
> i think if you are able to separate the 2nd kind of traffic from the
> other two, the cluster should already benefit from this. and this can be
> done by attaching the xen-guest bridge to a different network interface
> then the interface that RHCS uses. unless you want to migrate every few
> seconds, i don't think the migration traffic should have any effects to
> the cluster infrastructure. 
> 
> best regards,
> johannes
>

Many thanks for your help Johannes. (and yes you understand perfectly that I try 
to explain).

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From lhh at redhat.com  Tue Dec 11 19:31:38 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 11 Dec 2007 14:31:38 -0500
Subject: [Linux-cluster] Re: [RFC] DRBD + GFS cookbook (Lon Hohberger)
In-Reply-To: <680255.91462.qm@web94209.mail.in2.yahoo.com>
References: <680255.91462.qm@web94209.mail.in2.yahoo.com>
Message-ID: <1197401498.18308.43.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-11 at 14:12 +0530, Koustubha Kale wrote:

> Does it work as expected? There seem to be two problems to me...
> 
> a) when we use something like.. (from your cookbook)
> disk {
>                fencing resource-and-stonith;
>        }
>        handlers {
>                outdate-peer "/sbin/obliterate"; # We'll get back to this.
>        }

> when this handler gets called both nodes will try to fence each other. Is that the intended effect?

Yes, in a network partition of a two-node cluster, both nodes will race
to fence.  One wins, the other dies. ;)


> b) If we try to do ssh <host> -c "drbdadm outdate all",  gfs is still mounted on top of drbd and drbd is primary so here is no effect of the command and the split brain continues. I have seen this.

... but with resource-and-stonith, drbd freezes I/O until the
outdate-peer script returns a 4 or 7...  If it doesn't return


> >I don't fully understand.  You want to start a service on a node which
> >doesn't have access to DRBD - but the service depends on DRBD?
> 
> We are using a three server node cluster. Two of the server nodes act as the shared storage in Active-Active DRBD. The third
>  server node mounts the gfs volumes through a manged NFS service. All three


>  cluster nodes act as servers for diskless nodes ( XDMCP through LVS --> Direct Routing method). The diskless nodes are not part of RHCS cluster. They are thin clients for students.
> What I was wondering about is if there is a way to switch over a users session in the event of a server cluster node crashing. It wont have to depend on DRBD as the other server node will still be active as drbd primary, also the third server will continue working with NFS failing over to the remaining DRBD machine. 

Fail over an xdmcp session?  I think xdm/gdm/etc. were not designed to
handle that sort of a failure case.  It sounds like a cool idea, but I
would not even know where to begin to make that work.

-- Lon


From lhh at redhat.com  Tue Dec 11 19:35:14 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 11 Dec 2007 14:35:14 -0500
Subject: [Linux-cluster] Doing "xenmotion" under rhel5.1 and RHCS
In-Reply-To: <475E5194.9090803@gmail.com>
References: <475D75D7.8040507@gmail.com>
	<1197324429.6279.91.camel@ayanami.boston.devel.redhat.com>
	<475E5194.9090803@gmail.com>
Message-ID: <1197401714.18308.48.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-11 at 10:00 +0100, carlopmart wrote:
> Lon Hohberger wrote:
> > On Mon, 2007-12-10 at 18:22 +0100, carlopmart wrote:
> >> Hi all,
> >>
> >>   I am testing xen cluster using rehl5.1 and I have some doubts. First: can I 
> >> configure a dedicated ethernet interface to move xen guests over nodes on RHCS 
> >> 5?. For example, I have two nodes: xenhost01 and xenhost02 with ip 1.1.1.1 and 
> >> 1.1.1.2 respectively. I have another physical interface configured on each node 
> >> with ip's 2.2.2.1 and 2.2.2.2. Then, I would like to move one xen guest over 
> >> second interface and not over first interface using clusvcadm tool. Uname 
> >> command returns hostname associated on 1.1.1.x ip. How can I do it?
> > 
> > how about... 
> > 
> > 'xm migrate' ?
> > 
> > -- Lon
> 
> Yes Lon. I know that I need 'xm migrate' command to do this. My question is: 
> when I execute 'xm migrate', can I be sure that migration is doing across 
> 2.2.2.x network??

Sorry, I didn't mean to come off sounding like that.

More info - 

I expect you should be able to use the 2.2.2.x IP address as the target
for the xm migrate command.

Basically, clusvcadm -M will always use the cluster path because it
relies on the target node name to do migration.  With 5.1, however,
rgmanager will attempt to locate VMs which were migrated by other tools.

So, xm migrage -l <vm> 2.2.2.x should actually "just work".  Because
'xm' doesn't use the cluster tools, it's not going to be re-routed
through the cluster network...

-- Lon


From rpeterso at redhat.com  Tue Dec 11 19:53:55 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 11 Dec 2007 13:53:55 -0600
Subject: [Linux-cluster] gfs_grow: Device has grown by less than 100
	blocks.... skipping
In-Reply-To: <76BD4E6231F77947B6885C75E5214FA5030CF3@INTMAIL01.es.int.atosorigin.com>
References: <76BD4E6231F77947B6885C75E5214FA5030CF3@INTMAIL01.es.int.atosorigin.com>
Message-ID: <1197402835.2714.99.camel@technetium.msp.redhat.com>

On Mon, 2007-12-03 at 20:58 +0100, Leandro Alcalde Perez wrote:
> Hello.
>  
> I need to grow a production GFS, but firts, i have probed with a little gfs.
> The problem is:
> When i try   gfs_grow always get an error:
> [root at localhost ~]# gfs_grow -v /Backup_files
> Device has grown by less than 100 blocks.... skipping
>  
> 
> Even I have created 3 pv, but only the first has dates, the others ones are empty
> [root at localhost ~]# pvs
>  /dev/sdc          VolGroup03 lvm2 a-   1020.00M       0 
>   /dev/sdd          VolGroup03 lvm2 a-   1020.00M 1020.00M
>   /dev/sde          VolGroup03 lvm2 a-     10.00G   10.00G
>  
> And the volum group is:
> [root at localhost ~]# vgs
>   VG         #PV #LV #SN Attr   VSize   VFree 
>    VolGroup03   3   1   0 wz--nc  11.99G 10.99G

Hi Leandro,

You did not include the output from the "lvs" command, which would
show the logical volume information.  I assume from your email that
your GFS file system is located in a logical volume, but that logical
volume needs to have enough free space (> 100 blocks) in order to grow.
So if your LV is inside of VG "VolGroup03", let's assume your LV is
named "leandro_lv".  Then you would need to do something like this:

lvresize -L +10G /dev/VolGroup03/leandro_lv
This extends the logical volume "leandro_lv" so that it can use some of
the free space in the VolGroup03 volume group.  After that,
"gfs2_grow /Backup_files" should work properly.

Regards,

Bob Peterson
Red Hat GFS


From Alain.Moulle at bull.net  Wed Dec 12 07:43:12 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 12 Dec 2007 08:43:12 +0100
Subject: [Linux-cluster] RE: CS5 can't stop service cman 
Message-ID: <475F9110.1020509@bull.net>

This problem is related to the cman kernel module. The module has services
connected to it, and therefore will not unload from memory.
You can stop cman manually by:
cman_tool leave force
Or
cman_tool leave force remove
There is also a force flag on the cman start stop script.

However this does not solve the problem of the cman kernel module!
Hi

Thanks Damjan, but I don't think so : what you write is true with CS U4
but no more with CS5 because cman seems to no more be a kernel module but
in user space now.

So ... ?

Thanks
Regards
Alain Moull?


Thanks,
Damjan Stulic
IS Security Identity Management
Edward Jones
 If you are not the intended recipient of this message (including attachments),
or if you have received this message in error, immediately notify us and delete
it and any attachments.  If you no longer wish to receive e-mail from Edward
Jones, please send this request to messages at edwardjones.com.  You must include
the e-mail address that you wish not to receive e-mail communications.  For
important additional information related to this e-mail, visit
www.edwardjones.com/US_email_disclosure

-----Original Message-----


From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
On Behalf Of Alain Moulle
Sent: Monday, December 10, 2007 10:06
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CS5 can't stop service cman

Hi

I've got quite systematically this problem when stopping cman service :

# service cman stop
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]

but no more details in syslog ...

Is this a known issue ? and if so, is there a issue number with a patch ?

Thanks

Regards
Alain Moull?
-- 


########################################
Bull recrute : http://www.bull.fr/emploi
########################################

mailto:Alain.Moulle at bull.net
+------------------------------+--------------------------------+
|	Alain Moull?	       	| from France :	04 76 29 75 99  |
|                              	| FAX number  : 04 76 29 72 49  |
| Bull SA		       	|				|
| 1, Rue de Provence  		| Adr  : FREC B1-063            |
| B.P. 208			|				|
| 38432 Echirolles - CEDEX     	| Email: Alain.Moulle at bull.net  |
| France                       	| BCOM : 229 7599               |
+-------------------------------+-------------------------------+


From pedro.faustino at fccn.pt  Wed Dec 12 10:44:40 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Wed, 12 Dec 2007 10:44:40 +0000
Subject: [Linux-cluster] I/O performance tests on GFS and GFS2 (cluster
	2.01.00)
In-Reply-To: <1197395773.5342.28.camel@localhost>
References: <927097.19694.qm@web73508.mail.tp2.yahoo.com>	<475EC498.9010906@fccn.pt>
	<1197395773.5342.28.camel@localhost>
Message-ID: <475FBB98.3040809@fccn.pt>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071212/cc029dd4/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071212/cc029dd4/attachment.bin>

From madunix at gmail.com  Wed Dec 12 12:12:36 2007
From: madunix at gmail.com (Mad Unix)
Date: Wed, 12 Dec 2007 14:12:36 +0200
Subject: [Linux-cluster] I/O performance tests on GFS and GFS2 (cluster
	2.01.00)
In-Reply-To: <475FBB98.3040809@fccn.pt>
References: <927097.19694.qm@web73508.mail.tp2.yahoo.com>
	<475EC498.9010906@fccn.pt> <1197395773.5342.28.camel@localhost>
	<475FBB98.3040809@fccn.pt>
Message-ID: <4d3f56c90712120412i3ccce0ccnf4eb976fff72e818@mail.gmail.com>

Good J!

On Dec 12, 2007 12:44 PM, Pedro Bandim Faustino <pedro.faustino at fccn.pt>
wrote:

>  Hi Christopher,
>
> For GFS, we use the default which is 4096 bytes.
> As for NFSv3, we use {r,w}size=16384.
>
> Regards,
>
> Pedro Bandim Faustino
> email/sip: pedro.faustino at fccn.pt
> FCCN - Funda??o para a Computa??o Cient?fica Nacional
> Av. do Brasil, n.? 101
> 1700-066 Lisboa
> Tel: +351 21 844 0100
> Fax: +351 21 847 2167www.fccn.pt
>
> Aviso de Confidencialidade
>
> Esta mensagem ? exclusivamente destinada ao seu destinat?rio, podendo conter informa??o CONFIDENCIAL, cuja divulga??o est? expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conte?do de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately.
>
>
>
> Christopher Barry wrote:
>
> On Tue, 2007-12-11 at 17:10 +0000, Pedro Bandim Faustino wrote:
>
>
>  Hi all,
>
> I've conducted some I/O performance tests with IOzone on GFS and GFS2
> (among others for comparison).
> You can find them here:http://pedrofaustino.blogspot.com/2007/12/iozone-performance-tests-on-gfsnfsext3.html
>
> If any of you has done such tests, please write me back if the
> results/performance are as expected.
>
> Cheers,
> Pedro Bandim Faustino
> email/sip: pedro.faustino at fccn.pt
>
> FCCN - Funda??o para a Computa??o Cient?fica Nacional
> Av. do Brasil, n.? 101
> 1700-066 Lisboa
> Tel: +351 21 844 0100
> Fax: +351 21 847 2167www.fccn.pt
>
> Aviso de Confidencialidade
>
> Esta mensagem ? exclusivamente destinada ao seu destinat?rio, podendo conter informa??o CONFIDENCIAL, cuja divulga??o est? expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conte?do de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately.
> --
> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>  Excellent work Pedro! Thank You. I have been looking for just such a
> comparison.
>
> Couple of questions:
> * What is the 'out-of-the-box record size' for GFS1?
> * '                                      ' for NFSv3?
>
> I'm assuming the {r,w}size=8192 we set NFS to is the record size?
>
> Regards,
> -C
>
> --
> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
madunix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071212/8b38718c/attachment.htm>

From damjan.stulic at edwardjones.com  Wed Dec 12 15:22:33 2007
From: damjan.stulic at edwardjones.com (Stulic,Damjan)
Date: Wed, 12 Dec 2007 09:22:33 -0600
Subject: [Linux-cluster] RE: CS5 can't stop service cman 
In-Reply-To: <475F9110.1020509@bull.net>
References: <475F9110.1020509@bull.net>
Message-ID: <B2ED9364017FF54FB9392221F9A49845025522AE@nwpsrv08.edj.ad.edwardjones.com>

You are correct. Sorry I should have considered version information. On my RHEL4.4 cluster behaves as described.
cman_tool leave + options works for me .. And so should service cman stop remove

Good luck..
Damjan Stulic

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alain Moulle
Sent: Wednesday, December 12, 2007 01:43
To: linux-cluster at redhat.com
Subject: [Linux-cluster] RE: CS5 can't stop service cman 

This problem is related to the cman kernel module. The module has services
connected to it, and therefore will not unload from memory.
You can stop cman manually by:
cman_tool leave force
Or
cman_tool leave force remove
There is also a force flag on the cman start stop script.

However this does not solve the problem of the cman kernel module!
Hi

Thanks Damjan, but I don't think so : what you write is true with CS U4
but no more with CS5 because cman seems to no more be a kernel module but
in user space now.

So ... ?

Thanks
Regards
Alain Moull?


Thanks,
Damjan Stulic
IS Security Identity Management
Edward Jones
 If you are not the intended recipient of this message (including attachments),
or if you have received this message in error, immediately notify us and delete
it and any attachments.  If you no longer wish to receive e-mail from Edward
Jones, please send this request to messages at edwardjones.com.  You must include
the e-mail address that you wish not to receive e-mail communications.  For
important additional information related to this e-mail, visit
www.edwardjones.com/US_email_disclosure

-----Original Message-----


From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com]
On Behalf Of Alain Moulle
Sent: Monday, December 10, 2007 10:06
To: linux-cluster at redhat.com
Subject: [Linux-cluster] CS5 can't stop service cman

Hi

I've got quite systematically this problem when stopping cman service :

# service cman stop
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
                                                           [FAILED]

but no more details in syslog ...

Is this a known issue ? and if so, is there a issue number with a patch ?

Thanks

Regards
Alain Moull?
-- 


########################################
Bull recrute : http://www.bull.fr/emploi
########################################

mailto:Alain.Moulle at bull.net
+------------------------------+--------------------------------+
|	Alain Moull?	       	| from France :	04 76 29 75 99  |
|                              	| FAX number  : 04 76 29 72 49  |
| Bull SA		       	|				|
| 1, Rue de Provence  		| Adr  : FREC B1-063            |
| B.P. 208			|				|
| 38432 Echirolles - CEDEX     	| Email: Alain.Moulle at bull.net  |
| France                       	| BCOM : 229 7599               |
+-------------------------------+-------------------------------+


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From fedora at resork.sent.com  Wed Dec 12 18:09:50 2007
From: fedora at resork.sent.com (Ken Roser)
Date: Wed, 12 Dec 2007 13:09:50 -0500
Subject: [Linux-cluster] with gfs2,
	permission denied on first attempt to run an executable
Message-ID: <476023EE.3030404@resork.sent.com>

I'm running a two cluster system with both nodes RHEL 5.1.  I have 
shared gfs2 filesystems on a SAN storage device.

Everything seems to work except for the first time I attempt to run an 
executable on one of the gfs2 filesystems.  The first time I get 
"permission denied", but all subsequent attempts work as expected.  If I 
unmount and mount the filesystem, again the "permission denied" problem 
returns just for the first time I try to execute the file.

I'm using lock_dlm with 4 journals.  Is there anything I can 
check/change to diagnose this problem.  I see no message in 
/var/log/messages corresponding to the "permission denied" problem.


From shawnlhood at gmail.com  Wed Dec 12 18:12:49 2007
From: shawnlhood at gmail.com (Shawn Hood)
Date: Wed, 12 Dec 2007 13:12:49 -0500
Subject: [Linux-cluster] with gfs2,
	permission denied on first attempt to run an executable
In-Reply-To: <476023EE.3030404@resork.sent.com>
References: <476023EE.3030404@resork.sent.com>
Message-ID: <cfe2fc960712121012w2dedf405m7dbbda839d26bee2@mail.gmail.com>

Try running it along with an strace.  Compare the results from the
first and second attempts.

Shawn

On Dec 12, 2007 1:09 PM, Ken Roser <fedora at resork.sent.com> wrote:
> I'm running a two cluster system with both nodes RHEL 5.1.  I have
> shared gfs2 filesystems on a SAN storage device.
>
> Everything seems to work except for the first time I attempt to run an
> executable on one of the gfs2 filesystems.  The first time I get
> "permission denied", but all subsequent attempts work as expected.  If I
> unmount and mount the filesystem, again the "permission denied" problem
> returns just for the first time I try to execute the file.
>
> I'm using lock_dlm with 4 journals.  Is there anything I can
> check/change to diagnose this problem.  I see no message in
> /var/log/messages corresponding to the "permission denied" problem.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Shawn Hood
(910) 670-1819 Mobile


From paolom at prisma-eng.it  Wed Dec 12 18:23:54 2007
From: paolom at prisma-eng.it (Paolo Marini)
Date: Wed, 12 Dec 2007 19:23:54 +0100
Subject: [Linux-cluster] Cluster of XEN guests unstable when rebooting a
	node under CS5.1
Message-ID: <4760273A.70209@prisma-eng.it>

I reiterate the request for help hoping someone has undergone (and 
hopefully solved) the same issues.

I am building up a cluster of XEN Guests with root file system residing 
on a file on an GFS filesystem (iscsi actually).

Each cluster node mounts an GFS file system residing on an iscsi device.

For performance reasons, both the iscsi device and the physical nodes 
(part also of a cluster) use two gigabit ethernet with bonding and LACP. 
For the physical machines, I had to insert a sleep 30 on the 
/etc/init.d/iscsi script before the iscsi login, in order to wait for 
the bond interface to come up, otherwise the iscsi devices are not seen 
and no gfs mount is possible.

Then, going to the cluster of XEN Guests, they work fine, I am able to 
migrate each one to a different physical node without problems on the guest.

When I reboot or fence one of the guests, the guest cluster breaks, e.g. 
the quorum is dissolved and I have to fence ALL the nodes and reboot 
them in order for the cluster to restart.

Does it have to do with the xen bridge going up and down for a time 
longer than the heartbeat timeout ?

One other problem. My physical machines cluster is build with 3 nodes. 
If I start 2 of them and build up the cluster  (with one missing node), 
everything works fine. Then I switch on the third node. As soon as the 
node comes up and cman starts, the physical nodes cluster quorum is 
dissolved.

Hope someone has undergone the same issues and provides a hand ...

Paolo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: paolom.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071212/58eaa303/attachment.vcf>

From fedora at resork.sent.com  Wed Dec 12 18:58:45 2007
From: fedora at resork.sent.com (Ken Roser)
Date: Wed, 12 Dec 2007 13:58:45 -0500
Subject: [Linux-cluster] with gfs2,	permission denied on first attempt
	to run an executable
In-Reply-To: <cfe2fc960712121012w2dedf405m7dbbda839d26bee2@mail.gmail.com>
References: <476023EE.3030404@resork.sent.com>
	<cfe2fc960712121012w2dedf405m7dbbda839d26bee2@mail.gmail.com>
Message-ID: <47602F65.6080406@resork.sent.com>

The problem always occurs when I do this:

$ umount /storage2; mount /dev/emcpowerc1 /storage2;  /storage2/program
bash: /storage2/program: Permission denied
$

If I do:
umount /storage2;mount /dev/emcpowerc1 /storage2; strace /storage2/program

I get a very long strace output AND the program does run, thus the 
strace doesn't show where the problem is.


Shawn Hood wrote:
> Try running it along with an strace.  Compare the results from the
> first and second attempts.
>
> Shawn
>
> On Dec 12, 2007 1:09 PM, Ken Roser <fedora at resork.sent.com> wrote:
>   
>> I'm running a two cluster system with both nodes RHEL 5.1.  I have
>> shared gfs2 filesystems on a SAN storage device.
>>
>> Everything seems to work except for the first time I attempt to run an
>> executable on one of the gfs2 filesystems.  The first time I get
>> "permission denied", but all subsequent attempts work as expected.  If I
>> unmount and mount the filesystem, again the "permission denied" problem
>> returns just for the first time I try to execute the file.
>>
>> I'm using lock_dlm with 4 journals.  Is there anything I can
>> check/change to diagnose this problem.  I see no message in
>> /var/log/messages corresponding to the "permission denied" problem.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>     
>
>
>
>   


From rstevens at internap.com  Wed Dec 12 19:21:11 2007
From: rstevens at internap.com (Rick Stevens)
Date: Wed, 12 Dec 2007 11:21:11 -0800
Subject: [Linux-cluster] with gfs2,	permission denied on first attempt
	to run an executable
In-Reply-To: <47602F65.6080406@resork.sent.com>
References: <476023EE.3030404@resork.sent.com>
	<cfe2fc960712121012w2dedf405m7dbbda839d26bee2@mail.gmail.com>
	<47602F65.6080406@resork.sent.com>
Message-ID: <1197487271.29325.27.camel@prophead.corp.publichost.com>

On Wed, 2007-12-12 at 13:58 -0500, Ken Roser wrote:
> The problem always occurs when I do this:
> 
> $ umount /storage2; mount /dev/emcpowerc1 /storage2;  /storage2/program
> bash: /storage2/program: Permission denied
> $
> 
> If I do:
> umount /storage2;mount /dev/emcpowerc1 /storage2; strace /storage2/program
> 
> I get a very long strace output AND the program does run, thus the 
> strace doesn't show where the problem is.

Check the permissions on /storage2/program via "ls -l /storage2/program"
and make sure the execute bit is set for the user you're trying to run
it as.  Bash is telling you it's not an executable file.  strace doesn't
check that--it looks at the "magic number" in the file and sees it as an
executable program and runs it.

> 
> 
> Shawn Hood wrote:
> > Try running it along with an strace.  Compare the results from the
> > first and second attempts.
> >
> > Shawn
> >
> > On Dec 12, 2007 1:09 PM, Ken Roser <fedora at resork.sent.com> wrote:
> >   
> >> I'm running a two cluster system with both nodes RHEL 5.1.  I have
> >> shared gfs2 filesystems on a SAN storage device.
> >>
> >> Everything seems to work except for the first time I attempt to run an
> >> executable on one of the gfs2 filesystems.  The first time I get
> >> "permission denied", but all subsequent attempts work as expected.  If I
> >> unmount and mount the filesystem, again the "permission denied" problem
> >> returns just for the first time I try to execute the file.
> >>
> >> I'm using lock_dlm with 4 journals.  Is there anything I can
> >> check/change to diagnose this problem.  I see no message in
> >> /var/log/messages corresponding to the "permission denied" problem.
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >>     
> >
> >
> >
> >   
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
----------------------------------------------------------------------
- Rick Stevens, Principal Engineer             rstevens at internap.com -
- CDN Systems, Internap, Inc.                http://www.internap.com -
-                                                                    -
- Millihelen, adj: The amount of beauty required to launch one ship. -
----------------------------------------------------------------------


From mbrookov at mines.edu  Wed Dec 12 19:25:41 2007
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Wed, 12 Dec 2007 12:25:41 -0700
Subject: [Linux-cluster] Types of file locking support in GFS
In-Reply-To: <C692865D-2A62-4D44-ADB7-9EB528017451@devnull.org.uk>
References: <20071206134544.GA27857@book.riviera.org.uk>
	<C692865D-2A62-4D44-ADB7-9EB528017451@devnull.org.uk>
Message-ID: <1197487541.27760.19.camel@merlin.Mines.EDU>

Test carefully, and look at your code.  If your program needs to promote
a shared lock to an exclusive lock, GFS will allow another program to
steal the lock.  flock frees the program's shared lock and then tries to
get an exclusive lock, another thread can sneak in and get a lock.  This
can cause data corruption if your program expects flock to promote or
demote a lock without allowing another process to modify the file.

This is not a problem on EXT3.  EXT3 will allow a lock to be promoted
from shared to exclusive and not allow a second program to sneak in.

See https://bugzilla.redhat.com/show_bug.cgi?id=252000 for more details.

Red Hat said that I should modify my code to use fcntl.  Unfortunately,
we may end up dropping GFS, and the cluster suite instead.

Matt


On Sat, 2007-12-08 at 10:09 +0000, Elliot Moore wrote:

> On 6 Dec 2007, at 13:45, Elliot Moore wrote:
> > Hi,
> >
> > I'm trying to setup activemq with master and slave.
> > According to http://activemq.apache.org/shared-file-system-master-slave.html 
> > , you can use a SAN to hold a lockfile for multiple brokers to watch.
> > But the SAN filesystem must support exclusive file locks.
> >
> > OCFS2 only supports locking with 'fcntl' and not 'lockf and flock',  
> > therefore mutex file locking from Java isn't supported. (both  
> > brokers think they have
> > an exclusive lock on the lockfile!)
> >
> > Does Redhat GFS support 'lockf and flock' as well as fcntl ?
> 
> 
> FYI
> got a response from redhat, yes
> more information @ http://sources.redhat.com/cluster/faq.html#gfs_vs_ocfs2
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071212/4829182e/attachment.htm>

From fedora at resork.sent.com  Wed Dec 12 20:07:54 2007
From: fedora at resork.sent.com (Ken Roser)
Date: Wed, 12 Dec 2007 15:07:54 -0500
Subject: [Linux-cluster] with gfs2,	permission denied on first attempt
	to run an executable
In-Reply-To: <1197487271.29325.27.camel@prophead.corp.publichost.com>
References: <476023EE.3030404@resork.sent.com>	<cfe2fc960712121012w2dedf405m7dbbda839d26bee2@mail.gmail.com>	<47602F65.6080406@resork.sent.com>
	<1197487271.29325.27.camel@prophead.corp.publichost.com>
Message-ID: <47603F9A.2030903@resork.sent.com>

Rick, ls -l does show execute permission.  It couldn't have been the 
execute bit since the 2nd or later attempts to run the program are 
successful.

BTW, in case this is useful, after mounting the filesystem, if I do a ls 
-l on the file first, before attempting to execute it, the program will 
run first and every time.  It's as if the execution will fail until the 
directory is read at least once.

Rick Stevens wrote:
> On Wed, 2007-12-12 at 13:58 -0500, Ken Roser wrote:
>   
>> The problem always occurs when I do this:
>>
>> $ umount /storage2; mount /dev/emcpowerc1 /storage2;  /storage2/program
>> bash: /storage2/program: Permission denied
>> $
>>
>> If I do:
>> umount /storage2;mount /dev/emcpowerc1 /storage2; strace /storage2/program
>>
>> I get a very long strace output AND the program does run, thus the 
>> strace doesn't show where the problem is.
>>     
>
> Check the permissions on /storage2/program via "ls -l /storage2/program"
> and make sure the execute bit is set for the user you're trying to run
> it as.  Bash is telling you it's not an executable file.  strace doesn't
> check that--it looks at the "magic number" in the file and sees it as an
> executable program and runs it.
>
>   
>> Shawn Hood wrote:
>>     
>>> Try running it along with an strace.  Compare the results from the
>>> first and second attempts.
>>>
>>> Shawn
>>>
>>> On Dec 12, 2007 1:09 PM, Ken Roser <fedora at resork.sent.com> wrote:
>>>   
>>>       
>>>> I'm running a two cluster system with both nodes RHEL 5.1.  I have
>>>> shared gfs2 filesystems on a SAN storage device.
>>>>
>>>> Everything seems to work except for the first time I attempt to run an
>>>> executable on one of the gfs2 filesystems.  The first time I get
>>>> "permission denied", but all subsequent attempts work as expected.  If I
>>>> unmount and mount the filesystem, again the "permission denied" problem
>>>> returns just for the first time I try to execute the file.
>>>>
>>>> I'm using lock_dlm with 4 journals.  Is there anything I can
>>>> check/change to diagnose this problem.  I see no message in
>>>> /var/log/messages corresponding to the "permission denied" problem.
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>     
>>>>         
>>>
>>>   
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
> ----------------------------------------------------------------------
> - Rick Stevens, Principal Engineer             rstevens at internap.com -
> - CDN Systems, Internap, Inc.                http://www.internap.com -
> -                                                                    -
> - Millihelen, adj: The amount of beauty required to launch one ship. -
> ----------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


From rpeterso at redhat.com  Wed Dec 12 22:00:45 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 12 Dec 2007 16:00:45 -0600
Subject: [Linux-cluster] with gfs2, permission denied on first attempt
	to run an executable
In-Reply-To: <476023EE.3030404@resork.sent.com>
References: <476023EE.3030404@resork.sent.com>
Message-ID: <1197496845.23671.109.camel@technetium.msp.redhat.com>

On Wed, 2007-12-12 at 13:09 -0500, Ken Roser wrote:
> I'm running a two cluster system with both nodes RHEL 5.1.  I have 
> shared gfs2 filesystems on a SAN storage device.
> 
> Everything seems to work except for the first time I attempt to run an 
> executable on one of the gfs2 filesystems.  The first time I get 
> "permission denied", but all subsequent attempts work as expected.  If I 
> unmount and mount the filesystem, again the "permission denied" problem 
> returns just for the first time I try to execute the file.
> 
> I'm using lock_dlm with 4 journals.  Is there anything I can 
> check/change to diagnose this problem.  I see no message in 
> /var/log/messages corresponding to the "permission denied" problem.

Hi Ken,

I recreated your problem on the latest and greatest gfs2 code, so
it's a real problem.  Can you open a bugzilla record about it?
You can assign it to me.  If you don't want to, I can...

Regards,

Bob Peterson
Red Hat GFS


From fedora at resork.sent.com  Wed Dec 12 23:28:49 2007
From: fedora at resork.sent.com (Ken Roser)
Date: Wed, 12 Dec 2007 18:28:49 -0500
Subject: [Linux-cluster] with gfs2, permission denied on first attempt
	to run an executable
In-Reply-To: <1197496845.23671.109.camel@technetium.msp.redhat.com>
References: <476023EE.3030404@resork.sent.com>
	<1197496845.23671.109.camel@technetium.msp.redhat.com>
Message-ID: <47606EB1.50209@resork.sent.com>

See bug 422681, https://bugzilla.redhat.com/show_bug.cgi?id=422681

Bob Peterson wrote:
> On Wed, 2007-12-12 at 13:09 -0500, Ken Roser wrote:
>   
>> I'm running a two cluster system with both nodes RHEL 5.1.  I have 
>> shared gfs2 filesystems on a SAN storage device.
>>
>> Everything seems to work except for the first time I attempt to run an 
>> executable on one of the gfs2 filesystems.  The first time I get 
>> "permission denied", but all subsequent attempts work as expected.  If I 
>> unmount and mount the filesystem, again the "permission denied" problem 
>> returns just for the first time I try to execute the file.
>>
>> I'm using lock_dlm with 4 journals.  Is there anything I can 
>> check/change to diagnose this problem.  I see no message in 
>> /var/log/messages corresponding to the "permission denied" problem.
>>     
>
> Hi Ken,
>
> I recreated your problem on the latest and greatest gfs2 code, so
> it's a real problem.  Can you open a bugzilla record about it?
> You can assign it to me.  If you don't want to, I can...
>
> Regards,
>
> Bob Peterson
> Red Hat GFS
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


From rpeterso at redhat.com  Wed Dec 12 23:46:47 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 12 Dec 2007 17:46:47 -0600
Subject: [Linux-cluster] with gfs2, permission denied on first attempt
	to run an executable
In-Reply-To: <47606EB1.50209@resork.sent.com>
References: <476023EE.3030404@resork.sent.com>
	<1197496845.23671.109.camel@technetium.msp.redhat.com>
	<47606EB1.50209@resork.sent.com>
Message-ID: <1197503207.23671.111.camel@technetium.msp.redhat.com>


On Wed, 2007-12-12 at 18:28 -0500, Ken Roser wrote:
> See bug 422681, https://bugzilla.redhat.com/show_bug.cgi?id=422681

Hi Ken,

Thanks.  Working on it.

Bob Peterson


From koustubha_kale at yahoo.com  Thu Dec 13 04:37:07 2007
From: koustubha_kale at yahoo.com (Koustubha Kale)
Date: Thu, 13 Dec 2007 10:07:07 +0530 (IST)
Subject: [Linux-cluster] Re: [RFC] DRBD + GFS cookbook (Lon Hohberger)
Message-ID: <373378.88450.qm@web94204.mail.in2.yahoo.com>

>> when this handler gets called both nodes will try to fence each
 other.. Is that the intended effect?

>Yes, in a network partition of a two-node cluster, both nodes will race
>to fence.  One wins, the other dies. ;)

OK.

>> b) If we try to do ssh <host> -c "drbdadm outdate all",  gfs is still
 mounted on top of drbd and drbd is primary so here is no effect of the
 command and >> the split brain continues. I have seen this.

>... but with resource-and-stonith, drbd freezes I/O until the
>outdate-peer script returns a 4 or 7...  If it doesn't return

Could you please explain ... If it doesn't return?

> Fail over an xdmcp session?  I think xdm/gdm/etc. were not designed to
> handle that sort of a failure case.  It sounds like a cool idea, but I
> would not even know where to begin to make that work.

Well I could keep asking the question around and may be someday somebody will have an idea.

Also could you please explain following from your obliterate script..

<quote>
# now.  Note that GFS *will* wait for this to occur, so if you're using GFS
# on DRBD, you still don't get access. ;)

</quote>


What will GFS wait for? Fence status? 

Also since our APC masterswitch hasn't arrived yet, I modified the obliterate script to use ssh to do the dirty work.
( instead of using RHCS fencing i.e., also as I have a 3 node cluster.. Also I defined REMOTE manually on each of the two drbd nodes )
Here it is. Please comment if it will work?

#!/bin/bash
# ###########################################################
# DRBD 0.8.2.1 -> linux-cluster super-simple fencing wrapper
#
# Kills the other node in a 2-node cluster.  Only works with
# 2-node clusters (FIXME?)
#
# ###########################################################
#
# Author: Lon Hohberger <lhh[a]redhat.com>
#
# Special thanks to fabioc on freenode
#

PATH="/bin:/sbin:/usr/bin:/usr/sbin"

NODECOUNT=0
LOCAL_ID="2"
REMOTE_ID="1"
REMOTE="imstermserver1"

echo "Local node ID: $LOCAL_ID"
echo "Remote node ID: $REMOTE_ID"
echo "Remote node: $REMOTE "

#
# This could be cleaner by calling cman_tool kill -n <node>, but then we have
# to poll/wait for fence status, and I don't feel like writing that right
# now.  Note that GFS *will* wait for this to occur, so if you're using GFS
# on DRBD, you still don't get access. ;)
#

#fence_node $REMOTE


logger -f /var/log/messages "$0 : Fencing Node : $REMOTE"

ssh $REMOTE drbdadm outdate all
if [ $? -eq 0 ]; then
logger -f /var/log/messages "$0 : drbdadm outdate all on $REMOTE succeded"
        #
        # Reference:
        # http://osdir.com/ml/linux.kernel.drbd.devel/2006-11/msg00005.html
        #
        # 4 = -> peer is outdated (this handler outdated it) [ resource fencing ]
        #
        ssh $REMOTE drbdadm resume-io all
        exit 4
fi
logger -f /var/log/messages "$0 : drbdadm outdate all on $REMOTE FAILED!!"
ssh $REMOTE poweroff -f
if [ $? -eq 0 ]; then
logger -f /var/log/messages "$0 : poweroff -f on $REMOTE succeded"
        #
        # Reference:
        # http://osdir..com/ml/linux.kernel.drbd.devel/2006-11/msg00005.html
        #
        # 7 = node got blown away.
        #
        ssh $REMOTE drbdadm resume-io all
        exit 7
fi

logger -f /var/log/messages "$0 : poweroff -f on $REMOTE FAILED!!"
#
# Fencing failed?!
#

logger -f /var/log/messages "$0 : FENCING on $REMOTE FAILED!!"

# Go along with split brain..

ssh $REMOTE drbdadm resume-io all 
drbdadm resume-io all

exit 1


Regards
Koustubha Kale


      Chat on a cool, new interface. No download required. Go to http://in.messenger.yahoo.com/webmessengerpromo.php


From gordan at bobich.net  Thu Dec 13 09:33:41 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Thu, 13 Dec 2007 09:33:41 +0000 (GMT)
Subject: [Linux-cluster] Types of file locking support in GFS
In-Reply-To: <1197487541.27760.19.camel@merlin.Mines.EDU>
References: <20071206134544.GA27857@book.riviera.org.uk>
	<C692865D-2A62-4D44-ADB7-9EB528017451@devnull.org.uk>
	<1197487541.27760.19.camel@merlin.Mines.EDU>
Message-ID: <Pine.LNX.4.64.0712130922001.2163@skynet.shatteredsilicon.net>

What even remotely similar alternatives to GFS are there, though?

On Wed, 12 Dec 2007, Matthew B. Brookover wrote:

> Test carefully, and look at your code.  If your program needs to promote
> a shared lock to an exclusive lock, GFS will allow another program to
> steal the lock.  flock frees the program's shared lock and then tries to
> get an exclusive lock, another thread can sneak in and get a lock.  This
> can cause data corruption if your program expects flock to promote or
> demote a lock without allowing another process to modify the file.
>
> This is not a problem on EXT3.  EXT3 will allow a lock to be promoted
> from shared to exclusive and not allow a second program to sneak in.
>
> See https://bugzilla.redhat.com/show_bug.cgi?id=252000 for more details.
>
> Red Hat said that I should modify my code to use fcntl.  Unfortunately,
> we may end up dropping GFS, and the cluster suite instead.
>
> Matt
>
>
> On Sat, 2007-12-08 at 10:09 +0000, Elliot Moore wrote:
>
>> On 6 Dec 2007, at 13:45, Elliot Moore wrote:
>>> Hi,
>>>
>>> I'm trying to setup activemq with master and slave.
>>> According to http://activemq.apache.org/shared-file-system-master-slave.html
>>> , you can use a SAN to hold a lockfile for multiple brokers to watch.
>>> But the SAN filesystem must support exclusive file locks.
>>>
>>> OCFS2 only supports locking with 'fcntl' and not 'lockf and flock',
>>> therefore mutex file locking from Java isn't supported. (both
>>> brokers think they have
>>> an exclusive lock on the lockfile!)
>>>
>>> Does Redhat GFS support 'lockf and flock' as well as fcntl ?
>>
>>
>> FYI
>> got a response from redhat, yes
>> more information @ http://sources.redhat.com/cluster/faq.html#gfs_vs_ocfs2
>>
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From mvz+rhcluster at nimium.hr  Thu Dec 13 09:55:11 2007
From: mvz+rhcluster at nimium.hr (Miroslav Zubcic)
Date: Thu, 13 Dec 2007 10:55:11 +0100
Subject: [Linux-cluster] fence_ilo confused if both power supplies die
Message-ID: <4761017F.6000609@nimium.hr>

Hi all,

Is this a bug? Should we report it on official RHN (I hate that slow
buggy oracle based portal!)

Summary:

We have 2-node cluster on HP ProLiant DL 380 G5 servers.

3 services in cluster:
	- FreeRADIUS + IP addr
	- Apache + IP addr + storage LUN
	- Postgres + IP addr + storage LUN

Fencing is done via HP ILO cards.

Couple days ago, both power supplies on one node died in short time
(well, obviously it can happen). Fenced daemon, ccsd, and cluster
generaly didn't reacted well on that, despite surviving non-real-life
acceptance tests where we pulled both power supplies out in test. Faulty
power supply is something different than missing power supply for HP ILO
card. ILO card continued to work on it's internal battery but "POWER ON"
action did not suceeded (POWER command was returning that power is off).

This situation has confused fence_ilo agent. Agent has seen that other
server is down, but it never returned sucess to cluster because it
FAILED TO POWER ON other server.

I think this is buggy behaviour. Who cares if fence agent cannot power
on again fenced node, why it just didn't give up? Here is relevant part
of the log on healthy node which tried to fence other node.


Dec 10 03:37:14 aoc01 kernel: CMAN: removing node aoc02 from the cluster
: Missed too many heartbeats
Dec 10 03:37:14 aoc01 fenced[3012]: aoc02 not a cluster member after 0
sec post_fail_delay
Dec 10 03:37:14 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:50 aoc01 fenced[3012]: agent "fence_ilo" reports: failed to
turn on
Dec 10 03:37:50 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:37:55 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:55 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:37:55 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 03:37:55 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:38:00 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:38:00 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:38:00 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor

Dec 10 05:42:13 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:18 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:18 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:18 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:18 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:23 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:23 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:23 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:23 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:28 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:28 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:28 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor


-- 
Miroslav


From gordan at bobich.net  Thu Dec 13 10:10:04 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Thu, 13 Dec 2007 10:10:04 +0000 (GMT)
Subject: [Linux-cluster] with gfs2, permission denied on first attempt
	to run an executable
In-Reply-To: <1197496845.23671.109.camel@technetium.msp.redhat.com>
References: <476023EE.3030404@resork.sent.com>
	<1197496845.23671.109.camel@technetium.msp.redhat.com>
Message-ID: <Pine.LNX.4.64.0712130935020.2163@skynet.shatteredsilicon.net>

On Wed, 12 Dec 2007, Bob Peterson wrote:

> On Wed, 2007-12-12 at 13:09 -0500, Ken Roser wrote:
>> I'm running a two cluster system with both nodes RHEL 5.1.  I have
>> shared gfs2 filesystems on a SAN storage device.
>>
>> Everything seems to work except for the first time I attempt to run an
>> executable on one of the gfs2 filesystems.  The first time I get
>> "permission denied", but all subsequent attempts work as expected.  If I
>> unmount and mount the filesystem, again the "permission denied" problem
>> returns just for the first time I try to execute the file.
>>
>> I'm using lock_dlm with 4 journals.  Is there anything I can
>> check/change to diagnose this problem.  I see no message in
>> /var/log/messages corresponding to the "permission denied" problem.
>
> I recreated your problem on the latest and greatest gfs2 code, so
> it's a real problem.  Can you open a bugzilla record about it?

So much for GFS2 being stable enough for production use in RHEL 5.1 ...
I'm still puzzled by the rush to get GFS2 out. GFS1 works. It's stable. It 
gets the job done. Having GFS2 available in RHEL5.x before it's stable 
just ends up annoying people who try it and giving RHEL a bad name. Put it 
in RHEL6 when it's been through the full testing cycle. Otherwise all it 
does is diminish the only advantage of RHEL over Fedora.

Just MHO...

Gordan


From elliot at devnull.org.uk  Thu Dec 13 10:21:14 2007
From: elliot at devnull.org.uk (Elliot Moore)
Date: Thu, 13 Dec 2007 10:21:14 +0000
Subject: [Linux-cluster] Types of file locking support in GFS
In-Reply-To: <Pine.LNX.4.64.0712130922001.2163@skynet.shatteredsilicon.net>
References: <20071206134544.GA27857@book.riviera.org.uk>
	<C692865D-2A62-4D44-ADB7-9EB528017451@devnull.org.uk>
	<1197487541.27760.19.camel@merlin.Mines.EDU>
	<Pine.LNX.4.64.0712130922001.2163@skynet.shatteredsilicon.net>
Message-ID: <20071213102114.GC83570@book.riviera.org.uk>

On Thu, Dec 13, 2007 at 09:33:41AM +0000, gordan at bobich.net wrote:
> What even remotely similar alternatives to GFS are there, though?

A Netapp Filer(tm) :O)


From gordan at bobich.net  Thu Dec 13 10:25:18 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Thu, 13 Dec 2007 10:25:18 +0000 (GMT)
Subject: [Linux-cluster] Types of file locking support in GFS
In-Reply-To: <20071213102114.GC83570@book.riviera.org.uk>
References: <20071206134544.GA27857@book.riviera.org.uk>
	<C692865D-2A62-4D44-ADB7-9EB528017451@devnull.org.uk>
	<1197487541.27760.19.camel@merlin.Mines.EDU>
	<Pine.LNX.4.64.0712130922001.2163@skynet.shatteredsilicon.net>
	<20071213102114.GC83570@book.riviera.org.uk>
Message-ID: <Pine.LNX.4.64.0712131024380.2163@skynet.shatteredsilicon.net>

On Thu, 13 Dec 2007, Elliot Moore wrote:

>> What even remotely similar alternatives to GFS are there, though?
>
> A Netapp Filer(tm) :O)

You aren't seriously suggesting using NFS if file locking flexibility is 
an issue?!

Gordan


From Alain.Moulle at bull.net  Thu Dec 13 13:00:54 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 13 Dec 2007 14:00:54 +0100
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk
Message-ID: <47612D06.2060305@bull.net>

Hi Lon

I've carefully read your last detailed information. I've a
better understanding but something is again not clear for me :
in my two node cluster node1/node2, with quorum disk , without any heuristic,
when I do on node2 ifdown on eth if of heart-beat, what is the
mechanism via the quorum disk that assure that it will always
be node1 which will fence node2 and never in the other way ?
I think all will be clear for me if I understand this case ...
Thanks
Regards
Alain

>> Thanks for your information about votes values with quorumd.
>>
>> Another question about my tests :
>> Now I have the quorum disk working correctly, and so I wanted
>> to do this test : ifdown on the heart beat interface, to simulate
>> a heart beat network breakdown. I expected the cluster NOT to failover
>> because of quorum disk always available, but in fact after the
>> 21s the node where I've stopped the if eth has been fenced despite
>> the quorumdisk ...
>>
>> Where is my misunderstanding ?


QDisk provides additional votes based on user-defined heuristics (or, no
heuristics, depending).  The combination of the heuristics + votes can
be used to:

* prevent even-split fence races in the event of a network partition -
one cluster partition can, given well-defined heuristics, decide it is
unfit for cluster participation (and usually remove itself), while the
other remains "fit" and therefore fences the bad partition

* allow a minority partition to become the surviving partition a split -
similar to the above - given a 4-node cluster, 3 nodes in a majority
partition could decide that they are *all* unfit for cluster
participation and remove themselves - while the 1-node minority
partition continues to operate

* prevent a partition from becoming quorate after being fenced - on
boot, if a node does not meet its heuristic requirements and a master
node exists in the cluster, it cannot become quorate unless it has
communications with the master qdisk node (optionally, you can have
qdisk stop CMAN in this case)

... and possibly other things, but those are the main ones.

It's not a replacement for cluster communications nor is it a
replacement for CMAN's membership (in fact, it relies on CMAN's
membership - and fencing - to do its job).

Even if qdiskd told CMAN which nodes were online, much of the internal
network traffic (for example, DLM traffic) cannot be pushed through the
disk in a meaningful way, meaning GFS access would be blocked.

-- Lon


From pedro.faustino at fccn.pt  Thu Dec 13 13:18:02 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Thu, 13 Dec 2007 13:18:02 +0000
Subject: [Linux-cluster] conga: luci authentication issues
In-Reply-To: <47567EC4.2090409@fccn.pt>
References: <47509568.905@bxwa.com>	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>	<47542FAC.104@bxwa.com>
	<4754321D.1050600@fccn.pt>	<20071203175404.GA21909@redhat.com>
	<47567EC4.2090409@fccn.pt>
Message-ID: <4761310A.4010802@fccn.pt>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071213/0ef89b55/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071213/0ef89b55/attachment.bin>

From pedro.faustino at fccn.pt  Thu Dec 13 15:45:09 2007
From: pedro.faustino at fccn.pt (Pedro Bandim Faustino)
Date: Thu, 13 Dec 2007 15:45:09 +0000
Subject: [Linux-cluster] conga: luci authentication issues - SOLVED
In-Reply-To: <4761310A.4010802@fccn.pt>
References: <47509568.905@bxwa.com>	<1196692846.10025.25.camel@ayanami.boston.devel.redhat.com>	<47542FAC.104@bxwa.com>	<4754321D.1050600@fccn.pt>	<20071203175404.GA21909@redhat.com>	<47567EC4.2090409@fccn.pt>
	<4761310A.4010802@fccn.pt>
Message-ID: <47615385.3050503@fccn.pt>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071213/8ddcd65d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2980 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071213/8ddcd65d/attachment.bin>

From lhh at redhat.com  Thu Dec 13 18:05:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 13 Dec 2007 13:05:19 -0500
Subject: [Linux-cluster] Cluster of XEN guests unstable when rebooting
	a node under CS5.1
In-Reply-To: <4760273A.70209@prisma-eng.it>
References: <4760273A.70209@prisma-eng.it>
Message-ID: <1197569119.18308.88.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-12-12 at 19:23 +0100, Paolo Marini wrote:
> I reiterate the request for help hoping someone has undergone (and 
> hopefully solved) the same issues.
> 
> I am building up a cluster of XEN Guests with root file system residing 
> on a file on an GFS filesystem (iscsi actually).
> 
> Each cluster node mounts an GFS file system residing on an iscsi device.
> 
> For performance reasons, both the iscsi device and the physical nodes 
> (part also of a cluster) use two gigabit ethernet with bonding and LACP. 
> For the physical machines, I had to insert a sleep 30 on the 
> /etc/init.d/iscsi script before the iscsi login, in order to wait for 
> the bond interface to come up, otherwise the iscsi devices are not seen 
> and no gfs mount is possible.
> 
> Then, going to the cluster of XEN Guests, they work fine, I am able to 
> migrate each one to a different physical node without problems on the guest.
> 
> When I reboot or fence one of the guests, the guest cluster breaks, e.g. 
> the quorum is dissolved and I have to fence ALL the nodes and reboot 
> them in order for the cluster to restart.

How many guests - and what are you using for fencing ?

> Does it have to do with the xen bridge going up and down for a time 
> longer than the heartbeat timeout ?

Not sure - it shouldn't be that big of a deal.  If you think that's the
problem try adding:

   <totem token="30000"/>

to the vm cluster's cluster.conf

-- Lon


From paolom at prisma-eng.it  Thu Dec 13 18:43:23 2007
From: paolom at prisma-eng.it (Paolo Marini)
Date: Thu, 13 Dec 2007 19:43:23 +0100
Subject: [Linux-cluster] Cluster of XEN guests unstable when rebooting
	a node under CS5.1
In-Reply-To: <1197569119.18308.88.camel@ayanami.boston.devel.redhat.com>
References: <4760273A.70209@prisma-eng.it>
	<1197569119.18308.88.camel@ayanami.boston.devel.redhat.com>
Message-ID: <47617D4B.1080201@prisma-eng.it>

Good ! It seems the right solution. Below my answers/comments.

Thanks, Paolo


> On Wed, 2007-12-12 at 19:23 +0100, Paolo Marini wrote:
>   
>> I reiterate the request for help hoping someone has undergone (and 
>> hopefully solved) the same issues.
>>
>> I am building up a cluster of XEN Guests with root file system residing 
>> on a file on an GFS filesystem (iscsi actually).
>>
>> Each cluster node mounts an GFS file system residing on an iscsi device.
>>
>> For performance reasons, both the iscsi device and the physical nodes 
>> (part also of a cluster) use two gigabit ethernet with bonding and LACP. 
>> For the physical machines, I had to insert a sleep 30 on the 
>> /etc/init.d/iscsi script before the iscsi login, in order to wait for 
>> the bond interface to come up, otherwise the iscsi devices are not seen 
>> and no gfs mount is possible.
>>
>> Then, going to the cluster of XEN Guests, they work fine, I am able to 
>> migrate each one to a different physical node without problems on the guest.
>>
>> When I reboot or fence one of the guests, the guest cluster breaks, e.g. 
>> the quorum is dissolved and I have to fence ALL the nodes and reboot 
>> them in order for the cluster to restart.
>>     
>
> How many guests - and what are you using for fencing ?
>
>   
I am using 5 guests - 4 are within a cluster and the remaining one is a 
management node (nagios etc.). I am using fencing with fence_xvm and it 
is correctly configured and working. Each Physical node is a DELL PE860 
with 4 Gb of RAM, one quad XEON and 3 network interfaces, two are used 
for bonding and the third one is reserved for IPMI (which I use for 
fencing of the physical nodes).

The guests configure two network interfaces (eth0 and eth0:0), one is 
for private communications between the nodes and to the iscsi device, 
the other for the public access to the nodes. I am not using VLAN.
>> Does it have to do with the xen bridge going up and down for a time 
>> longer than the heartbeat timeout ?
>>     
>
> Not sure - it shouldn't be that big of a deal.  If you think that's the
> problem try adding:
>
>    <totem token="30000"/>
>
>   
It seems much more stable. More tests will prove this. By now, xm 
destroy on a guest causes the whole cluster of guests to stay up, detect 
the missing guest, fence successfully it. The machine restarts and 
rejoins the cluster.

> to the vm cluster's cluster.conf
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: paolom.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071213/7602581d/attachment.vcf>

From raju.rajsand at gmail.com  Thu Dec 13 22:16:27 2007
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Fri, 14 Dec 2007 03:46:27 +0530
Subject: [Linux-cluster] re:GFS Cookbook
Message-ID: <8786b91c0712131416h49fd5c17xea936cc03887a5cc@mail.gmail.com>

Greetings,

apologies for reposting this I had posted this question from my yahoo
account. Now I am a subscriber of this list

This is an adjunct question to Koustubha Kale's discussion.

I am sysadmin of the cluster Koustubha Kale had built.

My question is:

When the node gets fenced, it is rebooted using fencing device. essentially
let us consider that as power cycling the fenced node. Now this
configuration has local boot disks with OS and apps loaded and shared
storage for user home directories.

What should be done to ensure that the local storage FS is not corrupted?
should a "touch /forcefsck" be done on the fenced node's local storage
before it is sent on a reboot?

Perhaps I have not understood the fencing enough

Regards

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071214/20ce6450/attachment.htm>

From isplist at logicore.net  Fri Dec 14 05:43:08 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 13 Dec 2007 23:43:08 -0600
Subject: [Linux-cluster] GFS Filesystem check
Message-ID: <2007121323438.169333@leena>

Other than unmounting the filesystem from each node, is there any other way of 
doing a quick, regular check from one of the nodes for possible problems? Say 
a cron job or some other tool which might report potential errors.

Mike


From paolom at prisma-eng.it  Fri Dec 14 09:11:48 2007
From: paolom at prisma-eng.it (Paolo Marini)
Date: Fri, 14 Dec 2007 10:11:48 +0100 (CET)
Subject: [Linux-cluster] Cluster of XEN guests unstable when rebooting 
	a node under CS5.1
In-Reply-To: <1197569119.18308.88.camel@ayanami.boston.devel.redhat.com>
References: <4760273A.70209@prisma-eng.it>
	<1197569119.18308.88.camel@ayanami.boston.devel.redhat.com>
Message-ID: <16829.81.208.106.64.1197623508.squirrel@webmail.kpnqwest.it>

After some more estensive testing, the problem is not solved.

I fence one guest node from the luci interface (or with xm destroy from a
physical node, is the same). What I see on another node log is:

Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] The token was lost in the
OPERATIONAL state.
Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] entering GATHER state from 2.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering GATHER state from 0.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
11.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
11.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Creating commit token
because I am the rep.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Saving state aru 71 high
seq received 71
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Storing new sequence id
for ring 10ec
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering COMMIT state.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering RECOVERY state.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] position [0] member
192.168.15.152:
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] previous ring seq 4328 rep
192.168.15.151
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] aru 71 high delivered 71
received flag 1
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Did not need to originate
any messages in recovery.
Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Sending initial ORF token
Dec 14 09:59:32 c5g-thor clurgmgrd[2386]: <emerg> #1: Quorum Dissolved
Dec 14 09:59:32 c5g-thor kernel: dlm: closing connection to node 2
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] CLM CONFIGURATION CHANGE
Dec 14 09:59:32 c5g-thor kernel: dlm: closing connection to node 3
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] New Configuration:
Dec 14 09:59:32 c5g-thor kernel: dlm: closing connection to node 4
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
ip(192.168.15.152)
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] Members Left:
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
ip(192.168.15.151)
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
ip(192.168.15.153)
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
ip(192.168.15.154)
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] Members Joined:
Dec 14 09:59:32 c5g-thor openais[1741]: [CMAN ] quorum lost, blocking
activity
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] CLM CONFIGURATION CHANGE
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] New Configuration:
Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
ip(192.168.15.152)
Dec 14 09:59:33 c5g-thor openais[1741]: [CLM  ] Members Left:
Dec 14 09:59:33 c5g-thor openais[1741]: [CLM  ] Members Joined:
Dec 14 09:59:33 c5g-thor openais[1741]: [SYNC ] This node is within the
primary component and will provide service.
Dec 14 09:59:33 c5g-thor openais[1741]: [TOTEM] entering OPERATIONAL state.
Dec 14 09:59:33 c5g-thor openais[1741]: [CLM  ] got nodejoin message
192.168.15.152
Dec 14 09:59:33 c5g-thor openais[1741]: [CPG  ] got joinlist message from
node 1
Dec 14 09:59:33 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
11.
Dec 14 09:59:33 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
11.
Dec 14 09:59:33 c5g-thor ccsd[1704]: Cluster is not quorate.  Refusing
connection.


The cluster.conf file looks like:

<?xml version="1.0"?>
<cluster alias="PESV" config_version="25" name="PESV">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="c5g-thor.prisma" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device domain="c5g-thor"
name="c5g-thor-f"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="c5g-backup.prisma" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device domain="c5g-backup"
name="c5g-backup-f"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="c5g-memo.prisma" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device domain="c5g-memo"
name="c5g-memo-f"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="c5g-steiner.prisma" nodeid="4" votes="1">
                        <fence>
                                <method name="1">
                                        <device domain="c5g-steiner"
name="c5g-steiner-f"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_xvm" name="c5g-backup-f"/>
                <fencedevice agent="fence_xvm" name="c5g-thor-f"/>
                <fencedevice agent="fence_xvm" name="c5g-memo-f"/>
                <fencedevice agent="fence_xvm" name="c5g-steiner-f"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <totem token="30000"/>
        <cman/>
</cluster>


> On Wed, 2007-12-12 at 19:23 +0100, Paolo Marini wrote:
>> I reiterate the request for help hoping someone has undergone (and
>> hopefully solved) the same issues.
>>
>> I am building up a cluster of XEN Guests with root file system residing
>> on a file on an GFS filesystem (iscsi actually).
>>
>> Each cluster node mounts an GFS file system residing on an iscsi device.
>>
>> For performance reasons, both the iscsi device and the physical nodes
>> (part also of a cluster) use two gigabit ethernet with bonding and LACP.
>> For the physical machines, I had to insert a sleep 30 on the
>> /etc/init.d/iscsi script before the iscsi login, in order to wait for
>> the bond interface to come up, otherwise the iscsi devices are not seen
>> and no gfs mount is possible.
>>
>> Then, going to the cluster of XEN Guests, they work fine, I am able to
>> migrate each one to a different physical node without problems on the
>> guest.
>>
>> When I reboot or fence one of the guests, the guest cluster breaks, e.g.
>> the quorum is dissolved and I have to fence ALL the nodes and reboot
>> them in order for the cluster to restart.
>
> How many guests - and what are you using for fencing ?
>
>> Does it have to do with the xen bridge going up and down for a time
>> longer than the heartbeat timeout ?
>
> Not sure - it shouldn't be that big of a deal.  If you think that's the
> problem try adding:
>
>    <totem token="30000"/>
>
> to the vm cluster's cluster.conf
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From rainer at ultra-secure.de  Fri Dec 14 13:19:32 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Fri, 14 Dec 2007 14:19:32 +0100
Subject: [Linux-cluster] semi-OT: Where to report (i)SCSI-issues?
Message-ID: <476282E4.6090902@ultra-secure.de>

Hi,

http://www.redhat.com/mailman/listinfo

doesn't list a place to report SCSI-problems.
Anyway, in RHEL5.1, I have a problem in that if a logical volume (LVM)
is located on an iSCSI device, it will not be detected on boot-up,
because the network (needed for iSCSI, somehow) starts up after lvm(8).

Either I missed it or the documentation lacks a workaround for this.
(_netdev doesn't help).
No tips from the KB. I created an additional init.d-script that does a
vgscan+vgchange -ay after the iscsi comes up.
BTW: the iscsi-initiator README actually conflicts with reality:
configuration databases are in /var/lib, not /etc, as the README suggests.

There's a bugzilla for a similar problem (with shutdown), but it only
deals with shutdowns. I wonder how they get up their LVMs in the first
place...

Sorry if this is offtopic here, but I didn't know where to report it
(except for bugzilla, but maybe there is an (un-) documented workaround.


cheers,
Rainer


From paolom at prisma-eng.it  Fri Dec 14 13:46:42 2007
From: paolom at prisma-eng.it (Paolo Marini)
Date: Fri, 14 Dec 2007 14:46:42 +0100 (CET)
Subject: [Linux-cluster] semi-OT: Where to report (i)SCSI-issues?
In-Reply-To: <476282E4.6090902@ultra-secure.de>
References: <476282E4.6090902@ultra-secure.de>
Message-ID: <63736.194.244.192.235.1197640002.squirrel@webmail.kpnqwest.it>

I had to manually add a line with

sleep 30

in the /etc/init.d/iscsi script BEFORE the iscsi login, in order to take
into account the time needed for my bonding (2 ethernet) interface to come
up. These solves the problem and was the only modification I had to do.

cheers, Paolo

> Hi,
>
> http://www.redhat.com/mailman/listinfo
>
> doesn't list a place to report SCSI-problems.
> Anyway, in RHEL5.1, I have a problem in that if a logical volume (LVM)
> is located on an iSCSI device, it will not be detected on boot-up,
> because the network (needed for iSCSI, somehow) starts up after lvm(8).
>
> Either I missed it or the documentation lacks a workaround for this.
> (_netdev doesn't help).
> No tips from the KB. I created an additional init.d-script that does a
> vgscan+vgchange -ay after the iscsi comes up.
> BTW: the iscsi-initiator README actually conflicts with reality:
> configuration databases are in /var/lib, not /etc, as the README suggests.
>
> There's a bugzilla for a similar problem (with shutdown), but it only
> deals with shutdowns. I wonder how they get up their LVMs in the first
> place...
>
> Sorry if this is offtopic here, but I didn't know where to report it
> (except for bugzilla, but maybe there is an (un-) documented workaround.
>
>
>
> cheers,
> Rainer
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From nirmaltom at hotmail.com  Fri Dec 14 14:14:39 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Fri, 14 Dec 2007 19:44:39 +0530
Subject: [Linux-cluster] semi-OT: Where to report (i)SCSI-issues?
In-Reply-To: <63736.194.244.192.235.1197640002.squirrel@webmail.kpnqwest.it>
References: <476282E4.6090902@ultra-secure.de>
	<63736.194.244.192.235.1197640002.squirrel@webmail.kpnqwest.it>
Message-ID: <BAY136-W695C33449F8195B9E8303AC670@phx.gbl>


Hi,
What about renaming the soft links in /etc/rc[initlevel].d Directory
I heard a similar problem before all i do is,
Suppose
S05lfilename is for iscsitarget
S07filename is for ethernet

Renaming the link file for ethernet with a number less than 5 starts the ethernet script before iscsi
Note S indicates start and K indicates kill in the directory

Regards,
Nirmal Tom.

----------------------------------------
> Date: Fri, 14 Dec 2007 14:46:42 +0100
> Subject: Re: [Linux-cluster] semi-OT: Where to report (i)SCSI-issues?
> From: paolom at prisma-eng.it
> To: linux-cluster at redhat.com
> CC: linux-cluster at redhat.com
> 
> I had to manually add a line with
> 
> sleep 30
> 
> in the /etc/init.d/iscsi script BEFORE the iscsi login, in order to take
> into account the time needed for my bonding (2 ethernet) interface to come
> up. These solves the problem and was the only modification I had to do.
> 
> cheers, Paolo
> 
>> Hi,
>>
>> http://www.redhat.com/mailman/listinfo
>>
>> doesn't list a place to report SCSI-problems.
>> Anyway, in RHEL5.1, I have a problem in that if a logical volume (LVM)
>> is located on an iSCSI device, it will not be detected on boot-up,
>> because the network (needed for iSCSI, somehow) starts up after lvm(8).
>>
>> Either I missed it or the documentation lacks a workaround for this.
>> (_netdev doesn't help).
>> No tips from the KB. I created an additional init.d-script that does a
>> vgscan+vgchange -ay after the iscsi comes up.
>> BTW: the iscsi-initiator README actually conflicts with reality:
>> configuration databases are in /var/lib, not /etc, as the README suggests.
>>
>> There's a bugzilla for a similar problem (with shutdown), but it only
>> deals with shutdowns. I wonder how they get up their LVMs in the first
>> place...
>>
>> Sorry if this is offtopic here, but I didn't know where to report it
>> (except for bugzilla, but maybe there is an (un-) documented workaround.
>>
>>
>>
>> cheers,
>> Rainer
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From mpartio at gmail.com  Fri Dec 14 15:07:38 2007
From: mpartio at gmail.com (Mikko Partio)
Date: Fri, 14 Dec 2007 17:07:38 +0200
Subject: [Linux-cluster] Adding new file system caused problems
In-Reply-To: <1196442322.2437.8.camel@localhost.localdomain>
References: <474C5260.6030908@noaa.gov>
	<97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org>
	<1196442322.2437.8.camel@localhost.localdomain>
Message-ID: <2ca799770712140707k2de9cd65re2d5260c9eb6d4d0@mail.gmail.com>

On Nov 30, 2007 7:05 PM, jr <johannes.russek at io-consulting.net> wrote:

> > This morning I created a 100GB volume on our storage unit and
> > proceeded to make it available to the cluster so it could be served
> > via NFS to a client on our network.  I used pvcreate and vgcreate as I
> > always do and created a new volume group.  When I went to create the
> > logical volume I saw this message:
> > Error locking on node nfs1-cluster.nws.noaa.gov: Volume group for uuid
> > not found:
> > 9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs
> >
> >
>
>

I can confirm this issue with Centos 5.1. When a new volume is added to the
system, I can not create a lv. Not even clvmd -R helps, in order to create
the logical volume I must reboot both servers in my two-node setup. It's
insane that in order to add a volume I must reboot the whole cluster!

Regards

Mikko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071214/c50403eb/attachment.htm>

From pcaulfie at redhat.com  Fri Dec 14 15:24:12 2007
From: pcaulfie at redhat.com (pcaulfie)
Date: Fri, 14 Dec 2007 15:24:12 +0000
Subject: [Linux-cluster] Adding new file system caused problems
In-Reply-To: <2ca799770712140707k2de9cd65re2d5260c9eb6d4d0@mail.gmail.com>
References: <474C5260.6030908@noaa.gov>	<97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org>	<1196442322.2437.8.camel@localhost.localdomain>
	<2ca799770712140707k2de9cd65re2d5260c9eb6d4d0@mail.gmail.com>
Message-ID: <4762A01C.6070303@redhat.com>

Mikko Partio wrote:
> 
> 
> On Nov 30, 2007 7:05 PM, jr <johannes.russek at io-consulting.net
> <mailto:johannes.russek at io-consulting.net>> wrote:
> 
>     > This morning I created a 100GB volume on our storage unit and
>     > proceeded to make it available to the cluster so it could be served
>     > via NFS to a client on our network.  I used pvcreate and vgcreate
>     as I
>     > always do and created a new volume group.  When I went to create the
>     > logical volume I saw this message:
>     > Error locking on node nfs1-cluster.nws.noaa.gov
>     <http://nfs1-cluster.nws.noaa.gov>: Volume group for uuid
>     > not found:
>     > 9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs
>     >
>     >
> 
> 
> 
> I can confirm this issue with Centos 5.1. When a new volume is added to
> the system, I can not create a lv. Not even clvmd -R helps, in order to
> create the logical volume I must reboot both servers in my two-node
> setup. It's insane that in order to add a volume I must reboot the whole
> cluster!

You should only need to restart clvmd, not the whole cluster.

clvmd -R was buggy before 2.02.28, in fact it made things worse rather
than better!

Patrick


From rpeterso at redhat.com  Fri Dec 14 15:50:33 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 14 Dec 2007 09:50:33 -0600
Subject: [Linux-cluster] re:GFS Cookbook
In-Reply-To: <8786b91c0712131416h49fd5c17xea936cc03887a5cc@mail.gmail.com>
References: <8786b91c0712131416h49fd5c17xea936cc03887a5cc@mail.gmail.com>
Message-ID: <1197647433.23671.160.camel@technetium.msp.redhat.com>

On Fri, 2007-12-14 at 03:46 +0530, Rajagopal Swaminathan wrote:
> 
> Greetings,
> 
> apologies for reposting this I had posted this question from my yahoo
> account. Now I am a subscriber of this list
> 
> This is an adjunct question to Koustubha Kale's discussion.
> 
> I am sysadmin of the cluster Koustubha Kale had built. 
> 
> My question is:
> 
> When the node gets fenced, it is rebooted using fencing device.
> essentially let us consider that as power cycling the fenced node. Now
> this configuration has local boot disks with OS and apps loaded and
> shared storage for user home directories.
> 
> What should be done to ensure that the local storage FS is not
> corrupted? should a "touch /forcefsck" be done on the fenced node's
> local storage before it is sent on a reboot?
> 
> Perhaps I have not understood the fencing enough
> 
> Regards
> 
> Rajagopal 

Hi Rajagopal,

If your local storage is ext3, it should not have a problem.
The journals should take care of everything for you.  So it is
only the shared storage you need to be concerned about.
If that shared storage for user home directories is gfs or gfs2,
after the reboot, the file system is re-mounted, which will
replay the journals.  So again, everything should be okay.

Regards,

Bob Peterson


From gordan at bobich.net  Fri Dec 14 15:54:30 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 14 Dec 2007 15:54:30 +0000 (GMT)
Subject: [Linux-cluster] Graceful Degradation
Message-ID: <Pine.LNX.4.64.0712141546270.8884@skynet.shatteredsilicon.net>

Hi,

I've got most of my cluster pretty much sorted out, apart from kicking 
nodes from the cluster when they fail.

Is there a way to make the node-kicking automated? I have 4 nodes. They 
are sharing 2 GFS file systems, a root FS and a data FS. If I pull the 
network cable from one of them, or just power it off, the rest of the 
cluster nodes just stop. The only way to get them to start responding 
again is to bring the missing node back, even if there are still enough 
nodes to maintain quorum (3 nodes out of 4).

Can anyone suggest a way around this? How can I make the 3 remaining nodes 
just kick the missing node out of the cluster and DLM group (possibly 
after some timeout, e.g. 10 seconds) and resume operation until the node 
rejoins?

This may or may not be related to the fact that I'm running a shared GFS 
root, but any pointers would be welcome.

Thanks.

Gordan


From rpeterso at redhat.com  Fri Dec 14 16:00:41 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Fri, 14 Dec 2007 10:00:41 -0600
Subject: [Linux-cluster] GFS Filesystem check
In-Reply-To: <2007121323438.169333@leena>
References: <2007121323438.169333@leena>
Message-ID: <1197648041.23671.171.camel@technetium.msp.redhat.com>

On Thu, 2007-12-13 at 23:43 -0600, isplist at logicore.net wrote:
> Other than unmounting the filesystem from each node, is there any other way of 
> doing a quick, regular check from one of the nodes for possible problems? Say 
> a cron job or some other tool which might report potential errors.
> 
> Mike

Hi Mike,

There isn't really a good way to do what you are asking.
There really isn't a way to see if an fsck is needed without,
as you said, unmounting from all nodes and running gfs_fsck.

However, while the file system is running, gfs always does a
bunch of sanity checking as it goes.
The idea is that most kinds of corruption are detected right
away, and cause the file system to be withdrawn.

For example, when most metadata blocks are read, gfs checks
that it's the correct metadata block type it expects.  So
if, for example, an disk inode is somehow overwritten by
something else (e.g. another machine attached to the san outside
of the cluster is writing data to that area of disk without the
knowledge of the cluster), gfs will check that the disk inode
is really a disk inode whenever it is read from disk.  If it
finds something else, it will withdraw.

Someone else may have written a tool I don't know about though...

Bob Peterson


From james at cloud9.co.uk  Fri Dec 14 16:20:25 2007
From: james at cloud9.co.uk (James Fidell)
Date: Fri, 14 Dec 2007 16:20:25 +0000
Subject: [Linux-cluster] semi-OT: Where to report (i)SCSI-issues?
In-Reply-To: <476282E4.6090902@ultra-secure.de>
References: <476282E4.6090902@ultra-secure.de>
Message-ID: <4762AD49.50104@cloud9.co.uk>

Rainer Duffner wrote:
> Hi,
> 
> http://www.redhat.com/mailman/listinfo
> 
> doesn't list a place to report SCSI-problems.
> Anyway, in RHEL5.1, I have a problem in that if a logical volume (LVM)
> is located on an iSCSI device, it will not be detected on boot-up,
> because the network (needed for iSCSI, somehow) starts up after lvm(8).

Are you talking about clustered LVM partitions, or non-clustered ones?
Clustered ones "just work" for me, because they're added by clvmd, which
is started *after* the network, iscsi and cman.  I'm not using non-
clustered ones though.

James


From orkcu at yahoo.com  Fri Dec 14 16:23:03 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 14 Dec 2007 08:23:03 -0800 (PST)
Subject: [Linux-cluster] Adding new file system caused problems
In-Reply-To: <4762A01C.6070303@redhat.com>
Message-ID: <456298.60187.qm@web50604.mail.re2.yahoo.com>


--- pcaulfie <pcaulfie at redhat.com> wrote:

> Mikko Partio wrote:
> > 
> > 
> > On Nov 30, 2007 7:05 PM, jr
> <johannes.russek at io-consulting.net
> > <mailto:johannes.russek at io-consulting.net>> wrote:
> > 
> >     > This morning I created a 100GB volume on our
> storage unit and
> >     > proceeded to make it available to the
> cluster so it could be served
> >     > via NFS to a client on our network.  I used
> pvcreate and vgcreate
> >     as I
> >     > always do and created a new volume group. 
> When I went to create the
> >     > logical volume I saw this message:
> >     > Error locking on node
> nfs1-cluster.nws.noaa.gov
> >     <http://nfs1-cluster.nws.noaa.gov>: Volume
> group for uuid
> >     > not found:
> >     >
>
9crOQoM3V0fcuZ1E2163k9vdRLK7njfvnIIMTLPGreuvGmdB1aqx6KR4t7mmDRDs
> >     >
> >     >
> > 
> > 
> > 
> > I can confirm this issue with Centos 5.1. When a
> new volume is added to
> > the system, I can not create a lv. Not even clvmd
> -R helps, in order to
> > create the logical volume I must reboot both
> servers in my two-node
> > setup. It's insane that in order to add a volume I
> must reboot the whole
> > cluster!
> 
> You should only need to restart clvmd, not the whole
> cluster.
> 
> clvmd -R was buggy before 2.02.28, in fact it made
> things worse rather
> than better!

the new version is the one released in RHBA-2007-1096
?
I get a "file not found" when try to visit:
http://rhn.redhat.com/errata/RHBA-2007-1096.html

cu
roger


__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


From rainer at ultra-secure.de  Fri Dec 14 16:27:56 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Fri, 14 Dec 2007 17:27:56 +0100
Subject: [Linux-cluster] semi-OT: Where to report (i)SCSI-issues?
In-Reply-To: <4762AD49.50104@cloud9.co.uk>
References: <476282E4.6090902@ultra-secure.de> <4762AD49.50104@cloud9.co.uk>
Message-ID: <4762AF0C.5040305@ultra-secure.de>

James Fidell wrote:
> Rainer Duffner wrote:
>   
>> Hi,
>>
>> http://www.redhat.com/mailman/listinfo
>>
>> doesn't list a place to report SCSI-problems.
>> Anyway, in RHEL5.1, I have a problem in that if a logical volume (LVM)
>> is located on an iSCSI device, it will not be detected on boot-up,
>> because the network (needed for iSCSI, somehow) starts up after lvm(8).
>>     
>
> Are you talking about clustered LVM partitions, or non-clustered ones?
> Clustered ones "just work" for me, because they're added by clvmd, which
> is started *after* the network, iscsi and cman.  I'm not using non-
> clustered ones though.
>   


No, non-clustered.
I was looking for a more appropriate mailing-list, but none seems to exist.
Just a couple of mailing-lists for RHEL beta-releases....


cu,
Rainer


From orkcu at yahoo.com  Fri Dec 14 16:33:35 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 14 Dec 2007 08:33:35 -0800 (PST)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <Pine.LNX.4.64.0712141546270.8884@skynet.shatteredsilicon.net>
Message-ID: <90138.53763.qm@web50609.mail.re2.yahoo.com>


--- gordan at bobich.net wrote:

> Hi,
> 
> I've got most of my cluster pretty much sorted out,
> apart from kicking 
> nodes from the cluster when they fail.
> 
> Is there a way to make the node-kicking automated? I
> have 4 nodes. They 
> are sharing 2 GFS file systems, a root FS and a data
> FS. If I pull the 
> network cable from one of them, or just power it
> off, the rest of the 
> cluster nodes just stop. The only way to get them to
> start responding 
> again is to bring the missing node back, even if
> there are still enough 
> nodes to maintain quorum (3 nodes out of 4).
> 
> Can anyone suggest a way around this? How can I make
> the 3 remaining nodes 
> just kick the missing node out of the cluster and
> DLM group (possibly 
> after some timeout, e.g. 10 seconds) and resume
> operation until the node 
> rejoins?
> 
> This may or may not be related to the fact that I'm
> running a shared GFS 
> root, but any pointers would be welcome.
> 
I thinks this is question #1 in the FAQs and in this
list :-)

the short anwser and the first place to look at is: 
1- fencing not configured or configured as manual
2- fencing problems, the devices not working as they
should

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


From rottmann at atix.de  Fri Dec 14 16:33:57 2007
From: rottmann at atix.de (Reiner Rottmann)
Date: Fri, 14 Dec 2007 17:33:57 +0100
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <Pine.LNX.4.64.0712141546270.8884@skynet.shatteredsilicon.net>
References: <Pine.LNX.4.64.0712141546270.8884@skynet.shatteredsilicon.net>
Message-ID: <200712141734.00998.rottmann@atix.de>

Hi Gordan,

you will need to configure fencing to automatically kick out the nodes.
See [1] and [2]. 

As long as the cluster is not informed that the node has been kicked out 
successfully, the cluster freezes the GFS volumes to prevent data corruption.

In case of a Shared Root Cluster (are you using Open-Sharedroot?) the root 
filesystem is also affected because it is formatted with GFS.

If a node is successfully fenced (the nodes will get informed) , the cluster 
will resume its activity. This is a protective measure.

Manual fencing which should only be used in early test environments (like [3]) 
require a manual fence acknowledgement. In case of Open-Sharedroot you will 
need access to a special shell called fenceackshell in order to resolve a 
manual fencing in progress:

---%----------------------------------------------------------------------------------------------------
[root at axqa02rc_1 ~]# telnet axqa02rc_2 12242
Trying 192.168.25.25...
Connected to axqa02rc_2 (192.168.25.25).
Escape character is '^]'.
Username: root
Password: somepassword
FenceacksvVersion $Revision: 1.7 $
Linux axqa02rc_2<singleserver> 2.6.9-55.0.9.ELsmp 89713 x86_64
FENCEACKSV axqa02rc_2<singleserver>$ shell
SHELL FENCEACKSV axqa02rc_2<singleserver>$ cd /sbin
SHELL FENCEACKSV axqa02rc_2<singleserver>$ SHELL FENCEACKSV 
axqa02rc_2<singleserver>$ ./fence_ack_manual
Usage:

fence_ack_manual [options]

Options:
  -h               usage
  -O               override
  -n <nodename>    Name of node that was manually fenced
  -s <ip>          IP address of machine that was manually fenced (deprecated)
  -V               Version information
---%----------------------------------------------------------------------------------------------------

[1] http://en.wikipedia.org/wiki/I/O_Fencing
[2] http://www.redhat.com/docs/manuals/csgfs/admin-guide/ch-fence.html
[3] http://www.open-sharedroot.org/documentation/the-opensharedroot-mini-howto

Cheers,
Reiner

On Friday 14 December 2007 04:54:30 pm gordan at bobich.net wrote:
> Hi,
>
> I've got most of my cluster pretty much sorted out, apart from kicking
> nodes from the cluster when they fail.
>
> Is there a way to make the node-kicking automated? I have 4 nodes. They
> are sharing 2 GFS file systems, a root FS and a data FS. If I pull the
> network cable from one of them, or just power it off, the rest of the
> cluster nodes just stop. The only way to get them to start responding
> again is to bring the missing node back, even if there are still enough
> nodes to maintain quorum (3 nodes out of 4).
>
> Can anyone suggest a way around this? How can I make the 3 remaining nodes
> just kick the missing node out of the cluster and DLM group (possibly
> after some timeout, e.g. 10 seconds) and resume operation until the node
> rejoins?
>
> This may or may not be related to the fact that I'm running a shared GFS
> root, but any pointers would be welcome.
>
> Thanks.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Gruss / Regards,

Dipl.-Ing. (FH) Reiner Rottmann

Phone: +49-89 452 3538-12

http://www.atix.de/
http://open-sharedroot.org/

https://www.xing.com/profile/Reiner_Rottmann

PGP Key ID: 0xCA67C5A6
PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax: ? +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071214/77607772/attachment.sig>

From gordan at bobich.net  Fri Dec 14 16:50:06 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 14 Dec 2007 16:50:06 +0000 (GMT)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <90138.53763.qm@web50609.mail.re2.yahoo.com>
References: <90138.53763.qm@web50609.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0712141648560.8884@skynet.shatteredsilicon.net>

On Fri, 14 Dec 2007, Roger Pe?a wrote:

> I thinks this is question #1 in the FAQs and in this
> list :-)
>
> the short anwser and the first place to look at is:
> 1- fencing not configured or configured as manual
> 2- fencing problems, the devices not working as they
> should

The problem is that I don't have any devices I could do fencing with. Is 
there a way to achieve this without external monitoring?

Gordan

From gordan at bobich.net  Fri Dec 14 16:56:11 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 14 Dec 2007 16:56:11 +0000 (GMT)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <200712141734.00998.rottmann@atix.de>
References: <Pine.LNX.4.64.0712141546270.8884@skynet.shatteredsilicon.net>
	<200712141734.00998.rottmann@atix.de>
Message-ID: <Pine.LNX.4.64.0712141650100.8884@skynet.shatteredsilicon.net>

Thanks for that. Most fence methods seem to rely on external 
monitoring/fencing. Is there a way to get this working without external 
monitoring from within the initial root on a shared root system?

Thanks.

Gordan

On Fri, 14 Dec 2007, Reiner Rottmann wrote:

> Hi Gordan,
>
> you will need to configure fencing to automatically kick out the nodes.
> See [1] and [2].
>
> As long as the cluster is not informed that the node has been kicked out
> successfully, the cluster freezes the GFS volumes to prevent data corruption.
>
> In case of a Shared Root Cluster (are you using Open-Sharedroot?) the root
> filesystem is also affected because it is formatted with GFS.
>
> If a node is successfully fenced (the nodes will get informed) , the cluster
> will resume its activity. This is a protective measure.
>
> Manual fencing which should only be used in early test environments (like [3])
> require a manual fence acknowledgement. In case of Open-Sharedroot you will
> need access to a special shell called fenceackshell in order to resolve a
> manual fencing in progress:
>
> ---%----------------------------------------------------------------------------------------------------
> [root at axqa02rc_1 ~]# telnet axqa02rc_2 12242
> Trying 192.168.25.25...
> Connected to axqa02rc_2 (192.168.25.25).
> Escape character is '^]'.
> Username: root
> Password: somepassword
> FenceacksvVersion $Revision: 1.7 $
> Linux axqa02rc_2<singleserver> 2.6.9-55.0.9.ELsmp 89713 x86_64
> FENCEACKSV axqa02rc_2<singleserver>$ shell
> SHELL FENCEACKSV axqa02rc_2<singleserver>$ cd /sbin
> SHELL FENCEACKSV axqa02rc_2<singleserver>$ SHELL FENCEACKSV
> axqa02rc_2<singleserver>$ ./fence_ack_manual
> Usage:
>
> fence_ack_manual [options]
>
> Options:
>  -h               usage
>  -O               override
>  -n <nodename>    Name of node that was manually fenced
>  -s <ip>          IP address of machine that was manually fenced (deprecated)
>  -V               Version information
> ---%----------------------------------------------------------------------------------------------------
>
> [1] http://en.wikipedia.org/wiki/I/O_Fencing
> [2] http://www.redhat.com/docs/manuals/csgfs/admin-guide/ch-fence.html
> [3] http://www.open-sharedroot.org/documentation/the-opensharedroot-mini-howto


From raju.rajsand at gmail.com  Fri Dec 14 17:06:27 2007
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Fri, 14 Dec 2007 22:36:27 +0530
Subject: [Linux-cluster] re:GFS Cookbook
In-Reply-To: <1197647433.23671.160.camel@technetium.msp.redhat.com>
References: <8786b91c0712131416h49fd5c17xea936cc03887a5cc@mail.gmail.com>
	<1197647433.23671.160.camel@technetium.msp.redhat.com>
Message-ID: <8786b91c0712140906r5860ebbamd888ac016036196f@mail.gmail.com>

 Greetings,

On Dec 14, 2007 9:20 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> On Fri, 2007-12-14 at 03:46 +0530, Rajagopal Swaminathan wrote:
> >
> If your local storage is ext3, it should not have a problem.
> The journals should take care of everything for you.  So it is
> only the shared storage you need to be concerned about.
> If that shared storage for user home directories is gfs or gfs2,
> after the reboot, the file system is re-mounted, which will
> replay the journals.  So again, everything should be okay.
>
>
the shared storage is a DRBD device.

We have observed that this mucks up pretty badly....

Fencing hasn't arrived. So clustering without _hardware_ poweroff doesn't
work. Right?

In US fencing device might be a given. But in the Indian (_not_ American
Indian but the largest Democracy in the world) Environment it is a long
drawn out procedure to spend $2000 for a proprietarey fencing device and the
same is true for developing nations. We have somehow managed to get the the
supera(rathe ulti-)nnuated cretin of "Management" to approve one but the
technical issue remains.

Of course any answer would increase the Redhat EL Adoption explode enough to
make the proprietary viri irrelevant .

I will work towards it.

Regards,


Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071214/8f9e921a/attachment.htm>

From orkcu at yahoo.com  Fri Dec 14 17:33:38 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 14 Dec 2007 09:33:38 -0800 (PST)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <Pine.LNX.4.64.0712141648560.8884@skynet.shatteredsilicon.net>
Message-ID: <318556.96607.qm@web50609.mail.re2.yahoo.com>


--- gordan at bobich.net wrote:

> On Fri, 14 Dec 2007, Roger Pe?a wrote:
> 
> > I thinks this is question #1 in the FAQs and in
> this
> > list :-)
> >
> > the short anwser and the first place to look at
> is:
> > 1- fencing not configured or configured as manual
> > 2- fencing problems, the devices not working as
> they
> > should
> 
> The problem is that I don't have any devices I could
> do fencing with. Is 

you do not have:
1- shared storage? usually, the "server" of the shared
storage have a way to cut the storage to a client, so
this can serve as a fencing device
2- what kind of server do you have? HP servers has
iLo, SUN and Dell servers have something similar. so
those interfaces can act as fencing devices


> there a way to achieve this without external
> monitoring?
not that I know off,
but I don't want to :-), I would like to be sure that
a  node with problems gets kicked from the cluster so
it did not mess things that is why I will decline to
start a cluster without at least a first level of
fencing.

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From gordan at bobich.net  Fri Dec 14 17:42:02 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 14 Dec 2007 17:42:02 +0000 (GMT)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <318556.96607.qm@web50609.mail.re2.yahoo.com>
References: <318556.96607.qm@web50609.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0712141736500.8884@skynet.shatteredsilicon.net>

On Fri, 14 Dec 2007, Roger Pe?a wrote:

>
> --- gordan at bobich.net wrote:
>
>> On Fri, 14 Dec 2007, Roger Pe?a wrote:
>>
>>> I thinks this is question #1 in the FAQs and in
>> this
>>> list :-)
>>>
>>> the short anwser and the first place to look at
>> is:
>>> 1- fencing not configured or configured as manual
>>> 2- fencing problems, the devices not working as
>> they
>>> should
>>
>> The problem is that I don't have any devices I could
>> do fencing with. Is
>
> you do not have:
> 1- shared storage? usually, the "server" of the shared
> storage have a way to cut the storage to a client, so
> this can serve as a fencing device
> 2- what kind of server do you have? HP servers has
> iLo, SUN and Dell servers have something similar. so
> those interfaces can act as fencing devices

I have Dell servers, but nothing that can be used to monitor them.

I'm really only looking for something simple - if a node fails 10 pings in 
a row or fails to respond to a ping in 10 seconds, kick it off. If it 
rejoins (on boot-up), then it should be allowed to join.

If all nodes monitor all other nodes, and kick the ones they can't 
contact, they'll either fence the dead node, or the dead node will fence 
off itself if there's a NIC failure. Or if the switch fails they'll all 
fence themselves off, but, in that case, so what...

>> there a way to achieve this without external
>> monitoring?
> not that I know off,
> but I don't want to :-), I would like to be sure that
> a  node with problems gets kicked from the cluster so
> it did not mess things that is why I will decline to
> start a cluster without at least a first level of
> fencing.

Except I don't have any fail-over services per se. All nodes run all 
services. If a node fails, it won't respond and the load-balancer will 
just stop directing TCP traffic to it.

At the moment, I'm thinking about the fencing console in the OSR tools, 
and writing a small monitoring daemon in perl to use it to kick out the 
nodes that aren't responding. It's just that it'd be nice if there was 
already something out there that'll do this...

Gordan

From raju.rajsand at gmail.com  Fri Dec 14 18:03:51 2007
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Fri, 14 Dec 2007 23:33:51 +0530
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <318556.96607.qm@web50609.mail.re2.yahoo.com>
References: <Pine.LNX.4.64.0712141648560.8884@skynet.shatteredsilicon.net>
	<318556.96607.qm@web50609.mail.re2.yahoo.com>
Message-ID: <8786b91c0712141003q266abd68vd35b3bfc4264c7ce@mail.gmail.com>

Greetings,

While in theory one can have all sorts of fantasy, one has to hve the
resources to fulfill the *theoraticcal* requirements.


Often the educational institutions just don't have that.

Let us be pragmatic and find a solution without hardware fencing devices
without the cost involved.

there has to be a mechanism to reboot nodes with say "reboot -f" _without_
corrupting the local FS and fence.


Open source can do better than that....


Regards

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071214/a6f8ec63/attachment.htm>

From orkcu at yahoo.com  Fri Dec 14 18:18:37 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 14 Dec 2007 10:18:37 -0800 (PST)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <Pine.LNX.4.64.0712141736500.8884@skynet.shatteredsilicon.net>
Message-ID: <539028.24713.qm@web50608.mail.re2.yahoo.com>


--- gordan at bobich.net wrote:

> On Fri, 14 Dec 2007, Roger Pe?a wrote:
> 
> >
> > --- gordan at bobich.net wrote:
> >
> >> On Fri, 14 Dec 2007, Roger Pe?a wrote:
> >>
> >>> I thinks this is question #1 in the FAQs and in
> >> this
> >>> list :-)
> >>>
> >>> the short anwser and the first place to look at
> >> is:
> >>> 1- fencing not configured or configured as
> manual
> >>> 2- fencing problems, the devices not working as
> >> they
> >>> should
> >>
> >> The problem is that I don't have any devices I
> could
> >> do fencing with. Is
> >
> > you do not have:
> > 1- shared storage? usually, the "server" of the
> shared
> > storage have a way to cut the storage to a client,
> so
> > this can serve as a fencing device
> > 2- what kind of server do you have? HP servers has
> > iLo, SUN and Dell servers have something similar.
> so
> > those interfaces can act as fencing devices
> 
> I have Dell servers, but nothing that can be used to
> monitor them.
So, those servers do not have support for DRAC or IPMI
? that is sad
if they do, well, I guess you can reboot them remotely

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From orkcu at yahoo.com  Fri Dec 14 18:28:25 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 14 Dec 2007 10:28:25 -0800 (PST)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <8786b91c0712141003q266abd68vd35b3bfc4264c7ce@mail.gmail.com>
Message-ID: <87405.80387.qm@web50607.mail.re2.yahoo.com>


--- Rajagopal Swaminathan <raju.rajsand at gmail.com>
wrote:

> Greetings,
> 
> While in theory one can have all sorts of fantasy,
> one has to hve the
> resources to fulfill the *theoraticcal*
> requirements.
> 
> 
> Often the educational institutions just don't have
> that.
> 
> Let us be pragmatic and find a solution without
> hardware fencing devices
> without the cost involved.
> 
> there has to be a mechanism to reboot nodes with say
> "reboot -f" _without_
> corrupting the local FS and fence.

sometimes you can't avoid a cool reboot :-(
bad and unpredictable things happen :-(
and on those situations the power button usually is
the only option to get the server back

> Open source can do better than that....
yeap but how many times do you had had to put your
finger over the power button and make a cool reboot?
I know this is the worst scenario but it happen :-(
and it doesn't matter if you run OSS

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


From raju.rajsand at gmail.com  Fri Dec 14 18:30:01 2007
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sat, 15 Dec 2007 00:00:01 +0530
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <539028.24713.qm@web50608.mail.re2.yahoo.com>
References: <Pine.LNX.4.64.0712141736500.8884@skynet.shatteredsilicon.net>
	<539028.24713.qm@web50608.mail.re2.yahoo.com>
Message-ID: <8786b91c0712141030n77e20a88sa66b5fd5d19dc039@mail.gmail.com>

Greetings,

No the servers that I admin just don't have that...


they are a potpurri of two year old Tyan MB and and Intel MB assembled
systems......

These "ILO"s and "IPMI"s MB are still expensive from the educational
instititution's viewpoint.

The just want "Computers" which run office automation.


It is a task enough to switch "normal" people's mindset from the fuzzy world
of World Wide Virus (aka M$ crap.)

Why Open source makes it difficult for poor admins like us who admin
clueless users with over 100 seats?

Regards,


Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071215/da3868c1/attachment.htm>

From amrossi at linux.it  Fri Dec 14 19:52:32 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Fri, 14 Dec 2007 20:52:32 +0100 (CET)
Subject: [Linux-cluster] XEN Cluster
Message-ID: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>

Hi to All,


Can I create a cluster of virtual machines physically on different hosts?

can i use  Xen Fence Device?


From jeff.sturm at eprize.com  Fri Dec 14 21:29:48 2007
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Fri, 14 Dec 2007 16:29:48 -0500
Subject: [Linux-cluster] XEN Cluster
In-Reply-To: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>
References: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>
Message-ID: <6313E798C012FD45888EDB692919BC9D03403D1A@bart.eprize.local>

We've done it.  We deploy our xen configurations with paired physical
hosts so we can live migrate and reboot for maintenance without service
disruption.  Some of these virtual machines also belong to a cluster.
Thankfully, live migration does not appear to disrupt the cluster
heartbeat.

Did you try it and have problems?

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of amrossi at linux.it
Sent: Friday, December 14, 2007 2:53 PM
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] XEN Cluster

Hi to All,


Can I create a cluster of virtual machines physically on different
hosts?

can i use  Xen Fence Device?

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Fri Dec 14 22:42:09 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 14 Dec 2007 17:42:09 -0500
Subject: [Linux-cluster] XEN Cluster
In-Reply-To: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>
References: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>
Message-ID: <1197672129.18614.2.camel@localhost.localdomain>


On Fri, 2007-12-14 at 20:52 +0100, amrossi at linux.it wrote:
> Hi to All,
> 
> 
> Can I create a cluster of virtual machines physically on different hosts?
> 
> can i use  Xen Fence Device?

http://sources.redhat.com/cluster/wiki/VMClusterCookbook

:)

-- Lon


From james at cloud9.co.uk  Fri Dec 14 23:41:43 2007
From: james at cloud9.co.uk (James Fidell)
Date: Fri, 14 Dec 2007 23:41:43 +0000
Subject: [Linux-cluster] semi-OT: Where to report (i)SCSI-issues?
In-Reply-To: <4762AF0C.5040305@ultra-secure.de>
References: <476282E4.6090902@ultra-secure.de> <4762AD49.50104@cloud9.co.uk>
	<4762AF0C.5040305@ultra-secure.de>
Message-ID: <476314B7.50606@cloud9.co.uk>

Rainer Duffner wrote:
> James Fidell wrote:
>> Rainer Duffner wrote:
>>   
>>> Hi,
>>>
>>> http://www.redhat.com/mailman/listinfo
>>>
>>> doesn't list a place to report SCSI-problems.
>>> Anyway, in RHEL5.1, I have a problem in that if a logical volume (LVM)
>>> is located on an iSCSI device, it will not be detected on boot-up,
>>> because the network (needed for iSCSI, somehow) starts up after lvm(8).
>>>     
>> Are you talking about clustered LVM partitions, or non-clustered ones?
>> Clustered ones "just work" for me, because they're added by clvmd, which
>> is started *after* the network, iscsi and cman.  I'm not using non-
>> clustered ones though.
> 
> No, non-clustered.
> I was looking for a more appropriate mailing-list, but none seems to exist.
> Just a couple of mailing-lists for RHEL beta-releases....

There's an iscsi list you might ask.  I don't have the details to hand,
but I think it's linux-iscsi-users at lists.sourceforge.net

James


From gordan at bobich.net  Sat Dec 15 00:32:17 2007
From: gordan at bobich.net (Gordan Bobic)
Date: Sat, 15 Dec 2007 00:32:17 +0000
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <539028.24713.qm@web50608.mail.re2.yahoo.com>
References: <539028.24713.qm@web50608.mail.re2.yahoo.com>
Message-ID: <47632091.7010006@bobich.net>

Roger Pe?a wrote:

>>>>> I thinks this is question #1 in the FAQs and in
>>>> this
>>>>> list :-)
>>>>>
>>>>> the short anwser and the first place to look at
>>>> is:
>>>>> 1- fencing not configured or configured as
>> manual
>>>>> 2- fencing problems, the devices not working as
>>>> they
>>>>> should
>>>> The problem is that I don't have any devices I
>> could
>>>> do fencing with. Is
>>> you do not have:
>>> 1- shared storage? usually, the "server" of the
>> shared
>>> storage have a way to cut the storage to a client,
>> so
>>> this can serve as a fencing device
>>> 2- what kind of server do you have? HP servers has
>>> iLo, SUN and Dell servers have something similar.
>> so
>>> those interfaces can act as fencing devices
>> I have Dell servers, but nothing that can be used to
>> monitor them.
> So, those servers do not have support for DRAC or IPMI
> ? that is sad
> if they do, well, I guess you can reboot them remotely

They do have DRAC support, but I don't see how that can be used for 
monitoring/fencing. Am I missing something?

Gordan


From madunix at gmail.com  Sat Dec 15 08:47:42 2007
From: madunix at gmail.com (Mad Unix)
Date: Sat, 15 Dec 2007 10:47:42 +0200
Subject: [Linux-cluster] RH Cluster
Message-ID: <4d3f56c90712150047u6805857ele168feb2a95ce807@mail.gmail.com>

I'm in the process of setting up a 2 nodes ACTIVE/ACTIVE cluster
with PE2950 16 RAM + RHEL5 + GFS + RHCS+ SAN EMC CX10.
Each of my nodes are RAID 1+ 0 and have the regular OS tree
and share SAN storage Raid5, am planning to build it for Oracle10g Database
_without_ using Oracle cluster ware CRS and ASM (I believe the cluster will
be done much better via the OS not DB)
to serve about 1000 users  the application based on client/server concept..

Can someone please give me a pinpoint of where I start ?

Regards,


-- 
madunix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071215/35b32bfd/attachment.htm>

From yann.boutin at gmail.com  Sat Dec 15 11:05:03 2007
From: yann.boutin at gmail.com (Yann Boutin)
Date: Sat, 15 Dec 2007 12:05:03 +0100
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <8786b91c0712141030n77e20a88sa66b5fd5d19dc039@mail.gmail.com>
References: <Pine.LNX.4.64.0712141736500.8884@skynet.shatteredsilicon.net>
	<539028.24713.qm@web50608.mail.re2.yahoo.com>
	<8786b91c0712141030n77e20a88sa66b5fd5d19dc039@mail.gmail.com>
Message-ID: <8617bce80712150305v614a81fcu87151ef4979a0ae0@mail.gmail.com>

2007/12/14, Rajagopal Swaminathan <raju.rajsand at gmail.com>:

> These "ILO"s and "IPMI"s MB are still expensive from the educational
> instititution's viewpoint.

Greetings,
A low cost solution, for educational purpose is using

centos (rhel clone)
http://www.centos.org/modules/tinycontent/index.php?id=15

free vmware server (1.0.4)
http://www.vmware.com/download/server/

and the vmware fence agent :
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/vmware/?cvsroot=cluster

which requires VMware Perl API
http://www.vmware.com/download/sdk/api.html

Read this very good article explaning how to put all together
http://www.tournament.org.il/run/index.php?/archives/140-VMware-Fencing-in-RedHat-Cluster-5-RHCS5.html

Regards

--
Yann


From elliot at devnull.org.uk  Sat Dec 15 13:16:39 2007
From: elliot at devnull.org.uk (Elliot Moore)
Date: Sat, 15 Dec 2007 13:16:39 +0000
Subject: [Linux-cluster] Types of file locking support in GFS
In-Reply-To: <Pine.LNX.4.64.0712131024380.2163@skynet.shatteredsilicon.net>
References: <20071206134544.GA27857@book.riviera.org.uk>
	<C692865D-2A62-4D44-ADB7-9EB528017451@devnull.org.uk>
	<1197487541.27760.19.camel@merlin.Mines.EDU>
	<Pine.LNX.4.64.0712130922001.2163@skynet.shatteredsilicon.net>
	<20071213102114.GC83570@book.riviera.org.uk>
	<Pine.LNX.4.64.0712131024380.2163@skynet.shatteredsilicon.net>
Message-ID: <3E068297-507D-4B72-AC6C-A66252F91E86@devnull.org.uk>


On 13 Dec 2007, at 10:25, gordan at bobich.net wrote:

> On Thu, 13 Dec 2007, Elliot Moore wrote:
>
>>> What even remotely similar alternatives to GFS are there, though?
>>
>> A Netapp Filer(tm) :O)
>
> You aren't seriously suggesting using NFS if file locking  
> flexibility is an issue?!

no no.. hence the: :O)
(but its working better than OCFS2)


From orkcu at yahoo.com  Sat Dec 15 20:16:08 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Sat, 15 Dec 2007 12:16:08 -0800 (PST)
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <47632091.7010006@bobich.net>
Message-ID: <183633.97606.qm@web50609.mail.re2.yahoo.com>


--- Gordan Bobic <gordan at bobich.net> wrote:

> Roger Pe???a wrote:
> 
> >>>>> I thinks this is question #1 in the FAQs and
> in
> >>>> this
> >>>>> list :-)
> >>>>>
> >>>>> the short anwser and the first place to look
> at
> >>>> is:
> >>>>> 1- fencing not configured or configured as
> >> manual
> >>>>> 2- fencing problems, the devices not working
> as
> >>>> they
> >>>>> should
> >>>> The problem is that I don't have any devices I
> >> could
> >>>> do fencing with. Is
> >>> you do not have:
> >>> 1- shared storage? usually, the "server" of the
> >> shared
> >>> storage have a way to cut the storage to a
> client,
> >> so
> >>> this can serve as a fencing device
> >>> 2- what kind of server do you have? HP servers
> has
> >>> iLo, SUN and Dell servers have something
> similar.
> >> so
> >>> those interfaces can act as fencing devices
> >> I have Dell servers, but nothing that can be used
> to
> >> monitor them.
> > So, those servers do not have support for DRAC or
> IPMI
> > ? that is sad
> > if they do, well, I guess you can reboot them
> remotely
> 
> They do have DRAC support, but I don't see how that
> can be used for 
> monitoring/fencing. Am I missing something?
I hear that you can give an IP to DRAC (bios?) and
through the DRAC interfase you can reboot, poweroff
and poweron the server
yesterday I search in google about DRAC reboot and I
got some nice pages

then use the fence_drac script to fence the devices
I use that script to fence the blades in an DELL
enclousure.

cu
roger


__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


From mpartio at gmail.com  Sat Dec 15 20:28:25 2007
From: mpartio at gmail.com (Mikko Partio)
Date: Sat, 15 Dec 2007 22:28:25 +0200
Subject: [Linux-cluster] Adding new file system caused problems
In-Reply-To: <4762A01C.6070303@redhat.com>
References: <474C5260.6030908@noaa.gov>
	<97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org>
	<1196442322.2437.8.camel@localhost.localdomain>
	<2ca799770712140707k2de9cd65re2d5260c9eb6d4d0@mail.gmail.com>
	<4762A01C.6070303@redhat.com>
Message-ID: <2ca799770712151228g69aee3cdh59a8ebd3002e25b1@mail.gmail.com>

> >
> > I can confirm this issue with Centos 5.1. When a new volume is added to
> > the system, I can not create a lv. Not even clvmd -R helps, in order to
> > create the logical volume I must reboot both servers in my two-node
> > setup. It's insane that in order to add a volume I must reboot the whole
> > cluster!
>
> You should only need to restart clvmd, not the whole cluster.
>
> clvmd -R was buggy before 2.02.28, in fact it made things worse rather
> than better!


But in order to restart clvmd I need to withdraw the node from the cluster
(move all services to the other node and use cman_tool leave) which is
essentially the same thing as rebooting the node right?

Are there any plans to fix clvmd -R?

Regards

Mikko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071215/c591531a/attachment.htm>

From soren at ubuntu.com  Sat Dec 15 23:06:42 2007
From: soren at ubuntu.com (Soren Hansen)
Date: Sun, 16 Dec 2007 00:06:42 +0100
Subject: [Linux-cluster] Highly available services
Message-ID: <20071215230642.GS1707@atlas.linux2go.dk>

I've been trying to work this out from manuals and FAQ's, but I find
myself at a bit of a loss.

Scenario: I have 2 nodes, A and B. I want to set up a website using
apache/php and mysql. When both nodes are functioning, node A should
respond to http requests, and node B as database backend for node A and
possibly for other clients, too.

In case of node failure, the service of the failing node should
obviously fail over to the remaining node.

Both of the nodes are in fact running MySQL, with bidirectional
replication set up. No shared storage involved.

Both of the nodes are also running apache/php. The php and html files
and such are pushed from elsewhere to the two nodes, so no shared
storage here either.

Let's say that we have the following addresses:

a.foo.com   = 10.0.0.1
b.foo.com   = 10.0.0.2
www.foo.com = 10.0.0.11
db.foo.com  = 10.0.0.12

Node A will own 10.0.0.1, node B 10.0.0.2, and the node currently
providing www would also have 10.0.0.11, and the node currently
providing mysql would have 10.0.0.12, and the latter two could of course
migrate in case of a failing node.

Now, due to the nature of the services provided, there is no notable
potential for damage if a node fails, and the other node detects this
and the ip corresponding to its migrates to the still functioning node,
so there's not really any point investing a lot of cash in proper
fencing hardware. I just want to have as little down time as possible.

So, the question is: How do I do this using redhat cluster suite? Even
using fence_manual I still have to somehow notice the failing node, log
on to the remaining node, and do the fence_ack_manual dance, correct?

-- 
Soren Hansen
Ubuntu Server Team
http://www.ubuntu.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071216/aaaf371e/attachment.sig>

From yann.boutin at gmail.com  Sun Dec 16 10:30:23 2007
From: yann.boutin at gmail.com (Yann Boutin)
Date: Sun, 16 Dec 2007 11:30:23 +0100
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <183633.97606.qm@web50609.mail.re2.yahoo.com>
References: <47632091.7010006@bobich.net>
	<183633.97606.qm@web50609.mail.re2.yahoo.com>
Message-ID: <8617bce80712160230n25673292ob8546e42d8894cbc@mail.gmail.com>

> I hear that you can give an IP to DRAC (bios?) and
> through the DRAC interfase you can reboot, poweroff
> and poweron the server

That's it !

Here are the docs.
http://support2.jp.dell.com/docs/software/smdrac3/

Other supported fencing hardware can be found here
http://www.redhat.com/cluster_suite/hardware/

and here

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/fence/agents/?cvsroot=cluster


From yann.boutin at gmail.com  Sun Dec 16 11:05:53 2007
From: yann.boutin at gmail.com (Yann Boutin)
Date: Sun, 16 Dec 2007 12:05:53 +0100
Subject: [Linux-cluster] Graceful Degradation
In-Reply-To: <8617bce80712160230n25673292ob8546e42d8894cbc@mail.gmail.com>
References: <47632091.7010006@bobich.net>
	<183633.97606.qm@web50609.mail.re2.yahoo.com>
	<8617bce80712160230n25673292ob8546e42d8894cbc@mail.gmail.com>
Message-ID: <8617bce80712160305s5bb968b3n1076d5e65a8f6aec@mail.gmail.com>

2007/12/16, Yann Boutin <yann.boutin at gmail.com>:
> > I hear that you can give an IP to DRAC (bios?) and
> > through the DRAC interfase you can reboot, poweroff
> > and poweron the server
>
> That's it !
>
> Here are the docs.
> http://support2.jp.dell.com/docs/software/smdrac3/

>From a security point of view you should know that the drac fence
agent communicate through telnet which is insecure. With sniffed
login/password an assailant could do the perfect DOS => remotely
shutdown your system.

However drac 4 and drac 5 can be accessed through ssh but the fence
agent don't use it. Perhaps someone could modify the fence agent to
use Net-SSH-Perl instead of Net-Telnet one day :)

-- 
Yann.


From jeff.sturm at eprize.com  Mon Dec 17 04:17:42 2007
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Sun, 16 Dec 2007 23:17:42 -0500
Subject: [Linux-cluster] RH Cluster
In-Reply-To: <4d3f56c90712150047u6805857ele168feb2a95ce807@mail.gmail.com>
References: <4d3f56c90712150047u6805857ele168feb2a95ce807@mail.gmail.com>
Message-ID: <6313E798C012FD45888EDB692919BC9D03403E4F@bart.eprize.local>

Not knowing anything about your project (aside from some technical
details), it's impossible to give you specific advice.
 
What's your timeframe for implementation?
 
Do you have any of this built or are you starting from scratch?
 
What is your level of expertise in the technologies you mention?  Each
of these are complex on their own.  Do you have help?  How much of this
have you done before?
 
If I were going to attempt such a project myself, I'd look for no less
than a six-month engagement to design, configure, test and deploy a
setup for production use.  During that time I would take the opportunity
to learn each technology in detail, run experiments, tweak and push them
to their limits.  Then always do the fun part--pulling cables to see
what happens in a redundant cluster when components really do fail.
 
I'd also pick an order to configure and verify each component in turn,
something like:
 
- Hardware
- OS
- Networking
- Cluster
- SAN
- GFS
- Oracle
- application
 
with the appropriate stress testing and proofing at each step.
Depending on timeframe and budget, I'd also give serious consideration
to training and professional consulting.  Good luck.
 
-Jeff

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Mad Unix
Sent: Saturday, December 15, 2007 3:48 AM
To: linux clustering
Subject: [Linux-cluster] RH Cluster


I'm in the process of setting up a 2 nodes ACTIVE/ACTIVE cluster 
with PE2950 16 RAM + RHEL5 + GFS + RHCS+ SAN EMC CX10.
Each of my nodes are RAID 1+ 0 and have the regular OS tree
and share SAN storage Raid5, am planning to build it for Oracle10g
Database
_without_ using Oracle cluster ware CRS and ASM (I believe the cluster
will be done much better via the OS not DB)
to serve about 1000 users  the application based on client/server
concept..

Can someone please give me a pinpoint of where I start ?

Regards,


-- 
madunix 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071216/2bd31424/attachment.htm>

From harun at mhd.co.om  Mon Dec 17 05:04:37 2007
From: harun at mhd.co.om (Harun)
Date: Mon, 17 Dec 2007 09:04:37 +0400
Subject: [Linux-cluster] RH Cluster issue
In-Reply-To: <e1bfed08000f554c@mhd.co.om>
Message-ID: <001201c8406a$53a718f0$4e030196@mhd.co.om>

Issue: When the public network cable is removed from any for the node both
nodes restarts itself. 

 
I want to test a live setup... a Linux Cluster with 2 nodes, in Linux
Advanced Server (Linux DB-Primary 2.4.21-37.ELsmp #1 SMP Wed Sep 7 13:28:55
EDT 2005 i686 i686 i386 GNU/Linux )   and Oracle database running. When I
test the cluster...shutting down the primary or powering off the primary...
every thing works fine. The secondary server comes up.

 
But only issue here is that if I pull out any of the the network cable in
the public interface, both server restart together. I am just confused with
what configuration files to check or how get started

 
I want to know what IP Address should I give for the Tiebreaker IP address

 
I am worried since only its a Live setup, please help. 

 
<<<<   Disclaimer Message  >>>>
"This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee, please notify the sender immediately after deleting this e-mail from your system and do not disseminate, distribute or copy this e-mail. The sender does not accept liability for any errors or omissions in the contents of this message, which arise as a result of erroneous e-mail transmission."
[Mohsin Haider Darwish LLC & Group Companies, PO.Box 880, Ruwi-112, Oman]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071217/76162355/attachment.htm>

From jeff.sturm at eprize.com  Mon Dec 17 05:09:29 2007
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Mon, 17 Dec 2007 00:09:29 -0500
Subject: [Linux-cluster] Highly available services
In-Reply-To: <20071215230642.GS1707@atlas.linux2go.dk>
References: <20071215230642.GS1707@atlas.linux2go.dk>
Message-ID: <6313E798C012FD45888EDB692919BC9D03403E50@bart.eprize.local>

At a quick glance, I'd say fencing doesn't meet your needs, and RH
Cluster Suite may be overkill.  You've worked out the details of
failover, you just need to automate it.

Have you tried heartbeat?  It is well suited for simple failover of
resources such as IP addresses.

http://www.linux-ha.org/

-Jeff

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Soren Hansen
Sent: Saturday, December 15, 2007 6:07 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Highly available services

I've been trying to work this out from manuals and FAQ's, but I find
myself at a bit of a loss.

Scenario: I have 2 nodes, A and B. I want to set up a website using
apache/php and mysql. When both nodes are functioning, node A should
respond to http requests, and node B as database backend for node A and
possibly for other clients, too.

In case of node failure, the service of the failing node should
obviously fail over to the remaining node.

Both of the nodes are in fact running MySQL, with bidirectional
replication set up. No shared storage involved.

Both of the nodes are also running apache/php. The php and html files
and such are pushed from elsewhere to the two nodes, so no shared
storage here either.

Let's say that we have the following addresses:

a.foo.com   = 10.0.0.1
b.foo.com   = 10.0.0.2
www.foo.com = 10.0.0.11
db.foo.com  = 10.0.0.12

Node A will own 10.0.0.1, node B 10.0.0.2, and the node currently
providing www would also have 10.0.0.11, and the node currently
providing mysql would have 10.0.0.12, and the latter two could of course
migrate in case of a failing node.

Now, due to the nature of the services provided, there is no notable
potential for damage if a node fails, and the other node detects this
and the ip corresponding to its migrates to the still functioning node,
so there's not really any point investing a lot of cash in proper
fencing hardware. I just want to have as little down time as possible.

So, the question is: How do I do this using redhat cluster suite? Even
using fence_manual I still have to somehow notice the failing node, log
on to the remaining node, and do the fence_ack_manual dance, correct?

--
Soren Hansen
Ubuntu Server Team
http://www.ubuntu.com/


From soren at ubuntu.com  Mon Dec 17 08:16:53 2007
From: soren at ubuntu.com (Soren Hansen)
Date: Mon, 17 Dec 2007 09:16:53 +0100
Subject: [Linux-cluster] Highly available services
In-Reply-To: <6313E798C012FD45888EDB692919BC9D03403E50@bart.eprize.local>
References: <20071215230642.GS1707@atlas.linux2go.dk>
	<6313E798C012FD45888EDB692919BC9D03403E50@bart.eprize.local>
Message-ID: <20071217081653.GT1707@atlas.linux2go.dk>

On Mon, Dec 17, 2007 at 12:09:29AM -0500, Jeff Sturm wrote:
> At a quick glance, I'd say fencing doesn't meet your needs, and RH
> Cluster Suite may be overkill.  You've worked out the details of
> failover, you just need to automate it.

Well, it's not called RedHat Fencing Suite, is it? :)

> Have you tried heartbeat?  It is well suited for simple failover of
> resources such as IP addresses.

Heartbeat has indeed been doing this for me up until now. :) I was just
under the impression that rhcs was supposed to be a superset of
heartbeat.

I've managed to work around it in sneaky, sneaky, ways, but I'd really
like to hear how this is *supposed* to be accomplished if at all.

-- 
Soren Hansen
Ubuntu Server Team
http://www.ubuntu.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071217/c670bd1b/attachment.sig>

From pcaulfie at redhat.com  Mon Dec 17 08:44:06 2007
From: pcaulfie at redhat.com (pcaulfie)
Date: Mon, 17 Dec 2007 08:44:06 +0000
Subject: [Linux-cluster] Adding new file system caused problems
In-Reply-To: <2ca799770712151228g69aee3cdh59a8ebd3002e25b1@mail.gmail.com>
References: <474C5260.6030908@noaa.gov>	<97F238EA86B5704DBAD740518CF829100394AE0C@hwpms600.tbo.citistreet.org>	<1196442322.2437.8.camel@localhost.localdomain>	<2ca799770712140707k2de9cd65re2d5260c9eb6d4d0@mail.gmail.com>	<4762A01C.6070303@redhat.com>
	<2ca799770712151228g69aee3cdh59a8ebd3002e25b1@mail.gmail.com>
Message-ID: <476636D6.6000203@redhat.com>

Mikko Partio wrote:
> 
>     >
>     > I can confirm this issue with Centos 5.1. When a new volume is
>     added to
>     > the system, I can not create a lv. Not even clvmd -R helps, in
>     order to
>     > create the logical volume I must reboot both servers in my two-node
>     > setup. It's insane that in order to add a volume I must reboot the
>     whole
>     > cluster!
> 
>     You should only need to restart clvmd, not the whole cluster.
> 
>     clvmd -R was buggy before 2.02.28, in fact it made things worse rather
>     than better!
> 
> 
> 
> 
> But in order to restart clvmd I need to withdraw the node from the
> cluster (move all services to the other node and use cman_tool leave)
> which is essentially the same thing as rebooting the node right?

No, just restart it.

  service clvmd restart

or

  killall clvmd
  clvmd

:-)


> Are there any plans to fix clvmd -R?

It's fixed in version 2.02.28 - vgscan also will initiate it.


Patrick


From Alain.Moulle at bull.net  Mon Dec 17 09:36:45 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 17 Dec 2007 10:36:45 +0100
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk 
Message-ID: <4766432D.6080303@bull.net>

Hi Lon

I've carefully read your last detailed information. I've a
better understanding but something is again not clear for me :
in my two node cluster node1/node2, with quorum disk , without any heuristic,
I would like to be sure that if there is a failure on the heart-beat
network, only one node fences the other and not both, so :
when I do on node2 ifdown on eth if of heart-beat, what is the
mechanism via the quorum disk that assures that ?
Or how must I configure to assure that ?
I think all will be clear for me if I understand this case ...
Thanks
Regards
Alain

QDisk provides additional votes based on user-defined heuristics (or, no
heuristics, depending).  The combination of the heuristics + votes can
be used to:

* prevent even-split fence races in the event of a network partition -
one cluster partition can, given well-defined heuristics, decide it is
unfit for cluster participation (and usually remove itself), while the
other remains "fit" and therefore fences the bad partition

* allow a minority partition to become the surviving partition a split -
similar to the above - given a 4-node cluster, 3 nodes in a majority
partition could decide that they are *all* unfit for cluster
participation and remove themselves - while the 1-node minority
partition continues to operate

* prevent a partition from becoming quorate after being fenced - on
boot, if a node does not meet its heuristic requirements and a master
node exists in the cluster, it cannot become quorate unless it has
communications with the master qdisk node (optionally, you can have
qdisk stop CMAN in this case)

... and possibly other things, but those are the main ones.

It's not a replacement for cluster communications nor is it a
replacement for CMAN's membership (in fact, it relies on CMAN's
membership - and fencing - to do its job).

Even if qdiskd told CMAN which nodes were online, much of the internal
network traffic (for example, DLM traffic) cannot be pushed through the
disk in a meaningful way, meaning GFS access would be blocked.

-- Lon


>>>>


From amrossi at linux.it  Mon Dec 17 10:33:35 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Mon, 17 Dec 2007 11:33:35 +0100 (CET)
Subject: [Linux-cluster] XEN Cluster
In-Reply-To: <1197672129.18614.2.camel@localhost.localdomain>
References: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>
	<1197672129.18614.2.camel@localhost.localdomain>
Message-ID: <26275.62.101.100.5.1197887615.squirrel@picard.linux.it>

Ok...but i've 2 hosts (domain-0)... is it different?

>
> On Fri, 2007-12-14 at 20:52 +0100, amrossi at linux.it wrote:
>> Hi to All,
>>
>>
>> Can I create a cluster of virtual machines physically on different
>> hosts?
>>
>> can i use  Xen Fence Device?
>
> http://sources.redhat.com/cluster/wiki/VMClusterCookbook
>
> :)
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From lhh at redhat.com  Mon Dec 17 15:34:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 17 Dec 2007 10:34:46 -0500
Subject: [Linux-cluster] RH Cluster issue
In-Reply-To: <001201c8406a$53a718f0$4e030196@mhd.co.om>
References: <001201c8406a$53a718f0$4e030196@mhd.co.om>
Message-ID: <1197905686.4959.3.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-17 at 09:04 +0400, Harun wrote:
> Issue: When the public network cable is removed from any for the node
> both nodes restarts itself. 

> I want to test a live setup... a Linux Cluster with 2 nodes, in Linux
> Advanced Server (Linux DB-Primary 2.4.21-37.ELsmp #1 SMP Wed Sep 7
> 13:28:55 EDT 2005 i686 i686 i386 GNU/Linux )

There have been a few updates since 2.4.21-37, I think.


> But only issue here is that if I pull out any of the the network cable
> in the public interface, both server restart together. I am just
> confused with what configuration files to check or how get started

That's strange.  I assume you're running clumanager-1.2.35.2, right?

For a tiebreaker, you should use the nearest upstream router.
Alternatively, with clumanager 1.2.x you can also use no IP address
which should be a 'disk' tiebreaker.

What does cluster.xml look like?

-- Lon


From jeff.sturm at eprize.com  Mon Dec 17 16:12:35 2007
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Mon, 17 Dec 2007 11:12:35 -0500
Subject: [Linux-cluster] Highly available services
In-Reply-To: <20071217081653.GT1707@atlas.linux2go.dk>
References: <20071215230642.GS1707@atlas.linux2go.dk><6313E798C012FD45888EDB692919BC9D03403E50@bart.eprize.local>
	<20071217081653.GT1707@atlas.linux2go.dk>
Message-ID: <6313E798C012FD45888EDB692919BC9D03403FD1@bart.eprize.local>

I confess my experience with RHCS is pretty much limited to running GFS.
You make a good point.

Aside from shared coherent storage, what other interesting problems does
RHCS solve? 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Soren Hansen
Sent: Monday, December 17, 2007 3:17 AM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Highly available services

On Mon, Dec 17, 2007 at 12:09:29AM -0500, Jeff Sturm wrote:
> At a quick glance, I'd say fencing doesn't meet your needs, and RH 
> Cluster Suite may be overkill.  You've worked out the details of 
> failover, you just need to automate it.

Well, it's not called RedHat Fencing Suite, is it? :)

> Have you tried heartbeat?  It is well suited for simple failover of 
> resources such as IP addresses.

Heartbeat has indeed been doing this for me up until now. :) I was just
under the impression that rhcs was supposed to be a superset of
heartbeat.

I've managed to work around it in sneaky, sneaky, ways, but I'd really
like to hear how this is *supposed* to be accomplished if at all.

--
Soren Hansen
Ubuntu Server Team
http://www.ubuntu.com/


From lhh at redhat.com  Mon Dec 17 18:19:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 17 Dec 2007 13:19:44 -0500
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk
In-Reply-To: <4766432D.6080303@bull.net>
References: <4766432D.6080303@bull.net>
Message-ID: <1197915584.4959.22.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-17 at 10:36 +0100, Alain Moulle wrote:
> Hi Lon
> 
> I've carefully read your last detailed information. I've a
> better understanding but something is again not clear for me :
> in my two node cluster node1/node2, with quorum disk , without any heuristic,

Need a heuristic for now.

> I would like to be sure that if there is a failure on the heart-beat
> network, only one node fences the other and not both, so :
> when I do on node2 ifdown on eth if of heart-beat, what is the
> mechanism via the quorum disk that assures that ?
> Or how must I configure to assure that ?

What does your "heartbeat network" look like?

You need a way to tell the difference between "I am alive" and "I am
alive but should be dead".

In your case, a simple link-detection heuristic would be an easy way to
do this - no router / pinging required: no link = "I cannot be the
master node" => immediately loses the race.

If the heartbeat link isn't available, the node should remove itself. 

Obviously, this will not work with a crossover cable (both nodes would
think they're dead if the link was unplugged).  Switches are cheap.

Now there are lots of other ways to determine who should win.  However,
when thinking about this, it is very important that understand that
perfect symmetrical splits - i.e. splits where there is absolutely no
discernible difference between the nodes - will usually end in a race.  

Currently, I'm working on a method to enable master-set-wins (e.g. for
operation w/o heuristics); I'll post when it's complete.

-- Lon


From lhh at redhat.com  Mon Dec 17 18:21:00 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 17 Dec 2007 13:21:00 -0500
Subject: [Linux-cluster] XEN Cluster
In-Reply-To: <26275.62.101.100.5.1197887615.squirrel@picard.linux.it>
References: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>
	<1197672129.18614.2.camel@localhost.localdomain>
	<26275.62.101.100.5.1197887615.squirrel@picard.linux.it>
Message-ID: <1197915660.4959.24.camel@ayanami.boston.devel.redhat.com>

On Mon, 2007-12-17 at 11:33 +0100, amrossi at linux.it wrote:
> Ok...but i've 2 hosts (domain-0)... is it different?

Not really, except that if you plan to do migration, you must have:

 * the images on shared storage (e.g. GFS or raw partitions), and
 * fencing for the dom0 cluster.

-- Lon


From paolom at prisma-eng.it  Mon Dec 17 18:22:14 2007
From: paolom at prisma-eng.it (Paolo Marini)
Date: Mon, 17 Dec 2007 19:22:14 +0100
Subject: [Linux-cluster] Cluster of XEN guests unstable when rebooting
	a node under CS5.1
In-Reply-To: <16829.81.208.106.64.1197623508.squirrel@webmail.kpnqwest.it>
References: <4760273A.70209@prisma-eng.it>	<1197569119.18308.88.camel@ayanami.boston.devel.redhat.com>
	<16829.81.208.106.64.1197623508.squirrel@webmail.kpnqwest.it>
Message-ID: <4766BE56.1090200@prisma-eng.it>

A positive update to the situation ...

after working a lot on the config.xml file, on the ethernet setup, I 
found that the instability is present both with ethernet bonding (LACP) 
and with single ethernet also. The instability is noticed when a fenced 
node comes up again and tries to rejoin the cluster, in that moment the 
whole cluster looses the quorum. This happens for both the physical and 
the virtual cluster.

Having seen that the instability is independant from the network 
configuration, I started palying with the network switch configuration, 
which is in my case a 3COM Baseline 2924-SFP, a level 2 manageable 
gigabit ethernet, 24 ports switch updated to the latest available firmware.

I tried to disable the IGMP snooping feature, and retested the cluster 
again by fencing and restarting the nodes, one at a time.

This time the fenced node restarted without problems to the remaining 
operational nodes ! I repeated the test more and more times, removed the 
<totem token= > line from the config.xml file going back to the default 
timeouts, and everything worked as expected.

So, the issue seems related to the switch I use ! I thought that the 
IGMP snooping on the switch was useful to limit the amount of traffic on 
the network ports when multicast is used, but it seems that it not only 
limits but cuts it when enabled.

Hope this note spares some time to other people working on the same 
configuration.

BR, Paolo


Paolo Marini ha scritto:
> After some more estensive testing, the problem is not solved.
>
> I fence one guest node from the luci interface (or with xm destroy from a
> physical node, is the same). What I see on another node log is:
>
> Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] The token was lost in the
> OPERATIONAL state.
> Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] Receive multicast socket
> recv buffer size (288000 bytes).
> Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> Dec 14 09:59:28 c5g-thor openais[1741]: [TOTEM] entering GATHER state from 2.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering GATHER state from 0.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
> 11.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
> 11.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Creating commit token
> because I am the rep.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Saving state aru 71 high
> seq received 71
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Storing new sequence id
> for ring 10ec
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering COMMIT state.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] entering RECOVERY state.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] position [0] member
> 192.168.15.152:
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] previous ring seq 4328 rep
> 192.168.15.151
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] aru 71 high delivered 71
> received flag 1
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Did not need to originate
> any messages in recovery.
> Dec 14 09:59:32 c5g-thor openais[1741]: [TOTEM] Sending initial ORF token
> Dec 14 09:59:32 c5g-thor clurgmgrd[2386]: <emerg> #1: Quorum Dissolved
> Dec 14 09:59:32 c5g-thor kernel: dlm: closing connection to node 2
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] CLM CONFIGURATION CHANGE
> Dec 14 09:59:32 c5g-thor kernel: dlm: closing connection to node 3
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] New Configuration:
> Dec 14 09:59:32 c5g-thor kernel: dlm: closing connection to node 4
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
> ip(192.168.15.152)
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] Members Left:
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
> ip(192.168.15.151)
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
> ip(192.168.15.153)
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
> ip(192.168.15.154)
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] Members Joined:
> Dec 14 09:59:32 c5g-thor openais[1741]: [CMAN ] quorum lost, blocking
> activity
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] CLM CONFIGURATION CHANGE
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ] New Configuration:
> Dec 14 09:59:32 c5g-thor openais[1741]: [CLM  ]         r(0)
> ip(192.168.15.152)
> Dec 14 09:59:33 c5g-thor openais[1741]: [CLM  ] Members Left:
> Dec 14 09:59:33 c5g-thor openais[1741]: [CLM  ] Members Joined:
> Dec 14 09:59:33 c5g-thor openais[1741]: [SYNC ] This node is within the
> primary component and will provide service.
> Dec 14 09:59:33 c5g-thor openais[1741]: [TOTEM] entering OPERATIONAL state.
> Dec 14 09:59:33 c5g-thor openais[1741]: [CLM  ] got nodejoin message
> 192.168.15.152
> Dec 14 09:59:33 c5g-thor openais[1741]: [CPG  ] got joinlist message from
> node 1
> Dec 14 09:59:33 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
> 11.
> Dec 14 09:59:33 c5g-thor openais[1741]: [TOTEM] entering GATHER state from
> 11.
> Dec 14 09:59:33 c5g-thor ccsd[1704]: Cluster is not quorate.  Refusing
> connection.
>
>
> The cluster.conf file looks like:
>
> <?xml version="1.0"?>
> <cluster alias="PESV" config_version="25" name="PESV">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="c5g-thor.prisma" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device domain="c5g-thor"
> name="c5g-thor-f"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="c5g-backup.prisma" nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device domain="c5g-backup"
> name="c5g-backup-f"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="c5g-memo.prisma" nodeid="3" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device domain="c5g-memo"
> name="c5g-memo-f"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="c5g-steiner.prisma" nodeid="4" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device domain="c5g-steiner"
> name="c5g-steiner-f"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <fencedevices>
>                 <fencedevice agent="fence_xvm" name="c5g-backup-f"/>
>                 <fencedevice agent="fence_xvm" name="c5g-thor-f"/>
>                 <fencedevice agent="fence_xvm" name="c5g-memo-f"/>
>                 <fencedevice agent="fence_xvm" name="c5g-steiner-f"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
>         <totem token="30000"/>
>         <cman/>
> </cluster>
>
>
>   
>> On Wed, 2007-12-12 at 19:23 +0100, Paolo Marini wrote:
>>     
>>> I reiterate the request for help hoping someone has undergone (and
>>> hopefully solved) the same issues.
>>>
>>> I am building up a cluster of XEN Guests with root file system residing
>>> on a file on an GFS filesystem (iscsi actually).
>>>
>>> Each cluster node mounts an GFS file system residing on an iscsi device.
>>>
>>> For performance reasons, both the iscsi device and the physical nodes
>>> (part also of a cluster) use two gigabit ethernet with bonding and LACP.
>>> For the physical machines, I had to insert a sleep 30 on the
>>> /etc/init.d/iscsi script before the iscsi login, in order to wait for
>>> the bond interface to come up, otherwise the iscsi devices are not seen
>>> and no gfs mount is possible.
>>>
>>> Then, going to the cluster of XEN Guests, they work fine, I am able to
>>> migrate each one to a different physical node without problems on the
>>> guest.
>>>
>>> When I reboot or fence one of the guests, the guest cluster breaks, e.g.
>>> the quorum is dissolved and I have to fence ALL the nodes and reboot
>>> them in order for the cluster to restart.
>>>       
>> How many guests - and what are you using for fencing ?
>>
>>     
>>> Does it have to do with the xen bridge going up and down for a time
>>> longer than the heartbeat timeout ?
>>>       
>> Not sure - it shouldn't be that big of a deal.  If you think that's the
>> problem try adding:
>>
>>    <totem token="30000"/>
>>
>> to the vm cluster's cluster.conf
>>
>> -- Lon
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>     
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   


From jos at xos.nl  Mon Dec 17 19:18:31 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 17 Dec 2007 20:18:31 +0100
Subject: [Linux-cluster] Quorum disk votes not used when starting node?
In-Reply-To: <1193174098.4831.12.camel@ayanami.boston.devel.redhat.com>
References: <200710101856.l9AIuRWA008113@jasmine.xos.nl>
	<1192462652.27135.16.camel@ayanami.boston.devel.redhat.com>
	<20071022174331.GB14244@jasmine.xos.nl>
	<1193174098.4831.12.camel@ayanami.boston.devel.redhat.com>
Message-ID: <20071217191831.GA17135@jasmine.xos.nl>

On Tue, Oct 23, 2007 at 05:14:58PM -0400, Lon Hohberger wrote:

> I see it - it looks like stop_cman only applies if qdiskd can't reach
> the disk, not if the heuristics are bad.
> 
> This should make it kill CMAN if heuristics are bad too if stop_cman is
> set.

It's long ago, but... will the patch you included in a subsequent release?
Is it considered "the way to go"?

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From harun at mhd.co.om  Tue Dec 18 05:09:09 2007
From: harun at mhd.co.om (Harun)
Date: Tue, 18 Dec 2007 09:09:09 +0400
Subject: [Linux-cluster] RH Cluster issue
In-Reply-To: <e42b1ba7000f8e86@mhd.co.om>
Message-ID: <000601c84134$214e65f0$4e030196@mhd.co.om>


Yes there are some updates. But I want to resolve the issue with out any up
gradations. Do you think that updating can resolve the issue?

Cluster.xml looks like this.

  <?xml version="1.0" ?> 
- <cluconfig version="3.0">
  <clumembd broadcast="no" interval="1000000" loglevel="5" multicast="yes"
multicast_ipaddress="225.0.0.11" thread="yes" tko_count="25" /> 
  <cluquorumd loglevel="5" pinginterval="2" tiebreaker_ip="" /> 
  <clurmtabd loglevel="5" pollinterval="4" /> 
  <clusvcmgrd loglevel="5" /> 
  <clulockd loglevel="5" /> 
  <cluster config_viewnumber="74" key="e307059f0bed8596868c4ab99818dc5d"
name="Oracle_cluster" /> 
  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1"
rawshadow="/dev/raw/raw2" type="raw" /> 
- <members>
  <member id="0" name="DB-Primary" watchdog="yes" /> 
  <member id="1" name="DB-Secondary" watchdog="yes" /> 
  </members>
- <services>
- <service checkinterval="0" failoverdomain="None" id="0" maxfalsestarts="3"
maxrestarts="0" name="AOFSPROD" userscript="">
- <service_ipaddresses>
  <service_ipaddress broadcast="None" id="0" ipaddress="192.168.0.7"
monitor_link="0" netmask="255.255.255.0" /> 
  </service_ipaddresses>
- <device id="0" name="/dev/sdb3" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/data1" options="" /> 
  </device>
- <device id="1" name="/dev/sdb4" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/data2" options="" /> 
  </device>
- <device id="2" name="/dev/sdc1" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/backup" options="" />

  </device>
- <device id="3" name="/dev/sdd1" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/archive" options=""
/> 
  </device>
  </service>
  </services>
  <failoverdomains /> 
  </cluconfig>


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Monday, December 17, 2007 7:35 PM
To: linux clustering
Subject: Re: [Linux-cluster] RH Cluster issue

On Mon, 2007-12-17 at 09:04 +0400, Harun wrote:
> Issue: When the public network cable is removed from any for the node
> both nodes restarts itself. 

> I want to test a live setup... a Linux Cluster with 2 nodes, in Linux
> Advanced Server (Linux DB-Primary 2.4.21-37.ELsmp #1 SMP Wed Sep 7
> 13:28:55 EDT 2005 i686 i686 i386 GNU/Linux )

There have been a few updates since 2.4.21-37, I think.


> But only issue here is that if I pull out any of the network cable
> in the public interface, both server restart together. I am just
> confused with what configuration files to check or how get started

That's strange.  I assume you're running clumanager-1.2.35.2, right?

For a tiebreaker, you should use the nearest upstream router.
Alternatively, with clumanager 1.2.x you can also use no IP address
which should be a 'disk' tiebreaker.

What does cluster.xml look like?

-- Lon


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


<<<<   Disclaimer Message  >>>>
"This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee, please notify the sender immediately after deleting this e-mail from your system and do not disseminate, distribute or copy this e-mail. The sender does not accept liability for any errors or omissions in the contents of this message, which arise as a result of erroneous e-mail transmission."
[Mohsin Haider Darwish LLC & Group Companies, PO.Box 880, Ruwi-112, Oman]


From Alain.Moulle at bull.net  Tue Dec 18 13:00:50 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 18 Dec 2007 14:00:50 +0100
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk
Message-ID: <4767C482.7040101@bull.net>

Hi Lon

Me again, always fighting with my two node cluster with quorum disk ;-)
I've not yet received your response to my last email if you have already
answered, because I 'm in digest mode, so , sorry if  ...

After again re-read lots of texts,FAQs, and emails from this ML about quorum
disk on two nodes cluster, I've got this conclusion, please tell me if I'm
right or not :
recall : my first goal in introducing quorum disk in my two-nodes clusters
is to avoid race-to-fence and eventually dual-fencing between both nodes, in
case of heart-beat network failure from one side (ethernet card, ethernet cable
or whatever).

So my conclusion is that it can't be achieved with a quorumd record
without at least one heuristic which ping another host than the
adjacent node on the same network where the heart-beat works.
Meaning that in cluster.conf I should have :
     <quorumd interval="1" label="QDISK_5_0" min_score="1" tko="10" votes="1">
         <heuristic interval="10" program="ping -t1 -c1 <another host>" score="1"/>
     </quorumd>

Am I right when telling that this heuristic is mandatory to avoid
race-to-fence and eventually dual-fencing ?
I hope so ... it would be the end for me about quorum disk ... ;-)

Thanks for your answer.
Alain


From jorge.gonzalez at degesys.com  Tue Dec 18 17:25:45 2007
From: jorge.gonzalez at degesys.com (Jorge Gonzalez)
Date: Tue, 18 Dec 2007 18:25:45 +0100
Subject: [Linux-cluster] clvmd do not start properly
Message-ID: <47680299.8060000@degesys.com>

Hi all!!!

I have two node cluster. When I do an ifdown ETH0 then fencing starts to 
work and the machine is rebooted by DRAC. When the node restarts then 
tries to get up clvmd and 'hangs' for a while. Then prints "clvmd 
startup timed out" and of course VGs are not active.

I change fallback_to_clustered_locking in /etc/lvm/lvm.conf from 1 to 0 
and now it works but I don't know if is correct.

When fallback_to_clustered_locking should be 0 or 1? Why works now?

thanks in advance

regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jorge.gonzalez.vcf
Type: text/x-vcard
Size: 350 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071218/733912c7/attachment.vcf>

From lhh at redhat.com  Tue Dec 18 23:22:26 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 18 Dec 2007 18:22:26 -0500
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk
In-Reply-To: <4767C482.7040101@bull.net>
References: <4767C482.7040101@bull.net>
Message-ID: <1198020146.10364.20.camel@ayanami.boston.devel.redhat.com>

On Tue, 2007-12-18 at 14:00 +0100, Alain Moulle wrote:

> Me again, always fighting with my two node cluster with quorum
disk ;-)
> I've not yet received your response to my last email if you have already
> answered, because I 'm in digest mode, so , sorry if  ...

https://www.redhat.com/archives/linux-cluster/2007-December/msg00188.html


> Meaning that in cluster.conf I should have :
>      <quorumd interval="1" label="QDISK_5_0" min_score="1" tko="10" votes="1">
>          <heuristic interval="10" program="ping -t1 -c1 <another host>" score="1"/>
>      </quorumd>
> 
> Am I right when telling that this heuristic is mandatory to avoid
> race-to-fence and eventually dual-fencing ?
> I hope so ... it would be the end for me about quorum disk ... ;-)

Correct, you need a heuristic for now.

In your case, you could also consider a heuristic which checks the
ethernet link too (using ethtool for example - rather than the typical
'ping') if there is no other host on the network you want to monitor.

Also - your heuristic should be more like one of the following:

  <heuristic interval="2"
             tko="3" 
             program="ping -t1 -c1 <another host>"
             score="1"/>

  <heuristic interval="2"
             program="ping -t3 -c1 <another host>"
             score="1"/>

Reason: You don't want a single ICMP packet to determine node fitness.
If that ping gets lost (network being full, or any reason really), the
node will commit suicide.  (The man page probably needs updating about
that!)

-- Lon


From Alain.Moulle at bull.net  Wed Dec 19 10:52:36 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 19 Dec 2007 11:52:36 +0100
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk
Message-ID: <4768F7F4.20300@bull.net>

Hi Lon

Many thanks for your last both responses, it's much clearer
in my head now ... ;-)

When you speak of "a method to enable master-set-wins (e.g. for
operation w/o heuristics)" , will it be delivered by a patch when
ready ? or just a matter in configuration file ?
if patch, when do you think it could be available ? because
I'm interested ... I guess it will be sort of a way to establish
that one of the two-nodes cluster is master and in case of
"perfect symmetrical splits", it is always the master the winner ...
if so, that will be fine for me .

Thanks again
Alain


From: Lon Hohberger <lhh at redhat.com>
Subject: Re: [Linux-cluster] Re: CS5 two-nodes with quorum disk
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <1197915584.4959.22.camel at ayanami.boston.devel.redhat.com>
Content-Type: text/plain

On Mon, 2007-12-17 at 10:36 +0100, Alain Moulle wrote:

>> Hi Lon
>>
>> I've carefully read your last detailed information. I've a
>> better understanding but something is again not clear for me :
>> in my two node cluster node1/node2, with quorum disk , without any heuristic,


Need a heuristic for now.


>> I would like to be sure that if there is a failure on the heart-beat
>> network, only one node fences the other and not both, so :
>> when I do on node2 ifdown on eth if of heart-beat, what is the
>> mechanism via the quorum disk that assures that ?
>> Or how must I configure to assure that ?


What does your "heartbeat network" look like?

You need a way to tell the difference between "I am alive" and "I am
alive but should be dead".

In your case, a simple link-detection heuristic would be an easy way to
do this - no router / pinging required: no link = "I cannot be the
master node" => immediately loses the race.

If the heartbeat link isn't available, the node should remove itself.

Obviously, this will not work with a crossover cable (both nodes would
think they're dead if the link was unplugged).  Switches are cheap.

Now there are lots of other ways to determine who should win.  However,
when thinking about this, it is very important that understand that
perfect symmetrical splits - i.e. splits where there is absolutely no
discernible difference between the nodes - will usually end in a race.

Currently, I'm working on a method to enable master-set-wins (e.g. for
operation w/o heuristics); I'll post when it's complete.

-- Lon


From kjain at aurarianetworks.com  Wed Dec 19 15:11:58 2007
From: kjain at aurarianetworks.com (Kamal Jain)
Date: Wed, 19 Dec 2007 10:11:58 -0500
Subject: [Linux-cluster] newbie question about RHCS/GFS stability and
	performance
Message-ID: <BD1F491BE5FEB942A95DC075D99F15CA19289CC49C@exch-be-01-prod.hq.aurarianetworks.com>

Hi Folks,

I just joined this list and am new to Linux clustering.  I've setup several RHEL4u5 (AS) clusters in our lab to do some performance testing with our own applications, but their underpinning is just Red Hat Cluster Services and GFS on top of some iSCSI arrays (StoreVault and EqualLogic).

When I run IOZONE throughput tests comparing a single, local SAS disk (146GB, 2.5", 10K-RPM) on the onboard Dell PERC 5/i controller versus an NFS-mounted volume versus a GFS volume on an iSCSI LUN...the results are not particularly surprising, and they actually show the GFS volumes to be the best performer in random read and random write, and a strong player overall.

Our application performance, however, really suffers with GFS.  I have seen numerous pointers to GFS performance tuning through the "gfs_tool setttune" parameters, but no clear guidance on what the parameters are and how one might know what direction to move them in based on run data.

Another thing we've noticed after running stress tests with our application is that cluster nodes, and the clustering management components themselves (like ricci, clustat) and filesystem tools (like df and du) start hanging, we get system instability.  Rebooting things clears it up.

Has anyone else experienced this, and do you have any guidance or advice?  We're running the native iSCSI components over a shared GbE connection, which I know is not optimal for performance, but I can see that the network ports are not even close to heavily used.  Could the software iSCSI initiator be contributing to this?

Who is using native iSCSI on a simple GbE port as we are and who has experience using iSCSI HBAs and/or TOEs and/or Fibre Channel for interconnect rather than iSCSI?

Thanks for any help or insight you can offer.

Cheers,
- K

--
Kamal Jain
kjain at aurarianetworks.com
+1 978.893.1098  (office)
+1 978.726.7098  (mobile)

Auraria Networks, Inc.
85 Swanson Road, Suite 120
Boxborough, MA  01719
USA

www.aurarianetworks.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071219/60acd1ac/attachment.htm>

From comaniliut at yahoo.com  Wed Dec 19 15:30:33 2007
From: comaniliut at yahoo.com (Coman ILIUT)
Date: Wed, 19 Dec 2007 07:30:33 -0800 (PST)
Subject: [Linux-cluster] fence_ilo confused if both power supplies die
Message-ID: <57474.55002.qm@web51310.mail.re2.yahoo.com>

We ran into the same problem. We ended up writing a new fence_ilo method. It sends a power reset via ILO. With power reset, if the node is powered down, nothing happens, and if it is powered up, it is powered down then up. Look at the HP ILO interface. HP has a PDF document about the ILO interface. Take a look at it.

Coman

----- Original Message ----
From: Miroslav Zubcic <mvz+rhcluster at nimium.hr>
To: linux-cluster at redhat.com
Sent: Thursday, December 13, 2007 4:55:11 AM
Subject: [Linux-cluster] fence_ilo confused if both power supplies die

Hi all,

Is this a bug? Should we report it on official RHN (I hate that slow
buggy oracle based portal!)

Summary:

We have 2-node cluster on HP ProLiant DL 380 G5 servers.

3 services in cluster:
    - FreeRADIUS + IP addr
    - Apache + IP addr + storage LUN
    - Postgres + IP addr + storage LUN

Fencing is done via HP ILO cards.

Couple days ago, both power supplies on one node died in short time
(well, obviously it can happen). Fenced daemon, ccsd, and cluster
generaly didn't reacted well on that, despite surviving non-real-life
acceptance tests where we pulled both power supplies out in test.
 Faulty
power supply is something different than missing power supply for HP
 ILO
card. ILO card continued to work on it's internal battery but "POWER
 ON"
action did not suceeded (POWER command was returning that power is
 off).

This situation has confused fence_ilo agent. Agent has seen that other
server is down, but it never returned sucess to cluster because it
FAILED TO POWER ON other server.

I think this is buggy behaviour. Who cares if fence agent cannot power
on again fenced node, why it just didn't give up? Here is relevant part
of the log on healthy node which tried to fence other node.


Dec 10 03:37:14 aoc01 kernel: CMAN: removing node aoc02 from the
 cluster
: Missed too many heartbeats
Dec 10 03:37:14 aoc01 fenced[3012]: aoc02 not a cluster member after 0
sec post_fail_delay
Dec 10 03:37:14 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:50 aoc01 fenced[3012]: agent "fence_ilo" reports: failed
 to
turn on
Dec 10 03:37:50 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:37:55 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:55 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:37:55 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 03:37:55 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:38:00 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:38:00 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:38:00 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor

Dec 10 05:42:13 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:18 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:18 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:18 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:18 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:23 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:23 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:23 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:23 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:28 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:28 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:28 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor


-- 
Miroslav


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


      Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071219/11be1474/attachment.htm>

From garromo at us.ibm.com  Wed Dec 19 16:00:49 2007
From: garromo at us.ibm.com (Gary Romo)
Date: Wed, 19 Dec 2007 09:00:49 -0700
Subject: [Linux-cluster] newbie question about RHCS/GFS stability and
	performance
In-Reply-To: <BD1F491BE5FEB942A95DC075D99F15CA19289CC49C@exch-be-01-prod.hq.aurarianetworks.com>
Message-ID: <OF31A82457.14BC44CB-ON872573B6.0057DCCF-872573B6.0057DFE0@us.ibm.com>

I will be using iozone at a later time.  Can you provide some syntax 
examples of what you are running?
Thanks.

Gary Romo
IBM Global Technology Services
303.458.4415
Email: garromo at us.ibm.com
Pager:1.877.552.9264
Text message: gromo at skytel.com


Kamal Jain <kjain at aurarianetworks.com> 
Sent by: linux-cluster-bounces at redhat.com
12/19/2007 08:11 AM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux-cluster at redhat.com" <linux-cluster at redhat.com>
cc

Subject
[Linux-cluster] newbie question about RHCS/GFS stability and performance


Hi Folks,
 
I just joined this list and am new to Linux clustering.  I?ve setup 
several RHEL4u5 (AS) clusters in our lab to do some performance testing 
with our own applications, but their underpinning is just Red Hat Cluster 
Services and GFS on top of some iSCSI arrays (StoreVault and EqualLogic).
 
When I run IOZONE throughput tests comparing a single, local SAS disk 
(146GB, 2.5?, 10K-RPM) on the onboard Dell PERC 5/i controller versus an 
NFS-mounted volume versus a GFS volume on an iSCSI LUN?the results are not 
particularly surprising, and they actually show the GFS volumes to be the 
best performer in random read and random write, and a strong player 
overall.
 
Our application performance, however, really suffers with GFS.  I have 
seen numerous pointers to GFS performance tuning through the ?gfs_tool 
setttune? parameters, but no clear guidance on what the parameters are and 
how one might know what direction to move them in based on run data.
 
Another thing we?ve noticed after running stress tests with our 
application is that cluster nodes, and the clustering management 
components themselves (like ricci, clustat) and filesystem tools (like df 
and du) start hanging, we get system instability.  Rebooting things clears 
it up.
 
Has anyone else experienced this, and do you have any guidance or advice? 
We?re running the native iSCSI components over a shared GbE connection, 
which I know is not optimal for performance, but I can see that the 
network ports are not even close to heavily used.  Could the software 
iSCSI initiator be contributing to this?
 
Who is using native iSCSI on a simple GbE port as we are and who has 
experience using iSCSI HBAs and/or TOEs and/or Fibre Channel for 
interconnect rather than iSCSI?
 
Thanks for any help or insight you can offer.
 
Cheers,
- K
 
--
Kamal Jain
kjain at aurarianetworks.com
+1 978.893.1098  (office)
+1 978.726.7098  (mobile)
 
Auraria Networks, Inc.
85 Swanson Road, Suite 120
Boxborough, MA  01719
USA
 
www.aurarianetworks.com
 --
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071219/b0dc99e8/attachment.htm>

From kjain at aurarianetworks.com  Wed Dec 19 16:16:04 2007
From: kjain at aurarianetworks.com (Kamal Jain)
Date: Wed, 19 Dec 2007 11:16:04 -0500
Subject: [Linux-cluster] newbie question about RHCS/GFS stability and
	performance
In-Reply-To: <OF31A82457.14BC44CB-ON872573B6.0057DCCF-872573B6.0057DFE0@us.ibm.com>
References: <BD1F491BE5FEB942A95DC075D99F15CA19289CC49C@exch-be-01-prod.hq.aurarianetworks.com>
	<OF31A82457.14BC44CB-ON872573B6.0057DCCF-872573B6.0057DFE0@us.ibm.com>
Message-ID: <BD1F491BE5FEB942A95DC075D99F15CA19289CC4B3@exch-be-01-prod.hq.aurarianetworks.com>

Hi,

Typical servers are Dell PowerEdge 2950s with single quad-core Intel X5365  @ 3.00GHz processors and 16GB RAM, onboard PERC 5/I RAID controllers, 146GB, 2.5", 10K-RPM SAS drives for /, /boot, and swap (32GB swap partition).

Using IOZONE v3.283 (detailed info below), with these types of commands:

$ ./iozone -Ra -g 2G -b ~/iozone.xls -f /mnt/gfsvol1/iozone/iozone.tmp

That runs the automatic matrix of IOZONE tests with test file sizes up to 2GB, and it skips small read/write requests for large files for time's sake.  It is running single thread, single I/O file.  I suspect our results will look somewhat different if I spin up more parallel threads and try to match IOZONE's testing thread-count to our application's.


The bottom line, however, is really around RHCS and GFS.  I'm hoping to find some good pointers for performance tuning, and monitoring and measurement to feed back into the tuning, best practices, etc.  There is a presentation at http://people.redhat.com/czinzili/docs/Summit07presentation_GFS_Best_Practices_Final.pdf which covers some things, but it's pretty light.  It did, however, point me at this mailing list, so there's a golden nugget!

Thanks,

- K


From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gary Romo
Sent: Wednesday, December 19, 2007 11:01 AM
To: linux clustering
Cc: linux-cluster at redhat.com; linux-cluster-bounces at redhat.com
Subject: Re: [Linux-cluster] newbie question about RHCS/GFS stability and performance


I will be using iozone at a later time.  Can you provide some syntax examples of what you are running?
Thanks.

Gary Romo
IBM Global Technology Services
303.458.4415
Email: garromo at us.ibm.com
Pager:1.877.552.9264
Text message: gromo at skytel.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071219/70b336eb/attachment.htm>

From jamesc at exa.com  Wed Dec 19 21:14:16 2007
From: jamesc at exa.com (James Chamberlain)
Date: Wed, 19 Dec 2007 16:14:16 -0500 (EST)
Subject: [Linux-cluster] CLVM and Cluster Service Migration issues
Message-ID: <Pine.LNX.4.64.0712191542300.16533@hawking.exa.com>

Hi all,

I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel 
2.6.18-53.el5.  Last night, I tried to grow two of the file systems on it. 
I ran lvextend and then gfs_grow on node3, with node2 serving the file 
systems out to the local network.  While gfs_grow was running, node2 failed 
the service and I couldn't get it to restart.  It looked to me like neither 
node1 nor node2 was aware of the lvextend I had run on node3.  I had to 
reboot the full cluster to bring everything back online.

This afternoon, node2 fenced node3.  Nothing migrated, and the entire 
cluster needed to be rebooted again to recover.  What I noticed after the 
full reboot is I seem to be getting initial ARP responses from the wrong 
nodes, as below:

[root at workstation ~]# arping cluster-fs1
ARPING 10.1.1.142 from 10.1.1.101 eth0
Unicast reply from 10.1.1.142 [00:1B:78:D1:88:C2]  0.624ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.666ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.621ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs2
ARPING 10.1.1.143 from 10.1.1.101 eth0
Unicast reply from 10.1.1.143 [00:1B:78:D1:88:C2]  0.695ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.680ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root at workstation ~]# arping cluster-fs3
ARPING 10.1.1.144 from 10.1.1.101 eth0
Unicast reply from 10.1.1.144 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.913ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.640ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)

[root at workstation ~]# arping node1
ARPING 10.1.1.131 from 10.1.1.101 eth0
Unicast reply from 10.1.1.1 [00:1B:78:D1:88:C2]  0.771ms
[...]
[root at workstation ~]# arping node2
ARPING 10.1.1.132 from 10.1.1.101 eth0
Unicast reply from 10.1.1.2 [00:1C:C4:81:AD:72]  0.681ms
[...]
[root at workstation ~]# arping node3
ARPING 10.1.1.133 from 10.1.1.101 eth0
Unicast reply from 10.1.1.3 [00:1C:C4:81:9F:66]  0.631ms

At the time, node1 was supposed to be serving fs1, fs2, and fs3.  I'll note 
that I did forget to run "lvmconf --enable-cluster" when I first set the 
volume group up, though I did make that change before putting the cluster 
into production.

Anyone have any thoughts on what's going on and what to do about it?

Thanks,

James


From lhh at redhat.com  Wed Dec 19 22:21:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 19 Dec 2007 17:21:46 -0500
Subject: [Linux-cluster] Re: CS5 two-nodes with quorum disk
In-Reply-To: <4768F7F4.20300@bull.net>
References: <4768F7F4.20300@bull.net>
Message-ID: <1198102906.14575.9.camel@ayanami.boston.devel.redhat.com>

On Wed, 2007-12-19 at 11:52 +0100, Alain Moulle wrote:
> Hi Lon
> 
> Many thanks for your last both responses, it's much clearer
> in my head now ... ;-)
> 
> When you speak of "a method to enable master-set-wins (e.g. for
> operation w/o heuristics)" , will it be delivered by a patch when
> ready ? or just a matter in configuration file ?

It will be both.

The current idea is to use a quasi-"fence agent" which talks to qdiskd.
Basically, "fence_qdisk" asks qdiskd "Am I in the master set?"  - if we
are not, qdiskd "fencing" fails immediately.  This "fencing" operation
needs to happen before other fence agents.

If we /are/ in the master set, we flip the "dead" node/partition's state
on the quorum disk S_EVICT (Evicted) - which will (probably) cause the
node to die, and return success to fenced.  At this point, normal
fencing occurs.

This provides also a quick "not supported for production use without
other 'real' fencing" test agent, too...

The limitation is that you cannot use multiple fence methods - or if you
do, you must reference the fence_qdisk agent from all methods.


> if patch, when do you think it could be available ? because
> I'm interested ... I guess it will be sort of a way to establish
> that one of the two-nodes cluster is master and in case of
> "perfect symmetrical splits", it is always the master the winner ...
> if so, that will be fine for me .

Yes, that's the intent - I'm also thinking about methods of how to
simplify things somewhat.  If you have input, let me know. :)

-- Lon


From isplist at logicore.net  Wed Dec 19 22:34:24 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 19 Dec 2007 16:34:24 -0600
Subject: [Linux-cluster] rhel4 with rhel5, GFS?
Message-ID: <20071219163424.260423@leena>

Can I have both rhel4 and rhel5 machines in the cluster sharing GFS? I need to 
add nodes but might as well move to rhel5 if I can run both for a while.

Mike


From jos at xos.nl  Wed Dec 19 22:39:28 2007
From: jos at xos.nl (Jos Vos)
Date: Wed, 19 Dec 2007 23:39:28 +0100
Subject: [Linux-cluster] rhel4 with rhel5, GFS?
In-Reply-To: <20071219163424.260423@leena>
References: <20071219163424.260423@leena>
Message-ID: <20071219223928.GD15878@jasmine.xos.nl>

On Wed, Dec 19, 2007 at 04:34:24PM -0600, isplist at logicore.net wrote:

> Can I have both rhel4 and rhel5 machines in the cluster sharing GFS? I need to
> add nodes but might as well move to rhel5 if I can run both for a while.

No.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From brent at meshier.com  Thu Dec 20 00:09:38 2007
From: brent at meshier.com (Brent Meshier)
Date: Wed, 19 Dec 2007 18:09:38 -0600
Subject: [Linux-cluster] Error locking on node - Volume group for uuid not
	found
Message-ID: <000001c8429c$9ce3b800$d6ab2800$@com>

I'm unable to manipulate LVs in the cluster, I get "Volume group for uuid
not found".  I've tried clvmd -R, running Redhat 5.


[root at s004 ~]# vgs
  VG         #PV #LV #SN Attr   VSize   VFree
  Shelf0Vol0   1   5   0 wz--nc   2.27T 543.80G
  Shelf1Vol0   1   7   0 wz--nc   2.73T   1.66T
  Shelf1Vol1   1   5   0 wz--nc   2.73T      0
  Shelf2Vol0   1   1   0 wz--nc   2.73T   2.55T
  Shelf2Vol1   1   1   0 wz--nc   2.73T   2.50T
  VolGroup00   1   2   0 wz--n-  67.62G      0

[root at s004 ~]# lvchange -a y /dev/Shelf1Vol1/lvVPS03
  Error locking on node s002: Volume group for uuid not found:
EoTXbqSBuLnIFhU37gqueJZH7qUdz2C7LzwngJubgEufF076TlSicHJle1xhIBbJ
  Error locking on node s001: Volume group for uuid not found:
EoTXbqSBuLnIFhU37gqueJZH7qUdz2C7LzwngJubgEufF076TlSicHJle1xhIBbJ
  Error locking on node s003: Volume group for uuid not found:
EoTXbqSBuLnIFhU37gqueJZH7qUdz2C7LzwngJubgEufF076TlSicHJle1xhIBbJ
  Error locking on node s004: Volume group for uuid not found:
EoTXbqSBuLnIFhU37gqueJZH7qUdz2C7LzwngJubgEufF076TlSicHJle1xhIBbJ

[root at s004 ~]# cman_tool status
Version: 6.0.1
Config Version: 22
Cluster Name: SMFCluster1
Cluster Id: 33945
Cluster Member: Yes
Cluster Generation: 5492
Membership state: Cluster-Member
Nodes: 4
Expected votes: 6
Total votes: 4
Quorum: 4
Active subsystems: 7
Flags:
Ports Bound: 0 11
Node name: s004
Node ID: 4
Multicast addresses: 239.192.132.30
Node addresses: 172.16.1.104

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.503 / Virus Database: 269.17.4/1189 - Release Date: 12/18/2007
9:40 PM
 

From d.skorupa at wasko.pl  Thu Dec 20 07:24:18 2007
From: d.skorupa at wasko.pl (Darek Skorupa)
Date: Thu, 20 Dec 2007 08:24:18 +0100
Subject: [Linux-cluster] XEN Cluster
In-Reply-To: <1197915660.4959.24.camel@ayanami.boston.devel.redhat.com>
References: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>	<1197672129.18614.2.camel@localhost.localdomain>	<26275.62.101.100.5.1197887615.squirrel@picard.linux.it>
	<1197915660.4959.24.camel@ayanami.boston.devel.redhat.com>
Message-ID: <476A18A2.2080406@wasko.pl>

Lon Hohberger pisze:
> On Mon, 2007-12-17 at 11:33 +0100, amrossi at linux.it wrote:
>   
>> Ok...but i've 2 hosts (domain-0)... is it different?
>>     
>
> Not really, except that if you plan to do migration, you must have:
>
>  * the images on shared storage (e.g. GFS or raw partitions), and
>  * fencing for the dom0 cluster.
>   

So, if I have 2 hosts (physical servers) with 6 virtual XEN instances on 
each and I want use one virtual machine from first host and one virtual 
machine from second host as two-node cluster it can generate a lot of 
problems.

1. I have to make two node cluster using fence_xvm as fencing method on 
XEN machines
2. I have to make two node dom0 cluster and using some type of fencing 
(ilo, power, etc...)
am I right ?

So, when one of dom0 machine start to think that second node (second 
dom0 machine) is dead the one will start to fencing procedure and 
killing second dom0 machine, so it will reboot not only XEN node but 
also any of XEN machines on the physical  computer  (dom0 machine).

If I understand it in good way, it a little bit stupid for me :/

regards
Darek


From amrossi at linux.it  Thu Dec 20 07:53:47 2007
From: amrossi at linux.it (amrossi at linux.it)
Date: Thu, 20 Dec 2007 08:53:47 +0100 (CET)
Subject: [Linux-cluster] XEN Cluster
In-Reply-To: <476A18A2.2080406@wasko.pl>
References: <45824.79.10.137.147.1197661952.squirrel@picard.linux.it>
	<1197672129.18614.2.camel@localhost.localdomain>
	<26275.62.101.100.5.1197887615.squirrel@picard.linux.it>
	<1197915660.4959.24.camel@ayanami.boston.devel.redhat.com>
	<476A18A2.2080406@wasko.pl>
Message-ID: <61490.62.101.100.5.1198137227.squirrel@picard.linux.it>

But is it really necessary to do a cluster with 2/3/4 dom0?

can i use standalone fence_xvmd daemon?


> Lon Hohberger pisze:
>> On Mon, 2007-12-17 at 11:33 +0100, amrossi at linux.it wrote:
>>
>>> Ok...but i've 2 hosts (domain-0)... is it different?
>>>
>>
>> Not really, except that if you plan to do migration, you must have:
>>
>>  * the images on shared storage (e.g. GFS or raw partitions), and
>>  * fencing for the dom0 cluster.
>>
>
> So, if I have 2 hosts (physical servers) with 6 virtual XEN instances on
> each and I want use one virtual machine from first host and one virtual
> machine from second host as two-node cluster it can generate a lot of
> problems.
>
> 1. I have to make two node cluster using fence_xvm as fencing method on
> XEN machines
> 2. I have to make two node dom0 cluster and using some type of fencing
> (ilo, power, etc...)
> am I right ?
>
> So, when one of dom0 machine start to think that second node (second
> dom0 machine) is dead the one will start to fencing procedure and
> killing second dom0 machine, so it will reboot not only XEN node but
> also any of XEN machines on the physical  computer  (dom0 machine).
>
> If I understand it in good way, it a little bit stupid for me :/
>
> regards
> Darek
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From pcaulfie at redhat.com  Thu Dec 20 08:35:01 2007
From: pcaulfie at redhat.com (Patrick Caulfeld)
Date: Thu, 20 Dec 2007 08:35:01 +0000
Subject: [Linux-cluster] Error locking on node - Volume group for uuid
	not	found
In-Reply-To: <000001c8429c$9ce3b800$d6ab2800$@com>
References: <000001c8429c$9ce3b800$d6ab2800$@com>
Message-ID: <476A2935.5000107@redhat.com>

Brent Meshier wrote:
> I'm unable to manipulate LVs in the cluster, I get "Volume group for uuid
> not found".  I've tried clvmd -R, running Redhat 5.
> 

clvmd -R doesn't work properly before lvm 2.02.28, you'll need to
restart the clvmd daemon using 'service clvmd restart'

Patrick


From balajisundar at midascomm.com  Thu Dec 20 14:05:19 2007
From: balajisundar at midascomm.com (Balaji)
Date: Thu, 20 Dec 2007 19:35:19 +0530
Subject: [Linux-cluster] Regd: Issue in Source NAT Rule for Cluster Suite
	Floating IP
Message-ID: <476A769F.4070106@midascomm.com>

Dear All,

 I am using the following rules in firewall scripts

       /sbin/iptables -F INPUT
       /sbin/iptables -F OUTPUT
       /sbin/iptables -F FORWARD
       /sbin/iptables -F RH-Firewall-1-INPUT

       # Default Rule
       /sbin/iptables -P OUTPUT ACCEPT
       /sbin/iptables -P INPUT  ACCEPT
       /sbin/iptables -P FORWARD ACCEPT

Rsync Source NAT rules is
       iptables -t nat -A POSTROUTING -p tcp -s 192.168.13.179 --dport 
873 -j SNAT --to-source 192.168.13.83:873

192.168.13.83  is floating ipaddress 
 192.168.13.179 is primary ipaddress 
192.168.13.110  is  secondary ipaddress

I am taking backup from my client pc via rsync  and i am configured 
floating ipaddress at my client pc and the following rules is working fine .

But the problem is at every rsync trigger we will restart the iptables 
and firewall scripts, then only it will works
If the services are not restart then it will send the following error 
message
"rsync: failed to connect to 192.168.13.100: Connection timed out (110)
rsync error: error in socket IO (code 10) at clientserver.c(94)"

I need to solve the following "iptables and firewall restart" issue.
Can some one throw light on this.

Regards
-S.Balaji


From kudjak at gmail.com  Thu Dec 20 14:47:19 2007
From: kudjak at gmail.com (Jan Kudjak)
Date: Thu, 20 Dec 2007 15:47:19 +0100
Subject: [Linux-cluster] RHEL5 LVM Cluster locking problem
In-Reply-To: <474E7399.3010904@cargo-partner.com>
References: <474E7399.3010904@cargo-partner.com>
Message-ID: <353fcd0b0712200647u51255a3j172e407db5f57b2d@mail.gmail.com>

Hi,

According

http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Logical_Volume_Manager/mirrored_volumes.html

Mirrored logical volumes are not currently supported in a cluster :(

Regards
Jan

On Nov 29, 2007 9:08 AM, Thomas Reiter <thomas.reiter at cargo-partner.com>
wrote:

>  Hi everyone in this community.
>
> I have installed a working two node cluster using the new RHEL5
> clustersuite and RHEL5 x64.
> kernel:  ......  2.6.18-8.el5
> cman: cman-2.0.64-1.0.1.el5
> lvm2-cluster version: lvm2-cluster-2.02.16-3.el5
> iscsi: iscsi-initiator-utils-6.2.0.742-0.6.el5
> multipath: device-mapper-multipath-0.4.7-8.el5
>
> What i am trying to do now is to connect my shared storage (iscsi) via
> lvm2 clustered VG and mirrored LV.
>
> I can successful create the PVs and VG.... but when activating the LV i
> get following errors in /var/log/messages:
>
> *kernel: device-mapper: table: 253:6: mirror: Error creating mirror dirty
> log
> kernel: device-mapper: ioctl: error adding target to table
> dmeventd[4626]: dmeventd ready for processing.
> dmeventd[4626]: Monitoring mirror device, VG_mail-LV_mail for events
> lvm[4590]: Error locking on node mail1: device-mapper: reload ioctl
> failed: Invalid argument
> lvm[4590]: Error locking on node mail1: device-mapper: reload ioctl
> failed: Invalid argument
> lvm[4590]: Failed to activate new LV.*
>
> What does this mean? Where to search for the mistake?
>
> Tried already to change locking mode in /etc/lvm/lvm.conf to 2 and use
> locking_library liblvm2clusterlock.so from x32 RHEL4.5 without success
> also. (i know x32 and x64 os, but i thought it's worth a try..... couldn't
> find an external locking lib for RHEL5 x64)
>
> Just for info, please find below my system config:
>
> *lvm.conf:*
> *library_dir = "/usr/lib64"
> locking_type = 3
> fallback_to_clustered_locking = 1
> fallback_to_local_locking = 1
> locking_dir = "/var/lock/lvm"
> mirror_log_fault_policy = "allocate"
> mirror_device_fault_policy = "remove"*
>
> *iscsi devices:*
> *2 infortrend iscsi storage boxes with each raid 6 + Hotspare
> LV size should be 6 TeraByte -> mirrored to 2nd box*
>
> *multipath -l shows: (right now there is only one path configured)*
> *mail_m0 (3600d0230006fb0c300065568b855f301) dm-2 IFT,A16E-G2130-4
> [size=5.9T][features=0][hwhandler=0]
> \_ round-robin 0 [prio=0][active]
>  \_ 3:0:0:0 sde 8:64  [active][undef]
> mail_mlog (3600d0230006fb0c300065568b855f300) dm-3 IFT,A16E-G2130-4
> [size=200M][features=0][hwhandler=0]
> \_ round-robin 0 [prio=0][active]
>  \_ 3:0:0:1 sdf 8:80  [active][undef]
> mail_m1 (3600d0230006fbccc0006624409e63600) dm-4 IFT,A16E-G2130-4
> [size=5.9T][features=0][hwhandler=0]
> \_ round-robin 0 [prio=0][active]
>  \_ 4:0:0:0 sdb 8:16  [active][undef]*
>
> *vgdisplay shows:
> **--- Volume group ---
>   VG Name               VG_mail
>   System ID
>   Format                lvm2
>   Metadata Areas        3
>   Metadata Sequence No  7
>   VG Access             read/write
>   VG Status             resizable
>   Clustered             yes
>   Shared                no
>   MAX LV                0
>   Cur LV                4
>   Open LV               0
>   Max PV                0
>   Cur PV                3
>   Act PV                3
>   VG Size               11.82 TB
>   PE Size               128.00 MB
>   Total PE              96824
>   Alloc PE / Size       96667 / 11.80 TB
>   Free  PE / Size       157 / 19.62 GB
>   VG UUID               cdjlgF-n0dh-mlDo-wPco-RvFl-C3fO-PUQSuh**
>
> *PV Size 128MB for getting 6 TB volume.
>
> *command issued to create LV:*
> *lvcreate -m1 -L 5.90T -n LV_mail VG_mail /dev/mapper/mail_m0
> /dev/mapper/mail_m1 /dev/mapper/mail_mlog
> *
> *lvs shows:*
> *LV       VG         Attr   LSize   Origin Snap%  Move Log          Copy%
>   LV_mail  VG_mail    mwi-d-   5.90T                    LV_mail_mlog
> 0.00
>   LogVol00 VolGroup00 -wi-ao   3.41G
>   LogVol01 VolGroup00 -wi-ao 480.00M
> *
> but LV is NOT active because i get the error mentioned above!
>
> Any ideas?
> Every hint would be highly appriaciated.
>
> Thanks
> Thomas
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071220/039cd76c/attachment.htm>

From lhh at redhat.com  Thu Dec 20 16:07:13 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 20 Dec 2007 11:07:13 -0500
Subject: [Linux-cluster] rhel4 with rhel5, GFS?
In-Reply-To: <20071219163424.260423@leena>
References: <20071219163424.260423@leena>
Message-ID: <1198166833.5104.2.camel@localhost.localdomain>


On Wed, 2007-12-19 at 16:34 -0600, isplist at logicore.net wrote:
> Can I have both rhel4 and rhel5 machines in the cluster sharing GFS? I need to 
> add nodes but might as well move to rhel5 if I can run both for a while.

Negative.

-- Lon


From fsalaman at gmail.com  Thu Dec 20 16:16:50 2007
From: fsalaman at gmail.com (Fabian Salamanca Dominguez)
Date: Thu, 20 Dec 2007 10:16:50 -0600
Subject: [Linux-cluster] Regd: Issue in Source NAT Rule for Cluster Suite
	Floating IP
In-Reply-To: <476A769F.4070106@midascomm.com>
References: <476A769F.4070106@midascomm.com>
Message-ID: <afe307c20712200816k30eb2b24w9dbcb9ffd0f44bd5@mail.gmail.com>

Hi!

Try to build a script, then add it as a cluster resource for failover

BR,

On Dec 20, 2007 8:05 AM, Balaji <balajisundar at midascomm.com> wrote:
> Dear All,
>
>  I am using the following rules in firewall scripts
>
>        /sbin/iptables -F INPUT
>        /sbin/iptables -F OUTPUT
>        /sbin/iptables -F FORWARD
>        /sbin/iptables -F RH-Firewall-1-INPUT
>
>        # Default Rule
>        /sbin/iptables -P OUTPUT ACCEPT
>        /sbin/iptables -P INPUT  ACCEPT
>        /sbin/iptables -P FORWARD ACCEPT
>
> Rsync Source NAT rules is
>        iptables -t nat -A POSTROUTING -p tcp -s 192.168.13.179 --dport
> 873 -j SNAT --to-source 192.168.13.83:873
>
> 192.168.13.83  is floating ipaddress
>  192.168.13.179 is primary ipaddress
> 192.168.13.110  is  secondary ipaddress
>
> I am taking backup from my client pc via rsync  and i am configured
> floating ipaddress at my client pc and the following rules is working fine .
>
> But the problem is at every rsync trigger we will restart the iptables
> and firewall scripts, then only it will works
> If the services are not restart then it will send the following error
> message
> "rsync: failed to connect to 192.168.13.100: Connection timed out (110)
> rsync error: error in socket IO (code 10) at clientserver.c(94)"
>
> I need to solve the following "iptables and firewall restart" issue.
> Can some one throw light on this.
>
> Regards
> -S.Balaji
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Fabian


From jon at levanta.com  Thu Dec 20 16:53:58 2007
From: jon at levanta.com (Jonathan Biggar)
Date: Thu, 20 Dec 2007 08:53:58 -0800
Subject: [Linux-cluster] Re: CLVM and Cluster Service Migration issues
In-Reply-To: <Pine.LNX.4.64.0712191542300.16533@hawking.exa.com>
References: <Pine.LNX.4.64.0712191542300.16533@hawking.exa.com>
Message-ID: <fke6rm$edk$1@ger.gmane.org>

James Chamberlain wrote:
> Hi all,
> 
> I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel 
> 2.6.18-53.el5.  Last night, I tried to grow two of the file systems on 
> it. I ran lvextend and then gfs_grow on node3, with node2 serving the 
> file systems out to the local network.  While gfs_grow was running, 
> node2 failed the service and I couldn't get it to restart.  It looked to 
> me like neither node1 nor node2 was aware of the lvextend I had run on 
> node3.  I had to reboot the full cluster to bring everything back online.

If you ran lvextend on node3 while node2 had the filesystem mounted and 
exported via NFS, then you did the wrong thing.  CLVM lets the nodes 
share information about the PVs, VGs and LVs, but not information about 
the contents of an LV.

You needed to run lvextend on the node that had the filesystem mounted.

-- 
Jon Biggar
Levanta
jon at levanta.com
650-403-7252


From brent at meshier.com  Thu Dec 20 19:13:42 2007
From: brent at meshier.com (brent at meshier.com)
Date: Thu, 20 Dec 2007 11:13:42 -0800
Subject: [Linux-cluster] Error locking on node - Volume group for uuid
	not found
In-Reply-To: <476A2935.5000107@redhat.com>
References: <000001c8429c$9ce3b800$d6ab2800$@com> <476A2935.5000107@redhat.com>
Message-ID: <20071220111342.u89qsnc2ok0wwkwo@webmail.meshier.com>

I've tried restarting clvmd as well.  No change.


--Brent

Quoting Patrick Caulfeld <pcaulfie at redhat.com>:

> Brent Meshier wrote:
>> I'm unable to manipulate LVs in the cluster, I get "Volume group for uuid
>> not found".  I've tried clvmd -R, running Redhat 5.
>>
>
> clvmd -R doesn't work properly before lvm 2.02.28, you'll need to
> restart the clvmd daemon using 'service clvmd restart'
>
> Patrick
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From jamesc at exa.com  Thu Dec 20 20:15:28 2007
From: jamesc at exa.com (James Chamberlain)
Date: Thu, 20 Dec 2007 15:15:28 -0500 (EST)
Subject: [Linux-cluster] Re: CLVM and Cluster Service Migration issues
In-Reply-To: <fke6rm$edk$1@ger.gmane.org>
References: <Pine.LNX.4.64.0712191542300.16533@hawking.exa.com>
	<fke6rm$edk$1@ger.gmane.org>
Message-ID: <Pine.LNX.4.64.0712201512050.10981@hawking.exa.com>

On Thu, 20 Dec 2007, Jonathan Biggar wrote:

> James Chamberlain wrote:
>>  Hi all,
>>
>>  I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel
>>  2.6.18-53.el5.  Last night, I tried to grow two of the file systems on it.
>>  I ran lvextend and then gfs_grow on node3, with node2 serving the file
>>  systems out to the local network.  While gfs_grow was running, node2
>>  failed the service and I couldn't get it to restart.  It looked to me like
>>  neither node1 nor node2 was aware of the lvextend I had run on node3.  I
>>  had to reboot the full cluster to bring everything back online.
>
> If you ran lvextend on node3 while node2 had the filesystem mounted and 
> exported via NFS, then you did the wrong thing.  CLVM lets the nodes share 
> information about the PVs, VGs and LVs, but not information about the 
> contents of an LV.
>
> You needed to run lvextend on the node that had the filesystem mounted.

All three nodes had the (GFS) filesystem mounted.  I'll note that exactly 
this operation this worked for me a couple of weeks ago on a CentOS 4 
CS/GFS cluster.

Thanks,

James


From isplist at logicore.net  Thu Dec 20 21:57:12 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 20 Dec 2007 15:57:12 -0600
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
Message-ID: <20071220155712.966996@leena>

I've tried various links on the net and can't seem to find the LVS mailing 
list. I've even found one that says "go away" and I can no longer send to the 
one I was on. I know I'm not cut off since the archives seem to be closed down 
too.

Anyone know of a new LVS mailing list? I'm having problems with the LVS 
portion of my cluster.

Thanks much.

Mike


From Christopher.Barry at qlogic.com  Thu Dec 20 23:34:31 2007
From: Christopher.Barry at qlogic.com (chris barry)
Date: Thu, 20 Dec 2007 18:34:31 -0500
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
In-Reply-To: <20071220155712.966996@leena>
References: <20071220155712.966996@leena>
Message-ID: <1198193671.6195.39.camel@localhost>

On Thu, 2007-12-20 at 15:57 -0600, isplist at logicore.net wrote:
> I've tried various links on the net and can't seem to find the LVS mailing 
> list.

lvs-users
http://www.linuxvirtualserver.org/mailing.html


From isplist at logicore.net  Thu Dec 20 23:40:34 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 20 Dec 2007 17:40:34 -0600
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
In-Reply-To: <1198193671.6195.39.camel@localhost>
Message-ID: <20071220174034.833588@leena>

>> I've tried various links on the net and can't seem to find the LVS mailing
>> list.
>> 
> lvs-users
> http://www.linuxvirtualserver.org/mailing.html

Yup, know that one, all I get is a broken link;

Server not found
Firefox can't find the server at www.linuxvirtualserver.org.
   *   Check the address for typing errors such as
          ww.example.com instead of
          www.example.com

    *   If you are unable to load any pages, check your computer's network
          connection.

    *   If your computer or network is protected by a firewall or proxy, make 
sure that Firefox is permitted to access the Web.


From Christopher.Barry at qlogic.com  Thu Dec 20 23:43:41 2007
From: Christopher.Barry at qlogic.com (chris barry)
Date: Thu, 20 Dec 2007 18:43:41 -0500
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
In-Reply-To: <20071220174034.833588@leena>
References: <20071220174034.833588@leena>
Message-ID: <1198194222.6195.41.camel@localhost>

On Thu, 2007-12-20 at 17:40 -0600, isplist at logicore.net wrote:
> >> I've tried various links on the net and can't seem to find the LVS mailing
> >> list.
> >> 
> > lvs-users
> > http://www.linuxvirtualserver.org/mailing.html
> 
> Yup, know that one, all I get is a broken link;
> 
> Server not found
> Firefox can't find the server at www.linuxvirtualserver.org.
>    *   Check the address for typing errors such as
>           ww.example.com instead of
>           www.example.com
> 
>     *   If you are unable to load any pages, check your computer's network
>           connection.
> 
>     *   If your computer or network is protected by a firewall or proxy, make 
> sure that Firefox is permitted to access the Web.
> 

Odd. I just posted to, and got a reply on that list last night.

-C


From isplist at logicore.net  Thu Dec 20 23:51:20 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 20 Dec 2007 17:51:20 -0600
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
In-Reply-To: <1198194222.6195.41.camel@localhost>
Message-ID: <20071220175120.780861@leena>

> Odd. I just posted to, and got a reply on that list last night.

I've been trying to reach it for weeks. I was on it and messages suddenly 
stopped coming in. Now I can't even reach historical messages on the net, all 
links are always broken.

Mike


From Christopher.Barry at qlogic.com  Thu Dec 20 23:55:50 2007
From: Christopher.Barry at qlogic.com (chris barry)
Date: Thu, 20 Dec 2007 18:55:50 -0500
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
In-Reply-To: <20071220175120.780861@leena>
References: <20071220175120.780861@leena>
Message-ID: <1198194950.6195.45.camel@localhost>

On Thu, 2007-12-20 at 17:51 -0600, isplist at logicore.net wrote:
> > Odd. I just posted to, and got a reply on that list last night.
> 
> I've been trying to reach it for weeks. I was on it and messages suddenly 
> stopped coming in. Now I can't even reach historical messages on the net, all 
> links are always broken.
> 
> Mike

I'm reading this right now. I just hopped in to see if the was indeed
down. Obviously it isn't.

http://kb.linuxvirtualserver.org/wiki/KTCPVS

Can you get there? If not, some routing is horked from you to it.
traceroute this:

Name:   linuxvirtualserver.org
Address: 62.250.7.65


From lhh at redhat.com  Fri Dec 21 14:09:59 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 21 Dec 2007 09:09:59 -0500
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
In-Reply-To: <1198194950.6195.45.camel@localhost>
References: <20071220175120.780861@leena> <1198194950.6195.45.camel@localhost>
Message-ID: <1198246199.5104.15.camel@localhost.localdomain>


On Thu, 2007-12-20 at 18:55 -0500, chris barry wrote:
> On Thu, 2007-12-20 at 17:51 -0600, isplist at logicore.net wrote:
> > > Odd. I just posted to, and got a reply on that list last night.
> > 
> > I've been trying to reach it for weeks. I was on it and messages suddenly 
> > stopped coming in. Now I can't even reach historical messages on the net, all 
> > links are always broken.
> > 
> > Mike
> 
> I'm reading this right now. I just hopped in to see if the was indeed
> down. Obviously it isn't.
> 
> http://kb.linuxvirtualserver.org/wiki/KTCPVS

FWIW, it works for me - from both home and work.

-- Lon


From balajisundar at midascomm.com  Fri Dec 21 14:26:18 2007
From: balajisundar at midascomm.com (Balaji)
Date: Fri, 21 Dec 2007 19:56:18 +0530
Subject: [Linux-cluster] Regd: Iptables SNAT issue in Cluster Suite Setup
Message-ID: <476BCD0A.9080505@midascomm.com>

Dear All,

I have configured Cluster Suite with 2 servers
   Server 1 : 192.168.13.110 IP Address
   Server 2 : 192.168.13.179 IP Address
   Floating : 192.168.13.83 IP Address (Assumed by currently active server)

I want all snmp packets going out through the active server to be 
stamped with floating IP
So i have added a iptables rules as
   "iptables -t nat -A POSTROUTING -p udp -s 192.168.13.179  --dport 161 
-j SNAT --to-source 192.168.13.83:161
     iptables -t nat -A POSTROUTING -p udp -s 192.168.13.110  --dport 
161 -j SNAT --to-source 192.168.13.83:161"

I have written a script to continuously poll an agent using snmpgetnext.
This script works for some time and then snmpgetnext fails giving the 
following message
   "Fri Dec 21 19:10:20 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:10:20 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:10:20 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:10:20 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:10:21 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:10:21 IST 2007
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:10:21 IST 2007
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:10:21 IST 2007
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:10:21 IST 2007

     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:13:20 IST 2007
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:13:20 IST 2007
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:13:21 IST 2007
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:13:21 IST 2007
     snmpgetnext: Failure in sendto (Operation not permitted)
     Fri Dec 21 19:13:21 IST 2007
     Fri Dec 21 19:13:21 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:13:21 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:13:21 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:13:21 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""
     Fri Dec 21 19:13:21 IST 2007 SNMPv2-SMI::enterprises.3769.1.2.3.1.0 
= ""   "

If i remove the above iptables rule it works continuously and if i 
restarted iptables (with rule)  then it works for some time
and before the same problem repeats

Can some one throw light on this particular problem

Regards
S.Balaji

P.S: Same thing happen when a rule for rsync, ssh and telnet
      i.e they works some time and some times does not
      e.g rsync,ssh and telnet iptables SNAT rule
            "iptables -t nat -A POSTROUTING -p tcp -s 192.168.13.179  
--dport 873 -j SNAT --to-source 192.168.13.83:873
              iptables -t nat -A POSTROUTING -p tcp -s 192.168.13.179  
--dport 22 -j SNAT --to-source 192.168.13.83:22
              iptables -t nat -A POSTROUTING -p tcp -s 192.168.13.179  
--dport 23 -j SNAT --to-source 192.168.13.83:23"
              *rsync Success* message is
               "receiving file list ... done
                 Log/
                 Log/20070619.dbg.zip
                 Log/20070619.log.zip
                 Log/20070620.dbg.zip
                 Log/20070620.log.zip
                 Log/20070621.dbg.zip
                 Log/20070621.log.zip
                 sent 190 bytes  received 211940 bytes  84852.00 bytes/sec
                 total size is 211421  speedup is 1.00"
              *rsync Failure* message is
              "rsync: failed to connect to 192.168.13.100: Connection 
timed out (110)
                rsync error: error in socket IO (code 10) at 
clientserver.c(94) "
              *ssh Failure *Message is
              "ssh: connect to host 192.168.13.100 port 22: Connection 
timed out"
              *telnet Failure* Message is
              "telnet: connect to address 192.168.13.100: Connection 
timed out"


From isplist at logicore.net  Fri Dec 21 15:14:51 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 21 Dec 2007 09:14:51 -0600
Subject: [Linux-cluster] Piranha/LVS Mailing List, Gone?
In-Reply-To: <1198246199.5104.15.camel@localhost.localdomain>
Message-ID: <2007122191451.474118@leena>

> FWIW, it works for me - from both home and work.

Really? I'm not reaching it from two totally different Inet connections. 
That's kinda weird.

Mike


From Christopher.Barry at qlogic.com  Fri Dec 21 17:30:30 2007
From: Christopher.Barry at qlogic.com (chris barry)
Date: Fri, 21 Dec 2007 12:30:30 -0500
Subject: [Linux-cluster] accessing a gfs volume using bind mounts
Message-ID: <1198258230.10225.27.camel@localhost>

Hi All,

Does anyone know if this is a supported method:

/
|-mnt
|   `gfs_vol (dev/sdb1)
|       |-users
|       `-common
|
=
|-home (bind from /mnt/gfs_vol/users)
=
`-var
   `-common (bind from /mnt/gfs_vol/common)


Thanks,
-C


From lhh at redhat.com  Sat Dec 22 00:19:02 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 21 Dec 2007 19:19:02 -0500
Subject: [Linux-cluster] accessing a gfs volume using bind mounts
In-Reply-To: <1198258230.10225.27.camel@localhost>
References: <1198258230.10225.27.camel@localhost>
Message-ID: <1198282742.5104.94.camel@localhost.localdomain>


On Fri, 2007-12-21 at 12:30 -0500, chris barry wrote:
> Hi All,
> 
> Does anyone know if this is a supported method:
> 
> /
> |-mnt
> |   `gfs_vol (dev/sdb1)
> |       |-users
> |       `-common
> |
> =
> |-home (bind from /mnt/gfs_vol/users)
> =
> `-var
>    `-common (bind from /mnt/gfs_vol/common)

I don't see why it wouldn't work.  It's kind of like using NFS for home
+ /var/common.

-- Lon


From harun at mhd.co.om  Sun Dec 23 04:16:25 2007
From: harun at mhd.co.om (Harun)
Date: Sun, 23 Dec 2007 08:16:25 +0400
Subject: [Linux-cluster] RH Cluster issue-Network Failover not happening
Message-ID: <002901c8451a$95738020$4e030196@mhd.co.om>

Issue: When network cable is disconnected from the Primary, primary restart
unclean and the failover to secondary do not happens. The shared drives
don't get mounted automatically for secondary neither gets it mounted on
primary, after the primary restarts. I have to then manually shut down both
Primary and Secondary, and start primary first and then secondary for the
setup to work fine again.

I want to test a live production setup... a Linux Cluster with 2 nodes, in
Linux Advanced Server (Linux DB-Primary 2.4.21-37.ELsmp #1 SMP Wed Sep 7
13:28:55 EDT 2005 i686 i686 i386 GNU/Linux ), Oracle data base is running on
this setup.

The clumanager version is 1.2.28 and redhat-config-cluster version is 1.0.8
on both primary and secondary.I want to resolve the issue with out any
upgradations. Do you think that updating can resolve the issue? If
upgradation is required please guide how to go ahead. I am trying to resolve
this issue with out any patch update.
Is this a configuration problem or some knows issue with the version used.

Cluster.xml looks like this.

  <?xml version="1.0" ?> 
- <cluconfig version="3.0">
  <clumembd broadcast="no" interval="1000000" loglevel="5" multicast="yes"
multicast_ipaddress="225.0.0.11" thread="yes" tko_count="25" /> 
  <cluquorumd loglevel="5" pinginterval="2" tiebreaker_ip="" /> 
  <clurmtabd loglevel="5" pollinterval="4" /> 
  <clusvcmgrd loglevel="5" /> 
  <clulockd loglevel="5" /> 
  <cluster config_viewnumber="74" key="e307059f0bed8596868c4ab99818dc5d"
name="Oracle_cluster" /> 
  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1"
rawshadow="/dev/raw/raw2" type="raw" /> 
- <members>
  <member id="0" name="DB-Primary" watchdog="yes" /> 
  <member id="1" name="DB-Secondary" watchdog="yes" /> 
  </members>
- <services>
- <service checkinterval="0" failoverdomain="None" id="0" maxfalsestarts="3"
maxrestarts="0" name="AOFSPROD" userscript="">
- <service_ipaddresses>
  <service_ipaddress broadcast="None" id="0" ipaddress="192.168.0.7"
monitor_link="0" netmask="255.255.255.0" /> 
  </service_ipaddresses>
- <device id="0" name="/dev/sdb3" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/data1" options="" /> 
  </device>
- <device id="1" name="/dev/sdb4" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/data2" options="" /> 
  </device>
- <device id="2" name="/dev/sdc1" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/backup" options="" />

  </device>
- <device id="3" name="/dev/sdd1" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/archive" options=""
/> 
  </device>
  </service>
  </services>
  <failoverdomains /> 
  </cluconfig>


<<<<   Disclaimer Message  >>>>
"This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee, please notify the sender immediately after deleting this e-mail from your system and do not disseminate, distribute or copy this e-mail. The sender does not accept liability for any errors or omissions in the contents of this message, which arise as a result of erroneous e-mail transmission."
[Mohsin Haider Darwish LLC & Group Companies, PO.Box 880, Ruwi-112, Oman]


From raju.rajsand at gmail.com  Sun Dec 23 08:53:49 2007
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sun, 23 Dec 2007 14:23:49 +0530
Subject: [Linux-cluster] RH Cluster issue-Network Failover not happening
In-Reply-To: <002901c8451a$95738020$4e030196@mhd.co.om>
References: <002901c8451a$95738020$4e030196@mhd.co.om>
Message-ID: <8786b91c0712230053n2fb47154n971608fd89367926@mail.gmail.com>

Greetings,

On Dec 23, 2007 9:46 AM, Harun <harun at mhd.co.om> wrote:

> Issue: When network cable is disconnected from the Primary, primary
> restart
> unclean and the failover to secondary do not happens. The shared drives
> don't get mounted automatically for secondary neither gets it mounted on
> primary, after the primary restarts. I have to then manually shut down
> both
> Primary and Secondary, and start primary first and then secondary for the
> setup to work fine again.
>


1. Can you check your ethernet switch (if it is a managed one) whether the
multicast is turned on?

Best regards,


Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071223/7efcda8a/attachment.htm>

From harun at mhd.co.om  Sun Dec 23 11:19:35 2007
From: harun at mhd.co.om (Harun)
Date: Sun, 23 Dec 2007 15:19:35 +0400
Subject: [Linux-cluster] RH Cluster issue-Network Failover not happening
Message-ID: <000601c84555$b43b6ff0$4e030196@mhd.co.om>


For the below issue some additional information: Network cable I disconnect
is from the public interface. Am I doing the correct way to test the
cluster? Can any one explain if the public interface is removed from primary
(Node 1) will the secondary (Node 2) come to know that primary is down?

1. Can you check your Ethernet switch (if it is a managed one) whether the
multicast is turned on? Ans: It's not a managed switch.


-----Original Message-----
From: Harun [mailto:harun at mhd.co.om] 
Sent: Sunday, December 23, 2007 8:16 AM
To: 'linux clustering'
Subject: RH Cluster issue-Network Failover not happening

Issue: When network cable is disconnected from the Primary, primary restart
unclean and the failover to secondary do not happens. The shared drives
don't get mounted automatically for secondary neither gets it mounted on
primary, after the primary restarts. I have to then manually shut down both
Primary and Secondary, and start primary first and then secondary for the
setup to work fine again.

I want to test a live production setup... a Linux Cluster with 2 nodes, in
Linux Advanced Server (Linux DB-Primary 2.4.21-37.ELsmp #1 SMP Wed Sep 7
13:28:55 EDT 2005 i686 i686 i386 GNU/Linux ), Oracle data base is running on
this setup.

The clumanager version is 1.2.28 and redhat-config-cluster version is 1.0.8
on both primary and secondary.I want to resolve the issue with out any
upgradations. Do you think that updating can resolve the issue? If
upgradation is required please guide how to go ahead. I am trying to resolve
this issue with out any patch update.
Is this a configuration problem or some known issue with the version used?

Cluster.xml looks like this.

  <?xml version="1.0" ?> 
- <cluconfig version="3.0">
  <clumembd broadcast="no" interval="1000000" loglevel="5" multicast="yes"
multicast_ipaddress="225.0.0.11" thread="yes" tko_count="25" /> 
  <cluquorumd loglevel="5" pinginterval="2" tiebreaker_ip="" /> 
  <clurmtabd loglevel="5" pollinterval="4" /> 
  <clusvcmgrd loglevel="5" /> 
  <clulockd loglevel="5" /> 
  <cluster config_viewnumber="74" key="e307059f0bed8596868c4ab99818dc5d"
name="Oracle_cluster" /> 
  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1"
rawshadow="/dev/raw/raw2" type="raw" /> 
- <members>
  <member id="0" name="DB-Primary" watchdog="yes" /> 
  <member id="1" name="DB-Secondary" watchdog="yes" /> 
  </members>
- <services>
- <service checkinterval="0" failoverdomain="None" id="0" maxfalsestarts="3"
maxrestarts="0" name="AOFSPROD" userscript="">
- <service_ipaddresses>
  <service_ipaddress broadcast="None" id="0" ipaddress="192.168.0.7"
monitor_link="0" netmask="255.255.255.0" /> 
  </service_ipaddresses>
- <device id="0" name="/dev/sdb3" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/data1" options="" /> 
  </device>
- <device id="1" name="/dev/sdb4" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/data2" options="" /> 
  </device>
- <device id="2" name="/dev/sdc1" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/backup" options="" />

  </device>
- <device id="3" name="/dev/sdd1" sharename="">
  <mount forceunmount="yes" fstype="ext3" mountpoint="/archive" options=""
/> 
  </device>
  </service>
  </services>
  <failoverdomains /> 
  </cluconfig>


<<<<   Disclaimer Message  >>>>
"This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee, please notify the sender immediately after deleting this e-mail from your system and do not disseminate, distribute or copy this e-mail. The sender does not accept liability for any errors or omissions in the contents of this message, which arise as a result of erroneous e-mail transmission."
[Mohsin Haider Darwish LLC & Group Companies, PO.Box 880, Ruwi-112, Oman]


From raju.rajsand at gmail.com  Sun Dec 23 20:11:09 2007
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Mon, 24 Dec 2007 01:41:09 +0530
Subject: [Linux-cluster] RH Cluster issue-Network Failover not happening
In-Reply-To: <000601c84555$b43b6ff0$4e030196@mhd.co.om>
References: <000601c84555$b43b6ff0$4e030196@mhd.co.om>
Message-ID: <8786b91c0712231211y79aee4ccyec9916c18a14c3e@mail.gmail.com>

namo namaH,
(Whatever that is to be respected-adored, that is respected adored)

Apologies for top posting, but this incident compels me.

Please, did you follow my earlier suggestion as many new "manged" Cisco
switches need to have their multicasting  tuned on _manually_ using some
commands (_incantations_)

and I would feel compelled _not_to reply to your future post if you have not
responded to _My_  posts.


I am an Indian living in India and being paid far less less than what you
are paid in Gulf, so I do not have a compelling reason to reply or educate
you if you are not being communicative.


aa no bhadraaH kratavo yantu vishvataH
(Let noble thoughts come to us from the Universe -- Rig Veda)

Rajagopal


On Dec 23, 2007 4:49 PM, Harun <harun at mhd.co.om> wrote:

>
> For the below issue some additional information: Network cable I
> disconnect
> is from the public interface. Am I doing the correct way to test the
> cluster? Can any one explain if the public interface is removed from
> primary
> (Node 1) will the secondary (Node 2) come to know that primary is down?
>
> 1. Can you check your Ethernet switch (if it is a managed one) whether the
> multicast is turned on? Ans: It's not a managed switch.
>
>
> -----Original Message-----
> From: Harun [mailto:harun at mhd.co.om]
> Sent: Sunday, December 23, 2007 8:16 AM
> To: 'linux clustering'
> Subject: RH Cluster issue-Network Failover not happening
>
> Issue: When network cable is disconnected from the Primary, primary
> restart
> unclean and the failover to secondary do not happens. The shared drives
> don't get mounted automatically for secondary neither gets it mounted on
> primary, after the primary restarts. I have to then manually shut down
> both
> Primary and Secondary, and start primary first and then secondary for the
> setup to work fine again.
>
> I want to test a live production setup... a Linux Cluster with 2 nodes, in
> Linux Advanced Server (Linux DB-Primary 2.4.21-37.ELsmp #1 SMP Wed Sep 7
> 13:28:55 EDT 2005 i686 i686 i386 GNU/Linux ), Oracle data base is running
> on
> this setup.
>
> The clumanager version is 1.2.28 and redhat-config-cluster version is
> 1.0.8
> on both primary and secondary.I want to resolve the issue with out any
> upgradations. Do you think that updating can resolve the issue? If
> upgradation is required please guide how to go ahead. I am trying to
> resolve
> this issue with out any patch update.
> Is this a configuration problem or some known issue with the version used?
>
> Cluster.xml looks like this.
>
>  <?xml version="1.0" ?>
> - <cluconfig version="3.0">
>  <clumembd broadcast="no" interval="1000000" loglevel="5" multicast="yes"
> multicast_ipaddress="225.0.0.11" thread="yes" tko_count="25" />
>  <cluquorumd loglevel="5" pinginterval="2" tiebreaker_ip="" />
>  <clurmtabd loglevel="5" pollinterval="4" />
>  <clusvcmgrd loglevel="5" />
>  <clulockd loglevel="5" />
>  <cluster config_viewnumber="74" key="e307059f0bed8596868c4ab99818dc5d"
> name="Oracle_cluster" />
>  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1"
> rawshadow="/dev/raw/raw2" type="raw" />
> - <members>
>  <member id="0" name="DB-Primary" watchdog="yes" />
>  <member id="1" name="DB-Secondary" watchdog="yes" />
>  </members>
> - <services>
> - <service checkinterval="0" failoverdomain="None" id="0"
> maxfalsestarts="3"
> maxrestarts="0" name="AOFSPROD" userscript="">
> - <service_ipaddresses>
>  <service_ipaddress broadcast="None" id="0" ipaddress="192.168.0.7"
> monitor_link="0" netmask="255.255.255.0" />
>  </service_ipaddresses>
> - <device id="0" name="/dev/sdb3" sharename="">
>  <mount forceunmount="yes" fstype="ext3" mountpoint="/data1" options="" />
>  </device>
> - <device id="1" name="/dev/sdb4" sharename="">
>  <mount forceunmount="yes" fstype="ext3" mountpoint="/data2" options="" />
>  </device>
> - <device id="2" name="/dev/sdc1" sharename="">
>  <mount forceunmount="yes" fstype="ext3" mountpoint="/backup" options=""
> />
>
>  </device>
> - <device id="3" name="/dev/sdd1" sharename="">
>  <mount forceunmount="yes" fstype="ext3" mountpoint="/archive" options=""
> />
>  </device>
>  </service>
>  </services>
>  <failoverdomains />
>  </cluconfig>
>
>
>
> <<<<   Disclaimer Message  >>>>
> "This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed. If you are not the named addressee, please notify the sender
> immediately after deleting this e-mail from your system and do not
> disseminate, distribute or copy this e-mail. The sender does not accept
> liability for any errors or omissions in the contents of this message, which
> arise as a result of erroneous e-mail transmission."
> [Mohsin Haider Darwish LLC & Group Companies, PO.Box 880, Ruwi-112, Oman]
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071224/f261b0bc/attachment.htm>

From isplist at logicore.net  Tue Dec 25 16:44:27 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 25 Dec 2007 10:44:27 -0600
Subject: [Linux-cluster] Mount GFS without cluster?
Message-ID: <20071225104427.504155@leena>

Can I mount a GFS volume to a machine that is not in the cluster? I know it 
defeats the idea of being able to share the data but this is when I need to 
just copy or update files.

To top things off, the cluster is all RHEL4 while the new machines are CentOS5 
which I am told won't be allowed into the cluster.

Any way of just mounting temporarily without breaking anything?

Mike


From mailinglist at bussink.ch  Tue Dec 25 22:39:45 2007
From: mailinglist at bussink.ch (Erik Bussink)
Date: Tue, 25 Dec 2007 23:39:45 +0100
Subject: [Linux-cluster] Mount GFS without cluster?
In-Reply-To: <20071225104427.504155@leena>
References: <20071225104427.504155@leena>
Message-ID: <1198622385.31354.2.camel@eowyn.bussink.local>

Hiya Mike,

Yes you can mount a GFS volume using gfs_mount lock_nolock on a single
system. You have more details in man 'gfs_mount'

Regards,
Erik


From isplist at logicore.net  Wed Dec 26 02:10:20 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 25 Dec 2007 20:10:20 -0600
Subject: [Linux-cluster] Mount GFS without cluster?
In-Reply-To: <1198622385.31354.2.camel@eowyn.bussink.local>
Message-ID: <20071225201020.199358@leena>

> Yes you can mount a GFS volume using gfs_mount lock_nolock on a single
> system. You have more details in man 'gfs_mount'

Thanks Erik. I'll look that up. I don't seem to have the doc's installed for 
man on the system.

Mike


From isplist at logicore.net  Wed Dec 26 02:36:28 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 25 Dec 2007 20:36:28 -0600
Subject: [Linux-cluster] Mount GFS without cluster?
In-Reply-To: <1198622385.31354.2.camel@eowyn.bussink.local>
Message-ID: <20071225203628.360003@leena>

This does not seem to do what I'm after either? This seems to force me to take 
down the cluster before allowing one non cluster server to gain access to the 
storage? There is a strong warning about it.

>Do not allow multiple nodes to mount the same file system while LOCK_NOLOCK 
>is used. Doing so causes one or more nodes to panic their kernels, and may 
>cause file system corruption.


I'm coming across a real problem due to having to use multiple operating 
systems such as RHEL4/5 and now CentOS5 in the mix too.

I need to allow RHEL4 machines to be in their own cluster, then have a couple 
of CentOS5 machines in their own cluster, all having access to the same 
storage space. 

 From what I've asked in the list, am reading and trying, doesn't seem that I 
can do this. 


On Tue, 25 Dec 2007 23:39:45 +0100, Erik Bussink wrote:
> Hiya Mike,
> 
> Yes you can mount a GFS volume using gfs_mount lock_nolock on a single
> system. You have more details in man 'gfs_mount'
> 
> Regards,
> Erik


From jos at xos.nl  Wed Dec 26 09:29:52 2007
From: jos at xos.nl (Jos Vos)
Date: Wed, 26 Dec 2007 10:29:52 +0100
Subject: [Linux-cluster] Mount GFS without cluster?
In-Reply-To: <20071225203628.360003@leena>
References: <1198622385.31354.2.camel@eowyn.bussink.local>
	<20071225203628.360003@leena>
Message-ID: <20071226092952.GA21174@jasmine.xos.nl>

On Tue, Dec 25, 2007 at 08:36:28PM -0600, isplist at logicore.net wrote:

> This does not seem to do what I'm after either? This seems to force me to take
> down the cluster before allowing one non cluster server to gain access to the 
> storage? There is a strong warning about it.

Yes, if you want to mount the GFS fs with no_lock this is intended for
one exclusive mount and the GFS fs should not be mounted somewhere
else, inside or outside the cluster.

Can't you export the GFS mounted fs (in the cluster) with NFS and copy the
data doing an NFS mount?

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From balajisundar at midascomm.com  Wed Dec 26 10:25:18 2007
From: balajisundar at midascomm.com (Balaji)
Date: Wed, 26 Dec 2007 15:55:18 +0530
Subject: [Linux-cluster] Regd: Ethernet Channel Bonding issue in Cluster
Message-ID: <47722C0E.9070200@midascomm.com>

Dear All,
I have configured Cluster Suite with 2 servers
Server 1 : 192.168.13.110 IP Address and hostname is primary
Server 2 : 192.168.13.179 IP Address and hostname is secondary
Floating : 192.168.13.83 IP Address (Assumed by currently active server)

I have configured Ethernet Channel Bonding in Each Cluster Nodes and 
Channel Bonding Configuration Details are
1) Created bonding devices in "/etc/modprobe.conf" file 
    alias bond0 bonding
    options bonding miimon=100 mode=1
2) Edit the "/etc/sysconfig/network-scripts/ifcfg-eth0 and ifcfg-eth1" 
configuration
    DEVICE=eth0
    USERCTL= no
    ONBOOT=yes
    MASTER=bond0
    SLAVE=yes
    BOOTPROTO=none

    DEVICE=eth1
    USERCTL= no
    ONBOOT=yes
    MASTER=bond0
    SLAVE=yes
    BOOTPROTO=none
3) Created a network script for the bonding device is 
"/etc/sysconfig/network-scripts/ifcfg-bond0"
    DEVICE=bond0
    USERCTL=no
    ONBOOT=yes  
    NETMASK=255.255.255.0
    GATEWAY=192.168.13.1
    IPADDR=192.168.13.110
4) Reboot the system for the changes to take effect.

After i am rebooted both the server then cluster node becomes simplex 
and Services are started on both the nodes
The cluster ouput in primary node

Member Status: Quorate

  Member Name                              Status
  ----------- -------                              ---------
  primary                                         Online, Local, rgmanager
  secondary                                      Offline

  Service Name         Owner (Last)                   State
  --------- --------         -----------------                   --------
  Service                      primary                        started

The cluster ouput in secondary node

Member Status: Quorate

  Member Name                              Status
  ----------- -------                              ---------
  primary                                         Offline
  secondary                                      Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  --------- --------         -----------------                   --------
  Service                      secondary                     started

Before Ethernet Channel Bonding cluster services are active in one node 
and other nodes acts as passive node.
But after Ethernet Channel Bonding cluster services are active on both 
the nodes
I need to solve the following issue and services are run only active node.

Can some one throw light on this peculiar problem

Regards
-S.Balaji


From orkcu at yahoo.com  Wed Dec 26 13:02:36 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Wed, 26 Dec 2007 05:02:36 -0800 (PST)
Subject: [Linux-cluster] Mount GFS without cluster?
In-Reply-To: <20071225203628.360003@leena>
Message-ID: <648117.49686.qm@web50604.mail.re2.yahoo.com>


--- "isplist at logicore.net" <isplist at logicore.net>
wrote:

> This does not seem to do what I'm after either? This
> seems to force me to take 
> down the cluster before allowing one non cluster
> server to gain access to the 
> storage? There is a strong warning about it.
> 
> >Do not allow multiple nodes to mount the same file
> system while LOCK_NOLOCK 
> >is used. Doing so causes one or more nodes to panic
> their kernels, and may 
> >cause file system corruption.
> 
> 
> I'm coming across a real problem due to having to
> use multiple operating 
> systems such as RHEL4/5 and now CentOS5 in the mix
> too.
> 
> I need to allow RHEL4 machines to be in their own
> cluster, then have a couple 
> of CentOS5 machines in their own cluster, all having
> access to the same 
> storage space. 
> 
>  From what I've asked in the list, am reading and
> trying, doesn't seem that I 
> can do this. 
Of course not
you are clear about not be able to mix nodes with 4
and 5 versions, and because GFS needs the cluster
infrastructure to coordinate access to the FS then the
nodes with version 4 can _not_ coordinate with nodes
with version 5 so, if you soft force (you can mount
GFS in both clusters) that situation I guess you will
get a wonderfull FS corruption sometime (eventually)
depending of how intensive you do write operations

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From fsalaman at gmail.com  Wed Dec 26 16:40:50 2007
From: fsalaman at gmail.com (Fabian Salamanca Dominguez)
Date: Wed, 26 Dec 2007 10:40:50 -0600
Subject: [Linux-cluster] Regd: Ethernet Channel Bonding issue in Cluster
In-Reply-To: <47722C0E.9070200@midascomm.com>
References: <47722C0E.9070200@midascomm.com>
Message-ID: <afe307c20712260840k6fcb3592o28ea9c1f041b25b5@mail.gmail.com>

Hi!

Sorry I couldn't get what's the issue,

Isn't the bonding working?

BR

On Dec 26, 2007 4:25 AM, Balaji <balajisundar at midascomm.com> wrote:
> Dear All,
> I have configured Cluster Suite with 2 servers
> Server 1 : 192.168.13.110 IP Address and hostname is primary
> Server 2 : 192.168.13.179 IP Address and hostname is secondary
> Floating : 192.168.13.83 IP Address (Assumed by currently active server)
>
> I have configured Ethernet Channel Bonding in Each Cluster Nodes and
> Channel Bonding Configuration Details are
> 1) Created bonding devices in "/etc/modprobe.conf" file
>     alias bond0 bonding
>     options bonding miimon=100 mode=1
> 2) Edit the "/etc/sysconfig/network-scripts/ifcfg-eth0 and ifcfg-eth1"
> configuration
>     DEVICE=eth0
>     USERCTL= no
>     ONBOOT=yes
>     MASTER=bond0
>     SLAVE=yes
>     BOOTPROTO=none
>
>     DEVICE=eth1
>     USERCTL= no
>     ONBOOT=yes
>     MASTER=bond0
>     SLAVE=yes
>     BOOTPROTO=none
> 3) Created a network script for the bonding device is
> "/etc/sysconfig/network-scripts/ifcfg-bond0"
>     DEVICE=bond0
>     USERCTL=no
>     ONBOOT=yes
>     NETMASK=255.255.255.0
>     GATEWAY=192.168.13.1
>     IPADDR=192.168.13.110
> 4) Reboot the system for the changes to take effect.
>
> After i am rebooted both the server then cluster node becomes simplex
> and Services are started on both the nodes
> The cluster ouput in primary node
>
> Member Status: Quorate
>
>   Member Name                              Status
>   ----------- -------                              ---------
>   primary                                         Online, Local, rgmanager
>   secondary                                      Offline
>
>   Service Name         Owner (Last)                   State
>   --------- --------         -----------------                   --------
>   Service                      primary                        started
>
> The cluster ouput in secondary node
>
> Member Status: Quorate
>
>   Member Name                              Status
>   ----------- -------                              ---------
>   primary                                         Offline
>   secondary                                      Online, Local, rgmanager
>
>   Service Name         Owner (Last)                   State
>   --------- --------         -----------------                   --------
>   Service                      secondary                     started
>
> Before Ethernet Channel Bonding cluster services are active in one node
> and other nodes acts as passive node.
> But after Ethernet Channel Bonding cluster services are active on both
> the nodes
> I need to solve the following issue and services are run only active node.
>
> Can some one throw light on this peculiar problem
>
> Regards
> -S.Balaji
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Fabian


From kjain at aurarianetworks.com  Wed Dec 26 22:57:41 2007
From: kjain at aurarianetworks.com (Kamal Jain)
Date: Wed, 26 Dec 2007 17:57:41 -0500
Subject: [Linux-cluster] GFS performance
Message-ID: <BD1F491BE5FEB942A95DC075D99F15CA1AB026333F@exch-be-01-prod.hq.aurarianetworks.com>

Hello and Happy Holidays!

This is sort of a follow-up to a posting I sent last week about GFS and Cluster Suite performance and stability.  I ran some throughput (kBps and IOPS) tests on a server running RHEL4.5 with RHCS and GFS.  The goal was to complete a brief set of tests to show some basic performance differences between GFS, direct iSCSI (using EXT3 on the LUN) and EXT3 on a local SAS disk.

There weren't many test runs so certainly there's room for error and differences, but in general, the ~35% and ~60% performance degradation on GFS versus the local disk did manifest itself in some runs we did with our own applications, in places where we were reading or writing heavily.

Does this performance difference make sense to you?  The iSCSI-direct was on the same EqualLogic appliance as the GFS volumes were on, so no change in the storage array or GbE switch.  It was also the same server in all the tests.  Local disk was JBOD on Dell PERC5/i.

- K


[cid:image005.png at 01C847C4.02D477E0]


[cid:image006.png at 01C847C4.02D477E0]


--
Kamal Jain
kjain at aurarianetworks.com
+1 978.893.1098  (office)
+1 978.726.7098  (mobile)

Auraria Networks, Inc.
85 Swanson Road, Suite 120
Boxborough, MA  01719
USA

www.aurarianetworks.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071226/8b4bc9a3/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 70363 bytes
Desc: image005.png
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071226/8b4bc9a3/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.png
Type: image/png
Size: 77296 bytes
Desc: image006.png
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071226/8b4bc9a3/attachment-0001.png>

From isplist at logicore.net  Wed Dec 26 23:45:25 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 26 Dec 2007 17:45:25 -0600
Subject: [Linux-cluster] PAE/Hugemem
Message-ID: <20071226174525.205016@leena>

Quick question: Is the hugemem kernel the equivalent of the kernel-PAE for 
RHEL4 machines with more than 4GB?


From isplist at logicore.net  Thu Dec 27 00:17:21 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 26 Dec 2007 18:17:21 -0600
Subject: [Linux-cluster] Regd: Ethernet Channel Bonding issue in Cluster
In-Reply-To: <afe307c20712260840k6fcb3592o28ea9c1f041b25b5@mail.gmail.com>
Message-ID: <20071226181721.966172@leena>

>> I have configured Ethernet Channel Bonding in Each Cluster Nodes and
>> Channel Bonding Configuration Details are

This is not a cluster issue, it's an OS issue.

This works on both my RHEL and CentOS machines, both 4 and 5. Take a close 
look at your configuration, perhaps you have an error in one of your files. 
That's about the only problems I've had. Here are my own notes which you can 
use to follow along.

Mike

Test Links;
ethtool <interface-name> | grep "Link detected:" This is used to check eth0 
Interfaces for example, not bond0 interfaces.
 
Check the bond0 interface;
# cat /proc/net/bonding/bond0
 
List all Interfaces;
ip -a
List Routes;
ip r
 
In CentOS5;
 
In the /etc/modprobe.conf file add the following:
alias bond0 bonding
options bond0 miimon=80 mode=5
 
In the /etc/sysconfig/network-scripts/ directory create ifcfg-bond0:
DEVICE=bond0
IPADDR=x.x.x.x
NETMASK=x.x.x.x
NETWORK=x.x.x.x
BROADCAST=x.x.x.255
GATEWAY=x.x.x.x
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
 
Change the ifcfg-eth0 to:
DEVICE=eth0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes
 
Change the ifcfg-eth1 to:
DEVICE=eth1
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes
 
Change additional interfaces, ifcfg-ethx, etc.
 
Restart network services;
---
Diverse modes of bonding:
 
mode=1 (active-backup)
Active-backup policy: Only one slave in the bond is active. A different slave 
becomes active if, and only if, the active slave fails. The bond's MAC address 
is externally visible on only one port (network adapter) to avoid confusing 
the switch. This mode provides fault tolerance. The primary option affects the 
behavior of this mode.
 
mode=2 (balance-xor)
XOR policy: Transmit based on [(source MAC address XOR'd with destination MAC 
address) modulo slave count]. This selects the same slave for each destination 
MAC address. This mode provides load balancing and fault tolerance.
 
mode=3 (broadcast)
Broadcast policy: transmits everything on all slave interfaces. This mode 
provides fault tolerance.
 
mode=4 (802.3ad)
IEEE 802.3ad Dynamic link aggregation. Creates aggregation groups that share 
the same speed and duplex settings. Utilizes all slaves in the active 
aggregator according to the 802.3ad specification.
 
* Prerequisites:
o Ethtool support in the base drivers for retrieving the speed and duplex of 
each slave.
o A switch that supports IEEE 802.3ad Dynamic link aggregation. Most switches 
will require some type of configuration to enable 802.3ad mode.
 
mode=5 (balance-tlb)
Adaptive transmit load balancing: channel bonding that does not require any 
special switch support. The outgoing traffic is distributed according to the 
current load (computed relative to the speed) on each slave. Incoming traffic 
is received by the current slave. If the receiving slave fails, another slave 
takes over the MAC address of the failed receiving slave.
 
* Prerequisite: Ethtool support in the base drivers for retrieving the speed 
of each slave.
 
mode=6 (balance-alb)
Adaptive load balancing: includes balance-tlb plus receive load balancing 
(rlb) for IPV4 traffic, and does not require any special switch support. The 
receive load balancing is achieved by ARP negotiation. The bonding driver 
intercepts the ARP Replies sent by the local system on their way out and 
overwrites the source hardware address with the unique hardware address of one 
of the slaves in the bond such that different peers use different hardware 
addresses for the server.
Also you can use multiple bond interface but for that you must load the 
bonding module as many as you need.
 

From isplist at logicore.net  Thu Dec 27 00:18:40 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 26 Dec 2007 18:18:40 -0600
Subject: [Linux-cluster] Mount GFS without cluster?
In-Reply-To: <20071226092952.GA21174@jasmine.xos.nl>
Message-ID: <20071226181840.602129@leena>

> Yes, if you want to mount the GFS fs with no_lock this is intended for
> one exclusive mount and the GFS fs should not be mounted somewhere
> else, inside or outside the cluster.

Got it.
 
> Can't you export the GFS mounted fs (in the cluster) with NFS and copy the
> data doing an NFS mount?

I could but I wanted to use GFS if at all possible before introducing another 
level of complexity into the mix.

Mike


From isplist at logicore.net  Thu Dec 27 00:20:13 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 26 Dec 2007 18:20:13 -0600
Subject: [Linux-cluster] Mount GFS without cluster?
In-Reply-To: <648117.49686.qm@web50604.mail.re2.yahoo.com>
Message-ID: <20071226182013.203057@leena>

> GFS in both clusters) that situation I guess you will
> get a wonderfull FS corruption sometime (eventually)
> depending of how intensive you do write operations

Yup, thanks. Like I said, I just have too much of a mix with RHEL/CentOS 4 and 
5 going. 

Mike


From balajisundar at midascomm.com  Thu Dec 27 05:33:15 2007
From: balajisundar at midascomm.com (Balaji)
Date: Thu, 27 Dec 2007 11:03:15 +0530
Subject: [Linux-cluster] Channel Bonding issue in Cluster Suite Setup
Message-ID: <4773391B.4080605@midascomm.com>

Dear All,

I have configured RHEL Cluster Suite with 2 servers
Server 1 : 192.168.13.110 IP Address and hostname is primary
Server 2 : 192.168.13.179 IP Address and hostname is secondary
Floating : 192.168.13.83 IP Address (Assumed by currently active server)

In am followed the RHEL Cluster Suite Configuration document 
"rh-cs-en-4.pdf" and
I have configured Ethernet Channel Bonding in Each Cluster Nodes

Channel Bonding Configuration Details are
1) Created bonding devices in "/etc/modprobe.conf" file  
     alias bond0 bonding
     options bonding miimon=100 mode=1
2) Edit the "/etc/sysconfig/network-scripts/ifcfg-eth0 and ifcfg-eth1" 
configuration
    DEVICE=eth0
    USERCTL= no
    ONBOOT=yes
    MASTER=bond0
    SLAVE=yes
    BOOTPROTO=none

    DEVICE=eth1
    USERCTL= no
    ONBOOT=yes
    MASTER=bond0
    SLAVE=yes
    BOOTPROTO=none
3) Created a network script for the bonding device is 
"/etc/sysconfig/network-scripts/ifcfg-bond0"
    DEVICE=bond0
    USERCTL=no
    ONBOOT=yes    
    NETMASK=255.255.255.0
    GATEWAY=192.168.13.1
    IPADDR=192.168.13.110
4) Reboot the system for the changes to take effect.

After i am rebooted both the server then cluster node becomes simplex 
and Services are started on both the nodes

The cluster output in primary node
Member Status: Quorate

Member Name                              Status
----------- -------                              ---------
primary                                         Online, Local, rgmanager
secondary                                      Offline

Service Name         Owner (Last)                   State
--------- --------         -----------------                   --------
Service                      primary                        started

The cluster output in secondary node
Member Status: Quorate

Member Name                              Status
----------- -------                              ---------
primary                                         Offline
secondary                                      Online, Local, rgmanager

Service Name         Owner (Last)                   State
--------- --------         -----------------                   --------
Service                      secondary                     started

Before Ethernet Channel Bonding cluster services are active in one node and
other nodes acts as passive node and member status is Online for both 
the nodes

But after Ethernet Channel Bonding cluster services are active on both 
the nodes and
member status is current node is Online and other node is Offline

We are not sure why this is happening. Can some one throw light on this.

Regards
-S.Balaji


From banjamin2001 at 163.com  Thu Dec 27 05:53:53 2007
From: banjamin2001 at 163.com (banjamin2001)
Date: Thu, 27 Dec 2007 13:53:53 +0800
Subject: [Linux-cluster] [Linux-HA]
Message-ID: <200712271353516259434@163.com>

hi,

when i am trying config a ha test with my two liunx machine on my vmare,i got a problem as follows:
**********
heartbeat: 2007/12/27_11:44:04 info: Acquiring resource group: TEST1 192.168.8.184 httpd
heartbeat: 2007/12/27_11:44:04 info: Running /etc/ha.d/resource.d/IPaddr 192.168.8.184 start
heartbeat: 2007/12/27_11:44:04 info: /sbin/ifconfig ????Z@:0 192.168.8.184 netmask 255.255.255.127     broadcast 192.168.8.184
heartbeat: 2007/12/27_11:44:04 ERROR: Return code 255 from /etc/ha.d/resource.d/IPaddr
heartbeat: 2007/12/27_11:44:04 info: Running /etc/init.d/httpd  start

then i check the netcard,it seems that eth0 was wrong:

-bash-2.05b# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:0C:29:8E:BF:90  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3928 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:368975 (360.3 Kb)  TX bytes:42 (42.0 b)
          Interrupt:10 Base address:0x10a4
??????

**********

I  don't kown why this happen,who can tell me why and how to resolve it ?


thank u very much!
best wishes! 


                                                                                         banjamin2001banjamin2001 at 163.com
                                                                                                 2007-12-27
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071227/33341522/attachment.htm>

From orkcu at yahoo.com  Thu Dec 27 14:29:58 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Thu, 27 Dec 2007 06:29:58 -0800 (PST)
Subject: [Linux-cluster] Channel Bonding issue in Cluster Suite Setup
In-Reply-To: <4773391B.4080605@midascomm.com>
Message-ID: <839293.38853.qm@web50606.mail.re2.yahoo.com>


--- Balaji <balajisundar at midascomm.com> wrote:
> After i am rebooted both the server then cluster
> node becomes simplex 
> and Services are started on both the nodes
> 
> The cluster output in primary node
> Member Status: Quorate
> 
> Member Name                              Status
> ----------- -------                             
> ---------
> primary                                        
> Online, Local, rgmanager
> secondary                                     
> Offline
  ^^^^^^^


> The cluster output in secondary node
> Member Status: Quorate
> 
> Member Name                              Status
> ----------- -------                             
> ---------
> primary                                        
> Offline
  ^^^^^^^

both servers thinks that the other one are death, so I
guess you have a problem with comunication between the
nodes after the bonding is set

What I wonder now is why fencing do not work :-(
it could be dangerus to have both nodes accessing the
same storage without know it :-(


> We are not sure why this is happening. Can some one
> throw light on this.

my first try is to ping both servers and see if they
can comunicate.
As a say, it looks to me that you have a problem with
comunication.
1- different IP for the nodes in cluster.conf or in
DNS (/etc/host) and in the bonding configuration
2- some arp problem
3- some firewall problem

I dont know, I am just guessing

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From rhurst at bidmc.harvard.edu  Thu Dec 27 15:46:19 2007
From: rhurst at bidmc.harvard.edu (Robert Hurst)
Date: Thu, 27 Dec 2007 10:46:19 -0500
Subject: [Linux-cluster] GFS performance
In-Reply-To: <BD1F491BE5FEB942A95DC075D99F15CA1AB026333F@exch-be-01-prod.hq.aurarianetworks.com>
References: <BD1F491BE5FEB942A95DC075D99F15CA1AB026333F@exch-be-01-prod.hq.aurarianetworks.com>
Message-ID: <1198770380.4932.23.camel@WSBID06223>

Thanks for sharing your result data.  For these tests, was this a single
node mounting a standalone GFS disk, or was it shared between other
nodes that had the same volume mounted?  It is not clear to me on the
median GFS was mounted on either, i.e., HBA, iSCSI, local disk?

We're somewhat local to you if you want to coordinate a call and discuss
your objectives on this, too.  We're using GFS in an 11-node RHEL
cluster implementation.   I'd be interested in your experiences with
iSCSI as well.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071227/388f06f0/attachment.htm>

From kjain at aurarianetworks.com  Thu Dec 27 16:13:52 2007
From: kjain at aurarianetworks.com (Kamal Jain)
Date: Thu, 27 Dec 2007 11:13:52 -0500
Subject: [Linux-cluster] GFS performance
In-Reply-To: <1198770380.4932.23.camel@WSBID06223>
References: <BD1F491BE5FEB942A95DC075D99F15CA1AB026333F@exch-be-01-prod.hq.aurarianetworks.com>
	<1198770380.4932.23.camel@WSBID06223>
Message-ID: <BD1F491BE5FEB942A95DC075D99F15CA1AB026338F@exch-be-01-prod.hq.aurarianetworks.com>

Hello Robert,

I?m certainly open to a call ? when is good for you?  Thanks for suggesting it.

In all tests, I/O was being performed on a single node only, and on the same machine in all cases.  The cluster has 7 nodes and the GFS volumes were mounted on all of them, but the other 6 systems were quiesced for the test window.  I was trying to ascertain what performance and/or overhead was incurred through GFS.  The GFS volumes are managed by LVM and are on iSCSI targets on an EqualLogic PS50E appliance with 14 250GB drives in a single large RAID-5 array.  There are three GbE connections from the EqualLogic to the core Cisco switch, and the 7 nodes are also directly into the same switch, and all devices in the test environment are on the same VLAN.

No HBAs were used, and each server in the cluster (all Dell PowerEdge 2950 servers with single Intel Xeon quad-core X5365  @ 3.00GHz processors and 16GB of RAM) is using a single GbE port for both general network connectivity as well as for the iSCSI initiator.  While I was concerned about the overhead of software/CPU-based iSCSI, but the testing with iSCSI LUNs without GFS (and without LVM) showed really good throughput, close enough perhaps to where the network overhead was in play.

It might be possible to gain more performance by using bonding, though we?re nowhere close to saturating GbE.  Perhaps separating general network I/O from iSCSI would help as well.  We?re not going to invest in HBAs at this time, but in the future it might be interesting to see how much difference they make.

A challenge we?re dealing with is a massive number of small files, so there is a lot of file-level overhead, and as you saw in the charts?the random reads and writes were not friends of GFS.  Perhaps there are GFS tuning parameters that would specifically help speed up reading and writing of many small files in succession rather than a small number of files which might well be cached in RAM by Linux.

 I realize that there are numerous OS, filesystem and I/O knobs which can be tuned for our application, but how likely is it that we could overcome a 63% performance degradation?  Mind you, I?m not factoring in the ?gains? realized by having a single shared filesystem.  Also, we will be looking at ways to have multiple readers on GFS with no locking or ?locking-lite? if there is such a thing.  Does anyone here have experience with configuring GFS to be read-only for some or all nodes and not require locking?

- K


--
Kamal Jain
kjain at aurarianetworks.com
+1 978.893.1098  (office)
+1 978.726.7098  (mobile)

Auraria Networks, Inc.
85 Swanson Road, Suite 120
Boxborough, MA  01719
USA

www.aurarianetworks.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071227/6a290f12/attachment.htm>

From bsd_daemon at msn.com  Thu Dec 27 18:41:29 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Thu, 27 Dec 2007 18:41:29 +0000
Subject: [Linux-cluster] [Linux-HA]
In-Reply-To: <200712271353516259434@163.com>
References: <200712271353516259434@163.com>
Message-ID: <BLU115-W41BB296E83A4F63BF3B64CE3540@phx.gbl>


hi,Are you sure ? The eth is not running on your server. ! for test, you can use the "findif".   Generally, the findif command does exist in  directory of the heartbeat. In addition, you must write "bcast ethX"  to ha.cf file.
How you did configure the heartbeat ?
-- Mehmet CELIK
Istanbul/TURKEY


Date: Thu, 27 Dec 2007 13:53:53 +0800From: banjamin2001 at 163.comTo: linux-cluster at redhat.comSubject: [Linux-cluster] [Linux-HA]


hi,
 
when i am trying config a ha test with my two liunx machine on my vmare,i got a problem as follows:
**********
heartbeat: 2007/12/27_11:44:04 info: Acquiring resource group: TEST1 192.168.8.184 httpdheartbeat: 2007/12/27_11:44:04 info: Running /etc/ha.d/resource.d/IPaddr 192.168.8.184 startheartbeat: 2007/12/27_11:44:04 info: /sbin/ifconfig ???Z@:0 192.168.8.184 netmask 255.255.255.127     broadcast 192.168.8.184heartbeat: 2007/12/27_11:44:04 ERROR: Return code 255 from /etc/ha.d/resource.d/IPaddrheartbeat: 2007/12/27_11:44:04 info: Running /etc/init.d/httpd  start
 
then i check the netcard,it seems that eth0 was wrong:
 
-bash-2.05b# ifconfig -aeth0      Link encap:Ethernet  HWaddr 00:0C:29:8E:BF:90            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1          RX packets:3928 errors:0 dropped:0 overruns:0 frame:0          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0          collisions:0 txqueuelen:100           RX bytes:368975 (360.3 Kb)  TX bytes:42 (42.0 b)          Interrupt:10 Base address:0x10a4
???
 
**********
 
I  don't kown why this happen,who can tell me why and how to resolve it ?
 
 
thank u very much!
best wishes! 
 
 
                                                                                         banjamin2001banjamin2001 at 163.com
                                                                                                 2007-12-27
_________________________________________________________________
The best games are on Xbox 360.  Click here for a special offer on an Xbox 360 Console.
http://www.xbox.com/en-US/hardware/wheretobuy/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071227/b2faed8c/attachment.htm>

From scooter at cgl.ucsf.edu  Thu Dec 27 21:31:59 2007
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Thu, 27 Dec 2007 13:31:59 -0800
Subject: [Linux-cluster] gfs2 hang
Message-ID: <477419CF.1040002@cgl.ucsf.edu>

Greetings,
    We've got a two-node cluster running RHEL 5.1 that we've been 
experimenting with and have discovered a problem with gfs2.  As part of 
our build environment, we have some find scripts that walk a directory tree:

#! /bin/sh
for html in `/usr/bin/find curGenerated -name \*.html -print` ; do \
 cat $html > tmpCR.html ; \
 /bin/mv tmpCR.html $html ; \
done

The curGenerated directory has about 141 subdirectories, each of which 
has from 2-10 subdirectories.  What we find is that this find script 
will hang the operating system when it is executed within a gfs2 
partition that is shared between the two nodes.  Fencing is configured 
and detects the hung node and restarts it, but that's not much of a 
consolation.  The gfs2 partition lives on a fibreChannel array (HP 
EVA5000), and quotas are not turned on.  The gfs2 filesystem continues 
to operate normally on the other node.

Is this a known bug in gfs2?   Is there something we could do to help 
find this problem?

Thanks!

-- scooter


From jos at xos.nl  Thu Dec 27 21:40:25 2007
From: jos at xos.nl (Jos Vos)
Date: Thu, 27 Dec 2007 22:40:25 +0100
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <477419CF.1040002@cgl.ucsf.edu>
References: <477419CF.1040002@cgl.ucsf.edu>
Message-ID: <20071227214025.GB16736@jasmine.xos.nl>

On Thu, Dec 27, 2007 at 01:31:59PM -0800, Scooter Morris wrote:

> Is this a known bug in gfs2?   Is there something we could do to help 
> find this problem?

I assume you know GFS2 is still in "technology preview" status and
officially not meant for production use?

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From scooter at cgl.ucsf.edu  Thu Dec 27 21:45:56 2007
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Thu, 27 Dec 2007 13:45:56 -0800
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <20071227214025.GB16736@jasmine.xos.nl>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
Message-ID: <47741D14.7020109@cgl.ucsf.edu>

Jos Vos wrote:
> On Thu, Dec 27, 2007 at 01:31:59PM -0800, Scooter Morris wrote:
>
>   
>> Is this a known bug in gfs2?   Is there something we could do to help 
>> find this problem?
>>     
>
> I assume you know GFS2 is still in "technology preview" status and
> officially not meant for production use?
>
>   
Yes, I am aware of that.  We're not looking to put things into 
production on this cluster for awhile.  However, are you telling me that 
GFS2 is just flat unstable or not yet thoroughly tested?  If the former, 
then I won't waste my time even looking at it, if the latter, then I'm 
happy to help with the testing by banging on it a little.

-- scooter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071227/eca81cc5/attachment.htm>

From jos at xos.nl  Thu Dec 27 21:57:36 2007
From: jos at xos.nl (Jos Vos)
Date: Thu, 27 Dec 2007 22:57:36 +0100
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <47741D14.7020109@cgl.ucsf.edu>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
	<47741D14.7020109@cgl.ucsf.edu>
Message-ID: <20071227215736.GC16736@jasmine.xos.nl>

On Thu, Dec 27, 2007 at 01:45:56PM -0800, Scooter Morris wrote:

> Yes, I am aware of that.  We're not looking to put things into 
> production on this cluster for awhile.  However, are you telling me that 
> GFS2 is just flat unstable or not yet thoroughly tested?  If the former, 
> then I won't waste my time even looking at it, if the latter, then I'm 
> happy to help with the testing by banging on it a little.

I don't have any concrete experience with GFS2, but I think it's the latter.

Unfortunately, the GFS performance is for some applications dramatically
bad (I will test some new tuning options that should be in 5.1), so I'd
love to use GFS2, but I can't afford using non-production software for
that particular configuration.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From scooter at cgl.ucsf.edu  Thu Dec 27 22:01:39 2007
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Thu, 27 Dec 2007 14:01:39 -0800
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <20071227215736.GC16736@jasmine.xos.nl>
References: <477419CF.1040002@cgl.ucsf.edu>	<20071227214025.GB16736@jasmine.xos.nl>	<47741D14.7020109@cgl.ucsf.edu>
	<20071227215736.GC16736@jasmine.xos.nl>
Message-ID: <477420C3.6040602@cgl.ucsf.edu>

Jos Vos wrote:
>
> I don't have any concrete experience with GFS2, but I think it's the latter.
>
> Unfortunately, the GFS performance is for some applications dramatically
> bad (I will test some new tuning options that should be in 5.1), so I'd
> love to use GFS2, but I can't afford using non-production software for
> that particular configuration.
>
>   
That agrees with what I've read, which is why we're looking at GFS2.  It 
seems to be pretty nice, except for this one hang!

-- scooter


From gordan at bobich.net  Fri Dec 28 08:23:57 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 28 Dec 2007 08:23:57 +0000 (GMT)
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <20071227214025.GB16736@jasmine.xos.nl>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
Message-ID: <Pine.LNX.4.64.0712280822550.8683@skynet.shatteredsilicon.net>

On Thu, 27 Dec 2007, Jos Vos wrote:

> On Thu, Dec 27, 2007 at 01:31:59PM -0800, Scooter Morris wrote:
>
>> Is this a known bug in gfs2?   Is there something we could do to help
>> find this problem?
>
> I assume you know GFS2 is still in "technology preview" status and
> officially not meant for production use?

I thought it was supposed to be "stable" in RHEL5.1. Not that I believed 
it, but it seemed to be implied.

Gordan


From gordan at bobich.net  Fri Dec 28 08:27:42 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 28 Dec 2007 08:27:42 +0000 (GMT)
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <47741D14.7020109@cgl.ucsf.edu>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
	<47741D14.7020109@cgl.ucsf.edu>
Message-ID: <Pine.LNX.4.64.0712280824220.8683@skynet.shatteredsilicon.net>

On Thu, 27 Dec 2007, Scooter Morris wrote:

>> >  Is this a known bug in gfs2?   Is there something we could do to help 
>> >  find this problem?
>>
>>  I assume you know GFS2 is still in "technology preview" status and
>>  officially not meant for production use?
>> 
> Yes, I am aware of that.  We're not looking to put things into production on 
> this cluster for awhile.  However, are you telling me that GFS2 is just flat 
> unstable or not yet thoroughly tested?  If the former, then I won't waste my 
> time even looking at it, if the latter, then I'm happy to help with the 
> testing by banging on it a little.

I suspect you'll find it isn't usable even for a working prototype of any 
description. Stick to GFS1, there's nothing wrong with it.

Gordan


From jos at xos.nl  Fri Dec 28 08:28:03 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 28 Dec 2007 09:28:03 +0100
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <Pine.LNX.4.64.0712280822550.8683@skynet.shatteredsilicon.net>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
	<Pine.LNX.4.64.0712280822550.8683@skynet.shatteredsilicon.net>
Message-ID: <20071228082803.GA22732@jasmine.xos.nl>

On Fri, Dec 28, 2007 at 08:23:57AM +0000, gordan at bobich.net wrote:

> I thought it was supposed to be "stable" in RHEL5.1. Not that I believed 
> it, but it seemed to be implied.

This was suggested during the 5.0 release, but the release notes of
5.1 say something different.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From gordan at bobich.net  Fri Dec 28 08:32:25 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 28 Dec 2007 08:32:25 +0000 (GMT)
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <20071227215736.GC16736@jasmine.xos.nl>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
	<47741D14.7020109@cgl.ucsf.edu> <20071227215736.GC16736@jasmine.xos.nl>
Message-ID: <Pine.LNX.4.64.0712280828030.8683@skynet.shatteredsilicon.net>

On Thu, 27 Dec 2007, Jos Vos wrote:

> On Thu, Dec 27, 2007 at 01:45:56PM -0800, Scooter Morris wrote:
>
>> Yes, I am aware of that.  We're not looking to put things into
>> production on this cluster for awhile.  However, are you telling me that
>> GFS2 is just flat unstable or not yet thoroughly tested?  If the former,
>> then I won't waste my time even looking at it, if the latter, then I'm
>> happy to help with the testing by banging on it a little.
>
> I don't have any concrete experience with GFS2, but I think it's the latter.

No, it's the former. If it can be reliably crashed in seconds by intensive 
random writes to random files in random directory trees, it's way past 
unstable. If it crashed after days of intensive abuse then that might fall 
under the "not yet thoroughly tested" category, but that clearly isn't the 
case at the moment.

> Unfortunately, the GFS performance is for some applications dramatically
> bad (I will test some new tuning options that should be in 5.1), so I'd
> love to use GFS2, but I can't afford using non-production software for
> that particular configuration.

If you expect performance differences of an order of magnitude or more 
from GFS2 over GFS1, you are likely to be rather disappointed. But I'll be 
quite happy to be proven wrong on that.

Gordan


From jos at xos.nl  Fri Dec 28 08:57:34 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 28 Dec 2007 09:57:34 +0100
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <Pine.LNX.4.64.0712280824220.8683@skynet.shatteredsilicon.net>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
	<47741D14.7020109@cgl.ucsf.edu>
	<Pine.LNX.4.64.0712280824220.8683@skynet.shatteredsilicon.net>
Message-ID: <20071228085734.GA23405@jasmine.xos.nl>

On Fri, Dec 28, 2007 at 08:27:42AM +0000, gordan at bobich.net wrote:

> I suspect you'll find it isn't usable even for a working prototype of any 
> description. Stick to GFS1, there's nothing wrong with it.

The one thing that's horribly wrong in some applications is performance.
If you need to have large amounts of files and frequent directory scans
(i.e. rsync etc.), you're lost.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From gordan at bobich.net  Fri Dec 28 09:01:54 2007
From: gordan at bobich.net (gordan at bobich.net)
Date: Fri, 28 Dec 2007 09:01:54 +0000 (GMT)
Subject: [Linux-cluster] gfs2 hang
In-Reply-To: <20071228085734.GA23405@jasmine.xos.nl>
References: <477419CF.1040002@cgl.ucsf.edu>
	<20071227214025.GB16736@jasmine.xos.nl>
	<47741D14.7020109@cgl.ucsf.edu>
	<Pine.LNX.4.64.0712280824220.8683@skynet.shatteredsilicon.net>
	<20071228085734.GA23405@jasmine.xos.nl>
Message-ID: <Pine.LNX.4.64.0712280859440.8683@skynet.shatteredsilicon.net>

On Fri, 28 Dec 2007, Jos Vos wrote:

>> I suspect you'll find it isn't usable even for a working prototype of any
>> description. Stick to GFS1, there's nothing wrong with it.
>
> The one thing that's horribly wrong in some applications is performance.
> If you need to have large amounts of files and frequent directory scans
> (i.e. rsync etc.), you're lost.

How much of a performance improvement are you looking for, though? If it's 
30% - that might happen. Even 50% might be in the realm of plausible. But 
if you're expecting an improvement of 10x you should probably start 
re-thinking the design of your application.

Gordan


From hwoehle at arcor.de  Fri Dec 28 10:31:26 2007
From: hwoehle at arcor.de (Holger Woehle)
Date: Fri, 28 Dec 2007 11:31:26 +0100
Subject: [Linux-cluster] GFS without RHCM but with Heartbeat V2 and drbd ?
In-Reply-To: <20071227170006.AE68F733D5@hormel.redhat.com>
References: <20071227170006.AE68F733D5@hormel.redhat.com>
Message-ID: <4774D07E.6030701@arcor.de>

Hi,
at the moment i am evaluating RHCM and Heartbeat V2 to refine our 
Heartbeat V1 Cluster.
My question as in the subject:
Is it possible to use GFS without the RHCM ?
I want to build a 2 node cluster with drbd activ/activ and RHCM/GFS or 
Heartbeat V2 with filesystem OCFS or GFS.


kind regards
holgi


From raju.rajsand at gmail.com  Fri Dec 28 16:04:10 2007
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Fri, 28 Dec 2007 21:34:10 +0530
Subject: [Linux-cluster] Channel Bonding issue in Cluster Suite Setup
In-Reply-To: <4773391B.4080605@midascomm.com>
References: <4773391B.4080605@midascomm.com>
Message-ID: <8786b91c0712280804x273d3992x9cea2bb812a0f3c4@mail.gmail.com>

namo namaH

On Dec 27, 2007 11:03 AM, Balaji <balajisundar at midascomm.com> wrote:

> Dear All,
>
> I have configured RHEL Cluster Suite with 2 servers
> Server 1 : 192.168.13.110 IP Address and hostname is primary
> Server 2 : 192.168.13.179 IP Address and hostname is secondary
> Floating : 192.168.13.83 IP Address (Assumed by currently active server)
>
>
If your switch connecting both the nodes is a managed switch please turn on
the multicasting.

Refer to http://sources.redhat.com/cluster/faq.html#cman_cisco_switches

HTH

aa no bhadraaH kratavo yantu vishvataH

Rajagopal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071228/3c0c4861/attachment.htm>

From randy.brown at noaa.gov  Mon Dec 31 14:56:35 2007
From: randy.brown at noaa.gov (Randy Brown)
Date: Mon, 31 Dec 2007 09:56:35 -0500
Subject: [Linux-cluster] Writing problems on a filesystem mounted from my
	NFS cluster
Message-ID: <47790323.4000501@noaa.gov>

I am using Centos 5 Cluster suite for an NFS cluster.  I am able to 
mount the filesystems exported by the cluster to other machines in our 
network.  The problem arises when I try to copy or move files to this 
file system as a non-root user.
Here is the result of trying to copy a file:

[clifford_rbrown]$cp top_snapshot_lx10 top_snapshot_lx10.orig
cp: cannot create top_snapshot_lx10.orig: Permission denied

Moving a file:
[clifford_rbrown]$mv users users.2
----------   1 rbrown     users            0 Dec 27 08:31 users.2
-rw-r--r--   1 root       sys            379 Apr 11  2007 users

The umask for this user is 022.  I believe I have the export configured 
correctly.  Here is the relevant entry from the /etc/exports file:
/fs/shared clifford(rw,no_root_squash)

permissions on the mount point are good:
drwxr-xr-x   7 root       sys             96 Dec 11 09:45 /mnt
drwxr-xr-x   9 root       root          3864 Dec 27 12:52 /mnt/shared
drwxr-xr-x  143 root       root          2048 Oct 23 07:07 /mnt/shared/home
drwxrwxr-x  65 rbrown     users         2048 Dec 27 09:48 
/mnt/shared/home/rbrown

The permissions on the cluster node for that filesystem look good:
[root at nfs1-cluster log]# ls -ld /fs /fs/shared
drwxr-xr-x 18 root root 1024 Dec 26 12:24 /fs
drwxr-xr-x  9 root root 3864 Dec 27 16:02 /fs/shared

Maybe my configuration of the exports is wrong?

Thanks in advance for any help or suggestions.

Randy

cluster.conf:
<?xml version="1.0"?>
<cluster alias="ohd_cluster" config_version="140" name="ohd_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="60"/>
        <clusternodes>
                <clusternode name="nfs1-cluster.nws.noaa.gov" nodeid="1" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="8" 
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="nfs2-cluster.nws.noaa.gov" nodeid="2" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="7" 
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="nfs-failover" ordered="0" 
restricted="1">
                                <failoverdomainnode 
name="nfs1-cluster.nws.noaa.gov" priority="1"/>
                                <failoverdomainnode 
name="nfs2-cluster.nws.noaa.gov" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="140.90.91.244" monitor_link="1"/>
                        <clusterfs 
device="/dev/VolGroupFS/LogVol-shared" force_unmount="0" fsid="30647" 
fstype="gfs" mountpoint="/fs/shared" name="fs-shared" options="acl"/>
                        <nfsexport name="fs-shared-exp"/>
                        <nfsclient name="fs-shared-client" 
options="no_root_squash,rw" path="/fs/shared" target="140.90.91.0/24"/>
                        <clusterfs 
device="/dev/VolGroupTemp/LogVol-rfcdata" force_unmount="0" fsid="54233" 
fstype="gfs" mountpoint="/rfcdata" name="rfcdata" options="acl"/>
                        <nfsexport name="rfcdata-exp"/>
                        <nfsclient name="rfcdata-client" 
options="no_root_squash,rw" path="/rfcdata" target="140.90.91.0/24"/>
                        <ip address="140.90.91.245" monitor_link="1"/>
                        <clusterfs 
device="/dev/VolGroupFS/LogVol-brianj" force_unmount="0" fsid="39025" 
fstype="gfs" mountpoint="/fs/rfcdata" name="fs-rfcdata" options="acl"/>
                        <nfsexport name="fs-rfcdata-exp"/>
                        <nfsclient name="fs-rfcdata-client" 
options="no_root_squash,rw" path="/fs/rfcdata" target="140.90.91.0/24"/>
                </resources>
                <service autostart="1" domain="nfs-failover" name="nfs">
                        <clusterfs ref="fs-shared">
                                <nfsexport ref="fs-shared-exp">
                                        <nfsclient ref="fs-shared-client"/>
                                </nfsexport>
                        </clusterfs>
                        <ip ref="140.90.91.244"/>
                        <clusterfs ref="rfcdata">
                                <nfsexport ref="rfcdata-exp">
                                        <nfsclient ref="rfcdata-client"/>
                                </nfsexport>
                                <ip ref="140.90.91.245"/>
                        </clusterfs>
                        <clusterfs ref="fs-rfcdata">
                                <nfsexport ref="fs-rfcdata-exp">
                                        <nfsclient ref="fs-rfcdata-client"/>
                                </nfsexport>
                                <ip ref="140.90.91.244"/>
                        </clusterfs>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.42.30" 
login="rbrown" name="nfspower" passwd="removed"/>
        </fencedevices>
</cluster>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071231/081f1253/attachment.vcf>

From eric at bootseg.com  Mon Dec 31 15:23:42 2007
From: eric at bootseg.com (Eric Kerin)
Date: Mon, 31 Dec 2007 10:23:42 -0500
Subject: [Linux-cluster] Writing problems on a filesystem mounted from
	my NFS cluster
In-Reply-To: <47790323.4000501@noaa.gov>
References: <47790323.4000501@noaa.gov>
Message-ID: <1199114622.20585.6.camel@mechanism.localnet>

On Mon, 2007-12-31 at 09:56 -0500, Randy Brown wrote:
> The umask for this user is 022.  I believe I have the export configured 
> correctly.  Here is the relevant entry from the /etc/exports file:
> /fs/shared clifford(rw,no_root_squash)
Not sure entirely what your problem is, but one thing that may help: 
The cluster manager does not add items directly to /etc/exports, it does
it's work through the exportfs command.

If you run exportfs -v you can see the current in-kernel export list and
options.

Hope this helps, 
Eric Kerin
eric at bootseg.com


From randy.brown at noaa.gov  Mon Dec 31 15:29:00 2007
From: randy.brown at noaa.gov (Randy Brown)
Date: Mon, 31 Dec 2007 10:29:00 -0500
Subject: [Linux-cluster] Writing problems on a filesystem mounted from
	my NFS cluster
In-Reply-To: <1199114622.20585.6.camel@mechanism.localnet>
References: <47790323.4000501@noaa.gov>
	<1199114622.20585.6.camel@mechanism.localnet>
Message-ID: <47790ABC.4080309@noaa.gov>

That's what's frustrating me is that everything I can find seems to 
indicate that it is exported rw, but non-root users cannot write 
consistently.

Randy

[root at nfs2-cluster ~]# exportfs -v
/fs/rfcdata     
clifford.nws.noaa.gov(rw,wdelay,no_root_squash,no_subtree_check,anonuid=65534,anongid=65534)
/fs/rfcdata     
frisky.nws.noaa.gov(rw,wdelay,no_root_squash,no_subtree_check,anonuid=65534,anongid=65534)

Eric Kerin wrote:
> On Mon, 2007-12-31 at 09:56 -0500, Randy Brown wrote:
>   
>> The umask for this user is 022.  I believe I have the export configured 
>> correctly.  Here is the relevant entry from the /etc/exports file:
>> /fs/shared clifford(rw,no_root_squash)
>>     
> Not sure entirely what your problem is, but one thing that may help: 
> The cluster manager does not add items directly to /etc/exports, it does
> it's work through the exportfs command.
>
> If you run exportfs -v you can see the current in-kernel export list and
> options.
>
> Hope this helps, 
> Eric Kerin
> eric at bootseg.com
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071231/8b24d2e4/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071231/8b24d2e4/attachment.vcf>

From randy.brown at noaa.gov  Mon Dec 31 18:33:55 2007
From: randy.brown at noaa.gov (Randy Brown)
Date: Mon, 31 Dec 2007 13:33:55 -0500
Subject: [Linux-cluster] Error messages during Fence operation
Message-ID: <47793613.9000304@noaa.gov>

I am using an APC Masterswitch Plus as my fencing device.  I am seeing 
this in my logs now when fencing occurs:

Dec 31 11:36:26 nfs1-cluster fenced[3848]: agent "fence_apc" reports: 
Traceback (most recent call last):   File "/sbin/fence_apc", line 829, 
in ?     main()   File "/sbin/fence_apc", line 289, in main     
do_login(sock)   File "/sbin/fence_apc", line 444, in do_login     i, 
mo, txt = sock.expect(regex_list, TELNET_TIMEOUT)
Dec 31 11:36:26 nfs1-cluster fenced[3848]: agent "fence_apc" reports:   
File "/usr/lib/python2.4/telnetlib.py", line 620, in expect     text = 
self.read_very_lazy()   File "/usr/lib/python2.4/telnetlib.py", line 
400, in read_very_lazy     raise EOFError, 'telnet connection closed' 
EOFError: telnet connection closed
Dec 31 11:36:26 nfs1-cluster fenced[3848]: fence 
"nfs2-cluster.nws.noaa.gov" failed

This used to work just fine.  If I run `fence_apc -a 192.168.42.30 -l 
cluster -n 1:7 -o Reboot -p <my password>` from the command line, 
fencing works as expected.  The relevant lines from my cluster.conf file 
are below.  I will gladly provide more information as necessary.

Information from the APC Masterswitch Plus:
*Model Number:*  AP9617  *
Serial Number:*  NA0724033746  *
Hardware Revision:*  A10  *
Manufacture Date:*  06/17/2007  *
MAC Address:*  00 C0 B7 31 7E 76    *
Application module information*    *
Name:*  msp  *
Version:*  v2.6.2  *
Date:*  12/03/2004  *
Time:*  16:23:16    *
APC OS(AOS) information*    *
Name:*  aos  *
Version:*  v2.6.4  *
Date:*  11/12/2004  *
Time:*  19:40:33

Thanks,
Randy

<clusternode name="nfs1-cluster.nws.noaa.gov" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="8" 
switch="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="nfs2-cluster.nws.noaa.gov" nodeid="2" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="nfspower" port="7" 
switch="1"/>
                                </method>
                        </fence>

<fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.42.30" 
login="cluster" name="nfspower" passwd="<mypassword>"/>
        </fencedevices>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071231/6f3d598c/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071231/6f3d598c/attachment.vcf>

From randy.brown at noaa.gov  Mon Dec 31 18:42:18 2007
From: randy.brown at noaa.gov (Randy Brown)
Date: Mon, 31 Dec 2007 13:42:18 -0500
Subject: [Linux-cluster] Error messages during Fence operation
In-Reply-To: <47793613.9000304@noaa.gov>
References: <47793613.9000304@noaa.gov>
Message-ID: <4779380A.6000706@noaa.gov>

I forgot....I'm using Centos 5 with latest patches and kernel.

Randy Brown wrote:
> I am using an APC Masterswitch Plus as my fencing device.  I am seeing 
> this in my logs now when fencing occurs:
>
> Dec 31 11:36:26 nfs1-cluster fenced[3848]: agent "fence_apc" reports: 
> Traceback (most recent call last):   File "/sbin/fence_apc", line 829, 
> in ?     main()   File "/sbin/fence_apc", line 289, in main     
> do_login(sock)   File "/sbin/fence_apc", line 444, in do_login     i, 
> mo, txt = sock.expect(regex_list, TELNET_TIMEOUT)
> Dec 31 11:36:26 nfs1-cluster fenced[3848]: agent "fence_apc" 
> reports:   File "/usr/lib/python2.4/telnetlib.py", line 620, in 
> expect     text = self.read_very_lazy()   File 
> "/usr/lib/python2.4/telnetlib.py", line 400, in read_very_lazy     
> raise EOFError, 'telnet connection closed' EOFError: telnet connection 
> closed
> Dec 31 11:36:26 nfs1-cluster fenced[3848]: fence 
> "nfs2-cluster.nws.noaa.gov" failed
>
> This used to work just fine.  If I run `fence_apc -a 192.168.42.30 -l 
> cluster -n 1:7 -o Reboot -p <my password>` from the command line, 
> fencing works as expected.  The relevant lines from my cluster.conf 
> file are below.  I will gladly provide more information as necessary.
>
> Information from the APC Masterswitch Plus:
> *Model Number:*  AP9617  *
> Serial Number:*  NA0724033746  *
> Hardware Revision:*  A10  *
> Manufacture Date:*  06/17/2007  *
> MAC Address:*  00 C0 B7 31 7E 76    *
> Application module information*    *
> Name:*  msp  *
> Version:*  v2.6.2  *
> Date:*  12/03/2004  *
> Time:*  16:23:16    *
> APC OS(AOS) information*    *
> Name:*  aos  *
> Version:*  v2.6.4  *
> Date:*  11/12/2004  *
> Time:*  19:40:33
>
> Thanks,
> Randy
>
> <clusternode name="nfs1-cluster.nws.noaa.gov" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="nfspower" 
> port="8" switch="1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="nfs2-cluster.nws.noaa.gov" 
> nodeid="2" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="nfspower" 
> port="7" switch="1"/>
>                                 </method>
>                         </fence>
>
> <fencedevices>
>                 <fencedevice agent="fence_apc" ipaddr="192.168.42.30" 
> login="cluster" name="nfspower" passwd="<mypassword>"/>
>         </fencedevices>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071231/407d2d65/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071231/407d2d65/attachment.vcf>