From anasnajj at gmail.com  Wed Jun  2 02:42:20 2010
From: anasnajj at gmail.com (Anas Alnajjar)
Date: Wed, 2 Jun 2010 05:42:20 +0300
Subject: [Linux-cluster] check  status time out
Message-ID: <000301cb01fd$3b2bc8f0$b1835ad0$@com>

Dear all

Hi I wish you have enjoyable life?

 

I have Redhat cluster on Centos 5.4  and I make Script resource to handle my
service " /etc/init.d/xxxx " but I need to modify check  status time out
because my service take long time to return back its status so how i can do
this

BR

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100602/f72d0f8c/attachment.htm>

From glisha at gmail.com  Wed Jun  2 14:36:43 2010
From: glisha at gmail.com (Georgi Stanojevski)
Date: Wed, 2 Jun 2010 16:36:43 +0200
Subject: [Linux-cluster] check status time out
In-Reply-To: <000301cb01fd$3b2bc8f0$b1835ad0$@com>
References: <000301cb01fd$3b2bc8f0$b1835ad0$@com>
Message-ID: <AANLkTincnvXgrN_KlRlzi9EIUHrxEx7DwbF5LpHBGlhR@mail.gmail.com>

On Wed, Jun 2, 2010 at 4:42 AM, Anas Alnajjar <anasnajj at gmail.com> wrote:

> I have Redhat cluster on Centos 5.4? and I make Script resource to handle my
> service ? /etc/init.d/xxxx ? but I need to modify check? status time out
> because my service take long time to return back its status so how i can do
> this

According to /usr/share/cluster/script.sh you can't set up timeout for
status check.

<!-- This is just a wrapper for LSB init scripts, so monitor
       and status can't have a timeout, nor do they do any extra
       work regardless of the depth -->

So I guess it waits indefinitely for the status script to return?

Are you sure you need to increase the timeout? Does rgmanager kill
your resource after a long time running or because it returns <>0?

I have just the opposite problem. If my status doesn't return in ex.
60s I need to restart the service, and according to the comments in
script.sh I can't do that?

-- 
Glisha



From dhoffutt at gmail.com  Wed Jun  2 15:50:09 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Wed, 2 Jun 2010 10:50:09 -0500
Subject: [Linux-cluster] check status time out
In-Reply-To: <AANLkTincnvXgrN_KlRlzi9EIUHrxEx7DwbF5LpHBGlhR@mail.gmail.com>
References: <000301cb01fd$3b2bc8f0$b1835ad0$@com>
	<AANLkTincnvXgrN_KlRlzi9EIUHrxEx7DwbF5LpHBGlhR@mail.gmail.com>
Message-ID: <AANLkTimVS6_6MybtQF0MCTq_XSp2_i0aZC6CsJgTdPkY@mail.gmail.com>

Life is absolutely enjoyable! Hope yours is as well!

What one might consider in such a situation is instead calling a custom
wrapper script...

Have the custom script do something like:

a "thought script":

myTimeOut = 60 seconds? 120 seconds?
start {
/etc/init.d/myService start
date +SOMEFORMAT > /var/lock/subsys/customScriptStartTimeStamp
}
stop {
/etc/init.d/myService stop
}
status {
$serviceStartedAt = $(cat /var/lock/subsys/customScriptStartTimeStamp)
if ($serviceStartedAt is longer ago than a timestamp taken now plus
$myTimeOut){
  return $(service myService status)
} else {
return 0
}

So the wrapper won't start querying the real service for a status until
after the timeout specified in the myTimeOut variable....

Just an idea...


On Wed, Jun 2, 2010 at 9:36 AM, Georgi Stanojevski <glisha at gmail.com> wrote:

> On Wed, Jun 2, 2010 at 4:42 AM, Anas Alnajjar <anasnajj at gmail.com> wrote:
>
> > I have Redhat cluster on Centos 5.4  and I make Script resource to handle
> my
> > service ? /etc/init.d/xxxx ? but I need to modify check  status time out
> > because my service take long time to return back its status so how i can
> do
> > this
>
> According to /usr/share/cluster/script.sh you can't set up timeout for
> status check.
>
> <!-- This is just a wrapper for LSB init scripts, so monitor
>       and status can't have a timeout, nor do they do any extra
>       work regardless of the depth -->
>
> So I guess it waits indefinitely for the status script to return?
>
> Are you sure you need to increase the timeout? Does rgmanager kill
> your resource after a long time running or because it returns <>0?
>
> I have just the opposite problem. If my status doesn't return in ex.
> 60s I need to restart the service, and according to the comments in
> script.sh I can't do that?
>
> --
> Glisha
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100602/bd903fb7/attachment.htm>

From kitgerrits at gmail.com  Wed Jun  2 21:16:36 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Wed, 2 Jun 2010 23:16:36 +0200
Subject: [Linux-cluster] check status time out
In-Reply-To: <AANLkTimVS6_6MybtQF0MCTq_XSp2_i0aZC6CsJgTdPkY@mail.gmail.com>
Message-ID: <4c06ca34.1067f10a.4fa2.6595@mx.google.com>


You can also try playing with script.sh.
From: http://sources.redhat.com/cluster/wiki/FAQ/RGManager#rgm_svcstart

How can I change the interval at which rgmanager checks a given service?

The interval is in the script for each service, in /usr/share/cluster/ 

It's easier to just change the script.sh file to use whatever value you want
(<5 is not supported, though). Checking is per-resource-type, not
per-service, because it takes more system time to check one resource type
vs. another resource type. 

That is, a check on a "script" might happen only every 30 seconds, while a
check on an "ip" might happen every 10 seconds. 

The status checks are not supposed to consume system resources.
Historically, people have done one of two things which generate support
calls: 

*	Does not set a status check interval at all (why is my service not
being checked?), or 
*	sets the status check interval to something way too low, like 10
seconds for an Oracle service (why is the cluster acting strange/running
slowly?). 

If the status check interval is lower than the actual amount of time it
takes to check the status of a service, you end up with endless
status-checking, which is a pure waste of resources. 


________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry Offutt
Sent: woensdag 2 juni 2010 17:50
To: linux clustering
Subject: Re: [Linux-cluster] check status time out


Life is absolutely enjoyable! Hope yours is as well!

What one might consider in such a situation is instead calling a custom
wrapper script...

Have the custom script do something like:

a "thought script":

myTimeOut = 60 seconds? 120 seconds?
start {
/etc/init.d/myService start
date +SOMEFORMAT > /var/lock/subsys/customScriptStartTimeStamp
}
stop {
/etc/init.d/myService stop
}
status {
$serviceStartedAt = $(cat /var/lock/subsys/customScriptStartTimeStamp)
if ($serviceStartedAt is longer ago than a timestamp taken now plus
$myTimeOut){
  return $(service myService status)
} else {
return 0
}

So the wrapper won't start querying the real service for a status until
after the timeout specified in the myTimeOut variable....

Just an idea...



On Wed, Jun 2, 2010 at 9:36 AM, Georgi Stanojevski <glisha at gmail.com> wrote:


	On Wed, Jun 2, 2010 at 4:42 AM, Anas Alnajjar <anasnajj at gmail.com>
wrote:
	
	> I have Redhat cluster on Centos 5.4  and I make Script resource to
handle my
	> service " /etc/init.d/xxxx " but I need to modify check  status
time out
	> because my service take long time to return back its status so how
i can do
	> this
	
	
	According to /usr/share/cluster/script.sh you can't set up timeout
for
	status check.
	
	<!-- This is just a wrapper for LSB init scripts, so monitor
	      and status can't have a timeout, nor do they do any extra
	      work regardless of the depth -->
	
	So I guess it waits indefinitely for the status script to return?
	
	Are you sure you need to increase the timeout? Does rgmanager kill
	your resource after a long time running or because it returns <>0?
	
	I have just the opposite problem. If my status doesn't return in ex.
	60s I need to restart the service, and according to the comments in
	script.sh I can't do that?
	
	--
	Glisha
	
	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.819 / Virus Database: 271.1.1/2911 - Release Date: 06/01/10
20:25:00





From glisha at gmail.com  Thu Jun  3 20:46:39 2010
From: glisha at gmail.com (Georgi Stanojevski)
Date: Thu, 3 Jun 2010 22:46:39 +0200
Subject: [Linux-cluster] only one service fails-over out of two depended
	services.
Message-ID: <AANLkTinNA6HaZfbJHJflr6bGIpiLtPwjFVwqxgL3r7N1@mail.gmail.com>

Hi,

I have configured two services in my two-node cluster (RHEL 5.4).

service1 - with ip, ha-lvm and fs resources.
service2 - with a script ?resource which depends on service1.

When i manually relocate the services everything works as expected.

But, when i fail one node (halt -f) only service1 gets relocated to
the other node. service2 "stays" on the failed node in started state.

The logs say that only service1 will be taken over from the failed
node. No mention that service2 should be failed to the working node.

Jun ?3 22:17:35 node1 clurgmgrd[22963]: <info> Waiting for node #2 to be fenced
Jun ?3 22:17:43 node1 clurgmgrd[22963]: <info> Node #2 fenced; continuing
Jun ?3 22:17:43 node1 clurgmgrd[22963]: <debug> Evaluating RG
service:service1, state started, owner node2
Jun ?3 22:17:43 node1 clurgmgrd[22963]: <debug> Evaluating RG
service:service2, state started, owner node2
Jun ?3 22:17:43 node1 clurgmgrd[22963]: <notice> Taking over service
service:service1 from down member node2
...
Jun ?3 22:17:45 yeti clurgmgrd[22963]: <notice> Service service:service1 started

Does anyone have an idea if I am mis-configuring something?

Here is clustat when one node is failed:
===
Cluster Status for cluster1 @ Thu Jun ?3 22:34:52 2010
Member Status: Quorate

Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ID ? Status
------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ---- ------
node1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 Online, Local, rgmanager
node2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  ? ?           2 Offline

Service Name ? ? ? ? ? ? ? ? ? ?? ? ? ? Owner (Last)
? ? ? ? ? State
------- ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----- ------
 ? ? ? ? ? ? ? ? ? ? ? ? ? -----
service:service1 ? ? ? ? ? ? ? ? ? ? ? ? node1
              started
service:service2 ? ? ? ? ? ? ? ? ? ? ?  ?node2
              started
===

Here is the snippet of cluster.conf regarding the services:
===
        <service autostart="1" exclusive="0" name="service1"
recovery="relocate" priority="1">
            <ip ref="10.1.1.1"/>
            <lvm ref="lvm1"/>
            <fs ref="fs1"/>
        </service>
        <service autostart="1" exclusive="0" name="service2"
recovery="relocate" depend="service:service1" depend_mode="hard"
priority="2">
            <script ref="scriptsresource"/>
        </service>
===

Whole cluster.conf is attached.

Thank you very much for any input.

--
Glisha
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 2018 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100603/9d7cc2d6/attachment.obj>

From ricardo.maeda at webbertek.com.br  Mon Jun  7 17:41:53 2010
From: ricardo.maeda at webbertek.com.br (Ricardo Masashi Maeda)
Date: Mon, 7 Jun 2010 14:41:53 -0300 (BRT)
Subject: [Linux-cluster] What are the recommend settings when using a
 multipathed device for my cluster's quorum disk?
In-Reply-To: <1006019587.115.1275932205503.JavaMail.root@mail.webbertek>
Message-ID: <1487892257.129.1275932513648.JavaMail.root@mail.webbertek>

Hi, everybody, 

We've configured our qdisk/cman/multipath timeout settings, based on the following KB: http://kbase.redhat.com/faq/docs/DOC-2882.

The cluster is RHCS 5.4 + PowerPath 5.3.1 (1),

Basically, I've tried the following values, as you can see in cluster.conf (2):
PowerPath failover = X = 45 seconds
qdisk failover = X * 1,3 = 58,5 (tko = 59 s)
cman failover = X * 2,7 = 121,5 (token = 122000 ms)

However, when we've done a simple test, by removing heartbeat interface, it took almost 6 minutes to fence one of the nodes (3).

We'd like to know, if this behavior is expected.

I really appreciate any help on that!

Thanks!

(1) [root at mercurio dell]# rpm -qi EMCpower.LINUX
Name        : EMCpower.LINUX               Relocations: / 
Version     : 5.3.1.00.00                       Vendor: EMC, Inc.
Release     : 111                           Build Date: Thu 13 Aug 2009 04:01:31 PM BRT
Install Date: Wed 02 Jun 2010 03:01:44 PM BRT      Build Host: lsca2111.lss.emc.com
Group       : System Environment/Kernel     Source RPM: EMCpower.LINUX-5.3.1.00.00-111.src.rpm
Size        : 22070425                         License: Copyright (c) 2002-2009, EMC Corporation. All Rights Reserved.
Signature   : (none)
Summary     : EMC PowerPath
Description :
Multi-path software providing fail-over and load-sharing for SCSI disks.

(2) Source: /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="clu-informix" config_version="17" name="clu-informix">
        <fence_daemon clean_start="0" post_fail_delay="30" post_join_delay="5"/>
        <clusternodes>
                <clusternode name="clu-urano" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="fence_urano"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="clu-gemini" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="fence_gemini"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman quorum_dev_poll="50000" expected_votes="3"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" ipaddr="gemini-ipmi" login="cluster" name="fence_gemini" passwd="clusteraguia" method="cycle"/>
                <fencedevice agent="fence_ipmilan" ipaddr="urano-ipmi" login="cluster" name="fence_urano" passwd="clusteraguia" method="cycle"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="srvkrm" nofailback="0" ordered="0" restricted="0">
                                <failoverdomainnode name="clu-urano" priority="1"/>
                                <failoverdomainnode name="clu-gemini" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="srvvdsa" nofailback="0" ordered="0" restricted="0">
                                <failoverdomainnode name="clu-urano" priority="1"/>
                                <failoverdomainnode name="clu-gemini" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
        ... # Removed service and resource tags
        </rm>
        <totem token="122000"/>
        <quorumd device="/dev/emcpowera1" interval="1" min_score="1" tko="59" votes="1"/>
</cluster>


(3) Heartbeat tests:

[root at gemini ~]# clustat
Member Status: Quorate

 Member Name                                                    ID   Status
 ------ ----                                                    ---- ------
 clu-urano                                                          1 Online, rgmanager
 clu-gemini                                                         2 Online, Local, rgmanager
 /dev/emcpowera1                                                    0 Online, Quorum Disk

 Service Name                                          Owner (Last)                                          State         
 ------- ----                                          ----- ------                                          -----         
 service:srvkrm                                        clu-urano                                             started       
 service:srvvdsa                                       clu-urano                                             started       

(3.1) Removed the heartbeat interface in gemini server, at Jun 7, 13:55:07.

(3.2) Around 60-80 seconds, got 'token lost' in gemini.
Jun  7 13:56:28 gemini openais[5922]: [TOTEM] The token was lost in the OPERATIONAL state. 
Jun  7 13:56:28 gemini openais[5922]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Jun  7 13:56:28 gemini openais[5922]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Jun  7 13:56:28 gemini openais[5922]: [TOTEM] entering GATHER state from 2. 

(3.2) Then, after 121 seconds, got the second 'token lost', but in urano.
Jun  7 13:58:29 urano openais[5837]: [TOTEM] The token was lost in the OPERATIONAL state. 
Jun  7 13:58:29 urano openais[5837]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). 
Jun  7 13:58:29 urano openais[5837]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Jun  7 13:58:29 urano openais[5837]: [TOTEM] entering GATHER state from 2. 

(3.3) After 122 seconds, node urano has left.
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] entering GATHER state from 0. 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] Creating commit token because I am the rep. 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] Saving state aru 34 high seq received 34 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] Storing new sequence id for ring 140 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] entering COMMIT state. 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] entering RECOVERY state. 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] position [0] member 10.1.1.32: 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] previous ring seq 316 rep 10.1.1.32 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] aru 34 high delivered 34 received flag 1 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] Did not need to originate any messages in recovery. 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] Sending initial ORF token 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] New Configuration: 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ]   r(0) ip(10.1.1.32)  
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] Members Left: 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ]   r(0) ip(10.1.1.39)  
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] Members Joined: 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] New Configuration: 
Jun  7 14:00:32 gemini kernel: dlm: closing connection to node 1
Jun  7 14:00:32 gemini openais[5922]: [CLM  ]   r(0) ip(10.1.1.32)  
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] Members Left: 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] Members Joined: 
Jun  7 14:00:32 gemini openais[5922]: [SYNC ] This node is within the primary component and will provide service. 
Jun  7 14:00:32 gemini openais[5922]: [TOTEM] entering OPERATIONAL state. 
Jun  7 14:00:32 gemini openais[5922]: [CLM  ] got nodejoin message 10.1.1.32 
Jun  7 14:00:32 gemini openais[5922]: [CPG  ] got joinlist message from node 2

(3.3) After 48 seconds (post_fail_delay), urano was fenced.
Jun  7 14:01:20 gemini fenced[5971]: clu-urano not a cluster member after 48 sec post_fail_delay
Jun  7 14:01:20 gemini fenced[5971]: fencing node "clu-urano"
Jun  7 14:01:20 gemini fenced[5971]: fence "clu-urano" success



*Ricardo Masashi Maeda* 
Consultor Oracle / DBA 
ricardo.maeda at webbertek.com.br 

*Webbertek - Professional IT Services* 
+55 (41) 4063-8448 - fixo 
+55 (41) 8834-8354 - celular 


-- 
Esta mensagem foi verificada pelo sistema de antivmrus e
 acredita-se estar livre de perigo.



From fdinitto at redhat.com  Tue Jun  8 04:01:03 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 08 Jun 2010 06:01:03 +0200
Subject: [Linux-cluster] Cluster 3.0.13 stable release
Message-ID: <4C0DC07F.3030704@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256


The cluster team and its community are proud to announce the 3.0.13
stable release from the STABLE3 branch.

This release contains a few major bug fixes. We strongly recommend
people to update their clusters.

In order to build/run the 3.0.13 release you will need:

- - corosync 1.2.3
- - openais 1.1.3
- - linux kernel 2.6.31 (only for GFS1 users)

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.13.tar.bz2

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Fabio

Under the hood (from 3.0.12):

Bob Peterson (1):
      Fix device name and mount point in utils

Christine Caulfield (2):
      config: Fix ccs_tool create -n
      cman: fix quorum recalculation when a node is externally killed

David Teigland (2):
      dlm_controld: wrong fencing time comparison
      dlm_controld: wrong fencing time comparison (2)

Fabio M. Di Nitto (2):
      cman init: wait for corosync daemon to exit on stop
      add missing man pages

Jonathan Brassow (1):
      halvm: Fix bug 506587: lvm agent incorrectly reports vg is in
volume_list

Lon Hohberger (1):
      resource-agents: Resolve incorrect default

Marek 'marx' Grac (1):
      fence_wti: Add direct support for WTI VMR

 cman/daemon/commands.c           |    6 +++---
 cman/init.d/cman.in              |    7 +++++++
 cman/man/Makefile                |    3 ++-
 cman/man/cman_notify.8           |   17 +++++++++++++++++
 config/tools/ccs_tool/editconf.c |    2 +-
 fence/agents/wti/fence_wti.py    |    2 +-
 gfs2/libgfs2/misc.c              |   27 +++++++++++++++------------
 group/dlm_controld/cpg.c         |   23 +++++++++++++++++------
 group/man/Makefile               |    2 ++
 group/man/dlm_controld.pcmk.8    |    1 +
 group/man/gfs_controld.pcmk.8    |    1 +
 rgmanager/src/resources/lvm.sh   |    2 +-
 rgmanager/src/resources/vm.sh    |    2 +-
 13 files changed, 69 insertions(+), 26 deletions(-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBCAAGBQJMDcB9AAoJEFA6oBJjVJ+O1CYP/2ACxWh9jqKGK3PLogZo1ICA
RkkCFu9qGT3bd16xz+38XRUp62Wgwos/TsBiJoiQnjC9275HcWvV2Vye+OIRaGsU
yCo4Rk9bTX0o534yMEh/sijApJjKrhD3uzw4Ed15alfy/8q+7uSUs7m/mi+LsKc3
SJ2UxHJ9I4KUVmtHn8e49dDS8L/0ybwyQCq32pzvsnPJy1XiJObMbfVbcU9+2yPj
dt3C5R0J+/P7mASZdHfIsMvMj0/cpssZnWcJjK1F8pbIEJyBNC6TbYeIo37euD8L
T1mrLkGSx+Z9Lwzct4sq+DaebMXVZEXO9Q63NzpMfrbG1unoQrZDGpTtEI3qVHCp
/pvCYaGtuFpiqyAQpbQnNAWpv4gVP7KKhAdRdaJJ7FpPgQ7Ir/+Cvh5XRz/6dLw5
CL4nhjVciAOyqQs3CUoTd8TnoQmWMZjMwKnWYyiPXlWookzLfhfNYsHyWLbP0MxJ
6W/wR6E0jSVxCrb4tsqfEL2gUWoMaeHitUSNAf+l4cjZp6WVuVBIL+RPya/UOUmE
c+nmkUCkLJs9/aiEdzt2s7hKToOgUbQFcq6fV8MZxC9jYj2rpxypDFLDolB/LYPj
YXtzRfSedrj3U+zxRJRddEZa0/DQcZz2c5ERthQhE7yeNgk6uwiozzfMGh/6g2+y
7vfO0A1fJgb/zqdXQRgj
=9hrR
-----END PGP SIGNATURE-----



From dhoffutt at gmail.com  Fri Jun 11 18:05:02 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Fri, 11 Jun 2010 13:05:02 -0500
Subject: [Linux-cluster] Higher Grained Definition of IP Address Assignments?
Message-ID: <AANLkTinf_FHZt7E6SIJV4sZLOMRAYsHc7xz0Q1JEVXoJ@mail.gmail.com>

Hello.

We have a scenario where a server would have one NIC on a 10.1.1.x network
and multiple NICs on a 10.1.2.x network. The server requires the NICs on the
10.1.2 network each have an independent 10.1.2 network address assigned.

In converting this into an HA solution, we're reaching some difficulty in
that if the Cluster Service is using a Cluster Resourced IP of 10.1.1.50, it
gets assigned to the appropriate NIC on the cluster node.

However, the multple IP resources on the 10.1.2.x network are all getting
assigned to a single NIC on the physical cluster node instead of spreading
out across those 10.1.2.x NICs.

This particular cluster is Cluster Suite as released with RHEL5.5.

An example cluster.conf snippet regarding IP address resource:

<ip address=10.1.1.101 monitor_link="1">
<ip address=10.1.2.101 monitor_link="1">
<ip address=10.1.2.102 monitor_link="1">
<ip address=10.1.2.103 monitor_link="1">

Please imagine that these four IP resource addresses have been assigned to
one cluster service.

Are there any more configurable parameters that can go in there? I would
like to be able to assign these four address to four separate NICs on the
cluster node.

Right now what happens is the first IP goes correctly to the cluster node's
NIC that is assigned a 10.1.1 address, and though the cluster node's three
other NICs are assigned 10.1.2 addresses, the three remaining IPs get
bunched up one a single 10.1.2 NIC.

Bonding/aggregating isn't an option unfortunately.

Hope this makes sense...

Thank you,
Dusty
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100611/4bbc9fd0/attachment.htm>

From glisha at gmail.com  Fri Jun 11 21:00:21 2010
From: glisha at gmail.com (Georgi Stanojevski)
Date: Fri, 11 Jun 2010 23:00:21 +0200
Subject: [Linux-cluster] only one service fails-over out of two depended
	services.
In-Reply-To: <AANLkTinNA6HaZfbJHJflr6bGIpiLtPwjFVwqxgL3r7N1@mail.gmail.com>
References: <AANLkTinNA6HaZfbJHJflr6bGIpiLtPwjFVwqxgL3r7N1@mail.gmail.com>
Message-ID: <AANLkTimZ-vogiHxBNLV-fU0hGRozosRsvfYju5Q87WwX@mail.gmail.com>

> ? ? ? ?<service autostart="1" exclusive="0" name="service1"
> recovery="relocate" priority="1">
> ? ? ? ? ? ?<ip ref="10.1.1.1"/>
> ? ? ? ? ? ?<lvm ref="lvm1"/>
> ? ? ? ? ? ?<fs ref="fs1"/>
> ? ? ? ?</service>
> ? ? ? ?<service autostart="1" exclusive="0" name="service2"
> recovery="relocate" depend="service:service1" depend_mode="hard"
> priority="2">
> ? ? ? ? ? ?<script ref="scriptsresource"/>
> ? ? ? ?</service>

If I change the order of the services in cluster.conf I get "better" results.

I put the dependent service first (service2 before service1).

No at least  when the active node fails, the cluster tries to start
service2. Since it can't start without service1 (the startup script
returns -1) it puts it in failed state. Then it tries to start
service1, and after starting it successfully it starts service2.

-- 
Glisha



From kitgerrits at gmail.com  Sat Jun 12 02:58:42 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Sat, 12 Jun 2010 04:58:42 +0200
Subject: [Linux-cluster] Higher Grained Definition of IP Address
	Assignments?
In-Reply-To: <AANLkTinf_FHZt7E6SIJV4sZLOMRAYsHc7xz0Q1JEVXoJ@mail.gmail.com>
Message-ID: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>

Hello,
 
What you want sounds more like Load Balancing than HA Clustering.
 
I would suggest building a lvs load balancing cluster with 10.1.1.x as
front-end IP and 10.1.2 as backend IP.
Make the LVS the default gateway for your 'cluster servers' (realservers),
then configure 1-.1.1.50 on your LVS cluster as Virtual IP with the 10.1.2.x
realservers as backend using NAT routing.
 
Documentation isa vailable at:
http://www.austintek.com/LVS/LVS-HOWTO/
or, more specifically:
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html
 
LVS should be included in Red Hat Advanced Platform.
 
Yes, running a LoadBalancing cluster means 2 more servers and 2 more
subscriptions, but it will allow for highly-available Load Balancing.
(implicitly allowing you to take realservers offline for maintenance)
 
 
Regards,
 
Kit Gerrits

  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry Offutt
Sent: vrijdag 11 juni 2010 20:05
To: linux clustering
Subject: [Linux-cluster] Higher Grained Definition of IP Address
Assignments?


Hello.

We have a scenario where a server would have one NIC on a 10.1.1.x network
and multiple NICs on a 10.1.2.x network. The server requires the NICs on the
10.1.2 network each have an independent 10.1.2 network address assigned.

In converting this into an HA solution, we're reaching some difficulty in
that if the Cluster Service is using a Cluster Resourced IP of 10.1.1.50, it
gets assigned to the appropriate NIC on the cluster node.

However, the multple IP resources on the 10.1.2.x network are all getting
assigned to a single NIC on the physical cluster node instead of spreading
out across those 10.1.2.x NICs.

This particular cluster is Cluster Suite as released with RHEL5.5.

An example cluster.conf snippet regarding IP address resource:

<ip address=10.1.1.101 monitor_link="1">
<ip address=10.1.2.101 monitor_link="1">
<ip address=10.1.2.102 monitor_link="1">
<ip address=10.1.2.103 monitor_link="1">

Please imagine that these four IP resource addresses have been assigned to
one cluster service.

Are there any more configurable parameters that can go in there? I would
like to be able to assign these four address to four separate NICs on the
cluster node.

Right now what happens is the first IP goes correctly to the cluster node's
NIC that is assigned a 10.1.1 address, and though the cluster node's three
other NICs are assigned 10.1.2 addresses, the three remaining IPs get
bunched up one a single 10.1.2 NIC.

Bonding/aggregating isn't an option unfortunately.

Hope this makes sense...

Thank you,
Dusty


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2930 - Release Date: 06/10/10
20:35:00


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100612/d12ce3d7/attachment.htm>

From michael.lackner at mu-leoben.at  Mon Jun 14 12:00:35 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Mon, 14 Jun 2010 14:00:35 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>
Message-ID: <4C1619E3.7070601@mu-leoben.at>

Hello!

I am currently building a Cluster sitting on CentOS 5 for GFS usage.

At the moment, the storage subsystem consists of an HP MSA2312
Fibrechannel SAN linked to an FC 8gbit switch. Three client machines
are connected to that switch over 8gbit FC. The disks themselves are
12 * 15.000rpm SAS configured in RAID-5 with two hotspares.

Now, the whole storage shall be shared (single filesystem), here GFS
comes in.

The Cluster is only 3 nodes large at the moment, more nodes will be
added later on. I am currently testing GFS1 and GFS2 for performance.
Lock Management is done over single 1Gbit Ethernet Links (1 per
machine).

Thing is, with GFS1 I get far better performance than with the newer
GFS2 across the board, with a few tunable parameters set, for writes
GFS1 is roughly twice as fast.

But, concurrent reads are totally abysmal. The total write performance
(all nodes combined) sits around 280-330Mbyte/sec, whereas the
READ performance is as low as 30-40Mbyte/sec when doing concurrent
reads. Surprisingly, single-node read is somewhat ok at 180Mbyte/sec,
but as soon as several nodes are reading from GFS (version 1 at the
moment) at the same time,  things turn ugly.

This is strange, because for writes, global performance across the
cluster increases slightly when adding more nodes. But for reads, the
oppsite seems to be true.

For read and write tests, separate testfiles were created and read for
each node, with each testfile sitting in its own subdirectory, so no
node would access another nodes file.

GFS1 created with the following mkfs.gfs parameters:
"-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
(4kB blocksite, 16 * 128MB journals, 2GB resource groups,
Distributed LockManager)

Mount Options set: "noatime,nodiratime,noquota"

Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
demote_secs 20"

Also, in /etc/cluster/cluster.conf, I added this:
<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0"/>

Any ideas on how to figure out what's going wrong, and how to
tune GFS1 for better concurrent read performance, or tune GFS2
in general to be competitive/better than GFS1?

I'm dreaming about 300MB/sec read, 300MB/sec write sequentially
and somewhat good reaction times while under heavy sequential
and/or random load. But for now, I just wanna get the seq reading
to work acceptably fast.

Thanks a lot for your help!

-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505



From dhoffutt at gmail.com  Mon Jun 14 12:14:57 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Mon, 14 Jun 2010 07:14:57 -0500
Subject: [Linux-cluster] Higher Grained Definition of IP Address
	Assignments?
Message-ID: <4C161D41.9070001@gmail.com>

Appreciate the info, but indeed what we need is HA.

I need to perhaps request if a cluster developer would be willing to add 
a new configuration item to the IP xtag within the cluster.conf 
configuration that would allow one to specify IP an IP label to apply 
the IP resource to.

This /could/ be done via a cluster resource script - but then we'd lose 
the ability to have the cluster software monitor the link and relocate 
the service should the link be lost.

Kit Gerrits wrote:
> Hello,
>  
> What you want sounds more like Load Balancing than HA Clustering.
>  
> I would suggest building a lvs load balancing cluster with 10.1.1.x as 
> front-end IP and 10.1.2 as backend IP.
> Make the LVS the default gateway for your 'cluster servers' 
> (realservers), then configure 1-.1.1.50 on your LVS cluster as Virtual 
> IP with the 10.1.2.x realservers as backend using NAT routing.
>  
> Documentation isa vailable at:
> http://www.austintek.com/LVS/LVS-HOWTO/
> or, more specifically:
> http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html
>  
> LVS should be included in Red Hat Advanced Platform.
>  
> Yes, running a LoadBalancing cluster means 2 more servers and 2 more 
> subscriptions, but it will allow for highly-available Load Balancing.
> (implicitly allowing you to take realservers offline for maintenance)
>  
>  
> Regards,
>  
> Kit Gerrits


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100614/0be2f0e2/attachment.htm>

From swhiteho at redhat.com  Mon Jun 14 12:33:53 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 14 Jun 2010 13:33:53 +0100
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4C1619E3.7070601@mu-leoben.at>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>
	<4C1619E3.7070601@mu-leoben.at>
Message-ID: <1276518833.3158.302.camel@localhost.localdomain>

Hi,

On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
> Hello!
> 
> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
> 
> At the moment, the storage subsystem consists of an HP MSA2312
> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines
> are connected to that switch over 8gbit FC. The disks themselves are
> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
> 
> Now, the whole storage shall be shared (single filesystem), here GFS
> comes in.
> 
> The Cluster is only 3 nodes large at the moment, more nodes will be
> added later on. I am currently testing GFS1 and GFS2 for performance.
> Lock Management is done over single 1Gbit Ethernet Links (1 per
> machine).
> 
> Thing is, with GFS1 I get far better performance than with the newer
> GFS2 across the board, with a few tunable parameters set, for writes
> GFS1 is roughly twice as fast.
> 
What tests are you running? GFS2 is generally faster than GFS1 except
for streaming writes, which is an area that we are putting some effort
into solving currently. Small writes (one fs block (4k default) or less)
on GFS2 are much faster than on GFS1.

> But, concurrent reads are totally abysmal. The total write performance
> (all nodes combined) sits around 280-330Mbyte/sec, whereas the
> READ performance is as low as 30-40Mbyte/sec when doing concurrent
> reads. Surprisingly, single-node read is somewhat ok at 180Mbyte/sec,
> but as soon as several nodes are reading from GFS (version 1 at the
> moment) at the same time,  things turn ugly.
> 
Reads on GFS2 should be much faster than GFS1, so it sounds as if
something isn't working correctly for some reason. For cached data,
reads on GFS2 should be as fast as ext2/3 since the code path is
identical (to the page cache) and only changes if pages are not cached.
GFS1 does its locking at a higher level, so there will be more overhead
for cached reads in general.

Do make sure that if you are preparing the test files for reading all
from one node (or even just a different node to that on which you sre
running the read tests) that you need to sync them to disk on that node
before starting the tests to avoid issues with caching.

> This is strange, because for writes, global performance across the
> cluster increases slightly when adding more nodes. But for reads, the
> oppsite seems to be true.
> 
> For read and write tests, separate testfiles were created and read for
> each node, with each testfile sitting in its own subdirectory, so no
> node would access another nodes file.
> 
That sounds like a good test set up to me.

> GFS1 created with the following mkfs.gfs parameters:
> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
> (4kB blocksite, 16 * 128MB journals, 2GB resource groups,
> Distributed LockManager)
> 
> Mount Options set: "noatime,nodiratime,noquota"
> 
> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
> demote_secs 20"
You shouldn't normally need to set the glock_purge and demote_secs to
anything other than the default. These settings no longer exist in GFS2
since it makes use of the shrinker subsystem provided by the VM and is
auto-tuning. If your workload is metadata heavy, you could try boosting
the journal size and/or the incore_log_blocks tunable.

> 
> Also, in /etc/cluster/cluster.conf, I added this:
> <dlm plock_ownership="1" plock_rate_limit="0"/>
> <gfs_controld plock_rate_limit="0"/>
> 
> Any ideas on how to figure out what's going wrong, and how to
> tune GFS1 for better concurrent read performance, or tune GFS2
> in general to be competitive/better than GFS1?
> 
> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially
> and somewhat good reaction times while under heavy sequential
> and/or random load. But for now, I just wanna get the seq reading
> to work acceptably fast.
> 
> Thanks a lot for your help!
> 
Can you try doing some I/O direct to the block device so that we can get
an idea of what the raw device can manage? Using dd both read and write,
across the nodes (different disk locations on each node to simulate
different files).

I'm wondering if the problem might be due to the seek pattern generated
by the multiple read locations,

Steve.




From Martin.Waite at datacash.com  Mon Jun 14 13:18:16 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Mon, 14 Jun 2010 14:18:16 +0100
Subject: [Linux-cluster] Higher Grained Definition of IP
	AddressAssignments?
In-Reply-To: <4C161D41.9070001@gmail.com>
References: <4C161D41.9070001@gmail.com>
Message-ID: <A78DB34D00374344A0AB65B6523C05DC059BE4CC@marsden.win.datacash.com>

Hi,

 

/usr/share/cluster/ip.sh appears to perform the link-monitoring in the
"status" command, which is called periodically.  I don't know that
either rgmanager or cman or other cluster software are directly involved
in that.

 

The "ip" configuration already supports an "interface" attribute:

 

      <ip address="192.168.2.120" interface="eth0" monitor_link="1"/>

 

 

regards,

Martin

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry
Offutt
Sent: 14 June 2010 13:15
To: linux-cluster at redhat.com >> linux clustering
Subject: Re: [Linux-cluster] Higher Grained Definition of IP
AddressAssignments?

 

Appreciate the info, but indeed what we need is HA. 

I need to perhaps request if a cluster developer would be willing to add
a new configuration item to the IP xtag within the cluster.conf
configuration that would allow one to specify IP an IP label to apply
the IP resource to.

This could be done via a cluster resource script - but then we'd lose
the ability to have the cluster software monitor the link and relocate
the service should the link be lost.

Kit Gerrits wrote: 

Hello,

 

What you want sounds more like Load Balancing than HA Clustering.

 

I would suggest building a lvs load balancing cluster with 10.1.1.x as
front-end IP and 10.1.2 as backend IP.

Make the LVS the default gateway for your 'cluster servers'
(realservers), then configure 1-.1.1.50 on your LVS cluster as Virtual
IP with the 10.1.2.x realservers as backend using NAT routing.

 

Documentation isa vailable at:

http://www.austintek.com/LVS/LVS-HOWTO/

or, more specifically:

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html

 

LVS should be included in Red Hat Advanced Platform.

 

Yes, running a LoadBalancing cluster means 2 more servers and 2 more
subscriptions, but it will allow for highly-available Load Balancing.

(implicitly allowing you to take realservers offline for maintenance)

 

 

Regards,

 

Kit Gerrits

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100614/2b7236d0/attachment.htm>

From michael.lackner at mu-leoben.at  Mon Jun 14 14:21:44 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Mon, 14 Jun 2010 16:21:44 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <1276518833.3158.302.camel@localhost.localdomain>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>	<4C1619E3.7070601@mu-leoben.at>
	<1276518833.3158.302.camel@localhost.localdomain>
Message-ID: <4C163AF8.9000609@mu-leoben.at>

Hello!

Thanks for your reply. I unfortunately forgot to mention, HOW I was 
actually testing,
stupid.

I tested with dd, doing 4kB blocksize reads and writes, 160GB total 
testfile size per node.
I read from /dev/zero for writing tests and wrote to /dev/null for 
reading tests. So, totally
sequential, somewhat small blocksize (equal to filesystem BS).

The performance was measured directly on the Fibrechannel Switch, which 
offers nice
per-port monitoring for that purpose.

I have yet to do some serious read testing on GFS2. I have aborted my 
GFS2 tests as
write performance was not up to GFS1 to begin with. My older GFS2 benchmarks
(i did this with a 2-node configuration before) are lost, I will need to 
re-do them to
give you some numbers.

After each write test I did a "sync" to flush everything to disks.  I 
did not do this before
or after read tests though..

As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, 
that only 2-3%
logspace were in use after the tests (I guess this is the per-node fs 
journal?).

As for the direct I/O tests, by that you mean testing without ANY 
caching going on, a
synchronous write? What I did before was test EXT3 (~190MB/s) and XFS 
(~320MB/s)
on the Storage Array. I think what I'm getting here is raw throughput, 
since I am not
monitoring in the OS, but at the Fibrechannel Switch itself..

I will do GFS2 read tests similiar to those conducted for GFS1. I'll be 
able to do that
tomorrow morning, then I can post the numbers here.

Thanks!

Steven Whitehouse wrote:
> Hi,
>
> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>   
>> Hello!
>>
>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>
>> At the moment, the storage subsystem consists of an HP MSA2312
>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines
>> are connected to that switch over 8gbit FC. The disks themselves are
>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>
>> Now, the whole storage shall be shared (single filesystem), here GFS
>> comes in.
>>
>> The Cluster is only 3 nodes large at the moment, more nodes will be
>> added later on. I am currently testing GFS1 and GFS2 for performance.
>> Lock Management is done over single 1Gbit Ethernet Links (1 per
>> machine).
>>
>> Thing is, with GFS1 I get far better performance than with the newer
>> GFS2 across the board, with a few tunable parameters set, for writes
>> GFS1 is roughly twice as fast.
>>
>>     
> What tests are you running? GFS2 is generally faster than GFS1 except
> for streaming writes, which is an area that we are putting some effort
> into solving currently. Small writes (one fs block (4k default) or less)
> on GFS2 are much faster than on GFS1.
>
>   
>> But, concurrent reads are totally abysmal. The total write performance
>> (all nodes combined) sits around 280-330Mbyte/sec, whereas the
>> READ performance is as low as 30-40Mbyte/sec when doing concurrent
>> reads. Surprisingly, single-node read is somewhat ok at 180Mbyte/sec,
>> but as soon as several nodes are reading from GFS (version 1 at the
>> moment) at the same time,  things turn ugly.
>>
>>     
> Reads on GFS2 should be much faster than GFS1, so it sounds as if
> something isn't working correctly for some reason. For cached data,
> reads on GFS2 should be as fast as ext2/3 since the code path is
> identical (to the page cache) and only changes if pages are not cached.
> GFS1 does its locking at a higher level, so there will be more overhead
> for cached reads in general.
>
> Do make sure that if you are preparing the test files for reading all
> from one node (or even just a different node to that on which you sre
> running the read tests) that you need to sync them to disk on that node
> before starting the tests to avoid issues with caching.
>
>   
>> This is strange, because for writes, global performance across the
>> cluster increases slightly when adding more nodes. But for reads, the
>> oppsite seems to be true.
>>
>> For read and write tests, separate testfiles were created and read for
>> each node, with each testfile sitting in its own subdirectory, so no
>> node would access another nodes file.
>>
>>     
> That sounds like a good test set up to me.
>
>   
>> GFS1 created with the following mkfs.gfs parameters:
>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups,
>> Distributed LockManager)
>>
>> Mount Options set: "noatime,nodiratime,noquota"
>>
>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>> demote_secs 20"
>>     
> You shouldn't normally need to set the glock_purge and demote_secs to
> anything other than the default. These settings no longer exist in GFS2
> since it makes use of the shrinker subsystem provided by the VM and is
> auto-tuning. If your workload is metadata heavy, you could try boosting
> the journal size and/or the incore_log_blocks tunable.
>
>   
>> Also, in /etc/cluster/cluster.conf, I added this:
>> <dlm plock_ownership="1" plock_rate_limit="0"/>
>> <gfs_controld plock_rate_limit="0"/>
>>
>> Any ideas on how to figure out what's going wrong, and how to
>> tune GFS1 for better concurrent read performance, or tune GFS2
>> in general to be competitive/better than GFS1?
>>
>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially
>> and somewhat good reaction times while under heavy sequential
>> and/or random load. But for now, I just wanna get the seq reading
>> to work acceptably fast.
>>
>> Thanks a lot for your help!
>>
>>     
> Can you try doing some I/O direct to the block device so that we can get
> an idea of what the raw device can manage? Using dd both read and write,
> across the nodes (different disk locations on each node to simulate
> different files).
>
> I'm wondering if the problem might be due to the seek pattern generated
> by the multiple read locations,
>
> Steve.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505



From dhoffutt at gmail.com  Mon Jun 14 14:23:06 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Mon, 14 Jun 2010 09:23:06 -0500
Subject: [Linux-cluster] Higher Grained Definition of IP
	AddressAssignments?
In-Reply-To: <A78DB34D00374344A0AB65B6523C05DC059BE4CC@marsden.win.datacash.com>
References: <4C161D41.9070001@gmail.com>
	<A78DB34D00374344A0AB65B6523C05DC059BE4CC@marsden.win.datacash.com>
Message-ID: <AANLkTilXULFMfoxdTa__pa3VKZAw99jRBMjBo0dUZABZ@mail.gmail.com>

Martin,

A thousand most sincere gratitudes.

This is *exactly* what we need (I'm presuming this attribute looks for an
interface labeled "eth0" (from your example) and applies that 192 address to
it....?). Testing immediately!!!

If you have a moment, from whence did you find this attribute?



On Mon, Jun 14, 2010 at 8:18 AM, Martin Waite <Martin.Waite at datacash.com>wrote:

>  Hi,
>
>
>
> /usr/share/cluster/ip.sh appears to perform the link-monitoring in the
> "status" command, which is called periodically.  I don't know that either
> rgmanager or cman or other cluster software are directly involved in that.
>
>
>
> The "ip" configuration already supports an "interface" attribute:
>
>
>
>       <ip address="192.168.2.120" interface="eth0" monitor_link="1"/>
>
>
>
>
>
> regards,
>
> Martin
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Dustin Henry Offutt
> *Sent:* 14 June 2010 13:15
> *To:* linux-cluster at redhat.com >> linux clustering
> *Subject:* Re: [Linux-cluster] Higher Grained Definition of IP
> AddressAssignments?
>
>
>
> Appreciate the info, but indeed what we need is HA.
>
> I need to perhaps request if a cluster developer would be willing to add a
> new configuration item to the IP xtag within the cluster.conf configuration
> that would allow one to specify IP an IP label to apply the IP resource to.
>
> This *could* be done via a cluster resource script - but then we'd lose
> the ability to have the cluster software monitor the link and relocate the
> service should the link be lost.
>
> Kit Gerrits wrote:
>
> Hello,
>
>
>
> What you want sounds more like Load Balancing than HA Clustering.
>
>
>
> I would suggest building a lvs load balancing cluster with 10.1.1.x as
> front-end IP and 10.1.2 as backend IP.
>
> Make the LVS the default gateway for your 'cluster servers'
> (realservers), then configure 1-.1.1.50 on your LVS cluster as Virtual IP
> with the 10.1.2.x realservers as backend using NAT routing.
>
>
>
> Documentation isa vailable at:
>
> http://www.austintek.com/LVS/LVS-HOWTO/
>
> or, more specifically:
>
> http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html
>
>
>
> LVS should be included in Red Hat Advanced Platform.
>
>
>
> Yes, running a LoadBalancing cluster means 2 more servers and 2 more
> subscriptions, but it will allow for highly-available Load Balancing.
>
> (implicitly allowing you to take realservers offline for maintenance)
>
>
>
>
>
> Regards,
>
>
>
> Kit Gerrits
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100614/22451c11/attachment.htm>

From Martin.Waite at datacash.com  Mon Jun 14 14:40:00 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Mon, 14 Jun 2010 15:40:00 +0100
Subject: [Linux-cluster] Higher Grained Definition of
	IPAddressAssignments?
In-Reply-To: <AANLkTilXULFMfoxdTa__pa3VKZAw99jRBMjBo0dUZABZ@mail.gmail.com>
References: <4C161D41.9070001@gmail.com><A78DB34D00374344A0AB65B6523C05DC059BE4CC@marsden.win.datacash.com>
	<AANLkTilXULFMfoxdTa__pa3VKZAw99jRBMjBo0dUZABZ@mail.gmail.com>
Message-ID: <A78DB34D00374344A0AB65B6523C05DC059BE580@marsden.win.datacash.com>

Dustin,

 

A thousand sincere apologies. 

 

 Unfortunately, tracing through the ip script with this attribute
enabled, I can see that this has absolutely no effect.

 

Sorry to get your hopes up.

 

regards,

Martin

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry
Offutt
Sent: 14 June 2010 15:23
To: linux clustering
Subject: Re: [Linux-cluster] Higher Grained Definition of
IPAddressAssignments?

 

Martin,

A thousand most sincere gratitudes.

This is exactly what we need (I'm presuming this attribute looks for an
interface labeled "eth0" (from your example) and applies that 192
address to it....?). Testing immediately!!!

If you have a moment, from whence did you find this attribute?




On Mon, Jun 14, 2010 at 8:18 AM, Martin Waite
<Martin.Waite at datacash.com> wrote:

Hi,

 

/usr/share/cluster/ip.sh appears to perform the link-monitoring in the
"status" command, which is called periodically.  I don't know that
either rgmanager or cman or other cluster software are directly involved
in that.

 

The "ip" configuration already supports an "interface" attribute:

 

      <ip address="192.168.2.120" interface="eth0" monitor_link="1"/>

 

 

regards,

Martin

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry
Offutt
Sent: 14 June 2010 13:15
To: linux-cluster at redhat.com >> linux clustering
Subject: Re: [Linux-cluster] Higher Grained Definition of IP
AddressAssignments?

 

Appreciate the info, but indeed what we need is HA. 

I need to perhaps request if a cluster developer would be willing to add
a new configuration item to the IP xtag within the cluster.conf
configuration that would allow one to specify IP an IP label to apply
the IP resource to.

This could be done via a cluster resource script - but then we'd lose
the ability to have the cluster software monitor the link and relocate
the service should the link be lost.

Kit Gerrits wrote: 

Hello,

 

What you want sounds more like Load Balancing than HA Clustering.

 

I would suggest building a lvs load balancing cluster with 10.1.1.x as
front-end IP and 10.1.2 as backend IP.

Make the LVS the default gateway for your 'cluster servers'
(realservers), then configure 1-.1.1.50 on your LVS cluster as Virtual
IP with the 10.1.2.x realservers as backend using NAT routing.

 

Documentation isa vailable at:

http://www.austintek.com/LVS/LVS-HOWTO/

or, more specifically:

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html

 

LVS should be included in Red Hat Advanced Platform.

 

Yes, running a LoadBalancing cluster means 2 more servers and 2 more
subscriptions, but it will allow for highly-available Load Balancing.

(implicitly allowing you to take realservers offline for maintenance)

 

 

Regards,

 

Kit Gerrits

 


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100614/488fac33/attachment.htm>

From swhiteho at redhat.com  Mon Jun 14 14:48:13 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 14 Jun 2010 15:48:13 +0100
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4C163AF8.9000609@mu-leoben.at>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>
	<4C1619E3.7070601@mu-leoben.at>
	<1276518833.3158.302.camel@localhost.localdomain>
	<4C163AF8.9000609@mu-leoben.at>
Message-ID: <1276526893.3158.317.camel@localhost.localdomain>

Hi,

On Mon, 2010-06-14 at 16:21 +0200, Michael Lackner wrote:
> Hello!
> 
> Thanks for your reply. I unfortunately forgot to mention, HOW I was 
> actually testing,
> stupid.
> 
> I tested with dd, doing 4kB blocksize reads and writes, 160GB total 
> testfile size per node.
> I read from /dev/zero for writing tests and wrote to /dev/null for 
> reading tests. So, totally
> sequential, somewhat small blocksize (equal to filesystem BS).
> 
> The performance was measured directly on the Fibrechannel Switch, which 
> offers nice
> per-port monitoring for that purpose.
> 
> I have yet to do some serious read testing on GFS2. I have aborted my 
> GFS2 tests as
> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks
> (i did this with a 2-node configuration before) are lost, I will need to 
> re-do them to
> give you some numbers.
> 
Ok, so these are streaming writes, and plenty large enough to be
affected by the gfs2 performance issue. The reason we have that issue in
GFS2 but not GFS1 is that the lock ordering is different. We try to make
maximum use of the page cache in GFS2 which gives us the faster reads,
but also (due to page-at-a-time write code) the slower streaming writes.
The smaller writes are faster because the overall overhead for writing
is lower on GFS2. However that overhead is per-page written on GFS2, but
per-write call on GFS1 which results in the slower writes which
streaming on GFS2.

It is pretty tricky to fix because it requires being able to do
multi-page writes which are problematic due to the (page) locking order
requirements.

> After each write test I did a "sync" to flush everything to disks.  I 
> did not do this before
> or after read tests though..
> 
> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, 
> that only 2-3%
> logspace were in use after the tests (I guess this is the per-node fs 
> journal?).
> 
You need to measure the log space during the tests rather than at the
end, but since you are doing streaming writes, the amount of metadata is
relatively small anyway, so thats probably not an issue.

> As for the direct I/O tests, by that you mean testing without ANY 
> caching going on, a
> synchronous write? What I did before was test EXT3 (~190MB/s) and XFS 
> (~320MB/s)
> on the Storage Array. I think what I'm getting here is raw throughput, 
> since I am not
> monitoring in the OS, but at the Fibrechannel Switch itself..
> 
I was thinking of just testing the block device without any fs on it.
That would give you an absolute max figure. However, bearing in mind the
similarities between the GFS2 on-disk layout and ext3, I would expect
the performance to be closer (on a single node basis) to that then to
XFS. There is always going to be some overhead relating to using a
cluster filesystem, so that single node tests will be slower. Having
said that, there shouldn't be a huge gap and the scaling wrt the number
of nodes that you are looking for should be achievable.

> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be 
> able to do that
> tomorrow morning, then I can post the numbers here.
> 
Ok. That would be interesting. Thanks,

Steve.

> Thanks!
> 
> Steven Whitehouse wrote:
> > Hi,
> >
> > On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
> >   
> >> Hello!
> >>
> >> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
> >>
> >> At the moment, the storage subsystem consists of an HP MSA2312
> >> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines
> >> are connected to that switch over 8gbit FC. The disks themselves are
> >> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
> >>
> >> Now, the whole storage shall be shared (single filesystem), here GFS
> >> comes in.
> >>
> >> The Cluster is only 3 nodes large at the moment, more nodes will be
> >> added later on. I am currently testing GFS1 and GFS2 for performance.
> >> Lock Management is done over single 1Gbit Ethernet Links (1 per
> >> machine).
> >>
> >> Thing is, with GFS1 I get far better performance than with the newer
> >> GFS2 across the board, with a few tunable parameters set, for writes
> >> GFS1 is roughly twice as fast.
> >>
> >>     
> > What tests are you running? GFS2 is generally faster than GFS1 except
> > for streaming writes, which is an area that we are putting some effort
> > into solving currently. Small writes (one fs block (4k default) or less)
> > on GFS2 are much faster than on GFS1.
> >
> >   
> >> But, concurrent reads are totally abysmal. The total write performance
> >> (all nodes combined) sits around 280-330Mbyte/sec, whereas the
> >> READ performance is as low as 30-40Mbyte/sec when doing concurrent
> >> reads. Surprisingly, single-node read is somewhat ok at 180Mbyte/sec,
> >> but as soon as several nodes are reading from GFS (version 1 at the
> >> moment) at the same time,  things turn ugly.
> >>
> >>     
> > Reads on GFS2 should be much faster than GFS1, so it sounds as if
> > something isn't working correctly for some reason. For cached data,
> > reads on GFS2 should be as fast as ext2/3 since the code path is
> > identical (to the page cache) and only changes if pages are not cached.
> > GFS1 does its locking at a higher level, so there will be more overhead
> > for cached reads in general.
> >
> > Do make sure that if you are preparing the test files for reading all
> > from one node (or even just a different node to that on which you sre
> > running the read tests) that you need to sync them to disk on that node
> > before starting the tests to avoid issues with caching.
> >
> >   
> >> This is strange, because for writes, global performance across the
> >> cluster increases slightly when adding more nodes. But for reads, the
> >> oppsite seems to be true.
> >>
> >> For read and write tests, separate testfiles were created and read for
> >> each node, with each testfile sitting in its own subdirectory, so no
> >> node would access another nodes file.
> >>
> >>     
> > That sounds like a good test set up to me.
> >
> >   
> >> GFS1 created with the following mkfs.gfs parameters:
> >> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
> >> (4kB blocksite, 16 * 128MB journals, 2GB resource groups,
> >> Distributed LockManager)
> >>
> >> Mount Options set: "noatime,nodiratime,noquota"
> >>
> >> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
> >> demote_secs 20"
> >>     
> > You shouldn't normally need to set the glock_purge and demote_secs to
> > anything other than the default. These settings no longer exist in GFS2
> > since it makes use of the shrinker subsystem provided by the VM and is
> > auto-tuning. If your workload is metadata heavy, you could try boosting
> > the journal size and/or the incore_log_blocks tunable.
> >
> >   
> >> Also, in /etc/cluster/cluster.conf, I added this:
> >> <dlm plock_ownership="1" plock_rate_limit="0"/>
> >> <gfs_controld plock_rate_limit="0"/>
> >>
> >> Any ideas on how to figure out what's going wrong, and how to
> >> tune GFS1 for better concurrent read performance, or tune GFS2
> >> in general to be competitive/better than GFS1?
> >>
> >> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially
> >> and somewhat good reaction times while under heavy sequential
> >> and/or random load. But for now, I just wanna get the seq reading
> >> to work acceptably fast.
> >>
> >> Thanks a lot for your help!
> >>
> >>     
> > Can you try doing some I/O direct to the block device so that we can get
> > an idea of what the raw device can manage? Using dd both read and write,
> > across the nodes (different disk locations on each node to simulate
> > different files).
> >
> > I'm wondering if the problem might be due to the seek pattern generated
> > by the multiple read locations,
> >
> > Steve.
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >   



From Eric.Johnson at mtsallstream.com  Mon Jun 14 15:08:28 2010
From: Eric.Johnson at mtsallstream.com (Johnson, Eric)
Date: Mon, 14 Jun 2010 10:08:28 -0500
Subject: [Linux-cluster] pvmove with CLVM messages
Message-ID: <CD9C931A046A4A41876F7F5E134298B403AFED3A@PTEEXB02.mtsallstream.com>

We had to move all of our data to a new SAN and used LVM with pvmove to
do so. This worked great. With the 2 RHEL 5.5 clusters we have that use
CLVM, some messages appeared in /var/log/messages while doing this and I
don't know what they mean:

Jun  8 12:23:07 hostname clogd[29492]: *** Region #155906 skipped during
recovery ***
Jun  8 12:23:15 hostname clogd[29492]: *** Region #287056 skipped during
recovery ***
Jun  8 12:23:28 hostname clogd[29492]: *** Region #18165 skipped during
recovery ***
Jun  8 12:23:52 hostname clogd[29492]: *** Region #227549 skipped during
recovery ***
Jun  8 12:29:23 hostname clogd[29492]: *** Region #198876 skipped during
recovery ***
Jun  8 12:31:08 hostname clogd[29492]: *** Region #197478 skipped during
recovery ***
Jun  8 12:31:58 hostname clogd[29492]: *** Region #197538 skipped during
recovery ***

Is this an error? The data seemed to transition fine and database
consistency checks that were done after passed with no errors. Here's
what I have installed for package versions:

kernel-2.6.18-194.3.1.el5
lvm2-2.02.56-8.el5_5.4
lvm2-cluster-2.02.56-7.el5_5.3
cmirror-1.1.39-8.el5
openais-0.80.6-16.el5_5.1
rgmanager-2.0.52-6.el5

Thanks,
Eric

?
?
Is it really necessary to print this email?
?
MTS ALLSTREAM INC. CONFIDENTIALITY WARNING: This email message is confidential and intended only for?the named recipient(s). ?If you are not the intended recipient, or an agent responsible for delivering it to the intended recipient, or if this message has been sent to you in error, you are hereby notified that any review, use,?dissemination, distribution or copying of this message or its contents is strictly prohibited. ??If you have received this message in error, please notify the sender immediately and delete the original message. ?If there is an agreement attached with this message, such agreement will not be binding until it is signed by all parties named therein.



From Chris.Jankowski at hp.com  Mon Jun 14 15:09:30 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Mon, 14 Jun 2010 15:09:30 +0000
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4C163AF8.9000609@mu-leoben.at>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>
	<4C1619E3.7070601@mu-leoben.at>
	<1276518833.3158.302.camel@localhost.localdomain>
	<4C163AF8.9000609@mu-leoben.at>
Message-ID: <036B68E61A28CA49AC2767596576CD596BA5E2300B@GVW1113EXC.americas.hpqcorp.net>

Michael,

For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please?

I have a vague hunch that the problem may have something to do with coalescing or not of IO operations.

Also, which IO scheduler are you using?

Thanks abnd regards,

Chris Jankowski


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
Sent: Tuesday, 15 June 2010 00:22
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Hello!

Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid.

I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node.
I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).

The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose.

I have yet to do some serious read testing on GFS2. I have aborted my
GFS2 tests as
write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.

After each write test I did a "sync" to flush everything to disks.  I did not do this before or after read tests though..

As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).

As for the direct I/O tests, by that you mean testing without ANY caching going on, a synchronous write? What I did before was test EXT3 (~190MB/s) and XFS
(~320MB/s)
on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself..

I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here.

Thanks!

Steven Whitehouse wrote:
> Hi,
>
> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>   
>> Hello!
>>
>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>
>> At the moment, the storage subsystem consists of an HP MSA2312 
>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines 
>> are connected to that switch over 8gbit FC. The disks themselves are
>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>
>> Now, the whole storage shall be shared (single filesystem), here GFS 
>> comes in.
>>
>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>> added later on. I am currently testing GFS1 and GFS2 for performance.
>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>> machine).
>>
>> Thing is, with GFS1 I get far better performance than with the newer
>> GFS2 across the board, with a few tunable parameters set, for writes
>> GFS1 is roughly twice as fast.
>>
>>     
> What tests are you running? GFS2 is generally faster than GFS1 except 
> for streaming writes, which is an area that we are putting some effort 
> into solving currently. Small writes (one fs block (4k default) or 
> less) on GFS2 are much faster than on GFS1.
>
>   
>> But, concurrent reads are totally abysmal. The total write 
>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>> (version 1 at the
>> moment) at the same time,  things turn ugly.
>>
>>     
> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
> something isn't working correctly for some reason. For cached data, 
> reads on GFS2 should be as fast as ext2/3 since the code path is 
> identical (to the page cache) and only changes if pages are not cached.
> GFS1 does its locking at a higher level, so there will be more 
> overhead for cached reads in general.
>
> Do make sure that if you are preparing the test files for reading all 
> from one node (or even just a different node to that on which you sre 
> running the read tests) that you need to sync them to disk on that 
> node before starting the tests to avoid issues with caching.
>
>   
>> This is strange, because for writes, global performance across the 
>> cluster increases slightly when adding more nodes. But for reads, the 
>> oppsite seems to be true.
>>
>> For read and write tests, separate testfiles were created and read 
>> for each node, with each testfile sitting in its own subdirectory, so 
>> no node would access another nodes file.
>>
>>     
> That sounds like a good test set up to me.
>
>   
>> GFS1 created with the following mkfs.gfs parameters:
>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, Distributed 
>> LockManager)
>>
>> Mount Options set: "noatime,nodiratime,noquota"
>>
>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>> demote_secs 20"
>>     
> You shouldn't normally need to set the glock_purge and demote_secs to 
> anything other than the default. These settings no longer exist in 
> GFS2 since it makes use of the shrinker subsystem provided by the VM 
> and is auto-tuning. If your workload is metadata heavy, you could try 
> boosting the journal size and/or the incore_log_blocks tunable.
>
>   
>> Also, in /etc/cluster/cluster.conf, I added this:
>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>> plock_rate_limit="0"/>
>>
>> Any ideas on how to figure out what's going wrong, and how to tune 
>> GFS1 for better concurrent read performance, or tune GFS2 in general 
>> to be competitive/better than GFS1?
>>
>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>> somewhat good reaction times while under heavy sequential and/or 
>> random load. But for now, I just wanna get the seq reading to work 
>> acceptably fast.
>>
>> Thanks a lot for your help!
>>
>>     
> Can you try doing some I/O direct to the block device so that we can 
> get an idea of what the raw device can manage? Using dd both read and 
> write, across the nodes (different disk locations on each node to 
> simulate different files).
>
> I'm wondering if the problem might be due to the seek pattern 
> generated by the multiple read locations,
>
> Steve.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From garromo at us.ibm.com  Mon Jun 14 15:45:45 2010
From: garromo at us.ibm.com (Gary Romo)
Date: Mon, 14 Jun 2010 09:45:45 -0600
Subject: [Linux-cluster] pvmove with CLVM messages
In-Reply-To: <CD9C931A046A4A41876F7F5E134298B403AFED3A@PTEEXB02.mtsallstream.com>
References: <CD9C931A046A4A41876F7F5E134298B403AFED3A@PTEEXB02.mtsallstream.com>
Message-ID: <OF994EA4A3.108BD017-ON87257742.005685E9-87257742.00569670@us.ibm.com>

Hello Eric.

Can we see the pvmove syntax you used?  some examples?

Thank you!

Gary 



From:
"Johnson, Eric" <Eric.Johnson at mtsallstream.com>
To:
<linux-cluster at redhat.com>
Date:
06/14/2010 09:40 AM
Subject:
[Linux-cluster] pvmove with CLVM messages
Sent by:
linux-cluster-bounces at redhat.com



We had to move all of our data to a new SAN and used LVM with pvmove to
do so. This worked great. With the 2 RHEL 5.5 clusters we have that use
CLVM, some messages appeared in /var/log/messages while doing this and I
don't know what they mean:

Jun  8 12:23:07 hostname clogd[29492]: *** Region #155906 skipped during
recovery ***
Jun  8 12:23:15 hostname clogd[29492]: *** Region #287056 skipped during
recovery ***
Jun  8 12:23:28 hostname clogd[29492]: *** Region #18165 skipped during
recovery ***
Jun  8 12:23:52 hostname clogd[29492]: *** Region #227549 skipped during
recovery ***
Jun  8 12:29:23 hostname clogd[29492]: *** Region #198876 skipped during
recovery ***
Jun  8 12:31:08 hostname clogd[29492]: *** Region #197478 skipped during
recovery ***
Jun  8 12:31:58 hostname clogd[29492]: *** Region #197538 skipped during
recovery ***

Is this an error? The data seemed to transition fine and database
consistency checks that were done after passed with no errors. Here's
what I have installed for package versions:

kernel-2.6.18-194.3.1.el5
lvm2-2.02.56-8.el5_5.4
lvm2-cluster-2.02.56-7.el5_5.3
cmirror-1.1.39-8.el5
openais-0.80.6-16.el5_5.1
rgmanager-2.0.52-6.el5

Thanks,
Eric

 
 
Is it really necessary to print this email?
 
MTS ALLSTREAM INC. CONFIDENTIALITY WARNING: This email message is 
confidential and intended only for the named recipient(s).  If you are not 
the intended recipient, or an agent responsible for delivering it to the 
intended recipient, or if this message has been sent to you in error, you 
are hereby notified that any review, use, dissemination, distribution or 
copying of this message or its contents is strictly prohibited.   If you 
have received this message in error, please notify the sender immediately 
and delete the original message.  If there is an agreement attached with 
this message, such agreement will not be binding until it is signed by all 
parties named therein.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100614/e270e4e4/attachment.htm>

From Eric.Johnson at mtsallstream.com  Mon Jun 14 16:32:02 2010
From: Eric.Johnson at mtsallstream.com (Johnson, Eric)
Date: Mon, 14 Jun 2010 11:32:02 -0500
Subject: [Linux-cluster] pvmove with CLVM messages
In-Reply-To: <mailman.33.1276531206.17226.linux-cluster@redhat.com>
References: <mailman.33.1276531206.17226.linux-cluster@redhat.com>
Message-ID: <CD9C931A046A4A41876F7F5E134298B403AFEDF6@PTEEXB02.mtsallstream.com>

Hi Gary,

Once the new SAN LUNs were presented to us, I just marked them as
physical volumes, extended the VG, then pvmove'd the old PVs to the new
ones. We had 3 existing LUNs that were moving to 1 larger LUN on the new
SAN.

pvcreate /dev/mapper/mpath4
vgextend datavg /dev/mapper/mpath4
pvmove -b /dev/mapper/mpath1 /dev/mapper/mpath4
pvmove -i 2  (to watch the progress)
pvmove -b /dev/mapper/mpath2 /dev/mapper/mpath4
pvmove -i 2  (to watch the progress)
pvmove -b /dev/mapper/mpath3 /dev/mapper/mpath4
pvmove -i 2  (to watch the progress)
lvdisplay -m /dev/datavg/* | grep "Physical volume"  (only mpath4 shows)
vgreduce datavg /dev/mapper/mpath1 /dev/mapper/mpath2 /dev/mapper/mpath3
pvremove /dev/mapper/mpath1 /dev/mapper/mpath2 /dev/mapper/mpath3

The only difference with the clustered systems was that the cmirror
package needed to be installed and the clogd daemon running.

Eric


>Message: 4
>Date: Mon, 14 Jun 2010 09:45:45 -0600
>From: Gary Romo <garromo at us.ibm.com>
>To: linux clustering <linux-cluster at redhat.com>
>Cc: linux-cluster at redhat.com, linux-cluster-bounces at redhat.com
>Subject: Re: [Linux-cluster] pvmove with CLVM messages
>Message-ID:
>
<OF994EA4A3.108BD017-ON87257742.005685E9-87257742.00569670 at us.ibm.com>
>Content-Type: text/plain; charset="us-ascii"
>
>Hello Eric.
>
>Can we see the pvmove syntax you used?  some examples?
>
>Thank you!
>
>Gary 



?
?
Is it really necessary to print this email?
?
MTS ALLSTREAM INC. CONFIDENTIALITY WARNING: This email message is confidential and intended only for?the named recipient(s). ?If you are not the intended recipient, or an agent responsible for delivering it to the intended recipient, or if this message has been sent to you in error, you are hereby notified that any review, use,?dissemination, distribution or copying of this message or its contents is strictly prohibited. ??If you have received this message in error, please notify the sender immediately and delete the original message. ?If there is an agreement attached with this message, such agreement will not be binding until it is signed by all parties named therein.



From christoph at macht-blau.org  Tue Jun 15 07:27:23 2010
From: christoph at macht-blau.org (C. Handel)
Date: Tue, 15 Jun 2010 09:27:23 +0200
Subject: [Linux-cluster] Higher Grained Definition of IP
	AddressAssignments
Message-ID: <AANLkTimxIJy4Kg2tvfa3Uh8tk1RI7EbTM1kx_QVKQNDD@mail.gmail.com>

[define interface of cluster controlled ip resource]

> /usr/share/cluster/ip.sh appears to perform the link-monitoring in the

This is a resource agent script. What attributes a resource agent
accepts can be found by calling it with the option meta-data

/usr/share/cluster/ip.sh meta-data

There is no attribute interface. The agent will add the additional
address to the first interface that is in the same subnet.

You could edit the script and add a parameter interface yourself. Add
a new parameter into the XML at the beginning and access it in the
script with OCF_RESKEY_...

I don't understand what you are trying to do. If you are only handling
network interfaces as services, then rhcs is most likely the wrong
tool. If you would explain your goal we could probably suggest other
solutions.

Greetings
   Christoph



From michael.lackner at mu-leoben.at  Tue Jun 15 12:04:09 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Tue, 15 Jun 2010 14:04:09 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5E2300B@GVW1113EXC.americas.hpqcorp.net>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>	<4C1619E3.7070601@mu-leoben.at>	<1276518833.3158.302.camel@localhost.localdomain>	<4C163AF8.9000609@mu-leoben.at>
	<036B68E61A28CA49AC2767596576CD596BA5E2300B@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4C176C39.9070206@mu-leoben.at>

Hello!

I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, 
and the
difference in performance was negligible. Also, GFS2 was almost on the same
speed level when compared to GFS1 for Reads (see below why..). I/O 
scheduler
is "cfq" by the way. I never really cared about the I/O scheduler since 
I do not yet
understand the differences between the available ones anyway.

But, I found out something else. As suggested by Steven in his reply, I 
ran tests
both on the GFS1/2 filesystems, and also on the raw blockdevice, and 
surprisingly
the  results were almost the same!

So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total
of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node
sequential read the performance went up to a nice 180-190MB/s for both FS
versions.

Now, the surprising part: Doing a dd read on the raw blockdevice with 3 
nodes
showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with
multiple nodes at the same time!! When reading the raw blockdevice on a 
single
node, I got slightly over 190MB/s again.

So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but
more a problem of the underlying storage. This is extremely surprising 
and a
bit shocking I must say.

I guess for the Reads I will need to check the SAN itself, see if I can 
do any
optimization on it..  That thing can't possibly be that bad when it 
comes to reading..

Thanks a lot for your ideas so far!

Jankowski, Chris wrote:
> Michael,
>
> For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please?
>
> I have a vague hunch that the problem may have something to do with coalescing or not of IO operations.
>
> Also, which IO scheduler are you using?
>
> Thanks abnd regards,
>
> Chris Jankowski
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Tuesday, 15 June 2010 00:22
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems
>
> Hello!
>
> Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid.
>
> I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node.
> I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).
>
> The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose.
>
> I have yet to do some serious read testing on GFS2. I have aborted my
> GFS2 tests as
> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.
>
> After each write test I did a "sync" to flush everything to disks.  I did not do this before or after read tests though..
>
> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).
>
> As for the direct I/O tests, by that you mean testing without ANY caching going on, a synchronous write? What I did before was test EXT3 (~190MB/s) and XFS
> (~320MB/s)
> on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>
> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here.
>
> Thanks!
>
> Steven Whitehouse wrote:
>   
>> Hi,
>>
>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>   
>>     
>>> Hello!
>>>
>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>
>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines 
>>> are connected to that switch over 8gbit FC. The disks themselves are
>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>
>>> Now, the whole storage shall be shared (single filesystem), here GFS 
>>> comes in.
>>>
>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>> machine).
>>>
>>> Thing is, with GFS1 I get far better performance than with the newer
>>> GFS2 across the board, with a few tunable parameters set, for writes
>>> GFS1 is roughly twice as fast.
>>>
>>>     
>>>       
>> What tests are you running? GFS2 is generally faster than GFS1 except 
>> for streaming writes, which is an area that we are putting some effort 
>> into solving currently. Small writes (one fs block (4k default) or 
>> less) on GFS2 are much faster than on GFS1.
>>
>>   
>>     
>>> But, concurrent reads are totally abysmal. The total write 
>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>> (version 1 at the
>>> moment) at the same time,  things turn ugly.
>>>
>>>     
>>>       
>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>> something isn't working correctly for some reason. For cached data, 
>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>> identical (to the page cache) and only changes if pages are not cached.
>> GFS1 does its locking at a higher level, so there will be more 
>> overhead for cached reads in general.
>>
>> Do make sure that if you are preparing the test files for reading all 
>> from one node (or even just a different node to that on which you sre 
>> running the read tests) that you need to sync them to disk on that 
>> node before starting the tests to avoid issues with caching.
>>
>>   
>>     
>>> This is strange, because for writes, global performance across the 
>>> cluster increases slightly when adding more nodes. But for reads, the 
>>> oppsite seems to be true.
>>>
>>> For read and write tests, separate testfiles were created and read 
>>> for each node, with each testfile sitting in its own subdirectory, so 
>>> no node would access another nodes file.
>>>
>>>     
>>>       
>> That sounds like a good test set up to me.
>>
>>   
>>     
>>> GFS1 created with the following mkfs.gfs parameters:
>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, Distributed 
>>> LockManager)
>>>
>>> Mount Options set: "noatime,nodiratime,noquota"
>>>
>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>> demote_secs 20"
>>>     
>>>       
>> You shouldn't normally need to set the glock_purge and demote_secs to 
>> anything other than the default. These settings no longer exist in 
>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>> and is auto-tuning. If your workload is metadata heavy, you could try 
>> boosting the journal size and/or the incore_log_blocks tunable.
>>
>>   
>>     
>>> Also, in /etc/cluster/cluster.conf, I added this:
>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>> plock_rate_limit="0"/>
>>>
>>> Any ideas on how to figure out what's going wrong, and how to tune 
>>> GFS1 for better concurrent read performance, or tune GFS2 in general 
>>> to be competitive/better than GFS1?
>>>
>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>> somewhat good reaction times while under heavy sequential and/or 
>>> random load. But for now, I just wanna get the seq reading to work 
>>> acceptably fast.
>>>
>>> Thanks a lot for your help!
>>>
>>>     
>>>       
>> Can you try doing some I/O direct to the block device so that we can 
>> get an idea of what the raw device can manage? Using dd both read and 
>> write, across the nodes (different disk locations on each node to 
>> simulate different files).
>>
>> I'm wondering if the problem might be due to the seek pattern 
>> generated by the multiple read locations,
>>
>> Steve.
>>     
-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505



From dhoffutt at gmail.com  Tue Jun 15 12:39:40 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Tue, 15 Jun 2010 07:39:40 -0500
Subject: [Linux-cluster] Higher Grained Definition of
	IP	AddressAssignments
In-Reply-To: <AANLkTimxIJy4Kg2tvfa3Uh8tk1RI7EbTM1kx_QVKQNDD@mail.gmail.com>
References: <AANLkTimxIJy4Kg2tvfa3Uh8tk1RI7EbTM1kx_QVKQNDD@mail.gmail.com>
Message-ID: <4C17748C.8010801@gmail.com>

I've spent the past year architecting an HA cluster with RHCS and it's 
working wonderfully. I have not seen anything superior.

Due to a new customer-driven feature of our software, we need to add the 
ability for a cluster service/resource group to have up to eight 
distinct IPs on one particular network due to the software being made 
highly available via RHCS performing its own load balancing. Placing the 
load balancing elsewhere is not an option due to the nature of the product.

Regarding "OCF_RESKEY_," will google more on this and appreciate the 
tip. Must work this out some way.

~ Dusty

C. Handel wrote:
> [define interface of cluster controlled ip resource]
>
>   
>> /usr/share/cluster/ip.sh appears to perform the link-monitoring in the
>>     
>
> This is a resource agent script. What attributes a resource agent
> accepts can be found by calling it with the option meta-data
>
> /usr/share/cluster/ip.sh meta-data
>
> There is no attribute interface. The agent will add the additional
> address to the first interface that is in the same subnet.
>
> You could edit the script and add a parameter interface yourself. Add
> a new parameter into the XML at the beginning and access it in the
> script with OCF_RESKEY_...
>
> I don't understand what you are trying to do. If you are only handling
> network interfaces as services, then rhcs is most likely the wrong
> tool. If you would explain your goal we could probably suggest other
> solutions.
>
> Greetings
>    Christoph
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100615/4d955323/attachment.htm>

From Chris.Jankowski at hp.com  Tue Jun 15 13:41:53 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Tue, 15 Jun 2010 13:41:53 +0000
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4C176C39.9070206@mu-leoben.at>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>
	<4C1619E3.7070601@mu-leoben.at>
	<1276518833.3158.302.camel@localhost.localdomain>
	<4C163AF8.9000609@mu-leoben.at>
	<036B68E61A28CA49AC2767596576CD596BA5E2300B@GVW1113EXC.americas.hpqcorp.net>
	<4C176C39.9070206@mu-leoben.at>
Message-ID: <036B68E61A28CA49AC2767596576CD596BA5E233E0@GVW1113EXC.americas.hpqcorp.net>

Michael,

Would you be willing to repeat the tests with large block with different IO scheduler. Specifically there is a scheduler that actually is a null scheduler.

I think that I saw cases when the cfq IO scheduler was not working all that great on single streams.

Thanks and regards,

Chris 

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
Sent: Tuesday, 15 June 2010 22:04
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Hello!

I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and the difference in performance was negligible. Also, GFS2 was almost on the same speed level when compared to GFS1 for Reads (see below why..). I/O scheduler is "cfq" by the way. I never really cared about the I/O scheduler since I do not yet understand the differences between the available ones anyway.

But, I found out something else. As suggested by Steven in his reply, I ran tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and surprisingly the  results were almost the same!

So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node sequential read the performance went up to a nice 180-190MB/s for both FS versions.

Now, the surprising part: Doing a dd read on the raw blockdevice with 3 nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with multiple nodes at the same time!! When reading the raw blockdevice on a single node, I got slightly over 190MB/s again.

So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but more a problem of the underlying storage. This is extremely surprising and a bit shocking I must say.

I guess for the Reads I will need to check the SAN itself, see if I can do any optimization on it..  That thing can't possibly be that bad when it comes to reading..

Thanks a lot for your ideas so far!

Jankowski, Chris wrote:
> Michael,
>
> For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please?
>
> I have a vague hunch that the problem may have something to do with coalescing or not of IO operations.
>
> Also, which IO scheduler are you using?
>
> Thanks abnd regards,
>
> Chris Jankowski
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Tuesday, 15 June 2010 00:22
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
> problems
>
> Hello!
>
> Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid.
>
> I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node.
> I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).
>
> The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose.
>
> I have yet to do some serious read testing on GFS2. I have aborted my
> GFS2 tests as
> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.
>
> After each write test I did a "sync" to flush everything to disks.  I did not do this before or after read tests though..
>
> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).
>
> As for the direct I/O tests, by that you mean testing without ANY 
> caching going on, a synchronous write? What I did before was test EXT3 
> (~190MB/s) and XFS
> (~320MB/s)
> on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>
> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here.
>
> Thanks!
>
> Steven Whitehouse wrote:
>   
>> Hi,
>>
>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>   
>>     
>>> Hello!
>>>
>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>
>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines 
>>> are connected to that switch over 8gbit FC. The disks themselves are
>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>
>>> Now, the whole storage shall be shared (single filesystem), here GFS 
>>> comes in.
>>>
>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>> machine).
>>>
>>> Thing is, with GFS1 I get far better performance than with the newer
>>> GFS2 across the board, with a few tunable parameters set, for writes
>>> GFS1 is roughly twice as fast.
>>>
>>>     
>>>       
>> What tests are you running? GFS2 is generally faster than GFS1 except 
>> for streaming writes, which is an area that we are putting some 
>> effort into solving currently. Small writes (one fs block (4k 
>> default) or
>> less) on GFS2 are much faster than on GFS1.
>>
>>   
>>     
>>> But, concurrent reads are totally abysmal. The total write 
>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>> (version 1 at the
>>> moment) at the same time,  things turn ugly.
>>>
>>>     
>>>       
>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>> something isn't working correctly for some reason. For cached data, 
>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>> identical (to the page cache) and only changes if pages are not cached.
>> GFS1 does its locking at a higher level, so there will be more 
>> overhead for cached reads in general.
>>
>> Do make sure that if you are preparing the test files for reading all 
>> from one node (or even just a different node to that on which you sre 
>> running the read tests) that you need to sync them to disk on that 
>> node before starting the tests to avoid issues with caching.
>>
>>   
>>     
>>> This is strange, because for writes, global performance across the 
>>> cluster increases slightly when adding more nodes. But for reads, 
>>> the oppsite seems to be true.
>>>
>>> For read and write tests, separate testfiles were created and read 
>>> for each node, with each testfile sitting in its own subdirectory, 
>>> so no node would access another nodes file.
>>>
>>>     
>>>       
>> That sounds like a good test set up to me.
>>
>>   
>>     
>>> GFS1 created with the following mkfs.gfs parameters:
>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>> Distributed
>>> LockManager)
>>>
>>> Mount Options set: "noatime,nodiratime,noquota"
>>>
>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>> demote_secs 20"
>>>     
>>>       
>> You shouldn't normally need to set the glock_purge and demote_secs to 
>> anything other than the default. These settings no longer exist in
>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>> and is auto-tuning. If your workload is metadata heavy, you could try 
>> boosting the journal size and/or the incore_log_blocks tunable.
>>
>>   
>>     
>>> Also, in /etc/cluster/cluster.conf, I added this:
>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>> plock_rate_limit="0"/>
>>>
>>> Any ideas on how to figure out what's going wrong, and how to tune
>>> GFS1 for better concurrent read performance, or tune GFS2 in general 
>>> to be competitive/better than GFS1?
>>>
>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>> somewhat good reaction times while under heavy sequential and/or 
>>> random load. But for now, I just wanna get the seq reading to work 
>>> acceptably fast.
>>>
>>> Thanks a lot for your help!
>>>
>>>     
>>>       
>> Can you try doing some I/O direct to the block device so that we can 
>> get an idea of what the raw device can manage? Using dd both read and 
>> write, across the nodes (different disk locations on each node to 
>> simulate different files).
>>
>> I'm wondering if the problem might be due to the seek pattern 
>> generated by the multiple read locations,
>>
>> Steve.
>>     
--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jeff.sturm at eprize.com  Tue Jun 15 18:45:49 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 15 Jun 2010 14:45:49 -0400
Subject: [Linux-cluster] ENOSPC on write()
Message-ID: <64D0546C5EBBD147B75DE133D798665F055D94E1@hugo.eprize.local>

I'm trying to diagnose a mysterious ENOSPC that happens infrequently
when our application writes to a GFS filesystem.

Here's an example system call:

       write(22, "\0\0", 2)                    = 2
       write(22,
"\4\7\01012345678\4\10\10\10\3\n\0\0\0\4\3\5\0\0\0\4\3\30\0\0\0"...,
299718) = -1 ENOSPC (No space left on device)

The file is empty prior to the first write().  2 bytes are successfully
written before the 2nd write() fails.

The filesystem has plenty of blocks and inodes free.  Does this indicate
some sort of resource starvation?  If so, what is an effective strategy
to mitigate?  Repeat the operation?  Write in smaller chunks?

Any help appreciated,

Jeff





From michael.lackner at mu-leoben.at  Wed Jun 16 07:49:37 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Wed, 16 Jun 2010 09:49:37 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5E233E0@GVW1113EXC.americas.hpqcorp.net>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>	<4C1619E3.7070601@mu-leoben.at>	<1276518833.3158.302.camel@localhost.localdomain>	<4C163AF8.9000609@mu-leoben.at>	<036B68E61A28CA49AC2767596576CD596BA5E2300B@GVW1113EXC.americas.hpqcorp.net>	<4C176C39.9070206@mu-leoben.at>
	<036B68E61A28CA49AC2767596576CD596BA5E233E0@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4C188211.8030409@mu-leoben.at>

Chris,

Can do. Which one shall I try? I got these four to choose from:

* noop
* anticipatory
* deadline
* cfq

One more thing, because of the Fibrechannel Storage I am using 
multipathing. And
I cannot set the scheduler for the multipath device (/dev/dm-0), because
"/sys/block/dm-0/queue/scheduler" doesn't exist. I actually have four 
paths to the
storage that i can see as "/dev/sda", "/dev/sdb", "/dev/sdc/" and 
"/dev/sdd".

I guess it's ok if I change the scheduler for those four? Is it ok to 
just run a command
similar to the one below, and will this change the scheduler on the fly?

"echo noop > /sys/block/sd*/queue/scheduler"

Cause at the moment, the scheduler files for each blockdevice contain 
this line:

"noop anticipatory deadline [cfq]"

Maybe I would have to do something like
"echo [noop] anticipatory deadline cfq > /sys/block/sd*/queue/scheduler"
instead?

Thanks for the help.

Jankowski, Chris wrote:
> Michael,
>
> Would you be willing to repeat the tests with large block with different IO scheduler. Specifically there is a scheduler that actually is a null scheduler.
>
> I think that I saw cases when the cfq IO scheduler was not working all that great on single streams.
>
> Thanks and regards,
>
> Chris 
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Tuesday, 15 June 2010 22:04
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems
>
> Hello!
>
> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and the difference in performance was negligible. Also, GFS2 was almost on the same speed level when compared to GFS1 for Reads (see below why..). I/O scheduler is "cfq" by the way. I never really cared about the I/O scheduler since I do not yet understand the differences between the available ones anyway.
>
> But, I found out something else. As suggested by Steven in his reply, I ran tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and surprisingly the  results were almost the same!
>
> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node sequential read the performance went up to a nice 180-190MB/s for both FS versions.
>
> Now, the surprising part: Doing a dd read on the raw blockdevice with 3 nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with multiple nodes at the same time!! When reading the raw blockdevice on a single node, I got slightly over 190MB/s again.
>
> So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but more a problem of the underlying storage. This is extremely surprising and a bit shocking I must say.
>
> I guess for the Reads I will need to check the SAN itself, see if I can do any optimization on it..  That thing can't possibly be that bad when it comes to reading..
>
> Thanks a lot for your ideas so far!
>
> Jankowski, Chris wrote:
>   
>> Michael,
>>
>> For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please?
>>
>> I have a vague hunch that the problem may have something to do with coalescing or not of IO operations.
>>
>> Also, which IO scheduler are you using?
>>
>> Thanks abnd regards,
>>
>> Chris Jankowski
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
>> Sent: Tuesday, 15 June 2010 00:22
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>> problems
>>
>> Hello!
>>
>> Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid.
>>
>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node.
>> I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).
>>
>> The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose.
>>
>> I have yet to do some serious read testing on GFS2. I have aborted my
>> GFS2 tests as
>> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.
>>
>> After each write test I did a "sync" to flush everything to disks.  I did not do this before or after read tests though..
>>
>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).
>>
>> As for the direct I/O tests, by that you mean testing without ANY 
>> caching going on, a synchronous write? What I did before was test EXT3 
>> (~190MB/s) and XFS
>> (~320MB/s)
>> on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>>
>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here.
>>
>> Thanks!
>>
>> Steven Whitehouse wrote:
>>   
>>     
>>> Hi,
>>>
>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>   
>>>     
>>>       
>>>> Hello!
>>>>
>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>
>>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines 
>>>> are connected to that switch over 8gbit FC. The disks themselves are
>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>
>>>> Now, the whole storage shall be shared (single filesystem), here GFS 
>>>> comes in.
>>>>
>>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>>> machine).
>>>>
>>>> Thing is, with GFS1 I get far better performance than with the newer
>>>> GFS2 across the board, with a few tunable parameters set, for writes
>>>> GFS1 is roughly twice as fast.
>>>>
>>>>     
>>>>       
>>>>         
>>> What tests are you running? GFS2 is generally faster than GFS1 except 
>>> for streaming writes, which is an area that we are putting some 
>>> effort into solving currently. Small writes (one fs block (4k 
>>> default) or
>>> less) on GFS2 are much faster than on GFS1.
>>>
>>>   
>>>     
>>>       
>>>> But, concurrent reads are totally abysmal. The total write 
>>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>>> (version 1 at the
>>>> moment) at the same time,  things turn ugly.
>>>>
>>>>     
>>>>       
>>>>         
>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>>> something isn't working correctly for some reason. For cached data, 
>>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>>> identical (to the page cache) and only changes if pages are not cached.
>>> GFS1 does its locking at a higher level, so there will be more 
>>> overhead for cached reads in general.
>>>
>>> Do make sure that if you are preparing the test files for reading all 
>>> from one node (or even just a different node to that on which you sre 
>>> running the read tests) that you need to sync them to disk on that 
>>> node before starting the tests to avoid issues with caching.
>>>
>>>   
>>>     
>>>       
>>>> This is strange, because for writes, global performance across the 
>>>> cluster increases slightly when adding more nodes. But for reads, 
>>>> the oppsite seems to be true.
>>>>
>>>> For read and write tests, separate testfiles were created and read 
>>>> for each node, with each testfile sitting in its own subdirectory, 
>>>> so no node would access another nodes file.
>>>>
>>>>     
>>>>       
>>>>         
>>> That sounds like a good test set up to me.
>>>
>>>   
>>>     
>>>       
>>>> GFS1 created with the following mkfs.gfs parameters:
>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>>> Distributed
>>>> LockManager)
>>>>
>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>
>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>>> demote_secs 20"
>>>>     
>>>>       
>>>>         
>>> You shouldn't normally need to set the glock_purge and demote_secs to 
>>> anything other than the default. These settings no longer exist in
>>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>>> and is auto-tuning. If your workload is metadata heavy, you could try 
>>> boosting the journal size and/or the incore_log_blocks tunable.
>>>
>>>   
>>>     
>>>       
>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>>> plock_rate_limit="0"/>
>>>>
>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>> GFS1 for better concurrent read performance, or tune GFS2 in general 
>>>> to be competitive/better than GFS1?
>>>>
>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>>> somewhat good reaction times while under heavy sequential and/or 
>>>> random load. But for now, I just wanna get the seq reading to work 
>>>> acceptably fast.
>>>>
>>>> Thanks a lot for your help!
>>>>
>>>>     
>>>>       
>>>>         
>>> Can you try doing some I/O direct to the block device so that we can 
>>> get an idea of what the raw device can manage? Using dd both read and 
>>> write, across the nodes (different disk locations on each node to 
>>> simulate different files).
>>>
>>> I'm wondering if the problem might be due to the seek pattern 
>>> generated by the multiple read locations,
>>>
>>> Steve.
-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505



From Chris.Jankowski at hp.com  Wed Jun 16 08:31:24 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Wed, 16 Jun 2010 08:31:24 +0000
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4C188211.8030409@mu-leoben.at>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>
	<4C1619E3.7070601@mu-leoben.at>
	<1276518833.3158.302.camel@localhost.localdomain>
	<4C163AF8.9000609@mu-leoben.at>
	<036B68E61A28CA49AC2767596576CD596BA5E2300B@GVW1113EXC.americas.hpqcorp.net>
	<4C176C39.9070206@mu-leoben.at>
	<036B68E61A28CA49AC2767596576CD596BA5E233E0@GVW1113EXC.americas.hpqcorp.net>
	<4C188211.8030409@mu-leoben.at>
Message-ID: <036B68E61A28CA49AC2767596576CD596BA5E23614@GVW1113EXC.americas.hpqcorp.net>

Michael,

I do not know the process for setting this up in a multipathing configuration, but the scheduler to test is the noop scheduler.

Please let us know what would it yield.

Regards,

Chris 

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
Sent: Wednesday, 16 June 2010 17:50
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Chris,

Can do. Which one shall I try? I got these four to choose from:

* noop
* anticipatory
* deadline
* cfq

One more thing, because of the Fibrechannel Storage I am using multipathing. And I cannot set the scheduler for the multipath device (/dev/dm-0), because "/sys/block/dm-0/queue/scheduler" doesn't exist. I actually have four paths to the storage that i can see as "/dev/sda", "/dev/sdb", "/dev/sdc/" and "/dev/sdd".

I guess it's ok if I change the scheduler for those four? Is it ok to just run a command similar to the one below, and will this change the scheduler on the fly?

"echo noop > /sys/block/sd*/queue/scheduler"

Cause at the moment, the scheduler files for each blockdevice contain this line:

"noop anticipatory deadline [cfq]"

Maybe I would have to do something like
"echo [noop] anticipatory deadline cfq > /sys/block/sd*/queue/scheduler"
instead?

Thanks for the help.

Jankowski, Chris wrote:
> Michael,
>
> Would you be willing to repeat the tests with large block with different IO scheduler. Specifically there is a scheduler that actually is a null scheduler.
>
> I think that I saw cases when the cfq IO scheduler was not working all that great on single streams.
>
> Thanks and regards,
>
> Chris
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Tuesday, 15 June 2010 22:04
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
> problems
>
> Hello!
>
> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and the difference in performance was negligible. Also, GFS2 was almost on the same speed level when compared to GFS1 for Reads (see below why..). I/O scheduler is "cfq" by the way. I never really cared about the I/O scheduler since I do not yet understand the differences between the available ones anyway.
>
> But, I found out something else. As suggested by Steven in his reply, I ran tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and surprisingly the  results were almost the same!
>
> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node sequential read the performance went up to a nice 180-190MB/s for both FS versions.
>
> Now, the surprising part: Doing a dd read on the raw blockdevice with 3 nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with multiple nodes at the same time!! When reading the raw blockdevice on a single node, I got slightly over 190MB/s again.
>
> So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but more a problem of the underlying storage. This is extremely surprising and a bit shocking I must say.
>
> I guess for the Reads I will need to check the SAN itself, see if I can do any optimization on it..  That thing can't possibly be that bad when it comes to reading..
>
> Thanks a lot for your ideas so far!
>
> Jankowski, Chris wrote:
>   
>> Michael,
>>
>> For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please?
>>
>> I have a vague hunch that the problem may have something to do with coalescing or not of IO operations.
>>
>> Also, which IO scheduler are you using?
>>
>> Thanks abnd regards,
>>
>> Chris Jankowski
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael 
>> Lackner
>> Sent: Tuesday, 15 June 2010 00:22
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>> problems
>>
>> Hello!
>>
>> Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid.
>>
>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node.
>> I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).
>>
>> The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose.
>>
>> I have yet to do some serious read testing on GFS2. I have aborted my
>> GFS2 tests as
>> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.
>>
>> After each write test I did a "sync" to flush everything to disks.  I did not do this before or after read tests though..
>>
>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).
>>
>> As for the direct I/O tests, by that you mean testing without ANY 
>> caching going on, a synchronous write? What I did before was test 
>> EXT3
>> (~190MB/s) and XFS
>> (~320MB/s)
>> on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>>
>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here.
>>
>> Thanks!
>>
>> Steven Whitehouse wrote:
>>   
>>     
>>> Hi,
>>>
>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>   
>>>     
>>>       
>>>> Hello!
>>>>
>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>
>>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client 
>>>> machines are connected to that switch over 8gbit FC. The disks 
>>>> themselves are
>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>
>>>> Now, the whole storage shall be shared (single filesystem), here 
>>>> GFS comes in.
>>>>
>>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>>> machine).
>>>>
>>>> Thing is, with GFS1 I get far better performance than with the 
>>>> newer
>>>> GFS2 across the board, with a few tunable parameters set, for 
>>>> writes
>>>> GFS1 is roughly twice as fast.
>>>>
>>>>     
>>>>       
>>>>         
>>> What tests are you running? GFS2 is generally faster than GFS1 
>>> except for streaming writes, which is an area that we are putting 
>>> some effort into solving currently. Small writes (one fs block (4k
>>> default) or
>>> less) on GFS2 are much faster than on GFS1.
>>>
>>>   
>>>     
>>>       
>>>> But, concurrent reads are totally abysmal. The total write 
>>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>>> (version 1 at the
>>>> moment) at the same time,  things turn ugly.
>>>>
>>>>     
>>>>       
>>>>         
>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>>> something isn't working correctly for some reason. For cached data, 
>>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>>> identical (to the page cache) and only changes if pages are not cached.
>>> GFS1 does its locking at a higher level, so there will be more 
>>> overhead for cached reads in general.
>>>
>>> Do make sure that if you are preparing the test files for reading 
>>> all from one node (or even just a different node to that on which 
>>> you sre running the read tests) that you need to sync them to disk 
>>> on that node before starting the tests to avoid issues with caching.
>>>
>>>   
>>>     
>>>       
>>>> This is strange, because for writes, global performance across the 
>>>> cluster increases slightly when adding more nodes. But for reads, 
>>>> the oppsite seems to be true.
>>>>
>>>> For read and write tests, separate testfiles were created and read 
>>>> for each node, with each testfile sitting in its own subdirectory, 
>>>> so no node would access another nodes file.
>>>>
>>>>     
>>>>       
>>>>         
>>> That sounds like a good test set up to me.
>>>
>>>   
>>>     
>>>       
>>>> GFS1 created with the following mkfs.gfs parameters:
>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>>> Distributed
>>>> LockManager)
>>>>
>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>
>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>>> demote_secs 20"
>>>>     
>>>>       
>>>>         
>>> You shouldn't normally need to set the glock_purge and demote_secs 
>>> to anything other than the default. These settings no longer exist 
>>> in
>>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>>> and is auto-tuning. If your workload is metadata heavy, you could 
>>> try boosting the journal size and/or the incore_log_blocks tunable.
>>>
>>>   
>>>     
>>>       
>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>>> plock_rate_limit="0"/>
>>>>
>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>> GFS1 for better concurrent read performance, or tune GFS2 in 
>>>> general to be competitive/better than GFS1?
>>>>
>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>>> somewhat good reaction times while under heavy sequential and/or 
>>>> random load. But for now, I just wanna get the seq reading to work 
>>>> acceptably fast.
>>>>
>>>> Thanks a lot for your help!
>>>>
>>>>     
>>>>       
>>>>         
>>> Can you try doing some I/O direct to the block device so that we can 
>>> get an idea of what the raw device can manage? Using dd both read 
>>> and write, across the nodes (different disk locations on each node 
>>> to simulate different files).
>>>
>>> I'm wondering if the problem might be due to the seek pattern 
>>> generated by the multiple read locations,
>>>
>>> Steve.
--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From goutam.baul at cesc.co.in  Wed Jun 16 13:42:25 2010
From: goutam.baul at cesc.co.in (Goutam Baul)
Date: Wed, 16 Jun 2010 19:12:25 +0530
Subject: [Linux-cluster] RHEL 5.4 cluster with DRAC6
Message-ID: <NCEPJJMAJKELEEEMCIHNCEKPFPAA.goutam.baul@cesc.co.in>

Dear List Members,

We are trying to create a two-node cluster with RHEL 5.4 (AP). The hardware
is two nos. DELL R610 servers. These servers are having iDRAC6 and we are
planning to do the fencing using these cards. The present situation is as
follows:

1.	We are able to fence the remote host by issuing the command fence_ipmilan
for the IP address of the DRAC card of the remote host
2.	The service is getting relocated if the host running the service is
shutdown (init 0) or restarted (init 6).
3.	But if we power cycle one node from the other node using the ipmitool
command then the service is not getting relocated to the other machine. The
clustat reports that the service is in "started" state in the node that has
been power cycled though the status of the node is reported to be "Offline".
The log file of the node that is not power cycled reports that it is failing
to fence the other node.

	The IP addresses of the setup are as follows:

	Node : wmd01.tibs.edu.in
	                IP address of the machine is 10.100.4.11
	                IP address of the DRAC is 10.100.4.17

	Node : wmd02.tibs.edu.in
	                IP address of the machine is 10.100.4.12
	                IP address of the DRAC is 10.100.4.16

	The cluster.conf file is given below.

<?xml version="1.0"?>
<cluster config_version="7" name="tibs_wmd">
        <fence_daemon clean_start="1" post_fail_delay="0"
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="wmd01.tibs.edu.in" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="wmd02.tibs.edu.in" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth=""
ipaddr="10.100.4.17" login="root" name="wmd01_ipmi" passwd="calvin"/>
                <fencedevice agent="fence_ipmilan" auth=""
ipaddr="10.100.4.16" login="root" name="wmd02_ipmi" passwd="calvin"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="wmd_http" ordered="1"
restricted="0">
                                <failoverdomainnode name="wmd01.tibs.edu.in"
priority="2"/>
                                <failoverdomainnode name="wmd02.tibs.edu.in"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.100.4.13" monitor_link="1"/>
                        <script file="/etc/init.d/httpd"
name="wmd_http_script"/>
                </resources>
                <service autostart="1" domain="wmd_http"
name="wmd_http_srvc" recovery="relocate">
                        <ip ref="10.100.4.13"/>
                        <script ref="wmd_http_script"/>
                </service>
        </rm>
</cluster>

Kindly help us to resolve the issue please. We are totally stuck up.

With regards,

Goutam

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100616/5dd3b05f/attachment.htm>

From michael.lackner at mu-leoben.at  Wed Jun 16 13:53:37 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Wed, 16 Jun 2010 15:53:37 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5E23614@GVW1113EXC.americas.hpqcorp.net>
References: <4c12f7e4.5120e30a.7714.7ed8@mx.google.com>	<4C1619E3.7070601@mu-leoben.at>	<1276518833.3158.302.camel@localhost.localdomain>	<4C163AF8.9000609@mu-leoben.at>	<036B68E61A28CA49AC2767596576CD596BA5E2300B@GVW1113EXC.americas.hpqcorp.net>	<4C176C39.9070206@mu-leoben.at>	<036B68E61A28CA49AC2767596576CD596BA5E233E0@GVW1113EXC.americas.hpqcorp.net>	<4C188211.8030409@mu-leoben.at>
	<036B68E61A28CA49AC2767596576CD596BA5E23614@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4C18D761.8070606@mu-leoben.at>

Hello!

Ok, I got the results. It seems that the scheduler can only be set for 
real, physical
block devices (not multipath devices), which should be ok I assume.

For curiositys sake I tested all four schedulers for the dd read with 
1MB blocksize.
And here are the results, both per-node as well as total over all three 
nodes,
numbers are in MB/sec again, sorted by speed, slowest to fastest:

cfq: 15.8 / 15.8 / 15.2 (=46.8MB/s total)
noop: 24.3 / 24.1 / 24.3 (=72.7MB/s total)
deadline: 24.6 / 24.5 / 24.2 (=73.3MB/s total)
anticipatory: 24.9 / 24.8 / 24.5 (=74.2MB/s total)

Before/after each test, i did flush write caches ("sync") and purge all 
I/O caches
("echo 3 > /proc/sys/vm/drop_caches") to get results unaffected by caching.

So it seems "anticipatory" scheduler wins for sequential reads, closely 
followed
by "deadline" and "noop". The only one that seems to really suck is the 
default
one, "cfq". I did not do any write tests so far with the different 
schedulers, nor
any random I/O tests. Also no single-node tests this time (no more time 
today).

While this shows some significant improvement for this specific 
workload, it's
definitely still far below our expectations...

I will also check for the impact of the schedulers on sequential writes and
random I/O as soon as I've figured out how to run some good random I/O 
tests.

In the meantime, I would be happy to listen to any additional 
suggestions to
further improve performance.

Thanks!

Jankowski, Chris wrote:
> Michael,
>
> I do not know the process for setting this up in a multipathing configuration, but the scheduler to test is the noop scheduler.
>
> Please let us know what would it yield.
>
> Regards,
>
> Chris 
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Wednesday, 16 June 2010 17:50
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems
>
> Chris,
>
> Can do. Which one shall I try? I got these four to choose from:
>
> * noop
> * anticipatory
> * deadline
> * cfq
>
> One more thing, because of the Fibrechannel Storage I am using multipathing. And I cannot set the scheduler for the multipath device (/dev/dm-0), because "/sys/block/dm-0/queue/scheduler" doesn't exist. I actually have four paths to the storage that i can see as "/dev/sda", "/dev/sdb", "/dev/sdc/" and "/dev/sdd".
>
> I guess it's ok if I change the scheduler for those four? Is it ok to just run a command similar to the one below, and will this change the scheduler on the fly?
>
> "echo noop > /sys/block/sd*/queue/scheduler"
>
> Cause at the moment, the scheduler files for each blockdevice contain this line:
>
> "noop anticipatory deadline [cfq]"
>
> Maybe I would have to do something like
> "echo [noop] anticipatory deadline cfq > /sys/block/sd*/queue/scheduler"
> instead?
>
> Thanks for the help.
>
> Jankowski, Chris wrote:
>   
>> Michael,
>>
>> Would you be willing to repeat the tests with large block with different IO scheduler. Specifically there is a scheduler that actually is a null scheduler.
>>
>> I think that I saw cases when the cfq IO scheduler was not working all that great on single streams.
>>
>> Thanks and regards,
>>
>> Chris
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
>> Sent: Tuesday, 15 June 2010 22:04
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>> problems
>>
>> Hello!
>>
>> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and the difference in performance was negligible. Also, GFS2 was almost on the same speed level when compared to GFS1 for Reads (see below why..). I/O scheduler is "cfq" by the way. I never really cared about the I/O scheduler since I do not yet understand the differences between the available ones anyway.
>>
>> But, I found out something else. As suggested by Steven in his reply, I ran tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and surprisingly the  results were almost the same!
>>
>> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node sequential read the performance went up to a nice 180-190MB/s for both FS versions.
>>
>> Now, the surprising part: Doing a dd read on the raw blockdevice with 3 nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with multiple nodes at the same time!! When reading the raw blockdevice on a single node, I got slightly over 190MB/s again.
>>
>> So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but more a problem of the underlying storage. This is extremely surprising and a bit shocking I must say.
>>
>> I guess for the Reads I will need to check the SAN itself, see if I can do any optimization on it..  That thing can't possibly be that bad when it comes to reading..
>>
>> Thanks a lot for your ideas so far!
>>
>> Jankowski, Chris wrote:
>>   
>>     
>>> Michael,
>>>
>>> For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please?
>>>
>>> I have a vague hunch that the problem may have something to do with coalescing or not of IO operations.
>>>
>>> Also, which IO scheduler are you using?
>>>
>>> Thanks abnd regards,
>>>
>>> Chris Jankowski
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com 
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael 
>>> Lackner
>>> Sent: Tuesday, 15 June 2010 00:22
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>>> problems
>>>
>>> Hello!
>>>
>>> Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid.
>>>
>>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node.
>>> I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).
>>>
>>> The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose.
>>>
>>> I have yet to do some serious read testing on GFS2. I have aborted my
>>> GFS2 tests as
>>> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.
>>>
>>> After each write test I did a "sync" to flush everything to disks.  I did not do this before or after read tests though..
>>>
>>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).
>>>
>>> As for the direct I/O tests, by that you mean testing without ANY 
>>> caching going on, a synchronous write? What I did before was test 
>>> EXT3
>>> (~190MB/s) and XFS
>>> (~320MB/s)
>>> on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>>>
>>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here.
>>>
>>> Thanks!
>>>
>>> Steven Whitehouse wrote:
>>>   
>>>     
>>>       
>>>> Hi,
>>>>
>>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> Hello!
>>>>>
>>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>>
>>>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client 
>>>>> machines are connected to that switch over 8gbit FC. The disks 
>>>>> themselves are
>>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>>
>>>>> Now, the whole storage shall be shared (single filesystem), here 
>>>>> GFS comes in.
>>>>>
>>>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>>>> machine).
>>>>>
>>>>> Thing is, with GFS1 I get far better performance than with the 
>>>>> newer
>>>>> GFS2 across the board, with a few tunable parameters set, for 
>>>>> writes
>>>>> GFS1 is roughly twice as fast.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> What tests are you running? GFS2 is generally faster than GFS1 
>>>> except for streaming writes, which is an area that we are putting 
>>>> some effort into solving currently. Small writes (one fs block (4k
>>>> default) or
>>>> less) on GFS2 are much faster than on GFS1.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> But, concurrent reads are totally abysmal. The total write 
>>>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>>>> (version 1 at the
>>>>> moment) at the same time,  things turn ugly.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>>>> something isn't working correctly for some reason. For cached data, 
>>>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>>>> identical (to the page cache) and only changes if pages are not cached.
>>>> GFS1 does its locking at a higher level, so there will be more 
>>>> overhead for cached reads in general.
>>>>
>>>> Do make sure that if you are preparing the test files for reading 
>>>> all from one node (or even just a different node to that on which 
>>>> you sre running the read tests) that you need to sync them to disk 
>>>> on that node before starting the tests to avoid issues with caching.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> This is strange, because for writes, global performance across the 
>>>>> cluster increases slightly when adding more nodes. But for reads, 
>>>>> the oppsite seems to be true.
>>>>>
>>>>> For read and write tests, separate testfiles were created and read 
>>>>> for each node, with each testfile sitting in its own subdirectory, 
>>>>> so no node would access another nodes file.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> That sounds like a good test set up to me.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> GFS1 created with the following mkfs.gfs parameters:
>>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>>>> Distributed
>>>>> LockManager)
>>>>>
>>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>>
>>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>>>> demote_secs 20"
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> You shouldn't normally need to set the glock_purge and demote_secs 
>>>> to anything other than the default. These settings no longer exist 
>>>> in
>>>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>>>> and is auto-tuning. If your workload is metadata heavy, you could 
>>>> try boosting the journal size and/or the incore_log_blocks tunable.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>>>> plock_rate_limit="0"/>
>>>>>
>>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>>> GFS1 for better concurrent read performance, or tune GFS2 in 
>>>>> general to be competitive/better than GFS1?
>>>>>
>>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>>>> somewhat good reaction times while under heavy sequential and/or 
>>>>> random load. But for now, I just wanna get the seq reading to work 
>>>>> acceptably fast.
>>>>>
>>>>> Thanks a lot for your help!
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> Can you try doing some I/O direct to the block device so that we can 
>>>> get an idea of what the raw device can manage? Using dd both read 
>>>> and write, across the nodes (different disk locations on each node 
>>>> to simulate different files).
>>>>
>>>> I'm wondering if the problem might be due to the seek pattern 
>>>> generated by the multiple read locations,
>>>>
>>>> Steve.
>>>>         
> --
> Michael Lackner
> Chair of Information Technology, University of Leoben IT Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505



From rajatjpatel at gmail.com  Wed Jun 16 14:05:25 2010
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Wed, 16 Jun 2010 19:35:25 +0530
Subject: [Linux-cluster] http://studyrat.blogspot.com
Message-ID: <AANLkTinDCXIYP-o32rV3dDCF4FqK7bBUOnNB5-h-kIAb@mail.gmail.com>

Regards,

Rajat J Patel

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100616/766a0ab9/attachment.htm>

From sklemer at gmail.com  Wed Jun 16 20:04:26 2010
From: sklemer at gmail.com (=?UTF-8?B?16nXnNeV150g16fXnNee16g=?=)
Date: Wed, 16 Jun 2010 23:04:26 +0300
Subject: [Linux-cluster] RHEL 5.4 cluster with DRAC6
In-Reply-To: <NCEPJJMAJKELEEEMCIHNCEKPFPAA.goutam.baul@cesc.co.in>
References: <NCEPJJMAJKELEEEMCIHNCEKPFPAA.goutam.baul@cesc.co.in>
Message-ID: <AANLkTiml1KHpMgJIym2XZofdyh_yy5XTuPl2yaOvnVKa@mail.gmail.com>

Hi.

for forgot to assign the fence devices to clusternodes.


<clusternodes>

                <clusternode name="wmd01.tibs.edu.in" nodeid="1" votes="1">

                        <fence/>
<method name="1">
                    <device name="wmd01_ipmi"/>
                </method>
            </fence>

                </clusternode>


On Wed, Jun 16, 2010 at 4:42 PM, Goutam Baul <goutam.baul at cesc.co.in> wrote:

>  Dear List Members,
>
>
>
> We are trying to create a two-node cluster with RHEL 5.4 (AP). The hardware
> is two nos. DELL R610 servers. These servers are having iDRAC6 and we are
> planning to do the fencing using these cards. The present situation is as
> follows:
>
>
>
>    1. We are able to fence the remote host by issuing the command
>    fence_ipmilan for the IP address of the DRAC card of the remote host
>    2. The service is getting relocated if the host running the service is
>    shutdown (init 0) or restarted (init 6).
>    3. But if we power cycle one node from the other node using the
>    ipmitool command then the service is not getting relocated to the other
>    machine. The clustat reports that the service is in "started" state in the
>    node that has been power cycled though the status of the node is reported to
>    be "Offline". The log file of the node that is not power cycled reports that
>    it is failing to fence the other node.
>
>    The IP addresses of the setup are as follows:
>
>    Node : wmd01.tibs.edu.in
>                    IP address of the machine is 10.100.4.11
>                    IP address of the DRAC is 10.100.4.17
>
>    Node : wmd02.tibs.edu.in
>                    IP address of the machine is 10.100.4.12
>                    IP address of the DRAC is 10.100.4.16
>
>    The cluster.conf file is given below.
>
>
>
> <?xml version="1.0"?>
>
> <cluster config_version="7" name="tibs_wmd">
>
>         <fence_daemon clean_start="1" post_fail_delay="0"
> post_join_delay="3"/>
>
>         <clusternodes>
>
>                 <clusternode name="wmd01.tibs.edu.in" nodeid="1"
> votes="1">
>
>                         <fence/>
>
>                 </clusternode>
>
>                 <clusternode name="wmd02.tibs.edu.in" nodeid="2"
> votes="1">
>
>                         <fence/>
>
>                 </clusternode>
>
>         </clusternodes>
>
>         <cman expected_votes="1" two_node="1"/>
>
>         <fencedevices>
>
>                 <fencedevice agent="fence_ipmilan" auth=""
> ipaddr="10.100.4.17" login="root" name="wmd01_ipmi" passwd="calvin"/>
>
>                 <fencedevice agent="fence_ipmilan" auth=""
> ipaddr="10.100.4.16" login="root" name="wmd02_ipmi" passwd="calvin"/>
>
>         </fencedevices>
>
>         <rm>
>
>                 <failoverdomains>
>
>                         <failoverdomain name="wmd_http" ordered="1"
> restricted="0">
>
>                                 <failoverdomainnode name="
> wmd01.tibs.edu.in" priority="2"/>
>
>                                 <failoverdomainnode name="
> wmd02.tibs.edu.in" priority="1"/>
>
>                         </failoverdomain>
>
>                 </failoverdomains>
>
>                 <resources>
>
>                         <ip address="10.100.4.13" monitor_link="1"/>
>
>                         <script file="/etc/init.d/httpd"
> name="wmd_http_script"/>
>
>                 </resources>
>
>                 <service autostart="1" domain="wmd_http"
> name="wmd_http_srvc" recovery="relocate">
>
>                         <ip ref="10.100.4.13"/>
>
>                         <script ref="wmd_http_script"/>
>
>                 </service>
>
>         </rm>
>
> </cluster>
>
>
>
> Kindly help us to resolve the issue please. We are totally stuck up.
>
>
>
> With regards,
>
>
>
> Goutam
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100616/a0325856/attachment.htm>

From kitgerrits at gmail.com  Thu Jun 17 07:09:44 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Thu, 17 Jun 2010 09:09:44 +0200
Subject: [Linux-cluster] Higher Grained Definition
	ofIP	AddressAssignments
In-Reply-To: <4C17748C.8010801@gmail.com>
Message-ID: <4c19ca35.0d44d80a.414b.fffff48d@mx.google.com>

In that case, might it be easier to simply use the host IP adresses and not
the cluster IP's?
(the application will need to handle up/down events itself)
 
 
Regards,
 
Kit

  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry Offutt
Sent: dinsdag 15 juni 2010 14:40
To: linux clustering
Subject: Re: [Linux-cluster] Higher Grained Definition ofIP
AddressAssignments


I've spent the past year architecting an HA cluster with RHCS and it's
working wonderfully. I have not seen anything superior.

Due to a new customer-driven feature of our software, we need to add the
ability for a cluster service/resource group to have up to eight distinct
IPs on one particular network due to the software being made highly
available via RHCS performing its own load balancing. Placing the load
balancing elsewhere is not an option due to the nature of the product.

Regarding "OCF_RESKEY_," will google more on this and appreciate the tip.
Must work this out some way.

~ Dusty

C. Handel wrote: 

[define interface of cluster controlled ip resource]



  

/usr/share/cluster/ip.sh appears to perform the link-monitoring in the

    



This is a resource agent script. What attributes a resource agent

accepts can be found by calling it with the option meta-data



/usr/share/cluster/ip.sh meta-data



There is no attribute interface. The agent will add the additional

address to the first interface that is in the same subnet.



You could edit the script and add a parameter interface yourself. Add

a new parameter into the XML at the beginning and access it in the

script with OCF_RESKEY_...



I don't understand what you are trying to do. If you are only handling

network interfaces as services, then rhcs is most likely the wrong

tool. If you would explain your goal we could probably suggest other

solutions.



Greetings

   Christoph



--

Linux-cluster mailing list

Linux-cluster at redhat.com

https://www.redhat.com/mailman/listinfo/linux-cluster



  


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2939 - Release Date: 06/15/10
08:35:00


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100617/afca2c84/attachment.htm>

From kitgerrits at gmail.com  Thu Jun 17 07:14:50 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Thu, 17 Jun 2010 09:14:50 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4C176C39.9070206@mu-leoben.at>
Message-ID: <4c19cb67.014fd80a.136c.fffff102@mx.google.com>


Didn't you have that HP MSA with the fibrechannel interfaces?

I have exactly the same device, also with HP DL380 and HP DL 580 hosts with
two FC interfaces.
I've seen similarly insane statistics using only ext2fs mounts. (even worse,
around 7MB/s)
It went away after a while, but I have no idea where it came from or why it
left.
(I was backing up files with tar-over-ssh)

I would really like to know how you get rid of it, if ever.


Regards,

Kit 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
Sent: dinsdag 15 juni 2010 14:04
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Hello!

I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and
the difference in performance was negligible. Also, GFS2 was almost on the
same speed level when compared to GFS1 for Reads (see below why..). I/O
scheduler is "cfq" by the way. I never really cared about the I/O scheduler
since I do not yet understand the differences between the available ones
anyway.

But, I found out something else. As suggested by Steven in his reply, I ran
tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and
surprisingly the  results were almost the same!

So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total
of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node
sequential read the performance went up to a nice 180-190MB/s for both FS
versions.

Now, the surprising part: Doing a dd read on the raw blockdevice with 3
nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2
with multiple nodes at the same time!! When reading the raw blockdevice on a
single node, I got slightly over 190MB/s again.

So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but
more a problem of the underlying storage. This is extremely surprising and a
bit shocking I must say.

I guess for the Reads I will need to check the SAN itself, see if I can do
any optimization on it..  That thing can't possibly be that bad when it
comes to reading..

Thanks a lot for your ideas so far!

Jankowski, Chris wrote:
> Michael,
>
> For comparison, could you do your dd(1) tests with a very large block size
(1 MB) and tell us the results, please?
>
> I have a vague hunch that the problem may have something to do with
coalescing or not of IO operations.
>
> Also, which IO scheduler are you using?
>
> Thanks abnd regards,
>
> Chris Jankowski
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Tuesday, 15 June 2010 00:22
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
> problems
>
> Hello!
>
> Thanks for your reply. I unfortunately forgot to mention, HOW I was
actually testing, stupid.
>
> I tested with dd, doing 4kB blocksize reads and writes, 160GB total
testfile size per node.
> I read from /dev/zero for writing tests and wrote to /dev/null for reading
tests. So, totally sequential, somewhat small blocksize (equal to filesystem
BS).
>
> The performance was measured directly on the Fibrechannel Switch, which
offers nice per-port monitoring for that purpose.
>
> I have yet to do some serious read testing on GFS2. I have aborted my
> GFS2 tests as
> write performance was not up to GFS1 to begin with. My older GFS2
benchmarks (i did this with a 2-node configuration before) are lost, I will
need to re-do them to give you some numbers.
>
> After each write test I did a "sync" to flush everything to disks.  I did
not do this before or after read tests though..
>
> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that
only 2-3% logspace were in use after the tests (I guess this is the per-node
fs journal?).
>
> As for the direct I/O tests, by that you mean testing without ANY 
> caching going on, a synchronous write? What I did before was test EXT3 
> (~190MB/s) and XFS
> (~320MB/s)
> on the Storage Array. I think what I'm getting here is raw throughput,
since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>
> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be
able to do that tomorrow morning, then I can post the numbers here.
>
> Thanks!
>
> Steven Whitehouse wrote:
>   
>> Hi,
>>
>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>   
>>     
>>> Hello!
>>>
>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>
>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines 
>>> are connected to that switch over 8gbit FC. The disks themselves are
>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>
>>> Now, the whole storage shall be shared (single filesystem), here GFS 
>>> comes in.
>>>
>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>> machine).
>>>
>>> Thing is, with GFS1 I get far better performance than with the newer
>>> GFS2 across the board, with a few tunable parameters set, for writes
>>> GFS1 is roughly twice as fast.
>>>
>>>     
>>>       
>> What tests are you running? GFS2 is generally faster than GFS1 except 
>> for streaming writes, which is an area that we are putting some 
>> effort into solving currently. Small writes (one fs block (4k 
>> default) or
>> less) on GFS2 are much faster than on GFS1.
>>
>>   
>>     
>>> But, concurrent reads are totally abysmal. The total write 
>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>> (version 1 at the
>>> moment) at the same time,  things turn ugly.
>>>
>>>     
>>>       
>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>> something isn't working correctly for some reason. For cached data, 
>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>> identical (to the page cache) and only changes if pages are not cached.
>> GFS1 does its locking at a higher level, so there will be more 
>> overhead for cached reads in general.
>>
>> Do make sure that if you are preparing the test files for reading all 
>> from one node (or even just a different node to that on which you sre 
>> running the read tests) that you need to sync them to disk on that 
>> node before starting the tests to avoid issues with caching.
>>
>>   
>>     
>>> This is strange, because for writes, global performance across the 
>>> cluster increases slightly when adding more nodes. But for reads, 
>>> the oppsite seems to be true.
>>>
>>> For read and write tests, separate testfiles were created and read 
>>> for each node, with each testfile sitting in its own subdirectory, 
>>> so no node would access another nodes file.
>>>
>>>     
>>>       
>> That sounds like a good test set up to me.
>>
>>   
>>     
>>> GFS1 created with the following mkfs.gfs parameters:
>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>> Distributed
>>> LockManager)
>>>
>>> Mount Options set: "noatime,nodiratime,noquota"
>>>
>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>> demote_secs 20"
>>>     
>>>       
>> You shouldn't normally need to set the glock_purge and demote_secs to 
>> anything other than the default. These settings no longer exist in
>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>> and is auto-tuning. If your workload is metadata heavy, you could try 
>> boosting the journal size and/or the incore_log_blocks tunable.
>>
>>   
>>     
>>> Also, in /etc/cluster/cluster.conf, I added this:
>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>> plock_rate_limit="0"/>
>>>
>>> Any ideas on how to figure out what's going wrong, and how to tune
>>> GFS1 for better concurrent read performance, or tune GFS2 in general 
>>> to be competitive/better than GFS1?
>>>
>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>> somewhat good reaction times while under heavy sequential and/or 
>>> random load. But for now, I just wanna get the seq reading to work 
>>> acceptably fast.
>>>
>>> Thanks a lot for your help!
>>>
>>>     
>>>       
>> Can you try doing some I/O direct to the block device so that we can 
>> get an idea of what the raw device can manage? Using dd both read and 
>> write, across the nodes (different disk locations on each node to 
>> simulate different files).
>>
>> I'm wondering if the problem might be due to the seek pattern 
>> generated by the multiple read locations,
>>
>> Steve.
>>     
--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2939 - Release Date: 06/15/10
08:35:00



From kitgerrits at gmail.com  Thu Jun 17 07:25:56 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Thu, 17 Jun 2010 09:25:56 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4C18D761.8070606@mu-leoben.at>
Message-ID: <4c19ce01.e198d80a.799b.fffff293@mx.google.com>


Multipathing has a round-robin and a failover scheduler, which can be
configures in /etc/multipath.conf

The path_selector value only seems to support round-ronin:
http://storagefoo.blogspot.com/2006/08/linux-native-multipathing-device.html


Maybe this helps:
		#
		# name    : path_grouping_policy
		# scope   : multipath
		# desc    : path grouping policy to apply to this
multipath
		# values  : failover, multibus, group_by_serial
		# default : failover
		#
		path_grouping_policy	multibus

Specifies the default path grouping policy to apply to unspecified
multipaths. Possible values include:
failover = 1 path per priority group
multibus = all valid paths in 1 priority group
group_by_serial = 1 priority group per detected serial number
group_by_prio = 1 priority group per path priority value
group_by_node_name = 1 priority group per target node name
The default value is failover. 

 

Regards,

Kit

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
Sent: woensdag 16 juni 2010 15:54
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Hello!

Ok, I got the results. It seems that the scheduler can only be set for real,
physical block devices (not multipath devices), which should be ok I assume.

For curiositys sake I tested all four schedulers for the dd read with 1MB
blocksize.
And here are the results, both per-node as well as total over all three
nodes, numbers are in MB/sec again, sorted by speed, slowest to fastest:

cfq: 15.8 / 15.8 / 15.2 (=46.8MB/s total)
noop: 24.3 / 24.1 / 24.3 (=72.7MB/s total)
deadline: 24.6 / 24.5 / 24.2 (=73.3MB/s total)
anticipatory: 24.9 / 24.8 / 24.5 (=74.2MB/s total)

Before/after each test, i did flush write caches ("sync") and purge all I/O
caches ("echo 3 > /proc/sys/vm/drop_caches") to get results unaffected by
caching.

So it seems "anticipatory" scheduler wins for sequential reads, closely
followed by "deadline" and "noop". The only one that seems to really suck is
the default one, "cfq". I did not do any write tests so far with the
different schedulers, nor any random I/O tests. Also no single-node tests
this time (no more time today).

While this shows some significant improvement for this specific workload,
it's definitely still far below our expectations...

I will also check for the impact of the schedulers on sequential writes and
random I/O as soon as I've figured out how to run some good random I/O
tests.

In the meantime, I would be happy to listen to any additional suggestions to
further improve performance.

Thanks!

Jankowski, Chris wrote:
> Michael,
>
> I do not know the process for setting this up in a multipathing
configuration, but the scheduler to test is the noop scheduler.
>
> Please let us know what would it yield.
>
> Regards,
>
> Chris
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Wednesday, 16 June 2010 17:50
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
> problems
>
> Chris,
>
> Can do. Which one shall I try? I got these four to choose from:
>
> * noop
> * anticipatory
> * deadline
> * cfq
>
> One more thing, because of the Fibrechannel Storage I am using
multipathing. And I cannot set the scheduler for the multipath device
(/dev/dm-0), because "/sys/block/dm-0/queue/scheduler" doesn't exist. I
actually have four paths to the storage that i can see as "/dev/sda",
"/dev/sdb", "/dev/sdc/" and "/dev/sdd".
>
> I guess it's ok if I change the scheduler for those four? Is it ok to just
run a command similar to the one below, and will this change the scheduler
on the fly?
>
> "echo noop > /sys/block/sd*/queue/scheduler"
>
> Cause at the moment, the scheduler files for each blockdevice contain this
line:
>
> "noop anticipatory deadline [cfq]"
>
> Maybe I would have to do something like "echo [noop] anticipatory 
> deadline cfq > /sys/block/sd*/queue/scheduler"
> instead?
>
> Thanks for the help.
>
> Jankowski, Chris wrote:
>   
>> Michael,
>>
>> Would you be willing to repeat the tests with large block with different
IO scheduler. Specifically there is a scheduler that actually is a null
scheduler.
>>
>> I think that I saw cases when the cfq IO scheduler was not working all
that great on single streams.
>>
>> Thanks and regards,
>>
>> Chris
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael 
>> Lackner
>> Sent: Tuesday, 15 June 2010 22:04
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>> problems
>>
>> Hello!
>>
>> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and
the difference in performance was negligible. Also, GFS2 was almost on the
same speed level when compared to GFS1 for Reads (see below why..). I/O
scheduler is "cfq" by the way. I never really cared about the I/O scheduler
since I do not yet understand the differences between the available ones
anyway.
>>
>> But, I found out something else. As suggested by Steven in his reply, I
ran tests both on the GFS1/2 filesystems, and also on the raw blockdevice,
and surprisingly the  results were almost the same!
>>
>> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a
total of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For
single-node sequential read the performance went up to a nice 180-190MB/s
for both FS versions.
>>
>> Now, the surprising part: Doing a dd read on the raw blockdevice with 3
nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2
with multiple nodes at the same time!! When reading the raw blockdevice on a
single node, I got slightly over 190MB/s again.
>>
>> So, this concurrent read issue seems not to be a GFS1 or GFS2 problem,
but more a problem of the underlying storage. This is extremely surprising
and a bit shocking I must say.
>>
>> I guess for the Reads I will need to check the SAN itself, see if I can
do any optimization on it..  That thing can't possibly be that bad when it
comes to reading..
>>
>> Thanks a lot for your ideas so far!
>>
>> Jankowski, Chris wrote:
>>   
>>     
>>> Michael,
>>>
>>> For comparison, could you do your dd(1) tests with a very large block
size (1 MB) and tell us the results, please?
>>>
>>> I have a vague hunch that the problem may have something to do with
coalescing or not of IO operations.
>>>
>>> Also, which IO scheduler are you using?
>>>
>>> Thanks abnd regards,
>>>
>>> Chris Jankowski
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com 
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael 
>>> Lackner
>>> Sent: Tuesday, 15 June 2010 00:22
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>>> problems
>>>
>>> Hello!
>>>
>>> Thanks for your reply. I unfortunately forgot to mention, HOW I was
actually testing, stupid.
>>>
>>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total
testfile size per node.
>>> I read from /dev/zero for writing tests and wrote to /dev/null for
reading tests. So, totally sequential, somewhat small blocksize (equal to
filesystem BS).
>>>
>>> The performance was measured directly on the Fibrechannel Switch, which
offers nice per-port monitoring for that purpose.
>>>
>>> I have yet to do some serious read testing on GFS2. I have aborted 
>>> my
>>> GFS2 tests as
>>> write performance was not up to GFS1 to begin with. My older GFS2
benchmarks (i did this with a 2-node configuration before) are lost, I will
need to re-do them to give you some numbers.
>>>
>>> After each write test I did a "sync" to flush everything to disks.  I
did not do this before or after read tests though..
>>>
>>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said,
that only 2-3% logspace were in use after the tests (I guess this is the
per-node fs journal?).
>>>
>>> As for the direct I/O tests, by that you mean testing without ANY 
>>> caching going on, a synchronous write? What I did before was test
>>> EXT3
>>> (~190MB/s) and XFS
>>> (~320MB/s)
>>> on the Storage Array. I think what I'm getting here is raw throughput,
since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>>>
>>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be
able to do that tomorrow morning, then I can post the numbers here.
>>>
>>> Thanks!
>>>
>>> Steven Whitehouse wrote:
>>>   
>>>     
>>>       
>>>> Hi,
>>>>
>>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> Hello!
>>>>>
>>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>>
>>>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client 
>>>>> machines are connected to that switch over 8gbit FC. The disks 
>>>>> themselves are
>>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>>
>>>>> Now, the whole storage shall be shared (single filesystem), here 
>>>>> GFS comes in.
>>>>>
>>>>> The Cluster is only 3 nodes large at the moment, more nodes will 
>>>>> be added later on. I am currently testing GFS1 and GFS2 for
performance.
>>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>>>> machine).
>>>>>
>>>>> Thing is, with GFS1 I get far better performance than with the 
>>>>> newer
>>>>> GFS2 across the board, with a few tunable parameters set, for 
>>>>> writes
>>>>> GFS1 is roughly twice as fast.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> What tests are you running? GFS2 is generally faster than GFS1 
>>>> except for streaming writes, which is an area that we are putting 
>>>> some effort into solving currently. Small writes (one fs block (4k
>>>> default) or
>>>> less) on GFS2 are much faster than on GFS1.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> But, concurrent reads are totally abysmal. The total write 
>>>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>>>> whereas the READ performance is as low as 30-40Mbyte/sec when 
>>>>> doing concurrent reads. Surprisingly, single-node read is somewhat 
>>>>> ok at 180Mbyte/sec, but as soon as several nodes are reading from 
>>>>> GFS (version 1 at the
>>>>> moment) at the same time,  things turn ugly.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>>>> something isn't working correctly for some reason. For cached data, 
>>>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>>>> identical (to the page cache) and only changes if pages are not cached.
>>>> GFS1 does its locking at a higher level, so there will be more 
>>>> overhead for cached reads in general.
>>>>
>>>> Do make sure that if you are preparing the test files for reading 
>>>> all from one node (or even just a different node to that on which 
>>>> you sre running the read tests) that you need to sync them to disk 
>>>> on that node before starting the tests to avoid issues with caching.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> This is strange, because for writes, global performance across the 
>>>>> cluster increases slightly when adding more nodes. But for reads, 
>>>>> the oppsite seems to be true.
>>>>>
>>>>> For read and write tests, separate testfiles were created and read 
>>>>> for each node, with each testfile sitting in its own subdirectory, 
>>>>> so no node would access another nodes file.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> That sounds like a good test set up to me.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> GFS1 created with the following mkfs.gfs parameters:
>>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>>>> Distributed
>>>>> LockManager)
>>>>>
>>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>>
>>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>>>> demote_secs 20"
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> You shouldn't normally need to set the glock_purge and demote_secs 
>>>> to anything other than the default. These settings no longer exist 
>>>> in
>>>> GFS2 since it makes use of the shrinker subsystem provided by the 
>>>> VM and is auto-tuning. If your workload is metadata heavy, you 
>>>> could try boosting the journal size and/or the incore_log_blocks
tunable.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>>>> plock_rate_limit="0"/>
>>>>>
>>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>>> GFS1 for better concurrent read performance, or tune GFS2 in 
>>>>> general to be competitive/better than GFS1?
>>>>>
>>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially 
>>>>> and somewhat good reaction times while under heavy sequential 
>>>>> and/or random load. But for now, I just wanna get the seq reading 
>>>>> to work acceptably fast.
>>>>>
>>>>> Thanks a lot for your help!
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> Can you try doing some I/O direct to the block device so that we 
>>>> can get an idea of what the raw device can manage? Using dd both 
>>>> read and write, across the nodes (different disk locations on each 
>>>> node to simulate different files).
>>>>
>>>> I'm wondering if the problem might be due to the seek pattern 
>>>> generated by the multiple read locations,
>>>>
>>>> Steve.
>>>>         
> --
> Michael Lackner
> Chair of Information Technology, University of Leoben IT 
> Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2940 - Release Date: 06/15/10
20:35:00



From Chris.Jankowski at hp.com  Thu Jun 17 07:40:44 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 17 Jun 2010 07:40:44 +0000
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4c19ce01.e198d80a.799b.fffff293@mx.google.com>
References: <4C18D761.8070606@mu-leoben.at>
	<4c19ce01.e198d80a.799b.fffff293@mx.google.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596BA5E239E6@GVW1113EXC.americas.hpqcorp.net>

Kit,

I think that you are mixing here the multipathing + host path failover concepts with IO scheduler concept.
They are IMHO completely different areas of the operating system.

Michael was testing  the impact of different IO schedulers.

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kit Gerrits
Sent: Thursday, 17 June 2010 17:26
To: 'linux clustering'
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems


Multipathing has a round-robin and a failover scheduler, which can be configures in /etc/multipath.conf

The path_selector value only seems to support round-ronin:
http://storagefoo.blogspot.com/2006/08/linux-native-multipathing-device.html


Maybe this helps:
                #
                # name    : path_grouping_policy
                # scope   : multipath
                # desc    : path grouping policy to apply to this
multipath
                # values  : failover, multibus, group_by_serial
                # default : failover
                #
                path_grouping_policy    multibus

Specifies the default path grouping policy to apply to unspecified multipaths. Possible values include:
failover = 1 path per priority group
multibus = all valid paths in 1 priority group group_by_serial = 1 priority group per detected serial number group_by_prio = 1 priority group per path priority value group_by_node_name = 1 priority group per target node name The default value is failover.



Regards,

Kit

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
Sent: woensdag 16 juni 2010 15:54
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Hello!

Ok, I got the results. It seems that the scheduler can only be set for real, physical block devices (not multipath devices), which should be ok I assume.

For curiositys sake I tested all four schedulers for the dd read with 1MB blocksize.
And here are the results, both per-node as well as total over all three nodes, numbers are in MB/sec again, sorted by speed, slowest to fastest:

cfq: 15.8 / 15.8 / 15.2 (=46.8MB/s total)
noop: 24.3 / 24.1 / 24.3 (=72.7MB/s total)
deadline: 24.6 / 24.5 / 24.2 (=73.3MB/s total)
anticipatory: 24.9 / 24.8 / 24.5 (=74.2MB/s total)

Before/after each test, i did flush write caches ("sync") and purge all I/O caches ("echo 3 > /proc/sys/vm/drop_caches") to get results unaffected by caching.

So it seems "anticipatory" scheduler wins for sequential reads, closely followed by "deadline" and "noop". The only one that seems to really suck is the default one, "cfq". I did not do any write tests so far with the different schedulers, nor any random I/O tests. Also no single-node tests this time (no more time today).

While this shows some significant improvement for this specific workload, it's definitely still far below our expectations...

I will also check for the impact of the schedulers on sequential writes and random I/O as soon as I've figured out how to run some good random I/O tests.

In the meantime, I would be happy to listen to any additional suggestions to further improve performance.

Thanks!

Jankowski, Chris wrote:
> Michael,
>
> I do not know the process for setting this up in a multipathing
configuration, but the scheduler to test is the noop scheduler.
>
> Please let us know what would it yield.
>
> Regards,
>
> Chris
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: Wednesday, 16 June 2010 17:50
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance
> problems
>
> Chris,
>
> Can do. Which one shall I try? I got these four to choose from:
>
> * noop
> * anticipatory
> * deadline
> * cfq
>
> One more thing, because of the Fibrechannel Storage I am using
multipathing. And I cannot set the scheduler for the multipath device (/dev/dm-0), because "/sys/block/dm-0/queue/scheduler" doesn't exist. I actually have four paths to the storage that i can see as "/dev/sda", "/dev/sdb", "/dev/sdc/" and "/dev/sdd".
>
> I guess it's ok if I change the scheduler for those four? Is it ok to
> just
run a command similar to the one below, and will this change the scheduler on the fly?
>
> "echo noop > /sys/block/sd*/queue/scheduler"
>
> Cause at the moment, the scheduler files for each blockdevice contain
> this
line:
>
> "noop anticipatory deadline [cfq]"
>
> Maybe I would have to do something like "echo [noop] anticipatory
> deadline cfq > /sys/block/sd*/queue/scheduler"
> instead?
>
> Thanks for the help.
>
> Jankowski, Chris wrote:
>
>> Michael,
>>
>> Would you be willing to repeat the tests with large block with
>> different
IO scheduler. Specifically there is a scheduler that actually is a null scheduler.
>>
>> I think that I saw cases when the cfq IO scheduler was not working
>> all
that great on single streams.
>>
>> Thanks and regards,
>>
>> Chris
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael
>> Lackner
>> Sent: Tuesday, 15 June 2010 22:04
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance
>> problems
>>
>> Hello!
>>
>> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now,
>> and
the difference in performance was negligible. Also, GFS2 was almost on the same speed level when compared to GFS1 for Reads (see below why..). I/O scheduler is "cfq" by the way. I never really cared about the I/O scheduler since I do not yet understand the differences between the available ones anyway.
>>
>> But, I found out something else. As suggested by Steven in his reply,
>> I
ran tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and surprisingly the  results were almost the same!
>>
>> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a
total of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node sequential read the performance went up to a nice 180-190MB/s for both FS versions.
>>
>> Now, the surprising part: Doing a dd read on the raw blockdevice with
>> 3
nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with multiple nodes at the same time!! When reading the raw blockdevice on a single node, I got slightly over 190MB/s again.
>>
>> So, this concurrent read issue seems not to be a GFS1 or GFS2
>> problem,
but more a problem of the underlying storage. This is extremely surprising and a bit shocking I must say.
>>
>> I guess for the Reads I will need to check the SAN itself, see if I
>> can
do any optimization on it..  That thing can't possibly be that bad when it comes to reading..
>>
>> Thanks a lot for your ideas so far!
>>
>> Jankowski, Chris wrote:
>>
>>
>>> Michael,
>>>
>>> For comparison, could you do your dd(1) tests with a very large
>>> block
size (1 MB) and tell us the results, please?
>>>
>>> I have a vague hunch that the problem may have something to do with
coalescing or not of IO operations.
>>>
>>> Also, which IO scheduler are you using?
>>>
>>> Thanks abnd regards,
>>>
>>> Chris Jankowski
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael
>>> Lackner
>>> Sent: Tuesday, 15 June 2010 00:22
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance
>>> problems
>>>
>>> Hello!
>>>
>>> Thanks for your reply. I unfortunately forgot to mention, HOW I was
actually testing, stupid.
>>>
>>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total
testfile size per node.
>>> I read from /dev/zero for writing tests and wrote to /dev/null for
reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).
>>>
>>> The performance was measured directly on the Fibrechannel Switch,
>>> which
offers nice per-port monitoring for that purpose.
>>>
>>> I have yet to do some serious read testing on GFS2. I have aborted
>>> my
>>> GFS2 tests as
>>> write performance was not up to GFS1 to begin with. My older GFS2
benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.
>>>
>>> After each write test I did a "sync" to flush everything to disks.
>>> I
did not do this before or after read tests though..
>>>
>>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>"
>>> said,
that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).
>>>
>>> As for the direct I/O tests, by that you mean testing without ANY
>>> caching going on, a synchronous write? What I did before was test
>>> EXT3
>>> (~190MB/s) and XFS
>>> (~320MB/s)
>>> on the Storage Array. I think what I'm getting here is raw
>>> throughput,
since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>>>
>>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll
>>> be
able to do that tomorrow morning, then I can post the numbers here.
>>>
>>> Thanks!
>>>
>>> Steven Whitehouse wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> Hello!
>>>>>
>>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>>
>>>>> At the moment, the storage subsystem consists of an HP MSA2312
>>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client
>>>>> machines are connected to that switch over 8gbit FC. The disks
>>>>> themselves are
>>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>>
>>>>> Now, the whole storage shall be shared (single filesystem), here
>>>>> GFS comes in.
>>>>>
>>>>> The Cluster is only 3 nodes large at the moment, more nodes will
>>>>> be added later on. I am currently testing GFS1 and GFS2 for
performance.
>>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per
>>>>> machine).
>>>>>
>>>>> Thing is, with GFS1 I get far better performance than with the
>>>>> newer
>>>>> GFS2 across the board, with a few tunable parameters set, for
>>>>> writes
>>>>> GFS1 is roughly twice as fast.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> What tests are you running? GFS2 is generally faster than GFS1
>>>> except for streaming writes, which is an area that we are putting
>>>> some effort into solving currently. Small writes (one fs block (4k
>>>> default) or
>>>> less) on GFS2 are much faster than on GFS1.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> But, concurrent reads are totally abysmal. The total write
>>>>> performance (all nodes combined) sits around 280-330Mbyte/sec,
>>>>> whereas the READ performance is as low as 30-40Mbyte/sec when
>>>>> doing concurrent reads. Surprisingly, single-node read is somewhat
>>>>> ok at 180Mbyte/sec, but as soon as several nodes are reading from
>>>>> GFS (version 1 at the
>>>>> moment) at the same time,  things turn ugly.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if
>>>> something isn't working correctly for some reason. For cached data,
>>>> reads on GFS2 should be as fast as ext2/3 since the code path is
>>>> identical (to the page cache) and only changes if pages are not cached.
>>>> GFS1 does its locking at a higher level, so there will be more
>>>> overhead for cached reads in general.
>>>>
>>>> Do make sure that if you are preparing the test files for reading
>>>> all from one node (or even just a different node to that on which
>>>> you sre running the read tests) that you need to sync them to disk
>>>> on that node before starting the tests to avoid issues with caching.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> This is strange, because for writes, global performance across the
>>>>> cluster increases slightly when adding more nodes. But for reads,
>>>>> the oppsite seems to be true.
>>>>>
>>>>> For read and write tests, separate testfiles were created and read
>>>>> for each node, with each testfile sitting in its own subdirectory,
>>>>> so no node would access another nodes file.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> That sounds like a good test set up to me.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> GFS1 created with the following mkfs.gfs parameters:
>>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups,
>>>>> Distributed
>>>>> LockManager)
>>>>>
>>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>>
>>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1,
>>>>> demote_secs 20"
>>>>>
>>>>>
>>>>>
>>>>>
>>>> You shouldn't normally need to set the glock_purge and demote_secs
>>>> to anything other than the default. These settings no longer exist
>>>> in
>>>> GFS2 since it makes use of the shrinker subsystem provided by the
>>>> VM and is auto-tuning. If your workload is metadata heavy, you
>>>> could try boosting the journal size and/or the incore_log_blocks
tunable.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld
>>>>> plock_rate_limit="0"/>
>>>>>
>>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>>> GFS1 for better concurrent read performance, or tune GFS2 in
>>>>> general to be competitive/better than GFS1?
>>>>>
>>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially
>>>>> and somewhat good reaction times while under heavy sequential
>>>>> and/or random load. But for now, I just wanna get the seq reading
>>>>> to work acceptably fast.
>>>>>
>>>>> Thanks a lot for your help!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Can you try doing some I/O direct to the block device so that we
>>>> can get an idea of what the raw device can manage? Using dd both
>>>> read and write, across the nodes (different disk locations on each
>>>> node to simulate different files).
>>>>
>>>> I'm wondering if the problem might be due to the seek pattern
>>>> generated by the multiple read locations,
>>>>
>>>> Steve.
>>>>
> --
> Michael Lackner
> Chair of Information Technology, University of Leoben IT
> Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2940 - Release Date: 06/15/10 20:35:00

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From michael.lackner at mu-leoben.at  Thu Jun 17 08:05:39 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Thu, 17 Jun 2010 10:05:39 +0200
Subject: [Linux-cluster] GFS (1 & partially 2) performance problems
In-Reply-To: <4c19cb67.014fd80a.136c.fffff102@mx.google.com>
References: <4c19cb67.014fd80a.136c.fffff102@mx.google.com>
Message-ID: <4C19D753.1020303@mu-leoben.at>

Hello, Kit!

I concatenated the two mails of yours in my quote, I hope that's ok?

I do have a HP MSA2312fc here, yes. Fibrechannel. With EXT3 and XFS
performance was pretty good, but of course, those aren't cluster-aware and
can only ever be tested in single-node configuration (I didn't try multiple
volumes with 1 volume for each client with EXT/XFS though, since we need
a shared filesystem). If I test GFS1/2, both are also reasonably fast in
single-node config for reads (those 180-190MB/s I was talking about).

In single node operation I have never seen such drastic drops as you 
have, no
single-digit MB/s numbers..

As for the multipathing, this is what multipath -l tells me:

mpath0 (3600c0ff000da8493da059a4b01000000) dm-0 HP,MSA2312fc
[size=2.7T][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 2:0:2:1 sdc 8:32  [active][undef]
 \_ 2:0:3:1 sdd 8:48  [active][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:1 sda 8:0   [active][undef]
 \_ 2:0:1:1 sdb 8:16  [active][undef]

In "/etc/multipath.conf" I have only set the necessary device 
blacklistings and
the "user_friendly_names yes" option, nothing else. But i don't think 
this can
have performance implications? Slowest single FC link is 4Gbps, which would
equal a theoretical maximum of 512MB/s (+full duplex?) per link. I'm no 
expert
here, but I  would guess that the multipath scheduling is not too 
important for
GFS performance.

It seems just to be responsible for choosing the right FC links to 
transfer data
over. But even if all of my three clients would choose to use the same 
link to
the MSA2312fc, it should still be ok? Switch monitoring however tells 
me, that
the client transfers are being distributed over several links that the 
MSA has
anyway.

Round-Robin at work I suppose. FC links pretty much under-utilized...

Thanks!

Kit Gerrits wrote:
> Didn't you have that HP MSA with the fibrechannel interfaces?
>
> I have exactly the same device, also with HP DL380 and HP DL 580 hosts with
> two FC interfaces.
> I've seen similarly insane statistics using only ext2fs mounts. (even worse,
> around 7MB/s)
> It went away after a while, but I have no idea where it came from or why it
> left.
> (I was backing up files with tar-over-ssh)
>
> I would really like to know how you get rid of it, if ever.
>
>
> Multipathing has a round-robin and a failover scheduler, which can be
> configures in /etc/multipath.conf
>
> The path_selector value only seems to support round-ronin:
> http://storagefoo.blogspot.com/2006/08/linux-native-multipathing-device.html
>
>
> Maybe this helps:
> 		#
> 		# name    : path_grouping_policy
> 		# scope   : multipath
> 		# desc    : path grouping policy to apply to this
> multipath
> 		# values  : failover, multibus, group_by_serial
> 		# default : failover
> 		#
> 		path_grouping_policy	multibus
>
> Specifies the default path grouping policy to apply to unspecified
> multipaths. Possible values include:
> failover = 1 path per priority group
> multibus = all valid paths in 1 priority group
> group_by_serial = 1 priority group per detected serial number
> group_by_prio = 1 priority group per path priority value
> group_by_node_name = 1 priority group per target node name
> The default value is failover. 
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
> Sent: dinsdag 15 juni 2010 14:04
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems
>
> Hello!
>
> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and
> the difference in performance was negligible. Also, GFS2 was almost on the
> same speed level when compared to GFS1 for Reads (see below why..). I/O
> scheduler is "cfq" by the way. I never really cared about the I/O scheduler
> since I do not yet understand the differences between the available ones
> anyway.
>
> But, I found out something else. As suggested by Steven in his reply, I ran
> tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and
> surprisingly the  results were almost the same!
>
> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total
> of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node
> sequential read the performance went up to a nice 180-190MB/s for both FS
> versions.
>
> Now, the surprising part: Doing a dd read on the raw blockdevice with 3
> nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2
> with multiple nodes at the same time!! When reading the raw blockdevice on a
> single node, I got slightly over 190MB/s again.
>
> So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but
> more a problem of the underlying storage. This is extremely surprising and a
> bit shocking I must say.
>
> I guess for the Reads I will need to check the SAN itself, see if I can do
> any optimization on it..  That thing can't possibly be that bad when it
> comes to reading..
>
> Thanks a lot for your ideas so far!
>
> Jankowski, Chris wrote:
>   
>> Michael,
>>
>> For comparison, could you do your dd(1) tests with a very large block size
>>     
> (1 MB) and tell us the results, please?
>   
>> I have a vague hunch that the problem may have something to do with
>>     
> coalescing or not of IO operations.
>   
>> Also, which IO scheduler are you using?
>>
>> Thanks abnd regards,
>>
>> Chris Jankowski
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
>> Sent: Tuesday, 15 June 2010 00:22
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>> problems
>>
>> Hello!
>>
>> Thanks for your reply. I unfortunately forgot to mention, HOW I was
>>     
> actually testing, stupid.
>   
>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total
>>     
> testfile size per node.
>   
>> I read from /dev/zero for writing tests and wrote to /dev/null for reading
>>     
> tests. So, totally sequential, somewhat small blocksize (equal to filesystem
> BS).
>   
>> The performance was measured directly on the Fibrechannel Switch, which
>>     
> offers nice per-port monitoring for that purpose.
>   
>> I have yet to do some serious read testing on GFS2. I have aborted my
>> GFS2 tests as
>> write performance was not up to GFS1 to begin with. My older GFS2
>>     
> benchmarks (i did this with a 2-node configuration before) are lost, I will
> need to re-do them to give you some numbers.
>   
>> After each write test I did a "sync" to flush everything to disks.  I did
>>     
> not do this before or after read tests though..
>   
>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that
>>     
> only 2-3% logspace were in use after the tests (I guess this is the per-node
> fs journal?).
>   
>> As for the direct I/O tests, by that you mean testing without ANY 
>> caching going on, a synchronous write? What I did before was test EXT3 
>> (~190MB/s) and XFS
>> (~320MB/s)
>> on the Storage Array. I think what I'm getting here is raw throughput,
>>     
> since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>   
>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be
>>     
> able to do that tomorrow morning, then I can post the numbers here.
>   
>> Thanks!
>>
>> Steven Whitehouse wrote:
>>   
>>     
>>> Hi,
>>>
>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>   
>>>     
>>>       
>>>> Hello!
>>>>
>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>
>>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines 
>>>> are connected to that switch over 8gbit FC. The disks themselves are
>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>
>>>> Now, the whole storage shall be shared (single filesystem), here GFS 
>>>> comes in.
>>>>
>>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>>> machine).
>>>>
>>>> Thing is, with GFS1 I get far better performance than with the newer
>>>> GFS2 across the board, with a few tunable parameters set, for writes
>>>> GFS1 is roughly twice as fast.
>>>>
>>>>     
>>>>       
>>>>         
>>> What tests are you running? GFS2 is generally faster than GFS1 except 
>>> for streaming writes, which is an area that we are putting some 
>>> effort into solving currently. Small writes (one fs block (4k 
>>> default) or
>>> less) on GFS2 are much faster than on GFS1.
>>>
>>>   
>>>     
>>>       
>>>> But, concurrent reads are totally abysmal. The total write 
>>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>>> (version 1 at the
>>>> moment) at the same time,  things turn ugly.
>>>>
>>>>     
>>>>       
>>>>         
>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>>> something isn't working correctly for some reason. For cached data, 
>>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>>> identical (to the page cache) and only changes if pages are not cached.
>>> GFS1 does its locking at a higher level, so there will be more 
>>> overhead for cached reads in general.
>>>
>>> Do make sure that if you are preparing the test files for reading all 
>>> from one node (or even just a different node to that on which you sre 
>>> running the read tests) that you need to sync them to disk on that 
>>> node before starting the tests to avoid issues with caching.
>>>
>>>   
>>>     
>>>       
>>>> This is strange, because for writes, global performance across the 
>>>> cluster increases slightly when adding more nodes. But for reads, 
>>>> the oppsite seems to be true.
>>>>
>>>> For read and write tests, separate testfiles were created and read 
>>>> for each node, with each testfile sitting in its own subdirectory, 
>>>> so no node would access another nodes file.
>>>>
>>>>     
>>>>       
>>>>         
>>> That sounds like a good test set up to me.
>>>
>>>   
>>>     
>>>       
>>>> GFS1 created with the following mkfs.gfs parameters:
>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>>> Distributed
>>>> LockManager)
>>>>
>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>
>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>>> demote_secs 20"
>>>>     
>>>>       
>>>>         
>>> You shouldn't normally need to set the glock_purge and demote_secs to 
>>> anything other than the default. These settings no longer exist in
>>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>>> and is auto-tuning. If your workload is metadata heavy, you could try 
>>> boosting the journal size and/or the incore_log_blocks tunable.
>>>
>>>   
>>>     
>>>       
>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>>> plock_rate_limit="0"/>
>>>>
>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>> GFS1 for better concurrent read performance, or tune GFS2 in general 
>>>> to be competitive/better than GFS1?
>>>>
>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>>> somewhat good reaction times while under heavy sequential and/or 
>>>> random load. But for now, I just wanna get the seq reading to work 
>>>> acceptably fast.
>>>>
>>>> Thanks a lot for your help!
>>>>
>>>>     
>>>>       
>>>>         
>>> Can you try doing some I/O direct to the block device so that we can 
>>> get an idea of what the raw device can manage? Using dd both read and 
>>> write, across the nodes (different disk locations on each node to 
>>> simulate different files).
>>>
>>> I'm wondering if the problem might be due to the seek pattern 
>>> generated by the multiple read locations,
>>>
>>> Steve.
>>>     
>>>       
> --
> Michael Lackner
> Chair of Information Technology, University of Leoben IT Administration
> michael.lackner at mu-leoben.at | +43 (0)3842/402-1505
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.829 / Virus Database: 271.1.1/2939 - Release Date: 06/15/10
> 08:35:00
>
>   


-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505



From dhoffutt at gmail.com  Thu Jun 17 14:00:54 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Thu, 17 Jun 2010 09:00:54 -0500
Subject: [Linux-cluster] Higher Grained Definition ofIP
	AddressAssignments
In-Reply-To: <4c19ca35.0d44d80a.414b.fffff48d@mx.google.com>
References: <4C17748C.8010801@gmail.com>
	<4c19ca35.0d44d80a.414b.fffff48d@mx.google.com>
Message-ID: <AANLkTimAAgZw0RBTPdHInu8KiHwurdGX5WXS3e31bgCn@mail.gmail.com>

Using the node's IPs would not work. The software being made HA must keep
its IPs the same no matter what node its running on. Could script an IP
change, but then we're putting IP logic and monitoring in two places: The
cluster software and in our custom scripting. That's not a clean solution
and is rather going backwards.

We may as well just do our own HA if we were starting down that road. When
we sell our product the customer must also purchase Redhat Support for their
OS and cluster software. I would think Redhat should pony up to get this
done as the product we are selling is selling well and inducing Redhat
Support sales.

An official feature request has been submitted to Redhat.

Also, I'm working on the /usr/share/cluster/ip.sh script myself to add the
feature. Hopefully it works out.


On Thu, Jun 17, 2010 at 2:09 AM, Kit Gerrits <kitgerrits at gmail.com> wrote:

>  In that case, might it be easier to simply use the host IP adresses and
> not the cluster IP's?
> (the application will need to handle up/down events itself)
>
>
> Regards,
>
> Kit
>
>  ------------------------------
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *Dustin Henry Offutt
> *Sent:* dinsdag 15 juni 2010 14:40
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] Higher Grained Definition ofIP
> AddressAssignments
>
> I've spent the past year architecting an HA cluster with RHCS and it's
> working wonderfully. I have not seen anything superior.
>
> Due to a new customer-driven feature of our software, we need to add the
> ability for a cluster service/resource group to have up to eight distinct
> IPs on one particular network due to the software being made highly
> available via RHCS performing its own load balancing. Placing the load
> balancing elsewhere is not an option due to the nature of the product.
>
> Regarding "OCF_RESKEY_," will google more on this and appreciate the tip.
> Must work this out some way.
>
> ~ Dusty
>
> C. Handel wrote:
>
> [define interface of cluster controlled ip resource]
>
>
>
> /usr/share/cluster/ip.sh appears to perform the link-monitoring in the
>
>
> This is a resource agent script. What attributes a resource agent
> accepts can be found by calling it with the option meta-data
>
> /usr/share/cluster/ip.sh meta-data
>
> There is no attribute interface. The agent will add the additional
> address to the first interface that is in the same subnet.
>
> You could edit the script and add a parameter interface yourself. Add
> a new parameter into the XML at the beginning and access it in the
> script with OCF_RESKEY_...
>
> I don't understand what you are trying to do. If you are only handling
> network interfaces as services, then rhcs is most likely the wrong
> tool. If you would explain your goal we could probably suggest other
> solutions.
>
> Greetings
>    Christoph
>
> --
> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.829 / Virus Database: 271.1.1/2939 - Release Date: 06/15/10
> 08:35:00
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100617/c189f96b/attachment.htm>

From jimbobpalmer at gmail.com  Thu Jun 17 15:58:44 2010
From: jimbobpalmer at gmail.com (jimbob palmer)
Date: Thu, 17 Jun 2010 17:58:44 +0200
Subject: [Linux-cluster] qdisk WITHOUT fencing
Message-ID: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>

Dear distinguished linux-cluster members!

I have two data centers linked by physical fibre. Everything goes over
this physical route: everything.

I would like to setup a high availability nfs server with drbd:
* drbd to replicate storage
* nfsd running
* floating ip

If the physical link between the two data centers is lost, I would
like the primary data center to win.

I've setup a qdisk, and this works well: the node which can access the
qdisk wins. i.e. the primary datacenter, which is the data center
where the san holding the qdisk also lives, wins.

Unfortunately for me, I get pages and pages of errors about being
unable to fence the secondary node.

The docs tell me that I absolutely must use power fencing, but in this
case fencing makes no sense: it won't work when the link between the
data centers is severed. The network, and the qdisk is the decider for
who "wins".

So what should I do?

Many thanks in advance.



From dhoffutt at gmail.com  Thu Jun 17 19:59:59 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Thu, 17 Jun 2010 14:59:59 -0500
Subject: [Linux-cluster] Higher Grained Definition ofIP
	AddressAssignments
In-Reply-To: <AANLkTimAAgZw0RBTPdHInu8KiHwurdGX5WXS3e31bgCn@mail.gmail.com>
References: <4C17748C.8010801@gmail.com>
	<4c19ca35.0d44d80a.414b.fffff48d@mx.google.com>
	<AANLkTimAAgZw0RBTPdHInu8KiHwurdGX5WXS3e31bgCn@mail.gmail.com>
Message-ID: <AANLkTinM89aHpfqW5GDXl3OAGgibiHchXZdIpgligFaB@mail.gmail.com>

Believe this issue has been resolved by altering /usr/share/cluster/ip.sh.

The resulting script has added new XML for a new "device" parameter.

New variable 'device' is passed to the ip_op function and then to functions
ipv4 and ipv6. The ipv4 and ipv6 function iterate through all network
devices and, upon finding a device with a configuration similar to the IP
needing to be assigned, would assign the IP there, which caused all the IPs
to bunch up on one device. The added logic here will go through the
iteration, and if there is a "device" variable requested it is matched
against the device name in the function.

Is there some way to get this put into the Cluster Suite officially so that
it may be supported?

Thank you...

(diff -cB)

*** ip.sh.original    2010-06-17 10:43:00.000000000 -0500
--- ip.sh    2010-06-17 14:42:26.000000000 -0500
***************
*** 86,91 ****
--- 86,104 ----
              <content type="string"/>
          </parameter>

+         <parameter name="device">
+             <longdesc lang="en">
+                 Specify network device to bring this
+                 IP up on. Optional. Example: "eth0"
+             </longdesc>
+
+             <shortdesc lang="en">
+                 Network device
+             </shortdesc>
+
+             <content type="string" default="auto"/>
+         </parameter>
+
          <parameter name="monitor_link">
              <longdesc lang="en">
                  Enabling this causes the status check to fail if
***************
*** 571,576 ****
--- 583,589 ----
      declare addr_exp=$(ipv6_expand $addr)

      while read dev ifaddr_exp maskbits; do
+             if ([ -z $3 ] || [ "$3" = "auto" ]) || [ "$dev" = "$3" ]; then
              if [ -z "$dev" ]; then
                  continue
          fi
***************
*** 636,641 ****
--- 649,655 ----
          fi

          return 0
+             fi
      done < <(ipv6_list_interfaces)

      return 1
***************
*** 651,656 ****
--- 664,670 ----
      declare addr=$2

      while read dev ifaddr maskbits; do
+             if ([ -z $3 ] || [ "$3" = "auto" ]) || [ "$dev" = "$3" ]; then
              if [ -z "$dev" ]; then
                  continue
          fi
***************
*** 715,720 ****
--- 729,735 ----
          fi

          return 0
+             fi
      done  < <(ipv4_list_interfaces)

      return 1
***************
*** 813,819 ****

  #
  # Usage:
! # ip_op <family> <operation> <address> [quiet]
  #
  ip_op()
  {
--- 827,833 ----

  #
  # Usage:
! # ip_op <family> <operation> <address> <device> [quiet]
  #
  ip_op()
  {
***************
*** 866,872 ****

      case $1 in
      inet)
!         ipv4 $2 $3
          return $?
          ;;
      inet6)
--- 880,886 ----

      case $1 in
      inet)
!         ipv4 $2 $3 $4
          return $?
          ;;
      inet6)
***************
*** 923,929 ****
          ocf_log debug "${OCF_RESKEY_address} already configured"
          exit 0
      fi
!     ip_op ${OCF_RESKEY_family} add ${OCF_RESKEY_address}
      if [ $? -ne 0 ]; then
          exit $OCF_ERR_GENERIC
      fi
--- 937,943 ----
          ocf_log debug "${OCF_RESKEY_address} already configured"
          exit 0
      fi
!     ip_op ${OCF_RESKEY_family} add ${OCF_RESKEY_address}
${OCF_RESKEY_device}
      if [ $? -ne 0 ]; then
          exit $OCF_ERR_GENERIC
      fi


On Thu, Jun 17, 2010 at 9:00 AM, Dustin Henry Offutt <dhoffutt at gmail.com>wrote:

> Using the node's IPs would not work. The software being made HA must keep
> its IPs the same no matter what node its running on. Could script an IP
> change, but then we're putting IP logic and monitoring in two places: The
> cluster software and in our custom scripting. That's not a clean solution
> and is rather going backwards.
>
> We may as well just do our own HA if we were starting down that road. When
> we sell our product the customer must also purchase Redhat Support for their
> OS and cluster software. I would think Redhat should pony up to get this
> done as the product we are selling is selling well and inducing Redhat
> Support sales.
>
> An official feature request has been submitted to Redhat.
>
> Also, I'm working on the /usr/share/cluster/ip.sh script myself to add the
> feature. Hopefully it works out.
>
>
>
> On Thu, Jun 17, 2010 at 2:09 AM, Kit Gerrits <kitgerrits at gmail.com> wrote:
>
>>  In that case, might it be easier to simply use the host IP adresses and
>> not the cluster IP's?
>> (the application will need to handle up/down events itself)
>>
>>
>> Regards,
>>
>> Kit
>>
>>  ------------------------------
>> *From:* linux-cluster-bounces at redhat.com [mailto:
>> linux-cluster-bounces at redhat.com] *On Behalf Of *Dustin Henry Offutt
>> *Sent:* dinsdag 15 juni 2010 14:40
>> *To:* linux clustering
>> *Subject:* Re: [Linux-cluster] Higher Grained Definition ofIP
>> AddressAssignments
>>
>> I've spent the past year architecting an HA cluster with RHCS and it's
>> working wonderfully. I have not seen anything superior.
>>
>> Due to a new customer-driven feature of our software, we need to add the
>> ability for a cluster service/resource group to have up to eight distinct
>> IPs on one particular network due to the software being made highly
>> available via RHCS performing its own load balancing. Placing the load
>> balancing elsewhere is not an option due to the nature of the product.
>>
>> Regarding "OCF_RESKEY_," will google more on this and appreciate the tip.
>> Must work this out some way.
>>
>> ~ Dusty
>>
>> C. Handel wrote:
>>
>> [define interface of cluster controlled ip resource]
>>
>>
>>
>> /usr/share/cluster/ip.sh appears to perform the link-monitoring in the
>>
>>
>> This is a resource agent script. What attributes a resource agent
>> accepts can be found by calling it with the option meta-data
>>
>> /usr/share/cluster/ip.sh meta-data
>>
>> There is no attribute interface. The agent will add the additional
>> address to the first interface that is in the same subnet.
>>
>> You could edit the script and add a parameter interface yourself. Add
>> a new parameter into the XML at the beginning and access it in the
>> script with OCF_RESKEY_...
>>
>> I don't understand what you are trying to do. If you are only handling
>> network interfaces as services, then rhcs is most likely the wrong
>> tool. If you would explain your goal we could probably suggest other
>> solutions.
>>
>> Greetings
>>    Christoph
>>
>> --
>> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.829 / Virus Database: 271.1.1/2939 - Release Date: 06/15/10
>> 08:35:00
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100617/ab30f812/attachment.htm>

From Chris.Jankowski at hp.com  Thu Jun 17 23:31:25 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 17 Jun 2010 23:31:25 +0000
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>

Jim,

You hit architectural limitation of Linux Cluster, which is specific to Linux Cluster design, which other clusters tend not to have.

Linux Cluster assumes that you will *always* be able to execute fencing of *all* other nodes.  In fact, this is a stated *prerequisite* for correct operation of the cluster.

This is all very well when you have two PCs under your desk and a power switch.

However, this model completely fails when any network more complex then a power switch is present. Your network fails and you have a partitioned cluster that cannot fence. It all gets stuck. From a practical, operational point of view of an IT this is a disaster worse then not having a cluster.

Having come to Linux Cluster with a TruCluster background, I always had a problem with the STONITH approach used by Linux Cluster. I deem it harmful. But I see no inclination anywhere in the Linux Cluster world to remove it.

I believe that there is a major philosophical chasm dividing the design stance between the Linux Cluster and others. The Linux Cluster seems to be saying "A node is the centre of the world and can control it".  Other clusters take the opposite stance: "A node is a part of the world, cannot control it and may have a very limited visibility of the world in some circuumstances."

Regards,

Chris Jankowski



-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of jimbob palmer
Sent: Friday, 18 June 2010 01:59
To: linux-cluster at redhat.com
Subject: [Linux-cluster] qdisk WITHOUT fencing

Dear distinguished linux-cluster members!

I have two data centers linked by physical fibre. Everything goes over this physical route: everything.

I would like to setup a high availability nfs server with drbd:
* drbd to replicate storage
* nfsd running
* floating ip

If the physical link between the two data centers is lost, I would like the primary data center to win.

I've setup a qdisk, and this works well: the node which can access the qdisk wins. i.e. the primary datacenter, which is the data center where the san holding the qdisk also lives, wins.

Unfortunately for me, I get pages and pages of errors about being unable to fence the secondary node.

The docs tell me that I absolutely must use power fencing, but in this case fencing makes no sense: it won't work when the link between the data centers is severed. The network, and the qdisk is the decider for who "wins".

So what should I do?

Many thanks in advance.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jcasale at activenetwerx.com  Thu Jun 17 23:46:13 2010
From: jcasale at activenetwerx.com (Joseph L. Casale)
Date: Thu, 17 Jun 2010 23:46:13 +0000
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <CA5A491E9DEFBE4CB777DE97E21575E9056C3F8D@prato.activenetwerx.local>

>The Linux Cluster seems to be saying "A node is the centre of the world and can control it".

While I won't question your knowledge on the subject, doesn't a quorum
mitigate this to some degree?

As for the Ops original dilemma, if you can design fault tolerance into
your own procedure, you can trivially write your own fence daemon script
like I did for an HP Procurve (the iscsi script in 5.5 didn't work in my
5.4 cluster as a result of newer deps).

You can make use of whatever technologies you want such as iptables, switch
ports etc in your own script and return a "success" to the fenced so things
carry on.

I use drbd between my two node w/o a qdisc and have drbd play a role in mitigating
the issues you describe. My two nodes are also separated by fiber and I
encountered the same issue where one node might be able to fence the other
properly.

jlc





From gordan at bobich.net  Thu Jun 17 23:54:14 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 18 Jun 2010 00:54:14 +0100
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4C1AB5A6.1090509@bobich.net>

The problem you are overlooking is that without a reliable way to 
prevent split-brain, you cannot ensure that the services you are trying 
to make resilient will handle failure without resource clashes.

If you have a suggestion on how to make that viable, I'm sure it will be 
listened to. But I cannot see how you can logically prevent resource 
clash (or worse, in case of a shared file system, data corruption) 
without a reliable fencing method.

If all you want to do is fail over some floating IPs, then fair enough, 
you might be able to get away to some extent without fencing (you can 
always manually get into the nodes via their fixed IPs to rectify any 
issues). For for anything more complex, I don't see how you can make do 
without reliable fencing.

Gordan

On 18/06/2010 00:31, Jankowski, Chris wrote:
> Jim,
>
> You hit architectural limitation of Linux Cluster, which is specific to Linux Cluster design, which other clusters tend not to have.
>
> Linux Cluster assumes that you will *always* be able to execute fencing of *all* other nodes.  In fact, this is a stated *prerequisite* for correct operation of the cluster.
>
> This is all very well when you have two PCs under your desk and a power switch.
>
> However, this model completely fails when any network more complex then a power switch is present. Your network fails and you have a partitioned cluster that cannot fence. It all gets stuck. From a practical, operational point of view of an IT this is a disaster worse then not having a cluster.
>
> Having come to Linux Cluster with a TruCluster background, I always had a problem with the STONITH approach used by Linux Cluster. I deem it harmful. But I see no inclination anywhere in the Linux Cluster world to remove it.
>
> I believe that there is a major philosophical chasm dividing the design stance between the Linux Cluster and others. The Linux Cluster seems to be saying "A node is the centre of the world and can control it".  Other clusters take the opposite stance: "A node is a part of the world, cannot control it and may have a very limited visibility of the world in some circuumstances."
>
> Regards,
>
> Chris Jankowski
>
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of jimbob palmer
> Sent: Friday, 18 June 2010 01:59
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] qdisk WITHOUT fencing
>
> Dear distinguished linux-cluster members!
>
> I have two data centers linked by physical fibre. Everything goes over this physical route: everything.
>
> I would like to setup a high availability nfs server with drbd:
> * drbd to replicate storage
> * nfsd running
> * floating ip
>
> If the physical link between the two data centers is lost, I would like the primary data center to win.
>
> I've setup a qdisk, and this works well: the node which can access the qdisk wins. i.e. the primary datacenter, which is the data center where the san holding the qdisk also lives, wins.
>
> Unfortunately for me, I get pages and pages of errors about being unable to fence the secondary node.
>
> The docs tell me that I absolutely must use power fencing, but in this case fencing makes no sense: it won't work when the link between the data centers is severed. The network, and the qdisk is the decider for who "wins".
>
> So what should I do?
>
> Many thanks in advance.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jumanjiman at gmail.com  Thu Jun 17 23:57:48 2010
From: jumanjiman at gmail.com (Paul Morgan)
Date: Thu, 17 Jun 2010 19:57:48 -0400
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
Message-ID: <AANLkTikrwMHLQi2ewROY2HkhOex4GiWKeqxdUaWlfAQR@mail.gmail.com>

The goal of fencing is to guarantee that errant nodes cannot corrupt file
systems. If you can guarantee that, then you could write a custom fence
agent script that returns 0 on  guarantee.

On Jun 17, 2010 12:11 PM, "jimbob palmer" <jimbobpalmer at gmail.com> wrote:
> Dear distinguished linux-cluster members!
>
> I have two data centers linked by physical fibre. Everything goes over
> this physical route: everything.
>
> I would like to setup a high availability nfs server with drbd:
> * drbd to replicate storage
> * nfsd running
> * floating ip
>
> If the physical link between the two data centers is lost, I would
> like the primary data center to win.
>
> I've setup a qdisk, and this works well: the node which can access the
> qdisk wins. i.e. the primary datacenter, which is the data center
> where the san holding the qdisk also lives, wins.
>
> Unfortunately for me, I get pages and pages of errors about being
> unable to fence the secondary node.
>
> The docs tell me that I absolutely must use power fencing, but in this
> case fencing makes no sense: it won't work when the link between the
> data centers is severed. The network, and the qdisk is the decider for
> who "wins".
>
> So what should I do?
>
> Many thanks in advance.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100617/ec503af1/attachment.htm>

From gordan at bobich.net  Thu Jun 17 23:58:10 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 18 Jun 2010 00:58:10 +0100
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <CA5A491E9DEFBE4CB777DE97E21575E9056C3F8D@prato.activenetwerx.local>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
	<CA5A491E9DEFBE4CB777DE97E21575E9056C3F8D@prato.activenetwerx.local>
Message-ID: <4C1AB692.4060601@bobich.net>

On 18/06/2010 00:46, Joseph L. Casale wrote:

> You can make use of whatever technologies you want such as iptables, switch
> ports etc in your own script and return a "success" to the fenced so things
> carry on.

Absolutely. Many things can be used to provide reliable fencing.

> I use drbd between my two node w/o a qdisc and have drbd play a role in mitigating
> the issues you describe. My two nodes are also separated by fiber and I
> encountered the same issue where one node might be able to fence the other
> properly.

Sure, DRBD can assume it's primary after a timeout when the peer goes 
away, but the stonith option is there for a reason. Ultimately, when the 
other node re-joins, one DRBD instance will "win", and the other will 
sync to it. If something else was accessing it at the time, it'll have 
the rug pulled out from under it when it's FS consistency suddenly goes 
down the pan.

Gordan



From brem.belguebli at gmail.com  Fri Jun 18 06:12:36 2010
From: brem.belguebli at gmail.com (Brem Belguebli)
Date: Fri, 18 Jun 2010 08:12:36 +0200
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <1276841556.7576.8.camel@newgen.localdomain>

If I may do this comparison, 
- All the other known cluster stacks (linux/unix/win....) have the
Japanese (Harakiri) sense of honor, ie if a node goes wrong and commits
suicide, all the remaining nodes trust blindly the fact that the node
commited suicide
- RHCS have the Italian sense of honor (Mafioso), when a node goes
wrong, even if some cluster process makes this node commit suicide
(qdisk for instance), the remaining nodes do not trust it until some
node of the cluster "shoot the sick node in the head"

It's clear that geo clustering RHCS, due to this constraint is normally
impossible, though some scripting logic could allow to bypass completely
the fencing and guarantee the integrity of the cluster.

Brem

On Thu, 2010-06-17 at 23:31 +0000, Jankowski, Chris wrote:
> Jim,
> 
> You hit architectural limitation of Linux Cluster, which is specific to Linux Cluster design, which other clusters tend not to have.
> 
> Linux Cluster assumes that you will *always* be able to execute fencing of *all* other nodes.  In fact, this is a stated *prerequisite* for correct operation of the cluster.
> 
> This is all very well when you have two PCs under your desk and a power switch.
> 
> However, this model completely fails when any network more complex then a power switch is present. Your network fails and you have a partitioned cluster that cannot fence. It all gets stuck. From a practical, operational point of view of an IT this is a disaster worse then not having a cluster.
> 
> Having come to Linux Cluster with a TruCluster background, I always had a problem with the STONITH approach used by Linux Cluster. I deem it harmful. But I see no inclination anywhere in the Linux Cluster world to remove it.
> 
> I believe that there is a major philosophical chasm dividing the design stance between the Linux Cluster and others. The Linux Cluster seems to be saying "A node is the centre of the world and can control it".  Other clusters take the opposite stance: "A node is a part of the world, cannot control it and may have a very limited visibility of the world in some circuumstances."
> 
> Regards,
> 
> Chris Jankowski
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of jimbob palmer
> Sent: Friday, 18 June 2010 01:59
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] qdisk WITHOUT fencing
> 
> Dear distinguished linux-cluster members!
> 
> I have two data centers linked by physical fibre. Everything goes over this physical route: everything.
> 
> I would like to setup a high availability nfs server with drbd:
> * drbd to replicate storage
> * nfsd running
> * floating ip
> 
> If the physical link between the two data centers is lost, I would like the primary data center to win.
> 
> I've setup a qdisk, and this works well: the node which can access the qdisk wins. i.e. the primary datacenter, which is the data center where the san holding the qdisk also lives, wins.
> 
> Unfortunately for me, I get pages and pages of errors about being unable to fence the secondary node.
> 
> The docs tell me that I absolutely must use power fencing, but in this case fencing makes no sense: it won't work when the link between the data centers is severed. The network, and the qdisk is the decider for who "wins".
> 
> So what should I do?
> 
> Many thanks in advance.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From Chris.Jankowski at hp.com  Fri Jun 18 06:57:13 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Fri, 18 Jun 2010 06:57:13 +0000
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <1276841556.7576.8.camel@newgen.localdomain>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
	<1276841556.7576.8.camel@newgen.localdomain>
Message-ID: <036B68E61A28CA49AC2767596576CD596BA5FFDABB@GVW1113EXC.americas.hpqcorp.net>

Brem,

I love this analogy.

Using the analogy you gave, the problem with a mafioso is that he cannot kill all other mafiosos in the gang when they are all sitting in solitary confinment cells (:-)).

I would like to remark that this STONITH business causes endless problems in clusters within a single data centre too. For example a temporary hiccup on the network that causes short heartbeat failure triggers all nodes of the cluster to kill the other nodes. And boy, do they succeed with a typical HP iLO fencing. You can see all your nodes going down. Then they come back and the shootout continues essentially indefinitely if fencing works. If not, then they all block.

And all of that is so unnecessary, as a combination of a properly implemented quorum disk and SCSI reservations with local boot disks and data disks on shared storage could provide quorum maintenance, split-brain avoidance and protection of the integrity of the filesystem. DEC ASE cluster on Ultrix and MIPS hardware had that in 1991. You do not even need GFS2, although it is very nice to have a real cluster filesystem.

By the way, I believe that commercial stretched cluster on Linux is not possible if you rely on LVM for distributed storage. Linux LVM is architecturally incapable of providing any resilience over distance, IMHO.  It is missing the plex and subdisk layers as in Veritas LVM and has no notion of location, so you it cannot tell which piece of storage is in which data centre. The only volume manager that I know that has this feature is in OpenVMS.  Perhaps the latest Veritas has it too.

One could use distributed storage arrays of the type of HP P4000 (bought with Left Hand Networks). This shifts the problem from the OS to the storage vendor.

What distributed storage would you use in a hypothetical stretched cluster?

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brem Belguebli
Sent: Friday, 18 June 2010 16:13
To: linux clustering
Subject: Re: [Linux-cluster] qdisk WITHOUT fencing

If I may do this comparison,
- All the other known cluster stacks (linux/unix/win....) have the Japanese (Harakiri) sense of honor, ie if a node goes wrong and commits suicide, all the remaining nodes trust blindly the fact that the node commited suicide
- RHCS have the Italian sense of honor (Mafioso), when a node goes wrong, even if some cluster process makes this node commit suicide (qdisk for instance), the remaining nodes do not trust it until some node of the cluster "shoot the sick node in the head"

It's clear that geo clustering RHCS, due to this constraint is normally impossible, though some scripting logic could allow to bypass completely the fencing and guarantee the integrity of the cluster.

Brem

On Thu, 2010-06-17 at 23:31 +0000, Jankowski, Chris wrote:
> Jim,
> 
> You hit architectural limitation of Linux Cluster, which is specific to Linux Cluster design, which other clusters tend not to have.
> 
> Linux Cluster assumes that you will *always* be able to execute fencing of *all* other nodes.  In fact, this is a stated *prerequisite* for correct operation of the cluster.
> 
> This is all very well when you have two PCs under your desk and a power switch.
> 
> However, this model completely fails when any network more complex then a power switch is present. Your network fails and you have a partitioned cluster that cannot fence. It all gets stuck. From a practical, operational point of view of an IT this is a disaster worse then not having a cluster.
> 
> Having come to Linux Cluster with a TruCluster background, I always had a problem with the STONITH approach used by Linux Cluster. I deem it harmful. But I see no inclination anywhere in the Linux Cluster world to remove it.
> 
> I believe that there is a major philosophical chasm dividing the design stance between the Linux Cluster and others. The Linux Cluster seems to be saying "A node is the centre of the world and can control it".  Other clusters take the opposite stance: "A node is a part of the world, cannot control it and may have a very limited visibility of the world in some circuumstances."
> 
> Regards,
> 
> Chris Jankowski
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of jimbob palmer
> Sent: Friday, 18 June 2010 01:59
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] qdisk WITHOUT fencing
> 
> Dear distinguished linux-cluster members!
> 
> I have two data centers linked by physical fibre. Everything goes over this physical route: everything.
> 
> I would like to setup a high availability nfs server with drbd:
> * drbd to replicate storage
> * nfsd running
> * floating ip
> 
> If the physical link between the two data centers is lost, I would like the primary data center to win.
> 
> I've setup a qdisk, and this works well: the node which can access the qdisk wins. i.e. the primary datacenter, which is the data center where the san holding the qdisk also lives, wins.
> 
> Unfortunately for me, I get pages and pages of errors about being unable to fence the secondary node.
> 
> The docs tell me that I absolutely must use power fencing, but in this case fencing makes no sense: it won't work when the link between the data centers is severed. The network, and the qdisk is the decider for who "wins".
> 
> So what should I do?
> 
> Many thanks in advance.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From volker at ixolution.de  Fri Jun 18 08:09:41 2010
From: volker at ixolution.de (Volker Dormeyer)
Date: Fri, 18 Jun 2010 10:09:41 +0200
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
Message-ID: <20100618080941.GA3674@dijkstra>

Hi,

On Thu, Jun 17, 2010 at 05:58:44PM +0200,
jimbob palmer <jimbobpalmer at gmail.com> wrote:
> I have two data centers linked by physical fibre. Everything goes over
> this physical route: everything.
> 
> I would like to setup a high availability nfs server with drbd:
> * drbd to replicate storage
> * nfsd running
> * floating ip
> 
> If the physical link between the two data centers is lost, I would
> like the primary data center to win.

This is a real problem as described by other in ths thread, already. It isn't
that easy to resolve realiable with the current architecture.

In my opinion, a third independent location (i.e. third datacenter) with a
third node/quorum server would be a solution. But the problem to fence the
node persists if one Datacenter fails. However, as there would be
a majority, because the two other Datacenters are still alive, fencing could
be scripted to be not that strict... Of course, many scenarios are thinkable.

> I've setup a qdisk, and this works well: the node which can access the
> qdisk wins. i.e. the primary datacenter, which is the data center
> where the san holding the qdisk also lives, wins.

fenced creates a FIFO, if it was not able to fence the failed node. In RHCS it
will be created in /var/run/cluster/fenced_override. You can override fencing
by using this FIFO.

You can use fence_ack_manual to "ack" fencing by using the FIFO in case fenced
is not able to fence successfully. I. e.

    fence_ack_manual -eOn <name of failed node>

After this, the remaining node will continue its work. You might be able to
put this in some scripting logic.

Of course, this will not solve the entire problem and you will have the risk
to have a split-brain in the end.


Regards,
Volker



From gordan at bobich.net  Fri Jun 18 08:38:04 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 18 Jun 2010 09:38:04 +0100
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5FFDABB@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>	<1276841556.7576.8.camel@newgen.localdomain>
	<036B68E61A28CA49AC2767596576CD596BA5FFDABB@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4C1B306C.1040300@bobich.net>

On 06/18/2010 07:57 AM, Jankowski, Chris wrote:

> Using the analogy you gave, the problem with a mafioso is that he cannot kill
> all other mafiosos in the gang when they are all sitting in solitary confinment
> cells (:-)).

Do you have a better idea? How do you propose to ensure that there is no 
resource clash when a node becomes intermittent or half-dead? How do you 
prevent it's interference from bringing down the service? What do you 
propose? More importantly, how would you propose to handle this when 
ensuring consistency is of paramount importance, e.g. when using a 
cluster file system?

> I would like to remark that this STONITH business causes endless
> problems in clusters within a single data centre too. For example a
> temporary hiccup on the network that causes short heartbeat failure
> triggers all nodes of the cluster to kill the other nodes. And boy,
> do they succeed with a typical HP iLO fencing. You can see all your
> nodes going down. Then they come back and the shootout continues
> essentially indefinitely if fencing works. If not, then they all
> block.

If your network is that intermittent, you have bigger problems.
But you can adjust your cman timeout values (<totem token = "[timeout in 
milliseconds]"/>) to something more appropriate to the quality of your 
network.

> And all of that is so unnecessary, as a combination of a properly
> implemented quorum  disk and SCSI reservations with local boot disks
> and data disks on shared storage  could provide quorum maintenance,
> split-brain avoidance and protection of the integrity  of the
> filesystem.

I disagree. If a note starts to go wrong, it cannot be trusted to not 
trash the file system, ignoring quorums and suchlike. Data integrity is 
too important to take that risk.

> DEC ASE cluster on Ultrix and MIPS hardware had that in 1991. You do
> not  even need GFS2, although it is very nice to have a real cluster
> filesystem.

If you want something that's looser than a proper cluster FS without the 
need for fencing (and are happy to live with the fact that when 
splitbrain occurs, one of the files will win and the other copies _will_ 
get trashed, you may want to look into GlusterFS if you haven't already.

> By the way, I believe that commercial stretched cluster on Linux is
> not possible if you rely on LVM for distributed storage. Linux LVM
> is architecturally incapable of providing any resilience over
> distance, IMHO. It is missing the plex and subdisk layers as in
> Veritas LVM and has no notion of location, so you it cannot tell
> which piece of storage is in which data centre. The only volume
> manager that I know that has this feature is in OpenVMS.  Perhaps
> the latest Veritas has it too.

I never actually found a purpose for LVM that cannot be done away with 
if you apply a modicum of forward planning (something that seems to be 
becoming quite rare in most industries these days). There are generally 
better ways than LVM to achieve the things that LVM is supposed to do.

> One could use distributed storage arrays of the type of HP P4000
> (bought with Left Hand Networks). This shifts the problem from the
> OS to the storage vendor.
>
> What distributed storage would you use in a hypothetical stretched
> cluster?

Depends on what exactly your use-case is. In most use-cases, properly 
distributed storage (a-la CleverSafe) comes with too much of a 
performance penalty to be useful when geographically dispersed. The 
single most defining measure of performance of a system is access time 
latencies. When caching gets difficult and your ping times move from LAN 
(slow) to WAN (ridiculous), performance generally becomes completely 
unworkable.

Gordan

Gordan



From brem.belguebli at gmail.com  Fri Jun 18 09:27:10 2010
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 18 Jun 2010 11:27:10 +0200
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5FFDABB@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
	<1276841556.7576.8.camel@newgen.localdomain>
	<036B68E61A28CA49AC2767596576CD596BA5FFDABB@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <AANLkTil-kB88ikev8DoDFVAVj5ZbYwtOSzIIIzhPgksi@mail.gmail.com>

2010/6/18 Jankowski, Chris <Chris.Jankowski at hp.com>:
> Brem,
>
> I love this analogy.
>
I hit the light after a few beers discussing with colleagues of RHCS ;-)

> Using the analogy you gave, the problem with a mafioso is that he cannot kill all other mafiosos in the gang when they are all sitting in solitary confinment cells (:-)).

Indeed, this is why fencing cannot fit with stretched (Geo, metro)
clusters... without hacking the setup, with the risk of not being
supported anymore.
>
> I would like to remark that this STONITH business causes endless problems in clusters within a single data centre too. For example a temporary hiccup on the network that causes short heartbeat failure triggers all nodes of the cluster to kill the other nodes. And boy, do they succeed with a typical HP iLO fencing. You can see all your nodes going down. Then they come back and the shootout continues essentially indefinitely if fencing works. If not, then they all block.
>
Timers (TKO, qdisk, DM-MP...) are very important in setting up a
cluster, and network protection (bonding, multiring -- not supported
yet-- ) also. A temporary network hiccup shouldn't last more than a
few seconds (up to 5 max), or it has to be considered as an outage.

> And all of that is so unnecessary, as a combination of a properly implemented quorum disk and SCSI reservations with local boot disks and data disks on shared storage could provide quorum maintenance, split-brain avoidance and protection of the integrity of the filesystem. DEC ASE cluster on Ultrix and MIPS hardware had that in 1991. You do not even need GFS2, although it is very nice to have a real cluster filesystem.
>
In my geo cluster setup (2 sites) , I cannot rely on scsi reservation
as if the interconnect (both SAN and MAN) goes down, the nodes from
one site won't be able to clear the other site's luns reservation,
ending up in a split brain situation.

Ideally, a tie-breaker should be located on a 3rd site. An iscsi Lun
accessible from both production sites, acting as quorum disk.
In case one of the 2 sites gets isolated, its nodes won't be able to
access this Lun and qdisk should instruct these nodes to commit
suicide (panic, or hard reset).
This combined to a watchdog mechanism that  monitors if the cluster is
quorate, and in case it's not anymore, hard resets the faulty nodes.

> By the way, I believe that commercial stretched cluster on Linux is not possible if you rely on LVM for distributed storage. Linux LVM is architecturally incapable of providing any resilience over distance, IMHO.

You mean LVM mirroring ?  if so, as for all the mirroring mechanisms
(and even synchronous replication ones), most vendors (Veritas,
storage vendors) tend to say max 100 km between 2 sites, ie less than
2 or 3 ms latency.
I'm seeing some new features coming with LVM mirroring, ie mirror log
redundancy, already existing mirror log cluster awarness, partial
synchronization, device-mapper cluster awarness, etc ...
Plus the most awaited feature, dm-replicator that 'll bring a new
"era" in managing DR situation (but still incompatible with the
fencing constraint!!!)

>?It is missing the plex and subdisk layers as in Veritas LVM and has no notion of location, so you it cannot tell which piece of storage is in which data centre. The only volume manager that I know that has this feature is in OpenVMS. ?Perhaps the latest Veritas has it too.
It's (the location thing)  a design choice from Symantec SF, but it is
not absolutely necessary to build stretched clusters. Look at HP-UX
Serviceguard based on HP-UX LVM for instance.
Concerning the plex and subdisk layer, I think it's just a matter of
terminology (PV,mirror leg,VG, and LV), not a difference IMHO.

>
> One could use distributed storage arrays of the type of HP P4000 (bought with Left Hand Networks). This shifts the problem from the OS to the storage vendor.
>
How do you address remote site replication ?
> What distributed storage would you use in a hypothetical stretched cluster?
>
In our environment we use HP  high end frames (XP24000), and some
couples have CA enabled.
> Regards,
>
> Chris Jankowski
>
Brem
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brem Belguebli
> Sent: Friday, 18 June 2010 16:13
> To: linux clustering
> Subject: Re: [Linux-cluster] qdisk WITHOUT fencing
>
> If I may do this comparison,
> - All the other known cluster stacks (linux/unix/win....) have the Japanese (Harakiri) sense of honor, ie if a node goes wrong and commits suicide, all the remaining nodes trust blindly the fact that the node commited suicide
> - RHCS have the Italian sense of honor (Mafioso), when a node goes wrong, even if some cluster process makes this node commit suicide (qdisk for instance), the remaining nodes do not trust it until some node of the cluster "shoot the sick node in the head"
>
> It's clear that geo clustering RHCS, due to this constraint is normally impossible, though some scripting logic could allow to bypass completely the fencing and guarantee the integrity of the cluster.
>
> Brem
>
> On Thu, 2010-06-17 at 23:31 +0000, Jankowski, Chris wrote:
>> Jim,
>>
>> You hit architectural limitation of Linux Cluster, which is specific to Linux Cluster design, which other clusters tend not to have.
>>
>> Linux Cluster assumes that you will *always* be able to execute fencing of *all* other nodes. ?In fact, this is a stated *prerequisite* for correct operation of the cluster.
>>
>> This is all very well when you have two PCs under your desk and a power switch.
>>
>> However, this model completely fails when any network more complex then a power switch is present. Your network fails and you have a partitioned cluster that cannot fence. It all gets stuck. From a practical, operational point of view of an IT this is a disaster worse then not having a cluster.
>>
>> Having come to Linux Cluster with a TruCluster background, I always had a problem with the STONITH approach used by Linux Cluster. I deem it harmful. But I see no inclination anywhere in the Linux Cluster world to remove it.
>>
>> I believe that there is a major philosophical chasm dividing the design stance between the Linux Cluster and others. The Linux Cluster seems to be saying "A node is the centre of the world and can control it". ?Other clusters take the opposite stance: "A node is a part of the world, cannot control it and may have a very limited visibility of the world in some circuumstances."
>>
>> Regards,
>>
>> Chris Jankowski
>>
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of jimbob palmer
>> Sent: Friday, 18 June 2010 01:59
>> To: linux-cluster at redhat.com
>> Subject: [Linux-cluster] qdisk WITHOUT fencing
>>
>> Dear distinguished linux-cluster members!
>>
>> I have two data centers linked by physical fibre. Everything goes over this physical route: everything.
>>
>> I would like to setup a high availability nfs server with drbd:
>> * drbd to replicate storage
>> * nfsd running
>> * floating ip
>>
>> If the physical link between the two data centers is lost, I would like the primary data center to win.
>>
>> I've setup a qdisk, and this works well: the node which can access the qdisk wins. i.e. the primary datacenter, which is the data center where the san holding the qdisk also lives, wins.
>>
>> Unfortunately for me, I get pages and pages of errors about being unable to fence the secondary node.
>>
>> The docs tell me that I absolutely must use power fencing, but in this case fencing makes no sense: it won't work when the link between the data centers is severed. The network, and the qdisk is the decider for who "wins".
>>
>> So what should I do?
>>
>> Many thanks in advance.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From Chris.Jankowski at hp.com  Fri Jun 18 10:28:04 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Fri, 18 Jun 2010 10:28:04 +0000
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <4C1B306C.1040300@bobich.net>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>
	<1276841556.7576.8.camel@newgen.localdomain>
	<036B68E61A28CA49AC2767596576CD596BA5FFDABB@GVW1113EXC.americas.hpqcorp.net>
	<4C1B306C.1040300@bobich.net>
Message-ID: <036B68E61A28CA49AC2767596576CD596BA5FFDC19@GVW1113EXC.americas.hpqcorp.net>

Gordan,

>>>Do you have a better idea? How do you propose to ensure that there is no resource clash when a node becomes intermittent or half-dead? How do you prevent it's interference from bringing down the service? What do you propose? More importantly, how would you propose to handle this when ensuring consistency is of paramount importance, e.g. when using a cluster file system? 

I believe that SCSI reservation are the key for protection.  One can form a group of hosts that are allowed to access storage and exclude those that had their membership revoked. Note that this is a protective mechanism - the stance is here: "This is ours and we protect it".  A node that have been ejected cannot do damage anymore.  This is philosophically opposite approach to fencing, which is: "I'll go out and shoot everybody whom I consider suspect and I am not going to come back until I've successfully shot everybody whom I consider suspect."

A properly implemented quorum disk is the key for management of the cluster membership. Based on access to quorum disk one can then establish who is the member. The nodes ejected are configured to commit suicide, reboot and try to rejoin the cluster. Then, based on membership one can set up SCSI reservations on shared storage.  This will protect the integrity of the filesystems including shared cluster filesystem.

Note that there is natural affinity between the quorum disk on shared storage and shared cluster file system on the shared storage. Whoever has access to the quorum disk has access to shared storage and can stay as a member. Whoever does not should be ejected. Whether such node is dead, half-dead or actively looking for mischief is irrelevant, because it does not have access to storage once SCSI reservations have been set to exclude it. It won't get anywhere without access to storage.

The cluster will reform after failure and won't need fencing.

This is how DEC/Compaq/HP TruCluster V5.x works. It does support shared cluster filesystem.  In fact, this is the only filesystem that it supports except for UFS for CDROMS. And it supports shared root. There is only one password file, one group file, one set of binaries and libraries all shared in CFS. And it has a rolling upgrade. It works reliably and there is not a trace of fencing in it.  So, it can be done.  This is a living proof and it works. Those clusters used to run multiterabyte Oracle RAC databases when Alpha was still actively marketed.

Here is an excellent Technical Overview of it:

http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_ACRO_DUX/ARHGVETE.PDF

For a long time there was a hesitance to relying on SCSI reservations, because shared parallel SCSI was rarefied and exotic. Then we had FC and it was expensive. Not for everybody it was said. But today one can buy $500 iSCSI storage arrays. They support all required protocols. This is now commodity. If one has a system that is important enough to be clustered then a block mode array should not be a cost problem (iSCSI, FC, switched SAS and shortly FCoE).

-------------

I also would like to remark that from practical operations point of view the great amount of effort that is expanded on trying to do something with the network in the cluster or checking server interfaces is at best useless, but mostly harmful.  The pragmatic stance today in data centre environment is:

- we have bonded interfaces connected to different switches - this takes care of redundancy of the local link. If switches are properly configured it will even propagate upstream switch failure to the local link and force failover in the bond. 
- cluster nodes cannot fix network - no matter what they think about it.  Therefore services should ride through network failures.  Failing over a service because network went down does not help on networks with redundancy correctly implemented.  Actually it hurts. If you fail over a database then users have to relogin, they lost their sessions, context and often data. You also loose warm database cache (Oracle SGA) with all the right blocks in it.

All this business of trying to ping your default gateway is plain silly. As if we had different gateways for each member of the cluster. And trying to marry quorum disk with heuristics that ping gateways seems to be even sillier.

Regards,

Chris Jankowski



-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
Sent: Friday, 18 June 2010 18:38
To: linux clustering
Subject: Re: [Linux-cluster] qdisk WITHOUT fencing

On 06/18/2010 07:57 AM, Jankowski, Chris wrote:

> Using the analogy you gave, the problem with a mafioso is that he 
> cannot kill all other mafiosos in the gang when they are all sitting 
> in solitary confinment cells (:-)).

Do you have a better idea? How do you propose to ensure that there is no resource clash when a node becomes intermittent or half-dead? How do you prevent it's interference from bringing down the service? What do you propose? More importantly, how would you propose to handle this when ensuring consistency is of paramount importance, e.g. when using a cluster file system?

> I would like to remark that this STONITH business causes endless 
> problems in clusters within a single data centre too. For example a 
> temporary hiccup on the network that causes short heartbeat failure 
> triggers all nodes of the cluster to kill the other nodes. And boy, do 
> they succeed with a typical HP iLO fencing. You can see all your nodes 
> going down. Then they come back and the shootout continues essentially 
> indefinitely if fencing works. If not, then they all block.

If your network is that intermittent, you have bigger problems.
But you can adjust your cman timeout values (<totem token = "[timeout in
milliseconds]"/>) to something more appropriate to the quality of your network.

> And all of that is so unnecessary, as a combination of a properly 
> implemented quorum  disk and SCSI reservations with local boot disks 
> and data disks on shared storage  could provide quorum maintenance, 
> split-brain avoidance and protection of the integrity  of the 
> filesystem.

I disagree. If a note starts to go wrong, it cannot be trusted to not trash the file system, ignoring quorums and suchlike. Data integrity is too important to take that risk.

> DEC ASE cluster on Ultrix and MIPS hardware had that in 1991. You do 
> not  even need GFS2, although it is very nice to have a real cluster 
> filesystem.

If you want something that's looser than a proper cluster FS without the need for fencing (and are happy to live with the fact that when splitbrain occurs, one of the files will win and the other copies _will_ get trashed, you may want to look into GlusterFS if you haven't already.

> By the way, I believe that commercial stretched cluster on Linux is 
> not possible if you rely on LVM for distributed storage. Linux LVM is 
> architecturally incapable of providing any resilience over distance, 
> IMHO. It is missing the plex and subdisk layers as in Veritas LVM and 
> has no notion of location, so you it cannot tell which piece of 
> storage is in which data centre. The only volume manager that I know 
> that has this feature is in OpenVMS.  Perhaps the latest Veritas has 
> it too.

I never actually found a purpose for LVM that cannot be done away with if you apply a modicum of forward planning (something that seems to be becoming quite rare in most industries these days). There are generally better ways than LVM to achieve the things that LVM is supposed to do.

> One could use distributed storage arrays of the type of HP P4000 
> (bought with Left Hand Networks). This shifts the problem from the OS 
> to the storage vendor.
>
> What distributed storage would you use in a hypothetical stretched 
> cluster?

Depends on what exactly your use-case is. In most use-cases, properly distributed storage (a-la CleverSafe) comes with too much of a performance penalty to be useful when geographically dispersed. The single most defining measure of performance of a system is access time latencies. When caching gets difficult and your ping times move from LAN
(slow) to WAN (ridiculous), performance generally becomes completely unworkable.

Gordan

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From gordan at bobich.net  Fri Jun 18 10:49:58 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 18 Jun 2010 11:49:58 +0100
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <036B68E61A28CA49AC2767596576CD596BA5FFDC19@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTil6aHw1K8eNIlgFyfw8uXQ8P9bmyRJ6s-eRAGcs@mail.gmail.com>	<036B68E61A28CA49AC2767596576CD596BA5E23C01@GVW1113EXC.americas.hpqcorp.net>	<1276841556.7576.8.camel@newgen.localdomain>	<036B68E61A28CA49AC2767596576CD596BA5FFDABB@GVW1113EXC.americas.hpqcorp.net>	<4C1B306C.1040300@bobich.net>
	<036B68E61A28CA49AC2767596576CD596BA5FFDC19@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4C1B4F56.8050309@bobich.net>

On 06/18/2010 11:28 AM, Jankowski, Chris wrote:

Can you please sort out the (lack of) word-wraps in your email client?

> Do you have a better idea? How do you propose to ensure that there
> is no resource clash when a node becomes intermittent or half-dead?
> How do you prevent it's interference from bringing down the service?
> What do you propose? More importantly, how would you propose to handle
> this when ensuring consistency is of paramount importance, e.g. when
> using a cluster file system?
>
> I believe that SCSI reservation are the key for protection.  One can
> form a group of hosts that are allowed to access storage and exclude
> those that had their membership revoked. Note that this is a protective
> mechanism - the stance is here: "This is ours and we protect it".  A
> node that have been ejected cannot do damage anymore.  This is
> philosophically opposite approach to fencing, which is: "I'll go out and
> shoot everybody whom I consider suspect and I am not going to come back
> until I've successfully shot everybody whom I consider suspect."

It isn't opposite philosophically at all. Instead of fencing by powering 
off the offending machine, you are fencing by cutting the machine off 
from the SAN. Logically, the two are identical, but you then also 
potentially need to apply other fencing for, say, network resources. 
I've written a fencing agent before for a managed switch to fence a 
machine by fencing it's switch port. That works as well as power 
fancing, but it isn't at all fundamentally different.

> A properly implemented quorum disk is the key for management of the
> cluster membership. Based on access to quorum disk one can then
> establish who is the member. The nodes ejected are configured to
> commit suicide, reboot and try to rejoin the cluster.

If a node crashes, it cannot be expected to remain functional enough to 
commit suicide.

> Then, based on membership one can set up SCSI reservations on shared
> storage.  This will protect the integrity of the filesystems including
> shared cluster filesystem.

See above - the distinction between power a node off or cutting off all 
it's network access is pretty immaterial. It doesn't get you away from 
the fundamental problem that you need a reliable way of preventing the 
failing node from rejoining the cluster.

> Note that there is natural affinity between the quorum disk on shared
> storage and shared cluster file system on the shared storage. Whoever
> has access to the quorum disk has access to shared storage and can
> stay as a member. Whoever does not should be ejected. Whether such
> node is dead, half-dead or actively looking for mischief is irrelevant,
> because it does not have access to storage once SCSI reservations have
> been set to exclude it. It won't get anywhere without access to storage.

Sure - but I don't think anyone ever argued that power based fencing is 
mandatory. Brocade switch based fencing from the SAN was supported last 
time I checked the list of supported fencing devices for RHCS.

> This is how DEC/Compaq/HP TruCluster V5.x works. It does support shared
> cluster filesystem.  In fact, this is the only filesystem that it
> supports except for UFS for CDROMS. And it supports shared root.

Shared Root is supported on Linux, in a lot of ways. Open Shared Root is 
one example, and I've even written a set of extensions to make that work 
on GlusterFS. I think it's in the OSR contrib repository.

> There is only one password file, one group file, one set of binaries
> and libraries all shared in CFS. And it has a rolling upgrade. It
> works reliably and there is not a trace of fencing in it.  So, it can
> be done.  This is a living proof and it works.

I think we are not agreeing entirely on what "fencing" actually is. And 
you are still talking about solving a problem that isn't hard to solve 
with RHCS - single SAN, at one location. If the machines are in one 
place, fencing isn't a problem. What's difficult is fencing in a 
geographically dispersed setup. I thought this was the main point of 
this thread.

Gordan



From kkovachev at varna.net  Fri Jun 18 10:50:38 2010
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Fri, 18 Jun 2010 13:50:38 +0300
Subject: [Linux-cluster] Higher Grained Definition ofIP
	AddressAssignments
In-Reply-To: <AANLkTinM89aHpfqW5GDXl3OAGgibiHchXZdIpgligFaB@mail.gmail.com>
References: <4C17748C.8010801@gmail.com>
	<4c19ca35.0d44d80a.414b.fffff48d@mx.google.com>
	<AANLkTimAAgZw0RBTPdHInu8KiHwurdGX5WXS3e31bgCn@mail.gmail.com>
	<AANLkTinM89aHpfqW5GDXl3OAGgibiHchXZdIpgligFaB@mail.gmail.com>
Message-ID: <79f8d6a200e1641f24db35271498aa99@mx.varna.net>

On Thu, 17 Jun 2010 14:59:59 -0500, Dustin Henry Offutt
<dhoffutt at gmail.com> wrote:
> Believe this issue has been resolved by altering
/usr/share/cluster/ip.sh.
> 
> The resulting script has added new XML for a new "device" parameter.
> 
> New variable 'device' is passed to the ip_op function and then to
functions
> ipv4 and ipv6. The ipv4 and ipv6 function iterate through all network
> devices and, upon finding a device with a configuration similar to the
IP
> needing to be assigned, would assign the IP there, which caused all the
IPs
> to bunch up on one device. The added logic here will go through the
> iteration, and if there is a "device" variable requested it is matched
> against the device name in the function.
> 
> Is there some way to get this put into the Cluster Suite officially so
that
> it may be supported?
> 
> Thank you...
> 
> (diff -cB)

You should also modify cluster.rng and add 'device' as an optional
attribute to the 'ip' element (around line 1010) or else the config won't
validate

why not use OSCF_RESKEY_device inside ipv4/6 directly, instead of passing
it to ip_op?
  ip_op <family> <operation> <address> <device> [quiet]
... if device is empty, but there is quiet present it will be accepted as
device

> 
> *** ip.sh.original    2010-06-17 10:43:00.000000000 -0500
> --- ip.sh    2010-06-17 14:42:26.000000000 -0500
> ***************
> *** 86,91 ****
> --- 86,104 ----
>               <content type="string"/>
>           </parameter>
> 
> +         <parameter name="device">
> +             <longdesc lang="en">
> +                 Specify network device to bring this
> +                 IP up on. Optional. Example: "eth0"
> +             </longdesc>
> +
> +             <shortdesc lang="en">
> +                 Network device
> +             </shortdesc>
> +
> +             <content type="string" default="auto"/>
> +         </parameter>
> +
>           <parameter name="monitor_link">
>               <longdesc lang="en">
>                   Enabling this causes the status check to fail if
> ***************
> *** 571,576 ****
> --- 583,589 ----
>       declare addr_exp=$(ipv6_expand $addr)
> 
>       while read dev ifaddr_exp maskbits; do
> +             if ([ -z $3 ] || [ "$3" = "auto" ]) || [ "$dev" = "$3" ];
> then
>               if [ -z "$dev" ]; then
>                   continue
>           fi
> ***************
> *** 636,641 ****
> --- 649,655 ----
>           fi
> 
>           return 0
> +             fi
>       done < <(ipv6_list_interfaces)
> 
>       return 1
> ***************
> *** 651,656 ****
> --- 664,670 ----
>       declare addr=$2
> 
>       while read dev ifaddr maskbits; do
> +             if ([ -z $3 ] || [ "$3" = "auto" ]) || [ "$dev" = "$3" ];
> then
>               if [ -z "$dev" ]; then
>                   continue
>           fi
> ***************
> *** 715,720 ****
> --- 729,735 ----
>           fi
> 
>           return 0
> +             fi
>       done  < <(ipv4_list_interfaces)
> 
>       return 1
> ***************
> *** 813,819 ****
> 
>   #
>   # Usage:
> ! # ip_op <family> <operation> <address> [quiet]
>   #
>   ip_op()
>   {
> --- 827,833 ----
> 
>   #
>   # Usage:
> ! # ip_op <family> <operation> <address> <device> [quiet]
>   #
>   ip_op()
>   {
> ***************
> *** 866,872 ****
> 
>       case $1 in
>       inet)
> !         ipv4 $2 $3
>           return $?
>           ;;
>       inet6)
> --- 880,886 ----
> 
>       case $1 in
>       inet)
> !         ipv4 $2 $3 $4
>           return $?
>           ;;
>       inet6)
> ***************
> *** 923,929 ****
>           ocf_log debug "${OCF_RESKEY_address} already configured"
>           exit 0
>       fi
> !     ip_op ${OCF_RESKEY_family} add ${OCF_RESKEY_address}
>       if [ $? -ne 0 ]; then
>           exit $OCF_ERR_GENERIC
>       fi
> --- 937,943 ----
>           ocf_log debug "${OCF_RESKEY_address} already configured"
>           exit 0
>       fi
> !     ip_op ${OCF_RESKEY_family} add ${OCF_RESKEY_address}
> ${OCF_RESKEY_device}
>       if [ $? -ne 0 ]; then
>           exit $OCF_ERR_GENERIC
>       fi
> 
> 
> On Thu, Jun 17, 2010 at 9:00 AM, Dustin Henry Offutt
> <dhoffutt at gmail.com>wrote:
> 
>> Using the node's IPs would not work. The software being made HA must
keep
>> its IPs the same no matter what node its running on. Could script an IP
>> change, but then we're putting IP logic and monitoring in two places:
The
>> cluster software and in our custom scripting. That's not a clean
solution
>> and is rather going backwards.
>>
>> We may as well just do our own HA if we were starting down that road.
>> When
>> we sell our product the customer must also purchase Redhat Support for
>> their
>> OS and cluster software. I would think Redhat should pony up to get
this
>> done as the product we are selling is selling well and inducing Redhat
>> Support sales.
>>
>> An official feature request has been submitted to Redhat.
>>
>> Also, I'm working on the /usr/share/cluster/ip.sh script myself to add
>> the
>> feature. Hopefully it works out.
>>
>>
>>
>> On Thu, Jun 17, 2010 at 2:09 AM, Kit Gerrits <kitgerrits at gmail.com>
>> wrote:
>>
>>>  In that case, might it be easier to simply use the host IP adresses
and
>>> not the cluster IP's?
>>> (the application will need to handle up/down events itself)
>>>
>>>
>>> Regards,
>>>
>>> Kit
>>>
>>>  ------------------------------
>>> *From:* linux-cluster-bounces at redhat.com [mailto:
>>> linux-cluster-bounces at redhat.com] *On Behalf Of *Dustin Henry Offutt
>>> *Sent:* dinsdag 15 juni 2010 14:40
>>> *To:* linux clustering
>>> *Subject:* Re: [Linux-cluster] Higher Grained Definition ofIP
>>> AddressAssignments
>>>
>>> I've spent the past year architecting an HA cluster with RHCS and it's
>>> working wonderfully. I have not seen anything superior.
>>>
>>> Due to a new customer-driven feature of our software, we need to add
the
>>> ability for a cluster service/resource group to have up to eight
>>> distinct
>>> IPs on one particular network due to the software being made highly
>>> available via RHCS performing its own load balancing. Placing the load
>>> balancing elsewhere is not an option due to the nature of the product.
>>>
>>> Regarding "OCF_RESKEY_," will google more on this and appreciate the
>>> tip.
>>> Must work this out some way.
>>>
>>> ~ Dusty
>>>
>>> C. Handel wrote:
>>>
>>> [define interface of cluster controlled ip resource]
>>>
>>>
>>>
>>> /usr/share/cluster/ip.sh appears to perform the link-monitoring in the
>>>
>>>
>>> This is a resource agent script. What attributes a resource agent
>>> accepts can be found by calling it with the option meta-data
>>>
>>> /usr/share/cluster/ip.sh meta-data
>>>
>>> There is no attribute interface. The agent will add the additional
>>> address to the first interface that is in the same subnet.
>>>
>>> You could edit the script and add a parameter interface yourself. Add
>>> a new parameter into the XML at the beginning and access it in the
>>> script with OCF_RESKEY_...
>>>
>>> I don't understand what you are trying to do. If you are only handling
>>> network interfaces as services, then rhcs is most likely the wrong
>>> tool. If you would explain your goal we could probably suggest other
>>> solutions.
>>>
>>> Greetings
>>>    Christoph
>>>
>>> --
>>> Linux-cluster mailing
>>>
listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG - www.avg.com
>>> Version: 9.0.829 / Virus Database: 271.1.1/2939 - Release Date:
06/15/10
>>> 08:35:00
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>



From dhoffutt at gmail.com  Fri Jun 18 12:42:51 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Fri, 18 Jun 2010 07:42:51 -0500
Subject: [Linux-cluster] Higher Grained Definition
	ofIP	AddressAssignments
In-Reply-To: <79f8d6a200e1641f24db35271498aa99@mx.varna.net>
References: <4C17748C.8010801@gmail.com>	<4c19ca35.0d44d80a.414b.fffff48d@mx.google.com>	<AANLkTimAAgZw0RBTPdHInu8KiHwurdGX5WXS3e31bgCn@mail.gmail.com>	<AANLkTinM89aHpfqW5GDXl3OAGgibiHchXZdIpgligFaB@mail.gmail.com>
	<79f8d6a200e1641f24db35271498aa99@mx.varna.net>
Message-ID: <4C1B69CB.8060302@gmail.com>

Aye-aye. Will do.

Kaloyan Kovachev wrote:
> On Thu, 17 Jun 2010 14:59:59 -0500, Dustin Henry Offutt
> <dhoffutt at gmail.com> wrote:
>   
>> Believe this issue has been resolved by altering
>>     
> /usr/share/cluster/ip.sh.
>   
>> The resulting script has added new XML for a new "device" parameter.
>>
>> New variable 'device' is passed to the ip_op function and then to
>>     
> functions
>   
>> ipv4 and ipv6. The ipv4 and ipv6 function iterate through all network
>> devices and, upon finding a device with a configuration similar to the
>>     
> IP
>   
>> needing to be assigned, would assign the IP there, which caused all the
>>     
> IPs
>   
>> to bunch up on one device. The added logic here will go through the
>> iteration, and if there is a "device" variable requested it is matched
>> against the device name in the function.
>>
>> Is there some way to get this put into the Cluster Suite officially so
>>     
> that
>   
>> it may be supported?
>>
>> Thank you...
>>
>> (diff -cB)
>>     
>
> You should also modify cluster.rng and add 'device' as an optional
> attribute to the 'ip' element (around line 1010) or else the config won't
> validate
>
> why not use OSCF_RESKEY_device inside ipv4/6 directly, instead of passing
> it to ip_op?
>   ip_op <family> <operation> <address> <device> [quiet]
> ... if device is empty, but there is quiet present it will be accepted as
> device
>
>   
>> *** ip.sh.original    2010-06-17 10:43:00.000000000 -0500
>> --- ip.sh    2010-06-17 14:42:26.000000000 -0500
>> ***************
>> *** 86,91 ****
>> --- 86,104 ----
>>               <content type="string"/>
>>           </parameter>
>>
>> +         <parameter name="device">
>> +             <longdesc lang="en">
>> +                 Specify network device to bring this
>> +                 IP up on. Optional. Example: "eth0"
>> +             </longdesc>
>> +
>> +             <shortdesc lang="en">
>> +                 Network device
>> +             </shortdesc>
>> +
>> +             <content type="string" default="auto"/>
>> +         </parameter>
>> +
>>           <parameter name="monitor_link">
>>               <longdesc lang="en">
>>                   Enabling this causes the status check to fail if
>> ***************
>> *** 571,576 ****
>> --- 583,589 ----
>>       declare addr_exp=$(ipv6_expand $addr)
>>
>>       while read dev ifaddr_exp maskbits; do
>> +             if ([ -z $3 ] || [ "$3" = "auto" ]) || [ "$dev" = "$3" ];
>> then
>>               if [ -z "$dev" ]; then
>>                   continue
>>           fi
>> ***************
>> *** 636,641 ****
>> --- 649,655 ----
>>           fi
>>
>>           return 0
>> +             fi
>>       done < <(ipv6_list_interfaces)
>>
>>       return 1
>> ***************
>> *** 651,656 ****
>> --- 664,670 ----
>>       declare addr=$2
>>
>>       while read dev ifaddr maskbits; do
>> +             if ([ -z $3 ] || [ "$3" = "auto" ]) || [ "$dev" = "$3" ];
>> then
>>               if [ -z "$dev" ]; then
>>                   continue
>>           fi
>> ***************
>> *** 715,720 ****
>> --- 729,735 ----
>>           fi
>>
>>           return 0
>> +             fi
>>       done  < <(ipv4_list_interfaces)
>>
>>       return 1
>> ***************
>> *** 813,819 ****
>>
>>   #
>>   # Usage:
>> ! # ip_op <family> <operation> <address> [quiet]
>>   #
>>   ip_op()
>>   {
>> --- 827,833 ----
>>
>>   #
>>   # Usage:
>> ! # ip_op <family> <operation> <address> <device> [quiet]
>>   #
>>   ip_op()
>>   {
>> ***************
>> *** 866,872 ****
>>
>>       case $1 in
>>       inet)
>> !         ipv4 $2 $3
>>           return $?
>>           ;;
>>       inet6)
>> --- 880,886 ----
>>
>>       case $1 in
>>       inet)
>> !         ipv4 $2 $3 $4
>>           return $?
>>           ;;
>>       inet6)
>> ***************
>> *** 923,929 ****
>>           ocf_log debug "${OCF_RESKEY_address} already configured"
>>           exit 0
>>       fi
>> !     ip_op ${OCF_RESKEY_family} add ${OCF_RESKEY_address}
>>       if [ $? -ne 0 ]; then
>>           exit $OCF_ERR_GENERIC
>>       fi
>> --- 937,943 ----
>>           ocf_log debug "${OCF_RESKEY_address} already configured"
>>           exit 0
>>       fi
>> !     ip_op ${OCF_RESKEY_family} add ${OCF_RESKEY_address}
>> ${OCF_RESKEY_device}
>>       if [ $? -ne 0 ]; then
>>           exit $OCF_ERR_GENERIC
>>       fi
>>
>>
>> On Thu, Jun 17, 2010 at 9:00 AM, Dustin Henry Offutt
>> <dhoffutt at gmail.com>wrote:
>>
>>     
>>> Using the node's IPs would not work. The software being made HA must
>>>       
> keep
>   
>>> its IPs the same no matter what node its running on. Could script an IP
>>> change, but then we're putting IP logic and monitoring in two places:
>>>       
> The
>   
>>> cluster software and in our custom scripting. That's not a clean
>>>       
> solution
>   
>>> and is rather going backwards.
>>>
>>> We may as well just do our own HA if we were starting down that road.
>>> When
>>> we sell our product the customer must also purchase Redhat Support for
>>> their
>>> OS and cluster software. I would think Redhat should pony up to get
>>>       
> this
>   
>>> done as the product we are selling is selling well and inducing Redhat
>>> Support sales.
>>>
>>> An official feature request has been submitted to Redhat.
>>>
>>> Also, I'm working on the /usr/share/cluster/ip.sh script myself to add
>>> the
>>> feature. Hopefully it works out.
>>>
>>>
>>>
>>> On Thu, Jun 17, 2010 at 2:09 AM, Kit Gerrits <kitgerrits at gmail.com>
>>> wrote:
>>>
>>>       
>>>>  In that case, might it be easier to simply use the host IP adresses
>>>>         
> and
>   
>>>> not the cluster IP's?
>>>> (the application will need to handle up/down events itself)
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Kit
>>>>
>>>>  ------------------------------
>>>> *From:* linux-cluster-bounces at redhat.com [mailto:
>>>> linux-cluster-bounces at redhat.com] *On Behalf Of *Dustin Henry Offutt
>>>> *Sent:* dinsdag 15 juni 2010 14:40
>>>> *To:* linux clustering
>>>> *Subject:* Re: [Linux-cluster] Higher Grained Definition ofIP
>>>> AddressAssignments
>>>>
>>>> I've spent the past year architecting an HA cluster with RHCS and it's
>>>> working wonderfully. I have not seen anything superior.
>>>>
>>>> Due to a new customer-driven feature of our software, we need to add
>>>>         
> the
>   
>>>> ability for a cluster service/resource group to have up to eight
>>>> distinct
>>>> IPs on one particular network due to the software being made highly
>>>> available via RHCS performing its own load balancing. Placing the load
>>>> balancing elsewhere is not an option due to the nature of the product.
>>>>
>>>> Regarding "OCF_RESKEY_," will google more on this and appreciate the
>>>> tip.
>>>> Must work this out some way.
>>>>
>>>> ~ Dusty
>>>>
>>>> C. Handel wrote:
>>>>
>>>> [define interface of cluster controlled ip resource]
>>>>
>>>>
>>>>
>>>> /usr/share/cluster/ip.sh appears to perform the link-monitoring in the
>>>>
>>>>
>>>> This is a resource agent script. What attributes a resource agent
>>>> accepts can be found by calling it with the option meta-data
>>>>
>>>> /usr/share/cluster/ip.sh meta-data
>>>>
>>>> There is no attribute interface. The agent will add the additional
>>>> address to the first interface that is in the same subnet.
>>>>
>>>> You could edit the script and add a parameter interface yourself. Add
>>>> a new parameter into the XML at the beginning and access it in the
>>>> script with OCF_RESKEY_...
>>>>
>>>> I don't understand what you are trying to do. If you are only handling
>>>> network interfaces as services, then rhcs is most likely the wrong
>>>> tool. If you would explain your goal we could probably suggest other
>>>> solutions.
>>>>
>>>> Greetings
>>>>    Christoph
>>>>
>>>> --
>>>> Linux-cluster mailing
>>>>
>>>>         
> listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>   
>>>> No virus found in this incoming message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 9.0.829 / Virus Database: 271.1.1/2939 - Release Date:
>>>>         
> 06/15/10
>   
>>>> 08:35:00
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>         
>>>       
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100618/9e4b2286/attachment.htm>

From dxh at yahoo.com  Fri Jun 18 15:33:53 2010
From: dxh at yahoo.com (Don Hoover)
Date: Fri, 18 Jun 2010 08:33:53 -0700 (PDT)
Subject: [Linux-cluster] qdisk WITHOUT fencing
Message-ID: <117863.60382.qm@web65513.mail.ac4.yahoo.com>

Couldn't the geo cluster be most reliably solved by writing or using a fence based on a script to make SAN changes or based on controlling the storage replication?

Maybe it's just a matter of the fact that you need different kinds of fencing than are currently available. 



From brem.belguebli at gmail.com  Fri Jun 18 16:15:09 2010
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 18 Jun 2010 18:15:09 +0200
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <117863.60382.qm@web65513.mail.ac4.yahoo.com>
References: <117863.60382.qm@web65513.mail.ac4.yahoo.com>
Message-ID: <AANLkTilBCV1kDWqSpc6qxURqBCqcpum6yXKJHK9wnSvI@mail.gmail.com>

How do you deal with fencing when the intersite interconnects (SAN and
LAN) are the cause of the failure ?


2010/6/18 Don Hoover <dxh at yahoo.com>:
> Couldn't the geo cluster be most reliably solved by writing or using a fence based on a script to make SAN changes or based on controlling the storage replication?
>
> Maybe it's just a matter of the fact that you need different kinds of fencing than are currently available.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From kieran at digital-crocus.com  Mon Jun 21 01:24:15 2010
From: kieran at digital-crocus.com (Kieran Simkin)
Date: Mon, 21 Jun 2010 02:24:15 +0100
Subject: [Linux-cluster] Newbie question - why the 16 node limit on Cluster
 Suite and does this also apply to GFS?
Message-ID: <4C1EBF3F.4060407@digital-crocus.com>

Hi there,
I'm new to Cluster Suite and just reading the documentation from Redhat 
it mentions a 16 node limit - I just wondered why this limit was so low, 
if it's likely to change anytime soon, and whether this limit also 
applies to GFS alone?

Thanks,

-- 
~Kieran Simkin
http://slinq.com/
http://www.hybrid-cluster.com/
+44 (0) 1273 929209



From linux-cluster at redhat.com  Mon Jun 21 04:33:17 2010
From: linux-cluster at redhat.com (Mailbot for etexusa.com)
Date: Sun, 20 Jun 2010 21:33:17 -0700
Subject: [Linux-cluster] DSN: failed (delivery failed)
Message-ID: <mAWtxIeCDcrRuAEcJ02@etexusa.com>


This is a Delivery Status Notification (DSN).

I was unable to deliver your message to
ammextexinc at chesnet.net.

I said 
  RCPT TO:<ammextexinc at chesnet.net>

And they gave me the error;
  550 5.1.1 <ammextexinc at chesnet.net>: Recipient address rejected: User unknown in virtual mailbox table

 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/rfc822-headers
Size: 498 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100620/c0ced365/attachment.bin>

From volker at ixolution.de  Mon Jun 21 06:39:51 2010
From: volker at ixolution.de (Volker Dormeyer)
Date: Mon, 21 Jun 2010 08:39:51 +0200
Subject: [Linux-cluster] Newbie question - why the 16 node limit on
 Cluster Suite and does this also apply to GFS?
In-Reply-To: <4C1EBF3F.4060407@digital-crocus.com>
References: <4C1EBF3F.4060407@digital-crocus.com>
Message-ID: <20100621063951.GA3542@dijkstra>

Hi,

On Mon, Jun 21, 2010 at 02:24:15AM +0100,
Kieran Simkin <kieran at digital-crocus.com> wrote:
> I'm new to Cluster Suite and just reading the documentation from
> Redhat it mentions a 16 node limit - I just wondered why this limit
> was so low, if it's likely to change anytime soon, and whether this
> limit also applies to GFS alone?

Currently, the number is limited by qdiskd, which supports 16 slots, only.

Even though GFS is not limited to 16 nodes, this is what Red Hat currently
supports.

Regards,
Volker



From kkovachev at varna.net  Mon Jun 21 07:52:39 2010
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 21 Jun 2010 10:52:39 +0300
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <AANLkTilBCV1kDWqSpc6qxURqBCqcpum6yXKJHK9wnSvI@mail.gmail.com>
References: <117863.60382.qm@web65513.mail.ac4.yahoo.com>
	<AANLkTilBCV1kDWqSpc6qxURqBCqcpum6yXKJHK9wnSvI@mail.gmail.com>
Message-ID: <4ae3464f967fb89bc0f525efaffe35b5@mx.varna.net>

On Fri, 18 Jun 2010 18:15:09 +0200, brem belguebli
<brem.belguebli at gmail.com> wrote:
> How do you deal with fencing when the intersite interconnects (SAN and
> LAN) are the cause of the failure ?
> 

GPRS or the good old modem over a phone line?

> 
> 2010/6/18 Don Hoover <dxh at yahoo.com>:
>> Couldn't the geo cluster be most reliably solved by writing or using a
>> fence based on a script to make SAN changes or based on controlling the
>> storage replication?
>>
>> Maybe it's just a matter of the fact that you need different kinds of
>> fencing than are currently available.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From michael.lackner at mu-leoben.at  Mon Jun 21 08:56:42 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Mon, 21 Jun 2010 10:56:42 +0200
Subject: [Linux-cluster] Newbie question - why the 16 node limit on
 Cluster Suite and does this also apply to GFS?
In-Reply-To: <20100621063951.GA3542@dijkstra>
References: <4C1EBF3F.4060407@digital-crocus.com>
	<20100621063951.GA3542@dijkstra>
Message-ID: <4C1F294A.7090505@mu-leoben.at>

Hello!

I'm also interested in this (just curious). When not using a quorum 
disk, but
when using GFS/GFS2, would it be possible to create a cluster with more than
16 nodes with actual releases of the cluster suite, or would the 
software not
allow it?

Maybe GFS would also refuse to create >16 journals or something?

Thanks.

Volker Dormeyer wrote:
> Hi,
>
> On Mon, Jun 21, 2010 at 02:24:15AM +0100,
> Kieran Simkin <kieran at digital-crocus.com> wrote:
>   
>> I'm new to Cluster Suite and just reading the documentation from
>> Redhat it mentions a 16 node limit - I just wondered why this limit
>> was so low, if it's likely to change anytime soon, and whether this
>> limit also applies to GFS alone?
>>     
>
> Currently, the number is limited by qdiskd, which supports 16 slots, only.
>
> Even though GFS is not limited to 16 nodes, this is what Red Hat currently
> supports.
>
> Regards,
> Volker
-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505



From Martin.Waite at datacash.com  Mon Jun 21 09:14:17 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Mon, 21 Jun 2010 10:14:17 +0100
Subject: [Linux-cluster] permanently removing node from running cluster
Message-ID: <A78DB34D00374344A0AB65B6523C05DC05A079FD@marsden.win.datacash.com>

Hi,

 

Is it possible to permanently remove a node from a running cluster ?

 

All my attempts result in the node being in the state "offline,
estranged", and the node still counting as a member in the "Nodes: "
count from cman_tool status ( but not in the "Expected votes:" count -
so I think the quorum size is correct).

 

It appears that the only way to permanently remove references to a node
is to restart cman on the surviving nodes.

 

My procedure for removing the node is:

 

1.    relocate any services running on the node

2.    edit cluster.conf to remove the node from clusternodes

3.    push the config to the cluster with ccs_tool

4.    stop rgmanager on the node to be removed

5.    stop cman on the node to be removed.

 

At this point, clustat on a surviving node shows:

 

Cluster Status for EDISV1DBM @ Mon Jun 21 09:46:45 2010

Member Status: Quorate

 

 Member Name                                           ID   Status

 ------ ----                                           ---- ------

 svXprdclu002                                              2 Online,
Local, rgmanager

 svXprdclu003                                              3 Online,
rgmanager

 svXprdclu004                                              4 Online,
rgmanager

 svXprdclu005                                              5 Online,
rgmanager

 svXprdclu001                                              1 Offline,
Estranged

 

 Service Name                                 Owner (Last)
State

 ------- ----                                 ----- ------
-----

 service:ACTIVESITE                           svXprdclu002
started

 service:MASTERVIP                            svXprdclu002
started

 

The removed node (svXprdclu001) is still known to the cluster, but is
now "estranged".  

 

The node has been removed from the "Expected votes" count, but not the
"Nodes" count:

 

sudo /usr/sbin/cman_tool status

Version: 6.2.0

Config Version: 19

Cluster Name: EDISV1DBM

Cluster Id: 35945

Cluster Member: Yes

Cluster Generation: 1008

Membership state: Cluster-Member

Nodes: 5

Expected votes: 4

Total votes: 4

Quorum: 3

Active subsystems: 8

Flags: Dirty

Ports Bound: 0 177

Node name: svXprdclu004

Node ID: 4

Multicast addresses: 239.192.0.1

Node addresses: 10.3.18.24

 

If I then choose a node (not running the services) and restart cman,
this node no longer _sees_ the removed node:

 

[martin at cp1edidbm003 ~]$ sudo /usr/sbin/clustat

Cluster Status for EDISV1DBM @ Mon Jun 21 09:53:34 2010

Member Status: Quorate

 

 Member Name                                           ID   Status

 ------ ----                                           ---- ------

 svXprdclu002                                              2 Online

 svXprdclu003                                              3 Online,
Local

 svXprdclu004                                              4 Online

 svXprdclu005                                              5 Online

 

[martin at cp1edidbm003 ~]$ sudo /usr/sbin/cman_tool status

Version: 6.2.0

Config Version: 19

Cluster Name: EDISV1DBM

Cluster Id: 35945

Cluster Member: Yes

Cluster Generation: 1008

Membership state: Cluster-Member

Nodes: 4

Expected votes: 4

Total votes: 4

Quorum: 3

Active subsystems: 7

Flags: Dirty

Ports Bound: 0

Node name: svXprdclu003

Node ID: 3

Multicast addresses: 239.192.0.1

Node addresses: 10.3.18.23

 

However, I would prefer not to relocate my services in order to restart
cman on every node.

 

 

 

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/2b9fdd88/attachment.htm>

From gordan at bobich.net  Mon Jun 21 09:20:34 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Mon, 21 Jun 2010 10:20:34 +0100
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <4ae3464f967fb89bc0f525efaffe35b5@mx.varna.net>
References: <117863.60382.qm@web65513.mail.ac4.yahoo.com>	<AANLkTilBCV1kDWqSpc6qxURqBCqcpum6yXKJHK9wnSvI@mail.gmail.com>
	<4ae3464f967fb89bc0f525efaffe35b5@mx.varna.net>
Message-ID: <4C1F2EE2.60405@bobich.net>

On 06/21/2010 08:52 AM, Kaloyan Kovachev wrote:
> On Fri, 18 Jun 2010 18:15:09 +0200, brem belguebli
> <brem.belguebli at gmail.com>  wrote:
>> How do you deal with fencing when the intersite interconnects (SAN and
>> LAN) are the cause of the failure ?
>>
>
> GPRS or the good old modem over a phone line?

That isn't going to work if the whole site is down for whatever reason 
(unlikely as it may be).

To protect yourself from the 100% outage of a remote site, the only sane 
way I of approaching it I can think of is to do something like the 
following:

1) Make each node fence itself off from the failed node using iptables 
or some other firewalling method. The SAN should also be prevented from 
allowing the booted out node back onto it.

2) Fail over the IP address or DNS name of the service. Since it's 
across different sites, you are likely to have to use something like RIP 
to re-route the IPs, so DNS on short refresh may well be an easier and 
possibly safer option. It'll mean some downtime, but probably less than 
any manual intervention in an unplanned case.

It's not entirely ideal, bit it's about as good as it is likely to get. 
And you can write a fencing agent to do something like this easily enough.

Gordan



From ccaulfie at redhat.com  Mon Jun 21 09:31:46 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 21 Jun 2010 10:31:46 +0100
Subject: [Linux-cluster] permanently removing node from running cluster
In-Reply-To: <A78DB34D00374344A0AB65B6523C05DC05A079FD@marsden.win.datacash.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A079FD@marsden.win.datacash.com>
Message-ID: <4C1F3182.3050208@redhat.com>

On 21/06/10 10:14, Martin Waite wrote:
> Hi,
>
> Is it possible to permanently remove a node from a running cluster ?
>
> All my attempts result in the node being in the state "offline,
> estranged", and the node still counting as a member in the "Nodes: "
> count from cman_tool status ( but not in the "Expected votes:" count -
> so I think the quorum size is correct).
>
> It appears that the only way to permanently remove references to a node
> is to restart cman on the surviving nodes.
>
> My procedure for removing the node is:
>
> 1. relocate any services running on the node
>
> 2. edit cluster.conf to remove the node from clusternodes
>
> 3. push the config to the cluster with ccs_tool
>
> 4. stop rgmanager on the node to be removed
>
> 5. stop cman on the node to be removed.
>
> At this point, clustat on a surviving node shows:
>
> Cluster Status for EDISV1DBM @ Mon Jun 21 09:46:45 2010
>
> Member Status: Quorate
>
> Member Name ID Status
>
> ------ ---- ---- ------
>
> svXprdclu002 2 Online, Local, rgmanager
>
> svXprdclu003 3 Online, rgmanager
>
> svXprdclu004 4 Online, rgmanager
>
> svXprdclu005 5 Online, rgmanager
>
> svXprdclu001 1 Offline, Estranged
>
> Service Name Owner (Last) State
>
> ------- ---- ----- ------ -----
>
> service:ACTIVESITE svXprdclu002 started
>
> service:MASTERVIP svXprdclu002 started
>
> The removed node (svXprdclu001) is still known to the cluster, but is
> now "estranged".
>
> The node has been removed from the "Expected votes" count, but not the
> "Nodes" count:
>
> sudo /usr/sbin/cman_tool status
>
> Version: 6.2.0
>
> Config Version: 19
>
> Cluster Name: EDISV1DBM
>
> Cluster Id: 35945
>
> Cluster Member: Yes
>
> Cluster Generation: 1008
>
> Membership state: Cluster-Member
>
> Nodes: 5
>
> Expected votes: 4
>
> Total votes: 4
>
> Quorum: 3
>
> Active subsystems: 8
>
> Flags: Dirty
>
> Ports Bound: 0 177
>
> Node name: svXprdclu004
>
> Node ID: 4
>
> Multicast addresses: 239.192.0.1
>
> Node addresses: 10.3.18.24
>
> If I then choose a node (not running the services) and restart cman,
> this node no longer _/sees/_ the removed node:
>
> [martin at cp1edidbm003 ~]$ sudo /usr/sbin/clustat
>
> Cluster Status for EDISV1DBM @ Mon Jun 21 09:53:34 2010
>
> Member Status: Quorate
>
> Member Name ID Status
>
> ------ ---- ---- ------
>
> svXprdclu002 2 Online
>
> svXprdclu003 3 Online, Local
>
> svXprdclu004 4 Online
>
> svXprdclu005 5 Online
>
> [martin at cp1edidbm003 ~]$ sudo /usr/sbin/cman_tool status
>
> Version: 6.2.0
>
> Config Version: 19
>
> Cluster Name: EDISV1DBM
>
> Cluster Id: 35945
>
> Cluster Member: Yes
>
> Cluster Generation: 1008
>
> Membership state: Cluster-Member
>
> Nodes: 4
>
> Expected votes: 4
>
> Total votes: 4
>
> Quorum: 3
>
> Active subsystems: 7
>
> Flags: Dirty
>
> Ports Bound: 0
>
> Node name: svXprdclu003
>
> Node ID: 3
>
> Multicast addresses: 239.192.0.1
>
> Node addresses: 10.3.18.23
>
> However, I would prefer not to relocate my services in order to restart
> cman on every node.
>


You don't say what version of clustering you are using. In cluster3 
nodes can be removed permanently from the internal cluster lists by 
removing it from cluster.conf and reloading it. In versions Before that 
they hang around until the whole cluster is rebooted.

It's just a name in a list and the inconvenience should be purely 
cosmetic. A node that is not in the cluster has no effect on any cluster 
operations.

Chrissie



From Martin.Waite at datacash.com  Mon Jun 21 09:47:43 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Mon, 21 Jun 2010 10:47:43 +0100
Subject: [Linux-cluster] permanently removing node from running cluster
In-Reply-To: <4C1F3182.3050208@redhat.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A079FD@marsden.win.datacash.com>
	<4C1F3182.3050208@redhat.com>
Message-ID: <A78DB34D00374344A0AB65B6523C05DC05A07A23@marsden.win.datacash.com>



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Christine Caulfield
> Sent: 21 June 2010 10:32
> To: linux clustering
> Subject: Re: [Linux-cluster] permanently removing node from running
cluster
> 
> On 21/06/10 10:14, Martin Waite wrote:
> > Hi,
> >
> > Is it possible to permanently remove a node from a running cluster ?
> >
> > All my attempts result in the node being in the state "offline,
> > estranged", and the node still counting as a member in the "Nodes: "
> > count from cman_tool status ( but not in the "Expected votes:" count
-
> > so I think the quorum size is correct).
> >
> > It appears that the only way to permanently remove references to a
node
> > is to restart cman on the surviving nodes.
> >
> > My procedure for removing the node is:
> >
> > 1. relocate any services running on the node
> >
> > 2. edit cluster.conf to remove the node from clusternodes
> >
> > 3. push the config to the cluster with ccs_tool
> >
> > 4. stop rgmanager on the node to be removed
> >
> > 5. stop cman on the node to be removed.
> >
> > At this point, clustat on a surviving node shows:
> >
> > Cluster Status for EDISV1DBM @ Mon Jun 21 09:46:45 2010
> >
> > Member Status: Quorate
> >
> > Member Name ID Status
> >
> > ------ ---- ---- ------
> >
> > svXprdclu002 2 Online, Local, rgmanager
> >
> > svXprdclu003 3 Online, rgmanager
> >
> > svXprdclu004 4 Online, rgmanager
> >
> > svXprdclu005 5 Online, rgmanager
> >
> > svXprdclu001 1 Offline, Estranged
> >
> > Service Name Owner (Last) State
> >
> > ------- ---- ----- ------ -----
> >
> > service:ACTIVESITE svXprdclu002 started
> >
> > service:MASTERVIP svXprdclu002 started
> >
> > The removed node (svXprdclu001) is still known to the cluster, but
is
> > now "estranged".
> >
> > The node has been removed from the "Expected votes" count, but not
the
> > "Nodes" count:
> >
> > sudo /usr/sbin/cman_tool status
> >
> > Version: 6.2.0
> >
> > Config Version: 19
> >
> > Cluster Name: EDISV1DBM
> >
> > Cluster Id: 35945
> >
> > Cluster Member: Yes
> >
> > Cluster Generation: 1008
> >
> > Membership state: Cluster-Member
> >
> > Nodes: 5
> >
> > Expected votes: 4
> >
> > Total votes: 4
> >
> > Quorum: 3
> >
> > Active subsystems: 8
> >
> > Flags: Dirty
> >
> > Ports Bound: 0 177
> >
> > Node name: svXprdclu004
> >
> > Node ID: 4
> >
> > Multicast addresses: 239.192.0.1
> >
> > Node addresses: 10.3.18.24
> >
> > If I then choose a node (not running the services) and restart cman,
> > this node no longer _/sees/_ the removed node:
> >
> > [martin at cp1edidbm003 ~]$ sudo /usr/sbin/clustat
> >
> > Cluster Status for EDISV1DBM @ Mon Jun 21 09:53:34 2010
> >
> > Member Status: Quorate
> >
> > Member Name ID Status
> >
> > ------ ---- ---- ------
> >
> > svXprdclu002 2 Online
> >
> > svXprdclu003 3 Online, Local
> >
> > svXprdclu004 4 Online
> >
> > svXprdclu005 5 Online
> >
> > [martin at cp1edidbm003 ~]$ sudo /usr/sbin/cman_tool status
> >
> > Version: 6.2.0
> >
> > Config Version: 19
> >
> > Cluster Name: EDISV1DBM
> >
> > Cluster Id: 35945
> >
> > Cluster Member: Yes
> >
> > Cluster Generation: 1008
> >
> > Membership state: Cluster-Member
> >
> > Nodes: 4
> >
> > Expected votes: 4
> >
> > Total votes: 4
> >
> > Quorum: 3
> >
> > Active subsystems: 7
> >
> > Flags: Dirty
> >
> > Ports Bound: 0
> >
> > Node name: svXprdclu003
> >
> > Node ID: 3
> >
> > Multicast addresses: 239.192.0.1
> >
> > Node addresses: 10.3.18.23
> >
> > However, I would prefer not to relocate my services in order to
restart
> > cman on every node.
> >
> 
> 
> You don't say what version of clustering you are using. In cluster3
> nodes can be removed permanently from the internal cluster lists by
> removing it from cluster.conf and reloading it. In versions Before
that
> they hang around until the whole cluster is rebooted.
> 
> It's just a name in a list and the inconvenience should be purely
> cosmetic. A node that is not in the cluster has no effect on any
cluster
> operations.
> 
> Chrissie
> 

Hi Chrissie, 

I am using whatever version of cluster comes with RHEL 5.4 - this is
likely to be cluster2 given its behaviour.

I was hoping that the estranged node was just a cosmetic nuisance - so
thanks for the confirmation.

regards,
Martin





From kkovachev at varna.net  Mon Jun 21 10:28:57 2010
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 21 Jun 2010 13:28:57 +0300
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <4C1F2EE2.60405@bobich.net>
References: <117863.60382.qm@web65513.mail.ac4.yahoo.com>	<AANLkTilBCV1kDWqSpc6qxURqBCqcpum6yXKJHK9wnSvI@mail.gmail.com>
	<4ae3464f967fb89bc0f525efaffe35b5@mx.varna.net>
	<4C1F2EE2.60405@bobich.net>
Message-ID: <7bd83f425b9ecfa455db424e168840fe@mx.varna.net>

On Mon, 21 Jun 2010 10:20:34 +0100, Gordan Bobic <gordan at bobich.net>
wrote:
> On 06/21/2010 08:52 AM, Kaloyan Kovachev wrote:
>> On Fri, 18 Jun 2010 18:15:09 +0200, brem belguebli
>> <brem.belguebli at gmail.com>  wrote:
>>> How do you deal with fencing when the intersite interconnects (SAN and
>>> LAN) are the cause of the failure ?
>>>
>>
>> GPRS or the good old modem over a phone line?
> 
> That isn't going to work if the whole site is down for whatever reason 
> (unlikely as it may be).
> 

If the whole site is down because of a power failure - yes (well, then you
don't need to actually fence anything) , but if the failure is just in the
intersite connection - alternative low speed connection to simply fence the
remote nodes and tell the remote SAN to block it's access should be enough.

> To protect yourself from the 100% outage of a remote site, the only sane

> way I of approaching it I can think of is to do something like the 
> following:
> 
> 1) Make each node fence itself off from the failed node using iptables 
> or some other firewalling method. The SAN should also be prevented from 
> allowing the booted out node back onto it.
> 

then each node should do that kind of fencing, but if a single node blocks
the port(s) on the switch (to the remote location) should be easier to do
as fencing agent. Again having additional communication channel will help -
"if it's just the link, then fence the remote nodes and don't block the
port(s)" this would avoid manual intervention to restore the link after the
outage is fixed

> 2) Fail over the IP address or DNS name of the service. Since it's 
> across different sites, you are likely to have to use something like RIP

> to re-route the IPs, so DNS on short refresh may well be an easier and 
> possibly safer option. It'll mean some downtime, but probably less than 
> any manual intervention in an unplanned case.
> 
> It's not entirely ideal, bit it's about as good as it is likely to get. 
> And you can write a fencing agent to do something like this easily
enough.
> 
> Gordan
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From gordan at bobich.net  Mon Jun 21 11:02:51 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Mon, 21 Jun 2010 12:02:51 +0100
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <7bd83f425b9ecfa455db424e168840fe@mx.varna.net>
References: <117863.60382.qm@web65513.mail.ac4.yahoo.com>	<AANLkTilBCV1kDWqSpc6qxURqBCqcpum6yXKJHK9wnSvI@mail.gmail.com>	<4ae3464f967fb89bc0f525efaffe35b5@mx.varna.net>	<4C1F2EE2.60405@bobich.net>
	<7bd83f425b9ecfa455db424e168840fe@mx.varna.net>
Message-ID: <4C1F46DB.8040902@bobich.net>

On 06/21/2010 11:28 AM, Kaloyan Kovachev wrote:
> On Mon, 21 Jun 2010 10:20:34 +0100, Gordan Bobic<gordan at bobich.net>
> wrote:
>> On 06/21/2010 08:52 AM, Kaloyan Kovachev wrote:
>>> On Fri, 18 Jun 2010 18:15:09 +0200, brem belguebli
>>> <brem.belguebli at gmail.com>   wrote:
>>>> How do you deal with fencing when the intersite interconnects (SAN and
>>>> LAN) are the cause of the failure ?
>>>>
>>>
>>> GPRS or the good old modem over a phone line?
>>
>> That isn't going to work if the whole site is down for whatever reason
>> (unlikely as it may be).
>>
>
> If the whole site is down because of a power failure - yes (well, then you
> don't need to actually fence anything) , but if the failure is just in the
> intersite connection - alternative low speed connection to simply fence the
> remote nodes and tell the remote SAN to block it's access should be enough.

The problem is that although you don't need to fence anything, you need to:
1) Verify that the site is properly down
2) Make sure it stays down

Otherwise you are risking resource clashes.

>> To protect yourself from the 100% outage of a remote site, the only sane
>
>> way I of approaching it I can think of is to do something like the
>> following:
>>
>> 1) Make each node fence itself off from the failed node using iptables
>> or some other firewalling method. The SAN should also be prevented from
>> allowing the booted out node back onto it.
>>
>
> then each node should do that kind of fencing, but if a single node blocks
> the port(s) on the switch (to the remote location) should be easier to do
> as fencing agent. Again having additional communication channel will help -
> "if it's just the link, then fence the remote nodes and don't block the
> port(s)" this would avoid manual intervention to restore the link after the
> outage is fixed

There is no reason why you couldn't fire off the iptables fencing 
command to each node via SSH, so that whichever node does the fencing, 
covers it for all nodes.

Gordan



From parshu001 at gmail.com  Mon Jun 21 11:46:55 2010
From: parshu001 at gmail.com (parshuram prasad)
Date: Mon, 21 Jun 2010 17:16:55 +0530
Subject: [Linux-cluster] cluster configuration
Message-ID: <AANLkTilMO39vU9z07xUjAM5YUa7JIwvI8y88GQJEPagj@mail.gmail.com>

Hi All,

please provide  me step by step clustering in linux el 5.3


-- 
Warm Regards
Parshuram Prasad
+91-9560170372
Sr. System Administrator & Database Administrator

Stratoshear Technology Pvt. Ltd.

BPS House Green Park -16
www.stratoshear.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/5526553c/attachment.htm>

From brem.belguebli at gmail.com  Mon Jun 21 13:02:22 2010
From: brem.belguebli at gmail.com (brem belguebli)
Date: Mon, 21 Jun 2010 15:02:22 +0200
Subject: [Linux-cluster] qdisk WITHOUT fencing
In-Reply-To: <4C1F46DB.8040902@bobich.net>
References: <117863.60382.qm@web65513.mail.ac4.yahoo.com>
	<AANLkTilBCV1kDWqSpc6qxURqBCqcpum6yXKJHK9wnSvI@mail.gmail.com>
	<4ae3464f967fb89bc0f525efaffe35b5@mx.varna.net>
	<4C1F2EE2.60405@bobich.net>
	<7bd83f425b9ecfa455db424e168840fe@mx.varna.net>
	<4C1F46DB.8040902@bobich.net>
Message-ID: <AANLkTilUZdir4VM3czsDxNrnb1oOXBODJjPgig7LFWrG@mail.gmail.com>

>From my experience, and some good practices (IMHO) I've seen in a lot
of productions, cluster must never be autostarted.

This to prevent a power flapping node from accessing the storage
almost randomly.

The second thing, is to have reliable suicide procedure, generally
based on hardware watchdog mechanism.
Almost all the known vendors provide reliable hardware that can be
used for that. That will imply that this autofence mechanism to be
supported  only on the certified hardware.

A simple watchdog agent would be to monitor the cluster state, if it
goes inquorate, then the node is hard reset without any further
consideration.
When coupled to autostart off, there is no risk, anymore.

>> GPRS or the good old modem over a phone line?
In the datacenters I manage, mobile communications are inoperant,
there're practical Faraday cages.
I thought about POTS lines, but it made me feel like I was going back
to the 90's....

> The problem is that although you don't need to fence anything, you need to:
> 1) Verify that the site is properly down
> 2) Make sure it stays down

1 --> Best case, electrical problem, all the nodes and storage is off
and  if it is not (interconnect failure for instance), the watchdog
mechanism described above has done its job (need to be coupled to a
3rd site tie breaker).
2 --> Forbid cluster autostart to avoid this kind of problem.


2010/6/21 Gordan Bobic <gordan at bobich.net>:
> On 06/21/2010 11:28 AM, Kaloyan Kovachev wrote:
>>
>> On Mon, 21 Jun 2010 10:20:34 +0100, Gordan Bobic<gordan at bobich.net>
>> wrote:
>>>
>>> On 06/21/2010 08:52 AM, Kaloyan Kovachev wrote:
>>>>
>>>> On Fri, 18 Jun 2010 18:15:09 +0200, brem belguebli
>>>> <brem.belguebli at gmail.com> ? wrote:
>>>>>
>>>>> How do you deal with fencing when the intersite interconnects (SAN and
>>>>> LAN) are the cause of the failure ?
>>>>>
>>>>
>>>> GPRS or the good old modem over a phone line?
>>>
>>> That isn't going to work if the whole site is down for whatever reason
>>> (unlikely as it may be).
>>>
>>
>> If the whole site is down because of a power failure - yes (well, then you
>> don't need to actually fence anything) , but if the failure is just in the
>> intersite connection - alternative low speed connection to simply fence
>> the
>> remote nodes and tell the remote SAN to block it's access should be
>> enough.
>
> The problem is that although you don't need to fence anything, you need to:
> 1) Verify that the site is properly down
> 2) Make sure it stays down
>
> Otherwise you are risking resource clashes.
>
>>> To protect yourself from the 100% outage of a remote site, the only sane
>>
>>> way I of approaching it I can think of is to do something like the
>>> following:
>>>
>>> 1) Make each node fence itself off from the failed node using iptables
>>> or some other firewalling method. The SAN should also be prevented from
>>> allowing the booted out node back onto it.
>>>
>>
>> then each node should do that kind of fencing, but if a single node blocks
>> the port(s) on the switch (to the remote location) should be easier to do
>> as fencing agent. Again having additional communication channel will help
>> -
>> "if it's just the link, then fence the remote nodes and don't block the
>> port(s)" this would avoid manual intervention to restore the link after
>> the
>> outage is fixed
>
> There is no reason why you couldn't fire off the iptables fencing command to
> each node via SSH, so that whichever node does the fencing, covers it for
> all nodes.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From finnzi at finnzi.com  Mon Jun 21 13:06:57 2010
From: finnzi at finnzi.com (=?ISO-8859-1?Q?Finnur_=D6rn_Gu=F0mundsson?=)
Date: Mon, 21 Jun 2010 13:06:57 +0000
Subject: [Linux-cluster] cluster configuration
In-Reply-To: <AANLkTilMO39vU9z07xUjAM5YUa7JIwvI8y88GQJEPagj@mail.gmail.com>
References: <AANLkTilMO39vU9z07xUjAM5YUa7JIwvI8y88GQJEPagj@mail.gmail.com>
Message-ID: <4C1F63F1.5050008@finnzi.com>

On 21.6.2010 11:46, parshuram prasad wrote:
> Hi All,
>
> please provide  me step by step clustering in linux el 5.3
>
>
> -- 
> Warm Regards
> Parshuram Prasad
> +91-9560170372
> Sr. System Administrator & Database Administrator
>
> Stratoshear Technology Pvt. Ltd.
>
> BPS House Green Park -16
> www.stratoshear.com <http://www.stratoshear.com>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Hi there,

You might want to consider reading this guide before attempting to build 
a cluster: 
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.5/html/Cluster_Administration/index.html

Bgrds,
Finnur
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/3fac9278/attachment.htm>

From dhoffutt at gmail.com  Mon Jun 21 14:07:53 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Mon, 21 Jun 2010 09:07:53 -0500
Subject: [Linux-cluster] A couple of Multicast Questions
Message-ID: <AANLkTikI62b9YJuqyWWst9CPfIBKNV11gbfG4V6nB1Ac@mail.gmail.com>

Regarding the CS as released with RH 5.4, 5.5 and as expected with 6.0 if
anything might change...:

Should one adjust the multicast address being used if running multiple
clusters? If no, how is it that they can differentiate traffic between the
clusters?

If an unrelated application is on the same network as these clusters, should
care be taken to change the multicast address being used by the cluster(s)?

Would it be best practice, if altering the multicast address to edit
/etc/ais/openais.conf or add multicast xtags in the cluster.conf or does it
not really matter?

Thank you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/1d9063a4/attachment.htm>

From jayfitzpatrick at gmail.com  Mon Jun 21 15:07:34 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Mon, 21 Jun 2010 16:07:34 +0100
Subject: [Linux-cluster] Basic Active Active File Server
Message-ID: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>

Hi all

I am having no end of trouble getting a basic Active Active Cluster
working. at the moment it is in test / proof of concept and has manual
fencing in place but I cannot for the life of me get the 2 nodes to
join to the one cluster (they both report joined in crm_tool status
but only to a local clustered instance if that makes any sence)

I have tried to use luci and system-config-cluster to get this up and
running and have been at it over a week, the network guys swear that
there is nothing blocking multicast traffic between them and the
firewalls have been disabled (they are on the same VLAN but connected
to different switches) servers have been rebuilt and have RHEL 5.5
installed

Shared Storage is being provided by an Active Active DRBD setup
(tested and working)

I have attached a copy of my cluster.conf

Thanks in advance

Jay

--

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1460 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/b88f96ea/attachment.obj>

From kkovachev at varna.net  Mon Jun 21 15:49:13 2010
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 21 Jun 2010 18:49:13 +0300
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>
References: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>
Message-ID: <016be92770a57631ff512604864c06b3@mx.varna.net>

Hi,
On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick
<jayfitzpatrick at gmail.com> wrote:
> Hi all
> 
> I am having no end of trouble getting a basic Active Active Cluster
> working. at the moment it is in test / proof of concept and has manual
> fencing in place but I cannot for the life of me get the 2 nodes to
> join to the one cluster (they both report joined in crm_tool status
> but only to a local clustered instance if that makes any sence)
> 
> I have tried to use luci and system-config-cluster to get this up and
> running and have been at it over a week, the network guys swear that
> there is nothing blocking multicast traffic between them and the
> firewalls have been disabled (they are on the same VLAN but connected
> to different switches) servers have been rebuilt and have RHEL 5.5
> installed
> 

your problem is the multicast traffic - check with tcpdump if it is
comming to the other server at all (network) and if it is, then doublecheck
the firewall.
alternatively you may try using broadcast instead of multicast

> Shared Storage is being provided by an Active Active DRBD setup
> (tested and working)
> 
> I have attached a copy of my cluster.conf
> 
> Thanks in advance
> 
> Jay
> 
> --
> 
> "The only difference between saints and sinners is that every saint
> has a past while every sinner has a future. "
> ? Oscar Wilde



From jumanjiman at gmail.com  Mon Jun 21 15:50:49 2010
From: jumanjiman at gmail.com (Paul Morgan)
Date: Mon, 21 Jun 2010 11:50:49 -0400
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>
References: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>
Message-ID: <AANLkTinx6ftpFUrFpVhS7oGe7bj6ynYZ-5Hif1ZZGGC5@mail.gmail.com>

By "cannot join", do you mean the logs actually report "failed to join" or
do they join, then one gets fenced?

If the latter, ask your network team if they're using igmp snoop and/or igmp
query. If so, it's likely they only do igmp v2 and you'll need to force v2
on your interfaces via sysctl.conf.  current kernels default to v3.

-paul

On Jun 21, 2010 11:13 AM, "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
wrote:
> Hi all
>
> I am having no end of trouble getting a basic Active Active Cluster
> working. at the moment it is in test / proof of concept and has manual
> fencing in place but I cannot for the life of me get the 2 nodes to
> join to the one cluster (they both report joined in crm_tool status
> but only to a local clustered instance if that makes any sence)
>
> I have tried to use luci and system-config-cluster to get this up and
> running and have been at it over a week, the network guys swear that
> there is nothing blocking multicast traffic between them and the
> firewalls have been disabled (they are on the same VLAN but connected
> to different switches) servers have been rebuilt and have RHEL 5.5
> installed
>
> Shared Storage is being provided by an Active Active DRBD setup
> (tested and working)
>
> I have attached a copy of my cluster.conf
>
> Thanks in advance
>
> Jay
>
> --
>
> "The only difference between saints and sinners is that every saint
> has a past while every sinner has a future. "
> ? Oscar Wilde
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/0afdbd40/attachment.htm>

From Martin.Waite at datacash.com  Mon Jun 21 15:50:54 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Mon, 21 Jun 2010 16:50:54 +0100
Subject: [Linux-cluster] frozen services are stopped when rgmanager is
	restarted
Message-ID: <A78DB34D00374344A0AB65B6523C05DC05A07CFE@marsden.win.datacash.com>

Hi,

 

RHEL 5.4:  cluster2 (I think).

 

I expected to be able to freeze a service on a node and restart
rgmanager on that node without interrupting the service.   In practice,
starting rgmanager causes the service to be stopped.  

 

Is this what is supposed to happen ?  I thought the whole point of
freezing services was to allow maintenance (including restarting cluster
software).

 

Are there any options to prevent the services from being stopped when
rgmanager is started ?

 

One effect of rgmanager stopping the service is that the cluster reaches
an inconsistent state.  Once rgmanager has restarted, the cluster
believes that the services are still frozen, where in reality they are
stopped.   Any attempt to unfreeze the service causes the service to
failover to a standby node.

 

regards,

Martin

 

 

sudo /usr/sbin/clustat

Cluster Status for EDISV1DBM @ Mon Jun 21 16:27:05 2010

Member Status: Quorate

 

 Member Name                                           ID   Status

 ------ ----                                           ---- ------

 svXprdclu001                                              1 Online,
rgmanager

 svXprdclu002                                              2 Online,
Local, rgmanager

 svXprdclu003                                              3 Online,
rgmanager

 svXprdclu004                                              4 Online,
rgmanager

 svXprdclu005                                              5 Online,
rgmanager

 

 Service Name                                 Owner (Last)
State

 ------- ----                                 ----- ------
-----

 service:ACTIVESITE                           svXprdclu002
started

 service:MASTERVIP                            svXprdclu002
started

 

[martin at cp1edidbm002 ~]$ sudo /usr/sbin/clusvcadm -Z ACTIVESITE

Local machine freezing service:ACTIVESITE...Success

 

[martin at cp1edidbm002 ~]$ sudo /usr/sbin/clusvcadm -Z MASTERVIP

Local machine freezing service:MASTERVIP...Success

 

[martin at cp1edidbm002 ~]$ sudo /usr/sbin/clustat

Cluster Status for EDISV1DBM @ Mon Jun 21 16:34:02 2010

Member Status: Quorate

 

 Member Name                                           ID   Status

 ------ ----                                           ---- ------

 svXprdclu001                                              1 Online,
rgmanager

 svXprdclu002                                              2 Online,
Local, rgmanager

 svXprdclu003                                              3 Online,
rgmanager

 svXprdclu004                                              4 Online,
rgmanager

 svXprdclu005                                              5 Online,
rgmanager

 

 Service Name                                 Owner (Last)
State

 ------- ----                                 ----- ------
-----

 service:ACTIVESITE                           svXprdclu002
started    [Z]

 service:MASTERVIP                            svXprdclu002
started    [Z]

 

[martin at cp1edidbm002 ~]$ sudo /etc/init.d/rgmanager stop

Shutting down Cluster Service Manager...

Waiting for services to stop:                              [  OK  ]

Cluster Service Manager is stopped.

 

[martin at cp1edidbm002 ~]$ sudo /etc/init.d/rgmanager start

Starting Cluster Service Manager:                          [  OK  ]

 

#

# the services are stopped by rgmanager start.  Ugh!

#

 

[martin at cp1edidbm002 ~]$ sudo /usr/sbin/clustat

Cluster Status for EDISV1DBM @ Mon Jun 21 16:35:34 2010

Member Status: Quorate

 

 Member Name                                           ID   Status

 ------ ----                                           ---- ------

 svXprdclu001                                              1 Online,
rgmanager

 svXprdclu002                                              2 Online,
Local, rgmanager

 svXprdclu003                                              3 Online,
rgmanager

 svXprdclu004                                              4 Online,
rgmanager

 svXprdclu005                                              5 Online,
rgmanager

 

 Service Name                                 Owner (Last)
State

 ------- ----                                 ----- ------
-----

 service:ACTIVESITE                           svXprdclu002
started    [Z]

 service:MASTERVIP                            svXprdclu002
started    [Z]

 

=========================================

 

The logs show that the service is stopped as rgmanager is started on
svXprdclu002.  

 

Jun 21 16:31:19 cp1edidbm002 clurgmgrd: [14256]: <info> Executing
/home/martin/dc-dsm status

Jun 21 16:34:58 cp1edidbm002 rgmanager: [15526]: <notice> Shutting down
Cluster Service Manager...

Jun 21 16:34:58 cp1edidbm002 clurgmgrd[14256]: <notice> Shutting down

Jun 21 16:35:08 cp1edidbm002 clurgmgrd[14256]: <notice> Shutdown
complete, exiting

Jun 21 16:35:08 cp1edidbm002 rgmanager: [15526]: <notice> Cluster
Service Manager is stopped.

 

Jun 21 16:35:16 cp1edidbm002 kernel: dlm: Using TCP for communications

Jun 21 16:35:16 cp1edidbm002 kernel: dlm: got connection from 4

Jun 21 16:35:16 cp1edidbm002 kernel: dlm: got connection from 5

Jun 21 16:35:16 cp1edidbm002 kernel: dlm: got connection from 1

Jun 21 16:35:16 cp1edidbm002 kernel: dlm: got connection from 3

Jun 21 16:35:17 cp1edidbm002 clurgmgrd[15574]: <notice> Resource Group
Manager Starting

Jun 21 16:35:17 cp1edidbm002 clurgmgrd[15574]: <info> Loading Service
Data

Jun 21 16:35:17 cp1edidbm002 clurgmgrd[15574]: <info> Initializing
Services

Jun 21 16:35:17 cp1edidbm002 clurgmgrd: [15574]: <info> Executing
/bin/true stop

Jun 21 16:35:17 cp1edidbm002 clurgmgrd: [15574]: <info> Removing IPv4
address 10.3.17.20/24 from bond0

Jun 21 16:35:27 cp1edidbm002 clurgmgrd: [15574]: <info> Executing
/home/martin/dc-dsm stop

Jun 21 16:35:27 cp1edidbm002 clurgmgrd[15574]: <info> Services
Initialized

Jun 21 16:35:27 cp1edidbm002 clurgmgrd[15574]: <info> State change:
Local UP

Jun 21 16:35:27 cp1edidbm002 clurgmgrd[15574]: <info> State change:
svXprdclu001 UP

Jun 21 16:35:27 cp1edidbm002 clurgmgrd[15574]: <info> State change:
svXprdclu003 UP

Jun 21 16:35:27 cp1edidbm002 clurgmgrd[15574]: <info> State change:
svXprdclu004 UP

Jun 21 16:35:27 cp1edidbm002 clurgmgrd[15574]: <info> State change:
svXprdclu005 UP

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/8ef502e2/attachment.htm>

From jayfitzpatrick at gmail.com  Mon Jun 21 16:11:24 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Mon, 21 Jun 2010 17:11:24 +0100
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTinx6ftpFUrFpVhS7oGe7bj6ynYZ-5Hif1ZZGGC5@mail.gmail.com>
References: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>
	<AANLkTinx6ftpFUrFpVhS7oGe7bj6ynYZ-5Hif1ZZGGC5@mail.gmail.com>
Message-ID: <AANLkTikgzhU3M1G3zYkolyoRtYKq8lTfWPRM4CpCOQEo@mail.gmail.com>

Hi Paul..

I have done a crm_tool leave force on the nodes in an attempt to get
them to join back in, but they simply form a cluster on their own. I
have asked about the igmp and have been told that it is not running on
the network,

Since this is a test system I have built it on ESX, one node in each
of our datacenters, I am now moving both of the nodes to the same ESX
server and will try again.

Jay

On 21 June 2010 16:50, Paul Morgan <jumanjiman at gmail.com> wrote:
> By "cannot join", do you mean the logs actually report "failed to join" or
> do they join, then one gets fenced?
>
> If the latter, ask your network team if they're using igmp snoop and/or igmp
> query. If so, it's likely they only do igmp v2 and you'll need to force v2
> on your interfaces via sysctl.conf.? current kernels default to v3.
>
> -paul
>
> On Jun 21, 2010 11:13 AM, "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
> wrote:
>> Hi all
>>
>> I am having no end of trouble getting a basic Active Active Cluster
>> working. at the moment it is in test / proof of concept and has manual
>> fencing in place but I cannot for the life of me get the 2 nodes to
>> join to the one cluster (they both report joined in crm_tool status
>> but only to a local clustered instance if that makes any sence)
>>
>> I have tried to use luci and system-config-cluster to get this up and
>> running and have been at it over a week, the network guys swear that
>> there is nothing blocking multicast traffic between them and the
>> firewalls have been disabled (they are on the same VLAN but connected
>> to different switches) servers have been rebuilt and have RHEL 5.5
>> installed
>>
>> Shared Storage is being provided by an Active Active DRBD setup
>> (tested and working)
>>
>> I have attached a copy of my cluster.conf
>>
>> Thanks in advance
>>
>> Jay
>>
>> --
>>
>> "The only difference between saints and sinners is that every saint
>> has a past while every sinner has a future. "
>> ? Oscar Wilde
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde



From jayfitzpatrick at gmail.com  Mon Jun 21 16:14:25 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Mon, 21 Jun 2010 17:14:25 +0100
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <016be92770a57631ff512604864c06b3@mx.varna.net>
References: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>
	<016be92770a57631ff512604864c06b3@mx.varna.net>
Message-ID: <AANLkTimh2jPhuEWrIjuk7nflKERO4kaA_rgOebummhZi@mail.gmail.com>

Hi..

I have tried both multicast and broadcast to no avail, as above I am
moving the systems to the same ESX to try and rule out the networking
end of things, I have not tried the tcpdump but was running wireshark
in an attempt to do the same as you recommended

Jay

On 21 June 2010 16:49, Kaloyan Kovachev <kkovachev at varna.net> wrote:
> Hi,
> On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick
> <jayfitzpatrick at gmail.com> wrote:
>> Hi all
>>
>> I am having no end of trouble getting a basic Active Active Cluster
>> working. at the moment it is in test / proof of concept and has manual
>> fencing in place but I cannot for the life of me get the 2 nodes to
>> join to the one cluster (they both report joined in crm_tool status
>> but only to a local clustered instance if that makes any sence)
>>
>> I have tried to use luci and system-config-cluster to get this up and
>> running and have been at it over a week, the network guys swear that
>> there is nothing blocking multicast traffic between them and the
>> firewalls have been disabled (they are on the same VLAN but connected
>> to different switches) servers have been rebuilt and have RHEL 5.5
>> installed
>>
>
> your problem is the multicast traffic - check with tcpdump if it is
> comming to the other server at all (network) and if it is, then doublecheck
> the firewall.
> alternatively you may try using broadcast instead of multicast
>
>> Shared Storage is being provided by an Active Active DRBD setup
>> (tested and working)
>>
>> I have attached a copy of my cluster.conf
>>
>> Thanks in advance
>>
>> Jay
>>
>> --
>>
>> "The only difference between saints and sinners is that every saint
>> has a past while every sinner has a future. "
>> ? Oscar Wilde
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde



From volker at ixolution.de  Mon Jun 21 18:54:08 2010
From: volker at ixolution.de (Volker Dormeyer)
Date: Mon, 21 Jun 2010 20:54:08 +0200
Subject: [Linux-cluster] Newbie question - why the 16 node limit on
 Cluster Suite and does this also apply to GFS?
In-Reply-To: <4C1F294A.7090505@mu-leoben.at>
References: <4C1EBF3F.4060407@digital-crocus.com>
	<20100621063951.GA3542@dijkstra> <4C1F294A.7090505@mu-leoben.at>
Message-ID: <20100621185408.GA3704@dijkstra>

Hi,

On Mon, Jun 21, 2010 at 10:56:42AM +0200,
Michael Lackner <michael.lackner at mu-leoben.at> wrote:
> I'm also interested in this (just curious). When not using a quorum
> disk, but
> when using GFS/GFS2, would it be possible to create a cluster with more than
> 16 nodes with actual releases of the cluster suite, or would the
> software not
> allow it?

GFS2 is able to scale with more than 16 nodes. But it is not officially
supported. For myself, I didn't try to use more than 16 nodes, so far.

There was a discussion on this list one or two months ago.

Regards,
Volker



From dhoffutt at gmail.com  Mon Jun 21 20:11:00 2010
From: dhoffutt at gmail.com (Dustin Henry Offutt)
Date: Mon, 21 Jun 2010 15:11:00 -0500
Subject: [Linux-cluster] Newbie question - why the 16 node limit on
	Cluster Suite and does this also apply to GFS?
In-Reply-To: <20100621185408.GA3704@dijkstra>
References: <4C1EBF3F.4060407@digital-crocus.com>
	<20100621063951.GA3542@dijkstra> <4C1F294A.7090505@mu-leoben.at>
	<20100621185408.GA3704@dijkstra>
Message-ID: <AANLkTinv3no4m1FtG5aaLHgue4ps6zI9OV1qBxQOwAnK@mail.gmail.com>

https://www.redhat.com/archives/linux-cluster/2010-May/msg00003.html


On Mon, Jun 21, 2010 at 1:54 PM, Volker Dormeyer <volker at ixolution.de>wrote:

> Hi,
>
> On Mon, Jun 21, 2010 at 10:56:42AM +0200,
> Michael Lackner <michael.lackner at mu-leoben.at> wrote:
> > I'm also interested in this (just curious). When not using a quorum
> > disk, but
> > when using GFS/GFS2, would it be possible to create a cluster with more
> than
> > 16 nodes with actual releases of the cluster suite, or would the
> > software not
> > allow it?
>
> GFS2 is able to scale with more than 16 nodes. But it is not officially
> supported. For myself, I didn't try to use more than 16 nodes, so far.
>
> There was a discussion on this list one or two months ago.
>
> Regards,
> Volker
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100621/51deac3c/attachment.htm>

From michael.lackner at mu-leoben.at  Tue Jun 22 06:31:36 2010
From: michael.lackner at mu-leoben.at (Michael Lackner)
Date: Tue, 22 Jun 2010 08:31:36 +0200
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTimh2jPhuEWrIjuk7nflKERO4kaA_rgOebummhZi@mail.gmail.com>
References: <AANLkTinz5e9XM9dEhuz9LWFMJtaSX8AEILRO0cK2_EsH@mail.gmail.com>	<016be92770a57631ff512604864c06b3@mx.varna.net>
	<AANLkTimh2jPhuEWrIjuk7nflKERO4kaA_rgOebummhZi@mail.gmail.com>
Message-ID: <4C2058C8.7030900@mu-leoben.at>

Greets.

For what it's worth, I have had a similar problem, with our central 
networking
department telling me that "it should work". I have had a single node 
unable to
join the 3-node cluster I'm currently trying to get working. All 
physical machines,
no VMs.

Solution: Hook them all up on the same switch, no more problems with 
that part...

Jason Fitzpatrick wrote:
> Hi..
>
> I have tried both multicast and broadcast to no avail, as above I am
> moving the systems to the same ESX to try and rule out the networking
> end of things, I have not tried the tcpdump but was running wireshark
> in an attempt to do the same as you recommended
>
> Jay
>
> On 21 June 2010 16:49, Kaloyan Kovachev <kkovachev at varna.net> wrote:
>   
>> Hi,
>> On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick
>> <jayfitzpatrick at gmail.com> wrote:
>>     
>>> Hi all
>>>
>>> I am having no end of trouble getting a basic Active Active Cluster
>>> working. at the moment it is in test / proof of concept and has manual
>>> fencing in place but I cannot for the life of me get the 2 nodes to
>>> join to the one cluster (they both report joined in crm_tool status
>>> but only to a local clustered instance if that makes any sence)
>>>
>>> I have tried to use luci and system-config-cluster to get this up and
>>> running and have been at it over a week, the network guys swear that
>>> there is nothing blocking multicast traffic between them and the
>>> firewalls have been disabled (they are on the same VLAN but connected
>>> to different switches) servers have been rebuilt and have RHEL 5.5
>>> installed
>>>
>>>       
>> your problem is the multicast traffic - check with tcpdump if it is
>> comming to the other server at all (network) and if it is, then doublecheck
>> the firewall.
>> alternatively you may try using broadcast instead of multicast
>>
>>     
>>> Shared Storage is being provided by an Active Active DRBD setup
>>> (tested and working)
>>>
>>> I have attached a copy of my cluster.conf
>>>
>>> Thanks in advance
>>>
>>> Jay
>>>
>>> --
>>>
>>> "The only difference between saints and sinners is that every saint
>>> has a past while every sinner has a future. "
>>> ? Oscar Wilde
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Michael Lackner
Chair of Information Technology, University of Leoben
IT Administration
michael.lackner at mu-leoben.at | +43 (0)3842/402-1505




From ccaulfie at redhat.com  Tue Jun 22 06:50:47 2010
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 22 Jun 2010 07:50:47 +0100
Subject: [Linux-cluster] A couple of Multicast Questions
In-Reply-To: <AANLkTikI62b9YJuqyWWst9CPfIBKNV11gbfG4V6nB1Ac@mail.gmail.com>
References: <AANLkTikI62b9YJuqyWWst9CPfIBKNV11gbfG4V6nB1Ac@mail.gmail.com>
Message-ID: <4C205D47.4080302@redhat.com>

On 21/06/10 15:07, Dustin Henry Offutt wrote:
> Regarding the CS as released with RH 5.4, 5.5 and as expected with 6.0
> if anything might change...:
>
> Should one adjust the multicast address being used if running multiple
> clusters? If no, how is it that they can differentiate traffic between
> the clusters?

cman sets the multicast addressed based on a hash of the cluster name. 
So if you have two clusters with different names they should use 
different multicast addresses. The hash it uses is fairly primitive 
though, so it's worth checking that the two clusters are using different 
addresses. cman_tool status will tell you this. If you need to change 
the multiast address manually, then do so inside the <cman> tag of 
cluster.conf


> If an unrelated application is on the same network as these clusters,
> should care be taken to change the multicast address being used by the
> cluster(s)?

That's up the the application you are using. If the application uses 
both the same multicast AND port number that openais is using, then you 
will have to change one or the other. It's rather unlikely though.

> Would it be best practice, if altering the multicast address to edit
> /etc/ais/openais.conf or add multicast xtags in the cluster.conf or does
> it not really matter?
\
When using RHCS/cman the openais.conf file is never read. All 
configuration should be in cluster.conf


Chrissie



From tom+linux-cluster at oneshoeco.com  Tue Jun 22 06:54:37 2010
From: tom+linux-cluster at oneshoeco.com (Tom Lanyon)
Date: Tue, 22 Jun 2010 16:24:37 +0930
Subject: [Linux-cluster] frozen services are stopped when rgmanager
	is	restarted
In-Reply-To: <A78DB34D00374344A0AB65B6523C05DC05A07CFE@marsden.win.datacash.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A07CFE@marsden.win.datacash.com>
Message-ID: <CE1D02A6-460F-4731-B6D1-390F108E8535@oneshoeco.com>

On 22/06/2010, at 1:20 AM, Martin Waite wrote:
> Hi,
> 
> RHEL 5.4:  cluster2 (I think).
> 
> I expected to be able to freeze a service on a node and restart rgmanager on that node without interrupting the service.   In practice, starting rgmanager causes the service to be stopped. 
> 
> Is this what is supposed to happen ?  I thought the whole point of freezing services was to allow maintenance (including restarting cluster software).
> 
> Are there any options to prevent the services from being stopped when rgmanager is started ?
> 
> One effect of rgmanager stopping the service is that the cluster reaches an inconsistent state.  Once rgmanager has restarted, the cluster believes that the services are still frozen, where in reality they are stopped.   Any attempt to unfreeze the service causes the service to failover to a standby node.

I did this recently to upgrade rgmanager on a production cluster with no downtime to services, however I can't find the reference materials I used to do so...

The basic steps are:

* freeze all services
	/usr/sbin/clusvcadm -Z <service>
* stop rgmanager
	/sbin/service rgmanager stop
* upgrade rgmanager 
	yum upgrade rgmanager
* restart rgmanager manually, using the -N flag
	/usr/sbin/clurgmgrd -N
* wait until rgmanager is running again (check 'clustat' output)
* unfreeze the services
	/usr/sbin/clusvcadm -U <service>

Tom



From kitgerrits at gmail.com  Tue Jun 22 08:12:54 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Tue, 22 Jun 2010 10:12:54 +0200
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTimh2jPhuEWrIjuk7nflKERO4kaA_rgOebummhZi@mail.gmail.com>
Message-ID: <4c207084.5cebd80a.320c.ffff88b0@mx.google.com>


Keep in mind that multicast requires a multicast router to handle the
traffic.
Mere Layer2 connectivity is not enough.

If broadcast does work, that might be your problem.

Kit 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason Fitzpatrick
Sent: maandag 21 juni 2010 18:14
To: linux clustering
Subject: Re: [Linux-cluster] Basic Active Active File Server

Hi..

I have tried both multicast and broadcast to no avail, as above I am moving
the systems to the same ESX to try and rule out the networking end of
things, I have not tried the tcpdump but was running wireshark in an attempt
to do the same as you recommended

Jay

On 21 June 2010 16:49, Kaloyan Kovachev <kkovachev at varna.net> wrote:
> Hi,
> On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick 
> <jayfitzpatrick at gmail.com> wrote:
>> Hi all
>>
>> I am having no end of trouble getting a basic Active Active Cluster 
>> working. at the moment it is in test / proof of concept and has 
>> manual fencing in place but I cannot for the life of me get the 2 
>> nodes to join to the one cluster (they both report joined in crm_tool 
>> status but only to a local clustered instance if that makes any 
>> sence)
>>
>> I have tried to use luci and system-config-cluster to get this up and 
>> running and have been at it over a week, the network guys swear that 
>> there is nothing blocking multicast traffic between them and the 
>> firewalls have been disabled (they are on the same VLAN but connected 
>> to different switches) servers have been rebuilt and have RHEL 5.5 
>> installed
>>
>
> your problem is the multicast traffic - check with tcpdump if it is 
> comming to the other server at all (network) and if it is, then 
> doublecheck the firewall.
> alternatively you may try using broadcast instead of multicast
>
>> Shared Storage is being provided by an Active Active DRBD setup 
>> (tested and working)
>>
>> I have attached a copy of my cluster.conf
>>
>> Thanks in advance
>>
>> Jay
>>
>> --
>>
>> "The only difference between saints and sinners is that every saint 
>> has a past while every sinner has a future. "
>> - Oscar Wilde
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 

"The only difference between saints and sinners is that every saint has a
past while every sinner has a future. "
- Oscar Wilde

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2952 - Release Date: 06/20/10
20:36:00



From Martin.Waite at datacash.com  Tue Jun 22 08:13:16 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Tue, 22 Jun 2010 09:13:16 +0100
Subject: [Linux-cluster] frozen services are stopped when
	rgmanageris	restarted
In-Reply-To: <CE1D02A6-460F-4731-B6D1-390F108E8535@oneshoeco.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A07CFE@marsden.win.datacash.com>
	<CE1D02A6-460F-4731-B6D1-390F108E8535@oneshoeco.com>
Message-ID: <A78DB34D00374344A0AB65B6523C05DC05A07DB1@marsden.win.datacash.com>



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Tom Lanyon
> Sent: 22 June 2010 07:55
> To: linux clustering
> Subject: Re: [Linux-cluster] frozen services are stopped when
rgmanageris restarted
> 
> On 22/06/2010, at 1:20 AM, Martin Waite wrote:
> > Hi,
> >
> > RHEL 5.4:  cluster2 (I think).
> >
> > I expected to be able to freeze a service on a node and restart
rgmanager on that
> node without interrupting the service.   In practice, starting
rgmanager causes the
> service to be stopped.
> >
> > Is this what is supposed to happen ?  I thought the whole point of
freezing services
> was to allow maintenance (including restarting cluster software).
> >
> > Are there any options to prevent the services from being stopped
when rgmanager
> is started ?
> >
> > One effect of rgmanager stopping the service is that the cluster
reaches an
> inconsistent state.  Once rgmanager has restarted, the cluster
believes that the
> services are still frozen, where in reality they are stopped.   Any
attempt to unfreeze
> the service causes the service to failover to a standby node.
> 
> I did this recently to upgrade rgmanager on a production cluster with
no downtime to
> services, however I can't find the reference materials I used to do
so...
> 
> The basic steps are:
> 
> * freeze all services
> 	/usr/sbin/clusvcadm -Z <service>
> * stop rgmanager
> 	/sbin/service rgmanager stop
> * upgrade rgmanager
> 	yum upgrade rgmanager
> * restart rgmanager manually, using the -N flag
> 	/usr/sbin/clurgmgrd -N
> * wait until rgmanager is running again (check 'clustat' output)
> * unfreeze the services
> 	/usr/sbin/clusvcadm -U <service>
> 
> Tom
> 

Perfect.

Thanks Tom.

regards,
Martin




From jayfitzpatrick at gmail.com  Tue Jun 22 08:39:36 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Tue, 22 Jun 2010 09:39:36 +0100
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <4c207084.5cebd80a.320c.ffff88b0@mx.google.com>
References: <AANLkTimh2jPhuEWrIjuk7nflKERO4kaA_rgOebummhZi@mail.gmail.com>
	<4c207084.5cebd80a.320c.ffff88b0@mx.google.com>
Message-ID: <AANLkTil8y04QP-p031b7UNajMcxqW0Xx-p-fnB3ximPo@mail.gmail.com>

Hi Kit..

Awesome and all as I am when it comes to computers, networking is a
serious weak point ;0)

How would I go about checking if multicast traffic is making it
between the two nodes, they are now hosted of the same ESX server and
therefore only hitting the virtual switch in the ESX and should not
have to traverse our network at all.

I will review the ESX switch config and DNS (I have a horrible feeling
that the DNS gremlins are responsible, but am pretty sure that this
should not affect Multicast)

Also the multicast address that should be used, I am using 244.0.0.1
which I believe is 	The All Hosts multicast group that contains all
systems on the same network segment, but am not a 100% sure if this is
the correct setting.

Thanks again

Jay

On 22 June 2010 09:12, Kit Gerrits <kitgerrits at gmail.com> wrote:
>
> Keep in mind that multicast requires a multicast router to handle the
> traffic.
> Mere Layer2 connectivity is not enough.
>
> If broadcast does work, that might be your problem.
>
> Kit
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason Fitzpatrick
> Sent: maandag 21 juni 2010 18:14
> To: linux clustering
> Subject: Re: [Linux-cluster] Basic Active Active File Server
>
> Hi..
>
> I have tried both multicast and broadcast to no avail, as above I am moving
> the systems to the same ESX to try and rule out the networking end of
> things, I have not tried the tcpdump but was running wireshark in an attempt
> to do the same as you recommended
>
> Jay
>
> On 21 June 2010 16:49, Kaloyan Kovachev <kkovachev at varna.net> wrote:
>> Hi,
>> On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick
>> <jayfitzpatrick at gmail.com> wrote:
>>> Hi all
>>>
>>> I am having no end of trouble getting a basic Active Active Cluster
>>> working. at the moment it is in test / proof of concept and has
>>> manual fencing in place but I cannot for the life of me get the 2
>>> nodes to join to the one cluster (they both report joined in crm_tool
>>> status but only to a local clustered instance if that makes any
>>> sence)
>>>
>>> I have tried to use luci and system-config-cluster to get this up and
>>> running and have been at it over a week, the network guys swear that
>>> there is nothing blocking multicast traffic between them and the
>>> firewalls have been disabled (they are on the same VLAN but connected
>>> to different switches) servers have been rebuilt and have RHEL 5.5
>>> installed
>>>
>>
>> your problem is the multicast traffic - check with tcpdump if it is
>> comming to the other server at all (network) and if it is, then
>> doublecheck the firewall.
>> alternatively you may try using broadcast instead of multicast
>>
>>> Shared Storage is being provided by an Active Active DRBD setup
>>> (tested and working)
>>>
>>> I have attached a copy of my cluster.conf
>>>
>>> Thanks in advance
>>>
>>> Jay
>>>
>>> --
>>>
>>> "The only difference between saints and sinners is that every saint
>>> has a past while every sinner has a future. "
>>> - Oscar Wilde
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
>
> "The only difference between saints and sinners is that every saint has a
> past while every sinner has a future. "
> - Oscar Wilde
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.829 / Virus Database: 271.1.1/2952 - Release Date: 06/20/10
> 20:36:00
>
>



-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde



From jayfitzpatrick at gmail.com  Tue Jun 22 10:34:01 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Tue, 22 Jun 2010 11:34:01 +0100
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTil8y04QP-p031b7UNajMcxqW0Xx-p-fnB3ximPo@mail.gmail.com>
References: <AANLkTimh2jPhuEWrIjuk7nflKERO4kaA_rgOebummhZi@mail.gmail.com>
	<4c207084.5cebd80a.320c.ffff88b0@mx.google.com>
	<AANLkTil8y04QP-p031b7UNajMcxqW0Xx-p-fnB3ximPo@mail.gmail.com>
Message-ID: <AANLkTilC2Ia_daUdS_AH8kkGemP-8MQicJILunN-uYRA@mail.gmail.com>

Hi all

Well stripped out everything and started again and now cluster seems
to be up and running.

I believe that I had some ocfs2 drm issues (installed to test this
cluster using heartbeat) and this was preventing the cluster from
coming online cleanly

Thanks for all the help

Jay

On 22 June 2010 09:39, Jason Fitzpatrick <jayfitzpatrick at gmail.com> wrote:
> Hi Kit..
>
> Awesome and all as I am when it comes to computers, networking is a
> serious weak point ;0)
>
> How would I go about checking if multicast traffic is making it
> between the two nodes, they are now hosted of the same ESX server and
> therefore only hitting the virtual switch in the ESX and should not
> have to traverse our network at all.
>
> I will review the ESX switch config and DNS (I have a horrible feeling
> that the DNS gremlins are responsible, but am pretty sure that this
> should not affect Multicast)
>
> Also the multicast address that should be used, I am using 244.0.0.1
> which I believe is ? ? ?The All Hosts multicast group that contains all
> systems on the same network segment, but am not a 100% sure if this is
> the correct setting.
>
> Thanks again
>
> Jay
>
> On 22 June 2010 09:12, Kit Gerrits <kitgerrits at gmail.com> wrote:
>>
>> Keep in mind that multicast requires a multicast router to handle the
>> traffic.
>> Mere Layer2 connectivity is not enough.
>>
>> If broadcast does work, that might be your problem.
>>
>> Kit
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason Fitzpatrick
>> Sent: maandag 21 juni 2010 18:14
>> To: linux clustering
>> Subject: Re: [Linux-cluster] Basic Active Active File Server
>>
>> Hi..
>>
>> I have tried both multicast and broadcast to no avail, as above I am moving
>> the systems to the same ESX to try and rule out the networking end of
>> things, I have not tried the tcpdump but was running wireshark in an attempt
>> to do the same as you recommended
>>
>> Jay
>>
>> On 21 June 2010 16:49, Kaloyan Kovachev <kkovachev at varna.net> wrote:
>>> Hi,
>>> On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick
>>> <jayfitzpatrick at gmail.com> wrote:
>>>> Hi all
>>>>
>>>> I am having no end of trouble getting a basic Active Active Cluster
>>>> working. at the moment it is in test / proof of concept and has
>>>> manual fencing in place but I cannot for the life of me get the 2
>>>> nodes to join to the one cluster (they both report joined in crm_tool
>>>> status but only to a local clustered instance if that makes any
>>>> sence)
>>>>
>>>> I have tried to use luci and system-config-cluster to get this up and
>>>> running and have been at it over a week, the network guys swear that
>>>> there is nothing blocking multicast traffic between them and the
>>>> firewalls have been disabled (they are on the same VLAN but connected
>>>> to different switches) servers have been rebuilt and have RHEL 5.5
>>>> installed
>>>>
>>>
>>> your problem is the multicast traffic - check with tcpdump if it is
>>> comming to the other server at all (network) and if it is, then
>>> doublecheck the firewall.
>>> alternatively you may try using broadcast instead of multicast
>>>
>>>> Shared Storage is being provided by an Active Active DRBD setup
>>>> (tested and working)
>>>>
>>>> I have attached a copy of my cluster.conf
>>>>
>>>> Thanks in advance
>>>>
>>>> Jay
>>>>
>>>> --
>>>>
>>>> "The only difference between saints and sinners is that every saint
>>>> has a past while every sinner has a future. "
>>>> - Oscar Wilde
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>> --
>>
>> "The only difference between saints and sinners is that every saint has a
>> past while every sinner has a future. "
>> - Oscar Wilde
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 9.0.829 / Virus Database: 271.1.1/2952 - Release Date: 06/20/10
>> 20:36:00
>>
>>
>
>
>
> --
>
> "The only difference between saints and sinners is that every saint
> has a past while every sinner has a future. "
> ? Oscar Wilde
>



-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde



From jayfitzpatrick at gmail.com  Tue Jun 22 11:43:36 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Tue, 22 Jun 2010 12:43:36 +0100
Subject: [Linux-cluster] Samba Statefull Failover
Message-ID: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>

Hi all

Just wondering if it is possible to statefully migrate smb connections
between cluster nodes, I am running ctdb (Samba's Cluster software)
but all connections are dropped when the service is failed between
nodes

Setup is as follows

2 node cluster
DRBD backend shared storage in Master Master configuration
cman presenting GFS2 /Storage folder
Samba + Winbind + CTDB used to present /Storage/Test_Share via
\\clustername\test_share (both nodes are AD integrated)

Connections to replicated storage are working fine, AD accounts are
authenticated correcly and smbstatus shows that CTDB is load
ballancing the cluster address between nodes correctly,

When I run ctdb shutdown I expect existing connections to be migrated
to the other active node, but instead all connections are dropped (to
both nodes) and all active file transfers / locks being lost. I assume
that I need to have the lockdb on the shared storage but I do not have
a monkeys on how to do it,

logs / configs available on request

Thanks a mill

Jay

PS free beer for anyone who can sort this for me, I am at this 2 weeks
with heartbeat clusters and got nowhere!



-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde



From adas at redhat.com  Tue Jun 22 14:12:14 2010
From: adas at redhat.com (Abhijith Das)
Date: Tue, 22 Jun 2010 10:12:14 -0400 (EDT)
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>
Message-ID: <1566907717.695201277215934632.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>


----- "Jason Fitzpatrick" <jayfitzpatrick at gmail.com> wrote:

> From: "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
> To: linux-cluster at redhat.com
> Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central
> Subject: [Linux-cluster] Samba Statefull Failover
>
> Hi all
> 
> Just wondering if it is possible to statefully migrate smb
> connections
> between cluster nodes, I am running ctdb (Samba's Cluster software)
> but all connections are dropped when the service is failed between
> nodes
> 
> Setup is as follows
> 
> 2 node cluster
> DRBD backend shared storage in Master Master configuration
> cman presenting GFS2 /Storage folder
> Samba + Winbind + CTDB used to present /Storage/Test_Share via
> \\clustername\test_share (both nodes are AD integrated)
> 
> Connections to replicated storage are working fine, AD accounts are
> authenticated correcly and smbstatus shows that CTDB is load
> ballancing the cluster address between nodes correctly,
> 
> When I run ctdb shutdown I expect existing connections to be migrated
> to the other active node, but instead all connections are dropped (to
> both nodes) and all active file transfers / locks being lost. I
> assume
> that I need to have the lockdb on the shared storage but I do not
> have
> a monkeys on how to do it,
> 
> logs / configs available on request
> 
> Thanks a mill
> 
> Jay
> 
> PS free beer for anyone who can sort this for me, I am at this 2
> weeks
> with heartbeat clusters and got nowhere!
> 

Hi Jason,

Even with CTDB, migration of active connections/locks is not possible and in fact,
from what I understand, is not required.
Windows clients reconnect and reacquire locks. As far as file transfers go, it 
really depends on what the application is doing with the kind of operations it
does, etc. Most apps shouldn't notice anything, just slow down for a second or so.

Cheers!
--Abhi



From adas at redhat.com  Tue Jun 22 14:33:44 2010
From: adas at redhat.com (Abhijith Das)
Date: Tue, 22 Jun 2010 10:33:44 -0400 (EDT)
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>
Message-ID: <406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>


----- "Jason Fitzpatrick" <jayfitzpatrick at gmail.com> wrote:

> From: "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
> To: linux-cluster at redhat.com
> Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central
> Subject: [Linux-cluster] Samba Statefull Failover
>
> Hi all
> 
> Just wondering if it is possible to statefully migrate smb
> connections
> between cluster nodes, I am running ctdb (Samba's Cluster software)
> but all connections are dropped when the service is failed between
> nodes
> 
> Setup is as follows
> 
> 2 node cluster
> DRBD backend shared storage in Master Master configuration
> cman presenting GFS2 /Storage folder
> Samba + Winbind + CTDB used to present /Storage/Test_Share via
> \\clustername\test_share (both nodes are AD integrated)
> 
> Connections to replicated storage are working fine, AD accounts are
> authenticated correcly and smbstatus shows that CTDB is load
> ballancing the cluster address between nodes correctly,
> 
> When I run ctdb shutdown I expect existing connections to be migrated

Also, I think "ctdb shutdown" is not the right command (Use "ctdb disable", it 
should all be in the man pages). Only one node should fail so that the 
other node can take over the IP address. If the IP address is not taken 
over, the clients will probably not be able to reconnect.

Cheers!
--Abhi



From Frank.de.Groodt at interaccess.nl  Tue Jun 22 14:53:08 2010
From: Frank.de.Groodt at interaccess.nl (Frank de Groodt)
Date: Tue, 22 Jun 2010 16:53:08 +0200
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
References: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>,
	<406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <42A27C666486E746BD0C6A60142EC3AB03275B45FE@NTHVSEXCHMAIL01.interaccess.nl>

Make sure you use virtual public ip addresses managed by CTDB, not the ones bound to your NICS.

Frank.
________________________________________
From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Abhijith Das [adas at redhat.com]
Sent: Tuesday, June 22, 2010 4:33 PM
To: linux clustering
Cc: Sumit Bose; Gunther Deschner; Simo Sorce
Subject: Re: [Linux-cluster] Samba Statefull Failover

----- "Jason Fitzpatrick" <jayfitzpatrick at gmail.com> wrote:

> From: "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
> To: linux-cluster at redhat.com
> Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central
> Subject: [Linux-cluster] Samba Statefull Failover
>
> Hi all
>
> Just wondering if it is possible to statefully migrate smb
> connections
> between cluster nodes, I am running ctdb (Samba's Cluster software)
> but all connections are dropped when the service is failed between
> nodes
>
> Setup is as follows
>
> 2 node cluster
> DRBD backend shared storage in Master Master configuration
> cman presenting GFS2 /Storage folder
> Samba + Winbind + CTDB used to present /Storage/Test_Share via
> \\clustername\test_share (both nodes are AD integrated)
>
> Connections to replicated storage are working fine, AD accounts are
> authenticated correcly and smbstatus shows that CTDB is load
> ballancing the cluster address between nodes correctly,
>
> When I run ctdb shutdown I expect existing connections to be migrated

Also, I think "ctdb shutdown" is not the right command (Use "ctdb disable", it
should all be in the man pages). Only one node should fail so that the
other node can take over the IP address. If the IP address is not taken
over, the clients will probably not be able to reconnect.

Cheers!
--Abhi

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jeff.sturm at eprize.com  Tue Jun 22 15:47:18 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 22 Jun 2010 11:47:18 -0400
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTil8y04QP-p031b7UNajMcxqW0Xx-p-fnB3ximPo@mail.gmail.com>
References: <AANLkTimh2jPhuEWrIjuk7nflKERO4kaA_rgOebummhZi@mail.gmail.com><4c207084.5cebd80a.320c.ffff88b0@mx.google.com>
	<AANLkTil8y04QP-p031b7UNajMcxqW0Xx-p-fnB3ximPo@mail.gmail.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F055D9599@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Jason Fitzpatrick
> Sent: Tuesday, June 22, 2010 4:40 AM
> To: Kit Gerrits
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Basic Active Active File Server
> 
> Hi Kit..
> 
> Awesome and all as I am when it comes to computers, networking is a
> serious weak point ;0)
> 
> How would I go about checking if multicast traffic is making it
> between the two nodes, ...

When I suspect multicast problems I like to do a sanity check with
multicast ping.

First, on all cluster hosts make sure you enable ICMP multicast
responses:

    echo 0 >/proc/sys/net/ipv4/icmp_echo_ignore_broadcasts

Second, find your multicast address on a cluster node.  This is shown by
"cman_tool status" or "ip maddr":

    # ip maddr
    1:      lo
            inet  224.0.0.1
    2:      eth0
            link  01:00:5e:00:00:01
            inet  224.0.0.1
    3:      eth1
            link  01:00:5e:40:f9:ce
            link  01:00:5e:00:00:01
            inet  239.192.249.206    <---- my multicast address
            inet  224.0.0.1

Noting the address and interface name above, try a multicast ping with a
count of at least 2:

    # ping -I eth1 -b -L 239.192.249.206 -c 2
    PING 239.192.249.206 (239.192.249.206) from 10.65.3.166 eth1: 56(84)
bytes of data.
    64 bytes from 10.65.3.102: icmp_seq=1 ttl=64 time=0.394 ms
    64 bytes from 10.65.3.86: icmp_seq=1 ttl=64 time=0.415 ms (DUP!)
    64 bytes from 10.65.3.182: icmp_seq=1 ttl=64 time=0.418 ms (DUP!)
    64 bytes from 10.65.3.134: icmp_seq=1 ttl=64 time=0.420 ms (DUP!)
    64 bytes from 10.65.3.87: icmp_seq=1 ttl=64 time=0.971 ms (DUP!)
    64 bytes from 10.65.3.183: icmp_seq=1 ttl=64 time=0.985 ms (DUP!)
    64 bytes from 10.65.3.103: icmp_seq=1 ttl=64 time=0.987 ms (DUP!)
    64 bytes from 10.65.3.167: icmp_seq=1 ttl=64 time=0.990 ms (DUP!)
    64 bytes from 10.65.3.135: icmp_seq=1 ttl=64 time=0.992 ms (DUP!)
    64 bytes from 10.65.3.134: icmp_seq=2 ttl=64 time=0.486 ms

9 ping responses on a cluster of size 10.  Looks good.  Repeat this test
on each cluster member--you should see a consistent number of replies.

-Jeff





From crh at ubiqx.mn.org  Tue Jun 22 16:08:35 2010
From: crh at ubiqx.mn.org (Christopher R. Hertel)
Date: Tue, 22 Jun 2010 11:08:35 -0500
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <42A27C666486E746BD0C6A60142EC3AB03275B45FE@NTHVSEXCHMAIL01.interaccess.nl>
References: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>,
	<406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<42A27C666486E746BD0C6A60142EC3AB03275B45FE@NTHVSEXCHMAIL01.interaccess.nl>
Message-ID: <4C20E003.2060608@ubiqx.mn.org>

Please note that the SMB/CIFS protocol itself does not gracefully recover
from a failover.  SMB2 is much better in this regard.  This is a client-side
problem due to limitations in the protocol and client expectations.

Chris -)-----

Frank de Groodt wrote:
> Make sure you use virtual public ip addresses managed by CTDB, not the ones bound to your NICS.
> 
> Frank.
> ________________________________________
> From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Abhijith Das [adas at redhat.com]
> Sent: Tuesday, June 22, 2010 4:33 PM
> To: linux clustering
> Cc: Sumit Bose; Gunther Deschner; Simo Sorce
> Subject: Re: [Linux-cluster] Samba Statefull Failover
> 
> ----- "Jason Fitzpatrick" <jayfitzpatrick at gmail.com> wrote:
> 
>> From: "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
>> To: linux-cluster at redhat.com
>> Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central
>> Subject: [Linux-cluster] Samba Statefull Failover
>>
>> Hi all
>>
>> Just wondering if it is possible to statefully migrate smb
>> connections
>> between cluster nodes, I am running ctdb (Samba's Cluster software)
>> but all connections are dropped when the service is failed between
>> nodes
>>
>> Setup is as follows
>>
>> 2 node cluster
>> DRBD backend shared storage in Master Master configuration
>> cman presenting GFS2 /Storage folder
>> Samba + Winbind + CTDB used to present /Storage/Test_Share via
>> \\clustername\test_share (both nodes are AD integrated)
>>
>> Connections to replicated storage are working fine, AD accounts are
>> authenticated correcly and smbstatus shows that CTDB is load
>> ballancing the cluster address between nodes correctly,
>>
>> When I run ctdb shutdown I expect existing connections to be migrated
> 
> Also, I think "ctdb shutdown" is not the right command (Use "ctdb disable", it
> should all be in the man pages). Only one node should fail so that the
> other node can take over the IP address. If the IP address is not taken
> over, the clients will probably not be able to reconnect.
> 
> Cheers!
> --Abhi
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org



From Martin.Waite at datacash.com  Tue Jun 22 16:18:50 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Tue, 22 Jun 2010 17:18:50 +0100
Subject: [Linux-cluster] running clurgmgr directly causes clustat malfunction
Message-ID: <A78DB34D00374344A0AB65B6523C05DC05A08126@marsden.win.datacash.com>

Hi,

 

RHEL 5.4: cluster2.

 

Following Tom's advice from earlier today, in order to work around a
problem with starting rgmanager causing frozen services to stop, I
started /usr/sbin/clurgmgrd directly rather than through an init.d
script.   This enables the "-N" flag to be passed in on the command
line.

 

However, starting rgmanager this way (with or without the -N flag)
causes problems with local invocations of clustat - ie. rgmanager cannot
be seen in its output.  (clustat run on other cluster nodes DO see
rgmanager on this node however).  

 

I have waited for minutes after invoking /usr/sbin/clurgmgrd for it to
show up in clustat output, but with no joy.

 

I have traced through the init.d script and cannot see that very much
happens in there to affect how clurgmgrd is run.

 

Any ideas anyone ?

 

regards,

Martin

 

Eg:

 

[martin at cp1edidbm001 ~]$ sudo /sbin/service rgmanager stop

Shutting down Cluster Service Manager...

Waiting for services to stop:                              [  OK  ]

Cluster Service Manager is stopped.

 

[martin at cp1edidbm001 ~]$ sudo /usr/sbin/clurgmgrd start

 

[martin at cp1edidbm001 ~]$ sudo /usr/sbin/clustat

Cluster Status for EDISV1DBM @ Tue Jun 22 17:08:32 2010

Member Status: Quorate

 

 Member Name                                           ID   Status

 ------ ----                                           ---- ------

 svXprdclu001                                              1 Online,
Local

 svXprdclu002                                              2 Online

 svXprdclu003                                              3 Online

 svXprdclu004                                              4 Online

 svXprdclu005                                              5 Online

 

[martin at cp1edidbm001 ~]$ sudo /sbin/service rgmanager stop

Shutting down Cluster Service Manager...

Waiting for services to stop:                              [  OK  ]

Cluster Service Manager is stopped.

 

[martin at cp1edidbm001 ~]$ sudo /sbin/service rgmanager start

Starting Cluster Service Manager:                          [  OK  ]

 

[martin at cp1edidbm001 ~]$ sudo /usr/sbin/clustat

Cluster Status for EDISV1DBM @ Tue Jun 22 17:09:32 2010

Member Status: Quorate

 

 Member Name                                           ID   Status

 ------ ----                                           ---- ------

 svXprdclu001                                              1 Online,
Local, rgmanager

 svXprdclu002                                              2 Online,
rgmanager

 svXprdclu003                                              3 Online,
rgmanager

 svXprdclu004                                              4 Online,
rgmanager

 svXprdclu005                                              5 Online,
rgmanager

 

 Service Name                                 Owner (Last)
State

 ------- ----                                 ----- ------
-----

 service:ACTIVESITE                           svXprdclu001
started    [Z]

 service:MASTERVIP                            svXprdclu001
started    [Z]

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100622/bca5f6c1/attachment.htm>

From tom+linux-cluster at oneshoeco.com  Wed Jun 23 00:40:17 2010
From: tom+linux-cluster at oneshoeco.com (Tom Lanyon)
Date: Wed, 23 Jun 2010 10:10:17 +0930
Subject: [Linux-cluster] running clurgmgr directly causes clustat
	malfunction
In-Reply-To: <A78DB34D00374344A0AB65B6523C05DC05A08126@marsden.win.datacash.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A08126@marsden.win.datacash.com>
Message-ID: <0669A4DB-95CD-47D4-93DC-8C9D06626347@oneshoeco.com>

On 23/06/2010, at 1:48 AM, Martin Waite wrote:

> Hi,
>  
> RHEL 5.4: cluster2.
>  
> Following Tom's advice from earlier today, in order to work around a problem with starting rgmanager causing frozen services to stop, I started /usr/sbin/clurgmgrd directly rather than through an init.d script.   This enables the "-N" flag to be passed in on the command line.
>  
> However, starting rgmanager this way (with or without the -N flag) causes problems with local invocations of clustat - ie. rgmanager cannot be seen in its output.  (clustat run on other cluster nodes DO see rgmanager on this node however). 
>  
> I have waited for minutes after invoking /usr/sbin/clurgmgrd for it to show up in clustat output, but with no joy.
>  
> I have traced through the init.d script and cannot see that very much happens in there to affect how clurgmgrd is run.
>  
> Any ideas anyone ?

When you run "clurgmgrd -N" manually, have you checked /var/log/messages to see whether it is indeed starting correctly?

You could also try running clurgmgrd with the -f and -d flags to run in the foreground and enable debugging, so you can see what's going on.

FYI it works for me on the following - perhaps you've just found a cman/rgmanager incompatibility?
	cman-2.0.98-1.el5_3.4
	openais-0.80.3-22.el5_3.8
	rgmanager-2.0.52-6.el5


> [martin at cp1edidbm001 ~]$ sudo /sbin/service rgmanager stop
> [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clurgmgrd start

Are you actually running this verbatim? If so, you have the wrong command :) - it should be:
	$ sudo /usr/sbin/clurgmgrd -N

>  [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clustat


Regards,
Tom



From kitgerrits at gmail.com  Wed Jun 23 07:32:44 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Wed, 23 Jun 2010 09:32:44 +0200
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <AANLkTil8y04QP-p031b7UNajMcxqW0Xx-p-fnB3ximPo@mail.gmail.com>
Message-ID: <4c21b89d.6161e30a.3abf.5990@mx.google.com>


Hey Jay,

I'm not at work at the moment, but this should get you started:

1/ The simplest test is to tell clustering to use broadcast instead of
multicast.
# In a single ESX server, you can use a host-only vSwitch for that.
# Disable multicast by removing the multicast reference from the cluster
configuration and restart the cluster


You can check multicast traffic between nodes in 2 ways:
2/ dumping packets
#relatively simple

tcpdump -i <interface> ip multicast
# I'm not sure (can't test from here), else try:
tcpdump -i <interface> ether multicast

# That should show multicast packets traveling over the interface to and
from the hosts and the multicast IP


3/ ping tests

# By enabling responses to broadcast pings in both Host O/S'es and pinging
them on their multicast address:
# http://kerneltrap.org/node/16225
echo "0" > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts

# find out the multicast IP of the cluster 
cman_tool status
# Ex: Multicast addresses: 239.192.12.239

# ping the multicast IP from each host
ping -L 239.192.12.239
#You should see ping replies.


About DNS:
Clustering will use the IP that the hostname resolves to.
The interface that IP resolves to will be used for multicast traffic.
If you need to use another interface, give the IP on that interface its own
hostname and put that in the cluster config.

About 224.0.0.1
I'm not sure, either.
I will try It out at the office.


Footnote:
http://sourceware.org/cluster/doc/usage.txt
Advanced Network Configuration
------------------------------

* UDP Port

CMAN uses UDP port 6809 by default.  A different port number can be used by:

<cman port="6809">
</cman>


* Multicast

CMAN can be configured to use multicast instead of broadcast (broadcast is
used by default if no multicast parameters are given.)  To configure
multicast
add one line under the <cman> section and another under the <clusternode>
section:

<cman>
    <multicast addr="224.0.0.1"/>
</cman>

<clusternode name="nd1">
    <multicast addr="224.0.0.1" interface="eth0"/>
</clusternode>

The multicast addresses must match and the address must be usable on the
interface name given for the node.



Regards,

Kit

-----Original Message-----
From: Jason Fitzpatrick [mailto:jayfitzpatrick at gmail.com] 
Sent: dinsdag 22 juni 2010 10:40
To: Kit Gerrits
Cc: linux clustering
Subject: Re: [Linux-cluster] Basic Active Active File Server

Hi Kit..

Awesome and all as I am when it comes to computers, networking is a serious
weak point ;0)

How would I go about checking if multicast traffic is making it between the
two nodes, they are now hosted of the same ESX server and therefore only
hitting the virtual switch in the ESX and should not have to traverse our
network at all.

I will review the ESX switch config and DNS (I have a horrible feeling that
the DNS gremlins are responsible, but am pretty sure that this should not
affect Multicast)

Also the multicast address that should be used, I am using 244.0.0.1
which I believe is 	The All Hosts multicast group that contains all
systems on the same network segment, but am not a 100% sure if this is the
correct setting.

Thanks again

Jay

On 22 June 2010 09:12, Kit Gerrits <kitgerrits at gmail.com> wrote:
>
> Keep in mind that multicast requires a multicast router to handle the 
> traffic.
> Mere Layer2 connectivity is not enough.
>
> If broadcast does work, that might be your problem.
>
> Kit
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason 
> Fitzpatrick
> Sent: maandag 21 juni 2010 18:14
> To: linux clustering
> Subject: Re: [Linux-cluster] Basic Active Active File Server
>
> Hi..
>
> I have tried both multicast and broadcast to no avail, as above I am 
> moving the systems to the same ESX to try and rule out the networking 
> end of things, I have not tried the tcpdump but was running wireshark 
> in an attempt to do the same as you recommended
>
> Jay
>
> On 21 June 2010 16:49, Kaloyan Kovachev <kkovachev at varna.net> wrote:
>> Hi,
>> On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick 
>> <jayfitzpatrick at gmail.com> wrote:
>>> Hi all
>>>
>>> I am having no end of trouble getting a basic Active Active Cluster 
>>> working. at the moment it is in test / proof of concept and has 
>>> manual fencing in place but I cannot for the life of me get the 2 
>>> nodes to join to the one cluster (they both report joined in 
>>> crm_tool status but only to a local clustered instance if that makes 
>>> any
>>> sence)
>>>
>>> I have tried to use luci and system-config-cluster to get this up 
>>> and running and have been at it over a week, the network guys swear 
>>> that there is nothing blocking multicast traffic between them and 
>>> the firewalls have been disabled (they are on the same VLAN but 
>>> connected to different switches) servers have been rebuilt and have 
>>> RHEL 5.5 installed
>>>
>>
>> your problem is the multicast traffic - check with tcpdump if it is 
>> comming to the other server at all (network) and if it is, then 
>> doublecheck the firewall.
>> alternatively you may try using broadcast instead of multicast
>>
>>> Shared Storage is being provided by an Active Active DRBD setup 
>>> (tested and working)
>>>
>>> I have attached a copy of my cluster.conf
>>>
>>> Thanks in advance
>>>
>>> Jay
>>>
>>> --
>>>
>>> "The only difference between saints and sinners is that every saint 
>>> has a past while every sinner has a future. "
>>> - Oscar Wilde
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
>
> "The only difference between saints and sinners is that every saint 
> has a past while every sinner has a future. "
> - Oscar Wilde
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.829 / Virus Database: 271.1.1/2952 - Release Date: 
> 06/20/10 20:36:00
>
>



-- 

"The only difference between saints and sinners is that every saint has a
past while every sinner has a future. "
- Oscar Wilde
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2954 - Release Date: 06/21/10
20:36:00



From Martin.Waite at datacash.com  Wed Jun 23 08:59:31 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Wed, 23 Jun 2010 09:59:31 +0100
Subject: [Linux-cluster] running clurgmgr directly causes
	clustatmalfunction
In-Reply-To: <0669A4DB-95CD-47D4-93DC-8C9D06626347@oneshoeco.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A08126@marsden.win.datacash.com>
	<0669A4DB-95CD-47D4-93DC-8C9D06626347@oneshoeco.com>
Message-ID: <A78DB34D00374344A0AB65B6523C05DC05A08216@marsden.win.datacash.com>



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Tom Lanyon
> Sent: 23 June 2010 01:40
> To: linux clustering
> Subject: Re: [Linux-cluster] running clurgmgr directly causes
clustatmalfunction
> 
> On 23/06/2010, at 1:48 AM, Martin Waite wrote:
> 
> > Hi,
> >
> > RHEL 5.4: cluster2.
> >
> > Following Tom's advice from earlier today, in order to work around a
problem with
> starting rgmanager causing frozen services to stop, I started
/usr/sbin/clurgmgrd
> directly rather than through an init.d script.   This enables the "-N"
flag to be passed in
> on the command line.
> >
> > However, starting rgmanager this way (with or without the -N flag)
causes problems
> with local invocations of clustat - ie. rgmanager cannot be seen in
its output.  (clustat
> run on other cluster nodes DO see rgmanager on this node however).
> >
> > I have waited for minutes after invoking /usr/sbin/clurgmgrd for it
to show up in
> clustat output, but with no joy.
> >
> > I have traced through the init.d script and cannot see that very
much happens in
> there to affect how clurgmgrd is run.
> >
> > Any ideas anyone ?
> 
> When you run "clurgmgrd -N" manually, have you checked
/var/log/messages to see
> whether it is indeed starting correctly?
> 
> You could also try running clurgmgrd with the -f and -d flags to run
in the foreground
> and enable debugging, so you can see what's going on.
> 
> FYI it works for me on the following - perhaps you've just found a
cman/rgmanager
> incompatibility?
> 	cman-2.0.98-1.el5_3.4
> 	openais-0.80.3-22.el5_3.8
> 	rgmanager-2.0.52-6.el5
> 
> 
> > [martin at cp1edidbm001 ~]$ sudo /sbin/service rgmanager stop
> > [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clurgmgrd start
> 
> Are you actually running this verbatim? If so, you have the wrong
command :) - it
> should be:
> 	$ sudo /usr/sbin/clurgmgrd -N
> 
> >  [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clustat
> 
> 

Hi Tom,

I was running that verbatim.   I re-ran the sequence:

[martin at cp1edidbm001 ~]$ sudo /etc/init.d/rgmanager stop
Shutting down Cluster Service Manager...
Waiting for services to stop:                              [  OK  ]
Cluster Service Manager is stopped.
[martin at cp1edidbm001 ~]$ sudo /usr/sbin/clurgmgrd -f -d -N
[20839] info: I am node #1
[20839] debug: Fence domain already joined or no fencing configured
[20839] notice: Resource Group Manager Starting
[20839] info: Loading Service Data
[20839] debug: Loading Resource Rules
[20839] debug: 0 rules loaded
[20839] debug: Building Resource Trees
[20839] debug: 0 resources defined
[20839] debug: Loading Failover Domains
[20839] debug: 2 domains defined
[20839] debug: 1 events defined
[20839] info: Skipping stop-before-start: overridden by administrator
[20839] debug: Event: Port Opened
[20839] info: State change: Local UP
[20839] info: State change: svXprdclu002 UP
[20839] info: State change: svXprdclu003 UP
[20839] info: State change: svXprdclu004 UP
[20839] info: State change: svXprdclu005 UP
[20882] debug: Event (1:1:1) Processed
[20882] debug: Event (0:2:1) Processed
[20882] debug: Event (0:3:1) Processed
[20882] debug: Event (0:4:1) Processed
[20882] debug: Event (0:5:1) Processed
[20882] debug: 5 events processed

other window....

[martin at cp1edidbm001 ~]$ sudo /usr/sbin/clustat
Cluster Status for EDISV1DBM @ Wed Jun 23 09:53:06 2010
Member Status: Quorate

 Member Name                                  ID   Status
 ------ ----                                  ---- ------
 svXprdclu001                                     1 Online, Local
 svXprdclu002                                     2 Online
 svXprdclu003                                     3 Online
 svXprdclu004                                     4 Online
 svXprdclu005                                     5 Online

So still no rgmanager output.

My versions of the packages are different:

[martin at cp1edidbm001 ~]$ rpm -qa | egrep "rgmanager|cman|openais"
openais-0.80.6-16.el5_5.1
cman-2.0.115-34.el5
rgmanager-2.0.52-1.el5_4.3

There must be something happening in the init.d script that enables this
to work.  I'll explore the environment variables later.

regards,
Martin




From Martin.Waite at datacash.com  Wed Jun 23 13:45:06 2010
From: Martin.Waite at datacash.com (Martin Waite)
Date: Wed, 23 Jun 2010 14:45:06 +0100
Subject: [Linux-cluster] running clurgmgr directly
	causesclustatmalfunction
In-Reply-To: <A78DB34D00374344A0AB65B6523C05DC05A08216@marsden.win.datacash.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A08126@marsden.win.datacash.com><0669A4DB-95CD-47D4-93DC-8C9D06626347@oneshoeco.com>
	<A78DB34D00374344A0AB65B6523C05DC05A08216@marsden.win.datacash.com>
Message-ID: <A78DB34D00374344A0AB65B6523C05DC05A0847D@marsden.win.datacash.com>



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Martin Waite
> Sent: 23 June 2010 10:00
> To: linux clustering
> Subject: Re: [Linux-cluster] running clurgmgr directly
causesclustatmalfunction
> 
> 
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]
> > On Behalf Of Tom Lanyon
> > Sent: 23 June 2010 01:40
> > To: linux clustering
> > Subject: Re: [Linux-cluster] running clurgmgr directly causes
> clustatmalfunction
> >
> > On 23/06/2010, at 1:48 AM, Martin Waite wrote:
> >
> > > Hi,
> > >
> > > RHEL 5.4: cluster2.
> > >
> > > Following Tom's advice from earlier today, in order to work around
a
> problem with
> > starting rgmanager causing frozen services to stop, I started
> /usr/sbin/clurgmgrd
> > directly rather than through an init.d script.   This enables the
"-N"
> flag to be passed in
> > on the command line.
> > >
> > > However, starting rgmanager this way (with or without the -N flag)
> causes problems
> > with local invocations of clustat - ie. rgmanager cannot be seen in
> its output.  (clustat
> > run on other cluster nodes DO see rgmanager on this node however).
> > >
> > > I have waited for minutes after invoking /usr/sbin/clurgmgrd for
it
> to show up in
> > clustat output, but with no joy.
> > >
> > > I have traced through the init.d script and cannot see that very
> much happens in
> > there to affect how clurgmgrd is run.
> > >
> > > Any ideas anyone ?
> >
> > When you run "clurgmgrd -N" manually, have you checked
> /var/log/messages to see
> > whether it is indeed starting correctly?
> >
> > You could also try running clurgmgrd with the -f and -d flags to run
> in the foreground
> > and enable debugging, so you can see what's going on.
> >
> > FYI it works for me on the following - perhaps you've just found a
> cman/rgmanager
> > incompatibility?
> > 	cman-2.0.98-1.el5_3.4
> > 	openais-0.80.3-22.el5_3.8
> > 	rgmanager-2.0.52-6.el5
> >
> >
> > > [martin at cp1edidbm001 ~]$ sudo /sbin/service rgmanager stop
> > > [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clurgmgrd start
> >
> > Are you actually running this verbatim? If so, you have the wrong
> command :) - it
> > should be:
> > 	$ sudo /usr/sbin/clurgmgrd -N
> >
> > >  [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clustat
> >
> >
> 
> Hi Tom,
> 
> I was running that verbatim.   I re-ran the sequence:
> 
> [martin at cp1edidbm001 ~]$ sudo /etc/init.d/rgmanager stop
> Shutting down Cluster Service Manager...
> Waiting for services to stop:                              [  OK  ]
> Cluster Service Manager is stopped.
> [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clurgmgrd -f -d -N
> [20839] info: I am node #1
> [20839] debug: Fence domain already joined or no fencing configured
> [20839] notice: Resource Group Manager Starting
> [20839] info: Loading Service Data
> [20839] debug: Loading Resource Rules
> [20839] debug: 0 rules loaded
> [20839] debug: Building Resource Trees
> [20839] debug: 0 resources defined
> [20839] debug: Loading Failover Domains
> [20839] debug: 2 domains defined
> [20839] debug: 1 events defined
> [20839] info: Skipping stop-before-start: overridden by administrator
> [20839] debug: Event: Port Opened
> [20839] info: State change: Local UP
> [20839] info: State change: svXprdclu002 UP
> [20839] info: State change: svXprdclu003 UP
> [20839] info: State change: svXprdclu004 UP
> [20839] info: State change: svXprdclu005 UP
> [20882] debug: Event (1:1:1) Processed
> [20882] debug: Event (0:2:1) Processed
> [20882] debug: Event (0:3:1) Processed
> [20882] debug: Event (0:4:1) Processed
> [20882] debug: Event (0:5:1) Processed
> [20882] debug: 5 events processed
> 
> other window....
> 
> [martin at cp1edidbm001 ~]$ sudo /usr/sbin/clustat
> Cluster Status for EDISV1DBM @ Wed Jun 23 09:53:06 2010
> Member Status: Quorate
> 
>  Member Name                                  ID   Status
>  ------ ----                                  ---- ------
>  svXprdclu001                                     1 Online, Local
>  svXprdclu002                                     2 Online
>  svXprdclu003                                     3 Online
>  svXprdclu004                                     4 Online
>  svXprdclu005                                     5 Online
> 
> So still no rgmanager output.
> 
> My versions of the packages are different:
> 
> [martin at cp1edidbm001 ~]$ rpm -qa | egrep "rgmanager|cman|openais"
> openais-0.80.6-16.el5_5.1
> cman-2.0.115-34.el5
> rgmanager-2.0.52-1.el5_4.3
> 
> There must be something happening in the init.d script that enables
this
> to work.  I'll explore the environment variables later.
> 
> regards,
> Martin
> 


Hi,

I found the cause of the problem:  we run a sudo environment that
restricts exec permission to a specified list of programs.  

The init.d script was covered by this - and so worked fine - but
/usr/sbin/clurgmrgd was not.

The problem was solved by adding /usr/sbin/clurgmrgd to the list of
programs allowed to exec under sudo.

regards,
Martin





From jayfitzpatrick at gmail.com  Wed Jun 23 17:47:23 2010
From: jayfitzpatrick at gmail.com (jayfitzpatrick at gmail.com)
Date: Wed, 23 Jun 2010 17:47:23 +0000
Subject: [Linux-cluster] Basic Active Active File Server
In-Reply-To: <4c21b89d.6161e30a.3abf.5990@mx.google.com>
Message-ID: <0015174c34945c535d0489b621eb@google.com>

Hi All

thanks for all the advice guys, I will run through these tomorrow, (Pearl  
Jam last night, enough said!)

Jay

On Jun 23, 2010 8:32am, Kit Gerrits <kitgerrits at gmail.com> wrote:


> Hey Jay,



> I'm not at work at the moment, but this should get you started:



> 1/ The simplest test is to tell clustering to use broadcast instead of

> multicast.

> # In a single ESX server, you can use a host-only vSwitch for that.

> # Disable multicast by removing the multicast reference from the cluster

> configuration and restart the cluster





> You can check multicast traffic between nodes in 2 ways:

> 2/ dumping packets

> #relatively simple



> tcpdump -i ip multicast

> # I'm not sure (can't test from here), else try:

> tcpdump -i ether multicast



> # That should show multicast packets traveling over the interface to and

> from the hosts and the multicast IP





> 3/ ping tests



> # By enabling responses to broadcast pings in both Host O/S'es and pinging

> them on their multicast address:

> # http://kerneltrap.org/node/16225

> echo "0" > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts



> # find out the multicast IP of the cluster

> cman_tool status

> # Ex: Multicast addresses: 239.192.12.239



> # ping the multicast IP from each host

> ping -L 239.192.12.239

> #You should see ping replies.





> About DNS:

> Clustering will use the IP that the hostname resolves to.

> The interface that IP resolves to will be used for multicast traffic.

> If you need to use another interface, give the IP on that interface its  
> own

> hostname and put that in the cluster config.



> About 224.0.0.1

> I'm not sure, either.

> I will try It out at the office.





> Footnote:

> http://sourceware.org/cluster/doc/usage.txt

> Advanced Network Configuration

> ------------------------------



> * UDP Port



> CMAN uses UDP port 6809 by default. A different port number can be used  
> by:











> * Multicast



> CMAN can be configured to use multicast instead of broadcast (broadcast is

> used by default if no multicast parameters are given.) To configure

> multicast

> add one line under the section and another under the

> section:



















> The multicast addresses must match and the address must be usable on the

> interface name given for the node.







> Regards,



> Kit



> -----Original Message-----

> From: Jason Fitzpatrick [mailto:jayfitzpatrick at gmail.com]

> Sent: dinsdag 22 juni 2010 10:40

> To: Kit Gerrits

> Cc: linux clustering

> Subject: Re: [Linux-cluster] Basic Active Active File Server



> Hi Kit..



> Awesome and all as I am when it comes to computers, networking is a  
> serious

> weak point ;0)



> How would I go about checking if multicast traffic is making it between  
> the

> two nodes, they are now hosted of the same ESX server and therefore only

> hitting the virtual switch in the ESX and should not have to traverse our

> network at all.



> I will review the ESX switch config and DNS (I have a horrible feeling  
> that

> the DNS gremlins are responsible, but am pretty sure that this should not

> affect Multicast)



> Also the multicast address that should be used, I am using 244.0.0.1

> which I believe is The All Hosts multicast group that contains all

> systems on the same network segment, but am not a 100% sure if this is the

> correct setting.



> Thanks again



> Jay



> On 22 June 2010 09:12, Kit Gerrits kitgerrits at gmail.com> wrote:

> >

> > Keep in mind that multicast requires a multicast router to handle the

> > traffic.

> > Mere Layer2 connectivity is not enough.

> >

> > If broadcast does work, that might be your problem.

> >

> > Kit

> >

> > -----Original Message-----

> > From: linux-cluster-bounces at redhat.com

> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason

> > Fitzpatrick

> > Sent: maandag 21 juni 2010 18:14

> > To: linux clustering

> > Subject: Re: [Linux-cluster] Basic Active Active File Server

> >

> > Hi..

> >

> > I have tried both multicast and broadcast to no avail, as above I am

> > moving the systems to the same ESX to try and rule out the networking

> > end of things, I have not tried the tcpdump but was running wireshark

> > in an attempt to do the same as you recommended

> >

> > Jay

> >

> > On 21 June 2010 16:49, Kaloyan Kovachev kkovachev at varna.net> wrote:

> >> Hi,

> >> On Mon, 21 Jun 2010 16:07:34 +0100, Jason Fitzpatrick

> >> jayfitzpatrick at gmail.com> wrote:

> >>> Hi all

> >>>

> >>> I am having no end of trouble getting a basic Active Active Cluster

> >>> working. at the moment it is in test / proof of concept and has

> >>> manual fencing in place but I cannot for the life of me get the 2

> >>> nodes to join to the one cluster (they both report joined in

> >>> crm_tool status but only to a local clustered instance if that makes

> >>> any

> >>> sence)

> >>>

> >>> I have tried to use luci and system-config-cluster to get this up

> >>> and running and have been at it over a week, the network guys swear

> >>> that there is nothing blocking multicast traffic between them and

> >>> the firewalls have been disabled (they are on the same VLAN but

> >>> connected to different switches) servers have been rebuilt and have

> >>> RHEL 5.5 installed

> >>>

> >>

> >> your problem is the multicast traffic - check with tcpdump if it is

> >> comming to the other server at all (network) and if it is, then

> >> doublecheck the firewall.

> >> alternatively you may try using broadcast instead of multicast

> >>

> >>> Shared Storage is being provided by an Active Active DRBD setup

> >>> (tested and working)

> >>>

> >>> I have attached a copy of my cluster.conf

> >>>

> >>> Thanks in advance

> >>>

> >>> Jay

> >>>

> >>> --

> >>>

> >>> "The only difference between saints and sinners is that every saint

> >>> has a past while every sinner has a future. "

> >>> - Oscar Wilde

> >>

> >> --

> >> Linux-cluster mailing list

> >> Linux-cluster at redhat.com

> >> https://www.redhat.com/mailman/listinfo/linux-cluster

> >

> >

> >

> > --

> >

> > "The only difference between saints and sinners is that every saint

> > has a past while every sinner has a future. "

> > - Oscar Wilde

> >

> > --

> > Linux-cluster mailing list

> > Linux-cluster at redhat.com

> > https://www.redhat.com/mailman/listinfo/linux-cluster

> > No virus found in this incoming message.

> > Checked by AVG - www.avg.com

> > Version: 9.0.829 / Virus Database: 271.1.1/2952 - Release Date:

> > 06/20/10 20:36:00

> >

> >







> --



> "The only difference between saints and sinners is that every saint has a

> past while every sinner has a future. "

> - Oscar Wilde

> No virus found in this incoming message.

> Checked by AVG - www.avg.com

> Version: 9.0.829 / Virus Database: 271.1.1/2954 - Release Date: 06/21/10

> 20:36:00



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100623/1ffe3860/attachment.htm>

From jayfitzpatrick at gmail.com  Wed Jun 23 19:23:16 2010
From: jayfitzpatrick at gmail.com (jayfitzpatrick at gmail.com)
Date: Wed, 23 Jun 2010 19:23:16 +0000
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Message-ID: <0014852e1ea047b93c0489b77894@google.com>

Hi Abhijith,

The planned setup for us is to host Citrix / Windows users home directories  
on the shared storage, contained within these directories is some Lotus  
Notes config files (address book etc) and notes throws a canary whenever  
there is a break in connection to these files and does not attempt a  
re-connect, and as a result I was hoping to be able to fail over the file  
locks. The directory as presented to the Citrix clients fails over  
flawlessly, the OS reconnects to the new server and you would never have  
known that the backend has moved, but Notes, errors all over the place,  
clicking out of error messages (x12) very visible, very irritating, and  
horrible in a demo of your new file server with your manager!

I also seem to be getting clients disconnecting from the node that I am not  
stopping the service on, but this may just be me, I will re-test tomorrow.

thanks for the feedback

Jay



On Jun 22, 2010 3:33pm, Abhijith Das <adas at redhat.com> wrote:


> ----- "Jason Fitzpatrick" jayfitzpatrick at gmail.com> wrote:



> > From: "Jason Fitzpatrick" jayfitzpatrick at gmail.com>

> > To: linux-cluster at redhat.com

> > Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central

> > Subject: [Linux-cluster] Samba Statefull Failover

> >

> > Hi all

> >

> > Just wondering if it is possible to statefully migrate smb

> > connections

> > between cluster nodes, I am running ctdb (Samba's Cluster software)

> > but all connections are dropped when the service is failed between

> > nodes

> >

> > Setup is as follows

> >

> > 2 node cluster

> > DRBD backend shared storage in Master Master configuration

> > cman presenting GFS2 /Storage folder

> > Samba + Winbind + CTDB used to present /Storage/Test_Share via

> > \\clustername\test_share (both nodes are AD integrated)

> >

> > Connections to replicated storage are working fine, AD accounts are

> > authenticated correcly and smbstatus shows that CTDB is load

> > ballancing the cluster address between nodes correctly,

> >

> > When I run ctdb shutdown I expect existing connections to be migrated



> Also, I think "ctdb shutdown" is not the right command (Use "ctdb  
> disable", it

> should all be in the man pages). Only one node should fail so that the

> other node can take over the IP address. If the IP address is not taken

> over, the clients will probably not be able to reconnect.



> Cheers!

> --Abhi



> --

> Linux-cluster mailing list

> Linux-cluster at redhat.com

> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100623/0c2ae27b/attachment.htm>

From jayfitzpatrick at gmail.com  Wed Jun 23 19:26:57 2010
From: jayfitzpatrick at gmail.com (jayfitzpatrick at gmail.com)
Date: Wed, 23 Jun 2010 19:26:57 +0000
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <42A27C666486E746BD0C6A60142EC3AB03275B45FE@NTHVSEXCHMAIL01.interaccess.nl>
Message-ID: <00c09f8c1c556a9f120489b78531@google.com>

Hi Frank

I have been using this feature in an attempt to load balance the 2 servers,  
both servers are presenting the address and are contactable via the shared  
address (load is balancing correctly)


Thanks

Jay


On Jun 22, 2010 3:53pm, Frank de Groodt <Frank.de.Groodt at interaccess.nl>  
wrote:
> Make sure you use virtual public ip addresses managed by CTDB, not the  
> ones bound to your NICS.



> Frank.

> ________________________________________

> From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com]  
> On Behalf Of Abhijith Das [adas at redhat.com]

> Sent: Tuesday, June 22, 2010 4:33 PM

> To: linux clustering

> Cc: Sumit Bose; Gunther Deschner; Simo Sorce

> Subject: Re: [Linux-cluster] Samba Statefull Failover



> ----- "Jason Fitzpatrick" jayfitzpatrick at gmail.com> wrote:



> > From: "Jason Fitzpatrick" jayfitzpatrick at gmail.com>

> > To: linux-cluster at redhat.com

> > Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central

> > Subject: [Linux-cluster] Samba Statefull Failover

> >

> > Hi all

> >

> > Just wondering if it is possible to statefully migrate smb

> > connections

> > between cluster nodes, I am running ctdb (Samba's Cluster software)

> > but all connections are dropped when the service is failed between

> > nodes

> >

> > Setup is as follows

> >

> > 2 node cluster

> > DRBD backend shared storage in Master Master configuration

> > cman presenting GFS2 /Storage folder

> > Samba + Winbind + CTDB used to present /Storage/Test_Share via

> > \\clustername\test_share (both nodes are AD integrated)

> >

> > Connections to replicated storage are working fine, AD accounts are

> > authenticated correcly and smbstatus shows that CTDB is load

> > ballancing the cluster address between nodes correctly,

> >

> > When I run ctdb shutdown I expect existing connections to be migrated



> Also, I think "ctdb shutdown" is not the right command (Use "ctdb  
> disable", it

> should all be in the man pages). Only one node should fail so that the

> other node can take over the IP address. If the IP address is not taken

> over, the clients will probably not be able to reconnect.



> Cheers!

> --Abhi



> --

> Linux-cluster mailing list

> Linux-cluster at redhat.com

> https://www.redhat.com/mailman/listinfo/linux-cluster



> --

> Linux-cluster mailing list

> Linux-cluster at redhat.com

> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100623/98b25b01/attachment.htm>

From jayfitzpatrick at gmail.com  Wed Jun 23 19:34:24 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Wed, 23 Jun 2010 20:34:24 +0100
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <4C20E003.2060608@ubiqx.mn.org>
References: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>
	<406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<42A27C666486E746BD0C6A60142EC3AB03275B45FE@NTHVSEXCHMAIL01.interaccess.nl>
	<4C20E003.2060608@ubiqx.mn.org>
Message-ID: <AANLkTikXizYzAFZKosyl9OM7xuQTuHhnOZj2LipPyEoa@mail.gmail.com>

Hi Chris..

>From reading  up on the SMB2 protocol it seems that this is
implemented within Vista and greater MS clients, but I do not seem to
be able to track support for SMB2 within SAMBA

I see that there is a Samba4AD project but cannot find rpms for RHEL
and am a bit cagey about putting a dev version onto a critical file
server.

And the Citrix (client) servers are 2003 SMBv1 anyway so that kind of
kills that anyway.

Thanks

Jay

On 22 June 2010 17:08, Christopher R. Hertel <crh at ubiqx.mn.org> wrote:
> Please note that the SMB/CIFS protocol itself does not gracefully recover
> from a failover. ?SMB2 is much better in this regard. ?This is a client-side
> problem due to limitations in the protocol and client expectations.
>
> Chris -)-----
>
> Frank de Groodt wrote:
>> Make sure you use virtual public ip addresses managed by CTDB, not the ones bound to your NICS.
>>
>> Frank.
>> ________________________________________
>> From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Abhijith Das [adas at redhat.com]
>> Sent: Tuesday, June 22, 2010 4:33 PM
>> To: linux clustering
>> Cc: Sumit Bose; Gunther Deschner; Simo Sorce
>> Subject: Re: [Linux-cluster] Samba Statefull Failover
>>
>> ----- "Jason Fitzpatrick" <jayfitzpatrick at gmail.com> wrote:
>>
>>> From: "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
>>> To: linux-cluster at redhat.com
>>> Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central
>>> Subject: [Linux-cluster] Samba Statefull Failover
>>>
>>> Hi all
>>>
>>> Just wondering if it is possible to statefully migrate smb
>>> connections
>>> between cluster nodes, I am running ctdb (Samba's Cluster software)
>>> but all connections are dropped when the service is failed between
>>> nodes
>>>
>>> Setup is as follows
>>>
>>> 2 node cluster
>>> DRBD backend shared storage in Master Master configuration
>>> cman presenting GFS2 /Storage folder
>>> Samba + Winbind + CTDB used to present /Storage/Test_Share via
>>> \\clustername\test_share (both nodes are AD integrated)
>>>
>>> Connections to replicated storage are working fine, AD accounts are
>>> authenticated correcly and smbstatus shows that CTDB is load
>>> ballancing the cluster address between nodes correctly,
>>>
>>> When I run ctdb shutdown I expect existing connections to be migrated
>>
>> Also, I think "ctdb shutdown" is not the right command (Use "ctdb disable", it
>> should all be in the man pages). Only one node should fail so that the
>> other node can take over the IP address. If the IP address is not taken
>> over, the clients will probably not be able to reconnect.
>>
>> Cheers!
>> --Abhi
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
> Samba Team -- http://www.samba.org/ ? ? -)----- ? Christopher R. Hertel
> jCIFS Team -- http://jcifs.samba.org/ ? -)----- ? ubiqx development, uninq.
> ubiqx Team -- http://www.ubiqx.org/ ? ? -)----- ? crh at ubiqx.mn.org
> OnLineBook -- http://ubiqx.org/cifs/ ? ?-)----- ? crh at ubiqx.org
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde



From crh at ubiqx.mn.org  Wed Jun 23 20:11:51 2010
From: crh at ubiqx.mn.org (Christopher R. Hertel)
Date: Wed, 23 Jun 2010 15:11:51 -0500
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <AANLkTikXizYzAFZKosyl9OM7xuQTuHhnOZj2LipPyEoa@mail.gmail.com>
References: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>	<406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>	<42A27C666486E746BD0C6A60142EC3AB03275B45FE@NTHVSEXCHMAIL01.interaccess.nl>	<4C20E003.2060608@ubiqx.mn.org>
	<AANLkTikXizYzAFZKosyl9OM7xuQTuHhnOZj2LipPyEoa@mail.gmail.com>
Message-ID: <4C226A87.5000001@ubiqx.mn.org>

Jason,

You're correct.  Sorry.  I'm a protocol geek, and I was addressing the
protocol issues.

Samba 3 has SMB2 in the development tree, but not yet released as stable.
It should be pretty close to a testing release, however, so folks interested
in testing SMB2 support should watch the samba-technical list.

If you are running any pre-Vista Windows products, then you are also correct
that SMB2 won't be supported on those.

Having said all of that...  The problem is that the Windows SMB1 clients
have no mechanism for recovering if the TCP connection is lost.  If they are
using OpLocks, for instance, all cached updates are lost if the connection
goes down.  The clients simply throw away state and start over.  If the
applications running on those clients do not know how to re-establish the
correct state then they will lose data.

This has nothing to do with CTDB, since it all happens on the client side.

SMB2, however, has what are called "persistent file handles".  The Windows
SMB2 client can maintain state even if the TCP connection fails.  When the
connection is re-established, the client can resynchronize with the server
and all is well.

Chris -)-----

Jason Fitzpatrick wrote:
> Hi Chris..
> 
>>From reading  up on the SMB2 protocol it seems that this is
> implemented within Vista and greater MS clients, but I do not seem to
> be able to track support for SMB2 within SAMBA
> 
> I see that there is a Samba4AD project but cannot find rpms for RHEL
> and am a bit cagey about putting a dev version onto a critical file
> server.
> 
> And the Citrix (client) servers are 2003 SMBv1 anyway so that kind of
> kills that anyway.
> 
> Thanks
> 
> Jay
> 
> On 22 June 2010 17:08, Christopher R. Hertel <crh at ubiqx.mn.org> wrote:
>> Please note that the SMB/CIFS protocol itself does not gracefully recover
>> from a failover.  SMB2 is much better in this regard.  This is a client-side
>> problem due to limitations in the protocol and client expectations.
>>
>> Chris -)-----
>>
>> Frank de Groodt wrote:
>>> Make sure you use virtual public ip addresses managed by CTDB, not the ones bound to your NICS.
>>>
>>> Frank.
>>> ________________________________________
>>> From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Abhijith Das [adas at redhat.com]
>>> Sent: Tuesday, June 22, 2010 4:33 PM
>>> To: linux clustering
>>> Cc: Sumit Bose; Gunther Deschner; Simo Sorce
>>> Subject: Re: [Linux-cluster] Samba Statefull Failover
>>>
>>> ----- "Jason Fitzpatrick" <jayfitzpatrick at gmail.com> wrote:
>>>
>>>> From: "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
>>>> To: linux-cluster at redhat.com
>>>> Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central
>>>> Subject: [Linux-cluster] Samba Statefull Failover
>>>>
>>>> Hi all
>>>>
>>>> Just wondering if it is possible to statefully migrate smb
>>>> connections
>>>> between cluster nodes, I am running ctdb (Samba's Cluster software)
>>>> but all connections are dropped when the service is failed between
>>>> nodes
>>>>
>>>> Setup is as follows
>>>>
>>>> 2 node cluster
>>>> DRBD backend shared storage in Master Master configuration
>>>> cman presenting GFS2 /Storage folder
>>>> Samba + Winbind + CTDB used to present /Storage/Test_Share via
>>>> \\clustername\test_share (both nodes are AD integrated)
>>>>
>>>> Connections to replicated storage are working fine, AD accounts are
>>>> authenticated correcly and smbstatus shows that CTDB is load
>>>> ballancing the cluster address between nodes correctly,
>>>>
>>>> When I run ctdb shutdown I expect existing connections to be migrated
>>> Also, I think "ctdb shutdown" is not the right command (Use "ctdb disable", it
>>> should all be in the man pages). Only one node should fail so that the
>>> other node can take over the IP address. If the IP address is not taken
>>> over, the clients will probably not be able to reconnect.
>>>
>>> Cheers!
>>> --Abhi
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> --
>> "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
>> Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
>> jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
>> ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
>> OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> 
> 

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org



From jayfitzpatrick at gmail.com  Wed Jun 23 20:43:00 2010
From: jayfitzpatrick at gmail.com (Jason Fitzpatrick)
Date: Wed, 23 Jun 2010 21:43:00 +0100
Subject: [Linux-cluster] Samba Statefull Failover
In-Reply-To: <4C226A87.5000001@ubiqx.mn.org>
References: <AANLkTik3uHPcJm_M4PkVNEMZ24z2iHtYa_uU3s-Ol4Cp@mail.gmail.com>
	<406355108.698261277217224346.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
	<42A27C666486E746BD0C6A60142EC3AB03275B45FE@NTHVSEXCHMAIL01.interaccess.nl>
	<4C20E003.2060608@ubiqx.mn.org>
	<AANLkTikXizYzAFZKosyl9OM7xuQTuHhnOZj2LipPyEoa@mail.gmail.com>
	<4C226A87.5000001@ubiqx.mn.org>
Message-ID: <AANLkTinBfFoNn6jYiMWsuw0GIS28Spsom2VHcklUjH0A@mail.gmail.com>

Hi Chris.

Sweet I will give that a go (upgrading to dev version of SMB and using
a later SMB client - Win7 or the like)

Thanks a mill

Jay

On 23 June 2010 21:11, Christopher R. Hertel <crh at ubiqx.mn.org> wrote:
> Jason,
>
> You're correct. ?Sorry. ?I'm a protocol geek, and I was addressing the
> protocol issues.
>
> Samba 3 has SMB2 in the development tree, but not yet released as stable.
> It should be pretty close to a testing release, however, so folks interested
> in testing SMB2 support should watch the samba-technical list.
>
> If you are running any pre-Vista Windows products, then you are also correct
> that SMB2 won't be supported on those.
>
> Having said all of that... ?The problem is that the Windows SMB1 clients
> have no mechanism for recovering if the TCP connection is lost. ?If they are
> using OpLocks, for instance, all cached updates are lost if the connection
> goes down. ?The clients simply throw away state and start over. ?If the
> applications running on those clients do not know how to re-establish the
> correct state then they will lose data.
>
> This has nothing to do with CTDB, since it all happens on the client side.
>
> SMB2, however, has what are called "persistent file handles". ?The Windows
> SMB2 client can maintain state even if the TCP connection fails. ?When the
> connection is re-established, the client can resynchronize with the server
> and all is well.
>
> Chris -)-----
>
> Jason Fitzpatrick wrote:
>> Hi Chris..
>>
>>>From reading ?up on the SMB2 protocol it seems that this is
>> implemented within Vista and greater MS clients, but I do not seem to
>> be able to track support for SMB2 within SAMBA
>>
>> I see that there is a Samba4AD project but cannot find rpms for RHEL
>> and am a bit cagey about putting a dev version onto a critical file
>> server.
>>
>> And the Citrix (client) servers are 2003 SMBv1 anyway so that kind of
>> kills that anyway.
>>
>> Thanks
>>
>> Jay
>>
>> On 22 June 2010 17:08, Christopher R. Hertel <crh at ubiqx.mn.org> wrote:
>>> Please note that the SMB/CIFS protocol itself does not gracefully recover
>>> from a failover. ?SMB2 is much better in this regard. ?This is a client-side
>>> problem due to limitations in the protocol and client expectations.
>>>
>>> Chris -)-----
>>>
>>> Frank de Groodt wrote:
>>>> Make sure you use virtual public ip addresses managed by CTDB, not the ones bound to your NICS.
>>>>
>>>> Frank.
>>>> ________________________________________
>>>> From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Abhijith Das [adas at redhat.com]
>>>> Sent: Tuesday, June 22, 2010 4:33 PM
>>>> To: linux clustering
>>>> Cc: Sumit Bose; Gunther Deschner; Simo Sorce
>>>> Subject: Re: [Linux-cluster] Samba Statefull Failover
>>>>
>>>> ----- "Jason Fitzpatrick" <jayfitzpatrick at gmail.com> wrote:
>>>>
>>>>> From: "Jason Fitzpatrick" <jayfitzpatrick at gmail.com>
>>>>> To: linux-cluster at redhat.com
>>>>> Sent: Tuesday, June 22, 2010 6:43:36 AM GMT -06:00 US/Canada Central
>>>>> Subject: [Linux-cluster] Samba Statefull Failover
>>>>>
>>>>> Hi all
>>>>>
>>>>> Just wondering if it is possible to statefully migrate smb
>>>>> connections
>>>>> between cluster nodes, I am running ctdb (Samba's Cluster software)
>>>>> but all connections are dropped when the service is failed between
>>>>> nodes
>>>>>
>>>>> Setup is as follows
>>>>>
>>>>> 2 node cluster
>>>>> DRBD backend shared storage in Master Master configuration
>>>>> cman presenting GFS2 /Storage folder
>>>>> Samba + Winbind + CTDB used to present /Storage/Test_Share via
>>>>> \\clustername\test_share (both nodes are AD integrated)
>>>>>
>>>>> Connections to replicated storage are working fine, AD accounts are
>>>>> authenticated correcly and smbstatus shows that CTDB is load
>>>>> ballancing the cluster address between nodes correctly,
>>>>>
>>>>> When I run ctdb shutdown I expect existing connections to be migrated
>>>> Also, I think "ctdb shutdown" is not the right command (Use "ctdb disable", it
>>>> should all be in the man pages). Only one node should fail so that the
>>>> other node can take over the IP address. If the IP address is not taken
>>>> over, the clients will probably not be able to reconnect.
>>>>
>>>> Cheers!
>>>> --Abhi
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>> --
>>> "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
>>> Samba Team -- http://www.samba.org/ ? ? -)----- ? Christopher R. Hertel
>>> jCIFS Team -- http://jcifs.samba.org/ ? -)----- ? ubiqx development, uninq.
>>> ubiqx Team -- http://www.ubiqx.org/ ? ? -)----- ? crh at ubiqx.mn.org
>>> OnLineBook -- http://ubiqx.org/cifs/ ? ?-)----- ? crh at ubiqx.org
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>>
>
> --
> "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
> Samba Team -- http://www.samba.org/ ? ? -)----- ? Christopher R. Hertel
> jCIFS Team -- http://jcifs.samba.org/ ? -)----- ? ubiqx development, uninq.
> ubiqx Team -- http://www.ubiqx.org/ ? ? -)----- ? crh at ubiqx.mn.org
> OnLineBook -- http://ubiqx.org/cifs/ ? ?-)----- ? crh at ubiqx.org
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 

"The only difference between saints and sinners is that every saint
has a past while every sinner has a future. "
? Oscar Wilde



From tom+linux-cluster at oneshoeco.com  Wed Jun 23 20:43:39 2010
From: tom+linux-cluster at oneshoeco.com (Tom Lanyon)
Date: Thu, 24 Jun 2010 06:13:39 +0930
Subject: [Linux-cluster] running clurgmgr
	directly	causesclustatmalfunction
In-Reply-To: <A78DB34D00374344A0AB65B6523C05DC05A0847D@marsden.win.datacash.com>
References: <A78DB34D00374344A0AB65B6523C05DC05A08126@marsden.win.datacash.com><0669A4DB-95CD-47D4-93DC-8C9D06626347@oneshoeco.com>
	<A78DB34D00374344A0AB65B6523C05DC05A08216@marsden.win.datacash.com>
	<A78DB34D00374344A0AB65B6523C05DC05A0847D@marsden.win.datacash.com>
Message-ID: <A1D3C123-11EF-4C42-B813-EA4D5887AD5C@oneshoeco.com>

On 23/06/2010, at 11:15 PM, Martin Waite wrote:
> Hi,
> 
> I found the cause of the problem:  we run a sudo environment that
> restricts exec permission to a specified list of programs.  
> 
> The init.d script was covered by this - and so worked fine - but
> /usr/sbin/clurgmrgd was not.
> 
> The problem was solved by adding /usr/sbin/clurgmrgd to the list of
> programs allowed to exec under sudo.


Good to hear it's working - it's always the simple problems that are most confusing!

Tom



From anoop_rajkumar at merck.com  Sat Jun 26 15:39:31 2010
From: anoop_rajkumar at merck.com (Rajkumar, Anoop)
Date: Sat, 26 Jun 2010 11:39:31 -0400
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster membership
Message-ID: <C651C3AA2A6A1D4980D35451DDE3F96B721452@usctmx1160.merck.com>

Hi

I have two dl585 with shared storage from MSA 1000 in a two node rhel
5.3 cluster. Priority in cluster.conf are like below.

<failoverdomainnode name="usrylxap237.merck.com" priority="1"/>
                                <failoverdomainnode
name="usrylxap238.merck.com" priority="2"/>

Whenever lower priority node usrylxap238 Is rebooted it kills cman on
usrylxap237 (Higher priority node) and fence it causing reboot of it.
Message I see in /var/log/messages of higher priority node is 

Jun 26 11:02:36 usrylxap237 openais[4750]: [CMAN ] cman killed by node 2
because we rejoined the cluster without a full restart
Jun 26 11:03:57 usrylxap237 openais[27373]: [CMAN ] cman killed by node
1 because we were killed by cman_tool or other application

After reboot when higher priority node usrylxap237 comes up it tranfers
services from lower priority node to itself and everything works fine
for some time. Then I see following message in /var/log/messages of
higher priority node running services.

Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the OPERATIONAL state.
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Receive multicast
socket recv buffer size (2880
00 bytes).
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Transmit multicast
socket send buffer size (288
000 bytes).
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 2.
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17
high seq received 17
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 420
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state.
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the COMMIT state.
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 4.
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 424
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state.
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the COMMIT state.
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 4.
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 428
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state.
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering RECOVERY
state.
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [0] member
54.3.254.237:
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq
1052 rep 54.3.254.237
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered
17 received flag 1
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [1] member
54.3.254.238:
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq
1052 rep 54.3.254.237
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered
17 received flag 1
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Did not need to
originate any messages in recov
ery.
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Sending initial ORF
token
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION
CHANGE
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration:
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.237)
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.238)
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left:
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined:
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION
CHANGE
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration:
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.237)
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.238)
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left:
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined:
Jun 26 09:24:54 usrylxap237 openais[5792]: [SYNC ] This node is within
the primary component and w
ill provide service.
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering OPERATIONAL
state.
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
54.3.254.237
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
54.3.254.238
Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message
from node 1
Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message
from node 2
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the OPERATIONAL state.
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Receive multicast
socket recv buffer size (2880
00 bytes).
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Transmit multicast
socket send buffer size (288
000 bytes).
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 2.
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17
high seq received 17
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 42c
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 430
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 13.
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 434
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 438
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 13.
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep.


On the second node I can see 

Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 12.
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17
high seq received 17
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 420
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13.
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 424
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state.
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4.
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 428
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY
state.
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
54.3.254.237:
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1052 rep 54.3.254.237
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
54.3.254.238:
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1052 rep 54.3.254.237
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] Did not need to
originate any messages in re
covery.
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237)
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238)
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237)
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238)
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
Jun 26 09:24:54 usrylxap238 openais[5725]: [SYNC ] This node is within
the primary component an
d will provide service.
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
state.
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.237
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.238
Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 1
Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 2
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 12.
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17
high seq received 17
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 42c
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state.
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4.
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 430
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13.
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 434
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state.
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4.
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 438
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13.
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 43c
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state.
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4.
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 440
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state.
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY
state.
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
54.3.254.237:
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1064 rep 54.3.254.237
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
54.3.254.238:
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1064 rep 54.3.254.237
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] Did not need to
originate any messages in re
covery.
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237)
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238)
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237)
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238)
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
Jun 26 09:25:54 usrylxap238 openais[5725]: [SYNC ] This node is within
the primary component an
d will provide service.
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
state.
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.237
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.238
Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 1
Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 2

Now my cluster is messed up. Even though clustat and cman_tool show
everything is fine. As I can not move services between the node (they
are running fine on present node). It even does not give any error
message when I try to move them using clusvcadm.

[root at usrylxap238 ~]# clustat
Cluster Status for cluster1 @ Sat Jun 26 11:25:12 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 usrylxap237.merck.com                       1 Online, rgmanager
 usrylxap238.merck.com                       2 Online, Local, rgmanager

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:http-service           usrylxap237.merck.com          started
 service:mysql                  usrylxap237.merck.com          started
[root at usrylxap238 ~]# cman_tool status
Version: 6.1.0
Config Version: 32
Cluster Name: cluster1
Cluster Id: 26777
Cluster Member: Yes
Cluster Generation: 1276
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 9
Flags: 2node Dirty
Ports Bound: 0 11 177
Node name: usrylxap238.merck.com
Node ID: 2
Multicast addresses: 239.192.104.2
Node addresses: 54.3.254.238

I have clvmd running with locking_type = 3 and gfs2 file system mounted
(using dlm) which now is hanging on higher priority node but is fine on
lower priority node (Which seems is not part of cluster now).

[root at usrylxap237 ~]# service gfs2 status
Active GFS2 mountpoints:
/oracluster1

[root at usrylxap238 ~]# service gfs2 status
Configured GFS2 mountpoints:
/oracluster1
Active GFS2 mountpoints:
/oracluster1

Not sure why cluster is loosing membership and getting staled and GFS
file system is not accessible.

Thanks
Anoop
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100626/ebb5ae24/attachment.htm>

From sakect at gmail.com  Sat Jun 26 16:54:33 2010
From: sakect at gmail.com (POWERBALL ONLINE)
Date: Sat, 26 Jun 2010 23:54:33 +0700
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster membership
In-Reply-To: <C651C3AA2A6A1D4980D35451DDE3F96B721452@usctmx1160.merck.com>
References: <C651C3AA2A6A1D4980D35451DDE3F96B721452@usctmx1160.merck.com>
Message-ID: <AANLkTin7QaJ0qgh1uiTMMFputPQStnxgkXLNmoTAhqzA@mail.gmail.com>

Hi ,

Are you select don't fail back in cluster policy?
What tool you use for create cluster luci or system-config-cluster?
Do you have quorum disk?

Regards,

Somsak (Linux Specialist HP Thailand)

On Sat, Jun 26, 2010 at 10:39 PM, Rajkumar, Anoop
<anoop_rajkumar at merck.com>wrote:

>  Hi
>
> I have two dl585 with shared storage from MSA 1000 in a two node rhel 5.3
> cluster. Priority in cluster.conf are like below.
>
> <failoverdomainnode name="usrylxap237.merck.com" priority="1"/>
>                                 <failoverdomainnode name="
> usrylxap238.merck.com" priority="2"/>
>
> Whenever lower priority node usrylxap238 Is rebooted it kills cman on
> usrylxap237 (Higher priority node) and fence it causing reboot of it.
> Message I see in /var/log/messages of higher priority node is
>
> Jun 26 11:02:36 usrylxap237 openais[4750]: [CMAN ] cman killed by node 2
> because we rejoined the cluster without a full restart
>
> Jun 26 11:03:57 usrylxap237 openais[27373]: [CMAN ] cman killed by node 1
> because we were killed by cman_tool or other application
>
> After reboot when higher priority node usrylxap237 comes up it tranfers
> services from lower priority node to itself and everything works fine for
> some time. Then I see following message in /var/log/messages of higher
> priority node running services.
>
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] The token was lost in
> the OPERATIONAL state.
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Receive multicast socket
> recv buffer size (2880
> 00 bytes).
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Transmit multicast
> socket send buffer size (288
> 000 bytes).
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
> from 2.
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17 high
> seq received 17
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
> for ring 420
> Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state.
> Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] The token was lost in
> the COMMIT state.
> Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
> from 4.
> Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
> Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
> for ring 424
> Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state.
> Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] The token was lost in
> the COMMIT state.
> Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
> from 4.
> Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
> Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
> for ring 428
> Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state.
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering RECOVERY state.
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [0] member
> 54.3.254.237:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq 1052
> rep 54.3.254.237
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered 17
> received flag 1
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [1] member
> 54.3.254.238:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq 1052
> rep 54.3.254.237
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered 17
> received flag 1
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Did not need to
> originate any messages in recov
> ery.
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Sending initial ORF
> token
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
> ip(54.3.254.237)
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
> ip(54.3.254.238)
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
> ip(54.3.254.237)
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
> ip(54.3.254.238)
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined:
> Jun 26 09:24:54 usrylxap237 openais[5792]: [SYNC ] This node is within the
> primary component and w
> ill provide service.
> Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering OPERATIONAL
> state.
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
> 54.3.254.237
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
> 54.3.254.238
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message
> from node 1
> Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message
> from node 2
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] The token was lost in
> the OPERATIONAL state.
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Receive multicast socket
> recv buffer size (2880
> 00 bytes).
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Transmit multicast
> socket send buffer size (288
> 000 bytes).
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
> from 2.
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17 high
> seq received 17
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
> for ring 42c
> Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
> Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
> for ring 430
> Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
> from 13.
> Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
> Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
> for ring 434
> Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
> Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
> for ring 438
> Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
> from 13.
> Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
> because I am the rep.
>
> On the second node I can see
>
> Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 12.
> Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17 high
> seq received 17
> Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 420
> Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 13.
> Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 424
> Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] The token was lost in
> the COMMIT state.
> Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 4.
> Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 428
> Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY state.
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
> 54.3.254.237:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1052
> rep 54.3.254.237
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
> received flag 1
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
> 54.3.254.238:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1052
> rep 54.3.254.237
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
> received flag 1
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] Did not need to
> originate any messages in re
> covery.
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.237)
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.238)
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.237)
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.238)
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
> Jun 26 09:24:54 usrylxap238 openais[5725]: [SYNC ] This node is within the
> primary component an
> d will provide service.
> Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
> state.
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
> 54.3.254.237
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
> 54.3.254.238
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
> from node 1
> Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
> from node 2
> Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 12.
> Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17 high
> seq received 17
> Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 42c
> Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] The token was lost in
> the COMMIT state.
> Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 4.
> Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 430
> Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 13.
> Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 434
> Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] The token was lost in
> the COMMIT state.
> Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 4.
> Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 438
> Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 13.
> Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 43c
> Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] The token was lost in
> the COMMIT state.
> Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
> from 4.
> Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
> for ring 440
> Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state.
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY state.
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
> 54.3.254.237:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1064
> rep 54.3.254.237
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
> received flag 1
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
> 54.3.254.238:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1064
> rep 54.3.254.237
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
> received flag 1
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] Did not need to
> originate any messages in re
> covery.
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.237)
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.238)
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.237)
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
> ip(54.3.254.238)
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined:
> Jun 26 09:25:54 usrylxap238 openais[5725]: [SYNC ] This node is within the
> primary component an
> d will provide service.
> Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
> state.
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
> 54.3.254.237
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
> 54.3.254.238
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
> from node 1
> Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
> from node 2
>
> Now my cluster is messed up. Even though clustat and cman_tool show
> everything is fine. As I can not move services between the node (they are
> running fine on present node). It even does not give any error message when
> I try to move them using clusvcadm.
>
> [root at usrylxap238 ~]# clustat
> Cluster Status for cluster1 @ Sat Jun 26 11:25:12 2010
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  usrylxap237.merck.com                       1 Online, rgmanager
>  usrylxap238.merck.com                       2 Online, Local, rgmanager
>
>  Service Name                   Owner (Last)                   State
>  ------- ----                   ----- ------                   -----
>  service:http-service           usrylxap237.merck.com          started
>  service:mysql                  usrylxap237.merck.com          started
> [root at usrylxap238 ~]# cman_tool status
> Version: 6.1.0
> Config Version: 32
> Cluster Name: cluster1
> Cluster Id: 26777
> Cluster Member: Yes
> Cluster Generation: 1276
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 9
> Flags: 2node Dirty
> Ports Bound: 0 11 177
> Node name: usrylxap238.merck.com
> Node ID: 2
> Multicast addresses: 239.192.104.2
> Node addresses: 54.3.254.238
>
> I have clvmd running with locking_type = 3 and gfs2 file system mounted
> (using dlm) which now is hanging on higher priority node but is fine on
> lower priority node (Which seems is not part of cluster now).
>
> [root at usrylxap237 ~]# service gfs2 status
> Active GFS2 mountpoints:
> /oracluster1
>
> [root at usrylxap238 ~]# service gfs2 status
> Configured GFS2 mountpoints:
> /oracluster1
> Active GFS2 mountpoints:
> /oracluster1
>
> Not sure why cluster is loosing membership and getting staled and GFS file
> system is not accessible.
>
> Thanks
> Anoop
>
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100626/ad4871ba/attachment.htm>

From sakect at gmail.com  Sat Jun 26 17:00:10 2010
From: sakect at gmail.com (POWERBALL ONLINE)
Date: Sun, 27 Jun 2010 00:00:10 +0700
Subject: [Linux-cluster] cluster configuration
In-Reply-To: <AANLkTilMO39vU9z07xUjAM5YUa7JIwvI8y88GQJEPagj@mail.gmail.com>
References: <AANLkTilMO39vU9z07xUjAM5YUa7JIwvI8y88GQJEPagj@mail.gmail.com>
Message-ID: <AANLkTikj-vNZrECjTqdOylXLkPOGxm6qRPqdeUJliiqK@mail.gmail.com>

Hi,

I have step by step to create cluster but how many node that you want to
create in the cluster?
I will prepare it for you.

Somsak ( Linux Specialist HP Thailand)

On Mon, Jun 21, 2010 at 6:46 PM, parshuram prasad <parshu001 at gmail.com>wrote:

> Hi All,
>
> please provide  me step by step clustering in linux el 5.3
>
>
> --
> Warm Regards
> Parshuram Prasad
> +91-9560170372
> Sr. System Administrator & Database Administrator
>
> Stratoshear Technology Pvt. Ltd.
>
> BPS House Green Park -16
> www.stratoshear.com
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100627/8eba98a0/attachment.htm>

From kitgerrits at gmail.com  Sun Jun 27 10:34:36 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Sun, 27 Jun 2010 12:34:36 +0200
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster membership
In-Reply-To: <C651C3AA2A6A1D4980D35451DDE3F96B721452@usctmx1160.merck.com>
Message-ID: <4c27293b.c8e8d80a.06c5.ffffe48d@mx.google.com>

Have you tried comparing the output of the cluster tools between the two
nodes?
 
Maybe the internal cluster services are not 'synchronised'
I have seen this on clusters with connection issues.
 
I'm not familiar enough with the messages to understand them exactly, 
  but my gut instinct tells me you temporarily have 2 seperate clusters with
1 vote each.
My guess:
1/ the secundary node fails to join the cluster on the first node
2/ the secundary node starts its own cluster 
3/ the primary node sees the secundary node and says hello
4/ the secundary node and then fences the primary node 
 
Are both nodes running NTP (timing issues, log timestamps)
Are ther any firewalls or network issues? (multicast packets traveling only
one way)
 
 
Regards,
 
Kit
 
  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajkumar, Anoop
Sent: zaterdag 26 juni 2010 17:40
To: linux-cluster at redhat.com
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster membership



Hi 

I have two dl585 with shared storage from MSA 1000 in a two node rhel 5.3
cluster. Priority in cluster.conf are like below.

<failoverdomainnode name="usrylxap237.merck.com" priority="1"/> 
                                <failoverdomainnode
name="usrylxap238.merck.com" priority="2"/> 

Whenever lower priority node usrylxap238 Is rebooted it kills cman on
usrylxap237 (Higher priority node) and fence it causing reboot of it.
Message I see in /var/log/messages of higher priority node is 

Jun 26 11:02:36 usrylxap237 openais[4750]: [CMAN ] cman killed by node 2
because we rejoined the cluster without a full restart

Jun 26 11:03:57 usrylxap237 openais[27373]: [CMAN ] cman killed by node 1
because we were killed by cman_tool or other application

After reboot when higher priority node usrylxap237 comes up it tranfers
services from lower priority node to itself and everything works fine for
some time. Then I see following message in /var/log/messages of higher
priority node running services.

Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] The token was lost in the
OPERATIONAL state. 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Receive multicast socket
recv buffer size (2880 
00 bytes). 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Transmit multicast socket
send buffer size (288 
000 bytes). 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 2. 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17 high
seq received 17 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
for ring 420 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] The token was lost in the
COMMIT state. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
for ring 424 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] The token was lost in the
COMMIT state. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
for ring 428 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering RECOVERY state. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [0] member
54.3.254.237: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq 1052
rep 54.3.254.237 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered 17
received flag 1 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [1] member
54.3.254.238: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq 1052
rep 54.3.254.237 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered 17
received flag 1 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Did not need to originate
any messages in recov 
ery. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Sending initial ORF token

Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [SYNC ] This node is within the
primary component and w 
ill provide service. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering OPERATIONAL
state. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
54.3.254.237 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
54.3.254.238 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message from
node 1 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message from
node 2 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] The token was lost in the
OPERATIONAL state. 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Receive multicast socket
recv buffer size (2880 
00 bytes). 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Transmit multicast socket
send buffer size (288 
000 bytes). 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 2. 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17 high
seq received 17 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
for ring 42c 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
for ring 430 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
for ring 434 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Storing new sequence id
for ring 438 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 


On the second node I can see 

Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 12. 
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17 high
seq received 17 
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 420 
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 424 
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] The token was lost in the
COMMIT state. 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 428 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY state. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
54.3.254.237: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1052
rep 54.3.254.237 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
received flag 1 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
54.3.254.238: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1052
rep 54.3.254.237 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
received flag 1 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] Did not need to originate
any messages in re 
covery. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [SYNC ] This node is within the
primary component an 
d will provide service. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
state. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.237 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.238 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message from
node 1 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message from
node 2 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 12. 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17 high
seq received 17 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 42c 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] The token was lost in the
COMMIT state. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 430 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 434 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] The token was lost in the
COMMIT state. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 438 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 43c 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] The token was lost in the
COMMIT state. 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] Storing new sequence id
for ring 440 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering COMMIT state. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY state. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
54.3.254.237: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1064
rep 54.3.254.237 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
received flag 1 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
54.3.254.238: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq 1064
rep 54.3.254.237 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered 17
received flag 1 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] Did not need to originate
any messages in re 
covery. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [SYNC ] This node is within the
primary component an 
d will provide service. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
state. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.237 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.238 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message from
node 1 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message from
node 2 

Now my cluster is messed up. Even though clustat and cman_tool show
everything is fine. As I can not move services between the node (they are
running fine on present node). It even does not give any error message when
I try to move them using clusvcadm.

[root at usrylxap238 ~]# clustat 
Cluster Status for cluster1 @ Sat Jun 26 11:25:12 2010 
Member Status: Quorate 

 Member Name                             ID   Status 
 ------ ----                             ---- ------ 
 usrylxap237.merck.com                       1 Online, rgmanager 
 usrylxap238.merck.com                       2 Online, Local, rgmanager 

 Service Name                   Owner (Last)                   State 
 ------- ----                   ----- ------                   ----- 
 service:http-service           usrylxap237.merck.com          started 
 service:mysql                  usrylxap237.merck.com          started 
[root at usrylxap238 ~]# cman_tool status 
Version: 6.1.0 
Config Version: 32 
Cluster Name: cluster1 
Cluster Id: 26777 
Cluster Member: Yes 
Cluster Generation: 1276 
Membership state: Cluster-Member 
Nodes: 2 
Expected votes: 1 
Total votes: 2 
Quorum: 1 
Active subsystems: 9 
Flags: 2node Dirty 
Ports Bound: 0 11 177 
Node name: usrylxap238.merck.com 
Node ID: 2 
Multicast addresses: 239.192.104.2 
Node addresses: 54.3.254.238 

I have clvmd running with locking_type = 3 and gfs2 file system mounted
(using dlm) which now is hanging on higher priority node but is fine on
lower priority node (Which seems is not part of cluster now).

[root at usrylxap237 ~]# service gfs2 status 
Active GFS2 mountpoints: 
/oracluster1 

[root at usrylxap238 ~]# service gfs2 status 
Configured GFS2 mountpoints: 
/oracluster1 
Active GFS2 mountpoints: 
/oracluster1 

Not sure why cluster is loosing membership and getting staled and GFS file
system is not accessible. 

Thanks 
Anoop 

Notice:  This e-mail message, together with any attachments, contains

information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,

New Jersey, USA 08889), and/or its affiliates Direct contact information

for affiliates is available at 

http://www.merck.com/contact/contacts.html) that may be confidential,

proprietary copyrighted and/or legally privileged. It is intended solely

for the use of the individual or entity named on this message. If you are

not the intended recipient, and have received this message in error,

please notify us immediately by reply e-mail and then delete it from 

your system.

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2963 - Release Date: 06/26/10
08:35:00


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100627/3ec65b7c/attachment.htm>

From kethureddy78 at gmail.com  Sun Jun 27 10:18:59 2010
From: kethureddy78 at gmail.com (Kethu Sreenuvasulu Reddy)
Date: Sun, 27 Jun 2010 15:48:59 +0530
Subject: [Linux-cluster] cluster configuration
In-Reply-To: <AANLkTikj-vNZrECjTqdOylXLkPOGxm6qRPqdeUJliiqK@mail.gmail.com>
References: <AANLkTilMO39vU9z07xUjAM5YUa7JIwvI8y88GQJEPagj@mail.gmail.com>
	<AANLkTikj-vNZrECjTqdOylXLkPOGxm6qRPqdeUJliiqK@mail.gmail.com>
Message-ID: <AANLkTinBbgVeLQrIbVNPB3M-ND4MkkrYToXb-OOHajDa@mail.gmail.com>

Hi Prasad,

I am also looking for step by step configuration of clusters.

PLease make three node cluster configuration including common
troubleshooting issues.

Will be waiting for your reply to move further in my learnings.

Thanks and Regards,
Kethu Reddy

On Sat, Jun 26, 2010 at 10:30 PM, POWERBALL ONLINE <sakect at gmail.com> wrote:

> Hi,
>
> I have step by step to create cluster but how many node that you want to
> create in the cluster?
> I will prepare it for you.
>
> Somsak ( Linux Specialist HP Thailand)
>
>   On Mon, Jun 21, 2010 at 6:46 PM, parshuram prasad <parshu001 at gmail.com>wrote:
>
>>  Hi All,
>>
>> please provide  me step by step clustering in linux el 5.3
>>
>>
>> --
>> Warm Regards
>> Parshuram Prasad
>> +91-9560170372
>> Sr. System Administrator & Database Administrator
>>
>> Stratoshear Technology Pvt. Ltd.
>>
>> BPS House Green Park -16
>> www.stratoshear.com
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100627/857617b0/attachment.htm>

From anoop_rajkumar at merck.com  Sun Jun 27 15:44:55 2010
From: anoop_rajkumar at merck.com (Rajkumar, Anoop)
Date: Sun, 27 Jun 2010 11:44:55 -0400
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster membership
In-Reply-To: <4c27293b.c8e8d80a.06c5.ffffe48d@mx.google.com>
References: <C651C3AA2A6A1D4980D35451DDE3F96B721452@usctmx1160.merck.com>
	<4c27293b.c8e8d80a.06c5.ffffe48d@mx.google.com>
Message-ID: <C651C3AA2A6A1D4980D35451DDE3F96B72145D@usctmx1160.merck.com>

Hi Kit
 
ntpd is running on both the systems. I removed following gfs and lvm
packages and my cluster is working perfectly now.
 
gfs2-utils-0.1.53-1.el5
kmod-gfs-0.1.31-3.el5
lvm2-cluster-2.02.40-7.el5
 
so basically as soon as gfs process starts after rpm is added i run into
that problem.
 
Below is the ccs_tool configuration from both the servers.
 
[root at usrylxap237 ~]# ccs_tool lsnode
 
Cluster name: cluster1, config_version: 33
 
Nodename                        Votes Nodeid Fencetype
usrylxap237.merck.com              1    1    usrylxap237r
usrylxap238.merck.com              1    2    usrylxap238r
[root at usrylxap237 ~]# ccs_tool lsfence
Name             Agent
usrylxap237r     fence_ilo
usrylxap238r     fence_ilo
 
[root at usrylxap238 ~]# ccs_tool lsnode
 
Cluster name: cluster1, config_version: 33
 
Nodename                        Votes Nodeid Fencetype
usrylxap237.merck.com              1    1    usrylxap237r
usrylxap238.merck.com              1    2    usrylxap238r
[root at usrylxap238 ~]# ccs_tool lsfence
Name             Agent
usrylxap237r     fence_ilo
usrylxap238r     fence_ilo
 
Below is the firewall configuration on both the servers.
 
 
[root at usrylxap237 ~]# iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere            udp
dpt:netsupport
ACCEPT     udp  --  anywhere             anywhere            udp
spt:netsupport
ACCEPT     udp  --  anywhere             anywhere            udp
dpt:50007
ACCEPT     udp  --  anywhere             anywhere            udp
spt:50007
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:21064
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:21064
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:50009
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:50009
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:50008
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:50008
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:50006
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:50006
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41969
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41969
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41968
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41968
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41967
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41967
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41966
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41966
 
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
 
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere            udp
spt:netsupport
ACCEPT     udp  --  anywhere             anywhere            udp
dpt:netsupport
ACCEPT     udp  --  anywhere             anywhere            udp
spt:50007
ACCEPT     udp  --  anywhere             anywhere            udp
dpt:50007
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:21064
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:21064
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:50009
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:50009
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:50008
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:50008
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:50006
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:50006
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41969
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41969
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41968
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41968
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41967
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41967
ACCEPT     tcp  --  anywhere             anywhere            tcp
spt:41966
ACCEPT     tcp  --  anywhere             anywhere            tcp
dpt:41966
 
Thanks
Anoop

________________________________

From: Kit Gerrits [mailto:kitgerrits at gmail.com] 
Sent: Sunday, June 27, 2010 6:35 AM
To: 'linux clustering'
Cc: Rajkumar, Anoop
Subject: RE: [Linux-cluster] RHEL Cluster node fencing and cluster
membership


Have you tried comparing the output of the cluster tools between the two
nodes?
 
Maybe the internal cluster services are not 'synchronised'
I have seen this on clusters with connection issues.
 
I'm not familiar enough with the messages to understand them exactly, 
  but my gut instinct tells me you temporarily have 2 seperate clusters
with 1 vote each.
My guess:
1/ the secundary node fails to join the cluster on the first node
2/ the secundary node starts its own cluster 
3/ the primary node sees the secundary node and says hello
4/ the secundary node and then fences the primary node 
 
Are both nodes running NTP (timing issues, log timestamps)
Are ther any firewalls or network issues? (multicast packets traveling
only one way)
 
 
Regards,
 
Kit
 
________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajkumar, Anoop
Sent: zaterdag 26 juni 2010 17:40
To: linux-cluster at redhat.com
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster
membership



Hi 

I have two dl585 with shared storage from MSA 1000 in a two node rhel
5.3 cluster. Priority in cluster.conf are like below.

<failoverdomainnode name="usrylxap237.merck.com" priority="1"/> 
                                <failoverdomainnode
name="usrylxap238.merck.com" priority="2"/> 

Whenever lower priority node usrylxap238 Is rebooted it kills cman on
usrylxap237 (Higher priority node) and fence it causing reboot of it.
Message I see in /var/log/messages of higher priority node is 

Jun 26 11:02:36 usrylxap237 openais[4750]: [CMAN ] cman killed by node 2
because we rejoined the cluster without a full restart

Jun 26 11:03:57 usrylxap237 openais[27373]: [CMAN ] cman killed by node
1 because we were killed by cman_tool or other application

After reboot when higher priority node usrylxap237 comes up it tranfers
services from lower priority node to itself and everything works fine
for some time. Then I see following message in /var/log/messages of
higher priority node running services.

Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the OPERATIONAL state. 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Receive multicast
socket recv buffer size (2880 
00 bytes). 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Transmit multicast
socket send buffer size (288 
000 bytes). 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 2. 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17
high seq received 17 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 420 
Jun 26 09:24:26 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the COMMIT state. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 424 
Jun 26 09:24:36 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the COMMIT state. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 428 
Jun 26 09:24:46 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering RECOVERY
state. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [0] member
54.3.254.237: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq
1052 rep 54.3.254.237 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered
17 received flag 1 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] position [1] member
54.3.254.238: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] previous ring seq
1052 rep 54.3.254.237 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] aru 17 high delivered
17 received flag 1 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Did not need to
originate any messages in recov 
ery. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] Sending initial ORF
token 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION
CHANGE 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] CLM CONFIGURATION
CHANGE 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap237 openais[5792]: [SYNC ] This node is within
the primary component and w 
ill provide service. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [TOTEM] entering OPERATIONAL
state. 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
54.3.254.237 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CLM  ] got nodejoin message
54.3.254.238 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message
from node 1 
Jun 26 09:24:54 usrylxap237 openais[5792]: [CPG  ] got joinlist message
from node 2 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] The token was lost in
the OPERATIONAL state. 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Receive multicast
socket recv buffer size (2880 
00 bytes). 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Transmit multicast
socket send buffer size (288 
000 bytes). 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 2. 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Saving state aru 17
high seq received 17 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 42c 
Jun 26 09:25:23 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 430 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 434 
Jun 26 09:25:33 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Storing new sequence
id for ring 438 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:43 usrylxap237 openais[5792]: [TOTEM] Creating commit token
because I am the rep. 


On the second node I can see 

Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 12. 
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17
high seq received 17 
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 420 
Jun 26 09:24:26 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 424 
Jun 26 09:24:36 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state. 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 428 
Jun 26 09:24:46 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY
state. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
54.3.254.237: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1052 rep 54.3.254.237 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
54.3.254.238: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1052 rep 54.3.254.237 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] Did not need to
originate any messages in re 
covery. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:24:54 usrylxap238 openais[5725]: [SYNC ] This node is within
the primary component an 
d will provide service. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
state. 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.237 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.238 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 1 
Jun 26 09:24:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 2 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 12. 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Saving state aru 17
high seq received 17 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 42c 
Jun 26 09:25:23 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 430 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 434 
Jun 26 09:25:33 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 438 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 13. 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 43c 
Jun 26 09:25:43 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] The token was lost in
the COMMIT state. 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering GATHER state
from 4. 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] Storing new sequence
id for ring 440 
Jun 26 09:25:53 usrylxap238 openais[5725]: [TOTEM] entering COMMIT
state. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering RECOVERY
state. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [0] member
54.3.254.237: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1064 rep 54.3.254.237 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] position [1] member
54.3.254.238: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] previous ring seq
1064 rep 54.3.254.237 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] aru 17 high delivered
17 received flag 1 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] Did not need to
originate any messages in re 
covery. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] CLM CONFIGURATION
CHANGE 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] New Configuration: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.237) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ]      r(0)
ip(54.3.254.238) 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Left: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] Members Joined: 
Jun 26 09:25:54 usrylxap238 openais[5725]: [SYNC ] This node is within
the primary component an 
d will provide service. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [TOTEM] entering OPERATIONAL
state. 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.237 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CLM  ] got nodejoin message
54.3.254.238 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 1 
Jun 26 09:25:54 usrylxap238 openais[5725]: [CPG  ] got joinlist message
from node 2 

Now my cluster is messed up. Even though clustat and cman_tool show
everything is fine. As I can not move services between the node (they
are running fine on present node). It even does not give any error
message when I try to move them using clusvcadm.

[root at usrylxap238 ~]# clustat 
Cluster Status for cluster1 @ Sat Jun 26 11:25:12 2010 
Member Status: Quorate 

 Member Name                             ID   Status 
 ------ ----                             ---- ------ 
 usrylxap237.merck.com                       1 Online, rgmanager 
 usrylxap238.merck.com                       2 Online, Local, rgmanager 

 Service Name                   Owner (Last)                   State 
 ------- ----                   ----- ------                   ----- 
 service:http-service           usrylxap237.merck.com          started 
 service:mysql                  usrylxap237.merck.com          started 
[root at usrylxap238 ~]# cman_tool status 
Version: 6.1.0 
Config Version: 32 
Cluster Name: cluster1 
Cluster Id: 26777 
Cluster Member: Yes 
Cluster Generation: 1276 
Membership state: Cluster-Member 
Nodes: 2 
Expected votes: 1 
Total votes: 2 
Quorum: 1 
Active subsystems: 9 
Flags: 2node Dirty 
Ports Bound: 0 11 177 
Node name: usrylxap238.merck.com 
Node ID: 2 
Multicast addresses: 239.192.104.2 
Node addresses: 54.3.254.238 

I have clvmd running with locking_type = 3 and gfs2 file system mounted
(using dlm) which now is hanging on higher priority node but is fine on
lower priority node (Which seems is not part of cluster now).

[root at usrylxap237 ~]# service gfs2 status 
Active GFS2 mountpoints: 
/oracluster1 

[root at usrylxap238 ~]# service gfs2 status 
Configured GFS2 mountpoints: 
/oracluster1 
Active GFS2 mountpoints: 
/oracluster1 

Not sure why cluster is loosing membership and getting staled and GFS
file system is not accessible. 

Thanks 
Anoop 

Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you
are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2963 - Release Date: 06/26/10
08:35:00


Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100627/c65249a1/attachment.htm>

From jakov.sosic at srce.hr  Sun Jun 27 23:58:16 2010
From: jakov.sosic at srce.hr (Jakov Sosic)
Date: Mon, 28 Jun 2010 01:58:16 +0200
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster membership
In-Reply-To: <C651C3AA2A6A1D4980D35451DDE3F96B72145D@usctmx1160.merck.com>
References: <C651C3AA2A6A1D4980D35451DDE3F96B721452@usctmx1160.merck.com>	<4c27293b.c8e8d80a.06c5.ffffe48d@mx.google.com>
	<C651C3AA2A6A1D4980D35451DDE3F96B72145D@usctmx1160.merck.com>
Message-ID: <4C27E598.4070206@srce.hr>

On 06/27/2010 05:44 PM, Rajkumar, Anoop wrote:

Hi Anoop


Could you post your whole cluster.conf, maybe you've set both nodes the
same nodeid? That could cause something similar to your issues...

Also, take a look at:
http://openais.org/doku.php?id=faq:cisco_switches

if you're using Cisco gear.




-- 
|    Jakov Sosic    |    ICQ: 28410271    |   PGP: 0x965CAE2D   |
=================================================================
| start fighting cancer -> http://www.worldcommunitygrid.org/   |



From esggrupos at gmail.com  Mon Jun 28 10:11:41 2010
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 28 Jun 2010 12:11:41 +0200
Subject: [Linux-cluster] is it possible an active-active NFS server?
Message-ID: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>

Hi All,

I?m going to mount an active-active file server and my first idea is to
mount a NFS service with luci but now I have the doubt if is it possible.
with luci i have allways mounted Active-Passive services. So, my question is
that.

Any other aproach to get an Active-Active file Server?

Thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/0668afe9/attachment.htm>

From gordan at bobich.net  Mon Jun 28 10:23:36 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Mon, 28 Jun 2010 11:23:36 +0100
Subject: [Linux-cluster] is it possible an active-active NFS server?
In-Reply-To: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>
References: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>
Message-ID: <4C287828.3050605@bobich.net>

On 06/28/2010 11:11 AM, ESGLinux wrote:
> Hi All,
>
> I?m going to mount an active-active file server and my first idea is to
> mount a NFS service with luci but now I have the doubt if is it
>  possible. with luci i have allways mounted Active-Passive services. So,
> my question is that.
>
> Any other aproach to get an Active-Active file Server?


Not with NFS, since NFS has no feature to have multiple servers/share. 
But there is no reason you can't connect half of the clients to the 
other server.

If you need client-side multi-homing, GlusterFS can do that.

Gordan



From rajatjpatel at gmail.com  Mon Jun 28 10:48:56 2010
From: rajatjpatel at gmail.com (rajatjpatel)
Date: Mon, 28 Jun 2010 16:18:56 +0530
Subject: [Linux-cluster] cluster step
Message-ID: <AANLkTimU_8f1du2wMflIgcm3mN_Ik3YG4U-eGTsXqvZz@mail.gmail.com>

http://studyhat.blogspot.com/2010/01/cluster-hp-ilo.html

http://studyhat.blogspot.com/2009/11/clustering-linux-ha.html

try above line it will help to setup cluster


Regards,

Rajat J Patel

FIRST THEY IGNORE YOU...
THEN THEY LAUGH AT YOU...
THEN THEY FIGHT YOU...
THEN YOU WIN...

Skype rajatjpatel
AIM    rajatjpatel
yahoo rajatjpatel
msn    rajatjpatel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/d79987a5/attachment.htm>

From esggrupos at gmail.com  Mon Jun 28 11:26:54 2010
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 28 Jun 2010 13:26:54 +0200
Subject: [Linux-cluster] is it possible an active-active NFS server?
In-Reply-To: <4C287828.3050605@bobich.net>
References: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>
	<4C287828.3050605@bobich.net>
Message-ID: <AANLkTikWVqVmg0NUcQlVM7IAwAJ83zfGX2Wp2ffSSjru@mail.gmail.com>

2010/6/28 Gordan Bobic <gordan at bobich.net>

> On 06/28/2010 11:11 AM, ESGLinux wrote:
>
>> Hi All,
>>
>> I?m going to mount an active-active file server and my first idea is to
>> mount a NFS service with luci but now I have the doubt if is it
>>  possible. with luci i have allways mounted Active-Passive services. So,
>> my question is that.
>>
>> Any other aproach to get an Active-Active file Server?
>>
>
>
> Not with NFS, since NFS has no feature to have multiple servers/share. But
> there is no reason you can't connect half of the clients to the other
> server.
>

I haven't realized about it, it could be a solution.

one thing, I have been investigating about it, and  I have thought it could
be possible using Linux Virtual Server (administered with piranha), what do
you think about it?




>
> If you need client-side multi-homing, GlusterFS can do that.
>
>
in this project I only can use red hat certified solutions. I suposse
GlusterFS isn't,

Thanks for your answer



> Gordan
>
>
ESG


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/71af4a25/attachment.htm>

From gordan at bobich.net  Mon Jun 28 11:42:32 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Mon, 28 Jun 2010 12:42:32 +0100
Subject: [Linux-cluster] is it possible an active-active NFS server?
In-Reply-To: <AANLkTikWVqVmg0NUcQlVM7IAwAJ83zfGX2Wp2ffSSjru@mail.gmail.com>
References: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>	<4C287828.3050605@bobich.net>
	<AANLkTikWVqVmg0NUcQlVM7IAwAJ83zfGX2Wp2ffSSjru@mail.gmail.com>
Message-ID: <4C288AA8.9040107@bobich.net>

On 06/28/2010 12:26 PM, ESGLinux wrote:
>
>
> 2010/6/28 Gordan Bobic <gordan at bobich.net <mailto:gordan at bobich.net>>
>
>     On 06/28/2010 11:11 AM, ESGLinux wrote:
>
>         Hi All,
>
>         I?m going to mount an active-active file server and my first
>         idea is to
>         mount a NFS service with luci but now I have the doubt if is it
>           possible. with luci i have allways mounted Active-Passive
>         services. So,
>         my question is that.
>
>         Any other aproach to get an Active-Active file Server?
>
>
>
>     Not with NFS, since NFS has no feature to have multiple
>     servers/share. But there is no reason you can't connect half of the
>     clients to the other server.
>
>
> I haven't realized about it, it could be a solution.
>
> one thing, I have been investigating about it, and  I have thought it
> could be possible using Linux Virtual Server (administered with
> piranha), what do you think about it?

I think you need to start to list your requirements in a coherent manner 
first, in terms of performance, features, and redundancy. The solution 
you should be looking for will be more obvious then.

>     If you need client-side multi-homing, GlusterFS can do that.
>
>
> in this project I only can use red hat certified solutions. I suposse
> GlusterFS isn't,

Hmm, personally I find that sticking with only what ships with the 
distro to be too limiting as soon as you need to cover a use-case that 
isn't very basic and boring. But if you are looking for a "supported" 
solution, Gluster Inc. do have support contracts available.

But without more information on your use-case and expected access 
patterns, it is impossible to suggest a solution in a particularly 
meaningful way.

Gordan



From esggrupos at gmail.com  Mon Jun 28 12:28:41 2010
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 28 Jun 2010 14:28:41 +0200
Subject: [Linux-cluster] is it possible an active-active NFS server?
In-Reply-To: <4C288AA8.9040107@bobich.net>
References: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>
	<4C287828.3050605@bobich.net>
	<AANLkTikWVqVmg0NUcQlVM7IAwAJ83zfGX2Wp2ffSSjru@mail.gmail.com>
	<4C288AA8.9040107@bobich.net>
Message-ID: <AANLkTildQiGqfJcusb6SqRs_rwgiXVP1CLi9S9BeJ9FT@mail.gmail.com>

2010/6/28 Gordan Bobic <gordan at bobich.net>

> On 06/28/2010 12:26 PM, ESGLinux wrote:
>
>>
>>
>> 2010/6/28 Gordan Bobic <gordan at bobich.net <mailto:gordan at bobich.net>>
>>
>>
>>    On 06/28/2010 11:11 AM, ESGLinux wrote:
>>
>>        Hi All,
>>
>>        I?m going to mount an active-active file server and my first
>>        idea is to
>>        mount a NFS service with luci but now I have the doubt if is it
>>          possible. with luci i have allways mounted Active-Passive
>>        services. So,
>>        my question is that.
>>
>>        Any other aproach to get an Active-Active file Server?
>>
>>
>>
>>    Not with NFS, since NFS has no feature to have multiple
>>    servers/share. But there is no reason you can't connect half of the
>>    clients to the other server.
>>
>>
>> I haven't realized about it, it could be a solution.
>>
>> one thing, I have been investigating about it, and  I have thought it
>> could be possible using Linux Virtual Server (administered with
>> piranha), what do you think about it?
>>
>
> I think you need to start to list your requirements in a coherent manner
> first, in terms of performance, features, and redundancy. The solution you
> should be looking for will be more obvious then.
>
>

Hi again,

you are right this is a bit confusing (my customer said me: I need a cluster
file server, its your problem... :-/ ). Now I?m investigating how to do it.

what basically I need is:

I need to access the files in HA and it must be scalable. If the load is a
problem I could be able to add another node to solve the load problem, (so I
thought in a Active-active, because with Active-Pasive only there is one
node active so the load problem is still there)




>
>     If you need client-side multi-homing, GlusterFS can do that.
>>
>>
>> in this project I only can use red hat certified solutions. I suposse
>> GlusterFS isn't,
>>
>
> Hmm, personally I find that sticking with only what ships with the distro
> to be too limiting as soon as you need to cover a use-case that isn't very
> basic and boring. But if you are looking for a "supported" solution, Gluster
> Inc. do have support contracts available.
>
>
I'll try it in my systems to test it, but I think I can?t use it, :-(



> But without more information on your use-case and expected access patterns,
> it is impossible to suggest a solution in a particularly meaningful way.
>
> Gordan
>
> Thanks Gordan,

ESG



>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/287220ec/attachment.htm>

From gordan at bobich.net  Mon Jun 28 12:49:39 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Mon, 28 Jun 2010 13:49:39 +0100
Subject: [Linux-cluster] is it possible an active-active NFS server?
In-Reply-To: <AANLkTildQiGqfJcusb6SqRs_rwgiXVP1CLi9S9BeJ9FT@mail.gmail.com>
References: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>	<4C287828.3050605@bobich.net>	<AANLkTikWVqVmg0NUcQlVM7IAwAJ83zfGX2Wp2ffSSjru@mail.gmail.com>	<4C288AA8.9040107@bobich.net>
	<AANLkTildQiGqfJcusb6SqRs_rwgiXVP1CLi9S9BeJ9FT@mail.gmail.com>
Message-ID: <4C289A63.2060500@bobich.net>

On 06/28/2010 01:28 PM, ESGLinux wrote:
>
>
> 2010/6/28 Gordan Bobic <gordan at bobich.net <mailto:gordan at bobich.net>>
>
>     On 06/28/2010 12:26 PM, ESGLinux wrote:
>
>
>
>         2010/6/28 Gordan Bobic <gordan at bobich.net
>         <mailto:gordan at bobich.net> <mailto:gordan at bobich.net
>         <mailto:gordan at bobich.net>>>
>
>
>             On 06/28/2010 11:11 AM, ESGLinux wrote:
>
>                 Hi All,
>
>                 I?m going to mount an active-active file server and my first
>                 idea is to
>                 mount a NFS service with luci but now I have the doubt
>         if is it
>                   possible. with luci i have allways mounted Active-Passive
>                 services. So,
>                 my question is that.
>
>                 Any other aproach to get an Active-Active file Server?
>
>
>
>             Not with NFS, since NFS has no feature to have multiple
>             servers/share. But there is no reason you can't connect half
>         of the
>             clients to the other server.
>
>
>         I haven't realized about it, it could be a solution.
>
>         one thing, I have been investigating about it, and  I have
>         thought it
>         could be possible using Linux Virtual Server (administered with
>         piranha), what do you think about it?
>
>
>     I think you need to start to list your requirements in a coherent
>     manner first, in terms of performance, features, and redundancy. The
>     solution you should be looking for will be more obvious then.
>
>
>
> Hi again,
>
> you are right this is a bit confusing (my customer said me: I need a
> cluster file server, its your problem... :-/ ). Now I?m investigating
> how to do it.
>
> what basically I need is:
>
> I need to access the files in HA and it must be scalable. If the load is
> a problem I could be able to add another node to solve the load problem,
> (so I thought in a Active-active, because with Active-Pasive only there
> is one node active so the load problem is still there)

Whether it will scale is dependant almost exclusively on your access 
pattern. If you can group your cluster file system accesses so that 
nodes hardly ever access the same file system subtrees then it will 
scale reasonably well. If you are going to have nodes randomly accessing 
the file system paths, then the performance will take a nosedive, and 
get progressively slower as you add nodes.

This will scale linearly:
Node 1 accessing /my/path/1/whatever
Node 2 accessing /my/path/2/whatever

This will scale inversely (get slower):
Node 1 accessing /my/path
Node 2 accessing /my/path

Cluster file systems are generally slower at random access than 
standalone file systems, so you are likely to find that having a 
standalone failover (active-passive) solution is faster than a clustered 
active-active solution, especially as you add nodes.

So the question really comes down to access patterns. If you are going 
to have random access to lots of small files (e.g. Maildir), the 
performance will be poor to start with and get worse as you add nodes 
unless you can engineer your solution so that access for a particular 
subtree always hits the same node. OTOH for large file operations, the 
bandwidth will be more dominant than random access lock acquisition 
time, so the performance will be OK and scale reasonably as you add nodes.

Note that this isn't something specific to GFS - pretty much all cluster 
file systems behave this way.

Gordan



From anoop_rajkumar at merck.com  Mon Jun 28 17:42:35 2010
From: anoop_rajkumar at merck.com (Rajkumar, Anoop)
Date: Mon, 28 Jun 2010 13:42:35 -0400
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster
Message-ID: <C651C3AA2A6A1D4980D35451DDE3F96B721474@usctmx1160.merck.com>

Hi 

I am not getting into the problem now of cluster getting staled after I
create gfs file system instaed of gfs2. Here is my cluster.conf file.

[root at system1 cluster]# more cluster.conf
<?xml version="1.0"?>
<cluster alias="cluster1" config_version="33" name="cluster1">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="100"/>
        <clusternodes>
                <clusternode name="system1.merck.com" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="system1r"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="system2.merck.com" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="system2r"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ilo"
hostname="system1r.merck.com" login="admin
" name="system1r" passwd="Anwyccdfy57"/>
                <fencedevice agent="fence_ilo"
hostname="system2r.merck.com" login="admin
" name="system1r" passwd="Anwyccdfy57"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="webdomain" nofailback="0"
ordered="1" restricte
d="1">
                                <failoverdomainnode
name="system1.merck.com" priority="
1"/>
                                <failoverdomainnode
name="system2.merck.com" priority="
2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="54.3.xyz.abc" monitor_link="1"/>
                        <script file="/etc/init.d/orig.httpd" name="http
startup script"/>
                        <fs device="/dev/sda2" force_fsck="0"
force_unmount="0" fsid="6443" f
stype="ext3" mountpoint="/var/www/html" name="httpd-content" options=""
self_fence="0"/>
                        <fs device="/dev/sda1" force_fsck="0"
force_unmount="0" fsid="30579"
fstype="ext3" mountpoint="/var/lib/mysql" name="mysql-content"
options="" self_fence="0"/>
                        <script file="/etc/init.d/mysqld" name="mysql
startup script"/>
                        <ip address="192.168.0.3" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="webdomain"
name="http-service" recovery="resta
rt">
                        <script ref="http startup script"/>
                        <fs ref="httpd-content"/>
                        <ip ref="54.3.xyz.abc"/>
                </service>
                <service autostart="1" domain="webdomain" exclusive="0"
name="mysql" recovery
="disable">
                        <fs ref="mysql-content"/>
                        <script ref="mysql startup script"/>
                        <ip ref="192.168.0.3"/>
                </service>
        </rm>
</cluster>

Thanks
Anoop
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/994c9f84/attachment.htm>

From Bennie_R_Thomas at raytheon.com  Mon Jun 28 20:53:52 2010
From: Bennie_R_Thomas at raytheon.com (Bennie Thomas)
Date: Mon, 28 Jun 2010 15:53:52 -0500
Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly reboot
Message-ID: <4C290BE0.3090707@raytheon.com>

I currently have 2 DL380 G6 with and HP MSA2312 disk array. Running 
Redhat 5u4 64bit.  I have a quorum disk.  I use the Cluster as an 
Active/passive NFS Cluster
The problem I am having is one or both of the nodes will randomly 
reboot. Has anyone experienced this problem

-- 
Bennie Thomas
Sr. Information Systems Technologist II
Raytheon Company

972.205.4126
972.205.6363 fax
888.347.1660 pager
Bennie_R_Thomas at raytheon.com


DISCLAIMER: This message contains information that may be confidential and privileged. Unless you are the addressee (or authorized to receive mail for the addressee), you should not use, copy or disclose to anyone this message or any information contained in this message. If you have received this message in error, please so advise the sender by reply e-mail and delete this message. Thank you for your cooperation.

Any views or opinions presented are solely those of the author and do not necessarily represent those of Raytheon unless specifically stated. 
Electronic communications including email may be monitored by Raytheon
for operational or business reasons.






From wrsturm at mtroyal.ca  Mon Jun 28 21:07:17 2010
From: wrsturm at mtroyal.ca (Warren Sturm)
Date: Mon, 28 Jun 2010 15:07:17 -0600
Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
 reboot
In-Reply-To: <4C290BE0.3090707@raytheon.com>
References: <4C290BE0.3090707@raytheon.com>
Message-ID: <1277759237.28120.3.camel@mrdt215982.mtroyal.ca>

On Mon, 2010-06-28 at 15:53 -0500, Bennie Thomas wrote:
> I currently have 2 DL380 G6 with and HP MSA2312 disk array. Running 
> Redhat 5u4 64bit.  I have a quorum disk.  I use the Cluster as an 
> Active/passive NFS Cluster
> The problem I am having is one or both of the nodes will randomly 
> reboot. Has anyone experienced this problem
> 

see:

https://bugzilla.redhat.com/show_bug.cgi?id=502977

and

https://bugzilla.redhat.com/show_bug.cgi?id=580863

It seems at this point it cannot be done (safely).





From jeff.sturm at eprize.com  Mon Jun 28 21:07:55 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Mon, 28 Jun 2010 17:07:55 -0400
Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
	reboot
In-Reply-To: <4C290BE0.3090707@raytheon.com>
References: <4C290BE0.3090707@raytheon.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F055D963A@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Bennie Thomas
> Sent: Monday, June 28, 2010 4:54 PM
> To: linux clustering
> Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
reboot
> 
> I currently have 2 DL380 G6 with and HP MSA2312 disk array. Running
> Redhat 5u4 64bit.  I have a quorum disk.  I use the Cluster as an
> Active/passive NFS Cluster
> The problem I am having is one or both of the nodes will randomly
> reboot. Has anyone experienced this problem

Do you run a power-based fence device?  Are the hosts being fenced when
they reboot?  Any other clues in /var/log/messages?

-Jeff





From jjest at u.washington.edu  Mon Jun 28 21:28:25 2010
From: jjest at u.washington.edu (Jeremiah D. Jester)
Date: Mon, 28 Jun 2010 14:28:25 -0700
Subject: [Linux-cluster] gfs2 and SAN setup
Message-ID: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7722@ads-mbx-02.exchange.washington.edu>

Hello,

I'm trying to get gfs2 configured on a SAN connected from two RHEL 5 servers via iSCSI. I have formatted the iSCSI partition w/ gfs from Server 1.

 mkfs.gfs2 -p lock_dlm -t ngs:gfs2 -j 8 /dev/sdd

>From Server 1 I can mount the partition just fine.

mount -o acl -t gfs2 /dev/sdd /vol10

However, I get the following error when I attempt to mount the gfs partition from Server 2.

[root at coffee cluster]# mount -o acl -t gfs2 /dev/sdd /vol10
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
/sbin/mount.gfs2: gfs_controld not running
/sbin/mount.gfs2: error mounting lockproto lock_dlm

I have the iSCSI initiator name properly entered into the SAN access control list for both servers so this doesn't seem to be the issue.

I have also installed the RHEL cluster packages, including CMAN. Is this required to get gfs2 working?

Appreciate your help!


Jeremiah Jester
Senior Informatics Specialist
Microbiology, Katze Lab
Box 357242
P: 206-732-6185

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/17cc76a8/attachment.htm>

From jeff.sturm at eprize.com  Mon Jun 28 21:46:21 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Mon, 28 Jun 2010 17:46:21 -0400
Subject: [Linux-cluster] gfs2 and SAN setup
In-Reply-To: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7722@ads-mbx-02.exchange.washington.edu>
References: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7722@ads-mbx-02.exchange.washington.edu>
Message-ID: <64D0546C5EBBD147B75DE133D798665F055D963B@hugo.eprize.local>

> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremiah D.
Jester
> Sent: Monday, June 28, 2010 5:28 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] gfs2 and SAN setup

> [root at coffee cluster]# mount -o acl -t gfs2 /dev/sdd /vol10
> /sbin/mount.gfs2: can't connect to gfs_controld: Connection refused

Make sure CMAN is running on all nodes, and all nodes have successfully
joined the cluster.

-Jeff





From jumanjiman at gmail.com  Mon Jun 28 22:16:19 2010
From: jumanjiman at gmail.com (Paul Morgan)
Date: Mon, 28 Jun 2010 18:16:19 -0400
Subject: [Linux-cluster] gfs2 and SAN setup
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F055D963B@hugo.eprize.local>
References: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7722@ads-mbx-02.exchange.washington.edu>
	<64D0546C5EBBD147B75DE133D798665F055D963B@hugo.eprize.local>
Message-ID: <AANLkTilGYNwyB6mT3g6GEmlj6hIHX3A-FgdpICyuDwRn@mail.gmail.com>

Also: as a matter of best practice, prefer to mount /dev/disk/by-* path to
avoid reordering issues later.

-paul

--
Top-posted from gmail on android 2.2

On Jun 28, 2010 5:52 PM, "Jeff Sturm" <jeff.sturm at eprize.com> wrote:
>> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremiah D.
> Jester
>> Sent: Monday, June 28, 2010 5:28 PM
>> To: linux-cluster at redhat.com
>> Subject: [Linux-cluster] gfs2 and SAN setup
>
>> [root at coffee cluster]# mount -o acl -t gfs2 /dev/sdd /vol10
>> /sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
>
> Make sure CMAN is running on all nodes, and all nodes have successfully
> joined the cluster.
>
> -Jeff
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/12b3a7c4/attachment.htm>

From jumanjiman at gmail.com  Mon Jun 28 22:20:48 2010
From: jumanjiman at gmail.com (Paul Morgan)
Date: Mon, 28 Jun 2010 18:20:48 -0400
Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
	reboot
In-Reply-To: <4C290BE0.3090707@raytheon.com>
References: <4C290BE0.3090707@raytheon.com>
Message-ID: <AANLkTilnu133h5o9e-nrol_ZWQDY7_U8pWayWjn_eHQX@mail.gmail.com>

You don't mention whether you checked your logs. If there is nothing to
indicate a problem, it may be a bios issue.

At my client we recently updated our g6 bios for this exact behavior. Sorry
I don't have the hp link at the moment, but it's a known problem with a fix.


-paul

--
Top-posted from gmail on android 2.2

On Jun 28, 2010 5:04 PM, "Bennie Thomas" <Bennie_R_Thomas at raytheon.com>
wrote:
> I currently have 2 DL380 G6 with and HP MSA2312 disk array. Running
> Redhat 5u4 64bit. I have a quorum disk. I use the Cluster as an
> Active/passive NFS Cluster
> The problem I am having is one or both of the nodes will randomly
> reboot. Has anyone experienced this problem
>
> --
> Bennie Thomas
> Sr. Information Systems Technologist II
> Raytheon Company
>
> 972.205.4126
> 972.205.6363 fax
> 888.347.1660 pager
> Bennie_R_Thomas at raytheon.com
>
>
> DISCLAIMER: This message contains information that may be confidential and
privileged. Unless you are the addressee (or authorized to receive mail for
the addressee), you should not use, copy or disclose to anyone this message
or any information contained in this message. If you have received this
message in error, please so advise the sender by reply e-mail and delete
this message. Thank you for your cooperation.
>
> Any views or opinions presented are solely those of the author and do not
necessarily represent those of Raytheon unless specifically stated.
> Electronic communications including email may be monitored by Raytheon
> for operational or business reasons.
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/79b50d03/attachment.htm>

From jjest at u.washington.edu  Mon Jun 28 22:20:26 2010
From: jjest at u.washington.edu (Jeremiah D. Jester)
Date: Mon, 28 Jun 2010 15:20:26 -0700
Subject: [Linux-cluster] gfs2 and SAN setup
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F055D963B@hugo.eprize.local>
References: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7722@ads-mbx-02.exchange.washington.edu>
	<64D0546C5EBBD147B75DE133D798665F055D963B@hugo.eprize.local>
Message-ID: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7867@ads-mbx-02.exchange.washington.edu>

Hi Jeff,



Thanks for your reply. Our two servers are donuts (server1) and coffee (server2)  have been configured with CMAN but the outcome isn't quite what we expect. On donuts a 'clustat' gives us some errors.



[root at donuts ~]# clustat

Cluster Status for ngs @ Mon Jun 28 15:14:15 2010

Member Status: Quorate



Member Name                                                ID   Status

------ ----                                                ---- ------

donuts.microslu.washington.edu                                 2 Online, Local

Node1

                                                         1 Offline, Estranged



My cluster.conf file reads as following on this machine.



[root at donuts ~]# cat  /etc/cluster/cluster.conf

<?xml version="1.0"?>

<cluster alias="ngsCluster" config_version="7" name="ngs">

        <fence_daemon post_fail_delay="0" post_join_delay="3"/>

        <clusternodes>

                <clusternode name="coffee.microslu.washington.edu" nodeid="1" votes="1">

                        <fence/>

                </clusternode>

                <clusternode name="donuts.microslu.washington.edu" nodeid="2" votes="1">

                        <fence/>

                </clusternode>

        </clusternodes>

        <cman/>

        <fencedevices/>

        <rm>

                <failoverdomains/>

                <resources/>

        </rm>

</cluster>



However, when I try to do the same on coffee, I am unable to start cman. I've copied donuts cluster.conf file to  this machine but gets overwritten with a cluster.conf file that just has 'donuts' in it every time I try to restart CMAN.


[root at coffee cluster]# clustat

Could not connect to CMAN: Connection refused

[root at coffee cluster]# /etc/init.d/cman start

Starting cluster:

   Loading modules... done

   Mounting configfs... done

   Starting ccsd... done

   Starting cman... failed

cman not started: Can't find local node name in cluster.conf /usr/sbin/cman_tool: aisexec daemon didn't start

                                                           [FAILED]





Thanks!

JJ



-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Sturm
Sent: Monday, June 28, 2010 2:46 PM
To: linux clustering
Subject: Re: [Linux-cluster] gfs2 and SAN setup



> From: linux-cluster-bounces at redhat.com

[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremiah D.

Jester

> Sent: Monday, June 28, 2010 5:28 PM

> To: linux-cluster at redhat.com

> Subject: [Linux-cluster] gfs2 and SAN setup



> [root at coffee cluster]# mount -o acl -t gfs2 /dev/sdd /vol10

> /sbin/mount.gfs2: can't connect to gfs_controld: Connection refused



Make sure CMAN is running on all nodes, and all nodes have successfully joined the cluster.



-Jeff







--

Linux-cluster mailing list

Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>

https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100628/30bc0e9b/attachment.htm>

From xavier.montagutelli at unilim.fr  Tue Jun 29 06:51:24 2010
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Tue, 29 Jun 2010 08:51:24 +0200
Subject: [Linux-cluster] gfs2 and SAN setup
In-Reply-To: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7867@ads-mbx-02.exchange.washington.edu>
References: <14C6C8313F1842459FAAD0B3FEB9E42C50EC7722@ads-mbx-02.exchange.washington.edu>
	<64D0546C5EBBD147B75DE133D798665F055D963B@hugo.eprize.local>
	<14C6C8313F1842459FAAD0B3FEB9E42C50EC7867@ads-mbx-02.exchange.washington.edu>
Message-ID: <201006290851.25027.xavier.montagutelli@unilim.fr>

On Tuesday 29 June 2010 00:20:26 Jeremiah D. Jester wrote:
> Hi Jeff,
> 
> 
> 
> Thanks for your reply. Our two servers are donuts (server1) and coffee
>  (server2)  have been configured with CMAN but the outcome isn't quite what
>  we expect. On donuts a 'clustat' gives us some errors.
> 
> 
> 
> [root at donuts ~]# clustat
> 
> Cluster Status for ngs @ Mon Jun 28 15:14:15 2010
> 
> Member Status: Quorate
> 
> 
> 
> Member Name                                                ID   Status
> 
> ------ ----                                                ---- ------
> 
> donuts.microslu.washington.edu                                 2 Online,
>  Local
> 
> Node1
> 
>                                                          1 Offline,
>  Estranged
> 
> 
> 
> My cluster.conf file reads as following on this machine.
> 
> 
> 
> [root at donuts ~]# cat  /etc/cluster/cluster.conf
> 
> <?xml version="1.0"?>
> 
> <cluster alias="ngsCluster" config_version="7" name="ngs">
> 
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> 
>         <clusternodes>
> 
>                 <clusternode name="coffee.microslu.washington.edu"
>  nodeid="1" votes="1">
> 
>                         <fence/>
> 
>                 </clusternode>
> 
>                 <clusternode name="donuts.microslu.washington.edu"
>  nodeid="2" votes="1">
> 
>                         <fence/>
> 
>                 </clusternode>
> 
>         </clusternodes>
> 
>         <cman/>
> 
>         <fencedevices/>
> 
>         <rm>
> 
>                 <failoverdomains/>
> 
>                 <resources/>
> 
>         </rm>
> 
> </cluster>
> 
> 
> 
> However, when I try to do the same on coffee, I am unable to start cman.
>  I've copied donuts cluster.conf file to  this machine but gets overwritten
>  with a cluster.conf file that just has 'donuts' in it every time I try to
>  restart CMAN.
> 
> 
> [root at coffee cluster]# clustat
> 
> Could not connect to CMAN: Connection refused
> 
> [root at coffee cluster]# /etc/init.d/cman start
> 
> Starting cluster:
> 
>    Loading modules... done
> 
>    Mounting configfs... done
> 
>    Starting ccsd... done
> 
>    Starting cman... failed
> 
> cman not started: Can't find local node name in cluster.conf

I don't understand why your local copy of cluster.conf gets overwritten on 
coffee when you start cman.

10 cents : make sure the "hostname" command returns the fully qualified domain 
name, as appears in your cluster.conf file

Also make sure everything is ok in the /etc/hosts file

Perhaps you can also stop everything, increment the version number, copy the 
"good" file on your nodes and restart simultanously cman on both nodes ?

>  /usr/sbin/cman_tool: aisexec daemon didn't start
> 
>                                                            [FAILED]
> 
> 
> 
> 
> 
> Thanks!
> 
> JJ
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
>  [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Sturm Sent:
>  Monday, June 28, 2010 2:46 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] gfs2 and SAN setup
> 
> > From: linux-cluster-bounces at redhat.com
> 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremiah D.
> 
> Jester
> 
> > Sent: Monday, June 28, 2010 5:28 PM
> >
> > To: linux-cluster at redhat.com
> >
> > Subject: [Linux-cluster] gfs2 and SAN setup
> >
> >
> >
> > [root at coffee cluster]# mount -o acl -t gfs2 /dev/sdd /vol10
> >
> > /sbin/mount.gfs2: can't connect to gfs_controld: Connection refused
> 
> Make sure CMAN is running on all nodes, and all nodes have successfully
>  joined the cluster.
> 
> 
> 
> -Jeff
> 
> 
> 
> 
> 
> 
> 
> --
> 
> Linux-cluster mailing list
> 
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> 
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From jacob.ishak at gmail.com  Tue Jun 29 10:16:42 2010
From: jacob.ishak at gmail.com (jacob ishak)
Date: Tue, 29 Jun 2010 13:16:42 +0300
Subject: [Linux-cluster] RHEL Cluster node fencing and cluster
In-Reply-To: <C651C3AA2A6A1D4980D35451DDE3F96B721474@usctmx1160.merck.com>
References: <C651C3AA2A6A1D4980D35451DDE3F96B721474@usctmx1160.merck.com>
Message-ID: <AANLkTik2y7_oRHENLq-Z6shLwcV-AsBtB6zsENHm2LQh@mail.gmail.com>

in your cluster.conf

fstype="ext3"

it should be fstype="gfs" or gfs2

BR

On Mon, Jun 28, 2010 at 8:42 PM, Rajkumar, Anoop
<anoop_rajkumar at merck.com>wrote:

>  Hi
>
> I am not getting into the problem now of cluster getting staled after I
> create gfs file system instaed of gfs2. Here is my cluster.conf file.
>
> [root at system1 cluster]# more cluster.conf
> <?xml version="1.0"?>
> <cluster alias="cluster1" config_version="33" name="cluster1">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="100"/>
>         <clusternodes>
>                 <clusternode name="system1.merck.com" nodeid="1"
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="system1r"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="system2.merck.com" nodeid="2"
> votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="system2r"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_ilo" hostname="
> system1r.merck.com" login="admin
> " name="system1r" passwd="Anwyccdfy57"/>
>                 <fencedevice agent="fence_ilo" hostname="
> system2r.merck.com" login="admin
> " name="system1r" passwd="Anwyccdfy57"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="webdomain" nofailback="0"
> ordered="1" restricte
> d="1">
>                                 <failoverdomainnode name="
> system1.merck.com" priority="
> 1"/>
>                                 <failoverdomainnode name="
> system2.merck.com" priority="
> 2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <ip address="54.3.xyz.abc" monitor_link="1"/>
>                         <script file="/etc/init.d/orig.httpd" name="http
> startup script"/>
>                         <fs device="/dev/sda2" force_fsck="0"
> force_unmount="0" fsid="6443" f
> stype="ext3" mountpoint="/var/www/html" name="httpd-content" options=""
> self_fence="0"/>
>                         <fs device="/dev/sda1" force_fsck="0"
> force_unmount="0" fsid="30579"
> fstype="ext3" mountpoint="/var/lib/mysql" name="mysql-content" options=""
> self_fence="0"/>
>                         <script file="/etc/init.d/mysqld" name="mysql
> startup script"/>
>                         <ip address="192.168.0.3" monitor_link="1"/>
>                 </resources>
>                 <service autostart="1" domain="webdomain"
> name="http-service" recovery="resta
> rt">
>                         <script ref="http startup script"/>
>                         <fs ref="httpd-content"/>
>                         <ip ref="54.3.xyz.abc"/>
>                 </service>
>                 <service autostart="1" domain="webdomain" exclusive="0"
> name="mysql" recovery
> ="disable">
>                         <fs ref="mysql-content"/>
>                         <script ref="mysql startup script"/>
>                         <ip ref="192.168.0.3"/>
>                 </service>
>         </rm>
> </cluster>
>
> Thanks
> Anoop
>
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100629/bee6280c/attachment.htm>

From esggrupos at gmail.com  Tue Jun 29 12:06:01 2010
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 29 Jun 2010 14:06:01 +0200
Subject: [Linux-cluster] is it possible an active-active NFS server?
In-Reply-To: <4C289A63.2060500@bobich.net>
References: <AANLkTimQWW1vQCdPXoIItzMw3S9eB7Z67q0QfwdiAmJC@mail.gmail.com>
	<4C287828.3050605@bobich.net>
	<AANLkTikWVqVmg0NUcQlVM7IAwAJ83zfGX2Wp2ffSSjru@mail.gmail.com>
	<4C288AA8.9040107@bobich.net>
	<AANLkTildQiGqfJcusb6SqRs_rwgiXVP1CLi9S9BeJ9FT@mail.gmail.com>
	<4C289A63.2060500@bobich.net>
Message-ID: <AANLkTin-wrApNNS8aVX9VyBPH0gvVaOuImoT24CVJ66o@mail.gmail.com>

Hi Gordan,

first thanks for your answer and your time



>
> Whether it will scale is dependant almost exclusively on your access
> pattern. If you can group your cluster file system accesses so that nodes
> hardly ever access the same file system subtrees then it will scale
> reasonably well. If you are going to have nodes randomly accessing the file
> system paths, then the performance will take a nosedive, and get
> progressively slower as you add nodes.
>
> This will scale linearly:
> Node 1 accessing /my/path/1/whatever
> Node 2 accessing /my/path/2/whatever
>
> This will scale inversely (get slower):
> Node 1 accessing /my/path
> Node 2 accessing /my/path
>
> Cluster file systems are generally slower at random access than standalone
> file systems, so you are likely to find that having a standalone failover
> (active-passive) solution is faster than a clustered active-active solution,
> especially as you add nodes.
>

interesting, I suposse that active-active will be faster...



>
> So the question really comes down to access patterns. If you are going to
> have random access to lots of small files (e.g. Maildir), the performance
> will be poor to start with and get worse as you add nodes unless you can
> engineer your solution so that access for a particular subtree always hits
> the same node. OTOH for large file operations, the bandwidth will be more
> dominant than random access lock acquisition time, so the performance will
> be OK and scale reasonably as you add nodes.
>
>
ok, understood, I?ll try to know the access paterns to get the best solution


> Note that this isn't something specific to GFS - pretty much all cluster
> file systems behave this way.
>
>


> Gordan
>
>
> Grettings

ESG



> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100629/3be8e396/attachment.htm>

From esggrupos at gmail.com  Tue Jun 29 12:26:56 2010
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 29 Jun 2010 14:26:56 +0200
Subject: [Linux-cluster] postgres cluster without shared storage
Message-ID: <AANLkTinsvVfyeBSsknOQeHc7WWBVNsl-FWmOh3wRlx2J@mail.gmail.com>

Hi all,

I need to mount a two nodes cluster with postgres as service. I have mounted
it in the past but with a shared storage and using GFS but now I don?t have
this element.

The idea is to have a master node with all the data in its own disk and have
a mechanism to replicate this data to the slave node in its own disk. If the
master goes down the slave begin to give the service and the flow of data
will go from this node to the other one. (the slave node becomes the master
one)

Is it possible to do something like this?

Thanks in advance,

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100629/0e522aca/attachment.htm>

From cmaiolino at redhat.com  Tue Jun 29 12:43:00 2010
From: cmaiolino at redhat.com (Carlos Maiolino)
Date: Tue, 29 Jun 2010 09:43:00 -0300
Subject: [Linux-cluster] postgres cluster without shared storage
In-Reply-To: <AANLkTinsvVfyeBSsknOQeHc7WWBVNsl-FWmOh3wRlx2J@mail.gmail.com>
References: <AANLkTinsvVfyeBSsknOQeHc7WWBVNsl-FWmOh3wRlx2J@mail.gmail.com>
Message-ID: <20100629124300.GA2376@andromeda.usersys.redhat.com>

On Tue, Jun 29, 2010 at 02:26:56PM +0200, ESGLinux wrote:
> Hi all, 
> 
> I need to mount a two nodes cluster with postgres as service. I have mounted it
> in the past but with a shared storage and using GFS but now I don t have this
> element. 
> 
> The idea is to have a master node with all the data in its own disk and have a
> mechanism to replicate this data to the slave node in its own disk. If the
> master goes down the slave begin to give the service and the flow of data will
> go from this node to the other one. (the slave node becomes the master one)
> 
> Is it possible to do something like this?
> 
> Thanks in advance, 
> 
> ESG

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hi,

I'm not completely aware of postgres, but it should have its own replication system. I know MySQL has (at least I already built a master/slave database with mySQL, which does replication via network).

the only thing I think you should take care is if the slave postgres will be read/write too. In case of MySQL, iirc, it should has a read-only slave.

-- 
---

Best Regards

Carlos Eduardo Maiolino



From gordan at bobich.net  Tue Jun 29 13:52:00 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Tue, 29 Jun 2010 14:52:00 +0100
Subject: [Linux-cluster] postgres cluster without shared storage
In-Reply-To: <AANLkTinsvVfyeBSsknOQeHc7WWBVNsl-FWmOh3wRlx2J@mail.gmail.com>
References: <AANLkTinsvVfyeBSsknOQeHc7WWBVNsl-FWmOh3wRlx2J@mail.gmail.com>
Message-ID: <4C29FA80.1080105@bobich.net>

On 06/29/2010 01:26 PM, ESGLinux wrote:
> Hi all,
>
> I need to mount a two nodes cluster with postgres as service. I have
> mounted it in the past but with a shared storage and using GFS but now I
> don?t have this element.

Any particular reason why you cannot use DRBD to provide that element 
without a standalone SAN?

> The idea is to have a master node with all the data in its own disk and
> have a mechanism to replicate this data to the slave node in its own
> disk. If the master goes down the slave begin to give the service and
> the flow of data will go from this node to the other one. (the slave
> node becomes the master one)

Have a look at Bucardo replication system for PostgreSQL:
http://bucardo.org/

It is as similar a solution as PostgreSQL has available to MySQL's 
replication.

It's master-slave only, though, there is no provision for master-master 
replication like in MySQL (but master-master is riddled with race 
conditions anyway, and you shouldn't be using it if you don't understand 
the edge cases that are likely to break your app).

Have a look here for more info on PostgreSQL's replication/clustering 
functionality:

http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling

You're probably better off asking about this stuff on the PostgreSQL 
mailing lists, though.

Gordan



From esggrupos at gmail.com  Wed Jun 30 06:39:28 2010
From: esggrupos at gmail.com (ESGLinux)
Date: Wed, 30 Jun 2010 08:39:28 +0200
Subject: [Linux-cluster] postgres cluster without shared storage
In-Reply-To: <4C29FA80.1080105@bobich.net>
References: <AANLkTinsvVfyeBSsknOQeHc7WWBVNsl-FWmOh3wRlx2J@mail.gmail.com>
	<4C29FA80.1080105@bobich.net>
Message-ID: <AANLkTimeku4gK8UzNCVZnkJVN9D2v5yYnCrK0j_GQbCO@mail.gmail.com>

2010/6/29 Gordan Bobic <gordan at bobich.net>

> On 06/29/2010 01:26 PM, ESGLinux wrote:
>
>> Hi all,
>>
>> I need to mount a two nodes cluster with postgres as service. I have
>> mounted it in the past but with a shared storage and using GFS but now I
>> don?t have this element.
>>
>
> Any particular reason why you cannot use DRBD to provide that element
> without a standalone SAN?
>
>
the reason is that I haven?t done it before ;-). perhaps it?s time to take a
look at DRBD...


>
>  The idea is to have a master node with all the data in its own disk and
>> have a mechanism to replicate this data to the slave node in its own
>> disk. If the master goes down the slave begin to give the service and
>> the flow of data will go from this node to the other one. (the slave
>> node becomes the master one)
>>
>
> Have a look at Bucardo replication system for PostgreSQL:
> http://bucardo.org/
>
>



> It is as similar a solution as PostgreSQL has available to MySQL's
> replication.
>
> It's master-slave only, though, there is no provision for master-master
> replication like in MySQL (but master-master is riddled with race conditions
> anyway, and you shouldn't be using it if you don't understand the edge cases
> that are likely to break your app).
>
> Have a look here for more info on PostgreSQL's replication/clustering
> functionality:
>
>
> http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
>
>
Interesting link, I?m going to evaluate this solutions


> You're probably better off asking about this stuff on the PostgreSQL
> mailing lists, though.
>
>
I?ll do it,

Thank you all for your answers



> Gordan
>
>
>
ESG


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100630/1cdc022d/attachment.htm>

From zagar at arlut.utexas.edu  Wed Jun 30 14:22:26 2010
From: zagar at arlut.utexas.edu (Randy Zagar)
Date: Wed, 30 Jun 2010 09:22:26 -0500
Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
	reboot
In-Reply-To: <mailman.20241.1277763744.8372.linux-cluster@redhat.com>
References: <mailman.20241.1277763744.8372.linux-cluster@redhat.com>
Message-ID: <4C2B5322.9030305@arlut.utexas.edu>

Yes.  My experience is that you can't currently nfs-export *any* GFS or 
GFS2 filesystems.

Exporting EXT3/EXT4 filesystems, however, doesn't appear to be a problem.

-Randy Zagar <zagar at arlut.utexas.edu>

On 06/28/2010 05:22 PM, linux-cluster-request at redhat.com wrote:
> From: Bennie Thomas<Bennie_R_Thomas at raytheon.com>
> Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
> 	reboot
> Message-ID:<4C290BE0.3090707 at raytheon.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> I currently have 2 DL380 G6 with and HP MSA2312 disk array. Running
> Redhat 5u4 64bit.  I have a quorum disk.  I use the Cluster as an
> Active/passive NFS Cluster
> The problem I am having is one or both of the nodes will randomly
> reboot. Has anyone experienced this problem
>
>    

-- 
Randy Zagar                               Sr. Unix Systems Administrator
E-mail: zagar at arlut.utexas.edu            Applied Research Laboratories
Phone: 512 835-3131                       Univ. of Texas at Austin


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4758 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100630/05f1d0b5/attachment.p7s>

From swhiteho at redhat.com  Wed Jun 30 14:35:21 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 30 Jun 2010 15:35:21 +0100
Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
 reboot
In-Reply-To: <4C2B5322.9030305@arlut.utexas.edu>
References: <mailman.20241.1277763744.8372.linux-cluster@redhat.com>
	<4C2B5322.9030305@arlut.utexas.edu>
Message-ID: <1277908521.3158.441.camel@localhost.localdomain>

Hi,

On Wed, 2010-06-30 at 09:22 -0500, Randy Zagar wrote:
> Yes.  My experience is that you can't currently nfs-export *any* GFS or 
> GFS2 filesystems.
> 
You can, but there are only a fairly small number of configurations
which will actually work from the larger number of possible
configurations. We do hope to expand that a bit in the future, but for
the time being its best to stick to a active/passive failover export
which is not mixed with any other protocol (Samba) or any local
applications.

> Exporting EXT3/EXT4 filesystems, however, doesn't appear to be a problem.
> 
> -Randy Zagar <zagar at arlut.utexas.edu>
> 
> On 06/28/2010 05:22 PM, linux-cluster-request at redhat.com wrote:
> > From: Bennie Thomas<Bennie_R_Thomas at raytheon.com>
> > Subject: [Linux-cluster] RedHat RHEL 5U4 NFS Cluster nodes randomly
> > 	reboot
> > Message-ID:<4C290BE0.3090707 at raytheon.com>
> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> >
> > I currently have 2 DL380 G6 with and HP MSA2312 disk array. Running
> > Redhat 5u4 64bit.  I have a quorum disk.  I use the Cluster as an
> > Active/passive NFS Cluster
> > The problem I am having is one or both of the nodes will randomly
> > reboot. Has anyone experienced this problem
> >
Is the node being fenced? This might be down to excessive network
traffic blocking the cluster traffic and making it appear as if the node
is down when it isn't, or something similar to that. Do you get any log
messages?

Steve.

> >    
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster