From jeff.sturm at eprize.com  Wed Jul  1 03:57:47 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 30 Jun 2009 23:57:47 -0400
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246378523.7787.12.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Tiago Cruz
> Sent: Tuesday, June 30, 2009 12:15 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Did you use GFS with witch technology?
> 
> Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not
Virtual?
> Witch version are you using? v1 or v2?

Xen here, with GFS1.  Works great.  Pay attention to the performance
optimizations (noatime, etc.) including statfs_fast if you are on GFS1.

We export LUNs from our SAN to each domU using tap:sync driver.
Performance seems to be limited by our SAN.  Each domU in our setup has
two vif's:  one for openais, another for everything else, though I can't
say if that helps or hurts performance.

Jeff


From agx at sigxcpu.org  Wed Jul  1 11:57:25 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Wed, 1 Jul 2009 13:57:25 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <1246306200.25867.86.camel@cerberus.int.fabbione.net>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
	<20090629184848.GA25796@bogon.sigxcpu.org>
	<1246306200.25867.86.camel@cerberus.int.fabbione.net>
Message-ID: <20090701115725.GA6565@bogon.sigxcpu.org>

On Mon, Jun 29, 2009 at 10:10:00PM +0200, Fabio M. Di Nitto wrote:
> > 1246297857 fenced 3.0.0.rc3 started
> > 1246297857 our_nodeid 1 our_name node2.foo.bar
> > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
> > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager
And it also leads to:

dlm_controld[14981]: fenced_domain_info error -1 

so it's not possible to get the node back without rebooting.

> It looks to me the node has not been shutdown properly and an attempt to
> restart it did fail. The fenced segfault shouldn't happen but I am
> CC'ing David. Maybe he has a better idea.
> 
> > 
> > when trying to restart fenced. Since this is not possible one has to
> > reboot the node.
> > 
> > We're also seeing:
> > 
> > Jun 29 19:29:03 node2 kernel: [   50.149855] dlm: no local IP address has been set
> > Jun 29 19:29:03 node2 kernel: [   50.150035] dlm: cannot start dlm lowcomms -107
> 
> hmm this looks like a bad configuration to me or bad startup.
> 
> IIRC dlm kernel is configured via configfs and probably it was not
> mounted by the init script.
It is. 

> > from time to time. Stopping/starting via cman's init script (as from the
> > Ubuntu package) several times makes this go away.
> > 
> > Any ideas what causes this?
> 
> Could you please try to use our upstream init scripts? They work just
> fine (unchanged) in ubuntu/debian environment and they are for sure a
> lot more robust than the ones I originally wrote for Ubuntu many years
> ago.
Tested that without any notable change.

> Could you also please summarize your setup and config? I assume you did
> the normal checks such as cman_tool status, cman_tool nodes and so on...
> 
> The usual extra things I'd check are:
> 
> - make sure the hostname doesn't resolve to localhost but to the real ip
> address of the cluster interface
> - cman_tool status
> - cman_tool nodes
These all do look o.k. However:

> - Before starting any kind of service, such as rgmanager or gfs*, make
> sure that the fencing configuration is correct. Test by using fence_node
> $nodename.
fence_node node1

gives the segfaults at the same locationo as described above which seems
to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
-a iloip" works as expected). 
The segfault happens in fence/libfence/agent.c's make_args where the
second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
str. Doing this xpath lookup by hand looks fine. So it seems
ccs_get_list is returning corrupted pointers. I've attached the current
clluster.conf.
Cheers,
 -- Guido

-------------- next part --------------
?xml version="1.0"?>
<cluster config_version="5" name="cl">
  <cman two_node="1" expected_votes="2">
  </cman>
  <dlm log_debug="1"/>
  <clusternodes>
    <clusternode name="node1.foo.bar" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="fence1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node2.foo.bar" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="fence2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
    <fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="node1" passwd="pass"/>
    <fencedevice agent="fence_ilo" hostname="rnode2.foo.bar" login="reboot" name="node2" passwd="pass"/>
  </fencedevices>

  <rm log_level="7">
   <failoverdomains>
      <failoverdomain name="kvm-hosts" ordered="1">
        <failoverdomainnode name="node1.foo.bar"/>
        <failoverdomainnode name="node2.foo.bar"/>
      </failoverdomain>
   </failoverdomains>
   <resources>
       <virt name="test11" />
       <virt name="test12" />
   </resources>
   <service name="test11">
        <virt ref="test11"/>
   </service>
   <service name="test12">
        <virt ref="test12"/>
   </service>
  </rm>
</cluster>

From fdinitto at redhat.com  Wed Jul  1 13:23:56 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 01 Jul 2009 15:23:56 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <20090701115725.GA6565@bogon.sigxcpu.org>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
	<20090629184848.GA25796@bogon.sigxcpu.org>
	<1246306200.25867.86.camel@cerberus.int.fabbione.net>
	<20090701115725.GA6565@bogon.sigxcpu.org>
Message-ID: <1246454636.19414.30.camel@cerberus.int.fabbione.net>

Hi Guido,

On Wed, 2009-07-01 at 13:57 +0200, Guido G?nther wrote:

> > - Before starting any kind of service, such as rgmanager or gfs*, make
> > sure that the fencing configuration is correct. Test by using fence_node
> > $nodename.
> fence_node node1
> 
> gives the segfaults at the same locationo as described above which seems
> to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
> -a iloip" works as expected). 
> The segfault happens in fence/libfence/agent.c's make_args where the
> second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
> str. Doing this xpath lookup by hand looks fine. So it seems
> ccs_get_list is returning corrupted pointers. I've attached the current
> clluster.conf.

I am having problems to reproduce this problem and I'll need your help.

First of all I replicated your configuration:

<?xml version="1.0"?>
<cluster name="fabbione" config_version="1" alias="fabbione">
  <logging debug="on"/>
  <clusternodes>
    <clusternode name="node1.foo.bar" votes="1" nodeid="1">
      <fence>
        <method name="1">
          <device name="fence1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node2.foo.bar" votes="1" nodeid="4">
      <fence>
        <method name="1">
          <device name="fence2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice name="node1" agent="fence_virsh" port="fedora-rh-node1"
ipaddr="daikengo.int.fabbione.net" login="root" secure="1"
identity_file="/root/.ssh/id_rsa"/>
    <fencedevice name="node2" agent="fence_virsh" port="fedora-rh-node4"
ipaddr="daikengo.int.fabbione.net" login="root" secure="1"
identity_file="/root/.ssh/id_rsa"/>
  </fencedevices>
</cluster>

as you can see node names and fencing methods are the same.

I don't have ilo but it shouldn't matter.

Now my question is: did you mangle the configuration you sent me
manually? because there is no matching entry between device to use for a
node and the fencedevices section and I get:

[root at node2]# fence_node -vv node1
fence node1 dev 0.0 agent none result: error config agent
agent args: 
fence node1 failed

Now if i change device name="fenceX" to name="nodeX" there is a matching
and:

[root at node2 cluster]# fence_node -vv node1
fence node1 dev 0.0 agent fence_virsh result: success
agent args: agent=fence_virsh port=fedora-rh-node1
ipaddr=daikengo.int.fabbione.net login=root secure=1
identity_file=/root/.ssh/id_rsa 
fence node1 success

and I still don't see the segfault...

Since you can reproduce the problem regularly I'd really like to see
some debugging output of libfence to start with. I'd really appreciate
if you could help us.

test 1:

Please add a bunch fprintf(stderr, to agents.c to see the created XPath
queries and the result coming back from libccs.

If you could please collect the output and send it to me.

test 2:

If you could please find:

cd = ccs_connect(); (line 287 in agent.c)
and right before that add:
fullxpath=1;

That change will ask libccs to use a different Xpath engine internally.

And then re-run test1.

This should be able to isolate pretty much the problem and give me
enough information to debug the issue.

the next question is: are you running on some fancy architecture? Maybe
something in that environment is not initialized properly (the garbage
string you get back from libccs sounds like that) but on more common
arches like x86/x86_64 gcc takes care of that for us.... (really wild
guessing but still something to fix!).

Thanks
Fabio


From jeff.sturm at eprize.com  Wed Jul  1 13:50:36 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 1 Jul 2009 09:50:36 -0400
Subject: [Linux-cluster] Recovering from "telling LM to withdraw"
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>

Recently we had a cluster node fail with a failed assertion:

 
Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal:
assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)"
failed

Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8:   function =
gfs_trans_add_gl

Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8:   file =
/builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c, line
= 237

Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8:   time =
1246022619

Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to
withdraw from the cluster

Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM
to withdraw

 
This is with CentOS 5.2, GFS1.  The cluster had been operating
continuously for about 3 months.

 
My challenge isn't in preventing assertion failures entirely-I recognize
lurking software bugs and hardware anomalies can lead to a failed node.
Rather, I want to prevent one node from freezing the cluster.  When the
above was logged, all nodes in the cluster which access the tb2data
filesystem also froze and did not recover.  We recovered with a rolling
cluster restart and a precautionary gfs_fsck.

 
Most cluster problems can be quickly handled by the fence agents.  The
"telling LM to withdraw" does not trigger a fence operation, or any
other automated recovery.  I need a deployment strategy to fix that.
Should I write an agent to scan the syslog, match on the message above,
and fence the node?

 
Has anyone else encountered the same problem?  If so, how did you get
around it?

 
-Jeff


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/5b38bae7/attachment.htm>

From garromo at us.ibm.com  Wed Jul  1 14:21:26 2009
From: garromo at us.ibm.com (Gary Romo)
Date: Wed, 1 Jul 2009 08:21:26 -0600
Subject: [Linux-cluster] GFS on stand alone
Message-ID: <OFD984CCCF.19CF035C-ON872575E6.004EBFB1-872575E6.004EDE20@us.ibm.com>


Can GFS be used on a stand alone server without RHCS running?

Any pro's or con's to this type of setup?    Thanks.

-Gary Romo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/2c91e344/attachment.htm>

From cthulhucalling at gmail.com  Wed Jul  1 14:32:26 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Wed, 1 Jul 2009 07:32:26 -0700
Subject: [Linux-cluster] GFS on stand alone
In-Reply-To: <OFD984CCCF.19CF035C-ON872575E6.004EBFB1-872575E6.004EDE20@us.ibm.com>
References: <OFD984CCCF.19CF035C-ON872575E6.004EBFB1-872575E6.004EDE20@us.ibm.com>
Message-ID: <36df569a0907010732m450ae24eu8e5827ee3a37b93f@mail.gmail.com>

Yes it can. Use lock_nolock as your locking protocol.

On Wed, Jul 1, 2009 at 7:21 AM, Gary Romo <garromo at us.ibm.com> wrote:

> Can GFS be used on a stand alone server without RHCS running?
>
> Any pro's or con's to this type of setup? Thanks.
>
> -Gary Romo
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/efb13a1c/attachment.htm>

From Andrea.Giussani at nposistemi.it  Wed Jul  1 14:59:41 2009
From: Andrea.Giussani at nposistemi.it (Giussani Andrea)
Date: Wed, 1 Jul 2009 16:59:41 +0200
Subject: [Linux-cluster] Package Apache and Mysql Problem
Message-ID: <F0CC33D1F1858A4E8A01D8315C683E290227126E0353@NPOMAILLCR1.nposistemi.it>

Hi,

i have a little big problem with RH Cluster Suite.

I have 2 cluster nodes with 1 partition to share between the 2 node. There is no SAN.
The node have the same hardware and the same partition.
I have 1 partition with drbd to sycronize the 2 nodes Primary/Primary.

I try in a lot type of configuration of Apache and Mysql package but i have the same problem.
The error is:
Jul  1 18:50:39 nodo1 luci[2581]: Unable to retrieve batch 1072342062 status from nodo2.local:11111: clusvcadm start failed to start Httpd:

nodo1 and nodo2 is the 2 nodes and httpd is the apache service.

Any idea???

I try the configuration in this procedure: http://kbase.redhat.com/faq/docs/DOC-5648 for Mysql but the result is the same.

In attach my cluster.conf and drbd.conf

If we need more tell me please.

Thanks a lot

Andrea Giussani


AVVERTENZE AI SENSI DEL D.LGS. 196/2003    . 

Il contenuto di questo messaggio (ed eventuali allegati) e' strettamente confidenziale. L'utilizzo del contenuto del messaggio e' riservato esclusivamente al destinatario. La modifica, distribuzione, copia del messaggio da parte di altri e' severamente proibita. Se non siete i destinatari Vi invitiamo ad informare il mittente ed eliminare tutte le copie del suddetto messaggio     .               

The content of this message (and attachment) is closely confidentiality. Use of the content of the message is classified exclusively to the addressee. The modification, distribution, copy of the message from others are forbidden. If you are not the addressees, we invite You to inform the sender and to eliminate all the copies of the message.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cluster.txt
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/49706431/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: drbd.txt
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/49706431/attachment-0001.txt>

From brettcave at gmail.com  Wed Jul  1 15:24:44 2009
From: brettcave at gmail.com (Brett Cave)
Date: Wed, 1 Jul 2009 17:24:44 +0200
Subject: [Linux-cluster] problem with heartbeat + ipvs
Message-ID: <c0773fd30907010824p52cc5b69l8a700854865953f8@mail.gmail.com>

hi all,

have a problem with HA / LB system, using heartbeat for HA and ldirector /
ipvs for load balancing.

When the primary node is shut down or heartbeat is stopped, the migration of
services works fine, but the loadbalancing does not work (ipvs rules are
active, but connect connect to HA services). Configs on primary and
secondary are the same:


haresources:
primary 172.16.5.1/16/bond0 ldirectord::ldirectord.cf

ldirectord.cf:
virtual = 172.16.5.1:3306
       service = mysql
       real = 172.16.10.1:3306 gate 1000
       checktype, login, passwd, database, request values all set
       scheduler = sed

ip_forward is enabled (checked via /proc, configured via sysctl)


network configs are almost the same except for the IP address (using a
bonded interface in active/passive mode)
have set iptables policies to ACCEPT with rules that would not block the
traffic (99.99% sure on this).

if i try connect from a server such as 172.16.10.10, i cannot connect if the
secondary is up:
[user at someserver]$ mysql -h 172.16.5.1
ERROR 2003 (HY000): Can't connect to MySQL server on '172.16.5.1' (111)


perror shows that 111 is Connection Refused

running a sniffer on the secondary HA box, i dont see the tcp 3306 packets
coming in.

the arp_ignore / arp_announce kernel params are configured on teh real
server, HA ip address is added on a /32 subnet to the lo interface, etc,
etc.... (everything works 100% when primary is up).

sure it is something i have overlooked, any idea's?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/6e1ce1f9/attachment.htm>

From agx at sigxcpu.org  Wed Jul  1 16:40:07 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Wed, 1 Jul 2009 18:40:07 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <1246454636.19414.30.camel@cerberus.int.fabbione.net>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
	<20090629184848.GA25796@bogon.sigxcpu.org>
	<1246306200.25867.86.camel@cerberus.int.fabbione.net>
	<20090701115725.GA6565@bogon.sigxcpu.org>
	<1246454636.19414.30.camel@cerberus.int.fabbione.net>
Message-ID: <20090701164007.GA10680@bogon.sigxcpu.org>

Hi Fabio,
On Wed, Jul 01, 2009 at 03:23:56PM +0200, Fabio M. Di Nitto wrote:
> Now my question is: did you mangle the configuration you sent me
> manually? because there is no matching entry between device to use for a
> node and the fencedevices section and I get:
Yes, I had to get some internal names out. This is what went wrong:

-<fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="node1" passwd="pass"/>
+<fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="fence1" passwd="pass"/>
									       ^^^^^^

(same for node2/fence2).
> Since you can reproduce the problem regularly I'd really like to see
> some debugging output of libfence to start with. I'd really appreciate
> if you could help us.
> 
> test 1:
> 
> Please add a bunch fprintf(stderr, to agents.c to see the created XPath
> queries and the result coming back from libccs.

# fence_node -vv node2
make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
make_args(156)
Segmentation fault

> test 2:
> 
> If you could please find:
> 
> cd = ccs_connect(); (line 287 in agent.c)
> and right before that add:
> fullxpath=1;
>
> That change will ask libccs to use a different Xpath engine internally.
> 
> And then re-run test1.
# fence_node -vv node2
fence_node(289): fullxpath: 0
fence_node(291): fullxpath: 1
make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
make_args(156)
Segmentation fault

make_args(156) is just before the strncmp. Trying to print out str
results in a segfault too (that's why it's missing from the output).

[..snip..]  
> the next question is: are you running on some fancy architecture? Maybe
> something in that environment is not initialized properly (the garbage
> string you get back from libccs sounds like that) but on more common
> arches like x86/x86_64 gcc takes care of that for us.... (really wild
> guessing but still something to fix!).
Nothing fancy here:

# uname -a
Linux vm41 2.6.30-1-amd64 #1 SMP Sun Jun 14 15:00:29 UTC 2009 x86_64
GNU/Linux

Cheers,
 -- Guido


From adas at redhat.com  Wed Jul  1 16:43:26 2009
From: adas at redhat.com (Abhijith Das)
Date: Wed, 01 Jul 2009 11:43:26 -0500
Subject: [Linux-cluster] Recovering from "telling LM to withdraw"
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>
References: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>
Message-ID: <4A4B922E.5090301@redhat.com>

Jeff Sturm wrote:
>
> Recently we had a cluster node fail with a failed assertion:
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: fatal:
> assertion "gfs_glock_is_locked_by_me(gl) &&
> gfs_glock_is_held_excl(gl)" failed
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: function =
> gfs_trans_add_gl
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: file =
> /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_/src/gfs/trans.c,
> line = 237
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: time =
> 1246022619
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: about to
> withdraw from the cluster
>
> Jun 26 09:23:39 mqc02 kernel: GFS: fsid=dc1rhc:tb2data.1.8: telling LM
> to withdraw
>
> This is with CentOS 5.2, GFS1. The cluster had been operating
> continuously for about 3 months.
>
> My challenge isn't in preventing assertion failures entirely?I
> recognize lurking software bugs and hardware anomalies can lead to a
> failed node. Rather, I want to prevent one node from freezing the
> cluster. When the above was logged, all nodes in the cluster which
> access the tb2data filesystem also froze and did not recover. We
> recovered with a rolling cluster restart and a precautionary gfs_fsck.
>
> Most cluster problems can be quickly handled by the fence agents. The
> "telling LM to withdraw" does not trigger a fence operation, or any
> other automated recovery. I need a deployment strategy to fix that.
> Should I write an agent to scan the syslog, match on the message
> above, and fence the node?
>
> Has anyone else encountered the same problem? If so, how did you get
> around it?
>
> -Jeff
>
https://bugzilla.redhat.com/show_bug.cgi?id=471258

The assert+withdraw you're seeing seems to be this bug above. I've tried
to recreate this on my cluster and failed. If you have a recipe to
create this, could you please post it to the bugzilla?

Meanwhile, I'll look at the code again to see if I can spot anything.

Thanks!
--Abhi


From fdinitto at redhat.com  Wed Jul  1 17:12:07 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 01 Jul 2009 19:12:07 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <20090701164007.GA10680@bogon.sigxcpu.org>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
	<20090629184848.GA25796@bogon.sigxcpu.org>
	<1246306200.25867.86.camel@cerberus.int.fabbione.net>
	<20090701115725.GA6565@bogon.sigxcpu.org>
	<1246454636.19414.30.camel@cerberus.int.fabbione.net>
	<20090701164007.GA10680@bogon.sigxcpu.org>
Message-ID: <1246468327.19414.65.camel@cerberus.int.fabbione.net>

On Wed, 2009-07-01 at 18:40 +0200, Guido G?nther wrote:
> Hi Fabio,
> On Wed, Jul 01, 2009 at 03:23:56PM +0200, Fabio M. Di Nitto wrote:
> > Now my question is: did you mangle the configuration you sent me
> > manually? because there is no matching entry between device to use for a
> > node and the fencedevices section and I get:
> Yes, I had to get some internal names out. This is what went wrong:
> 
> -<fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="node1" passwd="pass"/>
> +<fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="fence1" passwd="pass"/>


Ok perfect thanks.

> 
> # fence_node -vv node2
> make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
> make_args(156)
> Segmentation fault
> 
> > test 2:
> > 
> > If you could please find:
> > 
> > cd = ccs_connect(); (line 287 in agent.c)
> > and right before that add:
> > fullxpath=1;
> >
> > That change will ask libccs to use a different Xpath engine internally.
> > 
> > And then re-run test1.
> # fence_node -vv node2
> fence_node(289): fullxpath: 0
> fence_node(291): fullxpath: 1
> make_args(149): /cluster/fencedevices/fencedevice[@name="fence2"]/@*
> make_args(156)
> Segmentation fault
> 
> make_args(156) is just before the strncmp. Trying to print out str
> results in a segfault too (that's why it's missing from the output).

No matter what, I can't trigger this segfault.

Do you have a build log for the package? and could you send me the
make/defines.mk in the build tree?

gcc versions and usual tool chain info.. maybe it's a gcc bug or maybe
it's an optimization that behaves differently between debian and fedora.

I have attached a small test case to simply test libccs. At this point I
don't believe it's a problem in libfence. Could you please run it for me
and send me the output? If the bug is in libccs this would start
isolating it.

[root at fedora-rh-node4 ~]# gcc -Wall -o testccs main.c -lccs
[root at fedora-rh-node4 ~]# ./testccs 
-hopefully some output-

and please check the XPath query at the top of main.c as it could be
slightly different given your config.

Thanks
Fabio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.c
Type: text/x-csrc
Size: 528 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/54cb4cf0/attachment.bin>

From Luis.Cerezo at pgs.com  Wed Jul  1 18:24:07 2009
From: Luis.Cerezo at pgs.com (Luis Cerezo)
Date: Wed, 1 Jul 2009 13:24:07 -0500
Subject: [Linux-cluster] qdisk best practices
Message-ID: <15D5002F61F31A45A82A153D2F73906760FBD3F011@HOUMS26.onshore.pgs.com>

Hi all-

i've got a RHEL 5.3 cluster, 2node with qdisk. All works fine, but the qdisk seems to beat on the SAN (I/Ops) I adjusted the interval from the default of 1 to 5 and it is still high (san admin is crying)

does anyone have best practices for this? its an LSI san and both nodes are mulitpathed to it via 4G FC.

thanks!


Luis E. Cerezo
PGS
Global IT

This e-mail, any attachments and response string may contain proprietary information, which are confidential and may be legally privileged.  It is for the intended recipient only and if you are not the intended recipient or transmission error has misdirected this e-mail, please notify the author by return e-mail and delete this message and any attachment immediately.  If you are not the intended recipient you must not use, disclose, distribute, forward, copy, print or rely in this e-mail in any way except as permitted by the author.


From tiagocruz at forumgdh.net  Wed Jul  1 19:00:15 2009
From: tiagocruz at forumgdh.net (Tiago Cruz)
Date: Wed, 01 Jul 2009 16:00:15 -0300
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>
References: <1246378523.7787.12.camel@tuxkiller>
	<64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>
Message-ID: <1246474815.7192.148.camel@tuxkiller>

Thanks guys for all the comments!

Just one more question:

I have 10 VM inside a apache cluster, and I've compiled one httpd inside
GFS, some like (/gfs/httpd_servers/bin-2.2.9).

Did you see any problem with this? How do you use Apache with GFS?

-- 
Tiago Cruz <tiagocruz at forumgdh.net>


On Tue, 2009-06-30 at 23:57 -0400, Jeff Sturm wrote:
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]
> > On Behalf Of Tiago Cruz
> > Sent: Tuesday, June 30, 2009 12:15 PM
> > To: linux-cluster at redhat.com
> > Subject: [Linux-cluster] Did you use GFS with witch technology?
> > 
> > Did you use GFS and witch technology? KVM? Xen? VirtualBox? Not
> Virtual?
> > Witch version are you using? v1 or v2?
> 
> Xen here, with GFS1.  Works great.  Pay attention to the performance
> optimizations (noatime, etc.) including statfs_fast if you are on GFS1.
> 
> We export LUNs from our SAN to each domU using tap:sync driver.
> Performance seems to be limited by our SAN.  Each domU in our setup has
> two vif's:  one for openais, another for everything else, though I can't
> say if that helps or hurts performance.
> 
> Jeff
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From brem.belguebli at gmail.com  Wed Jul  1 21:09:43 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Wed, 1 Jul 2009 23:09:43 +0200
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246397778.7787.67.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller>
	<36df569a0906301254p5dcece20g1336aece80bcd708@mail.gmail.com>
	<1246396045.7787.45.camel@tuxkiller>
	<29ae894c0906301429g6550e907hfe633c28c75c08eb@mail.gmail.com>
	<1246397778.7787.67.camel@tuxkiller>
Message-ID: <29ae894c0907011409k470d1405qe23d7041b6c13dce@mail.gmail.com>

Hi,
Ok I understand, that should be supported.

If your problems (freeze, qdisk loose, etc..;) occur when your VMs are under
high load (CPU, RAM, disk IOs ?) why don't you configure more VCPUs on your
VMs, split the number of VMs accross more LUNs, etc..

 The thing to keep in mind, there is no way under ESX to limit the IO rate
per VM, and ESX 3.5 doesn't support multipathing, if the bottleneck is more
located on the disk subsystem.


2009/6/30 Tiago Cruz <tiagocruz at forumgdh.net>

> Not,
>
> What I did:
>
> I have 10 virtual machines.
>
> I have one LUN with 200 GB formated by ESX using VMFS.
>
> Inside this lun, I have a lot of small pieces of 10 GB (the "/" os each
> one virtual machine) formated by RHEL 5.x using EXT3.
>
> And my GFS was in another LUN, called DRM (some like Direct Raw Mapping)
> where the LUN was delivered to VM without pass "inside" of ESX.
>
> Can you understood of I've complicated ever more? :-p
> --
> Tiago Cruz <tiagocruz at forumgdh.net>
>
>
> On Tue, 2009-06-30 at 23:29 +0200, brem belguebli wrote:
> > Not really,
> >
> >
> > VMFS is the clustered filesystem shipped with ESX.
> >
> >
> > If I understand well, you got the source code of GFS that you did
> > recompile on your ESX host, is that it ?
> >
> >
> > I think you're already out of support from VMware if so.
> >
> >
> >
> >
> > 2009/6/30 Tiago Cruz <tiagocruz at forumgdh.net>
> >         Hello  Ian,
> >
> >         'cause AFAIK I can't format one block device with VMFS.
> >         You can think in VMFS in some like LVM - just one abstraction
> >         layer and
> >         not a FS itself :)
> >
> >         --
> >         Tiago Cruz <tiagocruz at forumgdh.net>
> >
> >
> >
> >
> >         On Tue, 2009-06-30 at 12:54 -0700, Ian Hayes wrote:
> >         >
> >         >
> >         > On Tue, Jun 30, 2009 at 9:15 AM, Tiago Cruz
> >         <tiagocruz at forumgdh.net>
> >         > wrote:
> >         >         Hello, guys.. please... I need to know a little
> >         thing:
> >         >
> >         >         I'm using GFS v1 with ESX 3.5 and I'm not very
> >         happy :)
> >         >         High load from vms, freeze and quorum lost, for
> >         example.
> >         >
> >         >         Did you use GFS and witch technology? KVM? Xen?
> >         VirtualBox?
> >         >         Not Virtual?
> >         >         Witch version are you using? v1 or v2?
> >         >
> >         >         Are you a happy people using this? =)
> >         >
> >         > If you're using ESX, why are you using GFS instead of VMFS?
> >         >
> >         >
> >
> >
> >         > --
> >         > Linux-cluster mailing list
> >         > Linux-cluster at redhat.com
> >         > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >         --
> >         Linux-cluster mailing list
> >         Linux-cluster at redhat.com
> >         https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090701/abeb47a8/attachment.htm>

From fdinitto at redhat.com  Wed Jul  1 23:16:30 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 02 Jul 2009 01:16:30 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
Message-ID: <1246490190.19414.93.camel@cerberus.int.fabbione.net>

The cluster team and its community are proud to announce the
3.0.0.rc4 release candidate from the STABLE3 branch.

This should be the last release candidate unless major problems will be
found during the final testing stage.

Everybody with test equipment and time to spare, is highly encouraged to
download, install and test this release candidate and more important
report problems. This is the time for people to make a difference and
help us testing as much as possible.

In order to build the 3.0.0.rc4 release you will need:

- corosync 0.100 (1.0.0.rc1)
- openais 0.100 (1.0.0.rc1)
- linux kernel 2.6.29

The new source tarball can be downloaded here:

ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.rc4.tar.gz
https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.rc5.tar.gz

At the same location is now possible to find separated tarballs for
fence-agents and resource-agents as previously announced
(http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm)

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Happy clustering,
Fabio

Under the hood (from 3.0.0.rc3):

Bob Peterson (4):
      GFS2: gfs2_convert, parameter not understood on ppc
      /sbin/mount.gfs2: can't find /proc/mounts entry for directory /
      Message printed to stderr instead of stdout
      gfs_fsck: Segfault in EA leaf repair

Christine Caulfield (3):
      cman: use api->shutdown_request instead of api->request_shutdown
      cman: Fix some compile-time warning
      dlm: Fix some compile warnings

Fabio M. Di Nitto (17):
      gfs: kill dead test code
      gfs2: drop dead test code
      build: enable fence_xvm by default
      config: fix warnings in confdb2ldif
      config: use HDB_X instead of _D
      gfs: add missing format attributes
      gfs2: handle output conversion properly
      gfs2: add missing casts
      gfs2: make functions static
      gfs2: backport coding format from master
      gfs2: resync internationalization support from master
      cman: port to the latest corosync API
      cman init: stop qdiskd only if enabled
      qdiskd: fix log file name
      cman init: don't stop fence_xvmd if we don't know the status
      cman init: readd support for fence_xvmd standalone operations
      Revert "gfs-kernel: enable FS_HAS_FREEZE"

Federico Simoncelli (1):
      rgmanager: Allow vm.sh use of libvirt XML file

Jim Meyering (5):
      src/clulib/ckpt_state.c (ds_key_init_nt): detect failed malloc
      dlm/tests: handle malloc failure
      cman: handle malloc failure (i.e., don't deref NULL)
      dlm_controld: handle heap allocation failure and plug leaks
      dlm_controld: add comments: mark memory problems

Lon Hohberger (42):
      rgmanager: Fix ptr arithmetic and C90 warnings
      rgmanager: Fix rg_locks.c build warnings
      rgmanager: Fix rg_strings.c build warnings
      rgmanager: Fix members.c and related build warnings
      rgmanager: Change ccs_read_old_logging to static
      rgmanager: Fix daemon_init related warnings
      rgmanager: Remove unused function
      rgmanager: Remove unused proof-of-concept code
      rgmanager: Fix build warnings in cman.c
      rgmanager: Fix build warnings in fdops.c
      rgmanager: Fix vft.c and related build warnings
      rgmanager: Fix msgtest.c build warnings
      rgmanager: Fix complier warnings in msg_cluster.c
      rgmanager: Fix build warnings in msg_socket.c
      rgmanager: Fix build warnings in msgtest.c
      rgmanager: Fix fo_domain.c build warnings
      rgmanager: Fix fo_domain.c build warnings (part 2)
      rgmanager: Fix clufindhostname.c build warnings
      rgmanager: Fix clustat.c build warnings
      rgmanager: Fix clusvcadm.c build warnings
      rgmanager: Fix clulog.c build warnings
      rgmanager: groups.c cleanup
      rgmanager: Cleanups around main.c
      rgmanager: Fix reslist.c complier warnings
      rgmanager: Fix resrules.c compiler warnings
      rgmanager: Fix restree.c compiler warnings
      rgmanager: Clean up rg_event.c and related build warnings
      rgmanager: Fix rg_forward.c build warnings
      rgmanager: Fix rg_queue.c build warnings
      rgmanager: Clean up rg_queue.c and related warnings
      rgmanager: Clean up slang_event.c and related warnings
      rgmanager: Fix last bits of compiler warnings
      rgmanager: Fix leaked context on queue fail
      rgmanager: Fix stop/start race
      rgmanager: Fix stack overflows on stress testing
      rgmanager: Fix small memory leak
      rgmanager: Don't push NULL on to the S/Lang stack
      rgmanager: Fix error message
      rgmanager: Fix --debug build
      fence: Make fence_node return 2 for no fencing
      rgmanager: follow-service.sl stack cleanup
      rgmanager: Allow exit while waiting for fencing

Marek 'marx' Grac (1):
      fence_wti: Fence agent for WTI ends with traceback when option is
missing

Steven Dake (1):
      fence: Fix missing case in switch statement

Steven Whitehouse (1):
      libgfs2: Use -o meta rather than gfs2meta fs type

 cman/daemon/ais.c                               |    7 +-
 cman/daemon/commands.c                          |    6 +-
 cman/daemon/daemon.c                            |    5 +-
 cman/daemon/daemon.h                            |    2 +-
 cman/init.d/cman.in                             |   27 +-
 cman/qdisk/main.c                               |    2 +-
 config/tools/ldap/confdb2ldif.c                 |    6 +-
 configure                                       |    8 -
 dlm/tests/usertest/alternate-lvb.c              |   10 +-
 dlm/tests/usertest/asttest.c                    |   14 +-
 dlm/tests/usertest/dlmtest.c                    |    6 +-
 dlm/tests/usertest/dlmtest2.c                   |    7 +-
 dlm/tests/usertest/flood.c                      |    7 +-
 dlm/tests/usertest/joinleave.c                  |    2 +-
 dlm/tests/usertest/lstest.c                     |   12 +-
 dlm/tests/usertest/lvb.c                        |   11 +-
 dlm/tests/usertest/pingtest.c                   |    8 +-
 dlm/tests/usertest/threads.c                    |   34 +-
 fence/agents/Makefile                           |   13 +-
 fence/agents/wti/fence_wti.py                   |   14 +-
 fence/agents/xvm/vm_states.c                    |    2 +
 fence/fence_node/fence_node.c                   |    6 +-
 fence/libfence/agent.c                          |    2 +-
 gfs-kernel/src/gfs/ops_fstype.c                 |    2 +-
 gfs/gfs_fsck/Makefile                           |    7 -
 gfs/gfs_fsck/log.c                              |    9 +-
 gfs/gfs_fsck/metawalk.c                         |    7 +-
 gfs/gfs_fsck/test_bitmap.c                      |   38 -
 gfs/gfs_fsck/test_block_list.c                  |   91 -
 gfs/libgfs/log.c                                |    9 +-
 gfs2/convert/gfs2_convert.c                     |    2 +-
 gfs2/fsck/Makefile                              |    6 -
 gfs2/fsck/fs_recovery.c                         |   34 +-
 gfs2/fsck/initialize.c                          |    6 +-
 gfs2/fsck/main.c                                |    2 +-
 gfs2/fsck/rgrepair.c                            |    2 +-
 gfs2/fsck/test_bitmap.c                         |   38 -
 gfs2/fsck/test_block_list.c                     |   91 -
 gfs2/libgfs2/misc.c                             |    2 +-
 gfs2/mkfs/main.c                                |    2 +-
 gfs2/mkfs/main_grow.c                           |    4 +-
 gfs2/mkfs/main_jadd.c                           |   11 +-
 gfs2/mkfs/main_mkfs.c                           |   10 +-
 gfs2/mount/util.c                               |   15 +-
 gfs2/tool/main.c                                |    2 +-
 group/dlm_controld/pacemaker.c                  |   15 +-
 make/defines.mk.input                           |    1 -
 rgmanager/include/daemon_init.h                 |    9 +
 rgmanager/include/depends.h                     |  134 --
 rgmanager/include/event.h                       |   10 +
 rgmanager/include/fo_domain.h                   |   48 +
 rgmanager/include/groups.h                      |   42 +
 rgmanager/include/lock.h                        |    4 +-
 rgmanager/include/members.h                     |    1 +
 rgmanager/include/message.h                     |   20 +-
 rgmanager/include/resgroup.h                    |   82 +-
 rgmanager/include/reslist.h                     |   51 +-
 rgmanager/include/restart_counter.h             |    2 +-
 rgmanager/include/rg_locks.h                    |    9 +
 rgmanager/include/rg_queue.h                    |    6 +-
 rgmanager/include/vf.h                          |   10 +-
 rgmanager/src/clulib/ckpt_state.c               |    1 +
 rgmanager/src/clulib/cman.c                     |    3 +-
 rgmanager/src/clulib/daemon_init.c              |    8 +-
 rgmanager/src/clulib/fdops.c                    |    5 +-
 rgmanager/src/clulib/lock.c                     |    4 +-
 rgmanager/src/clulib/logging.c                  |    4 +-
 rgmanager/src/clulib/members.c                  |   66 -
 rgmanager/src/clulib/message.c                  |   22 +-
 rgmanager/src/clulib/msg_cluster.c              |   13 +-
 rgmanager/src/clulib/msg_socket.c               |   12 +-
 rgmanager/src/clulib/msgtest.c                  |   19 +-
 rgmanager/src/clulib/rg_strings.c               |    2 +-
 rgmanager/src/clulib/vft.c                      |   53 +-
 rgmanager/src/daemons/Makefile                  |    6 +-
 rgmanager/src/daemons/depends.c                 | 2512
-----------------------
 rgmanager/src/daemons/dtest.c                   |  810 --------
 rgmanager/src/daemons/event_config.c            |   19 +-
 rgmanager/src/daemons/fo_domain.c               |   29 +-
 rgmanager/src/daemons/groups.c                  |   94 +-
 rgmanager/src/daemons/main.c                    |  173 +--
 rgmanager/src/daemons/reslist.c                 |   35 +-
 rgmanager/src/daemons/resrules.c                |   41 +-
 rgmanager/src/daemons/restree.c                 |   70 +-
 rgmanager/src/daemons/rg_event.c                |   30 +-
 rgmanager/src/daemons/rg_forward.c              |    6 +-
 rgmanager/src/daemons/rg_locks.c                |   12 +-
 rgmanager/src/daemons/rg_queue.c                |    8 +-
 rgmanager/src/daemons/rg_state.c                |  145 +-
 rgmanager/src/daemons/rg_thread.c               |   14 +-
 rgmanager/src/daemons/service_op.c              |   15 +-
 rgmanager/src/daemons/slang_event.c             |  266 ++--
 rgmanager/src/daemons/test.c                    |   72 +-
 rgmanager/src/daemons/watchdog.c                |    5 +
 rgmanager/src/resources/default_event_script.sl |   16 +-
 rgmanager/src/resources/follow-service.sl       |   10 +-
 rgmanager/src/resources/vm.sh                   |   17 +-
 rgmanager/src/utils/clufindhostname.c           |    2 +-
 rgmanager/src/utils/clulog.c                    |    4 +-
 rgmanager/src/utils/clustat.c                   |   67 +-
 rgmanager/src/utils/clusvcadm.c                 |   16 +-
 101 files changed, 939 insertions(+), 4812 deletions(-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090702/aa53c18a/attachment.sig>

From jeff.sturm at eprize.com  Thu Jul  2 03:40:40 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 1 Jul 2009 23:40:40 -0400
Subject: [Linux-cluster] Did you use GFS with witch technology?
In-Reply-To: <1246474815.7192.148.camel@tuxkiller>
References: <1246378523.7787.12.camel@tuxkiller><64D0546C5EBBD147B75DE133D798665F02FDC1CF@hugo.eprize.local>
	<1246474815.7192.148.camel@tuxkiller>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC207@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Tiago Cruz
> Sent: Wednesday, July 01, 2009 3:00 PM
> To: linux clustering
> Subject: RE: [Linux-cluster] Did you use GFS with witch technology?
> 
> I have 10 VM inside a apache cluster, and I've compiled one httpd
inside
> GFS, some like (/gfs/httpd_servers/bin-2.2.9).

You can do that.  It sounds like most of the nodes may be accessing this
httpd instance read-only.  If that will be the case, consider using
spectator mounts on some of the nodes so you don't have to create 10
individual journals.

> Did you see any problem with this? How do you use Apache with GFS?

We actually use it for several purposes.  For one, we keep our document
root on GFS, so when web content is modified, the new content is
immediately visible to all web servers.  For another, we have a
file-based session implementation on a GFS mount.

The only real limitations I know of have to do with applications which
are not cluster-aware, and performance of heavy read-write loads.

-Jeff


From jeff.sturm at eprize.com  Thu Jul  2 03:45:00 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 1 Jul 2009 23:45:00 -0400
Subject: [Linux-cluster] Recovering from "telling LM to withdraw"
In-Reply-To: <4A4B922E.5090301@redhat.com>
References: <64D0546C5EBBD147B75DE133D798665F02FDC1E2@hugo.eprize.local>
	<4A4B922E.5090301@redhat.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC208@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Abhijith Das
> Sent: Wednesday, July 01, 2009 12:43 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Recovering from "telling LM to withdraw"
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=471258
> 
> The assert+withdraw you're seeing seems to be this bug above. I've
tried
> to recreate this on my cluster and failed. If you have a recipe to
> create this, could you please post it to the bugzilla?

Thank you for the link.  I'm not confident I can easily reproduce this
yet, as we've had months of continuous uptime without such an incident.

However if I do learn more about the circumstances leading up to our
crash, I'll certainly post information to the bugzilla page.

In the meantime I'll see if I can install a nagios agent to scan logs
for any GFS problems.  The sooner we know about it, the faster we can
recover if this happens again.

-Jeff


From Emmanuel.Thome at normalesup.org  Thu Jul  2 09:56:17 2009
From: Emmanuel.Thome at normalesup.org (Emmanuel =?iso-8859-1?Q?Thom=E9?=)
Date: Thu, 2 Jul 2009 11:56:17 +0200
Subject: [Linux-cluster] ipmi activates session, but no talk.
Message-ID: <20090702095617.GA24015@tiramisu.loria.fr>


Hi.

I'm trying to set up ipmi (1.5) management using the bmc on ibm
eserver326 machines. Yes, these machines are old.

So far, I've been able to access the bmc with ipmitool, and configure it
as correctly as I could for remote access.

When trying to access it from afar, I successfully activate a session,
but further requests are unanswered.

Some dumps of ipmitool commands are included below.

If anybody has an idea of what's going on, that would be greatly
appreciated.

I might also try to flash the bmc firmware, as it seems that ibm released
a newer firmware for these servers. But I'm already a bit puzzled by
the situation so far.

Thanks,

E.

I'm trying to access the bmc with IP 152.81.4.81 from the host with IP
152.81.3.83. The BMC piggies-back on the eth0 NIC which has IP
152.81.3.81 for the system. Thus the BMC and the system both have
different MAC and IPs. Seems to work fine as some kind of conversation
occurs.

Here's the output of a remote ipmi request:

[root at cassandre ~]# IPMI_PASSWORD=xxx ipmitool -vvI lan -L USER -H 152.81.4.81 -E mc info
ipmi_lan_send_cmd:opened=[0], open=[4490512]
IPMI LAN host 152.81.4.81 port 623
Sending IPMI/RMCP presence ping packet
ipmi_lan_send_cmd:opened=[1], open=[4490512]
Channel 01 Authentication Capabilities:
  Privilege Level : USER
  Auth Types      : MD5 
  Per-msg auth    : disabled
  User level auth : disabled
  Non-null users  : enabled
  Null users      : enabled
  Anonymous login : disabled

Proceeding with AuthType MD5
ipmi_lan_send_cmd:opened=[1], open=[4490512]
Opening Session
  Session ID      : 751168e4
  Challenge       : e44e37374801833f77701411992dae25
  Privilege Level : USER
  Auth Type       : MD5
ipmi_lan_send_cmd:opened=[1], open=[4490512]

Session Activated
  Auth Type       : MD5
  Max Priv Level  : USER
  Session ID      : 751168e4
  Inbound Seq     : 00000001

        opened=[1], open=[4490512]
  No response from remote controller
Get Device ID command failed
ipmi_lan_send_cmd:opened=[1], open=[4490512]
  No response from remote controller
Close Session command failed

On the machine I'm trying to talk to, I have in particular:

[root at achille ~]# ipmitool -I open session info all
[...]
session handle                : 255
slot count                    : 4
active sessions               : 1
user id                       : 1
privilege level               : USER
session type                  : IPMIv1.5
channel number                : 0x01
console ip                    : 152.81.3.83
console mac                   : 00:00:00:00:00:00
console port                  : 60599
[...]

[root at achille ~]# /usr/bin/ipmitool -I open lan print 
Set in Progress         : Set Complete
Auth Type Support       : NONE MD5 PASSWORD 
Auth Type Enable        : Callback : MD5 
                        : User     : MD5 
                        : Operator : MD5 
                        : Admin    : MD5 
                        : OEM      : NONE MD5 PASSWORD 
IP Address Source       : Static Address
IP Address              : 152.81.4.81
Subnet Mask             : 255.255.240.0
MAC Address             : 00:0d:60:18:7c:47
SNMP Community String   : public
IP Header               : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
Default Gateway IP      : 152.81.1.1
Default Gateway MAC     : 00:13:5f:89:14:00
Backup Gateway IP       : 192.168.0.2
Backup Gateway MAC      : 00:00:00:00:00:02
Cipher Suite Priv Max   : Not Available

[root at achille ~]# ipmitool user  list  1
ID  Name             Callin  Link Auth  IPMI Msg   Channel Priv Limit
1                    true    false      true       ADMINISTRATOR
2   root             true    true       true       OPERATOR
3   USERID           true    true       true       ADMINISTRATOR
4   OEM              true    true       true       OEM

[root at achille ~]# ipmitool -I open channel info 1
Channel 0x1 info:
  Channel Medium Type   : 802.3 LAN
  Channel Protocol Type : IPMB-1.0
  Session Support       : multi-session
  Active Session Count  : 1
  Protocol Vendor ID    : 7154
  Volatile(active) Settings
    Alerting            : disabled
    Per-message Auth    : disabled
    User Level Auth     : disabled
    Access Mode         : always available
  Non-Volatile Settings
    Alerting            : disabled
    Per-message Auth    : disabled
    User Level Auth     : disabled
    Access Mode         : always available


From j.buzzard at dundee.ac.uk  Thu Jul  2 10:15:36 2009
From: j.buzzard at dundee.ac.uk (Jonathan Buzzard)
Date: Thu, 02 Jul 2009 11:15:36 +0100
Subject: [Linux-cluster] ipmi activates session, but no talk.
In-Reply-To: <20090702095617.GA24015@tiramisu.loria.fr>
References: <20090702095617.GA24015@tiramisu.loria.fr>
Message-ID: <1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk>


On Thu, 2009-07-02 at 11:56 +0200, Emmanuel Thom? wrote:
> Hi.
> 
> I'm trying to set up ipmi (1.5) management using the bmc on ibm
> eserver326 machines. Yes, these machines are old.

they are cheap and nasty rebadged MSI boxes.

> So far, I've been able to access the bmc with ipmitool, and configure it
> as correctly as I could for remote access.
> 
> When trying to access it from afar, I successfully activate a session,
> but further requests are unanswered.
> 
> Some dumps of ipmitool commands are included below.

Well that's your problem, it don't work with ipmitools :-(

> If anybody has an idea of what's going on, that would be greatly
> appreciated.
> 

I suggest switching to freeipmi which does work.

> I might also try to flash the bmc firmware, as it seems that ibm released
> a newer firmware for these servers. But I'm already a bit puzzled by
> the situation so far.

I would if I where you. I would also do the BIOS, BMC and hard disk
firmware at a minimum if I where you. The diagnostics are option.

Note that you cannot configure bonding on eth0 and use the IPMI
interface.

Even when you get it working it is not reliable. I have seen boxes hang
and refuse to respond to IPMI commands to reboot.

I have also never been able to get the serial over LAN bit working
either.

They are just cheap and nasty.

JAB.

-- 
Jonathan A. Buzzard                      Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH


From Emmanuel.Thome at normalesup.org  Thu Jul  2 10:47:45 2009
From: Emmanuel.Thome at normalesup.org (Emmanuel =?iso-8859-1?Q?Thom=E9?=)
Date: Thu, 2 Jul 2009 12:47:45 +0200
Subject: [Linux-cluster] ipmi activates session, but no talk.
In-Reply-To: <1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk>
References: <20090702095617.GA24015@tiramisu.loria.fr>
	<1246529736.23585.6.camel@penguin.lifesci.dundee.ac.uk>
Message-ID: <20090702104745.GA25283@tiramisu.loria.fr>

On Thu, Jul 02, 2009 at 11:15:36AM +0100, Jonathan Buzzard wrote:
> > Some dumps of ipmitool commands are included below.
> 
> Well that's your problem, it don't work with ipmitools :-(

thanks a lot. Indeed.

Regards,

E.


From brettcave at gmail.com  Thu Jul  2 10:54:35 2009
From: brettcave at gmail.com (Brett Cave)
Date: Thu, 2 Jul 2009 12:54:35 +0200
Subject: [Linux-cluster] Re: [SOLVED] problem with heartbeat + ipvs
In-Reply-To: <c0773fd30907010824p52cc5b69l8a700854865953f8@mail.gmail.com>
References: <c0773fd30907010824p52cc5b69l8a700854865953f8@mail.gmail.com>
Message-ID: <c0773fd30907020354tafdcfe0r441c0f0d4fd6fe6e@mail.gmail.com>

Was missing the DBD::mysql module so the connectioncheck was failing and
setting weight to 0.

only noticed this when i ran ldirector in debug mode.


On Wed, Jul 1, 2009 at 5:24 PM, Brett Cave <brettcave at gmail.com> wrote:

> hi all,
>
> have a problem with HA / LB system, using heartbeat for HA and ldirector /
> ipvs for load balancing.
>
> When the primary node is shut down or heartbeat is stopped, the migration
> of services works fine, but the loadbalancing does not work (ipvs rules are
> active, but connect connect to HA services). Configs on primary and
> secondary are the same:
>
>
> haresources:
> primary 172.16.5.1/16/bond0 ldirectord::ldirectord.cf
>
> ldirectord.cf:
> virtual = 172.16.5.1:3306
>        service = mysql
>        real = 172.16.10.1:3306 gate 1000
>        checktype, login, passwd, database, request values all set
>        scheduler = sed
>
> ip_forward is enabled (checked via /proc, configured via sysctl)
>
>
> network configs are almost the same except for the IP address (using a
> bonded interface in active/passive mode)
> have set iptables policies to ACCEPT with rules that would not block the
> traffic (99.99% sure on this).
>
> if i try connect from a server such as 172.16.10.10, i cannot connect if
> the secondary is up:
> [user at someserver]$ mysql -h 172.16.5.1
> ERROR 2003 (HY000): Can't connect to MySQL server on '172.16.5.1' (111)
>
>
> perror shows that 111 is Connection Refused
>
> running a sniffer on the secondary HA box, i dont see the tcp 3306 packets
> coming in.
>
> the arp_ignore / arp_announce kernel params are configured on teh real
> server, HA ip address is added on a /32 subnet to the lo interface, etc,
> etc.... (everything works 100% when primary is up).
>
> sure it is something i have overlooked, any idea's?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090702/db03117f/attachment.htm>

From ironludo at free.fr  Thu Jul  2 12:09:01 2009
From: ironludo at free.fr (LEROUX Ludovic)
Date: Thu, 2 Jul 2009 14:09:01 +0200
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
References: <c0773fd30907010824p52cc5b69l8a700854865953f8@mail.gmail.com>
	<c0773fd30907020354tafdcfe0r441c0f0d4fd6fe6e@mail.gmail.com>
Message-ID: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>

I try to install Oracle 11g on a redhat 5 cluster with 2 nodes.
I have a gfs mount point for the shared datafiles.
Oracle binaries are installed on each node.
I want to create a failover instance (active/passive) but the service with the ressource oracle 10g failover instance doesn't start (see the logfile).
I think that the resource doesn't work with Oracle 11g.
Do you have any ideas?
Do you have any documents to set up a redhat cluster with Oracle but without Oracle RAC?
Thanks a lot.
Ludo

________________________________________________________________________________________________________

Jul  2 14:11:03 siimlinux13 luci[2956]: Unable to retrieve batch 1273662007 status from siimlinux13.siim:11111: Unable to disable failed service oracle before starting it: clusvcadm failed to stop oracle
Jul  2 14:11:11 siimlinux13 : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: Cluster. Please verify its path and try again
Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> Starting disabled service service:oracle
Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> start on script "serviceoracle" returned 5 (program not installed)
Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <warning> #68: Failed to start service:oracle; return value: 1
Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> Stopping service service:oracle
Jul  2 14:11:55 siimlinux13 luci[2956]: Unable to retrieve batch 710454996 status from siimlinux13.siim:11111: module scheduled for execution
Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> stop on script "serviceoracle" returned 5 (program not installed)
Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <crit> #12: RG service:oracle failed to stop; intervention required
Jul  2 14:11:56 siimlinux13 clurgmgrd[3045]: <notice> Service service:oracle is failed
Jul  2 14:11:56 siimlinux13 clurgmgrd[3045]: <crit> #13: Service service:oracle failed to stop cleanly
Jul  2 14:12:01 siimlinux13 luci[2956]: Unable to retrieve batch 710454996 status from siimlinux13.siim:11111: clusvcadm start failed to start oracle:
Jul  2 15:11:14 siimlinux13 : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: Cluster. Please verify its path and try again
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090702/3ece540e/attachment.htm>

From raju.rajsand at gmail.com  Thu Jul  2 12:17:02 2009
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Thu, 2 Jul 2009 17:47:02 +0530
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
References: <c0773fd30907010824p52cc5b69l8a700854865953f8@mail.gmail.com>
	<c0773fd30907020354tafdcfe0r441c0f0d4fd6fe6e@mail.gmail.com>
	<3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
Message-ID: <8786b91c0907020517s51ccd802pfa306b401ad3f07e@mail.gmail.com>

Greetings,

On Thu, Jul 2, 2009 at 5:39 PM, LEROUX Ludovic<ironludo at free.fr> wrote:
> I try to install Oracle 11g on a redhat 5 cluster with 2 nodes.
> I have a gfs mount point for the shared datafiles.
> Oracle binaries are installed on each node.
> I want to create a failover instance (active/passive) but the service with
> the ressource oracle 10g failover instance doesn't start (see the logfile).

Have you done chkconfig --off for the oracle script on both the nodes
and added Oracle to cluster managed service along with the listener
IP.

Regards,

Rajagopal


From esggrupos at gmail.com  Thu Jul  2 17:24:03 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 2 Jul 2009 19:24:03 +0200
Subject: [Linux-cluster] OFF TOPIC: cloud computing
Message-ID: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>

Hi folks,
First sorry for the off topic but I?m sure you know a lot about the concept
cloud computing.

While I have been learning about clustering (with the help of this list..) I
have read about using clusters for cloud computing.

I?m totally newbie about that concept, so I want to ask you what you have to
say about it, is it real? is an abstract concept and it?s not going to be
interesting at all?

what do you think?

by the way, any web, book, magazine, article or any thing to profundice in
this concept

greetings

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090702/a83becee/attachment.htm>

From brettcave at gmail.com  Thu Jul  2 17:35:43 2009
From: brettcave at gmail.com (Brett Cave)
Date: Thu, 2 Jul 2009 19:35:43 +0200
Subject: [Linux-cluster] OFF TOPIC: cloud computing
In-Reply-To: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>
References: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>
Message-ID: <c0773fd30907021035p1a5f51cen888602f0d2dbeab8@mail.gmail.com>

On Thu, Jul 2, 2009 at 7:24 PM, ESGLinux <esggrupos at gmail.com> wrote:

> Hi folks,
> First sorry for the off topic but I?m sure you know a lot about the concept
> cloud computing.
>
> While I have been learning about clustering (with the help of this list..)
> I have read about using clusters for cloud computing.
>
> I?m totally newbie about that concept, so I want to ask you what you have
> to say about it, is it real? is an abstract concept and it?s not going to be
> interesting at all?
>

It is real, have a look at MPI for development of cloud computing (MPI CH as
an implementation). Its used for message passing to queue out components of
a job to various nodes.  We implemented sorting using this library last year
that allocated tasks on a per-core basis across multiple servers.


> what do you think?
>
> by the way, any web, book, magazine, article or any thing to profundice in
> this concept
>
> greetings
>
> ESG
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090702/909a32a5/attachment.htm>

From jruemker at redhat.com  Thu Jul  2 19:53:58 2009
From: jruemker at redhat.com (John Ruemker)
Date: Thu, 02 Jul 2009 15:53:58 -0400
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
References: <c0773fd30907010824p52cc5b69l8a700854865953f8@mail.gmail.com>	<c0773fd30907020354tafdcfe0r441c0f0d4fd6fe6e@mail.gmail.com>
	<3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
Message-ID: <4A4D1056.6090807@redhat.com>

On 07/02/2009 08:09 AM, LEROUX Ludovic wrote:
> Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> start on script
> "serviceoracle" returned 5 (program not installed)
> Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <warning> #68: Failed to
> start service:oracle; return value: 1

The above error is why its failing, but unfortunately this is pretty 
generic.  Something returned a status code 5, but from these logs theres 
no way to be sure what since the oracle agent does a number of things 
during the startup sequence.

Usually the best way to troubleshoot these issues is with rg_test, as it 
will be much more verbose.  First disable your service

   # clusvcadm -d serviceoracle

Now do

   # rg_test test /etc/cluster/cluster.conf start service serviceoracle

You should see it logging each operation and it will tell you where it 
failed.  If this doesn't point you to your answer then post the output 
here as well as your cluster.conf.

Also there are some good guidelines and basic steps for setting up an 
oracle service here:

   http://people.redhat.com/lhh/oracle-rhel5-notes-0.6/oracle-notes.html

HTH

-John


From hlawatschek at atix.de  Fri Jul  3 09:25:55 2009
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Fri, 3 Jul 2009 11:25:55 +0200
Subject: [Linux-cluster] redhat 5 cluster suite with gfs and oracle 11g
In-Reply-To: <3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
References: <c0773fd30907010824p52cc5b69l8a700854865953f8@mail.gmail.com>
	<c0773fd30907020354tafdcfe0r441c0f0d4fd6fe6e@mail.gmail.com>
	<3AA7FB4A03854AE2953DA88714C2DBA4@siim94.local>
Message-ID: <200907031125.55435.hlawatschek@atix.de>

Hi Ludo,

could you please provide your cluster.conf file ?

-Mark

On Thursday 02 July 2009 14:09:01 LEROUX Ludovic wrote:
> I try to install Oracle 11g on a redhat 5 cluster with 2 nodes.
> I have a gfs mount point for the shared datafiles.
> Oracle binaries are installed on each node.
> I want to create a failover instance (active/passive) but the service with
> the ressource oracle 10g failover instance doesn't start (see the logfile).
> I think that the resource doesn't work with Oracle 11g.
> Do you have any ideas?
> Do you have any documents to set up a redhat cluster with Oracle but
> without Oracle RAC? Thanks a lot.
> Ludo
>
> ___________________________________________________________________________
>_____________________________
>
> Jul  2 14:11:03 siimlinux13 luci[2956]: Unable to retrieve batch 1273662007
> status from siimlinux13.siim:11111: Unable to disable failed service oracle
> before starting it: clusvcadm failed to stop oracle Jul  2 14:11:11
> siimlinux13 : error getting update info: Cannot retrieve repository
> metadata (repomd.xml) for repository: Cluster. Please verify its path and
> try again Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> Starting
> disabled service service:oracle Jul  2 14:11:55 siimlinux13
> clurgmgrd[3045]: <notice> start on script "serviceoracle" returned 5
> (program not installed) Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]:
> <warning> #68: Failed to start service:oracle; return value: 1 Jul  2
> 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> Stopping service
> service:oracle Jul  2 14:11:55 siimlinux13 luci[2956]: Unable to retrieve
> batch 710454996 status from siimlinux13.siim:11111: module scheduled for
> execution Jul  2 14:11:55 siimlinux13 clurgmgrd[3045]: <notice> stop on
> script "serviceoracle" returned 5 (program not installed) Jul  2 14:11:55
> siimlinux13 clurgmgrd[3045]: <crit> #12: RG service:oracle failed to stop;
> intervention required Jul  2 14:11:56 siimlinux13 clurgmgrd[3045]: <notice>
> Service service:oracle is failed Jul  2 14:11:56 siimlinux13
> clurgmgrd[3045]: <crit> #13: Service service:oracle failed to stop cleanly
> Jul  2 14:12:01 siimlinux13 luci[2956]: Unable to retrieve batch 710454996
> status from siimlinux13.siim:11111: clusvcadm start failed to start oracle:
> Jul  2 15:11:14 siimlinux13 : error getting update info: Cannot retrieve
> repository metadata (repomd.xml) for repository: Cluster. Please verify its
> path and try again


-- 
Dipl.-Ing. Mark Hlawatschek

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de | www.open-sharedroot.org


From mech at meteo.uni-koeln.de  Fri Jul  3 15:06:13 2009
From: mech at meteo.uni-koeln.de (Mario Mech)
Date: Fri, 03 Jul 2009 17:06:13 +0200
Subject: [Linux-cluster] running services as non-root user
Message-ID: <4A4E1E65.2070908@meteo.uni-koeln.de>

Hi,

in my cluster environment some services need to run as a non-root user. What are the necessary settings?

Settings in my cluster.conf like

<script file="/data/cluster/scripts/cluster_get_grid_ecmwf_globe" name="Get grid/ECMWF_Globe" user="ninjoadm"/>

(not accepted by system-config-cluster) and in /usr/share/cluster/scripts.sh

<parameter name="user" unique="0" required="0">
<longdesc lang="en">
User name
</longdesc>
<shortdesc lang="en">
User name
</shortdesc>
<content type="string"/>
</parameter>

su - ${OCF_RESKEY_user} -c "${OCF_RESKEY_file} $1"

didn't succeed. The services are startet but as root.

Is it the wrong way?

Thank you

Mario
-- 


From billpp at gmail.com  Fri Jul  3 19:30:44 2009
From: billpp at gmail.com (Flavio Junior)
Date: Fri, 3 Jul 2009 16:30:44 -0300
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
Message-ID: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>

Hi folks....

I'm (trying to) using GFS2 with a mailserver scenario using:

- CentOS 5.3 updated
- Dovecot IMAP/Maildir
- Postfix

To make servers active/active i'm using CTDB (http://ctdb.samba.org).

Some info that could be relevant:
[root at pinky ~]# uname -a
Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux
[root at pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
kernel-2.6.18-128.1.16.el5
gfs2-utils-0.1.53-1.el5_3.3
modcluster-0.12.1-2.el5.centos
cluster-cim-0.12.1-2.el5.centos
kernel-devel-2.6.18-128.1.10.el5
openais-0.80.3-22.el5_3.8
system-config-cluster-1.0.55-1.0
kernel-2.6.18-128.1.6.el5
kernel-2.6.18-128.1.10.el5
kernel-devel-2.6.18-128.1.16.el5
lvm2-cluster-2.02.40-7.el5
cluster-snmp-0.12.1-2.el5.centos
kernel-headers-2.6.18-128.1.16.el5
kernel-devel-2.6.18-128.1.6.el5
cman-2.0.98-1.el5_3.4
[root at pinky ~]# grep /home /etc/fstab
/dev/homeClusterVG/home_vmail   /home           gfs2
auto,noatime,quota=off,noexec,nodev,_netdev       0 0


Everything works fine for some time, but two or three times by day I get
some dovecot/deliver process hanged D state, so the only way to solve it is
rebooting node.

I'm not a developer and don't know much about debugging. As i've got other
problems ago I learn to use "sysrq-t" and here is the output related with
two of these process:

Pastebin: http://pastebin.ca/1483264

Jul  3 15:45:20 cerebro kernel: deliver       D ffff81007e442800     0
24420  23846                     (NOTLB)
Jul  3 15:45:20 cerebro kernel:  ffff810013885e08 0000000000000082
ffff810013885d68 0000000000000092
Jul  3 15:45:20 cerebro kernel:  ffff810013885e20 0000000000000001
ffff8100141870c0 ffff81000904b0c0
Jul  3 15:45:20 cerebro kernel:  0000052a72ff2a70 000000000000034a
ffff8100141872a8 000000036caf5000
Jul  3 15:45:20 cerebro kernel: Call Trace:
Jul  3 15:45:20 cerebro kernel:  [<ffffffff88562a7d>]
:dlm:dlm_posix_lock+0x172/0x210
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8009eba4>]
autoremove_wake_function+0x0/0x2e
Jul  3 15:45:20 cerebro kernel:  [<ffffffff88591c7a>]
:gfs2:gfs2_lock+0xc3/0xcf
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8003a39e>]
fcntl_setlk+0x11e/0x273
Jul  3 15:45:20 cerebro kernel:  [<ffffffff800b5659>]
audit_syscall_entry+0x16e/0x1a1
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0


Jul  3 15:45:21 cerebro kernel: deliver       D ffff81000238f480     0
1358  32225                     (NOTLB)
Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe08 0000000000000082
ffff8100086cfd68 0000000000000092
Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe20 0000000000000001
ffff81000904b0c0 ffff81007ff28100
Jul  3 15:45:21 cerebro kernel:  0000052a72ff2ca2 0000000000000232
ffff81000904b2a8 000000037ed68a00
Jul  3 15:45:21 cerebro kernel: Call Trace:
Jul  3 15:45:21 cerebro kernel:  [<ffffffff88562a7d>]
:dlm:dlm_posix_lock+0x172/0x210
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8009eba4>]
autoremove_wake_function+0x0/0x2e
Jul  3 15:45:21 cerebro kernel:  [<ffffffff88591c7a>]
:gfs2:gfs2_lock+0xc3/0xcf
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8003a39e>]
fcntl_setlk+0x11e/0x273
Jul  3 15:45:21 cerebro kernel:  [<ffffffff800b5659>]
audit_syscall_entry+0x16e/0x1a1
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0


Before reboot the node I went into the directory of this user and run some
"ls" and everything works as expected. I was pretty sure that command will
hang, but it don't.
Here is the "ps ax" output:
cicero   24420  0.0  0.0   8960  1220 ?        Ds   14:46   0:00
/usr/libexec/dovecot/deliver -f cicero -d cicero

I've already rebooted that node, but if there is someway more deeply to
perform a debug of this case, just let me know that probably till the end of
the day i'll get same situation.


Thanks in advance.

--

Fl?vio do Carmo J?nior aka waKKu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090703/87babf13/attachment.htm>

From gordan at bobich.net  Fri Jul  3 19:40:11 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 03 Jul 2009 20:40:11 +0100
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
Message-ID: <4A4E5E9B.7060906@bobich.net>

Sounds like you are running into the same bug that I ran into with GFS2 
on a similar setup nearly 2 years ago, except I could produce a lock-up 
in under 2 seconds every time. Solution is to use GFS1 if you really 
want to stick with that setup, but bear in mind that, regardless of the 
cluster file system (GFS1, GFS2, OCFS2) the performance will scale 
_inversely_. Cluster file systems really don't work well with millions 
of small files.

You might, instead, want to look into something like DBMail with a MySQL 
proxy to serialize all writes to a single node.

You can, of course, still use GFS1 for the root file system to share the 
OS install. Look at Open Shared Root project if this is of interest.

Gordan

Flavio Junior wrote:
> Hi folks....
> 
> I'm (trying to) using GFS2 with a mailserver scenario using:
> 
> - CentOS 5.3 updated
> - Dovecot IMAP/Maildir
> - Postfix
> 
> To make servers active/active i'm using CTDB (http://ctdb.samba.org).
> 
> Some info that could be relevant:
> [root at pinky ~]# uname -a
> Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009 
> x86_64 x86_64 x86_64 GNU/Linux
> [root at pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
> kernel-2.6.18-128.1.16.el5
> gfs2-utils-0.1.53-1.el5_3.3
> modcluster-0.12.1-2.el5.centos
> cluster-cim-0.12.1-2.el5.centos
> kernel-devel-2.6.18-128.1.10.el5
> openais-0.80.3-22.el5_3.8
> system-config-cluster-1.0.55-1.0
> kernel-2.6.18-128.1.6.el5
> kernel-2.6.18-128.1.10.el5
> kernel-devel-2.6.18-128.1.16.el5
> lvm2-cluster-2.02.40-7.el5
> cluster-snmp-0.12.1-2.el5.centos
> kernel-headers-2.6.18-128.1.16.el5
> kernel-devel-2.6.18-128.1.6.el5
> cman-2.0.98-1.el5_3.4
> [root at pinky ~]# grep /home /etc/fstab
> /dev/homeClusterVG/home_vmail   /home           gfs2    
> auto,noatime,quota=off,noexec,nodev,_netdev       0 0
> 
> 
> Everything works fine for some time, but two or three times by day I get 
> some dovecot/deliver process hanged D state, so the only way to solve it 
> is rebooting node.
> 
> I'm not a developer and don't know much about debugging. As i've got 
> other problems ago I learn to use "sysrq-t" and here is the output 
> related with two of these process:
> 
> Pastebin: http://pastebin.ca/1483264
> 
> Jul  3 15:45:20 cerebro kernel: deliver       D ffff81007e442800     0 
> 24420  23846                     (NOTLB)
> Jul  3 15:45:20 cerebro kernel:  ffff810013885e08 0000000000000082 
> ffff810013885d68 0000000000000092
> Jul  3 15:45:20 cerebro kernel:  ffff810013885e20 0000000000000001 
> ffff8100141870c0 ffff81000904b0c0
> Jul  3 15:45:20 cerebro kernel:  0000052a72ff2a70 000000000000034a 
> ffff8100141872a8 000000036caf5000
> Jul  3 15:45:20 cerebro kernel: Call Trace:
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff88562a7d>] 
> :dlm:dlm_posix_lock+0x172/0x210
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8009eba4>] 
> autoremove_wake_function+0x0/0x2e
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff88591c7a>] 
> :gfs2:gfs2_lock+0xc3/0xcf
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8003a39e>] 
> fcntl_setlk+0x11e/0x273
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff800b5659>] 
> audit_syscall_entry+0x16e/0x1a1
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
> Jul  3 15:45:20 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0
> 
> 
> Jul  3 15:45:21 cerebro kernel: deliver       D ffff81000238f480     0  
> 1358  32225                     (NOTLB)
> Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe08 0000000000000082 
> ffff8100086cfd68 0000000000000092
> Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe20 0000000000000001 
> ffff81000904b0c0 ffff81007ff28100
> Jul  3 15:45:21 cerebro kernel:  0000052a72ff2ca2 0000000000000232 
> ffff81000904b2a8 000000037ed68a00
> Jul  3 15:45:21 cerebro kernel: Call Trace:
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff88562a7d>] 
> :dlm:dlm_posix_lock+0x172/0x210
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8009eba4>] 
> autoremove_wake_function+0x0/0x2e
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff88591c7a>] 
> :gfs2:gfs2_lock+0xc3/0xcf
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8003a39e>] 
> fcntl_setlk+0x11e/0x273
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff800b5659>] 
> audit_syscall_entry+0x16e/0x1a1
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
> Jul  3 15:45:21 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0
> 
> 
> Before reboot the node I went into the directory of this user and run 
> some "ls" and everything works as expected. I was pretty sure that 
> command will hang, but it don't.
> Here is the "ps ax" output:
> cicero   24420  0.0  0.0   8960  1220 ?        Ds   14:46   0:00 
> /usr/libexec/dovecot/deliver -f cicero -d cicero
> 
> I've already rebooted that node, but if there is someway more deeply to 
> perform a debug of this case, just let me know that probably till the 
> end of the day i'll get same situation.
> 
> 
> Thanks in advance.
> 
> --
> 
> Fl?vio do Carmo J?nior aka waKKu
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From billpp at gmail.com  Fri Jul  3 20:02:29 2009
From: billpp at gmail.com (Flavio Junior)
Date: Fri, 3 Jul 2009 17:02:29 -0300
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <4A4E5E9B.7060906@bobich.net>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
	<4A4E5E9B.7060906@bobich.net>
Message-ID: <58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>

On Fri, Jul 3, 2009 at 4:40 PM, Gordan Bobic <gordan at bobich.net> wrote:

> Sounds like you are running into the same bug that I ran into with GFS2 on
> a similar setup nearly 2 years ago, except I could produce a lock-up in
> under 2 seconds every time. Solution is to use GFS1 if you really want to
> stick with that setup, but bear in mind that, regardless of the cluster file
> system (GFS1, GFS2, OCFS2) the performance will scale _inversely_. Cluster
> file systems really don't work well with millions of small files.
>

Hi Gordan, thanks for answer.

But, if it is "possible" to be solved (as it was with GFS1) why is it not
feasible to GFS2?

Well, no problem at al to migrate to GFS1, actually I've already thinked
about it, but all those gfs1 tunning options and tests makes me a bit
apprehensive.

I'll wait a bit more for GFS2 community, if they say that it can't be done I
go to GFS1 or even ocfs2 (what is the third option, as I've already a RHCS
structure with clvmd).


--

Fl?vio do Carmo J?nior aka waKKu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090703/8918b55b/attachment.htm>

From gordan at bobich.net  Fri Jul  3 21:00:13 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 03 Jul 2009 22:00:13 +0100
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>	<4A4E5E9B.7060906@bobich.net>
	<58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>
Message-ID: <4A4E715D.5010204@bobich.net>

Flavio Junior wrote:
> On Fri, Jul 3, 2009 at 4:40 PM, Gordan Bobic <gordan at bobich.net 
> <mailto:gordan at bobich.net>> wrote:
> 
>     Sounds like you are running into the same bug that I ran into with
>     GFS2 on a similar setup nearly 2 years ago, except I could produce a
>     lock-up in under 2 seconds every time. Solution is to use GFS1 if
>     you really want to stick with that setup, but bear in mind that,
>     regardless of the cluster file system (GFS1, GFS2, OCFS2) the
>     performance will scale _inversely_. Cluster file systems really
>     don't work well with millions of small files.
> 
> 
> Hi Gordan, thanks for answer.
> 
> But, if it is "possible" to be solved (as it was with GFS1) why is it 
> not feasible to GFS2?

1) Performance will such regardless of whether it's GFS1 or GFS2. It's 
fine for 10-20 users, but if you have 10,000-20,000 users, it will grind 
to a halt.

2) The GFS2 clearly still isn't stable enough if this sort of crash 
still happens.

> Well, no problem at al to migrate to GFS1, actually I've already thinked 
> about it, but all those gfs1 tunning options and tests makes me a bit 
> apprehensive.

GFS1 doesn't have any more tuning options than GFS2 that I can think of. 
And besides, in practice, if the performance isn't in the right ball 
park out of the box, no amount of tweaking will help. Just about the 
only think that makes a significant difference is the noatime mount 
option. I wouldn't bother with the rest unless you really need those 
last few percent.

> I'll wait a bit more for GFS2 community, if they say that it can't be 
> done I go to GFS1 or even ocfs2 (what is the third option, as I've 
> already a RHCS structure with clvmd).

The problem with GFS2 is that it's still a bit buggy, as you've found. 
But there isn't that much difference in performance between various 
similar file systems. Sure, GFS2 is faster than GFS1, but it's not an 
order of magnitude faster.

Gordan


From cthulhucalling at gmail.com  Sat Jul  4 01:48:16 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Fri, 3 Jul 2009 18:48:16 -0700
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <4A4E715D.5010204@bobich.net>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
	<4A4E5E9B.7060906@bobich.net>
	<58aa8d780907031302x2cf76587had74673f962d1e61@mail.gmail.com>
	<4A4E715D.5010204@bobich.net>
Message-ID: <36df569a0907031848u52b19902s963d26ccea69abb4@mail.gmail.com>

On Fri, Jul 3, 2009 at 2:00 PM, Gordan Bobic <gordan at bobich.net> wrote:

> Flavio Junior wrote:
>
>> On Fri, Jul 3, 2009 at 4:40 PM, Gordan Bobic <gordan at bobich.net <mailto:
>> gordan at bobich.net>> wrote:
>>
>>
>
>  Well, no problem at al to migrate to GFS1, actually I've already thinked
>> about it, but all those gfs1 tunning options and tests makes me a bit
>> apprehensive.
>>
>
> GFS1 doesn't have any more tuning options than GFS2 that I can think of.
> And besides, in practice, if the performance isn't in the right ball park
> out of the box, no amount of tweaking will help. Just about the only think
> that makes a significant difference is the noatime mount option. I wouldn't
> bother with the rest unless you really need those last few percent.


Noatime helps, but where I've seen some really good performance boosts is in
tweaking glock_purge and demote_secs parameters. Of course, always start
with a modest setting and start tweaking from there. Playing around with
statfs_fast =1, noatime, nodiratime and playing withe glock settings, I've
seen a pretty significant jump in performance.


>
>  I'll wait a bit more for GFS2 community, if they say that it can't be done
>> I go to GFS1 or even ocfs2 (what is the third option, as I've already a RHCS
>> structure with clvmd).
>>
>
> The problem with GFS2 is that it's still a bit buggy, as you've found. But
> there isn't that much difference in performance between various similar file
> systems. Sure, GFS2 is faster than GFS1, but it's not an order of magnitude
> faster.


I've done some GFS vs GFS2 performance benchmarking for a cluster that I
will be putting in soon. I've found that GFS1 performance has been much much
better than GFS2. As far as I can tell, GFS2 lacks a lot of the tunability
that GFS1 has. All the documentation I've seen says that it's supposed to be
self-tuning, so there are fewer performance tuning options you have to play
with. From my tests, I've had almost a 50% reduction in performance using
GFS2.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090703/32771821/attachment.htm>

From brdvss at gmail.com  Sat Jul  4 10:19:32 2009
From: brdvss at gmail.com (Brady Vass)
Date: Sat, 4 Jul 2009 15:49:32 +0530
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
Message-ID: <995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>

> Hi,
>
> I am trying to find out if there are any dedicated commands that allows for
> file copying or command execution among nodes in RHCS. i.e, these commands
> need to be exclusive with the RHCS s/w and the communication should be
> seamlessly without the need for password authentications.
>
> (PS: I dont want to use rsh/ssh genre of commands. Other HA solution comes
> with exclusive set of cluster commands. I am looking for something similar.)
>
>
> Thanks and regards,
>
> Brady
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090704/eb051367/attachment.htm>

From robejrm at gmail.com  Sat Jul  4 13:13:49 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Sat, 4 Jul 2009 15:13:49 +0200
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
	<995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
Message-ID: <8a5668960907040613s34eb6046r5009f6bf24573e26@mail.gmail.com>

On Sat, Jul 4, 2009 at 12:19 PM, Brady Vass <brdvss at gmail.com> wrote:

>
>  Hi,
>>
>> I am trying to find out if there are any dedicated commands that allows
>> for file copying or command execution among nodes in RHCS. i.e, these
>> commands need to be exclusive with the RHCS s/w and the communication should
>> be seamlessly without the need for password authentications.
>>
>> (PS: I dont want to use rsh/ssh genre of commands. Other HA solution comes
>> with exclusive set of cluster commands. I am looking for something similar.)
>>
>>
> You can always use public key authentication with ssh and scp,
communication will be seamless. You can also use dsh (or any parallel shell
on top of ssh) to execute the same command on all the nodes at once.

Greetings,
Juanra

> Thanks and regards,
>>
>> Brady
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090704/12f11e4b/attachment.htm>

From grimme at atix.de  Sun Jul  5 11:29:54 2009
From: grimme at atix.de (Marc Grimme)
Date: Sun, 5 Jul 2009 13:29:54 +0200
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <8a5668960907040613s34eb6046r5009f6bf24573e26@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
	<995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
	<8a5668960907040613s34eb6046r5009f6bf24573e26@mail.gmail.com>
Message-ID: <200907051329.54722.grimme@atix.de>

You might also want to have a look at com-dsh as part of comoonics-cs-py 
(python based version of pydsh to be got from  
http://download.atix.de/yum/comoonics/redhat-el5/productive/noarch/RPMS/comoonics-cs-py-0.1-56.noarch.rpm). 
It should automatically detect all "online" nodes in the cluster and then 
issue the command on all nodes. It also detects the node you are working on 
and will issue this command directly. See below for an example.

[root at generix3 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   2   M     12   2009-05-05 10:06:24  generix2.local
   3   M      4   2009-05-05 09:37:02  generix3.local
   4   M     32   2009-05-05 10:14:49  generix4.local
[root at generix3 ~]# com-dsh hostname
Host           | Output:
---------------+-------------------------------------------------------------------------------------------------------------------------------------------
localhost      | generix3
generix2.local | generix2
generix4.local | generix4
[root at generix3 ~]# rpm -qf $(which com-dsh)
comoonics-cs-py-0.1-56

Hope this helps
Marc.
On Saturday 04 July 2009 15:13:49 Juan Ramon Martin Blanco wrote:
> On Sat, Jul 4, 2009 at 12:19 PM, Brady Vass <brdvss at gmail.com> wrote:
> >  Hi,
> >
> >> I am trying to find out if there are any dedicated commands that allows
> >> for file copying or command execution among nodes in RHCS. i.e, these
> >> commands need to be exclusive with the RHCS s/w and the communication
> >> should be seamlessly without the need for password authentications.
> >>
> >> (PS: I dont want to use rsh/ssh genre of commands. Other HA solution
> >> comes with exclusive set of cluster commands. I am looking for something
> >> similar.)
> >
> > You can always use public key authentication with ssh and scp,
>
> communication will be seamless. You can also use dsh (or any parallel shell
> on top of ssh) to execute the same command on all the nodes at once.
>
> Greetings,
> Juanra
>
> > Thanks and regards,
> >
> >> Brady
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de | www.open-sharedroot.org

Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.: 
DE209485962 | Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) |
Vorsitzender des Aufsichtsrats: Dr. Martin Buss


From armanets at ill.fr  Mon Jul  6 08:08:33 2009
From: armanets at ill.fr (Armanet Stephane)
Date: Mon, 06 Jul 2009 10:08:33 +0200
Subject: [Linux-cluster] force fencing
Message-ID: <4A51B101.2010500@ill.fr>

Hello list

I'm trying to setup a 3 nodes Cluster with 2 failover Domain for an HA
mail solution.
I want 1 run active for the Imap server in the Imap Failover domain , 1
node active for the Smtp in the Smtp Failover domain and the 3rd in the
2 failover domain as a backup node.

I run Centos 5.3
My fence device is a wti power switch

My cluster.conf is in attachement

My SMTP service is composed of:
	1 IP
	1 amavisd scritp
	1 postfix script
	2 NFS mount for postfix and amavis

If I manually kill the postfix master process (to simulate a crash), my
node is not fence and the logs said:

Jul  6 10:00:40 centos-smtp1 clurgmgrd: [4228]: <info> Executing
/etc/init.d/postfix status
Jul  6 10:00:40 centos-smtp1 clurgmgrd: [4228]: <err> script:postfix:
status of /etc/init.d/postfix failed (returned 3)
Jul  6 10:00:40 centos-smtp1 clurgmgrd[4228]: <notice> status on script
"postfix" returned 1 (generic error)
Jul  6 10:00:40 centos-smtp1 clurgmgrd[4228]: <notice> Stopping service
service:Postfix
Jul  6 10:00:40 centos-smtp1 clurgmgrd: [4228]: <info> Executing
/etc/init.d/amavisd stop
Jul  6 10:00:40 centos-smtp1 kernel: do_vfs_lock: VFS is out of sync
with lock manager!
Jul  6 10:00:40 centos-smtp1 last message repeated 8 times
Jul  6 10:00:41 centos-smtp1 clurgmgrd: [4228]: <info> Executing
/etc/init.d/postfix stop
Jul  6 10:00:41 centos-smtp1 clurgmgrd: [4228]: <err> script:postfix:
stop of /etc/init.d/postfix failed (returned 1)
Jul  6 10:00:41 centos-smtp1 clurgmgrd[4228]: <notice> stop on script
"postfix" returned 1 (generic error)
Jul  6 10:00:41 centos-smtp1 clurgmgrd: [4228]: <info> Removing IPv4
address 195.83.126.201/24 from bond0
Jul  6 10:00:41 centos-smtp1 avahi-daemon[3552]: Withdrawing address
record for 195.83.126.201 on bond0.
Jul  6 10:00:51 centos-smtp1 clurgmgrd: [4228]: <info> unmounting
/var/lib/amavis
Jul  6 10:00:51 centos-smtp1 clurgmgrd: [4228]: <info> unmounting
/var/spool/postfix
Jul  6 10:00:51 centos-smtp1 clurgmgrd[4228]: <crit> #12: RG
service:Postfix failed to stop; intervention required
Jul  6 10:00:51 centos-smtp1 clurgmgrd[4228]: <notice> Service
service:Postfix is failed
Jul  6 10:00:52 centos-smtp1 ntpd[3322]: synchronized to 195.83.126.119,
stratum 1

Clustat said:

Cluster Status for cluster-test @ Mon Jul  6 10:02:39 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 centos-imap1.ill.fr                                                 1
Online, Local, rgmanager
 centos-imap2.ill.fr                                                 2
Online, rgmanager
 centos-smtp1.ill.fr                                                 3
Online, rgmanager
 /dev/disk/by-id/scsi-360a98000567247514634507447594661-part1        0
Online, Quorum Disk

 Service Name                                                   Owner
(Last)                                                   State
 ------- ----                                                   -----
------                                                   -----
 service:Imap
centos-imap2.ill.fr                                            started

 service:Postfix
(centos-smtp1.ill.fr)                                          failed


So I have to disable the Postfix servcie with:
	clusvcadm -d Postfix
and re-enable
	clusvcadm -e Postfix


Could you explain my why my original smtp node is not fenced and why my
service is not start on the 2nd node ???

Is there a way to force the fencing ???


-- 
ARMANET Stephane
Division Projet Technique
Service Informatique
  Groupe Infrastructure

Institut Laue langevin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: text/xml
Size: 3723 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090706/d4f9fa78/attachment.xml>

From robejrm at gmail.com  Mon Jul  6 08:22:23 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Mon, 6 Jul 2009 10:22:23 +0200
Subject: [Linux-cluster] force fencing
In-Reply-To: <4A51B101.2010500@ill.fr>
References: <4A51B101.2010500@ill.fr>
Message-ID: <8a5668960907060122s8a47dd6rb89f4dade8621efe@mail.gmail.com>

On Mon, Jul 6, 2009 at 10:08 AM, Armanet Stephane <armanets at ill.fr> wrote:

> Hello list
>
> I'm trying to setup a 3 nodes Cluster with 2 failover Domain for an HA
> mail solution.
> I want 1 run active for the Imap server in the Imap Failover domain , 1
> node active for the Smtp in the Smtp Failover domain and the 3rd in the
> 2 failover domain as a backup node.
>
> I run Centos 5.3
> My fence device is a wti power switch
>
> My cluster.conf is in attachement
>
> My SMTP service is composed of:
>        1 IP
>        1 amavisd scritp
>        1 postfix script
>        2 NFS mount for postfix and amavis
>
> If I manually kill the postfix master process (to simulate a crash), my
> node is not fence and the logs said:
>
> Jul  6 10:00:40 centos-smtp1 clurgmgrd: [4228]: <info> Executing
> /etc/init.d/postfix status
> Jul  6 10:00:40 centos-smtp1 clurgmgrd: [4228]: <err> script:postfix:
> status of /etc/init.d/postfix failed (returned 3)
> Jul  6 10:00:40 centos-smtp1 clurgmgrd[4228]: <notice> status on script
> "postfix" returned 1 (generic error)
> Jul  6 10:00:40 centos-smtp1 clurgmgrd[4228]: <notice> Stopping service
> service:Postfix
> Jul  6 10:00:40 centos-smtp1 clurgmgrd: [4228]: <info> Executing
> /etc/init.d/amavisd stop
> Jul  6 10:00:40 centos-smtp1 kernel: do_vfs_lock: VFS is out of sync
> with lock manager!
> Jul  6 10:00:40 centos-smtp1 last message repeated 8 times
> Jul  6 10:00:41 centos-smtp1 clurgmgrd: [4228]: <info> Executing
> /etc/init.d/postfix stop
> Jul  6 10:00:41 centos-smtp1 clurgmgrd: [4228]: <err> script:postfix:
> stop of /etc/init.d/postfix failed (returned 1)
> Jul  6 10:00:41 centos-smtp1 clurgmgrd[4228]: <notice> stop on script
> "postfix" returned 1 (generic error)
> Jul  6 10:00:41 centos-smtp1 clurgmgrd: [4228]: <info> Removing IPv4
> address 195.83.126.201/24 from bond0
> Jul  6 10:00:41 centos-smtp1 avahi-daemon[3552]: Withdrawing address
> record for 195.83.126.201 on bond0.
> Jul  6 10:00:51 centos-smtp1 clurgmgrd: [4228]: <info> unmounting
> /var/lib/amavis
> Jul  6 10:00:51 centos-smtp1 clurgmgrd: [4228]: <info> unmounting
> /var/spool/postfix
> Jul  6 10:00:51 centos-smtp1 clurgmgrd[4228]: <crit> #12: RG
> service:Postfix failed to stop; intervention required
> Jul  6 10:00:51 centos-smtp1 clurgmgrd[4228]: <notice> Service
> service:Postfix is failed
> Jul  6 10:00:52 centos-smtp1 ntpd[3322]: synchronized to 195.83.126.119,
> stratum 1
>
> Clustat said:
>
> Cluster Status for cluster-test @ Mon Jul  6 10:02:39 2009
> Member Status: Quorate
>
>  Member Name                                                     ID
> Status
>  ------ ----                                                     ----
> ------
>  centos-imap1.ill.fr                                                 1
> Online, Local, rgmanager
>  centos-imap2.ill.fr                                                 2
> Online, rgmanager
>  centos-smtp1.ill.fr                                                 3
> Online, rgmanager
>  /dev/disk/by-id/scsi-360a98000567247514634507447594661-part1        0
> Online, Quorum Disk
>
>  Service Name                                                   Owner
> (Last)                                                   State
>  ------- ----                                                   -----
> ------                                                   -----
>  service:Imap
> centos-imap2.ill.fr                                            started
>
>  service:Postfix
> (centos-smtp1.ill.fr)                                          failed
>
>
>
>
> So I have to disable the Postfix servcie with:
>        clusvcadm -d Postfix
> and re-enable
>        clusvcadm -e Postfix
>
>
>
> Could you explain my why my original smtp node is not fenced and why my
> service is not start on the 2nd node ???
>
Nodes are fenced only when they lost communications with the other nodes,
not when a service fails.
You should check the init scripts  to make sure it works fine outside the
cluster, return values are important. I think in your case is failing
because you killed postfix in a way it deleted the .pid file, and that made
the init script fail.
BTW you should configure the service as recovery="relocate" if you want them
to be started on a different node.

Greetings,
Juanra


> Is there a way to force the fencing ???
>
>
> --
> ARMANET Stephane
> Division Projet Technique
> Service Informatique
>  Groupe Infrastructure
>
> Institut Laue langevin
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090706/8c3595a9/attachment.htm>

From armanets at ill.fr  Mon Jul  6 10:49:42 2009
From: armanets at ill.fr (Armanet Stephane)
Date: Mon, 06 Jul 2009 12:49:42 +0200
Subject: [Linux-cluster] force fencing
In-Reply-To: <8a5668960907060122s8a47dd6rb89f4dade8621efe@mail.gmail.com>
References: <4A51B101.2010500@ill.fr>
	<8a5668960907060122s8a47dd6rb89f4dade8621efe@mail.gmail.com>
Message-ID: <4A51D6C6.2030006@ill.fr>

Juan Ramon Martin Blanco a ?crit :

>>
> Nodes are fenced only when they lost communications with the other nodes,
> not when a service fails.
> You should check the init scripts  to make sure it works fine outside the
> cluster, return values are important. I think in your case is failing
> because you killed postfix in a way it deleted the .pid file, and that made
> the init script fail.
> BTW you should configure the service as recovery="relocate" if you want them
> to be started on a different node.
> 
> Greetings,
> Juanra
> 
> 
> 


Thank's for the reply

I will check my init.d scripts

-- 
ARMANET Stephane
Division Projet Technique
Service Informatique
  Groupe Infrastructure

Institut Laue langevin
38042 Grenoble Cedex 9
France

Tel: 04.76.20.78.56 email: armanets at ill.fr


From esggrupos at gmail.com  Mon Jul  6 10:55:23 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 6 Jul 2009 12:55:23 +0200
Subject: [Linux-cluster] OFF TOPIC: cloud computing
In-Reply-To: <c0773fd30907021035p1a5f51cen888602f0d2dbeab8@mail.gmail.com>
References: <3128ba140907021024m4e48ff1fl992272d2239b2c0c@mail.gmail.com>
	<c0773fd30907021035p1a5f51cen888602f0d2dbeab8@mail.gmail.com>
Message-ID: <3128ba140907060355x7f110150v25492461b004ff22@mail.gmail.com>

>
> It is real, have a look at MPI for development of cloud computing (MPI CH
> as an implementation). Its used for message passing to queue out components
> of a job to various nodes.  We implemented sorting using this library last
> year that allocated tasks on a per-core basis across multiple servers.
>

Hi, thanks for your answer, it looks like interesting,

I?m still looking how to start to study this, For now I?m reading about it,
and watching videos in youtube ;-)

Thanks again,

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090706/768ffa3b/attachment.htm>

From agx at sigxcpu.org  Mon Jul  6 12:46:10 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Mon, 6 Jul 2009 14:46:10 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc3 release
In-Reply-To: <1246468327.19414.65.camel@cerberus.int.fabbione.net>
References: <1245496789.3665.328.camel@cerberus.int.fabbione.net>
	<20090629184848.GA25796@bogon.sigxcpu.org>
	<1246306200.25867.86.camel@cerberus.int.fabbione.net>
	<20090701115725.GA6565@bogon.sigxcpu.org>
	<1246454636.19414.30.camel@cerberus.int.fabbione.net>
	<20090701164007.GA10680@bogon.sigxcpu.org>
	<1246468327.19414.65.camel@cerberus.int.fabbione.net>
Message-ID: <20090706124610.GA2229@bogon.sigxcpu.org>

On Wed, Jul 01, 2009 at 07:12:07PM +0200, Fabio M. Di Nitto wrote:
> Do you have a build log for the package? and could you send me the
> make/defines.mk in the build tree?
No, not from that build we're currently using. I can rebuild though. But
from our libccss modifications:

gcc -Wall -Wformat=2 -Wshadow -Wmissing-prototypes -Wstrict-prototypes
-Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings
-Wcast-align
-Wbad-function-cast -Wmissing-format-attribute -Wformat-security
-Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing
-Wmissing-declarations -O2 -ggdb3 -MMD
-I/var/home/schmitz/3/redhat-cluster/make
-DDEFAULT_CONFIG_DIR=\"/etc/cluster\"
-DDEFAULT_CONFIG_FILE=\"cluster.conf\" -DENABLE_PACEMAKER=
-DLOGDIR=\"/var/log/cluster\" -DSYSLOGFACILITY=LOG_LOCAL4
-DSYSLOGLEVEL=LOG_INFO -DRELEASE_VERSION=\"3.0.0.rc3\" -fPIC
-D_GNU_SOURCE
-D_FILE_OFFSET_BITS=64 -I/usr/include
-I/var/home/schmitz/3/redhat-cluster/common/liblogthread `xml2-config
--cflags` -I/usr/include   -c -o libccs.o
/var/home/schmitz/3/redhat-cluster/config/libs/libccsconfdb/libccs.c
ar cru libccs.a libccs.o xpathlite.o fullxpath.o extras.o
ranlib libccs.a
gcc -shared -o libccs.so.3.0 -Wl,-soname=libccs.so.3 libccs.o
xpathlite.o
fullxpath.o extras.o  -L/usr/lib/corosync -lconfdb `xml2-config --libs`
-L/usr/lib
ln -sf libccs.so.3.0 libccs.so
ln -sf libccs.so.3.0 libccs.so.3

> gcc versions and usual tool chain info.. maybe it's a gcc bug or maybe
> it's an optimization that behaves differently between debian and fedora.
$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1'

$ ld -v
GNU ld (GNU Binutils for Debian) 2.18.0.20080103

$ dpkg -s libc6-dev
Package: libc6-dev
Status: install ok installed
Priority: optional
Section: libdevel
Installed-Size: 11072
Maintainer: GNU Libc Maintainers <debian-glibc at lists.debian.org>
Architecture: amd64
Source: glibc
Version: 2.7-18
Replaces: man-db (<= 2.3.10-41), gettext (<= 0.10.26-1), ppp (<=
2.2.0f-24), libgdbmg1-dev (<= 1.7.3-24)
Provides: libc-dev
Depends: libc6 (= 2.7-18), linux-libc-dev
Recommends: gcc | c-compiler
Suggests: glibc-doc, manpages-dev
Conflicts: libstdc++2.10-dev (<< 1:2.95.2-15), gcc-2.95 (<< 1:2.95.3-8),
binutils (<< 2.17cvs20070426-1), libc-dev

> I have attached a small test case to simply test libccs. At this point I
> don't believe it's a problem in libfence. Could you please run it for me
> and send me the output? If the bug is in libccs this would start
> isolating it.
# ./testccs
xpathlite
agent=fence_ilo
Segmentation fault

# and if I prefer fullxpath over xpathlite:
# ./testccs
fullxpath
agent=fence_ilo
Segmentation fault

Cheers,
 -- Guido


From robejrm at gmail.com  Mon Jul  6 14:09:17 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Mon, 6 Jul 2009 16:09:17 +0200
Subject: [Linux-cluster] Problems with cluster-snmp rhel5.3 x86_64
Message-ID: <8a5668960907060709p169cf7fcsc60b39704d68fa29@mail.gmail.com>

Hi,

I would like to use snmp to monitor the services status in my clusters(rhel
5.3 x86_64), so I installed cluster-snmp and configure snmpd as it can be
seen on the cluster-snmp documentation with the public community "cluster".
The thing is that I cannot obtain any information from the community, only
this:
# snmpwalk -v 2c -c cluster localhost REDHAT-CLUSTER-MIB::RedHatCluster
REDHAT-CLUSTER-MIB::rhcMIBVersion.0 = INTEGER: 1

That's the only information that can be obtained from the MIB...

I.E if I query the services get this:
# snmpwalk -v 2c -c cluster localhost
REDHAT-CLUSTER-MIB::rhcClusterServicesNames
REDHAT-CLUSTER-MIB::rhcClusterServicesNames = No Such Instance currently
exists at this OID

Any clues? Is it a bug in the x86_64 version? I tested this also in rhel 5.1
32bits and worked fine.

Thanks in advance,
Juanra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090706/9dfcd6e3/attachment.htm>

From agx at sigxcpu.org  Mon Jul  6 19:09:54 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Mon, 6 Jul 2009 21:09:54 +0200
Subject: [Linux-cluster] add "force-reload" targets to init scripts
Message-ID: <20090706190954.GA28021@bogon.sigxcpu.org>

Hi,
attached patch adds the force-reload targets to the init scripts as
expected by Debian based distros. Would be nice to have this applied for
3.0.
Cheers,
 -- Guido
-------------- next part --------------
A non-text attachment was scrubbed...
Name: force-reload.diff
Type: text/x-diff
Size: 1093 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090706/53bf0ef6/attachment.bin>

From fdinitto at redhat.com  Tue Jul  7 07:21:25 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 07 Jul 2009 09:21:25 +0200
Subject: [Linux-cluster] add "force-reload" targets to init scripts
In-Reply-To: <20090706190954.GA28021@bogon.sigxcpu.org>
References: <20090706190954.GA28021@bogon.sigxcpu.org>
Message-ID: <1246951285.7993.1.camel@cerberus.int.fabbione.net>

hi Guido,

On Mon, 2009-07-06 at 21:09 +0200, Guido G?nther wrote:
> Hi,
> attached patch adds the force-reload targets to the init scripts as
> expected by Debian based distros. Would be nice to have this applied for
> 3.0.
> Cheers,
>  -- Guido


http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=192d4e27c36fb263617ad726795f1c8dbc709497

thanks for the patch.

In future could you please send patches to cluster-devel mailing list?
It will be easier to notice them.

Fabio


From robejrm at gmail.com  Tue Jul  7 09:57:41 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 11:57:41 +0200
Subject: [Linux-cluster] qdisk best practices
In-Reply-To: <15D5002F61F31A45A82A153D2F73906760FBD3F011@HOUMS26.onshore.pgs.com>
References: <AQHJ+nlwdd65V2ETsEGsQRziDvlSfw==>
	<15D5002F61F31A45A82A153D2F73906760FBD3F011@HOUMS26.onshore.pgs.com>
Message-ID: <8a5668960907070257k27567349pbba385cb3329489c@mail.gmail.com>

On Wed, Jul 1, 2009 at 8:24 PM, Luis Cerezo <Luis.Cerezo at pgs.com> wrote:

> Hi all-
>
> i've got a RHEL 5.3 cluster, 2node with qdisk. All works fine, but the
> qdisk seems to beat on the SAN (I/Ops) I adjusted the interval from the
> default of 1 to 5 and it is still high (san admin is crying)
>
> does anyone have best practices for this? its an LSI san and both nodes are
> mulitpathed to it via 4G FC.
>
If it's really a big problem for the SAN, consider adding a third node to
the cluster and get rid of the qdisk.

Greetings,
Juanra

>
> thanks!
>
>
>
> Luis E. Cerezo
> PGS
> Global IT
>
> This e-mail, any attachments and response string may contain proprietary
> information, which are confidential and may be legally privileged.  It is
> for the intended recipient only and if you are not the intended recipient or
> transmission error has misdirected this e-mail, please notify the author by
> return e-mail and delete this message and any attachment immediately.  If
> you are not the intended recipient you must not use, disclose, distribute,
> forward, copy, print or rely in this e-mail in any way except as permitted
> by the author.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/75a77206/attachment.htm>

From robejrm at gmail.com  Tue Jul  7 10:10:51 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 12:10:51 +0200
Subject: [Linux-cluster] quorum disk size recommedation
In-Reply-To: <3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com>
References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>
	<200906291243.18175.harri.paivaniemi@tieto.com>
	<3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com>
Message-ID: <8a5668960907070310y4cf0924v16ccec3f892a67f5@mail.gmail.com>

On Mon, Jun 29, 2009 at 11:48 AM, ESGLinux <esggrupos at gmail.com> wrote:

> hi,
> Thanks for your quick answer.
>
> Just for curiosity, why this size? and with 10 MB, what happens if you need
> more? (the question is why can you need more? perhaps 1000 nodes? or it
> doesnt matter)
>
Correct me if I'm wrong, but Red Hat does not officially support clusters
with quorum disks, with more than 16 nodes.

Regards,
Juanra

>
> Greetings,
>
> ESG
>
> 2009/6/29 H.P?iv?niemi <harri.paivaniemi at tieto.com>
>
>
>> http://sources.redhat.com/cluster/wiki/FAQ/CMAN#quorumdisksize
>>
>> What's the minimum size of a quorum disk/partition?
>>
>> The official answer is 10MB. The real number is something like 100KB, but
>> we'd like to reserve 10MB for possible
>> future expansion and features.
>>
>>
>> -hjp
>>
>>
>>
>> On Monday 29 June 2009 12:38:39 ESGLinux wrote:
>> > Hi all,
>> >
>> > I?m planning a 2 nodes cluster and I?m going to use quorum disk. My
>> > question is which is the  best size of this kind of disk. It will be
>> > interesting to explain how calculate this size,
>> >
>> > Thanks in advance
>> >
>> > ESG
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/374785de/attachment.htm>

From robejrm at gmail.com  Tue Jul  7 10:21:02 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 12:21:02 +0200
Subject: [Linux-cluster] cman + qdisk timeouts....
In-Reply-To: <FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>
References: <AcntxBO69ALhjGTVQJqDI+yBI6Opnw==>
	<FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>
Message-ID: <8a5668960907070321p5a082091oa7f83fff625dde47@mail.gmail.com>

On Mon, Jun 15, 2009 at 4:17 PM, Moralejo, Alfredo <
alfredo.moralejo at roche.com> wrote:

>  Hi,
>
>
>
> I?m having what I think is a timeouts issue in my cluster.
>
>
>
> I have a two node cluster using qdisk. Everytime the node that has the
> master role for qdisk becomes down (for failure or even stopping qdiskd
> manually), packages in the sane node are stopped because of the lack of
> quorum as the qdiskd becames unresponsive until second node becames master
> node and start working properly. Once qdiskd start working fine (usually 5-6
> seconds) packages are started again.
>
>
>
> I?ve read in the cluster manual section for ?CMAN membership timeout
> value? and I think this is the case. I?ve used RHEL 5.3 and I thought this
> parameter is the token that I set much longer that needed:
>
>
>
> <cluster alias="CLUSTER_ENG" config_version="75" name="CLUSTER_ENG">
>
>         <totem token="50000"/>
>
> ?
>
>
>
>         <quorumd device="/dev/mapper/mpathquorump1" interval="3"
> status_file="/tmp/qdisk" tko="3" votes="5" log_level="7"
> log_facility="local4"/>
>
>
>
>
>
> Totem token is much more that double of qdisk timeout, so I guess it should
> be enough but everytime qdisk dies in the master node I get same result,
> services restarted in the sane node:
>
>
>
> Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (2/3)
>
> Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (3/3)
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (4/3)
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 DOWN
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Making bid for master
>
> Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: <info> Executing
> /etc/init.d/watchdog status
>
> Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (5/3)
>
> Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (6/3)
>
> *Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <info> Assuming master role*
>
>
>
> Message from syslogd at rmamseslab07 at Jun 15 16:11:53 ...
>
>  clurgmgrd[18510]: <emerg> #1: Quorum Dissolved
>
> Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with
> quorum device
>
> Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking
> activity
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Membership Change
> Event
>
> *Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <emerg> #1: Quorum
> Dissolved*
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:Cluster_test_2
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:wdtcscript-rmamseslab05-ic
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:wdtcscript-rmamseslab07-ic
>
> Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:Logical volume 1
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (7/3)
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <notice> Writing eviction
> notice for node 1
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Telling CMAN to kill
> the node
>
> *Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained,
> resuming activity*
>
>
>
> I?ve just logged a case but? any idea????
>
>
>
> Regards,
>
> Hi!

Have you set two_node="0" in cman section?
Why don't you use any heuristics within the quorumd configuration? I.e:
pinging a router...
Could you paste us your cluster.conf?

Greetings,
Juanra


>
>
>
>
> *Alfredo Moralejo*
> Business Platforms Engineering - OS Servers - UNIX Senior Specialist
>
> F. Hoffmann-La Roche Ltd.
>
> Global Informatics Group Infrastructure
> Josefa Valc?rcel, 40
> 28027 Madrid SPAIN
>
> Phone: +34 91 305 97 87
>
> alfredo.moralejo at roche.com
>
> *Confidentiality Note:* This message is intended only for the use of the
> named recipient(s) and may contain confidential and/or proprietary
> information. If you are not the intended recipient, please contact the
> sender and delete this message. Any unauthorized use of the information
> contained in this message is prohibited.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/bc0f54b1/attachment.htm>

From robejrm at gmail.com  Tue Jul  7 10:22:29 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 12:22:29 +0200
Subject: [Linux-cluster] System load at 1.00 for gfs2?
In-Reply-To: <13E5ADD5-B0C6-4339-8D86-5E46DA37B6A6@netspot.com.au>
References: <20090513173511.GA5992@esri.com>
	<8a5668960905180135p118312bfj6625f8513f477674@mail.gmail.com>
	<20090518140201.GA7429@esri.com>
	<1242655685.29604.345.camel@localhost.localdomain>
	<13E5ADD5-B0C6-4339-8D86-5E46DA37B6A6@netspot.com.au>
Message-ID: <8a5668960907070322x4bdbed49nd0ae3712a4069b0e@mail.gmail.com>

On Wed, Jun 10, 2009 at 3:29 AM, Tom Lanyon <tom at netspot.com.au> wrote:

> On 18/05/2009, at 11:38 PM, Steven Whitehouse wrote:
>
>  The fix has gone in to RHEL 5.4. I have a feeling that it might also go
>> into 5.3.z but I'm not 100% sure what the timescales are there. The bug
>> is known and fixed in upstream too.
>>
>> It isn't actually using any more CPU, its just that the LA is
>> incremented by 1. So a fix is already on its way,
>>
>> Steve.
>>
>
>
> Great, we experience this bug too. It doesn't cause any problems but
> confuses some of the administrators... :)
>
It's currently fixed in 5.3

Many thanks!

>
> Tom
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/8185f831/attachment.htm>

From esggrupos at gmail.com  Tue Jul  7 10:28:34 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 7 Jul 2009 12:28:34 +0200
Subject: [Linux-cluster] quorum disk size recommedation
In-Reply-To: <8a5668960907070310y4cf0924v16ccec3f892a67f5@mail.gmail.com>
References: <3128ba140906290238l3e98aac2s2b5ac2bf97418307@mail.gmail.com>
	<200906291243.18175.harri.paivaniemi@tieto.com>
	<3128ba140906290248q620ad560m8700f65ab0bd63d8@mail.gmail.com>
	<8a5668960907070310y4cf0924v16ccec3f892a67f5@mail.gmail.com>
Message-ID: <3128ba140907070328o3ce52d8au1e7c1934a38e0019@mail.gmail.com>

> Correct me if I'm wrong, but Red Hat does not officially support clusters
> with quorum disks, with more than 16 nodes.
>
> Regards,
> Juanra
>
>>
>>
Hi Juanra, no idea about this limit, my numbers was only to ask what happens
if you need more....

Greetings,

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/28bcea0e/attachment.htm>

From brdvss at gmail.com  Tue Jul  7 11:14:04 2009
From: brdvss at gmail.com (Brady Vass)
Date: Tue, 7 Jul 2009 16:44:04 +0530
Subject: [Linux-cluster] Re: Commands for communicating among nodes?
In-Reply-To: <995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
References: <995446330907030938t67b4c101y4dde2fbf2c51e8eb@mail.gmail.com>
	<995446330907040319j57c13bfbhfe8b92a02f2937e4@mail.gmail.com>
Message-ID: <995446330907070414j10a5a728r69689c6fb2da34a9@mail.gmail.com>

Thanks much for the responses. I will definitely try it out.

Thanks and regards,
Brady.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/0c721596/attachment.htm>

From brdvss at gmail.com  Tue Jul  7 11:21:02 2009
From: brdvss at gmail.com (Brady Vass)
Date: Tue, 7 Jul 2009 16:51:02 +0530
Subject: [Linux-cluster] Disk Monitoring in RHCS
Message-ID: <995446330907070421q5b395772j85ef8860bf7e2552@mail.gmail.com>

Hi,

I am trying to configure a cluster where the resource is on a SCSI disk and
I need to monitor the disk. The failover should happen depending on the
disk-monitoring result. Can someone point me in the right direction? How to
go about monitoring the disk ?

Thanks much in advance.

regards,
Brady.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/a904c203/attachment.htm>

From rsetchfield at xcalibre.co.uk  Tue Jul  7 12:35:24 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Tue, 07 Jul 2009 13:35:24 +0100
Subject: [Linux-cluster] Trying to locate the bottleneck
Message-ID: <4A53410C.3090704@xcalibre.co.uk>

Hi

I am trying to find a problem here with a setup which I am currently 
testing.

This is the current setup which I have at the moment

15 web farm servers which are running vhost-ldap module and also have 
ldap caching enabled. Which are behind 2 Load balancer servers which are 
in fail over. The software which it is currently running is Piranha on 
the load balancers.

I am using siege to get some benchmarking done on these to test 
basically their availability when pushing high concurrency.

At 100 (99.60 according to siege) Concurrent Connection it appears to be 
all ok with 99.89%. At 120 (119.52 according to siege) Concurrent 
connections I get 99.9%, and at 130 (129.51 according to siege) 
Concurrent Connections I get 100% availability.

However pushing it any further than this, for example 150 concurrent 
connections it is falling over and siege bails out with multiple 
connection time outs. I am trying to find the bottle neck here and I am 
wondering if it is software which I am using for the load balancers or a 
limitation with apache.

The command I am using for siege is pretty simple nothing special;

siege --concurrent=150 --internet --file=urls.txt --benchmark --time=60M

My lvs.cf file can be found here to show you guys the config which I am 
using.

http://pastebin.com/m52d6cc23

Any help would be greatly appreciated

Many Thanks

R.


From esggrupos at gmail.com  Tue Jul  7 12:52:54 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Tue, 7 Jul 2009 14:52:54 +0200
Subject: [Linux-cluster] Package Apache and Mysql Problem
In-Reply-To: <F0CC33D1F1858A4E8A01D8315C683E290227126E0353@NPOMAILLCR1.nposistemi.it>
References: <F0CC33D1F1858A4E8A01D8315C683E290227126E0353@NPOMAILLCR1.nposistemi.it>
Message-ID: <3128ba140907070552u71769fdci6201d9fd24d731a5@mail.gmail.com>

Hi,
How are you configuring the cluster? with conga? with system-config-cluster?

if you run clustat what does it shows?

If you use the command clusvcadm to start the services what happens?

any error in /var/log/messages?

Greetings

ESG

2009/7/1 Giussani Andrea <Andrea.Giussani at nposistemi.it>

> Hi,
>
> i have a little big problem with RH Cluster Suite.
>
> I have 2 cluster nodes with 1 partition to share between the 2 node. There
> is no SAN.
> The node have the same hardware and the same partition.
> I have 1 partition with drbd to sycronize the 2 nodes Primary/Primary.
>
> I try in a lot type of configuration of Apache and Mysql package but i have
> the same problem.
> The error is:
> Jul  1 18:50:39 nodo1 luci[2581]: Unable to retrieve batch 1072342062
> status from nodo2.local:11111: clusvcadm start failed to start Httpd:
>
> nodo1 and nodo2 is the 2 nodes and httpd is the apache service.
>
> Any idea???
>
> I try the configuration in this procedure:
> http://kbase.redhat.com/faq/docs/DOC-5648 for Mysql but the result is the
> same.
>
> In attach my cluster.conf and drbd.conf
>
> If we need more tell me please.
>
> Thanks a lot
>
> Andrea Giussani
>
>
> AVVERTENZE AI SENSI DEL D.LGS. 196/2003    .
>
> Il contenuto di questo messaggio (ed eventuali allegati) e' strettamente
> confidenziale. L'utilizzo del contenuto del messaggio e' riservato
> esclusivamente al destinatario. La modifica, distribuzione, copia del
> messaggio da parte di altri e' severamente proibita. Se non siete i
> destinatari Vi invitiamo ad informare il mittente ed eliminare tutte le
> copie del suddetto messaggio     .
>
> The content of this message (and attachment) is closely confidentiality.
> Use of the content of the message is classified exclusively to the
> addressee. The modification, distribution, copy of the message from others
> are forbidden. If you are not the addressees, we invite You to inform the
> sender and to eliminate all the copies of the message.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/b9f6c798/attachment.htm>

From santosh.balan at linuxmail.org  Tue Jul  7 13:08:21 2009
From: santosh.balan at linuxmail.org (Santosh Balan)
Date: Tue, 7 Jul 2009 08:08:21 -0500
Subject: [Linux-cluster] Redhat Cluster Problem
Message-ID: <20090707130821.4B5AB10612@ws1-3.us4.outblaze.com>

Hi Friends,

I am getting the following errors on the cluster at my site. I am using
the RHEL 5 cluster suite on for HA of my database and web server. My
cluster service at one point in a day restarts the service automatically
as it cannot find the virtual ip. Can you please guide me on this issue.
On checking my logs i.e. /var/log/messages it shows me the following
info:

Jul? 6 18:18:11 DB01 clurgmgrd: [4350]: <warning> Failed to ping
xxx.xxx.xxx.xxx
Jul? 6 18:18:11 DB01 clurgmgrd[4350]: <notice> status on ip
"xxx.xxx.xxx.xxx" returned 1 (generic error)
Jul? 6 18:18:11 DB01 clurgmgrd[4350]: <notice> Stopping service
service:mysql
Jul? 6 18:18:11 DB01 clurgmgrd: [4350]: <info> Executing
/etc/init.d/mysql stop
Jul? 6 18:18:19 DB01 clurgmgrd: [4350]: <info> Removing IPv4 address
xxx.xxx.xxx.xxx from bond0
Jul? 6 18:18:19 DB01 snmpd[2238]: Connection from UDP: [127.0.0.1]:36318
Jul? 6 18:18:29 DB01 clurgmgrd: [4350]: <info> unmounting /data
Jul? 6 18:18:29 DB01 clurgmgrd[4350]: <notice> Service service:mysql is
recovering
Jul? 6 18:18:29 DB01 clurgmgrd[4350]: <notice> Recovering failed service
service:mysql
Jul? 6 18:18:30 DB01 clurgmgrd: [4350]: <info> mounting
/dev/mapper/vg01-DB on /data
Jul? 6 18:18:30 DB01 kernel: kjournald starting.? Commit interval 5
seconds
Jul? 6 18:18:30 DB01 kernel: EXT3 FS on dm-3, internal journal
Jul? 6 18:18:30 DB01 kernel: EXT3-fs: mounted filesystem with ordered
data mode.
Jul? 6 18:18:30 DB01 clurgmgrd: [4350]: <info> Adding IPv4 address
xxx.xxx.xxx.xxx to bond0
Jul? 6 18:18:31 DB01 clurgmgrd: [4350]: <info> Executing
/etc/init.d/mysql start
Jul? 6 18:18:33 DB01 clurgmgrd[4350]: <notice> Service service:mysql
started


Thanks in advance and expecting your reply at the earliest.

Thanks and Regards
Santosh Balan
9819419509

-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free Account at www.mail.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/87265d4b/attachment.htm>

From robejrm at gmail.com  Tue Jul  7 13:12:52 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Tue, 7 Jul 2009 15:12:52 +0200
Subject: [Linux-cluster] Redhat Cluster Problem
In-Reply-To: <20090707130821.4B5AB10612@ws1-3.us4.outblaze.com>
References: <20090707130821.4B5AB10612@ws1-3.us4.outblaze.com>
Message-ID: <8a5668960907070612p788d154ap2a60755caa95ef9d@mail.gmail.com>

On Tue, Jul 7, 2009 at 3:08 PM, Santosh Balan
<santosh.balan at linuxmail.org>wrote:

> Hi Friends,
>
> I am getting the following errors on the cluster at my site. I am using the
> RHEL 5 cluster suite on for HA of my database and web server. My cluster
> service at one point in a day restarts the service automatically as it
> cannot find the virtual ip. Can you please guide me on this issue. On
> checking my logs i.e. /var/log/messages it shows me the following info:
>
> Jul  6 18:18:11 DB01 clurgmgrd: [4350]: <warning> Failed to ping
> xxx.xxx.xxx.xxx
> Jul  6 18:18:11 DB01 clurgmgrd[4350]: <notice> status on ip
> "xxx.xxx.xxx.xxx" returned 1 (generic error)
> Jul  6 18:18:11 DB01 clurgmgrd[4350]: <notice> Stopping service
> service:mysql
> Jul  6 18:18:11 DB01 clurgmgrd: [4350]: <info> Executing /etc/init.d/mysql
> stop
> Jul  6 18:18:19 DB01 clurgmgrd: [4350]: <info> Removing IPv4 address
> xxx.xxx.xxx.xxx from bond0
> Jul  6 18:18:19 DB01 snmpd[2238]: Connection from UDP: [127.0.0.1]:36318
> Jul  6 18:18:29 DB01 clurgmgrd: [4350]: <info> unmounting /data
> Jul  6 18:18:29 DB01 clurgmgrd[4350]: <notice> Service service:mysql is
> recovering
> Jul  6 18:18:29 DB01 clurgmgrd[4350]: <notice> Recovering failed service
> service:mysql
> Jul  6 18:18:30 DB01 clurgmgrd: [4350]: <info> mounting /dev/mapper/vg01-DB
> on /data
> Jul  6 18:18:30 DB01 kernel: kjournald starting.  Commit interval 5 seconds
> Jul  6 18:18:30 DB01 kernel: EXT3 FS on dm-3, internal journal
> Jul  6 18:18:30 DB01 kernel: EXT3-fs: mounted filesystem with ordered data
> mode.
> Jul  6 18:18:30 DB01 clurgmgrd: [4350]: <info> Adding IPv4 address
> xxx.xxx.xxx.xxx to bond0
> Jul  6 18:18:31 DB01 clurgmgrd: [4350]: <info> Executing /etc/init.d/mysql
> start
> Jul  6 18:18:33 DB01 clurgmgrd[4350]: <notice> Service service:mysql
> started
>
> The cluster is doing its job, isn't it? When the ip address is not
reachable, it restarts the service. Are you using a correct ip? Maybe
another machine in the network tries to put that ip up.

Regards,
Juanra

>
> Thanks in advance and expecting your reply at the earliest.
>
> Thanks and Regards
> Santosh Balan
> 9819419509
>
> -- Be Yourself @ mail.com!
> Choose From 200+ Email Addresses
> Get a *Free* Account at www.mail.com <http://www.mail.com/Product.aspx>!
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/0f58d5a4/attachment.htm>

From charlieb-linux-cluster at budge.apana.org.au  Tue Jul  7 14:38:39 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Tue, 7 Jul 2009 10:38:39 -0400 (EDT)
Subject: [Linux-cluster] GFS2 with IMAP Maildir server
In-Reply-To: <4A4E5E9B.7060906@bobich.net>
References: <58aa8d780907031230i6d78890fpdd8791521c421348@mail.gmail.com>
	<4A4E5E9B.7060906@bobich.net>
Message-ID: <Pine.LNX.4.64.0907071036490.25211@e-smith.charlieb.ott.istop.com>


On Fri, 3 Jul 2009, Gordan Bobic wrote:

> Sounds like you are running into the same bug that I ran into with GFS2 on a 
> similar setup nearly 2 years ago, except I could produce a lock-up in under 2 
> seconds every time. Solution is to use GFS1 if you really want to stick with 
> that setup, but bear in mind that, regardless of the cluster file system 
> (GFS1, GFS2, OCFS2) the performance will scale _inversely_. Cluster file 
> systems really don't work well with millions of small files.

Isn't Maildir designed to work reliably with NFS? Do you really need a 
cluster file system?


From agx at sigxcpu.org  Tue Jul  7 16:28:45 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Tue, 7 Jul 2009 18:28:45 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
Message-ID: <20090707162845.GA30094@bogon.sigxcpu.org>

Hi,
On Thu, Jul 02, 2009 at 01:16:30AM +0200, Fabio M. Di Nitto wrote:
> The cluster team and its community are proud to announce the
> 3.0.0.rc4 release candidate from the STABLE3 branch.
Based on earlier Debian and Ubuntu packages of corosync, openais and
cluster I have put prelimenary Debian packges (built against Debian
Squeeze) here:
 http://pkg-libvirt.alioth.debian.org/packages/unstable/
Here are the soruces.list entries:
 http://wiki.debian.org/Teams/DebianLibvirtTeam#Packages
Cheers,
 -- Guido


From fdinitto at redhat.com  Tue Jul  7 17:45:16 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 07 Jul 2009 19:45:16 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <20090707162845.GA30094@bogon.sigxcpu.org>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
	<20090707162845.GA30094@bogon.sigxcpu.org>
Message-ID: <1246988716.7993.13.camel@cerberus.int.fabbione.net>

Hi Guido,

On Tue, 2009-07-07 at 18:28 +0200, Guido G?nther wrote:
> Hi,
> On Thu, Jul 02, 2009 at 01:16:30AM +0200, Fabio M. Di Nitto wrote:
> > The cluster team and its community are proud to announce the
> > 3.0.0.rc4 release candidate from the STABLE3 branch.
> Based on earlier Debian and Ubuntu packages of corosync, openais and
> cluster I have put prelimenary Debian packges (built against Debian
> Squeeze) here:
>  http://pkg-libvirt.alioth.debian.org/packages/unstable/
> Here are the soruces.list entries:
>  http://wiki.debian.org/Teams/DebianLibvirtTeam#Packages
> Cheers,
>  -- Guido

awesome! thanks!

I didn't check them out... anyway I am adding the Ubuntu HA team in
CC... it's worth sharing the effort.

For sometime I have been thinking to pull in debian/ and .spec files
upstream and involve the maintainers to work directly with us.

This would happen for corosync/openais (they already have spec files)
and cluster.

Anybody in contact with the Debian team and see if they would like to
work more closely with us?

Cheers
Fabio


From jeff.sturm at eprize.com  Tue Jul  7 17:58:45 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 7 Jul 2009 13:58:45 -0400
Subject: [Linux-cluster] Trying to locate the bottleneck
In-Reply-To: <4A53410C.3090704@xcalibre.co.uk>
References: <4A53410C.3090704@xcalibre.co.uk>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local>

Hi Raymond,

At those concurrency levels I would suspect network tuning may help.
Does dmesg show anything interesting on the load balancers during your
testing?

For high levels of concurrency on a NAT'd firewall or load balancer I
specifically remember having to adjust ip_conntrack_max upwards.
Perhaps network buffers as well.

-Jeff

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Raymond Setchfield
> Sent: Tuesday, July 07, 2009 8:35 AM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] Trying to locate the bottleneck
> 
> Hi
> 
> I am trying to find a problem here with a setup which I am currently
> testing.
> 
> This is the current setup which I have at the moment
> 
> 15 web farm servers which are running vhost-ldap module and also have
> ldap caching enabled. Which are behind 2 Load balancer servers which
are
> in fail over. The software which it is currently running is Piranha on
> the load balancers.
> 
> I am using siege to get some benchmarking done on these to test
> basically their availability when pushing high concurrency.
> 
> At 100 (99.60 according to siege) Concurrent Connection it appears to
be
> all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
> connections I get 99.9%, and at 130 (129.51 according to siege)
> Concurrent Connections I get 100% availability.
> 
> However pushing it any further than this, for example 150 concurrent
> connections it is falling over and siege bails out with multiple
> connection time outs. I am trying to find the bottle neck here and I
am
> wondering if it is software which I am using for the load balancers or
a
> limitation with apache.
> 
> The command I am using for siege is pretty simple nothing special;
> 
> siege --concurrent=150 --internet --file=urls.txt --benchmark
--time=60M
> 
> My lvs.cf file can be found here to show you guys the config which I
am
> using.
> 
> http://pastebin.com/m52d6cc23
> 
> Any help would be greatly appreciated
> 
> Many Thanks
> 
> R.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From abednegoyulo at yahoo.com  Wed Jul  8 06:53:37 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Tue, 7 Jul 2009 23:53:37 -0700 (PDT)
Subject: [Linux-cluster] Cannot make cluster after upgrade
Message-ID: <614645.81236.qm@web110403.mail.gq1.yahoo.com>


After an upgrade from 5.2 to 5.3, the cluster, named GFSCluster, seems to stop being a cluster. GFSCluster is a 2 node cluster using iscsi, cman, clvm, and gfs and it was working fine when it was on 5.2 The configuration on both of the nodes (passwords removed)

<?xml version="1.0"?>
<cluster name="GFSCluster" config_version="5">
<cman expected_votes="1" two_node="1"/>
  <clusternodes><clusternode name="node01.company.com" votes="1" nodeid="1"><fence><method name="single"><device name="node01_ipmi"/></method></fence></clusternode><clusternode name="node02.company.com" votes="1" nodeid="2"><fence><method name="single"><device name="node02_ipmi"/></method></fence></clusternode></clusternodes>
  <fencedevices><fencedevice name="node01_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.5" login="root" passwd="********"/><fencedevice name="node02_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.7" login="root" passwd="********"/></fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>

When starting the service cman, they both hang on the part starting fencing

Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... 

After 5 minutes the task finishes with "done" but clustat says

==== As root on web01.company.com ====
  Cluster Status for GFSCluster @ Wed Jul  8 01:00:24 2009
  Member Status: Quorate
  
   Member Name                             ID   Status
   ------ ----                             ---- ------
   node01.company.com                         1 Online, Local
   node02.company.com                         2 Offline
  

==== As root on web02.company.com ====
  Cluster Status for GFSCluster @ Wed Jul  8 01:00:26 2009
  Member Status: Quorate
  
   Member Name                             ID   Status
   ------ ----                             ---- ------
   node01.company.com                         1 Offline
   node02.company.com                         2 Online, Local

They are both quorate with their own cluster

In the logs of web01 I found repeating messages

Jul  8 00:55:27 web01 fenced[21872]: node02.company.com not a cluster member after 6 sec post_join_delay
Jul  8 00:55:27 web01 fenced[21872]: fencing node "node02.company.com"
Jul  8 00:55:52 web01 fenced[21872]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:10.1.0.7...ipmilan: Failed to connect after 30 seconds Failed 


In the logs of web02 I also found the same repeating messages

Jul  8 00:55:27 web02 fenced[6363]: node01.company.com not a cluster member after 6 sec post_join_delay
Jul  8 00:55:27 web02 fenced[6363]: fencing node "node01.company.com"
Jul  8 00:55:53 web02 fenced[6363]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:10.1.0.5...ipmilan: Failed to connect after 30 seconds Failed


Is there a bug on 5.3 with regards to clustering?
Is there any workarounds?


      Feel safer online. Upgrade to the new, safer Internet Explorer 8 optimized for Yahoo! to put your mind at peace. It&#39;s free. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/


From cthulhucalling at gmail.com  Wed Jul  8 06:59:24 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Tue, 7 Jul 2009 23:59:24 -0700
Subject: [Linux-cluster] Cannot make cluster after upgrade
In-Reply-To: <614645.81236.qm@web110403.mail.gq1.yahoo.com>
References: <614645.81236.qm@web110403.mail.gq1.yahoo.com>
Message-ID: <36df569a0907072359m2fce04d3h3c437b219eb73a9e@mail.gmail.com>

Sounds a little split-brainish....... have you tried the clean_start=1
option?

On Jul 7, 2009 11:54 PM, "Abed-nego G. Escobal, Jr." <abednegoyulo at yahoo.com>
wrote:


After an upgrade from 5.2 to 5.3, the cluster, named GFSCluster, seems to
stop being a cluster. GFSCluster is a 2 node cluster using iscsi, cman,
clvm, and gfs and it was working fine when it was on 5.2 The configuration
on both of the nodes (passwords removed)

<?xml version="1.0"?>
<cluster name="GFSCluster" config_version="5">
<cman expected_votes="1" two_node="1"/>
 <clusternodes><clusternode name="node01.company.com" votes="1"
nodeid="1"><fence><method name="single"><device
name="node01_ipmi"/></method></fence></clusternode><clusternode name="
node02.company.com" votes="1" nodeid="2"><fence><method
name="single"><device
name="node02_ipmi"/></method></fence></clusternode></clusternodes>
 <fencedevices><fencedevice name="node01_ipmi" agent="fence_ipmilan"
ipaddr="10.1.0.5" login="root" passwd="********"/><fencedevice
name="node02_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.7" login="root"
passwd="********"/></fencedevices>
 <rm>
   <failoverdomains/>
   <resources/>
 </rm>
</cluster>

When starting the service cman, they both hang on the part starting fencing

Starting cluster:
  Loading modules... done
  Mounting configfs... done
  Starting ccsd... done
  Starting cman... done
  Starting daemons... done
  Starting fencing...

After 5 minutes the task finishes with "done" but clustat says

==== As root on web01.company.com ====
 Cluster Status for GFSCluster @ Wed Jul  8 01:00:24 2009
 Member Status: Quorate

  Member Name                             ID   Status
  ------ ----                             ---- ------
  node01.company.com                         1 Online, Local
  node02.company.com                         2 Offline


==== As root on web02.company.com ====
 Cluster Status for GFSCluster @ Wed Jul  8 01:00:26 2009
 Member Status: Quorate

  Member Name                             ID   Status
  ------ ----                             ---- ------
  node01.company.com                         1 Offline
  node02.company.com                         2 Online, Local

They are both quorate with their own cluster

In the logs of web01 I found repeating messages

Jul  8 00:55:27 web01 fenced[21872]: node02.company.com not a cluster member
after 6 sec post_join_delay
Jul  8 00:55:27 web01 fenced[21872]: fencing node "node02.company.com"
Jul  8 00:55:52 web01 fenced[21872]: agent "fence_ipmilan" reports:
Rebooting machine @ IPMI:10.1.0.7...ipmilan: Failed to connect after 30
seconds Failed


In the logs of web02 I also found the same repeating messages

Jul  8 00:55:27 web02 fenced[6363]: node01.company.com not a cluster member
after 6 sec post_join_delay
Jul  8 00:55:27 web02 fenced[6363]: fencing node "node01.company.com"
Jul  8 00:55:53 web02 fenced[6363]: agent "fence_ipmilan" reports: Rebooting
machine @ IPMI:10.1.0.5...ipmilan: Failed to connect after 30 seconds Failed


Is there a bug on 5.3 with regards to clustering?
Is there any workarounds?


     Feel safer online. Upgrade to the new, safer Internet Explorer 8
optimized for Yahoo! to put your mind at peace. It's free. Get IE8 here!
http://downloads.yahoo.com/sg/internetexplorer/

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/f192b443/attachment.htm>

From alfredo.moralejo at roche.com  Wed Jul  8 08:40:21 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Wed, 8 Jul 2009 10:40:21 +0200
Subject: [Linux-cluster] cman + qdisk timeouts....
In-Reply-To: <8a5668960907070321p5a082091oa7f83fff625dde47@mail.gmail.com>
References: <AcntxBO69ALhjGTVQJqDI+yBI6Opnw==>
	<FD716ADFEE797543AA8B2AB0B3F9148E07442044@rkamsem701.emea.roche.com>
	<8a5668960907070321p5a082091oa7f83fff625dde47@mail.gmail.com>
Message-ID: <FD716ADFEE797543AA8B2AB0B3F9148E6141B753@rkamsem701.emea.roche.com>

Hi,

I added a heuristic checking network status and help in network failure scenarios.

However, I still face the same problem as soon as I stop the services orderly in the node holding the qdisk master role or reboot it.

If I execute in master qdisk node:

# service rgmanager stop
# service clvmd stop
# service qdiskd stop
# service cman stop

As said by Red Hat, I get the quorum lost in the other node until get the master role (some seconds) and stop the services.

I'm managing that by adding a sleep after stopping qdiskd long enough for the other node to become master, and then stop cman.

I understand this is a bug.

My cluster.conf file:

<?xml version="1.0"?>
<cluster alias="clueng" config_version="13" name="clueng">
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="10"/>
        <clusternodes>
                <clusternode name="rmamseslab05" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="iLO_NODE1"/>
                                </method>
                                <method name="2">
                                        <device name="manual_fencing" nodename="rmamseslab05"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="rmamseslab07" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="iLO_NODE2"/>
                                </method>
                                <method name="2">
                                        <device name="manual_fencing" nodename="rmamseslab07"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <totem token="45000"/>
        <quorumd device="/dev/mapper/mpathquorump1" interval="5" status_file="/tmp/qdisk" tko="3" votes="1">
                <heuristic program="/usr/local/cmcluster/conf/admin/test_hb.sh" score="1" interval="3"/>
        </quorumd>
        <fencedevices>
                <fencedevice agent="fence_manual" name="manual_fencing"/>
                <fencedevice agent="fence_ilo" hostname="rbrmamseslab05" login="LANO" name="iLO_NODE1" passwd="**"/>
                <fencedevice agent="fence_ilo" hostname="rbrmamseslab07" login="LANO" name="iLO_NODE2" passwd="**"/>
        </fencedevices>
        <rm>
                <!-- Configuration of the resource group manager -->
                <failoverdomains>
                </failoverdomains>
                <service autostart="1" exclusive="0" max_restarts="1" name="pkg_test" recovery="restart" restart_expire_time="900">
                        <script file="/etc/cluster/pkg_test/startstop.sh" name="pkg_test"/>
                </service>
                <resources>
                    <nfsexport name="nfs_export"/>
                </resources>
        </rm>
</cluster>

Best regards,

Alfredo


________________________________
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Juan Ramon Martin Blanco
Sent: Tuesday, July 07, 2009 12:21 PM
To: linux clustering
Subject: Re: [Linux-cluster] cman + qdisk timeouts....


On Mon, Jun 15, 2009 at 4:17 PM, Moralejo, Alfredo <alfredo.moralejo at roche.com<mailto:alfredo.moralejo at roche.com>> wrote:

Hi,


I'm having what I think is a timeouts issue in my cluster.


I have a two node cluster using qdisk. Everytime the node that has the master role for qdisk becomes down (for failure or even stopping qdiskd manually), packages in the sane node are stopped because of the lack of quorum as the qdiskd becames unresponsive until second node becames master node and start working properly. Once qdiskd start working fine (usually 5-6 seconds) packages are started again.


I've read in the cluster manual section for "CMAN membership timeout value" and I think this is the case. I've used RHEL 5.3 and I thought this parameter is the token that I set much longer that needed:


<cluster alias="CLUSTER_ENG" config_version="75" name="CLUSTER_ENG">

        <totem token="50000"/>

...


        <quorumd device="/dev/mapper/mpathquorump1" interval="3" status_file="/tmp/qdisk" tko="3" votes="5" log_level="7" log_facility="local4"/>


Totem token is much more that double of qdisk timeout, so I guess it should be enough but everytime qdisk dies in the master node I get same result, services restarted in the sane node:


Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (2/3)

Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (3/3)

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (4/3)

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 DOWN

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Making bid for master

Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: <info> Executing /etc/init.d/watchdog status

Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (5/3)

Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (6/3)

Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <info> Assuming master role


Message from syslogd at rmamseslab07 at Jun 15 16:11:53 ...

 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved

Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with quorum device

Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking activity

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Membership Change Event

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:Cluster_test_2

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:wdtcscript-rmamseslab05-ic

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:wdtcscript-rmamseslab07-ic

Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of service:Logical volume 1

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update (7/3)

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <notice> Writing eviction notice for node 1

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Telling CMAN to kill the node

Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained, resuming activity


I've just logged a case but... any idea????


Regards,
Hi!

Have you set two_node="0" in cman section?
Why don't you use any heuristics within the quorumd configuration? I.e: pinging a router...
Could you paste us your cluster.conf?

Greetings,
Juanra


Alfredo Moralejo
Business Platforms Engineering - OS Servers - UNIX Senior Specialist

F. Hoffmann-La Roche Ltd.

Global Informatics Group Infrastructure
Josefa Valc?rcel, 40
28027 Madrid SPAIN

Phone: +34 91 305 97 87

alfredo.moralejo at roche.com<mailto:alfredo.moralejo at roche.com>

Confidentiality Note: This message is intended only for the use of the named recipient(s) and may contain confidential and/or proprietary information. If you are not the intended recipient, please contact the sender and delete this message. Any unauthorized use of the information contained in this message is prohibited.


--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/7854a173/attachment.htm>

From rsetchfield at xcalibre.co.uk  Wed Jul  8 09:21:23 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Wed, 08 Jul 2009 10:21:23 +0100
Subject: [Linux-cluster] Trying to locate the bottleneck
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local>
References: <4A53410C.3090704@xcalibre.co.uk>
	<64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local>
Message-ID: <4A546513.1070408@xcalibre.co.uk>

Hi Jeff

Many Thanks for your reply.

I have had a look to see if there if there is anything suspicious within 
dmesg and within messages and unfortunately there isn't anything at all 
apart from one timeout.

Jul  8 10:15:51 loadbalancer-01 nanny[5427]: [inactive] shutting down 
192.168.10.36:80 due to connection failure
Jul  8 10:16:03 loadbalancer-01 nanny[5427]: [ active ] making 
192.168.10.36:80 available

I'll check out the possibility of any network related issues which may 
cause this problem though.

Thanks for all your help!

R.


Jeff Sturm wrote:
> Hi Raymond,
>
> At those concurrency levels I would suspect network tuning may help.
> Does dmesg show anything interesting on the load balancers during your
> testing?
>
> For high levels of concurrency on a NAT'd firewall or load balancer I
> specifically remember having to adjust ip_conntrack_max upwards.
> Perhaps network buffers as well.
>
> -Jeff
>
>   
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>>     
> [mailto:linux-cluster-bounces at redhat.com]
>   
>> On Behalf Of Raymond Setchfield
>> Sent: Tuesday, July 07, 2009 8:35 AM
>> To: linux-cluster at redhat.com
>> Subject: [Linux-cluster] Trying to locate the bottleneck
>>
>> Hi
>>
>> I am trying to find a problem here with a setup which I am currently
>> testing.
>>
>> This is the current setup which I have at the moment
>>
>> 15 web farm servers which are running vhost-ldap module and also have
>> ldap caching enabled. Which are behind 2 Load balancer servers which
>>     
> are
>   
>> in fail over. The software which it is currently running is Piranha on
>> the load balancers.
>>
>> I am using siege to get some benchmarking done on these to test
>> basically their availability when pushing high concurrency.
>>
>> At 100 (99.60 according to siege) Concurrent Connection it appears to
>>     
> be
>   
>> all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
>> connections I get 99.9%, and at 130 (129.51 according to siege)
>> Concurrent Connections I get 100% availability.
>>
>> However pushing it any further than this, for example 150 concurrent
>> connections it is falling over and siege bails out with multiple
>> connection time outs. I am trying to find the bottle neck here and I
>>     
> am
>   
>> wondering if it is software which I am using for the load balancers or
>>     
> a
>   
>> limitation with apache.
>>
>> The command I am using for siege is pretty simple nothing special;
>>
>> siege --concurrent=150 --internet --file=urls.txt --benchmark
>>     
> --time=60M
>   
>> My lvs.cf file can be found here to show you guys the config which I
>>     
> am
>   
>> using.
>>
>> http://pastebin.com/m52d6cc23
>>
>> Any help would be greatly appreciated
>>
>> Many Thanks
>>
>> R.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>   


From abednegoyulo at yahoo.com  Wed Jul  8 09:50:35 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Wed, 8 Jul 2009 02:50:35 -0700 (PDT)
Subject: [Linux-cluster] Cannot make cluster after upgrade
Message-ID: <407893.33658.qm@web110415.mail.gq1.yahoo.com>


I haven't tried it yet. To which part of the cluster.conf should I be inserting clean_start=1 ?

--- On Wed, 7/8/09, Ian Hayes <cthulhucalling at gmail.com> wrote:

> From: Ian Hayes <cthulhucalling at gmail.com>
> Subject: Re: [Linux-cluster] Cannot make cluster after upgrade
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Wednesday, 8 July, 2009, 2:59 PM
> Sounds a little
> split-brainish....... have you tried the clean_start=1
> option?
> On Jul 7, 2009 11:54 PM,
> "Abed-nego G. Escobal, Jr." <abednegoyulo at yahoo.com>
> wrote:
> 
> 
> 
> After an upgrade from 5.2 to 5.3, the cluster, named
> GFSCluster, seems to stop being a cluster. GFSCluster is a 2
> node cluster using iscsi, cman, clvm, and gfs and it was
> working fine when it was on 5.2 The configuration on both of
> the nodes (passwords removed)
> 
> 
> 
> 
> <?xml version="1.0"?>
> 
> <cluster name="GFSCluster"
> config_version="5">
> 
> <cman expected_votes="1"
> two_node="1"/>
> 
>  ?<clusternodes><clusternode name="node01.company.com"
> votes="1"
> nodeid="1"><fence><method
> name="single"><device
> name="node01_ipmi"/></method></fence></clusternode><clusternode
> name="node02.company.com"
> votes="1"
> nodeid="2"><fence><method
> name="single"><device
> name="node02_ipmi"/></method></fence></clusternode></clusternodes>
> 
> 
>  ?<fencedevices><fencedevice
> name="node01_ipmi" agent="fence_ipmilan"
> ipaddr="10.1.0.5" login="root"
> passwd="********"/><fencedevice
> name="node02_ipmi" agent="fence_ipmilan"
> ipaddr="10.1.0.7" login="root"
> passwd="********"/></fencedevices>
> 
> 
>  ?<rm>
> 
>  ? ?<failoverdomains/>
> 
>  ? ?<resources/>
> 
>  ?</rm>
> 
> </cluster>
> 
> 
> 
> When starting the service cman, they both hang on the part
> starting fencing
> 
> 
> 
> Starting cluster:
> 
>  ? Loading modules... done
> 
>  ? Mounting configfs... done
> 
>  ? Starting ccsd... done
> 
>  ? Starting cman... done
> 
>  ? Starting daemons... done
> 
>  ? Starting fencing...
> 
> 
> 
> After 5 minutes the task finishes with "done" but
> clustat says
> 
> 
> 
> ==== As root on web01.company.com ====
> 
>  ?Cluster Status for GFSCluster @ Wed Jul ?8 01:00:24
> 2009
> 
>  ?Member Status: Quorate
> 
> 
> 
>  ? Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ID ? Status
> 
>  ? ------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ---- ------
> 
>  ? node01.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 1 Online, Local
> 
>  ? node02.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 2 Offline
> 
> 
> 
> 
> 
> ==== As root on web02.company.com ====
> 
>  ?Cluster Status for GFSCluster @ Wed Jul ?8 01:00:26
> 2009
> 
>  ?Member Status: Quorate
> 
> 
> 
>  ? Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ID ? Status
> 
>  ? ------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ?
> ---- ------
> 
>  ? node01.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 1 Offline
> 
>  ? node02.company.com ?
> ? ? ? ? ? ? ? ? ? ? ? 2 Online, Local
> 
> 
> 
> They are both quorate with their own cluster
> 
> 
> 
> In the logs of web01 I found repeating messages
> 
> 
> 
> Jul ?8 00:55:27 web01 fenced[21872]: node02.company.com not
> a cluster member after 6 sec post_join_delay
> 
> Jul ?8 00:55:27 web01 fenced[21872]: fencing node "node02.company.com"
> 
> Jul ?8 00:55:52 web01 fenced[21872]: agent
> "fence_ipmilan" reports: Rebooting machine @
> IPMI:10.1.0.7...ipmilan: Failed to connect after 30 seconds
> Failed
> 
> 
> 
> 
> 
> In the logs of web02 I also found the same repeating
> messages
> 
> 
> 
> Jul ?8 00:55:27 web02 fenced[6363]: node01.company.com not
> a cluster member after 6 sec post_join_delay
> 
> Jul ?8 00:55:27 web02 fenced[6363]: fencing node "node01.company.com"
> 
> Jul ?8 00:55:53 web02 fenced[6363]: agent
> "fence_ipmilan" reports: Rebooting machine @
> IPMI:10.1.0.5...ipmilan: Failed to connect after 30 seconds
> Failed
> 
> 
> 
> 
> 
> Is there a bug on 5.3 with regards to clustering?
> 
> Is there any workarounds?
> 
> 
> 
> 
> 
> 
> 
>  ? ? ?Feel safer online. Upgrade to the new, safer
> Internet Explorer 8 optimized for Yahoo! to put your mind at
> peace. It's free. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/
> 
> 
> 
> 
> --
> 
> Linux-cluster mailing list
> 
> Linux-cluster at redhat.com
> 
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> -----Inline Attachment Follows-----
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From svictor.titus at gmail.com  Wed Jul  8 11:05:52 2009
From: svictor.titus at gmail.com (victor titus)
Date: Wed, 8 Jul 2009 19:05:52 +0800
Subject: [Linux-cluster] " Inconsistent NVRAM detected" ERROR
Message-ID: <8374e0ba0907080405haee5317h3653bff38c1c21@mail.gmail.com>

Hi All,
    Below are the messages found from the Log "/var/log/messages".
Seems to be some problem with the release of NVRAM memory. Due to this
the LVM in the cluster are not detected by the server, commands like
lvdisplay, pvdisplay just show no output.

******************************************************************************
Jul  7 11:58:12 lxxxx kernel: QLogic Fibre Channel HBA Driver
Jul  7 11:58:12 lxxxx kernel: ACPI: PCI Interrupt 0000:08:00.0[A] ->
GSI 18 (level, high) -> IRQ 185
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Found an ISP2432,
irq 185, iobase 0xffffff0000006000
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configuring PCI space...
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configure NVRAM
parameters...
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Inconsistent NVRAM
detected: checksum=0xd46cae00 id=I version=0x1.
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Falling back to
functioning (yet invalid -- WWPN) defaults.
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Verifying loaded
RISC code...
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (64 KB) for EFT...
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (1413
KB) for firmware dump...
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Waiting for LIP to
complete...
Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Cable is unplugged...
Jul  7 11:58:12 lxxxx kernel: scsi0 : qla2xxx
Jul  7 11:58:12 lxxx kernel: qla2400 0000:08:00.0:
*********************************************************************************

Thanks,
Victor


From muruganlnx at gmail.com  Wed Jul  8 11:58:33 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Wed, 8 Jul 2009 17:28:33 +0530
Subject: [Linux-cluster] RHCS with GFS2
Message-ID: <52868b3e0907080458ud37c4ffsc4d27a00e8e53d2d@mail.gmail.com>

Hi Friends,

I want to install the RHCS with GFS2 on Centos 5.3.

Kindly provide the list of packages(NAME) which is need for my requirement
and confirm whether DLM is inbuild with 5.3 Kernel.

Thanks & Regards,
P. Murugan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/07f1569a/attachment.htm>

From robejrm at gmail.com  Wed Jul  8 12:02:03 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 14:02:03 +0200
Subject: [Linux-cluster] " Inconsistent NVRAM detected" ERROR
In-Reply-To: <8374e0ba0907080405haee5317h3653bff38c1c21@mail.gmail.com>
References: <8374e0ba0907080405haee5317h3653bff38c1c21@mail.gmail.com>
Message-ID: <8a5668960907080502x7371849ds15ec406729d48995@mail.gmail.com>

On Wed, Jul 8, 2009 at 1:05 PM, victor titus <svictor.titus at gmail.com>wrote:

> Hi All,
>    Below are the messages found from the Log "/var/log/messages".
> Seems to be some problem with the release of NVRAM memory. Due to this
> the LVM in the cluster are not detected by the server, commands like
> lvdisplay, pvdisplay just show no output.
>
>
> ******************************************************************************
> Jul  7 11:58:12 lxxxx kernel: QLogic Fibre Channel HBA Driver
> Jul  7 11:58:12 lxxxx kernel: ACPI: PCI Interrupt 0000:08:00.0[A] ->
> GSI 18 (level, high) -> IRQ 185
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Found an ISP2432,
> irq 185, iobase 0xffffff0000006000
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configuring PCI
> space...
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Configure NVRAM
> parameters...
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Inconsistent NVRAM
> detected: checksum=0xd46cae00 id=I version=0x1.
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Falling back to
> functioning (yet invalid -- WWPN) defaults.
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Verifying loaded
> RISC code...
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (64 KB) for
> EFT...
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Allocated (1413
> KB) for firmware dump...
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Waiting for LIP to
> complete...
> Jul  7 11:58:12 lxxxx kernel: qla2400 0000:08:00.0: Cable is unplugged...
> Jul  7 11:58:12 lxxxx kernel: scsi0 : qla2xxx
> Jul  7 11:58:12 lxxx kernel: qla2400 0000:08:00.0:
>
> *********************************************************************************
>
Hi!

It seems that your fibre connection is failing, or maybe the HBA, or the
switch. Do you have a redundant path so the SAN? If so, have you configured
multipath?

Greetings,
Juanra

>
> Thanks,
> Victor
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/e03ff7fc/attachment.htm>

From giuseppe.fuggiano at gmail.com  Wed Jul  8 12:04:49 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Wed, 8 Jul 2009 14:04:49 +0200
Subject: [Linux-cluster] RHCS with GFS2
In-Reply-To: <52868b3e0907080458ud37c4ffsc4d27a00e8e53d2d@mail.gmail.com>
References: <52868b3e0907080458ud37c4ffsc4d27a00e8e53d2d@mail.gmail.com>
Message-ID: <1e09d9070907080504p588c2514q18b4a3f59b5fd62d@mail.gmail.com>

2009/7/8 Murugan P <muruganlnx at gmail.com>:
> Hi Friends,
>
> I want to install the RHCS with GFS2 on Centos 5.3.
>
> Kindly provide the list of packages(NAME) which is need for my requirement
> and confirm whether DLM is inbuild with 5.3 Kernel.

http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/

-- 
Giuseppe


From muruganlnx at gmail.com  Wed Jul  8 13:10:01 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Wed, 8 Jul 2009 18:40:01 +0530
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
Message-ID: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>

HI Friends,

I need small clarification from u guys...

Whille installing the centos 5.3 which software needs to select for
RHCS(Cluster service) and clarify which is having the CMAN package.

Thanks & Regards,
P. Murugan
muruganlnx at gmail.com
9841705767
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/d350544b/attachment.htm>

From robejrm at gmail.com  Wed Jul  8 13:22:50 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 15:22:50 +0200
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
Message-ID: <8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>

On Wed, Jul 8, 2009 at 3:10 PM, Murugan P <muruganlnx at gmail.com> wrote:

> HI Friends,
>
> I need small clarification from u guys...
>
> Whille installing the centos 5.3 which software needs to select for
> RHCS(Cluster service) and clarify which is having the CMAN package.

Hi,

rgmanager
cman
openais
and if you are using gfs2 and/or clustered lvm:
gfs2-utils
lvm2-cluster

#rpm -ql cman
In summary: fenced qdiskd ccsd groupd and associated tools

Greetings,
Juanra

P.S: I don't pretend to be rude, but read some documentation before
asking...
http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/s1-ha-components-CSO.html

>
> Thanks & Regards,
> P. Murugan
> muruganlnx at gmail.com
> 9841705767
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/58477a7d/attachment.htm>

From muruganlnx at gmail.com  Wed Jul  8 13:58:50 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Wed, 8 Jul 2009 19:28:50 +0530
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
	<8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
Message-ID: <52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>

I have installed the OS(centos 5.3) with cluster software and after
installation i can able to see

[root at testgfs ~]# rpm -qa | grep cman
cman-2.0.98-1.el5

**********

My question is, while selecting the software at the installation time i
don't find the CMAN packages using F2 on the software
clusters/clusterStorage & base.

Kindly clarify friends , how to know that which software is having the CMAN
packages since i haven't seen the same in clustering/ClusterStorage.


On Wed, Jul 8, 2009 at 6:52 PM, Juan Ramon Martin Blanco
<robejrm at gmail.com>wrote:


>
> On Wed, Jul 8, 2009 at 3:10 PM, Murugan P <muruganlnx at gmail.com> wrote:
>
>> HI Friends,
>>
>> I need small clarification from u guys...
>>
>> Whille installing the centos 5.3 which software needs to select for
>> RHCS(Cluster service) and clarify which is having the CMAN package.
>
> Hi,
>
> rgmanager
> cman
> openais
> and if you are using gfs2 and/or clustered lvm:
> gfs2-utils
> lvm2-cluster
>
> #rpm -ql cman
> In summary: fenced qdiskd ccsd groupd and associated tools
>
> Greetings,
> Juanra
>
> P.S: I don't pretend to be rude, but read some documentation before
> asking...
> http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/s1-ha-components-CSO.html
>
>>
>> Thanks & Regards,
>> P. Murugan
>> muruganlnx at gmail.com
>> 9841705767
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/52467c14/attachment.htm>

From giuseppe.fuggiano at gmail.com  Wed Jul  8 14:10:56 2009
From: giuseppe.fuggiano at gmail.com (Giuseppe Fuggiano)
Date: Wed, 8 Jul 2009 16:10:56 +0200
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
	<8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
	<52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
Message-ID: <1e09d9070907080710l312e256elfbdd0e3f4b273840@mail.gmail.com>

2009/7/8 Murugan P <muruganlnx at gmail.com>:
> Kindly clarify friends , how to know that which software is having the CMAN
> packages since i haven't seen the same in clustering/ClusterStorage.

At installation time, you can install the software selecting "Groups"
of packages.  These groups can be tuned as you prefer by clicking on a
button to edit them ("Details", IIRC).  Doing so, a window with
detailed information is showed.

Cheers
-- 
Giuseppe


From robejrm at gmail.com  Wed Jul  8 14:17:04 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 16:17:04 +0200
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
	<8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
	<52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
Message-ID: <8a5668960907080717i33bd5a4fxf1096c2129a2c11b@mail.gmail.com>

On Wed, Jul 8, 2009 at 3:58 PM, Murugan P <muruganlnx at gmail.com> wrote:

>
> I have installed the OS(centos 5.3) with cluster software and after
> installation i can able to see
>
> [root at testgfs ~]# rpm -qa | grep cman
> cman-2.0.98-1.el5
>
> **********
>
> My question is, while selecting the software at the installation time i
> don't find the CMAN packages using F2 on the software
> clusters/clusterStorage & base.
>
Those are "groups" (sorry for the spanish):
 # yum groupinfo "Clustering"
Loaded plugins: downloadonly, rhnplugin, security
Setting up Group Process

Group: Agrupamiento (clustering)
 Description: Soporte para clustering (agrupamiento).
 Default Packages:
   Cluster_Administration-en-US
   cluster-cim
   cluster-snmp
   clustermon
   conga-devel
   ipvsadm
   luci
   modcluster
   piranha
   rgmanager
   ricci
   ricci-modcluster
   system-config-cluster

#yum groupinfo "Cluster Storage"
Loaded plugins: downloadonly, rhnplugin, security
Setting up Group Process

Group: Almacenamiento del Cluster
 Description: Paquetes que proveen soporte para el almacenamiento de
cluster.
 Default Packages:
   Global_File_System-en-US
   gfs
   gfs-utils
   gnbd
   kmod-gfs
   kmod-gfs-kdump
   kmod-gnbd
   kmod-gnbd-kdump
   lvm2-cluster
 Optional Packages:
   kmod-gfs-PAE
   kmod-gfs-xen
   kmod-gnbd-PAE
   kmod-gnbd-xen

cman package is not in any group, but probably its installation was
triggered when you selected some of the groups above


> Kindly clarify friends , how to know that which software is having the CMAN
> packages since i haven't seen the same in clustering/ClusterStorage.
>
I think I don't understand you...
There are no CMAN packages, just one cman package.
In your installation process you should select the cman and rgmanager
packages, but they aren't in any group.


Regards,
Juanra

>
>
>
>
>
>
> On Wed, Jul 8, 2009 at 6:52 PM, Juan Ramon Martin Blanco <
> robejrm at gmail.com> wrote:
>
>
>
>>
>> On Wed, Jul 8, 2009 at 3:10 PM, Murugan P <muruganlnx at gmail.com> wrote:
>>
>>> HI Friends,
>>>
>>> I need small clarification from u guys...
>>>
>>> Whille installing the centos 5.3 which software needs to select for
>>> RHCS(Cluster service) and clarify which is having the CMAN package.
>>
>>  Hi,
>>
>> rgmanager
>> cman
>> openais
>> and if you are using gfs2 and/or clustered lvm:
>> gfs2-utils
>> lvm2-cluster
>>
>> #rpm -ql cman
>> In summary: fenced qdiskd ccsd groupd and associated tools
>>
>> Greetings,
>> Juanra
>>
>> P.S: I don't pretend to be rude, but read some documentation before
>> asking...
>> http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/s1-ha-components-CSO.html
>>
>>>
>>> Thanks & Regards,
>>> P. Murugan
>>> muruganlnx at gmail.com
>>> 9841705767
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/26186d74/attachment.htm>

From cthulhucalling at gmail.com  Wed Jul  8 14:45:38 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Wed, 8 Jul 2009 07:45:38 -0700
Subject: [Linux-cluster] Cannot make cluster after upgrade
In-Reply-To: <407893.33658.qm@web110415.mail.gq1.yahoo.com>
References: <407893.33658.qm@web110415.mail.gq1.yahoo.com>
Message-ID: <36df569a0907080745u1a498a96oc8853f37f093ea08@mail.gmail.com>

In the fence_daemon tag. Like this:

 <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>

On Wed, Jul 8, 2009 at 2:50 AM, Abed-nego G. Escobal, Jr. <
abednegoyulo at yahoo.com> wrote:

>
> I haven't tried it yet. To which part of the cluster.conf should I be
> inserting clean_start=1 ?
>
> --- On Wed, 7/8/09, Ian Hayes <cthulhucalling at gmail.com> wrote:
>
> > From: Ian Hayes <cthulhucalling at gmail.com>
> > Subject: Re: [Linux-cluster] Cannot make cluster after upgrade
> > To: "linux clustering" <linux-cluster at redhat.com>
> > Date: Wednesday, 8 July, 2009, 2:59 PM
> > Sounds a little
> > split-brainish....... have you tried the clean_start=1
> > option?
> > On Jul 7, 2009 11:54 PM,
> > "Abed-nego G. Escobal, Jr." <abednegoyulo at yahoo.com>
> > wrote:
> >
> >
> >
> > After an upgrade from 5.2 to 5.3, the cluster, named
> > GFSCluster, seems to stop being a cluster. GFSCluster is a 2
> > node cluster using iscsi, cman, clvm, and gfs and it was
> > working fine when it was on 5.2 The configuration on both of
> > the nodes (passwords removed)
> >
> >
> >
> >
> > <?xml version="1.0"?>
> >
> > <cluster name="GFSCluster"
> > config_version="5">
> >
> > <cman expected_votes="1"
> > two_node="1"/>
> >
> >   <clusternodes><clusternode name="node01.company.com"
> > votes="1"
> > nodeid="1"><fence><method
> > name="single"><device
> > name="node01_ipmi"/></method></fence></clusternode><clusternode
> > name="node02.company.com"
> > votes="1"
> > nodeid="2"><fence><method
> > name="single"><device
> > name="node02_ipmi"/></method></fence></clusternode></clusternodes>
> >
> >
> >   <fencedevices><fencedevice
> > name="node01_ipmi" agent="fence_ipmilan"
> > ipaddr="10.1.0.5" login="root"
> > passwd="********"/><fencedevice
> > name="node02_ipmi" agent="fence_ipmilan"
> > ipaddr="10.1.0.7" login="root"
> > passwd="********"/></fencedevices>
> >
> >
> >   <rm>
> >
> >     <failoverdomains/>
> >
> >     <resources/>
> >
> >   </rm>
> >
> > </cluster>
> >
> >
> >
> > When starting the service cman, they both hang on the part
> > starting fencing
> >
> >
> >
> > Starting cluster:
> >
> >    Loading modules... done
> >
> >    Mounting configfs... done
> >
> >    Starting ccsd... done
> >
> >    Starting cman... done
> >
> >    Starting daemons... done
> >
> >    Starting fencing...
> >
> >
> >
> > After 5 minutes the task finishes with "done" but
> > clustat says
> >
> >
> >
> > ==== As root on web01.company.com ====
> >
> >   Cluster Status for GFSCluster @ Wed Jul  8 01:00:24
> > 2009
> >
> >   Member Status: Quorate
> >
> >
> >
> >    Member Name
> > ID   Status
> >
> >    ------ ----
> > ---- ------
> >
> >    node01.company.com
> >                       1 Online, Local
> >
> >    node02.company.com
> >                       2 Offline
> >
> >
> >
> >
> >
> > ==== As root on web02.company.com ====
> >
> >   Cluster Status for GFSCluster @ Wed Jul  8 01:00:26
> > 2009
> >
> >   Member Status: Quorate
> >
> >
> >
> >    Member Name
> > ID   Status
> >
> >    ------ ----
> > ---- ------
> >
> >    node01.company.com
> >                       1 Offline
> >
> >    node02.company.com
> >                       2 Online, Local
> >
> >
> >
> > They are both quorate with their own cluster
> >
> >
> >
> > In the logs of web01 I found repeating messages
> >
> >
> >
> > Jul  8 00:55:27 web01 fenced[21872]: node02.company.com not
> > a cluster member after 6 sec post_join_delay
> >
> > Jul  8 00:55:27 web01 fenced[21872]: fencing node "node02.company.com"
> >
> > Jul  8 00:55:52 web01 fenced[21872]: agent
> > "fence_ipmilan" reports: Rebooting machine @
> > IPMI:10.1.0.7...ipmilan: Failed to connect after 30 seconds
> > Failed
> >
> >
> >
> >
> >
> > In the logs of web02 I also found the same repeating
> > messages
> >
> >
> >
> > Jul  8 00:55:27 web02 fenced[6363]: node01.company.com not
> > a cluster member after 6 sec post_join_delay
> >
> > Jul  8 00:55:27 web02 fenced[6363]: fencing node "node01.company.com"
> >
> > Jul  8 00:55:53 web02 fenced[6363]: agent
> > "fence_ipmilan" reports: Rebooting machine @
> > IPMI:10.1.0.5...ipmilan: Failed to connect after 30 seconds
> > Failed
> >
> >
> >
> >
> >
> > Is there a bug on 5.3 with regards to clustering?
> >
> > Is there any workarounds?
> >
> >
> >
> >
> >
> >
> >
> >       Feel safer online. Upgrade to the new, safer
> > Internet Explorer 8 optimized for Yahoo! to put your mind at
> > peace. It's free. Get IE8 here!
> http://downloads.yahoo.com/sg/internetexplorer/
> >
> >
> >
> >
> > --
> >
> > Linux-cluster mailing list
> >
> > Linux-cluster at redhat.com
> >
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> > -----Inline Attachment Follows-----
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/eb317705/attachment.htm>

From robejrm at gmail.com  Wed Jul  8 15:03:29 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 17:03:29 +0200
Subject: [Linux-cluster] gfs2_quotad kthread
Message-ID: <8a5668960907080803q5fc8d567v7f556e700677a8d6@mail.gmail.com>

Hi folks!

Now that developers (many thanks!) have corrected the issue that made
gfs2_quotad to stay in an uninterruptible state (so it was got into account
when calculating system load), I have another question: if quotas are
disabled by default, why is this kthread started? Is there any way of
avoiding it to start? It starts even if I mount the fs explicitely with the
quota=off option (even if it's supposed to be the default option).
It's a bit annoying, although I think it does not harm the system, but I
have as many gfs2_quotad kthreads as mounted filesystem (usually a lot, more
than 50).

Thanks in advance,
Juanra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/8283e6c6/attachment.htm>

From swhiteho at redhat.com  Wed Jul  8 15:11:44 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 08 Jul 2009 16:11:44 +0100
Subject: [Linux-cluster] gfs2_quotad kthread
In-Reply-To: <8a5668960907080803q5fc8d567v7f556e700677a8d6@mail.gmail.com>
References: <8a5668960907080803q5fc8d567v7f556e700677a8d6@mail.gmail.com>
Message-ID: <1247065904.3391.2.camel@localhost.localdomain>

Hi,

You can't avoid it starting. It does other things aside from quotas
although its workload is pretty light. It might be possible in the
future to avoid having quite so many threads (I've already got rid of a
number of them) but its not very high priority.

It maybe that some of the work could be pushed off to a workqueue, or
the slow-work subsystem instead,

Steve.

On Wed, 2009-07-08 at 17:03 +0200, Juan Ramon Martin Blanco wrote:
> Hi folks!
> 
> Now that developers (many thanks!) have corrected the issue that made
> gfs2_quotad to stay in an uninterruptible state (so it was got into
> account when calculating system load), I have another question: if
> quotas are disabled by default, why is this kthread started? Is there
> any way of avoiding it to start? It starts even if I mount the fs
> explicitely with the quota=off option (even if it's supposed to be the
> default option). 
> It's a bit annoying, although I think it does not harm the system, but
> I have as many gfs2_quotad kthreads as mounted filesystem (usually a
> lot, more than 50).
> 
> Thanks in advance,
> Juanra
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From muruganlnx at gmail.com  Wed Jul  8 15:11:12 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Wed, 8 Jul 2009 20:41:12 +0530
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <8a5668960907080717i33bd5a4fxf1096c2129a2c11b@mail.gmail.com>
References: <52868b3e0907080610u30451c85n25d726d277259237@mail.gmail.com>
	<8a5668960907080622o2a5f47dfjd1c87abfcd6a404d@mail.gmail.com>
	<52868b3e0907080658r20ee64e3sab2285c67512fd36@mail.gmail.com>
	<8a5668960907080717i33bd5a4fxf1096c2129a2c11b@mail.gmail.com>
Message-ID: <52868b3e0907080811t7532010bqb6fae340ac34b8ff@mail.gmail.com>

Thanks Juanra got my point.

CMAN package is not present in any of the group including the
groups(Clustering/ClusterStorages/Base).

But, i want to know, from which group the CMAN package is installing.

One more thing want to share with you, If u selected the above mentioned
group after the installation the CMAN package also installed but not able to
see it.


On Wed, Jul 8, 2009 at 7:47 PM, Juan Ramon Martin Blanco
<robejrm at gmail.com>wrote:

>
>
> On Wed, Jul 8, 2009 at 3:58 PM, Murugan P <muruganlnx at gmail.com> wrote:
>
>>
>> I have installed the OS(centos 5.3) with cluster software and after
>> installation i can able to see
>>
>> [root at testgfs ~]# rpm -qa | grep cman
>> cman-2.0.98-1.el5
>>
>> **********
>>
>> My question is, while selecting the software at the installation time i
>> don't find the CMAN packages using F2 on the software
>> clusters/clusterStorage & base.
>>
> Those are "groups" (sorry for the spanish):
>  # yum groupinfo "Clustering"
> Loaded plugins: downloadonly, rhnplugin, security
> Setting up Group Process
>
> Group: Agrupamiento (clustering)
>  Description: Soporte para clustering (agrupamiento).
>  Default Packages:
>    Cluster_Administration-en-US
>    cluster-cim
>    cluster-snmp
>    clustermon
>    conga-devel
>    ipvsadm
>    luci
>    modcluster
>    piranha
>    rgmanager
>    ricci
>    ricci-modcluster
>    system-config-cluster
>
> #yum groupinfo "Cluster Storage"
> Loaded plugins: downloadonly, rhnplugin, security
> Setting up Group Process
>
> Group: Almacenamiento del Cluster
>  Description: Paquetes que proveen soporte para el almacenamiento de
> cluster.
>  Default Packages:
>    Global_File_System-en-US
>    gfs
>    gfs-utils
>    gnbd
>    kmod-gfs
>    kmod-gfs-kdump
>    kmod-gnbd
>    kmod-gnbd-kdump
>    lvm2-cluster
>  Optional Packages:
>    kmod-gfs-PAE
>    kmod-gfs-xen
>    kmod-gnbd-PAE
>    kmod-gnbd-xen
>
> cman package is not in any group, but probably its installation was
> triggered when you selected some of the groups above
>
>
>> Kindly clarify friends , how to know that which software is having the
>> CMAN packages since i haven't seen the same in clustering/ClusterStorage.
>>
> I think I don't understand you...
> There are no CMAN packages, just one cman package.
> In your installation process you should select the cman and rgmanager
> packages, but they aren't in any group.
>
>
> Regards,
> Juanra
>
>>
>>
>>
>>
>>
>>
>> On Wed, Jul 8, 2009 at 6:52 PM, Juan Ramon Martin Blanco <
>> robejrm at gmail.com> wrote:
>>
>>
>>
>>>
>>> On Wed, Jul 8, 2009 at 3:10 PM, Murugan P <muruganlnx at gmail.com> wrote:
>>>
>>>> HI Friends,
>>>>
>>>> I need small clarification from u guys...
>>>>
>>>> Whille installing the centos 5.3 which software needs to select for
>>>> RHCS(Cluster service) and clarify which is having the CMAN package.
>>>
>>>  Hi,
>>>
>>> rgmanager
>>> cman
>>> openais
>>> and if you are using gfs2 and/or clustered lvm:
>>> gfs2-utils
>>> lvm2-cluster
>>>
>>> #rpm -ql cman
>>> In summary: fenced qdiskd ccsd groupd and associated tools
>>>
>>> Greetings,
>>> Juanra
>>>
>>> P.S: I don't pretend to be rude, but read some documentation before
>>> asking...
>>> http://www.centos.org/docs/5/html/5.2/Cluster_Suite_Overview/s1-ha-components-CSO.html
>>>
>>>>
>>>> Thanks & Regards,
>>>> P. Murugan
>>>> muruganlnx at gmail.com
>>>> 9841705767
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/bce9d0d2/attachment.htm>

From rsetchfield at xcalibre.co.uk  Wed Jul  8 15:13:23 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Wed, 08 Jul 2009 16:13:23 +0100
Subject: [Linux-cluster] Trying to locate the bottleneck
In-Reply-To: <4A546513.1070408@xcalibre.co.uk>
References: <4A53410C.3090704@xcalibre.co.uk>	<64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local>
	<4A546513.1070408@xcalibre.co.uk>
Message-ID: <4A54B793.8080500@xcalibre.co.uk>

Hi Guys

I am trying to locate ip_conntrack_max within CentOS 5.3 but it doesn't 
appear to be where I expect it to be. I have googled for this and from 
what I have read it should be located within

/proc/sys/net/ipv4/ip_conntrack_max

Which is where I thought it would be but unfortunately it isn't.

Here is some output

[root at loadbalancer-01 ~]# grep conn /proc/slabinfo
ip_vs_conn             0      0    128   30    1 : tunables  120   60    
8 : slabdata      0      0      0

[root at loadbalancer-01 ~]# rpm -qa | grep kernel
kernel-headers-2.6.18-53.1.14.el5
kernel-devel-2.6.18-53.1.14.el5
kernel-2.6.18-53.1.14.el5

[root at loadbalancer-01 ~]# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or 
directory
 

I have also checked within  /etc/sysctl.conf and nothing.

Can someone help me?

Thanks in advance

Raymond

Raymond Setchfield wrote:
> Hi Jeff
>
> Many Thanks for your reply.
>
> I have had a look to see if there if there is anything suspicious 
> within dmesg and within messages and unfortunately there isn't 
> anything at all apart from one timeout.
>
> Jul  8 10:15:51 loadbalancer-01 nanny[5427]: [inactive] shutting down 
> 192.168.10.36:80 due to connection failure
> Jul  8 10:16:03 loadbalancer-01 nanny[5427]: [ active ] making 
> 192.168.10.36:80 available
>
> I'll check out the possibility of any network related issues which may 
> cause this problem though.
>
> Thanks for all your help!
>
> R.
>
>
> Jeff Sturm wrote:
>> Hi Raymond,
>>
>> At those concurrency levels I would suspect network tuning may help.
>> Does dmesg show anything interesting on the load balancers during your
>> testing?
>>
>> For high levels of concurrency on a NAT'd firewall or load balancer I
>> specifically remember having to adjust ip_conntrack_max upwards.
>> Perhaps network buffers as well.
>>
>> -Jeff
>>
>>  
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com
>>>     
>> [mailto:linux-cluster-bounces at redhat.com]
>>  
>>> On Behalf Of Raymond Setchfield
>>> Sent: Tuesday, July 07, 2009 8:35 AM
>>> To: linux-cluster at redhat.com
>>> Subject: [Linux-cluster] Trying to locate the bottleneck
>>>
>>> Hi
>>>
>>> I am trying to find a problem here with a setup which I am currently
>>> testing.
>>>
>>> This is the current setup which I have at the moment
>>>
>>> 15 web farm servers which are running vhost-ldap module and also have
>>> ldap caching enabled. Which are behind 2 Load balancer servers which
>>>     
>> are
>>  
>>> in fail over. The software which it is currently running is Piranha on
>>> the load balancers.
>>>
>>> I am using siege to get some benchmarking done on these to test
>>> basically their availability when pushing high concurrency.
>>>
>>> At 100 (99.60 according to siege) Concurrent Connection it appears to
>>>     
>> be
>>  
>>> all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
>>> connections I get 99.9%, and at 130 (129.51 according to siege)
>>> Concurrent Connections I get 100% availability.
>>>
>>> However pushing it any further than this, for example 150 concurrent
>>> connections it is falling over and siege bails out with multiple
>>> connection time outs. I am trying to find the bottle neck here and I
>>>     
>> am
>>  
>>> wondering if it is software which I am using for the load balancers or
>>>     
>> a
>>  
>>> limitation with apache.
>>>
>>> The command I am using for siege is pretty simple nothing special;
>>>
>>> siege --concurrent=150 --internet --file=urls.txt --benchmark
>>>     
>> --time=60M
>>  
>>> My lvs.cf file can be found here to show you guys the config which I
>>>     
>> am
>>  
>>> using.
>>>
>>> http://pastebin.com/m52d6cc23
>>>
>>> Any help would be greatly appreciated
>>>
>>> Many Thanks
>>>
>>> R.
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>     
>>
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>   
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


From kkovachev at varna.net  Wed Jul  8 15:20:10 2009
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Wed, 8 Jul 2009 18:20:10 +0300
Subject: [Linux-cluster] gnbd with multipath
Message-ID: <20090708151056.M9047@varna.net>

Hello list,
 i am experimenting with a cluster based on RHCM, but to avoid single point of
failure i would like to have two storage machines, replicated via drbd. To be
able to access the data in case of failure of one of the storages there should
be multipath access to the data. So my question is: Is it possible (and how)
to configure multipath for a gnbd imported device, because i get a message
that the GFS2 filesytem is already imported from the first storage when i try
to import it from the second one.


From jeff.sturm at eprize.com  Wed Jul  8 17:20:02 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 8 Jul 2009 13:20:02 -0400
Subject: [Linux-cluster] Trying to locate the bottleneck
In-Reply-To: <4A54B793.8080500@xcalibre.co.uk>
References: <4A53410C.3090704@xcalibre.co.uk>	<64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local><4A546513.1070408@xcalibre.co.uk>
	<4A54B793.8080500@xcalibre.co.uk>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC2F8@hugo.eprize.local>

I think this is created when you first run iptables.  If you have no NAT
rules on the load balancer, the ip_conntrack_max setting won't exist,
and you'll need to look somewhere else for the problem.

-Jeff

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Raymond Setchfield
> Sent: Wednesday, July 08, 2009 11:13 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Trying to locate the bottleneck
> 
> Hi Guys
> 
> I am trying to locate ip_conntrack_max within CentOS 5.3 but it
doesn't
> appear to be where I expect it to be. I have googled for this and from
> what I have read it should be located within
> 
> /proc/sys/net/ipv4/ip_conntrack_max
> 
> Which is where I thought it would be but unfortunately it isn't.
> 
> Here is some output
> 
> [root at loadbalancer-01 ~]# grep conn /proc/slabinfo
> ip_vs_conn             0      0    128   30    1 : tunables  120   60
> 8 : slabdata      0      0      0
> 
> [root at loadbalancer-01 ~]# rpm -qa | grep kernel
> kernel-headers-2.6.18-53.1.14.el5
> kernel-devel-2.6.18-53.1.14.el5
> kernel-2.6.18-53.1.14.el5
> 
> [root at loadbalancer-01 ~]# cat
/proc/sys/net/ipv4/netfilter/ip_conntrack_max
> cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or
> directory
> 
> 
> I have also checked within  /etc/sysctl.conf and nothing.
> 
> Can someone help me?
> 
> Thanks in advance
> 
> Raymond
> 
> Raymond Setchfield wrote:
> > Hi Jeff
> >
> > Many Thanks for your reply.
> >
> > I have had a look to see if there if there is anything suspicious
> > within dmesg and within messages and unfortunately there isn't
> > anything at all apart from one timeout.
> >
> > Jul  8 10:15:51 loadbalancer-01 nanny[5427]: [inactive] shutting
down
> > 192.168.10.36:80 due to connection failure
> > Jul  8 10:16:03 loadbalancer-01 nanny[5427]: [ active ] making
> > 192.168.10.36:80 available
> >
> > I'll check out the possibility of any network related issues which
may
> > cause this problem though.
> >
> > Thanks for all your help!
> >
> > R.
> >
> >
> > Jeff Sturm wrote:
> >> Hi Raymond,
> >>
> >> At those concurrency levels I would suspect network tuning may
help.
> >> Does dmesg show anything interesting on the load balancers during
your
> >> testing?
> >>
> >> For high levels of concurrency on a NAT'd firewall or load balancer
I
> >> specifically remember having to adjust ip_conntrack_max upwards.
> >> Perhaps network buffers as well.
> >>
> >> -Jeff
> >>
> >>
> >>> -----Original Message-----
> >>> From: linux-cluster-bounces at redhat.com
> >>>
> >> [mailto:linux-cluster-bounces at redhat.com]
> >>
> >>> On Behalf Of Raymond Setchfield
> >>> Sent: Tuesday, July 07, 2009 8:35 AM
> >>> To: linux-cluster at redhat.com
> >>> Subject: [Linux-cluster] Trying to locate the bottleneck
> >>>
> >>> Hi
> >>>
> >>> I am trying to find a problem here with a setup which I am
currently
> >>> testing.
> >>>
> >>> This is the current setup which I have at the moment
> >>>
> >>> 15 web farm servers which are running vhost-ldap module and also
have
> >>> ldap caching enabled. Which are behind 2 Load balancer servers
which
> >>>
> >> are
> >>
> >>> in fail over. The software which it is currently running is
Piranha on
> >>> the load balancers.
> >>>
> >>> I am using siege to get some benchmarking done on these to test
> >>> basically their availability when pushing high concurrency.
> >>>
> >>> At 100 (99.60 according to siege) Concurrent Connection it appears
to
> >>>
> >> be
> >>
> >>> all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
> >>> connections I get 99.9%, and at 130 (129.51 according to siege)
> >>> Concurrent Connections I get 100% availability.
> >>>
> >>> However pushing it any further than this, for example 150
concurrent
> >>> connections it is falling over and siege bails out with multiple
> >>> connection time outs. I am trying to find the bottle neck here and
I
> >>>
> >> am
> >>
> >>> wondering if it is software which I am using for the load
balancers or
> >>>
> >> a
> >>
> >>> limitation with apache.
> >>>
> >>> The command I am using for siege is pretty simple nothing special;
> >>>
> >>> siege --concurrent=150 --internet --file=urls.txt --benchmark
> >>>
> >> --time=60M
> >>
> >>> My lvs.cf file can be found here to show you guys the config which
I
> >>>
> >> am
> >>
> >>> using.
> >>>
> >>> http://pastebin.com/m52d6cc23
> >>>
> >>> Any help would be greatly appreciated
> >>>
> >>> Many Thanks
> >>>
> >>> R.
> >>>
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>
> >>
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >>
> >>
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rsetchfield at xcalibre.co.uk  Wed Jul  8 17:29:38 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Wed, 08 Jul 2009 18:29:38 +0100
Subject: [Linux-cluster] Trying to locate the bottleneck
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC2F8@hugo.eprize.local>
References: <4A53410C.3090704@xcalibre.co.uk>	<64D0546C5EBBD147B75DE133D798665F02FDC2C5@hugo.eprize.local><4A546513.1070408@xcalibre.co.uk>	<4A54B793.8080500@xcalibre.co.uk>
	<64D0546C5EBBD147B75DE133D798665F02FDC2F8@hugo.eprize.local>
Message-ID: <4A54D782.3030308@xcalibre.co.uk>

Hi Jeff

iptables is disabled within this setup as this is basically being done 
within a development enviroment. Still on the hunt to see where this 
bottle neck is happening though. Trying alternative loadbalancing 
software to see if I get the same results

Thanks

R.

Jeff Sturm wrote:
> I think this is created when you first run iptables.  If you have no NAT
> rules on the load balancer, the ip_conntrack_max setting won't exist,
> and you'll need to look somewhere else for the problem.
>
> -Jeff
>
>   
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>>     
> [mailto:linux-cluster-bounces at redhat.com]
>   
>> On Behalf Of Raymond Setchfield
>> Sent: Wednesday, July 08, 2009 11:13 AM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] Trying to locate the bottleneck
>>
>> Hi Guys
>>
>> I am trying to locate ip_conntrack_max within CentOS 5.3 but it
>>     
> doesn't
>   
>> appear to be where I expect it to be. I have googled for this and from
>> what I have read it should be located within
>>
>> /proc/sys/net/ipv4/ip_conntrack_max
>>
>> Which is where I thought it would be but unfortunately it isn't.
>>
>> Here is some output
>>
>> [root at loadbalancer-01 ~]# grep conn /proc/slabinfo
>> ip_vs_conn             0      0    128   30    1 : tunables  120   60
>> 8 : slabdata      0      0      0
>>
>> [root at loadbalancer-01 ~]# rpm -qa | grep kernel
>> kernel-headers-2.6.18-53.1.14.el5
>> kernel-devel-2.6.18-53.1.14.el5
>> kernel-2.6.18-53.1.14.el5
>>
>> [root at loadbalancer-01 ~]# cat
>>     
> /proc/sys/net/ipv4/netfilter/ip_conntrack_max
>   
>> cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or
>> directory
>>
>>
>> I have also checked within  /etc/sysctl.conf and nothing.
>>
>> Can someone help me?
>>
>> Thanks in advance
>>
>> Raymond
>>
>> Raymond Setchfield wrote:
>>     
>>> Hi Jeff
>>>
>>> Many Thanks for your reply.
>>>
>>> I have had a look to see if there if there is anything suspicious
>>> within dmesg and within messages and unfortunately there isn't
>>> anything at all apart from one timeout.
>>>
>>> Jul  8 10:15:51 loadbalancer-01 nanny[5427]: [inactive] shutting
>>>       
> down
>   
>>> 192.168.10.36:80 due to connection failure
>>> Jul  8 10:16:03 loadbalancer-01 nanny[5427]: [ active ] making
>>> 192.168.10.36:80 available
>>>
>>> I'll check out the possibility of any network related issues which
>>>       
> may
>   
>>> cause this problem though.
>>>
>>> Thanks for all your help!
>>>
>>> R.
>>>
>>>
>>> Jeff Sturm wrote:
>>>       
>>>> Hi Raymond,
>>>>
>>>> At those concurrency levels I would suspect network tuning may
>>>>         
> help.
>   
>>>> Does dmesg show anything interesting on the load balancers during
>>>>         
> your
>   
>>>> testing?
>>>>
>>>> For high levels of concurrency on a NAT'd firewall or load balancer
>>>>         
> I
>   
>>>> specifically remember having to adjust ip_conntrack_max upwards.
>>>> Perhaps network buffers as well.
>>>>
>>>> -Jeff
>>>>
>>>>
>>>>         
>>>>> -----Original Message-----
>>>>> From: linux-cluster-bounces at redhat.com
>>>>>
>>>>>           
>>>> [mailto:linux-cluster-bounces at redhat.com]
>>>>
>>>>         
>>>>> On Behalf Of Raymond Setchfield
>>>>> Sent: Tuesday, July 07, 2009 8:35 AM
>>>>> To: linux-cluster at redhat.com
>>>>> Subject: [Linux-cluster] Trying to locate the bottleneck
>>>>>
>>>>> Hi
>>>>>
>>>>> I am trying to find a problem here with a setup which I am
>>>>>           
> currently
>   
>>>>> testing.
>>>>>
>>>>> This is the current setup which I have at the moment
>>>>>
>>>>> 15 web farm servers which are running vhost-ldap module and also
>>>>>           
> have
>   
>>>>> ldap caching enabled. Which are behind 2 Load balancer servers
>>>>>           
> which
>   
>>>> are
>>>>
>>>>         
>>>>> in fail over. The software which it is currently running is
>>>>>           
> Piranha on
>   
>>>>> the load balancers.
>>>>>
>>>>> I am using siege to get some benchmarking done on these to test
>>>>> basically their availability when pushing high concurrency.
>>>>>
>>>>> At 100 (99.60 according to siege) Concurrent Connection it appears
>>>>>           
> to
>   
>>>> be
>>>>
>>>>         
>>>>> all ok with 99.89%. At 120 (119.52 according to siege) Concurrent
>>>>> connections I get 99.9%, and at 130 (129.51 according to siege)
>>>>> Concurrent Connections I get 100% availability.
>>>>>
>>>>> However pushing it any further than this, for example 150
>>>>>           
> concurrent
>   
>>>>> connections it is falling over and siege bails out with multiple
>>>>> connection time outs. I am trying to find the bottle neck here and
>>>>>           
> I
>   
>>>> am
>>>>
>>>>         
>>>>> wondering if it is software which I am using for the load
>>>>>           
> balancers or
>   
>>>> a
>>>>
>>>>         
>>>>> limitation with apache.
>>>>>
>>>>> The command I am using for siege is pretty simple nothing special;
>>>>>
>>>>> siege --concurrent=150 --internet --file=urls.txt --benchmark
>>>>>
>>>>>           
>>>> --time=60M
>>>>
>>>>         
>>>>> My lvs.cf file can be found here to show you guys the config which
>>>>>           
> I
>   
>>>> am
>>>>
>>>>         
>>>>> using.
>>>>>
>>>>> http://pastebin.com/m52d6cc23
>>>>>
>>>>> Any help would be greatly appreciated
>>>>>
>>>>> Many Thanks
>>>>>
>>>>> R.
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>           
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>>
>>>>         
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>   


From muruganlnx at gmail.com  Wed Jul  8 18:51:47 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Thu, 9 Jul 2009 00:21:47 +0530
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
Message-ID: <52868b3e0907081151y7078c715kb0f50b90bf01b7a0@mail.gmail.com>

HI Friends,


I want to get the clear understanding about the software (package) openasis
which is using for the RHCS 5.

Kindly give me your input or suggest any URL for the same.


--Muruga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090709/79ac0d69/attachment.htm>

From robejrm at gmail.com  Wed Jul  8 18:56:07 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 8 Jul 2009 20:56:07 +0200
Subject: [Linux-cluster] RHCS with GFS2 on centos 5.3
In-Reply-To: <52868b3e0907081151y7078c715kb0f50b90bf01b7a0@mail.gmail.com>
References: <52868b3e0907081151y7078c715kb0f50b90bf01b7a0@mail.gmail.com>
Message-ID: <8a5668960907081156p1eff0cbajedc7b66ca60fdeb1@mail.gmail.com>

On Wed, Jul 8, 2009 at 8:51 PM, Murugan P <muruganlnx at gmail.com> wrote:

> HI Friends,
>
>
> I want to get the clear understanding about the software (package) openasis
> which is using for the RHCS 5.
>
> Kindly give me your input or suggest any URL for the same.
>
Hi,,

It's openais and a quick google search leads you to www.openais.org

Greetings,
Juanra

>
>
>
> --Muruga
>
>
>
>
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/7f5bef74/attachment.htm>

From pschobel at 1iopen.net  Wed Jul  8 20:58:30 2009
From: pschobel at 1iopen.net (Peter Schobel)
Date: Wed, 8 Jul 2009 13:58:30 -0700
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
Message-ID: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>

I am trying to set up a four node cluster but am getting very poor
performance when removing large directories. A directory approximately
1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
but removes in around 10 seconds from the local disk.

I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.

The filesystem was formatted in the following manner: mkfs.gfs2 -t
wtl_build:dev_home00 -p lock_dlm -j 10
/dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
following options: _netdev,noatime,defaults.

If anyone knows what could be causing this please let me know. I'm
happy to provide any other information.

Regards,

-- 
Peter Schobel
~


From fdinitto at redhat.com  Wed Jul  8 21:10:12 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 08 Jul 2009 23:10:12 +0200
Subject: [Linux-cluster] Cluster 3.0.0 final stable release
Message-ID: <1247087412.7941.37.camel@cerberus.int.fabbione.net>

The cluster team and its community are proud to announce the 3.0.0 final
release from the STABLE3 branch.

"And now what?"

The STABLE3 branch will continue to receive bug fixes and improvements
as feedback from our community and users will flow in.
Regular update releases will be available to sync with corosync/openais
releases and new kernels (for gfs1-kernel module).

In order to build the 3.0.0 release you will need:

- corosync 1.0.0
- openais 1.0.0
- linux kernel 2.6.29

The new source tarball can be downloaded here:

ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.tar.gz
https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.tar.gz

At the same location is now possible to find separated tarballs for
fence-agents and resource-agents as previously announced
(http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm)

To report bugs or issues:

   https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

   Join us on IRC (irc.freenode.net #linux-cluster) and share your
   experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

Happy clustering,
Fabio

Under the hood (from 3.0.0.rc4):

David Teigland (1):
      gfs_controld: fix default plock_ownership in compat mode

Fabio M. Di Nitto (3):
      contrib: Remove askant
      contrib: fix build warnings in libaislock
      libccs: fix yet another uint -> hdb_handle_t conversion

Guido G?nther (1):
      init scripts: add force-reload targets

Lon Hohberger (4):
      config: Update relaxng schema for broadcast options
      config: Update cluster.rng with new RA params
      config: Update cluster.ldif
      config: Fix misnamed statusmax -> status_child_max

 cman/init.d/cman.in                        |    2 +-
 config/libs/libccsconfdb/libccs.c          |    9 +-
 config/plugins/ldap/99cluster.ldif         |   84 ++++--
 config/plugins/ldap/ldap-base.csv          |   18 +-
 config/tools/xml/cluster.rng               |   70 ++++--
 contrib/Makefile                           |    2 +-
 contrib/askant/INSTALL                     |   42 ---
 contrib/askant/Makefile                    |   24 --
 contrib/askant/PLUGINAPI                   |   65 -----
 contrib/askant/README                      |   74 -----
 contrib/askant/askant/about.py             |    5 -
 contrib/askant/askant/askant.py            |   24 --
 contrib/askant/askant/blktrace.py          |   93 -------
 contrib/askant/askant/commands.py          |  333
-----------------------
 contrib/askant/askant/sysfs.py             |   86 ------
 contrib/askant/fsplugins/gfs2/gfs2.c       |  404
----------------------------
 contrib/askant/fsplugins/gfs2/gfs2.h       |    3 -
 contrib/askant/fsplugins/gfs2/gfs2module.c |  104 -------
 contrib/askant/scripts/askant              |    6 -
 contrib/askant/setup.py                    |   16 --
 contrib/libaislock/libaislock.c            |   10 +-
 doc/COPYRIGHT                              |    3 -
 gfs/init.d/gfs.in                          |    2 +-
 gfs2/init.d/gfs2.in                        |    2 +-
 group/gfs_controld/config.c                |   10 +-
 group/gfs_controld/gfs_daemon.h            |    1 +
 group/gfs_controld/main.c                  |    6 +
 rgmanager/init.d/rgmanager.in              |    2 +-
 28 files changed, 148 insertions(+), 1352 deletions(-)

Funny stats (STABLE2 vs STABLE3):

- Over 10000 lines of code have been killed from the latest (now old)
stable release.

- Almost 56000 lines of code have been rewritten or changed.

- Changes spread across over 730 files.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/e876dad3/attachment.sig>

From jos at xos.nl  Wed Jul  8 21:21:45 2009
From: jos at xos.nl (Jos Vos)
Date: Wed, 8 Jul 2009 23:21:45 +0200
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
References: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
Message-ID: <20090708212145.GA29922@jasmine.xos.nl>

On Wed, Jul 08, 2009 at 01:58:30PM -0700, Peter Schobel wrote:

> I am trying to set up a four node cluster but am getting very poor
> performance when removing large directories. A directory approximately
> 1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
> but removes in around 10 seconds from the local disk.
> 
> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
> 
> The filesystem was formatted in the following manner: mkfs.gfs2 -t
> wtl_build:dev_home00 -p lock_dlm -j 10
> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
> following options: _netdev,noatime,defaults.

This is something you have to live with.  GFS(2) works great, but with
large(r) directories performance is extremely bad and for many
applications a real show-stopper.

There have been many discussions on this list, with GFS parameter tuning
suggestions that at least for me didn't result in any improvements, with
promises that the problems would be solved in GFS2 (I see no significant
performance improvements between GFS and GFS2), etc.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From cthulhucalling at gmail.com  Wed Jul  8 21:31:30 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Wed, 8 Jul 2009 14:31:30 -0700
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
References: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
Message-ID: <36df569a0907081431g14fc71f5uc4d10e0c9e4bba50@mail.gmail.com>

I've just had bad experience all around with GFS2. You may want to try GFS1
and play with the tunable parameters.

On Wed, Jul 8, 2009 at 1:58 PM, Peter Schobel <pschobel at 1iopen.net> wrote:

> I am trying to set up a four node cluster but am getting very poor
> performance when removing large directories. A directory approximately
> 1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
> but removes in around 10 seconds from the local disk.
>
> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
>
> The filesystem was formatted in the following manner: mkfs.gfs2 -t
> wtl_build:dev_home00 -p lock_dlm -j 10
> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
> following options: _netdev,noatime,defaults.
>
> If anyone knows what could be causing this please let me know. I'm
> happy to provide any other information.
>
> Regards,
>
> --
> Peter Schobel
> ~
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/7a06f234/attachment.htm>

From fdinitto at redhat.com  Wed Jul  8 21:32:26 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Wed, 08 Jul 2009 23:32:26 +0200
Subject: [Linux-cluster] STABLE2 (2.03.xx) branch and release EOL the 7th of
	Jan 2010
Message-ID: <1247088746.7941.41.camel@cerberus.int.fabbione.net>

Hi all,

as previously announced here:

http://www.redhat.com/archives/cluster-devel/2009-January/msg00074.html

now that STABLE3 is in "production ready" status, the End-Of-Life date
for STABLE2 is set.

Regards
Fabio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090708/5a0000dd/attachment.sig>

From tfrumbacher at gmail.com  Wed Jul  8 21:52:37 2009
From: tfrumbacher at gmail.com (Aaron Benner)
Date: Wed, 8 Jul 2009 15:52:37 -0600
Subject: [Linux-cluster] RHEL Cluster Suite + Xen Dom0 = infinite reboots
Message-ID: <78DABFDE-3087-4315-8D3B-D63D3B839E9D@gmail.com>

I have 3 xen Dom0 machines upon which I'm trying to build a cluster  
for HA DomUs.  At present the cluster config file simply lists the 3  
nodes.  No fencing, services, resources or failover domains have been  
defined.  I know that this is not what I will need moving to  
production.  I was using the most minimal cluster config I could to  
ensure that my problem was the interaction of Xen and the cluster suite.

The problem is this:  when a node reboots it joins the cluster  
successfully, then xen tears down the network to build xenbr0, vif0.0,  
and peth0 (standard /etc/xen/scripts/network-bridge).  When this  
happens the rebooting node "fails" in the cluster's eyes.  The active  
nodes try to fence it.  Originally I had power fencing enabled and  
this situation resulted in the shootout at the o.k. corral with the  
failed node booting, failing and getting fenced forever.

I did find the gem at the very bottom of the FAQ in the  
GeneralQuestions section (http://sources.redhat.com/cluster/wiki/FAQ/GeneralQuestions#xencluster 
) that mentions this situation.  The "workaround" which also mentions  
a "more permanent solution" seems, well, clunky so I thought I'd ping  
the list to see if the more permanent solution exists and is just not  
well documented or if others have found a solution that doesn't  
require override of the default xen script behavior?


From robejrm at gmail.com  Wed Jul  8 22:10:41 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Thu, 9 Jul 2009 00:10:41 +0200
Subject: [Linux-cluster] RHEL Cluster Suite + Xen Dom0 = infinite reboots
In-Reply-To: <78DABFDE-3087-4315-8D3B-D63D3B839E9D@gmail.com>
References: <78DABFDE-3087-4315-8D3B-D63D3B839E9D@gmail.com>
Message-ID: <8a5668960907081510w759e8911re114e9523649942@mail.gmail.com>

On Wed, Jul 8, 2009 at 11:52 PM, Aaron Benner <tfrumbacher at gmail.com> wrote:

> I have 3 xen Dom0 machines upon which I'm trying to build a cluster for HA
> DomUs.  At present the cluster config file simply lists the 3 nodes.  No
> fencing, services, resources or failover domains have been defined.  I know
> that this is not what I will need moving to production.  I was using the
> most minimal cluster config I could to ensure that my problem was the
> interaction of Xen and the cluster suite.
>
> The problem is this:  when a node reboots it joins the cluster
> successfully, then xen tears down the network to build xenbr0, vif0.0, and
> peth0 (standard /etc/xen/scripts/network-bridge).  When this happens the
> rebooting node "fails" in the cluster's eyes.  The active nodes try to fence
> it.  Originally I had power fencing enabled and this situation resulted in
> the shootout at the o.k. corral with the failed node booting, failing and
> getting fenced forever.
>
So, you have fencing configured among the domU's cluster but not in the dom0
cluster, haven't you? And this behavior happens in the dom0's cluster. Maybe
you should configure an additional physical network interface (or bonding of
interfaces) independent from the one used by xen to be used as the cluster
main comms interface.

Greetings,
Juanra

>
> I did find the gem at the very bottom of the FAQ in the GeneralQuestions
> section (
> http://sources.redhat.com/cluster/wiki/FAQ/GeneralQuestions#xencluster)
> that mentions this situation.  The "workaround" which also mentions a "more
> permanent solution" seems, well, clunky so I thought I'd ping the list to
> see if the more permanent solution exists and is just not well documented or
> if others have found a solution that doesn't require override of the default
> xen script behavior?
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090709/b54dabe2/attachment.htm>

From jeff.sturm at eprize.com  Thu Jul  9 00:33:21 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 8 Jul 2009 20:33:21 -0400
Subject: [Linux-cluster] RHEL Cluster Suite + Xen Dom0 = infinite reboots
In-Reply-To: <78DABFDE-3087-4315-8D3B-D63D3B839E9D@gmail.com>
References: <78DABFDE-3087-4315-8D3B-D63D3B839E9D@gmail.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC301@hugo.eprize.local>

This FAQ section gives good advice.  The Xen network-bridge scripts are
designed to work on hosts without any preconfigured bridge; however I
find it much more straightforward to configure the host for bridging
myself exactly as in the FAQ.  As a plus you have more complete control
over all your network settings.

You can still use the vif-bridge script as is.  I don't really know any
better way.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Aaron Benner
> Sent: Wednesday, July 08, 2009 5:53 PM
> To: linux-cluster at redhat.com
> Subject: [Linux-cluster] RHEL Cluster Suite + Xen Dom0 = infinite
reboots
> 
> I have 3 xen Dom0 machines upon which I'm trying to build a cluster
> for HA DomUs.  At present the cluster config file simply lists the 3
> nodes.  No fencing, services, resources or failover domains have been
> defined.  I know that this is not what I will need moving to
> production.  I was using the most minimal cluster config I could to
> ensure that my problem was the interaction of Xen and the cluster
suite.
> 
> The problem is this:  when a node reboots it joins the cluster
> successfully, then xen tears down the network to build xenbr0, vif0.0,
> and peth0 (standard /etc/xen/scripts/network-bridge).  When this
> happens the rebooting node "fails" in the cluster's eyes.  The active
> nodes try to fence it.  Originally I had power fencing enabled and
> this situation resulted in the shootout at the o.k. corral with the
> failed node booting, failing and getting fenced forever.
> 
> I did find the gem at the very bottom of the FAQ in the
> GeneralQuestions section
>
(http://sources.redhat.com/cluster/wiki/FAQ/GeneralQuestions#xencluster
> ) that mentions this situation.  The "workaround" which also mentions
> a "more permanent solution" seems, well, clunky so I thought I'd ping
> the list to see if the more permanent solution exists and is just not
> well documented or if others have found a solution that doesn't
> require override of the default xen script behavior?
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From henry.robertson at hjrconsulting.com  Thu Jul  9 04:28:38 2009
From: henry.robertson at hjrconsulting.com (Henry Robertson)
Date: Thu, 9 Jul 2009 00:28:38 -0400
Subject: [Linux-cluster] Re: Trying to locate the bottleneck
Message-ID: <c0b9f0730907082128o6722329do3b8f941f176ac4b9@mail.gmail.com>

>
>
> Message: 1
> Date: Wed, 8 Jul 2009 13:20:02 -0400
> From: Jeff Sturm <jeff.sturm at eprize.com>
> Subject: RE: [Linux-cluster] Trying to locate the bottleneck
> To: "linux clustering" <linux-cluster at redhat.com>
> Message-ID:
>        <64D0546C5EBBD147B75DE133D798665F02FDC2F8 at hugo.eprize.local>
> Content-Type: text/plain; charset="us-ascii"
>
> I think this is created when you first run iptables.  If you have no NAT
> rules on the load balancer, the ip_conntrack_max setting won't exist,
> and you'll need to look somewhere else for the problem.
>
> -Jeff
>
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]
> > On Behalf Of Raymond Setchfield
> > Sent: Wednesday, July 08, 2009 11:13 AM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Trying to locate the bottleneck
> >
> > Hi Guys
> >
> > I am trying to locate ip_conntrack_max within CentOS 5.3 but it
> doesn't
> > appear to be where I expect it to be. I have googled for this and from
> > what I have read it should be located within
> >
> > /proc/sys/net/ipv4/ip_conntrack_max
> >
> > Which is where I thought it would be but unfortunately it isn't.
> >
> > Here is some output
> >
> > [root at loadbalancer-01 ~]# grep conn /proc/slabinfo
> > ip_vs_conn             0      0    128   30    1 : tunables  120   60
> > 8 : slabdata      0      0      0
> >
> > [root at loadbalancer-01 ~]# rpm -qa | grep kernel
> > kernel-headers-2.6.18-53.1.14.el5
> > kernel-devel-2.6.18-53.1.14.el5
> > kernel-2.6.18-53.1.14.el5
> >
> > [root at loadbalancer-01 ~]# cat
> /proc/sys/net/ipv4/netfilter/ip_conntrack_max
> > cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or
> > directory
> >
> >
> > I have also checked within  /etc/sysctl.conf and nothing.
> >
> > Can someone help me?
> >
> > Thanks in advance
> >
> > Raymond
> >


Try 'modprobe ip_conntrack' and see if it shows up. After that I was able to
set the value in /etc/sysctl.conf

Good luck!

Henry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090709/a6ed3fc9/attachment.htm>

From george.saji00 at gmail.com  Thu Jul  9 10:01:11 2009
From: george.saji00 at gmail.com (saji george)
Date: Thu, 9 Jul 2009 15:31:11 +0530
Subject: [Linux-cluster] Re: Trying to locate the bottleneck
In-Reply-To: <c0b9f0730907082128o6722329do3b8f941f176ac4b9@mail.gmail.com>
References: <c0b9f0730907082128o6722329do3b8f941f176ac4b9@mail.gmail.com>
Message-ID: <2460eaad0907090301x4f3ad33wdd763219dfb754d2@mail.gmail.com>

It may be a problem with the client machine also. Try to run the siege from
multiple machines at a time.

On Thu, Jul 9, 2009 at 9:58 AM, Henry Robertson <
henry.robertson at hjrconsulting.com> wrote:

>
>> Message: 1
>> Date: Wed, 8 Jul 2009 13:20:02 -0400
>> From: Jeff Sturm <jeff.sturm at eprize.com>
>> Subject: RE: [Linux-cluster] Trying to locate the bottleneck
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Message-ID:
>>        <64D0546C5EBBD147B75DE133D798665F02FDC2F8 at hugo.eprize.local>
>> Content-Type: text/plain; charset="us-ascii"
>>
>>
>> I think this is created when you first run iptables.  If you have no NAT
>> rules on the load balancer, the ip_conntrack_max setting won't exist,
>> and you'll need to look somewhere else for the problem.
>>
>> -Jeff
>>
>> > -----Original Message-----
>> > From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com]
>> > On Behalf Of Raymond Setchfield
>> > Sent: Wednesday, July 08, 2009 11:13 AM
>> > To: linux clustering
>> > Subject: Re: [Linux-cluster] Trying to locate the bottleneck
>> >
>> > Hi Guys
>> >
>> > I am trying to locate ip_conntrack_max within CentOS 5.3 but it
>> doesn't
>> > appear to be where I expect it to be. I have googled for this and from
>> > what I have read it should be located within
>> >
>> > /proc/sys/net/ipv4/ip_conntrack_max
>> >
>> > Which is where I thought it would be but unfortunately it isn't.
>> >
>> > Here is some output
>> >
>> > [root at loadbalancer-01 ~]# grep conn /proc/slabinfo
>> > ip_vs_conn             0      0    128   30    1 : tunables  120   60
>> > 8 : slabdata      0      0      0
>> >
>> > [root at loadbalancer-01 ~]# rpm -qa | grep kernel
>> > kernel-headers-2.6.18-53.1.14.el5
>> > kernel-devel-2.6.18-53.1.14.el5
>> > kernel-2.6.18-53.1.14.el5
>> >
>> > [root at loadbalancer-01 ~]# cat
>> /proc/sys/net/ipv4/netfilter/ip_conntrack_max
>> > cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or
>> > directory
>> >
>> >
>> > I have also checked within  /etc/sysctl.conf and nothing.
>> >
>> > Can someone help me?
>> >
>> > Thanks in advance
>> >
>> > Raymond
>> >
>>
>
>
> Try 'modprobe ip_conntrack' and see if it shows up. After that I was able
> to set the value in /etc/sysctl.conf
>
> Good luck!
>
> Henry
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090709/5b2e08a2/attachment.htm>

From rsetchfield at xcalibre.co.uk  Thu Jul  9 10:47:19 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Thu, 09 Jul 2009 11:47:19 +0100
Subject: [Linux-cluster] Re: Trying to locate the bottleneck
In-Reply-To: <c0b9f0730907082128o6722329do3b8f941f176ac4b9@mail.gmail.com>
References: <c0b9f0730907082128o6722329do3b8f941f176ac4b9@mail.gmail.com>
Message-ID: <4A55CAB7.9000904@xcalibre.co.uk>

Hi Henry

Unfortunately not, however I did get this output from /var/log/messages

Jul  9 11:45:41 loadbalancer-01 kernel: Netfilter messages via NETLINK 
v0.30.
Jul  9 11:45:41 loadbalancer-01 kernel: ip_conntrack version 2.4 (12289 
buckets, 98312 max) - 228 bytes per conntrack

Thanks

R.

Henry Robertson wrote:
>
>
>     Message: 1
>     Date: Wed, 8 Jul 2009 13:20:02 -0400
>     From: Jeff Sturm <jeff.sturm at eprize.com
>     <mailto:jeff.sturm at eprize.com>>
>     Subject: RE: [Linux-cluster] Trying to locate the bottleneck
>     To: "linux clustering" <linux-cluster at redhat.com
>     <mailto:linux-cluster at redhat.com>>
>     Message-ID:
>            <64D0546C5EBBD147B75DE133D798665F02FDC2F8 at hugo.eprize.local>
>     Content-Type: text/plain; charset="us-ascii"
>
>     I think this is created when you first run iptables.  If you have
>     no NAT
>     rules on the load balancer, the ip_conntrack_max setting won't exist,
>     and you'll need to look somewhere else for the problem.
>
>     -Jeff
>
>     > -----Original Message-----
>     > From: linux-cluster-bounces at redhat.com
>     <mailto:linux-cluster-bounces at redhat.com>
>     [mailto:linux-cluster-bounces at redhat.com
>     <mailto:linux-cluster-bounces at redhat.com>]
>     > On Behalf Of Raymond Setchfield
>     > Sent: Wednesday, July 08, 2009 11:13 AM
>     > To: linux clustering
>     > Subject: Re: [Linux-cluster] Trying to locate the bottleneck
>     >
>     > Hi Guys
>     >
>     > I am trying to locate ip_conntrack_max within CentOS 5.3 but it
>     doesn't
>     > appear to be where I expect it to be. I have googled for this
>     and from
>     > what I have read it should be located within
>     >
>     > /proc/sys/net/ipv4/ip_conntrack_max
>     >
>     > Which is where I thought it would be but unfortunately it isn't.
>     >
>     > Here is some output
>     >
>     > [root at loadbalancer-01 ~]# grep conn /proc/slabinfo
>     > ip_vs_conn             0      0    128   30    1 : tunables  120
>       60
>     > 8 : slabdata      0      0      0
>     >
>     > [root at loadbalancer-01 ~]# rpm -qa | grep kernel
>     > kernel-headers-2.6.18-53.1.14.el5
>     > kernel-devel-2.6.18-53.1.14.el5
>     > kernel-2.6.18-53.1.14.el5
>     >
>     > [root at loadbalancer-01 ~]# cat
>     /proc/sys/net/ipv4/netfilter/ip_conntrack_max
>     > cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such file or
>     > directory
>     >
>     >
>     > I have also checked within  /etc/sysctl.conf and nothing.
>     >
>     > Can someone help me?
>     >
>     > Thanks in advance
>     >
>     > Raymond
>     >
>
>
>
> Try 'modprobe ip_conntrack' and see if it shows up. After that I was 
> able to set the value in /etc/sysctl.conf
>
> Good luck!
>
> Henry 
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rsetchfield at xcalibre.co.uk  Thu Jul  9 10:48:44 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Thu, 09 Jul 2009 11:48:44 +0100
Subject: [Linux-cluster] Re: Trying to locate the bottleneck
In-Reply-To: <2460eaad0907090301x4f3ad33wdd763219dfb754d2@mail.gmail.com>
References: <c0b9f0730907082128o6722329do3b8f941f176ac4b9@mail.gmail.com>
	<2460eaad0907090301x4f3ad33wdd763219dfb754d2@mail.gmail.com>
Message-ID: <4A55CB0C.1020901@xcalibre.co.uk>

Hi Saji

You may have a point, didn't think about blaming the client machine. I 
have another two servers which have siege on them. I will try an 
alternative one and let you know how I get on.

Thanks

R.

saji george wrote:
> It may be a problem with the client machine also. 
> Try to run the siege from multiple machines at a time.
>
> On Thu, Jul 9, 2009 at 9:58 AM, Henry Robertson 
> <henry.robertson at hjrconsulting.com 
> <mailto:henry.robertson at hjrconsulting.com>> wrote:
>
>
>         Message: 1
>         Date: Wed, 8 Jul 2009 13:20:02 -0400
>         From: Jeff Sturm <jeff.sturm at eprize.com
>         <mailto:jeff.sturm at eprize.com>>
>         Subject: RE: [Linux-cluster] Trying to locate the bottleneck
>         To: "linux clustering" <linux-cluster at redhat.com
>         <mailto:linux-cluster at redhat.com>>
>         Message-ID:
>              
>          <64D0546C5EBBD147B75DE133D798665F02FDC2F8 at hugo.eprize.local>
>         Content-Type: text/plain; charset="us-ascii"
>
>
>         I think this is created when you first run iptables.  If you
>         have no NAT
>         rules on the load balancer, the ip_conntrack_max setting won't
>         exist,
>         and you'll need to look somewhere else for the problem.
>
>         -Jeff
>
>         > -----Original Message-----
>         > From: linux-cluster-bounces at redhat.com
>         <mailto:linux-cluster-bounces at redhat.com>
>         [mailto:linux-cluster-bounces at redhat.com
>         <mailto:linux-cluster-bounces at redhat.com>]
>         > On Behalf Of Raymond Setchfield
>         > Sent: Wednesday, July 08, 2009 11:13 AM
>         > To: linux clustering
>         > Subject: Re: [Linux-cluster] Trying to locate the bottleneck
>         >
>         > Hi Guys
>         >
>         > I am trying to locate ip_conntrack_max within CentOS 5.3 but it
>         doesn't
>         > appear to be where I expect it to be. I have googled for
>         this and from
>         > what I have read it should be located within
>         >
>         > /proc/sys/net/ipv4/ip_conntrack_max
>         >
>         > Which is where I thought it would be but unfortunately it isn't.
>         >
>         > Here is some output
>         >
>         > [root at loadbalancer-01 ~]# grep conn /proc/slabinfo
>         > ip_vs_conn             0      0    128   30    1 : tunables
>          120   60
>         > 8 : slabdata      0      0      0
>         >
>         > [root at loadbalancer-01 ~]# rpm -qa | grep kernel
>         > kernel-headers-2.6.18-53.1.14.el5
>         > kernel-devel-2.6.18-53.1.14.el5
>         > kernel-2.6.18-53.1.14.el5
>         >
>         > [root at loadbalancer-01 ~]# cat
>         /proc/sys/net/ipv4/netfilter/ip_conntrack_max
>         > cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such
>         file or
>         > directory
>         >
>         >
>         > I have also checked within  /etc/sysctl.conf and nothing.
>         >
>         > Can someone help me?
>         >
>         > Thanks in advance
>         >
>         > Raymond
>         >
>
>
>
>     Try 'modprobe ip_conntrack' and see if it shows up. After that I
>     was able to set the value in /etc/sysctl.conf
>
>     Good luck!
>
>     Henry 
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rsetchfield at xcalibre.co.uk  Thu Jul  9 12:03:35 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Thu, 09 Jul 2009 13:03:35 +0100
Subject: [Linux-cluster] Re: Trying to locate the bottleneck
In-Reply-To: <4A55CB0C.1020901@xcalibre.co.uk>
References: <c0b9f0730907082128o6722329do3b8f941f176ac4b9@mail.gmail.com>	<2460eaad0907090301x4f3ad33wdd763219dfb754d2@mail.gmail.com>
	<4A55CB0C.1020901@xcalibre.co.uk>
Message-ID: <4A55DC97.6060603@xcalibre.co.uk>

Hi Saji

As you suggested. I tried three servers with a lower concurrency (50 
each) and it still having the same issue.

Thanks for the suggestion though

R.

Raymond Setchfield wrote:
> Hi Saji
>
> You may have a point, didn't think about blaming the client machine. I 
> have another two servers which have siege on them. I will try an 
> alternative one and let you know how I get on.
>
> Thanks
>
> R.
>
> saji george wrote:
>> It may be a problem with the client machine also. Try to run the 
>> siege from multiple machines at a time.
>>
>> On Thu, Jul 9, 2009 at 9:58 AM, Henry Robertson 
>> <henry.robertson at hjrconsulting.com 
>> <mailto:henry.robertson at hjrconsulting.com>> wrote:
>>
>>
>>         Message: 1
>>         Date: Wed, 8 Jul 2009 13:20:02 -0400
>>         From: Jeff Sturm <jeff.sturm at eprize.com
>>         <mailto:jeff.sturm at eprize.com>>
>>         Subject: RE: [Linux-cluster] Trying to locate the bottleneck
>>         To: "linux clustering" <linux-cluster at redhat.com
>>         <mailto:linux-cluster at redhat.com>>
>>         Message-ID:
>>                       
>> <64D0546C5EBBD147B75DE133D798665F02FDC2F8 at hugo.eprize.local>
>>         Content-Type: text/plain; charset="us-ascii"
>>
>>
>>         I think this is created when you first run iptables.  If you
>>         have no NAT
>>         rules on the load balancer, the ip_conntrack_max setting won't
>>         exist,
>>         and you'll need to look somewhere else for the problem.
>>
>>         -Jeff
>>
>>         > -----Original Message-----
>>         > From: linux-cluster-bounces at redhat.com
>>         <mailto:linux-cluster-bounces at redhat.com>
>>         [mailto:linux-cluster-bounces at redhat.com
>>         <mailto:linux-cluster-bounces at redhat.com>]
>>         > On Behalf Of Raymond Setchfield
>>         > Sent: Wednesday, July 08, 2009 11:13 AM
>>         > To: linux clustering
>>         > Subject: Re: [Linux-cluster] Trying to locate the bottleneck
>>         >
>>         > Hi Guys
>>         >
>>         > I am trying to locate ip_conntrack_max within CentOS 5.3 
>> but it
>>         doesn't
>>         > appear to be where I expect it to be. I have googled for
>>         this and from
>>         > what I have read it should be located within
>>         >
>>         > /proc/sys/net/ipv4/ip_conntrack_max
>>         >
>>         > Which is where I thought it would be but unfortunately it 
>> isn't.
>>         >
>>         > Here is some output
>>         >
>>         > [root at loadbalancer-01 ~]# grep conn /proc/slabinfo
>>         > ip_vs_conn             0      0    128   30    1 : tunables
>>          120   60
>>         > 8 : slabdata      0      0      0
>>         >
>>         > [root at loadbalancer-01 ~]# rpm -qa | grep kernel
>>         > kernel-headers-2.6.18-53.1.14.el5
>>         > kernel-devel-2.6.18-53.1.14.el5
>>         > kernel-2.6.18-53.1.14.el5
>>         >
>>         > [root at loadbalancer-01 ~]# cat
>>         /proc/sys/net/ipv4/netfilter/ip_conntrack_max
>>         > cat: /proc/sys/net/ipv4/netfilter/ip_conntrack_max: No such
>>         file or
>>         > directory
>>         >
>>         >
>>         > I have also checked within  /etc/sysctl.conf and nothing.
>>         >
>>         > Can someone help me?
>>         >
>>         > Thanks in advance
>>         >
>>         > Raymond
>>         >
>>
>>
>>
>>     Try 'modprobe ip_conntrack' and see if it shows up. After that I
>>     was able to set the value in /etc/sysctl.conf
>>
>>     Good luck!
>>
>>     Henry
>>     --
>>     Linux-cluster mailing list
>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>     https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> ------------------------------------------------------------------------
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


From agx at sigxcpu.org  Thu Jul  9 12:19:58 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Thu, 9 Jul 2009 14:19:58 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <1246988716.7993.13.camel@cerberus.int.fabbione.net>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
	<20090707162845.GA30094@bogon.sigxcpu.org>
	<1246988716.7993.13.camel@cerberus.int.fabbione.net>
Message-ID: <20090709121958.GB18140@bogon.sigxcpu.org>

On Tue, Jul 07, 2009 at 07:45:16PM +0200, Fabio M. Di Nitto wrote:
> Hi Guido,
> 
> On Tue, 2009-07-07 at 18:28 +0200, Guido G?nther wrote:
> > Hi,
> > On Thu, Jul 02, 2009 at 01:16:30AM +0200, Fabio M. Di Nitto wrote:
> > > The cluster team and its community are proud to announce the
> > > 3.0.0.rc4 release candidate from the STABLE3 branch.
> > Based on earlier Debian and Ubuntu packages of corosync, openais and
> > cluster I have put prelimenary Debian packges (built against Debian
> > Squeeze) here:
> >  http://pkg-libvirt.alioth.debian.org/packages/unstable/
> > Here are the soruces.list entries:
> >  http://wiki.debian.org/Teams/DebianLibvirtTeam#Packages
> > Cheers,
> >  -- Guido
> 
> awesome! thanks!
> 
> I didn't check them out... anyway I am adding the Ubuntu HA team in
> CC... it's worth sharing the effort.
This stuff needs soome more work to be uploadable but it's mostly there
I think. I've pushed the git archives here:

http://git.debian.org/?p=users/agx/redhat-cluster/redhat-cluster.git;a=summary
http://git.debian.org/?p=users/agx/redhat-cluster/openais.git;a=summary
http://git.debian.org/?p=users/agx/redhat-cluster/corosync.git;a=summary

> For sometime I have been thinking to pull in debian/ and .spec files
> upstream and involve the maintainers to work directly with us.
> 
> This would happen for corosync/openais (they already have spec files)
> and cluster.
> 
> Anybody in contact with the Debian team and see if they would like to
> work more closely with us?
Most of this looks a bit outdated in Debian:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=535335
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528258

I intend to maintain things alongside for some time and eventually
upload things to experimental after pinging the maintainers again for
some time.
Cheers,
 -- Guido


From fdinitto at redhat.com  Thu Jul  9 17:17:24 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 09 Jul 2009 19:17:24 +0200
Subject: [Linux-cluster] Cluster 3.y.z: call for a more detailed schedule
	and timeline
Message-ID: <1247159844.7941.87.camel@cerberus.int.fabbione.net>

Hi everybody,

all of us had our heads down to get cluster 3.0.0 final out of the door
for a long time now. We achieved this milestone and it's now time to
look forward.

As you already know the 3.y.z series will receive bug fixes and
improvements as feedback from our community and users will flow in and
we will make available regular update releases to follow
corosync/openais and upstream kernels.

Let's try to define together the future of 3.y.z based on the following
concept:

- .z releases should contain only bug fixes.
  Exceptions: new fence agents that do not require changes to the 
  shared fence python library will be allowed. New resource agents will 
  also be allowed. Porting of gfs1-kernel module to a new kernel is 
  allowed.
  .z releases will happen depending on the amount of fixes that have
  been pushed in the tree. Fixes can go directly into the STABLE3 
  branch.

- .y releases can contain more substantial changes and will happen less
  often than .z releases. Default behaviour of the software should 
  never change. Introducing a new one has to be configurable and the 
  optional.
  .y releases need to be scheduled.
  All the development for a specific .y change needs to be 
  pre-published into a private branch for public review and testing and 
  land in STABLE3 only when the merge time is right (details on how to 
  merge or push or pull can be discussed later on as we have plenty of 
  options).

So let's take a look at possible time line for .y releases. Here is my
suggestion, nothing is set is stone, please add your comments and ideas
so we can start to lay down a more formal path:

3.1.z to be released by 15th of Sep:

- working and tested rolling upgrade from STABLE2 
  across the whole stack.
- enable make srpm and rpm targets to the build system.
- enable rgmanager checkpoints.
- add-your-items-here

3.2.z to be released by 15th of Dec:

- backport autotool build system from master.
- integrate debian/ubuntu packaging support upstream.
- internationalization and translation support.
- add-your-items-here

Goals can also be swapped around if you believe that they should have
higher or lower priority and if time allows more goals can be squeezed
within the previous release.

Please send your feedback now. In 2 weeks from now we will collect all
the response and write down the formal schedule.

Cheers
Fabio


From kbphillips80 at gmail.com  Thu Jul  9 18:26:52 2009
From: kbphillips80 at gmail.com (Kaerka Phillips)
Date: Thu, 9 Jul 2009 14:26:52 -0400
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
References: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
Message-ID: <e558921b0907091126m78eb45adw118b27c7227f4345@mail.gmail.com>

We found with GFS2 on RHEL5.3 the same issue, and although we did not
resolve it (and are migrating to NFS+SAN instead), we were able to partially
mitigate it with several suggestions from either RedHat's support or from
the cluster list:
1. disable LS color output (http://kbase.redhat.com/faq/docs/DOC-6533)
2. the "nodiratime,noatime" mount flags
3. Unlocking the plock rate to allow maximum locks generated and purged per
second with the "plock_rate_limit=0" option in cluster.conf

That said though, we found that the GFS2 was only about as fast as NFS4 on
GigE networks, and not the best solution for our setup.


On Wed, Jul 8, 2009 at 4:58 PM, Peter Schobel <pschobel at 1iopen.net> wrote:

> I am trying to set up a four node cluster but am getting very poor
> performance when removing large directories. A directory approximately
> 1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
> but removes in around 10 seconds from the local disk.
>
> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
>
> The filesystem was formatted in the following manner: mkfs.gfs2 -t
> wtl_build:dev_home00 -p lock_dlm -j 10
> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
> following options: _netdev,noatime,defaults.
>
> If anyone knows what could be causing this please let me know. I'm
> happy to provide any other information.
>
> Regards,
>
> --
> Peter Schobel
> ~
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090709/aafd33b0/attachment.htm>

From pschobel at 1iopen.net  Fri Jul 10 14:42:21 2009
From: pschobel at 1iopen.net (Peter Schobel)
Date: Fri, 10 Jul 2009 07:42:21 -0700
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
Message-ID: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com>

When we did our initial proof of concept, we did not notice any
performance problem of this magnitude. We were using OS release 2. Our
QA engineers passed approval on the performance stats of the gfs2
filesystem and now that we are in deployment phase they are calling it
unusable.

Have there been any recent software changes that could have caused
degraded performance or something I may have missed in configuration?
Are there any tunable parameters in gfs2 that may increase our
performance?

Our application is very write intensive. Basically we are compiling a
source tree and running a make clean between builds.

Thanks in advance,

Peter
~

On Wed, Jul 08, 2009 at 01:58:30PM -0700, Peter Schobel wrote:

>> I am trying to set up a four node cluster but am getting very poor
>> performance when removing large directories. A directory approximately
>> 1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
>> but removes in around 10 seconds from the local disk.
>>
>> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
>>
>> The filesystem was formatted in the following manner: mkfs.gfs2 -t
>> wtl_build:dev_home00 -p lock_dlm -j 10
>> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
>> following options: _netdev,noatime,defaults.
>
> This is something you have to live with.  GFS(2) works great, but with
> large(r) directories performance is extremely bad and for many
> applications a real show-stopper.
>
> There have been many discussions on this list, with GFS parameter tuning
> suggestions that at least for me didn't result in any improvements, with
> promises that the problems would be solved in GFS2 (I see no significant
> performance improvements between GFS and GFS2), etc.

> --
> --    Jos Vos <jos at xos.nl>
> --    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
> --    Amsterdam, The Netherlands        |     Fax: +31 20 6948204

-- 
Peter Schobel
~


From swhiteho at redhat.com  Fri Jul 10 15:27:07 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 10 Jul 2009 16:27:07 +0100
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com>
References: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com>
Message-ID: <1247239627.3384.38.camel@localhost.localdomain>

Hi,

On Fri, 2009-07-10 at 07:42 -0700, Peter Schobel wrote:
> When we did our initial proof of concept, we did not notice any
> performance problem of this magnitude. We were using OS release 2. Our
> QA engineers passed approval on the performance stats of the gfs2
> filesystem and now that we are in deployment phase they are calling it
> unusable.
> 
> Have there been any recent software changes that could have caused
> degraded performance or something I may have missed in configuration?
> Are there any tunable parameters in gfs2 that may increase our
> performance?
> 
Not that I'm aware of. There are no tunable parameters which might
affect this particular aspect of performance, but to be clear exactly
what the issue is, let me ask a few questions...

> Our application is very write intensive. Basically we are compiling a
> source tree and running a make clean between builds.
> 
> Thanks in advance,
> 
> Peter
> ~
> 
What is the nature of the writes? Are the different nodes writing into
different directories in the main?

GFS2 is pretty good at large directories, given certain conditions. Look
ups should be pretty fast. Once there is a writer into a particular
directory, then ideally one would take care not to read or write that
directory from other nodes until the writer is finished.

Directory listing of large directories can be slow, and counts as
reading the directory from a caching point of view. Look ups of
individual files should be fast though,

Steve.


> On Wed, Jul 08, 2009 at 01:58:30PM -0700, Peter Schobel wrote:
> 
> >> I am trying to set up a four node cluster but am getting very poor
> >> performance when removing large directories. A directory approximately
> >> 1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
> >> but removes in around 10 seconds from the local disk.
> >>
> >> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
> >>
> >> The filesystem was formatted in the following manner: mkfs.gfs2 -t
> >> wtl_build:dev_home00 -p lock_dlm -j 10
> >> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
> >> following options: _netdev,noatime,defaults.
> >
> > This is something you have to live with.  GFS(2) works great, but with
> > large(r) directories performance is extremely bad and for many
> > applications a real show-stopper.
> >
> > There have been many discussions on this list, with GFS parameter tuning
> > suggestions that at least for me didn't result in any improvements, with
> > promises that the problems would be solved in GFS2 (I see no significant
> > performance improvements between GFS and GFS2), etc.
> 
> > --
> > --    Jos Vos <jos at xos.nl>
> > --    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
> > --    Amsterdam, The Netherlands        |     Fax: +31 20 6948204
> 


From pschobel at 1iopen.net  Fri Jul 10 15:49:02 2009
From: pschobel at 1iopen.net (Peter Schobel)
Date: Fri, 10 Jul 2009 08:49:02 -0700
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <1247239627.3384.38.camel@localhost.localdomain>
References: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com>
	<1247239627.3384.38.camel@localhost.localdomain>
Message-ID: <c8c8ded30907100849m3f541939w1a2768fab95faf5a@mail.gmail.com>

The initial writing is done via the network by checking out source
trees from a Perforce repository. Beyond that, source trees are
compiled causing the creation of many object files.

Multiple source trees will be compiled from the same node or from
multiple nodes.

This performance problem exhibits itself even when using a single
node. Writing to the filesystem seems to work fine. The time to do a
cp -r dir /gfs/dir is very comparable to writing to local disk
however, rm -r /gfs/dir takes considerably longer than it does on
local disk. I am guessing this is a feature of dlm checking for a lock
on each individual file but I'm not sure.

Peter
~

On Fri, Jul 10, 2009 at 8:27 AM, Steven Whitehouse<swhiteho at redhat.com> wrote:
> Hi,
>
> On Fri, 2009-07-10 at 07:42 -0700, Peter Schobel wrote:
>> When we did our initial proof of concept, we did not notice any
>> performance problem of this magnitude. We were using OS release 2. Our
>> QA engineers passed approval on the performance stats of the gfs2
>> filesystem and now that we are in deployment phase they are calling it
>> unusable.
>>
>> Have there been any recent software changes that could have caused
>> degraded performance or something I may have missed in configuration?
>> Are there any tunable parameters in gfs2 that may increase our
>> performance?
>>
> Not that I'm aware of. There are no tunable parameters which might
> affect this particular aspect of performance, but to be clear exactly
> what the issue is, let me ask a few questions...
>
>> Our application is very write intensive. Basically we are compiling a
>> source tree and running a make clean between builds.
>>
>> Thanks in advance,
>>
>> Peter
>> ~
>>
> What is the nature of the writes? Are the different nodes writing into
> different directories in the main?
>
> GFS2 is pretty good at large directories, given certain conditions. Look
> ups should be pretty fast. Once there is a writer into a particular
> directory, then ideally one would take care not to read or write that
> directory from other nodes until the writer is finished.
>
> Directory listing of large directories can be slow, and counts as
> reading the directory from a caching point of view. Look ups of
> individual files should be fast though,
>
> Steve.
>
>
>> On Wed, Jul 08, 2009 at 01:58:30PM -0700, Peter Schobel wrote:
>>
>> >> I am trying to set up a four node cluster but am getting very poor
>> >> performance when removing large directories. A directory approximately
>> >> 1.6G ?in size takes around 5 mins to remove from the gfs2 filesystem
>> >> but removes in around 10 seconds from the local disk.
>> >>
>> >> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
>> >>
>> >> The filesystem was formatted in the following manner: mkfs.gfs2 -t
>> >> wtl_build:dev_home00 -p lock_dlm -j 10
>> >> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
>> >> following options: _netdev,noatime,defaults.
>> >
>> > This is something you have to live with. ?GFS(2) works great, but with
>> > large(r) directories performance is extremely bad and for many
>> > applications a real show-stopper.
>> >
>> > There have been many discussions on this list, with GFS parameter tuning
>> > suggestions that at least for me didn't result in any improvements, with
>> > promises that the problems would be solved in GFS2 (I see no significant
>> > performance improvements between GFS and GFS2), etc.
>>
>> > --
>> > -- ? ?Jos Vos <jos at xos.nl>
>> > -- ? ?X/OS Experts in Open Systems BV ? | ? Phone: +31 20 6938364
>> > -- ? ?Amsterdam, The Netherlands ? ? ? ?| ? ? Fax: +31 20 6948204
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

-- 
Peter Schobel
~


From swhiteho at redhat.com  Fri Jul 10 15:56:56 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 10 Jul 2009 16:56:56 +0100
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907100849m3f541939w1a2768fab95faf5a@mail.gmail.com>
References: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com>
	<1247239627.3384.38.camel@localhost.localdomain>
	<c8c8ded30907100849m3f541939w1a2768fab95faf5a@mail.gmail.com>
Message-ID: <1247241416.3384.42.camel@localhost.localdomain>

Hi,

On Fri, 2009-07-10 at 08:49 -0700, Peter Schobel wrote:
> The initial writing is done via the network by checking out source
> trees from a Perforce repository. Beyond that, source trees are
> compiled causing the creation of many object files.
> 
> Multiple source trees will be compiled from the same node or from
> multiple nodes.
> 
> This performance problem exhibits itself even when using a single
> node. Writing to the filesystem seems to work fine. The time to do a
> cp -r dir /gfs/dir is very comparable to writing to local disk
> however, rm -r /gfs/dir takes considerably longer than it does on
> local disk. I am guessing this is a feature of dlm checking for a lock
> on each individual file but I'm not sure.
> 
> Peter

Partly that is the case. There are some things which can be done to
improve performance in the deallocation area, and so that is likely to
improve in future. The main issue is to ensure that we continue to
maintain the correct locking order in that code. It can be complex since
it involves the inode lock, transaction lock, and (maybe) multiple
resource group locks,

Steve.

> ~
> 
> On Fri, Jul 10, 2009 at 8:27 AM, Steven Whitehouse<swhiteho at redhat.com> wrote:
> > Hi,
> >
> > On Fri, 2009-07-10 at 07:42 -0700, Peter Schobel wrote:
> >> When we did our initial proof of concept, we did not notice any
> >> performance problem of this magnitude. We were using OS release 2. Our
> >> QA engineers passed approval on the performance stats of the gfs2
> >> filesystem and now that we are in deployment phase they are calling it
> >> unusable.
> >>
> >> Have there been any recent software changes that could have caused
> >> degraded performance or something I may have missed in configuration?
> >> Are there any tunable parameters in gfs2 that may increase our
> >> performance?
> >>
> > Not that I'm aware of. There are no tunable parameters which might
> > affect this particular aspect of performance, but to be clear exactly
> > what the issue is, let me ask a few questions...
> >
> >> Our application is very write intensive. Basically we are compiling a
> >> source tree and running a make clean between builds.
> >>
> >> Thanks in advance,
> >>
> >> Peter
> >> ~
> >>
> > What is the nature of the writes? Are the different nodes writing into
> > different directories in the main?
> >
> > GFS2 is pretty good at large directories, given certain conditions. Look
> > ups should be pretty fast. Once there is a writer into a particular
> > directory, then ideally one would take care not to read or write that
> > directory from other nodes until the writer is finished.
> >
> > Directory listing of large directories can be slow, and counts as
> > reading the directory from a caching point of view. Look ups of
> > individual files should be fast though,
> >
> > Steve.
> >
> >
> >> On Wed, Jul 08, 2009 at 01:58:30PM -0700, Peter Schobel wrote:
> >>
> >> >> I am trying to set up a four node cluster but am getting very poor
> >> >> performance when removing large directories. A directory approximately
> >> >> 1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
> >> >> but removes in around 10 seconds from the local disk.
> >> >>
> >> >> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
> >> >>
> >> >> The filesystem was formatted in the following manner: mkfs.gfs2 -t
> >> >> wtl_build:dev_home00 -p lock_dlm -j 10
> >> >> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
> >> >> following options: _netdev,noatime,defaults.
> >> >
> >> > This is something you have to live with.  GFS(2) works great, but with
> >> > large(r) directories performance is extremely bad and for many
> >> > applications a real show-stopper.
> >> >
> >> > There have been many discussions on this list, with GFS parameter tuning
> >> > suggestions that at least for me didn't result in any improvements, with
> >> > promises that the problems would be solved in GFS2 (I see no significant
> >> > performance improvements between GFS and GFS2), etc.
> >>
> >> > --
> >> > --    Jos Vos <jos at xos.nl>
> >> > --    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
> >> > --    Amsterdam, The Netherlands        |     Fax: +31 20 6948204
> >>
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 


From pschobel at 1iopen.net  Fri Jul 10 16:07:58 2009
From: pschobel at 1iopen.net (Peter Schobel)
Date: Fri, 10 Jul 2009 09:07:58 -0700
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <1247241416.3384.42.camel@localhost.localdomain>
References: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com>
	<1247239627.3384.38.camel@localhost.localdomain>
	<c8c8ded30907100849m3f541939w1a2768fab95faf5a@mail.gmail.com>
	<1247241416.3384.42.camel@localhost.localdomain>
Message-ID: <c8c8ded30907100907i5000077cm6900a9570bfb12d6@mail.gmail.com>

So, in your opinion, there are no known issues which would cause this
particular problem and this poor performance while deleting files
should be considered normal?

Thanks,

Peter
~

On Fri, Jul 10, 2009 at 8:56 AM, Steven Whitehouse<swhiteho at redhat.com> wrote:
> Hi,
>
> On Fri, 2009-07-10 at 08:49 -0700, Peter Schobel wrote:
>> The initial writing is done via the network by checking out source
>> trees from a Perforce repository. Beyond that, source trees are
>> compiled causing the creation of many object files.
>>
>> Multiple source trees will be compiled from the same node or from
>> multiple nodes.
>>
>> This performance problem exhibits itself even when using a single
>> node. Writing to the filesystem seems to work fine. The time to do a
>> cp -r dir /gfs/dir is very comparable to writing to local disk
>> however, rm -r /gfs/dir takes considerably longer than it does on
>> local disk. I am guessing this is a feature of dlm checking for a lock
>> on each individual file but I'm not sure.
>>
>> Peter
>
> Partly that is the case. There are some things which can be done to
> improve performance in the deallocation area, and so that is likely to
> improve in future. The main issue is to ensure that we continue to
> maintain the correct locking order in that code. It can be complex since
> it involves the inode lock, transaction lock, and (maybe) multiple
> resource group locks,
>
> Steve.
>
>> ~
>>
>> On Fri, Jul 10, 2009 at 8:27 AM, Steven Whitehouse<swhiteho at redhat.com> wrote:
>> > Hi,
>> >
>> > On Fri, 2009-07-10 at 07:42 -0700, Peter Schobel wrote:
>> >> When we did our initial proof of concept, we did not notice any
>> >> performance problem of this magnitude. We were using OS release 2. Our
>> >> QA engineers passed approval on the performance stats of the gfs2
>> >> filesystem and now that we are in deployment phase they are calling it
>> >> unusable.
>> >>
>> >> Have there been any recent software changes that could have caused
>> >> degraded performance or something I may have missed in configuration?
>> >> Are there any tunable parameters in gfs2 that may increase our
>> >> performance?
>> >>
>> > Not that I'm aware of. There are no tunable parameters which might
>> > affect this particular aspect of performance, but to be clear exactly
>> > what the issue is, let me ask a few questions...
>> >
>> >> Our application is very write intensive. Basically we are compiling a
>> >> source tree and running a make clean between builds.
>> >>
>> >> Thanks in advance,
>> >>
>> >> Peter
>> >> ~
>> >>
>> > What is the nature of the writes? Are the different nodes writing into
>> > different directories in the main?
>> >
>> > GFS2 is pretty good at large directories, given certain conditions. Look
>> > ups should be pretty fast. Once there is a writer into a particular
>> > directory, then ideally one would take care not to read or write that
>> > directory from other nodes until the writer is finished.
>> >
>> > Directory listing of large directories can be slow, and counts as
>> > reading the directory from a caching point of view. Look ups of
>> > individual files should be fast though,
>> >
>> > Steve.
>> >
>> >
>> >> On Wed, Jul 08, 2009 at 01:58:30PM -0700, Peter Schobel wrote:
>> >>
>> >> >> I am trying to set up a four node cluster but am getting very poor
>> >> >> performance when removing large directories. A directory approximately
>> >> >> 1.6G ?in size takes around 5 mins to remove from the gfs2 filesystem
>> >> >> but removes in around 10 seconds from the local disk.
>> >> >>
>> >> >> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
>> >> >>
>> >> >> The filesystem was formatted in the following manner: mkfs.gfs2 -t
>> >> >> wtl_build:dev_home00 -p lock_dlm -j 10
>> >> >> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
>> >> >> following options: _netdev,noatime,defaults.
>> >> >
>> >> > This is something you have to live with. ?GFS(2) works great, but with
>> >> > large(r) directories performance is extremely bad and for many
>> >> > applications a real show-stopper.
>> >> >
>> >> > There have been many discussions on this list, with GFS parameter tuning
>> >> > suggestions that at least for me didn't result in any improvements, with
>> >> > promises that the problems would be solved in GFS2 (I see no significant
>> >> > performance improvements between GFS and GFS2), etc.
>> >>
>> >> > --
>> >> > -- ? ?Jos Vos <jos at xos.nl>
>> >> > -- ? ?X/OS Experts in Open Systems BV ? | ? Phone: +31 20 6938364
>> >> > -- ? ?Amsterdam, The Netherlands ? ? ? ?| ? ? Fax: +31 20 6948204
>> >>
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Peter Schobel
~


From jeff.sturm at eprize.com  Fri Jul 10 16:11:07 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Fri, 10 Jul 2009 12:11:07 -0400
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907100849m3f541939w1a2768fab95faf5a@mail.gmail.com>
References: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com><1247239627.3384.38.camel@localhost.localdomain>
	<c8c8ded30907100849m3f541939w1a2768fab95faf5a@mail.gmail.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC326@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Peter Schobel
> Sent: Friday, July 10, 2009 11:49 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] rm -r on gfs2 filesystem is very slow
> 
> This performance problem exhibits itself even when using a single
> node. Writing to the filesystem seems to work fine. The time to do a
> cp -r dir /gfs/dir is very comparable to writing to local disk
> however, rm -r /gfs/dir takes considerably longer than it does on
> local disk. I am guessing this is a feature of dlm checking for a lock
> on each individual file but I'm not sure.

We had similar results on GFS1, and obtained a large improvement by
increasing demote_secs in some cases.  I haven't tried GFS2 yet, and
don't know if would do the same.

One of our GFS1 filesystems contains files that don't normally live
longer than an hour.  They are created, expire after an hour, and get
reaped (rm'ed) within  2 hours.  Settings demote_secs to 7200 for this
filesystem made a huge difference, because (as I understand it) it
increased the likelihood that a lock would still be held on the inode by
the time it was to be removed.

-Jeff


From swhiteho at redhat.com  Fri Jul 10 16:47:19 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 10 Jul 2009 17:47:19 +0100
Subject: [Linux-cluster] rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907100907i5000077cm6900a9570bfb12d6@mail.gmail.com>
References: <c8c8ded30907100742y21661bd3s977e61c996fe8a11@mail.gmail.com>
	<1247239627.3384.38.camel@localhost.localdomain>
	<c8c8ded30907100849m3f541939w1a2768fab95faf5a@mail.gmail.com>
	<1247241416.3384.42.camel@localhost.localdomain>
	<c8c8ded30907100907i5000077cm6900a9570bfb12d6@mail.gmail.com>
Message-ID: <1247244439.3384.43.camel@localhost.localdomain>

Hi,

On Fri, 2009-07-10 at 09:07 -0700, Peter Schobel wrote:
> So, in your opinion, there are no known issues which would cause this
> particular problem and this poor performance while deleting files
> should be considered normal?
> 
> Thanks,
> 
> Peter
> ~
> 
It depends just how bad it is... it will be worse, but the question is
how much worse. The challenge of course is to minimise the amount by
which it slows down,

Steve.


> On Fri, Jul 10, 2009 at 8:56 AM, Steven Whitehouse<swhiteho at redhat.com> wrote:
> > Hi,
> >
> > On Fri, 2009-07-10 at 08:49 -0700, Peter Schobel wrote:
> >> The initial writing is done via the network by checking out source
> >> trees from a Perforce repository. Beyond that, source trees are
> >> compiled causing the creation of many object files.
> >>
> >> Multiple source trees will be compiled from the same node or from
> >> multiple nodes.
> >>
> >> This performance problem exhibits itself even when using a single
> >> node. Writing to the filesystem seems to work fine. The time to do a
> >> cp -r dir /gfs/dir is very comparable to writing to local disk
> >> however, rm -r /gfs/dir takes considerably longer than it does on
> >> local disk. I am guessing this is a feature of dlm checking for a lock
> >> on each individual file but I'm not sure.
> >>
> >> Peter
> >
> > Partly that is the case. There are some things which can be done to
> > improve performance in the deallocation area, and so that is likely to
> > improve in future. The main issue is to ensure that we continue to
> > maintain the correct locking order in that code. It can be complex since
> > it involves the inode lock, transaction lock, and (maybe) multiple
> > resource group locks,
> >
> > Steve.
> >
> >> ~
> >>
> >> On Fri, Jul 10, 2009 at 8:27 AM, Steven Whitehouse<swhiteho at redhat.com> wrote:
> >> > Hi,
> >> >
> >> > On Fri, 2009-07-10 at 07:42 -0700, Peter Schobel wrote:
> >> >> When we did our initial proof of concept, we did not notice any
> >> >> performance problem of this magnitude. We were using OS release 2. Our
> >> >> QA engineers passed approval on the performance stats of the gfs2
> >> >> filesystem and now that we are in deployment phase they are calling it
> >> >> unusable.
> >> >>
> >> >> Have there been any recent software changes that could have caused
> >> >> degraded performance or something I may have missed in configuration?
> >> >> Are there any tunable parameters in gfs2 that may increase our
> >> >> performance?
> >> >>
> >> > Not that I'm aware of. There are no tunable parameters which might
> >> > affect this particular aspect of performance, but to be clear exactly
> >> > what the issue is, let me ask a few questions...
> >> >
> >> >> Our application is very write intensive. Basically we are compiling a
> >> >> source tree and running a make clean between builds.
> >> >>
> >> >> Thanks in advance,
> >> >>
> >> >> Peter
> >> >> ~
> >> >>
> >> > What is the nature of the writes? Are the different nodes writing into
> >> > different directories in the main?
> >> >
> >> > GFS2 is pretty good at large directories, given certain conditions. Look
> >> > ups should be pretty fast. Once there is a writer into a particular
> >> > directory, then ideally one would take care not to read or write that
> >> > directory from other nodes until the writer is finished.
> >> >
> >> > Directory listing of large directories can be slow, and counts as
> >> > reading the directory from a caching point of view. Look ups of
> >> > individual files should be fast though,
> >> >
> >> > Steve.
> >> >
> >> >
> >> >> On Wed, Jul 08, 2009 at 01:58:30PM -0700, Peter Schobel wrote:
> >> >>
> >> >> >> I am trying to set up a four node cluster but am getting very poor
> >> >> >> performance when removing large directories. A directory approximately
> >> >> >> 1.6G  in size takes around 5 mins to remove from the gfs2 filesystem
> >> >> >> but removes in around 10 seconds from the local disk.
> >> >> >>
> >> >> >> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
> >> >> >>
> >> >> >> The filesystem was formatted in the following manner: mkfs.gfs2 -t
> >> >> >> wtl_build:dev_home00 -p lock_dlm -j 10
> >> >> >> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
> >> >> >> following options: _netdev,noatime,defaults.
> >> >> >
> >> >> > This is something you have to live with.  GFS(2) works great, but with
> >> >> > large(r) directories performance is extremely bad and for many
> >> >> > applications a real show-stopper.
> >> >> >
> >> >> > There have been many discussions on this list, with GFS parameter tuning
> >> >> > suggestions that at least for me didn't result in any improvements, with
> >> >> > promises that the problems would be solved in GFS2 (I see no significant
> >> >> > performance improvements between GFS and GFS2), etc.
> >> >>
> >> >> > --
> >> >> > --    Jos Vos <jos at xos.nl>
> >> >> > --    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
> >> >> > --    Amsterdam, The Netherlands        |     Fax: +31 20 6948204
> >> >>
> >> >
> >> > --
> >> > Linux-cluster mailing list
> >> > Linux-cluster at redhat.com
> >> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >> >
> >>
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> 
> 


From tfrumbacher at gmail.com  Fri Jul 10 20:56:32 2009
From: tfrumbacher at gmail.com (Aaron Benner)
Date: Fri, 10 Jul 2009 14:56:32 -0600
Subject: [Linux-cluster] linux cluster and virtualization - shared
	xend-config.sxp
Message-ID: <A7966B92-48F8-4EA5-9733-A531F4EBA8AE@gmail.com>

All,

The documentation provided via the conga cluster recipe (Add a virtual  
machine as a clustered service) states that:

1) "Both xen config files [e.g. xend-config-sxp] and VM disk images  
must be located on shared storage, with identical mount paths for each  
node"

and then on the next slide states:

2) "Xend must be running on each node, and MUST BE STARTED BEFORE THE  
CMAN CLUSTER DAEMON"

However, it seems to me that this creates a chicken and egg problem.   
In particular if my xend-config.sxp is on a shared storage device (In  
my case a gfs2 formatted iSCSI disk) then doesn't that imply that I  
must have cluster services running to mount/access the filesystem.  If  
that is the case then it is impossible for me to start xend before  
cman as my xend-config.sxp will not exist until cman starts and I  
mount my gfs2 partition.

The only other documentation I found for this setup: http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ 
  says absolutely nothing about system service startup order and  
includes an example of a cluster node reboot.

So, my question is this; is the default ordering of system service  
startup (cman < xend) workable with virtual machines defined as  
cluster services?

Thanks -- Aaron

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090710/6083a0d7/attachment.htm>

From jason at monsterjam.org  Fri Jul 10 23:35:32 2009
From: jason at monsterjam.org (jason at monsterjam.org)
Date: Fri, 10 Jul 2009 19:35:32 -0400
Subject: [Linux-cluster] service stuck in "starting" state
Message-ID: <20090710233532.GA6696@monsterjam.org>

hey cluster gurus..
I have a 2 node cluster thats been running without issue for quite a while.. all of a sudden one of the nodes will not 
completely start the apache webserver service.. it looks like this 

[root at tf1 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  tf1                                      Online, Local, rgmanager
  tf2                                      Online, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  Apache Service       tf1                            starting        
  postfix service      tf1                            started         
[root at tf1 ~]# 

and I see that the httpd is NOT started. although, if I do 
/etc/init.d/httpd start
the service starts without issue.

grepping for apache and http in the logs, I see this..

Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of /etc/httpd/conf.d/ssl.conf:
Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
Jul 10 14:33:57 tf1 httpd: httpd startup failed
Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of /etc/httpd/conf.d/ssl.conf:
Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
Jul 10 14:34:06 tf1 httpd: httpd startup failed
Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd stop 
Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd stop 
Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
[root at tf1 log]# grep apache  messages
Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
Jul 10 10:04:33 tf1 clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
Jul 10 16:23:34 tf1 clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
[root at tf1 log]# 


Im guessing its the  stop on script "cluster_apache" returned 1 (generic error)
but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the same size

[root at tf2 ~]# ls -al /etc/init.d/httpd
-rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd

[root at tf1 log]# ls -al /etc/init.d/httpd
-rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd

and the apache service starts/stops just fine on tf2 when the services get failed over to that machine.

any ideas on what can be wrong?

Jason


From cthulhucalling at gmail.com  Fri Jul 10 23:42:52 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Fri, 10 Jul 2009 16:42:52 -0700
Subject: [Linux-cluster] service stuck in "starting" state
In-Reply-To: <20090710233532.GA6696@monsterjam.org>
References: <20090710233532.GA6696@monsterjam.org>
Message-ID: <36df569a0907101642x41aa5661xfcb9a167162888a4@mail.gmail.com>

Well it looks like that you're missing a SSL Certificate, so Apache isn't
starting. It also looks like the server is trying to stop the service after
an unsuccessful startup and since it isn't started, the init script returns
an error. I had something similar with Squid a while back.

On Fri, Jul 10, 2009 at 4:35 PM, <jason at monsterjam.org> wrote:

> hey cluster gurus..
> I have a 2 node cluster thats been running without issue for quite a
> while.. all of a sudden one of the nodes will not
> completely start the apache webserver service.. it looks like this
>
> [root at tf1 ~]# clustat
> Member Status: Quorate
>
>  Member Name                              Status
>  ------ ----                              ------
>  tf1                                      Online, Local, rgmanager
>  tf2                                      Online, rgmanager
>
>  Service Name         Owner (Last)                   State
>  ------- ----         ----- ------                   -----
>  Apache Service       tf1                            starting
>  postfix service      tf1                            started
> [root at tf1 ~]#
>
> and I see that the httpd is NOT started. although, if I do
> /etc/init.d/httpd start
> the service starts without issue.
>
> grepping for apache and http in the logs, I see this..
>
> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of
> /etc/httpd/conf.d/ssl.conf:
> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file
> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
> Jul 10 14:33:57 tf1 httpd: httpd startup failed
> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of
> /etc/httpd/conf.d/ssl.conf:
> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file
> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
> Jul 10 14:34:06 tf1 httpd: httpd startup failed
> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd
> stop
> Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd
> stop
> Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
> [root at tf1 log]# grep apache  messages
> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script
> "cluster_apache" returned 1 (generic error)
> Jul 10 10:04:33 tf1 clurgmgrd[6149]: <notice> stop on script
> "cluster_apache" returned 1 (generic error)
> Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on script
> "cluster_apache" returned 1 (generic error)
> Jul 10 16:23:34 tf1 clurgmgrd[6168]: <notice> stop on script
> "cluster_apache" returned 1 (generic error)
> Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on script
> "cluster_apache" returned 1 (generic error)
> [root at tf1 log]#
>
>
> Im guessing its the  stop on script "cluster_apache" returned 1 (generic
> error)
> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the
> same size
>
> [root at tf2 ~]# ls -al /etc/init.d/httpd
> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>
> [root at tf1 log]# ls -al /etc/init.d/httpd
> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>
> and the apache service starts/stops just fine on tf2 when the services get
> failed over to that machine.
>
> any ideas on what can be wrong?
>
> Jason
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090710/de9c28a9/attachment.htm>

From ricks at nerd.com  Fri Jul 10 23:50:12 2009
From: ricks at nerd.com (Rick Stevens)
Date: Fri, 10 Jul 2009 16:50:12 -0700
Subject: [Linux-cluster] service stuck in "starting" state
In-Reply-To: <20090710233532.GA6696@monsterjam.org>
References: <20090710233532.GA6696@monsterjam.org>
Message-ID: <4A57D3B4.6040004@nerd.com>

jason at monsterjam.org wrote:
> hey cluster gurus..
> I have a 2 node cluster thats been running without issue for quite a while.. all of a sudden one of the nodes will not 
> completely start the apache webserver service.. it looks like this 
> 
> [root at tf1 ~]# clustat
> Member Status: Quorate
> 
>   Member Name                              Status
>   ------ ----                              ------
>   tf1                                      Online, Local, rgmanager
>   tf2                                      Online, rgmanager
> 
>   Service Name         Owner (Last)                   State         
>   ------- ----         ----- ------                   -----         
>   Apache Service       tf1                            starting        
>   postfix service      tf1                            started         
> [root at tf1 ~]# 
> 
> and I see that the httpd is NOT started. although, if I do 
> /etc/init.d/httpd start
> the service starts without issue.
> 
> grepping for apache and http in the logs, I see this..
> 
> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of /etc/httpd/conf.d/ssl.conf:
> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
> Jul 10 14:33:57 tf1 httpd: httpd startup failed
> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of /etc/httpd/conf.d/ssl.conf:
> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
> Jul 10 14:34:06 tf1 httpd: httpd startup failed
> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd stop 
> Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd stop 
> Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
> [root at tf1 log]# grep apache  messages
> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
> Jul 10 10:04:33 tf1 clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
> Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
> Jul 10 16:23:34 tf1 clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
> Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on script "cluster_apache" returned 1 (generic error) 
> [root at tf1 log]# 
> 
> 
> Im guessing its the  stop on script "cluster_apache" returned 1 (generic error)
> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the same size
> 
> [root at tf2 ~]# ls -al /etc/init.d/httpd
> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
> 
> [root at tf1 log]# ls -al /etc/init.d/httpd
> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
> 
> and the apache service starts/stops just fine on tf2 when the services get failed over to that machine.
> 
> any ideas on what can be wrong?

tf1 is complaining about a bad SSL cert.  The fact that it's complaining
when being started by clurgmgrd but not when started manually indicates
that clurgmgrd is starting it differently (specifying a different
httpd.conf file perhaps?).
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer                      ricks at nerd.com -
- AIM/Skype: therps2        ICQ: 22643734            Yahoo: origrps2 -
-                                                                    -
-     The trouble with troubleshooting is that trouble sometimes     -
-                             shoots back.                           -
----------------------------------------------------------------------


From jason at monsterjam.org  Sat Jul 11 00:59:04 2009
From: jason at monsterjam.org (jason at monsterjam.org)
Date: Fri, 10 Jul 2009 20:59:04 -0400
Subject: [Linux-cluster] service stuck in "starting" state
In-Reply-To: <4A57D3B4.6040004@nerd.com>
References: <20090710233532.GA6696@monsterjam.org> <4A57D3B4.6040004@nerd.com>
Message-ID: <20090711005904.GA15858@monsterjam.org>

On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
> jason at monsterjam.org wrote:
>> hey cluster gurus..
>> I have a 2 node cluster thats been running without issue for quite a 
>> while.. all of a sudden one of the nodes will not completely start the 
>> apache webserver service.. it looks like this [root at tf1 ~]# clustat
>> Member Status: Quorate
>>   Member Name                              Status
>>   ------ ----                              ------
>>   tf1                                      Online, Local, rgmanager
>>   tf2                                      Online, rgmanager
>>   Service Name         Owner (Last)                   State           
>> ------- ----         ----- ------                   -----           Apache 
>> Service       tf1                            starting          postfix 
>> service      tf1                            started         [root at tf1 ~]# 
>> and I see that the httpd is NOT started. although, if I do 
>> /etc/init.d/httpd start
>> the service starts without issue.
>> grepping for apache and http in the logs, I see this..
>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of 
>> /etc/httpd/conf.d/ssl.conf:
>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file 
>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>> Jul 10 14:33:57 tf1 httpd: httpd startup failed
>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of 
>> /etc/httpd/conf.d/ssl.conf:
>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file 
>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>> Jul 10 14:34:06 tf1 httpd: httpd startup failed
>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd 
>> stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd 
>> stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
>> [root at tf1 log]# grep apache  messages
>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script 
>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1 
>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 
>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on 
>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1 
>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 
>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on 
>> script "cluster_apache" returned 1 (generic error) [root at tf1 log]# Im 
>> guessing its the  stop on script "cluster_apache" returned 1 (generic 
>> error)
>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the 
>> same size
>> [root at tf2 ~]# ls -al /etc/init.d/httpd
>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>> [root at tf1 log]# ls -al /etc/init.d/httpd
>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>> and the apache service starts/stops just fine on tf2 when the services get 
>> failed over to that machine.
>> any ideas on what can be wrong?
>
> tf1 is complaining about a bad SSL cert.  The fact that it's complaining
> when being started by clurgmgrd but not when started manually indicates
> that clurgmgrd is starting it differently (specifying a different
> httpd.conf file perhaps?).

well, heres the relevant part of my config file
        <rm>
                <failoverdomains>
                        <failoverdomain name="httpd" ordered="1" restricted="1">
                                <failoverdomainnode name="tf1" priority="1"/>
                                <failoverdomainnode name="tf2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/httpd" name="cluster_apache"/>
                        <ip address="192.168.1.7" monitor_link="1"/>
                        <script file="/etc/init.d/postfix" name="cluster_posstfix"/>
                </resources>
                <service autostart="1" domain="httpd" name="Apache Service">
                        <ip ref="192.168.1.7"/>
                        <script ref="cluster_apache"/>
                </service>
                <service autostart="1" domain="httpd" name="postfix service">
                        <ip ref="192.168.1.7"/>
                        <script ref="cluster_posstfix"/>
                </service>
        </rm>

ive never seen that ssl error when starting the service manually.


the other thing that I noticed.. is that when I try to do 

[root at tf1 cluster]# clusvcadm -d "Apache Service"
Member tf1 disabling Apache Service...

it just hangs there and never returns.

Jason


From schlegel at riege.com  Sat Jul 11 08:02:45 2009
From: schlegel at riege.com (Gunther Schlegel)
Date: Sat, 11 Jul 2009 10:02:45 +0200
Subject: [Linux-cluster] linux cluster and virtualization -
	shared	xend-config.sxp
In-Reply-To: <A7966B92-48F8-4EA5-9733-A531F4EBA8AE@gmail.com>
References: <A7966B92-48F8-4EA5-9733-A531F4EBA8AE@gmail.com>
Message-ID: <4A584725.2010604@riege.com>

Aaron,

> The documentation provided via the conga cluster recipe (Add a virtual 
> machine as a clustered service 
> <http://sourceware.org/cluster/conga/cookbook/VMs_as_services>) states that:
> 
> 1) "Both xen config files [e.g. xend-config-sxp] and VM disk images must 
> be located on shared storage, with identical mount paths for each node"
> 
> and then on the next slide states:
> 
> 2) "Xend must be running on each node, and MUST BE STARTED BEFORE THE 
> CMAN CLUSTER DAEMON"
> 
 > ...
> 
> The only other documentation I found for this 
> setup: http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ 
> says absolutely nothing about system service startup order and includes 
> an example of a cluster node reboot.
> 
> So, my question is this; is the default ordering of system service 
> startup (cman < xend) workable with virtual machines defined as cluster 
> services?

It works fine in the normal startup order. xend has to be started before 
rgmanager and after cman.

best regards, Gunther

-- 
Gunther Schlegel
Manager IT Infrastructure


.............................................................
Riege Software International GmbH  Fon: +49 (2159) 9148 0
Mollsfeld 10                       Fax: +49 (2159) 9148 11
40670 Meerbusch                    Web: www.riege.com
Germany                            E-Mail: schlegel at riege.com
---                                ---
Handelsregister:                   Managing Directors:
Amtsgericht Neuss HRB-NR 4207      Christian Riege
USt-ID-Nr.: DE120585842            Gabriele  Riege
                                   Johannes  Riege
.............................................................
           YOU CARE FOR FREIGHT, WE CARE FOR YOU          


-------------- next part --------------
A non-text attachment was scrubbed...
Name: schlegel.vcf
Type: text/x-vcard
Size: 346 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090711/3ee2f401/attachment.vcf>

From peter.tiggerdine at uq.edu.au  Mon Jul 13 03:33:29 2009
From: peter.tiggerdine at uq.edu.au (Peter Tiggerdine)
Date: Mon, 13 Jul 2009 13:33:29 +1000
Subject: [Linux-cluster] linux cluster and virtualization -
	sharedxend-config.sxp
In-Reply-To: <A7966B92-48F8-4EA5-9733-A531F4EBA8AE@gmail.com>
Message-ID: <DB208538359CE54C920E89A5038023EAE7FF2D@UQEXMB5.soe.uq.edu.au>

Aaron,

 
I think they meant the config file for the VM not the global config for
xen.

 
Regards,

 
Peter Tiggerdine

HPC - UQ

 
________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Aaron Benner
Sent: Saturday, 11 July 2009 6:57 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] linux cluster and virtualization -
sharedxend-config.sxp

 
All,

 
The documentation provided via the conga cluster recipe (Add a virtual
machine as a clustered service
<http://sourceware.org/cluster/conga/cookbook/VMs_as_services> ) states
that:

 
1) "Both xen config files [e.g. xend-config-sxp] and VM disk images must
be located on shared storage, with identical mount paths for each node"

 
and then on the next slide states:

 
2) "Xend must be running on each node, and MUST BE STARTED BEFORE THE
CMAN CLUSTER DAEMON"

 
However, it seems to me that this creates a chicken and egg problem.  In
particular if my xend-config.sxp is on a shared storage device (In my
case a gfs2 formatted iSCSI disk) then doesn't that imply that I must
have cluster services running to mount/access the filesystem.  If that
is the case then it is impossible for me to start xend before cman as my
xend-config.sxp will not exist until cman starts and I mount my gfs2
partition.

 
The only other documentation I found for this setup:
http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of
-virtualized-guests-in-advanced-platform/ says absolutely nothing about
system service startup order and includes an example of a cluster node
reboot.

 
So, my question is this; is the default ordering of system service
startup (cman < xend) workable with virtual machines defined as cluster
services?

 
Thanks -- Aaron

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/a967bfa4/attachment.htm>

From abednegoyulo at yahoo.com  Mon Jul 13 05:29:46 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Sun, 12 Jul 2009 22:29:46 -0700 (PDT)
Subject: [Linux-cluster] Cannot make cluster after upgrade
In-Reply-To: <36df569a0907080745u1a498a96oc8853f37f093ea08@mail.gmail.com>
Message-ID: <122541.1255.qm@web110413.mail.gq1.yahoo.com>


Unfortunately it did not work. Now I am currently running on a single node cluster just to make the site run. Is it possible to add a cluster member to a currently running cluster without taking down the cluster?

--- On Wed, 7/8/09, Ian Hayes <cthulhucalling at gmail.com> wrote:

> From: Ian Hayes <cthulhucalling at gmail.com>
> Subject: Re: [Linux-cluster] Cannot make cluster after upgrade
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Wednesday, 8 July, 2009, 10:45 PM
> In the fence_daemon tag. Like this:
> 
> ?<fence_daemon clean_start="1"
> post_fail_delay="0"
> post_join_delay="3"/>
> 
> On Wed, Jul 8, 2009 at 2:50 AM,
> Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
> wrote:
> 
> 
> 
> I haven't tried it yet. To which part of the
> cluster.conf should I be inserting clean_start=1 ?
> 
> 
> 
> --- On Wed, 7/8/09, Ian Hayes <cthulhucalling at gmail.com>
> wrote:
> 
> 
> 
> > From: Ian Hayes <cthulhucalling at gmail.com>
> 
> > Subject: Re: [Linux-cluster] Cannot make cluster after
> upgrade
> 
> > To: "linux clustering" <linux-cluster at redhat.com>
> 
> > Date: Wednesday, 8 July, 2009, 2:59 PM
> 
> > Sounds a little
> 
> > split-brainish....... have you tried the
> clean_start=1
> 
> > option?
> 
> > On Jul 7, 2009 11:54 PM,
> 
> > "Abed-nego G. Escobal, Jr." <abednegoyulo at yahoo.com>
> 
> > wrote:
> 
> >
> 
> >
> 
> >
> 
> > After an upgrade from 5.2 to 5.3, the cluster, named
> 
> > GFSCluster, seems to stop being a cluster. GFSCluster
> is a 2
> 
> > node cluster using iscsi, cman, clvm, and gfs and it
> was
> 
> > working fine when it was on 5.2 The configuration on
> both of
> 
> > the nodes (passwords removed)
> 
> >
> 
> >
> 
> >
> 
> >
> 
> > <?xml version="1.0"?>
> 
> >
> 
> > <cluster name="GFSCluster"
> 
> > config_version="5">
> 
> >
> 
> > <cman expected_votes="1"
> 
> > two_node="1"/>
> 
> >
> 
> > ??<clusternodes><clusternode name="node01.company.com"
> 
> > votes="1"
> 
> > nodeid="1"><fence><method
> 
> > name="single"><device
> 
> >
> name="node01_ipmi"/></method></fence></clusternode><clusternode
> 
> > name="node02.company.com"
> 
> > votes="1"
> 
> > nodeid="2"><fence><method
> 
> > name="single"><device
> 
> >
> name="node02_ipmi"/></method></fence></clusternode></clusternodes>
> 
> >
> 
> >
> 
> > ??<fencedevices><fencedevice
> 
> > name="node01_ipmi"
> agent="fence_ipmilan"
> 
> > ipaddr="10.1.0.5" login="root"
> 
> > passwd="********"/><fencedevice
> 
> > name="node02_ipmi"
> agent="fence_ipmilan"
> 
> > ipaddr="10.1.0.7" login="root"
> 
> > passwd="********"/></fencedevices>
> 
> >
> 
> >
> 
> > ??<rm>
> 
> >
> 
> > ?? ?<failoverdomains/>
> 
> >
> 
> > ?? ?<resources/>
> 
> >
> 
> > ??</rm>
> 
> >
> 
> > </cluster>
> 
> >
> 
> >
> 
> >
> 
> > When starting the service cman, they both hang on the
> part
> 
> > starting fencing
> 
> >
> 
> >
> 
> >
> 
> > Starting cluster:
> 
> >
> 
> > ?? Loading modules... done
> 
> >
> 
> > ?? Mounting configfs... done
> 
> >
> 
> > ?? Starting ccsd... done
> 
> >
> 
> > ?? Starting cman... done
> 
> >
> 
> > ?? Starting daemons... done
> 
> >
> 
> > ?? Starting fencing...
> 
> >
> 
> >
> 
> >
> 
> > After 5 minutes the task finishes with
> "done" but
> 
> > clustat says
> 
> >
> 
> >
> 
> >
> 
> > ==== As root on web01.company.com ====
> 
> >
> 
> > ??Cluster Status for GFSCluster @ Wed Jul ?8
> 01:00:24
> 
> > 2009
> 
> >
> 
> > ??Member Status: Quorate
> 
> >
> 
> >
> 
> >
> 
> > ?? Member Name ? ? ? ? ? ? ? ? ? ? ? ?
> ? ?
> 
> > ID ? Status
> 
> >
> 
> > ?? ------ ---- ? ? ? ? ? ? ? ? ? ? ? ?
> ? ?
> 
> > ---- ------
> 
> >
> 
> > ?? node01.company.com ?
> 
> > ? ? ? ? ? ? ? ? ? ? ? 1 Online, Local
> 
> >
> 
> > ?? node02.company.com ?
> 
> > ? ? ? ? ? ? ? ? ? ? ? 2 Offline
> 
> >
> 
> >
> 
> >
> 
> >
> 
> >
> 
> > ==== As root on web02.company.com ====
> 
> >
> 
> > ??Cluster Status for GFSCluster @ Wed Jul ?8
> 01:00:26
> 
> > 2009
> 
> >
> 
> > ??Member Status: Quorate
> 
> >
> 
> >
> 
> >
> 
> > ?? Member Name ? ? ? ? ? ? ? ? ? ? ? ?
> ? ?
> 
> > ID ? Status
> 
> >
> 
> > ?? ------ ---- ? ? ? ? ? ? ? ? ? ? ? ?
> ? ?
> 
> > ---- ------
> 
> >
> 
> > ?? node01.company.com ?
> 
> > ? ? ? ? ? ? ? ? ? ? ? 1 Offline
> 
> >
> 
> > ?? node02.company.com ?
> 
> > ? ? ? ? ? ? ? ? ? ? ? 2 Online, Local
> 
> >
> 
> >
> 
> >
> 
> > They are both quorate with their own cluster
> 
> >
> 
> >
> 
> >
> 
> > In the logs of web01 I found repeating messages
> 
> >
> 
> >
> 
> >
> 
> > Jul ?8 00:55:27 web01 fenced[21872]: node02.company.com not
> 
> > a cluster member after 6 sec post_join_delay
> 
> >
> 
> > Jul ?8 00:55:27 web01 fenced[21872]: fencing node
> "node02.company.com"
> 
> >
> 
> > Jul ?8 00:55:52 web01 fenced[21872]: agent
> 
> > "fence_ipmilan" reports: Rebooting machine
> @
> 
> > IPMI:10.1.0.7...ipmilan: Failed to connect after 30
> seconds
> 
> > Failed
> 
> >
> 
> >
> 
> >
> 
> >
> 
> >
> 
> > In the logs of web02 I also found the same repeating
> 
> > messages
> 
> >
> 
> >
> 
> >
> 
> > Jul ?8 00:55:27 web02 fenced[6363]: node01.company.com not
> 
> > a cluster member after 6 sec post_join_delay
> 
> >
> 
> > Jul ?8 00:55:27 web02 fenced[6363]: fencing node
> "node01.company.com"
> 
> >
> 
> > Jul ?8 00:55:53 web02 fenced[6363]: agent
> 
> > "fence_ipmilan" reports: Rebooting machine
> @
> 
> > IPMI:10.1.0.5...ipmilan: Failed to connect after 30
> seconds
> 
> > Failed
> 
> >
> 
> >
> 
> >
> 
> >
> 
> >
> 
> > Is there a bug on 5.3 with regards to clustering?
> 
> >
> 
> > Is there any workarounds?
> 
> >
> 
> >
> 
> >
> 
> >
> 
> >
> 
> >
> 
> >
> 
> > ?? ? ?Feel safer online. Upgrade to the new,
> safer
> 
> > Internet Explorer 8 optimized for Yahoo! to put your
> mind at
> 
> > peace. It's free. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/
> 
> >
> 
> >
> 
> >
> 
> >
> 
> > --
> 
> >
> 
> > Linux-cluster mailing list
> 
> >
> 
> > Linux-cluster at redhat.com
> 
> >
> 
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> >
> 
> >
> 
> >
> 
> > -----Inline Attachment Follows-----
> 
> >
> 
> > --
> 
> > Linux-cluster mailing list
> 
> > Linux-cluster at redhat.com
> 
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> 
> Linux-cluster mailing list
> 
> Linux-cluster at redhat.com
> 
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> -----Inline Attachment Follows-----
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


      Try the new Yahoo! Messenger. Now with all you love about messenger and more! http://ph.messenger.yahoo.com


From esggrupos at gmail.com  Mon Jul 13 10:33:33 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 13 Jul 2009 12:33:33 +0200
Subject: [Linux-cluster] two diferente gfs systems on same cluster?
Message-ID: <3128ba140907130333j6d117bdbof2a2d58b59a04985@mail.gmail.com>

Hi all,
I?m configuring a cluster and I want to have 2 gfs systems on the cluster.

One filesytem to use with NFS service and another one with PostgreSQL
service.

I try to create them this way:

for the bbdd:
 mkfs -t gfs2 -p lock_dlm -t CLUSTERXEN:bbdddata -j4 /dev/sdb1

for the nfs:

mkfs -t gfs2 -p lock_dlm -t CLUSTERXEN:nfsdata -j4 /dev/sdc1

When I create one of them, when I try to create the second I get this error:

mkfs.gfs2: device /dev/sdc1 is busy

so my question if what I?m doing is right or its impossible,


Thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/102bd69b/attachment.htm>

From esggrupos at gmail.com  Mon Jul 13 12:47:22 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 13 Jul 2009 14:47:22 +0200
Subject: [Linux-cluster] relocating all services
Message-ID: <3128ba140907130547m7e685e87pe854890651ab5af8@mail.gmail.com>

Hi all,
is there any way to relocate all the services that a node is running.

I have 2 services apache and mysql using the same VIP. If I try to relocate
first one service and then the second I get the error that the ip is already
used. So I think the problem is because in a moment the services are running
on diferent nodes.

If i can relocate the 2 services at one time I think I haven?t this
problem,

any suggestion with this problem (if there is a better way to have this
services running its welcome)

Thanks,

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/e979042e/attachment.htm>

From robejrm at gmail.com  Mon Jul 13 12:49:41 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Mon, 13 Jul 2009 14:49:41 +0200
Subject: [Linux-cluster] relocating all services
In-Reply-To: <3128ba140907130547m7e685e87pe854890651ab5af8@mail.gmail.com>
References: <3128ba140907130547m7e685e87pe854890651ab5af8@mail.gmail.com>
Message-ID: <8a5668960907130549n94d60bes13ea7b29ff1b227b@mail.gmail.com>

On Mon, Jul 13, 2009 at 2:47 PM, ESGLinux <esggrupos at gmail.com> wrote:

> Hi all,
> is there any way to relocate all the services that a node is running.
>
> I have 2 services apache and mysql using the same VIP. If I try to relocate
> first one service and then the second I get the error that the ip is already
> used. So I think the problem is because in a moment the services are running
> on diferent nodes.
>
You should have one IP for _each_ service

Greetings,
Juanra

>
> If i can relocate the 2 services at one time I think I haven?t this
> problem,
>
> any suggestion with this problem (if there is a better way to have this
> services running its welcome)
>
> Thanks,
>
> ESG
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/38a8b0ce/attachment.htm>

From robejrm at gmail.com  Mon Jul 13 12:51:51 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Mon, 13 Jul 2009 14:51:51 +0200
Subject: [Linux-cluster] two diferente gfs systems on same cluster?
In-Reply-To: <3128ba140907130333j6d117bdbof2a2d58b59a04985@mail.gmail.com>
References: <3128ba140907130333j6d117bdbof2a2d58b59a04985@mail.gmail.com>
Message-ID: <8a5668960907130551w7129b6a6gcdc8410a82dfe68d@mail.gmail.com>

On Mon, Jul 13, 2009 at 12:33 PM, ESGLinux <esggrupos at gmail.com> wrote:

> Hi all,
> I?m configuring a cluster and I want to have 2 gfs systems on the cluster.
>
> One filesytem to use with NFS service and another one with PostgreSQL
> service.
>
> I try to create them this way:
>
> for the bbdd:
>  mkfs -t gfs2 -p lock_dlm -t CLUSTERXEN:bbdddata -j4 /dev/sdb1
>
> for the nfs:
>
> mkfs -t gfs2 -p lock_dlm -t CLUSTERXEN:nfsdata -j4 /dev/sdc1
>
What is sdc1? A disk partition, an iscsi lun, a scsi lun? Is being already
used?

Regards,
Juanra

>
> When I create one of them, when I try to create the second I get this
> error:
>
> mkfs.gfs2: device /dev/sdc1 is busy
>
> so my question if what I?m doing is right or its impossible,
>
>
> Thanks in advance
>
> ESG
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/8046b478/attachment.htm>

From esggrupos at gmail.com  Mon Jul 13 12:54:41 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 13 Jul 2009 14:54:41 +0200
Subject: [Linux-cluster] two diferente gfs systems on same cluster?
In-Reply-To: <8a5668960907130551w7129b6a6gcdc8410a82dfe68d@mail.gmail.com>
References: <3128ba140907130333j6d117bdbof2a2d58b59a04985@mail.gmail.com>
	<8a5668960907130551w7129b6a6gcdc8410a82dfe68d@mail.gmail.com>
Message-ID: <3128ba140907130554j24f53b66wc6ef6785cb207780@mail.gmail.com>

Hi Juanra

2009/7/13 Juan Ramon Martin Blanco <robejrm at gmail.com>

>
>
> On Mon, Jul 13, 2009 at 12:33 PM, ESGLinux <esggrupos at gmail.com> wrote:
>
>> Hi all,
>> I?m configuring a cluster and I want to have 2 gfs systems on the
>> cluster.
>>
>> One filesytem to use with NFS service and another one with PostgreSQL
>> service.
>>
>> I try to create them this way:
>>
>> for the bbdd:
>>  mkfs -t gfs2 -p lock_dlm -t CLUSTERXEN:bbdddata -j4 /dev/sdb1
>>
>> for the nfs:
>>
>> mkfs -t gfs2 -p lock_dlm -t CLUSTERXEN:nfsdata -j4 /dev/sdc1
>>
>


> What is sdc1? A disk partition, an iscsi lun, a scsi lun? Is being already
> used?
>

both sdb1 and sdc1 are iscsi targets.

when I set one of them, the other gives me the error.


Greetings

ESG


>
>
> Regards,
> Juanra
>
>>
>> When I create one of them, when I try to create the second I get this
>> error:
>>
>> mkfs.gfs2: device /dev/sdc1 is busy
>>
>> so my question if what I?m doing is right or its impossible,
>>
>>
>> Thanks in advance
>>
>> ESG
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/a2158baf/attachment.htm>

From esggrupos at gmail.com  Mon Jul 13 12:55:52 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Mon, 13 Jul 2009 14:55:52 +0200
Subject: [Linux-cluster] relocating all services
In-Reply-To: <8a5668960907130549n94d60bes13ea7b29ff1b227b@mail.gmail.com>
References: <3128ba140907130547m7e685e87pe854890651ab5af8@mail.gmail.com>
	<8a5668960907130549n94d60bes13ea7b29ff1b227b@mail.gmail.com>
Message-ID: <3128ba140907130555g16e85229w30cfc06efcd3f91c@mail.gmail.com>

2009/7/13 Juan Ramon Martin Blanco <robejrm at gmail.com>

>
>
> On Mon, Jul 13, 2009 at 2:47 PM, ESGLinux <esggrupos at gmail.com> wrote:
>
>> Hi all,
>> is there any way to relocate all the services that a node is running.
>>
>> I have 2 services apache and mysql using the same VIP. If I try to
>> relocate first one service and then the second I get the error that the ip
>> is already used. So I think the problem is because in a moment the services
>> are running on diferent nodes.
>>
> You should have one IP for _each_ service
>
> Greetings,
> Juanra
>


I suposse it, I?m going to try it and I?ll post my results

thanks

ESG


>
>
>> If i can relocate the 2 services at one time I think I haven?t this
>> problem,
>>
>> any suggestion with this problem (if there is a better way to have this
>> services running its welcome)
>>
>> Thanks,
>>
>> ESG
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/434db896/attachment.htm>

From bbushby at maninvestments.com  Mon Jul 13 14:40:28 2009
From: bbushby at maninvestments.com (Bushby, Bruce (London)(c))
Date: Mon, 13 Jul 2009 15:40:28 +0100
Subject: [Linux-cluster] GNBD vs DRBD to mirror two disks via the network
Message-ID: <B510590D118044449CEB7415E870EF0D027D3A86@mildnpexmb02.maninvestments.ad.man.com>

 
Greetings!
 
I'm hoping a member could assist me in clearing up some understanding I
appear to be missing when it comes to GNBD.
 
Today I clustered two vmware machines (active/passive shared nothing)
and then configure GNBD....thats when I noticed it
wants to "export" a device from one node and "import" the device on the
other node...which I did and it all works....but that is
not what I was after.
 
Can GNBD be configured to keep two disks on different systems in sync
using both synchronous and asynchronous write options?
 
Further research appears to dictate DRDB would be more suited for what
I'm trying to do. (build active/passive clusters on cheap
storage..ie no SAN to replicate and no scsi to share)
 
Any help much appreciated!!!!
 
 
Bruce Bushby
Unix Engineering
 

**********************************************************************
 Please consider the environment before printing this email or its attachments.
The contents of this email are for the named addressees only.  It contains information which may be confidential and privileged.  If you are not the intended recipient, please notify the sender immediately, destroy this email and any attachments and do not otherwise disclose or use them. Email transmission is not a secure method of communication and Man Investments cannot accept responsibility for the completeness or accuracy of this email or any attachments. Whilst Man Investments makes every effort to keep its network free from viruses, it does not accept responsibility for any computer virus which might be transferred by way of this email or any attachments. This email does not constitute a request, offer, recommendation or solicitation of any kind to buy, subscribe, sell or redeem any investment instruments or to perform other such transactions of any kind. Man Investments reserves the right to monitor, record and retain all electronic communications through its network to ensure the integrity of its systems, for record keeping and regulatory purposes. 
Visit us at: www.maninvestments.com 
TG0908
**********************************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090713/31d9a8f4/attachment.htm>

From bmr at redhat.com  Mon Jul 13 15:03:00 2009
From: bmr at redhat.com (Bryn M. Reeves)
Date: Mon, 13 Jul 2009 16:03:00 +0100
Subject: [Linux-cluster] GNBD vs DRBD to mirror two disks via the network
In-Reply-To: <B510590D118044449CEB7415E870EF0D027D3A86@mildnpexmb02.maninvestments.ad.man.com>
References: <B510590D118044449CEB7415E870EF0D027D3A86@mildnpexmb02.maninvestments.ad.man.com>
Message-ID: <1247497380.20189.534.camel@breeves.fab.redhat.com>

On Mon, 2009-07-13 at 15:40 +0100, Bushby, Bruce (London)(c) wrote:
>  
> Greetings!
>  
> I'm hoping a member could assist me in clearing up some understanding
> I appear to be missing when it comes to GNBD.
>  
> Today I clustered two vmware machines (active/passive shared nothing)
> and then configure GNBD....thats when I noticed it
> wants to "export" a device from one node and "import" the device on
> the other node...which I did and it all works....but that is
> not what I was after.
>  
> Can GNBD be configured to keep two disks on different systems in sync
> using both synchronous and asynchronous write options?

No. GNBD is just a cluster-aware network block device (i.e. it
implements fencing so that we can cut off a failing node from the shared
storage).

It's only useful if you want shared storage for e.g. GFS but don't have
access to hardware shared storage resources (or only have shared storage
hardware on a subset of nodes and want to spread that out to more nodes
using IP).

Regards,
Bryn.


From kkovachev at varna.net  Mon Jul 13 14:57:38 2009
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 13 Jul 2009 17:57:38 +0300
Subject: [Linux-cluster] GNBD vs DRBD to mirror two disks via the network
In-Reply-To: <B510590D118044449CEB7415E870EF0D027D3A86@mildnpexmb02.maninvestments.ad.man.com>
References: <B510590D118044449CEB7415E870EF0D027D3A86@mildnpexmb02.maninvestments.ad.man.com>
Message-ID: <20090713145052.M45484@varna.net>

Hi,
 DRBD is what you would want to use here and will beter do the job. It is
possible to create software raid 1 over GNBD and local drive/partition, but
not recomended.

On Mon, 13 Jul 2009 15:40:28 +0100, Bushby, Bruce \(London\)\(c\) wrote 
>   
> Greetings! 
>   
> I'm hoping a member could assist me in clearing up some understanding I
appear to be missing when it comes to GNBD. 
>   
> Today I clustered two vmware machines (active/passive shared nothing) and
then configure GNBD....thats when I noticed it 
> wants to "export" a device from one node and "import" the device on the
other node...which I did and it all works....but that is 
> not what I was after. 
>   
> Can GNBD be configured to keep two disks on different systems in sync using
both synchronous and asynchronous write options? 
>   
> Further research appears to dictate DRDB would be more suited for what I'm
trying to do. (build active/passive clusters on cheap 
> storage..ie no SAN to replicate and no scsi to share) 
>   
> Any help much appreciated!!!! 
>   
>   
> Bruce Bushby 
> Unix Engineering 
>   
> ********************************************************************** 
> Please consider the environment before printing this email or its attachments. 
> The contents of this email are for the named addressees only. It contains
information which may be confidential and privileged. If you are not the
intended recipient, please notify the sender immediately, destroy this email
and any attachments and do not otherwise disclose or use them. Email
transmission is not a secure method of communication and Man Investments
cannot accept responsibility for the completeness or accuracy of this email or
any attachments. Whilst Man Investments makes every effort to keep its network
free from viruses, it does not accept responsibility for any computer virus
which might be transferred by way of this email or any attachments. This email
does not constitute a request, offer, recommendation or solicitation of any
kind to buy, subscribe, sell or redeem any investment instruments or to
perform other such transactions of any kind. Man Investments reserves the
right to monitor, record and retain all electronic communications through its
network to ensure the integrity of its systems, for record keeping and
regulatory purposes.  
> Visit us at: www.maninvestments.com  
> TG0908 
> ********************************************************************** 
>  


From kkovachev at varna.net  Mon Jul 13 15:06:01 2009
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 13 Jul 2009 18:06:01 +0300
Subject: [Linux-cluster] GNBD vs DRBD to mirror two disks via the network
In-Reply-To: <1247497380.20189.534.camel@breeves.fab.redhat.com>
References: <B510590D118044449CEB7415E870EF0D027D3A86@mildnpexmb02.maninvestments.ad.man.com>
	<1247497380.20189.534.camel@breeves.fab.redhat.com>
Message-ID: <20090713145917.M20643@varna.net>

On Mon, 13 Jul 2009 16:03:00 +0100, Bryn M. Reeves wrote
> On Mon, 2009-07-13 at 15:40 +0100, Bushby, Bruce (London)(c) wrote:
> >  
> > Greetings!
> >  
> > I'm hoping a member could assist me in clearing up some understanding
> > I appear to be missing when it comes to GNBD.
> >  
> > Today I clustered two vmware machines (active/passive shared nothing)
> > and then configure GNBD....thats when I noticed it
> > wants to "export" a device from one node and "import" the device on
> > the other node...which I did and it all works....but that is
> > not what I was after.
> >  
> > Can GNBD be configured to keep two disks on different systems in sync
> > using both synchronous and asynchronous write options?
> 
> No. GNBD is just a cluster-aware network block device (i.e. it
> implements fencing so that we can cut off a failing node from the shared
> storage).
> 
> It's only useful if you want shared storage for e.g. GFS but don't have
> access to hardware shared storage resources (or only have shared storage
> hardware on a subset of nodes and want to spread that out to more nodes
> using IP).
> 

and a question from me here (asked a week ago but with no answer, so will try
to be more clear now).

'only have shared storage hardware on a subset of nodes ...' - OK so if we
have 3 nodes and 2 of them have access to the storage and are exporting it via
GNBD - how the 3-rd node may access the data from both of them and not
interupting the services in case of failure of either of the first two?

> Regards,
> Bryn.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From agx at sigxcpu.org  Mon Jul 13 15:53:24 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Mon, 13 Jul 2009 17:53:24 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <20090709121958.GB18140@bogon.sigxcpu.org>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
	<20090707162845.GA30094@bogon.sigxcpu.org>
	<1246988716.7993.13.camel@cerberus.int.fabbione.net>
	<20090709121958.GB18140@bogon.sigxcpu.org>
Message-ID: <20090713155324.GA20030@bogon.sigxcpu.org>

On Thu, Jul 09, 2009 at 02:19:58PM +0200, Guido G?nther wrote:
> This stuff needs soome more work to be uploadable but it's mostly there
> I think. I've pushed the git archives here:

[..snip..] 
I've updated the packages at:

deb http://pkg-libvirt.alioth.debian.org/packages unstable/i386/
deb http://pkg-libvirt.alioth.debian.org/packages unstable/all/

to cluster 3.0.0.
Cheers,
 -- Guido


From sdake at redhat.com  Mon Jul 13 16:06:13 2009
From: sdake at redhat.com (Steven Dake)
Date: Mon, 13 Jul 2009 09:06:13 -0700
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <20090713155324.GA20030@bogon.sigxcpu.org>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
	<20090707162845.GA30094@bogon.sigxcpu.org>
	<1246988716.7993.13.camel@cerberus.int.fabbione.net>
	<20090709121958.GB18140@bogon.sigxcpu.org>
	<20090713155324.GA20030@bogon.sigxcpu.org>
Message-ID: <1247501173.10789.2.camel@localhost.localdomain>

On Mon, 2009-07-13 at 17:53 +0200, Guido G?nther wrote:
> On Thu, Jul 09, 2009 at 02:19:58PM +0200, Guido G?nther wrote:
> > This stuff needs soome more work to be uploadable but it's mostly there
> > I think. I've pushed the git archives here:
> 
> [..snip..] 
> I've updated the packages at:
> 
> deb http://pkg-libvirt.alioth.debian.org/packages unstable/i386/
> deb http://pkg-libvirt.alioth.debian.org/packages unstable/all/
> 
> to cluster 3.0.0.
> Cheers,
>  -- Guido
> 
Guido,

Thanks for the work!  If there is anything we can do in upstream
corosync or openais to help simplify these efforts in the future, please
share your ideas.

regards
-steve

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From ricks at nerd.com  Mon Jul 13 19:16:00 2009
From: ricks at nerd.com (Rick Stevens)
Date: Mon, 13 Jul 2009 12:16:00 -0700
Subject: [Linux-cluster] service stuck in "starting" state
In-Reply-To: <20090711005904.GA15858@monsterjam.org>
References: <20090710233532.GA6696@monsterjam.org> <4A57D3B4.6040004@nerd.com>
	<20090711005904.GA15858@monsterjam.org>
Message-ID: <4A5B87F0.3040203@nerd.com>

jason at monsterjam.org wrote:
> On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
>> jason at monsterjam.org wrote:
>>> hey cluster gurus..
>>> I have a 2 node cluster thats been running without issue for quite a 
>>> while.. all of a sudden one of the nodes will not completely start the 
>>> apache webserver service.. it looks like this [root at tf1 ~]# clustat
>>> Member Status: Quorate
>>>   Member Name                              Status
>>>   ------ ----                              ------
>>>   tf1                                      Online, Local, rgmanager
>>>   tf2                                      Online, rgmanager
>>>   Service Name         Owner (Last)                   State           
>>> ------- ----         ----- ------                   -----           Apache 
>>> Service       tf1                            starting          postfix 
>>> service      tf1                            started         [root at tf1 ~]# 
>>> and I see that the httpd is NOT started. although, if I do 
>>> /etc/init.d/httpd start
>>> the service starts without issue.
>>> grepping for apache and http in the logs, I see this..
>>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
>>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of 
>>> /etc/httpd/conf.d/ssl.conf:
>>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file 
>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>> Jul 10 14:33:57 tf1 httpd: httpd startup failed
>>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of 
>>> /etc/httpd/conf.d/ssl.conf:
>>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file 
>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>> Jul 10 14:34:06 tf1 httpd: httpd startup failed
>>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd 
>>> stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
>>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
>>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd 
>>> stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
>>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
>>> [root at tf1 log]# grep apache  messages
>>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script 
>>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1 
>>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 
>>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on 
>>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1 
>>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 
>>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on 
>>> script "cluster_apache" returned 1 (generic error) [root at tf1 log]# Im 
>>> guessing its the  stop on script "cluster_apache" returned 1 (generic 
>>> error)
>>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the 
>>> same size
>>> [root at tf2 ~]# ls -al /etc/init.d/httpd
>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>> [root at tf1 log]# ls -al /etc/init.d/httpd
>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>> and the apache service starts/stops just fine on tf2 when the services get 
>>> failed over to that machine.
>>> any ideas on what can be wrong?
>> tf1 is complaining about a bad SSL cert.  The fact that it's complaining
>> when being started by clurgmgrd but not when started manually indicates
>> that clurgmgrd is starting it differently (specifying a different
>> httpd.conf file perhaps?).
> 
> well, heres the relevant part of my config file
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="httpd" ordered="1" restricted="1">
>                                 <failoverdomainnode name="tf1" priority="1"/>
>                                 <failoverdomainnode name="tf2" priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <script file="/etc/init.d/httpd" name="cluster_apache"/>
>                         <ip address="192.168.1.7" monitor_link="1"/>
>                         <script file="/etc/init.d/postfix" name="cluster_posstfix"/>
>                 </resources>
>                 <service autostart="1" domain="httpd" name="Apache Service">
>                         <ip ref="192.168.1.7"/>
>                         <script ref="cluster_apache"/>
>                 </service>
>                 <service autostart="1" domain="httpd" name="postfix service">
>                         <ip ref="192.168.1.7"/>
>                         <script ref="cluster_posstfix"/>
>                 </service>
>         </rm>
> 
> ive never seen that ssl error when starting the service manually.
> 
> 
> the other thing that I noticed.. is that when I try to do 
> 
> [root at tf1 cluster]# clusvcadm -d "Apache Service"
> Member tf1 disabling Apache Service...
> 
> it just hangs there and never returns.

Sorry about the delay in responding.  Was out of town for the weekend.

Does clusvcadm or clurgmgrd run as a different user...one that either
can't read the SSL certs or the directory containing them?  Normally
the stuff in /etc/init.d runs as root.  Running one of those scripts as
a different user can lead to lots of permissions issues.  It's bitten
me before.
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer                      ricks at nerd.com -
- AIM/Skype: therps2        ICQ: 22643734            Yahoo: origrps2 -
-                                                                    -
- Millihelen, adj: The amount of beauty required to launch one ship. -
----------------------------------------------------------------------


From ricks at nerd.com  Mon Jul 13 19:22:56 2009
From: ricks at nerd.com (Rick Stevens)
Date: Mon, 13 Jul 2009 12:22:56 -0700
Subject: [Linux-cluster] relocating all services
In-Reply-To: <3128ba140907130555g16e85229w30cfc06efcd3f91c@mail.gmail.com>
References: <3128ba140907130547m7e685e87pe854890651ab5af8@mail.gmail.com>	<8a5668960907130549n94d60bes13ea7b29ff1b227b@mail.gmail.com>
	<3128ba140907130555g16e85229w30cfc06efcd3f91c@mail.gmail.com>
Message-ID: <4A5B8990.7060606@nerd.com>

ESGLinux wrote:
> 2009/7/13 Juan Ramon Martin Blanco <robejrm at gmail.com>
> 
>>
>> On Mon, Jul 13, 2009 at 2:47 PM, ESGLinux <esggrupos at gmail.com> wrote:
>>
>>> Hi all,
>>> is there any way to relocate all the services that a node is running.
>>>
>>> I have 2 services apache and mysql using the same VIP. If I try to
>>> relocate first one service and then the second I get the error that the ip
>>> is already used. So I think the problem is because in a moment the services
>>> are running on diferent nodes.
>>>
>> You should have one IP for _each_ service
>>
>> Greetings,
>> Juanra
>>
> 
> 
> I suposse it, I?m going to try it and I?ll post my results

You could look at LinuxHA.  It creates a virtual IP (VIP) via aliases
depending on which node is the "master".  The service itself is tied to
this VIP.

We use it to flip MySQL between nodes.  sql1 is normally the master
node.  If it dies, LinuxHA on sql2 notices it, mounts the block device
holding the SQL database, starts MySQL and takes over the VIP.

Note that in this case we use DRBD to keep the mysql storage synced.
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer                      ricks at nerd.com -
- AIM/Skype: therps2        ICQ: 22643734            Yahoo: origrps2 -
-                                                                    -
-   UNIX is actually quite user friendly.  The problem is that it's  -
-              just very picky of who its friends are!               -
----------------------------------------------------------------------


From jason at monsterjam.org  Mon Jul 13 23:24:27 2009
From: jason at monsterjam.org (jason at monsterjam.org)
Date: Mon, 13 Jul 2009 19:24:27 -0400
Subject: [Linux-cluster] service stuck in "starting" state
In-Reply-To: <4A5B87F0.3040203@nerd.com>
References: <20090710233532.GA6696@monsterjam.org> <4A57D3B4.6040004@nerd.com>
	<20090711005904.GA15858@monsterjam.org> <4A5B87F0.3040203@nerd.com>
Message-ID: <20090713232427.GA29028@monsterjam.org>

On Mon, Jul 13, 2009 at 12:16:00PM -0700, Rick Stevens wrote:
> jason at monsterjam.org wrote:
>> On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
>>> jason at monsterjam.org wrote:
>>>> hey cluster gurus..
>>>> I have a 2 node cluster thats been running without issue for quite a 
>>>> while.. all of a sudden one of the nodes will not completely start the 
>>>> apache webserver service.. it looks like this [root at tf1 ~]# clustat
>>>> Member Status: Quorate
>>>>   Member Name                              Status
>>>>   ------ ----                              ------
>>>>   tf1                                      Online, Local, rgmanager
>>>>   tf2                                      Online, rgmanager
>>>>   Service Name         Owner (Last)                   State           
>>>> ------- ----         ----- ------                   -----           
>>>> Apache Service       tf1                            starting          
>>>> postfix service      tf1                            started         
>>>> [root at tf1 ~]# and I see that the httpd is NOT started. although, if I do 
>>>> /etc/init.d/httpd start
>>>> the service starts without issue.
>>>> grepping for apache and http in the logs, I see this..
>>>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
>>>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
>>>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
>>>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of 
>>>> /etc/httpd/conf.d/ssl.conf:
>>>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file 
>>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>>> Jul 10 14:33:57 tf1 httpd: httpd startup failed
>>>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of 
>>>> /etc/httpd/conf.d/ssl.conf:
>>>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file 
>>>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>>>> Jul 10 14:34:06 tf1 httpd: httpd startup failed
>>>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing 
>>>> /etc/init.d/httpd stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
>>>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
>>>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing 
>>>> /etc/init.d/httpd stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
>>>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
>>>> [root at tf1 log]# grep apache  messages
>>>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script 
>>>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1 
>>>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 
>>>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on 
>>>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1 
>>>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 
>>>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on 
>>>> script "cluster_apache" returned 1 (generic error) [root at tf1 log]# Im 
>>>> guessing its the  stop on script "cluster_apache" returned 1 (generic 
>>>> error)
>>>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both 
>>>> the same size
>>>> [root at tf2 ~]# ls -al /etc/init.d/httpd
>>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>>> [root at tf1 log]# ls -al /etc/init.d/httpd
>>>> -rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
>>>> and the apache service starts/stops just fine on tf2 when the services 
>>>> get failed over to that machine.
>>>> any ideas on what can be wrong?
>>> tf1 is complaining about a bad SSL cert.  The fact that it's complaining
>>> when being started by clurgmgrd but not when started manually indicates
>>> that clurgmgrd is starting it differently (specifying a different
>>> httpd.conf file perhaps?).
>> well, heres the relevant part of my config file
>>         <rm>
>>                 <failoverdomains>
>>                         <failoverdomain name="httpd" ordered="1" 
>> restricted="1">
>>                                 <failoverdomainnode name="tf1" 
>> priority="1"/>
>>                                 <failoverdomainnode name="tf2" 
>> priority="2"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <resources>
>>                         <script file="/etc/init.d/httpd" 
>> name="cluster_apache"/>
>>                         <ip address="192.168.1.7" monitor_link="1"/>
>>                         <script file="/etc/init.d/postfix" 
>> name="cluster_posstfix"/>
>>                 </resources>
>>                 <service autostart="1" domain="httpd" name="Apache 
>> Service">
>>                         <ip ref="192.168.1.7"/>
>>                         <script ref="cluster_apache"/>
>>                 </service>
>>                 <service autostart="1" domain="httpd" name="postfix 
>> service">
>>                         <ip ref="192.168.1.7"/>
>>                         <script ref="cluster_posstfix"/>
>>                 </service>
>>         </rm>
>> ive never seen that ssl error when starting the service manually.
>> the other thing that I noticed.. is that when I try to do [root at tf1 
>> cluster]# clusvcadm -d "Apache Service"
>> Member tf1 disabling Apache Service...
>> it just hangs there and never returns.
>
> Sorry about the delay in responding.  Was out of town for the weekend.
>
> Does clusvcadm or clurgmgrd run as a different user...one that either
> can't read the SSL certs or the directory containing them?  Normally
> the stuff in /etc/init.d runs as root.  Running one of those scripts as
> a different user can lead to lots of permissions issues.  It's bitten
> me before.

ok, so I think ive found out whats going on.. 
There is another custom program from some other 3rd party vendor that theyre trying to get going on 
this cluster.. It is somehow interfering with the apache service coming up.
If this 3rd party application is NOT started (from /etc/init.d/rc.local) when the server boots up, 
then the apache service comes up fine.. If the 3rd party application IS allowed to start when the 
server boots up, it somehow causes the apache service to not come up correctly, and I see the 
 Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  Apache Service       tf1                            starting
  postfix service      tf1                            started

but like I said earlier. 
/etc/init.d/httpd start 
still works fine either way..

If the 3rd party program is NOT running the cluster services come up fine on their own.

funky. 

thanks for the help,
Jason


From g.bagnoli at asidev.com  Mon Jul 13 23:36:50 2009
From: g.bagnoli at asidev.com (Giacomo Bagnoli)
Date: Tue, 14 Jul 2009 01:36:50 +0200
Subject: [Linux-cluster] Services not relocated after successful fencing
Message-ID: <1247528210.14169.68.camel@waste-bin>

Hi all, first mail to this mailing list.

I'm experimenting with the STABLE2 branch (using cluster-2.03.11
release) on a couple of gentoo servers (2 node cluster) using DRBD in
primary/primary.
I use rhcs for clvm, fencing, and failover of services (kvm with libvirt
and a primary/secondary drbd device used for backups). Every node has 3
gbit ethernet interfaces, two of them trunked in a bond device and used
for drbd replication and cluster communication, while the other as the
public interface.
cluster.conf is attached.

I've gone through all the step, configured cman, fenced using ipmi lan,
rgmanager (with vm.sh taken from git to use libvirt) and everything is
working as expected. At least issuing 

clusvcadm -M vm:vm01 -m node2 

makes the machine migrate to the other node. Similary
enabling/disabling/relocating a vm works too.

Obviously there's a problem :) While testing the failover I noticed a
behaviour similar to what reported on the ML in april
http://www.mail-archive.com/linux-cluster at redhat.com/msg05919.html

issuing a power off using ipmi on a node to simulate a failure I saw in
the log files:

fenced[9592]: node2 not a cluster member after 0 sec post_fail_delay
fenced[9592]: fencing node "node2"
fenced[9592]: can't get node number for node <garbage_here>
fenced[9592]: fence "node2" success

clustat then showed node2 as offline but its services were still marked
as "started" on the fenced node2. When node2 came back services did not
relocate back. 
I tried to trace the problem in the code, and found that in 
cluster-2.03.11/fence/fenced/agent.c

313         if (ccs_lookup_nodename(cd, victim, &victim_nodename) == 0)
314                 victim = victim_nodename;

then on line 358 victim_nodename is freed 

357                 if (victim_nodename)
358                         free(victim_nodename);

and than update_cman is called with "victim" as node name, failing as
the nodeid could not be retrieved (and garbage printed to syslog)

361                 if (!error) {
362                         update_cman(victim, good_device);
363                         break;

I admit that I miss why ccs_lookup_nodename returns 0, but delaying the
free call after the update_cman call makes everything works, services
relocate to the other node and when node2 comes back and rejoins the
cluster they migrate back to the original node, as expected.

Complete patch:
diff -Nuar a/fence/fenced/agent.c b/fence/fenced/agent.c
--- a/fence/fenced/agent.c        2009-01-22 13:33:51.000000000 +0100
+++ b/fence/fenced/agent.c        2009-07-14 01:19:26.385518781 +0200
@@ -354,14 +354,14 @@

                if (device)
                        free(device);
-               if (victim_nodename)
-                       free(victim_nodename);
                free(method);

                if (!error) {
                        update_cman(victim, good_device);
                        break;
                }
+               if (victim_nodename) 
+                       free(victim_nodename);
        }

        ccs_disconnect(cd);

The question is: should I open a bug on bugzilla? Or is my setup
(gentoo, vm.sh backported, etc) too unusual for this to being useful?
Or is it just a problem in the configuration?

Sorry for my English but I'm not a native speaker.

Regards,
	Giacomo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/xml
Size: 2707 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090714/f34e5748/attachment.wsdl>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090714/f34e5748/attachment.sig>

From T.Kumar at alcoa.com  Tue Jul 14 17:08:03 2009
From: T.Kumar at alcoa.com (Kumar, T Santhosh (TCS))
Date: Tue, 14 Jul 2009 13:08:03 -0400
Subject: [Linux-cluster] is it possible to configure TWO quorum disks on
	Redhat cluster suit?
Message-ID: <0C3FC6B507AF684199E57BFCA3EAB55333E7DC05@NOANDC-MXU11.NOA.Alcoa.com>


Recently we saw quorum lost situation and thought about adding another
one. is that supported? Any one currently using two quorum disks?

openais[17753]: [CMAN ] lost contact with quorum device
openais[17753]: [CMAN ] quorum lost, blocking activity


From fxmulder at gmail.com  Tue Jul 14 19:21:59 2009
From: fxmulder at gmail.com (James Devine)
Date: Tue, 14 Jul 2009 13:21:59 -0600
Subject: [Linux-cluster] GFS + VMWARE
Message-ID: <c30750500907141221s34f668ebs3c75bea37a03039@mail.gmail.com>

Does anybody have GFS setup on vsphere4?  I am not entirely sure how
to handle the fencing.  We plan to import an iscsi target and share
the same luns between multiple VMs.


From pradhanparas at gmail.com  Tue Jul 14 22:23:17 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 14 Jul 2009 17:23:17 -0500
Subject: [Linux-cluster] DRAC 4 fencing
Message-ID: <8b711df40907141523i430f2dd9k3fd5f98fa4903f3c@mail.gmail.com>

hi,

I believe fencing nodes using dell drac works using telnet which is
disabled by default. While trying to enable telnet using:

racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

I get the following error:

--
ERROR: RACADM is unable to process the requested subcommand because there is no
local RAC configuration to communicate with.

Local RACADM subcommand execution requires the following:

 1. A Remote Access Controller (RAC) must be present on the managed server
 2. Appropriate managed node software must be installed and running on the
    server
----

What is missing here?

And also,

what is racadm looking for  in the below case?

--
racadm -r 10.10.10.1 -u root -p calvin help getsysinfo
Failed to load SSL library
-------

I have openssl installed..

Thanks in Adv
Paras


From tharindub at millenniumit.com  Wed Jul 15 05:09:46 2009
From: tharindub at millenniumit.com (Tharindu Rukshan Bamunuarachchi)
Date: Wed, 15 Jul 2009 10:39:46 +0530
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
Message-ID: <005301ca050a$7a469ba0$6ed3d2e0$@com>


hi All,

 
SuSE 11 kernel comes with GFS2 module.

Where I can find cluster software for GFS2 deployment.

 
In source.redhat.com there are few cluster versions. Which version should be
compatible with 2.6.27* kernel.

Do I have to build module from cluster source package ?

 
Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ?

 
cheers

__

tharindu

 
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer."

*******************************************************************************************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/d521a091/attachment.htm>

From esggrupos at gmail.com  Wed Jul 15 07:00:15 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Wed, 15 Jul 2009 09:00:15 +0200
Subject: [Linux-cluster] EMC vs HP EVA
Message-ID: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>

Hi folks,
Finally, I?m decided to by a shared storage to use it for clustering.

I have seen in the web that this two solutions are well considered, so my
question is for all of you that have experience with them.

Which in your opinion is best? any web that makes an imparcial comparation
between them?

The final solution is for a medium size company, so the price is also a
factor.

Thanks for your help,

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/764a7294/attachment.htm>

From esggrupos at gmail.com  Wed Jul 15 07:35:16 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Wed, 15 Jul 2009 09:35:16 +0200
Subject: [Linux-cluster] Clustering service PostgreSQL
Message-ID: <3128ba140907150035k2650d0acu9cf183ae3ad7f163@mail.gmail.com>

Hi all,
Im trying to cluster the PostgreSQL database in active-passive mode with two
nodes.

I have tested it and If I fence one node, it fails over the other node, but
when I run the command to manually relocate the service:

clusvcadm -r BBDD -m NODO1

when I have open connections open on the database I get this  error messages
on /var/log/messages:

Jul 15 09:27:29 NODO2 clurgmgrd[2493]: <notice> Stopping service
service:BBDD
Jul 15 09:27:33 NODO2 clurgmgrd: [2493]: <err> Stopping Service
postgres-8:BBDD > Failed - Application Is Still Running
Jul 15 09:27:33 NODO2 clurgmgrd: [2493]: <err> Stopping Service
postgres-8:BBDD > Failed
Jul 15 09:27:33 NODO2 clurgmgrd[2493]: <notice> stop on postgres-8 "BBDD"
returned 1 (generic error)
Jul 15 09:27:33 NODO2 clurgmgrd: [2493]: <err> Stopping Service
postgres-8:BBDD > Failed
Jul 15 09:27:33 NODO2 clurgmgrd[2493]: <notice> stop on postgres-8 "BBDD"
returned 1 (generic error)
Jul 15 09:27:33 NODO2 avahi-daemon[2304]: Withdrawing address record for
192.168.1.183 on eth0.
Jul 15 09:27:43 NODO2 clurgmgrd[2493]: <crit> #12: RG service:BBDD failed to
stop; intervention required
Jul 15 09:27:43 NODO2 clurgmgrd[2493]: <notice> Service service:BBDD is
failed
Jul 15 09:27:43 NODO2 clurgmgrd[2493]: <warning> #70: Failed to relocate
service:BBDD; restarting locally
Jul 15 09:27:43 NODO2 clurgmgrd[2493]: <err> #43: Service service:BBDD has
failed; can not start.
Jul 15 09:27:44 NODO2 clurgmgrd[2493]: <alert> #2: Service service:BBDD
returned failure code.  Last Owner: NODO2.
Jul 15 09:27:44 NODO2 clurgmgrd[2493]: <alert> #4: Administrator
intervention required.

If I check the status of the service:
Service Name                                                     Owner
(Last)                                                     State
 ------- ----                                                     -----
------                                                     -----
 service:BBDD                                                     (NODO2)
                                        failed

and If I check with ps:

# ps aux | grep postgres
root     21552  0.0  0.2   2844  1120 ?        S<   09:27   0:00 su -
postgres -c /usr/bin/postmaster -c
config_file="/etc/cluster/postgres-8/postgres-8:BBDD/postgresql.conf" ??-D
/nfsvol/pgsql/data
postgres 21553  0.1  0.5  21504  3076 ?        S<s  09:27   0:00
/usr/bin/postmaster -c
config_file=/etc/cluster/postgres-8/postgres-8:BBDD/postgresql.conf -D
/nfsvol/pgsql/data
postgres 21591  0.0  0.1  11284   608 ?        S<   09:27   0:00 postgres:
logger process

postgres 21593  0.0  0.1  21504   896 ?        S<   09:27   0:00 postgres:
writer process

postgres 21594  0.0  0.1  12284   608 ?        S<   09:27   0:00 postgres:
stats buffer process

postgres 21595  0.0  0.1  11428   804 ?        S<   09:27   0:00 postgres:
stats collector process

postgres 21720  0.0  0.8  22280  4328 ?        S<   09:27   0:00 postgres:
postgres postgres 192.168.1.170(2849) idle


So, I think the problem is with the stop script that comes with the cluster
suite. I think it must close all the open connections or wait until they
finish...

Anyone has this problem? and how can it be solved

Thanks in advance

ESG
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/73189b34/attachment.htm>

From robejrm at gmail.com  Wed Jul 15 07:41:59 2009
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 15 Jul 2009 09:41:59 +0200
Subject: [Linux-cluster] EMC vs HP EVA
In-Reply-To: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
References: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
Message-ID: <8a5668960907150041m1162447by49981be3a0af3725@mail.gmail.com>

On Wed, Jul 15, 2009 at 9:00 AM, ESGLinux <esggrupos at gmail.com> wrote:

> Hi folks,
> Finally, I?m decided to by a shared storage to use it for clustering.
>
> I have seen in the web that this two solutions are well considered, so my
> question is for all of you that have experience with them.
>
> Which in your opinion is best? any web that makes an imparcial comparation
> between them?
>
> The final solution is for a medium size company, so the price is also a
> factor.
>
How many machines are you planning to form the cluster? If you are planning
to use 3 machines I recommend you some SAS DAS from IBM or SUN, are much
cheaper than EVA or  EMC. Even any of the fiber equivalents of them with a
fiber switch can be cheaper than the lasts (in case you are planning to
connect more than 3 machines to the cluster).

Regards,
Juanra

>
> Thanks for your help,
>
> ESG
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/02b201c3/attachment.htm>

From daniel.oberdick at barmenia.de  Wed Jul 15 08:02:10 2009
From: daniel.oberdick at barmenia.de (daniel.oberdick at barmenia.de)
Date: Wed, 15 Jul 2009 10:02:10 +0200
Subject: =?ISO-8859-1?Q?RE=3A_[Linux-cluster]_fence=5Fscsi_support_in_multipath?=
	=?ISO-8859-1?Q?_env_in_RHEL5=2E3_[Virenpr=FCfung_durchgef=FChrt]?=
In-Reply-To: <FD716ADFEE797543AA8B2AB0B3F9148E06C606CB@rkamsem701.emea.roche.com>
Message-ID: <OF68DF6E83.73C24B6B-ONC12575F4.002BA2E2-C12575F4.002C24DB@barmenia.de>

hi,
would like to know if anyone has configured  successfully  fence_scsi with 
mpio  and emc clariion.

we are getting a new storage (emc vmax) and i would like to know if 
fence_scsi is usable in productive environment.

thanks and greetings

Daniel

Barmenia Versicherungen

Abt. IT-Betrieb
Datenbank- & Systemadministration

Kronprinzenallee 12-18
42119 Wuppertal

E-Mail: daniel.oberdick at barmenia.de
Internet: http://barmenia.de
Telefon: +49 202 / 438 3195
Telefax: +49 202 / 438 03 3195


"Moralejo, Alfredo" <alfredo.moralejo at roche.com> 
Gesendet von: linux-cluster-bounces at redhat.com
11.06.2009 00:29
Bitte antworten an
linux clustering <linux-cluster at redhat.com>


An
linux clustering <linux-cluster at redhat.com>
Kopie

Thema
RE: [Linux-cluster] fence_scsi support in multipath env in RHEL5.3 
[Virenpr?fung durchgef?hrt]


I'm using the config provide by Red Hat by default:

       device {
               vendor                  "DGC"
                product                 ".*"
                product_blacklist       "LUN_Z"
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout            "/sbin/mpath_prio_emc /dev/%n"
                features                "1 queue_if_no_path"
                hardware_handler        "1 emc"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_weight               uniform
                no_path_retry           300
                rr_min_io               1000
                path_checker            emc_clariion
        }

-----Original Message-----
From: linux-cluster-bounces at redhat.com 
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
Sent: Thursday, June 11, 2009 12:11 AM
To: linux clustering
Subject: Re: [Linux-cluster] fence_scsi support in multipath env in 
RHEL5.3

On Wed, Jun 10, 2009 at 11:33:14PM +0200, Moralejo, Alfredo wrote:

> Anyone is successfully using it?

What path checker are you using? I've heard that certain path checkers
cause problems, but I honestly don't know enough about dm-multipath
to understand the reason for this.

I have successfully used it with RDAC.

Ryan

> I'm testing it with a clariion storage frame on RHEL 5.3, and as soon as 
I enable scsi_reserve, multipath starts failing and a path goes good and 
bad in a loop and scsi fencing fails sometimes, should I configure in a 
specific way multipath.conf?:
> 
> Jun 10 23:31:58 rmamseslab07 multipathd: mpath0: remaining active paths: 
3
> Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: reservation conflict
> Jun 10 23:32:00 rmamseslab07 kernel: sd 1:0:0:1: SCSI error: return code 
= 0x00000018
> Jun 10 23:32:00 rmamseslab07 kernel: end_request: I/O error, dev sdm, 
sector 79
> Jun 10 23:32:00 rmamseslab07 kernel: device-mapper: multipath: Failing 
path 8:192.
> Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: mark as failed
> Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active 
paths: 3
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already 
registered
> Jun 10 23:32:00 rmamseslab07 multipathd: sdm: emc_clariion_checker: Path 
healthy
> Jun 10 23:32:00 rmamseslab07 multipathd: 8:192: reinstated
> Jun 10 23:32:00 rmamseslab07 multipathd: mpathquorum: remaining active 
paths: 4
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: add map (uevent)
> Jun 10 23:32:00 rmamseslab07 multipathd: dm-10: devmap already 
registered
> Jun 10 23:32:02 rmamseslab07 multipathd: sdl: emc_clariion_checker: Path 
healthy
> Jun 10 23:32:02 rmamseslab07 multipathd: 8:176: reinstated
> Jun 10 23:32:02 rmamseslab07 multipathd: mpath0: remaining active paths: 
4
> Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: add map (uevent)
> Jun 10 23:32:02 rmamseslab07 multipathd: dm-9: devmap already registered
> Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: reservation conflict
> Jun 10 23:32:03 rmamseslab07 kernel: sd 1:0:0:0: SCSI error: return code 
= 0x00000018
> Jun 10 23:32:03 rmamseslab07 kernel: end_request: I/O error, dev sdl, 
sector 25256
> Jun 10 23:32:03 rmamseslab07 kernel: device-mapper: multipath: Failing 
path 8:176.
> Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: add map (uevent)
> Jun 10 23:32:03 rmamseslab07 multipathd: dm-9: devmap already registered
> Jun 10 23:32:03 rmamseslab07 multipathd: 8:176: mark as failed
> Jun 10 23:32:03 rmamseslab07 multipathd: mpath0: remaining active paths: 
3
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
> Sent: Wednesday, June 10, 2009 9:18 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] fence_scsi support in multipath env in 
RHEL5.3
> 
> On Wed, Jun 10, 2009 at 02:11:21PM +0530, Rajeev P wrote:
> >  I wanted to know if fence_scsi is supported in a multipath 
environment for
> > RHEL5.3 release.
> 
> Yes, it is supported.
> 
> > In earlier releases of RHEL5 fence_scsi was not supported in a 
multipath
> > environment for RHEL5.3 release. If I am not wrong, this was because 
the
> > DM-MPIO driver forwarded the registration/unregistration commands on 
only on
> > one of the physical paths of a LUN.  Ideally it should have passed the
> > commands on all physical paths.
> > 
> > For RHEL5.3, is this issue resolved so that I can fence_scsi in 
multipath
> > environment.
> > 
> > Thanks in advance.
> > 
> > Rajeev
> 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


----------------------------------------------------------------------------------
Barmenia Krankenversicherung a. G.
Vorstand: Dr. h. c. Josef Beutelmann (Vorsitzender) - Dr. Andreas Eurich - Norbert Lessmann - Heinz-Werner Richter - Martin Risse
Aufsichtsrats-Vors.: G?nter V?lker; Rechtsform des Unternehmens: Versicherungsverein auf Gegenseitigkeit; 
Sitz: Wuppertal; Amtsgericht Wuppertal HRB 3871; St.-Nr. 132/5906/0047
----------------------------------------------------------------------------------
Barmenia Allgemeine Versicherungs-AG
Vorstand: Dr. h. c. Josef Beutelmann (Vorsitzender) - Dr. Andreas Eurich - Norbert Lessmann - Heinz-Werner Richter - Martin Risse
Aufsichtsrats-Vors.: G?nter V?lker; Rechtsform des Unternehmens: Aktiengesellschaft;
Sitz: Wuppertal; Amtsgericht Wuppertal HRB 3033; St.-Nr. 132/5906/0025
----------------------------------------------------------------------------------
Barmenia Lebensversicherung a. G.
Vorstand: Dr. h. c. Josef Beutelmann (Vorsitzender) - Dr. Andreas Eurich - Norbert Lessmann - Heinz-Werner Richter - Martin Risse
Aufsichtsrats-Vors.: G?nter V?lker; Rechtsform des Unternehmens: Versicherungsverein auf Gegenseitigkeit;
Sitz: Wuppertal; Amtsgericht Wuppertal HRB 3854; St.-Nr. 132/5906/0058


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/1608dcc8/attachment.htm>

From esggrupos at gmail.com  Wed Jul 15 09:07:12 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Wed, 15 Jul 2009 11:07:12 +0200
Subject: [Linux-cluster] EMC vs HP EVA
In-Reply-To: <8a5668960907150041m1162447by49981be3a0af3725@mail.gmail.com>
References: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
	<8a5668960907150041m1162447by49981be3a0af3725@mail.gmail.com>
Message-ID: <3128ba140907150207s56da0c38p9f368168f05e3aaf@mail.gmail.com>

Hi JuanRa,
for the moment I?m planning to use 2 nodes in the cluster.

I don?t want to use fiber to avoid buying a fiber switch. I?m going to use
iSCSI over ethernet with TOE support.

I?ll take a look at the SAS DAS that you told me,

Thanks for your answer

ESG

2009/7/15 Juan Ramon Martin Blanco <robejrm at gmail.com>

>
>
> On Wed, Jul 15, 2009 at 9:00 AM, ESGLinux <esggrupos at gmail.com> wrote:
>
>> Hi folks,
>> Finally, I?m decided to by a shared storage to use it for clustering.
>>
>> I have seen in the web that this two solutions are well considered, so my
>> question is for all of you that have experience with them.
>>
>> Which in your opinion is best? any web that makes an imparcial comparation
>> between them?
>>
>> The final solution is for a medium size company, so the price is also a
>> factor.
>>
> How many machines are you planning to form the cluster? If you are planning
> to use 3 machines I recommend you some SAS DAS from IBM or SUN, are much
> cheaper than EVA or  EMC. Even any of the fiber equivalents of them with a
> fiber switch can be cheaper than the lasts (in case you are planning to
> connect more than 3 machines to the cluster).
>
> Regards,
> Juanra
>
>>
>> Thanks for your help,
>>
>> ESG
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/e531f4a5/attachment.htm>

From tfrumbacher at gmail.com  Wed Jul 15 14:00:07 2009
From: tfrumbacher at gmail.com (Aaron Benner)
Date: Wed, 15 Jul 2009 08:00:07 -0600
Subject: [Linux-cluster] EMC vs HP EVA
In-Reply-To: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
References: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
Message-ID: <E920F707-A26F-4B94-A686-D97D49F3D61E@gmail.com>

We have an EMC AX150i here and it works very well, but EMC's policy  
makes it pretty expensive.  We just added 8 720 GB drives to our  
unit.  The drives could be purchased commodity for about $80-120, but  
to put them into the EMC box we had to purchase "certified" units with  
the mounting sleds for $500 per.  Seemed like a pretty steep markup at  
the time.

Otherwise, no complaints.

--AB

On Jul 15, 2009, at 1:00 AM, ESGLinux wrote:

> Hi folks,
>
> Finally, I?m decided to by a shared storage to use it for clustering.
>
> I have seen in the web that this two solutions are well considered,  
> so my question is for all of you that have experience with them.
>
> Which in your opinion is best? any web that makes an imparcial  
> comparation between them?
>
> The final solution is for a medium size company, so the price is  
> also a factor.
>
> Thanks for your help,
>
> ESG
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From Ed.Sanborn at genband.com  Wed Jul 15 14:12:51 2009
From: Ed.Sanborn at genband.com (Ed Sanborn)
Date: Wed, 15 Jul 2009 09:12:51 -0500
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
In-Reply-To: <005301ca050a$7a469ba0$6ed3d2e0$@com>
References: <005301ca050a$7a469ba0$6ed3d2e0$@com>
Message-ID: <21E1652F6AA39A4B89489DCFC399C1FD171BDB@gbplamail.genband.com>

Tharindu,

 
 Did anyone answer you on this question?

 
Ed

 
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tharindu Rukshan
Bamunuarachchi
Sent: Wednesday, July 15, 2009 1:10 AM
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

 
hi All,

 
SuSE 11 kernel comes with GFS2 module.

Where I can find cluster software for GFS2 deployment.

 
In source.redhat.com there are few cluster versions. Which version
should be compatible with 2.6.27* kernel.

Do I have to build module from cluster source package ?

 
Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ?

 
cheers

__

tharindu

 
************************************************************************
************************************************************************
*******************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is
addressed. If you are not the intended recipient(s), you are prohibited
from printing, forwarding, saving or copying this email. If you have
received this e-mail in error, please immediately notify the sender and
delete this e-mail and its attachments from your computer."

************************************************************************
************************************************************************
*******************

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/e49bbaa5/attachment.htm>

From Ed.Sanborn at genband.com  Wed Jul 15 14:31:32 2009
From: Ed.Sanborn at genband.com (Ed Sanborn)
Date: Wed, 15 Jul 2009 09:31:32 -0500
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
In-Reply-To: <fbfeff2669c7.4a5e2e2e@millenniumit.com>
References: <005301ca050a$7a469ba0$6ed3d2e0$@com>
	<21E1652F6AA39A4B89489DCFC399C1FD171BDB@gbplamail.genband.com>
	<fbfeff2669c7.4a5e2e2e@millenniumit.com>
Message-ID: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com>

No, a great question and one I am interested in the answer along with
you.

 
From: Tharindu Bamunuarachchi [mailto:tharindub at millenniumit.com] 
Sent: Wednesday, July 15, 2009 10:30 AM
To: linux clustering
Cc: Ed Sanborn
Subject: Re: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

 
mmm ... not yet ...

is it a wrong question ??

----- Original Message -----
From: Ed Sanborn <Ed.Sanborn at genband.com>
Date: Wednesday, July 15, 2009 7:44 pm
Subject: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
To: linux clustering <linux-cluster at redhat.com>

> 


> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


> 
> 


> 
> 

Tharindu,


> 
> 

 
> 
> 

 Did anyone answer you on this
> question?


> 
> 

 
> 
> 

Ed


> 
> 

 
> 
> 


> 
> 


> 
> 

From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tharindu
Rukshan
> Bamunuarachchi
> 
> Sent: Wednesday, July 15, 2009 1:10 AM
> 
> To: Linux-cluster at redhat.com
> 
> Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1


> 
> 


> 
> 


> 
> 

 
> 
> 

hi All,


> 
> 

 
> 
> 

SuSE 11 kernel comes with GFS2 module.


> 
> 

Where I can find cluster software for GFS2 deployment.


> 
> 

 
> 
> 

In source.redhat.com there are few cluster versions. Which
> version should be compatible with 2.6.27* kernel.


> 
> 

Do I have to build module from cluster source package ?


> 
> 

 
> 
> 

Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ?


> 
> 

 
> 
> 

cheers


> 
> 

__


> 
> 

tharindu


> 
> 

 
> 
> 


> 
> 


> 

************************************************************************
************************************************************************
*******************
> 
> 
> 
> "The information contained in this email including in any attachment
is
> confidential and is meant to be read only by the person to whom it is
> addressed. If you are not the intended recipient(s), you are
prohibited from
> printing, forwarding, saving or copying this email. If you have
received this
> e-mail in error, please immediately notify the sender and delete this
e-mail
> and its attachments from your computer."
> 
> 
> 
>
************************************************************************
************************************************************************
*******************


> 
> 
> 


> 
> 

 
> 
> 


> 
> 
> 
> 
> > --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster 

************************************************************************
************************************************************************
*******************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is
addressed. If you are not the intended recipient(s), you are prohibited
from printing, forwarding, saving or copying this email. If you have
received this e-mail in error, please immediately notify the sender and
delete this e-mail and its attachments from your computer."

************************************************************************
************************************************************************
*******************

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/b8cf24ea/attachment.htm>

From cthulhucalling at gmail.com  Wed Jul 15 14:45:21 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Wed, 15 Jul 2009 07:45:21 -0700
Subject: [Linux-cluster] EMC vs HP EVA
In-Reply-To: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
References: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
Message-ID: <36df569a0907150745y57e2c554j5f209cc796089d5f@mail.gmail.com>

I don't know the whole particulars about your setup, but I'd consider
anything from EMC to be overkill as would an EVA unless you have some need
for such enterprise-class storage.

I'd look at an MSA from HP. You can go fiber, iSCSI or whatever with them.

On Wed, Jul 15, 2009 at 12:00 AM, ESGLinux <esggrupos at gmail.com> wrote:

> Hi folks,
> Finally, I?m decided to by a shared storage to use it for clustering.
>
> I have seen in the web that this two solutions are well considered, so my
> question is for all of you that have experience with them.
>
> Which in your opinion is best? any web that makes an imparcial comparation
> between them?
>
> The final solution is for a medium size company, so the price is also a
> factor.
>
> Thanks for your help,
>
> ESG
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/83343916/attachment.htm>

From lpleiman at redhat.com  Wed Jul 15 14:49:12 2009
From: lpleiman at redhat.com (Leo Pleiman)
Date: Wed, 15 Jul 2009 10:49:12 -0400 (EDT)
Subject: [Linux-cluster] EMC vs HP EVA
In-Reply-To: <1582998225.515531247669327530.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com>
Message-ID: <663526961.515591247669352379.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com>

Take a look at 3Par arrays. The price point is excellent. I have a customer who loves them, they just work.

http://www.3par.com


----- "Ian Hayes" <cthulhucalling at gmail.com> wrote:

> I don't know the whole particulars about your setup, but I'd consider
> anything from EMC to be overkill as would an EVA unless you have some
> need for such enterprise-class storage.
> 
> I'd look at an MSA from HP. You can go fiber, iSCSI or whatever with
> them.
> 
> 
> On Wed, Jul 15, 2009 at 12:00 AM, ESGLinux < esggrupos at gmail.com >
> wrote:
> 
> 
> Hi folks,
> 
> 
> Finally, I?m decided to by a shared storage to use it for clustering.
> 
> 
> I have seen in the web that this two solutions are well considered, so
> my question is for all of you that have experience with them.
> 
> 
> Which in your opinion is best? any web that makes an imparcial
> comparation between them?
> 
> 
> The final solution is for a medium size company, so the price is also
> a factor.
> 
> 
> Thanks for your help,
> 
> 
> ESG
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From tharindub at millenniumit.com  Wed Jul 15 14:29:50 2009
From: tharindub at millenniumit.com (Tharindu Bamunuarachchi)
Date: Wed, 15 Jul 2009 19:29:50 +0500
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
In-Reply-To: <21E1652F6AA39A4B89489DCFC399C1FD171BDB@gbplamail.genband.com>
References: <005301ca050a$7a469ba0$6ed3d2e0$@com>
	<21E1652F6AA39A4B89489DCFC399C1FD171BDB@gbplamail.genband.com>
Message-ID: <fbfeff2669c7.4a5e2e2e@millenniumit.com>


mmm ... not yet ...is it a wrong question ??----- Original Message -----From: Ed Sanborn <Ed.Sanborn at genband.com>Date: Wednesday, July 15, 2009 7:44 pmSubject: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1To: linux clustering <linux-cluster at redhat.com>> > > > > > <BR>> <!-- /* Font Definitions */ @font-face	{font-family:Cambria;	panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face	{font-family:Calibri;	panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face	{font-family:Tahoma;	panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal	{margin:0in;	margin-bottom:.0001pt;	font-size:11.0pt;	font-family:"Calibri","sans-serif";}a:link, span.MsoHyperlink	{mso-style-priority:99;	color:blue;	text-decoration:underline;}a:visited, span.MsoHyperlinkFollowed	{mso-style-priority:99;	color:purple;	text-decoration:underline;}span.EmailStyle17	{mso-style-type:personal;	font-family:"Calibri","sans-serif";	color:windowtext;}span.EmailStyle18	{mso-style-type:personal-reply;	font-family:"Calibri","sans-serif";	color:#1F497D;}.MsoChpDefault	{mso-style-type:export-only;	font-size:10.0pt;}@page Section1	{size:8.5in 11.0in;	margin:1.0in 1.0in 1.0in 1.0in;}div.Section1	{page:Section1;}--><BR>> > > > > > > > > Tharindu,> > ?> > ?Did anyone answer you on this> question?> > ?> > Ed> > ?> > > > > > From: linux-cluster-bounces at redhat.com> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tharindu Rukshan> Bamunuarachchi> > Sent: Wednesday, July 15, 2009 1:10 AM> > To: Linux-cluster at redhat.com> > Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1> > > > > > ?> > hi All,> > ?> > SuSE 11 kernel comes with GFS2 module.> > Where I can find cluster software for GFS2 deployment.> > ?> > In source.redhat.com there are few cluster versions. Which> version should be compatible with 2.6.27* kernel.> > Do I have to build module from cluster source package ?> > ?> > Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ?> > ?> > cheers> > __> > tharindu> > ?> > > >  >  *******************************************************************************************************************************************************************> >  > >  "The information contained in this email including in any attachment is>  confidential and is meant to be read only by the person to whom it is>  addressed. If you are not the intended recipient(s), you are prohibited from>  printing, forwarding, saving or copying this email. If you have received this>  e-mail in error, please immediately notify the sender and delete this e-mail>  and its attachments from your computer."> >  > >  *******************************************************************************************************************************************************************>  > > > > ?> > > > > > > > --> Linux-cluster mailing list> Linux-cluster at redhat.com> https://www.redhat.com/mailman/listinfo/linux-cluster


*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer."

*******************************************************************************************************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/880407f4/attachment.htm>

From j.buzzard at dundee.ac.uk  Wed Jul 15 15:05:05 2009
From: j.buzzard at dundee.ac.uk (Jonathan Buzzard)
Date: Wed, 15 Jul 2009 16:05:05 +0100
Subject: [Linux-cluster] EMC vs HP EVA
In-Reply-To: <E920F707-A26F-4B94-A686-D97D49F3D61E@gmail.com>
References: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
	<E920F707-A26F-4B94-A686-D97D49F3D61E@gmail.com>
Message-ID: <1247670305.725.63.camel@penguin.lifesci.dundee.ac.uk>


On Wed, 2009-07-15 at 08:00 -0600, Aaron Benner wrote:
> We have an EMC AX150i here and it works very well, but EMC's policy  
> makes it pretty expensive.  We just added 8 720 GB drives to our  
> unit.  The drives could be purchased commodity for about $80-120, but  
> to put them into the EMC box we had to purchase "certified" units with  
> the mounting sleds for $500 per.  Seemed like a pretty steep markup at  
> the time.

That is par for the course, they all come in platinum sleds whether you
get EMC, HP, IBM etc.

Last time I looked list price for a 1TB SATA drive for an IBM DS5000 was
something like 2000 USD.

JAB.

-- 
Jonathan A. Buzzard                      Tel: +441382-386998
Storage Administrator, College of Life Sciences
University of Dundee, DD1 5EH


From teigland at redhat.com  Wed Jul 15 15:06:10 2009
From: teigland at redhat.com (David Teigland)
Date: Wed, 15 Jul 2009 10:06:10 -0500
Subject: [Linux-cluster] Problem with fenced on cluster with 2 BladeCenter
	machines: 1st machine is remove physically. The remaining one
	does not became Active (waiting for fenced)
In-Reply-To: <1184248234.4533.6.camel@xw9300.bidmc.harvard.edu>
References: <OFF001E015.F1058F8E-ONC1257316.00483F06@frcl.bull.fr>
	<1184248234.4533.6.camel@xw9300.bidmc.harvard.edu>
Message-ID: <20090715150610.GB29084@redhat.com>

On Thu, Jul 12, 2007 at 09:50:34AM -0400, Robert Hurst wrote:
> On a related note, what is the correct value for clean_start in
> cluster.conf?

0 (disabled) which is the default if you don't define it at all.

> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
> 
> The man page states it should be set to zero "0"... you have it set to
> "1", which intuitively, makes more sense.
> 
>        To disable fencing at domain-creation time entirely, the -c
> option can
>        be  used  to  declare  that  all nodes are in a clean or safe
> state to
>        start.  The clean_start cluster.conf option can  also  be  set
> to  do
>        this,  but automatically disabling startup fencing in
> cluster.conf can
>        risk file system corruption.
> 
>        Clean-start  is  used  to prevent any startup fencing the daemon
> might
>        do.  It indicates that the daemon should assume all  nodes  are
> in  a
>        clean state to start.
> 
>          <fence_daemon clean_start="0">
>          </fence_daemon>

Setting clean_start to 1 is a bad idea because of the corruption risk
mentioned above.

Dave


From pradhanparas at gmail.com  Wed Jul 15 20:04:38 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Wed, 15 Jul 2009 15:04:38 -0500
Subject: [Linux-cluster] DRAC 4
Message-ID: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com>

hi,

I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i.  I
am working on to create a cluster using 3 poweredge nodes and I am in
need to use DRAC as fencing device.

While testing fencing using fence_drac it complains me as:

root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin
failed: telnet open failed: problem connecting to "10.10.10.2", port
23: Connection refused

I tried to enable telnet using racadm but got the following error.

[root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1
ERROR: RACADM is unable to process the requested subcommand because there is no
local RAC configuration to communicate with.

Local RACADM subcommand execution requires the following:

 1. A Remote Access Controller (RAC) must be present on the managed server
 2. Appropriate managed node software must be installed and running on the
    server


I am really stuck here. Any one having the similar problem?

Thanks
Paras.


From ggfelix at gmail.com  Wed Jul 15 21:41:37 2009
From: ggfelix at gmail.com (Guilherme G. Felix)
Date: Wed, 15 Jul 2009 18:41:37 -0300
Subject: [Linux-cluster] clurgmgrd hang/stuck
Message-ID: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com>

Hi all,

I'm having a odd problem with my 2 node cluster.

Everything starts up fine, I can relocate, restart, stop, start services
fine. I did perform all rg_test tests that were possible and everything is
working as designed.

However, for some weird reason rg_manager is freezing.

Yesterday, I raised the log_level to 7 and also started clurgmgrd with the
"-d" option. What happened is that is started to print all the status check
to my logging facility (which is what I expected to happen), however, around
8 hours later it stopped and clurgmgrd completely freezes and do not respond
to any "clusvcadm" commands. It also stopped to print anything to my log
files.

The only solution is to restart rg_manager on both nodes.

I tried to attach a strace to clurgmgrd process PID and got some timeout
errors, such as:

select(12, [10 11], NULL, NULL, {0, 908000}) = 0 (Timeout)

Although all the socks and FIFOs files are in place and with the correct
permissions. cman_tool, and ccs_tool are working just fine.

I noted that clurgmgrd isn't forking as it is expected to.

I also executed strace with the clusvcadm, here is the whole output.

[root at node1 ~]# strace clusvcadm -R "Service Web"
execve("/usr/sbin/clusvcadm", ["clusvcadm", "-R", "Service Web"], [/* 18
vars */]) = 0
brk(0)                                  = 0x83a4000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or
directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=40664, ...}) = 0
mmap2(NULL, 40664, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7fa6000
close(3)                                = 0
open("/usr/lib/libcman.so.2", O_RDONLY) = 3
read(3,
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000MA\0004\0\0\0"..., 512) =
512
fstat64(3, {st_mode=S_IFREG|0755, st_size=17368, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7fa5000
mmap2(0x414000, 18952, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)
= 0x414000
mmap2(0x418000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0x418000
close(3)                                = 0
open("/lib/libpthread.so.0", O_RDONLY)  = 3
read(3,
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000\330\265\0004\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=125612, ...}) = 0
mmap2(0xb59000, 90592, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)
= 0xb59000
mmap2(0xb6c000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x12) = 0xb6c000
mmap2(0xb6e000, 4576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb6e000
close(3)                                = 0
open("/lib/libdl.so.2", O_RDONLY)       = 3
read(3,
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0P:\265\0004\0\0\0"..., 512)
= 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=16428, ...}) = 0
mmap2(0xb53000, 12408, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)
= 0xb53000
mmap2(0xb55000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1) = 0xb55000
close(3)                                = 0
open("/usr/lib/libncurses.so.5", O_RDONLY) = 3
read(3,
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\204\252\0034\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=297464, ...}) = 0
mmap2(0x3a9a000, 297220, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x3a9a000
mmap2(0x3ada000, 32768, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x40) = 0x3ada000
mmap2(0x3ae2000, 2308, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3ae2000
close(3)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 3
read(3,
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320\237\237\0004\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1606808, ...}) = 0
mmap2(0x9e4000, 1324452, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x110000
mmap2(0x24e000, 12288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13e) = 0x24e000
mmap2(0x251000, 9636, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x251000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7fa4000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7fa3000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7fa36c0, limit:1048575,
seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1,
seg_not_present:0, useable:1}) = 0
mprotect(0x24e000, 8192, PROT_READ)     = 0
mprotect(0xb55000, 4096, PROT_READ)     = 0
mprotect(0xb6c000, 4096, PROT_READ)     = 0
mprotect(0x9e0000, 4096, PROT_READ)     = 0
munmap(0xb7fa6000, 40664)               = 0
set_tid_address(0xb7fa3708)             = 2488
set_robust_list(0xb7fa3710, 0xc)        = 0
futex(0xbfdb0074, FUTEX_WAKE_PRIVATE, 1) = 0
rt_sigaction(SIGRTMIN, {0xb5d3d0, [], SA_SIGINFO}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0xb5d2e0, [], SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0
uname({sys="Linux", node="node1", ...}) = 0
geteuid32()                             = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTART}, {SIG_DFL, [], 0}, 8) =
0
brk(0)                                  = 0x83a4000
brk(0x83c5000)                          = 0x83c5000
socket(PF_FILE, SOCK_STREAM, 0)         = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
connect(3, {sa_family=AF_FILE, path="/var/run/cman_client"}, 110) = 0
open("/dev/zero", O_RDONLY)             = 4
writev(3, [{"NAMC\3\0\0\20\24\0\0\0\7\0\0\0\0\0\0\0", 20}], 1) = 20
recv(3, "NAMC\24\0\0\0h\3\0\0\7\0\0@\0\0\0\0", 20, 0) = 20
read(3,
"\2\0\0\0\250\1\0\0\1\0\0\0\1\0\0\0\0\0\0\0\264\2\0\0\2\0\0\0hows"..., 852)
= 852
writev(3, [{"NAMC\3\0\0\20\24\0\0\0\7\0\0\0\0\0\0\0", 20}], 1) = 20
recv(3, "NAMC\24\0\0\0h\3\0\0\7\0\0@\0\0\0\0", 20, 0) = 20
read(3,
"\2\0\0\0\250\1\0\0\1\0\0\0\1\0\0\0\0\0\0\0\264\2\0\0\2\0\0\0hows"..., 852)
= 852
writev(3, [{"NAMC\3\0\0\20\274\1\0\0\220\0\0\0\0\0\0\0", 20},
{"x^\372\267\0\0\0\0\0P\372\267\304 at A\0008Z\372\267\300\17\236\0\0\373\332\2774\374\332\277"...,
424}], 2) = 444
recv(3, "NAMC\24\0\0\0\300\1\0\0\220\0\0@\0\0\0\0", 20, 0) = 20
read(3,
"\0\0\0\0\250\1\0\0\1\0\0\0\1\0\0\0\0\0\0\0\264\2\0\0\2\0\0\0hows"..., 428)
= 428
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7faf000
write(1, "Local machine trying to restart "..., 54Local machine trying to
restart service:Service Web...) = 54
socket(PF_FILE, SOCK_STREAM, 0)         = 5
connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"}, 110) =
0
select(6, NULL, [5], [5], NULL)         = 1 (out [5])
write(5,
"h\0\0\0\4\275\321?\22:\274\0\0\0\0h\0\23\205\202\0\0\0\0\0\0\0\0\0\0\0\0"...,
112) = 112
select(6, [5], NULL, [5], NULL <unfinished ...>

     As it wasn't executing what it was supposed to do I ctrl+c'ed it.

     Following are my servers info. All of them running the same kernel and
versions.

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
# uname -a
Linux hows001nex 2.6.18-128.1.10.el5PAE #1 SMP Wed Apr 29 14:24:53 EDT 2009
i686 i686 i386 GNU/Linux
# rpm -qa | egrep 'cman|rgm'
rgmanager-2.0.46-1.el5
cman-2.0.98-1.el5
# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="CLSCLU01" config_version="43" name="CLSCLU01">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="node1-rsa"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="node2-rsa"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_rsa" ipaddr="node1-rsa"
login="username" name="node1-rsa" passwd="password"/>
                <fencedevice agent="fence_rsa" ipaddr="node2-rsa"
login="username" name="node2-rsa" passwd="password"/>
        </fencedevices>
        <rm log_level="7">
                <failoverdomains>
                        <failoverdomain name="WEB" ordered="1"
restricted="1">
                                <failoverdomainnode name="node1"
priority="1"/>
                                <failoverdomainnode name="node2"
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.9.16.40" monitor_link="1"/>
                        <ip address="10.9.16.41" monitor_link="1"/>
                        <ip address="10.9.16.45" monitor_link="1"/>
                        <ip address="10.9.16.46" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="WEB" name="Service Web"
recovery="relocate">
                        <ip ref="10.9.16.40"/>
                        <ip ref="10.9.16.41"/>
                        <ip ref="10.9.16.45"/>
                        <ip ref="10.9.16.46">
                                <script file="/etc/init.d/jboss423.sh"
name="Script Jboss423">
                                        <script file="/etc/init.d/httpd"
name="Script Apache2"/>
                                        <script file="/etc/init.d/xinetd"
name="Script Xinetd">
                                                <script
file="/etc/init.d/cron-user.sh" name="Script Crond User"/>
                                        </script>
                                </script>
                        </ip>
                </service>
        </rm>
</cluster>

Thank you,

- G. Felix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/cfc3e3e0/attachment.htm>

From crosa at redhat.com  Wed Jul 15 22:27:08 2009
From: crosa at redhat.com (Cleber Rosa)
Date: Wed, 15 Jul 2009 18:27:08 -0400 (EDT)
Subject: [Linux-cluster] DRAC 4
In-Reply-To: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com>
Message-ID: <14860381.31247696672994.JavaMail.cleber@localhost.localdomain>

Paras, 

Maybe you should try using the "secure" (ssh) option. 

CR. 


--- 
Cleber Rodrigues < crosa at redhat.com > 
Solutions Architect - Red Hat, Inc. 
Mobile: +55 61 9185.3454 

----- Original Message ----- 
From: "Paras pradhan" <pradhanparas at gmail.com> 
To: linux-poweredge at lists.us.dell.com, "linux clustering" <linux-cluster at redhat.com> 
Sent: Wednesday, July 15, 2009 5:04:38 PM GMT -03:00 Brasilia 
Subject: [Linux-cluster] DRAC 4 

hi, 

I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i. I 
am working on to create a cluster using 3 poweredge nodes and I am in 
need to use DRAC as fencing device. 

While testing fencing using fence_drac it complains me as: 

root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin 
failed: telnet open failed: problem connecting to "10.10.10.2", port 
23: Connection refused 

I tried to enable telnet using racadm but got the following error. 

[root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1 
ERROR: RACADM is unable to process the requested subcommand because there is no 
local RAC configuration to communicate with. 

Local RACADM subcommand execution requires the following: 

1. A Remote Access Controller (RAC) must be present on the managed server 
2. Appropriate managed node software must be installed and running on the 
server 


I am really stuck here. Any one having the similar problem? 

Thanks 
Paras. 

-- 
Linux-cluster mailing list 
Linux-cluster at redhat.com 
https://www.redhat.com/mailman/listinfo/linux-cluster 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090715/f062dab6/attachment.htm>

From pradhanparas at gmail.com  Wed Jul 15 22:30:18 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Wed, 15 Jul 2009 17:30:18 -0500
Subject: [Linux-cluster] DRAC 4
In-Reply-To: <14860381.31247696672994.JavaMail.cleber@localhost.localdomain>
References: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com>
	<14860381.31247696672994.JavaMail.cleber@localhost.localdomain>
Message-ID: <8b711df40907151530g139f13e7wf70586e61494626c@mail.gmail.com>

On Wed, Jul 15, 2009 at 5:27 PM, Cleber Rosa<crosa at redhat.com> wrote:
> Paras,
>
> Maybe you should try using the "secure" (ssh) option.


Is it possible in DRAC 4?


Thanks
Paras.
>
> CR.
>
> ---
> Cleber Rodrigues <crosa at redhat.com>
> Solutions Architect - Red Hat, Inc.
> Mobile: +55 61 9185.3454
>
> ----- Original Message -----
> From: "Paras pradhan" <pradhanparas at gmail.com>
> To: linux-poweredge at lists.us.dell.com, "linux clustering"
> <linux-cluster at redhat.com>
> Sent: Wednesday, July 15, 2009 5:04:38 PM GMT -03:00 Brasilia
> Subject: [Linux-cluster] DRAC 4
>
> hi,
>
> I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i. ?I
> am working on to create a cluster using 3 poweredge nodes and I am in
> need to use DRAC as fencing device.
>
> While testing fencing using fence_drac it complains me as:
>
> root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin
> failed: telnet open failed: problem connecting to "10.10.10.2", port
> 23: Connection refused
>
> I tried to enable telnet using racadm but got the following error.
>
> [root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1
> ERROR: RACADM is unable to process the requested subcommand because there is
> no
> local RAC configuration to communicate with.
>
> Local RACADM subcommand execution requires the following:
>
> ?1. A Remote Access Controller (RAC) must be present on the managed server
> ?2. Appropriate managed node software must be installed and running on the
> ?? ?server
>
>
> I am really stuck here. Any one having the similar problem?
>
> Thanks
> Paras.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From abednegoyulo at yahoo.com  Thu Jul 16 02:46:24 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Wed, 15 Jul 2009 19:46:24 -0700 (PDT)
Subject: [Linux-cluster] Starting two-node cluster with only one node
Message-ID: <408681.82392.qm@web110416.mail.gq1.yahoo.com>


Using the config file below

<?xml version="1.0"?>
<cluster name="GFSCluster" config_version="5">
<cman expected_votes="1" two_node="1"/>
  <clusternodes><clusternode name="node01.company.com" votes="1" nodeid="1"><fence><method name="single"><device name="node01_ipmi"/></method></fence></clusternode><clusternode name="node02.company.com" votes="1" nodeid="2"><fence><method name="single"><device name="node02_ipmi"/></method></fence></clusternode></clusternodes>
  <fencedevices><fencedevice name="node01_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.5" login="root" passwd="********"/><fencedevice name="node02_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.7" login="root" passwd="********"/></fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>

Is it possible to start the cluster by only bringing up one node? The reason why I asked is because currently bringing them up together produces a split brain, each of them member of the cluster GFSCluster of their own fencing each other. My plan is to bring up only one node to create a quorum then bring the other one up and manually join it to the existing cluster.

I have already don the start_clean approach but it seems it does not work.


From abednegoyulo at yahoo.com  Thu Jul 16 06:16:46 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Wed, 15 Jul 2009 23:16:46 -0700 (PDT)
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <408681.82392.qm@web110416.mail.gq1.yahoo.com>
Message-ID: <813634.56331.qm@web110409.mail.gq1.yahoo.com>


Tried it and now the two node cluster is running with only one node. My problem right now is how to force the second node to join the first node's cluster. Right now it is creating its own cluster and trying to fence the first node. I tried cman_tool leave on the second node but I got 

cman_tool: Error leaving cluster: Device or resource busy

clvmd and gfs are not running on the second node. What is running on the second node is cman. When I did 

service cman start

It took 5 approximately 5 minutes before I got the [ok] meassage. Am I missing something here? Not doing right? Should be doing something?


--- On Thu, 7/16/09, Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com> wrote:

> From: Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
> Subject: [Linux-cluster] Starting two-node cluster with only one node
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Thursday, 16 July, 2009, 10:46 AM
> 
> Using the config file below
> 
> <?xml version="1.0"?>
> <cluster name="GFSCluster" config_version="5">
> <cman expected_votes="1" two_node="1"/>
> ? <clusternodes><clusternode
> name="node01.company.com" votes="1"
> nodeid="1"><fence><method
> name="single"><device
> name="node01_ipmi"/></method></fence></clusternode><clusternode
> name="node02.company.com" votes="1"
> nodeid="2"><fence><method
> name="single"><device
> name="node02_ipmi"/></method></fence></clusternode></clusternodes>
> ? <fencedevices><fencedevice
> name="node01_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.5"
> login="root" passwd="********"/><fencedevice
> name="node02_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.7"
> login="root" passwd="********"/></fencedevices>
> ? <rm>
> ? ? <failoverdomains/>
> ? ? <resources/>
> ? </rm>
> </cluster>
> 
> Is it possible to start the cluster by only bringing up one
> node? The reason why I asked is because currently bringing
> them up together produces a split brain, each of them member
> of the cluster GFSCluster of their own fencing each other.
> My plan is to bring up only one node to create a quorum then
> bring the other one up and manually join it to the existing
> cluster.
> 
> I have already don the start_clean approach but it seems it
> does not work.
> 
> 
> ? ? ? 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


      Try the new Yahoo! Messenger. Now with all you love about messenger and more! http://ph.messenger.yahoo.com


From esggrupos at gmail.com  Thu Jul 16 08:27:09 2009
From: esggrupos at gmail.com (ESGLinux)
Date: Thu, 16 Jul 2009 10:27:09 +0200
Subject: [Linux-cluster] EMC vs HP EVA
In-Reply-To: <1247670305.725.63.camel@penguin.lifesci.dundee.ac.uk>
References: <3128ba140907150000l8475a7fx44374523df51b73d@mail.gmail.com>
	<E920F707-A26F-4B94-A686-D97D49F3D61E@gmail.com>
	<1247670305.725.63.camel@penguin.lifesci.dundee.ac.uk>
Message-ID: <3128ba140907160127p79279fd7uc7ebb662fa8831d9@mail.gmail.com>

Hi All,
First thanks for your answers,

My budget for buying the share storage is about 10.000 EUR

With this in mind, I want something good enough to make a goog cluster.

One of the things I want with this is get knowledge about the products used
in big implementations. This is the reason I?m thinking on EMC or EVA, ( I
don?t know with my budget I can afford this or I?m being a litle ingenious)

by the way anybody has told wich of this  family of products is best. (or if
there is something better. I have also seen Dell equallogic)

Greetings

ESG


2009/7/15 Jonathan Buzzard <j.buzzard at dundee.ac.uk>

>
> On Wed, 2009-07-15 at 08:00 -0600, Aaron Benner wrote:
> > We have an EMC AX150i here and it works very well, but EMC's policy
> > makes it pretty expensive.  We just added 8 720 GB drives to our
> > unit.  The drives could be purchased commodity for about $80-120, but
> > to put them into the EMC box we had to purchase "certified" units with
> > the mounting sleds for $500 per.  Seemed like a pretty steep markup at
> > the time.
>
> That is par for the course, they all come in platinum sleds whether you
> get EMC, HP, IBM etc.
>
> Last time I looked list price for a 1TB SATA drive for an IBM DS5000 was
> something like 2000 USD.
>
> JAB.
>
> --
> Jonathan A. Buzzard                      Tel: +441382-386998
> Storage Administrator, College of Life Sciences
> University of Dundee, DD1 5EH
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090716/8a820e6f/attachment.htm>

From tharindub at millenniumit.com  Thu Jul 16 08:06:53 2009
From: tharindub at millenniumit.com (Tharindu Rukshan Bamunuarachchi)
Date: Thu, 16 Jul 2009 14:06:53 +0600
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
In-Reply-To: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com>
References: <005301ca050a$7a469ba0$6ed3d2e0$@com>
	<21E1652F6AA39A4B89489DCFC399C1FD171BDB@gbplamail.genband.com>
	<fbfeff2669c7.4a5e2e2e@millenniumit.com>
	<21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com>
Message-ID: <000501ca05ec$633467c0$299d3740$@com>


Are you going to deploy SuSE 11 soon .

 
We could not use RedHat as RHEL 5.x do not support High Resolution Timers .

 
Now we can not continue, as GFS is not supported on SuSE 11 J

 
From: Ed Sanborn [mailto:Ed.Sanborn at genband.com] 
Sent: Wednesday, July 15, 2009 8:32 PM
To: Tharindu Bamunuarachchi; linux clustering
Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

 
No, a great question and one I am interested in the answer along with you.

 
From: Tharindu Bamunuarachchi [mailto:tharindub at millenniumit.com] 
Sent: Wednesday, July 15, 2009 10:30 AM
To: linux clustering
Cc: Ed Sanborn
Subject: Re: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

 
mmm ... not yet ...

is it a wrong question ??

----- Original Message -----
From: Ed Sanborn <Ed.Sanborn at genband.com>
Date: Wednesday, July 15, 2009 7:44 pm
Subject: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
To: linux clustering <linux-cluster at redhat.com>

> 


> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


> 
> 


> 
> 

Tharindu,


> 
> 

 
> 
> 

 Did anyone answer you on this
> question?


> 
> 

 
> 
> 

Ed


> 
> 

 
> 
> 


> 
> 


> 
> 

From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tharindu Rukshan
> Bamunuarachchi
> 
> Sent: Wednesday, July 15, 2009 1:10 AM
> 
> To: Linux-cluster at redhat.com
> 
> Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1


> 
> 


> 
> 


> 
> 

 
> 
> 

hi All,


> 
> 

 
> 
> 

SuSE 11 kernel comes with GFS2 module.


> 
> 

Where I can find cluster software for GFS2 deployment.


> 
> 

 
> 
> 

In source.redhat.com there are few cluster versions. Which
> version should be compatible with 2.6.27* kernel.


> 
> 

Do I have to build module from cluster source package ?


> 
> 

 
> 
> 

Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ?


> 
> 

 
> 
> 

cheers


> 
> 

__


> 
> 

tharindu


> 
> 

 
> 
> 


> 
> 


> 

****************************************************************************
****************************************************************************
***********
> 
> 
> 
> "The information contained in this email including in any attachment is
> confidential and is meant to be read only by the person to whom it is
> addressed. If you are not the intended recipient(s), you are prohibited
from
> printing, forwarding, saving or copying this email. If you have received
this
> e-mail in error, please immediately notify the sender and delete this
e-mail
> and its attachments from your computer."
> 
> 
> 
>
****************************************************************************
****************************************************************************
***********


> 
> 
> 


> 
> 

 
> 
> 


> 
> 
> 
> 
> > --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster 


****************************************************************************
****************************************************************************
***********

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is
addressed. If you are not the intended recipient(s), you are prohibited from
printing, forwarding, saving or copying this email. If you have received
this e-mail in error, please immediately notify the sender and delete this
e-mail and its attachments from your computer."

****************************************************************************
****************************************************************************
***********

 
*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer."

*******************************************************************************************************************************************************************


From rsetchfield at xcalibre.co.uk  Thu Jul 16 09:05:52 2009
From: rsetchfield at xcalibre.co.uk (Raymond Setchfield)
Date: Thu, 16 Jul 2009 10:05:52 +0100
Subject: [Linux-cluster] keepalived on centos
Message-ID: <4A5EED70.7080904@xcalibre.co.uk>

Hey Guys

Just looking to see if any of you know of good websites which will help 
on setting and troubleshooting keepalived? As I am coming across a 
problem which when having a look on google I am not getting very much 
help from. I dont know if this is a common occurrence with thin the logs 
or a configuration error or possibly a bug.


Getting errors like this showing in the logs;

Jul 16 10:01:56 loadbalancer-03 Keepalived: VRRP child process(5847) 
died: Respawning
Jul 16 10:01:56 loadbalancer-03 Keepalived: Remove a zombie pid file 
/var/run/vrrp.pid
Jul 16 10:01:56 loadbalancer-03 Keepalived: Starting VRRP child process, 
pid=5848
Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Using MII-BMSR NIC 
polling thread...
Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Registering Kernel 
netlink reflector
Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Registering Kernel 
netlink command channel
Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Registering gratutious 
ARP shared channel
Jul 16 10:01:56 loadbalancer-03 Keepalived_vrrp: Opening file 
'/etc/keepalived/keepalived.conf'

If anyone can help and point me in the right direction it would be 
greatly appreciated

Thanks in advance

Raymond


From abednegoyulo at yahoo.com  Thu Jul 16 10:07:54 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Thu, 16 Jul 2009 03:07:54 -0700 (PDT)
Subject: [Linux-cluster] Eliminating split brain
Message-ID: <472005.90546.qm@web110415.mail.gq1.yahoo.com>


Does converting a two node cluster to a three node cluster (adding one more member) eliminate the possibility of split brain?


      Feel safer online. Upgrade to the new, safer Internet Explorer 8 optimized for Yahoo! to put your mind at peace. It&#39;s free. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/


From glaurence at networkenablers.com.au  Thu Jul 16 10:29:49 2009
From: glaurence at networkenablers.com.au (Geoffrey Laurence)
Date: Thu, 16 Jul 2009 20:29:49 +1000
Subject: [Linux-cluster] DRAC 4
In-Reply-To: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com>
References: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com>
Message-ID: <FB3C48D549F24844A81A51792F44AD690C1146@NE-SERVER02.networkenablers.local>

Hi,

I have had a simular problem with Fedora-10, I think some kernels have a
problem detecting the Drac.  Anyway I found that you can enable telnet
from the web interface.

>From the 'command box' on the Diagnostics tab you can type the following
commands,

'd3debug propget ENABLE_TELNET'  -  Prints if telnet is enabled
'd3debug propset ENABLE_TELNET=TRUE'  -  Enables telnet
'd3debug racadm racreset'  - Reboots the drac.

After you have enabled telnet and rebooted the drac, you should be able
to telnet to the drac card.

Hope this helps,
Geoffrey.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paras pradhan
Sent: Thursday, 16 July 2009 6:05 AM
To: linux-poweredge at lists.us.dell.com; linux clustering
Subject: [Linux-cluster] DRAC 4

hi,

I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i.  I
am working on to create a cluster using 3 poweredge nodes and I am in
need to use DRAC as fencing device.

While testing fencing using fence_drac it complains me as:

root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin
failed: telnet open failed: problem connecting to "10.10.10.2", port
23: Connection refused

I tried to enable telnet using racadm but got the following error.

[root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1
ERROR: RACADM is unable to process the requested subcommand because
there is no
local RAC configuration to communicate with.

Local RACADM subcommand execution requires the following:

 1. A Remote Access Controller (RAC) must be present on the managed
server
 2. Appropriate managed node software must be installed and running on
the
    server


I am really stuck here. Any one having the similar problem?

Thanks
Paras.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From Ed.Sanborn at genband.com  Thu Jul 16 13:31:51 2009
From: Ed.Sanborn at genband.com (Ed Sanborn)
Date: Thu, 16 Jul 2009 08:31:51 -0500
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
In-Reply-To: <000001ca05ec$63108c10$2931a430$@com>
References: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com>
	<000001ca05ec$63108c10$2931a430$@com>
Message-ID: <21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com>

Well, that's a bummer to here that GFS is not supported by SUSE 11.  I
think I'll end up 

installing SUSE 11 on another machine then and using NFS to access our
GFS filesystem.

 
Ed

 
From: Tharindu Rukshan Bamunuarachchi
[mailto:tharindub at millenniumit.com] 
Sent: Thursday, July 16, 2009 4:07 AM
To: Ed Sanborn; 'linux clustering'
Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

 
Are you going to deploy SuSE 11 soon ...

 
We could not use RedHat as RHEL 5.x do not support High Resolution
Timers ...

 
Now we can not continue, as GFS is not supported on SuSE 11 J

 
From: Ed Sanborn [mailto:Ed.Sanborn at genband.com] 
Sent: Wednesday, July 15, 2009 8:32 PM
To: Tharindu Bamunuarachchi; linux clustering
Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

 
No, a great question and one I am interested in the answer along with
you.

 
From: Tharindu Bamunuarachchi [mailto:tharindub at millenniumit.com] 
Sent: Wednesday, July 15, 2009 10:30 AM
To: linux clustering
Cc: Ed Sanborn
Subject: Re: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

 
mmm ... not yet ...

is it a wrong question ??

----- Original Message -----
From: Ed Sanborn <Ed.Sanborn at genband.com>
Date: Wednesday, July 15, 2009 7:44 pm
Subject: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
To: linux clustering <linux-cluster at redhat.com>

> 


> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


> 
> 


> 
> 

Tharindu,


> 
> 

 
> 
> 

 Did anyone answer you on this
> question?


> 
> 

 
> 
> 

Ed


> 
> 

 
> 
> 


> 
> 


> 
> 

From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Tharindu
Rukshan
> Bamunuarachchi
> 
> Sent: Wednesday, July 15, 2009 1:10 AM
> 
> To: Linux-cluster at redhat.com
> 
> Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1


> 
> 


> 
> 


> 
> 

 
> 
> 

hi All,


> 
> 

 
> 
> 

SuSE 11 kernel comes with GFS2 module.


> 
> 

Where I can find cluster software for GFS2 deployment.


> 
> 

 
> 
> 

In source.redhat.com there are few cluster versions. Which
> version should be compatible with 2.6.27* kernel.


> 
> 

Do I have to build module from cluster source package ?


> 
> 

 
> 
> 

Have you guys tried GFS on SuSE 11 or openSuSE 11.1 ?


> 
> 

 
> 
> 

cheers


> 
> 

__


> 
> 

tharindu


> 
> 

 
> 
> 


> 
> 


> 

************************************************************************
************************************************************************
*******************
> 
> 
> 
> "The information contained in this email including in any attachment
is
> confidential and is meant to be read only by the person to whom it is
> addressed. If you are not the intended recipient(s), you are
prohibited from
> printing, forwarding, saving or copying this email. If you have
received this
> e-mail in error, please immediately notify the sender and delete this
e-mail
> and its attachments from your computer."
> 
> 
> 
>
************************************************************************
************************************************************************
*******************


> 
> 
> 


> 
> 

 
> 
> 


> 
> 
> 
> 
> > --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster 

************************************************************************
************************************************************************
*******************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is
addressed. If you are not the intended recipient(s), you are prohibited
from printing, forwarding, saving or copying this email. If you have
received this e-mail in error, please immediately notify the sender and
delete this e-mail and its attachments from your computer."

************************************************************************
************************************************************************
*******************

 
************************************************************************
************************************************************************
*******************

"The information contained in this email including in any attachment is
confidential and is meant to be read only by the person to whom it is
addressed. If you are not the intended recipient(s), you are prohibited
from printing, forwarding, saving or copying this email. If you have
received this e-mail in error, please immediately notify the sender and
delete this e-mail and its attachments from your computer."

************************************************************************
************************************************************************
*******************

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090716/94e8731d/attachment.htm>

From gordan at bobich.net  Thu Jul 16 13:50:05 2009
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 16 Jul 2009 14:50:05 +0100
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
In-Reply-To: <21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com>
References: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com>	<000001ca05ec$63108c10$2931a430$@com>
	<21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com>
Message-ID: <39426d35b502eb5ceccb659385880d6d@localhost>

> We could not use RedHat as RHEL 5.x do not support High Resolution
> Timers ...

Are you saying that changing the distribution is deemed less of a problem
than rebuilding the kernel with the high-res timer option ticked? The
kernel src.rpm package is there for a reason.

Gordan


From tharindub at millenniumit.com  Thu Jul 16 13:32:09 2009
From: tharindub at millenniumit.com (Tharindu Rukshan Bamunuarachchi)
Date: Thu, 16 Jul 2009 19:32:09 +0600
Subject: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1
In-Reply-To: <39426d35b502eb5ceccb659385880d6d@localhost>
References: <21E1652F6AA39A4B89489DCFC399C1FD171BDD@gbplamail.genband.com>
	<000001ca05ec$63108c10$2931a430$@com>
	<21E1652F6AA39A4B89489DCFC399C1FD171BEA@gbplamail.genband.com>
	<39426d35b502eb5ceccb659385880d6d@localhost>
Message-ID: <000801ca0619$d2febf60$78fc3e20$@com>


Why did not RedHat release 5.x kernel with High Resolution Timers enabled ?

We thought that particular kernel is not much stable with High Resolution Timers.

However, SuSE 11 supported High Resolution Timers AS-IS. 

Initially, we tried RedHat 5 on our system. But, In production environment we needed to have stable version.

I am not sure, but I think vendor does not support custom kernels.


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
Sent: Thursday, July 16, 2009 7:50 PM
To: linux clustering
Subject: RE: RE: [Linux-cluster] GFS on SuSE 11 or OpenSuSE 11.1

> We could not use RedHat as RHEL 5.x do not support High Resolution
> Timers ...

Are you saying that changing the distribution is deemed less of a problem
than rebuilding the kernel with the high-res timer option ticked? The
kernel src.rpm package is there for a reason.

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


*******************************************************************************************************************************************************************

"The information contained in this email including in any attachment is confidential and is meant to be read only by the person to whom it is addressed. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer."

*******************************************************************************************************************************************************************


From tfrumbacher at gmail.com  Thu Jul 16 14:04:45 2009
From: tfrumbacher at gmail.com (Aaron Benner)
Date: Thu, 16 Jul 2009 08:04:45 -0600
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <813634.56331.qm@web110409.mail.gq1.yahoo.com>
References: <813634.56331.qm@web110409.mail.gq1.yahoo.com>
Message-ID: <4D278A4A-3120-4114-A374-CC087A02897A@gmail.com>

Have you tried setting the "post_join_delay" value in the  
<fence_daemon ...> declaration to -1?

<fence_daemon clean_start="0" post_fail_delay="0"  
post_join_delay="-1" />

This is a hint I picked up from the fenced man page section on  
avoiding boot time fencing.  It tells fenced to wait until all of the  
nodes have joined the cluster before starting up.  We use this on a  
couple of 2 node clusters (with qdisk) to allow them to start up  
without the first node to grab the quorum disk fencing the other node.

--Aaron

On Jul 16, 2009, at 12:16 AM, Abed-nego G. Escobal, Jr. wrote:

>
>
> Tried it and now the two node cluster is running with only one node.  
> My problem right now is how to force the second node to join the  
> first node's cluster. Right now it is creating its own cluster and  
> trying to fence the first node. I tried cman_tool leave on the  
> second node but I got
>
> cman_tool: Error leaving cluster: Device or resource busy
>
> clvmd and gfs are not running on the second node. What is running on  
> the second node is cman. When I did
>
> service cman start
>
> It took 5 approximately 5 minutes before I got the [ok] meassage. Am  
> I missing something here? Not doing right? Should be doing something?
>
>
> --- On Thu, 7/16/09, Abed-nego G. Escobal, Jr.  
> <abednegoyulo at yahoo.com> wrote:
>
>> From: Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
>> Subject: [Linux-cluster] Starting two-node cluster with only one node
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Date: Thursday, 16 July, 2009, 10:46 AM
>>
>> Using the config file below
>>
>> <?xml version="1.0"?>
>> <cluster name="GFSCluster" config_version="5">
>> <cman expected_votes="1" two_node="1"/>
>>   <clusternodes><clusternode
>> name="node01.company.com" votes="1"
>> nodeid="1"><fence><method
>> name="single"><device
>> name="node01_ipmi"/></method></fence></clusternode><clusternode
>> name="node02.company.com" votes="1"
>> nodeid="2"><fence><method
>> name="single"><device
>> name="node02_ipmi"/></method></fence></clusternode></clusternodes>
>>   <fencedevices><fencedevice
>> name="node01_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.5"
>> login="root" passwd="********"/><fencedevice
>> name="node02_ipmi" agent="fence_ipmilan" ipaddr="10.1.0.7"
>> login="root" passwd="********"/></fencedevices>
>>   <rm>
>>     <failoverdomains/>
>>     <resources/>
>>   </rm>
>> </cluster>
>>
>> Is it possible to start the cluster by only bringing up one
>> node? The reason why I asked is because currently bringing
>> them up together produces a split brain, each of them member
>> of the cluster GFSCluster of their own fencing each other.
>> My plan is to bring up only one node to create a quorum then
>> bring the other one up and manually join it to the existing
>> cluster.
>>
>> I have already don the start_clean approach but it seems it
>> does not work.
>>
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>      Try the new Yahoo! Messenger. Now with all you love about  
> messenger and more! http://ph.messenger.yahoo.com
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From pradhanparas at gmail.com  Thu Jul 16 14:32:33 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Thu, 16 Jul 2009 09:32:33 -0500
Subject: [Linux-cluster] DRAC 4
In-Reply-To: <FB3C48D549F24844A81A51792F44AD690C1146@NE-SERVER02.networkenablers.local>
References: <8b711df40907151304g68d4fbdqf835a70aeee03f59@mail.gmail.com>
	<FB3C48D549F24844A81A51792F44AD690C1146@NE-SERVER02.networkenablers.local>
Message-ID: <8b711df40907160732n59486bd4tbecdffcd02fe02fb@mail.gmail.com>

On Thu, Jul 16, 2009 at 5:29 AM, Geoffrey
Laurence<glaurence at networkenablers.com.au> wrote:
> Hi,
>
> I have had a simular problem with Fedora-10, I think some kernels have a
> problem detecting the Drac. ?Anyway I found that you can enable telnet
> from the web interface.
>
> >From the 'command box' on the Diagnostics tab you can type the following
> commands,
>
> 'd3debug propget ENABLE_TELNET' ?- ?Prints if telnet is enabled
> 'd3debug propset ENABLE_TELNET=TRUE' ?- ?Enables telnet
> 'd3debug racadm racreset' ?- Reboots the drac.

Geoffrey,

Thanks a lot. It worked for me !

Paras.

>
> After you have enabled telnet and rebooted the drac, you should be able
> to telnet to the drac card.
>
> Hope this helps,
> Geoffrey.
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paras pradhan
> Sent: Thursday, 16 July 2009 6:05 AM
> To: linux-poweredge at lists.us.dell.com; linux clustering
> Subject: [Linux-cluster] DRAC 4
>
> hi,
>
> I am using centos 5.3 in dell poweredge 1850 servers . Drac is 4/i. ?I
> am working on to create a cluster using 3 poweredge nodes and I am in
> need to use DRAC as fencing device.
>
> While testing fencing using fence_drac it complains me as:
>
> root at tst1 ~]# fence_drac -a 10.10.10.2 -l user -p calvin
> failed: telnet open failed: problem connecting to "10.10.10.2", port
> 23: Connection refused
>
> I tried to enable telnet using racadm but got the following error.
>
> [root at tst1 ~]# racadm config -g cfgSerial -o cfgSerialSshEnable 1
> ERROR: RACADM is unable to process the requested subcommand because
> there is no
> local RAC configuration to communicate with.
>
> Local RACADM subcommand execution requires the following:
>
> ?1. A Remote Access Controller (RAC) must be present on the managed
> server
> ?2. Appropriate managed node software must be installed and running on
> the
> ? ?server
>
>
> I am really stuck here. Any one having the similar problem?
>
> Thanks
> Paras.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From lhh at redhat.com  Thu Jul 16 21:31:18 2009
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 16 Jul 2009 17:31:18 -0400
Subject: [Linux-cluster] clurgmgrd hang/stuck
In-Reply-To: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com>
References: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com>
Message-ID: <1247779878.5229.29.camel@localhost>

On Wed, 2009-07-15 at 18:41 -0300, Guilherme G. Felix wrote:
> Hi all,
> 
> I'm having a odd problem with my 2 node cluster.

What package n-v-r / git checkout are you using?

-- Lon


From garromo at us.ibm.com  Thu Jul 16 22:19:15 2009
From: garromo at us.ibm.com (Gary Romo)
Date: Thu, 16 Jul 2009 16:19:15 -0600
Subject: [Linux-cluster] RHCS config -  Conga or system-config-cluster
Message-ID: <OF883D227D.DBCE1AE2-ON872575F5.007A79F2-872575F5.007A9CD7@us.ibm.com>


Any known issues with configuring RHCS with both Conga and
system-config-cluster?

Gary Romo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090716/0bb0433c/attachment.htm>

From ggfelix at gmail.com  Thu Jul 16 22:47:18 2009
From: ggfelix at gmail.com (Guilherme G. Felix)
Date: Thu, 16 Jul 2009 19:47:18 -0300
Subject: [Linux-cluster] clurgmgrd hang/stuck
In-Reply-To: <1247779878.5229.29.camel@localhost>
References: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com>
	<1247779878.5229.29.camel@localhost>
Message-ID: <6b83838d0907161547g6e521e1du9d944ad052a8634b@mail.gmail.com>

Howdy Lon,

I'm using the standard RHES 5.3 (Tikanga), with the following package
n-v-r-arch - no updates were applied to the system yet:

cman.2.0.98.1.el5-i386
rgmanager.2.0.46.1.el5-i386
modcluster.0.12.1.2.el5-i386
system-config-cluster.1.0.55.1.0-noarch
openais.0.80.3.22.el5-i386
cluster-cim.0.12.1.2.el5-i386

Thank you,

- G. Felix

On Thu, Jul 16, 2009 at 6:31 PM, Lon Hohberger <lhh at redhat.com> wrote:

> On Wed, 2009-07-15 at 18:41 -0300, Guilherme G. Felix wrote:
> > Hi all,
> >
> > I'm having a odd problem with my 2 node cluster.
>
> What package n-v-r / git checkout are you using?
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090716/f389b947/attachment.htm>

From nagarurnaren at gmail.com  Fri Jul 17 05:04:38 2009
From: nagarurnaren at gmail.com (narendra reddy naren)
Date: Fri, 17 Jul 2009 10:34:38 +0530
Subject: [Linux-cluster] New to cluster
Message-ID: <b0e124170907162204g3546eabdn59ede23b9a5c25c@mail.gmail.com>

HI  :

I am new to clustering concept .Please send me some good PDF for HA and
HPC,,,,

-- 
Thanks & Regards,
G.Narendra Reddy ,
Mob : 9000500132
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090717/e5af8def/attachment.htm>

From abednegoyulo at yahoo.com  Fri Jul 17 06:41:41 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Thu, 16 Jul 2009 23:41:41 -0700 (PDT)
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <4D278A4A-3120-4114-A374-CC087A02897A@gmail.com>
Message-ID: <384614.10667.qm@web110407.mail.gq1.yahoo.com>


Thanks for the tip. It helped by stopping each node kicking each other, as per the logs, but still I have a split brain status. 

On node01

# /usr/sbin/cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    680   2009-07-17 00:30:42  node01.company.com
   2   X      0                        node02.company.com

# /usr/sbin/clustat 
Cluster Status for GFSCluster @ Fri Jul 17 01:01:09 2009
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node01.company.com                         1 Online, Local
 node02.company.com                         2 Offline


On node02

# /usr/sbin/cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        node01.company.com
   2   M    676   2009-07-17 00:30:43  node02.company.com
 

# /usr/sbin/clustat
Cluster Status for GFSCluster @ Fri Jul 17 01:01:22 2009
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node01.company.com                         1 Offline
 node02.company.com                         2 Online, Local


Another thing that I have noticed,

1. Start node01 with only itself as the member of the cluster
2. Update cluster.conf to have node02 as an additional member
3. Start node02

Yields both nodes being quorate (split brain) but only node02 tries to fence out node01. After some time, clustat will yield both of them being in the same cluster. Then I will be starting clvmd on node02 but will not be successful. After trying to start the clvmd service, clustat will yield split brain again. 

Are there some troubleshootings that I should be doing?


--- On Thu, 7/16/09, Aaron Benner <tfrumbacher at gmail.com> wrote:

> From: Aaron Benner <tfrumbacher at gmail.com>
> Subject: Re: [Linux-cluster] Starting two-node cluster with only one node
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Thursday, 16 July, 2009, 10:04 PM
> Have you tried setting the
> "post_join_delay" value in the <fence_daemon ...>
> declaration to -1?
> 
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="-1" />
> 
> This is a hint I picked up from the fenced man page section
> on avoiding boot time fencing.? It tells fenced to wait
> until all of the nodes have joined the cluster before
> starting up.? We use this on a couple of 2 node
> clusters (with qdisk) to allow them to start up without the
> first node to grab the quorum disk fencing the other node.
> 
> --Aaron
> 
> On Jul 16, 2009, at 12:16 AM, Abed-nego G. Escobal, Jr.
> wrote:
> 
> > 
> > 
> > Tried it and now the two node cluster is running with
> only one node. My problem right now is how to force the
> second node to join the first node's cluster. Right now it
> is creating its own cluster and trying to fence the first
> node. I tried cman_tool leave on the second node but I got
> > 
> > cman_tool: Error leaving cluster: Device or resource
> busy
> > 
> > clvmd and gfs are not running on the second node. What
> is running on the second node is cman. When I did
> > 
> > service cman start
> > 
> > It took 5 approximately 5 minutes before I got the
> [ok] meassage. Am I missing something here? Not doing right?
> Should be doing something?
> > 
> > 
> > --- On Thu, 7/16/09, Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
> wrote:
> > 
> >> From: Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
> >> Subject: [Linux-cluster] Starting two-node cluster
> with only one node
> >> To: "linux clustering" <linux-cluster at redhat.com>
> >> Date: Thursday, 16 July, 2009, 10:46 AM
> >> 
> >> Using the config file below
> >> 
> >> <?xml version="1.0"?>
> >> <cluster name="GFSCluster"
> config_version="5">
> >> <cman expected_votes="1" two_node="1"/>
> >>???<clusternodes><clusternode
> >> name="node01.company.com" votes="1"
> >> nodeid="1"><fence><method
> >> name="single"><device
> >>
> name="node01_ipmi"/></method></fence></clusternode><clusternode
> >> name="node02.company.com" votes="1"
> >> nodeid="2"><fence><method
> >> name="single"><device
> >>
> name="node02_ipmi"/></method></fence></clusternode></clusternodes>
> >>???<fencedevices><fencedevice
> >> name="node01_ipmi" agent="fence_ipmilan"
> ipaddr="10.1.0.5"
> >> login="root"
> passwd="********"/><fencedevice
> >> name="node02_ipmi" agent="fence_ipmilan"
> ipaddr="10.1.0.7"
> >> login="root"
> passwd="********"/></fencedevices>
> >>???<rm>
> >>? ???<failoverdomains/>
> >>? ???<resources/>
> >>???</rm>
> >> </cluster>
> >> 
> >> Is it possible to start the cluster by only
> bringing up one
> >> node? The reason why I asked is because currently
> bringing
> >> them up together produces a split brain, each of
> them member
> >> of the cluster GFSCluster of their own fencing
> each other.
> >> My plan is to bring up only one node to create a
> quorum then
> >> bring the other one up and manually join it to the
> existing
> >> cluster.
> >> 
> >> I have already don the start_clean approach but it
> seems it
> >> does not work.
> >> 
> >> 
> >> 
> >> 
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >> 
> > 
> > 
> >? ? ? Try the new Yahoo! Messenger. Now
> with all you love about messenger and more! http://ph.messenger.yahoo.com
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


      Get connected with chat on network profile, blog, or any personal website! Yahoo! allows you to IM with Pingbox. Check it out! http://ph.messenger.yahoo.com/pingbox


From agx at sigxcpu.org  Fri Jul 17 08:46:23 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Fri, 17 Jul 2009 10:46:23 +0200
Subject: [Linux-cluster] Cluster 3.0.0.rc4 release
In-Reply-To: <1247501173.10789.2.camel@localhost.localdomain>
References: <1246490190.19414.93.camel@cerberus.int.fabbione.net>
	<20090707162845.GA30094@bogon.sigxcpu.org>
	<1246988716.7993.13.camel@cerberus.int.fabbione.net>
	<20090709121958.GB18140@bogon.sigxcpu.org>
	<20090713155324.GA20030@bogon.sigxcpu.org>
	<1247501173.10789.2.camel@localhost.localdomain>
Message-ID: <20090717084623.GA16661@bogon.sigxcpu.org>

On Mon, Jul 13, 2009 at 09:06:13AM -0700, Steven Dake wrote:
> On Mon, 2009-07-13 at 17:53 +0200, Guido G?nther wrote:
> > On Thu, Jul 09, 2009 at 02:19:58PM +0200, Guido G?nther wrote:
> > > This stuff needs soome more work to be uploadable but it's mostly there
> > > I think. I've pushed the git archives here:
> > 
> > [..snip..] 
> > I've updated the packages at:
> > 
> > deb http://pkg-libvirt.alioth.debian.org/packages unstable/i386/
> > deb http://pkg-libvirt.alioth.debian.org/packages unstable/all/
> > 
> > to cluster 3.0.0.
> > Cheers,
> >  -- Guido
> > 
> Guido,
> 
> Thanks for the work!  If there is anything we can do in upstream
> corosync or openais to help simplify these efforts in the future, please
> share your ideas.
I think we're fine from the packaging point of view at the moment,
thanks! The switch to autoconf helped a lot.
Cheers,
 -- Guido


From cluster at xinet.it  Fri Jul 17 08:55:27 2009
From: cluster at xinet.it (Francesco Gallo)
Date: Fri, 17 Jul 2009 10:55:27 +0200
Subject: [Linux-cluster] Clone/Snapshot Lun based VM
Message-ID: <001001ca06bc$54c4ea90$fe4ebfb0$@it>

Hi all,

 
i would like to clone/make a snapshot of a lun-based VM. 

I use lvm2 clustered and an ISCSI storage (Dell MD3000i). 

 
Is there a particular software who could help me? Any suggestion? Any link?

 
Thanks a lot,

Francesco

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090717/ca816c8a/attachment.htm>

From mad at wol.de  Fri Jul 17 09:56:22 2009
From: mad at wol.de (Marc - A. Dahlhaus [ Administration | Westermann GmbH ])
Date: Fri, 17 Jul 2009 11:56:22 +0200
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <384614.10667.qm@web110407.mail.gq1.yahoo.com>
References: <384614.10667.qm@web110407.mail.gq1.yahoo.com>
Message-ID: <1247824582.973.48.camel@marc>

Hello,


can you give us some hard facts on what versions of cluster-suite
packages you are using in your environment and also the related logs?

Have you read the corresponding parts of the cluster suites manual, man
pages, FAQ and also searched the list-archives for similar problems
already? If not -> do it, there are may good hints to find there.


The nodes find each other and create a cluster very fast IF they can
talk to each other. As no cluster networking is involved in fencing a
remote node if the fencing node by itself is quorate this could be your
problem.

You should change to fence_manual and switch back to your real fencing
devices after you have debuged your problem. Also get rid of the
<fence_daemon ... /> tag in your cluster.conf as fenced does the right
thing by default if the remaining configuration is right and now it is
just hiding a part of the problem.

Also the 5 minute break on cman start smells like a DNS-lookup problem
or other network related problem to me.

Here is a short check-list to be sure the nodes can talk to each other:

Can the individual nodes ping each other?

Can the individual nodes dns-lookup the other node-names (which you used
in your cluster.conf)? (Try to add them to your etc/hosts file, that way
you have a working cluster even if your dns-system is going on
vacation.)

Is your switch allowing multicast communication on all ports that are
used for cluster communication? (This is a prerequisite for openais /
corosync based cman which would be anything >= RHEL 5. Search the
archives on this if you need more info...)

Can you trace (eg. with wiresharks tshark) incoming cluster
communication from remote nodes? (If you don't changed your fencing to
fence_manual your listening system will get fenced before you can get
any useful information out of it. Try with and without active firewall.)

If all above could be answered with "yes" your cluster should form just
fine. You could try to add a qdisk-device as tiebreaker after that and
test it just to be sure you have a working last man standing setup...

Hope that helps,

Marc

Am Donnerstag, den 16.07.2009, 23:41 -0700 schrieb Abed-nego G. Escobal,
Jr.: 
> 
> Thanks for the tip. It helped by stopping each node kicking each other, as per the logs, but still I have a split brain status. 
> 
> On node01
> 
> # /usr/sbin/cman_tool nodes
> Node  Sts   Inc   Joined               Name
>    1   M    680   2009-07-17 00:30:42  node01.company.com
>    2   X      0                        node02.company.com
> 
> # /usr/sbin/clustat 
> Cluster Status for GFSCluster @ Fri Jul 17 01:01:09 2009
> Member Status: Quorate
> 
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  node01.company.com                         1 Online, Local
>  node02.company.com                         2 Offline
> 
> 
> On node02
> 
> # /usr/sbin/cman_tool nodes
> Node  Sts   Inc   Joined               Name
>    1   X      0                        node01.company.com
>    2   M    676   2009-07-17 00:30:43  node02.company.com
>  
> 
> # /usr/sbin/clustat
> Cluster Status for GFSCluster @ Fri Jul 17 01:01:22 2009
> Member Status: Quorate
> 
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  node01.company.com                         1 Offline
>  node02.company.com                         2 Online, Local
> 
> 
> Another thing that I have noticed,
> 
> 1. Start node01 with only itself as the member of the cluster
> 2. Update cluster.conf to have node02 as an additional member
> 3. Start node02
> 
> Yields both nodes being quorate (split brain) but only node02 tries to fence out node01. After some time, clustat will yield both of them being in the same cluster. Then I will be starting clvmd on node02 but will not be successful. After trying to start the clvmd service, clustat will yield split brain again. 
> 
> Are there some troubleshootings that I should be doing?
> 
> 
> --- On Thu, 7/16/09, Aaron Benner <tfrumbacher at gmail.com> wrote:
> 
> > From: Aaron Benner <tfrumbacher at gmail.com>
> > Subject: Re: [Linux-cluster] Starting two-node cluster with only one node
> > To: "linux clustering" <linux-cluster at redhat.com>
> > Date: Thursday, 16 July, 2009, 10:04 PM
> > Have you tried setting the
> > "post_join_delay" value in the <fence_daemon ...>
> > declaration to -1?
> > 
> > <fence_daemon clean_start="0" post_fail_delay="0"
> > post_join_delay="-1" />
> > 
> > This is a hint I picked up from the fenced man page section
> > on avoiding boot time fencing.  It tells fenced to wait
> > until all of the nodes have joined the cluster before
> > starting up.  We use this on a couple of 2 node
> > clusters (with qdisk) to allow them to start up without the
> > first node to grab the quorum disk fencing the other node.
> > 
> > --Aaron
> > 
> > On Jul 16, 2009, at 12:16 AM, Abed-nego G. Escobal, Jr.
> > wrote:
> > 
> > > 
> > > 
> > > Tried it and now the two node cluster is running with
> > only one node. My problem right now is how to force the
> > second node to join the first node's cluster. Right now it
> > is creating its own cluster and trying to fence the first
> > node. I tried cman_tool leave on the second node but I got
> > > 
> > > cman_tool: Error leaving cluster: Device or resource
> > busy
> > > 
> > > clvmd and gfs are not running on the second node. What
> > is running on the second node is cman. When I did
> > > 
> > > service cman start
> > > 
> > > It took 5 approximately 5 minutes before I got the
> > [ok] meassage. Am I missing something here? Not doing right?
> > Should be doing something?
> > > 
> > > 
> > > --- On Thu, 7/16/09, Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
> > wrote:
> > > 
> > >> From: Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
> > >> Subject: [Linux-cluster] Starting two-node cluster
> > with only one node
> > >> To: "linux clustering" <linux-cluster at redhat.com>
> > >> Date: Thursday, 16 July, 2009, 10:46 AM
> > >> 
> > >> Using the config file below
> > >> 
> > >> <?xml version="1.0"?>
> > >> <cluster name="GFSCluster"
> > config_version="5">
> > >> <cman expected_votes="1" two_node="1"/>
> > >>   <clusternodes><clusternode
> > >> name="node01.company.com" votes="1"
> > >> nodeid="1"><fence><method
> > >> name="single"><device
> > >>
> > name="node01_ipmi"/></method></fence></clusternode><clusternode
> > >> name="node02.company.com" votes="1"
> > >> nodeid="2"><fence><method
> > >> name="single"><device
> > >>
> > name="node02_ipmi"/></method></fence></clusternode></clusternodes>
> > >>   <fencedevices><fencedevice
> > >> name="node01_ipmi" agent="fence_ipmilan"
> > ipaddr="10.1.0.5"
> > >> login="root"
> > passwd="********"/><fencedevice
> > >> name="node02_ipmi" agent="fence_ipmilan"
> > ipaddr="10.1.0.7"
> > >> login="root"
> > passwd="********"/></fencedevices>
> > >>   <rm>
> > >>     <failoverdomains/>
> > >>     <resources/>
> > >>   </rm>
> > >> </cluster>
> > >> 
> > >> Is it possible to start the cluster by only
> > bringing up one
> > >> node? The reason why I asked is because currently
> > bringing
> > >> them up together produces a split brain, each of
> > them member
> > >> of the cluster GFSCluster of their own fencing
> > each other.
> > >> My plan is to bring up only one node to create a
> > quorum then
> > >> bring the other one up and manually join it to the
> > existing
> > >> cluster.
> > >> 
> > >> I have already don the start_clean approach but it
> > seems it
> > >> does not work.
> > >> 


From td3201 at gmail.com  Fri Jul 17 16:05:19 2009
From: td3201 at gmail.com (Terry)
Date: Fri, 17 Jul 2009 11:05:19 -0500
Subject: [Linux-cluster] determining fsid for fs resource
Message-ID: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com>

Hello,

When I create a fs resource using redhat's luci, it is able to find
the fsid for a fs and life is good.  However, I am not crazy about
luci and would prefer to manually create the resources from the
command line but how do I find the fsid for a filesystem?  Here's an
example of a fs resource created using luci:

<fs device="/dev/vg_data01i/lv_data01i" force_fsck="0"
force_unmount="1" fsid="49256" fstype="ext3" mountpoint="/data01i"
name="omadvnfs01-data01i"
options="noatime,nodiratime,data=writeback,commit=30" self_fence="0"/>

Thanks!


From fxmulder at gmail.com  Fri Jul 17 16:20:49 2009
From: fxmulder at gmail.com (James Devine)
Date: Fri, 17 Jul 2009 10:20:49 -0600
Subject: [Linux-cluster] disk fencing
Message-ID: <c30750500907170920n28f2de1yf7dc185fdf7e6d92@mail.gmail.com>

Has anybody looked into using the network for heartbeat only, and disk
for fencing in GFS?  i.e. using the disk to communicate quorum when
network heartbeat is lost between 1 or more nodes.  If the disk is
still accessible to all nodes, this should be a valid way to
communicate quorum, if not, then the remaining nodes, assuming enough
for quorum, should be able to continue knowing that nodes it can't
communicate with either have been fenced or can't read/write to disk
anyway.  Does this sound like a valid approach?


From ggfelix at gmail.com  Fri Jul 17 17:45:28 2009
From: ggfelix at gmail.com (Guilherme G. Felix)
Date: Fri, 17 Jul 2009 14:45:28 -0300
Subject: [Linux-cluster] clurgmgrd hang/stuck
In-Reply-To: <6b83838d0907161547g6e521e1du9d944ad052a8634b@mail.gmail.com>
References: <6b83838d0907151441w5fe9dd24ve7f45058e4c12dae@mail.gmail.com>
	<1247779878.5229.29.camel@localhost>
	<6b83838d0907161547g6e521e1du9d944ad052a8634b@mail.gmail.com>
Message-ID: <6b83838d0907171045q3c01eb89r8fd1ab2af6dd8201@mail.gmail.com>

Some more info....removed everything from the cluster and left only apache
and the shared IPs, I also made a huge change to cluster.conf and started to
use "Shared Resources".

Even after that rgmanager is freezing, when rgmanager enter in this state
system-config-cluster cannot show the service when it's opened, clusvcadm
won't work either, only solution is to restart rgmanager on both nodes.

Hardware vendor has done a full health check and upgraded firmwares to
eliminate this possibility.

Thank you,

- G. Felix

On Thu, Jul 16, 2009 at 7:47 PM, Guilherme G. Felix <ggfelix at gmail.com>wrote:

> Howdy Lon,
>
> I'm using the standard RHES 5.3 (Tikanga), with the following package
> n-v-r-arch - no updates were applied to the system yet:
>
> cman.2.0.98.1.el5-i386
> rgmanager.2.0.46.1.el5-i386
> modcluster.0.12.1.2.el5-i386
> system-config-cluster.1.0.55.1.0-noarch
> openais.0.80.3.22.el5-i386
> cluster-cim.0.12.1.2.el5-i386
>
> Thank you,
>
> - G. Felix
>
>
> On Thu, Jul 16, 2009 at 6:31 PM, Lon Hohberger <lhh at redhat.com> wrote:
>
>> On Wed, 2009-07-15 at 18:41 -0300, Guilherme G. Felix wrote:
>> > Hi all,
>> >
>> > I'm having a odd problem with my 2 node cluster.
>>
>> What package n-v-r / git checkout are you using?
>>
>> -- Lon
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090717/2491dcb5/attachment.htm>

From pradhanparas at gmail.com  Fri Jul 17 20:15:25 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Fri, 17 Jul 2009 15:15:25 -0500
Subject: [Linux-cluster] Cluster failover
Message-ID: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com>

hi,

I have 3 nodes of CentOS 5.3 running xen virtual machines as virtual
machine service. This cluster is working fine. One thing I would like
to know that how to make failover only to  third node. What I mean to
say is: I have 3 virtual machine  running on node 1 and 2 virtual
machines running on node 2. Now if node 1 fails I want my the node1
virtual machines  to be stared only on node 3 but not on node2.
Similary if node2 breaks, I want virtual machines to be started on
node3 but never on node 1.

Thanks !
Paras.


From raju.rajsand at gmail.com  Sat Jul 18 03:26:35 2009
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sat, 18 Jul 2009 08:56:35 +0530
Subject: [Linux-cluster] Cluster failover
In-Reply-To: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com>
References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com>
Message-ID: <8786b91c0907172026g3cdec94bw946187e2fdf3181a@mail.gmail.com>

Greetings,

You basically need to define multiple failover domains.

On Sat, Jul 18, 2009 at 1:45 AM, Paras pradhan<pradhanparas at gmail.com> wrote:
> I have 3 virtual machine ?running on node 1 and 2 virtual
> machines running on node 2.

> Now if node 1 fails I want my the node1
> virtual machines ?to be stared only on node 3 but not on node2.
Failover domain 1 consisting of Node 1 and Node 3

> Similary if node2 breaks, I want virtual machines to be started on
> node3 but never on node 1.
Failover domain 2 consisting of Node 2 and Node 3

HTH

Thanks and Regards

Rajagopal


From cthulhucalling at gmail.com  Sat Jul 18 06:50:01 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Fri, 17 Jul 2009 23:50:01 -0700
Subject: [Linux-cluster] disk fencing
In-Reply-To: <c30750500907170920n28f2de1yf7dc185fdf7e6d92@mail.gmail.com>
References: <c30750500907170920n28f2de1yf7dc185fdf7e6d92@mail.gmail.com>
Message-ID: <36df569a0907172350g5b381e3bt87abd7fbe6fe8677@mail.gmail.com>

I'm not sure what you're asking here, but it sounds like you're describing a
qdisk.

If a node loses heartbeat with the rest of the cluster, that's a fencin'.
Doesn't matter if it can still access the shared storage, and if it has lost
communication with the rest of the cluster, you probably don't want it
accessing your data anyway.

On Fri, Jul 17, 2009 at 9:20 AM, James Devine <fxmulder at gmail.com> wrote:

> Has anybody looked into using the network for heartbeat only, and disk
> for fencing in GFS?  i.e. using the disk to communicate quorum when
> network heartbeat is lost between 1 or more nodes.  If the disk is
> still accessible to all nodes, this should be a valid way to
> communicate quorum, if not, then the remaining nodes, assuming enough
> for quorum, should be able to continue knowing that nodes it can't
> communicate with either have been fenced or can't read/write to disk
> anyway.  Does this sound like a valid approach?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090717/127972c5/attachment.htm>

From cthulhucalling at gmail.com  Sat Jul 18 06:56:08 2009
From: cthulhucalling at gmail.com (Ian Hayes)
Date: Fri, 17 Jul 2009 23:56:08 -0700
Subject: [Linux-cluster] Cluster failover
In-Reply-To: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com>
References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com>
Message-ID: <36df569a0907172356g55aab93fwdd12a660f36e49f9@mail.gmail.com>

Specify 2 different failover domains for the services. I have a similar
setup for a project with a 3 node cluster. Node 1 runs Service A, Node 2
runs Service B and Node 3 is the floater

Failover Domain 1: Node 1, Node 3
Failover Domain 2: Node 2, Node 3


Service A: Failover Domain1
Service B: Failover Domain2

On Fri, Jul 17, 2009 at 1:15 PM, Paras pradhan <pradhanparas at gmail.com>wrote:

> hi,
>
> I have 3 nodes of CentOS 5.3 running xen virtual machines as virtual
> machine service. This cluster is working fine. One thing I would like
> to know that how to make failover only to  third node. What I mean to
> say is: I have 3 virtual machine  running on node 1 and 2 virtual
> machines running on node 2. Now if node 1 fails I want my the node1
> virtual machines  to be stared only on node 3 but not on node2.
> Similary if node2 breaks, I want virtual machines to be started on
> node3 but never on node 1.
>
> Thanks !
> Paras.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090717/85473b23/attachment.htm>

From abednegoyulo at yahoo.com  Sat Jul 18 07:38:14 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Sat, 18 Jul 2009 00:38:14 -0700 (PDT)
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <1247824582.973.48.camel@marc>
Message-ID: <273907.46020.qm@web110412.mail.gq1.yahoo.com>


Thanks for giving the pointers!

uname -r on both nodes

2.6.18-128.1.16.el5

on node01

rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager
cman-2.0.98-2chrissie
gfs-utils-0.1.18-1.el5
kmod-gfs-0.1.23-5.el5_2.4
kmod-gfs-0.1.31-3.el5
modcluster-0.12.1-2.el5.centos
ricci-0.12.1-7.3.el5.centos.1
luci-0.12.1-7.3.el5.centos.1
cluster-snmp-0.12.1-2.el5.centos
iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
lvm2-cluster-2.02.40-7.el5
openais-0.80.3-22.el5_3.8
oddjob-0.27-9.el5
rgmanager-2.0.46-1.el5.centos.3

on node02

rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager
cman-2.0.98-2chrissie
gfs-utils-0.1.18-1.el5
kmod-gfs-0.1.31-3.el5
modcluster-0.12.1-2.el5.centos
ricci-0.12.1-7.3.el5.centos.1
luci-0.12.1-7.3.el5.centos.1
cluster-snmp-0.12.1-2.el5.centos
iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
lvm2-cluster-2.02.40-7.el5
openais-0.80.3-22.el5_3.8
oddjob-0.27-9.el5
rgmanager-2.0.46-1.el5.centos.3

I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto to configure my cluster. When it was still on 5.2 the cluster worked, but after the recent update to 5.3, it broke.

On one of the threads that I have found in the archive, it states that there is a problem with the most current official version of cman, bug id 485026. I replaced the most current cman package with cman-2.0.98-2chrissie because I tested if this was my problem, seems not so I will be moving back to the official package.
I also found on another thread that openais was the culprit, changed it back to openais-0.80.3-15.el5 even though the change log indicates a lot of bug fixes were done on the most current official package. After doing it, it still did not work. I tried clean_start="1" with caution. I unmounted the iscsi then started cman but still it did not work. The most recent is post_join_delay="-1", I did not noticed that there was a man for fenced, which is much safer than clean_start="1" but still it did not fixed it. The man pages that I have read over and over again is cman and cluster.conf. Some pages in the online manual is somewhat not suitable for my situation because I do not have X installed on the machines and some pages in the online manual used system-config-cluster.

As I understand in the online manual and FAQ, qdisk is not required if I have two_nodes="1" so I did not create any. I have removed the fence_daemon tag since I only used it for trying the solutions that were suggested. The hosts are present in each others hosts with correct ips.


The ping results

ping node02.company.com

--- node01.company.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms

ping node01.company.com

--- node01.company.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9003ms
rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms

According to the people in the data center, the switch supports multicast communication on all ports that are used for cluster communication because they are in the same VLAN.

For the logs, I will sending fresh logs as soon as possible. Currently I have not enough time window to bring down the machine.

For the wireshark, I will be reading the man pages on how to use it.

Please advise if any other information is needed to solve this. I am very grateful for the very detailed pointers. Thank you very much! 


--- On Fri, 7/17/09, Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad at wol.de> wrote:

> From: Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad at wol.de>
> Subject: Re: [Linux-cluster] Starting two-node cluster with only one node
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Friday, 17 July, 2009, 5:56 PM
> Hello,
> 
> 
> can you give us some hard facts on what versions of
> cluster-suite
> packages you are using in your environment and also the
> related logs?
> 
> Have you read the corresponding parts of the cluster suites
> manual, man
> pages, FAQ and also searched the list-archives for similar
> problems
> already? If not -> do it, there are may good hints to
> find there.
> 
> 
> The nodes find each other and create a cluster very fast IF
> they can
> talk to each other. As no cluster networking is involved in
> fencing a
> remote node if the fencing node by itself is quorate this
> could be your
> problem.
> 
> You should change to fence_manual and switch back to your
> real fencing
> devices after you have debuged your problem. Also get rid
> of the
> <fence_daemon ... /> tag in your cluster.conf as
> fenced does the right
> thing by default if the remaining configuration is right
> and now it is
> just hiding a part of the problem.
> 
> Also the 5 minute break on cman start smells like a
> DNS-lookup problem
> or other network related problem to me.
> 
> Here is a short check-list to be sure the nodes can talk to
> each other:
> 
> Can the individual nodes ping each other?
> 
> Can the individual nodes dns-lookup the other node-names
> (which you used
> in your cluster.conf)? (Try to add them to your etc/hosts
> file, that way
> you have a working cluster even if your dns-system is going
> on
> vacation.)
> 
> Is your switch allowing multicast communication on all
> ports that are
> used for cluster communication? (This is a prerequisite for
> openais /
> corosync based cman which would be anything >= RHEL 5.
> Search the
> archives on this if you need more info...)
> 
> Can you trace (eg. with wiresharks tshark) incoming
> cluster
> communication from remote nodes? (If you don't changed your
> fencing to
> fence_manual your listening system will get fenced before
> you can get
> any useful information out of it. Try with and without
> active firewall.)
> 
> If all above could be answered with "yes" your cluster
> should form just
> fine. You could try to add a qdisk-device as tiebreaker
> after that and
> test it just to be sure you have a working last man
> standing setup...
> 
> Hope that helps,
> 
> Marc
> 
> Am Donnerstag, den 16.07.2009, 23:41 -0700 schrieb
> Abed-nego G. Escobal,
> Jr.: 
> > 
> > Thanks for the tip. It helped by stopping each node
> kicking each other, as per the logs, but still I have a
> split brain status. 
> > 
> > On node01
> > 
> > # /usr/sbin/cman_tool nodes
> > Node?
> Sts???Inc???Joined?
> ? ? ? ? ? ???Name
> >? ? 1???M? ?
> 680???2009-07-17 00:30:42?
> node01.company.com
> >? ? 2???X? ? ?
> 0? ? ? ? ? ? ? ?
> ? ? ? ? node02.company.com
> > 
> > # /usr/sbin/clustat 
> > Cluster Status for GFSCluster @ Fri Jul 17 01:01:09
> 2009
> > Member Status: Quorate
> > 
> >? Member Name? ? ? ? ?
> ? ? ? ? ? ? ? ?
> ???ID???Status
> >? ------ ----? ? ? ? ?
> ? ? ? ? ? ? ? ?
> ???---- ------
> >? node01.company.com? ? ? ?
> ? ? ? ? ? ? ?
> ???1 Online, Local
> >? node02.company.com? ? ? ?
> ? ? ? ? ? ? ?
> ???2 Offline
> > 
> > 
> > On node02
> > 
> > # /usr/sbin/cman_tool nodes
> > Node?
> Sts???Inc???Joined?
> ? ? ? ? ? ???Name
> >? ? 1???X? ? ?
> 0? ? ? ? ? ? ? ?
> ? ? ? ? node01.company.com
> >? ? 2???M? ?
> 676???2009-07-17 00:30:43?
> node02.company.com
> >? 
> > 
> > # /usr/sbin/clustat
> > Cluster Status for GFSCluster @ Fri Jul 17 01:01:22
> 2009
> > Member Status: Quorate
> > 
> >? Member Name? ? ? ? ?
> ? ? ? ? ? ? ? ?
> ???ID???Status
> >? ------ ----? ? ? ? ?
> ? ? ? ? ? ? ? ?
> ???---- ------
> >? node01.company.com? ? ? ?
> ? ? ? ? ? ? ?
> ???1 Offline
> >? node02.company.com? ? ? ?
> ? ? ? ? ? ? ?
> ???2 Online, Local
> > 
> > 
> > Another thing that I have noticed,
> > 
> > 1. Start node01 with only itself as the member of the
> cluster
> > 2. Update cluster.conf to have node02 as an additional
> member
> > 3. Start node02
> > 
> > Yields both nodes being quorate (split brain) but only
> node02 tries to fence out node01. After some time, clustat
> will yield both of them being in the same cluster. Then I
> will be starting clvmd on node02 but will not be successful.
> After trying to start the clvmd service, clustat will yield
> split brain again. 
> > 
> > Are there some troubleshootings that I should be
> doing?
> > 
> > 
> > --- On Thu, 7/16/09, Aaron Benner <tfrumbacher at gmail.com>
> wrote:
> > 
> > > From: Aaron Benner <tfrumbacher at gmail.com>
> > > Subject: Re: [Linux-cluster] Starting two-node
> cluster with only one node
> > > To: "linux clustering" <linux-cluster at redhat.com>
> > > Date: Thursday, 16 July, 2009, 10:04 PM
> > > Have you tried setting the
> > > "post_join_delay" value in the <fence_daemon
> ...>
> > > declaration to -1?
> > > 
> > > <fence_daemon clean_start="0"
> post_fail_delay="0"
> > > post_join_delay="-1" />
> > > 
> > > This is a hint I picked up from the fenced man
> page section
> > > on avoiding boot time fencing.? It tells
> fenced to wait
> > > until all of the nodes have joined the cluster
> before
> > > starting up.? We use this on a couple of 2
> node
> > > clusters (with qdisk) to allow them to start up
> without the
> > > first node to grab the quorum disk fencing the
> other node.
> > > 
> > > --Aaron
> > > 
> > > On Jul 16, 2009, at 12:16 AM, Abed-nego G.
> Escobal, Jr.
> > > wrote:
> > > 
> > > > 
> > > > 
> > > > Tried it and now the two node cluster is
> running with
> > > only one node. My problem right now is how to
> force the
> > > second node to join the first node's cluster.
> Right now it
> > > is creating its own cluster and trying to fence
> the first
> > > node. I tried cman_tool leave on the second node
> but I got
> > > > 
> > > > cman_tool: Error leaving cluster: Device or
> resource
> > > busy
> > > > 
> > > > clvmd and gfs are not running on the second
> node. What
> > > is running on the second node is cman. When I
> did
> > > > 
> > > > service cman start
> > > > 
> > > > It took 5 approximately 5 minutes before I
> got the
> > > [ok] meassage. Am I missing something here? Not
> doing right?
> > > Should be doing something?
> > > > 
> > > > 
> > > > --- On Thu, 7/16/09, Abed-nego G. Escobal,
> Jr. <abednegoyulo at yahoo.com>
> > > wrote:
> > > > 
> > > >> From: Abed-nego G. Escobal, Jr. <abednegoyulo at yahoo.com>
> > > >> Subject: [Linux-cluster] Starting
> two-node cluster
> > > with only one node
> > > >> To: "linux clustering" <linux-cluster at redhat.com>
> > > >> Date: Thursday, 16 July, 2009, 10:46 AM
> > > >> 
> > > >> Using the config file below
> > > >> 
> > > >> <?xml version="1.0"?>
> > > >> <cluster name="GFSCluster"
> > > config_version="5">
> > > >> <cman expected_votes="1"
> two_node="1"/>
> > >
> >>???<clusternodes><clusternode
> > > >> name="node01.company.com" votes="1"
> > > >> nodeid="1"><fence><method
> > > >> name="single"><device
> > > >>
> > >
> name="node01_ipmi"/></method></fence></clusternode><clusternode
> > > >> name="node02.company.com" votes="1"
> > > >> nodeid="2"><fence><method
> > > >> name="single"><device
> > > >>
> > >
> name="node02_ipmi"/></method></fence></clusternode></clusternodes>
> > >
> >>???<fencedevices><fencedevice
> > > >> name="node01_ipmi"
> agent="fence_ipmilan"
> > > ipaddr="10.1.0.5"
> > > >> login="root"
> > > passwd="********"/><fencedevice
> > > >> name="node02_ipmi"
> agent="fence_ipmilan"
> > > ipaddr="10.1.0.7"
> > > >> login="root"
> > > passwd="********"/></fencedevices>
> > > >>???<rm>
> > > >>?
> ???<failoverdomains/>
> > > >>?
> ???<resources/>
> > > >>???</rm>
> > > >> </cluster>
> > > >> 
> > > >> Is it possible to start the cluster by
> only
> > > bringing up one
> > > >> node? The reason why I asked is because
> currently
> > > bringing
> > > >> them up together produces a split brain,
> each of
> > > them member
> > > >> of the cluster GFSCluster of their own
> fencing
> > > each other.
> > > >> My plan is to bring up only one node to
> create a
> > > quorum then
> > > >> bring the other one up and manually join
> it to the
> > > existing
> > > >> cluster.
> > > >> 
> > > >> I have already don the start_clean
> approach but it
> > > seems it
> > > >> does not work.
> > > >> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


From mad at wol.de  Sat Jul 18 14:02:13 2009
From: mad at wol.de (Marc - A. Dahlhaus)
Date: Sat, 18 Jul 2009 16:02:13 +0200
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <273907.46020.qm@web110412.mail.gq1.yahoo.com>
References: <273907.46020.qm@web110412.mail.gq1.yahoo.com>
Message-ID: <4A61D5E5.4030702@wol.de>

Hello,

as your cluster worked well on centos 5.2 the networking hardware 
components couldn't be the culprit in this case but is still think that 
it is an cluster communication related problem.

It could be your iptables ruleset... Try to disable the firewall and 
check again...

You can use tshark to check this as well in this case by using something 
like this:

tshark -i <interface cluster is useing> -f 'host <multicast-ip cluster 
is useing>' -V | less

Have you checked that openais is still chkconfig off after your upgrade?

Abed-nego G. Escobal, Jr. schrieb:
> Thanks for giving the pointers!
>
> uname -r on both nodes
>
> 2.6.18-128.1.16.el5
>
> on node01
>
> rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager
> cman-2.0.98-2chrissie
> gfs-utils-0.1.18-1.el5
> kmod-gfs-0.1.23-5.el5_2.4
> kmod-gfs-0.1.31-3.el5
> modcluster-0.12.1-2.el5.centos
> ricci-0.12.1-7.3.el5.centos.1
> luci-0.12.1-7.3.el5.centos.1
> cluster-snmp-0.12.1-2.el5.centos
> iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
> lvm2-cluster-2.02.40-7.el5
> openais-0.80.3-22.el5_3.8
> oddjob-0.27-9.el5
> rgmanager-2.0.46-1.el5.centos.3
>
> on node02
>
> rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager
> cman-2.0.98-2chrissie
> gfs-utils-0.1.18-1.el5
> kmod-gfs-0.1.31-3.el5
> modcluster-0.12.1-2.el5.centos
> ricci-0.12.1-7.3.el5.centos.1
> luci-0.12.1-7.3.el5.centos.1
> cluster-snmp-0.12.1-2.el5.centos
> iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
> lvm2-cluster-2.02.40-7.el5
> openais-0.80.3-22.el5_3.8
> oddjob-0.27-9.el5
> rgmanager-2.0.46-1.el5.centos.3
>
> I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto to configure my cluster. When it was still on 5.2 the cluster worked, but after the recent update to 5.3, it broke.
>
> On one of the threads that I have found in the archive, it states that there is a problem with the most current official version of cman, bug id 485026. I replaced the most current cman package with cman-2.0.98-2chrissie because I tested if this was my problem, seems not so I will be moving back to the official package.
> I also found on another thread that openais was the culprit, changed it back to openais-0.80.3-15.el5 even though the change log indicates a lot of bug fixes were done on the most current official package. After doing it, it still did not work. I tried clean_start="1" with caution. I unmounted the iscsi then started cman but still it did not work. The most recent is post_join_delay="-1", I did not noticed that there was a man for fenced, which is much safer than clean_start="1" but still it did not fixed it. The man pages that I have read over and over again is cman and cluster.conf. Some pages in the online manual is somewhat not suitable for my situation because I do not have X installed on the machines and some pages in the online manual used system-config-cluster.
>
> As I understand in the online manual and FAQ, qdisk is not required if I have two_nodes="1" so I did not create any. I have removed the fence_daemon tag since I only used it for trying the solutions that were suggested. The hosts are present in each others hosts with correct ips.
>
>
> The ping results
>
> ping node02.company.com
>
> --- node01.company.com ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 8999ms
> rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms
>
> ping node01.company.com
>
> --- node01.company.com ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9003ms
> rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms
>
> According to the people in the data center, the switch supports multicast communication on all ports that are used for cluster communication because they are in the same VLAN.
>
> For the logs, I will sending fresh logs as soon as possible. Currently I have not enough time window to bring down the machine.
>
> For the wireshark, I will be reading the man pages on how to use it.
>
> Please advise if any other information is needed to solve this. I am very grateful for the very detailed pointers. Thank you very much! 
>
>
> --- On Fri, 7/17/09, Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad at wol.de> wrote:
>
>   
>> From: Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad at wol.de>
>> Subject: Re: [Linux-cluster] Starting two-node cluster with only one node
>> To: "linux clustering" <linux-cluster at redhat.com>
>> Date: Friday, 17 July, 2009, 5:56 PM
>> Hello,
>>
>>
>> can you give us some hard facts on what versions of
>> cluster-suite
>> packages you are using in your environment and also the
>> related logs?
>>
>> Have you read the corresponding parts of the cluster suites
>> manual, man
>> pages, FAQ and also searched the list-archives for similar
>> problems
>> already? If not -> do it, there are may good hints to
>> find there.
>>
>>
>> The nodes find each other and create a cluster very fast IF
>> they can
>> talk to each other. As no cluster networking is involved in
>> fencing a
>> remote node if the fencing node by itself is quorate this
>> could be your
>> problem.
>>
>> You should change to fence_manual and switch back to your
>> real fencing
>> devices after you have debuged your problem. Also get rid
>> of the
>> <fence_daemon ... /> tag in your cluster.conf as
>> fenced does the right
>> thing by default if the remaining configuration is right
>> and now it is
>> just hiding a part of the problem.
>>
>> Also the 5 minute break on cman start smells like a
>> DNS-lookup problem
>> or other network related problem to me.
>>
>> Here is a short check-list to be sure the nodes can talk to
>> each other:
>>
>> Can the individual nodes ping each other?
>>
>> Can the individual nodes dns-lookup the other node-names
>> (which you used
>> in your cluster.conf)? (Try to add them to your etc/hosts
>> file, that way
>> you have a working cluster even if your dns-system is going
>> on
>> vacation.)
>>
>> Is your switch allowing multicast communication on all
>> ports that are
>> used for cluster communication? (This is a prerequisite for
>> openais /
>> corosync based cman which would be anything >= RHEL 5.
>> Search the
>> archives on this if you need more info...)
>>
>> Can you trace (eg. with wiresharks tshark) incoming
>> cluster
>> communication from remote nodes? (If you don't changed your
>> fencing to
>> fence_manual your listening system will get fenced before
>> you can get
>> any useful information out of it. Try with and without
>> active firewall.)
>>
>> If all above could be answered with "yes" your cluster
>> should form just
>> fine. You could try to add a qdisk-device as tiebreaker
>> after that and
>> test it just to be sure you have a working last man
>> standing setup...
>>
>> Hope that helps,
>>
>> Marc
>>     


From abednegoyulo at yahoo.com  Sat Jul 18 15:47:28 2009
From: abednegoyulo at yahoo.com (Abed-nego G. Escobal, Jr.)
Date: Sat, 18 Jul 2009 08:47:28 -0700 (PDT)
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <4A61D5E5.4030702@wol.de>
Message-ID: <667406.91442.qm@web110403.mail.gq1.yahoo.com>


Hi!

I am very sorry that I did not mention that when I am testing different suggestions on solving this, I always temporarily disable the firewall. Then turning it back on after the testing.

Thank you very much on the tip for the tshark! I will post the output as as soon as I get a maintenance window to restart the cman service.

With regards to openais, it is still off on both servers. Should it be turned on on boot? I am very sorry but I haven't read it in the manuals that it should be "on".


--- On Sat, 7/18/09, Marc - A. Dahlhaus <mad at wol.de> wrote:

> From: Marc - A. Dahlhaus <mad at wol.de>
> Subject: Re: [Linux-cluster] Starting two-node cluster with only one node
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Saturday, 18 July, 2009, 10:02 PM
> Hello,
> 
> as your cluster worked well on centos 5.2 the networking
> hardware 
> components couldn't be the culprit in this case but is
> still think that 
> it is an cluster communication related problem.
> 
> It could be your iptables ruleset... Try to disable the
> firewall and 
> check again...
> 
> You can use tshark to check this as well in this case by
> using something 
> like this:
> 
> tshark -i <interface cluster is useing> -f 'host
> <multicast-ip cluster 
> is useing>' -V | less
> 
> Have you checked that openais is still chkconfig off after
> your upgrade?
> 
> Abed-nego G. Escobal, Jr. schrieb:
> > Thanks for giving the pointers!
> >
> > uname -r on both nodes
> >
> > 2.6.18-128.1.16.el5
> >
> > on node01
> >
> > rpm -q cman gfs-utils kmod-gfs modcluster ricci luci
> cluster-snmp iscsi-initiator-utils lvm2-cluster openais
> oddjob rgmanager
> > cman-2.0.98-2chrissie
> > gfs-utils-0.1.18-1.el5
> > kmod-gfs-0.1.23-5.el5_2.4
> > kmod-gfs-0.1.31-3.el5
> > modcluster-0.12.1-2.el5.centos
> > ricci-0.12.1-7.3.el5.centos.1
> > luci-0.12.1-7.3.el5.centos.1
> > cluster-snmp-0.12.1-2.el5.centos
> > iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
> > lvm2-cluster-2.02.40-7.el5
> > openais-0.80.3-22.el5_3.8
> > oddjob-0.27-9.el5
> > rgmanager-2.0.46-1.el5.centos.3
> >
> > on node02
> >
> > rpm -q cman gfs-utils kmod-gfs modcluster ricci luci
> cluster-snmp iscsi-initiator-utils lvm2-cluster openais
> oddjob rgmanager
> > cman-2.0.98-2chrissie
> > gfs-utils-0.1.18-1.el5
> > kmod-gfs-0.1.31-3.el5
> > modcluster-0.12.1-2.el5.centos
> > ricci-0.12.1-7.3.el5.centos.1
> > luci-0.12.1-7.3.el5.centos.1
> > cluster-snmp-0.12.1-2.el5.centos
> > iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
> > lvm2-cluster-2.02.40-7.el5
> > openais-0.80.3-22.el5_3.8
> > oddjob-0.27-9.el5
> > rgmanager-2.0.46-1.el5.centos.3
> >
> > I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto
> to configure my cluster. When it was still on 5.2 the
> cluster worked, but after the recent update to 5.3, it
> broke.
> >
> > On one of the threads that I have found in the
> archive, it states that there is a problem with the most
> current official version of cman, bug id 485026. I replaced
> the most current cman package with cman-2.0.98-2chrissie
> because I tested if this was my problem, seems not so I will
> be moving back to the official package.
> > I also found on another thread that openais was the
> culprit, changed it back to openais-0.80.3-15.el5 even
> though the change log indicates a lot of bug fixes were done
> on the most current official package. After doing it, it
> still did not work. I tried clean_start="1" with caution. I
> unmounted the iscsi then started cman but still it did not
> work. The most recent is post_join_delay="-1", I did not
> noticed that there was a man for fenced, which is much safer
> than clean_start="1" but still it did not fixed it. The man
> pages that I have read over and over again is cman and
> cluster.conf. Some pages in the online manual is somewhat
> not suitable for my situation because I do not have X
> installed on the machines and some pages in the online
> manual used system-config-cluster.
> >
> > As I understand in the online manual and FAQ, qdisk is
> not required if I have two_nodes="1" so I did not create
> any. I have removed the fence_daemon tag since I only used
> it for trying the solutions that were suggested. The hosts
> are present in each others hosts with correct ips.
> >
> >
> > The ping results
> >
> > ping node02.company.com
> >
> > --- node01.company.com ping statistics ---
> > 10 packets transmitted, 10 received, 0% packet loss,
> time 8999ms
> > rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms
> >
> > ping node01.company.com
> >
> > --- node01.company.com ping statistics ---
> > 10 packets transmitted, 10 received, 0% packet loss,
> time 9003ms
> > rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms
> >
> > According to the people in the data center, the switch
> supports multicast communication on all ports that are used
> for cluster communication because they are in the same
> VLAN.
> >
> > For the logs, I will sending fresh logs as soon as
> possible. Currently I have not enough time window to bring
> down the machine.
> >
> > For the wireshark, I will be reading the man pages on
> how to use it.
> >
> > Please advise if any other information is needed to
> solve this. I am very grateful for the very detailed
> pointers. Thank you very much! 
> >
> >
> > --- On Fri, 7/17/09, Marc - A. Dahlhaus [
> Administration | Westermann GmbH ] <mad at wol.de> wrote:
> >
> >???
> >> From: Marc - A. Dahlhaus [ Administration |
> Westermann GmbH ] <mad at wol.de>
> >> Subject: Re: [Linux-cluster] Starting two-node
> cluster with only one node
> >> To: "linux clustering" <linux-cluster at redhat.com>
> >> Date: Friday, 17 July, 2009, 5:56 PM
> >> Hello,
> >>
> >>
> >> can you give us some hard facts on what versions
> of
> >> cluster-suite
> >> packages you are using in your environment and
> also the
> >> related logs?
> >>
> >> Have you read the corresponding parts of the
> cluster suites
> >> manual, man
> >> pages, FAQ and also searched the list-archives for
> similar
> >> problems
> >> already? If not -> do it, there are may good
> hints to
> >> find there.
> >>
> >>
> >> The nodes find each other and create a cluster
> very fast IF
> >> they can
> >> talk to each other. As no cluster networking is
> involved in
> >> fencing a
> >> remote node if the fencing node by itself is
> quorate this
> >> could be your
> >> problem.
> >>
> >> You should change to fence_manual and switch back
> to your
> >> real fencing
> >> devices after you have debuged your problem. Also
> get rid
> >> of the
> >> <fence_daemon ... /> tag in your
> cluster.conf as
> >> fenced does the right
> >> thing by default if the remaining configuration is
> right
> >> and now it is
> >> just hiding a part of the problem.
> >>
> >> Also the 5 minute break on cman start smells like
> a
> >> DNS-lookup problem
> >> or other network related problem to me.
> >>
> >> Here is a short check-list to be sure the nodes
> can talk to
> >> each other:
> >>
> >> Can the individual nodes ping each other?
> >>
> >> Can the individual nodes dns-lookup the other
> node-names
> >> (which you used
> >> in your cluster.conf)? (Try to add them to your
> etc/hosts
> >> file, that way
> >> you have a working cluster even if your dns-system
> is going
> >> on
> >> vacation.)
> >>
> >> Is your switch allowing multicast communication on
> all
> >> ports that are
> >> used for cluster communication? (This is a
> prerequisite for
> >> openais /
> >> corosync based cman which would be anything >=
> RHEL 5.
> >> Search the
> >> archives on this if you need more info...)
> >>
> >> Can you trace (eg. with wiresharks tshark)
> incoming
> >> cluster
> >> communication from remote nodes? (If you don't
> changed your
> >> fencing to
> >> fence_manual your listening system will get fenced
> before
> >> you can get
> >> any useful information out of it. Try with and
> without
> >> active firewall.)
> >>
> >> If all above could be answered with "yes" your
> cluster
> >> should form just
> >> fine. You could try to add a qdisk-device as
> tiebreaker
> >> after that and
> >> test it just to be sure you have a working last
> man
> >> standing setup...
> >>
> >> Hope that helps,
> >>
> >> Marc
> >>? ???
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


      &quot;Try the new FASTER Yahoo! Mail. Experience it today at http://ph.mail.yahoo.com&quot;


From mad at wol.de  Sat Jul 18 23:03:46 2009
From: mad at wol.de (Marc - A. Dahlhaus)
Date: Sun, 19 Jul 2009 01:03:46 +0200
Subject: [Linux-cluster] Starting two-node cluster with only one node
In-Reply-To: <667406.91442.qm@web110403.mail.gq1.yahoo.com>
References: <667406.91442.qm@web110403.mail.gq1.yahoo.com>
Message-ID: <4A6254D2.1050003@wol.de>

Hello,

Abed-nego G. Escobal, Jr. schrieb:
> Hi!
>
> I am very sorry that I did not mention that when I am testing different suggestions on solving this, I always temporarily disable the firewall. Then turning it back on after the testing.
>
> Thank you very much on the tip for the tshark! I will post the output as as soon as I get a maintenance window to restart the cman service.
>
> With regards to openais, it is still off on both servers. Should it be turned on on boot? I am very sorry but I haven't read it in the manuals that it should be "on".
>   
No leave it off as cman starts openais with configuration based on your 
cluster.conf.
If openais starts via init your cluster will not work like it should...

Marc


From fxmulder at gmail.com  Mon Jul 20 15:43:14 2009
From: fxmulder at gmail.com (James Devine)
Date: Mon, 20 Jul 2009 09:43:14 -0600
Subject: [Linux-cluster] disk fencing
In-Reply-To: <36df569a0907172350g5b381e3bt87abd7fbe6fe8677@mail.gmail.com>
References: <c30750500907170920n28f2de1yf7dc185fdf7e6d92@mail.gmail.com>
	<36df569a0907172350g5b381e3bt87abd7fbe6fe8677@mail.gmail.com>
Message-ID: <c30750500907200843q77e8baf9sabd4555e5d775fc8@mail.gmail.com>

I had looked at qdisk, it looks like qdisk is just a way for the nodes
to share information about what it thinks the current cluster status
is.  It looked like external fencing still needed to take place, which
was normally some power or network intervention.

I was thinking of using the disk to do the fencing also.  It could use
the status information provided in the quorum disk for nodes to
determine if they are fenced off or not.  In the case of complete
cutoff from disk, the remaining nodes would have to work under the
assumption that the failed node(s) were no longer trying to access
disk as they were making no more status updates to the quorum disk for
a period of time.  So they would be "fenced off" as it were, and the
remaining nodes could continue on without them until the node(s) came
back and made further status updates to the quorum disk.  This way,
fencing could be done completely independent of the hardware running,
no need for network or power management.


On Sat, Jul 18, 2009 at 12:50 AM, Ian Hayes<cthulhucalling at gmail.com> wrote:
> I'm not sure what you're asking here, but it sounds like you're describing a
> qdisk.
>
> If a node loses heartbeat with the rest of the cluster, that's a fencin'.
> Doesn't matter if it can still access the shared storage, and if it has lost
> communication with the rest of the cluster, you probably don't want it
> accessing your data anyway.
>
> On Fri, Jul 17, 2009 at 9:20 AM, James Devine <fxmulder at gmail.com> wrote:
>>
>> Has anybody looked into using the network for heartbeat only, and disk
>> for fencing in GFS? ?i.e. using the disk to communicate quorum when
>> network heartbeat is lost between 1 or more nodes. ?If the disk is
>> still accessible to all nodes, this should be a valid way to
>> communicate quorum, if not, then the remaining nodes, assuming enough
>> for quorum, should be able to continue knowing that nodes it can't
>> communicate with either have been fenced or can't read/write to disk
>> anyway. ?Does this sound like a valid approach?
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From pradhanparas at gmail.com  Mon Jul 20 15:43:17 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Mon, 20 Jul 2009 10:43:17 -0500
Subject: [Linux-cluster] Cluster failover
In-Reply-To: <8786b91c0907172026g3cdec94bw946187e2fdf3181a@mail.gmail.com>
References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com>
	<8786b91c0907172026g3cdec94bw946187e2fdf3181a@mail.gmail.com>
Message-ID: <8b711df40907200843k63a37d07l6adca62aceae0478@mail.gmail.com>

On Fri, Jul 17, 2009 at 10:26 PM, Rajagopal
Swaminathan<raju.rajsand at gmail.com> wrote:
> Greetings,
>
> You basically need to define multiple failover domains.
>
> On Sat, Jul 18, 2009 at 1:45 AM, Paras pradhan<pradhanparas at gmail.com> wrote:
>> I have 3 virtual machine ?running on node 1 and 2 virtual
>> machines running on node 2.
>
>> Now if node 1 fails I want my the node1
>> virtual machines ?to be stared only on node 3 but not on node2.
> Failover domain 1 consisting of Node 1 and Node 3
>
>> Similary if node2 breaks, I want virtual machines to be started on
>> node3 but never on node 1.
> Failover domain 2 consisting of Node 2 and Node 3
>
> HTH
>
> Thanks and Regards
>
> Rajagopal
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


Thanks ! Will try it out.

Paras.


From pradhanparas at gmail.com  Mon Jul 20 15:43:40 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Mon, 20 Jul 2009 10:43:40 -0500
Subject: [Linux-cluster] Cluster failover
In-Reply-To: <36df569a0907172356g55aab93fwdd12a660f36e49f9@mail.gmail.com>
References: <8b711df40907171315n70057804rbc44d7fb255523af@mail.gmail.com>
	<36df569a0907172356g55aab93fwdd12a660f36e49f9@mail.gmail.com>
Message-ID: <8b711df40907200843s20585d2fo3929baaea9f32125@mail.gmail.com>

On Sat, Jul 18, 2009 at 1:56 AM, Ian Hayes<cthulhucalling at gmail.com> wrote:
> Specify 2 different failover domains for the services. I have a similar
> setup for a project with a 3 node cluster. Node 1 runs Service A, Node 2
> runs Service B and Node 3 is the floater
>
> Failover Domain 1: Node 1, Node 3
> Failover Domain 2: Node 2, Node 3
>
>
> Service A: Failover Domain1
> Service B: Failover Domain2
>
> On Fri, Jul 17, 2009 at 1:15 PM, Paras pradhan <pradhanparas at gmail.com>
> wrote:
>>
>> hi,
>>
>> I have 3 nodes of CentOS 5.3 running xen virtual machines as virtual
>> machine service. This cluster is working fine. One thing I would like
>> to know that how to make failover only to ?third node. What I mean to
>> say is: I have 3 virtual machine ?running on node 1 and 2 virtual
>> machines running on node 2. Now if node 1 fails I want my the node1
>> virtual machines ?to be stared only on node 3 but not on node2.
>> Similary if node2 breaks, I want virtual machines to be started on
>> node3 but never on node 1.
>>
>> Thanks !
>> Paras.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Thanks ! Will try it out.


From pschobel at 1iopen.net  Mon Jul 20 17:20:44 2009
From: pschobel at 1iopen.net (Peter Schobel)
Date: Mon, 20 Jul 2009 10:20:44 -0700
Subject: [Linux-cluster] Re: rm -r on gfs2 filesystem is very slow
In-Reply-To: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
References: <c8c8ded30907081358r5179cdd5m2bd63bb4d3fce6da@mail.gmail.com>
Message-ID: <c8c8ded30907201020ye42398eqbe450de1ea9f5887@mail.gmail.com>

In followup on this thread,

We discovered that the reason why we didn't notice this performance
problem in our initial proof of concept was because some directories
were recently added to our source repository which contain a large
number of files and subdirectories. I tested a number of directories
and discovered that other similar sized directories (1 to 2 G) could
remove in 1 to 3 seconds but the 1.6 G directory referenced below took
~ 9 minutes. This particular directory contained ~ 10,000
subdirectories containing a total of 65,000+ files.

Further tests with touching empty files revealed that a directory with
65,000 empty files removed relatively quickly while a directory with
65,000 subdirectories containing 1 empty file each took a very long
time to remove. Using the find command to identify files and piping
the results to rm was much faster than using rm -r.

Also, mounting the fs with the nodiratime option and setting:

        <dlm plock_ownership="1" plock_rate_limit="0"/>
        <gfs_controld plock_rate_limit="0"/>

in cluster.conf reduced rm -r time on the directory referenced below
from ~ 9 mins to ~ 5 mins.

Since we now understand the problem a little bit better we are able to
implement some workarounds to get us by.

Peter
~

On Wed, Jul 8, 2009 at 1:58 PM, Peter Schobel<pschobel at 1iopen.net> wrote:
> I am trying to set up a four node cluster but am getting very poor
> performance when removing large directories. A directory approximately
> 1.6G ?in size takes around 5 mins to remove from the gfs2 filesystem
> but removes in around 10 seconds from the local disk.
>
> I am using CentOS 5.3 with kernel 2.6.18-128.1.16.el5PAE.
>
> The filesystem was formatted in the following manner: mkfs.gfs2 -t
> wtl_build:dev_home00 -p lock_dlm -j 10
> /dev/mapper/VolGroupGFS-LogVolDevHome00 and is being mounted with the
> following options: _netdev,noatime,defaults.
>
> If anyone knows what could be causing this please let me know. I'm
> happy to provide any other information.
>
> Regards,
>
> --
> Peter Schobel
> ~
>


-- 
Peter Schobel
~


From pschobel at 1iopen.net  Mon Jul 20 21:20:25 2009
From: pschobel at 1iopen.net (Peter Schobel)
Date: Mon, 20 Jul 2009 14:20:25 -0700
Subject: [Linux-cluster] kernel BUG at fs/gfs2/rgrp.c:1458!
Message-ID: <c8c8ded30907201420i3958481and569434a58a32043@mail.gmail.com>

We are experiencing fatal exceptions on a four node Linux cluster
using a gfs2 filesystem. Any help would be appreciated. Am happy to
provide additional info.

kernel BUG at fs/gfs2/rgrp.c:1458!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm
configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon
backlight sbs i2c_ec
i2c_cord
CPU:    0
EIP:    0060:[<f96e04da>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-128.1.16.el5PAE #1)
EIP is at gfs2_alloc_data+0x75/0x155 [gfs2]
eax: ffffffff   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 05ec1513   edi: 00000000   ebp: f51aa114   esp: edf96c74
ds: 007b   es: 007b   ss: 0068
Process p4v.bin (pid: 8895, ti=edf96000 task=ccfe0550 task.ti=edf96000)
Stack: d3b34548 f7242000 d9663380 00000000 d3b34548 00000000 d4575000 f96c4db2
       cf449378 d4575140 00001000 00000000 cf449378 d3b34548 edf96cf4 f96c50a0
       edf96cf4 00000001 edf96d18 edf96d10 00000000 0000000c 00000000 0000c000
Call Trace:
 [<f96c4db2>] lookup_block+0xb4/0x153 [gfs2]
 [<f96c50a0>] gfs2_block_map+0x24f/0x392 [gfs2]
 [<c047364e>] set_bh_page+0x43/0x4c
 [<c047371f>] alloc_page_buffers+0x74/0xba
 [<c0474455>] __block_prepare_write+0x1a2/0x439
 [<f96cc405>] do_promote+0xe8/0x10b [gfs2]
 [<c0474702>] block_prepare_write+0x16/0x23
 [<f96c4e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d5005>] gfs2_write_begin+0x2af/0x359 [gfs2]
 [<f96c4e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d6823>] gfs2_file_buffered_write+0x10d/0x287 [gfs2]
 [<c0428969>] current_fs_time+0x4a/0x55
 [<f96d6c71>] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2]
 [<c05ab197>] sock_aio_read+0x53/0x61
 [<f96d6e24>] gfs2_file_write_nolock+0xb0/0x111 [gfs2]
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<f96d6f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<f96d6f55>] gfs2_file_write+0x3a/0x94 [gfs2]
 [<f96d6f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<c04720ef>] vfs_write+0xa1/0x143
 [<c04726e1>] sys_write+0x3c/0x63
 [<c0404ead>] sysenter_past_esp+0x56/0x79
 =======================
Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2
eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f>
0b b2 05
7c 5d 6
EIP: [<f96e04da>] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:edf96c74
 <0>Kernel panic - not syncing: Fatal exception


kernel BUG at fs/gfs2/rgrp.c:1458!
invalid opcode: 0000 [#1]
SMP
last sysfs file:
/devices/pci0000:00/0000:00:02.0/0000:04:00.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/irq
Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm
configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon
backlight sbs i2c_ec i2c_cord
CPU:    1
EIP:    0060:[<f96df4da>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-128.1.16.el5PAE #1)
EIP is at gfs2_alloc_data+0x75/0x155 [gfs2]
eax: ffffffff   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 05ec1513   edi: 00000000   ebp: f0e11858   esp: f01f1c74
ds: 007b   es: 007b   ss: 0068
Process cp (pid: 31700, ti=f01f1000 task=f2e83550 task.ti=f01f1000)
Stack: f4361d68 f6cb1000 e45b7dc0 00000000 f4361d68 00000001 d25b8000 f96c3db2
       d12d53ac d25b8e10 00001000 00000001 e40686ec f4361d68 f01f1cf4 f96c40a0
       f01f1cf4 00000001 f01f1d18 f01f1d10 00000000 000013a5 00000000 013a5000
Call Trace:
 [<f96c3db2>] lookup_block+0xb4/0x153 [gfs2]
 [<f96c40a0>] gfs2_block_map+0x24f/0x392 [gfs2]
 [<c047364e>] set_bh_page+0x43/0x4c
 [<c047371f>] alloc_page_buffers+0x74/0xba
 [<c0474455>] __block_prepare_write+0x1a2/0x439
 [<f96cb405>] do_promote+0xe8/0x10b [gfs2]
 [<c0474702>] block_prepare_write+0x16/0x23
 [<f96c3e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d4005>] gfs2_write_begin+0x2af/0x359 [gfs2]
 [<f96c3e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d5823>] gfs2_file_buffered_write+0x10d/0x287 [gfs2]
 [<c0428969>] current_fs_time+0x4a/0x55
 [<f96d5c71>] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2]
 [<f96d5e24>] gfs2_file_write_nolock+0xb0/0x111 [gfs2]
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<f96d5f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<f96d5f55>] gfs2_file_write+0x3a/0x94 [gfs2]
 [<f96d5f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<c04720ef>] vfs_write+0xa1/0x143
 [<c04726e1>] sys_write+0x3c/0x63
 [<c0404ead>] sysenter_past_esp+0x56/0x79
 =======================
Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2
eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f>
0b b2 05 7c 4d 6
EIP: [<f96df4da>] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:f01f1c74
 <0>Kernel panic - not syncing: Fatal exception

Thanks in advance,

-- 
Peter Schobel
~


From brem.belguebli at gmail.com  Tue Jul 21 09:21:23 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 21 Jul 2009 11:21:23 +0200
Subject: [Linux-cluster] CLVMD without GFS
Message-ID: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>

Hi all,

I think there is something to clarify about using CLVM across a cluster in a
active/passive mode without GFS.

>From my understanding, CLVM keeps LVM metadata coherent among the cluster
nodes and provides a cluster wide locking mechanism that can prevent any
node from trying to activate a volume group if it has been activated
exclusively (vgchange -a e VGXXX)  by another node (which needs to be up).

I have been playing with it to check this behaviour but it doesn't seem to
make what is expected.

I have 2 nodes (RHEL 5.3 X86_64, cluster installed and configured) , A and B
using a SAN shared storage.

I  have a LUN from this SAN seen by both nodes, pvcreate'd /dev/mpath/mpath0
, vgcreate'd vg10 and lvcreate'd lvol1 (on one node), created an ext3 FS on
/dev/vg10/lvol1

CLVM is running in debug mode (clvmd -d2 ) (but it complains about locking
disabled though locking set to 3 on both nodes)

On node A:

          vgchange -c y vg10 returns OK (vgs -->  vg10     1   1   0 wz--nc)

          vgchange -a e --> OK

          lvs returns lvol1   vg10   -wi-a-

On node B (while things are active on A, A is UP and member of the
cluster ):

          vgchange -a e --> Error locking on node B: Volume is busy on
another node
                                   1 logical volume(s) in volume group
"vg10" now active

 It activates vg10 even if it sees it busy on another node .

on B, lvs returns lvol1   vg10   -wi-a-

as well as on A.

I think the main problem comes from the fact that, as it is said when
starting CLVM in debug mode,  WARNING: Locking disabled. Be careful! This
could corrupt your metadata.

IMHO, the algorithm should be as follows:

VG is tagged as clustered (vgchange -c y VGXXX)

if a node (node P) tries to activate the VG exclusively (vgchange -a VGXXX)

ask the lock manager to check if VG is not already locked by another node
(node X)

if so, check if node X is up

if node X is down, return OK to node P

else

return NOK to node P (explicitely that VG is held exclusively by node X)

Brem

PS: this shouldn't be a problem with GFS or other clustered FS (OCFS,
etc...) as no node should try to activate exclusively any VG.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090721/7d3d534b/attachment.htm>

From ccaulfie at redhat.com  Tue Jul 21 09:55:58 2009
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 21 Jul 2009 10:55:58 +0100
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
Message-ID: <4A6590AE.2040808@redhat.com>

Hiya,

I've just tried this on my cluster and it works fine.

What you need to remember is that lvcreate on one node will also 
activate the LV on all nodes in the cluster - it does an implicit 
lvchange -ay when you create it. What I can't explain is why vgchange 
-ae seemed to work fine on node A, it should give the same error as on 
node B because LVs are open shared on both nodes.

Its not clear to me when you tagged the VG as clustered, so that might 
be contributing to the problem. When I create a new VG on shared storage 
it automatically gets labelled clustered so I have never needed to do 
this explicitly. If you create a non-clustered VG you probably ought to 
deactivate it on all nodes first as it could mess up the locking 
otherwise. This *might* be the cause of your troubles.

The error on clvmd startup can be ignored. It's caused by clvmd ussing a 
background command with --no_locking so that it can check which LVs (if 
any) are already active and re-acquire locks for them

Sorry this isn't conclusive, The exact order in which things are 
happening is not clear to me.

Chrissie.

On 07/21/2009 10:21 AM, brem belguebli wrote:
> Hi all,
> I think there is something to clarify about using CLVM across a cluster
> in a active/passive mode without GFS.
>  From my understanding, CLVM keeps LVM metadata coherent among the
> cluster nodes and provides a cluster wide locking mechanism that can
> prevent any node from trying to activate a volume group if it has been
> activated exclusively (vgchange -a e VGXXX)  by another node (which
> needs to be up).
> I have been playing with it to check this behaviour but it doesn't seem
> to make what is expected.
> I have 2 nodes (RHEL 5.3 X86_64, cluster installed and configured) , A
> and B using a SAN shared storage.
> I  have a LUN from this SAN seen by both nodes, pvcreate'd
> /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one node),
> created an ext3 FS on /dev/vg10/lvol1
> CLVM is running in debug mode (clvmd -d2 ) (but it complains about
> locking disabled though locking set to 3 on both nodes)
> On node A:
>            vgchange -c y vg10 returns OK (vgs -->  vg10     1   1   0
> wz--nc)
>            vgchange -a e --> OK
>            lvs returns lvol1   vg10   -wi-a-
> On node B (while things are active on A, A is UP and member of the
> cluster ):
>            vgchange -a e --> Error locking on node B: Volume is busy on
> another node
>                                     1 logical volume(s) in volume group
> "vg10" now active
> It activates vg10 even if it sees it busy on another node .
> on B, lvs returns lvol1   vg10   -wi-a-
> as well as on A.
> I think the main problem comes from the fact that, as it is said when
> starting CLVM in debug mode,  WARNING: Locking disabled. Be careful!
> This could corrupt your metadata.
> IMHO, the algorithm should be as follows:
> VG is tagged as clustered (vgchange -c y VGXXX)
> if a node (node P) tries to activate the VG exclusively (vgchange -a VGXXX)
> ask the lock manager to check if VG is not already locked by another
> node (node X)
> if so, check if node X is up
> if node X is down, return OK to node P
> else
> return NOK to node P (explicitely that VG is held exclusively by node X)
> Brem
> PS: this shouldn't be a problem with GFS or other clustered FS (OCFS,
> etc...) as no node should try to activate exclusively any VG.
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From brem.belguebli at gmail.com  Tue Jul 21 11:16:19 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 21 Jul 2009 13:16:19 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <4A6590AE.2040808@redhat.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
Message-ID: <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>

Hi Chrissie,

Indeed, by default when creating the VG, it is clustered, thus when creating
the LV it is active on all nodes.

To avoid data corruption, I have re created the VG as non clustered
(vgcreate -c n vgXX) then created the LV which got activated only on the
node where it got created.

Then changed the VG to clustered (vgchange -c y VGXX) and activated it
exclusively on this node.

But, I could reproduce the behaviour of bypassing the exclusive flag:

On node B, re change the VG to non clustered though it is activated
exclusively on node A.

and then activate it on node B and it works.

The thing I'm trying to point is that simply by erasing the clustered flag
you can bypass the exclusive activation.

I think a barrier is necessary to prevent this to happen, removing the
clustered flag from a VG should be possible only if the node holding the VG
exclusively is down (does the lock manager  DLM report which node holds
exclusively a VG ?)

Thanks

Brem


2009/7/21, Christine Caulfield <ccaulfie at redhat.com>:
>
> Hiya,
>
> I've just tried this on my cluster and it works fine.
>
> What you need to remember is that lvcreate on one node will also activate
> the LV on all nodes in the cluster - it does an implicit lvchange -ay when
> you create it. What I can't explain is why vgchange -ae seemed to work fine
> on node A, it should give the same error as on node B because LVs are open
> shared on both nodes.
>
> Its not clear to me when you tagged the VG as clustered, so that might be
> contributing to the problem. When I create a new VG on shared storage it
> automatically gets labelled clustered so I have never needed to do this
> explicitly. If you create a non-clustered VG you probably ought to
> deactivate it on all nodes first as it could mess up the locking otherwise.
> This *might* be the cause of your troubles.
>
> The error on clvmd startup can be ignored. It's caused by clvmd ussing a
> background command with --no_locking so that it can check which LVs (if any)
> are already active and re-acquire locks for them
>
> Sorry this isn't conclusive, The exact order in which things are happening
> is not clear to me.
>
> Chrissie.
>
> On 07/21/2009 10:21 AM, brem belguebli wrote:
>
>> Hi all,
>> I think there is something to clarify about using CLVM across a cluster
>> in a active/passive mode without GFS.
>>  From my understanding, CLVM keeps LVM metadata coherent among the
>> cluster nodes and provides a cluster wide locking mechanism that can
>> prevent any node from trying to activate a volume group if it has been
>> activated exclusively (vgchange -a e VGXXX)  by another node (which
>> needs to be up).
>> I have been playing with it to check this behaviour but it doesn't seem
>> to make what is expected.
>> I have 2 nodes (RHEL 5.3 X86_64, cluster installed and configured) , A
>> and B using a SAN shared storage.
>> I  have a LUN from this SAN seen by both nodes, pvcreate'd
>> /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one node),
>> created an ext3 FS on /dev/vg10/lvol1
>> CLVM is running in debug mode (clvmd -d2 ) (but it complains about
>> locking disabled though locking set to 3 on both nodes)
>> On node A:
>>           vgchange -c y vg10 returns OK (vgs -->  vg10     1   1   0
>> wz--nc)
>>           vgchange -a e --> OK
>>           lvs returns lvol1   vg10   -wi-a-
>> On node B (while things are active on A, A is UP and member of the
>> cluster ):
>>           vgchange -a e --> Error locking on node B: Volume is busy on
>> another node
>>                                    1 logical volume(s) in volume group
>> "vg10" now active
>> It activates vg10 even if it sees it busy on another node .
>> on B, lvs returns lvol1   vg10   -wi-a-
>> as well as on A.
>> I think the main problem comes from the fact that, as it is said when
>> starting CLVM in debug mode,  WARNING: Locking disabled. Be careful!
>> This could corrupt your metadata.
>> IMHO, the algorithm should be as follows:
>> VG is tagged as clustered (vgchange -c y VGXXX)
>> if a node (node P) tries to activate the VG exclusively (vgchange -a
>> VGXXX)
>> ask the lock manager to check if VG is not already locked by another
>> node (node X)
>> if so, check if node X is up
>> if node X is down, return OK to node P
>> else
>> return NOK to node P (explicitely that VG is held exclusively by node X)
>> Brem
>> PS: this shouldn't be a problem with GFS or other clustered FS (OCFS,
>> etc...) as no node should try to activate exclusively any VG.
>>
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090721/f2699fe9/attachment.htm>

From ccaulfie at redhat.com  Tue Jul 21 11:48:50 2009
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 21 Jul 2009 12:48:50 +0100
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
Message-ID: <4A65AB22.1030601@redhat.com>

Hiya,

If you make a VG non-clustered then you, by definition, forfeit all 
cluster locking protection on all of the LVs in that group.

In practise you should always make shared volumes clustered and 
non-shared volumes non-clustered. If you make a shared volume 
non-clustered then it's up to you to manage the protection of it.

It might be possible to put some protection in for when a volume group 
is changed from clustered to non-clustered, but really, you're not 
supposed to do that!

Look at it this way, if you create a shared VG and mark it non-clustered 
to start with, then you can corrupt it as much as you like by mounting 
its filesystems on multiple nodes. It's just the same as a SAN volume then.


On 07/21/2009 12:16 PM, brem belguebli wrote:
> Hi Chrissie,
> Indeed, by default when creating the VG, it is clustered, thus when
> creating the LV it is active on all nodes.
> To avoid data corruption, I have re created the VG as non clustered
> (vgcreate -c n vgXX) then created the LV which got activated only on the
> node where it got created.
> Then changed the VG to clustered (vgchange -c y VGXX) and activated it
> exclusively on this node.
> But, I could reproduce the behaviour of bypassing the exclusive flag:
> On node B, re change the VG to non clustered though it is activated
> exclusively on node A.
> and then activate it on node B and it works.
> The thing I'm trying to point is that simply by erasing the clustered
> flag you can bypass the exclusive activation.
> I think a barrier is necessary to prevent this to happen, removing the
> clustered flag from a VG should be possible only if the node holding the
> VG exclusively is down (does the lock manager  DLM report which node
> holds exclusively a VG ?)
> Thanks
> Brem
>
>
> 2009/7/21, Christine Caulfield <ccaulfie at redhat.com
> <mailto:ccaulfie at redhat.com>>:
>
>     Hiya,
>
>     I've just tried this on my cluster and it works fine.
>
>     What you need to remember is that lvcreate on one node will also
>     activate the LV on all nodes in the cluster - it does an implicit
>     lvchange -ay when you create it. What I can't explain is why
>     vgchange -ae seemed to work fine on node A, it should give the same
>     error as on node B because LVs are open shared on both nodes.
>
>     Its not clear to me when you tagged the VG as clustered, so that
>     might be contributing to the problem. When I create a new VG on
>     shared storage it automatically gets labelled clustered so I have
>     never needed to do this explicitly. If you create a non-clustered VG
>     you probably ought to deactivate it on all nodes first as it could
>     mess up the locking otherwise. This *might* be the cause of your
>     troubles.
>
>     The error on clvmd startup can be ignored. It's caused by clvmd
>     ussing a background command with --no_locking so that it can check
>     which LVs (if any) are already active and re-acquire locks for them
>
>     Sorry this isn't conclusive, The exact order in which things are
>     happening is not clear to me.
>
>     Chrissie.
>
>
>     On 07/21/2009 10:21 AM, brem belguebli wrote:
>
>         Hi all,
>         I think there is something to clarify about using CLVM across a
>         cluster
>         in a active/passive mode without GFS.
>           From my understanding, CLVM keeps LVM metadata coherent among the
>         cluster nodes and provides a cluster wide locking mechanism that can
>         prevent any node from trying to activate a volume group if it
>         has been
>         activated exclusively (vgchange -a e VGXXX)  by another node (which
>         needs to be up).
>         I have been playing with it to check this behaviour but it
>         doesn't seem
>         to make what is expected.
>         I have 2 nodes (RHEL 5.3 X86_64, cluster installed and
>         configured) , A
>         and B using a SAN shared storage.
>         I  have a LUN from this SAN seen by both nodes, pvcreate'd
>         /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one
>         node),
>         created an ext3 FS on /dev/vg10/lvol1
>         CLVM is running in debug mode (clvmd -d2 ) (but it complains about
>         locking disabled though locking set to 3 on both nodes)
>         On node A:
>                    vgchange -c y vg10 returns OK (vgs -->  vg10     1
>         1   0
>         wz--nc)
>                    vgchange -a e --> OK
>                    lvs returns lvol1   vg10   -wi-a-
>         On node B (while things are active on A, A is UP and member of the
>         cluster ):
>                    vgchange -a e --> Error locking on node B: Volume is
>         busy on
>         another node
>                                             1 logical volume(s) in
>         volume group
>         "vg10" now active
>         It activates vg10 even if it sees it busy on another node .
>         on B, lvs returns lvol1   vg10   -wi-a-
>         as well as on A.
>         I think the main problem comes from the fact that, as it is said
>         when
>         starting CLVM in debug mode,  WARNING: Locking disabled. Be careful!
>         This could corrupt your metadata.
>         IMHO, the algorithm should be as follows:
>         VG is tagged as clustered (vgchange -c y VGXXX)
>         if a node (node P) tries to activate the VG exclusively
>         (vgchange -a VGXXX)
>         ask the lock manager to check if VG is not already locked by another
>         node (node X)
>         if so, check if node X is up
>         if node X is down, return OK to node P
>         else
>         return NOK to node P (explicitely that VG is held exclusively by
>         node X)
>         Brem
>         PS: this shouldn't be a problem with GFS or other clustered FS
>         (OCFS,
>         etc...) as no node should try to activate exclusively any VG.
>
>
>         ------------------------------------------------------------------------
>
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From listener at may.co.at  Tue Jul 21 11:51:22 2009
From: listener at may.co.at (Wolfgang Hotwagner)
Date: Tue, 21 Jul 2009 13:51:22 +0200
Subject: [Linux-cluster] Waiting for fenced to join the fence group
Message-ID: <4A65ABBA.3030305@may.co.at>

Hello,

i am not able to make a gfs2-cluster on a drbd-device. I always have the
problem with joining the fence group. I am using a debian stable(lenny)
system. On eth0 there is also a ctdb-service which enables 2 additional
ip's. Maybe someone could help me to get it working..

Greetings
Wolfgang


dslin1:
eth0: 172.30.50.83
eth1: 10.13.13.2

/etc/hosts:
127.0.0.1       localhost
172.30.50.83    dslin1
172.30.50.84    dslin2
10.13.13.2      node1
10.13.13.3      node2


/proc/drbd:
version: 8.0.14 (api:86/proto:86)
GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by
phil at fat-tyre, 2008-11-12 16:40:33
 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:12288 dw:12288 dr:0 al:0 bm:3 lo:0 pe:0 ua:0 ap:0
        resync: used:0/61 hits:765 misses:3 starving:0 dirty:0 changed:3
        act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0


syslog:
Jul 21 13:38:27 dslin1 ccsd[14975]: Starting ccsd 2.03.09:
Jul 21 13:38:27 dslin1 ccsd[14975]:  Built: Nov  3 2008 18:22:21
Jul 21 13:38:27 dslin1 ccsd[14975]:  Copyright (C) Red Hat, Inc.
2004-2008  All rights reserved.
Jul 21 13:38:28 dslin1 ccsd[14975]: /etc/cluster/cluster.conf (cluster
name = cluster, version = 1) found.
Jul 21 13:38:31 dslin1 ccsd[14975]: Initial status:: Quorate
Jul 21 13:38:35 dslin1 openais[14980]: cman killed by node 2 because we
rejoined the cluster without a full restart
Jul 21 13:38:35 dslin1 groupd[14984]: cman_get_nodes error -1 104
Jul 21 13:38:35 dslin1 gfs_controld[14992]: cluster is down, exiting
Jul 21 13:39:00 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 30 seconds.
Jul 21 13:39:30 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 60 seconds.
Jul 21 13:40:00 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 90 seconds.
Jul 21 13:40:30 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 120 seconds.
and so on..


dslin2:
eth0: 172.30.50.84
eth1: 10.13.13.3

/etc/hosts:
127.0.0.1       localhost
172.30.50.83    dslin1
172.30.50.84    dslin2
10.13.13.2      node1
10.13.13.3      node2


/proc/drbd
version: 8.0.14 (api:86/proto:86)
GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by
phil at fat-tyre, 2008-11-12 16:40:33
 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
    ns:12292 nr:0 dw:0 dr:12296 al:0 bm:6 lo:0 pe:0 ua:0 ap:0
        resync: used:0/61 hits:765 misses:3 starving:0 dirty:0 changed:3
        act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0


syslog:
Jul 21 13:38:27 dslin1 ccsd[14975]: Starting ccsd 2.03.09:
Jul 21 13:38:27 dslin1 ccsd[14975]:  Built: Nov  3 2008 18:22:21
Jul 21 13:38:27 dslin1 ccsd[14975]:  Copyright (C) Red Hat, Inc.
2004-2008  All rights reserved.
Jul 21 13:38:28 dslin1 ccsd[14975]: /etc/cluster/cluster.conf (cluster
name = cluster, version = 1) found.
Jul 21 13:38:31 dslin1 ccsd[14975]: Initial status:: Quorate
Jul 21 13:38:35 dslin1 openais[14980]: cman killed by node 2 because we
rejoined the cluster without a full restart
Jul 21 13:38:35 dslin1 groupd[14984]: cman_get_nodes error -1 104
Jul 21 13:38:35 dslin1 gfs_controld[14992]: cluster is down, exiting
Jul 21 13:39:00 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 30 seconds.
Jul 21 13:39:30 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 60 seconds.
Jul 21 13:40:00 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 90 seconds.
Jul 21 13:40:30 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 120 seconds.
Jul 21 13:41:00 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 150 seconds.
Jul 21 13:41:30 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 180 seconds.
Jul 21 13:42:00 dslin1 ccsd[14975]: Unable to connect to cluster
infrastructure after 210 seconds.
and so on..


/etc/cluster/cluster.conf:
<?xml version="1.0"?>
<cluster name="cluster" config_version="1">
  <!-- post_join_delay: number of seconds the daemon will wait before
                        fencing any victims after a node joins the domain
       post_fail_delay: number of seconds the daemon will wait before
                        fencing any victims after a domain member fails
       clean_start    : prevent any startup fencing the daemon might do.
                        It indicates that the daemon should assume all nodes
                        are in a clean state to start. -->
  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
  <clusternodes>
    <clusternode name="dslin1" votes="1" nodeid="1">
      <fence>
        <!-- Handle fencing manually -->
        <method name="human">
          <device name="human" nodename="dslin1" ipaddr="10.13.13.2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="dslin2" votes="1" nodeid="2">
      <fence>
        <!-- Handle fencing manually -->
        <method name="human">
          <device name="human" nodename="dslin2" ipaddr="10.13.13.3"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <!-- cman two nodes specification -->
  <cman expected_votes="1" two_node="1"/>
  <fencedevices>
    <!-- Define manual fencing -->
    <fencedevice name="human" agent="fence_manual"/>
  </fencedevices>
</cluster>


From brem.belguebli at gmail.com  Tue Jul 21 12:11:47 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 21 Jul 2009 14:11:47 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <4A65AB22.1030601@redhat.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
Message-ID: <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>

Hi,

When creating the VG by default clustered, you implicitely assume that it
will be used with a clustered FS on top of it (gfs, ocfs, etc...) that will
handle the active/active mode.

As I do not intend to use GFS in this particular case, but ext3 and raw
devices, I need to make sure the vg is exclusively activated on one node,
preventing the other nodes to access it unless it is the failover procedure
(node holding the VG crashed) and then re activate it exclusively on the
failover node.

Thanks

Brem


2009/7/21, Christine Caulfield <ccaulfie at redhat.com>:
>
> Hiya,
>
> If you make a VG non-clustered then you, by definition, forfeit all cluster
> locking protection on all of the LVs in that group.
>
> In practise you should always make shared volumes clustered and non-shared
> volumes non-clustered. If you make a shared volume non-clustered then it's
> up to you to manage the protection of it.
>
> It might be possible to put some protection in for when a volume group is
> changed from clustered to non-clustered, but really, you're not supposed to
> do that!
>
> Look at it this way, if you create a shared VG and mark it non-clustered to
> start with, then you can corrupt it as much as you like by mounting its
> filesystems on multiple nodes. It's just the same as a SAN volume then.
>
>
> On 07/21/2009 12:16 PM, brem belguebli wrote:
>
>> Hi Chrissie,
>> Indeed, by default when creating the VG, it is clustered, thus when
>> creating the LV it is active on all nodes.
>> To avoid data corruption, I have re created the VG as non clustered
>> (vgcreate -c n vgXX) then created the LV which got activated only on the
>> node where it got created.
>> Then changed the VG to clustered (vgchange -c y VGXX) and activated it
>> exclusively on this node.
>> But, I could reproduce the behaviour of bypassing the exclusive flag:
>> On node B, re change the VG to non clustered though it is activated
>> exclusively on node A.
>> and then activate it on node B and it works.
>> The thing I'm trying to point is that simply by erasing the clustered
>> flag you can bypass the exclusive activation.
>> I think a barrier is necessary to prevent this to happen, removing the
>> clustered flag from a VG should be possible only if the node holding the
>> VG exclusively is down (does the lock manager  DLM report which node
>> holds exclusively a VG ?)
>> Thanks
>> Brem
>>
>>
>> 2009/7/21, Christine Caulfield <ccaulfie at redhat.com
>> <mailto:ccaulfie at redhat.com>>:
>>
>>    Hiya,
>>
>>    I've just tried this on my cluster and it works fine.
>>
>>    What you need to remember is that lvcreate on one node will also
>>    activate the LV on all nodes in the cluster - it does an implicit
>>    lvchange -ay when you create it. What I can't explain is why
>>    vgchange -ae seemed to work fine on node A, it should give the same
>>    error as on node B because LVs are open shared on both nodes.
>>
>>    Its not clear to me when you tagged the VG as clustered, so that
>>    might be contributing to the problem. When I create a new VG on
>>    shared storage it automatically gets labelled clustered so I have
>>    never needed to do this explicitly. If you create a non-clustered VG
>>    you probably ought to deactivate it on all nodes first as it could
>>    mess up the locking otherwise. This *might* be the cause of your
>>    troubles.
>>
>>    The error on clvmd startup can be ignored. It's caused by clvmd
>>    ussing a background command with --no_locking so that it can check
>>    which LVs (if any) are already active and re-acquire locks for them
>>
>>    Sorry this isn't conclusive, The exact order in which things are
>>    happening is not clear to me.
>>
>>    Chrissie.
>>
>>
>>    On 07/21/2009 10:21 AM, brem belguebli wrote:
>>
>>        Hi all,
>>        I think there is something to clarify about using CLVM across a
>>        cluster
>>        in a active/passive mode without GFS.
>>          From my understanding, CLVM keeps LVM metadata coherent among the
>>        cluster nodes and provides a cluster wide locking mechanism that
>> can
>>        prevent any node from trying to activate a volume group if it
>>        has been
>>        activated exclusively (vgchange -a e VGXXX)  by another node (which
>>        needs to be up).
>>        I have been playing with it to check this behaviour but it
>>        doesn't seem
>>        to make what is expected.
>>        I have 2 nodes (RHEL 5.3 X86_64, cluster installed and
>>        configured) , A
>>        and B using a SAN shared storage.
>>        I  have a LUN from this SAN seen by both nodes, pvcreate'd
>>        /dev/mpath/mpath0 , vgcreate'd vg10 and lvcreate'd lvol1 (on one
>>        node),
>>        created an ext3 FS on /dev/vg10/lvol1
>>        CLVM is running in debug mode (clvmd -d2 ) (but it complains about
>>        locking disabled though locking set to 3 on both nodes)
>>        On node A:
>>                   vgchange -c y vg10 returns OK (vgs -->  vg10     1
>>        1   0
>>        wz--nc)
>>                   vgchange -a e --> OK
>>                   lvs returns lvol1   vg10   -wi-a-
>>        On node B (while things are active on A, A is UP and member of the
>>        cluster ):
>>                   vgchange -a e --> Error locking on node B: Volume is
>>        busy on
>>        another node
>>                                            1 logical volume(s) in
>>        volume group
>>        "vg10" now active
>>        It activates vg10 even if it sees it busy on another node .
>>        on B, lvs returns lvol1   vg10   -wi-a-
>>        as well as on A.
>>        I think the main problem comes from the fact that, as it is said
>>        when
>>        starting CLVM in debug mode,  WARNING: Locking disabled. Be
>> careful!
>>        This could corrupt your metadata.
>>        IMHO, the algorithm should be as follows:
>>        VG is tagged as clustered (vgchange -c y VGXXX)
>>        if a node (node P) tries to activate the VG exclusively
>>        (vgchange -a VGXXX)
>>        ask the lock manager to check if VG is not already locked by
>> another
>>        node (node X)
>>        if so, check if node X is up
>>        if node X is down, return OK to node P
>>        else
>>        return NOK to node P (explicitely that VG is held exclusively by
>>        node X)
>>        Brem
>>        PS: this shouldn't be a problem with GFS or other clustered FS
>>        (OCFS,
>>        etc...) as no node should try to activate exclusively any VG.
>>
>>
>>
>>  ------------------------------------------------------------------------
>>
>>        --
>>        Linux-cluster mailing list
>>        Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>        https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>    --
>>    Linux-cluster mailing list
>>    Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>    https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090721/39338d57/attachment.htm>

From ccaulfie at redhat.com  Tue Jul 21 13:23:56 2009
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 21 Jul 2009 14:23:56 +0100
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>	<4A6590AE.2040808@redhat.com>	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
Message-ID: <4A65C16C.20104@redhat.com>

On 07/21/2009 01:11 PM, brem belguebli wrote:
> Hi,
> When creating the VG by default clustered, you implicitely assume that
> it will be used with a clustered FS on top of it (gfs, ocfs, etc...)
> that will handle the active/active mode.
> As I do not intend to use GFS in this particular case, but ext3 and raw
> devices, I need to make sure the vg is exclusively activated on one
> node, preventing the other nodes to access it unless it is the failover
> procedure (node holding the VG crashed) and then re activate it
> exclusively on the failover node.
> Thanks


In that case you probably ought to be using rgmanager to do the failover 
for you. It has a script for doing exactly this :-)

Chrissie


From brem.belguebli at gmail.com  Tue Jul 21 14:40:16 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 21 Jul 2009 16:40:16 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <4A65C16C.20104@redhat.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
Message-ID: <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>

Hi,

That's what I 'm trying to do.

If you mean lvm.sh, well, I've been playing with it, but it does some
"sanity" checks that are wierd

   1. It expects HA LVM to be setup (why such check if we want to use CLVM).
   2. it exits if it finds a CLVM VG  (kind of funny !)
   3. it exits if the lvm.conf is newer than /boot/*.img (about this one, we
   tend to prevent the cluster from automatically starting ...)

I was looking to find some doc on how to write my own resources, ie CLVM
resource that checks if the vg is clustered, if so by which node is it
exclusively held, and if the node is down to activate exclusively the VG.

If you have some good links to provide me, that'll be great.

Thanks


2009/7/21, Christine Caulfield <ccaulfie at redhat.com>:

> On 07/21/2009 01:11 PM, brem belguebli wrote:
>
>> Hi,
>> When creating the VG by default clustered, you implicitely assume that
>> it will be used with a clustered FS on top of it (gfs, ocfs, etc...)
>> that will handle the active/active mode.
>> As I do not intend to use GFS in this particular case, but ext3 and raw
>> devices, I need to make sure the vg is exclusively activated on one
>> node, preventing the other nodes to access it unless it is the failover
>> procedure (node holding the VG crashed) and then re activate it
>> exclusively on the failover node.
>> Thanks
>>
>
>
> In that case you probably ought to be using rgmanager to do the failover
> for you. It has a script for doing exactly this :-)
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090721/34dce422/attachment.htm>

From ccaulfie at redhat.com  Tue Jul 21 14:55:32 2009
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 21 Jul 2009 15:55:32 +0100
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>	<4A6590AE.2040808@redhat.com>	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>	<4A65AB22.1030601@redhat.com>	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
Message-ID: <4A65D6E4.1090905@redhat.com>

It seems a little pointless to integrate clvmd with a failover system. 
They're almost totally different ways of running a cluster. clvmd 
assumes a symmetrical cluster (as you've found out) and is designed so 
that the LVs are available on all nodes for a cluster filesystem. Trying 
to make that sort of system work for a failover installation is always 
going to be awkward, it's not what it was designed for.

That, in part I think, is why HA-LVM checks for a clustered VGs and 
declines to manage them. A resource should be controlled by one manager, 
not two, it's just asking for confusion.

Basically you either use clvmd or HA-LVM; not both together.

If you really want to write a resource manager to use clvmd then feel 
free, I don't have any references but others might. It's not an area I 
have ever had to go into.

Good luck ;-)

Chrissie


On 07/21/2009 03:40 PM, brem belguebli wrote:
> Hi,
> That's what I 'm trying to do.
> If you mean lvm.sh, well, I've been playing with it, but it does some
> "sanity" checks that are wierd
>
>    1. It expects HA LVM to be setup (why such check if we want to use CLVM).
>    2. it exits if it finds a CLVM VG  (kind of funny !)
>    3. it exits if the lvm.conf is newer than /boot/*.img (about this
>       one, we tend to prevent the cluster from automatically starting ...)
>
> I was looking to find some doc on how to write my own resources, ie CLVM
> resource that checks if the vg is clustered, if so by which node is it
> exclusively held, and if the node is down to activate exclusively the VG.
> If you have some good links to provide me, that'll be great.
> Thanks
>
>
> 2009/7/21, Christine Caulfield <ccaulfie at redhat.com
> <mailto:ccaulfie at redhat.com>>:
>
>     On 07/21/2009 01:11 PM, brem belguebli wrote:
>
>         Hi,
>         When creating the VG by default clustered, you implicitely
>         assume that
>         it will be used with a clustered FS on top of it (gfs, ocfs, etc...)
>         that will handle the active/active mode.
>         As I do not intend to use GFS in this particular case, but ext3
>         and raw
>         devices, I need to make sure the vg is exclusively activated on one
>         node, preventing the other nodes to access it unless it is the
>         failover
>         procedure (node holding the VG crashed) and then re activate it
>         exclusively on the failover node.
>         Thanks
>
>
>
>     In that case you probably ought to be using rgmanager to do the
>     failover for you. It has a script for doing exactly this :-)
>
>     Chrissie
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rmicmirregs at gmail.com  Tue Jul 21 14:56:25 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Tue, 21 Jul 2009 16:56:25 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
Message-ID: <1248188185.6464.2.camel@mecatol>

Hi Brem,

El mar, 21-07-2009 a las 16:40 +0200, brem belguebli escribi?:
> Hi,
>  
> That's what I 'm trying to do.
>  
> If you mean lvm.sh, well, I've been playing with it, but it does some
> "sanity" checks that are wierd
>      1. It expects HA LVM to be setup (why such check if we want to
>         use CLVM).
>      2. it exits if it finds a CLVM VG  (kind of funny !)
>      3. it exits if the lvm.conf is newer than /boot/*.img (about this
>         one, we tend to prevent the cluster from automatically
>         starting ...)
> I was looking to find some doc on how to write my own resources, ie
> CLVM resource that checks if the vg is clustered, if so by which node
> is it exclusively held, and if the node is down to activate
> exclusively the VG.
>  
> If you have some good links to provide me, that'll be great.
>  
> Thanks
> 
> 
> 2009/7/21, Christine Caulfield <ccaulfie at redhat.com>:
>         On 07/21/2009 01:11 PM, brem belguebli wrote:
>                 Hi,
>                 When creating the VG by default clustered, you
>                 implicitely assume that
>                 it will be used with a clustered FS on top of it (gfs,
>                 ocfs, etc...)
>                 that will handle the active/active mode.
>                 As I do not intend to use GFS in this particular case,
>                 but ext3 and raw
>                 devices, I need to make sure the vg is exclusively
>                 activated on one
>                 node, preventing the other nodes to access it unless
>                 it is the failover
>                 procedure (node holding the VG crashed) and then re
>                 activate it
>                 exclusively on the failover node.
>                 Thanks
>         
>         
>         In that case you probably ought to be using rgmanager to do
>         the failover for you. It has a script for doing exactly
>         this :-)
>         
>         Chrissie 
>         
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
>         
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Please, check this link:

https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html

I found exactly the same problem as you, and i developed the
"lvm-cluster.sh" script to solve the needs I had. You can find the
script on the last message of the thread.

I submitted it to make it part of the main project, but i have no news
about that yet.

I hope this helps.

Cheers,

Rafael

-- 
Rafael Mic? Miranda


From brem.belguebli at gmail.com  Tue Jul 21 15:24:43 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 21 Jul 2009 17:24:43 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <1248188185.6464.2.camel@mecatol>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<1248188185.6464.2.camel@mecatol>
Message-ID: <29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>

Hola Rafael,

Thanks a lot, that'll avoid me going from scratch.

I'll have a look at them and keep you updated.

Brem


2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
>
> Hi Brem,
>
> El mar, 21-07-2009 a las 16:40 +0200, brem belguebli escribi?:
> > Hi,
> >
> > That's what I 'm trying to do.
> >
> > If you mean lvm.sh, well, I've been playing with it, but it does some
> > "sanity" checks that are wierd
> >      1. It expects HA LVM to be setup (why such check if we want to
> >         use CLVM).
> >      2. it exits if it finds a CLVM VG  (kind of funny !)
> >      3. it exits if the lvm.conf is newer than /boot/*.img (about this
> >         one, we tend to prevent the cluster from automatically
> >         starting ...)
> > I was looking to find some doc on how to write my own resources, ie
> > CLVM resource that checks if the vg is clustered, if so by which node
> > is it exclusively held, and if the node is down to activate
> > exclusively the VG.
> >
> > If you have some good links to provide me, that'll be great.
> >
> > Thanks
> >
> >
> > 2009/7/21, Christine Caulfield <ccaulfie at redhat.com>:
> >         On 07/21/2009 01:11 PM, brem belguebli wrote:
> >                 Hi,
> >                 When creating the VG by default clustered, you
> >                 implicitely assume that
> >                 it will be used with a clustered FS on top of it (gfs,
> >                 ocfs, etc...)
> >                 that will handle the active/active mode.
> >                 As I do not intend to use GFS in this particular case,
> >                 but ext3 and raw
> >                 devices, I need to make sure the vg is exclusively
> >                 activated on one
> >                 node, preventing the other nodes to access it unless
> >                 it is the failover
> >                 procedure (node holding the VG crashed) and then re
> >                 activate it
> >                 exclusively on the failover node.
> >                 Thanks
> >
> >
> >         In that case you probably ought to be using rgmanager to do
> >         the failover for you. It has a script for doing exactly
> >         this :-)
> >
> >         Chrissie
> >
> >
> >         --
> >         Linux-cluster mailing list
> >         Linux-cluster at redhat.com
> >         https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> Please, check this link:
>
> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
>
> I found exactly the same problem as you, and i developed the
> "lvm-cluster.sh" script to solve the needs I had. You can find the
> script on the last message of the thread.
>
> I submitted it to make it part of the main project, but i have no news
> about that yet.
>
> I hope this helps.
>
> Cheers,
>
> Rafael
>
> --
> Rafael Mic? Miranda
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090721/aec4e50a/attachment.htm>

From brem.belguebli at gmail.com  Tue Jul 21 15:19:56 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 21 Jul 2009 17:19:56 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <4A65D6E4.1090905@redhat.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<4A65D6E4.1090905@redhat.com>
Message-ID: <29ae894c0907210819g6ccf00f0o79fc3d85c20dbe14@mail.gmail.com>

Well,

Pointless, I'm not sure as you take advantage of having all the other nodes
in the cluster updated if a LVM metadata is modified by the node holding the
VG.

Second point, HA-LVM aka hosttags has, IMHO, a security problem as anyone
could modify the hosttag on a VG without any problem (no locking mechanisms
as CLVM).

I have nothing against Clustered FS, but in my specific case, I have to host
serveral Sybase Dataservers on some clusters, and the only acceptable option
for my DBA's is to use raw devices.

I never meant to combine HA-LVM and CLVM, I consider them mutualy exclusive.

Regards

2009/7/21, Christine Caulfield <ccaulfie at redhat.com>:
>
> It seems a little pointless to integrate clvmd with a failover system.
> They're almost totally different ways of running a cluster. clvmd assumes a
> symmetrical cluster (as you've found out) and is designed so that the LVs
> are available on all nodes for a cluster filesystem. Trying to make that
> sort of system work for a failover installation is always going to be
> awkward, it's not what it was designed for.
>
> That, in part I think, is why HA-LVM checks for a clustered VGs and
> declines to manage them. A resource should be controlled by one manager, not
> two, it's just asking for confusion.
>
> Basically you either use clvmd or HA-LVM; not both together.
>
> If you really want to write a resource manager to use clvmd then feel free,
> I don't have any references but others might. It's not an area I have ever
> had to go into.
>
> Good luck ;-)
>
> Chrissie
>
>
>
> On 07/21/2009 03:40 PM, brem belguebli wrote:
>
>> Hi,
>> That's what I 'm trying to do.
>> If you mean lvm.sh, well, I've been playing with it, but it does some
>> "sanity" checks that are wierd
>>
>>   1. It expects HA LVM to be setup (why such check if we want to use
>> CLVM).
>>   2. it exits if it finds a CLVM VG  (kind of funny !)
>>   3. it exits if the lvm.conf is newer than /boot/*.img (about this
>>      one, we tend to prevent the cluster from automatically starting ...)
>>
>> I was looking to find some doc on how to write my own resources, ie CLVM
>> resource that checks if the vg is clustered, if so by which node is it
>> exclusively held, and if the node is down to activate exclusively the VG.
>> If you have some good links to provide me, that'll be great.
>> Thanks
>>
>>
>> 2009/7/21, Christine Caulfield <ccaulfie at redhat.com
>> <mailto:ccaulfie at redhat.com>>:
>>
>>    On 07/21/2009 01:11 PM, brem belguebli wrote:
>>
>>        Hi,
>>        When creating the VG by default clustered, you implicitely
>>        assume that
>>        it will be used with a clustered FS on top of it (gfs, ocfs,
>> etc...)
>>        that will handle the active/active mode.
>>        As I do not intend to use GFS in this particular case, but ext3
>>        and raw
>>        devices, I need to make sure the vg is exclusively activated on one
>>        node, preventing the other nodes to access it unless it is the
>>        failover
>>        procedure (node holding the VG crashed) and then re activate it
>>        exclusively on the failover node.
>>        Thanks
>>
>>
>>
>>    In that case you probably ought to be using rgmanager to do the
>>    failover for you. It has a script for doing exactly this :-)
>>
>>    Chrissie
>>
>>
>>    --
>>    Linux-cluster mailing list
>>    Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>    https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090721/1360d52f/attachment.htm>

From azzopardi at eib.org  Wed Jul 22 09:33:32 2009
From: azzopardi at eib.org (AZZOPARDI Konrad)
Date: Wed, 22 Jul 2009 11:33:32 +0200
Subject: [Linux-cluster] redhat cluster - veritas
Message-ID: <0KN600ED7FVXVID0@comexp1-srv.lux.eib.org>

Dear all,
 
I am new to RedHat cluster and I am looking into different architectural models. In my workplace we have two datacentres with a SAN in each DC. Now we do not have the luxury of SAN replication so I was looking into doing this using host based mirroring. From what I  read, the new RedHat 5.3 supports CLVM host based mirrors but to be honest I am quite reluctant to use something so new and I cannot find any references of anyone using it in PROD, which leads me to another question. Can RedHat cluster work with other volume managers such as veritas. Probably the related question would be whether GFS can with veritas volume manager.
 
Thank you for any responses.
 
konrad
--------------------------------------------------------------------

Les informations contenues dans ce message et/ou ses annexes sont 
reservees a l'attention et a l'utilisation de leur destinataire et peuvent etre 
confidentielles. Si vous n'etes pas destinataire de ce message, vous etes 
informes que vous l'avez recu par erreur et que toute utilisation en est 
interdite. Dans ce cas, vous etes pries de le detruire et d'en informer la 
Banque Europeenne d'Investissement.

The information in this message and/or attachments is intended solely for 
the attention and use of the named addressee and may be confidential. If 
you are not the intended recipient, you are hereby notified that you have 
received this transmittal in error and that any use of it is prohibited. In 
such a case please delete this message and kindly notify the European 
Investment Bank accordingly.
--------------------------------------------------------------------
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090722/2bf12d27/attachment.htm>

From pasik at iki.fi  Wed Jul 22 12:30:07 2009
From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=)
Date: Wed, 22 Jul 2009 15:30:07 +0300
Subject: [Linux-cluster] Virtual service using CLVM not migrating
In-Reply-To: <2B9A98D7-A9C6-4210-B1B6-9B6306E21362@netspot.com.au>
References: <2B9A98D7-A9C6-4210-B1B6-9B6306E21362@netspot.com.au>
Message-ID: <20090722123007.GW24960@edu.joroinen.fi>

On Mon, Aug 25, 2008 at 03:49:49PM +0930, Tom Lanyon wrote:
> Hi list,
> 
> (let me know if this should be on the xen list, but I think it's an  
> issue with clvm locking a logical volume)
> 
> I have a three node RHEL5 cluster running some virtual machines. The  
> virtual machines use a LVM LV as their root which is available cluster- 
> wide via clvmd.
> 
> Live migration between cluster nodes seems to work well when running  
> one-vm-per-node exclusively, but fails when a node is running more  
> than one virtual machine.
>

Hi!

Did you ever figure out the reason for this behaviour? 
Have you tried it again with RHEL 5.3 ?

-- Pasi


From gianluca.cecchi at gmail.com  Wed Jul 22 15:12:40 2009
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Wed, 22 Jul 2009 17:12:40 +0200
Subject: [Linux-cluster] lvm2: cluster request failed: Unknown error 65538
Message-ID: <561c252c0907220812l152462d9o27cd64ba37c017a4@mail.gmail.com>

Hello,
by mistake I previously sent this to fedora-list.
I resend to the appropriate list I wanted...
Excuse in advance for eventual cross-posting effects for anyone...

fedora11 x86_64 with lvm2, device-mapper and related packages updated at :
lvm2-2.02.48-1.fc11.x86_64
lvm2-cluster-2.02.48-1.fc11.x86_64
device-mapper-1.02.33-1.fc11.x86_64
device-mapper-libs-1.02.33-1.fc11.x86_64

I have 2 VGs: vg_virtfed that is a system vg (with root lv ans swap
lv) and vg_qemu01 that is a clustered vg.
At the moment only one node active

[root virtfed ~]# service clvmd status
clvmd (pid 2581) is running...
active volumes: centos53 centos53_cldisk1 centos53_cldisk2
centos53_disk2 centos53_node02 centos53_node02_disk2 centos53_qdisk
test_vm_drbd w2k3_01 lv_root lv_swap

- it is strange that between the active volumes I have also the ones
owned by vg_virtfed, that is not a clustered vg
vgs command gives

[root virtfed ~]# cman_tool status
Version: 6.2.0
Config Version: 3
Cluster Name: kvm
Cluster Id: 773
Cluster Member: Yes
Cluster Generation: 196
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 10
Flags: 2node HaveState
Ports Bound: 0 177
Node name: kvm1
Node ID: 1
Multicast addresses: 239.192.3.8
Node addresses: 192.168.10.1

I have no services or filesystem on cluster. Only the infra with cman and clvmd
[root virtfed ~]# clustat
Cluster Status for kvm @ Wed Jul 22 16:00:14 2009
Member Status: Quorate

 Member Name                                             ID   Status
 ------ ----                                             ---- ------
 kvm1                                                        1 Online, Local
 kvm2                                                        2 Offline


[root virtfed ~]# vgs
  cluster request failed: Unknown error 65538
  cluster request failed: Unknown error 65538
  VG         #PV #LV #SN Attr   VSize  VFree
  vg_qemu01    1   9   0 wz--nc 52.12G 13.07G
  vg_virtfed   1   2   0 wz--n- 16.00G     0
  Internal error: Volume Group vg_qemu01 was not unlocked
  Internal error: Volume Group vg_virtfed was not unlocked
  Device '/dev/drbd0' has been left open.
  Device '/dev/block/104:2' has been left open.
  Device '/dev/drbd0' has been left open.
  Device '/dev/block/104:2' has been left open.
  Device '/dev/block/104:2' has been left open.
  Device '/dev/drbd0' has been left open.
  Device '/dev/block/104:2' has been left open.

so it seems that the cluster attribute is correctly present only for vg_qemu01
It seems that I can do normal operations such as adding an LV and so
on, but I don't understand the errors about
 Internal error: Volume Group vg_qemu01 was not unlocked
and
 Device '/dev/drbd0' has been left open.

Any clues?

dumpconfig from lvm> prompt gives:
  devices {
  	dir="/dev"
  	scan="/dev"
  	preferred_names=[]
  	filter=["a|^/dev/cciss/c0d0p2$|", "a|drbd.*|", "r|.*|"]
  	cache_dir="/etc/lvm/cache"
  	cache_file_prefix=""
  	write_cache_state=1
  	sysfs_scan=1
  	md_component_detection=1
  	md_chunk_alignment=1
  	data_alignment=0
  	ignore_suspended_devices=0
  }
  activation {
  	missing_stripe_filler="error"
  	reserved_stack=256
  	reserved_memory=8192
  	process_priority=-18
  	mirror_region_size=512
  	readahead="auto"
  	mirror_log_fault_policy="allocate"
  	mirror_device_fault_policy="remove"
  }
  global {
  	umask=63
  	test=0
  	units="h"
  	activation=1
  	proc="/proc"
  	locking_type=3
  	fallback_to_clustered_locking=1
  	fallback_to_local_locking=1
  	locking_dir="/var/lock/lvm"
  }
  shell {
  	history_size=100
  }
  backup {
  	backup=1
  	backup_dir="/etc/lvm/backup"
  	archive=1
  	archive_dir="/etc/lvm/archive"
  	retain_min=10
  	retain_days=30
  }
  log {
  	verbose=0
  	syslog=1
  	file="/var/log/lvm2.log"
  	overwrite=0
  	level=6
  	indent=1
  	command_names=0
  	prefix="  "
  }

I set log verbosse (level 6) and when I run vgs command i get inside
the file /var/log/lvm2.log:

config/config.c:955   log/activation not found in config: defaulting to 0
commands/toolcontext.c:189   Logging initialised at Wed Jul 22 15:58:19 2009
config/config.c:950   Setting global/umask to 63
commands/toolcontext.c:210   Set umask to 0077
config/config.c:927   Setting devices/dir to /dev
config/config.c:927   Setting global/proc to /proc
config/config.c:950   Setting global/activation to 1
config/config.c:955   global/suffix not found in config: defaulting to 1
config/config.c:927   Setting global/units to h
config/config.c:927   Setting activation/readahead to auto
config/config.c:927   Setting activation/missing_stripe_filler to error
device/dev-cache.c:485   devices/preferred_names not found in config
file: using built-in preferences
config/config.c:950   Setting devices/ignore_suspended_devices to 0
config/config.c:927   Setting devices/cache_dir to /etc/lvm/cache
config/config.c:950   Setting devices/write_cache_state to 1
filters/filter-persistent.c:131   Loaded persistent filter cache from
/etc/lvm/cache/.cache
config/config.c:950   Setting activation/reserved_stack to 256
config/config.c:950   Setting activation/reserved_memory to 8192
config/config.c:950   Setting activation/process_priority to -18
format1/format1.c:532   Initialised format: lvm1
format_pool/format_pool.c:333   Initialised format: pool
format_text/format-text.c:2015   Initialised format: lvm2
config/config.c:933   global/format not found in config: defaulting to lvm2
striped/striped.c:228   Initialised segtype: striped
zero/zero.c:110   Initialised segtype: zero
error/errseg.c:113   Initialised segtype: error
freeseg/freeseg.c:57   Initialised segtype: free
snapshot/snapshot.c:312   Initialised segtype: snapshot
mirror/mirrored.c:579   Initialised segtype: mirror
config/config.c:950   Setting backup/retain_days to 30
config/config.c:950   Setting backup/retain_min to 10
config/config.c:927   Setting backup/archive_dir to /etc/lvm/archive
config/config.c:927   Setting backup/backup_dir to /etc/lvm/backup
config/config.c:955   global/fallback_to_lvm1 not found in config:
defaulting to 1
config/config.c:950   Setting global/locking_type to 3
locking/locking.c:253   Cluster locking selected.
config/config.c:955   report/aligned not found in config: defaulting to 1
config/config.c:955   report/buffered not found in config: defaulting to 1
config/config.c:955   report/headings not found in config: defaulting to 1
config/config.c:933   report/separator not found in config: defaulting to
config/config.c:955   report/prefixes not found in config: defaulting to 0
config/config.c:955   report/quoted not found in config: defaulting to 1
config/config.c:955   report/columns_as_rows not found in config:
defaulting to 0
config/config.c:933   report/vgs_sort not found in config: defaulting to vg_name
config/config.c:933   report/vgs_cols not found in config: defaulting
to vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free
toollib.c:572   Finding all volume groups
label/label.c:160   /dev/drbd0: lvm2 label detected
label/label.c:160   /dev/block/104:2: lvm2 label detected
label/label.c:184   /dev/dm-10: No label detected
locking/cluster_locking.c:458   Locking VG V_vg_virtfed CR B (0x1)
toollib.c:468   Finding volume group "vg_virtfed"
label/label.c:160   /dev/block/104:2: lvm2 label detected
config/config.c:933   description not found in config: defaulting to
config/config.c:927   Setting description to Created *after* executing
'lvextend -l +100%FREE /dev/vg_virtfed/lv_root'
locking/cluster_locking.c:458   Locking VG V_vg_virtfed UN B (0x6)
locking/cluster_locking.c:159   cluster request failed: Unknown error 65538
locking/cluster_locking.c:458   Locking VG V_vg_qemu01 CR B (0x1)
toollib.c:468   Finding volume group "vg_qemu01"
label/label.c:160   /dev/drbd0: lvm2 label detected
config/config.c:933   description not found in config: defaulting to
config/config.c:927   Setting description to Created *after* executing
'/sbin/lvcreate --name test_vm_drbd -L 4194304K /dev/vg_qemu01'
locking/cluster_locking.c:458   Locking VG V_vg_qemu01 UN B (0x6)
locking/cluster_locking.c:159   cluster request failed: Unknown error 65538
libdm-report.c:805   VG         #PV #LV #SN Attr   VSize  VFree
libdm-report.c:1065   vg_qemu01    1   9   0 wz--nc 52.12G 13.07G
libdm-report.c:1065   vg_virtfed   1   2   0 wz--n- 16.00G     0
filters/filter-persistent.c:199   Dumping persistent device cache to
/etc/lvm/cache/.cache
misc/lvm-file.c:236   Locking /etc/lvm/cache/.cache (F_WRLCK, 1)
misc/lvm-file.c:265   Unlocking fd 5
cache/lvmcache.c:1241   Wiping internal VG cache
cache/lvmcache.c:1235   Internal error: Volume Group vg_qemu01 was not unlocked
cache/lvmcache.c:1235   Internal error: Volume Group vg_virtfed was not unlocked
device/dev-cache.c:564   Device '/dev/drbd0' has been left open.
device/dev-cache.c:564   Device '/dev/block/104:2' has been left open.
device/dev-cache.c:564   Device '/dev/drbd0' has been left open.
device/dev-cache.c:564   Device '/dev/block/104:2' has been left open.
device/dev-cache.c:564   Device '/dev/block/104:2' has been left open.
device/dev-cache.c:564   Device '/dev/drbd0' has been left open.
device/dev-cache.c:564   Device '/dev/block/104:2' has been left open.


Thanks,
Gianluca


From brem.belguebli at gmail.com  Wed Jul 22 17:40:40 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Wed, 22 Jul 2009 19:40:40 +0200
Subject: [Linux-cluster] redhat cluster - veritas
In-Reply-To: <0KN600ED7FVXVID0@comexp1-srv.lux.eib.org>
References: <0KN600ED7FVXVID0@comexp1-srv.lux.eib.org>
Message-ID: <29ae894c0907221040j3edd77dex37a8b42ed2cd2286@mail.gmail.com>

Hello Konrad,

LVM mirroring has some stuff to be aware of before using it in production:

1) till now, no live extension, you need to break the mirror before growing
it
2) the mirror metadata has to reside on a 3rd disk if you do not want to
resynchronise entirely your mirror at boot time, if it is not a problem for
you, you can put the metadata in memory (lost at reboot)

I'm not sure anyone has already experienced VX storage foundation on top of
RHCS, I'm ready to bet you would be the first.


2009/7/22, AZZOPARDI Konrad <azzopardi at eib.org>:
>
>  Dear all,
>
> I am new to RedHat cluster and I am looking into different architectural
> models. In my workplace we have two datacentres with a SAN in each DC. Now
> we do not have the luxury of SAN replication so I was looking into doing
> this using host based mirroring. From what I  read, the new RedHat 5.3
> supports CLVM host based mirrors but to be honest I am quite reluctant to
> use something so new and I cannot find any references of anyone using it in
> PROD, which leads me to another question. Can RedHat cluster work with other
> volume managers such as veritas. Probably the related question would be
> whether GFS can with veritas volume manager.
>
> Thank you for any responses.
>
> konrad
>
> --------------------------------------------------------------------
>
> Les informations contenues dans ce message et/ou ses annexes sont
> reservees a l'attention et a l'utilisation de leur destinataire et peuvent etre
> confidentielles. Si vous n'etes pas destinataire de ce message, vous etes
> informes que vous l'avez recu par erreur et que toute utilisation en est
> interdite. Dans ce cas, vous etes pries de le detruire et d'en informer la
> Banque Europeenne d'Investissement.
>
> The information in this message and/or attachments is intended solely for
> the attention and use of the named addressee and may be confidential. If
> you are not the intended recipient, you are hereby notified that you have
> received this transmittal in error and that any use of it is prohibited. In
> such a case please delete this message and kindly notify the European
> Investment Bank accordingly.
> --------------------------------------------------------------------
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090722/cb0d01f3/attachment.htm>

From td3201 at gmail.com  Wed Jul 22 18:26:15 2009
From: td3201 at gmail.com (Terry)
Date: Wed, 22 Jul 2009 13:26:15 -0500
Subject: [Linux-cluster] Re: determining fsid for fs resource
In-Reply-To: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com>
References: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com>
Message-ID: <8ee061010907221126r4329f2bbp19851aea365320ba@mail.gmail.com>

On Fri, Jul 17, 2009 at 11:05 AM, Terry<td3201 at gmail.com> wrote:
> Hello,
>
> When I create a fs resource using redhat's luci, it is able to find
> the fsid for a fs and life is good. ?However, I am not crazy about
> luci and would prefer to manually create the resources from the
> command line but how do I find the fsid for a filesystem? ?Here's an
> example of a fs resource created using luci:
>
> <fs device="/dev/vg_data01i/lv_data01i" force_fsck="0"
> force_unmount="1" fsid="49256" fstype="ext3" mountpoint="/data01i"
> name="omadvnfs01-data01i"
> options="noatime,nodiratime,data=writeback,commit=30" self_fence="0"/>
>
> Thanks!
>

Anyone have an idea for this?


From s.wendy.cheng at gmail.com  Wed Jul 22 19:53:31 2009
From: s.wendy.cheng at gmail.com (Wendy Cheng)
Date: Wed, 22 Jul 2009 12:53:31 -0700
Subject: [Linux-cluster] Re: determining fsid for fs resource
In-Reply-To: <8ee061010907221126r4329f2bbp19851aea365320ba@mail.gmail.com>
References: <8ee061010907170905x3468960foc938215865056a10@mail.gmail.com>
	<8ee061010907221126r4329f2bbp19851aea365320ba@mail.gmail.com>
Message-ID: <4A676E3B.3010507@gmail.com>

Terry wrote:
> On Fri, Jul 17, 2009 at 11:05 AM, Terry<td3201 at gmail.com> wrote:
>   
>> Hello,
>>
>> When I create a fs resource using redhat's luci, it is able to find
>> the fsid for a fs and life is good.  However, I am not crazy about
>> luci and would prefer to manually create the resources from the
>> command line but how do I find the fsid for a filesystem?  Here's an
>> example of a fs resource created using luci:
>>
>> <fs device="/dev/vg_data01i/lv_data01i" force_fsck="0"
>> force_unmount="1" fsid="49256" fstype="ext3" mountpoint="/data01i"
>> name="omadvnfs01-data01i"
>> options="noatime,nodiratime,data=writeback,commit=30" self_fence="0"/>
>>
>> Thanks!
>>
>>     
>
> Anyone have an idea for this?
>   

IIRC, you basically have to make up the key (fsid) by yourself. Just 
pick any number (integer) that is less then 2**32 - but make sure it is 
unique per-filesystem-per-export while NFS service is up and running. 
That is, if you plan to export the same filesystem via two export 
entries (or say, export two different directories from the very same 
filesystem) , you need two fsids. If you have x exports (regardless they 
are from the same filesystem or different filessytems) at the same time, 
you would need x fsid(s). This is mostly to do with NFS export 
(internally represented by an unsigned integer)  - don't confuse it with 
filesystem id (that is obtained via stat system call family).


-- Wendy


From jason at monsterjam.org  Wed Jul 22 23:15:40 2009
From: jason at monsterjam.org (Jason Welsh)
Date: Wed, 22 Jul 2009 19:15:40 -0400
Subject: [Linux-cluster] quickie GFS with lock_nolock question
Message-ID: <4A679D9C.4010208@monsterjam.org>

we have a funky custom application that doesnt seem to like
reading/writing to the shared GFS filesystem we have on our 2node
cluster. someone at the vendor suggested just having the main node mount
the GFS filesystem with "lock_nolock" to see if that would fix the problem.
What is the correct way to go through the motions to do this? because
currently they both mount up the gfs partition and I believe its a
dependency for the other services. feel free to direct me to the right
FM to RTFM.

Jason


From rhurst at bidmc.harvard.edu  Thu Jul 23 01:21:13 2009
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Wed, 22 Jul 2009 21:21:13 -0400
Subject: [Linux-cluster] quickie GFS with lock_nolock question
References: <4A679D9C.4010208@monsterjam.org>
Message-ID: <C44AB05AAC2D694A907A91DA40CCF3730241D185@EVS6.its.caregroup.org>

For example, in /etc/fstab an entry could look like:
 
/dev/VGCCC/lvolshare    /cluster/share          gfs     lockproto=lock_nolock  0 0
An ad-hoc mount could look like:
 
mount -t gfs -o lockproto=lock_nolock /dev/VGCCC/lvolshare /cluster/share
Make certain that this filesystem is not mounted anywhere else before overriding, because it will corrupt.
 
 
Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.

________________________________

From: linux-cluster-bounces at redhat.com on behalf of Jason Welsh
Sent: Wed 7/22/2009 7:15 PM
To: linux clustering
Subject: [Linux-cluster] quickie GFS with lock_nolock question


we have a funky custom application that doesnt seem to like
reading/writing to the shared GFS filesystem we have on our 2node
cluster. someone at the vendor suggested just having the main node mount
the GFS filesystem with "lock_nolock" to see if that would fix the problem.
What is the correct way to go through the motions to do this? because
currently they both mount up the gfs partition and I believe its a
dependency for the other services. feel free to direct me to the right
FM to RTFM.

Jason


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090722/b4929c72/attachment.htm>

From charlieb-linux-cluster at budge.apana.org.au  Thu Jul 23 16:23:46 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Thu, 23 Jul 2009 12:23:46 -0400 (EDT)
Subject: [Linux-cluster] quickie GFS with lock_nolock question
In-Reply-To: <4A679D9C.4010208@monsterjam.org>
References: <4A679D9C.4010208@monsterjam.org>
Message-ID: <Pine.LNX.4.64.0907231222110.4806@e-smith.charlieb.ott.istop.com>


On Wed, 22 Jul 2009, Jason Welsh wrote:

> we have a funky custom application that doesnt seem to like
> reading/writing to the shared GFS filesystem ...

What is your definition of "doesn't seem to like reading/writing"?

> we have on our 2node
> cluster. someone at the vendor suggested just having the main node mount
> the GFS filesystem with "lock_nolock" to see if that would fix the problem.

"fix" the problem? Or just hide it?


From charlieb-linux-cluster at budge.apana.org.au  Thu Jul 23 16:25:18 2009
From: charlieb-linux-cluster at budge.apana.org.au (Charlie Brady)
Date: Thu, 23 Jul 2009 12:25:18 -0400 (EDT)
Subject: [Linux-cluster] kernel BUG at fs/gfs2/rgrp.c:1458!
In-Reply-To: <c8c8ded30907201420i3958481and569434a58a32043@mail.gmail.com>
References: <c8c8ded30907201420i3958481and569434a58a32043@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0907231224270.4806@e-smith.charlieb.ott.istop.com>


On Mon, 20 Jul 2009, Peter Schobel wrote:

> We are experiencing fatal exceptions on a four node Linux cluster
> using a gfs2 filesystem. Any help would be appreciated. Am happy to
> provide additional info.
>
> kernel BUG at fs/gfs2/rgrp.c:1458!
> invalid opcode: 0000 [#1]
> SMP
> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
> Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm
> configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon
> backlight sbs i2c_ec
> i2c_cord
> CPU:    0
> EIP:    0060:[<f96e04da>]    Not tainted VLI
> EFLAGS: 00010246   (2.6.18-128.1.16.el5PAE #1)
> EIP is at gfs2_alloc_data+0x75/0x155 [gfs2]

I think you should open a bug at https://bugzilla.redhat.com/.


From jason at monsterjam.org  Thu Jul 23 17:08:36 2009
From: jason at monsterjam.org (Jason Welsh)
Date: Thu, 23 Jul 2009 13:08:36 -0400
Subject: [Linux-cluster] quickie GFS with lock_nolock question
In-Reply-To: <Pine.LNX.4.64.0907231222110.4806@e-smith.charlieb.ott.istop.com>
References: <4A679D9C.4010208@monsterjam.org>
	<Pine.LNX.4.64.0907231222110.4806@e-smith.charlieb.ott.istop.com>
Message-ID: <4A689914.7050104@monsterjam.org>


Charlie Brady wrote:
>
> On Wed, 22 Jul 2009, Jason Welsh wrote:
>
>> we have a funky custom application that doesnt seem to like
>> reading/writing to the shared GFS filesystem ...
>
> What is your definition of "doesn't seem to like reading/writing"?
>
>> we have on our 2node
>> cluster. someone at the vendor suggested just having the main node mount
>> the GFS filesystem with "lock_nolock" to see if that would fix the
>> problem.
>
> "fix" the problem? Or just hide it?

well, we are trying to test to see if thats the real problem or not.. It
might not be.
to make it simple, I just added that argument in the fstab on the first
server and shutdown the second server.
the question  I was trying to ask was without shutting down and
rebooting the servers, whats the right way to back both servers out of
the clustered service and manually mount up the GFS partition on the
first node.. I tried doing the "cman_tool leave remove"
but I got the error that there were still services running, I tried
shutting down everything, but still had like 3 lingering..
at this point, I just shut one down and rebooted the other with the
correct mount options and everything is fine now for our testing..
I will hopefully know later today if the application will behave with
its data on GFS volume mounted by the single node.

thanks/regards,
Jason


From pschobel at 1iopen.net  Thu Jul 23 17:19:38 2009
From: pschobel at 1iopen.net (Peter Schobel)
Date: Thu, 23 Jul 2009 10:19:38 -0700
Subject: [Linux-cluster] kernel BUG at fs/gfs2/rgrp.c:1458!
In-Reply-To: <Pine.LNX.4.64.0907231224270.4806@e-smith.charlieb.ott.istop.com>
References: <c8c8ded30907201420i3958481and569434a58a32043@mail.gmail.com>
	<Pine.LNX.4.64.0907231224270.4806@e-smith.charlieb.ott.istop.com>
Message-ID: <c8c8ded30907231019y5ee08465p15512999cb7ebe25@mail.gmail.com>

I actually have found the bug filed at
https://bugzilla.redhat.com/show_bug.cgi?id=499333 and posted my
comments.

Regards,

Peter Schobel
~

On Thu, Jul 23, 2009 at 9:25 AM, Charlie
Brady<charlieb-linux-cluster at budge.apana.org.au> wrote:
>
> On Mon, 20 Jul 2009, Peter Schobel wrote:
>
>> We are experiencing fatal exceptions on a four node Linux cluster
>> using a gfs2 filesystem. Any help would be appreciated. Am happy to
>> provide additional info.
>>
>> kernel BUG at fs/gfs2/rgrp.c:1458!
>> invalid opcode: 0000 [#1]
>> SMP
>> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
>> Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm
>> configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon
>> backlight sbs i2c_ec
>> i2c_cord
>> CPU: ? ?0
>> EIP: ? ?0060:[<f96e04da>] ? ?Not tainted VLI
>> EFLAGS: 00010246 ? (2.6.18-128.1.16.el5PAE #1)
>> EIP is at gfs2_alloc_data+0x75/0x155 [gfs2]
>
> I think you should open a bug at https://bugzilla.redhat.com/.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Peter Schobel
~


From scooter at cgl.ucsf.edu  Thu Jul 23 21:08:21 2009
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Thu, 23 Jul 2009 14:08:21 -0700
Subject: [Linux-cluster] Dependencies in resources
Message-ID: <4A68D145.6070706@cgl.ucsf.edu>

Hi all,
    I saw in a message on the net about a *depends="service:xxxx" 
*option for services in cluster.conf for 5.3beta.  Did this single-level 
dependency support make it into the 5.3 release?  It would be really, 
really useful if it did!  If not, can anyone suggest a way that I can 
have a multiple services depend on a single IP?

Thanks in advance!

-- scooter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090723/30416ddb/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scooter.vcf
Type: text/x-vcard
Size: 378 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090723/30416ddb/attachment.vcf>

From alfredo.moralejo at roche.com  Thu Jul 23 21:15:21 2009
From: alfredo.moralejo at roche.com (Moralejo, Alfredo)
Date: Thu, 23 Jul 2009 23:15:21 +0200
Subject: [Linux-cluster] Dependencies in resources
In-Reply-To: <4A68D145.6070706@cgl.ucsf.edu>
References: <4A68D145.6070706@cgl.ucsf.edu>
Message-ID: <FD716ADFEE797543AA8B2AB0B3F9148E019E4AB303@rkamsem701.emea.roche.com>

Take a look into

https://inquiries.redhat.com/go/redhat/ReferenceArchitectureSAPClusterSuite

Deploying Highly Available SAP Servers using Red Hat Cluster Suite

Gives some example about dependencies on services etc..

A very good document


________________________________
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Scooter Morris
Sent: Thursday, July 23, 2009 11:08 PM
To: linux clustering
Subject: [Linux-cluster] Dependencies in resources

Hi all,
    I saw in a message on the net about a depends="service:xxxx" option for services in cluster.conf for 5.3beta.  Did this single-level dependency support make it into the 5.3 release?  It would be really, really useful if it did!  If not, can anyone suggest a way that I can have a multiple services depend on a single IP?

Thanks in advance!

-- scooter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090723/0ef0209a/attachment.htm>

From agx at sigxcpu.org  Fri Jul 24 00:35:43 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Fri, 24 Jul 2009 02:35:43 +0200
Subject: [Linux-cluster] updating cluster.conf on one node,
	when the other is down
Message-ID: <20090724003543.GA19353@bogon.sigxcpu.org>

Hi,
taking node2 down, updating the cluster configuration on node1 then
using "cman_tool version -r 7" on node1 and then booting node2 gives the
error below:

corosync[1790]:   [QUORUM] This node is within the primary component and will provide service.
corosync[1790]:   [QUORUM] Members[1]: 
corosync[1790]:   [QUORUM]     2 
corosync[1790]:   [CLM   ] CLM CONFIGURATION CHANGE
corosync[1790]:   [CLM   ] New Configuration:
corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.228) 
corosync[1790]:   [CLM   ] Members Left:
corosync[1790]:   [CLM   ] Members Joined:
corosync[1790]:   [CLM   ] CLM CONFIGURATION CHANGE
corosync[1790]:   [CLM   ] New Configuration:
corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.82) 
corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.228) 
corosync[1790]:   [CLM   ] Members Left:
corosync[1790]:   [CLM   ] Members Joined:
corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.82) 
corosync[1790]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
corosync[1790]:   [CMAN  ] Can't get updated config version 7, config file is version 5.
corosync[1790]:   [QUORUM] This node is within the primary component and will provide service.
corosync[1790]:   [QUORUM] Members[1]: 
corosync[1790]:   [QUORUM]     2 
corosync[1790]:   [CMAN  ] Node 1 conflict, remote config version id=7, local=5
corosync[1790]:   [MAIN  ] Completed service synchronization, ready to provide service.
corosync[1790]:   [CMAN  ] Can't get updated config version 7, config file is version 5.

Afterwards corosync is spinning with 100% cpu usage. This is cluster
3.0.0 with corosync/openais 1.0.0. cluster.conf is attached. Any ideas?
Cheers,
 -- Guido
-------------- next part --------------
<?xml version="1.0"?>
<cluster config_version="7" name="agx">
  <cman two_node="1" expected_votes="2">
  </cman>
  <dlm log_debug="1"/>
  <clusternodes>
    <clusternode name="node1.foo.bar" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="fence1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node2.foo.bar" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="fence2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
    <fencedevice agent="fence_xvm" domain="node1" name="fence1"/>
    <fencedevice agent="fence_xvm" domain="node2" name="fence2"/>
  </fencedevices>

  <rm log_level="7">
   <failoverdomains>
      <failoverdomain name="kvm-hosts" ordered="1">
        <failoverdomainnode name="node1.foo.bar"/>
        <failoverdomainnode name="node2.foo.bar"/>
      </failoverdomain>
   </failoverdomains>
   <resources>
       <virt name="test11" />
       <virt name="test12" />
   </resources>
   <service name="test11">
        <virt ref="test11"/>
   </service>
   <service name="test12">
        <virt ref="test12"/>
   </service>
  </rm>
</cluster>

From agx at sigxcpu.org  Fri Jul 24 00:27:40 2009
From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=)
Date: Fri, 24 Jul 2009 02:27:40 +0200
Subject: [Linux-cluster] Cluster 3.0.0 final stable release
In-Reply-To: <1247087412.7941.37.camel@cerberus.int.fabbione.net>
References: <1247087412.7941.37.camel@cerberus.int.fabbione.net>
Message-ID: <20090724002740.GA19266@bogon.sigxcpu.org>

On Wed, Jul 08, 2009 at 11:10:12PM +0200, Fabio M. Di Nitto wrote:
> The cluster team and its community are proud to announce the 3.0.0 final
> release from the STABLE3 branch.
> 
> "And now what?"
> 
> The STABLE3 branch will continue to receive bug fixes and improvements
> as feedback from our community and users will flow in.
> Regular update releases will be available to sync with corosync/openais
> releases and new kernels (for gfs1-kernel module).
> 
> In order to build the 3.0.0 release you will need:
> 
> - corosync 1.0.0
> - openais 1.0.0
> - linux kernel 2.6.29
> 
> The new source tarball can be downloaded here:
> 
> ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.tar.gz
> https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.tar.gz
> 
> At the same location is now possible to find separated tarballs for
> fence-agents and resource-agents as previously announced
> (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm)
> 
> To report bugs or issues:
> 
>    https://bugzilla.redhat.com/
What would be the right component if not running on RHEL?
 -- Guido


From scooter at cgl.ucsf.edu  Fri Jul 24 16:59:46 2009
From: scooter at cgl.ucsf.edu (Scooter Morris)
Date: Fri, 24 Jul 2009 09:59:46 -0700
Subject: [Linux-cluster] Dependencies in resources
In-Reply-To: <FD716ADFEE797543AA8B2AB0B3F9148E019E4AB303@rkamsem701.emea.roche.com>
References: <4A68D145.6070706@cgl.ucsf.edu>
	<FD716ADFEE797543AA8B2AB0B3F9148E019E4AB303@rkamsem701.emea.roche.com>
Message-ID: <4A69E882.9080300@cgl.ucsf.edu>

Alfredo,
    Thanks!  That was very helpful, although I have a couple more 
questions for the list.  What I am trying to do seems well suited for 
the existing "depend" and "depend-mode" capability.  Essentially, I'm 
trying to use this to "group" services.  So, assume I have 3 services: 
A, B, and C (in my case, A is an IP service, B and C both have an fs and 
a script resource).  I want B and C to depend on A such that B and C 
will only start on the node where A is running.  I assume I can do this by:

   <service autostart="1" name="B" recovery="relocate" 
depend="service:A" depend-mode="hard">
      <fs ... />
      <script .../>
    </service>

    <service autostart="1" name="C" depend="service:A" depend-mode="hard">
      <fs .../>
      <script .../>
    </service>

    <service autostart="1" name="A" recovery="relocate">
      <ip ref="xxx.xxx.xxx.xxx"/>
    </service>

Unfortunately, this didn't work as expected.  If I start A on node 1, 
services B and C seem perfectly happy to run on any other node.  If a 
coerce everybody on the same node, and relocate (or disable A), services 
B and C continue on blissfully ignoring the status of A.  Do I need to 
write and event handler for this situation (which seems much simpler 
than the follow-service script)?

Again, thanks in advance!

-- scooter

Moralejo, Alfredo wrote:
>
> Take a look into
>
>  
>
> https://inquiries.redhat.com/go/redhat/ReferenceArchitectureSAPClusterSuite
>
>  
>
> Deploying Highly Available SAP Servers using Red Hat Cluster Suite
>
>  
>
> Gives some example about dependencies on services etc..
>
>  
>
> A very good document
>
>  
>
>  
>
>  
>
> ------------------------------------------------------------------------
>
> *From:* linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] *On Behalf Of *Scooter Morris
> *Sent:* Thursday, July 23, 2009 11:08 PM
> *To:* linux clustering
> *Subject:* [Linux-cluster] Dependencies in resources
>
>  
>
> Hi all,
>     I saw in a message on the net about a *depends="service:xxxx" 
> *option for services in cluster.conf for 5.3beta.  Did this 
> single-level dependency support make it into the 5.3 release?  It 
> would be really, really useful if it did!  If not, can anyone suggest 
> a way that I can have a multiple services depend on a single IP?
>
> Thanks in advance!
>
> -- scooter
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090724/c4530bfa/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scooter.vcf
Type: text/x-vcard
Size: 378 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090724/c4530bfa/attachment.vcf>

From pradhanparas at gmail.com  Fri Jul 24 22:54:32 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Fri, 24 Jul 2009 17:54:32 -0500
Subject: [Linux-cluster] Hardware recommendation
Message-ID: <8b711df40907241554w70238fc4m2c08d661f78ea4e6@mail.gmail.com>

hi,

I will be creating a cluster using red hat EL 5.3 consisting of 3
nodes which will run Xen virtual machines. The first two nodes will
host around 30 virtual machines ( all paravirt Linux and some Solaris
if possible) . The third node will be basically a fail over node and
will host virtual machines if there is a problem at node 1 or node 2.
I do not know if this is the right way to ask but I need
recommendation on which CPU to choose (i7, phenom, quad core or dual
core) and how much RAM do I need on node 1 and node 2 and on also node
3. And also which server hardware is recommended for my purpose. My
test cluster is based on Dell Poweredge 1800 machines with DRAC 4
(which I am using for fencing) and looks like DELL Poweregde server
can be a good candidate.

Any help is highly appreciated.
Thanks
Paras.


From jeff.sturm at eprize.com  Fri Jul 24 23:21:19 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Fri, 24 Jul 2009 19:21:19 -0400
Subject: [Linux-cluster] Hardware recommendation
In-Reply-To: <8b711df40907241554w70238fc4m2c08d661f78ea4e6@mail.gmail.com>
References: <8b711df40907241554w70238fc4m2c08d661f78ea4e6@mail.gmail.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC531@hugo.eprize.local>

Don't go small.  If you're running 30-60 virtual machines you'll likely
need systems with at least 8 cores at 32GB of RAM, or more depending on
the needs of your virtual hosts.  There's really no way to guess at how
much RAM you'll need without understanding the needs of your virtual
hosts, however.

Dell PowerEdge hardware should work well for this.  We have some old
6950 systems we are now using for Xen.  These appear to have been
replaced by the R905 series, according to Dell's web site.

CPU isn't important.  All modern Intel (Core) and AMD processors work
well with Xen.  To save money I tend to opt for more cores at slower
speeds rather than buy the fastest speed processors available.

What are you using for storage?  If you want node failover you'll likely
want some sort of SAN for central storage.

I also recommend at least two network switches for redundancy, else
you'll regret it the first time you need to reboot a switch without
bringing down your cluster.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Paras pradhan
> Sent: Friday, July 24, 2009 6:55 PM
> To: linux clustering
> Subject: [Linux-cluster] Hardware recommendation
> 
> hi,
> 
> I will be creating a cluster using red hat EL 5.3 consisting of 3
> nodes which will run Xen virtual machines. The first two nodes will
> host around 30 virtual machines ( all paravirt Linux and some Solaris
> if possible) . The third node will be basically a fail over node and
> will host virtual machines if there is a problem at node 1 or node 2.
> I do not know if this is the right way to ask but I need
> recommendation on which CPU to choose (i7, phenom, quad core or dual
> core) and how much RAM do I need on node 1 and node 2 and on also node
> 3. And also which server hardware is recommended for my purpose. My
> test cluster is based on Dell Poweredge 1800 machines with DRAC 4
> (which I am using for fencing) and looks like DELL Poweregde server
> can be a good candidate.
> 
> Any help is highly appreciated.
> Thanks
> Paras.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From dougbunger at yahoo.com  Sat Jul 25 16:39:06 2009
From: dougbunger at yahoo.com (Doug Bunger)
Date: Sat, 25 Jul 2009 09:39:06 -0700 (PDT)
Subject: [Linux-cluster] Hardware recommendation
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC531@hugo.eprize.local>
Message-ID: <257968.54862.qm@web110211.mail.gq1.yahoo.com>

If you remission in house systems or (gag) get used systems, than CPU ***IS*** important.? You need to make sure you are running x86_64 with CPU virtualization technology (VT) extensions.? You'll need x86_64 to get past the i386 Xen 32meg memory limit, and you'll need VT to provide an update path to RHEL6, vSphere, or even Micro$oft Hyper-V.

Many of my customers are running older equipment and are not wanting to upgrade.? We are finding that in an HP shop, anything older than DL380 G5 won't cut it.? On the Dell side it seems the 1950 is the break point.

--- On Fri, 7/24/09, Jeff Sturm <jeff.sturm at eprize.com> wrote:

From: Jeff Sturm <jeff.sturm at eprize.com>
Subject: RE: [Linux-cluster] Hardware recommendation
To: "linux clustering" <linux-cluster at redhat.com>
Date: Friday, July 24, 2009, 6:21 PM

Don't go small.? If you're running 30-60 virtual machines you'll likely
need systems with at least 8 cores at 32GB of RAM, or more depending on
the needs of your virtual hosts.? There's really no way to guess at how
much RAM you'll need without understanding the needs of your virtual
hosts, however.

Dell PowerEdge hardware should work well for this.? We have some old
6950 systems we are now using for Xen.? These appear to have been
replaced by the R905 series, according to Dell's web site.

CPU isn't important.? All modern Intel (Core) and AMD processors work
well with Xen.? To save money I tend to opt for more cores at slower
speeds rather than buy the fastest speed processors available.

What are you using for storage?? If you want node failover you'll likely
want some sort of SAN for central storage.

I also recommend at least two network switches for redundancy, else
you'll regret it the first time you need to reboot a switch without
bringing down your cluster.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Paras pradhan
> Sent: Friday, July 24, 2009 6:55 PM
> To: linux clustering
> Subject: [Linux-cluster] Hardware recommendation
> 
> hi,
> 
> I will be creating a cluster using red hat EL 5.3 consisting of 3
> nodes which will run Xen virtual machines. The first two nodes will
> host around 30 virtual machines ( all paravirt Linux and some Solaris
> if possible) . The third node will be basically a fail over node and
> will host virtual machines if there is a problem at node 1 or node 2.
> I do not know if this is the right way to ask but I need
> recommendation on which CPU to choose (i7, phenom, quad core or dual
> core) and how much RAM do I need on node 1 and node 2 and on also node
> 3. And also which server hardware is recommended for my purpose. My
> test cluster is based on Dell Poweredge 1800 machines with DRAC 4
> (which I am using for fencing) and looks like DELL Poweregde server
> can be a good candidate.
> 
> Any help is highly appreciated.
> Thanks
> Paras.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090725/bdf136cf/attachment.htm>

From jeff.sturm at eprize.com  Sat Jul 25 22:36:27 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Sat, 25 Jul 2009 18:36:27 -0400
Subject: [Linux-cluster] Hardware recommendation
In-Reply-To: <257968.54862.qm@web110211.mail.gq1.yahoo.com>
References: <64D0546C5EBBD147B75DE133D798665F02FDC531@hugo.eprize.local>
	<257968.54862.qm@web110211.mail.gq1.yahoo.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC534@hugo.eprize.local>

Good point-the requisite CPU technology has been available for at least
3 years, so I tend to assume most users are running 64-bit processors
with VT by now.  However, I suppose it's possible there are many older
systems still in production.

 
We happen to use several 1950-class machines which have been repurposed
for this.  They work very well with Xen hypervisors.  The 1950 can be
inexpensively upgraded to 16GB RAM, and have two sockets which can
accommodate 8 cores.  With shared storage, the 1U chassis gives very
good density where space is a premium.  If 1950's are available, they
can be a bargain for virtualization.

 
And, as you say, VT is required for some deployments.  Though I'd like
to think most non-VT capable hardware will be long gone by the time
RHEL5 reaches EOL.

 
Power consumption should be a consideration too.  It might be worth
retiring some systems to save on power alone.  I don't have good numbers
handy comparing power consumption vs. performance of different processor
generations-perhaps someone else does.

 
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Doug Bunger
Sent: Saturday, July 25, 2009 12:39 PM
To: linux clustering
Subject: RE: [Linux-cluster] Hardware recommendation

 
If you remission in house systems or (gag) get used systems, than CPU
***IS*** important.  You need to make sure you are running x86_64 with
CPU virtualization technology (VT) extensions.  You'll need x86_64 to
get past the i386 Xen 32meg memory limit, and you'll need VT to provide
an update path to RHEL6, vSphere, or even Micro$oft Hyper-V.

Many of my customers are running older equipment and are not wanting to
upgrade.  We are finding that in an HP shop, anything older than DL380
G5 won't cut it.  On the Dell side it seems the 1950 is the break point.

--- On Fri, 7/24/09, Jeff Sturm <jeff.sturm at eprize.com> wrote:


From: Jeff Sturm <jeff.sturm at eprize.com>
Subject: RE: [Linux-cluster] Hardware recommendation
To: "linux clustering" <linux-cluster at redhat.com>
Date: Friday, July 24, 2009, 6:21 PM

Don't go small.  If you're running 30-60 virtual machines you'll likely
need systems with at least 8 cores at 32GB of RAM, or more depending on
the needs of your virtual hosts.  There's really no way to guess at how
much RAM you'll need without understanding the needs of your virtual
hosts, however.

Dell PowerEdge hardware should work well for this.  We have some old
6950 systems we are now using for Xen.  These appear to have been
replaced by the R905 series, according to Dell's web site.

CPU isn't important.  All modern Intel (Core) and AMD processors work
well with Xen.  To save money I tend to opt for more cores at slower
speeds rather than buy the fastest speed processors available.

What are you using for storage?  If you want node failover you'll likely
want some sort of SAN for central storage.

I also recommend at least two network switches for redundancy, else
you'll regret it the first time you need to reboot a switch without
bringing down your cluster.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Paras pradhan
> Sent: Friday, July 24, 2009 6:55 PM
> To: linux clustering
> Subject: [Linux-cluster] Hardware recommendation
> 
> hi,
> 
> I will be creating a cluster using red hat EL 5.3 consisting of 3
> nodes which will run Xen virtual machines. The first two nodes will
> host around 30 virtual machines ( all paravirt Linux and some Solaris
> if possible) . The third node will be basically a fail over node and
> will host virtual machines if there is a problem at node 1 or node 2.
> I do not know if this is the right way to ask but I need
> recommendation on which CPU to choose (i7, phenom, quad core or dual
> core) and how much RAM do I need on node 1 and node 2 and on also node
> 3. And also which server hardware is recommended for my purpose. My
> test cluster is based on Dell Poweredge 1800 machines with DRAC 4
> (which I am using for fencing) and looks like DELL Poweregde server
> can be a good candidate.
> 
> Any help is highly appreciated.
> Thanks
> Paras.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090725/edb8fd9d/attachment.htm>

From thomas at sjolshagen.net  Sun Jul 26 17:00:16 2009
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Sun, 26 Jul 2009 13:00:16 -0400
Subject: [Linux-cluster] lvm2: cluster request failed: Unknown error 65538
In-Reply-To: <561c252c0907220812l152462d9o27cd64ba37c017a4@mail.gmail.com>
References: <561c252c0907220812l152462d9o27cd64ba37c017a4@mail.gmail.com>
Message-ID: <20090726130016.19271lj1w10stbj4@www.sjolshagen.net>


Quoting Gianluca Cecchi <gianluca.cecchi at gmail.com>:

> Hello,
> by mistake I previously sent this to fedora-list.
> I resend to the appropriate list I wanted...
> Excuse in advance for eventual cross-posting effects for anyone...

Hi,

Just wanted to add a +1 to this report. Had it since the day I  
upgraded to Fedora 11 (from Fedora 10).

// Thomas


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From fdinitto at redhat.com  Mon Jul 27 06:43:02 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 27 Jul 2009 08:43:02 +0200
Subject: [Linux-cluster] Cluster 3.0.0 final stable release
In-Reply-To: <20090724002740.GA19266@bogon.sigxcpu.org>
References: <1247087412.7941.37.camel@cerberus.int.fabbione.net>
	<20090724002740.GA19266@bogon.sigxcpu.org>
Message-ID: <1248676982.7930.0.camel@cerberus.int.fabbione.net>

On Fri, 2009-07-24 at 02:27 +0200, Guido G?nther wrote:
> On Wed, Jul 08, 2009 at 11:10:12PM +0200, Fabio M. Di Nitto wrote:
> > The cluster team and its community are proud to announce the 3.0.0 final
> > release from the STABLE3 branch.
> > 
> > "And now what?"
> > 
> > The STABLE3 branch will continue to receive bug fixes and improvements
> > as feedback from our community and users will flow in.
> > Regular update releases will be available to sync with corosync/openais
> > releases and new kernels (for gfs1-kernel module).
> > 
> > In order to build the 3.0.0 release you will need:
> > 
> > - corosync 1.0.0
> > - openais 1.0.0
> > - linux kernel 2.6.29
> > 
> > The new source tarball can be downloaded here:
> > 
> > ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.tar.gz
> > https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.tar.gz
> > 
> > At the same location is now possible to find separated tarballs for
> > fence-agents and resource-agents as previously announced
> > (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm)
> > 
> > To report bugs or issues:
> > 
> >    https://bugzilla.redhat.com/
> What would be the right component if not running on RHEL?

Simply use Fedora Rawhide. If it's not the right one, we will take care
to adjust the bug.

fabio


From fdinitto at redhat.com  Mon Jul 27 06:49:15 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 27 Jul 2009 08:49:15 +0200
Subject: [Linux-cluster] updating cluster.conf on one node, when the
	other is down
In-Reply-To: <20090724003543.GA19353@bogon.sigxcpu.org>
References: <20090724003543.GA19353@bogon.sigxcpu.org>
Message-ID: <1248677355.7930.5.camel@cerberus.int.fabbione.net>

On Fri, 2009-07-24 at 02:35 +0200, Guido G?nther wrote:
> Hi,
> taking node2 down, updating the cluster configuration on node1 then
> using "cman_tool version -r 7" on node1 and then booting node2 gives the
> error below:
> 
> corosync[1790]:   [QUORUM] This node is within the primary component and will provide service.
> corosync[1790]:   [QUORUM] Members[1]: 
> corosync[1790]:   [QUORUM]     2 
> corosync[1790]:   [CLM   ] CLM CONFIGURATION CHANGE
> corosync[1790]:   [CLM   ] New Configuration:
> corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.228) 
> corosync[1790]:   [CLM   ] Members Left:
> corosync[1790]:   [CLM   ] Members Joined:
> corosync[1790]:   [CLM   ] CLM CONFIGURATION CHANGE
> corosync[1790]:   [CLM   ] New Configuration:
> corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.82) 
> corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.228) 
> corosync[1790]:   [CLM   ] Members Left:
> corosync[1790]:   [CLM   ] Members Joined:
> corosync[1790]:   [CLM   ] 	r(0) ip(192.168.122.82) 
> corosync[1790]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> corosync[1790]:   [CMAN  ] Can't get updated config version 7, config file is version 5.
> corosync[1790]:   [QUORUM] This node is within the primary component and will provide service.
> corosync[1790]:   [QUORUM] Members[1]: 
> corosync[1790]:   [QUORUM]     2 
> corosync[1790]:   [CMAN  ] Node 1 conflict, remote config version id=7, local=5
> corosync[1790]:   [MAIN  ] Completed service synchronization, ready to provide service.
> corosync[1790]:   [CMAN  ] Can't get updated config version 7, config file is version 5.
> 
> Afterwards corosync is spinning with 100% cpu usage. This is cluster
> 3.0.0 with corosync/openais 1.0.0. cluster.conf is attached. Any ideas?

Yes, the configuration sync is now delegated to conga/luci via ccs_sync
or needs to be done manually. The configuration is expected to be
syncronized before cluster startup.

cman will try constantly to reload the config (probably too fast and
that's why you see cpu spinning) till it finds the right version so that
the cluster operations can start again.

I'll check with Chrissie about the spinning.

Fabio


From fdinitto at redhat.com  Mon Jul 27 06:53:48 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 27 Jul 2009 08:53:48 +0200
Subject: [Linux-cluster] lvm2: cluster request failed: Unknown error 65538
In-Reply-To: <561c252c0907220812l152462d9o27cd64ba37c017a4@mail.gmail.com>
References: <561c252c0907220812l152462d9o27cd64ba37c017a4@mail.gmail.com>
Message-ID: <1248677628.7930.8.camel@cerberus.int.fabbione.net>

On Wed, 2009-07-22 at 17:12 +0200, Gianluca Cecchi wrote:
> Hello,
> by mistake I previously sent this to fedora-list.
> I resend to the appropriate list I wanted...
> Excuse in advance for eventual cross-posting effects for anyone...
> 
> fedora11 x86_64 with lvm2, device-mapper and related packages updated at :
> lvm2-2.02.48-1.fc11.x86_64
> lvm2-cluster-2.02.48-1.fc11.x86_64
> device-mapper-1.02.33-1.fc11.x86_64
> device-mapper-libs-1.02.33-1.fc11.x86_64
> 
> I have 2 VGs: vg_virtfed that is a system vg (with root lv ans swap
> lv) and vg_qemu01 that is a clustered vg.
> At the moment only one node active

[SNIP]

known issue... I am working on updating packages in F11 and F10.

Due to some delays from upstream, this task has been delayed a lot.

It will be sorted soon.

Fabio


From wendell at bisonline.com  Mon Jul 27 17:23:59 2009
From: wendell at bisonline.com (Wendell Dingus)
Date: Mon, 27 Jul 2009 13:23:59 -0400 (EDT)
Subject: [Linux-cluster] GFS2 corruption/withdrawal/crash
In-Reply-To: <4190901.1991248715204185.JavaMail.wendell@philco>
Message-ID: <6903172.2011248715437038.JavaMail.wendell@philco>

This recently happened to us and I'm wondering if there's anything else we can do to prevent it and/or more fully recover from it.

3 physical nodes, 64-bit, 2.6.18-128.2.1.el5xen, 3 GFS2 filesystems mounted by each. 

A zero-byte file in a subdir about 3 levels deep that when touched in any way causes total meltdown. Details below...

We took the filesystem offline (all nodes) and ran gfs2_fsck against it. The FS is 6.2TB in size, living on 2gb/sec fibrechannel array. It took 9 hours to complete which was not as bad as I had feared it would be. Afterwards the filesystem was remounted and that zero-byte file was attempted to be removed and the same thing happened again. So it appears gfs2_fsck did not fix it. Since there are 3 GFS filesystems the bad part was that access to all 3 went away when one of them had an issue because GFS itself appears to have crashed. That's the part I don't understand and am pretty sure was not what should have happened.

After a full reboot we renamed the directory holding the "bad" zero-byte file to a directory in the root of that GFS filesystem and are simply avoiding it at this point. 

Thanks...

Description from a co-worker on what he found from researching this:

While hitting something on the filesystem, it runs in to an invalid metadata block, realizes the error & problem, and attempts to take the FS offline because it's bad. (To not risk additional corruption) 

Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: fatal: invalid metadata block 
Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: bh = 1633350398 (magic number) 
Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: function = gfs2_meta_indirect_buffer, file = /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_xen/meta_io.c, line = 33 
4 
Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: about to withdraw this file system 
Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: telling LM to withdraw 
Jul 22 04:11:57 srvname kernel: GFS2: fsid=cluname:raid1.1: withdrawn 
Jul 22 04:11:57 srvname kernel: 

Unfortunately... For some reason, when it completes the withdrawal process, gfs crashes... I'm sure it's not supposed to do that... It should continue allowing access to all of the other GFS filesystems, but since the gfs module is dieing, it kills access to any gfs filesystems. 

Jul 22 04:11:57 srvname kernel: Call Trace: 
Jul 22 04:11:57 srvname kernel: [<ffffffff8854891a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 
Jul 22 04:11:57 srvname kernel: [<ffffffff80262907>] __wait_on_bit+0x60/0x6e 
Jul 22 04:11:57 srvname kernel: [<ffffffff80215788>] sync_buffer+0x0/0x3f 
Jul 22 04:11:57 srvname kernel: [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78 
Jul 22 04:11:57 srvname kernel: [<ffffffff8029a01a>] wake_bit_function+0x0/0x23 
Jul 22 04:11:57 srvname kernel: [<ffffffff8021a7f1>] submit_bh+0x10a/0x111 
Jul 22 04:11:57 srvname kernel: [<ffffffff8855a627>] :gfs2:gfs2_meta_check_ii+0x2c/0x38 
Jul 22 04:11:57 srvname kernel: [<ffffffff8854c168>] :gfs2:gfs2_meta_indirect_buffer+0x104/0x160 
Jul 22 04:11:57 srvname kernel: [<ffffffff8853c786>] :gfs2:recursive_scan+0x96/0x175 
Jul 22 04:11:57 srvname kernel: [<ffffffff8853c82c>] :gfs2:recursive_scan+0x13c/0x175 
Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>] :gfs2:do_strip+0x0/0x358 
Jul 22 04:11:57 srvname kernel: [<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14 
Jul 22 04:11:57 srvname kernel: [<ffffffff8853c8fe>] :gfs2:trunc_dealloc+0x99/0xe7 
Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>] :gfs2:do_strip+0x0/0x358 
Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>] :gfs2:gfs2_glock_dq+0x1e/0x132 
Jul 22 04:11:57 srvname kernel: [<ffffffff8020b7bf>] kfree+0x15/0xc5 
Jul 22 04:11:57 srvname kernel: [<ffffffff8853df97>] :gfs2:gfs2_truncatei+0x5e5/0x70d 
Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>] :gfs2:gfs2_glock_dq+0x1e/0x132 
Jul 22 04:11:57 srvname kernel: [<ffffffff88544b28>] :gfs2:gfs2_glock_put+0x1a/0xe2 
Jul 22 04:11:57 srvname kernel: [<ffffffff88550b83>] :gfs2:gfs2_setattr+0xe6/0x335 
Jul 22 04:11:57 srvname kernel: [<ffffffff88550acd>] :gfs2:gfs2_setattr+0x30/0x335 
Jul 22 04:11:57 srvname kernel: [<ffffffff8026349f>] __down_write_nested+0x35/0x9a 
Jul 22 04:11:57 srvname kernel: [<ffffffff8022caf2>] notify_change+0x145/0x2e0 
Jul 22 04:11:57 srvname kernel: [<ffffffff802ce6ae>] do_truncate+0x5e/0x79 
Jul 22 04:11:57 srvname kernel: [<ffffffff8020db96>] permission+0x81/0xc8 
Jul 22 04:11:57 srvname kernel: [<ffffffff80212b01>] may_open+0x1d3/0x22e 
Jul 22 04:11:57 srvname kernel: [<ffffffff8021b1c2>] open_namei+0x2c4/0x6d5 
Jul 22 04:11:57 srvname kernel: [<ffffffff802275c9>] do_filp_open+0x1c/0x38 
Jul 22 04:11:57 srvname kernel: [<ffffffff80219d14>] do_sys_open+0x44/0xbe 
Jul 22 04:11:57 srvname kernel: [<ffffffff8025f2f9>] tracesys+0xab/0xb6 
Jul 22 04:11:57 srvname kernel: 

Tail end of the gfs2_fsck run:
Ondisk and fsck bitmaps differ at block 1633350906 (0x615af4fa)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 1633350907 (0x615af4fb)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
RG #1633288608 (0x615a01a0) free count inconsistent: is 48931 should be 49441
Resource group counts updated
Pass5 complete
Writing changes to disk
gfs2_fsck complete

PS. Access to this file which we know has caused us to crash for sure: mv file1 file2 and rm -f file1 and echo "asdf" >file1 


From rpeterso at redhat.com  Mon Jul 27 18:10:42 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 27 Jul 2009 14:10:42 -0400 (EDT)
Subject: [Linux-cluster] GFS2 corruption/withdrawal/crash
In-Reply-To: <825488907.953591248718120999.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <1714694551.953751248718242229.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Wendell Dingus" <wendell at bisonline.com> wrote:
| This recently happened to us and I'm wondering if there's anything
| else we can do to prevent it and/or more fully recover from it.
| 
| 3 physical nodes, 64-bit, 2.6.18-128.2.1.el5xen, 3 GFS2 filesystems
| mounted by each. 
| 
| A zero-byte file in a subdir about 3 levels deep that when touched in
| any way causes total meltdown. Details below...
| 
| We took the filesystem offline (all nodes) and ran gfs2_fsck against
| it. The FS is 6.2TB in size, living on 2gb/sec fibrechannel array. It
| took 9 hours to complete which was not as bad as I had feared it would
| be. Afterwards the filesystem was remounted and that zero-byte file
| was attempted to be removed and the same thing happened again. So it
| appears gfs2_fsck did not fix it. Since there are 3 GFS filesystems
| the bad part was that access to all 3 went away when one of them had
| an issue because GFS itself appears to have crashed. That's the part I
| don't understand and am pretty sure was not what should have
| happened.
| 
| After a full reboot we renamed the directory holding the "bad"
| zero-byte file to a directory in the root of that GFS filesystem and
| are simply avoiding it at this point. 
| 
| Thanks...
| 
| Description from a co-worker on what he found from researching this:
| 
| While hitting something on the filesystem, it runs in to an invalid
| metadata block, realizes the error & problem, and attempts to take the
| FS offline because it's bad. (To not risk additional corruption) 
| 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: fatal:
| invalid metadata block 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: bh =
| 1633350398 (magic number) 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: function =
| gfs2_meta_indirect_buffer, file =
| /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_xen/meta_io.c, line =
| 33 
| 4 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: about to
| withdraw this file system 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: telling LM
| to withdraw 
| Jul 22 04:11:57 srvname kernel: GFS2: fsid=cluname:raid1.1: withdrawn
| 
| Jul 22 04:11:57 srvname kernel: 
| 
| Unfortunately... For some reason, when it completes the withdrawal
| process, gfs crashes... I'm sure it's not supposed to do that... It
| should continue allowing access to all of the other GFS filesystems,
| but since the gfs module is dieing, it kills access to any gfs
| filesystems. 
| 
| Jul 22 04:11:57 srvname kernel: Call Trace: 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8854891a>]
| :gfs2:gfs2_lm_withdraw+0xc1/0xd0 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80262907>]
| __wait_on_bit+0x60/0x6e 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80215788>]
| sync_buffer+0x0/0x3f 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80262981>]
| out_of_line_wait_on_bit+0x6c/0x78 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8029a01a>]
| wake_bit_function+0x0/0x23 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8021a7f1>]
| submit_bh+0x10a/0x111 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8855a627>]
| :gfs2:gfs2_meta_check_ii+0x2c/0x38 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8854c168>]
| :gfs2:gfs2_meta_indirect_buffer+0x104/0x160 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853c786>]
| :gfs2:recursive_scan+0x96/0x175 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853c82c>]
| :gfs2:recursive_scan+0x13c/0x175 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>]
| :gfs2:do_strip+0x0/0x358 
| Jul 22 04:11:57 srvname kernel: [<ffffffff802639f9>]
| _spin_lock_irqsave+0x9/0x14 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853c8fe>]
| :gfs2:trunc_dealloc+0x99/0xe7 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>]
| :gfs2:do_strip+0x0/0x358 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>]
| :gfs2:gfs2_glock_dq+0x1e/0x132 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8020b7bf>] kfree+0x15/0xc5 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853df97>]
| :gfs2:gfs2_truncatei+0x5e5/0x70d 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>]
| :gfs2:gfs2_glock_dq+0x1e/0x132 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88544b28>]
| :gfs2:gfs2_glock_put+0x1a/0xe2 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88550b83>]
| :gfs2:gfs2_setattr+0xe6/0x335 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88550acd>]
| :gfs2:gfs2_setattr+0x30/0x335 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8026349f>]
| __down_write_nested+0x35/0x9a 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8022caf2>]
| notify_change+0x145/0x2e0 
| Jul 22 04:11:57 srvname kernel: [<ffffffff802ce6ae>]
| do_truncate+0x5e/0x79 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8020db96>]
| permission+0x81/0xc8 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80212b01>]
| may_open+0x1d3/0x22e 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8021b1c2>]
| open_namei+0x2c4/0x6d5 
| Jul 22 04:11:57 srvname kernel: [<ffffffff802275c9>]
| do_filp_open+0x1c/0x38 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80219d14>]
| do_sys_open+0x44/0xbe 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8025f2f9>]
| tracesys+0xab/0xb6 
| Jul 22 04:11:57 srvname kernel: 
| 
| Tail end of the gfs2_fsck run:
| Ondisk and fsck bitmaps differ at block 1633350906 (0x615af4fa)
| Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
| Metadata type is 0 (free)
| Succeeded.
| Ondisk and fsck bitmaps differ at block 1633350907 (0x615af4fb)
| Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
| Metadata type is 0 (free)
| Succeeded.
| RG #1633288608 (0x615a01a0) free count inconsistent: is 48931 should
| be 49441
| Resource group counts updated
| Pass5 complete
| Writing changes to disk
| gfs2_fsck complete
| 
| PS. Access to this file which we know has caused us to crash for sure:
| mv file1 file2 and rm -f file1 and echo "asdf" >file1 

Hi Wendell,

The fsck.gfs2 program should fix this kind of error, but there
are some known fsck bugs.  I've been working on a big fix for
fsck.gfs and fsck.gfs2 lately that solves many problems, and there
is a chance it will solve yours.  If you don't mind, perhaps you
can send me a copy of your file system metadata, and I will
run my latest-and-greatest against it to see if it will detect
and fix the problem.  Even if it doesn't, perhaps I can even adjust
my patch to fix your file system.

Regards,

Bob Peterson
Red Hat File Systems


From brem.belguebli at gmail.com  Mon Jul 27 19:02:24 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Mon, 27 Jul 2009 21:02:24 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<1248188185.6464.2.camel@mecatol>
	<29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>
Message-ID: <29ae894c0907271202y6b34afe1w3bfa53ab6a04c291@mail.gmail.com>

Hi Rafael,

It works fine, well at least when not hiting some CLVM strange behaviours,
that I'm able to replay by hand, so your script is allright.

I'll post to linux-lvm what I could see.

Brem


2009/7/21, brem belguebli <brem.belguebli at gmail.com>:
>
> Hola Rafael,
>
> Thanks a lot, that'll avoid me going from scratch.
>
> I'll have a look at them and keep you updated.
>
> Brem
>
>
>
> 2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
>>
>> Hi Brem,
>>
>> El mar, 21-07-2009 a las 16:40 +0200, brem belguebli escribi?:
>> > Hi,
>> >
>> > That's what I 'm trying to do.
>> >
>> > If you mean lvm.sh, well, I've been playing with it, but it does some
>> > "sanity" checks that are wierd
>> >      1. It expects HA LVM to be setup (why such check if we want to
>> >         use CLVM).
>> >      2. it exits if it finds a CLVM VG  (kind of funny !)
>> >      3. it exits if the lvm.conf is newer than /boot/*.img (about this
>> >         one, we tend to prevent the cluster from automatically
>> >         starting ...)
>> > I was looking to find some doc on how to write my own resources, ie
>> > CLVM resource that checks if the vg is clustered, if so by which node
>> > is it exclusively held, and if the node is down to activate
>> > exclusively the VG.
>> >
>> > If you have some good links to provide me, that'll be great.
>> >
>> > Thanks
>> >
>> >
>> > 2009/7/21, Christine Caulfield <ccaulfie at redhat.com>:
>> >         On 07/21/2009 01:11 PM, brem belguebli wrote:
>> >                 Hi,
>> >                 When creating the VG by default clustered, you
>> >                 implicitely assume that
>> >                 it will be used with a clustered FS on top of it (gfs,
>> >                 ocfs, etc...)
>> >                 that will handle the active/active mode.
>> >                 As I do not intend to use GFS in this particular case,
>> >                 but ext3 and raw
>> >                 devices, I need to make sure the vg is exclusively
>> >                 activated on one
>> >                 node, preventing the other nodes to access it unless
>> >                 it is the failover
>> >                 procedure (node holding the VG crashed) and then re
>> >                 activate it
>> >                 exclusively on the failover node.
>> >                 Thanks
>> >
>> >
>> >         In that case you probably ought to be using rgmanager to do
>> >         the failover for you. It has a script for doing exactly
>> >         this :-)
>> >
>> >         Chrissie
>> >
>> >
>> >         --
>> >         Linux-cluster mailing list
>> >         Linux-cluster at redhat.com
>> >         https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> Please, check this link:
>>
>> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
>>
>> I found exactly the same problem as you, and i developed the
>> "lvm-cluster.sh" script to solve the needs I had. You can find the
>> script on the last message of the thread.
>>
>> I submitted it to make it part of the main project, but i have no news
>> about that yet.
>>
>> I hope this helps.
>>
>> Cheers,
>>
>> Rafael
>>
>> --
>> Rafael Mic? Miranda
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090727/4f8fbc28/attachment.htm>

From rmicmirregs at gmail.com  Mon Jul 27 20:59:15 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Mon, 27 Jul 2009 22:59:15 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907271202y6b34afe1w3bfa53ab6a04c291@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A6590AE.2040808@redhat.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<1248188185.6464.2.camel@mecatol>
	<29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>
	<29ae894c0907271202y6b34afe1w3bfa53ab6a04c291@mail.gmail.com>
Message-ID: <1248728355.7027.3.camel@mecatol>

Hi Brem

So, does it work successfully? I made some testing before I submitted it
to the list and AFAIK i found no errors.

What do you mean exactly with "some CLVM strange behaviours"? Could you
be more specific?

I'm not subscribed to linux-lvm, please keep us informed through this
list.

Thanks in advance. Cheers,

Rafael

El lun, 27-07-2009 a las 21:02 +0200, brem belguebli escribi?:
> Hi Rafael,
>  
> It works fine, well at least when not hiting some CLVM strange
> behaviours, that I'm able to replay by hand, so your script is
> allright.
>  
> I'll post to linux-lvm what I could see.
>  
> Brem
> 
>  
> 2009/7/21, brem belguebli <brem.belguebli at gmail.com>: 
>         Hola Rafael,
>          
>         Thanks a lot, that'll avoid me going from scratch.
>          
>         I'll have a look at them and keep you updated.
>          
>         Brem
>          
>         
>          
>         2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>: 
>                 Hi Brem,
>                 
>                 El mar, 21-07-2009 a las 16:40 +0200, brem belguebli
>                 escribi?:
>                 > Hi,
>                 >
>                 > That's what I 'm trying to do.
>                 >
>                 > If you mean lvm.sh, well, I've been playing with it,
>                 but it does some
>                 > "sanity" checks that are wierd
>                 >      1. It expects HA LVM to be setup (why such
>                 check if we want to
>                 >         use CLVM).
>                 >      2. it exits if it finds a CLVM VG  (kind of
>                 funny !)
>                 >      3. it exits if the lvm.conf is newer
>                 than /boot/*.img (about this
>                 >         one, we tend to prevent the cluster from
>                 automatically
>                 >         starting ...)
>                 > I was looking to find some doc on how to write my
>                 own resources, ie
>                 > CLVM resource that checks if the vg is clustered, if
>                 so by which node
>                 > is it exclusively held, and if the node is down to
>                 activate
>                 > exclusively the VG.
>                 >
>                 > If you have some good links to provide me, that'll
>                 be great.
>                 >
>                 > Thanks
>                 >
>                 >
>                 > 2009/7/21, Christine Caulfield
>                 <ccaulfie at redhat.com>:
>                 >         On 07/21/2009 01:11 PM, brem belguebli
>                 wrote:
>                 >                 Hi,
>                 >                 When creating the VG by default
>                 clustered, you
>                 >                 implicitely assume that
>                 >                 it will be used with a clustered FS
>                 on top of it (gfs,
>                 >                 ocfs, etc...)
>                 >                 that will handle the active/active
>                 mode.
>                 >                 As I do not intend to use GFS in
>                 this particular case,
>                 >                 but ext3 and raw
>                 >                 devices, I need to make sure the vg
>                 is exclusively
>                 >                 activated on one
>                 >                 node, preventing the other nodes to
>                 access it unless
>                 >                 it is the failover
>                 >                 procedure (node holding the VG
>                 crashed) and then re
>                 >                 activate it
>                 >                 exclusively on the failover node.
>                 >                 Thanks
>                 >
>                 >
>                 >         In that case you probably ought to be using
>                 rgmanager to do
>                 >         the failover for you. It has a script for
>                 doing exactly
>                 >         this :-)
>                 >
>                 >         Chrissie
>                 >
>                 >
>                 >         --
>                 >         Linux-cluster mailing list
>                 >         Linux-cluster at redhat.com
>                 >
>                 https://www.redhat.com/mailman/listinfo/linux-cluster
>                 >
>                 >
>                 > --
>                 > Linux-cluster mailing list
>                 > Linux-cluster at redhat.com
>                 >
>                 https://www.redhat.com/mailman/listinfo/linux-cluster
>                 
>                 Please, check this link:
>                 
>                 https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
>                 
>                 I found exactly the same problem as you, and i
>                 developed the
>                 "lvm-cluster.sh" script to solve the needs I had. You
>                 can find the
>                 script on the last message of the thread.
>                 
>                 I submitted it to make it part of the main project,
>                 but i have no news
>                 about that yet.
>                 
>                 I hope this helps.
>                 
>                 Cheers,
>                 
>                 Rafael
>                 
>                 --
>                 Rafael Mic? Miranda
>                 
>                 --
>                 Linux-cluster mailing list
>                 Linux-cluster at redhat.com
>                 https://www.redhat.com/mailman/listinfo/linux-cluster
>         
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Rafael Mic? Miranda


From brem.belguebli at gmail.com  Mon Jul 27 21:26:37 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Mon, 27 Jul 2009 23:26:37 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <1248728355.7027.3.camel@mecatol>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<1248188185.6464.2.camel@mecatol>
	<29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>
	<29ae894c0907271202y6b34afe1w3bfa53ab6a04c291@mail.gmail.com>
	<1248728355.7027.3.camel@mecatol>
Message-ID: <29ae894c0907271426l251a7b92gcb402894e3742258@mail.gmail.com>

Hi Rafael,

On RHEL 5.3, locks thru DLM aren't reliable all the time, I've been able to
activate a VG on a node of the cluster even if it was already activated
exclusively on another node.
Also, I've been able to activate a LV on a node not holding the exclusive
lock on the VG by typing 2 times lvchange -a y /dev/VGX/lvoly .

First try lvchange tells you there is a lock, second try activates it.

Checking dlm debug (mount -t debugfs debug /debug) /debug/dlm/clvmd_locks
gives a lock for nodeid 0 ....

Brem

2009/7/27 Rafael Mic? Miranda <rmicmirregs at gmail.com>

> Hi Brem
>
> So, does it work successfully? I made some testing before I submitted it
> to the list and AFAIK i found no errors.
>
> What do you mean exactly with "some CLVM strange behaviours"? Could you
> be more specific?
>
> I'm not subscribed to linux-lvm, please keep us informed through this
> list.
>
> Thanks in advance. Cheers,
>
> Rafael
>
> El lun, 27-07-2009 a las 21:02 +0200, brem belguebli escribi?:
> > Hi Rafael,
> >
> > It works fine, well at least when not hiting some CLVM strange
> > behaviours, that I'm able to replay by hand, so your script is
> > allright.
> >
> > I'll post to linux-lvm what I could see.
> >
> > Brem
> >
> >
> > 2009/7/21, brem belguebli <brem.belguebli at gmail.com>:
> >         Hola Rafael,
> >
> >         Thanks a lot, that'll avoid me going from scratch.
> >
> >         I'll have a look at them and keep you updated.
> >
> >         Brem
> >
> >
> >
> >         2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
> >                 Hi Brem,
> >
> >                 El mar, 21-07-2009 a las 16:40 +0200, brem belguebli
> >                 escribi?:
> >                 > Hi,
> >                 >
> >                 > That's what I 'm trying to do.
> >                 >
> >                 > If you mean lvm.sh, well, I've been playing with it,
> >                 but it does some
> >                 > "sanity" checks that are wierd
> >                 >      1. It expects HA LVM to be setup (why such
> >                 check if we want to
> >                 >         use CLVM).
> >                 >      2. it exits if it finds a CLVM VG  (kind of
> >                 funny !)
> >                 >      3. it exits if the lvm.conf is newer
> >                 than /boot/*.img (about this
> >                 >         one, we tend to prevent the cluster from
> >                 automatically
> >                 >         starting ...)
> >                 > I was looking to find some doc on how to write my
> >                 own resources, ie
> >                 > CLVM resource that checks if the vg is clustered, if
> >                 so by which node
> >                 > is it exclusively held, and if the node is down to
> >                 activate
> >                 > exclusively the VG.
> >                 >
> >                 > If you have some good links to provide me, that'll
> >                 be great.
> >                 >
> >                 > Thanks
> >                 >
> >                 >
> >                 > 2009/7/21, Christine Caulfield
> >                 <ccaulfie at redhat.com>:
> >                 >         On 07/21/2009 01:11 PM, brem belguebli
> >                 wrote:
> >                 >                 Hi,
> >                 >                 When creating the VG by default
> >                 clustered, you
> >                 >                 implicitely assume that
> >                 >                 it will be used with a clustered FS
> >                 on top of it (gfs,
> >                 >                 ocfs, etc...)
> >                 >                 that will handle the active/active
> >                 mode.
> >                 >                 As I do not intend to use GFS in
> >                 this particular case,
> >                 >                 but ext3 and raw
> >                 >                 devices, I need to make sure the vg
> >                 is exclusively
> >                 >                 activated on one
> >                 >                 node, preventing the other nodes to
> >                 access it unless
> >                 >                 it is the failover
> >                 >                 procedure (node holding the VG
> >                 crashed) and then re
> >                 >                 activate it
> >                 >                 exclusively on the failover node.
> >                 >                 Thanks
> >                 >
> >                 >
> >                 >         In that case you probably ought to be using
> >                 rgmanager to do
> >                 >         the failover for you. It has a script for
> >                 doing exactly
> >                 >         this :-)
> >                 >
> >                 >         Chrissie
> >                 >
> >                 >
> >                 >         --
> >                 >         Linux-cluster mailing list
> >                 >         Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >                 >
> >                 >
> >                 > --
> >                 > Linux-cluster mailing list
> >                 > Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >                 Please, check this link:
> >
> >
> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
> >
> >                 I found exactly the same problem as you, and i
> >                 developed the
> >                 "lvm-cluster.sh" script to solve the needs I had. You
> >                 can find the
> >                 script on the last message of the thread.
> >
> >                 I submitted it to make it part of the main project,
> >                 but i have no news
> >                 about that yet.
> >
> >                 I hope this helps.
> >
> >                 Cheers,
> >
> >                 Rafael
> >
> >                 --
> >                 Rafael Mic? Miranda
> >
> >                 --
> >                 Linux-cluster mailing list
> >                 Linux-cluster at redhat.com
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Rafael Mic? Miranda
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090727/33d90687/attachment.htm>

From sendtosumathi at gmail.com  Tue Jul 28 05:47:52 2009
From: sendtosumathi at gmail.com (Sumathi Kuppusamy)
Date: Tue, 28 Jul 2009 11:17:52 +0530
Subject: [Linux-cluster] Configuring multiple instances of Tomcat in RHCS
Message-ID: <e26deecd0907272247r5744ea8bsb8ff501e1179c5cd@mail.gmail.com>

Hi,

I am investigatong on Tomcat Resource Agent on RHCS.
RedHat cluster suite Tomcat Resource Agent supports multiple instances of
Tomcat?

I could see that the Tomcat service is started by the script like,
"org.apache.catalina.startup.Bootstrap start" and not by "catalina.sh
start".
So not sure which isntance of Tomcat it will be starting.

And when we try to configure Tomcat RA in RHCS, the configuration file is
taken by default as /usr/share/tomcat5/tomcat5.conf. Incase of multiple
tomcat instances, how do we maintain different tomcat config files and
startup scripts.
I am not clear on the supporting multiple instances of Tomcat in RHCS.
Please help me in this regard.

Thanks,
Sumathi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090728/78aad9a4/attachment.htm>

From listener at may.co.at  Tue Jul 28 08:54:11 2009
From: listener at may.co.at (Wolfgang Hotwagner)
Date: Tue, 28 Jul 2009 10:54:11 +0200
Subject: [Linux-cluster] Waiting for fenced to join the fence group
In-Reply-To: <4A65ABBA.3030305@may.co.at>
References: <4A65ABBA.3030305@may.co.at>
Message-ID: <4A6EBCB3.9090904@may.co.at>

No ideas?

Wolfgang Hotwagner wrote:
> Hello,
>
> i am not able to make a gfs2-cluster on a drbd-device. I always have the
> problem with joining the fence group. I am using a debian stable(lenny)
> system. On eth0 there is also a ctdb-service which enables 2 additional
> ip's. Maybe someone could help me to get it working..
>
> Greetings
> Wolfgang
>
>
>
>
>
>
>
> dslin1:
> eth0: 172.30.50.83
> eth1: 10.13.13.2
>
> /etc/hosts:
> 127.0.0.1       localhost
> 172.30.50.83    dslin1
> 172.30.50.84    dslin2
> 10.13.13.2      node1
> 10.13.13.3      node2
>
>
> /proc/drbd:
> version: 8.0.14 (api:86/proto:86)
> GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by
> phil at fat-tyre, 2008-11-12 16:40:33
>  0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
>     ns:0 nr:12288 dw:12288 dr:0 al:0 bm:3 lo:0 pe:0 ua:0 ap:0
>         resync: used:0/61 hits:765 misses:3 starving:0 dirty:0 changed:3
>         act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
>
>
>
> syslog:
> Jul 21 13:38:27 dslin1 ccsd[14975]: Starting ccsd 2.03.09:
> Jul 21 13:38:27 dslin1 ccsd[14975]:  Built: Nov  3 2008 18:22:21
> Jul 21 13:38:27 dslin1 ccsd[14975]:  Copyright (C) Red Hat, Inc.
> 2004-2008  All rights reserved.
> Jul 21 13:38:28 dslin1 ccsd[14975]: /etc/cluster/cluster.conf (cluster
> name = cluster, version = 1) found.
> Jul 21 13:38:31 dslin1 ccsd[14975]: Initial status:: Quorate
> Jul 21 13:38:35 dslin1 openais[14980]: cman killed by node 2 because we
> rejoined the cluster without a full restart
> Jul 21 13:38:35 dslin1 groupd[14984]: cman_get_nodes error -1 104
> Jul 21 13:38:35 dslin1 gfs_controld[14992]: cluster is down, exiting
> Jul 21 13:39:00 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 30 seconds.
> Jul 21 13:39:30 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 60 seconds.
> Jul 21 13:40:00 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 90 seconds.
> Jul 21 13:40:30 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 120 seconds.
> and so on..
>
>
>
>
>
> dslin2:
> eth0: 172.30.50.84
> eth1: 10.13.13.3
>
> /etc/hosts:
> 127.0.0.1       localhost
> 172.30.50.83    dslin1
> 172.30.50.84    dslin2
> 10.13.13.2      node1
> 10.13.13.3      node2
>
>
> /proc/drbd
> version: 8.0.14 (api:86/proto:86)
> GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by
> phil at fat-tyre, 2008-11-12 16:40:33
>  0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
>     ns:12292 nr:0 dw:0 dr:12296 al:0 bm:6 lo:0 pe:0 ua:0 ap:0
>         resync: used:0/61 hits:765 misses:3 starving:0 dirty:0 changed:3
>         act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
>
>
> syslog:
> Jul 21 13:38:27 dslin1 ccsd[14975]: Starting ccsd 2.03.09:
> Jul 21 13:38:27 dslin1 ccsd[14975]:  Built: Nov  3 2008 18:22:21
> Jul 21 13:38:27 dslin1 ccsd[14975]:  Copyright (C) Red Hat, Inc.
> 2004-2008  All rights reserved.
> Jul 21 13:38:28 dslin1 ccsd[14975]: /etc/cluster/cluster.conf (cluster
> name = cluster, version = 1) found.
> Jul 21 13:38:31 dslin1 ccsd[14975]: Initial status:: Quorate
> Jul 21 13:38:35 dslin1 openais[14980]: cman killed by node 2 because we
> rejoined the cluster without a full restart
> Jul 21 13:38:35 dslin1 groupd[14984]: cman_get_nodes error -1 104
> Jul 21 13:38:35 dslin1 gfs_controld[14992]: cluster is down, exiting
> Jul 21 13:39:00 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 30 seconds.
> Jul 21 13:39:30 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 60 seconds.
> Jul 21 13:40:00 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 90 seconds.
> Jul 21 13:40:30 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 120 seconds.
> Jul 21 13:41:00 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 150 seconds.
> Jul 21 13:41:30 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 180 seconds.
> Jul 21 13:42:00 dslin1 ccsd[14975]: Unable to connect to cluster
> infrastructure after 210 seconds.
> and so on..
>
>
>
>
>
>
>
> /etc/cluster/cluster.conf:
> <?xml version="1.0"?>
> <cluster name="cluster" config_version="1">
>   <!-- post_join_delay: number of seconds the daemon will wait before
>                         fencing any victims after a node joins the domain
>        post_fail_delay: number of seconds the daemon will wait before
>                         fencing any victims after a domain member fails
>        clean_start    : prevent any startup fencing the daemon might do.
>                         It indicates that the daemon should assume all nodes
>                         are in a clean state to start. -->
>   <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>   <clusternodes>
>     <clusternode name="dslin1" votes="1" nodeid="1">
>       <fence>
>         <!-- Handle fencing manually -->
>         <method name="human">
>           <device name="human" nodename="dslin1" ipaddr="10.13.13.2"/>
>         </method>
>       </fence>
>     </clusternode>
>     <clusternode name="dslin2" votes="1" nodeid="2">
>       <fence>
>         <!-- Handle fencing manually -->
>         <method name="human">
>           <device name="human" nodename="dslin2" ipaddr="10.13.13.3"/>
>         </method>
>       </fence>
>     </clusternode>
>   </clusternodes>
>   <!-- cman two nodes specification -->
>   <cman expected_votes="1" two_node="1"/>
>   <fencedevices>
>     <!-- Define manual fencing -->
>     <fencedevice name="human" agent="fence_manual"/>
>   </fencedevices>
> </cluster>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


From swhiteho at redhat.com  Tue Jul 28 10:15:40 2009
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Tue, 28 Jul 2009 11:15:40 +0100
Subject: [Linux-cluster] GFS2 corruption/withdrawal/crash
In-Reply-To: <6903172.2011248715437038.JavaMail.wendell@philco>
References: <6903172.2011248715437038.JavaMail.wendell@philco>
Message-ID: <1248776140.3411.10.camel@localhost.localdomain>

Hi,

On Mon, 2009-07-27 at 13:23 -0400, Wendell Dingus wrote:
> This recently happened to us and I'm wondering if there's anything else we can do to prevent it and/or more fully recover from it.
> 
> 3 physical nodes, 64-bit, 2.6.18-128.2.1.el5xen, 3 GFS2 filesystems mounted by each. 
> 
> A zero-byte file in a subdir about 3 levels deep that when touched in any way causes total meltdown. Details below...
> 
I'd be very interested to know about the circumstances which led up to
this file getting into that state. Was it always a zero byte file, or
has it been truncated at some stage from some larger size? Was there a
prior fs crash at some time, or has the fs been otherwise reliable since
mkfs time?

Anything that you can tell us about the history of this file would be
very interesting to know.

> We took the filesystem offline (all nodes) and ran gfs2_fsck against it. The FS is 6.2TB in size, living on 2gb/sec fibrechannel array. It took 9 hours to complete which was not as bad as I had feared it would be. Afterwards the filesystem was remounted and that zero-byte file was attempted to be removed and the same thing happened again. So it appears gfs2_fsck did not fix it. Since there are 3 GFS filesystems the bad part was that access to all 3 went away when one of them had an issue because GFS itself appears to have crashed. That's the part I don't understand and am pretty sure was not what should have happened.
> 
> After a full reboot we renamed the directory holding the "bad" zero-byte file to a directory in the root of that GFS filesystem and are simply avoiding it at this point. 
> 
> Thanks...
> 
> Description from a co-worker on what he found from researching this:
> 
> While hitting something on the filesystem, it runs in to an invalid metadata block, realizes the error & problem, and attempts to take the FS offline because it's bad. (To not risk additional corruption) 
> 
> Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: fatal: invalid metadata block 
> Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: bh = 1633350398 (magic number) 
> Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: function = gfs2_meta_indirect_buffer, file = /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_xen/meta_io.c, line = 33 
> 4 
> Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: about to withdraw this file system 
> Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: telling LM to withdraw 
> Jul 22 04:11:57 srvname kernel: GFS2: fsid=cluname:raid1.1: withdrawn 
> Jul 22 04:11:57 srvname kernel: 
> 
> Unfortunately... For some reason, when it completes the withdrawal process, gfs crashes... I'm sure it's not supposed to do that... It should continue allowing access to all of the other GFS filesystems, but since the gfs module is dieing, it kills access to any gfs filesystems. 
> 
Ideally it wouldn't crash. In reality there are cases where what we need
to do in order to recover from an error gracefully cannot be done in the
context in which the error has occurred. The context in this case
usually means the locks which are being held at the time. There is some
ongoing work to try and improve on this, particularly wrt to corrupt
on-disk structures. In some cases we can now just return -EIO to the
user and carry on rather than withdrawing from the cluster.

The interesting thing in this case is that if the file is zero length,
it shouldn't have any indirect blocks at all, so it looks like the inode
height might have become corrupt. If you are able to save the metadata
from this fs, then that is something which we would find very helpful to
have a look at,

Steve.


From hxinwei at gmail.com  Tue Jul 28 09:30:18 2009
From: hxinwei at gmail.com (Xinwei Hu)
Date: Tue, 28 Jul 2009 17:30:18 +0800
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <29ae894c0907271426l251a7b92gcb402894e3742258@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<1248188185.6464.2.camel@mecatol>
	<29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>
	<29ae894c0907271202y6b34afe1w3bfa53ab6a04c291@mail.gmail.com>
	<1248728355.7027.3.camel@mecatol>
	<29ae894c0907271426l251a7b92gcb402894e3742258@mail.gmail.com>
Message-ID: <1cafab770907280230t7922d166rde506bb84abc5905@mail.gmail.com>

Hi Brem,

 I guess the cause of the problem is using 'lvchange -ay'.

 Clvmd actually does lock conversion underlying. So when you tried
'lvchange -ay',
the exclusive lock will be converted to an non-exclusive lock. And that means
the second try will success anyway.

2009/7/28 brem belguebli <brem.belguebli at gmail.com>:
> Hi Rafael,
> On RHEL 5.3, locks thru DLM aren't reliable all the time, I've been able to
> activate a VG on a node of the cluster even if it was already activated
> exclusively on another node.
> Also, I've been able to activate a LV on a node not holding the exclusive
> lock on the VG by typing 2 times lvchange -a y /dev/VGX/lvoly .
> First try lvchange tells you there is a lock, second try activates it.
> Checking dlm debug (mount -t debugfs debug /debug) /debug/dlm/clvmd_locks
> gives a lock for nodeid 0 ....
> Brem
>
> 2009/7/27 Rafael Mic? Miranda <rmicmirregs at gmail.com>
>>
>> Hi Brem
>>
>> So, does it work successfully? I made some testing before I submitted it
>> to the list and AFAIK i found no errors.
>>
>> What do you mean exactly with "some CLVM strange behaviours"? Could you
>> be more specific?
>>
>> I'm not subscribed to linux-lvm, please keep us informed through this
>> list.
>>
>> Thanks in advance. Cheers,
>>
>> Rafael
>>
>> El lun, 27-07-2009 a las 21:02 +0200, brem belguebli escribi?:
>> > Hi Rafael,
>> >
>> > It works fine, well at least when not hiting some CLVM strange
>> > behaviours, that I'm able to replay by hand, so your script is
>> > allright.
>> >
>> > I'll post to linux-lvm what I could see.
>> >
>> > Brem
>> >
>> >
>> > 2009/7/21, brem belguebli <brem.belguebli at gmail.com>:
>> > ? ? ? ? Hola Rafael,
>> >
>> > ? ? ? ? Thanks a lot, that'll avoid me going from scratch.
>> >
>> > ? ? ? ? I'll have a look at them and keep you updated.
>> >
>> > ? ? ? ? Brem
>> >
>> >
>> >
>> > ? ? ? ? 2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
>> > ? ? ? ? ? ? ? ? Hi Brem,
>> >
>> > ? ? ? ? ? ? ? ? El mar, 21-07-2009 a las 16:40 +0200, brem belguebli
>> > ? ? ? ? ? ? ? ? escribi?:
>> > ? ? ? ? ? ? ? ? > Hi,
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > That's what I 'm trying to do.
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > If you mean lvm.sh, well, I've been playing with it,
>> > ? ? ? ? ? ? ? ? but it does some
>> > ? ? ? ? ? ? ? ? > "sanity" checks that are wierd
>> > ? ? ? ? ? ? ? ? > ? ? ?1. It expects HA LVM to be setup (why such
>> > ? ? ? ? ? ? ? ? check if we want to
>> > ? ? ? ? ? ? ? ? > ? ? ? ? use CLVM).
>> > ? ? ? ? ? ? ? ? > ? ? ?2. it exits if it finds a CLVM VG ?(kind of
>> > ? ? ? ? ? ? ? ? funny !)
>> > ? ? ? ? ? ? ? ? > ? ? ?3. it exits if the lvm.conf is newer
>> > ? ? ? ? ? ? ? ? than /boot/*.img (about this
>> > ? ? ? ? ? ? ? ? > ? ? ? ? one, we tend to prevent the cluster from
>> > ? ? ? ? ? ? ? ? automatically
>> > ? ? ? ? ? ? ? ? > ? ? ? ? starting ...)
>> > ? ? ? ? ? ? ? ? > I was looking to find some doc on how to write my
>> > ? ? ? ? ? ? ? ? own resources, ie
>> > ? ? ? ? ? ? ? ? > CLVM resource that checks if the vg is clustered, if
>> > ? ? ? ? ? ? ? ? so by which node
>> > ? ? ? ? ? ? ? ? > is it exclusively held, and if the node is down to
>> > ? ? ? ? ? ? ? ? activate
>> > ? ? ? ? ? ? ? ? > exclusively the VG.
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > If you have some good links to provide me, that'll
>> > ? ? ? ? ? ? ? ? be great.
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > Thanks
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > 2009/7/21, Christine Caulfield
>> > ? ? ? ? ? ? ? ? <ccaulfie at redhat.com>:
>> > ? ? ? ? ? ? ? ? > ? ? ? ? On 07/21/2009 01:11 PM, brem belguebli
>> > ? ? ? ? ? ? ? ? wrote:
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? Hi,
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? When creating the VG by default
>> > ? ? ? ? ? ? ? ? clustered, you
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? implicitely assume that
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? it will be used with a clustered FS
>> > ? ? ? ? ? ? ? ? on top of it (gfs,
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ocfs, etc...)
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? that will handle the active/active
>> > ? ? ? ? ? ? ? ? mode.
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? As I do not intend to use GFS in
>> > ? ? ? ? ? ? ? ? this particular case,
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? but ext3 and raw
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? devices, I need to make sure the vg
>> > ? ? ? ? ? ? ? ? is exclusively
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? activated on one
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? node, preventing the other nodes to
>> > ? ? ? ? ? ? ? ? access it unless
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? it is the failover
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? procedure (node holding the VG
>> > ? ? ? ? ? ? ? ? crashed) and then re
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? activate it
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? exclusively on the failover node.
>> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? Thanks
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > ? ? ? ? In that case you probably ought to be using
>> > ? ? ? ? ? ? ? ? rgmanager to do
>> > ? ? ? ? ? ? ? ? > ? ? ? ? the failover for you. It has a script for
>> > ? ? ? ? ? ? ? ? doing exactly
>> > ? ? ? ? ? ? ? ? > ? ? ? ? this :-)
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > ? ? ? ? Chrissie
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > ? ? ? ? --
>> > ? ? ? ? ? ? ? ? > ? ? ? ? Linux-cluster mailing list
>> > ? ? ? ? ? ? ? ? > ? ? ? ? Linux-cluster at redhat.com
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? https://www.redhat.com/mailman/listinfo/linux-cluster
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? > --
>> > ? ? ? ? ? ? ? ? > Linux-cluster mailing list
>> > ? ? ? ? ? ? ? ? > Linux-cluster at redhat.com
>> > ? ? ? ? ? ? ? ? >
>> > ? ? ? ? ? ? ? ? https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>> > ? ? ? ? ? ? ? ? Please, check this link:
>> >
>> >
>> > https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
>> >
>> > ? ? ? ? ? ? ? ? I found exactly the same problem as you, and i
>> > ? ? ? ? ? ? ? ? developed the
>> > ? ? ? ? ? ? ? ? "lvm-cluster.sh" script to solve the needs I had. You
>> > ? ? ? ? ? ? ? ? can find the
>> > ? ? ? ? ? ? ? ? script on the last message of the thread.
>> >
>> > ? ? ? ? ? ? ? ? I submitted it to make it part of the main project,
>> > ? ? ? ? ? ? ? ? but i have no news
>> > ? ? ? ? ? ? ? ? about that yet.
>> >
>> > ? ? ? ? ? ? ? ? I hope this helps.
>> >
>> > ? ? ? ? ? ? ? ? Cheers,
>> >
>> > ? ? ? ? ? ? ? ? Rafael
>> >
>> > ? ? ? ? ? ? ? ? --
>> > ? ? ? ? ? ? ? ? Rafael Mic? Miranda
>> >
>> > ? ? ? ? ? ? ? ? --
>> > ? ? ? ? ? ? ? ? Linux-cluster mailing list
>> > ? ? ? ? ? ? ? ? Linux-cluster at redhat.com
>> > ? ? ? ? ? ? ? ? https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> --
>> Rafael Mic? Miranda
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From brem.belguebli at gmail.com  Tue Jul 28 14:07:10 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Tue, 28 Jul 2009 16:07:10 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <1cafab770907280230t7922d166rde506bb84abc5905@mail.gmail.com>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<1248188185.6464.2.camel@mecatol>
	<29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>
	<29ae894c0907271202y6b34afe1w3bfa53ab6a04c291@mail.gmail.com>
	<1248728355.7027.3.camel@mecatol>
	<29ae894c0907271426l251a7b92gcb402894e3742258@mail.gmail.com>
	<1cafab770907280230t7922d166rde506bb84abc5905@mail.gmail.com>
Message-ID: <29ae894c0907280707y124632daj6b18c82a5861c38@mail.gmail.com>

Hi,

>From my understanding, any change request (lv, vg, pv...) should be blocked
as long as a lock is held by another alive node in the cluster.

I mean the "exclusive flag" was here at the origin to address this.

I think there is a sort of misconcept in LVM2 may be due to the fact that
for many people assume a cluster is necessarly a "share-everything"
 infrastructure (a la VMS).

This is the good approach for clustering file servers (NFS, CIFS), web
servers, etc... where one would take advantage of load balancing user
sessions accross multiple nodes.

The other need for clustering is to run "mono instance" databases that are
not by concept (except Oracle RAC) designed to run accross multiple nodes.

In this case, only one node is holding the database instance with all of its
storage, and putting it on a "share everything" cluster based on a Cluster
filesystem (GFS) would imply :

     - performance penalty: each storage IO having to request the FS lock
manager to be executed
         --> managing the locks at a lower level (VG and LV) wouldn't imply
so as one the VG is exclusively activated no other node can remove the lock,
every IO being done on a regular FS (ext3 for instance) without having to
manage any kind of locks
     - Security issue :  To bypass this lock mechanism, one should start the
clustered FS without locks, but this will certainly lead to FS corruption.

Brem


2009/7/28, Xinwei Hu <hxinwei at gmail.com>:

> Hi Brem,
>
> I guess the cause of the problem is using 'lvchange -ay'.
>
> Clvmd actually does lock conversion underlying. So when you tried
> 'lvchange -ay',
> the exclusive lock will be converted to an non-exclusive lock. And that
> means
> the second try will success anyway.
>
> 2009/7/28 brem belguebli <brem.belguebli at gmail.com>:
> > Hi Rafael,
> > On RHEL 5.3, locks thru DLM aren't reliable all the time, I've been able
> to
> > activate a VG on a node of the cluster even if it was already activated
> > exclusively on another node.
> > Also, I've been able to activate a LV on a node not holding the exclusive
> > lock on the VG by typing 2 times lvchange -a y /dev/VGX/lvoly .
> > First try lvchange tells you there is a lock, second try activates it.
> > Checking dlm debug (mount -t debugfs debug /debug) /debug/dlm/clvmd_locks
> > gives a lock for nodeid 0 ....
> > Brem
> >
> > 2009/7/27 Rafael Mic? Miranda <rmicmirregs at gmail.com>
> >>
> >> Hi Brem
> >>
> >> So, does it work successfully? I made some testing before I submitted it
> >> to the list and AFAIK i found no errors.
> >>
> >> What do you mean exactly with "some CLVM strange behaviours"? Could you
> >> be more specific?
> >>
> >> I'm not subscribed to linux-lvm, please keep us informed through this
> >> list.
> >>
> >> Thanks in advance. Cheers,
> >>
> >> Rafael
> >>
> >> El lun, 27-07-2009 a las 21:02 +0200, brem belguebli escribi?:
> >> > Hi Rafael,
> >> >
> >> > It works fine, well at least when not hiting some CLVM strange
> >> > behaviours, that I'm able to replay by hand, so your script is
> >> > allright.
> >> >
> >> > I'll post to linux-lvm what I could see.
> >> >
> >> > Brem
> >> >
> >> >
> >> > 2009/7/21, brem belguebli <brem.belguebli at gmail.com>:
> >> >         Hola Rafael,
> >> >
> >> >         Thanks a lot, that'll avoid me going from scratch.
> >> >
> >> >         I'll have a look at them and keep you updated.
> >> >
> >> >         Brem
> >> >
> >> >
> >> >
> >> >         2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
> >> >                 Hi Brem,
> >> >
> >> >                 El mar, 21-07-2009 a las 16:40 +0200, brem belguebli
> >> >                 escribi?:
> >> >                 > Hi,
> >> >                 >
> >> >                 > That's what I 'm trying to do.
> >> >                 >
> >> >                 > If you mean lvm.sh, well, I've been playing with it,
> >> >                 but it does some
> >> >                 > "sanity" checks that are wierd
> >> >                 >      1. It expects HA LVM to be setup (why such
> >> >                 check if we want to
> >> >                 >         use CLVM).
> >> >                 >      2. it exits if it finds a CLVM VG  (kind of
> >> >                 funny !)
> >> >                 >      3. it exits if the lvm.conf is newer
> >> >                 than /boot/*.img (about this
> >> >                 >         one, we tend to prevent the cluster from
> >> >                 automatically
> >> >                 >         starting ...)
> >> >                 > I was looking to find some doc on how to write my
> >> >                 own resources, ie
> >> >                 > CLVM resource that checks if the vg is clustered, if
> >> >                 so by which node
> >> >                 > is it exclusively held, and if the node is down to
> >> >                 activate
> >> >                 > exclusively the VG.
> >> >                 >
> >> >                 > If you have some good links to provide me, that'll
> >> >                 be great.
> >> >                 >
> >> >                 > Thanks
> >> >                 >
> >> >                 >
> >> >                 > 2009/7/21, Christine Caulfield
> >> >                 <ccaulfie at redhat.com>:
> >> >                 >         On 07/21/2009 01:11 PM, brem belguebli
> >> >                 wrote:
> >> >                 >                 Hi,
> >> >                 >                 When creating the VG by default
> >> >                 clustered, you
> >> >                 >                 implicitely assume that
> >> >                 >                 it will be used with a clustered FS
> >> >                 on top of it (gfs,
> >> >                 >                 ocfs, etc...)
> >> >                 >                 that will handle the active/active
> >> >                 mode.
> >> >                 >                 As I do not intend to use GFS in
> >> >                 this particular case,
> >> >                 >                 but ext3 and raw
> >> >                 >                 devices, I need to make sure the vg
> >> >                 is exclusively
> >> >                 >                 activated on one
> >> >                 >                 node, preventing the other nodes to
> >> >                 access it unless
> >> >                 >                 it is the failover
> >> >                 >                 procedure (node holding the VG
> >> >                 crashed) and then re
> >> >                 >                 activate it
> >> >                 >                 exclusively on the failover node.
> >> >                 >                 Thanks
> >> >                 >
> >> >                 >
> >> >                 >         In that case you probably ought to be using
> >> >                 rgmanager to do
> >> >                 >         the failover for you. It has a script for
> >> >                 doing exactly
> >> >                 >         this :-)
> >> >                 >
> >> >                 >         Chrissie
> >> >                 >
> >> >                 >
> >> >                 >         --
> >> >                 >         Linux-cluster mailing list
> >> >                 >         Linux-cluster at redhat.com
> >> >                 >
> >> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >> >                 >
> >> >                 >
> >> >                 > --
> >> >                 > Linux-cluster mailing list
> >> >                 > Linux-cluster at redhat.com
> >> >                 >
> >> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >> >
> >> >                 Please, check this link:
> >> >
> >> >
> >> > https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
> >> >
> >> >                 I found exactly the same problem as you, and i
> >> >                 developed the
> >> >                 "lvm-cluster.sh" script to solve the needs I had. You
> >> >                 can find the
> >> >                 script on the last message of the thread.
> >> >
> >> >                 I submitted it to make it part of the main project,
> >> >                 but i have no news
> >> >                 about that yet.
> >> >
> >> >                 I hope this helps.
> >> >
> >> >                 Cheers,
> >> >
> >> >                 Rafael
> >> >
> >> >                 --
> >> >                 Rafael Mic? Miranda
> >> >
> >> >                 --
> >> >                 Linux-cluster mailing list
> >> >                 Linux-cluster at redhat.com
> >> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >> >
> >> >
> >> > --
> >> > Linux-cluster mailing list
> >> > Linux-cluster at redhat.com
> >> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >> --
> >> Rafael Mic? Miranda
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090728/5005c1ea/attachment.htm>

From muruganlnx at gmail.com  Tue Jul 28 14:44:53 2009
From: muruganlnx at gmail.com (Murugan P)
Date: Tue, 28 Jul 2009 20:14:53 +0530
Subject: [Linux-cluster] RHCS with nfs
Message-ID: <52868b3e0907280744n237b59a4yefe84bda2202184e@mail.gmail.com>

HI,

I have configured the cluster with nfs export service & 2 node on centos
5.3.

While starting the service, am getting as follow:

Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
/usr/sbin/cman_tool: Node is already active
                                                           [FAILED]
Starting Cluster Service Manager:                          [  OK  ]
Starting portmap:                                          [  OK  ]
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]


CLUSTAT:
*************[root at test1 ~]# clustat
Cluster Status for gfs2_cluster @ Tue Jul 28 21:32:56 2009
Member Status: Inquorate

Resource Group Manager not running; no service information available.

Membership information not available


Kindly let us know,  what are the things needs to be tested after
configuring RHCS with nfs exports.

Thanks
P. Murugan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090728/2f3ff0b3/attachment.htm>

From wendell at bisonline.com  Tue Jul 28 15:34:04 2009
From: wendell at bisonline.com (Wendell Dingus)
Date: Tue, 28 Jul 2009 11:34:04 -0400 (EDT)
Subject: [Linux-cluster] GFS2 corruption/withdrawal/crash
In-Reply-To: <7226923.3051248795216312.JavaMail.wendell@philco>
Message-ID: <20227042.3071248795238944.JavaMail.wendell@philco>

> I'd be very interested to know about the circumstances which led up to 
> this file getting into that state. Was it always a zero byte file, or 
> has it been truncated at some stage from some larger size? Was there a 
> prior fs crash at some time, or has the fs been otherwise reliable since 
> mkfs time? 
> 
> Anything that you can tell us about the history of this file would be 
> very interesting to know. 

Sure... This particular file was one that is recreated each night. It's the
result of a find /path -print >file that starts back at zero bytes each
night and ends up with a list of filenames found in a particular directory.
It is ~150MB in size currently.

> Ideally it wouldn't crash. In reality there are cases where what we need 
> to do in order to recover from an error gracefully cannot be done in the 
> context in which the error has occurred. The context in this case 
> usually means the locks which are being held at the time. There is some 
> ongoing work to try and improve on this, particularly wrt to corrupt 
> on-disk structures. In some cases we can now just return -EIO to the 
> user and carry on rather than withdrawing from the cluster. 
> 
> The interesting thing in this case is that if the file is zero length, 
> it shouldn't have any indirect blocks at all, so it looks like the inode 
> height might have become corrupt. If you are able to save the metadata 
> from this fs, then that is something which we would find very helpful to 
> have a look at, 
> 
> Steve. 

This equipment was far less stable initially than I'd have liked. Except for
this oops we appear to be in pretty good shape now though. Physical hard 
power cycles happened many times during our initial setup. Which, while I'm 
on the subject, might be valuable to explain here along with what we did to
stabilize things. 

Three servers, a few large GFS2 filesystems, Xen kernels, CLVMD, RHCS 
controlling a bunch of VMs. We were having lots of problems with the cluster
becoming inquorate and nodes being fenced every time a non-member node booted
and joined. 

We figured out that as a machine booted and started networking, started cman
and related components, and started xend there were more pauses than there 
should be and the delay was long enough to trip RHCS into thinking a node
had died. So we did this, we renamed /etc/rd3.d/S98xend to S17xend. So that
xend would fire up and do the NIC moving/renaming before cluster suite.

Thanks.

From wendell at bisonline.com  Tue Jul 28 15:39:17 2009
From: wendell at bisonline.com (Wendell Dingus)
Date: Tue, 28 Jul 2009 11:39:17 -0400 (EDT)
Subject: [Linux-cluster] Re: GFS2 corruption/withdrawal/crash (Bob Peterson)
In-Reply-To: <928464.3091248795510986.JavaMail.wendell@philco>
Message-ID: <8934043.3111248795554443.JavaMail.wendell@philco>

>Hi Wendell, 
> 
>The fsck.gfs2 program should fix this kind of error, but there 
>are some known fsck bugs. I've been working on a big fix for 
>fsck.gfs and fsck.gfs2 lately that solves many problems, and there 
>is a chance it will solve yours. If you don't mind, perhaps you 
>can send me a copy of your file system metadata, and I will 
>run my latest-and-greatest against it to see if it will detect 
>and fix the problem. Even if it doesn't, perhaps I can even adjust 
>my patch to fix your file system. 
> 
>Regards, 
> 
>Bob Peterson 
>Red Hat File Systems 


Absolutely.. We're doing a gfs2_edit savemeta right now and at 22%
it's already over 4.5GB in size. So how can I get that to you when
it completes? 

Thanks!

From rpeterso at redhat.com  Tue Jul 28 16:07:39 2009
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 28 Jul 2009 12:07:39 -0400 (EDT)
Subject: [Linux-cluster] Re: GFS2 corruption/withdrawal/crash (Bob
	Peterson)
In-Reply-To: <1444520702.1003321248797195510.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>
Message-ID: <2040479433.1003341248797259502.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com>

----- "Wendell Dingus" <wendell at bisonline.com> wrote:
| Absolutely.. We're doing a gfs2_edit savemeta right now and at 22%
| it's already over 4.5GB in size. So how can I get that to you when
| it completes? 
| 
| Thanks!

Hi Wendell,

What would probably work best is if you use bzip2 to compress the
metadata then upload it to some web or ftp server for me to download.
There are a couple free ones out there.

Regards,

Bob Peterson
Red Hat File Systems


From pradhanparas at gmail.com  Tue Jul 28 17:54:25 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 28 Jul 2009 12:54:25 -0500
Subject: [Linux-cluster] Snapshot and backups
Message-ID: <8b711df40907281054l73ac65d1mc9c85c20a23f789e@mail.gmail.com>

hi,

I have a shared GFS partition that holds Xen virtual machines (ex:
/mygfs/guest1.img, /myfgs/guest2.img) . I need to backup my virtual
machines regularly. I can not do snapshot since clustered snapshots
are not yet supported. So what is the recommended way to backup my
virtual machines if someone is using GFS.

Thanks
Paras.


From jeff.sturm at eprize.com  Tue Jul 28 19:55:12 2009
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Tue, 28 Jul 2009 15:55:12 -0400
Subject: [Linux-cluster] Snapshot and backups
In-Reply-To: <8b711df40907281054l73ac65d1mc9c85c20a23f789e@mail.gmail.com>
References: <8b711df40907281054l73ac65d1mc9c85c20a23f789e@mail.gmail.com>
Message-ID: <64D0546C5EBBD147B75DE133D798665F02FDC59F@hugo.eprize.local>

We've set up things a little differently than you.  Our domU images are
CLVM partitions, e.g. /dev/cvg/guest1, /dev/cvg/guest2, not image files
on a GFS filesystem.

This design will allow us to take advantage of single-host snapshots,
once they are supported.

Bear in mind the ramifications of taking dirty snapshots of a domU host.
At any point in time the domU has unflushed buffers.  If you do snapshot
these you'll likely have to fsck the snapshot filesystem before it is
usable.

Many Xen users probably stick with traditional backup solutions--if you
want to know more you may want to ask on the xen-users list.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Paras pradhan
> Sent: Tuesday, July 28, 2009 1:54 PM
> To: linux clustering
> Subject: [Linux-cluster] Snapshot and backups
> 
> hi,
> 
> I have a shared GFS partition that holds Xen virtual machines (ex:
> /mygfs/guest1.img, /myfgs/guest2.img) . I need to backup my virtual
> machines regularly. I can not do snapshot since clustered snapshots
> are not yet supported. So what is the recommended way to backup my
> virtual machines if someone is using GFS.
> 
> Thanks
> Paras.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From pradhanparas at gmail.com  Tue Jul 28 22:01:10 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 28 Jul 2009 17:01:10 -0500
Subject: [Linux-cluster] Hardware recommendation
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F02FDC534@hugo.eprize.local>
References: <64D0546C5EBBD147B75DE133D798665F02FDC531@hugo.eprize.local>
	<257968.54862.qm@web110211.mail.gq1.yahoo.com>
	<64D0546C5EBBD147B75DE133D798665F02FDC534@hugo.eprize.local>
Message-ID: <8b711df40907281501o4c0271e4u10766172ef7a0a12@mail.gmail.com>

On Sat, Jul 25, 2009 at 5:36 PM, Jeff Sturm<jeff.sturm at eprize.com> wrote:
> Good point?the requisite CPU technology has been available for at least 3
> years, so I tend to assume most users are running 64-bit processors with VT
> by now.? However, I suppose it's possible there are many older systems still
> in production.
>
>
>
> We happen to use several 1950-class machines which have been repurposed for
> this.? They work very well with Xen hypervisors.? The 1950 can be
> inexpensively upgraded to 16GB RAM, and have two sockets which can
> accommodate 8 cores.? With shared storage, the 1U chassis gives very good
> density where space is a premium.? If 1950's are available, they can be a
> bargain for virtualization.
>
>
>
> And, as you say, VT is required for some deployments.? Though I'd like to
> think most non-VT capable hardware will be long gone by the time RHEL5
> reaches EOL.
>
>
>
> Power consumption should be a consideration too.? It might be worth retiring
> some systems to save on power alone.? I don't have good numbers handy
> comparing power consumption vs. performance of different processor
> generations?perhaps someone else does.
>
>
>
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Doug Bunger
> Sent: Saturday, July 25, 2009 12:39 PM
> To: linux clustering
>
> Subject: RE: [Linux-cluster] Hardware recommendation
>
>
>
> If you remission in house systems or (gag) get used systems, than CPU
> ***IS*** important.? You need to make sure you are running x86_64 with CPU
> virtualization technology (VT) extensions.? You'll need x86_64 to get past
> the i386 Xen 32meg memory limit, and you'll need VT to provide an update
> path to RHEL6, vSphere, or even Micro$oft Hyper-V.
>
> Many of my customers are running older equipment and are not wanting to
> upgrade.? We are finding that in an HP shop, anything older than DL380 G5
> won't cut it.? On the Dell side it seems the 1950 is the break point.
>
> --- On Fri, 7/24/09, Jeff Sturm <jeff.sturm at eprize.com> wrote:
>
> From: Jeff Sturm <jeff.sturm at eprize.com>
> Subject: RE: [Linux-cluster] Hardware recommendation
> To: "linux clustering" <linux-cluster at redhat.com>
> Date: Friday, July 24, 2009, 6:21 PM
>
> Don't go small.? If you're running 30-60 virtual machines you'll likely
> need systems with at least 8 cores at 32GB of RAM, or more depending on
> the needs of your virtual hosts.? There's really no way to guess at how
> much RAM you'll need without understanding the needs of your virtual
> hosts, however.
>
> Dell PowerEdge hardware should work well for this.? We have some old
> 6950 systems we are now using for Xen.? These appear to have been
> replaced by the R905 series, according to Dell's web site.
>
> CPU isn't important.? All modern Intel (Core) and AMD processors work
> well with Xen.? To save money I tend to opt for more cores at slower
> speeds rather than buy the fastest speed processors available.
>
> What are you using for storage?? If you want node failover you'll likely
> want some sort of SAN for central storage.
>
> I also recommend at least two network switches for redundancy, else
> you'll regret it the first time you need to reboot a switch without
> bringing down your cluster.
>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]
>> On Behalf Of Paras pradhan
>> Sent: Friday, July 24, 2009 6:55 PM
>> To: linux clustering
>> Subject: [Linux-cluster] Hardware recommendation
>>
>> hi,
>>
>> I will be creating a cluster using red hat EL 5.3 consisting of 3
>> nodes which will run Xen virtual machines. The first two nodes will
>> host around 30 virtual machines ( all paravirt Linux and some Solaris
>> if possible) . The third node will be basically a fail over node and
>> will host virtual machines if there is a problem at node 1 or node 2.
>> I do not know if this is the right way to ask but I need
>> recommendation on which CPU to choose (i7, phenom, quad core or dual
>> core) and how much RAM do I need on node 1 and node 2 and on also node
>> 3. And also which server hardware is recommended for my purpose. My
>> test cluster is based on Dell Poweredge 1800 machines with DRAC 4
>> (which I am using for fencing) and looks like DELL Poweregde server
>> can be a good candidate.
>>
>> Any help is highly appreciated.
>> Thanks
>> Paras.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Guys ! thanks. R905 is a good candidate.

Paras.


From consult.itlinux at gmail.com  Wed Jul 29 00:16:09 2009
From: consult.itlinux at gmail.com (Ahmed Taha)
Date: Tue, 28 Jul 2009 19:16:09 -0500
Subject: [Linux-cluster] Hardware recommendation
In-Reply-To: <8b711df40907241554w70238fc4m2c08d661f78ea4e6@mail.gmail.com>
References: <8b711df40907241554w70238fc4m2c08d661f78ea4e6@mail.gmail.com>
Message-ID: <d8befd80907281716l35f46834xefe827a5266c335b@mail.gmail.com>

Hi,

Dell PowerEdge is a good choice. I would go for the PowerEdge 2850 if I
could.  For such a number of virtual machines (30), you will definitely need
not only high RAM but CPU power as well. First, let us consider the memory.
Each RHEL5 guest instance needs at least 512M (eventhough I have tested few
instances and I was able, with some tuning, to make it work down to 384M,
but definitely not 256M). The same min of 512M for Solaris10 will be
essential. Now for 30 VMs, you are talking about 15G of physical memory (not
including page cache and buffers).

For CPUs, theoretically, performance hit starts when the number of VCPUs
goes above the number of physical CPUs, but the exact nonlinear behavior has
a lot of debate about it. What I have actually tested is two AMD64 dual-core
(4 cpu's) while I was able to build up 18 VM's with one VCPU per each,
however, performance was very bad. I repeated the experiment on just one
Intel dual-core processor (2 cpu's) and I was able to build up to 10 VM's.
But once again, I could not run except 4 or 5 with a production-level
performance. So, you may scale accordingly for these 30 machines. I would go
for 4 dual-core or two quad-core at the least.

Don't forget dom0 may be picky about how much memory it gets, specially if
you will run X. Check the forums for further discussions if you encountered
X issues when xend starts.

Hope this helps,


On Fri, Jul 24, 2009 at 5:54 PM, Paras pradhan <pradhanparas at gmail.com>wrote:

> hi,
>
> I will be creating a cluster using red hat EL 5.3 consisting of 3
> nodes which will run Xen virtual machines. The first two nodes will
> host around 30 virtual machines ( all paravirt Linux and some Solaris
> if possible) . The third node will be basically a fail over node and
> will host virtual machines if there is a problem at node 1 or node 2.
> I do not know if this is the right way to ask but I need
> recommendation on which CPU to choose (i7, phenom, quad core or dual
> core) and how much RAM do I need on node 1 and node 2 and on also node
> 3. And also which server hardware is recommended for my purpose. My
> test cluster is based on Dell Poweredge 1800 machines with DRAC 4
> (which I am using for fencing) and looks like DELL Poweregde server
> can be a good candidate.
>
> Any help is highly appreciated.
> Thanks
> Paras.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Ahmed G. Taha
Linux Systems Consultant RHCE &
Operating System Programmer
(Cell): (573) 289-7300
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090728/94893825/attachment.htm>

From pradhanparas at gmail.com  Wed Jul 29 04:13:03 2009
From: pradhanparas at gmail.com (Paras pradhan)
Date: Tue, 28 Jul 2009 23:13:03 -0500
Subject: [Linux-cluster] Hardware recommendation
In-Reply-To: <d8befd80907281716l35f46834xefe827a5266c335b@mail.gmail.com>
References: <8b711df40907241554w70238fc4m2c08d661f78ea4e6@mail.gmail.com>
	<d8befd80907281716l35f46834xefe827a5266c335b@mail.gmail.com>
Message-ID: <8b711df40907282113k740480bet96d3ff05ea8c3da9@mail.gmail.com>

Thanks Ahmed for your reply.

On Tue, Jul 28, 2009 at 7:16 PM, Ahmed Taha<consult.itlinux at gmail.com> wrote:
> Hi,
>
> Dell PowerEdge is a good choice. I would go for the PowerEdge 2850 if I
> could.? For such a number of virtual machines (30), you will definitely need
> not only high RAM but CPU power as well. First, let us consider the memory.
> Each RHEL5 guest instance needs at least 512M (eventhough I have tested few
> instances and I was able, with some tuning, to make it work down to 384M,
> but definitely not 256M). The same min of 512M for Solaris10 will be
> essential. Now for 30 VMs, you are talking about 15G of physical memory (not
> including page cache and buffers).
>
Are you talking about fully virt VMs or Para Virt in this case? WIth
Para Virt the allocation is dynamic. Isn't it?


> For CPUs, theoretically, performance hit starts when the number of VCPUs
> goes above the number of physical CPUs, but the exact nonlinear behavior has
> a lot of debate about it. What I have actually tested is two AMD64 dual-core
> (4 cpu's) while I was able to build up 18 VM's with one VCPU per each,
> however, performance was very bad. I repeated the experiment on just one
> Intel dual-core processor (2 cpu's) and I was able to build up to 10 VM's.
> But once again, I could not run except 4 or 5 with a production-level
> performance. So, you may scale accordingly for these 30 machines. I would go
> for 4 dual-core or two quad-core at the least.

If I understand correctly, to get a production level performance even
with 2 quad core, I can run only around 8 virtual machines?

>
> Don't forget dom0 may be picky about how much memory it gets, specially if
> you will run X. Check the forums for further discussions if you encountered
> X issues when xend starts.
>
> Hope this helps,
>

Thanks !
Paras.

>
>
> On Fri, Jul 24, 2009 at 5:54 PM, Paras pradhan <pradhanparas at gmail.com>
> wrote:
>>
>> hi,
>>
>> I will be creating a cluster using red hat EL 5.3 consisting of 3
>> nodes which will run Xen virtual machines. The first two nodes will
>> host around 30 virtual machines ( all paravirt Linux and some Solaris
>> if possible) . The third node will be basically a fail over node and
>> will host virtual machines if there is a problem at node 1 or node 2.
>> I do not know if this is the right way to ask but I need
>> recommendation on which CPU to choose (i7, phenom, quad core or dual
>> core) and how much RAM do I need on node 1 and node 2 and on also node
>> 3. And also which server hardware is recommended for my purpose. My
>> test cluster is based on Dell Poweredge 1800 machines with DRAC 4
>> (which I am using for fencing) and looks like DELL Poweregde server
>> can be a good candidate.
>>
>> Any help is highly appreciated.
>> Thanks
>> Paras.
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
> Ahmed G. Taha
> Linux Systems Consultant RHCE &
> Operating System Programmer
> (Cell): (573) 289-7300
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From gianluca.cecchi at gmail.com  Wed Jul 29 13:12:38 2009
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Wed, 29 Jul 2009 15:12:38 +0200
Subject: [Linux-cluster] Cluster 3.0.0 final stable release
Message-ID: <561c252c0907290612q4d0210dfmeba2da9432f931f8@mail.gmail.com>

On Wed, 08 Jul 2009 23:10:12 +0200 Fabio M. Di Nitto wrote:
> The cluster team and its community are proud to announce the 3.0.0 final
> release from the STABLE3 branch.

hello,
where to find docs for 3.0.0 final?
Thanks in advance,
Gianluca


From brem.belguebli at gmail.com  Wed Jul 29 13:55:06 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Wed, 29 Jul 2009 15:55:06 +0200
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
Message-ID: <29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>

Hello,

I've been playing on RHEL 5.3 with CLVM and exclusive activation but the
results I'm getting are not what I'm expecting.

- My cluster is a freshly 2 nodes (node1 and node2) installed cluster with
the packages shipped by RHEL 5.3 X86_64.

- LVM locking type =  3

- a San LUN (/dev/mpath/mpath2) visible from both nodes

- dlm used as lock_manager

Everything starts normally from cman to clvmd.

Below what I'm doing

On node1:

 pvcreate /dev/mpath/mpath2
 vgcreate -c n vg11 /dev/mpath/mpath2

! nothing in /debug/dlm/clvmd_locks on both nodes

  vgchange -a n vg11

! nothing in /debug/dlm/clvmd_locks on both nodes

  vgchange -c y vg11

! nothing in /debug/dlm/clvmd_locks on both nodes, vg seen on both nodes as
clustered.

  vgchange -a ey vg11

! nothing in /debug/dlm/clvmd_locks on both nodes

 lvcreate -n lvol1 -L 6G /dev/vg11

On node1 cat /debug/dlm/clvmd_locks gives:

  6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
  38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"

On node2:

  3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"

Is there something I'm doing wrong or misunderstand?
I understand that node1 (which actually activated exclusively the vg) sees a
lock on /dev/vg11/lvol1 (uuid corresponding to it) from node id 2 wich is
node2
plus a lock from node id 0 (which seems to be the quorun disk id which is
not configured in my case).

Plus, node2 seems to see the right lock from node1.

I go on:

on both nodes, lvdisplay -v /dev/vg11/lvol1 gives:

...
 LV UUID                r3Xrp1-prEG-ceCk-A2dh-SA2E-NWoc-unEfdf
  LV Write Access        read/write
  LV Status              available
...

Shouldn't it be seen NOT available on node2 ?

Now, on node2:

vgchange -a y vg11 :

 1 logical volume(s) in volume group "vg11" now active <-- vg was supposed
to be held exlusively by node1

cat  /debug/dlm/clvmd_locks gives:

3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"

on node1:

6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"


I may be missing something in my procedure that makes it do everything
except what I'm expecting.

Any ideas ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090729/f724aac1/attachment.htm>

From brem.belguebli at gmail.com  Wed Jul 29 14:00:58 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Wed, 29 Jul 2009 16:00:58 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <1248728355.7027.3.camel@mecatol>
References: <29ae894c0907210221y395ab4bew95f78769a84eb141@mail.gmail.com>
	<29ae894c0907210416l523d8cf4t71f9b46adfbd4adf@mail.gmail.com>
	<4A65AB22.1030601@redhat.com>
	<29ae894c0907210511o28fa6a80j2b403d2ae6866d79@mail.gmail.com>
	<4A65C16C.20104@redhat.com>
	<29ae894c0907210740n35ce71d6r2c3ba7478118c6a6@mail.gmail.com>
	<1248188185.6464.2.camel@mecatol>
	<29ae894c0907210824m751038fak861f5f9d2a8ca76b@mail.gmail.com>
	<29ae894c0907271202y6b34afe1w3bfa53ab6a04c291@mail.gmail.com>
	<1248728355.7027.3.camel@mecatol>
Message-ID: <29ae894c0907290700x5eaf0c42oadd8cab7a4c334be@mail.gmail.com>

Hi Rafael,

Just posted the basic tests I'm doing on both linux-cluster and linux-lvm.

I can't get exclusive activation to work properly, I may be missing some
step in my process.

Brem


2009/7/27, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
>
> Hi Brem
>
> So, does it work successfully? I made some testing before I submitted it
> to the list and AFAIK i found no errors.
>
> What do you mean exactly with "some CLVM strange behaviours"? Could you
> be more specific?
>
> I'm not subscribed to linux-lvm, please keep us informed through this
> list.
>
> Thanks in advance. Cheers,
>
> Rafael
>
> El lun, 27-07-2009 a las 21:02 +0200, brem belguebli escribi?:
> > Hi Rafael,
> >
> > It works fine, well at least when not hiting some CLVM strange
> > behaviours, that I'm able to replay by hand, so your script is
> > allright.
> >
> > I'll post to linux-lvm what I could see.
> >
> > Brem
> >
> >
> > 2009/7/21, brem belguebli <brem.belguebli at gmail.com>:
> >         Hola Rafael,
> >
> >         Thanks a lot, that'll avoid me going from scratch.
> >
> >         I'll have a look at them and keep you updated.
> >
> >         Brem
> >
> >
> >
> >         2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
> >                 Hi Brem,
> >
> >                 El mar, 21-07-2009 a las 16:40 +0200, brem belguebli
> >                 escribi?:
> >                 > Hi,
> >                 >
> >                 > That's what I 'm trying to do.
> >                 >
> >                 > If you mean lvm.sh, well, I've been playing with it,
> >                 but it does some
> >                 > "sanity" checks that are wierd
> >                 >      1. It expects HA LVM to be setup (why such
> >                 check if we want to
> >                 >         use CLVM).
> >                 >      2. it exits if it finds a CLVM VG  (kind of
> >                 funny !)
> >                 >      3. it exits if the lvm.conf is newer
> >                 than /boot/*.img (about this
> >                 >         one, we tend to prevent the cluster from
> >                 automatically
> >                 >         starting ...)
> >                 > I was looking to find some doc on how to write my
> >                 own resources, ie
> >                 > CLVM resource that checks if the vg is clustered, if
> >                 so by which node
> >                 > is it exclusively held, and if the node is down to
> >                 activate
> >                 > exclusively the VG.
> >                 >
> >                 > If you have some good links to provide me, that'll
> >                 be great.
> >                 >
> >                 > Thanks
> >                 >
> >                 >
> >                 > 2009/7/21, Christine Caulfield
> >                 <ccaulfie at redhat.com>:
> >                 >         On 07/21/2009 01:11 PM, brem belguebli
> >                 wrote:
> >                 >                 Hi,
> >                 >                 When creating the VG by default
> >                 clustered, you
> >                 >                 implicitely assume that
> >                 >                 it will be used with a clustered FS
> >                 on top of it (gfs,
> >                 >                 ocfs, etc...)
> >                 >                 that will handle the active/active
> >                 mode.
> >                 >                 As I do not intend to use GFS in
> >                 this particular case,
> >                 >                 but ext3 and raw
> >                 >                 devices, I need to make sure the vg
> >                 is exclusively
> >                 >                 activated on one
> >                 >                 node, preventing the other nodes to
> >                 access it unless
> >                 >                 it is the failover
> >                 >                 procedure (node holding the VG
> >                 crashed) and then re
> >                 >                 activate it
> >                 >                 exclusively on the failover node.
> >                 >                 Thanks
> >                 >
> >                 >
> >                 >         In that case you probably ought to be using
> >                 rgmanager to do
> >                 >         the failover for you. It has a script for
> >                 doing exactly
> >                 >         this :-)
> >                 >
> >                 >         Chrissie
> >                 >
> >                 >
> >                 >         --
> >                 >         Linux-cluster mailing list
> >                 >         Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >                 >
> >                 >
> >                 > --
> >                 > Linux-cluster mailing list
> >                 > Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >                 Please, check this link:
> >
> >
> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
> >
> >                 I found exactly the same problem as you, and i
> >                 developed the
> >                 "lvm-cluster.sh" script to solve the needs I had. You
> >                 can find the
> >                 script on the last message of the thread.
> >
> >                 I submitted it to make it part of the main project,
> >                 but i have no news
> >                 about that yet.
> >
> >                 I hope this helps.
> >
> >                 Cheers,
> >
> >                 Rafael
> >
> >                 --
> >                 Rafael Mic? Miranda
> >
> >                 --
> >                 Linux-cluster mailing list
> >                 Linux-cluster at redhat.com
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Rafael Mic? Miranda
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090729/e2117b76/attachment.htm>

From ccaulfie at redhat.com  Wed Jul 29 14:15:21 2009
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Wed, 29 Jul 2009 15:15:21 +0100
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
Message-ID: <4A705979.1040605@redhat.com>

I think you've misunderstood what vgchange -aey does,

It activates all the currently existing LVs in that VG exclusively on 
that node. If you create another LV in that VG then it's activated 
normally on all nodes in the cluster.

Chrissie

On 29/07/09 14:55, brem belguebli wrote:
> Hello,
>
> I've been playing on RHEL 5.3 with CLVM and exclusive activation but the
> results I'm getting are not what I'm expecting.
>
> - My cluster is a freshly 2 nodes (node1 and node2) installed cluster
> with the packages shipped by RHEL 5.3 X86_64.
>
> - LVM locking type =  3
>
> - a San LUN (/dev/mpath/mpath2) visible from both nodes
>
> - dlm used as lock_manager
>
> Everything starts normally from cman to clvmd.
>
> Below what I'm doing
>
> On node1:
>
>   pvcreate /dev/mpath/mpath2
>   vgcreate -c n vg11 /dev/mpath/mpath2
>
> ! nothing in /debug/dlm/clvmd_locks on both nodes
>
>    vgchange -a n vg11
>
> ! nothing in /debug/dlm/clvmd_locks on both nodes
>
>    vgchange -c y vg11
>
> ! nothing in /debug/dlm/clvmd_locks on both nodes, vg seen on both nodes
> as clustered.
>
>    vgchange -a ey vg11
>
> ! nothing in /debug/dlm/clvmd_locks on both nodes
>
>   lvcreate -n lvol1 -L 6G /dev/vg11
>
> On node1 cat /debug/dlm/clvmd_locks gives:
>
>    6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>    38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>
> On node2:
>
>    3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>
> Is there something I'm doing wrong or misunderstand?
> I understand that node1 (which actually activated exclusively the vg)
> sees a lock on /dev/vg11/lvol1 (uuid corresponding to it) from node id 2
> wich is node2
> plus a lock from node id 0 (which seems to be the quorun disk id which
> is not configured in my case).
>
> Plus, node2 seems to see the right lock from node1.
>
> I go on:
>
> on both nodes, lvdisplay -v /dev/vg11/lvol1 gives:
>
> ...
>   LV UUID                r3Xrp1-prEG-ceCk-A2dh-SA2E-NWoc-unEfdf
>    LV Write Access        read/write
>    LV Status              available
> ...
>
> Shouldn't it be seen NOT available on node2 ?
>
> Now, on node2:
>
> vgchange -a y vg11 :
>
>   1 logical volume(s) in volume group "vg11" now active <-- vg was
> supposed to be held exlusively by node1
>
> cat  /debug/dlm/clvmd_locks gives:
>
> 3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>
> on node1:
>
> 6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
> 38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>
> I may be missing something in my procedure that makes it do everything
> except what I'm expecting.
>
> Any ideas ?
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From brem.belguebli at gmail.com  Wed Jul 29 16:28:11 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Wed, 29 Jul 2009 18:28:11 +0200
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <4A705979.1040605@redhat.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
	<4A705979.1040605@redhat.com>
Message-ID: <29ae894c0907290928k7928cac6l68a714ad21cc91dc@mail.gmail.com>

Hello Chrissie,

To make things clear, when I activate exclusively vg11 on node1, the
existing lv's are locked on node1, right ?

The things that confuse me about that are:

- clvmd_locks on node1 shows node id 0
- clvmd_locks on node2 shows nothing
- I'm able on node2 to bypass the exclusive lock by issuing (node2) vgchange
-a y vg11(or even vgchange -a ey vg11).

My understanding of exclusive flag is to forbid any change on another node
while lock holding node (node1) is alive and member of the cluster.

Is it where I'm wrong ?

Brem


2009/7/29, Christine Caulfield <ccaulfie at redhat.com>:
>
> I think you've misunderstood what vgchange -aey does,
>
> It activates all the currently existing LVs in that VG exclusively on that
> node. If you create another LV in that VG then it's activated normally on
> all nodes in the cluster.
>
> Chrissie
>
> On 29/07/09 14:55, brem belguebli wrote:
>
>> Hello,
>>
>> I've been playing on RHEL 5.3 with CLVM and exclusive activation but the
>> results I'm getting are not what I'm expecting.
>>
>> - My cluster is a freshly 2 nodes (node1 and node2) installed cluster
>> with the packages shipped by RHEL 5.3 X86_64.
>>
>> - LVM locking type =  3
>>
>> - a San LUN (/dev/mpath/mpath2) visible from both nodes
>>
>> - dlm used as lock_manager
>>
>> Everything starts normally from cman to clvmd.
>>
>> Below what I'm doing
>>
>> On node1:
>>
>>  pvcreate /dev/mpath/mpath2
>>  vgcreate -c n vg11 /dev/mpath/mpath2
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes
>>
>>   vgchange -a n vg11
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes
>>
>>   vgchange -c y vg11
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes, vg seen on both nodes
>> as clustered.
>>
>>   vgchange -a ey vg11
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes
>>
>>  lvcreate -n lvol1 -L 6G /dev/vg11
>>
>> On node1 cat /debug/dlm/clvmd_locks gives:
>>
>>   6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>   38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> On node2:
>>
>>   3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> Is there something I'm doing wrong or misunderstand?
>> I understand that node1 (which actually activated exclusively the vg)
>> sees a lock on /dev/vg11/lvol1 (uuid corresponding to it) from node id 2
>> wich is node2
>> plus a lock from node id 0 (which seems to be the quorun disk id which
>> is not configured in my case).
>>
>> Plus, node2 seems to see the right lock from node1.
>>
>> I go on:
>>
>> on both nodes, lvdisplay -v /dev/vg11/lvol1 gives:
>>
>> ...
>>  LV UUID                r3Xrp1-prEG-ceCk-A2dh-SA2E-NWoc-unEfdf
>>   LV Write Access        read/write
>>   LV Status              available
>> ...
>>
>> Shouldn't it be seen NOT available on node2 ?
>>
>> Now, on node2:
>>
>> vgchange -a y vg11 :
>>
>>  1 logical volume(s) in volume group "vg11" now active <-- vg was
>> supposed to be held exlusively by node1
>>
>> cat  /debug/dlm/clvmd_locks gives:
>>
>> 3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> on node1:
>>
>> 6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>> 38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> I may be missing something in my procedure that makes it do everything
>> except what I'm expecting.
>>
>> Any ideas ?
>>
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090729/898ef8c9/attachment.htm>

From hxinwei at gmail.com  Thu Jul 30 05:42:43 2009
From: hxinwei at gmail.com (Xinwei Hu)
Date: Thu, 30 Jul 2009 13:42:43 +0800
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <4A705979.1040605@redhat.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
	<4A705979.1040605@redhat.com>
Message-ID: <1cafab770907292242m1c4253b1o60be8a58856365db@mail.gmail.com>

Hi Christine,

  I can always activate the vg via 'vgchange -ay', regardless to the
exclusive activation on other nodes.
The reason is that the do_lock_lv is actually called by
process_remote_command, and then the EX
lock is convert to CR.

  I'm not so convinced this is the desired behavior. Is there any
reason behind this ?

  Thanks.

2009/7/29 Christine Caulfield <ccaulfie at redhat.com>:
> I think you've misunderstood what vgchange -aey does,
>
> It activates all the currently existing LVs in that VG exclusively on that
> node. If you create another LV in that VG then it's activated normally on
> all nodes in the cluster.
>
> Chrissie
>
> On 29/07/09 14:55, brem belguebli wrote:
>>
>> Hello,
>>
>> I've been playing on RHEL 5.3 with CLVM and exclusive activation but the
>> results I'm getting are not what I'm expecting.
>>
>> - My cluster is a freshly 2 nodes (node1 and node2) installed cluster
>> with the packages shipped by RHEL 5.3 X86_64.
>>
>> - LVM locking type = ?3
>>
>> - a San LUN (/dev/mpath/mpath2) visible from both nodes
>>
>> - dlm used as lock_manager
>>
>> Everything starts normally from cman to clvmd.
>>
>> Below what I'm doing
>>
>> On node1:
>>
>> ?pvcreate /dev/mpath/mpath2
>> ?vgcreate -c n vg11 /dev/mpath/mpath2
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes
>>
>> ? vgchange -a n vg11
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes
>>
>> ? vgchange -c y vg11
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes, vg seen on both nodes
>> as clustered.
>>
>> ? vgchange -a ey vg11
>>
>> ! nothing in /debug/dlm/clvmd_locks on both nodes
>>
>> ?lvcreate -n lvol1 -L 6G /dev/vg11
>>
>> On node1 cat /debug/dlm/clvmd_locks gives:
>>
>> ? 6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>> ? 38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> On node2:
>>
>> ? 3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> Is there something I'm doing wrong or misunderstand?
>> I understand that node1 (which actually activated exclusively the vg)
>> sees a lock on /dev/vg11/lvol1 (uuid corresponding to it) from node id 2
>> wich is node2
>> plus a lock from node id 0 (which seems to be the quorun disk id which
>> is not configured in my case).
>>
>> Plus, node2 seems to see the right lock from node1.
>>
>> I go on:
>>
>> on both nodes, lvdisplay -v /dev/vg11/lvol1 gives:
>>
>> ...
>> ?LV UUID ? ? ? ? ? ? ? ?r3Xrp1-prEG-ceCk-A2dh-SA2E-NWoc-unEfdf
>> ? LV Write Access ? ? ? ?read/write
>> ? LV Status ? ? ? ? ? ? ?available
>> ...
>>
>> Shouldn't it be seen NOT available on node2 ?
>>
>> Now, on node2:
>>
>> vgchange -a y vg11 :
>>
>> ?1 logical volume(s) in volume group "vg11" now active <-- vg was
>> supposed to be held exlusively by node1
>>
>> cat ?/debug/dlm/clvmd_locks gives:
>>
>> 3da0001 1 6f0001 2204 0 1 1 2 1 -1 0 1 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> on node1:
>>
>> 6f0001 2 3da0001 2204 0 1 10001 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>> 38a0001 0 0 434 0 1 1 2 1 -1 0 0 64
>> "iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
>>
>> I may be missing something in my procedure that makes it do everything
>> except what I'm expecting.
>>
>> Any ideas ?
>>
>>
>> ------------------------------------------------------------------------
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From ccaulfie at redhat.com  Thu Jul 30 06:25:57 2009
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Thu, 30 Jul 2009 07:25:57 +0100
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <1cafab770907292242m1c4253b1o60be8a58856365db@mail.gmail.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>	<4A705979.1040605@redhat.com>
	<1cafab770907292242m1c4253b1o60be8a58856365db@mail.gmail.com>
Message-ID: <4A713CF5.707@redhat.com>

On 30/07/09 06:42, Xinwei Hu wrote:
> Hi Christine,
>
>    I can always activate the vg via 'vgchange -ay', regardless to the
> exclusive activation on other nodes.
> The reason is that the do_lock_lv is actually called by
> process_remote_command, and then the EX
> lock is convert to CR.
>
>    I'm not so convinced this is the desired behavior. Is there any
> reason behind this ?
>

I have no idea. It seems wrong to me too. :S

BTW The zero nodeid shown in the DLM lock dump means "this node", and 
has nothing to do with qdisk as some people seem to think.


CHrissie


From brem.belguebli at gmail.com  Thu Jul 30 07:15:11 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Thu, 30 Jul 2009 09:15:11 +0200
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <4A713CF5.707@redhat.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
	<4A705979.1040605@redhat.com>
	<1cafab770907292242m1c4253b1o60be8a58856365db@mail.gmail.com>
	<4A713CF5.707@redhat.com>
Message-ID: <29ae894c0907300015r7ee0d114q8dca432e9491496b@mail.gmail.com>

Hi,

does it look like we're hiting some "undesired feature" ;-)

Concerning the 0 nodeid, I think I read that on some Redhat documents or
bugzilla report, I could find it out.

Brem


2009/7/30, Christine Caulfield <ccaulfie at redhat.com>:
>
> On 30/07/09 06:42, Xinwei Hu wrote:
>
>> Hi Christine,
>>
>>   I can always activate the vg via 'vgchange -ay', regardless to the
>> exclusive activation on other nodes.
>> The reason is that the do_lock_lv is actually called by
>> process_remote_command, and then the EX
>> lock is convert to CR.
>>
>>   I'm not so convinced this is the desired behavior. Is there any
>> reason behind this ?
>>
>>
> I have no idea. It seems wrong to me too. :S
>
> BTW The zero nodeid shown in the DLM lock dump means "this node", and has
> nothing to do with qdisk as some people seem to think.
>
>
> CHrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090730/cc2942ed/attachment.htm>

From fdinitto at redhat.com  Thu Jul 30 12:15:46 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 30 Jul 2009 14:15:46 +0200
Subject: [Linux-cluster] lvm2: cluster request failed: Unknown error 65538
In-Reply-To: <1248677628.7930.8.camel@cerberus.int.fabbione.net>
References: <561c252c0907220812l152462d9o27cd64ba37c017a4@mail.gmail.com>
	<1248677628.7930.8.camel@cerberus.int.fabbione.net>
Message-ID: <1248956146.7962.5.camel@cerberus.int.fabbione.net>

On Mon, 2009-07-27 at 08:53 +0200, Fabio M. Di Nitto wrote:
> On Wed, 2009-07-22 at 17:12 +0200, Gianluca Cecchi wrote:
> > Hello,
> > by mistake I previously sent this to fedora-list.
> > I resend to the appropriate list I wanted...
> > Excuse in advance for eventual cross-posting effects for anyone...
> > 
> > fedora11 x86_64 with lvm2, device-mapper and related packages updated at :
> > lvm2-2.02.48-1.fc11.x86_64
> > lvm2-cluster-2.02.48-1.fc11.x86_64
> > device-mapper-1.02.33-1.fc11.x86_64
> > device-mapper-libs-1.02.33-1.fc11.x86_64
> > 
> > I have 2 VGs: vg_virtfed that is a system vg (with root lv ans swap
> > lv) and vg_qemu01 that is a clustered vg.
> > At the moment only one node active
> 
> [SNIP]
> 
> known issue... I am working on updating packages in F11 and F10.
> 
> Due to some delays from upstream, this task has been delayed a lot.
> 
> It will be sorted soon.

The update packages should be available today or tomorrow depending on
your local mirror.

Fabio


From gianluca.cecchi at gmail.com  Thu Jul 30 13:04:10 2009
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Thu, 30 Jul 2009 15:04:10 +0200
Subject: [Linux-cluster] lvm2: cluster request failed: Unknown error 65538
Message-ID: <561c252c0907300604w67c79c37jda6af24c5937168e@mail.gmail.com>

On Thu, 30 Jul 2009 14:15:46 +0200 Fabio M. Di Nitto wrote:
> The update packages should be available today or tomorrow depending on
> your local mirror.
>
> Fabio

Hi,
Is this version with the fix the one that I would pick up enabling
updates-testing repo?
[root at tekkafedora ~]# yum --enablerepo updates-testing update lvm2
[snip]
Dependencies Resolved

================================================================================
 Package                Arch       Version            Repository           Size
================================================================================
Updating:
 lvm2                   x86_64     2.02.48-2.fc11     updates-testing     437 k
Updating for dependencies:
 device-mapper          x86_64     1.02.33-2.fc11     updates-testing      87 k
 device-mapper-libs     x86_64     1.02.33-2.fc11     updates-testing      86 k

Transaction Summary
================================================================================
Install      0 Package(s)
Update       3 Package(s)
Remove       0 Package(s)

Total download size: 610 k
Is this ok [y/N]:

In case I can install and test right now.
BTW: what are the effects of the known bug? Any bugzilla?

Thanks,
Gianluca


From rmicmirregs at gmail.com  Thu Jul 30 14:15:04 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Thu, 30 Jul 2009 16:15:04 +0200
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <29ae894c0907300015r7ee0d114q8dca432e9491496b@mail.gmail.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
	<4A705979.1040605@redhat.com>
	<1cafab770907292242m1c4253b1o60be8a58856365db@mail.gmail.com>
	<4A713CF5.707@redhat.com>
	<29ae894c0907300015r7ee0d114q8dca432e9491496b@mail.gmail.com>
Message-ID: <1248963304.6459.3.camel@mecatol>

Hi Brem

El jue, 30-07-2009 a las 09:15 +0200, brem belguebli escribi?:
> Hi,
>  
> does it look like we're hiting some "undesired feature" ;-)
>  
> Concerning the 0 nodeid, I think I read that on some Redhat documents
> or bugzilla report, I could find it out.
>  
> Brem 
> 
>  
>         
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

I made some test on my lab environment too, i attach the results in the
TXT file.

My conclusions:

1.- lovgols with exclusive flag must be used over clustered volume
groups (obvious and already known)
2.- logvols activated with exclusive flag must be handled EXCLUSIVELY
with the exclusive flag

---> as part of my lvm-cluster.sh resource script, the exclusive flag is
part of the resource definition in cluster.conf so this is correctly
handled
	
3.- you can activate an already active exclusive logvol on any node if
you dont take into accout, during the activation, the exclusive flag
4.- in use (opened) logvols are protected from deactivation from
secondary nodes, even from main node
5.- after a node failure (hang-up, fencing...) logvol is not open
anymore, so it can be exclusively activated on a new node

All this was tested manually, but this is the expected behaviour on
lvm-cluster.sh resource script.

Link to lvm-cluster.sh resource script:

https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html

Cheers,

Rafael

-- 
Rafael Mic? Miranda
-------------- next part --------------
# Nodes are ccopru08 and ccopru17
# ccopru17 will act as the main node, where i try to get exclusive access to logvols
# ccopru08 will act as the secondary node, trying to get resources and trying to produce "bad things"

# volgrp = volume group
# logvol = logical volume over a volume group

# general system information

[root at ccopru08 dlm]# cat /etc/issue
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
Kernel \r on an \m


[root at ccopru08 ~]# rpm -qa | grep -e "\(cman\|openais\|rgmanager\|lvm\)"
cman-2.0.98-1.el5
rgmanager-2.0.46-1.el5
lvm2-2.02.40-6.el5
openais-0.80.3-22.el5
lvm2-cluster-2.02.40-7.el5
system-config-lvm-1.1.5-1.0.el5


[root at ccopru08 etc]# greplimpio.sh /etc/multipath.conf 
blacklist {
        devnode         "sda"
}
blacklist_exceptions {
        devnode "sda[a-z]"
}
defaults {
        user_friendly_names yes
}
multipaths {
        multipath {
                wwid                    360014380024d2e260000900000770000
                alias                   quorum
        }
        multipath {
                wwid                    360014380024d2e2600009000007d0000
                alias                   clvm01
        }
        multipath {
                wwid                    360014380024d2e260000900000830000
                alias                   clvm02
        }
        multipath {
                wwid                    360014380024d2e260000900000890000
                alias                   clvm03
        }
        multipath {
                wwid                    360014380024d2e2600009000008f0000
                alias                   vol31
        }
        multipath {
                wwid                    360014380024d2e260000900000950000
                alias                   vol33
        }
        multipath {
                wwid                    360014380024d2e2600009000009b0000
                alias                   vol34
        }
        multipath {
                wwid                    360014380024d2e260000900000a10000
                alias                   vol55
        }
}
devices {
        device {
                vendor "(COMPAQ|HP)"
                product "HSV1[01]1|HSV2[01]0|HSV300|HSV4[05]0"
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout "/sbin/mpath_prio_alua /dev/%n"
                hardware_handler "0"
                path_selector "round-robin 0"
                path_grouping_policy group_by_prio
                failback immediate
                rr_weight uniform
                no_path_retry 12
                rr_min_io 100
                path_checker tur
        }
}

[root at ccopru08 etc]# greplimpio.sh /etc/lvm/lvm.conf
devices {
    dir = "/dev"
    scan = [ "/dev" ]
    preferred_names = [ "^/dev/mpath/", "^/dev/mapper/mpath", "^/dev/[hs]d" ]
    filter = [ "r|/dev/sd|", "a|/dev/mpath/.*|", "r|.*|" ]
    cache_dir = "/etc/lvm/cache"
    cache_file_prefix = ""
    write_cache_state = 1
    sysfs_scan = 1
    md_component_detection = 1
    md_chunk_alignment = 1
    ignore_suspended_devices = 0
}
log {
    verbose = 0
    syslog = 1
    file = "/var/log/lvm2.log"
    overwrite = 0
    level = 0
    indent = 1
    command_names = 0
    prefix = "  "
}
backup {
    backup = 1
    backup_dir = "/etc/lvm/backup"
    archive = 1
    archive_dir = "/etc/lvm/archive"
    retain_min = 10
    retain_days = 30
}
shell {
    history_size = 100
}
global {
    library_dir = "/usr/lib64"
    umask = 077
    test = 0
    units = "h"
    activation = 1
    proc = "/proc"
    locking_type = 3
    fallback_to_clustered_locking = 1
    fallback_to_local_locking = 1
    locking_dir = "/var/lock/lvm"
}
activation {
    missing_stripe_filler = "error"
    reserved_stack = 256
    reserved_memory = 8192
    process_priority = -18
    mirror_region_size = 512
    readahead = "auto"
    mirror_log_fault_policy = "allocate"
    mirror_device_fault_policy = "remove"
}
dmeventd {
    mirror_library = "libdevmapper-event-lvm2mirror.so"
    snapshot_library = "libdevmapper-event-lvm2snapshot.so"
}

[root at ccopru08 etc]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster alias="cluster01" config_version="1" name="cluster01">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="ccopru08" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="ccopru17" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
        </cman>
        <fencedevices>
                <fencedevice agent="fence_drac" hostname="ccopru08-ilo" login="***" name="ccopru08-fence" passwd="***"/>
                <fencedevice agent="fence_drac" hostname="ccopru17-ilo" login="***" name="ccopru17-fence" passwd="***"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

# end of general system information

# start of tests

# creation of pvs and vgs

[root at ccopru08 etc]# pvcreate /dev/mpath/clvm01 
  Physical volume "/dev/mpath/clvm01" successfully created
 
[root at ccopru17 dlm]# pvs
  PV                VG   Fmt  Attr PSize   PFree  
  /dev/mpath/clvm01      lvm2 --   269,00G 269,00G

[root at ccopru17 dlm]# vgcreate -c n volgrp01 /dev/mpath/clvm01 
  Non-clustered volume group "volgrp01" successfully created

[root at ccopru08 dlm]# vgs
  VG       #PV #LV #SN Attr   VSize   VFree  
  volgrp01   1   0   0 wz--n- 269,00G 269,00G

[root at ccopru17 dlm]# vgchange -a n volgrp01
  0 logical volume(s) in volume group "volgrp01" now active
  
# change to clustered vg

[root at ccopru17 dlm]# vgchange -c y volgrp01 
  Volume group "volgrp01" successfully changed
  
[root at ccopru17 dlm]# vgs
  VG       #PV #LV #SN Attr   VSize   VFree  
  volgrp01   1   0   0 wz--nc 269,00G 269,00G

# creation of logvol01

[root at ccopru17 dlm]# lvcreate -n logvol01 -L 5G volgrp01
  Logical volume "logvol01" created
  
# logvol is reachable on all nodes

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2


[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2


[root at ccopru08 dlm]# lvdisplay -v /dev/volgrp01/logvol01
    Using logical volume(s) on command line
  --- Logical volume ---
  LV Name                /dev/volgrp01/logvol01
  VG Name                volgrp01
  LV UUID                oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                5,00 GB
  Current LE             1280
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:8

[root at ccopru17 dlm]# lvdisplay -v /dev/volgrp01/logvol01
    Using logical volume(s) on command line
  --- Logical volume ---
  LV Name                /dev/volgrp01/logvol01
  VG Name                volgrp01
  LV UUID                oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                5,00 GB
  Current LE             1280
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:8


Node ccopru08
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
26a0001 2 1a70001 12991 0 1 10001 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
2790001 0 0 7844 0 1 1 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff81083a61a600 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Master Copy
Granted Queue
026a0001 CR Remote:   2 01a70001
02790001 CR
Conversion Queue
Waiting Queue


Node ccopru17
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
1a70001 1 26a0001 12991 0 1 1 2 1 -1 0 1 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff81011cfc0600 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Local Copy, Master is node 1
Granted Queue
01a70001 CR Master:     026a0001
Conversion Queue
Waiting Queue

# deactivation of logvol01

[root at ccopru17 dlm]# lvchange -an volgrp01/logvol01

Node ccopru08
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd

Node ccopru17
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

# activation with exclusive flag of logvol01 as part of full volgrp01

[root at ccopru17 dlm]# lvchange -aey volgrp01
[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

Node ccopru08
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd

Node ccopru17
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
3bd0001 0 0 12991 0 1 1 2 5 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff81083db55e00 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Master Copy
Granted Queue
03bd0001 EX
Conversion Queue
Waiting Queue

# activation of logvol01 on the other node, without using the exclusive flag

[root at ccopru08 dlm]# lvchange -ay volgrp01/logvol01
[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

Node ccopru08
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
1790001 2 3720001 7844 0 1 1 2 1 -1 0 2 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff81076e0ca800 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Local Copy, Master is node 2
Granted Queue
01790001 CR Master:     03720001
Conversion Queue
Waiting Queue


Node ccopru17
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
3720001 1 1790001 7844 0 1 10001 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
3bd0001 0 0 12991 0 5 1 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff81083db55e00 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Master Copy
Granted Queue
03720001 CR Remote:   1 01790001
03bd0001 CR
Conversion Queue
Waiting Queue

# deactivation, again, of logvol01

[root at ccopru17 dlm]# lvchange -an volgrp01/logvol01
[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2

Node ccopru08
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd

Node ccopru17
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd

# creation of a second logvol, logvol02

[root at ccopru17 dlm]# lvcreate -n logvol02 -L 5G volgrp01
  Logical volume "logvol02" created
[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

Node ccopru08
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
31e0001 2 3680001 7844 0 1 1 2 1 -1 0 2 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88tfIeS8eMkJIehccljM0vSW56gdaLYujYV"


Contents of /debug/dlm/clvmd

Resource ffff8107c196de00 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88tfIeS8eMkJIehccljM0vSW56gdaLYujYV"
Local Copy, Master is node 2
Granted Queue
031e0001 CR Master:     03680001
Conversion Queue
Waiting Queue

Node ccopru17
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
3680001 1 31e0001 7844 0 1 10001 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88tfIeS8eMkJIehccljM0vSW56gdaLYujYV"
3940001 0 0 12991 0 1 1 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88tfIeS8eMkJIehccljM0vSW56gdaLYujYV"


Contents of /debug/dlm/clvmd

Resource ffff81083f9b1000 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88tfIeS8eMkJIehccljM0vSW56gdaLYujYV"
Master Copy
Granted Queue
03680001 CR Remote:   1 031e0001
03940001 CR
Conversion Queue
Waiting Queue

# deactivation of logvol02

[root at ccopru17 dlm]# lvchange -an volgrp01/logvol02
[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

Node ccopru08
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd

Node ccopru17
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd

# activation of logvol01 with exclusive flag, no activation of logvol02

[root at ccopru17 dlm]# lvchange -aey volgrp01/logvol01
[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV


Node ccopru08
Contents of /debug/dlm/clvmd_locks


Contents of /debug/dlm/clvmd


Node ccopru17
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
e80001 0 0 12991 0 1 1 2 5 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff810825026400 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Master Copy
Granted Queue
00e80001 EX
Conversion Queue
Waiting Queue


[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

# try to activate logvol01 on other node, with exclusive flag

[root at ccopru08 dlm]# lvchange -aey volgrp01/logvol01 
  Error locking on node ccopru08: Volume is busy on another node

# so this is OK

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

[root at ccopru08 dlm]# lvchange -aly volgrp01/logvol01 
  Error locking on node ccopru08: Volume is busy on another node

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

# second try, activation without using the exclusive flag

[root at ccopru08 dlm]# lvchange -ay volgrp01/logvol01

# success, undesired behaviour

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

Node ccopru08
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
2b00001 2 23b0001 7844 0 1 1 2 1 -1 0 2 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff81083b9f4200 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Local Copy, Master is node 2
Granted Queue
02b00001 CR Master:     023b0001
Conversion Queue
Waiting Queue

Node ccopru17
Contents of /debug/dlm/clvmd_locks
id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
23b0001 1 2b00001 7844 0 1 10001 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
e80001 0 0 12991 0 5 1 2 1 -1 0 0 64 "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"


Contents of /debug/dlm/clvmd

Resource ffff810825026400 Name (len=64) "q3J161vB2DUWbzULDf2LIA5iotmPU88toYLF252lZuSSFzIw9N1fJQ1dHJTATNp2"
Master Copy
Granted Queue
023b0001 CR Remote:   1 02b00001
00e80001 CR
Conversion Queue
Waiting Queue

# deactivation of logvol01 on both nodes

[root at ccopru17 dlm]# lvchange -an volgrp01/logvol01

[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV
  
# activation of logvol01 with exclusive flag
 
[root at ccopru17 dlm]# lvchange -aey volgrp01/logvol01

[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV


[root at ccopru17 dlm]# mkfs.ext3 /dev/volgrp01/logvol01

[root at ccopru17 dlm]# mkdir /media/aux
[root at ccopru17 dlm]# mount /dev/volgrp01/logvol01 /media/aux/
[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-ao 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

# deactivation of logvol01, in use (opened), from the active node

[root at ccopru17 dlm]# lvchange -aen volgrp01/logvol01
  Error locking on node ccopru17: LV volgrp01/logvol01 in use: not deactivating
# good behaviour

# try to deactivate from the other node

[root at ccopru08 dlm]# lvchange -aen volgrp01/logvol01
  Error locking on node ccopru17: LV volgrp01/logvol01 in use: not deactivating

[root at ccopru08 dlm]# lvchange -an volgrp01/logvol01
  Error locking on node ccopru17: LV volgrp01/logvol01 in use: not deactivating

# good behaviour

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

# try to activate in the other node with exclusive flag

[root at ccopru08 dlm]# lvchange -aey volgrp01/logvol01
  Error locking on node ccopru08: Volume is busy on another node
#good behaviour

# try to activate in the other node without exclusive flag

[root at ccopru08 dlm]# lvchange -ay volgrp01/logvol01

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-a- 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

# try to deactivate from the other node

[root at ccopru08 dlm]# lvchange -an volgrp01/logvol01
  Error locking on node ccopru17: LV volgrp01/logvol01 in use: not deactivating

[root at ccopru08 dlm]# lvchange -aen volgrp01/logvol01
  Error locking on node ccopru17: LV volgrp01/logvol01 in use: not deactivating

[root at ccopru08 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

[root at ccopru17 dlm]# lvs -v 
    Finding all logical volumes
  LV       VG       #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move Copy%  Log Convert LV UUID                               
  logvol01 volgrp01    1 -wi-ao 5,00G  -1  -1 253  8                                          oYLF25-2lZu-SSFz-Iw9N-1fJQ-1dHJ-TATNp2
  logvol02 volgrp01    1 -wi--- 5,00G  -1  -1 -1   -1                                         fIeS8e-MkJI-ehcc-ljM0-vSW5-6gda-LYujYV

[root at ccopru08 dlm]# lvchange -aen volgrp01/logvol01
  Error locking on node ccopru17: LV volgrp01/logvol01 in use: not deactivating


# volume is yet in the main node


Conclusions:

1.- lovgols with exclusive flag must be used over clustered volume groups (obvious and already known)
2.- logvols activated with exclusive flag must be handled EXCLUSIVELY with the exclusive flag

---> as part of my lvm-cluster.sh resource script, the exclusive flag is part of the resource definition in cluster.conf so this is correctly handled
	
3.- you can activate an already active exclusive logvol on any node if you dont take into accout, during the activation, the exclusive flag
4.- in use (opened) logvols are protected from deactivation from secondary nodes, even from main node
5.- after a node faiule (hang-up, fencing...) logvol is not open anymore, so it can be exclusively activated on a new node

All this was tested manually, but this is the expected behaviour on lvm-cluster.sh resurce script.

Link to lvm-cluster.sh resoruce script:

https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html

Cheers,

Rafael

From fdinitto at redhat.com  Thu Jul 30 14:25:33 2009
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 30 Jul 2009 16:25:33 +0200
Subject: [Linux-cluster] lvm2: cluster request failed: Unknown error 65538
In-Reply-To: <561c252c0907300604w67c79c37jda6af24c5937168e@mail.gmail.com>
References: <561c252c0907300604w67c79c37jda6af24c5937168e@mail.gmail.com>
Message-ID: <1248963933.7962.9.camel@cerberus.int.fabbione.net>

On Thu, 2009-07-30 at 15:04 +0200, Gianluca Cecchi wrote:
> On Thu, 30 Jul 2009 14:15:46 +0200 Fabio M. Di Nitto wrote:
> > The update packages should be available today or tomorrow depending on
> > your local mirror.
> >
> > Fabio
> 
> Hi,
> Is this version with the fix the one that I would pick up enabling
> updates-testing repo?
> [root at tekkafedora ~]# yum --enablerepo updates-testing update lvm2

lvm2 is not enough. you will need to pull in also corosync/openais/cman
and all cluster components that have been updated.

Probably easier to just do:

yum --enablerepo updates-testing update lvm2-cluster

> BTW: what are the effects of the known bug? Any bugzilla?

BZ tons... effects == working cluster and clvmd.

Fabio


From brem.belguebli at gmail.com  Fri Jul 31 00:02:50 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 31 Jul 2009 02:02:50 +0200
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <1248963304.6459.3.camel@mecatol>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
	<4A705979.1040605@redhat.com>
	<1cafab770907292242m1c4253b1o60be8a58856365db@mail.gmail.com>
	<4A713CF5.707@redhat.com>
	<29ae894c0907300015r7ee0d114q8dca432e9491496b@mail.gmail.com>
	<1248963304.6459.3.camel@mecatol>
Message-ID: <29ae894c0907301702m452bbfb5u2178565cc127b4da@mail.gmail.com>

Hi Rafael,
Good testing, it confirms that some additional barriers are necessary to
prevent undesired behaviours.

I'll test by tomorrow the same procedure at VG level.


2009/7/30 Rafael Mic? Miranda <rmicmirregs at gmail.com>

> Hi Brem
>
> El jue, 30-07-2009 a las 09:15 +0200, brem belguebli escribi?:
> > Hi,
> >
> > does it look like we're hiting some "undesired feature" ;-)
> >
> > Concerning the 0 nodeid, I think I read that on some Redhat documents
> > or bugzilla report, I could find it out.
> >
> > Brem
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> I made some test on my lab environment too, i attach the results in the
> TXT file.
>
> My conclusions:
>
> 1.- lovgols with exclusive flag must be used over clustered volume
> groups (obvious and already known)
> 2.- logvols activated with exclusive flag must be handled EXCLUSIVELY
> with the exclusive flag
>
> ---> as part of my lvm-cluster.sh resource script, the exclusive flag is
> part of the resource definition in cluster.conf so this is correctly
> handled
>
> 3.- you can activate an already active exclusive logvol on any node if
> you dont take into accout, during the activation, the exclusive flag
> 4.- in use (opened) logvols are protected from deactivation from
> secondary nodes, even from main node
> 5.- after a node failure (hang-up, fencing...) logvol is not open
> anymore, so it can be exclusively activated on a new node
>
> All this was tested manually, but this is the expected behaviour on
> lvm-cluster.sh resource script.
>
> Link to lvm-cluster.sh resource script:
>
> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
>
> Cheers,
>
> Rafael
>
> --
> Rafael Mic? Miranda
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090731/cf974ffa/attachment.htm>

From crosa at redhat.com  Fri Jul 31 10:08:55 2009
From: crosa at redhat.com (crosa at redhat.com)
Date: Fri, 31 Jul 2009 06:08:55 -0400 (EDT)
Subject: [Linux-cluster] CLVMD without GFS
Message-ID: <Gmxg0kQpGzcT@fhdsb2B3>

A

--- mensagem original ---
De: brem belguebli <brem.belguebli at gmail.com>
Assunto: Re: [Linux-cluster] CLVMD without GFS
Data: 29 de Julho de 2009
Hora: 10:1:29

Hi Rafael,

Just posted the basic tests I'm doing on both linux-cluster and linux-lvm.

I can't get exclusive activation to work properly, I may be missing some
step in my process.

Brem


2009/7/27, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
>
> Hi Brem
>
> So, does it work successfully? I made some testing before I submitted it
> to the list and AFAIK i found no errors.
>
> What do you mean exactly with "some CLVM strange behaviours"? Could you
> be more specific?
>
> I'm not subscribed to linux-lvm, please keep us informed through this
> list.
>
> Thanks in advance. Cheers,
>
> Rafael
>
> El lun, 27-07-2009 a las 21:02 +0200, brem belguebli escribi?:
> > Hi Rafael,
> >
> > It works fine, well at least when not hiting some CLVM strange
> > behaviours, that I'm able to replay by hand, so your script is
> > allright.
> >
> > I'll post to linux-lvm what I could see.
> >
> > Brem
> >
> >
> > 2009/7/21, brem belguebli <brem.belguebli at gmail.com>:
> >         Hola Rafael,
> >
> >         Thanks a lot, that'll avoid me going from scratch.
> >
> >         I'll have a look at them and keep you updated.
> >
> >         Brem
> >
> >
> >
> >         2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
> >                 Hi Brem,
> >
> >                 El mar, 21-07-2009 a las 16:40 +0200, brem belguebli
> >                 escribi?:
> >                 > Hi,
> >                 >
> >                 > That's what I 'm trying to do.
> >                 >
> >                 > If you mean lvm.sh, well, I've been playing with it,
> >                 but it does some
> >                 > "sanity" checks that are wierd
> >                 >      1. It expects HA LVM to be setup (why such
> >                 check if we want to
> >                 >         use CLVM).
> >                 >      2. it exits if it finds a CLVM VG  (kind of
> >                 funny !)
> >                 >      3. it exits if the lvm.conf is newer
> >                 than /boot/*.img (about this
> >                 >         one, we tend to prevent the cluster from
> >                 automatically
> >                 >         starting ...)
> >                 > I was looking to find some doc on how to write my
> >                 own resources, ie
> >                 > CLVM resource that checks if the vg is clustered, if
> >                 so by which node
> >                 > is it exclusively held, and if the node is down to
> >                 activate
> >                 > exclusively the VG.
> >                 >
> >                 > If you have some good links to provide me, that'll
> >                 be great.
> >                 >
> >                 > Thanks
> >                 >
> >                 >
> >                 > 2009/7/21, Christine Caulfield
> >                 <ccaulfie at redhat.com>:
> >                 >         On 07/21/2009 01:11 PM, brem belguebli
> >                 wrote:
> >                 >                 Hi,
> >                 >                 When creating the VG by default
> >                 clustered, you
> >                 >                 implicitely assume that
> >                 >                 it will be used with a clustered FS
> >                 on top of it (gfs,
> >                 >                 ocfs, etc...)
> >                 >                 that will handle the active/active
> >                 mode.
> >                 >                 As I do not intend to use GFS in
> >                 this particular case,
> >                 >                 but ext3 and raw
> >                 >                 devices, I need to make sure the vg
> >                 is exclusively
> >                 >                 activated on one
> >                 >                 node, preventing the other nodes to
> >                 access it unless
> >                 >                 it is the failover
> >                 >                 procedure (node holding the VG
> >                 crashed) and then re
> >                 >                 activate it
> >                 >                 exclusively on the failover node.
> >                 >                 Thanks
> >                 >
> >                 >
> >                 >         In that case you probably ought to be using
> >                 rgmanager to do
> >                 >         the failover for you. It has a script for
> >                 doing exactly
> >                 >         this :-)
> >                 >
> >                 >         Chrissie
> >                 >
> >                 >
> >                 >         --
> >                 >         Linux-cluster mailing list
> >                 >         Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >                 >
> >                 >
> >                 > --
> >                 > Linux-cluster mailing list
> >                 > Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >                 Please, check this link:
> >
> >
> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
> >
> >                 I found exactly the same problem as you, and i
> >                 developed the
> >                 "lvm-cluster.sh" script to solve the needs I had. You
> >                 can find the
> >                 script on the last message of the thread.
> >
> >                 I submitted it to make it part of the main project,
> >                 but i have no news
> >                 about that yet.
> >
> >                 I hope this helps.
> >
> >                 Cheers,
> >
> >                 Rafael
> >
> >                 --
> >                 Rafael Mic? Miranda
> >
> >                 --
> >                 Linux-cluster mailing list
> >                 Linux-cluster at redhat.com
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Rafael Mic? Miranda
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From crosa at redhat.com  Fri Jul 31 10:09:02 2009
From: crosa at redhat.com (crosa at redhat.com)
Date: Fri, 31 Jul 2009 06:09:02 -0400 (EDT)
Subject: [Linux-cluster] CLVMD without GFS
Message-ID: <MJ2mY0lYjZGc@NBjvyXzB>


--- mensagem original ---
De: brem belguebli <brem.belguebli at gmail.com>
Assunto: Re: [Linux-cluster] CLVMD without GFS
Data: 29 de Julho de 2009
Hora: 10:1:29

Hi Rafael,

Just posted the basic tests I'm doing on both linux-cluster and linux-lvm.

I can't get exclusive activation to work properly, I may be missing some
step in my process.

Brem


2009/7/27, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
>
> Hi Brem
>
> So, does it work successfully? I made some testing before I submitted it
> to the list and AFAIK i found no errors.
>
> What do you mean exactly with "some CLVM strange behaviours"? Could you
> be more specific?
>
> I'm not subscribed to linux-lvm, please keep us informed through this
> list.
>
> Thanks in advance. Cheers,
>
> Rafael
>
> El lun, 27-07-2009 a las 21:02 +0200, brem belguebli escribi?:
> > Hi Rafael,
> >
> > It works fine, well at least when not hiting some CLVM strange
> > behaviours, that I'm able to replay by hand, so your script is
> > allright.
> >
> > I'll post to linux-lvm what I could see.
> >
> > Brem
> >
> >
> > 2009/7/21, brem belguebli <brem.belguebli at gmail.com>:
> >         Hola Rafael,
> >
> >         Thanks a lot, that'll avoid me going from scratch.
> >
> >         I'll have a look at them and keep you updated.
> >
> >         Brem
> >
> >
> >
> >         2009/7/21, Rafael Mic? Miranda <rmicmirregs at gmail.com>:
> >                 Hi Brem,
> >
> >                 El mar, 21-07-2009 a las 16:40 +0200, brem belguebli
> >                 escribi?:
> >                 > Hi,
> >                 >
> >                 > That's what I 'm trying to do.
> >                 >
> >                 > If you mean lvm.sh, well, I've been playing with it,
> >                 but it does some
> >                 > "sanity" checks that are wierd
> >                 >      1. It expects HA LVM to be setup (why such
> >                 check if we want to
> >                 >         use CLVM).
> >                 >      2. it exits if it finds a CLVM VG  (kind of
> >                 funny !)
> >                 >      3. it exits if the lvm.conf is newer
> >                 than /boot/*.img (about this
> >                 >         one, we tend to prevent the cluster from
> >                 automatically
> >                 >         starting ...)
> >                 > I was looking to find some doc on how to write my
> >                 own resources, ie
> >                 > CLVM resource that checks if the vg is clustered, if
> >                 so by which node
> >                 > is it exclusively held, and if the node is down to
> >                 activate
> >                 > exclusively the VG.
> >                 >
> >                 > If you have some good links to provide me, that'll
> >                 be great.
> >                 >
> >                 > Thanks
> >                 >
> >                 >
> >                 > 2009/7/21, Christine Caulfield
> >                 <ccaulfie at redhat.com>:
> >                 >         On 07/21/2009 01:11 PM, brem belguebli
> >                 wrote:
> >                 >                 Hi,
> >                 >                 When creating the VG by default
> >                 clustered, you
> >                 >                 implicitely assume that
> >                 >                 it will be used with a clustered FS
> >                 on top of it (gfs,
> >                 >                 ocfs, etc...)
> >                 >                 that will handle the active/active
> >                 mode.
> >                 >                 As I do not intend to use GFS in
> >                 this particular case,
> >                 >                 but ext3 and raw
> >                 >                 devices, I need to make sure the vg
> >                 is exclusively
> >                 >                 activated on one
> >                 >                 node, preventing the other nodes to
> >                 access it unless
> >                 >                 it is the failover
> >                 >                 procedure (node holding the VG
> >                 crashed) and then re
> >                 >                 activate it
> >                 >                 exclusively on the failover node.
> >                 >                 Thanks
> >                 >
> >                 >
> >                 >         In that case you probably ought to be using
> >                 rgmanager to do
> >                 >         the failover for you. It has a script for
> >                 doing exactly
> >                 >         this :-)
> >                 >
> >                 >         Chrissie
> >                 >
> >                 >
> >                 >         --
> >                 >         Linux-cluster mailing list
> >                 >         Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >                 >
> >                 >
> >                 > --
> >                 > Linux-cluster mailing list
> >                 > Linux-cluster at redhat.com
> >                 >
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >                 Please, check this link:
> >
> >
> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
> >
> >                 I found exactly the same problem as you, and i
> >                 developed the
> >                 "lvm-cluster.sh" script to solve the needs I had. You
> >                 can find the
> >                 script on the last message of the thread.
> >
> >                 I submitted it to make it part of the main project,
> >                 but i have no news
> >                 about that yet.
> >
> >                 I hope this helps.
> >
> >                 Cheers,
> >
> >                 Rafael
> >
> >                 --
> >                 Rafael Mic? Miranda
> >
> >                 --
> >                 Linux-cluster mailing list
> >                 Linux-cluster at redhat.com
> >                 https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> --
> Rafael Mic? Miranda
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From brem.belguebli at gmail.com  Fri Jul 31 19:29:49 2009
From: brem.belguebli at gmail.com (brem belguebli)
Date: Fri, 31 Jul 2009 21:29:49 +0200
Subject: [Linux-cluster] Fwd: CLVM exclusive mode
In-Reply-To: <29ae894c0907301702m452bbfb5u2178565cc127b4da@mail.gmail.com>
References: <29ae894c0907290553r5c51b938v4da9a0f944d75af2@mail.gmail.com>
	<29ae894c0907290655l17053b05j22a5e942985d714a@mail.gmail.com>
	<4A705979.1040605@redhat.com>
	<1cafab770907292242m1c4253b1o60be8a58856365db@mail.gmail.com>
	<4A713CF5.707@redhat.com>
	<29ae894c0907300015r7ee0d114q8dca432e9491496b@mail.gmail.com>
	<1248963304.6459.3.camel@mecatol>
	<29ae894c0907301702m452bbfb5u2178565cc127b4da@mail.gmail.com>
Message-ID: <29ae894c0907311229w7921e7e5o8c37e058a1e01d8e@mail.gmail.com>

Hi,

Same behaviour as the one from Rafael.

Everything is coherent as long as you use the exclusive flag from the rogue
node, the locking does the job. Deactivating an already opened VG (mounted
lvol) is not possible either. How could this behave in case one used raw
devices instead of FS ?

But when you come to ignore the exclusive flag on the rogue node (vgchange
-a y vgXX) the locking is completely bypassed. It's definitely here that the
watchdog has to be (within the tools lvchange, vgchange, or at dlm level).

 below the output of the test:

node1 = nodeid 1
node2 = nodeid 2

node1:

vgchange -a ey vg11
  1 logical volume(s) in volume group "vg11" now active

[root at node1 ~]# lvs
  LV      VG     Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  lvol1   vg11   -wi-a-  6.00G

[root at node1 ~]# ldebug

id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid
r_len r_name
39a0001 0 0 434 0 1 1 2 5 -1 0 0 64
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"

[root at node1 ~]# cdebug

Resource ffff81010abd6e00 Name (len=64)
"iZ8vgn7nBm05aMSo5cfpy63rflTqL2ryr3Xrp1prEGceCkA2dhSA2ENWocunEfdf"
Master Copy
Granted Queue
039a0001 EX
Conversion Queue
Waiting Queue

[root at node1 ~]# mount /dev/vg11/lvol1 /mnt

node2:

[root at node2 ~]# vgchange -a ey vg11
  Error locking on node node2: Volume is busy on another node
  0 logical volume(s) in volume group "vg11" now active

ldebug
 nothing
cdebug
 nothing


[root at node2 ~]# vgchange -a n vg11
  Error locking on node node1: LV vg11/lvol1 in use: not deactivating
  0 logical volume(s) in volume group "vg11" now active

# vg11/lvol1 is already mounted on node1 !

 [root at node2 ~]# vgchange -a y vg11
  1 logical volume(s) in volume group "vg11" now active

[root at node2 ~]# mount /dev/vg11/lvol1 /mnt
success
# ..it happens ! ;-)
Brem


2009/7/31, brem belguebli <brem.belguebli at gmail.com>:
>
> Hi Rafael,
>
> Good testing, it confirms that some additional barriers are necessary to
> prevent undesired behaviours.
>
> I'll test by tomorrow the same procedure at VG level.
>
>
>
>
>
> 2009/7/30 Rafael Mic? Miranda <rmicmirregs at gmail.com>
>
>> Hi Brem
>>
>> El jue, 30-07-2009 a las 09:15 +0200, brem belguebli escribi?:
>> > Hi,
>> >
>> > does it look like we're hiting some "undesired feature" ;-)
>> >
>> > Concerning the 0 nodeid, I think I read that on some Redhat documents
>> > or bugzilla report, I could find it out.
>> >
>> > Brem
>> >
>> >
>> >
>> >
>>
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> I made some test on my lab environment too, i attach the results in the
>> TXT file.
>>
>> My conclusions:
>>
>> 1.- lovgols with exclusive flag must be used over clustered volume
>> groups (obvious and already known)
>> 2.- logvols activated with exclusive flag must be handled EXCLUSIVELY
>> with the exclusive flag
>>
>> ---> as part of my lvm-cluster.sh resource script, the exclusive flag is
>> part of the resource definition in cluster.conf so this is correctly
>> handled
>>
>> 3.- you can activate an already active exclusive logvol on any node if
>> you dont take into accout, during the activation, the exclusive flag
>> 4.- in use (opened) logvols are protected from deactivation from
>> secondary nodes, even from main node
>> 5.- after a node failure (hang-up, fencing...) logvol is not open
>> anymore, so it can be exclusively activated on a new node
>>
>> All this was tested manually, but this is the expected behaviour on
>> lvm-cluster.sh resource script.
>>
>> Link to lvm-cluster.sh resource script:
>>
>> https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html
>>
>> Cheers,
>>
>> Rafael
>>
>> --
>> Rafael Mic? Miranda
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090731/54ee4942/attachment.htm>

From rmicmirregs at gmail.com  Fri Jul 31 22:28:57 2009
From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda)
Date: Sat, 01 Aug 2009 00:28:57 +0200
Subject: [Linux-cluster] CLVMD without GFS
In-Reply-To: <MJ2mY0lYjZGc@NBjvyXzB>
References: <MJ2mY0lYjZGc@NBjvyXzB>
Message-ID: <1249079337.6494.10.camel@mecatol>

Hi Brem,

El vie, 31-07-2009 a las 06:09 -0400, crosa at redhat.com escribi?:
> 
> --- mensagem original ---
> De: brem belguebli <brem.belguebli at gmail.com>
> Assunto: Re: [Linux-cluster] CLVMD without GFS
> Data: 29 de Julho de 2009
> Hora: 10:1:29
> 
> Hi Rafael,
> 
> Just posted the basic tests I'm doing on both linux-cluster and linux-lvm.
> 
> I can't get exclusive activation to work properly, I may be missing some
> step in my process.
> 
> Brem
> 
> 

> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Sorry, i'm not sure if I missed this mail.

Maybe i need to explain a couple of things first, just to make them
clear.

As part of the use of the lvm-cluster.sh resource script, you need to:

1.- Configure properly the lock type into lvm.conf to type 3 (example
into this link:

https://www.redhat.com/archives/linux-cluster/2009-July/msg00253.html

)

2.- Configure your multipathing software

3.- Create your Physical Volumes

4.- Configure your Volume Groups as clustered Volume Groups

5.- Create your Logical Volumens into the clustered Volume Groups

6.- And, and the not so obvious, de-activate all the Logical Volumes you
plan to use as exclusive Logical Volumes (using exclusive flag).

De-activation can be done with "vgchange -an volgrp01/logvol01" or
similar command.

If you don't do step 6, you will receive an error message when executing
"lvchange -aey volgrp01/logvol01". This command is executed (with the
proper volume group and logical volume names) internally in the resource
script.

I designed the lvm-cluster.sh resource script to be verbose, maybe you
can copy here your logs (default to /var/log/messages on RHEL systems).

Cheers and thanks,

Rafael
 
-- 
Rafael Mic? Miranda