From andrew at beekhof.net  Mon Oct  3 00:10:13 2011
From: andrew at beekhof.net (Andrew Beekhof)
Date: Mon, 3 Oct 2011 11:10:13 +1100
Subject: [Linux-cluster] [Linux-ha-dev] [ha-wg] CFP: HA Mini-Conference
	in Prague on Oct 25th
In-Reply-To: <4E85D87D.4000906@alteeve.com>
References: <20110814193045.GP5299@suse.de> <20110927145802.GB3713@suse.de>
	<4E85D87D.4000906@alteeve.com>
Message-ID: <CAEDLWG1kLGGdtZJcueQY2FSicZcCZvk5Rm66-KvvUAyxbhyRDA@mail.gmail.com>

On Sat, Oct 1, 2011 at 12:55 AM, Digimer <linux at alteeve.com> wrote:
> On 09/27/2011 07:58 AM, Lars Marowsky-Bree wrote:
>> Hi all,
>>
>> it turns out that there was zero feedback about people wanting to
>> present, only some about travel budget being too tight to come. So we
>> had some discussions about whether to cancel this completely, as this
>> made planning rather difficult.
>>
>> But just in the last few days, I got a fair share of e-mails asking if
>> this still takes place, and who is going to be there. ;-)
>>
>> So: we have the room. I will be there, and it seems so will at least a
>> few other people, including Andrew. I suggest we do it in an
>> "unconference" style and draw up the agenda as we go along; you're
>> welcome to stop by and discuss HA/clustering topics that are important
>> to you. ?It is going to be as successful as we all make it out to be.
>>
>> We share the venue with LinuxCon Europe: Clarion Congress Hotel ?
>> Prague, Czech Republic, on Oct 25th.
>>
>> I suggest we start at 9:30 in the morning and go from there.
>>
>>
>> Regards,
>> ? ? Lars
>>
>
> Is it possible, if this isn't set in stone, to push back to later in the
> day? I don't fly in until the 25th, and I think there is one other
> person who wants to attend in the same boat.

Based on Boston last year, I imagine the conversations will last right
up until Lars starts presenting his talk on Friday afternoon.
People came and went at random, and if someone essential was missing
for a conversation we deferred it until later.

Very informal, but it seemed to work ok.



From linux at alteeve.com  Tue Oct  4 04:49:35 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 04 Oct 2011 00:49:35 -0400
Subject: [Linux-cluster] Can you build a cluster without fencing?
Message-ID: <4E8A905F.2080503@alteeve.com>

Here is the answer;

http://www.youtube.com/watch?v=oKI-tD0L18A

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"



From christian.masopust at siemens.com  Tue Oct  4 05:48:53 2011
From: christian.masopust at siemens.com (Masopust, Christian)
Date: Tue, 4 Oct 2011 07:48:53 +0200
Subject: [Linux-cluster] Can you build a cluster without fencing?
In-Reply-To: <4E8A905F.2080503@alteeve.com>
References: <4E8A905F.2080503@alteeve.com>
Message-ID: <C3B6F57F6F0CE34093FF52B3FFBEFA7C0100FA3BAA2B@ATVIES9917VMSX.ww300.siemens.net>

> 
> Here is the answer;
> 
> http://www.youtube.com/watch?v=oKI-tD0L18A
> 
> -- 
> Digimer
> E-Mail:              digimer at alteeve.com

You've lighted up my day :-)))))



From fdinitto at redhat.com  Tue Oct  4 06:35:33 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 04 Oct 2011 08:35:33 +0200
Subject: [Linux-cluster] Can you build a cluster without fencing?
In-Reply-To: <4E8A905F.2080503@alteeve.com>
References: <4E8A905F.2080503@alteeve.com>
Message-ID: <4E8AA935.3090202@redhat.com>

On 10/04/2011 06:49 AM, Digimer wrote:
> Here is the answer;
> 
> http://www.youtube.com/watch?v=oKI-tD0L18A
> 

ROFL... finally a bit humor on this mailing list ;)

Fabio



From mij at irwan.name  Tue Oct  4 07:05:48 2011
From: mij at irwan.name (Mohd Irwan Jamaluddin)
Date: Tue, 4 Oct 2011 15:05:48 +0800
Subject: [Linux-cluster] Can you build a cluster without fencing?
In-Reply-To: <C3B6F57F6F0CE34093FF52B3FFBEFA7C0100FA3BAA2B@ATVIES9917VMSX.ww300.siemens.net>
References: <4E8A905F.2080503@alteeve.com>
	<C3B6F57F6F0CE34093FF52B3FFBEFA7C0100FA3BAA2B@ATVIES9917VMSX.ww300.siemens.net>
Message-ID: <CALG-EqqsMDGRk+J4ESP7-A1iuBN7Wjcu9VGT0Y-8-vDkgNLN+Q@mail.gmail.com>

On Tue, Oct 4, 2011 at 1:48 PM, Masopust, Christian
<christian.masopust at siemens.com> wrote:
>
> >
> > Here is the answer;
> >
> > http://www.youtube.com/watch?v=oKI-tD0L18A
> >
> > --
> > Digimer
> > E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com
>
> You've lighted up my day :-)))))
>

On a serious note, you can use manual fencing (as if no fencing at
all) but it won't be supported by Red Hat.



From Mdukhan at nds.com  Tue Oct  4 12:05:41 2011
From: Mdukhan at nds.com (Dukhan, Meir)
Date: Tue, 4 Oct 2011 14:05:41 +0200
Subject: [Linux-cluster] Killing node XXX because it has rejoined
 thecluster with existing state
Message-ID: <6DAE69EA69F39E4B9DA073B8C848A27C60E7553B35@ILMA1.IL.NDS.COM>

Hi Jean-Daniel,

Sorry to disappoint you, I'm not replying to your post to linux-cluster: I just have the same problem :(
Did you receive any answer or maybe did you solve the problem?

Merci beaucoup :)

Best Regards,
Meir R. Dukhan

________________________________
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.
To protect the environment please do not print this e-mail unless necessary.

An NDS Group Limited company. www.nds.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111004/d96cdbe7/attachment.htm>

From ext.thales.jean-daniel.bonnetot at sncf.fr  Tue Oct  4 12:47:23 2011
From: ext.thales.jean-daniel.bonnetot at sncf.fr (BONNETOT Jean-Daniel (EXT THALES))
Date: Tue, 4 Oct 2011 14:47:23 +0200
Subject: [Linux-cluster] Killing node XXX because it has rejoined
	thecluster with existing state
In-Reply-To: <6DAE69EA69F39E4B9DA073B8C848A27C60E7553B35@ILMA1.IL.NDS.COM>
References: <6DAE69EA69F39E4B9DA073B8C848A27C60E7553B35@ILMA1.IL.NDS.COM>
Message-ID: <C088D3516432C643AC828162A5164A7F0AB0617A@se3lmwbibaw.COMMUN.AD.SNCF.FR>

Hi,

No, no answer. It's the good time to ask one more time ;)

this is my last email :
I have problem with two node cluster. When I force a node to faile, second node fences first one. When first one rejoin my cluster, cman shutdown on both nodes saying : 

Sep 28 17:29:36 s64lmwbig3c openais[7273]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state
Sep 28 17:29:36 s64lmwbig3c openais[7273]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart


Logs :
See attached

Conf :
<?xml version="1.0"?>
<cluster config_version="12" name="u64lmwbig8r">
        <cman expected_votes="1" two_node="1">
                <multicast addr="239.192.0.11"/>
        </cman>
        <clusternodes>
                <clusternode name="s64lmwbig3b" nodeid="1" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="fenceHP_g3b"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="s64lmwbig3c" nodeid="2" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="fenceHP_g3c"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" ipaddr="XXXXX" lanplus="1" login="user" name="fenceHP_g3b" passwd="password" verbose="yes"/>
                <fencedevice agent="fence_ipmilan" ipaddr="XXXXX" lanplus="1" login="user" name="fenceHP_g3c" passwd="password" verbose="yes"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <fence_daemon clean_start="0" post_fail_delay="20" post_join_delay="60"/>
</cluster>

Do you know what I missed ?

Thanks
Regards,



Jean-Daniel BONNETOT

De?: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Dukhan, Meir
Envoy??: mardi 4 octobre 2011 14:06
??: linux-cluster at redhat.com
Objet?: Re: [Linux-cluster] Killing node XXX because it has rejoined thecluster with existing state

Hi Jean-Daniel, 

Sorry to disappoint you, I?m not replying to your post to linux-cluster: I just have the same problem ?
Did you receive any answer or maybe did you solve the problem? 

Merci beaucoup ?

Best Regards, 
Meir R. Dukhan

________________________________________
This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.
To protect the environment please do not print this e-mail unless necessary.

An NDS Group Limited company. www.nds.com
-------
Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. 



From hlawatschek at atix.de  Tue Oct  4 13:45:42 2011
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 4 Oct 2011 15:45:42 +0200 (CEST)
Subject: [Linux-cluster] Fencing agent for cisco nexus 5k
In-Reply-To: <1172525370.1599.1317735544624.JavaMail.root@axgroupware01-1.gallien.atix>
Message-ID: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix>

Hi,

we are currently building up a Cisco Nexus FCoE infrastructure together with Red Hat Clusters. 

I'd like to use the Nexus 5ks for I/O fencing operations and I'm looking for a fencing agent to be used together with the Nexus 5k.
The basic idea would be to disable the network ports of the cluster node inside the Nexus that is supposed to be fenced.

Any pointers or ideas? 

Thanks a lot!

Mark

-- 
Mark Hlawatschek 

ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 |
85716 Unterschleissheim | www.atix.de 

http://www.linux-subscriptions.com



From linux at alteeve.com  Tue Oct  4 14:00:08 2011
From: linux at alteeve.com (Digimer)
Date: Tue, 04 Oct 2011 10:00:08 -0400
Subject: [Linux-cluster] Fencing agent for cisco nexus 5k
In-Reply-To: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix>
References: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix>
Message-ID: <4E8B1168.2060906@alteeve.com>

On 10/04/2011 09:45 AM, Mark Hlawatschek wrote:
> Hi,
> 
> we are currently building up a Cisco Nexus FCoE infrastructure together with Red Hat Clusters. 
> 
> I'd like to use the Nexus 5ks for I/O fencing operations and I'm looking for a fencing agent to be used together with the Nexus 5k.
> The basic idea would be to disable the network ports of the cluster node inside the Nexus that is supposed to be fenced.
> 
> Any pointers or ideas? 
> 
> Thanks a lot!
> 
> Mark

I don't have experience with that switch/equipment. However, writing a
fabric-type fence agent should be pretty straight forward. I assume the
device has telnet or ssh access?

Install the fence-agents package and then look for the 'fence_*' files.
They will make for great examples to base a new agent on. The actual API
is defined here;

https://fedorahosted.org/cluster/wiki/FenceAgentAPI

The main task is to ensure disconnection of the node. So after calling
the switch, be sure to confirm that the port is logically disconnected
before returning a success to the fenced caller.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"



From fdinitto at redhat.com  Tue Oct  4 15:09:43 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 04 Oct 2011 17:09:43 +0200
Subject: [Linux-cluster] Fencing agent for cisco nexus 5k
In-Reply-To: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix>
References: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix>
Message-ID: <4E8B21B7.5030208@redhat.com>

On 10/04/2011 03:45 PM, Mark Hlawatschek wrote:
> Hi,
> 
> we are currently building up a Cisco Nexus FCoE infrastructure together with Red Hat Clusters. 
> 
> I'd like to use the Nexus 5ks for I/O fencing operations and I'm looking for a fencing agent to be used together with the Nexus 5k.
> The basic idea would be to disable the network ports of the cluster node inside the Nexus that is supposed to be fenced.
> 
> Any pointers or ideas? 
> 
> Thanks a lot!
> 
> Mark
> 

Unless they changed the MIB, you can probably use fence_ifmib.

Fabio



From cos at aaaaa.org  Wed Oct  5 02:23:04 2011
From: cos at aaaaa.org (Ofer Inbar)
Date: Tue, 4 Oct 2011 22:23:04 -0400
Subject: [Linux-cluster] service stuck in "recovering", no attempt to restart
Message-ID: <20111005022304.GD7753@mip.aaaaa.org>

On a 3 node cluster running:
  cman-2.0.115-34.el5_5.3
  rgmanager-2.0.52-6.el5.centos.8
  openais-0.80.6-16.el5_5.9

We have a custom resource, "dn", for which I wrote the resource agent.
Service has three resources: a virtual IP (using ip.sh), and two dn children.

Normally, when one of the dn instances fails its status check,
rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
then relocates to another node and starts the service there.

Several hours ago, one of the dn instances failed its status check,
rgmanager stopped it, marked the service "recovering", but then did
not seem to try to start it on any node.  It just stayed down for
hours until logged in to look at it.

Until 17:22 today, service was running on node1.  Here's what it logged:

Oct  4 17:22:12 clustnode1 clurgmgrd: [517]: <err> Monitoring Service dn:dn_b > Service Is Not Running
Oct  4 17:22:12 clustnode1 clurgmgrd[517]: <notice> status on dn "dn_b" returned 1 (generic error)
Oct  4 17:22:12 clustnode1 clurgmgrd[517]: <notice> Stopping service service:dn
Oct  4 17:22:12 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_b
Oct  4 17:22:12 clustnode1 clurgmgrd: [517]: <notice> Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid
Oct  4 17:22:14 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_b > Succeed
Oct  4 17:22:14 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_a
Oct  4 17:22:15 clustnode1 clurgmgrd: [517]: <notice> Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid
Oct  4 17:22:17 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_a > Succeed
Oct  4 17:22:17 clustnode1 clurgmgrd: [517]: <info> Removing IPv4 address 10.6.9.136/23 from eth0
Oct  4 17:22:27 clustnode1 clurgmgrd[517]: <notice> Service service:dn is recovering

At around that time, node2 also logged this:

Oct  4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.
Oct  4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.

[Cluster name and node names anonymized with simple search and replace]

There are no other log entries in /var/log/messages on any node around
that time, that relate to cluster suite.

Currently, the service is still "recovering", with cluster status
otherwise apparently fine.  clustat -x output on all three nodes is
identical except for which node has local="1".  It looks like this:

<?xml version="1.0"?>
<clustat version="4.1.1">
  <cluster name="clustnode" id="23048" generation="12"/>
  <quorum quorate="1" groupmember="1"/>
  <nodes>
    <node name="clustnode1" state="1" local="0" estranged="0" rgmanager="1" rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
    <node name="clustnode2" state="1" local="1" estranged="0" rgmanager="1" rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
    <node name="clustnode3" state="1" local="0" estranged="0" rgmanager="1" rgmanager_master="0" qdisk="0" nodeid="0x00000003"/>
  </nodes>
  <groups>
    <group name="service:dn" state="118" state_str="recovering" flags="0" flags_str="" owner="none" last_owner="clustnode1" restarts="0" last_transition="1317763347" last_transition_str="Tue Oct  4 17:22:27 2011"/>
  </groups>
</clustat>

And cman_tool status shows all three nodes voting and in the quorum:

Version: 6.2.0
Config Version: 2
Cluster Name: clustnode
Cluster Id: 23048
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0 177  
Node name: clustnode2
Node ID: 2
Multicast addresses: 239.245.0.84 
Node addresses: 10.6.8.208 

Again, this looks the same on all three nodes.

Here's the resource section of cluster.conf (with the values of some
of the arguments to my custom resource modified so as not to expose
actual username, path, or port number):

<rm log_level="6">
  <service autostart="1" name="dn" recovery="relocate">
    <ip address="10.6.9.136" monitor_link="1">
      <dn user="username" dninstall="/dn/path" name="dn_a" monitoringport="portnum"/>
      <dn user="username" dninstall="/dn/path" name="dn_b" monitoringport="portnum"/>
    </ip>
  </service>
</rm>

Any ideas why it might be in this state, where everything is
apparently fine except that the service is "recovering" and rgmanager
isn't trying to do anything about it and isn't logging any complaints?

Attached: strace -fp output of clurgmrgd processes on node1 and node2
  -- Cos
-------------- next part --------------
Process 517 attached with 4 threads - interrupt to quit
[pid  9842] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1001] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1000] select(6, [3 5], NULL, NULL, {0, 935000} <unfinished ...>
[pid   517] select(12, [10 11], NULL, NULL, {8, 177000} <unfinished ...>
[pid  9842] <... clock_gettime resumed> {1317781205, 661864000}) = 0
[pid  1001] <... clock_gettime resumed> {1317781205, 661864000}) = 0
[pid  9842] futex(0x432a5cbc, FUTEX_WAIT_PRIVATE, 3573, {7, 357519000} <unfinished ...>
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81853, {0, 867658000}) = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781206, 530851000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81855, {2, 999711000} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781209, 532508000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81857, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781212, 534580000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81859, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  9842] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  9842] futex(0x432a5c90, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  9842] clock_gettime(CLOCK_REALTIME, {1317781213, 21497000}) = 0
[pid  9842] futex(0x432a5cbc, FUTEX_WAIT_PRIVATE, 3575, {10, 0} <unfinished ...>
[pid   517] <... select resumed> )      = 0 (Timeout)
[pid   517] socket(PF_FILE, SOCK_STREAM, 0) = 13
[pid   517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid   517] write(13, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] read(13, "\1\0\0\0\0\0\0\0\350D9\3\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] close(13)                   = 0
[pid   517] socket(PF_FILE, SOCK_STREAM, 0) = 13
[pid   517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid   517] write(13, "\3\0\0\0\0\0\0\0\350D9\3\0\0\0\0\31\0\0\0/cluster/@co"..., 45) = 45
[pid   517] read(13, "\3\0\0\0\0\0\0\0\350D9\3\0\0\0\0\2\0\0\0", 20) = 20
[pid   517] read(13, "2\0", 2)          = 2
[pid   517] close(13)                   = 0
[pid   517] socket(PF_FILE, SOCK_STREAM, 0) = 13
[pid   517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid   517] write(13, "\2\0\0\0\0\0\0\0\350D9\3\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] read(13, "\2\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] close(13)                   = 0
[pid   517] clone(Process 18772 attached
child_stack=0x40ebc240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x40ebc9c0, tls=0x40ebc930, child_tidptr=0x40ebc9c0) = 18772
[pid   517] select(12, [10 11], NULL, NULL, {10, 0} <unfinished ...>
[pid 18772] set_robust_list(0x40ebc9d0, 0x18) = 0
[pid 18772] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0
[pid 18772] _exit(0)                    = ?
Process 18772 detached
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781215, 536718000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81861, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781218, 538706000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81863, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781221, 540821000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81865, {3, 0}
-------------- next part --------------
Process 28445 attached with 4 threads - interrupt to quit
[pid 28962] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 28931] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 28930] select(6, [3 5], NULL, NULL, {1, 725000} <unfinished ...>
[pid 28445] select(12, [10 11], NULL, NULL, {6, 894000} <unfinished ...>
[pid 28962] <... clock_gettime resumed> {1317781260, 477926000}) = 0
[pid 28931] <... clock_gettime resumed> {1317781260, 477926000}) = 0
[pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24531, {4, 991782000} <unfinished ...>
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81869, {2, 613666000} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781263, 93684000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81871, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28962] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28962] futex(0x429dac90, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28962] clock_gettime(CLOCK_REALTIME, {1317781265, 471616000}) = 0
[pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24533, {10, 0} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781266, 95446000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81873, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28445] <... select resumed> )      = 0 (Timeout)
[pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14
[pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid 28445] write(14, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] read(14, "\1\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] close(14)                   = 0
[pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14
[pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid 28445] write(14, "\3\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\31\0\0\0/cluster/@co"..., 45) = 45
[pid 28445] read(14, "\3\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\2\0\0\0", 20) = 20
[pid 28445] read(14, "2\0", 2)          = 2
[pid 28445] close(14)                   = 0
[pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14
[pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid 28445] write(14, "\2\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] read(14, "\2\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] close(14)                   = 0
[pid 28445] clone(Process 29968 attached
child_stack=0x40705240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x407059c0, tls=0x40705930, child_tidptr=0x407059c0) = 29968
[pid 28445] select(12, [10 11], NULL, NULL, {10, 0} <unfinished ...>
[pid 29968] set_robust_list(0x407059d0, 0x18) = 0
[pid 29968] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0
[pid 29968] _exit(0)                    = ?
Process 29968 detached
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781269, 97451000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81875, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28445] <... select resumed> )      = 1 (in [10], left {6, 643000})
[pid 28445] accept(10, 0, NULL)         = 14
[pid 28445] fcntl(14, F_GETFD)          = 0
[pid 28445] fcntl(14, F_SETFD, FD_CLOEXEC) = 0
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0", 24) = 24
[pid 28445] clone(Process 29977 attached
child_stack=0x40705240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x407059c0, tls=0x40705930, child_tidptr=0x407059c0) = 29977
[pid 28445] select(12, [10 11], NULL, NULL, {6, 643000} <unfinished ...>
[pid 29977] set_robust_list(0x407059d0, 0x18) = 0
[pid 29977] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0
[pid 29977] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 29977] write(14, "x\0\0\0\4\0\0\0\22:\274\0\0\0\0x\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
[pid 29977] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 29977] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0N\213y\23", 32) = 32
[pid 29977] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 29977] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 29977] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 29977] read(14, "\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\30\0\0\0", 24) = 24
[pid 29977] close(14)                   = 0
[pid 29977] _exit(0)                    = ?
Process 29977 detached
[pid 28445] <... select resumed> )      = 1 (in [10], left {6, 642000})
[pid 28445] accept(10, 0, NULL)         = 14
[pid 28445] fcntl(14, F_GETFD)          = 0
[pid 28445] fcntl(14, F_SETFD, FD_CLOEXEC) = 0
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\0", 24) = 24
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\1\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\2\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\3\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 28445] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\30\0\0\0", 24) = 24
[pid 28445] close(14)                   = 0
[pid 28445] select(12, [10 11], NULL, NULL, {6, 642000} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781272, 99389000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81877, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781275, 101463000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81879, {3, 0} <unfinished ...>
[pid 28962] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28962] futex(0x429dac90, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28962] clock_gettime(CLOCK_REALTIME, {1317781275, 474705000}) = 0
[pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24535, {9, 999904000} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2}

From cos at aaaaa.org  Wed Oct  5 02:43:57 2011
From: cos at aaaaa.org (Ofer Inbar)
Date: Tue, 4 Oct 2011 22:43:57 -0400
Subject: [Linux-cluster] service stuck in "recovering",
	no attempt to restart
In-Reply-To: <20111005022304.GD7753@mip.aaaaa.org>
References: <20111005022304.GD7753@mip.aaaaa.org>
Message-ID: <20111005024357.GL341@mip.aaaaa.org>

After collecting all of the information in my previous mailing,
I then tried restarting the service using clusvcadm -R, to no avail:

| $ sudo clusvcadm -R dn
| Local machine trying to restart service:dn...

And so it stood for over a minute, with no evidence that it was
actually trying to start anything, so I hit ^C.

Next, I restarted rgmanager on all three nodes simultaneously,
using "sudo service rgmanager restart".  When rgmanager came
back up, the service was in status "recoverable" and then soon
after, it got started successully on node2.

So now the service is running, but it's still a complete mystery
to me why it never got restarted before, and why I had to restart
rgmanager to get it to bring the service up.  I also don't know
what, if anything, I need to do to prevent this from happening again.

[I did try killing processes a few times and observed successful
relocations and restarts, so the cluster seems to be in a good
state for now...]
  -- Cos



From lhh at redhat.com  Wed Oct  5 14:39:05 2011
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 05 Oct 2011 10:39:05 -0400
Subject: [Linux-cluster] service stuck in "recovering",
 no attempt to restart
In-Reply-To: <20111005022304.GD7753@mip.aaaaa.org>
References: <20111005022304.GD7753@mip.aaaaa.org>
Message-ID: <4E8C6C09.1080806@redhat.com>

On 10/04/2011 10:23 PM, Ofer Inbar wrote:
> On a 3 node cluster running:
>    cman-2.0.115-34.el5_5.3
>    rgmanager-2.0.52-6.el5.centos.8
>    openais-0.80.6-16.el5_5.9
>
> We have a custom resource, "dn", for which I wrote the resource agent.
> Service has three resources: a virtual IP (using ip.sh), and two dn children.

You should be able to disable then re-enable - that is, you shouldn't 
need to restart rgmanager to break the recovering state.

There's this related bug, but it should have been fixed in 2.0.52-6:

   https://bugzilla.redhat.com/show_bug.cgi?id=530409

> Normally, when one of the dn instances fails its status check,
> rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
> then relocates to another node and starts the service there.

That's what I'd expect to happen.

> Several hours ago, one of the dn instances failed its status check,
> rgmanager stopped it, marked the service "recovering", but then did
> not seem to try to start it on any node.  It just stayed down for
> hours until logged in to look at it.
>
> Until 17:22 today, service was running on node1.  Here's what it logged:
>
> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<err>  Monitoring Service dn:dn_b>  Service Is Not Running
> Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  status on dn "dn_b" returned 1 (generic error)
> Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  Stopping service service:dn
> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_b
> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid
> Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_b>  Succeed
> Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_a
> Oct  4 17:22:15 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid
> Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_a>  Succeed
> Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Removing IPv4 address 10.6.9.136/23 from eth0
> Oct  4 17:22:27 clustnode1 clurgmgrd[517]:<notice>  Service service:dn is recovering
>
> At around that time, node2 also logged this:
>
> Oct  4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.
> Oct  4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.

It may be related; I doubt it.


> Again, this looks the same on all three nodes.
>
> Here's the resource section of cluster.conf (with the values of some
> of the arguments to my custom resource modified so as not to expose
> actual username, path, or port number):
>
> <rm log_level="6">
>    <service autostart="1" name="dn" recovery="relocate">
>      <ip address="10.6.9.136" monitor_link="1">
>        <dn user="username" dninstall="/dn/path" name="dn_a" monitoringport="portnum"/>
>        <dn user="username" dninstall="/dn/path" name="dn_b" monitoringport="portnum"/>
>      </ip>
>    </service>
> </rm>
>
> Any ideas why it might be in this state, where everything is
> apparently fine except that the service is "recovering" and rgmanager
> isn't trying to do anything about it and isn't logging any complaints?

The only cause for this is if we send a message but it either doesn't 
make it or we get a weird return code -- I think rgmanager logs it, 
though, so this could be a new issue.

> Attached: strace -fp output of clurgmrgd processes on node1 and node2

The strace data is not likely to be useful, but a dump from rgmanager 
would.  If you get in to this state again, do this:

    kill -USR1 `pidof -s clurgmgrd`

Then look at /tmp/rgmanager-dump* (2.0.x) or 
/var/lib/cluster/rgmanager-dump (3.x.y)

-- Lon



From robejrm at gmail.com  Wed Oct  5 15:01:46 2011
From: robejrm at gmail.com (Juan Ramon Martin Blanco)
Date: Wed, 5 Oct 2011 17:01:46 +0200
Subject: [Linux-cluster] service stuck in "recovering",
	no attempt to restart
In-Reply-To: <4E8C6C09.1080806@redhat.com>
References: <20111005022304.GD7753@mip.aaaaa.org> <4E8C6C09.1080806@redhat.com>
Message-ID: <CAJfd-EB+nWxrWVT7mkJx3S6gxJGq8NvN2D9Wi_637ur4WT0+WA@mail.gmail.com>

On Wed, Oct 5, 2011 at 4:39 PM, Lon Hohberger <lhh at redhat.com> wrote:
> On 10/04/2011 10:23 PM, Ofer Inbar wrote:
>>
>> On a 3 node cluster running:
>> ? cman-2.0.115-34.el5_5.3
>> ? rgmanager-2.0.52-6.el5.centos.8
>> ? openais-0.80.6-16.el5_5.9
>>
>> We have a custom resource, "dn", for which I wrote the resource agent.
>> Service has three resources: a virtual IP (using ip.sh), and two dn
>> children.
>
> You should be able to disable then re-enable - that is, you shouldn't need
> to restart rgmanager to break the recovering state.
>
> There's this related bug, but it should have been fixed in 2.0.52-6:
>
> ?https://bugzilla.redhat.com/show_bug.cgi?id=530409
>
I have the same problem with version 2.0.52-6 on rhel5, I'll try to
get a dump when it happens again (didn't know the USR1 signal thing)
# rpm -aq | grep -e rgmanager -e openais -e cman
cman-2.0.115-34.el5_5.4
rgmanager-2.0.52-6.el5_5.8
openais-0.80.6-16.el5_5.9

Thanks,
Juanra
>> Normally, when one of the dn instances fails its status check,
>> rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
>> then relocates to another node and starts the service there.
>
> That's what I'd expect to happen.
>
>> Several hours ago, one of the dn instances failed its status check,
>> rgmanager stopped it, marked the service "recovering", but then did
>> not seem to try to start it on any node. ?It just stayed down for
>> hours until logged in to look at it.
>>
>> Until 17:22 today, service was running on node1. ?Here's what it logged:
>>
>> Oct ?4 17:22:12 clustnode1 clurgmgrd: [517]:<err> ?Monitoring Service
>> dn:dn_b> ?Service Is Not Running
>> Oct ?4 17:22:12 clustnode1 clurgmgrd[517]:<notice> ?status on dn "dn_b"
>> returned 1 (generic error)
>> Oct ?4 17:22:12 clustnode1 clurgmgrd[517]:<notice> ?Stopping service
>> service:dn
>> Oct ?4 17:22:12 clustnode1 clurgmgrd: [517]:<info> ?Stopping Service
>> dn:dn_b
>> Oct ?4 17:22:12 clustnode1 clurgmgrd: [517]:<notice> ?Checking if stopped:
>> check_pid_file /dn/dn_b/dn_b.pid
>> Oct ?4 17:22:14 clustnode1 clurgmgrd: [517]:<info> ?Stopping Service
>> dn:dn_b> ?Succeed
>> Oct ?4 17:22:14 clustnode1 clurgmgrd: [517]:<info> ?Stopping Service
>> dn:dn_a
>> Oct ?4 17:22:15 clustnode1 clurgmgrd: [517]:<notice> ?Checking if stopped:
>> check_pid_file /dn/dn_a/dn_a.pid
>> Oct ?4 17:22:17 clustnode1 clurgmgrd: [517]:<info> ?Stopping Service
>> dn:dn_a> ?Succeed
>> Oct ?4 17:22:17 clustnode1 clurgmgrd: [517]:<info> ?Removing IPv4 address
>> 10.6.9.136/23 from eth0
>> Oct ?4 17:22:27 clustnode1 clurgmgrd[517]:<notice> ?Service service:dn is
>> recovering
>>
>> At around that time, node2 also logged this:
>>
>> Oct ?4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete
>> comm_header_t.
>> Oct ?4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete
>> comm_header_t.
>
> It may be related; I doubt it.
>
>
>> Again, this looks the same on all three nodes.
>>
>> Here's the resource section of cluster.conf (with the values of some
>> of the arguments to my custom resource modified so as not to expose
>> actual username, path, or port number):
>>
>> <rm log_level="6">
>> ? <service autostart="1" name="dn" recovery="relocate">
>> ? ? <ip address="10.6.9.136" monitor_link="1">
>> ? ? ? <dn user="username" dninstall="/dn/path" name="dn_a"
>> monitoringport="portnum"/>
>> ? ? ? <dn user="username" dninstall="/dn/path" name="dn_b"
>> monitoringport="portnum"/>
>> ? ? </ip>
>> ? </service>
>> </rm>
>>
>> Any ideas why it might be in this state, where everything is
>> apparently fine except that the service is "recovering" and rgmanager
>> isn't trying to do anything about it and isn't logging any complaints?
>
> The only cause for this is if we send a message but it either doesn't make
> it or we get a weird return code -- I think rgmanager logs it, though, so
> this could be a new issue.
>
>> Attached: strace -fp output of clurgmrgd processes on node1 and node2
>
> The strace data is not likely to be useful, but a dump from rgmanager would.
> ?If you get in to this state again, do this:
>
> ? kill -USR1 `pidof -s clurgmgrd`
>
> Then look at /tmp/rgmanager-dump* (2.0.x) or /var/lib/cluster/rgmanager-dump
> (3.x.y)
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From andrew at beekhof.net  Thu Oct  6 22:05:38 2011
From: andrew at beekhof.net (Andrew Beekhof)
Date: Fri, 7 Oct 2011 09:05:38 +1100
Subject: [Linux-cluster] [Linux-HA] [Linux-ha-dev] [ha-wg] CFP: HA
 Mini-Conference in Prague on Oct 25th
In-Reply-To: <20111005145310.GB3724@suse.de>
References: <20110814193045.GP5299@suse.de> <20110927145802.GB3713@suse.de>
	<4E85D87D.4000906@alteeve.com>
	<CAEDLWG1kLGGdtZJcueQY2FSicZcCZvk5Rm66-KvvUAyxbhyRDA@mail.gmail.com>
	<20111005145310.GB3724@suse.de>
Message-ID: <CAEDLWG2SZLSqsNamRT2Wmkte9Bg+Bsq=XGRgRtGGtJ-WztTh4w@mail.gmail.com>

On Thu, Oct 6, 2011 at 1:53 AM, Lars Marowsky-Bree <lmb at suse.com> wrote:
> On 2011-10-03T11:10:13, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>> Based on Boston last year, I imagine the conversations will last right
>> up until Lars starts presenting his talk on Friday afternoon.
>> People came and went at random, and if someone essential was missing
>> for a conversation we deferred it until later.
>
> Oh, then we're going to not stop, ever - because I don't have a talk at
> the main conference this time ;-)

The schedule has you in a friday afternoon slot iirc.

>
>> Very informal, but it seemed to work ok.
>
> yes, and given that the ha mailing lists are still down, probably the
> best we can hope for ...

indeed



From andy.speagle at wichita.edu  Fri Oct  7 18:38:32 2011
From: andy.speagle at wichita.edu (Speagle, Andy)
Date: Fri, 7 Oct 2011 18:38:32 +0000
Subject: [Linux-cluster] Multiple HA-LVM Resources
Message-ID: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu>

Hi Team,

I'm having an issue with RHCS on RHEL 6.1 ... I have multiple HA-LVM resources in my cluster which are being used by two different service groups.  I'm having an issue when I try to start the second service group on the same cluster node running the first service group.  I get this immediately.

Local machine trying to enable service:<servicename>...Invalid operation for resource

However, I can startup both service groups just fine as long as it's on different nodes in the cluster.  I've got logging turned up to debug... but I can't seem to get anything meaningful in the logs regarding this issue.

Can someone clue me in?

Andy Speagle
System & Storage Administrator
UCATS - Wichita State University

P: 316.978.3869
C: 316.617.2431

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111007/7b75fbe1/attachment.htm>

From tc3driver at gmail.com  Fri Oct  7 19:08:57 2011
From: tc3driver at gmail.com (Bill G.)
Date: Fri, 7 Oct 2011 12:08:57 -0700
Subject: [Linux-cluster] Multiple HA-LVM Resources
In-Reply-To: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu>
References: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu>
Message-ID: <CABQafzjfSq8=s6jVtkYsN1Tj9eNy3CwH_kGSic=iMfLv8N=3gw@mail.gmail.com>

Hi Andy,

What do your failover domains look like?

If you have failover domains set up, and that service group is not listed as
being able to run on that node, you will get that error message when trying
to start a service on that node.

Thanks,
Bill

On Fri, Oct 7, 2011 at 11:38 AM, Speagle, Andy <andy.speagle at wichita.edu>wrote:

>  Hi Team,****
>
> ** **
>
> I?m having an issue with RHCS on RHEL 6.1 ? I have multiple HA-LVM
> resources in my cluster which are being used by two different service
> groups.  I?m having an issue when I try to start the second service group on
> the same cluster node running the first service group.  I get this
> immediately.****
>
> ** **
>
> Local machine trying to enable service:<servicename>...Invalid operation
> for resource****
>
> ** **
>
> However, I can startup both service groups just fine as long as it?s on
> different nodes in the cluster.  I?ve got logging turned up to debug? but I
> can?t seem to get anything meaningful in the logs regarding this issue.***
> *
>
> ** **
>
> Can someone clue me in?****
>
> ** **
>
> Andy Speagle****
>
> System & Storage Administrator****
>
> UCATS - Wichita State University****
>
> ** **
>
> P: 316.978.3869****
>
> C: 316.617.2431****
>
> ** **
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Thanks,
Bill G.
tc3driver at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111007/11fe16ff/attachment.htm>

From mmorgan at dca.net  Fri Oct  7 19:09:13 2011
From: mmorgan at dca.net (Michael Morgan)
Date: Fri, 7 Oct 2011 15:09:13 -0400
Subject: [Linux-cluster] cluster-snmp in EL6 reporting rhcMIBVersion and
	nothing else
Message-ID: <20111007190913.GA27724@staff.dca.net>

Hello,

 I'm having a problem with cluster-snmp output in our SL6.1 clusters with
cluster-snmp-0.16.2-10.el6.x86_64 installed.

[root at node2 ~]# snmpwalk -m REDHAT-CLUSTER-MIB -v2c -cpublic localhost REDHAT-CLUSTER-MIB::redhatCluster
REDHAT-CLUSTER-MIB::rhcMIBVersion.0 = INTEGER: 2

 I don't get any other objects and thus can't monitor the cluster through SNMP.
Our 5.7 clusters work properly with a matching snmpd config. I have the
required "dlmod RedHatCluster /usr/lib64/cluster-snmp/libClusterMonitorSnmp.so"
and have made sure it is included in the view. Am I missing something?
Thanks in advance.

-Mike

--
Michael Morgan
mmorgan at dca.net



From mmorgan at dca.net  Fri Oct  7 19:14:54 2011
From: mmorgan at dca.net (Michael Morgan)
Date: Fri, 7 Oct 2011 15:14:54 -0400
Subject: [Linux-cluster] Multiple HA-LVM Resources
In-Reply-To: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu>
References: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu>
Message-ID: <20111007191454.GB27724@staff.dca.net>

On Fri, Oct 07, 2011 at 06:38:32PM +0000, Speagle, Andy wrote:
>    Local machine trying to enable service:<servicename>...Invalid operation
>    for resource

I ran into a similar problem recently. It turned out that I mistakenly had
exclusive="1" in the service definition which prevents the service from running
on a node that already has any active services. The error is incredibly vague
and it took a while before I realized the cause.

-Mike

--
Michael Morgan
mmorgan at dca.net



From andy.speagle at wichita.edu  Fri Oct  7 19:19:07 2011
From: andy.speagle at wichita.edu (Speagle, Andy)
Date: Fri, 7 Oct 2011 19:19:07 +0000
Subject: [Linux-cluster] Multiple HA-LVM Resources
In-Reply-To: <20111007191454.GB27724@staff.dca.net>
References: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu>
	<20111007191454.GB27724@staff.dca.net>
Message-ID: <188F4C6C277F4843A5E712D5458E82770218D5@mbxsvc-300.ad.wichita.edu>

> I ran into a similar problem recently. It turned out that I mistakenly had
> exclusive="1" in the service definition which prevents the service from
> running on a node that already has any active services. The error is
> incredibly vague and it took a while before I realized the cause.

Ah... that's precisely the problem. I don't even have to look to know that's the issue.

Thanks for loaning me your brain.

-Andy



From tserong at suse.com  Mon Oct 10 01:46:38 2011
From: tserong at suse.com (Tim Serong)
Date: Mon, 10 Oct 2011 12:46:38 +1100
Subject: [Linux-cluster] CFP: High Availability and Distributed Storage
	miniconf at LCA 2012
Message-ID: <4E924E7E.2090507@suse.com>

Hi All,

I'm pleased to announce that we will be hosting a one day High 
Availability and Distributed Storage mini conference on January 16 2012, 
as part of linux.conf.au in Ballarat, Australia.

We would like to invite proposals for presentations to be delivered at 
the miniconf.  Please feel free to forward this CFP to your colleagues 
and other relevant mailing lists.

Suggested topics for presentations include (but are not limited to):

  - Cluster resource management
  - Cluster membership/messaging
  - Clustered filesystems
  - Distributed storage
  - SQL and NoSQL databases
  - Caching layers

The CFP is open until November 6.  Proposals can be submitted at:

   http://tinyurl.com/ha-lca2012-cfp

If you have any questions, please feel free to contact me directly 
(please do not group reply to this announcement).

Note that as this miniconf is part of linux.conf.au, you will need to 
register to attend the main conference 
(http://linux.conf.au/register/prices).  Unfortunately we have no 
sponsorship budget for speakers, so presenting at the miniconf does not 
entitle you to discounted or free registration.  If you need help 
convincing your employer to fund your travel, please see 
http://linux.conf.au/register/business-case - this is *the* Linux 
conference to be at in the southern hemisphere!

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tserong at suse.com



From ext.thales.jean-daniel.bonnetot at sncf.fr  Wed Oct 12 14:17:41 2011
From: ext.thales.jean-daniel.bonnetot at sncf.fr (BONNETOT Jean-Daniel (EXT THALES))
Date: Wed, 12 Oct 2011 16:17:41 +0200
Subject: [Linux-cluster] NTP sync cause CNAM shutdown
Message-ID: <C088D3516432C643AC828162A5164A7F0ACBCD8D@se3lmwbibaw.COMMUN.AD.SNCF.FR>

Hi,

I post previous email asking what was wrong in my two nodes
cluster.conf. I think I found it and have some question.

The problem was two nodes boot, join then cman shutdown with :
Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node
s64lmwbig3b because it has rejoined the cluster with existing state
Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1
because we rejoined the cluster without a full restart

Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours,
my timzone is GMT + 2).

My questions are:
Which date do you set up in your bios (GMT, your time zone)?
Do you use ntpd ? all documentations say to use it.
What are best practices about ntp and RHCS?

Jean-Daniel BONNETOT

-------
Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. 



From sdake at redhat.com  Wed Oct 12 15:43:05 2011
From: sdake at redhat.com (Steven Dake)
Date: Wed, 12 Oct 2011 08:43:05 -0700
Subject: [Linux-cluster] NTP sync cause CNAM shutdown
In-Reply-To: <C088D3516432C643AC828162A5164A7F0ACBCD8D@se3lmwbibaw.COMMUN.AD.SNCF.FR>
References: <C088D3516432C643AC828162A5164A7F0ACBCD8D@se3lmwbibaw.COMMUN.AD.SNCF.FR>
Message-ID: <4E95B589.2060302@redhat.com>

On 10/12/2011 07:17 AM, BONNETOT Jean-Daniel (EXT THALES) wrote:
> Hi,
> 
> I post previous email asking what was wrong in my two nodes
> cluster.conf. I think I found it and have some question.
> 
> The problem was two nodes boot, join then cman shutdown with :
> Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node
> s64lmwbig3b because it has rejoined the cluster with existing state
> Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1
> because we rejoined the cluster without a full restart
> 
> Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours,
> my timzone is GMT + 2).
> 
> My questions are:
> Which date do you set up in your bios (GMT, your time zone)?
> Do you use ntpd ? all documentations say to use it.
> What are best practices about ntp and RHCS?
> 
> Jean-Daniel BONNETOT
> 
> -------
> Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire.
> -------
> This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. 
> 
https://bugzilla.redhat.com/show_bug.cgi?id=738468

RHEL6 does not have this problem.

Regards
-steve

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From alvaro.fernandez at sivsa.com  Wed Oct 12 15:52:11 2011
From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez)
Date: Wed, 12 Oct 2011 17:52:11 +0200
Subject: [Linux-cluster] NTP sync cause CNAM shutdown
References: <C088D3516432C643AC828162A5164A7F0ACBCD8D@se3lmwbibaw.COMMUN.AD.SNCF.FR>
Message-ID: <607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int>

Jean,

I too suffered the same issue, opened a case with support, etc. The best option running ntpd and RHCS are:

-First, start the cman, rgmanager, etc. (I mean, all the RHCS daemons) always after ntpd startup. In RHEL5 at least the default is the other way around. 

You can do that if you disable all RHCS daemons (via chkconfig off) from automatic startup, and then, starting them explicitly via your rc.local init script, as the last init sequence action (ie, after the network, basic systems, and most importantly after ntpd initially adjusted the clock, via it's "ntpdate" call.

Be aware that if you do the above, you must explicitly (manually) stop them if you need to shutdown the cluster or the nodes, as with this hack, the init scripts of cman, rgmanager, etc , won't run for the "kill"/shutdown sequence.

-Start the ntpd using the "slew" mode ( -x startup flag), in the configuration file. Running it in slew mode makes ntpd adjust the time over a large time span, enough to assure that CMAN internal timings won't get messed.

Using that hack was Ok for me, no more node evictions or unexpected problems since.

There is a FAQ and best practices document in Redhat Network for NTPD and RHCS, updated few months ago as I recall. Just search for it in the Redhat Network website (sorry, I don't have the link for the DOC at the moment)

regards,


?lvaro Fern?ndez 
 Departamento de Sistemas_

-------
Hi,

I post previous email asking what was wrong in my two nodes cluster.conf. I think I found it and have some question.

The problem was two nodes boot, join then cman shutdown with :
Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart

Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, my timzone is GMT + 2).

My questions are:
Which date do you set up in your bios (GMT, your time zone)?
Do you use ntpd ? all documentations say to use it.
What are best practices about ntp and RHCS?

Jean-Daniel BONNETOT



From ext.thales.jean-daniel.bonnetot at sncf.fr  Thu Oct 13 16:14:45 2011
From: ext.thales.jean-daniel.bonnetot at sncf.fr (BONNETOT Jean-Daniel (EXT THALES))
Date: Thu, 13 Oct 2011 18:14:45 +0200
Subject: [Linux-cluster] NTP sync cause CNAM shutdown
In-Reply-To: <607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int>
References: <C088D3516432C643AC828162A5164A7F0ACBCD8D@se3lmwbibaw.COMMUN.AD.SNCF.FR>
	<607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int>
Message-ID: <C088D3516432C643AC828162A5164A7F0ACF3793@se3lmwbibaw.COMMUN.AD.SNCF.FR>

Thanks for your answer, it help me to find my way ;)
I saw "-x" option fot ntpd, but it's not the only things to apply.

First, I had to solve my timezone problem. 
-> Hwclock set on GMT int BIOS (UTC if you prefer)
-> timezone --utc Europe/Paris in kickstart, or set ZONE="Europe/Paris" and UTC=true in /etc/sysconfig/clock
This two settings make my time boot kernel in the right place, kernel get time from hwclock and know that it has to apply my timezone over it.

Then, I add "-x" option in /etc/syscinfig/ntp to say ntpd to not make big step.

As a result, boot time before:
Oct 13 12:02:20 s64lmwbig3b ntpd[7996]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1)
Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: precision = 1.000 usec
Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface wildcard, 0.0.0.0#123 Disabled
...
Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface bond0, 10.151.231.215#123 Enabled <== 2H TIME JUMP
Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] The token was lost in the OPERATIONAL state.
Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] entering GATHER state from 2.
=> CMAN crashed

Boot time now:
Oct 13 16:10:08 s64lmwbig3b clvmd: Cluster LVM daemon started - connected to CMAN
...
Oct 13 16:10:27 s64lmwbig3b ntpdate[7971]: step time server 10.151.156.87 offset 1.306150 sec <== 1S TIME JUMP
Oct 13 16:10:29 s64lmwbig3b ntpd[7975]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1)
Oct 13 16:10:29 s64lmwbig3b ntpd[7976]: precision = 1.000 usec
...
Oct 13 16:10:40 s64lmwbig3b modclusterd: startup succeeded
=> CMAN up and running

I looked for the FAQ you talked about but nothing, if you can post it when you have time ;)

Jean-Daniel BONNETOT

-----Message d'origine-----
De?: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Alvaro Jose Fernandez
Envoy??: mercredi 12 octobre 2011 17:52
??: linux clustering
Objet?: Re: [Linux-cluster] NTP sync cause CNAM shutdown

Jean,

I too suffered the same issue, opened a case with support, etc. The best option running ntpd and RHCS are:

-First, start the cman, rgmanager, etc. (I mean, all the RHCS daemons) always after ntpd startup. In RHEL5 at least the default is the other way around. 

You can do that if you disable all RHCS daemons (via chkconfig off) from automatic startup, and then, starting them explicitly via your rc.local init script, as the last init sequence action (ie, after the network, basic systems, and most importantly after ntpd initially adjusted the clock, via it's "ntpdate" call.

Be aware that if you do the above, you must explicitly (manually) stop them if you need to shutdown the cluster or the nodes, as with this hack, the init scripts of cman, rgmanager, etc , won't run for the "kill"/shutdown sequence.

-Start the ntpd using the "slew" mode ( -x startup flag), in the configuration file. Running it in slew mode makes ntpd adjust the time over a large time span, enough to assure that CMAN internal timings won't get messed.

Using that hack was Ok for me, no more node evictions or unexpected problems since.

There is a FAQ and best practices document in Redhat Network for NTPD and RHCS, updated few months ago as I recall. Just search for it in the Redhat Network website (sorry, I don't have the link for the DOC at the moment)

regards,


?lvaro Fern?ndez 
 Departamento de Sistemas_

-------
Hi,

I post previous email asking what was wrong in my two nodes cluster.conf. I think I found it and have some question.

The problem was two nodes boot, join then cman shutdown with :
Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart

Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, my timzone is GMT + 2).

My questions are:
Which date do you set up in your bios (GMT, your time zone)?
Do you use ntpd ? all documentations say to use it.
What are best practices about ntp and RHCS?

Jean-Daniel BONNETOT

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------
Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. 




From alvaro.fernandez at sivsa.com  Thu Oct 13 16:58:42 2011
From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez)
Date: Thu, 13 Oct 2011 18:58:42 +0200
Subject: [Linux-cluster] NTP sync cause CNAM shutdown
References: <C088D3516432C643AC828162A5164A7F0ACBCD8D@se3lmwbibaw.COMMUN.AD.SNCF.FR><607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int>
	<C088D3516432C643AC828162A5164A7F0ACF3793@se3lmwbibaw.COMMUN.AD.SNCF.FR>
Message-ID: <607D6181D9919041BE792D70EF2AEC4801DDD4A9@LIMENS.sivsa.int>

Hi Jean,

The DOC is https://access.redhat.com/kb/docs/DOC-42471 .

But, at Steven Drake said in a previous email, if you *can* upgrade to RHEL6, sure that would be the best option (I just cannot upgrade my customer, he will die on RHEL 5.x). 

In RHEL6, the cluster daemons are different and use other API, unlike openais.

Best regards.

?lvaro Fern?ndez 
 Departamento de Sistemas_
  
________________________________

SIVSA, Soluciones Inform?ticas S.A. 
Arenal n? 18 ? 3? Planta ? 36201 ? Vigo 
Tel?fono: (+34)  986 092 100  
Fax: (+34)  986 092 219
e-mail: alvaro.fernandez at sivsa.com
www.sivsa.com
Espa?a_
 
******************************  ADVERTENCIA LEGAL  ****************************
En cumplimiento de la Ley de Servicios de la Sociedad de la Informaci?n y de Comercio Electr?nico (LSSI-CE), y de la vigente Ley Org?nica 15/1999 de 13 de Diciembre de Protecci?n de Datos de Car?cter Personal (LOPD), le informamos que su direcci?n de correo electr?nico figura en este momento en la base de datos  de SIVSA, Soluciones Inform?ticas, S.A,  con domicilio en la calle Areal n? 18 - 3? planta, Vigo (Pontevedra),  que, como responsable del fichero, le garantiza el ejercicio de sus derechos de acceso, rectificaci?n, cancelaci?n y oposici?n de los datos facilitados, en los t?rminos y condiciones previstos en la propia LOPD, mediante una comunicaci?n por escrito dirigida a la direcci?n indicada, a la atenci?n del "Departamento de Administraci?n".  De no ser as?, se entiende que usted consiente expresamente que sus datos puedan ser utilizados por SIVSA con fines publicitarios, promocionales y de marketing, en relaci?n con sus propios productos y servicios. 

Este mensaje va dirigido, de manera exclusiva, a su destinatario y contiene informaci?n confidencial y sujeta al secreto profesional, cuya divulgaci?n no est? permitida por la ley. En caso de haber recibido este mensaje por error, le rogamos que, de forma inmediata, nos lo comunique mediante correo electr?nico remitido a nuestra atenci?n o a trav?s del tel?fono (+ 34) 986 092 100 y proceda a su eliminaci?n, as? como a la de cualquier documento adjunto al mismo. Asimismo, le comunicamos que la distribuci?n, copia o utilizaci?n de este mensaje, o de cualquier documento adjunto al mismo, cualquiera que fuera su finalidad, est?n prohibidas por la ley."


-----Mensaje original-----
De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de BONNETOT Jean-Daniel (EXT THALES)
Enviado el: jueves, 13 de octubre de 2011 18:15
Para: linux clustering
Asunto: Re: [Linux-cluster] NTP sync cause CNAM shutdown

Thanks for your answer, it help me to find my way ;) I saw "-x" option fot ntpd, but it's not the only things to apply.

First, I had to solve my timezone problem. 
-> Hwclock set on GMT int BIOS (UTC if you prefer) timezone --utc 
-> Europe/Paris in kickstart, or set ZONE="Europe/Paris" and UTC=true in 
-> /etc/sysconfig/clock
This two settings make my time boot kernel in the right place, kernel get time from hwclock and know that it has to apply my timezone over it.

Then, I add "-x" option in /etc/syscinfig/ntp to say ntpd to not make big step.

As a result, boot time before:
Oct 13 12:02:20 s64lmwbig3b ntpd[7996]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1) Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: precision = 1.000 usec Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface wildcard, 0.0.0.0#123 Disabled ...
Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface bond0, 10.151.231.215#123 Enabled <== 2H TIME JUMP Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] The token was lost in the OPERATIONAL state.
Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] entering GATHER state from 2.
=> CMAN crashed

Boot time now:
Oct 13 16:10:08 s64lmwbig3b clvmd: Cluster LVM daemon started - connected to CMAN ...
Oct 13 16:10:27 s64lmwbig3b ntpdate[7971]: step time server 10.151.156.87 offset 1.306150 sec <== 1S TIME JUMP Oct 13 16:10:29 s64lmwbig3b ntpd[7975]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1) Oct 13 16:10:29 s64lmwbig3b ntpd[7976]: precision = 1.000 usec ...
Oct 13 16:10:40 s64lmwbig3b modclusterd: startup succeeded => CMAN up and running

I looked for the FAQ you talked about but nothing, if you can post it when you have time ;)

Jean-Daniel BONNETOT

-----Message d'origine-----
De?: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Alvaro Jose Fernandez Envoy??: mercredi 12 octobre 2011 17:52 ??: linux clustering Objet?: Re: [Linux-cluster] NTP sync cause CNAM shutdown

Jean,

I too suffered the same issue, opened a case with support, etc. The best option running ntpd and RHCS are:

-First, start the cman, rgmanager, etc. (I mean, all the RHCS daemons) always after ntpd startup. In RHEL5 at least the default is the other way around. 

You can do that if you disable all RHCS daemons (via chkconfig off) from automatic startup, and then, starting them explicitly via your rc.local init script, as the last init sequence action (ie, after the network, basic systems, and most importantly after ntpd initially adjusted the clock, via it's "ntpdate" call.

Be aware that if you do the above, you must explicitly (manually) stop them if you need to shutdown the cluster or the nodes, as with this hack, the init scripts of cman, rgmanager, etc , won't run for the "kill"/shutdown sequence.

-Start the ntpd using the "slew" mode ( -x startup flag), in the configuration file. Running it in slew mode makes ntpd adjust the time over a large time span, enough to assure that CMAN internal timings won't get messed.

Using that hack was Ok for me, no more node evictions or unexpected problems since.

There is a FAQ and best practices document in Redhat Network for NTPD and RHCS, updated few months ago as I recall. Just search for it in the Redhat Network website (sorry, I don't have the link for the DOC at the moment)

regards,


?lvaro Fern?ndez
 Departamento de Sistemas_

-------
Hi,

I post previous email asking what was wrong in my two nodes cluster.conf. I think I found it and have some question.

The problem was two nodes boot, join then cman shutdown with :
Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart

Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, my timzone is GMT + 2).

My questions are:
Which date do you set up in your bios (GMT, your time zone)?
Do you use ntpd ? all documentations say to use it.
What are best practices about ntp and RHCS?

Jean-Daniel BONNETOT

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------
Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. 


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From daniele at retaggio.net  Thu Oct 13 22:17:58 2011
From: daniele at retaggio.net (Daniele Palumbo)
Date: Fri, 14 Oct 2011 00:17:58 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open
	attribute
Message-ID: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>

hi,

first of all, sorry for the long subject...
i do not know how to explain myself, so any faq/manual will be of course appreciated.

now,
i have a test cluster, gentoo based.
cluster 3.1.7, corosync 1.4.2, lvm2 2.02.88.
clvm is built with following flags:
./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --enable-readline --disable-selinux --enable-pkgconfig --with-confdir=/etc --sbindir=/sbin --with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 --enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d/ --enable-dmeventd --enable-cmdlib --enable-applib --enable-fsadm --enable-static_link --with-mirrors=internal --with-snapshots=internal --with-lvm1=internal --with-cluster=internal --enable-cmirrord --with-clvmd=cman --with-pool=internal --with-dmeventd-path=/sbin/dmeventd CLDFLAGS=-Wl,-O1 -Wl,--as-needed

cluster.conf:
<cluster name="CEMCluster" config_version="2">
       <clusternodes>
               <clusternode name="pvsrv01" nodeid="1">
               </clusternode>
               <clusternode name="pvsrv07" nodeid="2">
               </clusternode>
               <clusternode name="pvsrv08" nodeid="3">
               </clusternode>
       </clusternodes>
</cluster>


now, i have the cluster working, i can see and create volumes and so on.

but when i mount a volume on node pvsrv07, i cannot see it as open in pvsrv08.
also, clvmd (running in debug mode) does not show me anything when i mount or unmount volumes.

any hints?

thanks
Daniele



From fdinitto at redhat.com  Fri Oct 14 03:56:21 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 14 Oct 2011 05:56:21 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
 open attribute
In-Reply-To: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>
Message-ID: <4E97B2E5.5000509@redhat.com>

On 10/14/2011 12:17 AM, Daniele Palumbo wrote:
> hi,
> 
> first of all, sorry for the long subject...
> i do not know how to explain myself, so any faq/manual will be of course appreciated.
> 
> now,
> i have a test cluster, gentoo based.
> cluster 3.1.7, corosync 1.4.2, lvm2 2.02.88.
> clvm is built with following flags:
> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --enable-readline --disable-selinux --enable-pkgconfig --with-confdir=/etc --sbindir=/sbin --with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 --enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d/ --enable-dmeventd --enable-cmdlib --enable-applib --enable-fsadm --enable-static_link --with-mirrors=internal --with-snapshots=internal --with-lvm1=internal --with-cluster=internal --enable-cmirrord --with-clvmd=cman --with-pool=internal --with-dmeventd-path=/sbin/dmeventd CLDFLAGS=-Wl,-O1 -Wl,--as-needed
> 
> cluster.conf:
> <cluster name="CEMCluster" config_version="2">
>        <clusternodes>
>                <clusternode name="pvsrv01" nodeid="1">
>                </clusternode>
>                <clusternode name="pvsrv07" nodeid="2">
>                </clusternode>
>                <clusternode name="pvsrv08" nodeid="3">
>                </clusternode>
>        </clusternodes>
> </cluster>
> 
> 
> now, i have the cluster working, i can see and create volumes and so on.
> 
> but when i mount a volume on node pvsrv07, i cannot see it as open in pvsrv08.
> also, clvmd (running in debug mode) does not show me anything when i mount or unmount volumes.
> 
> any hints?

What kind of shared storage are you using? Filesystem on top of lvm?

Fabio



From Sagar.Shimpi at tieto.com  Fri Oct 14 08:17:04 2011
From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com)
Date: Fri, 14 Oct 2011 11:17:04 +0300
Subject: [Linux-cluster] Apache active-active cluster
Message-ID: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9607@EXMB01.eu.tieto.com>

Hi,

Can I configure Apache Active -Active cluster using Redhat Cluster Suit in RHEL6?

If yes, can someone please pass me the link for the same.


Regards,

Sagar Shimpi, Senior Technical Specialist, OSS Labs

Tieto
email sagar.shimpi at tieto.com<mailto:aniruddha.khadkikar at tieto.com>,
Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77,
MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in

TIETO. Knowledge. Passion. Results.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111014/5a41ae74/attachment.htm>

From daniele at retaggio.net  Fri Oct 14 09:07:00 2011
From: daniele at retaggio.net (Daniele Palumbo)
Date: Fri, 14 Oct 2011 11:07:00 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
	open attribute
In-Reply-To: <4E97B2E5.5000509@redhat.com>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>
	<4E97B2E5.5000509@redhat.com>
Message-ID: <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>

Il giorno 14/ott/2011, alle ore 05.56, Fabio M. Di Nitto ha scritto:
> What kind of shared storage are you using? Filesystem on top of lvm?

i am using 2 local disk, exported via vblade (i have an aoe storage, this is the test environment).

on top of that right does not matter for me which fs, cause it will be used also by xen hvm (let say also ntfs).
anyway, tested with ext3 and reiserfs.
(but i think i will get a RTFM as you write down your questions ;( )

pvsrv07 ~ # pvs
  PV               VG        Fmt  Attr PSize PFree
  /dev/etherd/e8.0 vgPvSrv08 lvm2 a--  3.90g 3.90g
  /dev/sdb1        vgPvSrv07 lvm2 a--  3.90g 3.80g
pvsrv07 ~ # vgs
  VG        #PV #LV #SN Attr   VSize VFree
  vgPvSrv07   1   1   0 wz--nc 3.90g 3.80g
  vgPvSrv08   1   0   0 wz--nc 3.90g 3.90g
pvsrv07 ~ # lvs 
  LV     VG        Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  test07 vgPvSrv07 -wi-ao 100.00m                                      
pvsrv07 ~ #
pvsrv07 ~ # mount|grep test07
/dev/mapper/vgPvSrv07-test07 on /mnt type reiserfs (rw)
pvsrv07 ~ #

pvsrv08 ~ # pvs
  PV               VG        Fmt  Attr PSize PFree
  /dev/etherd/e7.0 vgPvSrv07 lvm2 a--  3.90g 3.80g
  /dev/sdb1        vgPvSrv08 lvm2 a--  3.90g 3.90g
pvsrv08 ~ # vgs
  VG        #PV #LV #SN Attr   VSize VFree
  vgPvSrv07   1   1   0 wz--nc 3.90g 3.80g
  vgPvSrv08   1   0   0 wz--nc 3.90g 3.90g
pvsrv08 ~ # lvs  
  LV     VG        Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  test07 vgPvSrv07 -wi-a- 100.00m                                      
pvsrv08 ~ # 

the tricky part is in lvs, i can see an open volume in pvsrv07 and not in pvsrv08.

pvsrv07 ~ # cman_tool status
Version: 6.2.0
Config Version: 2
Cluster Name: CEMCluster
Cluster Id: 7604
Cluster Member: Yes
Cluster Generation: 240
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 8
Flags: 
Ports Bound: 0 11  
Node name: pvsrv07
Node ID: 2
Multicast addresses: 239.192.29.209 
Node addresses: 192.168.1.107 
pvsrv07 ~ # 

pvsrv08 ~ # cman_tool status
Version: 6.2.0
Config Version: 2
Cluster Name: CEMCluster
Cluster Id: 7604
Cluster Member: Yes
Cluster Generation: 240
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 8
Flags: 
Ports Bound: 0 11  
Node name: pvsrv08
Node ID: 3
Multicast addresses: 239.192.29.209 
Node addresses: 192.168.1.108 
pvsrv08 ~ # 




From fdinitto at redhat.com  Fri Oct 14 12:01:35 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 14 Oct 2011 14:01:35 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
 open attribute
In-Reply-To: <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>
	<4E97B2E5.5000509@redhat.com>
	<1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>
Message-ID: <4E98249F.2060302@redhat.com>

On 10/14/2011 11:07 AM, Daniele Palumbo wrote:
> Il giorno 14/ott/2011, alle ore 05.56, Fabio M. Di Nitto ha scritto:
>> What kind of shared storage are you using? Filesystem on top of lvm?
> 
> i am using 2 local disk, exported via vblade (i have an aoe storage, this is the test environment).
> 
> on top of that right does not matter for me which fs, cause it will be used also by xen hvm (let say also ntfs).
> anyway, tested with ext3 and reiserfs.
> (but i think i will get a RTFM as you write down your questions ;( )

Well yes you have a lot of RTFM to do.

This setup looks very wrong and there is a lot of work on the storage
side you need to do.

I am not even sure where to start, but a few simple points:

1) you need to use proper shared storage. AOE is fine, but use the real
one. not 2 local disks exported, because that's never going to work.

2) if you want to use clvmd, all nodes *must* see the same storage

3) you need a cluster filesystem such as GFS2. ext3 and reiserfs are not
cluster fs. If you mounted any of those on both nodes, I strongly
recommend recreateing the fs and restore data from a backup.

Fabio



From daniele at retaggio.net  Fri Oct 14 12:38:25 2011
From: daniele at retaggio.net (Daniele Palumbo)
Date: Fri, 14 Oct 2011 14:38:25 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
	open attribute
In-Reply-To: <4E98249F.2060302@redhat.com>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>
	<4E97B2E5.5000509@redhat.com>
	<1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>
	<4E98249F.2060302@redhat.com>
Message-ID: <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net>

Il giorno 14/ott/2011, alle ore 14.01, Fabio M. Di Nitto ha scritto:
> This setup looks very wrong and there is a lot of work on the storage
> side you need to do.
> 
> I am not even sure where to start, but a few simple points:
> 
> 1) you need to use proper shared storage. AOE is fine, but use the real
> one. not 2 local disks exported, because that's never going to work.

why not?

> 2) if you want to use clvmd, all nodes *must* see the same storage

that is :)
different name (in pvs) but same storage.
anyway, i will add a third machine and setup the storage over there, then i will setup 3 machines in cluster that see the same device name.

can that help?

> 3) you need a cluster filesystem such as GFS2. ext3 and reiserfs are not
> cluster fs. If you mounted any of those on both nodes, I strongly
> recommend recreateing the fs and restore data from a backup.

yes but that could be a problem... cause i need to have (at least) a volume for each virtual machine.
or am i missing something like do a cluster filesystem and on top of that re-create the vg?

but anyway, i just need to see the filesystem as open...
do i need a clustered one to see the correct open attribute with lvs?
how does the open attribute is set?

thanks a lot!
d.



From fdinitto at redhat.com  Fri Oct 14 13:08:37 2011
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 14 Oct 2011 15:08:37 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
 open attribute
In-Reply-To: <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>
	<4E97B2E5.5000509@redhat.com>
	<1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>
	<4E98249F.2060302@redhat.com>
	<714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net>
Message-ID: <4E983455.1000205@redhat.com>

On 10/14/2011 02:38 PM, Daniele Palumbo wrote:
> Il giorno 14/ott/2011, alle ore 14.01, Fabio M. Di Nitto ha scritto:
>> This setup looks very wrong and there is a lot of work on the storage
>> side you need to do.
>>
>> I am not even sure where to start, but a few simple points:
>>
>> 1) you need to use proper shared storage. AOE is fine, but use the real
>> one. not 2 local disks exported, because that's never going to work.
> 
> why not?

this is the kind of answer that boils down to RTFM.....

more seriously, the technical explanation is very long and complex.

> 
>> 2) if you want to use clvmd, all nodes *must* see the same storage
> 
> that is :)
> different name (in pvs) but same storage.
> anyway, i will add a third machine and setup the storage over there, then i will setup 3 machines in cluster that see the same device name.
> 
> can that help?

Yes, keep the 3rd machine out of the cluster and use it to export an AOE
device for testing.

> 
>> 3) you need a cluster filesystem such as GFS2. ext3 and reiserfs are not
>> cluster fs. If you mounted any of those on both nodes, I strongly
>> recommend recreateing the fs and restore data from a backup.
> 
> yes but that could be a problem... cause i need to have (at least) a volume for each virtual machine.
> or am i missing something like do a cluster filesystem and on top of that re-create the vg?

You were talking about mounting reiserfs and such... as long as you make
sure that the fs is mounted on one node at a time then it's fine.

No you don't need to create a vg on top of gfs2. gfs2 is not so
different from any other filesystems, except that you can mount it and
use it simultaneously on all nodes read/write.

> 
> but anyway, i just need to see the filesystem as open...
> do i need a clustered one to see the correct open attribute with lvs?
> how does the open attribute is set?

Ok can you please explain better what you mean here??

clustered lvs are available on all nodes at the same time. You probably
see a metadata sync issue due to "interesting" storage.

Fabio



From daniele at retaggio.net  Fri Oct 14 15:30:42 2011
From: daniele at retaggio.net (Daniele Palumbo)
Date: Fri, 14 Oct 2011 17:30:42 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
	open attribute
In-Reply-To: <4E983455.1000205@redhat.com>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>
	<4E97B2E5.5000509@redhat.com>
	<1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>
	<4E98249F.2060302@redhat.com>
	<714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net>
	<4E983455.1000205@redhat.com>
Message-ID: <C97BB1FB-F294-4C48-B830-C2D75258641E@retaggio.net>

Il giorno 14/ott/2011, alle ore 15.08, Fabio M. Di Nitto ha scritto:
> You were talking about mounting reiserfs and such... as long as you make
> sure that the fs is mounted on one node at a time then it's fine.

that is, i just need to sync open devices so i can see on server #2 if the device is mounted in server #1.

> Ok can you please explain better what you mean here??

pvsrv07 ~ # lvs 
 LV     VG        Attr   LSize   Origin Snap%  Move Log Copy%  Convert
 test07 vgPvSrv07 -wi-ao 100.00m                                      
pvsrv07 ~ #

pvsrv08 ~ # lvs  
 LV     VG        Attr   LSize   Origin Snap%  Move Log Copy%  Convert
 test07 vgPvSrv07 -wi-a- 100.00m                                      
pvsrv08 ~ #

as you see in one server the device is *marked* as open, in the other is not.

i need just to see open in both server,
if it is marked as open, xen refuse to boot vm with that device open, and i will not care about the filesystem that is contained in lvm.

> clustered lvs are available on all nodes at the same time. You probably
> see a metadata sync issue due to "interesting" storage.

ok so i will try with one storage and i will report back.

bye
d.



From mjh2000 at gmail.com  Fri Oct 14 20:58:01 2011
From: mjh2000 at gmail.com (Joey L)
Date: Fri, 14 Oct 2011 16:58:01 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
Message-ID: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>

I am new to redhat cluster and i am having some issues.

1. I am looking for a simple cluster.conf that I can use for :
A. failing over an ip address.
B. failing over apache.
C. failing over mysql
D. failing over asterisk.
E. failing over a nfs mount.

I have created the following cluster.conf using system-config-cluster :

cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="2" name="picky_cluster">
	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="192.168.2.110" nodeid="1" votes="1">
			<fence/>
		</clusternode>
		<clusternode name="192.168.2.111" nodeid="2" votes="1">
			<fence/>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices/>
	<rm>
		<failoverdomains/>
		<resources>
			<ip address="192.168.2.112" monitor_link="1"/>
		</resources>
	</rm>
</cluster>


And I am getting the following errors :

/etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity
error : Invalid attribute name for element clusternode
/etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity
error : Element clusternodes has extra content: clusternode
/etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity
error : Type ID doesn't allow value '192.168.2.110'
Relax-NG validity error : Element clusternode failed to validate attributes
/etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error
: Invalid sequence in interleave
/etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error
: Element cluster failed to validate content
/etc/cluster/cluster.conf fails to validate

Also get this message when trying to open cluster.conf with
system-config-cluster:

Because this node is not currently part of a cluster, the management
tab for this application is not available.

Can anyone give me a pointer as what to do ?
thanks



From mjh2000 at gmail.com  Fri Oct 14 21:05:43 2011
From: mjh2000 at gmail.com (Joey L)
Date: Fri, 14 Oct 2011 17:05:43 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
Message-ID: <CAK3ER7sNjxdgKr-iJGA7n6yvDMY5Uqh4caQrUEohK8vmHi0i4g@mail.gmail.com>

On Fri, Oct 14, 2011 at 4:58 PM, Joey L <mjh2000 at gmail.com> wrote:
> I am new to redhat cluster and i am having some issues.
>
> 1. I am looking for a simple cluster.conf that I can use for :
> A. failing over an ip address.
> B. failing over apache.
> C. failing over mysql
> D. failing over asterisk.
> E. failing over a nfs mount.
>
> I have created the following cluster.conf using system-config-cluster :
>
> cat /etc/cluster/cluster.conf
> <?xml version="1.0" ?>
> <cluster config_version="2" name="picky_cluster">
> ? ? ? ?<fence_daemon post_fail_delay="0" post_join_delay="3"/>
> ? ? ? ?<clusternodes>
> ? ? ? ? ? ? ? ?<clusternode name="192.168.2.110" nodeid="1" votes="1">
> ? ? ? ? ? ? ? ? ? ? ? ?<fence/>
> ? ? ? ? ? ? ? ?</clusternode>
> ? ? ? ? ? ? ? ?<clusternode name="192.168.2.111" nodeid="2" votes="1">
> ? ? ? ? ? ? ? ? ? ? ? ?<fence/>
> ? ? ? ? ? ? ? ?</clusternode>
> ? ? ? ?</clusternodes>
> ? ? ? ?<cman expected_votes="1" two_node="1"/>
> ? ? ? ?<fencedevices/>
> ? ? ? ?<rm>
> ? ? ? ? ? ? ? ?<failoverdomains/>
> ? ? ? ? ? ? ? ?<resources>
> ? ? ? ? ? ? ? ? ? ? ? ?<ip address="192.168.2.112" monitor_link="1"/>
> ? ? ? ? ? ? ? ?</resources>
> ? ? ? ?</rm>
> </cluster>
>
>
> And I am getting the following errors :
>
> /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity
> error : Invalid attribute name for element clusternode
> /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity
> error : Element clusternodes has extra content: clusternode
> /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity
> error : Type ID doesn't allow value '192.168.2.110'
> Relax-NG validity error : Element clusternode failed to validate attributes
> /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error
> : Invalid sequence in interleave
> /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error
> : Element cluster failed to validate content
> /etc/cluster/cluster.conf fails to validate
>
> Also get this message when trying to open cluster.conf with
> system-config-cluster:
>
> Because this node is not currently part of a cluster, the management
> tab for this application is not available.
>
> Can anyone give me a pointer as what to do ?
> thanks
>
I have updated my cluster.conf with the following config - can anyone
tell me if this is correct ?
thanks

cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="5" name="picky_cluster">
	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="192.168.2.110" nodeid="1" votes="1">
			<fence/>
		</clusternode>
		<clusternode name="192.168.2.111" nodeid="2" votes="1">
			<fence/>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices/>
	<rm>
		<failoverdomains>
			<failoverdomain name="fail1" ordered="0" restricted="0">
				<failoverdomainnode name="192.168.2.110" priority="1"/>
				<failoverdomainnode name="192.168.2.111" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="192.168.2.112" monitor_link="1"/>
			<apache config_file="/etc/apache2/apache2.conf" httpd_options=""
name="debby" server_root="/var/www/html" shutdown_wait=""/>
		</resources>
		<service autostart="1" domain="fail1" name="web">
			<apache config_file="/etc/apache2/apache2.conf" httpd_options=""
name="deb1" server_root="/var/www/html" shutdown_wait=""/>
			<apache config_file="/etc/apache2/apache2.conf" httpd_options=""
name="deb2" server_root="/var/www/html" shutdown_wait=""/>
			<ip ref="192.168.2.112"/>
			<apache ref="debby"/>
		</service>
	</rm>
</cluster>



From linux at alteeve.com  Fri Oct 14 21:12:46 2011
From: linux at alteeve.com (Digimer)
Date: Fri, 14 Oct 2011 17:12:46 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <CAK3ER7sNjxdgKr-iJGA7n6yvDMY5Uqh4caQrUEohK8vmHi0i4g@mail.gmail.com>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
	<CAK3ER7sNjxdgKr-iJGA7n6yvDMY5Uqh4caQrUEohK8vmHi0i4g@mail.gmail.com>
Message-ID: <4E98A5CE.9050706@alteeve.com>

On 10/14/2011 05:05 PM, Joey L wrote:
> <?xml version="1.0" ?>
> <cluster config_version="5" name="picky_cluster">
> 	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
> 	<clusternodes>
> 		<clusternode name="192.168.2.110" nodeid="1" votes="1">
> 			<fence/>
> 		</clusternode>
> 		<clusternode name="192.168.2.111" nodeid="2" votes="1">
> 			<fence/>
> 		</clusternode>
> 	</clusternodes>
> 	<cman expected_votes="1" two_node="1"/>
> 	<fencedevices/>
> 	<rm>
> 		<failoverdomains>
> 			<failoverdomain name="fail1" ordered="0" restricted="0">
> 				<failoverdomainnode name="192.168.2.110" priority="1"/>
> 				<failoverdomainnode name="192.168.2.111" priority="1"/>
> 			</failoverdomain>
> 		</failoverdomains>
> 		<resources>
> 			<ip address="192.168.2.112" monitor_link="1"/>
> 			<apache config_file="/etc/apache2/apache2.conf" httpd_options=""
> name="debby" server_root="/var/www/html" shutdown_wait=""/>
> 		</resources>
> 		<service autostart="1" domain="fail1" name="web">
> 			<apache config_file="/etc/apache2/apache2.conf" httpd_options=""
> name="deb1" server_root="/var/www/html" shutdown_wait=""/>
> 			<apache config_file="/etc/apache2/apache2.conf" httpd_options=""
> name="deb2" server_root="/var/www/html" shutdown_wait=""/>
> 			<ip ref="192.168.2.112"/>
> 			<apache ref="debby"/>
> 		</service>
> 	</rm>
> </cluster>

I don't use apache, so I can't speak to that resource agent's config. I 
can say though that overall it looks okay with two exceptions.

You *must* configure fencing for the cluster to work properly. Even 
without shared storage, a node failure will trigger a fence call which, 
because it can't succeed, will leave your cluster hung hard.

Change the cluster names to the output of `uname -n` (should be the FQDN).

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"



From mjh2000 at gmail.com  Sat Oct 15 13:30:34 2011
From: mjh2000 at gmail.com (Joey L)
Date: Sat, 15 Oct 2011 09:30:34 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <4E98A5CE.9050706@alteeve.com>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
	<CAK3ER7sNjxdgKr-iJGA7n6yvDMY5Uqh4caQrUEohK8vmHi0i4g@mail.gmail.com>
	<4E98A5CE.9050706@alteeve.com>
Message-ID: <CAK3ER7tfz+8Qd5UrdnYMnSZUMqgRKRuU-p_yfnyx8bWwuOUhmg@mail.gmail.com>

>
> I don't use apache, so I can't speak to that resource agent's config. I can
> say though that overall it looks okay with two exceptions.
>
> You *must* configure fencing for the cluster to work properly. Even without
> shared storage, a node failure will trigger a fence call which, because it
> can't succeed, will leave your cluster hung hard.
>
> Change the cluster names to the output of `uname -n` (should be the FQDN).
>

i looked at the docs - i think it says fencing is not required.
I do not have any sofisticated fencing devices on my network - so it
would not help anyways ??
Or do you have a solution in that scenerio ??
Have you used Heartbeat ?
It looks a lot less complicated then RH Cluster and seems like there
are more docs and vidoes on the net.
Do you have a simple cluster.conf file that I can use to see if I am
setting this up correctly?
I do not see the any of my shared services when i look at my node in
the members tab.


thanks
mjh



From mjh2000 at gmail.com  Sat Oct 15 13:33:49 2011
From: mjh2000 at gmail.com (Joey L)
Date: Sat, 15 Oct 2011 09:33:49 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <j7agdk$vrp$1@dough.gmane.org>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
	<j7agdk$vrp$1@dough.gmane.org>
Message-ID: <CAK3ER7sbhqFNzTRu9RQ4=oqqHPqmVhkRBOdu6L50qrGhfNY+nw@mail.gmail.com>

On Fri, Oct 14, 2011 at 7:25 PM, Walter Hurry <walterhurry at lavabit.com> wrote:
> On Fri, 14 Oct 2011 16:58:01 -0400, Joey L wrote:
>
>> I am new to redhat cluster and i am having some issues.
>
> Why do you keep starting new threads for the same basic question? Clearly
> you are some kind of "architect" who is trying to put together a proposal
> for some client or other. Why do you expect us to prepare a ready made
> cookbook for you?
>
> Admit that it is beyond your level of competence, and give the job to
> someone who knows what they are doing.
>

Walter apparently you were born knowing it all and thanks for sharing
your profound knowledge by your reply.
Apparently you are trying to sell some questionable solution at
www.lavabit.com -- which looks really bad by the way.
This is no way to get customers
I will not waste my time with you - apparently you are a man with
little things...technology and otherwise.
thanks



From linux at alteeve.com  Sat Oct 15 15:47:51 2011
From: linux at alteeve.com (Digimer)
Date: Sat, 15 Oct 2011 11:47:51 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <CAK3ER7tfz+8Qd5UrdnYMnSZUMqgRKRuU-p_yfnyx8bWwuOUhmg@mail.gmail.com>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
	<CAK3ER7sNjxdgKr-iJGA7n6yvDMY5Uqh4caQrUEohK8vmHi0i4g@mail.gmail.com>
	<4E98A5CE.9050706@alteeve.com>
	<CAK3ER7tfz+8Qd5UrdnYMnSZUMqgRKRuU-p_yfnyx8bWwuOUhmg@mail.gmail.com>
Message-ID: <4E99AB27.5080908@alteeve.com>

On 10/15/2011 09:30 AM, Joey L wrote:
>>
>> I don't use apache, so I can't speak to that resource agent's config. I can
>> say though that overall it looks okay with two exceptions.
>>
>> You *must* configure fencing for the cluster to work properly. Even without
>> shared storage, a node failure will trigger a fence call which, because it
>> can't succeed, will leave your cluster hung hard.
>>
>> Change the cluster names to the output of `uname -n` (should be the FQDN).
>>
> 
> i looked at the docs - i think it says fencing is not required.

What docs? That is misleading.

Consider;
* Node 1 wants to start Service A.
* Node 1 requests a DLM lock, gets it, starts Service A.
* Meanwhile, Node 2 wants to start Service A.
* Node 2 requests a DLM lock, is refused because the lock is out to Node A.
* Node 1 finishes starting Service A and tells the cluster.
* Node 1 releases the lock.
* Node 2, having seen now that Service A is running, no longer tries to
start Service A.

Time pases, and suddenly Node A fails.

* After a short period of time, the cluster will detect Node 1's death.
* The cluster enters an unknown state (is Node 1 dead, hung, ?).
* The clyster will call a fence and DLM will block. With DLM blocked,
nothing can get a lack and, without a lock, services can not be recovered.
* The fence call completes successfully and tells the cluster that
things are back into a known state.
* DLM unblocks.
* RGManager sorts out what services where lost (Service A), figures out
who can recover the lost service (Node 2).
* Node 2 requests a lock from DLM and you know the rest of the story.

> I do not have any sofisticated fencing devices on my network - so it
> would not help anyways ??

Wrong. Without fencing, you can not have a stable cluster. In clustering;

 "The only thing you know is what you don't know."

As soon as a node goes silent, you can only know that it has stopped
responding. Has it hung and will it come back? Has it completely powered
off? You can't guess.

Fencing puts the silent node into a known state. That is, it is either
disconnected from the cluster's network or forced off. Only then can the
state of the silent node be known. Until it's state is known, the
cluster can not operate safely.

This is a general high-availability cluster concept.

> Or do you have a solution in that scenerio ??

Yup, you can get a switched PDU. The APC brand PDUs are very good and
very well supported using the 'fence_apc_snmp' fence agent.

I've used this one in many clusters (as a backup to iLO/IPMI based
fencing). It's just fine as a primary fence agent as well.

http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900

> Have you used Heartbeat ?

A long time ago. The heartbeat project is effectively deprecated.
Linbit, the company behind DRBD, has taken over the project and have
announced that they plan no further development. They are maintaining it
as bugs are found, but that is it.

Both RHCS and the Pacemaker project are primarily on corosync know.

> It looks a lot less complicated then RH Cluster and seems like there
> are more docs and vidoes on the net.

I think you are referring to Pacemaker. That is the resource management
layer. Whether you think it is simpler or not is, of course, up to the
user's perspective. That said, Pacemaker is a perfectly good clustered
resource manager and I have no reason to argue against it. I just can't
help with it as I'm mostly familiar with Red Hat's current cluster suite.

> Do you have a simple cluster.conf file that I can use to see if I am
> setting this up correctly?

"Simple", no. I do have an extensive tutorial though.

https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial

It's for EL5 and RHCS Stable 2, and the current version is stable 3, but
the configuration is all but the same. The only (visible) change are the
way the config file is validated (ccs_config_validate instead of the
xmllint call) and how updated versions are pushed out to the rest of the
cluster ('cman_tool version -r' instead of 'ccs_tool update
/etc/cluster/cluster.conf').

> I do not see the any of my shared services when i look at my node in
> the members tab.

Is rgmanager running? What does 'clustat' show?

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"



From linux at alteeve.com  Sat Oct 15 15:48:26 2011
From: linux at alteeve.com (Digimer)
Date: Sat, 15 Oct 2011 11:48:26 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <CAK3ER7sbhqFNzTRu9RQ4=oqqHPqmVhkRBOdu6L50qrGhfNY+nw@mail.gmail.com>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
	<j7agdk$vrp$1@dough.gmane.org>
	<CAK3ER7sbhqFNzTRu9RQ4=oqqHPqmVhkRBOdu6L50qrGhfNY+nw@mail.gmail.com>
Message-ID: <4E99AB4A.8060109@alteeve.com>

On 10/15/2011 09:33 AM, Joey L wrote:
> On Fri, Oct 14, 2011 at 7:25 PM, Walter Hurry <walterhurry at lavabit.com> wrote:
>> On Fri, 14 Oct 2011 16:58:01 -0400, Joey L wrote:
>>
>>> I am new to redhat cluster and i am having some issues.
>>
>> Why do you keep starting new threads for the same basic question? Clearly
>> you are some kind of "architect" who is trying to put together a proposal
>> for some client or other. Why do you expect us to prepare a ready made
>> cookbook for you?
>>
>> Admit that it is beyond your level of competence, and give the job to
>> someone who knows what they are doing.
>>
> 
> Walter apparently you were born knowing it all and thanks for sharing
> your profound knowledge by your reply.
> Apparently you are trying to sell some questionable solution at
> www.lavabit.com -- which looks really bad by the way.
> This is no way to get customers
> I will not waste my time with you - apparently you are a man with
> little things...technology and otherwise.
> thanks
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Please keep this petty bickering off of this mailing list.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"



From raju.rajsand at gmail.com  Sun Oct 16 04:20:37 2011
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Sun, 16 Oct 2011 09:50:37 +0530
Subject: [Linux-cluster] Apache active-active cluster
In-Reply-To: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9607@EXMB01.eu.tieto.com>
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9607@EXMB01.eu.tieto.com>
Message-ID: <CA+Ydgaoiv_ji+KAO-XoFoN4Rq3YaUE4q5K=C0GuGCB_-L0_qog@mail.gmail.com>

Greetings,

On Fri, Oct 14, 2011 at 1:47 PM,  <Sagar.Shimpi at tieto.com> wrote:
> Hi,
>
> Can I configure Apache Active ?Active cluster using Redhat Cluster Suit in
> RHEL6?
>
> If yes, can someone please pass me the link for the same.
>

If you are using htpasswd for authentication, perhaps you may not be
able to build it.

Look at LVS at the same time. That might solve your problem.

and oh! never forget that without fencing (storage/in-band/out-of-band)

-- 
Regards,

Rajagopal
+91 9930 633 852



From mjh2000 at gmail.com  Sun Oct 16 11:21:43 2011
From: mjh2000 at gmail.com (Joey L)
Date: Sun, 16 Oct 2011 07:21:43 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <4E99AB27.5080908@alteeve.com>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
	<CAK3ER7sNjxdgKr-iJGA7n6yvDMY5Uqh4caQrUEohK8vmHi0i4g@mail.gmail.com>
	<4E98A5CE.9050706@alteeve.com>
	<CAK3ER7tfz+8Qd5UrdnYMnSZUMqgRKRuU-p_yfnyx8bWwuOUhmg@mail.gmail.com>
	<4E99AB27.5080908@alteeve.com>
Message-ID: <CAK3ER7u0PDz1wD0SKPuDb5ANd9Kqvene1uJChTZQZtt6XKhrvg@mail.gmail.com>

Digimer - thanks for you input - you saved me a ton of time!!!
I did look at your tutorial -- great stuff BTW.

I thought fencing was an option because I setup RH cluster about 5
years ago and I thought I did not do it then..and further in the RHEL
Cluster Administrator had points that it was optional -can not find
the url right now. but have pdf and it says:

"If a cluster node is configured to be fenced by an integrated fence
device, disable ACPI Soft-Off for
that node. Disabling ACPI Soft-Off allows an integrated fence device
to turn off a node immediately
and completely rather than attempting a clean shutdown (for example,
shutdown -h now).
Otherwise, if ACPI Soft-Off is enabled, an integrated fence device can
take four or more seconds to
turn off a node (refer to note that follows). In addition, if ACPI
Soft-Off is enabled and a node panics
or freezes during shutdown, an integrated fence device may not be able
to turn off the node. Under
those circumstances, fencing is delayed or unsuccessful. Consequently,
when a node is fenced
with an integrated fence device and ACPI Soft-Off is enabled, a
cluster recovers slowly or requires
administrative intervention to recover "


But not looking to argue this point at all - i remember that when i
did set it up it was indeed more stable like you state.
My Memory is getting old :)

About pacemaker --
Do I need fencing hardware as well ??
I just got 2 servers and a regular switch - i think it netgear.
Like I said earlier - just want the 2 boxes to back up each other.
I have mysql, apache, asterisk, dns and nfs client running on them -
can i do anything with pacemaker ??
I should mention that I am using software raid but will probably need
to change to hardware raid in near future.
I would like to use the mysql replication feature -- if possible.

thanks for your insight and help.

mjh



From linux at alteeve.com  Sun Oct 16 14:47:23 2011
From: linux at alteeve.com (Digimer)
Date: Sun, 16 Oct 2011 10:47:23 -0400
Subject: [Linux-cluster] redhat cluster running on debian 6.
In-Reply-To: <CAK3ER7u0PDz1wD0SKPuDb5ANd9Kqvene1uJChTZQZtt6XKhrvg@mail.gmail.com>
References: <CAK3ER7svf1K9VR6ohbupZit96QL+waOwhWWoLGoobcdGMWwPbA@mail.gmail.com>
	<CAK3ER7sNjxdgKr-iJGA7n6yvDMY5Uqh4caQrUEohK8vmHi0i4g@mail.gmail.com>
	<4E98A5CE.9050706@alteeve.com>
	<CAK3ER7tfz+8Qd5UrdnYMnSZUMqgRKRuU-p_yfnyx8bWwuOUhmg@mail.gmail.com>
	<4E99AB27.5080908@alteeve.com>
	<CAK3ER7u0PDz1wD0SKPuDb5ANd9Kqvene1uJChTZQZtt6XKhrvg@mail.gmail.com>
Message-ID: <4E9AEE7B.9050803@alteeve.com>

On 10/16/2011 07:21 AM, Joey L wrote:
> Digimer - thanks for you input - you saved me a ton of time!!!
> I did look at your tutorial -- great stuff BTW.

Thank you. :)

> I thought fencing was an option because I setup RH cluster about 5
> years ago and I thought I did not do it then..and further in the RHEL
> Cluster Administrator had points that it was optional -can not find
> the url right now. but have pdf and it says:

Maybe in very old versions, but not under EL5 or EL6. Like I said,
RGManager uses DLM and DLM blocks on fence and fence triggers as soon as
a node goes quiet, regardless of if there is a fence device or not.

> "If a cluster node is configured to be fenced by an integrated fence
> device, disable ACPI Soft-Off for
> that node. Disabling ACPI Soft-Off allows an integrated fence device
> to turn off a node immediately
> and completely rather than attempting a clean shutdown (for example,
> shutdown -h now).
> Otherwise, if ACPI Soft-Off is enabled, an integrated fence device can
> take four or more seconds to
> turn off a node (refer to note that follows). In addition, if ACPI
> Soft-Off is enabled and a node panics
> or freezes during shutdown, an integrated fence device may not be able
> to turn off the node. Under
> those circumstances, fencing is delayed or unsuccessful. Consequently,
> when a node is fenced
> with an integrated fence device and ACPI Soft-Off is enabled, a
> cluster recovers slowly or requires
> administrative intervention to recover "

This helps speed up recovery. However, I prefer to leave it on an accept
the 4 seconds delta because I find ACPI soft-off is very handy. In the
end though, the decision depends on your needs.

> But not looking to argue this point at all - i remember that when i
> did set it up it was indeed more stable like you state.
> My Memory is getting old :)

Fencing <3

> About pacemaker --
> Do I need fencing hardware as well ??

Short answer; Yes.

Long answer, you can disable stonith. However, that is always a
tremendous risk because you set yourself up for a bad day if the cluster
is allowed to reconfigure around a blocked node that eventually
unblocks. I *strongly* advice against it. Strictly speaking though; No.

> I just got 2 servers and a regular switch - i think it netgear.
> Like I said earlier - just want the 2 boxes to back up each other.

What you want doesn't really preclude the need to build a proper cluster. :)

> I have mysql, apache, asterisk, dns and nfs client running on them -
> can i do anything with pacemaker ??

Pacemaker and Red Hat cluster will both do this just fine.

> I should mention that I am using software raid but will probably need
> to change to hardware raid in near future.

The cluster doesn't care.

> I would like to use the mysql replication feature -- if possible.

This is a well tested and used configuration. You should have no trouble
getting help.

> thanks for your insight and help.
> 
> mjh

My pleasure. :)

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"



From mjh2000 at gmail.com  Mon Oct 17 02:02:47 2011
From: mjh2000 at gmail.com (Joey L)
Date: Sun, 16 Oct 2011 22:02:47 -0400
Subject: [Linux-cluster] new to pacemaker and heartbeat on debian...getting
	error..
Message-ID: <CAK3ER7tu9LDBe=nR7C_OFVnAaAiyXXan_zgBxWHo3=iE=VAYHQ@mail.gmail.com>

Hi - New to heartbeat and pacemaker on debian.
Followed a tutorial online at:
http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo


and now getting this error -


root at deb2:/home/mjh# sudo crm_mon --one-shot
============
Last updated: Sun Oct 16 21:56:43 2011
Stack: openais
Current DC: deb1 - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ deb1 deb2 ]


Failed actions:
    failover-ip_start_0 (node=deb2, call=35, rc=1, status=complete):
unknown error
    failover-ip_start_0 (node=deb1, call=35, rc=1, status=complete):
unknown error
root at deb2:/home/mjh#


Tried googling - nothing --
I stopped avahi-daemon because i was getting a strange error:
I was getting a naming conflict error do not think i am getting it any more.

Any thoughts on this ?
Any way i can turn logs on for heartbeat or pacemaker ?

thanks
mjh



From Sagar.Shimpi at tieto.com  Mon Oct 17 08:00:58 2011
From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com)
Date: Mon, 17 Oct 2011 11:00:58 +0300
Subject: [Linux-cluster] Apache active-active cluster
In-Reply-To: <CA+Ydgaoiv_ji+KAO-XoFoN4Rq3YaUE4q5K=C0GuGCB_-L0_qog@mail.gmail.com>
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9607@EXMB01.eu.tieto.com>
	<CA+Ydgaoiv_ji+KAO-XoFoN4Rq3YaUE4q5K=C0GuGCB_-L0_qog@mail.gmail.com>
Message-ID: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9D92@EXMB01.eu.tieto.com>

Hi,

Thanks for the info.

I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ??


Regards,

Sagar Shimpi, Senior Technical Specialist, OSS Labs

Tieto
email sagar.shimpi at tieto.com,
Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77, 
MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in 

TIETO. Knowledge. Passion. Results.


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
Sent: Sunday, October 16, 2011 9:51 AM
To: linux clustering
Subject: Re: [Linux-cluster] Apache active-active cluster

Greetings,

On Fri, Oct 14, 2011 at 1:47 PM,  <Sagar.Shimpi at tieto.com> wrote:
> Hi,
>
> Can I configure Apache Active -Active cluster using Redhat Cluster Suit in
> RHEL6?
>
> If yes, can someone please pass me the link for the same.
>

If you are using htpasswd for authentication, perhaps you may not be
able to build it.

Look at LVS at the same time. That might solve your problem.

and oh! never forget that without fencing (storage/in-band/out-of-band)

-- 
Regards,

Rajagopal
+91 9930 633 852

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From mgrac at redhat.com  Mon Oct 17 08:10:09 2011
From: mgrac at redhat.com (Marek Grac)
Date: Mon, 17 Oct 2011 10:10:09 +0200
Subject: [Linux-cluster] Apache active-active cluster
In-Reply-To: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9D92@EXMB01.eu.tieto.com>
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9607@EXMB01.eu.tieto.com>	<CA+Ydgaoiv_ji+KAO-XoFoN4Rq3YaUE4q5K=C0GuGCB_-L0_qog@mail.gmail.com>
	<F9FFD5DEC2163F498E703E7F9CBB3A350F756F9D92@EXMB01.eu.tieto.com>
Message-ID: <4E9BE2E1.1060301@redhat.com>

On 10/17/2011 10:00 AM, Sagar.Shimpi at tieto.com wrote:
> Hi,
> 
> Thanks for the info.
> 
> I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ??

In general it is possible. I have tried to run LVS on xen/kvm and it
works. But I have no experience with running it on VMWare

m,



From daniele at retaggio.net  Mon Oct 17 10:22:13 2011
From: daniele at retaggio.net (Daniele Palumbo)
Date: Mon, 17 Oct 2011 12:22:13 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
 open attribute
In-Reply-To: <C97BB1FB-F294-4C48-B830-C2D75258641E@retaggio.net>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>	<4E97B2E5.5000509@redhat.com>	<1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>	<4E98249F.2060302@redhat.com>	<714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net>	<4E983455.1000205@redhat.com>
	<C97BB1FB-F294-4C48-B830-C2D75258641E@retaggio.net>
Message-ID: <4E9C01D5.4040401@retaggio.net>

On 14/10/2011 17:30, Daniele Palumbo wrote:
> as you see in one server the device is *marked* as open, in the other is not.
>
> i need just to see open in both server,
> if it is marked as open, xen refuse to boot vm with that device open, and i will not care about the filesystem that is contained in lvm.
[...]
 > ok so i will try with one storage and i will report back.

here i am :)

the issue is still present:
pvsrv08 ~ # mount | grep vgTest
pvsrv08 ~ # lvs
   LV     VG     Attr   LSize   Origin Snap%  Move Log Copy%  Convert
   lvTest vgTest -wi-a- 500.00m
pvsrv08 ~ #

pvsrv07 ~ # mount | grep vgTest
pvsrv07 ~ # lvs
   LV     VG     Attr   LSize   Origin Snap%  Move Log Copy%  Convert
   lvTest vgTest -wi-a- 500.00m
pvsrv07 ~ #

pvsrv06 ~ # mount | grep vgTest
/dev/mapper/vgTest-lvTest on /mnt type reiserfs (rw)
pvsrv06 ~ # lvs
   LV     VG     Attr   LSize   Origin Snap%  Move Log Copy%  Convert
   lvTest vgTest -wi-ao 500.00m
pvsrv06 ~ # vgs
   VG     #PV #LV #SN Attr   VSize VFree
   vgTest   1   1   0 wz--nc 3.90g 3.41g
pvsrv06 ~ # pvs
   PV               VG     Fmt  Attr PSize PFree
   /dev/etherd/e7.0 vgTest lvm2 a--  3.90g 3.41g
pvsrv06 ~ #

pvsrv06 ~ # lvdisplay vgTest/lvTest
   --- Logical volume ---
   LV Name                /dev/vgTest/lvTest
   VG Name                vgTest
   LV UUID                2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr
   LV Write Access        read/write
   LV Status              available
   # open                 2
   LV Size                500.00 MiB
   Current LE             125
   Segments               1
   Allocation             inherit
   Read ahead sectors     auto
   - currently set to     4096
   Block device           252:0

pvsrv06 ~ #


pvsrv07 ~ # lvdisplay /dev/vgTest/lvTest
   --- Logical volume ---
   LV Name                /dev/vgTest/lvTest
   VG Name                vgTest
   LV UUID                2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr
   LV Write Access        read/write
   LV Status              available
   # open                 0
   LV Size                500.00 MiB
   Current LE             125
   Segments               1
   Allocation             inherit
   Read ahead sectors     auto
   - currently set to     4096
   Block device           252:0

pvsrv07 ~ #

when a device is mounted i can see the open status only in one server, 
so the open status is different between server.

the 3 servers see the same storage with same device path (as seen in 
pvsrv06).

pvsrv06 ~ # cman_tool status
Version: 6.2.0
Config Version: 5
Cluster Name: CEMCluster
Cluster Id: 7604
Cluster Member: Yes
Cluster Generation: 288
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: pvsrv06
Node ID: 1
Multicast addresses: 239.192.29.209
Node addresses: 192.168.1.106
pvsrv06 ~ #

pvsrv06 ~ # fence_tool ls
fence domain
member count  3
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 3

pvsrv06 ~ #


pvsrv06 ~ # cat /etc/cluster/cluster.conf
<cluster name="CEMCluster" config_version="5">
         <clusternodes>
                 <clusternode name="pvsrv06" nodeid="1">
                 </clusternode>
                 <clusternode name="pvsrv07" nodeid="2">
                 </clusternode>
                 <clusternode name="pvsrv08" nodeid="3">
                 </clusternode>
         </clusternodes>
</cluster>
pvsrv06 ~ #


bye
d.



From balla at staff.spin.it  Mon Oct 17 10:42:07 2011
From: balla at staff.spin.it (Emanuele Balla)
Date: Mon, 17 Oct 2011 12:42:07 +0200
Subject: [Linux-cluster] Apache active-active cluster
In-Reply-To: <4E9BE2E1.1060301@redhat.com>
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9607@EXMB01.eu.tieto.com>	<CA+Ydgaoiv_ji+KAO-XoFoN4Rq3YaUE4q5K=C0GuGCB_-L0_qog@mail.gmail.com>
	<F9FFD5DEC2163F498E703E7F9CBB3A350F756F9D92@EXMB01.eu.tieto.com>
	<4E9BE2E1.1060301@redhat.com>
Message-ID: <4E9C067F.3000005@staff.spin.it>

On 10/17/11 10:10 AM, Marek Grac wrote:
> On 10/17/2011 10:00 AM, Sagar.Shimpi at tieto.com wrote:
>> Hi,
>>
>> Thanks for the info.
>>
>> I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ??
> 
> In general it is possible. I have tried to run LVS on xen/kvm and it
> works. But I have no experience with running it on VMWare

Works exactly the same.
I have customers running production websites >70Mbps balanced by LVS on
vmware (direct routing, FWIW).
On ESXi, obviously, but I wouldn't expect troubles running it on some
other virtual environment...

-- 
# Emanuele Balla            #                            #
# System & Network Engineer #                            #
# Spin s.r.l. - AS6734      # Phone: +39 040 9869090     #
# Trieste                   # Email: balla at staff.spin.it #



From Sagar.Shimpi at tieto.com  Mon Oct 17 11:25:35 2011
From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com)
Date: Mon, 17 Oct 2011 14:25:35 +0300
Subject: [Linux-cluster] Apache active-active cluster
In-Reply-To: <4E9C067F.3000005@staff.spin.it>
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F756F9607@EXMB01.eu.tieto.com>
	<CA+Ydgaoiv_ji+KAO-XoFoN4Rq3YaUE4q5K=C0GuGCB_-L0_qog@mail.gmail.com>
	<F9FFD5DEC2163F498E703E7F9CBB3A350F756F9D92@EXMB01.eu.tieto.com>
	<4E9BE2E1.1060301@redhat.com> <4E9C067F.3000005@staff.spin.it>
Message-ID: <F9FFD5DEC2163F498E703E7F9CBB3A350F756FA09D@EXMB01.eu.tieto.com>

Hi,

I want to know if I can configure LVS on my home network - Vmware Workstation [and not Vmware ESX server]...!!!!


Regards,

Sagar Shimpi, Senior Technical Specialist, OSS Labs

Tieto
email sagar.shimpi at tieto.com,
Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77, 
MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in 

TIETO. Knowledge. Passion. Results.


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Emanuele Balla
Sent: Monday, October 17, 2011 4:12 PM
To: linux clustering
Subject: Re: [Linux-cluster] Apache active-active cluster

On 10/17/11 10:10 AM, Marek Grac wrote:
> On 10/17/2011 10:00 AM, Sagar.Shimpi at tieto.com wrote:
>> Hi,
>>
>> Thanks for the info.
>>
>> I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ??
> 
> In general it is possible. I have tried to run LVS on xen/kvm and it
> works. But I have no experience with running it on VMWare

Works exactly the same.
I have customers running production websites >70Mbps balanced by LVS on
vmware (direct routing, FWIW).
On ESXi, obviously, but I wouldn't expect troubles running it on some
other virtual environment...

-- 
# Emanuele Balla            #                            #
# System & Network Engineer #                            #
# Spin s.r.l. - AS6734      # Phone: +39 040 9869090     #
# Trieste                   # Email: balla at staff.spin.it #

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From andreas at hastexo.com  Mon Oct 17 14:07:43 2011
From: andreas at hastexo.com (Andreas Kurz)
Date: Mon, 17 Oct 2011 16:07:43 +0200
Subject: [Linux-cluster] new to pacemaker and heartbeat on
 debian...getting error..
In-Reply-To: <CAK3ER7tu9LDBe=nR7C_OFVnAaAiyXXan_zgBxWHo3=iE=VAYHQ@mail.gmail.com>
References: <CAK3ER7tu9LDBe=nR7C_OFVnAaAiyXXan_zgBxWHo3=iE=VAYHQ@mail.gmail.com>
Message-ID: <4E9C36AF.2060302@hastexo.com>

On 10/17/2011 04:02 AM, Joey L wrote:
> Hi - New to heartbeat and pacemaker on debian.
> Followed a tutorial online at:
> http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
> 
> 
> and now getting this error -
> 
> 
> root at deb2:/home/mjh# sudo crm_mon --one-shot
> ============
> Last updated: Sun Oct 16 21:56:43 2011
> Stack: openais
> Current DC: deb1 - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
> 
> Online: [ deb1 deb2 ]
> 
> 
> Failed actions:
>     failover-ip_start_0 (node=deb2, call=35, rc=1, status=complete):
> unknown error
>     failover-ip_start_0 (node=deb1, call=35, rc=1, status=complete):
> unknown error
> root at deb2:/home/mjh#
> 

Please provide your config ... best is the ouput of "cibadmin -Q".
Reading the logs should also give you valuable hints.

One shot into the dark: there is no interface up with an IP in the same
subnet you configured your failover-ip and you did not explicitly
defined an interface.

And there is a dedicated Pacemaker mailinglist, I set it on cc for this
thread.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Tried googling - nothing --
> I stopped avahi-daemon because i was getting a strange error:
> I was getting a naming conflict error do not think i am getting it any more.
> 
> Any thoughts on this ?
> Any way i can turn logs on for heartbeat or pacemaker ?
> 
> thanks
> mjh
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 286 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111017/1a9812cf/attachment.sig>

From daniele at retaggio.net  Tue Oct 18 08:27:26 2011
From: daniele at retaggio.net (Daniele Palumbo)
Date: Tue, 18 Oct 2011 10:27:26 +0200
Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see
 open attribute
In-Reply-To: <4E9C01D5.4040401@retaggio.net>
References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net>	<4E97B2E5.5000509@redhat.com>	<1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net>	<4E98249F.2060302@redhat.com>	<714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net>	<4E983455.1000205@redhat.com>	<C97BB1FB-F294-4C48-B830-C2D75258641E@retaggio.net>
	<4E9C01D5.4040401@retaggio.net>
Message-ID: <4E9D386E.9050803@retaggio.net>

any hint?
i'm almost desperate :(

On 17/10/2011 12:22, Daniele Palumbo wrote:
> On 14/10/2011 17:30, Daniele Palumbo wrote:
>> as you see in one server the device is *marked* as open, in the other
>> is not.
>>
>> i need just to see open in both server,
>> if it is marked as open, xen refuse to boot vm with that device open,
>> and i will not care about the filesystem that is contained in lvm.
> [...]
>  > ok so i will try with one storage and i will report back.
>
> here i am :)
>
> the issue is still present:
> pvsrv08 ~ # mount | grep vgTest
> pvsrv08 ~ # lvs
> LV VG Attr LSize Origin Snap% Move Log Copy% Convert
> lvTest vgTest -wi-a- 500.00m
> pvsrv08 ~ #
>
> pvsrv07 ~ # mount | grep vgTest
> pvsrv07 ~ # lvs
> LV VG Attr LSize Origin Snap% Move Log Copy% Convert
> lvTest vgTest -wi-a- 500.00m
> pvsrv07 ~ #
>
> pvsrv06 ~ # mount | grep vgTest
> /dev/mapper/vgTest-lvTest on /mnt type reiserfs (rw)
> pvsrv06 ~ # lvs
> LV VG Attr LSize Origin Snap% Move Log Copy% Convert
> lvTest vgTest -wi-ao 500.00m
> pvsrv06 ~ # vgs
> VG #PV #LV #SN Attr VSize VFree
> vgTest 1 1 0 wz--nc 3.90g 3.41g
> pvsrv06 ~ # pvs
> PV VG Fmt Attr PSize PFree
> /dev/etherd/e7.0 vgTest lvm2 a-- 3.90g 3.41g
> pvsrv06 ~ #
>
> pvsrv06 ~ # lvdisplay vgTest/lvTest
> --- Logical volume ---
> LV Name /dev/vgTest/lvTest
> VG Name vgTest
> LV UUID 2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr
> LV Write Access read/write
> LV Status available
> # open 2
> LV Size 500.00 MiB
> Current LE 125
> Segments 1
> Allocation inherit
> Read ahead sectors auto
> - currently set to 4096
> Block device 252:0
>
> pvsrv06 ~ #
>
>
> pvsrv07 ~ # lvdisplay /dev/vgTest/lvTest
> --- Logical volume ---
> LV Name /dev/vgTest/lvTest
> VG Name vgTest
> LV UUID 2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr
> LV Write Access read/write
> LV Status available
> # open 0
> LV Size 500.00 MiB
> Current LE 125
> Segments 1
> Allocation inherit
> Read ahead sectors auto
> - currently set to 4096
> Block device 252:0
>
> pvsrv07 ~ #
>
> when a device is mounted i can see the open status only in one server,
> so the open status is different between server.
>
> the 3 servers see the same storage with same device path (as seen in
> pvsrv06).
>
> pvsrv06 ~ # cman_tool status
> Version: 6.2.0
> Config Version: 5
> Cluster Name: CEMCluster
> Cluster Id: 7604
> Cluster Member: Yes
> Cluster Generation: 288
> Membership state: Cluster-Member
> Nodes: 3
> Expected votes: 3
> Total votes: 3
> Node votes: 1
> Quorum: 2
> Active subsystems: 8
> Flags:
> Ports Bound: 0 11
> Node name: pvsrv06
> Node ID: 1
> Multicast addresses: 239.192.29.209
> Node addresses: 192.168.1.106
> pvsrv06 ~ #
>
> pvsrv06 ~ # fence_tool ls
> fence domain
> member count 3
> victim count 0
> victim now 0
> master nodeid 1
> wait state none
> members 1 2 3
>
> pvsrv06 ~ #
>
>
> pvsrv06 ~ # cat /etc/cluster/cluster.conf
> <cluster name="CEMCluster" config_version="5">
> <clusternodes>
> <clusternode name="pvsrv06" nodeid="1">
> </clusternode>
> <clusternode name="pvsrv07" nodeid="2">
> </clusternode>
> <clusternode name="pvsrv08" nodeid="3">
> </clusternode>
> </clusternodes>
> </cluster>
> pvsrv06 ~ #
>
>
> bye
> d.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From christian.masopust at siemens.com  Tue Oct 18 16:47:29 2011
From: christian.masopust at siemens.com (Masopust, Christian)
Date: Tue, 18 Oct 2011 18:47:29 +0200
Subject: [Linux-cluster] manually stopping system-service (part of a cluster
 service) and rgmanager doesn't start it again
Message-ID: <C3B6F57F6F0CE34093FF52B3FFBEFA7C010311E62F2E@ATVIES9917VMSX.ww300.siemens.net>

Hi all,

I have the following rgmanager configuration:

                <resources>
                        <ip address="<my-cluster-ip>" monitor_link="on" sleeptime="1"/>
                        <script file="/etc/init.d/clearcase" name="clearcase"/>
                        <script file="/etc/init.d/flexlm" name="flexlm"/>
                </resources>

                <service autostart="1" domain="lic2_primary" exclusive="0" name="Licenses" recovery="relocate">
                        <ip ref="<my-cluster-ip>">
                                <script ref="clearcase"/>
                                <script ref="flexlm"/>
                        </ip>
                </service>
Today one of our (not cluster-aware :) ) colleagues manually stopped flexlm by "service flexlm stop".

What I've expected to happen is, that rgmanager detects the stopped flexlm-service and either restarts it
or relocates the complete cluster-service to my other node, but nothing happened.  In rgmanager.log I can
see lots of calls of "/etc/init.d/flexlm status", but no action of restart or relocate.

What's going wrong here?   Or is this behaviour ok?

Thanks a lot,
Christian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111018/64246f48/attachment.htm>

From jumanjiman at gmail.com  Tue Oct 18 22:04:26 2011
From: jumanjiman at gmail.com (Paul Morgan)
Date: Tue, 18 Oct 2011 18:04:26 -0400
Subject: [Linux-cluster] manually stopping system-service (part of a
 cluster service) and rgmanager doesn't start it again
In-Reply-To: <C3B6F57F6F0CE34093FF52B3FFBEFA7C010311E62F2E@ATVIES9917VMSX.ww300.siemens.net>
References: <C3B6F57F6F0CE34093FF52B3FFBEFA7C010311E62F2E@ATVIES9917VMSX.ww300.siemens.net>
Message-ID: <20111018220425.GA7419@gmail.com>

On Tue, Oct 18, 2011 at 06:47:29PM +0200, Masopust, Christian wrote:
-snip-
> Today one of our (not cluster-aware :) ) colleagues manually
> stopped flexlm by "service flexlm stop".
> 
> What I've expected to happen is, that rgmanager detects the
> stopped flexlm-service and either restarts it
> or relocates the complete cluster-service to my other node,
> but nothing happened.  In rgmanager.log I can
> see lots of calls of "/etc/init.d/flexlm status", but no
> action of restart or relocate.
>
> What's going wrong here?   Or is this behaviour ok?

My guess is that `service flexlm status' is returning
exit code 0 even if the service is not running.
Manually check the exit code while it's stopped.
$?==0 means the service is deemed good by rgmanager.

I hacked together an init script for our flexlm server,
and I've just posted it at:

  https://gist.github.com/1296847

In the comments of the script, I point out the importance
of error codes and provide URL references to LSB compliance
notes. The comments also mention how to override the script variables
(in /etc/sysconfig/flexlm).

Feel free to critique the init script.

hth,
-paul

-- 
Paul Morgan <jumanjiman at gmail.com>
RHCE, RHCDS, RHCVA, RHCSS, RHCA
http://github.com/jumanjiman
GPG Public Key ID: 0xf59e77c2
Fingerprint = 3248 D0C8 4B42 2F7C D92A  AEA0 7D20 6D66 F59E 77C2



From songyu555 at gmail.com  Wed Oct 19 00:00:20 2011
From: songyu555 at gmail.com (yu song)
Date: Wed, 19 Oct 2011 11:00:20 +1100
Subject: [Linux-cluster] non-critical resource in redhat cluster
Message-ID: <CADJUD+byGttNxdAnmAnMZL6VH=O+yURbFbTjQFz69MCmJ888oA@mail.gmail.com>

Dear all,

Does anyone have experience to setup non-critical resource in redhat
cluster?  in my case, we would like a cluster behavior like this.

1. infrastructure level resource (such as file system, IP) fails can trigger
an entire service fail over.

2. Application level resource fails will NOT trigger an entire service fail
over.

obviously I would like set application resource as non-critical. but have
not seen any place to explain how to.

thanks

Yu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111019/05c34a9e/attachment.htm>

From Ralph.Grothe at itdz-berlin.de  Wed Oct 19 06:56:02 2011
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Wed, 19 Oct 2011 08:56:02 +0200
Subject: [Linux-cluster] non-critical resource in redhat cluster
In-Reply-To: <CADJUD+byGttNxdAnmAnMZL6VH=O+yURbFbTjQFz69MCmJ888oA@mail.gmail.com>
References: <CADJUD+byGttNxdAnmAnMZL6VH=O+yURbFbTjQFz69MCmJ888oA@mail.gmail.com>
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432A99@itdzex101.ITDZ.verwalt-berlin.de>

Hi Yu,

maybe you could make use of the __independent_subtree resource
tag attribute?

See
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html
/Cluster_Administration/s1-clust-rsc-failure-rec-CA.html


Regards
Ralph


________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of yu song
	Sent: Wednesday, October 19, 2011 02:00
	To: linux clustering
	Subject: [Linux-cluster] non-critical resource in redhat
cluster
	
	
	Dear all,
	
	Does anyone have experience to setup non-critical
resource in redhat cluster?  in my case, we would like a cluster
behavior like this. 
	
	1. infrastructure level resource (such as file system,
IP) fails can trigger an entire service fail over. 
	
	2. Application level resource fails will NOT trigger an
entire service fail over. 
	
	obviously I would like set application resource as
non-critical. but have not seen any place to explain how to. 
	
	thanks
	
	Yu
	




From mjh2000 at gmail.com  Thu Oct 20 19:57:23 2011
From: mjh2000 at gmail.com (Joey L)
Date: Thu, 20 Oct 2011 15:57:23 -0400
Subject: [Linux-cluster] new to pacemaker and heartbeat on
 debian...getting error..
In-Reply-To: <4E9C36AF.2060302@hastexo.com>
References: <CAK3ER7tu9LDBe=nR7C_OFVnAaAiyXXan_zgBxWHo3=iE=VAYHQ@mail.gmail.com>
	<4E9C36AF.2060302@hastexo.com>
Message-ID: <CAK3ER7vqR2+3At-9cE2ik3Eg9h+dPmHZM1_MfPPdt7xduU_Z9w@mail.gmail.com>

>
> Please provide your config ... best is the ouput of "cibadmin -Q".
> Reading the logs should also give you valuable hints.
>

sorry for the late response - i was trying a few things and really
messed things up.

I think you were right in your previous email about another adapter on
same subnet -
I am using vbox and cloned the vm --- i think this caused the
duplicate ethernet hardware issue ---something like you are saying.

So i tried to remove all packages and start out with a new nic or
hardware address...now on installing pacemaker or heartbeat - i get
the following error:

Done.
Warning: The home dir /var/lib/heartbeat you specified already exists.
Adding system user `hacluster' (UID 117) ...
Adding new user `hacluster' (UID 117) with group `haclient' ...
The home directory `/var/lib/heartbeat' already exists.  Not copying
from `/etc/skel'.
adduser: Warning: The home directory `/var/lib/heartbeat' does not
belong to the user you are currently creating.
Processing triggers for python-central ...


I tried to clean it up - removed /etc/ha.d and /etc/corosync but the
those directories and others are not coming back as it did on a fresh
install.
Can you help me with that ??? are there other files i need to remove ???
How can i turn on logging for heartbeat and pacemaker ???
thanks
mjh



From brent.bolin at gmail.com  Fri Oct 21 23:29:16 2011
From: brent.bolin at gmail.com (Brent Bolin)
Date: Fri, 21 Oct 2011 18:29:16 -0500
Subject: [Linux-cluster] Problems adding application resources for samba or
 apache. Failover IP works fine
Message-ID: <CAE9HCWaDpNXSpJxTm_gJXK-Sfss0qTp=zEstQDoGDLBVZn5cWw@mail.gmail.com>

Seeing errors -

rgmanager[2758]: Checking Existence Of File
/var/run/cluster/apache/apache:web_apache.pid [apache:web_apache] >
Failed - File Doesn't Exist



From jpokorny at redhat.com  Mon Oct 24 14:33:40 2011
From: jpokorny at redhat.com (Jan Pokorny)
Date: Mon, 24 Oct 2011 16:33:40 +0200
Subject: [Linux-cluster] Luci can't install packages
In-Reply-To: <C088D3516432C643AC828162A5164A7F0A82BDFC@se3lmwbibaw.COMMUN.AD.SNCF.FR>
References: <C088D3516432C643AC828162A5164A7F0A82BDFC@se3lmwbibaw.COMMUN.AD.SNCF.FR>
Message-ID: <4EA57744.9030300@redhat.com>

Hi and pardon me for late reply,

On 09/16/2011 02:54 PM, BONNETOT Jean-Daniel (EXT THALES) wrote:
> Hello,
>
> Usually I used manal installation but I need to process throu Luci. My
> problem is present with RHEL 5.7 and RHEL 6.0 (luci and ricci), with
> RHEL 5.6 it works correctly.
>
> I used "Create" new cluster and add my nodes (options arenot important,
> the problem is always here) and submit?
>
> "Please wait..."
>
> Creating node "node1" for cluster "clutest": installing packages
>
> Creating node "node2" for cluster "clutest": installing packages
>
> I waited ;) but nothing. My process list on nodes says :
>
> 4166 ? Ss 0:00 /usr/sbin/oddjobd -p /var/run/oddjobd.pid -t 300
>
> 22343 ? S 0:00 \_ ricci-modrpm
>
> 22355 ? S 0:01 \_ /usr/bin/python /usr/bin/yum -y list all
>
> 4221 ? S<s 0:09 ricci -u 236
>
> 22342 ? S<s 0:00 /usr/libexec/ricci/ricci-worker -f
> /var/lib/ricci/queue/1952735127
>
> Nothing append, yum -y list all stay blocked? this command works well
> manually.
>
> I found some people with same problem on centos lists but no answers :(
>
> Do you know what can trouble ricci ? Have someone already same problem ?

This is a known issue (I contributed to the fix) that should fixed in
upcoming ricci/conga RHEL 5.7 (and also 5.6, see below) package update
and the fix should be present in RHEL 5.8 as well.

I haven't tried RHEL 5.6 version regarding this issue but with 5.7,
I was reproducing it reliably.  The same part identified as problematic
with ricci is present also in RHEL 5.6 package(s), so it's interesting
you haven't run into this with 5.6 in the same scenario as with 5.7.
Can you please provide which package version have you been using there
and which platform it was?


Regarding RHEL 6.x, I could reproduce the problem only extremely rarely
(i.e., once in tens of tries of two-nodes cluster creation, and it was
only a single node exposing that buggy behavior, the other was fine).

Therefore, I would be glad if you could provide details of your
reproducer (platform, ricci package version, the rate of successful
reproducing the issue, whether it is in "create cluster" scenario
only and maybe other circumstances such as if SELinux is used in
enforcing mode).


Thanks for your feedback (either on the list or via PM),
will be appreciated,

Jan



From symack at gmail.com  Tue Oct 25 21:10:45 2011
From: symack at gmail.com (Nick Khamis)
Date: Tue, 25 Oct 2011 17:10:45 -0400
Subject: [Linux-cluster] Compiling DLM for OpenAIS/Corosync
Message-ID: <CAGWRaZYuw8DLWv6muYwN9U=eCu9i_uc_NtfzzcNTuUUxTndJbQ@mail.gmail.com>

Hello Everyone,

I have downloaded the latest version of the cman stack to enable
Pacemaker+OCFS2 support on our test environment.
For production we will obviously use fencing however, it's not
required for our prototype. That being said, is it ok to
compile cman without group, rgmanager? And as mentioned before, I have
also excluded cman and fence. For a
configure:

 ./configure --prefix=/usr/local --without_cman --without_group
--without_fence --without_rgmanager
--corosyncincdir=/usr/include/corocync
--corosynclibdir=/usr/var/lib/corosync
openaisincdir=/usr/include/openais --corosyncbin=/usr/sbin/corosync

Does this exclude anything needed for dlm, and o2cb support for pcmk?

A slightly off topic question. What network filesystems are your guys
comfortable with using in a production system (Lusture, GFS2, OCFS2).
I hope
I am not opening up a can of worms, just a simple question.

Thanks in Advance,

Nick from Toronto.



From kclo2000 at gmail.com  Wed Oct 26 18:21:54 2011
From: kclo2000 at gmail.com (KC LO)
Date: Thu, 27 Oct 2011 02:21:54 +0800
Subject: [Linux-cluster] application startup timeout
Message-ID: <CADxp4OgaG5N+qSNv17CkTWWY9URu5J6zyiBC5YxjhU=azT6rNg@mail.gmail.com>

Dear all,

I have configured an application startup script in the redhat cluster
service. Sometimes, the application startup takes long time to complete and
the cluster will consider that it is failed as it consider that the return
code is "1".  When I use manual method to startup the mountpoint,
application, IP address, they can all complete successfully.

Is there any method to configure the timeout value of the application
startup script?

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111027/a3bc9ac0/attachment.htm>

From danielgore at yaktech.com  Thu Oct 27 12:26:31 2011
From: danielgore at yaktech.com (Daniel R. Gore)
Date: Thu, 27 Oct 2011 08:26:31 -0400
Subject: [Linux-cluster] file system ownership/permissions on HA nfs and IPA
	client.
Message-ID: <1319718391.5980.34.camel@varun.yaktech.com>

My systems are build on Redhat Server 6.1.
I set up a pair of nodes with common (FC) storage as HA servers.  I have
an IPA server and have made the nodes IPA clients to the IPA server.  I
am serving an xfs file systems through HA nfs from an IP resource
associated to a common service name (fserv) from the HA service.

It all appears to work fine, except I cannot get the correct file
ownership/permissions when I connect through the NFS service name.  In
other words, when I mount "fserv:/export_home /home" from a IPA client
system, file ownership is "nobody;nobody".  If I look to see what node
is serving the file system and check permission where it is mounted, the
file ownership and permissions are correct.

Obviously, the NFS service provided from the HA systems has no ability
to reference LDAP from IPA even though both nodes are clients.

Permissions and ownership are a basic requirement that I need to meet.
Preferably, I would like to serve NFS4 kerberized access.  At this point
I am at a loss as how to do this.  Does anyone have a good solution?

One possible way might be to uses DRDB to mirror a file system on each
node and server nfs direct from one node at a time with fail-over to the
other node.  Then setup autofs on the clients to decide which node to
use.  I imagine I would need to have both nodes mount the
same /var/lib/nfs directory to ensure nfs recovery.

Looking for ideas!

In the long run, it would be best if you could setup a pseudo OS for the
NFS service that would have it's own kerberos certificates and LDAP
access that both (all) nodes have access to.  That way the service would
have correct ownership/permissions. 

Thanks.

Dan 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



From Ralph.Grothe at itdz-berlin.de  Thu Oct 27 12:30:12 2011
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Thu, 27 Oct 2011 14:30:12 +0200
Subject: [Linux-cluster] Issue with NFS exports and "loopback" NFS mounts as
	RHCS resources on attempted resource release
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432ABD@itdzex101.ITDZ.verwalt-berlin.de>

Hi Cluster Wizards,

I have a severe issue with NFS exports and mounts on an RHCS
2-node cluster (active/passive) within a clustered SAP service.

The migration of a clustered SAP from an HP-UX MC/ServiceGuard
cluster to an RHCS cluster,  what I thought to be an easy job,
now turns out to become a nightmare because I should provide the
clustered service ready with no leeway for further testing and
booting the nodes for me after 17:00 today.
In the old HP MC/SC environment this works smoothly and the
cluster doesn't care at all whether there are still claims from
clients on the NFS exports or dangling locks etc, as it should
be.
In RHCS clurgmrd or the agents seem to be far more capricious in
this respect.


Unfortunately, this kind of reference doc 
https://access.redhat.com/sites/default/files/sap_ha_reference_ar
chitecture.pdf
doesn't address NFS at all, but according to our SAP admins these
are absolutely required in our SAP setup.
What makes it even worse is the fact that the NFS shares should
be exported out of the cluster as well to a myriad of external
SAP servers that aren't cluster nodes.
However, to make testing easier I so far have those three
required NFS shares only exported to the cluster node IPs and the
floating VIP through the nfsexport/nfsclient RAs of standard
RHCS.
But later it should be exported to the whole LAN.
Also the Node that is running the SAP service needs to mount
those shares itself locally, what I thus called "loopback" NFS
mounts in the subject line.



I hope my ordering of resources is in order.
But I cannot imagine how else they should be ordered for it to
work.

For now I let alone SAPDatabase and SAPInstance (which I know are
managed correctly by their respective agents as I tested it prior
without the NFS resources).


This is the starting sequence of lvm resource and its child
resources which contain the crappy NFS stuff:

[root at alsterneu:/etc/cluster]
# rg_test noop /etc/cluster/cluster.conf start lvm vg_san0
Running in test mode.
Starting vg_san0...
[start] lvm:vg_san0
[start] fs:oracle_Z01
[start] fs:oracle_Z01_sapreorg
[start] fs:oracle_Z01_oraarch
[start] fs:oracle_Z01_origlogA
[start] fs:oracle_Z01_origlogB
[start] fs:oracle_Z01_mirrlogA
[start] fs:oracle_Z01_mirrlogB
[start] fs:oracle_Z01_sapdata1
[start] ip:10.25.101.244
[start] fs:export_Z01
[start] nfsexport:exports
[start] nfsclient:client_alster
[start] nfsclient:client_warnow
[start] nfsclient:client_lena
[start] netfs:nfs_Z01
[start] fs:export_sapmnt_Z01
[start] nfsexport:exports
[start] nfsclient:client_alster
[start] nfsclient:client_warnow
[start] nfsclient:client_lena
[start] netfs:nfs_sapmnt_Z01
[start] fs:usr_sap_Z01_DVEBMGS01
[start] fs:export_audit
[start] nfsexport:exports
[start] nfsclient:client_alster
[start] nfsclient:client_warnow
[start] nfsclient:client_lena
[start] netfs:nfs_audit
Start of vg_san0 complete


and this is the stopping sequence (in reverse order, what a
surprise):

[root at alsterneu:/etc/cluster]
# rg_test noop /etc/cluster/cluster.conf stop lvm vg_san0
Running in test mode.
Stopping vg_san0...
[stop] netfs:nfs_audit
[stop] nfsclient:client_lena
[stop] nfsclient:client_warnow
[stop] nfsclient:client_alster
[stop] nfsexport:exports
[stop] fs:export_audit
[stop] fs:usr_sap_Z01_DVEBMGS01
[stop] netfs:nfs_sapmnt_Z01
[stop] nfsclient:client_lena
[stop] nfsclient:client_warnow
[stop] nfsclient:client_alster
[stop] nfsexport:exports
[stop] fs:export_sapmnt_Z01
[stop] netfs:nfs_Z01
[stop] nfsclient:client_lena
[stop] nfsclient:client_warnow
[stop] nfsclient:client_alster
[stop] nfsexport:exports
[stop] fs:export_Z01
[stop] ip:10.25.101.244
[stop] fs:oracle_Z01_sapdata1
[stop] fs:oracle_Z01_mirrlogB
[stop] fs:oracle_Z01_mirrlogA
[stop] fs:oracle_Z01_origlogB
[stop] fs:oracle_Z01_origlogA
[stop] fs:oracle_Z01_oraarch
[stop] fs:oracle_Z01_sapreorg
[stop] fs:oracle_Z01
[stop] lvm:vg_san0
Stop of vg_san0 complete


First make sure the service really is disabled

[root at alsterneu:/etc/cluster]
# clustat -s z01
 Service Name                                   Owner (Last)
State         
 ------- ----                                   ----- ------
-----         
 service:z01                                    (none)
disabled      


and no HA-LVM tag on the shared VG

[root at alsterneu:/etc/cluster]
# vgs -o +tags vg_san0
  VG      #PV #LV #SN Attr   VSize VFree  VG Tags
  vg_san0   5  14   0 wz--n- 2.13T 12.00M        



Now while the start works fine...


[root at alsterneu:/etc/cluster]
# rg_test test /etc/cluster/cluster.conf start lvm vg_san0
Running in test mode.
Starting vg_san0...
  volume_list=["vg_root", "vg_local", "@alstera"]
<info> Starting volume group, vg_san0
<info> I can claim this volume group
  Volume group "vg_san0" successfully changed
<info> New tag "alstera" added to vg_san0
  Internal error: Maps lock 14598144 < unlock 14868480
  14 logical volume(s) in volume group "vg_san0" now active
<info> mounting /dev/mapper/vg_san0-lv_ora_z01 on /oracle/Z01
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01 /oracle/Z01
<info> mounting /dev/mapper/vg_san0-lv_ora_z01_sapreorg on
/oracle/Z01/sapreorg
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01_sapreorg
/oracle/Z01/sapreorg
<info> mounting /dev/mapper/vg_san0-lv_ora_z01_oraarch on
/oracle/Z01/oraarch
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01_oraarch
/oracle/Z01/oraarch
<info> mounting /dev/mapper/vg_san0-lv_ora_z01_origloga on
/oracle/Z01/origlogA
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01_origloga
/oracle/Z01/origlogA
<info> mounting /dev/mapper/vg_san0-lv_ora_z01_origlogb on
/oracle/Z01/origlogB
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01_origlogb
/oracle/Z01/origlogB
<info> mounting /dev/mapper/vg_san0-lv_ora_z01_mirrloga on
/oracle/Z01/mirrlogA
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01_mirrloga
/oracle/Z01/mirrlogA
<info> mounting /dev/mapper/vg_san0-lv_ora_z01_mirrlogb on
/oracle/Z01/mirrlogB
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01_mirrlogb
/oracle/Z01/mirrlogB
<info> mounting /dev/mapper/vg_san0-lv_ora_z01_sapdata1 on
/oracle/Z01/sapdata1
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_ora_z01_sapdata1
/oracle/Z01/sapdata1
<debug> Link for bond2: Detected
<info> Adding IPv4 address 10.25.101.244/24 to bond2
<debug> Pinging addr 10.25.101.244 from dev bond2
<debug> Sending gratuitous ARP: 10.25.101.244 00:17:a4:77:d0:c4
brd ff:ff:ff:ff:ff:ff
<info> mounting /dev/mapper/vg_san0-lv_export_z01 on /export/Z01
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_export_z01
/export/Z01
<info> Adding export: 10.25.101.231:/export/Z01
(fsid=110,rw,sync,wdelay,insecure,no_root_squash)
<info> Adding export: 10.25.101.232:/export/Z01
(fsid=110,rw,sync,wdelay,insecure,no_root_squash)
<info> Adding export: 10.25.101.244:/export/Z01
(fsid=110,rw,sync,wdelay,insecure,no_root_squash)
<debug> mount  -o rw,fg,hard,intr 10.25.101.244:/export/Z01 /Z01
<info> mounting /dev/mapper/vg_san0-lv_export_sapmnt_z01 on
/export/sapmnt/Z01
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_export_sapmnt_z01
/export/sapmnt/Z01
<info> Adding export: 10.25.101.231:/export/sapmnt/Z01
(fsid=111,rw,sync,wdelay,insecure,no_root_squash)
<info> Adding export: 10.25.101.232:/export/sapmnt/Z01
(fsid=111,rw,sync,wdelay,insecure,no_root_squash)
<info> Adding export: 10.25.101.244:/export/sapmnt/Z01
(fsid=111,rw,sync,wdelay,insecure,no_root_squash)
<debug> mount  -o rw,fg,hard,intr
10.25.101.244:/export/sapmnt/Z01 /sapmnt/Z01
<info> mounting /dev/mapper/vg_san0-lv_usr_sap_z01_dvebmgs01 on
/usr/sap/Z01/DVEBMGS01
<debug> mount -t ext3
/dev/mapper/vg_san0-lv_usr_sap_z01_dvebmgs01
/usr/sap/Z01/DVEBMGS01
<info> mounting /dev/mapper/vg_san0-lv_export_audit on
/export/audit
<debug> mount -t ext3  /dev/mapper/vg_san0-lv_export_audit
/export/audit
<info> Adding export: 10.25.101.231:/export/audit
(fsid=114,rw,sync,wdelay,insecure,no_root_squash)
<info> Adding export: 10.25.101.232:/export/audit
(fsid=114,rw,sync,wdelay,insecure,no_root_squash)
<info> Adding export: 10.25.101.244:/export/audit
(fsid=114,rw,sync,wdelay,insecure,no_root_squash)
<debug> mount  -o rw,fg,hard,intr 10.25.101.244:/export/audit
/audit
Start of vg_san0 complete


(n.b. verify the mounts are present)


[root at alsterneu:/etc/cluster]
# df -Ph|grep vg_san0
/dev/mapper/vg_san0-lv_ora_z01  7.9G  4.7G  2.9G  63% /oracle/Z01
/dev/mapper/vg_san0-lv_ora_z01_sapreorg  5.0G  158M  4.6G   4%
/oracle/Z01/sapreorg
/dev/mapper/vg_san0-lv_ora_z01_oraarch   20G  1.7G   18G   9%
/oracle/Z01/oraarch
/dev/mapper/vg_san0-lv_ora_z01_origloga 1008M  935M   23M  98%
/oracle/Z01/origlogA
/dev/mapper/vg_san0-lv_ora_z01_origlogb 1008M  935M   23M  98%
/oracle/Z01/origlogB
/dev/mapper/vg_san0-lv_ora_z01_mirrloga 1008M  935M   23M  98%
/oracle/Z01/mirrlogA
/dev/mapper/vg_san0-lv_ora_z01_mirrlogb 1008M  935M   23M  98%
/oracle/Z01/mirrlogB
/dev/mapper/vg_san0-lv_ora_z01_sapdata1  2.1T  1.3T  696G  65%
/oracle/Z01/sapdata1
/dev/mapper/vg_san0-lv_export_z01   37G  177M   35G   1%
/export/Z01
/dev/mapper/vg_san0-lv_export_sapmnt_z01  3.0G 1010M  1.9G  36%
/export/sapmnt/Z01
/dev/mapper/vg_san0-lv_usr_sap_z01_dvebmgs01  3.0G  130M  2.7G
5% /usr/sap/Z01/DVEBMGS01
/dev/mapper/vg_san0-lv_export_audit 1008M   34M  924M   4%
/export/audit


[root at alsterneu:/etc/cluster]
# exportfs 
/export/sapmnt/Z01
                10.25.101.244
/export/sapmnt/Z01
                10.25.101.232
/export/sapmnt/Z01
                10.25.101.231
/export/audit   10.25.101.244
/export/audit   10.25.101.232
/export/audit   10.25.101.231
/oracle/stage   10.25.101.0/24
/export/Z01     10.25.101.244
/export/Z01     10.25.101.232
/export/Z01     10.25.101.231

[root at alsterneu:/etc/cluster]
# df -Ph -t nfs
Filesystem            Size  Used Avail Use% Mounted on
10.25.101.244:/export/Z01   37G  176M   35G   1% /Z01
10.25.101.244:/export/sapmnt/Z01  3.0G 1010M  1.9G  36%
/sapmnt/Z01
10.25.101.244:/export/audit 1008M   34M  924M   4% /audit



...the shutdown fails miserably:


[root at alsterneu:/etc/cluster]
# rg_test test /etc/cluster/cluster.conf stop lvm vg_san0
Running in test mode.
Stopping vg_san0...
<info> unmounting /audit
<info> Removing export: 10.25.101.244:/export/audit
<info> Removing export: 10.25.101.232:/export/audit
<info> Removing export: 10.25.101.231:/export/audit
<info> unmounting /export/audit
<info> unmounting /usr/sap/Z01/DVEBMGS01
<info> unmounting /sapmnt/Z01
<info> Removing export: 10.25.101.244:/export/sapmnt/Z01
<info> Removing export: 10.25.101.232:/export/sapmnt/Z01
<info> Removing export: 10.25.101.231:/export/sapmnt/Z01
<info> unmounting /export/sapmnt/Z01
<info> unmounting /Z01
<info> Removing export: 10.25.101.244:/export/Z01
<info> Removing export: 10.25.101.232:/export/Z01
<info> Removing export: 10.25.101.231:/export/Z01
<info> unmounting /export/Z01
umount: /export/Z01: device is busy
umount: /export/Z01: device is busy
<warning> Dropping node-wide NFS locks
<info> unmounting /export/Z01
umount: /export/Z01: device is busy
umount: /export/Z01: device is busy
<info> unmounting /export/Z01
umount: /export/Z01: device is busy
umount: /export/Z01: device is busy
<info> Asking lockd to drop locks (pid 6607)
<debug> No hosts to notify
<debug> No hosts to notify
<debug> No hosts to notify
<err> 'umount /export/Z01' failed, error=0
<info> Removing IPv4 address 10.25.101.244/24 from bond2
<info> unmounting /oracle/Z01/sapdata1
<info> unmounting /oracle/Z01/mirrlogB
<info> unmounting /oracle/Z01/mirrlogA
<info> unmounting /oracle/Z01/origlogB
<info> unmounting /oracle/Z01/origlogA
<info> unmounting /oracle/Z01/oraarch
<info> unmounting /oracle/Z01/sapreorg
<info> unmounting /oracle/Z01
  volume_list=["vg_root", "vg_local", "@alstera"]
  Can't deactivate volume group "vg_san0" with 1 open logical
volume(s)
<err> Logical volume vg_san0/lv_export_audit failed to shutdown


[root at alsterneu:/etc/cluster]
# df -Ph|grep vg_san0
/dev/mapper/vg_san0-lv_export_z01   37G  177M   35G   1%
/export/Z01
[root at alsterneu:/etc/cluster]
# df -Ph -t nfs
Filesystem            Size  Used Avail Use% Mounted on
[root at alsterneu:/etc/cluster]
# fuser -m /export/Z01
[root at alsterneu:/etc/cluster]
# lsof +D /export/Z01
[root at alsterneu:/etc/cluster]
# umount /export/Z01
umount: /export/Z01: device is busy
umount: /export/Z01: device is busy
[root at alsterneu:/etc/cluster]
# umount -f /export/Z01
umount2: Device or resource busy
umount: /export/Z01: device is busy
umount2: Device or resource busy
umount: /export/Z01: device is busy


How on earth can I discover what is keeping this stale mount
busy?
Neither fuser nor lsof report any procs accessing the filesystem
(see above)
And also showmount shows no NFS client who could have a lock on
it.

[root at alsterneu:/etc/cluster]
# showmount -a|grep -c /export/Z01
0

After an hour or so the busy devices aren't so anymore and
unmountable
So that finally the shared VG can be deactivated and untagged.


[root at alsterneu:/etc/cluster]
# while ! umount -f /export/Z01; do sleep 15;done
umount2: Device or resource busy
umount: /export/Z01: device is busy
umount2: Device or resource busy
umount: /export/Z01: device is busy
umount2: Device or resource busy
umount: /export/Z01: device is busy
umount2: Device or resource busy
umount: /export/Z01: device is busy

...


[root at alsterneu:/etc/cluster]
# umount /export/Z01
[root at alsterneu:/etc/cluster]
# vgs -o +tags vg_san0
  VG      #PV #LV #SN Attr   VSize VFree  VG Tags
  vg_san0   5  14   0 wz--n- 2.13T 12.00M alstera
[root at alsterneu:/etc/cluster]
# vgchange -an vg_san0
  0 logical volume(s) in volume group "vg_san0" now active
[root at alsterneu:/etc/cluster]
# vgchange --deltag alstera vg_san0
  Volume group "vg_san0" successfully changed


I wonder if someone of you has encountered similar problems when
providing NFS exports as clustered resource and hopefully knows
of a workaround.


Many thanks for your patience and kind notice

Ralph





From raju.rajsand at gmail.com  Thu Oct 27 13:07:07 2011
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Thu, 27 Oct 2011 18:37:07 +0530
Subject: [Linux-cluster] Issue with NFS exports and "loopback" NFS
 mounts as RHCS resources on attempted resource release
In-Reply-To: <A789DDB53ED7E94396E842EE2AC9B5FF01432ABD@itdzex101.ITDZ.verwalt-berlin.de>
References: <A789DDB53ED7E94396E842EE2AC9B5FF01432ABD@itdzex101.ITDZ.verwalt-berlin.de>
Message-ID: <CA+YdgaoVtsTd1-s0aPb1sTh8OuGXbJpmY+MdAG=WYWvyAvg=Ow@mail.gmail.com>

Greetings,

On Thu, Oct 27, 2011 at 6:00 PM,  <Ralph.Grothe at itdz-berlin.de> wrote:
> 2-node cluster (active/passive) within a clustered SAP service.
>

IIRC, There is a document for Redhat subscribers (I have not been one)
about configuring HA SAP services in the KB.

Have you checked that out?

Well, I apologise if the above info is wrong.

But I had downloaded it for few partners and subscribers sometime (>
1year) back.

-- 
Regards,

Rajagopal



From Ralph.Grothe at itdz-berlin.de  Thu Oct 27 14:26:18 2011
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Thu, 27 Oct 2011 16:26:18 +0200
Subject: [Linux-cluster] Issue with NFS exports and "loopback" NFS
	mounts as RHCS resources on attempted resource release
In-Reply-To: <CA+YdgaoVtsTd1-s0aPb1sTh8OuGXbJpmY+MdAG=WYWvyAvg=Ow@mail.gmail.com>
References: <A789DDB53ED7E94396E842EE2AC9B5FF01432ABD@itdzex101.ITDZ.verwalt-berlin.de>
	<CA+YdgaoVtsTd1-s0aPb1sTh8OuGXbJpmY+MdAG=WYWvyAvg=Ow@mail.gmail.com>
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432ABE@itdzex101.ITDZ.verwalt-berlin.de>

Hi Raj,

yes, I know this SAP on RHCS document and have thouroughly read
it (I myself supplied the link in my initial posting).
But as I already wrote NFS isn't addressed in this document.
Instead it reads on page 32, Section 8.3.3 "NFS Mounted File
Systems", 2nd paragraph:

"It is possible but not recommended or supported to run an NFS
server within the cluster. This
leads to a re-mount scenario in which the cluster node exporting
the NFS filesystems remounts
the same NFS exports. In low memory situations, the NFS server
and client can
negatively impact system stability."

Besides, I don't have a "low memory situation", especially not
now during my setting up and testing, and this box has almost
48GB RAM.
Instead it simply doesn't seem to work at all!
I know I should show them SAP admins this paragraph, and ask them
why they are trying to hold me accountable for not working
features or being deemed inapt to configure a cluster properly
for requested layouts the clustering software (viz. its agents)
isn't fit to deliver.
But this is the usual battle with users and customers, and I
already was coerced into violating cluster software just to make
it work as the customer thinks it ought to work a, couple of
times.


[root at alsterneu:/etc/cluster]
# free -g
             total       used       free     shared    buffers
cached
Mem:            47          4         42          0          0
3
-/+ buffers/cache:          0         46
Swap:           19          0         19






> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Rajagopal Swaminathan
> Sent: Thursday, October 27, 2011 15:07
> To: linux clustering
> Subject: Re: [Linux-cluster] Issue with NFS exports and 
> "loopback" NFS mounts as RHCS resources on attempted resource
release
> 
> Greetings,
> 
> On Thu, Oct 27, 2011 at 6:00 PM,  <Ralph.Grothe at itdz-berlin.de>
wrote:
> > 2-node cluster (active/passive) within a clustered SAP
service.
> >
> 
> IIRC, There is a document for Redhat subscribers (I have not
been one)
> about configuring HA SAP services in the KB.
> 
> Have you checked that out?
> 
> Well, I apologise if the above info is wrong.
> 
> But I had downloaded it for few partners and subscribers
sometime (>
> 1year) back.
> 
> -- 
> Regards,
> 
> Rajagopal
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From raju.rajsand at gmail.com  Thu Oct 27 18:02:34 2011
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Thu, 27 Oct 2011 23:32:34 +0530
Subject: [Linux-cluster] Issue with NFS exports and "loopback" NFS
 mounts as RHCS resources on attempted resource release
In-Reply-To: <A789DDB53ED7E94396E842EE2AC9B5FF01432ABE@itdzex101.ITDZ.verwalt-berlin.de>
References: <A789DDB53ED7E94396E842EE2AC9B5FF01432ABD@itdzex101.ITDZ.verwalt-berlin.de>
	<CA+YdgaoVtsTd1-s0aPb1sTh8OuGXbJpmY+MdAG=WYWvyAvg=Ow@mail.gmail.com>
	<A789DDB53ED7E94396E842EE2AC9B5FF01432ABE@itdzex101.ITDZ.verwalt-berlin.de>
Message-ID: <CA+Ydgaq4sy9CmTjD=y_z=iiO5yO3BUDN2jTWMrjViSpND+UBEg@mail.gmail.com>

Greetings,

On Thu, Oct 27, 2011 at 7:56 PM,  <Ralph.Grothe at itdz-berlin.de> wrote:
> Hi Raj,
>
> yes, I know this SAP on RHCS document and have thouroughly read
> it (I myself supplied the link in my initial posting).
> But as I already wrote NFS isn't addressed in this document.
> Instead it reads on page 32, Section 8.3.3 "NFS Mounted File
> Systems", 2nd paragraph:
>
> "It is possible but not recommended or supported to run an NFS
> server within the cluster. This
> leads to a re-mount scenario in which the cluster node exporting
> the NFS filesystems remounts
> the same NFS exports. In low memory situations, the NFS server
> and client can
> negatively impact system stability."
>
> I know I should show them SAP admins this paragraph, and ask them
> why they are trying to hold me accountable for not working
> features or being deemed inapt to configure a cluster properly
> for requested layouts the clustering software (viz. its agents)
> isn't fit to deliver.

I would understand SAP's uppity attitude if they are approached.

> But this is the usual battle with users and customers, and I
> already was coerced into violating cluster software just to make
> it work as the customer thinks it ought to work a, couple of
> times.
>

I can understand that too, been there done that.

I apologise for I neither have a cluster nor have aSAP implementation
in front of me to give you a more usable answer.

But then, IIRC NFS can operate on UDP (connectionless) as well as TCP
(connection-oriented and default)

But I will keep quite on this thread further. Till I hear from the
experts from Redhat or SAP. I get paid from neither.

-- 
Regards,

Rajagopal



From chris.alexander at kusiri.com  Sat Oct 29 13:19:27 2011
From: chris.alexander at kusiri.com (Chris Alexander)
Date: Sat, 29 Oct 2011 14:19:27 +0100
Subject: [Linux-cluster] Corosync node disconnects occasionally
Message-ID: <CAOJZNMdnEhHO-dhtyiHTb+v3yfjKH-2gscGJWmdT4tUJQ9seyQ@mail.gmail.com>

Hi all,

Had a bit of an issue with our test cluster configuration after a network
hardware change (which we unfortunately can't reverse easily) and I wanted
to make sure there isn't any known issues with the Cluster software before
we go though another large hardware swapout.

We changed the switching hardware which the cluster software uses for its
administrative data, and now we are occasionally receiving the following
Corosync notification before the node in question gets fenced:

DISMAN-EVENT-MIB::sysUpTimeInstance = <uptime>, SNMPv2-MIB::snmpTrapOID.0 =
COROSYNC-MIB::corosyncNoticesNodeStatus,
COROSYNC-MIB::corosyncObjectsNodeName.0 = "<nodename>",
COROSYNC-MIB::corosyncObjectsNodeID.0 = 1,
COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "<local IP>",
COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left"

We suspect that the switch is misconfigured or just broken (offsite hardware
hence the difficulties swapping mentioned above) but wanted to check nobody
has had something like this before which might be software related rather
than hardware.

Cheers for any ideas

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111029/5481c7e5/attachment.htm>

From kozmozg2 at gmail.com  Mon Oct 31 10:06:15 2011
From: kozmozg2 at gmail.com (Misha Mal)
Date: Mon, 31 Oct 2011 05:06:15 -0500
Subject: [Linux-cluster] Linux-cluster Digest, Vol 90, Issue 20
In-Reply-To: <mailman.33.1319817606.12762.linux-cluster@redhat.com>
References: <mailman.33.1319817606.12762.linux-cluster@redhat.com>
Message-ID: <CAKaZsur0-dE8mvKhNJbHUs0us3hpDoRbBcJjATivwLFN7BdN=w@mail.gmail.com>

trash draft

On Fri, Oct 28, 2011 at 11:00 AM, <linux-cluster-request at redhat.com> wrote:

> Send Linux-cluster mailing list submissions to
>        linux-cluster at redhat.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://www.redhat.com/mailman/listinfo/linux-cluster
> or, via email, send a message with subject or body 'help' to
>        linux-cluster-request at redhat.com
>
> You can reach the person managing the list at
>        linux-cluster-owner at redhat.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Linux-cluster digest..."
>
>
> Today's Topics:
>
>   1. Re: Issue with NFS exports and "loopback" NFS mounts as RHCS
>      resources on attempted resource release (Rajagopal Swaminathan)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 27 Oct 2011 23:32:34 +0530
> From: Rajagopal Swaminathan <raju.rajsand at gmail.com>
> To: linux clustering <linux-cluster at redhat.com>
> Subject: Re: [Linux-cluster] Issue with NFS exports and "loopback" NFS
>        mounts as RHCS resources on attempted resource release
> Message-ID:
>        <CA+Ydgaq4sy9CmTjD=y_z=iiO5yO3BUDN2jTWMrjViSpND+UBEg at mail.gmail.com
> >
> Content-Type: text/plain; charset=ISO-8859-1
>
> Greetings,
>
> On Thu, Oct 27, 2011 at 7:56 PM,  <Ralph.Grothe at itdz-berlin.de> wrote:
> > Hi Raj,
> >
> > yes, I know this SAP on RHCS document and have thouroughly read
> > it (I myself supplied the link in my initial posting).
> > But as I already wrote NFS isn't addressed in this document.
> > Instead it reads on page 32, Section 8.3.3 "NFS Mounted File
> > Systems", 2nd paragraph:
> >
> > "It is possible but not recommended or supported to run an NFS
> > server within the cluster. This
> > leads to a re-mount scenario in which the cluster node exporting
> > the NFS filesystems remounts
> > the same NFS exports. In low memory situations, the NFS server
> > and client can
> > negatively impact system stability."
> >
> > I know I should show them SAP admins this paragraph, and ask them
> > why they are trying to hold me accountable for not working
> > features or being deemed inapt to configure a cluster properly
> > for requested layouts the clustering software (viz. its agents)
> > isn't fit to deliver.
>
> I would understand SAP's uppity attitude if they are approached.
>
> > But this is the usual battle with users and customers, and I
> > already was coerced into violating cluster software just to make
> > it work as the customer thinks it ought to work a, couple of
> > times.
> >
>
> I can understand that too, been there done that.
>
> I apologise for I neither have a cluster nor have aSAP implementation
> in front of me to give you a more usable answer.
>
> But then, IIRC NFS can operate on UDP (connectionless) as well as TCP
> (connection-oriented and default)
>
> But I will keep quite on this thread further. Till I hear from the
> experts from Redhat or SAP. I get paid from neither.
>
> --
> Regards,
>
> Rajagopal
>
>
>
> ------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> End of Linux-cluster Digest, Vol 90, Issue 20
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111031/0612d24e/attachment.htm>

From Sagar.Shimpi at tieto.com  Mon Oct 31 10:27:47 2011
From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com)
Date: Mon, 31 Oct 2011 12:27:47 +0200
Subject: [Linux-cluster] mysql active-active or Master-Master or Load
	Balancing Cluster
Message-ID: <F9FFD5DEC2163F498E703E7F9CBB3A350F7591A463@EXMB01.eu.tieto.com>

Hi Team,

Can anyone please let me know that weather is it possible to implement MYSQL Load Balancing [i.e. active-active or Master-Master] cluster in Redhat Linux 6.0 using GFS2 as a shared storage ???
Any Step by Step guide will be appreciated.

Regards,

Sagar Shimpi, Senior Technical Specialist, OSS Labs

Tieto
email sagar.shimpi at tieto.com<mailto:aniruddha.khadkikar at tieto.com>,
Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77,
MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in

TIETO. Knowledge. Passion. Results.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111031/58331234/attachment.htm>

From omerfsen at gmail.com  Mon Oct 31 10:51:17 2011
From: omerfsen at gmail.com (omerfsen at gmail.com)
Date: Mon, 31 Oct 2011 12:51:17 +0200
Subject: [Linux-cluster] mysql active-active or Master-Master or Load
	Balancing Cluster
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F7591A463@EXMB01.eu.tieto.com>
Message-ID: <8272874662.1004532677@gmail.com>

For active active mysql i suggest to use mysql cluster which is a share nothing cluster. Or you can use mysql multi master setup

------- Original message -------
> From:  <Sagar.Shimpi at tieto.com>
> To: linux-cluster at redhat.com
> Sent: 31.10.'11,  12:27
> 
> Hi Team,
> 
> ?
> 
> Can anyone please let me know that weather is it possible to implement MYSQL Load Balancing [i.e. active-active or Master-Master] cluster in Redhat Linux 6.0 using GFS2 as a shared storage ???
> 
> Any Step by Step guide will be appreciated.
> 
> ?
> 
> Regards,
> 
> ?
> 
> Sagar Shimpi, Senior Technical Specialist, OSS Labs
> 
> ?
> 
> Tieto
> 
> email sagar.shimpi at tieto.com,
> 
> Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77,
> 
> MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in
> 
> 
> TIETO. Knowledge. Passion. Results.
> 
> ?




From kkovachev at varna.net  Mon Oct 31 11:04:52 2011
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Mon, 31 Oct 2011 13:04:52 +0200
Subject: [Linux-cluster] mysql active-active or Master-Master or Load
 Balancing Cluster
In-Reply-To: <F9FFD5DEC2163F498E703E7F9CBB3A350F7591A463@EXMB01.eu.tieto.com>
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F7591A463@EXMB01.eu.tieto.com>
Message-ID: <4660b307ff1b11ccf569b6ca2fe334e2@mx.varna.net>

Hi,

On Mon, 31 Oct 2011 12:27:47 +0200, <Sagar.Shimpi at tieto.com> wrote:
> Hi Team,
> 
> Can anyone please let me know that weather is it possible to implement
> MYSQL Load Balancing [i.e. active-active or Master-Master] cluster in
> Redhat Linux 6.0 using GFS2 as a shared storage ???
> Any Step by Step guide will be appreciated.
> 

yes, it is possible, but i wouldn't recommend it - such setup will be much
slower than using a local (non-clustered) file system and Master/Master
replication provided from MySQL itself



From Sagar.Shimpi at tieto.com  Mon Oct 31 11:07:46 2011
From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com)
Date: Mon, 31 Oct 2011 13:07:46 +0200
Subject: [Linux-cluster] mysql active-active or Master-Master or
	Load	Balancing Cluster
In-Reply-To: <8272874662.1004532677@gmail.com>
References: <F9FFD5DEC2163F498E703E7F9CBB3A350F7591A463@EXMB01.eu.tieto.com>
	<8272874662.1004532677@gmail.com>
Message-ID: <F9FFD5DEC2163F498E703E7F9CBB3A350F7591A4DB@EXMB01.eu.tieto.com>

Thanks for the reply. But can you pass me any specific link as I am not able to get the exact configuration.


Regards,

Sagar Shimpi, Senior Technical Specialist, OSS Labs

Tieto
email sagar.shimpi at tieto.com,
Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77, 
MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in 

TIETO. Knowledge. Passion. Results.

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of omerfsen at gmail.com
Sent: Monday, October 31, 2011 4:21 PM
To: linux-cluster at redhat.com
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] mysql active-active or Master-Master or Load Balancing Cluster

For active active mysql i suggest to use mysql cluster which is a share nothing cluster. Or you can use mysql multi master setup

------- Original message -------
> From:  <Sagar.Shimpi at tieto.com>
> To: linux-cluster at redhat.com
> Sent: 31.10.'11,  12:27
> 
> Hi Team,
> 
> ?
> 
> Can anyone please let me know that weather is it possible to implement MYSQL Load Balancing [i.e. active-active or Master-Master] cluster in Redhat Linux 6.0 using GFS2 as a shared storage ???
> 
> Any Step by Step guide will be appreciated.
> 
> ?
> 
> Regards,
> 
> ?
> 
> Sagar Shimpi, Senior Technical Specialist, OSS Labs
> 
> ?
> 
> Tieto
> 
> email sagar.shimpi at tieto.com,
> 
> Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77,
> 
> MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in
> 
> 
> TIETO. Knowledge. Passion. Results.
> 
> ?


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster