From andrew at beekhof.net Mon Oct 3 00:10:13 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Mon, 3 Oct 2011 11:10:13 +1100 Subject: [Linux-cluster] [Linux-ha-dev] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th In-Reply-To: <4E85D87D.4000906@alteeve.com> References: <20110814193045.GP5299@suse.de> <20110927145802.GB3713@suse.de> <4E85D87D.4000906@alteeve.com> Message-ID: On Sat, Oct 1, 2011 at 12:55 AM, Digimer wrote: > On 09/27/2011 07:58 AM, Lars Marowsky-Bree wrote: >> Hi all, >> >> it turns out that there was zero feedback about people wanting to >> present, only some about travel budget being too tight to come. So we >> had some discussions about whether to cancel this completely, as this >> made planning rather difficult. >> >> But just in the last few days, I got a fair share of e-mails asking if >> this still takes place, and who is going to be there. ;-) >> >> So: we have the room. I will be there, and it seems so will at least a >> few other people, including Andrew. I suggest we do it in an >> "unconference" style and draw up the agenda as we go along; you're >> welcome to stop by and discuss HA/clustering topics that are important >> to you. ?It is going to be as successful as we all make it out to be. >> >> We share the venue with LinuxCon Europe: Clarion Congress Hotel ? >> Prague, Czech Republic, on Oct 25th. >> >> I suggest we start at 9:30 in the morning and go from there. >> >> >> Regards, >> ? ? Lars >> > > Is it possible, if this isn't set in stone, to push back to later in the > day? I don't fly in until the 25th, and I think there is one other > person who wants to attend in the same boat. Based on Boston last year, I imagine the conversations will last right up until Lars starts presenting his talk on Friday afternoon. People came and went at random, and if someone essential was missing for a conversation we deferred it until later. Very informal, but it seemed to work ok. From linux at alteeve.com Tue Oct 4 04:49:35 2011 From: linux at alteeve.com (Digimer) Date: Tue, 04 Oct 2011 00:49:35 -0400 Subject: [Linux-cluster] Can you build a cluster without fencing? Message-ID: <4E8A905F.2080503@alteeve.com> Here is the answer; http://www.youtube.com/watch?v=oKI-tD0L18A -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?" From christian.masopust at siemens.com Tue Oct 4 05:48:53 2011 From: christian.masopust at siemens.com (Masopust, Christian) Date: Tue, 4 Oct 2011 07:48:53 +0200 Subject: [Linux-cluster] Can you build a cluster without fencing? In-Reply-To: <4E8A905F.2080503@alteeve.com> References: <4E8A905F.2080503@alteeve.com> Message-ID: > > Here is the answer; > > http://www.youtube.com/watch?v=oKI-tD0L18A > > -- > Digimer > E-Mail: digimer at alteeve.com You've lighted up my day :-))))) From fdinitto at redhat.com Tue Oct 4 06:35:33 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 04 Oct 2011 08:35:33 +0200 Subject: [Linux-cluster] Can you build a cluster without fencing? In-Reply-To: <4E8A905F.2080503@alteeve.com> References: <4E8A905F.2080503@alteeve.com> Message-ID: <4E8AA935.3090202@redhat.com> On 10/04/2011 06:49 AM, Digimer wrote: > Here is the answer; > > http://www.youtube.com/watch?v=oKI-tD0L18A > ROFL... finally a bit humor on this mailing list ;) Fabio From mij at irwan.name Tue Oct 4 07:05:48 2011 From: mij at irwan.name (Mohd Irwan Jamaluddin) Date: Tue, 4 Oct 2011 15:05:48 +0800 Subject: [Linux-cluster] Can you build a cluster without fencing? In-Reply-To: References: <4E8A905F.2080503@alteeve.com> Message-ID: On Tue, Oct 4, 2011 at 1:48 PM, Masopust, Christian wrote: > > > > > Here is the answer; > > > > http://www.youtube.com/watch?v=oKI-tD0L18A > > > > -- > > Digimer > > E-Mail: ? ? ? ? ? ? ?digimer at alteeve.com > > You've lighted up my day :-))))) > On a serious note, you can use manual fencing (as if no fencing at all) but it won't be supported by Red Hat. From Mdukhan at nds.com Tue Oct 4 12:05:41 2011 From: Mdukhan at nds.com (Dukhan, Meir) Date: Tue, 4 Oct 2011 14:05:41 +0200 Subject: [Linux-cluster] Killing node XXX because it has rejoined thecluster with existing state Message-ID: <6DAE69EA69F39E4B9DA073B8C848A27C60E7553B35@ILMA1.IL.NDS.COM> Hi Jean-Daniel, Sorry to disappoint you, I'm not replying to your post to linux-cluster: I just have the same problem :( Did you receive any answer or maybe did you solve the problem? Merci beaucoup :) Best Regards, Meir R. Dukhan ________________________________ This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. An NDS Group Limited company. www.nds.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ext.thales.jean-daniel.bonnetot at sncf.fr Tue Oct 4 12:47:23 2011 From: ext.thales.jean-daniel.bonnetot at sncf.fr (BONNETOT Jean-Daniel (EXT THALES)) Date: Tue, 4 Oct 2011 14:47:23 +0200 Subject: [Linux-cluster] Killing node XXX because it has rejoined thecluster with existing state In-Reply-To: <6DAE69EA69F39E4B9DA073B8C848A27C60E7553B35@ILMA1.IL.NDS.COM> References: <6DAE69EA69F39E4B9DA073B8C848A27C60E7553B35@ILMA1.IL.NDS.COM> Message-ID: Hi, No, no answer. It's the good time to ask one more time ;) this is my last email : I have problem with two node cluster. When I force a node to faile, second node fences first one. When first one rejoin my cluster, cman shutdown on both nodes saying : Sep 28 17:29:36 s64lmwbig3c openais[7273]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Sep 28 17:29:36 s64lmwbig3c openais[7273]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart Logs : See attached Conf : Do you know what I missed ? Thanks Regards, Jean-Daniel BONNETOT De?: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Dukhan, Meir Envoy??: mardi 4 octobre 2011 14:06 ??: linux-cluster at redhat.com Objet?: Re: [Linux-cluster] Killing node XXX because it has rejoined thecluster with existing state Hi Jean-Daniel, Sorry to disappoint you, I?m not replying to your post to linux-cluster: I just have the same problem ? Did you receive any answer or maybe did you solve the problem? Merci beaucoup ? Best Regards, Meir R. Dukhan ________________________________________ This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. An NDS Group Limited company. www.nds.com ------- Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. ------- This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. From hlawatschek at atix.de Tue Oct 4 13:45:42 2011 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Tue, 4 Oct 2011 15:45:42 +0200 (CEST) Subject: [Linux-cluster] Fencing agent for cisco nexus 5k In-Reply-To: <1172525370.1599.1317735544624.JavaMail.root@axgroupware01-1.gallien.atix> Message-ID: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix> Hi, we are currently building up a Cisco Nexus FCoE infrastructure together with Red Hat Clusters. I'd like to use the Nexus 5ks for I/O fencing operations and I'm looking for a fencing agent to be used together with the Nexus 5k. The basic idea would be to disable the network ports of the cluster node inside the Nexus that is supposed to be fenced. Any pointers or ideas? Thanks a lot! Mark -- Mark Hlawatschek ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de http://www.linux-subscriptions.com From linux at alteeve.com Tue Oct 4 14:00:08 2011 From: linux at alteeve.com (Digimer) Date: Tue, 04 Oct 2011 10:00:08 -0400 Subject: [Linux-cluster] Fencing agent for cisco nexus 5k In-Reply-To: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix> References: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix> Message-ID: <4E8B1168.2060906@alteeve.com> On 10/04/2011 09:45 AM, Mark Hlawatschek wrote: > Hi, > > we are currently building up a Cisco Nexus FCoE infrastructure together with Red Hat Clusters. > > I'd like to use the Nexus 5ks for I/O fencing operations and I'm looking for a fencing agent to be used together with the Nexus 5k. > The basic idea would be to disable the network ports of the cluster node inside the Nexus that is supposed to be fenced. > > Any pointers or ideas? > > Thanks a lot! > > Mark I don't have experience with that switch/equipment. However, writing a fabric-type fence agent should be pretty straight forward. I assume the device has telnet or ssh access? Install the fence-agents package and then look for the 'fence_*' files. They will make for great examples to base a new agent on. The actual API is defined here; https://fedorahosted.org/cluster/wiki/FenceAgentAPI The main task is to ensure disconnection of the node. So after calling the switch, be sure to confirm that the port is logically disconnected before returning a success to the fenced caller. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?" From fdinitto at redhat.com Tue Oct 4 15:09:43 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 04 Oct 2011 17:09:43 +0200 Subject: [Linux-cluster] Fencing agent for cisco nexus 5k In-Reply-To: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix> References: <2029629951.1608.1317735942643.JavaMail.root@axgroupware01-1.gallien.atix> Message-ID: <4E8B21B7.5030208@redhat.com> On 10/04/2011 03:45 PM, Mark Hlawatschek wrote: > Hi, > > we are currently building up a Cisco Nexus FCoE infrastructure together with Red Hat Clusters. > > I'd like to use the Nexus 5ks for I/O fencing operations and I'm looking for a fencing agent to be used together with the Nexus 5k. > The basic idea would be to disable the network ports of the cluster node inside the Nexus that is supposed to be fenced. > > Any pointers or ideas? > > Thanks a lot! > > Mark > Unless they changed the MIB, you can probably use fence_ifmib. Fabio From cos at aaaaa.org Wed Oct 5 02:23:04 2011 From: cos at aaaaa.org (Ofer Inbar) Date: Tue, 4 Oct 2011 22:23:04 -0400 Subject: [Linux-cluster] service stuck in "recovering", no attempt to restart Message-ID: <20111005022304.GD7753@mip.aaaaa.org> On a 3 node cluster running: cman-2.0.115-34.el5_5.3 rgmanager-2.0.52-6.el5.centos.8 openais-0.80.6-16.el5_5.9 We have a custom resource, "dn", for which I wrote the resource agent. Service has three resources: a virtual IP (using ip.sh), and two dn children. Normally, when one of the dn instances fails its status check, rgmanager stops the service (stops dn_a and dn_b, then stops the IP), then relocates to another node and starts the service there. Several hours ago, one of the dn instances failed its status check, rgmanager stopped it, marked the service "recovering", but then did not seem to try to start it on any node. It just stayed down for hours until logged in to look at it. Until 17:22 today, service was running on node1. Here's what it logged: Oct 4 17:22:12 clustnode1 clurgmgrd: [517]: Monitoring Service dn:dn_b > Service Is Not Running Oct 4 17:22:12 clustnode1 clurgmgrd[517]: status on dn "dn_b" returned 1 (generic error) Oct 4 17:22:12 clustnode1 clurgmgrd[517]: Stopping service service:dn Oct 4 17:22:12 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_b Oct 4 17:22:12 clustnode1 clurgmgrd: [517]: Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid Oct 4 17:22:14 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_b > Succeed Oct 4 17:22:14 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_a Oct 4 17:22:15 clustnode1 clurgmgrd: [517]: Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid Oct 4 17:22:17 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_a > Succeed Oct 4 17:22:17 clustnode1 clurgmgrd: [517]: Removing IPv4 address 10.6.9.136/23 from eth0 Oct 4 17:22:27 clustnode1 clurgmgrd[517]: Service service:dn is recovering At around that time, node2 also logged this: Oct 4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t. Oct 4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t. [Cluster name and node names anonymized with simple search and replace] There are no other log entries in /var/log/messages on any node around that time, that relate to cluster suite. Currently, the service is still "recovering", with cluster status otherwise apparently fine. clustat -x output on all three nodes is identical except for which node has local="1". It looks like this: And cman_tool status shows all three nodes voting and in the quorum: Version: 6.2.0 Config Version: 2 Cluster Name: clustnode Cluster Id: 23048 Cluster Member: Yes Cluster Generation: 12 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 8 Flags: Dirty Ports Bound: 0 177 Node name: clustnode2 Node ID: 2 Multicast addresses: 239.245.0.84 Node addresses: 10.6.8.208 Again, this looks the same on all three nodes. Here's the resource section of cluster.conf (with the values of some of the arguments to my custom resource modified so as not to expose actual username, path, or port number): Any ideas why it might be in this state, where everything is apparently fine except that the service is "recovering" and rgmanager isn't trying to do anything about it and isn't logging any complaints? Attached: strace -fp output of clurgmrgd processes on node1 and node2 -- Cos -------------- next part -------------- Process 517 attached with 4 threads - interrupt to quit [pid 9842] clock_gettime(CLOCK_REALTIME, [pid 1001] clock_gettime(CLOCK_REALTIME, [pid 1000] select(6, [3 5], NULL, NULL, {0, 935000} [pid 517] select(12, [10 11], NULL, NULL, {8, 177000} [pid 9842] <... clock_gettime resumed> {1317781205, 661864000}) = 0 [pid 1001] <... clock_gettime resumed> {1317781205, 661864000}) = 0 [pid 9842] futex(0x432a5cbc, FUTEX_WAIT_PRIVATE, 3573, {7, 357519000} [pid 1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81853, {0, 867658000}) = -1 ETIMEDOUT (Connection timed out) [pid 1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 1001] clock_gettime(CLOCK_REALTIME, {1317781206, 530851000}) = 0 [pid 1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81855, {2, 999711000} [pid 1000] <... select resumed> ) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2} [pid 1001] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 1001] clock_gettime(CLOCK_REALTIME, {1317781209, 532508000}) = 0 [pid 1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81857, {3, 0} [pid 1000] <... select resumed> ) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2} [pid 1001] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 1001] clock_gettime(CLOCK_REALTIME, {1317781212, 534580000}) = 0 [pid 1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81859, {3, 0} [pid 1000] <... select resumed> ) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2} [pid 9842] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 9842] futex(0x432a5c90, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 9842] clock_gettime(CLOCK_REALTIME, {1317781213, 21497000}) = 0 [pid 9842] futex(0x432a5cbc, FUTEX_WAIT_PRIVATE, 3575, {10, 0} [pid 517] <... select resumed> ) = 0 (Timeout) [pid 517] socket(PF_FILE, SOCK_STREAM, 0) = 13 [pid 517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0 [pid 517] write(13, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20 [pid 517] read(13, "\1\0\0\0\0\0\0\0\350D9\3\0\0\0\0\0\0\0\0", 20) = 20 [pid 517] close(13) = 0 [pid 517] socket(PF_FILE, SOCK_STREAM, 0) = 13 [pid 517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0 [pid 517] write(13, "\3\0\0\0\0\0\0\0\350D9\3\0\0\0\0\31\0\0\0/cluster/@co"..., 45) = 45 [pid 517] read(13, "\3\0\0\0\0\0\0\0\350D9\3\0\0\0\0\2\0\0\0", 20) = 20 [pid 517] read(13, "2\0", 2) = 2 [pid 517] close(13) = 0 [pid 517] socket(PF_FILE, SOCK_STREAM, 0) = 13 [pid 517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0 [pid 517] write(13, "\2\0\0\0\0\0\0\0\350D9\3\0\0\0\0\0\0\0\0", 20) = 20 [pid 517] read(13, "\2\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0", 20) = 20 [pid 517] close(13) = 0 [pid 517] clone(Process 18772 attached child_stack=0x40ebc240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x40ebc9c0, tls=0x40ebc930, child_tidptr=0x40ebc9c0) = 18772 [pid 517] select(12, [10 11], NULL, NULL, {10, 0} [pid 18772] set_robust_list(0x40ebc9d0, 0x18) = 0 [pid 18772] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0 [pid 18772] _exit(0) = ? Process 18772 detached [pid 1000] <... select resumed> ) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2} [pid 1001] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 1001] clock_gettime(CLOCK_REALTIME, {1317781215, 536718000}) = 0 [pid 1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81861, {3, 0} [pid 1000] <... select resumed> ) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2} [pid 1001] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 1001] clock_gettime(CLOCK_REALTIME, {1317781218, 538706000}) = 0 [pid 1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81863, {3, 0} [pid 1000] <... select resumed> ) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout) [pid 1000] read(5, 0x428a4f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 1000] select(6, [3 5], NULL, NULL, {2, 2} [pid 1001] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 1001] clock_gettime(CLOCK_REALTIME, {1317781221, 540821000}) = 0 [pid 1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81865, {3, 0} -------------- next part -------------- Process 28445 attached with 4 threads - interrupt to quit [pid 28962] clock_gettime(CLOCK_REALTIME, [pid 28931] clock_gettime(CLOCK_REALTIME, [pid 28930] select(6, [3 5], NULL, NULL, {1, 725000} [pid 28445] select(12, [10 11], NULL, NULL, {6, 894000} [pid 28962] <... clock_gettime resumed> {1317781260, 477926000}) = 0 [pid 28931] <... clock_gettime resumed> {1317781260, 477926000}) = 0 [pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24531, {4, 991782000} [pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81869, {2, 613666000} [pid 28930] <... select resumed> ) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2} [pid 28931] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 28931] clock_gettime(CLOCK_REALTIME, {1317781263, 93684000}) = 0 [pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81871, {3, 0} [pid 28930] <... select resumed> ) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2} [pid 28962] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 28962] futex(0x429dac90, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 28962] clock_gettime(CLOCK_REALTIME, {1317781265, 471616000}) = 0 [pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24533, {10, 0} [pid 28931] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 28931] clock_gettime(CLOCK_REALTIME, {1317781266, 95446000}) = 0 [pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81873, {3, 0} [pid 28930] <... select resumed> ) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2} [pid 28445] <... select resumed> ) = 0 (Timeout) [pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14 [pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0 [pid 28445] write(14, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20 [pid 28445] read(14, "\1\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\0\0\0\0", 20) = 20 [pid 28445] close(14) = 0 [pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14 [pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0 [pid 28445] write(14, "\3\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\31\0\0\0/cluster/@co"..., 45) = 45 [pid 28445] read(14, "\3\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\2\0\0\0", 20) = 20 [pid 28445] read(14, "2\0", 2) = 2 [pid 28445] close(14) = 0 [pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14 [pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0 [pid 28445] write(14, "\2\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\0\0\0\0", 20) = 20 [pid 28445] read(14, "\2\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0", 20) = 20 [pid 28445] close(14) = 0 [pid 28445] clone(Process 29968 attached child_stack=0x40705240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x407059c0, tls=0x40705930, child_tidptr=0x407059c0) = 29968 [pid 28445] select(12, [10 11], NULL, NULL, {10, 0} [pid 29968] set_robust_list(0x407059d0, 0x18) = 0 [pid 29968] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0 [pid 29968] _exit(0) = ? Process 29968 detached [pid 28930] <... select resumed> ) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2} [pid 28931] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 28931] clock_gettime(CLOCK_REALTIME, {1317781269, 97451000}) = 0 [pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81875, {3, 0} [pid 28930] <... select resumed> ) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2} [pid 28445] <... select resumed> ) = 1 (in [10], left {6, 643000}) [pid 28445] accept(10, 0, NULL) = 14 [pid 28445] fcntl(14, F_GETFD) = 0 [pid 28445] fcntl(14, F_SETFD, FD_CLOEXEC) = 0 [pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0}) [pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8 [pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0}) [pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0", 24) = 24 [pid 28445] clone(Process 29977 attached child_stack=0x40705240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x407059c0, tls=0x40705930, child_tidptr=0x407059c0) = 29977 [pid 28445] select(12, [10 11], NULL, NULL, {6, 643000} [pid 29977] set_robust_list(0x407059d0, 0x18) = 0 [pid 29977] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0 [pid 29977] select(15, NULL, [14], [14], NULL) = 1 (out [14]) [pid 29977] write(14, "x\0\0\0\4\0\0\0\22:\274\0\0\0\0x\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 [pid 29977] select(15, NULL, [14], [14], NULL) = 1 (out [14]) [pid 29977] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0N\213y\23", 32) = 32 [pid 29977] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0}) [pid 29977] read(14, "\30\0\0\0\4\0\0\0", 8) = 8 [pid 29977] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0}) [pid 29977] read(14, "\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\30\0\0\0", 24) = 24 [pid 29977] close(14) = 0 [pid 29977] _exit(0) = ? Process 29977 detached [pid 28445] <... select resumed> ) = 1 (in [10], left {6, 642000}) [pid 28445] accept(10, 0, NULL) = 14 [pid 28445] fcntl(14, F_GETFD) = 0 [pid 28445] fcntl(14, F_SETFD, FD_CLOEXEC) = 0 [pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0}) [pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8 [pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0}) [pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\0", 24) = 24 [pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14]) [pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\1\0\0\0\0\377\377\377\377", 32) = 32 [pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14]) [pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\2\0\0\0\0\377\377\377\377", 32) = 32 [pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14]) [pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\3\0\0\0\0\377\377\377\377", 32) = 32 [pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14]) [pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377", 32) = 32 [pid 28445] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0}) [pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8 [pid 28445] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0}) [pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\30\0\0\0", 24) = 24 [pid 28445] close(14) = 0 [pid 28445] select(12, [10 11], NULL, NULL, {6, 642000} [pid 28931] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 28931] clock_gettime(CLOCK_REALTIME, {1317781272, 99389000}) = 0 [pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81877, {3, 0} [pid 28930] <... select resumed> ) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2} [pid 28931] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 28931] clock_gettime(CLOCK_REALTIME, {1317781275, 101463000}) = 0 [pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81879, {3, 0} [pid 28962] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 28962] futex(0x429dac90, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 28962] clock_gettime(CLOCK_REALTIME, {1317781275, 474705000}) = 0 [pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24535, {9, 999904000} [pid 28930] <... select resumed> ) = 0 (Timeout) [pid 28930] read(5, 0x41587f6b, 1) = -1 EAGAIN (Resource temporarily unavailable) [pid 28930] select(6, [3 5], NULL, NULL, {2, 2} From cos at aaaaa.org Wed Oct 5 02:43:57 2011 From: cos at aaaaa.org (Ofer Inbar) Date: Tue, 4 Oct 2011 22:43:57 -0400 Subject: [Linux-cluster] service stuck in "recovering", no attempt to restart In-Reply-To: <20111005022304.GD7753@mip.aaaaa.org> References: <20111005022304.GD7753@mip.aaaaa.org> Message-ID: <20111005024357.GL341@mip.aaaaa.org> After collecting all of the information in my previous mailing, I then tried restarting the service using clusvcadm -R, to no avail: | $ sudo clusvcadm -R dn | Local machine trying to restart service:dn... And so it stood for over a minute, with no evidence that it was actually trying to start anything, so I hit ^C. Next, I restarted rgmanager on all three nodes simultaneously, using "sudo service rgmanager restart". When rgmanager came back up, the service was in status "recoverable" and then soon after, it got started successully on node2. So now the service is running, but it's still a complete mystery to me why it never got restarted before, and why I had to restart rgmanager to get it to bring the service up. I also don't know what, if anything, I need to do to prevent this from happening again. [I did try killing processes a few times and observed successful relocations and restarts, so the cluster seems to be in a good state for now...] -- Cos From lhh at redhat.com Wed Oct 5 14:39:05 2011 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 05 Oct 2011 10:39:05 -0400 Subject: [Linux-cluster] service stuck in "recovering", no attempt to restart In-Reply-To: <20111005022304.GD7753@mip.aaaaa.org> References: <20111005022304.GD7753@mip.aaaaa.org> Message-ID: <4E8C6C09.1080806@redhat.com> On 10/04/2011 10:23 PM, Ofer Inbar wrote: > On a 3 node cluster running: > cman-2.0.115-34.el5_5.3 > rgmanager-2.0.52-6.el5.centos.8 > openais-0.80.6-16.el5_5.9 > > We have a custom resource, "dn", for which I wrote the resource agent. > Service has three resources: a virtual IP (using ip.sh), and two dn children. You should be able to disable then re-enable - that is, you shouldn't need to restart rgmanager to break the recovering state. There's this related bug, but it should have been fixed in 2.0.52-6: https://bugzilla.redhat.com/show_bug.cgi?id=530409 > Normally, when one of the dn instances fails its status check, > rgmanager stops the service (stops dn_a and dn_b, then stops the IP), > then relocates to another node and starts the service there. That's what I'd expect to happen. > Several hours ago, one of the dn instances failed its status check, > rgmanager stopped it, marked the service "recovering", but then did > not seem to try to start it on any node. It just stayed down for > hours until logged in to look at it. > > Until 17:22 today, service was running on node1. Here's what it logged: > > Oct 4 17:22:12 clustnode1 clurgmgrd: [517]: Monitoring Service dn:dn_b> Service Is Not Running > Oct 4 17:22:12 clustnode1 clurgmgrd[517]: status on dn "dn_b" returned 1 (generic error) > Oct 4 17:22:12 clustnode1 clurgmgrd[517]: Stopping service service:dn > Oct 4 17:22:12 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_b > Oct 4 17:22:12 clustnode1 clurgmgrd: [517]: Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid > Oct 4 17:22:14 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_b> Succeed > Oct 4 17:22:14 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_a > Oct 4 17:22:15 clustnode1 clurgmgrd: [517]: Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid > Oct 4 17:22:17 clustnode1 clurgmgrd: [517]: Stopping Service dn:dn_a> Succeed > Oct 4 17:22:17 clustnode1 clurgmgrd: [517]: Removing IPv4 address 10.6.9.136/23 from eth0 > Oct 4 17:22:27 clustnode1 clurgmgrd[517]: Service service:dn is recovering > > At around that time, node2 also logged this: > > Oct 4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t. > Oct 4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t. It may be related; I doubt it. > Again, this looks the same on all three nodes. > > Here's the resource section of cluster.conf (with the values of some > of the arguments to my custom resource modified so as not to expose > actual username, path, or port number): > > > > > > > > > > > Any ideas why it might be in this state, where everything is > apparently fine except that the service is "recovering" and rgmanager > isn't trying to do anything about it and isn't logging any complaints? The only cause for this is if we send a message but it either doesn't make it or we get a weird return code -- I think rgmanager logs it, though, so this could be a new issue. > Attached: strace -fp output of clurgmrgd processes on node1 and node2 The strace data is not likely to be useful, but a dump from rgmanager would. If you get in to this state again, do this: kill -USR1 `pidof -s clurgmgrd` Then look at /tmp/rgmanager-dump* (2.0.x) or /var/lib/cluster/rgmanager-dump (3.x.y) -- Lon From robejrm at gmail.com Wed Oct 5 15:01:46 2011 From: robejrm at gmail.com (Juan Ramon Martin Blanco) Date: Wed, 5 Oct 2011 17:01:46 +0200 Subject: [Linux-cluster] service stuck in "recovering", no attempt to restart In-Reply-To: <4E8C6C09.1080806@redhat.com> References: <20111005022304.GD7753@mip.aaaaa.org> <4E8C6C09.1080806@redhat.com> Message-ID: On Wed, Oct 5, 2011 at 4:39 PM, Lon Hohberger wrote: > On 10/04/2011 10:23 PM, Ofer Inbar wrote: >> >> On a 3 node cluster running: >> ? cman-2.0.115-34.el5_5.3 >> ? rgmanager-2.0.52-6.el5.centos.8 >> ? openais-0.80.6-16.el5_5.9 >> >> We have a custom resource, "dn", for which I wrote the resource agent. >> Service has three resources: a virtual IP (using ip.sh), and two dn >> children. > > You should be able to disable then re-enable - that is, you shouldn't need > to restart rgmanager to break the recovering state. > > There's this related bug, but it should have been fixed in 2.0.52-6: > > ?https://bugzilla.redhat.com/show_bug.cgi?id=530409 > I have the same problem with version 2.0.52-6 on rhel5, I'll try to get a dump when it happens again (didn't know the USR1 signal thing) # rpm -aq | grep -e rgmanager -e openais -e cman cman-2.0.115-34.el5_5.4 rgmanager-2.0.52-6.el5_5.8 openais-0.80.6-16.el5_5.9 Thanks, Juanra >> Normally, when one of the dn instances fails its status check, >> rgmanager stops the service (stops dn_a and dn_b, then stops the IP), >> then relocates to another node and starts the service there. > > That's what I'd expect to happen. > >> Several hours ago, one of the dn instances failed its status check, >> rgmanager stopped it, marked the service "recovering", but then did >> not seem to try to start it on any node. ?It just stayed down for >> hours until logged in to look at it. >> >> Until 17:22 today, service was running on node1. ?Here's what it logged: >> >> Oct ?4 17:22:12 clustnode1 clurgmgrd: [517]: ?Monitoring Service >> dn:dn_b> ?Service Is Not Running >> Oct ?4 17:22:12 clustnode1 clurgmgrd[517]: ?status on dn "dn_b" >> returned 1 (generic error) >> Oct ?4 17:22:12 clustnode1 clurgmgrd[517]: ?Stopping service >> service:dn >> Oct ?4 17:22:12 clustnode1 clurgmgrd: [517]: ?Stopping Service >> dn:dn_b >> Oct ?4 17:22:12 clustnode1 clurgmgrd: [517]: ?Checking if stopped: >> check_pid_file /dn/dn_b/dn_b.pid >> Oct ?4 17:22:14 clustnode1 clurgmgrd: [517]: ?Stopping Service >> dn:dn_b> ?Succeed >> Oct ?4 17:22:14 clustnode1 clurgmgrd: [517]: ?Stopping Service >> dn:dn_a >> Oct ?4 17:22:15 clustnode1 clurgmgrd: [517]: ?Checking if stopped: >> check_pid_file /dn/dn_a/dn_a.pid >> Oct ?4 17:22:17 clustnode1 clurgmgrd: [517]: ?Stopping Service >> dn:dn_a> ?Succeed >> Oct ?4 17:22:17 clustnode1 clurgmgrd: [517]: ?Removing IPv4 address >> 10.6.9.136/23 from eth0 >> Oct ?4 17:22:27 clustnode1 clurgmgrd[517]: ?Service service:dn is >> recovering >> >> At around that time, node2 also logged this: >> >> Oct ?4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete >> comm_header_t. >> Oct ?4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete >> comm_header_t. > > It may be related; I doubt it. > > >> Again, this looks the same on all three nodes. >> >> Here's the resource section of cluster.conf (with the values of some >> of the arguments to my custom resource modified so as not to expose >> actual username, path, or port number): >> >> >> ? >> ? ? >> ? ? ? > monitoringport="portnum"/> >> ? ? ? > monitoringport="portnum"/> >> ? ? >> ? >> >> >> Any ideas why it might be in this state, where everything is >> apparently fine except that the service is "recovering" and rgmanager >> isn't trying to do anything about it and isn't logging any complaints? > > The only cause for this is if we send a message but it either doesn't make > it or we get a weird return code -- I think rgmanager logs it, though, so > this could be a new issue. > >> Attached: strace -fp output of clurgmrgd processes on node1 and node2 > > The strace data is not likely to be useful, but a dump from rgmanager would. > ?If you get in to this state again, do this: > > ? kill -USR1 `pidof -s clurgmgrd` > > Then look at /tmp/rgmanager-dump* (2.0.x) or /var/lib/cluster/rgmanager-dump > (3.x.y) > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From andrew at beekhof.net Thu Oct 6 22:05:38 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Fri, 7 Oct 2011 09:05:38 +1100 Subject: [Linux-cluster] [Linux-HA] [Linux-ha-dev] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th In-Reply-To: <20111005145310.GB3724@suse.de> References: <20110814193045.GP5299@suse.de> <20110927145802.GB3713@suse.de> <4E85D87D.4000906@alteeve.com> <20111005145310.GB3724@suse.de> Message-ID: On Thu, Oct 6, 2011 at 1:53 AM, Lars Marowsky-Bree wrote: > On 2011-10-03T11:10:13, Andrew Beekhof wrote: > >> Based on Boston last year, I imagine the conversations will last right >> up until Lars starts presenting his talk on Friday afternoon. >> People came and went at random, and if someone essential was missing >> for a conversation we deferred it until later. > > Oh, then we're going to not stop, ever - because I don't have a talk at > the main conference this time ;-) The schedule has you in a friday afternoon slot iirc. > >> Very informal, but it seemed to work ok. > > yes, and given that the ha mailing lists are still down, probably the > best we can hope for ... indeed From andy.speagle at wichita.edu Fri Oct 7 18:38:32 2011 From: andy.speagle at wichita.edu (Speagle, Andy) Date: Fri, 7 Oct 2011 18:38:32 +0000 Subject: [Linux-cluster] Multiple HA-LVM Resources Message-ID: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu> Hi Team, I'm having an issue with RHCS on RHEL 6.1 ... I have multiple HA-LVM resources in my cluster which are being used by two different service groups. I'm having an issue when I try to start the second service group on the same cluster node running the first service group. I get this immediately. Local machine trying to enable service:...Invalid operation for resource However, I can startup both service groups just fine as long as it's on different nodes in the cluster. I've got logging turned up to debug... but I can't seem to get anything meaningful in the logs regarding this issue. Can someone clue me in? Andy Speagle System & Storage Administrator UCATS - Wichita State University P: 316.978.3869 C: 316.617.2431 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tc3driver at gmail.com Fri Oct 7 19:08:57 2011 From: tc3driver at gmail.com (Bill G.) Date: Fri, 7 Oct 2011 12:08:57 -0700 Subject: [Linux-cluster] Multiple HA-LVM Resources In-Reply-To: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu> References: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu> Message-ID: Hi Andy, What do your failover domains look like? If you have failover domains set up, and that service group is not listed as being able to run on that node, you will get that error message when trying to start a service on that node. Thanks, Bill On Fri, Oct 7, 2011 at 11:38 AM, Speagle, Andy wrote: > Hi Team,**** > > ** ** > > I?m having an issue with RHCS on RHEL 6.1 ? I have multiple HA-LVM > resources in my cluster which are being used by two different service > groups. I?m having an issue when I try to start the second service group on > the same cluster node running the first service group. I get this > immediately.**** > > ** ** > > Local machine trying to enable service:...Invalid operation > for resource**** > > ** ** > > However, I can startup both service groups just fine as long as it?s on > different nodes in the cluster. I?ve got logging turned up to debug? but I > can?t seem to get anything meaningful in the logs regarding this issue.*** > * > > ** ** > > Can someone clue me in?**** > > ** ** > > Andy Speagle**** > > System & Storage Administrator**** > > UCATS - Wichita State University**** > > ** ** > > P: 316.978.3869**** > > C: 316.617.2431**** > > ** ** > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Thanks, Bill G. tc3driver at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mmorgan at dca.net Fri Oct 7 19:09:13 2011 From: mmorgan at dca.net (Michael Morgan) Date: Fri, 7 Oct 2011 15:09:13 -0400 Subject: [Linux-cluster] cluster-snmp in EL6 reporting rhcMIBVersion and nothing else Message-ID: <20111007190913.GA27724@staff.dca.net> Hello, I'm having a problem with cluster-snmp output in our SL6.1 clusters with cluster-snmp-0.16.2-10.el6.x86_64 installed. [root at node2 ~]# snmpwalk -m REDHAT-CLUSTER-MIB -v2c -cpublic localhost REDHAT-CLUSTER-MIB::redhatCluster REDHAT-CLUSTER-MIB::rhcMIBVersion.0 = INTEGER: 2 I don't get any other objects and thus can't monitor the cluster through SNMP. Our 5.7 clusters work properly with a matching snmpd config. I have the required "dlmod RedHatCluster /usr/lib64/cluster-snmp/libClusterMonitorSnmp.so" and have made sure it is included in the view. Am I missing something? Thanks in advance. -Mike -- Michael Morgan mmorgan at dca.net From mmorgan at dca.net Fri Oct 7 19:14:54 2011 From: mmorgan at dca.net (Michael Morgan) Date: Fri, 7 Oct 2011 15:14:54 -0400 Subject: [Linux-cluster] Multiple HA-LVM Resources In-Reply-To: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu> References: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu> Message-ID: <20111007191454.GB27724@staff.dca.net> On Fri, Oct 07, 2011 at 06:38:32PM +0000, Speagle, Andy wrote: > Local machine trying to enable service:...Invalid operation > for resource I ran into a similar problem recently. It turned out that I mistakenly had exclusive="1" in the service definition which prevents the service from running on a node that already has any active services. The error is incredibly vague and it took a while before I realized the cause. -Mike -- Michael Morgan mmorgan at dca.net From andy.speagle at wichita.edu Fri Oct 7 19:19:07 2011 From: andy.speagle at wichita.edu (Speagle, Andy) Date: Fri, 7 Oct 2011 19:19:07 +0000 Subject: [Linux-cluster] Multiple HA-LVM Resources In-Reply-To: <20111007191454.GB27724@staff.dca.net> References: <188F4C6C277F4843A5E712D5458E82770215E9@mbxsvc-300.ad.wichita.edu> <20111007191454.GB27724@staff.dca.net> Message-ID: <188F4C6C277F4843A5E712D5458E82770218D5@mbxsvc-300.ad.wichita.edu> > I ran into a similar problem recently. It turned out that I mistakenly had > exclusive="1" in the service definition which prevents the service from > running on a node that already has any active services. The error is > incredibly vague and it took a while before I realized the cause. Ah... that's precisely the problem. I don't even have to look to know that's the issue. Thanks for loaning me your brain. -Andy From tserong at suse.com Mon Oct 10 01:46:38 2011 From: tserong at suse.com (Tim Serong) Date: Mon, 10 Oct 2011 12:46:38 +1100 Subject: [Linux-cluster] CFP: High Availability and Distributed Storage miniconf at LCA 2012 Message-ID: <4E924E7E.2090507@suse.com> Hi All, I'm pleased to announce that we will be hosting a one day High Availability and Distributed Storage mini conference on January 16 2012, as part of linux.conf.au in Ballarat, Australia. We would like to invite proposals for presentations to be delivered at the miniconf. Please feel free to forward this CFP to your colleagues and other relevant mailing lists. Suggested topics for presentations include (but are not limited to): - Cluster resource management - Cluster membership/messaging - Clustered filesystems - Distributed storage - SQL and NoSQL databases - Caching layers The CFP is open until November 6. Proposals can be submitted at: http://tinyurl.com/ha-lca2012-cfp If you have any questions, please feel free to contact me directly (please do not group reply to this announcement). Note that as this miniconf is part of linux.conf.au, you will need to register to attend the main conference (http://linux.conf.au/register/prices). Unfortunately we have no sponsorship budget for speakers, so presenting at the miniconf does not entitle you to discounted or free registration. If you need help convincing your employer to fund your travel, please see http://linux.conf.au/register/business-case - this is *the* Linux conference to be at in the southern hemisphere! Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tserong at suse.com From ext.thales.jean-daniel.bonnetot at sncf.fr Wed Oct 12 14:17:41 2011 From: ext.thales.jean-daniel.bonnetot at sncf.fr (BONNETOT Jean-Daniel (EXT THALES)) Date: Wed, 12 Oct 2011 16:17:41 +0200 Subject: [Linux-cluster] NTP sync cause CNAM shutdown Message-ID: Hi, I post previous email asking what was wrong in my two nodes cluster.conf. I think I found it and have some question. The problem was two nodes boot, join then cman shutdown with : Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, my timzone is GMT + 2). My questions are: Which date do you set up in your bios (GMT, your time zone)? Do you use ntpd ? all documentations say to use it. What are best practices about ntp and RHCS? Jean-Daniel BONNETOT ------- Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. ------- This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. From sdake at redhat.com Wed Oct 12 15:43:05 2011 From: sdake at redhat.com (Steven Dake) Date: Wed, 12 Oct 2011 08:43:05 -0700 Subject: [Linux-cluster] NTP sync cause CNAM shutdown In-Reply-To: References: Message-ID: <4E95B589.2060302@redhat.com> On 10/12/2011 07:17 AM, BONNETOT Jean-Daniel (EXT THALES) wrote: > Hi, > > I post previous email asking what was wrong in my two nodes > cluster.conf. I think I found it and have some question. > > The problem was two nodes boot, join then cman shutdown with : > Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node > s64lmwbig3b because it has rejoined the cluster with existing state > Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 > because we rejoined the cluster without a full restart > > Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, > my timzone is GMT + 2). > > My questions are: > Which date do you set up in your bios (GMT, your time zone)? > Do you use ntpd ? all documentations say to use it. > What are best practices about ntp and RHCS? > > Jean-Daniel BONNETOT > > ------- > Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. > ------- > This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. > https://bugzilla.redhat.com/show_bug.cgi?id=738468 RHEL6 does not have this problem. Regards -steve > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From alvaro.fernandez at sivsa.com Wed Oct 12 15:52:11 2011 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Wed, 12 Oct 2011 17:52:11 +0200 Subject: [Linux-cluster] NTP sync cause CNAM shutdown References: Message-ID: <607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int> Jean, I too suffered the same issue, opened a case with support, etc. The best option running ntpd and RHCS are: -First, start the cman, rgmanager, etc. (I mean, all the RHCS daemons) always after ntpd startup. In RHEL5 at least the default is the other way around. You can do that if you disable all RHCS daemons (via chkconfig off) from automatic startup, and then, starting them explicitly via your rc.local init script, as the last init sequence action (ie, after the network, basic systems, and most importantly after ntpd initially adjusted the clock, via it's "ntpdate" call. Be aware that if you do the above, you must explicitly (manually) stop them if you need to shutdown the cluster or the nodes, as with this hack, the init scripts of cman, rgmanager, etc , won't run for the "kill"/shutdown sequence. -Start the ntpd using the "slew" mode ( -x startup flag), in the configuration file. Running it in slew mode makes ntpd adjust the time over a large time span, enough to assure that CMAN internal timings won't get messed. Using that hack was Ok for me, no more node evictions or unexpected problems since. There is a FAQ and best practices document in Redhat Network for NTPD and RHCS, updated few months ago as I recall. Just search for it in the Redhat Network website (sorry, I don't have the link for the DOC at the moment) regards, ?lvaro Fern?ndez Departamento de Sistemas_ ------- Hi, I post previous email asking what was wrong in my two nodes cluster.conf. I think I found it and have some question. The problem was two nodes boot, join then cman shutdown with : Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, my timzone is GMT + 2). My questions are: Which date do you set up in your bios (GMT, your time zone)? Do you use ntpd ? all documentations say to use it. What are best practices about ntp and RHCS? Jean-Daniel BONNETOT From ext.thales.jean-daniel.bonnetot at sncf.fr Thu Oct 13 16:14:45 2011 From: ext.thales.jean-daniel.bonnetot at sncf.fr (BONNETOT Jean-Daniel (EXT THALES)) Date: Thu, 13 Oct 2011 18:14:45 +0200 Subject: [Linux-cluster] NTP sync cause CNAM shutdown In-Reply-To: <607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int> References: <607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int> Message-ID: Thanks for your answer, it help me to find my way ;) I saw "-x" option fot ntpd, but it's not the only things to apply. First, I had to solve my timezone problem. -> Hwclock set on GMT int BIOS (UTC if you prefer) -> timezone --utc Europe/Paris in kickstart, or set ZONE="Europe/Paris" and UTC=true in /etc/sysconfig/clock This two settings make my time boot kernel in the right place, kernel get time from hwclock and know that it has to apply my timezone over it. Then, I add "-x" option in /etc/syscinfig/ntp to say ntpd to not make big step. As a result, boot time before: Oct 13 12:02:20 s64lmwbig3b ntpd[7996]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1) Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: precision = 1.000 usec Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface wildcard, 0.0.0.0#123 Disabled ... Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface bond0, 10.151.231.215#123 Enabled <== 2H TIME JUMP Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] The token was lost in the OPERATIONAL state. Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] entering GATHER state from 2. => CMAN crashed Boot time now: Oct 13 16:10:08 s64lmwbig3b clvmd: Cluster LVM daemon started - connected to CMAN ... Oct 13 16:10:27 s64lmwbig3b ntpdate[7971]: step time server 10.151.156.87 offset 1.306150 sec <== 1S TIME JUMP Oct 13 16:10:29 s64lmwbig3b ntpd[7975]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1) Oct 13 16:10:29 s64lmwbig3b ntpd[7976]: precision = 1.000 usec ... Oct 13 16:10:40 s64lmwbig3b modclusterd: startup succeeded => CMAN up and running I looked for the FAQ you talked about but nothing, if you can post it when you have time ;) Jean-Daniel BONNETOT -----Message d'origine----- De?: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Alvaro Jose Fernandez Envoy??: mercredi 12 octobre 2011 17:52 ??: linux clustering Objet?: Re: [Linux-cluster] NTP sync cause CNAM shutdown Jean, I too suffered the same issue, opened a case with support, etc. The best option running ntpd and RHCS are: -First, start the cman, rgmanager, etc. (I mean, all the RHCS daemons) always after ntpd startup. In RHEL5 at least the default is the other way around. You can do that if you disable all RHCS daemons (via chkconfig off) from automatic startup, and then, starting them explicitly via your rc.local init script, as the last init sequence action (ie, after the network, basic systems, and most importantly after ntpd initially adjusted the clock, via it's "ntpdate" call. Be aware that if you do the above, you must explicitly (manually) stop them if you need to shutdown the cluster or the nodes, as with this hack, the init scripts of cman, rgmanager, etc , won't run for the "kill"/shutdown sequence. -Start the ntpd using the "slew" mode ( -x startup flag), in the configuration file. Running it in slew mode makes ntpd adjust the time over a large time span, enough to assure that CMAN internal timings won't get messed. Using that hack was Ok for me, no more node evictions or unexpected problems since. There is a FAQ and best practices document in Redhat Network for NTPD and RHCS, updated few months ago as I recall. Just search for it in the Redhat Network website (sorry, I don't have the link for the DOC at the moment) regards, ?lvaro Fern?ndez Departamento de Sistemas_ ------- Hi, I post previous email asking what was wrong in my two nodes cluster.conf. I think I found it and have some question. The problem was two nodes boot, join then cman shutdown with : Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, my timzone is GMT + 2). My questions are: Which date do you set up in your bios (GMT, your time zone)? Do you use ntpd ? all documentations say to use it. What are best practices about ntp and RHCS? Jean-Daniel BONNETOT -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster ------- Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. ------- This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. From alvaro.fernandez at sivsa.com Thu Oct 13 16:58:42 2011 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Thu, 13 Oct 2011 18:58:42 +0200 Subject: [Linux-cluster] NTP sync cause CNAM shutdown References: <607D6181D9919041BE792D70EF2AEC4801DDD3F4@LIMENS.sivsa.int> Message-ID: <607D6181D9919041BE792D70EF2AEC4801DDD4A9@LIMENS.sivsa.int> Hi Jean, The DOC is https://access.redhat.com/kb/docs/DOC-42471 . But, at Steven Drake said in a previous email, if you *can* upgrade to RHEL6, sure that would be the best option (I just cannot upgrade my customer, he will die on RHEL 5.x). In RHEL6, the cluster daemons are different and use other API, unlike openais. Best regards. ?lvaro Fern?ndez Departamento de Sistemas_ ________________________________ SIVSA, Soluciones Inform?ticas S.A. Arenal n? 18 ? 3? Planta ? 36201 ? Vigo Tel?fono: (+34) 986 092 100 Fax: (+34) 986 092 219 e-mail: alvaro.fernandez at sivsa.com www.sivsa.com Espa?a_ ****************************** ADVERTENCIA LEGAL **************************** En cumplimiento de la Ley de Servicios de la Sociedad de la Informaci?n y de Comercio Electr?nico (LSSI-CE), y de la vigente Ley Org?nica 15/1999 de 13 de Diciembre de Protecci?n de Datos de Car?cter Personal (LOPD), le informamos que su direcci?n de correo electr?nico figura en este momento en la base de datos de SIVSA, Soluciones Inform?ticas, S.A, con domicilio en la calle Areal n? 18 - 3? planta, Vigo (Pontevedra), que, como responsable del fichero, le garantiza el ejercicio de sus derechos de acceso, rectificaci?n, cancelaci?n y oposici?n de los datos facilitados, en los t?rminos y condiciones previstos en la propia LOPD, mediante una comunicaci?n por escrito dirigida a la direcci?n indicada, a la atenci?n del "Departamento de Administraci?n". De no ser as?, se entiende que usted consiente expresamente que sus datos puedan ser utilizados por SIVSA con fines publicitarios, promocionales y de marketing, en relaci?n con sus propios productos y servicios. Este mensaje va dirigido, de manera exclusiva, a su destinatario y contiene informaci?n confidencial y sujeta al secreto profesional, cuya divulgaci?n no est? permitida por la ley. En caso de haber recibido este mensaje por error, le rogamos que, de forma inmediata, nos lo comunique mediante correo electr?nico remitido a nuestra atenci?n o a trav?s del tel?fono (+ 34) 986 092 100 y proceda a su eliminaci?n, as? como a la de cualquier documento adjunto al mismo. Asimismo, le comunicamos que la distribuci?n, copia o utilizaci?n de este mensaje, o de cualquier documento adjunto al mismo, cualquiera que fuera su finalidad, est?n prohibidas por la ley." -----Mensaje original----- De: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] En nombre de BONNETOT Jean-Daniel (EXT THALES) Enviado el: jueves, 13 de octubre de 2011 18:15 Para: linux clustering Asunto: Re: [Linux-cluster] NTP sync cause CNAM shutdown Thanks for your answer, it help me to find my way ;) I saw "-x" option fot ntpd, but it's not the only things to apply. First, I had to solve my timezone problem. -> Hwclock set on GMT int BIOS (UTC if you prefer) timezone --utc -> Europe/Paris in kickstart, or set ZONE="Europe/Paris" and UTC=true in -> /etc/sysconfig/clock This two settings make my time boot kernel in the right place, kernel get time from hwclock and know that it has to apply my timezone over it. Then, I add "-x" option in /etc/syscinfig/ntp to say ntpd to not make big step. As a result, boot time before: Oct 13 12:02:20 s64lmwbig3b ntpd[7996]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1) Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: precision = 1.000 usec Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface wildcard, 0.0.0.0#123 Disabled ... Oct 13 12:02:20 s64lmwbig3b ntpd[7997]: Listening on interface bond0, 10.151.231.215#123 Enabled <== 2H TIME JUMP Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] The token was lost in the OPERATIONAL state. Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Oct 13 14:02:31 s64lmwbig3b openais[7701]: [TOTEM] entering GATHER state from 2. => CMAN crashed Boot time now: Oct 13 16:10:08 s64lmwbig3b clvmd: Cluster LVM daemon started - connected to CMAN ... Oct 13 16:10:27 s64lmwbig3b ntpdate[7971]: step time server 10.151.156.87 offset 1.306150 sec <== 1S TIME JUMP Oct 13 16:10:29 s64lmwbig3b ntpd[7975]: ntpd 4.2.2p1 at 1.1570-o Thu Nov 26 11:34:34 UTC 2009 (1) Oct 13 16:10:29 s64lmwbig3b ntpd[7976]: precision = 1.000 usec ... Oct 13 16:10:40 s64lmwbig3b modclusterd: startup succeeded => CMAN up and running I looked for the FAQ you talked about but nothing, if you can post it when you have time ;) Jean-Daniel BONNETOT -----Message d'origine----- De?: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Alvaro Jose Fernandez Envoy??: mercredi 12 octobre 2011 17:52 ??: linux clustering Objet?: Re: [Linux-cluster] NTP sync cause CNAM shutdown Jean, I too suffered the same issue, opened a case with support, etc. The best option running ntpd and RHCS are: -First, start the cman, rgmanager, etc. (I mean, all the RHCS daemons) always after ntpd startup. In RHEL5 at least the default is the other way around. You can do that if you disable all RHCS daemons (via chkconfig off) from automatic startup, and then, starting them explicitly via your rc.local init script, as the last init sequence action (ie, after the network, basic systems, and most importantly after ntpd initially adjusted the clock, via it's "ntpdate" call. Be aware that if you do the above, you must explicitly (manually) stop them if you need to shutdown the cluster or the nodes, as with this hack, the init scripts of cman, rgmanager, etc , won't run for the "kill"/shutdown sequence. -Start the ntpd using the "slew" mode ( -x startup flag), in the configuration file. Running it in slew mode makes ntpd adjust the time over a large time span, enough to assure that CMAN internal timings won't get messed. Using that hack was Ok for me, no more node evictions or unexpected problems since. There is a FAQ and best practices document in Redhat Network for NTPD and RHCS, updated few months ago as I recall. Just search for it in the Redhat Network website (sorry, I don't have the link for the DOC at the moment) regards, ?lvaro Fern?ndez Departamento de Sistemas_ ------- Hi, I post previous email asking what was wrong in my two nodes cluster.conf. I think I found it and have some question. The problem was two nodes boot, join then cman shutdown with : Oct 12 15:55:30 s64lmwbig3c openais[7672]: [MAIN ] Killing node s64lmwbig3b because it has rejoined the cluster with existing state Oct 12 15:55:30 s64lmwbig3c openais[7672]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart Few seconds before, ntpd sync and jump forward with 7200 sec (2 hours, my timzone is GMT + 2). My questions are: Which date do you set up in your bios (GMT, your time zone)? Do you use ntpd ? all documentations say to use it. What are best practices about ntp and RHCS? Jean-Daniel BONNETOT -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster ------- Ce message et toutes les pi?ces jointes sont ?tablis ? l'intention exclusive de ses destinataires et sont confidentiels. L'int?grit? de ce message n'?tant pas assur?e sur Internet, la SNCF ne peut ?tre tenue responsable des alt?rations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, m?me partielle, non autoris?e pr?alablement par la SNCF, est strictement interdite. Si vous n'?tes pas le destinataire de ce message, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. ------- This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From daniele at retaggio.net Thu Oct 13 22:17:58 2011 From: daniele at retaggio.net (Daniele Palumbo) Date: Fri, 14 Oct 2011 00:17:58 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute Message-ID: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> hi, first of all, sorry for the long subject... i do not know how to explain myself, so any faq/manual will be of course appreciated. now, i have a test cluster, gentoo based. cluster 3.1.7, corosync 1.4.2, lvm2 2.02.88. clvm is built with following flags: ./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --enable-readline --disable-selinux --enable-pkgconfig --with-confdir=/etc --sbindir=/sbin --with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 --enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d/ --enable-dmeventd --enable-cmdlib --enable-applib --enable-fsadm --enable-static_link --with-mirrors=internal --with-snapshots=internal --with-lvm1=internal --with-cluster=internal --enable-cmirrord --with-clvmd=cman --with-pool=internal --with-dmeventd-path=/sbin/dmeventd CLDFLAGS=-Wl,-O1 -Wl,--as-needed cluster.conf: now, i have the cluster working, i can see and create volumes and so on. but when i mount a volume on node pvsrv07, i cannot see it as open in pvsrv08. also, clvmd (running in debug mode) does not show me anything when i mount or unmount volumes. any hints? thanks Daniele From fdinitto at redhat.com Fri Oct 14 03:56:21 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 14 Oct 2011 05:56:21 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> Message-ID: <4E97B2E5.5000509@redhat.com> On 10/14/2011 12:17 AM, Daniele Palumbo wrote: > hi, > > first of all, sorry for the long subject... > i do not know how to explain myself, so any faq/manual will be of course appreciated. > > now, > i have a test cluster, gentoo based. > cluster 3.1.7, corosync 1.4.2, lvm2 2.02.88. > clvm is built with following flags: > ./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --enable-readline --disable-selinux --enable-pkgconfig --with-confdir=/etc --sbindir=/sbin --with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 --enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d/ --enable-dmeventd --enable-cmdlib --enable-applib --enable-fsadm --enable-static_link --with-mirrors=internal --with-snapshots=internal --with-lvm1=internal --with-cluster=internal --enable-cmirrord --with-clvmd=cman --with-pool=internal --with-dmeventd-path=/sbin/dmeventd CLDFLAGS=-Wl,-O1 -Wl,--as-needed > > cluster.conf: > > > > > > > > > > > > > now, i have the cluster working, i can see and create volumes and so on. > > but when i mount a volume on node pvsrv07, i cannot see it as open in pvsrv08. > also, clvmd (running in debug mode) does not show me anything when i mount or unmount volumes. > > any hints? What kind of shared storage are you using? Filesystem on top of lvm? Fabio From Sagar.Shimpi at tieto.com Fri Oct 14 08:17:04 2011 From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com) Date: Fri, 14 Oct 2011 11:17:04 +0300 Subject: [Linux-cluster] Apache active-active cluster Message-ID: Hi, Can I configure Apache Active -Active cluster using Redhat Cluster Suit in RHEL6? If yes, can someone please pass me the link for the same. Regards, Sagar Shimpi, Senior Technical Specialist, OSS Labs Tieto email sagar.shimpi at tieto.com, Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77, MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in TIETO. Knowledge. Passion. Results. -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniele at retaggio.net Fri Oct 14 09:07:00 2011 From: daniele at retaggio.net (Daniele Palumbo) Date: Fri, 14 Oct 2011 11:07:00 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: <4E97B2E5.5000509@redhat.com> References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> <4E97B2E5.5000509@redhat.com> Message-ID: <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> Il giorno 14/ott/2011, alle ore 05.56, Fabio M. Di Nitto ha scritto: > What kind of shared storage are you using? Filesystem on top of lvm? i am using 2 local disk, exported via vblade (i have an aoe storage, this is the test environment). on top of that right does not matter for me which fs, cause it will be used also by xen hvm (let say also ntfs). anyway, tested with ext3 and reiserfs. (but i think i will get a RTFM as you write down your questions ;( ) pvsrv07 ~ # pvs PV VG Fmt Attr PSize PFree /dev/etherd/e8.0 vgPvSrv08 lvm2 a-- 3.90g 3.90g /dev/sdb1 vgPvSrv07 lvm2 a-- 3.90g 3.80g pvsrv07 ~ # vgs VG #PV #LV #SN Attr VSize VFree vgPvSrv07 1 1 0 wz--nc 3.90g 3.80g vgPvSrv08 1 0 0 wz--nc 3.90g 3.90g pvsrv07 ~ # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert test07 vgPvSrv07 -wi-ao 100.00m pvsrv07 ~ # pvsrv07 ~ # mount|grep test07 /dev/mapper/vgPvSrv07-test07 on /mnt type reiserfs (rw) pvsrv07 ~ # pvsrv08 ~ # pvs PV VG Fmt Attr PSize PFree /dev/etherd/e7.0 vgPvSrv07 lvm2 a-- 3.90g 3.80g /dev/sdb1 vgPvSrv08 lvm2 a-- 3.90g 3.90g pvsrv08 ~ # vgs VG #PV #LV #SN Attr VSize VFree vgPvSrv07 1 1 0 wz--nc 3.90g 3.80g vgPvSrv08 1 0 0 wz--nc 3.90g 3.90g pvsrv08 ~ # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert test07 vgPvSrv07 -wi-a- 100.00m pvsrv08 ~ # the tricky part is in lvs, i can see an open volume in pvsrv07 and not in pvsrv08. pvsrv07 ~ # cman_tool status Version: 6.2.0 Config Version: 2 Cluster Name: CEMCluster Cluster Id: 7604 Cluster Member: Yes Cluster Generation: 240 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 8 Flags: Ports Bound: 0 11 Node name: pvsrv07 Node ID: 2 Multicast addresses: 239.192.29.209 Node addresses: 192.168.1.107 pvsrv07 ~ # pvsrv08 ~ # cman_tool status Version: 6.2.0 Config Version: 2 Cluster Name: CEMCluster Cluster Id: 7604 Cluster Member: Yes Cluster Generation: 240 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 8 Flags: Ports Bound: 0 11 Node name: pvsrv08 Node ID: 3 Multicast addresses: 239.192.29.209 Node addresses: 192.168.1.108 pvsrv08 ~ # From fdinitto at redhat.com Fri Oct 14 12:01:35 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 14 Oct 2011 14:01:35 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> <4E97B2E5.5000509@redhat.com> <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> Message-ID: <4E98249F.2060302@redhat.com> On 10/14/2011 11:07 AM, Daniele Palumbo wrote: > Il giorno 14/ott/2011, alle ore 05.56, Fabio M. Di Nitto ha scritto: >> What kind of shared storage are you using? Filesystem on top of lvm? > > i am using 2 local disk, exported via vblade (i have an aoe storage, this is the test environment). > > on top of that right does not matter for me which fs, cause it will be used also by xen hvm (let say also ntfs). > anyway, tested with ext3 and reiserfs. > (but i think i will get a RTFM as you write down your questions ;( ) Well yes you have a lot of RTFM to do. This setup looks very wrong and there is a lot of work on the storage side you need to do. I am not even sure where to start, but a few simple points: 1) you need to use proper shared storage. AOE is fine, but use the real one. not 2 local disks exported, because that's never going to work. 2) if you want to use clvmd, all nodes *must* see the same storage 3) you need a cluster filesystem such as GFS2. ext3 and reiserfs are not cluster fs. If you mounted any of those on both nodes, I strongly recommend recreateing the fs and restore data from a backup. Fabio From daniele at retaggio.net Fri Oct 14 12:38:25 2011 From: daniele at retaggio.net (Daniele Palumbo) Date: Fri, 14 Oct 2011 14:38:25 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: <4E98249F.2060302@redhat.com> References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> <4E97B2E5.5000509@redhat.com> <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> <4E98249F.2060302@redhat.com> Message-ID: <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net> Il giorno 14/ott/2011, alle ore 14.01, Fabio M. Di Nitto ha scritto: > This setup looks very wrong and there is a lot of work on the storage > side you need to do. > > I am not even sure where to start, but a few simple points: > > 1) you need to use proper shared storage. AOE is fine, but use the real > one. not 2 local disks exported, because that's never going to work. why not? > 2) if you want to use clvmd, all nodes *must* see the same storage that is :) different name (in pvs) but same storage. anyway, i will add a third machine and setup the storage over there, then i will setup 3 machines in cluster that see the same device name. can that help? > 3) you need a cluster filesystem such as GFS2. ext3 and reiserfs are not > cluster fs. If you mounted any of those on both nodes, I strongly > recommend recreateing the fs and restore data from a backup. yes but that could be a problem... cause i need to have (at least) a volume for each virtual machine. or am i missing something like do a cluster filesystem and on top of that re-create the vg? but anyway, i just need to see the filesystem as open... do i need a clustered one to see the correct open attribute with lvs? how does the open attribute is set? thanks a lot! d. From fdinitto at redhat.com Fri Oct 14 13:08:37 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 14 Oct 2011 15:08:37 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net> References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> <4E97B2E5.5000509@redhat.com> <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> <4E98249F.2060302@redhat.com> <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net> Message-ID: <4E983455.1000205@redhat.com> On 10/14/2011 02:38 PM, Daniele Palumbo wrote: > Il giorno 14/ott/2011, alle ore 14.01, Fabio M. Di Nitto ha scritto: >> This setup looks very wrong and there is a lot of work on the storage >> side you need to do. >> >> I am not even sure where to start, but a few simple points: >> >> 1) you need to use proper shared storage. AOE is fine, but use the real >> one. not 2 local disks exported, because that's never going to work. > > why not? this is the kind of answer that boils down to RTFM..... more seriously, the technical explanation is very long and complex. > >> 2) if you want to use clvmd, all nodes *must* see the same storage > > that is :) > different name (in pvs) but same storage. > anyway, i will add a third machine and setup the storage over there, then i will setup 3 machines in cluster that see the same device name. > > can that help? Yes, keep the 3rd machine out of the cluster and use it to export an AOE device for testing. > >> 3) you need a cluster filesystem such as GFS2. ext3 and reiserfs are not >> cluster fs. If you mounted any of those on both nodes, I strongly >> recommend recreateing the fs and restore data from a backup. > > yes but that could be a problem... cause i need to have (at least) a volume for each virtual machine. > or am i missing something like do a cluster filesystem and on top of that re-create the vg? You were talking about mounting reiserfs and such... as long as you make sure that the fs is mounted on one node at a time then it's fine. No you don't need to create a vg on top of gfs2. gfs2 is not so different from any other filesystems, except that you can mount it and use it simultaneously on all nodes read/write. > > but anyway, i just need to see the filesystem as open... > do i need a clustered one to see the correct open attribute with lvs? > how does the open attribute is set? Ok can you please explain better what you mean here?? clustered lvs are available on all nodes at the same time. You probably see a metadata sync issue due to "interesting" storage. Fabio From daniele at retaggio.net Fri Oct 14 15:30:42 2011 From: daniele at retaggio.net (Daniele Palumbo) Date: Fri, 14 Oct 2011 17:30:42 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: <4E983455.1000205@redhat.com> References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> <4E97B2E5.5000509@redhat.com> <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> <4E98249F.2060302@redhat.com> <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net> <4E983455.1000205@redhat.com> Message-ID: Il giorno 14/ott/2011, alle ore 15.08, Fabio M. Di Nitto ha scritto: > You were talking about mounting reiserfs and such... as long as you make > sure that the fs is mounted on one node at a time then it's fine. that is, i just need to sync open devices so i can see on server #2 if the device is mounted in server #1. > Ok can you please explain better what you mean here?? pvsrv07 ~ # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert test07 vgPvSrv07 -wi-ao 100.00m pvsrv07 ~ # pvsrv08 ~ # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert test07 vgPvSrv07 -wi-a- 100.00m pvsrv08 ~ # as you see in one server the device is *marked* as open, in the other is not. i need just to see open in both server, if it is marked as open, xen refuse to boot vm with that device open, and i will not care about the filesystem that is contained in lvm. > clustered lvs are available on all nodes at the same time. You probably > see a metadata sync issue due to "interesting" storage. ok so i will try with one storage and i will report back. bye d. From mjh2000 at gmail.com Fri Oct 14 20:58:01 2011 From: mjh2000 at gmail.com (Joey L) Date: Fri, 14 Oct 2011 16:58:01 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. Message-ID: I am new to redhat cluster and i am having some issues. 1. I am looking for a simple cluster.conf that I can use for : A. failing over an ip address. B. failing over apache. C. failing over mysql D. failing over asterisk. E. failing over a nfs mount. I have created the following cluster.conf using system-config-cluster : cat /etc/cluster/cluster.conf And I am getting the following errors : /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity error : Invalid attribute name for element clusternode /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity error : Element clusternodes has extra content: clusternode /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity error : Type ID doesn't allow value '192.168.2.110' Relax-NG validity error : Element clusternode failed to validate attributes /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error : Invalid sequence in interleave /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error : Element cluster failed to validate content /etc/cluster/cluster.conf fails to validate Also get this message when trying to open cluster.conf with system-config-cluster: Because this node is not currently part of a cluster, the management tab for this application is not available. Can anyone give me a pointer as what to do ? thanks From mjh2000 at gmail.com Fri Oct 14 21:05:43 2011 From: mjh2000 at gmail.com (Joey L) Date: Fri, 14 Oct 2011 17:05:43 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: References: Message-ID: On Fri, Oct 14, 2011 at 4:58 PM, Joey L wrote: > I am new to redhat cluster and i am having some issues. > > 1. I am looking for a simple cluster.conf that I can use for : > A. failing over an ip address. > B. failing over apache. > C. failing over mysql > D. failing over asterisk. > E. failing over a nfs mount. > > I have created the following cluster.conf using system-config-cluster : > > cat /etc/cluster/cluster.conf > > > ? ? ? ? > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > > > > And I am getting the following errors : > > /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity > error : Invalid attribute name for element clusternode > /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity > error : Element clusternodes has extra content: clusternode > /etc/cluster/cluster.conf:5: element clusternode: Relax-NG validity > error : Type ID doesn't allow value '192.168.2.110' > Relax-NG validity error : Element clusternode failed to validate attributes > /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error > : Invalid sequence in interleave > /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error > : Element cluster failed to validate content > /etc/cluster/cluster.conf fails to validate > > Also get this message when trying to open cluster.conf with > system-config-cluster: > > Because this node is not currently part of a cluster, the management > tab for this application is not available. > > Can anyone give me a pointer as what to do ? > thanks > I have updated my cluster.conf with the following config - can anyone tell me if this is correct ? thanks cat /etc/cluster/cluster.conf From linux at alteeve.com Fri Oct 14 21:12:46 2011 From: linux at alteeve.com (Digimer) Date: Fri, 14 Oct 2011 17:12:46 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: References: Message-ID: <4E98A5CE.9050706@alteeve.com> On 10/14/2011 05:05 PM, Joey L wrote: > > > > > > > > > > > > > > > > > > > > > > > name="debby" server_root="/var/www/html" shutdown_wait=""/> > > > name="deb1" server_root="/var/www/html" shutdown_wait=""/> > name="deb2" server_root="/var/www/html" shutdown_wait=""/> > > > > > I don't use apache, so I can't speak to that resource agent's config. I can say though that overall it looks okay with two exceptions. You *must* configure fencing for the cluster to work properly. Even without shared storage, a node failure will trigger a fence call which, because it can't succeed, will leave your cluster hung hard. Change the cluster names to the output of `uname -n` (should be the FQDN). -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?" From mjh2000 at gmail.com Sat Oct 15 13:30:34 2011 From: mjh2000 at gmail.com (Joey L) Date: Sat, 15 Oct 2011 09:30:34 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: <4E98A5CE.9050706@alteeve.com> References: <4E98A5CE.9050706@alteeve.com> Message-ID: > > I don't use apache, so I can't speak to that resource agent's config. I can > say though that overall it looks okay with two exceptions. > > You *must* configure fencing for the cluster to work properly. Even without > shared storage, a node failure will trigger a fence call which, because it > can't succeed, will leave your cluster hung hard. > > Change the cluster names to the output of `uname -n` (should be the FQDN). > i looked at the docs - i think it says fencing is not required. I do not have any sofisticated fencing devices on my network - so it would not help anyways ?? Or do you have a solution in that scenerio ?? Have you used Heartbeat ? It looks a lot less complicated then RH Cluster and seems like there are more docs and vidoes on the net. Do you have a simple cluster.conf file that I can use to see if I am setting this up correctly? I do not see the any of my shared services when i look at my node in the members tab. thanks mjh From mjh2000 at gmail.com Sat Oct 15 13:33:49 2011 From: mjh2000 at gmail.com (Joey L) Date: Sat, 15 Oct 2011 09:33:49 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: References: Message-ID: On Fri, Oct 14, 2011 at 7:25 PM, Walter Hurry wrote: > On Fri, 14 Oct 2011 16:58:01 -0400, Joey L wrote: > >> I am new to redhat cluster and i am having some issues. > > Why do you keep starting new threads for the same basic question? Clearly > you are some kind of "architect" who is trying to put together a proposal > for some client or other. Why do you expect us to prepare a ready made > cookbook for you? > > Admit that it is beyond your level of competence, and give the job to > someone who knows what they are doing. > Walter apparently you were born knowing it all and thanks for sharing your profound knowledge by your reply. Apparently you are trying to sell some questionable solution at www.lavabit.com -- which looks really bad by the way. This is no way to get customers I will not waste my time with you - apparently you are a man with little things...technology and otherwise. thanks From linux at alteeve.com Sat Oct 15 15:47:51 2011 From: linux at alteeve.com (Digimer) Date: Sat, 15 Oct 2011 11:47:51 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: References: <4E98A5CE.9050706@alteeve.com> Message-ID: <4E99AB27.5080908@alteeve.com> On 10/15/2011 09:30 AM, Joey L wrote: >> >> I don't use apache, so I can't speak to that resource agent's config. I can >> say though that overall it looks okay with two exceptions. >> >> You *must* configure fencing for the cluster to work properly. Even without >> shared storage, a node failure will trigger a fence call which, because it >> can't succeed, will leave your cluster hung hard. >> >> Change the cluster names to the output of `uname -n` (should be the FQDN). >> > > i looked at the docs - i think it says fencing is not required. What docs? That is misleading. Consider; * Node 1 wants to start Service A. * Node 1 requests a DLM lock, gets it, starts Service A. * Meanwhile, Node 2 wants to start Service A. * Node 2 requests a DLM lock, is refused because the lock is out to Node A. * Node 1 finishes starting Service A and tells the cluster. * Node 1 releases the lock. * Node 2, having seen now that Service A is running, no longer tries to start Service A. Time pases, and suddenly Node A fails. * After a short period of time, the cluster will detect Node 1's death. * The cluster enters an unknown state (is Node 1 dead, hung, ?). * The clyster will call a fence and DLM will block. With DLM blocked, nothing can get a lack and, without a lock, services can not be recovered. * The fence call completes successfully and tells the cluster that things are back into a known state. * DLM unblocks. * RGManager sorts out what services where lost (Service A), figures out who can recover the lost service (Node 2). * Node 2 requests a lock from DLM and you know the rest of the story. > I do not have any sofisticated fencing devices on my network - so it > would not help anyways ?? Wrong. Without fencing, you can not have a stable cluster. In clustering; "The only thing you know is what you don't know." As soon as a node goes silent, you can only know that it has stopped responding. Has it hung and will it come back? Has it completely powered off? You can't guess. Fencing puts the silent node into a known state. That is, it is either disconnected from the cluster's network or forced off. Only then can the state of the silent node be known. Until it's state is known, the cluster can not operate safely. This is a general high-availability cluster concept. > Or do you have a solution in that scenerio ?? Yup, you can get a switched PDU. The APC brand PDUs are very good and very well supported using the 'fence_apc_snmp' fence agent. I've used this one in many clusters (as a backup to iLO/IPMI based fencing). It's just fine as a primary fence agent as well. http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900 > Have you used Heartbeat ? A long time ago. The heartbeat project is effectively deprecated. Linbit, the company behind DRBD, has taken over the project and have announced that they plan no further development. They are maintaining it as bugs are found, but that is it. Both RHCS and the Pacemaker project are primarily on corosync know. > It looks a lot less complicated then RH Cluster and seems like there > are more docs and vidoes on the net. I think you are referring to Pacemaker. That is the resource management layer. Whether you think it is simpler or not is, of course, up to the user's perspective. That said, Pacemaker is a perfectly good clustered resource manager and I have no reason to argue against it. I just can't help with it as I'm mostly familiar with Red Hat's current cluster suite. > Do you have a simple cluster.conf file that I can use to see if I am > setting this up correctly? "Simple", no. I do have an extensive tutorial though. https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial It's for EL5 and RHCS Stable 2, and the current version is stable 3, but the configuration is all but the same. The only (visible) change are the way the config file is validated (ccs_config_validate instead of the xmllint call) and how updated versions are pushed out to the rest of the cluster ('cman_tool version -r' instead of 'ccs_tool update /etc/cluster/cluster.conf'). > I do not see the any of my shared services when i look at my node in > the members tab. Is rgmanager running? What does 'clustat' show? -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?" From linux at alteeve.com Sat Oct 15 15:48:26 2011 From: linux at alteeve.com (Digimer) Date: Sat, 15 Oct 2011 11:48:26 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: References: Message-ID: <4E99AB4A.8060109@alteeve.com> On 10/15/2011 09:33 AM, Joey L wrote: > On Fri, Oct 14, 2011 at 7:25 PM, Walter Hurry wrote: >> On Fri, 14 Oct 2011 16:58:01 -0400, Joey L wrote: >> >>> I am new to redhat cluster and i am having some issues. >> >> Why do you keep starting new threads for the same basic question? Clearly >> you are some kind of "architect" who is trying to put together a proposal >> for some client or other. Why do you expect us to prepare a ready made >> cookbook for you? >> >> Admit that it is beyond your level of competence, and give the job to >> someone who knows what they are doing. >> > > Walter apparently you were born knowing it all and thanks for sharing > your profound knowledge by your reply. > Apparently you are trying to sell some questionable solution at > www.lavabit.com -- which looks really bad by the way. > This is no way to get customers > I will not waste my time with you - apparently you are a man with > little things...technology and otherwise. > thanks > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Please keep this petty bickering off of this mailing list. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?" From raju.rajsand at gmail.com Sun Oct 16 04:20:37 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Sun, 16 Oct 2011 09:50:37 +0530 Subject: [Linux-cluster] Apache active-active cluster In-Reply-To: References: Message-ID: Greetings, On Fri, Oct 14, 2011 at 1:47 PM, wrote: > Hi, > > Can I configure Apache Active ?Active cluster using Redhat Cluster Suit in > RHEL6? > > If yes, can someone please pass me the link for the same. > If you are using htpasswd for authentication, perhaps you may not be able to build it. Look at LVS at the same time. That might solve your problem. and oh! never forget that without fencing (storage/in-band/out-of-band) -- Regards, Rajagopal +91 9930 633 852 From mjh2000 at gmail.com Sun Oct 16 11:21:43 2011 From: mjh2000 at gmail.com (Joey L) Date: Sun, 16 Oct 2011 07:21:43 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: <4E99AB27.5080908@alteeve.com> References: <4E98A5CE.9050706@alteeve.com> <4E99AB27.5080908@alteeve.com> Message-ID: Digimer - thanks for you input - you saved me a ton of time!!! I did look at your tutorial -- great stuff BTW. I thought fencing was an option because I setup RH cluster about 5 years ago and I thought I did not do it then..and further in the RHEL Cluster Administrator had points that it was optional -can not find the url right now. but have pdf and it says: "If a cluster node is configured to be fenced by an integrated fence device, disable ACPI Soft-Off for that node. Disabling ACPI Soft-Off allows an integrated fence device to turn off a node immediately and completely rather than attempting a clean shutdown (for example, shutdown -h now). Otherwise, if ACPI Soft-Off is enabled, an integrated fence device can take four or more seconds to turn off a node (refer to note that follows). In addition, if ACPI Soft-Off is enabled and a node panics or freezes during shutdown, an integrated fence device may not be able to turn off the node. Under those circumstances, fencing is delayed or unsuccessful. Consequently, when a node is fenced with an integrated fence device and ACPI Soft-Off is enabled, a cluster recovers slowly or requires administrative intervention to recover " But not looking to argue this point at all - i remember that when i did set it up it was indeed more stable like you state. My Memory is getting old :) About pacemaker -- Do I need fencing hardware as well ?? I just got 2 servers and a regular switch - i think it netgear. Like I said earlier - just want the 2 boxes to back up each other. I have mysql, apache, asterisk, dns and nfs client running on them - can i do anything with pacemaker ?? I should mention that I am using software raid but will probably need to change to hardware raid in near future. I would like to use the mysql replication feature -- if possible. thanks for your insight and help. mjh From linux at alteeve.com Sun Oct 16 14:47:23 2011 From: linux at alteeve.com (Digimer) Date: Sun, 16 Oct 2011 10:47:23 -0400 Subject: [Linux-cluster] redhat cluster running on debian 6. In-Reply-To: References: <4E98A5CE.9050706@alteeve.com> <4E99AB27.5080908@alteeve.com> Message-ID: <4E9AEE7B.9050803@alteeve.com> On 10/16/2011 07:21 AM, Joey L wrote: > Digimer - thanks for you input - you saved me a ton of time!!! > I did look at your tutorial -- great stuff BTW. Thank you. :) > I thought fencing was an option because I setup RH cluster about 5 > years ago and I thought I did not do it then..and further in the RHEL > Cluster Administrator had points that it was optional -can not find > the url right now. but have pdf and it says: Maybe in very old versions, but not under EL5 or EL6. Like I said, RGManager uses DLM and DLM blocks on fence and fence triggers as soon as a node goes quiet, regardless of if there is a fence device or not. > "If a cluster node is configured to be fenced by an integrated fence > device, disable ACPI Soft-Off for > that node. Disabling ACPI Soft-Off allows an integrated fence device > to turn off a node immediately > and completely rather than attempting a clean shutdown (for example, > shutdown -h now). > Otherwise, if ACPI Soft-Off is enabled, an integrated fence device can > take four or more seconds to > turn off a node (refer to note that follows). In addition, if ACPI > Soft-Off is enabled and a node panics > or freezes during shutdown, an integrated fence device may not be able > to turn off the node. Under > those circumstances, fencing is delayed or unsuccessful. Consequently, > when a node is fenced > with an integrated fence device and ACPI Soft-Off is enabled, a > cluster recovers slowly or requires > administrative intervention to recover " This helps speed up recovery. However, I prefer to leave it on an accept the 4 seconds delta because I find ACPI soft-off is very handy. In the end though, the decision depends on your needs. > But not looking to argue this point at all - i remember that when i > did set it up it was indeed more stable like you state. > My Memory is getting old :) Fencing <3 > About pacemaker -- > Do I need fencing hardware as well ?? Short answer; Yes. Long answer, you can disable stonith. However, that is always a tremendous risk because you set yourself up for a bad day if the cluster is allowed to reconfigure around a blocked node that eventually unblocks. I *strongly* advice against it. Strictly speaking though; No. > I just got 2 servers and a regular switch - i think it netgear. > Like I said earlier - just want the 2 boxes to back up each other. What you want doesn't really preclude the need to build a proper cluster. :) > I have mysql, apache, asterisk, dns and nfs client running on them - > can i do anything with pacemaker ?? Pacemaker and Red Hat cluster will both do this just fine. > I should mention that I am using software raid but will probably need > to change to hardware raid in near future. The cluster doesn't care. > I would like to use the mysql replication feature -- if possible. This is a well tested and used configuration. You should have no trouble getting help. > thanks for your insight and help. > > mjh My pleasure. :) -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math?" From mjh2000 at gmail.com Mon Oct 17 02:02:47 2011 From: mjh2000 at gmail.com (Joey L) Date: Sun, 16 Oct 2011 22:02:47 -0400 Subject: [Linux-cluster] new to pacemaker and heartbeat on debian...getting error.. Message-ID: Hi - New to heartbeat and pacemaker on debian. Followed a tutorial online at: http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo and now getting this error - root at deb2:/home/mjh# sudo crm_mon --one-shot ============ Last updated: Sun Oct 16 21:56:43 2011 Stack: openais Current DC: deb1 - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, 2 expected votes 1 Resources configured. ============ Online: [ deb1 deb2 ] Failed actions: failover-ip_start_0 (node=deb2, call=35, rc=1, status=complete): unknown error failover-ip_start_0 (node=deb1, call=35, rc=1, status=complete): unknown error root at deb2:/home/mjh# Tried googling - nothing -- I stopped avahi-daemon because i was getting a strange error: I was getting a naming conflict error do not think i am getting it any more. Any thoughts on this ? Any way i can turn logs on for heartbeat or pacemaker ? thanks mjh From Sagar.Shimpi at tieto.com Mon Oct 17 08:00:58 2011 From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com) Date: Mon, 17 Oct 2011 11:00:58 +0300 Subject: [Linux-cluster] Apache active-active cluster In-Reply-To: References: Message-ID: Hi, Thanks for the info. I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ?? Regards, Sagar Shimpi, Senior Technical Specialist, OSS Labs Tieto email sagar.shimpi at tieto.com, Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77, MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in TIETO. Knowledge. Passion. Results. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan Sent: Sunday, October 16, 2011 9:51 AM To: linux clustering Subject: Re: [Linux-cluster] Apache active-active cluster Greetings, On Fri, Oct 14, 2011 at 1:47 PM, wrote: > Hi, > > Can I configure Apache Active -Active cluster using Redhat Cluster Suit in > RHEL6? > > If yes, can someone please pass me the link for the same. > If you are using htpasswd for authentication, perhaps you may not be able to build it. Look at LVS at the same time. That might solve your problem. and oh! never forget that without fencing (storage/in-band/out-of-band) -- Regards, Rajagopal +91 9930 633 852 -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From mgrac at redhat.com Mon Oct 17 08:10:09 2011 From: mgrac at redhat.com (Marek Grac) Date: Mon, 17 Oct 2011 10:10:09 +0200 Subject: [Linux-cluster] Apache active-active cluster In-Reply-To: References: Message-ID: <4E9BE2E1.1060301@redhat.com> On 10/17/2011 10:00 AM, Sagar.Shimpi at tieto.com wrote: > Hi, > > Thanks for the info. > > I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ?? In general it is possible. I have tried to run LVS on xen/kvm and it works. But I have no experience with running it on VMWare m, From daniele at retaggio.net Mon Oct 17 10:22:13 2011 From: daniele at retaggio.net (Daniele Palumbo) Date: Mon, 17 Oct 2011 12:22:13 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> <4E97B2E5.5000509@redhat.com> <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> <4E98249F.2060302@redhat.com> <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net> <4E983455.1000205@redhat.com> Message-ID: <4E9C01D5.4040401@retaggio.net> On 14/10/2011 17:30, Daniele Palumbo wrote: > as you see in one server the device is *marked* as open, in the other is not. > > i need just to see open in both server, > if it is marked as open, xen refuse to boot vm with that device open, and i will not care about the filesystem that is contained in lvm. [...] > ok so i will try with one storage and i will report back. here i am :) the issue is still present: pvsrv08 ~ # mount | grep vgTest pvsrv08 ~ # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lvTest vgTest -wi-a- 500.00m pvsrv08 ~ # pvsrv07 ~ # mount | grep vgTest pvsrv07 ~ # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lvTest vgTest -wi-a- 500.00m pvsrv07 ~ # pvsrv06 ~ # mount | grep vgTest /dev/mapper/vgTest-lvTest on /mnt type reiserfs (rw) pvsrv06 ~ # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lvTest vgTest -wi-ao 500.00m pvsrv06 ~ # vgs VG #PV #LV #SN Attr VSize VFree vgTest 1 1 0 wz--nc 3.90g 3.41g pvsrv06 ~ # pvs PV VG Fmt Attr PSize PFree /dev/etherd/e7.0 vgTest lvm2 a-- 3.90g 3.41g pvsrv06 ~ # pvsrv06 ~ # lvdisplay vgTest/lvTest --- Logical volume --- LV Name /dev/vgTest/lvTest VG Name vgTest LV UUID 2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr LV Write Access read/write LV Status available # open 2 LV Size 500.00 MiB Current LE 125 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 pvsrv06 ~ # pvsrv07 ~ # lvdisplay /dev/vgTest/lvTest --- Logical volume --- LV Name /dev/vgTest/lvTest VG Name vgTest LV UUID 2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr LV Write Access read/write LV Status available # open 0 LV Size 500.00 MiB Current LE 125 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 pvsrv07 ~ # when a device is mounted i can see the open status only in one server, so the open status is different between server. the 3 servers see the same storage with same device path (as seen in pvsrv06). pvsrv06 ~ # cman_tool status Version: 6.2.0 Config Version: 5 Cluster Name: CEMCluster Cluster Id: 7604 Cluster Member: Yes Cluster Generation: 288 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Node votes: 1 Quorum: 2 Active subsystems: 8 Flags: Ports Bound: 0 11 Node name: pvsrv06 Node ID: 1 Multicast addresses: 239.192.29.209 Node addresses: 192.168.1.106 pvsrv06 ~ # pvsrv06 ~ # fence_tool ls fence domain member count 3 victim count 0 victim now 0 master nodeid 1 wait state none members 1 2 3 pvsrv06 ~ # pvsrv06 ~ # cat /etc/cluster/cluster.conf pvsrv06 ~ # bye d. From balla at staff.spin.it Mon Oct 17 10:42:07 2011 From: balla at staff.spin.it (Emanuele Balla) Date: Mon, 17 Oct 2011 12:42:07 +0200 Subject: [Linux-cluster] Apache active-active cluster In-Reply-To: <4E9BE2E1.1060301@redhat.com> References: <4E9BE2E1.1060301@redhat.com> Message-ID: <4E9C067F.3000005@staff.spin.it> On 10/17/11 10:10 AM, Marek Grac wrote: > On 10/17/2011 10:00 AM, Sagar.Shimpi at tieto.com wrote: >> Hi, >> >> Thanks for the info. >> >> I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ?? > > In general it is possible. I have tried to run LVS on xen/kvm and it > works. But I have no experience with running it on VMWare Works exactly the same. I have customers running production websites >70Mbps balanced by LVS on vmware (direct routing, FWIW). On ESXi, obviously, but I wouldn't expect troubles running it on some other virtual environment... -- # Emanuele Balla # # # System & Network Engineer # # # Spin s.r.l. - AS6734 # Phone: +39 040 9869090 # # Trieste # Email: balla at staff.spin.it # From Sagar.Shimpi at tieto.com Mon Oct 17 11:25:35 2011 From: Sagar.Shimpi at tieto.com (Sagar.Shimpi at tieto.com) Date: Mon, 17 Oct 2011 14:25:35 +0300 Subject: [Linux-cluster] Apache active-active cluster In-Reply-To: <4E9C067F.3000005@staff.spin.it> References: <4E9BE2E1.1060301@redhat.com> <4E9C067F.3000005@staff.spin.it> Message-ID: Hi, I want to know if I can configure LVS on my home network - Vmware Workstation [and not Vmware ESX server]...!!!! Regards, Sagar Shimpi, Senior Technical Specialist, OSS Labs Tieto email sagar.shimpi at tieto.com, Wing 1, Cluster D, EON Free Zone, Plot No. 1, Survery # 77, MIDC Kharadi Knowledge Park, Pune 411014, India, www.tieto.com www.tieto.in TIETO. Knowledge. Passion. Results. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Emanuele Balla Sent: Monday, October 17, 2011 4:12 PM To: linux clustering Subject: Re: [Linux-cluster] Apache active-active cluster On 10/17/11 10:10 AM, Marek Grac wrote: > On 10/17/2011 10:00 AM, Sagar.Shimpi at tieto.com wrote: >> Hi, >> >> Thanks for the info. >> >> I need to ask one for thing - Is LVS possible to implement and test on Virtual Machines[I mean Vmware workstation -Desktop] ?? > > In general it is possible. I have tried to run LVS on xen/kvm and it > works. But I have no experience with running it on VMWare Works exactly the same. I have customers running production websites >70Mbps balanced by LVS on vmware (direct routing, FWIW). On ESXi, obviously, but I wouldn't expect troubles running it on some other virtual environment... -- # Emanuele Balla # # # System & Network Engineer # # # Spin s.r.l. - AS6734 # Phone: +39 040 9869090 # # Trieste # Email: balla at staff.spin.it # -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From andreas at hastexo.com Mon Oct 17 14:07:43 2011 From: andreas at hastexo.com (Andreas Kurz) Date: Mon, 17 Oct 2011 16:07:43 +0200 Subject: [Linux-cluster] new to pacemaker and heartbeat on debian...getting error.. In-Reply-To: References: Message-ID: <4E9C36AF.2060302@hastexo.com> On 10/17/2011 04:02 AM, Joey L wrote: > Hi - New to heartbeat and pacemaker on debian. > Followed a tutorial online at: > http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo > > > and now getting this error - > > > root at deb2:/home/mjh# sudo crm_mon --one-shot > ============ > Last updated: Sun Oct 16 21:56:43 2011 > Stack: openais > Current DC: deb1 - partition with quorum > Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b > 2 Nodes configured, 2 expected votes > 1 Resources configured. > ============ > > Online: [ deb1 deb2 ] > > > Failed actions: > failover-ip_start_0 (node=deb2, call=35, rc=1, status=complete): > unknown error > failover-ip_start_0 (node=deb1, call=35, rc=1, status=complete): > unknown error > root at deb2:/home/mjh# > Please provide your config ... best is the ouput of "cibadmin -Q". Reading the logs should also give you valuable hints. One shot into the dark: there is no interface up with an IP in the same subnet you configured your failover-ip and you did not explicitly defined an interface. And there is a dedicated Pacemaker mailinglist, I set it on cc for this thread. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > Tried googling - nothing -- > I stopped avahi-daemon because i was getting a strange error: > I was getting a naming conflict error do not think i am getting it any more. > > Any thoughts on this ? > Any way i can turn logs on for heartbeat or pacemaker ? > > thanks > mjh > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 286 bytes Desc: OpenPGP digital signature URL: From daniele at retaggio.net Tue Oct 18 08:27:26 2011 From: daniele at retaggio.net (Daniele Palumbo) Date: Tue, 18 Oct 2011 10:27:26 +0200 Subject: [Linux-cluster] sharing attr on clustered volume -- cannot see open attribute In-Reply-To: <4E9C01D5.4040401@retaggio.net> References: <479B9EE9-B9C2-4FCA-A95D-63D2A02E11AE@retaggio.net> <4E97B2E5.5000509@redhat.com> <1F31A599-8893-4749-A49B-49CD51D41C5E@retaggio.net> <4E98249F.2060302@redhat.com> <714FCFA3-5311-4479-BE57-49FF9002E48C@retaggio.net> <4E983455.1000205@redhat.com> <4E9C01D5.4040401@retaggio.net> Message-ID: <4E9D386E.9050803@retaggio.net> any hint? i'm almost desperate :( On 17/10/2011 12:22, Daniele Palumbo wrote: > On 14/10/2011 17:30, Daniele Palumbo wrote: >> as you see in one server the device is *marked* as open, in the other >> is not. >> >> i need just to see open in both server, >> if it is marked as open, xen refuse to boot vm with that device open, >> and i will not care about the filesystem that is contained in lvm. > [...] > > ok so i will try with one storage and i will report back. > > here i am :) > > the issue is still present: > pvsrv08 ~ # mount | grep vgTest > pvsrv08 ~ # lvs > LV VG Attr LSize Origin Snap% Move Log Copy% Convert > lvTest vgTest -wi-a- 500.00m > pvsrv08 ~ # > > pvsrv07 ~ # mount | grep vgTest > pvsrv07 ~ # lvs > LV VG Attr LSize Origin Snap% Move Log Copy% Convert > lvTest vgTest -wi-a- 500.00m > pvsrv07 ~ # > > pvsrv06 ~ # mount | grep vgTest > /dev/mapper/vgTest-lvTest on /mnt type reiserfs (rw) > pvsrv06 ~ # lvs > LV VG Attr LSize Origin Snap% Move Log Copy% Convert > lvTest vgTest -wi-ao 500.00m > pvsrv06 ~ # vgs > VG #PV #LV #SN Attr VSize VFree > vgTest 1 1 0 wz--nc 3.90g 3.41g > pvsrv06 ~ # pvs > PV VG Fmt Attr PSize PFree > /dev/etherd/e7.0 vgTest lvm2 a-- 3.90g 3.41g > pvsrv06 ~ # > > pvsrv06 ~ # lvdisplay vgTest/lvTest > --- Logical volume --- > LV Name /dev/vgTest/lvTest > VG Name vgTest > LV UUID 2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr > LV Write Access read/write > LV Status available > # open 2 > LV Size 500.00 MiB > Current LE 125 > Segments 1 > Allocation inherit > Read ahead sectors auto > - currently set to 4096 > Block device 252:0 > > pvsrv06 ~ # > > > pvsrv07 ~ # lvdisplay /dev/vgTest/lvTest > --- Logical volume --- > LV Name /dev/vgTest/lvTest > VG Name vgTest > LV UUID 2nRkxV-JhDe-oDW1-SIZT-dHwH-9x2W-XMvHDr > LV Write Access read/write > LV Status available > # open 0 > LV Size 500.00 MiB > Current LE 125 > Segments 1 > Allocation inherit > Read ahead sectors auto > - currently set to 4096 > Block device 252:0 > > pvsrv07 ~ # > > when a device is mounted i can see the open status only in one server, > so the open status is different between server. > > the 3 servers see the same storage with same device path (as seen in > pvsrv06). > > pvsrv06 ~ # cman_tool status > Version: 6.2.0 > Config Version: 5 > Cluster Name: CEMCluster > Cluster Id: 7604 > Cluster Member: Yes > Cluster Generation: 288 > Membership state: Cluster-Member > Nodes: 3 > Expected votes: 3 > Total votes: 3 > Node votes: 1 > Quorum: 2 > Active subsystems: 8 > Flags: > Ports Bound: 0 11 > Node name: pvsrv06 > Node ID: 1 > Multicast addresses: 239.192.29.209 > Node addresses: 192.168.1.106 > pvsrv06 ~ # > > pvsrv06 ~ # fence_tool ls > fence domain > member count 3 > victim count 0 > victim now 0 > master nodeid 1 > wait state none > members 1 2 3 > > pvsrv06 ~ # > > > pvsrv06 ~ # cat /etc/cluster/cluster.conf > > > > > > > > > > > pvsrv06 ~ # > > > bye > d. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From christian.masopust at siemens.com Tue Oct 18 16:47:29 2011 From: christian.masopust at siemens.com (Masopust, Christian) Date: Tue, 18 Oct 2011 18:47:29 +0200 Subject: [Linux-cluster] manually stopping system-service (part of a cluster service) and rgmanager doesn't start it again Message-ID: Hi all, I have the following rgmanager configuration: