From lipson12 at yahoo.com Sun Dec 4 11:38:01 2011 From: lipson12 at yahoo.com (Kaisar Ahmed Khan) Date: Sun, 4 Dec 2011 03:38:01 -0800 (PST) Subject: [Linux-cluster] fencing problem Message-ID: <1322998681.56234.YahooMailClassic@web36502.mail.mud.yahoo.com> Dear all, ? fence_xvm -H station2.example.com when i am trying to fence the node it's showing ? request timeout. [root at station1 ~]# fence_xvm -H station2.example.com -ddd -o null Debugging threshold is now 3 -- args @ 0xbfa22738 -- ? args->addr = 225.0.0.12 ? args->domain = station2.example.com ? args->key_file = /etc/cluster/fence_xvm.key ? args->op = 0 ? args->hash = 2 ? args->auth = 2 ? args->port = 1229 ? args->ifindex = 0 ? args->family = 2 ? args->timeout = 30 ? args->retr_time = 20 ? args->flags = 0 ? args->debug = 3 -- end args -- Reading in key file /etc/cluster/fence_xvm.key into 0xbfa216ec (4096 max size) Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1 Waiting for connection from XVM host daemon. Timed out waiting for response can anybody help me ? Thanks ? Md.Kaisar Ahmed Khan ????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From xubinbin2004 at gmail.com Mon Dec 5 00:18:45 2011 From: xubinbin2004 at gmail.com (Bin) Date: Sun, 4 Dec 2011 17:18:45 -0700 Subject: [Linux-cluster] Can I build a computer cluster based on RedHat Desktop edition? Message-ID: I am a beginner -:) and want to build a PC clusters based on Redhat linux for running my parallel codes. please help... Thanks -- Best regards, Bin -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Mon Dec 5 01:30:49 2011 From: linux at alteeve.com (Digimer) Date: Sun, 04 Dec 2011 20:30:49 -0500 Subject: [Linux-cluster] Can I build a computer cluster based on RedHat Desktop edition? In-Reply-To: References: Message-ID: <4EDC1EC9.7020003@alteeve.com> On 12/04/2011 07:18 PM, Bin wrote: > I am a beginner -:) and want to build a PC clusters based on Redhat > linux for running my parallel codes. please help... > > Thanks Performance clustering most often a per-application question, not one that can be generalized too well. These tend to be pretty distro-agnostic and generally rely on specialized tools running on nodes. A classic example is a video render farm where a master node cuts up a series of frames, hands them off to a node in the farm to render, repeats for the various other parts of the movie, collects the finished frame and stiches them together into a single movie. Similar concepts can be applied to decryption, compilation and so on. So, tell us what you are trying to do, specifically. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From linux at alteeve.com Mon Dec 5 01:36:16 2011 From: linux at alteeve.com (Digimer) Date: Sun, 04 Dec 2011 20:36:16 -0500 Subject: [Linux-cluster] Can I build a computer cluster based on RedHat Desktop edition? In-Reply-To: <1855549946-1323048837-cardhu_decombobulator_blackberry.rim.net-1003587105-@b11.c21.bise6.blackberry> References: <1855549946-1323048837-cardhu_decombobulator_blackberry.rim.net-1003587105-@b11.c21.bise6.blackberry> Message-ID: <4EDC2010.3020508@alteeve.com> On 12/04/2011 08:33 PM, bin xu wrote: > Thanks. I just want to run some MPI based computing codes. > Thanks > > Bin > ------Original Message------ > From: Digimer > To: linux clustering > Cc: Bin > Subject: Re: [Linux-cluster] Can I build a computer cluster based on RedHat Desktop edition? > Sent: Dec 4, 2011 6:30 PM > > On 12/04/2011 07:18 PM, Bin wrote: >> I am a beginner -:) and want to build a PC clusters based on Redhat >> linux for running my parallel codes. please help... >> >> Thanks > > Performance clustering most often a per-application question, not one > that can be generalized too well. These tend to be pretty > distro-agnostic and generally rely on specialized tools running on nodes. > > A classic example is a video render farm where a master node cuts up a > series of frames, hands them off to a node in the farm to render, > repeats for the various other parts of the movie, collects the finished > frame and stiches them together into a single movie. Similar concepts > can be applied to decryption, compilation and so on. > > So, tell us what you are trying to do, specifically. > Please reply to the mailing list. Discussions like this can help other people later when they're archived and searchable. You will want to take a look at the OpenMPI project. I've not used it, but it should give you what you need to get started. http://www.open-mpi.org/ -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From mgrac at redhat.com Mon Dec 5 10:36:41 2011 From: mgrac at redhat.com (Marek Grac) Date: Mon, 05 Dec 2011 11:36:41 +0100 Subject: [Linux-cluster] Fence_vmware_soap In-Reply-To: References: Message-ID: <4EDC9EB9.6010309@redhat.com> Hi, On 11/28/2011 10:59 PM, Geovanis, Nicholas wrote: > RH Cluster Services on RHEL 5.7, all nodes running in Vmware VMs: > The only doc I can find on the fence_vmware_soap fencing agent is the > script itself and the man page for it. There is no background info in > either and no examples. I can get my Vcenter server to respond to a > "list" subcommand but anything else receives "Failed: Unable to obtain > correct plug status or plug is not available". Sadly, the "successful" > list only retrieves exactly 100 entries from one of the 8 (Vmware) > clusters running (same cluster every time, same 100 VMs and templates > every time). 'list' option was not tested with that much entries but it's functionality is not used anywhere yet. So it does not impact proper function of fence agent. 'Plug is not available/...' in which format do you enter it. Proper one is explained in manual page (/datacenter/vm/Discovered virtual machine/myMachine) where myMachine is yours name. Alternatively you can use option -U / uuid. m, From davegu1 at hotmail.com Mon Dec 5 18:37:24 2011 From: davegu1 at hotmail.com (David F. Gutierrez) Date: Mon, 5 Dec 2011 12:37:24 -0600 Subject: [Linux-cluster] Can I build a computer cluster based on RedHat Desktop edition? In-Reply-To: <4EDC2010.3020508@alteeve.com> References: <1855549946-1323048837-cardhu_decombobulator_blackberry.rim.net-1003587105-@b11.c21.bise6.blackberry> <4EDC2010.3020508@alteeve.com> Message-ID: to run mpi code it can be done and could costly. Read these articles and do other searches in google for the same topic. http://na-inet.jp/na/pccluster/fc5_x8664-en.html http://www.webstreet.com/super_computer.htm http://www.divms.uiowa.edu/~jni/HowTo/HowToBuildAClusterG.pdf http://blizzard.rwic.und.edu/~nordlie/deuce/ Good luck David -----Original Message----- From: Digimer Sent: Sunday, December 04, 2011 7:36 PM To: linux clustering Subject: Re: [Linux-cluster] Can I build a computer cluster based on RedHat Desktop edition? On 12/04/2011 08:33 PM, bin xu wrote: > Thanks. I just want to run some MPI based computing codes. > Thanks > > Bin > ------Original Message------ > From: Digimer > To: linux clustering > Cc: Bin > Subject: Re: [Linux-cluster] Can I build a computer cluster based on > RedHat Desktop edition? > Sent: Dec 4, 2011 6:30 PM > > On 12/04/2011 07:18 PM, Bin wrote: >> I am a beginner -:) and want to build a PC clusters based on Redhat >> linux for running my parallel codes. please help... >> >> Thanks > > Performance clustering most often a per-application question, not one > that can be generalized too well. These tend to be pretty > distro-agnostic and generally rely on specialized tools running on nodes. > > A classic example is a video render farm where a master node cuts up a > series of frames, hands them off to a node in the farm to render, > repeats for the various other parts of the movie, collects the finished > frame and stiches them together into a single movie. Similar concepts > can be applied to decryption, compilation and so on. > > So, tell us what you are trying to do, specifically. > Please reply to the mailing list. Discussions like this can help other people later when they're archived and searchable. You will want to take a look at the OpenMPI project. I've not used it, but it should give you what you need to get started. http://www.open-mpi.org/ -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From rossnick-lists at cybercat.ca Tue Dec 6 18:39:39 2011 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 6 Dec 2011 13:39:39 -0500 Subject: [Linux-cluster] Design question about VG / LV in a clustered environement Message-ID: <6C60C76401934E22A74971C46A813064@versa> Hi ! Since the last couple of months, we had a few problems with the maner we designed our clustered filesystem and we are planing to do a re-design of the filesystems and how they are used. Our cluster is composed of 8 nodes, connected via fibre channel, to a raid enclosure where we have 6 pair of 1-tb drives in mirror, so 6 1tb physical volumes. First of all, our services that are run from the cluster are running inside of directories. For exemple, a webserver for a given application is run from /CyberCat/WebServer/(...) That directory contains all executable (apache, php for exemple) and the related data, except for the databases. /CyberCat being a single GFS partition containing several other services. This filesystem and another one like this containing services for some other clients occupy a single VG composed of 2 PV (total 2tb). The remaining (4) other PV are used in one 1tb VG each, and those VG contains only one LV that is used for databases servers. For availibility reasons, we are planing of spliting the /CyberCat (and the other one like it) FS into several smaller filesystems, one for each service. The reason being that in the event that we need to make a filesystem check, or any other unplaned reason, on any filesystem it won't affect other services. So, now comes the question I have : 1. First of all, is this a bad idea ? 2. Is there any disadvantages of doing a single volume group composed of many physical volumes, enabling us to move the extents of a logical volume from one physical volume to another one, so that load is more balanced in the event we need it. Thanks for the input. From jeff.sturm at eprize.com Wed Dec 7 03:33:35 2011 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 7 Dec 2011 03:33:35 +0000 Subject: [Linux-cluster] Design question about VG / LV in a clustered environement In-Reply-To: <6C60C76401934E22A74971C46A813064@versa> References: <6C60C76401934E22A74971C46A813064@versa> Message-ID: > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Nicolas Ross > Sent: Tuesday, December 06, 2011 1:40 PM > > For availibility reasons, we are planing of spliting the /CyberCat (and the other one like > it) FS into several smaller filesystems, one for each service. [snip] > 1. First of all, is this a bad idea ? Right or wrong, that's how we do it. Apart from availability, you can tune the fs appropriately depending on how you use it. GFS2 dropped some tunables, I think, but you can still mount with "noatime" (assuming your application doesn't rely on atime) and tune some things like block size. Some of our GFS filesystems are also read-only on certain nodes, so we take advantage of spectator mounts for those. > 2. Is there any disadvantages of doing a single volume group composed of many > physical volumes, enabling us to move the extents of a logical volume from one > physical volume to another one, so that load is more balanced in the event we need it. Can't say, really. We ditched CLVM but kept GFS. It felt like CLVM had too many limitations to make it worthwhile. It was straightforward to just export a LUN from our SAN for each file system, and that allows us to take advantage of the SAN's native snapshot facility. -Jeff From linux at alteeve.com Wed Dec 7 04:45:32 2011 From: linux at alteeve.com (Digimer) Date: Tue, 06 Dec 2011 23:45:32 -0500 Subject: [Linux-cluster] cluster 3.1.8 released Message-ID: <4EDEEF6C.2040002@alteeve.com> Welcome to the cluster 3.1.8 release. This release addresses several bugs and includes a patch to improve RRP configuration handling. DLM+SCTP (kernel counterpart of RRP) is still under testing, feedback is always appreciated. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.8.tar.xz ChangeLog: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.8 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Digimer From Nicholas.Geovanis at uscellular.com Wed Dec 7 17:22:54 2011 From: Nicholas.Geovanis at uscellular.com (Geovanis, Nicholas) Date: Wed, 7 Dec 2011 11:22:54 -0600 Subject: [Linux-cluster] Design question about VG / LV in a clustered environment In-Reply-To: References: Message-ID: Jeff Sturme wrote: >> We ditched CLVM but kept GFS. It felt like CLVM had too many limitations to make it worthwhile. Would you elaborate on this for me please? I understand the "damn, forgot to start clvmd on that node...." type of annoyance, but what were your burning issues? I'm not convinced that there's a performance drawback which is specifically clvmd-related, but maybe I'm na?ve. Thanks....Nick G Nick Geovanis US Cellular/Kforce Inc e. Nicholas.Geovanis at uscellular.com -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of linux-cluster-request at redhat.com Sent: Wednesday, December 07, 2011 11:00 AM To: linux-cluster at redhat.com Subject: Linux-cluster Digest, Vol 92, Issue 4 Send Linux-cluster mailing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. Design question about VG / LV in a clustered environement (Nicolas Ross) 2. Re: Design question about VG / LV in a clustered environement (Jeff Sturm) 3. cluster 3.1.8 released (Digimer) ---------------------------------------------------------------------- Message: 1 Date: Tue, 6 Dec 2011 13:39:39 -0500 From: "Nicolas Ross" To: "linux clustering" Subject: [Linux-cluster] Design question about VG / LV in a clustered environement Message-ID: <6C60C76401934E22A74971C46A813064 at versa> Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Hi ! Since the last couple of months, we had a few problems with the maner we designed our clustered filesystem and we are planing to do a re-design of the filesystems and how they are used. Our cluster is composed of 8 nodes, connected via fibre channel, to a raid enclosure where we have 6 pair of 1-tb drives in mirror, so 6 1tb physical volumes. First of all, our services that are run from the cluster are running inside of directories. For exemple, a webserver for a given application is run from /CyberCat/WebServer/(...) That directory contains all executable (apache, php for exemple) and the related data, except for the databases. /CyberCat being a single GFS partition containing several other services. This filesystem and another one like this containing services for some other clients occupy a single VG composed of 2 PV (total 2tb). The remaining (4) other PV are used in one 1tb VG each, and those VG contains only one LV that is used for databases servers. For availibility reasons, we are planing of spliting the /CyberCat (and the other one like it) FS into several smaller filesystems, one for each service. The reason being that in the event that we need to make a filesystem check, or any other unplaned reason, on any filesystem it won't affect other services. So, now comes the question I have : 1. First of all, is this a bad idea ? 2. Is there any disadvantages of doing a single volume group composed of many physical volumes, enabling us to move the extents of a logical volume from one physical volume to another one, so that load is more balanced in the event we need it. Thanks for the input. ------------------------------ Message: 2 Date: Wed, 7 Dec 2011 03:33:35 +0000 From: Jeff Sturm To: linux clustering Subject: Re: [Linux-cluster] Design question about VG / LV in a clustered environement Message-ID: Content-Type: text/plain; charset="us-ascii" > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Nicolas Ross > Sent: Tuesday, December 06, 2011 1:40 PM > > For availibility reasons, we are planing of spliting the /CyberCat > (and the other one like > it) FS into several smaller filesystems, one for each service. [snip] > 1. First of all, is this a bad idea ? Right or wrong, that's how we do it. Apart from availability, you can tune the fs appropriately depending on how you use it. GFS2 dropped some tunables, I think, but you can still mount with "noatime" (assuming your application doesn't rely on atime) and tune some things like block size. Some of our GFS filesystems are also read-only on certain nodes, so we take advantage of spectator mounts for those. > 2. Is there any disadvantages of doing a single volume group composed > of many physical volumes, enabling us to move the extents of a logical > volume from one physical volume to another one, so that load is more balanced in the event we need it. Can't say, really. We ditched CLVM but kept GFS. It felt like CLVM had too many limitations to make it worthwhile. It was straightforward to just export a LUN from our SAN for each file system, and that allows us to take advantage of the SAN's native snapshot facility. -Jeff ------------------------------ Message: 3 Date: Tue, 06 Dec 2011 23:45:32 -0500 From: Digimer To: linux clustering Subject: [Linux-cluster] cluster 3.1.8 released Message-ID: <4EDEEF6C.2040002 at alteeve.com> Content-Type: text/plain; charset=ISO-8859-1 Welcome to the cluster 3.1.8 release. This release addresses several bugs and includes a patch to improve RRP configuration handling. DLM+SCTP (kernel counterpart of RRP) is still under testing, feedback is always appreciated. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.8.tar.xz ChangeLog: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.8 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Digimer ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 92, Issue 4 ******************************************** From jeff.sturm at eprize.com Thu Dec 8 19:48:27 2011 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Thu, 8 Dec 2011 19:48:27 +0000 Subject: [Linux-cluster] Design question about VG / LV in a clustered environment In-Reply-To: References: Message-ID: > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Geovanis, Nicholas > Sent: Wednesday, December 07, 2011 12:23 PM > >> We ditched CLVM but kept GFS. It felt like CLVM had too many limitations to make > it worthwhile. > > Would you elaborate on this for me please? I understand the "damn, forgot to start > clvmd on that node...." type of annoyance, but what were your burning issues? I'm not > convinced that there's a performance drawback which is specifically clvmd-related, but > maybe I'm na?ve. Thanks....Nick G At the time there was no snapshot support. That was the big missing feature for us. We also tried using pvmove, and had problems with it. It was very slow, and eventually stopped altogether complaining about a lock. I tried activating LV's exclusively and it didn't help. Later I found that that in drastic situations, I could remove the "clustered" bit temporarily, make my changes, then revert to a clustered volume (recognizing this is dangerous when the volume is shared). At times, running simple commands like "lvs" became very slow, or stopped completely. Restarting the node would clear it up. When this occurred, the cluster would otherwise appear normal. To be fair, it was a few years ago when we were evaluating this, on 5.2 I think. It's likely some of the bugs have been worked out. We didn't have a lot of motivation to work through them as long as we could fall back on the SAN for the functionality we needed. Red Hat has a tendency I think to release features just a little before they are ready (sometimes with caveats, like the GFS2 preview release). This is good for users who are evaluating the technology. For production however, we need stability above all else. Since about 5.3, GFS has worked very well for us. Based on our experience with early 5.x releases, I'm not in any hurry to move to 6.x. -Jeff From matthew.painter at kusiri.com Sat Dec 10 20:32:05 2011 From: matthew.painter at kusiri.com (Matthew Painter) Date: Sat, 10 Dec 2011 20:32:05 +0000 Subject: [Linux-cluster] Nodes leaving and re-joining intermittently Message-ID: Hi all, We are trying to get to the bottom of some odd intermittent behavior on a cluster. We are intermittently seeing nodes leave and rejoin clusters, without being fenced. Further the gap between leaving on re-joining is 8 minutes. We are monitoring the latency between boxes, and it is acceptable (<5ms). How can nodes exhibit this behavior? There seem to be no impact on the services running on the box, just this leaving and re-joining. The SNMP messages are below. All help decoding this gratefully received! :) Thanks, Matt Sat Dec 10 15:22:00 GMT 2011: cluster3.localdomain DISMAN-EVENT-MIB::sysUpTimeInstance = 3:2:52:23.35, SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left" Sat Dec 10 15:30:25 GMT 2011: cluster3.localdomain DISMAN-EVENT-MIB::sysUpTimeInstance = 3:3:00:48.75, SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "joined" -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sat Dec 10 20:55:38 2011 From: linux at alteeve.com (Digimer) Date: Sat, 10 Dec 2011 15:55:38 -0500 Subject: [Linux-cluster] Nodes leaving and re-joining intermittently In-Reply-To: References: Message-ID: <4EE3C74A.3000206@alteeve.com> On 12/10/2011 03:32 PM, Matthew Painter wrote: > Hi all, > > We are trying to get to the bottom of some odd intermittent behavior on > a cluster. We are intermittently seeing nodes leave and rejoin clusters, > without being fenced. Further the gap between leaving on re-joining is 8 > minutes. We are monitoring the latency between boxes, and it is > acceptable (<5ms). > > How can nodes exhibit this behavior? There seem to be no impact on the > services running on the box, just this leaving and re-joining. The SNMP > messages are below. > > All help decoding this gratefully received! :) > > Thanks, > > Matt > > > Sat Dec 10 15:22:00 GMT 2011: cluster3.localdomain > DISMAN-EVENT-MIB::sysUpTimeInstance = 3:2:52:23.35, > SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, > COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", > COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, > COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", > COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left" > > Sat Dec 10 15:30:25 GMT 2011: cluster3.localdomain > DISMAN-EVENT-MIB::sysUpTimeInstance = 3:3:00:48.75, > SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, > COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", > COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, > COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", > COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "joined" My first instinct is to point to multicast issues in your switch, but then, I'd expect the node to get fenced. That said, any unexpected disconnect should fire a fence, so it would seem like the node is cleanly stopping/restarting corosync. Can you share your configuration and, ideally, anything in syslog from all involved nodes starting from just before the disconnect and continuing through to after the node rejoins? -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From matthew.painter at kusiri.com Sat Dec 10 22:00:12 2011 From: matthew.painter at kusiri.com (Matthew Painter) Date: Sat, 10 Dec 2011 22:00:12 +0000 Subject: [Linux-cluster] Nodes leaving and re-joining intermittently In-Reply-To: <4EE3C74A.3000206@alteeve.com> References: <4EE3C74A.3000206@alteeve.com> Message-ID: The switch was our first thought, but that has been swapped, and while we are not having nodes fenced anymore (we were daily), this anomoly remains. I will ask for those logs and conf on Monday. I think it might be worth reinstalling corosync on this box anyway? Can't be healthy if it is exiting unclearly. I have has reports of the rgmanager dying on this box. (pid file but not running) Could that be related? Thanks :) On Saturday, December 10, 2011, Digimer wrote: > On 12/10/2011 03:32 PM, Matthew Painter wrote: >> Hi all, >> >> We are trying to get to the bottom of some odd intermittent behavior on >> a cluster. We are intermittently seeing nodes leave and rejoin clusters, >> without being fenced. Further the gap between leaving on re-joining is 8 >> minutes. We are monitoring the latency between boxes, and it is >> acceptable (<5ms). >> >> How can nodes exhibit this behavior? There seem to be no impact on the >> services running on the box, just this leaving and re-joining. The SNMP >> messages are below. >> >> All help decoding this gratefully received! :) >> >> Thanks, >> >> Matt >> >> >> Sat Dec 10 15:22:00 GMT 2011: cluster3.localdomain >> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:2:52:23.35, >> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, >> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", >> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, >> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", >> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left" >> >> Sat Dec 10 15:30:25 GMT 2011: cluster3.localdomain >> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:3:00:48.75, >> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, >> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", >> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, >> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", >> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "joined" > > My first instinct is to point to multicast issues in your switch, but > then, I'd expect the node to get fenced. That said, any unexpected > disconnect should fire a fence, so it would seem like the node is > cleanly stopping/restarting corosync. > > Can you share your configuration and, ideally, anything in syslog from > all involved nodes starting from just before the disconnect and > continuing through to after the node rejoins? > > -- > Digimer > E-Mail: digimer at alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron > -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sat Dec 10 22:22:55 2011 From: linux at alteeve.com (Digimer) Date: Sat, 10 Dec 2011 17:22:55 -0500 Subject: [Linux-cluster] Nodes leaving and re-joining intermittently In-Reply-To: References: <4EE3C74A.3000206@alteeve.com> Message-ID: <4EE3DBBF.9080006@alteeve.com> On 12/10/2011 05:00 PM, Matthew Painter wrote: > The switch was our first thought, but that has been swapped, and while > we are not having nodes fenced anymore (we were daily), this anomoly > remains. > > I will ask for those logs and conf on Monday. > > I think it might be worth reinstalling corosync on this box anyway? > Can't be healthy if it is exiting unclearly. I have has reports of the > rgmanager dying on this box. (pid file but not running) Could that be > related? > > Thanks :) It's impossible to say without knowing your configuration. Please share the cluster.conf (only obfuscate passwords, please) along with the log files. The more detail, the better. Versions, distros, network config, etc. Uninstalling corosync is not likely help. RGManager is something fairly high up in the stack, so it's not likely the cause either. Did you configure the timeouts to be very high, by chance? I'm finding it difficult to fathom how the node can withdraw without being fenced, short of cleanly stopping the cluster stack. I suspect there is something important not being said, which the configuration information, versions and logs will hopefully expose. -- Digimer E-Mail: digimer at alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron From Mdukhan at nds.com Sun Dec 11 07:16:49 2011 From: Mdukhan at nds.com (Dukhan, Meir) Date: Sun, 11 Dec 2011 09:16:49 +0200 Subject: [Linux-cluster] Nodes leaving and re-joining intermittently In-Reply-To: <4EE3DBBF.9080006@alteeve.com> References: <4EE3C74A.3000206@alteeve.com> <4EE3DBBF.9080006@alteeve.com> Message-ID: <6DAE69EA69F39E4B9DA073B8C848A27C60E82DE826@ILMA1.IL.NDS.COM> Are your nodes time synced and how? We ran into problems of nodes being fenced because NTP problem. The solution (AFAIR, from the Redhat knowledge base) was to start ntpd _before_ cman. I'm not sure but there could be an update of openais or ntpd re this issue. For those of you who have RedHat account, see the RedHat KB article: Does cman need to have the time of nodes in sync? https://access.redhat.com/kb/docs/DOC-42471 Hope this help, Regards, -- Meir R. Dukhan |-----Original Message----- |From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- |bounces at redhat.com] On Behalf Of Digimer |Sent: Sunday, December 11, 2011 0:23 AM |To: Matthew Painter |Cc: linux clustering |Subject: Re: [Linux-cluster] Nodes leaving and re-joining intermittently | |On 12/10/2011 05:00 PM, Matthew Painter wrote: |> The switch was our first thought, but that has been swapped, and while |> we are not having nodes fenced anymore (we were daily), this anomoly |> remains. |> |> I will ask for those logs and conf on Monday. |> |> I think it might be worth reinstalling corosync on this box anyway? |> Can't be healthy if it is exiting unclearly. I have has reports of the |> rgmanager dying on this box. (pid file but not running) Could that be |> related? |> |> Thanks :) | |It's impossible to say without knowing your configuration. Please share the |cluster.conf (only obfuscate passwords, please) along with the log files. |The more detail, the better. Versions, distros, network config, etc. | |Uninstalling corosync is not likely help. RGManager is something fairly |high up in the stack, so it's not likely the cause either. | |Did you configure the timeouts to be very high, by chance? I'm finding it |difficult to fathom how the node can withdraw without being fenced, short |of cleanly stopping the cluster stack. I suspect there is something |important not being said, which the configuration information, versions and |logs will hopefully expose. | |-- |Digimer |E-Mail: digimer at alteeve.com |Freenode handle: digimer |Papers and Projects: http://alteeve.com |Node Assassin: http://nodeassassin.org |"omg my singularity battery is dead again. |stupid hawking radiation." - epitron | |-- |Linux-cluster mailing list |Linux-cluster at redhat.com |https://www.redhat.com/mailman/listinfo/linux-cluster This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster at nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes. To protect the environment please do not print this e-mail unless necessary. An NDS Group Limited company. www.nds.com From matthew.painter at kusiri.com Sun Dec 11 11:12:51 2011 From: matthew.painter at kusiri.com (Matthew Painter) Date: Sun, 11 Dec 2011 11:12:51 +0000 Subject: [Linux-cluster] Nodes leaving and re-joining intermittently In-Reply-To: <6DAE69EA69F39E4B9DA073B8C848A27C60E82DE826@ILMA1.IL.NDS.COM> References: <4EE3C74A.3000206@alteeve.com> <4EE3DBBF.9080006@alteeve.com> <6DAE69EA69F39E4B9DA073B8C848A27C60E82DE826@ILMA1.IL.NDS.COM> Message-ID: Thank you for your input :) The nodes are syncd using NTP. Although I am unsure about the respective run levels. I will look into this, thank you. On Sun, Dec 11, 2011 at 7:16 AM, Dukhan, Meir wrote: > > Are your nodes time synced and how? > > We ran into problems of nodes being fenced because NTP problem. > > The solution (AFAIR, from the Redhat knowledge base) was to start ntpd > _before_ cman. > I'm not sure but there could be an update of openais or ntpd re this issue. > > For those of you who have RedHat account, see the RedHat KB article: > > Does cman need to have the time of nodes in sync? > https://access.redhat.com/kb/docs/DOC-42471 > > Hope this help, > > Regards, > -- Meir R. Dukhan > > |-----Original Message----- > |From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > |bounces at redhat.com] On Behalf Of Digimer > |Sent: Sunday, December 11, 2011 0:23 AM > |To: Matthew Painter > |Cc: linux clustering > |Subject: Re: [Linux-cluster] Nodes leaving and re-joining intermittently > | > |On 12/10/2011 05:00 PM, Matthew Painter wrote: > |> The switch was our first thought, but that has been swapped, and while > |> we are not having nodes fenced anymore (we were daily), this anomoly > |> remains. > |> > |> I will ask for those logs and conf on Monday. > |> > |> I think it might be worth reinstalling corosync on this box anyway? > |> Can't be healthy if it is exiting unclearly. I have has reports of the > |> rgmanager dying on this box. (pid file but not running) Could that be > |> related? > |> > |> Thanks :) > | > |It's impossible to say without knowing your configuration. Please share > the > |cluster.conf (only obfuscate passwords, please) along with the log files. > |The more detail, the better. Versions, distros, network config, etc. > | > |Uninstalling corosync is not likely help. RGManager is something fairly > |high up in the stack, so it's not likely the cause either. > | > |Did you configure the timeouts to be very high, by chance? I'm finding it > |difficult to fathom how the node can withdraw without being fenced, short > |of cleanly stopping the cluster stack. I suspect there is something > |important not being said, which the configuration information, versions > and > |logs will hopefully expose. > | > |-- > |Digimer > |E-Mail: digimer at alteeve.com > |Freenode handle: digimer > |Papers and Projects: http://alteeve.com > |Node Assassin: http://nodeassassin.org > |"omg my singularity battery is dead again. > |stupid hawking radiation." - epitron > | > |-- > |Linux-cluster mailing list > |Linux-cluster at redhat.com > |https://www.redhat.com/mailman/listinfo/linux-cluster > > This message is confidential and intended only for the addressee. If you > have received this message in error, please immediately notify the > postmaster at nds.com and delete it from your system as well as any copies. > The content of e-mails as well as traffic data may be monitored by NDS for > employment and security purposes. > To protect the environment please do not print this e-mail unless > necessary. > > An NDS Group Limited company. www.nds.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.alexander at kusiri.com Sun Dec 11 15:26:23 2011 From: chris.alexander at kusiri.com (Chris Alexander) Date: Sun, 11 Dec 2011 15:26:23 +0000 Subject: [Linux-cluster] Nodes leaving and re-joining intermittently In-Reply-To: References: <4EE3C74A.3000206@alteeve.com> <4EE3DBBF.9080006@alteeve.com> <6DAE69EA69F39E4B9DA073B8C848A27C60E82DE826@ILMA1.IL.NDS.COM> Message-ID: Please find below the cluster.conf Matt mentioned. Regarding logs, I have verified the 2 SNMP trap notifications that Matt posted in his first message are the only ones that were processed by our script anywhere near this event window (days until the previous one, none since). I will have a look in the on-disk logging tomorrow and see if there's anything of any worth over that time period on any of the cluster nodes. Thanks, Chris