From Micah.Schaefer at jhuapl.edu Wed Jun 4 14:59:01 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Wed, 4 Jun 2014 10:59:01 -0400 Subject: [Linux-cluster] Node is randomly fenced Message-ID: I have a 4 node cluster, running a single service group. I have been seeing node1 fence node3 while node3 is actively running the service group at random intervals. Rgmanager logs show no failures in service checks, and no other logs provide any useful information. How can I go about finding out why node1 is fencing node3? I currently set up the failover domain to be restricted and not include node3. cluster.conf : http://pastebin.com/xYy6xp6N From emi2fast at gmail.com Wed Jun 4 15:11:12 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 4 Jun 2014 17:11:12 +0200 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: Message-ID: logs? 2014-06-04 16:59 GMT+02:00 Schaefer, Micah : > I have a 4 node cluster, running a single service group. I have been > seeing node1 fence node3 while node3 is actively running the service group > at random intervals. > > Rgmanager logs show no failures in service checks, and no other logs > provide any useful information. How can I go about finding out why node1 > is fencing node3? > > I currently set up the failover domain to be restricted and not include > node3. > > cluster.conf : http://pastebin.com/xYy6xp6N > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Wed Jun 4 15:13:15 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 04 Jun 2014 11:13:15 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: Message-ID: <538F378B.8030407@alteeve.ca> On 04/06/14 10:59 AM, Schaefer, Micah wrote: > I have a 4 node cluster, running a single service group. I have been > seeing node1 fence node3 while node3 is actively running the service group > at random intervals. > > Rgmanager logs show no failures in service checks, and no other logs > provide any useful information. How can I go about finding out why node1 > is fencing node3? > > I currently set up the failover domain to be restricted and not include > node3. > > cluster.conf : http://pastebin.com/xYy6xp6N Random fencing is almost always caused by network failures. Can you look are the system logs, starting a little before the fence and continuing until after the fence completes, and paste them here? I suspect you will see corosync complaining. If this is true, do your switches support persistent multicast? Do you use active/passive bonding? Have you tried different switch/cable/NIC? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Micah.Schaefer at jhuapl.edu Wed Jun 4 15:32:45 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Wed, 4 Jun 2014 11:32:45 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <538F378B.8030407@alteeve.ca> References: <538F378B.8030407@alteeve.ca> Message-ID: Logs: http://pastebin.com/QCh5FzZu I have one 10gb nic connected Here is the corosync log from node1, I see that is says ? A processor failed, forming new configuration.?, I need to dig deeper though. May 27 10:03:49 corosync [QUORUM] Members[4]: 1 2 3 4 May 27 10:05:04 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 03 13:52:34 corosync [TOTEM ] A processor failed, forming new configuration. Jun 03 13:52:46 corosync [QUORUM] Members[3]: 1 2 4 Jun 03 13:52:46 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 03 13:52:46 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) Jun 03 13:52:46 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 03 13:56:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 03 13:56:14 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 03 13:56:14 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 03 13:56:28 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 03 13:56:28 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 03 13:56:28 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 03 13:56:41 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 03 13:56:41 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 03 13:56:41 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 03 13:57:04 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 03 13:57:04 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 03 13:57:04 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 03 15:12:09 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 03 15:12:09 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 03 15:12:09 corosync [MAIN ] Completed service synchronization, ready to provide service. Regards, ------- Micah Schaefer JHU/ APL ITSD/ ITC 240-228-1148 (x81148) On 6/4/14, 11:13 AM, "Digimer" wrote: >On 04/06/14 10:59 AM, Schaefer, Micah wrote: >> I have a 4 node cluster, running a single service group. I have been >> seeing node1 fence node3 while node3 is actively running the service >>group >> at random intervals. >> >> Rgmanager logs show no failures in service checks, and no other logs >> provide any useful information. How can I go about finding out why node1 >> is fencing node3? >> >> I currently set up the failover domain to be restricted and not include >> node3. >> >> cluster.conf : http://pastebin.com/xYy6xp6N > >Random fencing is almost always caused by network failures. Can you look >are the system logs, starting a little before the fence and continuing >until after the fence completes, and paste them here? I suspect you will >see corosync complaining. > >If this is true, do your switches support persistent multicast? Do you >use active/passive bonding? Have you tried different switch/cable/NIC? > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From alkol6 at gmail.com Wed Jun 4 15:48:31 2014 From: alkol6 at gmail.com (Senol Erdogan) Date: Wed, 4 Jun 2014 11:48:31 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: Message-ID: Problem looks at Failover domain's nodes priority. Same nodes has adjusted by different priority. it's would be triggerig unexpected fences. Maybe you can solve while step by step active your FO domains. (Ofcourse after all newtork and firewall settings right and w/o problem) Senol Erdogan On Jun 4, 2014 11:06 AM, "Schaefer, Micah" wrote: > I have a 4 node cluster, running a single service group. I have been > seeing node1 fence node3 while node3 is actively running the service group > at random intervals. > > Rgmanager logs show no failures in service checks, and no other logs > provide any useful information. How can I go about finding out why node1 > is fencing node3? > > I currently set up the failover domain to be restricted and not include > node3. > > cluster.conf : http://pastebin.com/xYy6xp6N > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jfriesse at redhat.com Thu Jun 5 08:36:58 2014 From: jfriesse at redhat.com (Jan Friesse) Date: Thu, 05 Jun 2014 10:36:58 +0200 Subject: [Linux-cluster] [Openais] Newbie clustering questions In-Reply-To: References: Message-ID: <53902C2A.3060108@redhat.com> Per, it looks like none of your question is really corosync related (so I'm CC'ing linux clustering (this is really better list) but I will try to answer at least some of your questions. > Hi all > > I have redhat clustering running on a 3 VMware vm's 2 nodes and 1 > management server I can join the nodes without any problems but I got a > couple of questions that I hope someone here can shed some lights on for me. > > If I want to add a ip resource to the cluster must both nodes be configured > with a interface with that ip or is there a better way of doing it? If not You must make sure that NO nodes has this address assigned. IPAddr resource will take care to add ip to interface. > then can one of the nodes have the nic in standby? > I don't think this is supported by any resource script. > How do I add fencing for a VMware vm's I notice that there is the VMware > soa must the each vm be configured with its individual VMware soa fencing > or is fencing not needed? From what I can read fencing is needed. Every node must be able to fence any other node. So you have to configure fencing method for every node. In theory fencing is not needed as long as you are not using shared storage, but it's still better to have it. > > I am using Centos .6.5 with ESXI 4.1 > > Many thanks for your time > > Regards > Regards, Honza > > > _______________________________________________ > Openais mailing list > Openais at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/openais > From jfriesse at redhat.com Fri Jun 6 07:37:50 2014 From: jfriesse at redhat.com (Jan Friesse) Date: Fri, 06 Jun 2014 09:37:50 +0200 Subject: [Linux-cluster] [Openais] Newbie clustering questions In-Reply-To: References: <53902C2A.3060108@redhat.com> Message-ID: <53916FCE.2000007@redhat.com> Per, > Hi Jan > > Many thanks for your response. > > I spent some more time on this yesterday so I found out that the nodes > needs really to have 2 nics, 2 ip's and the resource ip gets assigned to > the node that becomes the running node. > You don't need two nics. Even tho it's better, because you have separated cluster traffic from app traffic. > I have setup vmware fencing for each node, but I could not see anything in > the configuration to allow or disallow one node to fence of the other or > does this happen automagically? apologies if the question seem a bit stupid Yes, it is happening automatically as long as you've configured fencing for every node. > but 1 week ago I started this project with very little experience in > clustering :) > That's why I'm recommending you to ask linux-cluster at redhat.com and read some docs (https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/High_Availability_Add-On_Overview/index.html). Regards, Honza > Regards > Per Qvindesland > > > On Thu, Jun 5, 2014 at 9:36 AM, Jan Friesse wrote: > >> Per, >> it looks like none of your question is really corosync related (so I'm >> CC'ing linux clustering (this is really >> better list) but I will try to answer at least some of your questions. >> >>> Hi all >>> >>> I have redhat clustering running on a 3 VMware vm's 2 nodes and 1 >>> management server I can join the nodes without any problems but I got a >>> couple of questions that I hope someone here can shed some lights on for >> me. >>> >>> If I want to add a ip resource to the cluster must both nodes be >> configured >>> with a interface with that ip or is there a better way of doing it? If >> not >> >> You must make sure that NO nodes has this address assigned. IPAddr >> resource will take care to add ip to interface. >> >>> then can one of the nodes have the nic in standby? >>> >> >> I don't think this is supported by any resource script. >> >>> How do I add fencing for a VMware vm's I notice that there is the VMware >>> soa must the each vm be configured with its individual VMware soa fencing >>> or is fencing not needed? From what I can read fencing is needed. >> >> Every node must be able to fence any other node. So you have to >> configure fencing method for every node. >> >> In theory fencing is not needed as long as you are not using shared >> storage, but it's still better to have it. >> >>> >>> I am using Centos .6.5 with ESXI 4.1 >>> >>> Many thanks for your time >>> >>> Regards >>> >> >> Regards, >> Honza >> >>> >>> >>> _______________________________________________ >>> Openais mailing list >>> Openais at lists.linux-foundation.org >>> https://lists.linuxfoundation.org/mailman/listinfo/openais >>> >> >> > From arun.nair at dimensiondata.com Wed Jun 11 14:48:37 2014 From: arun.nair at dimensiondata.com (Arun G Nair) Date: Wed, 11 Jun 2014 20:18:37 +0530 Subject: [Linux-cluster] 2-node cluster fence loop Message-ID: Hello, What are the reasons for fence loops when only cman is started ? We have an RHEL 6.5 2-node cluster which goes in to a fence loop and every time we start cman on both nodes. Either one fences the other. Multicast seems to be working properly. My understanding is that without rgmanager running there won't be a multicast group subscription ? I don't see the multicast address in 'netstat -g' unless rgmanager is running. I've tried to increase the fence post_join_delay but one of the nodes still gets fenced. The cluster works fine if we use unicast UDP. Thanks, -- Arun G Nair -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Wed Jun 11 15:03:48 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 11 Jun 2014 11:03:48 -0400 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: References: Message-ID: <53986FD4.6010902@alteeve.ca> On 11/06/14 10:48 AM, Arun G Nair wrote: > Hello, > > What are the reasons for fence loops when only cman is started ? We > have an RHEL 6.5 2-node cluster which goes in to a fence loop and every > time we start cman on both nodes. Either one fences the other. Multicast > seems to be working properly. My understanding is that without rgmanager > running there won't be a multicast group subscription ? I don't see the > multicast address in 'netstat -g' unless rgmanager is running. I've > tried to increase the fence post_join_delay but one of the nodes still > gets fenced. > > The cluster works fine if we use unicast UDP. > > Thanks, Hi, When cman starts, it waits post_join_delay seconds for the peer to connect. If, after that time expires (6 seconds by default, iirc), it gives up and calls a fence against the peer to put it into a known state. Corosync is what determines membership, and it is started by cman. The rgmanager only handles resource start/stop/relocate/recovery and has nothing to do with fencing directly. Corosync is what uses multicast. So as you seem to have already surmised, multicast is probably not working in your environment. Have you enabled multicast traffic on the firewall? Do your switches support multicast properly? digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From muthukumar.t at hp.com Wed Jun 11 18:11:51 2014 From: muthukumar.t at hp.com (T, Muthukumar) Date: Wed, 11 Jun 2014 18:11:51 +0000 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: References: Message-ID: <8C558298378D604B9DB3536AFF81F04D1ECB5903@G5W2718.americas.hpqcorp.net> Hi all, When your cluster nodes got panic while starting cman services that can?t called as fence loop that is called as misconfiguration of POST JOIN DELAY setting. By default post_join_delay setting is 3 seconds, while starting cman on a cluster node it will try to get the status of other cluster nodes to make sure the integrity of the cluster services if other nodes are not responsive till post_join_delay timeout, other cluster node fenced by this node to ensure integrity (there may be chance that node already formed the cluster and started cluster services) Fence looping is different one, this is happen when there is a failure in heart beat switch for long time. Thanks & Regards Muthukumar T Production Engineering - UNIX 9790907286 From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arun G Nair Sent: Wednesday, June 11, 2014 8:19 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] 2-node cluster fence loop Hello, What are the reasons for fence loops when only cman is started ? We have an RHEL 6.5 2-node cluster which goes in to a fence loop and every time we start cman on both nodes. Either one fences the other. Multicast seems to be working properly. My understanding is that without rgmanager running there won't be a multicast group subscription ? I don't see the multicast address in 'netstat -g' unless rgmanager is running. I've tried to increase the fence post_join_delay but one of the nodes still gets fenced. The cluster works fine if we use unicast UDP. Thanks, -- Arun G Nair -------------- next part -------------- An HTML attachment was scrubbed... URL: From Micah.Schaefer at jhuapl.edu Wed Jun 11 18:21:59 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Wed, 11 Jun 2014 14:21:59 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> Message-ID: It failed again, even after deleting all the other failover domains. Cluster conf http://pastebin.com/jUXkwKS4 I turned corosync output to debug. How can I go about troubleshooting if it really is a network issue or something else? Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new configuration. Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) Jun 11 14:10:29 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:13:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:13:54 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:13:54 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:14:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:14:08 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:14:08 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:14:21 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:14:21 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:14:21 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:14:43 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:14:43 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:14:43 corosync [MAIN ] Completed service synchronization, ready to provide service. On 6/4/14, 11:32 AM, "Schaefer, Micah" wrote: >Logs: http://pastebin.com/QCh5FzZu > >I have one 10gb nic connected > > >Here is the corosync log from node1, I see that is says ? A processor >failed, forming new configuration.?, I need to dig deeper though. > > >May 27 10:03:49 corosync [QUORUM] Members[4]: 1 2 3 4 >May 27 10:05:04 corosync [QUORUM] Members[4]: 1 2 3 4 >Jun 03 13:52:34 corosync [TOTEM ] A processor failed, forming new >configuration. >Jun 03 13:52:46 corosync [QUORUM] Members[3]: 1 2 4 >Jun 03 13:52:46 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:52:46 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:4 left:1) >Jun 03 13:52:46 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:56:14 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:56:14 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:56:14 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:56:28 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:56:28 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:56:28 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:56:41 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:56:41 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:56:41 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:57:04 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:57:04 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:57:04 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 15:12:09 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4 >Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4 >Jun 03 15:12:09 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 15:12:09 corosync [MAIN ] Completed service synchronization, ready >to provide service. > > > > > > > > > > > > >On 6/4/14, 11:13 AM, "Digimer" wrote: > >>On 04/06/14 10:59 AM, Schaefer, Micah wrote: >>> I have a 4 node cluster, running a single service group. I have been >>> seeing node1 fence node3 while node3 is actively running the service >>>group >>> at random intervals. >>> >>> Rgmanager logs show no failures in service checks, and no other logs >>> provide any useful information. How can I go about finding out why >>>node1 >>> is fencing node3? >>> >>> I currently set up the failover domain to be restricted and not include >>> node3. >>> >>> cluster.conf : http://pastebin.com/xYy6xp6N >> >>Random fencing is almost always caused by network failures. Can you look >>are the system logs, starting a little before the fence and continuing >>until after the fence completes, and paste them here? I suspect you will >>see corosync complaining. >> >>If this is true, do your switches support persistent multicast? Do you >>use active/passive bonding? Have you tried different switch/cable/NIC? >> >>-- >>Digimer >>Papers and Projects: https://alteeve.ca/w/ >>What if the cure for cancer is trapped in the mind of a person without >>access to education? >> >>-- >>Linux-cluster mailing list >>Linux-cluster at redhat.com >>https://www.redhat.com/mailman/listinfo/linux-cluster > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Wed Jun 11 18:29:30 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 11 Jun 2014 14:29:30 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> Message-ID: <5398A00A.4020802@alteeve.ca> On 11/06/14 02:21 PM, Schaefer, Micah wrote: > It failed again, even after deleting all the other failover domains. > > Cluster conf > http://pastebin.com/jUXkwKS4 > > I turned corosync output to debug. How can I go about troubleshooting if > it really is a network issue or something else? > > > > Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 > Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new > configuration. > Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 > Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) > ip(10.70.100.101) ; members(old:4 left:1) This is, to me, *strongly* indicative of a network issue. It's not likely switch-wide as only one member was lost, but I would certainly put my money on a network problem somewhere, some how. Do you use bonding? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Micah.Schaefer at jhuapl.edu Wed Jun 11 18:55:07 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Wed, 11 Jun 2014 14:55:07 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5398A00A.4020802@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> Message-ID: I have the issue on two of my nodes. Each node has 1ea 10gb connection. No bonding, single link. What else can I look at? I manage the network too. I don?t see any link down notifications, don?t see any errors on the ports. On 6/11/14, 2:29 PM, "Digimer" wrote: >On 11/06/14 02:21 PM, Schaefer, Micah wrote: >> It failed again, even after deleting all the other failover domains. >> >> Cluster conf >> http://pastebin.com/jUXkwKS4 >> >> I turned corosync output to debug. How can I go about troubleshooting if >> it really is a network issue or something else? >> >> >> >> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 >> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new >> configuration. >> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 >> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >> ip(10.70.100.101) ; members(old:4 left:1) > >This is, to me, *strongly* indicative of a network issue. It's not >likely switch-wide as only one member was lost, but I would certainly >put my money on a network problem somewhere, some how. > >Do you use bonding? > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Wed Jun 11 19:28:28 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 11 Jun 2014 15:28:28 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> Message-ID: <5398ADDC.80501@alteeve.ca> The first thing I would do is get a second NIC and configure active-passive bonding. network issues are too common to ignore in HA setups. Ideally, I would span the links across separate stacked switches. As for debugging the issue, I can only recommend to look closely at the system and switch logs for clues. On 11/06/14 02:55 PM, Schaefer, Micah wrote: > I have the issue on two of my nodes. Each node has 1ea 10gb connection. No > bonding, single link. What else can I look at? I manage the network too. I > don?t see any link down notifications, don?t see any errors on the ports. > > > > > On 6/11/14, 2:29 PM, "Digimer" wrote: > >> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>> It failed again, even after deleting all the other failover domains. >>> >>> Cluster conf >>> http://pastebin.com/jUXkwKS4 >>> >>> I turned corosync output to debug. How can I go about troubleshooting if >>> it really is a network issue or something else? >>> >>> >>> >>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 >>> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new >>> configuration. >>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 >>> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the >>> membership and a new membership was formed. >>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>> ip(10.70.100.101) ; members(old:4 left:1) >> >> This is, to me, *strongly* indicative of a network issue. It's not >> likely switch-wide as only one member was lost, but I would certainly >> put my money on a network problem somewhere, some how. >> >> Do you use bonding? >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Micah.Schaefer at jhuapl.edu Wed Jun 11 19:50:14 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Wed, 11 Jun 2014 15:50:14 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5398ADDC.80501@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> Message-ID: Okay, I set up active/ backup bonding and will watch for any change. This is the network side: 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 output errors, 0 collisions, 0 interface resets This is the server side: em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 GiB) Interrupt:34 Memory:d5000000-d57fffff I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I?ll work on cross connecting the bonded interfaces to different switches. On 6/11/14, 3:28 PM, "Digimer" wrote: >The first thing I would do is get a second NIC and configure >active-passive bonding. network issues are too common to ignore in HA >setups. Ideally, I would span the links across separate stacked switches. > >As for debugging the issue, I can only recommend to look closely at the >system and switch logs for clues. > >On 11/06/14 02:55 PM, Schaefer, Micah wrote: >> I have the issue on two of my nodes. Each node has 1ea 10gb connection. >>No >> bonding, single link. What else can I look at? I manage the network >>too. I >> don?t see any link down notifications, don?t see any errors on the >>ports. >> >> >> >> >> On 6/11/14, 2:29 PM, "Digimer" wrote: >> >>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>> It failed again, even after deleting all the other failover domains. >>>> >>>> Cluster conf >>>> http://pastebin.com/jUXkwKS4 >>>> >>>> I turned corosync output to debug. How can I go about troubleshooting >>>>if >>>> it really is a network issue or something else? >>>> >>>> >>>> >>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 >>>> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>> configuration. >>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 >>>> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the >>>> membership and a new membership was formed. >>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>> ip(10.70.100.101) ; members(old:4 left:1) >>> >>> This is, to me, *strongly* indicative of a network issue. It's not >>> likely switch-wide as only one member was lost, but I would certainly >>> put my money on a network problem somewhere, some how. >>> >>> Do you use bonding? >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ >>> What if the cure for cancer is trapped in the mind of a person without >>> access to education? >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From gnetravali at sonusnet.com Thu Jun 12 04:12:01 2014 From: gnetravali at sonusnet.com (Netravali, Ganesh) Date: Thu, 12 Jun 2014 04:12:01 +0000 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> Message-ID: <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> Make sure multicast is enabled across the switches. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah Sent: Thursday, June 12, 2014 1:20 AM To: linux clustering Subject: Re: [Linux-cluster] Node is randomly fenced Okay, I set up active/ backup bonding and will watch for any change. This is the network side: 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 output errors, 0 collisions, 0 interface resets This is the server side: em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 GiB) Interrupt:34 Memory:d5000000-d57fffff I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I'll work on cross connecting the bonded interfaces to different switches. On 6/11/14, 3:28 PM, "Digimer" wrote: >The first thing I would do is get a second NIC and configure >active-passive bonding. network issues are too common to ignore in HA >setups. Ideally, I would span the links across separate stacked switches. > >As for debugging the issue, I can only recommend to look closely at the >system and switch logs for clues. > >On 11/06/14 02:55 PM, Schaefer, Micah wrote: >> I have the issue on two of my nodes. Each node has 1ea 10gb connection. >>No >> bonding, single link. What else can I look at? I manage the network >>too. I don?t see any link down notifications, don?t see any errors on >>the ports. >> >> >> >> >> On 6/11/14, 2:29 PM, "Digimer" wrote: >> >>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>> It failed again, even after deleting all the other failover domains. >>>> >>>> Cluster conf >>>> http://pastebin.com/jUXkwKS4 >>>> >>>> I turned corosync output to debug. How can I go about >>>>troubleshooting if it really is a network issue or something else? >>>> >>>> >>>> >>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 >>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>> configuration. >>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 >>>> corosync [TOTEM ] A processor joined or left the membership and a >>>> new membership was formed. >>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>> ip(10.70.100.101) ; members(old:4 left:1) >>> >>> This is, to me, *strongly* indicative of a network issue. It's not >>> likely switch-wide as only one member was lost, but I would >>> certainly put my money on a network problem somewhere, some how. >>> >>> Do you use bonding? >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ What if the cure for >>> cancer is trapped in the mind of a person without access to >>> education? >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer >is trapped in the mind of a person without access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Thu Jun 12 04:19:50 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 12 Jun 2014 00:19:50 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> Message-ID: <53992A66.4070109@alteeve.ca> I considered that, but I would expect more nodes to be lost. On 12/06/14 12:12 AM, Netravali, Ganesh wrote: > Make sure multicast is enabled across the switches. > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah > Sent: Thursday, June 12, 2014 1:20 AM > To: linux clustering > Subject: Re: [Linux-cluster] Node is randomly fenced > > Okay, I set up active/ backup bonding and will watch for any change. > > This is the network side: > 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored > 0 output errors, 0 collisions, 0 interface resets > > > > This is the server side: > > em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD > inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 > inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 > TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 GiB) > Interrupt:34 Memory:d5000000-d57fffff > > > > I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I'll work on cross connecting the bonded interfaces to different switches. > > > > On 6/11/14, 3:28 PM, "Digimer" wrote: > >> The first thing I would do is get a second NIC and configure >> active-passive bonding. network issues are too common to ignore in HA >> setups. Ideally, I would span the links across separate stacked switches. >> >> As for debugging the issue, I can only recommend to look closely at the >> system and switch logs for clues. >> >> On 11/06/14 02:55 PM, Schaefer, Micah wrote: >>> I have the issue on two of my nodes. Each node has 1ea 10gb connection. >>> No >>> bonding, single link. What else can I look at? I manage the network >>> too. I don?t see any link down notifications, don?t see any errors on >>> the ports. >>> >>> >>> >>> >>> On 6/11/14, 2:29 PM, "Digimer" wrote: >>> >>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>>> It failed again, even after deleting all the other failover domains. >>>>> >>>>> Cluster conf >>>>> http://pastebin.com/jUXkwKS4 >>>>> >>>>> I turned corosync output to debug. How can I go about >>>>> troubleshooting if it really is a network issue or something else? >>>>> >>>>> >>>>> >>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 >>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>>> configuration. >>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 >>>>> corosync [TOTEM ] A processor joined or left the membership and a >>>>> new membership was formed. >>>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>>> ip(10.70.100.101) ; members(old:4 left:1) >>>> >>>> This is, to me, *strongly* indicative of a network issue. It's not >>>> likely switch-wide as only one member was lost, but I would >>>> certainly put my money on a network problem somewhere, some how. >>>> >>>> Do you use bonding? >>>> >>>> -- >>>> Digimer >>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for >>>> cancer is trapped in the mind of a person without access to >>>> education? >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >> >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer >> is trapped in the mind of a person without access to education? >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From lzhong at suse.com Thu Jun 12 06:42:58 2014 From: lzhong at suse.com (Lidong Zhong) Date: Thu, 12 Jun 2014 14:42:58 +0800 Subject: [Linux-cluster] [RFC] dlm: keep listening connection alive with sctp mode Message-ID: <1402555378-5220-1-git-send-email-lzhong@suse.com> Currently when a node close a connection, it will send a user initiated ABORT instead of gracefully shut down(ece35848c184). Sadly it also could close the listening connection, so this node will fail to rejoin the cluster. I setup two node of cluster to do this test. While the cluster works fine, the connection looks like this: clt-n2-sles12b7-2:~ # netstat -apn|grep sctp sctp 147.2.208.197:21064 LISTEN - sctp 0 4 0.0.82.72:62887 147.2.208.197:21064 ESTABLISHED - and if I reboot the other node or stop running dlm, and all the connections get lost: clt-n2-sles12b7-2:~ # netstat -apn | grep sctp clt-n2-sles12b7-2:~ # so if the other node tries to rejoin the cluster, the following message flushes because of no listening port now. dlm: Trying to connect to 192.168.3.4 dlm: Can't start SCTP association - retrying dlm: Retry sending 64 bytes to node id 318951621 dlm: Retrying SCTP association init for node 318951621 Signed-off-by: Lidong Zhong --- fs/dlm/lowcomms.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 1e5b453..d08e079 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -617,6 +617,11 @@ static void retry_failed_sctp_send(struct connection *recv_con, int nodeid = sn_send_failed->ssf_info.sinfo_ppid; log_print("Retry sending %d bytes to node id %d", len, nodeid); + + if (!nodeid) { + log_print("Shouldn't resend data via listening connection."); + return; + } con = nodeid2con(nodeid, 0); if (!con) { -- 1.8.1.4 From rpeterso at redhat.com Thu Jun 12 12:29:44 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Thu, 12 Jun 2014 08:29:44 -0400 (EDT) Subject: [Linux-cluster] [RFC] dlm: keep listening connection alive with sctp mode In-Reply-To: <1402555378-5220-1-git-send-email-lzhong@suse.com> References: <1402555378-5220-1-git-send-email-lzhong@suse.com> Message-ID: <742486000.20595916.1402576184717.JavaMail.zimbra@redhat.com> ----- Original Message ----- (snip) > Signed-off-by: Lidong Zhong Hi Lidong, There is a special public mailing list for patches like this and other cluster-related development. The mailing list is called cluster-devel. Here is a link where you can subscribe to it: https://www.redhat.com/mailman/listinfo/cluster-devel I recommend you send your patch to cluster-devel at redhat.com. Regards, Bob Peterson Red Hat File Systems From arun.nair at dimensiondata.com Thu Jun 12 14:29:06 2014 From: arun.nair at dimensiondata.com (Arun G Nair) Date: Thu, 12 Jun 2014 19:59:06 +0530 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: <53986FD4.6010902@alteeve.ca> References: <53986FD4.6010902@alteeve.ca> Message-ID: We have multicast enabled on the switch. I've also tried the multicast.py tool from RH's knowledge base to test multicast and I see the expected output, though the tool uses a different multicast IP( guess that shouldn't matter). I've tried increasing the post_join_delay to 360 seconds to give me enough time to check everything on both the nodes. One node still gets fenced. `clustat` output says the other node is offline on both servers. So one node can't see the other one ? This again points to issue with multicast. Any other clues as to what/where to look ? On Wed, Jun 11, 2014 at 8:33 PM, Digimer wrote: > On 11/06/14 10:48 AM, Arun G Nair wrote: > >> Hello, >> >> What are the reasons for fence loops when only cman is started ? We >> have an RHEL 6.5 2-node cluster which goes in to a fence loop and every >> time we start cman on both nodes. Either one fences the other. Multicast >> seems to be working properly. My understanding is that without rgmanager >> running there won't be a multicast group subscription ? I don't see the >> multicast address in 'netstat -g' unless rgmanager is running. I've >> tried to increase the fence post_join_delay but one of the nodes still >> gets fenced. >> >> The cluster works fine if we use unicast UDP. >> >> Thanks, >> > > Hi, > > When cman starts, it waits post_join_delay seconds for the peer to > connect. If, after that time expires (6 seconds by default, iirc), it gives > up and calls a fence against the peer to put it into a known state. > > Corosync is what determines membership, and it is started by cman. The > rgmanager only handles resource start/stop/relocate/recovery and has > nothing to do with fencing directly. Corosync is what uses multicast. > > So as you seem to have already surmised, multicast is probably not > working in your environment. Have you enabled multicast traffic on the > firewall? Do your switches support multicast properly? > > digimer > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Arun G Nair Sr. Sysadmin Dimension Data | Ph: (800) 664-9973 Feedback? We're listening -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkovachev at varna.net Thu Jun 12 14:43:06 2014 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Thu, 12 Jun 2014 17:43:06 +0300 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: References: <53986FD4.6010902@alteeve.ca> Message-ID: Do you have a different auth key on each node by any chance? On 2014-06-12 17:29, Arun G Nair wrote: > We have multicast enabled on the switch. I've also tried the > multicast.py tool from RH's knowledge base to test multicast and I see > the expected output, though the tool uses a different multicast IP( > guess that shouldn't matter). I've tried increasing the post_join_delay > to 360 seconds to give me enough time to check everything on both the > nodes. One node still gets fenced. `clustat` output says the other node > is offline on both servers. So one node can't see the other one ? This > again points to issue with multicast. Any other clues as to what/where > to look ? > > On Wed, Jun 11, 2014 at 8:33 PM, Digimer wrote: > > On 11/06/14 10:48 AM, Arun G Nair wrote: > Hello, > > What are the reasons for fence loops when only cman is started ? We > have an RHEL 6.5 2-node cluster which goes in to a fence loop and every > time we start cman on both nodes. Either one fences the other. > Multicast > seems to be working properly. My understanding is that without > rgmanager > running there won't be a multicast group subscription ? I don't see the > multicast address in 'netstat -g' unless rgmanager is running. I've > tried to increase the fence post_join_delay but one of the nodes still > gets fenced. > > The cluster works fine if we use unicast UDP. > > Thanks, Hi, > > When cman starts, it waits post_join_delay seconds for the peer to > connect. If, after that time expires (6 seconds by default, iirc), it > gives up and calls a fence against the peer to put it into a known > state. > > Corosync is what determines membership, and it is started by cman. The > rgmanager only handles resource start/stop/relocate/recovery and has > nothing to do with fencing directly. Corosync is what uses multicast. > > So as you seem to have already surmised, multicast is probably not > working in your environment. Have you enabled multicast traffic on the > firewall? Do your switches support multicast properly? > > digimer > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ [1] > What if the cure for cancer is trapped in the mind of a person without > access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster [2] -- Arun G Nair Sr. Sysadmin Dimension Data | Ph: (800) 664-9973 Feedback? We're listening [3] Links: ------ [1] https://alteeve.ca/w/ [2] https://www.redhat.com/mailman/listinfo/linux-cluster [3] http://www.surveymonkey.com/s/XRCYXBH From emi2fast at gmail.com Thu Jun 12 15:05:03 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Thu, 12 Jun 2014 17:05:03 +0200 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: References: <53986FD4.6010902@alteeve.ca> Message-ID: I always used "tcpdump -ni bond1 port 5405" to check if both nodes are involved in the comunication, if isn't like that, that would say is multicast problem 2014-06-12 16:43 GMT+02:00 Kaloyan Kovachev : > Do you have a different auth key on each node by any chance? > > > On 2014-06-12 17:29, Arun G Nair wrote: > >> We have multicast enabled on the switch. I've also tried the multicast.py >> tool from RH's knowledge base to test multicast and I see the expected >> output, though the tool uses a different multicast IP( guess that shouldn't >> matter). I've tried increasing the post_join_delay to 360 seconds to give me >> enough time to check everything on both the nodes. One node still gets >> fenced. `clustat` output says the other node is offline on both servers. So >> one node can't see the other one ? This again points to issue with >> multicast. Any other clues as to what/where to look ? >> >> On Wed, Jun 11, 2014 at 8:33 PM, Digimer wrote: >> >> On 11/06/14 10:48 AM, Arun G Nair wrote: >> Hello, >> >> What are the reasons for fence loops when only cman is started ? We >> have an RHEL 6.5 2-node cluster which goes in to a fence loop and every >> time we start cman on both nodes. Either one fences the other. Multicast >> seems to be working properly. My understanding is that without rgmanager >> running there won't be a multicast group subscription ? I don't see the >> multicast address in 'netstat -g' unless rgmanager is running. I've >> tried to increase the fence post_join_delay but one of the nodes still >> gets fenced. >> >> The cluster works fine if we use unicast UDP. >> >> Thanks, Hi, >> >> When cman starts, it waits post_join_delay seconds for the peer to >> connect. If, after that time expires (6 seconds by default, iirc), it gives >> up and calls a fence against the peer to put it into a known state. >> >> Corosync is what determines membership, and it is started by cman. The >> rgmanager only handles resource start/stop/relocate/recovery and has nothing >> to do with fencing directly. Corosync is what uses multicast. >> >> So as you seem to have already surmised, multicast is probably not working >> in your environment. Have you enabled multicast traffic on the firewall? Do >> your switches support multicast properly? >> >> digimer >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ [1] >> >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster [2] > > > -- > Arun G Nair > Sr. Sysadmin > Dimension Data | Ph: (800) 664-9973 > Feedback? We're listening [3] > > > > Links: > ------ > [1] https://alteeve.ca/w/ > [2] https://www.redhat.com/mailman/listinfo/linux-cluster > [3] http://www.surveymonkey.com/s/XRCYXBH > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera From Micah.Schaefer at jhuapl.edu Thu Jun 12 15:32:57 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Thu, 12 Jun 2014 11:32:57 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <53992A66.4070109@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> Message-ID: Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and fenced, then node3 was fenced when node4 came back online. The network topology is as follows: switch1: node1, node3 (two connections) switch2: node2, node4 (two connections) switch1 switch2 All on the same subnet I set up monitoring at 100 millisecond of the nics in active-backup mode, and saw no messages about link problems before the fence. I see multicast between the servers using tcpdump. Any more ideas? On 6/12/14, 12:19 AM, "Digimer" wrote: >I considered that, but I would expect more nodes to be lost. > >On 12/06/14 12:12 AM, Netravali, Ganesh wrote: >> Make sure multicast is enabled across the switches. >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com >>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah >> Sent: Thursday, June 12, 2014 1:20 AM >> To: linux clustering >> Subject: Re: [Linux-cluster] Node is randomly fenced >> >> Okay, I set up active/ backup bonding and will watch for any change. >> >> This is the network side: >> 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored >> 0 output errors, 0 collisions, 0 interface resets >> >> >> >> This is the server side: >> >> em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD >> inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 >> inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 >>GiB) >> Interrupt:34 Memory:d5000000-d57fffff >> >> >> >> I need to run some fiber, but for now two nodes are plugged into one >>switch and the other two nodes into a separate switch that are on the >>same subnet. I'll work on cross connecting the bonded interfaces to >>different switches. >> >> >> >> On 6/11/14, 3:28 PM, "Digimer" wrote: >> >>> The first thing I would do is get a second NIC and configure >>> active-passive bonding. network issues are too common to ignore in HA >>> setups. Ideally, I would span the links across separate stacked >>>switches. >>> >>> As for debugging the issue, I can only recommend to look closely at the >>> system and switch logs for clues. >>> >>> On 11/06/14 02:55 PM, Schaefer, Micah wrote: >>>> I have the issue on two of my nodes. Each node has 1ea 10gb >>>>connection. >>>> No >>>> bonding, single link. What else can I look at? I manage the network >>>> too. I don?t see any link down notifications, don?t see any errors on >>>> the ports. >>>> >>>> >>>> >>>> >>>> On 6/11/14, 2:29 PM, "Digimer" wrote: >>>> >>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>>>> It failed again, even after deleting all the other failover domains. >>>>>> >>>>>> Cluster conf >>>>>> http://pastebin.com/jUXkwKS4 >>>>>> >>>>>> I turned corosync output to debug. How can I go about >>>>>> troubleshooting if it really is a network issue or something else? >>>>>> >>>>>> >>>>>> >>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 >>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>>>> configuration. >>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 >>>>>> corosync [TOTEM ] A processor joined or left the membership and a >>>>>> new membership was formed. >>>>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>>>> ip(10.70.100.101) ; members(old:4 left:1) >>>>> >>>>> This is, to me, *strongly* indicative of a network issue. It's not >>>>> likely switch-wide as only one member was lost, but I would >>>>> certainly put my money on a network problem somewhere, some how. >>>>> >>>>> Do you use bonding? >>>>> >>>>> -- >>>>> Digimer >>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for >>>>> cancer is trapped in the mind of a person without access to >>>>> education? >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>> >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer >>> is trapped in the mind of a person without access to education? >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Thu Jun 12 16:25:14 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 12 Jun 2014 12:25:14 -0400 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: References: <53986FD4.6010902@alteeve.ca> Message-ID: <5399D46A.6080205@alteeve.ca> Have you tried simple things like disabling iptables or selinux, just to test? If that doesn't work, and it's a small cluster, try unicast and see if that helps (again, even if just to test). On 12/06/14 10:29 AM, Arun G Nair wrote: > We have multicast enabled on the switch. I've also tried the > multicast.py tool from RH's knowledge base to test multicast and I see > the expected output, though the tool uses a different multicast IP( > guess that shouldn't matter). I've tried increasing the post_join_delay > to 360 seconds to give me enough time to check everything on both the > nodes. One node still gets fenced. `clustat` output says the other node > is offline on both servers. So one node can't see the other one ? This > again points to issue with multicast. Any other clues as to what/where > to look ? > > > On Wed, Jun 11, 2014 at 8:33 PM, Digimer > wrote: > > On 11/06/14 10:48 AM, Arun G Nair wrote: > > Hello, > > What are the reasons for fence loops when only cman is > started ? We > have an RHEL 6.5 2-node cluster which goes in to a fence loop > and every > time we start cman on both nodes. Either one fences the other. > Multicast > seems to be working properly. My understanding is that without > rgmanager > running there won't be a multicast group subscription ? I don't > see the > multicast address in 'netstat -g' unless rgmanager is running. I've > tried to increase the fence post_join_delay but one of the nodes > still > gets fenced. > > The cluster works fine if we use unicast UDP. > > Thanks, > > > Hi, > > When cman starts, it waits post_join_delay seconds for the peer > to connect. If, after that time expires (6 seconds by default, > iirc), it gives up and calls a fence against the peer to put it into > a known state. > > Corosync is what determines membership, and it is started by > cman. The rgmanager only handles resource > start/stop/relocate/recovery and has nothing to do with fencing > directly. Corosync is what uses multicast. > > So as you seem to have already surmised, multicast is probably > not working in your environment. Have you enabled multicast traffic > on the firewall? Do your switches support multicast properly? > > digimer > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person > without access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/__mailman/listinfo/linux-cluster > > > > > > -- > Arun G Nair > Sr. Sysadmin > Dimension Data | Ph: (800) 664-9973 > Feedback? We're listening > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From lists at alteeve.ca Thu Jun 12 16:31:43 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 12 Jun 2014 12:31:43 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> Message-ID: <5399D5EF.9050605@alteeve.ca> To confirm; Have you tried with the bonds setup where each node has one link into either switch? I just want to be sure you've ruled out all the network hardware. Also please confirm that you used mode=1 (active-passive) bonding. Assuming this doesn't help, then I would say that I was wrong in assuming it was network related. The next thing I would look at is corosync. Do you see any messages about totem retransmit? On 12/06/14 11:32 AM, Schaefer, Micah wrote: > Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and > fenced, then node3 was fenced when node4 came back online. The network > topology is as follows: > switch1: node1, node3 (two connections) > switch2: node2, node4 (two connections) > switch1 switch2 > All on the same subnet > > I set up monitoring at 100 millisecond of the nics in active-backup mode, > and saw no messages about link problems before the fence. > > I see multicast between the servers using tcpdump. > > > Any more ideas? > > > > > > On 6/12/14, 12:19 AM, "Digimer" wrote: > >> I considered that, but I would expect more nodes to be lost. >> >> On 12/06/14 12:12 AM, Netravali, Ganesh wrote: >>> Make sure multicast is enabled across the switches. >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah >>> Sent: Thursday, June 12, 2014 1:20 AM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] Node is randomly fenced >>> >>> Okay, I set up active/ backup bonding and will watch for any change. >>> >>> This is the network side: >>> 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored >>> 0 output errors, 0 collisions, 0 interface resets >>> >>> >>> >>> This is the server side: >>> >>> em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD >>> inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 >>> inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 >>> GiB) >>> Interrupt:34 Memory:d5000000-d57fffff >>> >>> >>> >>> I need to run some fiber, but for now two nodes are plugged into one >>> switch and the other two nodes into a separate switch that are on the >>> same subnet. I'll work on cross connecting the bonded interfaces to >>> different switches. >>> >>> >>> >>> On 6/11/14, 3:28 PM, "Digimer" wrote: >>> >>>> The first thing I would do is get a second NIC and configure >>>> active-passive bonding. network issues are too common to ignore in HA >>>> setups. Ideally, I would span the links across separate stacked >>>> switches. >>>> >>>> As for debugging the issue, I can only recommend to look closely at the >>>> system and switch logs for clues. >>>> >>>> On 11/06/14 02:55 PM, Schaefer, Micah wrote: >>>>> I have the issue on two of my nodes. Each node has 1ea 10gb >>>>> connection. >>>>> No >>>>> bonding, single link. What else can I look at? I manage the network >>>>> too. I don?t see any link down notifications, don?t see any errors on >>>>> the ports. >>>>> >>>>> >>>>> >>>>> >>>>> On 6/11/14, 2:29 PM, "Digimer" wrote: >>>>> >>>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>>>>> It failed again, even after deleting all the other failover domains. >>>>>>> >>>>>>> Cluster conf >>>>>>> http://pastebin.com/jUXkwKS4 >>>>>>> >>>>>>> I turned corosync output to debug. How can I go about >>>>>>> troubleshooting if it really is a network issue or something else? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 >>>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>>>>> configuration. >>>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 >>>>>>> corosync [TOTEM ] A processor joined or left the membership and a >>>>>>> new membership was formed. >>>>>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>>>>> ip(10.70.100.101) ; members(old:4 left:1) >>>>>> >>>>>> This is, to me, *strongly* indicative of a network issue. It's not >>>>>> likely switch-wide as only one member was lost, but I would >>>>>> certainly put my money on a network problem somewhere, some how. >>>>>> >>>>>> Do you use bonding? >>>>>> >>>>>> -- >>>>>> Digimer >>>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for >>>>>> cancer is trapped in the mind of a person without access to >>>>>> education? >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> >>>> >>>> >>>> -- >>>> Digimer >>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer >>>> is trapped in the mind of a person without access to education? >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >> >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From yvette at dbtgroup.com Thu Jun 12 16:33:17 2014 From: yvette at dbtgroup.com (yvette hirth) Date: Thu, 12 Jun 2014 09:33:17 -0700 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> Message-ID: <5399D64D.8080301@dbtgroup.com> On 06/12/2014 08:32 AM, Schaefer, Micah wrote: > Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and > fenced, then node3 was fenced when node4 came back online. The network > topology is as follows: > switch1: node1, node3 (two connections) > switch2: node2, node4 (two connections) > switch1 switch2 > All on the same subnet > > I set up monitoring at 100 millisecond of the nics in active-backup mode, > and saw no messages about link problems before the fence. > > I see multicast between the servers using tcpdump. > > Any more ideas? spanning-tree scans/rebuilds happen on 10Gb circuits just like they do on 1Gb circuits, and when they happen, traffic on the switches *can* come to a grinding halt, depending upon the switch firmware and the type of spanning-tree scan/rebuild being done. you may want to check your switch logs to see if any spanning-tree rebuilds were being done at the time of the fence. just an idea, and hth yvette hirth From lists at alteeve.ca Thu Jun 12 16:36:12 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 12 Jun 2014 12:36:12 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5399D64D.8080301@dbtgroup.com> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> Message-ID: <5399D6FC.8030800@alteeve.ca> On 12/06/14 12:33 PM, yvette hirth wrote: > On 06/12/2014 08:32 AM, Schaefer, Micah wrote: > >> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and >> fenced, then node3 was fenced when node4 came back online. The network >> topology is as follows: >> switch1: node1, node3 (two connections) >> switch2: node2, node4 (two connections) >> switch1 switch2 >> All on the same subnet >> >> I set up monitoring at 100 millisecond of the nics in active-backup mode, >> and saw no messages about link problems before the fence. >> >> I see multicast between the servers using tcpdump. >> >> Any more ideas? > > spanning-tree scans/rebuilds happen on 10Gb circuits just like they do > on 1Gb circuits, and when they happen, traffic on the switches *can* > come to a grinding halt, depending upon the switch firmware and the type > of spanning-tree scan/rebuild being done. > > you may want to check your switch logs to see if any spanning-tree > rebuilds were being done at the time of the fence. > > just an idea, and hth > yvette hirth When I've seen this (I now disable STP entirely), it blocks all traffic so I would expect multiple/all nodes to partition off on their own. Still, worth looking into. :) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Micah.Schaefer at jhuapl.edu Thu Jun 12 16:48:17 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Thu, 12 Jun 2014 12:48:17 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5399D6FC.8030800@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> Message-ID: This is all I see for TOTEM from node1 Jun 12 11:07:10 corosync [TOTEM ] A processor failed, forming new configuration. Jun 12 11:07:22 corosync [QUORUM] Members[3]: 1 2 3 Jun 12 11:07:22 corosync [TOTEM ] A processor joined or left the membership" and a new membership was formed. Jun 12 11:07:22 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) Jun 12 11:07:22 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:10:49 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:10:49 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:10:49 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:11:02 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:11:02 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:11:02 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:11:06 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:11:06 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:11:06 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:11:35 corosync [TOTEM ] A processor failed, forming new configuration. Jun 12 11:11:47 corosync [QUORUM] Members[3]: 1 2 4 Jun 12 11:11:47 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:11:47 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) Jun 12 11:11:47 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:15:18 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:15:18 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:15:18 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:15:31 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:15:31 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:15:31 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:15:33 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:15:33 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:15:33 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 12:36:20 corosync [QUORUM] Members[4]: 1 2 3 4 As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning tree changes are happening and all the ports have port-fast enabled for these servers. My switch logging level is very high and I have no messages in relation to the time frames or ports. TOTEM reports that ?A processor joined or left the membership??, but that isn?t enough detail. Also note that I did not have these issues until adding new servers: node3 and node4 to the cluster. Node1 and node2 do not fence each other (unless a real issue is there), and they are on different switches. On 6/12/14, 12:36 PM, "Digimer" wrote: >On 12/06/14 12:33 PM, yvette hirth wrote: >> On 06/12/2014 08:32 AM, Schaefer, Micah wrote: >> >>> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and >>> fenced, then node3 was fenced when node4 came back online. The network >>> topology is as follows: >>> switch1: node1, node3 (two connections) >>> switch2: node2, node4 (two connections) >>> switch1 switch2 >>> All on the same subnet >>> >>> I set up monitoring at 100 millisecond of the nics in active-backup >>>mode, >>> and saw no messages about link problems before the fence. >>> >>> I see multicast between the servers using tcpdump. >>> >>> Any more ideas? >> >> spanning-tree scans/rebuilds happen on 10Gb circuits just like they do >> on 1Gb circuits, and when they happen, traffic on the switches *can* >> come to a grinding halt, depending upon the switch firmware and the type >> of spanning-tree scan/rebuild being done. >> >> you may want to check your switch logs to see if any spanning-tree >> rebuilds were being done at the time of the fence. >> >> just an idea, and hth >> yvette hirth > >When I've seen this (I now disable STP entirely), it blocks all traffic >so I would expect multiple/all nodes to partition off on their own. >Still, worth looking into. :) > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Thu Jun 12 17:08:07 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 12 Jun 2014 13:08:07 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> Message-ID: <5399DE77.1030302@alteeve.ca> On 12/06/14 12:48 PM, Schaefer, Micah wrote: > As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning > tree changes are happening and all the ports have port-fast enabled for > these servers. My switch logging level is very high and I have no messages > in relation to the time frames or ports. > > TOTEM reports that ?A processor joined or left the membership??, but that > isn?t enough detail. > > Also note that I did not have these issues until adding new servers: node3 > and node4 to the cluster. Node1 and node2 do not fence each other (unless > a real issue is there), and they are on different switches. Then I can't imagine it being network anymore. Seeing as both node 3 and 4 get fenced, it's likely not hardware either. Are the workloads on 3 and 4 much higher (or are the computers much slower) than 1 and 2? I'm wondering if the nodes are simply not keeping up with corosync traffic. You might try adjusting the corosync token timeout and retransmit counts to see if that reduces the node loses. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Micah.Schaefer at jhuapl.edu Thu Jun 12 17:24:03 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Thu, 12 Jun 2014 13:24:03 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5399DE77.1030302@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> Message-ID: The servers do not run any tasks other than the tasks in the cluster service group. Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1 and 2 are virtual machines with much less resources available. I adjusted the token settings and will watch for any change. On 6/12/14, 1:08 PM, "Digimer" wrote: >On 12/06/14 12:48 PM, Schaefer, Micah wrote: >> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning >> tree changes are happening and all the ports have port-fast enabled for >> these servers. My switch logging level is very high and I have no >>messages >> in relation to the time frames or ports. >> >> TOTEM reports that ?A processor joined or left the membership??, but >>that >> isn?t enough detail. >> >> Also note that I did not have these issues until adding new servers: >>node3 >> and node4 to the cluster. Node1 and node2 do not fence each other >>(unless >> a real issue is there), and they are on different switches. > >Then I can't imagine it being network anymore. Seeing as both node 3 and >4 get fenced, it's likely not hardware either. Are the workloads on 3 >and 4 much higher (or are the computers much slower) than 1 and 2? I'm >wondering if the nodes are simply not keeping up with corosync traffic. >You might try adjusting the corosync token timeout and retransmit counts >to see if that reduces the node loses. > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Thu Jun 12 17:29:53 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 12 Jun 2014 13:29:53 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> Message-ID: <5399E391.3060701@alteeve.ca> Even if the token changes stop the immediate fencing, don't leave it please. There is something fundamentally wrong that you need to identify/fix. Keep us posted! On 12/06/14 01:24 PM, Schaefer, Micah wrote: > The servers do not run any tasks other than the tasks in the cluster > service group. > > Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1 > and 2 are virtual machines with much less resources available. > > I adjusted the token settings and will watch for any change. > > > > > > > > > On 6/12/14, 1:08 PM, "Digimer" wrote: > >> On 12/06/14 12:48 PM, Schaefer, Micah wrote: >>> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning >>> tree changes are happening and all the ports have port-fast enabled for >>> these servers. My switch logging level is very high and I have no >>> messages >>> in relation to the time frames or ports. >>> >>> TOTEM reports that ?A processor joined or left the membership??, but >>> that >>> isn?t enough detail. >>> >>> Also note that I did not have these issues until adding new servers: >>> node3 >>> and node4 to the cluster. Node1 and node2 do not fence each other >>> (unless >>> a real issue is there), and they are on different switches. >> >> Then I can't imagine it being network anymore. Seeing as both node 3 and >> 4 get fenced, it's likely not hardware either. Are the workloads on 3 >> and 4 much higher (or are the computers much slower) than 1 and 2? I'm >> wondering if the nodes are simply not keeping up with corosync traffic. >> You might try adjusting the corosync token timeout and retransmit counts >> to see if that reduces the node loses. >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Micah.Schaefer at jhuapl.edu Thu Jun 12 17:55:35 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Thu, 12 Jun 2014 13:55:35 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5399E391.3060701@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> Message-ID: I just found that the clock on node1 was off by about a minute and a half compared to the rest of the nodes. I am running ntp, so not sure why the time wasn?t synced up. Wonder if node1 being behind, would think it was not receiving updates from the other nodes? On 6/12/14, 1:29 PM, "Digimer" wrote: >Even if the token changes stop the immediate fencing, don't leave it >please. There is something fundamentally wrong that you need to >identify/fix. > >Keep us posted! > >On 12/06/14 01:24 PM, Schaefer, Micah wrote: >> The servers do not run any tasks other than the tasks in the cluster >> service group. >> >> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1 >> and 2 are virtual machines with much less resources available. >> >> I adjusted the token settings and will watch for any change. >> >> >> >> >> >> >> >> >> On 6/12/14, 1:08 PM, "Digimer" wrote: >> >>> On 12/06/14 12:48 PM, Schaefer, Micah wrote: >>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning >>>> tree changes are happening and all the ports have port-fast enabled >>>>for >>>> these servers. My switch logging level is very high and I have no >>>> messages >>>> in relation to the time frames or ports. >>>> >>>> TOTEM reports that ?A processor joined or left the membership??, but >>>> that >>>> isn?t enough detail. >>>> >>>> Also note that I did not have these issues until adding new servers: >>>> node3 >>>> and node4 to the cluster. Node1 and node2 do not fence each other >>>> (unless >>>> a real issue is there), and they are on different switches. >>> >>> Then I can't imagine it being network anymore. Seeing as both node 3 >>>and >>> 4 get fenced, it's likely not hardware either. Are the workloads on 3 >>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm >>> wondering if the nodes are simply not keeping up with corosync traffic. >>> You might try adjusting the corosync token timeout and retransmit >>>counts >>> to see if that reduces the node loses. >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ >>> What if the cure for cancer is trapped in the mind of a person without >>> access to education? >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From Micah.Schaefer at jhuapl.edu Thu Jun 12 19:02:43 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Thu, 12 Jun 2014 15:02:43 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> Message-ID: Node4 was fenced again, I was able to get some debug logs (below), a new message : "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL state.? Rest of corosync logs http://pastebin.com/iYFbkbhb Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, flushing membership messages. Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33494 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, flushing membership messages. Jun 12 14:44:50 corosync [TOTEM ] got commit token Jun 12 14:44:50 corosync [TOTEM ] Saving state aru 86 high seq received 86 Jun 12 14:44:50 corosync [TOTEM ] Storing new sequence id for ring 6324 Jun 12 14:44:50 corosync [TOTEM ] entering COMMIT state. Jun 12 14:44:50 corosync [TOTEM ] got commit token Jun 12 14:44:50 corosync [TOTEM ] entering RECOVERY state. Jun 12 14:44:50 corosync [TOTEM ] TRANS [0] member 10.70.100.101: Jun 12 14:44:50 corosync [TOTEM ] TRANS [1] member 10.70.100.102: Jun 12 14:44:50 corosync [TOTEM ] TRANS [2] member 10.70.100.103: Jun 12 14:44:50 corosync [TOTEM ] TRANS [3] member 10.70.100.104: Jun 12 14:44:50 corosync [TOTEM ] position [0] member 10.70.100.101: Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:50 corosync [TOTEM ] position [1] member 10.70.100.102: Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:50 corosync [TOTEM ] position [2] member 10.70.100.103: Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:50 corosync [TOTEM ] position [3] member 10.70.100.104: Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:50 corosync [TOTEM ] Did not need to originate any messages in recovery. Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:50 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Jun 12 14:44:50 corosync [TOTEM ] Resetting old ring state Jun 12 14:44:50 corosync [TOTEM ] recovery to regular 1-0 Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 1 Jun 12 14:44:50 corosync [TOTEM ] entering OPERATIONAL state. Jun 12 14:44:50 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, flushing membership messages. Jun 12 14:44:51 corosync [TOTEM ] got commit token Jun 12 14:44:51 corosync [TOTEM ] Saving state aru 86 high seq received 86 Jun 12 14:44:51 corosync [TOTEM ] Storing new sequence id for ring 6328 Jun 12 14:44:51 corosync [TOTEM ] entering COMMIT state. Jun 12 14:44:51 corosync [TOTEM ] got commit token Jun 12 14:44:51 corosync [TOTEM ] entering RECOVERY state. Jun 12 14:44:51 corosync [TOTEM ] TRANS [0] member 10.70.100.101: Jun 12 14:44:51 corosync [TOTEM ] TRANS [1] member 10.70.100.102: Jun 12 14:44:51 corosync [TOTEM ] TRANS [2] member 10.70.100.103: Jun 12 14:44:51 corosync [TOTEM ] TRANS [3] member 10.70.100.104: Jun 12 14:44:51 corosync [TOTEM ] position [0] member 10.70.100.101: Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:51 corosync [TOTEM ] position [1] member 10.70.100.102: Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:51 corosync [TOTEM ] position [2] member 10.70.100.103: Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:51 corosync [TOTEM ] position [3] member 10.70.100.104: Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:51 corosync [TOTEM ] Did not need to originate any messages in recovery. Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:51 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Jun 12 14:44:51 corosync [TOTEM ] Resetting old ring state Jun 12 14:44:51 corosync [TOTEM ] recovery to regular 1-0 Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 1 Jun 12 14:44:51 corosync [TOTEM ] entering OPERATIONAL state. Jun 12 14:44:51 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35455 ms, flushing membership messages. Jun 12 14:44:52 corosync [TOTEM ] got commit token Jun 12 14:44:52 corosync [TOTEM ] Saving state aru 86 high seq received 86 Jun 12 14:44:52 corosync [TOTEM ] Storing new sequence id for ring 632c Jun 12 14:44:52 corosync [TOTEM ] entering COMMIT state. Jun 12 14:44:52 corosync [TOTEM ] got commit token Jun 12 14:44:52 corosync [TOTEM ] entering RECOVERY state. Jun 12 14:44:52 corosync [TOTEM ] TRANS [0] member 10.70.100.101: Jun 12 14:44:52 corosync [TOTEM ] TRANS [1] member 10.70.100.102: Jun 12 14:44:52 corosync [TOTEM ] TRANS [2] member 10.70.100.103: Jun 12 14:44:52 corosync [TOTEM ] TRANS [3] member 10.70.100.104: Jun 12 14:44:52 corosync [TOTEM ] position [0] member 10.70.100.101: Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:52 corosync [TOTEM ] position [1] member 10.70.100.102: Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:52 corosync [TOTEM ] position [2] member 10.70.100.103: Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:52 corosync [TOTEM ] position [3] member 10.70.100.104: Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:52 corosync [TOTEM ] Did not need to originate any messages in recovery. Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:52 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Jun 12 14:44:52 corosync [TOTEM ] Resetting old ring state Jun 12 14:44:52 corosync [TOTEM ] recovery to regular 1-0 Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 1 Jun 12 14:44:52 corosync [TOTEM ] entering OPERATIONAL state. Jun 12 14:44:52 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36223 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36224 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, flushing membership messages. Jun 12 14:44:53 corosync [TOTEM ] got commit token Jun 12 14:44:53 corosync [TOTEM ] Saving state aru 86 high seq received 86 Jun 12 14:44:53 corosync [TOTEM ] Storing new sequence id for ring 6330 Jun 12 14:44:53 corosync [TOTEM ] entering COMMIT state. Jun 12 14:44:53 corosync [TOTEM ] got commit token Jun 12 14:44:53 corosync [TOTEM ] entering RECOVERY state. Jun 12 14:44:53 corosync [TOTEM ] TRANS [0] member 10.70.100.101: Jun 12 14:44:53 corosync [TOTEM ] TRANS [1] member 10.70.100.102: Jun 12 14:44:53 corosync [TOTEM ] TRANS [2] member 10.70.100.103: Jun 12 14:44:53 corosync [TOTEM ] TRANS [3] member 10.70.100.104: Jun 12 14:44:53 corosync [TOTEM ] position [0] member 10.70.100.101: Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:53 corosync [TOTEM ] position [1] member 10.70.100.102: Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:53 corosync [TOTEM ] position [2] member 10.70.100.103: Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:53 corosync [TOTEM ] position [3] member 10.70.100.104: Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:53 corosync [TOTEM ] Did not need to originate any messages in recovery. Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:53 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Jun 12 14:44:53 corosync [TOTEM ] Resetting old ring state Jun 12 14:44:53 corosync [TOTEM ] recovery to regular 1-0 Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 1 Jun 12 14:44:53 corosync [TOTEM ] entering OPERATIONAL state. Jun 12 14:44:53 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] got commit token Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq received 86 Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334 Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state. Jun 12 14:44:54 corosync [TOTEM ] got commit token Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state. Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101: Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102: Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103: Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104: Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages in recovery. Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0 Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1 Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state. Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms, flushing membership messages. On 6/12/14, 1:55 PM, "Schaefer, Micah" wrote: >I just found that the clock on node1 was off by about a minute and a half >compared to the rest of the nodes. > >I am running ntp, so not sure why the time wasn?t synced up. Wonder if >node1 being behind, would think it was not receiving updates from the >other nodes? > > > > > > > >On 6/12/14, 1:29 PM, "Digimer" wrote: > >>Even if the token changes stop the immediate fencing, don't leave it >>please. There is something fundamentally wrong that you need to >>identify/fix. >> >>Keep us posted! >> >>On 12/06/14 01:24 PM, Schaefer, Micah wrote: >>> The servers do not run any tasks other than the tasks in the cluster >>> service group. >>> >>> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1 >>> and 2 are virtual machines with much less resources available. >>> >>> I adjusted the token settings and will watch for any change. >>> >>> >>> >>> >>> >>> >>> >>> >>> On 6/12/14, 1:08 PM, "Digimer" wrote: >>> >>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote: >>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no >>>>>spanning >>>>> tree changes are happening and all the ports have port-fast enabled >>>>>for >>>>> these servers. My switch logging level is very high and I have no >>>>> messages >>>>> in relation to the time frames or ports. >>>>> >>>>> TOTEM reports that ?A processor joined or left the membership??, but >>>>> that >>>>> isn?t enough detail. >>>>> >>>>> Also note that I did not have these issues until adding new servers: >>>>> node3 >>>>> and node4 to the cluster. Node1 and node2 do not fence each other >>>>> (unless >>>>> a real issue is there), and they are on different switches. >>>> >>>> Then I can't imagine it being network anymore. Seeing as both node 3 >>>>and >>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3 >>>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm >>>> wondering if the nodes are simply not keeping up with corosync >>>>traffic. >>>> You might try adjusting the corosync token timeout and retransmit >>>>counts >>>> to see if that reduces the node loses. >>>> >>>> -- >>>> Digimer >>>> Papers and Projects: https://alteeve.ca/w/ >>>> What if the cure for cancer is trapped in the mind of a person without >>>> access to education? >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >> >> >>-- >>Digimer >>Papers and Projects: https://alteeve.ca/w/ >>What if the cure for cancer is trapped in the mind of a person without >>access to education? >> >>-- >>Linux-cluster mailing list >>Linux-cluster at redhat.com >>https://www.redhat.com/mailman/listinfo/linux-cluster > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Thu Jun 12 19:06:57 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 12 Jun 2014 15:06:57 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> Message-ID: <5399FA51.2020808@alteeve.ca> Hrm, I'm not really sure that I am able to interpret this without making guesses. I'm cc'ing one of the devs (who I hope will poke the right person if he's not able to help at the moment). Lets see what he has to say. I am curious now, too. :) On 12/06/14 03:02 PM, Schaefer, Micah wrote: > Node4 was fenced again, I was able to get some debug logs (below), a new > message : > > "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL > state.? > > > Rest of corosync logs > > http://pastebin.com/iYFbkbhb > > > Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. > Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, > flushing membership messages. > Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33494 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, > flushing membership messages. > Jun 12 14:44:50 corosync [TOTEM ] got commit token > Jun 12 14:44:50 corosync [TOTEM ] Saving state aru 86 high seq received 86 > Jun 12 14:44:50 corosync [TOTEM ] Storing new sequence id for ring 6324 > Jun 12 14:44:50 corosync [TOTEM ] entering COMMIT state. > Jun 12 14:44:50 corosync [TOTEM ] got commit token > Jun 12 14:44:50 corosync [TOTEM ] entering RECOVERY state. > Jun 12 14:44:50 corosync [TOTEM ] TRANS [0] member 10.70.100.101: > Jun 12 14:44:50 corosync [TOTEM ] TRANS [1] member 10.70.100.102: > Jun 12 14:44:50 corosync [TOTEM ] TRANS [2] member 10.70.100.103: > Jun 12 14:44:50 corosync [TOTEM ] TRANS [3] member 10.70.100.104: > Jun 12 14:44:50 corosync [TOTEM ] position [0] member 10.70.100.101: > Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 > Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:50 corosync [TOTEM ] position [1] member 10.70.100.102: > Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 > Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:50 corosync [TOTEM ] position [2] member 10.70.100.103: > Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 > Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:50 corosync [TOTEM ] position [3] member 10.70.100.104: > Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101 > Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:50 corosync [TOTEM ] Did not need to originate any messages > in recovery. > Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 0, aru ffffffff > Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 > Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 2, aru 0 > Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 3, aru 0 > Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:50 corosync [TOTEM ] retrans flag count 4 token aru 0 install > seq 0 aru 0 0 > Jun 12 14:44:50 corosync [TOTEM ] Resetting old ring state > Jun 12 14:44:50 corosync [TOTEM ] recovery to regular 1-0 > Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 1 > Jun 12 14:44:50 corosync [TOTEM ] entering OPERATIONAL state. > Jun 12 14:44:50 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 0 > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] entering GATHER state from 12. > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, > flushing membership messages. > Jun 12 14:44:51 corosync [TOTEM ] got commit token > Jun 12 14:44:51 corosync [TOTEM ] Saving state aru 86 high seq received 86 > Jun 12 14:44:51 corosync [TOTEM ] Storing new sequence id for ring 6328 > Jun 12 14:44:51 corosync [TOTEM ] entering COMMIT state. > Jun 12 14:44:51 corosync [TOTEM ] got commit token > Jun 12 14:44:51 corosync [TOTEM ] entering RECOVERY state. > Jun 12 14:44:51 corosync [TOTEM ] TRANS [0] member 10.70.100.101: > Jun 12 14:44:51 corosync [TOTEM ] TRANS [1] member 10.70.100.102: > Jun 12 14:44:51 corosync [TOTEM ] TRANS [2] member 10.70.100.103: > Jun 12 14:44:51 corosync [TOTEM ] TRANS [3] member 10.70.100.104: > Jun 12 14:44:51 corosync [TOTEM ] position [0] member 10.70.100.101: > Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 > Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:51 corosync [TOTEM ] position [1] member 10.70.100.102: > Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 > Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:51 corosync [TOTEM ] position [2] member 10.70.100.103: > Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 > Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:51 corosync [TOTEM ] position [3] member 10.70.100.104: > Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101 > Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:51 corosync [TOTEM ] Did not need to originate any messages > in recovery. > Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 0, aru ffffffff > Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 > Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 2, aru 0 > Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 3, aru 0 > Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:51 corosync [TOTEM ] retrans flag count 4 token aru 0 install > seq 0 aru 0 0 > Jun 12 14:44:51 corosync [TOTEM ] Resetting old ring state > Jun 12 14:44:51 corosync [TOTEM ] recovery to regular 1-0 > Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 1 > Jun 12 14:44:51 corosync [TOTEM ] entering OPERATIONAL state. > Jun 12 14:44:51 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 0 > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] entering GATHER state from 12. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35455 ms, > flushing membership messages. > Jun 12 14:44:52 corosync [TOTEM ] got commit token > Jun 12 14:44:52 corosync [TOTEM ] Saving state aru 86 high seq received 86 > Jun 12 14:44:52 corosync [TOTEM ] Storing new sequence id for ring 632c > Jun 12 14:44:52 corosync [TOTEM ] entering COMMIT state. > Jun 12 14:44:52 corosync [TOTEM ] got commit token > Jun 12 14:44:52 corosync [TOTEM ] entering RECOVERY state. > Jun 12 14:44:52 corosync [TOTEM ] TRANS [0] member 10.70.100.101: > Jun 12 14:44:52 corosync [TOTEM ] TRANS [1] member 10.70.100.102: > Jun 12 14:44:52 corosync [TOTEM ] TRANS [2] member 10.70.100.103: > Jun 12 14:44:52 corosync [TOTEM ] TRANS [3] member 10.70.100.104: > Jun 12 14:44:52 corosync [TOTEM ] position [0] member 10.70.100.101: > Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 > Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:52 corosync [TOTEM ] position [1] member 10.70.100.102: > Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 > Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:52 corosync [TOTEM ] position [2] member 10.70.100.103: > Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 > Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:52 corosync [TOTEM ] position [3] member 10.70.100.104: > Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101 > Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:52 corosync [TOTEM ] Did not need to originate any messages > in recovery. > Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 0, aru ffffffff > Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 > Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 2, aru 0 > Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 3, aru 0 > Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:52 corosync [TOTEM ] retrans flag count 4 token aru 0 install > seq 0 aru 0 0 > Jun 12 14:44:52 corosync [TOTEM ] Resetting old ring state > Jun 12 14:44:52 corosync [TOTEM ] recovery to regular 1-0 > Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 1 > Jun 12 14:44:52 corosync [TOTEM ] entering OPERATIONAL state. > Jun 12 14:44:52 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 0 > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36223 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] entering GATHER state from 12. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36224 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, > flushing membership messages. > Jun 12 14:44:53 corosync [TOTEM ] got commit token > Jun 12 14:44:53 corosync [TOTEM ] Saving state aru 86 high seq received 86 > Jun 12 14:44:53 corosync [TOTEM ] Storing new sequence id for ring 6330 > Jun 12 14:44:53 corosync [TOTEM ] entering COMMIT state. > Jun 12 14:44:53 corosync [TOTEM ] got commit token > Jun 12 14:44:53 corosync [TOTEM ] entering RECOVERY state. > Jun 12 14:44:53 corosync [TOTEM ] TRANS [0] member 10.70.100.101: > Jun 12 14:44:53 corosync [TOTEM ] TRANS [1] member 10.70.100.102: > Jun 12 14:44:53 corosync [TOTEM ] TRANS [2] member 10.70.100.103: > Jun 12 14:44:53 corosync [TOTEM ] TRANS [3] member 10.70.100.104: > Jun 12 14:44:53 corosync [TOTEM ] position [0] member 10.70.100.101: > Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 > Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:53 corosync [TOTEM ] position [1] member 10.70.100.102: > Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 > Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:53 corosync [TOTEM ] position [2] member 10.70.100.103: > Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 > Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:53 corosync [TOTEM ] position [3] member 10.70.100.104: > Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101 > Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:53 corosync [TOTEM ] Did not need to originate any messages > in recovery. > Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 0, aru ffffffff > Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 > Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 2, aru 0 > Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 3, aru 0 > Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:53 corosync [TOTEM ] retrans flag count 4 token aru 0 install > seq 0 aru 0 0 > Jun 12 14:44:53 corosync [TOTEM ] Resetting old ring state > Jun 12 14:44:53 corosync [TOTEM ] recovery to regular 1-0 > Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 1 > Jun 12 14:44:53 corosync [TOTEM ] entering OPERATIONAL state. > Jun 12 14:44:53 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0 > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] got commit token > Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq received 86 > Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334 > Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state. > Jun 12 14:44:54 corosync [TOTEM ] got commit token > Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state. > Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101: > Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102: > Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103: > Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104: > Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101: > Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 > Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102: > Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 > Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103: > Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 > Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104: > Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 > Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 > Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages > in recovery. > Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 0, aru ffffffff > Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 > Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 2, aru 0 > Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 3, aru 0 > Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0 install > seq 0 aru 0 0 > Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state > Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0 > Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1 > Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state. > Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0 > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, > flushing membership messages. > Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms, > flushing membership messages. > > > > > > > > > > On 6/12/14, 1:55 PM, "Schaefer, Micah" wrote: > >> I just found that the clock on node1 was off by about a minute and a half >> compared to the rest of the nodes. >> >> I am running ntp, so not sure why the time wasn?t synced up. Wonder if >> node1 being behind, would think it was not receiving updates from the >> other nodes? >> >> >> >> >> >> >> >> On 6/12/14, 1:29 PM, "Digimer" wrote: >> >>> Even if the token changes stop the immediate fencing, don't leave it >>> please. There is something fundamentally wrong that you need to >>> identify/fix. >>> >>> Keep us posted! >>> >>> On 12/06/14 01:24 PM, Schaefer, Micah wrote: >>>> The servers do not run any tasks other than the tasks in the cluster >>>> service group. >>>> >>>> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1 >>>> and 2 are virtual machines with much less resources available. >>>> >>>> I adjusted the token settings and will watch for any change. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 6/12/14, 1:08 PM, "Digimer" wrote: >>>> >>>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote: >>>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no >>>>>> spanning >>>>>> tree changes are happening and all the ports have port-fast enabled >>>>>> for >>>>>> these servers. My switch logging level is very high and I have no >>>>>> messages >>>>>> in relation to the time frames or ports. >>>>>> >>>>>> TOTEM reports that ?A processor joined or left the membership??, but >>>>>> that >>>>>> isn?t enough detail. >>>>>> >>>>>> Also note that I did not have these issues until adding new servers: >>>>>> node3 >>>>>> and node4 to the cluster. Node1 and node2 do not fence each other >>>>>> (unless >>>>>> a real issue is there), and they are on different switches. >>>>> >>>>> Then I can't imagine it being network anymore. Seeing as both node 3 >>>>> and >>>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3 >>>>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm >>>>> wondering if the nodes are simply not keeping up with corosync >>>>> traffic. >>>>> You might try adjusting the corosync token timeout and retransmit >>>>> counts >>>>> to see if that reduces the node loses. >>>>> >>>>> -- >>>>> Digimer >>>>> Papers and Projects: https://alteeve.ca/w/ >>>>> What if the cure for cancer is trapped in the mind of a person without >>>>> access to education? >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>> >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ >>> What if the cure for cancer is trapped in the mind of a person without >>> access to education? >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From lzhong at suse.com Fri Jun 13 01:59:29 2014 From: lzhong at suse.com (Lidong Zhong) Date: Fri, 13 Jun 2014 09:59:29 +0800 Subject: [Linux-cluster] [RFC] dlm: keep listening connection alive with sctp mode In-Reply-To: <742486000.20595916.1402576184717.JavaMail.zimbra@redhat.com> References: <1402555378-5220-1-git-send-email-lzhong@suse.com> <742486000.20595916.1402576184717.JavaMail.zimbra@redhat.com> Message-ID: <1402624769.1407.0.camel@suse.site> Hi Bob, > ----- Original Message ----- > (snip) > > Signed-off-by: Lidong Zhong > > Hi Lidong, > > There is a special public mailing list for patches like this > and other cluster-related development. The mailing list is called > cluster-devel. Here is a link where you can subscribe to it: > > https://www.redhat.com/mailman/listinfo/cluster-devel > > I recommend you send your patch to cluster-devel at redhat.com. > OK, thank you very much. > Regards, > > Bob Peterson > Red Hat File Systems > -- Best regards, Lidong From fdinitto at redhat.com Fri Jun 13 04:02:34 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 13 Jun 2014 06:02:34 +0200 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5399FA51.2020808@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> <5399FA51.2020808@alteeve.ca> Message-ID: <539A77DA.6010407@redhat.com> On 06/12/2014 09:06 PM, Digimer wrote: > Hrm, I'm not really sure that I am able to interpret this without making > guesses. I'm cc'ing one of the devs (who I hope will poke the right > person if he's not able to help at the moment). Lets see what he has to > say. > > I am curious now, too. :) Chrissie/Honza: can you please take a look at this thread and see if there is a latent bug? I find it odd that the Process pause detected is kicking in so many times without a fencing action. Fabio > > On 12/06/14 03:02 PM, Schaefer, Micah wrote: >> Node4 was fenced again, I was able to get some debug logs (below), a new >> message : >> >> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL >> state.? >> >> >> Rest of corosync logs >> >> http://pastebin.com/iYFbkbhb >> >> >> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. >> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33494 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] got commit token >> Jun 12 14:44:50 corosync [TOTEM ] Saving state aru 86 high seq >> received 86 >> Jun 12 14:44:50 corosync [TOTEM ] Storing new sequence id for ring 6324 >> Jun 12 14:44:50 corosync [TOTEM ] entering COMMIT state. >> Jun 12 14:44:50 corosync [TOTEM ] got commit token >> Jun 12 14:44:50 corosync [TOTEM ] entering RECOVERY state. >> Jun 12 14:44:50 corosync [TOTEM ] TRANS [0] member 10.70.100.101: >> Jun 12 14:44:50 corosync [TOTEM ] TRANS [1] member 10.70.100.102: >> Jun 12 14:44:50 corosync [TOTEM ] TRANS [2] member 10.70.100.103: >> Jun 12 14:44:50 corosync [TOTEM ] TRANS [3] member 10.70.100.104: >> Jun 12 14:44:50 corosync [TOTEM ] position [0] member 10.70.100.101: >> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep >> 10.70.100.101 >> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:50 corosync [TOTEM ] position [1] member 10.70.100.102: >> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep >> 10.70.100.101 >> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:50 corosync [TOTEM ] position [2] member 10.70.100.103: >> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep >> 10.70.100.101 >> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:50 corosync [TOTEM ] position [3] member 10.70.100.104: >> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep >> 10.70.100.101 >> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:50 corosync [TOTEM ] Did not need to originate any messages >> in recovery. >> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 0, aru ffffffff >> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 1, aru 0 >> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 2, aru 0 >> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 3, aru 0 >> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:50 corosync [TOTEM ] retrans flag count 4 token aru 0 >> install >> seq 0 aru 0 0 >> Jun 12 14:44:50 corosync [TOTEM ] Resetting old ring state >> Jun 12 14:44:50 corosync [TOTEM ] recovery to regular 1-0 >> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 1 >> Jun 12 14:44:50 corosync [TOTEM ] entering OPERATIONAL state. >> Jun 12 14:44:50 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 0 >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] entering GATHER state from 12. >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms, >> flushing membership messages. >> Jun 12 14:44:51 corosync [TOTEM ] got commit token >> Jun 12 14:44:51 corosync [TOTEM ] Saving state aru 86 high seq >> received 86 >> Jun 12 14:44:51 corosync [TOTEM ] Storing new sequence id for ring 6328 >> Jun 12 14:44:51 corosync [TOTEM ] entering COMMIT state. >> Jun 12 14:44:51 corosync [TOTEM ] got commit token >> Jun 12 14:44:51 corosync [TOTEM ] entering RECOVERY state. >> Jun 12 14:44:51 corosync [TOTEM ] TRANS [0] member 10.70.100.101: >> Jun 12 14:44:51 corosync [TOTEM ] TRANS [1] member 10.70.100.102: >> Jun 12 14:44:51 corosync [TOTEM ] TRANS [2] member 10.70.100.103: >> Jun 12 14:44:51 corosync [TOTEM ] TRANS [3] member 10.70.100.104: >> Jun 12 14:44:51 corosync [TOTEM ] position [0] member 10.70.100.101: >> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep >> 10.70.100.101 >> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:51 corosync [TOTEM ] position [1] member 10.70.100.102: >> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep >> 10.70.100.101 >> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:51 corosync [TOTEM ] position [2] member 10.70.100.103: >> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep >> 10.70.100.101 >> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:51 corosync [TOTEM ] position [3] member 10.70.100.104: >> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep >> 10.70.100.101 >> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:51 corosync [TOTEM ] Did not need to originate any messages >> in recovery. >> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 0, aru ffffffff >> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 1, aru 0 >> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 2, aru 0 >> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 3, aru 0 >> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:51 corosync [TOTEM ] retrans flag count 4 token aru 0 >> install >> seq 0 aru 0 0 >> Jun 12 14:44:51 corosync [TOTEM ] Resetting old ring state >> Jun 12 14:44:51 corosync [TOTEM ] recovery to regular 1-0 >> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 1 >> Jun 12 14:44:51 corosync [TOTEM ] entering OPERATIONAL state. >> Jun 12 14:44:51 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 0 >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] entering GATHER state from 12. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35455 ms, >> flushing membership messages. >> Jun 12 14:44:52 corosync [TOTEM ] got commit token >> Jun 12 14:44:52 corosync [TOTEM ] Saving state aru 86 high seq >> received 86 >> Jun 12 14:44:52 corosync [TOTEM ] Storing new sequence id for ring 632c >> Jun 12 14:44:52 corosync [TOTEM ] entering COMMIT state. >> Jun 12 14:44:52 corosync [TOTEM ] got commit token >> Jun 12 14:44:52 corosync [TOTEM ] entering RECOVERY state. >> Jun 12 14:44:52 corosync [TOTEM ] TRANS [0] member 10.70.100.101: >> Jun 12 14:44:52 corosync [TOTEM ] TRANS [1] member 10.70.100.102: >> Jun 12 14:44:52 corosync [TOTEM ] TRANS [2] member 10.70.100.103: >> Jun 12 14:44:52 corosync [TOTEM ] TRANS [3] member 10.70.100.104: >> Jun 12 14:44:52 corosync [TOTEM ] position [0] member 10.70.100.101: >> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep >> 10.70.100.101 >> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:52 corosync [TOTEM ] position [1] member 10.70.100.102: >> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep >> 10.70.100.101 >> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:52 corosync [TOTEM ] position [2] member 10.70.100.103: >> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep >> 10.70.100.101 >> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:52 corosync [TOTEM ] position [3] member 10.70.100.104: >> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep >> 10.70.100.101 >> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:52 corosync [TOTEM ] Did not need to originate any messages >> in recovery. >> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 0, aru ffffffff >> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 1, aru 0 >> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 2, aru 0 >> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 3, aru 0 >> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:52 corosync [TOTEM ] retrans flag count 4 token aru 0 >> install >> seq 0 aru 0 0 >> Jun 12 14:44:52 corosync [TOTEM ] Resetting old ring state >> Jun 12 14:44:52 corosync [TOTEM ] recovery to regular 1-0 >> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 1 >> Jun 12 14:44:52 corosync [TOTEM ] entering OPERATIONAL state. >> Jun 12 14:44:52 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 0 >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36223 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] entering GATHER state from 12. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36224 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms, >> flushing membership messages. >> Jun 12 14:44:53 corosync [TOTEM ] got commit token >> Jun 12 14:44:53 corosync [TOTEM ] Saving state aru 86 high seq >> received 86 >> Jun 12 14:44:53 corosync [TOTEM ] Storing new sequence id for ring 6330 >> Jun 12 14:44:53 corosync [TOTEM ] entering COMMIT state. >> Jun 12 14:44:53 corosync [TOTEM ] got commit token >> Jun 12 14:44:53 corosync [TOTEM ] entering RECOVERY state. >> Jun 12 14:44:53 corosync [TOTEM ] TRANS [0] member 10.70.100.101: >> Jun 12 14:44:53 corosync [TOTEM ] TRANS [1] member 10.70.100.102: >> Jun 12 14:44:53 corosync [TOTEM ] TRANS [2] member 10.70.100.103: >> Jun 12 14:44:53 corosync [TOTEM ] TRANS [3] member 10.70.100.104: >> Jun 12 14:44:53 corosync [TOTEM ] position [0] member 10.70.100.101: >> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep >> 10.70.100.101 >> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:53 corosync [TOTEM ] position [1] member 10.70.100.102: >> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep >> 10.70.100.101 >> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:53 corosync [TOTEM ] position [2] member 10.70.100.103: >> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep >> 10.70.100.101 >> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:53 corosync [TOTEM ] position [3] member 10.70.100.104: >> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep >> 10.70.100.101 >> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:53 corosync [TOTEM ] Did not need to originate any messages >> in recovery. >> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 0, aru ffffffff >> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 1, aru 0 >> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 2, aru 0 >> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 3, aru 0 >> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:53 corosync [TOTEM ] retrans flag count 4 token aru 0 >> install >> seq 0 aru 0 0 >> Jun 12 14:44:53 corosync [TOTEM ] Resetting old ring state >> Jun 12 14:44:53 corosync [TOTEM ] recovery to regular 1-0 >> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 1 >> Jun 12 14:44:53 corosync [TOTEM ] entering OPERATIONAL state. >> Jun 12 14:44:53 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0 >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] got commit token >> Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq >> received 86 >> Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334 >> Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state. >> Jun 12 14:44:54 corosync [TOTEM ] got commit token >> Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state. >> Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101: >> Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102: >> Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103: >> Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104: >> Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101: >> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep >> 10.70.100.101 >> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102: >> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep >> 10.70.100.101 >> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103: >> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep >> 10.70.100.101 >> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104: >> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep >> 10.70.100.101 >> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received >> flag 1 >> Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages >> in recovery. >> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 0, aru ffffffff >> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 1, aru 0 >> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 2, aru 0 >> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans >> flag0 retrans queue empty 1 count 3, aru 0 >> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0 >> install >> seq 0 aru 0 0 >> Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state >> Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0 >> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1 >> Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state. >> Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0 >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, >> flushing membership messages. >> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms, >> flushing membership messages. >> >> >> >> >> >> >> >> >> >> On 6/12/14, 1:55 PM, "Schaefer, Micah" wrote: >> >>> I just found that the clock on node1 was off by about a minute and a >>> half >>> compared to the rest of the nodes. >>> >>> I am running ntp, so not sure why the time wasn?t synced up. Wonder if >>> node1 being behind, would think it was not receiving updates from the >>> other nodes? >>> >>> >>> >>> >>> >>> >>> >>> On 6/12/14, 1:29 PM, "Digimer" wrote: >>> >>>> Even if the token changes stop the immediate fencing, don't leave it >>>> please. There is something fundamentally wrong that you need to >>>> identify/fix. >>>> >>>> Keep us posted! >>>> >>>> On 12/06/14 01:24 PM, Schaefer, Micah wrote: >>>>> The servers do not run any tasks other than the tasks in the cluster >>>>> service group. >>>>> >>>>> Nodes 3 and 4 are physical servers with a lot of horsepower and >>>>> nodes 1 >>>>> and 2 are virtual machines with much less resources available. >>>>> >>>>> I adjusted the token settings and will watch for any change. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 6/12/14, 1:08 PM, "Digimer" wrote: >>>>> >>>>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote: >>>>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no >>>>>>> spanning >>>>>>> tree changes are happening and all the ports have port-fast enabled >>>>>>> for >>>>>>> these servers. My switch logging level is very high and I have no >>>>>>> messages >>>>>>> in relation to the time frames or ports. >>>>>>> >>>>>>> TOTEM reports that ?A processor joined or left the membership??, but >>>>>>> that >>>>>>> isn?t enough detail. >>>>>>> >>>>>>> Also note that I did not have these issues until adding new servers: >>>>>>> node3 >>>>>>> and node4 to the cluster. Node1 and node2 do not fence each other >>>>>>> (unless >>>>>>> a real issue is there), and they are on different switches. >>>>>> >>>>>> Then I can't imagine it being network anymore. Seeing as both node 3 >>>>>> and >>>>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3 >>>>>> and 4 much higher (or are the computers much slower) than 1 and 2? >>>>>> I'm >>>>>> wondering if the nodes are simply not keeping up with corosync >>>>>> traffic. >>>>>> You might try adjusting the corosync token timeout and retransmit >>>>>> counts >>>>>> to see if that reduces the node loses. >>>>>> >>>>>> -- >>>>>> Digimer >>>>>> Papers and Projects: https://alteeve.ca/w/ >>>>>> What if the cure for cancer is trapped in the mind of a person >>>>>> without >>>>>> access to education? >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> >>>> >>>> >>>> -- >>>> Digimer >>>> Papers and Projects: https://alteeve.ca/w/ >>>> What if the cure for cancer is trapped in the mind of a person without >>>> access to education? >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > From kienlt at mbbank.com.vn Mon Jun 16 11:43:33 2014 From: kienlt at mbbank.com.vn (Le Trung Kien) Date: Mon, 16 Jun 2014 11:43:33 +0000 Subject: [Linux-cluster] Two-node cluster GFS2 confusing Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> Hello everyone, I'm a new man on linux cluster. I have built a two-node cluster (without qdisk), includes: Redhat 6.4 cman pacemaker gfs2 My cluster could fail-over (back and forth) between two nodes for these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on /mnt/gfs2_storage), WebSite ( apache service) My problem occurs when I stop/start node in the following order: (when both nodes started) 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all resources still working on node2 2. Stop: node2 (stop service: pacemaker then cman) -> all resources stop (of course) 3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP started, WebFS failed, WebSite not started Status: Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1 Stack: cman Current DC: server1 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 2 Nodes configured, 1 expected votes 4 Resources configured. Online: [ server1 ] OFFLINE: [ server2 ] ClusterIP (ocf::heartbeat:IPaddr2): Started server1 WebFS (ocf::heartbeat:Filesystem): Started server1 (unmanaged) FAILED Failed actions: WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error Here is my /etc/cluster/cluster.conf Here is my: crm configure show node server1 node server2 primitive ClusterIP IPaddr2 \ params ip=192.168.117.130 cidr_netmask=32 \ op monitor interval=10s primitive WebFS Filesystem \ params device="/dev/sdc" directory="/mnt/gfs2_datastore" fstype=gfs2 \ meta target-role=Started primitive WebSite1 apache \ params configfile="/mnt/nfs_datastore/httpd/conf/httpd.conf" statusurl="http://localhost/server-status" \ op monitor interval=40s \ meta target-role=Stopped primitive WebSite2 apache \ params configfile="/mnt/gfs2_datastore/httpd/conf/httpd.conf" statusurl="http://localhost/server-status" \ op monitor interval=40s \ meta target-role=Started colocation webfs-with-ip inf: WebFS ClusterIP colocation website-with-webfs inf: WebSite2 WebFS order webfs-after-clusterip inf: ClusterIP WebFS order website-after-webfs inf: WebFS WebSite2 property cib-bootstrap-options: \ dc-version=1.1.8-7.el6-394e906 \ cluster-infrastructure=cman \ stonith-enabled=false \ no-quorum-policy=ignore \ expected-quorum-votes=1 \ last-lrm-refresh=1402374391 rsc_defaults rsc-options: \ resource-stickiness=100 rsc_defaults rsc_defaults-options: \ resource-stickiness=100 op_defaults op_defaults-options: \ migration-threshold=1 I don't have any glues to trace down this case, I just guess this problem comes from locking file system, please suggest me some advices. Thank you. Kien Le. From rpeterso at redhat.com Mon Jun 16 12:20:44 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 16 Jun 2014 08:20:44 -0400 (EDT) Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> Message-ID: <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> ----- Original Message ----- > Hello everyone, > > I'm a new man on linux cluster. I have built a two-node cluster (without > qdisk), includes: > > Redhat 6.4 > cman > pacemaker > gfs2 > > My cluster could fail-over (back and forth) between two nodes for these 3 > resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on > /mnt/gfs2_storage), WebSite ( apache service) > > My problem occurs when I stop/start node in the following order: (when both > nodes started) > > 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all resources > still working on node2 > 2. Stop: node2 (stop service: pacemaker then cman) -> all resources stop (of > course) > 3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP > started, WebFS failed, WebSite not started (snip) > I don't have any glues to trace down this case, I just guess this problem > comes from locking file system, please suggest me some advices. Hi, Some thoughts on your problem: (1) If this is truly Redhat 6.4, and you have a support contract with Red Hat, you should call the support number with Global Support Services and file a ticket. They'll be able to help. (2) You didn't explain what your symptoms were? In what way does it fail? (3) Why do you suspect "this problem comes from locking file system"? Do you mean from GFS2? What is the symptom that causes you to think it might be the file system? Were there messages on the console or dmesg to indicate a kernel issue? (4) I thought RHEL6.4 has cman/rgmanager, not pacemaker. Regards, Bob Peterson Red Hat File Systems From kienlt at mbbank.com.vn Mon Jun 16 12:50:52 2014 From: kienlt at mbbank.com.vn (Le Trung Kien) Date: Mon, 16 Jun 2014 12:50:52 +0000 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP> Hi, I don't have an active support contract with Redhat right now. And try to work around with Redhat cluster to understand the solution first. I followed the steps guide from clusterlabs.org, configure cluster using: CMAN, Pacemaker (of course there is rgmanager in 6.4 but I don't know how to use it right now because I'm in the middle of messing thing from start) I think the problem was from GFS2 because, with a NFS (no locking) has no problem with my cluster at all. This problem just come when I configure a shared GFS2 for my cluster. Thank you for your concerns :) -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson Sent: Monday, June 16, 2014 7:21 PM To: linux clustering Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing ----- Original Message ----- > Hello everyone, > > I'm a new man on linux cluster. I have built a two-node cluster > (without qdisk), includes: > > Redhat 6.4 > cman > pacemaker > gfs2 > > My cluster could fail-over (back and forth) between two nodes for > these 3 > resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on > /mnt/gfs2_storage), WebSite ( apache service) > > My problem occurs when I stop/start node in the following order: (when > both nodes started) > > 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all > resources still working on node2 2. Stop: node2 (stop service: > pacemaker then cman) -> all resources stop (of > course) > 3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP > started, WebFS failed, WebSite not started (snip) > I don't have any glues to trace down this case, I just guess this > problem comes from locking file system, please suggest me some advices. Hi, Some thoughts on your problem: (1) If this is truly Redhat 6.4, and you have a support contract with Red Hat, you should call the support number with Global Support Services and file a ticket. They'll be able to help. (2) You didn't explain what your symptoms were? In what way does it fail? (3) Why do you suspect "this problem comes from locking file system"? Do you mean from GFS2? What is the symptom that causes you to think it might be the file system? Were there messages on the console or dmesg to indicate a kernel issue? (4) I thought RHEL6.4 has cman/rgmanager, not pacemaker. Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From rpeterso at redhat.com Mon Jun 16 12:56:14 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 16 Jun 2014 08:56:14 -0400 (EDT) Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP> Message-ID: <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com> ----- Original Message ----- > Hi, > > I don't have an active support contract with Redhat right now. And try to > work around with Redhat cluster to understand the solution first. > > I followed the steps guide from clusterlabs.org, configure cluster using: > CMAN, Pacemaker (of course there is rgmanager in 6.4 but I don't know how to > use it right now because I'm in the middle of messing thing from start) > > I think the problem was from GFS2 because, with a NFS (no locking) has no > problem with my cluster at all. This problem just come when I configure a > shared GFS2 for my cluster. > > Thank you for your concerns :) Hi, Do you see any kernel messages in dmesg or on the console, after the failure? Regards, Bob Peterson Red Hat File Systems From kienlt at mbbank.com.vn Tue Jun 17 04:07:21 2014 From: kienlt at mbbank.com.vn (Le Trung Kien) Date: Tue, 17 Jun 2014 04:07:21 +0000 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP> <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com> Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP> Hi, here is my dmesg after failed: GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" dlm: Using TCP for communications GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS... GFS2: fsid=mycluster:web.0: jid=0, already locked for use GFS2: fsid=mycluster:web.0: jid=0: Looking at journal... GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock... GFS2: fsid=mycluster:web.0: jid=0: Replaying journal... GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s GFS2: fsid=mycluster:web.0: jid=0: Done GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock... GFS2: fsid=mycluster:web.0: jid=1: Looking at journal... GFS2: fsid=mycluster:web.0: jid=1: Done hrtimer: interrupt took 4149483 ns dlm: closing connection to node 2 dlm: closing connection to node 1 GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" dlm: Using TCP for communications -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson Sent: Monday, June 16, 2014 7:56 PM To: linux clustering Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing ----- Original Message ----- > Hi, > > I don't have an active support contract with Redhat right now. And try > to work around with Redhat cluster to understand the solution first. > > I followed the steps guide from clusterlabs.org, configure cluster using: > CMAN, Pacemaker (of course there is rgmanager in 6.4 but I don't know > how to use it right now because I'm in the middle of messing thing > from start) > > I think the problem was from GFS2 because, with a NFS (no locking) has > no problem with my cluster at all. This problem just come when I > configure a shared GFS2 for my cluster. > > Thank you for your concerns :) Hi, Do you see any kernel messages in dmesg or on the console, after the failure? Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From rpeterso at redhat.com Tue Jun 17 12:08:54 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 17 Jun 2014 08:08:54 -0400 (EDT) Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP> <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP> Message-ID: <1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com> ----- Original Message ----- > Hi, here is my dmesg after failed: > > GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" > dlm: Using TCP for communications > GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS... > GFS2: fsid=mycluster:web.0: jid=0, already locked for use > GFS2: fsid=mycluster:web.0: jid=0: Looking at journal... > GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock... > GFS2: fsid=mycluster:web.0: jid=0: Replaying journal... > GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks > GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags > GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s > GFS2: fsid=mycluster:web.0: jid=0: Done > GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock... > GFS2: fsid=mycluster:web.0: jid=1: Looking at journal... > GFS2: fsid=mycluster:web.0: jid=1: Done > hrtimer: interrupt took 4149483 ns > dlm: closing connection to node 2 > dlm: closing connection to node 1 > GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" > dlm: Using TCP for communications > Hi, If there was a GFS2 problem, you would ordinarily see errors there, and these messages are all pretty normal. Regards, Bob Peterson Red Hat File Systems From ccaulfie at redhat.com Tue Jun 17 12:41:07 2014 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 17 Jun 2014 13:41:07 +0100 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <5399FA51.2020808@alteeve.ca> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> <5399FA51.2020808@alteeve.ca> Message-ID: <53A03763.4080905@redhat.com> On 12/06/14 20:06, Digimer wrote: > Hrm, I'm not really sure that I am able to interpret this without making > guesses. I'm cc'ing one of the devs (who I hope will poke the right > person if he's not able to help at the moment). Lets see what he has to > say. > > I am curious now, too. :) > > On 12/06/14 03:02 PM, Schaefer, Micah wrote: >> Node4 was fenced again, I was able to get some debug logs (below), a new >> message : >> >> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL >> state.? >> >> >> Rest of corosync logs >> >> http://pastebin.com/iYFbkbhb >> >> >> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. >> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >> flushing membership messages. >> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, >> flushing membership messages. I'm concerned that the pause messages are repeating like that, it looks like it might be a fixed bug. What version of corosync do you have? Chrissie From Micah.Schaefer at jhuapl.edu Tue Jun 17 14:27:29 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Tue, 17 Jun 2014 10:27:29 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <53A03763.4080905@redhat.com> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> <5399FA51.2020808@alteeve.ca> <53A03763.4080905@redhat.com> Message-ID: I am running Red Hat 6.4 with the HA/ load balancing packages from the install DVD. -bash-4.1$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.4 (Santiago) -bash-4.1$ corosync -v Corosync Cluster Engine, version '1.4.1' Copyright (c) 2006-2009 Red Hat, Inc. On 6/17/14, 8:41 AM, "Christine Caulfield" wrote: >On 12/06/14 20:06, Digimer wrote: >> Hrm, I'm not really sure that I am able to interpret this without making >> guesses. I'm cc'ing one of the devs (who I hope will poke the right >> person if he's not able to help at the moment). Lets see what he has to >> say. >> >> I am curious now, too. :) >> >> On 12/06/14 03:02 PM, Schaefer, Micah wrote: >>> Node4 was fenced again, I was able to get some debug logs (below), a >>>new >>> message : >>> >>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the >>>OPERATIONAL >>> state.? >>> >>> >>> Rest of corosync logs >>> >>> http://pastebin.com/iYFbkbhb >>> >>> >>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. >>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the >>> membership and a new membership was formed. >>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>> flushing membership messages. >>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>> flushing membership messages. >>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >>> flushing membership messages. >>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >>> flushing membership messages. >>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, >>> flushing membership messages. > > >I'm concerned that the pause messages are repeating like that, it looks >like it might be a fixed bug. What version of corosync do you have? > >Chrissie > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From kienlt at mbbank.com.vn Tue Jun 17 15:48:40 2014 From: kienlt at mbbank.com.vn (Le Trung Kien) Date: Tue, 17 Jun 2014 15:48:40 +0000 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP> <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP> <1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com> Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F1E@HN-MBX-02.BANK.MB.GROUP> I reproduced my cluster problem again and got this error from /var/log/message So, I think the reason is fencing wrongly configured. And I may have to focus on Configure Fencing Device. Here is my log: Jun 17 22:32:36 server2 fenced[6559]: fenced 3.0.12.1 started Jun 17 22:32:36 server2 dlm_controld[6573]: dlm_controld 3.0.12.1 started Jun 17 22:32:37 server2 gfs_controld[6634]: gfs_controld 3.0.12.1 started Jun 17 22:33:29 server2 fenced[6559]: fencing node server1 Jun 17 22:33:29 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:29 server2 fenced[6559]: fence server1 failed Jun 17 22:33:32 server2 fenced[6559]: fencing node server1 Jun 17 22:33:32 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:32 server2 fenced[6559]: fence server1 failed Jun 17 22:33:35 server2 fenced[6559]: fencing node server1 Jun 17 22:33:35 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:35 server2 fenced[6559]: fence server1 failed -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson Sent: Tuesday, June 17, 2014 7:09 PM To: linux clustering Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing ----- Original Message ----- > Hi, here is my dmesg after failed: > > GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" > dlm: Using TCP for communications > GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS... > GFS2: fsid=mycluster:web.0: jid=0, already locked for use > GFS2: fsid=mycluster:web.0: jid=0: Looking at journal... > GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock... > GFS2: fsid=mycluster:web.0: jid=0: Replaying journal... > GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks > GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags > GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s > GFS2: fsid=mycluster:web.0: jid=0: Done > GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock... > GFS2: fsid=mycluster:web.0: jid=1: Looking at journal... > GFS2: fsid=mycluster:web.0: jid=1: Done > hrtimer: interrupt took 4149483 ns > dlm: closing connection to node 2 > dlm: closing connection to node 1 > GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" > dlm: Using TCP for communications > Hi, If there was a GFS2 problem, you would ordinarily see errors there, and these messages are all pretty normal. Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From kienlt at mbbank.com.vn Tue Jun 17 16:16:50 2014 From: kienlt at mbbank.com.vn (Le Trung Kien) Date: Tue, 17 Jun 2014 16:16:50 +0000 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F1E@HN-MBX-02.BANK.MB.GROUP> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP> <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP> <1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F1E@HN-MBX-02.BANK.MB.GROUP> Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F33@HN-MBX-02.BANK.MB.GROUP> Sorry, I reformat my log to easy for reading: Jun 17 22:32:36 server2 fenced[6559]: fenced 3.0.12.1 started Jun 17 22:32:36 server2 dlm_controld[6573]: dlm_controld 3.0.12.1 started Jun 17 22:32:37 server2 gfs_controld[6634]: gfs_controld 3.0.12.1 started Jun 17 22:33:29 server2 fenced[6559]: fencing node server1 Jun 17 22:33:29 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:29 server2 fenced[6559]: fence server1 failed Jun 17 22:33:32 server2 fenced[6559]: fencing node server1 Jun 17 22:33:32 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:32 server2 fenced[6559]: fence server1 failed Jun 17 22:33:35 server2 fenced[6559]: fencing node server1 Jun 17 22:33:35 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:35 server2 fenced[6559]: fence server1 failed -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Le Trung Kien Sent: Tuesday, June 17, 2014 10:49 PM To: linux clustering Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing I reproduced my cluster problem again and got this error from /var/log/message So, I think the reason is fencing wrongly configured. And I may have to focus on Configure Fencing Device. Here is my log: Jun 17 22:32:36 server2 fenced[6559]: fenced 3.0.12.1 started Jun 17 22:32:36 server2 dlm_controld[6573]: dlm_controld 3.0.12.1 started Jun 17 22:32:37 server2 gfs_controld[6634]: gfs_controld 3.0.12.1 started Jun 17 22:33:29 server2 fenced[6559]: fencing node server1 Jun 17 22:33:29 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:29 server2 fenced[6559]: fence server1 failed Jun 17 22:33:32 server2 fenced[6559]: fencing node server1 Jun 17 22:33:32 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:32 server2 fenced[6559]: fence server1 failed Jun 17 22:33:35 server2 fenced[6559]: fencing node server1 Jun 17 22:33:35 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:35 server2 fenced[6559]: fence server1 failed -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson Sent: Tuesday, June 17, 2014 7:09 PM To: linux clustering Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing ----- Original Message ----- > Hi, here is my dmesg after failed: > > GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" > dlm: Using TCP for communications > GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS... > GFS2: fsid=mycluster:web.0: jid=0, already locked for use > GFS2: fsid=mycluster:web.0: jid=0: Looking at journal... > GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock... > GFS2: fsid=mycluster:web.0: jid=0: Replaying journal... > GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks > GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags > GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s > GFS2: fsid=mycluster:web.0: jid=0: Done > GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock... > GFS2: fsid=mycluster:web.0: jid=1: Looking at journal... > GFS2: fsid=mycluster:web.0: jid=1: Done > hrtimer: interrupt took 4149483 ns > dlm: closing connection to node 2 > dlm: closing connection to node 1 > GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web" > dlm: Using TCP for communications > Hi, If there was a GFS2 problem, you would ordinarily see errors there, and these messages are all pretty normal. Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From white.heron at yahoo.com Wed Jun 18 03:56:09 2014 From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell) Date: Tue, 17 Jun 2014 20:56:09 -0700 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: <5399D46A.6080205@alteeve.ca> Message-ID: <1403063769.79975.YahooMailIosMobile@web163503.mail.gq1.yahoo.com> The clustering will only works if you run same operating systems on top of same hardware platform ppc, intel!

Sent from Yahoo Mail for iPhone
-------------- next part -------------- An HTML attachment was scrubbed... URL: From white.heron at yahoo.com Wed Jun 18 03:56:09 2014 From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell) Date: Tue, 17 Jun 2014 20:56:09 -0700 Subject: [Linux-cluster] 2-node cluster fence loop In-Reply-To: <5399D46A.6080205@alteeve.ca> Message-ID: <1403063769.79975.YahooMailIosMobile@web163503.mail.gq1.yahoo.com> The clustering will only works if you run same operating systems on top of same hardware platform ppc, intel!

Sent from Yahoo Mail for iPhone
-------------- next part -------------- An HTML attachment was scrubbed... URL: From white.heron at yahoo.com Wed Jun 18 04:08:54 2014 From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell) Date: Tue, 17 Jun 2014 21:08:54 -0700 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F33@HN-MBX-02.BANK.MB.GROUP> Message-ID: <1403064534.18434.YahooMailIosMobile@web163505.mail.gq1.yahoo.com> The clustering will only works if you enable ssl between two nodes and allow root access persistent connection.

Sent from Yahoo Mail for iPhone
-------------- next part -------------- An HTML attachment was scrubbed... URL: From white.heron at yahoo.com Wed Jun 18 04:08:54 2014 From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell) Date: Tue, 17 Jun 2014 21:08:54 -0700 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F33@HN-MBX-02.BANK.MB.GROUP> Message-ID: <1403064534.18434.YahooMailIosMobile@web163505.mail.gq1.yahoo.com> The clustering will only works if you enable ssl between two nodes and allow root access persistent connection.

Sent from Yahoo Mail for iPhone
-------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Wed Jun 18 04:18:16 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 18 Jun 2014 00:18:16 -0400 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> Message-ID: <53A11308.2040504@alteeve.ca> On 16/06/14 07:43 AM, Le Trung Kien wrote: > Hello everyone, > > I'm a new man on linux cluster. I have built a two-node cluster (without qdisk), includes: > > Redhat 6.4 > cman > pacemaker > gfs2 > > My cluster could fail-over (back and forth) between two nodes for these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on /mnt/gfs2_storage), WebSite ( apache service) > > My problem occurs when I stop/start node in the following order: (when both nodes started) > > 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all resources still working on node2 > 2. Stop: node2 (stop service: pacemaker then cman) -> all resources stop (of course) > 3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP started, WebFS failed, WebSite not started > > Status: > > Last updated: Mon Jun 16 18:34:56 2014 > Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1 > Stack: cman > Current DC: server1 - partition WITHOUT quorum > Version: 1.1.8-7.el6-394e906 > 2 Nodes configured, 1 expected votes > 4 Resources configured. > > Online: [ server1 ] > OFFLINE: [ server2 ] > > ClusterIP (ocf::heartbeat:IPaddr2): Started server1 > WebFS (ocf::heartbeat:Filesystem): Started server1 (unmanaged) FAILED > > Failed actions: > WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error > > Here is my /etc/cluster/cluster.conf > > > > > > > > > > > > > > > > > > > > > > > > > Here is my: crm configure show > > stonith-enabled=false \ Well this is a problem. When cman detects a failure (well corosync, but cman is told), it initiates a fence request. The fence daemon informs DLM with blocks. Then fenced calls the configured 'fence_pcmk', which just passes the request up to pacemaker. Without stonith configured in fencing, pacemaker will fail to fence, of course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by design. If configure proper fencing in pacemaker (and test it to make sure it works), then pacemaker *would* succeed in fencing and return a success to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans up lost locks and returns to normal operation. So please configure and test real stonith in pacemaker and see if your problem is resolved. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From white.heron at yahoo.com Wed Jun 18 18:20:05 2014 From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell) Date: Wed, 18 Jun 2014 11:20:05 -0700 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: Message-ID: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com> Hi,

The linux clustering will be only working perfectly if you run the linux operating systems between nodes. Allow root ssh persistent connection on top of same specifications hardware platform.

To perform test or proof of concept, you may allow to run and configure between two nodes.

The databases for clustering will be configure right after the two nodes linux operating systems run with persistent root access ssh connection.

Sent from Yahoo Mail for iPhone
-------------- next part -------------- An HTML attachment was scrubbed... URL: From white.heron at yahoo.com Wed Jun 18 18:20:05 2014 From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell) Date: Wed, 18 Jun 2014 11:20:05 -0700 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: Message-ID: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com> Hi,

The linux clustering will be only working perfectly if you run the linux operating systems between nodes. Allow root ssh persistent connection on top of same specifications hardware platform.

To perform test or proof of concept, you may allow to run and configure between two nodes.

The databases for clustering will be configure right after the two nodes linux operating systems run with persistent root access ssh connection.

Sent from Yahoo Mail for iPhone
-------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Wed Jun 18 18:32:39 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 18 Jun 2014 14:32:39 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com> References: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com> Message-ID: <53A1DB47.5040101@alteeve.ca> On 18/06/14 02:20 PM, YB Tan Sri Dato Sri' Adli a.k.a Dell wrote: > Hi, > > The linux clustering will be only working perfectly if you run the linux > operating systems between nodes. Allow root ssh persistent connection on > top of same specifications hardware platform. > > To perform test or proof of concept, you may allow to run and configure > between two nodes. > > The databases for clustering will be configure right after the two nodes > linux operating systems run with persistent root access ssh connection. > > Sent from Yahoo Mail for iPhone You have said this a couple times now, and I am not sure why. There is no need to have persistent, root access SSH between nodes. It's helpful in some cases, sure, but certainly not required. Corosync, which provides cluster membership and communication, handles internode traffic itself, on it's own TCP port (using multicast by default or unicast if configured). There is also nothing restricting you to two nodes. It's a good configuration, and one I use personally, but there are many 3+ node clusters out there. As for a database cluster, that would depend entirely on which database you are using and whether you are using tools specific for that DB or a more generic HA stack like corosync + pacemaker. Cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From kienlt at mbbank.com.vn Thu Jun 19 01:51:12 2014 From: kienlt at mbbank.com.vn (Le Trung Kien) Date: Thu, 19 Jun 2014 01:51:12 +0000 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <53A11308.2040504@alteeve.ca> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <53A11308.2040504@alteeve.ca> Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0FA5CB@HN-MBX-02.BANK.MB.GROUP> Hi, As Digimer suggested, I change property stonith-enabled=true But now I don't know which fencing method I should use, because my two Redhat nodes running on VMWare Workstation, OpenFiler as SCSI shared LUN storage. I attempted to use "fence_scsi", but no luck, I got this error: Jun 19 08:35:58 server1 stonith_admin[3837]: notice: crm_log_args: Invoked: stonith_admin --reboot server2 --tolerance 5s Jun 19 08:36:08 server1 root: fence_pcmk[3836]: Call to fence server2 (reset) failed with rc=255 Here is my fencing configuration: And the log: /tmp/fence_scsi.log show: Jun 18 19:49:40 fence_scsi: [error] no devices found I will try "vmware_soap" to see if it works. Kien Le -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer Sent: Wednesday, June 18, 2014 11:18 AM To: linux clustering Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing On 16/06/14 07:43 AM, Le Trung Kien wrote: > Hello everyone, > > I'm a new man on linux cluster. I have built a two-node cluster (without qdisk), includes: > > Redhat 6.4 > cman > pacemaker > gfs2 > > My cluster could fail-over (back and forth) between two nodes for > these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on > /mnt/gfs2_storage), WebSite ( apache service) > > My problem occurs when I stop/start node in the following order: (when > both nodes started) > > 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all > resources still working on node2 2. Stop: node2 (stop service: > pacemaker then cman) -> all resources stop (of course) 3. Start: node1 > (start service: cman then pacemaker) -> only ClusterIP started, WebFS > failed, WebSite not started > > Status: > > Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16 > 14:24:54 2014 via cibadmin on server1 > Stack: cman > Current DC: server1 - partition WITHOUT quorum > Version: 1.1.8-7.el6-394e906 > 2 Nodes configured, 1 expected votes > 4 Resources configured. > > Online: [ server1 ] > OFFLINE: [ server2 ] > > ClusterIP (ocf::heartbeat:IPaddr2): Started server1 > WebFS (ocf::heartbeat:Filesystem): Started server1 (unmanaged) FAILED > > Failed actions: > WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): > unknown error > > Here is my /etc/cluster/cluster.conf > > > > > > > > > > > > > > > > > > > > > > > > > Here is my: crm configure show > > stonith-enabled=false \ Well this is a problem. When cman detects a failure (well corosync, but cman is told), it initiates a fence request. The fence daemon informs DLM with blocks. Then fenced calls the configured 'fence_pcmk', which just passes the request up to pacemaker. Without stonith configured in fencing, pacemaker will fail to fence, of course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by design. If configure proper fencing in pacemaker (and test it to make sure it works), then pacemaker *would* succeed in fencing and return a success to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans up lost locks and returns to normal operation. So please configure and test real stonith in pacemaker and see if your problem is resolved. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Thu Jun 19 02:01:35 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 18 Jun 2014 22:01:35 -0400 Subject: [Linux-cluster] Two-node cluster GFS2 confusing In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0FA5CB@HN-MBX-02.BANK.MB.GROUP> References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP> <53A11308.2040504@alteeve.ca> <3D6C1B8E3C47614AAE3227D507C4ECCD3A0FA5CB@HN-MBX-02.BANK.MB.GROUP> Message-ID: <53A2447F.7060705@alteeve.ca> I don't use VMware myself, but I think fence_vmware will work for you. Please note that simply enabling stonith is not enough. As you realize, you need a configured and working fence method. If you try using the command line, you can play with the command's switched asking for 'status'. When that returns properly, you will then just need to convert the switches into arguments for pacemaker. Read the man page for 'fence_vmware', and then try calling: fence_vmware ... -o status Fill in the switches and values you need based on the instructions in 'man fence_vmware'. digimer On 18/06/14 09:51 PM, Le Trung Kien wrote: > Hi, > > As Digimer suggested, I change property > > stonith-enabled=true > > But now I don't know which fencing method I should use, because my two Redhat nodes running on VMWare Workstation, OpenFiler as SCSI shared LUN storage. > > I attempted to use "fence_scsi", but no luck, I got this error: > > Jun 19 08:35:58 server1 stonith_admin[3837]: notice: crm_log_args: Invoked: stonith_admin --reboot server2 --tolerance 5s > Jun 19 08:36:08 server1 root: fence_pcmk[3836]: Call to fence server2 (reset) failed with rc=255 > > Here is my fencing configuration: > > > > > > > > > > > > > > > > > > > > > > > > > > > And the log: /tmp/fence_scsi.log show: > > Jun 18 19:49:40 fence_scsi: [error] no devices found > > I will try "vmware_soap" to see if it works. > > Kien Le > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer > Sent: Wednesday, June 18, 2014 11:18 AM > To: linux clustering > Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing > > On 16/06/14 07:43 AM, Le Trung Kien wrote: >> Hello everyone, >> >> I'm a new man on linux cluster. I have built a two-node cluster (without qdisk), includes: >> >> Redhat 6.4 >> cman >> pacemaker >> gfs2 >> >> My cluster could fail-over (back and forth) between two nodes for >> these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on >> /mnt/gfs2_storage), WebSite ( apache service) >> >> My problem occurs when I stop/start node in the following order: (when >> both nodes started) >> >> 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all >> resources still working on node2 2. Stop: node2 (stop service: >> pacemaker then cman) -> all resources stop (of course) 3. Start: node1 >> (start service: cman then pacemaker) -> only ClusterIP started, WebFS >> failed, WebSite not started >> >> Status: >> >> Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16 >> 14:24:54 2014 via cibadmin on server1 >> Stack: cman >> Current DC: server1 - partition WITHOUT quorum >> Version: 1.1.8-7.el6-394e906 >> 2 Nodes configured, 1 expected votes >> 4 Resources configured. >> >> Online: [ server1 ] >> OFFLINE: [ server2 ] >> >> ClusterIP (ocf::heartbeat:IPaddr2): Started server1 >> WebFS (ocf::heartbeat:Filesystem): Started server1 (unmanaged) FAILED >> >> Failed actions: >> WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): >> unknown error >> >> Here is my /etc/cluster/cluster.conf >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Here is my: crm configure show >> > > > >> stonith-enabled=false \ > > Well this is a problem. > > When cman detects a failure (well corosync, but cman is told), it initiates a fence request. The fence daemon informs DLM with blocks. > Then fenced calls the configured 'fence_pcmk', which just passes the request up to pacemaker. > > Without stonith configured in fencing, pacemaker will fail to fence, of course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by design. > > If configure proper fencing in pacemaker (and test it to make sure it works), then pacemaker *would* succeed in fencing and return a success to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans up lost locks and returns to normal operation. > > So please configure and test real stonith in pacemaker and see if your problem is resolved. > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From ccaulfie at redhat.com Thu Jun 19 10:02:58 2014 From: ccaulfie at redhat.com (Christine Caulfield) Date: Thu, 19 Jun 2014 11:02:58 +0100 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> <5399FA51.2020808@alteeve.ca> <53A03763.4080905@redhat.com> Message-ID: <53A2B552.1000609@redhat.com> On 17/06/14 15:27, Schaefer, Micah wrote: > I am running Red Hat 6.4 with the HA/ load balancing packages from the > install DVD. > > > -bash-4.1$ cat /etc/redhat-release > Red Hat Enterprise Linux Server release 6.4 (Santiago) > > -bash-4.1$ corosync -v > Corosync Cluster Engine, version '1.4.1' > Copyright (c) 2006-2009 Red Hat, Inc. > > Thanks. 6.5 has better pause detection in it but I don't think that's the issue here actually. It looks to me like some messages are getting through but not others. So I'm back to seriously wondering if multicast traffic is being forwarded correctly and reliably. Having a mix of virtual and physical systems can cause these sorts of issues with real and software switches being mixed. Though I haven't seen anything quite as odd as this to be honest. Can you try either UDPU (preferred) or broadcast transport please and see if that helps or changes the symptoms at all? Broadcast could be problematic itself with the real/virtual mix so UDPU will be a more reliable option. Annoyingly, you'll need to take down the whole cluster to do this, and add to /etc/cluster/cluster.conf on all nodes. Chrissie > > On 6/17/14, 8:41 AM, "Christine Caulfield" wrote: > >> On 12/06/14 20:06, Digimer wrote: >>> Hrm, I'm not really sure that I am able to interpret this without making >>> guesses. I'm cc'ing one of the devs (who I hope will poke the right >>> person if he's not able to help at the moment). Lets see what he has to >>> say. >>> >>> I am curious now, too. :) >>> >>> On 12/06/14 03:02 PM, Schaefer, Micah wrote: >>>> Node4 was fenced again, I was able to get some debug logs (below), a >>>> new >>>> message : >>>> >>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the >>>> OPERATIONAL >>>> state.? >>>> >>>> >>>> Rest of corosync logs >>>> >>>> http://pastebin.com/iYFbkbhb >>>> >>>> >>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. >>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the >>>> membership and a new membership was formed. >>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms, >>>> flushing membership messages. >>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms, >>>> flushing membership messages. >> >> >> I'm concerned that the pause messages are repeating like that, it looks >> like it might be a fixed bug. What version of corosync do you have? >> >> Chrissie >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > From Micah.Schaefer at jhuapl.edu Thu Jun 19 12:39:20 2014 From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah) Date: Thu, 19 Jun 2014 08:39:20 -0400 Subject: [Linux-cluster] Node is randomly fenced In-Reply-To: <53A2B552.1000609@redhat.com> References: <538F378B.8030407@alteeve.ca> <5398A00A.4020802@alteeve.ca> <5398ADDC.80501@alteeve.ca> <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com> <53992A66.4070109@alteeve.ca> <5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca> <5399DE77.1030302@alteeve.ca> <5399E391.3060701@alteeve.ca> <5399FA51.2020808@alteeve.ca> <53A03763.4080905@redhat.com> <53A2B552.1000609@redhat.com> Message-ID: I have set the network to udpu. The physical nodes are to replace the virtual nodes. I was planning on decommissioning the virtual nodes when the cluster was stable with the physical nodes. I will also remove the virtual nodes from the cluster and see if it makes any difference. When I was only running the two virtual nodes I did not have any of these issues. On 6/19/14, 6:02 AM, "Christine Caulfield" wrote: >On 17/06/14 15:27, Schaefer, Micah wrote: >> I am running Red Hat 6.4 with the HA/ load balancing packages from the >> install DVD. >> >> >> -bash-4.1$ cat /etc/redhat-release >> Red Hat Enterprise Linux Server release 6.4 (Santiago) >> >> -bash-4.1$ corosync -v >> Corosync Cluster Engine, version '1.4.1' >> Copyright (c) 2006-2009 Red Hat, Inc. >> >> > > >Thanks. 6.5 has better pause detection in it but I don't think that's >the issue here actually. It looks to me like some messages are getting >through but not others. So I'm back to seriously wondering if multicast >traffic is being forwarded correctly and reliably. Having a mix of >virtual and physical systems can cause these sorts of issues with real >and software switches being mixed. Though I haven't seen anything quite >as odd as this to be honest. > >Can you try either UDPU (preferred) or broadcast transport please and >see if that helps or changes the symptoms at all? Broadcast could be >problematic itself with the real/virtual mix so UDPU will be a more >reliable option. > >Annoyingly, you'll need to take down the whole cluster to do this, and add > > > >to /etc/cluster/cluster.conf on all nodes. > >Chrissie > > > >> >> On 6/17/14, 8:41 AM, "Christine Caulfield" wrote: >> >>> On 12/06/14 20:06, Digimer wrote: >>>> Hrm, I'm not really sure that I am able to interpret this without >>>>making >>>> guesses. I'm cc'ing one of the devs (who I hope will poke the right >>>> person if he's not able to help at the moment). Lets see what he has >>>>to >>>> say. >>>> >>>> I am curious now, too. :) >>>> >>>> On 12/06/14 03:02 PM, Schaefer, Micah wrote: >>>>> Node4 was fenced again, I was able to get some debug logs (below), a >>>>> new >>>>> message : >>>>> >>>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the >>>>> OPERATIONAL >>>>> state.? >>>>> >>>>> >>>>> Rest of corosync logs >>>>> >>>>> http://pastebin.com/iYFbkbhb >>>>> >>>>> >>>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. >>>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the >>>>> membership and a new membership was formed. >>>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 >>>>>ms, >>>>> flushing membership messages. >>> >>> >>> I'm concerned that the pause messages are repeating like that, it looks >>> like it might be a fixed bug. What version of corosync do you have? >>> >>> Chrissie >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From haralambop at gmail.com Thu Jun 19 14:08:11 2014 From: haralambop at gmail.com (Andreas Haralambopoulos) Date: Thu, 19 Jun 2014 17:08:11 +0300 Subject: [Linux-cluster] Openvpn as a service in RGManager Message-ID: <9FA47F25-F865-4577-87B9-BEC1D73079C9@gmail.com> Is it possible to tun in rgmanager a VPN service only in one node? something like this in pacemaker primitive p_openvpn ocf:heartbeat:anything \ params binfile="/usr/sbin/openvpn" cmdline_options="--daemon --writepid /var/run/openvpn.pid --config /data/openvpn/server.conf --cd /data/openvpn" pidfile="/var/run/openvpn.pid" \ op start timeout="20" \ op stop timeout="30" \ op monitor interval="20" \ meta target-role="Started" From yamato at redhat.com Fri Jun 20 02:07:55 2014 From: yamato at redhat.com (Masatake YAMATO) Date: Fri, 20 Jun 2014 11:07:55 +0900 (JST) Subject: [Linux-cluster] Fw: [corosync] wireshark dissector for corosync 1.x srp Message-ID: <20140620.110755.698885983983758743.yamato@redhat.com> If you have a trouble in lower layer communication in cluster 3, wireshark can help you understand it. Masatake YAMATO -------------- next part -------------- An embedded message was scrubbed... From: Masatake YAMATO Subject: [corosync] wireshark dissector for corosync 1.x srp Date: Fri, 20 Jun 2014 11:03:36 +0900 (JST) Size: 4305 URL: From amjadcsu at gmail.com Sun Jun 22 07:55:40 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Sun, 22 Jun 2014 10:55:40 +0300 Subject: [Linux-cluster] fence Agent Message-ID: Hello, I am trying to setup a simple 2 node cluster in active/passive mode for oracle high availability We are using one INSPUR server and one HP proliant (Management decision based on hardware availability) and we are seeing if we can use IPMI as fencing method CCHS though supports HP ILO, DELL IPMI, IBM , but not INSPUR. So the basic question i have is what if we can use fence_ILO (for HP) and fence_ipmilan (For INSPUR)? IF any one have any experience with fence_ipmilan or point to resources , it would really be appreciated. Sincerely, Amjad -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Sun Jun 22 08:31:23 2014 From: lists at alteeve.ca (Digimer) Date: Sun, 22 Jun 2014 04:31:23 -0400 Subject: [Linux-cluster] fence Agent In-Reply-To: References: Message-ID: <53A6945B.4050804@alteeve.ca> On 22/06/14 03:55 AM, Amjad Syed wrote: > Hello, > > I am trying to setup a simple 2 node cluster in active/passive mode for > oracle high availability > > We are using one INSPUR server and one HP proliant (Management decision > based on hardware availability) and we are seeing if we can use IPMI > as fencing method > > CCHS though supports HP ILO, DELL IPMI, IBM , but not INSPUR. > > So the basic question i have is what if we can use fence_ILO (for HP) > and fence_ipmilan (For INSPUR)? > > IF any one have any experience with fence_ipmilan or point to resources > , it would really be appreciated. > > Sincerely, > Amjad fence_ipmilan works with just about every IPMI-based out of band management interface. Most of those branded ones, like DRAC, RSA, iLO, etc are fundamentally based on IPMI. I've used fence_ipmilan on iLO personally and it's fine. If you can show what 'ipmitool' command you use that can show if the peer is powered on or off, then you should be able to translate it quite easily to a matching fence_ipmilan call (check man fence_ipmilan for the switches). Once you can check the power status of the peer(s) with fence_ipmilan, you're 95% of the way there. cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From amjadcsu at gmail.com Sun Jun 22 14:32:50 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Sun, 22 Jun 2014 17:32:50 +0300 Subject: [Linux-cluster] fence Agent In-Reply-To: <53A6945B.4050804@alteeve.ca> References: <53A6945B.4050804@alteeve.ca> Message-ID: Well , i am running RHEL 6.3 on INSPUR NFS5280 . For some reason the ipmitool and drivers stopped working. While restarting /etc/init.d/ipmi , it would just hang. Is it that ipmitool is not communicating with BMC .? What is the best way to tackle this issue ? Thanks On Sun, Jun 22, 2014 at 11:31 AM, Digimer wrote: > On 22/06/14 03:55 AM, Amjad Syed wrote: > >> Hello, >> >> I am trying to setup a simple 2 node cluster in active/passive mode for >> oracle high availability >> >> We are using one INSPUR server and one HP proliant (Management decision >> based on hardware availability) and we are seeing if we can use IPMI >> as fencing method >> >> CCHS though supports HP ILO, DELL IPMI, IBM , but not INSPUR. >> >> So the basic question i have is what if we can use fence_ILO (for HP) >> and fence_ipmilan (For INSPUR)? >> >> IF any one have any experience with fence_ipmilan or point to resources >> , it would really be appreciated. >> >> Sincerely, >> Amjad >> > > fence_ipmilan works with just about every IPMI-based out of band > management interface. Most of those branded ones, like DRAC, RSA, iLO, etc > are fundamentally based on IPMI. I've used fence_ipmilan on iLO personally > and it's fine. > > If you can show what 'ipmitool' command you use that can show if the peer > is powered on or off, then you should be able to translate it quite easily > to a matching fence_ipmilan call (check man fence_ipmilan for the > switches). Once you can check the power status of the peer(s) with > fence_ipmilan, you're 95% of the way there. > > cheers > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vasil.val at gmail.com Mon Jun 23 18:09:48 2014 From: vasil.val at gmail.com (Vasil Valchev) Date: Mon, 23 Jun 2014 21:09:48 +0300 Subject: [Linux-cluster] Online change of fence device options - possible? Message-ID: Hello, I have a RHEL 6.5 cluster, using rgmanager. The fence devices are fence_ipmilan - fencing through HP iLO4. The issue is the fence devices weren't configured entirely correct - recently after a node failure, the fence agent was returning failures (even though it was fencing the node successfully), which apparently can be avoided by setting the power_wait option to the fence dev configuration. My question is - after changing the fence device (I think directly through the .conf will be fine?), iterating the config version, and syncing the .conf through the cluster software - is something else necessary to apply the change (eg. cman reload)? Will the new fence option be used the next time a fencing action is performed? And lastly can all of this be performed while the cluster and services are operational or they have to be stopped/restarted? Regards, Vasil -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Mon Jun 23 18:16:37 2014 From: lists at alteeve.ca (Digimer) Date: Mon, 23 Jun 2014 14:16:37 -0400 Subject: [Linux-cluster] Online change of fence device options - possible? In-Reply-To: References: Message-ID: <53A86F05.6090901@alteeve.ca> On 23/06/14 02:09 PM, Vasil Valchev wrote: > Hello, > > I have a RHEL 6.5 cluster, using rgmanager. > The fence devices are fence_ipmilan - fencing through HP iLO4. > > The issue is the fence devices weren't configured entirely correct - > recently after a node failure, the fence agent was returning failures > (even though it was fencing the node successfully), which apparently can > be avoided by setting the power_wait option to the fence dev configuration. > > My question is - after changing the fence device (I think directly > through the .conf will be fine?), iterating the config version, and > syncing the .conf through the cluster software - is something else > necessary to apply the change (eg. cman reload)? > > Will the new fence option be used the next time a fencing action is > performed? > > And lastly can all of this be performed while the cluster and services > are operational or they have to be stopped/restarted? > > > Regards, > Vasil This should be fine. As you said; Update the fence config, increment the config_version, save and exit. Run 'ccs_config_validate' and if that passes, 'cman_tool version -r'. Note that for this to work, you need to have set the 'ricci' user's shell password as well as have the 'ricci' and 'modclusterd' daemons running. Once done, run 'fence_check'[1] to verify that the fence config works (it makes a status call to check). If that works, you're good to go. You can also crontab the fence_check call and have it email you or something so that you can catch fence failures earlier. digimer 1. https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From lists at alteeve.ca Mon Jun 23 18:19:56 2014 From: lists at alteeve.ca (Digimer) Date: Mon, 23 Jun 2014 14:19:56 -0400 Subject: [Linux-cluster] Online change of fence device options - possible? In-Reply-To: <53A86F05.6090901@alteeve.ca> References: <53A86F05.6090901@alteeve.ca> Message-ID: <53A86FCC.3010607@alteeve.ca> On 23/06/14 02:16 PM, Digimer wrote: > On 23/06/14 02:09 PM, Vasil Valchev wrote: >> Hello, >> >> I have a RHEL 6.5 cluster, using rgmanager. >> The fence devices are fence_ipmilan - fencing through HP iLO4. >> >> The issue is the fence devices weren't configured entirely correct - >> recently after a node failure, the fence agent was returning failures >> (even though it was fencing the node successfully), which apparently can >> be avoided by setting the power_wait option to the fence dev >> configuration. >> >> My question is - after changing the fence device (I think directly >> through the .conf will be fine?), iterating the config version, and >> syncing the .conf through the cluster software - is something else >> necessary to apply the change (eg. cman reload)? >> >> Will the new fence option be used the next time a fencing action is >> performed? >> >> And lastly can all of this be performed while the cluster and services >> are operational or they have to be stopped/restarted? >> >> >> Regards, >> Vasil > > This should be fine. As you said; Update the fence config, increment the > config_version, save and exit. Run 'ccs_config_validate' and if that > passes, 'cman_tool version -r'. Note that for this to work, you need to > have set the 'ricci' user's shell password as well as have the 'ricci' > and 'modclusterd' daemons running. > > Once done, run 'fence_check'[1] to verify that the fence config works > (it makes a status call to check). If that works, you're good to go. > > You can also crontab the fence_check call and have it email you or > something so that you can catch fence failures earlier. > > digimer > > 1. > https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config I should clarify; You can update the config while the cluster is online. No fences will be called and you do not need to restart anything. cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From amjadcsu at gmail.com Tue Jun 24 10:32:30 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Tue, 24 Jun 2014 13:32:30 +0300 Subject: [Linux-cluster] Error in Cluster.conf Message-ID: Hello I am getting the following error when i run ccs_config_Validate ccs_config_validate Relax-NG validity error : Extra element clusternodes in interleave tempfile:12: element clusternodes: Relax-NG validity error : Element cluster failed to validate content Configuration fails to validate Here is my cluster.conf file Any help would be appreciated -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Tue Jun 24 11:56:52 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 24 Jun 2014 13:56:52 +0200 Subject: [Linux-cluster] Error in Cluster.conf In-Reply-To: References: Message-ID: <53A96784.3030009@redhat.com> On 6/24/2014 12:32 PM, Amjad Syed wrote: > Hello > > I am getting the following error when i run ccs_config_Validate > > ccs_config_validate > Relax-NG validity error : Extra element clusternodes in interleave You defined tempfile:12: element clusternodes: Relax-NG validity error : Element > cluster failed to validate content > Configuration fails to validate > > Here is my cluster.conf file > > > > > > > > > > > login="ADMIN" name="inspuripmi" passwd="abc123"/> > login="test" name="hpipmi" passwd="abc12345"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Any help would be appreciated > > > > From jpokorny at redhat.com Tue Jun 24 12:55:00 2014 From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=) Date: Tue, 24 Jun 2014 14:55:00 +0200 Subject: [Linux-cluster] Error in Cluster.conf In-Reply-To: <53A96784.3030009@redhat.com> References: <53A96784.3030009@redhat.com> Message-ID: <20140624125500.GA1425@redhat.com> On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote: > On 6/24/2014 12:32 PM, Amjad Syed wrote: >> Hello >> >> I am getting the following error when i run ccs_config_Validate >> >> ccs_config_validate >> Relax-NG validity error : Extra element clusternodes in interleave > > You defined cluster.conf:13:47: error: > element "fencedvice" not allowed anywhere; expected the element > end-tag or element "fencedevice" > cluster.conf:15:23: error: > element "clusternodes" not allowed here; expected the element > end-tag or element "clvmd", "dlm", "fence_daemon", "fence_xvmd", > "gfs_controld", "group", "logging", "quorumd", "rm", "totem" or > "uidgid" > cluster.conf:26:76: error: > IDREF "fence_node2" without matching ID > cluster.conf:19:77: error: > IDREF "fence_node1" without matching ID So it spotted also: - a typo in "fencedvice" - broken referential integrity; it is prescribed "name" attribute of "device" tag should match a "name" of a defined "fencedevice" Hope this helps. -- Jan > Fabio > >> tempfile:12: element clusternodes: Relax-NG validity error : Element >> cluster failed to validate content >> Configuration fails to validate >> >> Here is my cluster.conf file >> >> >> >> >> >> >> >> >> >> >> > login="ADMIN" name="inspuripmi" passwd="abc123"/> >> > login="test" name="hpipmi" passwd="abc12345"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Any help would be appreciated -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 819 bytes Desc: not available URL: From lists at alteeve.ca Tue Jun 24 15:46:38 2014 From: lists at alteeve.ca (Digimer) Date: Tue, 24 Jun 2014 11:46:38 -0400 Subject: [Linux-cluster] Error in Cluster.conf In-Reply-To: <20140624125500.GA1425@redhat.com> References: <53A96784.3030009@redhat.com> <20140624125500.GA1425@redhat.com> Message-ID: <53A99D5E.2080508@alteeve.ca> On 24/06/14 08:55 AM, Jan Pokorn? wrote: > On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote: >> On 6/24/2014 12:32 PM, Amjad Syed wrote: >>> Hello >>> >>> I am getting the following error when i run ccs_config_Validate >>> >>> ccs_config_validate >>> Relax-NG validity error : Extra element clusternodes in interleave >> >> You defined > That + the are more issues discoverable by more powerful validator > jing (packaged in Fedora and RHEL 7, for instance, admittedly not > for RHEL 6/EPEL): > > $ jing cluster.rng cluster.conf >> cluster.conf:13:47: error: >> element "fencedvice" not allowed anywhere; expected the element >> end-tag or element "fencedevice" >> cluster.conf:15:23: error: >> element "clusternodes" not allowed here; expected the element >> end-tag or element "clvmd", "dlm", "fence_daemon", "fence_xvmd", >> "gfs_controld", "group", "logging", "quorumd", "rm", "totem" or >> "uidgid" >> cluster.conf:26:76: error: >> IDREF "fence_node2" without matching ID >> cluster.conf:19:77: error: >> IDREF "fence_node1" without matching ID > > So it spotted also: > - a typo in "fencedvice" > - broken referential integrity; it is prescribed "name" attribute > of "device" tag should match a "name" of a defined "fencedevice" > > Hope this helps. > > -- Jan Also, without fence methods defined for the nodes, rgmanager will block the first time there is an issue. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From amjadcsu at gmail.com Tue Jun 24 18:44:02 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Tue, 24 Jun 2014 21:44:02 +0300 Subject: [Linux-cluster] Error in Cluster.conf In-Reply-To: <53A99D5E.2080508@alteeve.ca> References: <53A96784.3030009@redhat.com> <20140624125500.GA1425@redhat.com> <53A99D5E.2080508@alteeve.ca> Message-ID: I have updated the config file , validated by ccs_config_validate Added the fence_daemon and post_join_delay. I am using bonding using ethernet coaxial cable. But for some reason whenever i start CMAN on node, it fences (kicks the other node). As a result at a time only one node is online . Do i need to use multicast to get both nodes online at same instance ?. or i am missing something here ? Now the file looks like this : ?xml version="1.0"?> Thanks On Tue, Jun 24, 2014 at 6:46 PM, Digimer wrote: > On 24/06/14 08:55 AM, Jan Pokorn? wrote: > >> On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote: >> >>> On 6/24/2014 12:32 PM, Amjad Syed wrote: >>> >>>> Hello >>>> >>>> I am getting the following error when i run ccs_config_Validate >>>> >>>> ccs_config_validate >>>> Relax-NG validity error : Extra element clusternodes in interleave >>>> >>> >>> You defined >> >> >> That + the are more issues discoverable by more powerful validator >> jing (packaged in Fedora and RHEL 7, for instance, admittedly not >> for RHEL 6/EPEL): >> >> $ jing cluster.rng cluster.conf >> >>> cluster.conf:13:47: error: >>> element "fencedvice" not allowed anywhere; expected the element >>> end-tag or element "fencedevice" >>> cluster.conf:15:23: error: >>> element "clusternodes" not allowed here; expected the element >>> end-tag or element "clvmd", "dlm", "fence_daemon", "fence_xvmd", >>> "gfs_controld", "group", "logging", "quorumd", "rm", "totem" or >>> "uidgid" >>> cluster.conf:26:76: error: >>> IDREF "fence_node2" without matching ID >>> cluster.conf:19:77: error: >>> IDREF "fence_node1" without matching ID >>> >> >> So it spotted also: >> - a typo in "fencedvice" >> - broken referential integrity; it is prescribed "name" attribute >> of "device" tag should match a "name" of a defined "fencedevice" >> >> Hope this helps. >> >> -- Jan >> > > Also, without fence methods defined for the nodes, rgmanager will block > the first time there is an issue. > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eivind at aminor.no Sun Jun 29 23:48:43 2014 From: eivind at aminor.no (Eivind Olsen) Date: Mon, 30 Jun 2014 01:48:43 +0200 Subject: [Linux-cluster] Which fence agents to use? Message-ID: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no> Hello. I am currently planning a 2-node cluster based on RHEL 6.5 and the high availability addon, with the goal of running Oracle 11g in active/passive failover mode. The cluster nodes will be physical HP blades, and they will have shared storage for the Oracle data-files on a FC-SAN. That is, shared block device but using HA LVM so only mounting the filesystem on one node at a time. The way I see it, my fence options are fence_ipmilan but I could also look at fence_scsi. Should I use only one or both of these? If both: in what order? Regards Eivind Olsen From raju.rajsand at gmail.com Mon Jun 30 06:22:37 2014 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Mon, 30 Jun 2014 11:52:37 +0530 Subject: [Linux-cluster] Which fence agents to use? In-Reply-To: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no> References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no> Message-ID: Greetings, On Mon, Jun 30, 2014 at 5:18 AM, Eivind Olsen wrote: > The cluster nodes will be physical HP blades, and they will have shared storage for the Oracle data-files on a FC-SAN. That is, shared block device but using HA LVM so only mounting the filesystem on one node at a time. > HP ILO should help. As a secondary you can use the FC-SAN Fencing HTH Regards -- Regards, Rajagopal From amjadcsu at gmail.com Mon Jun 30 06:47:20 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Mon, 30 Jun 2014 09:47:20 +0300 Subject: [Linux-cluster] Which fence agents to use? In-Reply-To: References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no> Message-ID: Hi last week I implemented the 2 node cluster with hp proliant. I used hp ilo .power based fencing agents are preferred . On 30 Jun 2014 09:28, "Rajagopal Swaminathan" wrote: > Greetings, > > On Mon, Jun 30, 2014 at 5:18 AM, Eivind Olsen wrote: > > The cluster nodes will be physical HP blades, and they will have shared > storage for the Oracle data-files on a FC-SAN. That is, shared block device > but using HA LVM so only mounting the filesystem on one node at a time. > > > > HP ILO should help. > > As a secondary you can use the FC-SAN Fencing > > HTH > > Regards > > > -- > Regards, > > Rajagopal > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ekuric at redhat.com Mon Jun 30 08:14:54 2014 From: ekuric at redhat.com (Elvir Kuric) Date: Mon, 30 Jun 2014 10:14:54 +0200 Subject: [Linux-cluster] Which fence agents to use? In-Reply-To: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no> References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no> Message-ID: <53B11C7E.4070903@redhat.com> On 06/30/2014 01:48 AM, Eivind Olsen wrote: > Hello. > > I am currently planning a 2-node cluster based on RHEL 6.5 and the high availability addon, with the goal of running Oracle 11g in active/passive failover mode. > The cluster nodes will be physical HP blades which model and which generation and which ilo version? if ilo3/ilo4 ( what is case in most recent models of blades / Proliants ) then fence_ipmilan is recommended way. Here is full list of supported fencing agents with RHEL : https://access.redhat.com/site/articles/28603 ( you have to have Red Hat customer portal access to see it - I guess you have it as Red Hat customer ) > , and they will have shared storage for the Oracle data-files on a FC-SAN. That is, shared block device but using HA LVM so only mounting the filesystem on one node at a time. ok! > > The way I see it, my fence options are fence_ipmilan but I could also look at fence_scsi. Should I use only one or both of these? you can use any of these combination. Be aware that with fence_scsi ( if used as only fencing method ) cluster node will only be stopped to access shared storage - no power restart , and you will need manually to restart it. This could be overcome if you implement power fencing ( fence_ipmilan ) to restart machine once it has issue. > If both: in what order? When configured properly fence_ipmilan will do the job, having additional fencing ( fence_scsi ) method it will introduce additional complexity in cluster configuration - harder to debug / maintain. ihmo, fence_ipmilan is good choice. > > Regards > Eivind Olsen > > From eivind at aminor.no Mon Jun 30 10:19:47 2014 From: eivind at aminor.no (Eivind Olsen) Date: Mon, 30 Jun 2014 12:19:47 +0200 Subject: [Linux-cluster] Which fence agents to use? In-Reply-To: <53B11C7E.4070903@redhat.com> References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no> <53B11C7E.4070903@redhat.com> Message-ID: Elvir Kuric wrote: > which model and which generation and which ilo version? > if ilo3/ilo4 ( what is case in most recent models of blades / Proliants > ) then fence_ipmilan is recommended way. ProLiant BL460c Gen8, with some version of iLO 4. > Here is full list of supported fencing agents with RHEL : > https://access.redhat.com/site/articles/28603 ( you have to have Red Hat > customer portal access to see it - I guess you have it as Red Hat > customer ) Yes, I've seen that article, and that's why I thought I'd go with fence_ipmilan and not fence_hpblade (which exists but isn't on the list in that article). > When configured properly fence_ipmilan will do the job, having > additional fencing ( fence_scsi ) method it will introduce additional > complexity in cluster configuration - harder to debug / maintain. > ihmo, fence_ipmilan is good choice. Ah ok, I'll keep the configuration simpler then, and not bother with fence_scsi :) Thanks! Regards Eivind Olsen