From Micah.Schaefer at jhuapl.edu  Wed Jun  4 14:59:01 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Wed, 4 Jun 2014 10:59:01 -0400
Subject: [Linux-cluster]  Node is randomly fenced
Message-ID: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>

I have a 4 node cluster, running a single service group. I have been
seeing node1 fence node3 while node3 is actively running the service group
at random intervals.

Rgmanager logs show no failures in service checks, and no other logs
provide any useful information. How can I go about finding out why node1
is fencing node3? 

I currently set up the failover domain to be restricted and not include
node3. 

cluster.conf : http://pastebin.com/xYy6xp6N


From emi2fast at gmail.com  Wed Jun  4 15:11:12 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 4 Jun 2014 17:11:12 +0200
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
Message-ID: <CAE7pJ3At0W=WPH0ojeTS+Eft4XKZq+ruEXn+U+xXjMPFvf08Cw@mail.gmail.com>

logs?


2014-06-04 16:59 GMT+02:00 Schaefer, Micah <Micah.Schaefer at jhuapl.edu>:

> I have a 4 node cluster, running a single service group. I have been
> seeing node1 fence node3 while node3 is actively running the service group
> at random intervals.
>
> Rgmanager logs show no failures in service checks, and no other logs
> provide any useful information. How can I go about finding out why node1
> is fencing node3?
>
> I currently set up the failover domain to be restricted and not include
> node3.
>
> cluster.conf : http://pastebin.com/xYy6xp6N
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140604/572d8123/attachment.htm>

From lists at alteeve.ca  Wed Jun  4 15:13:15 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 04 Jun 2014 11:13:15 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
Message-ID: <538F378B.8030407@alteeve.ca>

On 04/06/14 10:59 AM, Schaefer, Micah wrote:
> I have a 4 node cluster, running a single service group. I have been
> seeing node1 fence node3 while node3 is actively running the service group
> at random intervals.
>
> Rgmanager logs show no failures in service checks, and no other logs
> provide any useful information. How can I go about finding out why node1
> is fencing node3?
>
> I currently set up the failover domain to be restricted and not include
> node3.
>
> cluster.conf : http://pastebin.com/xYy6xp6N

Random fencing is almost always caused by network failures. Can you look 
are the system logs, starting a little before the fence and continuing 
until after the fence completes, and paste them here? I suspect you will 
see corosync complaining.

If this is true, do your switches support persistent multicast? Do you 
use active/passive bonding? Have you tried different switch/cable/NIC?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From Micah.Schaefer at jhuapl.edu  Wed Jun  4 15:32:45 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Wed, 4 Jun 2014 11:32:45 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <538F378B.8030407@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca>
Message-ID: <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>

Logs: http://pastebin.com/QCh5FzZu

I have one 10gb nic connected


Here is the corosync log from node1, I see that is says ? A processor
failed, forming new configuration.?, I need to dig deeper though.


May 27 10:03:49 corosync [QUORUM] Members[4]: 1 2 3 4
May 27 10:05:04 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 03 13:52:34 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 03 13:52:46 corosync [QUORUM] Members[3]: 1 2 4
Jun 03 13:52:46 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:52:46 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 03 13:52:46 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:56:14 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:56:14 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:56:14 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:56:28 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:56:28 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:56:28 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:56:41 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:56:41 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:56:41 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:57:04 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:57:04 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:57:04 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 15:12:09 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 03 15:12:09 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 15:12:09 corosync [MAIN  ] Completed service synchronization, ready
to provide service.


Regards,
-------
Micah Schaefer
JHU/ APL
ITSD/ ITC
240-228-1148 (x81148)


On 6/4/14, 11:13 AM, "Digimer" <lists at alteeve.ca> wrote:

>On 04/06/14 10:59 AM, Schaefer, Micah wrote:
>> I have a 4 node cluster, running a single service group. I have been
>> seeing node1 fence node3 while node3 is actively running the service
>>group
>> at random intervals.
>>
>> Rgmanager logs show no failures in service checks, and no other logs
>> provide any useful information. How can I go about finding out why node1
>> is fencing node3?
>>
>> I currently set up the failover domain to be restricted and not include
>> node3.
>>
>> cluster.conf : http://pastebin.com/xYy6xp6N
>
>Random fencing is almost always caused by network failures. Can you look
>are the system logs, starting a little before the fence and continuing
>until after the fence completes, and paste them here? I suspect you will
>see corosync complaining.
>
>If this is true, do your switches support persistent multicast? Do you
>use active/passive bonding? Have you tried different switch/cable/NIC?
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From alkol6 at gmail.com  Wed Jun  4 15:48:31 2014
From: alkol6 at gmail.com (Senol Erdogan)
Date: Wed, 4 Jun 2014 11:48:31 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
Message-ID: <CA+_NNHgS+O-uqFhZqO8=_OwY5sRMfcc7WhQmEq+o0FJGCFL+WQ@mail.gmail.com>

Problem looks at Failover domain's nodes priority. Same nodes has adjusted
by different priority. it's  would be triggerig unexpected fences. Maybe
you can solve while step by step active your FO domains. (Ofcourse after
all newtork and firewall settings right and w/o problem)

Senol Erdogan
On Jun 4, 2014 11:06 AM, "Schaefer, Micah" <Micah.Schaefer at jhuapl.edu>
wrote:

> I have a 4 node cluster, running a single service group. I have been
> seeing node1 fence node3 while node3 is actively running the service group
> at random intervals.
>
> Rgmanager logs show no failures in service checks, and no other logs
> provide any useful information. How can I go about finding out why node1
> is fencing node3?
>
> I currently set up the failover domain to be restricted and not include
> node3.
>
> cluster.conf : http://pastebin.com/xYy6xp6N
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140604/76d64703/attachment.htm>

From jfriesse at redhat.com  Thu Jun  5 08:36:58 2014
From: jfriesse at redhat.com (Jan Friesse)
Date: Thu, 05 Jun 2014 10:36:58 +0200
Subject: [Linux-cluster] [Openais] Newbie clustering questions
In-Reply-To: <CA+MhJ94kWRAfUuQdjfo0UVfxADEQFK=vt_b2Uobz01wFQUex3g@mail.gmail.com>
References: <CA+MhJ94kWRAfUuQdjfo0UVfxADEQFK=vt_b2Uobz01wFQUex3g@mail.gmail.com>
Message-ID: <53902C2A.3060108@redhat.com>

Per,
it looks like none of your question is really corosync related (so I'm
CC'ing linux clustering <linux-cluster at redhat.com> (this is really
better list) but I will try to answer at least some of your questions.

> Hi all
> 
> I have redhat clustering running on a 3 VMware vm's 2 nodes and 1
> management server I can join the nodes without any problems but I got a
> couple of questions that I hope someone here can shed some lights on for me.
> 
> If I want to add a ip resource to the cluster must both nodes be configured
> with a interface with that ip or is there a better way of doing it? If not

You must make sure that NO nodes has this address assigned. IPAddr
resource will take care to add ip to interface.

> then can one of the nodes have the nic in standby?
> 

I don't think this is supported by any resource script.

> How do I add fencing for a VMware vm's I notice that there is the VMware
> soa must the each vm be configured with its individual VMware soa fencing
> or is fencing not needed? From what I can read fencing is needed.

Every node must be able to fence any other node. So you have to
configure fencing method for every node.

In theory fencing is not needed as long as you are not using shared
storage, but it's still better to have it.

> 
> I am using Centos .6.5 with ESXI 4.1
> 
> Many thanks for your time
> 
> Regards
> 

Regards,
  Honza

> 
> 
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/openais
> 


From jfriesse at redhat.com  Fri Jun  6 07:37:50 2014
From: jfriesse at redhat.com (Jan Friesse)
Date: Fri, 06 Jun 2014 09:37:50 +0200
Subject: [Linux-cluster] [Openais] Newbie clustering questions
In-Reply-To: <CA+MhJ96-_UAHa89R+yMGUJhmW85f278FVNLYbdVP4eUB-hN=2Q@mail.gmail.com>
References: <CA+MhJ94kWRAfUuQdjfo0UVfxADEQFK=vt_b2Uobz01wFQUex3g@mail.gmail.com>	<53902C2A.3060108@redhat.com>
	<CA+MhJ96-_UAHa89R+yMGUJhmW85f278FVNLYbdVP4eUB-hN=2Q@mail.gmail.com>
Message-ID: <53916FCE.2000007@redhat.com>

Per,

> Hi Jan
>
> Many thanks for your response.
>
> I spent some more time on this yesterday so I found out that the nodes
> needs really to have 2 nics, 2 ip's and the resource ip gets assigned to
> the node that becomes the running node.
>

You don't need two nics. Even tho it's better, because you have 
separated cluster traffic from app traffic.

> I have setup vmware fencing for each node, but I could not see anything in
> the configuration to allow or disallow one node to fence of the other or
> does this happen automagically? apologies if the question seem a bit stupid

Yes, it is happening automatically as long as you've configured fencing 
for every node.

> but 1 week ago I started this project with very little experience in
> clustering :)
>

That's why I'm recommending you to ask linux-cluster at redhat.com and read 
some docs 
(https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/High_Availability_Add-On_Overview/index.html).

Regards,
   Honza

> Regards
> Per Qvindesland
>
>
> On Thu, Jun 5, 2014 at 9:36 AM, Jan Friesse <jfriesse at redhat.com> wrote:
>
>> Per,
>> it looks like none of your question is really corosync related (so I'm
>> CC'ing linux clustering <linux-cluster at redhat.com> (this is really
>> better list) but I will try to answer at least some of your questions.
>>
>>> Hi all
>>>
>>> I have redhat clustering running on a 3 VMware vm's 2 nodes and 1
>>> management server I can join the nodes without any problems but I got a
>>> couple of questions that I hope someone here can shed some lights on for
>> me.
>>>
>>> If I want to add a ip resource to the cluster must both nodes be
>> configured
>>> with a interface with that ip or is there a better way of doing it? If
>> not
>>
>> You must make sure that NO nodes has this address assigned. IPAddr
>> resource will take care to add ip to interface.
>>
>>> then can one of the nodes have the nic in standby?
>>>
>>
>> I don't think this is supported by any resource script.
>>
>>> How do I add fencing for a VMware vm's I notice that there is the VMware
>>> soa must the each vm be configured with its individual VMware soa fencing
>>> or is fencing not needed? From what I can read fencing is needed.
>>
>> Every node must be able to fence any other node. So you have to
>> configure fencing method for every node.
>>
>> In theory fencing is not needed as long as you are not using shared
>> storage, but it's still better to have it.
>>
>>>
>>> I am using Centos .6.5 with ESXI 4.1
>>>
>>> Many thanks for your time
>>>
>>> Regards
>>>
>>
>> Regards,
>>    Honza
>>
>>>
>>>
>>> _______________________________________________
>>> Openais mailing list
>>> Openais at lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/openais
>>>
>>
>>
>


From arun.nair at dimensiondata.com  Wed Jun 11 14:48:37 2014
From: arun.nair at dimensiondata.com (Arun G Nair)
Date: Wed, 11 Jun 2014 20:18:37 +0530
Subject: [Linux-cluster] 2-node cluster fence loop
Message-ID: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com>

Hello,

   What are the reasons for fence loops when only cman is started ? We have
an RHEL 6.5 2-node cluster which goes in to a fence loop and every time we
start cman on both nodes. Either one fences the other. Multicast seems to
be working properly. My understanding is that without rgmanager running
there won't be a multicast group subscription ? I don't see the multicast
address in 'netstat -g' unless rgmanager is running. I've tried to increase
the fence post_join_delay but one of the nodes still gets fenced.

The cluster works fine if we use unicast UDP.

Thanks,
-- 
Arun G Nair <http://www.surveymonkey.com/s/XRCYXBH>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140611/8773e02a/attachment.htm>

From lists at alteeve.ca  Wed Jun 11 15:03:48 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 11 Jun 2014 11:03:48 -0400
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com>
References: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com>
Message-ID: <53986FD4.6010902@alteeve.ca>

On 11/06/14 10:48 AM, Arun G Nair wrote:
> Hello,
>
>     What are the reasons for fence loops when only cman is started ? We
> have an RHEL 6.5 2-node cluster which goes in to a fence loop and every
> time we start cman on both nodes. Either one fences the other. Multicast
> seems to be working properly. My understanding is that without rgmanager
> running there won't be a multicast group subscription ? I don't see the
> multicast address in 'netstat -g' unless rgmanager is running. I've
> tried to increase the fence post_join_delay but one of the nodes still
> gets fenced.
>
> The cluster works fine if we use unicast UDP.
>
> Thanks,

Hi,

   When cman starts, it waits post_join_delay seconds for the peer to 
connect. If, after that time expires (6 seconds by default, iirc), it 
gives up and calls a fence against the peer to put it into a known state.

   Corosync is what determines membership, and it is started by cman. 
The rgmanager only handles resource start/stop/relocate/recovery and has 
nothing to do with fencing directly. Corosync is what uses multicast.

   So as you seem to have already surmised, multicast is probably not 
working in your environment. Have you enabled multicast traffic on the 
firewall? Do your switches support multicast properly?

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From muthukumar.t at hp.com  Wed Jun 11 18:11:51 2014
From: muthukumar.t at hp.com (T, Muthukumar)
Date: Wed, 11 Jun 2014 18:11:51 +0000
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com>
References: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com>
Message-ID: <8C558298378D604B9DB3536AFF81F04D1ECB5903@G5W2718.americas.hpqcorp.net>

Hi all,

When your cluster nodes got panic while starting cman services that can?t called as fence loop that is called as misconfiguration of POST JOIN DELAY setting.

By default post_join_delay setting is 3 seconds, while starting cman on a cluster node it will try to get the status of other cluster nodes to make sure the integrity of the cluster services if other nodes are not responsive till post_join_delay timeout, other cluster node fenced by this node to ensure integrity (there may be chance that node already formed the cluster and started cluster services)

Fence looping is different one, this is happen when there is a failure in heart beat switch for long time.

Thanks & Regards

Muthukumar T
Production Engineering - UNIX
9790907286

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arun G Nair
Sent: Wednesday, June 11, 2014 8:19 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] 2-node cluster fence loop

Hello,

   What are the reasons for fence loops when only cman is started ? We have an RHEL 6.5 2-node cluster which goes in to a fence loop and every time we start cman on both nodes. Either one fences the other. Multicast seems to be working properly. My understanding is that without rgmanager running there won't be a multicast group subscription ? I don't see the multicast address in 'netstat -g' unless rgmanager is running. I've tried to increase the fence post_join_delay but one of the nodes still gets fenced.
The cluster works fine if we use unicast UDP.

Thanks,
--
Arun G Nair
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140611/108a6ce8/attachment.htm>

From Micah.Schaefer at jhuapl.edu  Wed Jun 11 18:21:59 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Wed, 11 Jun 2014 14:21:59 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
Message-ID: <CFBE1647.F927%micah.schaefer@jhuapl.edu>

It failed again, even after deleting all the other failover domains.

Cluster conf
http://pastebin.com/jUXkwKS4

I turned corosync output to debug. How can I go about troubleshooting if
it really is a network issue or something else?


Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 11 14:10:29 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:13:54 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:13:54 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:13:54 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:14:07 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:14:08 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:14:08 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:14:21 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:14:21 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:14:21 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:14:43 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:14:43 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:14:43 corosync [MAIN  ] Completed service synchronization, ready
to provide service.


On 6/4/14, 11:32 AM, "Schaefer, Micah" <Micah.Schaefer at jhuapl.edu> wrote:

>Logs: http://pastebin.com/QCh5FzZu
>
>I have one 10gb nic connected
>
>
>Here is the corosync log from node1, I see that is says ? A processor
>failed, forming new configuration.?, I need to dig deeper though.
>
>
>May 27 10:03:49 corosync [QUORUM] Members[4]: 1 2 3 4
>May 27 10:05:04 corosync [QUORUM] Members[4]: 1 2 3 4
>Jun 03 13:52:34 corosync [TOTEM ] A processor failed, forming new
>configuration.
>Jun 03 13:52:46 corosync [QUORUM] Members[3]: 1 2 4
>Jun 03 13:52:46 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:52:46 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:4 left:1)
>Jun 03 13:52:46 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:56:14 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:56:14 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:56:14 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:56:28 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:56:28 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:56:28 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:56:41 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:56:41 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:56:41 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:57:04 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:57:04 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:57:04 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 15:12:09 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
>Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
>Jun 03 15:12:09 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 15:12:09 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>
>
>
>
>
>
>
>
>
>
>
>
>On 6/4/14, 11:13 AM, "Digimer" <lists at alteeve.ca> wrote:
>
>>On 04/06/14 10:59 AM, Schaefer, Micah wrote:
>>> I have a 4 node cluster, running a single service group. I have been
>>> seeing node1 fence node3 while node3 is actively running the service
>>>group
>>> at random intervals.
>>>
>>> Rgmanager logs show no failures in service checks, and no other logs
>>> provide any useful information. How can I go about finding out why
>>>node1
>>> is fencing node3?
>>>
>>> I currently set up the failover domain to be restricted and not include
>>> node3.
>>>
>>> cluster.conf : http://pastebin.com/xYy6xp6N
>>
>>Random fencing is almost always caused by network failures. Can you look
>>are the system logs, starting a little before the fence and continuing
>>until after the fence completes, and paste them here? I suspect you will
>>see corosync complaining.
>>
>>If this is true, do your switches support persistent multicast? Do you
>>use active/passive bonding? Have you tried different switch/cable/NIC?
>>
>>-- 
>>Digimer
>>Papers and Projects: https://alteeve.ca/w/
>>What if the cure for cancer is trapped in the mind of a person without
>>access to education?
>>
>>-- 
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Wed Jun 11 18:29:30 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 11 Jun 2014 14:29:30 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBE1647.F927%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
Message-ID: <5398A00A.4020802@alteeve.ca>

On 11/06/14 02:21 PM, Schaefer, Micah wrote:
> It failed again, even after deleting all the other failover domains.
> 
> Cluster conf
> http://pastebin.com/jUXkwKS4
> 
> I turned corosync output to debug. How can I go about troubleshooting if
> it really is a network issue or something else?
> 
> 
> 
> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
> ip(10.70.100.101) ; members(old:4 left:1)

This is, to me, *strongly* indicative of a network issue. It's not
likely switch-wide as only one member was lost, but I would certainly
put my money on a network problem somewhere, some how.

Do you use bonding?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?


From Micah.Schaefer at jhuapl.edu  Wed Jun 11 18:55:07 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Wed, 11 Jun 2014 14:55:07 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5398A00A.4020802@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
Message-ID: <CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>

I have the issue on two of my nodes. Each node has 1ea 10gb connection. No
bonding, single link. What else can I look at? I manage the network too. I
don?t see any link down notifications, don?t see any errors on the ports.


On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:

>On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>> It failed again, even after deleting all the other failover domains.
>> 
>> Cluster conf
>> http://pastebin.com/jUXkwKS4
>> 
>> I turned corosync output to debug. How can I go about troubleshooting if
>> it really is a network issue or something else?
>> 
>> 
>> 
>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
>> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
>> configuration.
>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
>> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>> ip(10.70.100.101) ; members(old:4 left:1)
>
>This is, to me, *strongly* indicative of a network issue. It's not
>likely switch-wide as only one member was lost, but I would certainly
>put my money on a network problem somewhere, some how.
>
>Do you use bonding?
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Wed Jun 11 19:28:28 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 11 Jun 2014 15:28:28 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
Message-ID: <5398ADDC.80501@alteeve.ca>

The first thing I would do is get a second NIC and configure 
active-passive bonding. network issues are too common to ignore in HA 
setups. Ideally, I would span the links across separate stacked switches.

As for debugging the issue, I can only recommend to look closely at the 
system and switch logs for clues.

On 11/06/14 02:55 PM, Schaefer, Micah wrote:
> I have the issue on two of my nodes. Each node has 1ea 10gb connection. No
> bonding, single link. What else can I look at? I manage the network too. I
> don?t see any link down notifications, don?t see any errors on the ports.
>
>
>
>
> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>
>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>> It failed again, even after deleting all the other failover domains.
>>>
>>> Cluster conf
>>> http://pastebin.com/jUXkwKS4
>>>
>>> I turned corosync output to debug. How can I go about troubleshooting if
>>> it really is a network issue or something else?
>>>
>>>
>>>
>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
>>> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>> configuration.
>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
>>> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
>>> membership and a new membership was formed.
>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>> ip(10.70.100.101) ; members(old:4 left:1)
>>
>> This is, to me, *strongly* indicative of a network issue. It's not
>> likely switch-wide as only one member was lost, but I would certainly
>> put my money on a network problem somewhere, some how.
>>
>> Do you use bonding?
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From Micah.Schaefer at jhuapl.edu  Wed Jun 11 19:50:14 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Wed, 11 Jun 2014 15:50:14 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5398ADDC.80501@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
Message-ID: <CFBE2A1A.F946%micah.schaefer@jhuapl.edu>

Okay, I set up active/ backup bonding and will watch for any change.

This is the network side:
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 output errors, 0 collisions, 0 interface resets


This is the server side:

em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
          inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
          inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0 GiB)
          Interrupt:34 Memory:d5000000-d57fffff


I need to run some fiber, but for now two nodes are plugged into one
switch and the other two nodes into a separate switch that are on the same
subnet. I?ll work on cross connecting the bonded interfaces to different
switches.


On 6/11/14, 3:28 PM, "Digimer" <lists at alteeve.ca> wrote:

>The first thing I would do is get a second NIC and configure
>active-passive bonding. network issues are too common to ignore in HA
>setups. Ideally, I would span the links across separate stacked switches.
>
>As for debugging the issue, I can only recommend to look closely at the
>system and switch logs for clues.
>
>On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>> I have the issue on two of my nodes. Each node has 1ea 10gb connection.
>>No
>> bonding, single link. What else can I look at? I manage the network
>>too. I
>> don?t see any link down notifications, don?t see any errors on the
>>ports.
>>
>>
>>
>>
>> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>
>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>> It failed again, even after deleting all the other failover domains.
>>>>
>>>> Cluster conf
>>>> http://pastebin.com/jUXkwKS4
>>>>
>>>> I turned corosync output to debug. How can I go about troubleshooting
>>>>if
>>>> it really is a network issue or something else?
>>>>
>>>>
>>>>
>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
>>>> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>>> configuration.
>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
>>>> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
>>>> membership and a new membership was formed.
>>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>
>>> This is, to me, *strongly* indicative of a network issue. It's not
>>> likely switch-wide as only one member was lost, but I would certainly
>>> put my money on a network problem somewhere, some how.
>>>
>>> Do you use bonding?
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From gnetravali at sonusnet.com  Thu Jun 12 04:12:01 2014
From: gnetravali at sonusnet.com (Netravali, Ganesh)
Date: Thu, 12 Jun 2014 04:12:01 +0000
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
Message-ID: <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>

Make sure multicast is enabled across the switches.

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah
Sent: Thursday, June 12, 2014 1:20 AM
To: linux clustering
Subject: Re: [Linux-cluster] Node is randomly fenced

Okay, I set up active/ backup bonding and will watch for any change.

This is the network side:
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 output errors, 0 collisions, 0 interface resets


This is the server side:

em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
          inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
          inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0 GiB)
          Interrupt:34 Memory:d5000000-d57fffff


I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I'll work on cross connecting the bonded interfaces to different switches.


On 6/11/14, 3:28 PM, "Digimer" <lists at alteeve.ca> wrote:

>The first thing I would do is get a second NIC and configure 
>active-passive bonding. network issues are too common to ignore in HA 
>setups. Ideally, I would span the links across separate stacked switches.
>
>As for debugging the issue, I can only recommend to look closely at the 
>system and switch logs for clues.
>
>On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>> I have the issue on two of my nodes. Each node has 1ea 10gb connection.
>>No
>> bonding, single link. What else can I look at? I manage the network 
>>too. I  don?t see any link down notifications, don?t see any errors on 
>>the ports.
>>
>>
>>
>>
>> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>
>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>> It failed again, even after deleting all the other failover domains.
>>>>
>>>> Cluster conf
>>>> http://pastebin.com/jUXkwKS4
>>>>
>>>> I turned corosync output to debug. How can I go about 
>>>>troubleshooting if  it really is a network issue or something else?
>>>>
>>>>
>>>>
>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 
>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new 
>>>> configuration.
>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 
>>>> corosync [TOTEM ] A processor joined or left the membership and a 
>>>> new membership was formed.
>>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>
>>> This is, to me, *strongly* indicative of a network issue. It's not 
>>> likely switch-wide as only one member was lost, but I would 
>>> certainly put my money on a network problem somewhere, some how.
>>>
>>> Do you use bonding?
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for 
>>> cancer is trapped in the mind of a person without access to 
>>> education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
>--
>Digimer
>Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer 
>is trapped in the mind of a person without access to education?
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Thu Jun 12 04:19:50 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jun 2014 00:19:50 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
Message-ID: <53992A66.4070109@alteeve.ca>

I considered that, but I would expect more nodes to be lost.

On 12/06/14 12:12 AM, Netravali, Ganesh wrote:
> Make sure multicast is enabled across the switches.
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah
> Sent: Thursday, June 12, 2014 1:20 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Node is randomly fenced
>
> Okay, I set up active/ backup bonding and will watch for any change.
>
> This is the network side:
>       0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
>       0 output errors, 0 collisions, 0 interface resets
>
>
>
> This is the server side:
>
> em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
>            inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
>            inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>            RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
>            TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
>            collisions:0 txqueuelen:1000
>            RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0 GiB)
>            Interrupt:34 Memory:d5000000-d57fffff
>
>
>
> I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I'll work on cross connecting the bonded interfaces to different switches.
>
>
>
> On 6/11/14, 3:28 PM, "Digimer" <lists at alteeve.ca> wrote:
>
>> The first thing I would do is get a second NIC and configure
>> active-passive bonding. network issues are too common to ignore in HA
>> setups. Ideally, I would span the links across separate stacked switches.
>>
>> As for debugging the issue, I can only recommend to look closely at the
>> system and switch logs for clues.
>>
>> On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>>> I have the issue on two of my nodes. Each node has 1ea 10gb connection.
>>> No
>>> bonding, single link. What else can I look at? I manage the network
>>> too. I  don?t see any link down notifications, don?t see any errors on
>>> the ports.
>>>
>>>
>>>
>>>
>>> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>
>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>>> It failed again, even after deleting all the other failover domains.
>>>>>
>>>>> Cluster conf
>>>>> http://pastebin.com/jUXkwKS4
>>>>>
>>>>> I turned corosync output to debug. How can I go about
>>>>> troubleshooting if  it really is a network issue or something else?
>>>>>
>>>>>
>>>>>
>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>>>> configuration.
>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
>>>>> corosync [TOTEM ] A processor joined or left the membership and a
>>>>> new membership was formed.
>>>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>>
>>>> This is, to me, *strongly* indicative of a network issue. It's not
>>>> likely switch-wide as only one member was lost, but I would
>>>> certainly put my money on a network problem somewhere, some how.
>>>>
>>>> Do you use bonding?
>>>>
>>>> --
>>>> Digimer
>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for
>>>> cancer is trapped in the mind of a person without access to
>>>> education?
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
>> is trapped in the mind of a person without access to education?
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From lzhong at suse.com  Thu Jun 12 06:42:58 2014
From: lzhong at suse.com (Lidong Zhong)
Date: Thu, 12 Jun 2014 14:42:58 +0800
Subject: [Linux-cluster] [RFC] dlm: keep listening connection alive with
	sctp mode
Message-ID: <1402555378-5220-1-git-send-email-lzhong@suse.com>

Currently when a node close a connection, it will send a user initiated
ABORT instead of gracefully shut down(ece35848c184). Sadly it also could
close the listening connection, so this node will fail to rejoin the
cluster.

I setup two node of cluster to do this test. While the cluster works
fine, the connection looks like this:
clt-n2-sles12b7-2:~ # netstat -apn|grep sctp
sctp   147.2.208.197:21064  LISTEN      -
sctp   0   4 0.0.82.72:62887   147.2.208.197:21064  ESTABLISHED -

and if I reboot the other node or stop running dlm, and all the
connections get lost:
clt-n2-sles12b7-2:~ # netstat -apn | grep sctp
clt-n2-sles12b7-2:~ #

so if the other node tries to rejoin the cluster, the following message
flushes because of no listening port now.

dlm: Trying to connect to 192.168.3.4
dlm: Can't start SCTP association - retrying
dlm: Retry sending 64 bytes to node id 318951621
dlm: Retrying SCTP association init for node 318951621

Signed-off-by: Lidong Zhong <lzhong at suse.com>
---
 fs/dlm/lowcomms.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 1e5b453..d08e079 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -617,6 +617,11 @@ static void retry_failed_sctp_send(struct connection *recv_con,
 	int nodeid = sn_send_failed->ssf_info.sinfo_ppid;
 
 	log_print("Retry sending %d bytes to node id %d", len, nodeid);
+	
+	if (!nodeid) {
+		log_print("Shouldn't resend data via listening connection.");
+		return;
+	}
 
 	con = nodeid2con(nodeid, 0);
 	if (!con) {
-- 
1.8.1.4


From rpeterso at redhat.com  Thu Jun 12 12:29:44 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 12 Jun 2014 08:29:44 -0400 (EDT)
Subject: [Linux-cluster] [RFC] dlm: keep listening connection alive
	with	sctp mode
In-Reply-To: <1402555378-5220-1-git-send-email-lzhong@suse.com>
References: <1402555378-5220-1-git-send-email-lzhong@suse.com>
Message-ID: <742486000.20595916.1402576184717.JavaMail.zimbra@redhat.com>

----- Original Message -----
(snip)
> Signed-off-by: Lidong Zhong <lzhong at suse.com>

Hi Lidong,

There is a special public mailing list for patches like this
and other cluster-related development. The mailing list is called
cluster-devel. Here is a link where you can subscribe to it:

https://www.redhat.com/mailman/listinfo/cluster-devel

I recommend you send your patch to cluster-devel at redhat.com.

Regards,

Bob Peterson
Red Hat File Systems


From arun.nair at dimensiondata.com  Thu Jun 12 14:29:06 2014
From: arun.nair at dimensiondata.com (Arun G Nair)
Date: Thu, 12 Jun 2014 19:59:06 +0530
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <53986FD4.6010902@alteeve.ca>
References: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com><53986FD4.6010902@alteeve.ca>
Message-ID: <CAMNCB82zs=hpLEuNHDtvNPpXJ9e8oNbzd91GDG8kvzZpFrSunw@mail.gmail.com>

We have multicast enabled on the switch. I've also tried the multicast.py
tool from RH's knowledge base to test multicast and I see the expected
output, though the tool uses a different multicast IP( guess that shouldn't
matter). I've tried increasing the post_join_delay to 360 seconds to give
me enough time to check everything on both the nodes. One node still gets
fenced. `clustat` output says the other node is offline on both servers. So
one node can't see the other one ? This again points to issue with
multicast. Any other clues as to what/where to look ?


On Wed, Jun 11, 2014 at 8:33 PM, Digimer <lists at alteeve.ca> wrote:

> On 11/06/14 10:48 AM, Arun G Nair wrote:
>
>> Hello,
>>
>>     What are the reasons for fence loops when only cman is started ? We
>> have an RHEL 6.5 2-node cluster which goes in to a fence loop and every
>> time we start cman on both nodes. Either one fences the other. Multicast
>> seems to be working properly. My understanding is that without rgmanager
>> running there won't be a multicast group subscription ? I don't see the
>> multicast address in 'netstat -g' unless rgmanager is running. I've
>> tried to increase the fence post_join_delay but one of the nodes still
>> gets fenced.
>>
>> The cluster works fine if we use unicast UDP.
>>
>> Thanks,
>>
>
> Hi,
>
>   When cman starts, it waits post_join_delay seconds for the peer to
> connect. If, after that time expires (6 seconds by default, iirc), it gives
> up and calls a fence against the peer to put it into a known state.
>
>   Corosync is what determines membership, and it is started by cman. The
> rgmanager only handles resource start/stop/relocate/recovery and has
> nothing to do with fencing directly. Corosync is what uses multicast.
>
>   So as you seem to have already surmised, multicast is probably not
> working in your environment. Have you enabled multicast traffic on the
> firewall? Do your switches support multicast properly?
>
> digimer
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Arun G Nair
Sr. Sysadmin
Dimension Data | Ph: (800) 664-9973
Feedback? We're listening <http://www.surveymonkey.com/s/XRCYXBH>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140612/8a23b39c/attachment.htm>

From kkovachev at varna.net  Thu Jun 12 14:43:06 2014
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Thu, 12 Jun 2014 17:43:06 +0300
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <CAMNCB82zs=hpLEuNHDtvNPpXJ9e8oNbzd91GDG8kvzZpFrSunw@mail.gmail.com>
References: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com><53986FD4.6010902@alteeve.ca>
	<CAMNCB82zs=hpLEuNHDtvNPpXJ9e8oNbzd91GDG8kvzZpFrSunw@mail.gmail.com>
Message-ID: <d2e70c4860a9dbc66de9d62e38f8d433@vmail.varna.net>

Do you have a different auth key on each node by any chance?

On 2014-06-12 17:29, Arun G Nair wrote:

> We have multicast enabled on the switch. I've also tried the 
> multicast.py tool from RH's knowledge base to test multicast and I see 
> the expected output, though the tool uses a different multicast IP( 
> guess that shouldn't matter). I've tried increasing the post_join_delay 
> to 360 seconds to give me enough time to check everything on both the 
> nodes. One node still gets fenced. `clustat` output says the other node 
> is offline on both servers. So one node can't see the other one ? This 
> again points to issue with multicast. Any other clues as to what/where 
> to look ?
> 
> On Wed, Jun 11, 2014 at 8:33 PM, Digimer <lists at alteeve.ca> wrote:
> 
> On 11/06/14 10:48 AM, Arun G Nair wrote:
> Hello,
> 
> What are the reasons for fence loops when only cman is started ? We
> have an RHEL 6.5 2-node cluster which goes in to a fence loop and every
> time we start cman on both nodes. Either one fences the other. 
> Multicast
> seems to be working properly. My understanding is that without 
> rgmanager
> running there won't be a multicast group subscription ? I don't see the
> multicast address in 'netstat -g' unless rgmanager is running. I've
> tried to increase the fence post_join_delay but one of the nodes still
> gets fenced.
> 
> The cluster works fine if we use unicast UDP.
> 
> Thanks, Hi,
> 
> When cman starts, it waits post_join_delay seconds for the peer to 
> connect. If, after that time expires (6 seconds by default, iirc), it 
> gives up and calls a fence against the peer to put it into a known 
> state.
> 
> Corosync is what determines membership, and it is started by cman. The 
> rgmanager only handles resource start/stop/relocate/recovery and has 
> nothing to do with fencing directly. Corosync is what uses multicast.
> 
> So as you seem to have already surmised, multicast is probably not 
> working in your environment. Have you enabled multicast traffic on the 
> firewall? Do your switches support multicast properly?
> 
> digimer
> 
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/ [1]
> What if the cure for cancer is trapped in the mind of a person without 
> access to education?
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster [2]

-- 
Arun G Nair
Sr. Sysadmin
Dimension Data | Ph: (800) 664-9973
Feedback? We're listening [3]


Links:
------
[1] https://alteeve.ca/w/
[2] https://www.redhat.com/mailman/listinfo/linux-cluster
[3] http://www.surveymonkey.com/s/XRCYXBH


From emi2fast at gmail.com  Thu Jun 12 15:05:03 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Thu, 12 Jun 2014 17:05:03 +0200
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <d2e70c4860a9dbc66de9d62e38f8d433@vmail.varna.net>
References: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com>
	<53986FD4.6010902@alteeve.ca>
	<CAMNCB82zs=hpLEuNHDtvNPpXJ9e8oNbzd91GDG8kvzZpFrSunw@mail.gmail.com>
	<d2e70c4860a9dbc66de9d62e38f8d433@vmail.varna.net>
Message-ID: <CAE7pJ3Aj8gmos9ifmZooAugY+vyWbYU4HXFc8p8xd2BDg3_FoQ@mail.gmail.com>

I always used "tcpdump -ni bond1 port 5405" to check if both nodes are
involved in the comunication, if isn't like that, that would say is
multicast problem

2014-06-12 16:43 GMT+02:00 Kaloyan Kovachev <kkovachev at varna.net>:
> Do you have a different auth key on each node by any chance?
>
>
> On 2014-06-12 17:29, Arun G Nair wrote:
>
>> We have multicast enabled on the switch. I've also tried the multicast.py
>> tool from RH's knowledge base to test multicast and I see the expected
>> output, though the tool uses a different multicast IP( guess that shouldn't
>> matter). I've tried increasing the post_join_delay to 360 seconds to give me
>> enough time to check everything on both the nodes. One node still gets
>> fenced. `clustat` output says the other node is offline on both servers. So
>> one node can't see the other one ? This again points to issue with
>> multicast. Any other clues as to what/where to look ?
>>
>> On Wed, Jun 11, 2014 at 8:33 PM, Digimer <lists at alteeve.ca> wrote:
>>
>> On 11/06/14 10:48 AM, Arun G Nair wrote:
>> Hello,
>>
>> What are the reasons for fence loops when only cman is started ? We
>> have an RHEL 6.5 2-node cluster which goes in to a fence loop and every
>> time we start cman on both nodes. Either one fences the other. Multicast
>> seems to be working properly. My understanding is that without rgmanager
>> running there won't be a multicast group subscription ? I don't see the
>> multicast address in 'netstat -g' unless rgmanager is running. I've
>> tried to increase the fence post_join_delay but one of the nodes still
>> gets fenced.
>>
>> The cluster works fine if we use unicast UDP.
>>
>> Thanks, Hi,
>>
>> When cman starts, it waits post_join_delay seconds for the peer to
>> connect. If, after that time expires (6 seconds by default, iirc), it gives
>> up and calls a fence against the peer to put it into a known state.
>>
>> Corosync is what determines membership, and it is started by cman. The
>> rgmanager only handles resource start/stop/relocate/recovery and has nothing
>> to do with fencing directly. Corosync is what uses multicast.
>>
>> So as you seem to have already surmised, multicast is probably not working
>> in your environment. Have you enabled multicast traffic on the firewall? Do
>> your switches support multicast properly?
>>
>> digimer
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/ [1]
>>
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster [2]
>
>
> --
> Arun G Nair
> Sr. Sysadmin
> Dimension Data | Ph: (800) 664-9973
> Feedback? We're listening [3]
>
>
>
> Links:
> ------
> [1] https://alteeve.ca/w/
> [2] https://www.redhat.com/mailman/listinfo/linux-cluster
> [3] http://www.surveymonkey.com/s/XRCYXBH
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
esta es mi vida e me la vivo hasta que dios quiera


From Micah.Schaefer at jhuapl.edu  Thu Jun 12 15:32:57 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Thu, 12 Jun 2014 11:32:57 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <53992A66.4070109@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
	<53992A66.4070109@alteeve.ca>
Message-ID: <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>

Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
fenced, then node3 was fenced when node4 came back online. The network
topology is as follows:
switch1: node1, node3 (two connections)
switch2: node2, node4 (two connections)
switch1 <?> switch2
All on the same subnet

I set up monitoring at 100 millisecond of the nics in active-backup mode,
and saw no messages about link problems before the fence.

I see multicast between the servers using tcpdump.


Any more ideas? 


On 6/12/14, 12:19 AM, "Digimer" <lists at alteeve.ca> wrote:

>I considered that, but I would expect more nodes to be lost.
>
>On 12/06/14 12:12 AM, Netravali, Ganesh wrote:
>> Make sure multicast is enabled across the switches.
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>>[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah
>> Sent: Thursday, June 12, 2014 1:20 AM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] Node is randomly fenced
>>
>> Okay, I set up active/ backup bonding and will watch for any change.
>>
>> This is the network side:
>>       0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
>>       0 output errors, 0 collisions, 0 interface resets
>>
>>
>>
>> This is the server side:
>>
>> em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
>>            inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
>>            inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
>>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>            RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
>>            TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
>>            collisions:0 txqueuelen:1000
>>            RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0
>>GiB)
>>            Interrupt:34 Memory:d5000000-d57fffff
>>
>>
>>
>> I need to run some fiber, but for now two nodes are plugged into one
>>switch and the other two nodes into a separate switch that are on the
>>same subnet. I'll work on cross connecting the bonded interfaces to
>>different switches.
>>
>>
>>
>> On 6/11/14, 3:28 PM, "Digimer" <lists at alteeve.ca> wrote:
>>
>>> The first thing I would do is get a second NIC and configure
>>> active-passive bonding. network issues are too common to ignore in HA
>>> setups. Ideally, I would span the links across separate stacked
>>>switches.
>>>
>>> As for debugging the issue, I can only recommend to look closely at the
>>> system and switch logs for clues.
>>>
>>> On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>>>> I have the issue on two of my nodes. Each node has 1ea 10gb
>>>>connection.
>>>> No
>>>> bonding, single link. What else can I look at? I manage the network
>>>> too. I  don?t see any link down notifications, don?t see any errors on
>>>> the ports.
>>>>
>>>>
>>>>
>>>>
>>>> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>>
>>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>>>> It failed again, even after deleting all the other failover domains.
>>>>>>
>>>>>> Cluster conf
>>>>>> http://pastebin.com/jUXkwKS4
>>>>>>
>>>>>> I turned corosync output to debug. How can I go about
>>>>>> troubleshooting if  it really is a network issue or something else?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
>>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>>>>> configuration.
>>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
>>>>>> corosync [TOTEM ] A processor joined or left the membership and a
>>>>>> new membership was formed.
>>>>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>>>
>>>>> This is, to me, *strongly* indicative of a network issue. It's not
>>>>> likely switch-wide as only one member was lost, but I would
>>>>> certainly put my money on a network problem somewhere, some how.
>>>>>
>>>>> Do you use bonding?
>>>>>
>>>>> --
>>>>> Digimer
>>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for
>>>>> cancer is trapped in the mind of a person without access to
>>>>> education?
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
>>> is trapped in the mind of a person without access to education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Thu Jun 12 16:25:14 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jun 2014 12:25:14 -0400
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <CAMNCB82zs=hpLEuNHDtvNPpXJ9e8oNbzd91GDG8kvzZpFrSunw@mail.gmail.com>
References: <CAMNCB837GmJaOxjsm8ajODY9Rgw-EBszJWsUENr9e5==rz67bw@mail.gmail.com><53986FD4.6010902@alteeve.ca>
	<CAMNCB82zs=hpLEuNHDtvNPpXJ9e8oNbzd91GDG8kvzZpFrSunw@mail.gmail.com>
Message-ID: <5399D46A.6080205@alteeve.ca>

Have you tried simple things like disabling iptables or selinux, just to 
test? If that doesn't work, and it's a small cluster, try unicast and 
see if that helps (again, even if just to test).

On 12/06/14 10:29 AM, Arun G Nair wrote:
> We have multicast enabled on the switch. I've also tried the
> multicast.py tool from RH's knowledge base to test multicast and I see
> the expected output, though the tool uses a different multicast IP(
> guess that shouldn't matter). I've tried increasing the post_join_delay
> to 360 seconds to give me enough time to check everything on both the
> nodes. One node still gets fenced. `clustat` output says the other node
> is offline on both servers. So one node can't see the other one ? This
> again points to issue with multicast. Any other clues as to what/where
> to look ?
>
>
> On Wed, Jun 11, 2014 at 8:33 PM, Digimer <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> wrote:
>
>     On 11/06/14 10:48 AM, Arun G Nair wrote:
>
>         Hello,
>
>              What are the reasons for fence loops when only cman is
>         started ? We
>         have an RHEL 6.5 2-node cluster which goes in to a fence loop
>         and every
>         time we start cman on both nodes. Either one fences the other.
>         Multicast
>         seems to be working properly. My understanding is that without
>         rgmanager
>         running there won't be a multicast group subscription ? I don't
>         see the
>         multicast address in 'netstat -g' unless rgmanager is running. I've
>         tried to increase the fence post_join_delay but one of the nodes
>         still
>         gets fenced.
>
>         The cluster works fine if we use unicast UDP.
>
>         Thanks,
>
>
>     Hi,
>
>        When cman starts, it waits post_join_delay seconds for the peer
>     to connect. If, after that time expires (6 seconds by default,
>     iirc), it gives up and calls a fence against the peer to put it into
>     a known state.
>
>        Corosync is what determines membership, and it is started by
>     cman. The rgmanager only handles resource
>     start/stop/relocate/recovery and has nothing to do with fencing
>     directly. Corosync is what uses multicast.
>
>        So as you seem to have already surmised, multicast is probably
>     not working in your environment. Have you enabled multicast traffic
>     on the firewall? Do your switches support multicast properly?
>
>     digimer
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca/w/
>     What if the cure for cancer is trapped in the mind of a person
>     without access to education?
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/__mailman/listinfo/linux-cluster
>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>
>
> --
> Arun G Nair
> Sr. Sysadmin
> Dimension Data | Ph: (800) 664-9973
> Feedback? We're listening <http://www.surveymonkey.com/s/XRCYXBH>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From lists at alteeve.ca  Thu Jun 12 16:31:43 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jun 2014 12:31:43 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>
	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
Message-ID: <5399D5EF.9050605@alteeve.ca>

To confirm; Have you tried with the bonds setup where each node has one
link into either switch? I just want to be sure you've ruled out all the
network hardware. Also please confirm that you used mode=1
(active-passive) bonding.

Assuming this doesn't help, then I would say that I was wrong in
assuming it was network related. The next thing I would look at is
corosync. Do you see any messages about totem retransmit?

On 12/06/14 11:32 AM, Schaefer, Micah wrote:
> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
> fenced, then node3 was fenced when node4 came back online. The network
> topology is as follows:
> switch1: node1, node3 (two connections)
> switch2: node2, node4 (two connections)
> switch1 <?> switch2
> All on the same subnet
> 
> I set up monitoring at 100 millisecond of the nics in active-backup mode,
> and saw no messages about link problems before the fence.
> 
> I see multicast between the servers using tcpdump.
> 
> 
> Any more ideas?
> 
> 
> 
> 
> 
> On 6/12/14, 12:19 AM, "Digimer" <lists at alteeve.ca> wrote:
> 
>> I considered that, but I would expect more nodes to be lost.
>>
>> On 12/06/14 12:12 AM, Netravali, Ganesh wrote:
>>> Make sure multicast is enabled across the switches.
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah
>>> Sent: Thursday, June 12, 2014 1:20 AM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] Node is randomly fenced
>>>
>>> Okay, I set up active/ backup bonding and will watch for any change.
>>>
>>> This is the network side:
>>>        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
>>>        0 output errors, 0 collisions, 0 interface resets
>>>
>>>
>>>
>>> This is the server side:
>>>
>>> em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
>>>             inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
>>>             inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
>>>             UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>             RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
>>>             TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
>>>             collisions:0 txqueuelen:1000
>>>             RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0
>>> GiB)
>>>             Interrupt:34 Memory:d5000000-d57fffff
>>>
>>>
>>>
>>> I need to run some fiber, but for now two nodes are plugged into one
>>> switch and the other two nodes into a separate switch that are on the
>>> same subnet. I'll work on cross connecting the bonded interfaces to
>>> different switches.
>>>
>>>
>>>
>>> On 6/11/14, 3:28 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>
>>>> The first thing I would do is get a second NIC and configure
>>>> active-passive bonding. network issues are too common to ignore in HA
>>>> setups. Ideally, I would span the links across separate stacked
>>>> switches.
>>>>
>>>> As for debugging the issue, I can only recommend to look closely at the
>>>> system and switch logs for clues.
>>>>
>>>> On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>>>>> I have the issue on two of my nodes. Each node has 1ea 10gb
>>>>> connection.
>>>>> No
>>>>> bonding, single link. What else can I look at? I manage the network
>>>>> too. I  don?t see any link down notifications, don?t see any errors on
>>>>> the ports.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>>>
>>>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>>>>> It failed again, even after deleting all the other failover domains.
>>>>>>>
>>>>>>> Cluster conf
>>>>>>> http://pastebin.com/jUXkwKS4
>>>>>>>
>>>>>>> I turned corosync output to debug. How can I go about
>>>>>>> troubleshooting if  it really is a network issue or something else?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
>>>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>>>>>> configuration.
>>>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
>>>>>>> corosync [TOTEM ] A processor joined or left the membership and a
>>>>>>> new membership was formed.
>>>>>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>>>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>>>>
>>>>>> This is, to me, *strongly* indicative of a network issue. It's not
>>>>>> likely switch-wide as only one member was lost, but I would
>>>>>> certainly put my money on a network problem somewhere, some how.
>>>>>>
>>>>>> Do you use bonding?
>>>>>>
>>>>>> --
>>>>>> Digimer
>>>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for
>>>>>> cancer is trapped in the mind of a person without access to
>>>>>> education?
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Digimer
>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
>>>> is trapped in the mind of a person without access to education?
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>
>>
>> -- 
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?


From yvette at dbtgroup.com  Thu Jun 12 16:33:17 2014
From: yvette at dbtgroup.com (yvette hirth)
Date: Thu, 12 Jun 2014 09:33:17 -0700
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>
	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
Message-ID: <5399D64D.8080301@dbtgroup.com>

On 06/12/2014 08:32 AM, Schaefer, Micah wrote:

> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
> fenced, then node3 was fenced when node4 came back online. The network
> topology is as follows:
> switch1: node1, node3 (two connections)
> switch2: node2, node4 (two connections)
> switch1 <?> switch2
> All on the same subnet
> 
> I set up monitoring at 100 millisecond of the nics in active-backup mode,
> and saw no messages about link problems before the fence.
> 
> I see multicast between the servers using tcpdump.
> 
> Any more ideas?

spanning-tree scans/rebuilds happen on 10Gb circuits just like they do
on 1Gb circuits, and when they happen, traffic on the switches *can*
come to a grinding halt, depending upon the switch firmware and the type
of spanning-tree scan/rebuild being done.

you may want to check your switch logs to see if any spanning-tree
rebuilds were being done at the time of the fence.

just an idea, and hth
yvette hirth


From lists at alteeve.ca  Thu Jun 12 16:36:12 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jun 2014 12:36:12 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5399D64D.8080301@dbtgroup.com>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
	<5399D64D.8080301@dbtgroup.com>
Message-ID: <5399D6FC.8030800@alteeve.ca>

On 12/06/14 12:33 PM, yvette hirth wrote:
> On 06/12/2014 08:32 AM, Schaefer, Micah wrote:
> 
>> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
>> fenced, then node3 was fenced when node4 came back online. The network
>> topology is as follows:
>> switch1: node1, node3 (two connections)
>> switch2: node2, node4 (two connections)
>> switch1 <?> switch2
>> All on the same subnet
>>
>> I set up monitoring at 100 millisecond of the nics in active-backup mode,
>> and saw no messages about link problems before the fence.
>>
>> I see multicast between the servers using tcpdump.
>>
>> Any more ideas?
> 
> spanning-tree scans/rebuilds happen on 10Gb circuits just like they do
> on 1Gb circuits, and when they happen, traffic on the switches *can*
> come to a grinding halt, depending upon the switch firmware and the type
> of spanning-tree scan/rebuild being done.
> 
> you may want to check your switch logs to see if any spanning-tree
> rebuilds were being done at the time of the fence.
> 
> just an idea, and hth
> yvette hirth

When I've seen this (I now disable STP entirely), it blocks all traffic
so I would expect multiple/all nodes to partition off on their own.
Still, worth looking into. :)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?


From Micah.Schaefer at jhuapl.edu  Thu Jun 12 16:48:17 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Thu, 12 Jun 2014 12:48:17 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5399D6FC.8030800@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
	<53992A66.4070109@alteeve.ca> <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
	<5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca>
Message-ID: <CFBF507D.F995%micah.schaefer@jhuapl.edu>

This is all I see for TOTEM from node1

Jun 12 11:07:10 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 12 11:07:22 corosync [QUORUM] Members[3]: 1 2 3
Jun 12 11:07:22 corosync [TOTEM ] A processor joined or left the
membership" and a new membership was formed.
Jun 12 11:07:22 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 12 11:07:22 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:10:49 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:10:49 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:10:49 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:11:02 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:11:02 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:11:02 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:11:06 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:11:06 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:11:06 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:11:35 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 12 11:11:47 corosync [QUORUM] Members[3]: 1 2 4
Jun 12 11:11:47 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:11:47 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 12 11:11:47 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:15:18 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:15:18 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:15:18 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:15:31 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:15:31 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:15:31 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:15:33 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:15:33 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:15:33 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 12:36:20 corosync [QUORUM] Members[4]: 1 2 3 4


As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
tree changes are happening and all the ports have port-fast enabled for
these servers. My switch logging level is very high and I have no messages
in relation to the time frames or ports.

TOTEM reports that ?A processor joined or left the membership??, but that
isn?t enough detail.

Also note that I did not have these issues until adding new servers: node3
and node4 to the cluster. Node1 and node2 do not fence each other (unless
a real issue is there), and they are on different switches.


On 6/12/14, 12:36 PM, "Digimer" <lists at alteeve.ca> wrote:

>On 12/06/14 12:33 PM, yvette hirth wrote:
>> On 06/12/2014 08:32 AM, Schaefer, Micah wrote:
>> 
>>> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
>>> fenced, then node3 was fenced when node4 came back online. The network
>>> topology is as follows:
>>> switch1: node1, node3 (two connections)
>>> switch2: node2, node4 (two connections)
>>> switch1 <?> switch2
>>> All on the same subnet
>>>
>>> I set up monitoring at 100 millisecond of the nics in active-backup
>>>mode,
>>> and saw no messages about link problems before the fence.
>>>
>>> I see multicast between the servers using tcpdump.
>>>
>>> Any more ideas?
>> 
>> spanning-tree scans/rebuilds happen on 10Gb circuits just like they do
>> on 1Gb circuits, and when they happen, traffic on the switches *can*
>> come to a grinding halt, depending upon the switch firmware and the type
>> of spanning-tree scan/rebuild being done.
>> 
>> you may want to check your switch logs to see if any spanning-tree
>> rebuilds were being done at the time of the fence.
>> 
>> just an idea, and hth
>> yvette hirth
>
>When I've seen this (I now disable STP entirely), it blocks all traffic
>so I would expect multiple/all nodes to partition off on their own.
>Still, worth looking into. :)
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Thu Jun 12 17:08:07 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jun 2014 13:08:07 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBF507D.F995%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>
	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>	<5399D64D.8080301@dbtgroup.com>
	<5399D6FC.8030800@alteeve.ca>
	<CFBF507D.F995%micah.schaefer@jhuapl.edu>
Message-ID: <5399DE77.1030302@alteeve.ca>

On 12/06/14 12:48 PM, Schaefer, Micah wrote:
> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
> tree changes are happening and all the ports have port-fast enabled for
> these servers. My switch logging level is very high and I have no messages
> in relation to the time frames or ports.
> 
> TOTEM reports that ?A processor joined or left the membership??, but that
> isn?t enough detail.
> 
> Also note that I did not have these issues until adding new servers: node3
> and node4 to the cluster. Node1 and node2 do not fence each other (unless
> a real issue is there), and they are on different switches.

Then I can't imagine it being network anymore. Seeing as both node 3 and
4 get fenced, it's likely not hardware either. Are the workloads on 3
and 4 much higher (or are the computers much slower) than 1 and 2? I'm
wondering if the nodes are simply not keeping up with corosync traffic.
You might try adjusting the corosync token timeout and retransmit counts
to see if that reduces the node loses.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?


From Micah.Schaefer at jhuapl.edu  Thu Jun 12 17:24:03 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Thu, 12 Jun 2014 13:24:03 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5399DE77.1030302@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
	<53992A66.4070109@alteeve.ca> <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
	<5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca>
	<CFBF507D.F995%micah.schaefer@jhuapl.edu> <5399DE77.1030302@alteeve.ca>
Message-ID: <CFBF57ED.F9A6%micah.schaefer@jhuapl.edu>

The servers do not run any tasks other than the tasks in the cluster
service group. 

Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1
and 2 are virtual machines with much less resources available.

I adjusted the token settings and will watch for any change.


On 6/12/14, 1:08 PM, "Digimer" <lists at alteeve.ca> wrote:

>On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
>> tree changes are happening and all the ports have port-fast enabled for
>> these servers. My switch logging level is very high and I have no
>>messages
>> in relation to the time frames or ports.
>> 
>> TOTEM reports that ?A processor joined or left the membership??, but
>>that
>> isn?t enough detail.
>> 
>> Also note that I did not have these issues until adding new servers:
>>node3
>> and node4 to the cluster. Node1 and node2 do not fence each other
>>(unless
>> a real issue is there), and they are on different switches.
>
>Then I can't imagine it being network anymore. Seeing as both node 3 and
>4 get fenced, it's likely not hardware either. Are the workloads on 3
>and 4 much higher (or are the computers much slower) than 1 and 2? I'm
>wondering if the nodes are simply not keeping up with corosync traffic.
>You might try adjusting the corosync token timeout and retransmit counts
>to see if that reduces the node loses.
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Thu Jun 12 17:29:53 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jun 2014 13:29:53 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBF57ED.F9A6%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>
	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>	<5399D64D.8080301@dbtgroup.com>
	<5399D6FC.8030800@alteeve.ca>	<CFBF507D.F995%micah.schaefer@jhuapl.edu>
	<5399DE77.1030302@alteeve.ca>
	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu>
Message-ID: <5399E391.3060701@alteeve.ca>

Even if the token changes stop the immediate fencing, don't leave it 
please. There is something fundamentally wrong that you need to 
identify/fix.

Keep us posted!

On 12/06/14 01:24 PM, Schaefer, Micah wrote:
> The servers do not run any tasks other than the tasks in the cluster
> service group.
>
> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1
> and 2 are virtual machines with much less resources available.
>
> I adjusted the token settings and will watch for any change.
>
>
>
>
>
>
>
>
> On 6/12/14, 1:08 PM, "Digimer" <lists at alteeve.ca> wrote:
>
>> On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
>>> tree changes are happening and all the ports have port-fast enabled for
>>> these servers. My switch logging level is very high and I have no
>>> messages
>>> in relation to the time frames or ports.
>>>
>>> TOTEM reports that ?A processor joined or left the membership??, but
>>> that
>>> isn?t enough detail.
>>>
>>> Also note that I did not have these issues until adding new servers:
>>> node3
>>> and node4 to the cluster. Node1 and node2 do not fence each other
>>> (unless
>>> a real issue is there), and they are on different switches.
>>
>> Then I can't imagine it being network anymore. Seeing as both node 3 and
>> 4 get fenced, it's likely not hardware either. Are the workloads on 3
>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm
>> wondering if the nodes are simply not keeping up with corosync traffic.
>> You might try adjusting the corosync token timeout and retransmit counts
>> to see if that reduces the node loses.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From Micah.Schaefer at jhuapl.edu  Thu Jun 12 17:55:35 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Thu, 12 Jun 2014 13:55:35 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5399E391.3060701@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
	<53992A66.4070109@alteeve.ca> <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
	<5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca>
	<CFBF507D.F995%micah.schaefer@jhuapl.edu> <5399DE77.1030302@alteeve.ca>
	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu> <5399E391.3060701@alteeve.ca>
Message-ID: <CFBF6163.F9BF%micah.schaefer@jhuapl.edu>

I just found that the clock on node1 was off by about a minute and a half
compared to the rest of the nodes.

I am running ntp, so not sure why the time wasn?t synced up. Wonder if
node1 being behind, would think it was not receiving updates from the
other nodes? 


On 6/12/14, 1:29 PM, "Digimer" <lists at alteeve.ca> wrote:

>Even if the token changes stop the immediate fencing, don't leave it
>please. There is something fundamentally wrong that you need to
>identify/fix.
>
>Keep us posted!
>
>On 12/06/14 01:24 PM, Schaefer, Micah wrote:
>> The servers do not run any tasks other than the tasks in the cluster
>> service group.
>>
>> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1
>> and 2 are virtual machines with much less resources available.
>>
>> I adjusted the token settings and will watch for any change.
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/12/14, 1:08 PM, "Digimer" <lists at alteeve.ca> wrote:
>>
>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
>>>> tree changes are happening and all the ports have port-fast enabled
>>>>for
>>>> these servers. My switch logging level is very high and I have no
>>>> messages
>>>> in relation to the time frames or ports.
>>>>
>>>> TOTEM reports that ?A processor joined or left the membership??, but
>>>> that
>>>> isn?t enough detail.
>>>>
>>>> Also note that I did not have these issues until adding new servers:
>>>> node3
>>>> and node4 to the cluster. Node1 and node2 do not fence each other
>>>> (unless
>>>> a real issue is there), and they are on different switches.
>>>
>>> Then I can't imagine it being network anymore. Seeing as both node 3
>>>and
>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3
>>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm
>>> wondering if the nodes are simply not keeping up with corosync traffic.
>>> You might try adjusting the corosync token timeout and retransmit
>>>counts
>>> to see if that reduces the node loses.
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From Micah.Schaefer at jhuapl.edu  Thu Jun 12 19:02:43 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Thu, 12 Jun 2014 15:02:43 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBF6163.F9BF%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
	<53992A66.4070109@alteeve.ca> <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
	<5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca>
	<CFBF507D.F995%micah.schaefer@jhuapl.edu> <5399DE77.1030302@alteeve.ca>
	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu> <5399E391.3060701@alteeve.ca>
	<CFBF6163.F9BF%micah.schaefer@jhuapl.edu>
Message-ID: <CFBF69C0.F9D3%micah.schaefer@jhuapl.edu>

Node4 was fenced again, I was able to get some debug logs (below), a new
message :

"Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL
state.?


Rest of corosync logs

http://pastebin.com/iYFbkbhb


Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33494 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] got commit token
Jun 12 14:44:50 corosync [TOTEM ] Saving state aru 86 high seq received 86
Jun 12 14:44:50 corosync [TOTEM ] Storing new sequence id for ring 6324
Jun 12 14:44:50 corosync [TOTEM ] entering COMMIT state.
Jun 12 14:44:50 corosync [TOTEM ] got commit token
Jun 12 14:44:50 corosync [TOTEM ] entering RECOVERY state.
Jun 12 14:44:50 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
Jun 12 14:44:50 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
Jun 12 14:44:50 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
Jun 12 14:44:50 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
Jun 12 14:44:50 corosync [TOTEM ] position [0] member 10.70.100.101:
Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:50 corosync [TOTEM ] position [1] member 10.70.100.102:
Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:50 corosync [TOTEM ] position [2] member 10.70.100.103:
Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:50 corosync [TOTEM ] position [3] member 10.70.100.104:
Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:50 corosync [TOTEM ] Did not need to originate any messages
in recovery.
Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 0, aru ffffffff
Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 1, aru 0
Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 2, aru 0
Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 3, aru 0
Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:50 corosync [TOTEM ] retrans flag count 4 token aru 0 install
seq 0 aru 0 0
Jun 12 14:44:50 corosync [TOTEM ] Resetting old ring state
Jun 12 14:44:50 corosync [TOTEM ] recovery to regular 1-0
Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 1
Jun 12 14:44:50 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:50 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
flushing membership messages.
Jun 12 14:44:51 corosync [TOTEM ] got commit token
Jun 12 14:44:51 corosync [TOTEM ] Saving state aru 86 high seq received 86
Jun 12 14:44:51 corosync [TOTEM ] Storing new sequence id for ring 6328
Jun 12 14:44:51 corosync [TOTEM ] entering COMMIT state.
Jun 12 14:44:51 corosync [TOTEM ] got commit token
Jun 12 14:44:51 corosync [TOTEM ] entering RECOVERY state.
Jun 12 14:44:51 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
Jun 12 14:44:51 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
Jun 12 14:44:51 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
Jun 12 14:44:51 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
Jun 12 14:44:51 corosync [TOTEM ] position [0] member 10.70.100.101:
Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:51 corosync [TOTEM ] position [1] member 10.70.100.102:
Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:51 corosync [TOTEM ] position [2] member 10.70.100.103:
Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:51 corosync [TOTEM ] position [3] member 10.70.100.104:
Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:51 corosync [TOTEM ] Did not need to originate any messages
in recovery.
Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 0, aru ffffffff
Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 1, aru 0
Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 2, aru 0
Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 3, aru 0
Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:51 corosync [TOTEM ] retrans flag count 4 token aru 0 install
seq 0 aru 0 0
Jun 12 14:44:51 corosync [TOTEM ] Resetting old ring state
Jun 12 14:44:51 corosync [TOTEM ] recovery to regular 1-0
Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 1
Jun 12 14:44:51 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:51 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35455 ms,
flushing membership messages.
Jun 12 14:44:52 corosync [TOTEM ] got commit token
Jun 12 14:44:52 corosync [TOTEM ] Saving state aru 86 high seq received 86
Jun 12 14:44:52 corosync [TOTEM ] Storing new sequence id for ring 632c
Jun 12 14:44:52 corosync [TOTEM ] entering COMMIT state.
Jun 12 14:44:52 corosync [TOTEM ] got commit token
Jun 12 14:44:52 corosync [TOTEM ] entering RECOVERY state.
Jun 12 14:44:52 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
Jun 12 14:44:52 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
Jun 12 14:44:52 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
Jun 12 14:44:52 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
Jun 12 14:44:52 corosync [TOTEM ] position [0] member 10.70.100.101:
Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:52 corosync [TOTEM ] position [1] member 10.70.100.102:
Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:52 corosync [TOTEM ] position [2] member 10.70.100.103:
Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:52 corosync [TOTEM ] position [3] member 10.70.100.104:
Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:52 corosync [TOTEM ] Did not need to originate any messages
in recovery.
Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 0, aru ffffffff
Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 1, aru 0
Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 2, aru 0
Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 3, aru 0
Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:52 corosync [TOTEM ] retrans flag count 4 token aru 0 install
seq 0 aru 0 0
Jun 12 14:44:52 corosync [TOTEM ] Resetting old ring state
Jun 12 14:44:52 corosync [TOTEM ] recovery to regular 1-0
Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 1
Jun 12 14:44:52 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:52 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36223 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36224 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
flushing membership messages.
Jun 12 14:44:53 corosync [TOTEM ] got commit token
Jun 12 14:44:53 corosync [TOTEM ] Saving state aru 86 high seq received 86
Jun 12 14:44:53 corosync [TOTEM ] Storing new sequence id for ring 6330
Jun 12 14:44:53 corosync [TOTEM ] entering COMMIT state.
Jun 12 14:44:53 corosync [TOTEM ] got commit token
Jun 12 14:44:53 corosync [TOTEM ] entering RECOVERY state.
Jun 12 14:44:53 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
Jun 12 14:44:53 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
Jun 12 14:44:53 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
Jun 12 14:44:53 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
Jun 12 14:44:53 corosync [TOTEM ] position [0] member 10.70.100.101:
Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:53 corosync [TOTEM ] position [1] member 10.70.100.102:
Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:53 corosync [TOTEM ] position [2] member 10.70.100.103:
Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:53 corosync [TOTEM ] position [3] member 10.70.100.104:
Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:53 corosync [TOTEM ] Did not need to originate any messages
in recovery.
Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 0, aru ffffffff
Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 1, aru 0
Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 2, aru 0
Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 3, aru 0
Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:53 corosync [TOTEM ] retrans flag count 4 token aru 0 install
seq 0 aru 0 0
Jun 12 14:44:53 corosync [TOTEM ] Resetting old ring state
Jun 12 14:44:53 corosync [TOTEM ] recovery to regular 1-0
Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 1
Jun 12 14:44:53 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:53 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] got commit token
Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq received 86
Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334
Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state.
Jun 12 14:44:54 corosync [TOTEM ] got commit token
Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state.
Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages
in recovery.
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 0, aru ffffffff
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 1, aru 0
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 2, aru 0
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 3, aru 0
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0 install
seq 0 aru 0 0
Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state
Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0
Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1
Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms,
flushing membership messages.


On 6/12/14, 1:55 PM, "Schaefer, Micah" <Micah.Schaefer at jhuapl.edu> wrote:

>I just found that the clock on node1 was off by about a minute and a half
>compared to the rest of the nodes.
>
>I am running ntp, so not sure why the time wasn?t synced up. Wonder if
>node1 being behind, would think it was not receiving updates from the
>other nodes?
>
>
>
>
>
>
>
>On 6/12/14, 1:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>
>>Even if the token changes stop the immediate fencing, don't leave it
>>please. There is something fundamentally wrong that you need to
>>identify/fix.
>>
>>Keep us posted!
>>
>>On 12/06/14 01:24 PM, Schaefer, Micah wrote:
>>> The servers do not run any tasks other than the tasks in the cluster
>>> service group.
>>>
>>> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1
>>> and 2 are virtual machines with much less resources available.
>>>
>>> I adjusted the token settings and will watch for any change.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 6/12/14, 1:08 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>
>>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no
>>>>>spanning
>>>>> tree changes are happening and all the ports have port-fast enabled
>>>>>for
>>>>> these servers. My switch logging level is very high and I have no
>>>>> messages
>>>>> in relation to the time frames or ports.
>>>>>
>>>>> TOTEM reports that ?A processor joined or left the membership??, but
>>>>> that
>>>>> isn?t enough detail.
>>>>>
>>>>> Also note that I did not have these issues until adding new servers:
>>>>> node3
>>>>> and node4 to the cluster. Node1 and node2 do not fence each other
>>>>> (unless
>>>>> a real issue is there), and they are on different switches.
>>>>
>>>> Then I can't imagine it being network anymore. Seeing as both node 3
>>>>and
>>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3
>>>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm
>>>> wondering if the nodes are simply not keeping up with corosync
>>>>traffic.
>>>> You might try adjusting the corosync token timeout and retransmit
>>>>counts
>>>> to see if that reduces the node loses.
>>>>
>>>> --
>>>> Digimer
>>>> Papers and Projects: https://alteeve.ca/w/
>>>> What if the cure for cancer is trapped in the mind of a person without
>>>> access to education?
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>
>>
>>--
>>Digimer
>>Papers and Projects: https://alteeve.ca/w/
>>What if the cure for cancer is trapped in the mind of a person without
>>access to education?
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Thu Jun 12 19:06:57 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 12 Jun 2014 15:06:57 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFBF69C0.F9D3%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>
	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>	<5399D64D.8080301@dbtgroup.com>
	<5399D6FC.8030800@alteeve.ca>	<CFBF507D.F995%micah.schaefer@jhuapl.edu>
	<5399DE77.1030302@alteeve.ca>	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu>
	<5399E391.3060701@alteeve.ca>	<CFBF6163.F9BF%micah.schaefer@jhuapl.edu>
	<CFBF69C0.F9D3%micah.schaefer@jhuapl.edu>
Message-ID: <5399FA51.2020808@alteeve.ca>

Hrm, I'm not really sure that I am able to interpret this without making 
guesses. I'm cc'ing one of the devs (who I hope will poke the right 
person if he's not able to help at the moment). Lets see what he has to say.

I am curious now, too. :)

On 12/06/14 03:02 PM, Schaefer, Micah wrote:
> Node4 was fenced again, I was able to get some debug logs (below), a new
> message :
>
> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL
> state.?
>
>
> Rest of corosync logs
>
> http://pastebin.com/iYFbkbhb
>
>
> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
> flushing membership messages.
> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33494 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
> flushing membership messages.
> Jun 12 14:44:50 corosync [TOTEM ] got commit token
> Jun 12 14:44:50 corosync [TOTEM ] Saving state aru 86 high seq received 86
> Jun 12 14:44:50 corosync [TOTEM ] Storing new sequence id for ring 6324
> Jun 12 14:44:50 corosync [TOTEM ] entering COMMIT state.
> Jun 12 14:44:50 corosync [TOTEM ] got commit token
> Jun 12 14:44:50 corosync [TOTEM ] entering RECOVERY state.
> Jun 12 14:44:50 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
> Jun 12 14:44:50 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
> Jun 12 14:44:50 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
> Jun 12 14:44:50 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
> Jun 12 14:44:50 corosync [TOTEM ] position [0] member 10.70.100.101:
> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:50 corosync [TOTEM ] position [1] member 10.70.100.102:
> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:50 corosync [TOTEM ] position [2] member 10.70.100.103:
> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:50 corosync [TOTEM ] position [3] member 10.70.100.104:
> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep 10.70.100.101
> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:50 corosync [TOTEM ] Did not need to originate any messages
> in recovery.
> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 0, aru ffffffff
> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 1, aru 0
> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 2, aru 0
> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 3, aru 0
> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:50 corosync [TOTEM ] retrans flag count 4 token aru 0 install
> seq 0 aru 0 0
> Jun 12 14:44:50 corosync [TOTEM ] Resetting old ring state
> Jun 12 14:44:50 corosync [TOTEM ] recovery to regular 1-0
> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 1
> Jun 12 14:44:50 corosync [TOTEM ] entering OPERATIONAL state.
> Jun 12 14:44:50 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 0
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] entering GATHER state from 12.
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
> flushing membership messages.
> Jun 12 14:44:51 corosync [TOTEM ] got commit token
> Jun 12 14:44:51 corosync [TOTEM ] Saving state aru 86 high seq received 86
> Jun 12 14:44:51 corosync [TOTEM ] Storing new sequence id for ring 6328
> Jun 12 14:44:51 corosync [TOTEM ] entering COMMIT state.
> Jun 12 14:44:51 corosync [TOTEM ] got commit token
> Jun 12 14:44:51 corosync [TOTEM ] entering RECOVERY state.
> Jun 12 14:44:51 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
> Jun 12 14:44:51 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
> Jun 12 14:44:51 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
> Jun 12 14:44:51 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
> Jun 12 14:44:51 corosync [TOTEM ] position [0] member 10.70.100.101:
> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:51 corosync [TOTEM ] position [1] member 10.70.100.102:
> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:51 corosync [TOTEM ] position [2] member 10.70.100.103:
> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:51 corosync [TOTEM ] position [3] member 10.70.100.104:
> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep 10.70.100.101
> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:51 corosync [TOTEM ] Did not need to originate any messages
> in recovery.
> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 0, aru ffffffff
> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 1, aru 0
> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 2, aru 0
> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 3, aru 0
> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:51 corosync [TOTEM ] retrans flag count 4 token aru 0 install
> seq 0 aru 0 0
> Jun 12 14:44:51 corosync [TOTEM ] Resetting old ring state
> Jun 12 14:44:51 corosync [TOTEM ] recovery to regular 1-0
> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 1
> Jun 12 14:44:51 corosync [TOTEM ] entering OPERATIONAL state.
> Jun 12 14:44:51 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 0
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] entering GATHER state from 12.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35455 ms,
> flushing membership messages.
> Jun 12 14:44:52 corosync [TOTEM ] got commit token
> Jun 12 14:44:52 corosync [TOTEM ] Saving state aru 86 high seq received 86
> Jun 12 14:44:52 corosync [TOTEM ] Storing new sequence id for ring 632c
> Jun 12 14:44:52 corosync [TOTEM ] entering COMMIT state.
> Jun 12 14:44:52 corosync [TOTEM ] got commit token
> Jun 12 14:44:52 corosync [TOTEM ] entering RECOVERY state.
> Jun 12 14:44:52 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
> Jun 12 14:44:52 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
> Jun 12 14:44:52 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
> Jun 12 14:44:52 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
> Jun 12 14:44:52 corosync [TOTEM ] position [0] member 10.70.100.101:
> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:52 corosync [TOTEM ] position [1] member 10.70.100.102:
> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:52 corosync [TOTEM ] position [2] member 10.70.100.103:
> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:52 corosync [TOTEM ] position [3] member 10.70.100.104:
> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep 10.70.100.101
> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:52 corosync [TOTEM ] Did not need to originate any messages
> in recovery.
> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 0, aru ffffffff
> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 1, aru 0
> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 2, aru 0
> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 3, aru 0
> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:52 corosync [TOTEM ] retrans flag count 4 token aru 0 install
> seq 0 aru 0 0
> Jun 12 14:44:52 corosync [TOTEM ] Resetting old ring state
> Jun 12 14:44:52 corosync [TOTEM ] recovery to regular 1-0
> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 1
> Jun 12 14:44:52 corosync [TOTEM ] entering OPERATIONAL state.
> Jun 12 14:44:52 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 0
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36223 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] entering GATHER state from 12.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36224 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
> flushing membership messages.
> Jun 12 14:44:53 corosync [TOTEM ] got commit token
> Jun 12 14:44:53 corosync [TOTEM ] Saving state aru 86 high seq received 86
> Jun 12 14:44:53 corosync [TOTEM ] Storing new sequence id for ring 6330
> Jun 12 14:44:53 corosync [TOTEM ] entering COMMIT state.
> Jun 12 14:44:53 corosync [TOTEM ] got commit token
> Jun 12 14:44:53 corosync [TOTEM ] entering RECOVERY state.
> Jun 12 14:44:53 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
> Jun 12 14:44:53 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
> Jun 12 14:44:53 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
> Jun 12 14:44:53 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
> Jun 12 14:44:53 corosync [TOTEM ] position [0] member 10.70.100.101:
> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:53 corosync [TOTEM ] position [1] member 10.70.100.102:
> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:53 corosync [TOTEM ] position [2] member 10.70.100.103:
> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:53 corosync [TOTEM ] position [3] member 10.70.100.104:
> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep 10.70.100.101
> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:53 corosync [TOTEM ] Did not need to originate any messages
> in recovery.
> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 0, aru ffffffff
> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 1, aru 0
> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 2, aru 0
> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 3, aru 0
> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:53 corosync [TOTEM ] retrans flag count 4 token aru 0 install
> seq 0 aru 0 0
> Jun 12 14:44:53 corosync [TOTEM ] Resetting old ring state
> Jun 12 14:44:53 corosync [TOTEM ] recovery to regular 1-0
> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 1
> Jun 12 14:44:53 corosync [TOTEM ] entering OPERATIONAL state.
> Jun 12 14:44:53 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] got commit token
> Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq received 86
> Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334
> Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state.
> Jun 12 14:44:54 corosync [TOTEM ] got commit token
> Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state.
> Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
> Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
> Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
> Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
> Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101:
> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102:
> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103:
> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104:
> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
> Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages
> in recovery.
> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 0, aru ffffffff
> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 1, aru 0
> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 2, aru 0
> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 3, aru 0
> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0 install
> seq 0 aru 0 0
> Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state
> Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0
> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1
> Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state.
> Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
> flushing membership messages.
> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms,
> flushing membership messages.
>
>
>
>
>
>
>
>
>
> On 6/12/14, 1:55 PM, "Schaefer, Micah" <Micah.Schaefer at jhuapl.edu> wrote:
>
>> I just found that the clock on node1 was off by about a minute and a half
>> compared to the rest of the nodes.
>>
>> I am running ntp, so not sure why the time wasn?t synced up. Wonder if
>> node1 being behind, would think it was not receiving updates from the
>> other nodes?
>>
>>
>>
>>
>>
>>
>>
>> On 6/12/14, 1:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>
>>> Even if the token changes stop the immediate fencing, don't leave it
>>> please. There is something fundamentally wrong that you need to
>>> identify/fix.
>>>
>>> Keep us posted!
>>>
>>> On 12/06/14 01:24 PM, Schaefer, Micah wrote:
>>>> The servers do not run any tasks other than the tasks in the cluster
>>>> service group.
>>>>
>>>> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1
>>>> and 2 are virtual machines with much less resources available.
>>>>
>>>> I adjusted the token settings and will watch for any change.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 6/12/14, 1:08 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>>
>>>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>>>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no
>>>>>> spanning
>>>>>> tree changes are happening and all the ports have port-fast enabled
>>>>>> for
>>>>>> these servers. My switch logging level is very high and I have no
>>>>>> messages
>>>>>> in relation to the time frames or ports.
>>>>>>
>>>>>> TOTEM reports that ?A processor joined or left the membership??, but
>>>>>> that
>>>>>> isn?t enough detail.
>>>>>>
>>>>>> Also note that I did not have these issues until adding new servers:
>>>>>> node3
>>>>>> and node4 to the cluster. Node1 and node2 do not fence each other
>>>>>> (unless
>>>>>> a real issue is there), and they are on different switches.
>>>>>
>>>>> Then I can't imagine it being network anymore. Seeing as both node 3
>>>>> and
>>>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3
>>>>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm
>>>>> wondering if the nodes are simply not keeping up with corosync
>>>>> traffic.
>>>>> You might try adjusting the corosync token timeout and retransmit
>>>>> counts
>>>>> to see if that reduces the node loses.
>>>>>
>>>>> --
>>>>> Digimer
>>>>> Papers and Projects: https://alteeve.ca/w/
>>>>> What if the cure for cancer is trapped in the mind of a person without
>>>>> access to education?
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From lzhong at suse.com  Fri Jun 13 01:59:29 2014
From: lzhong at suse.com (Lidong Zhong)
Date: Fri, 13 Jun 2014 09:59:29 +0800
Subject: [Linux-cluster] [RFC] dlm: keep listening connection alive
	with	sctp mode
In-Reply-To: <742486000.20595916.1402576184717.JavaMail.zimbra@redhat.com>
References: <1402555378-5220-1-git-send-email-lzhong@suse.com>
	<742486000.20595916.1402576184717.JavaMail.zimbra@redhat.com>
Message-ID: <1402624769.1407.0.camel@suse.site>

Hi Bob,
> ----- Original Message -----
> (snip)
> > Signed-off-by: Lidong Zhong <lzhong at suse.com>
> 
> Hi Lidong,
> 
> There is a special public mailing list for patches like this
> and other cluster-related development. The mailing list is called
> cluster-devel. Here is a link where you can subscribe to it:
> 
> https://www.redhat.com/mailman/listinfo/cluster-devel
> 
> I recommend you send your patch to cluster-devel at redhat.com.
> 
OK, thank you very much.
> Regards,
> 
> Bob Peterson
> Red Hat File Systems
> 

-- 
Best regards,
Lidong


From fdinitto at redhat.com  Fri Jun 13 04:02:34 2014
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 13 Jun 2014 06:02:34 +0200
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5399FA51.2020808@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>
	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>	<5399D64D.8080301@dbtgroup.com>
	<5399D6FC.8030800@alteeve.ca>	<CFBF507D.F995%micah.schaefer@jhuapl.edu>
	<5399DE77.1030302@alteeve.ca>	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu>
	<5399E391.3060701@alteeve.ca>	<CFBF6163.F9BF%micah.schaefer@jhuapl.edu>
	<CFBF69C0.F9D3%micah.schaefer@jhuapl.edu>
	<5399FA51.2020808@alteeve.ca>
Message-ID: <539A77DA.6010407@redhat.com>

On 06/12/2014 09:06 PM, Digimer wrote:
> Hrm, I'm not really sure that I am able to interpret this without making
> guesses. I'm cc'ing one of the devs (who I hope will poke the right
> person if he's not able to help at the moment). Lets see what he has to
> say.
> 
> I am curious now, too. :)

Chrissie/Honza: can you please take a look at this thread and see if
there is a latent bug?

I find it odd that the Process pause detected is kicking in so many
times without a fencing action.

Fabio

> 
> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>> Node4 was fenced again, I was able to get some debug logs (below), a new
>> message :
>>
>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL
>> state.?
>>
>>
>> Rest of corosync logs
>>
>> http://pastebin.com/iYFbkbhb
>>
>>
>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33494 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] got commit token
>> Jun 12 14:44:50 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:50 corosync [TOTEM ] Storing new sequence id for ring 6324
>> Jun 12 14:44:50 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:50 corosync [TOTEM ] got commit token
>> Jun 12 14:44:50 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:50 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:50 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:50 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:50 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:50 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] got commit token
>> Jun 12 14:44:51 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:51 corosync [TOTEM ] Storing new sequence id for ring 6328
>> Jun 12 14:44:51 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:51 corosync [TOTEM ] got commit token
>> Jun 12 14:44:51 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:51 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:51 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:51 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:51 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:51 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35455 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] got commit token
>> Jun 12 14:44:52 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:52 corosync [TOTEM ] Storing new sequence id for ring 632c
>> Jun 12 14:44:52 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:52 corosync [TOTEM ] got commit token
>> Jun 12 14:44:52 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:52 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:52 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:52 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:52 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:52 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36223 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36224 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] got commit token
>> Jun 12 14:44:53 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:53 corosync [TOTEM ] Storing new sequence id for ring 6330
>> Jun 12 14:44:53 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:53 corosync [TOTEM ] got commit token
>> Jun 12 14:44:53 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:53 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:53 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:53 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:53 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:53 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] got commit token
>> Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334
>> Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:54 corosync [TOTEM ] got commit token
>> Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms,
>> flushing membership messages.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/12/14, 1:55 PM, "Schaefer, Micah" <Micah.Schaefer at jhuapl.edu> wrote:
>>
>>> I just found that the clock on node1 was off by about a minute and a
>>> half
>>> compared to the rest of the nodes.
>>>
>>> I am running ntp, so not sure why the time wasn?t synced up. Wonder if
>>> node1 being behind, would think it was not receiving updates from the
>>> other nodes?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 6/12/14, 1:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>
>>>> Even if the token changes stop the immediate fencing, don't leave it
>>>> please. There is something fundamentally wrong that you need to
>>>> identify/fix.
>>>>
>>>> Keep us posted!
>>>>
>>>> On 12/06/14 01:24 PM, Schaefer, Micah wrote:
>>>>> The servers do not run any tasks other than the tasks in the cluster
>>>>> service group.
>>>>>
>>>>> Nodes 3 and 4 are physical servers with a lot of horsepower and
>>>>> nodes 1
>>>>> and 2 are virtual machines with much less resources available.
>>>>>
>>>>> I adjusted the token settings and will watch for any change.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6/12/14, 1:08 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>>>
>>>>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>>>>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no
>>>>>>> spanning
>>>>>>> tree changes are happening and all the ports have port-fast enabled
>>>>>>> for
>>>>>>> these servers. My switch logging level is very high and I have no
>>>>>>> messages
>>>>>>> in relation to the time frames or ports.
>>>>>>>
>>>>>>> TOTEM reports that ?A processor joined or left the membership??, but
>>>>>>> that
>>>>>>> isn?t enough detail.
>>>>>>>
>>>>>>> Also note that I did not have these issues until adding new servers:
>>>>>>> node3
>>>>>>> and node4 to the cluster. Node1 and node2 do not fence each other
>>>>>>> (unless
>>>>>>> a real issue is there), and they are on different switches.
>>>>>>
>>>>>> Then I can't imagine it being network anymore. Seeing as both node 3
>>>>>> and
>>>>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3
>>>>>> and 4 much higher (or are the computers much slower) than 1 and 2?
>>>>>> I'm
>>>>>> wondering if the nodes are simply not keeping up with corosync
>>>>>> traffic.
>>>>>> You might try adjusting the corosync token timeout and retransmit
>>>>>> counts
>>>>>> to see if that reduces the node loses.
>>>>>>
>>>>>> -- 
>>>>>> Digimer
>>>>>> Papers and Projects: https://alteeve.ca/w/
>>>>>> What if the cure for cancer is trapped in the mind of a person
>>>>>> without
>>>>>> access to education?
>>>>>>
>>>>>> -- 
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Digimer
>>>> Papers and Projects: https://alteeve.ca/w/
>>>> What if the cure for cancer is trapped in the mind of a person without
>>>> access to education?
>>>>
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
> 
> 


From kienlt at mbbank.com.vn  Mon Jun 16 11:43:33 2014
From: kienlt at mbbank.com.vn (Le Trung Kien)
Date: Mon, 16 Jun 2014 11:43:33 +0000
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>

Hello everyone,

I'm a new man on linux cluster.  I have built a two-node cluster (without qdisk), includes:

Redhat 6.4
cman
pacemaker
gfs2

My cluster could fail-over (back and forth) between two nodes for these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on /mnt/gfs2_storage), WebSite ( apache service)

My problem occurs when I stop/start node in the following order: (when both nodes started)

1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all resources still working on node2
2. Stop: node2 (stop service: pacemaker then cman) -> all resources stop (of course)
3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP started, WebFS failed, WebSite not started

Status:

Last updated: Mon Jun 16 18:34:56 2014
Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1
Stack: cman
Current DC: server1 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 1 expected votes
4 Resources configured.

Online: [ server1 ]
OFFLINE: [ server2 ]

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started server1
 WebFS  (ocf::heartbeat:Filesystem):    Started server1 (unmanaged) FAILED

Failed actions:
    WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error

Here is my /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="mycluster">
        <logging debug="on"/>
        <clusternodes>
                <clusternode name="server1" nodeid="1">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device name="pcmk" port="server1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="server2" nodeid="2">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device name="pcmk" port="server2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="pcmk" agent="fence_pcmk"/>
        </fencedevices>
</cluster>

Here is my: crm configure show

node server1
node server2
primitive ClusterIP IPaddr2 \
        params ip=192.168.117.130 cidr_netmask=32 \
        op monitor interval=10s
primitive WebFS Filesystem \
        params device="/dev/sdc" directory="/mnt/gfs2_datastore" fstype=gfs2 \
        meta target-role=Started
primitive WebSite1 apache \
        params configfile="/mnt/nfs_datastore/httpd/conf/httpd.conf" statusurl="http://localhost/server-status" \
        op monitor interval=40s \
        meta target-role=Stopped
primitive WebSite2 apache \
        params configfile="/mnt/gfs2_datastore/httpd/conf/httpd.conf" statusurl="http://localhost/server-status" \
        op monitor interval=40s \
        meta target-role=Started
colocation webfs-with-ip inf: WebFS ClusterIP
colocation website-with-webfs inf: WebSite2 WebFS
order webfs-after-clusterip inf: ClusterIP WebFS
order website-after-webfs inf: WebFS WebSite2
property cib-bootstrap-options: \
        dc-version=1.1.8-7.el6-394e906 \
        cluster-infrastructure=cman \
        stonith-enabled=false \
        no-quorum-policy=ignore \
        expected-quorum-votes=1 \
        last-lrm-refresh=1402374391
rsc_defaults rsc-options: \
        resource-stickiness=100
rsc_defaults rsc_defaults-options: \
        resource-stickiness=100
op_defaults op_defaults-options: \
        migration-threshold=1


I don't have any glues to trace down this case, I just guess this problem comes from locking file system, please suggest me some advices.


Thank you.
Kien Le.


From rpeterso at redhat.com  Mon Jun 16 12:20:44 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 16 Jun 2014 08:20:44 -0400 (EDT)
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Hello everyone,
> 
> I'm a new man on linux cluster.  I have built a two-node cluster (without
> qdisk), includes:
> 
> Redhat 6.4
> cman
> pacemaker
> gfs2
> 
> My cluster could fail-over (back and forth) between two nodes for these 3
> resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on
> /mnt/gfs2_storage), WebSite ( apache service)
> 
> My problem occurs when I stop/start node in the following order: (when both
> nodes started)
> 
> 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all resources
> still working on node2
> 2. Stop: node2 (stop service: pacemaker then cman) -> all resources stop (of
> course)
> 3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP
> started, WebFS failed, WebSite not started
(snip)
> I don't have any glues to trace down this case, I just guess this problem
> comes from locking file system, please suggest me some advices.

Hi,

Some thoughts on your problem:

(1) If this is truly Redhat 6.4, and you have a support contract with Red Hat,
    you should call the support number with Global Support Services and file a
    ticket. They'll be able to help.
(2) You didn't explain what your symptoms were? In what way does it fail?
(3) Why do you suspect "this problem comes from locking file system"? Do you
    mean from GFS2? What is the symptom that causes you to think it might be
    the file system? Were there messages on the console or dmesg to indicate
    a kernel issue?
(4) I thought RHEL6.4 has cman/rgmanager, not pacemaker.

Regards,

Bob Peterson
Red Hat File Systems


From kienlt at mbbank.com.vn  Mon Jun 16 12:50:52 2014
From: kienlt at mbbank.com.vn (Le Trung Kien)
Date: Mon, 16 Jun 2014 12:50:52 +0000
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
	<250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>
Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP>

Hi,

I don't have an active support contract with Redhat right now. And try to work around with Redhat cluster to understand the solution first.

I followed the steps guide from clusterlabs.org, configure cluster using: CMAN, Pacemaker (of course there is rgmanager in 6.4 but I don't know how to use it right now because I'm in the middle of messing thing from start)

I think the problem was from GFS2 because, with a NFS (no locking) has no problem with my cluster at all. This problem just come when I configure a shared GFS2 for my cluster.

Thank you for your concerns :)


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson
Sent: Monday, June 16, 2014 7:21 PM
To: linux clustering
Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing

----- Original Message -----
> Hello everyone,
> 
> I'm a new man on linux cluster.  I have built a two-node cluster 
> (without qdisk), includes:
> 
> Redhat 6.4
> cman
> pacemaker
> gfs2
> 
> My cluster could fail-over (back and forth) between two nodes for 
> these 3
> resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on 
> /mnt/gfs2_storage), WebSite ( apache service)
> 
> My problem occurs when I stop/start node in the following order: (when 
> both nodes started)
> 
> 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all 
> resources still working on node2 2. Stop: node2 (stop service: 
> pacemaker then cman) -> all resources stop (of
> course)
> 3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP 
> started, WebFS failed, WebSite not started
(snip)
> I don't have any glues to trace down this case, I just guess this 
> problem comes from locking file system, please suggest me some advices.

Hi,

Some thoughts on your problem:

(1) If this is truly Redhat 6.4, and you have a support contract with Red Hat,
    you should call the support number with Global Support Services and file a
    ticket. They'll be able to help.
(2) You didn't explain what your symptoms were? In what way does it fail?
(3) Why do you suspect "this problem comes from locking file system"? Do you
    mean from GFS2? What is the symptom that causes you to think it might be
    the file system? Were there messages on the console or dmesg to indicate
    a kernel issue?
(4) I thought RHEL6.4 has cman/rgmanager, not pacemaker.

Regards,

Bob Peterson
Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From rpeterso at redhat.com  Mon Jun 16 12:56:14 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 16 Jun 2014 08:56:14 -0400 (EDT)
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
	<250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Hi,
> 
> I don't have an active support contract with Redhat right now. And try to
> work around with Redhat cluster to understand the solution first.
> 
> I followed the steps guide from clusterlabs.org, configure cluster using:
> CMAN, Pacemaker (of course there is rgmanager in 6.4 but I don't know how to
> use it right now because I'm in the middle of messing thing from start)
> 
> I think the problem was from GFS2 because, with a NFS (no locking) has no
> problem with my cluster at all. This problem just come when I configure a
> shared GFS2 for my cluster.
> 
> Thank you for your concerns :)

Hi,

Do you see any kernel messages in dmesg or on the console, after the failure?

Regards,

Bob Peterson
Red Hat File Systems


From kienlt at mbbank.com.vn  Tue Jun 17 04:07:21 2014
From: kienlt at mbbank.com.vn (Le Trung Kien)
Date: Tue, 17 Jun 2014 04:07:21 +0000
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
	<250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP>
	<339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com>
Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP>

Hi, here is my dmesg after failed:

GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
dlm: Using TCP for communications
GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS...
GFS2: fsid=mycluster:web.0: jid=0, already locked for use
GFS2: fsid=mycluster:web.0: jid=0: Looking at journal...
GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock...
GFS2: fsid=mycluster:web.0: jid=0: Replaying journal...
GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks
GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags
GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s
GFS2: fsid=mycluster:web.0: jid=0: Done
GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock...
GFS2: fsid=mycluster:web.0: jid=1: Looking at journal...
GFS2: fsid=mycluster:web.0: jid=1: Done
hrtimer: interrupt took 4149483 ns
dlm: closing connection to node 2
dlm: closing connection to node 1
GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
dlm: Using TCP for communications

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson
Sent: Monday, June 16, 2014 7:56 PM
To: linux clustering
Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing

----- Original Message -----
> Hi,
> 
> I don't have an active support contract with Redhat right now. And try 
> to work around with Redhat cluster to understand the solution first.
> 
> I followed the steps guide from clusterlabs.org, configure cluster using:
> CMAN, Pacemaker (of course there is rgmanager in 6.4 but I don't know 
> how to use it right now because I'm in the middle of messing thing 
> from start)
> 
> I think the problem was from GFS2 because, with a NFS (no locking) has 
> no problem with my cluster at all. This problem just come when I 
> configure a shared GFS2 for my cluster.
> 
> Thank you for your concerns :)

Hi,

Do you see any kernel messages in dmesg or on the console, after the failure?

Regards,

Bob Peterson
Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From rpeterso at redhat.com  Tue Jun 17 12:08:54 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 17 Jun 2014 08:08:54 -0400 (EDT)
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
	<250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP>
	<339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Hi, here is my dmesg after failed:
> 
> GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
> dlm: Using TCP for communications
> GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS...
> GFS2: fsid=mycluster:web.0: jid=0, already locked for use
> GFS2: fsid=mycluster:web.0: jid=0: Looking at journal...
> GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock...
> GFS2: fsid=mycluster:web.0: jid=0: Replaying journal...
> GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks
> GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags
> GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s
> GFS2: fsid=mycluster:web.0: jid=0: Done
> GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock...
> GFS2: fsid=mycluster:web.0: jid=1: Looking at journal...
> GFS2: fsid=mycluster:web.0: jid=1: Done
> hrtimer: interrupt took 4149483 ns
> dlm: closing connection to node 2
> dlm: closing connection to node 1
> GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
> dlm: Using TCP for communications
> 

Hi,

If there was a GFS2 problem, you would ordinarily see errors there,
and these messages are all pretty normal.

Regards,

Bob Peterson
Red Hat File Systems


From ccaulfie at redhat.com  Tue Jun 17 12:41:07 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 17 Jun 2014 13:41:07 +0100
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <5399FA51.2020808@alteeve.ca>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>	<5399D64D.8080301@dbtgroup.com>	<5399D6FC.8030800@alteeve.ca>	<CFBF507D.F995%micah.schaefer@jhuapl.edu>	<5399DE77.1030302@alteeve.ca>	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu>	<5399E391.3060701@alteeve.ca>	<CFBF6163.F9BF%micah.schaefer@jhuapl.edu>	<CFBF69C0.F9D3%micah.schaefer@jhuapl.edu>
	<5399FA51.2020808@alteeve.ca>
Message-ID: <53A03763.4080905@redhat.com>

On 12/06/14 20:06, Digimer wrote:
> Hrm, I'm not really sure that I am able to interpret this without making
> guesses. I'm cc'ing one of the devs (who I hope will poke the right
> person if he's not able to help at the moment). Lets see what he has to
> say.
>
> I am curious now, too. :)
>
> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>> Node4 was fenced again, I was able to get some debug logs (below), a new
>> message :
>>
>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL
>> state.?
>>
>>
>> Rest of corosync logs
>>
>> http://pastebin.com/iYFbkbhb
>>
>>
>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>> flushing membership messages.


I'm concerned that the pause messages are repeating like that, it looks 
like it might be a fixed bug. What version of corosync do you have?

Chrissie


From Micah.Schaefer at jhuapl.edu  Tue Jun 17 14:27:29 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Tue, 17 Jun 2014 10:27:29 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <53A03763.4080905@redhat.com>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
	<53992A66.4070109@alteeve.ca> <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
	<5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca>
	<CFBF507D.F995%micah.schaefer@jhuapl.edu> <5399DE77.1030302@alteeve.ca>
	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu> <5399E391.3060701@alteeve.ca>
	<CFBF6163.F9BF%micah.schaefer@jhuapl.edu>
	<CFBF69C0.F9D3%micah.schaefer@jhuapl.edu> <5399FA51.2020808@alteeve.ca>
	<53A03763.4080905@redhat.com>
Message-ID: <CFC5C81A.FADA%micah.schaefer@jhuapl.edu>

I am running Red Hat 6.4 with the HA/ load balancing packages from the
install DVD. 


-bash-4.1$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.4 (Santiago)

-bash-4.1$ corosync -v
Corosync Cluster Engine, version '1.4.1'
Copyright (c) 2006-2009 Red Hat, Inc.


On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:

>On 12/06/14 20:06, Digimer wrote:
>> Hrm, I'm not really sure that I am able to interpret this without making
>> guesses. I'm cc'ing one of the devs (who I hope will poke the right
>> person if he's not able to help at the moment). Lets see what he has to
>> say.
>>
>> I am curious now, too. :)
>>
>> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>>> Node4 was fenced again, I was able to get some debug logs (below), a
>>>new
>>> message :
>>>
>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
>>>OPERATIONAL
>>> state.?
>>>
>>>
>>> Rest of corosync logs
>>>
>>> http://pastebin.com/iYFbkbhb
>>>
>>>
>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>>> membership and a new membership was formed.
>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>> flushing membership messages.
>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>>> flushing membership messages.
>
>
>I'm concerned that the pause messages are repeating like that, it looks
>like it might be a fixed bug. What version of corosync do you have?
>
>Chrissie
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From kienlt at mbbank.com.vn  Tue Jun 17 15:48:40 2014
From: kienlt at mbbank.com.vn (Le Trung Kien)
Date: Tue, 17 Jun 2014 15:48:40 +0000
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
	<250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP>
	<339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP>
	<1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com>
Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F1E@HN-MBX-02.BANK.MB.GROUP>

I reproduced my cluster problem again and got this error from /var/log/message

So, I think the reason is fencing wrongly configured. And I may have to focus on Configure Fencing Device.

Here is my log:

Jun 17 22:32:36 server2 fenced[6559]: fenced 3.0.12.1 started
Jun 17 22:32:36 server2 dlm_controld[6573]: dlm_controld 3.0.12.1 started
Jun 17 22:32:37 server2 gfs_controld[6634]: gfs_controld 3.0.12.1 started
Jun 17 22:33:29 server2 fenced[6559]: fencing node server1
Jun 17 22:33:29 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent
Jun 17 22:33:29 server2 fenced[6559]: fence server1 failed
Jun 17 22:33:32 server2 fenced[6559]: fencing node server1
Jun 17 22:33:32 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent
Jun 17 22:33:32 server2 fenced[6559]: fence server1 failed
Jun 17 22:33:35 server2 fenced[6559]: fencing node server1
Jun 17 22:33:35 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent
Jun 17 22:33:35 server2 fenced[6559]: fence server1 failed

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson
Sent: Tuesday, June 17, 2014 7:09 PM
To: linux clustering
Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing

----- Original Message -----
> Hi, here is my dmesg after failed:
> 
> GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
> dlm: Using TCP for communications
> GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS...
> GFS2: fsid=mycluster:web.0: jid=0, already locked for use
> GFS2: fsid=mycluster:web.0: jid=0: Looking at journal...
> GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock...
> GFS2: fsid=mycluster:web.0: jid=0: Replaying journal...
> GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks
> GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags
> GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s
> GFS2: fsid=mycluster:web.0: jid=0: Done
> GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock...
> GFS2: fsid=mycluster:web.0: jid=1: Looking at journal...
> GFS2: fsid=mycluster:web.0: jid=1: Done
> hrtimer: interrupt took 4149483 ns
> dlm: closing connection to node 2
> dlm: closing connection to node 1
> GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
> dlm: Using TCP for communications
> 

Hi,

If there was a GFS2 problem, you would ordinarily see errors there, and these messages are all pretty normal.

Regards,

Bob Peterson
Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From kienlt at mbbank.com.vn  Tue Jun 17 16:16:50 2014
From: kienlt at mbbank.com.vn (Le Trung Kien)
Date: Tue, 17 Jun 2014 16:16:50 +0000
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F1E@HN-MBX-02.BANK.MB.GROUP>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
	<250852778.22209689.1402921244733.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9BBA@HN-MBX-02.BANK.MB.GROUP>
	<339851360.22295947.1402923374425.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9D0F@HN-MBX-02.BANK.MB.GROUP>
	<1432751001.23019030.1403006934583.JavaMail.zimbra@redhat.com>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F1E@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F33@HN-MBX-02.BANK.MB.GROUP>

Sorry, I reformat my log to easy for reading:


Jun 17 22:32:36 server2 fenced[6559]: fenced 3.0.12.1 started 

Jun 17 22:32:36 server2 dlm_controld[6573]: dlm_controld 3.0.12.1 started 

Jun 17 22:32:37 server2 gfs_controld[6634]: gfs_controld 3.0.12.1 started 

Jun 17 22:33:29 server2 fenced[6559]: fencing node server1 

Jun 17 22:33:29 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent 

Jun 17 22:33:29 server2 fenced[6559]: fence server1 failed 

Jun 17 22:33:32 server2 fenced[6559]: fencing node server1 

Jun 17 22:33:32 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent 

Jun 17 22:33:32 server2 fenced[6559]: fence server1 failed 

Jun 17 22:33:35 server2 fenced[6559]: fencing node server1 

Jun 17 22:33:35 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent 

Jun 17 22:33:35 server2 fenced[6559]: fence server1 failed


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Le Trung Kien
Sent: Tuesday, June 17, 2014 10:49 PM
To: linux clustering
Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing

I reproduced my cluster problem again and got this error from /var/log/message

So, I think the reason is fencing wrongly configured. And I may have to focus on Configure Fencing Device.

Here is my log:

Jun 17 22:32:36 server2 fenced[6559]: fenced 3.0.12.1 started Jun 17 22:32:36 server2 dlm_controld[6573]: dlm_controld 3.0.12.1 started Jun 17 22:32:37 server2 gfs_controld[6634]: gfs_controld 3.0.12.1 started Jun 17 22:33:29 server2 fenced[6559]: fencing node server1 Jun 17 22:33:29 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:29 server2 fenced[6559]: fence server1 failed Jun 17 22:33:32 server2 fenced[6559]: fencing node server1 Jun 17 22:33:32 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:32 server2 fenced[6559]: fence server1 failed Jun 17 22:33:35 server2 fenced[6559]: fencing node server1 Jun 17 22:33:35 server2 fenced[6559]: fence server1 dev 0.0 agent none result: error config agent Jun 17 22:33:35 server2 fenced[6559]: fence server1 failed

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson
Sent: Tuesday, June 17, 2014 7:09 PM
To: linux clustering
Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing

----- Original Message -----
> Hi, here is my dmesg after failed:
> 
> GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
> dlm: Using TCP for communications
> GFS2: fsid=mycluster:web.0: Joined cluster. Now mounting FS...
> GFS2: fsid=mycluster:web.0: jid=0, already locked for use
> GFS2: fsid=mycluster:web.0: jid=0: Looking at journal...
> GFS2: fsid=mycluster:web.0: jid=0: Acquiring the transaction lock...
> GFS2: fsid=mycluster:web.0: jid=0: Replaying journal...
> GFS2: fsid=mycluster:web.0: jid=0: Replayed 1 of 1 blocks
> GFS2: fsid=mycluster:web.0: jid=0: Found 0 revoke tags
> GFS2: fsid=mycluster:web.0: jid=0: Journal replayed in 1s
> GFS2: fsid=mycluster:web.0: jid=0: Done
> GFS2: fsid=mycluster:web.0: jid=1: Trying to acquire journal lock...
> GFS2: fsid=mycluster:web.0: jid=1: Looking at journal...
> GFS2: fsid=mycluster:web.0: jid=1: Done
> hrtimer: interrupt took 4149483 ns
> dlm: closing connection to node 2
> dlm: closing connection to node 1
> GFS2: fsid=: Trying to join cluster "lock_dlm", "mycluster:web"
> dlm: Using TCP for communications
> 

Hi,

If there was a GFS2 problem, you would ordinarily see errors there, and these messages are all pretty normal.

Regards,

Bob Peterson
Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From white.heron at yahoo.com  Wed Jun 18 03:56:09 2014
From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell)
Date: Tue, 17 Jun 2014 20:56:09 -0700
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <5399D46A.6080205@alteeve.ca>
Message-ID: <1403063769.79975.YahooMailIosMobile@web163503.mail.gq1.yahoo.com>

The clustering will only works if you run same operating systems on top of same hardware platform ppc, intel!<a href="https://overview.mail.yahoo.com?.src=iOS"><br/><br/>Sent from Yahoo Mail for iPhone</a>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140617/72dcc023/attachment.htm>

From white.heron at yahoo.com  Wed Jun 18 03:56:09 2014
From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell)
Date: Tue, 17 Jun 2014 20:56:09 -0700
Subject: [Linux-cluster] 2-node cluster fence loop
In-Reply-To: <5399D46A.6080205@alteeve.ca>
Message-ID: <1403063769.79975.YahooMailIosMobile@web163503.mail.gq1.yahoo.com>

The clustering will only works if you run same operating systems on top of same hardware platform ppc, intel!<a href="https://overview.mail.yahoo.com?.src=iOS"><br/><br/>Sent from Yahoo Mail for iPhone</a>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140617/72dcc023/attachment-0001.htm>

From white.heron at yahoo.com  Wed Jun 18 04:08:54 2014
From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell)
Date: Tue, 17 Jun 2014 21:08:54 -0700
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F33@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <1403064534.18434.YahooMailIosMobile@web163505.mail.gq1.yahoo.com>

The clustering will only works if you enable ssl between two nodes and allow root access persistent connection.<a href="https://overview.mail.yahoo.com?.src=iOS"><br/><br/>Sent from Yahoo Mail for iPhone</a>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140617/ed604185/attachment.htm>

From white.heron at yahoo.com  Wed Jun 18 04:08:54 2014
From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell)
Date: Tue, 17 Jun 2014 21:08:54 -0700
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9F33@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <1403064534.18434.YahooMailIosMobile@web163505.mail.gq1.yahoo.com>

The clustering will only works if you enable ssl between two nodes and allow root access persistent connection.<a href="https://overview.mail.yahoo.com?.src=iOS"><br/><br/>Sent from Yahoo Mail for iPhone</a>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140617/ed604185/attachment-0001.htm>

From lists at alteeve.ca  Wed Jun 18 04:18:16 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 18 Jun 2014 00:18:16 -0400
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <53A11308.2040504@alteeve.ca>

On 16/06/14 07:43 AM, Le Trung Kien wrote:
> Hello everyone,
>
> I'm a new man on linux cluster.  I have built a two-node cluster (without qdisk), includes:
>
> Redhat 6.4
> cman
> pacemaker
> gfs2
>
> My cluster could fail-over (back and forth) between two nodes for these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on /mnt/gfs2_storage), WebSite ( apache service)
>
> My problem occurs when I stop/start node in the following order: (when both nodes started)
>
> 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all resources still working on node2
> 2. Stop: node2 (stop service: pacemaker then cman) -> all resources stop (of course)
> 3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP started, WebFS failed, WebSite not started
>
> Status:
>
> Last updated: Mon Jun 16 18:34:56 2014
> Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1
> Stack: cman
> Current DC: server1 - partition WITHOUT quorum
> Version: 1.1.8-7.el6-394e906
> 2 Nodes configured, 1 expected votes
> 4 Resources configured.
>
> Online: [ server1 ]
> OFFLINE: [ server2 ]
>
>   ClusterIP      (ocf::heartbeat:IPaddr2):       Started server1
>   WebFS  (ocf::heartbeat:Filesystem):    Started server1 (unmanaged) FAILED
>
> Failed actions:
>      WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error
>
> Here is my /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster config_version="1" name="mycluster">
>          <logging debug="on"/>
>          <clusternodes>
>                  <clusternode name="server1" nodeid="1">
>                          <fence>
>                                  <method name="pcmk-redirect">
>                                          <device name="pcmk" port="server1"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="server2" nodeid="2">
>                          <fence>
>                                  <method name="pcmk-redirect">
>                                          <device name="pcmk" port="server2"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>          </clusternodes>
>          <fencedevices>
>                  <fencedevice name="pcmk" agent="fence_pcmk"/>
>          </fencedevices>
> </cluster>
>
> Here is my: crm configure show
>

<snip>

>          stonith-enabled=false \

Well this is a problem.

When cman detects a failure (well corosync, but cman is told), it 
initiates a fence request. The fence daemon informs DLM with blocks. 
Then fenced calls the configured 'fence_pcmk', which just passes the 
request up to pacemaker.

Without stonith configured in fencing, pacemaker will fail to fence, of 
course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by 
design.

If configure proper fencing in pacemaker (and test it to make sure it 
works), then pacemaker *would* succeed in fencing and return a success 
to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans 
up lost locks and returns to normal operation.

So please configure and test real stonith in pacemaker and see if your 
problem is resolved.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From white.heron at yahoo.com  Wed Jun 18 18:20:05 2014
From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell)
Date: Wed, 18 Jun 2014 11:20:05 -0700
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFC5C81A.FADA%micah.schaefer@jhuapl.edu>
Message-ID: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com>

Hi,<br/><br/>The linux clustering will be only working perfectly if you run the linux operating systems between nodes. Allow root ssh persistent connection on top of same specifications hardware platform.<br/><br/>To perform test or proof of concept, you may allow to run and configure between two nodes.<br/><br/>The databases for clustering will be configure right after the two nodes linux operating systems run with persistent root access ssh connection.<a href="https://overview.mail.yahoo.com?.src=iOS"><br/><br/>Sent from Yahoo Mail for iPhone</a>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140618/b9b8bcb2/attachment.htm>

From white.heron at yahoo.com  Wed Jun 18 18:20:05 2014
From: white.heron at yahoo.com (YB Tan Sri Dato Sri' Adli a.k.a Dell)
Date: Wed, 18 Jun 2014 11:20:05 -0700
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFC5C81A.FADA%micah.schaefer@jhuapl.edu>
Message-ID: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com>

Hi,<br/><br/>The linux clustering will be only working perfectly if you run the linux operating systems between nodes. Allow root ssh persistent connection on top of same specifications hardware platform.<br/><br/>To perform test or proof of concept, you may allow to run and configure between two nodes.<br/><br/>The databases for clustering will be configure right after the two nodes linux operating systems run with persistent root access ssh connection.<a href="https://overview.mail.yahoo.com?.src=iOS"><br/><br/>Sent from Yahoo Mail for iPhone</a>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140618/b9b8bcb2/attachment-0001.htm>

From lists at alteeve.ca  Wed Jun 18 18:32:39 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 18 Jun 2014 14:32:39 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com>
References: <1403115605.19689.YahooMailIosMobile@web163502.mail.gq1.yahoo.com>
Message-ID: <53A1DB47.5040101@alteeve.ca>

On 18/06/14 02:20 PM, YB Tan Sri Dato Sri' Adli a.k.a Dell wrote:
> Hi,
>
> The linux clustering will be only working perfectly if you run the linux
> operating systems between nodes. Allow root ssh persistent connection on
> top of same specifications hardware platform.
>
> To perform test or proof of concept, you may allow to run and configure
> between two nodes.
>
> The databases for clustering will be configure right after the two nodes
> linux operating systems run with persistent root access ssh connection.
>
> Sent from Yahoo Mail for iPhone <https://overview.mail.yahoo.com?.src=iOS>

You have said this a couple times now, and I am not sure why. There is 
no need to have persistent, root access SSH between nodes. It's helpful 
in some cases, sure, but certainly not required. Corosync, which 
provides cluster membership and communication, handles internode traffic 
itself, on it's own TCP port (using multicast by default or unicast if 
configured).

There is also nothing restricting you to two nodes. It's a good 
configuration, and one I use personally, but there are many 3+ node 
clusters out there.

As for a database cluster, that would depend entirely on which database 
you are using and whether you are using tools specific for that DB or a 
more generic HA stack like corosync + pacemaker.

Cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From kienlt at mbbank.com.vn  Thu Jun 19 01:51:12 2014
From: kienlt at mbbank.com.vn (Le Trung Kien)
Date: Thu, 19 Jun 2014 01:51:12 +0000
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <53A11308.2040504@alteeve.ca>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>
	<53A11308.2040504@alteeve.ca>
Message-ID: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0FA5CB@HN-MBX-02.BANK.MB.GROUP>

Hi,

As Digimer suggested, I change property 

stonith-enabled=true

But now I don't know which fencing method I should use, because my two Redhat nodes running on VMWare Workstation, OpenFiler as SCSI shared LUN storage.

I attempted to use "fence_scsi", but no luck, I got this error:

Jun 19 08:35:58 server1 stonith_admin[3837]:   notice: crm_log_args: Invoked: stonith_admin --reboot server2 --tolerance 5s
Jun 19 08:36:08 server1 root: fence_pcmk[3836]: Call to fence server2 (reset) failed with rc=255

Here is my fencing configuration:

<?xml version="1.0"?>
<cluster config_version="1" name="mycluster">
<cman expected_votes="1" cluster_id="1"/>
<fence_daemon post_fail_delay="0" post_join_delay="30"/>
<clusternodes>
        <clusternode name="server1" votes="1" nodeid="1">
                <fence>
                        <method name="scsi">
                        <device name="scsi_dev" key="1"/>
                </method>
        </fence>
        </clusternode>
        <clusternode name="server2" votes="1" nodeid="2">
                <fence>
                        <method name="scsi">
                        <device name="scsi_dev" key="2"/>
                        </method>
                </fence>
        </clusternode>
        </clusternodes>
<fencedevices>
        <fencedevice agent="fence_scsi" name="scsi_dev" aptpl="1" logfile="/tmp/fence_scsi.log"/>
</fencedevices>
</cluster>

And the log: /tmp/fence_scsi.log show:

Jun 18 19:49:40 fence_scsi: [error] no devices found

I will try "vmware_soap" to see if it works.

Kien Le

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer
Sent: Wednesday, June 18, 2014 11:18 AM
To: linux clustering
Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing

On 16/06/14 07:43 AM, Le Trung Kien wrote:
> Hello everyone,
>
> I'm a new man on linux cluster.  I have built a two-node cluster (without qdisk), includes:
>
> Redhat 6.4
> cman
> pacemaker
> gfs2
>
> My cluster could fail-over (back and forth) between two nodes for 
> these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on 
> /mnt/gfs2_storage), WebSite ( apache service)
>
> My problem occurs when I stop/start node in the following order: (when 
> both nodes started)
>
> 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all 
> resources still working on node2 2. Stop: node2 (stop service: 
> pacemaker then cman) -> all resources stop (of course) 3. Start: node1 
> (start service: cman then pacemaker) -> only ClusterIP started, WebFS 
> failed, WebSite not started
>
> Status:
>
> Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16 
> 14:24:54 2014 via cibadmin on server1
> Stack: cman
> Current DC: server1 - partition WITHOUT quorum
> Version: 1.1.8-7.el6-394e906
> 2 Nodes configured, 1 expected votes
> 4 Resources configured.
>
> Online: [ server1 ]
> OFFLINE: [ server2 ]
>
>   ClusterIP      (ocf::heartbeat:IPaddr2):       Started server1
>   WebFS  (ocf::heartbeat:Filesystem):    Started server1 (unmanaged) FAILED
>
> Failed actions:
>      WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): 
> unknown error
>
> Here is my /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster config_version="1" name="mycluster">
>          <logging debug="on"/>
>          <clusternodes>
>                  <clusternode name="server1" nodeid="1">
>                          <fence>
>                                  <method name="pcmk-redirect">
>                                          <device name="pcmk" port="server1"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="server2" nodeid="2">
>                          <fence>
>                                  <method name="pcmk-redirect">
>                                          <device name="pcmk" port="server2"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>          </clusternodes>
>          <fencedevices>
>                  <fencedevice name="pcmk" agent="fence_pcmk"/>
>          </fencedevices>
> </cluster>
>
> Here is my: crm configure show
>

<snip>

>          stonith-enabled=false \

Well this is a problem.

When cman detects a failure (well corosync, but cman is told), it initiates a fence request. The fence daemon informs DLM with blocks. 
Then fenced calls the configured 'fence_pcmk', which just passes the request up to pacemaker.

Without stonith configured in fencing, pacemaker will fail to fence, of course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by design.

If configure proper fencing in pacemaker (and test it to make sure it works), then pacemaker *would* succeed in fencing and return a success to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans up lost locks and returns to normal operation.

So please configure and test real stonith in pacemaker and see if your problem is resolved.

--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Thu Jun 19 02:01:35 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 18 Jun 2014 22:01:35 -0400
Subject: [Linux-cluster] Two-node cluster GFS2 confusing
In-Reply-To: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0FA5CB@HN-MBX-02.BANK.MB.GROUP>
References: <3D6C1B8E3C47614AAE3227D507C4ECCD3A0F9B22@HN-MBX-02.BANK.MB.GROUP>	<53A11308.2040504@alteeve.ca>
	<3D6C1B8E3C47614AAE3227D507C4ECCD3A0FA5CB@HN-MBX-02.BANK.MB.GROUP>
Message-ID: <53A2447F.7060705@alteeve.ca>

I don't use VMware myself, but I think fence_vmware will work for you. 
Please note that simply enabling stonith is not enough. As you realize, 
you need a configured and working fence method.

If you try using the command line, you can play with the command's 
switched asking for 'status'. When that returns properly, you will then 
just need to convert the switches into arguments for pacemaker.

Read the man page for 'fence_vmware', and then try calling:

fence_vmware ... -o status

Fill in the switches and values you need based on the instructions in 
'man fence_vmware'.

digimer

On 18/06/14 09:51 PM, Le Trung Kien wrote:
> Hi,
>
> As Digimer suggested, I change property
>
> stonith-enabled=true
>
> But now I don't know which fencing method I should use, because my two Redhat nodes running on VMWare Workstation, OpenFiler as SCSI shared LUN storage.
>
> I attempted to use "fence_scsi", but no luck, I got this error:
>
> Jun 19 08:35:58 server1 stonith_admin[3837]:   notice: crm_log_args: Invoked: stonith_admin --reboot server2 --tolerance 5s
> Jun 19 08:36:08 server1 root: fence_pcmk[3836]: Call to fence server2 (reset) failed with rc=255
>
> Here is my fencing configuration:
>
> <?xml version="1.0"?>
> <cluster config_version="1" name="mycluster">
> <cman expected_votes="1" cluster_id="1"/>
> <fence_daemon post_fail_delay="0" post_join_delay="30"/>
> <clusternodes>
>          <clusternode name="server1" votes="1" nodeid="1">
>                  <fence>
>                          <method name="scsi">
>                          <device name="scsi_dev" key="1"/>
>                  </method>
>          </fence>
>          </clusternode>
>          <clusternode name="server2" votes="1" nodeid="2">
>                  <fence>
>                          <method name="scsi">
>                          <device name="scsi_dev" key="2"/>
>                          </method>
>                  </fence>
>          </clusternode>
>          </clusternodes>
> <fencedevices>
>          <fencedevice agent="fence_scsi" name="scsi_dev" aptpl="1" logfile="/tmp/fence_scsi.log"/>
> </fencedevices>
> </cluster>
>
> And the log: /tmp/fence_scsi.log show:
>
> Jun 18 19:49:40 fence_scsi: [error] no devices found
>
> I will try "vmware_soap" to see if it works.
>
> Kien Le
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer
> Sent: Wednesday, June 18, 2014 11:18 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing
>
> On 16/06/14 07:43 AM, Le Trung Kien wrote:
>> Hello everyone,
>>
>> I'm a new man on linux cluster.  I have built a two-node cluster (without qdisk), includes:
>>
>> Redhat 6.4
>> cman
>> pacemaker
>> gfs2
>>
>> My cluster could fail-over (back and forth) between two nodes for
>> these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on
>> /mnt/gfs2_storage), WebSite ( apache service)
>>
>> My problem occurs when I stop/start node in the following order: (when
>> both nodes started)
>>
>> 1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all
>> resources still working on node2 2. Stop: node2 (stop service:
>> pacemaker then cman) -> all resources stop (of course) 3. Start: node1
>> (start service: cman then pacemaker) -> only ClusterIP started, WebFS
>> failed, WebSite not started
>>
>> Status:
>>
>> Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16
>> 14:24:54 2014 via cibadmin on server1
>> Stack: cman
>> Current DC: server1 - partition WITHOUT quorum
>> Version: 1.1.8-7.el6-394e906
>> 2 Nodes configured, 1 expected votes
>> 4 Resources configured.
>>
>> Online: [ server1 ]
>> OFFLINE: [ server2 ]
>>
>>    ClusterIP      (ocf::heartbeat:IPaddr2):       Started server1
>>    WebFS  (ocf::heartbeat:Filesystem):    Started server1 (unmanaged) FAILED
>>
>> Failed actions:
>>       WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out):
>> unknown error
>>
>> Here is my /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="1" name="mycluster">
>>           <logging debug="on"/>
>>           <clusternodes>
>>                   <clusternode name="server1" nodeid="1">
>>                           <fence>
>>                                   <method name="pcmk-redirect">
>>                                           <device name="pcmk" port="server1"/>
>>                                   </method>
>>                           </fence>
>>                   </clusternode>
>>                   <clusternode name="server2" nodeid="2">
>>                           <fence>
>>                                   <method name="pcmk-redirect">
>>                                           <device name="pcmk" port="server2"/>
>>                                   </method>
>>                           </fence>
>>                   </clusternode>
>>           </clusternodes>
>>           <fencedevices>
>>                   <fencedevice name="pcmk" agent="fence_pcmk"/>
>>           </fencedevices>
>> </cluster>
>>
>> Here is my: crm configure show
>>
>
> <snip>
>
>>           stonith-enabled=false \
>
> Well this is a problem.
>
> When cman detects a failure (well corosync, but cman is told), it initiates a fence request. The fence daemon informs DLM with blocks.
> Then fenced calls the configured 'fence_pcmk', which just passes the request up to pacemaker.
>
> Without stonith configured in fencing, pacemaker will fail to fence, of course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by design.
>
> If configure proper fencing in pacemaker (and test it to make sure it works), then pacemaker *would* succeed in fencing and return a success to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans up lost locks and returns to normal operation.
>
> So please configure and test real stonith in pacemaker and see if your problem is resolved.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From ccaulfie at redhat.com  Thu Jun 19 10:02:58 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Thu, 19 Jun 2014 11:02:58 +0100
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <CFC5C81A.FADA%micah.schaefer@jhuapl.edu>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>	<538F378B.8030407@alteeve.ca>
	<CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>	<CFBE1647.F927%micah.schaefer@jhuapl.edu>
	<5398A00A.4020802@alteeve.ca>	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu>
	<5398ADDC.80501@alteeve.ca>	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>	<53992A66.4070109@alteeve.ca>
	<CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>	<5399D64D.8080301@dbtgroup.com>
	<5399D6FC.8030800@alteeve.ca>	<CFBF507D.F995%micah.schaefer@jhuapl.edu>
	<5399DE77.1030302@alteeve.ca>	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu>
	<5399E391.3060701@alteeve.ca>	<CFBF6163.F9BF%micah.schaefer@jhuapl.edu>	<CFBF69C0.F9D3%micah.schaefer@jhuapl.edu>
	<5399FA51.2020808@alteeve.ca>	<53A03763.4080905@redhat.com>
	<CFC5C81A.FADA%micah.schaefer@jhuapl.edu>
Message-ID: <53A2B552.1000609@redhat.com>

On 17/06/14 15:27, Schaefer, Micah wrote:
> I am running Red Hat 6.4 with the HA/ load balancing packages from the
> install DVD.
>
>
> -bash-4.1$ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 6.4 (Santiago)
>
> -bash-4.1$ corosync -v
> Corosync Cluster Engine, version '1.4.1'
> Copyright (c) 2006-2009 Red Hat, Inc.
>
>


Thanks. 6.5 has better pause detection in it but I don't think that's 
the issue here actually. It looks to me like some messages are getting 
through but not others. So I'm back to seriously wondering if multicast 
traffic is being forwarded correctly and reliably. Having a mix of 
virtual and physical systems can cause these sorts of issues with real 
and software switches being mixed. Though I haven't seen anything quite 
as odd as this to be honest.

Can you try either UDPU (preferred) or broadcast transport please and 
see if that helps or changes the symptoms at all? Broadcast could be 
problematic itself with the real/virtual mix so UDPU will be a more 
reliable option.

Annoyingly, you'll need to take down the whole cluster to do this, and add

<cman transport="udpu"/>

to /etc/cluster/cluster.conf on all nodes.

Chrissie


>
> On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:
>
>> On 12/06/14 20:06, Digimer wrote:
>>> Hrm, I'm not really sure that I am able to interpret this without making
>>> guesses. I'm cc'ing one of the devs (who I hope will poke the right
>>> person if he's not able to help at the moment). Lets see what he has to
>>> say.
>>>
>>> I am curious now, too. :)
>>>
>>> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>>>> Node4 was fenced again, I was able to get some debug logs (below), a
>>>> new
>>>> message :
>>>>
>>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
>>>> OPERATIONAL
>>>> state.?
>>>>
>>>>
>>>> Rest of corosync logs
>>>>
>>>> http://pastebin.com/iYFbkbhb
>>>>
>>>>
>>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>>>> membership and a new membership was formed.
>>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>>>> flushing membership messages.
>>
>>
>> I'm concerned that the pause messages are repeating like that, it looks
>> like it might be a fixed bug. What version of corosync do you have?
>>
>> Chrissie
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


From Micah.Schaefer at jhuapl.edu  Thu Jun 19 12:39:20 2014
From: Micah.Schaefer at jhuapl.edu (Schaefer, Micah)
Date: Thu, 19 Jun 2014 08:39:20 -0400
Subject: [Linux-cluster] Node is randomly fenced
In-Reply-To: <53A2B552.1000609@redhat.com>
References: <CFB4AC75.F5E8%micah.schaefer@jhuapl.edu>
	<538F378B.8030407@alteeve.ca> <CFB4B3CF.F5F3%micah.schaefer@jhuapl.edu>
	<CFBE1647.F927%micah.schaefer@jhuapl.edu> <5398A00A.4020802@alteeve.ca>
	<CFBE1BEB.F92D%micah.schaefer@jhuapl.edu> <5398ADDC.80501@alteeve.ca>
	<CFBE2A1A.F946%micah.schaefer@jhuapl.edu>
	<68B234C500D5C34EBCE58D977272260E566C15@inba-mail01.sonusnet.com>
	<53992A66.4070109@alteeve.ca> <CFBF3F4C.F97E%micah.schaefer@jhuapl.edu>
	<5399D64D.8080301@dbtgroup.com> <5399D6FC.8030800@alteeve.ca>
	<CFBF507D.F995%micah.schaefer@jhuapl.edu> <5399DE77.1030302@alteeve.ca>
	<CFBF57ED.F9A6%micah.schaefer@jhuapl.edu> <5399E391.3060701@alteeve.ca>
	<CFBF6163.F9BF%micah.schaefer@jhuapl.edu>
	<CFBF69C0.F9D3%micah.schaefer@jhuapl.edu> <5399FA51.2020808@alteeve.ca>
	<53A03763.4080905@redhat.com> <CFC5C81A.FADA%micah.schaefer@jhuapl.edu>
	<53A2B552.1000609@redhat.com>
Message-ID: <CFC851CB.FBBC%micah.schaefer@jhuapl.edu>

I have set the network to udpu. The physical nodes are to replace the
virtual nodes. I was planning on decommissioning the virtual nodes when
the cluster was stable with the physical nodes.

I will also remove the virtual nodes from the cluster and see if it makes
any difference. When I was only running the two virtual nodes I did not
have any of these issues.


On 6/19/14, 6:02 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:

>On 17/06/14 15:27, Schaefer, Micah wrote:
>> I am running Red Hat 6.4 with the HA/ load balancing packages from the
>> install DVD.
>>
>>
>> -bash-4.1$ cat /etc/redhat-release
>> Red Hat Enterprise Linux Server release 6.4 (Santiago)
>>
>> -bash-4.1$ corosync -v
>> Corosync Cluster Engine, version '1.4.1'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>>
>
>
>Thanks. 6.5 has better pause detection in it but I don't think that's
>the issue here actually. It looks to me like some messages are getting
>through but not others. So I'm back to seriously wondering if multicast
>traffic is being forwarded correctly and reliably. Having a mix of
>virtual and physical systems can cause these sorts of issues with real
>and software switches being mixed. Though I haven't seen anything quite
>as odd as this to be honest.
>
>Can you try either UDPU (preferred) or broadcast transport please and
>see if that helps or changes the symptoms at all? Broadcast could be
>problematic itself with the real/virtual mix so UDPU will be a more
>reliable option.
>
>Annoyingly, you'll need to take down the whole cluster to do this, and add
>
><cman transport="udpu"/>
>
>to /etc/cluster/cluster.conf on all nodes.
>
>Chrissie
>
>
>
>>
>> On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:
>>
>>> On 12/06/14 20:06, Digimer wrote:
>>>> Hrm, I'm not really sure that I am able to interpret this without
>>>>making
>>>> guesses. I'm cc'ing one of the devs (who I hope will poke the right
>>>> person if he's not able to help at the moment). Lets see what he has
>>>>to
>>>> say.
>>>>
>>>> I am curious now, too. :)
>>>>
>>>> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>>>>> Node4 was fenced again, I was able to get some debug logs (below), a
>>>>> new
>>>>> message :
>>>>>
>>>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
>>>>> OPERATIONAL
>>>>> state.?
>>>>>
>>>>>
>>>>> Rest of corosync logs
>>>>>
>>>>> http://pastebin.com/iYFbkbhb
>>>>>
>>>>>
>>>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>>>>> membership and a new membership was formed.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225
>>>>>ms,
>>>>> flushing membership messages.
>>>
>>>
>>> I'm concerned that the pause messages are repeating like that, it looks
>>> like it might be a fixed bug. What version of corosync do you have?
>>>
>>> Chrissie
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From haralambop at gmail.com  Thu Jun 19 14:08:11 2014
From: haralambop at gmail.com (Andreas Haralambopoulos)
Date: Thu, 19 Jun 2014 17:08:11 +0300
Subject: [Linux-cluster] Openvpn as a service in RGManager
Message-ID: <9FA47F25-F865-4577-87B9-BEC1D73079C9@gmail.com>

Is it possible to tun in rgmanager a VPN service only in one node?

something like this in pacemaker

primitive p_openvpn ocf:heartbeat:anything \
        params binfile="/usr/sbin/openvpn" cmdline_options="--daemon --writepid /var/run/openvpn.pid --config /data/openvpn/server.conf --cd /data/openvpn" pidfile="/var/run/openvpn.pid" \
        op start timeout="20" \
        op stop timeout="30" \
        op monitor interval="20" \
        meta target-role="Started"


From yamato at redhat.com  Fri Jun 20 02:07:55 2014
From: yamato at redhat.com (Masatake YAMATO)
Date: Fri, 20 Jun 2014 11:07:55 +0900 (JST)
Subject: [Linux-cluster] Fw: [corosync] wireshark dissector for corosync 1.x
	srp
Message-ID: <20140620.110755.698885983983758743.yamato@redhat.com>

If you have a trouble in lower layer communication in cluster 3,
wireshark can help you understand it.

Masatake YAMATO
-------------- next part --------------
An embedded message was scrubbed...
From: Masatake YAMATO <yamato at redhat.com>
Subject: [corosync] wireshark dissector for corosync 1.x srp
Date: Fri, 20 Jun 2014 11:03:36 +0900 (JST)
Size: 4305
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140620/bc5b42e7/attachment.eml>

From amjadcsu at gmail.com  Sun Jun 22 07:55:40 2014
From: amjadcsu at gmail.com (Amjad Syed)
Date: Sun, 22 Jun 2014 10:55:40 +0300
Subject: [Linux-cluster] fence Agent
Message-ID: <CAJWdRQh-bJeBP+OURTuryfGpN73OZHs063v7oFRv1VeJunuJ4w@mail.gmail.com>

Hello,

I am trying to setup a simple 2 node cluster in active/passive mode for
oracle high availability

We are using one  INSPUR server and one HP proliant (Management decision
based on  hardware availability)   and we are seeing if we can use IPMI as
fencing method

CCHS though supports HP ILO, DELL IPMI, IBM , but not  INSPUR.

So the basic question i have is what if we can use fence_ILO (for HP) and
fence_ipmilan (For INSPUR)?

IF any one have any experience with fence_ipmilan or point to resources ,
it would really be appreciated.

Sincerely,
Amjad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140622/32ed5469/attachment.htm>

From lists at alteeve.ca  Sun Jun 22 08:31:23 2014
From: lists at alteeve.ca (Digimer)
Date: Sun, 22 Jun 2014 04:31:23 -0400
Subject: [Linux-cluster] fence Agent
In-Reply-To: <CAJWdRQh-bJeBP+OURTuryfGpN73OZHs063v7oFRv1VeJunuJ4w@mail.gmail.com>
References: <CAJWdRQh-bJeBP+OURTuryfGpN73OZHs063v7oFRv1VeJunuJ4w@mail.gmail.com>
Message-ID: <53A6945B.4050804@alteeve.ca>

On 22/06/14 03:55 AM, Amjad Syed wrote:
> Hello,
>
> I am trying to setup a simple 2 node cluster in active/passive mode for
> oracle high availability
>
> We are using one  INSPUR server and one HP proliant (Management decision
> based on  hardware availability)   and we are seeing if we can use IPMI
> as fencing method
>
> CCHS though supports HP ILO, DELL IPMI, IBM , but not  INSPUR.
>
> So the basic question i have is what if we can use fence_ILO (for HP)
> and fence_ipmilan (For INSPUR)?
>
> IF any one have any experience with fence_ipmilan or point to resources
> , it would really be appreciated.
>
> Sincerely,
> Amjad

fence_ipmilan works with just about every IPMI-based out of band 
management interface. Most of those branded ones, like DRAC, RSA, iLO, 
etc are fundamentally based on IPMI. I've used fence_ipmilan on iLO 
personally and it's fine.

If you can show what 'ipmitool' command you use that can show if the 
peer is powered on or off, then you should be able to translate it quite 
easily to a matching fence_ipmilan call (check man fence_ipmilan for the 
switches). Once you can check the power status of the peer(s) with 
fence_ipmilan, you're 95% of the way there.

cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From amjadcsu at gmail.com  Sun Jun 22 14:32:50 2014
From: amjadcsu at gmail.com (Amjad Syed)
Date: Sun, 22 Jun 2014 17:32:50 +0300
Subject: [Linux-cluster] fence Agent
In-Reply-To: <53A6945B.4050804@alteeve.ca>
References: <CAJWdRQh-bJeBP+OURTuryfGpN73OZHs063v7oFRv1VeJunuJ4w@mail.gmail.com>
	<53A6945B.4050804@alteeve.ca>
Message-ID: <CAJWdRQgO=+4VV=b7RsF+8YmrxJYMWzScZYOVFGKZQwFodji+Hg@mail.gmail.com>

Well , i am running RHEL 6.3 on  INSPUR NFS5280 . For some reason  the
ipmitool and drivers stopped working.

While restarting /etc/init.d/ipmi , it would just hang.

Is it that ipmitool is not communicating  with BMC .?

What is the best way to tackle this issue ?

Thanks


On Sun, Jun 22, 2014 at 11:31 AM, Digimer <lists at alteeve.ca> wrote:

> On 22/06/14 03:55 AM, Amjad Syed wrote:
>
>> Hello,
>>
>> I am trying to setup a simple 2 node cluster in active/passive mode for
>> oracle high availability
>>
>> We are using one  INSPUR server and one HP proliant (Management decision
>> based on  hardware availability)   and we are seeing if we can use IPMI
>> as fencing method
>>
>> CCHS though supports HP ILO, DELL IPMI, IBM , but not  INSPUR.
>>
>> So the basic question i have is what if we can use fence_ILO (for HP)
>> and fence_ipmilan (For INSPUR)?
>>
>> IF any one have any experience with fence_ipmilan or point to resources
>> , it would really be appreciated.
>>
>> Sincerely,
>> Amjad
>>
>
> fence_ipmilan works with just about every IPMI-based out of band
> management interface. Most of those branded ones, like DRAC, RSA, iLO, etc
> are fundamentally based on IPMI. I've used fence_ipmilan on iLO personally
> and it's fine.
>
> If you can show what 'ipmitool' command you use that can show if the peer
> is powered on or off, then you should be able to translate it quite easily
> to a matching fence_ipmilan call (check man fence_ipmilan for the
> switches). Once you can check the power status of the peer(s) with
> fence_ipmilan, you're 95% of the way there.
>
> cheers
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140622/fd89df9a/attachment.htm>

From vasil.val at gmail.com  Mon Jun 23 18:09:48 2014
From: vasil.val at gmail.com (Vasil Valchev)
Date: Mon, 23 Jun 2014 21:09:48 +0300
Subject: [Linux-cluster] Online change of fence device options - possible?
Message-ID: <CAFZxf=JQ5QTeKRs0q+7GD5mAOo6gW8YebHJoMr_tsgLDW993cw@mail.gmail.com>

Hello,

I have a RHEL 6.5 cluster, using rgmanager.
The fence devices are fence_ipmilan - fencing through HP iLO4.

The issue is the fence devices weren't configured entirely correct -
recently after a node failure, the fence agent was returning failures (even
though it was fencing the node successfully), which apparently can be
avoided by setting the power_wait option to the fence dev configuration.

My question is - after changing the fence device (I think directly through
the .conf will be fine?), iterating the config version, and syncing the
.conf through the cluster software - is something else necessary to apply
the change (eg. cman reload)?

Will the new fence option be used the next time a fencing action is
performed?

And lastly can all of this be performed while the cluster and services are
operational or they have to be stopped/restarted?


Regards,
Vasil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140623/2fcc878c/attachment.htm>

From lists at alteeve.ca  Mon Jun 23 18:16:37 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 23 Jun 2014 14:16:37 -0400
Subject: [Linux-cluster] Online change of fence device options -
	possible?
In-Reply-To: <CAFZxf=JQ5QTeKRs0q+7GD5mAOo6gW8YebHJoMr_tsgLDW993cw@mail.gmail.com>
References: <CAFZxf=JQ5QTeKRs0q+7GD5mAOo6gW8YebHJoMr_tsgLDW993cw@mail.gmail.com>
Message-ID: <53A86F05.6090901@alteeve.ca>

On 23/06/14 02:09 PM, Vasil Valchev wrote:
> Hello,
>
> I have a RHEL 6.5 cluster, using rgmanager.
> The fence devices are fence_ipmilan - fencing through HP iLO4.
>
> The issue is the fence devices weren't configured entirely correct -
> recently after a node failure, the fence agent was returning failures
> (even though it was fencing the node successfully), which apparently can
> be avoided by setting the power_wait option to the fence dev configuration.
>
> My question is - after changing the fence device (I think directly
> through the .conf will be fine?), iterating the config version, and
> syncing the .conf through the cluster software - is something else
> necessary to apply the change (eg. cman reload)?
>
> Will the new fence option be used the next time a fencing action is
> performed?
>
> And lastly can all of this be performed while the cluster and services
> are operational or they have to be stopped/restarted?
>
>
> Regards,
> Vasil

This should be fine. As you said; Update the fence config, increment the 
config_version, save and exit. Run 'ccs_config_validate' and if that 
passes, 'cman_tool version -r'. Note that for this to work, you need to 
have set the 'ricci' user's shell password as well as have the 'ricci' 
and 'modclusterd' daemons running.

Once done, run 'fence_check'[1] to verify that the fence config works 
(it makes a status call to check). If that works, you're good to go.

You can also crontab the fence_check call and have it email you or 
something so that you can catch fence failures earlier.

digimer

1. 
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From lists at alteeve.ca  Mon Jun 23 18:19:56 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 23 Jun 2014 14:19:56 -0400
Subject: [Linux-cluster] Online change of fence device options -
	possible?
In-Reply-To: <53A86F05.6090901@alteeve.ca>
References: <CAFZxf=JQ5QTeKRs0q+7GD5mAOo6gW8YebHJoMr_tsgLDW993cw@mail.gmail.com>
	<53A86F05.6090901@alteeve.ca>
Message-ID: <53A86FCC.3010607@alteeve.ca>

On 23/06/14 02:16 PM, Digimer wrote:
> On 23/06/14 02:09 PM, Vasil Valchev wrote:
>> Hello,
>>
>> I have a RHEL 6.5 cluster, using rgmanager.
>> The fence devices are fence_ipmilan - fencing through HP iLO4.
>>
>> The issue is the fence devices weren't configured entirely correct -
>> recently after a node failure, the fence agent was returning failures
>> (even though it was fencing the node successfully), which apparently can
>> be avoided by setting the power_wait option to the fence dev
>> configuration.
>>
>> My question is - after changing the fence device (I think directly
>> through the .conf will be fine?), iterating the config version, and
>> syncing the .conf through the cluster software - is something else
>> necessary to apply the change (eg. cman reload)?
>>
>> Will the new fence option be used the next time a fencing action is
>> performed?
>>
>> And lastly can all of this be performed while the cluster and services
>> are operational or they have to be stopped/restarted?
>>
>>
>> Regards,
>> Vasil
>
> This should be fine. As you said; Update the fence config, increment the
> config_version, save and exit. Run 'ccs_config_validate' and if that
> passes, 'cman_tool version -r'. Note that for this to work, you need to
> have set the 'ricci' user's shell password as well as have the 'ricci'
> and 'modclusterd' daemons running.
>
> Once done, run 'fence_check'[1] to verify that the fence config works
> (it makes a status call to check). If that works, you're good to go.
>
> You can also crontab the fence_check call and have it email you or
> something so that you can catch fence failures earlier.
>
> digimer
>
> 1.
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config

I should clarify; You can update the config while the cluster is online. 
No fences will be called and you do not need to restart anything.

cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From amjadcsu at gmail.com  Tue Jun 24 10:32:30 2014
From: amjadcsu at gmail.com (Amjad Syed)
Date: Tue, 24 Jun 2014 13:32:30 +0300
Subject: [Linux-cluster] Error in Cluster.conf
Message-ID: <CAJWdRQi0NTuvB8+L=X9qCLOkNmOG=DTfbbVLdsY7HSFs1qP+PA@mail.gmail.com>

Hello

I am getting the following error when i run ccs_config_Validate

ccs_config_validate
Relax-NG validity error : Extra element clusternodes in interleave
tempfile:12: element clusternodes: Relax-NG validity error : Element
cluster failed to validate content
Configuration fails to validate

Here is my cluster.conf file

<?xml version="1.0"?>
<cluster config_version="1" name="oracleha">
        <clusternodes>
                <clusternode name="krplporcl001" nodeid="1"/>
                <clusternode name="krplporcl002" nodeid="2"/>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>

        <fencedevices>
           <fencedevice agent= "fence_ipmilan" ipaddr="10.10.63.93"
login="ADMIN" name="inspuripmi" passwd="abc123"/>
           <fencedvice agent = "fence_ilo" ipaddr="10.10.63.92"
login="test" name="hpipmi" passwd="abc12345"/>
          </fencedevices>
        <clusternodes>
           <clusternode name= "krplporcl001"  nodeid="1" votes= "1">
           <fence>
               <method name  = "1">
                 <device lanplus = "" name="fence_node1"  action ="reboot"/>
                 </method>
            </fence>
           </clusternode>
            <clusternode name = "krplporcl002" nodeid="2" votes ="1">
                 <fence>
                 <method name = "1">
                 <device lanplus ="1" name="fence_node2" action ="reboot"/>
                  </method>
               </fence>
            </clusternode>
         </clusternodes>


        <rm>

          <failoverdomains/>
        <resources/>
        <service autostart="1" exclusive="0" name="IP" recovery="relocate">
                <ip address="10.10.5.23" monitor_link="on" sleeptime="10"/>
        </service>
</rm>
</cluster>


Any help would be appreciated
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140624/259de991/attachment.htm>

From fdinitto at redhat.com  Tue Jun 24 11:56:52 2014
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Tue, 24 Jun 2014 13:56:52 +0200
Subject: [Linux-cluster] Error in Cluster.conf
In-Reply-To: <CAJWdRQi0NTuvB8+L=X9qCLOkNmOG=DTfbbVLdsY7HSFs1qP+PA@mail.gmail.com>
References: <CAJWdRQi0NTuvB8+L=X9qCLOkNmOG=DTfbbVLdsY7HSFs1qP+PA@mail.gmail.com>
Message-ID: <53A96784.3030009@redhat.com>

On 6/24/2014 12:32 PM, Amjad Syed wrote:
> Hello
> 
> I am getting the following error when i run ccs_config_Validate
> 
> ccs_config_validate
> Relax-NG validity error : Extra element clusternodes in interleave

You defined <clusternodes.. twice.

Fabio

> tempfile:12: element clusternodes: Relax-NG validity error : Element
> cluster failed to validate content
> Configuration fails to validate
> 
> Here is my cluster.conf file
> 
> <?xml version="1.0"?>
> <cluster config_version="1" name="oracleha">
>         <clusternodes>
>                 <clusternode name="krplporcl001" nodeid="1"/>
>                 <clusternode name="krplporcl002" nodeid="2"/>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
> 
>         <fencedevices>
>            <fencedevice agent= "fence_ipmilan" ipaddr="10.10.63.93"
> login="ADMIN" name="inspuripmi" passwd="abc123"/>
>            <fencedvice agent = "fence_ilo" ipaddr="10.10.63.92"
> login="test" name="hpipmi" passwd="abc12345"/>
>           </fencedevices>
>         <clusternodes>
>            <clusternode name= "krplporcl001"  nodeid="1" votes= "1">
>            <fence>
>                <method name  = "1">
>                  <device lanplus = "" name="fence_node1"  action ="reboot"/>
>                  </method>
>             </fence>
>            </clusternode>
>             <clusternode name = "krplporcl002" nodeid="2" votes ="1">
>                  <fence>
>                  <method name = "1">
>                  <device lanplus ="1" name="fence_node2" action ="reboot"/>
>                   </method>
>                </fence>
>             </clusternode>
>          </clusternodes>
> 
> 
>         <rm>
> 
>           <failoverdomains/>
>         <resources/>
>         <service autostart="1" exclusive="0" name="IP" recovery="relocate">
>                 <ip address="10.10.5.23" monitor_link="on" sleeptime="10"/>
>         </service>
> </rm>
> </cluster>
> 
> 
> Any help would be appreciated
> 
> 
> 
> 


From jpokorny at redhat.com  Tue Jun 24 12:55:00 2014
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Tue, 24 Jun 2014 14:55:00 +0200
Subject: [Linux-cluster] Error in Cluster.conf
In-Reply-To: <53A96784.3030009@redhat.com>
References: <CAJWdRQi0NTuvB8+L=X9qCLOkNmOG=DTfbbVLdsY7HSFs1qP+PA@mail.gmail.com>
	<53A96784.3030009@redhat.com>
Message-ID: <20140624125500.GA1425@redhat.com>

On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote:
> On 6/24/2014 12:32 PM, Amjad Syed wrote:
>> Hello
>> 
>> I am getting the following error when i run ccs_config_Validate
>> 
>> ccs_config_validate
>> Relax-NG validity error : Extra element clusternodes in interleave
> 
> You defined <clusternodes.. twice.

That + the are more issues discoverable by more powerful validator
jing (packaged in Fedora and RHEL 7, for instance, admittedly not
for RHEL 6/EPEL):

$ jing cluster.rng cluster.conf
> cluster.conf:13:47: error:
>   element "fencedvice" not allowed anywhere; expected the element
>   end-tag or element "fencedevice"
> cluster.conf:15:23: error:
>   element "clusternodes" not allowed here; expected the element
>   end-tag or element "clvmd", "dlm", "fence_daemon", "fence_xvmd",
>   "gfs_controld", "group", "logging", "quorumd", "rm", "totem" or
>   "uidgid"
> cluster.conf:26:76: error:
>   IDREF "fence_node2" without matching ID
> cluster.conf:19:77: error:
>   IDREF "fence_node1" without matching ID

So it spotted also:
- a typo in "fencedvice"
- broken referential integrity; it is prescribed "name" attribute
  of "device" tag should match a "name" of a defined "fencedevice"

Hope this helps.

-- Jan

> Fabio
> 
>> tempfile:12: element clusternodes: Relax-NG validity error : Element
>> cluster failed to validate content
>> Configuration fails to validate
>> 
>> Here is my cluster.conf file
>> 
>> <?xml version="1.0"?>
>> <cluster config_version="1" name="oracleha">
>>         <clusternodes>
>>                 <clusternode name="krplporcl001" nodeid="1"/>
>>                 <clusternode name="krplporcl002" nodeid="2"/>
>>         </clusternodes>
>>         <cman expected_votes="1" two_node="1"/>
>> 
>>         <fencedevices>
>>            <fencedevice agent= "fence_ipmilan" ipaddr="10.10.63.93"
>> login="ADMIN" name="inspuripmi" passwd="abc123"/>
>>            <fencedvice agent = "fence_ilo" ipaddr="10.10.63.92"
>> login="test" name="hpipmi" passwd="abc12345"/>
>>           </fencedevices>
>>         <clusternodes>
>>            <clusternode name= "krplporcl001"  nodeid="1" votes= "1">
>>            <fence>
>>                <method name  = "1">
>>                  <device lanplus = "" name="fence_node1"  action ="reboot"/>
>>                  </method>
>>             </fence>
>>            </clusternode>
>>             <clusternode name = "krplporcl002" nodeid="2" votes ="1">
>>                  <fence>
>>                  <method name = "1">
>>                  <device lanplus ="1" name="fence_node2" action ="reboot"/>
>>                   </method>
>>                </fence>
>>             </clusternode>
>>          </clusternodes>
>> 
>> 
>>         <rm>
>> 
>>           <failoverdomains/>
>>         <resources/>
>>         <service autostart="1" exclusive="0" name="IP" recovery="relocate">
>>                 <ip address="10.10.5.23" monitor_link="on" sleeptime="10"/>
>>         </service>
>> </rm>
>> </cluster>
>> 
>> 
>> Any help would be appreciated
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140624/c2be7416/attachment.sig>

From lists at alteeve.ca  Tue Jun 24 15:46:38 2014
From: lists at alteeve.ca (Digimer)
Date: Tue, 24 Jun 2014 11:46:38 -0400
Subject: [Linux-cluster] Error in Cluster.conf
In-Reply-To: <20140624125500.GA1425@redhat.com>
References: <CAJWdRQi0NTuvB8+L=X9qCLOkNmOG=DTfbbVLdsY7HSFs1qP+PA@mail.gmail.com>	<53A96784.3030009@redhat.com>
	<20140624125500.GA1425@redhat.com>
Message-ID: <53A99D5E.2080508@alteeve.ca>

On 24/06/14 08:55 AM, Jan Pokorn? wrote:
> On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote:
>> On 6/24/2014 12:32 PM, Amjad Syed wrote:
>>> Hello
>>>
>>> I am getting the following error when i run ccs_config_Validate
>>>
>>> ccs_config_validate
>>> Relax-NG validity error : Extra element clusternodes in interleave
>>
>> You defined <clusternodes.. twice.
>
> That + the are more issues discoverable by more powerful validator
> jing (packaged in Fedora and RHEL 7, for instance, admittedly not
> for RHEL 6/EPEL):
>
> $ jing cluster.rng cluster.conf
>> cluster.conf:13:47: error:
>>    element "fencedvice" not allowed anywhere; expected the element
>>    end-tag or element "fencedevice"
>> cluster.conf:15:23: error:
>>    element "clusternodes" not allowed here; expected the element
>>    end-tag or element "clvmd", "dlm", "fence_daemon", "fence_xvmd",
>>    "gfs_controld", "group", "logging", "quorumd", "rm", "totem" or
>>    "uidgid"
>> cluster.conf:26:76: error:
>>    IDREF "fence_node2" without matching ID
>> cluster.conf:19:77: error:
>>    IDREF "fence_node1" without matching ID
>
> So it spotted also:
> - a typo in "fencedvice"
> - broken referential integrity; it is prescribed "name" attribute
>    of "device" tag should match a "name" of a defined "fencedevice"
>
> Hope this helps.
>
> -- Jan

Also, without fence methods defined for the nodes, rgmanager will block 
the first time there is an issue.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From amjadcsu at gmail.com  Tue Jun 24 18:44:02 2014
From: amjadcsu at gmail.com (Amjad Syed)
Date: Tue, 24 Jun 2014 21:44:02 +0300
Subject: [Linux-cluster] Error in Cluster.conf
In-Reply-To: <53A99D5E.2080508@alteeve.ca>
References: <CAJWdRQi0NTuvB8+L=X9qCLOkNmOG=DTfbbVLdsY7HSFs1qP+PA@mail.gmail.com>
	<53A96784.3030009@redhat.com> <20140624125500.GA1425@redhat.com>
	<53A99D5E.2080508@alteeve.ca>
Message-ID: <CAJWdRQgVWXvVpycXtbzOi58ytrkRfFx4mp727pTO03g5Mv9i4A@mail.gmail.com>

I have updated the config file ,  validated by ccs_config_validate

Added the fence_daemon and post_join_delay. I am using bonding using
ethernet coaxial cable. But for some reason whenever i start CMAN on node,
it fences (kicks the other node). As a result at a time only one node is
online . Do i need to use multicast to get both nodes online at same
instance ?. or i am missing something here ?

Now the file looks like this :


?xml version="1.0"?>
<cluster config_version="2" name="oracleha">
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
           <fencedevice agent= "fence_ipmilan" ipaddr="10.10.63.93"
login="ADMIN" name="inspuripmi"  passwd="xxxx"/>
           <fencedevice agent = "fence_ilo2" ipaddr="10.10.63.92"
login="test" name="hpipmi"  passwd="xxxx"/>
          </fencedevices>
          <fence_daemon post_fail_delay="0" post_join_delay="60"/>
        <clusternodes>
           <clusternode name= "krplporcl001"  nodeid="1" votes= "1">
           <fence>
               <method name  = "1">
                 <device lanplus = "" name="inspuripmi"  action ="reboot"/>
                 </method>
            </fence>
           </clusternode>
            <clusternode name = "krplporcl002" nodeid="2" votes ="1">
                 <fence>
                 <method name = "1">
                  <device lanplus = "" name="hpipmi" action ="reboot"/>
                   </method>
              </fence>
            </clusternode>
         </clusternodes>


        <rm>

          <failoverdomains/>
        <resources/>
        <service autostart="1" exclusive="0" name="IP" recovery="relocate">
                <ip address="10.10.5.23" monitor_link="on" sleeptime="10"/>
        </service>
</rm>
</cluster>

Thanks


On Tue, Jun 24, 2014 at 6:46 PM, Digimer <lists at alteeve.ca> wrote:

> On 24/06/14 08:55 AM, Jan Pokorn? wrote:
>
>> On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote:
>>
>>> On 6/24/2014 12:32 PM, Amjad Syed wrote:
>>>
>>>> Hello
>>>>
>>>> I am getting the following error when i run ccs_config_Validate
>>>>
>>>> ccs_config_validate
>>>> Relax-NG validity error : Extra element clusternodes in interleave
>>>>
>>>
>>> You defined <clusternodes.. twice.
>>>
>>
>> That + the are more issues discoverable by more powerful validator
>> jing (packaged in Fedora and RHEL 7, for instance, admittedly not
>> for RHEL 6/EPEL):
>>
>> $ jing cluster.rng cluster.conf
>>
>>> cluster.conf:13:47: error:
>>>    element "fencedvice" not allowed anywhere; expected the element
>>>    end-tag or element "fencedevice"
>>> cluster.conf:15:23: error:
>>>    element "clusternodes" not allowed here; expected the element
>>>    end-tag or element "clvmd", "dlm", "fence_daemon", "fence_xvmd",
>>>    "gfs_controld", "group", "logging", "quorumd", "rm", "totem" or
>>>    "uidgid"
>>> cluster.conf:26:76: error:
>>>    IDREF "fence_node2" without matching ID
>>> cluster.conf:19:77: error:
>>>    IDREF "fence_node1" without matching ID
>>>
>>
>> So it spotted also:
>> - a typo in "fencedvice"
>> - broken referential integrity; it is prescribed "name" attribute
>>    of "device" tag should match a "name" of a defined "fencedevice"
>>
>> Hope this helps.
>>
>> -- Jan
>>
>
> Also, without fence methods defined for the nodes, rgmanager will block
> the first time there is an issue.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140624/f10ccd21/attachment.htm>

From eivind at aminor.no  Sun Jun 29 23:48:43 2014
From: eivind at aminor.no (Eivind Olsen)
Date: Mon, 30 Jun 2014 01:48:43 +0200
Subject: [Linux-cluster] Which fence agents to use?
Message-ID: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no>

Hello.

I am currently planning a 2-node cluster based on RHEL 6.5 and the high availability addon, with the goal of running Oracle 11g in active/passive failover mode.
The cluster nodes will be physical HP blades, and they will have shared storage for the Oracle data-files on a FC-SAN. That is, shared block device but using HA LVM so only mounting the filesystem on one node at a time.

The way I see it, my fence options are fence_ipmilan but I could also look at fence_scsi. Should I use only one or both of these? If both: in what order?

Regards
Eivind Olsen


From raju.rajsand at gmail.com  Mon Jun 30 06:22:37 2014
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Mon, 30 Jun 2014 11:52:37 +0530
Subject: [Linux-cluster] Which fence agents to use?
In-Reply-To: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no>
References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no>
Message-ID: <CA+YdgarE5e8KBJpePaUiPxsHjJ7i30bs_4XjKXxZ-+t92LADog@mail.gmail.com>

Greetings,

On Mon, Jun 30, 2014 at 5:18 AM, Eivind Olsen <eivind at aminor.no> wrote:
> The cluster nodes will be physical HP blades, and they will have shared storage for the Oracle data-files on a FC-SAN. That is, shared block device but using HA LVM so only mounting the filesystem on one node at a time.
>

HP ILO should help.

As a secondary you can use the FC-SAN Fencing

HTH

Regards


-- 
Regards,

Rajagopal


From amjadcsu at gmail.com  Mon Jun 30 06:47:20 2014
From: amjadcsu at gmail.com (Amjad Syed)
Date: Mon, 30 Jun 2014 09:47:20 +0300
Subject: [Linux-cluster] Which fence agents to use?
In-Reply-To: <CA+YdgarE5e8KBJpePaUiPxsHjJ7i30bs_4XjKXxZ-+t92LADog@mail.gmail.com>
References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no>
	<CA+YdgarE5e8KBJpePaUiPxsHjJ7i30bs_4XjKXxZ-+t92LADog@mail.gmail.com>
Message-ID: <CAJWdRQhdiCWJj71e-2brdwcv-5jiSSnpwr5Qp3VPKp18CaefRg@mail.gmail.com>

Hi last week I implemented the 2 node cluster with hp proliant.  I used hp
ilo .power based fencing agents are preferred .
On 30 Jun 2014 09:28, "Rajagopal Swaminathan" <raju.rajsand at gmail.com>
wrote:

> Greetings,
>
> On Mon, Jun 30, 2014 at 5:18 AM, Eivind Olsen <eivind at aminor.no> wrote:
> > The cluster nodes will be physical HP blades, and they will have shared
> storage for the Oracle data-files on a FC-SAN. That is, shared block device
> but using HA LVM so only mounting the filesystem on one node at a time.
> >
>
> HP ILO should help.
>
> As a secondary you can use the FC-SAN Fencing
>
> HTH
>
> Regards
>
>
> --
> Regards,
>
> Rajagopal
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140630/27f9570b/attachment.htm>

From ekuric at redhat.com  Mon Jun 30 08:14:54 2014
From: ekuric at redhat.com (Elvir Kuric)
Date: Mon, 30 Jun 2014 10:14:54 +0200
Subject: [Linux-cluster] Which fence agents to use?
In-Reply-To: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no>
References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no>
Message-ID: <53B11C7E.4070903@redhat.com>

On 06/30/2014 01:48 AM, Eivind Olsen wrote:
> Hello.
>
> I am currently planning a 2-node cluster based on RHEL 6.5 and the high availability addon, with the goal of running Oracle 11g in active/passive failover mode.
> The cluster nodes will be physical HP blades
which model and which generation and which ilo version?
if ilo3/ilo4 ( what is case in most recent models of blades / Proliants 
) then fence_ipmilan is recommended way.

Here is full list of supported fencing agents with RHEL : 
https://access.redhat.com/site/articles/28603 ( you have to have Red Hat 
customer portal access to see it - I guess you have it as Red Hat 
customer )


> , and they will have shared storage for the Oracle data-files on a FC-SAN. That is, shared block device but using HA LVM so only mounting the filesystem on one node at a time.
ok!
>
> The way I see it, my fence options are fence_ipmilan but I could also look at fence_scsi. Should I use only one or both of these?
you can use any of these combination. Be aware that with fence_scsi ( if 
used as only fencing method ) cluster node will only be stopped to 
access shared storage - no power restart , and you will need manually to 
restart it.
This could be overcome if you implement power fencing ( fence_ipmilan ) 
to restart machine once it has issue.
>   If both: in what order?
When configured properly fence_ipmilan will do the job, having 
additional fencing ( fence_scsi ) method it will introduce additional 
complexity in cluster configuration  - harder to debug / maintain.

ihmo, fence_ipmilan is good choice.
>
> Regards
> Eivind Olsen
>
>


From eivind at aminor.no  Mon Jun 30 10:19:47 2014
From: eivind at aminor.no (Eivind Olsen)
Date: Mon, 30 Jun 2014 12:19:47 +0200
Subject: [Linux-cluster] Which fence agents to use?
In-Reply-To: <53B11C7E.4070903@redhat.com>
References: <183CC201-289D-4406-945C-7FB836FDC0BA@aminor.no>
	<53B11C7E.4070903@redhat.com>
Message-ID: <c306db31549b14147ba04995d95df158.squirrel@webmail.aminor.no>

Elvir Kuric wrote:

> which model and which generation and which ilo version?
> if ilo3/ilo4 ( what is case in most recent models of blades / Proliants
> ) then fence_ipmilan is recommended way.

ProLiant BL460c Gen8, with some version of iLO 4.

> Here is full list of supported fencing agents with RHEL :
> https://access.redhat.com/site/articles/28603 ( you have to have Red Hat
> customer portal access to see it - I guess you have it as Red Hat
> customer )

Yes, I've seen that article, and that's why I thought I'd go with
fence_ipmilan and not fence_hpblade (which exists but isn't on the list in
that article).

> When configured properly fence_ipmilan will do the job, having
> additional fencing ( fence_scsi ) method it will introduce additional
> complexity in cluster configuration  - harder to debug / maintain.
> ihmo, fence_ipmilan is good choice.

Ah ok, I'll keep the configuration simpler then, and not bother with
fence_scsi :)

Thanks!

Regards
Eivind Olsen