[Linux-cluster] Node is randomly fenced

Schaefer, Micah Micah.Schaefer at jhuapl.edu
Thu Jun 19 12:39:20 UTC 2014


I have set the network to udpu. The physical nodes are to replace the
virtual nodes. I was planning on decommissioning the virtual nodes when
the cluster was stable with the physical nodes.

I will also remove the virtual nodes from the cluster and see if it makes
any difference. When I was only running the two virtual nodes I did not
have any of these issues.




On 6/19/14, 6:02 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:

>On 17/06/14 15:27, Schaefer, Micah wrote:
>> I am running Red Hat 6.4 with the HA/ load balancing packages from the
>> install DVD.
>>
>>
>> -bash-4.1$ cat /etc/redhat-release
>> Red Hat Enterprise Linux Server release 6.4 (Santiago)
>>
>> -bash-4.1$ corosync -v
>> Corosync Cluster Engine, version '1.4.1'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>>
>
>
>Thanks. 6.5 has better pause detection in it but I don't think that's
>the issue here actually. It looks to me like some messages are getting
>through but not others. So I'm back to seriously wondering if multicast
>traffic is being forwarded correctly and reliably. Having a mix of
>virtual and physical systems can cause these sorts of issues with real
>and software switches being mixed. Though I haven't seen anything quite
>as odd as this to be honest.
>
>Can you try either UDPU (preferred) or broadcast transport please and
>see if that helps or changes the symptoms at all? Broadcast could be
>problematic itself with the real/virtual mix so UDPU will be a more
>reliable option.
>
>Annoyingly, you'll need to take down the whole cluster to do this, and add
>
><cman transport="udpu"/>
>
>to /etc/cluster/cluster.conf on all nodes.
>
>Chrissie
>
>
>
>>
>> On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:
>>
>>> On 12/06/14 20:06, Digimer wrote:
>>>> Hrm, I'm not really sure that I am able to interpret this without
>>>>making
>>>> guesses. I'm cc'ing one of the devs (who I hope will poke the right
>>>> person if he's not able to help at the moment). Lets see what he has
>>>>to
>>>> say.
>>>>
>>>> I am curious now, too. :)
>>>>
>>>> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>>>>> Node4 was fenced again, I was able to get some debug logs (below), a
>>>>> new
>>>>> message :
>>>>>
>>>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
>>>>> OPERATIONAL
>>>>> state.³
>>>>>
>>>>>
>>>>> Rest of corosync logs
>>>>>
>>>>> http://pastebin.com/iYFbkbhb
>>>>>
>>>>>
>>>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>>>>> membership and a new membership was formed.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224
>>>>>ms,
>>>>> flushing membership messages.
>>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225
>>>>>ms,
>>>>> flushing membership messages.
>>>
>>>
>>> I'm concerned that the pause messages are repeating like that, it looks
>>> like it might be a fixed bug. What version of corosync do you have?
>>>
>>> Chrissie
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster





More information about the Linux-cluster mailing list