[Linux-cluster] Node is randomly fenced
Christine Caulfield
ccaulfie at redhat.com
Thu Jun 19 10:02:58 UTC 2014
On 17/06/14 15:27, Schaefer, Micah wrote:
> I am running Red Hat 6.4 with the HA/ load balancing packages from the
> install DVD.
>
>
> -bash-4.1$ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 6.4 (Santiago)
>
> -bash-4.1$ corosync -v
> Corosync Cluster Engine, version '1.4.1'
> Copyright (c) 2006-2009 Red Hat, Inc.
>
>
Thanks. 6.5 has better pause detection in it but I don't think that's
the issue here actually. It looks to me like some messages are getting
through but not others. So I'm back to seriously wondering if multicast
traffic is being forwarded correctly and reliably. Having a mix of
virtual and physical systems can cause these sorts of issues with real
and software switches being mixed. Though I haven't seen anything quite
as odd as this to be honest.
Can you try either UDPU (preferred) or broadcast transport please and
see if that helps or changes the symptoms at all? Broadcast could be
problematic itself with the real/virtual mix so UDPU will be a more
reliable option.
Annoyingly, you'll need to take down the whole cluster to do this, and add
<cman transport="udpu"/>
to /etc/cluster/cluster.conf on all nodes.
Chrissie
>
> On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:
>
>> On 12/06/14 20:06, Digimer wrote:
>>> Hrm, I'm not really sure that I am able to interpret this without making
>>> guesses. I'm cc'ing one of the devs (who I hope will poke the right
>>> person if he's not able to help at the moment). Lets see what he has to
>>> say.
>>>
>>> I am curious now, too. :)
>>>
>>> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>>>> Node4 was fenced again, I was able to get some debug logs (below), a
>>>> new
>>>> message :
>>>>
>>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
>>>> OPERATIONAL
>>>> state.³
>>>>
>>>>
>>>> Rest of corosync logs
>>>>
>>>> http://pastebin.com/iYFbkbhb
>>>>
>>>>
>>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>>>> membership and a new membership was formed.
>>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>>>> flushing membership messages.
>>
>>
>> I'm concerned that the pause messages are repeating like that, it looks
>> like it might be a fixed bug. What version of corosync do you have?
>>
>> Chrissie
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
More information about the Linux-cluster
mailing list