[Linux-cluster] Node is randomly fenced

Christine Caulfield ccaulfie at redhat.com
Thu Jun 19 10:02:58 UTC 2014


On 17/06/14 15:27, Schaefer, Micah wrote:
> I am running Red Hat 6.4 with the HA/ load balancing packages from the
> install DVD.
>
>
> -bash-4.1$ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 6.4 (Santiago)
>
> -bash-4.1$ corosync -v
> Corosync Cluster Engine, version '1.4.1'
> Copyright (c) 2006-2009 Red Hat, Inc.
>
>


Thanks. 6.5 has better pause detection in it but I don't think that's 
the issue here actually. It looks to me like some messages are getting 
through but not others. So I'm back to seriously wondering if multicast 
traffic is being forwarded correctly and reliably. Having a mix of 
virtual and physical systems can cause these sorts of issues with real 
and software switches being mixed. Though I haven't seen anything quite 
as odd as this to be honest.

Can you try either UDPU (preferred) or broadcast transport please and 
see if that helps or changes the symptoms at all? Broadcast could be 
problematic itself with the real/virtual mix so UDPU will be a more 
reliable option.

Annoyingly, you'll need to take down the whole cluster to do this, and add

<cman transport="udpu"/>

to /etc/cluster/cluster.conf on all nodes.

Chrissie



>
> On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie at redhat.com> wrote:
>
>> On 12/06/14 20:06, Digimer wrote:
>>> Hrm, I'm not really sure that I am able to interpret this without making
>>> guesses. I'm cc'ing one of the devs (who I hope will poke the right
>>> person if he's not able to help at the moment). Lets see what he has to
>>> say.
>>>
>>> I am curious now, too. :)
>>>
>>> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>>>> Node4 was fenced again, I was able to get some debug logs (below), a
>>>> new
>>>> message :
>>>>
>>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
>>>> OPERATIONAL
>>>> state.³
>>>>
>>>>
>>>> Rest of corosync logs
>>>>
>>>> http://pastebin.com/iYFbkbhb
>>>>
>>>>
>>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>>>> membership and a new membership was formed.
>>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>>>> flushing membership messages.
>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>>>> flushing membership messages.
>>
>>
>> I'm concerned that the pause messages are repeating like that, it looks
>> like it might be a fixed bug. What version of corosync do you have?
>>
>> Chrissie
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>




More information about the Linux-cluster mailing list