[Linux-cluster] Node is randomly fenced

Thu Jun 12 16:48:17 UTC 2014

This is all I see for TOTEM from node1

Jun 12 11:07:10 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 12 11:07:22 corosync [QUORUM] Members[3]: 1 2 3
Jun 12 11:07:22 corosync [TOTEM ] A processor joined or left the
membership" and a new membership was formed.
Jun 12 11:07:22 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 12 11:07:22 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:10:49 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:10:49 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:10:49 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:11:02 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:11:02 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:11:02 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:11:06 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:11:06 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:11:06 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:11:35 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 12 11:11:47 corosync [QUORUM] Members[3]: 1 2 4
Jun 12 11:11:47 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:11:47 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 12 11:11:47 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:15:18 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:15:18 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:15:18 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:15:31 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:15:31 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:15:31 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 11:15:33 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 12 11:15:33 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 12 11:15:33 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 12 12:36:20 corosync [QUORUM] Members[4]: 1 2 3 4

As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
tree changes are happening and all the ports have port-fast enabled for
these servers. My switch logging level is very high and I have no messages
in relation to the time frames or ports.

TOTEM reports that “A processor joined or left the membership…”, but that
isn’t enough detail.

Also note that I did not have these issues until adding new servers: node3
and node4 to the cluster. Node1 and node2 do not fence each other (unless
a real issue is there), and they are on different switches.

On 6/12/14, 12:36 PM, "Digimer" <lists at alteeve.ca> wrote:

>On 12/06/14 12:33 PM, yvette hirth wrote:
>> On 06/12/2014 08:32 AM, Schaefer, Micah wrote:
>> 
>>> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
>>> fenced, then node3 was fenced when node4 came back online. The network
>>> topology is as follows:
>>> switch1: node1, node3 (two connections)
>>> switch2: node2, node4 (two connections)
>>> switch1 <―> switch2
>>> All on the same subnet
>>>
>>> I set up monitoring at 100 millisecond of the nics in active-backup
>>>mode,
>>> and saw no messages about link problems before the fence.
>>>
>>> I see multicast between the servers using tcpdump.
>>>
>>> Any more ideas?
>> 
>> spanning-tree scans/rebuilds happen on 10Gb circuits just like they do
>> on 1Gb circuits, and when they happen, traffic on the switches *can*
>> come to a grinding halt, depending upon the switch firmware and the type
>> of spanning-tree scan/rebuild being done.
>> 
>> you may want to check your switch logs to see if any spanning-tree
>> rebuilds were being done at the time of the fence.
>> 
>> just an idea, and hth
>> yvette hirth
>
>When I've seen this (I now disable STP entirely), it blocks all traffic
>so I would expect multiple/all nodes to partition off on their own.
>Still, worth looking into. :)
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster