[Linux-cluster] Node is randomly fenced
Digimer
lists at alteeve.ca
Thu Jun 12 16:31:43 UTC 2014
To confirm; Have you tried with the bonds setup where each node has one
link into either switch? I just want to be sure you've ruled out all the
network hardware. Also please confirm that you used mode=1
(active-passive) bonding.
Assuming this doesn't help, then I would say that I was wrong in
assuming it was network related. The next thing I would look at is
corosync. Do you see any messages about totem retransmit?
On 12/06/14 11:32 AM, Schaefer, Micah wrote:
> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
> fenced, then node3 was fenced when node4 came back online. The network
> topology is as follows:
> switch1: node1, node3 (two connections)
> switch2: node2, node4 (two connections)
> switch1 <―> switch2
> All on the same subnet
>
> I set up monitoring at 100 millisecond of the nics in active-backup mode,
> and saw no messages about link problems before the fence.
>
> I see multicast between the servers using tcpdump.
>
>
> Any more ideas?
>
>
>
>
>
> On 6/12/14, 12:19 AM, "Digimer" <lists at alteeve.ca> wrote:
>
>> I considered that, but I would expect more nodes to be lost.
>>
>> On 12/06/14 12:12 AM, Netravali, Ganesh wrote:
>>> Make sure multicast is enabled across the switches.
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Schaefer, Micah
>>> Sent: Thursday, June 12, 2014 1:20 AM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] Node is randomly fenced
>>>
>>> Okay, I set up active/ backup bonding and will watch for any change.
>>>
>>> This is the network side:
>>> 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
>>> 0 output errors, 0 collisions, 0 interface resets
>>>
>>>
>>>
>>> This is the server side:
>>>
>>> em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD
>>> inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0
>>> inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0
>>> GiB)
>>> Interrupt:34 Memory:d5000000-d57fffff
>>>
>>>
>>>
>>> I need to run some fiber, but for now two nodes are plugged into one
>>> switch and the other two nodes into a separate switch that are on the
>>> same subnet. I'll work on cross connecting the bonded interfaces to
>>> different switches.
>>>
>>>
>>>
>>> On 6/11/14, 3:28 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>
>>>> The first thing I would do is get a second NIC and configure
>>>> active-passive bonding. network issues are too common to ignore in HA
>>>> setups. Ideally, I would span the links across separate stacked
>>>> switches.
>>>>
>>>> As for debugging the issue, I can only recommend to look closely at the
>>>> system and switch logs for clues.
>>>>
>>>> On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>>>>> I have the issue on two of my nodes. Each node has 1ea 10gb
>>>>> connection.
>>>>> No
>>>>> bonding, single link. What else can I look at? I manage the network
>>>>> too. I don¹t see any link down notifications, don¹t see any errors on
>>>>> the ports.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>>>
>>>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>>>>> It failed again, even after deleting all the other failover domains.
>>>>>>>
>>>>>>> Cluster conf
>>>>>>> http://pastebin.com/jUXkwKS4
>>>>>>>
>>>>>>> I turned corosync output to debug. How can I go about
>>>>>>> troubleshooting if it really is a network issue or something else?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
>>>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>>>>>> configuration.
>>>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
>>>>>>> corosync [TOTEM ] A processor joined or left the membership and a
>>>>>>> new membership was formed.
>>>>>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0)
>>>>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>>>>
>>>>>> This is, to me, *strongly* indicative of a network issue. It's not
>>>>>> likely switch-wide as only one member was lost, but I would
>>>>>> certainly put my money on a network problem somewhere, some how.
>>>>>>
>>>>>> Do you use bonding?
>>>>>>
>>>>>> --
>>>>>> Digimer
>>>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for
>>>>>> cancer is trapped in the mind of a person without access to
>>>>>> education?
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Digimer
>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
>>>> is trapped in the mind of a person without access to education?
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Linux-cluster
mailing list