[Linux-cluster] Node is randomly fenced

Wed Jun 11 19:50:14 UTC 2014

Okay, I set up active/ backup bonding and will watch for any change.

This is the network side:
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 output errors, 0 collisions, 0 interface resets

This is the server side:

em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
          inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
          inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0 GiB)
          Interrupt:34 Memory:d5000000-d57fffff

I need to run some fiber, but for now two nodes are plugged into one
switch and the other two nodes into a separate switch that are on the same
subnet. I’ll work on cross connecting the bonded interfaces to different
switches.

On 6/11/14, 3:28 PM, "Digimer" <lists at alteeve.ca> wrote:

>The first thing I would do is get a second NIC and configure
>active-passive bonding. network issues are too common to ignore in HA
>setups. Ideally, I would span the links across separate stacked switches.
>
>As for debugging the issue, I can only recommend to look closely at the
>system and switch logs for clues.
>
>On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>> I have the issue on two of my nodes. Each node has 1ea 10gb connection.
>>No
>> bonding, single link. What else can I look at? I manage the network
>>too. I
>> don¹t see any link down notifications, don¹t see any errors on the
>>ports.
>>
>>
>>
>>
>> On 6/11/14, 2:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>
>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>> It failed again, even after deleting all the other failover domains.
>>>>
>>>> Cluster conf
>>>> http://pastebin.com/jUXkwKS4
>>>>
>>>> I turned corosync output to debug. How can I go about troubleshooting
>>>>if
>>>> it really is a network issue or something else?
>>>>
>>>>
>>>>
>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
>>>> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>>> configuration.
>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
>>>> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
>>>> membership and a new membership was formed.
>>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>
>>> This is, to me, *strongly* indicative of a network issue. It's not
>>> likely switch-wide as only one member was lost, but I would certainly
>>> put my money on a network problem somewhere, some how.
>>>
>>> Do you use bonding?
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster