[Linux-cluster] Packet loss after configuring Ethernet bonding

Sat Jan 5 07:23:36 UTC 2013

On Sat, Nov 10, 2012 at 9:52 AM, Digimer <lists at alteeve.ca> wrote:
> On 11/09/2012 11:12 PM, Zama Ques wrote:

>>> Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here  will have answer to the issues I am facing .
>>>
>>> I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches .
>>>
>>> My configuration is as follows:
>>>
>>> ========
>>> # cat /proc/net/bonding/bond0
>>>
>>> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
>>>
>>> Bonding Mode: adaptive load balancing Primary Slave: None Currently
>>> Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay
>>> (ms): 0 Down Delay (ms): 0
>>>
>>> Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link
>>> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0
>>>
>>> Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link
>>> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0
>>> ------------
>>> # cat /sys/class/net/bond0/bonding/mode
>>>
>>>    balance-alb 6
>>>
>>>
>>> # cat /sys/class/net/bond0/bonding/miimon
>>>     0
>>>
>>> ============
>>>
>>>
>>> The issue for me is that I am seeing packet loss after configuring bonding .  Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss.
>>>
>>> What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem .
>>>
>>>
>>>
>>> Thanks
>>> Zaman
>>
>>  > You didn't share any details on your configuration, but I will assume
>>> you are using corosync.
>>
>>> The only supported bonding mode is Active/Passive (mode=1). I've
>>> personally tried all modes, out of curiosity, and all had problems. The
>>> short of it is that if you need more that 1 gbit of performance, buy
>>> faster cards.
>>
>>> If you are interested in what I use, it's documented here:
>>
>>>   https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network
>>
>>>   I've used this setup in several production clusters and have tested
>>>   failure are recovery extensively. It's proven very stable. :)
>>
>>
>> Thanks Digimer for the quick response and pointing me to the link . I am yet to reach cluster configuration , initially trying to  understand ethernet bonding before going into cluster configuration. So , option for me is only to use Active/Passive bonding mode in case of clustered environment.
>> Few more clarifications needed , Can we use other bonding modes in non clustered environment .  I am seeing packet loss in other modes . Also , the support of  using only mode=1 in cluster environment is it a restriction of RHEL Cluster suite or it is by design .
>>
>> Will be great if you clarify these queries .
>>
>> Thanks in Advance
>> Zaman
>
> Corosync is the only actively developed/supported (HA) cluster
> communications and membership tool. It's used on all modern distros for
> clustering and the requirement for mode=1 is with it. As such, it
> doesn't matter which OS you are on, it's the only mode that will work
> (reliably).
>
> The problem is that corosync needs to detect state changes quickly. It
> does this using the totem protocol (which serves other purposes), which
> passes a token around the nodes in the cluster. If a node is sent a
> token and the token is not returned within a time-out period, it is
> declared lost and a new token is dispatched. Once too many failures
> occur in a row, the node is declared lost and it is ejected from the
> cluster. This process is detailed in the link above under the "Concept;
> Fencing" section.
>
> With all modes other than mode=1, the failure recovery and/or the
> restoration of a link in the bond causes a sufficient disruption to
> cause a node to be declared lost. As I mentioned, this matches my
> experience in testing the other modes. It isn't an arbitrary rule.
>
> As for non-clustered traffic; the usefulness of other bond modes depends
> entirely on the traffic you are pushing over it. Personally, I am
> focused on HA in clusters, so I only use mode=1, regardless of the
> traffic designed for it.
>
> digimer

I was dealing with an issue where network performance had to be
improved in a high availability cluster and while going through the
archives I saw this thread.

Would this condition of bonding mode being 1 (or active backup) also
apply when we have different interfaces for cluster communication and
service networks ? In such a scenario, can't we have the bonding mode
for the cluster communication network interfaces as 1 and the bonding
mode for the interfaces on service network as 0 or 5 (or any other
suitable mode) ?

Thanks,
--
Manish