[Linux-cluster] bonding

Thu Apr 12 14:19:43 UTC 2007

I don't know that I'd need to increase max_bonds since I only have one
bond on each node but I have considered resorting to the old MII or
ETHTOOL ioctl method to determine link state.  You are running a newer
kernel and I haven't checked the changelog to see what differences might
be pertinent but mainly you are using e1000 drivers compared to my e100
driver.  I just can't seem to associate the link status failures with
any other events on the box, it's really strange.

On Thu, 2007-04-12 at 09:52 -0400, rhurst at bidmc.harvard.edu wrote:
> I have the same hardware configuration for 11 nodes, but without any
> of the spurious failover events.  The main thing different I had to do
> was to increase the bond device count to 2 (the driver defaults to
> only 1), as I have mine teamed between dual tg3/e1000 ports from the
> mobo and PCI card.  bond0 is on a gigabit switch, while bond1 is on
> 100mb.  In /etc/modprobe.conf:
> 
> alias bond0 bonding
> alias bond1 bonding
> options bonding max_bonds=2 mode=1 miimon=100 updelay=200
> alias eth0 e1000
> alias eth1 e1000
> alias eth2 tg3
> alias eth3 tg3
> 
> So eth0/eth2 are teamed, and eth1/eth3 are teamed.  In dmesg:
> 
> e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
> bonding: bond0: making interface eth0 the new active one 0 ms earlier.
> bonding: bond0: enslaving eth0 as an active interface with an up link.
> bonding: bond0: enslaving eth2 as a backup interface with a down link.
> tg3: eth2: Link is up at 1000 Mbps, full duplex.
> tg3: eth2: Flow control is on for TX and on for RX.
> bonding: bond0: link status up for interface eth2, enabling it in 200
> ms.
> bonding: bond0: link status definitely up for interface eth2.
> e1000: eth1: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
> bonding: bond1: making interface eth1 the new active one 0 ms earlier.
> bonding: bond1: enslaving eth1 as an active interface with an up link.
> bonding: bond1: enslaving eth3 as a backup interface with a down link.
> bond0: duplicate address detected!
> tg3: eth3: Link is up at 100 Mbps, full duplex.
> tg3: eth3: Flow control is off for TX and off for RX.
> bonding: bond1: link status up for interface eth3, enabling it in 200
> ms.
> bonding: bond1: link status definitely up for interface eth3.
> 
> $ uname -srvmpio
> Linux 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007 x86_64
> x86_64 x86_64 GNU/Linux
> 
> $ cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)
> 
> Bonding Mode: fault-tolerance (active-backup)
> Primary Slave: None
> Currently Active Slave: eth0
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 200
> Down Delay (ms): 0
> 
> Slave Interface: eth0
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:11:0a:5f:1e:0a
> 
> Slave Interface: eth2
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:17:a4:a7:9a:54
> 
> $ cat /proc/net/bonding/bond1
> Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)
> 
> Bonding Mode: fault-tolerance (active-backup)
> Primary Slave: None
> Currently Active Slave: eth1
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 200
> Down Delay (ms): 0
> 
> Slave Interface: eth1
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:11:0a:5f:1e:0b
> 
> Slave Interface: eth3
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:17:a4:a7:9a:53
> 
> 
> On Thu, 2007-04-12 at 08:45 -0400, Scott McClanahan wrote: 
> > I have every node in my four node cluster setup to do active-backup
> > bonding and the drivers loaded for the bonded network interfaces vary
> > between tg3 and e100.  All interfaces with the e100 driver loaded report
> > errors much like what you see here:
> > 
> > bonding: bond0: link status definitely down for interface eth2,
> > disabling it
> > e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
> > bonding: bond0: link status definitely up for interface eth2.
> > 
> > This happens all day on every node.  I have configured the bonding
> > module to do MII link monitoring at a frequency of 100 milliseconds and
> > it is using basic carrier link detection to test if the interface is
> > alive or not.  There was no custom building of any modules on these
> > nodes and the o/s is CentOS 4.3.
> > 
> > Some more relevant information is below (this display is consistent
> > across all nodes):
> > 
> > [smccl at tf35 ~]$uname -srvmpio
> > Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686
> > i386 GNU/Linux
> > 
> > [smccl at tf35 ~]$head -5 /etc/modprobe.conf
> > alias bond0 bonding
> > options bonding miimon=100 mode=1
> > alias eth0 tg3
> > alias eth1 tg3
> > alias eth2 e100
> > 
> > [smccl at tf35 ~]$cat /proc/net/bonding/bond0 
> > Ethernet Channel Bonding Driver: v2.6.1 (October 29, 2004)
> > 
> > Bonding Mode: fault-tolerance (active-backup)
> > Primary Slave: None
> > Currently Active Slave: eth0
> > MII Status: up
> > MII Polling Interval (ms): 100
> > Up Delay (ms): 0
> > Down Delay (ms): 0
> > 
> > Slave Interface: eth0
> > MII Status: up
> > Link Failure Count: 0
> > Permanent HW addr: 00:10:18:0c:86:a4
> > 
> > Slave Interface: eth2
> > MII Status: up
> > Link Failure Count: 12
> > Permanent HW addr: 00:02:55:ac:a2:ea
> > 
> > Any idea why these e100 links report failures so often?  They are
> > directly plugged into a Cisco Catalyst 4506.  Thanks.
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> Robert Hurst, Sr. Caché Administrator
> Beth Israel Deaconess Medical Center
> 1135 Tremont Street, REN-7
> Boston, Massachusetts   02120-2140
> 617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154
> Any technology distinguishable from magic is insufficiently advanced.
> plain text document attachment (ATT362682.txt), "ATT362682.txt"
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster