[Linux-cluster] rhel 6.2 network bonding interface in cluster environment

Mon Jan 9 05:12:43 UTC 2012

Hi,

Thanks for your mail. I herewith attaching the bonding and eth configuration
files. And on the /var/log/messages during the fence operation we can get
the logs updated related to network only in the node which fences the other.

Server 1 NIC 1:  (eth2)

/etc/sysconfig/network-scripts/ifcfg-eth2

DEVICE="eth2"
HWADDR="3C:D9:2B:04:2D:7A"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond0
SLAVE=yes
USERCTL=no
BOOTPROTO=none

Server 1 NIC 4: (eth5)

/etc/sysconfig/network-scripts/ifcfg-eth5

DEVICE="eth5"
HWADDR="3C:D9:2B:04:2D:80"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond0
SLAVE=yes
USERCTL=no
BOOTPROTO=none

Server 1 NIC 2: (eth3)

/etc/sysconfig/network-scripts/ifcfg-eth3

DEVICE="eth3"
HWADDR="3C:D9:2B:04:2D:7C"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond1
SLAVE=yes
USERCTL=no
BOOTPROTO=none

Server 1 NIC 3:

/etc/sysconfig/network-scripts/ifcfg-eth4

DEVICE="eth4"
HWADDR="3C:D9:2B:04:2D:7E"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER=bond1
SLAVE=yes
USERCTL=no
BOOTPROTO=none

Server 1 Bond0: (Public Access)

/etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0
BOOTPROTO=static
IPADDR=192.168.129.10
NETMASK=255.255.255.0
GATEWAY=192.168.129.1
USERCTL=no
ONBOOT=yes
BONDING_OPTS="miimon=100 mode=0"

Server 1 Bond1: (Heartbeat)

/etc/sysconfig/network-scripts/ifcfg-bond1

DEVICE=bond1
BOOTPROTO=static
IPADDR=10.0.0.10
NETMASK=255.0.0.0
USERCTL=no
ONBOOT=yes
BONDING_OPTS="miimon=100 mode=1"

On the log messages, 

Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Down
Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Down
Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth3, disabling it
Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any
active interface !
Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
for interface eth4, disabling it
Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
interface eth3, 1000 Mbps full duplex.
Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the
new active one.
Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up!
Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
Up, 1000 Mbps full duplex, receive & transmit flow control ON
Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
interface eth4, 1000 Mbps full duplex.

Thanks

Sathya Narayanan V
Solution Architect	

-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Monday, January 09, 2012 10:27 AM
To: linux clustering
Cc: SATHYA - IT
Subject: SPAM - Re: [Linux-cluster] rhel 6.2 network bonding interface in
cluster environment

On 01/08/2012 11:37 PM, SATHYA - IT wrote:
> Hi,
> 
> We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman + 
> smb. We have 4 nic cards in the servers where 2 been configured in 
> bonding for heartbeat (with mode=1) and 2 been configured in bonding 
> for public access (with mode=0). Heartbeat network is connected 
> directly from server to server. Once in 3 - 4 days, the heartbeat goes 
> down and comes up automatically in 2 to 3 seconds. Not sure why this 
> down and up occurs. Because of this in cluster, one system is got fenced
by other.
> 
> Is there anyway where we can increase the time to wait for the cluster 
> to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even 
> the heartbeat fails for 5-6 seconds the node won't get fenced. Kindly 
> advise.

"mode=1" is Active/Passive and I use it extensively with no trouble. I'm not
sure where "heartbeat" comes from, but I might be missing the obvious. Can
you share your bond and eth configuration files here please (as plain-text
attachments)?

Secondly, make sure that you are actually using that interface/bond. Run
'gethostip -d <nodename>', where "nodename" is what you set in cluster.conf.
The returned IP will be the one used by the cluster.

Back to the bond; A failed link would nearly instantly transfer to the
backup link. So if you are going down for 2~3 seconds on both links,
something else is happening. Look at syslog on both nodes around the time
the last fence happened and see what logs are written just prior to the
fence. That might give you a clue.

--
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron

This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.