[Linux-cluster] rhel 6.2 network bonding interface in cluster environment

Mon Jan 9 05:51:08 UTC 2012

Hi,

Herewith attaching the /var/log/messages of both the servers. Yesterday
(08th Jan) one of the server got fenced by other around 10:48 AM. I am also
attaching the cluster.conf file for your reference. 

On the related note, related to heartbeat - I am referring the channel used
by corosync. And the name which has been configured in cluster.conf file
resolves with bond1 only.

Related to the network card, we are using 2 dual port card where we
configured 1 port from each for bond0 and 1 port from the other for bond1.
So it doesn't seems be a network card related issue. Moreover, we are not
having any errors related to bond0.

Thanks

Sathya Narayanan V
Solution Architect	

-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Monday, January 09, 2012 10:54 AM
To: SATHYA - IT
Cc: 'linux clustering'
Subject: SPAM - Re: [Linux-cluster] rhel 6.2 network bonding interface in
cluster environment

On 01/09/2012 12:12 AM, SATHYA - IT wrote:
> Hi,
> 
> Thanks for your mail. I herewith attaching the bonding and eth 
> configuration files. And on the /var/log/messages during the fence 
> operation we can get the logs updated related to network only in the node
which fences the other.

What IPs do the node names resolve to? I'm assuming bond1, but I would like
you to confirm.

> Server 1 Bond1: (Heartbeat)

I'm still not sure what you mean by heartbeat. Do you mean the channel
corosync is using?

> On the log messages,
> 
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper 
> Link is Down Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: 
> NIC Copper Link is Down

This tells me both links dropped at the same time. These messages are coming
from below the cluster though.

> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status 
> definitely down for interface eth3, disabling it Jan  3 14:46:07 
> filesrv2 kernel: bonding: bond1: now running without any active 
> interface !
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status 
> definitely down for interface eth4, disabling it

With both of the bond's NICs down, the bond itself is going to drop.

> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper 
> Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON 
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for 
> interface eth3, 1000 Mbps full duplex.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 
> the new active one.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface
up!
> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper 
> Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON 
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for 
> interface eth4, 1000 Mbps full duplex.

I don't see any messages about the cluster in here, which I assume you
cropped out. In this case, it doesn't matter as the problem is well below
the cluster, but in general, please provide more data, not less.
You never know what might help. :)

Anyway, you need to sort out what is happening here. Bad drivers? Bad card
(assuming dual-port)? Something is taking the NICs down, as though they were
actually unplugged.

If you can run them through a switch, if might help isolate which node is
causing the problems as then you would only see one node record "NIC Copper
Link is Down" and can then focus on just that node.

--
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron

This communication may contain confidential information. 
If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. 
Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. 
Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. 
All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1043 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/bc973723/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages_filesrv1
Type: application/octet-stream
Size: 117290 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/bc973723/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages_filesrv2
Type: application/octet-stream
Size: 15302 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120109/bc973723/attachment-0002.obj>