[Linux-cluster] rhel 6.2 network bonding interface in cluster environment

Mon Jan 9 04:56:38 UTC 2012

On 01/08/2012 11:37 PM, SATHYA - IT wrote:
> Hi,
> 
> We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman +
> smb. We have 4 nic cards in the servers where 2 been configured in
> bonding for heartbeat (with mode=1) and 2 been configured in bonding for
> public access (with mode=0). Heartbeat network is connected directly
> from server to server. Once in 3 – 4 days, the heartbeat goes down and
> comes up automatically in 2 to 3 seconds. Not sure why this down and up
> occurs. Because of this in cluster, one system is got fenced by other.
> 
> Is there anyway where we can increase the time to wait for the cluster
> to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even
> the heartbeat fails for 5-6 seconds the node won’t get fenced. Kindly
> advise.

"mode=1" is Active/Passive and I use it extensively with no trouble. I'm
not sure where "heartbeat" comes from, but I might be missing the
obvious. Can you share your bond and eth configuration files here please
(as plain-text attachments)?

Secondly, make sure that you are actually using that interface/bond. Run
'gethostip -d <nodename>', where "nodename" is what you set in
cluster.conf. The returned IP will be the one used by the cluster.

Back to the bond; A failed link would nearly instantly transfer to the
backup link. So if you are going down for 2~3 seconds on both links,
something else is happening. Look at syslog on both nodes around the
time the last fence happened and see what logs are written just prior to
the fence. That might give you a clue.

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron