[Linux-cluster] Bonded heartbeat channels on RH Cluster Suite v3

Mon Aug 21 19:38:38 UTC 2006

Hello all,

I'm experiencing a weird behaviour on RHCSv3 that I don't know if it is 
my mistake.

The configuration is like this:
* 2-node RHCS Cluster (not GFS);
* two onboard NICS are channel bonded (bond0) for corporate network access;
* one offboard NIC is used for cluster network heartbeating.

Since each node is located on two separate buildings, the customer 
wanted to channel bond the heartbeat channel, also. There has been some 
Ethernet switch problems in the heartbeat channel before.

So we tried to add another bonded hannel (bond1) to the setup so that we 
have a redundant heartbeat channel.

The setup went like this (sorry for the ASCII art):
+---------+      +----------+      +---------+
|         |----->|          |<-----|         |
| server1 |bond0 | ethernet | bond0| server2 |
|         |----->| switch   |<-----|         |
|         |      |          |      |         |
|         |----->+----------+<-----|         |
|         |bond1              bond1|         |
|         |<----crossover cable--->|         |
+---------+                        +---------+

For bond1 the customer wanted the following:
* to use the same ethernet switch of the corporate network, since it is 
fully redundant (each cable plugged to a differente physical switch);
* to use a crossover cable for the redundant connection of bond1, just 
in case the whole ethernet switch solution goes down. The crossover 
cable here is an optical fiber passed on between the buildings;
* beartbeat IP address for each server are 10.1.1.3 (clu_server1) and 
10.1.1.4 (clu_server2).

   - /etc/modules.conf:
alias bond0 bonding
options bond0 -o bond0 mode=1 miimon=100
alias bond1 bonding
options bond1 -o bond1 mode=1 miimon=100

   - /etc/sysconfig/network-scripts/ifcfg-bond1:
DEVICE=bond1
ONBOOT=yes
IPADDR=10.1.1.XXX
NETMASK=255.255.255.0
BOOTPROTO=none
TYPE=Bonding

   - /etc/sysconfig/network-scripts/ifcfg-eth2:
DEVICE=eth2
ONBOOT=yes
BOOTPROTO=none
MASTER=bond1
SLAVE=yes
TYPE=Ethernet

   - /etc/sysconfig/network-scripts/ifcfg-eth3:
DEVICE=eth3
ONBOOT=yes
BOOTPROTO=none
MASTER=bond1
SLAVE=yes
TYPE=Ethernet

Now the problem is that the Cluster didn't come up, and we could get 
some warning in the logs:
00:19:25 server1 clumembd[11041]: <notice> Member clu_server1 UP
00:19:28 server1 clumembd[11041]: <warning> Dropping connect from 
10.1.1.4: Not in subnet!
00:19:29 server1 cluquorumd[11039]: <warning> Dropping connect from 
10.1.1.4: Not in subnet!
00:19:31 server1 cluquorumd[11039]: <notice> IPv4 TB @ 10.0.4.196 Online

00:18:59 server2 clumembd[17634]: <notice> Member clu_server1 UP
00:19:09 server2 clumembd[17634]: <warning> Dropping connect from 
10.1.1.3: Not in subnet!
00:19:11 server2 clumembd[17634]: <notice> Member clu_server2 UP
00:19:19 server2 cluquorumd[17632]: <notice> IPv4 TB @ 10.0.4.196 Online

It seems that both servers "see" each other, both "see" the IPv4 
Tiebraker as Online, but they refuse to form quorum.

Removing the "bond1" configuration made the cluster come back to normal 
functions, but now we cant't understand what we did wrong here.

Is this mixed switch+crossover setup for channel bonding wrong?

Please, tell me if someone got any mistake from our part, ok?

Thank you all.

Regards,

Celso.
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035