[Linux-cluster] Bonded heartbeat channels on RH Cluster Suite v3

Mon Aug 21 20:40:55 UTC 2006

On Mon, 2006-08-21 at 16:38 -0300, Celso K. Webber wrote:
> Hello all,
> 
> I'm experiencing a weird behaviour on RHCSv3 that I don't know if it is 
> my mistake.
> 
> The configuration is like this:
> * 2-node RHCS Cluster (not GFS);
> * two onboard NICS are channel bonded (bond0) for corporate network access;
> * one offboard NIC is used for cluster network heartbeating.
> 
> Since each node is located on two separate buildings, the customer 
> wanted to channel bond the heartbeat channel, also. There has been some 
> Ethernet switch problems in the heartbeat channel before.
> 
> So we tried to add another bonded hannel (bond1) to the setup so that we 
> have a redundant heartbeat channel.
> 
> The setup went like this (sorry for the ASCII art):
> +---------+      +----------+      +---------+
> |         |----->|          |<-----|         |
> | server1 |bond0 | ethernet | bond0| server2 |
> |         |----->| switch   |<-----|         |
> |         |      |          |      |         |
> |         |----->+----------+<-----|         |
> |         |bond1              bond1|         |
> |         |<----crossover cable--->|         |
> +---------+                        +---------+
> 
> For bond1 the customer wanted the following:
> * to use the same ethernet switch of the corporate network, since it is 
> fully redundant (each cable plugged to a differente physical switch);
> * to use a crossover cable for the redundant connection of bond1, just 
> in case the whole ethernet switch solution goes down. The crossover 
> cable here is an optical fiber passed on between the buildings;
> * beartbeat IP address for each server are 10.1.1.3 (clu_server1) and 
> 10.1.1.4 (clu_server2).
> 
>    - /etc/modules.conf:
> alias bond0 bonding
> options bond0 -o bond0 mode=1 miimon=100
> alias bond1 bonding
> options bond1 -o bond1 mode=1 miimon=100
> 
>    - /etc/sysconfig/network-scripts/ifcfg-bond1:
> DEVICE=bond1
> ONBOOT=yes
> IPADDR=10.1.1.XXX
> NETMASK=255.255.255.0
> BOOTPROTO=none
> TYPE=Bonding
> 
>    - /etc/sysconfig/network-scripts/ifcfg-eth2:
> DEVICE=eth2
> ONBOOT=yes
> BOOTPROTO=none
> MASTER=bond1
> SLAVE=yes
> TYPE=Ethernet
> 
>    - /etc/sysconfig/network-scripts/ifcfg-eth3:
> DEVICE=eth3
> ONBOOT=yes
> BOOTPROTO=none
> MASTER=bond1
> SLAVE=yes
> TYPE=Ethernet
> 
> 
> Now the problem is that the Cluster didn't come up, and we could get 
> some warning in the logs:
> 00:19:25 server1 clumembd[11041]: <notice> Member clu_server1 UP
> 00:19:28 server1 clumembd[11041]: <warning> Dropping connect from 
> 10.1.1.4: Not in subnet!
> 00:19:29 server1 cluquorumd[11039]: <warning> Dropping connect from 
> 10.1.1.4: Not in subnet!
> 00:19:31 server1 cluquorumd[11039]: <notice> IPv4 TB @ 10.0.4.196 Online
> 
> 00:18:59 server2 clumembd[17634]: <notice> Member clu_server1 UP
> 00:19:09 server2 clumembd[17634]: <warning> Dropping connect from 
> 10.1.1.3: Not in subnet!
> 00:19:11 server2 clumembd[17634]: <notice> Member clu_server2 UP
> 00:19:19 server2 cluquorumd[17632]: <notice> IPv4 TB @ 10.0.4.196 Online
> 
> It seems that both servers "see" each other, both "see" the IPv4 
> Tiebraker as Online, but they refuse to form quorum.
> 
> Removing the "bond1" configuration made the cluster come back to normal 
> functions, but now we cant't understand what we did wrong here.

It's probably just the just the ARP code causing problems, which is why
you can turn it off.  Try running the following with the cluster
stopped:

  cludb -p cluster%msgsvc_noarp 1

...and copying the configuration to the other node.

-- Lon