[Linux-cluster] Clumembd heartbeat problem.

Wed Aug 11 03:14:44 UTC 2004

I am not sure if this is the correct place to post this. If not
and you know where I should, could you please tell me.

(This effects RedHat ES 3.0 clumanager software)
I have found a problem with the clumembd daemon where the heartbeat
message is rejected by other nodes causing the node to be powered off.

If you have a Ethernet interface with an alias and are using multicast the
source address may contain the main IP address or
the alias address. If it contains the alias address the message is
then rejected by all other nodes as it now contains the wrong IP address.

The software correctly creates a socket on the main interface and at first
the correct IP address is send. Some time later on the same socket the
alias address seems to get into the packets.

I have extract the relevant parts from my log file showing the output
from the debugging lines I inserted into the code.

Computer has  
   Interfaces: bond0   addr 10.10.197.11
               bond0:0 addr 10.10.197.6

         Multcast set up     
clumembd[2]: <debug> add_interface fd:4 name:bond0
clumembd[2]: <debug> Interface IP is 10.10.197.11
clumembd[2]: <debug> Setting up multicast 225.0.0.11 on 10.10.197.11
clumembd[2]: <debug> Multicast send fd:5 (10.10.197.11)
clumembd[2]: <debug> Multicast receive fd:6

	   Sending and receiving message (Correct behaviour)
clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
            ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e

After a while you get. sinp = source address, nsp = expected address

clumembd[2]: <debug> sending multicast message fd:5 ,nodeid:1
             ,addr:225.0.0.11,token:0x0002881d4119638e
clumembd[2]: <debug> update_seen new msg nodeid:1 token:0x0002881d4119638e
clumembd[2]: <debug> IP/NodeID mismatch: Probably another cluster on our
             subnet... msg from nodeid:1 sinp:10.10.197.6 nsp:10.10.197.11

The source address now has bond0:0 address when it did have bond0's address.
The socket has not changed.

This looks to me like a bug in the sending routine (it is using sendto in
std library)

Has anyone else noticed this sort of behaviour on sending multicast messages
to a Ethernet device with multiple addresses. 

Cheers
Royce