[Linux-cluster] node fails to join cluster after it was fenced

Wed Feb 14 14:06:30 UTC 2007

Frederik Ferner wrote:
> Hi All,
> 
> I've recently run into the problem that in one of my clusters the second
> node doesn't join the cluster anymore.
> 
> First some background on my setup here. I have a couple of two node
> clusters connected to a common storage each. They're basically identical
> setups running basically RHEL4U4 and corresponding cluster suite.
> Everything was running fine until yesterday in one clusters one node
> (i04-storage2) was fenced and can't seem to join the cluster anymore,
> all I could find was messages in the log files of i04-storage2 telling
> me "kernel: CMAN: sending membership request" over and over again. On
> the node still in the cluster (i04-storage1) I could see nothing in any
> log files. 
> 
> To get i04-storage2 back into my cluster, I tried to fence it again
> using fence_tool on i04-storage1 without success. The node gets fenced,
> as I can see on i04-storage1 in the log. When I increased the version of
> the cluster config on the working node, the join request was rejected
> directly but the same timeout occured when I copied the new
> configuration and tried to start the cluster suite again. 
> 
> There's no firewall on any computer involved, both are connected to the
> same switch. Using wireshark I can see UDP packets with source and
> destination port 6809 going from i04-storage2 to i04-storage1 and from
> i04-storage1 to the network broadcast address. No other network traffic
> seems to be going between these two hosts.
> 
> The same setup used to work fine. All other clusters are supposed to be
> identical to that one and I don't see that kind of behaviour. If there's
> a difference, I can't spot it.
> 
> Does anyone have any suggestions what else I could look for? What could
> be wrong here?
> 
> If you need any other bits of information that I haven't supplied,
> please ask.

The main reason a node would repeatedly try to rejoin a cluster is that it gets
told to "wait" by the remaining nodes. This happens when the remaining cluster
nodes are still in transition state (ie they haven't sorted out the cluster
after the node has left). Normally this state only lasts a fraction of a second
or maybe a handful of seconds for a very large cluster.

As you only have one node in the cluster It sounds like the remaining node may
be in some strange state that it can't get out of. I'm not sure what that would
be off-hand...

- it must be able to see the fenced nodes 'joinreq' messages because if you
increment the config version in reject it.
- it can't even be in transition here for the same reason ... the transition
state is checked before the validity of the joinreq message so the former case
would also fail!

Can you check the output of 'cman_tool status' and see what state the remaining
node is in. It might also be worth sending me the 'tcpdump -s0 -x port 6809'
output in case that shows anything useful.

-- 

patrick