[Linux-cluster] node fails to join cluster after it was fenced

Frederik Ferner frederik.ferner at diamond.ac.uk
Wed Feb 14 16:03:48 UTC 2007


Hi Patrick,

thanks for you reply.

I've just discovered that I seem to have the same problem on one more
cluster, so maybe I've change something that causes this but did not
affect a running cluster. I'll append the cluster.conf for the original
cluster as well.

On Wed, 2007-02-14 at 14:06 +0000, Patrick Caulfield wrote:
> Frederik Ferner wrote:
> > I've recently run into the problem that in one of my clusters the second
> > node doesn't join the cluster anymore.
> > 
> > First some background on my setup here. I have a couple of two node
> > clusters connected to a common storage each. They're basically identical
> > setups running basically RHEL4U4 and corresponding cluster suite.
> > Everything was running fine until yesterday in one clusters one node
> > (i04-storage2) was fenced and can't seem to join the cluster anymore,
> > all I could find was messages in the log files of i04-storage2 telling
> > me "kernel: CMAN: sending membership request" over and over again. On
> > the node still in the cluster (i04-storage1) I could see nothing in any
> > log files. 

> The main reason a node would repeatedly try to rejoin a cluster is that it gets
> told to "wait" by the remaining nodes. This happens when the remaining cluster
> nodes are still in transition state (ie they haven't sorted out the cluster
> after the node has left). Normally this state only lasts a fraction of a second
> or maybe a handful of seconds for a very large cluster.
> 
> As you only have one node in the cluster It sounds like the remaining node may
> be in some strange state that it can't get out of. I'm not sure what that would
> be off-hand...
> 
> - it must be able to see the fenced nodes 'joinreq' messages because if you
> increment the config version in reject it.

That's what I assumed.

> - it can't even be in transition here for the same reason ... the transition
> state is checked before the validity of the joinreq message so the former case
> would also fail!
> 
> Can you check the output of 'cman_tool status' and see what state the remaining
> node is in. It might also be worth sending me the 'tcpdump -s0 -x port 6809'
> output in case that shows anything useful.

See attached file for tcpdump output.

<snip>
[bnh65367 at i04-storage1 log]$ cman_tool status
Protocol version: 5.0.1
Config version: 20
Cluster name: i04-cluster
Cluster ID: 33460
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 2
Total_votes: 4
Quorum: 3
Active subsystems: 8
Node name: i04-storage1.diamond.ac.uk
Node ID: 1
Node addresses: 172.23.104.33

[bnh65367 at i04-storage1 log]$
</snip>

Thanks,
Frederik

-- 
Frederik Ferner 
Systems Administrator                  Phone: +44 (0)1235-778624
Diamond Light Source                   Fax:   +44 (0)1235-778468
-------------- next part --------------
A non-text attachment was scrubbed...
Name: i04_tcpdump_s0_port_6809
Type: application/octet-stream
Size: 1299 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070214/df045c74/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: i04-cluster.conf
Type: text/xml
Size: 2738 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070214/df045c74/attachment.xml>


More information about the Linux-cluster mailing list