[Linux-cluster] node fails to join cluster after it was fenced
Frederik Ferner
frederik.ferner at diamond.ac.uk
Wed Feb 14 16:03:48 UTC 2007
Hi Patrick,
thanks for you reply.
I've just discovered that I seem to have the same problem on one more
cluster, so maybe I've change something that causes this but did not
affect a running cluster. I'll append the cluster.conf for the original
cluster as well.
On Wed, 2007-02-14 at 14:06 +0000, Patrick Caulfield wrote:
> Frederik Ferner wrote:
> > I've recently run into the problem that in one of my clusters the second
> > node doesn't join the cluster anymore.
> >
> > First some background on my setup here. I have a couple of two node
> > clusters connected to a common storage each. They're basically identical
> > setups running basically RHEL4U4 and corresponding cluster suite.
> > Everything was running fine until yesterday in one clusters one node
> > (i04-storage2) was fenced and can't seem to join the cluster anymore,
> > all I could find was messages in the log files of i04-storage2 telling
> > me "kernel: CMAN: sending membership request" over and over again. On
> > the node still in the cluster (i04-storage1) I could see nothing in any
> > log files.
> The main reason a node would repeatedly try to rejoin a cluster is that it gets
> told to "wait" by the remaining nodes. This happens when the remaining cluster
> nodes are still in transition state (ie they haven't sorted out the cluster
> after the node has left). Normally this state only lasts a fraction of a second
> or maybe a handful of seconds for a very large cluster.
>
> As you only have one node in the cluster It sounds like the remaining node may
> be in some strange state that it can't get out of. I'm not sure what that would
> be off-hand...
>
> - it must be able to see the fenced nodes 'joinreq' messages because if you
> increment the config version in reject it.
That's what I assumed.
> - it can't even be in transition here for the same reason ... the transition
> state is checked before the validity of the joinreq message so the former case
> would also fail!
>
> Can you check the output of 'cman_tool status' and see what state the remaining
> node is in. It might also be worth sending me the 'tcpdump -s0 -x port 6809'
> output in case that shows anything useful.
See attached file for tcpdump output.
<snip>
[bnh65367 at i04-storage1 log]$ cman_tool status
Protocol version: 5.0.1
Config version: 20
Cluster name: i04-cluster
Cluster ID: 33460
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 2
Total_votes: 4
Quorum: 3
Active subsystems: 8
Node name: i04-storage1.diamond.ac.uk
Node ID: 1
Node addresses: 172.23.104.33
[bnh65367 at i04-storage1 log]$
</snip>
Thanks,
Frederik
--
Frederik Ferner
Systems Administrator Phone: +44 (0)1235-778624
Diamond Light Source Fax: +44 (0)1235-778468
-------------- next part --------------
A non-text attachment was scrubbed...
Name: i04_tcpdump_s0_port_6809
Type: application/octet-stream
Size: 1299 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070214/df045c74/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: i04-cluster.conf
Type: text/xml
Size: 2738 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070214/df045c74/attachment.xml>
More information about the Linux-cluster
mailing list