[Linux-cluster] CMAN: sending membership request, unable to join cluster.

Tue Mar 3 14:37:53 UTC 2009

I'm seeing the same problem in a 4.7 cluster. 
Chrissi, is there a solution or another bz for the problem ?

-Mark

On Wednesday 11 February 2009 10:17:30 Chrissie Caulfield wrote:
> thijn wrote:
> > Hi,
> >
> > I have the following problem.
> > CMAN: removing node [server1] from the cluster : Missed too many
> > heartbeats
> > When the server comes back up:
> > Feb 10 14:43:58 server1 kernel: CMAN: sending membership request
> > after which it will try  to join until the end of times.
> >
> > In the current problem, server2 is active and server1 has the problem
> > not being able to join the cluster.
> >
> > The setup is a two server setup cluster.
> > We have had the problem on several clusters.
> > We "fixed" it usualy with rebooting the other node at which the cluster
> > would repair itself and all ran smoothly from thereon.
> > Naturally this will disrupt any services running on the cluster. And its
> > not really a solution that will win prices.
> > The problem is that server1(the problem one) is in a inquorate state and
> > we are unable to get it to a quorate state, neither do we see why this
> > is the case.
> > We tried to use a test setup to replay the problem, we were unable.
> >
> > So we decided to try to find a way to fix the state of the cluster using
> > the tools the system provides.
> >
> > The problem we see presents itself after a fence action by either node.
> > When we would bring down both nodes to stabilize the issue, the cluster
> > would become healthy and after that we can reboot either node and it
> > will rejoin the cluster.
> > It seems the problem presents itself when "pulling the plug" out of the
> > server.
> > We run on IBM Xservers using the SA-adapter as a fence device.
> > The fence device is in a different subnet then the subnet on which the
> > cluster communicates.
> > Bot fence devices are on the same subnet/vlan.
> >
> > CentOS release 4.6 (Final)
> > Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686
> > i686 i386 GNU/Linux
> > cman_tool 1.0.17 (built Mar 20 2007 17:10:52)
> > Copyright (C) Red Hat, Inc.  2004  All rights reserved.
> >
> > All versions of libraries and packages, kernel modules and all that is
> > dependent for the GFS cluster to operate are identical on both nodes.
> >
> > Cluster.conf
> > [root at server1 log]# cat /etc/cluster/cluster.conf
> > <?xml version="1.0"?>
> > <cluster config_version="3" name="NAME_cluster">
> > <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> > <clusternodes>
> > <clusternode name="server1.production.loc" votes="1">
> > <fence>
> > <method name="1">
> > <device name="saserver1"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="server2.production.loc" votes="1">
> > <fence>
> > <method name="1">
> > <device name="saserver2"/>
> > </method>
> > </fence>
> > </clusternode>
> > </clusternodes>
> > <cman expected_votes="1" two_node="1"/>
> > <fencedevices>
> > <fencedevice agent="fence_rsa" ipaddr="10.13.110.114" login="saadapter"
> > name="saserver1" passwd="XXXXXXX"/>
> > <fencedevice agent="fence_rsa" ipaddr="10.13.110.115" login="saadapter"
> > name="saserver2" passwd="XXXXXXX"/>
> > </fencedevices>
> > <rm>
> > <failoverdomains/>
> > <resources/>
> > </rm>
> > </cluster>
> >
> > [root at server1 log]# cat /etc/hosts
> > 127.0.0.1 localhost.localdomain localhost
> >
> > Both server are able to ping each other and also the broadcast address,
> > so there is no firewall filtering UDP packets
> > When i tcpdump the line i see traffic going both ways,
> >
> > Both servers are in the same vlan
> > 14:51:28.703240 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> > 17, length: 56) server2.production.loc.6809 >
> > broadcast.production.loc.6809: UDP, length 28
> > 14:51:28.703277 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> > 17, length: 140) server1.production.loc.6809 >
> > server2.production.loc.6809: UDP, length 112
> > 14:51:33.703240 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> > 17, length: 56) server2.production.loc.6809 >
> > broadcast.production.loc.6809: UDP, length 28
> > 14:51:33.703310 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> > 17, length: 140) server1.production.loc.6809 >
> > server2.production.loc.6809.6809: UDP, length 112
> >
> > Is this normal network behavior when a cluster is inquorate?
> > I see that server1 is talking to server2, but server2 is only talking in
> > broadcasts.
> >
> > When i start of try to join the cluster
> > Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed
> >
> > [root at server1 ~]# cman_tool status
> > Protocol version: 5.0.1
> > Config version: 3
> > Cluster name: NAME_cluster
> > Cluster ID: 64692
> > Cluster Member: No
> > Membership state: Joining
> >
> > [root at server2 log]# cman_tool status
> > Protocol version: 5.0.1
> > Config version: 3
> > Cluster name: RWSEems_cluster
> > Cluster ID: 64692
> > Cluster Member: Yes
> > Membership state: Cluster-Member
> > Nodes: 1
> > Expected_votes: 1
> > Total_votes: 1
> > Quorum: 1
> > Active subsystems: 7
> > Node name: server2.production.loc
> > Node ID: 2
> > Node addresses: server1.production.loc
> >
> > [root at server1 ~]# cman_tool nodes
> > Node  Votes Exp Sts  Name
> >
> > [root at server2 log]# cman_tool nodes
> > Node  Votes Exp Sts  Name
> >    1    1    1   X   server1.production.loc
> >    2    1    1   M   server2.production.loc
> >
> > When i start cman
> > service cman start
> >
> > Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a
> > Linux-cluster
> > Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture
> > via: CMAN/SM Plugin v1.1.7.4
> > Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate
> >
> >
> > It seems to me that this should be fixable with the tools as provided
> > with the RedHat Cluster Suite, without disturbing the running cluster.
> > It seems quite insane if i need to restart my cluster to have it all
> > working again.. kinda spoils the idea of running a cluster.
> > This setup is running in a HA envirmoment and we can have nearly to no
> > downtime.
> >
> > The logs on the healthy server (server2) does not mention/complain
> > anything of errors when rebooting, restarting cman or when server1 want
> > to join the cluster.
> > We see no disallowed, refused or anything that server2 is not willing to
> > play with server1
> >
> > I have been looking at this thing for a while now.. am i missing
> > anything?
>
> This is a known bug, see
>
> https://bugzilla.redhat.com/show_bug.cgi?id=475293
>
> It's fixed in 4.7 or you can run a program to set up a workaround.
>
> Having said that I have heard reports of is still happening in some
> circumstances ... but I don't have any more detail
>
> --
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Dipl.-Ing. Mark Hlawatschek