[Linux-cluster] cman kickout out nodes for no good reason

Olivier Crête ocrete at max-t.com
Tue Apr 11 14:06:40 UTC 2006


On Tue, 2006-11-04 at 08:47 +0100, Patrick Caulfield wrote:
> Olivier Crête wrote:
> > On Thu, 2006-06-04 at 12:34 -0400, Olivier Crête wrote:
> >> I have a strange problem where cman suddenly starts kicking out members
> >> of the cluster with "Inconsistent cluster view" when I join a new node
> >> (sometimes).  It takes a few minutes between each kicking. I'm using a
> >> snapshot for March 12th of the STABLE branch on 2.6.16. The cluster is
> >> in transition state at that point and I can't stop/start services or do
> >> anything else. It did not do that with a snapshot I took a few months
> >> ago.
> > 
> > Its still happening, the node that joins says "Transition master
> > unknown", while all of the other nodes who the master is, then the
> > master gets kicked out. Then a new master is selected, all of the nodes
> > seem to know who the master is, but refuse to act on it. After a while,
> > the new master is kicked out and the process restarts. I guess its
> > related to the changes with the timestamps to prevent master desync, I
> > dont see any other recent change that could have caused it.
> > 
> 
> That's very peculiar behaviour, and it's going to be hard to pin down. How
> consistently does it happen ?

Often, but I haven't found the exact sequence to reproduce it.

> It could be caused by extreme network packet loss, or something blocking the
> progress of cman processes. Are the already joined nodes very busy when you
> bring the new node into the cluster (if so, doing what?)

I doubt its packet loss since cman is running over myrinet's ethernet/ip
layer and its the only user of that port (so it shouldn't be affected by
the rest of the traffic over the myrinet). The other nodes may be busy,
but the CPU isn't at 100% us on any of them, although the PCIX bus may
be used a lot.

> I think the best way to try and track this down is to get a tcpdump of the
> cluster traffic (port 6809/udp) happening at the time of the join - make sure
> that all nodes are included in the dump and that all of the packet is captured.

I will try to get a tcpdump.

Thanks for you help,

-- 
Olivier Crête
ocrete at max-t.com
Maximum Throughput Inc.




More information about the Linux-cluster mailing list