[Linux-cluster] Possible cman init script race condition

David Teigland teigland at redhat.com
Mon Oct 1 16:21:46 UTC 2007


> Strangely enough adding a "sleep 30" line directly below the "echo
> "Starting cluster: "" line seems to make this problem go away every
> time. Note that this is before any daemon is started. It works, but I'm
> not sure why.

Have you tried numbers less than 30?  I forget if I've asked yet, but do
you have the xend init script disabled?


> > Also, how often are you seeing the nodes not merge together right
> > away?  If it's frequent, then we need to fix that.
> 
> This happens every time on this hardware (2 Dell 1955 blades). I never
> got fenced to work correctly until I figured out that I need to add a
> sleep 30 to the cman init script. So I'm obviously very interested in
> seeing this fixed in a 5.0 errata or in 5.1 at the very latest. I can't
> really wait until 5.2 is out...

Remember, there are two problems we're talking about here.  The first is
why openais doesn't merge together for many seconds when both nodes start
up in parallel.  This should be a rare occurance.  The fact that you're
seeing it every time implies there's an openais problem, or there could be
a problem related to the networking between your nodes.  We don't have any
idea at this point.  Maybe Steve Dake could help you more with this.  Your
sleep 30 workaround is a clue -- it forces openais to start 30 seconds
apart on the two nodes.

The second problem is how we deal with the eventual merging of the two
clusters.  After we fix the first problem, you will probably never see
this second problem again.


> And as I mentioned before, the really scary part is that I am able to
> mount gfs filesystems during this kind of cluster split. And if I one
> node is shot, the other node replays the gfs journal and makes the
> filesystem writable again without first fencing the shot/missing node.

I would need to see the logs from the exact scenario you're talking about
here to determine if this is a new problem or an effect of the other one.

Dave




More information about the Linux-cluster mailing list