[Linux-cluster] Possible cman init script race condition
teigland at redhat.com
Fri Sep 28 16:45:47 UTC 2007
On Fri, Sep 28, 2007 at 09:58:18AM -0500, David Teigland wrote:
> On Fri, Sep 28, 2007 at 04:48:18PM +0200, Borgstr?m Jonas wrote:
> > I must have misunderstood you or something, but didn't I already include
> > that info in the message I sent a few days ago?
> > http://permalink.gmane.org/gmane.linux.redhat.cluster/9999
> > (The archive inlines the "group_tool dump" output making it a bit hard
> > to read, but hopefully your email client shows them as attachments).
> I missed that, I'll take a look, thanks.
You've hit a known bug that's been fixed:
We may have to move up the release of that fix since people are seeing the
problem. Be careful when reading that bz because there's a lot of
incorrect diagnosis that was recorded before we figured out what the real
bug was. Here's the problem, it's very complex:
1. when the nodes start up, they each form a 1-node openais cluster
independent of the other
[This shouldn't really happen, but in reality we can't prevent it
100% of the time. We try to make it rare, and then deal with it
sensibly on the rare occasion when it does happen. You've hit
the "rare" occasion -- if you're actually seeing this regularly
then we probably need to fix or adjust something at the openais
level to make it less common.]
2. fence_tool join is run on each node which creates group state in both
3. The two clusters then merge together. We could handle this merging
*if* there had been no group activity yet (in this case from fenced).
But, in this case, divergent group state exists in the two clusters
that we can't combine. Cman (above openais) should recognize this [*]
and continue to treat the nodes separately, even though openais has
merged them together.
[*] In RHEL5.0, cman/groupd are *not* smart enough to recognize this.
The fix in bz 251966 makes cman/groupd recognize this condition by
introducing a "dirty flag". What you observe, is groupd trying to
merge the divergent state, getting confused and stuck.
After the bug is fixed, what you should observe is the two nodes will
stay separate (in cman) and will try to fence each other. One will
win the fencing race and reboot the other. When the rebooted node
returns, it should properly join the existing cluster.
More information about the Linux-cluster