[Linux-cluster] Possible cman init script race condition

David Teigland teigland at redhat.com
Fri Sep 28 17:03:09 UTC 2007


On Fri, Sep 28, 2007 at 11:45:47AM -0500, David Teigland wrote:
> On Fri, Sep 28, 2007 at 09:58:18AM -0500, David Teigland wrote:
> > On Fri, Sep 28, 2007 at 04:48:18PM +0200, Borgstr?m Jonas wrote:
> > > I must have misunderstood you or something, but didn't I already include
> > > that info in the message I sent a few days ago?
> > > 
> > > http://permalink.gmane.org/gmane.linux.redhat.cluster/9999
> > > 
> > > (The archive inlines the "group_tool dump" output making it a bit hard
> > > to read, but hopefully your email client shows them as attachments).
> > 
> > I missed that, I'll take a look, thanks.
> 
> You've hit a known bug that's been fixed:
>   https://bugzilla.redhat.com/show_bug.cgi?id=251966
> 
> We may have to move up the release of that fix since people are seeing the
> problem.  Be careful when reading that bz because there's a lot of
> incorrect diagnosis that was recorded before we figured out what the real
> bug was.  Here's the problem, it's very complex:
> 
> 1. when the nodes start up, they each form a 1-node openais cluster
>    independent of the other
> 
>    [This shouldn't really happen, but in reality we can't prevent it
>     100% of the time.  We try to make it rare, and then deal with it
>     sensibly on the rare occasion when it does happen.  You've hit
>     the "rare" occasion -- if you're actually seeing this regularly
>     then we probably need to fix or adjust something at the openais
>     level to make it less common.]

I'd try to use some sleeps here, before running fence_tool join on either
node, as a work-around.  We're trying to get both nodes merged together
before they do anything else.

Also, how often are you seeing the nodes not merge together right away?
If it's frequent, then we need to fix that.

> 2. fence_tool join is run on each node which creates group state in both
>    clusters
> 
> 3. The two clusters then merge together.  We could handle this merging
>    *if* there had been no group activity yet (in this case from fenced).
>    But, in this case, divergent group state exists in the two clusters
>    that we can't combine.  Cman (above openais) should recognize this [*]
>    and continue to treat the nodes separately, even though openais has
>    merged them together.
> 
>    [*] In RHEL5.0, cman/groupd are *not* smart enough to recognize this.
>    The fix in bz 251966 makes cman/groupd recognize this condition by
>    introducing a "dirty flag".  What you observe, is groupd trying to
>    merge the divergent state, getting confused and stuck.

There's no work-around once you've gotten to this point.

>    After the bug is fixed, what you should observe is the two nodes will
>    stay separate (in cman) and will try to fence each other.  One will
>    win the fencing race and reboot the other.  When the rebooted node
>    returns, it should properly join the existing cluster.




More information about the Linux-cluster mailing list