[Linux-cluster] Possible cman init script race condition
Borgström Jonas
jobot at wmdata.com
Mon Oct 1 14:33:49 UTC 2007
-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com]
Sent: den 28 september 2007 19:03
To: Borgström Jonas
Cc: linux clustering
Subject: Re: [Linux-cluster] Possible cman init script race condition
> On Fri, Sep 28, 2007 at 11:45:47AM -0500, David Teigland wrote:
> > On Fri, Sep 28, 2007 at 09:58:18AM -0500, David Teigland wrote:
> > > On Fri, Sep 28, 2007 at 04:48:18PM +0200, Borgstr?m Jonas wrote:
> > > > I must have misunderstood you or something, but didn't I already include
> > > > that info in the message I sent a few days ago?
> > > >
> > > > http://permalink.gmane.org/gmane.linux.redhat.cluster/9999
> > > >
> > > > (The archive inlines the "group_tool dump" output making it a bit hard
> > > > to read, but hopefully your email client shows them as attachments).
> > >
> > > I missed that, I'll take a look, thanks.
> >
> > You've hit a known bug that's been fixed:
> > https://bugzilla.redhat.com/show_bug.cgi?id=251966
> >
> > We may have to move up the release of that fix since people are seeing the
> > problem. Be careful when reading that bz because there's a lot of
> > incorrect diagnosis that was recorded before we figured out what the real
> > bug was. Here's the problem, it's very complex:
> >
> > 1. when the nodes start up, they each form a 1-node openais cluster
> > independent of the other
> >
> > [This shouldn't really happen, but in reality we can't prevent it
> > 100% of the time. We try to make it rare, and then deal with it
> > sensibly on the rare occasion when it does happen. You've hit
> > the "rare" occasion -- if you're actually seeing this regularly
> > then we probably need to fix or adjust something at the openais
> > level to make it less common.]
>
> I'd try to use some sleeps here, before running fence_tool join on either
> node, as a work-around. We're trying to get both nodes merged together
> before they do anything else.
Strangely enough adding a "sleep 30" line directly below the "echo "Starting cluster: "" line seems to make this problem go away every time. Note that this is before any daemon is started. It works, but I'm not sure why.
>
> Also, how often are you seeing the nodes not merge together right away?
> If it's frequent, then we need to fix that.
This happens every time on this hardware (2 Dell 1955 blades). I never got fenced to work correctly until I figured out that I need to add a sleep 30 to the cman init script. So I'm obviously very interested in seeing this fixed in a 5.0 errata or in 5.1 at the very latest. I can't really wait until 5.2 is out...
And as I mentioned before, the really scary part is that I am able to mount gfs filesystems during this kind of cluster split. And if I one node is shot, the other node replays the gfs journal and makes the filesystem writable again without first fencing the shot/missing node.
Here some "group_tool -v" output with a mounted filesystem:
[root at prod-db2 pgsql]# group_tool -v
type level name id state node id local_done
fence 0 default 00010002 JOIN_START_WAIT 1 100020001 1
[1 2]
dlm 1 clvmd 00020001 JOIN_START_WAIT 1 100020001 1
[1 2]
dlm 1 pg_fs 00060001 JOIN_START_WAIT 1 100020001 1
[1 2]
gfs 2 pg_fs 00050001 JOIN_START_WAIT 1 100020001 1
[1 2]
Regards,
Jonas
More information about the Linux-cluster
mailing list