[Linux-cluster] Possible cman init script race condition

Tue Oct 2 15:51:40 UTC 2007

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: den 1 oktober 2007 18:22
To: Borgström Jonas
Cc: linux clustering
Subject: Re: [Linux-cluster] Possible cman init script race condition
> 
> > Strangely enough adding a "sleep 30" line directly below the "echo
> > "Starting cluster: "" line seems to make this problem go away every
> > time. Note that this is before any daemon is started. It works, but I'm
> > not sure why.
>
> Have you tried numbers less than 30?  I forget if I've asked yet, but do
> you have the xend init script disabled?

I did try "sleep 15" but that was not enough. Maybe the HBA/lun initialization that's taking too long or something.

And no, xen is not installed on these servers.

>
>
> > > Also, how often are you seeing the nodes not merge together right
> > > away?  If it's frequent, then we need to fix that.
> > 
> > This happens every time on this hardware (2 Dell 1955 blades). I never
> > got fenced to work correctly until I figured out that I need to add a
> > sleep 30 to the cman init script. So I'm obviously very interested in
> > seeing this fixed in a 5.0 errata or in 5.1 at the very latest. I can't
> > really wait until 5.2 is out...
>
> Remember, there are two problems we're talking about here.  The first is
> why openais doesn't merge together for many seconds when both nodes start
> up in parallel.  This should be a rare occurance.  The fact that you're
> seeing it every time implies there's an openais problem, or there could be
> a problem related to the networking between your nodes.  We don't have any
> idea at this point.  Maybe Steve Dake could help you more with this.  Your
> sleep 30 workaround is a clue -- it forces openais to start 30 seconds
> apart on the two nodes.

No, I think the cman daemons are started at pretty much the same time on both nodes. At least if I reboot both machines at the same time. "sleep 30" gives the kernel and the programs started before "cman" an extra 30 seconds to do their stuff before the bulk of the cman init script is executed.

Another workaround is to run "chkconfig cman off" and start it from /etc/rc.d/rc.local. That also works, and does not require and "sleep". This probably works since rc.local is the very last thing executed by the boot-up process and that is probably at least 30 seconds later.

>
> The second problem is how we deal with the eventual merging of the two
> clusters.  After we fix the first problem, you will probably never see
> this second problem again.
>
>
> > And as I mentioned before, the really scary part is that I am able to
> > mount gfs filesystems during this kind of cluster split. And if I one
> > node is shot, the other node replays the gfs journal and makes the
> > filesystem writable again without first fencing the shot/missing node.
>
> I would need to see the logs from the exact scenario you're talking about
> here to determine if this is a new problem or an effect of the other one.

Ok, here's some log outpt:

Scenario: A gfs filesystem is mounted on two nodes in a "split cluster"

cluster.conf: http://jonas.borgstrom.se/gfs/cluster.conf

Node: prod-db1:
group_tool -v: http://jonas.borgstrom.se/gfs/prod_db1_group_tool_v.txt
group_tool dump: http://jonas.borgstrom.se/gfs/prod_db1_group_tool_dump.txt

Node: prod-db2:
group_tool -v: http://jonas.borgstrom.se/gfs/prod_db2_group_tool_v.txt
group_tool dump: http://jonas.borgstrom.se/gfs/prod_db2_group_tool_dump.txt

Node prod-db1 is now shot and prod-db2 happily replays the gfs journal without first fencing the failed node:

Node: prod-db2:
group_tool -v: http://jonas.borgstrom.se/gfs/prod_db2_group_tool_v_after_prod_db1_is_shot.txt
group_tool dump: http://jonas.borgstrom.se/gfs/prod_db2_group_tool_dump_after_prod_db1_is_shot.txt
/var/log/messages: http://jonas.borgstrom.se/gfs/prod_db2_messages_after_prod_db1_is_shot.txt

So gfs is till mounted and writable on prod-db2 even though prod-db1 was never fenced.

Expected behavior: prod-db1 should be fenced before the gfs journal is replayed. (Which happens if I add "sleep 30" to /etc/rc.d/init.d/cman).

Regards,
Jonas