[Linux-cluster] Possible cman init script race condition

David Teigland teigland at redhat.com
Mon Sep 24 16:10:12 UTC 2007


On Mon, Sep 24, 2007 at 05:33:30PM +0200, Borgstr?m Jonas wrote:
> Hi,
> 
> I think there might be some race condition in the cman init script
> causing fenced to stop working correctly.
> I'm able to reliably reproduce the problem using problem using a minimal
> cluster.conf with two nodes and fence_manual fencing.
> 
> Steps to reproduce:
> 1. Install cluster.conf on two nodes, enable the "cman" service and
> reboot both nodes.
> 2. The cluster boots successfully and clustat lists both nodes as online.
> 3. Power-cycle node prod-db1.
> 4. On prod-db2 openais detects the missing node but fenced decides to do
> nothing about it and logs nothing to /var/log/messages (But the fenced
> process is still running)
> 
> Output from "group_tool dump fence" After the test:
> 
> [root at prod-db2 ~]# group_tool dump fence
> 1190645583 our_nodeid 2 our_name prod-db2
> 1190645583 listen 4 member 5 groupd 7
> 1190645584 client 3: join default
> 1190645584 delay post_join 120s post_fail 0s
> 1190645584 added 2 nodes from ccs
> 1190645584 setid default 65538
> 1190645584 start default 1 members 2 
> 1190645584 do_recovery stop 0 start 1 finish 0
> 1190645584 node "prod-db1" not a cman member, cn 1
> 1190645584 add first victim prod-db1
> 1190645585 node "prod-db1" not a cman member, cn 1
> 1190645586 node "prod-db1" not a cman member, cn 1
> 1190645587 node "prod-db1" not a cman member, cn 1
> 1190645588 node "prod-db1" not a cman member, cn 1
> 1190645589 node "prod-db1" not a cman member, cn 1
> 1190645590 node "prod-db1" not a cman member, cn 1
> 1190645591 node "prod-db1" not a cman member, cn 1
> 1190645592 node "prod-db1" not a cman member, cn 1
> 1190645593 node "prod-db1" not a cman member, cn 1
> 1190645594 node "prod-db1" not a cman member, cn 1
> 1190645595 node "prod-db1" not a cman member, cn 1
> 1190645596 node "prod-db1" not a cman member, cn 1
> 1190645597 node "prod-db1" not a cman member, cn 1
> 1190645598 node "prod-db1" not a cman member, cn 1
> 1190645599 node "prod-db1" not a cman member, cn 1
> 1190645600 reduce victim prod-db1
> 1190645600 delay of 16s leaves 0 victims
> 1190645600 finish default 1
> 1190645600 stop default
> 1190645600 start default 2 members 1 2 
> 1190645600 do_recovery stop 1 start 2 finish 1

I think something has gone wrong here, either in groupd or fenced, that's
preventing this start from finishing (we don't get a 'finish default 2'
which we expect).  A 'group_tool -v' here should show the state of the
fence group still in transition.  Could you run that, plus a 'group_tool
dump' at this point, in addition to the 'dump fence' you have.  And please
run those commands on both nodes.

> 1190645954 client 3: dump    <--- Before killing prod-db1
> 1190645985 stop default
> 1190645985 start default 3 members 2 
> 1190645985 do_recovery stop 2 start 3 finish 1
> 1190645985 finish default 3
> 1190646008 client 3: dump    <--- After killing prod-db1

Node 1 isn't fenced here because it never completed joining the fence
group above.

> The scary part is that as far as I can tell fenced is the only cman
> daemon being affected by this. So your cluster appears to work fine. But
> when a node needs to be fenced the operation it isn't carried out and
> that can cause gfs filesystem corruption.

You shouldn't be able to mount gfs on the node where joining the fence
group is stuck.

Thanks for the informative report.

Dave




More information about the Linux-cluster mailing list