[Linux-cluster] Possible cman init script race condition

Mon Sep 24 16:10:12 UTC 2007

On Mon, Sep 24, 2007 at 05:33:30PM +0200, Borgstr?m Jonas wrote:
> Hi,
> 
> I think there might be some race condition in the cman init script
> causing fenced to stop working correctly.
> I'm able to reliably reproduce the problem using problem using a minimal
> cluster.conf with two nodes and fence_manual fencing.
> 
> Steps to reproduce:
> 1. Install cluster.conf on two nodes, enable the "cman" service and
> reboot both nodes.
> 2. The cluster boots successfully and clustat lists both nodes as online.
> 3. Power-cycle node prod-db1.
> 4. On prod-db2 openais detects the missing node but fenced decides to do
> nothing about it and logs nothing to /var/log/messages (But the fenced
> process is still running)
> 
> Output from "group_tool dump fence" After the test:
> 
> [root at prod-db2 ~]# group_tool dump fence
> 1190645583 our_nodeid 2 our_name prod-db2
> 1190645583 listen 4 member 5 groupd 7
> 1190645584 client 3: join default
> 1190645584 delay post_join 120s post_fail 0s
> 1190645584 added 2 nodes from ccs
> 1190645584 setid default 65538
> 1190645584 start default 1 members 2 
> 1190645584 do_recovery stop 0 start 1 finish 0
> 1190645584 node "prod-db1" not a cman member, cn 1
> 1190645584 add first victim prod-db1
> 1190645585 node "prod-db1" not a cman member, cn 1
> 1190645586 node "prod-db1" not a cman member, cn 1
> 1190645587 node "prod-db1" not a cman member, cn 1
> 1190645588 node "prod-db1" not a cman member, cn 1
> 1190645589 node "prod-db1" not a cman member, cn 1
> 1190645590 node "prod-db1" not a cman member, cn 1
> 1190645591 node "prod-db1" not a cman member, cn 1
> 1190645592 node "prod-db1" not a cman member, cn 1
> 1190645593 node "prod-db1" not a cman member, cn 1
> 1190645594 node "prod-db1" not a cman member, cn 1
> 1190645595 node "prod-db1" not a cman member, cn 1
> 1190645596 node "prod-db1" not a cman member, cn 1
> 1190645597 node "prod-db1" not a cman member, cn 1
> 1190645598 node "prod-db1" not a cman member, cn 1
> 1190645599 node "prod-db1" not a cman member, cn 1
> 1190645600 reduce victim prod-db1
> 1190645600 delay of 16s leaves 0 victims
> 1190645600 finish default 1
> 1190645600 stop default
> 1190645600 start default 2 members 1 2 
> 1190645600 do_recovery stop 1 start 2 finish 1

I think something has gone wrong here, either in groupd or fenced, that's
preventing this start from finishing (we don't get a 'finish default 2'
which we expect).  A 'group_tool -v' here should show the state of the
fence group still in transition.  Could you run that, plus a 'group_tool
dump' at this point, in addition to the 'dump fence' you have.  And please
run those commands on both nodes.

> 1190645954 client 3: dump    <--- Before killing prod-db1
> 1190645985 stop default
> 1190645985 start default 3 members 2 
> 1190645985 do_recovery stop 2 start 3 finish 1
> 1190645985 finish default 3
> 1190646008 client 3: dump    <--- After killing prod-db1

Node 1 isn't fenced here because it never completed joining the fence
group above.

> The scary part is that as far as I can tell fenced is the only cman
> daemon being affected by this. So your cluster appears to work fine. But
> when a node needs to be fenced the operation it isn't carried out and
> that can cause gfs filesystem corruption.

You shouldn't be able to mount gfs on the node where joining the fence
group is stuck.

Thanks for the informative report.

Dave