[Linux-cluster] Unkillable clurgmgrd

Wed Nov 14 15:17:39 UTC 2007

On Tue, 2007-11-13 at 10:04 +0100, Jos Vos wrote:
> On Mon, Nov 12, 2007 at 02:47:18PM -0800, Alex Kompel wrote:
> 
> > I observed a similar problem on the test cluster. It appears the clurgmgrd
> > deadlocks in some cases in groups.c:count_resource_groups(). It does not
> > happen every time but it is reproducible. Surviving node calls
> > rg_lock(service:mysql) @ groups.c:101 and gets stuck. The other node
> > resource manager waits indefinitely for the lock:
> 
> [...]
> 
> > To the original poster: the surviving node clurgmgrd is "unkillable" as
> > well.
> > You can try to reboot the surviving node - it will release the lock and
> > resource manager on the fenced node will be unblocked and start just fine.
> > Unfortunately, once you reboot the node the situation may reverse (resource
> > manager will hang on the rebooted node).
> 
> Yes, rebooting ended up in some "locking war" and neither node came up
> properly.  I finally (a) chkconfig off all cluster subsystems and (b)
> modified the cluster.conf on both nodes to turn off autostart and then
> turned off both nodes (shutting down didn't work of course).  Then, I
> brought the nodes up in sequence, manually brought up the cluster
> subsystems, manually started the cluster services and finally I
> reverted (a) and (b).
> 
> Is this problem solved in 5.1?

I'm not aware of what might be causing that, unless it's the same as
#338511 but in rhel5-land.  Someone else might.

-- Lon