[Linux-cluster] Unkillable clurgmgrd

Tue Nov 13 09:04:10 UTC 2007

On Mon, Nov 12, 2007 at 02:47:18PM -0800, Alex Kompel wrote:

> I observed a similar problem on the test cluster. It appears the clurgmgrd
> deadlocks in some cases in groups.c:count_resource_groups(). It does not
> happen every time but it is reproducible. Surviving node calls
> rg_lock(service:mysql) @ groups.c:101 and gets stuck. The other node
> resource manager waits indefinitely for the lock:

[...]

> To the original poster: the surviving node clurgmgrd is "unkillable" as
> well.
> You can try to reboot the surviving node - it will release the lock and
> resource manager on the fenced node will be unblocked and start just fine.
> Unfortunately, once you reboot the node the situation may reverse (resource
> manager will hang on the rebooted node).

Yes, rebooting ended up in some "locking war" and neither node came up
properly.  I finally (a) chkconfig off all cluster subsystems and (b)
modified the cluster.conf on both nodes to turn off autostart and then
turned off both nodes (shutting down didn't work of course).  Then, I
brought the nodes up in sequence, manually brought up the cluster
subsystems, manually started the cluster services and finally I
reverted (a) and (b).

Is this problem solved in 5.1?

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204