[Linux-cluster] node fails to stop when inquorate

Wed Oct 18 20:20:06 UTC 2006

On Wed, 2006-10-18 at 21:38 +0200, Katriel Traum wrote:

> The (ugly) workaround I've been using is killing the process manually
> and then manually removing /var/lock/subsys/rgmanager, which causes "rc"
> to skip it.

> Is there a better way to restart a failed node? Shouldn't a failed node
> be "hard booted" by cman?

Nodes don't "know" they're fenced with fabric-level fencing; it's a
deficiency in the model itself.

The easiest thing to do is 'reboot -fn'.  A fenced node may have
outstanding buffers which never get cleaned up - so you can't "un-fence"
them until they have been rebooted anyway.

Rgmanager's child processes are probably trying to umount the a file
system that has been fenced and are stuck in disk-wait - which may be
"forever", depending on the storage configuration.

There's an patch outstanding for qdiskd which makes it reboot on loss of
score, which triggers a reboot.  However, I don't think this is your
problem.

-- Lon