[Linux-cluster] DLM locks with 1 node on 2 node cluster

Mon Aug 28 19:46:42 UTC 2006

On Mon, Aug 28, 2006 at 03:33:47PM -0400, Zelikov_Mikhail at emc.com wrote:
> Dave, I guess we are confused here by "the failed node is actually reset" -
> does this mean that "the system is down/has been shutdown" or does this mean
> "the system has been rebooted and now is up and running"? In the first case
> I am getting errors in /var/log/messages in the second I do not need to do
> anything since the cluster will recover by itself.

The idea behind fence_manual is that you need to go and manually fence the
failed machine somehow when you see that message.  That means doing
yourself what one of the normal fencing agents would otherwise do, e.g.
power it off, disable its SAN connection.  After you've done this, you run
fence_ack_manual to tell the system that the failed node has been properly
fenced (by you).

If you reset the failed node, you just need to make sure the power is off
before doing the ack command; you don't need to wait for it to be up and
running again.

If you reset the failed node and it comes back up and rejoins the cluster
before you happen to run the fence_ack_manual command, then the
fence_manual agent that's waiting on the non-failed node will recognize
this and effectively do the fence_ack_manual step for you since it knows
the failed node has been rebooted if it rejoins the cluster.

Dave