[Linux-cluster] DLM locks with 1 node on 2 node cluster

Mon Aug 28 20:07:18 UTC 2006

Yes, make sense :) I changed the cluster.conf not to include two fencing
mechanisms: rather just manual (since I do not have any gnbd devices
yet) ... and it worked :)
So it might be (WARNING - speculation here)  that a tmp file that is
used for fencing is used by both manual and gndb fences and opened by
one of them in the exclusive mode, so the other can not open it and wait
on it ...

You mentioned that  the gndb fencing has multiple options: hw, manual
... I tried to change the configuration on my gndb   fence resource via
gui (system-config-cluster) and the only options are the name and the
servers ... 
	Mike

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 3:47 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 03:33:47PM -0400, Zelikov_Mikhail at emc.com wrote:
> Dave, I guess we are confused here by "the failed node is actually 
> reset" - does this mean that "the system is down/has been shutdown" or

> does this mean "the system has been rebooted and now is up and 
> running"? In the first case I am getting errors in /var/log/messages 
> in the second I do not need to do anything since the cluster will
recover by itself.

The idea behind fence_manual is that you need to go and manually fence
the failed machine somehow when you see that message.  That means doing
yourself what one of the normal fencing agents would otherwise do, e.g.
power it off, disable its SAN connection.  After you've done this, you
run fence_ack_manual to tell the system that the failed node has been
properly fenced (by you).

If you reset the failed node, you just need to make sure the power is
off before doing the ack command; you don't need to wait for it to be up
and running again.

If you reset the failed node and it comes back up and rejoins the
cluster before you happen to run the fence_ack_manual command, then the
fence_manual agent that's waiting on the non-failed node will recognize
this and effectively do the fence_ack_manual step for you since it knows
the failed node has been rebooted if it rejoins the cluster.

Dave