[Linux-cluster] DLM locks with 1 node on 2 node cluster

Mon Aug 28 19:33:47 UTC 2006

Dave, I guess we are confused here by "the failed node is actually reset" -
does this mean that "the system is down/has been shutdown" or does this mean
"the system has been rebooted and now is up and running"? In the first case
I am getting errors in /var/log/messages in the second I do not need to do
anything since the cluster will recover by itself.
	Mike

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 2:52 PM
To: Zelikov, Mikhail
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 02:52:48PM -0400, Zelikov_Mikhail at emc.com wrote:
> I am using manual fencing with gnbd fencing. Here is the tail on
> /var/proc/messages:
> 
> Aug 28 14:17:06 bof227 fenced[2497]: bof226 not a cluster member after 
> 0 sec post_fail_delay Aug 28 14:17:06 bof227 kernel: CMAN: removing 
> node bof226 from the cluster :
> Missed too many heartbeats
> Aug 28 14:17:06 bof227 fenced[2497]: fencing node "bof226"
> Aug 28 14:17:06 bof227 fence_manual: Node bof226 needs to be reset 
> before recovery can procede.  Waiting for bof226 to rejoin the cluster 
> or for manual acknowledgement that it has been reset (i.e. 
> fence_ack_manual -n
> bof226)

Follow what the message says:
- make sure the failed node is actually reset, then
- run "fence_ack_manual -n bof226" on the remaining node

then recovery will continue.

Dave