[Linux-cluster] DLM locks with 1 node on 2 node cluster

Mon Aug 28 19:18:24 UTC 2006

While the node is down (bof226) I do fence_ack_manual -n bof226. I start
getting the following messages in the /var/log/messages:

Aug 28 15:08:30 bof227 fence_manual: Node bof226 needs to be reset
before recovery can procede.  Waiting for bof226 to rejoin the cluster
or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n bof226)
Aug 28 15:10:33 bof227 ccsd[2433]: process_get: Invalid connection
descriptor received.
Aug 28 15:10:33 bof227 ccsd[2433]: Error while processing get: Invalid
request descriptor
Aug 28 15:10:33 bof227 fenced[2497]: fence "bof226" failed
Aug 28 15:10:38 bof227 fenced[2497]: fencing node "bof226"
Aug 28 15:10:38 bof227 ccsd[2433]: process_get: Invalid connection
descriptor received.
Aug 28 15:10:38 bof227 ccsd[2433]: Error while processing get: Invalid
request descriptor
Aug 28 15:10:38 bof227 fenced[2497]: fence "bof226" failed
Aug 28 15:10:43 bof227 fenced[2497]: fencing node "bof226"
Aug 28 15:10:43 bof227 ccsd[2433]: process_get: Invalid connection
descriptor received.
Aug 28 15:10:43 bof227 ccsd[2433]: Error while processing get: Invalid
request descriptor
Aug 28 15:10:43 bof227 fenced[2497]: fence "bof226" failed

>>> Is there a special reason you're using both gnbd and manual fencing?
I've never seen that done before and can't think of a reason you'd want
to.
I was under impression that if there is no hw fencing device then the
manual one is required. It was also my understanding that if I use gnbd
devices then an explicit gnbd fencing is required as well.
	Mike

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Monday, August 28, 2006 3:04 PM
To: Zelikov, Mikhail
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] DLM locks with 1 node on 2 node cluster

On Mon, Aug 28, 2006 at 02:58:32PM -0400, Zelikov_Mikhail at emc.com wrote:
> I am using manual fencing with gnbd fencing.

Is there a special reason you're using both gnbd and manual fencing?
I've never seen that done before and can't think of a reason you'd want
to.
(I'd just use gnbd, not manual.)  That said, I suspect what you have
configured should still work.

> Here is the tail on /var/proc/messages:
> 
> Aug 28 14:17:06 bof227 fenced[2497]: bof226 not a cluster member after

> 0 sec post_fail_delay Aug 28 14:17:06 bof227 kernel: CMAN: removing 
> node bof226 from the cluster : Missed too many heartbeats Aug 28 
> 14:17:06 bof227
> fenced[2497]: fencing node "bof226"
> Aug 28 14:17:06 bof227 fence_manual: Node bof226 needs to be reset 
> before recovery can procede.  Waiting for bof226 to rejoin the cluster

> or for manual acknowledgement that it has been reset (i.e. 
> fence_ack_manual -n
> bof226)

Follow what the message says and run "fence_ack_manual -n bof226" on the
remaining node after verifying the failed node has been reset or
otherwise fenced.

Dave