[Linux-cluster] fence_manual node failure clarification

Thu May 12 22:49:33 UTC 2005

...answered my own question...or, the helpful message answered
my question.  I can reset it manually using fence_ack_manual.

Node blade09 needs to be reset before recovery can procede.  W aiting for
blade09 to rejoin the cluster or for manual acknowledgement that it has
been reset (i.e. fence_ack_manual -n blade09)

On 12, May, 2005, Dan B. Phung declared:

> My question is in reference to node failures using fence_manual
> >From 'man fenced'
> 
>   Node failure
>   When a domain member fails, the actual fencing must be completed before
>   GFS recovery can begin.  This means any delay in carrying out the 
>   fencing operation will also delay the completion of GFS file system
>   operations; most file system operations will hang during this period.
> 
> So this is what I'm seeing now when a node fails, ie. the rest of the
> nodes notice that the heartbeats of a certain node A has timed out. Node A
> is fenced by ther remaining nodes, and the file system is hung.  My
> questions are:
> 
> 1) can I call fence_ack_manual right when I see that node A is fenced, or
> do I have to wait for node A to reboot, come back, and join the cluster?
> 
> 2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely
> for the failed node to rejoin the cluster, which it seems to be doing, 
> so is this the default?  The man page shows:
>   <fence_daemon post_fail_delay="0">
> 
> So with my assumption of the delay being 0, I expected the node to be
> fenced instantly on timeout, recovery to begin and complete, and my file
> system for the rest of the nodes to be usable in a relatively short time.
> I guess if the answer to 1) is that this recovery is done manually with
> the fence_ack_manual, then it all makes sense.
> 
> thanks,
> dan
> 
> 

--