[Linux-cluster] fence_manual node failure clarification
Dan B. Phung
phung at cs.columbia.edu
Thu May 12 22:49:33 UTC 2005
...answered my own question...or, the helpful message answered
my question. I can reset it manually using fence_ack_manual.
Node blade09 needs to be reset before recovery can procede. W aiting for
blade09 to rejoin the cluster or for manual acknowledgement that it has
been reset (i.e. fence_ack_manual -n blade09)
On 12, May, 2005, Dan B. Phung declared:
> My question is in reference to node failures using fence_manual
> >From 'man fenced'
>
> Node failure
> When a domain member fails, the actual fencing must be completed before
> GFS recovery can begin. This means any delay in carrying out the
> fencing operation will also delay the completion of GFS file system
> operations; most file system operations will hang during this period.
>
> So this is what I'm seeing now when a node fails, ie. the rest of the
> nodes notice that the heartbeats of a certain node A has timed out. Node A
> is fenced by ther remaining nodes, and the file system is hung. My
> questions are:
>
> 1) can I call fence_ack_manual right when I see that node A is fenced, or
> do I have to wait for node A to reboot, come back, and join the cluster?
>
> 2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely
> for the failed node to rejoin the cluster, which it seems to be doing,
> so is this the default? The man page shows:
> <fence_daemon post_fail_delay="0">
>
> So with my assumption of the delay being 0, I expected the node to be
> fenced instantly on timeout, recovery to begin and complete, and my file
> system for the rest of the nodes to be usable in a relatively short time.
> I guess if the answer to 1) is that this recovery is done manually with
> the fence_ack_manual, then it all makes sense.
>
> thanks,
> dan
>
>
--
More information about the Linux-cluster
mailing list