[Cluster-devel] unfencing
Fabio M. Di Nitto
fabbione at fabbione.net
Mon Feb 23 06:27:20 UTC 2009
Hi David,
On Fri, 2009-02-20 at 15:44 -0600, David Teigland wrote:
> Fencing devices that do not reboot a node, but just cut off storage have
> always required the impractical step of re-enabling storage access after the
> node has been reset. We've never provided a mechanism to automate this
> unfencing.
>
> Below is an outline of how we might automate unfencing with some simple
> extensions to the existing fencing library, config scheme and agents. It does
> not involve the fencing daemon (fenced). Nodes would unfence themselves when
> they start up. We might also consider a scheme where a node is unfenced by
> *other* nodes when it starts up, if that has any advantage over
> self-unfencing.
Use case where we need remote unfencing is to recover nodes that boot
from the shared storage and those are not that uncommon.
I personally don't like the idea of exposing a -U option to users. It's
a short cut that could be easily misused in an attempt to recover a node
and make more damage than anything else, but I can't see another
solution either.
> cluster3 is the context, but a similar thing would apply to a next generation
> unified fencing system, e.g.
> https://www.redhat.com/archives/cluster-devel/2008-October/msg00005.html
>
> init.d/cman would run:
> cman_tool join
> fence_node -U <ourname>
> qdiskd
> groupd
> fenced
> dlm_controld
> gfs_controld
> fence_tool join
>
> The new step fence_node -U <name> would call libfence:fence_node_undo(name).
> [fence_node <name> currently calls libfence:fence_node(name) to fence a node.]
>
> libfence:fence_node_undo(node_name) logic:
> for each device_name under given node_name,
> if an unfencedevice exists with name=device_name, then
> run the unfencedevice agent with first arg of "undo"
> and other args the normal combination of node and device args
> (any agent used with unfencing must recognize/support "undo")
All our agents already support on/off enable/disable operations. It's
probably best to align them to have the same config options rather than
adding a new one across the board.
>
> [logic derived from cluster.conf structure and similar to fence_node logic]
>
> Example 1:
>
> <clusternode name="foo" nodeid="3">
> <fence>
> <method="1">
> <device name="san" node="foo"/>
> </method>
> </fence>
> </clusternode>
>
> <fencedevices>
> <fencedevice name="san" agent="fence_scsi"/>
> </fencedevices>
>
> <unfencedevices>
> <unfencedevice name="san" agent="fence_scsi"/>
> </unfencedevices>
I think that we can avoid the whole <unfence* structure either by
overriding the default action="" for that fence method or possibly
consider unfencing a special case method. The idea is to contain the
whole fence config for the node within the <clusternode> object rather
than spreading it even more.
For e.g.:
<method name="1">
<device name="san" node="foo"/>
</method>
<method name="unfence">
...
</method>
OR
<method name="1">
<device name="san" node="foo"/>
</method>
<method name="2" operation="unfence">
...
</method>
(clearly names and format are up for discussion)
>
> [Note: we've talked about fence_scsi getting a device list from
> /etc/cluster/fence_scsi.conf instead of from clvm. It would require
> more user configuration, but would create fewer problems and should
> be more robust.]
I think we should really consider firing up a separate thread for this.
It seems to be a more and more often recurring issue.
Fabio
More information about the Cluster-devel
mailing list