[Cluster-devel] unfencing

Mon Feb 23 06:27:20 UTC 2009

Hi David,

On Fri, 2009-02-20 at 15:44 -0600, David Teigland wrote:
> Fencing devices that do not reboot a node, but just cut off storage have
> always required the impractical step of re-enabling storage access after the
> node has been reset.  We've never provided a mechanism to automate this
> unfencing.
> 
> Below is an outline of how we might automate unfencing with some simple
> extensions to the existing fencing library, config scheme and agents.  It does
> not involve the fencing daemon (fenced).  Nodes would unfence themselves when
> they start up.  We might also consider a scheme where a node is unfenced by
> *other* nodes when it starts up, if that has any advantage over
> self-unfencing.

Use case where we need remote unfencing is to recover nodes that boot
from the shared storage and those are not that uncommon.

I personally don't like the idea of exposing a -U option to users. It's
a short cut that could be easily misused in an attempt to recover a node
and make more damage than anything else, but I can't see another
solution either.

> cluster3 is the context, but a similar thing would apply to a next generation
> unified fencing system, e.g.
> https://www.redhat.com/archives/cluster-devel/2008-October/msg00005.html
> 
> init.d/cman would run:
> 	cman_tool join
> 	fence_node -U <ourname>
> 	qdiskd
> 	groupd
> 	fenced
> 	dlm_controld
> 	gfs_controld
> 	fence_tool join
> 
> The new step fence_node -U <name> would call libfence:fence_node_undo(name).
> [fence_node <name> currently calls libfence:fence_node(name) to fence a node.]
> 
> libfence:fence_node_undo(node_name) logic:
> 	for each device_name under given node_name,
> 	if an unfencedevice exists with name=device_name, then
> 	run the unfencedevice agent with first arg of "undo"
> 	and other args the normal combination of node and device args
> 	(any agent used with unfencing must recognize/support "undo")

All our agents already support on/off enable/disable operations. It's
probably best to align them to have the same config options rather than
adding a new one across the board.

> 
> [logic derived from cluster.conf structure and similar to fence_node logic]
> 
> Example 1:
> 
> <clusternode name="foo" nodeid="3">
> 	<fence>
> 	<method="1">
> 		<device name="san" node="foo"/>
> 	</method>
> 	</fence>
> </clusternode>
> 
> <fencedevices>
> 	<fencedevice name="san" agent="fence_scsi"/>
> </fencedevices>
> 
> <unfencedevices>
> 	<unfencedevice name="san" agent="fence_scsi"/>
> </unfencedevices>

I think that we can avoid the whole <unfence* structure either by
overriding the default action="" for that fence method or possibly
consider unfencing a special case method. The idea is to contain the
whole fence config for the node within the <clusternode> object rather
than spreading it even more.

For e.g.:

OR

(clearly names and format are up for discussion)

> 
> [Note: we've talked about fence_scsi getting a device list from
>  /etc/cluster/fence_scsi.conf instead of from clvm.  It would require
>  more user configuration, but would create fewer problems and should
>  be more robust.]

I think we should really consider firing up a separate thread for this.
It seems to be a more and more often recurring issue.

Fabio