[Linux-cluster] problem with fencing

Fri Aug 5 15:21:02 UTC 2005

On Fri, Aug 05, 2005 at 05:02:05PM +0200, Javi Polo wrote:
> Hi there
> 
> I'm trying to set up gfs for work with a SAN ... and I want to use a
> script for fencing, instead of fence_manual, but it doesnt works :/
> 
> to try that, I do a "ifconfig eth0 down" in gfstest2, and gfstest1's syslog says:
> Aug  5 16:51:13 gfstest1 fenced: gfstest2 not a cluster member after 0 sec post_fail_delay
> Aug  5 16:51:13 gfstest1 fenced: fencing node "gfstest2"
> Aug  5 16:51:13 gfstest1 fence_manual: Node 192.168.1.2 needs to be reset before recovery can procede.  Waiting for 192.168.1.2 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 192.168.1.2)
> 
> I want it to be automatic, and I modified fence_sanbox2.pl so it fits
> our switch commands. (I attached it on another mail some days ago)
> 
> the script works fine if I run it manually:
> gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4      
> portDisable 4 
> success: disable 4
> gfstest1:~# fence_IBMswitch -a 10.1.1.1 -l admin -p tangerine -n 4 -o enable
> portEnable 4 
> success: enable 4
> gfstest1:~# 
> 
> could anybody give me a hint?
> I'm using lock_dlm

Did you update the cluster.conf file across all the nodes?  Could it be that
gfstest1 still has the old cluster.conf file?  That might account for the
manual fencing being run.  

Another way that you are going to run into manual fencing using this
configuration is if the first method ("san") fails the second method
("single") will be called.  What's odd about that is that there should still
be something in the logs listing the output of the first command.  I would
hope that there would also be an error in the logs in the even that you
forgot to but "fence_IBM" in the path or make it executable.  I'd consider
it a bug if that wasn't the case.

Lastly, I've not looked too closely at your script for fence_IBMswitch (I
think that's what you called it in the previous email... did you rename it to
fence_IBM?), but success and failure is not determined by fenced on the
basis of the output, but on the exit status of the agent.  If the agent
returns 0, then it succeeds, otherwise it's a failed fencing operation.
This might explain why the second method is being called, but it wouldn't
explain why there is no output in the logs from the first.

> this is my cluster.conf:
> 
> <?xml version="1.0"?>
> <cluster name="test_cluster" config_version="4">
> 
>         <fencedevices>
>           <fencedevice name="human" agent="fence_manual"/>
>           <fencedevice name="san" agent="fence_IBM" ipaddr="10.1.1.1" login="admin" passwd="tangerine"/>
>         </fencedevices>
> 
>         <fence_daemon clean_start="0">
>         </fence_daemon>
> 
>         <cman>
>         </cman>
> 
>         <clusternodes>
>           <clusternode name="gfstest1" nodeid="1" votes="1">
>              <fence>
>                <method name="fibre">
>                  <device name="san" port="5"/>
>                </method>
>                <method name="single">
>                  <device name="human" ipaddr="192.168.1.1"/>
>                </method>
>              </fence>
>           </clusternode>
> 
>           <clusternode name="gfstest2" nodeid="2" votes="1">
>              <fence>
>                <method name="fibre">
>                  <device name="san" port="4"/>
>                </method>
>                <method name="single">
>                  <device name="human" ipaddr="192.168.1.2"/>
>                </method>
>              </fence>
>           </clusternode>
> 
>           <clusternode name="gfstest3" nodeid="3" votes="1">
>              <fence>
>                <method name="fibre">
>                  <device name="san" port="3"/>
>                </method>
>                <method name="single">
>                  <device name="human" ipaddr="192.168.1.3"/>
>                </method>
>              </fence>
>           </clusternode>
>         </clusternodes>
> 
> </cluster>
> 
> -- 
> Javier Polo @ Datagrama
> 902 136 126
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei at redhat.com>