[Linux-cluster] RE: Fencing quandry

Wed Oct 15 07:53:04 UTC 2008

Jeff

If you do not need the fenced node to come back (in your case it can
not come back due to the hardware issues)
you can remove the "on" fence action and simply have the fence device
issue a "off" command.
This should return a success.

In this case the fenced node will never return to life without human
interaction, but that is no worse than the situation you are in now.

Erling

On Wed, Oct 15, 2008 at 12:43 AM, Jeff Stoner <jstoner at opsource.net> wrote:
> Thanks for the response, James. Unfortunately, it doesn't fully answer
> my question or at least, I'm not following the logic. The bug report
> would seem to indicate a problem with using the default "reboot" method
> of the agent. The work around simply replaces the single fence device
> ('reboot') with 2 fence devices ('off' followed by 'on') in the same
> fence method. If the server fails to power on, then, according to the
> FAQ, fencing still fails ("All fence devices within a fence method must
> succeed in order for the method to succeed").
>
> I'm back to fenced being a SPoF if hardware failures prevent a fenced
> node from powering on.
>
> --Jeff
> Performance Engineer
>
> OpSource, Inc.
> http://www.opsource.net
> "Your Success is Our Success"
>
>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of
>> Hofmeister, James (WTEC Linux)
>> Sent: Tuesday, October 14, 2008 1:40 PM
>> To: linux clustering
>> Subject: [Linux-cluster] RE: Fencing quandry
>>
>> Hello Jeff,
>>
>> I am working with RedHat on a RHEL-5 fencing issue with
>> c-class blades...  We have bugzilla 433864 opened for this
>> and my notes state to be resolved in RHEL-5.3.
>>
>> We had a workaround in the RHEL-5 cluster configuration:
>>
>>   In the /etc/cluster/cluster.conf
>>
>>   *Update version number by 1.
>>   *Then edit the fence device section for "each" node for example:
>>
>>                         <fence>
>>                                 <method name="1">
>>                                         <device name="ilo01"/>
>>                                 </method>
>>                         </fence>
>>   change to  -->
>>                         <fence>
>>                                 <method name="1">
>>                                         <device name="ilo01"
>> action="off"/>
>>                                         <device name="ilo01"
>> action="on"/>
>>                                 </method>
>>                         </fence>
>>
>> Regards,
>> James Hofmeister
>> Hewlett Packard Linux Solutions Engineer
>>
>>
>>
>> |-----Original Message-----
>> |From: linux-cluster-bounces at redhat.com
>> |[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
>> |Sent: Tuesday, October 14, 2008 8:32 AM
>> |To: linux clustering
>> |Subject: [Linux-cluster] Fencing quandry
>> |
>> |We had a "that totally sucks" event the other night
>> involving fencing.
>> |In short - Red Hat 4.7, 2 node cluster using iLO fencing
>> with HP blade
>> |servers:
>> |
>> |- passive node detemined active node was unresponsive
>> (missed too many
>> |heartbeats)
>> |- passive node initiates take-over and begins fencing process
>> |- fencing agent successfully powers off blade server
>> |- fencing agent sits in an endless loop trying to power on the
>> |blade, which won't power up
>> |- the cluster appears "stalled" at this point because fencing
>> |won't complete
>> |
>> |I was able to complete the failover by swapping out the
>> |fencing agent with a shell script that does "exit 0". This
>> |allowed the fencing agent to complete so the resource manager
>> |could successfully relocate the service.
>> |
>> |My question becomes: why isn't a successful power off
>> |considered sufficient for a take-over of a service? If the
>> |power is off, you've guaranteed that all resources are
>> |released by that node. By requiring a successful power on
>> |(which may never happen due to hardware failure,) the fencing
>> |agent becomes a single point of failure in the cluster. The
>> |fencing agent should make an attempt to power on a down node
>> |but it shouldn't hold up the failover process if that attempt fails.
>> |
>> |
>> |
>> |--Jeff
>> |Performance Engineer
>> |
>> |OpSource, Inc.
>> |http://www.opsource.net
>> |"Your Success is Our Success"
>> |
>> |
>> |--
>> |Linux-cluster mailing list
>> |Linux-cluster at redhat.com
>> |https://www.redhat.com/mailman/listinfo/linux-cluster
>> |
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>