[Linux-cluster] pull plug on node, service never relocates

Sat May 15 03:59:23 UTC 2010

What happens when you do ...

fence_node 192.168.1.4

from any of the other nodes?

if that doesn't work, then fencing is not configured correctly and you
should try to invoke the fence agent directly.
Also, it would help if you included the APC model and firmware rev.
The fence_apc agent can be finicky about such things.

Hope this helps.

-Core

On Fri, May 14, 2010 at 8:45 PM, Dusty <dhoffutt at gmail.com> wrote:
> Greetings,
>
> Using stock "clustering" and "cluster-storage" from RHEL5 update 4 X86_64
> ISO.
>
> As an example using my below config:
>
> Node1 is running service1, node2 is running service2, etc, etc, node5 is
> spare and available for the relocation of any failover domain / cluster
> service.
>
> If I go into the APC PDU and turn off the electrical port to node1, node2
> will fence node1 (going into the APC PDU and doing and off, on on node1's
> port), this is fine. Works well. When node1 comes back up, then it shuts
> down service1 and service1 relocates to node5.
>
> Now if I go in the lab and literally pull the plug on node5 running
> service1, another node fences node5 via the APC - can check the APC PDU log
> and see that it has done an off/on on node5's electrical port just fine.
>
> But I pulled the plug on node5 - resetting the power doesn't matter. I want
> to simulate a completely dead node, and have the service relocate in this
> case of complete node failure.
>
> In this RHEL5.4 cluster, the service never relocates. I can similate this on
> any node for any service. What if a node's motherboard fries?
>
> What can I set to have the remaining nodes stop waiting for the reboot of a
> failed node and just go ahead and relocate the cluster service that had been
> running on the now failed node?
>
> Thank you!
>
> versions:
>
> cman-2.0.115-1.el5
> openais-0.80.6-8.el5
> modcluster-0.12.1-2.el5
> lvm2-cluster-2.02.46-8.el5
> rgmanager-2.0.52-1.el5
> ricci-0.12.2-6.el5
>
> cluster.conf (sanitized, real scripts removed, all gfs2 mounts gone for
> clarity):
> <?xml version="1.0"?>
> <cluster config_version="1"
> name="alderaanDefenseShieldRebelAllianceCluster">
>     <fence_daemon clean_start="0" post_fail_delay="3" post_join_delay="60"/>
>     <clusternodes>
>         <clusternode name="192.168.1.1" nodeid="1" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="apc_pdu" port="1" switch="1"/>
>                 </method>
>             </fence>
>         </clusternode>
>         <clusternode name="192.168.1.2" nodeid="2" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="apc_pdu" port="2" switch="1"/>
>                 </method>
>             </fence>
>         </clusternode>
>         <clusternode name="192.168.1.3" nodeid="3" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="apc_pdu" port="3" switch="1"/>
>                 </method>
>             </fence>
>         </clusternode>
>         <clusternode name="192.168.1.4" nodeid="4" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="apc_pdu" port="4" switch="1"/>
>                 </method>
>             </fence>
>         </clusternode>
>         <clusternode name="192.168.1.5" nodeid="5" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="apc_pdu" port="5" switch="1"/>
>                 </method>
>             </fence>
>         </clusternode>
>     </clusternodes>
>     <cman expected_votes="6"/>
>     <fencedevices>
>         <fencedevice agent="fence_apc" ipaddr="192.168.1.20" login="device"
> name="apc_pdu" passwd="wonderwomanWasAPrettyCoolSuperhero"/>
>     </fencedevices>
>     <rm>
>         <failoverdomains>
>             <failoverdomain name="fd1" nofailback="0" ordered="1"
> restricted="1">
>                 <failoverdomainnode name="192.168.1.1" priority="1"/>
>                 <failoverdomainnode name="192.168.1.2" priority="2"/>
>                 <failoverdomainnode name="192.168.1.3" priority="3"/>
>                 <failoverdomainnode name="192.168.1.4" priority="4"/>
>                 <failoverdomainnode name="192.168.1.5" priority="5"/>
>             </failoverdomain>
>             <failoverdomain name="fd2" nofailback="0" ordered="1"
> restricted="1">
>                 <failoverdomainnode name="192.168.1.1" priority="5"/>
>                 <failoverdomainnode name="192.168.1.2" priority="1"/>
>                 <failoverdomainnode name="192.168.1.3" priority="2"/>
>                 <failoverdomainnode name="192.168.1.4" priority="3"/>
>                 <failoverdomainnode name="192.168.1.5" priority="4"/>
>             </failoverdomain>
>             <failoverdomain name="fd3" nofailback="0" ordered="1"
> restricted="1">
>                 <failoverdomainnode name="192.168.1.1" priority="4"/>
>                 <failoverdomainnode name="192.168.1.2" priority="5"/>
>                 <failoverdomainnode name="192.168.1.3" priority="1"/>
>                 <failoverdomainnode name="192.168.1.4" priority="2"/>
>                 <failoverdomainnode name="192.168.1.5" priority="3"/>
>             </failoverdomain>
>             <failoverdomain name="fd4" nofailback="0" ordered="1"
> restricted="1">
>                 <failoverdomainnode name="192.168.1.1" priority="3"/>
>                 <failoverdomainnode name="192.168.1.2" priority="4"/>
>                 <failoverdomainnode name="192.168.1.3" priority="5"/>
>                 <failoverdomainnode name="192.168.1.4" priority="1"/>
>                 <failoverdomainnode name="192.168.1.5" priority="2"/>
>             </failoverdomain>
>         </failoverdomains>
>         <resources>
>             <ip address="10.1.1.1" monitor_link="1"/>
>             <ip address="10.1.1.2" monitor_link="1"/>
>             <ip address="10.1.1.3" monitor_link="1"/>
>             <ip address="10.1.1.4" monitor_link="1"/>
>             <ip address="10.1.1.5" monitor_link="1"/>
>             <script file="/usr/local/bin/service1" name="service1"/>
>             <script file="/usr/local/bin/service2" name="service2"/>
>             <script file="/usr/local/bin/service3" name="service3"/>
>             <script file="/usr/local/bin/service4" name="service4"/>
>        </resources>
>         <service autostart="1" domain="fd1" exclusive="1" name="service1"
> recovery="relocate">
>             <ip ref="10.1.1.1"/>
>             <script ref="service1"/>
>         </service>
>         <service autostart="1" domain="fd2" exclusive="1" name="service2"
> recovery="relocate">
>             <ip ref="10.1.1.2"/>
>             <script ref="service2"/>
>         </service>
>         <service autostart="1" domain="fd3" exclusive="1" name="service3"
> recovery="relocate">
>             <ip ref="10.1.1.3"/>
>             <script ref="service3"/>
>         </service>
>         <service autostart="1" domain="fd4" exclusive="1" name="service4"
> recovery="relocate">
>             <ip ref="10.1.1.4"/>
>             <script ref="service4"/>
>         </service>
>     </rm>
> </cluster>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>