[Linux-cluster] fencing problem

Marcos David marcos.david at efacec.pt
Thu Dec 14 15:19:21 UTC 2006


Hello,
I still need help with this one ;)

help! please!

Thanks.

Marcos David wrote:
> hello,
> I'm experiencing some problems with cluster fencing.
> First lets start with the specs:
>
> it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4
>
> the machines both have ILOM device that acts as a first level of fencing.
> then there is a second level of fencing that is performed by an UPS.
>
> my problem is the following:
> if i shutdown one of the nodes (simulating a power failure) the other 
> tries to fence the failed node. So far so good.
> The problem is that since the ILOM in the node is offline the second 
> node keeps trying to fence the ILOM device and never gives up!
>
> According to what I've read on the FAQ about fencing levels, if the 
> first level fails it should go to the second level, and so on...
>
> But it never does this!
>
> Here a copy of th /var/log/messages:
>
> Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the 
> cluster : Missed too many heartbeats
> Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after 
> 0 sec post_fail_delay
> Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
> Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: 
> Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect 
> after 30 seconds Failed
> Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection 
> descriptor received.
> Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid 
> request descriptor
> Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
> Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"
>
> the last 4 lines repeat for ever....
>
> here is a copy of the cluster.conf
>
>
> <?xml version="1.0"?>
> <cluster config_version="19" name="SERVER-A">
>        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>        <clusternodes>
>                <clusternode name="node-a" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="fence_node-a"/>
>                                </method>
>                                <method name="2">
>                                        <device name="UPS_node-a"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="node-b" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="fence_node-b"/>
>                                </method>
>                                <method name="2">
>                                        <device name="UPS_node-b"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman expected_votes="1" two_node="1"/>
>        <fencedevices>
>                <fencedevice agent="fence_ipmilan" auth="password" 
> ipaddr="172.18.57.17" login="root" name="fence_node-a" 
> passwd="changeme"/>
>                <fencedevice agent="fence_ipmilan" auth="password" 
> ipaddr="172.18.57.18" login="root" name="fence_node-b" 
> passwd="changeme"/>
>                <fencedevice agent="fence_apc" ipaddr="172.18.57.20" 
> login="power" name="UPS_node-a" passwd="power"/>
>                <fencedevice agent="fence_apc" ipaddr="172.18.57.21" 
> login="power" name="UPS_node-b" passwd="power"/>
>
>        </fencedevices>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="Cluster_0" ordered="1" 
> restricted="0">
>                                <failoverdomainnode name="node-a" 
> priority="1"/>
>                                <failoverdomainnode name="node-b" 
> priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources>
>                        <fs device="/dev/sdb1" force_fsck="1" 
> force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared" 
> name="Storedge_Shared" options="" self_fence="1"/>
>                        <ip address="172.18.57.16" monitor_link="1"/>
>                        <ip address="172.18.57.11" monitor_link="1"/>
>                        <ip address="172.18.57.14" monitor_link="1"/>
>                </resources>
>                <service autostart="1" domain="Cluster_0" 
> name="postgresql">
>                        <ip ref="172.18.57.16">
>                                <fs ref="Storedge_Shared">
>                                        <script 
> file="/etc/init.d/postgresql" 
> name="PostgreSQL">                                             
>                                </fs>
>                        </ip>
>                </service>
>                <service autostart="1" domain="Cluster_0" name="afs">
>                        <ip ref="172.18.57.14">
>                                <script file="/etc/init.d/afs" 
> name="AFS"/>
>                        </ip>
>                </service>
>        </rm>
> </cluster>
>
> I would like to know a way to solve this problem.... :-)
>
> Thanks in advance,
>
> Marcos David
>
>
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>





More information about the Linux-cluster mailing list