[Linux-cluster] fencing problem

Marcos David marcos.david at efacec.pt
Mon Dec 11 18:13:02 UTC 2006


hello,
I'm experiencing some problems with cluster fencing.
First lets start with the specs:

it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4

the machines both have ILOM device that acts as a first level of fencing.
then there is a second level of fencing that is performed by an UPS.

my problem is the following:
if i shutdown one of the nodes (simulating a power failure) the other 
tries to fence the failed node. So far so good.
The problem is that since the ILOM in the node is offline the second 
node keeps trying to fence the ILOM device and never gives up!

According to what I've read on the FAQ about fencing levels, if the 
first level fails it should go to the second level, and so on...

But it never does this!

Here a copy of th /var/log/messages:

Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the 
cluster : Missed too many heartbeats
Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after 0 
sec post_fail_delay
Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: 
Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect after 
30 seconds Failed
Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection 
descriptor received.
Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid 
request descriptor
Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"

the last 4 lines repeat for ever....

here is a copy of the cluster.conf


<?xml version="1.0"?>
<cluster config_version="19" name="SERVER-A">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="node-a" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="fence_node-a"/>
                                </method>
                                <method name="2">
                                        <device name="UPS_node-a"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node-b" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="fence_node-b"/>
                                </method>
                                <method name="2">
                                        <device name="UPS_node-b"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="password" 
ipaddr="172.18.57.17" login="root" name="fence_node-a" passwd="changeme"/>
                <fencedevice agent="fence_ipmilan" auth="password" 
ipaddr="172.18.57.18" login="root" name="fence_node-b" passwd="changeme"/>
                <fencedevice agent="fence_apc" ipaddr="172.18.57.20" 
login="power" name="UPS_node-a" passwd="power"/>
                <fencedevice agent="fence_apc" ipaddr="172.18.57.21" 
login="power" name="UPS_node-b" passwd="power"/>

        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="Cluster_0" ordered="1" 
restricted="0">
                                <failoverdomainnode name="node-a" 
priority="1"/>
                                <failoverdomainnode name="node-b" 
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <fs device="/dev/sdb1" force_fsck="1" 
force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared" 
name="Storedge_Shared" options="" self_fence="1"/>
                        <ip address="172.18.57.16" monitor_link="1"/>
                        <ip address="172.18.57.11" monitor_link="1"/>
                        <ip address="172.18.57.14" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="Cluster_0" name="postgresql">
                        <ip ref="172.18.57.16">
                                <fs ref="Storedge_Shared">
                                        <script 
file="/etc/init.d/postgresql" 
name="PostgreSQL">                                             
                                </fs>
                        </ip>
                </service>
                <service autostart="1" domain="Cluster_0" name="afs">
                        <ip ref="172.18.57.14">
                                <script file="/etc/init.d/afs" name="AFS"/>
                        </ip>
                </service>
        </rm>
</cluster>

I would like to know a way to solve this problem.... :-)

Thanks in advance,

Marcos David







More information about the Linux-cluster mailing list