[Linux-cluster] fencing problem

Thu Oct 16 15:29:23 UTC 2008

All,

I'll provide some more config details a little later, but thought
maybe some cursory information could yield a response.  Simple four
node cluster running RHEL4U7, latest RHEL cluster packages.  Three GFS
filesystems.  This morning one of our nodes remained responsive, but
was having some problems that required a reboot.  Unfortunately, most
commands from the command line were unsuccessful (Input/Output error,
seems the root filesystem may have been remounted read only).  I
decided to fence the node from another node in the cluster -- using
fence_node <nodename>.  This calls fence_drac.  The operation returned
successful, the node was fenced and rebooted.

After this fencing operation, all nodes reporting their Membership
state (as reported by cman_tool status) as Transition-Master.  Per
http://sources.redhat.com/cluster/faq.html#gfs_fencefreeze, I
understand that GFS will freeze briefly after fencing is performed.
The filesystems did not return to a responsive state.  After many
transition restarts, all nodes leave the cluster (as expected).  Some
logs and cluster.conf below.

Shawn

Oct 16 10:09:12 hugin fence_node[3512]: Fence of "munin" was successful
Oct 16 10:09:32 hugin kernel: CMAN: removing node munin from the
cluster : Missed too many heartbeats
Oct 16 10:09:32 hugin kernel: CMAN: Initiating transition, generation 69
Oct 16 10:09:47 hugin kernel: CMAN: Initiating transition, generation 70
Oct 16 10:10:02 hugin kernel: CMAN: Initiating transition, generation 71
Oct 16 10:10:17 hugin kernel: CMAN: Initiating transition, generation 72
Oct 16 10:10:32 hugin kernel: CMAN: Initiating transition, generation 73
Oct 16 10:10:47 hugin kernel: CMAN: Initiating transition, generation 74
Oct 16 10:11:02 hugin kernel: CMAN: Initiating transition, generation 75
Oct 16 10:11:17 hugin kernel: CMAN: Initiating transition, generation 76
Oct 16 10:11:32 hugin kernel: CMAN: Initiating transition, generation 77
Oct 16 10:11:47 hugin kernel: CMAN: Initiating transition, generation 78
Oct 16 10:12:02 hugin kernel: CMAN: Initiating transition, generation 79
Oct 16 10:12:14 hugin kernel: CMAN: removing node odin from the
cluster : Inconsistent cluster view
Oct 16 10:12:14 hugin kernel: CMAN: Initiating transition, generation 80
Oct 16 10:12:14 hugin kernel: CMAN: removing node odin from the
cluster : Inconsistent cluster view
Oct 16 10:12:14 hugin kernel: CMAN: Initiating transition, generation 81
Oct 16 10:12:16 hugin kernel: CMAN: removing node zeus from the
cluster : Inconsistent cluster view
Oct 16 10:12:16 hugin kernel: CMAN: quorum lost, blocking activity
Oct 16 10:12:16 hugin clurgmgrd[8799]: <emerg> #1: Quorum Dissolved
Oct 16 10:12:16 hugin kernel: CMAN: removing node zeus from the
cluster : Inconsistent cluster view
Oct 16 10:12:19 hugin ccsd[6330]: Cluster is not quorate.  Refusing connection.
Oct 16 10:12:19 hugin ccsd[6330]: Error while processing connect:
Connection refused
Oct 16 10:12:29 hugin ccsd[6330]: Cluster is not quorate.  Refusing connection.
Oct 16 10:12:29 hugin ccsd[6330]: Error while processing connect:
Connection refused
Oct 16 10:12:39 hugin ccsd[6330]: Cluster is not quorate.  Refusing connection.
Oct 16 10:13:47 hugin kernel: CMAN: node munin rejoining
Oct 16 10:13:47 hugin kernel: CMAN: Completed transition, generation 81
Oct 16 10:13:49 hugin ccsd[6330]: Cluster is not quorate.  Refusing connection.
Oct 16 10:13:49 hugin ccsd[6330]: Error while processing connect:
Connection refused
-- previous error message repeated several times ---

Another node in the same cluster, after fencing munin from hugin:
Oct 16 10:09:31 zeus kernel: CMAN: removing node munin from the
cluster : Missed too many heartbeats
Oct 16 10:09:31 zeus kernel: CMAN: Initiating transition, generation 69
Oct 16 10:09:46 zeus kernel: CMAN: Initiating transition, generation 70
Oct 16 10:10:01 zeus kernel: CMAN: Initiating transition, generation 71

cluster.conf:

<?xml version="1.0"?>
<cluster alias="tungsten" config_version="31" name="qualia">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="odin" votes="1">
                        <fence>
                                <method name="1">
                                        <device modulename="" name="odin-drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="hugin" votes="1">
                        <fence>
                                <method name="1">
                                        <device modulename=""
name="hugin-drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="munin" votes="1">
                        <fence>
                                <method name="1">
                                        <device modulename=""
name="munin-drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="zeus" votes="1">
                        <fence>
                                <method name="1">
                                        <device modulename="" name="zeus-drac"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="0"/>
        <fencedevices>
                <resources/>
                <fencedevice name="odin-drac" agent="fence_drac" <redacted>/>
                <fencedevice name="hugin-drac" agent="fence_drac" <redacted>/>
                <fencedevice name="munin-drac" agent="fence_drac" <redacted>/>
                <fencedevice name="zeus-drac" agent="fence_drac" <redacted>/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>