[Linux-cluster] fencing problem
Shawn Hood
shawnlhood at gmail.com
Thu Oct 16 15:29:23 UTC 2008
All,
I'll provide some more config details a little later, but thought
maybe some cursory information could yield a response. Simple four
node cluster running RHEL4U7, latest RHEL cluster packages. Three GFS
filesystems. This morning one of our nodes remained responsive, but
was having some problems that required a reboot. Unfortunately, most
commands from the command line were unsuccessful (Input/Output error,
seems the root filesystem may have been remounted read only). I
decided to fence the node from another node in the cluster -- using
fence_node <nodename>. This calls fence_drac. The operation returned
successful, the node was fenced and rebooted.
After this fencing operation, all nodes reporting their Membership
state (as reported by cman_tool status) as Transition-Master. Per
http://sources.redhat.com/cluster/faq.html#gfs_fencefreeze, I
understand that GFS will freeze briefly after fencing is performed.
The filesystems did not return to a responsive state. After many
transition restarts, all nodes leave the cluster (as expected). Some
logs and cluster.conf below.
Shawn
Oct 16 10:09:12 hugin fence_node[3512]: Fence of "munin" was successful
Oct 16 10:09:32 hugin kernel: CMAN: removing node munin from the
cluster : Missed too many heartbeats
Oct 16 10:09:32 hugin kernel: CMAN: Initiating transition, generation 69
Oct 16 10:09:47 hugin kernel: CMAN: Initiating transition, generation 70
Oct 16 10:10:02 hugin kernel: CMAN: Initiating transition, generation 71
Oct 16 10:10:17 hugin kernel: CMAN: Initiating transition, generation 72
Oct 16 10:10:32 hugin kernel: CMAN: Initiating transition, generation 73
Oct 16 10:10:47 hugin kernel: CMAN: Initiating transition, generation 74
Oct 16 10:11:02 hugin kernel: CMAN: Initiating transition, generation 75
Oct 16 10:11:17 hugin kernel: CMAN: Initiating transition, generation 76
Oct 16 10:11:32 hugin kernel: CMAN: Initiating transition, generation 77
Oct 16 10:11:47 hugin kernel: CMAN: Initiating transition, generation 78
Oct 16 10:12:02 hugin kernel: CMAN: Initiating transition, generation 79
Oct 16 10:12:14 hugin kernel: CMAN: removing node odin from the
cluster : Inconsistent cluster view
Oct 16 10:12:14 hugin kernel: CMAN: Initiating transition, generation 80
Oct 16 10:12:14 hugin kernel: CMAN: removing node odin from the
cluster : Inconsistent cluster view
Oct 16 10:12:14 hugin kernel: CMAN: Initiating transition, generation 81
Oct 16 10:12:16 hugin kernel: CMAN: removing node zeus from the
cluster : Inconsistent cluster view
Oct 16 10:12:16 hugin kernel: CMAN: quorum lost, blocking activity
Oct 16 10:12:16 hugin clurgmgrd[8799]: <emerg> #1: Quorum Dissolved
Oct 16 10:12:16 hugin kernel: CMAN: removing node zeus from the
cluster : Inconsistent cluster view
Oct 16 10:12:19 hugin ccsd[6330]: Cluster is not quorate. Refusing connection.
Oct 16 10:12:19 hugin ccsd[6330]: Error while processing connect:
Connection refused
Oct 16 10:12:29 hugin ccsd[6330]: Cluster is not quorate. Refusing connection.
Oct 16 10:12:29 hugin ccsd[6330]: Error while processing connect:
Connection refused
Oct 16 10:12:39 hugin ccsd[6330]: Cluster is not quorate. Refusing connection.
Oct 16 10:13:47 hugin kernel: CMAN: node munin rejoining
Oct 16 10:13:47 hugin kernel: CMAN: Completed transition, generation 81
Oct 16 10:13:49 hugin ccsd[6330]: Cluster is not quorate. Refusing connection.
Oct 16 10:13:49 hugin ccsd[6330]: Error while processing connect:
Connection refused
-- previous error message repeated several times ---
Another node in the same cluster, after fencing munin from hugin:
Oct 16 10:09:31 zeus kernel: CMAN: removing node munin from the
cluster : Missed too many heartbeats
Oct 16 10:09:31 zeus kernel: CMAN: Initiating transition, generation 69
Oct 16 10:09:46 zeus kernel: CMAN: Initiating transition, generation 70
Oct 16 10:10:01 zeus kernel: CMAN: Initiating transition, generation 71
cluster.conf:
<?xml version="1.0"?>
<cluster alias="tungsten" config_version="31" name="qualia">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="odin" votes="1">
<fence>
<method name="1">
<device modulename="" name="odin-drac"/>
</method>
</fence>
</clusternode>
<clusternode name="hugin" votes="1">
<fence>
<method name="1">
<device modulename=""
name="hugin-drac"/>
</method>
</fence>
</clusternode>
<clusternode name="munin" votes="1">
<fence>
<method name="1">
<device modulename=""
name="munin-drac"/>
</method>
</fence>
</clusternode>
<clusternode name="zeus" votes="1">
<fence>
<method name="1">
<device modulename="" name="zeus-drac"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="0"/>
<fencedevices>
<resources/>
<fencedevice name="odin-drac" agent="fence_drac" <redacted>/>
<fencedevice name="hugin-drac" agent="fence_drac" <redacted>/>
<fencedevice name="munin-drac" agent="fence_drac" <redacted>/>
<fencedevice name="zeus-drac" agent="fence_drac" <redacted>/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
More information about the Linux-cluster
mailing list