[Linux-cluster] GFS hangs if one clusternode is powered off

Arnd m_list at eshine.de
Tue Mar 21 17:46:11 UTC 2006


Hello list,

I'm just setting up an linux cluster with GFS and local shared discs
(SAN). All the clusternodes have the same LUN presented and I did all
the necessary steps to create an GFS on that disc. Its all just fine
working, but when powering one of the nodes off the whole GFS hangs. No
reads and no writes are possible to the filesystem. Even every process
accessing the device waits for the I/O to get completed.

The clusternodes are: adnux1, adnux2, adnux3, adnux4, adlade1
During the tests I'm only running the hosts: adnux2, adnux3, adnux4

While GFS is running fine, I can check the cluster with cman_tool:

adnux2 / # cman_tool status     
Protocol version: 5.0.1
Config version: 1
Cluster name: adnuxCluster1
Cluster ID: 41625
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 3
Expected_votes: 1
Total_votes: 3
Quorum: 2  
Active subsystems: 6
Node name: adnux2
Node addresses: 192.168.1.152 

&

adnux2 / # cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           7   5 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[3 1 2]

DLM Lock Space:  "adnux"                             8   6 run       -
[1 3 2]

GFS Mount Group: "adnux"                             9   7 run       -
[1 3 2]


When powering adnux4 off the other two hosts cannot access the GFS in
any way. The file /var/log/messages from one of the nodes says:
...
Mar 21 18:35:37 adnux2 kernel: CMAN: node adnux4 has been removed from
the cluster : Missed too many heartbeats
Mar 21 18:35:38 adnux2 fenced[5627]: adnux4 not a cluster member after 0
sec post_fail_delay
Mar 21 18:35:38 adnux2 fenced[5627]: fencing node "adnux4"
Mar 21 18:35:38 adnux2 fenced[5627]: fence "adnux4" failed

cman_tool tells that the cluster is still up:

adnux2 ~ # cman_tool status
Protocol version: 5.0.1
Config version: 1
Cluster name: adnuxCluster1
Cluster ID: 41625
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 1
Total_votes: 2
Quorum: 2  
Active subsystems: 6
Node name: adnux2
Node addresses: 192.168.1.152 

&

adnux2 ~ # cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           7   5 recover 2 -
[1 2]

DLM Lock Space:  "clvmd"                             2   3 recover 0 -
[1 2]

DLM Lock Space:  "adnux"                             8   6 recover 0 -
[1 2]

GFS Mount Group: "adnux"                             9   7 recover 0 -
[1 2]


Even successfully running "fence_manual -n adnux4" and "fence_ack_manual
-n adnux4" remains without any affect. I'm wondering why the GFS is
blocking? It can not be that the failed node must be fenced in order to
gfs be able to function?!

I was searching for many ideas but only found some people pointing to an
maybe misconfigured fencing. Now I'm hoping to find here where my
mistake is.

Thank you in advance.


-- configuration file /etc/cluster/cluster.conf --

<?xml version="1.0"?>
<cluster name="adnuxCluster1" config_version="1">

<cman expected_votes="1" quorum="1">
</cman>

<clusternodes>
     <clusternode name="adnux1" votes="1">
       <fence>
           <method name="single">
             <device name="human" nodename="adnux1"/>
           </method>
        </fence>
     </clusternode>
     <clusternode name="adnux2" votes="1">
       <fence>
           <method name="single">
             <device name="human" nodename="adnux2"/>
           </method>
        </fence>
     </clusternode>
     <clusternode name="adnux3" votes="1">
       <fence>
           <method name="single">
             <device name="human" nodename="adnux3"/>
           </method>
        </fence>
     </clusternode>
     <clusternode name="adnux4" votes="1">
       <fence>
           <method name="single">
             <device name="human" nodename="adnux4"/>
           </method>
        </fence>
      </clusternode>
      <clusternode name="adlade1" votes="1">
       <fence>
           <method name="single">
             <device name="human" nodename="adlade1"/>
           </method>
        </fence>
      </clusternode>
</clusternodes>
<fence_devices>
  <device name="human" agent="fence_manual"/>
</fence_devices>

</cluster>




More information about the Linux-cluster mailing list