[Linux-cluster] Why my cluster stop to work when one node down?

Wed Apr 2 15:08:53 UTC 2008

Hello guys,

I have one cluster with two machines, running RHEL 5.1 x86_64.
The Storage device has imported using GNDB and formated using GFS, to
mount on both nodes:

[root at teste-spo-la-v1 ~]# gnbd_import -v -l
Device name : cluster
----------------------
    Minor # : 0
 sysfs name : /block/gnbd0
     Server : gnbdserv
       Port : 14567
      State : Open Connected Clear
   Readonly : No
    Sectors : 20971520

# gfs2_mkfs -p lock_dlm -t mycluster:export1 -j 2 /dev/gnbd/cluster
# mount /dev/gnbd/cluster /mnt/

Everything works graceful, until one node get out (shutdown, network
stop, xm destroy...)

teste-spo-la-v1 clurgmgrd[3557]: <emerg> #1: Quorum Dissolved Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] entering GATHER state from 0. 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] Creating commit token because I am the rep. 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] Saving state aru 46 high seq received 46 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] Storing new sequence id for ring 4c 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] entering COMMIT state. 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] entering RECOVERY state. 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] position [0] member 10.25.0.251: 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] previous ring seq 72 rep 10.25.0.251 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] aru 46 high delivered 46 received flag 1 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] Did not need to originate any messages in recovery. 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] Sending initial ORF token 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] CLM CONFIGURATION CHANGE 
Apr  2 12:00:07 teste-spo-la-v1 kernel: dlm: closing connection to node 3
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] New Configuration: 
Apr  2 12:00:07 teste-spo-la-v1 clurgmgrd[3557]: <emerg> #1: Quorum Dissolved 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] 	r(0) ip(10.25.0.251)  
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] Members Left: 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] 	r(0) ip(10.25.0.252)  
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] Members Joined: 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CMAN ] quorum lost, blocking activity 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] CLM CONFIGURATION CHANGE 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] New Configuration: 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] 	r(0) ip(10.25.0.251)  
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] Members Left: 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] Members Joined: 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [SYNC ] This node is within the primary component and will provide service. 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [TOTEM] entering OPERATIONAL state. 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CLM  ] got nodejoin message 10.25.0.251 
Apr  2 12:00:07 teste-spo-la-v1 openais[1545]: [CPG  ] got joinlist message from node 2 
Apr  2 12:00:12 teste-spo-la-v1 ccsd[1539]: Cluster is not quorate.  Refusing connection. 
Apr  2 12:00:12 teste-spo-la-v1 ccsd[1539]: Error while processing connect: Connection refused 
Apr  2 12:00:16 teste-spo-la-v1 ccsd[1539]: Cluster is not quorate.  Refusing connection. 
Apr  2 12:00:17 teste-spo-la-v1 ccsd[1539]: Error while processing connect: Connection refused 
Apr  2 12:00:22 teste-spo-la-v1 ccsd[1539]: Cluster is not quorate.  Refusing connection. 

So then, my GFS mount point has broken... the terminal freeze when I try
to access the directory "/mnt" and just come back when the second node
has back again to the cluster.

Follow the cluster.conf:

<?xml version="1.0"?>
<cluster name="mycluster" config_version="2">

<cman expected_votes="1">
</cman>

<fence_daemon post_join_delay="60">
</fence_daemon>

<clusternodes>
<clusternode name="node1.mycluster.com" nodeid="2">
	<fence>
		<method name="single">
			<device name="gnbd" ipaddr="10.25.0.251"/>
		</method>
	</fence>
</clusternode>
<clusternode name="node2.mycluster.com" nodeid="3">
	<fence>
		<method name="single">
			<device name="gnbd" ipaddr="10.25.0.252"/>
		</method>
	</fence>
</clusternode>
</clusternodes>

<fencedevices>
	<fencedevice name="gnbd" agent="fence_gnbd"/>
</fencedevices>
</cluster>

Thanks!

-- 
Tiago Cruz
http://everlinux.com
Linux User #282636