[Linux-cluster] GFS2 becomes non-responsive, no fencing

Ross Vandegrift ross at kallisti.us
Mon Aug 25 23:29:41 UTC 2008

Hi everyone,

Have run into a strange problem on our RH cluster installation.  We
have a cluster that uses iscsi shared storage for GFS2.  It's been
running for months with no problems.

Today, the app on one node died.  I logged in, assumed things were
fenced, and tried to go about my business of restarting it.  After
some fiddling, I got the box back in the cluster fine.

It just happened again, and I've dug in a bit more.  I was wrong - the
failed node has not been fenced.  The last thing in dmesg on the
failing node is:

GFS2: fsid=: Trying to join cluster "lock_dlm", "sensors:rrd_gfs"
GFS2: fsid=sensors:rrd_gfs.1: Joined cluster. Now mounting FS...
GFS2: fsid=sensors:rrd_gfs.1: jid=1, already locked for use
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Looking at journal...
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Done

Any reads or writes to the mounted filesystem hangs like the DLM can't
get locks.  Connectivity to the storage is good: no interfaces show
dropped packets or errors.  cman_tool reports the node as healthy:

[root at sensor01 ~]# cman_tool status
Version: 6.0.1
Config Version: 14
Cluster Name: sensors
Cluster Id: 14059
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2  
Active subsystems: 7
Ports Bound: 0 11  
Node name: sensor01.dc3
Node ID: 1
Multicast addresses: 

The missing vote is a third node that is not yet live, but it's been
in that state of rweeks now with no problems.

[root at sensor01 ~]# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   1   M    360   2008-08-25 16:24:29  sensor01.dc3
       Last fenced:   2008-08-25 16:04:25 by leaf8b-2.dc3
   2   M    364   2008-08-25 16:24:29  sensor02.dc3
   3   X    364                        sensor03.dc3
       Node has not been fenced since it went down

The fencing above is when I rebooted the node - because processes were
hung on GFS I/O, I had to hard reset the box, which caused the other
nodes to fence it.

Cluster LVM operations seem to work fine - I can query all LVM objects
without a problem.  But as soon as I try a filesystem operation, boom,
I hang.

Any hints on where I can start looking?

Ross Vandegrift
ross at kallisti.us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

More information about the Linux-cluster mailing list