[Linux-cluster] GFS2 becomes non-responsive, no fencing

Tue Aug 26 17:50:04 UTC 2008

On Mon, Aug 25, 2008 at 07:29:41PM -0400, Ross Vandegrift wrote:
> Today, the app on one node died.  I logged in, assumed things were
> fenced, and tried to go about my business of restarting it.  After
> some fiddling, I got the box back in the cluster fine.
> 
> It just happened again, and I've dug in a bit more.  I was wrong - the
> failed node has not been fenced.  The last thing in dmesg on the
> failing node is:

Some more information gleaned today.  I left the node running last
night without fixing the GFS2 access.  Today, we noticed that
filesystem access has been restored for new processes - it's slow
(sometimes taking minutes to return an ls for 10 items), but
it eventually responds.  The application threads that are sleeping in D
still haven't received their data from reads issued yesterday
afternoon.

A cursory examination of DLM-related keys in /sys reveal that the
working and broken nodes are configured the same.  No major disparity
in terms of memory use, except the obvious fact that the broken node
shows very litte disk IO.

I'm pretty much at a loss - any ideas would be very welcome.

-- 
Ross Vandegrift
ross at kallisti.us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37