[Linux-cluster] GFS filesystem "hang" with cluster-1.03.00

Fri Oct 20 13:34:33 UTC 2006

Hi Josef,

Josef Whiter wrote:
> In your previous message you asked about the latency.  With gfs1, there is a
> certain amount of latency involved with stat calls, so ls -al, du, df all take a
> great deal of time comparitively.  With these calls first you have to traverse
> the FS in order to cache all the inforamation about the files, so every lookup
> requires a lock on each directory to the file, and then a lock on the file
> itself inorder to read its information off of the disk.  Then thats just the
> lookup, then we have to grab a shared lock again to get the stat information
> from the file.  Each lock mind you requires exporting the lock to all of the
> other nodes so they know about it and getting confirmation back on that lock.
> So for every stat lookup you are looking at at the very least 2 seperate locks,
> one for the lookup and then one for the stat.  Every subsequent call is faster
> because the lookups no longer require the locks to lookup the file, as the inode
> information is now cached, so we just need the lock for the file.

Yes, this was the previous exchange. In the same exchange I was advised 
by Wendy Cheng to switch to iozone because it would avoid such multiple 
lock calls (less stat on files) instead of bonnie++ which we were 
previously testing with.

My test-run last night started at 4AM with 4 different iozone processes 
using a temp file in different directories on the same filesystem / 
logical volume.

AFAIK this would avoid the problem you mention above ?

All iozone processes were in D (uninteruptable sleep) by the time I woke 
up and had a look 8AM this morning.

I would expect gfs to deal with this gracefully and return a performance 
metric on multiple writes because:

* Not in same directory so no dir-lock to pass around
* Different files

> If gfs_tool counters is stuck, you'll want to get a couple instances of sysrq-t
> from all nodes and see if you can see who is hanging, wether its in D state or
> if the particular process isn't makeing progress.

I interrupted the processes by now and found one node that was hanging.
I'm still completely clueless as to what is causing this.

Any pointers and/or ideas on where to look, testcases to run or any info 
at all that might be helpfull in finding the cause of the problem would 
be much appreciated.

I'm also opening a similar case with the coraid support department to 
see if they have something to say about it.

Thanx,

Ramon