[Linux-cluster] gfs2 lockup
swhiteho at redhat.com
Tue Dec 2 10:24:25 UTC 2008
On Mon, 2008-12-01 at 16:46 -0600, Brian Kroth wrote:
> Given the recent discussion of GFS2's stability I thought I'd chime in
> with a problem test case.
> I've noticed a deadlock in the following situation:
> 3 node Debian (Lenny) cluster of esx based vm nodes using either fibre
> channel or open-iscsi based storage. Version 2.03.06 on the
> redhat-cluster-suite software, 0.80.3 openais, and 2.6.26 on the kernel.
I'm not that familar with the Debian kernel, so I don't know what fixes
might have been added recently. You might find that the problem goes
away if you upgrade to a more recent kernel, however...
> cssh node1 node2 node3
> cd /gfs2/
> mkdir $HOSTNAME
> echo $HOSTNAME > $HOSTNAME/test
> rm -rf *
> The last command generally deadlocks at least one of the machines. Any
> access attempts to the /gfs2 volume simply hang. No logs in dmesg,
> messages, etc. On a few occasions about 24 hours later it'll get
> fenced, but usually it's just stuck indefinitely. I haven't had a
> chance to look into this in much more depth since I had to get something
> running so I just went back to OCFS2. I now have an opportunity to test
> with things again, so if someone would like more information or could
> possibly tell me what's wrong that would be nice.
The first thing to check is that you have debugfs mounted on each node.
You can then look at the glock dumps which are located
under /sys/kernel/debug/gfs2/<fsname>/glocks. There are a number of
lines in this file, each relating to a particular glock.
Lines starting G: relate to a glock, and lines below that, indented by a
single space also relate to that same glock. H: lines relate to the
holders of that glock, and if you look at the flags field which starts
f: then you can see if any of the holders are waiting for a lock (look
for the W (wait) flag). The holders are listed in order, granted holders
first (if any) and then waiting holders (if any). So the only
interesting holder in this case will be one with a W flag set thats
nearest to its associated glock.
Looking back at the associated G: line, there are various lock modes
listed. The s: field shows the current state of the glock. The t: state
shows the target state. The target state is only of interest if the l
(locked) flag is set on the glock itself (again f: is the flags field).
In that case it tells you that there is a remote lock request in
progress (i.e. a request has been sent to the DLM) to convert from the
current lock mode (s:) to the target lock mode (t:). Demote requests are
issued from the DLM when it receives a lock request which conflicts with
an existing holder. In that case, the D flag is set on the glock and the
d: field shows the state which has been requested along with the time
(in jiffies) since the demote request was received.
I know all that sounds quite complicated, but in fact its usually pretty
easy to find the cause of deadlocks. It is usually just a matter of
first tracking down holders (H:) which are first in the queue (i.e.
immediately after a G:) with the W flag set, and then looking at the
lock with the same number (the n: field of the G: line) across the
cluster to see which node is still holding that lock (i.e. s: is not UN)
and then checking the remaining flags to see why that is the case.
There is a tool which does some of this automatically, although I've not
tried it myself as I tend to use the manual method still. If you get
stuck then please file a bug (just file it against Fedora/rawhide and
mark it as Debian in the comments somewhere, so we know which kernel it
is) and attach the glock dumps to it and then we can take a look at it.
I have it on my TODO list to write this up properly at some stage and
turn it into a GFS2 debugging FAQ or something like that. At the moment
the only documentation on glocks is the
linux-2.6/Documentation/filesystems/gfs2-glocks.txt file, although thats
aimed more at developers than users, I'm afraid,
More information about the Linux-cluster