[Linux-cluster] GFS 2 node hang in rm test
Daniel McNeil
daniel at osdl.org
Fri Dec 3 23:08:00 UTC 2004
I ran my test script
(http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight.
It ran 17 test runs before hanging in a rm during a 2 node test.
The /gfs_stripe5 is mounted on cl030 and cl031.
process 28723 (rm) on cl030 is hung.
process 29693 (updatedb) is also hung on cl030.
process 29537 (updatedb) is hung on cl031.
I have stack traces and lockdump and lock debug output
from both nodes here:
http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/
gfs_tool/decipher_lockstate_dump cl030.lockdump shows:
Glock (inode[2], 39860)
gl_flags =
gl_count = 6
gl_state = shared[3]
lvb_count = 0
object = yes
aspace = 2
reclaim = no
Holder
owner = 28723
gh_state = shared[3]
gh_flags = atime[9]
error = 0
gh_iflags = promote[1] holder[6] first[7]
Waiter2
owner = none[-1]
gh_state = unlocked[0]
gh_flags = try[0]
error = 0
gh_iflags = demote[2] alloced[4] dealloc[5]
Waiter3
owner = 29693
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Inode: busy
gfs_tool/decipher_lockstate_dump cl031.lockdump shows:
Glock (inode[2], 39860)
gl_flags = lock[1]
gl_count = 5
gl_state = shared[3]
lvb_count = 0
object = yes
aspace = 1
reclaim = no
Request
owner = 29537
gh_state = exclusive[1]
gh_flags = local_excl[5] atime[9]
error = 0
gh_iflags = promote[1]
Waiter3
owner = 29537
gh_state = exclusive[1]
gh_flags = local_excl[5] atime[9]
error = 0
gh_iflags = promote[1]
Inode: busy
Is there any documentation on what these fields are?
What is the difference between Waiter2 and Waiter3?
If I understand this correctly, the updatedb (29537) on
cl031 is trying to go from shared -> exclusive while the
rm (28723) on cl030 is holding the glock shared and the
updatedb (29693) on cl030 is waiting to get the glock shared.
Questions:
How does one know which node is the master for a lock?
Shouldn't the cl030 know (bast) that the updatedb on cl031
is trying to go shared->exclusive?
What does the gfs_tool/parse_lockdump script do?
I have include the output from /proc/cluster/lock_dlm/debug,
but I have no idea what that data is. Any hints?
Anything else I can do to debug this further?
Thanks,
Daniel
More information about the Linux-cluster
mailing list