[Linux-cluster] GFS 2 node hang in rm test
Daniel McNeil
daniel at osdl.org
Sat Dec 4 00:36:31 UTC 2004
On Fri, 2004-12-03 at 15:08, Daniel McNeil wrote:
> I ran my test script
> (http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight.
>
> It ran 17 test runs before hanging in a rm during a 2 node test.
> The /gfs_stripe5 is mounted on cl030 and cl031.
>
> process 28723 (rm) on cl030 is hung.
> process 29693 (updatedb) is also hung on cl030.
>
> process 29537 (updatedb) is hung on cl031.
>
> I have stack traces and lockdump and lock debug output
> from both nodes here:
>
> http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/
>
>
> gfs_tool/decipher_lockstate_dump cl030.lockdump shows:
>
> Glock (inode[2], 39860)
> gl_flags =
> gl_count = 6
> gl_state = shared[3]
> lvb_count = 0
> object = yes
> aspace = 2
> reclaim = no
> Holder
> owner = 28723
> gh_state = shared[3]
> gh_flags = atime[9]
> error = 0
> gh_iflags = promote[1] holder[6] first[7]
> Waiter2
> owner = none[-1]
> gh_state = unlocked[0]
> gh_flags = try[0]
> error = 0
> gh_iflags = demote[2] alloced[4] dealloc[5]
> Waiter3
> owner = 29693
> gh_state = shared[3]
> gh_flags = any[3]
> error = 0
> gh_iflags = promote[1]
> Inode: busy
>
> gfs_tool/decipher_lockstate_dump cl031.lockdump shows:
>
> Glock (inode[2], 39860)
> gl_flags = lock[1]
> gl_count = 5
> gl_state = shared[3]
> lvb_count = 0
> object = yes
> aspace = 1
> reclaim = no
> Request
> owner = 29537
> gh_state = exclusive[1]
> gh_flags = local_excl[5] atime[9]
> error = 0
> gh_iflags = promote[1]
> Waiter3
> owner = 29537
> gh_state = exclusive[1]
> gh_flags = local_excl[5] atime[9]
> error = 0
> gh_iflags = promote[1]
> Inode: busy
>
> Is there any documentation on what these fields are?
>
> What is the difference between Waiter2 and Waiter3?
>
> If I understand this correctly, the updatedb (29537) on
> cl031 is trying to go from shared -> exclusive while the
> rm (28723) on cl030 is holding the glock shared and the
> updatedb (29693) on cl030 is waiting to get the glock shared.
>
Looking at the stack traces, what I said above does not
makes sense. So now I am really confused.
updatedb should only need the glock shared since it is
only doing a readdir.
But the stack trace on the rm cl030 shows that it is in
readdir as well.
So what does the Request, gh_state = exclusive mean?
Still looks like it is trying to go exclusive, but
I cannot tell why.
Thanks for any help,
Daniel
More information about the Linux-cluster
mailing list