[Linux-cluster] GFS 2 node hang in rm test

Sat Dec 4 00:36:31 UTC 2004

On Fri, 2004-12-03 at 15:08, Daniel McNeil wrote:
> I ran my test script
> (http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight.
> 
> It ran 17 test runs before hanging in a rm during a 2 node test.
> The /gfs_stripe5 is mounted on cl030 and cl031.
> 
> process 28723 (rm) on cl030 is hung.
> process 29693 (updatedb) is also hung on cl030.
> 
> process 29537 (updatedb) is hung on cl031.
> 
> I have stack traces and lockdump and lock debug output
> from both nodes here:
> 
> http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/
> 
> 
> gfs_tool/decipher_lockstate_dump cl030.lockdump shows:
> 
> Glock (inode[2], 39860)
>   gl_flags =
>   gl_count = 6
>   gl_state = shared[3]
>   lvb_count = 0
>   object = yes
>   aspace = 2
>   reclaim = no
>   Holder
>     owner = 28723
>     gh_state = shared[3]
>     gh_flags = atime[9]
>     error = 0
>     gh_iflags = promote[1] holder[6] first[7]
>   Waiter2
>     owner = none[-1]
>     gh_state = unlocked[0]
>     gh_flags = try[0]
>     error = 0
>     gh_iflags = demote[2] alloced[4] dealloc[5]
>   Waiter3
>     owner = 29693
>     gh_state = shared[3]
>     gh_flags = any[3]
>     error = 0
>     gh_iflags = promote[1]
>   Inode: busy
> 
> gfs_tool/decipher_lockstate_dump cl031.lockdump shows:
> 
> Glock (inode[2], 39860)
>   gl_flags = lock[1]
>   gl_count = 5
>   gl_state = shared[3]
>   lvb_count = 0
>   object = yes
>   aspace = 1
>   reclaim = no
>   Request
>     owner = 29537
>     gh_state = exclusive[1]
>     gh_flags = local_excl[5] atime[9]
>     error = 0
>     gh_iflags = promote[1]
>   Waiter3
>     owner = 29537
>     gh_state = exclusive[1]
>     gh_flags = local_excl[5] atime[9]
>     error = 0
>     gh_iflags = promote[1]
>   Inode: busy
> 
> Is there any documentation on what these fields are?
> 
> What is the difference between Waiter2 and Waiter3?
> 
> If I understand this correctly, the updatedb (29537) on
> cl031 is trying to go from shared -> exclusive while the 
> rm (28723) on cl030 is holding the glock shared and the
> updatedb (29693) on cl030 is waiting to get the glock shared.
> 

Looking at the stack traces, what I said above does not
makes sense.  So now I am really confused.

updatedb should only need the glock shared since it is
only doing a readdir.

But the stack trace on the rm cl030 shows that it is in
readdir as well.

So what does the Request, gh_state = exclusive mean?

Still looks like it is trying to go exclusive, but
I cannot tell why.

Thanks for any help,

Daniel