[Linux-cluster] gfs deadlock situation

Tue Feb 13 14:48:03 UTC 2007

Mark Hlawatschek wrote:
> Hi,
>
> we have the following deadlock situation:
>
> 2 node cluster consisting of node1 and node2. 
> /usr/local is placed on a GFS filesystem mounted on both nodes. 
> Lockmanager is dlm.
> We are using RHEL4u4
>
> a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
> lstat("/usr/local/swadmin/mnx/xml",                                                                
>
> This happens on both cluster nodes.
>
> All processes trying to access the directory /usr/local/swadmin/mnx/xml are 
> in "Waiting for IO (D)" state. I.e. system load is at about 400 ;-)
>
> Any ideas ?
>   
Quickly browsing this, look to me that process with pid=5856 got stuck. 
That process had the file or directory (ino number 627732 - probably 
/usr/local/swadmin/mnx/xml) exclusive lock so everyone was waiting for 
it. The faulty process was apparently in the middle of obtaining another 
exclusive lock (and almost got it). We need to know where pid=5856 was 
stuck at that time. If this occurs again, could you use "crash" to back 
trace that process and show us the output ?

-- Wendy
> a lockdump analysis with the decipher_lockstate_dump and parse_lockdump shows 
> the following output (The whole file is too large for the mailing-list):
>
> Entries:  101939
> Glocks:  60112
> PIDs:  751
>
> 4 chain:
> lockdump.node1.dec Glock (inode[2], 1114343)
>   gl_flags = lock[1]
>   gl_count = 5
>   gl_state = shared[3]
>   req_gh = yes
>   req_bh = yes
>   lvb_count = 0
>   object = yes
>   new_le = no
>   incore_le = no
>   reclaim = no
>   aspace = 1
>   ail_bufs = no
>   Request
>     owner = 5856
>     gh_state = exclusive[1]
>     gh_flags = try[0] local_excl[5] async[6]
>     error = 0
>     gh_iflags = promote[1]
>   Waiter3
>     owner = 5856
>     gh_state = exclusive[1]
>     gh_flags = try[0] local_excl[5] async[6]
>     error = 0
>     gh_iflags = promote[1]
>   Inode: busy
> lockdump.node2.dec Glock (inode[2], 1114343)
>   gl_flags =
>   gl_count = 2
>   gl_state = unlocked[0]
>   req_gh = no
>   req_bh = no
>   lvb_count = 0
>   object = yes
>   new_le = no
>   incore_le = no
>   reclaim = no
>   aspace = 0
>   ail_bufs = no
>   Inode:
>     num = 1114343/1114343
>     type = regular[1]
>     i_count = 1
>     i_flags =
>     vnode = yes
> lockdump.node1.dec Glock (inode[2], 627732)
>   gl_flags = dirty[5]
>   gl_count = 379
>   gl_state = exclusive[1]
>   req_gh = no
>   req_bh = no
>   lvb_count = 0
>   object = yes
>   new_le = no
>   incore_le = no
>   reclaim = no
>   aspace = 58
>   ail_bufs = no
>   Holder
>     owner = 5856
>     gh_state = exclusive[1]
>     gh_flags = try[0] local_excl[5] async[6]
>     error = 0
>     gh_iflags = promote[1] holder[6] first[7]
>   Waiter2
>     owner = none[-1]
>     gh_state = shared[3]
>     gh_flags = try[0]
>     error = 0
>     gh_iflags = demote[2] alloced[4] dealloc[5]
>   Waiter3
>     owner = 32753
>     gh_state = shared[3]
>     gh_flags = any[3]
>     error = 0
>     gh_iflags = promote[1]
>   [...loads of Waiter3 entries...]
>   Waiter3
>     owner = 4566
>     gh_state = shared[3]
>     gh_flags = any[3]
>     error = 0
>     gh_iflags = promote[1]
>   Inode: busy
> lockdump.node2.dec Glock (inode[2], 627732)
>   gl_flags = lock[1]
>   gl_count = 375
>   gl_state = unlocked[0]
>   req_gh = yes
>   req_bh = yes
>   lvb_count = 0
>   object = yes
>   new_le = no
>   incore_le = no
>   reclaim = no
>   aspace = 0
>   ail_bufs = no
>   Request
>     owner = 20187
>     gh_state = shared[3]
>     gh_flags = any[3]
>     error = 0
>     gh_iflags = promote[1]
>   Waiter3
>     owner = 20187
>     gh_state = shared[3]
>     gh_flags = any[3]
>     error = 0
>     gh_iflags = promote[1]
>   [...loads of Waiter3 entries...]
>   Waiter3
>     owner = 10460
>     gh_state = shared[3]
>     gh_flags = any[3]
>     error = 0
>     gh_iflags = promote[1]
>   Inode: busy
> 2 requests
>
>