[Linux-cluster] gfs deadlock situation

Wed Feb 14 19:41:25 UTC 2007

Mark Hlawatschek wrote:
> On Tuesday 13 February 2007 20:56, Wendy Cheng wrote:
>   
>> Wendy Cheng wrote:
>>     
>>> So it is removing a file. It has obtained the directory lock and is
>>> waiting for the file lock. Look to me DLM (LM_CB_ASYNC) callback never
>>> occurs. Do you have abnormal messages in your /var/log/messages file ?
>>> Dave, how to dump the locks from DLM side to see how DLM is thinking ?
>>>       
>> shell> cman_tool services    /* find your lock space name */
>> shell> echo "lock-space-name-found-above" > /proc/cluster/dlm_locks
>> shell> cat /proc/cluster/dlm_locks
>>
>> Then try to find the lock (2, hex of (1114343)) - cut and paste the
>> contents of that file here.
>>     
>
> syslog seems to be ok.
> note, that the process 5856 is running on node1
>   
What I was looking for is a lock (type=2 and lock number=0x98d3 
(=1114343)) - that's the lock hangs process id=5856. Since pid=5856 also 
holds another directory exclusive lock, so no body can access to that 
directory.

Apparently from GFS end, node 2 thinks 0x98d3 is "unlocked" and node 1 
is waiting for it. The only thing that can get node 1 out of this wait 
state is DLM's callback. If DLM doesn't have any record of this lock, 
pid=5856 will wait forever. Are you sure this is the whole file of DLM 
output ? This lock somehow disappears from DLM and I have no idea why we 
get into this state. If the files are too large, could you tar the files 
and email over ? I would like to see (both) complete glock and dlm lock 
dumps on both nodes (4 files here). If possible, add the following two 
outputs (so 6 files total):

shell> cd /tmp  /* on both nodes */
shell> script /* this should generate a file called typescript in /tmp 
directory */
shell> crash
crash> foreach bt /* keep hitting space bar until this command run thru */
crash> quit
shell> <cntl><d> /* this should close out typescript file */
shell> mv typescript nodex_crash /* x=1, 2 based on node1 or node2 */

Tar these 6 files (glock_dump_1, glock_dump_2, dlm_dump_1, dlm_dump_2, 
node1_crash, node2_crash) and email them to wcheng at redhat.com

Thank you for the helps if you can.

-- Wendy

> Here's the dlm output:
>
> node1:
> Resource 0000010001218088 (parent 0000000000000000). Name (len=24) "       2          
> 1100e7"
> Local Copy, Master is node 2
> Granted Queue
> Conversion Queue
> Waiting Queue
> 5eb00178 PR (EX) Master:     3eeb0117  LQ: 0,0x5
> [...]
> Resource 00000100f56f0618 (parent 0000000000000000). Name (len=24) "       5          
> 1100e7"
> Local Copy, Master is node 2
> Granted Queue
> 5bc20257 PR Master:     3d9703e0
> Conversion Queue
> Waiting Queue
>
> node2:
> Resource 00000107e462c8c8 (parent 0000000000000000). Name (len=24) "       2          
> 1100e7"
> Master Copy
> Granted Queue
> 3eeb0117 PR Remote:   1 5eb00178
> Conversion Queue
> Waiting Queue
> [...]
> Resource 000001079f7e81d8 (parent 0000000000000000). Name (len=24) "       5
>       1100e7"
> Master Copy
> Granted Queue
> 3d9703e0 PR Remote:   1 5bc20257
> 3e500091 PR
> Conversion Queue
> Waiting Queue
>
> Thanks for your help,
>
> Mark
>
>   
>>>>>> we have the following deadlock situation:
>>>>>>
>>>>>> 2 node cluster consisting of node1 and node2.
>>>>>> /usr/local is placed on a GFS filesystem mounted on both nodes.
>>>>>> Lockmanager is dlm.
>>>>>> We are using RHEL4u4
>>>>>>
>>>>>> a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
>>>>>> lstat("/usr/local/swadmin/mnx/xml",
>>>>>>
>>>>>> This happens on both cluster nodes.
>>>>>>
>>>>>> All processes trying to access the directory
>>>>>> /usr/local/swadmin/mnx/xml
>>>>>> are in "Waiting for IO (D)" state. I.e. system load is at about 400
>>>>>> ;-)
>>>>>>
>>>>>> Any ideas ?
>>>>>>             
>>>>> Quickly browsing this, look to me that process with pid=5856 got stuck.
>>>>> That process had the file or directory (ino number 627732 - probably
>>>>> /usr/local/swadmin/mnx/xml) exclusive lock so everyone was waiting for
>>>>> it. The faulty process was apparently in the middle of obtaining
>>>>> another
>>>>> exclusive lock (and almost got it). We need to know where pid=5856 was
>>>>> stuck at that time. If this occurs again, could you use "crash" to back
>>>>> trace that process and show us the output ?
>>>>>           
>>>> Here's the crash output:
>>>>
>>>> crash> bt 5856
>>>> PID: 5856   TASK: 10bd26427f0       CPU: 0   COMMAND: "java"
>>>>  #0 [10bd20cfbc8] schedule at ffffffff8030a1d1
>>>>  #1 [10bd20cfca0] wait_for_completion at ffffffff8030a415
>>>>  #2 [10bd20cfd20] glock_wait_internal at ffffffffa018574e
>>>>  #3 [10bd20cfd60] gfs_glock_nq_m at ffffffffa01860ce
>>>>  #4 [10bd20cfda0] gfs_unlink at ffffffffa019ce41
>>>>  #5 [10bd20cfea0] vfs_unlink at ffffffff801889fa
>>>>  #6 [10bd20cfed0] sys_unlink at ffffffff80188b19
>>>>  #7 [10bd20cff30] filp_close at ffffffff80178e48
>>>>  #8 [10bd20cff50] error_exit at ffffffff80110d91
>>>>     RIP: 0000002a9593f649  RSP: 0000007fbfffbca0  RFLAGS: 00010206
>>>>     RAX: 0000000000000057  RBX: ffffffff8011026a  RCX: 0000002a9cc9c870
>>>>     RDX: 0000002ae5989000  RSI: 0000002a962fa3a8  RDI: 0000002ae5989000
>>>>     RBP: 0000000000000000   R8: 0000002a9630abb0   R9: 0000000000000ffc
>>>>     R10: 0000002a9630abc0  R11: 0000000000000206  R12: 0000000040115700
>>>>     R13: 0000002ae23294b0  R14: 0000007fbfffc300  R15: 0000002ae5989000
>>>>     ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b
>>>>
>>>>         
>>>>>> a lockdump analysis with the decipher_lockstate_dump and
>>>>>> parse_lockdump
>>>>>> shows the following output (The whole file is too large for the
>>>>>> mailing-list):
>>>>>>
>>>>>> Entries:  101939
>>>>>> Glocks:  60112
>>>>>> PIDs:  751
>>>>>>
>>>>>> 4 chain:
>>>>>> lockdump.node1.dec Glock (inode[2], 1114343)
>>>>>>   gl_flags = lock[1]
>>>>>>   gl_count = 5
>>>>>>   gl_state = shared[3]
>>>>>>   req_gh = yes
>>>>>>   req_bh = yes
>>>>>>   lvb_count = 0
>>>>>>   object = yes
>>>>>>   new_le = no
>>>>>>   incore_le = no
>>>>>>   reclaim = no
>>>>>>   aspace = 1
>>>>>>   ail_bufs = no
>>>>>>   Request
>>>>>>     owner = 5856
>>>>>>     gh_state = exclusive[1]
>>>>>>     gh_flags = try[0] local_excl[5] async[6]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1]
>>>>>>   Waiter3
>>>>>>     owner = 5856
>>>>>>     gh_state = exclusive[1]
>>>>>>     gh_flags = try[0] local_excl[5] async[6]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1]
>>>>>>   Inode: busy
>>>>>> lockdump.node2.dec Glock (inode[2], 1114343)
>>>>>>   gl_flags =
>>>>>>   gl_count = 2
>>>>>>   gl_state = unlocked[0]
>>>>>>   req_gh = no
>>>>>>   req_bh = no
>>>>>>   lvb_count = 0
>>>>>>   object = yes
>>>>>>   new_le = no
>>>>>>   incore_le = no
>>>>>>   reclaim = no
>>>>>>   aspace = 0
>>>>>>   ail_bufs = no
>>>>>>   Inode:
>>>>>>     num = 1114343/1114343
>>>>>>     type = regular[1]
>>>>>>     i_count = 1
>>>>>>     i_flags =
>>>>>>     vnode = yes
>>>>>> lockdump.node1.dec Glock (inode[2], 627732)
>>>>>>   gl_flags = dirty[5]
>>>>>>   gl_count = 379
>>>>>>   gl_state = exclusive[1]
>>>>>>   req_gh = no
>>>>>>   req_bh = no
>>>>>>   lvb_count = 0
>>>>>>   object = yes
>>>>>>   new_le = no
>>>>>>   incore_le = no
>>>>>>   reclaim = no
>>>>>>   aspace = 58
>>>>>>   ail_bufs = no
>>>>>>   Holder
>>>>>>     owner = 5856
>>>>>>     gh_state = exclusive[1]
>>>>>>     gh_flags = try[0] local_excl[5] async[6]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1] holder[6] first[7]
>>>>>>   Waiter2
>>>>>>     owner = none[-1]
>>>>>>     gh_state = shared[3]
>>>>>>     gh_flags = try[0]
>>>>>>     error = 0
>>>>>>     gh_iflags = demote[2] alloced[4] dealloc[5]
>>>>>>   Waiter3
>>>>>>     owner = 32753
>>>>>>     gh_state = shared[3]
>>>>>>     gh_flags = any[3]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1]
>>>>>>   [...loads of Waiter3 entries...]
>>>>>>   Waiter3
>>>>>>     owner = 4566
>>>>>>     gh_state = shared[3]
>>>>>>     gh_flags = any[3]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1]
>>>>>>   Inode: busy
>>>>>> lockdump.node2.dec Glock (inode[2], 627732)
>>>>>>   gl_flags = lock[1]
>>>>>>   gl_count = 375
>>>>>>   gl_state = unlocked[0]
>>>>>>   req_gh = yes
>>>>>>   req_bh = yes
>>>>>>   lvb_count = 0
>>>>>>   object = yes
>>>>>>   new_le = no
>>>>>>   incore_le = no
>>>>>>   reclaim = no
>>>>>>   aspace = 0
>>>>>>   ail_bufs = no
>>>>>>   Request
>>>>>>     owner = 20187
>>>>>>     gh_state = shared[3]
>>>>>>     gh_flags = any[3]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1]
>>>>>>   Waiter3
>>>>>>     owner = 20187
>>>>>>     gh_state = shared[3]
>>>>>>     gh_flags = any[3]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1]
>>>>>>   [...loads of Waiter3 entries...]
>>>>>>   Waiter3
>>>>>>     owner = 10460
>>>>>>     gh_state = shared[3]
>>>>>>     gh_flags = any[3]
>>>>>>     error = 0
>>>>>>     gh_iflags = promote[1]
>>>>>>   Inode: busy
>>>>>> 2 requests
>>>>>>             
>>>>> --
>>>>>           
>
>