[Linux-cluster] mount hang during test runs

David Teigland teigland at redhat.com
Wed Jan 12 06:45:23 UTC 2005


On Mon, Jan 10, 2005 at 04:50:20PM -0800, Daniel McNeil wrote:

> I collected stack traces and a bunch of other info.  It is
> available here:
> http://developer.osdl.org/daniel/GFS/mount.hang.05jan2005/
> 
> Any ideas on debugging this one?


- Processes on cl032 and cl030 are blocked waiting for dlm responses from
  cl031.

- Processes on cl031 are blocked waiting for dlm responses to resource
  directory lookups (looking up unknown resource masters for 10,0 and 3,11).

- It looks like dlm_recvd may be stuck on cl031 preventing it from
  receiving the requests from the other two nodes and preventing it
  from receiving the responses to its own lookup requests.  This is
  probably the crux of the problem.  Unfortunately, all we see for
  dlm_recvd on cl031 (from stack.cl031) is:

dlm_recvd     R running     0 29053      6         29054 29052 (L-TLB)




cl032 - requesting PR on 10,1 (mounting)
----------------------------------------

lock_dlm2     D C170F414     0 18399      4               18398 (L-TLB)
e6a1fe04 00000046 e7639930 c170f414 0003e36e 00000018 00000008 00000000 
       d5ea8d58 7505db9d 0003e36e db8ff348 e6a1fdf8 e7639930 00000000 c170f8c0 
       c170ef60 00000000 000138a5 7505df29 0003e36e f4377170 f43772d8 00000000 
Call Trace:
 [<c03dbac4>] wait_for_completion+0xa4/0xe0
 [<f8968139>] lm_dlm_lock_sync+0x59/0x70 [lock_dlm]
 [<f8966163>] id_test_and_set+0xa3/0x260 [lock_dlm]
 [<f8966597>] claim_jid+0x47/0x120 [lock_dlm]
 [<f8966c3d>] process_start+0x46d/0x610 [lock_dlm]
 [<f896ca54>] dlm_async+0x274/0x3c0 [lock_dlm]
 [<c0134cca>] kthread+0xba/0xc0
 [<c0103325>] kernel_thread_helper+0x5/0x10


cl031 - requesting PR on 10,0
-----------------------------

lock_dlm1     D C170EF9C     0 29065      6         29066 29054 (L-TLB)
d2e0ede8 00000046 f76d3850 c170ef9c 0003e354 00000018 00000008 00000000 
       f6750838 30672ddf 0003e354 dbf900dc d2e0eddc f76d3850 00000000 c170f8c0 
       c170ef60 00000000 0002088a 306734a4 0003e354 f64d8710 f64d8878 00000000 
Call Trace:
 [<c03dbac4>] wait_for_completion+0xa4/0xe0
 [<f8968139>] lm_dlm_lock_sync+0x59/0x70 [lock_dlm]
 [<f8966443>] id_value+0x93/0x130 [lock_dlm]
 [<f896650f>] id_find+0x2f/0x70 [lock_dlm]
 [<f896670a>] discover_jids+0x6a/0xa0 [lock_dlm]
 [<f8966ab8>] process_start+0x2e8/0x610 [lock_dlm]
 [<f896ca54>] dlm_async+0x274/0x3c0 [lock_dlm]
 [<c0134cca>] kthread+0xba/0xc0
 [<c0103325>] kernel_thread_helper+0x5/0x10


cl031 - requesting NL on 3,11
-----------------------------

df            D 00000008     0 29088  29086                     (NOTLB)
dd0e5c14 00000082 dd0e5c04 00000008 00000001 f8b3b571 00000008 dd0e5c0c 
       ecb0a568 dbf9002c d6e5415c 00000008 dd0e5c44 00000018 00000000 00000000 
       c170ef60 00000000 00000fec 4d5f5234 0003e3a1 f6789190 f67892f8 dd0e5c44 
Call Trace:
 [<c03dbac4>] wait_for_completion+0xa4/0xe0
 [<f896804b>] do_dlm_lock_sync+0x4b/0x60 [lock_dlm]
 [<f89683d4>] hold_null_lock+0xb4/0xd0 [lock_dlm]
 [<f8968470>] lm_dlm_hold_lvb+0x40/0x50 [lock_dlm]
 [<f8afff2c>] gfs_lm_hold_lvb+0x3c/0x50 [gfs]
 [<f8af49a1>] gfs_lvb_hold+0x41/0xe0 [gfs]
 [<f8b19c13>] gfs_ri_update+0x1d3/0x250 [gfs]
 [<f8b19d78>] gfs_rindex_hold+0xe8/0x100 [gfs]
 [<f8b1d781>] gfs_stat_gfs+0x21/0x80 [gfs]
 [<f8b131e0>] gfs_statfs+0x30/0xd0 [gfs]
 [<c015e8ac>] vfs_statfs+0x4c/0x70
 [<c015e9cb>] vfs_statfs64+0x1b/0x50
 [<c015eb07>] sys_statfs64+0x67/0xa0
 [<c010537d>] sysenter_past_esp+0x52/0x71


cl030 - requesting PR on 10,1
-----------------------------

lock_dlm2     D 00000008     0 14338      6               14337 (L-TLB)
cf1b4de8 00000046 cf1b4dd8 00000008 00000001 00000018 00000008 00000000 
       f600ec98 00000000 00000000 cbe5ed24 cf1b4ddc 00000000 f7b82054 cf1b4df8 
       c170ef60 00000000 00014966 b62fc6b6 00009f97 f6610730 f6610898 00000009 
Call Trace:
 [<c03dbac4>] wait_for_completion+0xa4/0xe0
 [<f8b57139>] lm_dlm_lock_sync+0x59/0x70 [lock_dlm]
 [<f8b55443>] id_value+0x93/0x130 [lock_dlm]
 [<f8b5550f>] id_find+0x2f/0x70 [lock_dlm]
 [<f8b5570a>] discover_jids+0x6a/0xa0 [lock_dlm]
 [<f8b55ab8>] process_start+0x2e8/0x610 [lock_dlm]
 [<f8b5ba54>] dlm_async+0x274/0x3c0 [lock_dlm]
 [<c0134cca>] kthread+0xba/0xc0
 [<c0103325>] kernel_thread_helper+0x5/0x10


cl030 - requesting NL on 3,11
-----------------------------

df            D 00000008     0 14362  14360                     (NOTLB)
d10a3c14 00000086 d10a3c04 00000008 00000001 f8b3b571 00000008 d10a3c0c 
       f6b89818 cbe5ec74 c2015b28 00000008 d10a3c44 00000018 00000000 00000000 
       c170ef60 00000000 000305ef f0cf7f52 00009fe4 da6f0f10 da6f1078 d10a3c44 
Call Trace:
 [<c03dbac4>] wait_for_completion+0xa4/0xe0
 [<f8b5704b>] do_dlm_lock_sync+0x4b/0x60 [lock_dlm]
 [<f8b573d4>] hold_null_lock+0xb4/0xd0 [lock_dlm]
 [<f8b57470>] lm_dlm_hold_lvb+0x40/0x50 [lock_dlm]
 [<f8afff2c>] gfs_lm_hold_lvb+0x3c/0x50 [gfs]
 [<f8af49a1>] gfs_lvb_hold+0x41/0xe0 [gfs]
 [<f8b19c13>] gfs_ri_update+0x1d3/0x250 [gfs]
 [<f8b19d78>] gfs_rindex_hold+0xe8/0x100 [gfs]
 [<f8b1d781>] gfs_stat_gfs+0x21/0x80 [gfs]
 [<f8b131e0>] gfs_statfs+0x30/0xd0 [gfs]
 [<c015e8ac>] vfs_statfs+0x4c/0x70
 [<c015e9cb>] vfs_statfs64+0x1b/0x50
 [<c015eb07>] sys_statfs64+0x67/0xa0
 [<c010537d>] sysenter_past_esp+0x52/0x71



cl032 (nodeid 3, mounting and looking for free jid)
---------------------------------------------------

Resource dfdbf26c (parent 00000000). Name (len=24) "      10               1"  
Local Copy, Master is node 2
Granted Queue
Conversion Queue
Waiting Queue
000102aa -- (PR) Master:     00000000  LQ: 3,0x9 (pid 18399)


cl031 (nodeid 2, jid 1)
-----------------------

Resource cc0100a4 (parent 00000000). Name (len=24) "      10               1"  
Master Copy
LVB: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Granted Queue
000100d5 PR (pid 29066)
Conversion Queue
Waiting Queue

Resource e16fe26c (parent 00000000). Name (len=24) "      10               0"  
Local Copy, Master is node -1
Granted Queue
Conversion Queue
Waiting Queue

Resource e4b5573c (parent 00000000). Name (len=24) "       3              11"  
Local Copy, Master is node -1
Granted Queue
Conversion Queue
Waiting Queue


cl030 (nodeid 1, jid 0)
-----------------------

Resource cfb9054c (parent 00000000). Name (len=24) "      10               0"  
Master Copy
LVB: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Granted Queue
000102c3 PR (pid 14338)
Conversion Queue
Waiting Queue

Resource d798911c (parent 00000000). Name (len=24) "      10               1"  
Local Copy, Master is node 2
Granted Queue
Conversion Queue
Waiting Queue
000103b7 -- (PR) Master:     00000000  LQ: 3,0x9 (pid 14338)

Resource d38d7b2c (parent 00000000). Name (len=24) "       3              11"  
Local Copy, Master is node 2
Granted Queue
Conversion Queue
Waiting Queue
0002022e -- (NL) Master:     00000000  LQ: 3,0x8 (pid 14362)



-- 
Dave Teigland  <teigland at redhat.com>




More information about the Linux-cluster mailing list