[Linux-cluster] gfs deadlock situation
Wendy Cheng
wcheng at redhat.com
Tue Feb 13 15:00:23 UTC 2007
Wendy Cheng wrote:
> Mark Hlawatschek wrote:
>> Hi,
>>
>> we have the following deadlock situation:
>>
>> 2 node cluster consisting of node1 and node2. /usr/local is placed on
>> a GFS filesystem mounted on both nodes. Lockmanager is dlm.
>> We are using RHEL4u4
>>
>> a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
>> lstat("/usr/local/swadmin/mnx/xml",
>>
>> This happens on both cluster nodes.
>>
>> All processes trying to access the directory
>> /usr/local/swadmin/mnx/xml are in "Waiting for IO (D)" state. I.e.
>> system load is at about 400 ;-)
>>
>> Any ideas ?
>>
> Quickly browsing this, look to me that process with pid=5856 got
> stuck. That process had the file or directory (ino number 627732 -
> probably /usr/local/swadmin/mnx/xml) exclusive lock so everyone was
> waiting for it. The faulty process was apparently in the middle of
> obtaining another exclusive lock (and almost got it). We need to know
> where pid=5856 was stuck at that time. If this occurs again, could you
> use "crash" to back trace that process and show us the output ?
Or an "echo t > /proc/sysrq-trigger" to obtain *all* threads backtrace
would be better - but it has the risk of missing heartbeat that could
result cluster fence action since sysrq-t could stall the system for a
while.
-- Wendy
>
>> a lockdump analysis with the decipher_lockstate_dump and
>> parse_lockdump shows the following output (The whole file is too
>> large for the mailing-list):
>>
>> Entries: 101939
>> Glocks: 60112
>> PIDs: 751
>>
>> 4 chain:
>> lockdump.node1.dec Glock (inode[2], 1114343)
>> gl_flags = lock[1]
>> gl_count = 5
>> gl_state = shared[3]
>> req_gh = yes
>> req_bh = yes
>> lvb_count = 0
>> object = yes
>> new_le = no
>> incore_le = no
>> reclaim = no
>> aspace = 1
>> ail_bufs = no
>> Request
>> owner = 5856
>> gh_state = exclusive[1]
>> gh_flags = try[0] local_excl[5] async[6]
>> error = 0
>> gh_iflags = promote[1]
>> Waiter3
>> owner = 5856
>> gh_state = exclusive[1]
>> gh_flags = try[0] local_excl[5] async[6]
>> error = 0
>> gh_iflags = promote[1]
>> Inode: busy
>> lockdump.node2.dec Glock (inode[2], 1114343)
>> gl_flags =
>> gl_count = 2
>> gl_state = unlocked[0]
>> req_gh = no
>> req_bh = no
>> lvb_count = 0
>> object = yes
>> new_le = no
>> incore_le = no
>> reclaim = no
>> aspace = 0
>> ail_bufs = no
>> Inode:
>> num = 1114343/1114343
>> type = regular[1]
>> i_count = 1
>> i_flags =
>> vnode = yes
>> lockdump.node1.dec Glock (inode[2], 627732)
>> gl_flags = dirty[5]
>> gl_count = 379
>> gl_state = exclusive[1]
>> req_gh = no
>> req_bh = no
>> lvb_count = 0
>> object = yes
>> new_le = no
>> incore_le = no
>> reclaim = no
>> aspace = 58
>> ail_bufs = no
>> Holder
>> owner = 5856
>> gh_state = exclusive[1]
>> gh_flags = try[0] local_excl[5] async[6]
>> error = 0
>> gh_iflags = promote[1] holder[6] first[7]
>> Waiter2
>> owner = none[-1]
>> gh_state = shared[3]
>> gh_flags = try[0]
>> error = 0
>> gh_iflags = demote[2] alloced[4] dealloc[5]
>> Waiter3
>> owner = 32753
>> gh_state = shared[3]
>> gh_flags = any[3]
>> error = 0
>> gh_iflags = promote[1]
>> [...loads of Waiter3 entries...]
>> Waiter3
>> owner = 4566
>> gh_state = shared[3]
>> gh_flags = any[3]
>> error = 0
>> gh_iflags = promote[1]
>> Inode: busy
>> lockdump.node2.dec Glock (inode[2], 627732)
>> gl_flags = lock[1]
>> gl_count = 375
>> gl_state = unlocked[0]
>> req_gh = yes
>> req_bh = yes
>> lvb_count = 0
>> object = yes
>> new_le = no
>> incore_le = no
>> reclaim = no
>> aspace = 0
>> ail_bufs = no
>> Request
>> owner = 20187
>> gh_state = shared[3]
>> gh_flags = any[3]
>> error = 0
>> gh_iflags = promote[1]
>> Waiter3
>> owner = 20187
>> gh_state = shared[3]
>> gh_flags = any[3]
>> error = 0
>> gh_iflags = promote[1]
>> [...loads of Waiter3 entries...]
>> Waiter3
>> owner = 10460
>> gh_state = shared[3]
>> gh_flags = any[3]
>> error = 0
>> gh_iflags = promote[1]
>> Inode: busy
>> 2 requests
>>
>>
>
>
More information about the Linux-cluster
mailing list