[Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

Gang He ghe at suse.com
Thu Aug 12 05:44:53 UTC 2021


Hi Alexander,


On 2021/8/12 4:35, Alexander Aring wrote:
> Hi,
> 
> On Wed, Aug 11, 2021 at 6:41 AM Gang He <GHe at suse.com> wrote:
>>
>> Hello List,
>>
>> I am using kernel 5.13.4 (some old version kernels have the same problem).
>> When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message,
>> then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16.
>> The function failure did not always happen, but in some case, I could encounter this failure.
>> Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases?
>> If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently.
>> How should we handle such situation? call dlm_lock function to downconvert the dlm lock again?
> 
> What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?
ocfs2 file system.

> 
> I believe you are running into case [0]. Can you provide the
> corresponding log_debug() message? It's necessary to insert
> "log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
> in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 100000
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 100000
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
validate_lock_args -16 10 100000 10c 2 0 M0000000000000000046e0200000000
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error 
-16 while calling ocfs2_dlm_lock on resource M0000000000000000046e0200000000
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap

The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there 
is not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function 
again? maybe the function will fails again, that will lead to kernel 
soft-lockup after multiple re-tries.

Thanks
Gang

> 
> Thanks.
> 
> - Alex
> 
> [0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886
> 




More information about the Cluster-devel mailing list