[Cluster-devel] FS/DLM module triggered kernel BUG

Gang He ghe at suse.com
Tue Aug 24 05:36:05 UTC 2021



On 2021/8/23 21:49, Alexander Aring wrote:
> Hi Gang He,
> 
> On Mon, Aug 23, 2021 at 1:43 AM Gang He <GHe at suse.com> wrote:
>>
>> Hello Guys,
>>
>> I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG.
> 
> What do you exactly do? I would like to test it on a recent upstream
> version, or you can do it?
I am not specifically to test the dlm kernel module.
I am doing ocfs2 related testing with opensuse Tumbleweed, which 
includes a very new kernel version.
But sometimes the ocfs2 test cases were blocked/aborted, due to this DLM 
problem.

> 
>> Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not.
>>
> 
> could be, see below.
> 
>> The backtrace is as below,
>>
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes
>> [Fri Aug 20 16:24:14 2021] dlm: connection 000000005ef82293 got EOF from 172204615
> 
> here we disconnect from nodeid 172204615.
> 
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
>> [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
>> [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
>> [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
>> [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
>> [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
>> [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
>> [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
>> [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
>> [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
>> [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
>> [Fri Aug 20 16:24:14 2021] RSP: 0018:ffffbd9a041ebd80 EFLAGS: 00010282
>> [Fri Aug 20 16:24:14 2021] RAX: 00003cc9c100ec00 RBX: 00000000000000dc RCX: 0000000000000830
>> [Fri Aug 20 16:24:14 2021] RDX: 0000000000000000 RSI: 0000000000000f48 RDI: ffffffffc06b4420
>> [Fri Aug 20 16:24:14 2021] RBP: ffffa0d028423974 R08: 0000000000000001 R09: 0000000000000004
>> [Fri Aug 20 16:24:14 2021] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0d028425000
>> [Fri Aug 20 16:24:14 2021] R13: 000000000a43a2f2 R14: ffffa0d028425770 R15: 000000000a43a2f2
>> [Fri Aug 20 16:24:14 2021] FS:  0000000000000000(0000) GS:ffffa0d03ed00000(0000) knlGS:0000000000000000
>> [Fri Aug 20 16:24:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650 CR3: 0000000002696000 CR4: 00000000000406e0
>> [Fri Aug 20 16:24:14 2021] Call Trace:
>> [Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]
> 
> It would be interesting if we got here some message from nodeid
> 172204615 and I think this is what happens. There is maybe some use
> after free going on and we should not receive anymore messages from
> nodeid 172204615.
> I recently added some dlm tracing infrastructure. It should be simple
> to add a trace event here, print out the nodeid and compare
> timestamps.
> 
> I recently fixed a synchronization issue which is not part of kernel
> 5.13.8 and has something to do with what you are seeing here.
> There exists a workaround or a simple test if this really affects you,
> simply create a dummy lockspace on all nodes so we actually never do
> any disconnects and look if you are running again into this issue.
What is this git commit? I do not want to see any kernel (warning) print 
about DLM kernel module. Sometimes, DLM would enter a stuck state after 
the DLM kernel print.
Since there were a few commits in the past weeks, I just wonder if there 
is any regression problem.

Thanks
Gang


> 
>> [Fri Aug 20 16:24:14 2021]  dlm_process_incoming_buffer+0x38/0x90 [dlm]
>> [Fri Aug 20 16:24:14 2021]  receive_from_sock+0xd4/0x1f0 [dlm]
>> [Fri Aug 20 16:24:14 2021]  process_recv_sockets+0x1a/0x20 [dlm]
>> [Fri Aug 20 16:24:14 2021]  process_one_work+0x1df/0x370
>> [Fri Aug 20 16:24:14 2021]  worker_thread+0x50/0x400
>> [Fri Aug 20 16:24:14 2021]  ? process_one_work+0x370/0x370
>> [Fri Aug 20 16:24:14 2021]  kthread+0x127/0x150
>> [Fri Aug 20 16:24:14 2021]  ? set_kthread_struct+0x40/0x40
>> [Fri Aug 20 16:24:14 2021]  ret_from_fork+0x22/0x30
> 
> - Alex
> 




More information about the Cluster-devel mailing list