[Cluster-devel] FS/DLM module triggered kernel BUG

Mon Aug 23 13:49:02 UTC 2021

Hi Gang He,

On Mon, Aug 23, 2021 at 1:43 AM Gang He <GHe at suse.com> wrote:
>
> Hello Guys,
>
> I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG.

What do you exactly do? I would like to test it on a recent upstream
version, or you can do it?

> Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not.
>

could be, see below.

> The backtrace is as below,
>
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes
> [Fri Aug 20 16:24:14 2021] dlm: connection 000000005ef82293 got EOF from 172204615

here we disconnect from nodeid 172204615.

> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
> [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
> [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
> [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
> [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
> [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
> [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
> [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
> [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
> [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
> [Fri Aug 20 16:24:14 2021] RSP: 0018:ffffbd9a041ebd80 EFLAGS: 00010282
> [Fri Aug 20 16:24:14 2021] RAX: 00003cc9c100ec00 RBX: 00000000000000dc RCX: 0000000000000830
> [Fri Aug 20 16:24:14 2021] RDX: 0000000000000000 RSI: 0000000000000f48 RDI: ffffffffc06b4420
> [Fri Aug 20 16:24:14 2021] RBP: ffffa0d028423974 R08: 0000000000000001 R09: 0000000000000004
> [Fri Aug 20 16:24:14 2021] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0d028425000
> [Fri Aug 20 16:24:14 2021] R13: 000000000a43a2f2 R14: ffffa0d028425770 R15: 000000000a43a2f2
> [Fri Aug 20 16:24:14 2021] FS:  0000000000000000(0000) GS:ffffa0d03ed00000(0000) knlGS:0000000000000000
> [Fri Aug 20 16:24:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650 CR3: 0000000002696000 CR4: 00000000000406e0
> [Fri Aug 20 16:24:14 2021] Call Trace:
> [Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]

It would be interesting if we got here some message from nodeid
172204615 and I think this is what happens. There is maybe some use
after free going on and we should not receive anymore messages from
nodeid 172204615.
I recently added some dlm tracing infrastructure. It should be simple
to add a trace event here, print out the nodeid and compare
timestamps.

I recently fixed a synchronization issue which is not part of kernel
5.13.8 and has something to do with what you are seeing here.
There exists a workaround or a simple test if this really affects you,
simply create a dummy lockspace on all nodes so we actually never do
any disconnects and look if you are running again into this issue.

> [Fri Aug 20 16:24:14 2021]  dlm_process_incoming_buffer+0x38/0x90 [dlm]
> [Fri Aug 20 16:24:14 2021]  receive_from_sock+0xd4/0x1f0 [dlm]
> [Fri Aug 20 16:24:14 2021]  process_recv_sockets+0x1a/0x20 [dlm]
> [Fri Aug 20 16:24:14 2021]  process_one_work+0x1df/0x370
> [Fri Aug 20 16:24:14 2021]  worker_thread+0x50/0x400
> [Fri Aug 20 16:24:14 2021]  ? process_one_work+0x370/0x370
> [Fri Aug 20 16:24:14 2021]  kthread+0x127/0x150
> [Fri Aug 20 16:24:14 2021]  ? set_kthread_struct+0x40/0x40
> [Fri Aug 20 16:24:14 2021]  ret_from_fork+0x22/0x30

- Alex