[Cluster-devel] FS/DLM module triggered kernel BUG

Tue Aug 24 20:31:14 UTC 2021

Hi Gang He,

On Tue, Aug 24, 2021 at 10:18 AM Alexander Aring <aahringo at redhat.com> wrote:
...
> > >
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
> > >> [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
> > >> [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
> > >> [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
> > >> [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
> > >> [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
> > >> [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
> > >> [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> > >> [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
> > >> [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20

that was suspicious to me and I was looking into the code in v5.13.8
again and found an issue. I believe you are hitting an out-of-bounds
array access of __srcu_read_unlock() while some concurrency handling
was updating the idx parameter which became invalid at that moment.
However the idx handling could be invalid in several other cases. It's
fixed in the current mainline kernel, but v5.13.8 is still broken. I
will send a patch marked as RFC for you. Please test it and report
back, then I will resend it for v5.13.8.

- Alex