[dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

Nikos Tsironis ntsironis at arrikto.com
Thu Oct 17 07:43:37 UTC 2019


On 10/17/19 8:58 AM, Guruswamy Basavaiah wrote:
>Hello Nikos,
>  Tested with your new patches. Issue is resolved. Thank you.

Hi Guru,

That's great. Thanks for testing the patches.

>  In second patch "struct wait_queue_head" to "wait_queue_head_t" for
> variable in_progress_wait, else compilation is failing with error
>  "error: field 'in_progress_wait' has incomplete type
>   struct wait_queue_head in_progress_wait;"

"struct wait_queue_head" was introduced by commit 9d9d676f595b50
("sched/wait: Standardize internal naming of wait-queue heads"), which
is included in kernels starting from v4.13.

So, the patch works fine with the latest kernel, but needs adapting for
older kernels, which I missed when rebasing the patches for the 4.4.x
kernel series.

Nikos.

>  Attached the changed patch.
> 
> Guru
> 
> On Sat, 12 Oct 2019 at 14:16, Guruswamy Basavaiah <guru2018 at gmail.com> wrote:
>>
>> Hello Nikos,
>>  I am having some issues in our set-up, I will try to get the results ASAP.
>> Guru
>>
>>
>> On Fri, 11 Oct 2019 at 17:47, Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>>
>>> On 10/11/19 2:39 PM, Nikos Tsironis wrote:
>>>> On 10/11/19 1:17 PM, Guruswamy Basavaiah wrote:
>>>>> Hello Nikos,
>>>>>  Applied these patches and tested.
>>>>>  We still see hung_task_timeout back traces and the drbd Resync is blocked.
>>>>>  Attached the back trace, please let me know if you need any other information.
>>>>>
>>>>
>>>> Hi Guru,
>>>>
>>>> Can you provide more information about your setup? The output of
>>>> 'dmsetup table', 'dmsetup ls --tree' and the DRBD configuration would
>>>> help to get a better picture of your I/O stack.
>>>>
>>>> Also, is it possible to describe the test case you are running and
>>>> exactly what it does?
>>>>
>>>> Thanks,
>>>> Nikos
>>>>
>>>
>>> Hi Guru,
>>>
>>> I believe I found the mistake. The in_progress variable was never
>>> initialized to zero.
>>>
>>> I attach a new version of the second patch correcting this.
>>>
>>> Can you please test again with this patch?
>>>
>>> Thanks,
>>> Nikos
>>>
>>>>>  In patch "0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch"
>>>>> I change "struct wait_queue_head" to "wait_queue_head_t" as i was
>>>>> getting compilation error with former one.
>>>>>
>>>>> On Thu, 10 Oct 2019 at 17:33, Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>>>>>
>>>>>> On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote:
>>>>>>> Hello,
>>>>>>> We use 4.4.184 in our builds and the patch fails to apply.
>>>>>>> Is it possible to give a patch for 4.4.x branch ?
>>>>>> Hi Guru,
>>>>>>
>>>>>> I attach the two patches fixing the deadlock rebased on the 4.4.x branch.
>>>>>>
>>>>>> Nikos
>>>>>>
>>>>>>>
>>>>>>> patching Logs.
>>>>>>> patching file drivers/md/dm-snap.c
>>>>>>> Hunk #1 succeeded at 19 (offset 1 line).
>>>>>>> Hunk #2 succeeded at 105 (offset -1 lines).
>>>>>>> Hunk #3 succeeded at 157 (offset -4 lines).
>>>>>>> Hunk #4 succeeded at 1206 (offset -120 lines).
>>>>>>> Hunk #5 FAILED at 1508.
>>>>>>> Hunk #6 succeeded at 1412 (offset -124 lines).
>>>>>>> Hunk #7 succeeded at 1425 (offset -124 lines).
>>>>>>> Hunk #8 FAILED at 1925.
>>>>>>> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines).
>>>>>>> Hunk #10 succeeded at 2202 (offset -294 lines).
>>>>>>> Hunk #11 succeeded at 2332 (offset -294 lines).
>>>>>>> 2 out of 11 hunks FAILED -- saving rejects to file drivers/md/dm-snap.c.rej
>>>>>>>
>>>>>>> Guru
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah <guru2018 at gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hello Mike,
>>>>>>>>  I will get the testing result before end of Thursday.
>>>>>>>> Guru
>>>>>>>>
>>>>>>>> On Wed, 9 Oct 2019 at 21:34, Mike Snitzer <snitzer at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Oct 09 2019 at 11:44am -0400,
>>>>>>>>> Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>>>>>>>>
>>>>>>>>>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am -0400,
>>>>>>>>>>> Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
>>>>>>>>>>>>> Hello Nikos,
>>>>>>>>>>>>>  Yes, issue is consistently reproducible with us, in a particular
>>>>>>>>>>>>> set-up and test case.
>>>>>>>>>>>>>  I will get the access to set-up next week, will try to test and let
>>>>>>>>>>>>> you know the results before end of next week.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> That sounds great!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>> Nikos
>>>>>>>>>>>
>>>>>>>>>>> Hi Guru,
>>>>>>>>>>>
>>>>>>>>>>> Any chance you could try this fix that I've staged to send to Linus?
>>>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4&id=633b1613b2a49304743c18314bb6e6465c21fd8a
>>>>>>>>>>>
>>>>>>>>>>> Shiort of that, Nikos: do you happen to have a test scenario that teases
>>>>>>>>>>> out this deadlock?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Mike,
>>>>>>>>>>
>>>>>>>>>> Yes,
>>>>>>>>>>
>>>>>>>>>> I created a 50G LV and took a snapshot of the same size:
>>>>>>>>>>
>>>>>>>>>>   lvcreate -n data-lv -L50G testvg
>>>>>>>>>>   lvcreate -n snap-lv -L50G -s testvg/data-lv
>>>>>>>>>>
>>>>>>>>>> Then I ran the following fio job:
>>>>>>>>>>
>>>>>>>>>> [global]
>>>>>>>>>> randrepeat=1
>>>>>>>>>> ioengine=libaio
>>>>>>>>>> bs=1M
>>>>>>>>>> size=6G
>>>>>>>>>> offset_increment=6G
>>>>>>>>>> numjobs=8
>>>>>>>>>> direct=1
>>>>>>>>>> iodepth=32
>>>>>>>>>> group_reporting
>>>>>>>>>> filename=/dev/testvg/data-lv
>>>>>>>>>>
>>>>>>>>>> [test]
>>>>>>>>>> rw=write
>>>>>>>>>> timeout=180
>>>>>>>>>>
>>>>>>>>>> , concurrently with the following script:
>>>>>>>>>>
>>>>>>>>>> lvcreate -n dummy-lv -L1G testvg
>>>>>>>>>>
>>>>>>>>>> while true
>>>>>>>>>> do
>>>>>>>>>>  lvcreate -n dummy-snap -L1M -s testvg/dummy-lv
>>>>>>>>>>  lvremove -f testvg/dummy-snap
>>>>>>>>>> done
>>>>>>>>>>
>>>>>>>>>> This reproduced the deadlock for me. I also ran 'echo 30 >
>>>>>>>>>> /proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task
>>>>>>>>>> timeout.
>>>>>>>>>>
>>>>>>>>>> Nikos.
>>>>>>>>>
>>>>>>>>> Very nice, well done.  Curious if you've tested with the fix I've staged
>>>>>>>>> (see above)?  If so, does it resolve the deadlock?  If you've had
>>>>>>>>> success I'd be happy to update the tags in the commit header to include
>>>>>>>>> your Tested-by before sending it to Linus.  Also, any review of the
>>>>>>>>> patch that you can do would be appreciated and with your formal
>>>>>>>>> Reviewed-by reply would be welcomed and folded in too.
>>>>>>>>>
>>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guruswamy Basavaiah
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>>
>> --
>> Guruswamy Basavaiah
> 
> 
> 




More information about the dm-devel mailing list