[dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

Nikos Tsironis ntsironis at arrikto.com
Fri Oct 11 12:17:00 UTC 2019


On 10/11/19 2:39 PM, Nikos Tsironis wrote:
> On 10/11/19 1:17 PM, Guruswamy Basavaiah wrote:
>> Hello Nikos,
>>  Applied these patches and tested.
>>  We still see hung_task_timeout back traces and the drbd Resync is blocked.
>>  Attached the back trace, please let me know if you need any other information.
>>
> 
> Hi Guru,
> 
> Can you provide more information about your setup? The output of
> 'dmsetup table', 'dmsetup ls --tree' and the DRBD configuration would
> help to get a better picture of your I/O stack.
> 
> Also, is it possible to describe the test case you are running and
> exactly what it does?
> 
> Thanks,
> Nikos
> 

Hi Guru,

I believe I found the mistake. The in_progress variable was never
initialized to zero.

I attach a new version of the second patch correcting this.

Can you please test again with this patch?

Thanks,
Nikos

>>  In patch "0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch"
>> I change "struct wait_queue_head" to "wait_queue_head_t" as i was
>> getting compilation error with former one.
>>
>> On Thu, 10 Oct 2019 at 17:33, Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>>
>>> On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote:
>>>> Hello,
>>>> We use 4.4.184 in our builds and the patch fails to apply.
>>>> Is it possible to give a patch for 4.4.x branch ?
>>> Hi Guru,
>>>
>>> I attach the two patches fixing the deadlock rebased on the 4.4.x branch.
>>>
>>> Nikos
>>>
>>>>
>>>> patching Logs.
>>>> patching file drivers/md/dm-snap.c
>>>> Hunk #1 succeeded at 19 (offset 1 line).
>>>> Hunk #2 succeeded at 105 (offset -1 lines).
>>>> Hunk #3 succeeded at 157 (offset -4 lines).
>>>> Hunk #4 succeeded at 1206 (offset -120 lines).
>>>> Hunk #5 FAILED at 1508.
>>>> Hunk #6 succeeded at 1412 (offset -124 lines).
>>>> Hunk #7 succeeded at 1425 (offset -124 lines).
>>>> Hunk #8 FAILED at 1925.
>>>> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines).
>>>> Hunk #10 succeeded at 2202 (offset -294 lines).
>>>> Hunk #11 succeeded at 2332 (offset -294 lines).
>>>> 2 out of 11 hunks FAILED -- saving rejects to file drivers/md/dm-snap.c.rej
>>>>
>>>> Guru
>>>>
>>>> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah <guru2018 at gmail.com> wrote:
>>>>>
>>>>> Hello Mike,
>>>>>  I will get the testing result before end of Thursday.
>>>>> Guru
>>>>>
>>>>> On Wed, 9 Oct 2019 at 21:34, Mike Snitzer <snitzer at redhat.com> wrote:
>>>>>>
>>>>>> On Wed, Oct 09 2019 at 11:44am -0400,
>>>>>> Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>>>>>
>>>>>>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am -0400,
>>>>>>>> Nikos Tsironis <ntsironis at arrikto.com> wrote:
>>>>>>>>
>>>>>>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
>>>>>>>>>> Hello Nikos,
>>>>>>>>>>  Yes, issue is consistently reproducible with us, in a particular
>>>>>>>>>> set-up and test case.
>>>>>>>>>>  I will get the access to set-up next week, will try to test and let
>>>>>>>>>> you know the results before end of next week.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That sounds great!
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> Nikos
>>>>>>>>
>>>>>>>> Hi Guru,
>>>>>>>>
>>>>>>>> Any chance you could try this fix that I've staged to send to Linus?
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4&id=633b1613b2a49304743c18314bb6e6465c21fd8a
>>>>>>>>
>>>>>>>> Shiort of that, Nikos: do you happen to have a test scenario that teases
>>>>>>>> out this deadlock?
>>>>>>>>
>>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> Yes,
>>>>>>>
>>>>>>> I created a 50G LV and took a snapshot of the same size:
>>>>>>>
>>>>>>>   lvcreate -n data-lv -L50G testvg
>>>>>>>   lvcreate -n snap-lv -L50G -s testvg/data-lv
>>>>>>>
>>>>>>> Then I ran the following fio job:
>>>>>>>
>>>>>>> [global]
>>>>>>> randrepeat=1
>>>>>>> ioengine=libaio
>>>>>>> bs=1M
>>>>>>> size=6G
>>>>>>> offset_increment=6G
>>>>>>> numjobs=8
>>>>>>> direct=1
>>>>>>> iodepth=32
>>>>>>> group_reporting
>>>>>>> filename=/dev/testvg/data-lv
>>>>>>>
>>>>>>> [test]
>>>>>>> rw=write
>>>>>>> timeout=180
>>>>>>>
>>>>>>> , concurrently with the following script:
>>>>>>>
>>>>>>> lvcreate -n dummy-lv -L1G testvg
>>>>>>>
>>>>>>> while true
>>>>>>> do
>>>>>>>  lvcreate -n dummy-snap -L1M -s testvg/dummy-lv
>>>>>>>  lvremove -f testvg/dummy-snap
>>>>>>> done
>>>>>>>
>>>>>>> This reproduced the deadlock for me. I also ran 'echo 30 >
>>>>>>> /proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task
>>>>>>> timeout.
>>>>>>>
>>>>>>> Nikos.
>>>>>>
>>>>>> Very nice, well done.  Curious if you've tested with the fix I've staged
>>>>>> (see above)?  If so, does it resolve the deadlock?  If you've had
>>>>>> success I'd be happy to update the tags in the commit header to include
>>>>>> your Tested-by before sending it to Linus.  Also, any review of the
>>>>>> patch that you can do would be appreciated and with your formal
>>>>>> Reviewed-by reply would be welcomed and folded in too.
>>>>>>
>>>>>> Mike
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Guruswamy Basavaiah
>>>>
>>>>
>>>>
>>
>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch
Type: text/x-patch
Size: 7912 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20191011/39536853/attachment.bin>


More information about the dm-devel mailing list