[dm-devel] Fix "dm kcopyd: Fix bug causing workqueue stalls" causes dead lock

Guruswamy Basavaiah guru2018 at gmail.com
Sat Oct 12 08:46:02 UTC 2019


Hello Nikos,
 I am having some issues in our set-up, I will try to get the results ASAP.
Guru


On Fri, 11 Oct 2019 at 17:47, Nikos Tsironis <ntsironis at arrikto.com> wrote:
>
> On 10/11/19 2:39 PM, Nikos Tsironis wrote:
> > On 10/11/19 1:17 PM, Guruswamy Basavaiah wrote:
> >> Hello Nikos,
> >>  Applied these patches and tested.
> >>  We still see hung_task_timeout back traces and the drbd Resync is blocked.
> >>  Attached the back trace, please let me know if you need any other information.
> >>
> >
> > Hi Guru,
> >
> > Can you provide more information about your setup? The output of
> > 'dmsetup table', 'dmsetup ls --tree' and the DRBD configuration would
> > help to get a better picture of your I/O stack.
> >
> > Also, is it possible to describe the test case you are running and
> > exactly what it does?
> >
> > Thanks,
> > Nikos
> >
>
> Hi Guru,
>
> I believe I found the mistake. The in_progress variable was never
> initialized to zero.
>
> I attach a new version of the second patch correcting this.
>
> Can you please test again with this patch?
>
> Thanks,
> Nikos
>
> >>  In patch "0002-dm-snapshot-rework-COW-throttling-to-fix-deadlock.patch"
> >> I change "struct wait_queue_head" to "wait_queue_head_t" as i was
> >> getting compilation error with former one.
> >>
> >> On Thu, 10 Oct 2019 at 17:33, Nikos Tsironis <ntsironis at arrikto.com> wrote:
> >>>
> >>> On 10/10/19 9:34 AM, Guruswamy Basavaiah wrote:
> >>>> Hello,
> >>>> We use 4.4.184 in our builds and the patch fails to apply.
> >>>> Is it possible to give a patch for 4.4.x branch ?
> >>> Hi Guru,
> >>>
> >>> I attach the two patches fixing the deadlock rebased on the 4.4.x branch.
> >>>
> >>> Nikos
> >>>
> >>>>
> >>>> patching Logs.
> >>>> patching file drivers/md/dm-snap.c
> >>>> Hunk #1 succeeded at 19 (offset 1 line).
> >>>> Hunk #2 succeeded at 105 (offset -1 lines).
> >>>> Hunk #3 succeeded at 157 (offset -4 lines).
> >>>> Hunk #4 succeeded at 1206 (offset -120 lines).
> >>>> Hunk #5 FAILED at 1508.
> >>>> Hunk #6 succeeded at 1412 (offset -124 lines).
> >>>> Hunk #7 succeeded at 1425 (offset -124 lines).
> >>>> Hunk #8 FAILED at 1925.
> >>>> Hunk #9 succeeded at 1866 with fuzz 2 (offset -255 lines).
> >>>> Hunk #10 succeeded at 2202 (offset -294 lines).
> >>>> Hunk #11 succeeded at 2332 (offset -294 lines).
> >>>> 2 out of 11 hunks FAILED -- saving rejects to file drivers/md/dm-snap.c.rej
> >>>>
> >>>> Guru
> >>>>
> >>>> On Thu, 10 Oct 2019 at 01:33, Guruswamy Basavaiah <guru2018 at gmail.com> wrote:
> >>>>>
> >>>>> Hello Mike,
> >>>>>  I will get the testing result before end of Thursday.
> >>>>> Guru
> >>>>>
> >>>>> On Wed, 9 Oct 2019 at 21:34, Mike Snitzer <snitzer at redhat.com> wrote:
> >>>>>>
> >>>>>> On Wed, Oct 09 2019 at 11:44am -0400,
> >>>>>> Nikos Tsironis <ntsironis at arrikto.com> wrote:
> >>>>>>
> >>>>>>> On 10/9/19 5:13 PM, Mike Snitzer wrote:> On Tue, Oct 01 2019 at  8:43am -0400,
> >>>>>>>> Nikos Tsironis <ntsironis at arrikto.com> wrote:
> >>>>>>>>
> >>>>>>>>> On 10/1/19 3:27 PM, Guruswamy Basavaiah wrote:
> >>>>>>>>>> Hello Nikos,
> >>>>>>>>>>  Yes, issue is consistently reproducible with us, in a particular
> >>>>>>>>>> set-up and test case.
> >>>>>>>>>>  I will get the access to set-up next week, will try to test and let
> >>>>>>>>>> you know the results before end of next week.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> That sounds great!
> >>>>>>>>>
> >>>>>>>>> Thanks a lot,
> >>>>>>>>> Nikos
> >>>>>>>>
> >>>>>>>> Hi Guru,
> >>>>>>>>
> >>>>>>>> Any chance you could try this fix that I've staged to send to Linus?
> >>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4&id=633b1613b2a49304743c18314bb6e6465c21fd8a
> >>>>>>>>
> >>>>>>>> Shiort of that, Nikos: do you happen to have a test scenario that teases
> >>>>>>>> out this deadlock?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi Mike,
> >>>>>>>
> >>>>>>> Yes,
> >>>>>>>
> >>>>>>> I created a 50G LV and took a snapshot of the same size:
> >>>>>>>
> >>>>>>>   lvcreate -n data-lv -L50G testvg
> >>>>>>>   lvcreate -n snap-lv -L50G -s testvg/data-lv
> >>>>>>>
> >>>>>>> Then I ran the following fio job:
> >>>>>>>
> >>>>>>> [global]
> >>>>>>> randrepeat=1
> >>>>>>> ioengine=libaio
> >>>>>>> bs=1M
> >>>>>>> size=6G
> >>>>>>> offset_increment=6G
> >>>>>>> numjobs=8
> >>>>>>> direct=1
> >>>>>>> iodepth=32
> >>>>>>> group_reporting
> >>>>>>> filename=/dev/testvg/data-lv
> >>>>>>>
> >>>>>>> [test]
> >>>>>>> rw=write
> >>>>>>> timeout=180
> >>>>>>>
> >>>>>>> , concurrently with the following script:
> >>>>>>>
> >>>>>>> lvcreate -n dummy-lv -L1G testvg
> >>>>>>>
> >>>>>>> while true
> >>>>>>> do
> >>>>>>>  lvcreate -n dummy-snap -L1M -s testvg/dummy-lv
> >>>>>>>  lvremove -f testvg/dummy-snap
> >>>>>>> done
> >>>>>>>
> >>>>>>> This reproduced the deadlock for me. I also ran 'echo 30 >
> >>>>>>> /proc/sys/kernel/hung_task_timeout_secs', to reduce the hung task
> >>>>>>> timeout.
> >>>>>>>
> >>>>>>> Nikos.
> >>>>>>
> >>>>>> Very nice, well done.  Curious if you've tested with the fix I've staged
> >>>>>> (see above)?  If so, does it resolve the deadlock?  If you've had
> >>>>>> success I'd be happy to update the tags in the commit header to include
> >>>>>> your Tested-by before sending it to Linus.  Also, any review of the
> >>>>>> patch that you can do would be appreciated and with your formal
> >>>>>> Reviewed-by reply would be welcomed and folded in too.
> >>>>>>
> >>>>>> Mike
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Guruswamy Basavaiah
> >>>>
> >>>>
> >>>>
> >>
> >>
> >>



--
Guruswamy Basavaiah




More information about the dm-devel mailing list