[dm-devel] I/O block when removing thin device on the same pool

Fri Jan 22 13:38:28 UTC 2016

On Thu, Jan 21, 2016 at 02:44:06PM -0500, Mike Snitzer wrote:
> > > Dne 20.1.2016 v 11:05 Dennis Yang napsal(a):
> > >>
> > >> Hi,
> > >>
> > >> I had noticed that I/O requests to one thin device will be blocked
> > >> when the other thin device is being deleting. The root cause of this
> > >> is that to delete a thin device will eventually call dm_btree_del()
> > >> which is a slow function and can block. This means that the device
> > >> deleting process will need to hold the pool lock for a very long time
> > >> to wait for this function to delete the whole data mapping subtree.
> > >> Since I/O to the devices on the same pool needs to held the same pool
> > >> lock to lookup/insert/delete data mapping, all I/O will be blocked
> > >> until the delete process finish.
> > >>
> > >> For now, I have to discard all the mappings of a thin device before
> > >> deleting it to prevent I/O from being blocked. Since these discard
> > >> requests not only take lots of time to finish but hurt the pool I/O
> > >> throughput, I am still looking for other better solutions to fix this
> > >> issue.
> > >>
> > >> I think the main problem is still the big pool lock in dm-thin which
> > >> hurts both the scalability and performance of. I am wondering if there
> > >> is any plan on improving this or any better fix for the I/O block
> > >> problem.
> 
> Just so I'm aware: which kernel are you using?
> 
> dm_pool_delete_thin_device() takes pmd->root_lock so yes it is very
> coarse-grained; especially when you consider concurrent IO to another
> thin device from the same pool will call interfaces, like
> dm_thin_find_block(), which also take the same pmd->root_lock.

We have seen lvremove of thin snapshots sometimes minutes,
even ~20 minutes before.
So that means blocking IO to other devices in that pool
(e.g. the typically currently in-use "origin") for minutes.

That was, iirc, with ~10 TB origin, mostly allocated,
tens of "rotating" snapshots, 64k chunk size,
and considerable random write change rate on the origin.

I'd like to propose a different approach for lvremove of thin devices
(using "made up terms" instead of the correct device mapper vocabulary,
because I'm lazy):
on lvremove of a thin device, take all the locks you need,
even if that implies blocking IO to other devices,
BUT
then don't do all the "delete" right there while holding those
locks, but convert the device into a "i-am-currently-removing-myself"
target, and release all the locks. That should be fast (enough).

Then this "i-am-currently-removing-myself" target would have its .open()
return some error, so it cannot even be opened anymore (or something
with similar effect), start some kernel thread that does the actual
"wipe" and "unref/unmap" from the tree and all that stuff "in the
background", using much finer granular temporary locking for each
processed region.

If that then takes 20 minutes, someone may still care, but at least it
does not block IO to the other active devices in the pool.

Or is something like this already going on?

	Lars Ellenberg