[dm-devel] [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations

Tue Sep 5 23:43:28 UTC 2017

On Mon, Sep 04 2017, Mikulas Patocka wrote:

> On Mon, 4 Sep 2017, NeilBrown wrote:
>
>> On Fri, Sep 01 2017, Mikulas Patocka wrote:
>> 
>> > On Fri, 1 Sep 2017, NeilBrown wrote:
>> >
>> >> On Thu, Aug 31 2017, Mikulas Patocka wrote:
>> >> 
>> >> >> 
>> >> >> Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
>> >> >> contains bios that were scheduled *before* the current one started, so
>> >> >
>> >> > These bios need to be offloaded too, otherwise you re-introduce this 
>> >> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
>> >> 
>> >> Thanks for the link.
>> >> In the example the bio that is stuck was created in step 4.  The call
>> >> to generic_make_request() will have placed it on current->bio_list[0].
>> >> The top-level generic_make_request call by Process A is still running,
>> >> so nothing will have moved the bio to ->bio_list[1].  That only happens
>> >> after the ->make_request_fn completes, which must be after step 7.
>> >> So the problem bio is on ->bio_list[0] and the code in my patch will
>> >> pass it to a workqueue for handling.
>> >> 
>> >> So I don't think the deadlock would be reintroduced.  Can you see
>> >> something that I am missing?
>> >> 
>> >> Thanks,
>> >> NeilBrown
>> >
>> > Offloading only current->bio_list[0] will work in a simple case described 
>> > above, but not in the general case.
>> >
>> > For example, suppose that we have a dm device where the first part is 
>> > linear and the second part is snapshot.
>> >
>> > * We receive bio A that crosses the linear/snapshot boundary
>> > * DM core creates bio B, submits it to the linear target and adds it to 
>> > current->bio_list[0]
>> > * DM core creates bio C, submits it to the snapshot target, the snapshot 
>> > target calls track_chunk for this bio and appends the bio to 
>> > current->bio_list[0]
>> > * Now, we return back to generic_make_request
>> > * We pop bio B from current->bio_list[0]
>> > * We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C 
>> > to bio_list_on_stack[1] - and now we lose any possibility to offload bio C 
>> > to the rescue thread
>> > * The kcopyd thread for the snapshot takes the snapshot lock and waits for 
>> > bio C to finish
>> 
>> Thanks for the explanation.
>> I cannot find this last step in the code.  "waits for BIO C to finish"
>> is presumably the called to __check_for_conflicting_io().  This is
>> called from kcopyd in merge_callback() -> snapshot_merge_next_chunks().
>> What lock is held at that time?
>
> The function pending_complete is called from the kcopyd callback. It takes 
> "down_write(&s->lock)" and calls __check_for_conflicting_io which waits 
> for I/Os to finish.
>
>> > * We process bio B - and if processing bio B reaches something that takes 
>> > the snapshot lock (for example an origin target for the snapshot), a 
>> > deadlock will happen. The deadlock could be avoided by offloading bio C to 
>> > the rescue thread, but bio C is already on bio_list_on_stack[1] and so it 
>> > won't be offloaded
>> 
>> I think the core issue behind this deadlock is that the same volume
>> can appear twice in the top-level device, in different regions.  An
>> outstanding request to one region can then interfere with a new request
>> to a different region.  That seems like a reasonable thing to do.
>> I cannot immediately see a generic way to handle this case that I am
>> happy with.  I'll keep hunting.
>
> You can have a snapshot and an origin for the same device in the same 
> tree - it is not common, but it is possible.
>
> Offloafing bios queued on current->bio_list avoids the deadlock, but your 
> patch breaks it by offloading only the first list and not the second.
>

Thanks for the extra explanation.  I've thought some more about this I can
see a way forward that I am comfortable with.  Please let me know what
you think.
I haven't tested this patch yet, but I believe that applying it first will
clear the way for my other patch to work without reintroducing
deadlocks.

Thanks,
NeilBrown


From: NeilBrown <neilb at suse.com>
Date: Wed, 6 Sep 2017 09:14:52 +1000
Subject: [PATCH] dm: ensure bio submission follows a depth-first tree walk.

A dm device can, in general, represent a tree of targets,
each of which handles a sub-range of the range of blocks handled
by the parent.

The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner.  Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.

In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.

This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the canonical
approach.

dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available.  It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request().  This requires the 'rescue'
functionality provided by dm_offload*.

Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach.  That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately.  This delays
the handling until the first part is completely processed, so the
deadlock problems do not occur.

__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work().  When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request, current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.

To provide the clone bio when splitting, we use q->bio_split().  This
was previously being freed to avoid having excess rescuer threads.  As
bio_split bio sets no longer create rescuer threads, there is little
cost and much gain from leaving the ->bio_split bio set in place.

Signed-off-by: NeilBrown <neilb at suse.com>
---
 drivers/md/dm.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d669fddd9290..81395f7550c0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1007,7 +1007,7 @@ static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
- * allowed for all bio types except REQ_PREFLUSH.
+ * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
  *
  * dm_accept_partial_bio informs the dm that the target only wants to process
  * additional n_sectors sectors of the bio and the rest of the data should be
@@ -1508,8 +1508,21 @@ static void __split_and_process_bio(struct mapped_device *md,
 	} else {
 		ci.bio = bio;
 		ci.sector_count = bio_sectors(bio);
-		while (ci.sector_count && !error)
+		while (ci.sector_count && !error) {
 			error = __split_and_process_non_flush(&ci);
+			if (current->bio_list && ci.sector_count && !error) {
+				/* Remainder must be passed to generic_make_request()
+				 * so that it gets handled *after* bios already submitted
+				 * have been completely processed.
+				 */
+				struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
+								 md->queue->bio_split);
+				bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
+				bio_chain(b, bio);
+				generic_make_request(b);
+				break;
+			}
+		}
 	}
 
 	/* drop the extra reference count */
@@ -1520,8 +1533,8 @@ static void __split_and_process_bio(struct mapped_device *md,
  *---------------------------------------------------------------*/
 
 /*
- * The request function that just remaps the bio built up by
- * dm_merge_bvec.
+ * The request function that remaps the bio to one target and
+ * splits off any remainder.
  */
 static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
 {
@@ -2056,12 +2069,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
 	case DM_TYPE_DAX_BIO_BASED:
 		dm_init_normal_md_queue(md);
 		blk_queue_make_request(md->queue, dm_make_request);
-		/*
-		 * DM handles splitting bios as needed.  Free the bio_split bioset
-		 * since it won't be used (saves 1 process per bio-based DM device).
-		 */
-		bioset_free(md->queue->bio_split);
-		md->queue->bio_split = NULL;
 
 		if (type == DM_TYPE_DAX_BIO_BASED)
 			queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
-- 
2.14.0

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20170906/f1edccf2/attachment.sig>