[dm-devel] [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
NeilBrown
neilb at suse.com
Tue Sep 5 23:43:28 UTC 2017
On Mon, Sep 04 2017, Mikulas Patocka wrote:
> On Mon, 4 Sep 2017, NeilBrown wrote:
>
>> On Fri, Sep 01 2017, Mikulas Patocka wrote:
>>
>> > On Fri, 1 Sep 2017, NeilBrown wrote:
>> >
>> >> On Thu, Aug 31 2017, Mikulas Patocka wrote:
>> >>
>> >> >>
>> >> >> Note that only current->bio_list[0] is offloaded. current->bio_list[1]
>> >> >> contains bios that were scheduled *before* the current one started, so
>> >> >
>> >> > These bios need to be offloaded too, otherwise you re-introduce this
>> >> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
>> >>
>> >> Thanks for the link.
>> >> In the example the bio that is stuck was created in step 4. The call
>> >> to generic_make_request() will have placed it on current->bio_list[0].
>> >> The top-level generic_make_request call by Process A is still running,
>> >> so nothing will have moved the bio to ->bio_list[1]. That only happens
>> >> after the ->make_request_fn completes, which must be after step 7.
>> >> So the problem bio is on ->bio_list[0] and the code in my patch will
>> >> pass it to a workqueue for handling.
>> >>
>> >> So I don't think the deadlock would be reintroduced. Can you see
>> >> something that I am missing?
>> >>
>> >> Thanks,
>> >> NeilBrown
>> >
>> > Offloading only current->bio_list[0] will work in a simple case described
>> > above, but not in the general case.
>> >
>> > For example, suppose that we have a dm device where the first part is
>> > linear and the second part is snapshot.
>> >
>> > * We receive bio A that crosses the linear/snapshot boundary
>> > * DM core creates bio B, submits it to the linear target and adds it to
>> > current->bio_list[0]
>> > * DM core creates bio C, submits it to the snapshot target, the snapshot
>> > target calls track_chunk for this bio and appends the bio to
>> > current->bio_list[0]
>> > * Now, we return back to generic_make_request
>> > * We pop bio B from current->bio_list[0]
>> > * We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C
>> > to bio_list_on_stack[1] - and now we lose any possibility to offload bio C
>> > to the rescue thread
>> > * The kcopyd thread for the snapshot takes the snapshot lock and waits for
>> > bio C to finish
>>
>> Thanks for the explanation.
>> I cannot find this last step in the code. "waits for BIO C to finish"
>> is presumably the called to __check_for_conflicting_io(). This is
>> called from kcopyd in merge_callback() -> snapshot_merge_next_chunks().
>> What lock is held at that time?
>
> The function pending_complete is called from the kcopyd callback. It takes
> "down_write(&s->lock)" and calls __check_for_conflicting_io which waits
> for I/Os to finish.
>
>> > * We process bio B - and if processing bio B reaches something that takes
>> > the snapshot lock (for example an origin target for the snapshot), a
>> > deadlock will happen. The deadlock could be avoided by offloading bio C to
>> > the rescue thread, but bio C is already on bio_list_on_stack[1] and so it
>> > won't be offloaded
>>
>> I think the core issue behind this deadlock is that the same volume
>> can appear twice in the top-level device, in different regions. An
>> outstanding request to one region can then interfere with a new request
>> to a different region. That seems like a reasonable thing to do.
>> I cannot immediately see a generic way to handle this case that I am
>> happy with. I'll keep hunting.
>
> You can have a snapshot and an origin for the same device in the same
> tree - it is not common, but it is possible.
>
> Offloafing bios queued on current->bio_list avoids the deadlock, but your
> patch breaks it by offloading only the first list and not the second.
>
Thanks for the extra explanation. I've thought some more about this I can
see a way forward that I am comfortable with. Please let me know what
you think.
I haven't tested this patch yet, but I believe that applying it first will
clear the way for my other patch to work without reintroducing
deadlocks.
Thanks,
NeilBrown
From: NeilBrown <neilb at suse.com>
Date: Wed, 6 Sep 2017 09:14:52 +1000
Subject: [PATCH] dm: ensure bio submission follows a depth-first tree walk.
A dm device can, in general, represent a tree of targets,
each of which handles a sub-range of the range of blocks handled
by the parent.
The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner. Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.
In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.
This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the canonical
approach.
dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available. It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request(). This requires the 'rescue'
functionality provided by dm_offload*.
Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach. That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately. This delays
the handling until the first part is completely processed, so the
deadlock problems do not occur.
__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work(). When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request, current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.
To provide the clone bio when splitting, we use q->bio_split(). This
was previously being freed to avoid having excess rescuer threads. As
bio_split bio sets no longer create rescuer threads, there is little
cost and much gain from leaving the ->bio_split bio set in place.
Signed-off-by: NeilBrown <neilb at suse.com>
---
drivers/md/dm.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d669fddd9290..81395f7550c0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1007,7 +1007,7 @@ static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
- * allowed for all bio types except REQ_PREFLUSH.
+ * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
*
* dm_accept_partial_bio informs the dm that the target only wants to process
* additional n_sectors sectors of the bio and the rest of the data should be
@@ -1508,8 +1508,21 @@ static void __split_and_process_bio(struct mapped_device *md,
} else {
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
- while (ci.sector_count && !error)
+ while (ci.sector_count && !error) {
error = __split_and_process_non_flush(&ci);
+ if (current->bio_list && ci.sector_count && !error) {
+ /* Remainder must be passed to generic_make_request()
+ * so that it gets handled *after* bios already submitted
+ * have been completely processed.
+ */
+ struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
+ md->queue->bio_split);
+ bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
+ bio_chain(b, bio);
+ generic_make_request(b);
+ break;
+ }
+ }
}
/* drop the extra reference count */
@@ -1520,8 +1533,8 @@ static void __split_and_process_bio(struct mapped_device *md,
*---------------------------------------------------------------*/
/*
- * The request function that just remaps the bio built up by
- * dm_merge_bvec.
+ * The request function that remaps the bio to one target and
+ * splits off any remainder.
*/
static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
{
@@ -2056,12 +2069,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
case DM_TYPE_DAX_BIO_BASED:
dm_init_normal_md_queue(md);
blk_queue_make_request(md->queue, dm_make_request);
- /*
- * DM handles splitting bios as needed. Free the bio_split bioset
- * since it won't be used (saves 1 process per bio-based DM device).
- */
- bioset_free(md->queue->bio_split);
- md->queue->bio_split = NULL;
if (type == DM_TYPE_DAX_BIO_BASED)
queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
--
2.14.0
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20170906/f1edccf2/attachment.sig>
More information about the dm-devel
mailing list