[dm-devel] [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations

Mon Sep 4 16:54:44 UTC 2017

On Mon, 4 Sep 2017, NeilBrown wrote:

> On Fri, Sep 01 2017, Mikulas Patocka wrote:
> 
> > On Fri, 1 Sep 2017, NeilBrown wrote:
> >
> >> On Thu, Aug 31 2017, Mikulas Patocka wrote:
> >> 
> >> >> 
> >> >> Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> >> >> contains bios that were scheduled *before* the current one started, so
> >> >
> >> > These bios need to be offloaded too, otherwise you re-introduce this 
> >> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
> >> 
> >> Thanks for the link.
> >> In the example the bio that is stuck was created in step 4.  The call
> >> to generic_make_request() will have placed it on current->bio_list[0].
> >> The top-level generic_make_request call by Process A is still running,
> >> so nothing will have moved the bio to ->bio_list[1].  That only happens
> >> after the ->make_request_fn completes, which must be after step 7.
> >> So the problem bio is on ->bio_list[0] and the code in my patch will
> >> pass it to a workqueue for handling.
> >> 
> >> So I don't think the deadlock would be reintroduced.  Can you see
> >> something that I am missing?
> >> 
> >> Thanks,
> >> NeilBrown
> >
> > Offloading only current->bio_list[0] will work in a simple case described 
> > above, but not in the general case.
> >
> > For example, suppose that we have a dm device where the first part is 
> > linear and the second part is snapshot.
> >
> > * We receive bio A that crosses the linear/snapshot boundary
> > * DM core creates bio B, submits it to the linear target and adds it to 
> > current->bio_list[0]
> > * DM core creates bio C, submits it to the snapshot target, the snapshot 
> > target calls track_chunk for this bio and appends the bio to 
> > current->bio_list[0]
> > * Now, we return back to generic_make_request
> > * We pop bio B from current->bio_list[0]
> > * We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C 
> > to bio_list_on_stack[1] - and now we lose any possibility to offload bio C 
> > to the rescue thread
> > * The kcopyd thread for the snapshot takes the snapshot lock and waits for 
> > bio C to finish
> 
> Thanks for the explanation.
> I cannot find this last step in the code.  "waits for BIO C to finish"
> is presumably the called to __check_for_conflicting_io().  This is
> called from kcopyd in merge_callback() -> snapshot_merge_next_chunks().
> What lock is held at that time?

The function pending_complete is called from the kcopyd callback. It takes 
"down_write(&s->lock)" and calls __check_for_conflicting_io which waits 
for I/Os to finish.

> > * We process bio B - and if processing bio B reaches something that takes 
> > the snapshot lock (for example an origin target for the snapshot), a 
> > deadlock will happen. The deadlock could be avoided by offloading bio C to 
> > the rescue thread, but bio C is already on bio_list_on_stack[1] and so it 
> > won't be offloaded
> 
> I think the core issue behind this deadlock is that the same volume
> can appear twice in the top-level device, in different regions.  An
> outstanding request to one region can then interfere with a new request
> to a different region.  That seems like a reasonable thing to do.
> I cannot immediately see a generic way to handle this case that I am
> happy with.  I'll keep hunting.

You can have a snapshot and an origin for the same device in the same 
tree - it is not common, but it is possible.

Offloafing bios queued on current->bio_list avoids the deadlock, but your 
patch breaks it by offloading only the first list and not the second.

> Thanks,
> NeilBrown

Mikulas