[dm-devel] [PATCH v2 15/17] libmultipath: make directio checker share io contexts

Tue Feb 11 08:38:39 UTC 2020

On Mon, 2020-02-10 at 17:15 -0600, Benjamin Marzinski wrote:
> On Mon, Feb 10, 2020 at 06:08:14PM +0100, Martin Wilck wrote:
> > On Wed, 2020-02-05 at 12:58 -0600, Benjamin Marzinski wrote:
> > > On systems with a large number of cores (>500), io_destroy() can
> > > take
> > > tens to hundreds of milliseconds to complete, due to RCU
> > > synchronization. If there are a large number of paths using the
> > > directio
> > > checker on such a system, this can lead to multipath taking
> > > almost a
> > > minute to complete, with the vast majority of time taken up by
> > > io_destroy().
> > > 
> > > To solve this, the directio checker now allocates one aio context
> > > for
> > > every 1024 checkers. This reduces the io_destroy() delay to a
> > > thousandth
> > > of its previous level. However, this means that muliple checkers
> > > are
> > > sharing the same aio context, and must be able to handle getting
> > > results
> > > for other checkers.  Because only one checker is ever running at
> > > a
> > > time, this doesn't require any locking.  However, locking could
> > > be added
> > > in the future if necessary, to allow multiple checkers to run at
> > > the
> > > same time.
> > > 
> > > When checkers are freed, they usually no longer destroy the io
> > > context.
> > > Instead, they attempt to cancel any outstanding request. If that
> > > fails,
> > > they put the request on an orphan list, so that it can be freed
> > > by other
> > > checkers, once it has completed. IO contexts are only destroyed
> > > at three
> > > times, during reconfigure (to deal with the possibility that
> > > multipathd
> > > is holding more aio events than it needs to be, since there is a
> > > single
> > > limit for the whole system), when the checker class is unloaded,
> > > and
> > > in a corner case when checkers are freed. If an aio_group (which
> > > contains the aio context) is entirely full of orphaned requests,
> > > then
> > > no checker can use it. Since no checker is using it, there is no
> > > checker
> > > to clear out the orphaned requests. In this (likely rare) case,
> > > the
> > > last checker from that group to be freed and leave behind an
> > > orphaned
> > > request will call io_destroy() and remove the group.
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> > > ---
> > >  libmultipath/checkers/directio.c | 336
> > > +++++++++++++++++++++++++------
> > >  multipath/multipath.conf.5       |   7 +-
> > >  2 files changed, 281 insertions(+), 62 deletions(-)
> > 
> > Although I concur now that your design is sound, I still have some
> > issues, see below.
> > ...
> >  	}
> > >  	ct->running++;
> > 
> > This looks to me as if in the case (ct->running && ct->req->state
> > ==
> > PATH_PENDING), ct->running could become > 1, even though you don't
> > start a new IO. Is that intended? I don't think it matters because
> > you
> > never decrement, but it looks weird.
> 
> That's necessary for how the checker currently works. In async mode
> checker doesn't actually wait for a specific number of seconds (and
> didn't before my patch). It assumes 1 sec call times for pending
> paths,
> and times out after ct->running > timeout_secs. That's why the unit
> tests can get away with simply calling the checker repeatedly,
> without
> waiting, to make it timeout. But I suppose that
> wait_for_pending_paths()
> will also be making the checker time out quicker, so this should
> probably be changed. However, since check_path won't set a paths
> state
> to PATH_PENDING if it wasn't already, it's not really a very likely
> issue to occur.

Bah. I should have realized that, of course. Yet measuring the timeout
in *seconds*, and polling for it the way we currently do, is really 80s
design... I guess I was confused by the use of ns timeout calculation
for the get_events() call (suggesting high-res timing), and the use of
"ct->running" as both indicator of running IO and "abstract run time".

I hope you admit that the logic in check_path() is convoluted and hard
to understand :-/ . For example here:

> 	while(1) {
> 		r = get_events(ct->aio_grp, timep);
> 
> 		if (ct->req->state != PATH_PENDING) {
> 			ct->running = 0;
> 			return ct->req->state;
> 		} else if (r == 0 || !timep)
> 			break;
> 
> 		get_monotonic_time(&currtime);
> 		timespecsub(&endtime, &currtime, &timeout);
> 		if (timeout.tv_sec < 0)
> 			timep = NULL;

We're past the timeout already, having seen some events, just not for
the path we're currently looking at. Why do we go through another
iteration?

> 	}
> 	if (ct->running > timeout_secs || sync) {
>  
> > See comment for get_events() above. Why don't you simply do this?
> > 
> >     timeout.tv_sec = timeout.tv_nsec = 0;
> > 
> 
> Sure.
> 
> So, If I will post a seperate patch that resolves these issues, can
> this
> one have an ack?

Yes. I agree we should move forward.

Regards
Martin