[dm-devel] [PATCH v2 15/17] libmultipath: make directio checker share io contexts

Tue Feb 11 22:21:07 UTC 2020

On Tue, Feb 11, 2020 at 09:38:39AM +0100, Martin Wilck wrote:
> On Mon, 2020-02-10 at 17:15 -0600, Benjamin Marzinski wrote:
> > On Mon, Feb 10, 2020 at 06:08:14PM +0100, Martin Wilck wrote:
> > > On Wed, 2020-02-05 at 12:58 -0600, Benjamin Marzinski wrote:
> > > > On systems with a large number of cores (>500), io_destroy() can
> > > > take
> > > > tens to hundreds of milliseconds to complete, due to RCU
> > > > synchronization. If there are a large number of paths using the
> > > > directio
> > > > checker on such a system, this can lead to multipath taking
> > > > almost a
> > > > minute to complete, with the vast majority of time taken up by
> > > > io_destroy().
> > > > 
> > > > To solve this, the directio checker now allocates one aio context
> > > > for
> > > > every 1024 checkers. This reduces the io_destroy() delay to a
> > > > thousandth
> > > > of its previous level. However, this means that muliple checkers
> > > > are
> > > > sharing the same aio context, and must be able to handle getting
> > > > results
> > > > for other checkers.  Because only one checker is ever running at
> > > > a
> > > > time, this doesn't require any locking.  However, locking could
> > > > be added
> > > > in the future if necessary, to allow multiple checkers to run at
> > > > the
> > > > same time.
> > > > 
> > > > When checkers are freed, they usually no longer destroy the io
> > > > context.
> > > > Instead, they attempt to cancel any outstanding request. If that
> > > > fails,
> > > > they put the request on an orphan list, so that it can be freed
> > > > by other
> > > > checkers, once it has completed. IO contexts are only destroyed
> > > > at three
> > > > times, during reconfigure (to deal with the possibility that
> > > > multipathd
> > > > is holding more aio events than it needs to be, since there is a
> > > > single
> > > > limit for the whole system), when the checker class is unloaded,
> > > > and
> > > > in a corner case when checkers are freed. If an aio_group (which
> > > > contains the aio context) is entirely full of orphaned requests,
> > > > then
> > > > no checker can use it. Since no checker is using it, there is no
> > > > checker
> > > > to clear out the orphaned requests. In this (likely rare) case,
> > > > the
> > > > last checker from that group to be freed and leave behind an
> > > > orphaned
> > > > request will call io_destroy() and remove the group.
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> > > > ---
> > > >  libmultipath/checkers/directio.c | 336
> > > > +++++++++++++++++++++++++------
> > > >  multipath/multipath.conf.5       |   7 +-
> > > >  2 files changed, 281 insertions(+), 62 deletions(-)
> > > 
> > > Although I concur now that your design is sound, I still have some
> > > issues, see below.
> > > ...
> > >  	}
> > > >  	ct->running++;
> > > 
> > > This looks to me as if in the case (ct->running && ct->req->state
> > > ==
> > > PATH_PENDING), ct->running could become > 1, even though you don't
> > > start a new IO. Is that intended? I don't think it matters because
> > > you
> > > never decrement, but it looks weird.
> > 
> > That's necessary for how the checker currently works. In async mode
> > checker doesn't actually wait for a specific number of seconds (and
> > didn't before my patch). It assumes 1 sec call times for pending
> > paths,
> > and times out after ct->running > timeout_secs. That's why the unit
> > tests can get away with simply calling the checker repeatedly,
> > without
> > waiting, to make it timeout. But I suppose that
> > wait_for_pending_paths()
> > will also be making the checker time out quicker, so this should
> > probably be changed. However, since check_path won't set a paths
> > state
> > to PATH_PENDING if it wasn't already, it's not really a very likely
> > issue to occur.
> 
> Bah. I should have realized that, of course. Yet measuring the timeout
> in *seconds*, and polling for it the way we currently do, is really 80s
> design... I guess I was confused by the use of ns timeout calculation
> for the get_events() call (suggesting high-res timing), and the use of
> "ct->running" as both indicator of running IO and "abstract run time".
> 
> I hope you admit that the logic in check_path() is convoluted and hard
> to understand :-/ . For example here:
> 
> > 	while(1) {
> > 		r = get_events(ct->aio_grp, timep);
> > 
> > 		if (ct->req->state != PATH_PENDING) {
> > 			ct->running = 0;
> > 			return ct->req->state;
> > 		} else if (r == 0 || !timep)
> > 			break;
> > 
> > 		get_monotonic_time(&currtime);
> > 		timespecsub(&endtime, &currtime, &timeout);
> > 		if (timeout.tv_sec < 0)
> > 			timep = NULL;
> 
> We're past the timeout already, having seen some events, just not for
> the path we're currently looking at. Why do we go through another
> iteration?

Well, in this case we are past the timeout now, but weren't when
io_getevents() completed, so the code loops one more time to check
(without waiting) if the current path's request has completed. Since we
call io_getevents() after setting the timeout to be all the remaining
time, this should only ever happen if io_getevents() exitted early.
Granted, it likely didn't exit very early (unless the thread was
preempted afterwards), since it is now past the timeout.

-Ben

> 
> > 	}
> > 	if (ct->running > timeout_secs || sync) {
> >  
> > > See comment for get_events() above. Why don't you simply do this?
> > > 
> > >     timeout.tv_sec = timeout.tv_nsec = 0;
> > > 
> > 
> > Sure.
> > 
> > So, If I will post a seperate patch that resolves these issues, can
> > this
> > one have an ack?
> 
> Yes. I agree we should move forward.
> 
> Regards
> Martin
>