[dm-devel] [PATCH v2 15/17] libmultipath: make directio checker share io contexts

Mon Feb 10 17:08:14 UTC 2020

On Wed, 2020-02-05 at 12:58 -0600, Benjamin Marzinski wrote:
> On systems with a large number of cores (>500), io_destroy() can take
> tens to hundreds of milliseconds to complete, due to RCU
> synchronization. If there are a large number of paths using the directio
> checker on such a system, this can lead to multipath taking almost a
> minute to complete, with the vast majority of time taken up by
> io_destroy().
> 
> To solve this, the directio checker now allocates one aio context for
> every 1024 checkers. This reduces the io_destroy() delay to a thousandth
> of its previous level. However, this means that muliple checkers are
> sharing the same aio context, and must be able to handle getting results
> for other checkers.  Because only one checker is ever running at a
> time, this doesn't require any locking.  However, locking could be added
> in the future if necessary, to allow multiple checkers to run at the
> same time.
> 
> When checkers are freed, they usually no longer destroy the io context.
> Instead, they attempt to cancel any outstanding request. If that fails,
> they put the request on an orphan list, so that it can be freed by other
> checkers, once it has completed. IO contexts are only destroyed at three
> times, during reconfigure (to deal with the possibility that multipathd
> is holding more aio events than it needs to be, since there is a single
> limit for the whole system), when the checker class is unloaded, and
> in a corner case when checkers are freed. If an aio_group (which
> contains the aio context) is entirely full of orphaned requests, then
> no checker can use it. Since no checker is using it, there is no checker
> to clear out the orphaned requests. In this (likely rare) case, the
> last checker from that group to be freed and leave behind an orphaned
> request will call io_destroy() and remove the group.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> ---
>  libmultipath/checkers/directio.c | 336 +++++++++++++++++++++++++------
>  multipath/multipath.conf.5       |   7 +-
>  2 files changed, 281 insertions(+), 62 deletions(-)

Although I concur now that your design is sound, I still have some
issues, see below.

> 
> diff --git a/libmultipath/checkers/directio.c b/libmultipath/checkers/directio.c
> index 1b00b775..740c68e5 100644
> --- a/libmultipath/checkers/directio.c
> +++ b/libmultipath/checkers/directio.c
> 
> +/* If an aio_group is completely full of orphans, then no checkers can
> + * use it, which means that no checkers can clear out the orphans. To
> + * avoid keeping the useless group around, simply remove remove the
> + * group */
> +static void
> +check_orphaned_group(struct aio_group *aio_grp)
> +{
> +	int count = 0;
> +	struct list_head *item;
> +
> +	if (aio_grp->holders < AIO_GROUP_SIZE)
> +		return;
> +	list_for_each(item, &aio_grp->orphans)
> +		count++;
> +	if (count >= AIO_GROUP_SIZE) {
> +		remove_aio_group(aio_grp);
> +		if (list_empty(&aio_grp_list))
> +			add_aio_group();

OK, but not beautiful. Can be improved later, I guess. In general, you
could delay allocation of an aio_group until it's actually needed (i.e.
when the first path is using it, in set_aio_group(), as you're already
doing it for 2nd and later groups).

> +	}
> +}
> +
> +int libcheck_load (void)
> +{
> +	if (add_aio_group() == NULL) {
> +		LOG(1, "libcheck_load failed: %s", strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +void libcheck_unload (void)
> +{
> +	struct aio_group *aio_grp, *tmp;
> +
> +	list_for_each_entry_safe(aio_grp, tmp, &aio_grp_list, node)
> +		remove_aio_group(aio_grp);
> +}

I have one concern here - this might cause delays during multipathd
shutdown, which we have struggled to eliminate in previous patches.
OTOH, according to what you wrote, with the current code the shutdown
delays will probably be higher, so this is actually an improvement.
We should take a mental note about the shutdown issue. Like with TUR,
avoiding hanging on shutdown is tricky if we consider possibly hanging
device I/O.

> +
> +int libcheck_reset (void)
> +{
> +	struct aio_group *aio_grp, *tmp, *reset_grp = NULL;
> +
> +	/* If a clean existing aio_group exists, use that. Otherwise add a
> +	 * new one */
> +	list_for_each_entry(aio_grp, &aio_grp_list, node) {
> +		if (aio_grp->holders == 0 &&
> +		    list_empty(&aio_grp->orphans)) {
> +			reset_grp = aio_grp;
> +			break;
> +		}
> +	}
> +	if (!reset_grp)
> +		reset_grp = add_aio_group();
> +	if (!reset_grp) {
> +		LOG(1, "checker reset failed");
> +		return 1;
> +	}
> +
> +	list_for_each_entry_safe(aio_grp, tmp, &aio_grp_list, node) {
> +		if (aio_grp != reset_grp)
> +			remove_aio_group(aio_grp);
> +	}
> +	return 0;
> +}
>  
>  int libcheck_init (struct checker * c)
>  {
>  	unsigned long pgsize = getpagesize();
>  	struct directio_context * ct;
> +	struct async_req *req = NULL;
>  	long flags;
>  
>  	ct = malloc(sizeof(struct directio_context));
> @@ -56,26 +201,31 @@ int libcheck_init (struct checker * c)
>  		return 1;
>  	memset(ct, 0, sizeof(struct directio_context));
>  
> -	if (io_setup(1, &ct->ioctx) != 0) {
> -		condlog(1, "io_setup failed");
> -		free(ct);
> -		return 1;
> +	if (set_aio_group(ct) < 0)
> +		goto out;
> +
> +	req = malloc(sizeof(struct async_req));
> +	if (!req) {
> +		goto out;
>  	}
> +	memset(req, 0, sizeof(struct async_req));
> +	INIT_LIST_HEAD(&req->node);
>  
> -	if (ioctl(c->fd, BLKBSZGET, &ct->blksize) < 0) {
> +	if (ioctl(c->fd, BLKBSZGET, &req->blksize) < 0) {
>  		c->msgid = MSG_DIRECTIO_BLOCKSIZE;
> -		ct->blksize = 512;
> +		req->blksize = 512;

You didn't change this, but I wonder if this is really a safe default.
IIUC it's safe (although perhaps a bit slower) to read a multiple of
the logical block size, but reading less than the logical block size
might fail. Perhaps we should default to 4k?

>  	}
> -	if (ct->blksize > 4096) {
> +	if (req->blksize > 4096) {
>  		/*
>  		 * Sanity check for DASD; BSZGET is broken
>  		 */
> -		ct->blksize = 4096;
> +		req->blksize = 4096;
>  	}
> -	if (!ct->blksize)
> +	if (!req->blksize)
>  		goto out;
> -	ct->buf = (unsigned char *)malloc(ct->blksize + pgsize);
> -	if (!ct->buf)
> +
> +	req->buf = (unsigned char *)malloc(req->blksize + pgsize);

Why don't you simply use posix_memalign()?

> +	if (!req->buf)
>  		goto out;
>  
>  	flags = fcntl(c->fd, F_GETFL);
> @@ -88,17 +238,22 @@ int libcheck_init (struct checker * c)
>  		ct->reset_flags = 1;
>  	}
>  
> -	ct->ptr = (unsigned char *) (((unsigned long)ct->buf + pgsize - 1) &
> +	req->ptr = (unsigned char *) (((unsigned long)req->buf + pgsize - 1) &
>  		  (~(pgsize - 1)));

See above.

>  
>  	/* Successfully initialized, return the context. */
> +	ct->req = req;
>  	c->context = (void *) ct;
>  	return 0;
>  
>  out:
> -	if (ct->buf)
> -		free(ct->buf);
> -	io_destroy(ct->ioctx);
> +	if (req) {
> +		if (req->buf)
> +			free(req->buf);
> +		free(req);
> +	}
> +	if (ct->aio_grp)
> +		ct->aio_grp->holders--;
>  	free(ct);
>  	return 1;
>  }
> @@ -106,6 +261,7 @@ out:
>  void libcheck_free (struct checker * c)
>  {
>  	struct directio_context * ct = (struct directio_context *)c->context;
> +	struct io_event event;
>  	long flags;
>  
>  	if (!ct)
> @@ -121,20 +277,72 @@ void libcheck_free (struct checker * c)
>  		}
>  	}
>  
> -	if (ct->buf)
> -		free(ct->buf);
> -	io_destroy(ct->ioctx);
> +	if (ct->running &&
> +	    (ct->req->state != PATH_PENDING ||
> +	     io_cancel(ct->aio_grp->ioctx, &ct->req->io, &event) == 0))
> +		ct->running = 0;
> +	if (!ct->running) {
> +		free(ct->req->buf);
> +		free(ct->req);
> +		ct->aio_grp->holders--;
> +	} else {
> +		ct->req->state = PATH_REMOVED;
> +		list_add(&ct->req->node, &ct->aio_grp->orphans);
> +		check_orphaned_group(ct->aio_grp);
> +	}
> +
>  	free(ct);
> +	c->context = NULL;
> +}
> +
> +static int
> +get_events(struct aio_group *aio_grp, struct timespec *timeout)
> +{
> +	struct io_event events[128];
> +	int i, nr, got_events = 0;
> +	struct timespec zero_timeout = {0};
> +	struct timespec *timep = (timeout)? timeout : &zero_timeout;

This isn't wrong, but the semantics of the "timeout" parameter are a
bit confusing, as io_getevents() would interpret a NULL timeout as
"forever", and get_events is mostly a wrapper around io_getevents().

> +
> +	do {
> +		errno = 0;
> +		nr = io_getevents(aio_grp->ioctx, 1, 128, events, timep);
> +		got_events |= (nr > 0);
> +
> +		for (i = 0; i < nr; i++) {
> +			struct async_req *req = container_of(events[i].obj, struct async_req, io);
> +
> +			LOG(3, "io finished %lu/%lu", events[i].res,
> +			    events[i].res2);
> +
> +			/* got an orphaned request */
> +			if (req->state == PATH_REMOVED) {
> +				list_del(&req->node);
> +				free(req->buf);
> +				free(req);
> +				aio_grp->holders--;
> +			} else
> +				req->state = (events[i].res == req->blksize) ?
> +					      PATH_UP : PATH_DOWN;
> +		}
> +		timep = &zero_timeout;
> +	} while (nr == 128); /* assume there are more events and try again */
> +
> +	if (nr < 0)
> +		LOG(3, "async io getevents returned %i (errno=%s)",
> +		    nr, strerror(errno));
> +
> +	return got_events;
>  }
>  static int
>  check_state(int fd, struct directio_context *ct, int sync, int timeout_secs)
>  {
>  	struct timespec	timeout = { .tv_nsec = 5 };

What's the purpose of these 5ns? Unless the device in question is an
NVDIMM, I'd say 5ns is practically equivalent to 0.

> -	struct io_event event;
>  	struct stat	sb;
> -	int		rc = PATH_UNCHECKED;
> +	int		rc;
>  	long		r;
> +	struct timespec currtime, endtime;
> +	struct timespec *timep = &timeout;
>  
>  	if (fstat(fd, &sb) == 0) {
>  		LOG(4, "called for %x", (unsigned) sb.st_rdev);
> @@ -145,50 +353,60 @@ check_state(int fd, struct directio_context *ct, int sync, int timeout_secs)
>  		timeout.tv_nsec = 0;
>  	}
>  
> -	if (!ct->running) {
> -		struct iocb *ios[1] = { &ct->io };
> +	if (ct->running) {
> +		if (ct->req->state != PATH_PENDING) {
> +			ct->running = 0;
> +			return ct->req->state;
> +		}
> +	} else {
> +		struct iocb *ios[1] = { &ct->req->io };
>  
>  		LOG(3, "starting new request");
> -		memset(&ct->io, 0, sizeof(struct iocb));
> -		io_prep_pread(&ct->io, fd, ct->ptr, ct->blksize, 0);
> -		if (io_submit(ct->ioctx, 1, ios) != 1) {
> +		memset(&ct->req->io, 0, sizeof(struct iocb));
> +		io_prep_pread(&ct->req->io, fd, ct->req->ptr,
> +			      ct->req->blksize, 0);
> +		ct->req->state = PATH_PENDING;
> +		if (io_submit(ct->aio_grp->ioctx, 1, ios) != 1) {
>  			LOG(3, "io_submit error %i", errno);
>  			return PATH_UNCHECKED;
>  		}
>  	}
>  	ct->running++;

This looks to me as if in the case (ct->running && ct->req->state ==
PATH_PENDING), ct->running could become > 1, even though you don't
start a new IO. Is that intended? I don't think it matters because you
never decrement, but it looks weird.

>  
> -	errno = 0;
> -	r = io_getevents(ct->ioctx, 1L, 1L, &event, &timeout);
> +	get_monotonic_time(&endtime);
> +	endtime.tv_sec += timeout.tv_sec;
> +	endtime.tv_nsec += timeout.tv_nsec;
> +	normalize_timespec(&endtime);
> +	while(1) {
> +		r = get_events(ct->aio_grp, timep);
>  
> -	if (r < 0 ) {
> -		LOG(3, "async io getevents returned %li (errno=%s)", r,
> -		    strerror(errno));
> -		ct->running = 0;
> -		rc = PATH_UNCHECKED;
> -	} else if (r < 1L) {
> -		if (ct->running > timeout_secs || sync) {
> -			struct iocb *ios[1] = { &ct->io };
> -
> -			LOG(3, "abort check on timeout");
> -			r = io_cancel(ct->ioctx, ios[0], &event);
> -			/*
> -			 * Only reset ct->running if we really
> -			 * could abort the pending I/O
> -			 */
> -			if (r)
> -				LOG(3, "io_cancel error %i", errno);
> -			else
> -				ct->running = 0;
> -			rc = PATH_DOWN;
> -		} else {
> -			LOG(3, "async io pending");
> -			rc = PATH_PENDING;
> -		}
> +		if (ct->req->state != PATH_PENDING) {
> +			ct->running = 0;
> +			return ct->req->state;
> +		} else if (r == 0 || !timep)
> +			break;
> +
> +		get_monotonic_time(&currtime);
> +		timespecsub(&endtime, &currtime, &timeout);
> +		if (timeout.tv_sec < 0)
> +			timep = NULL;

See comment for get_events() above. Why don't you simply do this?

    timeout.tv_sec = timeout.tv_nsec = 0;

> +	}
> +	if (ct->running > timeout_secs || sync) {
> +		struct io_event event;
> +
> +		LOG(3, "abort check on timeout");
> +
> +		r = io_cancel(ct->aio_grp->ioctx, &ct->req->io, &event);
> +		/*
> +		 * Only reset ct->running if we really
> +		 * could abort the pending I/O

... which will never happen ... but never mind.

> +		 */
> +		if (!r)
> +			ct->running = 0;
> +		rc = PATH_DOWN;
>  	} else {
> -		LOG(3, "io finished %lu/%lu", event.res, event.res2);
> -		ct->running = 0;
> -		rc = (event.res == ct->blksize) ? PATH_UP : PATH_DOWN;
> +		LOG(3, "async io pending");
> +		rc = PATH_PENDING;
>  	}
>  
>  	return rc;
> diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
> index dc103fd8..05a5e8ff 100644
> --- a/multipath/multipath.conf.5
> +++ b/multipath/multipath.conf.5
> @@ -494,9 +494,10 @@ Check the path state for LSI/Engenio/NetApp RDAC class as NetApp SANtricity E/EF
>  Series, and OEM arrays from IBM DELL SGI STK and SUN.
>  .TP
>  .I directio
> -(Deprecated) Read the first sector with direct I/O. This checker is being
> -deprecated, it could cause spurious path failures under high load.
> -Please use \fItur\fR instead.
> +(Deprecated) Read the first sector with direct I/O. If you have a large number
> +of paths, or many AIO users on a system, you may need to use sysctl to
> +increase fs.aio-max-nr. This checker is being deprecated, it could cause
> +spurious path failures under high load. Please use \fItur\fR instead.
>  .TP
>  .I cciss_tur
>  (Hardware-dependent)