[dm-devel] [PATCH v4 01/19] libmultipath: fix tur checker timeout

Martin Wilck mwilck at suse.de
Wed Oct 10 07:05:17 UTC 2018


On Tue, 2018-10-09 at 18:02 -0500, Benjamin Marzinski wrote:
> The code previously was timing out mode if ct->thread was 0 but
> ct->running wasn't. This combination never happens.  The idea was to
> timeout if for some reason the path checker tried to cancel the
> thread,
> but it didn't die.  The correct thing to check for this is ct-
> >holders.
> ct->holders will always be at least one when libcheck_check() is
> called,
> since libcheck_free() won't get called until the device is no longer
> being checked. So, if ct->holders is 2, that means that the tur
> thread
> is has not shut down yet.
> 
> Also, instead of timing out, the tur checker will switch to
> synchronous
> mode.  The chance of this code path happening is very low.  I simply
> exists because the old thread must not interfere with a new thread
> starting up. But if something does go very wrong, and a thread does
> get
> stuck, this solution will keep the checker from just ignoring the
> device
> forever.

Well, the previous tur thread hanging means that future attempts might
hang as well, in which case the synchronous approach would block _all_
path checkers. Wouldn't the following reasoning apply here?

commit 05cbea354172be5507ac83c98bbac8e02aa8cf3c
Author: Hannes Reinecke <hare at suse.de>
Date:   Fri Dec 13 13:12:42 2013 +0100

    multipath: do not call tur in sync mode if pthread_cancel fails
    
    When pthread_cancel fails the thread is stuck, most likely
    during I/O submission. So it would be pointless to call the
    tur checker in sync mode here, as this would be stuck, too.

I argued before that the current PATH_TIMEOUT return code is wrong, but
I think it's better than falling back to synchronous mode.

I'm fine with this patch if the return PATH_TIMEOUT remains for now,
and we vow to fix this for good soon.

> s
> Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> ---
>  libmultipath/checkers/tur.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/libmultipath/checkers/tur.c
> b/libmultipath/checkers/tur.c
> index bf8486d..3c5e236 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -355,12 +355,13 @@ int libcheck_check(struct checker * c)
>  		}
>  		pthread_mutex_unlock(&ct->lock);
>  	} else {
> -		if (uatomic_read(&ct->running) != 0) {
> -			/* pthread cancel failed. continue in sync mode
> */
> +		if (uatomic_read(&ct->holders) > 1) {
> +			/* The thread has been cancelled but hasn't
> +			 * quilt. Fail back to synchronous mode */

Typo.

>  			pthread_mutex_unlock(&ct->lock);
> -			condlog(3, "%s: tur thread not responding",
> +			condlog(3, "%s: tur checker failing back to
> sync",
>  				tur_devt(devt, sizeof(devt), ct));
> -			return PATH_TIMEOUT;
> +			return tur_check(c->fd, c->timeout,
> copy_msg_to_checker, c);
>  		}
>  		/* Start new TUR checker */
>  		ct->state = PATH_UNCHECKED;

Regards,
Martin





More information about the dm-devel mailing list