[dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

Fri Feb 9 17:26:07 UTC 2018

On Fri, Feb 09, 2018 at 04:15:34PM +0000, Bart Van Assche wrote:
> On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote:
> >  static void cleanup_func(void *data)
> >  {
> > -	int holders;
> > +	int running, holders;
> >  	struct tur_checker_context *ct = data;
> > -	pthread_spin_lock(&ct->hldr_lock);
> > -	ct->holders--;
> > -	holders = ct->holders;
> > -	ct->thread = 0;
> > -	pthread_spin_unlock(&ct->hldr_lock);
> > +
> > +	running = uatomic_xchg(&ct->running, 0);
> > +	holders = uatomic_sub_return(&ct->holders, 1);
> >  	if (!holders)
> >  		cleanup_context(ct);
> > +	if (!running)
> > +		pause();
> >  }
> 
> Hello Ben,
> 
> Why has the pause() call been added? I think it is safe to call pthread_cancel()
> for a non-detached thread that has finished so I don't think that pause() call
> is necessary.

Martin objected to having the threads getting detached as part of
cancelling them (I think. I'm a little fuzzy on what he didn't like).
But he definitely said he preferred the thread to start detached, so in
this version, it does.  That's why we need the pause().  If he's fine with
the threads getting detached later, I will happily replace the pause()
with

if (running)
	pthread_detach(pthread_self());

and add pthread_detach(ct->thread) after the calls to
pthread_cancel(ct->thread). Otherwise we need the pause() to solve your
original bug.

As an aside, Martin, if your problem is the thread detaching itself, we
can skip that if we are fine with a zombie thread hanging around until
the next time we call libcheck_check() or libcheck_free(). Then the
checker can always be in charge of detaching the thread.

> >  static int tur_running(struct tur_checker_context *ct)
> >  {
> > -	pthread_t thread;
> > -
> > -	pthread_spin_lock(&ct->hldr_lock);
> > -	thread = ct->thread;
> > -	pthread_spin_unlock(&ct->hldr_lock);
> > -
> > -	return thread != 0;
> > +	return (uatomic_read(&ct->running) != 0);
> >  }
> 
> Is such a one line function really useful?

Nope. I just left it there to keep the number of changes that the patch
makes lower, to make it more straightforward to review. I'm fine will
inlining it.

> I think the code will be easier to read if this function is inlined
> into its callers.

> > @@ -418,8 +396,12 @@ int libcheck_check(struct checker * c)
> >  		    (tur_status == PATH_PENDING || tur_status == PATH_UNCHECKED)) {
> >  			condlog(3, "%s: tur checker still running",
> >  				tur_devt(devt, sizeof(devt), ct));
> > -			ct->running = 1;
> >  			tur_status = PATH_PENDING;
> > +		} else {
> > +			int running = uatomic_xchg(&ct->running, 0);
> > +			if (running)
> > +				pthread_cancel(ct->thread);
> > +			ct->thread = 0;
> >  		}
> >  	}
> 
> Why has this pthread_cancel() call been added? I think that else clause can only be
> reached if ct->running == 0 so I don't think that the pthread_cancel() call will ever
> be reached.

It can be reached if ct->running is 1, as long as tur_status has been
updated.  In practice this means that the thread should have done
everything it needs to do, and all that's left is for it to shutdown.
However, if the thread doesn't shut itself down before the next time you
call libcheck_check(), the checker will give up and return PATH_TIMEOUT.

It seems pretty unlikely that this will happen, since there should be a
significant delay before calling libcheck_check() again. This
theoretical race has been in the code for a while, and AFAIK, it's never
occured. But there definitely is the possiblity that the thread will
still be running at the end of libcheck_check(), and it doesn't hurt
things to forceably shut the thread down, if it is. 

-Ben

> Thanks,
> 
> Bart.
> 
> 
>