[dm-devel] [PATCH v3 01/19] libmultipath: fix tur checker timeout
Martin Wilck
mwilck at suse.com
Fri Oct 5 10:11:18 UTC 2018
On Thu, 2018-10-04 at 11:31 -0500, Benjamin Marzinski wrote:
> On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote:
> > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> > > The code previously was timing out mode if ct->thread was 0 but
> > > ct->running wasn't. This combination never happens. The idea was
> > > to
> > > timeout if for some reason the path checker tried to kill the
> > > thread,
> > > but it didn't die. The correct thing to check for this is ct-
> > > > holders.
> > >
> > > ct->holders will always be at least one when libcheck_check() is
> > > called,
> > > since libcheck_free() won't get called until the device is no
> > > longer
> > > being checked. So, if ct->holders is 2, that means that the tur
> > > thread
> > > is has not shut down yet.
> > >
> > > Also, instead of returning PATH_TIMEOUT whenever the thread
> > > hasn't
> > > died,
> > > it should only time out if the thread didn't successfully get a
> > > value,
> > > which means the previous state was already PATH_TIMEOUT.
> >
> > What about PATH_UNCHECKED?
> >
>
> I don't see how that could happen. In this state we've definitely set
> ct->state to PATH_PENDING when we started the thread. The thread will
> only change ct->state to the return of tur_check(), which doesn't
> ever return PATH_UNCHECKED. Am I missing something here?
OK. The only thing you missed was a comment :-)
This stuff is so subtle that we should describe exactly how
the locking works, which actor is supposed/entitled to set what field
to what value in what situation, and so on.
>
> > >
> > > Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> > > ---
> > > libmultipath/checkers/tur.c | 15 +++++++++------
> > > 1 file changed, 9 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/libmultipath/checkers/tur.c
> > > b/libmultipath/checkers/tur.c
> > > index bf8486d..275541f 100644
> > > --- a/libmultipath/checkers/tur.c
> > > +++ b/libmultipath/checkers/tur.c
> > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
> > > }
> > > pthread_mutex_unlock(&ct->lock);
> > > } else {
> > > - if (uatomic_read(&ct->running) != 0) {
> > > - /* pthread cancel failed. continue in sync mode
> > > */
> > > - pthread_mutex_unlock(&ct->lock);
> > > - condlog(3, "%s: tur thread not responding",
> > > - tur_devt(devt, sizeof(devt), ct));
> > > - return PATH_TIMEOUT;
> > > + if (uatomic_read(&ct->holders) > 1) {
> > > + /* pthread cancel failed. If it didn't get the
> > > path
> > > + state already, timeout */
> > > + if (ct->state == PATH_PENDING) {
> > > + pthread_mutex_unlock(&ct->lock);
> > > + condlog(3, "%s: tur thread not
> > > responding",
> > > + tur_devt(devt, sizeof(devt),
> > > ct));
> > > + return PATH_TIMEOUT;
> > > + }
> > > }
> >
> > I'm not quite getting it yet. We're entering this code path if
> > ct->thread is 0. As you point out, if there are >1 holders, a
> > previously cancelled TUR thread must be still "alive". But now we
> > want
> > to check *again*. The previous code refused to do that - a single
> > hanging TUR thread could block further checks from happening,
> > forever.
> > The PATH_TIMEOUT return code was actually questionable - in this
> > libcheck_check() invocation, we didn't even try to check, so
> > nothing
> > could actually time out (but that's obviously independent of your
> > patch).
> >
> > With your patch, if ct->state != PATH_PENDING, we'd try to start
> > another TUR thread although there's still a hanging one. That's not
> > necessarily wrong, but I fail to see why it depends on ct->state.
> > Anyway, if we do this, we take the risk of starting more and more
> > TUR
> > threads which might all end up hanging; I wonder if we should stop
> > creating more threads if ct->holders grows further (e.g. > 5).
>
> It's been a while, and I'm not exactly sure what I was thinking here.
> But IIRC the idea was that if the state isn't set yet, then the old
> thread could mess with the results of a future thread. Also, it was
> to
> avoid a corner case where the tur checker was called, started the
> thread, got the result before the checker exitted, cancelled the
> thread
> and returned the result and then was called very shortly afterwards,
> before the thread could clean itself up. In that case, the right
> thing
> to do (I thought) would be to start a new thread, because the other
> one
> would be ending soon. In truth, starting a new thread at all is
> probably a bad idea, since the old thread will still mess with the
> checker context on exit.
>
> A better idea might be to simply fail back to a syncronous call to
> tur_check() when you notice a cancelled thread that hasn't exitted.
> That can cause all the other checkers to get delayed, but at least in
> that case you are still checking paths. The other option is to do as
> before and just not check this path, which won't effect the other
> checkers. It seems very unlikely that the thread could remain
> uncancelled forever, especially if the path was usable.
>
> thoughts?
Generally speaking, we're deeply in the realm of highly unlikely
situations I would say. But we should get it right eventually.
Maybe we can add logic to the tur thread to keep its hands off the
context if it's a "zombie", like below (just a thought, untested)?
Martin
diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index bf8486d3..e8493ca8 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -219,11 +219,18 @@ static void cleanup_func(void *data)
rcu_unregister_thread();
}
+#define tur_thread_quit_unless_owner(__ct) \
+ if (__ct->thread != pthread_self()) { \
+ pthread_mutex_unlock(&__ct->lock); \
+ pthread_exit(NULL); \
+ }
+
static void copy_msg_to_tcc(void *ct_p, const char *msg)
{
struct tur_checker_context *ct = ct_p;
pthread_mutex_lock(&ct->lock);
+ tur_thread_quit_unless_owner(ct);
strlcpy(ct->message, msg, sizeof(ct->message));
pthread_mutex_unlock(&ct->lock);
}
@@ -243,6 +250,7 @@ static void *tur_thread(void *ctx)
/* TUR checker start up */
pthread_mutex_lock(&ct->lock);
+ tur_thread_quit_unless_owner(ct);
ct->state = PATH_PENDING;
ct->message[0] = '\0';
pthread_mutex_unlock(&ct->lock);
@@ -252,6 +260,7 @@ static void *tur_thread(void *ctx)
/* TUR checker done */
pthread_mutex_lock(&ct->lock);
+ tur_thread_quit_unless_owner(ct);
ct->state = state;
pthread_cond_signal(&ct->active);
pthread_mutex_unlock(&ct->lock);
--
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
More information about the dm-devel
mailing list