[dm-devel] [RFC PATCH] multipathd: Don't keep starting TUR threads, if they always hang.

Wed Mar 30 16:32:12 UTC 2022

On Wed, 2022-03-30 at 09:34 -0500, Benjamin Marzinski wrote:
> On Wed, Mar 30, 2022 at 09:44:39AM +0000, Martin Wilck wrote:
> 
> > 
> > Why did you choose 1? Perhaps we should make a few more attempts?
> 
> Oops. I forgot to include the Notes when formatting my patch (I need
> to
> make that the default). Here they are:
> 
> Notes:
> 
>     I'd be interested in knowing what people think of this solution.
> I'm
>     open to making multipathd start more threads before it gives up.
> We
>     also could make multipathd save the contexts from the stalled
>     threads, so that when it stops creating new ones, instead of just
>     waiting for the last thread to complete, it could start up again

I wouldn't recommend that. We drop all references to the old
context for a good reason: to be sure there are no races when the
hanging thread eventually does exit. Keeping such references would re-
open a Pandora's box which we sealed and closed in 2018.

> as
>     soon as any of the outstanding threads completed.  We could also
>     consider not stopping creating new threads entirely, but instead
>     having a delay before we create a new checker thread, where we
> wait
>     for the last thread to complete.

I wouldn't do this, either. The hang check is not done immediately
after cancelling the thread, but in the following libcheck_check()
invocation. That means there has been some delay already when we do the
check, at least a second. And this cancellation happened after the SCSI
timeout expired, anyway. How long are we going to wait for the normally
instanteneous cancellation to complete? I like the "just forget about
this thread" attitude which has saved us a lot of trouble lately, IMO.

>   If the delay doubled after evey
>     consecutive timeout, the number of threads created would stay at
> a
>     more reasonable level until someone got around to looking into
> what
>     was going wrong. Thoughts?
> 
> But to answer your question, there was no obvious number to choose,
> and
> you can make the case that if it fails once, that's a fluke. If it
> fails
> twice in a row, then it's likely to keep faiing.

That makes sense. I'm fine with the patch.

>  But I'm fine with
> picking a bigger number, or any of the other possibilities I listed.
> I
> just wanted to throw something out as a starting point.
> 
> -Ben
> 
> 
> > Other than that, this looks ok to me (assuming you've tested with
> > the
> > code from bdf55d6, or something similar).
> 
> Yep. I tested it with the zombie tur checker tests, with different
> sleep seconds and intervals.

Reviewed-by: Martin Wilck <mwilck at suse.com>