[dm-devel] [RFC PATCH] multipathd: Don't keep starting TUR threads, if they always hang.
Martin Wilck
martin.wilck at suse.com
Wed Mar 30 16:32:12 UTC 2022
On Wed, 2022-03-30 at 09:34 -0500, Benjamin Marzinski wrote:
> On Wed, Mar 30, 2022 at 09:44:39AM +0000, Martin Wilck wrote:
>
> >
> > Why did you choose 1? Perhaps we should make a few more attempts?
>
> Oops. I forgot to include the Notes when formatting my patch (I need
> to
> make that the default). Here they are:
>
> Notes:
>
> I'd be interested in knowing what people think of this solution.
> I'm
> open to making multipathd start more threads before it gives up.
> We
> also could make multipathd save the contexts from the stalled
> threads, so that when it stops creating new ones, instead of just
> waiting for the last thread to complete, it could start up again
I wouldn't recommend that. We drop all references to the old
context for a good reason: to be sure there are no races when the
hanging thread eventually does exit. Keeping such references would re-
open a Pandora's box which we sealed and closed in 2018.
> as
> soon as any of the outstanding threads completed. We could also
> consider not stopping creating new threads entirely, but instead
> having a delay before we create a new checker thread, where we
> wait
> for the last thread to complete.
I wouldn't do this, either. The hang check is not done immediately
after cancelling the thread, but in the following libcheck_check()
invocation. That means there has been some delay already when we do the
check, at least a second. And this cancellation happened after the SCSI
timeout expired, anyway. How long are we going to wait for the normally
instanteneous cancellation to complete? I like the "just forget about
this thread" attitude which has saved us a lot of trouble lately, IMO.
> If the delay doubled after evey
> consecutive timeout, the number of threads created would stay at
> a
> more reasonable level until someone got around to looking into
> what
> was going wrong. Thoughts?
>
> But to answer your question, there was no obvious number to choose,
> and
> you can make the case that if it fails once, that's a fluke. If it
> fails
> twice in a row, then it's likely to keep faiing.
That makes sense. I'm fine with the patch.
> But I'm fine with
> picking a bigger number, or any of the other possibilities I listed.
> I
> just wanted to throw something out as a starting point.
>
> -Ben
>
>
> > Other than that, this looks ok to me (assuming you've tested with
> > the
> > code from bdf55d6, or something similar).
>
> Yep. I tested it with the zombie tur checker tests, with different
> sleep seconds and intervals.
Reviewed-by: Martin Wilck <mwilck at suse.com>
More information about the dm-devel
mailing list