[dm-devel] [RFC PATCH] multipathd: Don't keep starting TUR threads, if they always hang.
Martin Wilck
martin.wilck at suse.com
Wed Mar 30 09:44:39 UTC 2022
On Tue, 2022-03-29 at 22:22 -0500, Benjamin Marzinski wrote:
> If tur thead hangs, multipathd was simply creating a new thread, and
> assuming that the old thread would get cleaned up eventually. I have
> seen a case recently where there were 26000 multipathd threads on a
> system, all stuck trying to send TUR commands to path devices. The
> root
> cause of the issue was a scsi kernel issue, but it shows that the way
> multipathd currently deals with stuck threads could use some
> refinement.
Alright, let's start the discussion about dead TUR threads again ;-)
For my own records, here are links to some previous ML threads:
https://listman.redhat.com/archives/dm-devel/2018-September/036102.html
https://listman.redhat.com/archives/dm-devel/2018-October/036156.html
https://listman.redhat.com/archives/dm-devel/2018-October/036159.html
https://listman.redhat.com/archives/dm-devel/2018-October/036240.html
https://listman.redhat.com/archives/dm-devel/2018-October/036362.html
Have you assessed why the SCSI issues were causing indefinite hangs?
Was it just the TUR command hanging forever in the kernel, or some
other issue?
I have seen similar issues in the past; they were also related to SCSI
problems, IIRC. I used to think that this is just a symptom: The "dead"
threads are detached and consume a minimal amount of non-shared memory
(they are a pain when analyzing a crash dump, though).
But I agree, just continuing to start new threads forever isn't
optimal.
>
> Now, when one tur thread hangs, multipathd will act as it did before.
> If a second one in a row hangs, multipathd will instead wait for it
> to
> complete before starting another thread. Once the thread completes,
> the
> count is reset.
>
> Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> ---
> libmultipath/checkers/tur.c | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/libmultipath/checkers/tur.c
> b/libmultipath/checkers/tur.c
> index c93e4625..1bcb7576 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -26,6 +26,7 @@
>
> #define TUR_CMD_LEN 6
> #define HEAVY_CHECK_COUNT 10
> +#define MAX_NR_TIMEOUTS 1
Why did you choose 1? Perhaps we should make a few more attempts?
Other than that, this looks ok to me (assuming you've tested with the
code from bdf55d6, or something similar).
Martin
>
> enum {
> MSG_TUR_RUNNING = CHECKER_FIRST_MSGID,
> @@ -54,6 +55,7 @@ struct tur_checker_context {
> int holders; /* uatomic access only */
> int msgid;
> struct checker_context ctx;
> + unsigned int nr_timeouts;
> };
>
> int libcheck_init (struct checker * c)
> @@ -358,8 +360,23 @@ int libcheck_check(struct checker * c)
> }
> } else {
> if (uatomic_read(&ct->holders) > 1) {
> + /* The thread has been cancelled but hasn't
> quit. */
> + if (ct->nr_timeouts == MAX_NR_TIMEOUTS) {
> + condlog(2, "%d:%d : waiting for
> stalled tur thread to finish",
> + major(ct->devt), minor(ct-
> >devt));
> + ct->nr_timeouts++;
> + }
> /*
> - * The thread has been cancelled but hasn't
> quit.
> + * Don't start new threads until the last
> once has
> + * finished.
> + */
> + if (ct->nr_timeouts > MAX_NR_TIMEOUTS) {
> + c->msgid = MSG_TUR_TIMEOUT;
> + return PATH_TIMEOUT;
> + }
> + ct->nr_timeouts++;
> + /*
> + * Start a new thread while the old one is
> stalled.
> * We have to prevent it from interfering
> with the new
> * thread. We create a new context and leave
> the old
> * one with the stale thread, hoping it will
> clean up
> @@ -375,13 +392,15 @@ int libcheck_check(struct checker * c)
> */
> if (libcheck_init(c) != 0)
> return PATH_UNCHECKED;
> + ((struct tur_checker_context *)c->context)-
> >nr_timeouts = ct->nr_timeouts;
>
> if (!uatomic_sub_return(&ct->holders, 1))
> /* It did terminate, eventually */
> cleanup_context(ct);
>
> ct = c->context;
> - }
> + } else
> + ct->nr_timeouts = 0;
> /* Start new TUR checker */
> pthread_mutex_lock(&ct->lock);
> tur_status = ct->state = PATH_PENDING;
More information about the dm-devel
mailing list