[dm-devel] [PATCH 1/3] multipathd: fix waiter thread cancelling
Martin Wilck
mwilck at suse.com
Mon Mar 19 12:59:24 UTC 2018
On Fri, 2018-03-16 at 16:31 -0500, Benjamin Marzinski wrote:
> multipathd was sending a signal to the per-device waiter thread after
> cancelling it. Since these threads are detached, there is a window
> where a new thread could have started with the old thread id between
> the
> cancel and the signalling, causing the signal to be delivered to the
> wrong thread. Simply reversing the order doesn't fix the issue, since
> the waiter threads exit immediately if they receive a signal, again
> opening a window for the cancel to be delivered to the wrong thread.
>
> To fix this, multipathd does reverse the order, so that it signals
> the
> thread first (it needs to signal the thread, since the dm_task_run
> ioctl
> isn't a cancellation point) and then cancels it. However it does this
> while holding a new mutex.
>
> The waiter thread can only exit without being cancelled for two
> reasons.
> 1. When it fails in update_multipath, which removes the device while
> holding the vecs lock.
> 2. If it receives a SIGUSR2 signal while waiting for a dm event.
>
> Case 1 can never race with another thread removing the device, since
> removing a device always happens while holding the vecs lock. This
> means that if the device exists to be removed, then the waiter thread
> can't exit this way during the removal.
>
> Case 2 is now solved by grabbing the new mutex after failing
> dm_task_run(). With the mutex held, the thread checks if it has been
> cancelled. If it wasn't cancelled, the thread continues.
>
> The reason that this uses a new mutex, instead of the vecs lock, is
> that
> using the vecs lock would keep the thread from ending until the vecs
> lock was released. Normally, this isn't a problem. But during
> reconfigure, the waiter threads for all devices are stopped, and new
> ones started, all while holding the vecs lock. For systems with a
> large
> number of multipath devices, this will cause multipathd do have
> double
> its already large number of waiter threads during reconfigure, all
> locked
> into memory.
>
> Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> ---
> multipathd/waiter.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/multipathd/waiter.c b/multipathd/waiter.c
> index cb9708b..e894294 100644
> --- a/multipathd/waiter.c
> +++ b/multipathd/waiter.c
> @@ -23,6 +23,7 @@
> #include "waiter.h"
>
> pthread_attr_t waiter_attr;
> +struct mutex_lock waiter_lock = { .mutex = PTHREAD_MUTEX_INITIALIZER
> };
>
> static struct event_thread *alloc_waiter (void)
> {
> @@ -59,8 +60,12 @@ void stop_waiter_thread (struct multipath *mpp,
> struct vectors *vecs)
> mpp->waiter);
> thread = mpp->waiter;
> mpp->waiter = (pthread_t)0;
> - pthread_cancel(thread);
> + pthread_cleanup_push(cleanup_lock, &waiter_lock);
> + lock(&waiter_lock);
> + pthread_testcancel();
What's the purpose of this pthread_testcancel() call?
> pthread_kill(thread, SIGUSR2);
> + pthread_cancel(thread);
> + lock_cleanup_pop(&waiter_lock);
> }
>
> /*
> @@ -114,8 +119,13 @@ static int waiteventloop (struct event_thread
> *waiter)
> dm_task_destroy(waiter->dmt);
> waiter->dmt = NULL;
>
> - if (!r) /* wait interrupted by signal */
> - return -1;
> + if (!r) { /* wait interrupted by signal. check for
> cancellation */
> + pthread_cleanup_push(cleanup_lock, &waiter_lock);
> + lock(&waiter_lock);
> + pthread_testcancel();
> + lock_cleanup_pop(&waiter_lock);
Nitpick: I'd prefer pthread_cleanup_pop(1) here.
Regards,
Martin
> + return 1; /* If we weren't cancelled, just
> reschedule */
> + }
>
> waiter->event_nr++;
>
--
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
More information about the dm-devel
mailing list