[dm-devel] [RFC PATCH 4/6] multipathd: cancel threads early during shutdown
Martin Wilck
mwilck at suse.com
Fri Aug 21 22:48:50 UTC 2020
On Fri, 2020-08-21 at 17:22 -0500, Benjamin Marzinski wrote:
> On Thu, Aug 20, 2020 at 10:39:19PM +0200, Martin Wilck wrote:
> > Hi Ben,
> >
> > I need to get back to this old discussion. I didn't resend this
> > patch
> > last year, because I tried to figure out how to solve the memory
> > leaks
> > you mentioned.
> >
> > On Thu, 2019-01-17 at 10:59 +0100, Martin Wilck wrote:
> > > On Wed, 2019-01-16 at 17:40 -0600, Benjamin Marzinski wrote:
> > > > On Fri, Jan 04, 2019 at 06:59:12PM +0100, Martin Wilck wrote:
> > > > > Cancel the other threads before taking vecs->lock. This
> > > > > avoids
> > > > > delays during shutdown caused e.g. by the checker thread
> > > > > holding
> > > > > the vecs lock.
> > > >
> > > > Before this change, multipathd was guaranteed that once a
> > > > thread
> > > > had
> > > > locked the vecs lock, and checked if it had been cancelled, it
> > > > could
> > > > not
> > > > be cancelled after that until it unlocked the vecs
> > > > lock. Undoing
> > > > this
> > > > guarantee will likely make it possible for multipathd to leak
> > > > memory
> > > > where it wasn't possible before.
> > >
> > > Thanks for pointing that out.
> > >
> > > I wasn't aware of this guarantee. In my latest valgrind tests,
> > > valgrind
> > > reported no leaks, but multipathd was also not "clean" in the
> > > sense
> > > that every chunk of memory malloc()d had been explicitly free()d
> > > at
> > > exit. IIRC that hadn't been caused by any patches added recently.
> > > I haven't had time to look at that further, I was satisfied with
> > > no
> > > real leaks being reported.
> > >
> > > We do free the global data structures in "vecs" when we exit. So
> > > any
> > > possible leaks caused by this patch must be cases where temporary
> > > memory is allocated without proper pthread_cleanup_push(), or
> > > where a
> > > thread was cancelled between allocation of memory and setting a
> > > reference to it in the global data structures - e.g. between
> > > allocation
> > > of a path and adding it to vecs->pathvec.
> > >
> > > I haven't audited either class of leaks. I believe the first
> > > class
> > > should have been eliminated by earlier phthread_cancel audits.
> > > Fixing
> > > the second class would require designing some really clever
> > > helpers,
> > > I
> > > guess. But as argued above, I really don't think it matters much
> > > if
> > > it
> > > concerns only leaks-at-shutdown.
> > >
> > > I'll put this on my todo list, but not at the highest prio.
> > > For this RFC, we need to decide whether it's more important to be
> > > leak-
> > > free on shutdown, or to react quickly on shutdown requests.
> >
> > Last year, I had started looking into this, and produced a first
> > patch,
> > bca3729 ("libmultipath: alias.c: prepare for cancel-safe
> > allocation").
> > I've just revisited this. I think what I did back then was broken.
> >
> > out:
> > + pthread_cleanup_push(free, alias);
> > fclose(f);
> > + pthread_cleanup_pop(0);
> >
> >
> > is confusing and hard to read. When I just saw it, several months
> > after
> > having written it myself, I thought it was a bug. Actually, it *is*
> > broken, as you pointed out already in
> > https://www.redhat.com/archives/dm-devel/2019-October/msg00211.html
> > ,
> > because we need to avoid leaking the fd, too.
> >
> > How would code look like that would protect both from the memory
> > and fd
> > leak due to cancellation? Maybe like this:
> >
> > char *get_user_friendly_alias(...)
> > {
> > char *alias = NULL;
> > ...
> >
> > /* simplified, we need a wrapper for fclose */
> > pthread_cleanup_push(fclose, f);
> >
> > id = lookup_binding(f, wwid, &alias, prefix);
> > if (id < 0)
> > goto out_fclose;
> >
> > pthread_cleanup_push(free, alias);
> > if (fflush(f) != 0) {
> > condlog(0, "cannot fflush bindings file stream : %s",
> > strerror(errno));
> > free(alias);
> > alias = NULL;
> > goto out_free;
> > } else if (can_write && !bindings_read_only && !alias)
> > alias = allocate_binding(fd, wwid, id, prefix);
> >
> > out_free:
> > /* This is necessary to preserve nesting */
> > pthread_cleanup_pop(0); /* free */
> > out_fclose:
> > pthread_cleanup_pop(0); /* fclose */
> > /* This is necessary because fclose() is a cancellation
> > point */
> > pthread_cleanup_push(free, alias);
> > fclose(f);
> > pthread_cleanup_pop(0); /* free */
> >
> > return alias;
> > }
> >
> > I hope you concur that this is awfully ugly. Everyone is invited to
> > find a solution that doesn't require 3x
> > pthread_cleanup_push()/pop(),
> > without completely rewriting the code.
>
> I assume you would do it something like:
>
> static void free_ptr(void *arg)
> {
> void *ptr;
>
> if (!arg)
> return
>
> ptr = *((void **) arg);
> if (ptr)
> free(ptr);
> }
>
> char *get_user_friendly_alias(...)
> {
> char *alias = NULL;
>
> pthread_cleanup_push(free_ptr, &alias);
> ...
>
> /* simplified, we need a wrapper for fclose */
> pthread_cleanup_push(fclose, f);
> ...
>
> out:
> pthread_cleanup_pop(1); /* fclose */
> pthread_cleanup_pop(0); /* free_ptr */
> return alias;
> }Regards
Hey, why didn't this occur to me? I literally shifted this code
back and forth for hours :-)
This actually looks acceptible, but...
>
> > IMO avoiding the fd leak is more important than avoiding the
> > memory
> > leak of "alias". I'm going to submit a patch that does exactly
> > that.
> >
> > In general: I think completely avoiding memory leaks in
> > multithreaded
> > code that allows almost arbitrary cancellation is not a worthwhile
> > goal. After all, except for the waiter and checker threads,
> > all other threads are only cancelled when multipathd exits. Yes,
> > this
> > makes it harder to assess potential memory leaks with valgrind,
> > because
> > we can't easily distinguish real memory leaks from leaks caused by
> > cancellation. But that's about it, and I think the distinction is
> > actually not that hard, because the leaks caused by cancellation
> > would
> > be sporadic, and wouldn't pile up during longer runs.
> >
> > So, I propose not to go further in this direction. IOW, we
> > shouldn't
> > write code like bca3729 any more. We don't have to avoid it at all
> > cost
> > (for example, it's always good to link allocated memory to some
> > global
> > data structure as soon as it makes sense to do so). But I think
> > that
> > "pthread_cleanup_push(free, xyz)" is often not worth the code
> > uglification it causes. If it conflicts with other cleanup actions,
> > and
> > can't be cleanly nested like above, we should definitely not do it.
>
> Sure. For threads that are only cancelled for shutdown, I'm fine with
> reasonable effort, balanced against other code considerations.
... we seem to be basically on the same boat here. Great!
Thanks,
Martin
>
> > Moreover, I believe that reacting quickly on cancellation / exit
> > requests is more important than avoiding cancellation-caused memory
> > leaks. Therefore I plan to resend the "cancel threads early" patch,
> > unless you come up with more strong reasons not to do so.
>
> go ahead.
>
> > Another possibility to "fix" cancellation issues and get rid of
> > ugly
> > pthread_cleanup_push() calls would be changing our cancellation
> > policy
> > to PTHREAD_CANCEL_DISABLE, and check for cancellation only at
> > certain
> > points during runtime (basically, before and after blocking /
> > waiting
> > on something). But I don't think that's going to work well, for the
> > same reason - we'd run high risk to get those multipathd shutdown
> > issues back which we've overcome only recently.
>
> I'm not a fan either. I'd rather deal with occasionally seeing fake
> memory leaks.
>
> -Ben
>
> > Thoughts?
> >
> > Martin
> >
> >
> >
More information about the dm-devel
mailing list