[dm-devel] [PATCH 5/5] libmultipath: don't dlclose tur checker DSO

Benjamin Marzinski bmarzins at redhat.com
Thu Nov 5 00:41:57 UTC 2020


On Wed, Nov 04, 2020 at 11:56:07PM +0000, Martin Wilck wrote:
> On Wed, 2020-11-04 at 17:27 -0600, Benjamin Marzinski wrote:
> > On Wed, Nov 04, 2020 at 10:39:39PM +0000, Martin Wilck wrote:
> > > On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote:
> > > > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote:
> > > > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote:
> > > > > > The multipathd tur checker thread is designed to be able to
> > > > > > finish at
> > > > > > any time, even after the tur checker itself has been freed.
> > > > > > The
> > > > > > multipathd shutdown code makes sure all the checkers have
> > > > > > been
> > > > > > freed
> > > > > > before freeing the checker_class and calling dlclose() to
> > > > > > unload
> > > > > > the
> > > > > > DSO, but this doesn't guarantee that the checker threads have
> > > > > > finished.
> > > > > > If one hasn't, the DSO will get unloaded while the thread
> > > > > > still
> > > > > > running
> > > > > > code from it, causing a segfault. Unfortunately, it's not
> > > > > > possible to
> > > > > > be
> > > > > > sure that all tur checker threads have ended during shutdown,
> > > > > > without
> > > > > > making them joinable.
> > > > > > 
> > > > > > However, since libmultipath will never be reinitialized after
> > > > > > it
> > > > > > has
> > > > > > been uninitialzed, not dlclosing the tur checker DSO once a
> > > > > > thread is
> > > > > > started has minimal cost (keeping the DSO code around until
> > > > > > the
> > > > > > program
> > > > > > exits, which usually happens right after freeing the
> > > > > > checkers).
> > > > > 
> > > > > I'm not against this, but have you considered using an
> > > > > atomic  refcount
> > > > > for the DSO? With every tur thread starting, we could increase
> > > > > it,
> > > > > and
> > > > > decrease it in the cleanup function of the thread when it
> > > > > exits.
> > > > > That
> > > > > should be safe. If the refcount was positive when we exit, we
> > > > > could
> > > > > refrain from unloading the DSO.
> > > > 
> > > > I tried exactly that, and I would still get crashes, even if it
> > > > put
> > > > the
> > > > code that decrements the atomic variable in a function that's not
> > > > part
> > > > of the DSO, and put a pthread_exit() before the final
> > > > pthread_cleanup_pop() that would decrement it in tur_thread, so
> > > > that
> > > > after the cleanup code is called the thread should never return
> > > > to
> > > > code
> > > > that is in the DSO. I had to add sleeps in code to force various
> > > > orderings, but I couldn't find any way that worked for all
> > > > possible
> > > > orderings.  I would love it if this worked, and you're free to
> > > > try,
> > > > but
> > > > I could not get this method to work.
> > > 
> > > I've experimented a bit with a trivial test program, and I found
> > > that
> > > it worked stably if decrementing the refcount is really the last
> > > thing
> > > thread's cleanup function does. Could you provide some details
> > > about
> > > the sleeps that you'd put in that made this approach fail?
> > 
> > I believe the situation that continued to crash was where the
> > tur_thread() exitted naturally (so it set running to 0), although I'm
> > not sure that this is necessary, or if it would still crash when
> > running
> > the cleanup function because of a cancel.  I put the cleanup function
> > to
> > decrement the count in libmultipath, so that it wasn't part of the
> > DSO,
> > and then I put a sleep(5) as the last line of the cleanup function,
> > and
> > a sleep(10) as the last line of cleanup_checkers(). I also had to set
> > running to 0 before starting the thread, and then take out the code
> > to
> > pause the thread if running was aleady 0, to make sure the thread
> > acted
> > like it was the one to set running to 0. Then the idea is to stop
> > multipathd while there is a thread in its sleep period, so that
> > multipathd sees that the counter is 0, and closes the dso, and then
> > the thread finishes before multipathd shuts the rest of the way
> > down. 
> 
> I guess the key is that the thread's entry point must also be in
> libmultipath (i.e. outside the DSO). In pseudo-code:
> 
> entrypoint() {
>    refcount++;
>    pthread_cleanup_push(refcount--);
>    tur_thread(ct);
>    pthread_cleanup_pop(1);
> }
> 
> This way the thread can't be in DSO code any more when refcount goes to
> zero.

Oh! I didn't think of solving it that way, but it makes sense. So, were
you planning on posting a patch?

-Ben


> 
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
> SUSE  Software Solutions Germany GmbH
> HRB 36809, AG Nürnberg GF: Felix
> Imendörffer
> 




More information about the dm-devel mailing list