[dm-devel] [PATCH 5/5] libmultipath: don't dlclose tur checker DSO

Martin Wilck martin.wilck at suse.com
Wed Nov 4 23:56:07 UTC 2020


On Wed, 2020-11-04 at 17:27 -0600, Benjamin Marzinski wrote:
> On Wed, Nov 04, 2020 at 10:39:39PM +0000, Martin Wilck wrote:
> > On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote:
> > > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote:
> > > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote:
> > > > > The multipathd tur checker thread is designed to be able to
> > > > > finish at
> > > > > any time, even after the tur checker itself has been freed.
> > > > > The
> > > > > multipathd shutdown code makes sure all the checkers have
> > > > > been
> > > > > freed
> > > > > before freeing the checker_class and calling dlclose() to
> > > > > unload
> > > > > the
> > > > > DSO, but this doesn't guarantee that the checker threads have
> > > > > finished.
> > > > > If one hasn't, the DSO will get unloaded while the thread
> > > > > still
> > > > > running
> > > > > code from it, causing a segfault. Unfortunately, it's not
> > > > > possible to
> > > > > be
> > > > > sure that all tur checker threads have ended during shutdown,
> > > > > without
> > > > > making them joinable.
> > > > > 
> > > > > However, since libmultipath will never be reinitialized after
> > > > > it
> > > > > has
> > > > > been uninitialzed, not dlclosing the tur checker DSO once a
> > > > > thread is
> > > > > started has minimal cost (keeping the DSO code around until
> > > > > the
> > > > > program
> > > > > exits, which usually happens right after freeing the
> > > > > checkers).
> > > > 
> > > > I'm not against this, but have you considered using an
> > > > atomic  refcount
> > > > for the DSO? With every tur thread starting, we could increase
> > > > it,
> > > > and
> > > > decrease it in the cleanup function of the thread when it
> > > > exits.
> > > > That
> > > > should be safe. If the refcount was positive when we exit, we
> > > > could
> > > > refrain from unloading the DSO.
> > > 
> > > I tried exactly that, and I would still get crashes, even if it
> > > put
> > > the
> > > code that decrements the atomic variable in a function that's not
> > > part
> > > of the DSO, and put a pthread_exit() before the final
> > > pthread_cleanup_pop() that would decrement it in tur_thread, so
> > > that
> > > after the cleanup code is called the thread should never return
> > > to
> > > code
> > > that is in the DSO. I had to add sleeps in code to force various
> > > orderings, but I couldn't find any way that worked for all
> > > possible
> > > orderings.  I would love it if this worked, and you're free to
> > > try,
> > > but
> > > I could not get this method to work.
> > 
> > I've experimented a bit with a trivial test program, and I found
> > that
> > it worked stably if decrementing the refcount is really the last
> > thing
> > thread's cleanup function does. Could you provide some details
> > about
> > the sleeps that you'd put in that made this approach fail?
> 
> I believe the situation that continued to crash was where the
> tur_thread() exitted naturally (so it set running to 0), although I'm
> not sure that this is necessary, or if it would still crash when
> running
> the cleanup function because of a cancel.  I put the cleanup function
> to
> decrement the count in libmultipath, so that it wasn't part of the
> DSO,
> and then I put a sleep(5) as the last line of the cleanup function,
> and
> a sleep(10) as the last line of cleanup_checkers(). I also had to set
> running to 0 before starting the thread, and then take out the code
> to
> pause the thread if running was aleady 0, to make sure the thread
> acted
> like it was the one to set running to 0. Then the idea is to stop
> multipathd while there is a thread in its sleep period, so that
> multipathd sees that the counter is 0, and closes the dso, and then
> the thread finishes before multipathd shuts the rest of the way
> down. 

I guess the key is that the thread's entry point must also be in
libmultipath (i.e. outside the DSO). In pseudo-code:

entrypoint() {
   refcount++;
   pthread_cleanup_push(refcount--);
   tur_thread(ct);
   pthread_cleanup_pop(1);
}

This way the thread can't be in DSO code any more when refcount goes to
zero.

Martin

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE  Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix
Imendörffer






More information about the dm-devel mailing list