[dm-devel] [PATCH 5/5] libmultipath: don't dlclose tur checker DSO

Benjamin Marzinski bmarzins at redhat.com
Wed Nov 4 23:27:24 UTC 2020


On Wed, Nov 04, 2020 at 10:39:39PM +0000, Martin Wilck wrote:
> On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote:
> > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote:
> > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote:
> > > > The multipathd tur checker thread is designed to be able to
> > > > finish at
> > > > any time, even after the tur checker itself has been freed. The
> > > > multipathd shutdown code makes sure all the checkers have been
> > > > freed
> > > > before freeing the checker_class and calling dlclose() to unload
> > > > the
> > > > DSO, but this doesn't guarantee that the checker threads have
> > > > finished.
> > > > If one hasn't, the DSO will get unloaded while the thread still
> > > > running
> > > > code from it, causing a segfault. Unfortunately, it's not
> > > > possible to
> > > > be
> > > > sure that all tur checker threads have ended during shutdown,
> > > > without
> > > > making them joinable.
> > > > 
> > > > However, since libmultipath will never be reinitialized after it
> > > > has
> > > > been uninitialzed, not dlclosing the tur checker DSO once a
> > > > thread is
> > > > started has minimal cost (keeping the DSO code around until the
> > > > program
> > > > exits, which usually happens right after freeing the checkers).
> > > 
> > > I'm not against this, but have you considered using an
> > > atomic  refcount
> > > for the DSO? With every tur thread starting, we could increase it,
> > > and
> > > decrease it in the cleanup function of the thread when it exits.
> > > That
> > > should be safe. If the refcount was positive when we exit, we could
> > > refrain from unloading the DSO.
> > 
> > I tried exactly that, and I would still get crashes, even if it put
> > the
> > code that decrements the atomic variable in a function that's not
> > part
> > of the DSO, and put a pthread_exit() before the final
> > pthread_cleanup_pop() that would decrement it in tur_thread, so that
> > after the cleanup code is called the thread should never return to
> > code
> > that is in the DSO. I had to add sleeps in code to force various
> > orderings, but I couldn't find any way that worked for all possible
> > orderings.  I would love it if this worked, and you're free to try,
> > but
> > I could not get this method to work.
> 
> I've experimented a bit with a trivial test program, and I found that
> it worked stably if decrementing the refcount is really the last thing
> thread's cleanup function does. Could you provide some details about
> the sleeps that you'd put in that made this approach fail?

I believe the situation that continued to crash was where the
tur_thread() exitted naturally (so it set running to 0), although I'm
not sure that this is necessary, or if it would still crash when running
the cleanup function because of a cancel.  I put the cleanup function to
decrement the count in libmultipath, so that it wasn't part of the DSO,
and then I put a sleep(5) as the last line of the cleanup function, and
a sleep(10) as the last line of cleanup_checkers(). I also had to set
running to 0 before starting the thread, and then take out the code to
pause the thread if running was aleady 0, to make sure the thread acted
like it was the one to set running to 0. Then the idea is to stop
multipathd while there is a thread in its sleep period, so that
multipathd sees that the counter is 0, and closes the dso, and then
the thread finishes before multipathd shuts the rest of the way down. 

-Ben

> Martin
> 
> -- Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107 SUSE
> Software Solutions Germany GmbH HRB 36809, AG Nürnberg GF: Felix
> Imendörffer
> 




More information about the dm-devel mailing list