[dm-devel] [PATCH 31/35] multipathd: uxlsnr: add idle notification

Thu Sep 16 16:10:58 UTC 2021

On Thu, Sep 16, 2021 at 05:54:14PM +0200, Martin Wilck wrote:
> On Thu, 2021-09-16 at 10:06 -0500, Benjamin Marzinski wrote:
> > On Thu, Sep 16, 2021 at 10:54:19AM +0200, Martin Wilck wrote:
> > > On Wed, 2021-09-15 at 23:14 -0500, Benjamin Marzinski wrote:
> > > > On Fri, Sep 10, 2021 at 01:41:16PM +0200, mwilck at suse.com wrote:
> > > > > From: Martin Wilck <mwilck at suse.com>
> > > > > 
> > > > > The previous patches added the state machine and the timeout
> > > > > handling,
> > > > > but there was no wakeup mechanism for the uxlsnr for cases
> > > > > where
> > > > > client connections were waiting for the vecs lock.
> > > > > 
> > > > > This patch uses the previously introduced wakeup mechanism of
> > > > > struct mutex_lock for this purpose. Processes which unlock the
> > > > > "global" vecs lock send an event in an eventfd which the uxlsnr
> > > > > loop is polling for.
> > > > > 
> > > > > As we are now woken up for servicing client handlers that don't
> > > > > wait for input but for the lock, we need to set up the pollfds
> > > > > differently, and iterate over all clients when handling events,
> > > > > not only over the ones that are receiving. The hangup handling
> > > > > is changed, too. We have to look at every client, even if one
> > > > > has
> > > > > hung up. Note that I don't take client_lock for the loop in
> > > > > uxsock_listen(), it's not necessary and will be removed
> > > > > elsewhere
> > > > > in a follow-up patch.
> > > > > 
> > > > > With this in place, the lock need not be taken in
> > > > > execute_handler()
> > > > > any more. The uxlsnr only ever calls trylock() on the vecs
> > > > > lock,
> > > > > avoiding any waiting for other threads to finish.
> > > > > 
> > > > > Signed-off-by: Martin Wilck <mwilck at suse.com>
> > > > > ---
> > > > >  multipathd/uxlsnr.c | 211 ++++++++++++++++++++++++++++++------
> > > > > ------
> > > > > --
> > > > >  1 file changed, 143 insertions(+), 68 deletions(-)
> > > > > 
> > > 
> > > > 
> > > > I do worry that if there are, for instance, a lot of uevents
> > > > coming in,
> > > > this could starve the uxlsnr thread, since other threads could be
> > > > grabbing and releasing the vecs lock, but if it's usually being
> > > > held,
> > > > then the uxlsnr thread might never try to grab it when it's free,
> > > > and
> > > > it
> > > > will keep losing its place in line. Also, every time that the
> > > > vecs lock
> > > > is dropped between ppoll() calls, a wakeup will get triggered,
> > > > even if
> > > > the lock was grabbed by something else before the ppoll thread
> > > > runs.
> > > 
> > > I've thought about this too. It's true that the ppoll ->
> > > pthread_mutex_trylock() sequence will never acquire the lock if
> > > some
> > > other thread calls lock() at the same time.
> > > 
> > > If multiple processes call lock(), the "winner" of the lock is
> > > random.
> > > Thus in a way this change actually adds some predictablity: the
> > > uxlsnr
> > > will step back if some other process is actively trying to grab the
> > > lock. IMO that the right thing to do in almost all situations.
> > > 
> > > We don't need to worry about "thundering herd" issues because the
> > > number of threads that might wait on the lock is rather small. In
> > > the
> > > worst case, 3 threads (checker, dmevents handler and uevent
> > > dispatcher,
> > > plus the uxlsnr in ppoll()) wait for the lock at the same time.
> > > Usually
> > > one of them will have it grabbed. On systems that lack dmevent
> > > polling,
> > > the number of waiter threads may be higher, but AFAICS it's a very
> > > rare
> > > condition to have hundreds of dmevents delivered to different maps
> > > simultaneously, and if it happens, it's probably correct to have
> > > them
> > > serviced quickly.
> > > 
> > > The uevent dispatcher doesn't hold the lock, it's taken and
> > > released
> > > for every event handled. Thus uxlsnr has a real chance to jump in
> > > between uevents. The same holds for the dmevents thread, it takes
> > > the
> > > lock separately for every map affected. The only piece of code that
> > > holds the lock for an extended period of time (except
> > > reconfigure(),
> > > where it's unavoidable) is the path checker (that's bad, and next
> > > on
> > > the todo list).
> > > 
> > > The really "important" commands (shutdown, reconfigure) don't take
> > > the
> > > lock and return immediately; the lock is no issue for them. I don't
> > > see
> > > any other cli command that needs to be served before uevents or dm
> > > events.
> > > 
> > > I haven't been able to test this on huge configurations with 1000s
> > > of
> > > LUNs, but I tested with artificial delays in checker loop, uevent
> > > handlers, and dmevent handler, and lots of clients querying the
> > > daemon
> > > in parallel, and saw that clients were handled very nicely. Some
> > > timeouts are inevitable (e.g. if the checker simply holds the lock
> > > longer than the uxsock_timeout), but that is no regression.
> > > 
> > > Bottom line: I believe that because this patch reduces the busy-
> > > wait
> > > time, clients will be served more reliably and more quickly than
> > > before
> > > (more precisely: both average and standard deviation of the service
> > > delay will be improved wrt before, and timeouts occur less
> > > frequently).
> > > I encourage everyone to experiment and see if reality shows that
> > > I'm
> > > wrong.
> > > 
> > > > I suppose the only way to deal with that would be to move the
> > > > locking
> > > > commands to a list handled by a separate thread, so that it could
> > > > block
> > > > without stalling the non-locking commands.
> > > 
> > > Not sure if I understand correctly, just in case: non-locking
> > > commands
> > > are never stalled with my patch.
> > 
> > I realize. I was saying that you could avoid starvation while still
> > allowing non-locking commands to complete by moving the locking
> > commands
> > to a seperate thread, which did block on the lock. I didn't consider
> > a
> > ticketing system. Ideally, the checker loop would have the lowest
> > priority, Since it isn't responding to any event, and ususally is
> > just
> > verifiying that nothing has changed.  But you do make a good point
> > that
> > when we are getting a lot of events, and the uxlsnr loop has a chance
> > of
> > getting starved, we probably want to prioritize the event handling
> > anyway.
> > 
> 
> I have also thought about using additional threads for handling cli
> commands. One could either use a single thread, similar to the udev
> listener/dispatcher pair (your suggestion IIUC), or one thread per
> (blocking) client.
> 
> Moving client handling into separate thread(s) avoids the complexity of
> the state machine and the eventfd-based wakeup. But on the back side,
> it introduces new multithreading-related complexity (of which we
> already have our fair share). Client tasks running lock(&vecs->lock) in
> order to serve commands like "multipathd show paths" might now starve
> event handling, which would be worse than vice versa, IMO.
> 
> Eventually, I found the idea of the poll/wakeup loop with no additional
> threads more appealing, and more suitable for the task. But I admit
> that it's a matter of personal taste. I tend to try to use pthreads as
> little as possible ;-).
> 
> So how do we proceed? 

I think your argument that we'll only risk starving the uxlsnr thread
when it makes sense to prioritize other threads is a good one. So, I'm
o.k. with this trylock() solution.

-Ben

> 
> Regards,
> Martin
>