[Libguestfs] [PATCH libnbd v2 2/3] fuse: Issue commands in parallel

Fri May 14 12:12:30 UTC 2021

On Fri, May 14, 2021 at 02:39:12PM +0300, Nir Soffer wrote:
> On Fri, May 14, 2021 at 12:07 PM Richard W.M. Jones <rjones at redhat.com> wrote:
> >
> > On top of the previous commit which enabled multithreading but
> > continued to use the synchronous libnbd API, this allows each thread
> > to issue commands asynchronously.  Because there is still a single
> > handle, this introduces a single background thread to poll the file
> > descriptor and dispatch the commands.
> > This is only a little bit faster (compare to results in previous
> > commit message):
> >
> >    READ: bw=250MiB/s (262MB/s), 62.4MiB/s-62.4MiB/s (65.4MB/s-65.5MB/s), io=4096MiB (4295MB), run=16398-16411msec
> >
> > A future multi-conn version of nbdfuse would likely use multiple
> > background threads (one per connection) to do the same job, but that
> > is left for future work.
> > ---
> >  fuse/nbdfuse.c    |   5 ++
> >  fuse/nbdfuse.h    |   1 +
> >  fuse/operations.c | 181 +++++++++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 169 insertions(+), 18 deletions(-)
> >
> > diff --git a/fuse/nbdfuse.c b/fuse/nbdfuse.c
> > index fa35080..f91ff7f 100644
> > --- a/fuse/nbdfuse.c
> > +++ b/fuse/nbdfuse.c
> > @@ -426,6 +426,11 @@ main (int argc, char *argv[])
> >    if (nbd_is_read_only (nbd) > 0)
> >      readonly = true;
> >
> > +  /* Create the background thread which is used to dispatch NBD
> > +   * operations.
> > +   */
> > +  start_operations_thread ();
> > +
> >    /* This is just used to give an unchanging time when they stat in
> >     * the mountpoint.
> >     */
> > diff --git a/fuse/nbdfuse.h b/fuse/nbdfuse.h
> > index 1f8f703..016c325 100644
> > --- a/fuse/nbdfuse.h
> > +++ b/fuse/nbdfuse.h
> > @@ -36,5 +36,6 @@ extern char *filename;
> >  extern uint64_t size;
> >
> >  extern struct fuse_operations nbdfuse_operations;
> > +extern void start_operations_thread (void);
> >
> >  #endif /* LIBNBD_NBDFUSE_H */
> > diff --git a/fuse/operations.c b/fuse/operations.c
> > index 4da701e..1e81593 100644
> > --- a/fuse/operations.c
> > +++ b/fuse/operations.c
> > @@ -39,6 +39,7 @@
> >  #include <assert.h>
> >  #include <sys/types.h>
> >  #include <sys/stat.h>
> > +#include <pthread.h>
> >
> >  #include <libnbd.h>
> >
> > @@ -47,14 +48,90 @@
> >
> >  #define MAX_REQUEST_SIZE (32 * 1024 * 1024)
> >
> > -/* Wraps calls to libnbd functions and automatically checks for error,
> > - * returning errors in the format required by FUSE.  It also prints
> > - * out the full error message on stderr, so that we don't lose it.
> > +/* Number of seconds to wait for commands to complete when closing the file. */
> > +#define RELEASE_TIMEOUT 5
> > +
> > +/* This operations background thread runs while nbdfuse is running and
> > + * is responsible for dispatching AIO commands.
> > + *
> > + * The commands themselves are initiated by the FUSE threads (by
> > + * calling eg. nbd_aio_pread), and then those threads call
> > + * wait_for_completion() which waits for the command to retire.
> > + *
> > + * A condition variable is signalled by any FUSE thread when it has
> > + * started a new AIO command and wants the operations thread to start
> > + * processing (if it isn't doing so already).  To signal completion we
> > + * use a completion callback which signals a per-thread completion
> > + * condition.
> >   */
> > -#define CHECK_NBD_ERROR(CALL)                                   \
> > -  do { if ((CALL) == -1) return check_nbd_error (); } while (0)
> > +static void *operations_thread (void *);
> > +
> > +void
> > +start_operations_thread (void)
> > +{
> > +  int err;
> > +  pthread_t t;
> > +
> > +  err = pthread_create (&t, NULL, operations_thread, NULL);
> > +  if (err != 0) {
> > +    errno = err;
> > +    perror ("nbdfuse: pthread_create");
> > +    exit (EXIT_FAILURE);
> > +  }
> > +}
> > +
> > +static pthread_mutex_t start_mutex = PTHREAD_MUTEX_INITIALIZER;
> > +static pthread_cond_t start_cond = PTHREAD_COND_INITIALIZER;
> > +
> > +struct completion {
> > +  pthread_mutex_t mutex;
> > +  pthread_cond_t cond;
> > +  bool completed;
> > +} completion;
> > +
> > +static void *
> > +operations_thread (void *arg)
> > +{
> > +  while (1) {
> > +    /* Sleep until a command is in flight. */
> > +    pthread_mutex_lock (&start_mutex);
> > +    while (nbd_aio_in_flight (nbd) == 0)
> > +      pthread_cond_wait (&start_cond, &start_mutex);
> > +    pthread_mutex_unlock (&start_mutex);
> > +
> > +    /* Dispatch work while there are commands in flight. */
> > +    while (nbd_aio_in_flight (nbd) > 0)
> > +      nbd_poll (nbd, -1);
> 
> Just to make sure I understand this correctly - this runs all completion
> callbacks?

I wasn't sure if your comment refers only to nbd_poll or to the whole
while() loop, so I'll talk about both.  nbd_poll is tricky to
understand without looking at the implementation:

https://gitlab.com/nbdkit/libnbd/-/blob/4a48e3bf0f71bbc25608ce4925aa8b37e8084399/lib/poll.c#L31

You can see that each call of nbd_poll first sleeps until the file
descriptor is ready, then it runs [nbd_unlocked_aio_notify*] the state
machine until it would block again once, then it returns.

Now how many completion callbacks might run during one nbd_poll is not
well defined.  It's possible to have a scenario where you have 2 or
more commands in flight waiting for the server to reply, and then in
the single call to nbd_unlocked_aio_notify* both commands would
complete.  It's also possible no commands will complete.

As for the while loop, it continues until there are no more commands
in flight (all have retired).  After this we go back to sleeping on
start_cond.  Hopefully there isn't a race condition - I don't think
there is ...

> Would be nice to mention this in the comment.
> 
> > +  }
> > +
> > +  /*NOTREACHED*/
> > +  return NULL;
> > +}
> > +
> > +/* Completion callback - called from the operations thread when a
> > + * command completes.
> > + */
> > +static int
> > +completion_callback (void *vp, int *error)
> > +{
> > +  struct completion *completion = vp;
> > +
> > +  /* Mark the command as completed. */
> > +  completion->completed = true;
> 
> I think changing this after locking the completion mutex will
> be better. The mutx will ensure proper ordering of things and
> it matches the code checking this flag while the mutex is locked.

Yes, good point:

> > +
> > +  pthread_mutex_lock (&completion->mutex);

+  completion->completed = true;      <-- here I think.

> > +  pthread_cond_signal (&completion->cond);
> > +  pthread_mutex_unlock (&completion->mutex);
> > +
> > +  /* Don't retire the command.  We want to get the error indication in
> > +   * the FUSE thread.
> > +   */
> > +  return 0;
> 
> How retiring the command affects the error indicator?
> 
> libnbd errors are stored in thread local storage - how does it
> work when another thread is polling?

nbd_aio_command_completed is called from the FUSE thread and sets the
error, so I guess we're OK ...

https://gitlab.com/nbdkit/libnbd/-/blob/4a48e3bf0f71bbc25608ce4925aa8b37e8084399/lib/aio.c#L114

> > +}
> > +
> > +/* Report an NBD error and return -errno. */
> >  static int
> > -check_nbd_error (void)
> > +report_nbd_error (void)
> >  {
> >    int err;
> >
> > @@ -66,6 +143,55 @@ check_nbd_error (void)
> >      return -EIO;
> >  }
> >
> > +static int
> > +wait_for_completion (struct completion *completion, int64_t cookie)
> > +{
> > +  int r;
> > +
> > +  /* Signal to the operations thread to start work, in case it is sleeping. */
> > +  pthread_mutex_lock (&start_mutex);
> > +  pthread_cond_signal (&start_cond);
> > +  pthread_mutex_unlock (&start_mutex);
> > +
> > +  /* Wait until the completion_callback sets the completed flag.
> > +   *
> > +   * We cannot call nbd_aio_command_completed yet because that can
> > +   * lead to a possible deadlock where completion_callback holds the
> > +   * NBD handle lock and we try to acquire it by calling
> > +   * nbd_aio_command_completed.  That is the reason for the
> > +   * completion.completed flag.
> > +   */
> > +  pthread_mutex_lock (&completion->mutex);
> > +  while (!completion->completed)
> > +    pthread_cond_wait (&completion->cond, &completion->mutex);
> > +  pthread_mutex_unlock (&completion->mutex);
> > +
> > +  /* nbd_aio_command_completed returns:
> > +   *  0 => command still in flight (should be impossible)
> > +   *  1 => completed successfully
> > +   * -1 => error
> > +   */
> > +  r = nbd_aio_command_completed (nbd, cookie);
> > +  assert (r != 0);
> > +  return r;
> > +}
> > +
> > +/* Wrap calls to any asynch command and check the error. */
> > +#define CHECK_NBD_ASYNC_ERROR(CALL)
> > +  do {                                                                  \
> > +    struct completion completion =                                      \
> > +      { PTHREAD_MUTEX_INITIALIZER, PTHREAD_COND_INITIALIZER, false };   \
> > +    nbd_completion_callback cb =                                        \
> > +      { .callback = completion_callback, .user_data = &completion };    \
> > +    int64_t cookie = (CALL);                                            \
> > +    if (cookie == -1 || wait_for_completion (&completion, cookie) == -1) \
> > +      return report_nbd_error ();
> 
> This runs in the fuse thread, but the error was detected in the operations
> thread. Do we pass the error to the thread starting the async io request
> in some way?

Yes, I think so.  When wait_for_completion (in the FUSE thread) calls
nbd_aio_command_completed, the error is set.

> Looks good, I'm not sure there is a better way to use a single handle from
> multiple threads.
> 
> It looks like a generic infrastructure that could be part of the library like
> nbd_poll().

Yup - I think I'd want something which is a bit more efficient though,
since there appears to be a lot of overhead from the conditions /
thread switching.

I'm still playing with this to see if more FUSE threads help.  At the
moment with 4 FIO jobs, FUSE runs about 4 threads (sometimes one
more), but I know that 4 commands in flight is not nearly enough to
keep the NBD server busy.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org