[Libguestfs] [PATCH libnbd] ublk: Add new nbdublk program

Tue Aug 30 09:38:35 UTC 2022

On Tue, Aug 30, 2022 at 10:07:40AM +0100, Richard W.M. Jones wrote:
> On Tue, Aug 30, 2022 at 04:30:40PM +0800, Ming Lei wrote:
> > On Tue, Aug 30, 2022 at 09:04:07AM +0100, Richard W.M. Jones wrote:
> > > On Tue, Aug 30, 2022 at 10:32:02AM +0800, Ming Lei wrote:
> > > > Hi Jones,
> > > > 
> > > > On Thu, Aug 25, 2022 at 01:10:55PM +0100, Richard W.M. Jones wrote:
> > > > > This patch adds simple support for a ublk-based NBD client.
> > > > > It is also available here:
> > > > > https://gitlab.com/rwmjones/libnbd/-/tree/nbdublk/ublk
> > > > > 
> > > > > ublk is a way to write Linux block device drivers in userspace:
> > > > 
> > > > Just looked at your nbdublk implementation a bit, basically it is good,
> > > > and one really nice work.
> > > > 
> > > > Also follows two suggestions:
> > > > 
> > > > 1) the io_uring context is multilexed with ublk io command handling, so
> > > > we should avoid to block in both ->handle_io_async() and
> > > > ->handle_event(), otherwise performance may be bad
> > > 
> > > The nbd_aio_* calls don't block.
> > > 
> > > However I noticed that I made a mistake with the trim and zero paths
> > > because I am using synchronous (blocking) nbd_flush / nbd_trim /
> > > nbd_zero instead of nbd_aio_flush / nbd_aio_trim / nbd_aio_zero.  I
> > > will fix this soon.
> > > 
> > > Nothing in handle_event should block except for the call to
> > > pthread_mutex_lock.  This lock is necessary because new commands can
> > > be retired on the nbd_work_thread while handle_event is being called
> > > from the io_uring thread.
> > 
> > Yes, I meant the mutex.
> > 
> > Also given nbd_work_thread() is required, it is reasonable to move
> > nbd_aio_* into nbd_work_thread(), so avoid to wait on the mutex in
> > io_uring context, then io command may not be forwarded to userspace
> > in time.
> 
> nbd_aio_* calls shouldn't block.  But how can we move those calls to
> nbd_work_thread()?  If we queue up commands in handle_io_async we

Please see aio_submitter() in my patch, the call has been moved to nbd
work thread already.

> would still have to use a lock on that queue.  I don't see how we can
> avoid some kind of locking in handle_io_async.

pthread_spin_lock isn't supposed to sleep, for protecting the list.

Completely lockless is possible too by using per-queue bitmap to mark which
requests is submitted & completed, one per-queue array to store the
result.

But per my test, nbd perf is still a bit low, so not sure this kind of
optimization makes difference.

> 
> > BTW what is the context for running callback of nbd_aio_*()?
> 
> It depends on whether nbd_aio_* can complete the command entirely
> without blocking.
> 
> The most common case is that nbd_aio_* calls a network function
> (eg. recv(2)) which returns EWOULDBLOCK.  Later, on nbd_work_thread,
> recv(2) is completed, the command is processed, and
> command_completed() is called.
> 
> It's unlikely, but possible, that the command_completed() function
> could be called directly from handle_io_async -> nbd_aio_* ->
> command_completed, eg. if recv(2) never blocks for some reason like
> it's talking over a Unix domain socket to a server which is very quick
> to reply.

OK, got it. I just tried to understand the whole flow.

> 
> > > > 2) in the implementation of nbd worker thread, there are two sleep
> > > > points(wait for incoming io command, and network FD), I'd suggest to use
> > > > poll to wait on any of them
> > > >
> > > > Recently I are working to add ublksrv io offloading or aio
> > > > interfaces on this sort of case in which io_uring can't be used,
> > > > which may simplified this area, please see the attached patch which
> > > > applies the above two points against your patch. And obvious
> > > > improvement can be observed on my simple fio test( randread, io, 4k
> > > > bs, libaio) against backend of 'nbdkit file'.
> > > >
> > > > But these interfaces aren't merged to ublksrv github tree yet, you can find
> > > > them in the aio branch, and demo_event.c is one example wrt. how to use
> > > > them:
> > > > 
> > > >   https://github.com/ming1/ubdsrv/tree/aio
> > > >
> > > > Actually this interface can be improved further for nbdublk case,
> > > > and the request allocation isn't needed actually for this direct
> > > > offloading. But they are added for covering some IOs not from ublk
> > > > driver, such as meta data, so 'struct ublksrv_aio' is allocated.
> > > > I will try best to finalize them and merge to master branch.
> > > 
> > > I didn't really understand what these patches to ubdsrv do when I
> > > looked at them before.  Maybe add some diagrams?
> > 
> > The patches are added recently. It is just for simplifying to offload
> > IO handling on another worker context, such as nbd_work_thread.
> > 
> > It uses one eventfd to notify the target worker thread when new requests
> > are added to a list. Once the worker thread is wakeup, it fetches
> > requests from the list, after handle these requests, the worker thread
> > waits on both the eventfd and target IO(network FD) via poll().
> 
> In the patch, it seems there is no place which creates a pthread for
> nbd_work_thread.  Also ubdsrv aio branch does not call pthread_create.
> So I don't understand in what context nbd_work_thread is called.

nbd work thread is created by nbd target code just like before, but the thread is
changed to the following way, basically bound with one aio_ctx:

  while (!ublksrv_aio_ctx_dead(aio_ctx)) {
      struct aio_list compl;

      aio_list_init(&compl);

	  //retrieve requests from submit list, and submit each one via
	  //aio_submitter(), if anyone is done, add it to &compl.
      ublksrv_aio_submit_worker(aio_ctx, aio_submitter, &compl);

	  //add requests completed from command_completed() to &compl
      pthread_spin_lock(&c->lock);
      aio_list_splice(&c->list, &compl);
      pthread_spin_unlock(&c->lock);

      //notify io_uring thread for the completed requests in &compl,
	  //then batching complete & re-issue can be done in io_uring
	  //context
      ublksrv_aio_complete_worker(aio_ctx, &compl);

      //wait for network IO and evevfd from io_uring at the same time
	  //so if either one is ready, nbd_poll2() will return from sleep
      if (nbd_poll2 (h, aio_ctx->efd, -1) == -1) {
        fprintf (stderr, "%s\n", nbd_get_error ());
        exit (EXIT_FAILURE);
      }
  }

> 
> I've never used eventfd before and the man page for it is very opaque.
> 
> > In your previous implementation, nbd work thread may wait on one pthread
> > mutex and aio_poll(), this way may not be efficient, given when waiting
> > on one event, another events can't be handled.
> 
> I'm not sure what "event" means in this context.  Does it mean
> "NBD command"?  Or poll(2) event?

Here event is generic, I meant: NBD IO ready(exactly what aio_poll()
waits for) or io_uring eventfd ready(one write done from nbd_handle_io_async()).

> 
> There are multiple (usually 4) nbd_work threads, one for each NBD
> network connection.  Each NBD network connection can handle many
> commands in flight at once.

OK, but aio_poll() supposes to get notified if one command is done, so
here it is just the implementation detail, but correct me if I am wrong.

Thanks, 
Ming