[Libguestfs] [libnbd] Simultaneous read and write

Fri May 31 21:47:52 UTC 2019

On 5/31/19 11:01 AM, Richard W.M. Jones wrote:
> This is a continuation of a discussion we were having on IRC.  The
> problems with IRC are it's not recorded and it's hard to have deep
> technical conversations.  I hope this is a decent summary.
> 
> Problem simply stated: Certain NBD servers (qemu-nbd in particular)

Also nbdkit for a plugin with NBDKIT_THREAD_MODEL_PARALLEL. qemu-nbd
does it via coroutines and nbdkit via threads, but the concept is
similar: a request-reply cycle is handled within one thread, but that
thread relinquishes read control to any other thread prior to starting
to write the reply, so the next thread in line can be reading in parallel.

> are able to simultaneously read and write on a socket.  ie. They can
> be simultaneously reading a request and writing the reply to a
> previous request.  However libnbd is unable to do this trick.
> Although multiple requests can be in flight, libnbd is either writing
> on the socket or reading from the socket, but never both.
> 
> Visualized it looks like this:
> 
>    write                      write           write
>    |===========|______________|=======|_______|===|  --> rq to server
>    _____________|============|_________|=====|_____  <-- rp from server
>                 read                   read
> 
> whereas an ideal libnbd which could write and read simultaneously,
> coupled with a server which can do the same:
> 
>    write
>    |===========||=======||===|  --> requests to server
>    |============||=====||=====  <-- replies from server
>    read
> 
> Eric already observed through testing that the ideal client can be up
> to twice as fast as libnbd, as is obvious from the diagram.

More concretely, the existing nbdkit-nbd plugin uses multiple threads
(as provided by nbdkit) to write requests to the server (using a mutex
so that requests are not interleaved), then a separate reader thread to
read responses from the server (with semaphores to learn which writer
thread to wake up when the response is complete).  The ideal division of
labor that would permit reads to be in a separate thread from writes
would require that the nbd_aio_pread() call (in the nbdkit thread) run
until the entire request is out the door, even if it temporarily blocks
on write() and has to wait for POLLOUT.  Meanwhile, the reader thread
would spin waiting for just POLLIN events, rather than having to service
any of the POLLOUT of the writer thread(s) running in parallel.

> 
> The reasons why libnbd can't do this are:
> 
> Problem (a) : Only one thread of control can hold the libnbd handle
> lock (h->lock), and since there can only be one thread of control
> running in each handle at any time, that thread can only be reading or
> writing.

Indeed, when the writer thread calls nbd_aio_pread(), it is contending
on the same lock as the nbd_aio_notify_read() in the reader thread, so
we'd have to split up to several finer-grained locks (maybe keep all
locking APIs with a grab on h->lock at the beginning, but while holding
that lock we then grab the h->rstate or h->wstate lock for the rest of
the call, and drop the main h->lock at that point even though the API
doesn't return until the state machine blocks again).

With my initial use of libnbd, the division of labor for which thread
writes a packet falls into three classes:
- if the state machine is ready, the libnbd fd is polling only on
POLLIN; from there:
  - if the request is small, the entire request is written by the nbdkit
thread during nbd_aio_pwrite (in this case, our write to the
pipe-to-self is actually wasted work, because the reader loop thread
will spuriously wake up, check the directions, and see that it still
only needs POLLIN for the next server response)
  - if the request is large, the first half of the request is written
from the nbdkit during nbd_aio_pwrite, and the second half of the
request is written from the reader thread during nbd_aio_notify_write
(here, our write to the pipe-to-self is essential, because we HAD to
break the reader loop out of its POLLIN poll() in order to learn that it
now needs to poll for POLLIN|POLLOUT)
- if the state machine is not ready, the libnbd fd is either in the
middle of processing a reply (POLLIN) or a request (POLLIN|POLLOUT,
since we allow replies to interrupt the processing of a request).
nbd_aio_pwrite queues the new request and returns immediately, without
write()ing anything from the nbdkit thread, and the entire request gets
written by the reader thread during subsequent
nbd_aio_notify_{read,write} (in this case, our write to the pipe-to-self
is useful if libnbd was in the middle of processing a reply, but wasted
work if it was in the middle of processing a previous request)

> 
> Problem (b) : There is only one state machine per handle (h->state),
> whereas to handle the write and read sides separately requires two
> state machines.  In the IRC discussion we gave these the preliminary
> names h->wstate and h->rstate.
> 
> ----------------------------------------------------------------------
> 
> It's worth also saying how the current API works, although we might
> want to change it.
> 
> You grab the underlying file descriptor using nbd_aio_get_fd, which is
> what you poll on.  You also have to call nbd_aio_get_direction which
> returns READ, WRITE or BOTH (== READ|WRITE).  You then set up some
> events mechanism (eg. poll, epoll, etc.), poll until the file
> descriptor is ready, and call one of nbd_aio_notify_read or
> nbd_aio_notify_write.
> 
> The direction can change any time the handle state changes, which
> includes whenever you issue a command (eg. nbd_aio_pread), or whenever
> you call nbd_aio_notify_*.  You therefore have to call
> nbd_aio_get_direction frequently.
> 
> A typical loop using poll might look like:
> 
>   fd = nbd_aio_get_fd (nbd);
>   for (;;) {
>     /* <-- If you need to issue more commands, do that here. */
>     dir = nbd_aio_get_direction (nbd);
>     pollfd[0].fd = fd;
>     pollfd[0].events = 0;
>     if (dir & LIBNBD_AIO_DIRECTION_READ) pollfd[0].events |= POLLIN;
>     if (dir & LIBNBD_AIO_DIRECTION_WRITE) pollfd[0].events |= POLLOUT;
>     poll (pollfd, 1, -1);
>     if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_READ)

Rather, if (pollfd[0].revents & POLLIN && dir & LIBNBD_AIO_DIRECTION_READ)

>       nbd_aio_notify_read ();
>     else if (pollfd[0].revents & LIBNBD_AIO_DIRECTION_WRITE)

and again

>       nbd_aio_notify_write ();
>   }

or in the nbdkit case:

nbdkit thread
  nbd_aio_pwrite ();
  write (selfpipe[1], &c, 1);

reader thread
  for (;;) {
    dir = nbd_aio_get_direction (nbd);
    pollfd[0].fd = fd;
    pollfd[0].events = 0;
    pollfd[1].fd = selfpipe[0];
    pollfd[1].events = POLLIN;
    if (dir & LIBNBD_AIO_DIRECTION_READ) pollfd[0].events |= POLLIN;
    if (dir & LIBNBD_AIO_DIRECTION_WRITE) pollfd[0].events |= POLLOUT;
    poll (pollfd, 2, -1);
    if (pollfd[0].revents & POLLIN && dir & LIBNBD_AIO_DIRECTION_READ)
       nbd_aio_notify_read ();
    else if (pollfd[0].revents & POLLOUT && dir &
LIBNBD_AIO_DIRECTION_WRITE)
      nbd_aio_notify_write ();
    if (pollfd[1].revents & POLLIN)
      read (selfpipe[0], &c, 1);
  }

> 
> ----------------------------------------------------------------------
> 
> The above code is of course assuming a single thread.  But to
> simultaneously read and write we will need at least two threads.

If we have two threads, an obvious split would be to state that requests
must be done from one thread while replies are done from another, at
which point the state machine would look something like:

nbdkit thread
  id = nbd_aio_pwrite ();
  while (!nbd_aio_command_inflight(id)) {
    pollfd[0].fd = fd;
    pollfd[0].events = POLLOUT;
    poll (pollfd, 1, -1);
    if (pollfd[0].revents & POLLOUT)
      nbd_aio_notify_write ();
  }

reader thread
  while (!nbd_aio_is_rstate_ready()) {
    pollfd[0].fd = fd;
    pollfd[0].events = POLLIN;
    poll (pollfd, 1, -1);
    if (pollfd[0].revents & POLLIN)
       nbd_aio_notify_read ();
  }

where we could also add a new convenience API:

nbd_aio_pwrite_send()

that does the work mentioned for the nbdkit thread, and is documented to
block the thread up until the point in time where the state machine has
finished sending the request (even if it had to poll on POLLOUT in the
middle), but before waiting for the server's reply; and while it can
lock out other potential writers (that is, hold the h->wstate lock for
the entire duration), but must not lock out readers (that is, it must
drop h->lock as soon as we prove that we are okay sending the request).

There probably needs to be checks for POLLHUP or POLLERR in there, to
ensure the loop quits if the server hangs up or if we encounter an error
that prevents further request/reply sequences.

> 
> It's hard for me to see a nice way to evolve the API to support
> multiple threads, but I guess issues include:
> 
>  - If one thread is waiting in poll(2) and another thread issues a
>    command, how do we get the first thread to return from poll and
>    reevaluate direction?  (Eric suggests a pipe-to-self for this)

That is, if you liked my self-pipe solution, we could somehow fold that
self-pipe into libnbd internals instead of making each caller
re-implement it. One idea I had for that was, at least on Linux, to use
a pollfd, where the user's poll loop is as simple as:

  while (!nbd_aio_is_rstate_ready()) {
    pollfd[0].fd = fd;
    pollfd[0].events = POLLIN;
    poll (pollfd, 1, -1);
    if (pollfd[0].revents & POLLIN)
       nbd_aio_notify_read ();
  }

because pollfd[0].fd is really a pollfd (and the libnbd, under the hood,
actually calls epoll_wait() during nbd_aio_notify_read() to learn what
REALLY happened, whether it was the wstate machine adding a request to
the queue and tickling the self-pipe, or the rstate machine making
progress). But in that model, all write()s are now done solely from the
single reader loop, during nbd_aio_notify_read(); the other threads
calling nbd_aio_pwrite() never call write directly, and we're back to
the question of how can we read and write to the same socket
simultaneously - so in addition to having two state machines, we might
also have to make libnbd itself run two separate threads per handle.

It's a shame that Linux does not yet have true aio for network packets:
 - glibc's implementation of POSIX aio_* creates a helper thread under
the hood, the same as I suggested we might do above
 - the Linux kernel's io_submit() still performs I/O in order rather
than in parallel
https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/
 - the Linux kernel's sendto(MSG_ZEROCOPY) can perform true async
network traffic (you can regain control of your single thread prior to
the full write going out the door, and thus can perform a read at the
same time from a single thread), although it is only efficient once you
reach the point of sending 10k or more at once

> 
>  - If two threads are waiting in poll(2) and both are notified that
>    the fd is ready, how do we get one of them to read and the other to
>    write?  I think this implies that we don't have two threads doing
>    poll(2), but if not how do we farm out the read/write work to two
>    or more threads?

Either we document that when using two threads, the user runs one poll
loop strictly on POLLIN and the other loop strictly on POLLOUT (and
doesn't have to bother with direction changes) - but only after the nbd
handle has reached is_ready() for the first time (the handshake phase is
very much lock-step, where a single thread checking direction all the
time really is the best solution). Or we tell the user to run a single
loop, manage a second thread under the hood, and the user only has to
poll on POLLIN (where we do the coordination with our helper thread
under the hood).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20190531/38319928/attachment.sig>