[Libguestfs] [PATCH libnbd v2 3/4] api: Implement concurrent writer.

Tue Jun 4 13:15:52 UTC 2019

On 6/4/19 4:59 AM, Richard W.M. Jones wrote:
> Callers may now optionally set up a concurrent writer thread.  The
> outline of this idea is explained here:
> 
> https://www.redhat.com/archives/libguestfs/2019-June/msg00010.html
> 
> The change is quite small, but here are some points which are true but
> may not be obvious:
> 
> * All writes return immediately with success (unless we run out of
>   memory), and writes never block.
> 
> * When going down the READY -> ISSUE_COMMAND.START (etc) path, because
>   writes never block, we never stop for a read notification, so we
>   always send all commands in the issue queue before starting to read
>   any replies.  (However this only means that the raw commands are
>   enqueued in the writer thread, which can still take its merry time
>   writing the requests to the socket.)
> 
> * Commands can be counted as "in flight" when they haven't been
>   written to the socket yet.  This is another reason for increasing
>   the in flight limits.  Does this matter?  Probably not.  The kernel
>   might also queue up packets before sending them, or they might be in
>   flight across the internet or queued at the receiver.  Nothing about
>   "in flight" ever meant that the server has received and is
>   processing those packets.
> 
> * Even with this change, full parallelism isn't quite achievable.
>   It's still possible to be in a state such as
>   REPLY.STRUCTURED_REPLY.RECV_* waiting for an unusual fast writer /
>   slow reader server.  If you then decide that you want to send yet

Or more likely, when NBD_CMD_READ is so large that the server can't send
it all in TCP window, and therefore we are guaranteed to have to wait
for POLLIN between reading the first few packets of the chunk to clear
up the server's ability to send, and before the server's second half of
the chunk arrives.  As long as we are waiting in
REPLY.STRUCTURED_REPLY.RECV_OFFSET_DATA or
REPLY.SIMPLE_REPLY.RECV_READ_PAYLOAD, we are locking out our ability to
send more commands unless we rework the state machine a bit.

>   more commands then those commands will only be enqueued in the
>   handle, not dispatched to the writer thread.  To avoid this it is
>   best to send as many commands as possible as soon as possible before
>   entering poll, but to a certain extent this is unavoidable with
>   having only one state machine.

If the state machine can make decisions based on whether there is a
writer callback, perhaps we can allow more states to (conditionally)
process a pending request immediately, any time we are otherwise blocked
waiting for POLLIN.  I'll have to think more about this.

> ---
>  docs/libnbd.pod     | 73 +++++++++++++++++++++++++++++++++++++++++++++
>  generator/generator | 29 ++++++++++++++++++
>  lib/handle.c        | 32 ++++++++++++++++++++
>  lib/internal.h      |  7 +++++
>  lib/socket.c        | 27 ++++++++++++++---
>  podwrapper.pl.in    |  3 +-
>  6 files changed, 166 insertions(+), 5 deletions(-)
> 
> diff --git a/docs/libnbd.pod b/docs/libnbd.pod
> index ede2539..07d259f 100644
> --- a/docs/libnbd.pod
> +++ b/docs/libnbd.pod
> @@ -400,6 +400,79 @@ If you are issuing multiple in-flight requests (see above) and
>  limiting the number, then the limit should be applied to each
>  individual NBD connection.
>  
> +=head2 Concurrent writer thread
> +
> +To achieve the maximum possible performance from libnbd and NBD
> +servers, as well as the above techniques you must also use a
> +concurrent writer thread.  This feature allows requests to be issued
> +on the NBD socket at the same time that replies are being read from
> +the socket.  In other words L<send(2)> and L<recv(2)> calls will be
> +running at the same time on the same socket from different threads.
> +
> +There is a full example using a concurrent writer available at
> +L<https://github.com/libguestfs/libnbd/blob/master/examples/concurrent-writer.c>
> +
> +To implement this, you change your ordinary AIO code in four ways:
> +
> +=over 4
> +
> +=item 1. Call nbd_set_concurrent_writer
> +
> + struct writer_data {
> +   struct nbd_handle *nbd;
> +   /* other data here as required */
> + } data;
> + 
> + nbd_set_concurrent_writer (nbd, &data, writer);
> +
> +This function can be called on the handle at any time, either after
> +the handle is created or after the connection and handshaking has
> +completed.
> +
> +=item 2. Implement non-blocking writer callback
> +
> +C<writer> is a I<non-blocking> callback which enqueues the buffer into
> +a ring or similar FIFO structure:
> +
> + struct ring_item {
> +   struct writer_data *data;
> +   const void *buf;
> +   size_t len;
> + };
> + 
> + void writer (void *data, const void *buf, size_t len)

Update to cover the int return type, and whether we add a flags argument.

> + {
> +   struct ring_item item;
> + 
> +   /* add (data, buf, len) to a shared ring */
> +   item.data = data;
> +   item.buf = malloc (len);
> +   memcpy (item.buf, buf, len);
> +   item.len = len;
> +   ring_add (&item);
> + 
> +   writer_signal ();   /* kick the writer thread */

Update to document immediate error return (ENOMEM)

> + }
> +
> +=item 3. Implement writer thread
> +
> +You must also supply another thread which picks up data off the ring
> +and writes it to the socket (see C<nbd_aio_get_fd>).  If there is an
> +error when writing to the socket, call C<nbd_concurrent_writer_error>
> +with the C<errno>.
> +
> +You have a choice of whether to implement one thread per nbd_handle or
> +one thread shared between all handles.
> +
> +=item 4. Modify main loop
> +
> +Finally your main loop can unconditionally call
> +C<nbd_aio_notify_write> when C<nbd_aio_get_direction> returns C<WRITE>
> +or C<BOTH> (since the concurrent thread can always enqueue more data
> +and so is always "ready to write").

Should it likewise unconditionally poll for POLLIN, even if
aio_get_direction does not currently request reads (in the case where
our atomic read of the current state spots a transient condition by some
other thread progressing through emitting a request)? Or are we trying
to beef up the state machine so that h->state never exposes transient
states to other threads?

> +
> +=back
> +
>  =head1 ENCRYPTION AND AUTHENTICATION
>  
>  The NBD protocol and libnbd supports TLS (sometimes incorrectly called
> diff --git a/generator/generator b/generator/generator
> index ff6075d..718e253 100755
> --- a/generator/generator
> +++ b/generator/generator
> @@ -1094,6 +1094,35 @@ C<\"qemu:dirty-bitmap:...\"> for qemu-nbd
>  (see qemu-nbd I<-B> option).  See also C<nbd_block_status>.";
>    };
>  
> +  "set_concurrent_writer", {
> +    default_call with
> +    args = [ Opaque "data";
> +             CallbackPersist ("writer", [Opaque "data";
> +                                         BytesIn ("buf", "len")]) ];
> +    ret = RErr;
> +    permitted_states = [ Created; Connecting; Connected ];
> +    shortdesc = "set a concurrent writer thread";
> +    longdesc = "\
> +Provide an optional concurrent writer thread for better performance.
> +See L<libnbd(3)/Concurrent writer thread> for how to use this.";
> +  };
> +
> +  "concurrent_writer_error", {
> +    default_call with
> +    args = [ Int "err" ]; ret = RErr;
> +    shortdesc = "signal an error from the concurrent writer thread";
> +    longdesc = "\
> +This can be called from the concurrent writer thread to signal
> +that there was an error writing to the socket.  As there is no
> +way to recover from such errors, the connection will move to the
> +dead state soon after.
> +
> +The parameter is the C<errno> returned by the failed L<send(2)> call.
> +It must be non-zero.
> +
> +See L<libnbd(3)/Concurrent writer thread> for how to use this.";
> +  };

Do we also need a function for the writer thread to call when it is
confirming that the writer callback was passed a flag stating that no
further writes are needed? I'm trying to figure out if
nbd_shutdown/nbd_close should wait for acknowledgment that the writer
thread has reached clean shutdown; it may especially matter for clean
TLS shutdown.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20190604/141c627a/attachment.sig>