[Libguestfs] [libnbd PATCH] states: Never block state machine inside REPLY

Richard W.M. Jones rjones at redhat.com
Tue Jun 25 09:03:13 UTC 2019


On Wed, Jun 19, 2019 at 01:18:01PM -0500, Eric Blake wrote:
> Oddly enough, I am not getting any measurable performance difference
> with this patch applied and using examples/threaded-reads-and-writes
> coupled with nbdkit.  My explanation is that in the common case, once
> a server has something to send, it is going to send the entire reply
> as fast as it can, rather than sending a partial reply and waiting;
> and our attempts to keep up to 64 commands in flight mean that our
> real bottleneck is not sending our next command (even if we have to
> wait for the server's reply to finish), but at the server (if we are
> saturating the server's amount of in-flight requests, the server won't
> read our next request until it has finished its current reply).
> Still, I'm sure that it is possible to construct corner cases where
> this shows more of an effect.

What I'm going to say is I think stating the obvious, but my intuition
is there are going to be two types of load.  In the first and easiest
case you want to read or write sequentially over the whole or a large
section of the image (think: qemu-img convert).  In this case you know
well in advance what parts of the image you want to read/write and can
keep the in flight queue full at all times.  This is what
threaded-reads-and-writes actually tests.

The harder case is random access (think: qemu with a database guest).
There is a short queue, for example because of data dependencies
between the requests, so the issued commands buffer is always short.
Also there may be server latency because of the random access pattern.
This kind of workload should show the benefit of this commit, but we
don't really have a test for this kind of workload.

> 
> Applies on top of my series to add nbd_aio_pread_callback.
> 
>  generator/generator                 | 22 +++++++++---------
>  generator/states-reply-simple.c     |  4 ++++
>  generator/states-reply-structured.c | 32 +++++++++++++++++++++++++
>  generator/states-reply.c            | 36 ++++++++++++++++++++++++++++-
>  lib/internal.h                      |  1 +
>  5 files changed, 83 insertions(+), 12 deletions(-)
> 
> diff --git a/generator/generator b/generator/generator
> index 630260b..68a4fdf 100755
> --- a/generator/generator
> +++ b/generator/generator
> @@ -682,14 +682,14 @@ and reply_state_machine = [
>      default_state with
>      name = "START";
>      comment = "Prepare to receive a reply from the remote server";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_REPLY";
>      comment = "Receive a reply from the remote server";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
> @@ -723,7 +723,7 @@ and simple_reply_state_machine = [
>      default_state with
>      name = "RECV_READ_PAYLOAD";
>      comment = "Receiving the read payload for a simple reply";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
>  ]
> 
> @@ -740,7 +740,7 @@ and structured_reply_state_machine = [
>      default_state with
>      name = "RECV_REMAINING";
>      comment = "Receiving the remaining part of a structured reply";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
> @@ -754,49 +754,49 @@ and structured_reply_state_machine = [
>      default_state with
>      name = "RECV_ERROR";
>      comment = "Receive a structured reply error header";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = []
>    };
> 
>    State {
>      default_state with
>      name = "RECV_ERROR_MESSAGE";
>      comment = "Receive a structured reply error message";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_ERROR_TAIL";
>      comment = "Receive a structured reply error tail";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_OFFSET_DATA";
>      comment = "Receive a structured reply offset-data header";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_OFFSET_DATA_DATA";
>      comment = "Receive a structured reply offset-data block of data";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_OFFSET_HOLE";
>      comment = "Receive a structured reply offset-hole header";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
>      default_state with
>      name = "RECV_BS_ENTRIES";
>      comment = "Receive a structured reply block-status payload";
> -    external_events = [ NotifyRead, "" ];
> +    external_events = [];
>    };
> 
>    State {
> diff --git a/generator/states-reply-simple.c b/generator/states-reply-simple.c
> index ddc91ce..3b63d07 100644
> --- a/generator/states-reply-simple.c
> +++ b/generator/states-reply-simple.c
> @@ -53,6 +53,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      /* guaranteed by START */
>      assert (cmd);
> diff --git a/generator/states-reply-structured.c b/generator/states-reply-structured.c
> index 00659de..594525e 100644
> --- a/generator/states-reply-structured.c
> +++ b/generator/states-reply-structured.c
> @@ -38,6 +38,10 @@
>   REPLY.STRUCTURED_REPLY.RECV_REMAINING:
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:  SET_NEXT_STATE (%CHECK);
>    }
>    return 0;
> @@ -154,6 +158,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      msglen = be16toh (h->sbuf.sr.payload.error.error.len);
> @@ -176,6 +184,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      msglen = be16toh (h->sbuf.sr.payload.error.error.len);
> @@ -219,6 +231,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      error = be32toh (h->sbuf.sr.payload.error.error.error);
>      type = be16toh (h->sbuf.sr.structured_reply.type);
> @@ -268,6 +284,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      offset = be64toh (h->sbuf.sr.payload.offset_data.offset);
> @@ -324,6 +344,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
>      offset = be64toh (h->sbuf.sr.payload.offset_data.offset);
> @@ -349,6 +373,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      offset = be64toh (h->sbuf.sr.payload.offset_hole.offset);
>      length = be32toh (h->sbuf.sr.payload.offset_hole.length);
> @@ -410,6 +438,10 @@
> 
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1:
> +    save_reply_state (h);
> +    SET_NEXT_STATE (%.READY);
> +    return 0;
>    case 0:
>      length = be32toh (h->sbuf.sr.structured_reply.length);
> 
> diff --git a/generator/states-reply.c b/generator/states-reply.c
> index 54f98c5..99e54f6 100644
> --- a/generator/states-reply.c
> +++ b/generator/states-reply.c
> @@ -16,10 +16,43 @@
>   * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
>   */
> 
> -/* State machine for receiving reply messages from the server. */
> +#include <assert.h>
> +
> +/* State machine for receiving reply messages from the server.
> + *
> + * Note that we never block while in this sub-group. If there is
> + * insufficient data to finish parsing a reply, requiring us to block
> + * until POLLIN, we instead track where in the state machine we left
> + * off, then return to READY to actually block. Then, on entry to
> + * REPLY.START, we can tell if this is the start of a new reply (rlen
> + * is 0, stay put), a continuation of the preamble (reply_cmd is NULL,
> + * resume with RECV_REPLY), or a continuation from any other location
> + * (reply_cmd contains the state to jump to).
> + */
> +
> +static void
> +save_reply_state (struct nbd_handle *h)
> +{
> +  assert (h->reply_cmd);
> +  assert (h->rlen);
> +  h->reply_cmd->state = get_next_state (h);
> +}
> +
> +/*----- End of prologue. -----*/
> 
>  /* STATE MACHINE */ {
>   REPLY.START:
> +  /* If rlen is non-zero, we are resuming an earlier reply cycle. */
> +  if (h->rlen > 0) {
> +    if (h->reply_cmd) {
> +      assert (nbd_internal_is_state_processing (h->reply_cmd->state));
> +      SET_NEXT_STATE (h->reply_cmd->state);

This is essentially the "stack of states" idea that I had early on,
but with a maximum stack depth of 1.

I originally rejected the idea then because it means that the
generator would not longer have a complete view of all state
transitions in the state machine, and thus wouldn't be able to enforce
invariants such as not jumping to a state in the middle of a state
group.  (In fact with this patch that is now the case -- I thought the
generator would give an error about this, but maybe my test is wrong).

Nevertheless given that we need to do this, maybe it's better to drop
the idea that the generator needs to have a complete view.

In my original plan we would have had "push state" and "pop state"
operations (pop state is a general jump to "any state", which is why
it breaks the assumption in the generator).

> +    }
> +    else
> +      SET_NEXT_STATE (%RECV_REPLY);
> +    return 0;
> +  }
> +
>    /* This state is entered when a read notification is received in the
>     * READY state.  Therefore we know the socket is readable here.
>     * Reading a zero length now would indicate that the socket has been
> @@ -66,6 +99,7 @@
>   REPLY.RECV_REPLY:
>    switch (recv_into_rbuf (h)) {
>    case -1: SET_NEXT_STATE (%.DEAD); return -1;
> +  case 1: SET_NEXT_STATE (%.READY); return 0;
>    case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY);
>    }
>    return 0;
> diff --git a/lib/internal.h b/lib/internal.h
> index a1e27df..662ff7a 100644
> --- a/lib/internal.h
> +++ b/lib/internal.h
> @@ -253,6 +253,7 @@ struct command_in_flight {
>    uint32_t count;
>    void *data; /* Buffer for read/write */
>    struct command_cb cb;
> +  enum state state; /* State to resume with on next POLLIN */
>    bool data_seen; /* For read, true if at least one data chunk seen */
>    uint32_t error; /* Local errno value */
>  };

The patch seems reasonable.  Does this obviate any need to split the
state machine?

ACK

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/




More information about the Libguestfs mailing list