[Libguestfs] Fwd: nbdkit async

Mon Feb 20 20:32:58 UTC 2017

The concern is a client is blocked while processing a request. The nbdkit
server design requires a thread per request being processed regardless of
the number of connections or clients. We want to run 1000's of requests in
parallel without needing a thread at nbdkit layer per request in flight.

Our plugin layer is built around boost asio and a few threads in a worker
pool running an io service can be processing 1000s of requests in parallel.
(Our plugin is a gateway of sorts and requests are sent back out over the
network. While our plugin waits for the read or write data we don't block
in a thread we handle other requests that are ready).

The current nbdkit server design requires a thread per request in progress
because it is built around a synchronous callback to the plugin layer and
the main recv_request_send_reply loop holds the only copy of the request
handle that is needed to make the reply.

A more flexible design would be the recv_request_send_reply loop is instead
split into a recv_request loop and a send_reply func. The recv_request loop
forwards the request handle to the handle_request call. The existing
nbdkit_plugin struct would function identically and send_reply is called
after the plugin.pread + fua is finished.

An alternative plugin struct "nbdkit_plugin_lowlevel" could define a
different interface where an opaque ptr to the handle + connection + flags
are passed in and the plugin is required to call nbdkit_reply (opaque *ptr,
...) to send the reply to the nbd client rather than nbdkit auto sending
the reply after returning from the plugin function.

Some pseudo code example changes for what I had in mind.

struct operation {
    struct connection *conn;
    uint64_t handle;
    uint32_t cmd;
    uint32_t flags;
    char *buf;
    uint32_t count;
};

// unchanged
struct nbdkit_plugin {
  ...
  int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);
  int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t
offset);
  int (*flush) (void *handle);
  int (*trim) (void *handle, uint32_t count, uint64_t offset);
  int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim);
}

// new lowlevel api
struct nbdkit_plugin_lowlevel {
  ...
  int (*pread) (void *op, void *handle, void *buf, uint32_t count, uint64_t
offset);
  int (*pwrite) (void *op, void *handle, void *buf, uint32_t count,
uint64_t offset);
  int (*flush) (void *op, void *handle);
  int (*trim) (void *op, void *handle, uint32_t count, uint64_t offset);
  int (*zero) (void *op, void *handle, uint32_t count, uint64_t offset, int
may_trim);
};

// Called by the lowlevel api to send a reply to the client
void
nbdkit_reply(void *vop)
{
  int r;
  bool flush_after_command;
  struct operation *op = (struct operation *) vop;

  flush_after_command = (op->flags & NBD_CMD_FLAG_FUA) != 0;
  if (!op->conn->can_flush || op->conn->readonly)
    flush_after_command = false;

  if (flush_after_command) {
    op->flags = 0; // clear flags
    r = plugin_flush_lowlevel (op, op->conn->handle);
    if (r == -1)
      // do error stuff;
  }
  else if (op->cmd == NBD_CMD_READ)
    send_reply(op, op->buf, op->count, 0);
  else
    send_reply(op, NULL, 0, 0);
}

int
my_plugin_pwrite_lowlevel (void *op, void *handle, void *buf,
                                        uint32_t count, uint64_t offset)
{
  if (is_readonly(handle)) {
    nbdkit_reply_error (op, EROFS);
    return 0;
  }

  if (some critically bad issue)
    return -1;

  // this returns right away before write has completed and when it
  // does complete calls the handler lambda
  my_storage.async_write(buf, count, offset,
           [op, count](const boost::asio::error_code & ec) {
      if (ec)
        nbdkit_reply_error(op, ec.value());
      else
        nbdkit_reply(op);
    });

  return 0;
}

// connections.c
static int
_send_reply (struct operation *op, uint32_t count, void *buf, uint32_t
error)
{
  int r;
  struct reply reply;
  reply.magic = htobe32 (NBD_REPLY_MAGIC);
  reply.handle = op->handle;
  reply.error = htobe32 (nbd_errno (error));

  if (error != 0) {
    /* Since we're about to send only the limited NBD_E* errno to the
     * client, don't lose the information about what really happened
     * on the server side.  Make sure there is a way for the operator
     * to retrieve the real error.
     */
    debug ("sending error reply: %s", strerror (error));
  }

  r = xwrite (op->conn->sockout, &reply, sizeof reply);
  if (r == -1) {
    nbdkit_error ("write reply: %m");
    return -1;
  }

  if (op->cmd == NBD_CMD_READ) { /* Send the read data buffer. */
    r = xwrite (op->conn->sockout, buf, count);
    if (r == -1) {
      nbdkit_error ("write data: %m");
      return -1;
    }
  }

  return 0;
}

// new mutex on writes due to parallel nature of responding to the socket
int
send_reply (struct operation *op, uint32_t count, void *buf, uint32_t error)
{
  int r;

  plugin_lock_reply (op->conn);
  r = _send_reply (op, count, buf, error);
  plugin_unlock_reply (op->conn);

  free (op);
  return r;
}

On Mon, Feb 20, 2017 at 4:03 AM, Richard W.M. Jones <rjones at redhat.com>
wrote:

> > ----- Forwarded message -----
> >
> > Date: Sat, 18 Feb 2017 22:21:19 -0500
> > Subject: nbdkit async
> >
> > Hello,
> >
> > Hope this is the right person to contact regarding nbdkit design.
> >
> > I have a high latency massively parallel device that I am currently
> > implementing as an nbdkit plugin in c++ and have run into some design
> > limitations due to the synchronous callback interface nbdkit requires.
>
> Is the concern that each client requires a single thread, consuming
> memory (eg for stack space), but because of the high latency plugin
> these threads will be hanging around not doing very much?  And/or is
> it that the client is blocked while servicing each request?
>
> > Nbdkit is currently designed to call the plugin
> > pread/pwrite/trim/flush/zero ops as synchronous calls and expects when
> the
> > plugin functions return that it can then send the nbd reply to the
> socket.
> >
> > It's parallel thread model is also not implemented as of yet
>
> I think this bit can be fixed fairly easily.  One change which is
> especially easy to make is to send back the NBD_FLAG_CAN_MULTI_CONN
> flag (under control of the plugin).
>
> Anyway this doesn't solve your problem ...
>
> > but the
> > current design still mandates a worker thread per parallel op in progress
> > due to the synchronous design of the plugin calls.
>
> And the synchronous / 1 thread per client design of the server.
>
> > I would like to modify this to allow for an alternative operating mode
> > where nbdkit calls the plugin functions and expects the plugin to
> callback
> > to nbdkit when a request has completed rather than responding right after
> > the plugin call returns to nbdkit.
> >
> > If you are familiar with fuse low level api design, something very
> similar
> > to that.
> >
> > An example flow for a read request would be as follows:
> >
> > 1) nbdkit reads and validates the request from the socket
> > 2) nbdkit calls handle_request but now also passing in the nbd request
> > handle value
> > 3) nbdkit bundles the nbd request handle value, bool flush_on_update, and
> > read size into an opaque ptr to struct
> > 4) nbdkit calls my_plugin.pread passing in the usual args + the opaque
> ptr
>
> We can't change the existing API, so this would have to be exposed
> through new plugin entry point(s).
>
> > 5) my_plugin.pread makes an asynchronous read call with a handler set on
> > completion to call nbdkit_reply_read(conn, opaque ptr, buf) or on error
> > nbdkit_reply_error(conn, opaque_ptr, error)
> > 6) my_plugin.pread returns back to nbdkit without error after it has
> > started the async op but before it has completed
> > 7) nbdkit doesn't send a response to the conn->sockout beause  when the
> > async op has completed my_plugin will callback to nbdkit for it to send
> the
> > response
> > 8) nbdkit loop continues right away on the next request and it reads and
> > validates the next request from conn->sockin without waiting for the
> > previous request to complete
> > *) Now requires an additional mutex on the conn->sockout for writing
> > responses
> >
> > The benefit of this approach is that 2 threads (1 thread for reading
> > requests from the socket and kicking off requests to the plugin and 1
> > thread (or more) in a worker pool executing the async handler callbacks)
> > can service 100s of slow nbd requests in parallel overcoming high
> latency.
> >
> > The current design with synchronous callbacks we can provide async in our
> > plugin layer for pwrites and implement our flush to enforce it but we
> can't
> > get around a single slow high latency read op blocking the entire pipe.
> >
> > I'm willing to work on this in a branch and push this up as opensource
> but
> > first wanted to check if this functionality extension is in fact
> something
> > redhat would want for nbdkit and if so if there were suggestions to the
> > implementation.
>
> It depends on how much it complicates the internals of nbdkit (which
> are currently quite simple).  Would need to see the patches ...
>
> You can help by splitting into simple changes which are generally
> applicable (eg. supporting NBD_FLAG_CAN_MULTI_CONN), and other changes
> which are more difficult to integrate.
>
> > Initial implementation approach was going to be similar to the
> > fuse_low_level approach and create an entirely separate header file for
> the
> > asynchronous plugin api because the plugin calls now need an additional
> > parameter (opaque ptr to handle for nbdkit_reply_). This header file
> > nbdkit_plugin_async.h defines the plugin struct with slightly different
> > function ptr prototypes that  accepts the opaque ptr to nbd request
> handle
> > and some additional callback functions nbdkit_reply_error, nbdkit_reply,
> > and nbdkit_reply_read. The user of this plugin interface is required to
> > call either nbdkit_reply_error or nbdkit_reply[_read] in each of the
> > pread/pwrite/flush/trim/zero ops.
> >
> > If you got this far thank you for the long read and please let me know if
> > there is any interest.
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~
> rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-top is 'top' for virtual machines.  Tiny program with many
> powerful monitoring features, net stats, disk stats, logging, etc.
> http://people.redhat.com/~rjones/virt-top
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20170220/276a67c0/attachment.htm>