<div dir="ltr"><div><div><div>The concern is a client is blocked while processing a request. The nbdkit server design requires a thread per request being processed regardless of the number of connections or clients. We want to run 1000's of requests in parallel without needing a thread at nbdkit layer per request in flight.<br><br>Our plugin layer is built around boost asio and a few threads in a worker pool running an io service can be processing 1000s of requests in parallel. (Our plugin is a gateway of sorts and requests are sent back out over the network. While our plugin waits for the read or write data we don't block in a thread we handle other requests that are ready).<br><br></div><div>The current nbdkit server design requires a thread per request in progress because it is built around a synchronous callback to the plugin layer and the main recv_request_send_reply loop holds the only copy of the request handle that is needed to make the reply.<br><br></div><div>A more flexible design would be the recv_request_send_reply loop is instead split into a recv_request loop and a send_reply func. The recv_request loop forwards the request handle to the handle_request call. The existing nbdkit_plugin struct would function identically and send_reply is called after the plugin.pread + fua is finished.<br><br></div><div>An alternative plugin struct "nbdkit_plugin_lowlevel" could define a different interface where an opaque ptr to the handle + connection + flags are passed in and the plugin is required to call nbdkit_reply (opaque *ptr, ...) to send the reply to the nbd client rather than nbdkit auto sending the reply after returning from the plugin function.<br><br></div><div>Some pseudo code example changes for what I had in mind.<br></div><br></div><div>struct operation {<br></div><div><div> struct connection *conn;<br> uint64_t handle;<br> uint32_t cmd;<br> uint32_t flags;<br></div><div> char *buf;<br></div><div> uint32_t count;<br></div><div>};<br><br></div><div>// unchanged<br></div><div>struct nbdkit_plugin {<br> ...<br></div><div></div></div> int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);<br> int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t offset);<br> int (*flush) (void *handle);<br> int (*trim) (void *handle, uint32_t count, uint64_t offset);<br> int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim);<br>}<br><br></div>// new lowlevel api<br><div>struct nbdkit_plugin_lowlevel {<br> ...<br> int (*pread) (void *op, void *handle, void *buf, uint32_t count, uint64_t offset);<br> int (*pwrite) (void *op, void *handle, void *buf, uint32_t count, uint64_t offset);<br> int (*flush) (void *op, void *handle);<br> int (*trim) (void *op, void *handle, uint32_t count, uint64_t offset);<br> int (*zero) (void *op, void *handle, uint32_t count, uint64_t offset, int may_trim);<br>};<br><div><div><div><div><div class="gmail_extra"><br></div><div class="gmail_extra">// Called by the lowlevel api to send a reply to the client<br></div><div class="gmail_extra"><div class="gmail_extra">void<br>nbdkit_reply(void *vop)<br>{<br> int r;<br> bool flush_after_command;<br> struct operation *op = (struct operation *) vop;<br><br> flush_after_command = (op->flags & NBD_CMD_FLAG_FUA) != 0;<br> if (!op->conn->can_flush || op->conn->readonly)<br> flush_after_command = false;<br><br> if (flush_after_command) {<br> op->flags = 0; // clear flags<br> r = plugin_flush_lowlevel (op, op->conn->handle);<br></div><div class="gmail_extra"> if (r == -1)<br></div><div class="gmail_extra"> // do error stuff;<br></div><div class="gmail_extra"> }<br></div><div class="gmail_extra"> else if (op->cmd == NBD_CMD_READ)<br></div> send_reply(op, op->buf, op->count, 0);<br></div><div class="gmail_extra"> else<br></div><div class="gmail_extra"> send_reply(op, NULL, 0, 0);<br></div><div class="gmail_extra">}<br><br></div><div class="gmail_extra">int<br>my_plugin_pwrite_lowlevel (void *op, void *handle, void *buf,<br> uint32_t count, uint64_t offset)<br>{<br></div><div class="gmail_extra"> if (is_readonly(handle)) {<br></div><div class="gmail_extra"> nbdkit_reply_error (op, EROFS);<br></div><div class="gmail_extra"> return 0;<br> }<br><br></div><div class="gmail_extra"> if (some critically bad issue)<br></div><div class="gmail_extra"> return -1;<br></div><div class="gmail_extra"><br></div><div class="gmail_extra"> // this returns right away before write has completed and when it<br> // does complete calls the handler lambda<br></div><div class="gmail_extra"> my_storage.async_write(buf, count, offset,<br> [op, count](const boost::asio::error_code & ec) {<br></div><div class="gmail_extra"> if (ec)<br></div><div class="gmail_extra"> nbdkit_reply_error(op, ec.value());<br></div><div class="gmail_extra"> else<br> nbdkit_reply(op);<br> });<br><br></div><div class="gmail_extra"> return 0;<br></div><div class="gmail_extra">}<br><br></div><div class="gmail_extra">// connections.c<br></div><div class="gmail_extra">static int<br>_send_reply (struct operation *op, uint32_t count, void *buf, uint32_t error)<br>{<br> int r;<br> struct reply reply;<br> reply.magic = htobe32 (NBD_REPLY_MAGIC);<br> reply.handle = op->handle;<br> reply.error = htobe32 (nbd_errno (error));<br><br> if (error != 0) {<br> /* Since we're about to send only the limited NBD_E* errno to the<br> * client, don't lose the information about what really happened<br> * on the server side. Make sure there is a way for the operator<br> * to retrieve the real error.<br> */<br> debug ("sending error reply: %s", strerror (error));<br> }<br><br> r = xwrite (op->conn->sockout, &reply, sizeof reply);<br> if (r == -1) {<br> nbdkit_error ("write reply: %m");<br> return -1;<br> }<br><br> if (op->cmd == NBD_CMD_READ) { /* Send the read data buffer. */<br> r = xwrite (op->conn->sockout, buf, count);<br> if (r == -1) {<br> nbdkit_error ("write data: %m");<br> return -1;<br> }<br> }<br><br> return 0;<br>}<br><br></div><div class="gmail_extra">// new mutex on writes due to parallel nature of responding to the socket<br></div><div class="gmail_extra">int<br>send_reply (struct operation *op, uint32_t count, void *buf, uint32_t error)<br>{<br> int r;<br><br> plugin_lock_reply (op->conn);<br> r = _send_reply (op, count, buf, error);<br> plugin_unlock_reply (op->conn);<br><br></div><div class="gmail_extra"> free (op);<br></div><div class="gmail_extra"> return r;<br>}<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 20, 2017 at 4:03 AM, Richard W.M. Jones <span dir="ltr"><<a href="mailto:rjones@redhat.com" target="_blank">rjones@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">> ----- Forwarded message -----<br> ><br> > Date: Sat, 18 Feb 2017 22:21:19 -0500<br> > Subject: nbdkit async<br> ><br> > Hello,<br> ><br> > Hope this is the right person to contact regarding nbdkit design.<br> ><br> > I have a high latency massively parallel device that I am currently<br> > implementing as an nbdkit plugin in c++ and have run into some design<br> > limitations due to the synchronous callback interface nbdkit requires.<br> <br> </span>Is the concern that each client requires a single thread, consuming<br> memory (eg for stack space), but because of the high latency plugin<br> these threads will be hanging around not doing very much? And/or is<br> it that the client is blocked while servicing each request?<br> <span class="gmail-"><br> > Nbdkit is currently designed to call the plugin<br> > pread/pwrite/trim/flush/zero ops as synchronous calls and expects when the<br> > plugin functions return that it can then send the nbd reply to the socket.<br> ><br> > It's parallel thread model is also not implemented as of yet<br> <br> </span>I think this bit can be fixed fairly easily. One change which is<br> especially easy to make is to send back the NBD_FLAG_CAN_MULTI_CONN<br> flag (under control of the plugin).<br> <br> Anyway this doesn't solve your problem ...<br> <span class="gmail-"><br> > but the<br> > current design still mandates a worker thread per parallel op in progress<br> > due to the synchronous design of the plugin calls.<br> <br> </span>And the synchronous / 1 thread per client design of the server.<br> <span class="gmail-"><br> > I would like to modify this to allow for an alternative operating mode<br> > where nbdkit calls the plugin functions and expects the plugin to callback<br> > to nbdkit when a request has completed rather than responding right after<br> > the plugin call returns to nbdkit.<br> ><br> > If you are familiar with fuse low level api design, something very similar<br> > to that.<br> ><br> > An example flow for a read request would be as follows:<br> ><br> > 1) nbdkit reads and validates the request from the socket<br> > 2) nbdkit calls handle_request but now also passing in the nbd request<br> > handle value<br> > 3) nbdkit bundles the nbd request handle value, bool flush_on_update, and<br> > read size into an opaque ptr to struct<br> > 4) nbdkit calls my_plugin.pread passing in the usual args + the opaque ptr<br> <br> </span>We can't change the existing API, so this would have to be exposed<br> through new plugin entry point(s).<br> <span class="gmail-"><br> > 5) my_plugin.pread makes an asynchronous read call with a handler set on<br> > completion to call nbdkit_reply_read(conn, opaque ptr, buf) or on error<br> > nbdkit_reply_error(conn, opaque_ptr, error)<br> > 6) my_plugin.pread returns back to nbdkit without error after it has<br> > started the async op but before it has completed<br> > 7) nbdkit doesn't send a response to the conn->sockout beause when the<br> > async op has completed my_plugin will callback to nbdkit for it to send the<br> > response<br> > 8) nbdkit loop continues right away on the next request and it reads and<br> > validates the next request from conn->sockin without waiting for the<br> > previous request to complete<br> > *) Now requires an additional mutex on the conn->sockout for writing<br> > responses<br> ><br> > The benefit of this approach is that 2 threads (1 thread for reading<br> > requests from the socket and kicking off requests to the plugin and 1<br> > thread (or more) in a worker pool executing the async handler callbacks)<br> > can service 100s of slow nbd requests in parallel overcoming high latency.<br> ><br> > The current design with synchronous callbacks we can provide async in our<br> > plugin layer for pwrites and implement our flush to enforce it but we can't<br> > get around a single slow high latency read op blocking the entire pipe.<br> ><br> > I'm willing to work on this in a branch and push this up as opensource but<br> > first wanted to check if this functionality extension is in fact something<br> > redhat would want for nbdkit and if so if there were suggestions to the<br> > implementation.<br> <br> </span>It depends on how much it complicates the internals of nbdkit (which<br> are currently quite simple). Would need to see the patches ...<br> <br> You can help by splitting into simple changes which are generally<br> applicable (eg. supporting NBD_FLAG_CAN_MULTI_CONN), and other changes<br> which are more difficult to integrate.<br> <span class="gmail-"><br> > Initial implementation approach was going to be similar to the<br> > fuse_low_level approach and create an entirely separate header file for the<br> > asynchronous plugin api because the plugin calls now need an additional<br> > parameter (opaque ptr to handle for nbdkit_reply_). This header file<br> > nbdkit_plugin_async.h defines the plugin struct with slightly different<br> > function ptr prototypes that accepts the opaque ptr to nbd request handle<br> > and some additional callback functions nbdkit_reply_error, nbdkit_reply,<br> > and nbdkit_reply_read. The user of this plugin interface is required to<br> > call either nbdkit_reply_error or nbdkit_reply[_read] in each of the<br> > pread/pwrite/flush/trim/zero ops.<br> ><br> > If you got this far thank you for the long read and please let me know if<br> > there is any interest.<br> <br> </span>Rich.<br> <span class="gmail-im gmail-HOEnZb"><br> --<br> Richard Jones, Virtualization Group, Red Hat <a href="http://people.redhat.com/%7Erjones" rel="noreferrer" target="_blank">http://people.redhat.com/~<wbr>rjones</a><br> Read my programming and virtualization blog: <a href="http://rwmj.wordpress.com" rel="noreferrer" target="_blank">http://rwmj.wordpress.com</a><br> </span><div class="gmail-HOEnZb"><div class="gmail-h5">virt-top is 'top' for virtual machines. Tiny program with many<br> powerful monitoring features, net stats, disk stats, logging, etc.<br> <a href="http://people.redhat.com/%7Erjones/virt-top" rel="noreferrer" target="_blank">http://people.redhat.com/~<wbr>rjones/virt-top</a><br> </div></div></blockquote></div><br></div></div></div></div></div></div></div>