<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-size:small;color:#000000"><span style="color:rgb(34,34,34)">On Fri, Mar 22, 2019 at 6:43 PM Eric Blake <<a href="mailto:eblake@redhat.com">eblake@redhat.com</a>> wrote:</span><br></div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">While it may be counterintuitive at first, the introduction of<br> NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance<br> regression in qemu [1], when copying a sparse file. When the<br> destination file must contain the same contents as the source, but it<br> is not known in advance whether the destination started life with all<br> zero content, then there are cases where it is faster to request a<br> bulk zero of the entire device followed by writing only the portions<br> of the device that are to contain data, as that results in fewer I/O<br> transactions overall. In fact, there are even situations where<br> trimming the entire device prior to writing zeroes may be faster than<br> bare write zero request [2]. However, if a bulk zero request ever<br> falls back to the same speed as a normal write, a bulk pre-zeroing<br> algorithm is actually a pessimization, as it ends up writing portions<br> of the disk twice.<br> <br> [1] <a href="https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html" rel="noreferrer" target="_blank">https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html</a><br> [2] <a href="https://github.com/libguestfs/nbdkit/commit/407f8dde" rel="noreferrer" target="_blank">https://github.com/libguestfs/nbdkit/commit/407f8dde</a><br> <br> Hence, it is desirable to have a way for clients to specify that a<br> particular write zero request is being attempted for a fast wipe, and<br> get an immediate failure if the zero request would otherwise take the<br> same time as a write. Conversely, if the client is not performing a<br> pre-initialization pass, it is still more efficient in terms of<br> networking traffic to send NBD_CMD_WRITE_ZERO requests where the<br> server implements the fallback to the slower write, than it is for the<br> client to have to perform the fallback to send NBD_CMD_WRITE with a<br> zeroed buffer.<br> <br> Add a protocol flag and corresponding transmission advertisement flag<br> to make it easier for clients to inform the server of their intent. If<br> the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two<br> things: to perform a fallback to write when the client does not<br> request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the<br> lower network overhead); and to fail quickly with ENOTSUP if the<br> client requested the flag but the server cannot write zeroes more<br> efficiently than a normal write (so that the client is not penalized<br> with the time of writing data areas of the disk twice).<br></blockquote><div><br></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">I think the issue is not that zero is slow as normal write, but that it is not fast</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">enough so it worth the zero entire disk before writing data.</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">For example, on storage server we had in the past BLKZEROOUT rate was</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">50G/s. On another server, it can run anywhere from 1G/s to 100G/s, depending</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">on the allocation status of the zeroed range.</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> Note that the semantics are chosen so that servers should advertise<br> the new flag whether or not they have fast zeroing (that is, this is<br> NOT the server advertising that it has fast zeroes, but rather<br> advertising that the client can get feedback as needed on whether<br> zeroing is fast). It is also intentional that the new advertisement<br> includes a new errno value, ENOTSUP, with rules that this error should<br> not be returned for any pre-existing behaviors, must not happen when<br> the client does not request a fast zero, and must be returned quickly<br> if the client requested fast zero but anything other than the error<br> would not be fast; while leaving it possible for clients to<br> distinguish other errors like EINVAL if alignment constraints are not<br> met. Clients should not send the flag unless the server advertised<br> support, but well-behaved servers should already be reporting EINVAL<br> to unrecognized flags. If the server does not advertise the new<br> feature, clients can safely fall back to assuming that writing zeroes<br> is no faster than normal writes.</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> Note that the Linux fallocate(2) interface may or may not be powerful<br> enough to easily determine if zeroing will be efficient - in<br> particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that<br> insight; for block devices, it is known that ioctl(BLKZEROOUT) does<br> NOT have a way for userspace to probe if it is efficient or slow. But<br> with enough demand, the kernel may add another FALLOC_FL_ flag to use<br> with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed<br> ENOTSUP failures if a fast path cannot be taken. If a server cannot<br> easily determine if write zeroes will be efficient, it is better off<br> not advertising NBD_FLAG_SEND_FAST_ZERO.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">I think this can work for file based images. If fallocate() fails, the client</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">will get ENOTSUP after the first call quickly.</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">For block device we don't have any way to know if a fallocate() or BLKZEROOUT</div></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">will be fast, so I guess servers will never advertise FAST_ZERO.</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">Generally this new flag usefulness is limited. It will only help qemu-img to convert</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">faster to file based images.</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">Do we have performance measurements showing significant speed up when </div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">zeroing the entire image before coping data, compared with zeroing only the </div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">unallocated ranges?</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">For example if the best speedup we can get in real world scenario is 2%, is ti </div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">worth complicating the protocol and using another bit?</div><div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> Signed-off-by: Eric Blake <<a href="mailto:eblake@redhat.com" target="_blank">eblake@redhat.com</a>><br> ---<br> <br> I will not push this without both:<br> - a positive review (for example, we may decide that burning another<br> NBD_FLAG_* is undesirable, and that we should instead have some sort<br> of NBD_OPT_ handshake for determining when the server supports<br> NBD_CMF_FLAG_FAST_ZERO)<br> - a reference client and server implementation (probably both via qemu,<br> since it was qemu that raised the problem in the first place)<br> <br> doc/proto.md | 44 +++++++++++++++++++++++++++++++++++++++++++-<br> 1 file changed, 43 insertions(+), 1 deletion(-)<br> <br> diff --git a/doc/proto.md b/doc/proto.md<br> index 8aaad96..1107766 100644<br> --- a/doc/proto.md<br> +++ b/doc/proto.md<br> @@ -1059,6 +1059,17 @@ The field has the following format:<br> which support the command without advertising this bit, and<br> conversely that this bit does not guarantee that the command will<br> succeed or have an impact.<br> +- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether<br> + `NBD_CMD_WRITE_ZEROES` is efficient. The server MUST set this<br> + transmission flag to 1 if the `NBD_CMD_WRITE_ZEROES` request<br> + supports the `NBD_CMD_FLAG_FAST_ZERO` flag, and MUST set this<br> + transmission flag to 0 if `NBD_FLAG_SEND_WRITE_ZEROES` is not<br> + set. Servers SHOULD NOT set this transmission flag if there is no<br> + quick way to determine whether a particular write zeroes request<br> + will be efficient, but the lack of an efficient write zero<br></blockquote><div><br></div><div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">I think we should use "fast" instead of "efficient". For example when the kernel</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">fallback to manual zeroing it is probably the most efficient way it can be done,</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">but it is not fast.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> + implementation SHOULD NOT prevent a server from setting this<br> + flag. Clients MUST NOT set the `NBD_CMD_FLAG_FAST_ZERO` request flag<br> + unless this transmission flag is set.<br> <br> Clients SHOULD ignore unknown flags.<br> <br> @@ -1636,6 +1647,12 @@ valid may depend on negotiation during the handshake phase.<br> MUST NOT send metadata on more than one extent in the reply. Client<br> implementors should note that using this flag on multiple contiguous<br> requests is likely to be inefficient.<br> +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during<br> + `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the<br> + write zeroes any faster than it would for an equivalent<br> + `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of<br> + `ENOTSUP`. The client MUST NOT set this unless the server advertised<br> + `NBD_FLAG_SEND_FAST_ZERO`.<br> <br> ##### Structured reply flags<br> <br> @@ -2004,7 +2021,10 @@ The following request types exist:<br> reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use.<br> <br> A client MUST NOT send a write zeroes request unless<br> - `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.<br> + `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags<br> + field. Additionally, a client MUST NOT send the<br> + `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was<br> + set in the transimssion flags field.<br> <br> By default, the server MAY use trimming to zero out the area, even<br> if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure<br> @@ -2014,6 +2034,23 @@ The following request types exist:<br> same area will not cause fragmentation or cause failure due to<br> insufficient space.<br> <br> + If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but<br> + `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail<br> + with `ENOTSUP`, even if the operation is no faster than a<br> + corresponding `NBD_CMD_WRITE`. Conversely, if<br> + `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with<br> + `ENOTSUP` unless the request can be serviced more efficiently than<br> + a corresponding `NBD_CMD_WRITE`. The server's determination of<br> + efficiency MAY depend on whether the request was suitably aligned,<br> + on whether the `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on<br> + whether a previous `NBD_CMD_TRIM` had been performed on the<br> + region. If the server did not advertise<br> + `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD NOT fail with `ENOTSUP`,<br> + regardless of the speed of servicing a request, and SHOULD fail<br> + with `EINVAL` if the `NBD_CMD_FLAG_FAST_ZERO` flag was set. A<br> + server MAY advertise `NBD_FLAG_SEND_FAST_ZERO` whether or not it<br> + can perform efficient zeroing.<br> +<br> If an error occurs, the server MUST set the appropriate error code<br> in the error field.<br> <br> @@ -2114,6 +2151,7 @@ The following error values are defined:<br> * `EINVAL` (22), Invalid argument.<br> * `ENOSPC` (28), No space left on device.<br> * `EOVERFLOW` (75), Value too large.<br> +* `ENOTSUP` (95), Operation not supported.<br> * `ESHUTDOWN` (108), Server is in the process of being shut down.<br> <br> The server SHOULD return `ENOSPC` if it receives a write request<br> @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block sizes. Finally, it<br> SHOULD return `EPERM` if it receives a write or trim request on a<br> read-only export.<br> <br> +The server SHOULD NOT return `ENOTSUP` except as documented in<br> +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is<br> +supported.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">This makes ENOTSUP less useful. I think it should be allowed to return ENOTSUP</div></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">as response for other commands if needed.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> +<br> The server SHOULD return `EINVAL` if it receives an unknown command.<br> <br> The server SHOULD return `EINVAL` if it receives an unknown command flag. It<br> -- <br> 2.20.1<br></blockquote><div><br></div><div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">I think this makes sense, and should work, but we need more data supporting that this is</div></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">useful in practice.</div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)"><br></div><div class="gmail_default" style="font-size:small;color:rgb(0,0,0)">Nir</div></div></div>