<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Mon, Mar 12, 2018 at 2:57 PM Eric Blake <<a href="mailto:eblake@redhat.com">eblake@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 03/12/2018 07:13 AM, Nir Soffer wrote:<br> > On Mon, Mar 12, 2018 at 12:32 PM Richard W.M. Jones <<a href="mailto:rjones@redhat.com" target="_blank">rjones@redhat.com</a>><br> > wrote:<br> ><br> >> On Mon, Mar 12, 2018 at 07:13:52AM +0000, Nir Soffer wrote:<br> >>> On Fri, Mar 9, 2018 at 4:25 PM Richard W.M. Jones <<a href="mailto:rjones@redhat.com" target="_blank">rjones@redhat.com</a>><br> >> wrote:<br> >>><br> >>>> It has to be said it would be really convenient to have a 'zero'<br> >>>> and/or 'trim' method of some sort.<br> >>>><br> >>><br> >>> 'trim' means discard?<br> >><br> >> Yes. The 5 functions we could support are:<br> >><br> >> * pread - done<br> >> * pwrite - done<br> >> * flush - does fdatasync(2) on the block device<br> >><br> ><br> > Currently we do fsync() on every PUT request, so flush is not very<br> > useful.<br> ><br> ><br> >> * zero - write a range of zeroes without having to send zeroes<br> >> * trim - punch hole, can be emulated using zero if not possile<br> >><br> <br> trim is advisory in NBD, so it can also be emulated as a no-op while<br> still having correct semantics. If you want to guarantee reading back<br> zeroes after punching a hole, you have to use zero instead of trim.<br> <br> ><br> >> Also (not implemented in nbdkit today, but coming soon), pwrite, zero<br> >> and trim can be extended with a FUA (force unit access) flag, which<br> >> would mean that the range should be persisted to disk before<br> >> returning. It can be emulated by calling flush after the operation.<br> ><br> > It wasn't clear if anything in this process flushes the content to<br> >> disk. Is that what transfer.finalize does?<br> >><br> ><br> > All PUT requests fsync() before returning. We optimize for complete image<br> > trasfer, not for random io.<br> <br> In other words, you are already implicitly behaving as if FUA is already<br> set on every single request. It might be less efficient than what you<br> could otherwise achieve, but it's fine if consistency is more important<br> than speed.<br> <br> <br> >>> I would like to support only aligned offset and size - do you think it<br> >>> should work<br> >>> for qemu-img?<br> >><br> >> It depends a bit on what you mean by "aligned" and what the alignment<br> >> is. We'd probably have to work around it in the plugin so that it can<br> >> round in the request, issues a zero operation for the aligned part,<br> >> and writes zeroes at each end. There's no guarantee that qemu-img<br> >> will be well-behaved in the future even if it is now.<br> <br> qemu-img in general tries to send sector-aligned data by default (it's<br> unusual that qemu tries to access less than that at once). In 2.11,<br> qemu-io can be made to send byte-aligned requests across any NBD<br> connection; in 2.12, it's tightened so that NBD requests are<br> sector-aligned unless the server advertised support for byte-aligned<br> requests (nbdkit does not yet advertise this). As a client, qemu-io<br> will then manually write zeroes to any unaligned portion (if there are<br> any), and use the actual zero request for the aligned middle.<br> <br> >><br> ><br> > Aligned for direct I/O (we use direct I/O for read/write). We can support<br> > non-aligned ranges by doing similar emulation in the server, but I prefer<br> > to do<br> > it only if we have such requirement. If you need to do this in the client,<br> > we<br> > probably need to do this in the server otherwise all clients may need to<br> > emulate this.<br> ><br> > I think there is no reason that qemu-img will zero unaligned ranges, but<br> > I guess Eric can have a better answer.<br> <br> Yeah, for now, you are probably safe assuming that qemu-img will never<br> send unaligned ranges. You're also correct that not all NBD servers<br> support read-modify-write at unaligned boundaries, so well-behaved<br> clients have to implement it themselves; while conversely not all<br> clients are well-behaved so good NBD servers have to implement it -<br> which is a duplication of effort since both sides of the equation have<br> to worry about it when they want maximum cross-implementation<br> portability. But that's life.<br> <br> And my pending patches for FUA support in nbdkit also add a<br> --filter=blocksize, which at least lets nbdkit guarantee aligned data<br> handed to the plugin even when the client is not well-behaved.<br></blockquote><div><br></div><div>Thanks for the good input!</div><div><br></div><div>I posted documentation for the new API optimized for random I/O:</div><div><a href="https://gerrit.ovirt.org/#/c/89022/">https://gerrit.ovirt.org/#/c/89022/</a><br></div><div><br></div><div>I changed POST to PATCH to match the existing /tickets API, and</div><div>this also seems to be more standard way to do such operations.</div><div><br></div><div>Please check and comment if this makes sense and serves the v2v</div><div>use case or other uses case we missed.</div><div><br></div><div>I think we can implement all of this for 4.2.4, but:</div><div><br></div><div>- using simple zero loop, as in <a href="https://gerrit.ovirt.org/#/c/88793/">https://gerrit.ovirt.org/#/c/88793/</a>.</div><div> later we can make it more efficient.</div><div>- trim is a noop, maybe we will be able to support it in 4.3</div><div>- flush - may be noop now (all requests will implicitly flush).</div><div><br></div><div>I think we better have complete API with partial or simpler </div><div>implementation now, to minimize the hacks needed in v2v and</div><div>other clients.</div><div><br></div><div>Nir</div><div><br></div></div></div>