[Libguestfs] [libnbd PATCH v3 03/22] protocol: Add definitions for extended headers

Wed May 31 16:04:06 UTC 2023

On Wed, May 31, 2023 at 01:29:30PM +0200, Laszlo Ersek wrote:
> >> Putting aside alignment even, I don't understand why reducing "count" to
> >> uint16_t would be reasonable. With the current 32-bit-only block
> >> descriptor, we already need to write loops in libnbd clients, because we
> >> can't cover the entire remote image in one API call [*]. If I understood
> >> Eric right earlier, the 64-bit extensions were supposed to remedy that
> >> -- but as it stands, clients will still need loops ("chunking") around
> >> block status fetching; is that right?
> > 
> > While the larger extents reduce the need for looping, it does not
> > entirely eliminate it.  For example, just because the server can now
> > tell you that an image is entirely data in just one reply does not
> > mean that it will actually do so - qemu in particular limits block
> > status of a qcow2 file to reporting just one cluster at a time for
> > consistency reasons, where even if you use the maximum size of 2M
> > clusters, you can never get more than (2M/16)*2M = 256G status
> > reported in a single request.
> 
> I don't understand the calculation. I can imagine the following
> interpretation:
> 
> - QEMU never sends more than 128K block descriptors, and each descriptor
> covers one 2MB sized cluster --> 256 GB of the disk covered in one go.
> 
> But I don't understand where the (2M/16) division comes from, even
> though the quotient is 128K.

Ah, I need to provide more backstory on the qcow2 format.  A qcow2
image has a fixed cluster size, chosen between between 512 and 2M
bytes.  A smaller cluster size has less wasted space for small images,
but uses more overhead.  Each cluster has to be stored in an L1 map,
where pages of the map are also a cluster in length, with 16 bytes per
map entry.  So if you pick a cluster size of 512, you get 512/16 or 32
entries per L1 page; if you pick a cluster size of 2M, you get 2M/16
or 128k entries per L1 page.  When reporting block status, qemu reads
at most one L1 page to then say how each cluster referenced from that
page is mapped.

https://gitlab.com/qemu-project/qemu/-/blob/master/docs/interop/qcow2.txt#L491

> 
> I can connect the constant "128K", and
> <https://github.com/NetworkBlockDevice/nbd/commit/926a51df>, to your
> paragraph [*] above, but not the division.

In this case, the qemu limit on reporting block status of at most one
L1 map page at a time happens to have no relationship to the NBD
constant of limiting block status reports to no more than 1M extents
(8M bytes) in a single reply, nor the fact that qemu picked a cap of
1M bytes (128k extents) on its NBD reply regardless of whether the
underlying image is qcow2 or some other format.

> 
> > But without 64-bit lengths, you are
> > guaranteed to need at least 2 (and possibly more) round trips to map
> > out an entire 6G image, while with 64-bit lengths, you can often get
> > an entire image map in one round trip.
> > 
> > Reducing 'count' to uint16_t for 64-bit responses would be possible if
> > we wanted to hard-code a server limit of never sending more than 64k
> > extents in one go; but above, I argued that existing servers currently
> > do exceed that cap even for 32-bit responses.
> > 
> 
> OK, understood. Looping is still required, but we expect to see it less
> exercised in practice, with extended headers.
> 
> >>
> >> Let me emphasize again that I'm not challenging the spec; also I don't
> >> secretly wish for the patches to do more than required by the spec. I
> >> guess I'd like to understand the intent of the spec, plus the
> >> consequences for client applications.
> >>
> >> [*] References (in this order!):
> >>
> >> - https://github.com/libguestfs/virt-v2v/commit/27c056cdc6aa0
> >> - https://gitlab.com/nbdkit/libnbd/-/commit/0e714a6e06e6
> >> - https://github.com/libguestfs/virt-v2v/commit/a2afed32d8b1f
> >>
> >> To be less cryptic, the first commit added "chunked" block status
> >> fetching to virt-v2v. Because our OCaml language libnbd mappings weren't
> >> proper at the time, that loop could move backwards and get stuck. We
> >> fixed that in the second commit (regarding the bindings) and then
> >> adapted virt-v2v's loop to the fixed bindings in the thirds commit. I've
> >> been hoping that such complexities could be eliminated in the future by
> >> virtue of not having to "chunk" the block status fetching.
> >>
> > 
> > We'll always have to deal with servers that send shorter replies than
> > we asked for, and where consecutive replies may have the same status.
> > The best the spec was able to do was recommend that the server return
> > as much as it can, and without consecutive status, where feasible.
> > 
> >> (
> >>
> >> BTW I'm foreseeing a problem: if the extended block descriptor can
> >> provide an unsigned 64-bit length, we're going to have trouble exposing
> >> that in OCaml, because OCaml only has signed 64-bit integers. So that's
> >> going to reproduce the same issue, only for OCaml callers of the *new* API.
> > 
> > At one point, I was playing with an idea to add a patch to the NBD
> > spec to REQUIRE that an export be capped at 2^63-1 bytes (that is, the
> > maximum of 'off_t').  It doesn't change any existing implementations,
> > and actually frees us up to use signed 64-bit numbers where negative
> > values are indeed error cases.  I'll probably try to revisit my
> > thoughts on that patch front, but don't think it holds up this series.
> > 
> >>
> >> I can see Eric's series includes patches like "ocaml: Add example for
> >> 64-bit extents" -- I've not looked at those yet; for now I'm just
> >> wondering what tricks we might need in the bindings generator. The
> >> method seen in the "middle patch" above won't work; we don't have a
> >> native OCaml "i128" type for example that we could use as an escape
> >> hatch, for representing C's uint64_t.
> > 
> > I _did_ go with the (currently implicit) assumption that no server
> > will ever expose larger than off_t can represent, and therefore a
> > signed 64-bit size is good enough.
> 
> I'll have to see that patch later, but for now I think making that
> assumption explicit would be nice (assert, abort, etc).
> 
> > Flags has to be unsigned, but flags
> > is representing something different than image size.
> 
> I take it you mean "nbd_block_descriptor_ext.status_flags", which is a
> "uint64_t".

Yes

> 
> If it's only used as a bitmap, that should be safe; first because some
> spec (dependent on metacontext) will introduce each bit one by one, and
> probably not start with bit#63; second because it makes no sense to sum
> bitmaps.
> 
> However, Wouter mentions up-thread that (dependent on metacontext)
> status_flags could actually stand for a length too:
> 
> 'Some metadata namespaces treat the "flags" field as an unsigned
> integer, to specify offsets or counters. I that context, the flags field
> should indeed be extended to 64 bits.'
> 
> and then dealing with those particular metacontexts in OCaml could be a
> problem.
> 
> It's probably best to catch unrepresentable values when converting
> between OCaml's int64 and C's uint64_t (both ways) in the generator as
> centrally as possible.

Indeed; that's where I'm hoping Rich may chime in with ideas for what
is best for the C <=> OCaml translation.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org