[Libguestfs] [libnbd PATCH v3 03/22] protocol: Add definitions for extended headers

Tue May 30 18:48:55 UTC 2023

On Tue, May 30, 2023 at 05:48:25PM +0200, Laszlo Ersek wrote:
> On 5/30/23 16:56, Wouter Verhelst wrote:
> > On Tue, May 30, 2023 at 01:50:59PM +0200, Laszlo Ersek wrote:
> >> On 5/25/23 15:00, Eric Blake wrote:
> 
> >>> +struct nbd_structured_reply_block_status_ext_hdr {
> >>> +  uint32_t context_id;          /* metadata context ID */
> >>> +  uint32_t count;               /* length of following array */
> >>
> >> (1) I think that "length of following array" is confusing here. Per
> >> spec, this is "descriptor count" -- that's clearer. "Length" could be
> >> mistaken for "number of bytes".
> >>
> >> Also, what was the justification for *not* making "count" uint64_t?
> >> (Just asking.) I do understand a server is permitted to respond with a
> >> block status reply that covers less than the requested range, so I
> >> understand a server, if it needs to, *can* truncate its response for the
> >> sake of fitting "count" into a uint32_t.
> > 
> > This is, as you say, the number of block descriptor messages we are
> > going to send to the client. Each such message is at 8 bytes long.
> 
> 16 bytes, isn't it? (uint64_t length, uint64_t status_flags)

Existing block status was 8 bytes per extent (uint32_t * 2), but yes,
the extended extent info is 16 bytes (uint64_t * 2).

> 
> > That would mean that by the time you overflow a uint32_t with the number
> > of extents that are to be sent, you'll be promising to send 2^32 * 8
> > (i.e., 2^36) bytes of data, or 32 GiB.
> 
> (Or 64 GiB if we agree that sizeof(nbd_block_descriptor_ext)=16.)
> 
> But... yes, I'm aware this is exorbitant, practically speaking.

We already coded into the NBD spec (commit 926a51df) that a server
SHOULD truncate its response rather than sending more than 2M extent
entries; I picked that number because 32M (the existing payload cap in
qemu) / 16 bytes is 2M extents (for non-extended headers, it actually
caps things at 16M instead of 32M).  In that commit, I pointed out
that qemu actually caps things at 128k entries (payload cap 1M without
extended headers), and nbdkit caps things at 1M entries.

> 
> My concern was that "practical considerations" must have been what led
> to the original 32-bit-only:
> 
> struct nbd_block_descriptor {
>   uint32_t length;              /* length of block */
>   uint32_t status_flags;        /* block type (hole etc) */
> } NBD_ATTRIBUTE_PACKED;
> 
> which is now proving too restrictive (being the basis for this entire
> work, IIUC).
> 
> A 2^(32+9) B = 2 TB image is not unthinkable. If the image used 512B
> sectors, and sectors alternated between being a hole and being
> allocated, then 2^32 extended descriptors would be justified.
> 
> May not be practical, but that's "policy"; the "mechanism" could still
> exist (if it doesn't come with undesirable costs).
> 
> > That would be a ridiculously amount of data, and no user is going to
> > wait for a client to finish parsing that amount of data without a hard
> > reboot of their client.
> 
> Over a 10gbit/s connection, transferring 64GB should take on the order
> of a minute.
> 
> ... *parsing* 4.3 billion entries is a different matter, of course ;)
> 
> OK!
> 
> > The only change that would be reasonable here is to reduce the size of
> > this field 16 to bits, rather than increasing it (but then we lose
> > alignment, so that's also not a good idea)
> 
> Putting aside alignment even, I don't understand why reducing "count" to
> uint16_t would be reasonable. With the current 32-bit-only block
> descriptor, we already need to write loops in libnbd clients, because we
> can't cover the entire remote image in one API call [*]. If I understood
> Eric right earlier, the 64-bit extensions were supposed to remedy that
> -- but as it stands, clients will still need loops ("chunking") around
> block status fetching; is that right?

While the larger extents reduce the need for looping, it does not
entirely eliminate it.  For example, just because the server can now
tell you that an image is entirely data in just one reply does not
mean that it will actually do so - qemu in particular limits block
status of a qcow2 file to reporting just one cluster at a time for
consistency reasons, where even if you use the maximum size of 2M
clusters, you can never get more than (2M/16)*2M = 256G status
reported in a single request.  But without 64-bit lengths, you are
guaranteed to need at least 2 (and possibly more) round trips to map
out an entire 6G image, while with 64-bit lengths, you can often get
an entire image map in one round trip.

Reducing 'count' to uint16_t for 64-bit responses would be possible if
we wanted to hard-code a server limit of never sending more than 64k
extents in one go; but above, I argued that existing servers currently
do exceed that cap even for 32-bit responses.

> 
> Let me emphasize again that I'm not challenging the spec; also I don't
> secretly wish for the patches to do more than required by the spec. I
> guess I'd like to understand the intent of the spec, plus the
> consequences for client applications.
> 
> [*] References (in this order!):
> 
> - https://github.com/libguestfs/virt-v2v/commit/27c056cdc6aa0
> - https://gitlab.com/nbdkit/libnbd/-/commit/0e714a6e06e6
> - https://github.com/libguestfs/virt-v2v/commit/a2afed32d8b1f
> 
> To be less cryptic, the first commit added "chunked" block status
> fetching to virt-v2v. Because our OCaml language libnbd mappings weren't
> proper at the time, that loop could move backwards and get stuck. We
> fixed that in the second commit (regarding the bindings) and then
> adapted virt-v2v's loop to the fixed bindings in the thirds commit. I've
> been hoping that such complexities could be eliminated in the future by
> virtue of not having to "chunk" the block status fetching.
>

We'll always have to deal with servers that send shorter replies than
we asked for, and where consecutive replies may have the same status.
The best the spec was able to do was recommend that the server return
as much as it can, and without consecutive status, where feasible.

> (
> 
> BTW I'm foreseeing a problem: if the extended block descriptor can
> provide an unsigned 64-bit length, we're going to have trouble exposing
> that in OCaml, because OCaml only has signed 64-bit integers. So that's
> going to reproduce the same issue, only for OCaml callers of the *new* API.

At one point, I was playing with an idea to add a patch to the NBD
spec to REQUIRE that an export be capped at 2^63-1 bytes (that is, the
maximum of 'off_t').  It doesn't change any existing implementations,
and actually frees us up to use signed 64-bit numbers where negative
values are indeed error cases.  I'll probably try to revisit my
thoughts on that patch front, but don't think it holds up this series.

> 
> I can see Eric's series includes patches like "ocaml: Add example for
> 64-bit extents" -- I've not looked at those yet; for now I'm just
> wondering what tricks we might need in the bindings generator. The
> method seen in the "middle patch" above won't work; we don't have a
> native OCaml "i128" type for example that we could use as an escape
> hatch, for representing C's uint64_t.

I _did_ go with the (currently implicit) assumption that no server
will ever expose larger than off_t can represent, and therefore a
signed 64-bit size is good enough. Flags has to be unsigned, but flags
is representing something different than image size.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org