[Libguestfs] Supporting sparse disks in nbdkit

Fri Mar 8 16:50:35 UTC 2019

On 3/8/19 7:48 AM, Richard W.M. Jones wrote:
> I've posted a couple of patches towards the ultimate goal of
> implementing NBD_CMD_BLOCK_STATUS / base:allocation in nbdkit.  Before
> I can do the final patch I think we need to discuss how this would be
> exposed to plugins since at the end of the day they need to implement
> the feature.

Quick thoughts for now; I'll probably have to spend more time on this
over the weekend and reply with more details or ideas later.

> 
> Background reading:
>  - preparatory patches:
>      https://www.redhat.com/archives/libguestfs/2019-March/msg00013.html
>      https://www.redhat.com/archives/libguestfs/2019-March/msg00016.html
>  - NBD protocol, see in particular NBD_CMD_BLOCK_STATUS and
>      NBD_REPLY_TYPE_BLOCK_STATUS:
>      https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md
> 
> I think we shouldn't modify the pread() callback.  If we decide to
> implement Structured Replies properly at some point in the future we
> might need to do that, but it's not necessary now.

Agreed; pread should stay for simple, and a plugin/filter that supports
structured reads should have a new set of callbacks. A while ago, we had
a thread about split callbacks for better parallelism - where nbdkit
calls the plugin to start a request (which returns control to nbdkit
immediately), then the plugin calls a callback to complete the request.
That may be worth adding now.

> 
> We could introduce a new call ‘extents’ to return the list of extents.
> I believe it would look like this:
> 
>   struct nbdkit_extent {
>     uint64_t offset;
>     uint32_t length;  // XXX is 32 bit right here?

For now, the NBD spec only supports 32-bit lengths on an extent;
although there has been talk of how we would extend the protocol to
allow the client to make 64-bit length requests (not for read or write,
but definitely useful for flush, write zeros, block status); at which
point, a 64-bit block status returning a 64-bit extent length makes
sense. Perhaps we should require clients to use 64-bit lengths from the
outset, in anticipation of NBD spec improvements? (but then nbdkit has
to post-process those values down to a 32-bit result when talking to a
client that did not negotiate 64-bit commands)

>     uint32_t flag;    // hole, zero, data ... more in future?
>   };
> 
>   int extents (void *handle, uint32_t count, uint64_t offset,
>                uint32_t flags /* always 0? */,

Unless we support the REQ_ONE flag (where the client specifically wants
only one extent returned).

>                size_t *nr_extents, struct nbdkit_extent *extents);
> 
> The function is meant to scan [offset, offset+count-1] and return a
> list of all extents overlapping this range, and their status (hole,
> zero, data).
> 
> To make writing plugins easier we could say that extents don't need to
> be returned in order, and may include extents which don't actually
> overlap the requested range.  Also missing regions would mean "hole"
> (makes writing the VDDK plugin easier), and adjacent extents of the
> same type would be coalesced automatically.  But it's an error if
> returned extents overlap each other.

I'd go one step further - missing regions prior to a present region mean
"hole"; missing regions after the last present region mean "no data
available, don't advertise status to the client".  This is because the
NBD spec intentionally tries to allow both:

- a server has more information available cheaply, allow it to return
beyond the end so that a smart client can reduce the number of
subsequent queries needed (a dumb client will ignore the extra data and
probably query on the tail, but that doesn't hurt the server); but to
avoid too much client confusion, any extra info reported has to be
limited to a single extent, and coalesced with an extent that overlaps
the guest's actual request

- a server cannot cheaply obtain information over the entire length
requested by the client, so it returns just an initial answer over the
head, and the client has to ask again to find out about the tail

The NBD spec also took care to require a successful return to make
progress, so a client that wants to iterate over the entire export can
write a sane loop (even if the client has to coalesce identical status
in adjacent returns, it at least doesn't have to worry about getting
stuck on an inf-loop for a 0-length return).

> 
> nbdkit would need to do some massaging on this to get it into the
> right format for NBD_CMD_BLOCK_STATUS.  (I'm very confused about what
> NBD_CMD_FLAG_REQ_ONE is supposed to do.)

REQ_ONE says that a successful answer has to be exactly one extent,
covering only the head of the region. (That is, force the server to
behave as if obtaining information beyond the first extent is
expensive). It also tells the server that it must not report extra
information beyond the initial request (because at least qemu 3.0 would
assert that the server supplied too much information).

You're also right that nbdkit can do REQ_ONE massaging, so whether we
pass REQ_ONE on to the client or not makes it sound like the client
needs a tri-state opt-in (no block status support, block status but let
nbdkit handle REQ_ONE, block status and the plugin can handle REQ_ONE
efficiently itself)

> 
> We will also need a corresponding ‘can_extents’, which is analogous to
> ‘can_write’ etc and is what would control the output of
> NBD_OPT_{SET,LIST}_META_CONTEXT.

Yep, and I think it needs to be tri-state, like can_fua, as argued above.

> 
> For nbdkit-file-filter:

file-plugin ?

> 
> - Fairly simple implementation using SEEK_HOLE/SEEK_DATA.
> 
> - Not sure how we detect zeroes without reading the file.

Not detecting allocated zeroes is okay; reporting the holes as zeroes is
safe.

Or take a leaf from the existing nozero-filter, where we default to
maximum speed (no detection of allocated zeroes), but where a filter can
be added that does read-based zero detection (slower, but potentially
reports more zero blocks).

> 
> For nbdkit-memory-plugin:
> 
>  - Pretty simple implementation, which can even detect non-hole zeroes.
> 
> For VDDK:
> 
>  - VixDiskLib_QueryAllocatedBlocks can return allocated blocks, but
>    doesn't return holes separately (they are assumed from what is
>    omitted from the list).  No support for detecting zeroes that I can
>    see.
> 
> Some existing filters would have to be modified to correctly adjust
> ‘extents’ offsets:

Yeah, we'll have to think about every single filter, and whether it can
even allow .can_extents, or must override the plugin.

> 
>  - nbdkit-offset-filter
> 
>  - nbdkit-partition-filter
> 
>  - nbdkit-truncate-filter (? maybe not)

There's also an obvious use of the truncate-filter that treats the
underlying plugin as fully allocated (if the plugin does not
.can_extents), but where the truncated tail reports as sparse/zero.

> 
>  - nbdkit-xz-filter is complicated: XZ files support sparseness so in
>    theory we should try to return this data

cache and cow filters need some thought.

error needs a plan for whether we can inject errors on block_status
requests, as well as consideration for whether it munges plugin extents
based on current error policy.

log needs a patch to log the new callback.

> 
> Your thoughts on this appreciated,
> 

Looking forward to seeing what you come up with, but it sounds like you
are on track for supporting NBD_CMD_BLOCK_STATUS.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20190308/7798baae/attachment.sig>