[Libguestfs] [PATCH nbdkit 1/8] server: Implement extents/can_extents calls for plugins and filters.

Sat Mar 23 16:25:06 UTC 2019

On 3/20/19 5:11 PM, Richard W.M. Jones wrote:
> This pair of calls allows plugins to describe which extents in the
> virtual disk are allocated, holes or zeroes.
> ---

I see you've already made a couple of tweaks to your block-status branch
since posting this (skipping 0-legnth add_extent, and coalescing
consecutive add_extents with the same type), but here's some more review:

> +++ b/docs/nbdkit-filter.pod

> +It is also possible for filters to transform the extents list received
> +back from the layer below.  Without error checking it would look like
> +this:
> +
> + myfilter_extents (...)
> + {
> +   size_t i;
> +   struct nbdkit_extents *extents2;
> +   struct nbdkit_extent e;
> +
> +   extents2 = nbdkit_extents_new (offset);
> +   next_ops->extents (nxdata, count, offset, flags, extents2, err);
> +   for (i = 0; i < nbdkit_extents_size (extents2); ++i) {
> +     e = nbdkit_get_extent (extents2, i);
> +     e.offset = /* transform offset */;
> +     nbdkit_add_extent (extents, e.offset, e.length, e.type);
> +   }
> +   nbdkit_extents_free (extents2);
> + }
> +
> +More complicated transformations may require the filter to allocate a
> +new extents list using C<nbdkit_extents_new> and free the old one
> +using C<nbdkit_extents_free>.

This sentence sounds odd after the example. You'll probably want to
reword this a bit.

I see that you chose to track an extent per 'struct nbdkit_extent'
rather than leaving it implicit based solely on what struct
nbdkit_extents tracks, which means that the offset filter has to
allocate a new one to pass to the plugin, AND has to adjust each extent
before returning.

Conceptually, you have something like:
extents (starts at 1000): [
 { offset=1000, length=100, type=0 },
 { offset=1100, length=100, type=1 },
...
]

and since we allow plugins to add extents prior to the starting offset
(as a no-op), the offset filter has to create a new filter with a new
starting offset.

As a thought experiment, I wonder if it would be any easier to have
'struct nbdkit_extents' track merely a next-expected offset, as well as
a cumulative length, and then omit the offset from 'struct
nbdkit_extent'; also, give filters a way to adjust the next expected
offset.  Then you could do something like:

offsetfilter_extents (handle, offset=1000, count=500)
{
  // extents has length=0, next=1000, array=[ ]
  // implied start=1000
  nbdkit_extents_adjust_next (extents, h->offs);

  // extents has length=0, next=1000+h->offs, array=[ ]
  // implied start=1000+h->offs
  ret = next_ops->extents (nxdata, count, offsest + h->offs, flags,
extents, err);

  // client populated extents: length=500, next=1500+h->offs, array=[
  //   { length=100, type=0 }, // offset would compute as 1000+h->offs
  //   { length=100, type=1 }, // offset would compute as 1100+h->offs
  // ...
  // ], implied start=1000+h->offs

  nbdkit_extents_adjust_next (extents, -h->offs);
  // extents length=500, next=1500, array=[
  //   { length=100, type=0 }, // offset would compute as 1000
  //   { length=100, type=1 }, // offset would compute as 1100
  // ...
  // ], implied start=1000
  return ret;
}

Doing that may also make it easier to do write a filter that mixes its
own holes with the underlying status of the plugin:

  // append hole owned by filter:
  nbdkit_extent_add(extents, hole_length, offset, hole)
  // change expected next offset for plugin
  int64_t delta = nbdkit_extents_adjust_next (extents, 0)
  // let plugin append its extents
  next_ops->extents (nxdata, count - hole_length, 0, flags, extents, err)
  // adjust things back to normal
  nbdkit_extents_adjust_next (extents, delta)

Having typed that, it may require less computation time, but requires
more thought about how it actually all works, so keeping your
implementation (where the filter HAS to allocate a new extents, then
copy the plugins answers back to its own, rather than being able to
blindly reuse the plugins answers in place) is okay. As I said, mine was
just a thought experiment for comparison, so it doesn't bother me if you
don't use it.

> +
> +If there is an error, C<.extents> should call C<nbdkit_error> with an
> +error message B<and> return -1 with C<err> set to the positive errno
> +value to return to the client.
> +
> +=head3 Allocating and freeing nbdkit_extents list
> +
> +Two functions are provided to filters only for allocating and freeing
> +the map:
> +
> + struct nbdkit_extents *nbdkit_extents_new (uint64_t start);
> +
> +Allocates and returns a new, empty extents list.  The C<start>
> +parameter is the start of the range described in the list.  Normally
> +you would pass in C<offset> as this parameter.

Well, the offset that the plugin should start at (which might not be
your C<offset>, if your filter is adjusting the plugin's offsets)

> +
> +On error this function can return C<NULL>.  In this case it calls
> +C<nbdkit_error> and/or C<nbdkit_set_error> as required.
> +
> + void nbdkit_extents_free (struct nbdkit_extents *);
> +
> +Frees an existing extents list.
> +
> +=head3 Iterating over nbdkit_extents list
> +
> +Two functions are provided to filters only to iterate over the extents
> +in order:
> +
> + size_t nbdkit_extents_size (const struct nbdkit_extents *);
> +
> +Returns the number of extents in the list.
> +
> + struct nbdkit_extent {
> +   uint64_t offset;
> +   uint64_t length;
> +   uint32_t type;
> + };
> + struct nbdkit_extent nbdkit_get_extent (const struct nbdkit_extents *,
> +                                         size_t i);
> +
> +Returns a copy of the C<i>'th extent.

If you go with my thought experiment, it gets harder to provide this
signature.  This signature is certainly easier than the foreach()
iterator with callback function of your previous version, though, so I
do like how enforcing the client to work in ascending contiguous order
simplified things.

> +
>  =head1 ERROR HANDLING
>  
>  If there is an error in the filter itself, the filter should call
> diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
> index 47545f3..a603f8b 100644
> --- a/docs/nbdkit-plugin.pod
> +++ b/docs/nbdkit-plugin.pod
> @@ -555,6 +555,20 @@ This callback is not required.  If omitted, then nbdkit always tries
>  C<.zero> first if it is present, and gracefully falls back to
>  C<.pwrite> if C<.zero> was absent or failed with C<EOPNOTSUPP>.
>  
> +=head2 C<.can_extents>
> +
> + int can_extents (void *handle);
> +
> +This is called during the option negotiation phase to find out if the
> +plugin supports detecting allocated (non-sparse) regions of the disk
> +with the C<.extents> callback.
> +
> +If there is an error, C<.can_extents> should call C<nbdkit_error> with
> +an error message and return C<-1>.
> +
> +This callback is not required.  If omitted, then we return true iff a
> +C<.extents> callback has been defined.

Should we default this to returning true always, _because_ we have the
sane fallback of treating the entire image as allocated when .extents is
absent?  (That is, a plugin has to explicitly opt out of advertising
NBD_CMD_BLOCK_STATUS support, because our default works well enough even
if the plugin didn't do anything). It would match how filters can call
next_ops->zero() even for plugins that do not have .zero (because
.can_zero defaults to true).

> +

> +Extents overlapping the range C<[offset...offset+count-1]> should be
> +returned if possible.  However nbdkit ignores extents E<lt> offset so
> +the plugin may, if it is easier to implement, return all extent
> +information for the whole disk.  The plugin may return extents beyond
> +the end of the range.  It may also return extent information for less
> +than the whole range, but it must return at least one extent.

must return at least one extent overlapping with C<[offset]>.

(if the plugin returns all extent information prior to offset, but stops
before offset is reached, that counts as not returning any extents).

> +
> +The extents B<must> be added in ascending order, and B<must> be
> +contiguous.
> +
> +The C<flags> parameter may contain the flag C<NBDKIT_FLAG_REQ_ONE>
> +which means that the client is only requesting information about the
> +extent overlapping C<offset>.  The plugin may ignore this flag, or as
> +an optimization it may return just a single extent for C<offset>.
> +(Note that implementing this optimization correctly is difficult and
> +subtle.  You must ensure that the upper limit of the single extent
> +returned is correct.  If unsure then the safest option is to ignore
> +this flag.)

I'm not sure we still need this parenthetical - with your simpler
implementation, I think it is harder for plugins to get things wrong
(the NBD protocol does NOT require the upper limit to be accurate,
merely that it makes progress - although the NBD protocol recommends
that servers return maximal upper bounds, it also warns clients that
they must be prepared to coalesce short returns with identical type).

> +++ b/include/nbdkit-filter.h
> @@ -1,5 +1,5 @@
>  /* nbdkit
> - * Copyright (C) 2013-2018 Red Hat Inc.
> + * Copyright (C) 2013-2019 Red Hat Inc.
>   * All rights reserved.
>   *
>   * Redistribution and use in source and binary forms, with or without
> @@ -44,6 +44,18 @@ extern "C" {
>  
>  #define NBDKIT_FILTER_API_VERSION 2
>  
> +struct nbdkit_extent {
> +  uint64_t offset;
> +  uint64_t length;
> +  uint32_t type;
> +};
> +
> +extern struct nbdkit_extents *nbdkit_extents_new (uint64_t start);
> +extern void nbdkit_extents_free (struct nbdkit_extents *);
> +extern size_t nbdkit_extents_size (const struct nbdkit_extents *);

'size' makes me wonder if this is the number of bytes covered by the
extents object, rather than the number of extent objects in the array.
Would calling it 'nbdkit_extents_count' be any easier to understand?

> +++ b/server/extents.c
> @@ -0,0 +1,171 @@

> +struct nbdkit_extents {
> +  struct nbdkit_extent *extents;
> +  size_t nr_extents, allocated;
> +  uint64_t start;
> +};

Is it also worth tracking cumulative length covered by extents?
Tracking it would give us an O(1) instead of O(n) answer to "how many
bytes did the plugin tell us about" - which MAY matter to filters that
want to append additional data about a hole after the underlying plugin
(so the filter can tell if the plugin filled in the entire range that
the filter requested, or if it filled in only a subset). But then again,
if the filter has to allocate a new map to pass to the plugin, and then
postprocess to copy from the temporary map back to the filter's incoming
map, the filter is already doing O(n) work and can answer the question
itself. (My thought proposal was a way to let such a filter do O(1) work
by letting the client append directly into the same list that the filter
will also be appending to - but I'm not sure it is necessary for the
complexity it introduces).

Is it worth tracking a maximum that the client asked for? When REQ_ONE
is in effect, the maximum is a hard stop for what we answer to the
client (but we should still let the client give more than we asked for);
when REQ_ONE is not in effect, the protocol says that we can't add any
more extents of a new type (we can only coalesce a larger end length to
the existing last extent). Either way, tracking a maximum means that we
could avoid wasting time and malloc()s on extents where the plugin gave
us more information than we need, for any nbdkit_add_extent() with an
offset beyond the maximum that does not coalesce with the existing final
extent.

> +
> +size_t
> +nbdkit_extents_size (const struct nbdkit_extents *exts)
> +{
> +  return exts->nr_extents;
> +}

Again, would nbdkit_extents_count() be a nicer name for this one?

> +int
> +nbdkit_add_extent (struct nbdkit_extents *exts,
> +                   uint64_t offset, uint64_t length, uint32_t type)
> +{
> +  uint64_t overlap;
> +
> +  /* If there are existing extents, the new extent must be contiguous. */

or else ignored because it is beyond a maximum, if you like my idea of
tracking the maximum the client cared about.  Here's also where you
added code to ignore 0-length plugin calls.

> +
> +  const struct nbdkit_extent e =
> +    { .offset = offset, .length = length, .type = type };
> +  return append_extent (exts, &e);

and here is where you added in code to coalesce plugin calls that
returned the same type.

> +++ b/server/plugins.c
> @@ -1,5 +1,5 @@

>  
> +static int
> +plugin_can_extents (struct backend *b, struct connection *conn)
> +{
> +  struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
> +
> +  assert (connection_get_handle (conn, 0));
> +
> +  debug ("can_extents");
> +
> +  if (p->plugin.can_extents)
> +    return p->plugin.can_extents (connection_get_handle (conn, 0));
> +  else
> +    return p->plugin.extents != NULL;

Again, making this default to true may be nicer, where a plugin has to
opt out of our sane default behavior.

We're getting closer.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20190323/b5f421df/attachment.sig>