[Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Wed Mar 20 05:35:33 UTC 2019

On 3/19/19 11:34 AM, Richard W.M. Jones wrote:
> This pair of calls allows plugins to describe which extents in the
> virtual disk are allocated, holes or zeroes.
> ---

> + void nbdkit_extents_free (struct nbdkit_extents_map *);
> +
> +Frees an existing extents map.

Is this like free() where it is safe to call on NULL?

> +
> +=head3 Iterating over nbdkit_extents_map
> +
> +One function is provided to filters only to iterate over the extents
> +map:
> +
> + typedef int (*nbdkit_extents_callback) (uint64_t offset, uint64_t length,
> +                                         uint32_t type, void *opaque);
> +
> + int nbdkit_extents_foreach (
> +            const struct nbdkit_extents_map *extents_map,
> +            nbdkit_extents_callback fn, void *opaque,
> +            uint32_t flags,
> +            uint64_t range_offset, uint64_t range_length);
> +

> +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
> +
> +If this flag is included then the C<range_offset> and C<range_length>
> +parameters are used, so only extents overlapping
> +C<[range_offset...range_offset+range_length-1]> are returned.

Must range_offset+range_length-1 lie within the reported
next_opts->get_size(), or can range extend beyond the end of the plugin
(that is, can I pass range == -1 to get from a given offset to the end
of the file, or do I have to compute the end range to avoid internal
errors)?

> +++ b/docs/nbdkit-plugin.pod

> +The callback should detect and return the list of extents overlapping
> +the range C<[offset...offset+count-1]>.  The C<extents_map> parameter
> +points to an opaque object which the callback should fill in by
> +calling C<nbdkit_extent_add> etc.  See L</Extents map> below.
> +
> +The C<flags> parameter may contain the flag C<NBDKIT_FLAG_REQ_ONE>
> +which means that the client is only requesting information about the
> +extent overlapping C<offset>.  The plugin may ignore this flag, or as
> +an optimization it may return just a single extent for C<offset>.
> +(Note that implementing this optimization correctly is difficult and
> +subtle.  You must ensure that the upper limit of the single extent
> +returned is correct.  If unsure then the safest option is to ignore
> +this flag.)

Indeed. Or in other words, since the extent map coalesces adjacent
regions with the same type, merely calling nbdkit_extent_add(0) exactly
once won't work, because that coalesces with the extent map's default of
the entire image already being reported as data; so to report a single
data extent that ends sooner than the client's requested range, it is
necessary to report a different extent starting at the next byte.

> +
> +If there is an error, C<.extents> should call C<nbdkit_error> with an
> +error message, and C<nbdkit_set_error> to record an appropriate error
> +(unless C<errno> is sufficient), then return C<-1>.
> +
> +=head3 Extents map
> +
> +The plugin C<extents> callback is passed an opaque pointer C<struct
> +nbdkit_extents_map *extents_map>.  This structure represents a list of
> +L<filesystem extents|https://en.wikipedia.org/wiki/Extent_(file_systems)>
> +describing which areas of the disk are allocated, which are sparse
> +(“holes”), and, if supported, which are zeroes.  Initially the map
> +describes a single allocated data extent covering the entire virtual
> +disk.
> +
> +The C<extents> callback has to fill in any holes or zeroes areas by
> +calling C<nbdkit_extent*> functions, to describe all extents
> +overlapping the range C<[offset...offset+count-1]>.  (nbdkit will
> +ignore extents outside this range so plugins may, if it is easier to
> +implement, return all extent information for the whole disk.)

Worth emphasizing that reporting DATA is always a safe fallback, and
that plugins should favor speed over precision?  (If the time it takes
you to report ZERO or SPARSE is not significantly faster than the time
for .pread to just read the extent in question, then reporting DATA is
the better option).

> +++ b/server/extents.c

> +/* Trivial implementation as an ordered list of extents.  We will
> + * probably have to change this to a more efficient data structure if
> + * we have plugins with lots of extents.
> + *
> + * Invariants:
> + * - extents are stored in ascending order
> + * - extents must not overlap
> + * - adjacent extents must not have the same type
> + *

Any invariant about not exceeding .get_size()?  (I guess that's not
easily possible unless nbdkit_extents_new() were to take a parameter, as
the max size may differ for a filter compared to what the filter passes
to the plugin).  I'm slightly worried about a malicious plugin that
purposefully slams us with add() calls way beyond the end of the image.
Maybe you want a magic type for marking EOF when the map is first
created, with different rules on how things coalesce?  (Then again, if
the plugin crashes because it manages memory poorly, that's not really
the end of the world; we're more worried about a client triggering bad
behavior, not a plugin).

Hmm, what if we had the signature nbdkit_extents_new(offset, length) to
bound the extents we bother to track from the user - we still allow the
user to call nbdkit_extents_add() with information outside of the bounds
we care about, but merely return 0 on those cases without allocating any
memory?  Then we are guaranteed that the guest cannot allocate more
memory than proportional to the size of the client's request (which is
currently capped at 32 bits by the protocol).

> +
> +/* Find the extent before or containing offset.  If offset is inside
> + * the extent, the extent index is returned.  If offset is not in any
> + * extent, but there is an extent with a lower offset, then the index
> + * of the nearest lower extent is returned.  If offset is before the
> + * zeroth extent then -1 is returned.

Technically, aren't gaps in the map treated as implicit extents of type
0? Then again, I guess you return the next lower extent (rather than the
gap) to make the coalescing nicer.

> +int
> +nbdkit_extent_add (struct nbdkit_extents_map *map,
> +                   uint64_t offset, uint64_t length, uint32_t type)
> +{
> +  ssize_t lo, hi;
> +  bool coalesce_below, coalesce_above;
> +  struct extent new_extent;
> +
> +  /* This might be considered an error, but be flexible for badly
> +   * written plugins.
> +   */
> +  if (length == 0)
> +    return 0;
> +
> +  /* Don't allow extents to exceeed the _signed_ 64 bit limit that is

exceed

> +   * used in the rest of nbdkit.
> +   */
> +  if (offset > INT64_MAX ||
> +      offset + length < offset ||
> +      offset + length > INT64_MAX) {
> +    nbdkit_error ("nbdkit_extent_add: "
> +                  "extent cannot exceed signed 64 bit limit: "
> +                  "offset=%" PRIu64 " length=%" PRIu64,
> +                  offset, length);
> +    return -1;
> +  }
> +
> + again:
> +  lo = find_extent (map, offset);
> +  hi = find_extent (map, offset + length - 1);
> +
> +  /* "lo" and "hi" are -1 or they point to the extent
> +   * overlapping-or-below the corresponding endpoints of the new
> +   * extent.
> +   *
> +   * The algorithm here has two parts: Firstly we remove any
> +   * overlapping extents (this may involve erasing extents completely
> +   * or shrinking them).  Secondly we insert the new extent at the
> +   * right place in the array, if necessary allowing it to coalesce
> +   * with the extent above or below.
> +   *
> +   * To remove overlapping extents we consider 7 cases:
> +   *
> +   * A.0   lo == -1 (there is no old extent before the new extent)
> +   *     => do nothing
> +   *
> +   * A.1   ┌─────┐     offset of new extent exactly at
> +   *       │ lo  │     beginning of lo and lo smaller/equal
> +   *       └─────┘     to new extent
> +   *       ↑
> +   *       ▐▓▓▓▓▓▓▓▓▓
> +   *     => erase lo
> +   *     => creates situation A.0 or A.5
> +   *
> +   * A.2   ┌──────────┐   offset of new extent exactly at
> +   *       │ lo       │   beginning of lo and lo bigger than
> +   *       └──────────┘   new extent
> +   *       ↑
> +   *       ▐▓▓▓▓▓▓▓▓▓
> +   *     => shrink lo “upwards”
> +   *     => creates situation A.0 or A.5
> +   *
> +   * A.3   ┌──────────────┐  new extent in the middle
> +   *       │ lo           │  of lo
> +   *       └──────────────┘
> +   *         ↑
> +   *         ▓▓▓▓▓▓▓▓▓
> +   *     => split lo
> +   *     => creates situation A.5
> +   *
> +   * A.4   ┌─────┐     offset of new extent in the middle
> +   *       │ lo  │     of lo
> +   *       └─────┘
> +   *         ↑
> +   *         ▓▓▓▓▓▓▓▓▓
> +   *     => shrink lo
> +   *     => creates situation A.5
> +   *
> +   * A.5   ┌─────┐     offset of new extent after lo
> +   *       │ lo  │
> +   *       └─────┘
> +   *               ↑
> +   *               ▓▓▓▓▓▓▓▓▓
> +   *     => do nothing
> +   *
> +   * B.0   lo == hi or hi == -1 (there is no higher extent)
> +   *     => do nothing
> +   *
> +   * B.1          ┌─────┐     extent hi is fully covered
> +   *              │ hi  │     by new extent
> +   *              └─────┘
> +   *                       ↑
> +   *       ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
> +   *     => erase hi
> +   *     => creates situation B.0
> +   *
> +   * B.2          ┌─────┐     extent hi is partly overlapped
> +   *              │ hi  │     by new extent
> +   *              └─────┘
> +   *                 ↑
> +   *       ▓▓▓▓▓▓▓▓▓▓▓
> +   *     => shrink hi
> +   *     => creates situation B.0
> +   *
> +   * In addition extents [lo+1...hi-1] (if they exist) are erased
> +   * since they must be fully covered by the new extent.
> +   *
> +   * At the end we're in situation A.0/A.5 and B.0, and so no old
> +   * extents overlap the new extent.
> +   *
> +   * The insertion can then be done with optional coalescing (see code
> +   * below for details - this is a lot simpler).
> +   */
> +
> +  /* A.0 */
> +  if (lo == -1)
> +    goto erase_middle_extents;
> +  /* A.5 */
> +  if (offset >= map->extents[lo].offset + map->extents[lo].length)
> +    goto erase_middle_extents;
> +
> +  /* A.1 or A.2 */
> +  if (offset == map->extents[lo].offset) {
> +    if (map->extents[lo].offset + map->extents[lo].length <= offset + length) {
> +      /* A.1 */
> +      erase_extents (map, lo, 1);
> +    }
> +    else {
> +      /* A.2 */
> +      uint64_t end = map->extents[lo].offset + map->extents[lo].length;
> +      map->extents[lo].offset = offset + length;
> +      map->extents[lo].length = end - map->extents[lo].offset;
> +    }
> +    goto again;
> +  }
> +
> +  /* A.3 */
> +  if (map->extents[lo].offset < offset &&
> +      map->extents[lo].offset + map->extents[lo].length > offset + length) {
> +    new_extent.offset = offset + length;
> +    new_extent.length =
> +      map->extents[lo].offset + map->extents[lo].length - new_extent.offset;
> +    new_extent.type = map->extents[lo].type;
> +    map->extents[lo].length = offset - map->extents[lo].offset;
> +    if (insert_extent (map, new_extent, lo+1) == -1)
> +      return -1;
> +    goto again;
> +  }
> +
> +  /* We must be in situation A.4 assuming analysis above is exhaustive. */
> +  assert (offset > map->extents[lo].offset);
> +  assert (offset < map->extents[lo].offset + map->extents[lo].length);
> +  assert (map->extents[lo].offset + map->extents[lo].length <= offset + length);
> +
> +  map->extents[lo].length = offset - map->extents[lo].offset;
> +
> + erase_middle_extents:
> +  /* Erase extents [lo+1..hi-1] if they exist as they must be
> +   * completely covered by the new extent.
> +   */
> +  if (hi >= 0 && lo+1 < hi) {
> +    erase_extents (map, lo+1, hi-lo-1);
> +    goto again;
> +  }
> +
> +  /* B.0 */
> +  if (lo == hi || hi == -1)
> +    goto insert;
> +
> +  /* B.1 */
> +  if (map->extents[hi].offset + map->extents[hi].length <= offset + length) {
> +    erase_extents (map, hi, 1);

For worst-case guest behavior, this can result in several memmove()
calls during a single _add(). I guess the way to be more efficient is to
use a different data structure rather than an array; I also suspect that
we want to cater to the typical case of a plugin calling _add() in
ascending order (so actions at the end of the list should be fast, no
matter whether we use array or some other structure).

> +    goto again;
> +  }
> +
> +  /* We must be in situation B.2 if the analysis above is exhaustive. */
> +  assert (map->extents[hi].offset < offset + length);
> +
> +  uint64_t end = map->extents[hi].offset + map->extents[hi].length;
> +  map->extents[hi].offset = offset + length;
> +  map->extents[hi].length = end - map->extents[hi].offset;
> +
> + insert:
> +  /* Now we know there is no extent which overlaps with the new
> +   * extent, and the new extent must be inserted after lo.
> +   */
> +
> +  /* However we have to deal with possibly coalescing the new extent
> +   * with lo and/or lo+1.
> +   */

Can we also take the shortcut that if we are in a gap, AND the _add is
for type 0, we don't have to do anything (because coalescing with the
gap is still the correct behavior)?

> +  coalesce_below =
> +    lo >= 0 &&
> +    map->extents[lo].type == type &&
> +    map->extents[lo].offset + map->extents[lo].length == offset;
> +  coalesce_above =
> +    lo+1 < map->nr_extents &&
> +    map->extents[lo+1].type == type &&
> +    map->extents[lo+1].offset == offset + length;
> +
> +  if (coalesce_below && coalesce_above) {
> +    /* Coalesce lo + new extent + lo+1
> +     * => Extend lo to cover the whole range and remove lo+1.
> +     */
> +    map->extents[lo].length =
> +      map->extents[lo+1].offset + map->extents[lo+1].length
> +      - map->extents[lo].offset;
> +    erase_extents (map, lo+1, 1);
> +  }
> +  else if (coalesce_below) {
> +    /* Coalesce lo + new extent.
> +     * => Extend lo to cover new extent.
> +     */
> +    map->extents[lo].length = offset + length - map->extents[lo].offset;
> +  }
> +  else if (coalesce_above) {
> +    /* Coalesce new extent + lo+1
> +     * => Extend lo+1 to cover new extent.
> +     */
> +    map->extents[lo+1].length =
> +      map->extents[lo+1].offset + map->extents[lo+1].length - offset;
> +    map->extents[lo+1].offset = offset;
> +  }
> +  else {
> +    /* Normal case: Allocate a new extent and insert it after lo. */
> +    new_extent.offset = offset;
> +    new_extent.length = length;
> +    new_extent.type = type;
> +    if (insert_extent (map, new_extent, lo+1) == -1)
> +      return -1;
> +  }
> +
> +  return 0;

Hairy, but well-commented and appears to be correct on my late-night
reading.

> +}
> +
> +int
> +nbdkit_extents_foreach (const struct nbdkit_extents_map *map,
> +                        nbdkit_extents_callback fn, void *opaque,
> +                        uint32_t flags,
> +                        uint64_t range_offset, uint64_t range_length)
> +{
> +  const bool flag_range = flags & NBDKIT_EXTENTS_FOREACH_FLAG_RANGE;
> +  const bool flag_one = flags & NBDKIT_EXTENTS_FOREACH_FLAG_ONE;
> +  uint64_t offset, next;
> +  size_t i;
> +
> +  if (flag_range) {
> +    if (range_offset + range_length < range_offset) {
> +      nbdkit_error ("nbdkit_extents_foreach: invalid range: "
> +                    "offset=%" PRIu64 " length=%" PRIu64,
> +                    range_offset, range_length);
> +      return -1;
> +    }

Do we still need FLAG_RANGE for this call if we instead change _new() to
take a range? (That is, if we ignore _add calls outside of the initial
range, then it is apparant that traversal will automatically be bounded
without having to request different bounds - and I'm having a hard time
seeing where a filter would want to request anything less than the full
range it passed to next_ops, other than when doing short reads with
FLAG_ONE).

The NBD protocol allows for short replies (whether or not FLAG_ONE was
requested) - should we have a fifth type that the plugin can call
_add(END) as a way of specifically marking that they did not provide
extent information beyond that point, and therefore nbdkit must do a
short reply (instead of treating the rest of the range as data)?  (For
example, in qemu with the qcow2 file, a block status operation never
reads more than one L2 cluster; requests that overlap regions of the
disk that would require reading a second L2 cluster are instead
truncated).  The END type would not be transmitted over the NBD wire,
but may make some other operations easier (such as your addressing your
comment about a plugin honoring REQ_ONE having subtle semantics if it
doesn't take coalescing of subsequent offsets into account).

> +  }
> +  else {
> +    range_offset = 0;
> +    range_length = INT64_MAX + UINT64_C(1);
> +  }
> +
> +  for (i = 0, offset = range_offset;
> +       offset < range_offset + range_length;
> +       offset = next) {
> +    next = offset;
> +
> +    if (i >= map->nr_extents) {
> +      /* Beyond last extent => Type 0 to end of range. */
> +      next = range_offset + range_length;
> +      if (fn (offset, next - offset, 0, opaque) == -1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    if (offset < map->extents[i].offset) {
> +      /* In a gap before the i'th extent => Type 0 to next offset. */

Did we guarantee that type 0 always appears as a gap, or is it sometimes
a gap and sometimes an actual extent entry?

> +      next = MIN (map->extents[i].offset, range_offset + range_length);
> +      if (fn (offset, next - offset, 0, opaque) == -1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    if (offset < map->extents[i].offset + map->extents[i].length) {
> +      /* In an extent. */
> +      next = MIN (map->extents[i].offset + map->extents[i].length,
> +                  range_offset + range_length);
> +      if (fn (offset, next - offset, map->extents[i].type, opaque) == -1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    /* After an extent, increment i and iterate. */
> +    i++;
> +  }
> +
> +  return 0;
> +}
> +
> +#ifdef IN_TEST_EXTENTS

That's as far as I got today.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20190320/de16cd0e/attachment.sig>