[Libguestfs] New extents structure proposal

Wed Mar 20 16:57:26 UTC 2019

On Wed, Mar 20, 2019 at 11:20:15AM -0500, Eric Blake wrote:
> On 3/20/19 10:31 AM, Richard W.M. Jones wrote:
> > I think the extents map is just too complicated and is unnecessarily
> > so.  How about instead we define the plugin interface to be:
> > 
> >   int can_extents (void *handle);  // as before
> >   int extents (void *handle, uint32_t count, uint64_t offset, uint32_t flags,
> >                struct nbdkit_extents_list *list);
> > 
> > and have the extents_list be a simple list.  The first extent you add
> > must start at offset.  And you're only allowed to append monotonically
> > increasing adjacent extents to it.  Plugins must return at least one
> > extent, but are not required to return more than one extent
> > (regardless of flags).
> 
> Well, required to return at least one for the command to succeed (if the
> return 0, then we synthesize an error, perhaps EIO, to keep the
> connection to the client still alive).
> 
> I take it you'd still have to use an nbdkit_extent_add(list, count,
> type), so that additions to the list are still calling malloc() from
> nbdkit's perspective (as you don't want the plugin to be realloc()ing
> the list)?

Yes I think we should preserve an abstract data type, not least
because it would allow us to implement a different structure in
future.

> > This would be simple enough to implement for both VDDK and file.  (For
> > VDDK the plugin needs to synthesize the holes but that's only a slight
> > complication).
> > 
> > NBDKIT_FLAG_REQ_ONE is now simple to implement (if generating the
> > extents, stop after generating the first one; if it's the server
> > processing the extents, take the first extent in the list).
> > 
> > The ‘end’ that we discussed before is now implicit (it's the byte of
> > the last extent in the list).
> 
> So a client may give us more extents than we need, but can also easily
> stop giving us extents as early as they want (and as long as they gave
> us at least one, the overall call is successful).  The requirement on
> the plugin giving us extents in monotonically increasing order is not
> too onerous; all your earlier implementations were naturally already
> doing that (in other words, the full flexibility of random-access extent
> probing rather than linear is over-engineered).
> 
> I like the idea.
> 
> I'm trying to figure out if filters would still need a foreach visitor.
> But my initial guess is no.  Consider - the earlier proposal required
> the offset filter to do:
> 
> +  struct nbdkit_extents_map *map2;
> +
> +  map2 = nbdkit_extents_new ();
> +  if (map2 == NULL)
> +    return -1;
> +  if (next_ops->extents (nxdata, count, offs + offset,
> +                         flags, map2, err) == -1) {
> +    nbdkit_extents_free (map2);
> +    return -1;
> +  }
> +
> +  /* Transform offsets in map2, return result in extents_map. */
> +  if (nbdkit_extents_foreach (map2, subtract_offset, extents_map,
> +                              NBDKIT_EXTENTS_FOREACH_FLAG_RANGE,
> +                              offset, range) == -1) {
> +    nbdkit_extents_free (map2);
> +    return -1;
> +  }
> +  nbdkit_extents_free (map2);
> 
> because the plugin was calling:
> 
> +    if (nbdkit_extent_add (extents_map, offset, n, type) == -1)
> 
> But if we drop offset from the interface, the offset filter simplifies to:
> 
> return next_ops->extents (nxdata, count, offs + offset, flags, list, err);
> 
> because the plugin will be calling:
> 
> if (nbdkit_extent_add(list, n type) == -1)

For a simple case like this we might even allow filters to simply
update the offsets in the list.  It requires exposing the inner
working of the data type to filters though.

> > Also an observation: qemu's nbd client only ever issues block status
> > requests with the req-one flag set, so perhaps we should optimize for
> > that case.
> 
> I hope to get to the point where future qemu doesn't send the req-one
> flag. There's several threads on the qemu list about how qemu-img is
> slower than it needs to be because it is throwing away useful
> information, and where it is aggravated by the kernel's abyssmal lseek()
> performance on tmpfs.  But until qemu learns useful caching, you're
> right that most existing NBD clients that request block status do so one
> extent at a time (because I don't know of any other existing NBD clients
> that use BLOCK_STATUS yet).

Is it ever possible to cache block status results?  What happens in
the (admittedly unusual) case where two writers are hitting the same
NBD server?  For example if the server is implementing a cluster
filesystem.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top