[Libguestfs] [PATCH nbdkit] filters: Add copy-on-write filter.

Eric Blake eblake at redhat.com
Mon Jan 22 15:50:41 UTC 2018


On 01/20/2018 10:58 AM, Richard W.M. Jones wrote:
> ---
>  configure.ac                      |   5 +
>  filters/Makefile.am               |   4 +
>  filters/cow/Makefile.am           |  65 +++++++
>  filters/cow/cow.c                 | 392 ++++++++++++++++++++++++++++++++++++++
>  filters/cow/nbdkit-cow-filter.pod | 162 ++++++++++++++++
>  tests/Makefile.am                 |   6 +
>  tests/test-cow.sh                 |  98 ++++++++++
>  7 files changed, 732 insertions(+)
> 

> +
> +/* Notes on the design and implementation of this filter:
> + *
> + * The filter works by creating a large, sparse temporary file, the
> + * same size as the underlying device.  Being sparse, initially this
> + * takes up no space.
> + *
> + * We confine all pread/pwrite operations to the filesystem block
> + * size.  The blk_read and blk_write functions below always happen on
> + * whole filesystem block boundaries.  A smaller-than-block-size
> + * pwrite will turn into a read-modify-write of a whole block.  We
> + * also assume that the plugin returns the same immutable data for
> + * each pread call we make, and optimize on this basis.
> + *
> + * When reading a block we first check the temporary file to see if
> + * that file block is allocated or a hole.  If allocated, we return it
> + * from the temporary file.  If a hole, we issue a pread to the
> + * underlying plugin.

This is great on supported file systems.  However, you are defaulting to
/tmp, and even recent Linux has had tmpfs that did not support SEEK_HOLE
(using the default of treating the entire file as data, even when it is
sparse, which means your reads fail to pick up the underlying plugin) or
which supported it only with something like O(n^2) instead of O(1)
performance according to the offset of the next hole (which makes
performance painful the larger the file is).  You may want to list
caveats that this filter requires decent filesystem support for
SEEK_HOLE before this makes sense; or, I see your followup patch that
uses a bitmap to avoid lseek issues and thus the problems on straggler
filesystems.

> + *
> + * When writing a block we unconditionally write the data to the
> + * temporary file (allocating a block in that file if it wasn't
> + * before).
> + *
> + * No locking is needed for blk_* calls, but there is a potential
> + * problem of multiple pwrite calls are doing a read-modify-write
> + * cycle because the last write would win, erasing earlier writes.  To
> + * avoid this we limit the thread model to SERIALIZE_ALL_REQUESTS so
> + * that there cannot be concurrent pwrite requests.  We could relax
> + * this restriction with a bit of work.

Yep, this is a pretty cool idea, when it works!


> +static void
> +cow_load (void)
> +{
> +  const char *tmpdir;
> +  size_t len;
> +  char *template;
> +
> +  tmpdir = getenv ("TMPDIR");
> +  if (!tmpdir)
> +    tmpdir = "/var/tmp";
> +
> +  nbdkit_debug ("cow: temporary directory for overlay: %s", tmpdir);
> +
> +  len = strlen (tmpdir) + 8;
> +  template = alloca (len);
> +  snprintf (template, len, "%s/XXXXXX", tmpdir);
> +
> +  fd = mkostemp (template, O_CLOEXEC);
> +  if (fd == -1) {
> +    nbdkit_error ("mkostemp: %s: %m", tmpdir);
> +    exit (EXIT_FAILURE);
> +  }
> +
> +  unlink (template);
> +
> +  if (ioctl (fd, FIGETBSZ, &blksize) == -1) {
> +    nbdkit_error ("ioctl: FIGETBSZ: %m");
> +    exit (EXIT_FAILURE);
> +  }
> +  if (blksize <= 0) {
> +    nbdkit_error ("filesystem block size is < 0 or cannot be read");
> +    exit (EXIT_FAILURE);
> +  }

The file is still size zero here...

> +/* Force an early call to cow_get_size, consequently truncating the
> + * overlay to the correct size.
> + */
> +static int
> +cow_prepare (struct nbdkit_next_ops *next_ops, void *nxdata,
> +             void *handle)
> +{
> +  int64_t r;
> +
> +  r = cow_get_size (next_ops, nxdata, handle);
> +  return r >= 0 ? 0 : -1;

...so here, now that you've resized it, it may be worth an
lseek(SEEK_HOLE) (must be 0) and SEEK_DATA (must fail with ENXIO because
there is no data yet) to make sure the tmp file has appropriate seek
support, so you can at least kill the nbdkit process up front rather
than suffer from a file system that reports the entire sparse file as DATA.


> +=head1 CREATING A DIFF WITH QEMU-IMG
> +
> +Although nbdkit-cow-filter itself cannot save the differences, it is
> +possible to do this using an obscure feature of L<qemu-img(1)>.
> +B<nbdkit must remain continuously running during the whole operation,
> +otherwise all changes will be lost>.
> +
> +Run nbdkit:
> +
> + nbdkit --filter=cow file file=disk.img
> +
> +and then connect with a client and make whatever changes you need.
> +At the end, disconnect the client.
> +
> +Run these C<qemu-img> commands to construct a qcow2 file containing
> +the differences:
> +
> + qemu-img create -f qcow2 -b nbd:localhost diff.qcow2
> + qemu-img rebase -b disk.img diff.qcow2
> +
> +F<diff.qcow2> now contains the differences between the base
> +(F<disk.img>) and the changes stored in nbdkit-cow-filter.  C<nbdkit>
> +can now be killed.

Cute!


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 619 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20180122/52bdad88/attachment.sig>


More information about the Libguestfs mailing list