[Libguestfs] [PATCH nbdkit] cache, cow: Do not round size down.

Wed Feb 17 14:57:36 UTC 2021

On 2/17/21 6:57 AM, Richard W.M. Jones wrote:
> Both the cache and cow filters have the annoying behaviour that the
> size of the underlying plugin is rounded down to the block size used
> by the filters (ie. 4K).  This is especially visible for single sector
> disk images:
> 
> $ ./nbdkit memory size=512 --filter=cow --run 'nbdinfo --size $uri'
> 0
> $ ./nbdkit memory size=512 --filter=cache --run 'nbdinfo --size $uri'
> 0
> 
> but affects all users of these filters somewhat randomly and
> unexpectedly.  (For example I had a GPT partitioned disk image from a
> customer which did not have a 4K aligned size, which was unexpectedly
> "corrupted" when I tried to open it using the cow filter - the reason
> for the corruption was the backup partition table at the end of the
> disk being truncated.)
> 
> Getting rid of this assumption is awkward: the general idea is that we
> round up the size of the backing file / bitmap by a full block.  But
> whenever we are asked to read or write to the underlying plugin we
> handle the tail case carefully.
> 
> This also tests various corner cases.
> ---

> +++ b/filters/cache/nbdkit-cache-filter.pod
> @@ -25,12 +25,6 @@ does not have effective caching, or (with C<cache=unsafe>) to defeat
>  flush requests from the client (which is unsafe and can cause data
>  loss, as the name suggests).
>  
> -Note that the use of this filter rounds the image size down to a
> -multiple of the caching granularity (the larger of 4096 or the
> -C<f_bsize> field of L<fstatvfs(3)>), to ease the implementation. If
> -you need to round the image size up instead to access the last few
> -bytes, combine this filter with L<nbdkit-truncate-filter(1)>.

Nice to see this restriction go.

Do you also want to tweak the sentence in nbdkit-truncate-filter.pod
that mentions cache and cow filters?

> +++ b/filters/cache/blk.c

>  int
>  blk_set_size (uint64_t new_size)
>  {
> -  if (bitmap_resize (&bm, new_size) == -1)
> +  size = new_size;
> +
> +  if (bitmap_resize (&bm, size) == -1)
>      return -1;
>  
> -  if (ftruncate (fd, new_size) == -1) {
> +  if (ftruncate (fd, ROUND_UP (size, blksize)) == -1) {

So the underlying fd is aligned...

>      nbdkit_error ("ftruncate: %m");
>      return -1;
>    }
>  
> -  if (lru_set_size (new_size) == -1)
> +  if (lru_set_size (size) == -1)

...while the lru size is still the original size.

> @@ -199,9 +207,22 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
>                  "unknown");
>  
>    if (state == BLOCK_NOT_CACHED) { /* Read underlying plugin. */
> -    if (next_ops->pread (nxdata, block, blksize, offset, 0, err) == -1)
> +    unsigned n = blksize, tail = 0;
> +
> +    if (offset + n > size) {
> +      tail = offset + n - size;
> +      n -= tail;
> +    }
> +
> +    if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1)
>        return -1;

Good, you were careful to never read beyond EOF of the underlying plugin.

>  
> +    /* Normally we're reading whole blocks, but at the very end of the
> +     * file we might read a partial block.  Deal with that case by
> +     * zeroing the tail.
> +     */
> +    memset (block + n, 0, tail);
> +

I don't know if that tail can ever leak out to the client, but agree
that it is safer to zero it anyway.

>      /* If cache-on-read, copy the block to the cache. */
>      if (cache_on_read) {
>        nbdkit_debug ("cache: cache-on-read block %" PRIu64
> @@ -247,9 +268,22 @@ blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
>  
>    if (state == BLOCK_NOT_CACHED) {
>      /* Read underlying plugin, copy to cache regardless of cache-on-read. */
> -    if (next_ops->pread (nxdata, block, blksize, offset, 0, err) == -1)
> +    unsigned n = blksize, tail = 0;
> +
> +    if (offset + n > size) {
> +      tail = offset + n - size;
> +      n -= tail;
> +    }
> +
> +    if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1)
>        return -1;
>  
> +    /* Normally we're reading whole blocks, but at the very end of the
> +     * file we might read a partial block.  Deal with that case by
> +     * zeroing the tail.
> +     */
> +    memset (block + n, 0, tail);
> +

Duplicated code, but I'm not sure if it warrants a helper function.

>      nbdkit_debug ("cache: cache block %" PRIu64 " (offset %" PRIu64 ")",
>                    blknum, (uint64_t) offset);
>  
> @@ -281,6 +315,12 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
>                    int *err)
>  {
>    off_t offset = blknum * blksize;
> +  unsigned n = blksize, tail = 0;
> +
> +  if (offset + n > size) {
> +    tail = offset + n - size;
> +    n -= tail;
> +  }
>  
>    reclaim (fd, &bm);
>  
> @@ -293,7 +333,7 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
>      return -1;
>    }
>  
> -  if (next_ops->pwrite (nxdata, block, blksize, offset, flags, err) == -1)
> +  if (next_ops->pwrite (nxdata, block, n, offset, flags, err) == -1)
>      return -1;

Okay, you're careful to not write beyond EOF of the plugin, regardless
of what extra may live in the tail in the local fd.

LGTM

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org