[Libguestfs] [PATCH libnbd v2 2/3] copy: Use preferred block size for copying

Wed Jun 29 15:40:03 UTC 2022

On Wed, Jun 29, 2022 at 05:14:35PM +0200, Laszlo Ersek wrote:
> On 06/29/22 13:03, Richard W.M. Jones wrote:
> > You're not supposed to read or write NBD servers at a granularity less
> > than the advertised minimum block size.  nbdcopy has ignored this
> > requirement, and this is usually fine because the NBD servers we care
> > about support 512-byte sector granularity, and never advertise sizes /
> > extents less granular than sectors (even if it's a bit suboptimal in a
> > few cases).
> > 
> > However there is one new case where we do care: When writing to a
> > compressed qcow2 file, qemu advertises a minimum and preferred block
> > size of 64K, and it really means it.  You cannot write blocks smaller
> > than this because of the way qcow2 compression is implemented.
> > 
> > This commit attempts to do the least work possible to fix this.
> > 
> > The previous multi-thread-copying loop was driven by the extent map
> > received from the source.  I have modified the loop so that it
> > iterates over request_size blocks.  request_size is set from the
> > command line (--request-size) but will be adjusted upwards if either
> > the source or destination preferred block size is larger.  So this
> > will always copy blocks which are at least the preferred block size
> > (except for the very last block of the disk).
> 
> Does this mean that, if we have a large hole, we won't skip it on the
> source / zero it on the destination in one operation (assuming that's
> what we've been doing, that is), but read through it dutifully in
> request_size blocks?

I'm not sure I understand the question correctly, but we do skip holes
[actually extents marked as zeroes, not just holes] by writing zeroes
straight to the target.  They are not read from the source (neither
before, nor after this patch).

The change in this patch is if you have a hole on the source which is
smaller than max (request-size, destination preferred block size),
then we do read it, eg this situation:

    src                 dst
    |   data    |       |           |
    +-----------+       +-----------+
    |   zero    |       |           | \
    +-----------+       |           |  request size or preferred
    |   data    |       |           |  block size
    |   data    |       |           | /
    |   data    |       +-----------+

Basically because there is some data, and we're reading large blocks,
we ignore the hole here and read the whole source block for simplicity.

We could imagine reading at the granularity of the source preferred
block size, if it was smaller than the destination preferred block
size, and thus being able to skip the hole, but that's much too
complicated for me to implement sanely.

> If that's the case, I'm totally OK with that; it's just exactly the kind
> of "sacrifice" that I felt would be necessary here!
> 
> > While copying these blocks we consult the source extent map.  If it
> > contains only zero regions covering the whole block (only_zeroes
> > function) then we can skip straight to zeroing the target
> > (fill_dst_range_with_zeroes), else we do read + write as before.
> 
> OK, so we subdivide the request_size-sized block still. I'm curious (and
> will later look) whether we'll rely on a previously fetched extent map
> (fetched for the whole block device), or if we get the extent-set that
> covers the individual request. In the former (more efficient) case,
> we'll effectively have to build an intersection between the request byte
> interval, and the intervals in the extent map.

There are N threads (N=4 by default) and they take it in turns to grab
128M (THREAD_WORK_SIZE) of work blocks.  For each work block the
thread first queries the extents of that 128M, and then does the copy.
After copying 128M, it goes back to the work queue and request another
128M work block.

This patch doesn't change this behaviour.

So I guess the answer is we have the extent map up front ("exts" in
the code) before we decide whether to read from the source or not.

> > I only modified the multi-thread-copying loop, not the synchronous
> > loop.  That should be updated in the same way later.
> > 
> > One side effect of this change is it always makes larger requests,
> 
> Why?
> 
> Considering an NBD server where minimum=preferred=maximum block size,
> what is the difference?
> 
> (In fact I'd have thought this change causes nbdcopy to use *smaller*
> requests than before, exactly because now we'll advance in fixed size
> blocks, not in extent-sized jumps!)
> 
> > even for regions we know are sparse.
> 
> A sparse region is specifically where I'd expect this change to slow
> down progress (use smaller requests than what would be permitted by a
> big hole).

I didn't put it very well, and that sentence isn't fully correct anyway.

Before the patch, we worked through the extent list.  For extents
marked as zeroes, we wrote zeroes to the target.  Since nbdkit usually
returns very granular extents (eg. 32K page size when using
nbdkit-data-plugin & nbdkit-memory-plugin), this would mean you'd
often see 32K zero requests being sent to the destination.

After the patch we split up the work block into request_size chunks
(256K by default), so requests appear generally larger.

> > This is clear in the
> > copy-sparse.sh and copy-sparse-allocated.sh tests which were
> > previously driven by the 32K sparse map granularity of the source.
> 
> What does this mean? Was the test image (well, "test device")
> intentionally populated by alternating, 32KB, "data" and "gap" extents?

In this test we set up an nbdkit-data-plugin disk:

$ nbdkit data data='
             1
             @1073741823 1
             @4294967295 1
             @4294967296 1 '

which means it has an allocated byte at offsets 0, 1073741823,
4294967295 & 4294967296.  Since nbdkit-data-plugin uses a sparse array
backed with 32K pages, the extent map looks like this:

$ nbdkit data data='
             1
             @1073741823 1
             @4294967295 1
             @4294967296 1 ' --run 'nbdinfo --map $uri'
         0       32768    0  data
     32768  1073676288    3  hole,zero
1073709056       32768    0  data
1073741824  3221192704    3  hole,zero
4294934528       32769    0  data

After this patch, with the default request-size (256K) it would write
256K of data at the beginning of the disk.  To make the test pass, I
set --request-size=32768 to match the source granularity.  (Previously
this was an implicit assumption of the test.)  Note preferred block
size in this case is 4K.

BTW you could argue that --request-size should be smaller by default,
but we found in previous testing that using a small request size
causes dire performance.

> > Without changing these tests, they would make make 256K reads & writes
> 
> Why? What advertizes a 256K preferred block size? nbdkit?
> 
> > (and also read from areas of the disk even though we know they are
> > sparse).
> 
> But that's unavoidable, no? (I mean it's inevitable that we now
> investigate (check for zeroes) the internals of these gaps, at request
> boundaries, against the extent map; we can still avoid actually reading
> those bytes.)

We could avoid it, but it's complicated to implement, so I went with
the easier option.

> > I adjusted these tests to use --request-size=32768 to force
> > the existing behaviour.
> > 
> > Note this doesn't attempt to limit the maximum block size when reading
> > or writing.  That is for future work.
> > 
> > This is a partial fix for https://bugzilla.redhat.com/2047660.
> > Further changes will be required in virt-v2v.
> > 
> > Link: https://lists.gnu.org/archive/html/qemu-block/2022-01/threads.html#00729
> > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2047660
> > ---
> >  TODO                          |   4 +-
> >  copy/copy-sparse-allocated.sh |   4 +-
> >  copy/copy-sparse.sh           |   7 +-
> >  copy/main.c                   |   7 ++
> >  copy/multi-thread-copying.c   | 143 +++++++++++++++++++++++++---------
> >  5 files changed, 119 insertions(+), 46 deletions(-)
> 
> I need to understand the design better here, before I can look at the code.

Note there's a v3!

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW