[Libguestfs] Some questions about nbdkit vs qemu performance affecting virt-v2v

Thu Jul 29 01:50:10 UTC 2021

On Tue, Jul 27, 2021 at 12:16:59PM +0100, Richard W.M. Jones wrote:
> Hi Eric, a couple of questions below about nbdkit performance.
> 
> Modular virt-v2v will use disk pipelines everywhere.  The input
> pipeline looks something like this:
> 
>   socket <- cow filter <- cache filter <-   nbdkit
>                                            curl|vddk
> 
> We found there's a notable slow down in at least one case: When the
> source plugin is very slow (eg. it's curl plugin to a slow and remote
> website, or VDDK in general), everything runs very slowly.
> 
> I made a simple test case to demonstrate this:
> 
> $ virt-builder fedora-33
> $ time ./nbdkit --filter=cache --filter=delay file /var/tmp/fedora-33.img delay-read=500ms --run 'virt-inspector --format=raw -a "$uri" -vx'
> 
> This uses a local file with the delay filter on top injecting half
> second delays into every read.  It "feels" a lot like the slow case we
> were observing.  Virt-v2v also does inspection as a first step when
> converting an image, so using virt-inspector is somewhat realistic.
> 
> Unfortunately this actually runs far too slowly for me to wait around
> - at least 30 mins, and probably a lot longer.  This compares to only
> 7 seconds if you remove the delay filter.
> 
> Reducing the delay to 50ms means at least it finishes in a reasonable time:
> 
> $ time ./nbdkit --filter=cache --filter=delay file /var/tmp/fedora-33.img \
>      delay-read=50ms \
>      --run 'virt-inspector --format=raw -a "$uri"'
> 
> real    5m16.298s
> user    0m0.509s
> sys     0m2.894s

Sounds like the reads are rather serialized (the application is not
proceeding to do a second read until after it has the result of the
first read) rather than highly parallel (where the application would
be reading multiple sites in the image at once, possibly by requesting
the start of a read at two different offsets before knowing which of
those two offsets is even useful).  There's also a question of how
frequently a given portion of the disk image is re-read (caching will
speed things up if data is revisited multiple times, but just adds
overhead if the reads are truly once-only access for the life of the
process).

> 
> In the above scenario the cache filter is not actually doing anything
> (since virt-inspector does not write).  Adding cache-on-read=true lets
> us cache the reads, avoiding going through the "slow" plugin in many
> cases, and the result is a lot better:
> 
> $ time ./nbdkit --filter=cache --filter=delay file /var/tmp/fedora-33.img \
>      delay-read=50ms cache-on-read=true \
>      --run 'virt-inspector --format=raw -a "$uri"'
> 
> real    0m27.731s
> user    0m0.304s
> sys     0m1.771s

Okay, that sounds like there is indeed frequent re-reading of portions
of the disk (or at least reading of nearby smaller offsets that fall
within the same larger granularity used by the cache).

> 
> However this is still slower than the old method which used qcow2 +
> qemu's copy-on-read.  It's harder to demonstrate this, but I modified
> virt-inspector to use the copy-on-read setting (which it doesn't do
> normally).  On top of nbdkit with 50ms delay and no other filters:
> 
> qemu + copy-on-read backed by nbdkit delay-read=50ms file:
> real    0m23.251s

qemu's copy-on-read creates a qcow2 image backed by a read-only base
image; any read that the qcow2 can't satisfy causes the entire cluster
to be read from the backing image into the qcow2 file, even if that
cluster is larger than what the client was actually reading.  It will
benefit from the same speedups of only hitting a given region of the
backing file once in the life of the process.

But it also assumes the presence of a backing chain.  If you try to
use copy-on-read on something that does not have a backing chain (such
as a direct use of an NBD link), the performance suffers (as we
discussed on IRC).  My understanding is that for every read operation,
the COR code does a block status query to see whether the data was
local or came from the backing chain; but in the case of an NBD image
which does not have a backing chain from qemu's point of view, EVERY
block status operation comes back as being local, and the COR has
nothing further to do - so the performance penalty is because of the
extra time spent on that block status call, particularly if that
results in another round trip NBD command over the wire before any
reading happens.

> 
> So 23s is the time to beat.  (I believe that with longer delays, the
> gap between qemu and nbdkit increases in favour of qemu.)
> 
> Q1: What other ideas could we explore to improve performance?

Have you played with block sizing?  (Reading the git log, you have...)
Part of qemu's COR behavior is that for any read not found in the
qcow2 active layer, the entire cluster is copied up the backing chain;
a 512-byte client read becomes a 32k cluster read for the default
sizing.  Other block sizes may be more efficient, such as 64k or 1M
per request actually sent over the wire.

> 
> - - -
> 
> In real scenarios we'll actually want to combine cow + cache, where
> cow is caching writes, and cache is caching reads.
> 
>   socket <- cow filter <- cache filter   <-  nbdkit
>                        cache-on-read=true   curl|vddk
> 
> The cow filter is necessary to prevent changes being written back to
> the pristine source image.
> 
> This is actually surprisingly efficient, making no noticable
> difference in this test:
> 
> time ./nbdkit --filter=cow --filter=cache --filter=delay \
>      file /var/tmp/fedora-33.img \
>      delay-read=50ms cache-on-read=true \
>      --run 'virt-inspector --format=raw -a "$uri"' 
> 
> real	0m27.193s
> user	0m0.283s
> sys	0m1.776s
> 
> Q2: Should we consider a "cow-on-read" flag to the cow filter (thus
> removing the need to use the cache filter at all)?

Since cow is already a form of caching (anything we touched now lives
locally, so we don't have to re-visit the original data source), yes,
it makes sense to have a cow-on-read mode that stores even reads
locally.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org