[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

Wed May 26 14:15:13 UTC 2021

On Wed, May 26, 2021 at 04:49:50PM +0300, Nir Soffer wrote:
> On Wed, May 26, 2021 at 4:03 PM Richard W.M. Jones <rjones at redhat.com> wrote:
> > In my testing, nbdcopy is a clear 4x faster than qemu-img convert, with
> > 4 also happening to be the default number of connections/threads.
> > Why use nbdcopy --connections=1?  That completely disables threads in
> > nbdcopy.
> 
> Because qemu-nbd does not report multicon when writing, so practically
> you get one nbd handle for writing.

Let's see if we can fix that.  Crippling nbdcopy because of a missing
feature in qemu-nbd isn't right.  I wonder what Eric's reasoning for
multi-conn not being safe is.

> > Also I'm not sure if --flush is fair (it depends on what
> > qemu-img does, which I don't know).
> 
> qemu is flushing at the end of the operation. Not flushing is cheating :-)

That's fair enough.  I will add that flag to my future tests.

I also pushed these commits to disable malloc checking outside tests:

  https://gitlab.com/nbdkit/libnbd/-/commit/88e72dcb1631b315957f5f98e3cdfcdd1fd0fe29
  https://gitlab.com/nbdkit/nbdkit/-/commit/6039780f3bb0617650fa1fa4c1399b0d3f1dcb26

> > The other interesting things are the qemu-img convert flags you're using:
> >
> >  -m 16  number of coroutines, default is 8
> 
> We use 8 in RHV since the difference is very small, and when running
> concurrent copies it does not matter. Since we use up to 64 concurrent
> requests in nbdcopy, it is useful to compare similar setup in qemu.

I'm not really clear on the relationship (in qemu-img) between number
of coroutines, number of pthreads and number of requests in flight.
At this rate I'm going to have to look at the source :-)

> >  -W     out of order writes, but the manual says "This is only recommended
> >         for preallocated devices like host devices or other raw block
> >         devices" which is a very unclear recommendation to me.
> >         What's special about host devices versus (eg) files or
> >         qcow2 files which means -W wouldn't always be recommended?
> 
> This is how RHV use qemu-img convert when copying to raw preallocated
> volumes. Using -W  can be up to 6x times faster. We use the same for imageio
> for any type of disk. This is the reason I tested this way.
> 
> -W is equivalent to the nbdocpy multithreaded copy using a single connection.
>
> qemu-img does N concurrent reads. If you don't specify -W, it writes
> the data in the right order (based on offset). If a read did not
> finish, the copy loops waits until the read complets before
> writing. This ensure exactly one concurrent write, and it is much
> slower.

Thanks - interesting.  Still not sure why you wouldn't want to use
this flag all the time.

See also:
https://lists.nongnu.org/archive/html/qemu-discuss/2021-05/msg00070.html

...
> This shows that nbdcopy works better when the latency is
> (practically) zero, copying data from memory to memory. This is
> useful for minimizing overhead in nbdcopy, but when copying real
> images with real storage with direct I/O the time to write the data
> to storage hides everything else.
>
> Would it be useful to add latency in the sparse-random plugin, so it
> behaves more like real storage? (or maybe it is already possible
> with a filter?)

We could use one of these filters:
https://libguestfs.org/nbdkit-delay-filter.1.html
https://libguestfs.org/nbdkit-rate-filter.1.html

Something like "--filter=delay wdelay=1ms" might be more realistic.
To simulate NVMe we might need to be able to specify microseconds there.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/