[Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)

Sun Jun 20 19:21:52 UTC 2021

On Sun, Jun 20, 2021 at 7:46 PM Richard W.M. Jones <rjones at redhat.com> wrote:
>
> As Nir has often pointed out, our current default request buffer size
> (32MB) is too large, resulting in nbdcopy being as much as 2½ times
> slower than it could be.
>
> The optimum buffer size most likely depends on the hardware, and may
> even vary over time as machines get generally larger caches.  To
> explore the problem I used this command:
>
> $ hyperfine -P rs 15 25 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**{rs})) \$uri \$uri"'

This uses the same process for serving both reads and writes, which may
be different from real world usage when one process is used for reading
and one for writing.

> On my 2019-era AMD server with 32GB of RAM and 64MB * 4 of L3 cache,
> 2**18 (262144) was the optimum when I tested all sizes between
> 2**15 (32K) and 2**25 (32M, the current default).
>
> Summary
>   'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' ran
>     1.03 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"'
>     1.06 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"'
>     1.09 ± 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"'

The difference is very small up to this point

>     1.23 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"'
>     1.26 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"'
>     1.39 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"'
>     1.45 ± 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"'
>     1.61 ± 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"'
>     1.94 ± 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"'
>     2.47 ± 0.08 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"'
>
> My 2018-era Intel laptop with a measly 8 MB of L3 cache the optimum
> size is one power-of-2 smaller (but 2**18 is still an improvement):
>
> Summary
>   'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' ran

This matches results I got when testing the libev example on Lenovo T480s
(~2018) and Dell Optiplex 9080 (~2012).

>     1.05 ± 0.19 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"'
>     1.06 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"'
>     1.10 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"'
>     1.22 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"'
>     1.29 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"'
>     1.33 ± 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"'
>     1.35 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"'
>     1.38 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"'
>     1.45 ± 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"'
>     1.63 ± 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"'
>
> To get an idea of the best request size on something rather different,
> this is a Raspberry Pi 4B.  I had to reduce the copy size down by a
> factor of 10 (to 10G) to make it run in a reasonable time.  2**18 is
> about 8% slower than the optimum choice (2**15).  It's still
> significantly better than our current default.
>
> Summary
>   'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' ran
>     1.00 ± 0.04 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"'
>     1.03 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"'
>     1.04 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"'
>     1.05 ± 0.08 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"'
>     1.05 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"'
>     1.07 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"'
>     1.08 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"'
>     1.15 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"'
>     1.28 ± 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"'
>     1.35 ± 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"'

But all these results do not test real work copy. They test copying
from memory to memory with zero (practical) latency.

When I tested using real storage on a real server, I got best results
using 16 requests and one connection and a request size of 1m.

4 connections with 4 requests per connection with the same request
size seem to be ~10% faster in these conditions.

I posted more info on these tests here:
https://listman.redhat.com/archives/libguestfs/2021-May/msg00124.html

Of course testing with other servers or storage can show different results,
and it is impossible to find a value that will work best in all cases.

I think we need to test both the number of requests and connections to
improve the defaults.

> ---
>  copy/main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/copy/main.c b/copy/main.c
> index 0fddfc3..70534b5 100644
> --- a/copy/main.c
> +++ b/copy/main.c
> @@ -50,7 +50,7 @@ bool flush;                     /* --flush flag */
>  unsigned max_requests = 64;     /* --requests */
>  bool progress;                  /* -p flag */
>  int progress_fd = -1;           /* --progress=FD */
> -unsigned request_size = MAX_REQUEST_SIZE;  /* --request-size */
> +unsigned request_size = 1<<18;  /* --request-size */

But this is clearly a better default.

>  unsigned sparse_size = 4096;    /* --sparse */
>  bool synchronous;               /* --synchronous flag */
>  unsigned threads;               /* --threads */
> --
> 2.32.0
>

Nir