[Libguestfs] [PATCH nbdkit] Experiment with parallel python plugin

Richard W.M. Jones rjones at redhat.com
Thu Aug 6 20:52:40 UTC 2020


On Thu, Aug 06, 2020 at 11:22:00PM +0300, Nir Soffer wrote:
> This is a quick hack to experiment with parallel threading model in the
> python plugin.
> 
> Changes:
> 
> - Use aligned buffers to make it possible to use O_DIRECT. Using
>   parallel I/O does not buy us much when using buffered I/O.  pwrite()
>   copies data to the page cache, and pread() reads data from the page
>   cache.

O_DIRECT is unfortunately a bit too fragile to consider using
routinely.  But I wonder if one of the posix_fadvise(2) hints could be
used (eg. POSIX_FADV_SEQUENTIAL + POSIX_FADV_DONTNEED).  Adding
fadvise hints as a parameter for the file plugin is a very plausible
change.

> - Disable extents in the file plugin. This way we can compare it with
>   the python file example.
> 
> - Implement flush in the file example.
> 
> With these changes, I could compare the file plugin with the new python
> file example, and it seems that the parallel threading models works
> nicely, and we get similar performance for the case of fully allocated
> image.
> 
> I created a test image using:
> 
> $ virt-builder fedora-32 -o /var/tmp/fedora-32.raw --root-password=password:root
> 
> And a fully allocated test image using:
> 
> $ fallocate --length 6g /var/tmp/disk.raw
> $ dd if=/var/tmp/fedora-32.raw bs=8M of=/var/tmp/disk.raw iflag=direct oflag=direct conv=fsync,notrunc
> 
> $ qemu-img map --output json /var/tmp/disk.raw
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": false, "data": true, "offset": 0}]
> 
> For reference, copying this image with dd using direct I/O:
> 
> $ dd if=/var/tmp/disk.raw bs=2M of=/dev/shm/disk.raw iflag=direct conv=fsync status=progress
> 6442450944 bytes (6.4 GB, 6.0 GiB) copied, 10.4783 s, 615 MB/s
> 
> Copying same image with qemu-img convert, disabling zero detection,
> using different number of coroutines:
> 
> $ time qemu-img convert -f raw -O raw -T none -S0 -m1 -W /var/tmp/disk.raw /dev/shm/disk.raw
> 
> real	0m11.527s
> user	0m0.102s
> sys	0m2.330s
> 
> $ time qemu-img convert -f raw -O raw -T none -S0 -m2 -W /var/tmp/disk.raw /dev/shm/disk.raw
> 
> real	0m5.971s
> user	0m0.080s
> sys	0m2.749s
> 
> $ time qemu-img convert -f raw -O raw -T none -S0 -m4 -W /var/tmp/disk.raw /dev/shm/disk.raw
> 
> real	0m3.674s
> user	0m0.071s
> sys	0m3.140s
> 
> $ time qemu-img convert -f raw -O raw -T none -S0 -m8 -W /var/tmp/disk.raw /dev/shm/disk.raw
> 
> real	0m3.408s
> user	0m0.069s
> sys	0m3.813s
> 
> $ time qemu-img convert -f raw -O raw -T none -S0 -m16 -W /var/tmp/disk.raw /dev/shm/disk.raw
> 
> real	0m3.305s
> user	0m0.054s
> sys	0m3.767s
> 
> Same with the modified file plugin, using direct I/O and without
> extents:
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t1 -f -r file file=/var/tmp/disk.raw
> $ time qemu-img convert -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m12.167s
> user	0m5.798s
> sys	0m2.477s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t2 -f -r file file=/var/tmp/disk.raw
> $ time qemu-img convert -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m7.981s
> user	0m5.204s
> sys	0m2.740s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t4 -f -r file file=/var/tmp/disk.raw
> $ time qemu-img convert -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m6.568s
> user	0m4.996s
> sys	0m3.167s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t8 -f -r file file=/var/tmp/disk.raw
> $ time qemu-img convert -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m6.493s
> user	0m4.950s
> sys	0m3.492s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t16 -f -r file file=/var/tmp/disk.raw
> $ time qemu-img convert -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m6.138s
> user	0m4.621s
> sys	0m3.550s
> 
> Finally, same with the python file example:
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t1 -f -r python ./plugins/python/examples/file.py file=/var/tmp/disk.raw
> $ time qemu-img convert -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m12.398s
> user	0m6.652s
> sys	0m2.484s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t2 -f -r python ./plugins/python/examples/file.py file=/var/tmp/disk.raw
> $ time qemu-img convert -p -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m8.169s
> user	0m5.418s
> sys	0m2.736s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t4 -f -r python ./plugins/python/examples/file.py file=/var/tmp/disk.raw
> $ time qemu-img convert -p -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m6.419s
> user	0m4.891s
> sys	0m3.103s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t8 -f -r python ./plugins/python/examples/file.py file=/var/tmp/disk.raw
> $ time qemu-img convert -p -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m6.610s
> user	0m5.115s
> sys	0m3.377s
> 
> $ rm -f /tmp/nbd.sock && ./nbdkit -U /tmp/nbd.sock -t16 -f -r python ./plugins/python/examples/file.py file=/var/tmp/disk.raw
> $ time qemu-img convert -p -f raw -O raw -S0 -m16 -W nbd:unix:/tmp/nbd.sock /dev/shm/disk.raw
> 
> real	0m6.093s
> user	0m4.520s
> sys	0m3.567s

All pretty excellent for an interpreted programming language.

> I think this show that the parallel threading model works for the python
> plugin as good as for the file plugin.
> ---
>  plugins/file/file.c             |  4 ++--
>  plugins/python/examples/file.py |  5 ++++-
>  server/plugins.c                | 20 ++++++++++++++------
>  server/threadlocal.c            |  7 +++++--
>  4 files changed, 25 insertions(+), 11 deletions(-)
> 
> diff --git a/plugins/file/file.c b/plugins/file/file.c
> index dc99f992..27316b9f 100644
> --- a/plugins/file/file.c
> +++ b/plugins/file/file.c
> @@ -170,7 +170,7 @@ file_open (int readonly)
>      return NULL;
>    }
>  
> -  flags = O_CLOEXEC|O_NOCTTY;
> +  flags = O_CLOEXEC|O_NOCTTY|O_DIRECT;
>    if (readonly)
>      flags |= O_RDONLY;
>    else
> @@ -551,7 +551,7 @@ file_can_extents (void *handle)
>      nbdkit_debug ("extents disabled: lseek: SEEK_HOLE: %m");
>      return 0;
>    }
> -  return 1;
> +  return 0;
>  }
>  
>  static int
> diff --git a/plugins/python/examples/file.py b/plugins/python/examples/file.py
> index 866b8244..3652eb52 100644
> --- a/plugins/python/examples/file.py
> +++ b/plugins/python/examples/file.py
> @@ -49,7 +49,7 @@ def open(readonly):
>          flags = os.O_RDONLY
>      else:
>          flags = os.O_RDWR
> -    fd = os.open(filename, flags)
> +    fd = os.open(filename, flags | os.O_DIRECT)
>      return { 'fd': fd }
>  
>  def get_size(h):
> @@ -65,3 +65,6 @@ def pwrite(h, buf, offset, flags):
>      n = os.pwritev(h['fd'], [buf], offset)
>      if n != len(buf):
>          raise RuntimeError("short write")
> +
> +def flush(h, flags):
> +    os.fsync(h['fd'])
> diff --git a/server/plugins.c b/server/plugins.c
> index d4364cd2..ce4700a3 100644
> --- a/server/plugins.c
> +++ b/server/plugins.c
> @@ -631,6 +631,8 @@ plugin_zero (struct backend *b, void *handle,
>    bool fast_zero = flags & NBDKIT_FLAG_FAST_ZERO;
>    bool emulate = false;
>    bool need_flush = false;
> +  void *zero_buffer = NULL;
> +  int buffer_size = MIN (MAX_REQUEST_SIZE, count);
>  
>    if (fua && backend_can_fua (b) != NBDKIT_FUA_NATIVE) {
>      flags &= ~NBDKIT_FLAG_FUA;
> @@ -669,19 +671,25 @@ plugin_zero (struct backend *b, void *handle,
>    threadlocal_set_error (0);
>    *err = 0;
>  
> +  *err = posix_memalign(&zero_buffer, 4096, buffer_size);
> +  if (*err != 0) {
> +      r = -1;
> +      goto done;
> +  }
> +
> +  memset(zero_buffer, 0, buffer_size);
> +
>    while (count) {
> -    /* Always contains zeroes, but we can't use const or else gcc 9
> -     * will use .rodata instead of .bss and inflate the binary size.
> -     */
> -    static /* const */ char buf[MAX_REQUEST_SIZE];
> -    uint32_t limit = MIN (count, sizeof buf);
> +    uint32_t limit = MIN (count, buffer_size);
>  
> -    r = plugin_pwrite (b, handle, buf, limit, offset, flags, err);
> +    r = plugin_pwrite (b, handle, zero_buffer, limit, offset, flags, err);
>      if (r == -1)
>        break;
>      count -= limit;
>    }
>  
> +  free(zero_buffer);
> +
>   done:
>    if (r != -1 && need_flush)
>      r = plugin_flush (b, handle, 0, err);
> diff --git a/server/threadlocal.c b/server/threadlocal.c
> index 90230028..04c82842 100644
> --- a/server/threadlocal.c
> +++ b/server/threadlocal.c
> @@ -195,13 +195,16 @@ threadlocal_buffer (size_t size)
>  
>    if (threadlocal->buffer_size < size) {
>      void *ptr;
> +    int err;
>  
> -    ptr = realloc (threadlocal->buffer, size);
> -    if (ptr == NULL) {
> +    err = posix_memalign (&ptr, 4096, size);
> +    if (err != 0) {
>        nbdkit_error ("threadlocal_buffer: realloc: %m");
>        return NULL;
>      }
> +
>      memset (ptr, 0, size);
> +    free(threadlocal->buffer);
>      threadlocal->buffer = ptr;
>      threadlocal->buffer_size = size;
>    }
> -- 
> 2.25.4

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html




More information about the Libguestfs mailing list