[Libguestfs] [PATCH libnbd 3/3] examples: copy-libev.c: Support extents

Sun Mar 7 23:26:44 UTC 2021

On Mon, Mar 8, 2021 at 12:59 AM Nir Soffer <nsoffer at redhat.com> wrote:

> On Mon, Mar 8, 2021 at 12:25 AM Richard W.M. Jones <rjones at redhat.com>
> wrote:
>
>> On Sat, Mar 06, 2021 at 11:21:22PM +0200, Nir Soffer wrote:
>> > Handling extents is complicated, in particular when using async block
>> > status. But I could not get this working using sync block status; It
>> > seems that libnbd eats the events on the source, and then we hang
>> > forever waiting for inflight reads.
>>
>> Is this a bug in libnbd?
>>
>
> Maybe, I can try to post a simple reproducer later. Even if this is not a
> bug
> it would be nice to show how sync calls can be mixed with async call with
> libev (or another event loop).
>
> Also is libev going to be faster than poll(2) [as used by nbdcopy] for
>> copying?  I would imagine that it should make hardly any difference
>> since we're only using a couple of file descriptors with poll(2) in
>> nbdcopy, and it's my understanding that poll(2) is only pessimal when
>> you have to poll large numbers of FDs.
>>
>
> I think that poll or select should be faster for 2 file descriptors, but
> the
> difference is negligible in this context.
>

Some data when copying fredora 32 qcow2 image to nbdkit memory plugin:

$ strace -c ./copy-libev $SRC $DST
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.07    0.271421          17     15728         4 recvfrom
 39.10    0.211987          30      6928           sendto
  8.25    0.044734           4      9114           epoll_wait
  1.13    0.006102           4      1335           epoll_ctl
  1.01    0.005484         457        12         6 wait4
  0.10    0.000522           4       122           mmap
  0.06    0.000311           3        78        48 openat
  0.05    0.000298          49         6           clone
  0.04    0.000227           5        43           read
  0.03    0.000143           3        42           close
  0.02    0.000135           2        52           rt_sigprocmask
  0.02    0.000132           3        37           rt_sigaction
  0.02    0.000105           3        32           mprotect
  0.02    0.000100           3        30        15 stat
  0.02    0.000095          47         2           execve
  0.01    0.000061           2        29           fstat
  0.01    0.000044           4         9           poll
  0.01    0.000038           2        16           pread64
  0.00    0.000027           5         5           pipe
  0.00    0.000021           1        13           brk
  0.00    0.000019           3         6           rt_sigreturn
  0.00    0.000017           1        11         2 ioctl
  0.00    0.000015           7         2           munmap
  0.00    0.000015           7         2           connect
  0.00    0.000010           5         2           socket
  0.00    0.000007           3         2           shutdown
  0.00    0.000007           7         1           sysinfo
  0.00    0.000005           2         2           getgid
  0.00    0.000004           0         5         4 access
  0.00    0.000004           2         2           getsockopt
  0.00    0.000004           4         1           uname
  0.00    0.000004           2         2           getuid
  0.00    0.000004           2         2           geteuid
  0.00    0.000004           2         2           getegid
  0.00    0.000004           1         4         2 arch_prctl
  0.00    0.000003           1         2         2 setsockopt
  0.00    0.000002           0         5           lseek
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         2           getpid
  0.00    0.000000           0         5         1 fcntl
  0.00    0.000000           0         1           getppid
  0.00    0.000000           0         1           getpgrp
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         3           prlimit64
  0.00    0.000000           0         1           getrandom
------ ----------- ----------- --------- --------- ----------------
100.00    0.542115          16     33704        84 total

$ strace -f -c ../copy/nbdcopy --sparse=1048576 --request-size=1048576
--requests=16 --connections=1 $SRC $DST
strace: Process 1094611 attached
strace: Process 1094612 attached
strace: Process 1094613 attached
strace: Process 1094614 attached
strace: Process 1094615 attached
strace: Process 1094616 attached
strace: Process 1094617 attached
strace: Process 1094618 attached
strace: Process 1094619 attached
strace: Process 1094620 attached
strace: Process 1094621 attached
strace: Process 1094622 attached
strace: Process 1094623 attached
strace: Process 1094641 attached

(Not sure why we start so many threads with --connections=1 - bug?)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 58.10    0.881783      146963         6           futex
 18.70    0.283841          17     16634         2 recvfrom
 17.42    0.264326          31      8398         4 sendto
  2.73    0.041494           4      8877           poll
  1.97    0.029895        1299        23        10 wait4
  0.46    0.006970         240        29           madvise
  0.13    0.001903           6       311           mmap
  0.09    0.001330         190         7           execve
  0.07    0.001041          40        26           munmap
  0.05    0.000825           5       157        67 openat
  0.04    0.000557           5       100           mprotect
  0.04    0.000548           4       113           read
  0.04    0.000545          38        14           clone
  0.03    0.000440           2       151         6 close
  0.02    0.000329           3        97           fstat
  0.02    0.000267           2       125           rt_sigprocmask
  0.02    0.000251           3        66           pread64
  0.01    0.000214           1       148           rt_sigaction
  0.01    0.000188          18        10           statfs
  0.01    0.000145           5        28           brk
  0.01    0.000143           3        37           write
  0.01    0.000140           5        24        10 access
  0.01    0.000083           1        69        31 stat
  0.00    0.000063          10         6           socket
  0.00    0.000049           3        15         6 ioctl
  0.00    0.000047          15         3         1 lstat
  0.00    0.000045           7         6         4 connect
  0.00    0.000032           4         8           pipe
  0.00    0.000031           2        14         7 arch_prctl
  0.00    0.000028           2        10           rt_sigreturn
  0.00    0.000023           2         8           prlimit64
  0.00    0.000021           3         7           set_robust_list
  0.00    0.000019           3         6           set_tid_address
  0.00    0.000011          11         1           lgetxattr
  0.00    0.000009           4         2           shutdown
  0.00    0.000008           1         6           lseek
  0.00    0.000007           7         1         1 getxattr
  0.00    0.000006           0        15           getpid
  0.00    0.000006           6         1           sysinfo
  0.00    0.000005           5         1           uname
  0.00    0.000004           0        16           dup2
  0.00    0.000000           0         2         2 setsockopt
  0.00    0.000000           0         2           getsockopt
  0.00    0.000000           0         4         1 fcntl
  0.00    0.000000           0         1           chdir
  0.00    0.000000           0         9           getuid
  0.00    0.000000           0         9           getgid
  0.00    0.000000           0         9           geteuid
  0.00    0.000000           0         9           getegid
  0.00    0.000000           0         1           getppid
  0.00    0.000000           0         1           getpgrp
  0.00    0.000000           0         2           getdents64
  0.00    0.000000           0         1           getrandom
------ ----------- ----------- --------- --------- ----------------
100.00    1.517672          42     35626       152 total

If you want to use poll in libev, you can:

diff --git a/examples/copy-libev.c b/examples/copy-libev.c
index 6e6cbcb..1878f27 100644
--- a/examples/copy-libev.c
+++ b/examples/copy-libev.c
@@ -530,7 +530,7 @@ main (int argc, char *argv[])
 {
     int i;

-    loop = EV_DEFAULT;
+    loop = ev_loop_new (EVBACKEND_POLL);

     if (argc != 3)
         FAIL ("Usage: %s src-uri dst-uri", PROG);

But it does not look useful:

$ strace -c ./copy-libev $SRC $DST
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 48.50    0.261526          16     15626         1 recvfrom
 39.01    0.210365          30      7007           sendto
  8.24    0.044413           4      9214           poll

The advantage in using an event loop is an easier way to extend the
> application,
> for example I'm using async block status, this will be much harder to do
> in nbdcopy.
>
> Another advantage is adding more features like handling signals, timeouts,
> child processes, or other other network protocols (e.g. nbd <-> http).
>
>
>>
>> Anyway patch series is fine, ACK.
>>
>> Rich.
>>
>> > Since getting extents is asynchronous, and starting new request depends
>> > on being able to get the next extent, requests have now a new a waiting
>> > state. When a request is started when extents are not available, it is
>> > marked as waiting. When extents request is completed, we start all
>> > waiting requests.
>> >
>> > Here is example log showing whats going on while copying fedora 32
>> > image:
>> >
>> > The first request detect that we don't have extents yet, so it starts
>> > aync block status request.
>> >
>> > copy-libev: r0: start extents offset=0 count=134217728
>> > copy-libev: r0: received 14 extents for base:allocation
>> > copy-libev: r0: extents completed time=0.001098
>> >
>> > When extent request completes, we start al waiting requests. The first
>> > request (r0) looked into the first extent (e0) and consumed all of it:
>> >
>> > copy-libev: r0: start request offset=0
>> > copy-libev: e0: offset=0 len=65536 zero=0
>> > copy-libev: r0: extent offset=0 len=65536 zero=0
>> > copy-libev: r0: start read offset=0 len=65536
>> >
>> > The second request (r15) is looking into the second extent (e1) and
>> > consume all of it, starting a zero request:
>> >
>> > copy-libev: r15: start request offset=65536
>> > copy-libev: e1: offset=65536 len=983040 zero=1
>> > copy-libev: r15: extent offset=65536 len=983040 zero=1
>> > copy-libev: r15: start zero offset=65536 len=983040
>> >
>> > ...
>> >
>> > Request (r12) looked into the fifth extent (e4), but since this extent
>> > was large (10747904), it consume only 1m from it:
>> >
>> > copy-libev: r12: start request offset=2097152
>> > copy-libev: e4: offset=2097152 len=10747904 zero=0
>> > copy-libev: r12: extent offset=2097152 len=1048576 zero=0
>> > copy-libev: r12: start read offset=2097152 len=1048576
>> >
>> > The next request consumed the next 1m from the same extent (e4):
>> >
>> > copy-libev: r11: start request offset=3145728
>> > copy-libev: e4: offset=3145728 len=9699328 zero=0
>> > copy-libev: r11: extent offset=3145728 len=1048576 zero=0
>> > copy-libev: r11: start read offset=3145728 len=1048576
>> > copy-libev: r10: start request offset=4194304
>> >
>> > ..
>> >
>> > The last part of extent e4 was consumed, and we switched to extent e5:
>> >
>> > copy-libev: r2: start request offset=12582912
>> > copy-libev: e4: offset=12582912 len=262144 zero=0
>> > copy-libev: r2: extent offset=12582912 len=262144 zero=0
>> > copy-libev: r2: start read offset=12582912 len=262144
>> > copy-libev: r1: start request offset=12845056
>> > copy-libev: e5: offset=12845056 len=131072 zero=1
>> > copy-libev: r1: extent offset=12845056 len=131072 zero=1
>> > copy-libev: r1: start zero offset=12845056 len=131072
>> >
>> > ...
>> >
>> > Request (r11) consumed the last extent (e13), starting a zero request.
>> > This free the extents array:
>> >
>> > copy-libev: r11: start request offset=133955584
>> > copy-libev: e13: offset=133955584 len=262144 zero=1
>> > copy-libev: r11: extent offset=133955584 len=262144 zero=1
>> > copy-libev: r11: consumed all extents offset=134217728
>> > copy-libev: r11: start zero offset=133955584 len=262144
>> >
>> > ...
>> >
>> > Request (r12) started when extents array as cleared, so it started new
>> > block status request:
>> >
>> > copy-libev: r12: start extents offset=134217728 count=134217728
>> > ...
>> > copy-libev: r12: received 3 extents for base:allocation
>> > copy-libev: r12: extents completed time=0.003027
>> >
>> > ...
>> >
>> > The rest of the flow is same as before. When all requests are done,
>> > we shutdown the event loop and flush:
>> >
>> > copy-libev: r4: request completed offset=6438256640 len=1048576
>> time=0.000132
>> > copy-libev: r1: read completed offset=6442385408 len=65536
>> > copy-libev: r1: start write offset=6442385408 len=65536
>> > copy-libev: r14: request completed offset=6439305216 len=1048576
>> time=0.000126
>> > copy-libev: r8: request completed offset=6440353792 len=1048576
>> time=0.000151
>> > copy-libev: r2: request completed offset=6441402368 len=983040
>> time=0.000143
>> > copy-libev: r1: request completed offset=6442385408 len=65536
>> time=0.000142
>> > copy-libev: flush
>> >
>> > Signed-off-by: Nir Soffer <nsoffer at redhat.com>
>> > ---
>> >  examples/copy-libev.c | 232 ++++++++++++++++++++++++++++++++++++++++--
>> >  1 file changed, 224 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/examples/copy-libev.c b/examples/copy-libev.c
>> > index 84d5c03..3030955 100644
>> > --- a/examples/copy-libev.c
>> > +++ b/examples/copy-libev.c
>> > @@ -41,6 +41,7 @@
>> >   */
>> >  #define MAX_REQUESTS 16
>> >  #define REQUEST_SIZE (1024 * 1024)
>> > +#define EXTENTS_SIZE (128 * 1024 * 1024)
>> >
>> >  #define MIN(a,b) (a) < (b) ? (a) : (b)
>> >
>> > @@ -62,14 +63,18 @@ struct connection {
>> >      ev_io watcher;
>> >      struct nbd_handle *nbd;
>> >      bool can_zero;
>> > +    bool can_extents;
>> >  };
>> >
>> >  struct request {
>> > +    ev_timer watcher;       /* For starting on next loop iteration. */
>> >      int64_t offset;
>> >      size_t length;
>> > +    bool zero;
>> >      unsigned char *data;
>> >      size_t index;
>> >      ev_tstamp started;
>> > +    bool waiting;           /* Waiting for extents completion. */
>> >  };
>> >
>> >  static struct ev_loop *loop;
>> > @@ -77,11 +82,29 @@ static ev_prepare prepare;
>> >  static struct connection src;
>> >  static struct connection dst;
>> >  static struct request requests[MAX_REQUESTS];
>> > +
>> > +/* List of extents received from source server. Using the same format
>> returned
>> > + * by libnbd, array of uint32_t pairs. The first item is the length of
>> the
>> > + * extent, and the second is the extent flags.
>> > + *
>> > + * The number of extents is extents_len / 2. extents_pos is the index
>> of the
>> > + * current extent.
>> > + *
>> > + * extents_in_progress flag is set when we start asynchronous block
>> status
>> > + * request.
>> > + */
>> > +static uint32_t *extents;
>> > +static size_t extents_len;
>> > +static size_t extents_pos;
>> > +static bool extents_in_progress;
>> > +
>> >  static int64_t size;
>> >  static int64_t offset;
>> >  static int64_t written;
>> >  static bool debug;
>> >
>> > +static inline void start_request_soon (struct request *r);
>> > +static void start_request_cb (struct ev_loop *loop, ev_timer *w, int
>> revents);
>> >  static void start_request(struct request *r);
>> >  static void start_read(struct request *r);
>> >  static void start_write(struct request *r);
>> > @@ -133,23 +156,206 @@ get_events(struct connection *c)
>> >          default:
>> >              return 0;
>> >      }
>> > +}
>> > +
>> > +static int
>> > +extent_callback (void *user_data, const char *metacontext, uint64_t
>> offset,
>> > +                 uint32_t *entries, size_t nr_entries, int *error)
>> > +{
>> > +    struct request *r = user_data;
>> > +
>> > +    if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) != 0) {
>> > +        DEBUG ("Unexpected meta context: %s", metacontext);
>> > +        return 1;
>> > +    }
>> > +
>> > +    extents = malloc (nr_entries * sizeof *extents);
>> > +    if (extents == NULL)
>> > +        FAIL ("Cannot allocated extents: %s", strerror (errno));
>> > +
>> > +    memcpy (extents, entries, nr_entries * sizeof *extents);
>> > +    extents_len = nr_entries;
>> > +
>> > +    DEBUG ("r%d: received %d extents for %s",
>> > +           r->index, nr_entries / 2, metacontext);
>> > +
>> > +    return 1;
>> > +}
>> > +
>> > +static int
>> > +extents_completed (void *user_data, int *error)
>> > +{
>> > +    struct request *r = (struct request *)user_data;
>> > +    int i;
>> > +
>> > +    DEBUG ("r%d: extents completed time=%.6f",
>> > +           r->index, ev_now (loop) - r->started);
>> > +
>> > +    extents_in_progress = false;
>> > +
>> > +    if (extents == NULL) {
>> > +        DEBUG ("r%d: received no extents, disabling extents",
>> r->index);
>> > +        src.can_extents = false;
>> > +    }
>> >
>> > +    /* Start requests waiting for extents completion on the next loop
>> > +     * iteration, to avoid deadlock if we need to start a read.
>> > +     */
>> > +    for (i = 0; i < MAX_REQUESTS; i++) {
>> > +        struct request *r = &requests[i];
>> > +        if (r->waiting) {
>> > +            r->waiting = false;
>> > +            start_request_soon (r);
>> > +       }
>> > +    }
>> >
>> > +    return 1;
>> > +}
>> > +
>> > +static bool
>> > +start_extents (struct request *r)
>> > +{
>> > +    size_t count = MIN (EXTENTS_SIZE, size - offset);
>> > +    int64_t cookie;
>> > +
>> > +    if (extents_in_progress) {
>> > +        r->waiting = true;
>> > +        return true;
>> > +    }
>> > +
>> > +    DEBUG ("r%d: start extents offset=%ld count=%ld", r->index,
>> offset, count);
>> > +
>> > +    cookie = nbd_aio_block_status (
>> > +        src.nbd, count, offset,
>> > +        (nbd_extent_callback) { .callback=extent_callback,
>> > +                                .user_data=r },
>> > +        (nbd_completion_callback) { .callback=extents_completed,
>> > +                                    .user_data=r },
>> > +        0);
>> > +    if (cookie == -1) {
>> > +        DEBUG ("Cannot get extents: %s", nbd_get_error ());
>> > +        src.can_extents = false;
>> > +        return false;
>> > +    }
>> > +
>> > +    r->waiting = true;
>> > +    extents_in_progress = true;
>> > +
>> > +    return true;
>> > +}
>> > +
>> > +/* Return next extent to process. */
>> > +static void
>> > +next_extent (struct request *r)
>> > +{
>> > +    uint32_t limit = MIN (REQUEST_SIZE, size - offset);
>> > +    uint32_t length = 0;
>> > +    bool is_zero;
>> > +
>> > +    assert (extents);
>> > +
>> > +    is_zero = extents[extents_pos + 1] & LIBNBD_STATE_ZERO;
>> > +
>> > +    while (length < limit) {
>> > +        DEBUG ("e%d: offset=%ld len=%ld zero=%d",
>> > +               extents_pos / 2, offset, extents[extents_pos], is_zero);
>> > +
>> > +        /* If this extent is too large, steal some data from it to
>> > +         * complete the request.
>> > +         */
>> > +        if (length + extents[extents_pos] > limit) {
>> > +            uint32_t stolen = limit - length;
>> > +
>> > +            extents[extents_pos] -= stolen;
>> > +            length += stolen;
>> > +            break;
>> > +        }
>> > +
>> > +        /* Consume the entire extent and start looking at the next
>> one. */
>> > +        length += extents[extents_pos];
>> > +        extents[extents_pos] = 0;
>> > +
>> > +        if (extents_pos + 2 == extents_len)
>> > +            break;
>> > +
>> > +        extents_pos += 2;
>> > +
>> > +        /* If next extent is different, we are done. */
>> > +        if ((extents[extents_pos + 1] & LIBNBD_STATE_ZERO) != is_zero)
>> > +            break;
>> > +    }
>> > +
>> > +    assert (length > 0 && length <= limit);
>> > +
>> > +    r->offset = offset;
>> > +    r->length = length;
>> > +    r->zero = is_zero;
>> > +
>> > +    DEBUG ("r%d: extent offset=%ld len=%ld zero=%d",
>> > +           r->index, r->offset, r->length, r->zero);
>> > +
>> > +    offset += length;
>> > +
>> > +    if (extents_pos + 2 == extents_len && extents[extents_pos] == 0) {
>> > +        /* Processed all extents, clear extents. */
>> > +        DEBUG ("r%d: consumed all extents offset=%ld", r->index,
>> offset);
>> > +        free (extents);
>> > +        extents = NULL;
>> > +        extents_pos = 0;
>> > +        extents_len = 0;
>> > +    }
>> > +}
>> > +
>> > +static inline void
>> > +start_request_soon (struct request *r)
>> > +{
>> > +    ev_timer_init (&r->watcher, start_request_cb, 0, 0);
>> > +    ev_timer_start (loop, &r->watcher);
>> > +}
>> > +
>> > +static void
>> > +start_request_cb (struct ev_loop *loop, ev_timer *w, int revents)
>> > +{
>> > +    struct request *r = (struct request *)w;
>> > +    start_request (r);
>> >  }
>> >
>> >  /* Start async copy or zero request. */
>> >  static void
>> >  start_request(struct request *r)
>> >  {
>> > -    assert (offset < size);
>> > +    /* Cancel the request if we are done. */
>> > +    if (offset == size)
>> > +        return;
>> >
>> >      r->started = ev_now (loop);
>> > -    r->length = MIN (REQUEST_SIZE, size - offset);
>> > -    r->offset = offset;
>> >
>> > -    start_read (r);
>> > -
>> > -    offset += r->length;
>> > +    /* If needed, get more extents from server. */
>> > +    if (src.can_extents && extents == NULL && start_extents (r))
>> > +        return;
>> > +
>> > +    DEBUG ("r%d: start request offset=%ld", r->index, offset);
>> > +
>> > +    if (src.can_extents) {
>> > +        /* Handle the next extent. */
>> > +        next_extent (r);
>> > +        if (r->zero) {
>> > +            if (dst.can_zero) {
>> > +                start_zero (r);
>> > +            } else {
>> > +                memset (r->data, 0, r->length);
>> > +                start_write (r);
>> > +            }
>> > +        } else {
>> > +            start_read (r);
>> > +        }
>> > +    } else {
>> > +        /* Extents not available. */
>> > +        r->length = MIN (REQUEST_SIZE, size - offset);
>> > +        r->offset = offset;
>> > +        start_read (r);
>> > +        offset += r->length;
>> > +    }
>> >  }
>> >
>> >  static void
>> > @@ -240,9 +446,11 @@ request_completed (void *user_data, int *error)
>> >          ev_break (loop, EVBREAK_ALL);
>> >      }
>> >
>> > -    /* If we have data to read, start a new read. */
>> > +    /* If we have more work, start a new request on the next loop
>> > +     * iteration, to avoid deadlock if we need to start a zero or
>> write.
>> > +     */
>> >      if (offset < size)
>> > -        start_request(r);
>> > +        start_request_soon(r);
>> >
>> >      return 1;
>> >  }
>> > @@ -304,11 +512,19 @@ main (int argc, char *argv[])
>> >
>> >      debug = getenv ("COPY_LIBEV_DEBUG") != NULL;
>> >
>> > +    /* Configure soruce to report extents. */
>> > +
>> > +    if (nbd_add_meta_context (src.nbd, LIBNBD_CONTEXT_BASE_ALLOCATION))
>> > +        FAIL ("Cannot add base:allocation: %s", nbd_get_error ());
>> > +
>> >      /* Connecting is fast, so use the syncronous API. */
>> >
>> >      if (nbd_connect_uri (src.nbd, argv[1]))
>> >          FAIL ("Cannot connect to source: %s", nbd_get_error ());
>> >
>> > +    src.can_extents = nbd_can_meta_context (
>> > +        src.nbd, LIBNBD_CONTEXT_BASE_ALLOCATION) > 0;
>> > +
>> >      if (nbd_connect_uri (dst.nbd, argv[2]))
>> >          FAIL ("Cannot connect to destination: %s", nbd_get_error ());
>> >
>> > --
>> > 2.26.2
>>
>> --
>> Richard Jones, Virtualization Group, Red Hat
>> http://people.redhat.com/~rjones
>> Read my programming and virtualization blog: http://rwmj.wordpress.com
>> virt-builder quickly builds VMs from scratch
>> http://libguestfs.org/virt-builder.1.html
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20210308/67a2c8c3/attachment.htm>