[Libguestfs] [PATCH libnbd 3/3] examples: copy-libev.c: Support extents
Nir Soffer
nsoffer at redhat.com
Sun Mar 7 23:26:44 UTC 2021
On Mon, Mar 8, 2021 at 12:59 AM Nir Soffer <nsoffer at redhat.com> wrote:
> On Mon, Mar 8, 2021 at 12:25 AM Richard W.M. Jones <rjones at redhat.com>
> wrote:
>
>> On Sat, Mar 06, 2021 at 11:21:22PM +0200, Nir Soffer wrote:
>> > Handling extents is complicated, in particular when using async block
>> > status. But I could not get this working using sync block status; It
>> > seems that libnbd eats the events on the source, and then we hang
>> > forever waiting for inflight reads.
>>
>> Is this a bug in libnbd?
>>
>
> Maybe, I can try to post a simple reproducer later. Even if this is not a
> bug
> it would be nice to show how sync calls can be mixed with async call with
> libev (or another event loop).
>
> Also is libev going to be faster than poll(2) [as used by nbdcopy] for
>> copying? I would imagine that it should make hardly any difference
>> since we're only using a couple of file descriptors with poll(2) in
>> nbdcopy, and it's my understanding that poll(2) is only pessimal when
>> you have to poll large numbers of FDs.
>>
>
> I think that poll or select should be faster for 2 file descriptors, but
> the
> difference is negligible in this context.
>
Some data when copying fredora 32 qcow2 image to nbdkit memory plugin:
$ strace -c ./copy-libev $SRC $DST
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
50.07 0.271421 17 15728 4 recvfrom
39.10 0.211987 30 6928 sendto
8.25 0.044734 4 9114 epoll_wait
1.13 0.006102 4 1335 epoll_ctl
1.01 0.005484 457 12 6 wait4
0.10 0.000522 4 122 mmap
0.06 0.000311 3 78 48 openat
0.05 0.000298 49 6 clone
0.04 0.000227 5 43 read
0.03 0.000143 3 42 close
0.02 0.000135 2 52 rt_sigprocmask
0.02 0.000132 3 37 rt_sigaction
0.02 0.000105 3 32 mprotect
0.02 0.000100 3 30 15 stat
0.02 0.000095 47 2 execve
0.01 0.000061 2 29 fstat
0.01 0.000044 4 9 poll
0.01 0.000038 2 16 pread64
0.00 0.000027 5 5 pipe
0.00 0.000021 1 13 brk
0.00 0.000019 3 6 rt_sigreturn
0.00 0.000017 1 11 2 ioctl
0.00 0.000015 7 2 munmap
0.00 0.000015 7 2 connect
0.00 0.000010 5 2 socket
0.00 0.000007 3 2 shutdown
0.00 0.000007 7 1 sysinfo
0.00 0.000005 2 2 getgid
0.00 0.000004 0 5 4 access
0.00 0.000004 2 2 getsockopt
0.00 0.000004 4 1 uname
0.00 0.000004 2 2 getuid
0.00 0.000004 2 2 geteuid
0.00 0.000004 2 2 getegid
0.00 0.000004 1 4 2 arch_prctl
0.00 0.000003 1 2 2 setsockopt
0.00 0.000002 0 5 lseek
0.00 0.000000 0 1 dup2
0.00 0.000000 0 2 getpid
0.00 0.000000 0 5 1 fcntl
0.00 0.000000 0 1 getppid
0.00 0.000000 0 1 getpgrp
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 3 prlimit64
0.00 0.000000 0 1 getrandom
------ ----------- ----------- --------- --------- ----------------
100.00 0.542115 16 33704 84 total
$ strace -f -c ../copy/nbdcopy --sparse=1048576 --request-size=1048576
--requests=16 --connections=1 $SRC $DST
strace: Process 1094611 attached
strace: Process 1094612 attached
strace: Process 1094613 attached
strace: Process 1094614 attached
strace: Process 1094615 attached
strace: Process 1094616 attached
strace: Process 1094617 attached
strace: Process 1094618 attached
strace: Process 1094619 attached
strace: Process 1094620 attached
strace: Process 1094621 attached
strace: Process 1094622 attached
strace: Process 1094623 attached
strace: Process 1094641 attached
(Not sure why we start so many threads with --connections=1 - bug?)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
58.10 0.881783 146963 6 futex
18.70 0.283841 17 16634 2 recvfrom
17.42 0.264326 31 8398 4 sendto
2.73 0.041494 4 8877 poll
1.97 0.029895 1299 23 10 wait4
0.46 0.006970 240 29 madvise
0.13 0.001903 6 311 mmap
0.09 0.001330 190 7 execve
0.07 0.001041 40 26 munmap
0.05 0.000825 5 157 67 openat
0.04 0.000557 5 100 mprotect
0.04 0.000548 4 113 read
0.04 0.000545 38 14 clone
0.03 0.000440 2 151 6 close
0.02 0.000329 3 97 fstat
0.02 0.000267 2 125 rt_sigprocmask
0.02 0.000251 3 66 pread64
0.01 0.000214 1 148 rt_sigaction
0.01 0.000188 18 10 statfs
0.01 0.000145 5 28 brk
0.01 0.000143 3 37 write
0.01 0.000140 5 24 10 access
0.01 0.000083 1 69 31 stat
0.00 0.000063 10 6 socket
0.00 0.000049 3 15 6 ioctl
0.00 0.000047 15 3 1 lstat
0.00 0.000045 7 6 4 connect
0.00 0.000032 4 8 pipe
0.00 0.000031 2 14 7 arch_prctl
0.00 0.000028 2 10 rt_sigreturn
0.00 0.000023 2 8 prlimit64
0.00 0.000021 3 7 set_robust_list
0.00 0.000019 3 6 set_tid_address
0.00 0.000011 11 1 lgetxattr
0.00 0.000009 4 2 shutdown
0.00 0.000008 1 6 lseek
0.00 0.000007 7 1 1 getxattr
0.00 0.000006 0 15 getpid
0.00 0.000006 6 1 sysinfo
0.00 0.000005 5 1 uname
0.00 0.000004 0 16 dup2
0.00 0.000000 0 2 2 setsockopt
0.00 0.000000 0 2 getsockopt
0.00 0.000000 0 4 1 fcntl
0.00 0.000000 0 1 chdir
0.00 0.000000 0 9 getuid
0.00 0.000000 0 9 getgid
0.00 0.000000 0 9 geteuid
0.00 0.000000 0 9 getegid
0.00 0.000000 0 1 getppid
0.00 0.000000 0 1 getpgrp
0.00 0.000000 0 2 getdents64
0.00 0.000000 0 1 getrandom
------ ----------- ----------- --------- --------- ----------------
100.00 1.517672 42 35626 152 total
If you want to use poll in libev, you can:
diff --git a/examples/copy-libev.c b/examples/copy-libev.c
index 6e6cbcb..1878f27 100644
--- a/examples/copy-libev.c
+++ b/examples/copy-libev.c
@@ -530,7 +530,7 @@ main (int argc, char *argv[])
{
int i;
- loop = EV_DEFAULT;
+ loop = ev_loop_new (EVBACKEND_POLL);
if (argc != 3)
FAIL ("Usage: %s src-uri dst-uri", PROG);
But it does not look useful:
$ strace -c ./copy-libev $SRC $DST
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
48.50 0.261526 16 15626 1 recvfrom
39.01 0.210365 30 7007 sendto
8.24 0.044413 4 9214 poll
The advantage in using an event loop is an easier way to extend the
> application,
> for example I'm using async block status, this will be much harder to do
> in nbdcopy.
>
> Another advantage is adding more features like handling signals, timeouts,
> child processes, or other other network protocols (e.g. nbd <-> http).
>
>
>>
>> Anyway patch series is fine, ACK.
>>
>> Rich.
>>
>> > Since getting extents is asynchronous, and starting new request depends
>> > on being able to get the next extent, requests have now a new a waiting
>> > state. When a request is started when extents are not available, it is
>> > marked as waiting. When extents request is completed, we start all
>> > waiting requests.
>> >
>> > Here is example log showing whats going on while copying fedora 32
>> > image:
>> >
>> > The first request detect that we don't have extents yet, so it starts
>> > aync block status request.
>> >
>> > copy-libev: r0: start extents offset=0 count=134217728
>> > copy-libev: r0: received 14 extents for base:allocation
>> > copy-libev: r0: extents completed time=0.001098
>> >
>> > When extent request completes, we start al waiting requests. The first
>> > request (r0) looked into the first extent (e0) and consumed all of it:
>> >
>> > copy-libev: r0: start request offset=0
>> > copy-libev: e0: offset=0 len=65536 zero=0
>> > copy-libev: r0: extent offset=0 len=65536 zero=0
>> > copy-libev: r0: start read offset=0 len=65536
>> >
>> > The second request (r15) is looking into the second extent (e1) and
>> > consume all of it, starting a zero request:
>> >
>> > copy-libev: r15: start request offset=65536
>> > copy-libev: e1: offset=65536 len=983040 zero=1
>> > copy-libev: r15: extent offset=65536 len=983040 zero=1
>> > copy-libev: r15: start zero offset=65536 len=983040
>> >
>> > ...
>> >
>> > Request (r12) looked into the fifth extent (e4), but since this extent
>> > was large (10747904), it consume only 1m from it:
>> >
>> > copy-libev: r12: start request offset=2097152
>> > copy-libev: e4: offset=2097152 len=10747904 zero=0
>> > copy-libev: r12: extent offset=2097152 len=1048576 zero=0
>> > copy-libev: r12: start read offset=2097152 len=1048576
>> >
>> > The next request consumed the next 1m from the same extent (e4):
>> >
>> > copy-libev: r11: start request offset=3145728
>> > copy-libev: e4: offset=3145728 len=9699328 zero=0
>> > copy-libev: r11: extent offset=3145728 len=1048576 zero=0
>> > copy-libev: r11: start read offset=3145728 len=1048576
>> > copy-libev: r10: start request offset=4194304
>> >
>> > ..
>> >
>> > The last part of extent e4 was consumed, and we switched to extent e5:
>> >
>> > copy-libev: r2: start request offset=12582912
>> > copy-libev: e4: offset=12582912 len=262144 zero=0
>> > copy-libev: r2: extent offset=12582912 len=262144 zero=0
>> > copy-libev: r2: start read offset=12582912 len=262144
>> > copy-libev: r1: start request offset=12845056
>> > copy-libev: e5: offset=12845056 len=131072 zero=1
>> > copy-libev: r1: extent offset=12845056 len=131072 zero=1
>> > copy-libev: r1: start zero offset=12845056 len=131072
>> >
>> > ...
>> >
>> > Request (r11) consumed the last extent (e13), starting a zero request.
>> > This free the extents array:
>> >
>> > copy-libev: r11: start request offset=133955584
>> > copy-libev: e13: offset=133955584 len=262144 zero=1
>> > copy-libev: r11: extent offset=133955584 len=262144 zero=1
>> > copy-libev: r11: consumed all extents offset=134217728
>> > copy-libev: r11: start zero offset=133955584 len=262144
>> >
>> > ...
>> >
>> > Request (r12) started when extents array as cleared, so it started new
>> > block status request:
>> >
>> > copy-libev: r12: start extents offset=134217728 count=134217728
>> > ...
>> > copy-libev: r12: received 3 extents for base:allocation
>> > copy-libev: r12: extents completed time=0.003027
>> >
>> > ...
>> >
>> > The rest of the flow is same as before. When all requests are done,
>> > we shutdown the event loop and flush:
>> >
>> > copy-libev: r4: request completed offset=6438256640 len=1048576
>> time=0.000132
>> > copy-libev: r1: read completed offset=6442385408 len=65536
>> > copy-libev: r1: start write offset=6442385408 len=65536
>> > copy-libev: r14: request completed offset=6439305216 len=1048576
>> time=0.000126
>> > copy-libev: r8: request completed offset=6440353792 len=1048576
>> time=0.000151
>> > copy-libev: r2: request completed offset=6441402368 len=983040
>> time=0.000143
>> > copy-libev: r1: request completed offset=6442385408 len=65536
>> time=0.000142
>> > copy-libev: flush
>> >
>> > Signed-off-by: Nir Soffer <nsoffer at redhat.com>
>> > ---
>> > examples/copy-libev.c | 232 ++++++++++++++++++++++++++++++++++++++++--
>> > 1 file changed, 224 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/examples/copy-libev.c b/examples/copy-libev.c
>> > index 84d5c03..3030955 100644
>> > --- a/examples/copy-libev.c
>> > +++ b/examples/copy-libev.c
>> > @@ -41,6 +41,7 @@
>> > */
>> > #define MAX_REQUESTS 16
>> > #define REQUEST_SIZE (1024 * 1024)
>> > +#define EXTENTS_SIZE (128 * 1024 * 1024)
>> >
>> > #define MIN(a,b) (a) < (b) ? (a) : (b)
>> >
>> > @@ -62,14 +63,18 @@ struct connection {
>> > ev_io watcher;
>> > struct nbd_handle *nbd;
>> > bool can_zero;
>> > + bool can_extents;
>> > };
>> >
>> > struct request {
>> > + ev_timer watcher; /* For starting on next loop iteration. */
>> > int64_t offset;
>> > size_t length;
>> > + bool zero;
>> > unsigned char *data;
>> > size_t index;
>> > ev_tstamp started;
>> > + bool waiting; /* Waiting for extents completion. */
>> > };
>> >
>> > static struct ev_loop *loop;
>> > @@ -77,11 +82,29 @@ static ev_prepare prepare;
>> > static struct connection src;
>> > static struct connection dst;
>> > static struct request requests[MAX_REQUESTS];
>> > +
>> > +/* List of extents received from source server. Using the same format
>> returned
>> > + * by libnbd, array of uint32_t pairs. The first item is the length of
>> the
>> > + * extent, and the second is the extent flags.
>> > + *
>> > + * The number of extents is extents_len / 2. extents_pos is the index
>> of the
>> > + * current extent.
>> > + *
>> > + * extents_in_progress flag is set when we start asynchronous block
>> status
>> > + * request.
>> > + */
>> > +static uint32_t *extents;
>> > +static size_t extents_len;
>> > +static size_t extents_pos;
>> > +static bool extents_in_progress;
>> > +
>> > static int64_t size;
>> > static int64_t offset;
>> > static int64_t written;
>> > static bool debug;
>> >
>> > +static inline void start_request_soon (struct request *r);
>> > +static void start_request_cb (struct ev_loop *loop, ev_timer *w, int
>> revents);
>> > static void start_request(struct request *r);
>> > static void start_read(struct request *r);
>> > static void start_write(struct request *r);
>> > @@ -133,23 +156,206 @@ get_events(struct connection *c)
>> > default:
>> > return 0;
>> > }
>> > +}
>> > +
>> > +static int
>> > +extent_callback (void *user_data, const char *metacontext, uint64_t
>> offset,
>> > + uint32_t *entries, size_t nr_entries, int *error)
>> > +{
>> > + struct request *r = user_data;
>> > +
>> > + if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) != 0) {
>> > + DEBUG ("Unexpected meta context: %s", metacontext);
>> > + return 1;
>> > + }
>> > +
>> > + extents = malloc (nr_entries * sizeof *extents);
>> > + if (extents == NULL)
>> > + FAIL ("Cannot allocated extents: %s", strerror (errno));
>> > +
>> > + memcpy (extents, entries, nr_entries * sizeof *extents);
>> > + extents_len = nr_entries;
>> > +
>> > + DEBUG ("r%d: received %d extents for %s",
>> > + r->index, nr_entries / 2, metacontext);
>> > +
>> > + return 1;
>> > +}
>> > +
>> > +static int
>> > +extents_completed (void *user_data, int *error)
>> > +{
>> > + struct request *r = (struct request *)user_data;
>> > + int i;
>> > +
>> > + DEBUG ("r%d: extents completed time=%.6f",
>> > + r->index, ev_now (loop) - r->started);
>> > +
>> > + extents_in_progress = false;
>> > +
>> > + if (extents == NULL) {
>> > + DEBUG ("r%d: received no extents, disabling extents",
>> r->index);
>> > + src.can_extents = false;
>> > + }
>> >
>> > + /* Start requests waiting for extents completion on the next loop
>> > + * iteration, to avoid deadlock if we need to start a read.
>> > + */
>> > + for (i = 0; i < MAX_REQUESTS; i++) {
>> > + struct request *r = &requests[i];
>> > + if (r->waiting) {
>> > + r->waiting = false;
>> > + start_request_soon (r);
>> > + }
>> > + }
>> >
>> > + return 1;
>> > +}
>> > +
>> > +static bool
>> > +start_extents (struct request *r)
>> > +{
>> > + size_t count = MIN (EXTENTS_SIZE, size - offset);
>> > + int64_t cookie;
>> > +
>> > + if (extents_in_progress) {
>> > + r->waiting = true;
>> > + return true;
>> > + }
>> > +
>> > + DEBUG ("r%d: start extents offset=%ld count=%ld", r->index,
>> offset, count);
>> > +
>> > + cookie = nbd_aio_block_status (
>> > + src.nbd, count, offset,
>> > + (nbd_extent_callback) { .callback=extent_callback,
>> > + .user_data=r },
>> > + (nbd_completion_callback) { .callback=extents_completed,
>> > + .user_data=r },
>> > + 0);
>> > + if (cookie == -1) {
>> > + DEBUG ("Cannot get extents: %s", nbd_get_error ());
>> > + src.can_extents = false;
>> > + return false;
>> > + }
>> > +
>> > + r->waiting = true;
>> > + extents_in_progress = true;
>> > +
>> > + return true;
>> > +}
>> > +
>> > +/* Return next extent to process. */
>> > +static void
>> > +next_extent (struct request *r)
>> > +{
>> > + uint32_t limit = MIN (REQUEST_SIZE, size - offset);
>> > + uint32_t length = 0;
>> > + bool is_zero;
>> > +
>> > + assert (extents);
>> > +
>> > + is_zero = extents[extents_pos + 1] & LIBNBD_STATE_ZERO;
>> > +
>> > + while (length < limit) {
>> > + DEBUG ("e%d: offset=%ld len=%ld zero=%d",
>> > + extents_pos / 2, offset, extents[extents_pos], is_zero);
>> > +
>> > + /* If this extent is too large, steal some data from it to
>> > + * complete the request.
>> > + */
>> > + if (length + extents[extents_pos] > limit) {
>> > + uint32_t stolen = limit - length;
>> > +
>> > + extents[extents_pos] -= stolen;
>> > + length += stolen;
>> > + break;
>> > + }
>> > +
>> > + /* Consume the entire extent and start looking at the next
>> one. */
>> > + length += extents[extents_pos];
>> > + extents[extents_pos] = 0;
>> > +
>> > + if (extents_pos + 2 == extents_len)
>> > + break;
>> > +
>> > + extents_pos += 2;
>> > +
>> > + /* If next extent is different, we are done. */
>> > + if ((extents[extents_pos + 1] & LIBNBD_STATE_ZERO) != is_zero)
>> > + break;
>> > + }
>> > +
>> > + assert (length > 0 && length <= limit);
>> > +
>> > + r->offset = offset;
>> > + r->length = length;
>> > + r->zero = is_zero;
>> > +
>> > + DEBUG ("r%d: extent offset=%ld len=%ld zero=%d",
>> > + r->index, r->offset, r->length, r->zero);
>> > +
>> > + offset += length;
>> > +
>> > + if (extents_pos + 2 == extents_len && extents[extents_pos] == 0) {
>> > + /* Processed all extents, clear extents. */
>> > + DEBUG ("r%d: consumed all extents offset=%ld", r->index,
>> offset);
>> > + free (extents);
>> > + extents = NULL;
>> > + extents_pos = 0;
>> > + extents_len = 0;
>> > + }
>> > +}
>> > +
>> > +static inline void
>> > +start_request_soon (struct request *r)
>> > +{
>> > + ev_timer_init (&r->watcher, start_request_cb, 0, 0);
>> > + ev_timer_start (loop, &r->watcher);
>> > +}
>> > +
>> > +static void
>> > +start_request_cb (struct ev_loop *loop, ev_timer *w, int revents)
>> > +{
>> > + struct request *r = (struct request *)w;
>> > + start_request (r);
>> > }
>> >
>> > /* Start async copy or zero request. */
>> > static void
>> > start_request(struct request *r)
>> > {
>> > - assert (offset < size);
>> > + /* Cancel the request if we are done. */
>> > + if (offset == size)
>> > + return;
>> >
>> > r->started = ev_now (loop);
>> > - r->length = MIN (REQUEST_SIZE, size - offset);
>> > - r->offset = offset;
>> >
>> > - start_read (r);
>> > -
>> > - offset += r->length;
>> > + /* If needed, get more extents from server. */
>> > + if (src.can_extents && extents == NULL && start_extents (r))
>> > + return;
>> > +
>> > + DEBUG ("r%d: start request offset=%ld", r->index, offset);
>> > +
>> > + if (src.can_extents) {
>> > + /* Handle the next extent. */
>> > + next_extent (r);
>> > + if (r->zero) {
>> > + if (dst.can_zero) {
>> > + start_zero (r);
>> > + } else {
>> > + memset (r->data, 0, r->length);
>> > + start_write (r);
>> > + }
>> > + } else {
>> > + start_read (r);
>> > + }
>> > + } else {
>> > + /* Extents not available. */
>> > + r->length = MIN (REQUEST_SIZE, size - offset);
>> > + r->offset = offset;
>> > + start_read (r);
>> > + offset += r->length;
>> > + }
>> > }
>> >
>> > static void
>> > @@ -240,9 +446,11 @@ request_completed (void *user_data, int *error)
>> > ev_break (loop, EVBREAK_ALL);
>> > }
>> >
>> > - /* If we have data to read, start a new read. */
>> > + /* If we have more work, start a new request on the next loop
>> > + * iteration, to avoid deadlock if we need to start a zero or
>> write.
>> > + */
>> > if (offset < size)
>> > - start_request(r);
>> > + start_request_soon(r);
>> >
>> > return 1;
>> > }
>> > @@ -304,11 +512,19 @@ main (int argc, char *argv[])
>> >
>> > debug = getenv ("COPY_LIBEV_DEBUG") != NULL;
>> >
>> > + /* Configure soruce to report extents. */
>> > +
>> > + if (nbd_add_meta_context (src.nbd, LIBNBD_CONTEXT_BASE_ALLOCATION))
>> > + FAIL ("Cannot add base:allocation: %s", nbd_get_error ());
>> > +
>> > /* Connecting is fast, so use the syncronous API. */
>> >
>> > if (nbd_connect_uri (src.nbd, argv[1]))
>> > FAIL ("Cannot connect to source: %s", nbd_get_error ());
>> >
>> > + src.can_extents = nbd_can_meta_context (
>> > + src.nbd, LIBNBD_CONTEXT_BASE_ALLOCATION) > 0;
>> > +
>> > if (nbd_connect_uri (dst.nbd, argv[2]))
>> > FAIL ("Cannot connect to destination: %s", nbd_get_error ());
>> >
>> > --
>> > 2.26.2
>>
>> --
>> Richard Jones, Virtualization Group, Red Hat
>> http://people.redhat.com/~rjones
>> Read my programming and virtualization blog: http://rwmj.wordpress.com
>> virt-builder quickly builds VMs from scratch
>> http://libguestfs.org/virt-builder.1.html
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20210308/67a2c8c3/attachment.htm>
More information about the Libguestfs
mailing list