[Libguestfs] [nbdkit PATCH] retry: Tweak error message

Thu Apr 2 09:40:40 UTC 2020

On Wed, Apr 01, 2020 at 08:46:53AM -0500, Eric Blake wrote:
> On 4/1/20 5:12 AM, Richard W.M. Jones wrote:
> >On Tue, Mar 31, 2020 at 04:00:22PM -0500, Eric Blake wrote:
> >>The retry filter defaults to 5 retries, but when run with verbose
> >>details produces some confusing output:
> >>
> >>$ rm -f /tmp/inject; (sleep 5s; touch /tmp/inject)& ./nbdkit -f -v -U - \
> >>   null 1G --filter=retry --filter=noextents --filter=error error-rate=100% \
> >>   error-file=/tmp/inject --filter=delay rdelay=1 \
> >>   --run 'qemu-img convert $nbd out.img'
> >>...
> >>nbdkit: null[1]: debug: noextents: pread count=2097152 offset=14680064
> >>nbdkit: null[1]: debug: error: pread count=2097152 offset=14680064
> >>nbdkit: null[1]: error: injecting EIO error into pread
> >>nbdkit: null[1]: debug: retry 6: original errno = 5
> >>nbdkit: null[1]: debug: could not recover after 5 retries
> >>
> 
> >>   again:
> >>    /* Log the original errno since it will be lost when we retry. */
> >>-  nbdkit_debug ("retry %d: original errno = %d", data->retry+1, *err);
> >>+  nbdkit_debug ("retry attempt %d: original errno = %d", data->retry, *err);
> >>
> >>    if (data->retry >= retries) {
> >>      nbdkit_debug ("could not recover after %d retries", retries);
> >
> >I think adding 1 was correct, so that the retries count from 1.
> 
> But the problem is the timing of when the messages are printed.  And
> remember, this code is only reached after a failed attempt, not on
> success.  Pre-patch, we have:
> 
> initial try succeeds - no message
> 
> or (here, with retries limited to 2):
> 
> initial try fails:
>  retry 1: original error
> delay
> retry 1 fails:
>  retry 2: original error
> delay
> retry 2 fails:
>  retry 3: original error
>  could not recover after 2 retries
> 
> >
> >How about moving the message to make it conditional instead,
> >like this?
> >
> >again:
> >   /* Log the original errno since it will be lost when we retry. */
> >   if (data->retry < retries)
> >     nbdkit_debug ("retry %d: original errno = %d", data->retry+1, *err);
> >   else {
> >      nbdkit_debug ("could not recover after %d retries", retries);
> >   ...
> 
> with your suggestion, we'd have:
> 
> initial try fails:
>  retry 1: original error
> delay
> retry 1 fails:
>  retry 2: original error
> delay
> retry 2 fails:
>  could not recover after 2 retries
> 
> where we lost the log of the errno of retry 2.

It's not printed, but also it's not lost.  do_retry will immediately
return false, causing the called function to return this final errno
(hopefully back to the client).  The reason for printing the "original
errno" was because those intermediate errnos would otherwise be lost.

Attached is my suggested patch - it's a small evolution on top of
yours.  The output is now:

$ rm -f /tmp/inject ; (sleep 5s; touch /tmp/inject ) & ./nbdkit -f -v -U - null size=1G --filter=retry --filter=noextents --filter=error error-rate=100% error-file=/tmp/inject --filter=delay rdelay=1 retries=3 --run 'qemu-img convert $nbd /var/tmp/out.img'

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 1: waiting 2 seconds before retrying

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 2: waiting 4 seconds before retrying

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 3: waiting 8 seconds before retrying

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: could not recover after 3 retries
nbdkit: null[1]: debug: sending error reply: Input/output error

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
-------------- next part --------------
>From 75b093d4cb2587c65f9c6cd8a4f08996cd761453 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" <rjones at redhat.com>
Date: Thu, 2 Apr 2020 10:37:57 +0100
Subject: [PATCH] retry: Count retries from 1 and log the name of the method
 which failed.

After this change, the output is shown below:

$ rm -f /tmp/inject ; (sleep 5s; touch /tmp/inject ) & ./nbdkit -f -v -U - null size=1G --filter=retry --filter=noextents --filter=error error-rate=100% error-file=/tmp/inject --filter=delay rdelay=1 retries=3 --run 'qemu-img convert $nbd /var/tmp/out.img'

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 1: waiting 2 seconds before retrying

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 2: waiting 4 seconds before retrying

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 3: waiting 8 seconds before retrying

nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: could not recover after 3 retries
nbdkit: null[1]: debug: sending error reply: Input/output error
---
 filters/retry/retry.c | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/filters/retry/retry.c b/filters/retry/retry.c
index db91d7ca..8037f383 100644
--- a/filters/retry/retry.c
+++ b/filters/retry/retry.c
@@ -165,22 +165,23 @@ static bool
 do_retry (struct retry_handle *h,
           struct retry_data *data,
           struct nbdkit_next_ops *next_ops, void *nxdata,
-          int *err)
+          const char *method, int *err)
 {
   /* If it's the first retry, initialize the other fields in *data. */
   if (data->retry == 0)
     data->delay = initial_delay;
 
  again:
-  /* Log the original errno since it will be lost when we retry. */
-  nbdkit_debug ("retry attempt %d: original errno = %d", data->retry, *err);
-
   if (data->retry >= retries) {
     nbdkit_debug ("could not recover after %d retries", retries);
     return false;
   }
 
-  nbdkit_debug ("waiting %d seconds before retrying", data->delay);
+  /* Since we will retry, log the original errno otherwise it will be lost. */
+  nbdkit_debug ("%s failed: original errno = %d", method, *err);
+
+  nbdkit_debug ("retry %d: waiting %d seconds before retrying",
+                data->retry+1, data->delay);
   if (nbdkit_nanosleep (data->delay, 0) == -1) {
     /* We could do this but it would overwrite the more important
      * errno from the underlying data call.
@@ -225,7 +226,8 @@ retry_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
     r = -1;
   else
     r = next_ops->pread (nxdata, buf, count, offset, flags, err);
-  if (r == -1 && do_retry (h, &data, next_ops, nxdata, err)) goto again;
+  if (r == -1 && do_retry (h, &data, next_ops, nxdata, "pread", err))
+    goto again;
 
   return r;
 }
@@ -259,7 +261,8 @@ retry_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
   }
   else
     r = next_ops->pwrite (nxdata, buf, count, offset, flags, err);
-  if (r == -1 && do_retry (h, &data, next_ops, nxdata, err)) goto again;
+  if (r == -1 && do_retry (h, &data, next_ops, nxdata, "pwrite", err))
+    goto again;
 
   return r;
 }
@@ -293,7 +296,8 @@ retry_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
   }
   else
     r = next_ops->trim (nxdata, count, offset, flags, err);
-  if (r == -1 && do_retry (h, &data, next_ops, nxdata, err)) goto again;
+  if (r == -1 && do_retry (h, &data, next_ops, nxdata, "trim", err))
+    goto again;
 
   return r;
 }
@@ -317,7 +321,8 @@ retry_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
   }
   else
     r = next_ops->flush (nxdata, flags, err);
-  if (r == -1 && do_retry (h, &data, next_ops, nxdata, err)) goto again;
+  if (r == -1 && do_retry (h, &data, next_ops, nxdata, "flush", err))
+    goto again;
 
   return r;
 }
@@ -356,7 +361,8 @@ retry_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   }
   else
     r = next_ops->zero (nxdata, count, offset, flags, err);
-  if (r == -1 && do_retry (h, &data, next_ops, nxdata, err)) goto again;
+  if (r == -1 && do_retry (h, &data, next_ops, nxdata, "zero", err))
+    goto again;
 
   return r;
 }
@@ -391,7 +397,8 @@ retry_extents (struct nbdkit_next_ops *next_ops, void *nxdata,
     }
     r = next_ops->extents (nxdata, count, offset, flags, extents2, err);
   }
-  if (r == -1 && do_retry (h, &data, next_ops, nxdata, err)) goto again;
+  if (r == -1 && do_retry (h, &data, next_ops, nxdata, "extents", err))
+    goto again;
 
   if (r == 0) {
     /* Transfer the successful extents back to the caller. */
@@ -428,7 +435,8 @@ retry_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
   }
   else
     r = next_ops->cache (nxdata, count, offset, flags, err);
-  if (r == -1 && do_retry (h, &data, next_ops, nxdata, err)) goto again;
+  if (r == -1 && do_retry (h, &data, next_ops, nxdata, "cache", err))
+    goto again;
 
   return r;
 }
-- 
2.25.0