[Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption
Laszlo Ersek
lersek at redhat.com
Wed Feb 2 08:59:04 UTC 2022
On 02/02/22 09:23, Laszlo Ersek wrote:
> On 02/01/22 20:44, Eric Blake wrote:
>> These are the trivial patches modeled after Nir's commit 7ba6ef67,
>> fixing further places that fail to check *error during callbacks.
>
> I find these interfaces very difficult to use.
>
> https://libguestfs.org/nbd_aio_block_status.3.html
>
> We have two callbacks here, each takes a *error, and I have zero idea of
> the order the callbacks are supposed to be called in.
>
> The last paragraph of the commit message speaks about this, and the code
> updates seem to suggest some ordering, but I'm still unsure.
>
> ... In fact, I'm now uncertain whether ignoring "err" in virt-v2v's
> "lib/utils.ml", function "get_disk_allocated", is safe. It's a "sync"
> (not aio) nbd_block_status() call, but the documentation
> <https://libguestfs.org/nbd_block_status.3.html> says:
>
> "On entry to the callback, the error parameter contains the errno value
> of any previously detected error."
>
> So what should we do then, in "get_disk_allocated"? Return -1
> immediately?
>From patch#4, it seems that we need exactly this. Here's the function:
let get_disk_allocated ~dir ~disknr =
let socket = sprintf "%s/out%d" dir disknr
and alloc_ctx = "base:allocation" in
with_nbd_connect_unix ~socket ~meta_contexts:[alloc_ctx]
(fun nbd ->
if NBD.can_meta_context nbd alloc_ctx then (
(* Get the list of extents, using a 2GiB chunk size as hint. *)
let size = NBD.get_size nbd
and allocated = ref 0_L
and fetch_offset = ref 0_L in
while !fetch_offset < size do
let remaining = size -^ !fetch_offset in
let fetch_size = min 0x8000_0000_L remaining in
NBD.block_status nbd fetch_size !fetch_offset
(fun ctx offset entries err ->
assert (ctx = alloc_ctx);
for i = 0 to Array.length entries / 2 - 1 do
let len = entries.(i * 2)
and typ = entries.(i * 2 + 1) in
assert (len > 0_L);
if typ &^ 1_L = 0_L then
allocated := !allocated +^ len;
fetch_offset := !fetch_offset +^ len
done;
0
)
done;
Some !allocated
) else None
)
If our (nameless) function is entered with a nonzero "err", then we
cannot trust the "entries" array. Which means not only "allocated" may
be miscalculated, but even "fetch_offset" -- and the latter can as well
break the chunking (= outer) loop. This seems to imply we *must not*
continue the outer loop.
So how do we handle this in OCaml? If we return -1 from the (nameless)
extents callback, will NBD.block_status raise an exception?
Hmmmm: maybe. In the generated nbd_internal_ocaml_nbd_block_status()
function, in file "ocaml/nbd-c.c", we have:
caml_enter_blocking_section ();
r = nbd_block_status (h, count, offset, extent_callback, flags);
caml_leave_blocking_section ();
if (r == -1)
nbd_internal_ocaml_raise_error ();
However, in extent_wrapper() -> extent_wrapper_locked(), there is more
confusion for me:
metacontextv = caml_copy_string (metacontext);
offsetv = caml_copy_int64 (offset);
entriesv = nbd_internal_ocaml_alloc_int64_from_uint32_array (entries, nr_entries);
errorv = caml_alloc_tuple (1);
Store_field (errorv, 0, Val_int (*error));
args[0] = metacontextv;
args[1] = offsetv;
args[2] = entriesv;
args[3] = errorv;
rv = caml_callbackN_exn (data->fnv, 4, args);
*error = Int_val (Field (errorv, 0));
if (Is_exception_result (rv)) {
nbd_internal_ocaml_exception_in_wrapper ("extent", rv);
CAMLreturnT (int, -1);
}
r = Int_val (rv);
assert (r >= 0);
CAMLreturnT (int, r);
If I understand correctly, if we returned (-1) from the (nameless)
extents callback, that would trip the assert(); would it not? Thus, we
would not reach the nbd_internal_ocaml_raise_error() in the above-quoted
nbd_internal_ocaml_nbd_block_status() function.
Thanks,
Laszlo
> And if we do so, what is that going to mean for
> "NBD.block_status"? Will it raise an exception?
>
> Thanks,
> Laszlo
>
>> nbdcopy still has a bigger bug in that it is NOT properly checking for
>> *error in the pread/pwrite completion handlers, but fixing that
>> gracefully (rather than just quitting the process immediately, as we
>> can do in the examples), requires more invasive patches which I will
>> be posting separately for review, while the ones posted here are
>> trivial enough that I'm pushing them now (commits 23bcea4a..1373d423).
>>
>> I also audited other uses of completion callbacks:
>> - in tests, closure-lifetimes, errors, server-death, and
>> shutdown-flags are all safe (either the callback returns 0 and the
>> code later calls nbd_aio_command_completed, or we are explicitly
>> testing aspects of completion callbacks)
>> - in fuse/operations.c, the completion callback returns 0 and calls
>> nbd_aio_command_completed later
>> - in examples/strict-structured-reads.c, the completion callback
>> properly checks *error
>>
>> I also checked that nbdkit's nbd plugin properly checks *error.
>>
>> Eric Blake (4):
>> examples: aio-connect-read.c: Fix error handling
>> examples: glib-main-loop: don't strand commands when gsource
>> disappears
>> examples: glib-main-loop: Fix error handling
>> copy: Ignore extents if we encountered earlier error
>>
>> examples/aio-connect-read.c | 7 +++++++
>> examples/glib-main-loop.c | 14 ++++++++++++--
>> copy/nbd-ops.c | 4 ++--
>> 3 files changed, 21 insertions(+), 4 deletions(-)
>>
>
More information about the Libguestfs
mailing list