[Libguestfs] [nbdkit PATCH v2] blocksize: Avoid losing aligned writes to RMW race

Fri May 27 06:59:23 UTC 2022

On 05/26/22 23:00, Eric Blake wrote:
> [We are still investigating if a CVE needs to be assigned.]
>
> We have a bug where the blocksize filter can lose the effects of an
> aligned write that happens in parallel with an overlapping unaligned
> write.  In general, a client should be extremely cautious about
> overlapping writes; it is not NBD's fault which of the two writes
> lands on the disk (or even if both writes land partially, at
> block-level granularities on which write landed last).  However, a
> client should still be able to assume that even when two writes
> partially overlap, the final state of the disk of the non-overlapping
> portion will be that of the one write explicitly touching that
> portion, rather than stale data from before either write touched that
> block.
>
> Fix the bug by switching the blocksize filter locking from a mutex to
> a rwlock.  We still need mutual exclusion during all unaligned
> operations (whether reading or writing to the plugin), because we
> share the same bounce buffer among all such code, which we get via the
> mutual exclusion of wrlock.  But we now also add in shared exclusion
> for aligned write-like operations (pwrite, trim, zero); these can
> operate in parallel with one another, but must not be allowed to fall
> in the window between when an overlapping unaligned operation has read
> from the plugin but not yet written, so that we do not end up
> re-writing stale contents that undo the portion of the aligned write
> that was outside the unaligned region [it's ironic that we only need
> this protection on write I/O, but get it by using the rdlock feature
> of the rwlock].  Read-like operations (pread, extents, cache) don't
> need to be protected from RMW operations, because it is already the
> user's problem if they execute parallel overlapping reads while
> modifying the same portion of the image; and they will still end up
> seeing either the state before or after the unaligned write,
> indistinguishable from if we had added more locking.
>
> Whether this lock favors rdlock (aka aligned pwrite) or wrlock (aka
> unaligned access) is implementation-defined by the pthread
> implementation; if we need to be more precise, we'll have to introduce
> our own rwlock implementation on top of lower-level primitives
> (https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock mentions
> the use of two mutex to favor readers, or a condvar/mutex to favor
> writers).  This is also not very fine-grained; it may turn out that we
> could get more performance if instead of locking the entire operation
> (across all clients, and regardless of whether the offsets overlap),
> we instead just lock metadata and then track whether a new operation
> overlaps with an existing unaligned operation, or even add in support
> for parallel unaligned operations by having more than one bounce
> buffer.  But if performance truly matters, you're better off fixing
> the client to do aligned access in the first place, rather than
> needing the blocksize filter.
>
> Add a test that fails without the change to blocksize.c.  With some
> carefully timed setups (using the delay filter to stall writes
> reaching the plugin, and the plugin itself to stall reads coming back
> from the plugin, so that we are reasonably sure of the overall
> timeline), we can demonstrate the bug present in the unpatched code,
> where an aligned write is lost when it lands in the wrong part of an
> unaligned RMW cycle; the fixed code instead shows that the overlapping
> operations were serialized.  We may need to further tweak the test to
> be more reliable even under heavy load, but hopefully 2 seconds per
> stall (where a successful test requires 18 seconds) is sufficient for
> now.
>
> Reported-by: Nikolaus Rath <Nikolaus at rath.org>
> Fixes: 1aadbf9971 ("blocksize: Allow parallel requests", v1.25.3)
> ---
>
> In v2:
> - Implement the blocksize fix.
> - Drop RFC; I think this is ready, other than determining if we want
>   to tag it with a CVE number.
> - Improve the test: print out more details before assertions, to aid
>   in debugging if it ever dies under CI resources

Apparently there was a bug in the zero callback too (which I had
missed):

    -+    zero='dd if=/dev/zero skip=$4 count=$3 iflag=count_bytes,skip_bytes' \
    ++    zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \
    ++      iflag=count_bytes,skip_bytes' \

After comparing RFC against v2: if you can split the test case to a
separate patch (= the second patch in a two-part series), then I'd be
happy for you to carry my R-b forward. I don't feel confident enough to
review the change to the blocksize filter itself. (My confidence is
lacking in my knowledge of nbdkit, not in your patch.)

Thanks
Laszlo

>
>  tests/Makefile.am                |   2 +
>  filters/blocksize/blocksize.c    |  26 +++--
>  tests/test-blocksize-sharding.sh | 168 +++++++++++++++++++++++++++++++
>  3 files changed, 187 insertions(+), 9 deletions(-)
>  create mode 100755 tests/test-blocksize-sharding.sh
>
> diff --git a/tests/Makefile.am b/tests/Makefile.am
> index b6d30c0a..67e7618b 100644
> --- a/tests/Makefile.am
> +++ b/tests/Makefile.am
> @@ -1376,11 +1376,13 @@ TESTS += \
>  	test-blocksize.sh \
>  	test-blocksize-extents.sh \
>  	test-blocksize-default.sh \
> +	test-blocksize-sharding.sh \
>  	$(NULL)
>  EXTRA_DIST += \
>  	test-blocksize.sh \
>  	test-blocksize-extents.sh \
>  	test-blocksize-default.sh \
> +	test-blocksize-sharding.sh \
>  	$(NULL)
>
>  # blocksize-policy filter test.
> diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c
> index 03da4971..1c81d5e3 100644
> --- a/filters/blocksize/blocksize.c
> +++ b/filters/blocksize/blocksize.c
> @@ -51,10 +51,15 @@
>
>  #define BLOCKSIZE_MIN_LIMIT (64U * 1024)
>
> -/* In order to handle parallel requests safely, this lock must be held
> - * when using the bounce buffer.
> +/* Lock in order to handle overlapping requests safely.
> + *
> + * Grabbed for exclusive access (wrlock) when using the bounce buffer.
> + *
> + * Grabbed for shared access (rdlock) when doing aligned writes.
> + * These can happen in parallel with one another, but must not land in
> + * between the read and write of an unaligned RMW operation.
>   */
> -static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
> +static pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER;
>
>  /* A single bounce buffer for alignment purposes, guarded by the lock.
>   * Size it to the maximum we allow for minblock.
> @@ -255,7 +260,7 @@ blocksize_pread (nbdkit_next *next,
>
>    /* Unaligned head */
>    if (offs & (h->minblock - 1)) {
> -    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
> +    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
>      drop = offs & (h->minblock - 1);
>      keep = MIN (h->minblock - drop, count);
>      if (next->pread (next, bounce, h->minblock, offs - drop, flags, err) == -1)
> @@ -278,7 +283,7 @@ blocksize_pread (nbdkit_next *next,
>
>    /* Unaligned tail */
>    if (count) {
> -    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
> +    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
>      if (next->pread (next, bounce, h->minblock, offs, flags, err) == -1)
>        return -1;
>      memcpy (buf, bounce, count);
> @@ -306,7 +311,7 @@ blocksize_pwrite (nbdkit_next *next,
>
>    /* Unaligned head */
>    if (offs & (h->minblock - 1)) {
> -    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
> +    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
>      drop = offs & (h->minblock - 1);
>      keep = MIN (h->minblock - drop, count);
>      if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1)
> @@ -321,6 +326,7 @@ blocksize_pwrite (nbdkit_next *next,
>
>    /* Aligned body */
>    while (count >= h->minblock) {
> +    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock);
>      keep = MIN (h->maxdata, ROUND_DOWN (count, h->minblock));
>      if (next->pwrite (next, buf, keep, offs, flags, err) == -1)
>        return -1;
> @@ -331,7 +337,7 @@ blocksize_pwrite (nbdkit_next *next,
>
>    /* Unaligned tail */
>    if (count) {
> -    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
> +    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
>      if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1)
>        return -1;
>      memcpy (bounce, buf, count);
> @@ -371,6 +377,7 @@ blocksize_trim (nbdkit_next *next,
>
>    /* Aligned body */
>    while (count) {
> +    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock);
>      keep = MIN (h->maxlen, count);
>      if (next->trim (next, keep, offs, flags, err) == -1)
>        return -1;
> @@ -413,7 +420,7 @@ blocksize_zero (nbdkit_next *next,
>
>    /* Unaligned head */
>    if (offs & (h->minblock - 1)) {
> -    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
> +    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
>      drop = offs & (h->minblock - 1);
>      keep = MIN (h->minblock - drop, count);
>      if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1)
> @@ -428,6 +435,7 @@ blocksize_zero (nbdkit_next *next,
>
>    /* Aligned body */
>    while (count >= h->minblock) {
> +    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock);
>      keep = MIN (h->maxlen, ROUND_DOWN (count, h->minblock));
>      if (next->zero (next, keep, offs, flags, err) == -1)
>        return -1;
> @@ -437,7 +445,7 @@ blocksize_zero (nbdkit_next *next,
>
>    /* Unaligned tail */
>    if (count) {
> -    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
> +    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
>      if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1)
>        return -1;
>      memset (bounce, 0, count);
> diff --git a/tests/test-blocksize-sharding.sh b/tests/test-blocksize-sharding.sh
> new file mode 100755
> index 00000000..177668ed
> --- /dev/null
> +++ b/tests/test-blocksize-sharding.sh
> @@ -0,0 +1,168 @@
> +#!/usr/bin/env bash
> +# nbdkit
> +# Copyright (C) 2018-2022 Red Hat Inc.
> +#
> +# Redistribution and use in source and binary forms, with or without
> +# modification, are permitted provided that the following conditions are
> +# met:
> +#
> +# * Redistributions of source code must retain the above copyright
> +# notice, this list of conditions and the following disclaimer.
> +#
> +# * Redistributions in binary form must reproduce the above copyright
> +# notice, this list of conditions and the following disclaimer in the
> +# documentation and/or other materials provided with the distribution.
> +#
> +# * Neither the name of Red Hat nor the names of its contributors may be
> +# used to endorse or promote products derived from this software without
> +# specific prior written permission.
> +#
> +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
> +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
> +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
> +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
> +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
> +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
> +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> +# SUCH DAMAGE.
> +
> +# Demonstrate a fix for a bug where blocksize could lose aligned writes
> +# run in parallel with unaligned writes
> +
> +source ./functions.sh
> +set -e
> +set -x
> +
> +requires_plugin eval
> +requires_nbdsh_uri
> +
> +files='blocksize-sharding.img blocksize-sharding.tmp'
> +rm -f $files
> +cleanup_fn rm -f $files
> +
> +# Script a server that requires 16-byte aligned requests, and which delays
> +# 4s after reads if a witness file exists.  Couple it with the delay filter
> +# that delays 2s before writes. If an unaligned and aligned write overlap,
> +# and can execute in parallel, we would have this timeline:
> +#
> +#     T1 aligned write 1's to 0/16     T2 unaligned write 2's to 4/12
> +#t=0  blocksize: next->pwrite(0, 16)   blocksize: next->pread(0, 16)
> +#       delay: wait 2s                   delay: next->pread(0, 16)
> +#       ...                                eval: read 0's, wait 4s
> +#t=2    delay: next->pwrite(0, 16)         ...
> +#         eval: write 1's                  ...
> +#     return                               ...
> +#t=4                                     return 0's (now stale)
> +#                                      blocksize: next->pwrite(0, 16)
> +#                                        delay: wait 2s
> +#t=6                                     delay: next->pwrite(0, 16)
> +#                                          eval: write stale RMW buffer
> +#
> +# leaving us with a sharded 0000222222222222 (T1's write is lost).
> +# But as long as the blocksize filter detects the overlap, we should end
> +# up with either 1111222222222222 (aligned write completed first), or with
> +# 1111111111111111 (unaligned write completed first), either taking 8s,
> +# but with no sharding.
> +#
> +# We also need an nbdsh script that kicks off parallel writes.
> +export script='
> +import os
> +import time
> +
> +witness = os.getenv("witness")
> +
> +def touch(path):
> +    open(path, "a").close()
> +
> +# First pass: check that two aligned operations work in parallel
> +# Total time should be closer to 2 seconds, rather than 4 if serialized
> +print("sanity check")
> +ba1 = bytearray(b"1"*16)
> +ba2 = bytearray(b"2"*16)
> +buf1 = nbd.Buffer.from_bytearray(ba1)
> +buf2 = nbd.Buffer.from_bytearray(ba2)
> +touch(witness)
> +start_t = time.time()
> +h.aio_pwrite(buf1, 0)
> +h.aio_pwrite(buf2, 0)
> +
> +while h.aio_in_flight() > 0:
> +    h.poll(-1)
> +end_t = time.time()
> +os.unlink(witness)
> +
> +out = h.pread(16,0)
> +print(out)
> +t = end_t - start_t
> +print(t)
> +assert out in [b"1"*16, b"2"*16]
> +assert t >= 2 and t <= 3
> +
> +# Next pass: try to kick off unaligned first
> +print("unaligned first")
> +h.zero(16, 0)
> +ba3 = bytearray(b"3"*12)
> +ba4 = bytearray(b"4"*16)
> +buf3 = nbd.Buffer.from_bytearray(ba3)
> +buf4 = nbd.Buffer.from_bytearray(ba4)
> +touch(witness)
> +start_t = time.time()
> +h.aio_pwrite(buf3, 4)
> +h.aio_pwrite(buf4, 0)
> +
> +while h.aio_in_flight() > 0:
> +    h.poll(-1)
> +end_t = time.time()
> +os.unlink(witness)
> +
> +out = h.pread(16,0)
> +print(out)
> +t = end_t - start_t
> +print(t)
> +assert out in [b"4"*4 + b"3"*12, b"4"*16]
> +assert t >= 8
> +
> +# Next pass: try to kick off aligned first
> +print("aligned first")
> +ba5 = bytearray(b"5"*16)
> +ba6 = bytearray(b"6"*12)
> +buf5 = nbd.Buffer.from_bytearray(ba5)
> +buf6 = nbd.Buffer.from_bytearray(ba6)
> +h.zero(16, 0)
> +touch(witness)
> +start_t = time.time()
> +h.aio_pwrite(buf5, 0)
> +h.aio_pwrite(buf6, 4)
> +
> +while h.aio_in_flight() > 0:
> +    h.poll(-1)
> +end_t = time.time()
> +os.unlink(witness)
> +
> +out = h.pread(16,0)
> +print(out)
> +t = end_t - start_t
> +print(t)
> +assert out in [b"5"*4 + b"6"*12, b"5"*16]
> +assert t >= 8
> +'
> +
> +# Now run everything
> +truncate --size=16 blocksize-sharding.img
> +export witness="$PWD/blocksize-sharding.tmp"
> +nbdkit -U - --filter=blocksize --filter=delay eval delay-write=2 \
> +    config='ln -sf "$(realpath "$3")" $tmpdir/$2' \
> +    img="$PWD/blocksize-sharding.img" tmp="$PWD/blocksize-sharding.tmp" \
> +    get_size='echo 16' block_size='echo 16 64K 1M' \
> +    thread_model='echo parallel' \
> +    zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \
> +      iflag=count_bytes,skip_bytes' \
> +    pread='
> +      dd if=$tmpdir/img skip=$4 count=$3 iflag=count_bytes,skip_bytes
> +      if [ -f $tmpdir/tmp ]; then sleep 4; fi ' \
> +    pwrite='dd of=$tmpdir/img seek=$4 conv=notrunc oflag=seek_bytes' \
> +    --run 'nbdsh -u "$uri" -c "$script"'
>