[Libguestfs] [nbdkit PATCH v2] blocksize: Avoid losing aligned writes to RMW race

Thu May 26 21:00:08 UTC 2022

[We are still investigating if a CVE needs to be assigned.]

We have a bug where the blocksize filter can lose the effects of an
aligned write that happens in parallel with an overlapping unaligned
write.  In general, a client should be extremely cautious about
overlapping writes; it is not NBD's fault which of the two writes
lands on the disk (or even if both writes land partially, at
block-level granularities on which write landed last).  However, a
client should still be able to assume that even when two writes
partially overlap, the final state of the disk of the non-overlapping
portion will be that of the one write explicitly touching that
portion, rather than stale data from before either write touched that
block.

Fix the bug by switching the blocksize filter locking from a mutex to
a rwlock.  We still need mutual exclusion during all unaligned
operations (whether reading or writing to the plugin), because we
share the same bounce buffer among all such code, which we get via the
mutual exclusion of wrlock.  But we now also add in shared exclusion
for aligned write-like operations (pwrite, trim, zero); these can
operate in parallel with one another, but must not be allowed to fall
in the window between when an overlapping unaligned operation has read
from the plugin but not yet written, so that we do not end up
re-writing stale contents that undo the portion of the aligned write
that was outside the unaligned region [it's ironic that we only need
this protection on write I/O, but get it by using the rdlock feature
of the rwlock].  Read-like operations (pread, extents, cache) don't
need to be protected from RMW operations, because it is already the
user's problem if they execute parallel overlapping reads while
modifying the same portion of the image; and they will still end up
seeing either the state before or after the unaligned write,
indistinguishable from if we had added more locking.

Whether this lock favors rdlock (aka aligned pwrite) or wrlock (aka
unaligned access) is implementation-defined by the pthread
implementation; if we need to be more precise, we'll have to introduce
our own rwlock implementation on top of lower-level primitives
(https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock mentions
the use of two mutex to favor readers, or a condvar/mutex to favor
writers).  This is also not very fine-grained; it may turn out that we
could get more performance if instead of locking the entire operation
(across all clients, and regardless of whether the offsets overlap),
we instead just lock metadata and then track whether a new operation
overlaps with an existing unaligned operation, or even add in support
for parallel unaligned operations by having more than one bounce
buffer.  But if performance truly matters, you're better off fixing
the client to do aligned access in the first place, rather than
needing the blocksize filter.

Add a test that fails without the change to blocksize.c.  With some
carefully timed setups (using the delay filter to stall writes
reaching the plugin, and the plugin itself to stall reads coming back
from the plugin, so that we are reasonably sure of the overall
timeline), we can demonstrate the bug present in the unpatched code,
where an aligned write is lost when it lands in the wrong part of an
unaligned RMW cycle; the fixed code instead shows that the overlapping
operations were serialized.  We may need to further tweak the test to
be more reliable even under heavy load, but hopefully 2 seconds per
stall (where a successful test requires 18 seconds) is sufficient for
now.

Reported-by: Nikolaus Rath <Nikolaus at rath.org>
Fixes: 1aadbf9971 ("blocksize: Allow parallel requests", v1.25.3)
---

In v2:
- Implement the blocksize fix.
- Drop RFC; I think this is ready, other than determining if we want
  to tag it with a CVE number.
- Improve the test: print out more details before assertions, to aid
  in debugging if it ever dies under CI resources

 tests/Makefile.am                |   2 +
 filters/blocksize/blocksize.c    |  26 +++--
 tests/test-blocksize-sharding.sh | 168 +++++++++++++++++++++++++++++++
 3 files changed, 187 insertions(+), 9 deletions(-)
 create mode 100755 tests/test-blocksize-sharding.sh

diff --git a/tests/Makefile.am b/tests/Makefile.am
index b6d30c0a..67e7618b 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -1376,11 +1376,13 @@ TESTS += \
 	test-blocksize.sh \
 	test-blocksize-extents.sh \
 	test-blocksize-default.sh \
+	test-blocksize-sharding.sh \
 	$(NULL)
 EXTRA_DIST += \
 	test-blocksize.sh \
 	test-blocksize-extents.sh \
 	test-blocksize-default.sh \
+	test-blocksize-sharding.sh \
 	$(NULL)

 # blocksize-policy filter test.
diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c
index 03da4971..1c81d5e3 100644
--- a/filters/blocksize/blocksize.c
+++ b/filters/blocksize/blocksize.c
@@ -51,10 +51,15 @@

 #define BLOCKSIZE_MIN_LIMIT (64U * 1024)

-/* In order to handle parallel requests safely, this lock must be held
- * when using the bounce buffer.
+/* Lock in order to handle overlapping requests safely.
+ *
+ * Grabbed for exclusive access (wrlock) when using the bounce buffer.
+ *
+ * Grabbed for shared access (rdlock) when doing aligned writes.
+ * These can happen in parallel with one another, but must not land in
+ * between the read and write of an unaligned RMW operation.
  */
-static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+static pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER;

 /* A single bounce buffer for alignment purposes, guarded by the lock.
  * Size it to the maximum we allow for minblock.
@@ -255,7 +260,7 @@ blocksize_pread (nbdkit_next *next,

   /* Unaligned head */
   if (offs & (h->minblock - 1)) {
-    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
     drop = offs & (h->minblock - 1);
     keep = MIN (h->minblock - drop, count);
     if (next->pread (next, bounce, h->minblock, offs - drop, flags, err) == -1)
@@ -278,7 +283,7 @@ blocksize_pread (nbdkit_next *next,

   /* Unaligned tail */
   if (count) {
-    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
     if (next->pread (next, bounce, h->minblock, offs, flags, err) == -1)
       return -1;
     memcpy (buf, bounce, count);
@@ -306,7 +311,7 @@ blocksize_pwrite (nbdkit_next *next,

   /* Unaligned head */
   if (offs & (h->minblock - 1)) {
-    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
     drop = offs & (h->minblock - 1);
     keep = MIN (h->minblock - drop, count);
     if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1)
@@ -321,6 +326,7 @@ blocksize_pwrite (nbdkit_next *next,

   /* Aligned body */
   while (count >= h->minblock) {
+    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock);
     keep = MIN (h->maxdata, ROUND_DOWN (count, h->minblock));
     if (next->pwrite (next, buf, keep, offs, flags, err) == -1)
       return -1;
@@ -331,7 +337,7 @@ blocksize_pwrite (nbdkit_next *next,

   /* Unaligned tail */
   if (count) {
-    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
     if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1)
       return -1;
     memcpy (bounce, buf, count);
@@ -371,6 +377,7 @@ blocksize_trim (nbdkit_next *next,

   /* Aligned body */
   while (count) {
+    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock);
     keep = MIN (h->maxlen, count);
     if (next->trim (next, keep, offs, flags, err) == -1)
       return -1;
@@ -413,7 +420,7 @@ blocksize_zero (nbdkit_next *next,

   /* Unaligned head */
   if (offs & (h->minblock - 1)) {
-    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
     drop = offs & (h->minblock - 1);
     keep = MIN (h->minblock - drop, count);
     if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1)
@@ -428,6 +435,7 @@ blocksize_zero (nbdkit_next *next,

   /* Aligned body */
   while (count >= h->minblock) {
+    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock);
     keep = MIN (h->maxlen, ROUND_DOWN (count, h->minblock));
     if (next->zero (next, keep, offs, flags, err) == -1)
       return -1;
@@ -437,7 +445,7 @@ blocksize_zero (nbdkit_next *next,

   /* Unaligned tail */
   if (count) {
-    ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock);
     if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1)
       return -1;
     memset (bounce, 0, count);
diff --git a/tests/test-blocksize-sharding.sh b/tests/test-blocksize-sharding.sh
new file mode 100755
index 00000000..177668ed
--- /dev/null
+++ b/tests/test-blocksize-sharding.sh
@@ -0,0 +1,168 @@
+#!/usr/bin/env bash
+# nbdkit
+# Copyright (C) 2018-2022 Red Hat Inc.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+# Demonstrate a fix for a bug where blocksize could lose aligned writes
+# run in parallel with unaligned writes
+
+source ./functions.sh
+set -e
+set -x
+
+requires_plugin eval
+requires_nbdsh_uri
+
+files='blocksize-sharding.img blocksize-sharding.tmp'
+rm -f $files
+cleanup_fn rm -f $files
+
+# Script a server that requires 16-byte aligned requests, and which delays
+# 4s after reads if a witness file exists.  Couple it with the delay filter
+# that delays 2s before writes. If an unaligned and aligned write overlap,
+# and can execute in parallel, we would have this timeline:
+#
+#     T1 aligned write 1's to 0/16     T2 unaligned write 2's to 4/12
+#t=0  blocksize: next->pwrite(0, 16)   blocksize: next->pread(0, 16)
+#       delay: wait 2s                   delay: next->pread(0, 16)
+#       ...                                eval: read 0's, wait 4s
+#t=2    delay: next->pwrite(0, 16)         ...
+#         eval: write 1's                  ...
+#     return                               ...
+#t=4                                     return 0's (now stale)
+#                                      blocksize: next->pwrite(0, 16)
+#                                        delay: wait 2s
+#t=6                                     delay: next->pwrite(0, 16)
+#                                          eval: write stale RMW buffer
+#
+# leaving us with a sharded 0000222222222222 (T1's write is lost).
+# But as long as the blocksize filter detects the overlap, we should end
+# up with either 1111222222222222 (aligned write completed first), or with
+# 1111111111111111 (unaligned write completed first), either taking 8s,
+# but with no sharding.
+#
+# We also need an nbdsh script that kicks off parallel writes.
+export script='
+import os
+import time
+
+witness = os.getenv("witness")
+
+def touch(path):
+    open(path, "a").close()
+
+# First pass: check that two aligned operations work in parallel
+# Total time should be closer to 2 seconds, rather than 4 if serialized
+print("sanity check")
+ba1 = bytearray(b"1"*16)
+ba2 = bytearray(b"2"*16)
+buf1 = nbd.Buffer.from_bytearray(ba1)
+buf2 = nbd.Buffer.from_bytearray(ba2)
+touch(witness)
+start_t = time.time()
+h.aio_pwrite(buf1, 0)
+h.aio_pwrite(buf2, 0)
+
+while h.aio_in_flight() > 0:
+    h.poll(-1)
+end_t = time.time()
+os.unlink(witness)
+
+out = h.pread(16,0)
+print(out)
+t = end_t - start_t
+print(t)
+assert out in [b"1"*16, b"2"*16]
+assert t >= 2 and t <= 3
+
+# Next pass: try to kick off unaligned first
+print("unaligned first")
+h.zero(16, 0)
+ba3 = bytearray(b"3"*12)
+ba4 = bytearray(b"4"*16)
+buf3 = nbd.Buffer.from_bytearray(ba3)
+buf4 = nbd.Buffer.from_bytearray(ba4)
+touch(witness)
+start_t = time.time()
+h.aio_pwrite(buf3, 4)
+h.aio_pwrite(buf4, 0)
+
+while h.aio_in_flight() > 0:
+    h.poll(-1)
+end_t = time.time()
+os.unlink(witness)
+
+out = h.pread(16,0)
+print(out)
+t = end_t - start_t
+print(t)
+assert out in [b"4"*4 + b"3"*12, b"4"*16]
+assert t >= 8
+
+# Next pass: try to kick off aligned first
+print("aligned first")
+ba5 = bytearray(b"5"*16)
+ba6 = bytearray(b"6"*12)
+buf5 = nbd.Buffer.from_bytearray(ba5)
+buf6 = nbd.Buffer.from_bytearray(ba6)
+h.zero(16, 0)
+touch(witness)
+start_t = time.time()
+h.aio_pwrite(buf5, 0)
+h.aio_pwrite(buf6, 4)
+
+while h.aio_in_flight() > 0:
+    h.poll(-1)
+end_t = time.time()
+os.unlink(witness)
+
+out = h.pread(16,0)
+print(out)
+t = end_t - start_t
+print(t)
+assert out in [b"5"*4 + b"6"*12, b"5"*16]
+assert t >= 8
+'
+
+# Now run everything
+truncate --size=16 blocksize-sharding.img
+export witness="$PWD/blocksize-sharding.tmp"
+nbdkit -U - --filter=blocksize --filter=delay eval delay-write=2 \
+    config='ln -sf "$(realpath "$3")" $tmpdir/$2' \
+    img="$PWD/blocksize-sharding.img" tmp="$PWD/blocksize-sharding.tmp" \
+    get_size='echo 16' block_size='echo 16 64K 1M' \
+    thread_model='echo parallel' \
+    zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \
+      iflag=count_bytes,skip_bytes' \
+    pread='
+      dd if=$tmpdir/img skip=$4 count=$3 iflag=count_bytes,skip_bytes
+      if [ -f $tmpdir/tmp ]; then sleep 4; fi ' \
+    pwrite='dd of=$tmpdir/img seek=$4 conv=notrunc oflag=seek_bytes' \
+    --run 'nbdsh -u "$uri" -c "$script"'
-- 
2.36.1