[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

Richard W.M. Jones rjones at redhat.com
Tue May 24 14:24:15 UTC 2022


On Tue, May 24, 2022 at 09:01:28AM -0500, Eric Blake wrote:
> On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:
> > Hi,
> > 
> > How does the blocksize filter take into account writes that end-up overlapping due to read-modify-write cycles?
> > 
> > Specifically, suppose there are two non-overlapping writes handled by two different threads, that, due to blocksize requirements, overlap when expanded. I think there is a risk that one thread may partially undo the work of the other here.
> > 
> > Looking at the code, it seems that writes of unaligned heads and tails are protected with a global lock., 
> > but writes of aligned data can occur concurrently.
> > 
> > However, does this not miss the case where there is one unaligned write that overlaps with an aligned one? 
> > 
> > For example, with blocksize 10, we could have:
> 
> The blocksize filter requires blocks to be sized as a power of 2,
> which 10 is not.  I will try to restate your question using hex
> boundaries; please correct me if I'm mis-stating things.
> 
> > 
> > Thread 1: receives write request for offset=0, size=10
> > Thread 2: receives write request for offset=4, size=16
> 
> minblock = 0x10
> Thread 1: receives write request for offset 0x00, size 0x10 (aligned request)
> Thread 2: receives write request for offset 0x04, size 0x16 (unaligned offset, unaligned size)
> 
> Graphically, we are wanting to write the following, given initial disk
> contents of I:
> 
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> start      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
> T1:    AAAAAAAAAAAAAAAA
> T2:        BBBBBBBBBBBBBBBBBBBBBB
> 
> Because both writes are issued simultaneously, we do not know whether
> bytes 0x04 through 0x0f will be written as A or B.  But our assumption
> is that because blocks are written atomically, we hope to get exactly
> one of the two following results, where either T1 beat T2:

The NBD protocol doesn't mention "atomic" anywhere.  Are writes
actually guaranteed to be atomic like this?  I'd be a bit more
convinced if this could be shown to happen in a non-overlapping case.

(It may be that we still want to fix the blocksize filter as you've
described -- using a rwlock seems like a reasonable and non-invasive
fix -- but I'd like to understand what is guaranteed and what is a
client's problem.)

>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end1:  AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII
> 
> or where T2 beat T1:
> 
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end2:  AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII
> 
> However, you are worried that a third possibility occurs:
> 
> > Thread 1: acquires lock, reads bytes 0-4
> > Thread 2: does aligned write (no locking needed), writes bytes 0-10
> > Thread 1: writes bytes 0-10, overwriting data from Thread 2
> 
> These do not match your initial conditions above (why is thread 1
> reading; why is thread 2 doing aligned action).  Again, I'm rewriting
> this into something that I think matches what you intended to ask, but
> I welcome corrections:
> 
> T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> for the unaligned head (it only needs 0x00-0x03, but we have to read a
> block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
> 
> T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> blocking it.
> 
> T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
> 
>        0   0   0   0   1   1   1   1
>        0...4...8...a...0...4...8...a...
> end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
> 
> which does NOT reflect either of the possibilities where T1 and T2
> write atomically.  Basically, we have become the victim of sharding.
> 
> You are correct that it is annoying that this third possibility (where
> T1 appears to have never run) is possible with the blocksize filter.
> And we should probably consider it as a data-corruption bug.  Your
> blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
> likely to hit scenarios where an older guest assumes it is writing to
> 512-byte aligned hardware, while using the blocksize filter to try and
> guarantee RMW atomic access to 4k modern hardware.  The older client
> will be unaware that it must avoid parallel writes that are
> 512-aligned but land in the same 4k page, so it seems like the
> blocksize filter should be doing that.
> 
> You have just demonstrated that our current approach (grabbing a
> single semaphore, only around the unaligned portions), does not do
> what we hoped.  So what WOULD protect us, while still allowing as much
> parallelism as possible?  It sounds like we want a RW-lock (note: not
> in the sense of parallel pread and exclusive pwrite operations, but
> rather in the sense of parallel aligned operations taking a rdlock
> whether it is a pread or pwrite operation, and exclusive unaligned
> operations taking a wrlock whether pread or pwrite).
> 
> pthread_rwlock_t would work, although it does not have guarantees on
> whether the implementation will favor readers or writers.  It is also
> possible to implement our own using 2 mutexes (favors readers), or a
> condvar and mutex (easy to favor writers) [1].  Favoring readers can
> starve writers indefinitely but gets the most concurrency; favoring
> writers avoids starvation but has less performance.  Other policies
> are RCU (wait-free for rdlock) or seqlock, borrowing ideas from the
> Linux kernel.  Maybe we make it a configurable knob which lock policy
> to use, based on how likely the client is to do unaligned operations
> that require a write lock?
> 
> [1] https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html


More information about the Libguestfs mailing list