[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency
Richard W.M. Jones
rjones at redhat.com
Tue May 24 14:24:15 UTC 2022
On Tue, May 24, 2022 at 09:01:28AM -0500, Eric Blake wrote:
> On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:
> > Hi,
> >
> > How does the blocksize filter take into account writes that end-up overlapping due to read-modify-write cycles?
> >
> > Specifically, suppose there are two non-overlapping writes handled by two different threads, that, due to blocksize requirements, overlap when expanded. I think there is a risk that one thread may partially undo the work of the other here.
> >
> > Looking at the code, it seems that writes of unaligned heads and tails are protected with a global lock.,
> > but writes of aligned data can occur concurrently.
> >
> > However, does this not miss the case where there is one unaligned write that overlaps with an aligned one?
> >
> > For example, with blocksize 10, we could have:
>
> The blocksize filter requires blocks to be sized as a power of 2,
> which 10 is not. I will try to restate your question using hex
> boundaries; please correct me if I'm mis-stating things.
>
> >
> > Thread 1: receives write request for offset=0, size=10
> > Thread 2: receives write request for offset=4, size=16
>
> minblock = 0x10
> Thread 1: receives write request for offset 0x00, size 0x10 (aligned request)
> Thread 2: receives write request for offset 0x04, size 0x16 (unaligned offset, unaligned size)
>
> Graphically, we are wanting to write the following, given initial disk
> contents of I:
>
> 0 0 0 0 1 1 1 1
> 0...4...8...a...0...4...8...a...
> start IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
> T1: AAAAAAAAAAAAAAAA
> T2: BBBBBBBBBBBBBBBBBBBBBB
>
> Because both writes are issued simultaneously, we do not know whether
> bytes 0x04 through 0x0f will be written as A or B. But our assumption
> is that because blocks are written atomically, we hope to get exactly
> one of the two following results, where either T1 beat T2:
The NBD protocol doesn't mention "atomic" anywhere. Are writes
actually guaranteed to be atomic like this? I'd be a bit more
convinced if this could be shown to happen in a non-overlapping case.
(It may be that we still want to fix the blocksize filter as you've
described -- using a rwlock seems like a reasonable and non-invasive
fix -- but I'd like to understand what is guaranteed and what is a
client's problem.)
> 0 0 0 0 1 1 1 1
> 0...4...8...a...0...4...8...a...
> end1: AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII
>
> or where T2 beat T1:
>
> 0 0 0 0 1 1 1 1
> 0...4...8...a...0...4...8...a...
> end2: AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII
>
> However, you are worried that a third possibility occurs:
>
> > Thread 1: acquires lock, reads bytes 0-4
> > Thread 2: does aligned write (no locking needed), writes bytes 0-10
> > Thread 1: writes bytes 0-10, overwriting data from Thread 2
>
> These do not match your initial conditions above (why is thread 1
> reading; why is thread 2 doing aligned action). Again, I'm rewriting
> this into something that I think matches what you intended to ask, but
> I welcome corrections:
>
> T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> for the unaligned head (it only needs 0x00-0x03, but we have to read a
> block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
>
> T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> blocking it.
>
> T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
>
> 0 0 0 0 1 1 1 1
> 0...4...8...a...0...4...8...a...
> end3: IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
>
> which does NOT reflect either of the possibilities where T1 and T2
> write atomically. Basically, we have become the victim of sharding.
>
> You are correct that it is annoying that this third possibility (where
> T1 appears to have never run) is possible with the blocksize filter.
> And we should probably consider it as a data-corruption bug. Your
> blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
> likely to hit scenarios where an older guest assumes it is writing to
> 512-byte aligned hardware, while using the blocksize filter to try and
> guarantee RMW atomic access to 4k modern hardware. The older client
> will be unaware that it must avoid parallel writes that are
> 512-aligned but land in the same 4k page, so it seems like the
> blocksize filter should be doing that.
>
> You have just demonstrated that our current approach (grabbing a
> single semaphore, only around the unaligned portions), does not do
> what we hoped. So what WOULD protect us, while still allowing as much
> parallelism as possible? It sounds like we want a RW-lock (note: not
> in the sense of parallel pread and exclusive pwrite operations, but
> rather in the sense of parallel aligned operations taking a rdlock
> whether it is a pread or pwrite operation, and exclusive unaligned
> operations taking a wrlock whether pread or pwrite).
>
> pthread_rwlock_t would work, although it does not have guarantees on
> whether the implementation will favor readers or writers. It is also
> possible to implement our own using 2 mutexes (favors readers), or a
> condvar and mutex (easy to favor writers) [1]. Favoring readers can
> starve writers indefinitely but gets the most concurrency; favoring
> writers avoids starvation but has less performance. Other policies
> are RCU (wait-free for rdlock) or seqlock, borrowing ideas from the
> Linux kernel. Maybe we make it a configurable knob which lock policy
> to use, based on how likely the client is to do unaligned operations
> that require a write lock?
>
> [1] https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock
Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
More information about the Libguestfs
mailing list