[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

Wed May 25 18:28:17 UTC 2022

On Tue, May 24, 2022 at 08:45:02PM +0100, Nikolaus Rath wrote:
> > However, you are worried that a third possibility occurs:
> >
> > T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> > for the unaligned head (it only needs 0x00-0x03, but we have to read a
> > block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
> >
> > T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> > blocking it.
> >
> > T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
> >
> >        0   0   0   0   1   1   1   1
> >        0...4...8...a...0...4...8...a...
> > end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
> >
> > which does NOT reflect either of the possibilities where T1 and T2
> > write atomically.  Basically, we have become the victim of sharding.
> 
> Yes, this is the scenario that I am worried about.
> 
> I this is a data corruption problem no matter if we assume that writes
> should be atomically or not.

Writes to a single block should be atomic.  Writes larger than a block
need not be atomic.  But when we advertise a particular block size,
clients should be safe in assuming that anything smaller than that
block size is inadvisable (either it will fail as too small, or it
will incur a RMW penalty), but that something at the block size should
not need client-side protection when no other parallel requests are
overlapping that block.  And that is where our blocksize filter is
currently failing.

> 
> In this scenario, the client has issued exactly one request to write
> (among other things) bytes 0-4. This request was executed successfully,
> so bytes 0-4 should have the new contents. There was no other write that
> affected this byte range, so whether the write was done atomic or not
> does not matter.

Basically, because the blocksize filter advertised a block size of 1,
the client was not expecting to need to avoid non-overlapping writes
of anything larger than a 1-byte block.  So yes, the more I think
about this, the more I see it as a bug in the blocksize filter.

> 
> > You are correct that it is annoying that this third possibility (where
> > T1 appears to have never run) is possible with the blocksize filter.
> > And we should probably consider it as a data-corruption bug.  Your
> > blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
> > likely to hit scenarios where an older guest assumes it is writing to
> > 512-byte aligned hardware, while using the blocksize filter to try and
> > guarantee RMW atomic access to 4k modern hardware.  The older client
> > will be unaware that it must avoid parallel writes that are
> > 512-aligned but land in the same 4k page, so it seems like the
> > blocksize filter should be doing that.
> 
> 
> Yes, I was just picking a very small number to illustrate the problem. I
> have seen this happen in practice with much larger blocksizes (32 kB+).
> 
> > You have just demonstrated that our current approach (grabbing a
> > single semaphore, only around the unaligned portions), does not do
> > what we hoped.  So what WOULD protect us, while still allowing as much
> > parallelism as possible?
> 
> How about a per-block lock as implemented for the S3 plugin in
> https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10?

That feels pretty heavyweight.  It may allow more parallelism, but at
the expense of more resources.  I'm hoping that a single rwlock will
do, although it may be dominated by starvation time when toggling
between unaligned actions (serialized) vs aligned actions (parallel).

> 
> It might be a bit harder to do in plain C because of the absence set
> datatypes, but I think it's should work for the blocksize filter as well.

Just because the language does not have a builtin set type doesn't
mean we can't code one; but you're right that it would be more code.

I'm still hoping to post a first draft of a rwlock in the blocksize
filter later today (the hardest part is trying to come up with a
testsuite that will reliably demonstrate the race without the patch).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org