[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

Wed May 25 20:58:14 UTC 2022

On May 25 2022, Eric Blake <eblake at redhat.com> wrote:
> On Tue, May 24, 2022 at 08:45:02PM +0100, Nikolaus Rath wrote:
>> > However, you are worried that a third possibility occurs:
>> >
>> > T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
>> > for the unaligned head (it only needs 0x00-0x03, but we have to read a
>> > block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
>> >
>> > T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
>> > blocking it.
>> >
>> > T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
>> >
>> >        0   0   0   0   1   1   1   1
>> >        0...4...8...a...0...4...8...a...
>> > end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
>> >
>> > which does NOT reflect either of the possibilities where T1 and T2
>> > write atomically.  Basically, we have become the victim of sharding.
>> 
>> Yes, this is the scenario that I am worried about.
>> 
>> I this is a data corruption problem no matter if we assume that writes
>> should be atomically or not.
>
> Writes to a single block should be atomic.  Writes larger than a block
> need not be atomic.  But when we advertise a particular block size,
> clients should be safe in assuming that anything smaller than that
> block size is inadvisable (either it will fail as too small, or it
> will incur a RMW penalty), but that something at the block size should
> not need client-side protection when no other parallel requests are
> overlapping that block.  And that is where our blocksize filter is
> currently failing.

Agreed.

>> > You have just demonstrated that our current approach (grabbing a
>> > single semaphore, only around the unaligned portions), does not do
>> > what we hoped.  So what WOULD protect us, while still allowing as much
>> > parallelism as possible?
>> 
>> How about a per-block lock as implemented for the S3 plugin in
>> https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10?
>
> That feels pretty heavyweight.  It may allow more parallelism, but at
> the expense of more resources.  I'm hoping that a single rwlock will
> do, although it may be dominated by starvation time when toggling
> between unaligned actions (serialized) vs aligned actions (parallel).

The resource usage is a single shared variable with an underlying mutex,
and a set (probably best implemented as a hash) containing at most as
many elements as block are being written in concurrently. This does not
sound like much to me. Am I missing something?

Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«