[Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

Sun May 22 10:01:06 UTC 2022

On Sat, May 21, 2022 at 05:37:10PM +0100, Nikolaus Rath wrote:
> On May 21 2022, "Richard W.M. Jones" <rjones at redhat.com> wrote:
> > On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:
> >> Hi,
> >>
> >> How does the blocksize filter take into account writes that end-up
> >> overlapping due to read-modify-write cycles?
> >>
> >> Specifically, suppose there are two non-overlapping writes handled
> >> by two different threads, that, due to blocksize requirements,
> >> overlap when expanded.  I think there is a risk that one thread may
> >> partially undo the work of the other here.
> >>
> >> Looking at the code, it seems that writes of unaligned heads and
> >> tails are protected with a global lock., but writes of aligned data
> >> can occur concurrently.
> >
> > I agree.
> >
> > Assuming the underlying plugin is NBDKIT_THREAD_MODEL_PARALLEL and no
> > other filters impose thread model limits, the blocksize filter does
> > not limit the thread model, so the thread model of nbdkit would also
> > be NBDKIT_THREAD_MODEL_PARALLEL.
> >
> > That means that two writes either on different connections or
> > pipelined on the same connection could happen at the same time.
> > “blocksize_pwrite” would be called concurrently for the two requests.
> >
> >> However, does this not miss the case where there is one unaligned
> >> write that overlaps with an aligned one?
> >>
> >> For example, with blocksize 10, we could have:
> >> 
> >> Thread 1: receives write request for offset=0, size=10
> >> Thread 2: receives write request for offset=4, size=16
> >> Thread 1: acquires lock, reads bytes 0-4
> >> Thread 2: does aligned write (no locking needed), writes bytes 0-10
> >> Thread 1: writes bytes 0-10, overwriting data from Thread 2
> >
> > I believe this analysis is correct.  (CC'd to Eric who knows a lot
> > more about this.)
> >
> > However I don't think it's a bug.  If a client doesn't want writes to
> > squash each other, then it shouldn't send overlapping requests.  I bet
> > the same thing happens with an SSD.
> 
> But the requests are not overlapping from the client point of view. They
> only become overlapping when the server applies its read-modify-write
> operation to align them to the blocksize.

I'm going to leave this one to Eric who's an expert on this ("write
tearing", I think).

> I think you elsewhere said that the blocksize reported by the NBD server
> is only a preferred blocksize, so I'd be surprised if not following this
> "preference" results in data corruption.

This is true for NBD at the moment, but I think everyone accepts it's
a mistake in the protocol.  Eric was looking into this too.

> > NBD_CMD_FLAG_FUA is provided for clients that wish to ensure that a
> > write has been committed before sending another request.
> >
> > Do you have an example of a client which sends overlapping requests
> > and depends on particular behaviour of the server?  You may be able to
> > get it to work by using nbdkit-noparallel-filter which can be used to
> > serialize nbdkit.
> 
> I'm working with the kernel's NBD client, and it would explain all the
> mysterious data corruption issues that I've seen with the S3 plugin. But
> I have not yet confirmed definitely that this is the root cause.
> 
> For now, I'll avoid the blocksize filter and instead do the
> read-modify-write in the plugin with proper locking. If that fixes it,
> then I think we can conclude that the kernel is sending such requests
> (but, as I said above, I would not consider them overlapping nor would I
> consider this a bug).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit