[dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

Thu Dec 5 22:34:01 UTC 2019

On Thu, 5 Dec 2019, Nikos Tsironis wrote:
> On 12/4/19 10:17 PM, Mike Snitzer wrote:
> > On Wed, Dec 04 2019 at  2:58pm -0500,
> > Eric Wheeler <dm-devel at lists.ewheeler.net> wrote:
> > 
> > > On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> > >
> > > > The thin provisioning target maintains per thin device mappings that map
> > > > virtual blocks to data blocks in the data device.
> > > >
> > > > When we write to a shared block, in case of internal snapshots, or
> > > > provision a new block, in case of external snapshots, we copy the shared
> > > > block to a new data block (COW), update the mapping for the relevant
> > > > virtual block and then issue the write to the new data block.
> > > >
> > > > Suppose the data device has a volatile write-back cache and the
> > > > following sequence of events occur:
> > >
> > > For those with NV caches, can the data disk flush be optional (maybe as a
> > > table flag)?
> > 
> > IIRC block core should avoid issuing the flush if not needed.  I'll have
> > a closer look to verify as much.
> > 
> 
> For devices without a volatile write-back cache block core strips off
> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
> completes empty REQ_PREFLUSH requests before entering the driver.
> 
> This happens in generic_make_request_checks():
> 
> 		/*
> 		 * Filter flush bio's early so that make_request based
> 		 * drivers without flush support don't have to worry
> 		 * about them.
> 		 */
> 		if (op_is_flush(bio->bi_opf) &&
> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> 		        if (!nr_sectors) {
> 		                status = BLK_STS_OK;
> 		                goto end_io;
> 		        }
> 		}
> 
> If I am not mistaken, it all depends on whether the underlying device
> reports the existence of a write back cache or not.
> 
> You could check this by looking at /sys/block/<device>/queue/write_cache
> If it says "write back" then flushes will be issued.
> 
> In case the sysfs entry reports a "write back" cache for a device with a
> non-volatile write cache, I think you can change the kernel's view of
> the device by writing to this entry (you could also create a udev rule
> for this).
> 
> This way you can set the write cache as write through. This will
> eliminate the cache flushes issued by the kernel, without altering the
> device state (Documentation/block/queue-sysfs.rst).

Interesting, I'll remember that. I think this is a documentation bug, isn't this backwards:
	'This means that it might not be safe to toggle the setting from 
	"write back" to "write through", since that will also eliminate
	cache flushes issued by the kernel.'
	[https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]

How does this work with stacking blockdevs?  Does it inherit from the 
lower-level dev? If an upper-level is misconfigured, would a writeback at 
higher levels would clear the flush for lower levels?

--
Eric Wheeler

> Nikos
> 
> > Mike
> > 
>