[dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

Mon Dec 9 14:25:10 UTC 2019

On 12/6/19 10:06 PM, Eric Wheeler wrote:
> On Fri, 6 Dec 2019, Nikos Tsironis wrote:
>> On 12/6/19 12:34 AM, Eric Wheeler wrote:
>>> On Thu, 5 Dec 2019, Nikos Tsironis wrote:
>>>> On 12/4/19 10:17 PM, Mike Snitzer wrote:
>>>>> On Wed, Dec 04 2019 at  2:58pm -0500,
>>>>> Eric Wheeler <dm-devel at lists.ewheeler.net> wrote:
>>>>>
>>>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>>>>>
>>>>>>> The thin provisioning target maintains per thin device mappings that
>>>>>>> map
>>>>>>> virtual blocks to data blocks in the data device.
>>>>>>>
>>>>>>> When we write to a shared block, in case of internal snapshots, or
>>>>>>> provision a new block, in case of external snapshots, we copy the
>>>>>>> shared
>>>>>>> block to a new data block (COW), update the mapping for the relevant
>>>>>>> virtual block and then issue the write to the new data block.
>>>>>>>
>>>>>>> Suppose the data device has a volatile write-back cache and the
>>>>>>> following sequence of events occur:
>>>>>>
>>>>>> For those with NV caches, can the data disk flush be optional (maybe
>>>>>> as a
>>>>>> table flag)?
>>>>>
>>>>> IIRC block core should avoid issuing the flush if not needed.  I'll have
>>>>> a closer look to verify as much.
>>>>>
>>>>
>>>> For devices without a volatile write-back cache block core strips off
>>>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
>>>> completes empty REQ_PREFLUSH requests before entering the driver.
>>>>
>>>> This happens in generic_make_request_checks():
>>>>
>>>>    /*
>>>>     * Filter flush bio's early so that make_request based
>>>>     * drivers without flush support don't have to worry
>>>>     * about them.
>>>>     */
>>>>    if (op_is_flush(bio->bi_opf) &&
>>>>        !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
>>>>            bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
>>>>            if (!nr_sectors) {
>>>>                    status = BLK_STS_OK;
>>>>                    goto end_io;
>>>>            }
>>>>    }
>>>>
>>>> If I am not mistaken, it all depends on whether the underlying device
>>>> reports the existence of a write back cache or not.
>>>>
>>>> You could check this by looking at /sys/block/<device>/queue/write_cache
>>>> If it says "write back" then flushes will be issued.
>>>>
>>>> In case the sysfs entry reports a "write back" cache for a device with a
>>>> non-volatile write cache, I think you can change the kernel's view of
>>>> the device by writing to this entry (you could also create a udev rule
>>>> for this).
>>>>
>>>> This way you can set the write cache as write through. This will
>>>> eliminate the cache flushes issued by the kernel, without altering the
>>>> device state (Documentation/block/queue-sysfs.rst).
>>>
>>> Interesting, I'll remember that. I think this is a documentation bug, isn't
>>> this backwards:
>>>   'This means that it might not be safe to toggle the setting from
>>>   "write back" to "write through", since that will also eliminate
>>>   cache flushes issued by the kernel.'
>>>   [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]
>>>
>>>
>>
>> If a device has a volatile cache then the write_cache sysfs entry will
>> be "write back" and we have to issue flushes to the device. In all other
>> cases write_cache will be "write through".
> 
> Forgive my misunderstanding, but if I have a RAID controller with a cache
> and BBU with the RAID volume set to write-back mode in the controller, are
> you saying that the sysfs entry should show "write through"? I had always
> understood that it was safe to disable flushes with a non-volatile cache
> and a non-volatile cache is called a write-back cache.
> 

 From the device perspective, a non-volatile cache operating in
write-back mode is indeed called a write-back cache.

But, from the OS perspective, a non-volatile cache (whether it operates
in write-back or write-through mode), for all intents and purposes, is
equivalent to a write-through cache: when the device acknowledges a
write it's guaranteed that the written data won't be lost in case of
power loss.

So, in the case of a controller with a BBU and/or a non-volatile cache,
you don't care what the device does internally. All that matters is that
acked writes won't be lost in case of power failure.

I believe that the sysfs entry reports exactly that. Whether the kernel
should treat the device as having a volatile write-back cache, so we
have to issue flushes to ensure the data are properly persisted, or as
having no cache or a write-through cache, so flushes are not necessary.

> It is strange to me that this terminology in the kernel would be backwards
> from how it is expressed in a RAID controller. Incidentally, I have an
> Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in
> write-back mode and the second volume is write-through. In both cases
> sysfs reports "write through":
> 
> [root at hv1-he ~]# cat /sys/block/sda/queue/write_cache
> write through
> [root at hv1-he ~]# cat /sys/block/sdb/queue/write_cache
> write through
> 
> This is running 4.19.75, so we can at least say that the 9460 does not
> support proper representation of the VD cache mode in sysfs, but which is
> correct? Should it not be that the sysfs entry reports the same cache mode
> of the RAID controller?
> 

My guess is that the controller reports to the kernel that it has a
write-through cache (or no cache at all) on purpose, to avoid
unnecessary flushes. Since it can ensure the persistence of acked writes
with other means, e.g., a BBU unit, as far as the kernel is concerned
the device can be treated as one with a write-through cache.

Moreover, I think that MegaRAID controllers, in the default write back
mode, automatically switch the write policy to write-through if the BBU
is low, has failed or is being charged.

So, I think it makes sense to report to the kernel that the device has a
write-through cache, even though internally the device operates the
cache in write-back mode.

Nikos

> -Eric
> 
>>
>> It's not safe to toggle write_cache from "write back" to "write through"
>> because this stops the kernel from sending flushes to the device, but
>> the device will continue caching the writes. So, in case something goes
>> wrong, you might lose your writes or end up with some kind of
>> corruption.
>>
>>> How does this work with stacking blockdevs?  Does it inherit from the
>>> lower-level dev? If an upper-level is misconfigured, would a writeback at
>>> higher levels would clear the flush for lower levels?
>>>
>>
>> As Mike already mentioned in another reply to this thread, the device
>> capabilities are stacked up when each device is created and are
>> inherited from component devices.
>>
>> The logic for device stacking is implemented in various functions in
>> block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
>> etc.), which are used also by DM core in dm-table.c to set the
>> capabilities of DM devices.
>>
>> If an upper layer device reports a "write back" cache then flushes will
>> be issued to it by the kernel, no matter what the capabilities of the
>> underlying devices are.
>>
>> Normally an upper layer device would report a "write back" cache if at
>> least one underlying device supports flushes. But, some DM devices
>> report a "write back" cache irrespective of the underlying devices,
>> e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
>> their own metadata. They then pass the flush request down to the
>> underlying device and rely on block core to do the right thing. Either
>> actually send the flush to the device, if it has a volatile cache, or
>> complete it immediately.
>>
>> Nikos
>>
>>> --
>>> Eric Wheeler
>>>
>>>
>>>
>>>> Nikos
>>>>
>>>>> Mike
>>>>>
>>>>
>>
>>