[dm-devel] Fwd: dm-writecache - Unexpected Data After Host Crash

Mon Jun 26 15:40:36 UTC 2023

On Thu, 15 Jun 2023, Marc Smith wrote:

> Hi Mikulas,
> 
> Apologies for the direct message, but I noticed it seems you're the
> original author of dm-writecache and wanted to get your thoughts on
> the issue described below... I've been going through the code in
> dm-writecache.c and trying to educate myself; it seems that
> writecache_read_metadata() is used to read the full set of metadata
> only in the writecache_resume() function. In writecache_ctr() only the
> first block of metadata is retrieved. So it seems this explains why in
> my example below with a system crash (eg, reset / power off / panic)
> the dirty data that resides on the SSD is not populated on VG
> activation.
> 
> It made me wonder if this is perhaps a bug in LVM2 where the 'resume'
> message is not being issued to the DM target (and therefore not
> populating the in-memory metadata structure properly).
> 
> I tried an older 5.4.x kernel as well but the issue still persists in
> that environment.
> 
> Thanks for your time.
> 
> 
> --Marc

Hi

I reproduced the issue that you reported, but I think this is not a bug in 
writecache, but an expected behavior.

Note that O_DIRECT doesn't guarantee that the written data is flushed to 
the stable storage. With O_DIRECT, we bypass kernel page cache, but the 
data may be still cached in the hardware cache inside the disk and if the 
power failure happens, the data may be lost. You need to use the "fsync" 
or "fdatasync" syscall - it will issue the flush command to the disk and 
when it returns, it is guaranteed that the data is persistent.

In a similar way, if we write data to dm-writecache, it is not guranteed 
that the data will be written when the I/O finishes. It may still be 
cached inside dm-writecache, until "fsync" or "fdatasync" is called.

If I add "--end_fsync=1" to the fio command that you posted below, there 
is no longer any data loss - after a reboot, the device will contain 0xff. 
So, dm-writecache works as expected, flushing the data when "fsync" is 
called.

Mikulas

> ---------- Forwarded message ---------
> From: Marc Smith <msmith626 at gmail.com>
> Date: Wed, Jun 14, 2023 at 5:29 PM
> Subject: dm-writecache - Unexpected Data After Host Crash
> To: <dm-devel at redhat.com>
> 
> 
> Hi,
> 
> I'm using dm-writecache via 'lvmcache' on Linux 5.4.229 (vanilla
> kernel.org source). I've been testing my storage server -- I'm using a
> couple NVMe drives in an MD RAID1 array that is the cache (fast)
> device, and using a 12-drive MD RAID6 array as the origin (backing)
> device.
> 
> I noticed that when the host crashes (power loss, forcefully reset,
> etc.) it seems the cached (via dm-writecache) LVM logical volume does
> not contain the bits I expect. Or perhaps I'm missing something in how
> I understand/expect dm-writecache to function...
> 
> I change the auto-commit settings to larger values so the data on the
> cache device is not flushed to the origin device:
> # lvchange --cachesettings "autocommit_blocks=1000000000000"
> --cachesettings "autocommit_time=3600000" dev_1_default/sys_dev_01
> 
> Then populate the start of the device (cached LV) with zeros:
> # dd if=/dev/zero of=/dev/dev_1_default/sys_dev_01 bs=1M count=10 oflag=direct
> 
> Force a flush from the cache device to the backing device (all zero's
> in the first 10 MiB):
> # dmsetup message dev_1_default-sys_dev_01 0 flush
> 
> Now write a different pattern to the first 10 MiB:
> # fio --bs=1m --direct=1 --rw=write --buffer_pattern=0xff
> --ioengine=libaio --iodepth=1 --numjobs=1 --size=10M
> --output-format=terse --name=/dev/dev_1_default/sys_dev_01
> 
> And then induce a reset:
> # echo b > /proc/sysrq-trigger
> 
> Now after the system boots back up, assemble the RAID arrays and
> activate the VG, then examine the data:
> # vgchange -ay dev_1_default
> # dd if=/dev/dev_1_default/sys_dev_01 bs=1M iflag=direct count=10
> status=noxfer | od -t x2
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 10+0 records in
> 10+0 records out
> 50000000
> 
> 
> So I'm expecting all "ffff" in the first 10 MiB, but instead, I'm
> getting what's on the origin device, zeros (not what was written to
> the cache device).
> 
> Obviously in a crash scenario (power loss, reset, panic, etc.) the
> dirty data in the cache won't be flushed to the origin device,
> however, I was expecting when the DM device started on the subsequent
> boot (via activating the VG) that all of the dirty data would be
> present -- it seems like it is not.
> 
> 
> Thanks for any information/advice, it's greatly appreciated.
> 
> 
> --Marc
>