[dm-devel] dm-writecache - Unexpected Data After Host Crash

Fri Jun 16 21:12:39 UTC 2023

On Fri, 16 Jun 2023 16:43:47 -0400
Marc Smith <msmith626 at gmail.com> wrote:

> On Fri, Jun 16, 2023 at 12:33 PM Lukas Straub <lukasstraub2 at web.de> wrote:
> >
> > On Wed, 14 Jun 2023 17:29:17 -0400
> > Marc Smith <msmith626 at gmail.com> wrote:
> >  
> > > Hi,
> > >
> > > I'm using dm-writecache via 'lvmcache' on Linux 5.4.229 (vanilla
> > > kernel.org source). I've been testing my storage server -- I'm using a
> > > couple NVMe drives in an MD RAID1 array that is the cache (fast)
> > > device, and using a 12-drive MD RAID6 array as the origin (backing)
> > > device.
> > >
> > > I noticed that when the host crashes (power loss, forcefully reset,
> > > etc.) it seems the cached (via dm-writecache) LVM logical volume does
> > > not contain the bits I expect. Or perhaps I'm missing something in how
> > > I understand/expect dm-writecache to function...
> > >
> > > I change the auto-commit settings to larger values so the data on the
> > > cache device is not flushed to the origin device:
> > > # lvchange --cachesettings "autocommit_blocks=1000000000000"
> > > --cachesettings "autocommit_time=3600000" dev_1_default/sys_dev_01
> > >
> > > Then populate the start of the device (cached LV) with zeros:
> > > # dd if=/dev/zero of=/dev/dev_1_default/sys_dev_01 bs=1M count=10 oflag=direct  
> >
> > Missing flush/fsync.
> >  
> > > Force a flush from the cache device to the backing device (all zero's
> > > in the first 10 MiB):
> > > # dmsetup message dev_1_default-sys_dev_01 0 flush
> > >
> > > Now write a different pattern to the first 10 MiB:
> > > # fio --bs=1m --direct=1 --rw=write --buffer_pattern=0xff
> > > --ioengine=libaio --iodepth=1 --numjobs=1 --size=10M
> > > --output-format=terse --name=/dev/dev_1_default/sys_dev_01  
> >
> > Again, no flush/fsync is issued.  
> 
> I'm doing direct I/O so I wasn't anticipating the need for a flush/fsync.
> 
> 
> >  
> > > And then induce a reset:
> > > # echo b > /proc/sysrq-trigger
> > >
> > > Now after the system boots back up, assemble the RAID arrays and
> > > activate the VG, then examine the data:
> > > # vgchange -ay dev_1_default
> > > # dd if=/dev/dev_1_default/sys_dev_01 bs=1M iflag=direct count=10
> > > status=noxfer | od -t x2
> > > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> > > *
> > > 10+0 records in
> > > 10+0 records out
> > > 50000000
> > >
> > >
> > > So I'm expecting all "ffff" in the first 10 MiB, but instead, I'm
> > > getting what's on the origin device, zeros (not what was written to
> > > the cache device).
> > >
> > > Obviously in a crash scenario (power loss, reset, panic, etc.) the
> > > dirty data in the cache won't be flushed to the origin device,
> > > however, I was expecting when the DM device started on the subsequent
> > > boot (via activating the VG) that all of the dirty data would be
> > > present -- it seems like it is not.
> > >
> > >
> > > Thanks for any information/advice, it's greatly appreciated.  
> >
> > This is the expected behavior. If you don't issue flushes, no guarantees
> > are made about the durability of the newly written data.  
> 
> Interesting... was not expecting that. I guess I was thrown by the use
> of persistent media (SSD / PMEM). If dm-writecache has dirty data that
> isn't flushed to the origin device yet (no flush/fsync from the
> application) and we lose power, the data is gone... why not just use
> volatile RAM for the cache then?

Because flushing is the very thing that dm-writecache should accelerate.
If your application isn't going to flush the data, it may just as well
throw the data away.

> 
> I'm still experimenting and learning the code, but from what I've seen
> so far, the dirty data blocks do reside on the SSD/PMEM device,

Not even that, many storage devices will buffer written data internally
in volatile caches. This is obvious on HDDs for performance reasons,
but SSDs too need to commit metadata internally before writen data is
visible after a crash.

And consider other scenarios:
When you're running inside a VM, with the virtual disk being backed by
a file on a normal filesystem (perhaps without O_DIRECT even).

> it's
> just the entry map that lives in metadata that isn't up-to-date if a
> crash / power loss occurs. I assume writing out all of the metadata on
> each cache change would be very expensive in terms of I/O performance.
> 
> 
> >  
> > >
> > > --Marc
> > >
> > > --
> > > dm-devel mailing list
> > > dm-devel at redhat.com
> > > https://listman.redhat.com/mailman/listinfo/dm-devel
> > >  
> >  

-- 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20230616/7860830e/attachment.sig>