[dm-devel] dm-cache questions

Tue Dec 10 01:56:03 UTC 2013

I'm building a small virtualization server for which I'd like to avail of
ssd caching to increase performance. While there seems to be an increasing
plethora of options for ssd caching under linux, I'd like to stick with
something that's part of the mainline kernel, which I think restricts the
playing field down to bcache or dm-cache.

After reviewing the dm-cache documentation and mailing list archives, I had
a few questions I hope somebody might be able to answer; I apologize in
advance if any of them are silly or something I should've already found on
my own.

I've got four WD RE4 2TB drives that I plan to configure as RAID10 for the
data device, and two Samsung 840 Pro 256GB SSD's that I plan to configure as
RAID1 for the cache device. I'd like to set up write back caching to improve
both read and write performance. I was going to set up lvm on top of the
cached device and then use lv's as the backing store for kvm virtual
machines.

Is dm-cache considered ready for production deployment? From what I
understand, there are plans to add support for managing dm-cache to lvm2,
and without that it's a bit cryptic to use/set up. I see that Fedora has
deferred including support for dm-cache into their distribution pending that
lvm2 support, but other than easing configuration/management, are there any
reasons not to go ahead and deploy dm-cache in production now working with
it directly rather than through lvm2?

What is the recommended kernel version for using dm-cache? Would 3.10LTS be
suitable, or would it be better at this point to be running the latest
stable, eg 3.12.x now, and then 3.13.x once 3.12 goes EOL, to be sure to
have the latest bug fixes and performance enhancements?

>From reviewing the documentation, in addition to the origin/backing device
and the cache device, a third device is necessary for metadata. Per the
documentation the rationale for having a separate device for metadata rather
than simply using the cache device is so that the metadevice can be
configured with different redundancy; the example given is that perhaps it
could be mirrored. I'm confused though as to what utility there is an having
a metadata device with a different level of redundancy than the cache
device. If the metadata device is mirrored, and the cache device is not, you
will still be able to access the metadata should the cache device fail, but
given the cache device has failed, what are you going to do with it?
Conversely, if the cache device is mirrored, and the metadata device is not,
should the metadata device fail, how are you going to use your cache? I can
see potentially having the origin device redundant, and the cache device
not, assuming you are not using write back caching, but I don't initially
see a scenario where you would configure a cache device and a metadevice
with different availability characteristics.

What are the performance requirements of the metadevice? For my system, I
can either put it on the cache device, on the origin device, or I have
another mirror of two USB sticks used for /boot that it could go on.
Intuitively it seems the metadata device should be fast/low latency, so my
first guess would be the best location would be on the SSD mirror I'm using
for cache. Based on the examples I've seen, you can either partition the
device into two pieces to separate metadata from cache, or use dm-linear,
I'm thinking I'll go with partitioning as that seems simpler and I'm more
familiar with it, although I suppose that will result in a little bit of
waste for the partition table and alignment.

With bcache, they recommend selecting the bucket size and block size based
on the specifications of your SSD, is there any similar recommended
alignment with the underlying SSD for selecting dm-cache block size? The SSD
I am using has a 1024k erase block size and an 8k page size. Or should be
block size be tuned based more on the size of the origin device relative to
the cache device and your expected I/O sizes, with no particular regard for
the physical characteristics of your SSD ?

>From what I've read, the rule of thumb algorithm for sizing your metadata
device is 4 MB + ( 16 bytes * nr_blocks ). Is that still accurate? So, if I
hypothetically selected a 256k block size, I would calculate it as:

# blockdev --getsize64 /dev/md2          (ssd mirror)
255926140928

4194304 + (16 * 255926140928 / 262144)  = 19814796

So I would need to make a partition of size approximately 19MB for the
metadata? Then, assuming I partitioned md2 into md2p1 (metadata) and md2p2
(cache), and my origin device was md3, I could create the cache device via:

# blockdev --getsz /dev/md3 
7813531648

# dmsetup create md3-cached --table '0 7813531648 cache /dev/md2p1
/dev/md2p2 /dev/md3 512 1 writeback default 0'

For shutdown, you should then arrange to run 'dmsetup suspend md3-cached'
at reboot/halt so it goes down cleanly? From what I read, dm-cache should be
reasonably robust in the face of a crash/panic, so this is really more of an
optimization as opposed to a hard requirement?

Just a couple more miscellaneous questions :), is there any way to switch
between modes/policies without downtime on the cache device? For example, if
one of the SSD's failed and you wanted to switch to write through mode
rather than write back until you replaced it and the mirror was healthy
again?

Is there any support or integration with SSD TRIM for the cache device? Not
necessarily in real-time, as that can degrade performance, but occasionally
in batch ala fstrim for filesystems, to get dm-cache to TRIM all of the not
in use blocks at that time in order to optimize the SSD garbage collector?

If you have read this far, thank you very much :), I'm sorry for such a long
message, but I'm trying to wrap my head around this and be sure I have a
good understanding before using it.

Thanks.