[dm-devel] [ceph-users] Local SSD cache for ceph on each compute node.

Wed Mar 30 13:02:14 UTC 2016

> >>
> >> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> >>> One thing I picked up on when looking at dm-cache for doing caching
> >>> with RBD's is that it wasn't really designed to be used as a
> >>> writeback cache for new writes, as in how you would expect a
> >>> traditional writeback cache to work. It seems all the policies are
> >>> designed around the idea that writes go to cache only if the block
> >>> is already in the cache (through reads) or its hot enough to
> >>> promote. Although there did seem to be some tunables to alter this
> >>> behaviour, posts on the mailing list seemed to suggest this wasn't
> >>> how it was designed to be used. I'm not sure if this has been addressed
> since I last looked at it though.
> >>>
> >>> Depending on if you are trying to accelerate all writes, or just
> >>> your
> > "hot"
> >>> blocks, this may or may not matter. Even <1GB local caches can make
> >>> a huge difference to sync writes.
> >> Hi Nick,
> >>
> >> Some of the caching policies have changed recently as the team has
> >> looked at different workloads.
> >>
> >> Happy to introduce you to them if you want to discuss offline or post
> >> comments over on their list: device-mapper development <dm-
> >> devel at redhat.com>
> >>
> >> thanks!
> >>
> >> Ric
> > Hi Ric,
> >
> > Thanks for the heads up, just from a quick flick through I can see
> > there are now separate read and write promotion thresholds, so I can
> > see just from that it would be a lot more suitable for what I
> > intended. I might try and find some time to give it another test.
> >
> > Nick
> 
> Let us know how it works out for you, I know that they are very interested in
> making sure things are useful :)

Hi Ric,

I have given it another test and unfortunately it seems it's still not giving the improvements that I was expecting.

Here is a rough description of my test

10GB RBD
1GB ZRAM kernel device for cache (Testing only)

0 20971520 cache 8 106/4096 64 32768/32768 2492 1239 349993 113194 47157 47157 0 1 writeback 2 migration_threshold 8192 mq 10 ra
ndom_threshold 0 sequential_threshold 0 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 0 rw -

I'm then running a directio 64kb seq write QD=1 bench with fio to the DM device.

What I expect to happen would be for this sequential stream of 64kb IO's to be coalesced into 4MB IO's and written out to the RBD at a high queue depth as possible/required. Effectively meaning my 64kb sequential bandwidth should match the limit of 4MB sequential bandwidth of my cluster. I'm more interested in replicating the behaviour of a write cache on a battery backed raid card, than a RW SSD cache, if that makes sense?

An example real life scenario would be for sitting underneath a iSCSI target, something like ESXi generates that IO pattern when moving VM's between datastores.

What I was seeing is that I get a sudden burst of speed at the start of the fio test, but then it quickly drops down to the speed of the underlying RBD device. The dirty blocks counter never seems to go too high, so I don't think that it’s a cache full problem. The counter is probably no more than about 40% when the slowdown starts and then it drops to less than 10% for the remainder of the test as it crawls along. It feels like it hits some sort of throttle and it never recovers. 

I've done similar tests with flashcache and it gets more stable performance over a longer period of time, but the associative hit set behaviour seems to cause write misses due to the sequential IO pattern, which limits overall top performance. 

Nick

> 
> ric
>