[dm-devel] dm-cache issue

Sat Nov 19 17:07:50 UTC 2016

On 16.11.2016 16:06, Zdenek Kabelac wrote:
> Dne 16.11.2016 v 14:45 Teodor Milkov napsal(a):
>> On 16.11.2016 11:24, Zdenek Kabelac wrote:
>>> My first 'guess' in this reported case is - the disk I/O traffic 
>>> seen is
>>> related to the 'reload' of cached chunks from disk back to cache.
>>>
>>> This will happen in the case, there has been unclean cache shutdown.
>>>
>>> However what is unclean is - why it slows down boot by hours.
>>> Is the cache too big??
>>
>> Indeed, cache is quite big – a 800GB SSD, but I found experimentally 
>> that this
>> is the size where I get good cache hit ratios with my >10TB data volume.
>
> Yep - that's the current trouble of existing  dm-cache target.
> It's getting inefficient when maintaining more then 1 million
> cache block entries - recent versions of lvm2 even do not allow
> create such cache without enforcing it.
> (so for 32k blocks it'  ~30G cache data size)

I'm sorry for not being clear: similarly to the OP my SSD is split among 
10 LVs, so eache cache is around 80GB.

>> As to the 'reload' vs 'flush' – I think it is flushing, because iirc 
>> iostat
>> showed lots of SSD reading and HDD writing, but I'm not really sure 
>> and need
>> to confirm that.
>>
>> So, are you saying that in case of unclean shutdown this 'reload' is 
>> inevitable?
>
> Yes - clean shutdown is mandatory - otherwise cache can't know consitency
> and has to refresh itself.  Other option would be probably to drop cache
> and let it rebuild - but you lose already gained 'knowledge' this way.
>
> Anyway AFAIK there is ongoing devel and up-streaming process for new 
> cache target which will others couple shortcomings and should perform 
> much
> better.   lvm2 will supposedly handle transition to a new format in 
> some way
> later.
>
>> How much time it takes obviously depends on the SSD size/speed & HDD 
>> speed,
>> but with 800GB SSD it is reasonable to expect very long boot times.
>>
>>> Can you provide full logs from 'deactivation' and following activation?
>>
>> Any hints as to how to collect "full logs from 'deactivation' and 
>> following
>> activation"? It happens early in the Debian boot process (I think 
>> udev does
>> the activation) and I'm not sure how to enable logging... should I tweak
>> /etc/lvm/lvm.conf?
>
> All you need to collect is basically 'serial' console log from your
> machine  - so if you have other box to trap serial console log - it's
> the most easiest option.
>
> But since you already said you use  ~30times bigger cache size then 
> the size with 'reasonable' performance - I think it's already clear 
> where is your
> problem hidden.
>
> Until new target will be deployed - please consider to use 
> significantly smaller cache size so the number of cache chunks is not 
> above 1 000 000.

Thank you very much for your help! I'll give it another go at debugging 
what the problem is.
I found dm-writeboost in write_around_mode (kinda write-through) works 
well for me, so if I don't manage to get along with dm-cache I have plan B.

Best regards,
Teodor