[dm-devel] dm-cache issue

Wed Nov 16 14:06:00 UTC 2016

Dne 16.11.2016 v 14:45 Teodor Milkov napsal(a):
> On 16.11.2016 11:24, Zdenek Kabelac wrote:
>> Dne 15.11.2016 v 13:38 Teodor Milkov napsal(a):
>>> On 14.11.2016 17:34, Zdenek Kabelac wrote:
>>>> Dne 14.11.2016 v 16:02 Alexander Pashaliyski napsal(a):
>>>>> The server is booting for hours, because of IO load. It seems is triggered
>>>>> a flush from SSD disk (that is used for a cache device) to the raid
>>>>> controllers (they are with slow SATA disks).
>>>>> I have 10 cached logical volumes in *writethrough mode*, each with 2T of
>>>>> data over 2 raid controllers. I use a single SSD disk for the cache.
>>>>> The backup system is with lvm2-2.02.164-1 & kernel 4.4.30.
>>>>>
>>>>> Do you have any ideas why such flush is triggered? In writethrough cache
>>>>> mode
>>>>> we shouldn't have dirty blocks in the cache.
>>>>>
>>>>
>>>> Have you ensured there was proper shutdown ?
>>>> Cache needs to be properly deactivated - if it's just turned off,
>>>> all metadata are marked dirty.
>>>>
>>>> Zdenek
>>>
>>> Hi,
>>>
>>> I'm seeing the same behavior described by Alexander. Even if we assume
>>> something is wrong with my shutdown scripts, still how could dm-cache ever be
>>> dirty in writethrough mode? What about the case where server crashes for
>>> whatever reason (kernel bug, power outage, operator error etc.)? Waiting
>>> several hours, or for sufficiently large cache even days for the system to
>>> come back up is not practical.
>>>
>>> I found this 2013 conversation, where Heinz Mauelshagen <heinzm redhat com>
>>> states that "in writethrough mode the cache will always be coherent after a
>>> crash": https://www.redhat.com/archives/dm-devel/2013-July/msg00117.html
>>>
>>> I'm thinking for a way to --uncache and recreate cache devices on every boot,
>>> which should be safe in writethrough mode and takes reasonable, and more
>>> importantly – constant amount of time.
>>
>> My first 'guess' in this reported case is - the disk I/O traffic seen is
>> related to the 'reload' of cached chunks from disk back to cache.
>>
>> This will happen in the case, there has been unclean cache shutdown.
>>
>> However what is unclean is - why it slows down boot by hours.
>> Is the cache too big??
>
> Indeed, cache is quite big – a 800GB SSD, but I found experimentally that this
> is the size where I get good cache hit ratios with my >10TB data volume.

Yep - that's the current trouble of existing  dm-cache target.
It's getting inefficient when maintaining more then 1 million
cache block entries - recent versions of lvm2 even do not allow
create such cache without enforcing it.
(so for 32k blocks it'  ~30G cache data size)

> As to the 'reload' vs 'flush' – I think it is flushing, because iirc iostat
> showed lots of SSD reading and HDD writing, but I'm not really sure and need
> to confirm that.
>
> So, are you saying that in case of unclean shutdown this 'reload' is inevitable?

Yes - clean shutdown is mandatory - otherwise cache can't know consitency
and has to refresh itself.  Other option would be probably to drop cache
and let it rebuild - but you lose already gained 'knowledge' this way.

Anyway AFAIK there is ongoing devel and up-streaming process for new cache 
target which will others couple shortcomings and should perform much
better.   lvm2 will supposedly handle transition to a new format in some way
later.

> How much time it takes obviously depends on the SSD size/speed & HDD speed,
> but with 800GB SSD it is reasonable to expect very long boot times.
>
>> Can you provide full logs from 'deactivation' and following activation?
>
> Any hints as to how to collect "full logs from 'deactivation' and following
> activation"? It happens early in the Debian boot process (I think udev does
> the activation) and I'm not sure how to enable logging... should I tweak
> /etc/lvm/lvm.conf?

All you need to collect is basically 'serial' console log from your
machine  - so if you have other box to trap serial console log - it's
the most easiest option.

But since you already said you use  ~30times bigger cache size then the size 
with 'reasonable' performance - I think it's already clear where is your
problem hidden.

Until new target will be deployed - please consider to use significantly 
smaller cache size so the number of cache chunks is not above 1 000 000.

Regards

Zdenek