[linux-lvm] Reserve space for specific thin logical volumes

Tue Sep 12 23:02:20 UTC 2017

Dne 13.9.2017 v 00:41 Gionatan Danti napsal(a):
> Il 13-09-2017 00:16 Zdenek Kabelac ha scritto:
>> Dne 12.9.2017 v 23:36 Gionatan Danti napsal(a):
>>> Il 12-09-2017 21:44 matthew patton ha scritto:
>>
>>> Again, please don't speak about things you don't know.
>>> I am *not* interested in thin provisioning itself at all; on the other 
>>> side, I find CoW and fast snapshots very useful.
>>
>>
>> Not going to comment KVM storage architecture - but with this statemnet -
>> you have VERY simple usage:
>>
>>
>> Just minimize chance for overprovisioning -
>>
>> let's go by example:
>>
>> you have  10  10GiB volumes  and you have 20 snapshots...
>>
>>
>> to not overprovision - you need 10 GiB * 30 LV  = 300GiB thin-pool.
>>
>> if that sounds too-much.
>>
>> you can go with 150 GiB - to always 100% cover all 'base' volumes.
>> and have some room for snapshots.
>>
>>
>> Now the fun begins - while monitoring is running -
>> you get callback for  50%, 55%... 95% 100%
>> at each moment  you can do whatever action you need.
>>
>>
>> So assume 100GiB is bare minimum for base volumes - you ignore any
>> state with less then 66% occupancy of thin-pool and you start solving
>> problems with 85% (~128GiB)- you know some snapshot is better to be
>> dropped.
>> You may try 'harder' actions for higher percentage.
>> (you need to consider how many dirty pages you leave floating your system
>> and other variables)
>>
>> Also you pick with some logic the snapshot which you want to drop -
>> Maybe the oldest ?
>> (see airplane :) URL link)....
>>
>> Anyway - you have plenty of time to solve it still at this moment
>> without any danger of losing write operation...
>> All you can lose is some 'snapshot' which might have been present a
>> bit longer...  but that is supposedly fine with your model workflow...
>>
>> Of course you are getting in serious problem, if you try to keep all
>> these demo-volumes within 50GiB with massive overprovisioning ;)
>>
>> There you have much hard times what should happen what should be
>> removed and where is possibly better to STOP everything and let admin
>> decide what is the ideal next step....
>>
> 
> Hi Zdenek,
> I fully agree with what you said above, and I sincerely thank you for taking 
> the time to reply.
> However, I am not sure to understand *why* reserving space for a thin volume 
> seems a bad idea to you.
> 
> Lets have a 100 GB thin pool, and wanting to *never* run out of space in spite 
> of taking multiple snapshots.
> To achieve that, I need to a) carefully size the original volume, b) ask the 
> thin pool to reserve the needed space and c) counting the "live" data (REFER 
> in ZFS terms) allocated inside the thin volume.
> 
> Step-by-step example:
> - create a 40 GB thin volume and subtract its size from the thin pool (USED 40 
> GB, FREE 60 GB, REFER 0 GB);
> - overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> - snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> - completely overwrite the original volume (USED 80 GB, FREE 20 GB, REFER 40 GB);
> - a new snapshot creation will fails (REFER is higher then FREE).
> 
> Result: thin pool is *never allowed* to fill. You need to keep track of 
> per-volume USED and REFER space, but thinp performance should not be impacted 
> in any manner. This is not theoretical: it is already working in this manner 
> with ZVOLs and refreservation, *without* involing/requiring any advanced 
> coupling/integration between block and filesystem layers.
> 
> Don't get me wrong: I am sure that, if you choose to not implement this 
> scheme, you have a very good reason to do that. Moreover, I understand that 
> patches are welcome :)
> 
> But I would like to understand *why* this possibility is ruled out with such 
> firmness.
> 

There could be a simple answer and complex one :)

I'd start with simple one - already presented here -

when you write to INDIVIDUAL thin volume target - respective dn thin target 
DOES manipulate with single btree set - it does NOT care there are some other 
snapshot and never influnces them -

You ask here to heavily 'change' thin-pool logic - so writing to THIN volume A 
  can remove/influence volume B - this is very problematic for meny reasons.

We can go into details of BTree updates  (that should be really discussed with 
its authors on dm channel ;)) - but I think the key element is capturing the 
idea the usage of thinLV A does not change thinLV B.

----

Now to your free 'reserved' space fiction :)
There is NO way to decide WHO deserves to use the reserve :)

Every thin volume is equal - (the fact we call some thin LV snapshot is 
user-land fiction - in kernel all thinLV are just equal -  every thinLV 
reference set of thin-pool chunks)  -

(for late-night thinking -  what would be snapshot of snapshot which is fully 
overwritten ;))

So when you now see that all thinLVs  just maps set of chunks,
and all thinLVs can be active and running concurrently - how do you want to 
use reserves in thin-pool :) ?
When do you decide it ?  (you need to see this is total race-lend)
How do you actually orchestrate locking around this single point of failure ;) ?
You will surely come with and idea of having reserve separate for every thinLV ?
How big it should actually be ?
Are you going to 'refill' those reserves  when thin-pool gets emptier ?
How you decide which thinLV deserves bigger reserves ;) ??

I assume you can start to SEE the whole point of this misery....

So instead -  you can start with normal thin-pool - keep it simple in kernel,
and solve complexity in user-space.

There you can decide - if you want to extend thin-pool...
You may drop some snapshot...
You may fstrim mounted thinLVs...
You can kill volumes way before the situation becomes unmaintable....

All you need to accept is - you will kill them at 95% -
in your world with reserves it would be already reported as 100% full,
with totally unknown size of reserves :)

Regards

Zdenek