[linux-lvm] Reserve space for specific thin logical volumes
zkabelac at redhat.com
Tue Sep 12 23:02:20 UTC 2017
Dne 13.9.2017 v 00:41 Gionatan Danti napsal(a):
> Il 13-09-2017 00:16 Zdenek Kabelac ha scritto:
>> Dne 12.9.2017 v 23:36 Gionatan Danti napsal(a):
>>> Il 12-09-2017 21:44 matthew patton ha scritto:
>>> Again, please don't speak about things you don't know.
>>> I am *not* interested in thin provisioning itself at all; on the other
>>> side, I find CoW and fast snapshots very useful.
>> Not going to comment KVM storage architecture - but with this statemnet -
>> you have VERY simple usage:
>> Just minimize chance for overprovisioning -
>> let's go by example:
>> you have 10 10GiB volumes and you have 20 snapshots...
>> to not overprovision - you need 10 GiB * 30 LV = 300GiB thin-pool.
>> if that sounds too-much.
>> you can go with 150 GiB - to always 100% cover all 'base' volumes.
>> and have some room for snapshots.
>> Now the fun begins - while monitoring is running -
>> you get callback for 50%, 55%... 95% 100%
>> at each moment you can do whatever action you need.
>> So assume 100GiB is bare minimum for base volumes - you ignore any
>> state with less then 66% occupancy of thin-pool and you start solving
>> problems with 85% (~128GiB)- you know some snapshot is better to be
>> You may try 'harder' actions for higher percentage.
>> (you need to consider how many dirty pages you leave floating your system
>> and other variables)
>> Also you pick with some logic the snapshot which you want to drop -
>> Maybe the oldest ?
>> (see airplane :) URL link)....
>> Anyway - you have plenty of time to solve it still at this moment
>> without any danger of losing write operation...
>> All you can lose is some 'snapshot' which might have been present a
>> bit longer... but that is supposedly fine with your model workflow...
>> Of course you are getting in serious problem, if you try to keep all
>> these demo-volumes within 50GiB with massive overprovisioning ;)
>> There you have much hard times what should happen what should be
>> removed and where is possibly better to STOP everything and let admin
>> decide what is the ideal next step....
> Hi Zdenek,
> I fully agree with what you said above, and I sincerely thank you for taking
> the time to reply.
> However, I am not sure to understand *why* reserving space for a thin volume
> seems a bad idea to you.
> Lets have a 100 GB thin pool, and wanting to *never* run out of space in spite
> of taking multiple snapshots.
> To achieve that, I need to a) carefully size the original volume, b) ask the
> thin pool to reserve the needed space and c) counting the "live" data (REFER
> in ZFS terms) allocated inside the thin volume.
> Step-by-step example:
> - create a 40 GB thin volume and subtract its size from the thin pool (USED 40
> GB, FREE 60 GB, REFER 0 GB);
> - overwrite the entire volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> - snapshot the volume (USED 40 GB, FREE 60 GB, REFER 40 GB);
> - completely overwrite the original volume (USED 80 GB, FREE 20 GB, REFER 40 GB);
> - a new snapshot creation will fails (REFER is higher then FREE).
> Result: thin pool is *never allowed* to fill. You need to keep track of
> per-volume USED and REFER space, but thinp performance should not be impacted
> in any manner. This is not theoretical: it is already working in this manner
> with ZVOLs and refreservation, *without* involing/requiring any advanced
> coupling/integration between block and filesystem layers.
> Don't get me wrong: I am sure that, if you choose to not implement this
> scheme, you have a very good reason to do that. Moreover, I understand that
> patches are welcome :)
> But I would like to understand *why* this possibility is ruled out with such
There could be a simple answer and complex one :)
I'd start with simple one - already presented here -
when you write to INDIVIDUAL thin volume target - respective dn thin target
DOES manipulate with single btree set - it does NOT care there are some other
snapshot and never influnces them -
You ask here to heavily 'change' thin-pool logic - so writing to THIN volume A
can remove/influence volume B - this is very problematic for meny reasons.
We can go into details of BTree updates (that should be really discussed with
its authors on dm channel ;)) - but I think the key element is capturing the
idea the usage of thinLV A does not change thinLV B.
Now to your free 'reserved' space fiction :)
There is NO way to decide WHO deserves to use the reserve :)
Every thin volume is equal - (the fact we call some thin LV snapshot is
user-land fiction - in kernel all thinLV are just equal - every thinLV
reference set of thin-pool chunks) -
(for late-night thinking - what would be snapshot of snapshot which is fully
So when you now see that all thinLVs just maps set of chunks,
and all thinLVs can be active and running concurrently - how do you want to
use reserves in thin-pool :) ?
When do you decide it ? (you need to see this is total race-lend)
How do you actually orchestrate locking around this single point of failure ;) ?
You will surely come with and idea of having reserve separate for every thinLV ?
How big it should actually be ?
Are you going to 'refill' those reserves when thin-pool gets emptier ?
How you decide which thinLV deserves bigger reserves ;) ??
I assume you can start to SEE the whole point of this misery....
So instead - you can start with normal thin-pool - keep it simple in kernel,
and solve complexity in user-space.
There you can decide - if you want to extend thin-pool...
You may drop some snapshot...
You may fstrim mounted thinLVs...
You can kill volumes way before the situation becomes unmaintable....
All you need to accept is - you will kill them at 95% -
in your world with reserves it would be already reported as 100% full,
with totally unknown size of reserves :)
More information about the linux-lvm