[linux-lvm] Reserve space for specific thin logical volumes

Thu Sep 21 13:02:25 UTC 2017

Dne 21.9.2017 v 12:22 Xen napsal(a):
> Hi,
> 
> thank you for your response once more.
> 
> Zdenek Kabelac schreef op 21-09-2017 11:49:
> 
>> Hi
>>
>> Of course this decision makes some tasks harder (i.e. there are surely
>> problems which would not even exist if it would be done in kernel)  -
>> but lots of other things are way easier - you really can't compare
>> those....
> 
> I understand. But many times lack of integration of shared goal of multiple 
> projects is also big problem in Linux.

And you also have project that do try to integrate shared goals like btrfs.

>>> However if we *can* standardize on some tag or way of _reserving_ this 
>>> space, I'm all for it.
>>
>> Problems of a desktop user with 0.5TB SSD are often different with
>> servers using 10PB across multiple network-connected nodes.
>>
>> I see you call for one standard - but it's very very difficult...
> 
> I am pretty sure that if you start out with something simple, it can extend 
> into the complex.

We hope community will provide some individual scripts...
Not a big deal to integrate them into repo dir...

>> We have spend really lot of time thinking if there is some sort of
>> 'one-ring-to-rule-them-all' solution - but we can't see it yet -
>> possibly because we know wider range of use-cases compared with
>> individual user-focused problem.
> 
> I think you have to start simple.

It's mostly about what can be supported 'globally'
and what is rather 'individual' customization.

> You can never come up with a solution if you start out with the complex.
> 
> The only thing I ever said was:
> - give each volume a number of extents or a percentage of reserved space if 
> needed

Which can't be deliver with current thinp technology.
It's simply too computational invasive for our targeted performance.

The only deliverable we have is - you create a 'cron' job that does hard 
'computing' once in a while - and makes some 'action' when individual 
'volumes' goes out of their preconfigured boundaries.  (often such logic is 
implemented outside of lvm2 - in some DB engine - since   lvm2 itself is 
really NOT a high performing DB - the ascii format has it's age....)

You can't get this 'percentage' logic online in kernel (aka while you update 
individual volume).

> - for all the active volumes in the thin pool, add up these numbers
> - when other volumes require allocation, check against free extents in the pool

I assume you possibly missed this logic of thin-p:

When you update origin - you always allocate FOR origin, but allocated chunk
remains claimed by snapshots (if there are any).

So if snapshot shared all pages with the origin at the beginning (so basically 
consumed only some 'metadata' space and 0% real exclusive own space)  - after 
full rewrite of the origin your snapshot suddenly 'holds' all the old chunks 
(100% of its size)

So when you 'write' to ORIGIN - your snapshot which becomes bigger in terms of 
individual/exclusively owned chunks - so if you have i.e. configured snapshot 
to not  consume more then  XX% of your pool - you would simply need to recalc 
this with every update on shared chunks....

And as has been already said - this is currently unsupportable 'online'

Another aspect here is - thin-pool has  no idea about 'history' of volume 
creation - it doesn't not know  there is volume X being snapshot of volume Y - 
this all is only 'remembered' by lvm2 metadata  -  in kernel - it's always 
like  -  volume X  owns set of chunks  1...
That's all kernel needs to know for a single thin volume to work.

You can do it with 'reasonable' delay in user-space upon 'triggers' of global 
threshold  (thin-pool fullness).

> - possibly deny allocation for these volumes

Unsupportable in 'kernel' without rewrite and you can i.e. 'workaround' this 
by placing 'error' targets in place of less important thinLVs...

Imagine you would get pretty random 'denials' of your WRITE request depending 
on interaction with other snapshots....

Surely if use 'read-only' snapshot you may not see all related problems, but 
such a very minor subclass of whole provisioning solution is not worth a 
special handling of whole thin-p target.

> I did not know or did not realize the upgrade paths of the DM module(s) and 
> LVM2 itself would be so divergent.

lvm2 is  volume manager...

dm is implementation layer for different 'segtypes' (in lvm2 terminology).

So i.e. anyone can write his own 'volume manager'  and use 'dm'  - it's fully 
supported - dm is not tied to lvm2 and is openly designed  (and used by other 
projects)....

> So my apologies for that but obviously I was talking about a full-system 
> solution (not partial).

yep - 2 different worlds....

i.e. crypto, multipath,...

>> You have origin and 2 snaps.
>> You set different 'thresholds' for these volumes  -
> 
> I would not allow setting threshold for snapshots.
> 
> I understand that for dm thin target they are all the same.
> 
> But for this model it does not make sense because LVM talks of "origin" and 
> "snapshots".
> 
>> You then overwrite 'origin'  and you have to maintain 'data' for OTHER LVs.
> 
> I don't understand. Other LVs == 2 snaps?

yes - other LVs are snaps in this example...

> 
>> So you get into the position - when 'WRITE' to origin will invalidate
>> volume that is NOT even active (without lvm2 being even aware).
> 
> I would not allow space reservation for inactive volumes.

You are not 'reserving' any space as the space already IS assigned to those 
inactive volumes.

What you would have to implement is to TAKE the space FROM them to satisfy 
writing task to your 'active' volume and respect prioritization...

If you will not implement this 'active' chunk 'stealing' - you are really ONLY 
shifting 'hit-the-wall' time-frame....  (worth possibly couple seconds only of 
your system load)...

In other words - tuning 'thresholds' in userspace's 'bash' script will give 
you very same effect as if you are focusing here on very complex 'kernel' 
solution.

Regards

Zdenek