[linux-lvm] Reserve space for specific thin logical volumes

Mon Sep 18 08:56:14 UTC 2017

Dne 17.9.2017 v 00:33 Xen napsal(a):
> Zdenek Kabelac schreef op 15-09-2017 11:22:
> 
>> lvm2 makes them look the same - but underneath it's very different
>> (and it's not just by age - but also for targeting different purpose).
>>
>> - old-snaps are good for short-time small snapshots - when there is
>> estimation for having low number of changes and it's not a big issue
>> if snapshot is 'lost'.
>>
>> - thin-snaps are ideal for long-time living objects with possibility
>> to take snaps of snaps of snaps and you are guaranteed the snapshot
>> will not 'just dissapear' while you modify your origin volume...
>>
>> Both have very different resources requirements and performance...
> 
> Point being that short-time small snapshots are also perfectly served by thin...

if you take into account other the constrain - like necessity of planning 
small chunk sizes for thin-pool to have reasonably efficient snapshots,
not so small memory footprint - there are cases where short lived
snapshot is simply better choice.

> My root volume is not on thin and thus has an "old-snap" snapshot. If the 
> snapshot is dropped it is because of lots of upgrades but this is no biggy; 
> next week the backup will succeed. Normally the root volume barely changes.

And you can really have VERY same behavior WITH thin-snaps.

All you need to do is - 'erase' your inactive thin volume snapshot before 
thin-pool switches to out-of-space mode.

You really have A LOT of time (60seconds) to do this - even when thin-pool 
hits 100% fullness.

All you need to do is to write your 'specific' maintenance mode that will 
'erase' volumes tagged/named with some specific name, so you can easily find 
those LVs and 'lvremove' them when thin-pool is getting out of the space.

That's the advantage of 'inactive' snapshot.

If you have snapshot 'active' - you need to kill 'holders' (backup software),
umount volume and remove it.

Again - quite reasonably simple task when you know all 'variables'.

Hardly doable at generic level....

> So it would be possible to reserve regular LVM space for thin volumes as well 

'reserve'  can't really be 'generic'.
Everyone has different view on what is 'safe' reserve.
And you loose a lot of space in unusable reserves...

I.e. think about  2000LV in single thin-pool - and design reserves....
Start to 'think big' instead of focusing on 3 thinLVs...

>> Thin-pool still does not support shrinking - so if the thin-pool
>> auto-grows to big size - there is not a way for lvm2 to reduce the
>> thin-pool size...
> 
> Ah ;-). A detriment of auto-extend :p.

Yep - that's why we have not enable 'autoresize' by default.

It's admin decision ATM whether the free space in VG should be used by 
thin-pool or something else.

It would be better is  there would be shrinking support - but it's not yet here...

> No if you only kept some statistics that would not amount to all the mapping 
> data but only to a summary of it.

Why should kernel be doing some complex statistic management ?

(Again 'think-big' - kernel is not supposed to be parsing ALL metadata ALL the 
time -  really  - in this case we could 'drop' all the user-space :) and shift 
everything to kernel - and we end with similar complexity of kernel code as 
the btrfs has....

> Say if you write a bot that plays a board game. While searching for moves the 
> bot has to constantly perform moves on the board. It can either create new 
> board instances out of every move, or just mutate the existing board and be a 
> lot faster.

Such bot  KNOW all the combination.. - you are constantly forgetting  thin 
volume target maps very small portion of the whole metadatata set.

> A lot of this information is easier to update than to recalculate, that is, 
> the moves themselves can modify this summary information, rather than derive 
> it again from the board positions.

Maybe you should try to write a chess-player then - AFAIK it's purely based on 
brutal CPU power and massive library of know 'starts' & 'finish'....

Your simplification proposal 'with summary' seems to be quite innovative here...

> This is what I mean by "updating the metadata without having to recalculate it".

When you propose is very different thin-pool architecture - so you should try 
to talk with it's authors -  I can only provide you with  'lvm2' abstraction 
level details.

I cannot change kernel level....

The ideal upstreaming mechanism for a new target is to provide some at least 
basic implementation proving the concept can work.

And you should also show how is this complicated kernel code giving any better 
result then current user-space solution we provide.

> You wouldn't have to keep the mapping information in RAM, just the amount of 
> blocks attributed and so on. A single number. A few single numbers for each 
> volume and each pool.

It really means  - kernel would need to read ALL data,
and do ALL validation in kernel   (which is currently work made in use-space)

Hopefully it's finally cleat at this point.

> But if it's not active, can it still 'trace' another volume? Ie. it has to get 
> updated if it is really a snapshot of something right.

Inactive volume CANNOT change - so it doesn't need to be traced.

> If it doesn't get updated (and not written to) then it also does not allocate 
> new extents.

Allocation of new chunks always happen for an active thin LV.

> However volumes that see new allocation happening for them, would then always 
> reside in kernel memory right.
> 
> You said somewhere else that overall data (for pool) IS available. But not for 
> volumes themselves?

Yes -  kernel knows how many 'free' chunks are in POOL.
Kernel does NOT know  how many individual chunks belongs to single thinLVs.

> Regardless with one volume as "master" I think a non-ambiguous interpretation 
> arises?

There is no 'master' volume.

All thinLVs   are equal - and present only set of mapped chunks.
Just some of them can be mapped by more then one thinLV...

> So is or is not the number of uniquely owned/shared blocks known for each 
> volume at any one point in time?

Unless you parse all metadata and create a big data structures for this info,
you do not have this information available.

>> You can use only very small subset of 'metadata' information for
>> individual volumes.
> 
> But I'm still talking about only summary information...

I'm wondering how would you be updating such summery information in case you
have just simple 'fstrim' information.

To update such info - you would need 'backtrace'  ALL the 'released' blocks 
for your fstrimed thin volume - figure out how many OTHER thinLV (snapst) were 
sharing same  blocks - and update all their summary information.

Effectively you again need pretty complex data processing (which is otherwise 
ATM happening at user-space level with current design) to be shifted into kernel.

I'm not saying it cannot be done - surely you can reach the goal (just like 
btrfs) - but it's simply different design requiring to write completely 
different kernel target and all user-land app.

It's not something we can reach with few months of codding...

> However with the appropriate amount of user friendliness what was first only 
> for experts can be simply for more ordinary people ;-).

I assume you overestimate how many people works on the project...
We do the best we can...

> I mean, kuch kuch, if I want some SSD caching in Microsoft Windows, kuch kuch, 
> I right click on a volume in Windows Explorer, select properties, select 
> ReadyBoost tab, click "Reserve complete volume for ReadyBoost", click okay, 
> and I'm done.

Do you think it's fair to compare us with  MS capacity  :)  ??

> It literally takes some 10 seconds to configure SSD caching on such a machine.
> 
> Would probably take me some 2 hours in Linux not just to enter the commands 
> but also to think about how to do it.

It's the open source world...

> So it made no sense to have to "figure this out" on your own. An enterprise 
> will be able to do so yes.
> 
> But why not make it easier...

All which needs to happen is -  someone sits and write the code :)
Nothing else is really needed ;)

Hopefully my time invested into this low-level explanation will motivate 
someone to write something for users....

> Yes again, apologies, but I was basing myself on Kernel 4.4 in Debian 8 with 
> LVM 2.02.111 which, by now, is three years old hahaha.

Well we are at 2.02.174 -  so I'm really mainly interested for complains 
against upstream version of lvm2.

There is not much point in discussing 3 years history...

> If the monitoring script can fail, now you need a monitoring script to monitor 
> the monitoring script ;-).

Maybe you start to see why  'reboot' is not such a bad option...

>> You can always use normal device - it's really about the choice and purpose...
> 
> Well the point is that I never liked BTRFS.

Do not take this as some  'advocating' for usage of btrfs.

But all you are proposing here is mostly 'btrfs' design.

lvm2/dm  is quite different solution with different goals.

> BTRFS has its own set of complexities and people running around and tumbling 
> over each other in figuring out how to use the darn thing. Particularly with 
> regards to the how-to of using subvolumes, of which there seem to be many 
> different strategies.

It's been BTRFS 'solution' how to overcome problems...

> And then Red Hat officially deprecates it for the next release. Hmmmmm.

Red Hat simply can't do everything for everyone...

> Sometimes there is annoying stuff like not being able to change a volume group 
> (name) when a PV is missing, but if you remove the PV how do you put it back 

You may possibly miss the complexity behind those operations.

But we try to keep them at 'reasonable' minimum.

Again please try to 'think' big when you have i.e. hundreds of PVs attached 
over network... used in clusters...

There are surely things, which do look over-complicated when you have just 2 
disks in your laptop.....

But as it has been said - we address issues on 'generic' level...

You have states - and transition between states is defined in some way and 
applies for systems states XYZ....

> I guess certain things are difficult enough that you would really want a book 
> about it, and having to figure it out is fun the first time but after that a 
> chore.

Would be nice if someone would have wrote a book about it ;)

> You mean use a different pool for that one critical volume that can't run out 
> of space.
> 
> This goes against the idea of thin in the first place. Now you have to give up 
> the flexibility that you seek or sought in order to get some safety because 
> you cannot define any constraints within the existing system without 
> separating physically.

Nope - it's still well within.

Imagine you have  a VG with  1TB space,
You create  0.2TB  'userdata'  thin-pool with some thins
and you create 0.2TB  'criticalsystem'  thin-pool with some thins.

Then you orchestrate growth of those 2 thin-pools according to your rules and 
needs -  i.e. always need  0.1TB of free space in VG to get some space for 
system thin-pool.   You may even start to remove 'userdata' thin-pool in case 
you would like to get some space for 'cricticalsystem'  thin-pool

There is NO solution to protect you again running out of system space when are 
overprovissiong.

It always end with having  1TB thin-pool with  2TB volume on it.

You can't fit 2TB into 1TB so at some point in time every overprovisioning is 
going to hit dead-end....

> I get that... building a wall between two houses is easier than having to 
> learn to live together.
> 
> But in the end the walls may also kill you ;-).
> 
> Now you can't share washing machine, you can't share vacuum cleaner, you have 
> to have your own copy of everything, including bath rooms, toilet, etc.
> 
> Even though 90% of the time these things go unused.

When you share - you need to HEAVILY plan for everything.

There is always some price paid.

In many cases it's better to leave your vacuum cleaner unused for 99% of its 
time, just to be sure you can take ANYTIME you need....

You may also drop usage of modern CPUs which are 99% left unused....

So of course it's cheaper to share  - but is it comfortable??
Does it payoff??

Your pick....

> I understand, but does this mean that the NUMBER of free blocks is also always 
> known?

Thin-pool knows how many blocks are 'free'.

> So isn't the NUMBER of used/shared blocks in each DATA volume also known?

It's not known per volume.

All you now  is  - thin-pool has size X and has free Y blocks.
Pool does not know how many thin-devices are there - unless you scan metadata.

All known info is visible with  'dmsetup status'

Status report exposes all known info for thin-pool and for thin volumes.

All is described in kernel documentation for these DM targets.

> What about the 'used space'. Could you, potentially, theoretically, set a 
> threshold for that? Or poll for that?

Clearly used_space is  'whole_space -  free_space'

> IF you could change the device mapper, THEN could it be possible to reserve 
> allocation space for a single volume???

You probably need to start then discussion at more kernel oriented DM list.

> Logically there are only two conditions:
> 
> - virtual free space for critical volume is smaller than its reserved space
> - virtual free space for critical volume is bigger than its reserved space
> 
> If bigger, then all the reserved space is necessary to stay free
> If smaller, then we don't need as much.

You can implement all this logic with existing lvm2 2.02.174.
Scripting gives you all the power to your hands.

> 
> But it probably also doesn't hurt.
> 
> So 40GB virtual volume has 5GB free but reserved space is 10GB.
> 
> Now real reserved space also becomes 5GB.

Please try to stop thinking within your  'margins' and your 'conditions' 
every user/customer has different view - sometimes you simply need to 
'think-big' in TiB or PiB ;)....

> Many things only work if the user follows a certain model of behaviour.
> 
> The whole idea of having a "critical" versus a "non-critical" volume is that 
> you are going to separate the dependencies such that a failure of the 
> "non-critical" volume will not be "critical" ;-).

Already explained few times...

>> With 'reboot' you know where you are -  it's IMHO fair condition for this.
>>
>> With frozen FS and paralyzed system and your 'fsfreeze' operation of
>> unimportant volumes actually has even eaten the space from thin-pool
>> which may possibly been used better to store data for important
>> volumes....
> 
> Fsfreeze would not eat more space than was already eaten.

If you 'fsfreeze' - the filesystem has to be put into consistent state -
so all unwritten 'data' & 'metadata' out of your page-cache has to pushed on 
your disk.

This will cause very hardly 'predictable' amount of provisioning on your 
thin-pool.   You can possibly estimate 'maximum' number....

> If I freeze a volume only used by a webserver... I will only freeze the 
> webserver... not anything else?

Number of system apps are doing  scans over entire system....
Apps are talking to each other and waiting for answers...
Of course lots it 'transiently' freezed apps would be, because other apps are 
not well written for parallel world...

Again - if you have set of constrains - like you have a 'special' volume for 
web server which is ONLY used by web server,  you can a better decision.

In this case it would be likely better to kill 'web server' and umount volume....

> We're going to prevent them from mapping new chunks ;-).
> 

You can't prevent kernel from mapping new chunks....

But you can do ALL in userspace - though ATM you need to possibly use 
'dmsetup' commands....

Regards

Zdenek