[linux-lvm] Reserve space for specific thin logical volumes

Fri Sep 15 09:22:26 UTC 2017

Dne 15.9.2017 v 09:34 Xen napsal(a):
> Zdenek Kabelac schreef op 14-09-2017 21:05:
> 

>>> But if I do create snapshots (which I do every day) when the root and boot 
>>> snapshots fill up (they are on regular lvm) they get dropped which is nice,
>>
>> old snapshot are different technology for different purpose.
> 
> Again, what I was saying was to support the notion that having snapshots that 
> may grow a lot can be a problem.

lvm2 makes them look the same - but underneath it's very different (and it's 
not just by age - but also for targeting different purpose).

- old-snaps are good for short-time small snapshots - when there is estimation 
for having low number of changes and it's not a big issue if snapshot is 'lost'.

- thin-snaps are ideal for long-time living objects with possibility to take 
snaps of snaps of snaps and you are guaranteed the snapshot will not 'just 
dissapear' while you modify your origin volume...

Both have very different resources requirements and performance...

> I am not sure the purpose of non-thin vs. thin snapshots is all that different 
> though.
> 
> They are both copy-on-write in a certain sense.
> 
> I think it is the same tool with different characteristics.

That are cases where it's quite valid option to take  old-snap of thinLV and 
it will payoff...

Even exactly in the case you use thin and you want to make sure your temporary 
snapshot will not 'eat' all your thin-pool space and you want to let snapshot die.

Thin-pool still does not support shrinking - so if the thin-pool auto-grows to 
big size - there is not a way for lvm2 to reduce the thin-pool size...

> That's just the sort of thing that in the past I have been keeping track of 
> continuously (in unrelated stuff) such that every mutation also updated the 
> metadata without having to recalculate it...

Would you prefer to spend all you RAM to keep all the mapping information for 
all the volumes and put very complex code into kernel to parse the information 
which is technically already out-of-data in the moment you get the result ??

In 99.9% of runtime you simply don't need this info.

> But the purpose of what you're saying is that the number of uniquely owned 
> blocks by any snapshot is not known at any one point in time.

As long as 'thinLV' (i.e. your snapshot thinLV) is NOT active - there is 
nothing in kernel maintaining its dataset.  You can have lots of thinLV active 
and lots of other inactive.

> Well pardon me for digging this deeply. It just seemed so alien that this 
> thing wouldn't be possible.

I'd say it's very smart ;)

You can use only very small subset of 'metadata' information for individual 
volumes.
> 
> It becomes a rather big enterprise to install thinp for anyone!!!

It's enterprise level software ;)

> Because to get it running takes no time at all!!! But to get it running well 
> then implies huge investment.

In most common scenarios - user knows when he runs out-of-space - it will not 
be 'pleasant' experience - but users data should be safe.

And then it depends how much energy/time/money user wants to put into 
monitoring effort to minimize downtime.

As has been said - disk-space is quite cheap.
So if you monitor and insert your new disk-space in-time  (enterprise...)  you 
have less set of problems - then if you try to fight constantly with 100% full 
thin-pool...

You have still problems even when you have 'enough' disk-space ;)
i.e. you select small chunk-size and you want extend thin-pool data volume 
beyond addressable capacity -  each chunk-size has its final maximum data size....

> That means for me and for others that may not be doing it professionally or in 
> a larger organisation, the benefit of spending all that time may not weigh up 
> to the cost it has and the result is then that you keep stuck with a deeply 
> suboptimal situation in which there is little or no reporting or fixing, all 
> because the initial investment is too high.

You can always use normal device - it's really about the choice and purpose...

> 
> While personally I also like the bigger versus smaller idea because you don't 
> have to configure it.

I'm still proposing to use different pools for different purposes...

Sometimes spreading the solution across existing logic is way easier,
then trying to achieve some super-inteligent universal one...

>> Script is called at  50% fullness, then when it crosses 55%, 60%, ...
>> 95%, 100%. When it drops bellow threshold - you are called again once
>> the boundary is crossed...
> 
> How do you know when it is at 50% fullness?
> 
>> If you are proud sponsor of your electricity provider and you like the
>> extra heating in your house - you can run this in loop of course...
> 
>> Threshold are based on  mapped size for whole thin-pool.
>>
>> Thin-pool surely knows all the time how many blocks are allocated and free for
>> its data and metadata devices.
> 
> But didn't you just say you needed to process up to 16GiB to know this 
> information?

Of course thin-pool has to be aware how much free space it has.
And this you can somehow imagine as 'hidden' volume with FREE space...

So to give you this 'info' about  free blocks in pool - you maintain very 
small metadata subset - you don't need to know about all other volumes...

If other volume is releasing or allocation chunks - your 'FREE space' gets 
updated....

It's complex underneath and locking is very performance sensitive - but for 
easy understanding you can possibly get the picture out of this...

> 
> You may not know the size and attribution of each device but you do know the 
> overall size and availability?

Kernel support 1 setting for threshold - where the user-space (dmeventd) is 
waked-up when usage has passed it.

The mapping of value is lvm.conf autoextend threshold.

As a 'secondary' source - dmeventd checks every 10 second pool fullness with 
single ioctl() call and compares how the fullness has changed and provides you 
with callbacks for those  50,55...  jumps
(as can be found in  'man dmeventd')

So for autoextend theshold passing you get instant call.
For all others there is up-to 10 second delay for discovery.

>> In the single thin-pool  all thins ARE equal.
> 
> But you could make them unequal ;-).

I cannot ;)  - I'm lvm2 coder -   dm thin-pool is Joe's/Mike's toy :)

In general - you can come with many different kernel modules which take 
different approach to the problem.

Worth to note -  RH has now Permabit  in its porfolio - so there can more then 
one type of thin-provisioning supported in lvm2...

Permabit solution has deduplication, compression, 4K blocks - but no snapshots....

> 
> The goal was more to protect the other volumes, supposing that log writing 
> happened on another one, for that other log volume not to impact the other 
> main volumes.

IMHO best protection is different pool for different thins...
You can more easily decide which pool can 'grow-up'
and which one should rather be taken offline.

So your 'less' important data volumes may simply hit the wall hard,
while your 'strategically important' one will avoid using overprovisioning as 
much as possible to keep it running.

Motto: keep it simple ;)

> So you have thin global reservation of say 10GB.
> 
> Your log volume is overprovisioned and starts eating up the 20GB you have 
> available and then runs into the condition that only 10GB remains.
> 
> The 10GB is a reservation maybe for your root volume. The system (scripts) (or 
> whatever) recognises that less than 10GB remains, that you have claimed it for 
> the root volume, and that the log volume is intruding upon that.
> 
> It then decides to freeze the log volume.

Of course you can play with 'fsfreeze' and other things - but all these things 
are very special to individual users with their individual preferences.

Effectively if you freeze your 'data' LV - as a reaction you may paralyze the 
rest of your system - unless you know the 'extra' information about the user 
use-pattern.

But do not take this as something to discourage you to try it - you may come 
with perfect solution for your particular system  - and some other user may 
find it useful in some similar pattern...

It's just something that lvm2 can't give support globally.

But lvm2 will give you enough bricks for writing 'smart' scripts...

> Okay.. I understand. I guess I was deluded a bit by non-thin snapshot 
> behaviour (filled up really fast without me understanding why, and concluding 
> that it was doing 4MB copies).

Fast disks are now easily able to write gigabytes in second... :)

> 
> But attribution of an extent to a snapshot will still be done in extent-sizes 
> right?

Allocation unit in VG  is 'extent'   - ranges from 1sector to 4GiB
and default is 4M - yes....

> 
> So I don't think the problems of freezing are bigger than the problems of 
> rebooting.

With 'reboot' you know where you are -  it's IMHO fair condition for this.

With frozen FS and paralyzed system and your 'fsfreeze' operation of 
unimportant volumes actually has even eaten the space from thin-pool which may 
possibly been used better to store data for important volumes....
and there is even big danger you will 'freeze' yourself already during call of 
fsfreeze  (unless you of course put BIG margins around)

> 
> "System is still running but some applications may have crashed. You will need 
> to unfreeze and restart in order to solve it, or reboot if necessary. But you 
> can still log into SSH, so maybe you can do it remotely without a console ;-)".

Compare with  email:

Your system has run out-of-space, all actions to gain some more space has 
failed  - going to reboot into some 'recovery' mode

> 
> So there is no issue with snapshots behaving differently. It's all the same 
> and all committed data will be safe prior to the fillup and not change afterward.

Yes - snapshot is 'user-land' language  -  in kernel - all thins  maps chunks...

If you can't map new chunk - things is going to stop - and start to error 
things out shortly...

Regards

Zdenek