[linux-lvm] Reserve space for specific thin logical volumes

Tue Sep 12 11:34:35 UTC 2017

On 12/09/2017 13:01, Zdenek Kabelac wrote:
> There is very good reason why thinLV is fast - when you work with thinLV -
> you work only with data-set for single thin LV.
> 
> So you write to thinLV and either you modify existing exclusively owned 
> chunk
> or you duplicate and provision new one.   Single thinLV does not care about
> other thin volume - this is very important to think about and it's 
> important for reasonable performance and memory and cpu resources usage.

Sure, I grasp that.

> I think you need to think 'wider'.
> 
> You do not need to use a single thin-pool - you can have numerous 
> thin-pools,
> and for each one you can maintain separate thresholds (for now in your own
> scripting - but doable with today's  lvm2)
> 
> Why would you want to place 'critical' volume into the same pool
> as some non-critical one ??
> 
> It's simply way easier to have critical volumes in different thin-pool
> where you might not even use over-provisioning.

I need to take a step back: my main use for thinp is virtual machine 
backing store. Due to some limitation in libvirt and virt-manager, which 
basically do not recognize thin pools, I can not use multiple thin pools 
or volumes.

Rather, I had to use a single, big thin volumes with XFS on top.

> Seems to me - everyone here looks for a solution where thin-pool is used 
> till the very last chunk in thin-pool is allocated - then some magical 
> AI step in,
> decides smartly which  'other already allocated chunk' can be trashed
> (possibly the one with minimal impact  :)) - and whole think will continue
> run in full speed ;)
> 
> Sad/bad news here - it's not going to work this way....

No, I absolutely *do not want* thinp to automatically dallocate/trash 
some provisioned blocks. Rather, I all for something as "if free space 
is lower than 30%, disable new snapshot *creation*"

> lvm2 also DOES protect you from creation of new thin-pool when the fullness
> is about lvm.conf defined threshold - so nothing really new here...

Maybe I am missing something: this threshold is about new thin pools or 
new snapshots within a single pool? I was really speaking about the latter.

>> [root at blackhole ~]# zfs destroy tank/vol1 at snap1
>> [root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M 
>> count=500 oflag=direct
>> 500+0 records in
>> 500+0 records out
>> 524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
>> [root at blackhole ~]# zfs list -t all
>> NAME        USED  AVAIL  REFER  MOUNTPOINT
>> tank        622M   258M    96K  /tank
>> tank/vol1   621M   378M   501M  -
>>
>> # Snapshot creation now FAILS!
> 
> ZFS is filesystem.
> 
> So let's repeat again :) amount of problems inside a single filesystem 
> is not comparable with block-device layer - it's entirely different 
> world of problems.
> 
> You can't really expect filesystem 'smartness' on block-layer.
> 
> That's the reason why we can see all those developers boldly stepping 
> into the 'dark waters' of  mixed filesystem & block layers.

In the examples above, I did not use any ZFS filesystem layer. I used 
ZFS as volume manager, with the intent to place an XFS filesystem on top 
of ZVOL block volumes.

The ZFS man page clearly warns about ENOSP with sparse volume. My point 
is that, by cleaver using of the refreservation property, I can engineer 
a setup where snapshot are generally allowed, unless free space is under 
a certain threshold. In this case, the are not allowed (but newer 
automatically deleted!).

> lvm2/dm trusts in different concept - it's possibly less efficient,
> but possibly way more secure - where you have different layers,
> and each layer could be replaced and is maintained separately.

And I really trust layer separation - it is for this very reason I am a 
big fan of thinp, but its fail behavior somewhat scares me.

> ATM thin-pool cannot somehow auto-magically 'drop'  snapshots on its own.

Let me repeat: I do *not* want thinp to automatically drop anything. I 
simply what it to disallow new snapshot/volume creation when unallocated 
space is too low

> And that's the reason why we have those monitoring features provided 
> with dmeventd.   Where you monitor  occupancy of thin-pool and when the
> fullness goes above defined threshold  - some 'action' needs to happen.

And I really thank you for that - this is a big step forward.
> AFAIK current kernel (4.13) with thinp & ext4 used with remount-ro on 
> error and lvm2 is safe to use in case of emergency - so surely you can 
> lose some uncommited data but after reboot and some extra free space 
> made in thin-pool you should have consistent filesystem without any 
> damage after fsck.
> 
> There are not known simple bugs in this case - like system crashing on 
> dm related OOPS (like Xen seems to suggest... - we need to see his bug 
> report...)
> 
> However - when thin-pool gets full - the reboot and filesystem check is 
> basically mandatory  -  there is no support  (and no plan to start 
> support randomly dropping allocated chunks from other thin-volumes to 
> make space for your running one)
> 
> 
> I'd like to still see what you think is  'deadly'

Committed (fsynced) writes are safe, and this is very good. However, 
*many* application do not properly issue fsync(); this is a fact of life.

I absolutely *do not expect* thinp to automatically cope well with this 
applications - I full understand & agree that application *must* issue 
proper fsyncs.

However, recognizing that real world is quite different from my ideals, 
I want to exclude how many problems are possible: for this reason, I 
really want to prevent full thin pools even in the face of failed 
monitoring (or somnolent sysadmins).

In the past, I testified that XFS take its relatively long time to 
recognize that a thin volume is unavailable - and many async writes can 
be lost in the process. Ext4 + data=journaled did a better job, but a) 
it is not the default filesystem in RH anymore and b) data=journaled is 
not the default option and has its share of problems.

Complex systems need to be monitored - true. And I do that; in fact, I 
have *two* monitor system in place (Zabbix and custom shell based one). 
However, being bitten from a failed Zabbix Agent in the past, I learn a 
good lesson: to design system where some types of problems can not 
simply happen.

So, if in the face of a near-full pool, thinp refuse me to create a new 
filesystem, I would be happy :)

> And also I'd like to be explained what better thin-pool can do in terms
> of block device layer.

Thinp is doing a great job, and nobody wants to deny that.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti at assyoma.it - info at assyoma.it
GPG public key ID: FF5F32A8