[linux-lvm] Reserve space for specific thin logical volumes

Mon Sep 11 21:59:18 UTC 2017

Il 11-09-2017 12:35 Zdenek Kabelac ha scritto:
> The first question here is - why do you want to use thin-provisioning ?

Because classic LVM snapshot behavior (slow write speed and linear 
performance decrease as snapshot count increases) make them useful for 
nightly backups only.

On the other side, the very fast CoW thinp's behavior mean very usable 
and frequent snapshots (which are very useful to recover from user 
errors).

> As thin-provisioning is about 'promising the space you can deliver
> later when needed'  - it's not about hidden magic to make the space
> out-of-nowhere.

I fully agree. In fact, I was asking about how to reserve space to 
*protect* critical thin volumes from "liberal" resource use by less 
important volumes. Fully-allocated thin volumes sound very interesting - 
even if I think this is a performance optimization rather than a "safety 
measure".

> The idea of planning to operate thin-pool on 100% fullness boundary is
> simply not going to work well - it's  not been designed for that
> use-case - so if that's been your plan - you will need to seek for
> other solution.
> (Unless you seek for those 100% provisioned devices)

I do *not* want to run at 100% data usage. Actually, I want to avoid it 
entirely by setting a reserved space which cannot be used for things as 
snapshot. In other words, I would very like to see a snapshot to fail 
rather than its volume becoming unavailable *and* corrupted.

Let me de-tour by using ZFS as an example (don't bash me for doing 
that!)

In ZFS words, there are object called ZVOLs - ZFS volumes/block devices, 
which can either be "fully-preallocated" or "sparse".

By default, they are "fully-preallocated": their entire nominal space is 
reseved and subtracted from the ZPOOL total capacity. Please note that 
this does *not* means that space is really allocated on the ZPOOL, 
rather that nominal space is accounted against other ZFS dataset/volumes 
when creating new object. A filesystem sitting on top of such a ZVOL 
will never run out of space; rather, if the remaining capacity is not 
enough to guaranteed this constrain, new volume/snapshot creating is 
forbidden.

Example:
# 1 GB ZPOOL
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  
ALTROOT
tank  1008M   456K  1008M         -     0%     0%  1.00x  ONLINE  -

# Creating a 600 MB ZVOL (note the different USED vs REFER values)
[root at blackhole ~]# zfs create -V 600M tank/vol1
[root at blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        621M   259M    96K  /tank
tank/vol1   621M   880M    56K  -

# Snapshot creating - please see that, as REFER is very low (I did write 
nothig on the volume), snapshot creating is allowed
[root at blackhole ~]# zfs snapshot tank/vol1 at snap1
[root at blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              621M   259M    96K  /tank
tank/vol1         621M   880M    56K  -
tank/vol1 at snap1     0B      -    56K  -

# Let write something to the volume (note how REFER is higher than free, 
unreserved space)
[root at blackhole ~]# zfs destroy tank/vol1 at snap1
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M 
count=500 oflag=direct
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 12.7038 s, 41.3 MB/s
[root at blackhole ~]# zfs list -t all
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        622M   258M    96K  /tank
tank/vol1   621M   378M   501M  -

# Snapshot creation now FAILS!
[root at blackhole ~]# zfs snapshot tank/vol1 at snap1
cannot create snapshot 'tank/vol1 at snap1': out of space
[root at blackhole ~]# zfs list -t all
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        622M   258M    96K  /tank
tank/vol1   621M   378M   501M  -

The above surely is safe behavior: when free, unused space is too low to 
guarantee the reserved space, snapshot creation is disallowed.

On the other side, using the "-s" option you can create a "sparse" ZVOL 
- a volume which nominal space is *not* accounted/subtracted from the 
total ZPOOL capacity. Such a volume have similar warnings that thin 
volumes. From the man page:

'Though not recommended, a "sparse volume" (also known as "thin 
provisioning") can be created by specifying the -s option to the zfs 
create -V command, or by changing the reservation after the volume has 
been created.  A "sparse volume" is a volume where the reservation is 
less then the volume size.  Consequently, writes to a sparse volume can 
fail with ENOSPC when the pool is low on space.  For a sparse volume, 
changes to volsize are not reflected in the reservation.'

The only real difference vs a fully preallocated volume is the property 
carrying the reserved space expectation. I can even switch at run-time 
between a fully preallocated vs sparse volume by simply changing the 
right property. Indeed, a very important thing to understand is that 
this property can be set to *any value* between 0 ("none") and max 
volume (nominal) size.

On a 600M fully preallocated volumes:
[root at blackhole ~]# zfs get refreservation tank/vol1
NAME       PROPERTY        VALUE      SOURCE
tank/vol1  refreservation  621M       local

On a 600M sparse volume:
[root at blackhole ~]# zfs get refreservation tank/vol1
NAME       PROPERTY        VALUE      SOURCE
tank/vol1  refreservation  none       local

Now, a sparse (refreservation=none) volume *can* be snapshotted even if 
very little free space if available in the ZPOOL:

# The very same command that previously failed, now completes 
successfully
[root at blackhole ~]# zfs snapshot tank/vol1 at snap1
[root at blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              502M   378M    96K  /tank
tank/vol1         501M   378M   501M  -
tank/vol1 at snap1     0B      -   501M  -

# Using a non-zero, but lower-than-nominal threshold 
(refreservation=100M) allows the snapshot to be taken:
[root at blackhole ~]# zfs set refreservation=100M tank/vol1
[root at blackhole ~]# zfs snapshot tank/vol1 at snap1
[root at blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              602M   278M    96K  /tank
tank/vol1         601M   378M   501M  -
tank/vol1 at snap1     0B      -   501M  -

# If free space drops under the lower-but-not-zero reservation 
(refreservation=100M), snapshot again fails:
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=1M 
count=300 oflag=direct
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 4.85282 s, 64.8 MB/s
[root at blackhole ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              804M  76.3M    96K  /tank
tank/vol1         802M  76.3M   501M  -
tank/vol1 at snap1   301M      -   501M  -
[root at blackhole ~]# zfs snapshot tank/vol1 at snap2
cannot create snapshot 'tank/vol1 at snap2': out of space

OK - now back to the original question: why reserved space can be 
useful? Consider the following two scenarios:

A) You want to efficiently use snapshots and *never* encounter 
unexpected full ZPOOL. Your main constrain it to use at most <50% of 
available space for your "critical" ZVOL. With such a setup, any 
"excessive" snapshot/volume creation will surely fail, but the main ZVOL 
will be unaffected;

B) You want to somewhat overprovision (taking account worst-case 
snapshot behavior), but with *large* operating margin. In this case, you 
can create a sparse volume with lower (but non-zero) reservation. Any 
snapshot/volume creation done when this margin is crossed will fail. You 
surely need to clean-up some space (eg: delete older snapshot), but you 
avoid the runaway effect of new snapshot being continuously created, 
consuming additional space.

Now leave ZWORLD, and back to thinp: it would be *really* cool to 
provide the same sort of functionality. Sure, you had to track space 
usage both at pool and a volume level - but the safety increase would be 
massive. There is an big difference between a corrupted main volume and 
a failed snapshot: while the latter can be resolved without too much 
concert, the former (volume corruption) really is a scary thing.

Don't misunderstand me, Zdenek: I *REALLY* appreciate you core 
developers from the outstanding work on LVM. This is especially true in 
the light of BTRFS's problems, and with stratis (which is heavily based 
on thinp) becoming the new next thing. I even more appreciate that you 
are on the mailing list, replying to your users.

Thin volumes are really cool (and fast!), but they can fail deadly. A 
fail-safe approach (ie: no new snapshot allowed) is much more desirable.

Thanks.

> 
> Regards
> 
> 
> Zdenek

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti at assyoma.it - info at assyoma.it
GPG public key ID: FF5F32A8