[linux-lvm] thin handling of available space

Mon May 2 14:32:26 UTC 2016

On Fri, Apr 29, 2016 at 7:23 AM, Zdenek Kabelac <zkabelac at redhat.com> wrote:

> Thin-provisioning is NOT about providing device to the upper
> system levels and inform THEM about this lie in-progress.
> That's complete misunderstanding of the purpose.
>

I think this line of thought is a bit of a strawman.

Thin provisioning is entirely about presenting the upper layer with a
logical view which does not match the physical view, including the
possibility for such things as over provisioning. How much of this detail
is presented to the higher layer is an implementation detail and has
nothing to do with "purpose". The purpose or objective is to allow volumes
that are not fully allocated in advance. This is what "thin" means, as
compared to "thick".

> If you seek for a filesystem with over-provisioning - look at btrfs, zfs
> and other variants...
>

I have to say that I am disappointed with this view, particularly if this
is a view held by Red Hat. To me this represents a misunderstanding of the
purpose for over-provisioning, and a misunderstanding of why thin volumes
are required. It seems there is a focus on "filesystem" in the above
statement, and that this may be the point of debate.

When a storage provider providers a block device (EMC, NetApp, ...) and a
snapshot capability, I expect to be able to take snapshots with low
overhead. The previous LVM model for snapshots was really bad, in that it
was not low overhead. We use this capability for many purposes including:

1) Instantiating test environments or dev environments from a snapshot of
production, with copy-on-write to allow for very large full-scale
environments to be constructed quickly and with low overhead. In one of our
examples, this includes an example where we have about 1 TByte of JIRA and
Confluence attachments collected over several years. It is exposed over NFS
by the NetApp device, but in the backend it is a volume. This volume is
snapshot and then exposed as a different volume with copy-on-write
characteristics. The storage allocation is monitored, and if it is
exceeded, it is known that there will be particular behaviour. I believe in
our case, the behaviour is that the snapshot becomes unusable.

2) Frequent snapshots. In many of our use cases, we may take snapshots
every 15 minutes, every hour, and every day, keeping 3 or more of each. If
this storage had to be allocated in full, this amounts to at least 10X the
storage cost. Using snapshots, and understanding the rate of churn, we can
use closer to 1X or 2X the storage overhead, instead of 10X the storage
overhead.

3) Snapshot as a means of achieving a consistent backup at low cost of
outage or storage overhead. If we "quiesce" the application (flush buffers,
put new requests on hold, etc.) take the snapshot, and then "resume" the
application, this can be achieved in a matter of seconds or less. Then, we
can mount the snapshot at a separate mount point and proceed with a more
intensive backup process against a particular consistent point-in-time.
This can be fast and require closer to 1X the storage overhead, instead of
2X the storage overhead.

In all of these cases - we'll buy more storage if we need more storage.
But, we're not going to use BTRFS or ZFS to provide the above capabilities,
just because this is your opinion on the matter. Storage vendors of
reputation and market presence sell these capabilities as features, and we
pay a lot of money to have access to these features.

In the case of LVM... which is really the point of this discussion... LVM
is not necessarily going to be used or available on a storage appliance.
The LVM use case, at least for us, is for storage which is thinly
provisioned by the compute host instead of the backend storage appliance.
This includes:

1) Local disks, particularly included local flash drives that are local to
achieve higher levels of performance than can normally be achieved with a
remote storage appliance.

2) Local file systems, on remote storage appliances, using a protocol such
as iSCSI to access the backend block device. This might be the case where
we need better control of the snapshot process, or to abstract the
management of the snapshots from the backend block device. In our case, we
previously use an EMC over iSCSI for one of these use cases, and we are
switching to NetApp. However, instead of embedding NetApp-specific logic
into our code, we want to use LVM on top of iSCSI, and re-use the LVM thin
pool capabilities from the host, such that we don't care what storage is
used on the backend. The management scripts will work the same whether the
storage is local (the first case above) or not (the case we are looking
into now).

In both of these cases, we have a need to take snapshots and manage them
locally on the host, instead of managing them on a storage appliance. In
both cases, we want to take many light weight snapshots of the block
device. You could argue that we should use BTRFS or ZFS, but you should
full well know that both of these have caveats as well. We want to use XFS
or EXT4 as our needs require, and still have the ability to take
light-weight snapshots.

Generally, I've seen the people who argue that thin provisioning is a
"lie", tend to not be talking about snapshots. I have a sense that you are
talking more as storage providers for customers, and talking more about
thinly provisioning content for your customers. In this case - I think I
would agree that it is a "lie" if you don't make sure to have the storage
by the time it is required. But, I think this is a very small use case in
reality. I think large service providers would use Ceph or EMC or NetApp,
or some such technology to provision large amounts of storage per customer,
and LVM would be used more at the level of a single customer, or a single
machine. In these cases, I would expect that LVM thin volumes should not be
used across multiple customers without understanding the exact type of
churn expected, to understand what the maximum allocation that would be
required. In the case of our IT team and EMC or NetApp, they mostly avoid
the use of thin volumes for "cross customer" purposes, and instead use thin
volumes for a specific customer, for a specific need. In the case of Amazon
EC2, for example... I would use EBS for storage, and expect that even if it
is "thin", Amazon would make sure to have enough storage to meet my
requirement if I need them. But, I would use LVM on my Amazon EC2 instance,
and I would expect to be able to use LVM thin pool snapshots to over
provision my own per-machine storage requirements by creating multiple
snapshots of the underlying storage, with a full understanding of the
amount of churn that I expect to occur, and a full understanding of the
need to monitor.

> Device target is definitely not here to solve  filesystem troubles.
> Thinp is about 'promising' - you as admin promised you will provide
> space -  we could here discuss maybe that LVM may possibly maintain
> max growth size we can promise to user - meanwhile - it's still the admin
> who creates thin-volume and gets WARNING if VG is not big enough when all
> thin volumes would be fully provisioned.
> And  THAT'S IT - nothing more.
> So please avoid making thinp target to be answer to ultimate question of
> life, the universe, and everything - as we all know  it's 42...

The WARNING is a cover-your-ass type warning that is showing up
inappropriately for us. It is warning me something that I should already
know, and it is training me to ignore warnings. Thinp doesn't have to be
the answer to everything. It does, however, need to provide a block device
visible to the file system layer, and it isn't invalid for the file system
layer to be able to query about the nature of the block device, such as
"how much space do you *really* have left?"

This seems to be a crux of this debate between you and the other people.
You think the block storage should be as transparent as possible, as if the
storage was not thin. Others, including me, think that this theory is
impractical, as it leads to edge cases where the file system could choose
to fail in a cleaner way, but it gets too far today leading to a more
dangerous failure when it allocates some block, but not some other block.

Exaggerating this to say that thinp would become everything, and the answer
to the ultimate question of life, weakens your point to me, as it means
that you are seeing things in far too black + white, whereas real life is
often not black + white.

It is your opinion that extending thin volumes to allow the file system to
have more information is breaking some fundamental law. But, in practice,
this sort of thing is done all of the time. "Size", "Read only",
"Discard/Trim Support", "Physical vs Logical Sector Size", ... are all
information queried from the device, and used by the file system. If it is
a general concept that applies to many different device targets, and it
will help the file system make better and smarter choices, why *shouldn't*
it be communicated? Who decides which ones are valid and which ones are not?

I didn't disagree with all of your points. But, enough of them seemed to be
directly contradicting my perspective on the matter that I felt it
important to respond to them.

Mostly, I think everybody has a set of opinions and use cases in mind when
they come to their conclusions. Please don't ignore mine. If there is
something unreasonable above, please let me know.

-- 
Mark Mielke <mark.mielke at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-lvm/attachments/20160502/e5c8f112/attachment.htm>