[linux-lvm] thin handling of available space

Wed May 4 00:56:40 UTC 2016

On Tue, May 3, 2016 at 9:01 AM, matthew patton <pattonme at yahoo.com> wrote:

> On Mon, 5/2/16, Mark Mielke <mark.mielke at gmail.com> wrote:
> <quote>
>  very small use case in reality. I think large service
>  providers would use Ceph or EMC or NetApp, or some such
>  technology to provision large amounts of storage per
>  customer, and LVM would be used more at the level of a
>  single customer, or a single machine.
> </quote>
>
> Ceph?!? yeah I don't think so.
>

I don't use Ceph myself. I only listed it as it may be more familiar to
others, and because I was responding to a Red Hat engineer. We use NetApp
and EMC for the most part.

> If you thin-provision an EMC/Netapp volume and the block device runs out
> of blocks (aka Raid Group is full) all volumes on it will drop OFFLINE.
> They don't even go RO. Poof, they disappear. Why? Because there is no
> guarantee that every NFS client, every iSCSI client, every FC client is
> going to do the right thing. The only reliable means of telling everyone
> "shit just broke" is for the asset to disappear.
>

I think you are correct. Based upon experience, I don't recall this ever
happening, but upon reflection, it may just be that our IT team always
caught the situation before it became too bad, and either extended the
storage, or asked permission to delete snapshots.

> All in-flight writes to the volume that the array ACK'd are still good
> even if they haven't been de-staged to the intended device thanks to NVRAM
> and the array's journal device.
>

Right. A good feature. An outage occurs, but the data that was properly
written stays written.

<quote>
>  In these cases, I
>  would expect that LVM thin volumes should not be used across
>  multiple customers without understanding the exact type of
>  churn expected, to understand what the maximum allocation
>  that would be required.
> </quote>
>
> sure, but that spells responsible sysadmin. Xen's post implied he didn't
> want to be bothered to manage his block layer  that magically the FS' job
> was to work closely with the block layer to suss out when it was safe to
> keep accepting writes. There's an answer to "works closely with block
> layer" - it's spelled BTRFS and ZFS.
>

I get a bit lost here in the push towards BTRFS and ZFS for people with
these expectations as I see BTRFS and ZFS as having a similar problem. They
can both still fill up. They just might get closer to 100% utilization
before they start to fail.

My use case isn't about reaching closer to 100% utilization. For example,
when I first proposed our LVM thinp model for dealing with host-side
snapshots, there were people in my team that felt that "fstrim" should be
run very frequently (even every 15 minutes!), so as to make maximum use of
the available free space across multiple volumes and reduce churn captured
in snapshots. I think anybody with this perspective really should be
looking at BTRFS or ZFS. Myself, I believe fstrim should run once a week or
less, and not really to save space, but more to hint to the flash device
which blocks are definitely not in use over time, to make the best use of
the flash storage over time. If we start to pass 80%, I raise the alarm
that we need to consider increasing the local storage, or moving more
content out of the thin volumes. Usually we find out that more-than-normal
churn occurred, and we just need to prune a few snapshots to drop below 50%
again. I still made them move the content that doesn't need to be snapshot
out of the thin volume, and to a stand-alone LVM thick volume so as to
entirely eliminate this churn from being trapped in snapshots and
accumulating.

LVM has no obligation to protect careless sysadmins doing dangerous things
> from themselves. There is nothing wrong with using THIN every which way you
> want just as long as you understand and handle the eventuality of extent
> exhaustion. Even thin snaps go invalid if it needs to track a change and
> can't allocate space for the 'copy'.
>

Right.

> Amazon would make sure to have enough storage to meet my requirement if I
> need them.
>
> Yes, because Amazon is a RESPONSIBLE sysadmin and has put in place tools
> to manage the fact they are thin-provisoning and to make damn sure they can
> cash the checks they are writing.
>

Right.

> > the nature of the block device, such as "how much space
> > do you *really* have left?"
>
> So you're going to write and then backport "second guess the block layer"
> code to all filesystems in common use and god knows how many versions back?
> Of course not. Just try to get on the EXT developer mailing list and ask
> them to write "block layer second-guessing code (aka branch on device
> flag=thin)" because THINP will cause problems for the FS when it runs out
> of extents. To which the obvious and correct response will be "Don't use
> THINP if you're not prepared to handle it's pre-requisites."
>

Bad things happen. Sometimes they happen very quickly. I don't intend to
dare fate, but if fate comes knocking, I prefer to be prepared. For
example, we had two monitoring systems in place for one particularly
critical piece of storage, where the application is particularly poor at
dealing with "out of space". No thin volumes in use here. Thick volumes all
the way. The system on the storage appliance stopped sending notifications
a few weeks prior as a result of some mistake during a reconfiguration or
upgrade. The separate monitoring system using entirely different software
and configuration, on different host, also failed for a different reason
that I no longer recall. The volume became full, and the application data
was corrupted in a bad way that required recovery. My immediate reaction
after best addressing the corruption, was to demand three monitoring
systems instead of two. :-)

> > you and the other people. You think the block storage should
> > be as transparent as possible, as if the storage was not
> > thin. Others, including me, think that this theory is
> > impractical
> Then by all means go ahead and retrofit all known filesystems with the
> extra logic. ALL of the filesystems were written with the understanding
> that the block layer is telling the truth and that any "white lie" was
> benign in so much that it would be made good and thus could be assumed to
> be "truth" for practical purpose.

I think this relates more closely to your other response, that I will
respond to separately...

-- 
Mark Mielke <mark.mielke at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-lvm/attachments/20160503/8952e57f/attachment.htm>