[linux-lvm] thin handling of available space

Mark Mielke mark.mielke at gmail.com
Wed May 4 01:25:11 UTC 2016

On Tue, May 3, 2016 at 8:00 AM, matthew patton <pattonme at yahoo.com> wrote:

> > written as required. If the file system has particular areas
> > of importance that need to be writable to prevent file
> > system failure, perhaps the file system should have a way of
> > communicating this to the volume layer. The naive approach
> > here might be to preallocate these critical blocks before
> >  proceeding with any updates to these blocks, such that the
> > failure situations can all be "safe" situations,
> > where ENOSPC can be returned without a danger of the file
> > system locking up or going read-only.
> why all of a sudden does each and every FS have to have this added code to
> second guess the block layer? The quickest solution is to mount the FS in
> sync mode. Go ahead and pay the performance piper. It's still not likely to
> be bullet proof but it's a sure step closer.

Not all of a sudden. From "at work" perspective, LVM thinp as a technology
is relatively recent, and only recently being deployed in more places as we
migrate our systems from RHEL 5 to RHEL 6 to RHEL 7. I didn't consider
thinp an option before RHEL 7, and I didn't consider it stable even in RHEL
7 without significant testing on our part.

>From an "at home" perspective, I have been using LVM thinp from the day it
was available in a Fedora release. The previous snapshot model was
unusable, and I wished upon a star that a better technology would arrive. I
tried BTRFS and while it did work - it was still marked as experimental, it
did not have the exact same behaviour as EXT4 or XFS from an applications
perspective, and I did encounter some early issues with subvolumes.
Frankly... I was happy to have LVM thinp, and glad that you LVM developers
provided it when you did. It is excellent technology from my perspective.
But, "at home", I was willing to accept some loose edge case behaviour. I
know when I use storage on my server at home, and if it fails, I can accept
the consequences for myself.

"At work", the situation is different. These are critical systems that I am
betting LVM on. As we begin to use it more broadly (after over a year of
success in hosting our JIRA + Confluence instances on local flash using LVM
thinp for much of the application data including PostgreSQL databases). I
am very comfortable with it from a "< 80% capacity" perspective. However,
every so often it passes 80%, and I have to raise the alarm, because I know
that there are edge cases that LVM / DM thinp + XFS don't handle quite so
well. It's never happened in production yet, but I've seen it happen many
times on designer desktops when they are using LVM, and they lock up their
system and require a system reboot to recover from.

I know there are smart people working on Linux, and smart people working on
LVM. Give the opportunity, and the perspective, I think the worst of these
cases are problems that deserve to be addressed, and probably that people
have been working on with or without my contributions to the subject.

> What you're saying is that when mounting a block device the layer needs to
> expose a "thin-mode" attribute (or the sysdmin sets such a flag via
> tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry
> (does that actually work reliably?).
> Then there has to be code in every FS block de-stage path:
> IF thin {
>   tickle block layer to allocate the block (aka write zeros to it? - what
> about pre-existing data, is there a "fake write" BIO call that does
> everything but actually write data to a block but would otherwise trigger
> LVM thin's extent allocation logic?)
>    IF success, destage dirty block to block layer ELSE
>    inform userland of ENOSPC
> }
> In a fully journal'd FS (metadata AND data) the journal could be 'pinned'
> and likewise the main metadata areas if for no other reason they are zero'd
> at onset and or constantly being written to. Once written to, LVM thin
> isn't going to go back and yank away an allocated extent.

Yes. This is exactly the type of solution I was thinking of including
pinning the journal! You used the correct terminology. I can read the terms
but not write them. :-)

You also managed to summarize it in only a few lines of text. As concepts
go, I think that makes it not-too-complex.

But, the devil is often in the details, and you are right that this is a
per-file system cost.

Balancing this, however, I am perhaps presuming that *all* systems will
eventually be thin volume systems, and that correct behaviour and highly
available behaviour will eventually require that *all* systems invest in
technology such as this. My view of the future is that fixed sized thick
partitions are very often a solution which is compromised from the start.
Most systems of significance grow over time, and the pressure to reduce
cost is real. I think we are taking baby steps to start, but that the
systems of the future will be thin volume systems. I see this as a problem
that needs to be understood and solved, except in the most limited of use
cases. This is my opinion, which I don't expect anybody to share.

> This at least should maintain FS integrity albeit you may end up in a
> situation where the journal can never get properly de-staged, so you're
> stuck on any further writes and need to force RO.

Interesting to consider. I don't see this as necessarily a problem - or
that it necessitates "RO" as a persistent state. For example, it would be
most practical if sufficient room was reserved to allow for content to be
removed, allowing for the file system to become unwedged and become "RW"
again. Perhaps there is always an edge case that would necessitate a
persistent "RO" state that requires the volume be extended to recover from,
but I think the edge case could be refined to something that will tend to
never happen?

> > just want a sanely behaving LVM + XFS...)
> IMO if the system admin made a conscious decision to use thin AND
> overprovision (thin by itself is not dangerous), it's up to HIM to actively
> manage his block layer. Even on million dollar SANs the expectation is that
> the engineer will do his job and not drop the mic and walk away. Maybe the
> "easiest" implementation would be a MD layer job that the admin can tailor
> to fail all allocation requests once extent count drops below a number and
> thus forcing all FS mounted on the thinpool to go into RO mode.

Another interesting idea. I like the idea of automatically shutting down
our applications or PostgreSQL database if the thin pool reaches an unsafe
allocation, such as 90% or 95%. This would ensure the integrity of the
data, at the expense of an outage. This is something we could implement
today. Thanks.

> But in any event it won't prevent irate users from demanding why the space
> they appear to have isn't actually there.

Users will always be irate. :-) I mostly don't consider that as a real
factor in my technical decisions... :-)

Thanks for entertaining this discussion, Matthew and Zdenek. I realize this
is an open source project, with passionate and smart people, whose time is
precious. I don't feel I have the capability of really contributing code
changes at this time, and I'm satisfied that the ideas are being considered
even if they ultimately don't get adopted. Even the mandatory warning about
snapshots exceeding the volume group size is something I can continue to
deal with using scripting and filtering. I mostly want to make sure that my
perspective is known and understood.

Mark Mielke <mark.mielke at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-lvm/attachments/20160503/866b4718/attachment.htm>

More information about the linux-lvm mailing list