<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, May 3, 2016 at 8:00 AM, matthew patton <span dir="ltr"><<a href="mailto:pattonme@yahoo.com" target="_blank">pattonme@yahoo.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> written as required. If the file system has particular areas<br> > of importance that need to be writable to prevent file<br> > system failure, perhaps the file system should have a way of<br> </span><span class="">> communicating this to the volume layer. The naive approach<br> > here might be to preallocate these critical blocks before<br> > proceeding with any updates to these blocks, such that the<br> > failure situations can all be "safe" situations,<br> > where ENOSPC can be returned without a danger of the file<br> > system locking up or going read-only.<br> <br> </span>why all of a sudden does each and every FS have to have this added code to second guess the block layer? The quickest solution is to mount the FS in sync mode. Go ahead and pay the performance piper. It's still not likely to be bullet proof but it's a sure step closer.<br></blockquote><div><br></div><div>Not all of a sudden. From "at work" perspective, LVM thinp as a technology is relatively recent, and only recently being deployed in more places as we migrate our systems from RHEL 5 to RHEL 6 to RHEL 7. I didn't consider thinp an option before RHEL 7, and I didn't consider it stable even in RHEL 7 without significant testing on our part.</div><div><br></div><div>From an "at home" perspective, I have been using LVM thinp from the day it was available in a Fedora release. The previous snapshot model was unusable, and I wished upon a star that a better technology would arrive. I tried BTRFS and while it did work - it was still marked as experimental, it did not have the exact same behaviour as EXT4 or XFS from an applications perspective, and I did encounter some early issues with subvolumes. Frankly... I was happy to have LVM thinp, and glad that you LVM developers provided it when you did. It is excellent technology from my perspective. But, "at home", I was willing to accept some loose edge case behaviour. I know when I use storage on my server at home, and if it fails, I can accept the consequences for myself.</div><div><br></div><div>"At work", the situation is different. These are critical systems that I am betting LVM on. As we begin to use it more broadly (after over a year of success in hosting our JIRA + Confluence instances on local flash using LVM thinp for much of the application data including PostgreSQL databases). I am very comfortable with it from a "< 80% capacity" perspective. However, every so often it passes 80%, and I have to raise the alarm, because I know that there are edge cases that LVM / DM thinp + XFS don't handle quite so well. It's never happened in production yet, but I've seen it happen many times on designer desktops when they are using LVM, and they lock up their system and require a system reboot to recover from.</div><div><br></div><div>I know there are smart people working on Linux, and smart people working on LVM. Give the opportunity, and the perspective, I think the worst of these cases are problems that deserve to be addressed, and probably that people have been working on with or without my contributions to the subject.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> What you're saying is that when mounting a block device the layer needs to expose a "thin-mode" attribute (or the sysdmin sets such a flag via tune2fs). Something analogous to mke2fs can "detect" LVM raid mode geometry (does that actually work reliably?).<br> <br> Then there has to be code in every FS block de-stage path:<br> IF thin {<br> tickle block layer to allocate the block (aka write zeros to it? - what about pre-existing data, is there a "fake write" BIO call that does everything but actually write data to a block but would otherwise trigger LVM thin's extent allocation logic?)<br> IF success, destage dirty block to block layer ELSE<br> inform userland of ENOSPC<br> }<br> <br> In a fully journal'd FS (metadata AND data) the journal could be 'pinned' and likewise the main metadata areas if for no other reason they are zero'd at onset and or constantly being written to. Once written to, LVM thin isn't going to go back and yank away an allocated extent.<br></blockquote><div><br></div><div>Yes. This is exactly the type of solution I was thinking of including pinning the journal! You used the correct terminology. I can read the terms but not write them. :-)</div><div><br></div><div>You also managed to summarize it in only a few lines of text. As concepts go, I think that makes it not-too-complex.</div><div><br></div><div>But, the devil is often in the details, and you are right that this is a per-file system cost.</div><div><br></div><div>Balancing this, however, I am perhaps presuming that *all* systems will eventually be thin volume systems, and that correct behaviour and highly available behaviour will eventually require that *all* systems invest in technology such as this. My view of the future is that fixed sized thick partitions are very often a solution which is compromised from the start. Most systems of significance grow over time, and the pressure to reduce cost is real. I think we are taking baby steps to start, but that the systems of the future will be thin volume systems. I see this as a problem that needs to be understood and solved, except in the most limited of use cases. This is my opinion, which I don't expect anybody to share.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> This at least should maintain FS integrity albeit you may end up in a situation where the journal can never get properly de-staged, so you're stuck on any further writes and need to force RO.<br></blockquote><div><br></div><div>Interesting to consider. I don't see this as necessarily a problem - or that it necessitates "RO" as a persistent state. For example, it would be most practical if sufficient room was reserved to allow for content to be removed, allowing for the file system to become unwedged and become "RW" again. Perhaps there is always an edge case that would necessitate a persistent "RO" state that requires the volume be extended to recover from, but I think the edge case could be refined to something that will tend to never happen?</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""> > just want a sanely behaving LVM + XFS...)<br></span>IMO if the system admin made a conscious decision to use thin AND overprovision (thin by itself is not dangerous), it's up to HIM to actively manage his block layer. Even on million dollar SANs the expectation is that the engineer will do his job and not drop the mic and walk away. Maybe the "easiest" implementation would be a MD layer job that the admin can tailor to fail all allocation requests once extent count drops below a number and thus forcing all FS mounted on the thinpool to go into RO mode.<br></blockquote><div><br></div><div>Another interesting idea. I like the idea of automatically shutting down our applications or PostgreSQL database if the thin pool reaches an unsafe allocation, such as 90% or 95%. This would ensure the integrity of the data, at the expense of an outage. This is something we could implement today. Thanks.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> But in any event it won't prevent irate users from demanding why the space they appear to have isn't actually there.</blockquote></div><div><br></div><div>Users will always be irate. :-) I mostly don't consider that as a real factor in my technical decisions... :-)</div><div><br></div><div>Thanks for entertaining this discussion, Matthew and Zdenek. I realize this is an open source project, with passionate and smart people, whose time is precious. I don't feel I have the capability of really contributing code changes at this time, and I'm satisfied that the ideas are being considered even if they ultimately don't get adopted. Even the mandatory warning about snapshots exceeding the volume group size is something I can continue to deal with using scripting and filtering. I mostly want to make sure that my perspective is known and understood.</div><div><br></div><div><br></div>-- <br><div class="gmail_signature">Mark Mielke <<a href="mailto:mark.mielke@gmail.com" target="_blank">mark.mielke@gmail.com</a>><br><br></div> </div></div>