[linux-lvm] thin handling of available space
list at xenhideout.nl
Thu Apr 28 18:20:15 UTC 2016
Let me just write down some thoughts here.
First of all you say that fundamental OS design is about higher layers
trusting lower layers and that certain types of communications should then
always be one way.
In this case it is about block layer vs. file system layer.
But you make certain assumptions about the nature of a block device to
A block device is defined by its acess method (ie. data organized in
blocks) rather than its contigiousness or having an unchanging, "single
block" address or access space. I know this goes pretty far but it is the
In theory there is nothing against a hypothetical block device offering
ranges of blocks to a higher level (that might never change) or to be
dynamically notified of changes to that address pool.
To a process virtual memory is a space that is transparent to it whether
that space is constructed of paged memory (swap file) or not. At the same
time it is not impossible to imagine that an IO scheduler for swap would
take heed of values given by applications, such as using nice or ionice
values. That would be one way communication though.
In general a higher level should be oblivious to what kind of lower level
layer it is running on, you are right. Yet if all lower levels exhibit the
same kind of features, this point becomes moot, because at that point the
higher level will not be able to know, once more, precisely what kind of
layer it is running on, although it would have more information.
So just theoretically speaking the only thing that is required to be
consistent is the API or whatever interface you design for it.
I think there are many cases where some software can run on some libraries
but not on others because those other libraries do not offer the full
feature set of whatever standard is being defined there. An example is
DLNA/UPNP, these are not layers but the standard is ill-defined and the
device you are communicating with might not support the full set.
Perhaps these are detrimental issues but there are plenty of cases where
one type of "lower level" will suffice but another won't, think maybe of
graphics drivers. Across the layer boundary, communication is two-way
anyway. The block device *does* supply endless streams of data to the
higher layer. The only thing that would change is that you would no longer
have this "always one contigious block of blocks" but something that is
slightly more volatile.
When you "mkfs" the tool reads the size of the block device. Perhaps
subsequently the filessytem is unaware and depends on fixed values.
The feature I described (use case) would allow the set of blocks that is
available, to dynamically change. You are right that this would apparently
be a big departure from the current model.
So I'm not saying it is easy, perfect, or well understood. I'm just saying
I like the idea.
I don't know what other applications it might have but it depends entirely
on correct "discard" behaviour from the filesystem.
The filesystem should be unaware of its underlying device but discard is
never required for rotating disks as far as I can tell. This is an option
that assumes knowledge of the underlying device. From discard we can
basically infer that either we are dealing with a flash device or
something that has some smartness about what blocks it retains and what
not (think cache).
So in general this is already a change that reflects changing conditions
of block devices in general or its availability. And its characteristic
behaviour or demands from filesystems.
These are block devices that want more information to operate (well).
Coincidentally, discard also favours or enhances (possibly) lvmcache.
So it's not about doing something wildly strange here, it's about offering
a feature set that a filesystem may or may not use, or a block device may
or may not offer.
Contrary to what you say, there is nothing inherently bad about the idea.
The OS design principle violation you speak of is principle, not practical
reality. It's not that it can't be done. It's that you don't want it to
happen because it violates your principles. It's not that it wouldn't
work. It's that you don't like it to work because it violates your
At the same time I object to the notion of the system administrator being
this theoretical vastly differing role/person than the user/client.
We have no in-betweens on Linux. For fun you should do a search of your
filesystem with find -xdev based on the contents of /etc/passwd or
/etc/group. You will find that 99% of files are owned by root and the only
ones that aren't are usually user files in the home directory or specific
services in /var/lib.
Here is a script that would do it for groups:
cat /etc/group | cut -d: -f1 | while read g; do printf "%-15s %6d" $g
`find / -xdev -type f -group $g | wc -l`; done
Probably. I can't run it here it might crash my system (live dvd).
Of about 170k files on an OpenSUSE system, 15 were group writable, mostly
due to my own interference probably. Of 170197 files (no xdev) 168161 were
owned by root.
Excluding man and my user, 69 files did not have "root" as the group. Part
of that was again due to my own changes.
At the same time in some debates your are presented with the ludicrous
notion that there is some ideal desktop user who doesn't need to ever see
anything of the internal system. She never opens a shell and certainly
does not come across ethernet device names (for example). The "desktop
user" does not care about the naming of devices from /dev/eth0 to
The desktop user never uses anything other than DHCP, etc. etc. etc.
The desktop user never can configure anything without the help of the
admin, if it is slightly more advanced.
It's that user vs. admin dichotomy that is never true on any desktop
system and I will venture it is not even true on the systems I am a client
of, because you often need to debate stuff with the vendor or ask for
features, offer solutions, etc.
In a store you are a client. There are employees and clients, nothing
else. At the same time I treat these girls as my neighbours because they
work in the block I live in.
You get the idea. Roles can be shifty. A person can use multiple roles at
the same time. He/she can be admin and user simulaneously.
Perhaps you are correct to state that the roles themselves should not be
watered down, that clear delimitations are required.
In your other email you allude to me not ever having done an OS design
Offlist a friendly member suggested strongly I not use personal attacks in
my communications here. But of course this is precisely what you are doing
here, because as a matter of fact I did follow such a course.
I don't remember the book we used because apparently between my house mate
and me we only had one exemplar and he ended up getting it because I was
usually the one borrowing stuff from him.
At the same time university is way beyond my current reach (in living
conditions) so it is just an unwarranted allusion that does not have
anything to do with anything really.
Yes I think it was the dinosaur book:
Operating System Concepts by Silberschatz, Galvin and Gagne
Anyway, irrelevant here.
> Another way (haven't tested) to 'signal' the FS as to the true state of
> the underlying storage is to have a sparse file that gets shrunk over
You do realize you are trying to find ways around the limitation you just
imposed on yourself right?
> The system admin decided it was a bright idea to use thin pools in the
> first place so he necessarily signed up to be liable for the hazards and
> risks that choice entails. It is not the job of the FS to bail his ass
I don't think thin pools are that risky or should be that risky. They do
incur a management overhead compared to static filesystems because of
adding that second layer you need to monitor. At the same time the burden
of that can be lessened with tools.
As it stands I consider thin LVM the only reasonably way to snapshot a
running system without dedicating specific space to it in advance. I could
expect snaphotting to require stuff to be in the same volume group.
Without LVM thin, snapshotting requires making at least some prior
investment in having a snapshot device ready for you in the same VG,
Do not think btrfs and ZFS are without costs. You wrote:
> Then you want an integrated block+fs implementation. See BTRFS and ZFS.
WAFL and friends.
But btrfs is not without complexity. It uses subvolumes that differ from
distribution to distribution as each makes its own choice. It requires
knowledge of more complicated tools and mechanics to do the simplest (or
most meaningful) of tasks. Working with LVM is easier. I'm not saying LVM
is perfect and....
Using snapshotting as a backup measure is something that seems risky to me
at the first place because it is a "partition table" operation which
really you shouldn't be doing on a consistent basis. So in other to
effectively use it in the first place you require tools that handle the
safeguards for you. Tools that make sure you are not making some command
line mistake. Tools that simply guard against misuse.
Regular users are not fit for being btrfs admins either.
It is going to confuse the hell out of people seeing as that what their
systems run on and if they are introduced to some of the complexity of it.
You say swallow your pride. It has not much to do with pride.
It has to do with ending up in a situation I don't like. That is then
going to "hurt" me for the remainder of my days until I switch back or get
rid of it.
I have seen NOTHING NOTHING NOTHING inspiring about btrfs.
Not having partition tables and sending volumes across space and time to
other systems, is not really my cup of tea.
It is a vendor lock-in system and would result in other technologies being
I am not alone in this opinion either.
Btrfs feels like a form of illness to me. It is living in a forest with
all deformed trees, instead of something lush and inspiring. If you've
ever played World of Warcraft, the only thing that comes a bit close is
the Felwood area ;-).
But I don't consider it beyond Plaguelands either.
I have felt like btrfs in my life. They have not been the happiest moments
of my life ;-).
I will respond more in another mail, this is getting too long.
More information about the linux-lvm