[linux-lvm] thin handling of available space

Sat Apr 23 17:53:03 UTC 2016

Hi,

So here is my question. I was talking about it with someone, who also 
didn't know.

There seems to be a reason against creating a combined V-size that 
exceeds the total L-size of the thin-pool. I mean that's amazing if you 
want extra space to create more volumes at will, but at the same time 
having a larger sum V-size is also an important use case.

Is there any way that user tools could ever be allowed to know about the 
real effective free space on these volumes?

My thinking goes like this:

- if LVM knows about allocated blocks then it should also be aware of 
blocks that have been freed.
- so it needs to receive some communication from the filesystem
- that means the filesystem really maintains a "claim" on used blocks, 
or at least notifies the underlying layer of its mutations.

- in that case a reverse communication could also exist where the block 
device communicates to the file system about the availability of 
individual blocks (such as might happen with bad sectors) or even the 
total amount of free blocks. That means the disk/volume manager (driver) 
could or would maintain a mapping or table of its own blocks. Something 
that needs to be persistent.

That means the question becomes this:

- is it either possible (theoretically) that LVM communicates to the 
filesystem about the real number of free blocks that could be used by 
the filesystem to make "educated decisions" about the real availability 
of data/space?

- or, is it possible (theoretically) that LVM communicates a "crafted" 
map of available blocks in which a certain (algorithmically determined) 
group of blocks would be considered "unavailable" due to actual real 
space restrictions in the thin pool? This would seem very suboptimal but 
would have the same effect.

See if the filesystem thinks it has 6GB available but really there is 
only 3GB because data is filling up, does it currently get notified of 
this?

What happens if it does fill up?

Funny that we are using GB in this example. I remembered today using 
Stacker on MS-DOS disk where I had 20MB available and was able to 
increase it to 30MB ;-).

Someone else might use terabytes, but anyway.

If the filesystem normally has a fixed size and this size doesn't change 
after creation (without modifying the filesystem) then it is going to 
calculate its free space based on its knowledge of available blocks.

So there are three figures:

- total available space
- real available space
- data taken up by files.

total - data is not always real, because there may still be handles on 
deleted files, etc., open. Visible, countable files and its "du" + 
blocks still in use + available blocks should be ~ total blocks.

So we are only talking about blocks here, nothing else.

And if LVM can communicate about availability of blocks, a fourth figure 
comes into play:

total = used blocks + unused blocks + unavailable blocks.

If LVM were able to dynamically adjust this last figure, we might have a 
filesystem that truthfully reports actual available space. In a thin 
setting.

I do not even know whether this is not already the case, but I read 
something that indicated an importance of "monitoring available space" 
which would make the whole situation unusable for an ordinary user.

Then you would need GUI applets that said "The space on your thin volume 
is running out (but the filesystem might not report it)".

So question is:

* is this currently 'provisioned' for?
* is this theoretically possible, if not?

If you take it to a tool such as "df"

There are only three figures and they add up.

They are:

total = used + available

but we want

total = used + available + unavailable

either that or the total must be dynamically be adjusted, but I think 
this is not a good solution.

So another question:

*SHOULDN'T THIS simply be a feature of any filesystem?*

The provision of being able to know about the *real* number of blocks in 
case an underlying block device might not be "fixed, stable, and 
unchanging"?

The way it is you can "tell" Linux filesystems with fsck which blocks 
are bad blocks and thus unavailable, probably reducing the number of 
"total" blocks.

 From a user interface perspective, perhaps this would be an ideal 
solution, if you needed any solution at all. Personally I would probably 
prefer either the total space to be "hard limited" by the underlying 
(LVM) system, or for df to show a different output, but df output is 
often parsed by scripts.

In the former case supposing a volume was filling up.

udev             1974288       0   1974288   0% /dev
tmpfs             404384   41920    362464  11% /run
/dev/sr2         1485120 1485120         0 100% /cdrom

(Just taking 3 random filesystems)

One filesystem would see "used" space go up. The other two would see 
"total" size going down, in addition to the other one, also seeing that 
figure go down. That would be counterintuitive and you cannot really do 
this.

It's impossible to give this information to the user in a way that the 
numbers still add up.

Supposing:

real size 2000

1000  500  500
1000  500  500
1000  500  500

combined virtual size 3000. Total usage 1500. Real free 500. Now the 
first volume uses another 250.

1000  750  250
1000  500  250
1000  500  250

The numbers no longer add up for the 2nd and 3rd system.

You *can* adjust total in a way that it still makes sense (a bit)

1000  750  250
  750  500  250
  750  500  250

You can also just ignore the discrepancy, or add another figure:

total used unav avail
1000  750    0  250
1000  500  250  250
1000  500  250  250

Whatever you do, you would have to simply calculate this adjusted number 
from the real number of available blocks.

Now the third volume takes another 100

First style:

1000  750  150
1000  500  150
1000  600  150

Second style:

1000  750  150
  650  500  150
  750  600  150

Third style:

total used unav avail
1000  750  100  150
1000  500  350  150
1000  600  250  150

There's nothing technically inconsistent about it, it is just rather 
difficult to grasp at first glance.

df uses filesystem data, but we are really talking about 
block-layer-level-data now.

You would either need to communicate the number of available blocks (but 
which ones?) and let the filesystem calculate unavailable --- or 
communicate the number of unavailable blocks at which point you just do 
this calculation yourself. For each volume you reach a different number 
of "blocks" you need to withhold.

If you needed to make those blocks unavailable, you would now randomly 
(or at the end of the volume, or any other method) need to "unavail" 
those to the filesystem layer beneath (or above).

Every write that filled up more blocks would be communicated to you, 
(since you receive the write or the allocation) and would result in an 
immediate return of "spurious" mutations or an updated number of 
unavailable blocks -- and you can also communicate both.

On every new allocation, the filesystem would be returned blocks that 
you have "fakely" marked as unavailable. All of this only happens if 
available real space becomes less than that of the individual volumes 
(virtual size). The virtual "available" minus the "real available" is 
the number of blocks (extents) you are going to communicate as being 
"not there".

At every mutation from the filesystem, you respond with a like mutation: 
not to the filesystem that did the mutation, but to every other 
filesystem on every other volume.

Space being freed (deallocated) then means a reverse communication to 
all those other filesystems/volumes.

But it would work, if this was possible. This is the entire algorithm.

I'm sorry if this sounds like a lot of "talk" and very little "doing" 
and I am annoyed by that as well. Sorry about that. I wish I could 
actually be active with any of these things.

I am reminded of my father. He was in school for being a car mechanic 
but he had a scooter accident days before having to do his exam. They 
did the exam with him in a (hospital) bed. He only needed to give 
directions on what needed to be done and someone else did it for him :p.

That's how he passed his exam. It feels the same way for me.

Regards.