[linux-lvm] Why use thin_pool_autoextend_threshold < 100 ?

Thu Aug 2 17:42:16 UTC 2018

On Tue, Jul 31, 2018 at 8:43 PM, Chris Murphy <lists at colorremedies.com> wrote:
> On Tue, Jul 31, 2018 at 7:33 PM, John Stoffel <john at stoffel.org> wrote:
>>>>>>> "Chris" == Chris Murphy <lists at colorremedies.com> writes:
>>
>> Chris> On Fri, Jul 27, 2018 at 1:31 PM, John Stoffel <john at stoffel.org> wrote:
>>>>
>>>> Why don't you run quotas on your filesystems?  Also, none of the
>>>> filesystems in Linux land that I'm aware of supports shrinking the
>>>> filesystem while live, it's all a unmount, shrink FS, shrink volume
>>>> (carefully!) and then re-mount the filesystem.
>>
>> Chris> Btrfs supports grow and shrink resizes only when mounted. It's
>> Chris> not possible to resize when unmounted.
>>
>> That's... bizarre.  Good to know, but bizarre.  That does make it more
>> appealing to use in day to day situations for sure.  Any thoughts on
>> how stable this is in real life?
>
> I've never heard of it failing in many years of being on the Btrfs
> list. The resize leverages the same block group handling as balance
> code, so the relocation of block groups during resize is the same as
> you'd get with a filtered balance, it's integral to the file system's
> operation.
>
> The shrink operation first moves block groups in the region subject to
> shrink (the part that's going away), and this is an atomic operation
> per block group. You could pull the plug on it (and I have) in
> progress and you'd just get a reversion to a prior state before the
> last file system metadata and superblock commit (assumes the hardware
> isn't lying and some hardware does lie). Once all the block groups are
> moved, and the dev and chunk trees are updated to reflect the new
> location of those chunks (block groups), the superblocks are updated
> to reflect the new device size.
>
> Literally the shrink operation changes very little metadata, it's just
> moving block groups, and then the actual "resize" is merely a
> superblock change. The file system metadata doesn't change much
> because Btrfs uses an internal logical block addressing to reference
> file extents and those references stay the same during a resize. The
> logical block range mapping to physical block range mapping is a tiny
> update (maybe 1/2 dozen 16K leaf and node writes) and those updates
> are always COW, not overwrites. That's also how this is an atomic
> operation. If the block group copy fails, the dev and chunk trees that
> are used to translate between logical and physical block ranges never
> get updated.
>
>
> --
> Chris Murphy

Also, fs resize always happens when doing device add or device remove.
So resize is integral for Btrfs multiple device support. Device add
and remove can likewise only be done while the file system is mounted.
Removing a device means migrating block groups off that device,
shrinking the file system by an amount identical to the device size,
updating superblocks on remaining devices, and wiping the Btrfs
signature on the removed device. And there are similar behaviors when
converting block group profiles: e.g. from single to raid1, single to
DUP, DUP to single, raid5 to raid6 or vice versa and so on.
Conversions are only possible while the file system is mounted.

LVM pvmove isn't entirely different in concept. The LVM extents are
smaller (4MB by default) than Btrfs block groups (dynamically variable
in size but most typically they are 1GiB for data bg's and 256MB for
metadata bg's, and 32MB for system bg's. Btrfs block groups are
collections of extents.). But basically the file system just keeps on
reading and writing to its usual LBA's which are abstracted and
translated into real physical LBA's and a device by LVM. I don't know
how atomic pvmove is without the --atomic flag, and what the chances
of resuming pvmove in case of crash or an urgent reboot is.

The gotcha with ext4 and XFS is they put filesystem metadata in fixed
locations on a block device, so those all have to be relocated to new
fixed positions based on the new block device size as well as data.
The shrink operation is probably sufficiently complicated for ext234
that they just don't want concurrent read/write operations happening
while shrinking. And also the resize introduces inherent inefficiency
with subsequent operation. The greater the difference between mkfs
volume size and the resized size, the greater the inefficiency. That
applies to both ext4 and XFS whether shrink or grow, of course XFS
doesn't have shrink at all, the expectation for its more sophisticated
environment use cases was that it would only ever be grown.

Whereas Btrfs has no fixed locations for any of its block groups, so
from its perspective a resize is just not that unique of an operation,
leveraging code that's regularly exercised in normal operation anyway.
And it also doesn't suffer from any resize inefficiencies either; in
fact depending on the operation it might become more efficient.

Anyway, probably a better way of handling shrink with ext4 and XFS is
having them on LVM thin volumes, and just using fstrim to remove
unused LVM extents from the LV, releasing them back to the pool for
use by any other LV in that pool. It's not exactly the same thing as a
shrink of course, but if the idea is to let a file system use the
unused but "reserved" space of a second file system, merely trimming
the second file system on a thin LV does achieve that. Bigger issue
here is you can't then shrink the pool, so you can still get stuck in
some circumstances.

-- 
Chris Murphy