[linux-lvm] lvm commands hanging when run from inside a kubernetes pod

Tue Jun 7 17:42:02 UTC 2022

On Mon, Jun 06, 2022 at 05:01:15PM +0530, Abhishek Agarwal wrote:
> Hey Demi, can you explain how it will help to solve the problem? I'm
> actually not aware of that much low-level stuff but would like to learn
> about it.

By default, a systemd unit runs in the same namespaces as systemd
itself.  Therefore, it runs outside of any container, and has full
access to udev and the host filesystem.  This is what you want when
running lvm2 commands.

> Also, can you provide a few references for it on how I can use it?

The easiest method is the systemd-run command-line tool.  I believe
“systemd-run --working-directory=/ --pipe --quiet -- lvm "$@"” should
work, with "$@" replaced by the actual LVM command you want to run.  Be
sure to pass --reportformat=json to get machine-readable JSON output.
The default output depends on configuration in /etc/lvm/lvm.conf, so you
don’t want to rely on it.  Alternatively, you can pass no arguments to
lvm and get an interactive shell, but that is a bit more complex to use.

To use this method, you will need to bind-mount the host’s system-wide
D-Bus instance into your container.  You will likely need to disable all
forms of security confinement and user namespacing as well.  This means
your container will have full control over the system, but LVM requires
full control over the system in order to function, so that does not
impact security much.  Your container can expose an API that impose
whatever restrictions it desires.

Instead of systemd-run, you can use the D-Bus API exposed by PID 1
directly, but that requires slightly more work than just calling a
command-line tool.  I have never used D-Bus from Go so I cannot comment
on how easy this is.

There are some other caveats with LVM.  I am not sure if these matter
for your use-case, but I thought you might want to be aware of them:

- LVM commands are slow (0.2 to 0.4 seconds or so) and serialized with a
  per-volume group lock.  Performance of individual commands is not a
  high priority of LVM upstream as per prior mailing list discussion.
  The actual time that I/O is suspended is much shorter.

- If LVM gets SIGKILLd or OOM-killed, your system may be left in an
  inconsistent state that requires a reboot to fix.  The latter can be
  prevented by setting OOMScoreAdjust to -1000.

- If you use thin provisioning (via thin pools and/or VDO), be sure to
  have monitoring so you can prevent out of space conditions.  Out of
  space conditions will likely result in all volumes going offline, and
  recovery may require growing the pool.

- Thin pools are backed by the dm-thin device-mapper target, which is
  optimized for overwriting already allocated blocks.  Writing to shared
  blocks, and possibly allocating new blocks, appears to triggers a slow
  path in dm-thin.  Discards are only supported at the block size
  granularity, which is typically greater than the block size of a
  filesystem.

- Deleting a thin volume does not pass down discards to the underlying
  block device, even if LVM is configured to discard deleted logical
  volumes.  You need to use blkdiscard before deleting the volume, but
  this can hang the entire pool unless you use the --step option to
  limit the amount of data discarded at once.

- If you are going to be exposing individual thinly-provisioned block
  devices to untrusted code (such as virtual machine guests), you need
  to prevent udev from scanning the thin volumes and keep zeroing of
  newly provisioned blocks enabled.  The latter is synchronous and slow.

- Shrinking thin or VDO pools is not supported.

- Old-style (not thin) snapshots are slow, and only intended for
  short-lived snapshots for backup purposes.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-lvm/attachments/20220607/fd0ad1b1/attachment.sig>