[libvirt] [RFC v2 0/4] LXC with block device and enabled userns

Wed Jun 13 10:46:37 UTC 2018

On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote:
> Hi all,
> 
> This patch series aims to resolve
> https://bugzilla.redhat.com/show_bug.cgi?id=1328946
> 
> For background information about the issue see v1 of this RFC.
> https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
> 
> The current state of this series enables the start of LXC container with NBD
> file system and enabled user namespace.
> 
> However, container shutdown causes "kernel BUG at fs/buffer.c:3058!"
> https://pastebin.com/raw/y0ycSM0H
> 
> The reason for this is because qemu-nbd process is terminated/killed without
> unmounting the container root file system.
> 
> This issue has been reported in [1] and [2].
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110
> [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html

This is not really a kernel bug at the end of the day. We have a filesystem
backed by NBD block device, and we're killing the NBD block device. So there's
nothing the kernel can really do here if there's outstanding I/O pendnig at
this time.

There is also this BZ reported against libvirt that has more info:

  https://bugzilla.redhat.com/show_bug.cgi?id=1570902

> As a workaround we could unmount the root file system of container before shutdown.
> 
> For example with:
>     $ CT_PID=$(pidof libvirt_lxc)
>     $ sudo nsenter \
>         --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \
>         /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
> 
> I noticed that we already have the functions lxcContainerUnmountSubtree
> and virProcessRunInMountNamespace.
> 
> Any suggestions on how to properly implement this?

We can't unmount the filesystem directly because we don't have any process
running inside the container's mount namespace at this time. The libvirt_lxc
controller is running in a custom mount namespace that is different from what
the container has.

The first thing we need todo is take qemu-nbd out of the cgroups. This will
ensure that it doesn't get killed at the same time as we're killing off all
the container PIDs. It will also fix the OOM deadlocks we see when the memory
controller prevents qemu-nbd allocating RAM needed to proces I/O.

Then, we can kill all processes in the container as normal. Once they are
all gone, we know the kernel will have cleaned up the mount namespace. We
can thus safely kill qemu-nbd at this point.

Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN
was release (ie when filesystem was unmounted). This is something you can
enable for loopback devices, but I'm not sure it works for NBD. THis would
be a useful kernel enhancement if someone feels adventurous.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|