[libvirt] [RFC v2 0/4] LXC with block device and enabled userns

Wed Jun 13 14:18:02 UTC 2018

On 13/06/18 11:46, Daniel P. Berrangé wrote:
> On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote:
>> Hi all,
>>
>> This patch series aims to resolve
>> https://bugzilla.redhat.com/show_bug.cgi?id=1328946
>>
>> For background information about the issue see v1 of this RFC.
>> https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
>>
>> The current state of this series enables the start of LXC container with NBD
>> file system and enabled user namespace.
>>
>> However, container shutdown causes "kernel BUG at fs/buffer.c:3058!"
>> https://pastebin.com/raw/y0ycSM0H
>>
>> The reason for this is because qemu-nbd process is terminated/killed without
>> unmounting the container root file system.
>>
>> This issue has been reported in [1] and [2].
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110
>> [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html
> This is not really a kernel bug at the end of the day. We have a filesystem
> backed by NBD block device, and we're killing the NBD block device. So there's
> nothing the kernel can really do here if there's outstanding I/O pendnig at
> this time.
>
> There is also this BZ reported against libvirt that has more info:
>
>   https://bugzilla.redhat.com/show_bug.cgi?id=1570902
>
>> As a workaround we could unmount the root file system of container before shutdown.
>>
>> For example with:
>>     $ CT_PID=$(pidof libvirt_lxc)
>>     $ sudo nsenter \
>>         --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \
>>         /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
>>
>> I noticed that we already have the functions lxcContainerUnmountSubtree
>> and virProcessRunInMountNamespace.
>>
>> Any suggestions on how to properly implement this?
> We can't unmount the filesystem directly because we don't have any process
> running inside the container's mount namespace at this time. The libvirt_lxc
> controller is running in a custom mount namespace that is different from what
> the container has.
>
> The first thing we need todo is take qemu-nbd out of the cgroups. This will
> ensure that it doesn't get killed at the same time as we're killing off all
> the container PIDs. It will also fix the OOM deadlocks we see when the memory
> controller prevents qemu-nbd allocating RAM needed to proces I/O.
>
> Then, we can kill all processes in the container as normal. Once they are
> all gone, we know the kernel will have cleaned up the mount namespace. We
> can thus safely kill qemu-nbd at this point.
Thank you for the pointers!
> Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN
> was release (ie when filesystem was unmounted). This is something you can
> enable for loopback devices, but I'm not sure it works for NBD. THis would
> be a useful kernel enhancement if someone feels adventurous.
It seems like qemu-nbd terminates automatically when the last client
disconnects.

https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4ee5bf362533406ce2494;hb=HEAD#l341

I will send a patch thattakes qemu-nbd out of the cgroups and
disconnects qemu-nbd on container shutdown.

Radostin