[libvirt] [PATCH] lxc: Cleaning up mount setup

Thu Jan 8 13:36:36 UTC 2015

Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
> We have historically done a number of things with LXC that are
> somewhat questionable in retrospect
> 
>  1. Mounted /proc/sys read-only, but then mounted
>     /proc/sys/net/ipv* read-write again
>  2. Mounted /sys read only
>  3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN
>  4. FUSE mount on /proc/meminfo
> 
> Items 1 & 2 are pointless as they offer no security benefit either
> with or without user namespaces. Without userns it is always insecure,
> with userns it is always secure, no matter what the mount state is.

I agree. Thanks a lot for addressing this, Daniel!

> Item 3 is some what dubious, since /proc/self/cgroup paths for
> processes are now not visible at /sys/fs/cgroup. This really
> confuses systemd inside the container making it create a broken
> layout

The question is, how to support systemd in containers?

As of now I'm not aware of a working concept.
With current libvirt it kind of works but recently I found a very nasty issue:
See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html

Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs
within the container in a secure way.
The current discussion can be found here: https://lkml.org/lkml/2015/1/7/150

As of now I have to drop all my systemd lxc guests and will replace them by
a non-systemd distro, which is very sad. :-(

> Item 4 is some what dubious, since we're only changing some of the
> fields in /proc/meminfo. It helps apps which blindly parse
> /proc/meminfo to determine free system resources they can consume.
> Those apps are broken even without containers being involved though,
> since any application must expect to be placed inside a cgroup with
> limited resources. Faking /proc/meminfo is a pretty limited workaround
> that just delays the inevitable fixing of such apps..

You mean that tools like free(1) have to be patched to query also
memory limits from cgroupfs?

> The patch that follows just removes the items 1 & 2, but I'm thinking
> we should go further and remove items 3 & 4 too.
> 
> Changing 4 in particular though is certainly classed as a guest ABI
> change though, so is not something distros may wish to see when
> upgrading libvirt. There is scope to argue that 1-3 are guest ABI
> changes too
> 
> In full machine virt world, we deal with this using machine types.
> eg each new KVM version introduces a new machine type which models
> the guest ABI in a stable fashion. Guest machine types are fixed at
> time of first deployment. So when libvirt / KVM is upgraded, existing
> guests will not see any changes, but new guests will automatically
> get the new machine type.
> 
> I'm thinking we might want make use of this in LXC before making
> these changes. eg introduce a new machine 'libvirt-lxc-1' to
> represent the current guest mount setup and make sure all existing
> guests get that machine type. Then introduce a new machine type
> libvirt-lxc-2 that removes all this cruft, which new guests will
> get by default.
> 
> Alternatively we could call them 'libvirt-lxc-compat-1' and
> 'libvirt-lxc-bare-1' to give a clearer indication of their
> functional difference and version them separately in the future ?

Can we have a new machine type which enforces user namespaces?

> Regards,
> Daniel
> 
> Daniel P. Berrange (1):
>   lxc: Stop mouning /proc and /sys read only
> 
>  src/lxc/lxc_container.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)

Acked-by: Richard Weinberger <richard at nod.at>

Thanks,
//richard