[libvirt] [PATCH] lxc: Cleaning up mount setup

Thu Jan 8 14:02:59 UTC 2015

Am 08.01.2015 um 14:45 schrieb Daniel P. Berrange:
> On Thu, Jan 08, 2015 at 02:36:36PM +0100, Richard Weinberger wrote:
>> Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
>>> We have historically done a number of things with LXC that are
>>> somewhat questionable in retrospect
>>>
>>>  1. Mounted /proc/sys read-only, but then mounted
>>>     /proc/sys/net/ipv* read-write again
>>>  2. Mounted /sys read only
>>>  3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN
>>>  4. FUSE mount on /proc/meminfo
>>>
>>> Items 1 & 2 are pointless as they offer no security benefit either
>>> with or without user namespaces. Without userns it is always insecure,
>>> with userns it is always secure, no matter what the mount state is.
>>
>> I agree. Thanks a lot for addressing this, Daniel!
>>
>>> Item 3 is some what dubious, since /proc/self/cgroup paths for
>>> processes are now not visible at /sys/fs/cgroup. This really
>>> confuses systemd inside the container making it create a broken
>>> layout
>>
>> The question is, how to support systemd in containers?
>>
>> As of now I'm not aware of a working concept.
>> With current libvirt it kind of works but recently I found a very nasty issue:
>> See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html
> 
> That reply from Lennart suggests systemd should pretty much work,
> albeit in a hacky way.

What hack to you mean?
*confused*

> I've not done much in anger with systemd in containers, but I have
> found it sufficient for application containers - ie not full OS
> containers with interactive sessions.

My use case is different. I need most of the time at least an init.
And if the distro is systemd based....

>>
>> Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs
>> within the container in a secure way.
>> The current discussion can be found here: https://lkml.org/lkml/2015/1/7/150
>>
>> As of now I have to drop all my systemd lxc guests and will replace them by
>> a non-systemd distro, which is very sad. :-(
>>
>>> Item 4 is some what dubious, since we're only changing some of the
>>> fields in /proc/meminfo. It helps apps which blindly parse
>>> /proc/meminfo to determine free system resources they can consume.
>>> Those apps are broken even without containers being involved though,
>>> since any application must expect to be placed inside a cgroup with
>>> limited resources. Faking /proc/meminfo is a pretty limited workaround
>>> that just delays the inevitable fixing of such apps..
>>
>> You mean that tools like free(1) have to be patched to query also
>> memory limits from cgroupfs?
> 
> Not neccessarily. The 'free' tool is said to
> 
>    "Display amount of free and used memory in the system"
> 
> so it is arguably correct that it reports /proc/meminfo of the host
> as a whole.
> 
> What is broken are applications that are invoking 'free' and then
> believing that the values it reports correspond to what the
> application is able to use. ie the applications are not taking
> account that they might not have ability to use the entire system
> resources due to cgroups or containers or both.
> 
>>> The patch that follows just removes the items 1 & 2, but I'm thinking
>>> we should go further and remove items 3 & 4 too.
>>>
>>> Changing 4 in particular though is certainly classed as a guest ABI
>>> change though, so is not something distros may wish to see when
>>> upgrading libvirt. There is scope to argue that 1-3 are guest ABI
>>> changes too
>>>
>>> In full machine virt world, we deal with this using machine types.
>>> eg each new KVM version introduces a new machine type which models
>>> the guest ABI in a stable fashion. Guest machine types are fixed at
>>> time of first deployment. So when libvirt / KVM is upgraded, existing
>>> guests will not see any changes, but new guests will automatically
>>> get the new machine type.
>>>
>>> I'm thinking we might want make use of this in LXC before making
>>> these changes. eg introduce a new machine 'libvirt-lxc-1' to
>>> represent the current guest mount setup and make sure all existing
>>> guests get that machine type. Then introduce a new machine type
>>> libvirt-lxc-2 that removes all this cruft, which new guests will
>>> get by default.
>>>
>>> Alternatively we could call them 'libvirt-lxc-compat-1' and
>>> 'libvirt-lxc-bare-1' to give a clearer indication of their
>>> functional difference and version them separately in the future ?
>>
>> Can we have a new machine type which enforces user namespaces?
> 
> Hmm, I'm not sure that would work. Not least because we need a way to
> assume the UID/GID mapping, and the filesystems used with the container
> need to have the right UID/GID permissions setup. IOW I don't think
> user ns is something we can transparently / automatically turn on.

Yeah but we have to warn the user that she is doing something insecure
if no mappings are set up.

Thanks,
//richard