[libvirt] Notes from the KVM Forum relevant to libvirt

Wed Aug 24 14:46:33 UTC 2011

On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> >> <berrange at redhat.com> wrote:
> >> > I was at the KVM Forum / LinuxCon last week and there were many
> >> > interesting things discussed which are relevant to ongoing libvirt
> >> > development. Here was the list that caught my attention. If I have
> >> > missed any, fill in the gaps....
> >> >
> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
> >> >   Containers are Linux's parallel of Zones, and while not nearly as
> >> >   secure yet, it would still be worth using more containers support
> >> >   to confine QEMU.
> >>
> >> Can you elaborate on why Linux containers are "not nearly as secure"
> >> [as Solaris Zones]?
> >
> > Mostly because the Linux namespace functionality is far from complete,
> > notably lacking proper UID/GID/capability separation, and UID/GID
> > virtualization wrt filesystems. The longer answer is here:
> >
> >   https://wiki.ubuntu.com/UserNamespace
> >
> > So at this time you can't build a secure container on Linux, relying
> > just on DAC alone. You have to add in a MAC layer ontop of the container
> > to get full security benefits, which obviously defeats the point of
> > using the container as a backup for failure in the MAC layer.
> 
> Thanks, that is interesting.  I still don't understand why that is a
> problem.  Linux containers (lxc) uses a different pid namespace (no
> ptrace worries), file system root (restricted to a subdirectory tree),
> forbids most device nodes, etc.  Why does the user namespace matter
> for security in this case?

A number of reasons really...

If user ID '0' on the host starts a container, and a process inside
the container does 'setuid(500)', then any user outside the container
with UID 500 will be able to kill that process. Only user ID '0' should
have been allowed todo that.

It will also let non-root user IDs on the host OS, start containers
and have root uid=0 inside the container.

Finally, any files created inside the container with, say, uid 500
will be accessible by any other process with UID 500, in either the
host or any other container

> I think it matters when giving multiple containers access to the same
> file system.  Is that what you'd like to do for libvirt?

Each container would have to share a (readonly) view onto the host
filesystem so it can see the QEMU emulator install / libraries. There
would also have to be some writable areas per QEMU container.  QEMU
inside the container would be set to run as some non-root UID (from
the container's POV). So both problem 1 & 3 above would impact the
security of this confinement.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|