[libvirt] Notes from the KVM Forum relevant to libvirt

Thu Aug 25 09:10:27 UTC 2011

On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
>> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
>> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
>> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
>> >> <berrange at redhat.com> wrote:
>> >> > I was at the KVM Forum / LinuxCon last week and there were many
>> >> > interesting things discussed which are relevant to ongoing libvirt
>> >> > development. Here was the list that caught my attention. If I have
>> >> > missed any, fill in the gaps....
>> >> >
>> >> >  - Sandbox/container KVM.  The Solaris port of KVM puts QEMU inside
>> >> >   a zone so that an exploit of QEMU can't escape into the full OS.
>> >> >   Containers are Linux's parallel of Zones, and while not nearly as
>> >> >   secure yet, it would still be worth using more containers support
>> >> >   to confine QEMU.
>> >>
>> >> Can you elaborate on why Linux containers are "not nearly as secure"
>> >> [as Solaris Zones]?
>> >
>> > Mostly because the Linux namespace functionality is far from complete,
>> > notably lacking proper UID/GID/capability separation, and UID/GID
>> > virtualization wrt filesystems. The longer answer is here:
>> >
>> >   https://wiki.ubuntu.com/UserNamespace
>> >
>> > So at this time you can't build a secure container on Linux, relying
>> > just on DAC alone. You have to add in a MAC layer ontop of the container
>> > to get full security benefits, which obviously defeats the point of
>> > using the container as a backup for failure in the MAC layer.
>>
>> Thanks, that is interesting.  I still don't understand why that is a
>> problem.  Linux containers (lxc) uses a different pid namespace (no
>> ptrace worries), file system root (restricted to a subdirectory tree),
>> forbids most device nodes, etc.  Why does the user namespace matter
>> for security in this case?
>
> A number of reasons really...
>
> If user ID '0' on the host starts a container, and a process inside
> the container does 'setuid(500)', then any user outside the container
> with UID 500 will be able to kill that process. Only user ID '0' should
> have been allowed todo that.
>
> It will also let non-root user IDs on the host OS, start containers
> and have root uid=0 inside the container.
>
> Finally, any files created inside the container with, say, uid 500
> will be accessible by any other process with UID 500, in either the
> host or any other container

These points mean that the host can peek inside containers and has
access to their processes/files.  But from the point of a libvirt
running inside a container there is no security problem.

This is kind of like saying that root on the host can modify KVM guest
disk images.  That is true but I don't see it as a security problem
because the root on the host is the trusted part of the system.

>> I think it matters when giving multiple containers access to the same
>> file system.  Is that what you'd like to do for libvirt?
>
> Each container would have to share a (readonly) view onto the host
> filesystem so it can see the QEMU emulator install / libraries. There
> would also have to be some writable areas per QEMU container.  QEMU
> inside the container would be set to run as some non-root UID (from
> the container's POV). So both problem 1 & 3 above would impact the
> security of this confinement.

But is there a way to escape confinement?  If not, then this is secure.

Stefan