[RFC DOCUMENT 09/12] kubevirt-and-kvm: Add Isolation page

Andrea Bolognani abologna at redhat.com
Wed Sep 16 16:54:25 UTC 2020


# Isolation

How is the QEMU process isolated from the host and from other VMs?

## Traditional virtualization


* managed by libvirt


* libvirt is privileged and QEMU is protected by SELinux policies set
  by libvirt (SVirt)
* QEMU runs with SELinux type `svirt_t`

## KubeVirt


* Managed by kubelet
  * No involvement from libvirt
* Memory limits
  * When using hard limits, the entire VM can be killed by Kubernetes
  * Memory consumption estimates are based on heuristics


* KubeVirt is not using SVirt and there are no plans to do so
* At the moment, the custom [KubeVirt SELinux policy][] is used to
  ensure libvirt has sufficient privilege to perform its own setup
  * The standard SELinux type used by containers is `container_t`
  * KubeVirt would like to eventually use the same for VMs as well


* The default set of capabilities is fairly conservative
  * Privileged operation should happen outside of the pod: in
    KubeVirt's case, a good candidate is `virt-handler`, the
    privileged components that runs at the node level
* Additional capabilities can be requested for a pod
  * However, this is frowned upon and considered a liability from the
    security point of view
  * The cluster admin may even set a security policy that prevent
    pods from using certain capabilities
    * In such a scenario, KubeVirt workloads may be entirely unable
      to run

## Specific examples

The following is a list of examples, either historical or current, of
scenarios where libvirt's approach to isolation clashed with
Kubernetes' and changes on either component were necessary.


* libvirt use of hugetlbfs for hugepages config is disallowed by
  * Possibly fixable by using memfd
    * [libvirt memoryBacking docs][]
    * [KubeVirt memfd issue][]
* Use of libvirt+QEMU multiqueue tap support is disallowed by
  * And there’s no way to pass in this setup from outside the
    existing stack
  * [KubeVirt multiqueue workaround][] extending their SELinux policy to allow
* Passing precreated tap devices to libvirt triggers
  relabelfrom+relabelto `tun_socket` SELinux access
  * This may not be virt stacks fault, seems to happen automatically
    when permissions aren’t correct


* libvirt performs memory locking for VFIO devices unconditionally
  * Previously KubeVirt had to grant `CAP_SYS_RESOURCE` to pods.
    KubeVirt worked around it by duplicating libvirt’s memory pinning
    calculations so the libvirt action would be a no-op, but that is
    fragile and may cause the issue to resurface if libvirt
    calculation logic changes.
  * References: [libvir-list memlock thread][], [KubeVirt memlock
    PR][], [libvirt qemuDomainGetMemLockLimitBytes][], [KubeVirt
* virtiofsd requires `CAP_SYS_ADMIN` capability to perform
  * This is required for certain use cases like running overlayfs in
    the VM on top of virtiofs, but is not a requirement for all
  * References: [KubeVirt virtiofs PR][], [RHEL virtiofs bug][]
* KubeVirt uses libvirt for CPU pinning, which requires the pod to
  have `CAP_SYS_NICE`.
  * Long term, KubeVirt would like to handle that pinning in their
    privileged component virt-handler, so `CAP_SYS_NICE` can be
    * Sidenote: libvirt unconditionally requires `CAP_SYS_NICE` when
      any other running VM is using CPU pinning, however this sounds
      like a plain old bug.
  * References: [KubeVirt CPU pinning PR][], [KubeVirt CPU pinning
    workaround PR][], [RHEL CPU pinning bug][]
* libvirt bridge usage used to require `CAP_NET_ADMIN`
  * This is a historical example for reference. libvirt usage of a
    bridge device always implied tap device creation, which required
    `CAP_NET_ADMIN` privileges for the pod
  * The fix was to teach libvirt to accept a precreated tap device
    and skip some setup operations on it
    * Example XML: `<interface type='ethernet'><target dev='mytap0'
  * Kubevirt still hasn’t fully managed to drop `CAP_NET_ADMIN`
  * References: [RHEL precreated TAP bug][], [libvirt precreated TAP
    patches][], [KubeVirt precreated TAP PR][], [KubeVirt NET_ADMIN
    issue][], [KubeVirt NET_ADMIN issue][]

[KubeVirt CPU pinning PR]: https://github.com/kubevirt/kubevirt/pull/1381
[KubeVirt CPU pinning workaround PR]: https://github.com/kubevirt/kubevirt/pull/1648
[KubeVirt NET_ADMIN PR]: https://github.com/kubevirt/kubevirt/pull/3290
[KubeVirt NET_ADMIN issue]: https://github.com/kubevirt/kubevirt/issues/3085
[KubeVirt SELinux policy]: https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-handler/virt_launcher.cil
[KubeVirt VMI.getMemlockSize]: https://github.com/kubevirt/kubevirt/blob/f5ffba5f84365155c81d0e2cda4aa709da062230/pkg/virt-handler/isolation/isolation.go#L206
[KubeVirt memfd issue]: https://github.com/kubevirt/kubevirt/issues/3781
[KubeVirt memlock PR]: https://github.com/kubevirt/kubevirt/pull/2584
[KubeVirt multiqueue workaround]: https://github.com/kubevirt/kubevirt/pull/2941/commits/bc55cb916003c54f6cbf329112a4e36d0d874836
[KubeVirt precreated TAP PR]: https://github.com/kubevirt/kubevirt/pull/2837
[KubeVirt virtiofs PR]: https://github.com/kubevirt/kubevirt/pull/3493
[RHEL CPU pinning bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1819801
[RHEL precreated TAP bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1723367
[RHEL virtiofs bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1854595
[libvir-list memlock thread]: https://www.redhat.com/archives/libvirt-users/2019-August/msg00046.html
[libvirt memoryBacking docs]: https://libvirt.org/formatdomain.html#elementsMemoryBacking
[libvirt precreated TAP patches]: https://www.redhat.com/archives/libvir-list/2019-August/msg01256.html
[libvirt qemuDomainGetMemLockLimitBytes]: https://gitlab.com/libvirt/libvirt/-/blob/84bb5fd1ab2bce88e508d416f4bcea520c803ea8/src/qemu/qemu_domain.c#L8712

More information about the libvir-list mailing list