[libvirt PATCH] docs: add a kbase explaining security protections for QEMU passthrough

Fri Feb 7 15:27:45 UTC 2020

On Thu, Feb 06, 2020 at 01:05:37PM +0000, Daniel P. Berrangé wrote:

The core content reads very well.  A couple of minor nit-picks inline.

[...]

> diff --git a/docs/kbase/qemu-passthrough-security.rst b/docs/kbase/qemu-passthrough-security.rst
> new file mode 100644
> index 0000000000..7fb1f6fbdd
> --- /dev/null
> +++ b/docs/kbase/qemu-passthrough-security.rst
> @@ -0,0 +1,157 @@

[...]

> +XML document additions
> +======================
> +
> +To deal with the problem, libvirt introduced support for command line

Nit: s/command line/command-line/g  (there are a few occurrences)

> +passthrough of QEMU arguments. This is achieved by supporting a custom
> +XML namespace, under which some QEMU driver specific elements are defined.
> +
> +The canonical place to declare the namespace is on the top level ``<domain>``
> +element. At the very end of the document, arbitrary command line arguments
> +can now be added, using the namespace prefix ``qemu:``
> +
> +::

If you can stomach the syntax chance, you can put the :: at the end of
the sentence.

> +
> +   <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> +     <name>QEMUGuest1</name>
> +     <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
> +     ...
> +     <qemu:commandline>
> +       <qemu:arg value='-newarg'/>
> +       <qemu:arg value='parameter'/>

I'd guess you intentionally took a generic example, rather than specific
QEMU command-line parameter to illustrate the XML, in case the example
command-line is deprecated, etc.

> +       <qemu:env name='ID' value='wibble'/>
> +       <qemu:env name='BAR'/>
> +     </qemu:commandline>
> +   </domain>

Is it worth calling out that the 'env' fragments are envirnoment
variables?  As it isn't obvious to those who don't dwell on libvirt/QEMU
daily.

> +Note that when an argument takes a value eg ``-newarg parameter``, the argument
> +and the value must be passed as separate ``<qemu:arg>`` entries.
>
> +
> +Instead of declaring the XML namespace on the top level ``<domain>`` it is also
> +possible to declare it at time of use, which is more convenient for humans
> +writing the XML documents manually. So the following example is functionally
> +identical:
> +
> +::

Here too, you can put the :: at the end of the sentence, saving one
colon :D

> +
> +   <domain type='kvm'>
> +     <name>QEMUGuest1</name>
> +     <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
> +     ...
> +     <commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0">
> +       <arg value='-newarg'/>
> +       <arg value='parameter'/>
> +       <env name='ID' value='wibble'/>
> +       <env name='BAR'/>
> +     </commandline>
> +   </domain>
> +
> +Note that when querying the XML from libvirt, it will have been translated into
> +the canonical syntax once more with the namespace on the top level element.

Here you might want to use the rST "note" admonition:

.. note:: When querying the XML from libvirt, it will have been
          translated into  canonical syntax once more with the namespace
          on the top level element.

> +
> +Security confinement / sandboxing
> +=================================
> +
> +When libvirt launches a QEMU process it makes use of a number of security
> +technologies to confine QEMU and thus protect the host from malicious VM
> +breakouts.
> +
> +When configuring security protection, however, libvirt generally needs to know
> +exactly which host resources the VM is permitted to access. It gets this
> +information from the domain XML document. This only works for elements in the
> +regular schema, the arguments used with command line passthrough are completely
> +opaque to libvirt.
> +
> +As a result, if command line passthrough is used to expose a file on the host
> +to QEMU, the security protections will activate and either kill QEMU or deny it
> +access.
> +
> +There are two strategies for dealing with this problem, either figure out what
> +steps are needed to grant QEMU access to the device, or disable the security
> +protections.  The former is harder, but more secure, while the latter is simple.
> +
> +Granting access per VM
> +----------------------
> +
> +* SELinux - the file on the host needs an SELinux label that will grant access
> +  to QEMU's ``svirt_t`` policy.
> +
> +  - Read only access - use the ``virt_content_t`` label

Nit: s/"Read only"/Read-only/

> +  - Shared, write access - use the ``svirt_image_t:s0`` label (ie no MCS
> +    category appended)
> +  - Exclusive, write access - use the ``svirt_image_t:s0:MCS`` label for the VM.
> +    The MCS is auto-generatd at boot time, so this may require re-configuring
> +    the VM to have a fixed MCS label
> +
> +* DAC - the file on the host needs to be readable/writable to the ``qemu``

Nit: let's please expand acronyms on first use: "Discretionary Access
Control (DAC)"; although DAC and ACL (below) might be common enough for
"Linux dwellers" that we don't have to be pedantic about it.  But MCS
(Multi-Category Security) is familiar only for those who are
SELinux-aware.

So, your choice, as I don't want to make you expand every acronym; but
only the obscure ones. :-)

> +  user or ``qemu`` group. This can be done by changing the file ownership to
> +  ``qemu``, or relaxing the permissions to allow world read, or adding file
> +  ACLs to allow access to ``qemu``.
> +
> +* Namespaces - a private ``mount`` namespace is used for QEMU by default
> +  which populates a new ``/dev`` with only the device nodes needed by QEMU.
> +  There is no way to augment the set of device nodes ahead of time.
> +
> +* Seccomp - libvirt launches QEMU with its built-in seccomp policy enabled with
> +  ``obsolete=deny``, ``elevateprivileges=deny``, ``spawn=deny`` and
> +  ``resourcecontrol=deny`` settings active. There is no way to change this
> +  policy on a per VM basis

Missing full stop at the end here ...

> +
> +* Cgroups - a custom cgroup is created per VM and this will either use the
> +  ``devices`` controller or an ``BPF`` rule to whitelist a set of device nodes.
> +  There is no way to change this policy on a per VM basis.
> +
> +Disabling security protection per VM
> +------------------------------------
> +
> +Some of the security protections can be disabled per-VM:
> +
> +* SELinux - in the domain XML the ``<seclabel>`` model can be changed to
> +  ``none`` instead of ``selinux``, which will make the VM run unconfined.
> +
> +* DAC - in the domain XML an ``<seclabel>`` element with the ``dac`` model can
> +  be added, configured with a user / group account of ``root`` to make QEMU run
> +  with full privileges

... here,

> +* Namespaces - there is no way to disable this per VM
> +
> +* Seccomp - there is no way to disable this per VM
> +
> +* Cgroups - there is no way to disable this per VM
> +
> +Disabling security protection host-wide
> +---------------------------------------
> +
> +As a last resort it is possible to disable security protection host wide which
> +will affect all virtual machines. These settings are all made in
> +``/etc/libvirt/qemu.conf``

... and here.

> +
> +* SELinux - set ``security_default_confied = 0`` to make QEMU run unconfined by
> +  default, while still allowing explicit opt-in to SELinux for VMs.
> +
> +* DAC - set ``user = root`` and ``group = root`` to make QEMU run as the root
> +  account
> +
> +* SELinux, DAC - set ``security_driver = []`` to entirely disable both the
> +  SELinux and DAC security drivers.
> +
> +* Namespaces - set ``namespaces = []`` to disable use of the ``mount``
> +  namespaces, causing QEMU to see the normal fully popualated ``dev``
> +
> +* Seccomp - set ``seccomp_sandbox = 0`` to disable use of the Seccomp sandboxing
> +  in QEMU
> +
> +* Cgroups - set ``cgroup_device_acl`` to include the desired device node, or
> +  ``cgroup_controllers = [...]`` to exclude the ``devices`` controller.

I'll let you pick what you want to address, as this doc is an
improvement as-is, FWIW:

Reviewed-by: Kashyap Chamarthy <kchamart at redhat.com>

-- 
/kashyap