[libvirt PATCH] docs: add kbase entry showing KVM real time guest config

Mon Jun 1 11:44:17 UTC 2020

There are many different settings that required to config a KVM guest
for real time, low latency workoads. The documentation included here is
based on guidance developed & tested by the Red Hat KVM real time team.

Signed-off-by: Daniel P. Berrangé <berrange at redhat.com>
---
 docs/kbase.html.in          |   3 +
 docs/kbase/kvm-realtime.rst | 213 ++++++++++++++++++++++++++++++++++++
 2 files changed, 216 insertions(+)
 create mode 100644 docs/kbase/kvm-realtime.rst

diff --git a/docs/kbase.html.in b/docs/kbase.html.in
index c586e0f676..e663ca525f 100644
--- a/docs/kbase.html.in
+++ b/docs/kbase.html.in
@@ -36,6 +36,9 @@
 
         <dt><a href="kbase/virtiofs.html">Virtio-FS</a></dt>
         <dd>Share a filesystem between the guest and the host</dd>
+
+        <dt><a href="kbase/kvm-realtime.html">KVM real time</a></dt>
+        <dd>Run real time workloads in guests on a KVM hypervisor</dd>
       </dl>
     </div>
 
diff --git a/docs/kbase/kvm-realtime.rst b/docs/kbase/kvm-realtime.rst
new file mode 100644
index 0000000000..ac6102879b
--- /dev/null
+++ b/docs/kbase/kvm-realtime.rst
@@ -0,0 +1,213 @@
+==========================
+KVM Real Time Guest Config
+==========================
+
+.. contents::
+
+The KVM hypervisor is capable of running real time guest workloads. This page
+describes the key pieces of configuration required in the domain XML to achieve
+the low latency needs of real time workloads.
+
+For the most part, configuration of the host OS is out of scope of this
+documentation. Refer to the operating system vendor's guidance on configuring
+the host OS and hardware for real time. Note in particular that the default
+kernel used by most Linux distros is not suitable for low latency real time and
+must be replaced by an special kernel build.
+
+
+Host partitioning plan
+======================
+
+Running real time workloads requires carefully partitioning up the host OS
+resources, such that the KVM / QEMU processes are strictly separated from any
+other workload running on the host, both userspace processes and kernel threads.
+
+As such, some subset of host CPUs need to be reserved exclusively for running
+KVM guests. This requires that the host kernel be booted using the ``isolcpus``
+kernel command line parameter. This parameter removes a set of CPUs from the
+schedular, such that that no kernel threads or userspace processes will ever get
+placed on those CPUs automatically. KVM guests are then manually placed onto
+these CPUs.
+
+Deciding which host CPUs to reserve for real time requires understanding of the
+guest workload needs and balancing with the host OS needs. The trade off will
+also vary based on the physical hardware available.
+
+For the sake of illustration, this guide will assume a physical machine with two
+NUMA nodes, each with 2 sockets and 4 cores, giving a total of 16 CPUs on the
+host. Furthermore, it is assumed that hyperthreading is either not supported or
+has been disabled in the BIOS, since it is incompatible with real time. Each
+NUMA node is assumed to have 32 GB of RAM, giving 64 GB total for the host.
+
+It is assumed that 2 CPUs in each NUMA node are reserved for the host OS, with
+the remaining 6 CPUs available for KVM real time. With this in mind, the host
+kernel should have booted with ``isolcpus=2-7,10,15`` to reserve CPUs.
+
+To maximise efficiency of page table lookups for the guest, the host needs to be
+configured with most RAM exposed as huge pages, ideally 1 GB sized. 6 GB of RAM
+in each NUMA node will be reserved for general host OS usage as normal sized
+pages, leaving 26 GB for KVM usage as huge pages.
+
+Once huge pages are reserved on the hypothetical machine, the ``virsh
+capabilities`` command output is expected to look approximately like:
+
+::
+
+   <topology>
+     <cells num='2'>
+       <cell id='0'>
+         <memory unit='KiB'>33554432</memory>
+         <pages unit='KiB' size='4'>1572864</pages>
+         <pages unit='KiB' size='2048'>0</pages>
+         <pages unit='KiB' size='1048576'>26</pages>
+         <distances>
+           <sibling id='0' value='10'/>
+           <sibling id='1' value='21'/>
+         </distances>
+         <cpus num='8'>
+           <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
+           <cpu id='1' socket_id='0' core_id='1' siblings='1'/>
+           <cpu id='2' socket_id='0' core_id='2' siblings='2'/>
+           <cpu id='3' socket_id='0' core_id='3' siblings='3'/>
+           <cpu id='4' socket_id='1' core_id='0' siblings='4'/>
+           <cpu id='5' socket_id='1' core_id='1' siblings='5'/>
+           <cpu id='6' socket_id='1' core_id='2' siblings='6'/>
+           <cpu id='7' socket_id='1' core_id='3' siblings='7'/>
+         </cpus>
+       </cell>
+       <cell id='1'>
+         <memory unit='KiB'>33554432</memory>
+         <pages unit='KiB' size='4'>1572864</pages>
+         <pages unit='KiB' size='2048'>0</pages>
+         <pages unit='KiB' size='1048576'>26</pages>
+         <distances>
+           <sibling id='0' value='21'/>
+           <sibling id='1' value='10'/>
+         </distances>
+         <cpus num='8'>
+           <cpu id='8' socket_id='0' core_id='0' siblings='8'/>
+           <cpu id='9' socket_id='0' core_id='1' siblings='9'/>
+           <cpu id='10' socket_id='0' core_id='2' siblings='10'/>
+           <cpu id='11' socket_id='0' core_id='3' siblings='11'/>
+           <cpu id='12' socket_id='1' core_id='0' siblings='12'/>
+           <cpu id='13' socket_id='1' core_id='1' siblings='13'/>
+           <cpu id='14' socket_id='1' core_id='2' siblings='14'/>
+           <cpu id='15' socket_id='1' core_id='3' siblings='15'/>
+          </cpus>
+       </cell>
+     </cells>
+   </topology>
+
+Be aware that CPU ID numbers are not always allocated sequentially as shown
+here. It is not unusual to see IDs interleaved between sockets on the two NUMA
+nodes, such that ``0-3,8-11`` are be on the first node and ``4-7,12-15`` are on
+the second node.  Carefully check the ``virsh capabilities`` output to determine
+the CPU ID numbers when configiring both ``isolcpus`` and the guest ``cpuset``
+values.
+
+Guest configuration
+===================
+
+What follows is an overview of the key parts of the domain XML that need to be
+configured to achieve low latency for real time workflows. The following example
+will assume a 4 CPU guest, requiring 16 GB of RAM. It is intended to be placed
+on the second host NUMA node.
+
+CPU configuration
+-----------------
+
+Real time KVM guests intended to run Linux should have a minimum of 2 CPUs.
+One vCPU is for running non-real time processes and performing I/O. The other
+vCPUs will run real time applications. Some non-Linux OS may not require a
+special non-real time CPU to be available, in which case the 2 CPU minimum would
+not apply.
+
+Each guest CPU, even the non-real time one, needs to be pinned to a dedicated
+host core that is in the `isolcpus` reserved set. The QEMU emulator threads
+also need to be pinned to host CPUs that are not in the `isolcpus` reserved set.
+The vCPUs need to be given a real time CPU schedular policy.
+
+When configuring the `guest CPU count <../formatdomain.html#elementsCPUAllocation>`_,
+do not include any CPU affinity are this stage:
+
+::
+
+   <vcpu placement='static'>4</vcpu>
+
+The guest CPUs now need to be placed individually. In this case, they will all
+be put within the same host socket, such that they can be exposed as core
+siblings. This is achieved using the `CPU tunning config <../formatdomain.html#elementsCPUTuning>`_:
+
+::
+
+   <cputune>
+     <emulatorpin cpuset="8-9"/>
+     <vcpupin vcpu="0" cpuset="12"/>
+     <vcpupin vcpu="1" cpuset="13"/>
+     <vcpupin vcpu="2" cpuset="14"/>
+     <vcpupin vcpu="3" cpuset="15"/>
+     <vcpusched vcpus='0-4' scheduler='fifo' priority='1'/>
+   </cputune>
+
+The `guest CPU model <formatdomain.html#elementsCPU>`_ now needs to be
+configured to pass through the host model unchanged, with topology matching the
+placement:
+
+::
+
+   <cpu mode='host-passthrough'>
+     <topology sockets='1' dies='1' cores='4' threads='1'/>
+     <feature policy='require' name='tsc-deadline'/>
+   </cpu>
+
+The performance monitoring unit virtualization needs to be disabled
+via the `hypervisor features <../formatdomain.html#elementsFeatures>`_:
+
+::
+
+   <features>
+     ...
+     <pmu state='off'/>
+   </features>
+
+
+Memory configuration
+--------------------
+
+The host memory used for guest RAM needs to be allocated from huge pages on the
+second NUMA node, and all other memory allocation needs to be locked into RAM
+with memory page sharing disabled.
+This is achieved by using the `memory backing config <formatdomain.html#elementsMemoryBacking>`_:
+
+::
+
+   <memoryBacking>
+     <hugepages>
+       <page size="1" unit="G" nodeset="1"/>
+     </hugepages>
+     <locked/>
+     <nosharepages/>
+   </memoryBacking>
+
+
+Device configuration
+--------------------
+
+Libvirt adds a few devices by default to maintain historical QEMU configuration
+behaviour. It is unlikely these devices are required by real time guests, so it
+is wise to disable them. Remove all USB controllers that may exist in the XML
+config and replace them with:
+
+::
+
+   <controller type="usb" model="none"/>
+
+Similarly the memory balloon config should be changed to
+
+::
+
+   <memballoon model="none"/>
+
+If the guest had a graphical console at installation time this can also be
+disabled, with remote access being over SSH, with a minimal serial console
+for emergencies.
-- 
2.26.2