[PATCH 5/5] qemu: Prefer -numa cpu over -numa node,cpus=

Wed Oct 6 13:32:54 UTC 2021

On Thu, 30 Sep 2021 14:08:34 +0200
Peter Krempa <pkrempa at redhat.com> wrote:

> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:
> > QEMU is trying to obsolete -numa node,cpus= because that uses
> > ambiguous vCPU id to [socket, die, core, thread] mapping. The new
> > form is:
> > 
> >   -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
> > 
> > which is repeated for every vCPU and places it at [S, D, C, T]
> > into guest NUMA node N.
> > 
> > While in general this is magic mapping, we can deal with it.
> > Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
> > is given then maxvcpus must be sockets * dies * cores * threads
> > (i.e. there are no 'holes').
> > Secondly, if no topology is given then libvirt itself places each
> > vCPU into a different socket (basically, it fakes topology of:
> > [maxvcpus, 1, 1, 1])
> > Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
> > onto topology, to make sure vCPUs don't start to move around.  
> 
> There's a problem with this premise though and unfortunately we don't
> seem to have qemuxml2argvtest for it.
> 
> On PPC64, in certain situations the CPU can be configured such that
> threads are visible only to VMs. This has substantial impact on how CPUs
> are configured using the modern parameters (until now used only for
> cpu hotplug purposes, and that's the reason vCPU hotplug has such
> complicated incantations when starting the VM).
> 
> In the above situation a CPU with topology of:
>  sockets=1, cores=4, threads=8 (thus 32 cpus)
> 
> will only expose 4 CPU "devices".
> 
>  core-id: 0,  core-id: 8, core-id: 16 and core-id: 24
> 
> yet the guest will correctly see 32 cpus when used as such.
> 
> You can see this in:
> 
> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
> 
> Also note that the 'props' object does _not_ have any socket-id, and
> management apps are supposed to pass in 'props' as is. (There's a bunch
> of code to do that on hotplug).
> 
> The problem is that you need to query the topology first (unless we want
> to duplicate all of qemu code that has to do with topology state and
> keep up with changes to it) to know how it's behaving on current
> machine.  This historically was not possible. The supposed solution for
> this was the pre-config state where we'd be able to query and set it up
> via QMP, but I was not keeping up sufficiently with that work, so I
> don't know if it's possible.
> 
> If preconfig is a viable option we IMO should start using it sooner
> rather than later and avoid duplicating qemu's logic here.

using preconfig is preferable variant otherwise libvirt
would end up duplicating topology logic which differs not only
between targets but also between machine/cpu types.

Closest example how to use preconfig is in pc_dynamic_cpu_cfg()
test case. Though it uses query-hotpluggable-cpus only for
verification, but one can use the command at the preconfig
stage to get topology for given -smp/-machine type combination.

> 
> > Note, migration from old to new cmd line works and therefore
> > doesn't need any special handling.
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1678085
> > Signed-off-by: Michal Privoznik <mprivozn at redhat.com>
> > ---
> >  src/qemu/qemu_command.c                       | 112 +++++++++++++++++-
> >  .../hugepages-nvdimm.x86_64-latest.args       |   4 +-
> >  ...memory-default-hugepage.x86_64-latest.args |  10 +-
> >  .../memfd-memory-numa.x86_64-latest.args      |  10 +-
> >  ...y-hotplug-nvdimm-access.x86_64-latest.args |   4 +-
> >  ...ory-hotplug-nvdimm-align.x86_64-5.2.0.args |   4 +-
> >  ...ry-hotplug-nvdimm-align.x86_64-latest.args |   4 +-
> >  ...ory-hotplug-nvdimm-label.x86_64-5.2.0.args |   4 +-
> >  ...ry-hotplug-nvdimm-label.x86_64-latest.args |   4 +-
> >  ...mory-hotplug-nvdimm-pmem.x86_64-5.2.0.args |   4 +-
> >  ...ory-hotplug-nvdimm-pmem.x86_64-latest.args |   4 +-
> >  ...-hotplug-nvdimm-readonly.x86_64-5.2.0.args |   4 +-
> >  ...hotplug-nvdimm-readonly.x86_64-latest.args |   4 +-
> >  .../memory-hotplug-nvdimm.x86_64-latest.args  |   4 +-
> >  ...mory-hotplug-virtio-pmem.x86_64-5.2.0.args |   4 +-
> >  ...ory-hotplug-virtio-pmem.x86_64-latest.args |   4 +-
> >  .../numatune-hmat.x86_64-latest.args          |  18 ++-
> >  ...emnode-restrictive-mode.x86_64-latest.args |  38 +++++-
> >  .../numatune-memnode.x86_64-5.2.0.args        |  38 +++++-
> >  .../numatune-memnode.x86_64-latest.args       |  38 +++++-
> >  ...vhost-user-fs-fd-memory.x86_64-latest.args |   4 +-
> >  ...vhost-user-fs-hugepages.x86_64-latest.args |   4 +-
> >  ...host-user-gpu-secondary.x86_64-latest.args |   3 +-
> >  .../vhost-user-vga.x86_64-latest.args         |   3 +-
> >  24 files changed, 296 insertions(+), 34 deletions(-)
> > 
> > diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> > index f04ae1e311..5192bd7630 100644
> > --- a/src/qemu/qemu_command.c
> > +++ b/src/qemu/qemu_command.c  
> 
> [...]
> 
> > @@ -7432,6 +7432,94 @@ qemuBuildNumaCPUs(virBuffer *buf,
> >  }
> >  
> >  
> > +/**
> > + * qemuTranlsatevCPUID:
> > + *
> > + * For given vCPU @id and vCPU topology (@cpu) compute corresponding
> > + * @socket, @die, @core and @thread). This assumes linear topology,
> > + * that is every [socket, die, core, thread] combination is valid vCPU
> > + * ID and there are no 'holes'. This is ensured by
> > + * qemuValidateDomainDef() if QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS is
> > + * set.  
> 
> As noted above, this assumption does not hold on PPC64. There are indeed
> "holes" in certain cases, while filled with cpus you are e.g. unable to
> spread them across multiple numa nodes.
> 
> In fact allowing to have two sibling threads to be spread across
> multiple NUMA nodes is also nonsensical configuration, which we allowed
> unfortunately.
> 
> > + * Moreover, if @diesSupported is false (QEMU lacks
> > + * QEMU_CAPS_SMP_DIES) then @die is set to zero and @socket is
> > + * computed without taking number of dies into account.
> > + *
> > + * The algorithm is shamelessly copied over from QEMU's
> > + * x86_topo_ids_from_idx() and its history (before introducing dies).
> > + */
> > +static void
> > +qemuTranlsatevCPUID(unsigned int id,
> > +                    bool diesSupported,
> > +                    virCPUDef *cpu,
> > +                    unsigned int *socket,
> > +                    unsigned int *die,
> > +                    unsigned int *core,
> > +                    unsigned int *thread)
> > +{
> > +    if (cpu && cpu->sockets) {
> > +        *thread = id % cpu->threads;
> > +        *core = id / cpu->threads % cpu->cores;
> > +        if (diesSupported) {
> > +            *die = id / (cpu->cores * cpu->threads) % cpu->dies;
> > +            *socket = id / (cpu->dies * cpu->cores * cpu->threads);
> > +        } else {
> > +            *die = 0;
> > +            *socket = id / (cpu->cores * cpu->threads) % cpu->sockets;
> > +        }
> > +    } else {
> > +        /* If no topology was provided, then qemuBuildSmpCommandLine()
> > +         * puts all vCPUs into a separate socket. */
> > +        *thread = 0;
> > +        *core = 0;
> > +        *die = 0;
> > +        *socket = id;
> > +    }
> > +}
> > +
> > +
> > +static void
> > +qemuBuildNumaNewCPUs(virCommand *cmd,
> > +                     virCPUDef *cpu,
> > +                     virBitmap *cpumask,
> > +                     size_t nodeid,
> > +                     virQEMUCaps *qemuCaps)
> > +{
> > +    const bool diesSupported = virQEMUCapsGet(qemuCaps, QEMU_CAPS_SMP_DIES);
> > +    ssize_t vcpuid = -1;
> > +
> > +    if (!cpumask)
> > +        return;
> > +
> > +    while ((vcpuid = virBitmapNextSetBit(cpumask, vcpuid)) >= 0) {
> > +        unsigned int socket;
> > +        unsigned int die;
> > +        unsigned int core;
> > +        unsigned int thread;
> > +
> > +        qemuTranlsatevCPUID(vcpuid, diesSupported, cpu,
> > +                            &socket, &die, &core, &thread);
> > +
> > +        virCommandAddArg(cmd, "-numa");
> > +
> > +        /* The simple fact that dies are supported by QEMU doesn't mean we can
> > +         * put it onto command line. QEMU will accept die-id only if -smp dies
> > +         * was set to a value greater than 1. On the other hand, this allows us
> > +         * to generate shorter command line. */
> > +        if (diesSupported && cpu && cpu->dies > 1) {
> > +            virCommandAddArgFormat(cmd,
> > +                                   "cpu,node-id=%zu,socket-id=%u,die-id=%u,core-id=%u,thread-id=%u",
> > +                                   nodeid, socket, die, core, thread);
> > +        } else {
> > +            virCommandAddArgFormat(cmd,
> > +                                   "cpu,node-id=%zu,socket-id=%u,core-id=%u,thread-id=%u",
> > +                                   nodeid, socket, core, thread);
> > +        }
> > +    }
> > +}
> > +
> > +
> >  static int
> >  qemuBuildNumaCommandLine(virQEMUDriverConfig *cfg,
> >                           virDomainDef *def,  
> 
> [...]
> 
> > @@ -7484,6 +7573,17 @@ qemuBuildNumaCommandLine(virQEMUDriverConfig *cfg,
> >          qemuBuildMemPathStr(def, cmd, priv) < 0)
> >          goto cleanup;
> >  
> > +    /* Use modern style of specifying vCPU topology only if:
> > +     * -numa cpu is available, introduced in the same time as -numa
> > +     *           dist, hence slightly misleading capability test, and
> > +     * query-hotpluggable-cpus is avialable, because then
> > +     *                         qemuValidateDomainDef() ensures that if
> > +     *                         topology is specified it matches max vCPU
> > +     *                         count and we can make some shortcuts in
> > +     *                         qemuTranlsatevCPUID().
> > +     */
> > +    newCpus = virQEMUCapsGet(qemuCaps, QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS);
> > +
> >      for (i = 0; i < ncells; i++) {
> >          if (virDomainNumaGetNodeCpumask(def->numa, i)) {
> >              masterInitiator = i;  
> 
> [...]
> 
> > diff --git a/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args b/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args
> > index 9f3c6fa63f..8af4b44758 100644
> > --- a/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args
> > +++ b/tests/qemuxml2argvdata/hugepages-nvdimm.x86_64-latest.args  
> 
> [...]
> 
> > diff --git a/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args b/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args
> > index 0bdd98d3b6..8c463496a1 100644
> > --- a/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args
> > +++ b/tests/qemuxml2argvdata/vhost-user-vga.x86_64-latest.args
> > @@ -16,7 +16,8 @@ XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest1/.config \
> >  -overcommit mem-lock=off \
> >  -smp 1,sockets=1,cores=1,threads=1 \
> >  -object '{"qom-type":"memory-backend-memfd","id":"ram-node0","share":true,"size":224395264}' \
> > --numa node,nodeid=0,cpus=0,memdev=ram-node0 \
> > +-numa node,nodeid=0,memdev=ram-node0 \
> > +-numa cpu,node-id=0,socket-id=0,core-id=0,thread-id=0 \
> >  -uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \
> >  -display none \
> >  -no-user-config \  
> 
> None of the impacted tests have 'threads' set to anything else than 1 so
> we are not getting any 'thread-id' coverage. Please add some before this
> commit. Also as noted, we'll need some PPC64 tests that are impacted.
>