[PATCH 5/5] qemu: Prefer -numa cpu over -numa node,cpus=

Thu Oct 21 07:13:26 UTC 2021

On Wed, 20 Oct 2021 13:07:59 +0200
Michal Prívozník <mprivozn at redhat.com> wrote:

> On 10/6/21 3:32 PM, Igor Mammedov wrote:
> > On Thu, 30 Sep 2021 14:08:34 +0200
> > Peter Krempa <pkrempa at redhat.com> wrote:
> >   
> >> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:  
> >>> QEMU is trying to obsolete -numa node,cpus= because that uses
> >>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
> >>> form is:
> >>>
> >>>   -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
> >>>
> >>> which is repeated for every vCPU and places it at [S, D, C, T]
> >>> into guest NUMA node N.
> >>>
> >>> While in general this is magic mapping, we can deal with it.
> >>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
> >>> is given then maxvcpus must be sockets * dies * cores * threads
> >>> (i.e. there are no 'holes').
> >>> Secondly, if no topology is given then libvirt itself places each
> >>> vCPU into a different socket (basically, it fakes topology of:
> >>> [maxvcpus, 1, 1, 1])
> >>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
> >>> onto topology, to make sure vCPUs don't start to move around.    
> >>
> >> There's a problem with this premise though and unfortunately we don't
> >> seem to have qemuxml2argvtest for it.
> >>
> >> On PPC64, in certain situations the CPU can be configured such that
> >> threads are visible only to VMs. This has substantial impact on how CPUs
> >> are configured using the modern parameters (until now used only for
> >> cpu hotplug purposes, and that's the reason vCPU hotplug has such
> >> complicated incantations when starting the VM).
> >>
> >> In the above situation a CPU with topology of:
> >>  sockets=1, cores=4, threads=8 (thus 32 cpus)
> >>
> >> will only expose 4 CPU "devices".
> >>
> >>  core-id: 0,  core-id: 8, core-id: 16 and core-id: 24
> >>
> >> yet the guest will correctly see 32 cpus when used as such.
> >>
> >> You can see this in:
> >>
> >> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
> >>
> >> Also note that the 'props' object does _not_ have any socket-id, and
> >> management apps are supposed to pass in 'props' as is. (There's a bunch
> >> of code to do that on hotplug).
> >>
> >> The problem is that you need to query the topology first (unless we want
> >> to duplicate all of qemu code that has to do with topology state and
> >> keep up with changes to it) to know how it's behaving on current
> >> machine.  This historically was not possible. The supposed solution for
> >> this was the pre-config state where we'd be able to query and set it up
> >> via QMP, but I was not keeping up sufficiently with that work, so I
> >> don't know if it's possible.
> >>
> >> If preconfig is a viable option we IMO should start using it sooner
> >> rather than later and avoid duplicating qemu's logic here.  
> > 
> > using preconfig is preferable variant otherwise libvirt
> > would end up duplicating topology logic which differs not only
> > between targets but also between machine/cpu types.
> > 
> > Closest example how to use preconfig is in pc_dynamic_cpu_cfg()
> > test case. Though it uses query-hotpluggable-cpus only for
> > verification, but one can use the command at the preconfig
> > stage to get topology for given -smp/-machine type combination.  
> 
> Alright, -preconfig should be pretty easy. However, I do have some
> points to raise/ask:
> 
> 1) currently, exit-preconfig is marked as experimental (hence its "x-"
> prefix). Before libvirt consumes it, QEMU should make it stable. Is
> there anything that stops QEMU from doing so or is it just a matter of
> sending patches (I volunteer to do that)?

if I recall correctly, it was made experimental due to lack of
actual users (it was supposed that libvirt would consume it
once available but it didn't happen for quite a long time).

So patches to make it stable interface should be fine.

> 
> 2) In my experiments I try to mimic what libvirt does. Here's my cmd
> line:
> 
> qemu-system-x86_64 \
> -S \
> -preconfig \
> -cpu host \
> -smp 120,sockets=2,dies=3,cores=4,threads=5 \
> -object '{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}' \
> -numa node,nodeid=0,memdev=ram-node0 \
> -no-user-config \
> -nodefaults \
> -no-shutdown \
> -qmp stdio
> 
> and here is my QMP log:
> 
> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 6}, "package": "v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}}
> 
> {"execute":"qmp_capabilities"}
> {"return": {}}
> 
> {"execute":"query-hotpluggable-cpus"}
> {"return": [{"props": {"core-id": 3, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 3, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 2, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 1, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 0, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 2, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"},
> <snip/>
> {"props": {"core-id": 0, "thread-id": 0, "die-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": "host-x86_64-cpu"}]}
> 
> 
> I can see that query-hotpluggable-cpus returns an array. Can I safely
> assume that vCPU ID == index in the array? I mean, if I did have -numa
> node,cpus=X  can I do  array[X] to obtain mapping onto Core/Thread/
> Die/Socket which would then be fed to 'set-numa-node' command. If not,
> what is the proper way to do it?

>From QEMU point of view, you shouldn't assume anything about vCPU
ordering within returned array. It's internal impl. detail
and a subject to change without notice.
What you can assume is that CPUs descriptions in array will be
stable for a given combination of [machine version, smp option, CPU type].

> And one more thing - if QEMU has to keep vCPU ID mapping code, what's
> the point in obsoleting -numa node,cpus=? In the end it is still QEMU
> who does the ID -> [Core,Thread,Die,Socket] translation but with extra
> steps for mgmt applications.

point is that cpu_index is ambiguous and it's practically impossible
to for user to tell which vCPU exactly it deals with unless
user re-implements and keeps in sync topology code for
 f(board, machine version, smp option, CPU type)

So even if cpu_index is still used inside of QEMU for
other purposes, the external interfaces and API will
be using only consistent topology tuple [Core,Thread,Die,Socket]
to describe and address vCPUs, same like device_add.

> Michal
>