[PATCH 5/5] qemu: Prefer -numa cpu over -numa node,cpus=

Wed Oct 20 11:07:59 UTC 2021

On 10/6/21 3:32 PM, Igor Mammedov wrote:
> On Thu, 30 Sep 2021 14:08:34 +0200
> Peter Krempa <pkrempa at redhat.com> wrote:
> 
>> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:
>>> QEMU is trying to obsolete -numa node,cpus= because that uses
>>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
>>> form is:
>>>
>>>   -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
>>>
>>> which is repeated for every vCPU and places it at [S, D, C, T]
>>> into guest NUMA node N.
>>>
>>> While in general this is magic mapping, we can deal with it.
>>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
>>> is given then maxvcpus must be sockets * dies * cores * threads
>>> (i.e. there are no 'holes').
>>> Secondly, if no topology is given then libvirt itself places each
>>> vCPU into a different socket (basically, it fakes topology of:
>>> [maxvcpus, 1, 1, 1])
>>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
>>> onto topology, to make sure vCPUs don't start to move around.  
>>
>> There's a problem with this premise though and unfortunately we don't
>> seem to have qemuxml2argvtest for it.
>>
>> On PPC64, in certain situations the CPU can be configured such that
>> threads are visible only to VMs. This has substantial impact on how CPUs
>> are configured using the modern parameters (until now used only for
>> cpu hotplug purposes, and that's the reason vCPU hotplug has such
>> complicated incantations when starting the VM).
>>
>> In the above situation a CPU with topology of:
>>  sockets=1, cores=4, threads=8 (thus 32 cpus)
>>
>> will only expose 4 CPU "devices".
>>
>>  core-id: 0,  core-id: 8, core-id: 16 and core-id: 24
>>
>> yet the guest will correctly see 32 cpus when used as such.
>>
>> You can see this in:
>>
>> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
>>
>> Also note that the 'props' object does _not_ have any socket-id, and
>> management apps are supposed to pass in 'props' as is. (There's a bunch
>> of code to do that on hotplug).
>>
>> The problem is that you need to query the topology first (unless we want
>> to duplicate all of qemu code that has to do with topology state and
>> keep up with changes to it) to know how it's behaving on current
>> machine.  This historically was not possible. The supposed solution for
>> this was the pre-config state where we'd be able to query and set it up
>> via QMP, but I was not keeping up sufficiently with that work, so I
>> don't know if it's possible.
>>
>> If preconfig is a viable option we IMO should start using it sooner
>> rather than later and avoid duplicating qemu's logic here.
> 
> using preconfig is preferable variant otherwise libvirt
> would end up duplicating topology logic which differs not only
> between targets but also between machine/cpu types.
> 
> Closest example how to use preconfig is in pc_dynamic_cpu_cfg()
> test case. Though it uses query-hotpluggable-cpus only for
> verification, but one can use the command at the preconfig
> stage to get topology for given -smp/-machine type combination.

Alright, -preconfig should be pretty easy. However, I do have some
points to raise/ask:

1) currently, exit-preconfig is marked as experimental (hence its "x-"
prefix). Before libvirt consumes it, QEMU should make it stable. Is
there anything that stops QEMU from doing so or is it just a matter of
sending patches (I volunteer to do that)?

2) In my experiments I try to mimic what libvirt does. Here's my cmd
line:

qemu-system-x86_64 \
-S \
-preconfig \
-cpu host \
-smp 120,sockets=2,dies=3,cores=4,threads=5 \
-object '{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}' \
-numa node,nodeid=0,memdev=ram-node0 \
-no-user-config \
-nodefaults \
-no-shutdown \
-qmp stdio

and here is my QMP log:

{"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 6}, "package": "v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}}

{"execute":"qmp_capabilities"}
{"return": {}}

{"execute":"query-hotpluggable-cpus"}
{"return": [{"props": {"core-id": 3, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 3, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 2, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 1, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 0, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 2, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"},
<snip/>
{"props": {"core-id": 0, "thread-id": 0, "die-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": "host-x86_64-cpu"}]}

I can see that query-hotpluggable-cpus returns an array. Can I safely
assume that vCPU ID == index in the array? I mean, if I did have -numa
node,cpus=X  can I do  array[X] to obtain mapping onto Core/Thread/
Die/Socket which would then be fed to 'set-numa-node' command. If not,
what is the proper way to do it?

And one more thing - if QEMU has to keep vCPU ID mapping code, what's
the point in obsoleting -numa node,cpus=? In the end it is still QEMU
who does the ID -> [Core,Thread,Die,Socket] translation but with extra
steps for mgmt applications.

Michal