[PATCH 4/5] qemu: Prefer -numa cpu over -numa node,cpus=

Thu Jun 4 08:58:01 UTC 2020

On 5/27/20 3:58 PM, Igor Mammedov wrote:
> On Tue, 26 May 2020 17:31:09 +0200
> Michal Privoznik <mprivozn at redhat.com> wrote:
> 
>> On 5/26/20 4:51 PM, Igor Mammedov wrote:
>>> On Mon, 25 May 2020 10:05:08 +0200
>>> Michal Privoznik <mprivozn at redhat.com> wrote:
>>>    
>>>>
>>>> This is a problem. The domain XML that is provided can't be changed,
>>>> mostly because mgmt apps construct it on the fly and then just pass it
>>>> as a RO string to libvirt. While libvirt could create a separate cache,
>>>> there has to be a better way.
>>>>
>>>> I mean, I can add some more code that once the guest is running
>>>> preserves the mapping during migration. But that assumes a running QEMU.
>>>> When starting a domain from scratch, is it acceptable it vCPU topology
>>>> changes? I suspect it is not.
>>> I'm not sure I got you but
>>> vCPU topology isn't changnig but when starting QEMU, user has to map
>>> 'concrete vCPUs' to spencific numa nodes. The issue here is that
>>> to specify concrete vCPUs user needs to get layout from QEMU first
>>> as it's a function of target/machine/-smp and possibly cpu type.
>>
>> Assume the following config: 4 vCPUs (2 sockets, 2 cores, 1 thread
>> topology) and 2 NUMA nodes and the following assignment to NUMA:
>>
>> node 0: cpus=0-1
>> node 1: cpus=2-3
>>
>> With old libvirt & qemu (and assuming x86_64 - not EPYC), I assume the
>> following topology is going to be used:
>>
>> node 0: socket=0,core=0,thread=0 (vCPU0)  socket=0,core=1,thread=0 (vCPU1)
>> node 1: socket=1,core=0,thread=0 (vCPU2)  socket=1,core=1,thread=0 (vCPU3)
>>
>> Now, user upgrades libvirt & qemu but doesn't change the config. And on
>> a fresh new start (no migration), they might get a different topology:
>>
>> node 0: socket=0,core=0,thread=0 (vCPU0)  socket=1,core=0,thread=0 (vCPU1)
>> node 1: socket=0,core=1,thread=0 (vCPU2)  socket=1,core=1,thread=0 (vCPU3)
> 
> that shouldn't happen at least for as long as machine version stays the same

Shouldn't as in it's bad if it happens or as in QEMU won't change 
topology for released machine types? Well, we are talking about libvirt 
generating the topology.

>> The problem here is not how to assign vCPUs to NUMA nodes, the problem
>> is how to translate vCPU IDs to socket=,core=,thread=.
> if you are talking about libvirt's vCPU IDs, then it's separate issue
> as it's user facing API, I think it should not rely on cpu_index.
> Instead it should map vCPU IDs to ([socket,]core[,thread]) tuple
> or maybe drop notion of vCPU IDs and expose ([socket,]core[,thread])
> to users if they ask for numa aware config.

And this is the thing I am asking. How to map vCPU IDs to 
socket,core,thread and how to do it reliably.

> 
> PS:
> I'm curious how libvirt currently implements numa mapping and
> how it's correlated with pinnig to host nodes?
> Does it have any sort of code to calculate topology based on cpu_index
> so it could properly assign vCPUs to nodes or all the pain of
> assigning vCPU IDs to nodes is on the user shoulders?

It's on users. In the domain XML they specify number of vCPUs, and then 
they can assign individual IDs to NUMA nodes. For instance:

   <vcpu>8</vcpu>

   <cpu>
      <numa>
        <cell id='0' cpus='0-3' memory='2097152' unit='KiB'/>
        <cell id='1' cpus='4-7' memory='2097152' unit='KiB'/>
      </numa>
   </cpu>

translates to:

   -smp 8,sockets=8,cores=1,threads=1
   -numa node,nodeid=0,cpus=0-3,mem=...
   -numa node,nodeid=1,cpus=4-7,mem=...

The sockets=,cores=,threads= is formatted every time, even if no 
topology was specified in the domain XML. If no topology was specified 
then every vCPU is in its own socket and has 1 core and 1 thread.

If topology is specified then the -smp looks accordingly. But all that 
libvirt uses to assing vCPUs to NUMA nodes is vCPU ID. If it has to use 
sockets,cores,threads then so be it, but that means libvirt needs to 
learn the mapping of vCPU IDs to sockets=,cores=,threads=; because if it 
doesn't and generates the mapping differently to QEMU then for the above 
snippet vCPUs might move between NUMA nodes. I mean, if there is a 
domain with the above config it has some topology (that QEMU came up 
with). Now, after we change it and user updates libvirt & QEMU, libvirt 
might (in general) come with a different topology and if the VM is 
booted again it will see say CPU1 move to NUMA#1 (for example).

This happened because libvirt came up with vCPU ID -> socket,core,thread 
mapping itself. I mean, in this patch the algorithm is copied from 
x86_topo_ids_from_idx(), but I bet there are different mappings (I can 
see x86_topo_ids_from_idx_epyc() and other architectures might have 
completely different mapping - powe9 perhaps?).

Maybe I'm misunderstanding cpu_index and vCPU ID? I thought it is the 
same thing.

> 
>>> that applies not only '-numa cpu' but also to -device cpufoo,
>>> that's why query-hotpluggable-cpus was introduced to let
>>> user get the list of possible CPUs (including topo properties needed to
>>> create them) for a given set of CLI options.
>>>
>>> If I recall right libvirt uses topo properies during cpu hotplug but
>>> treats it mainly as opaqueue info so it could feed it back to QEMU.
>>>
>>>    
>>>>>> tries to avoid that as much as it can.
>>>>>>      
>>>>>>>
>>>>>>> How to present it to libvirt user I'm not sure (give them that list perhaps
>>>>>>> and let select from it???)
>>>>>>
>>>>>> This is what I am trying to figure out in the cover letter. Maybe we
>>>>>> need to let users configure the topology (well, vCPU id to [socket, die,
>>>>>> core, thread] mapping), but then again, in my testing the guest ignored
>>>>>> that and displayed different topology (true, I was testing with -cpu
>>>>>> host, so maybe that's why).
>>>>> there is ongiong issue with EPYC VCPUs topology, but I otherwise it should work.
>>>>> Just report bug to qemu-devel, if it's broken.
>>>>>         
>>>>>>      
>>>>>>> But it's irrelevant, to the patch, magical IDs for socket/core/...whatever
>>>>>>> should not be generated by libvirt anymore, but rather taken from QEMU for given
>>>>>>> machine + -smp combination.
>>>>>>
>>>>>> Taken when? We can do this for running machines, but not for freshly
>>>>>> started ones, can we?
>>>>>
>>>>> it can be used for freshly started as well,
>>>>> QEMU -S -preconfig -M pc -smp ...
>>>>> (QMP) query-hotpluggable-cpus
>>>>> (QMP) set-numa-node ...
>>>>> ...
>>>>> (QMP) exit-preconfig
>>>>> (QMP) other stuff libvirt does (like hot-plugging CPUs , ...)
>>>>> (QMP) cont
>>>>
>>>> I'm not sure this works. query-hotpluggable-cpus does not map vCPU ID
>>>> <-> socket/core/thread, For '-smp 2,sockets=2,cores=1,threads=1' the
>>>> 'query-hotpluggable-cpus' returns:
>>>>
>>>> {"return": [{"props": {"core-id": 0, "thread-id": 0, "socket-id": 1},
>>>> "vcpus-count": 1, "type": "qemu64-x86_64-cpu"}, {"props": {"core-id": 0,
>>>> "thread-id": 0, "socket-id": 0}, "vcpus-count": 1, "type":
>>>> "qemu64-x86_64-cpu"}]}
>>>
>>> that's the list I was taling about, which is implicitly ordered by cpu_index
>>
>> Aha! So in this case it would be:
>>
>> vCPU0 -> socket=1,core=0,thread=0
>> vCPU1 -> socket=0,core=0,thread=0
>>
>> But that doesn't feel right. Is the cpu_index increasing or decreasing
>> as I go through the array?
> it's array with decreasing order and index in it currently == cpu_index for
> present and possible CPUs. Content of array is immutable for given
> -M/-smp combination, to keep migration working. We can try to add
> x-cpu-index to cpu entries, so you won't have to rely on order to help with
> migrating from old CLI (but only for old machine types where old CLI actually
> worked worked).

That might help. So we won't hardcode any mapping in libvirt rather than 
ask QEMU what it thinks the topology is. Cool.

So it would work like this:

1) libvirt starts:
    qemu -preconfig -S -smp 8,sockets=2,cores=2,threads=2

2) libvirt uses "query-hotpluggable-cpus" to learn what topology it came 
up with, IOW what is the vCPU ID <-> socket,core,thread mapping

3) libvirt configures NUMA nodes, assigns vCPUs to them using 
[socket,core,thread] based on the mapping it learned in step 2)

4) preconfig is exited, machine resumed

Very well. What I don't understand is why we need to have steps 2 and 3. 
Because in step 2, QEMU needs to report the mapping. Therefore it has to 
have some internal code that handles the mapping. Having said that, we 
can have new set of steps:

1) libvirt starts:
    qemu -preconfig -S -smp 8,sockets=2,cores=2,threads=2

2) libvirt configures NUMA nodes, assigns vCPUs to them using vCPU IDs, 
QEMU will use the internal code to map IDs to [socket,core,thread]

3) preconfig is exited, machine resumed

And since there is no need to preconfig anymore, we can have one step 
actually:

1) libvirt starts:
   qemu -S -smp 8,sockets=2,cores=2,threads=2 -numa node -numa cpu,cpus= 
-numa node -numa cpu,cpus=

Or, we can move the mapping into libvirt (that's what I tried to do in 
this patch). I'm not against it, but we will need to do it exactly like 
QEMU is doing now. Then we can do plain

1) qemu -S -smp 8,sockets=2,cores=2,threads=2 -numa node -numa 
cpu,socket=,core=,thread= -numa node -numa cpu,socket=,core=,thread=

> 
>> Also, how is this able to express holes? E.g.
>> there might be some CPUs that don't have linear topology, and for
>> instance while socket=0,core=0,thread=0 and socket=0,core=0,thread=2
>> exist, socket=0,core=0,thread=1 does not. How am I supposed to know that
>> by just looking at the array?
> speaking of x86, QEMU curently does not implement topologies with holes
> in [socket/core/thread] tuple but if it were it shouldn't matter as all
> CPUs and their realations with each other are described within that array.
> 
>   
>>>> And 'query-cpus' or 'query-cpus-fast' which map vCPU ID onto
>>>> socket/core/thread are not allowed in preconfig state.
>>> these 2 commands apply to present cpu only, if I'm not mistaken.
>>> query-hotpluggable-cpus shows not only present but also CPUs that
>>> could be hotplugged with device_add or used with -device.
>>
>> Fair enough. I haven't looked into the code that much.
>>
>>>
>>>    
>>>> But if I take a step back, the whole point of deprecating -numa
>>>> node,cpus= is that QEMU no longer wants to do vCPU ID <->
>>>> socket/core/thread mapping because it's ambiguous. So it feels a bit
>>>> weird to design a solution where libvirt would ask QEMU to provide the
>>>> mapping only so that it can be configured back. Not only because of the
>>>> extra step, but also because QEMU can't then remove the mapping anyway.
>>>> I might be misunderstanding the issue though.
>>> if '-numa node,cpus' is removed, we no longer will be using cpu_index as
>>> configuration interface with user, that would allow QEMU start pruning
>>> it from HMP/QMP interfaces and then probably remove it internally.
>>> (I haven't explored yet if we could get rid of it completely but
>>> I'd expect migration stream would be the only reason to keep it intrenally).
>>>
>>> I'm quite reluctant to add cpu_index to modern query-hotpluggable-cpus output,
>>> since the whole goal is to get rid of the index, which don't actually work
>>> with SPAPR where CPU entity is a core and threads are internal impl. detail
>>> (while cpu_index has 1:1 mapping with threads).
>>>
>>> However if it will let QEMU to drop '-numa node,cpus=', we can discuss
>>> adding optional 'x-cpu-index' to query-hotpluggable-cpus, that will be available
>>> for old machine types for the sole purpose to help libvirt map old CLI to new one.
>>> New machines shouldn't care about index though, since they should be using
>>> '-numa cpu'.
>>
>> The problem here is that so far, all that libvirt users see are vCPU
>> IDs. They use them to assign vCPUs to NUMA nodes. And in order to make
>> libvirt switch to the new command line it needs a way to map IDs to
>> socket=,core=,thread=. I will play more with the preconfig and let you know.
> 
> If libvirt's vCPU IDs are mirroring cpu_index, I'd say it shouldn't be doing
> so, see Daniel's response
> https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg04369.html

No. Libvirt is not mirroring anything. Daniel's right, and libvirt 
doesn't need to know actual qemu IDs. But it needs to ensure the 
topology. This is exactly what you objected to in the very first reply 
to the 4/5 patch. And I agree.

> 
> FYI:
> I didn't read through all the history of -preconfig patches but QEMU options
> for topology aware (sane) numa configuration on the table were:
>    1. -numa node,cpus[cpu_index]
>        libvirt needs to duplicate internal QEMU algorithms that map cpu_index
>        values to topology info and use it to map vCPUs to numa nodes
>        (and keep in sync with QEMU as it's machine versioned moving target)

I'm not sure I follow. So libvirt would continue to use -numa 
node,cpus=. How does topology step into that?

>    2. -numa cpu CLI option,
>        libvirt needs to duplicate internal QEMU algorithms that calculate
>        target depended values for socket/core/thread ids. (basically it's
>        the same as #1), the only difference is that CLI user interface is
>        expressed in topology properties.

Sure, this is what I tried to do. But you suggested using preconfig + 
query-hotpluggable-cpus.

>    3. when we discussed it in the past #2 wasn't going to fly as it still
>       had tha same burden as #1 (duplicating code and keeping it in sync).
>       so we ended up with runtime configuration (-preconfig) to avoid QEMU
>       restart just for querying, where libvirt could get list of possible CPUs
>       from QEMU instance and complete numa configuration on the fly (at least
>       for the first time, results could be cached and re-used with -numa cpu).
> 

Yeah, this is the burden I am talking about. I feel like we are not 
talking about the same thing. Maybe I'm misunderstanding something.

Michal