[libvirt] [Qemu-devel] [RFC 0/6] enable numa configuration before machine_init() from HMP/QMP

Fri Oct 20 20:07:03 UTC 2017

On Fri, Oct 20, 2017 at 10:07:27AM +0100, Daniel P. Berrange wrote:
> On Thu, Oct 19, 2017 at 05:56:49PM -0200, Eduardo Habkost wrote:
> > On Thu, Oct 19, 2017 at 04:28:59PM +0100, Daniel P. Berrange wrote:
> > > On Thu, Oct 19, 2017 at 11:21:22AM -0400, Igor Mammedov wrote:
> > > > ----- Original Message -----
> > > > > From: "Daniel P. Berrange" <berrange at redhat.com>
> > > > > To: "Igor Mammedov" <imammedo at redhat.com>
> > > > > Cc: "peter maydell" <peter.maydell at linaro.org>, pkrempa at redhat.com, ehabkost at redhat.com, cohuck at redhat.com,
> > > > > qemu-devel at nongnu.org, armbru at redhat.com, pbonzini at redhat.com, david at gibson.dropbear.id.au
> > > > > Sent: Wednesday, October 18, 2017 5:30:10 PM
> > > > > Subject: Re: [Qemu-devel] [RFC 0/6] enable numa configuration before machine_init() from HMP/QMP
> > > > > 
> > > > > On Tue, Oct 17, 2017 at 06:06:35PM +0200, Igor Mammedov wrote:
> > > > > > On Tue, 17 Oct 2017 16:07:59 +0100
> > > > > > "Daniel P. Berrange" <berrange at redhat.com> wrote:
> > > > > > 
> > > > > > > On Tue, Oct 17, 2017 at 09:27:02AM +0200, Igor Mammedov wrote:
> > > > > > > > On Mon, 16 Oct 2017 17:36:36 +0100
> > > > > > > > "Daniel P. Berrange" <berrange at redhat.com> wrote:
> > > > > > > >   
> > > > > > > > > On Mon, Oct 16, 2017 at 06:22:50PM +0200, Igor Mammedov wrote:
> > > > > > > > > > Series allows to configure NUMA mapping at runtime using QMP/HMP
> > > > > > > > > > interface. For that to happen it introduces a new '-paused' CLI
> > > > > > > > > > option
> > > > > > > > > > which allows to pause QEMU before machine_init() is run and
> > > > > > > > > > adds new set-numa-node HMP/QMP commands which in conjuction with
> > > > > > > > > > info hotpluggable-cpus/query-hotpluggable-cpus allow to configure
> > > > > > > > > > NUMA mapping for cpus.
> > > > > > > > > 
> > > > > > > > > What's the problem we're seeking solve here compared to what we
> > > > > > > > > currently
> > > > > > > > > do for NUMA configuration ?
> > > > > > > > From RHBZ1382425
> > > > > > > > "
> > > > > > > > Current -numa CLI interface is quite limited in terms that allow map
> > > > > > > > CPUs to NUMA nodes as it requires to provide cpu_index values which
> > > > > > > > are non obvious and depend on machine/arch. As result libvirt has to
> > > > > > > > assume/re-implement cpu_index allocation logic to provide valid
> > > > > > > > values for -numa cpus=... QEMU CLI option.
> > > > > > > 
> > > > > > > In broad terms, this problem applies to every device / object libvirt
> > > > > > > asks QEMU to create. For everything else libvirt is able to assign a
> > > > > > > "id" string, which is can then use to identify the thing later. The
> > > > > > > CPU stuff is different because libvirt isn't able to provide 'id'
> > > > > > > strings for each CPU - QEMU generates a psuedo-id internally which
> > > > > > > libvirt has to infer. The latter is the same problem we had with
> > > > > > > devices before '-device' was introduced allowing 'id' naming.
> > > > > > > 
> > > > > > > IMHO we should take the same approach with CPUs and start modelling
> > > > > > > the individual CPUs as something we can explicitly create with -object
> > > > > > > or -device. That way libvirt can assign names and does not have to
> > > > > > > care about CPU index values, and it all works just the same way as
> > > > > > > any other devices / object we create
> > > > > > > 
> > > > > > > ie instead of:
> > > > > > > 
> > > > > > >   -smp 8,sockets=4,cores=2,threads=1
> > > > > > >   -numa node,nodeid=0,cpus=0-3
> > > > > > >   -numa node,nodeid=1,cpus=4-7
> > > > > > > 
> > > > > > > we could do:
> > > > > > > 
> > > > > > >   -object numa-node,id=numa0
> > > > > > >   -object numa-node,id=numa1
> > > > > > >   -object cpu,id=cpu0,node=numa0,socket=0,core=0,thread=0
> > > > > > >   -object cpu,id=cpu1,node=numa0,socket=0,core=1,thread=0
> > > > > > >   -object cpu,id=cpu2,node=numa0,socket=1,core=0,thread=0
> > > > > > >   -object cpu,id=cpu3,node=numa0,socket=1,core=1,thread=0
> > > > > > >   -object cpu,id=cpu4,node=numa1,socket=2,core=0,thread=0
> > > > > > >   -object cpu,id=cpu5,node=numa1,socket=2,core=1,thread=0
> > > > > > >   -object cpu,id=cpu6,node=numa1,socket=3,core=0,thread=0
> > > > > > >   -object cpu,id=cpu7,node=numa1,socket=3,core=1,thread=0
> > > > > > the follow up question would be where do "socket=3,core=1,thread=0"
> > > > > > come from, currently these options are the function of
> > > > > > (-M foo -smp ...) and can be queried vi query-hotpluggble-cpus at
> > > > > > runtime after qemu parses -M and -smp options.
> > > > > 
> > > > > NB, I realize my example was open to mis-interpretation. The values I'm
> > > > > illustrating here for socket=3,core=1,thread=0 and *not* ID values, they
> > > > > are a plain enumeration of values. ie this is saying the 4th socket, the
> > > > > 2nd core and the 1st thread.  Internally QEMU might have the 2nd core
> > > > > with a core-id of 8, or 7038 or whatever architecture specific numbering
> > > > > scheme makes sense, but that's not what the mgmt app gives at the CLI
> > > > > level
> > > > Even though fixed properties/values simplicity is tempting and it might even
> > > > work for what we have implemented in qemu currently (well, SPAPR will need
> > > > refactoring (if possible) to meet requirements + compat stuff for current
> > > > machines with sparse IDs).
> > > > But I have to disagree here and try to oppose it.
> > > > 
> > > > QEMU models concrete platforms/hw with certain non abstract properties
> > > > and it's libvirt's domain to translate platform specific devices into
> > > > 'spherical' devices with abstract properties.
> > > > 
> > > > Now back to cpus and suggestion to fix the set of 'address' properties
> > > > and their values into continuous enumeration range [0..N). That would
> > > >   1. put a burden of hiding platform/device details on QEMU
> > > >       (which is already bad as QEMU's job is to emulate it)
> > > >   2. with abstract 'address' properties and values, user won't have
> > > >      a clue as to where device is being attached (as qemu would magically
> > > >      remap that to fit specific machine needs)
> > > >   2.1. if abstract 'address' properties and values we can do away with
> > > >      socket/core/thread/whatnot since they won't mean the same when considered
> > > >      from platform point of view, so we can just drop all these nonsense
> > > >      and go back to cpu-index that has all the properties you've suggested
> > > >      /abstract, [0..N]/.
> > > >   3. we currently stopped with socket|core|thread-id properties as they are
> > > >      applicable to machines that support -device cpu, but it's up to machine
> > > >      to pick witch of these to use (x86: uses all, spar: uses core-id only),
> > > >      but current property set is open for extension if need arises without
> > > >      need to redefine interface. So fixed list of properties [even ignoring
> > > >      values impact] doesn't scale.
> > > 
> > > Note from the libvirt POV, we don't expose socket-id/core-id/thread-id in our
> > > guest XML, we just provide an overall count of sockets/cores/threads which is
> > > portable. The only arch specific thing we would have todo is express constraints
> > > about ratios of these - eg indicate in some way that ppc doesn't allow mutliple
> > > threads per core for example.
> > > 
> > > > We even have cpu-add command which takes cpu-index as argument and
> > > > -numa node,cpus=0..X CLI option, good luck with figuring out which cpu goes
> > > > where and if it makes any sense from platform point of view.
> > > > 
> > > > That's why when designing hot plug for 'device_add cpu' interface, we ended up
> > > > with new query-hotpluggble-cpus QMP command, which is currently used by libvirt
> > > > for hot-plug:
> > > > 
> > > > Approach allows 
> > > >    1: machine to publish properties/values that make sense from emulated
> > > >       platform point of view but still understandable by user of given hw.
> > > >    2: user may use them as opaque mandatory properties to create cpu device if
> > > >       he/she doesn't care about where it's plugged.
> > > >    3: if user cares about which cpu goes where, properties defined by machine
> > > >       provide that info from emulated hw point of view including platform specific
> > > >       details.
> > > >    4: it's easy to extend set of properties/values if need arises without
> > > >       breaking users (provided user will put them all in -device/device_add
> > > >       options as it's supposed to)
> > > > 
> > > > But current approach has drawback, to call query-hotpluggble-cpus, machine has to
> > > > be started first, which is fine for hot plug but not for specifying CLI options.
> > > > 
> > > > Currently that could be solved by starting qemu twice when 'defining domain',
> > > > where on the first run mgmt queries board layout and caches it for all the next
> > > > times the defined machine is started (change in machine/version/-smp/-cpu will
> > > > invalidate, cache).
> > > > 
> > > > This series allows to avoid this 1st time restart, when creating domain for
> > > > the first time, mgmt can query layout and then specify numa mapping without
> > > > restarting, it can cache defined mapping as commands exactly match corresponding
> > > > CLI options and reuse cached options on the next domain starts.
> > > > 
> > > > This approach could be extended further with "device_add cpu" command
> > > > so it would be possible to start qemu with -smp 0,... and allow mgmt to
> > > > create cpus with explicit IDs controlled by mgmt, and again mgmt may cache
> > > > these commands and reuse them on CLI next time machine is started
> > > > 
> > > > I think Eduardo's work on query-slots is superset of query-hotpluggble-cpus,
> > > > but working to the same goal to allow mgmt discover which hw is provided by
> > > > specific machine and where/which hw could be plugged (like which slot supports
> > > > which kind of device and which 'address' should be used to attach device
> > > > (socket|core... - for cpus, bus/function - for pic, ...)
> > > 
> > > As mentioned elsewhere in the thread, the approach of defining the VM config
> > > incrementally via the monitor has significant downsides, by making the config
> > > invisible in any logs of the ARGV, and has likely performance impact when
> > > starting up QEMU, particularly if it is used for more things going forward. To
> > > me these downsides are enough to make the suggested approach for CPUs impractical
> > > for libvirt to use.
> > 
> > Those downsides do exist, but we should weight them against the
> > downsides of not allowing any information at all to flow from
> > QEMU to libvirt when starting a VM.
> > 
> > I believe the code in libvirt/src/qemu/qemu_domain_address.c is
> > a good illustration of those downsides.
> 
> Right, but for this NUMA / CPU scenario I don't think we're going to end up
> with complexity like this. I still believe we are able to come up with a
> way to represent it at the CLI without so much architecture specific
> knowledge.

In the case of NUMA/CPU, I'm inclined to agree.

> 
> Even if that is not possible though, from libvirt POV the extra complexity
> is worth it, if that is what we need to preserve fast startup time. The
> time to start a guest is very important to apps like libguestfs and libvirt
> sandbox, so going down a direction which is likely to add 100's or even 1000's
> of milliseconds to the startup time is not desirable, even if it makes libvirt
> simpler

I don't believe this is likely to add 100's or 1000's of
milliseconds to startup time, but I agree we need to keep an eye
on startup time while introducing new interfaces.

-- 
Eduardo