[libvirt] RFC: New CPU hot(un)plug API and XML

Thu Jun 16 07:14:12 UTC 2016

On Wed, Jun 15, 2016 at 18:10:01 +0100, Daniel Berrange wrote:
> On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:

TL;DR: I mostly agree with your suggestions altough there's a slight
problem with the suggested XML format. See [1] at the bottom

> 
> > # API #

[...]

> We generally use flags to control behaviour of a specific action,
> not to actually choose which action to run.
> 
> IOW, I'd prefer to see us have separate APIs for this
> 
>    virDomainVCPUPlug / virDomainVCPUUnplug

I thought so :)

> Or, have a single API with separate argument to reflect
> the state, eg
> 
>    virDomainVCPUSetHardware(virDomainPtr domain,
>                             unsigned int id,
>                             unsigned int present, /* 0 == unplugged, 1 == plugged */
>                             unsigned int flags);

I probably pefer this one, mostly to avoid adding yet more APIs :)

> 
> of course, inside the QEMU driver, its fine if you want to have
> these separate APIs call into common code path todo their work.
> 
> >   - misc:
> >     * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug
> 
> I think in general it was/is a mistake to overload guest agent
> CPU online/offline-ing, with real CPU hotplug/unplug. The two
> operations are really distinct from each other, and having them
> in the same API gives the misleading impression that they're
> just 2 ways to doing the same thing, where as really they are
> two completely different beasts with very different security
> implications too.
> 
> IOW, I'd suggest we make this more explicit by having another
> method
> 
>        virDomainVCPUSetGuestUsage(virDomainPtr domain,
>                                   unsigned int id,
>                                   unsigned int online, /* 0 == disabled, 1 == enabled */
>                                   unsigned int flags);

Agreed. I'll add this separately then.

> 
> >     * VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the
> >       cpu should be enabled/disabled rather than CPU id. This is a convenience
> >       flag that will allow to add cpu to a given numa node rather than having
> >       to find the correct ID.
> 
> Using flags is fine if we want to retrofit this kind of thing into an
> existing API, but if we're deciding a new API, we should aim to represent
> NUMA node ID explicitly I think.
> 
> Then again, I wonder if the desire for this flag is really just a reflection
> on us making it too hard to figure out the correct vCPU ID for a NUMA node.

IMO it's pretty obvious how to get the data. You just need to collate it
appropriately. For some reason I wanted to add syntax sugar for that but
that's certainly not needed and avoiding it will decrease code
complexity.

> 
> >     * VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This
> >                                  makes sure that the CPU will be plugged in
> >                                  on platforms that require to plug in multiple
> >                                  threads at once.
> 
> So the use of 'int id' rather strongly suggests we're only plugging
> a single vCPU at a time. Given that with PPC you're forced to plug
> an entire set of threads at once, in a single "core", I think perhaps
> we should have a bitmask instead of an 'int id'. eg
> 
>     unsigned char *cpumask,
>     unsigned int cpumasklen

This is a good idea. I'll use this approach but I'll limit this in the
docs and code so that this will (un)plug just one entity at time so that
we don't run into issues with partial failures.
> 
> The domain guest capabilities XML could describe the minimum required
> granularity for hotplugging/unplugging CPUs. So the app would query
> that and discover whether they need to plug 1 or 8 cpus at once.
> The app would be responsible for setting the CPU mask to reflect
> how many CPUs they want to plug at once.  That would remove the

You also need to make sure that you are plugging the correct cpus
together, but that should be an easy check.

> need to have an explicit VIR_DOMAIN_SET_VCPU_CORE flag I think.
> We'd simply return an error if the app didn't pay attention to the
> reuqired CPU hotplug granularity described in the capabilites.

[...]

> > The new API will require us to add new XML that will allow to track the state
> > of VCPUs individually. Internally we now have a data structure allowing to
> > keep the relevant data in one place.
> 
> Presumably we're only going to track the plugged/unplugged state,
> and *not* the guest OS  online/offline state. This would be another
> reason to deal with guest OS online/offline via a separate API
> from the real hotplug/unplug API.

Yes, dealing with the guest state would require adding guest agent
events for notifying us of changes.

[...]

> > <domain>
> >   [...]
> >   <vcpu current='1'>3</vcpu>
> >   <vcpus>
> >     <vcpu id='0' state='enabled'/> <-- option 1, no extra data
> >     <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node,
> >                                                   since it's non-obvious
> >     <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/>
> >      <!-- option 3 all the data duplicated -->
> >   </vcpus>
> >   [...]
> >   <cputune>
> >     <cpupin ... />
> >   </cputune>
> >   [...]
> >   <cpu>
> >     <numa>
> >       <cell id='0' cpus='0' memory='102400' unit='KiB/>
> >       <cell id='1' cpus='1-2' memory='102400' unit='KiB/>
> >     </numa>
> 
> In the 'virsh capabilities' XML, the CPUs are listed underneath the
> <cell>. So rather than adding <vcpu> as children of <vcpus>, I think
> perhaps we should just put then under the <cell> here too. That would
> avoid the need to add a cell=NN attribute.eg
> 
>    <cpu>
>      <numa>
>        <cell id='0' cpus='0' memory='102400' unit='KiB>
>          <cpus>
>            <cpu id="0" state="enabled"/>
> 	 </cpus>
>        </cell>
>        <cell id='1' cpus='1-2' memory='102400' unit='KiB>
>          <cpus>
>            <cpu id="1" state="disabled"/>
>            <cpu id="2" state="enabled"/>
>          </cpus>
>        </cell>
>      </numa>

[1]

I was thinking doing the same at first but there's one caveat. For
non-numa guests you don't get the <numa> element at all. In such case
the online/offline state would need to be kept somewhere else and I
wanted to avoid having two places where to put the data.

I've opted for adding a new element which would satisfy even non-numa
guests at the expense that you'd need to collate from two places in case
of numa guests.

> > # qemu/platform implementation caveats #
> > 
> > When starting the VM for the first time it might be necessary to start a
> > throw-away qemu process to query some details that we'll need to pass in on a
> > command line. I'm not sure if this is still necessary and I'll try to avoid it
> > at all cost.
> 
> If that is needed, then I think we would need to fix QEMU to let us
> introspect that at time we normally probe capabilities. Spawning an
> extra throw-away QEMU is absolutely not an option.

I was advocating for this to the qemu developers when I was discussing
this stuff with them. I hope they paid attention.