[libvirt] [RFC] Memory hotplug for qemu guests and the relevant XML parts

Wed Jul 30 10:08:04 UTC 2014

On Tue, Jul 29, 2014 at 05:05:23PM +0100, Daniel P. Berrange wrote:
> On Tue, Jul 29, 2014 at 04:40:50PM +0200, Peter Krempa wrote:
> > On 07/24/14 17:03, Peter Krempa wrote:
> > > On 07/24/14 16:40, Daniel P. Berrange wrote:
> > >> On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
> > >>> On 07/24/14 16:21, Daniel P. Berrange wrote:
> > >>>> On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
> > 
> > >>
> > >>>> So from that POV, I'd say that when we initially configure the
> > >>>> NUMA / huge page information for a guest at boot time, we should
> > >>>> be doing that wrt to the 'maxMemory' size, instead of the current
> > >>>> 'memory' size. ie the actual NUMA topology is all setup upfront
> > >>>> even though the DIMMS are not present for some of this topology.
> > >>>>
> > >>>>> "address" determines the address in the guest's memory space where the
> > >>>>> memory will be mapped. This is optional and not recommended being set by
> > >>>>> the user (except for special cases).
> > >>>>>
> > >>>>> For expansion the model="pflash" device may be added.
> > >>>>>
> > >>>>> For migration the target VM needs to be started with the hotplugged
> > >>>>> modules already specified on the command line, which is in line how we
> > >>>>> treat devices currently.
> > >>>>>
> > >>>>> My suggestion above contrasts with the approach Michal and Martin took
> > >>>>> when adding the numa and hugepage backing capabilities as they describe
> > >>>>> a node while this describes the memory device beneath it. I think those
> > >>>>> two approaches can co-exist whilst being mutually-exclusive. Simply when
> > >>>>> using memory hotplug, the memory will need to be specified using the
> > >>>>> memory modules. Non-hotplug guests could use the approach defined
> > >>>>> originally.
> > >>>>
> > >>>> I don't think it is viable to have two different approaches for configuring
> > >>>> NUMA / huge page information. Apps should not have to change the way they
> > >>>> configure NUMA/hugepages when they decide they want to take advantage of
> > >>>> DIMM hotplug.
> > >>>
> > >>> Well, the two approaches are orthogonal in the information they store.
> > >>> The existing approach stores the memory topology from the point of view
> > >>> of the numa node whereas the <device> based approach from the point of
> > >>> the memory module.
> > >>
> > >> Sure, they are clearly designed from different POV, but I'm saying that
> > >> from an application POV is it very unpleasant to have 2 different ways
> > >> to configure the same concept in the XML. So I really don't want us to
> > >> go down that route unless there is absolutely no other option to achieve
> > >> an acceptable level of functionality. If that really were the case, then
> > >> I would strongly consider reverting everything related to NUMA that we
> > >> have just done during this dev cycle and not releasing it as is.
> > >>
> > >>> The difference is that the existing approach currently wouldn't allow
> > >>> splitting a numa node into more memory devices to allow
> > >>> plugging/unplugging them.
> > >>
> > >> There's no reason why we have to assume 1 memory slot per guest or
> > >> per node when booting the guest. If the user wants the ability to
> > >> unplug, they could set their XML config so the guest has arbitrary
> > >> slot granularity. eg if i have a guest
> > >>
> > >>  - memory == 8 GB
> > >>  - max-memory == 16 GB
> > >>  - NUMA nodes == 4
> > >>
> > >> Then we could allow them to specify 32 memory slots each 512 MB
> > >> in size. This would allow them to plug/unplug memory from NUMA
> > >> nodes in 512 MB granularity.
> > 
> > In real hardware you still can plug in modules of different sizes. (eg
> > 1GiB + 2Gib) ...
> 
> I was just illustrating that as an example of the default we'd
> write into the XML if the app hadn't explicitly given any slot
> info themselves. If doing it manually you can of course list
> the slots with arbitrary sizes, each a different size.
> 
> > > Well, while this makes it pretty close to real hardware, the emulated
> > > one doesn't have a problem with plugging "dimms" of weird
> > > (non-power-of-2) sizing. And we are loosing flexibility due to that.
> > > 
> > 
> > Hmm, now that the rest of the Hugepage stuff was pushed and the release
> > is rather soon. What approach should I take? I'd rather avoid crippling
> > the interface for memory hotplug and having to add separate apis and
> > other stuff and mostly I'd like to avoid having to re-do it after
> > consumers of libvirt deem it to be unflexible.
> 
> NB, as a general point of design, it isn't our goal to always directly
> expose every possible way to configuring things that QEMU allows. If
> there are multiple ways to achieve the same end goal it is valid for
> libvirt to pick a particular approach and not expose all possible QEMU
> flexibility. This is especially true if this makes cross-hypervisor
> support of the feature more practical.
> 
> Looking at the big picture, we've got a bunch of memory related
> configuration sets
> 
>  - Guest NUMA topology setup, assigning vCPUs and RAM to guest nodes
> 
>     <cpu>
>       <numa>
>         <cell id='0' cpus='0' memory='512000'/>
>         <cell id='1' cpus='1' memory='512000'/>
>         <cell id='2' cpus='2-3' memory='1024000'/>
>       </numa>
>     </cpu>
> 
>  - Request the use of huge pages, optionally different size
>    per guest NUMA node
> 
>     <memoryBacking>
>       <hugepages/>
>     </memoryBacking>
> 
>     <memoryBacking>
>       <hugepages>
>         <page size='2048' unit='KiB' nodeset='0,1'/>
>         <page size='1' unit='GiB' nodeset='2'/>
>       </hugepages>
>     </memoryBacking>
> 
>  - Mapping of guest NUMA nodes to host NUMA nodes
> 
>     <numatune>
>       <memory mode="strict" nodeset="1-4,^3"/>
>       <memnode cellid="0" mode="strict" nodeset="1"/>
>       <memnode cellid="1" mode="strict"  nodeset="2"/>
>     </numatune>
> 
> 
> At the QEMU level, aside from the size of the DIMM, the memory slot
> device lets you 
> 
>   1. Specify guest NUMA node to attach to
>   2. Specify host NUMA node to assign to
>   3. Request use of huge pages, optionally with size

[snip]

> So I think it is valid for libvirt to expose the memory slot feature
> just specifying the RAM size and the guest NUMA node and infer huge
> page usage, huge page size and host NUMA node from existing data that
> libvirt has in its domain XML document elsewhere.

I meant to outline how I thought hotplug/unplug would interact with
the existing data.

When first booting the guest

 - If the XML does not include any memory slot info, we should
   add minimum possible memory slots to match the per-guest
   NUMA node config.

 - If XML does include slots, then we must validate that the
   sum of the memory for slots listed against each guest NUMA
   node matches the memory set in /cpu/numa/cell/@memory

When hugepages are in use we need to make we validate that we're
adding slots whose size is a multiple of huge page size. The code
should already be validating that each NUMA node is a multiple of
the configured hge page size for that node.

When hotplugging / unplugging

 - Libvirt would update the /cpu/numa/cell/@memory attribute
   and /memory element to reflect the newly added/removed DIMM

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|