[libvirt] [RFC] Memory hotplug for qemu guests and the relevant XML parts

Tue Jul 29 16:05:23 UTC 2014

On Tue, Jul 29, 2014 at 04:40:50PM +0200, Peter Krempa wrote:
> On 07/24/14 17:03, Peter Krempa wrote:
> > On 07/24/14 16:40, Daniel P. Berrange wrote:
> >> On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
> >>> On 07/24/14 16:21, Daniel P. Berrange wrote:
> >>>> On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
> 
> >>
> >>>> So from that POV, I'd say that when we initially configure the
> >>>> NUMA / huge page information for a guest at boot time, we should
> >>>> be doing that wrt to the 'maxMemory' size, instead of the current
> >>>> 'memory' size. ie the actual NUMA topology is all setup upfront
> >>>> even though the DIMMS are not present for some of this topology.
> >>>>
> >>>>> "address" determines the address in the guest's memory space where the
> >>>>> memory will be mapped. This is optional and not recommended being set by
> >>>>> the user (except for special cases).
> >>>>>
> >>>>> For expansion the model="pflash" device may be added.
> >>>>>
> >>>>> For migration the target VM needs to be started with the hotplugged
> >>>>> modules already specified on the command line, which is in line how we
> >>>>> treat devices currently.
> >>>>>
> >>>>> My suggestion above contrasts with the approach Michal and Martin took
> >>>>> when adding the numa and hugepage backing capabilities as they describe
> >>>>> a node while this describes the memory device beneath it. I think those
> >>>>> two approaches can co-exist whilst being mutually-exclusive. Simply when
> >>>>> using memory hotplug, the memory will need to be specified using the
> >>>>> memory modules. Non-hotplug guests could use the approach defined
> >>>>> originally.
> >>>>
> >>>> I don't think it is viable to have two different approaches for configuring
> >>>> NUMA / huge page information. Apps should not have to change the way they
> >>>> configure NUMA/hugepages when they decide they want to take advantage of
> >>>> DIMM hotplug.
> >>>
> >>> Well, the two approaches are orthogonal in the information they store.
> >>> The existing approach stores the memory topology from the point of view
> >>> of the numa node whereas the <device> based approach from the point of
> >>> the memory module.
> >>
> >> Sure, they are clearly designed from different POV, but I'm saying that
> >> from an application POV is it very unpleasant to have 2 different ways
> >> to configure the same concept in the XML. So I really don't want us to
> >> go down that route unless there is absolutely no other option to achieve
> >> an acceptable level of functionality. If that really were the case, then
> >> I would strongly consider reverting everything related to NUMA that we
> >> have just done during this dev cycle and not releasing it as is.
> >>
> >>> The difference is that the existing approach currently wouldn't allow
> >>> splitting a numa node into more memory devices to allow
> >>> plugging/unplugging them.
> >>
> >> There's no reason why we have to assume 1 memory slot per guest or
> >> per node when booting the guest. If the user wants the ability to
> >> unplug, they could set their XML config so the guest has arbitrary
> >> slot granularity. eg if i have a guest
> >>
> >>  - memory == 8 GB
> >>  - max-memory == 16 GB
> >>  - NUMA nodes == 4
> >>
> >> Then we could allow them to specify 32 memory slots each 512 MB
> >> in size. This would allow them to plug/unplug memory from NUMA
> >> nodes in 512 MB granularity.
> 
> In real hardware you still can plug in modules of different sizes. (eg
> 1GiB + 2Gib) ...

I was just illustrating that as an example of the default we'd
write into the XML if the app hadn't explicitly given any slot
info themselves. If doing it manually you can of course list
the slots with arbitrary sizes, each a different size.

> > Well, while this makes it pretty close to real hardware, the emulated
> > one doesn't have a problem with plugging "dimms" of weird
> > (non-power-of-2) sizing. And we are loosing flexibility due to that.
> > 
> 
> Hmm, now that the rest of the Hugepage stuff was pushed and the release
> is rather soon. What approach should I take? I'd rather avoid crippling
> the interface for memory hotplug and having to add separate apis and
> other stuff and mostly I'd like to avoid having to re-do it after
> consumers of libvirt deem it to be unflexible.

NB, as a general point of design, it isn't our goal to always directly
expose every possible way to configuring things that QEMU allows. If
there are multiple ways to achieve the same end goal it is valid for
libvirt to pick a particular approach and not expose all possible QEMU
flexibility. This is especially true if this makes cross-hypervisor
support of the feature more practical.

Looking at the big picture, we've got a bunch of memory related
configuration sets

 - Guest NUMA topology setup, assigning vCPUs and RAM to guest nodes

    <cpu>
      <numa>
        <cell id='0' cpus='0' memory='512000'/>
        <cell id='1' cpus='1' memory='512000'/>
        <cell id='2' cpus='2-3' memory='1024000'/>
      </numa>
    </cpu>

 - Request the use of huge pages, optionally different size
   per guest NUMA node

    <memoryBacking>
      <hugepages/>
    </memoryBacking>

    <memoryBacking>
      <hugepages>
        <page size='2048' unit='KiB' nodeset='0,1'/>
        <page size='1' unit='GiB' nodeset='2'/>
      </hugepages>
    </memoryBacking>

 - Mapping of guest NUMA nodes to host NUMA nodes

    <numatune>
      <memory mode="strict" nodeset="1-4,^3"/>
      <memnode cellid="0" mode="strict" nodeset="1"/>
      <memnode cellid="1" mode="strict"  nodeset="2"/>
    </numatune>

At the QEMU level, aside from the size of the DIMM, the memory slot
device lets you 

  1. Specify guest NUMA node to attach to
  2. Specify host NUMA node to assign to
  3. Request use of huge pages, optionally with size

Item 1 is clearly needed.

Item 2 is something that I think is not relevant to expose in libvirt.
We already define a mapping of guest nodes to host nodes, so it can
be inferred from that. It is true that specifying host node explicitly
is more flexible, because it lets you map different DIMMS within a
guest node to different host nodes. I think this flexibility is a
feature in search of a problem. It doesn't make sense from a performance
optimization POV to have a single guest node with DIMMS mapped to more
than one host node. If you find yourself having todo that it is a sign
that you didn't configure enough guest nodes in the first place.

Item 3 is a slightly more fuzzy one. If we inferred it from the existing
hugepage mapping, then any hotplugged memory would use the same page
size as the existing memory in that node. If explicitly specified then
you could configure a NUMA node with a mixture of 4k, 2MB and 1 GB
pages. I could see why you might want this if, say you have setup a
1 GB page size for the node, but only want to add 256 MB of RAM to
the node, you'd have to use 2 MB pages. If I consider what it means to
the guest from a functional peformance POV though, I'm pretty sceptical
that it is a sensible thing to want to do. People are using hugepages
so that guest can get predictable memory access latency and better
tlb efficiency. Consider if we have a guest with 2 NUMA nodes, the
first node uses 4KB pages, and the second node uses 2 MB or 1 GB pages.
Now in the guest OS an application needing the predictable latency
memory access can be bound to the second guest NUMA node to achieve
that. If we consider configuring a single NUMA node with a mixture
of page sizes, then there is no way for the guest administrator to
set up their guest applications to take advantage of the specific
huge pages allocated to the guest.

Now from the QEMU CLI there is the ability to configure all these
different options but that doesn't imply that all the configuration
possibilities are actually intended for use. ie, you need to be able
to specify the host NUMA node, huge page usage etc against the slot
but that doesn't mean it is intended that we use that to configure
different settings for multiple DIMMS within the same NUMA node.

So I think it is valid for libvirt to expose the memory slot feature
just specifying the RAM size and the guest NUMA node and infer huge
page usage, huge page size and host NUMA node from existing data that
libvirt has in its domain XML document elsewhere.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|