[libvirt] Question about QEMU + hugepages + NUMA memory + migration

Thu Oct 6 04:58:54 UTC 2016

On Tue, Oct 04, 2016 at 11:28:15AM +0200, Martin Kletzander wrote:
> On Tue, Oct 04, 2016 at 02:34:57PM +1100, Sam Bobroff wrote:
> >On Mon, Oct 03, 2016 at 04:27:25PM +0200, Martin Kletzander wrote:
> >>On Mon, Aug 01, 2016 at 02:01:22PM +1000, Sam Bobroff wrote:
> >>>Hi libvirt people,
> >>>
> >>>I've been looking at a (probable) bug and I'm not sure how to progress. The
> >>>situation is a bit complicated and involves both QEMU and libvirt (and I think
> >>>it may have been looked at already) so I would really appreciate some advice on
> >>>how to approach it. I'm using a pretty recent master version of libvirt from
> >>>git and I'm testing on a ppc64le host with a similar guest but this doesn't
> >>>seem to be arch-specific.
> >>>
> >>
> >>Sorry I haven't replied earlier and I'm respawning this thread now, I
> >>just noticed this thread being marked for replying only after I fixed
> >>similar thing to what you describe.
> >
> >No problem! :-)
> >
> >>>If I create a QEMU guest (e.g. via virt-install) that requests both hugepage
> >>>backing on the host and NUMA memory placement on the host, the NUMA placement
> >>>seems to be ignored. If I do:
> >>>
> >>># echo 0 > /proc/sys/vm/nr_hugepages
> >>># echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> >>># virt-install --name tmp --memory=4096 --graphics none --memorybacking hugepages=yes --disk none --import --wait 0 --numatune=8
> >>>
> >>
> >>So to be clear, we want to use 16M hugepages, but only allocated from
> >>node 8 while the only allocated ones are on node 0.
> >
> >Yes.
> >
> >>>... then hugepages are allocated on node 0 and the machine starts successfully,
> >>>which seems like a bug.
> >>>
> >>
> >>This definitely is a bug.  But I'm afraid it's not in libvirt.  I'll do
> >>some explaining first.
> >
> >OK.
> >
> >>>I believe it should fail to start due to insufficient memory, and in fact that
> >>>is what happens if cgroup support isn't detected in the host: there seems to be
> >>>a fall-back path in libvirt (probably using mbind()) that works as I would
> >>>expect.
> >>>
> >>
> >>Yes, we are using multiple things to enforce NUMA binding:
> >>
> >> 1) cgroups - This restricts all the allocations made on the account of
> >>              the process, even when KVM does that.  We cannot use that
> >>              until the QEMU process is running because it needs to
> >>              allocate some data from the DMA region which is usually
> >>              only on one node.
> >>
> >> 2) numactl's mbind() - it doesn't apply to kernel allocations, so
> >>                        whatever KVM allocates it's not restricted by
> >>                        this.  We use this always on the process before
> >>                        exec()-ing qemu binary (if compiled with
> >>                        support to numactl, of course)
> >>
> >> 3) memory-backend-file's host-nodes parameter - This is the best
> >>                                     option that is used when QEMU
> >>                                     supports it, but due to migration
> >>                                     compatibility we use it only when
> >>                                     requested with <memnode/>
> >>                                     elements.
> >
> >OK good, that matches what I thought I'd worked out :-)
> >
> >>>Note: the relevant part of the guest XML seems to be this:
> >>>»·······<memoryBacking>
> >>>»·······»·······<hugepages/>
> >>>»·······</memoryBacking>
> >>>»·······<numatune>
> >>>»·······»·······<memory mode='strict' nodeset='8'/>
> >>>»·······</numatune>
> >>>
> >>>It seems fairly clear what is happening: although QEMU is capable of allocating
> >>>hugepages on specific NUMA nodes (using "memory-backend-file") libvirt is not
> >>>passing those options to QEMU in this situation.
> >>>
> >>
> >>I'm guessing you're talking about the host-nodes= parameter.
> >
> >I am, yes.
> >
> >>>I investigated this line of reasoning and if I hack libvirt to pass those
> >>>options to QEMU it does indeed fix the problem... but it renders the machine
> >>>state migration-incompatible with unfixed versions. This seems to have been why
> >>>this hasn't been fixed already :-(
> >>>
> >>>So what can we do?
> >>>
> >>
> >>From virt-install POV I would suggest adding some functionality that
> >>would probe whether it works with <memnode/> and use it if possible.  Or
> >>dig deep down in kvm or qemu to see why the allocations do not conform
> >>to the mbind().
> >
> >Sorry, I don't think I understand your answer.
> >
> >You seem to be saying that it doesn't matter that the above XML causes memory
> >to be allocated incorrectly, as long as we don't use it (by changing
> >virt-install to use something else). But shouldn't the XML work correctly even
> >if it was generated by hand? (Sorry if I'm misunderstanding!)
> >
> 
> It should, but libvirt is doing what it can (see later).  At least what
> I thought.  But while composing the mail I must say that's not totally
> true.
> 
> >And yes, it does work if you specify memnodes but wouldn't adding them change
> >the guest definition (in a guest visible way)? (Because memnode requires
> >cellid, a guest NUMA node ID, and my test case has no guest NUMA nodes.)
> >
> 
> I totally missed that you're not using NUMA at all.  You can do two
> things then.  Either add one NUMA node (which technically isn't NUMA).
> That will not change that much.  You can also add some memory DIMMs
> which can be configured to have host-nodes= used too.  However, those
> require NUMA as well IIRC, so they will automatically add NUMA with one
> node, I think.

Yes, either option should work around the problem (I hadn't considered adding
DIMMs like that -- interesting!) but I'm more concerned with working out if I
should try to patch libvirt, and if so, how to do it without creating migration
problems ;-)

> >I think I already know why QEMU's allocations don't conform to the mbind() set
> >up by libvirt: Doesn't libvirt start QEMU machines paused, and once the VCPU
> >threads have started, it mbinds() them and starts the guest? But, -mem-prealloc
> >has been used so QEMU allocates the VCPU memory immediately during startup,
> >from the main thread. By the time the mbind() is done by libvirt, it's too
> >late (and not on the main thread anyway). (It doesn't seem like this can be
> >considered a QEMU bug because it can do the mbind() correctly itself if passed
> >the correct options.)
> >
> 
> Well, what I thought is happening is, that libvirt does fork(), in the
> child it calls numa_set_membind() on itself, and then exec()s qemu.
> Looking at the code to make sure, I see it has changed and we don't use
> numa_set_membind() if cgroups are available.  That is due to us not
> being able to change the binding later on.  Although it looks like that
> fixes only some problems as it still won't be changable when there are
> no cgroups used.
> 
> >>>I assume it's not acceptible to just break migration with a bugfix, and I can
> >>>only think of two ways to fix migration:
> >>>
> >>
> >>Unfortunately it's not, especially when it's not a bugfix. and that's
> >>why there's the extra logic for it in.
> >>
> >>>(a) Add a new flag to the XML, and for guests without the flag, maintain the
> >>>old buggy behaviour (and therefore migration compatability).
> >>>
> >>
> >>It kinda is there, that's the <memnode/> setting.
> >>
> >>>(b) Hack QEMU so that migration can succeed between un-fixed and fixed
> >>>versions. (And possibly also in the reverse direction?)
> >>>
> >>
> >>That's probably not possible to do.
> >
> >Aaaactually, I've done some investigation and it does seem to be possible but
> >the question is "would it be a good idea".
> >
> 
> I must say I don't know, that's probably question for qemu-devel.  But
> if that's possible, it could fix a bunch of stuff we need to deal with.
> But my guess is that it's not working for a reason.

I'm happy to explore this on qemu-devel, but I think I should have a clear idea
of what to propose before I do :-) When I was experimenting with the QEMU code,
it was fairly easy to allow migrations between "mem-backend" machines and "no
mem-backend" machines as long as the number of memory regions didn't change.

It would allow machines like this:
-mem-prealloc -mem-path /dev/hugepages/...

To be compatible with machines like this:
-object memory-backend-file,id=mem0,prealloc=yes,mem-path=/dev/hugepages/...

(The memory size must be the same and there must be either no NUMA or a single
NUMA node.)

In these cases the only difference I've found in the machine state is the name
of the memory region. In the first case the name is a hard-coded value that
differs by architecture (e.g. "spapr_ppc.ram" on PowerPC). In the
memory-backend case, the name is "/objects/mem0" (based on the "id" field). My
proposal would be that the migration code should, when there is only a single
memory region (on both sides), with the correct size, to always accept it
regardless of it's name.

This is a pretty restricted case but it includes the case that I raised in my
original post. Would it also help in the case you talk about below? (Sorry, I
had a quick look at your patch but I couldn't immediately see if it would or
not.) Would this be useful?

On the other hand, if the number of memory regions differs, it's still possible
get it to work but it is much more complicated and I think less likely to be
accepted upstream.  Would this be a lot more useful? Is it worth pursuing
further to try to improve it?

(And yes, I wouldn't be surprised to find that there's some good reason why
this hasn't been done before!)

> >(I think this is the same case you were dealing with in your other recent work.)
> >
> 
> You are probably talking about ff3112f3dc2c276a7e387ff7bb86f4fbbdf7bf2c.
> Yes, that is one of the many "we need to use the old behaviour because
> migration would be broken otherwise even though not using the new one
> means some stuff won't work..." and such.
> 
> >(What actually causes the migration failure is a mismatch in the "ramblocks":
> >either the name, or name and number of them, change when memory backend objects
> >are added or removed, but these aren't visible to the guest at all and if the
> >migration code can accept the data, the guest is OK. There is no other problem
> >with the migration data. Really, I think it's a problem with the QEMU migration
> >format: if it specified the RAM state using "position + length" rather than
> >"RAMblock name + length" then there would be no migration problems with these
> >changes. But I'm no expert on QEMU migration...)
> >
> 
> Looks like way more expert than me ;)
> 
> >>>I don't like (a) because it's visible in the XML, and would have to be carried
> >>>forever (or at least a long time?).
> >>>
> >>>I don't really like (b) either because it's tricky, and even if it could be
> >>>made to work reliably, it would add mess and risk to the migration code. I'm
> >>>not sure how the QEMU community would feel about it either. However, I did hack
> >>>up some code and it worked at least in some simple cases.
> >>>
> >>>Can anyone see a better approach? Is anyone already working on this?
> >>>
> >>
> >>Again, sorry for not replying earlier.
> >
> >Thanks for your time, this issue is so complicated it's making my brain hurt ;-)
> >
> 
> As always when I'm trying to fix something migration-related regarding
> last two years of releases =)
> 
> >>Martin
> >
> >Cheers,
> >Sam.
> >