[libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Martin Kletzander mkletzan at redhat.com
Mon Aug 17 08:57:40 UTC 2020


On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>
>
>On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>>
>>>
>>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
>>>>> Hi Libvirt experts,
>>>>>
>>>>> I would like enhence the numatune snippet configuration. Given a
>>>>> example snippet:
>>>>>
>>>>> <domain>
>>>>>  ...
>>>>>  <numatune>
>>>>>    <memory mode="strict" nodeset="1-4,^3"/>
>>>>>    <memnode cellid="0" mode="strict" nodeset="1"/>
>>>>>    <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>>>  </numatune>
>>>>>  ...
>>>>> </domain>
>>>>>
>>>>> Currently, attribute mode is either 'interleave', 'strict', or
>>>>> 'preferred',
>>>>> I propose to add a new 'default'  option. I give the reason as
>>>>> following.
>>>>>
>>>>> Presume we are using cgroups v1, Libvirt sets cpuset.mems for all vcpu
>>>>> threads
>>>>> according to 'nodeset' in memory element. And translate the memnode
>>>>> element to
>>>>> qemu config options (--object memory-backend-ram) for per numa cell,
>>>>> which
>>>>> invoking mbind() system call at the end.[1]
>>>>>
>>>>> But what if we want using default memory policy and request each guest
>>>>> numa cell
>>>>> pinned to different host memory nodes? We can't use mbind via qemu
>>>>> config options,
>>>>> because (I quoto here) "For MPOL_DEFAULT, the nodemask and maxnode
>>>>> arguments must
>>>>> be specify the empty set of nodes." [2]
>>>>>
>>>>> So my solution is introducing a new 'default' option for attribute
>>>>> mode. e.g.
>>>>>
>>>>> <domain>
>>>>>  ...
>>>>>  <numatune>
>>>>>    <memory mode="default" nodeset="1-2"/>
>>>>>    <memnode cellid="0" mode="default" nodeset="1"/>
>>>>>    <memnode cellid="1" mode="default" nodeset="2"/>
>>>>>  </numatune>
>>>>>  ...
>>>>> </domain>
>>>>>
>>>>> If the mode is 'default', libvirt should avoid generating qemu command
>>>>> line
>>>>> '--object memory-backend-ram', and invokes cgroups to set cpuset.mems
>>>>> for per guest numa
>>>>> combining with numa topology config. Presume the numa topology is :
>>>>>
>>>>> <cpu>
>>>>>  ...
>>>>>  <numa>
>>>>>    <cell id='0' cpus='0-3' memory='512000' unit='KiB' />
>>>>>    <cell id='1' cpus='4-7' memory='512000' unit='KiB' />
>>>>>  </numa>
>>>>>  ...
>>>>> </cpu>
>>>>>
>>>>> Then libvirt should set cpuset.mems to '1' for vcpus 0-3, and '2' for
>>>>> vcpus 4-7.
>>>>>
>>>>>
>>>>> Is this reasonable and feasible? Welcome any comments.
>>>>>
>>>>
>>>> There are couple of problems here.  The memory is not (always)
>>>> allocated
>>>> by the
>>>> vCPU threads.  I also remember it to not be allocated by the process,
>>>> but in KVM
>>>> in a way that was not affected by the cgroup settings.
>>>
>>> Thanks for your reply. Maybe I don't get what you mean, could you give
>>> me more context? But what I proposed will have no effect on other memory
>>> allocation.
>>>
>>
>> Check how cgroups work.  We can set the memory nodes that a process will
>> allocate from.  However to set the node for the process (thread) QEMU
>> needs to
>> be started with the vCPU threads already spawned (albeit stopped).  And
>> for that
>> QEMU already allocates some memory.  Moreover if extra memory was allocated
>> after we set the cpuset.mems it is not guaranteed that it will be
>> allocated by
>> the vCPU in that NUMA cell, it might be done in the emulator instead or
>> the KVM
>> module in the kernel in which case it might not be accounted for the
>> process
>> actually causing the allocation (as we've already seen with Linux).  In all
>> these cases cgroups will not do what you want them to do.  The last case
>> might
>> be fixed, the first ones are by default not going to work.
>>
>>>> That might be
>>>> fixed now,
>>>> however.
>>>>
>>>> But basically what we have against is all the reasons why we started
>>>> using
>>>> QEMU's command line arguments for all that.
>>>>
>>> I'm not proposing use QEMU's command line arguments, on contrary I want
>>> using cgroups setting to support a new config/requirement. I give a
>>> solution about if we require default memory policy and memory numa
>>> pinning.
>>>
>>
>> And I'm suggesting you look at the commit log to see why we *had* to add
>> these
>> command line arguments, even though I think I managed to describe most
>> of them
>> above already (except for one that _might_ already be fixed in the
>> kernel).  I
>> understand the git log is huge and the code around NUMA memory
>> allocation was
>> changing a lot, so I hope my explanation will be enough.
>>
>Thank you for detailed explanation, I think I get it now. We can't
>guarantee memory allocation matching requirement since there is a time
>slot before setting cpuset.mems.
>

That's one of the things, although this one could be avoided (by setting a
global cgroup before exec()).

>>> Thanks,
>>> Luyao
>>>> Sorry, but I think it will more likely break rather than fix stuff.
>>>> Maybe this
>>>> could be dealt with by a switch in `qemu.conf` with a huge warning above
>>>> it.
>>>>
>>> I'm not trying to fix something, I propose how to support a new
>>> requirement just like I stated above.
>>>
>>
>> I guess we should take a couple of steps back, I don't get what you are
>> trying
>> to achieve.  Maybe if you describe your use case it will be easier to
>> reach a
>> conclusion.
>>
>Yeah, I do have a usecase I didn't mention before. It's a feature in
>kernel but not merged yet, we call it memory tiering.
>(https://lwn.net/Articles/802544/)
>
>If memory tiering is enabled on host, DRAM is top tier memory, and
>PMEM(persistent memory) is second tier memory, PMEM is shown as numa
>node without cpu. For short, pages can be migrated between DRAM and PMEM
>based on DRAM pressure and how cold/hot they are.
>
>We could configure multiple memory migrating path. For example,
>node 0: DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM
>we can make 0+2 to a group, and 1+3 to a group. In each group, page is
>allowed to migrated down(demotion) and up(promotion).
>
>If **we want our VMs utilizing memory tiering and with NUMA topology**,
>we need handle the guest memory mapping to host memory, that means we
>need bind each guest numa node to a memory nodes group(DRAM node + PMEM
>node) on host. For example, guest node 0 -> host node 0+2.
>
>However, only cgroups setting can make the memory tiering work, if we
>use mbind() system call, demoted pages will never go back to DRAM.
>That's why I propose to add 'default' option and bypass mbind in QEMU.
>
>I hope I make myself understandable. I'll appreciate if you could give
>some suggestion.
>

This comes around every couple of months/years and bites us in the back no
matter what way we go (every time there is someone who wants it the other way).
That's why I think there could be a way for the user to specify whether they
will likely move the memory or not and based on that we would specify
`host-nodes` and `policy` to qemu or not.  I think I even suggested this before
(or probably delegated it to someone else for a suggestion so that there is more
discussion), but nobody really replied.

So what we need, I think, is a way for someone to set a per-domain information
whether we should bind the memory to nodes in a changeable fashion or not.  I'd
like to have it in as well.  The way we need to do that is, probably,
per-domain, because adding yet another switch for each place in the XML where we
can select a NUMA memory binding would be a suicide.  There should also be no
need for this to be enabled per memory-(module, node), so it should work fine.

Ideally we'd discuss it with others, but I think I am only one of a few people
who dealt with issues in this regard.  Maybe Michal (Cc'd) also dealt with some
things related to the binding, so maybe he can chime in.

>regards,
>Luyao
>
>>>> Have a nice day,
>>>> Martin
>>>>
>>>>> Regards,
>>>>> Luyao
>>>>>
>>>>> [1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169da97c3f2bad5/backends/hostmem.c#L379
>>>>>
>>>>>
>>>>> [2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>
>>>>> --
>>>>> 2.25.1
>>>>>
>>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20200817/d62133d2/attachment-0001.sig>


More information about the libvir-list mailing list