[libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Tue Aug 11 08:39:42 UTC 2020

On 8/7/2020 4:24 PM, Martin Kletzander wrote:
> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>
>>
>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
>>>> Hi Libvirt experts,
>>>>
>>>> I would like enhence the numatune snippet configuration. Given a
>>>> example snippet:
>>>>
>>>> <domain>
>>>> Ã‚Â ...
>>>> Ã‚Â <numatune>
>>>> Ã‚Â Ã‚Â  <memory mode="strict" nodeset="1-4,^3"/>
>>>> Ã‚Â Ã‚Â  <memnode cellid="0" mode="strict" nodeset="1"/>
>>>> Ã‚Â Ã‚Â  <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>> Ã‚Â </numatune>
>>>> Ã‚Â ...
>>>> </domain>
>>>>
>>>> Currently, attribute mode is either 'interleave', 'strict', or
>>>> 'preferred',
>>>> I propose to add a new 'default'Ã‚Â  option. I give the reason as 
>>>> following.
>>>>
>>>> Presume we are using cgroups v1, Libvirt sets cpuset.mems for all vcpu
>>>> threads
>>>> according to 'nodeset' in memory element. And translate the memnode
>>>> element to
>>>> qemu config options (--object memory-backend-ram) for per numa cell,
>>>> which
>>>> invoking mbind() system call at the end.[1]
>>>>
>>>> But what if we want using default memory policy and request each guest
>>>> numa cell
>>>> pinned to different host memory nodes? We can't use mbind via qemu
>>>> config options,
>>>> because (I quoto here) "For MPOL_DEFAULT, the nodemask and maxnode
>>>> arguments must
>>>> be specify the empty set of nodes." [2]
>>>>
>>>> So my solution is introducing a new 'default' option for attribute
>>>> mode. e.g.
>>>>
>>>> <domain>
>>>> Ã‚Â ...
>>>> Ã‚Â <numatune>
>>>> Ã‚Â Ã‚Â  <memory mode="default" nodeset="1-2"/>
>>>> Ã‚Â Ã‚Â  <memnode cellid="0" mode="default" nodeset="1"/>
>>>> Ã‚Â Ã‚Â  <memnode cellid="1" mode="default" nodeset="2"/>
>>>> Ã‚Â </numatune>
>>>> Ã‚Â ...
>>>> </domain>
>>>>
>>>> If the mode is 'default', libvirt should avoid generating qemu command
>>>> line
>>>> '--object memory-backend-ram', and invokes cgroups to set cpuset.mems
>>>> for per guest numa
>>>> combining with numa topology config. Presume the numa topology is :
>>>>
>>>> <cpu>
>>>> Ã‚Â ...
>>>> Ã‚Â <numa>
>>>> Ã‚Â Ã‚Â  <cell id='0' cpus='0-3' memory='512000' unit='KiB' />
>>>> Ã‚Â Ã‚Â  <cell id='1' cpus='4-7' memory='512000' unit='KiB' />
>>>> Ã‚Â </numa>
>>>> Ã‚Â ...
>>>> </cpu>
>>>>
>>>> Then libvirt should set cpuset.mems to '1' for vcpus 0-3, and '2' for
>>>> vcpus 4-7.
>>>>
>>>>
>>>> Is this reasonable and feasible? Welcome any comments.
>>>>
>>>
>>> There are couple of problems here.Ã‚Â  The memory is not (always) 
>>> allocated
>>> by the
>>> vCPU threads.Ã‚Â  I also remember it to not be allocated by the process,
>>> but in KVM
>>> in a way that was not affected by the cgroup settings.
>>
>> Thanks for your reply. Maybe I don't get what you mean, could you give
>> me more context? But what I proposed will have no effect on other memory
>> allocation.
>>
> 
> Check how cgroups work.Â  We can set the memory nodes that a process will
> allocate from.Â  However to set the node for the process (thread) QEMU 
> needs to
> be started with the vCPU threads already spawned (albeit stopped).Â  And 
> for that
> QEMU already allocates some memory.Â  Moreover if extra memory was allocated
> after we set the cpuset.mems it is not guaranteed that it will be 
> allocated by
> the vCPU in that NUMA cell, it might be done in the emulator instead or 
> the KVM
> module in the kernel in which case it might not be accounted for the 
> process
> actually causing the allocation (as we've already seen with Linux).Â  In all
> these cases cgroups will not do what you want them to do.Â  The last case 
> might
> be fixed, the first ones are by default not going to work.
> 
>>> That might be
>>> fixed now,
>>> however.
>>>
>>> But basically what we have against is all the reasons why we started 
>>> using
>>> QEMU's command line arguments for all that.
>>>
>> I'm not proposing use QEMU's command line arguments, on contrary I want
>> using cgroups setting to support a new config/requirement. I give a
>> solution about if we require default memory policy and memory numa 
>> pinning.
>>
> 
> And I'm suggesting you look at the commit log to see why we *had* to add 
> these
> command line arguments, even though I think I managed to describe most 
> of them
> above already (except for one that _might_ already be fixed in the 
> kernel).Â  I
> understand the git log is huge and the code around NUMA memory 
> allocation was
> changing a lot, so I hope my explanation will be enough.
> 
Thank you for detailed explanation, I think I get it now. We can't 
guarantee memory allocation matching requirement since there is a time 
slot before setting cpuset.mems.

>> Thanks,
>> Luyao
>>> Sorry, but I think it will more likely break rather than fix stuff.
>>> Maybe this
>>> could be dealt with by a switch in `qemu.conf` with a huge warning above
>>> it.
>>>
>> I'm not trying to fix something, I propose how to support a new
>> requirement just like I stated above.
>>
> 
> I guess we should take a couple of steps back, I don't get what you are 
> trying
> to achieve.Â  Maybe if you describe your use case it will be easier to 
> reach a
> conclusion.
> 
Yeah, I do have a usecase I didn't mention before. It's a feature in 
kernel but not merged yet, we call it memory tiering. 
(https://lwn.net/Articles/802544/)

If memory tiering is enabled on host, DRAM is top tier memory, and 
PMEM(persistent memory) is second tier memory, PMEM is shown as numa 
node without cpu. For short, pages can be migrated between DRAM and PMEM 
based on DRAM pressure and how cold/hot they are.

We could configure multiple memory migrating path. For example,
node 0: DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM
we can make 0+2 to a group, and 1+3 to a group. In each group, page is 
allowed to migrated down(demotion) and up(promotion).

If **we want our VMs utilizing memory tiering and with NUMA topology**, 
we need handle the guest memory mapping to host memory, that means we 
need bind each guest numa node to a memory nodes group(DRAM node + PMEM 
node) on host. For example, guest node 0 -> host node 0+2.

However, only cgroups setting can make the memory tiering work, if we 
use mbind() system call, demoted pages will never go back to DRAM. 
That's why I propose to add 'default' option and bypass mbind in QEMU.

I hope I make myself understandable. I'll appreciate if you could give 
some suggestion.

Regards,
Luyao

>>> Have a nice day,
>>> Martin
>>>
>>>> Regards,
>>>> Luyao
>>>>
>>>> [1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169da97c3f2bad5/backends/hostmem.c#L379 
>>>>
>>>>
>>>> [2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>
>>>> -- 
>>>> 2.25.1
>>>>
>>