[libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Fri Oct 16 12:33:51 UTC 2020

Hi Martin, Peter and other experts,

We got a consensus that we need introducing a new "migratable" attribute 
before. But in implementation, I found introducing a new 'default' 
option for existing mode attribute is still neccessary.

I have a initial patch for 'migratable' and Peter gave some comments 
already.
https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html

Current issue is, if I set 'migratable', any 'mode' should be ignored. 
Peter commented that I can't rely on docs to tell users some config is 
invalid, I need to reject the config in the code, I completely agree 
with that. But the 'mode' default value is 'strict', it will always 
conflict with the 'migratable', at the end I still need introducing a 
new option for 'mode' which can be a legal config when 'migratable' is set.

If we have 'default' option, is 'migratable' still needed then?

FYI.
The 'mode' is corresponding to memory policy, there already a notion of 
default memory policy.
   quote:
     System Default Policy:  this policy is "hard coded" into the kernel.
(https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)
So it might be easier to understand if we introduce a 'default' option 
directly.

Regards,
Luyao

On 8/26/2020 6:20 AM, Martin Kletzander wrote:
> On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
>>
>>
>> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Martin Kletzander <mkletzan at redhat.com>
>>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>>> To: Zhong, Luyao <luyao.zhong at intel.com>
>>>>> Cc: libvir-list at redhat.com; Zang, Rui <rui.zang at intel.com>; Michal
>>>>> Privoznik
>>>>> <mprivozn at redhat.com>
>>>>> Subject: Re: [libvirt][RFC PATCH] add a new 'default' option for
>>>>> attribute mode
>>>>> in numatune
>>>>>
>>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>>>>> >
>>>>> >
>>>>> >On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>>>> >> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>>>> >>>
>>>>> >>>
>>>>> >>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>>> >>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
>>>>> >>>>> Hi Libvirt experts,
>>>>> >>>>>
>>>>> >>>>> I would like enhence the numatune snippet configuration. Given a
>>>>> >>>>> example snippet:
>>>>> >>>>>
>>>>> >>>>> <domain>
>>>>> >>>>> Ã‚Â ...
>>>>> >>>>> Ã‚Â <numatune>
>>>>> >>>>> Ã‚Â Ã‚Â  <memory mode="strict" nodeset="1-4,^3"/> Ã‚Â Ã‚Â
>>>>> >>>>> <memnode cellid="0" mode="strict" nodeset="1"/> Ã‚Â Ã‚Â  
>>>>> <memnode
>>>>> >>>>> cellid="2" mode="preferred" nodeset="2"/> Ã‚Â </numatune> 
>>>>> Ã‚Â ...
>>>>> >>>>> </domain>
>>>>> >>>>>
>>>>> >>>>> Currently, attribute mode is either 'interleave', 'strict', or
>>>>> >>>>> 'preferred', I propose to add a new 'default'Ã‚Â  option. I give
>>>>> >>>>> the reason as following.
>>>>> >>>>>
>>>>> >>>>> Presume we are using cgroups v1, Libvirt sets cpuset.mems for 
>>>>> all
>>>>> >>>>> vcpu threads according to 'nodeset' in memory element. And
>>>>> >>>>> translate the memnode element to qemu config options (--object
>>>>> >>>>> memory-backend-ram) for per numa cell, which invoking mbind()
>>>>> >>>>> system call at the end.[1]
>>>>> >>>>>
>>>>> >>>>> But what if we want using default memory policy and request each
>>>>> >>>>> guest numa cell pinned to different host memory nodes? We can't
>>>>> >>>>> use mbind via qemu config options, because (I quoto here) "For
>>>>> >>>>> MPOL_DEFAULT, the nodemask and maxnode arguments must be specify
>>>>> >>>>> the empty set of nodes." [2]
>>>>> >>>>>
>>>>> >>>>> So my solution is introducing a new 'default' option for 
>>>>> attribute
>>>>> >>>>> mode. e.g.
>>>>> >>>>>
>>>>> >>>>> <domain>
>>>>> >>>>> Ã‚Â ...
>>>>> >>>>> Ã‚Â <numatune>
>>>>> >>>>> Ã‚Â Ã‚Â  <memory mode="default" nodeset="1-2"/> Ã‚Â Ã‚Â  
>>>>> <memnode
>>>>> >>>>> cellid="0" mode="default" nodeset="1"/> Ã‚Â Ã‚Â  <memnode
>>>>> >>>>> cellid="1" mode="default" nodeset="2"/> Ã‚Â </numatune> Ã‚Â ...
>>>>> >>>>> </domain>
>>>>> >>>>>
>>>>> >>>>> If the mode is 'default', libvirt should avoid generating qemu
>>>>> >>>>> command line '--object memory-backend-ram', and invokes 
>>>>> cgroups to
>>>>> >>>>> set cpuset.mems for per guest numa combining with numa topology
>>>>> >>>>> config. Presume the numa topology is :
>>>>> >>>>>
>>>>> >>>>> <cpu>
>>>>> >>>>> Ã‚Â ...
>>>>> >>>>> Ã‚Â <numa>
>>>>> >>>>> Ã‚Â Ã‚Â  <cell id='0' cpus='0-3' memory='512000' unit='KiB' 
>>>>> /> Ã‚Â
>>>>> >>>>> Ã‚Â  <cell id='1' cpus='4-7' memory='512000' unit='KiB' /> Ã‚Â
>>>>> >>>>> </numa> Ã‚Â ...
>>>>> >>>>> </cpu>
>>>>> >>>>>
>>>>> >>>>> Then libvirt should set cpuset.mems to '1' for vcpus 0-3, and 
>>>>> '2'
>>>>> >>>>> for vcpus 4-7.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> Is this reasonable and feasible? Welcome any comments.
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>> There are couple of problems here.Ã‚Â  The memory is not (always)
>>>>> >>>> allocated by the vCPU threads.Ã‚Â  I also remember it to not be
>>>>> >>>> allocated by the process, but in KVM in a way that was not 
>>>>> affected
>>>>> >>>> by the cgroup settings.
>>>>> >>>
>>>>> >>> Thanks for your reply. Maybe I don't get what you mean, could you
>>>>> >>> give me more context? But what I proposed will have no effect on
>>>>> >>> other memory allocation.
>>>>> >>>
>>>>> >>
>>>>> >> Check how cgroups work.Â  We can set the memory nodes that a 
>>>>> process
>>>>> >> will allocate from.Â  However to set the node for the process
>>>>> >> (thread) QEMU needs to be started with the vCPU threads already
>>>>> >> spawned (albeit stopped).Â  And for that QEMU already allocates 
>>>>> some
>>>>> >> memory.Â  Moreover if extra memory was allocated after we set the
>>>>> >> cpuset.mems it is not guaranteed that it will be allocated by the
>>>>> >> vCPU in that NUMA cell, it might be done in the emulator instead or
>>>>> >> the KVM module in the kernel in which case it might not be 
>>>>> accounted
>>>>> >> for the process actually causing the allocation (as we've already
>>>>> >> seen with Linux).Â  In all these cases cgroups will not do what you
>>>>> >> want them to do.Â  The last case might be fixed, the first ones are
>>>>> >> by default not going to work.
>>>>> >>
>>>>> >>>> That might be
>>>>> >>>> fixed now,
>>>>> >>>> however.
>>>>> >>>>
>>>>> >>>> But basically what we have against is all the reasons why we
>>>>> >>>> started using QEMU's command line arguments for all that.
>>>>> >>>>
>>>>> >>> I'm not proposing use QEMU's command line arguments, on contrary I
>>>>> >>> want using cgroups setting to support a new config/requirement. I
>>>>> >>> give a solution about if we require default memory policy and 
>>>>> memory
>>>>> >>> numa pinning.
>>>>> >>>
>>>>> >>
>>>>> >> And I'm suggesting you look at the commit log to see why we 
>>>>> *had* to
>>>>> >> add these command line arguments, even though I think I managed to
>>>>> >> describe most of them above already (except for one that _might_
>>>>> >> already be fixed in the kernel).Â  I understand the git log is huge
>>>>> >> and the code around NUMA memory allocation was changing a lot, so I
>>>>> >> hope my explanation will be enough.
>>>>> >>
>>>>> >Thank you for detailed explanation, I think I get it now. We can't
>>>>> >guarantee memory allocation matching requirement since there is a 
>>>>> time
>>>>> >slot before setting cpuset.mems.
>>>>> >
>>>>>
>>>>> That's one of the things, although this one could be avoided (by
>>>>> setting a global
>>>>> cgroup before exec()).
>>>>>
>>>>> >>> Thanks,
>>>>> >>> Luyao
>>>>> >>>> Sorry, but I think it will more likely break rather than fix 
>>>>> stuff.
>>>>> >>>> Maybe this
>>>>> >>>> could be dealt with by a switch in `qemu.conf` with a huge 
>>>>> warning
>>>>> >>>> above it.
>>>>> >>>>
>>>>> >>> I'm not trying to fix something, I propose how to support a new
>>>>> >>> requirement just like I stated above.
>>>>> >>>
>>>>> >>
>>>>> >> I guess we should take a couple of steps back, I don't get what you
>>>>> >> are trying to achieve.Â  Maybe if you describe your use case it 
>>>>> will
>>>>> >> be easier to reach a conclusion.
>>>>> >>
>>>>> >Yeah, I do have a usecase I didn't mention before. It's a feature in
>>>>> >kernel but not merged yet, we call it memory tiering.
>>>>> >(https://lwn.net/Articles/802544/)
>>>>> >
>>>>> >If memory tiering is enabled on host, DRAM is top tier memory, and
>>>>> >PMEM(persistent memory) is second tier memory, PMEM is shown as numa
>>>>> >node without cpu. For short, pages can be migrated between DRAM and
>>>>> >PMEM based on DRAM pressure and how cold/hot they are.
>>>>> >
>>>>> >We could configure multiple memory migrating path. For example, 
>>>>> node 0:
>>>>> >DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 to a
>>>>> >group, and 1+3 to a group. In each group, page is allowed to migrated
>>>>> >down(demotion) and up(promotion).
>>>>> >
>>>>> >If **we want our VMs utilizing memory tiering and with NUMA 
>>>>> topology**,
>>>>> >we need handle the guest memory mapping to host memory, that means we
>>>>> >need bind each guest numa node to a memory nodes group(DRAM node +
>>>>> PMEM
>>>>> >node) on host. For example, guest node 0 -> host node 0+2.
>>>>> >
>>>>> >However, only cgroups setting can make the memory tiering work, if we
>>>>> >use mbind() system call, demoted pages will never go back to DRAM.
>>>>> >That's why I propose to add 'default' option and bypass mbind in 
>>>>> QEMU.
>>>>> >
>>>>> >I hope I make myself understandable. I'll appreciate if you could 
>>>>> give
>>>>> >some suggestion.
>>>>> >
>>>>>
>>>>> This comes around every couple of months/years and bites us in the
>>>>> back no
>>>>> matter what way we go (every time there is someone who wants it the
>>>>> other
>>>>> way).
>>>>> That's why I think there could be a way for the user to specify
>>>>> whether they will
>>>>> likely move the memory or not and based on that we would specify 
>>>>> `host-
>>>>> nodes` and `policy` to qemu or not.  I think I even suggested this
>>>>> before (or
>>>>> probably delegated it to someone else for a suggestion so that there
>>>>> is more
>>>>> discussion), but nobody really replied.
>>>>>
>>>>> So what we need, I think, is a way for someone to set a per-domain
>>>>> information
>>>>> whether we should bind the memory to nodes in a changeable fashion or
>>>>> not.
>>>>> I'd like to have it in as well.  The way we need to do that is,
>>>>> probably, per-
>>>>> domain, because adding yet another switch for each place in the XML
>>>>> where we
>>>>> can select a NUMA memory binding would be a suicide.  There should
>>>>> also be
>>>>> no need for this to be enabled per memory-(module, node), so it
>>>>> should work
>>>>> fine.
>>>>>
>>>>
>>>> Thanks for letting us know your vision about this.
>>>> From what I understood, the "changeable fashion" means that the guest
>>>> numa
>>>> cell binding can be changed out of band after initial binding, either
>>>> by system
>>>> admin or the operating system (memory tiering in our case), or
>>>> whatever the
>>>> third party is.  Is that perception correct?
>>>
>>> Yes.  If the user wants to have the possibility of changing the binding,
>>> then we
>>> use *only* cgroups.  Otherwise we use the qemu parameters that will make
>>> qemu
>>> call mbind() (as that has other pros mentioned above).  The other option
>>> would
>>> be extra communication between QEMU and libvirt during start to let us
>>> know when
>>> to set what cgroups etc., but I don't think that's worth it.
>>>
>>>> It seems to me mbind() or set_mempolicy() system calls do not offer 
>>>> that
>>>> flexibility of changing afterwards. So in case of QEMU/KVM, I can only
>>>> think
>>>> of cgroups.
>>>> So to be specific, if we had this additional 
>>>> "memory_binding_changeable"
>>>> option specified, we will try to do the guest numa constraining via
>>>> cgroups
>>>> whenever possible. There will probably also be conflicts in options or
>>>> things
>>>> that cgroups can not do. For such cases we'd fail the domain.
>>>
>>> Basically we'll do what we're doing now and skip the qemu 
>>> `host-nodes` and
>>> `policy` parameters with the new option.  And of course we can fail with
>>> a nice
>>> error message if someone wants to move the memory without the option
>>> selected
>>> and so on.
>>
>> Thanks for your comments.
>>
>> I'd like get it more clear about defining the interface in domain xml,
>> then I could go into the implementation further.
>>
>> As you mentioned, per-domain option will be better than per-node. I go
>> through the libvirt doamin format to look for a proper position to place
>> this option. Then I'm thinking we could still utilizing numatune element
>> to configure.
>>
>> <numatune>
>>   <memory mode="strict" nodeset="1-4,^3"/>
>>   <memnode cellid="0" mode="strict" nodeset="1"/>
>>   <memnode cellid="2" mode="preferred" nodeset="2"/>
>> </numatune>
>>
>> coincidentally, the optional memory element specifies how to allocate
>> memory for the domain process on a NUMA host. So can we utilizing this
>> element, and introducing a new mode like "changeable" or whatever? Do
>> you have a better name?
>>
> 
> Yeah, I was thinking something along the lines of:
> 
> <numatune>
>     <memory mode="strict" nodeset="1-4,^3" movable/migratable="yes/no" />
>     <memnode cellid="0" mode="strict" nodeset="1"/>
>     <memnode cellid="2" mode="preferred" nodeset="2"/>
> </numatune>
> 
>> If the memory mode is set to 'changeable', we could ignore the mode
>> setting for each memnode, and then we only configure by cgroups. I have
>> not diven into code for now, expecting it could work.
>>
> 
> Yes, the example above gives the impression of the attribute being 
> available
> per-node.  But that could be handled in the documentation.
> 
> Specifying it per-node seems very weird, why would you want the memory 
> to be
> hard-locked, but for some guest nodes only?
> 
>> Thanks,
>> Luyao
>>
>>>
>>>> If you agree with the direction, I think we can dig deeper to see what
>>>> will
>>>> come out.
>>>>
>>>> Regards,
>>>> Zang, Rui
>>>>
>>>>
>>>>> Ideally we'd discuss it with others, but I think I am only one of a
>>>>> few people
>>>>> who dealt with issues in this regard.  Maybe Michal (Cc'd) also dealt
>>>>> with some
>>>>> things related to the binding, so maybe he can chime in.
>>>>>
>>>>> >regards,
>>>>> >Luyao
>>>>> >
>>>>> >>>> Have a nice day,
>>>>> >>>> Martin
>>>>> >>>>
>>>>> >>>>> Regards,
>>>>> >>>>> Luyao
>>>>> >>>>>
>>>>> >>>>>
>>>>> [1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
>>>>> >>>>> a97c3f2bad5/backends/hostmem.c#L379
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> [2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>> >>>>>
>>>>> >>>>> --
>>>>> >>>>> 2.25.1
>>>>> >>>>>
>>>>> >>>
>>>>> >
>>