[libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Thu Nov 12 06:19:06 UTC 2020

On 11/9/2020 7:21 PM, Martin Kletzander wrote:
> On Sat, Nov 07, 2020 at 10:41:52AM +0800, Zhong, Luyao wrote:
>>
>>
>> On 11/4/2020 9:02 PM, Martin Kletzander wrote:
>>> On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
>>>> On 10/16/2020 9:32 PM, Zang, Rui wrote:
>>>>>
>>>>> How about if “migratable” is set, “mode” should be ignored/omitted?
>>>>> So any setting of “mode” will be rejected with an error indicating an
>>>>> invalid configuration.
>>>>> We can say in the doc that “migratable” and “mode” shall not be set
>>>>> together. So even the default value of “mode” is not taken.
>>>>>
>>>> If "mode" is not set, it's the same as setting "strict" value ('strict'
>>>> is the default value). It involves some code detail, it will be
>>>> translated to enumerated type, the value is 0 when mode not set or set
>>>> to 'strict'. The code is in some fixed skeleton, so it's not easy to
>>>> modify.
>>>>
>>>
>>> Well I see it as it is "strict". It does not mean "strict cgroup 
>>> setting",
>>> because cgroups are just one of the ways to enforce this.  Look at it
>>> this way:
>>>
>>> mode can be:
>>>   - strict: only these nodes can be used for the memory
>>>   - preferred: there nodes should be preferred, but allocation should
>>> not fail
>>>   - interleave: interleave the memory between these nodes
>>>
>>> Due to the naming this maps to cgroup settings 1:1.
>>>
> 
> Sorry, I misspoke, this does not map to cgroup settings at all, in 
> cgroups you
> can only set "strict" (by using cpuset.mems) and that's it.  There is no 
> way to
> set preferred or interleaved mapping, sorry.
> 
memory policy is independent of cpuset.mems

I quote here "Memory policies should not be confused with cpusets 
(Documentation/admin-guide/cgroup-v1/cpusets.rst) which is an 
administrative mechanism for restricting the nodes from which memory may 
be allocated by a set of processes. Memory policies are a programming 
interface that a NUMA-aware application can take advantage of. When both 
cpusets and policies are applied to a task, the restrictions of the 
cpuset takes priority. See Memory Policies and cpusets below for more 
details."[1]

So using cpuset.mems does not mean set "strict" memory policy if I 
understand it correctly, we can set cpuset.mems with any memory policy.

[1]https://www.infradead.org/~mchehab/kernel_docs/admin-guide/mm/numa_memory_policy.html
>>> But now we have another way of enforcing this, using qemu cmdline
>>> option.  The
>>> names actually map 1:1 to those as well:
>>>
>>>
>>> https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901 
>>>
>>>
>>> So my idea was that we would add a movable/migratable/whatever attribute
>>> that
>>> would tell us which way for enforcing we use because there does not seem
>>> to be
>>> "one size fits all" solution.  Am I misunderstanding this discussion?
>>> Please
>>> correct me if I am.  Thank you.
>>>
>> Actually I need a default memory policy(memory policy is 'hard coded'
>> into the kernel) support, I thought "migratable" was enough to indicate
> 
> So I am getting on your track, yes.  What you mean is basically 
> MPOL_DEFAULT and
> that's where the naming probably comes from, right?  Anyway, what we're 
> trying
> to do is not restrict us from other options, even if they are only 
> possible in
> the future.  So instead of adding "default" which would actually mean 
> "strict"
> (because you still use cpuset.mems) which would restrict us from 
> potentially
> being able to migrate with a different policy than "strict" (even though it
> might not make sense for "preferred", for example) and it's also a bit 
> confusing
as I mentioned above, using "cpuset.mems" does not mean "strict" memory 
policy.
> for users, I suggested we add "migratable" which restricts just the qemu
> options.  Of course, "migratable" only makes sense with "strict" now, 
> but that's
> fine.  The XML provides a possibility for something we don't support, 
> but we can
> forbid that combination for the sake of clarity of the other option that 
> _is_
> supported.
> 
> I'll try to propose my idea based on your patch from Nov 3rd and it might
> improve my communication.  I feels difficult for me to explain myself 
> without
> the code.  I just need to deal with a lot of other emails first.
> 
Thank you in advance. Let's discuss later based on the patch.

Regards,
Luyao
>> that we rely on operating system to operate memory policy. So when
>> "migratable" is set, "mode" should not be set. But when I was coding, I
>> found "mode" default value is "strict", it is always "strict" even if
>> "migratable" is yes, that means we configure two different memory
>> policies at the same time. Then I still need a new option for "mode" to
>> make it not conflicting with the "migratable", then if we have the new
>> option("default") for "mode", it seems we can drop "migratable".
>>
>> Besides, we can make "mode" being a "one size fits all" solution., just
>> reject the different "mode" value config in memnode element when "mode"
>> is "default" in memory element.
>>
>> I summary it in the new email
>> https://www.redhat.com/archives/libvir-list/2020-November/msg00084.html
>>
>> Sorry I didn't make it easy to understand.
>>
>> Regards,
>> Luyao
>>>> So I need a option to indicate "I don't specify any mode.".
>>>>
>>>>>> 在 2020年10月16日，20:34，Zhong, Luyao <luyao.zhong at intel.com> 写道：
>>>>>>
>>>>>> Hi Martin, Peter and other experts,
>>>>>>
>>>>>> We got a consensus that we need introducing a new "migratable"
>>>>>> attribute before. But in implementation, I found introducing a new
>>>>>> 'default' option for existing mode attribute is still neccessary.
>>>>>>
>>>>>> I have a initial patch for 'migratable' and Peter gave some comments
>>>>>> already.
>>>>>> https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html 
>>>>>>
>>>>>>
>>>>>> Current issue is, if I set 'migratable', any 'mode' should be
>>>>>> ignored. Peter commented that I can't rely on docs to tell users
>>>>>> some config is invalid, I need to reject the config in the code, I
>>>>>> completely agree with that. But the 'mode' default value is
>>>>>> 'strict', it will always conflict with the 'migratable', at the end
>>>>>> I still need introducing a new option for 'mode' which can be a
>>>>>> legal config when 'migratable' is set.
>>>>>>
>>>>>> If we have 'default' option, is 'migratable' still needed then?
>>>>>>
>>>>>> FYI.
>>>>>> The 'mode' is corresponding to memory policy, there already a notion
>>>>>> of default memory policy.
>>>>>>   quote:
>>>>>>     System Default Policy:  this policy is "hard coded" into the
>>>>>> kernel.
>>>>>> (https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)
>>>>>> So it might be easier to understand if we introduce a 'default'
>>>>>> option directly.
>>>>>>
>>>>>> Regards,
>>>>>> Luyao
>>>>>>
>>>>>>> On 8/26/2020 6:20 AM, Martin Kletzander wrote:
>>>>>>>> On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>>>>>>>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Martin Kletzander <mkletzan at redhat.com>
>>>>>>>>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>>>>>>>>> To: Zhong, Luyao <luyao.zhong at intel.com>
>>>>>>>>>>> Cc: libvir-list at redhat.com; Zang, Rui <rui.zang at intel.com>; 
>>>>>>>>>>> Michal
>>>>>>>>>>> Privoznik
>>>>>>>>>>> <mprivozn at redhat.com>
>>>>>>>>>>> Subject: Re: [libvirt][RFC PATCH] add a new 'default' option for
>>>>>>>>>>> attribute mode
>>>>>>>>>>> in numatune
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>>>>>>>>>>>> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>>>>>>>>>>>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
>>>>>>>>>>>>>>>> Hi Libvirt experts,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like enhence the numatune snippet configuration.
>>>>>>>>>>>>>>>> Given a
>>>>>>>>>>>>>>>> example snippet:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory mode="strict" nodeset="1-4,^3"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>> <memnode cellid="0" mode="strict" nodeset="1"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>> cellid="2" mode="preferred" nodeset="2"/> Ã‚Â </numatune>
>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Currently, attribute mode is either 'interleave',
>>>>>>>>>>>>>>>> 'strict', or
>>>>>>>>>>>>>>>> 'preferred', I propose to add a new 'default'Ã‚Â  option.
>>>>>>>>>>>>>>>> I give
>>>>>>>>>>>>>>>> the reason as following.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Presume we are using cgroups v1, Libvirt sets cpuset.mems
>>>>>>>>>>>>>>>> for all
>>>>>>>>>>>>>>>> vcpu threads according to 'nodeset' in memory element. And
>>>>>>>>>>>>>>>> translate the memnode element to qemu config options
>>>>>>>>>>>>>>>> (--object
>>>>>>>>>>>>>>>> memory-backend-ram) for per numa cell, which invoking 
>>>>>>>>>>>>>>>> mbind()
>>>>>>>>>>>>>>>> system call at the end.[1]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But what if we want using default memory policy and
>>>>>>>>>>>>>>>> request each
>>>>>>>>>>>>>>>> guest numa cell pinned to different host memory nodes? We
>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>> use mbind via qemu config options, because (I quoto here)
>>>>>>>>>>>>>>>> "For
>>>>>>>>>>>>>>>> MPOL_DEFAULT, the nodemask and maxnode arguments must be
>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>> the empty set of nodes." [2]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So my solution is introducing a new 'default' option for
>>>>>>>>>>>>>>>> attribute
>>>>>>>>>>>>>>>> mode. e.g.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory mode="default" nodeset="1-2"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>> cellid="0" mode="default" nodeset="1"/> Ã‚Â Ã‚Â  <memnode
>>>>>>>>>>>>>>>> cellid="1" mode="default" nodeset="2"/> Ã‚Â </numatune>
>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If the mode is 'default', libvirt should avoid generating
>>>>>>>>>>>>>>>> qemu
>>>>>>>>>>>>>>>> command line '--object memory-backend-ram', and invokes
>>>>>>>>>>>>>>>> cgroups to
>>>>>>>>>>>>>>>> set cpuset.mems for per guest numa combining with numa
>>>>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>>> config. Presume the numa topology is :
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <cpu>
>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>> Ã‚Â <numa>
>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <cell id='0' cpus='0-3' memory='512000'
>>>>>>>>>>>>>>>> unit='KiB' /> Ã‚Â
>>>>>>>>>>>>>>>> Ã‚Â  <cell id='1' cpus='4-7' memory='512000' unit='KiB' />
>>>>>>>>>>>>>>>> Ã‚Â
>>>>>>>>>>>>>>>> </numa> Ã‚Â ...
>>>>>>>>>>>>>>>> </cpu>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Then libvirt should set cpuset.mems to '1' for vcpus 0-3,
>>>>>>>>>>>>>>>> and '2'
>>>>>>>>>>>>>>>> for vcpus 4-7.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is this reasonable and feasible? Welcome any comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are couple of problems here.Ã‚Â  The memory is not
>>>>>>>>>>>>>>> (always)
>>>>>>>>>>>>>>> allocated by the vCPU threads.Ã‚Â  I also remember it to
>>>>>>>>>>>>>>> not be
>>>>>>>>>>>>>>> allocated by the process, but in KVM in a way that was not
>>>>>>>>>>>>>>> affected
>>>>>>>>>>>>>>> by the cgroup settings.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your reply. Maybe I don't get what you mean,
>>>>>>>>>>>>>> could you
>>>>>>>>>>>>>> give me more context? But what I proposed will have no
>>>>>>>>>>>>>> effect on
>>>>>>>>>>>>>> other memory allocation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Check how cgroups work.Â  We can set the memory nodes that a
>>>>>>>>>>>>> process
>>>>>>>>>>>>> will allocate from.Â  However to set the node for the process
>>>>>>>>>>>>> (thread) QEMU needs to be started with the vCPU threads 
>>>>>>>>>>>>> already
>>>>>>>>>>>>> spawned (albeit stopped).Â  And for that QEMU already
>>>>>>>>>>>>> allocates some
>>>>>>>>>>>>> memory.Â  Moreover if extra memory was allocated after we set
>>>>>>>>>>>>> the
>>>>>>>>>>>>> cpuset.mems it is not guaranteed that it will be allocated by
>>>>>>>>>>>>> the
>>>>>>>>>>>>> vCPU in that NUMA cell, it might be done in the emulator
>>>>>>>>>>>>> instead or
>>>>>>>>>>>>> the KVM module in the kernel in which case it might not be
>>>>>>>>>>>>> accounted
>>>>>>>>>>>>> for the process actually causing the allocation (as we've
>>>>>>>>>>>>> already
>>>>>>>>>>>>> seen with Linux).Â  In all these cases cgroups will not do
>>>>>>>>>>>>> what you
>>>>>>>>>>>>> want them to do.Â  The last case might be fixed, the first
>>>>>>>>>>>>> ones are
>>>>>>>>>>>>> by default not going to work.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That might be
>>>>>>>>>>>>>>> fixed now,
>>>>>>>>>>>>>>> however.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But basically what we have against is all the reasons why we
>>>>>>>>>>>>>>> started using QEMU's command line arguments for all that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not proposing use QEMU's command line arguments, on
>>>>>>>>>>>>>> contrary I
>>>>>>>>>>>>>> want using cgroups setting to support a new
>>>>>>>>>>>>>> config/requirement. I
>>>>>>>>>>>>>> give a solution about if we require default memory policy
>>>>>>>>>>>>>> and memory
>>>>>>>>>>>>>> numa pinning.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> And I'm suggesting you look at the commit log to see why we
>>>>>>>>>>>>> *had* to
>>>>>>>>>>>>> add these command line arguments, even though I think I
>>>>>>>>>>>>> managed to
>>>>>>>>>>>>> describe most of them above already (except for one that 
>>>>>>>>>>>>> _might_
>>>>>>>>>>>>> already be fixed in the kernel).Â  I understand the git log
>>>>>>>>>>>>> is huge
>>>>>>>>>>>>> and the code around NUMA memory allocation was changing a
>>>>>>>>>>>>> lot, so I
>>>>>>>>>>>>> hope my explanation will be enough.
>>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for detailed explanation, I think I get it now. We
>>>>>>>>>>>> can't
>>>>>>>>>>>> guarantee memory allocation matching requirement since there
>>>>>>>>>>>> is a time
>>>>>>>>>>>> slot before setting cpuset.mems.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That's one of the things, although this one could be avoided (by
>>>>>>>>>>> setting a global
>>>>>>>>>>> cgroup before exec()).
>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>> Sorry, but I think it will more likely break rather than
>>>>>>>>>>>>>>> fix stuff.
>>>>>>>>>>>>>>> Maybe this
>>>>>>>>>>>>>>> could be dealt with by a switch in `qemu.conf` with a huge
>>>>>>>>>>>>>>> warning
>>>>>>>>>>>>>>> above it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not trying to fix something, I propose how to support 
>>>>>>>>>>>>>> a new
>>>>>>>>>>>>>> requirement just like I stated above.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess we should take a couple of steps back, I don't get
>>>>>>>>>>>>> what you
>>>>>>>>>>>>> are trying to achieve.Â  Maybe if you describe your use case
>>>>>>>>>>>>> it will
>>>>>>>>>>>>> be easier to reach a conclusion.
>>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, I do have a usecase I didn't mention before. It's a
>>>>>>>>>>>> feature in
>>>>>>>>>>>> kernel but not merged yet, we call it memory tiering.
>>>>>>>>>>>> (https://lwn.net/Articles/802544/)
>>>>>>>>>>>>
>>>>>>>>>>>> If memory tiering is enabled on host, DRAM is top tier memory,
>>>>>>>>>>>> and
>>>>>>>>>>>> PMEM(persistent memory) is second tier memory, PMEM is shown
>>>>>>>>>>>> as numa
>>>>>>>>>>>> node without cpu. For short, pages can be migrated between
>>>>>>>>>>>> DRAM and
>>>>>>>>>>>> PMEM based on DRAM pressure and how cold/hot they are.
>>>>>>>>>>>>
>>>>>>>>>>>> We could configure multiple memory migrating path. For
>>>>>>>>>>>> example, node 0:
>>>>>>>>>>>> DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2
>>>>>>>>>>>> to a
>>>>>>>>>>>> group, and 1+3 to a group. In each group, page is allowed to
>>>>>>>>>>>> migrated
>>>>>>>>>>>> down(demotion) and up(promotion).
>>>>>>>>>>>>
>>>>>>>>>>>> If **we want our VMs utilizing memory tiering and with NUMA
>>>>>>>>>>>> topology**,
>>>>>>>>>>>> we need handle the guest memory mapping to host memory, that
>>>>>>>>>>>> means we
>>>>>>>>>>>> need bind each guest numa node to a memory nodes group(DRAM
>>>>>>>>>>>> node +
>>>>>>>>>>> PMEM
>>>>>>>>>>>> node) on host. For example, guest node 0 -> host node 0+2.
>>>>>>>>>>>>
>>>>>>>>>>>> However, only cgroups setting can make the memory tiering
>>>>>>>>>>>> work, if we
>>>>>>>>>>>> use mbind() system call, demoted pages will never go back to
>>>>>>>>>>>> DRAM.
>>>>>>>>>>>> That's why I propose to add 'default' option and bypass mbind
>>>>>>>>>>>> in QEMU.
>>>>>>>>>>>>
>>>>>>>>>>>> I hope I make myself understandable. I'll appreciate if you
>>>>>>>>>>>> could give
>>>>>>>>>>>> some suggestion.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This comes around every couple of months/years and bites us 
>>>>>>>>>>> in the
>>>>>>>>>>> back no
>>>>>>>>>>> matter what way we go (every time there is someone who wants it
>>>>>>>>>>> the
>>>>>>>>>>> other
>>>>>>>>>>> way).
>>>>>>>>>>> That's why I think there could be a way for the user to specify
>>>>>>>>>>> whether they will
>>>>>>>>>>> likely move the memory or not and based on that we would
>>>>>>>>>>> specify `host-
>>>>>>>>>>> nodes` and `policy` to qemu or not.  I think I even suggested 
>>>>>>>>>>> this
>>>>>>>>>>> before (or
>>>>>>>>>>> probably delegated it to someone else for a suggestion so that
>>>>>>>>>>> there
>>>>>>>>>>> is more
>>>>>>>>>>> discussion), but nobody really replied.
>>>>>>>>>>>
>>>>>>>>>>> So what we need, I think, is a way for someone to set a 
>>>>>>>>>>> per-domain
>>>>>>>>>>> information
>>>>>>>>>>> whether we should bind the memory to nodes in a changeable
>>>>>>>>>>> fashion or
>>>>>>>>>>> not.
>>>>>>>>>>> I'd like to have it in as well.  The way we need to do that is,
>>>>>>>>>>> probably, per-
>>>>>>>>>>> domain, because adding yet another switch for each place in the
>>>>>>>>>>> XML
>>>>>>>>>>> where we
>>>>>>>>>>> can select a NUMA memory binding would be a suicide.  There 
>>>>>>>>>>> should
>>>>>>>>>>> also be
>>>>>>>>>>> no need for this to be enabled per memory-(module, node), so it
>>>>>>>>>>> should work
>>>>>>>>>>> fine.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for letting us know your vision about this.
>>>>>>>>>>  From what I understood, the "changeable fashion" means that the
>>>>>>>>>> guest
>>>>>>>>>> numa
>>>>>>>>>> cell binding can be changed out of band after initial binding,
>>>>>>>>>> either
>>>>>>>>>> by system
>>>>>>>>>> admin or the operating system (memory tiering in our case), or
>>>>>>>>>> whatever the
>>>>>>>>>> third party is.  Is that perception correct?
>>>>>>>>>
>>>>>>>>> Yes.  If the user wants to have the possibility of changing the
>>>>>>>>> binding,
>>>>>>>>> then we
>>>>>>>>> use *only* cgroups.  Otherwise we use the qemu parameters that
>>>>>>>>> will make
>>>>>>>>> qemu
>>>>>>>>> call mbind() (as that has other pros mentioned above).  The other
>>>>>>>>> option
>>>>>>>>> would
>>>>>>>>> be extra communication between QEMU and libvirt during start to
>>>>>>>>> let us
>>>>>>>>> know when
>>>>>>>>> to set what cgroups etc., but I don't think that's worth it.
>>>>>>>>>
>>>>>>>>>> It seems to me mbind() or set_mempolicy() system calls do not
>>>>>>>>>> offer that
>>>>>>>>>> flexibility of changing afterwards. So in case of QEMU/KVM, I
>>>>>>>>>> can only
>>>>>>>>>> think
>>>>>>>>>> of cgroups.
>>>>>>>>>> So to be specific, if we had this additional
>>>>>>>>>> "memory_binding_changeable"
>>>>>>>>>> option specified, we will try to do the guest numa 
>>>>>>>>>> constraining via
>>>>>>>>>> cgroups
>>>>>>>>>> whenever possible. There will probably also be conflicts in
>>>>>>>>>> options or
>>>>>>>>>> things
>>>>>>>>>> that cgroups can not do. For such cases we'd fail the domain.
>>>>>>>>>
>>>>>>>>> Basically we'll do what we're doing now and skip the qemu
>>>>>>>>> `host-nodes` and
>>>>>>>>> `policy` parameters with the new option.  And of course we can
>>>>>>>>> fail with
>>>>>>>>> a nice
>>>>>>>>> error message if someone wants to move the memory without the 
>>>>>>>>> option
>>>>>>>>> selected
>>>>>>>>> and so on.
>>>>>>>>
>>>>>>>> Thanks for your comments.
>>>>>>>>
>>>>>>>> I'd like get it more clear about defining the interface in domain
>>>>>>>> xml,
>>>>>>>> then I could go into the implementation further.
>>>>>>>>
>>>>>>>> As you mentioned, per-domain option will be better than per-node.
>>>>>>>> I go
>>>>>>>> through the libvirt doamin format to look for a proper position to
>>>>>>>> place
>>>>>>>> this option. Then I'm thinking we could still utilizing numatune
>>>>>>>> element
>>>>>>>> to configure.
>>>>>>>>
>>>>>>>> <numatune>
>>>>>>>>    <memory mode="strict" nodeset="1-4,^3"/>
>>>>>>>>    <memnode cellid="0" mode="strict" nodeset="1"/>
>>>>>>>>    <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>>>>>> </numatune>
>>>>>>>>
>>>>>>>> coincidentally, the optional memory element specifies how to 
>>>>>>>> allocate
>>>>>>>> memory for the domain process on a NUMA host. So can we utilizing
>>>>>>>> this
>>>>>>>> element, and introducing a new mode like "changeable" or 
>>>>>>>> whatever? Do
>>>>>>>> you have a better name?
>>>>>>>>
>>>>>>> Yeah, I was thinking something along the lines of:
>>>>>>> <numatune>
>>>>>>>     <memory mode="strict" nodeset="1-4,^3"
>>>>>>> movable/migratable="yes/no" />
>>>>>>>     <memnode cellid="0" mode="strict" nodeset="1"/>
>>>>>>>     <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>>>>> </numatune>
>>>>>>>> If the memory mode is set to 'changeable', we could ignore the mode
>>>>>>>> setting for each memnode, and then we only configure by cgroups. I
>>>>>>>> have
>>>>>>>> not diven into code for now, expecting it could work.
>>>>>>>>
>>>>>>> Yes, the example above gives the impression of the attribute being
>>>>>>> available
>>>>>>> per-node.  But that could be handled in the documentation.
>>>>>>> Specifying it per-node seems very weird, why would you want the
>>>>>>> memory to be
>>>>>>> hard-locked, but for some guest nodes only?
>>>>>>>> Thanks,
>>>>>>>> Luyao
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> If you agree with the direction, I think we can dig deeper to
>>>>>>>>>> see what
>>>>>>>>>> will
>>>>>>>>>> come out.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Zang, Rui
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Ideally we'd discuss it with others, but I think I am only one
>>>>>>>>>>> of a
>>>>>>>>>>> few people
>>>>>>>>>>> who dealt with issues in this regard.  Maybe Michal (Cc'd) also
>>>>>>>>>>> dealt
>>>>>>>>>>> with some
>>>>>>>>>>> things related to the binding, so maybe he can chime in.
>>>>>>>>>>>
>>>>>>>>>>>> regards,
>>>>>>>>>>>> Luyao
>>>>>>>>>>>>
>>>>>>>>>>>>>>> Have a nice day,
>>>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>> [1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d 
>>>>>>>>>>>
>>>>>>>>>>>>>>>> a97c3f2bad5/backends/hostmem.c#L379
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>
>>