[libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Fri Nov 20 05:17:29 UTC 2020

On 11/18/2020 8:42 PM, Martin Kletzander wrote:
> On Wed, Nov 18, 2020 at 05:40:04PM +0800, Zhong, Luyao wrote:
>>
>>
>> On 11/12/2020 5:27 PM, Martin Kletzander wrote:
>>> On Thu, Nov 12, 2020 at 02:19:06PM +0800, Zhong, Luyao wrote:
>>>>
>>>>
>>>> On 11/9/2020 7:21 PM, Martin Kletzander wrote:
>>>>> On Sat, Nov 07, 2020 at 10:41:52AM +0800, Zhong, Luyao wrote:
>>>>>>
>>>>>>
>>>>>> On 11/4/2020 9:02 PM, Martin Kletzander wrote:
>>>>>>> On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
>>>>>>>> On 10/16/2020 9:32 PM, Zang, Rui wrote:
>>>>>>>>>
>>>>>>>>> How about if “migratable” is set, “mode” should be 
>>>>>>>>> ignored/omitted?
>>>>>>>>> So any setting of “mode” will be rejected with an error
>>>>>>>>> indicating an
>>>>>>>>> invalid configuration.
>>>>>>>>> We can say in the doc that “migratable” and “mode” shall not be 
>>>>>>>>> set
>>>>>>>>> together. So even the default value of “mode” is not taken.
>>>>>>>>>
>>>>>>>> If "mode" is not set, it's the same as setting "strict" value
>>>>>>>> ('strict'
>>>>>>>> is the default value). It involves some code detail, it will be
>>>>>>>> translated to enumerated type, the value is 0 when mode not set or
>>>>>>>> set
>>>>>>>> to 'strict'. The code is in some fixed skeleton, so it's not 
>>>>>>>> easy to
>>>>>>>> modify.
>>>>>>>>
>>>>>>>
>>>>>>> Well I see it as it is "strict". It does not mean "strict cgroup
>>>>>>> setting",
>>>>>>> because cgroups are just one of the ways to enforce this.  Look 
>>>>>>> at it
>>>>>>> this way:
>>>>>>>
>>>>>>> mode can be:
>>>>>>>   - strict: only these nodes can be used for the memory
>>>>>>>   - preferred: there nodes should be preferred, but allocation 
>>>>>>> should
>>>>>>> not fail
>>>>>>>   - interleave: interleave the memory between these nodes
>>>>>>>
>>>>>>> Due to the naming this maps to cgroup settings 1:1.
>>>>>>>
>>>>>
>>>>> Sorry, I misspoke, this does not map to cgroup settings at all, in
>>>>> cgroups you
>>>>> can only set "strict" (by using cpuset.mems) and that's it.  There 
>>>>> is no
>>>>> way to
>>>>> set preferred or interleaved mapping, sorry.
>>>>>
>>>> memory policy is independent of cpuset.mems
>>>>
>>>
>>> Yes.
>>>
>>>> I quote here "Memory policies should not be confused with cpusets
>>>> (Documentation/admin-guide/cgroup-v1/cpusets.rst) which is an
>>>> administrative mechanism for restricting the nodes from which memory 
>>>> may
>>>> be allocated by a set of processes. Memory policies are a programming
>>>> interface that a NUMA-aware application can take advantage of.
>>>
>>> Pay attention to this part:
>>>
>>>> When both cpusets and policies are applied to a task, the restrictions
>>>> of the
>>>> cpuset takes priority.
>>>
>>>> See Memory Policies and cpusets below for more details."[1]
>>>>
>>>> So using cpuset.mems does not mean set "strict" memory policy if I
>>>> understand it correctly, we can set cpuset.mems with any memory policy.
>>>>
>>>
>>> That's not how I understand that.  Sure, it's independent of memory
>>> policy, but
>>> if you do not specify memory policy (which keeps it as "default") and 
>>> set
>>> cpuset.mems, then the process will only be permitted to allocate memory
>>> on NUMA
>>> nodes specified in the file.
>>>
>> yes it's not conflict with what I was saying, it's one kind of 
>> combinations.
>>
>> For instance, we can also set cpuset.mems to "1,2" and use mbind() set
>> memory policy to MPOL_PREFERRED and preferred node is "1", that means we
>> will allocate pages from the node 1 first then fall back to other
>> nodes(only node 2 under this case since cpuset.mems restrict the memory
>> nodes) if the preferred nodes is low on free memory. If the prefered
>> node is "0", we will not allocate pages from node 0 since cpuset.mems
>> takes priority.
>>
>> Do you mean cpuset.mems + MPOL_DEFAULT policy == MPOL_BIND policy? They
>> might be functionally similar if not specific policies implemented in
>> kernel. But I don't think they are exactly the same.
>>
> 
> It's not the same, but has similar outcome.  If you do 
> numa_set_membind() or
> set_mempolicy(MPOL_BIND) on nodes 2-3, then the memory allocations will 
> only be
> done on those nodes.  If there is no space on them the allocation will 
> fail.  If
> you write 2-3 into cpuset.mems, then similar thing happens: allocations 
> will
> only be done on the mentioned nodes and will fail if there is not enough 
> space.
> 
> If you are talking about mbind() in qemu, that is done *after* the 
> allocation,
> but *before* any write to that.  If that is anonymous memory or a private
> file-backed memory, then writes that allocate will be allocated on the 
> specified
> nodes.  If you write to cpuset.mems *after* the allocations were done, 
> then they
> might be moved to those nodes based on cpuset.memory_migrate and couple 
> of other
> things.
> 
>> Because default policy indicate that we are using policy implemented in
>> kernel(transparent to process), just like the memory
>> tiering(https://lwn.net/Articles/802544/) case I stated before. So
>> cpuset.mems + MPOL_DEFAULT policy is not MPOL_BIND policy under this 
>> case.
>>
> 
> Yes, there are some other minor details between mpol_default and 
> mpol_bind.  Are
> you referring to the fact that mpol_default prefers same node as the 
> thread is
> running on whereas mpol_bind allocates on the node with the lowest ID 
> from the
> set, no matter what node the thread requesting the allocation is running 
> on,
> etc.?
> 
> If yes, then sure, I completely agree that someone might want to use 
> cpuset.mems
> with mpol_default.  That's what we're both trying to achieve here, 
> right?  I
> think the misunderstanding here comes from the fact we both understand what
> libvirt XML is trying to describe.
> 
yes

> In libvirt XML we have the setting for "mode" which accepts possible 
> values of
> "strict", "interleave" and "preferred".  It *does not* say *how* that is
> achieved (in fact in the past we were using different mechanisms for 
> that).  If
> you say "default" in the terms of libvirt XML it means he value used if no
> setting was specified.  But since we do not expose "memory policy" per 
> se it is
> very misleading to add such value.  In my opinion it would also be very
> difficult to describe its outcome with regard to the "mode".
> 
> Given what I described earlier, the main difference between cgroups and 
> libnuma
> APIs is (at least from what I remember) that cgroup settings can be changed
> (even if The API settings could be changed we'd need cooperation from 
> QEMU to
> change them at runtime).  And if cpuset.memory_migrate is set it can be 
> migrated
> to another node.  That was also requested couple of times in the past 
> and that's
> what I based my suggestion on.
> 
> I now understand better what you mean when you are looking at it from 
> the other
> side needing to keep MPOL_DEFAULT, but with restricted set of NUMA nodes 
> allowed
> for allocation.  You don't really care whether it is going to be 
> migratable or
> not so that naming does not make sense for your use case (which also 
> means it
> would not make sense for all of our users).  So my naming is bad too, 
> probably
> worse than your suggestion.
> 
> So let's finish this sooner rather than later.  Let's remove the
> "migratable/movable" attribute and add another mode for the way memory
> allocations are going to be treated.  But let's name it something else than
> "default" and let's explain the name properly in man pages and 
> documentation.
> Since naming is hard and names I come up with are usually bad I can only 
> suggest
> a lowest bottom fallback if we can't come up with anything else =) Let's 
> say
> something like "restrictive".
> 
I have no better name than yours. So "restrictive" is good I think, I 
could use it and send out the patch first, then other reviewers might 
come up with new name otherwise we keep it.

> Sure, I don't like that the only difference between "strict" and 
> "the_new_name"
> would be whether we pass the options to qemu or not, but saying that 
> would not
> really help users very much.
> 
> Does that sound OK to you?
> 

Yes, it's good. Thank you so much for your comments.

>> Regards,
>> Luyao
>>
>>>> [1]https://www.infradead.org/~mchehab/kernel_docs/admin-guide/mm/numa_memory_policy.html 
>>>>
>>>>
>>>>>>> But now we have another way of enforcing this, using qemu cmdline
>>>>>>> option.  The
>>>>>>> names actually map 1:1 to those as well:
>>>>>>>
>>>>>>>
>>>>>>> https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So my idea was that we would add a movable/migratable/whatever
>>>>>>> attribute
>>>>>>> that
>>>>>>> would tell us which way for enforcing we use because there does not
>>>>>>> seem
>>>>>>> to be
>>>>>>> "one size fits all" solution.  Am I misunderstanding this 
>>>>>>> discussion?
>>>>>>> Please
>>>>>>> correct me if I am.  Thank you.
>>>>>>>
>>>>>> Actually I need a default memory policy(memory policy is 'hard coded'
>>>>>> into the kernel) support, I thought "migratable" was enough to 
>>>>>> indicate
>>>>>
>>>>> So I am getting on your track, yes.  What you mean is basically
>>>>> MPOL_DEFAULT and
>>>>> that's where the naming probably comes from, right?  Anyway, what 
>>>>> we're
>>>>> trying
>>>>> to do is not restrict us from other options, even if they are only
>>>>> possible in
>>>>> the future.  So instead of adding "default" which would actually mean
>>>>> "strict"
>>>>> (because you still use cpuset.mems) which would restrict us from
>>>>> potentially
>>>>> being able to migrate with a different policy than "strict" (even
>>>>> though it
>>>>> might not make sense for "preferred", for example) and it's also a bit
>>>>> confusing
>>>> as I mentioned above, using "cpuset.mems" does not mean "strict" memory
>>>> policy.
>>>>> for users, I suggested we add "migratable" which restricts just the 
>>>>> qemu
>>>>> options.  Of course, "migratable" only makes sense with "strict" now,
>>>>> but that's
>>>>> fine.  The XML provides a possibility for something we don't support,
>>>>> but we can
>>>>> forbid that combination for the sake of clarity of the other option 
>>>>> that
>>>>> _is_
>>>>> supported.
>>>>>
>>>>> I'll try to propose my idea based on your patch from Nov 3rd and it
>>>>> might
>>>>> improve my communication.  I feels difficult for me to explain myself
>>>>> without
>>>>> the code.  I just need to deal with a lot of other emails first.
>>>>>
>>>> Thank you in advance. Let's discuss later based on the patch.
>>>>
>>>> Regards,
>>>> Luyao
>>>>>> that we rely on operating system to operate memory policy. So when
>>>>>> "migratable" is set, "mode" should not be set. But when I was 
>>>>>> coding, I
>>>>>> found "mode" default value is "strict", it is always "strict" even if
>>>>>> "migratable" is yes, that means we configure two different memory
>>>>>> policies at the same time. Then I still need a new option for 
>>>>>> "mode" to
>>>>>> make it not conflicting with the "migratable", then if we have the 
>>>>>> new
>>>>>> option("default") for "mode", it seems we can drop "migratable".
>>>>>>
>>>>>> Besides, we can make "mode" being a "one size fits all" solution., 
>>>>>> just
>>>>>> reject the different "mode" value config in memnode element when 
>>>>>> "mode"
>>>>>> is "default" in memory element.
>>>>>>
>>>>>> I summary it in the new email
>>>>>> https://www.redhat.com/archives/libvir-list/2020-November/msg00084.html 
>>>>>>
>>>>>>
>>>>>> Sorry I didn't make it easy to understand.
>>>>>>
>>>>>> Regards,
>>>>>> Luyao
>>>>>>>> So I need a option to indicate "I don't specify any mode.".
>>>>>>>>
>>>>>>>>>> 在 2020年10月16日，20:34，Zhong, Luyao <luyao.zhong at intel.com>
>>>>>>>>>> 写道：
>>>>>>>>>>
>>>>>>>>>> Hi Martin, Peter and other experts,
>>>>>>>>>>
>>>>>>>>>> We got a consensus that we need introducing a new "migratable"
>>>>>>>>>> attribute before. But in implementation, I found introducing a 
>>>>>>>>>> new
>>>>>>>>>> 'default' option for existing mode attribute is still neccessary.
>>>>>>>>>>
>>>>>>>>>> I have a initial patch for 'migratable' and Peter gave some
>>>>>>>>>> comments
>>>>>>>>>> already.
>>>>>>>>>> https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Current issue is, if I set 'migratable', any 'mode' should be
>>>>>>>>>> ignored. Peter commented that I can't rely on docs to tell users
>>>>>>>>>> some config is invalid, I need to reject the config in the 
>>>>>>>>>> code, I
>>>>>>>>>> completely agree with that. But the 'mode' default value is
>>>>>>>>>> 'strict', it will always conflict with the 'migratable', at 
>>>>>>>>>> the end
>>>>>>>>>> I still need introducing a new option for 'mode' which can be a
>>>>>>>>>> legal config when 'migratable' is set.
>>>>>>>>>>
>>>>>>>>>> If we have 'default' option, is 'migratable' still needed then?
>>>>>>>>>>
>>>>>>>>>> FYI.
>>>>>>>>>> The 'mode' is corresponding to memory policy, there already a
>>>>>>>>>> notion
>>>>>>>>>> of default memory policy.
>>>>>>>>>>   quote:
>>>>>>>>>>     System Default Policy:  this policy is "hard coded" into the
>>>>>>>>>> kernel.
>>>>>>>>>> (https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt) 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So it might be easier to understand if we introduce a 'default'
>>>>>>>>>> option directly.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Luyao
>>>>>>>>>>
>>>>>>>>>>> On 8/26/2020 6:20 AM, Martin Kletzander wrote:
>>>>>>>>>>>> On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>>>>>>>>>>>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Martin Kletzander <mkletzan at redhat.com>
>>>>>>>>>>>>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>>>>>>>>>>>>> To: Zhong, Luyao <luyao.zhong at intel.com>
>>>>>>>>>>>>>>> Cc: libvir-list at redhat.com; Zang, Rui <rui.zang at intel.com>;
>>>>>>>>>>>>>>> Michal
>>>>>>>>>>>>>>> Privoznik
>>>>>>>>>>>>>>> <mprivozn at redhat.com>
>>>>>>>>>>>>>>> Subject: Re: [libvirt][RFC PATCH] add a new 'default'
>>>>>>>>>>>>>>> option for
>>>>>>>>>>>>>>> attribute mode
>>>>>>>>>>>>>>> in numatune
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>>>>>>>>>>>>>>>> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>>>>>>>>>>>>>>>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Hi Libvirt experts,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would like enhence the numatune snippet 
>>>>>>>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>>>>>>>> Given a
>>>>>>>>>>>>>>>>>>>> example snippet:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory mode="strict" nodeset="1-4,^3"/> 
>>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>>>>>> <memnode cellid="0" mode="strict" nodeset="1"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>>>>>> cellid="2" mode="preferred" nodeset="2"/> Ã‚Â 
>>>>>>>>>>>>>>>>>>>> </numatune>
>>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Currently, attribute mode is either 'interleave',
>>>>>>>>>>>>>>>>>>>> 'strict', or
>>>>>>>>>>>>>>>>>>>> 'preferred', I propose to add a new 'default'Ã‚Â  
>>>>>>>>>>>>>>>>>>>> option.
>>>>>>>>>>>>>>>>>>>> I give
>>>>>>>>>>>>>>>>>>>> the reason as following.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Presume we are using cgroups v1, Libvirt sets 
>>>>>>>>>>>>>>>>>>>> cpuset.mems
>>>>>>>>>>>>>>>>>>>> for all
>>>>>>>>>>>>>>>>>>>> vcpu threads according to 'nodeset' in memory element.
>>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>>> translate the memnode element to qemu config options
>>>>>>>>>>>>>>>>>>>> (--object
>>>>>>>>>>>>>>>>>>>> memory-backend-ram) for per numa cell, which invoking
>>>>>>>>>>>>>>>>>>>> mbind()
>>>>>>>>>>>>>>>>>>>> system call at the end.[1]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> But what if we want using default memory policy and
>>>>>>>>>>>>>>>>>>>> request each
>>>>>>>>>>>>>>>>>>>> guest numa cell pinned to different host memory 
>>>>>>>>>>>>>>>>>>>> nodes? We
>>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>>> use mbind via qemu config options, because (I quoto 
>>>>>>>>>>>>>>>>>>>> here)
>>>>>>>>>>>>>>>>>>>> "For
>>>>>>>>>>>>>>>>>>>> MPOL_DEFAULT, the nodemask and maxnode arguments 
>>>>>>>>>>>>>>>>>>>> must be
>>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>>> the empty set of nodes." [2]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> So my solution is introducing a new 'default' option 
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> attribute
>>>>>>>>>>>>>>>>>>>> mode. e.g.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory mode="default" nodeset="1-2"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>>>>>> cellid="0" mode="default" nodeset="1"/> Ã‚Â Ã‚Â  
>>>>>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>>>>>> cellid="1" mode="default" nodeset="2"/> Ã‚Â </numatune>
>>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If the mode is 'default', libvirt should avoid 
>>>>>>>>>>>>>>>>>>>> generating
>>>>>>>>>>>>>>>>>>>> qemu
>>>>>>>>>>>>>>>>>>>> command line '--object memory-backend-ram', and invokes
>>>>>>>>>>>>>>>>>>>> cgroups to
>>>>>>>>>>>>>>>>>>>> set cpuset.mems for per guest numa combining with numa
>>>>>>>>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>>>>>>> config. Presume the numa topology is :
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <cpu>
>>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>> Ã‚Â <numa>
>>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <cell id='0' cpus='0-3' memory='512000'
>>>>>>>>>>>>>>>>>>>> unit='KiB' /> Ã‚Â
>>>>>>>>>>>>>>>>>>>> Ã‚Â  <cell id='1' cpus='4-7' memory='512000'
>>>>>>>>>>>>>>>>>>>> unit='KiB' />
>>>>>>>>>>>>>>>>>>>> Ã‚Â
>>>>>>>>>>>>>>>>>>>> </numa> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>> </cpu>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Then libvirt should set cpuset.mems to '1' for vcpus 
>>>>>>>>>>>>>>>>>>>> 0-3,
>>>>>>>>>>>>>>>>>>>> and '2'
>>>>>>>>>>>>>>>>>>>> for vcpus 4-7.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is this reasonable and feasible? Welcome any comments.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are couple of problems here.Ã‚Â  The memory is not
>>>>>>>>>>>>>>>>>>> (always)
>>>>>>>>>>>>>>>>>>> allocated by the vCPU threads.Ã‚Â  I also remember it to
>>>>>>>>>>>>>>>>>>> not be
>>>>>>>>>>>>>>>>>>> allocated by the process, but in KVM in a way that 
>>>>>>>>>>>>>>>>>>> was not
>>>>>>>>>>>>>>>>>>> affected
>>>>>>>>>>>>>>>>>>> by the cgroup settings.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for your reply. Maybe I don't get what you mean,
>>>>>>>>>>>>>>>>>> could you
>>>>>>>>>>>>>>>>>> give me more context? But what I proposed will have no
>>>>>>>>>>>>>>>>>> effect on
>>>>>>>>>>>>>>>>>> other memory allocation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Check how cgroups work.Â  We can set the memory nodes 
>>>>>>>>>>>>>>>>> that a
>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>> will allocate from.Â  However to set the node for the
>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>> (thread) QEMU needs to be started with the vCPU threads
>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>> spawned (albeit stopped).Â  And for that QEMU already
>>>>>>>>>>>>>>>>> allocates some
>>>>>>>>>>>>>>>>> memory.Â  Moreover if extra memory was allocated after we
>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> cpuset.mems it is not guaranteed that it will be
>>>>>>>>>>>>>>>>> allocated by
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> vCPU in that NUMA cell, it might be done in the emulator
>>>>>>>>>>>>>>>>> instead or
>>>>>>>>>>>>>>>>> the KVM module in the kernel in which case it might not be
>>>>>>>>>>>>>>>>> accounted
>>>>>>>>>>>>>>>>> for the process actually causing the allocation (as we've
>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>> seen with Linux).Â  In all these cases cgroups will not do
>>>>>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>>>> want them to do.Â  The last case might be fixed, the first
>>>>>>>>>>>>>>>>> ones are
>>>>>>>>>>>>>>>>> by default not going to work.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That might be
>>>>>>>>>>>>>>>>>>> fixed now,
>>>>>>>>>>>>>>>>>>> however.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> But basically what we have against is all the reasons
>>>>>>>>>>>>>>>>>>> why we
>>>>>>>>>>>>>>>>>>> started using QEMU's command line arguments for all 
>>>>>>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not proposing use QEMU's command line arguments, on
>>>>>>>>>>>>>>>>>> contrary I
>>>>>>>>>>>>>>>>>> want using cgroups setting to support a new
>>>>>>>>>>>>>>>>>> config/requirement. I
>>>>>>>>>>>>>>>>>> give a solution about if we require default memory policy
>>>>>>>>>>>>>>>>>> and memory
>>>>>>>>>>>>>>>>>> numa pinning.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And I'm suggesting you look at the commit log to see 
>>>>>>>>>>>>>>>>> why we
>>>>>>>>>>>>>>>>> *had* to
>>>>>>>>>>>>>>>>> add these command line arguments, even though I think I
>>>>>>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>> describe most of them above already (except for one that
>>>>>>>>>>>>>>>>> _might_
>>>>>>>>>>>>>>>>> already be fixed in the kernel).Â  I understand the git 
>>>>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>> is huge
>>>>>>>>>>>>>>>>> and the code around NUMA memory allocation was changing a
>>>>>>>>>>>>>>>>> lot, so I
>>>>>>>>>>>>>>>>> hope my explanation will be enough.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for detailed explanation, I think I get it 
>>>>>>>>>>>>>>>> now. We
>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>> guarantee memory allocation matching requirement since 
>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>> is a time
>>>>>>>>>>>>>>>> slot before setting cpuset.mems.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That's one of the things, although this one could be
>>>>>>>>>>>>>>> avoided (by
>>>>>>>>>>>>>>> setting a global
>>>>>>>>>>>>>>> cgroup before exec()).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>>>>> Sorry, but I think it will more likely break rather than
>>>>>>>>>>>>>>>>>>> fix stuff.
>>>>>>>>>>>>>>>>>>> Maybe this
>>>>>>>>>>>>>>>>>>> could be dealt with by a switch in `qemu.conf` with a 
>>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>>> warning
>>>>>>>>>>>>>>>>>>> above it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not trying to fix something, I propose how to support
>>>>>>>>>>>>>>>>>> a new
>>>>>>>>>>>>>>>>>> requirement just like I stated above.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I guess we should take a couple of steps back, I don't get
>>>>>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>>>> are trying to achieve.Â  Maybe if you describe your use 
>>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>> it will
>>>>>>>>>>>>>>>>> be easier to reach a conclusion.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, I do have a usecase I didn't mention before. It's a
>>>>>>>>>>>>>>>> feature in
>>>>>>>>>>>>>>>> kernel but not merged yet, we call it memory tiering.
>>>>>>>>>>>>>>>> (https://lwn.net/Articles/802544/)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If memory tiering is enabled on host, DRAM is top tier
>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> PMEM(persistent memory) is second tier memory, PMEM is 
>>>>>>>>>>>>>>>> shown
>>>>>>>>>>>>>>>> as numa
>>>>>>>>>>>>>>>> node without cpu. For short, pages can be migrated between
>>>>>>>>>>>>>>>> DRAM and
>>>>>>>>>>>>>>>> PMEM based on DRAM pressure and how cold/hot they are.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We could configure multiple memory migrating path. For
>>>>>>>>>>>>>>>> example, node 0:
>>>>>>>>>>>>>>>> DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make
>>>>>>>>>>>>>>>> 0+2
>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>> group, and 1+3 to a group. In each group, page is 
>>>>>>>>>>>>>>>> allowed to
>>>>>>>>>>>>>>>> migrated
>>>>>>>>>>>>>>>> down(demotion) and up(promotion).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If **we want our VMs utilizing memory tiering and with NUMA
>>>>>>>>>>>>>>>> topology**,
>>>>>>>>>>>>>>>> we need handle the guest memory mapping to host memory, 
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> means we
>>>>>>>>>>>>>>>> need bind each guest numa node to a memory nodes group(DRAM
>>>>>>>>>>>>>>>> node +
>>>>>>>>>>>>>>> PMEM
>>>>>>>>>>>>>>>> node) on host. For example, guest node 0 -> host node 0+2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, only cgroups setting can make the memory tiering
>>>>>>>>>>>>>>>> work, if we
>>>>>>>>>>>>>>>> use mbind() system call, demoted pages will never go 
>>>>>>>>>>>>>>>> back to
>>>>>>>>>>>>>>>> DRAM.
>>>>>>>>>>>>>>>> That's why I propose to add 'default' option and bypass 
>>>>>>>>>>>>>>>> mbind
>>>>>>>>>>>>>>>> in QEMU.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I hope I make myself understandable. I'll appreciate if you
>>>>>>>>>>>>>>>> could give
>>>>>>>>>>>>>>>> some suggestion.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This comes around every couple of months/years and bites us
>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>> back no
>>>>>>>>>>>>>>> matter what way we go (every time there is someone who
>>>>>>>>>>>>>>> wants it
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>> way).
>>>>>>>>>>>>>>> That's why I think there could be a way for the user to
>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>> whether they will
>>>>>>>>>>>>>>> likely move the memory or not and based on that we would
>>>>>>>>>>>>>>> specify `host-
>>>>>>>>>>>>>>> nodes` and `policy` to qemu or not.  I think I even 
>>>>>>>>>>>>>>> suggested
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> before (or
>>>>>>>>>>>>>>> probably delegated it to someone else for a suggestion so 
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>> is more
>>>>>>>>>>>>>>> discussion), but nobody really replied.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So what we need, I think, is a way for someone to set a
>>>>>>>>>>>>>>> per-domain
>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>> whether we should bind the memory to nodes in a changeable
>>>>>>>>>>>>>>> fashion or
>>>>>>>>>>>>>>> not.
>>>>>>>>>>>>>>> I'd like to have it in as well.  The way we need to do that
>>>>>>>>>>>>>>> is,
>>>>>>>>>>>>>>> probably, per-
>>>>>>>>>>>>>>> domain, because adding yet another switch for each place in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> XML
>>>>>>>>>>>>>>> where we
>>>>>>>>>>>>>>> can select a NUMA memory binding would be a suicide.  There
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> also be
>>>>>>>>>>>>>>> no need for this to be enabled per memory-(module, node),
>>>>>>>>>>>>>>> so it
>>>>>>>>>>>>>>> should work
>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for letting us know your vision about this.
>>>>>>>>>>>>>>  From what I understood, the "changeable fashion" means that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> guest
>>>>>>>>>>>>>> numa
>>>>>>>>>>>>>> cell binding can be changed out of band after initial 
>>>>>>>>>>>>>> binding,
>>>>>>>>>>>>>> either
>>>>>>>>>>>>>> by system
>>>>>>>>>>>>>> admin or the operating system (memory tiering in our 
>>>>>>>>>>>>>> case), or
>>>>>>>>>>>>>> whatever the
>>>>>>>>>>>>>> third party is.  Is that perception correct?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes.  If the user wants to have the possibility of changing 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> binding,
>>>>>>>>>>>>> then we
>>>>>>>>>>>>> use *only* cgroups.  Otherwise we use the qemu parameters that
>>>>>>>>>>>>> will make
>>>>>>>>>>>>> qemu
>>>>>>>>>>>>> call mbind() (as that has other pros mentioned above).  The
>>>>>>>>>>>>> other
>>>>>>>>>>>>> option
>>>>>>>>>>>>> would
>>>>>>>>>>>>> be extra communication between QEMU and libvirt during 
>>>>>>>>>>>>> start to
>>>>>>>>>>>>> let us
>>>>>>>>>>>>> know when
>>>>>>>>>>>>> to set what cgroups etc., but I don't think that's worth it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems to me mbind() or set_mempolicy() system calls do not
>>>>>>>>>>>>>> offer that
>>>>>>>>>>>>>> flexibility of changing afterwards. So in case of QEMU/KVM, I
>>>>>>>>>>>>>> can only
>>>>>>>>>>>>>> think
>>>>>>>>>>>>>> of cgroups.
>>>>>>>>>>>>>> So to be specific, if we had this additional
>>>>>>>>>>>>>> "memory_binding_changeable"
>>>>>>>>>>>>>> option specified, we will try to do the guest numa
>>>>>>>>>>>>>> constraining via
>>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>>> whenever possible. There will probably also be conflicts in
>>>>>>>>>>>>>> options or
>>>>>>>>>>>>>> things
>>>>>>>>>>>>>> that cgroups can not do. For such cases we'd fail the domain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically we'll do what we're doing now and skip the qemu
>>>>>>>>>>>>> `host-nodes` and
>>>>>>>>>>>>> `policy` parameters with the new option.  And of course we can
>>>>>>>>>>>>> fail with
>>>>>>>>>>>>> a nice
>>>>>>>>>>>>> error message if someone wants to move the memory without the
>>>>>>>>>>>>> option
>>>>>>>>>>>>> selected
>>>>>>>>>>>>> and so on.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like get it more clear about defining the interface in 
>>>>>>>>>>>> domain
>>>>>>>>>>>> xml,
>>>>>>>>>>>> then I could go into the implementation further.
>>>>>>>>>>>>
>>>>>>>>>>>> As you mentioned, per-domain option will be better than 
>>>>>>>>>>>> per-node.
>>>>>>>>>>>> I go
>>>>>>>>>>>> through the libvirt doamin format to look for a proper
>>>>>>>>>>>> position to
>>>>>>>>>>>> place
>>>>>>>>>>>> this option. Then I'm thinking we could still utilizing 
>>>>>>>>>>>> numatune
>>>>>>>>>>>> element
>>>>>>>>>>>> to configure.
>>>>>>>>>>>>
>>>>>>>>>>>> <numatune>
>>>>>>>>>>>>    <memory mode="strict" nodeset="1-4,^3"/>
>>>>>>>>>>>>    <memnode cellid="0" mode="strict" nodeset="1"/>
>>>>>>>>>>>>    <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>>>>>>>>>> </numatune>
>>>>>>>>>>>>
>>>>>>>>>>>> coincidentally, the optional memory element specifies how to
>>>>>>>>>>>> allocate
>>>>>>>>>>>> memory for the domain process on a NUMA host. So can we 
>>>>>>>>>>>> utilizing
>>>>>>>>>>>> this
>>>>>>>>>>>> element, and introducing a new mode like "changeable" or
>>>>>>>>>>>> whatever? Do
>>>>>>>>>>>> you have a better name?
>>>>>>>>>>>>
>>>>>>>>>>> Yeah, I was thinking something along the lines of:
>>>>>>>>>>> <numatune>
>>>>>>>>>>>     <memory mode="strict" nodeset="1-4,^3"
>>>>>>>>>>> movable/migratable="yes/no" />
>>>>>>>>>>>     <memnode cellid="0" mode="strict" nodeset="1"/>
>>>>>>>>>>>     <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>>>>>>>>> </numatune>
>>>>>>>>>>>> If the memory mode is set to 'changeable', we could ignore the
>>>>>>>>>>>> mode
>>>>>>>>>>>> setting for each memnode, and then we only configure by
>>>>>>>>>>>> cgroups. I
>>>>>>>>>>>> have
>>>>>>>>>>>> not diven into code for now, expecting it could work.
>>>>>>>>>>>>
>>>>>>>>>>> Yes, the example above gives the impression of the attribute 
>>>>>>>>>>> being
>>>>>>>>>>> available
>>>>>>>>>>> per-node.  But that could be handled in the documentation.
>>>>>>>>>>> Specifying it per-node seems very weird, why would you want the
>>>>>>>>>>> memory to be
>>>>>>>>>>> hard-locked, but for some guest nodes only?
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Luyao
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you agree with the direction, I think we can dig deeper to
>>>>>>>>>>>>>> see what
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>> come out.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Zang, Rui
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ideally we'd discuss it with others, but I think I am 
>>>>>>>>>>>>>>> only one
>>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>> few people
>>>>>>>>>>>>>>> who dealt with issues in this regard.  Maybe Michal (Cc'd)
>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>> dealt
>>>>>>>>>>>>>>> with some
>>>>>>>>>>>>>>> things related to the binding, so maybe he can chime in.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Have a nice day,
>>>>>>>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> a97c3f2bad5/backends/hostmem.c#L379
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>