[libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Wed Nov 4 13:02:27 UTC 2020

On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
>On 10/16/2020 9:32 PM, Zang, Rui wrote:
>>
>> How about if “migratable” is set, “mode” should be ignored/omitted? So any setting of “mode” will be rejected with an error indicating an invalid configuration.
>> We can say in the doc that “migratable” and “mode” shall not be set together. So even the default value of “mode” is not taken.
>>
>If "mode" is not set, it's the same as setting "strict" value ('strict'
>is the default value). It involves some code detail, it will be
>translated to enumerated type, the value is 0 when mode not set or set
>to 'strict'. The code is in some fixed skeleton, so it's not easy to modify.
>

Well I see it as it is "strict". It does not mean "strict cgroup setting",
because cgroups are just one of the ways to enforce this.  Look at it this way:

mode can be:
  - strict: only these nodes can be used for the memory
  - preferred: there nodes should be preferred, but allocation should not fail
  - interleave: interleave the memory between these nodes

Due to the naming this maps to cgroup settings 1:1.

But now we have another way of enforcing this, using qemu cmdline option.  The
names actually map 1:1 to those as well:

   https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901

So my idea was that we would add a movable/migratable/whatever attribute that
would tell us which way for enforcing we use because there does not seem to be
"one size fits all" solution.  Am I misunderstanding this discussion?  Please
correct me if I am.  Thank you.

>So I need a option to indicate "I don't specify any mode.".
>
>>> 在 2020年10月16日，20:34，Zhong, Luyao <luyao.zhong at intel.com> 写道：
>>>
>>> Hi Martin, Peter and other experts,
>>>
>>> We got a consensus that we need introducing a new "migratable" attribute before. But in implementation, I found introducing a new 'default' option for existing mode attribute is still neccessary.
>>>
>>> I have a initial patch for 'migratable' and Peter gave some comments already.
>>> https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html
>>>
>>> Current issue is, if I set 'migratable', any 'mode' should be ignored. Peter commented that I can't rely on docs to tell users some config is invalid, I need to reject the config in the code, I completely agree with that. But the 'mode' default value is 'strict', it will always conflict with the 'migratable', at the end I still need introducing a new option for 'mode' which can be a legal config when 'migratable' is set.
>>>
>>> If we have 'default' option, is 'migratable' still needed then?
>>>
>>> FYI.
>>> The 'mode' is corresponding to memory policy, there already a notion of default memory policy.
>>>   quote:
>>>     System Default Policy:  this policy is "hard coded" into the kernel.
>>> (https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)
>>> So it might be easier to understand if we introduce a 'default' option directly.
>>>
>>> Regards,
>>> Luyao
>>>
>>>> On 8/26/2020 6:20 AM, Martin Kletzander wrote:
>>>>> On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
>>>>>
>>>>>
>>>>> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>>>>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Martin Kletzander <mkletzan at redhat.com>
>>>>>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>>>>>> To: Zhong, Luyao <luyao.zhong at intel.com>
>>>>>>>> Cc: libvir-list at redhat.com; Zang, Rui <rui.zang at intel.com>; Michal
>>>>>>>> Privoznik
>>>>>>>> <mprivozn at redhat.com>
>>>>>>>> Subject: Re: [libvirt][RFC PATCH] add a new 'default' option for
>>>>>>>> attribute mode
>>>>>>>> in numatune
>>>>>>>>
>>>>>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>>>>>>>>> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>>>>>>>>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
>>>>>>>>>>>>> Hi Libvirt experts,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like enhence the numatune snippet configuration. Given a
>>>>>>>>>>>>> example snippet:
>>>>>>>>>>>>>
>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory mode="strict" nodeset="1-4,^3"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>> <memnode cellid="0" mode="strict" nodeset="1"/> Ã‚Â Ã‚Â  <memnode
>>>>>>>>>>>>> cellid="2" mode="preferred" nodeset="2"/> Ã‚Â </numatune> Ã‚Â ...
>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently, attribute mode is either 'interleave', 'strict', or
>>>>>>>>>>>>> 'preferred', I propose to add a new 'default'Ã‚Â  option. I give
>>>>>>>>>>>>> the reason as following.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Presume we are using cgroups v1, Libvirt sets cpuset.mems for all
>>>>>>>>>>>>> vcpu threads according to 'nodeset' in memory element. And
>>>>>>>>>>>>> translate the memnode element to qemu config options (--object
>>>>>>>>>>>>> memory-backend-ram) for per numa cell, which invoking mbind()
>>>>>>>>>>>>> system call at the end.[1]
>>>>>>>>>>>>>
>>>>>>>>>>>>> But what if we want using default memory policy and request each
>>>>>>>>>>>>> guest numa cell pinned to different host memory nodes? We can't
>>>>>>>>>>>>> use mbind via qemu config options, because (I quoto here) "For
>>>>>>>>>>>>> MPOL_DEFAULT, the nodemask and maxnode arguments must be specify
>>>>>>>>>>>>> the empty set of nodes." [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>> So my solution is introducing a new 'default' option for attribute
>>>>>>>>>>>>> mode. e.g.
>>>>>>>>>>>>>
>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory mode="default" nodeset="1-2"/> Ã‚Â Ã‚Â  <memnode
>>>>>>>>>>>>> cellid="0" mode="default" nodeset="1"/> Ã‚Â Ã‚Â  <memnode
>>>>>>>>>>>>> cellid="1" mode="default" nodeset="2"/> Ã‚Â </numatune> Ã‚Â ...
>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the mode is 'default', libvirt should avoid generating qemu
>>>>>>>>>>>>> command line '--object memory-backend-ram', and invokes cgroups to
>>>>>>>>>>>>> set cpuset.mems for per guest numa combining with numa topology
>>>>>>>>>>>>> config. Presume the numa topology is :
>>>>>>>>>>>>>
>>>>>>>>>>>>> <cpu>
>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>> Ã‚Â <numa>
>>>>>>>>>>>>> Ã‚Â Ã‚Â  <cell id='0' cpus='0-3' memory='512000' unit='KiB' /> Ã‚Â
>>>>>>>>>>>>> Ã‚Â  <cell id='1' cpus='4-7' memory='512000' unit='KiB' /> Ã‚Â
>>>>>>>>>>>>> </numa> Ã‚Â ...
>>>>>>>>>>>>> </cpu>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then libvirt should set cpuset.mems to '1' for vcpus 0-3, and '2'
>>>>>>>>>>>>> for vcpus 4-7.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this reasonable and feasible? Welcome any comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> There are couple of problems here.Ã‚Â  The memory is not (always)
>>>>>>>>>>>> allocated by the vCPU threads.Ã‚Â  I also remember it to not be
>>>>>>>>>>>> allocated by the process, but in KVM in a way that was not affected
>>>>>>>>>>>> by the cgroup settings.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your reply. Maybe I don't get what you mean, could you
>>>>>>>>>>> give me more context? But what I proposed will have no effect on
>>>>>>>>>>> other memory allocation.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Check how cgroups work.Â  We can set the memory nodes that a process
>>>>>>>>>> will allocate from.Â  However to set the node for the process
>>>>>>>>>> (thread) QEMU needs to be started with the vCPU threads already
>>>>>>>>>> spawned (albeit stopped).Â  And for that QEMU already allocates some
>>>>>>>>>> memory.Â  Moreover if extra memory was allocated after we set the
>>>>>>>>>> cpuset.mems it is not guaranteed that it will be allocated by the
>>>>>>>>>> vCPU in that NUMA cell, it might be done in the emulator instead or
>>>>>>>>>> the KVM module in the kernel in which case it might not be accounted
>>>>>>>>>> for the process actually causing the allocation (as we've already
>>>>>>>>>> seen with Linux).Â  In all these cases cgroups will not do what you
>>>>>>>>>> want them to do.Â  The last case might be fixed, the first ones are
>>>>>>>>>> by default not going to work.
>>>>>>>>>>
>>>>>>>>>>>> That might be
>>>>>>>>>>>> fixed now,
>>>>>>>>>>>> however.
>>>>>>>>>>>>
>>>>>>>>>>>> But basically what we have against is all the reasons why we
>>>>>>>>>>>> started using QEMU's command line arguments for all that.
>>>>>>>>>>>>
>>>>>>>>>>> I'm not proposing use QEMU's command line arguments, on contrary I
>>>>>>>>>>> want using cgroups setting to support a new config/requirement. I
>>>>>>>>>>> give a solution about if we require default memory policy and memory
>>>>>>>>>>> numa pinning.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And I'm suggesting you look at the commit log to see why we *had* to
>>>>>>>>>> add these command line arguments, even though I think I managed to
>>>>>>>>>> describe most of them above already (except for one that _might_
>>>>>>>>>> already be fixed in the kernel).Â  I understand the git log is huge
>>>>>>>>>> and the code around NUMA memory allocation was changing a lot, so I
>>>>>>>>>> hope my explanation will be enough.
>>>>>>>>>>
>>>>>>>>> Thank you for detailed explanation, I think I get it now. We can't
>>>>>>>>> guarantee memory allocation matching requirement since there is a time
>>>>>>>>> slot before setting cpuset.mems.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That's one of the things, although this one could be avoided (by
>>>>>>>> setting a global
>>>>>>>> cgroup before exec()).
>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Luyao
>>>>>>>>>>>> Sorry, but I think it will more likely break rather than fix stuff.
>>>>>>>>>>>> Maybe this
>>>>>>>>>>>> could be dealt with by a switch in `qemu.conf` with a huge warning
>>>>>>>>>>>> above it.
>>>>>>>>>>>>
>>>>>>>>>>> I'm not trying to fix something, I propose how to support a new
>>>>>>>>>>> requirement just like I stated above.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I guess we should take a couple of steps back, I don't get what you
>>>>>>>>>> are trying to achieve.Â  Maybe if you describe your use case it will
>>>>>>>>>> be easier to reach a conclusion.
>>>>>>>>>>
>>>>>>>>> Yeah, I do have a usecase I didn't mention before. It's a feature in
>>>>>>>>> kernel but not merged yet, we call it memory tiering.
>>>>>>>>> (https://lwn.net/Articles/802544/)
>>>>>>>>>
>>>>>>>>> If memory tiering is enabled on host, DRAM is top tier memory, and
>>>>>>>>> PMEM(persistent memory) is second tier memory, PMEM is shown as numa
>>>>>>>>> node without cpu. For short, pages can be migrated between DRAM and
>>>>>>>>> PMEM based on DRAM pressure and how cold/hot they are.
>>>>>>>>>
>>>>>>>>> We could configure multiple memory migrating path. For example, node 0:
>>>>>>>>> DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 to a
>>>>>>>>> group, and 1+3 to a group. In each group, page is allowed to migrated
>>>>>>>>> down(demotion) and up(promotion).
>>>>>>>>>
>>>>>>>>> If **we want our VMs utilizing memory tiering and with NUMA topology**,
>>>>>>>>> we need handle the guest memory mapping to host memory, that means we
>>>>>>>>> need bind each guest numa node to a memory nodes group(DRAM node +
>>>>>>>> PMEM
>>>>>>>>> node) on host. For example, guest node 0 -> host node 0+2.
>>>>>>>>>
>>>>>>>>> However, only cgroups setting can make the memory tiering work, if we
>>>>>>>>> use mbind() system call, demoted pages will never go back to DRAM.
>>>>>>>>> That's why I propose to add 'default' option and bypass mbind in QEMU.
>>>>>>>>>
>>>>>>>>> I hope I make myself understandable. I'll appreciate if you could give
>>>>>>>>> some suggestion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This comes around every couple of months/years and bites us in the
>>>>>>>> back no
>>>>>>>> matter what way we go (every time there is someone who wants it the
>>>>>>>> other
>>>>>>>> way).
>>>>>>>> That's why I think there could be a way for the user to specify
>>>>>>>> whether they will
>>>>>>>> likely move the memory or not and based on that we would specify `host-
>>>>>>>> nodes` and `policy` to qemu or not.  I think I even suggested this
>>>>>>>> before (or
>>>>>>>> probably delegated it to someone else for a suggestion so that there
>>>>>>>> is more
>>>>>>>> discussion), but nobody really replied.
>>>>>>>>
>>>>>>>> So what we need, I think, is a way for someone to set a per-domain
>>>>>>>> information
>>>>>>>> whether we should bind the memory to nodes in a changeable fashion or
>>>>>>>> not.
>>>>>>>> I'd like to have it in as well.  The way we need to do that is,
>>>>>>>> probably, per-
>>>>>>>> domain, because adding yet another switch for each place in the XML
>>>>>>>> where we
>>>>>>>> can select a NUMA memory binding would be a suicide.  There should
>>>>>>>> also be
>>>>>>>> no need for this to be enabled per memory-(module, node), so it
>>>>>>>> should work
>>>>>>>> fine.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for letting us know your vision about this.
>>>>>>>  From what I understood, the "changeable fashion" means that the guest
>>>>>>> numa
>>>>>>> cell binding can be changed out of band after initial binding, either
>>>>>>> by system
>>>>>>> admin or the operating system (memory tiering in our case), or
>>>>>>> whatever the
>>>>>>> third party is.  Is that perception correct?
>>>>>>
>>>>>> Yes.  If the user wants to have the possibility of changing the binding,
>>>>>> then we
>>>>>> use *only* cgroups.  Otherwise we use the qemu parameters that will make
>>>>>> qemu
>>>>>> call mbind() (as that has other pros mentioned above).  The other option
>>>>>> would
>>>>>> be extra communication between QEMU and libvirt during start to let us
>>>>>> know when
>>>>>> to set what cgroups etc., but I don't think that's worth it.
>>>>>>
>>>>>>> It seems to me mbind() or set_mempolicy() system calls do not offer that
>>>>>>> flexibility of changing afterwards. So in case of QEMU/KVM, I can only
>>>>>>> think
>>>>>>> of cgroups.
>>>>>>> So to be specific, if we had this additional "memory_binding_changeable"
>>>>>>> option specified, we will try to do the guest numa constraining via
>>>>>>> cgroups
>>>>>>> whenever possible. There will probably also be conflicts in options or
>>>>>>> things
>>>>>>> that cgroups can not do. For such cases we'd fail the domain.
>>>>>>
>>>>>> Basically we'll do what we're doing now and skip the qemu `host-nodes` and
>>>>>> `policy` parameters with the new option.  And of course we can fail with
>>>>>> a nice
>>>>>> error message if someone wants to move the memory without the option
>>>>>> selected
>>>>>> and so on.
>>>>>
>>>>> Thanks for your comments.
>>>>>
>>>>> I'd like get it more clear about defining the interface in domain xml,
>>>>> then I could go into the implementation further.
>>>>>
>>>>> As you mentioned, per-domain option will be better than per-node. I go
>>>>> through the libvirt doamin format to look for a proper position to place
>>>>> this option. Then I'm thinking we could still utilizing numatune element
>>>>> to configure.
>>>>>
>>>>> <numatune>
>>>>>    <memory mode="strict" nodeset="1-4,^3"/>
>>>>>    <memnode cellid="0" mode="strict" nodeset="1"/>
>>>>>    <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>>> </numatune>
>>>>>
>>>>> coincidentally, the optional memory element specifies how to allocate
>>>>> memory for the domain process on a NUMA host. So can we utilizing this
>>>>> element, and introducing a new mode like "changeable" or whatever? Do
>>>>> you have a better name?
>>>>>
>>>> Yeah, I was thinking something along the lines of:
>>>> <numatune>
>>>>     <memory mode="strict" nodeset="1-4,^3" movable/migratable="yes/no" />
>>>>     <memnode cellid="0" mode="strict" nodeset="1"/>
>>>>     <memnode cellid="2" mode="preferred" nodeset="2"/>
>>>> </numatune>
>>>>> If the memory mode is set to 'changeable', we could ignore the mode
>>>>> setting for each memnode, and then we only configure by cgroups. I have
>>>>> not diven into code for now, expecting it could work.
>>>>>
>>>> Yes, the example above gives the impression of the attribute being available
>>>> per-node.  But that could be handled in the documentation.
>>>> Specifying it per-node seems very weird, why would you want the memory to be
>>>> hard-locked, but for some guest nodes only?
>>>>> Thanks,
>>>>> Luyao
>>>>>
>>>>>>
>>>>>>> If you agree with the direction, I think we can dig deeper to see what
>>>>>>> will
>>>>>>> come out.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Zang, Rui
>>>>>>>
>>>>>>>
>>>>>>>> Ideally we'd discuss it with others, but I think I am only one of a
>>>>>>>> few people
>>>>>>>> who dealt with issues in this regard.  Maybe Michal (Cc'd) also dealt
>>>>>>>> with some
>>>>>>>> things related to the binding, so maybe he can chime in.
>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Luyao
>>>>>>>>>
>>>>>>>>>>>> Have a nice day,
>>>>>>>>>>>> Martin
>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> [1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
>>>>>>>>>>>>> a97c3f2bad5/backends/hostmem.c#L379
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20201104/3ab8c62d/attachment-0001.sig>