[libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune

Zhong, Luyao luyao.zhong at intel.com
Thu Apr 1 02:57:59 UTC 2021



>-----Original Message-----
>From: Martin Kletzander <mkletzan at redhat.com>
>Sent: Wednesday, March 31, 2021 5:37 PM
>To: Zhong, Luyao <luyao.zhong at intel.com>
>Cc: Daniel P. Berrangé <berrange at redhat.com>; libvir-list at redhat.com
>Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
>
>On Wed, Mar 31, 2021 at 06:33:28AM +0000, Zhong, Luyao wrote:
>>
>>
>>>-----Original Message-----
>>>From: Martin Kletzander <mkletzan at redhat.com>
>>>Sent: Wednesday, March 31, 2021 12:21 AM
>>>To: Zhong, Luyao <luyao.zhong at intel.com>
>>>Cc: Daniel P. Berrangé <berrange at redhat.com>; libvir-list at redhat.com
>>>Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in
>>>numatune
>>>
>>>On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Martin Kletzander <mkletzan at redhat.com>
>>>>> Sent: Thursday, March 25, 2021 10:28 PM
>>>>> To: Daniel P. Berrangé <berrange at redhat.com>
>>>>> Cc: Zhong, Luyao <luyao.zhong at intel.com>; libvir-list at redhat.com
>>>>> Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode
>>>>> in numatune
>>>>>
>>>>> On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote:
>>>>> >On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
>>>>> >> On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
>>>>> >> >
>>>>> >> >
>>>>> >> > > -----Original Message-----
>>>>> >> > > From: Martin Kletzander <mkletzan at redhat.com>
>>>>> >> > > Sent: Thursday, March 25, 2021 4:46 AM
>>>>> >> > > To: Daniel P. Berrangé <berrange at redhat.com>
>>>>> >> > > Cc: Zhong, Luyao <luyao.zhong at intel.com>;
>>>>> >> > > libvir-list at redhat.com
>>>>> >> > > Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive'
>>>>> >> > > mode in numatune
>>>>> >> > >
>>>>> >> > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé
>wrote:
>>>>> >> > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
>>>>> >> > > >> Before this patch set, numatune only has three memory modes:
>>>>> >> > > >> static, interleave and prefered. These memory policies
>>>>> >> > > >> are ultimately set by mbind() system call.
>>>>> >> > > >>
>>>>> >> > > >> Memory policy could be 'hard coded' into the kernel, but
>>>>> >> > > >> none of above policies fit our requirment under this case.
>>>>> >> > > >> mbind() support default memory policy, but it requires a
>>>>> >> > > >> NULL nodemask. So obviously setting allowed memory nodes
>>>>> >> > > >> is
>>>cgroups'
>>>>> mission under this case.
>>>>> >> > > >> So we introduce a new option for mode in numatune named
>>>>> 'restrictive'.
>>>>> >> > > >>
>>>>> >> > > >> <numatune>
>>>>> >> > > >>    <memory mode="restrictive" nodeset="1-4,^3"/>
>>>>> >> > > >>    <memnode cellid="0" mode="restrictive" nodeset="1"/>
>>>>> >> > > >>    <memnode cellid="2" mode="restrictive" nodeset="2"/>
>>>>> >> > > >> </numatune>
>>>>> >> > > >
>>>>> >> > > >'restrictive' is rather a wierd name and doesn't really
>>>>> >> > > >tell me what the memory policy is going to be. As far as I
>>>>> >> > > >can tell from the patches, it seems this causes us to not
>>>>> >> > > >set any memory alllocation policy at all. IOW, we're using
>>>>> >> > > >some undefined host default
>>>>> policy.
>>>>> >> > > >
>>>>> >> > > >Given this I think we should be calling it either "none" or "default"
>>>>> >> > > >
>>>>> >> > >
>>>>> >> > > I was against "default" because having such option possible,
>>>>> >> > > but the actual default being different sounds stupid.
>>>>> >> > > Similarly "none" sounds like no restrictions are applied or
>>>>> >> > > that it is the same as if nothing was specified.  It is
>>>>> >> > > funny to imagine the situation when I am explaining to
>>>>> >> > > someone how to
>>>achieve this solution:
>>>>> >> > >
>>>>> >> > >    "The default is 'strict', you need to explicitly set it to 'default'."
>>>>> >> > >
>>>>> >> > > or
>>>>> >> > >
>>>>> >> > >    "What setting did you use?"
>>>>> >> > >    "None"
>>>>> >> > >    "As in no mode or in mode='none'?"
>>>>> >> > >
>>>>> >> > > As I said before, please come up with any name, but not
>>>>> >> > > these that are IMHO actually more confusing.
>>>>> >> > >
>>>>> >> >
>>>>> >> > Hi Daniel and Martin, thanks for your reply, just as Martin
>>>>> >> > said current default mode is "strict", so "default" was
>>>>> >> > deprecated at the beginning when I proposed this change.  And
>>>>> >> > actually we have cgroups restricting the memory resource so
>>>>> >> > could we call this a "none" mode? I still don't have a better
>>>>> >> > name. ☹
>>>>> >> >
>>>>> >>
>>>>> >> Me neither as figuring out the names when our names do not
>>>>> >> precisely map to anything else (since we are using multiple
>>>>> >> solutions to get as close to the desired result as possible) is
>>>>> >> difficult because there is no similar pre-existing setting.  And
>>>>> >> using anything
>>>like "cgroups-only"
>>>>> >> would limit us in the future, probably.
>>>>> >
>>>>> >What I'm still really missing in this series is a clear statement
>>>>> >of what the problem with the current modes is, and what this new
>>>>> >mode provides to solve it. The documentation for the new XML
>>>>> >attribute is not clear on this and neither are the commit
>>>>> >messages. There's a pointer to an enourmous mailing list thread,
>>>>> >but reading through
>>>>> >50 messages is a not a viable way to learn the answer.
>>>>> >
>>>>> >I'm not even certain that we should be introducing a new mode
>>>>> >value at all, as opposed to a separate attribute.
>>>>> >
>>>>>
>>>>> Yes, Luyao, could you summarize the reason for the new mode?  I
>>>>> think that the difference in behaviour between using cgroups and
>>>>> memory binding as opposed to just using cgroups should be enough
>>>>> for others to be able to figure out when to use this mode and when not.
>>>>>
>>>>Sure.
>>>>Let me give a concrete use case first. There is a new feature in
>>>>kernel but not merged yet, we call it memory tiering.
>>>>(https://lwn.net/Articles/802544/). If memory tiering is enabled on
>>>>host, DRAM is top tier memory, and PMEM(persistent memory)  is second
>>>>tier memory, PMEM is shown as numa node without cpu. Pages can be
>>>>migrated between DRAM node and PMEM node based on DRAM pressure and
>>>how
>>>>cold/hot they are. *this memory policy* is implemented in kernel. So
>>>>we need a default mode here, but from libvirt's perspective, the "defaut"
>>>>mode is "strict", it's not MPOL_DEFAULT
>>>>(https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel.
>>>>Besides, to make memory tiering works well, cgroups setting is
>>>>necessary, since it restricts that the pages can only be migrated
>>>>between the
>>>DRAM and PMEM nodes that we specified (NUMA affinity support).
>>>>
>>>>Except for upper use case, we might have some scenarios that only
>>>>requires
>>>cgroups restriction.
>>>>That's why "restrictive" mode is proposed.
>>>>
>>>>In a word, if a user requires default mode(MPOL_DEFAULT) and require
>>>>cgroups to restrict memory allocation, "restrictive" mode will be useful.
>>>>
>>>
>>>Yeah, I also seem to recall something about the fact that just using
>>>cgroups with multiple nodes in the nodeset makes kernel decide on
>>>which node (out of those in the restricted set) to allocate on, but
>>>specifying "strict" basically allocates it sequentially (on the first
>>>one until it is full, then on the next one and so on). I do not have
>>>anything to back this, so do you remember if this was that the case as well or
>does my memory serve me poorly?
>>>
>>Yeah, exactly. 😊
>>
>>cpuset.mems just specify the list of memory nodes on which the processes are
>allowed to allocate memory.
>>https://man7.org/linux/man-pages/man7/cpuset.7.html
>>
>>This link gives a detailed introduction of "strict" mode:
>>https://man7.org/linux/man-pages/man2/mbind.2.html
>>
>
>So, the behaviour I remembered was the case before Linux 2.6.26, not any more.
>But anyway there are still some more differences:
>
Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just from
"based on node id" to "based on distance" I think.
>- The default setting uses system default memory policy, which is same
>   as 'bind' for most of the time.  It is more close to 'interleave'
>   during the system boot (which does not concern us), but the fact that
>   it is the same as 'bind' might change in the future (as Luyao said).
>
>- If we change the memory policy (what happens with 'strict') then we
>   cannot change that later on as only the threads can change the
>   nodemask (or the policy) for themselves.  AFAIK QEMU does not provide
>   an API for this, neither should it have the permissions to do it.
>   We, however, can do that if we just use cgroups.  And 'virsh numatune'
>   already provides that for the whole domain (we just don't have an API
>   to do that per memory).
>
>These should definitely be noted in the documentation and, ideally, hinted at in
>the commit message as well.  I just do not know how to do that nicely without
>just pointing to the libnuma man pages.
>
Yes, current doc is not clear enough. I'll try my best to explain the new mode
in later patch update.

@Daniel P. Berrangé, do you still have concern about what this
mode is for and do you have any suggestion about this mode naming?

>Thought?
>
>>>>BR,
>>>>Luyao
>>>>
>>>>> >Regards,
>>>>> >Daniel
>>>>> >--
>>>>> >|: https://berrange.com      -o-
>>>https://www.flickr.com/photos/dberrange :|
>>>>> >|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>>>>> >|: https://entangle-photo.org    -o-
>>>>> https://www.instagram.com/dberrange :|




More information about the libvir-list mailing list