[libvirt] [RFCv2 PATCH 3/5] conf: rename cachetune to restune

Fri Jul 6 06:22:17 UTC 2018


On 2018年07月06日 06:40, John Ferlan wrote:
> 
> 
> On 07/03/2018 12:10 AM, bing.niu wrote:
>> Hi John,
>> Thanks for reviewing! Since major questions are from this thread, so I
>> think we can start from this.
>>
>> On 2018年06月30日 06:47, John Ferlan wrote:
>>>
>>>
>>> On 06/15/2018 05:29 AM, bing.niu at intel.com wrote:
>>>> From: Bing Niu <bing.niu at intel.com>
>>>>
>>>> resctrl not only supports cache tuning, but also memory bandwidth
>>>> tuning. Rename cachetune to restune(resource tuning) to reflect
>>>> that.
>>>>
>>>> Signed-off-by: Bing Niu <bing.niu at intel.com>
>>>> ---
>>>>    src/conf/domain_conf.c  | 44
>>>> ++++++++++++++++++++++----------------------
>>>>    src/conf/domain_conf.h  | 10 +++++-----
>>>>    src/qemu/qemu_process.c | 18 +++++++++---------
>>>>    3 files changed, 36 insertions(+), 36 deletions(-)
>>>>
>>>
>>> Not so sure I'm liking [R|r]estune[s]... Still @cachetunes is an array
>>> of a structure that contains a vCPU bitmap and a virResctrlAllocPtr for
>>> the /cputune/cachetunes data.  The virResctrlAllocPtr was changed to add
>>> a the bandwidth allocation, so does this really have to change?
>>>
>> The cachetunes from Cache Allocation Technology(CAT) and memorytunes
>> from Memory Bandwidth Allocation(MBA) are both features from Resource
>> Directory Technology. RDT is a collection of resource distribution
>> technology, right now it includes CAT and MBA. Details information can
>> be found 17.18.1 of Intel's sdm.
>> https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
>>
> 
> tl;dr :-)
> 
> So some day soon it'll be more complicated and have even more rules?!?!
Hopefully not, and Intel do backward compatible well.:)
So overall programing module should be steady.

> 
>>
>> The RDT capability and configuration methods is exposed to user land in
>> form of resctrl file system by kernel. The basic programming model for
>> this resctrl fs interface like this:
>>
>> 1. create a folder under /sys/fs/resctrl. And this folder is called one
>> resctrl group.
>>
>> 2. writing user desired CAT and MBA config to the file
>> /sys/fs/resctrl/<YOUR_GROUP_NAME>/schemata to allocate the cache or
>> memory bandwidth. you can set CAT and MBA together or any of them.
>>
>> 3. moving the threads you want to that resctrl group by writing threads'
>> PID to /sys/fs/resctrl/<YOUR_GROUP_NAME>/task file. So that those
>> threads can be assign with the resource allocated in step 2. And those
>> resources are shared by those threads.
>>
>> *NOTE*: A thread only can be put in one resctrl group. This requirement
>> is from how RDT and resctrl works.
> 
> IOW: A vCPU or the vCPUs being used by CAT and MBA would need to be the
> same because in "this world" the group *is* the single vCPU or the range
> of vCPU's.

Yup! vCPU actually is a thread of QEMU from host view.
> 
>> Each resctrl group has a closid. The resource configuration in above
>> step2 is set to that closid's msr(IA32_L?_QoS) to describe the resource
>> allocation for this closid.
>> And thread use this closid to claim the allocated resource during
>> context switch(scheduled_in()) in kernel by writing the closid to the
>> msr IA32_PQR_ASSOC.
>> With closid wrote in IA32_PQR_ASSOC, RDT hardware knows current running
>> closid and allocated resource basing on this closid's IA32_L?_QoS msr.
>> That why a thread can be put into one restrcl group only.
>>
>> Details description of resctrl can be found:
>> https://github.com/torvalds/linux/blob/v4.17/Documentation/x86/intel_rdt_ui.txt
>>
>>
>> :)
> 
> More lengthy things that I didn't read...  Thanks for the pointers. I
> think keeping track of all the rules and relationships is going to be a
> real "challenge".

Yes, that is a real "challenge". However, kernel modules usually give a 
consistent and unified interface. That will help a lot. We should be 
able to support new features by extending the existing logic.
> 
>>
>> Basing on above information, my understanding is that virResctrlAllocPtr
>> is not only bind to cachetunes. It should be a backing object to
>> describe a resctrl group. Especially current virresctrl class already
>> works as that.
>> Libvirt will create one resctrl group for each virResctrlAllocPtr by
>> virResctrlAllocCreate() interface.
>>
>> So from components view, the big picture is something like below.
> 
> Yay, pictures!  Thanks for taking the time.
> 
>>
>>
>>                 <memorytune ^cpus='0-3'>
>>                     +---------+
>>                               |    <cachetune vcpus='0-3'>
>>         XML                   |           +
>>        parser                 +-----------+
>>                               |
>>                               |
>>                  +------------------------------+
>>                               |
>>                               |
>> internal resctrl      +------v--------------+
>> backing object        |  virResctrlAllocPtr |
>>                        +------+--------------+
>>                               |
>>                               |
>>                  +------------------------------+
>>                               |
>>                            +--v-------+
>>                            |          |
>>                            | schemata |
>>           /sys/fs/resctrl  | tasks    |
>>                            |   .      |
>>                            |   .      |
>>                            |          |
>>                            +----------+
>>
>> Does this make sense? :)
>>
> 
> Yes and I think further has me believe that the vCPUs in this case are
> the groups and perhaps becomes the internal "key" for all this. I didn't
> go look, but have to assume multiple cachetune entries create multiple
> groups into which the vCPU's for the cachetune entries are placed. So
> far it's been easy because there's no way to overlap cachetune vCPU
> values (my assumption, I didn't look at the code).
> 
> Now we add memtune and dictate that if it uses a vCPU or set of vCPUs
> that it either matches the existing cachetune entries or is itself
> unique and doesn't overlap any cachetune.
> 
> Assume the following:
> 
>    <cachetune vcpus='0-1'>
>    <cachetune vcpus='2-4'>
>    <cachetune vcpus='5-7'>
> 
> A valid memtune must also be '0-1', '2-4', 5-7', or any new grouping
> from 8->maxvcpus, such as '8-11'.
> 
> Conversely, if <memtune vcpus='0-4'> was used, that's invalid. Similarly
> <memtune vcpus='3'> would be invalid. For the former, it's crossing two
> boundaries and for the latter it's "part of" some other boundary.
> 
> Hopefully that's right...
You are definitely right! That is how we implement in 
virDomainMemorytuneDefParse() of [RFCv2 PATCH 4/5] like below.

+    for (i = 0; i < def->nrestunes; i++) {
+        /* vcpus group has been created, directly use the existing one.
+         * Just updating memory allocation information of that group
+         * */
+        if (virBitmapEqual(def->restunes[i]->vcpus, vcpus)) {
+            alloc = def->restunes[i]->alloc;
+            break;
+        }
+
+        /* vcpus can not overlapping */
+        if (virBitmapOverlaps(def->restunes[i]->vcpus, vcpus)) {
+            virReportError(VIR_ERR_XML_ERROR, "%s",
+                           _("Overlapping vcpus in memorytunes"));
+            goto cleanup;
+        }
+    }
+    if (!alloc) {
+        alloc = virResctrlAllocNew();
+        new_alloc = true;
+    }

> 
>>> I think the question is - if both cachetunes and memtunes exist in
>>> domain XML, then for which does the @vcpus relate to. That is, from the
>>> cover there's:
>>>
>>>       <memorytune vcpus='0-1'>
>>>         <node id='0' bandwidth='20'/>
>>>       </memorytune>
>>>
>>> and then there's a
>>>
>>>       <cachetune vcpus='0-3'>
>>>         <cache id='0' level='3' type='both' size='3' unit='MiB'/>
>>>         <cache id='1' level='3' type='both' size='3' unit='MiB'/>
>>>       </cachetune>
>>>
>>> Now what?  I haven't looked at patch 4 yet to even know how this "works".
>>
>> I also raised this concern in the first round review. Ideally, if we can
>> have a domain XML like this
>>
>> <resourcetune vcpu='0-3'>
>>       <memorytune>
>>          <node id='0' bandwidth='20'/>
>>       </memorytune>
>>        <cachetune>
>>          <cache id='0' level='3' type='both' size='3' unit='MiB'/>
>>          <cache id='1' level='3' type='both' size='3' unit='MiB'/>
>>        </cachetune>
>> <resourcetune>
>>
>> That will be more clear. Above domain XML means: one resctrl group with
>> memory bandwidth 20 @node 0, L3 cache 3MB @ node0 and node 1. Put vcpu
>> '0-3' to this resctrl group. So those resources are shared among vcpus.
>> However, Pavel pointed out that we must keep backward compatible. And we
>> need keep <cachetune vcpus='0-3'>.
> 
> I think I was right above based on this part...
> 
> Unfortunately forward thinking about memorytune didn't happen, so what
> you propose above is what I think "internally" once we've read the
> <cachetune> and <memorytune> elements we end up with. And yes, it makes
> sense why you renamed things now and why the vcpus are essentially
> reused between the two.
Indeed. It's hard to keep space for feature extension without enough 
information. Introducing a parent section <resourcetune> for the single 
child <cachetune> without <memorytune> also confuses people.

> 
> Not a fan of restunes, RsrcDir is closer, but not a perfect shorter name
> the ResourceDirectory. Also rather than essentially cut-copy-paste for
> the vcpus logic into memtunes from cachetunes, that should be extracted
> into it's own helper.
> 
OK. I will pack vcpus logic into a helper function in next version.

>>
>> In RFC v1, I also proposed to put memory bandwidth into cachetunes
>> section like:
>>    <cachetune vcpus='0-3'>
>>      <cache id='0' level='3' type='both' size='3' unit='MiB'/>
>>      <node id='0' bandwidth='20'/>
>>    <cachetune vcpus>
>>
>> Maintainer also comments on this "it's not a clear XML definition",
>> describing memory bandwidth inside a cachetune section. That's why we
>> come with a separated memorytune section.
> 
> I didn't go back to the RFC when I reviewed. I tried to have a fresh
> look. After scanning Pavel's comments now though, I do agree having a
> <node> inside a <cachetune> to go along with a <cache> element is a bit
> confusing. But I do now see where the "math" is for the total bandwidth
> that can be used - it's the per node value.  Something that I didn't see
> during my first pass.  Of course I may have realized that had the
> example in the commit had more than 1 vCPU and 1 node.
Great! So we all agree to make <memtune> a separated session.
Beside more than 1 vCPU and 1 node, there is also the other possibility: 
put the host supporting threads (QEMU I/O thread and monitor thread) to 
the other group, So that memory bandwidth can be distributed among vCPUs 
and host supporting threads.
I will put the previous review link in the cover letter for easy 
reference in future.
> 
>> In your example, the current implement will stop creating resctrl group
>> and throw an error. vcpu list cannot overlap between memorytune and
>> cachetune, but they can be same. Since vcpu can be put into one resctrl
>> group only.
>>
>> My understanding is that we can set using <resourcetune> to describe
>> cachetune and memorytune together as a long term goal. Since that needs
>> to deprecate cachetune's vcpu list. I am not sure about procedure to
>> remove existing XML elements. Maybe you can point out. 
>> For the short term goal, we can use this separated <memorytune
>> vcpus='0-1'> section. How do you think? :)
>>
> 
> There is no deprecation possible.  You will always have to support the
> cachetune as a direct child of cputune. If you "fix" things up such that
> you introduce some other means to have a cputune/resourcetune/cachetune
> working before you introduce memtune as a child of resourcetune, then
> that works. But you'll have to know whether you read cputune/cachetune
> and then format out in that manner.  See how the <auth> element has been
> handled as a "child" of the disk element originally, but now possible as
> a child of the source element of the disk.  It's possible, but messy.
hmmm..., just learn the <auth> part. That cross reference between 
disk/auth and source/auth does make code complex and not so 
straightforward.
So better we keep cputune/cachetune and introduce cputune/memtune.

> 
>>
>>> I think what you want to do is create a virDomainMemtuneDef that mimics
>>> virDomainCachetuneDef, but I'm not 100% sure.
>>>
>>> Of course that still leaves me wondering how much of virResctrlAllocPtr
>>> then becomes wasted. You chose an integration point, but essentially
>>> have separate technologies. I'm trying to make sense of it all and am
>>> starting to get lost in the process of doing that.
>>
>>  From resctrl group perspective, I don't think extend memory bandwidth
>> information in virResctrlAllocPtr is a waste. I think this is a feature
>> extending.  :)
>>
> 
> Now that the picture is becoming clearer, I can see your point.
Great~ :)
> 
>>>
>>> If @bandwidth can not be > 100... Can we duplicate the @id somehow? So
>> Something like
>>   <cachetune vcpus='0-3'>
>>          <cache id='0' level='3' type='both' size='3' unit='MiB'/>
>>          <node id='0' bandwidth='20'/>
>>   </cachetune>
>>
>> or
>>
>>   <cachetune vcpus='0-3'>
>>          <cache id='0' level='3' type='both' size='3' unit='MiB'
>> bandwidth='20'/>
>>   </cachetune>
>> ???
>> That also works. This is something I did in RFC v1. But there is a
>> comment said it's better not to mix memory bandwidth with cachetune.
>> That may confuse people. :(
>>
> 
> Right - as noted. I didn't read that originally. I jumped into a review
> for this series mainly because no one else had started reviewing it.
Thanks for jumping in. I wait too long a time. :(
Your comments are absolutely helpful.

> 
>>> how does that free buffer that got 100% and then gets subtracted from
>>> really work in this model? I probably need to study the code some more,
>>> but it's not clear it requires using the "same" sort of logic that
>>> cachetune uses where there could be different types (instructs, data, or
>>> both) that would be drawing from the size, right? I think that's how
>>> that worked.  So what's the corollary for memtunes? You set a range from
>>> 0 to 100, right and that's it.  OK, too many questions and it's late so
>>> I'll stop here...  Hopefully I didn't leave any stray thoughts in the
>>> review of patch 1 or 2.
>>
>> The policy we used for memory bandwidth is consistent with the existing
>> cachetune. Current cache allocation logic model forbids cache allocation
>> overlap. This is to prevent resource contention I think. Same strategy
>> on memory allocation part. Since total bandwidth beyond to 100 will lead
>> to bandwidth competition.
> 
> I think I've explained what I missed above... Essentially your example
> had 1 vCPU and thus 1 node and so it didn't make sense about the 100%.
> Of course if there were 3 vCPU's and 3 nodes each having some
> percentage, yeah then it would have been more obvious.
> 
> My suggestion, let's try to get through reworking how the cachetunes
> data is stored. Some sort of common API to "add" a new vCPU's value that
> can be used by cachetune now and eventually memtune later. The vCPUs
> being the obvious key rather than cachetune.
The current cachetunes data is stored in a pointer based fashion. The 
advantage of that is NULL pointer means no cachetuen information for 
particular level/type cache. And vCPU is stored in virDomainCachetuneDef 
of domain_conf. And overlapping testing happens in domain_conf part.The 
reason vCPU not part of virresctrl is properly because domain_conf need 
this information to formate domain's XML to store.
As your suggested, I changed this virDomainCachetuneDef --> 
virDomainRestuneDef in this patch. Since it will hold information both 
cachetune and memtune. And yes, we can revisit virresctrl to extract 
common part of cachetune and memtune.
> 
> Fair warning - I have some days out of the office coming up over the
> next week or so.
Thanks for let me know this. I think we align together basing on above 
discuss. I will reply your comment in the PATH1 PATH2 and PATH4, and 
revise. Especially docs/schemas/*.rng, you pointed out. At the same 
time, we can see whether Pavel will give comment also.  ;)

Bing
> 
> John
> 
>>
>> Existing cache allocation logic achieves this by first populating
>> virResctrlInfoPtr structure. Libvirt will fill the virResctrlInfoPtr
>> structure during traverse host's cache hierarchy. When allocating cache
>> by virResctrlAllocCreate, libvirt will first create a virResctrlAllocPtr
>> basing on information from virResctrlInfoPtr. This virResctrlAllocPtr
>> represents the maximum resource available(I mean no any allocation
>> previously). Then libvirt traverse all resctrl groups (in
>> /sys/fs/resctrl/)created already, subtract the resource they claimed. So
>> the left resource is the resource free. Libvirt compares this free
>> resource with resource requested in virResctrlAllocCreate. If enough
>> free resource available, then new resctrl group will be created.
>> Memory bandwidth part follows the same idea, only the difference is
>> memory bandwidth use percentage number and cache use bitmap.
>>
>> Bing
>>>
>>> John
>>>
[.....]