[libvirt] [PATCH 02/10] util: add interface retrieving CMT capability

John Ferlan jferlan at redhat.com
Fri Sep 7 16:49:05 UTC 2018



On 09/07/2018 03:56 AM, Wang, Huaqiang wrote:
> 
> 
>> -----Original Message-----
>> From: John Ferlan [mailto:jferlan at redhat.com]
>> Sent: Wednesday, September 5, 2018 7:58 PM
>> To: Wang, Huaqiang <huaqiang.wang at intel.com>; libvir-list at redhat.com
>> Cc: Feng, Shaohe <shaohe.feng at intel.com>; Niu, Bing <bing.niu at intel.com>;
>> Ding, Jian-feng <jian-feng.ding at intel.com>; Zang, Rui <rui.zang at intel.com>
>> Subject: Re: [libvirt] [PATCH 02/10] util: add interface retrieving CMT capability
>>
>>
>>
>> On 08/27/2018 07:23 AM, Wang Huaqiang wrote:
>>> Introduce function for reporting CMT capability through going through
>>> files under /sys/fs/info/L3_MON.
>>> This patch is co-work with later patches and report these information
>>> to domain.
>>
>> Do you mean you're setting the basis for future patches to provide the capability
>> data for monitor info?
> 
> Yes, patches 1 to 4 are adding host capability for cache monitor.
> This patch introduces two structures, adding 'virResctrlInfoMonPtr' to
> @resctrl (type 'virResctrlInfoPtr') to keep the resctrl monitoring group
> capabilities, and another is adding 'virResctrlInfoMonPtr' for each cache
> @bank (type 'virCapsHostCacheBankPtr') to store capabilites of cache monitor.
> A subsequent patch 3 posts cache monitor capability information to host
> capability query request. 
> 

I pushed the src/conf/capabilities.c in patch 1, so let's take it off
the table.

Suggestion... When you're done with patches 2 -> 4 a/k/a the host piece
of the cache monitor, post it. Once we are "good" with that it can be
pushed.

If at the same time you want to introduce the refactorings that don't
include cache monitor, then do so in a separate series.

Having everything in one series makes an impending review of 10-20
patches less desirable to start. If you get lucky, sometimes when
there's a few 1-5 patch series you get multiple reviewers rather than
waiting for 1 reviewer to complete a long series.

>>
>>>
>>> Signed-off-by: Wang Huaqiang <huaqiang.wang at intel.com>
>>> ---
>>>  src/conf/capabilities.c |   6 ++-
>>>  src/conf/capabilities.h |   1 +
>>>  src/util/virresctrl.c   | 120
>> ++++++++++++++++++++++++++++++++++++++++++++++--
>>>  src/util/virresctrl.h   |  17 ++++++-
>>>  4 files changed, 137 insertions(+), 7 deletions(-)
>>>
>>
>> Caveat - I didn't go back and read all the previous history on this.
>> Sorry there's just too much. I hope that mkletzan will also take a look at the
>> series since he was involved previously.
>>
>> There's two things going on in this patch:
>>
>>   1. The actual fetch of the data into resctrl structures
> 
> This is accomplished by filling information to @virResctrlInfo->monitor_info.
> 
>>
>>   2. The movement/copy of some of that data into @bank
>>
> 
> This is accomplished by filling information to
> @virCapsHostCacheBankPtr->monitor   
> 
>> Splitting the patches such that item 1 is separate and then item 2 is combined
>> with patches 3 and 4 along with some doc adjustments to describe the output.
>>
> 
> Make sense. Will split this patch as you suggested, and add doc content.
> But patch 4 is a test patch for the newly involved cache monitor capability, do                                                                                                                                    
> you think it is should be merged with patch 3 and second half of patch 2?    
> 

Why not? You'd be testing what you introduced in patch3.

>> Of course I just peeked looking for "cache" and "bank" in docs/*.in and found
>> nothing /-|... Looks like docs/formatcaps.html.in needs some love to describe
>> <cache> and <memory_bandwidth> (how come this stuff comes to me
>> afterwards...)
> 
> I'll provide the descriptions for cache monitor. 
> 

Catching up with "existing" can be a separate patch/series...

>>
>>> diff --git a/src/conf/capabilities.c b/src/conf/capabilities.c index
>>> 326bd15..5280348 100644
>>> --- a/src/conf/capabilities.c
>>> +++ b/src/conf/capabilities.c
>>> @@ -1626,6 +1626,9 @@
>> virCapsHostCacheBankFree(virCapsHostCacheBankPtr ptr)
>>>      virBitmapFree(ptr->cpus);
>>>      for (i = 0; i < ptr->ncontrols; i++)
>>>          VIR_FREE(ptr->controls[i]);
>>> +    if (ptr->monitor && ptr->monitor->features)
>>> +        virStringListFree(ptr->monitor->features);
>>> +    VIR_FREE(ptr->monitor);
>>>      VIR_FREE(ptr->controls);
>>>      VIR_FREE(ptr);
>>>  }
>>> @@ -1801,7 +1804,8 @@ virCapabilitiesInitCaches(virCapsPtr caps)
>>>                                             bank->level,
>>>                                             bank->size,
>>>                                             &bank->ncontrols,
>>> -                                           &bank->controls) < 0)
>>> +                                           &bank->controls,
>>> +                                           &bank->monitor) < 0)
>>
>> I wonder if perhaps if it'd be better to have virResctrlInfoGetCache just take
>> @bank as a parameter instead of continually adding more... I'm also not
>> convinced @bank is the best place considering it's the same data that repeated
>> gets fetched/stored.
> 
> This should be OK, since 'virCapsHostCacheBankPtr' structure is fully exposed
> in 'capabilities.h' file.
> The suggestion will be followed.
> 

The point wasn't that @bank is exposed, it's the is it the right place.
I became less convinced as I went on.

>>
>>>                      goto cleanup;
>>>
>>>                  if (VIR_APPEND_ELEMENT(caps->host.caches,
>>> diff --git a/src/conf/capabilities.h b/src/conf/capabilities.h index
>>> 046e275..3ed2523 100644
>>> --- a/src/conf/capabilities.h
>>> +++ b/src/conf/capabilities.h
>>> @@ -149,6 +149,7 @@ struct _virCapsHostCacheBank {
>>>      virBitmapPtr cpus;  /* All CPUs that share this bank */
>>>      size_t ncontrols;
>>>      virResctrlInfoPerCachePtr *controls;
>>> +    virResctrlInfoMonPtr monitor;
>>
>> This structure notes usage for specific @level; however, from how I read the
>> path to the data, the data is only provided in L3. Since on output <bank> can
>> specify a specific level, I assume that L1 or L2 would be possible for it; however,
>> given how the code is written wouldn't that mean @monitor data is included as
>> well?
>>
> 
> @level could be 1,2 or 3 for a 3-level cache system. @monitor will be involved in
> structure of each level cache. @monitor will be set to NULL if hardware does not
> support monitoring. For example any L1D cache does not support resource
> monitoring, then the corresponding 'virCapsHostCacheBank' 's @monitor will be
> set to NULL.
> 

OK, but the path used in the subsequent patch is info/L3_MON/num_rmids -
IOW: L3 only - hence the query.  If there can be multiple levels of
monitor cache, then it further makes me believe that it shouldn't be a
child of @bank.

>> Additionally from how I read things it seems the same data is repeated for each
>> bank id='#' found. So is the data unique to a bank by id or is unique to the level
>> regardless of the bank?  If the former, then the data needs to be properly split so
>> it can be shown to be different for each id. If the latter, then <monitor>
>> wouldn't seem to need to be a child of <bank>. It's not clear whether it's then a
>> child or peer of <cache>.
>>
> 
> Data is only kept in cache 'bank' that supports cache monitor feature,
> for cache that does not support monitoring feature, the @monitor will be set                                                                                                                                       
> to 'NULL'.
> And, yes, it might be duplicated for serval times for a multi-node system,
> because multiple cache 'bank', which has feature of cache monitoring, were found.
> 

Maybe I asked the wrong question, but if I look at the data presented in
patch 4 - I didn't see anything that said to me that it should be a
child of each <bank id='#'...>.... If it is a child of @bank, then the
the @monitor would need to reference a bank by id, but instead it
disjoint to that AFAICT.

> Seems you have a big concern for my arrangement of putting @monitor inside
> 'virCapsHostCacheBankPtr' structure.

Yep, but it may just be a topological concern with how things are printed.

> Before introducing my considerations for this arrangement, let me clarify the
> definition of 'cache bank' and the cache topologoy, I may make mistakes here,
> if does, please correct me.

I'm hoping you know it better than I do!

The on disk structure I see starts at:

/sys/fs/resctrl (SYSFS_RESCTRL_PATH)

and then for each "subsystem" the subdirectory structure is:

GetCacheInfo
 => info/%s/{num_closids|cbm_mask|min_cbm_bits}

where "%s" is "L1", "L2", "L3", etc.

GetMemoryBandwidthInfo
 => info/MB/{bandwidth_gran|min_bandwidth|num_closids}

GetMonitorInfo
 => info/L3_MON/{num_rmids|mon_features}

This maps to _virCapsHost:

...
    size_t ncaches;
    virCapsHostCacheBankPtr *caches;

    size_t nnodes;
    virCapsHostMemBWNodePtr *nodes;
...

Where _virCapsHostCacheBank has:
...
    unsigned int id;
    unsigned int level; /* 1=L1, 2=L2, 3=L3, etc. */
...


The code places virResctrlInfoMonPtr as a child of each caches entry.
However, when loading the data for each cache level the same "L3_MON"
entry is read regardless of level and regardless of id. Thus, in my mind
there's *a lot* of seemingly needless duplication. That is "bank=1" gets
the same entry as "bank=2" and so on.

> 'cache bank' has a group of attributes, including 'id', 'level', 'type',
> 'size' and 'cpus', there attributes are defined by Linux kernel. You can find
> the values of some specific CPU cache block from this directory:
> /sys/devices/system/cpu/cpu*/cache/*
> 
> My understanding to libvirt cache 'bank' (or 'cache bank'):
> 'virCapsHostCacheBankPtr' represents the 'cache bank'.
> The 'virCapsHostCacheBankEquals', listed as below, tell us:
> for two 'cache bank', if they have exactly the same attributes values, means,
> they are the same 'cache bank', otherwise, they are different 'cache bank's.
> 
>      bool
>      virCapsHostCacheBankEquals(virCapsHostCacheBankPtr a,
>                                 virCapsHostCacheBankPtr b)
>      {
>          return (a->id == b->id &&
>                  a->level == b->level &&
>                  a->type == b->type &&
>                  a->size == b->size &&
>                  virBitmapEqual(a->cpus, b->cpus));
>      }
> 
> How many 'cache bank's in system? and what is the relationship for 'cache
> bank' and hardware cache block?
> For convenience, take 2-socket E5-2699v4 system for example, there are two
> CPU nodes in system, each CPU(or node) has 22 hardware CPU cores, each core is
> equipped with two private L1 caches and one private L2 data cache, each node
> has a L3(or LLC) cache shared among cores.

So now you're mixing in the /sys/devices/system/cpu/cpu*/cache/* with
the /sys/fs/resctrl/info/*. This is going to get confusing real fast. So
far "banks" have been associated with some cpumask map as has the memory
bandwidth via node id.

Still nothing in the monitor code leads me to "see" how things would be
different for each range of cpus by bank id.

> Let's use the concept of 'cache bank' to describe the cache of this 2-socket
> system, there are two L3 'cache bank's (assuming CDP is disabled), one for
> each socket and shared by associated CPU cores; There are totally 44 CPU
> cores, and each CPU core has three private 'cache bank's, the private L1D
> 'cache bank', L1I 'cache bank' and L2 'cache bank'. In total, 44x3+2=134
> 'cache bank's exist in the system.

OK, but the TLA's and "knowledge" of levels and private or other
variously named 'cache bank' just isn't knowledge I keep in my STM
(short term memory) let alone my LTM (long).

> 
> Here are my considerations for add @monitor for each 'cache bank':
> 1. This leverages the design of @controls/@ncontrols.
>      a.) @controls/@ncontrols is designed for introducing cache allocation(CAT)
>      feature, only last level cache of particular CPU supports cache
>      allocation, private 'cache bank' does not support cache allocation.
>      b.) @monitor follows the same policy of @controls/@ncontrols does. For
>      'cache bank' supports cache monitor, populates appropriate cache information
>      through 'virCapsHostCacheBankPtr'->monitor; for 'cache bank' does not
>      support cache monitor feature, just leave @monitor to empty.

I think you've lost me, but I do see the <controls> to some degree
equates to some level of calculated values based on the GetCacheInfo
data. Again, not code I keep fresh in mind.

I don't see the same for monitor - it's just raw data not related to
anything bank related.

I see "maxAlloc" a/k/a _virResctrlInfoMon->max_allocation a/k/a the read
"/info/L3_MON/num_rmids" value.

I see "threshold" a/k/a _virResctrlInfoMon->cache_threshold a/k/a the
read "/info/L3_MON/max_threshold_occupancy" value.

Nothing to do performing calculations like the control data has.
Although not shown in the sample output - it's not clear whether the
values could be different between banks. If they cannot, then it's too
bad we have so much duplication.


> 2. Cache ‘monitor’ is designed as an accompany concept to cache ‘control’:
>    cache ‘control’ mostly covers CAT technology, and cache ‘monitor’ now
>    refers to CMT technology.
> 
> You also mentioned "it seems the same data is repeated".
> Yes, it does for some system, for example, a 2-socket E5-2600v4 system.

It's the same because the same file is read. There's nothing that I see
that shows any differently.

The *only* files read are in the path:

    /sys/fs/resctrl/info/L3_MON/

Not if they were in something like:

    /sys/fs/resctrl/info/%s/L3_MON/

where %s related to some bank id number and further if the L3 was L%d
where %d = 1, 2, 3, etc. - then I can see the topology you've created.

But the fact remains, it's one file, it's not different, so there's no
need for the same data to be replicated.

> 
> This is also confusing me a lot when I began to write the POC code.
> As I said, the 'cache monitor' leverages the 'cache allocation' design,
> the capability of cache allocation is stored in each 'cache bank' through
> @ncontrols and @controls also, @controls is empty and @ncontrols is 0, mean
> that there is no capability of cache allocation for this 'cache bank',
> otherwise, the @controls/@ncontrols points to an array of
> 'virResctrlInfoPerCachePtr' pointers, where it indicates the capabilities of
> cache allocation (CAT) feature. The content of @controls/@ncontrols may also
> duplicated.
> So libvirt has the capability (or designed) to keep 'cache bank' unique
> information, but resctrl could not provide such kind of 'cache bank' unique
> information, even you have a multiple socket system each socket populates a
> different CPU.
> The reality is resctrl only reports system wide CAT, CMT, MBA, MBM capabilities,
> it does not report CAT,CMT,MBA,MBM capabilities based on 'cache bank'.
> 
> I think there should have an explanation for the necessity of keeping the
> cache allocation capability based on 'cache bank', despite the fact that
> resctrl only provide a copy of cache allocation capability data.
> 
> Maybe, it is not wise to leverage the design of @bank/@controls here. 
> or maybe I totally misunderstood the design of @bank/@controls.
> 

Sorry - I'm no help here. Martin did the original review and perhaps
understands the model best.

>>>  };
>>>
>>>  typedef struct _virCapsHostMemBWNode virCapsHostMemBWNode; diff --git
>>> a/src/util/virresctrl.c b/src/util/virresctrl.c index 4b5442f..2f6923a
>>> 100644
>>> --- a/src/util/virresctrl.c
>>> +++ b/src/util/virresctrl.c
>>> @@ -146,6 +146,8 @@ struct _virResctrlInfo {
>>>      size_t nlevels;
>>>
>>>      virResctrlInfoMemBWPtr membw_info;
>>> +
>>> +    virResctrlInfoMonPtr monitor_info;
>>>  };
>>>
>>>
>>> @@ -171,6 +173,9 @@ virResctrlInfoDispose(void *obj)
>>>          VIR_FREE(level);
>>>      }
>>>
>>> +    if (resctrl->monitor_info)
>>> +        virStringListFree(resctrl->monitor_info->features);
>>> +    VIR_FREE(resctrl->monitor_info);
>>>      VIR_FREE(resctrl->membw_info);
>>>      VIR_FREE(resctrl->levels);
>>>  }
>>> @@ -556,6 +561,81 @@
>>> virResctrlGetMemoryBandwidthInfo(virResctrlInfoPtr resctrl)
>>>
>>>
>>
>> Could add a few comments here to describe what's being provided here and
>> how it fits (regardless of where in the schema of things it ends up).
>>
> 
> how about adding these as function comments:                                                                                                                                                                                                  
> /*
>  * Retrieve the capability of resource monitoring group by checking the kernel
>  * resource control interface.
>  * the capability information includes:

s/the/The/  (and it can be on the preceding line)

be sure to leave a blank link before the next though - makes it easier
to read (at least for me)

>  * max_allocation: the maximum number of monitoring groups could be created.

Has monitoring groups been introduced?  "could be created"?

blank line for readability

>  * mon_features: the monitoring features supported, which could be
>  *   'llc_occupancy' for the feature of reporting how much last level cache using. 
>  *   'mbm_total_bytes' for the feature of reporting total memory bandwidth using.
>  *   'mbm_local_bytes' for the feature of reporting local memory bandwidth using.

These are essentially text strings for monitoring capabilities - how
they're used is something for later on. IOW: This is what's allowed

blank line for readability

>  * cache_threshold: this affects the actual destroy of resource monitoring
>  * group, mainly affects the hardware resource reclaim, a greater value of this, in
>  * bytes, will make the request of creating new resource monitoring group  more
>  * likely to fail if the existing number of monitoring groups reaches up to
>  * 'max_allocation'.

ah, what, OK - Doesn't help me. It's a *capability* - changing it would
seem to be the chore of someone sufficiently blessed in understanding in
how all of this works.

>  */
> 
>>>  static int
>>> +virResctrlGetMonitorInfo(virResctrlInfoPtr resctrl) {
>>> +    int rv = -1;
>>> +    char *featurestr = NULL;
>>> +    char **lines = NULL> +    size_t nlines = 0;
>>> +    size_t i = 0;
>>> +    int ret = -1;
>>> +    virResctrlInfoMonPtr info = NULL;
>>> +
>>> +    if (VIR_ALLOC(info) < 0)
>>> +        return -1;
>>> +
>>> +    rv = virFileReadValueUint(&info->max_allocation,
>>> +                              SYSFS_RESCTRL_PATH "/info/L3_MON/num_rmids");
>>> +    if (rv == -2) {
>>> +        /* The file doesn't exist, so it's unusable for us,
>>> +         * probably resource monitoring feature unsupported */
>>> +        VIR_WARN("The path '" SYSFS_RESCTRL_PATH
>> "/info/L3_MON/num_rmids' "
>>> +                 "does not exist");
>>> +
>>> +        ret = 0;
>>> +        goto cleanup;
>>> +    } else if (rv < 0) {
>>> +        /* Other failures are fatal, so just quit */
>>> +        goto cleanup;
>>> +    }
>>> +
>>> +    rv = virFileReadValueUint(&info->cache_threshold,
>>> +                              SYSFS_RESCTRL_PATH
>>> +
>>> + "/info/L3_MON/max_threshold_occupancy");
>>> +
>>> +    if (rv == -2) {
>>> +        virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>>> +                       _("Cannot get max_threshold_occupancy from resctrl"
>>> +                         " info"));
>>> +    }
>>> +    if (rv < 0)
>>> +        goto cleanup;
>>> +
>>> +    rv = virFileReadValueString(&featurestr,
>>> +                                SYSFS_RESCTRL_PATH
>>> +                                "/info/L3_MON/mon_features");
>>> +    if (rv == -2)
>>> +        virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>>> +                       _("Cannot get mon_features from resctrl info"));
>>> +    if (rv < 0)
>>> +        goto cleanup;
>>> +
>>> +    lines = virStringSplitCount(featurestr, "\n", 0, &nlines);
>>> +
>>> +    for (i = 0; i < nlines; i++) {
>>> +        if (STREQLEN(lines[i], "llc_", strlen("llc_")) ||
>>> +            STREQLEN(lines[i], "mbm_", strlen("mbm_"))) {
>>
>> Consider using STRPREFIX.
> 
> OK. To be fixed.
> 
>>
>>> +            if (virStringListAdd(&info->features, lines[i]) < 0)
>>> +                 goto cleanup;
>>> +            info->nfeatures++;
>>> +        }
>>
>> So we get a list filtered by prefixes "llc_" and "mbm_"
>>
>> Eventually this list gets pared down again to just "llc_". None of the subsequent
>> patches do anything with "mbm_" other than list it in capabilities XML output.
>>
>> Sure "mbm_" could be used in the future, but the question that comes to mind is
>> why are initially filtering at all? That is, why not just replace lines/nlines with
>> info->features and info->nfeatures? That then provides "everything" supported
>> in the tree, right?
>>
> 
> Agree. Will remove the filter, just report the content from resctrl.
> 
>> I would figure this code would be just mirroring what's available. It's then up to
>> the upper layers or other code to decide what to do with the list.
>>
> 
> Agree. Let upper layer make decision.
> 
>>> +    }
>>> +
>>> +    VIR_FREE(featurestr);
>>> +    virStringListFree(lines);
>>> +    resctrl->monitor_info = info;
>>> +    return 0;
>>
>> consider using cleanup as part of the processing and thus removing the need to
>> have multiple VIR_FREE(featurestr) and virStringListFree(lines).
>>
>> If you add a char **features = NULL; which you use to perform the
>> virStringListAdd calls and then
>>
>>     VIR_STEAL_PTR(info->features, features);
>>     VIR_STEAL_PTR(resctrl->monitor_info, info);
>>     ret = 0;
>>
>> and fall through allowing the featurestr, lines, and features to be used below.
>>
>> Of course that all assumes it's really necessary to do the filtering.
>>
>> There's also the VIR_AUTOPTR stuff added which I'm not exactly sure how to
>> use yet as I wasn't part of that effort.
>>
> 
> Thanks for advice, really helpful. Will pay attention to 'cleanup'/'error'                                                                                                                                                                    
> label and its backend logic.
> 
>>> +
>>> + cleanup:
>>> +    VIR_FREE(featurestr);
>>> +    virStringListFree(lines);
>>> +    virStringListFree(info->features);
>>> +    VIR_FREE(info);
>>> +    return ret;
>>> +}
>>> +
>>> +
>>> +static int
>>>  virResctrlGetInfo(virResctrlInfoPtr resctrl)  {
>>>      DIR *dirp = NULL;
>>> @@ -569,6 +649,10 @@ virResctrlGetInfo(virResctrlInfoPtr resctrl)
>>>      if (ret < 0)
>>>          goto cleanup;
>>>
>>> +    ret = virResctrlGetMonitorInfo(resctrl);
>>> +    if (ret < 0)
>>> +        goto cleanup;
>>> +
>>>      ret = virResctrlGetCacheInfo(resctrl, dirp);
>>>      if (ret < 0)
>>>          goto cleanup;
>>> @@ -654,16 +738,21 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>>                         unsigned int level,
>>>                         unsigned long long size,
>>>                         size_t *ncontrols,
>>> -                       virResctrlInfoPerCachePtr **controls)
>>> +                       virResctrlInfoPerCachePtr **controls,
>>> +                       virResctrlInfoMonPtr *monitor)
>>>  {
>>>      virResctrlInfoPerLevelPtr i_level = NULL;
>>>      virResctrlInfoPerTypePtr i_type = NULL;
>>> +    virResctrlInfoMonPtr cachemon = NULL;
>>>      size_t i = 0;
>>>      int ret = -1;
>>>
>>>      if (virResctrlInfoIsEmpty(resctrl))
>>>          return 0;
>>>
>>> +    if (VIR_ALLOC(cachemon) < 0)
>>> +        return -1;
>>> +
>>>      /* Let's take the opportunity to update the number of last level
>>>       * cache. This number of memory bandwidth controller is same with
>>>       * last level cache */
>>> @@ -716,14 +805,35 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>>          memcpy((*controls)[*ncontrols - 1], &i_type->control, sizeof(i_type-
>>> control));
>>>      }
>>>
>>> -    ret = 0;
>>> - cleanup:
>>> -    return ret;
>>> +    cachemon->max_allocation = 0;
>>> +
>>> +    if (resctrl->monitor_info) {
>>> +        virResctrlInfoMonPtr info = resctrl->monitor_info;
>>> +
>>> +        cachemon->max_allocation = info->max_allocation;
>>> +        cachemon->cache_threshold = info->cache_threshold;
>>> +        for (i = 0; i < info->nfeatures; i++) {
>>> +            /* Only cares about last level cache */
>>> +            if (STREQLEN(info->features[i], "llc_", strlen("llc_")))
>>> + {
>>
>> Again use STRPREFIX instead.
> 
> Will be fixed.
> 
>>
>>> +                if (virStringListAdd(&cachemon->features,
>>> +                                     info->features[i]) < 0)
>>> +                     goto error;
>>> +                cachemon->nfeatures++;
>>> +            }
>>> +        }
>>
>> This code further filters our "llc_" and "mbm_" list into just "llc_".
>> Not sure I understand why we only "care about last level cache" values just yet.
>> Of course I see that is all that gets added the subsequent patch capabilities
>> output, but is that really what's wanted.  Are we going to start seeing patches
>> that start expanding this list? Why limit/filter now?
>>
> 
> 
> Hope I addressed your concerns:

Not sure until I see the next version of course. I think we're perhaps
far enough apart to make me concerned.

> In resctrl, there are three resource monitoring features:
>   'llc_occupancy' for cache monitor.

Hence why you placed it where you did...

>   'mbm_total_bytes' and 'mbm_local_bytes' for memory bandwidth monitor.

does this mean these would be placed somehow under <memory_bandwidth> in
a similar manner?

> In this series patches I only introduced the cache monitor, and memory
> bandwidth monitor is planned to be introduced in a series of patches
> afterwards.

I think if we can come to an agreement on how the capability format
should look it will be better. You have my feedback, now it's up to you
to take the next step. I'm not sure I have the cycles to engineer this.

> All resctrl features list above ('llc_'s and 'mbm_'s) are kept in
> @resctrl->monitor_info, then, when libvirt wants to get the feature list for,
> just for, cache monitor, function virResctrlGetCacheInfo will be invoked,
> only 'llc_' are the interested feature name. this is the reason why
> limit/filter applied here.
> (will using full name filter instead of the form such as 'llc_*')
> 
> The virResctrlInfoGetCache will be called for any 'cache bank' that supported
> 'llc_occupancy' feature, the result will be stored in 'cache bank' private
> data area (with data type virCapsHostCacheBankPtr).
> Of course, as you mentioned, the data may be duplicated.
> 
> The monitor feature list will be expanded when formatting cache monitor
> capabilities string in task of reporting host capabilities, as illustrated in                                                                                                                                                                 
> following code:
> 
>     function virCapabilitiesFormatCaches at capabilities.c
>     ...
>         if (bank->monitor &&
>             bank->monitor->nfeatures) {
>             virBufferAsprintf(&childrenBuf,
>                               "<monitor threshold='%u' unit='B' "
>                               "maxAllocs='%u'>\n",
>                               bank->monitor->cache_threshold,
>                               bank->monitor->max_allocation);
>             /* expanding cache monitor feature list */
>             for (j = 0; j < bank->monitor->nfeatures; j++) {
>                 virBufferAdjustIndent(&childrenBuf, 2);
>                 virBufferAsprintf(&childrenBuf,
>                                   "<feature name='%s'/>\n",
>                                   bank->monitor->features[j]);
>                 virBufferAdjustIndent(&childrenBuf, -2);
>             }
>             virBufferAddLit(&childrenBuf, "</monitor>\n");
>         }
>     ...
> 
> 
>>> +    }
>>> +
>>> +    if (cachemon->features)
>>> +        *monitor = cachemon;
>>
>> if (!cachemon->features), then @cachemon is leaked, consider using:
>>
>>         VIR_STEAL_PTR(*monitor, cachemon);
>>
> 
> You catched my bug. Thanks.
> 
>> in the if condition, then
>>
>>     VIR_FREE(cachemon);
>>
>> or just the VIR_FREE(cachemon); as an else. IDC either way. Of course, it's still
>> not quite clear what's going on.
>>
>> Perhaps, you should have an API that gets all the names of the values prefixed by
>> some string, IOW:
>>
>>     virResctrlInfoGetMonitorPrefix(resctrl, filter)
>>
>> where filter is a "const char *filter"
>>
>> it would return that cachemon list whether it's NULL, empty, or full anything. Let
>> the caller decide what to do with it.
>>
> 
> Looks reasonable and make code more concise, especially when adding memory
> bandwidth monitor. Will be implemented in next version patch.                                                                                                                                                                                 
> 
>> I haven't looked beyond the first 4 patches, so how things are used later on may
>> determine what API's you could need. The relationship to <cache> and <bank>
>> isn't clear.
>>
> 
> "The relationship to <cache> and <bank> isn't clear." -- Not catching your idea
> too much, do you mean the relationship to 'physical CPU cache' and the software
> scope 'cache bank' is not clear? If yes, please refer to my upper replies, the
> clarification paragraph of 'cache bank' with an example of 2S E5-2699v4
> system.
> 

If it's not clear by this response, I'm afraid we'll just be too far
apart going forward.

Maybe the capability output should:

<monitor level='3' threshold='%u' maxAlloc='%u' >
  <feature name='%s'/>
  <feature name='%s'/>
...
</monitor>

Where the '3' is because you read from "L3_MON" and only important if
you feel 1 or 2 or something else would be generated eventually.

And then you correlate however you have to "later".  You "know" that
<cache> would be related to "llc_occupancy" and take it from there.

I see no way for each feature to have a different num_rmids or
max__threshold_occupancy value, so that's why I'm putting them as
attributes of <monitor>. What is of concern is how someone knows
<monitor> relates to both <cache> and <memory_bandwidth> - I guess that
has to be left for the documentation portion.  If you wanted to name it
something different than <monitor> that's fine - naming is hard (TM).


>>> +
>>> +    return 0;
>>>   error:
>>>      while (*ncontrols)
>>>          VIR_FREE((*controls)[--*ncontrols]);
>>>      VIR_FREE(*controls);
>>> -    goto cleanup;
>>> +    virStringListFree(cachemon->features);
>>> +    VIR_FREE(cachemon);
>>> +    return ret;
>>>  }
>>>
>>>
>>> diff --git a/src/util/virresctrl.h b/src/util/virresctrl.h index
>>> cfd56dd..51bb68b 100644
>>> --- a/src/util/virresctrl.h
>>> +++ b/src/util/virresctrl.h
>>> @@ -61,6 +61,19 @@ struct _virResctrlInfoMemBWPerNode {
>>>      unsigned int max_allocation;
>>>  };
>>>
>>> +typedef struct _virResctrlInfoMon virResctrlInfoMon; typedef
>>> +virResctrlInfoMon *virResctrlInfoMonPtr;
>>> +/* Information about resource monitoring group */ struct
>>> +_virResctrlInfoMon {
>>> +    /* null-terminal string list for hw supported monitor feature */
>>> +    char **features;
>>> +    size_t nfeatures;
>>> +    /* Maximum number of simultaneous allocations */
>>> +    unsigned int max_allocation;
>>
>> What kind of allocations? From your cover you state maximum number of
>> monitoring groups that could be created, but it's impossible to know how this
>> value is expected to be used by what's provided as a comment here.
>>
> 
> Changing the comment
> from
>   /* Maximum number of simultaneous allocations */
> to
>   /* Maximum number of monitoring groups that could be created */
> 
>> The code shows this is the value from /info/L3_MON/num_rmids - I don't see
>> the correlation in FS name to structure name. Since you later print a unit of 'B' I
>> assume it's bytes of something, but the comment seems to imply a maximum
>> simultaneous number of something.
>>
> 
> Looks the capability string make you confused.
> The XML output of cache monitor capability looks like: (leveraging the
> format of 'cache control' capability)
> 
>         <monitor threshold='270336' unit='B' maxAllocs='176'>
>           <feature name='llc_occupancy'/>
>         </monitor>
> 
> 'B' is the unit for 'threshold', 'maxAllocs' is for max_allocation (parsed
> through /info/L3_MON/num_rmids).
> The 'unit' are not limited to 'B', could be 'KiB' ...        

True, but you read a 'B' and never change that anywhere - it's also not
clear threshold is a 'B' value. At least when I see 'size' followed by
'unit' - I'm certain the size is a related to unit. To me "threshold" is
just a "value" not a necessarily byte related. Could be a count of
something. Of course the longer wording "max_threshold_occupancy"
doesn't help me much either.  A maximum threshold occupancy of what?
It's not unique to each feature name, it's unique to the L3_MON.



> This tells docs are necessary for interpreting these settings.

clearly!

> 
>     "monitor": describes cache monitor capability.
>     "threshold": This is cache occupancy threshold value used in kernel
>          resource control system, and affects the actual release of  hardware
>          resource, the RMID (resource monitoring ID). A greater value of this
>          will make the request of creating a new resource monitoring group more
>          likely to fail if the existing number of monitoring groups reaches up
>          to 'maxAlloc'.

Again it's not something a "normal consumer" would probably change...

>     "unit": This is the unit of "threashold", could be 'B', 'KiB', 'MiB' or 'GiB'.

Unless you do the logic to present it as calculated value, then what's
the purpose. I know there's code out there that will "prettify" output
such that for the example from patch 4 a value of 270336 bytes is
printed as '264 MiB'.  If you're not going to do that, then just present
as a value and note that it's a byte value.  /me no wonders if you
should be sure to store this in "unsigned long long" field since you
mention 'GiB' an 'unsigned int' only gets you so far.

>     "maxAllocs": This is a number that maximum number of  monitoring groups could
>          be created.
>     "feature": describes the feature name supported by 'monitor'.
> 
> Hope this documentation clears your confusion.
> 
>> Perhaps if documentation was added I would have had my answer without
>> needing to go research that is.
>>
> 
> See docs above.
> 
>>> +    /* determines the occupancy at which an RMID can be freed */
>>
>> Again, alone this comment is difficult to decipher as it relates to the structure
>> field name. The code shows the value read is
>> "/info/L3_MON/max_threshold_occupancy".  It's not clear what "occupancy"
>> means. Is there something related to this number that some consumer could
>> change that would improve some performance?
>>
> 
> "max_threshold_occupancy" is a concept involved by kernel resctrl. It is a
> cache value, in bytes, affects the release of hardware 'RMID', thus, affects
> the maximum number of monitor group could be created. Get more information from                                                                                                                                                               
> the cache monitor attribute 'threshold''s description.
> 
> Thanks for your efforts of the review.

Thanks for your return of details - I'm still not sure I really
understand the maxAllocs and threshold values.  I see them purely as
display values at this point. I cannot imagine providing an interface or
description that would help some consumer adjust the value to fix some
problem on their host. There are patch series in the virtual bit bucket
that have tried to do that in other areas.

John

>> John
>>
>>> +    unsigned int cache_threshold;
>>> +};
>>> +
>>>  typedef struct _virResctrlInfo virResctrlInfo;  typedef
>>> virResctrlInfo *virResctrlInfoPtr;
>>>
>>> @@ -72,7 +85,9 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>>                         unsigned int level,
>>>                         unsigned long long size,
>>>                         size_t *ncontrols,
>>> -                       virResctrlInfoPerCachePtr **controls);
>>> +                       virResctrlInfoPerCachePtr **controls,
>>> +                       virResctrlInfoMonPtr *monitor);
>>> +
>>>
>>>  int
>>>  virResctrlInfoGetMemoryBandwidth(virResctrlInfoPtr resctrl,
>>>




More information about the libvir-list mailing list