[libvirt] [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with domstats

Wang, Huaqiang huaqiang.wang at intel.com
Tue Nov 20 13:25:51 UTC 2018



> -----Original Message-----
> From: Wang, Huaqiang
> Sent: Thursday, November 15, 2018 8:41 PM
> To: John Ferlan <jferlan at redhat.com>; libvir-list at redhat.com
> Cc: Feng, Shaohe <shaohe.feng at intel.com>; Ding, Jian-feng <jian-
> feng.ding at intel.com>; Zang, Rui <rui.zang at intel.com>
> Subject: RE: [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with
> domstats
> 
> 
> 
> > -----Original Message-----
> > From: John Ferlan [mailto:jferlan at redhat.com]
> > Sent: Thursday, November 15, 2018 12:18 AM
> > To: Wang, Huaqiang <huaqiang.wang at intel.com>; libvir-list at redhat.com
> > Cc: Feng, Shaohe <shaohe.feng at intel.com>; Ding, Jian-feng <jian-
> > feng.ding at intel.com>; Zang, Rui <rui.zang at intel.com>
> > Subject: Re: [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with
> > domstats
> >
> >
> >
> > On 11/12/18 8:31 AM, Wang Huaqiang wrote:
> > > Adding the interface in qemu to report CMT statistic information
> > > through command 'virsh domstats --cpu-total'.
> > >
> > > Below is a typical output:
> > >
> > >          # virsh domstats 1 --cpu-total
> > >          Domain: 'ubuntu16.04-base'
> > >            ...
> > >            cpu.cache.monitor.count=2
> > >            cpu.cache.monitor.0.name=vcpus_1
> > >            cpu.cache.monitor.0.vcpus=1
> > >            cpu.cache.monitor.0.bank.count=2
> > >            cpu.cache.monitor.0.bank.0.id=0
> > >            cpu.cache.monitor.0.bank.0.bytes=4505600
> > >            cpu.cache.monitor.0.bank.1.id=1
> > >            cpu.cache.monitor.0.bank.1.bytes=5586944
> > >            cpu.cache.monitor.1.name=vcpus_4-6
> > >            cpu.cache.monitor.1.vcpus=4,5,6
> > >            cpu.cache.monitor.1.bank.count=2
> > >            cpu.cache.monitor.1.bank.0.id=0
> > >            cpu.cache.monitor.1.bank.0.bytes=17571840
> > >            cpu.cache.monitor.1.bank.1.id=1
> > >            cpu.cache.monitor.1.bank.1.bytes=29106176
> > >
> > > Signed-off-by: Wang Huaqiang <huaqiang.wang at intel.com>
> > > ---
> > >  src/libvirt-domain.c   |   9 +++
> > >  src/qemu/qemu_driver.c | 198
> > > +++++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 207 insertions(+)
> > >
> > > diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index
> > > 7690339..4895f9f 100644
> > > --- a/src/libvirt-domain.c
> > > +++ b/src/libvirt-domain.c
> > > @@ -11345,6 +11345,15 @@
> > > virConnectGetDomainCapabilities(virConnectPtr
> > conn,
> > >   *     "cpu.user" - user cpu time spent in nanoseconds as unsigned long
> long.
> > >   *     "cpu.system" - system cpu time spent in nanoseconds as unsigned
> long
> > >   *                    long.
> > > + *     "cpu.cache.monitor.count" - tocal cache monitoring groups
> > > + *     "cpu.cache.monitor.M.name" - name of cache monitoring group
> 'M'
> > > + *     "cpu.cache.monitor.M.vcpus" - vcpus for cache monitoring group
> 'M'
> > > + *     "cpu.cache.monitor.M.bank.count" - total bank number of cache
> > monitoring
> > > + *                    group 'M'
> > > + *     "cpu.cache.monitor.M.bank.N.id" - OS assigned cache bank id for
> cache
> > > + *                    'N' in cache monitoring group 'M'
> > > + *     "cpu.cache.monitor.M.bank.N.bytes" - monitor's cache occupancy
> of
> > cache
> > > + *                    bank 'N' in cache monitoring group 'M'
> >
> > I'll comment on these in your update...
> >
> > >   *
> > >   * VIR_DOMAIN_STATS_BALLOON:
> > >   *     Return memory balloon device information.
> > > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index
> > > 89d46ee..d41ae66 100644
> > > --- a/src/qemu/qemu_driver.c
> > > +++ b/src/qemu/qemu_driver.c
> > > @@ -19698,6 +19698,199 @@ typedef enum {  #define
> HAVE_JOB(flags)
> > > ((flags) & QEMU_DOMAIN_STATS_HAVE_JOB)
> > >
> > >
> > > +typedef struct _virQEMUCpuResMonitorData
> virQEMUCpuResMonitorData;
> > > +typedef virQEMUCpuResMonitorData
> *virQEMUCpuResMonitorDataPtr;
> > struct
> > > +_virQEMUCpuResMonitorData{
> >
> > Data {
> >
> 
> Got. One space before '{'
> 
> > > +    const char *name;
> > > +    char *vcpus;
> > > +    virResctrlMonitorType tag;
> > > +    virResctrlMonitorStatsPtr stats;
> > > +    size_t nstats;
> > > +};
> > > +
> > > +
> > > +static int
> > > +qemuDomainGetCpuResMonitorData(virDomainObjPtr dom,
> > > +                               virQEMUCpuResMonitorDataPtr mondata) {
> > > +    virDomainResctrlDefPtr resctrl = NULL;
> > > +    size_t i = 0;
> > > +    size_t j = 0;
> > > +    size_t l = 0;
> > > +
> > > +    for (i = 0; i < dom->def->nresctrls; i++) {
> > > +        resctrl = dom->def->resctrls[i];
> > > +
> > > +        for (j = 0; j < resctrl->nmonitors; j++) {
> > > +            virDomainResctrlMonDefPtr domresmon = NULL;
> > > +            virResctrlMonitorPtr monitor =
> > > + resctrl->monitors[j]->instance;
> > > +
> > > +            domresmon = resctrl->monitors[j];
> > > +            mondata[l].tag = domresmon->tag;
> >
> > Why "l" (BTW: 'k' would be preferred because '1' and 'l' are very
> > difficult to delineate).
> 
> This 'l' and the occurrences in below will be substituted with 'k'.
> 
> >
> > > +
> > > +            /* If virBitmapFormat successfully returns an vcpu string, then
> > > +             * mondata[l].vcpus is assigned with an memory space holding
> it,
> > > +             * let this newly allocated memory buffer to be freed along with
> > > +             * the free of 'mondata' */
> > > +            if (!(mondata[l].vcpus = virBitmapFormat(domresmon->vcpus)))
> > > +                return -1;
> > > +
> > > +            if (!(mondata[l].name = virResctrlMonitorGetID(monitor))) {
> > > +                virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
> > > +                               _("Could not get monitor ID"));
> > > +                return -1;
> > > +            }
> > > +
> > > +            if (domresmon->tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) {
> >
> > Something doesn't quite add up with this... Since we're only filling
> > in types with 'cache' types and erroring out otherwise ... see [1] data
> points below...
> >
> > > +                if (virResctrlMonitorGetCacheOccupancy(monitor,
> > > +                                                       &mondata[l].stats,
> > > +                                                       &mondata[l].nstats) < 0)
> > > +                    return -1;
> > > +            } else {
> > > +                virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
> > > +                               _("Invalid CPU resource type"));
> > > +                return -1;
> >
> > [1] Perhaps this should be done up front and "avoided" since this
> > helper should only care above getting cache stats...
> 
> Regarding this error message, it might over-checked since we have made
> safety check In building *domresmon->tag.
> Will remove the error report lines, since *domresmon->tag could only be
> VIR_RESCTRL_MONITOR_TYPE_CACHE currently.
> 
> [R1]:
> In my plan this function is to be used for
> VIR_RESCTRL_MONITOR_TYPE_CACHE and
> VIR_RESCTRL_MONITOR_TYPE_MEMBW.
> 
> Beside this function, qemuDomainGetStatsCpuResMonitorPerTag (name
> will be refined) and qemuDomainGetStatsCpuResMonitor (name will be
> refined)  are also planned to support both
> VIR_RESCTRL_MONITOR_TYPE_CACHE type monitor and both
> VIR_RESCTRL_MONITOR_TYPE_MEMBW monitor even the names are
> specified with word 'CpuRes'.
> 
> For me memBW monitor is also in scope of 'CPU'. Here is my understanding:
> 
> 1. CAT/CMT is technology for cache, obviously it is belong to scope of 'CPU'.
> 
> 2. It may  make you confused from the name of MBM, memory bandwidth
> monitoring, but it get the memory bandwidth utilization information by
> tracking L3 cache usage perf CPU thread not by reading counters of DRAM,
> so MBM technology is more close to cache than DRAM controller.
> This is the reason I think MBM is also a technology in scope of CPU.
> With some links for MBM and kernel resctrl for your reference:
> https://software.intel.com/en-us/articles/introduction-to-memory-
> bandwidth-monitoring
> https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
> 
> Based on above understandings, in naming the three functions that newly
> introduced I chose name with word 'cpures' (cpu resource). That I think
> cache is cpu resource and memBW is cpu resource, the new functions will
> handle both resource types. So in this patch my plan is getting cache and
> memory bandwidth  statistics in one function
> qemuDomainGetStatsCpuResMonitor, the interface connected to
> 'domstats' function is written as:
> 
> *** One function solution ***
> static int
> qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
>                       virDomainObjPtr dom,
>                       virDomainStatsRecordPtr record,
>                       int *maxparams,
>                       unsigned int privflags ATTRIBUTE_UNUSED) {
>      if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
>          return -1;
> 
>      /* Get cache and memory bandwidth statistics */                                  <--
> One function for both cache and memBW
>      if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) < 0)
>          return -1;
> 
>      return 0;
> }
> 
> Otherwise, if you still think it is not wise to process memBW information
> and cache in one function, they have very obvious boundary, then we'd
> better add two functions and not using word 'cpu resource'/cpures.
> Like these:
> 
> *** Two functions solution ***
> static int
> qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
>                       virDomainObjPtr dom,
>                       virDomainStatsRecordPtr record,
>                       int *maxparams,
>                       unsigned int privflags ATTRIBUTE_UNUSED) {
>      if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
>          return -1;
> 
>      /* Get cache statistics */
> <-- function only for cache.
>      if (qemuDomainGetStatsCacheMonitor(dom, record, maxparams) < 0)
>          return -1;
> }
> 
>      /*A new function for getting memory bandwidth statistics */ static int
> qemuDomainGetStatsMemStat(virQEMUDriverPtr driver
> ATTRIBUTE_UNUSED,
>                       virDomainObjPtr dom,
>                       virDomainStatsRecordPtr record,
>                       int *maxparams,
>                       unsigned int privflags ATTRIBUTE_UNUSED) {
> 
>  /* Read memBW monitors and output ...
> 
> memorybandwidth.monitor.count=...
> memorybandwidth.monitor.0.name=...
> memorybandwidth.monitor.0.vcpus=...
> ...
> */
>      return 0;
> }
> 
> How do you think? Which one do you prefer, one function interface or two
> functions interface?
> (function names might be refined.)
> 
> >
> > IOW:
> >
> >        for (j = 0; j < resctrl->nmonitors; j++) {
> >            virDomainResctrlMonDefPtr domresmon = NULL;
> >            virResctrlMonitorPtr monitor =
> > resctrl->monitors[j]->instance;
> >
> >            domresmon = resctrl->monitors[j];
> >
> >            if (domresmon->tag != VIR_RESCTRL_MONITOR_TYPE_CACHE)
> >                continue;
> >
> > This this loop only fetches cache information and then the 'l' (or
> > preferably 'k') makes more sense
> 
> Maybe before memBW is introduced, it might be better to make code more
> clear just as you wrote, for memBW then we make changes.
> 
> But I still need your opinion on using one function interface or interface of
> two separate functions for cache and memory statistics.
> 
> Now we will add output of 'domstats' something like:
> cpu.cache.monitor.count = ...
> cpu.cache.monitor.0.name=...
> cpu.cache.monitor.0.vcpus=...
> ...
> 
> Is following arrangement for memBW monitor is not acceptable?
> 
> cpu.memorybandwidth.monitor.count=...
> cpu.memorybandwidth.monitor.0.name=...
> cpu.memorybandwidth.monitor.0.vcpus=...
> ...
> 
> 

I have had some discussion with intel engineers about the MBM
and CMT, it is known that these are very similar technologies in underlying
hardware but tracking difference set of CPU internal counters.
For CMT, the concept is very clear, it is monitoring last level
cache and report the cache usage. For MBM it is reports memory
bandwidth by tracking counters from cache size, it reflects the
DRAM bandwidth, so it will be no problem to call it as memory
bandwidth.

And I'll submit the updated code for patch16 and 17 and addressing
your review comments and suggestions. 

Since cache monitor is the only monitor we support now, just
as you pointed out, I'll not involve memBW monitor tag to avoid
confusion. 

For MBM patches, I'll submit RFC patches to raise discussion about its
interface. 

I'll also rename the new added data structure name and function
names, please have a review. 

BR
Huaqiang

> >
> > > +            }
> > > +
> > > +            l++;
> > > +        }
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +
> >
> > Might be nice to add comments...
> >
> > > +static int
> > >
> +qemuDomainGetStatsCpuResMonitorPerTag(virQEMUCpuResMonitorData
> Ptr
> > mondata,
> > > +                                      size_t nmondata,
> > > +                                      virResctrlMonitorType tag,
> > > +                                      virDomainStatsRecordPtr record,
> > > +                                      int *maxparams) {
> > > +    char param_name[VIR_TYPED_PARAM_FIELD_LENGTH];
> > > +    unsigned int nmonitors = 0;
> > > +    const char *resname = NULL;
> > > +    const char *resnodename = NULL;
> > > +    size_t i = 0;
> > > +
> > > +    for (i = 0; i < nmondata; i++) {
> > > +        if (mondata[i].tag == tag)
> > > +            nmonitors++;
> > > +    }
> > > +
> > > +    if (!nmonitors)
> > > +        return 0;
> >
> > I'd place the above two below the following hunk - perf wise and
> > logically since @tag is the important piece here.  However, it may not
> > be important to compare the [i].tag == tag as long as you've done your
> > collection of *only* one type.  Hope this makes sense!
> >
> > Thus your code is just:
> >
> >     if (nmondata == 0)
> >         return 0;
> >
> 
> Yes. At least currently no need to add up *nmoitors again.
> 
> > > +
> > > +    if (tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) {
> > > +        resname = "cache";
> > > +        resnodename = "bank";
> > > +    } else if (tag == VIR_RESCTRL_MONITOR_TYPE_MEMBW) {
> > > +        resname = "memBW";
> > > +        resnodename = "node";
> >
> > [1] In the current scheme of qemuDomainGetStatsCpuResMonitor, how
> > could we get this tag?
> 
> Another tag, the VIR_RESCTRL_MONITOR_TYPE_MEMBW, will be added to
> qemuDomainGetStatsCpuResMonitor if memBW monitor is introduced in
> my plan.
> 
> But I know you are challenging this plan. I'll make change according to your
> suggestion.
> 
> >
> > If your goal was to make a "generic" printing API to be used by both
> > cpustats and memstats (eventually), then perhaps the name of the
> > helper should just be qemuDomainGetStatsResMonitor (or
> > *MonitorParams). Since @tag is a parameter and it's fairly easy to see
> > it's use and the formatting of the params is based purely on the tag in the
> generically output data.
> >
> > The helper should also be appropriately placed in qemu_driver.c such
> > that when memBW stats support is added it can just be used and doesn't
> need to be moved.
> >
> 
> As I stated in [R1], I look memstats as one kind of CPU resources.
> 
> If we chose the two function scheme, things will be considered as you
> stated.
> 
> 
> > > +    } else {
> > > +        return 0;
> > > +    }
> > > +
> > > +    snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> > > +             "cpu.%s.monitor.count", resname);
> > > +    if (virTypedParamsAddUInt(&record->params, &record->nparams,
> > > +                              maxparams, param_name, nmonitors) < 0)
> > > +        return -1;
> > > +
> > > +    for (i = 0; i < nmonitors; i++) {
> > > +        size_t l = 0;
> >
> > Similar 'l' concerns here use 'j' instead
> 
> 'l' will be modified to 'j'.
> 
> >
> > > +
> > > +        snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> > > +                 "cpu.%s.monitor.%zd.name", resname, i);
> > > +        if (virTypedParamsAddString(&record->params,
> > > +                                    &record->nparams,
> > > +                                    maxparams,
> > > +                                    param_name,
> > > +                                    mondata[i].name) < 0)
> > > +            return -1;
> > > +
> > > +        snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> > > +                 "cpu.%s.monitor.%zd.vcpus", resname, i);
> > > +        if (virTypedParamsAddString(&record->params, &record-
> >nparams,
> > > +                                    maxparams, param_name,
> > > +                                    mondata[i].vcpus) < 0)
> > > +            return -1;
> > > +
> > > +        snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> > > +                 "cpu.%s.monitor.%zd.%s.count", resname, i, resnodename);
> > > +        if (virTypedParamsAddUInt(&record->params, &record->nparams,
> > > +                                  maxparams, param_name,
> > > +                                  mondata[i].nstats) < 0)
> > > +            return -1;
> > > +
> > > +        for (l = 0; l < mondata[i].nstats; l++) {
> > > +            snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> > > +                     "cpu.%s.monitor.%zd.%s.%zd.id",
> > > +                     resname, i, resnodename, l);
> > > +            if (virTypedParamsAddUInt(&record->params, &record-
> >nparams,
> > > +                                      maxparams, param_name,
> > > +                                      mondata[i].stats[l].id) < 0)
> > > +                return -1;
> > > +
> > > +            snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> > > +                     "cpu.%s.monitor.%zd.%s.%zd.bytes",
> > > +                     resname, i, resnodename, l);
> > > +            if (virTypedParamsAddUInt(&record->params, &record-
> >nparams,
> > > +                                      maxparams, param_name,
> > > +                                      mondata[i].stats[l].val) < 0)
> > > +                return -1;
> > > +        }
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +
> > > +static int
> > > +qemuDomainGetStatsCpuResMonitor(virDomainObjPtr dom,
> > > +                                virDomainStatsRecordPtr record,
> > > +                                int *maxparams) {
> > > +    virDomainResctrlDefPtr resctrl = NULL;
> > > +    virQEMUCpuResMonitorDataPtr mondata = NULL;
> > > +    unsigned int nmonitors = 0;
> > > +    size_t i = 0;
> > > +    int ret = -1;
> > > +
> > > +    if (!virDomainObjIsActive(dom))
> > > +        return 0;
> > > +
> > > +    for (i = 0; i < dom->def->nresctrls; i++) {
> > > +        resctrl = dom->def->resctrls[i];
> > > +        nmonitors += resctrl->nmonitors;
> >
> > [1] To further the above conversation, @nmonitors won't be limited by
> > cache stats only, but the collection of the @mondata is. So I think
> > here nmonitors should only include tags w/ *TYPE_CACHE.
> >
> 
> > > +    }
> > > +
> > > +    if (!nmonitors)
> > > +        return 0;
> > > +
> > > +    if (VIR_ALLOC_N(mondata, nmonitors) < 0)
> > > +        return -1;
> > > +
> > > +    if (qemuDomainGetCpuResMonitorData(dom, mondata) < 0)
> >
> > You could pass @nmonitors here and use as 'nmondata' in the above
> > function to "ensure" that the nmonitors looping and filling of mondata
> > doesn't go beyond bounds - although that shouldn't happen, so it's not
> > a requirement as long as the algorithm to calculate @nmonitors and
> > allocate @mondata is the same algorithm used to fill in @mondata.
> >
> 
> qemuDomainGetCpuResMonitorData gets all monitor statistics and stored
> in mondata, in my design not only for cache statistics.
> But since we only have cache monitor currently I'll do as you suggested.
> 
> 
> > > +        goto cleanup;
> > > +
> > > +    for (i = VIR_RESCTRL_MONITOR_TYPE_UNSUPPORT + 1;
> > > +         i < VIR_RESCTRL_MONITOR_TYPE_LAST; i++) {
> > > +        if (qemuDomainGetStatsCpuResMonitorPerTag(mondata,
> nmonitors, i,
> > > +                                                  record, maxparams) < 0)
> > > +            goto cleanup;
> > > +    }
> >
> > Similar comment here - this is only being called from
> > qemuDomainGetStatsCpu thus it should only pass
> > VIR_RESCTRL_MONITOR_TYPE_CACHE and not be run in a loop. If
> eventually
> > memBW data was filled in - it would be printed next to cpu- stats data
> and that doesn't necessarily make sense, does it?
> >
> 
> Only for cache, OK.
> 
> >
> > > +
> > > +    ret = 0;
> > > + cleanup:
> > > +    for (i = 0; i < nmonitors; i++)
> > > +        VIR_FREE(mondata[i].vcpus);
> > > +    VIR_FREE(mondata);
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +
> > >  static int
> > >  qemuDomainGetStatsCpuCgroup(virDomainObjPtr dom,
> > >                              virDomainStatsRecordPtr record, @@
> > > -19747,6 +19940,11 @@ qemuDomainGetStatsCpu(virQEMUDriverPtr
> driver
> > > ATTRIBUTE_UNUSED,  {
> > >      if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
> > >          return -1;
> > > +
> > > +    if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) <
> 0)
> > > +        return -1;
> > > +
> > > +    return 0;
> >
> > This obviously would have some differences based on my comments from
> > patch15.
> >
> 
> Got. Still remember what changes you made.
> 
> > Rather than have you post patches 1-15 again just to fix 16, I'll push
> > 1-15 and let you work on 16 and 17.  We still have time before the next
> release.
> >
> 
> I am looking forward for your attitude toward whether I could regard
> 'memBW monitor'(MBM) as a kind of CPU resource in libvirt and report the
> memory bandwidth statistics by command ' virsh domstats --cpu-total'?
> 
> I still think MBM might have big difference in comparing with DRAM
> memory bandwidth, because the cache hierarchy has been significantly
> changed since CPU platform Skylake-SP, in Skyalke-SP not all data to CPU
> from DRAM is through L3 cache. Data might be read to L2 cache directly
> from DRAM.
> I am looking for answers regard the difference of MBM observed bandwidth
> and DRAM bandwidth from internal team. I will make update once get
> answers.
> 
> > Besides once I push, we'll find out fairly quickly whether some other
> > arch has build/compile problems!
> >
> > John
> >
> 
> Thanks for review.
> Huaqiang
> 
> > >  }
> > >
> > >
> > >




More information about the libvir-list mailing list