[libvirt] [PATCHv3 2/8] qemu: bulk stats: implement CPU stats group

Wed Sep 10 13:46:56 UTC 2014

----- Original Message -----
> From: "Wang Rui" <moon.wangrui at huawei.com>
> To: "Francesco Romani" <fromani at redhat.com>
> Cc: libvir-list at redhat.com, pkrempa at redhat.com
> Sent: Wednesday, September 10, 2014 10:56:47 AM
> Subject: Re: [libvirt] [PATCHv3 2/8] qemu: bulk stats: implement CPU stats group
> 
> On 2014/9/8 21:05, Francesco Romani wrote:
> > This patch implements the VIR_DOMAIN_STATS_CPU_TOTAL
> > group of statistics.
> > 
> > Signed-off-by: Francesco Romani <fromani at redhat.com>
> > ---
> >  include/libvirt/libvirt.h.in |  1 +
> >  src/libvirt.c                |  9 ++++++++
> >  src/qemu/qemu_driver.c       | 51
> >  ++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 61 insertions(+)
> [...]
> > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> > index 2950a4b..cfc5941 100644
> > --- a/src/qemu/qemu_driver.c
> > +++ b/src/qemu/qemu_driver.c
> > @@ -96,6 +96,7 @@
> >  #include "storage/storage_driver.h"
> >  #include "virhostdev.h"
> >  #include "domain_capabilities.h"
> > +#include "vircgroup.h"
> 
> Hi, Francesco.
> I see the file including relationship. 'qemu_driver.c' includes
> 'qemu_cgroup.h' which
> includes 'vircgroup.h'. There are other virCgroupGet* functions called in
> qemu_driver.c
> now. So I think here "include vircgroup.h" is not necessary.

Thanks for the research, I'll remove

> >  #define VIR_FROM_THIS VIR_FROM_QEMU
> >  
> > @@ -17338,6 +17339,55 @@ qemuDomainGetStatsState(virConnectPtr conn
> > ATTRIBUTE_UNUSED,
> >  }
> >  
> >  
> > +static int
> > +qemuDomainGetStatsCpu(virConnectPtr conn ATTRIBUTE_UNUSED,
> > +                      virDomainObjPtr dom,
> > +                      virDomainStatsRecordPtr record,
> > +                      int *maxparams,
> > +                      unsigned int privflags ATTRIBUTE_UNUSED)
> > +{
> > +    qemuDomainObjPrivatePtr priv = dom->privateData;
> > +    unsigned long long cpu_time = 0;
> > +    unsigned long long user_time = 0;
> > +    unsigned long long sys_time = 0;
> > +    int ncpus = 0;
> > +    int err;
> > +
> > +    ncpus = nodeGetCPUCount();
> > +    if (ncpus > 0 &&
> > +        virTypedParamsAddUInt(&record->params,
> > +                              &record->nparams,
> > +                              maxparams,
> > +                              "cpu.count",
> > +                              (unsigned int)ncpus) < 0)
> > +        return -1;
> > +
> > +    err = virCgroupGetCpuacctUsage(priv->cgroup, &cpu_time);
> > +    if (!err && virTypedParamsAddULLong(&record->params,
> > +                                        &record->nparams,
> > +                                        maxparams,
> > +                                        "cpu.time",
> > +                                        cpu_time) < 0)
> > +        return -1;
> > +    err = virCgroupGetCpuacctStat(priv->cgroup, &user_time, &sys_time);
> > +    if (!err && virTypedParamsAddULLong(&record->params,
> > +                                        &record->nparams,
> > +                                        maxparams,
> > +                                        "cpu.user",
> > +                                        user_time) < 0)
> > +        return -1;
> > +    if (!err && virTypedParamsAddULLong(&record->params,
> > +                                        &record->nparams,
> > +                                        maxparams,
> > +                                        "cpu.system",
> > +                                        sys_time) < 0)
> > +        return -1;
> 
> 1. If any of the 'err's is not zero, the function may returns 0 as success.
>    Is this the expected return? Or at least we can give a warning that we
>    miss some parameters.
> 2. I think it's better to report an error or warning log before return -1.

The idea here (well, at least my idea :) ) is to gather as much as data as possible,
and to silently skip failures here. The lack of expected output is a good enough
indicator that a domain is unresponsive.
-1 is reported for really unrecoverable error, like memory (re)allocation failure.

Otherwise, how could the client code be meaningfully informed that some group
failed, maybe partially? It is possible that different groups fail for different
domains: how could we convey this information?

I have no problems to go this route if there is consensus this is the preferred
way to go, but then we'll need to discuss how to convey a meaningful error convention.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani