[libvirt] [PATCH RESEND RFC v4 5/6] qemu: Implement cfs_period and cfs_quota's modification

Thu Jul 21 13:20:25 UTC 2011

On 07/21/2011 06:01 AM, Daniel P. Berrange wrote:
> On Thu, Jul 21, 2011 at 10:11:32AM +0800, Wen Congyang wrote:
>> @@ -5851,7 +5953,6 @@ static int qemuSetSchedulerParametersFlags(virDomainPtr dom,
>>          virTypedParameterPtr param = &params[i];
>>  
>>          if (STREQ(param->field, "cpu_shares")) {
>> -            int rc;
>>              if (param->type != VIR_TYPED_PARAM_ULLONG) {
>>                  qemuReportError(VIR_ERR_INVALID_ARG, "%s",
>>                                  _("invalid type for cpu_shares tunable, expected a 'ullong'"));
>> @@ -5870,19 +5971,47 @@ static int qemuSetSchedulerParametersFlags(virDomainPtr dom,
>>              }
>>  
>>              if (flags & VIR_DOMAIN_AFFECT_CONFIG) {
>> -                persistentDef = virDomainObjGetPersistentDef(driver->caps, vm);
>> -                if (!persistentDef) {
>> -                    qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>> -                                    _("can't get persistentDef"));
>> +                vmdef->cputune.shares = params[i].value.ul;
>> +            }
>> +        } else if (STREQ(param->field, "cfs_period")) {
> 
> [snip]
> 
>> +        } else if (STREQ(param->field, "cfs_quota")) {
> 
> 
> In the XML file we now have
> 
>   <cputune>
>     <shares>1024</shares>
>     <period>90000</period>
>     <quota>0</quota>
>   </cputune>
> 
> 
> But the schedinfo parameter are being named
> 
>  cpu_shares: 1024
>  cfs_period: 90000
>  cfs_quota: 0
> 
> These new tunables should be named 'cpu_period' and 'cpu_quota' too.

You mean 'cfs_period' and 'cfs_quota', right?  The only problem I see
with that naming (specifically for the quota) is that it implies a
direct relationship to the tunable in cgroupfs.  While true for
'cpu_shares', it is not true for quota since the proper setting to apply
depends on the threading behavior of qemu.

>> +static int
>> +qemuGetVcpusBWLive(virDomainObjPtr vm, virCgroupPtr cgroup,
>> +                   unsigned long long *period, long long *quota)
>> +{
>> +    virCgroupPtr cgroup_vcpu = NULL;
>> +    qemuDomainObjPrivatePtr priv = NULL;
>> +    int rc;
>> +    int ret = -1;
>> +
>> +    priv = vm->privateData;
>> +    if (priv->nvcpupids == 0 || priv->vcpupids[0] == vm->pid) {
>> +        /* We do not create sub dir for each vcpu */
>> +        rc = qemuGetVcpuBWLive(cgroup, period, quota);
>> +        if (rc < 0)
>> +            goto cleanup;
>> +
>> +        if (*quota > 0)
>> +            *quota /= vm->def->vcpus;
>> +        goto out;
>> +    }
>> +
>> +    /* get period and quota for vcpu0 */
>> +    rc = virCgroupForVcpu(cgroup, 0, &cgroup_vcpu, 0);
>> +    if (!cgroup_vcpu) {
>> +        virReportSystemError(-rc,
>> +                             _("Unable to find vcpu cgroup for %s(vcpu: 0)"),
>> +                             vm->def->name);
>> +        goto cleanup;
>> +    }
>> +
>> +    rc = qemuGetVcpuBWLive(cgroup_vcpu, period, quota);
>> +    if (rc < 0)
>> +        goto cleanup;
> 
> This is also bad IMHO, giving different semantics for the operation
> of the API, depending on whether the QEMU impl has VCPUs threads or
> not. Again this just says to me that we need this to apply at the VM
> level not the VCPU level.

IMO, this firms up the semantics of the libvirt API.  As these patches
are written, the libvirt semantics are: "Apply cpu capping to this
domain.  Period is the measurement interval.  Quota is the amount of
time each vcpu may run within the period."  We are able to insulate the
users from details of the underlying qemu -- whether it supports threads
or not.

Another reason a per-vcpu quota setting is more desirable than a global
setting is seen when changing the number of vcpus assigned to a domain.
 Each time you change the vcpu count, you would have to change the quota
number too.  Imagine all of the complaints from users when they increase
the vcpu count and find their domain actually performs worse.

> If we really do need finer per-VCPU controls, in addition to a per-VM
> control, I think that should likely be done as a separate tunable,
> or separate CPU
> 
> eg, we could have 2 sets of tunables, one that applies to the VM and
> the second set that adds further controls to the VCPUs.

How would this behave when qemu doesn't support vcpu threads?  We'd
either have to throw an error, or just ignore the vcpu_ settings.
either way, you're exposing users to qemu internals.  They would need to
change their domain.xml files when moving from a multi-threaded qemu to
a single-threaded qemu.

>   <cputune>
>     <shares>1024</shares>
>     <period>90000</period>
>     <quota>0</quota>
> 
>     <vcpu_shares>1024</vcpu_shares>
>     <vcpu_period>90000</vcpu_period>
>     <vcpu_quota>0</vcpu_quota>
>   </cputune>

I actually do like the <vcpu_*> naming because it makes two things more
clear: Units apply at the vcpu level, and these stats don't necessarily
map directly to cgroupfs.  However, I would still maintain that the
current logic should be used for applying the settings (per-vcpu if
supported, multiply by nvcpus and apply globally if vcpu threads are not
supported).

-- 
Adam Litke
IBM Linux Technology Center