[libvirt] Overhead for a default cpu cg placement scheme

Thu Jun 18 09:30:31 UTC 2015

On Thu, Jun 18, 2015 at 12:09 PM, Daniel P. Berrange
<berrange at redhat.com> wrote:
> On Wed, Jun 17, 2015 at 10:55:35PM +0300, Andrey Korolyov wrote:
>>
>> Sorry for a delay, the 'perf numa numa-mem -p 8 -t 2 -P 384 -C 0 -M 0
>> -s 200 -zZq --thp 1 --no-data_rand_walk' exposes a difference of value
>> 0.96 by 1. The trick I did (and successfully forget) before is in
>> setting the value of the cfs_quota in a machine wide group, up one
>> level from individual vcpus.
>>
>> Right now, libvirt sets values from
>> <cputune>
>> <period>100000</period>
>> <quota>200000</quota>
>> </cputune>
>> for each vCPU thread cgroup, which is a bit wrong by my understanding , like
>> /cgroup/cpu/machine/vmxx/vcpu0: period=100000, quota=2000000
>> /cgroup/cpu/machine/vmxx/vcpu1: period=100000, quota=2000000
>> /cgroup/cpu/machine/vmxx/vcpu2: period=100000, quota=2000000
>> /cgroup/cpu/machine/vmxx/vcpu3: period=100000, quota=2000000
>>
>>
>> In other words, the user (me) assumed that he limited total
>> consumption of the VM by two cores total, though all every thread can
>> consume up to a single CPU, resulting in a four-core consumption
>> instead. With different cpu count/quota/host cpu count ratios there
>> would be different practical limitations with same period to quota
>> ratio, where a single total quota will result in much more predictable
>> top consumption. I had put the same quota to period ratio in a
>> VM-level directory to meet the expectancies from a config setting and
>> there one can observe a mentioned performance drop.
>>
>> With default placement there is no difference in a performance
>> numbers, but the behavior of the libvirt itself is kinda controversial
>> there. The documentation says that this is a right behavior as well,
>> but I think that the limiting the vcpu group with total quota is far
>> more flexible than per-vcpu limitations which can negatively impact
>> single-threaded processes in the guest, plus the overall consumption
>> should be recalculated every time when host core count or guest core
>> count changes. Sorry for not mentioning the custom scheme before, if
>> mine assumption about execution flexibility is plainly wrong, I`ll
>> withdraw my concerns from above. I am using the 'mine' scheme for a
>> couple of years in production and it is proved (for me) to be a far
>> less complex for a workload balancing for a cpu-congested hypervisor
>> than a generic one.
>
> As you say there are two possible directions libvirt was able to take
> when implementing the schedular tunables. Either apply them to the
> VM as a whole, or apply them to the individual vCPUS. We debated this
> a fair bit, but in the end we took the per-VCPU approach. There were
> two real compelling reasons. First, if users have 2 guests with
> identical configurations, but give one of the guests 2 vCPUs and the
> other guest 4 vCPUs, the general expectation is that the one with
> 4 vCPUS will have twice the performance. If we apply the CFS tuning
> at the VM level, then as you added vCPUs you'd get no increase in
> performance.  The second reason was that people wanted to be able to
> control performance of the emulator threads, separately from the
> vCPU threads. Now we also have dedicated I/O threads that can have
> different tuning set. This would be impossible if we were always
> setting stuff at the VM level.
>
> It would in theory be possible for us to add a further tunable to the
> VM config which allowed VM level tuning.  eg we could define something
> like
>
>  <vmtune>
>    <period>100000</period>
>    <quota>200000</quota>
>  </vmtune>
>
> Semantically, if <vmtune> was set, we would then forbid use of the
> <cputune> and <emulatortune> configurations, as they'd be mutually
> exclusive. In such a case we'd avoid creating the sub-cgroups for
> vCPUs and emulator threads, etc.
>
> The question is whether the benefit would outweigh the extra code
> complexity to deal with this. I appreciate you would desire this
> kind of setup, but I think we'd probably need more than one person
> requesting use of this kind of setup in order to justify the work
> involved.
>

Thanks for a quite awesome explanation! I see, the thing that is
obvious for Xen-era hosting (more vCPUs means more power) is not an
obvious thing for myself. I agree with the fact that less count of
more powerful cores is always preferable over a large set of 'weak on
average' cores with the approach I proposed. The thing that is still
confusing is that the one should mind *three* exact things while
setting a limit in a current scheme - real or HT core count, the VM`
core count and the quota to period ratio itself to determine an upper
cap for a designated VM` consumption, and it would be even more
confusing when we will talk for a share ratios - for me, it is
completely unclear how two VMs with 2:1 share ratio for both vCPUs and
emulator would behave, will the emulator thread starve first on a CPU
congestion or vice-versa, will the many vCPU processes with equal
share to an emulator make enough influence inside a capped node to
displace the actual available bandwidths from 2:1, will the guest
emulator spread the workload between vCPUs fairly, so their host
scheduling can meet mentioned ratio and so on and so on.

The 'total' cap is a bit more fair there as well by my understanding,
because managing individual quotas and shares requires a deep
knowledge on how a designated emulator (QEMU) behaves on a congested
hypervisor node. I am not bringing in the cases with a free room of
CPU time because for those all mentioned problems, except an actual
calculation complexity, would probably not influence performance
numbers.

Thanks again for taking your time for bringing up the roots of an
existing solution, I hope I`ve only made a different point which
should not be regarded as a call for a parallel control implementation
there because it reflects only mine approach to a resource
calculation, probably nothing more.