[libvirt] CFS Hardlimits and the libvirt cgroups implementation

Fri Jun 10 13:10:01 UTC 2011

On 06/10/2011 04:20 AM, Daniel P. Berrange wrote:
> On Wed, Jun 08, 2011 at 02:20:23PM -0500, Adam Litke wrote:
>> Hi all.  In this post I would like to bring up 3 issues which are
>> tightly related: 1. unwanted behavior when using cfs hardlimits with
>> libvirt, 2. Scaling cputune.share according to the number of vcpus, 3.
>> API proposal for CFS hardlimits support.
>>
>>
>> === 1 ===
>> Mark Peloquin (on cc:) has been looking at implementing CFS hard limit
>> support on top of the existing libvirt cgroups implementation and he has
>> run into some unwanted behavior when enabling quotas that seems to be
>> affected by the cgroup hierarchy being used by libvirt.
>>
>> Here are Mark's words on the subject (posted by me while Mark joins this
>> mailing list):
>> ------------------
>> I've conducted a number of measurements using CFS.
>>
>> The system config is a 2 socket Nehalem system with 64GB ram. Installed
>> is RHEL6.1-snap4. The guest VMs being used have RHEL5.5 - 32bit. I've
>> replaced the kernel with 2.6.39-rc6+ with patches from
>> Paul-V6-upstream-breakout.tar.bz2 for CFS bandwidth. The test config
>> uses 5 VMs of various vcpu and memory sizes. Being used are 2 VMs with 2
>> vcpus and 4GB of memory, 1 VM with 4vcpus/8GB, another VM with
>> 8vcpus/16GB and finally a VM with 16vcpus/16GB.
>>
>> Thus far the tests have been limited to cpu intensive workloads. Each VM
>> runs a single instance of the workload. The workload is configured to
>> create one thread for each vcpu in the VM. The workload is then capable
>> of completely saturation each vcpu in each VM.
>>
>> CFS was tested using two different topologies.
>>
>> First vcpu cgroups were created under each VM created by libvirt. The
>> vcpu threads from the VM's cgroup/tasks were moved to the tasks list of
>> each vcpu cgroup, one thread to each vcpu cgroup. This tree structure
>> permits setting CFS quota and period per vcpu. Default values for
>> cpu.shares (1024), quota (-1) and period (500000us) was used in each VM
>> cgroup and inherited by the vcpu croup. With these settings the workload
>> generated system cpu utilization (measured in the host) of >99% guest,
>>> 0.1 idle, 0.14% user and 0.38 system.
>>
>> Second, using the same topology, the CFS quota in each vcpu's cgroup was
>> set to 250000us allowing each vcpu to consume 50% of a cpu. The cpu
>> workloads was run again. This time the total system cpu utilization was
>> measured at 75% guest, ~24% idle, 0.15% user and 0.40% system.
>>
>> The topology was changed such that a cgroup for each vcpu was created in
>> /cgroup/cpu.
>>
>> The first test used the default/inherited shares and CFS quota and
>> period. The measured system cpu utilization was >99% guest, ~0.5 idle,
>> 0.13 user and 0.38 system, similar to the default settings using vcpu
>> cgroups under libvirt.
>>
>> The next test, like before the topology change, set the vcpu quota
>> values to 250000us or 50% of a cpu. In this case the measured system cpu
>> utilization was ~92% guest, ~7.5% idle, 0.15% user and 0.38% system.
>>
>> We can see that moving the vcpu cgroups from being under libvirt/qemu
>> make a big difference in idle cpu time.
>>
>> Does this suggest a possible problems with libvirt?
>> ------------------
> 
> I can't really understand from your description what the different
> setups are. You're talking about libvirt vcpu cgroups, but nothing
> in libvirt does vcpu based cgroups, our cgroup granularity is always
> per-VM.
> 

I should have been more clear.  In Mark's testing, he found unacceptable
levels of idle time when mixing domain level cgroups with cfs.  The idle
time was reduced to an acceptable level when each vcpu thread was
confined in its own cgroup.  To achieve the per-vcpu scenarios, he
reassigned the cgroups to vcpus after the domain was started to override
the default libvirt configuration.

>> === 2 ===
>> Something else we are seeing is that libvirt's default setting for
>> cputune.share is 1024 for any domain (regardless of how many vcpus are
>> configured.  This ends up hindering performance of really large VMs
>> (with lots of vcpus) as compared to smaller ones since all domains are
>> given equal share.  Would folks consider changing the default for
>> 'shares' to be a quantity scaled by the number of vcpus such that bigger
>> domains get to use proportionally more host cpu resource?
> 
> Well that's just the kernel default setting actually. The intent
> of the default cgroups configuration for a VM, is that it should
> be identical to the configuration if the VM was *not* in any
> cgroups. So I think that gives some justification for setting
> the cpu shares relative to the # of vCPUs by default, otherwise
> we have a regression vs not using cgroups.

Yes, I agree.

>> === 3 ===
>> Besides the above issues, I would like to open a discussion on what the
>> libvirt API for enabling cpu hardlimits should look like.  Here is what
>> I was thinking:
>>
>> Two additional scheduler parameters (based on the names given in the
>> cgroup fs) will be recognized for qemu domains: 'cfs_period' and
>> 'cfs_quota'.  These can use the existing
>> virDomain[Get|Set]SchedulerParameters() API.  The Domain XML schema
>> would be updated to permit the following:
>>
>> --- snip ---
>> <cputune>
>>   ...
>>   <cfs_period>1000000</cfs_period>
>>   <cfs_quota>500000</cfs_quota>
>> </cputune>
>> --- snip ---
> 
> I don't think 'cfs_' should be in the names here. These absolute
> limits on CPU time could easily be applicable to non-CFS schedulars
> or non-Linux hypervisors.

Sure, changing the names would be fine with me.  I wonder, is the
concept of a period and quota common to any likely alternative
implementation of cpu accounting?  I would imagine so.

>> To actuate these configuration settings, we simply apply the values to
>> the appropriate cgroup(s) for the domain.  We would prefer that each
>> vcpu be in its own cgroup to ensure equal and fair scheduling across all
>> vcpus running on the system.  (We will need to resolve the issues
>> described by Mark in order to figure out where to hang these cgroups).
> 
> The reason for putting VMs in cgroups is that, because KVM is multithreaded,
> using Cgroups is the only way to control settings of the VM as a whole. If
> you just want to control individual VCPU settings, then that can be done
> without cgroups just be setting the process' schedpriority via the normal
> APIs. Creating cgroups at the granularity of individual vCPUs is somewhat
> troublesome, because if the administrator has mounted other cgroups
> controllers at the same location as the 'cpu' controller, then putting
> each VCPU in a separate cgroup will negatively impact other aspects of
> the VM. Also KVM has a number of other non-VCPU threads which consume a
> non-trivial amount of CPU time, which often come & go over time. So IMHO
> the smallest cgroup granularity should remain per-VM.

Is it possible to have a hierarchy like:
libvirt->qemu->domain->vcpu[0-n] ?  Then, share and other tunables could
still be governed at the domain level while capping could be applied to
the vcpu level?  Since the only reason we are pursuing per-vcpu groups
is because of the buggy behavior of domain-level quotas, we should
probably chase down any potential cgroups bugs before making this
change.  I do think it would be a good idea to preserve the ability to
switch to per-vcpu cgroups in the future (perhaps using the hierarchy I
suggest above) just in case we determine that the problem cannot be
fixed another way.

-- 
Adam Litke
IBM Linux Technology Center