[libvirt] Overhead for a default cpu cg placement scheme

Thu Jun 11 13:24:18 UTC 2015

On Thu, Jun 11, 2015 at 4:13 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> On Thu, Jun 11, 2015 at 04:06:59PM +0300, Andrey Korolyov wrote:
>> On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
>> > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote:
>> >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
>> >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote:
>> >> >> Hi Daniel,
>> >> >>
>> >> >> would it possible to adopt an optional tunable for a virCgroup
>> >> >> mechanism which targets to a disablement of a nested (per-thread)
>> >> >> cgroup creation? Those are bringing visible overhead for many-threaded
>> >> >> guest workloads, almost 5% in non-congested host CPU state, primarily
>> >> >> because the host scheduler should make a much more decisions with
>> >> >> those cgroups than without them. We also experienced a lot of host
>> >> >> lockups with currently exploited cgroup placement and disabled nested
>> >> >> behavior a couple of years ago. Though the current patch is simply
>> >> >> carves out the mentioned behavior, leaving only top-level per-machine
>> >> >> cgroups, it can serve for an upstream after some adaptation, that`s
>> >> >> why I`m asking about a chance of its acceptance. This message is a
>> >> >> kind of 'request of a feature', it either can be accepted/dropped from
>> >> >> our side or someone may give a hand and redo it from scratch. The
>> >> >> detailed benchmarks are related to a host 3.10.y, if anyone is
>> >> >> interested in the numbers for latest stable, I can update those.
>> >> >
>> >> > When you say nested cgroup creation, as you referring to the modern
>> >> > libvirt hierarchy, or the legacy hierarchy - as described here:
>> >> >
>> >> >   http://libvirt.org/cgroups.html
>> >> >
>> >> > The current libvirt setup used for a year or so now is much shallower
>> >> > than previously, to the extent that we'd consider performance problems
>> >> > with it to be the job of the kernel to fix.
>> >>
>> >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead
>> >> mentioned above. The host crashes I mentioned happened with old
>> >> hierarchy back ago, forgot to mention this. Despite the flattening of
>> >> the topo for the current scheme it should be possible to disable fine
>> >> group creation for the VM threads for some users who don`t need
>> >> per-vcpu cpu pinning/accounting (though overhead caused by a placement
>> >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal
>> >> distribution with such disablement for all nested-aware cgroup types),
>> >> that`s the point for now.
>> >
>> > Ok, so the per-vCPU cgroups are used for a couple of things
>> >
>> >  - Setting scheduler tunables - period/quota/shares/etc
>> >  - Setting CPU pinning
>> >  - Setting NUMA memory pinning
>> >
>> > In addition to the per-VCPU cgroup, we have one cgroup fr each
>> > I/O thread, and also one more for general QEMU emulator threads.
>> >
>> > In the case of CPU pinning we already have automatic fallback to
>> > sched_setaffinity if the CPUSET controller isn't available.
>> >
>> > We could in theory start off without the per-vCPU/emulator/I/O
>> > cgroups and only create them as & when the feature is actually
>> > used. The concern I would have though is that changing the cgroups
>> > layout on the fly may cause unexpected sideeffects in behaviour of
>> > the VM. More critically, there would be alot of places in the code
>> > where we would need to deal with this which could hurt maintainability.
>> >
>> > How confident are you that the performance problems you see are inherant
>> > to the actual use of the cgroups, and not instead as a result of some
>> > particular bad choice of default parameters we might have left in the
>> > cgroups ?  In general I'd have a desire to try to work to eliminate the
>> > perf impact before we consider the complexity of disabling this feature
>> >
>> > Regards,
>> > Daniel
>>
>> Hm, what are you proposing to begin with in a testing terms? By my
>> understanding the excessive cgroup usage along with small scheduler
>> quanta *will* lead to some overhead anyway. Let`s look at the numbers
>> which I would bring tomorrow, the mentioned five percents was catched
>> on a guest 'perf numa xxx' for a different kind of mappings and host
>> behavior (post-3.8): memory automigration on/off, kind of 'numa
>> passthrough', like grouping vcpu threads according to the host and
>> emulated guest NUMA topologies, totally scattered and unpinned threads
>> within a single and within a multiple NUMA nodes. As the result for
>> 3.10.y, there was a five-percent difference between best-performing
>> case with thread-level cpu cgroups and a 'totally scattered' case on a
>> simple mid-range two-headed node. If you think that the choice of an
>> emulated workload is wrong, please let me know, I was afraid that the
>> non-synthetic workload in the guest may suffer from a range of a side
>> factors and therefore chose perf for this task.
>
> Benchmarking isn't my area of expertize, but you should be able to just
> disable the CPUSET controller entirely in qemu.conf. If we got some
> comparative results for with & without CPUSET that'd be interesting
> place to start. If it shows clear difference, I might be able to get
> some of the Red Hat performance team to dig into what's going wrong
> in either libvirt or kernel level.
>
Thanks, let`s wait for the numbers. I mentioned cpuset only in a
matter of good-bad comparison, the main suspect for me is still the
scheduler and quotas/weights in CPU cgroup.