[libvirt] Overhead for a default cpu cg placement scheme

Thu Jun 11 13:13:12 UTC 2015

On Thu, Jun 11, 2015 at 04:06:59PM +0300, Andrey Korolyov wrote:
> On Thu, Jun 11, 2015 at 2:33 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> > On Thu, Jun 11, 2015 at 02:16:50PM +0300, Andrey Korolyov wrote:
> >> On Thu, Jun 11, 2015 at 2:09 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> >> > On Thu, Jun 11, 2015 at 01:50:24PM +0300, Andrey Korolyov wrote:
> >> >> Hi Daniel,
> >> >>
> >> >> would it possible to adopt an optional tunable for a virCgroup
> >> >> mechanism which targets to a disablement of a nested (per-thread)
> >> >> cgroup creation? Those are bringing visible overhead for many-threaded
> >> >> guest workloads, almost 5% in non-congested host CPU state, primarily
> >> >> because the host scheduler should make a much more decisions with
> >> >> those cgroups than without them. We also experienced a lot of host
> >> >> lockups with currently exploited cgroup placement and disabled nested
> >> >> behavior a couple of years ago. Though the current patch is simply
> >> >> carves out the mentioned behavior, leaving only top-level per-machine
> >> >> cgroups, it can serve for an upstream after some adaptation, that`s
> >> >> why I`m asking about a chance of its acceptance. This message is a
> >> >> kind of 'request of a feature', it either can be accepted/dropped from
> >> >> our side or someone may give a hand and redo it from scratch. The
> >> >> detailed benchmarks are related to a host 3.10.y, if anyone is
> >> >> interested in the numbers for latest stable, I can update those.
> >> >
> >> > When you say nested cgroup creation, as you referring to the modern
> >> > libvirt hierarchy, or the legacy hierarchy - as described here:
> >> >
> >> >   http://libvirt.org/cgroups.html
> >> >
> >> > The current libvirt setup used for a year or so now is much shallower
> >> > than previously, to the extent that we'd consider performance problems
> >> > with it to be the job of the kernel to fix.
> >>
> >> Thanks, I`m referring to a 'new nested' hiearchy for an overhead
> >> mentioned above. The host crashes I mentioned happened with old
> >> hierarchy back ago, forgot to mention this. Despite the flattening of
> >> the topo for the current scheme it should be possible to disable fine
> >> group creation for the VM threads for some users who don`t need
> >> per-vcpu cpu pinning/accounting (though overhead caused by a placement
> >> for cpu cgroup, not by accounting/pinning ones, I`m assuming equal
> >> distribution with such disablement for all nested-aware cgroup types),
> >> that`s the point for now.
> >
> > Ok, so the per-vCPU cgroups are used for a couple of things
> >
> >  - Setting scheduler tunables - period/quota/shares/etc
> >  - Setting CPU pinning
> >  - Setting NUMA memory pinning
> >
> > In addition to the per-VCPU cgroup, we have one cgroup fr each
> > I/O thread, and also one more for general QEMU emulator threads.
> >
> > In the case of CPU pinning we already have automatic fallback to
> > sched_setaffinity if the CPUSET controller isn't available.
> >
> > We could in theory start off without the per-vCPU/emulator/I/O
> > cgroups and only create them as & when the feature is actually
> > used. The concern I would have though is that changing the cgroups
> > layout on the fly may cause unexpected sideeffects in behaviour of
> > the VM. More critically, there would be alot of places in the code
> > where we would need to deal with this which could hurt maintainability.
> >
> > How confident are you that the performance problems you see are inherant
> > to the actual use of the cgroups, and not instead as a result of some
> > particular bad choice of default parameters we might have left in the
> > cgroups ?  In general I'd have a desire to try to work to eliminate the
> > perf impact before we consider the complexity of disabling this feature
> >
> > Regards,
> > Daniel
> 
> Hm, what are you proposing to begin with in a testing terms? By my
> understanding the excessive cgroup usage along with small scheduler
> quanta *will* lead to some overhead anyway. Let`s look at the numbers
> which I would bring tomorrow, the mentioned five percents was catched
> on a guest 'perf numa xxx' for a different kind of mappings and host
> behavior (post-3.8): memory automigration on/off, kind of 'numa
> passthrough', like grouping vcpu threads according to the host and
> emulated guest NUMA topologies, totally scattered and unpinned threads
> within a single and within a multiple NUMA nodes. As the result for
> 3.10.y, there was a five-percent difference between best-performing
> case with thread-level cpu cgroups and a 'totally scattered' case on a
> simple mid-range two-headed node. If you think that the choice of an
> emulated workload is wrong, please let me know, I was afraid that the
> non-synthetic workload in the guest may suffer from a range of a side
> factors and therefore chose perf for this task.

Benchmarking isn't my area of expertize, but you should be able to just
disable the CPUSET controller entirely in qemu.conf. If we got some
comparative results for with & without CPUSET that'd be interesting
place to start. If it shows clear difference, I might be able to get
some of the Red Hat performance team to dig into what's going wrong
in either libvirt or kernel level.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|