[libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)

Daniel P. Berrange berrange at redhat.com
Thu Jan 14 13:15:06 UTC 2016


On Thu, Jan 14, 2016 at 02:09:52PM +0100, Henning Schild wrote:
> On Thu, 14 Jan 2016 12:37:18 +0000
> "Daniel P. Berrange" <berrange at redhat.com> wrote:
> 
> > On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
> > > Since this has been puzzelling us for a while, let me recap on the
> > > cgroup setup in general.
> > > 
> > > First, I'll describe how it used to work *before* Henning's patches
> > > were merged, on a systemd based host.
> > > 
> > >  - The QEMU driver forks a child process, but does *not* exec QEMU
> > > yet The cgroup placement at this point is inherited from libvirtd.
> > > It may look like this:
> > > 
> > >      10:freezer:/
> > >      9:cpuset:/
> > >      8:perf_event:/
> > >      7:hugetlb:/
> > >      6:blkio:/system.slice
> > >      5:memory:/system.slice
> > >      4:net_cls,net_prio:/
> > >      3:devices:/system.slice/libvirtd.service
> > >      2:cpu,cpuacct:/system.slice
> > >      1:name=systemd:/system.slice/libvirtd.service
> > > 
> > >  - The QEMU driver calls virCgroupNewMachine()
> > > 
> > >       - We calll virSystemdCreateMachine with pidleader=$child
> > > 
> > >            - Systemd creates the initial machine scope unit under
> > > 	     the machine slice unit, for the "systemd" controller.
> > > 	     It may also add the PID to *zero* or more other
> > > 	     resource controllers. So at this point the cgroup
> > > 	     placement may look like this:
> > > 
> > >               10:freezer:/
> > >               9:cpuset:/
> > >               8:perf_event:/
> > >               7:hugetlb:/
> > >               6:blkio:/
> > >               5:memory:/
> > >               4:net_cls,net_prio:/
> > >               3:devices:/
> > >               2:cpu,cpuacct:/
> > >               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > > 
> > >              Or may look like this:
> > > 
> > >               10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > >               9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> > >               8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > >               7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > >               6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > >               5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > >               4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > >               3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > >               2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> > >               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > > 
> > >              Or anywhere in between. We have *ZERO* guarantee about
> > > what other resource controllers we may have been placed in by
> > > 	     systemd. There is some fairly complex logic that
> > > determines this, based on what other tasks current exist in sibling
> > > 	     cgroups, and what tasks have *previously* existed in
> > > the cgroups. IOW, you should consider the list of etra resource
> > > 	     controllers essentially non-deterministic
> > > 
> > >       - We call virCgroupAddTask with pid=$child
> > > 
> > >         This places the pid in any resource controllers we need,
> > > which systemd has not already setup. IOW, it guarantees that we now
> > > 	have placement that should look like this, regardless of
> > > what systemd has done:
> > > 
> > >               10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > >               9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> > >               8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > >               7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > >               6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > >               5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > >               4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > >               3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > >               2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> > >               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > > 
> > >  - The QEMU driver now lets the child process exec QEMU. QEMU
> > > creates its vCPU threads at this point. All QEMU threads (emulator,
> > > vcpu and I/O threads) now have the cgroup placement shown above.
> > > 
> > >  - We create the emulator cgroup for the cpuset, cpu, cpuacct
> > > controllers move all threads into this new cgroup. All threads
> > > (emulator, vcpu and I/O threads) thus now have placement of:
> > > 
> > >            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > >            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
> > >            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > >            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > >            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > >            5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > >            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > >            3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > >            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
> > >            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > > 
> > >    Yes, we really did move the vcpu threads into the emulator
> > > group...
> > > 
> > >  - We now ask QEMU which are the vCPU & I/O threads.
> > > 
> > >     - Foreach CPU thread we new vCPU cgroups and move them into this
> > >       place
> > > 
> > >            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > >            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
> > >            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > >            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > >            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > >            5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > >            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > >            3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > >            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
> > >            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > > 
> > >     - Foreach I/O thread we new vCPU cgroups and move them into this
> > >       place
> > > 
> > >            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > >            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> > >            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > >            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > >            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > >            5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > >            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > >            3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > >            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> > >            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope  
> > 
> > BTW, on a slight tangent, the kernel is throwing a spanner in the
> > works in the near future. They have just accepted cgroupv2 into
> > mainline. Broadly speaking this is very nice because they got rid
> > of the idea of separate mount point for each controller, and instead
> > have a single filesystem tree. The problem is that they decided the
> > granularity of placement is at a *process* level, not a *thread*
> > level. So it will no longer be possible for us to have the cgroups
> > for emulator, vcpus & i/o threads. Everything will have to live in
> > the same cgroup :-( For cpu accounting and cpu affinity I think we
> > can still achieve what we need by using a combination of cgroups
> > and sched_setaffinity and /proc. I'm not sure what we'll do about
> > per-thread schedular policies for period + quota though - not sure
> > if there's an API for setting those or not ?!?!
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt
> 
> Good to know. Do you you have that on the agenda for libvirt? I guess
> eventually v1 will get deprecated...

We'll have no choice but to use cgroupv2 as soon as systemd starts
using it....


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|




More information about the libvir-list mailing list