[libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)

Henning Schild henning.schild at siemens.com
Thu Jan 14 13:09:52 UTC 2016


On Thu, 14 Jan 2016 12:37:18 +0000
"Daniel P. Berrange" <berrange at redhat.com> wrote:

> On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
> > Since this has been puzzelling us for a while, let me recap on the
> > cgroup setup in general.
> > 
> > First, I'll describe how it used to work *before* Henning's patches
> > were merged, on a systemd based host.
> > 
> >  - The QEMU driver forks a child process, but does *not* exec QEMU
> > yet The cgroup placement at this point is inherited from libvirtd.
> > It may look like this:
> > 
> >      10:freezer:/
> >      9:cpuset:/
> >      8:perf_event:/
> >      7:hugetlb:/
> >      6:blkio:/system.slice
> >      5:memory:/system.slice
> >      4:net_cls,net_prio:/
> >      3:devices:/system.slice/libvirtd.service
> >      2:cpu,cpuacct:/system.slice
> >      1:name=systemd:/system.slice/libvirtd.service
> > 
> >  - The QEMU driver calls virCgroupNewMachine()
> > 
> >       - We calll virSystemdCreateMachine with pidleader=$child
> > 
> >            - Systemd creates the initial machine scope unit under
> > 	     the machine slice unit, for the "systemd" controller.
> > 	     It may also add the PID to *zero* or more other
> > 	     resource controllers. So at this point the cgroup
> > 	     placement may look like this:
> > 
> >               10:freezer:/
> >               9:cpuset:/
> >               8:perf_event:/
> >               7:hugetlb:/
> >               6:blkio:/
> >               5:memory:/
> >               4:net_cls,net_prio:/
> >               3:devices:/
> >               2:cpu,cpuacct:/
> >               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > 
> >              Or may look like this:
> > 
> >               10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> >               9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> >               8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> >               7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> >               6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> >               5:memory:/machine.slice/machine-qemu\x2dserial.scope
> >               4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> >               3:devices:/machine.slice/machine-qemu\x2dserial.scope
> >               2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> >               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > 
> >              Or anywhere in between. We have *ZERO* guarantee about
> > what other resource controllers we may have been placed in by
> > 	     systemd. There is some fairly complex logic that
> > determines this, based on what other tasks current exist in sibling
> > 	     cgroups, and what tasks have *previously* existed in
> > the cgroups. IOW, you should consider the list of etra resource
> > 	     controllers essentially non-deterministic
> > 
> >       - We call virCgroupAddTask with pid=$child
> > 
> >         This places the pid in any resource controllers we need,
> > which systemd has not already setup. IOW, it guarantees that we now
> > 	have placement that should look like this, regardless of
> > what systemd has done:
> > 
> >               10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> >               9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> >               8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> >               7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> >               6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> >               5:memory:/machine.slice/machine-qemu\x2dserial.scope
> >               4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> >               3:devices:/machine.slice/machine-qemu\x2dserial.scope
> >               2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> >               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > 
> >  - The QEMU driver now lets the child process exec QEMU. QEMU
> > creates its vCPU threads at this point. All QEMU threads (emulator,
> > vcpu and I/O threads) now have the cgroup placement shown above.
> > 
> >  - We create the emulator cgroup for the cpuset, cpu, cpuacct
> > controllers move all threads into this new cgroup. All threads
> > (emulator, vcpu and I/O threads) thus now have placement of:
> > 
> >            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> >            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
> >            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> >            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> >            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> >            5:memory:/machine.slice/machine-qemu\x2dserial.scope
> >            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> >            3:devices:/machine.slice/machine-qemu\x2dserial.scope
> >            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
> >            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > 
> >    Yes, we really did move the vcpu threads into the emulator
> > group...
> > 
> >  - We now ask QEMU which are the vCPU & I/O threads.
> > 
> >     - Foreach CPU thread we new vCPU cgroups and move them into this
> >       place
> > 
> >            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> >            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
> >            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> >            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> >            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> >            5:memory:/machine.slice/machine-qemu\x2dserial.scope
> >            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> >            3:devices:/machine.slice/machine-qemu\x2dserial.scope
> >            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
> >            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > 
> >     - Foreach I/O thread we new vCPU cgroups and move them into this
> >       place
> > 
> >            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> >            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> >            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> >            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> >            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> >            5:memory:/machine.slice/machine-qemu\x2dserial.scope
> >            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> >            3:devices:/machine.slice/machine-qemu\x2dserial.scope
> >            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> >            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope  
> 
> BTW, on a slight tangent, the kernel is throwing a spanner in the
> works in the near future. They have just accepted cgroupv2 into
> mainline. Broadly speaking this is very nice because they got rid
> of the idea of separate mount point for each controller, and instead
> have a single filesystem tree. The problem is that they decided the
> granularity of placement is at a *process* level, not a *thread*
> level. So it will no longer be possible for us to have the cgroups
> for emulator, vcpus & i/o threads. Everything will have to live in
> the same cgroup :-( For cpu accounting and cpu affinity I think we
> can still achieve what we need by using a combination of cgroups
> and sched_setaffinity and /proc. I'm not sure what we'll do about
> per-thread schedular policies for period + quota though - not sure
> if there's an API for setting those or not ?!?!
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt

Good to know. Do you you have that on the agenda for libvirt? I guess
eventually v1 will get deprecated...

> Regards,
> Daniel




More information about the libvir-list mailing list