[libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)
Daniel P. Berrange
berrange at redhat.com
Thu Jan 14 13:15:06 UTC 2016
On Thu, Jan 14, 2016 at 02:09:52PM +0100, Henning Schild wrote:
> On Thu, 14 Jan 2016 12:37:18 +0000
> "Daniel P. Berrange" <berrange at redhat.com> wrote:
>
> > On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
> > > Since this has been puzzelling us for a while, let me recap on the
> > > cgroup setup in general.
> > >
> > > First, I'll describe how it used to work *before* Henning's patches
> > > were merged, on a systemd based host.
> > >
> > > - The QEMU driver forks a child process, but does *not* exec QEMU
> > > yet The cgroup placement at this point is inherited from libvirtd.
> > > It may look like this:
> > >
> > > 10:freezer:/
> > > 9:cpuset:/
> > > 8:perf_event:/
> > > 7:hugetlb:/
> > > 6:blkio:/system.slice
> > > 5:memory:/system.slice
> > > 4:net_cls,net_prio:/
> > > 3:devices:/system.slice/libvirtd.service
> > > 2:cpu,cpuacct:/system.slice
> > > 1:name=systemd:/system.slice/libvirtd.service
> > >
> > > - The QEMU driver calls virCgroupNewMachine()
> > >
> > > - We calll virSystemdCreateMachine with pidleader=$child
> > >
> > > - Systemd creates the initial machine scope unit under
> > > the machine slice unit, for the "systemd" controller.
> > > It may also add the PID to *zero* or more other
> > > resource controllers. So at this point the cgroup
> > > placement may look like this:
> > >
> > > 10:freezer:/
> > > 9:cpuset:/
> > > 8:perf_event:/
> > > 7:hugetlb:/
> > > 6:blkio:/
> > > 5:memory:/
> > > 4:net_cls,net_prio:/
> > > 3:devices:/
> > > 2:cpu,cpuacct:/
> > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > >
> > > Or may look like this:
> > >
> > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > >
> > > Or anywhere in between. We have *ZERO* guarantee about
> > > what other resource controllers we may have been placed in by
> > > systemd. There is some fairly complex logic that
> > > determines this, based on what other tasks current exist in sibling
> > > cgroups, and what tasks have *previously* existed in
> > > the cgroups. IOW, you should consider the list of etra resource
> > > controllers essentially non-deterministic
> > >
> > > - We call virCgroupAddTask with pid=$child
> > >
> > > This places the pid in any resource controllers we need,
> > > which systemd has not already setup. IOW, it guarantees that we now
> > > have placement that should look like this, regardless of
> > > what systemd has done:
> > >
> > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > >
> > > - The QEMU driver now lets the child process exec QEMU. QEMU
> > > creates its vCPU threads at this point. All QEMU threads (emulator,
> > > vcpu and I/O threads) now have the cgroup placement shown above.
> > >
> > > - We create the emulator cgroup for the cpuset, cpu, cpuacct
> > > controllers move all threads into this new cgroup. All threads
> > > (emulator, vcpu and I/O threads) thus now have placement of:
> > >
> > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
> > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
> > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > >
> > > Yes, we really did move the vcpu threads into the emulator
> > > group...
> > >
> > > - We now ask QEMU which are the vCPU & I/O threads.
> > >
> > > - Foreach CPU thread we new vCPU cgroups and move them into this
> > > place
> > >
> > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
> > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
> > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> > >
> > > - Foreach I/O thread we new vCPU cgroups and move them into this
> > > place
> > >
> > > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> > > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> > > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> >
> > BTW, on a slight tangent, the kernel is throwing a spanner in the
> > works in the near future. They have just accepted cgroupv2 into
> > mainline. Broadly speaking this is very nice because they got rid
> > of the idea of separate mount point for each controller, and instead
> > have a single filesystem tree. The problem is that they decided the
> > granularity of placement is at a *process* level, not a *thread*
> > level. So it will no longer be possible for us to have the cgroups
> > for emulator, vcpus & i/o threads. Everything will have to live in
> > the same cgroup :-( For cpu accounting and cpu affinity I think we
> > can still achieve what we need by using a combination of cgroups
> > and sched_setaffinity and /proc. I'm not sure what we'll do about
> > per-thread schedular policies for period + quota though - not sure
> > if there's an API for setting those or not ?!?!
> >
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt
>
> Good to know. Do you you have that on the agenda for libvirt? I guess
> eventually v1 will get deprecated...
We'll have no choice but to use cgroupv2 as soon as systemd starts
using it....
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
More information about the libvir-list
mailing list