[libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)
Henning Schild
henning.schild at siemens.com
Thu Jan 14 12:12:41 UTC 2016
On Thu, 14 Jan 2016 11:57:44 +0000
"Daniel P. Berrange" <berrange at redhat.com> wrote:
> On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
> > Reposting my cgroup fixes series:
> >
> > http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
> >
> > partially because I originally forgot to CC the author (Henning
> > Schild) of the original series for which these patch fix a couple
> > of issues discovered during regression testing (virt-test memtune
> > failures in Red Hat regression environment), but also to bring them
> > up to date with the top of libvirt git.
> >
> > NB: I did send Henning the changes after the fact, but my resend
> > using the same message-id skills so that replies are left in the
> > onlist series are lacking. Henning has looked at the first patch -
> > with a response here:
> >
> > http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
> >
> > Finally, I think these changes should go into 1.3.1 since that's
> > when the regression was introduced.
>
> Since this has been puzzelling us for a while, let me recap on the
> cgroup setup in general.
>
> First, I'll describe how it used to work *before* Henning's patches
> were merged, on a systemd based host.
>
> - The QEMU driver forks a child process, but does *not* exec QEMU yet
> The cgroup placement at this point is inherited from libvirtd. It
> may look like this:
>
> 10:freezer:/
> 9:cpuset:/
> 8:perf_event:/
> 7:hugetlb:/
> 6:blkio:/system.slice
> 5:memory:/system.slice
> 4:net_cls,net_prio:/
> 3:devices:/system.slice/libvirtd.service
> 2:cpu,cpuacct:/system.slice
> 1:name=systemd:/system.slice/libvirtd.service
>
> - The QEMU driver calls virCgroupNewMachine()
>
> - We calll virSystemdCreateMachine with pidleader=$child
>
> - Systemd creates the initial machine scope unit under
> the machine slice unit, for the "systemd" controller.
> It may also add the PID to *zero* or more other
> resource controllers. So at this point the cgroup
> placement may look like this:
>
> 10:freezer:/
> 9:cpuset:/
> 8:perf_event:/
> 7:hugetlb:/
> 6:blkio:/
> 5:memory:/
> 4:net_cls,net_prio:/
> 3:devices:/
> 2:cpu,cpuacct:/
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> Or may look like this:
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> Or anywhere in between. We have *ZERO* guarantee about
> what other resource controllers we may have been placed in by
> systemd. There is some fairly complex logic that
> determines this, based on what other tasks current exist in sibling
> cgroups, and what tasks have *previously* existed in the
> cgroups. IOW, you should consider the list of etra
> resource controllers essentially non-deterministic
>
> - We call virCgroupAddTask with pid=$child
>
> This places the pid in any resource controllers we need, which
> systemd has not already setup. IOW, it guarantees that we now
> have placement that should look like this, regardless of what
> systemd has done:
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> - The QEMU driver now lets the child process exec QEMU. QEMU creates
> its vCPU threads at this point. All QEMU threads (emulator, vcpu
> and I/O threads) now have the cgroup placement shown above.
>
> - We create the emulator cgroup for the cpuset, cpu, cpuacct
> controllers move all threads into this new cgroup. All threads
> (emulator, vcpu and I/O threads) thus now have placement of:
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> Yes, we really did move the vcpu threads into the emulator group...
>
> - We now ask QEMU which are the vCPU & I/O threads.
>
> - Foreach CPU thread we new vCPU cgroups and move them into this
> place
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> - Foreach I/O thread we new vCPU cgroups and move them into this
> place
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
>
>
> Now, lets Henning's three patches
>
>
> commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5
> Author: Henning Schild <henning.schild at siemens.com>
> Date: Wed Dec 9 17:58:10 2015 +0100
>
> util: cgroups do not implicitly add task to new machine cgroup
>
> This alters virCgroupNewMachine so that it no longer calls
> virCgroupAddTask. instead the callers are responsible for doing it.
> This is a functional no-opp change, so ostensibly harmless at this
> point.
>
> The second patch
>
> commit a41c00b472efaa192d2deae51ab732e65903238f
> Author: Henning Schild <henning.schild at siemens.com>
> Date: Mon Dec 14 15:48:05 2015 -0500
>
> qemu: do not put a task into machine cgroup
>
> This stops qemuInitCgroup from calling virCgroupAddTask. Instead it
> has added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(),
> which only affects the cpu, cpuset & cpuacct controllers.
>
> This means we no longer have any guarantee about which resource
> controllers the QEMU processes in general are in. All we can say is
> that they are in the 'systemd' controller, and the cpu, cpuset &
> cpuacct controllers. Systemd may have also placed it into *all* the
> other resource controllers, or it may have placed it into none of the
> them.
>
> As such, after this change, we are potentially not in the correct
> cgroup for the memory, blkio, netcls, devices controllers.
>
> For added fun, the qemuSetupCgroupForEmulator() still has a call to
> virCgroupMoveTask, so the addition of virCgroupAddTask to this method
> was useless.
>
>
> The final patch
>
> commit 90b721e43ec9232b5b218e891437bed04548e841
> Author: Henning Schild <henning.schild at siemens.com>
> Date: Mon Dec 14 15:58:05 2015 -0500
>
> qemu cgroups: move new threads to new cgroup after cpuset is set
> up
>
> This delays the point at which we call virCgroupAddTask for the
> vCPU and I/O thread cgroups, until after we have configured properties
> on those cgroups. This change is fine
>
>
> So I think we need to revert the first 2 of Hennings patches - the 2nd
> one is the real serious problem, but once we revert that, the 1st
> change becomes pointless and so should also be reverted.
>
>
> Going back to the key problem Henning was trying to address....
>
> The issue is that we only setup the emulator cgroup /after/ QEMU has
> already started running, so there's a period of time where threads are
> not confined by the cgroup affinity.
>
> I think we can solve that more simply by just moving the call to
> qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we
> place the emulator thread into the correct cgroup *before* we even
> exec QEMU. So we never have the window where we run in the unconfined
> cpuset controller.
Thanks for that very thorough analysis and description, very helpful
indeed! I fully agree with your conclusion and suggestion on how the
code should be changed.
Henning
More information about the libvir-list
mailing list